cs.CV [Total: 135]
cs.CL [Total: 61]
cs.GR [Total: 2]
cs.AI [Total: 8]
cs.RO [Total: 5]
cs.SD [Total: 1]
cs.NI [Total: 1]
stat.ML [Total: 1]
cs.IR [Total: 1]
eess.IV [Total: 2]
cs.LG [Total: 10]
cs.CR [Total: 1]
cs.MA [Total: 1]
cs.MM [Total: 1]

cs.CV [Back]

[1] Random Direct Preference Optimization for Radiography Report Generation cs.CV | cs.AI | cs.CLPDF

Valentin Samokhin, Boris Shirokikh, Mikhail Goncharov, Dmitriy Umerenkov, Maksim Bobrin

TL;DR: 论文提出了一种基于随机直接偏好优化（DPO）的模型无关框架，用于提升放射学报告生成（RRG）的准确性，无需额外数据或人工标注。

Details

Motivation: 现有放射学报告生成方法在真实临床环境中的质量不足，而通用视觉语言模型（VLM）的成功启发了如何利用对齐技术改进RRG。

Result: 在三种先进模型上实验显示，该方法将临床性能指标提升了5%，且无需额外训练数据。

Insight: 随机采样技术可以高效地构建对比训练对，而无需复杂的人工干预，为医疗领域的小样本学习提供了新思路。

Abstract: Radiography Report Generation (RRG) has gained significant attention in medical image analysis as a promising tool for alleviating the growing workload of radiologists. However, despite numerous advancements, existing methods have yet to achieve the quality required for deployment in real-world clinical settings. Meanwhile, large Visual Language Models (VLMs) have demonstrated remarkable progress in the general domain by adopting training strategies originally designed for Large Language Models (LLMs), such as alignment techniques. In this paper, we introduce a model-agnostic framework to enhance RRG accuracy using Direct Preference Optimization (DPO). Our approach leverages random contrastive sampling to construct training pairs, eliminating the need for reward models or human preference annotations. Experiments on supplementing three state-of-the-art models with our Random DPO show that our method improves clinical performance metrics by up to 5%, without requiring any additional training data.

[2] Improving Autism Detection with Multimodal Behavioral Analysis cs.CV | cs.LGPDF

William Saakyan, Matthias Norden, Lola Eversmann, Simon Kirsch, Muyu Lin

TL;DR: 该论文提出了一种多模态行为分析方法，通过整合面部表情、语音韵律、头部运动、心率变异性和凝视行为等特征，改善了自闭症检测的准确性，尤其是在凝视行为特征的表现上取得了显著提升。

Details

Motivation: 现有的自闭症诊断方法依赖于复杂的临床评估，成本高且资源密集。基于视频数据的计算机辅助诊断方法虽然展现出潜力，但在凝视特征表现和现实世界泛化性方面存在不足。

Result: 新型凝视描述符将凝视分类准确率从64%提升到69%。多模态融合模型的最终分类准确率达到74%，验证了多模态行为分析在自闭症检测中的有效性。

Insight: 凝视行为的变异性是自闭症检测的重要特征，多模态整合可以显著提升模型的分类性能。该研究为开发基于视频的自动化自闭症筛查工具提供了支持。

Abstract: Due to the complex and resource-intensive nature of diagnosing Autism Spectrum Condition (ASC), several computer-aided diagnostic support methods have been proposed to detect autism by analyzing behavioral cues in patient video data. While these models show promising results on some datasets, they struggle with poor gaze feature performance and lack of real-world generalizability. To tackle these challenges, we analyze a standardized video dataset comprising 168 participants with ASC (46% female) and 157 non-autistic participants (46% female), making it, to our knowledge, the largest and most balanced dataset available. We conduct a multimodal analysis of facial expressions, voice prosody, head motion, heart rate variability (HRV), and gaze behavior. To address the limitations of prior gaze models, we introduce novel statistical descriptors that quantify variability in eye gaze angles, improving gaze-based classification accuracy from 64% to 69% and aligning computational findings with clinical research on gaze aversion in ASC. Using late fusion, we achieve a classification accuracy of 74%, demonstrating the effectiveness of integrating behavioral markers across multiple modalities. Our findings highlight the potential for scalable, video-based screening tools to support autism assessment.

[3] KV-Efficient VLA: A Method of Speed up Vision Language Model with RNN-Gated Chunked KV Cache cs.CV | cs.AIPDF

Wanshun Xu, Long Zhuang

TL;DR: KV-Efficient VLA是一个通用的内存压缩框架，通过分块KV缓存和循环门控模块选择性保留有用的上下文信息，显著提升了视觉语言模型的推理效率和内存利用率。

Details

Motivation: 视觉语言动作（VLA）模型的扩展性受限于注意力计算的二次成本和KV内存的无限制增长，尤其是在长序列推理中。现有的方法虽然通过扩展主干架构提升了泛化能力，但忽略了实时部署中的推理效率问题。

Result: 理论分析显示，KV-Efficient VLA可实现最高1.21倍的推理加速和36%的KV内存减少，且对任务成功率影响较小。

Insight: 该方法无需修改训练流程或下游控制逻辑即可无缝集成到现有自回归和混合VLA模型中，为长序列推理提供了高效的解决方案。

Abstract: Vision-Language-Action (VLA) models promise unified robotic perception and control, yet their scalability is constrained by the quadratic cost of attention and the unbounded growth of key-value (KV) memory during long-horizon inference. While recent methods improve generalization through scaling backbone architectures, they often neglect the inference inefficiencies critical to real-time deployment. In this work, we present KV-Efficient VLA, a model-agnostic memory compression framework that addresses these limitations by introducing a lightweight, training-friendly mechanism to selectively retain high-utility context. Our method partitions the KV cache into fixed size chunks and employs a recurrent gating module to summarize and filter historical context according to learned utility scores. This design preserves recent fine-grained detail while aggressively pruning stale, low-relevance memory, all while maintaining causality. Theoretically, KV-Efficient VLA yields up to 1.21x inference speedup and 36% KV memory reduction, with minimal impact on task success. Our method integrates seamlessly into existing autoregressive and hybrid VLA stacks, enabling scalable inference without modifying training pipelines or downstream control logic.

[4] Phrase-grounded Fact-checking for Automatically Generated Chest X-ray Reports cs.CV | cs.AIPDF

Razi Mahmood, Diego Machado-Reyes, Joy Wu, Parisa Kaviani, Ken C. L. Wong

TL;DR: 该论文提出了一种短语接地的事实核查模型（FC模型），用于检测自动生成的胸部X光报告中发现的错误及其位置。通过合成数据集训练多标签跨模态对比回归网络，展示了其在多个X光数据集上的高精度和有效性。

Details

Motivation: 随着大规模视觉语言模型（VLM）的发展，可以生成逼真的放射学报告。但这些报告在推理过程中存在事实性错误和幻觉，阻碍了其临床应用。

Result: 模型在多个X光数据集上表现出高精度和稳健性，与基于真实数据的验证一致性达到了0.997。

Insight: 该方法为临床放射工作流程中的错误检测提供了实用工具，展示了合成数据集在训练高质量核查模型中的潜力。

Abstract: With the emergence of large-scale vision language models (VLM), it is now possible to produce realistic-looking radiology reports for chest X-ray images. However, their clinical translation has been hampered by the factual errors and hallucinations in the produced descriptions during inference. In this paper, we present a novel phrase-grounded fact-checking model (FC model) that detects errors in findings and their indicated locations in automatically generated chest radiology reports. Specifically, we simulate the errors in reports through a large synthetic dataset derived by perturbing findings and their locations in ground truth reports to form real and fake findings-location pairs with images. A new multi-label cross-modal contrastive regression network is then trained on this dataset. We present results demonstrating the robustness of our method in terms of accuracy of finding veracity prediction and localization on multiple X-ray datasets. We also show its effectiveness for error detection in reports of SOTA report generators on multiple datasets achieving a concordance correlation coefficient of 0.997 with ground truth-based verification, thus pointing to its utility during clinical inference in radiology workflows.

Jason Jordan, Mohammadreza Akbari Lor, Peter Koulen, Mei-Ling Shyu, Shu-Ching Chen

TL;DR: 这篇论文提出了MDF-MLLM模型，通过跨模态特征对齐和多深度特征融合，显著提升了视网膜眼底图像的疾病分类性能。

Details

Motivation: 现有的多模态大语言模型（MLLMs）在捕捉视网膜疾病诊断所需的低层次空间细节方面存在不足。为此，作者提出了一种新的多模态深度学习架构，旨在整合图像和文本信息以提高分类准确性。

Result: MDF-MLLM在双类型疾病分类任务中达到94%准确率，比基线模型提升56%。召回率和F1分数分别提升了67%和35%。

Insight: 多深度特征融合策略显著提升了模型的空间推理能力，尤其对依赖临床文本的遗传性疾病分类效果显著。模型具有可扩展性，未来可应用于更多疾病分类和分割任务。

Abstract: This study aimed to enhance disease classification accuracy from retinal fundus images by integrating fine-grained image features and global textual context using a novel multimodal deep learning architecture. Existing multimodal large language models (MLLMs) often struggle to capture low-level spatial details critical for diagnosing retinal diseases such as glaucoma, diabetic retinopathy, and retinitis pigmentosa. This model development and validation study was conducted on 1,305 fundus image-text pairs compiled from three public datasets (FIVES, HRF, and StoneRounds), covering acquired and inherited retinal diseases, and evaluated using classification accuracy and F1-score. The MDF-MLLM integrates skip features from four U-Net encoder layers into cross-attention blocks within a LLaMA 3.2 11B MLLM. Vision features are patch-wise projected and fused using scaled cross-attention and FiLM-based U-Net modulation. Baseline MLLM achieved 60% accuracy on the dual-type disease classification task. MDF-MLLM, with both U-Net and MLLM components fully fine-tuned during training, achieved a significantly higher accuracy of 94%, representing a 56% improvement. Recall and F1-scores improved by as much as 67% and 35% over baseline, respectively. Ablation studies confirmed that the multi-depth fusion approach contributed to substantial gains in spatial reasoning and classification, particularly for inherited diseases with rich clinical text. MDF-MLLM presents a generalizable, interpretable, and modular framework for fundus image classification, outperforming traditional MLLM baselines through multi-scale feature fusion. The architecture holds promise for real-world deployment in clinical decision support systems. Future work will explore synchronized training techniques, a larger pool of diseases for more generalizability, and extending the model for segmentation tasks.

[6] Multimodal Prompt Decoupling Attack on the Safety Filters in Text-to-Image Models cs.CV | cs.AIPDF

Xingkai Peng, Jun Jiang, Meng Tong, Shuai Li, Weiming Zhang

TL;DR: Multimodal Prompt Decoupling Attack（MPDA）是一种利用图像模态分离有害语义的文本到图像（T2I）模型安全过滤器攻击方法。

Details

Motivation: 现有针对T2I模型的越狱攻击主要基于文本提示操纵，而图像模态和绕过安全过滤器的能力未被充分探索。

Result: MPDA成功绕过了T2I模型的安全过滤器，生成了与原始不安全提示语义一致的NSFW图像。

Insight: 图像模态在绕过安全过滤器中的作用被证明是有效的，且多模态协同攻击可能是未来研究方向。

Abstract: Text-to-image (T2I) models have been widely applied in generating high-fidelity images across various domains. However, these models may also be abused to produce Not-Safe-for-Work (NSFW) content via jailbreak attacks. Existing jailbreak methods primarily manipulate the textual prompt, leaving potential vulnerabilities in image-based inputs largely unexplored. Moreover, text-based methods face challenges in bypassing the model’s safety filters. In response to these limitations, we propose the Multimodal Prompt Decoupling Attack (MPDA), which utilizes image modality to separate the harmful semantic components of the original unsafe prompt. MPDA follows three core steps: firstly, a large language model (LLM) decouples unsafe prompts into pseudo-safe prompts and harmful prompts. The former are seemingly harmless sub-prompts that can bypass filters, while the latter are sub-prompts with unsafe semantics that trigger filters. Subsequently, the LLM rewrites the harmful prompts into natural adversarial prompts to bypass safety filters, which guide the T2I model to modify the base image into an NSFW output. Finally, to ensure semantic consistency between the generated NSFW images and the original unsafe prompts, the visual language model generates image captions, providing a new pathway to guide the LLM in iterative rewriting and refining the generated content.

[7] A Mutual Learning Method for Salient Object Detection with intertwined Multi-Supervision–Revised cs.CV | cs.AIPDF

Runmin Wu, Mengyang Feng, Wenlong Guan, Dong Wang, Huchuan Lu

TL;DR: 该论文提出了一种多任务相互学习方法，通过结合显著性目标检测、前景轮廓检测和边缘检测的监督信号，生成更完整的显著性图和更精确的边界。

Details

Motivation: 由于目标内部的复杂性和卷积/池化操作的步长导致的边界不准确问题，现有显著性目标检测方法的预测结果常出现不完整或模糊的边界。

Result: 在七个挑战性数据集上的实验表明，该方法在显著性目标检测和边缘检测任务中均达到最优性能。

Insight: 多任务交织监督和相互学习机制能够有效解决显著性检测中边界模糊和区域不完整的问题，为相关任务提供了新的思路。

Abstract: Though deep learning techniques have made great progress in salient object detection recently, the predicted saliency maps still suffer from incomplete predictions due to the internal complexity of objects and inaccurate boundaries caused by strides in convolution and pooling operations. To alleviate these issues, we propose to train saliency detection networks by exploiting the supervision from not only salient object detection, but also foreground contour detection and edge detection. First, we leverage salient object detection and foreground contour detection tasks in an intertwined manner to generate saliency maps with uniform highlight. Second, the foreground contour and edge detection tasks guide each other simultaneously, thereby leading to precise foreground contour prediction and reducing the local noises for edge prediction. In addition, we develop a novel mutual learning module (MLM) which serves as the building block of our method. Each MLM consists of multiple network branches trained in a mutual learning manner, which improves the performance by a large margin. Extensive experiments on seven challenging datasets demonstrate that the proposed method has delivered state-of-the-art results in both salient object detection and edge detection.

[8] MAJORScore: A Novel Metric for Evaluating Multimodal Relevance via Joint Representation cs.CV | cs.AIPDF

Zhicheng Du, Qingyang Shi, Jiasheng Lu, Yingshan Liang, Xinyu Zhang

TL;DR: MAJORScore提出了一种新的多模态相关性评估指标，通过联合表示能力为多模态数据的相关性提供更准确的评分。

Details

Motivation: 现有评估指标仅适用于双模态数据的相关性分析，无法满足多模态数据的评价需求，限制了多模态相似性评估的发展。

Result: 大量实验表明，MAJORScore在一致模态下提升26.03%-64.29%，不一致模态下降低13.28%-20.54%。

Insight: MAJORScore为大规模多模态数据集和模型性能评估提供了更可靠的指标。

Abstract: The multimodal relevance metric is usually borrowed from the embedding ability of pretrained contrastive learning models for bimodal data, which is used to evaluate the correlation between cross-modal data (e.g., CLIP). However, the commonly used evaluation metrics are only suitable for the associated analysis between two modalities, which greatly limits the evaluation of multimodal similarity. Herein, we propose MAJORScore, a brand-new evaluation metric for the relevance of multiple modalities ($N$ modalities, $N\ge3$) via multimodal joint representation for the first time. The ability of multimodal joint representation to integrate multiple modalities into the same latent space can accurately represent different modalities at one scale, providing support for fair relevance scoring. Extensive experiments have shown that MAJORScore increases by 26.03%-64.29% for consistent modality and decreases by 13.28%-20.54% for inconsistence compared to existing methods. MAJORScore serves as a more reliable metric for evaluating similarity on large-scale multimodal datasets and multimodal model performance evaluation.

[9] Safety Assessment of Scaffolding on Construction Site using AI cs.CV | cs.AIPDF

Sameer Prabhu, Amit Patwardhan, Ramin Karim

TL;DR: 本文探讨了利用人工智能（AI）和数字化技术提升脚手架安全检查的准确性，进而改善施工安全。通过开发云平台处理和分析点云数据，实现了脚手架结构的自动化监测。

Details

Motivation: 传统的脚手架安全检查主要依赖人工目视，耗时且易出错，可能导致安全隐患。为提高检查效率和准确性，该研究提出了基于AI的自动化解决方案。

Result: 结果表明，该系统能够高效检测脚手架的结构变化，减少人工检查的时间和错误，从而提升施工安全性。

Insight: 研究揭示了AI和数字化技术在建筑安全领域的潜力，尤其是点云数据分析和自动化监测的结合，为未来类似应用提供了技术参考。

Abstract: In the construction industry, safety assessment is vital to ensure both the reliability of assets and the safety of workers. Scaffolding, a key structural support asset requires regular inspection to detect and identify alterations from the design rules that may compromise the integrity and stability. At present, inspections are primarily visual and are conducted by site manager or accredited personnel to identify deviations. However, visual inspection is time-intensive and can be susceptible to human errors, which can lead to unsafe conditions. This paper explores the use of Artificial Intelligence (AI) and digitization to enhance the accuracy of scaffolding inspection and contribute to the safety improvement. A cloud-based AI platform is developed to process and analyse the point cloud data of scaffolding structure. The proposed system detects structural modifications through comparison and evaluation of certified reference data with the recent point cloud data. This approach may enable automated monitoring of scaffolding, reducing the time and effort required for manual inspections while enhancing the safety on a construction site.

[10] Automated Prompt Generation for Creative and Counterfactual Text-to-image Synthesis cs.CV | cs.AIPDF

Aleksa Jelaca, Ying Jiao, Chang Tian, Marie-Francine Moens

TL;DR: 该论文提出了一种自动生成提示词的方法，用于支持反事实和创造性的文本到图像合成，通过三个核心组件（图像评估器、监督提示词重写器和DPO训练的排序器）实现了对反事实尺寸的控制。

Details

Motivation: 当前文本到图像生成在细粒度控制（尤其是反事实控制）方面存在不足，而反事实控制对创意和探索性应用至关重要。

Result: 实验表明，该方法在反事实图像生成任务上优于当前最先进的基线方法和ChatGPT-4o。

Insight: 通过自动化提示词优化，可以有效提升文本到图像模型的反事实控制能力，为未来研究提供了新方向。

Abstract: Text-to-image generation has advanced rapidly with large-scale multimodal training, yet fine-grained controllability remains a critical challenge. Counterfactual controllability, defined as the capacity to deliberately generate images that contradict common-sense patterns, remains a major challenge but plays a crucial role in enabling creativity and exploratory applications. In this work, we address this gap with a focus on counterfactual size (e.g., generating a tiny walrus beside a giant button) and propose an automatic prompt engineering framework that adapts base prompts into revised prompts for counterfactual images. The framework comprises three components: an image evaluator that guides dataset construction by identifying successful image generations, a supervised prompt rewriter that produces revised prompts, and a DPO-trained ranker that selects the optimal revised prompt. We construct the first counterfactual size text-image dataset and enhance the image evaluator by extending Grounded SAM with refinements, achieving a 114 percent improvement over its backbone. Experiments demonstrate that our method outperforms state-of-the-art baselines and ChatGPT-4o, establishing a foundation for future research on counterfactual controllability.

[11] In silico Deep Learning Protocols for Label-Free Super-Resolution Microscopy: A Comparative Study of Network Architectures and SNR Dependence cs.CV | cs.AIPDF

Shiraz S Kaderuppan, Jonathan Mar, Andrew Irvine, Anurag Sharma, Muhammad Ramadan Saifuddin

TL;DR: 论文比较了两种深度学习架构（O-Net和Theta-Net）在无标记超分辨率显微镜中的应用，分析了它们在不同信噪比（SNR）条件下的性能。

Details

Motivation: 光学显微镜的分辨率受限于200nm左右，而现有超分辨率技术成本高昂或需要专门技术。本研究旨在通过深度学习方法实现经济的、无标记的超分辨率显微成像。

Result: 高SNR下O-Net表现更优，低SNR下Theta-Net更适用，表明模型架构与图像SNR的匹配对性能至关重要。

Insight: 模型架构的选择应与图像SNR条件相匹配，以实现最优的超分辨率效果，为无标记光学纳米镜提供了实用的深度学习解决方案。

Abstract: The field of optical microscopy spans across numerous industries and research domains, ranging from education to healthcare, quality inspection and analysis. Nonetheless, a key limitation often cited by optical microscopists refers to the limit of its lateral resolution (typically defined as ~200nm), with potential circumventions involving either costly external modules (e.g. confocal scan heads, etc) and/or specialized techniques [e.g. super-resolution (SR) fluorescent microscopy]. Addressing these challenges in a normal (non-specialist) context thus remains an aspect outside the scope of most microscope users & facilities. This study thus seeks to evaluate an alternative & economical approach to achieving SR optical microscopy, involving non-fluorescent phase-modulated microscopical modalities such as Zernike phase contrast (PCM) and differential interference contrast (DIC) microscopy. Two in silico deep neural network (DNN) architectures which we developed previously (termed O-Net and Theta-Net) are assessed on their abilities to resolve a custom-fabricated test target containing nanoscale features calibrated via atomic force microscopy (AFM). The results of our study demonstrate that although both O-Net and Theta-Net seemingly performed well when super-resolving these images, they were complementary (rather than competing) approaches to be considered for image SR, particularly under different image signal-to-noise ratios (SNRs). High image SNRs favoured the application of O-Net models, while low SNRs inclined preferentially towards Theta-Net models. These findings demonstrate the importance of model architectures (in conjunction with the source image SNR) on model performance and the SR quality of the generated images where DNN models are utilized for non-fluorescent optical nanoscopy, even where the same training dataset & number of epochs are being used.

Yinfeng Yu, Hailong Zhang, Meiling Zhu

TL;DR: DMTF-AVN提出了一种动态多目标融合方法，通过改进的Transformer机制选择性融合视听信息，在机器人导航任务中实现了最优性能。

Details

Motivation: 现有方法在视听导航中往往忽视更深层次的感知上下文，导致信息融合不够高效。本文旨在通过动态多目标架构和精细化Transformer机制解决这一问题。

Result: 在SR、SPL和SNA指标上超越了现有方法，展示了出色的性能和泛化能力。

Insight: DMTF-AVN的成功表明，动态选择性融合是多模态导航的关键，改进的Transformer机制能显著提升信息利用效率。

Abstract: Audiovisual embodied navigation enables robots to locate audio sources by dynamically integrating visual observations from onboard sensors with the auditory signals emitted by the target. The core challenge lies in effectively leveraging multimodal cues to guide navigation. While prior works have explored basic fusion of visual and audio data, they often overlook deeper perceptual context. To address this, we propose the Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation (DMTF-AVN). Our approach uses a multi-target architecture coupled with a refined Transformer mechanism to filter and selectively fuse cross-modal information. Extensive experiments on the Replica and Matterport3D datasets demonstrate that DMTF-AVN achieves state-of-the-art performance, outperforming existing methods in success rate (SR), path efficiency (SPL), and scene adaptation (SNA). Furthermore, the model exhibits strong scalability and generalizability, paving the way for advanced multimodal fusion strategies in robotic navigation. The code and videos are available at https://github.com/zzzmmm-svg/DMTF.

[13] Assessing the Alignment of Popular CNNs to the Brain for Valence Appraisal cs.CVPDF

Laurent Mertens, Elahe’ Yargholi, Laura Van Hove, Hans Op de Beeck, Jan Van den Stock

TL;DR: 本文研究了流行CNN架构在情感评估任务中与人脑的对应关系，发现CNN难以超越简单视觉处理，与高阶脑处理不匹配；并提出Object2Brain框架分析对象类别对CNN与人脑相关性的影响。

Details

Motivation: 研究CNN在复杂任务（如社会认知）中是否与人脑对齐，填补了先前研究仅关注一般视觉感知的空白。

Result: CNN在情感评估任务中难以反映高阶脑处理；不同CNN架构对对象类别的敏感性不同。

Insight: CNN在复杂认知任务中的局限性表明其与人脑的对应性可能仅限于低阶视觉处理。

Abstract: Convolutional Neural Networks (CNNs) are a popular type of computer model that have proven their worth in many computer vision tasks. Moreover, they form an interesting study object for the field of psychology, with shown correspondences between the workings of CNNs and the human brain. However, these correspondences have so far mostly been studied in the context of general visual perception. In contrast, this paper explores to what extent this correspondence also holds for a more complex brain process, namely social cognition. To this end, we assess the alignment between popular CNN architectures and both human behavioral and fMRI data for image valence appraisal through a correlation analysis. We show that for this task CNNs struggle to go beyond simple visual processing, and do not seem to reflect higher-order brain processing. Furthermore, we present Object2Brain, a novel framework that combines GradCAM and object detection at the CNN-filter level with the aforementioned correlation analysis to study the influence of different object classes on the CNN-to-human correlations. Despite similar correlation trends, different CNN architectures are shown to display different object class sensitivities.

[14] Debugging Concept Bottleneck Models through Removal and Retraining cs.CV | cs.LGPDF

Eric Enouen, Sainyam Galhotra

TL;DR: 论文提出了一个可解释的调试框架，通过两步流程（移除和重新训练）来解决概念瓶颈模型（CBMs）与专家推理不一致的问题，尤其是对数据偏见的学习。

Details

Motivation: 现有的概念瓶颈模型（CBMs）虽然可以通过概念干预验证预测，但无法解决系统性的模型与专家推理不一致问题，例如模型从有偏见的数据中学习捷径。

Result: CBDebug在多种CBM架构（PIP-Net、Post-hoc CBM）和已知虚假相关性基准上显著优于现有重新训练方法。

Insight: CBMs的可解释性可以作为桥梁，将高层次的用户反馈有效地转化为低层次的模型优化信号，从而提高模型的鲁棒性和与专家的一致性。

Abstract: Concept Bottleneck Models (CBMs) use a set of human-interpretable concepts to predict the final task label, enabling domain experts to not only validate the CBM’s predictions, but also intervene on incorrect concepts at test time. However, these interventions fail to address systemic misalignment between the CBM and the expert’s reasoning, such as when the model learns shortcuts from biased data. To address this, we present a general interpretable debugging framework for CBMs that follows a two-step process of Removal and Retraining. In the Removal step, experts use concept explanations to identify and remove any undesired concepts. In the Retraining step, we introduce CBDebug, a novel method that leverages the interpretability of CBMs as a bridge for converting concept-level user feedback into sample-level auxiliary labels. These labels are then used to apply supervised bias mitigation and targeted augmentation, reducing the model’s reliance on undesired concepts. We evaluate our framework with both real and automated expert feedback, and find that CBDebug significantly outperforms prior retraining methods across multiple CBM architectures (PIP-Net, Post-hoc CBM) and benchmarks with known spurious correlations.

[15] ShipwreckFinder: A QGIS Tool for Shipwreck Detection in Multibeam Sonar Data cs.CV | cs.RO | eess.IVPDF

Anja Sheppard, Tyler Smithline, Andrew Scheffer, David Smith, Advaith V. Sethuraman

TL;DR: ShipwreckFinder是一个开源的QGIS插件，通过深度学习模型自动检测多波束声呐数据中的沉船。

Details

Motivation: 手动检查海底地形数据以发现沉船耗时且依赖专家分析，需要一个自动化工具来提高效率。

Result: 与ArcGIS工具包和传统方法相比，ShipwreckFinder表现出更优的分割性能。

Insight: 合成数据可以增强模型的泛化能力，开源工具有助于推动沉船检测的研究和应用。

Abstract: In this paper, we introduce ShipwreckFinder, an open-source QGIS plugin that detects shipwrecks from multibeam sonar data. Shipwrecks are an important historical marker of maritime history, and can be discovered through manual inspection of bathymetric data. However, this is a time-consuming process and often requires expert analysis. Our proposed tool allows users to automatically preprocess bathymetry data, perform deep learning inference, threshold model outputs, and produce either pixel-wise segmentation masks or bounding boxes of predicted shipwrecks. The backbone of this open-source tool is a deep learning model, which is trained on a variety of shipwreck data from the Great Lakes and the coasts of Ireland. Additionally, we employ synthetic data generation in order to increase the size and diversity of our dataset. We demonstrate superior segmentation performance with our open-source tool and training pipeline as compared to a deep learning-based ArcGIS toolkit and a more classical inverse sinkhole detection method. The open-source tool can be found at https://github.com/umfieldrobotics/ShipwreckFinderQGISPlugin.

[16] TUN3D: Towards Real-World Scene Understanding from Unposed Images cs.CV | eess.IVPDF

Anton Konushin, Nikita Drozdov, Bulat Gabdullin, Alexey Zakharov, Anna Vorontsova

TL;DR: TUN3D是一种无需深度监督或真实相机姿态的方法，首次通过多视角图像输入联合解决布局估计和3D物体检测任务，实现了室内场景理解的突破。

Details

Motivation: 现有方法多依赖点云输入，但消费级相机通常缺乏深度传感器，限制了场景理解的实用性。TUN3D的提出填补了这一空白。

Result: 在多个场景理解基准测试中达到SOTA，尤其在布局估计方面显著领先。

Insight: TUN3D展示了仅凭视觉数据实现高效场景理解的潜力，为实际应用提供了新思路。

Abstract: Layout estimation and 3D object detection are two fundamental tasks in indoor scene understanding. When combined, they enable the creation of a compact yet semantically rich spatial representation of a scene. Existing approaches typically rely on point cloud input, which poses a major limitation since most consumer cameras lack depth sensors and visual-only data remains far more common. We address this issue with TUN3D, the first method that tackles joint layout estimation and 3D object detection in real scans, given multi-view images as input, and does not require ground-truth camera poses or depth supervision. Our approach builds on a lightweight sparse-convolutional backbone and employs two dedicated heads: one for 3D object detection and one for layout estimation, leveraging a novel and effective parametric wall representation. Extensive experiments show that TUN3D achieves state-of-the-art performance across three challenging scene understanding benchmarks: (i) using ground-truth point clouds, (ii) using posed images, and (iii) using unposed images. While performing on par with specialized 3D object detection methods, TUN3D significantly advances layout estimation, setting a new benchmark in holistic indoor scene understanding. Code is available at https://github.com/col14m/tun3d .

[17] Large AI Model-Enabled Generative Semantic Communications for Image Transmission cs.CV | cs.AI | cs.IT | math.ITPDF

Qiyu Ma, Wanli Ni, Zhijin Qin

TL;DR: 该论文提出了一种生成式语义通信系统，通过区分图像的关键和非关键区域，结合图像导向的语义编码器和图像到文本建模方法，优化图像传输的语义保真度和视觉质量。

Details

Motivation: 现有方法在图像传输中未充分考虑不同区域的重要性差异，可能导致关键视觉内容的重建质量下降。

Result: 仿真结果表明，所提系统在语义保真度和视觉质量上优于传统方法。

Insight: 区分图像区域重要性并结合生成式AI技术可显著提升语义通信性能，同时轻量级策略有效缓解了大模型资源问题。

Abstract: The rapid development of generative artificial intelligence (AI) has introduced significant opportunities for enhancing the efficiency and accuracy of image transmission within semantic communication systems. Despite these advancements, existing methodologies often neglect the difference in importance of different regions of the image, potentially compromising the reconstruction quality of visually critical content. To address this issue, we introduce an innovative generative semantic communication system that refines semantic granularity by segmenting images into key and non-key regions. Key regions, which contain essential visual information, are processed using an image oriented semantic encoder, while non-key regions are efficiently compressed through an image-to-text modeling approach. Additionally, to mitigate the substantial storage and computational demands posed by large AI models, the proposed system employs a lightweight deployment strategy incorporating model quantization and low-rank adaptation fine-tuning techniques, significantly boosting resource utilization without sacrificing performance. Simulation results demonstrate that the proposed system outperforms traditional methods in terms of both semantic fidelity and visual quality, thereby affirming its effectiveness for image transmission tasks.

Nabeel Nisar Bhat, Maksim Karnaukh, Stein Vandenbroeke, Wouter Lemoine, Jakob Struye

TL;DR: mmHSense提供了一个开放的多模态毫米波数据集，支持集成传感与通信（ISAC）系统中的人体感知研究，涵盖手势识别、身份识别等多种应用，并展示了高效微调方法。

Details

Motivation: 现有ISAC研究中缺乏公开的多模态毫米波数据集，限制了人体感知技术的发展和应用验证。

Result: 数据集可用于多种人体感知任务，高效微调方法在保持性能的同时显著降低了计算复杂度。

Insight: 公开数据集和多模态设计为ISAC研究提供了重要资源，高效微调展示了跨任务的模型适应性。

Abstract: This article presents mmHSense, a set of open labeled mmWave datasets to support human sensing research within Integrated Sensing and Communication (ISAC) systems. The datasets can be used to explore mmWave ISAC for various end applications such as gesture recognition, person identification, pose estimation, and localization. Moreover, the datasets can be used to develop and advance signal processing and deep learning research on mmWave ISAC. This article describes the testbed, experimental settings, and signal features for each dataset. Furthermore, the utility of the datasets is demonstrated through validation on a specific downstream task. In addition, we demonstrate the use of parameter-efficient fine-tuning to adapt ISAC models to different tasks, significantly reducing computational complexity while maintaining performance on prior tasks.

[19] Skeleton Sparsification and Densification Scale-Spaces cs.CV | eess.IVPDF

Julia Gierke, Pascal Peter

TL;DR: 论文提出了骨架稀疏化和稠密化尺度空间的方法，通过稀疏化骨架实现形状的层次化简化，解决了传统骨架对噪声敏感的问题，并满足尺度空间的关键性质。

Details

Motivation: 传统骨架（如中轴）对噪声敏感，微小的边界变化会导致骨架的不必要扩展。现有的修剪方法虽能缓解这一问题，但缺乏系统的层次化简化框架。

Result: 通过实验验证了该框架在鲁棒骨架提取、形状压缩和增材制造刚度增强等任务中的有效性。

Insight: 骨架稀疏化和稠密化尺度空间不仅解决了传统方法对噪声敏感的问题，还提供了新的形状表示方式，适用于多种实际应用。

Abstract: The Hamilton-Jacobi skeleton, also known as the medial axis, is a powerful shape descriptor that represents binary objects in terms of the centres of maximal inscribed discs. Despite its broad applicability, the medial axis suffers from sensitivity to noise: minor boundary variations can lead to disproportionately large and undesirable expansions of the skeleton. Classical pruning methods mitigate this shortcoming by systematically removing extraneous skeletal branches. This sequential simplification of skeletons resembles the principle of sparsification scale-spaces that embed images into a family of reconstructions from increasingly sparse pixel representations. We combine both worlds by introducing skeletonisation scale-spaces: They leverage sparsification of the medial axis to achieve hierarchical simplification of shapes. Unlike conventional pruning, our framework inherently satisfies key scale-space properties such as hierarchical architecture, controllable simplification, and equivariance to geometric transformations. We provide a rigorous theoretical foundation in both continuous and discrete formulations and extend the concept further with densification. This allows inverse progression from coarse to fine scales and can even reach beyond the original skeleton to produce overcomplete shape representations with relevancy for practical applications. Through proof-of-concept experiments, we demonstrate the effectiveness of our framework for practical tasks including robust skeletonisation, shape compression, and stiffness enhancement for additive manufacturing.

[20] JaiLIP: Jailbreaking Vision-Language Models via Loss Guided Image Perturbation cs.CVPDF

Md Jueal Mia, M. Hadi Amini

TL;DR: 该论文提出了一种通过损失引导的图像扰动（JaiLIP）来破解视觉语言模型（VLMs）的方法，通过最小化联合目标函数生成高效且不易察觉的对抗性图像，显著提高了毒性输出的生成能力。

Details

Motivation: 视觉语言模型（VLMs）在生成多模态推理任务方面表现出色，但其潜在滥用或安全性问题日益突出。现有攻击方法性能不稳定且扰动明显，亟需一种更有效的破解方法。

Result: 实验表明，JaiLIP生成的对抗性图像效果显著且不易察觉，在毒性输出生成上优于现有方法，并在特定领域（如交通）展现了攻击的实用性。

Insight: 研究强调了基于图像的破解攻击的实际挑战，并指出需要为VLMs开发高效防御机制的重要性。

Abstract: Vision-Language Models (VLMs) have remarkable abilities in generating multimodal reasoning tasks. However, potential misuse or safety alignment concerns of VLMs have increased significantly due to different categories of attack vectors. Among various attack vectors, recent studies have demonstrated that image-based perturbations are particularly effective in generating harmful outputs. In the literature, many existing techniques have been proposed to jailbreak VLMs, leading to unstable performance and visible perturbations. In this study, we propose Jailbreaking with Loss-guided Image Perturbation (JaiLIP), a jailbreaking attack in the image space that minimizes a joint objective combining the mean squared error (MSE) loss between clean and adversarial image with the models harmful-output loss. We evaluate our proposed method on VLMs using standard toxicity metrics from Perspective API and Detoxify. Experimental results demonstrate that our method generates highly effective and imperceptible adversarial images, outperforming existing methods in producing toxicity. Moreover, we have evaluated our method in the transportation domain to demonstrate the attacks practicality beyond toxic text generation in specific domain. Our findings emphasize the practical challenges of image-based jailbreak attacks and the need for efficient defense mechanisms for VLMs.

[21] QuadGPT: Native Quadrilateral Mesh Generation with Autoregressive Models cs.CVPDF

Jian Liu, Chunshi Wang, Song Guo, Haohan Weng, Zhen Zhou

TL;DR: QuadGPT是首个自回归框架，用于端到端生成四边形网格，通过序列预测范式和创新技术（如统一标记化和RL微调）显著提升了四边形的拓扑和几何质量。

Details

Motivation: 现有方法首先生成三角形网格再转换为四边形，导致拓扑质量较差。为了直接生成高质量的四边形网格，提出了QuadGPT。

Result: 实验表明，QuadGPT在几何精度和拓扑质量上显著优于之前的三角形转四边形方法。

Insight: 通过结合大规模自回归模型与拓扑感知的RL微调，可高效生成高质量结构化3D资产。

Abstract: The generation of quadrilateral-dominant meshes is a cornerstone of professional 3D content creation. However, existing generative models generate quad meshes by first generating triangle meshes and then merging triangles into quadrilaterals with some specific rules, which typically produces quad meshes with poor topology. In this paper, we introduce QuadGPT, the first autoregressive framework for generating quadrilateral meshes in an end-to-end manner. QuadGPT formulates this as a sequence prediction paradigm, distinguished by two key innovations: a unified tokenization method to handle mixed topologies of triangles and quadrilaterals, and a specialized Reinforcement Learning fine-tuning method tDPO for better generation quality. Extensive experiments demonstrate that QuadGPT significantly surpasses previous triangle-to-quad conversion pipelines in both geometric accuracy and topological quality. Our work establishes a new benchmark for native quad-mesh generation and showcases the power of combining large-scale autoregressive models with topology-aware RL refinement for creating structured 3D assets.

[22] DyME: Dynamic Multi-Concept Erasure in Diffusion Models with Bi-Level Orthogonal LoRA Adaptation cs.CV | cs.AI | cs.LGPDF

Jiaqi Liu, Lan Zhang, Xiaoyong Yuan

TL;DR: DyME提出了一种动态多概念擦除框架，通过双级正交LoRA适配器，解决了现有静态擦除方法在多概念擦除中的局限性，显著提升了擦除效果和生成质量。

Details

Motivation: 现有的概念擦除方法在应对多概念擦除时存在局限性，无法灵活适应不同的擦除需求，导致擦除成功率和生成质量下降。

Result: 在ErasureBench-H和标准数据集上的实验表明，DyME显著优于现有方法，在多概念擦除中保持了更高的生成质量。

Insight: 动态组合适配器和正交约束的结合可以有效解决多概念擦除中的干扰问题，为模型提供更灵活的擦除能力。

Abstract: Text-to-image diffusion models (DMs) inadvertently reproduce copyrighted styles and protected visual concepts, raising legal and ethical concerns. Concept erasure has emerged as a safeguard, aiming to selectively suppress such concepts through fine-tuning. However, existing methods do not scale to practical settings where providers must erase multiple and possibly conflicting concepts. The core bottleneck is their reliance on static erasure: a single checkpoint is fine-tuned to remove all target concepts, regardless of the actual erasure needs at inference. This rigid design mismatches real-world usage, where requests vary per generation, leading to degraded erasure success and reduced fidelity for non-target content. We propose DyME, an on-demand erasure framework that trains lightweight, concept-specific LoRA adapters and dynamically composes only those needed at inference. This modular design enables flexible multi-concept erasure, but naive composition causes interference among adapters, especially when many or semantically related concepts are suppressed. To overcome this, we introduce bi-level orthogonality constraints at both the feature and parameter levels, disentangling representation shifts and enforcing orthogonal adapter subspaces. We further develop ErasureBench-H, a new hierarchical benchmark with brand-series-character structure, enabling principled evaluation across semantic granularities and erasure set sizes. Experiments on ErasureBench-H and standard datasets (e.g., CIFAR-100, Imagenette) demonstrate that DyME consistently outperforms state-of-the-art baselines, achieving higher multi-concept erasure fidelity with minimal collateral degradation.

[23] VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding cs.CV | cs.CLPDF

Abdul Waheed, Zhen Wu, Dareen Alharthi, Seungone Kim, Bhiksha Raj

TL;DR: VideoJudge是一种用于视频理解模型评估的小型多模态大模型（MLLM），通过生成器与评估器的交互训练，在多个基准测试中优于更大的MLLM基准模型。

Details

Motivation: 现有视频理解模型的评估指标（如BLEU、ROUGE等）无法精确反映人类评判标准，而人工评估成本高昂，因此探索MLLM作为评估器的潜力。

Result: VideoJudge-7B在4项基准测试中的3项中优于更大的MLLM基准模型（如Qwen2.5-VL），并验证了视频输入对评估的重要性。

Insight: 提供视频输入对视频理解任务的评估至关重要，而长推理链并未提升性能；MLLM法官在视频任务中优于纯文本LLM法官。

Abstract: Precisely evaluating video understanding models remains challenging: commonly used metrics such as BLEU, ROUGE, and BERTScore fail to capture the fineness of human judgment, while obtaining such judgments through manual evaluation is costly. Recent work has explored using large language models (LLMs) or multimodal LLMs (MLLMs) as evaluators, but their extension to video understanding remains relatively unexplored. In this work, we introduce VideoJudge, a 3B and 7B-sized MLLM judge specialized to evaluate outputs from video understanding models (\textit{i.e.}, text responses conditioned on videos). To train VideoJudge, our recipe builds on the interplay between a generator and an evaluator: the generator is prompted to produce responses conditioned on a target rating, and responses not matching the evaluator’s rating are discarded. Across three out of four meta-evaluation benchmarks, VideoJudge-7B outperforms larger MLLM judge baselines such as Qwen2.5-VL (32B and 72B). Notably, we find that LLM judges (Qwen3) models perform worse than MLLM judges (Qwen2.5-VL) and long chain-of-thought reasoning does not improve performance, indicating that providing video inputs is crucial for evaluation of video understanding tasks.

[24] Residual Vector Quantization For Communication-Efficient Multi-Agent Perception cs.CV | cs.ROPDF

Dereje Shenkut, B. V. K Vijaya Kumar

TL;DR: ReVQom 是一种用于多智能体协同感知的通信高效特征编码方法，通过残差向量量化和瓶颈网络压缩特征，显著减少通信负载，同时保持感知精度。

Details

Motivation: 多智能体协同感知（CP）通过共享信息提升场景理解能力，但通信带宽限制了其扩展性。需要一种高效压缩方法以减少数据传输量。

Result: 在 DAIR-V2X 数据集上，ReVQom 实现了 273x 至 1365x 的压缩比，在 18 bpp 时性能与原始特征 CP 相当或更优，并在极低带宽下优雅降级。

Insight: ReVQom 为多智能体协同感知提供了一种高效通信方案，推动了车联网（V2X）的实用化部署。

Abstract: Multi-agent collaborative perception (CP) improves scene understanding by sharing information across connected agents such as autonomous vehicles, unmanned aerial vehicles, and robots. Communication bandwidth, however, constrains scalability. We present ReVQom, a learned feature codec that preserves spatial identity while compressing intermediate features. ReVQom is an end-to-end method that compresses feature dimensions via a simple bottleneck network followed by multi-stage residual vector quantization (RVQ). This allows only per-pixel code indices to be transmitted, reducing payloads from 8192 bits per pixel (bpp) of uncompressed 32-bit float features to 6-30 bpp per agent with minimal accuracy loss. On DAIR-V2X real-world CP dataset, ReVQom achieves 273x compression at 30 bpp to 1365x compression at 6 bpp. At 18 bpp (455x), ReVQom matches or outperforms raw-feature CP, and at 6-12 bpp it enables ultra-low-bandwidth operation with graceful degradation. ReVQom allows efficient and accurate multi-agent collaborative perception with a step toward practical V2X deployment.

[25] Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Moderation cs.CVPDF

Zixuan Wang, Yu Sun, Hongwei Wang, Baoyu Jing, Xiang Shen

TL;DR: 该论文提出了一种增强推理能力的多模态大语言模型（MLLM）预训练方法，用于短视频内容审核。通过引入Caption、VQA和CoT三种预训练任务，解决了数据分布差异和问题复杂性问题，显著提升了模型性能。

Details

Motivation: 随着短视频平台的快速发展，传统单一分类模型需要大量人工标注数据且缺乏跨问题泛化能力，因此需要一种统一的审核方法。

Result: 实验表明，该方法在零样本和监督微调中性能显著提升，且对未见问题表现出强泛化能力。

Insight: 多模态预训练任务的设计（如Caption、VQA、CoT）可以有效解决数据分布差异和问题复杂性，提升模型的应用灵活性。

Abstract: Short video platforms are evolving rapidly, making the identification of inappropriate content increasingly critical. Existing approaches typically train separate and small classification models for each type of issue, which requires extensive human-labeled data and lacks cross-issue generalization. We propose a reasoning-enhanced multimodal large language model (MLLM) pretraining paradigm for unified inappropriate content detection. To address the distribution gap between short video content and the original pretraining data of MLLMs, as well as the complex issue definitions, we introduce three targeted pretraining tasks: (1) \textit{Caption}, to enhance the MLLM’s perception of video details; (2) \textit{Visual Question Answering (VQA)}, to deepen the MLLM’s understanding of issue definitions and annotation guidelines; (3) \textit{Chain-of-Thought (CoT)}, to enhance the MLLM’s reasoning capability. Experimental results show that our pretraining approach significantly improves the MLLM’s performance in both zero-shot and supervised fine-tuning (SFT) settings. In addition, our pretrained model demonstrates strong generalization capabilities to emergent, previously unseen issues.

[26] Learning GUI Grounding with Spatial Reasoning from Visual Feedback cs.CV | cs.CLPDF

Yu Zhao, Wei-Ning Chen, Huseyin Atahan Inan, Samuel Kessler, Lu Wang

TL;DR: 该论文提出了一种将GUI grounding任务重构为交互式搜索任务的方法，通过视觉反馈和多步强化学习提高预测精度。

Details

Motivation: 传统的GUI grounding任务将自然语言指令映射为坐标预测，但现有视觉语言模型在高分辨率复杂布局的GUI图像上表现不佳。

Result: 在ScreenSpot-v2和ScreenSpot-Pro数据集上分别将准确率提升至93.9%和56.5%，且95%的实例能在两步内完成。

Insight: 交互式任务重构和视觉反馈能显著提升GUI grounding任务的性能，模型还能自适应调整步数以处理复杂任务。

Abstract: Graphical User Interface (GUI) grounding is commonly framed as a coordinate prediction task – given a natural language instruction, generate on-screen coordinates for actions such as clicks and keystrokes. However, recent Vision Language Models (VLMs) often fail to predict accurate numeric coordinates when processing high-resolution GUI images with complex layouts. To address this issue, we reframe GUI grounding as an \emph{interactive search task}, where the VLM generates actions to move a cursor in the GUI to locate UI elements. At each step, the model determines the target object, evaluates the spatial relations between the cursor and the target, and moves the cursor closer to the target conditioned on the movement history. In this interactive process, the rendered cursor provides visual feedback to help the model align its predictions with the corresponding on-screen locations. We train our GUI grounding model, GUI-Cursor, using multi-step online reinforcement learning with a dense trajectory-based reward function. Our experimental results show that GUI-Cursor, based on Qwen2.5-VL-7B, improves the GUI grounding accuracy and achieves state-of-the-art results on ScreenSpot-v2 ($88.8% \rightarrow 93.9%$) and ScreenSpot-Pro ($26.8% \rightarrow 56.5%$). Moreover, we observe that GUI-Cursor learns to solve the problem within two steps for 95% of instances and can adaptively conduct more steps on more difficult examples.

[27] X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought Reasoning cs.CVPDF

Prasanna Reddy Pulakurthi, Jiamian Wang, Majid Rabbani, Sohail Dianat, Raghuveer Rao

TL;DR: X-CoT提出了一种基于LLM Chain-of-Thought（CoT）推理的可解释文本-视频检索框架，通过细化语义理解和生成详细推理步骤，解决了传统嵌入模型在数据质量和可解释性上的问题。

Details

Motivation: 传统文本-视频检索系统依赖嵌入模型和余弦相似度计算，存在数据质量低难以识别、结果缺乏解释性的问题。本文旨在通过LLM CoT推理提升检索性能并增强解释性。

Result: X-CoT在检索性能上表现出色，并能生成详尽的解释性推理。此外，有助于分析模型行为和评估数据质量。

Insight: LLM的推理能力可用于改进传统检索系统的解释性问题，同时数据标注的增强是提升模型性能的关键。

Abstract: Prevalent text-to-video retrieval systems mainly adopt embedding models for feature extraction and compute cosine similarities for ranking. However, this design presents two limitations. Low-quality text-video data pairs could compromise the retrieval, yet are hard to identify and examine. Cosine similarity alone provides no explanation for the ranking results, limiting the interpretability. We ask that can we interpret the ranking results, so as to assess the retrieval models and examine the text-video data? This work proposes X-CoT, an explainable retrieval framework upon LLM CoT reasoning in place of the embedding model-based similarity ranking. We first expand the existing benchmarks with additional video annotations to support semantic understanding and reduce data bias. We also devise a retrieval CoT consisting of pairwise comparison steps, yielding detailed reasoning and complete ranking. X-CoT empirically improves the retrieval performance and produces detailed rationales. It also facilitates the model behavior and data quality analysis. Code and data are available at: https://github.com/PrasannaPulakurthi/X-CoT.

[28] Unsupervised Defect Detection for Surgical Instruments cs.CVPDF

Joseph Huang, Yichi Zhang, Jingxi Yu, Wei Chen, Seunghyun Hwang

TL;DR: 提出一种针对手术器械的无监督缺陷检测方法，解决了传统方法在手术器械领域适应性差的问题。

Details

Motivation: 手术器械的安全性需要可靠的视觉缺陷检测，但手动检测易出错，现有自动化方法在手术领域效果不佳。

Result: 能够可靠检测手术器械图像中的细粒度缺陷。

Insight: 无监督方法需针对性调整以适应特定领域（如手术器械），背景干扰和领域偏移是主要挑战。

Abstract: Ensuring the safety of surgical instruments requires reliable detection of visual defects. However, manual inspection is prone to error, and existing automated defect detection methods, typically trained on natural/industrial images, fail to transfer effectively to the surgical domain. We demonstrate that simply applying or fine-tuning these approaches leads to issues: false positive detections arising from textured backgrounds, poor sensitivity to small, subtle defects, and inadequate capture of instrument-specific features due to domain shift. To address these challenges, we propose a versatile method that adapts unsupervised defect detection methods specifically for surgical instruments. By integrating background masking, a patch-based analysis strategy, and efficient domain adaptation, our method overcomes these limitations, enabling the reliable detection of fine-grained defects in surgical instrument imagery.

[29] No Alignment Needed for Generation: Learning Linearly Separable Representations in Diffusion Models cs.CV | cs.AI | cs.LGPDF

Junno Yun, Yaşar Utku Alçalar, Mehmet Akçakaya

TL;DR: 该论文提出了一种新的正则化方法LSEP（Linear SEParability），用于训练扩散模型，无需依赖外部编码器和表示对齐，直接在网络的动态学习中促进线性可分性，从而提升训练效率和生成质量。

Details

Motivation: 现有基于表示对齐的方法依赖计算昂贵的大型预训练编码器，限制了扩散模型的训练效率。论文旨在通过改进中间层表示的线性可分性，避免对外部编码器的依赖。

Result: 在SiTs等流式Transformer架构上取得了显著效果，在256×256 ImageNet数据集上FID达到1.46。

Insight: 线性可分性可以作为独立的优化目标，替代传统的表示对齐，简化训练流程并提升生成性能。

Abstract: Efficient training strategies for large-scale diffusion models have recently emphasized the importance of improving discriminative feature representations in these models. A central line of work in this direction is representation alignment with features obtained from powerful external encoders, which improves the representation quality as assessed through linear probing. Alignment-based approaches show promise but depend on large pretrained encoders, which are computationally expensive to obtain. In this work, we propose an alternative regularization for training, based on promoting the Linear SEParability (LSEP) of intermediate layer representations. LSEP eliminates the need for an auxiliary encoder and representation alignment, while incorporating linear probing directly into the network’s learning dynamics rather than treating it as a simple post-hoc evaluation tool. Our results demonstrate substantial improvements in both training efficiency and generation quality on flow-based transformer architectures such as SiTs, achieving an FID of 1.46 on $256 \times 256$ ImageNet dataset.

[30] X-Streamer: Unified Human World Modeling with Audiovisual Interaction cs.CVPDF

You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Guoxian Song

TL;DR: X-Streamer提出了一种端到端多模态人类世界建模框架，通过统一的架构实现数字人类代理在文本、语音和视频中的无限交互。

Details

Motivation: 当前的数字人类代理在多模态交互中存在理解与生成的割裂问题，X-Streamer旨在通过统一架构弥合这一差距。

Result: X-Streamer在两块A100 GPU上实时运行，支持长时间一致的多模态视频聊天体验。

Insight: 通过统一理解与生成模块，X-Streamer展示了多模态交互中跨模态对齐和上下文保持的重要性，为未来数字人类建模提供了新思路。

Abstract: We introduce X-Streamer, an end-to-end multimodal human world modeling framework for building digital human agents capable of infinite interactions across text, speech, and video within a single unified architecture. Starting from a single portrait, X-Streamer enables real-time, open-ended video calls driven by streaming multimodal inputs. At its core is a Thinker-Actor dual-transformer architecture that unifies multimodal understanding and generation, turning a static portrait into persistent and intelligent audiovisual interactions. The Thinker module perceives and reasons over streaming user inputs, while its hidden states are translated by the Actor into synchronized multimodal streams in real time. Concretely, the Thinker leverages a pretrained large language-speech model, while the Actor employs a chunk-wise autoregressive diffusion model that cross-attends to the Thinker’s hidden states to produce time-aligned multimodal responses with interleaved discrete text and audio tokens and continuous video latents. To ensure long-horizon stability, we design inter- and intra-chunk attentions with time-aligned multimodal positional embeddings for fine-grained cross-modality alignment and context retention, further reinforced by chunk-wise diffusion forcing and global identity referencing. X-Streamer runs in real time on two A100 GPUs, sustaining hours-long consistent video chat experiences from arbitrary portraits and paving the way toward unified world modeling of interactive digital humans.

[31] What Happens Next? Anticipating Future Motion by Generating Point Trajectories cs.CV | cs.AI | cs.LGPDF

Gabrijel Boduljak, Laurynas Karazija, Iro Laina, Christian Rupprecht, Andrea Vedaldi

TL;DR: 本文提出了一种新方法，通过生成密集点轨迹而非像素，从单张图像预测物体未来运动，优于现有回归器和生成器。

Details

Motivation: 传统视频生成器在从单张图像预测运动时表现不佳，主要原因在于它们需要生成像素而非直接建模运动。本文旨在解决这一问题。

Result: 实验表明，该方法在模拟数据和真实世界数据上的预测精度和多样性显著优于现有方法，并在机器人等下游任务中有效。

Insight: 直接从图像生成运动轨迹（而非像素）能更高效地建模场景动态，为未来运动和物理交互预测提供了新思路。

Abstract: We consider the problem of forecasting motion from a single image, i.e., predicting how objects in the world are likely to move, without the ability to observe other parameters such as the object velocities or the forces applied to them. We formulate this task as conditional generation of dense trajectory grids with a model that closely follows the architecture of modern video generators but outputs motion trajectories instead of pixels. This approach captures scene-wide dynamics and uncertainty, yielding more accurate and diverse predictions than prior regressors and generators. We extensively evaluate our method on simulated data, demonstrate its effectiveness on downstream applications such as robotics, and show promising accuracy on real-world intuitive physics datasets. Although recent state-of-the-art video generators are often regarded as world models, we show that they struggle with forecasting motion from a single image, even in simple physical scenarios such as falling blocks or mechanical object interactions, despite fine-tuning on such data. We show that this limitation arises from the overhead of generating pixels rather than directly modeling motion.

[32] Temporal vs. Spatial: Comparing DINOv3 and V-JEPA2 Feature Representations for Video Action Analysis cs.CV | cs.AIPDF

Sai Varun Kodathala, Rakesh Vunnam

TL;DR: 该研究比较了两种自监督学习架构DINOv3和V-JEPA2在视频动作分析中的表现，揭示了它们在时空特征提取上的优劣势，为任务选择提供依据。

Details

Motivation: 理解不同自监督学习架构（空间独立建模与联合时序建模）在视频动作识别中的性能差异，为实际应用提供指导。

Result: DINOv3在聚类和静态动作识别上表现更好（Silhouette分数0.31），而V-JEPA2在所有动作类型上更稳定（性能方差0.094）。

Insight: 空间独立建模适合静态动作（如姿态识别），而时序建模在动态动作上表现均衡，需根据任务需求选择架构。

Abstract: This study presents a comprehensive comparative analysis of two prominent self-supervised learning architectures for video action recognition: DINOv3, which processes frames independently through spatial feature extraction, and V-JEPA2, which employs joint temporal modeling across video sequences. We evaluate both approaches on the UCF Sports dataset, examining feature quality through multiple dimensions including classification accuracy, clustering performance, intra-class consistency, and inter-class discrimination. Our analysis reveals fundamental architectural trade-offs: DINOv3 achieves superior clustering performance (Silhouette score: 0.31 vs 0.21) and demonstrates exceptional discrimination capability (6.16x separation ratio) particularly for pose-identifiable actions, while V-JEPA2 exhibits consistent reliability across all action types with significantly lower performance variance (0.094 vs 0.288). Through action-specific evaluation, we identify that DINOv3’s spatial processing architecture excels at static pose recognition but shows degraded performance on motion-dependent actions, whereas V-JEPA2’s temporal modeling provides balanced representation quality across diverse action categories. These findings contribute to the understanding of architectural design choices in video analysis systems and provide empirical guidance for selecting appropriate feature extraction methods based on task requirements and reliability constraints.

[33] VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment cs.CV | cs.LGPDF

Md. Mahfuzur Rahman, Kishor Datta Gupta, Marufa Kamal, Fahad Rahman, Sunzida Siddique

TL;DR: 该论文提出了一种名为VLCE的多模态框架，用于生成灾害图像的详细描述。它结合了CNN-LSTM和Vision Transformer（ViT）架构，并利用外部语义知识库（如ConceptNet和WordNet）提升描述准确性。实验表明，VLCE在语义对齐和信息丰富度上显著优于基线模型。

Details

Motivation: 灾害后的快速损失评估至关重要，但传统方法效率低且危险。尽管卫星和无人机图像提供了广泛视角，现有计算机视觉方法通常只能生成分类标签或分割掩膜，无法全面描述灾害场景。因此，需要一种能够生成上下文丰富、信息密集的灾害图像描述的系统。

Result: VLCE在InfoMetIC上最高达到95.33%，显著优于基线模型，同时保持了语义对齐的竞争力。

Insight: 1. 多模态（图像+外部知识）结合能显著提升灾害图像描述的信息量和准确性。2. 针对不同数据源（卫星和无人机）设计专用架构可优化性能。3. VLCE为灾害评估自动化提供了可行方案。

Abstract: Immediate damage assessment is essential after natural catastrophes; yet, conventional hand evaluation techniques are sluggish and perilous. Although satellite and unmanned aerial vehicle (UAV) photos offer extensive perspectives of impacted regions, current computer vision methodologies generally yield just classification labels or segmentation masks, so constraining their capacity to deliver a thorough situational comprehension. We introduce the Vision Language Caption Enhancer (VLCE), a multimodal system designed to produce comprehensive, contextually-informed explanations of disaster imagery. VLCE employs a dual-architecture approach: a CNN-LSTM model with a ResNet50 backbone pretrained on EuroSat satellite imagery for the xBD dataset, and a Vision Transformer (ViT) model pretrained on UAV pictures for the RescueNet dataset. Both systems utilize external semantic knowledge from ConceptNet and WordNet to expand vocabulary coverage and improve description accuracy. We assess VLCE in comparison to leading vision-language models (LLaVA and QwenVL) utilizing CLIPScore for semantic alignment and InfoMetIC for caption informativeness. Experimental findings indicate that VLCE markedly surpasses baseline models, attaining a maximum of 95.33% on InfoMetIC while preserving competitive semantic alignment. Our dual-architecture system demonstrates significant potential for improving disaster damage assessment by automating the production of actionable, information-dense descriptions from satellite and drone photos.

[34] A Data-driven Typology of Vision Models from Integrated Representational Metrics cs.CV | cs.AIPDF

Jialin Wu, Shreya Saha, Yiqing Bo, Meenakshi Khosla

TL;DR: 论文通过多元表征相似性度量方法，揭示视觉模型家族间的共享与独特特征，提出了一种数据驱动的视觉模型分类方法，显示了架构与训练目标对表征结构的共同影响。

Details

Motivation: 大型视觉模型在架构和训练范式上差异巨大，但目前缺乏系统方法区分其表征的共性与特性。论文旨在填补这一空白，为视觉模型提供分类依据。

Result: 1. 几何和调谐特征能强烈区分模型家族，而线性可解码信息更具共享性；2. SNF整合显著提升了家族分离效果；3. 聚类结果显示自监督模型跨架构聚集，而监督模型按架构分群。

Insight: 架构现代化与基于重建的训练可能导致表征结构趋同，说明模型的计算策略是架构与训练目标共同塑造的结果。

Abstract: Large vision models differ widely in architecture and training paradigm, yet we lack principled methods to determine which aspects of their representations are shared across families and which reflect distinctive computational strategies. We leverage a suite of representational similarity metrics, each capturing a different facet-geometry, unit tuning, or linear decodability-and assess family separability using multiple complementary measures. Metrics preserving geometry or tuning (e.g., RSA, Soft Matching) yield strong family discrimination, whereas flexible mappings such as Linear Predictivity show weaker separation. These findings indicate that geometry and tuning carry family-specific signatures, while linearly decodable information is more broadly shared. To integrate these complementary facets, we adapt Similarity Network Fusion (SNF), a method inspired by multi-omics integration. SNF achieves substantially sharper family separation than any individual metric and produces robust composite signatures. Clustering of the fused similarity matrix recovers both expected and surprising patterns: supervised ResNets and ViTs form distinct clusters, yet all self-supervised models group together across architectural boundaries. Hybrid architectures (ConvNeXt, Swin) cluster with masked autoencoders, suggesting convergence between architectural modernization and reconstruction-based training. This biology-inspired framework provides a principled typology of vision models, showing that emergent computational strategies-shaped jointly by architecture and training objective-define representational structure beyond surface design categories.

[35] FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction cs.CVPDF

Yixiang Dai, Fan Jiang, Chiyu Wang, Mu Xu, Yonggang Qi

TL;DR: FantasyWorld通过联合建模视频隐变量和隐式3D场，结合几何和视频先验，实现了高质量的3D感知视频生成。

Details

Motivation: 现有视频基础模型缺乏明确的3D基础，导致空间一致性和下游3D推理任务的实用性受限。

Result: 实验表明，FantasyWorld在多视角一致性和风格一致性上优于基线，验证了统一主干和交叉分支信息交换的有效性。

Insight: 联合建模视频与3D信息能够实现更一致的3D感知视频表示，无需逐场景优化即可支持下游任务。

Abstract: High-quality 3D world models are pivotal for embodied intelligence and Artificial General Intelligence (AGI), underpinning applications such as AR/VR content creation and robotic navigation. Despite the established strong imaginative priors, current video foundation models lack explicit 3D grounding capabilities, thus being limited in both spatial consistency and their utility for downstream 3D reasoning tasks. In this work, we present FantasyWorld, a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch, enabling joint modeling of video latents and an implicit 3D field in a single forward pass. Our approach introduces cross-branch supervision, where geometry cues guide video generation and video priors regularize 3D prediction, thus yielding consistent and generalizable 3D-aware video representations. Notably, the resulting latents from the geometric branch can potentially serve as versatile representations for downstream 3D tasks such as novel view synthesis and navigation, without requiring per-scene optimization or fine-tuning. Extensive experiments show that FantasyWorld effectively bridges video imagination and 3D perception, outperforming recent geometry-consistent baselines in multi-view coherence and style consistency. Ablation studies further confirm that these gains stem from the unified backbone and cross-branch information exchange.

[36] MORPH: Shape-agnostic PDE Foundation Models cs.CV | cs.AI | cs.LG | physics.comp-phPDF

Mahindra Singh Rautela, Alexander Most, Siddharth Mansingh, Bradley C. Love, Ayan Biswas

TL;DR: MORPH是一个形状无关的自回归PDE基础模型，利用卷积视觉Transformer架构处理多维多分辨率数据，并通过组件卷积、跨场注意力和轴向注意力实现高效计算和表达。

Details

Motivation: 科学观测数据通常是异质和多模态的，传统方法难以高效处理。MORPH旨在提供一个灵活且强大的骨架模型，以解决此类数据的复杂性。

Result: MORPH在零样本和全样本泛化中均优于从头训练的模型，并在多项任务中达到或超越SOTA。

Insight: MORPH展示了一种高效处理异质科学数据的路径，为科学机器学习提供了可扩展且数据高效的解决方案。

Abstract: We introduce MORPH, a shape-agnostic, autoregressive foundation model for partial differential equations (PDEs). MORPH is built on a convolutional vision transformer backbone that seamlessly handles heterogeneous spatiotemporal datasets of varying data dimensionality (1D–3D) at different resolutions, multiple fields with mixed scalar and vector components. The architecture combines (i) component-wise convolution, which jointly processes scalar and vector channels to capture local interactions, (ii) inter-field cross-attention, which models and selectively propagates information between different physical fields, (iii) axial attentions, which factorizes full spatiotemporal self-attention along individual spatial and temporal axes to reduce computational burden while retaining expressivity. We pretrain multiple model variants on a diverse collection of heterogeneous PDE datasets and evaluate transfer to a range of downstream prediction tasks. Using both full-model fine-tuning and parameter-efficient low-rank adapters (LoRA), MORPH outperforms models trained from scratch in both zero-shot and full-shot generalization. Across extensive evaluations, MORPH matches or surpasses strong baselines and recent state-of-the-art models. Collectively, these capabilities present a flexible and powerful backbone for learning from heterogeneous and multimodal nature of scientific observations, charting a path toward scalable and data-efficient scientific machine learning.

[37] MS-YOLO: Infrared Object Detection for Edge Deployment via MobileNetV4 and SlideLoss cs.CVPDF

Jiali Zhang, Thomas S. White, Haoliang Zhang, Wenqing Hu, Donald C. Wunsch II

TL;DR: MS-YOLO提出了一种基于MobileNetV4和SlideLoss的红外目标检测方法，优化了计算效率和类别不平衡问题，适用于边缘部署。

Details

Motivation: 红外目标检测在低光和恶劣天气下具有优势，但面临类别不平衡、热噪声和计算限制等挑战。

Result: 在FLIR ADAS V2数据集上，MS-YOLO以6.7 GFLOPs的计算量实现了竞争力mAP和更高精度。

Insight: 高效骨干网和动态损失函数可显著提升红外目标检测的实用性和边缘部署能力。

Abstract: Infrared imaging has emerged as a robust solution for urban object detection under low-light and adverse weather conditions, offering significant advantages over traditional visible-light cameras. However, challenges such as class imbalance, thermal noise, and computational constraints can significantly hinder model performance in practical settings. To address these issues, we evaluate multiple YOLO variants on the FLIR ADAS V2 dataset, ultimately selecting YOLOv8 as our baseline due to its balanced accuracy and efficiency. Building on this foundation, we present \texttt{MS-YOLO} (\textbf{M}obileNetv4 and \textbf{S}lideLoss based on YOLO), which replaces YOLOv8’s CSPDarknet backbone with the more efficient MobileNetV4, reducing computational overhead by \textbf{1.5%} while sustaining high accuracy. In addition, we introduce \emph{SlideLoss}, a novel loss function that dynamically emphasizes under-represented and occluded samples, boosting precision without sacrificing recall. Experiments on the FLIR ADAS V2 benchmark show that \texttt{MS-YOLO} attains competitive mAP and superior precision while operating at only \textbf{6.7 GFLOPs}. These results demonstrate that \texttt{MS-YOLO} effectively addresses the dual challenge of maintaining high detection quality while minimizing computational costs, making it well-suited for real-time edge deployment in urban environments.

[38] Motion-Aware Transformer for Multi-Object Tracking cs.CVPDF

Xu Yang, Gady Agam

TL;DR: MATR是一个用于多目标跟踪的运动感知Transformer，通过显式预测帧间物体运动以减少查询冲突，显著提升了跟踪性能。

Details

Motivation: 现有的DETR框架在处理多目标跟踪时，检测和跟踪查询在同一层Decoder中共同处理，导致冲突和关联精度下降。

Result: 在DanceTrack、SportsMOT和BDD100k上取得SOTA，DanceTrack上HOTA提升9点以上，SportsMOT达72.2 HOTA，BDD100k达54.7 mTETA和41.6 mHOTA。

Insight: 在端到端Transformer中显式建模运动信息是提升多目标跟踪性能的有效方法，避免了查询冲突问题。

Abstract: Multi-object tracking (MOT) in videos remains challenging due to complex object motions and crowded scenes. Recent DETR-based frameworks offer end-to-end solutions but typically process detection and tracking queries jointly within a single Transformer Decoder layer, leading to conflicts and degraded association accuracy. We introduce the Motion-Aware Transformer (MATR), which explicitly predicts object movements across frames to update track queries in advance. By reducing query collisions, MATR enables more consistent training and improves both detection and association. Extensive experiments on DanceTrack, SportsMOT, and BDD100k show that MATR delivers significant gains across standard metrics. On DanceTrack, MATR improves HOTA by more than 9 points over MOTR without additional data and reaches a new state-of-the-art score of 71.3 with supplementary data. MATR also achieves state-of-the-art results on SportsMOT (72.2 HOTA) and BDD100k (54.7 mTETA, 41.6 mHOTA) without relying on external datasets. These results demonstrate that explicitly modeling motion within end-to-end Transformers offers a simple yet highly effective approach to advancing multi-object tracking.

[39] DeLiVR: Differential Spatiotemporal Lie Bias for Efficient Video Deraining cs.CVPDF

Shuning Sun, Jialang Lu, Xiang Chen, Jichao Wang, Dianjie Lu

TL;DR: DeLiVR是一种高效的视频去雨方法，通过将时空Lie群微分偏置直接注入网络的注意力分数中，解决了传统方法依赖光流或启发式对齐的不足。

Details

Motivation: 现有视频去雨方法依赖计算昂贵且不够鲁棒的光流或启发式对齐，而Lie群提供了一种连续的几何变换表示方式，适合在视频建模中实现时空一致性。

Result: 在公开基准测试中验证了方法的有效性。

Insight: Lie群为视频建模提供了几何一致性的理论工具，显著提升了去雨任务的效率和鲁棒性。

Abstract: Videos captured in the wild often suffer from rain streaks, blur, and noise. In addition, even slight changes in camera pose can amplify cross-frame mismatches and temporal artifacts. Existing methods rely on optical flow or heuristic alignment, which are computationally expensive and less robust. To address these challenges, Lie groups provide a principled way to represent continuous geometric transformations, making them well-suited for enforcing spatial and temporal consistency in video modeling. Building on this insight, we propose DeLiVR, an efficient video deraining method that injects spatiotemporal Lie-group differential biases directly into attention scores of the network. Specifically, the method introduces two complementary components. First, a rotation-bounded Lie relative bias predicts the in-plane angle of each frame using a compact prediction module, where normalized coordinates are rotated and compared with base coordinates to achieve geometry-consistent alignment before feature aggregation. Second, a differential group displacement computes angular differences between adjacent frames to estimate a velocity. This bias computation combines temporal decay and attention masks to focus on inter-frame relationships while precisely matching the direction of rain streaks. Extensive experimental results demonstrate the effectiveness of our method on publicly available benchmarks.

[40] On the Status of Foundation Models for SAR Imagery cs.CV | eess.IVPDF

Nathan Inkawhich

TL;DR: 本文探讨了在合成孔径雷达（SAR）图像任务中应用基础AI/ML模型的可行性，测试了现有视觉基础模型的局限性，并通过自监督微调提出了一种改进SAR目标特征提取的方法。

Details

Motivation: 受自然图像领域中大规模自监督学习（SSL）模型的启发，作者希望将这一技术应用于SAR图像任务，以提高目标识别的鲁棒性和迁移能力。

Result: AFRL-DINOv2在SAR目标识别任务中表现优异，显著优于SARATR-X，并在低标签数据和扩展操作条件下展现了强大的适应能力。

Insight: 虽然现有视觉基础模型在SAR任务中表现有限，但通过自监督微调可以显著改进SAR目标特征的提取和识别，但仍需进一步研究以缩小与自然图像领域的差距。

Abstract: In this work we investigate the viability of foundational AI/ML models for Synthetic Aperture Radar (SAR) object recognition tasks. We are inspired by the tremendous progress being made in the wider community, particularly in the natural image domain where frontier labs are training huge models on web-scale datasets with unprecedented computing budgets. It has become clear that these models, often trained with Self-Supervised Learning (SSL), will transform how we develop AI/ML solutions for object recognition tasks - they can be adapted downstream with very limited labeled data, they are more robust to many forms of distribution shift, and their features are highly transferable out-of-the-box. For these reasons and more, we are motivated to apply this technology to the SAR domain. In our experiments we first run tests with today’s most powerful visual foundational models, including DINOv2, DINOv3 and PE-Core and observe their shortcomings at extracting semantically-interesting discriminative SAR target features when used off-the-shelf. We then show that Self-Supervised finetuning of publicly available SSL models with SAR data is a viable path forward by training several AFRL-DINOv2s and setting a new state-of-the-art for SAR foundation models, significantly outperforming today’s best SAR-domain model SARATR-X. Our experiments further analyze the performance trade-off of using different backbones with different downstream task-adaptation recipes, and we monitor each model’s ability to overcome challenges within the downstream environments (e.g., extended operating conditions and low amounts of labeled data). We hope this work will inform and inspire future SAR foundation model builders, because despite our positive results, we still have a long way to go.

[41] UISim: An Interactive Image-Based UI Simulator for Dynamic Mobile Environments cs.CV | cs.AI | cs.CL | cs.HC | cs.LGPDF

Jiannan Xiang, Yun Zhu, Lei Shu, Maria Wang, Lijun Yu

TL;DR: UISim是一个基于图像的交互式UI模拟器，旨在解决动态移动环境中UI开发和测试的挑战；它采用两阶段方法预测和生成新UI状态，显著提升了UI模拟的真实性和连贯性。

Details

Motivation: 现实移动环境的动态性和多样性使得UI开发和AI代理交互训练变得复杂，现有方法依赖物理设备或静态截图分析，无法满足规模化需求。

Result: 实验表明，UISim在生成真实和连贯的后续UI状态方面优于端到端UI生成基线。

Insight: UISim为UI开发和AI代理训练提供了高效工具，其图像合成和交互能力有望推动UI导航和任务规划等高级应用。

Abstract: Developing and testing user interfaces (UIs) and training AI agents to interact with them are challenging due to the dynamic and diverse nature of real-world mobile environments. Existing methods often rely on cumbersome physical devices or limited static analysis of screenshots, which hinders scalable testing and the development of intelligent UI agents. We introduce UISim, a novel image-based UI simulator that offers a dynamic and interactive platform for exploring mobile phone environments purely from screen images. Our system employs a two-stage method: given an initial phone screen image and a user action, it first predicts the abstract layout of the next UI state, then synthesizes a new, visually consistent image based on this predicted layout. This approach enables the realistic simulation of UI transitions. UISim provides immediate practical benefits for UI testing, rapid prototyping, and synthetic data generation. Furthermore, its interactive capabilities pave the way for advanced applications, such as UI navigation task planning for AI agents. Our experimental results show that UISim outperforms end-to-end UI generation baselines in generating realistic and coherent subsequent UI states, highlighting its fidelity and potential to streamline UI development and enhance AI agent training.

[42] LFA-Net: A Lightweight Network with LiteFusion Attention for Retinal Vessel Segmentation cs.CV | cs.AIPDF

Mehwish Mehmood, Ivor Spence, Muhammad Fahim

TL;DR: 该论文提出了一种轻量化的视网膜血管分割网络LFA-Net，结合了新的LiteFusion-Attention模块，实现了高效且低计算成本的局部与全局上下文捕捉，适用于资源受限的临床环境。

Details

Motivation: 在临床环境中，计算资源有限，而轻量化的视网膜血管分割对于早期诊断至关重要。现有深度学习模型存在小血管分割效果差和计算成本高的问题。

Result: 在DRIVE、STARE和CHASE_DB数据集上表现优异，Dice分数分别为83.28%、87.44%和84.50%，Jaccard指数分别为72.85%、79.31%和74.70%。

Insight: 通过轻量化设计和高效注意力机制，LFA-Net在保持高性能的同时显著降低了计算资源需求，适合实际临床应用。

Abstract: Lightweight retinal vessel segmentation is important for the early diagnosis of vision-threatening and systemic diseases, especially in a real-world clinical environment with limited computational resources. Although segmentation methods based on deep learning are improving, existing models are still facing challenges of small vessel segmentation and high computational costs. To address these challenges, we proposed a new vascular segmentation network, LFA-Net, which incorporates a newly designed attention module, LiteFusion-Attention. This attention module incorporates residual learning connections, Vision Mamba-inspired dynamics, and modulation-based attention, enabling the model to capture local and global context efficiently and in a lightweight manner. LFA-Net offers high performance with 0.11 million parameters, 0.42 MB memory size, and 4.46 GFLOPs, which make it ideal for resource-constrained environments. We validated our proposed model on DRIVE, STARE, and CHASE_DB with outstanding performance in terms of dice scores of 83.28, 87.44, and 84.50% and Jaccard indices of 72.85, 79.31, and 74.70%, respectively. The code of LFA-Net is available online https://github.com/Mehwish4593/LFA-Net.

[43] Incorporating Scene Context and Semantic Labels for Enhanced Group-level Emotion Recognition cs.CVPDF

Qing Zhu, Wangdong Guo, Qirong Mao, Xiaohua Huang, Xiuyan Shao

TL;DR: 这篇论文提出了一种结合视觉场景上下文和语义标签的新型框架，通过多尺度场景编码和情感树结构提升群体情感识别（GER）的性能。

Details

Motivation: 现有方法低估了视觉场景上下文在建模个体关系中的重要性，同时忽略了情感标签的语义信息对完整理解情感的关键作用。

Result: 在三个广泛使用的GER数据集上验证了方法的有效性，展示了与现有先进方法竞争的性能。

Insight: 视觉场景上下文和情感标签的语义信息对群体情感识别至关重要，两者的有效结合可以显著提升性能。

Abstract: Group-level emotion recognition (GER) aims to identify holistic emotions within a scene involving multiple individuals. Current existed methods underestimate the importance of visual scene contextual information in modeling individual relationships. Furthermore, they overlook the crucial role of semantic information from emotional labels for complete understanding of emotions. To address this limitation, we propose a novel framework that incorporates visual scene context and label-guided semantic information to improve GER performance. It involves the visual context encoding module that leverages multi-scale scene information to diversely encode individual relationships. Complementarily, the emotion semantic encoding module utilizes group-level emotion labels to prompt a large language model to generate nuanced emotion lexicons. These lexicons, in conjunction with the emotion labels, are then subsequently refined into comprehensive semantic representations through the utilization of a structured emotion tree. Finally, similarity-aware interaction is proposed to align and integrate visual and semantic information, thereby generating enhanced group-level emotion representations and subsequently improving the performance of GER. Experiments on three widely adopted GER datasets demonstrate that our proposed method achieves competitive performance compared to state-of-the-art methods.

[44] KG-SAM: Injecting Anatomical Knowledge into Segment Anything Models via Conditional Random Fields cs.CVPDF

Yu Li, Da Chang, Xi Xiao

TL;DR: KG-SAM是一个结合医学知识图谱和条件随机场（CRF）的框架，改进了Segment Anything Model（SAM）在医学图像分割中的应用，提升了分割精度和可靠性。

Details

Motivation: 医学图像分割面临边界模糊、解剖关系建模不足和缺乏不确定性量化等问题，直接应用SAM效果有限。因此，作者提出KG-SAM以解决这些挑战。

Result: 在前列腺分割任务中Dice得分为82.69%，在腹部MRI和CT分割中分别达到78.05%和79.68%。

Insight: 通过结合领域知识和不确定性量化，KG-SAM显著提升了医学图像分割的性能和可靠性，适合高风险临床应用。

Abstract: While the Segment Anything Model (SAM) has achieved remarkable success in image segmentation, its direct application to medical imaging remains hindered by fundamental challenges, including ambiguous boundaries, insufficient modeling of anatomical relationships, and the absence of uncertainty quantification. To address these limitations, we introduce KG-SAM, a knowledge-guided framework that synergistically integrates anatomical priors with boundary refinement and uncertainty estimation. Specifically, KG-SAM incorporates (i) a medical knowledge graph to encode fine-grained anatomical relationships, (ii) an energy-based Conditional Random Field (CRF) to enforce anatomically consistent predictions, and (iii) an uncertainty-aware fusion module to enhance reliability in high-stakes clinical scenarios. Extensive experiments across multi-center medical datasets demonstrate the effectiveness of our approach: KG-SAM achieves an average Dice score of 82.69% on prostate segmentation and delivers substantial gains in abdominal segmentation, reaching 78.05% on MRI and 79.68% on CT. These results establish KG-SAM as a robust and generalizable framework for advancing medical image segmentation.

[45] UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models cs.CVPDF

Lan Chen, Yuchao Gu, Qi Mao

TL;DR: UniVid proposes a unified vision task framework by fine-tuning a pre-trained video diffusion transformer, leveraging visual sentences to represent tasks and generalize across modalities and sources without task-specific modifications.

Details

Motivation: Inspired by large language models unifying linguistic tasks, UniVid aims to extend this unification to vision tasks, avoiding costly task-specific pre-training by utilizing pre-trained video generation models.

Result: UniVid generalizes well across modalities (images/videos) and sources (natural/annotated data), despite being trained only on natural videos, showcasing its potential as a unified vision model.

Insight: Pre-trained video generation models can serve as a scalable foundation for vision tasks by leveraging temporal sequence dependencies, with task switching achieved through simple sentence order reversal.

Abstract: Large language models, trained on extensive corpora, successfully unify diverse linguistic tasks within a single generative framework. Inspired by this, recent works like Large Vision Model (LVM) extend this paradigm to vision by organizing tasks into sequential visual sentences, where visual prompts serve as the context to guide outputs. However, such modeling requires task-specific pre-training across modalities and sources, which is costly and limits scalability to unseen tasks. Given that pre-trained video generation models inherently capture temporal sequence dependencies, we explore a more unified and scalable alternative: can a pre-trained video generation model adapt to diverse image and video tasks? To answer this, we propose UniVid, a framework that fine-tunes a video diffusion transformer to handle various vision tasks without task-specific modifications. Tasks are represented as visual sentences, where the context sequence defines both the task and the expected output modality. We evaluate the generalization of UniVid from two perspectives: (1) cross-modal inference with contexts composed of both images and videos, extending beyond LVM’s uni-modal setting; (2) cross-source tasks from natural to annotated data, without multi-source pre-training. Despite being trained solely on natural video data, UniVid generalizes well in both settings. Notably, understanding and generation tasks can easily switch by simply reversing the visual sentence order in this paradigm. These findings highlight the potential of pre-trained video generation models to serve as a scalable and unified foundation for vision modeling. Our code will be released at https://github.com/CUC-MIPG/UniVid.

[46] CubistMerge: Spatial-Preserving Token Merging For Diverse ViT Backbones cs.CV | cs.LGPDF

Wenyi Gong, Mieszko Lis

TL;DR: 提出了一种名为CubistMerge的token合并方法，能够在保持空间结构的前提下高效减少ViT中的token数量，适用于多样化的ViT架构。

Details

Motivation: 现代ViT架构（如SAM和DINOv3）依赖于空间设计（如窗口注意力和相对位置嵌入），而现有token合并方法破坏了这种空间结构，因此需要一种既能减少token数量又能保持空间完整性的方法。

Result: 在SAM-H和DeiT-B上实现了显著的加速效果（1.25x和1.15x），同时性能损失极小（0.7% mIOU和零top-1精度损失）。

Insight: 空间结构的保持对ViT性能至关重要，CubistMerge为空间敏感的ViT架构提供了一种高效的token合并方案。

Abstract: Many modern ViT backbones adopt spatial architectural designs, such as window attention, decomposed relative positional embeddings in SAM, and RoPE in DINOv3. Such architectures impose new challenges on token reduction, as the vast majority of existing methods fail to preserve the spatial structure these architectures depend on. In this paper, we introduce a simple yet effective token merging method that maintains spatial integrity, enabling seamless compatibility with spatial architectures. We reconcile two seemingly conflicting requirements: (i)exploiting the uneven information distribution across the spatial layout while (ii)preserving the spatial structure post-merging. Our approach employs (i)a 2D reduction strategy to enforce structured token layouts, (ii)a spatial-aware merging algorithm that maintains relative token positions, and (iii)a novel max-magnitude-per-dimension token representation that preserves salient features. Our method demonstrates strong performance both off-the-shelf and with fine-tuning, achieving state-of-the-art results on spatial and non-spatial architectures across various vision tasks. Specifically, we achieve 1.25x speedup on SAM-H with only 0.7% mIOU drop evaluated on COCO off-the-shelf, and 1.15x speedup on DeiT-B with no top-1 accuracy drop on ImageNet within just one epoch of fine-tuning.

[47] Training-Free Multimodal Deepfake Detection via Graph Reasoning cs.CV | cs.CYPDF

Yuxin Liu, Fei Wang, Kun Li, Yiqi Nie, Junjie Chen

TL;DR: 该论文提出了一种无需训练的多模态深度伪造检测框架GASP-ICL，通过图推理解决跨模态不一致性和任务对齐检索的挑战，显著提升了检测性能。

Details

Motivation: 多模态深度伪造检测（MDD）需要捕捉细微的伪造线索并解决跨模态的不一致性，尽管大规模视觉语言模型（LVLMs）在多模态推理方面表现强大，但在MDD任务中仍存在局限性。

Result: 在四种伪造类型的实验中，GASP-ICL超越了强基线方法，且无需对LVLM进行微调。

Insight: 通过注入任务感知知识和跨样本关系的图推理，可以有效提升多模态深度伪造检测的性能，同时避免了模型的额外训练开销。

Abstract: Multimodal deepfake detection (MDD) aims to uncover manipulations across visual, textual, and auditory modalities, thereby reinforcing the reliability of modern information systems. Although large vision-language models (LVLMs) exhibit strong multimodal reasoning, their effectiveness in MDD is limited by challenges in capturing subtle forgery cues, resolving cross-modal inconsistencies, and performing task-aligned retrieval. To this end, we propose Guided Adaptive Scorer and Propagation In-Context Learning (GASP-ICL), a training-free framework for MDD. GASP-ICL employs a pipeline to preserve semantic relevance while injecting task-aware knowledge into LVLMs. We leverage an MDD-adapted feature extractor to retrieve aligned image-text pairs and build a candidate set. We further design the Graph-Structured Taylor Adaptive Scorer (GSTAS) to capture cross-sample relations and propagate query-aligned signals, producing discriminative exemplars. This enables precise selection of semantically aligned, task-relevant demonstrations, enhancing LVLMs for robust MDD. Experiments on four forgery types show that GASP-ICL surpasses strong baselines, delivering gains without LVLM fine-tuning.

[48] Prompt-guided Representation Disentanglement for Action Recognition cs.CVPDF

Tianci Wu, Guangming Zhu, Jiang Lu, Siyuan Wang, Ning Wang

TL;DR: 论文提出ProDA框架，通过提示引导的解耦表示方法，从多动作场景中分离出指定动作，利用时空场景图和动态提示模块改进动作识别。

Details

Motivation: 现有方法在处理多动作场景时难以建模不同对象间的交互关系，因此需要一种能够从复杂场景中有效解耦指定动作的方法。

Result: 实验表明，ProDA在视频动作识别任务上优于现有方法。

Insight: 解耦指定动作的表示有助于在多动作场景中更精准地建模动作特征，动态提示模块提高了模型的灵活性。

Abstract: Action recognition is a fundamental task in video understanding. Existing methods typically extract unified features to process all actions in one video, which makes it challenging to model the interactions between different objects in multi-action scenarios. To alleviate this issue, we explore disentangling any specified actions from complex scenes as an effective solution. In this paper, we propose Prompt-guided Disentangled Representation for Action Recognition (ProDA), a novel framework that disentangles any specified actions from a multi-action scene. ProDA leverages Spatio-temporal Scene Graphs (SSGs) and introduces Dynamic Prompt Module (DPM) to guide a Graph Parsing Neural Network (GPNN) in generating action-specific representations. Furthermore, we design a video-adapted GPNN that aggregates information using dynamic weights. Experiments in video action recognition demonstrate the effectiveness of our approach when compared with the state-of-the-art methods. Our code can be found in https://github.com/iamsnaping/ProDA.git

[49] DeHate: A Stable Diffusion-based Multimodal Approach to Mitigate Hate Speech in Images cs.CV | cs.CLPDF

Dwip Dalal, Gautam Vashishtha, Anku Ranui, Aishwarya Reganti, Parth Patwa

TL;DR: 论文提出了一种基于Stable Diffusion的多模态方法DeHate，用于检测和消除图像中的仇恨内容，并结合DAAM模块生成仇恨注意力图，模糊仇恨区域。同时发布了数据集并介绍了共享任务。

Details

Motivation: 网络有害内容对公共话语和社会健康构成威胁，亟需一种高效的多模态方法来检测和消除图像中的仇恨内容。

Result: DeHate方法在仇恨内容检测和消除任务中表现优异，为社交媒体中的伦理AI应用提供了新标准。

Insight: 通过多模态方法结合文本提示，能够更精准地识别和消除图像中的仇恨内容，推动AI在伦理和社会责任方面的应用。

Abstract: The rise in harmful online content not only distorts public discourse but also poses significant challenges to maintaining a healthy digital environment. In response to this, we introduce a multimodal dataset uniquely crafted for identifying hate in digital content. Central to our methodology is the innovative application of watermarked, stability-enhanced, stable diffusion techniques combined with the Digital Attention Analysis Module (DAAM). This combination is instrumental in pinpointing the hateful elements within images, thereby generating detailed hate attention maps, which are used to blur these regions from the image, thereby removing the hateful sections of the image. We release this data set as a part of the dehate shared task. This paper also describes the details of the shared task. Furthermore, we present DeHater, a vision-language model designed for multimodal dehatification tasks. Our approach sets a new standard in AI-driven image hate detection given textual prompts, contributing to the development of more ethical AI applications in social media.

[50] MIRG-RL: Multi-Image Reasoning and Grounding with Reinforcement Learning cs.CVPDF

Lihao Zheng, Jiawei Chen, Xintian Shen, Hao Ma, Tao Wei

TL;DR: MIRG-RL是一个通过强化学习解决多图像理解和定位问题的框架，结合了监督微调和图像感知强化学习，并在多图像基准测试中达到了SOTA性能。

Details

Motivation: 目前的大型视觉语言模型在多图像推理和定位任务中面临两大挑战：缺乏跨图像推理能力和跨图像参考奖励建模不足。

Result: 在多图像定位基准测试中达到64.82%的性能，比之前最佳方法提升1%。

Insight: 通过强化学习优化多图像推理和定位任务，证明了结合监督学习和RL的优势；轻量级数据集的构建和双重奖励函数设计为解决跨图像歧义提供了新思路。

Abstract: Multi-image reasoning and grounding require understanding complex cross-image relationships at both object levels and image levels. Current Large Visual Language Models (LVLMs) face two critical challenges: the lack of cross-image reasoning capabilities and insufficient cross-image reference reward modeling. To address these issues, we propose a unified framework - Multi-Image Reasoning and Grounding with Reinforcement Learning (MIRG-RL). Specifically, our two-stage training paradigm combines supervised fine-tuning with annotated trajectories and image-aware reinforcement learning optimization, progressively developing multi-image reasoning capabilities. Furthermore, we innovatively propose a method for constructing the trajectory data, which integrates object-level and image-level annotation information, and use this method to generate a lightweight reasoning-enhanced dataset. To effectively resolve cross-image ambiguities, we design an image-aware RL policy with dual reward functions for objects and images. Experiments demonstrate that MIRG-RL achieves state-of-the-art (SOTA) performance in multi-image grounding benchmarks, attaining 64.82% on cross-image reasoning tasks - exceeding the previous best method by 1%. The code and dataset have been released at https://github.com/ZEUS2035/MIRG-RL.

[51] LongScape: Advancing Long-Horizon Embodied World Models with Context-Aware MoE cs.CVPDF

Yu Shang, Lei Jin, Yiding Ma, Xin Zhang, Chen Gao

TL;DR: LongScape提出了一种结合扩散去噪和自回归生成的混合框架，通过动作引导的分块机制和上下文感知的混合专家模型（CMoE），实现了长时程视频生成的稳定性和视觉质量。

Details

Motivation: 当前基于视频的世界模型在长时程生成中存在时间不一致性和视觉细节丢失的问题，LongScape旨在解决这些问题。

Result: 实验证明LongScape能够稳定生成长时程视频，并在视觉质量和一致性上优于现有方法。

Insight: 分块机制的设计和自适应专家模型的结合是长时程生成的关键。

Abstract: Video-based world models hold significant potential for generating high-quality embodied manipulation data. However, current video generation methods struggle to achieve stable long-horizon generation: classical diffusion-based approaches often suffer from temporal inconsistency and visual drift over multiple rollouts, while autoregressive methods tend to compromise on visual detail. To solve this, we introduce LongScape, a hybrid framework that adaptively combines intra-chunk diffusion denoising with inter-chunk autoregressive causal generation. Our core innovation is an action-guided, variable-length chunking mechanism that partitions video based on the semantic context of robotic actions. This ensures each chunk represents a complete, coherent action, enabling the model to flexibly generate diverse dynamics. We further introduce a Context-aware Mixture-of-Experts (CMoE) framework that adaptively activates specialized experts for each chunk during generation, guaranteeing high visual quality and seamless chunk transitions. Extensive experimental results demonstrate that our method achieves stable and consistent long-horizon generation over extended rollouts. Our code is available at: https://github.com/tsinghua-fib-lab/Longscape.

[52] MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation cs.CVPDF

Yu Shang, Yangcheng Yu, Xin Zhang, Xin Jin, Haisheng Su

TL;DR: MoWM提出了一种混合世界模型框架，结合了潜在模型和像素模型的特征，用于具身动作规划，显著提升了任务成功率和泛化能力。

Details

Motivation: 现有的视频生成世界模型过于依赖像素级重建，引入了视觉冗余，而潜在世界模型虽然紧凑但忽略了细节信息。MoWM旨在融合两者优势，提升具身规划的精确性和泛化性。

Result: 在CALVIN基准上实现了最高的任务成功率和优异的泛化性能。

Insight: 潜在特征和像素特征的结合在具身规划中具有重要意义，未来研究可以进一步探索不同特征空间的互补性。

Abstract: Embodied action planning is a core challenge in robotics, requiring models to generate precise actions from visual observations and language instructions. While video generation world models are promising, their reliance on pixel-level reconstruction often introduces visual redundancies that hinder action decoding and generalization. Latent world models offer a compact, motion-aware representation, but overlook the fine-grained details critical for precise manipulation. To overcome these limitations, we propose MoWM, a mixture-of-world-model framework that fuses representations from hybrid world models for embodied action planning. Our approach uses motion-aware representations from a latent model as a high-level prior, which guides the extraction of fine-grained visual features from the pixel space model. This design allows MoWM to highlight the informative visual details needed for action decoding. Extensive evaluations on the CALVIN benchmark demonstrate that our method achieves state-of-the-art task success rates and superior generalization. We also provide a comprehensive analysis of the strengths of each feature space, offering valuable insights for future research in embodied planning. The code is available at: https://github.com/tsinghua-fib-lab/MoWM.

[53] DiTraj: training-free trajectory control for video diffusion transformer cs.CV | cs.AIPDF

Cheng Lei, Jiayu Zhang, Yue Ma, Xinyu Wang, Long Chen

TL;DR: DiTraj是一种无需训练的轨迹控制框架，专为基于DiT的视频生成设计，通过前景-背景分离指导和改进的位置嵌入方法（STD-RoPE）实现高效的视频轨迹控制。

Details

Motivation: 现有的轨迹控制方法要么需要大量训练资源，要么仅适用于U-Net模型，未能充分利用DiT的优越性能。DiTraj旨在克服这些限制，提供一种简单有效的解决方案。

Result: 实验表明，DiTraj在视频质量和轨迹可控性上均优于现有方法。

Insight: DiTraj展示了无需训练即可有效控制DiT模型轨迹的可能性，为基于Transformer的视频生成提供了新的控制思路。

Abstract: Diffusion Transformers (DiT)-based video generation models with 3D full attention exhibit strong generative capabilities. Trajectory control represents a user-friendly task in the field of controllable video generation. However, existing methods either require substantial training resources or are specifically designed for U-Net, do not take advantage of the superior performance of DiT. To address these issues, we propose DiTraj, a simple but effective training-free framework for trajectory control in text-to-video generation, tailored for DiT. Specifically, first, to inject the object’s trajectory, we propose foreground-background separation guidance: we use the Large Language Model (LLM) to convert user-provided prompts into foreground and background prompts, which respectively guide the generation of foreground and background regions in the video. Then, we analyze 3D full attention and explore the tight correlation between inter-token attention scores and position embedding. Based on this, we propose inter-frame Spatial-Temporal Decoupled 3D-RoPE (STD-RoPE). By modifying only foreground tokens’ position embedding, STD-RoPE eliminates their cross-frame spatial discrepancies, strengthening cross-frame attention among them and thus enhancing trajectory control. Additionally, we achieve 3D-aware trajectory control by regulating the density of position embedding. Extensive experiments demonstrate that our method outperforms previous methods in both video quality and trajectory controllability.

[54] A Comprehensive Evaluation of Transformer-Based Question Answering Models and RAG-Enhanced Design cs.CVPDF

Zichen Zhang, Kunlong Zhang, Hongwei Ruan, Yiming Luo

TL;DR: 本文全面评估了基于Transformer的问答模型和多跳问答中的检索增强生成（RAG）设计，提出了一种混合检索方法，显著提升了性能。

Details

Motivation: 多跳推理（需结合多段证据回答问题）仍是问答系统的难点，本文旨在通过改进检索策略来解决这一问题。

Result: 在HotpotQA数据集上，混合方法比基线（余弦相似性）在精确匹配和F1分数上分别提升了50%和47%。

Insight: 混合检索提高了实体召回率和证据互补性，但在处理干扰项和时间推理方面仍有局限，为零样本多跳问答提供了实用方案。

Abstract: Transformer-based models have advanced the field of question answering, but multi-hop reasoning, where answers require combining evidence across multiple passages, remains difficult. This paper presents a comprehensive evaluation of retrieval strategies for multi-hop question answering within a retrieval-augmented generation framework. We compare cosine similarity, maximal marginal relevance, and a hybrid method that integrates dense embeddings with lexical overlap and re-ranking. To further improve retrieval, we adapt the EfficientRAG pipeline for query optimization, introducing token labeling and iterative refinement while maintaining efficiency. Experiments on the HotpotQA dataset show that the hybrid approach substantially outperforms baseline methods, achieving a relative improvement of 50 percent in exact match and 47 percent in F1 score compared to cosine similarity. Error analysis reveals that hybrid retrieval improves entity recall and evidence complementarity, while remaining limited in handling distractors and temporal reasoning. Overall, the results suggest that hybrid retrieval-augmented generation provides a practical zero-shot solution for multi-hop question answering, balancing accuracy, efficiency, and interpretability.

[55] Dynamic Novel View Synthesis in High Dynamic Range cs.CVPDF

Kaixuan Zhang, Zhipeng Xiong, Minxian Li, Mingwu Ren, Jiankang Deng

TL;DR: 本文提出了HDR DNVS问题，专注于从低动态范围（LDR）图像中学习高动态范围（HDR）的3D动态场景模型，并提出了HDR-4DGS方法，通过动态色调映射模块实现了时空一致的光照和颜色转换。

Details

Motivation: 现实世界中动态元素（如移动物体和变化光照）的动态场景合成未被现有方法充分解决，因此需要一种能够联合建模时空辐射变化和HDR/LDR转换的方法。

Result: HDR-4DGS在定量性能和视觉保真度上均超越了现有最优方法。

Insight: 动态色调映射模块的关键在于显式连接HDR和LDR域，并通过动态调整函数保持时间一致性，从而支持复杂动态场景的光照建模。

Abstract: High Dynamic Range Novel View Synthesis (HDR NVS) seeks to learn an HDR 3D model from Low Dynamic Range (LDR) training images captured under conventional imaging conditions. Current methods primarily focus on static scenes, implicitly assuming all scene elements remain stationary and non-living. However, real-world scenarios frequently feature dynamic elements, such as moving objects, varying lighting conditions, and other temporal events, thereby presenting a significantly more challenging scenario. To address this gap, we propose a more realistic problem named HDR Dynamic Novel View Synthesis (HDR DNVS), where the additional dimension ``Dynamic’’ emphasizes the necessity of jointly modeling temporal radiance variations alongside sophisticated 3D translation between LDR and HDR. To tackle this complex, intertwined challenge, we introduce HDR-4DGS, a Gaussian Splatting-based architecture featured with an innovative dynamic tone-mapping module that explicitly connects HDR and LDR domains, maintaining temporal radiance coherence by dynamically adapting tone-mapping functions according to the evolving radiance distributions across the temporal dimension. As a result, HDR-4DGS achieves both temporal radiance consistency and spatially accurate color translation, enabling photorealistic HDR renderings from arbitrary viewpoints and time instances. Extensive experiments demonstrate that HDR-4DGS surpasses existing state-of-the-art methods in both quantitative performance and visual fidelity. Source code will be released.

[56] Unlocking the Essence of Beauty: Advanced Aesthetic Reasoning with Relative-Absolute Policy Optimization cs.CV | cs.AIPDF

Boyang Liu, Yifan Hu, Senjie Jin, Shihan Dou, Gonglei Shi

TL;DR: 论文提出Aes-R1框架，结合强化学习和链式思维数据构造方法，提升多模态大语言模型在美学评分和推理任务中的性能。

Details

Motivation: 美学评分具有高度主观性，且缺乏高质量的跨模态推理数据，现有方法难以生成准确且可解释的美学评分。

Result: 实验表明，Aes-R1在PLCC/SRCC指标上分别提升47.9%/34.8%，优于同类基线模型。

Insight: 链式思维数据和联合优化策略显著提升了美学任务的性能，尤其在小样本和分布外场景中表现出鲁棒性。

Abstract: Multimodal large language models (MLLMs) are well suited to image aesthetic assessment, as they can capture high-level aesthetic features leveraging their cross-modal understanding capacity. However, the scarcity of multimodal aesthetic reasoning data and the inherently subjective nature of aesthetic judgment make it difficult for MLLMs to generate accurate aesthetic judgments with interpretable rationales. To this end, we propose Aes-R1, a comprehensive aesthetic reasoning framework with reinforcement learning (RL). Concretely, Aes-R1 integrates a pipeline, AesCoT, to construct and filter high-quality chain-of-thought aesthetic reasoning data used for cold-start. After teaching the model to generate structured explanations prior to scoring, we then employ the Relative-Absolute Policy Optimization (RAPO), a novel RL algorithm that jointly optimizes absolute score regression and relative ranking order, improving both per-image accuracy and cross-image preference judgments. Aes-R1 enables MLLMs to generate grounded explanations alongside faithful scores, thereby enhancing aesthetic scoring and reasoning in a unified framework. Extensive experiments demonstrate that Aes-R1 improves the backbone’s average PLCC/SRCC by 47.9%/34.8%, surpassing state-of-the-art baselines of similar size. More ablation studies validate Aes-R1’s robust generalization under limited supervision and in out-of-distribution scenarios.

[57] StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing cs.CV | cs.MMPDF

Liyang Chen, Tianze Zhou, Xu He, Boshi Tang, Zhiyong Wu

TL;DR: StableDub提出了一种结合唇部习惯建模和遮挡鲁棒合成的视觉配音框架，解决了现有方法在唇部习惯相似性和遮挡处理上的不足，并通过高效的训练架构提升了性能。

Details

Motivation: 现有视觉配音方法存在两个主要问题：(1)仅依赖音频驱动的范式无法准确捕捉说话者的唇部习惯；(2)传统盲修复方法在面对遮挡时容易产生视觉伪影。这些问题限制了其实际应用。

Result: 实验表明，StableDub在唇部习惯相似性、遮挡鲁棒性、音画同步、视频质量和分辨率一致性上优于其他方法。

Insight: 1. 唇部习惯建模对提升视觉配音效果至关重要；2. 遮挡感知策略能有效减少视觉伪影；3. 混合架构在低资源场景中更具优势。

Abstract: The visual dubbing task aims to generate mouth movements synchronized with the driving audio, which has seen significant progress in recent years. However, two critical deficiencies hinder their wide application: (1) Audio-only driving paradigms inadequately capture speaker-specific lip habits, which fail to generate lip movements similar to the target avatar; (2) Conventional blind-inpainting approaches frequently produce visual artifacts when handling obstructions (e.g., microphones, hands), limiting practical deployment. In this paper, we propose StableDub, a novel and concise framework integrating lip-habit-aware modeling with occlusion-robust synthesis. Specifically, building upon the Stable-Diffusion backbone, we develop a lip-habit-modulated mechanism that jointly models phonemic audio-visual synchronization and speaker-specific orofacial dynamics. To achieve plausible lip geometries and object appearances under occlusion, we introduce the occlusion-aware training strategy by explicitly exposing the occlusion objects to the inpainting process. By incorporating the proposed designs, the model eliminates the necessity for cost-intensive priors in previous methods, thereby exhibiting superior training efficiency on the computationally intensive diffusion-based backbone. To further optimize training efficiency from the perspective of model architecture, we introduce a hybrid Mamba-Transformer architecture, which demonstrates the enhanced applicability in low-resource research scenarios. Extensive experimental results demonstrate that StableDub achieves superior performance in lip habit resemblance and occlusion robustness. Our method also surpasses other methods in audio-lip sync, video quality, and resolution consistency. We expand the applicability of visual dubbing methods from comprehensive aspects, and demo videos can be found at https://stabledub.github.io.

[58] Drag4D: Align Your Motion with Text-Driven 3D Scene Generation cs.CVPDF

Minjun Kang, Inkyu Shin, Taeyeop Lee, In So Kweon, Kuk-Jin Yoon

TL;DR: Drag4D是一个交互式框架，通过文本驱动的3D场景生成，结合对象运动控制，使用户能够为3D对象定义轨迹并将其无缝集成到高质量的3D背景中。

Details

Motivation: 现有方法在将生成的对象运动与3D场景对齐时存在限制，Drag4D通过交互式控制和多阶段流程解决了这一问题。

Result: Drag4D在生成的3D场景中实现了高质量的对象运动和背景对齐，验证了其统一架构的有效性。

Insight: 通过分阶段集成多模态技术（如3D重建、运动扩散模型），Drag4D展示了交互式3D内容生成的潜力。

Abstract: We introduce Drag4D, an interactive framework that integrates object motion control within text-driven 3D scene generation. This framework enables users to define 3D trajectories for the 3D objects generated from a single image, seamlessly integrating them into a high-quality 3D background. Our Drag4D pipeline consists of three stages. First, we enhance text-to-3D background generation by applying 2D Gaussian Splatting with panoramic images and inpainted novel views, resulting in dense and visually complete 3D reconstructions. In the second stage, given a reference image of the target object, we introduce a 3D copy-and-paste approach: the target instance is extracted in a full 3D mesh using an off-the-shelf image-to-3D model and seamlessly composited into the generated 3D scene. The object mesh is then positioned within the 3D scene via our physics-aware object position learning, ensuring precise spatial alignment. Lastly, the spatially aligned object is temporally animated along a user-defined 3D trajectory. To mitigate motion hallucination and ensure view-consistent temporal alignment, we develop a part-augmented, motion-conditioned video diffusion model that processes multiview image pairs together with their projected 2D trajectories. We demonstrate the effectiveness of our unified architecture through evaluations at each stage and in the final results, showcasing the harmonized alignment of user-controlled object motion within a high-quality 3D background.

[59] Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers cs.CVPDF

Jibin Song, Mingi Kwon, Jaeseok Jeong, Youngjung Uh

TL;DR: Syncphony提出了一种基于扩散Transformer的音频到视频生成方法，通过运动感知损失和音频同步指导实现了高保真且时序同步的视频生成。

Details

Motivation: 现有文本或图像到视频生成方法难以精确控制运动的时序，而音频的时序特性为视频生成提供了自然的时间线索。

Result: 在AVSync15和The Greatest Hits数据集上，Syncphony在同步精度和视觉质量上均优于现有方法。

Insight: 音频信号可以作为视频生成的有效时序条件，通过辅助模块（如音频同步指导）可以有效提升同步效果。

Abstract: Text-to-video and image-to-video generation have made rapid progress in visual quality, but they remain limited in controlling the precise timing of motion. In contrast, audio provides temporal cues aligned with video motion, making it a promising condition for temporally controlled video generation. However, existing audio-to-video (A2V) models struggle with fine-grained synchronization due to indirect conditioning mechanisms or limited temporal modeling capacity. We present Syncphony, which generates 380x640 resolution, 24fps videos synchronized with diverse audio inputs. Our approach builds upon a pre-trained video backbone and incorporates two key components to improve synchronization: (1) Motion-aware Loss, which emphasizes learning at high-motion regions; (2) Audio Sync Guidance, which guides the full model using a visually aligned off-sync model without audio layers to better exploit audio cues at inference while maintaining visual quality. To evaluate synchronization, we propose CycleSync, a video-to-audio-based metric that measures the amount of motion cues in the generated video to reconstruct the original audio. Experiments on AVSync15 and The Greatest Hits datasets demonstrate that Syncphony outperforms existing methods in both synchronization accuracy and visual quality. Project page is available at: https://jibin86.github.io/syncphony_project_page

[60] LG-CD: Enhancing Language-Guided Change Detection through SAM2 Adaptation cs.CVPDF

Yixiao Liu, Yizhou Yang, Jinwen Li, Jun Tao, Ruoyu Li

TL;DR: 该论文提出了一种名为LG-CD的语言引导变化检测模型，通过结合视觉和文本信息，显著提升了遥感图像变化检测的准确性和鲁棒性。

Details

Motivation: 现有基于深度学习的变化检测方法主要关注单模态视觉信息，忽略了文本等多模态数据提供的丰富语义信息。

Result: 在三个数据集（LEVIR-CD、WHU-CD和SYSU-CD）上的实验表明，LG-CD优于现有最佳变化检测方法。

Insight: 通过多模态信息（如文本）的引入，可以实现广义的变化检测任务。

Abstract: Remote Sensing Change Detection (RSCD) typically identifies changes in land cover or surface conditions by analyzing multi-temporal images. Currently, most deep learning-based methods primarily focus on learning unimodal visual information, while neglecting the rich semantic information provided by multimodal data such as text. To address this limitation, we propose a novel Language-Guided Change Detection model (LG-CD). This model leverages natural language prompts to direct the network’s attention to regions of interest, significantly improving the accuracy and robustness of change detection. Specifically, LG-CD utilizes a visual foundational model (SAM2) as a feature extractor to capture multi-scale pyramid features from high-resolution to low-resolution across bi-temporal remote sensing images. Subsequently, multi-layer adapters are employed to fine-tune the model for downstream tasks, ensuring its effectiveness in remote sensing change detection. Additionally, we design a Text Fusion Attention Module (TFAM) to align visual and textual information, enabling the model to focus on target change regions using text prompts. Finally, a Vision-Semantic Fusion Decoder (V-SFD) is implemented, which deeply integrates visual and semantic information through a cross-attention mechanism to produce highly accurate change detection masks. Our experiments on three datasets (LEVIR-CD, WHU-CD, and SYSU-CD) demonstrate that LG-CD consistently outperforms state-of-the-art change detection methods. Furthermore, our approach provides new insights into achieving generalized change detection by leveraging multimodal information.

[61] Taming Flow-based I2V Models for Creative Video Editing cs.CV | cs.MMPDF

Xianghao Kong, Hansheng Chen, Yuwei Guo, Lvmin Zhang, Gordon Wetzstein

TL;DR: 论文提出了一种无需反演的方法IF-V2V，通过向量场矫正和运动感知初始化，将现有的基于流匹配的I2V模型用于视频编辑，实现了高效且高质量的编辑效果。

Details

Motivation: 现有视频编辑方法通常需要复杂的反演或优化过程，限制了其利用最新I2V模型的能力，因此需要一种轻量级且无需反演的解决方案。

Result: 实验表明，该方法在编辑质量和一致性上优于现有方法，提供了一种轻量级即插即用的解决方案。

Insight: 通过避免反演和引入运动感知噪声，IF-V2V展示了如何在不牺牲性能的情况下，高效地利用现有I2V模型进行视频编辑。

Abstract: Although image editing techniques have advanced significantly, video editing, which aims to manipulate videos according to user intent, remains an emerging challenge. Most existing image-conditioned video editing methods either require inversion with model-specific design or need extensive optimization, limiting their capability of leveraging up-to-date image-to-video (I2V) models to transfer the editing capability of image editing models to the video domain. To this end, we propose IF-V2V, an Inversion-Free method that can adapt off-the-shelf flow-matching-based I2V models for video editing without significant computational overhead. To circumvent inversion, we devise Vector Field Rectification with Sample Deviation to incorporate information from the source video into the denoising process by introducing a deviation term into the denoising vector field. To further ensure consistency with the source video in a model-agnostic way, we introduce Structure-and-Motion-Preserving Initialization to generate motion-aware temporally correlated noise with structural information embedded. We also present a Deviation Caching mechanism to minimize the additional computational cost for denoising vector rectification without significantly impacting editing quality. Evaluations demonstrate that our method achieves superior editing quality and consistency over existing approaches, offering a lightweight plug-and-play solution to realize visual creativity.

[62] Multi-View Crowd Counting With Self-Supervised Learning cs.CVPDF

Hong Mo, Xiong Zhang, Tengfei Shi, Zhongbo Wu

TL;DR: 该论文提出了一种基于自监督学习（SSL）的多视角人群计数框架SSLCounter，通过神经体渲染减少对标注数据的依赖，并在性能和数据效率上表现出色。

Details

Motivation: 传统的多视角计数（MVC）方法依赖大量标注数据，而完全监督学习（FSL）的成本较高。该论文旨在通过自监督学习减少对标注数据的依赖。

Result: 在多个MVC基准测试中，SSLCounter不仅达到最先进性能，且仅用70%训练数据就能表现优异。

Insight: 自监督学习和神经体渲染的结合为数据高效的MVC提供了新思路，展示了传统监督学习的替代方案。

Abstract: Multi-view counting (MVC) methods have attracted significant research attention and stimulated remarkable progress in recent years. Despite their success, most MVC methods have focused on improving performance by following the fully supervised learning (FSL) paradigm, which often requires large amounts of annotated data. In this work, we propose SSLCounter, a novel self-supervised learning (SSL) framework for MVC that leverages neural volumetric rendering to alleviate the reliance on large-scale annotated datasets. SSLCounter learns an implicit representation w.r.t. the scene, enabling the reconstruction of continuous geometry shape and the complex, view-dependent appearance of their 2D projections via differential neural rendering. Owing to its inherent flexibility, the key idea of our method can be seamlessly integrated into exsiting frameworks. Notably, extensive experiments demonstrate that SSLCounter not only demonstrates state-of-the-art performances but also delivers competitive performance with only using 70% proportion of training data, showcasing its superior data efficiency across multiple MVC benchmarks.

[63] Spatial Reasoning in Foundation Models: Benchmarking Object-Centric Spatial Understanding cs.CVPDF

Vahid Mirjalili, Ramin Giahi, Sriram Kollipara, Akshay Kekuda, Kehui Yao

TL;DR: 该论文提出了一个系统性基准，用于评估基础模型在对象为中心的空间推理能力上的表现，揭示了当前模型在定位准确性和真实空间理解之间的差距。

Details

Motivation: 空间理解是视觉基础模型的关键能力，但现有基准大多关注定位准确性，而忽略了对对象之间排列和关系的理解。论文旨在填补这一研究空白。

Result: 研究发现，检测器类模型在精确定位方面表现良好但缺乏关系推理能力，而视觉语言模型能提供粗略的空间布局但难以处理细粒度的空间上下文。

Insight: 揭示了基础模型在空间理解能力上的局限性，呼吁开发具备空间感知能力的新一代模型。

Abstract: Spatial understanding is a critical capability for vision foundation models. While recent advances in large vision models or vision-language models (VLMs) have expanded recognition capabilities, most benchmarks emphasize localization accuracy rather than whether models capture how objects are arranged and related within a scene. This gap is consequential; effective scene understanding requires not only identifying objects, but reasoning about their relative positions, groupings, and depth. In this paper, we present a systematic benchmark for object-centric spatial reasoning in foundation models. Using a controlled synthetic dataset, we evaluate state-of-the-art vision models (e.g., GroundingDINO, Florence-2, OWLv2) and large VLMs (e.g., InternVL, LLaVA, GPT-4o) across three tasks: spatial localization, spatial reasoning, and downstream retrieval tasks. We find a stable trade-off: detectors such as GroundingDINO and OWLv2 deliver precise boxes with limited relational reasoning, while VLMs like SmolVLM and GPT-4o provide coarse layout cues and fluent captions but struggle with fine-grained spatial context. Our study highlights the gap between localization and true spatial understanding, and pointing toward the need for spatially-aware foundation models in the community.

[64] PANICL: Mitigating Over-Reliance on Single Prompt in Visual In-Context Learning cs.CVPDF

Jiahao Zhang, Bowen Wang, Hong Liu, Yuta Nakashima, Hajime Nagahara

TL;DR: PANICL提出了一个无需训练的训练框架，通过利用多个上下文对来缓解视觉上下文学习中对单个提示的过度依赖问题，从而减少偏差并提高稳定性。

Details

Motivation: 视觉上下文学习（VICL）通常过度依赖单个上下文对，导致预测存在偏差且不稳定。为了解决这一问题，作者提出了PANICL框架。

Result: 实验表明，PANICL在多种视觉任务中表现优于基线方法，同时对领域偏移（如数据集和标签空间的偏移）具有强鲁棒性，并能泛化到其他VICL模型。

Insight: PANICL展示了多上下文对的引入可以有效缓解单个提示的依赖问题，且该方法通用性强，适用于多种任务和模型。

Abstract: Visual In-Context Learning (VICL) uses input-output image pairs, referred to as in-context pairs (or examples), as prompts alongside query images to guide models in performing diverse vision tasks. However, VICL often suffers from over-reliance on a single in-context pair, which can lead to biased and unstable predictions. We introduce PAtch-based $k$-Nearest neighbor visual In-Context Learning (PANICL), a general training-free framework that mitigates this issue by leveraging multiple in-context pairs. PANICL smooths assignment scores across pairs, reducing bias without requiring additional training. Extensive experiments on a variety of tasks, including foreground segmentation, single object detection, colorization, multi-object segmentation, and keypoint detection, demonstrate consistent improvements over strong baselines. Moreover, PANICL exhibits strong robustness to domain shifts, including dataset-level shift (e.g., from COCO to Pascal) and label-space shift (e.g., FSS-1000), and generalizes well to other VICL models such as SegGPT, Painter, and LVM, highlighting its versatility and broad applicability.

[65] Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach cs.CVPDF

Daiqing Wu, Dongbao Yang, Sicheng Zhao, Can Ma, Yu Zhou

TL;DR: 本文提出了一种针对多模态大语言模型（MLLMs）的定制化视觉情感评估方法，解决了现有评估方法的局限性，包括情感分类单一、忽略上下文等问题，并通过自动化流程构建情感中心化语句。

Details

Motivation: 现有MLLMs在视觉情感感知任务中表现不一致，部分源于评估方法的局限性，如情感分类不全面、忽略了可能的响应和上下文因素。

Result: 研究显示MLLMs在情感解释和基于上下文的判断上表现较强，但在理解感知主观性上仍有不足，尤其与人类存在显著差距。

Insight: 提升MLLMs的情感智能需注重全面性评估和上下文理解，未来的改进应关注感知主观性和人类情感判定的差距。

Abstract: Recently, Multimodal Large Language Models (MLLMs) have achieved exceptional performance across diverse tasks, continually surpassing previous expectations regarding their capabilities. Nevertheless, their proficiency in perceiving emotions from images remains debated, with studies yielding divergent results in zero-shot scenarios. We argue that this inconsistency stems partly from constraints in existing evaluation methods, including the oversight of plausible responses, limited emotional taxonomies, neglect of contextual factors, and labor-intensive annotations. To facilitate customized visual emotion evaluation for MLLMs, we propose an Emotion Statement Judgment task that overcomes these constraints. Complementing this task, we devise an automated pipeline that efficiently constructs emotion-centric statements with minimal human effort. Through systematically evaluating prevailing MLLMs, our study showcases their stronger performance in emotion interpretation and context-based emotion judgment, while revealing relative limitations in comprehending perception subjectivity. When compared to humans, even top-performing MLLMs like GPT4o demonstrate remarkable performance gaps, underscoring key areas for future improvement. By developing a fundamental evaluation framework and conducting a comprehensive MLLM assessment, we hope this work contributes to advancing emotional intelligence in MLLMs. Project page: https://github.com/wdqqdw/MVEI.

[66] MultiCrafter: High-Fidelity Multi-Subject Generation via Spatially Disentangled Attention and Identity-Aware Reinforcement Learning cs.CVPDF

Tao Wu, Yibo Jiang, Yehao Lu, Zhizhong Wang, Zeyi Huang

TL;DR: MultiCrafter通过空间解耦注意力和身份感知的强化学习，解决了多主体图像生成中的属性泄漏问题，并提升了与人类审美偏好的对齐度。

Details

Motivation: 现有的多主体图像生成方法依赖简单的基于重建的目标，导致属性泄漏严重且无法满足人类细微的偏好，因此需要一种更有效的解决方案。

Result: 实验表明，MultiCrafter显著提升了主体保真度，并更好地对齐了人类审美偏好。

Insight: 属性泄漏的核心原因是注意力区域的纠缠，而通过空间解耦和MoE架构可以有效解决问题；强化学习能够进一步优化人类偏好的对齐。

Abstract: Multi-subject image generation aims to synthesize user-provided subjects in a single image while preserving subject fidelity, ensuring prompt consistency, and aligning with human aesthetic preferences. However, existing methods, particularly those built on the In-Context-Learning paradigm, are limited by their reliance on simple reconstruction-based objectives, leading to both severe attribute leakage that compromises subject fidelity and failing to align with nuanced human preferences. To address this, we propose MultiCrafter, a framework that ensures high-fidelity, preference-aligned generation. First, we find that the root cause of attribute leakage is a significant entanglement of attention between different subjects during the generation process. Therefore, we introduce explicit positional supervision to explicitly separate attention regions for each subject, effectively mitigating attribute leakage. To enable the model to accurately plan the attention region of different subjects in diverse scenarios, we employ a Mixture-of-Experts architecture to enhance the model’s capacity, allowing different experts to focus on different scenarios. Finally, we design a novel online reinforcement learning framework to align the model with human preferences, featuring a scoring mechanism to accurately assess multi-subject fidelity and a more stable training strategy tailored for the MoE architecture. Experiments validate that our framework significantly improves subject fidelity while aligning with human preferences better.

[67] PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data cs.CVPDF

Zhe Zhu, Le Wan, Rui Xu, Yiheng Zhang, Honghua Chen

TL;DR: PartSAM是首个基于原生3D数据训练的可提示部件分割模型，通过Triplane双分支编码器和多样化大规模标注数据，显著提升开放世界的部件分割性能。

Details

Motivation: 现有方法依赖2D基础模型的监督迁移，难以捕捉3D几何本质，导致表面理解和分解能力受限。PartSAM旨在通过原生3D数据直接训练，解决这些问题。

Result: 在多个基准测试中大幅优于现有方法，展示了在开放世界部件理解上的强大能力。

Insight: 原生3D数据训练是提升部件分割性能的关键，可提示设计为3D基础模型的发展提供了新方向。

Abstract: Segmenting 3D objects into parts is a long-standing challenge in computer vision. To overcome taxonomy constraints and generalize to unseen 3D objects, recent works turn to open-world part segmentation. These approaches typically transfer supervision from 2D foundation models, such as SAM, by lifting multi-view masks into 3D. However, this indirect paradigm fails to capture intrinsic geometry, leading to surface-only understanding, uncontrolled decomposition, and limited generalization. We present PartSAM, the first promptable part segmentation model trained natively on large-scale 3D data. Following the design philosophy of SAM, PartSAM employs an encoder-decoder architecture in which a triplane-based dual-branch encoder produces spatially structured tokens for scalable part-aware representation learning. To enable large-scale supervision, we further introduce a model-in-the-loop annotation pipeline that curates over five million 3D shape-part pairs from online assets, providing diverse and fine-grained labels. This combination of scalable architecture and diverse 3D data yields emergent open-world capabilities: with a single prompt, PartSAM achieves highly accurate part identification, and in a Segment-Every-Part mode, it automatically decomposes shapes into both surface and internal structures. Extensive experiments show that PartSAM outperforms state-of-the-art methods by large margins across multiple benchmarks, marking a decisive step toward foundation models for 3D part understanding. Our code and model will be released soon.

[68] Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning cs.CV | cs.AIPDF

Zilun Zhang, Zian Guan, Tiancheng Zhao, Haozhan Shen, Tianyu Li

TL;DR: Geo-R1提出了一种基于强化微调的范式，通过显式生成推理链来提升少样本地理空间指代表达的理解能力，在有限标注数据下表现优异。

Details

Motivation: 地理空间指代表达理解需要复杂的对象-上下文关系推理，而传统监督微调（SFT）在大规模标注数据不足时泛化能力较差。

Result: 在三个少样本地理空间指代表达基准测试中，Geo-R1显著优于SFT基线，并表现出强跨数据集泛化能力。

Insight: 显式推理链不仅提升了少样本场景下的性能，还增强了模型的可解释性和鲁棒性。

Abstract: Referring expression understanding in remote sensing poses unique challenges, as it requires reasoning over complex object-context relationships. While supervised fine-tuning (SFT) on multimodal large language models achieves strong performance with massive labeled datasets, they struggle in data-scarce scenarios, leading to poor generalization. To address this limitation, we propose Geo-R1, a reasoning-centric reinforcement fine-tuning (RFT) paradigm for few-shot geospatial referring. Geo-R1 enforces the model to first generate explicit, interpretable reasoning chains that decompose referring expressions, and then leverage these rationales to localize target objects. This “reason first, then act” process enables the model to make more effective use of limited annotations, enhances generalization, and provides interpretability. We validate Geo-R1 on three carefully designed few-shot geospatial referring benchmarks, where our model consistently and substantially outperforms SFT baselines. It also demonstrates strong cross-dataset generalization, highlighting its robustness. Code and data will be released at http://geo-r1.github.io.

[69] Benchmarking and Mitigate Psychological Sycophancy in Medical Vision-Language Models cs.CV | cs.AIPDF

Zikun Guo, Xinyue Xu, Pei Xiang, Shu Yang, Xin Han

TL;DR: 该论文研究医疗视觉语言模型（VLMs）中的心理迎合行为，提出一个新的临床基准数据集和一种轻量级缓解策略（VIPER），以减少模型对非证据性内容的依赖。

Details

Motivation: 医疗视觉语言模型在临床工作流程中的应用日益广泛，但这些模型常表现出迎合行为，优先考虑用户的语言风格或感知权威，而非基于证据的推理。这可能导致临床决策的偏差。

Result: 实验表明，VLMs普遍易受迎合行为影响，VIPER显著减少了这种行为，且优于基线方法。

Insight: 1. 模型大小与迎合行为无强相关性；2. 模仿和专家纠正是最有效的触发因素；3. 需要基于证据的防御机制以确保临床应用安全。

Abstract: Vision language models(VLMs) are increasingly integrated into clinical workflows, but they often exhibit sycophantic behavior prioritizing alignment with user phrasing social cues or perceived authority over evidence based reasoning. This study evaluate clinical sycophancy in medical visual question answering through a novel clinically grounded benchmark. We propose a medical sycophancy dataset construct from PathVQA, SLAKE, and VQA-RAD stratified by different type organ system and modality. Using psychologically motivated pressure templates including various sycophancy. In our adversarial experiments on various VLMs, we found that these models are generally vulnerable, exhibiting significant variations in the occurrence of adversarial responses, with weak correlations to the model accuracy or size. Imitation and expert provided corrections were found to be the most effective triggers, suggesting that the models possess a bias mechanism independent of visual evidence. To address this, we propose Visual Information Purification for Evidence based Response (VIPER) a lightweight mitigation strategy that filters non evidentiary content for example social pressures and then generates constrained evidence first answers. This framework reduces sycophancy by an average amount outperforming baselines while maintaining interpretability. Our benchmark analysis and mitigation framework lay the groundwork for robust deployment of medical VLMs in real world clinician interactions emphasizing the need for evidence anchored defenses.

[70] Resolving Ambiguity in Gaze-Facilitated Visual Assistant Interaction Paradigm cs.CVPDF

Zeyu Wang, Baiyu Chen, Kun Yan, Hongjing Piao, Hao Xue

TL;DR: 论文提出GLARIFY方法，通过整合用户视线数据解决视觉-语言模型中多模态查询的模糊性问题，显著提升模型性能。

Details

Motivation: 当前智能眼镜等设备中，用户的注意力数据（如视线）用于多模态查询时会引入模糊性问题（如代词使用或视线噪声），现有方法仅处理静态图像，无法捕捉动态注意力。

Result: 实验显示GLARIFY显著优于基线模型，通过对齐人类注意力提升交互范式的实用性。

Insight: 视线数据是动态且噪声的，需专门处理方法；结合合成数据和CoT流程可有效解决多模态模糊性问题。

Abstract: With the rise in popularity of smart glasses, users’ attention has been integrated into Vision-Language Models (VLMs) to streamline multi-modal querying in daily scenarios. However, leveraging gaze data to model users’ attention may introduce ambiguity challenges: (1) users’ verbal questions become ambiguous by using pronouns or skipping context, (2) humans’ gaze patterns can be noisy and exhibit complex spatiotemporal relationships with their spoken questions. Previous works only consider single image as visual modality input, failing to capture the dynamic nature of the user’s attention. In this work, we introduce GLARIFY, a novel method to leverage spatiotemporal gaze information to enhance the model’s effectiveness in real-world applications. Initially, we analyzed hundreds of querying samples with the gaze modality to demonstrate the noisy nature of users’ gaze patterns. We then utilized GPT-4o to design an automatic data synthesis pipeline to generate the GLARIFY-Ambi dataset, which includes a dedicated chain-of-thought (CoT) process to handle noisy gaze patterns. Finally, we designed a heatmap module to incorporate gaze information into cutting-edge VLMs while preserving their pretrained knowledge. We evaluated GLARIFY using a hold-out test set. Experiments demonstrate that GLARIFY significantly outperforms baselines. By robustly aligning VLMs with human attention, GLARIFY paves the way for a usable and intuitive interaction paradigm with a visual assistant.

[71] From Bias to Balance: Exploring and Mitigating Spatial Bias in LVLMs cs.CV | cs.CLPDF

Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Weili Guan

TL;DR: 本文系统地研究了大型视觉-语言模型（LVLMs）中的空间偏差问题，并提出了一种简单有效的机制BaPA来缓解这一问题，从而提升模型的空间鲁棒性和多模态任务表现。

Details

Motivation: LVLMs在多种多模态任务中表现出色，但对空间变化的鲁棒性尚未被充分研究。本文旨在揭示并解决模型因输入图像中视觉信息位置不同而产生的输出不一致问题。

Result: BaPA显著提高了LVLMs的空间鲁棒性，并在多个多模态基准测试中表现出性能提升。信息流分析表明BaPA实现了更平衡的注意力分配。

Insight: 空间偏差问题主要源于语言模型中不均衡的位置嵌入设计，而非视觉编码器。统一位置嵌入能够有效缓解这一问题，提升模型的整体视觉理解能力。

Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable success across a wide range of multimodal tasks, yet their robustness to spatial variations remains insufficiently understood. In this work, we present a systematic study of the spatial bias of LVLMs, focusing on how models respond when identical key visual information is placed at different locations within an image. Through a carefully designed probing dataset, we demonstrate that current LVLMs often produce inconsistent outputs under such spatial shifts, revealing a fundamental limitation in their spatial-semantic understanding. Further analysis shows that this phenomenon originates not from the vision encoder, which reliably perceives and interprets visual content across positions, but from the unbalanced design of position embeddings in the language model component. In particular, the widely adopted position embedding strategies, such as RoPE, introduce imbalance during cross-modal interaction, leading image tokens at different positions to exert unequal influence on semantic understanding. To mitigate this issue, we introduce Balanced Position Assignment (BaPA), a simple yet effective mechanism that assigns identical position embeddings to all image tokens, promoting a more balanced integration of visual information. Extensive experiments show that BaPA enhances the spatial robustness of LVLMs without retraining and further boosts their performance across diverse multimodal benchmarks when combined with lightweight fine-tuning. Further analysis of information flow reveals that BaPA yields balanced attention, enabling more holistic visual understanding.

[72] Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation cs.CVPDF

Abdelrahman Eldesokey, Aleksandar Cvejic, Bernard Ghanem, Peter Wonka

TL;DR: 该论文提出了一种从预训练扩散模型中解耦视觉和语义特征的新方法，通过设计对比架构和自动构建图像对，实现了视觉不一致性的量化和定位。

Details

Motivation: 扩散模型的骨干网络已知编码了丰富的语义特征，但其视觉特征尚未被充分探索。由于缺乏标注数据集，分离这些特征具有挑战性。

Result: 实验表明VSM在量化视觉不一致性方面优于CLIP、DINO等全局特征方法，并能定位不一致区域。

Insight: 扩散模型中的视觉特征可以独立提取并用于改进生成任务，为生成模型的一致性问题提供了新工具。

Abstract: We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models, enabling visual correspondence in a manner analogous to the well-established semantic correspondence. While diffusion model backbones are known to encode semantically rich features, they must also contain visual features to support their image synthesis capabilities. However, isolating these visual features is challenging due to the absence of annotated datasets. To address this, we introduce an automated pipeline that constructs image pairs with annotated semantic and visual correspondences based on existing subject-driven image generation datasets, and design a contrastive architecture to separate the two feature types. Leveraging the disentangled representations, we propose a new metric, Visual Semantic Matching (VSM), that quantifies visual inconsistencies in subject-driven image generation. Empirical results show that our approach outperforms global feature-based metrics such as CLIP, DINO, and vision–language models in quantifying visual inconsistencies while also enabling spatial localization of inconsistent regions. To our knowledge, this is the first method that supports both quantification and localization of inconsistencies in subject-driven generation, offering a valuable tool for advancing this task. Project Page:https://abdo-eldesokey.github.io/mind-the-glitch/

[73] WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM cs.CV | cs.SDPDF

Changli Tang, Qinfan Xiao, Ke Mei, Tianyi Wang, Fengyun Rao

TL;DR: WAVE是首个基于多模态大语言模型（LLM）的嵌入模型，通过层次特征融合和多任务联合训练，实现了文本、音频和视频模态的统一表示，并在跨模态检索和基于提示的嵌入生成中表现优异。

Details

Motivation: 当前多模态LLM在动态模态（如音频和视频）中的应用尚未充分探索，WAVE旨在填补这一空白，提供一个统一的表征空间以支持跨模态任务。

Result: WAVE在MMEB-v2视频基准测试中达到SOTA，在音频和视频到音频检索中表现优异，并在多模态问答中大幅超越现有模型。

Insight: 1. 多任务联合训练显著提升多模态表征能力； 2. 基于提示的嵌入生成为多模态任务提供了灵活性。

Abstract: While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE (\textbf{u}nified & \textbf{v}ersatile \textbf{a}udio-\textbf{v}isual \textbf{e}mbeddings), the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities. WAVE employs a novel hierarchical feature fusion strategy and a joint multi-modal, multi-task training approach to enable two key capabilities: any-to-any cross-modal retrieval and the generation of prompt-aware embeddings tailored to user instructions. Experimentally, WAVE sets a new state-of-the-art on the MMEB-v2 video benchmark and achieves superior results in audio and video-to-audio retrieval. Its prompt-aware nature also yields remarkable performance in multimodal question answering, significantly outperforming existing embedding models. Ablation studies validate our joint training strategy, demonstrating improved performance across all modalities. With a newly introduced benchmark for versatile audio-visual learning, WAVE opens up broad possibilities for cross-modal, any-to-any applications. Our code, checkpoints, and data will be released.

[74] ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models cs.CV | cs.AI | cs.CL | cs.LGPDF

Jewon Lee, Wooksu Shin, Seungmin Yang, Ki-Ung Song, DongUk Lim

TL;DR: ERGO提出了一种高效的视觉-语言模型推理方法，通过两阶段“粗到细”的流程减少高分辨率图像的计算开销，同时保持视觉细节。

Details

Motivation: 现有的大规模视觉-语言模型在处理高分辨率图像时计算开销过大，且现有方法在降采样后容易丢失关键视觉信息，导致推理失败。

Result: ERGO在V*基准上比Qwen2.5-VL-7B高4.7分，同时仅使用23%的视觉标记，推理速度提升3倍。

Insight: 通过结合推理驱动的感知和多模态上下文，可以显著降低计算开销，同时保持或提升模型性能。

Abstract: Efficient processing of high-resolution images is crucial for real-world vision-language applications. However, existing Large Vision-Language Models (LVLMs) incur substantial computational overhead due to the large number of vision tokens. With the advent of “thinking with images” models, reasoning now extends beyond text to the visual domain. This capability motivates our two-stage “coarse-to-fine” reasoning pipeline: first, a downsampled image is analyzed to identify task-relevant regions; then, only these regions are cropped at full resolution and processed in a subsequent reasoning stage. This approach reduces computational cost while preserving fine-grained visual details where necessary. A major challenge lies in inferring which regions are truly relevant to a given query. Recent related methods often fail in the first stage after input-image downsampling, due to perception-driven reasoning, where clear visual information is required for effective reasoning. To address this issue, we propose ERGO (Efficient Reasoning & Guided Observation) that performs reasoning-driven perception-leveraging multimodal context to determine where to focus. Our model can account for perceptual uncertainty, expanding the cropped region to cover visually ambiguous areas for answering questions. To this end, we develop simple yet effective reward components in a reinforcement learning framework for coarse-to-fine perception. Across multiple datasets, our approach delivers higher accuracy than the original model and competitive methods, with greater efficiency. For instance, ERGO surpasses Qwen2.5-VL-7B on the V* benchmark by 4.7 points while using only 23% of the vision tokens, achieving a 3x inference speedup. The code and models can be found at: https://github.com/nota-github/ERGO.

[75] DualFocus: Depth from Focus with Spatio-Focal Dual Variational Constraints cs.CVPDF

Sungmin Woo, Sangyoun Lee

TL;DR: DualFocus通过引入空间和焦距维度的双重变分约束，改进了基于学习的Depth-from-Focus (DFF)方法，提高了复杂场景中的深度估计精度和稳健性。

Details

Motivation: 现有基于学习的DFF方法在复杂场景（如细纹理或深度突变）中表现不佳，因为聚焦线索可能变得模糊或误导性。DualFocus旨在通过建模空间和焦距维度的聚焦变化来解决这一问题。

Result: 在四个公共数据集上的实验表明，DualFocus在深度精度和感知质量上均优于现有方法，尤其在复杂场景中表现突出。

Insight: 通过显式建模空间和焦距维度的物理特性，可以有效提升DFF方法在复杂场景中的性能，为基于学习的深度估计提供了新的思路。

Abstract: Depth-from-Focus (DFF) enables precise depth estimation by analyzing focus cues across a stack of images captured at varying focal lengths. While recent learning-based approaches have advanced this field, they often struggle in complex scenes with fine textures or abrupt depth changes, where focus cues may become ambiguous or misleading. We present DualFocus, a novel DFF framework that leverages the focal stack’s unique gradient patterns induced by focus variation, jointly modeling focus changes over spatial and focal dimensions. Our approach introduces a variational formulation with dual constraints tailored to DFF: spatial constraints exploit gradient pattern changes across focus levels to distinguish true depth edges from texture artifacts, while focal constraints enforce unimodal, monotonic focus probabilities aligned with physical focus behavior. These inductive biases improve robustness and accuracy in challenging regions. Comprehensive experiments on four public datasets demonstrate that DualFocus consistently outperforms state-of-the-art methods in both depth accuracy and perceptual quality.

[76] Rate-Distortion Optimized Communication for Collaborative Perception cs.CVPDF

Genjia Liu, Anning Hu, Yue Hu, Wenjun Zhang, Siheng Chen

TL;DR: 本文提出了一种基于率失真理论的多智能体协作感知框架RDcomm，通过任务熵离散编码和互信息驱动的消息选择，实现了高效通信与任务性能的平衡，显著降低了通信量。

Details

Motivation: 多智能体协作感知中，如何在有限带宽资源下高效共享视觉信息仍缺乏理论支持，因此需要填补这一理论空白。

Result: 在3D目标检测和BEV分割任务上，RDcomm在DAIR-V2X和OPV2V数据集上达到SOTA性能，通信量减少了108倍。

Insight: 通过理论指导设计通信策略，既能提升任务性能，又能显著降低通信开销。

Abstract: Collaborative perception emphasizes enhancing environmental understanding by enabling multiple agents to share visual information with limited bandwidth resources. While prior work has explored the empirical trade-off between task performance and communication volume, a significant gap remains in the theoretical foundation. To fill this gap, we draw on information theory and introduce a pragmatic rate-distortion theory for multi-agent collaboration, specifically formulated to analyze performance-communication trade-off in goal-oriented multi-agent systems. This theory concretizes two key conditions for designing optimal communication strategies: supplying pragmatically relevant information and transmitting redundancy-less messages. Guided by these two conditions, we propose RDcomm, a communication-efficient collaborative perception framework that introduces two key innovations: i) task entropy discrete coding, which assigns features with task-relevant codeword-lengths to maximize the efficiency in supplying pragmatic information; ii) mutual-information-driven message selection, which utilizes mutual information neural estimation to approach the optimal redundancy-less condition. Experiments on 3D object detection and BEV segmentation demonstrate that RDcomm achieves state-of-the-art accuracy on DAIR-V2X and OPV2V, while reducing communication volume by up to 108 times. The code will be released.

[77] Exposing Hallucinations To Suppress Them: VLMs Representation Editing With Generative Anchors cs.CVPDF

Youxu Shi, Suorong Yang, Dong Liu

TL;DR: 本文提出了一种无需训练的自监督方法，通过视觉和文本的双重锚点（正负锚点）来减少多模态大语言模型（MLLMs）的幻觉问题。

Details

Motivation: 多模态大语言模型（MLLMs）虽然在视觉-语言任务中表现出色，但仍容易产生与视觉证据不一致的幻觉内容，现有方法通常需要额外的微调或牺牲模型的扩展性。

Result: 在多个基准测试中显著减少了对象、属性和关系层面的幻觉（如LLaVA-v1.5-7B在CHAIR上幻觉减少超过5%），并展示了跨架构的泛化能力。

Insight: 该方法对无幻觉的caption几乎没有副作用，展现了其鲁棒性和即插即用的实用性。

Abstract: Multimodal large language models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet they remain highly susceptible to hallucinations, producing content that is fluent but inconsistent with visual evidence. Such hallucinations, spanning objects, attributes, and relations, persist even in larger models, while existing mitigation approaches often require additional finetuning, handcrafted priors, or trade-offs that compromise informativeness and scalability. To address this limitation, we propose a training-free, self-supervised method for hallucination mitigation. Our approach introduces a novel hallucination amplification mechanism: a caption is projected into the visual space via a text-to-image model to reveal implicit hallucination signals, serving as a negative anchor, while the original image provides a positive anchor. Leveraging these dual anchors, we edit decoder hidden states by pulling representations toward faithful semantics and pushing them away from hallucination directions. This correction requires no human priors or additional training costs, ensuring both effectiveness and efficiency. Extensive experiments across multiple benchmarks show that our method significantly reduces hallucinations at the object, attribute, and relation levels while largely preserving recall and caption richness, e.g., achieving a hallucination reduction by over 5% using LLaVA-v1.5-7B on CHAIR. Furthermore, results on diverse architectures, including LLaVA-NEXT-7B, Cambrian-8B, and InstructBLIP-7B, validate strong cross-architecture generalization. More importantly, when applied to hallucination-free captions, our method introduces almost no side effects, underscoring its robustness and practical plug-and-play applicability. The implementation will be publicly available.

[78] CoFFT: Chain of Foresight-Focus Thought for Visual Language Models cs.CVPDF

Xinyu Zhang, Yuxuan Dong, Lingling Zhang, Chengyou Jia, Zhuohang Dang

TL;DR: CoFFT提出了一种无训练方法，通过模仿人类视觉认知，分三个阶段迭代优化视觉语言模型的推理能力，显著提升了性能。

Details

Motivation: 现有的视觉语言模型在处理复杂和冗余的视觉输入时容易受到干扰，生成不相关推理甚至幻觉结果。CoFFT旨在通过更精确的区域发现和处理来解决这一问题。

Result: 在多个基准测试中，CoFFT显著提升了Qwen2.5-VL等模型的性能，相对增益达3.1-5.8%，且计算开销可控。

Insight: CoFFT揭示了迭代式的视觉焦点调整对提升模型推理能力的重要性，为未来视觉语言模型的优化提供了新思路。

Abstract: Despite significant advances in Vision Language Models (VLMs), they remain constrained by the complexity and redundancy of visual input. When images contain large amounts of irrelevant information, VLMs are susceptible to interference, thus generating excessive task-irrelevant reasoning processes or even hallucinations. This limitation stems from their inability to discover and process the required regions during reasoning precisely. To address this limitation, we present the Chain of Foresight-Focus Thought (CoFFT), a novel training-free approach that enhances VLMs’ visual reasoning by emulating human visual cognition. Each Foresight-Focus Thought consists of three stages: (1) Diverse Sample Generation: generates diverse reasoning samples to explore potential reasoning paths, where each sample contains several reasoning steps; (2) Dual Foresight Decoding: rigorously evaluates these samples based on both visual focus and reasoning progression, adding the first step of optimal sample to the reasoning process; (3) Visual Focus Adjustment: precisely adjust visual focus toward regions most beneficial for future reasoning, before returning to stage (1) to generate subsequent reasoning samples until reaching the final answer. These stages function iteratively, creating an interdependent cycle where reasoning guides visual focus and visual focus informs subsequent reasoning. Empirical results across multiple benchmarks using Qwen2.5-VL, InternVL-2.5, and Llava-Next demonstrate consistent performance improvements of 3.1-5.8% with controllable increasing computational overhead.

[79] Lightweight Structured Multimodal Reasoning for Clinical Scene Understanding in Robotics cs.CV | cs.AI | cs.HC | cs.ROPDF

Saurav Jha, Stefan K. Ehrlich

TL;DR: 该论文提出了一种轻量级的多模态推理框架，结合Qwen2.5-VL-3B-Instruct模型和SmolAgent编排层，用于医疗机器人的临床场景理解，支持链式思维推理和动态工具调用，并在性能和鲁棒性上优于现有视觉语言模型。

Details

Motivation: 医疗机器人需要在动态临床环境中实现鲁棒的多模态感知与推理，现有视觉语言模型（VLMs）在时序推理、不确定性估计和结构化输出方面存在不足，无法满足机器人规划的需求。

Result: 在Video-MME基准和自定义临床数据集上表现出竞争性精度和更高的鲁棒性，适用于机器人辅助手术、患者监测和决策支持。

Insight: 通过轻量化的多模态框架，可有效弥补现有VLMs在时序推理和结构化输出方面的不足，为医疗场景的机器人应用提供可行性解决方案。

Abstract: Healthcare robotics requires robust multimodal perception and reasoning to ensure safety in dynamic clinical environments. Current Vision-Language Models (VLMs) demonstrate strong general-purpose capabilities but remain limited in temporal reasoning, uncertainty estimation, and structured outputs needed for robotic planning. We present a lightweight agentic multimodal framework for video-based scene understanding. Combining the Qwen2.5-VL-3B-Instruct model with a SmolAgent-based orchestration layer, it supports chain-of-thought reasoning, speech-vision fusion, and dynamic tool invocation. The framework generates structured scene graphs and leverages a hybrid retrieval module for interpretable and adaptive reasoning. Evaluations on the Video-MME benchmark and a custom clinical dataset show competitive accuracy and improved robustness compared to state-of-the-art VLMs, demonstrating its potential for applications in robot-assisted surgery, patient monitoring, and decision support.

Yuki Sakai, Ryosuke Furuta, Juichun Yen, Yoichi Sato

TL;DR: 论文《EgoInstruct》提出了一个新的以自我为中心的面对面教学视频数据集，并提供了多模态大语言模型（MLLMs）在任务中的表现基准测试，展示了其在教学场景中的潜力。

Details

Motivation: 面对面教学场景在计算机视觉领域缺乏系统性研究，主要是由于缺乏合适的数据集和分析技术。本文旨在填补这一空白。

Result: 实验表明，MLLMs在未进行任务特定微调的情况下，表现优于传统的任务专用模型。

Insight: 多模态大语言模型在教学场景中展现出整体理解的潜力，尤其是在整合语言和非语言信息方面。

Abstract: Analyzing instructional interactions between an instructor and a learner who are co-present in the same physical space is a critical problem for educational support and skill transfer. Yet such face-to-face instructional scenes have not been systematically studied in computer vision. We identify two key reasons: i) the lack of suitable datasets and ii) limited analytical techniques. To address this gap, we present a new egocentric video dataset of face-to-face instruction and provide ground-truth annotations for two fundamental tasks that serve as a first step toward a comprehensive understanding of instructional interactions: procedural step segmentation and conversation-state classification. Using this dataset, we benchmark multimodal large language models (MLLMs) against conventional task-specific models. Since face-to-face instruction involves multiple modalities (speech content and prosody, gaze and body motion, and visual context), effective understanding requires methods that handle verbal and nonverbal communication in an integrated manner. Accordingly, we evaluate recently introduced MLLMs that jointly process images, audio, and text. This evaluation quantifies the extent to which current machine learning models understand face-to-face instructional scenes. In experiments, MLLMs outperform specialized baselines even without task-specific fine-tuning, suggesting their promise for holistic understanding of instructional interactions.

[81] High-Quality Sound Separation Across Diverse Categories via Visually-Guided Generative Modeling cs.CV | cs.SDPDF

Chao Huang, Susan Liang, Yapeng Tian, Anurag Kumar, Chenliang Xu

TL;DR: DAVIS是一个基于扩散模型的视听分离框架，通过生成学习解决声音分离任务，超越了传统掩码回归方法的限制。

Details

Motivation: 现有声音分离方法通常基于掩码回归，难以捕捉复杂数据分布，导致多样化声音类别的高质量分离效果受限。

Result: 在AVE和MUSIC数据集上，DAVIS的两种变体均超越了现有方法，实现了更高品质的声音分离。

Insight: 生成模型在声音分离任务中表现出色，尤其是扩散模型能够有效捕捉复杂数据分布，为多样化声音类别的高质量分离提供了新思路。

Abstract: We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. Existing methods typically frame sound separation as a mask-based regression problem, achieving significant progress. However, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS circumvents these issues by leveraging potent generative modeling paradigms, specifically Denoising Diffusion Probabilistic Models (DDPM) and the more recent Flow Matching (FM), integrated within a specialized Separation U-Net architecture. Our framework operates by synthesizing the desired separated sound spectrograms directly from a noise distribution, conditioned concurrently on the mixed audio input and associated visual information. The inherent nature of its generative objective makes DAVIS particularly adept at producing high-quality sound separations for diverse sound categories. We present comparative evaluations of DAVIS, encompassing both its DDPM and Flow Matching variants, against leading methods on the standard AVE and MUSIC datasets. The results affirm that both variants surpass existing approaches in separation quality, highlighting the efficacy of our generative framework for tackling the audio-visual source separation task.

[82] SpecXNet: A Dual-Domain Convolutional Network for Robust Deepfake Detection cs.CVPDF

Inzamamul Alam, Md Tanvir Islam, Simon S. Woo

TL;DR: SpecXNet提出了一种双域卷积网络，结合空间和频谱特征，通过DDFC和DFA模块实现鲁棒的深度伪造检测。

Details

Motivation: 现有方法仅关注空间或频谱域特征，限制了其在未见篡改上的泛化能力。

Result: 在跨数据集和未见篡改场景下表现优异，并保持实时性。

Insight: 统一的空间-频谱学习能显著提升深度伪造检测的鲁棒性和泛化性。

Abstract: The increasing realism of content generated by GANs and diffusion models has made deepfake detection significantly more challenging. Existing approaches often focus solely on spatial or frequency-domain features, limiting their generalization to unseen manipulations. We propose the Spectral Cross-Attentional Network (SpecXNet), a dual-domain architecture for robust deepfake detection. The core \textbf{Dual-Domain Feature Coupler (DDFC)} decomposes features into a local spatial branch for capturing texture-level anomalies and a global spectral branch that employs Fast Fourier Transform to model periodic inconsistencies. This dual-domain formulation allows SpecXNet to jointly exploit localized detail and global structural coherence, which are critical for distinguishing authentic from manipulated images. We also introduce the \textbf{Dual Fourier Attention (DFA)} module, which dynamically fuses spatial and spectral features in a content-aware manner. Built atop a modified XceptionNet backbone, we embed the DDFC and DFA modules within a separable convolution block. Extensive experiments on multiple deepfake benchmarks show that SpecXNet achieves state-of-the-art accuracy, particularly under cross-dataset and unseen manipulation scenarios, while maintaining real-time feasibility. Our results highlight the effectiveness of unified spatial-spectral learning for robust and generalizable deepfake detection. To ensure reproducibility, we released the full code on \href{https://github.com/inzamamulDU/SpecXNet}{\textcolor{blue}{\textbf{GitHub}}}.

[83] Large Material Gaussian Model for Relightable 3D Generation cs.CVPDF

Jingrui Ye, Lingting Zhu, Runze Zhang, Zeyu Hu, Yingda Yin

TL;DR: 本文提出了Large Material Gaussian Model (MGM)，一种能够生成带有PBR材质的高质量3D内容的框架，解决了现有模型无法生成材质属性的问题，支持动态重光照渲染。

Details

Motivation: 随着3D内容需求的增长，当前模型缺乏对材质属性的生成能力，限制了在多样化光照环境中的真实渲染效果。本文旨在填补这一空白。

Result: 实验显示，MGM生成的材质在视觉效果和材质建模上均优于基线方法，并支持实际渲染应用。

Insight: >> 材质建模是提升3D内容真实感的关键。

高斯表示在多通道材质建模中具有优势，可为动态光照提供灵活性。

Abstract: The increasing demand for 3D assets across various industries necessitates efficient and automated methods for 3D content creation. Leveraging 3D Gaussian Splatting, recent large reconstruction models (LRMs) have demonstrated the ability to efficiently achieve high-quality 3D rendering by integrating multiview diffusion for generation and scalable transformers for reconstruction. However, existing models fail to produce the material properties of assets, which is crucial for realistic rendering in diverse lighting environments. In this paper, we introduce the Large Material Gaussian Model (MGM), a novel framework designed to generate high-quality 3D content with Physically Based Rendering (PBR) materials, ie, albedo, roughness, and metallic properties, rather than merely producing RGB textures with uncontrolled light baking. Specifically, we first fine-tune a new multiview material diffusion model conditioned on input depth and normal maps. Utilizing the generated multiview PBR images, we explore a Gaussian material representation that not only aligns with 2D Gaussian Splatting but also models each channel of the PBR materials. The reconstructed point clouds can then be rendered to acquire PBR attributes, enabling dynamic relighting by applying various ambient light maps. Extensive experiments demonstrate that the materials produced by our method not only exhibit greater visual appeal compared to baseline methods but also enhance material modeling, thereby enabling practical downstream rendering applications.

[84] Self-Supervised Point Cloud Completion based on Multi-View Augmentations of Single Partial Point Cloud cs.CVPDF

Jingjing Lu, Huilong Pi, Yunchuan Qin, Zhuo Tang, Ruihui Li

TL;DR: 该论文提出了一种基于单一部分点云的多视图增强的自监督点云补全方法，解决了现有方法依赖地面真值或完整点云的限制，并通过引入Mamba模块提升了补全质量。

Details

Motivation: 点云补全任务中，现有监督方法依赖地面真值导致泛化性受限，无监督方法需要完整点云，而现有自监督方法信号弱导致预测质量差。本文旨在通过多视图增强和引入Mamba模块，提出一种更有效的自监督补全方法。

Result: 在合成和真实数据集上取得了最先进的性能，验证了方法的有效性。

Insight: 通过多视图增强和Mamba的结合，可以显著提升自监督点云补全的质量，为无监督学习提供了一条新思路。

Abstract: Point cloud completion aims to reconstruct complete shapes from partial observations. Although current methods have achieved remarkable performance, they still have some limitations: Supervised methods heavily rely on ground truth, which limits their generalization to real-world datasets due to the synthetic-to-real domain gap. Unsupervised methods require complete point clouds to compose unpaired training data, and weakly-supervised methods need multi-view observations of the object. Existing self-supervised methods frequently produce unsatisfactory predictions due to the limited capabilities of their self-supervised signals. To overcome these challenges, we propose a novel self-supervised point cloud completion method. We design a set of novel self-supervised signals based on multi-view augmentations of the single partial point cloud. Additionally, to enhance the model’s learning ability, we first incorporate Mamba into self-supervised point cloud completion task, encouraging the model to generate point clouds with better quality. Experiments on synthetic and real-world datasets demonstrate that our method achieves state-of-the-art results.

[85] MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models cs.CVPDF

Jonas Belouadi, Tamy Boubekeur, Adrien Kaiser

TL;DR: MultiMat提出了一种多模态程序合成框架，利用大规模多模态模型处理视觉和文本图表示，提升程序化材质图的生成质量。

Details

Motivation: 程序化材质图在计算机图形学中至关重要，但创建需要专业技能。现有神经程序合成方法仅用文本表示，忽略了图的视觉-空间特性。

Result: 实验结果优于纯文本基线，视觉质量和保真度更高，达到新SOTA。

Insight: 多模态输入（视觉+文本）能更准确捕捉程序化材质图的本质，提升合成效果。

Abstract: Material node graphs are programs that generate the 2D channels of procedural materials, including geometry such as roughness and displacement maps, and reflectance such as albedo and conductivity maps. They are essential in computer graphics for representing the appearance of virtual 3D objects parametrically and at arbitrary resolution. In particular, their directed acyclic graph structures and intermediate states provide an intuitive understanding and workflow for interactive appearance modeling. Creating such graphs is a challenging task and typically requires professional training. While recent neural program synthesis approaches attempt to simplify this process, they solely represent graphs as textual programs, failing to capture the inherently visual-spatial nature of node graphs that makes them accessible to humans. To address this gap, we present MultiMat, a multimodal program synthesis framework that leverages large multimodal models to process both visual and textual graph representations for improved generation of procedural material graphs. We train our models on a new dataset of production-quality procedural materials and combine them with a constrained tree search inference algorithm that ensures syntactic validity while efficiently navigating the program space. Our experimental results show that our multimodal program synthesis method is more efficient in both unconditional and conditional graph synthesis with higher visual quality and fidelity than text-only baselines, establishing new state-of-the-art performance.

[86] MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing cs.CV | cs.CLPDF

Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang

TL;DR: MinerU2.5是一个1.2B参数的视觉语言模型，通过两阶段解耦策略（全局布局分析和局部内容识别）高效解析高分辨率文档，同时保持计算效率。

Details

Motivation: 现有文档解析模型在处理高分辨率输入时计算开销大，难以同时兼顾全局布局和局部细节。

Result: 在多个基准测试中达到最先进性能，计算开销显著低于其他模型。

Insight: 解耦策略和数据引擎的结合使得模型既能处理全局结构，又能保留局部细节，为高效文档解析提供了新思路。

Abstract: We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.

[87] Towards Faithful Reasoning in Remote Sensing: A Perceptually-Grounded GeoSpatial Chain-of-Thought for Vision-Language Models cs.CVPDF

Jiaqi Liu, Lang Sun, Ronghao Fu, Bo Yang

TL;DR: 论文提出了Perceptually-Grounded Geo-CoT框架，通过多步推理解决遥感视觉语言模型的推理不足问题，并利用两阶段对齐策略（SFT和GRPO）和数据Geo-CoT380k来训练模型RSThinker，显著提升了推理能力和性能。

Details

Motivation: 现有的视觉语言模型在遥感任务中因端到端训练而缺乏可验证的推理步骤，导致复杂分析任务失败。论文旨在解决这一问题，提出可验证的多步推理框架。

Result: RSThinker模型在多项任务中显著优于现有方法，输出可验证的推理轨迹。

Insight: 可验证的多步推理是提升遥感视觉语言模型性能的关键，两阶段对齐策略有效结合了数据驱动和策略优化。

Abstract: Vision-Language Models (VLMs) in remote sensing often fail at complex analytical tasks, a limitation stemming from their end-to-end training paradigm that bypasses crucial reasoning steps and leads to unverifiable outputs. To address this limitation, we introduce the Perceptually-Grounded Geospatial Chain-of-Thought (Geo-CoT), a framework that models remote sensing analysis as a verifiable, multi-step process. We instill this analytical process through a two-stage alignment strategy, leveraging Geo-CoT380k, the first large-scale dataset of structured Geo-CoT rationales. This strategy first employs supervised fine-tuning (SFT) to instill the foundational cognitive architecture, then leverages Group Reward Policy Optimization (GRPO) to refine the model’s reasoning policy towards factual correctness. The resulting model, RSThinker, outputs both a final answer and its justifying, verifiable analytical trace. This capability yields dominant performance, significantly outperforming state-of-the-art models across a comprehensive range of tasks. The public release of our Geo-CoT380k dataset and RSThinker model upon publication serves as a concrete pathway from opaque perception towards structured, verifiable reasoning for Earth Observation.

[88] Polysemous Language Gaussian Splatting via Matching-based Mask Lifting cs.CV | cs.AIPDF

Jiayu Ding, Xinpeng Liu, Zhiyi Pan, Shiqiang Long, Ge Li

TL;DR: MUSplat是一个无需训练的框架，将2D多粒度语义提升至3D高斯泼溅场景，通过语义匹配实现开放词汇查询，解决了传统方法的局限性。

Details

Motivation: 现有方法依赖昂贵的场景重训练，且设计为单语义，无法表示复杂多概念语义，且易受跨视角语义不一致影响。

Result: 在开放词汇3D对象选择和语义分割任务中超越现有方法，场景适应时间从小时级缩短至分钟级。

Insight: 无需训练的框架能显著提升效率，多粒度语义表示和视觉-语言模型的结合增强了语义一致性。

Abstract: Lifting 2D open-vocabulary understanding into 3D Gaussian Splatting (3DGS) scenes is a critical challenge. However, mainstream methods suffer from three key flaws: (i) their reliance on costly per-scene retraining prevents plug-and-play application; (ii) their restrictive monosemous design fails to represent complex, multi-concept semantics; and (iii) their vulnerability to cross-view semantic inconsistencies corrupts the final semantic representation. To overcome these limitations, we introduce MUSplat, a training-free framework that abandons feature optimization entirely. Leveraging a pre-trained 2D segmentation model, our pipeline generates and lifts multi-granularity 2D masks into 3D, where we estimate a foreground probability for each Gaussian point to form initial object groups. We then optimize the ambiguous boundaries of these initial groups using semantic entropy and geometric opacity. Subsequently, by interpreting the object’s appearance across its most representative viewpoints, a Vision-Language Model (VLM) distills robust textual features that reconciles visual inconsistencies, enabling open-vocabulary querying via semantic matching. By eliminating the costly per-scene training process, MUSplat reduces scene adaptation time from hours to mere minutes. On benchmark tasks for open-vocabulary 3D object selection and semantic segmentation, MUSplat outperforms established training-based frameworks while simultaneously addressing their monosemous limitations.

[89] UrbanFeel: A Comprehensive Benchmark for Temporal and Perceptual Understanding of City Scenes through Human Perspective cs.CVPDF

Jun He, Yi Lin, Zilong Huang, Jiacong Yin, Junyan Ye

TL;DR: UrbanFeel是一个全面的基准测试，旨在评估多模态大语言模型（MLLMs）在城市发展和主观环境感知中的表现。通过三个认知维度（静态场景感知、时间变化理解和主观环境感知）和14.3K个视觉问题，该基准揭示了MLLMs在不同任务中的性能差异。

Details

Motivation: 城市发展影响全球过半人口，但目前缺乏系统性评估MLLMs在城市环境中时间演变和主观感知能力的基准。UrbanFeel填补了这一空白。

Result: Gemini-2.5 Pro表现最佳，接近人类专家水平；MLLMs在静态任务上表现良好，但在时间推理任务中表现下降，部分模型在主观感知方面甚至超越人类。

Insight: MLLMs在城市场景理解中潜力显著，尤其在像素级变化检测和主观评价方面，但在时间推理任务中仍需改进。

Abstract: Urban development impacts over half of the global population, making human-centered understanding of its structural and perceptual changes essential for sustainable development. While Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various domains, existing benchmarks that explore their performance in urban environments remain limited, lacking systematic exploration of temporal evolution and subjective perception of urban environment that aligns with human perception. To address these limitations, we propose UrbanFeel, a comprehensive benchmark designed to evaluate the performance of MLLMs in urban development understanding and subjective environmental perception. UrbanFeel comprises 14.3K carefully constructed visual questions spanning three cognitively progressive dimensions: Static Scene Perception, Temporal Change Understanding, and Subjective Environmental Perception. We collect multi-temporal single-view and panoramic street-view images from 11 representative cities worldwide, and generate high-quality question-answer pairs through a hybrid pipeline of spatial clustering, rule-based generation, model-assisted prompting, and manual annotation. Through extensive evaluation of 20 state-of-the-art MLLMs, we observe that Gemini-2.5 Pro achieves the best overall performance, with its accuracy approaching human expert levels and narrowing the average gap to just 1.5%. Most models perform well on tasks grounded in scene understanding. In particular, some models even surpass human annotators in pixel-level change detection. However, performance drops notably in tasks requiring temporal reasoning over urban development. Additionally, in the subjective perception dimension, several models reach human-level or even higher consistency in evaluating dimension such as beautiful and safety.

[90] A Tale of Two Experts: Cooperative Learning for Source-Free Unsupervised Domain Adaptation cs.CVPDF

Jiaping Yu, Muli Yang, Jiapeng Ji, Jiexi Yan, Cheng Deng

TL;DR: 论文提出了一种名为EXCL的方法，通过双专家框架和RAIN优化流程，解决无监督源域自适应问题，实现了在不依赖源数据的情况下有效利用目标数据的结构和互补知识。

Details

Motivation: 现有SFUDA方法仅依赖源模型的预测或微调大型多模态模型，忽略了目标数据的潜在结构和互补知识。为解决这一问题，提出了一种新的协作学习方法。

Result: 在四个基准数据集上的实验表明，EXCL达到了最先进的性能。

Insight: 通过协作学习，可以更好地利用目标数据的潜在结构和互补知识，实现高效的无监督域自适应。

Abstract: Source-Free Unsupervised Domain Adaptation (SFUDA) addresses the realistic challenge of adapting a source-trained model to a target domain without access to the source data, driven by concerns over privacy and cost. Existing SFUDA methods either exploit only the source model’s predictions or fine-tune large multimodal models, yet both neglect complementary insights and the latent structure of target data. In this paper, we propose the Experts Cooperative Learning (EXCL). EXCL contains the Dual Experts framework and Retrieval-Augmentation-Interaction optimization pipeline. The Dual Experts framework places a frozen source-domain model (augmented with Conv-Adapter) and a pretrained vision-language model (with a trainable text prompt) on equal footing to mine consensus knowledge from unlabeled target samples. To effectively train these plug-in modules under purely unsupervised conditions, we introduce Retrieval-Augmented-Interaction(RAIN), a three-stage pipeline that (1) collaboratively retrieves pseudo-source and complex target samples, (2) separately fine-tunes each expert on its respective sample set, and (3) enforces learning object consistency via a shared learning result. Extensive experiments on four benchmark datasets demonstrate that our approach matches state-of-the-art performance.

[91] FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing cs.CVPDF

Junyi Wu, Zhiteng Li, Haotong Qin, Xiaohong Liu, Linghe Kong

TL;DR: FlashEdit提出了一种高效的文本引导图像编辑框架，通过解耦速度、结构和语义实现实时编辑，显著减少了传统方法的延迟。

Details

Motivation: 现有基于扩散模型的文本引导图像编辑方法虽质量优秀，但迭代过程导致延迟过高，限制了实际应用。

Result: 实验显示FlashEdit在0.2秒内完成编辑，速度提升150倍以上，同时保持高质量背景一致性和结构完整性。

Insight: 解耦编辑过程的速度、结构和语义是实现高效图像编辑的关键，选择性特征修改与语义泄漏抑制是提升精度的有效手段。

Abstract: Text-guided image editing with diffusion models has achieved remarkable quality but suffers from prohibitive latency, hindering real-world applications. We introduce FlashEdit, a novel framework designed to enable high-fidelity, real-time image editing. Its efficiency stems from three key innovations: (1) a One-Step Inversion-and-Editing (OSIE) pipeline that bypasses costly iterative processes; (2) a Background Shield (BG-Shield) technique that guarantees background preservation by selectively modifying features only within the edit region; and (3) a Sparsified Spatial Cross-Attention (SSCA) mechanism that ensures precise, localized edits by suppressing semantic leakage to the background. Extensive experiments demonstrate that FlashEdit maintains superior background consistency and structural integrity, while performing edits in under 0.2 seconds, which is an over 150$\times$ speedup compared to prior multi-step methods. Our code will be made publicly available at https://github.com/JunyiWuCode/FlashEdit.

[92] Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks cs.CV | cs.AIPDF

Miao Jing, Mengting Jia, Junling Lin, Zhongxia Shen, Lijun Wang

TL;DR: 研究者提出了Neural-MedBench，一个专注于深度临床推理的神经学评测基准，旨在弥补现有医疗AI评测集中在分类准确性上的不足。

Details

Motivation: 当前视觉语言模型（VLM）在标准医疗评测中表现出色，但其实际临床推理能力存疑。现有数据集主要关注分类准确性，导致模型可能在关键诊断推理中失败。

Result: 评测显示主流VLMs（如GPT-4o、Claude-4）在推理任务上表现显著下降，错误主要由推理失败而非感知错误导致。

Insight: 深度推理评测对医疗AI的临床可信度至关重要；未来的评测应兼顾大数据集的广度与小规模深度评测的严谨性。

Abstract: Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.

Yujian Yuan, Changjie Wu, Xinyuan Chang, Sijin Wang, Hang Zhang

TL;DR: UniMapGen提出了一种生成式框架，通过多模态数据构建大规模地图，解决了传统方法依赖高成本数据采集和卫星数据固有缺陷的问题，并在OpenSatMap数据集上达到领先性能。

Details

Motivation: 传统的大规模地图构建方法依赖昂贵的数据采集车辆和人工标注，而基于卫星数据的方法存在遮挡、过时和低效矢量化问题。UniMapGen旨在克服这些限制。

Result: 在OpenSatMap数据集上表现最优，并能推断被遮挡道路和预测缺失标注的道路。

Insight: 生成式框架可通过多模态输入和全局状态更新显著提升地图构建的效率和精度，尤其适用于大规模场景。

Abstract: Large-scale map construction is foundational for critical applications such as autonomous driving and navigation systems. Traditional large-scale map construction approaches mainly rely on costly and inefficient special data collection vehicles and labor-intensive annotation processes. While existing satellite-based methods have demonstrated promising potential in enhancing the efficiency and coverage of map construction, they exhibit two major limitations: (1) inherent drawbacks of satellite data (e.g., occlusions, outdatedness) and (2) inefficient vectorization from perception-based methods, resulting in discontinuous and rough roads that require extensive post-processing. This paper presents a novel generative framework, UniMapGen, for large-scale map construction, offering three key innovations: (1) representing lane lines as \textbf{discrete sequence} and establishing an iterative strategy to generate more complete and smooth map vectors than traditional perception-based methods. (2) proposing a flexible architecture that supports \textbf{multi-modal} inputs, enabling dynamic selection among BEV, PV, and text prompt, to overcome the drawbacks of satellite data. (3) developing a \textbf{state update} strategy for global continuity and consistency of the constructed large-scale map. UniMapGen achieves state-of-the-art performance on the OpenSatMap dataset. Furthermore, UniMapGen can infer occluded roads and predict roads missing from dataset annotations. Our code will be released.

[94] GS-2M: Gaussian Splatting for Joint Mesh Reconstruction and Material Decomposition cs.CVPDF

Dinh Minh Nguyen, Malte Avenhaus, Thomas Lindemeier

TL;DR: GS-2M 是一种基于 3D 高斯泼溅（Gaussian Splatting）的统一方法，用于从多视角图像中联合重建网格和分解材质，解决了现有方法在处理高反射表面时的局限性，并减少了对外部模型先验的依赖。

Details

Motivation: 现有方法通常将网格重建和材质分解分开处理，且难以处理高反射表面。这些方法依赖外部模型先验或复杂神经网络组件，限制了其效率和扩展性。GS-2M 旨在统一解决这些问题。

Result: 在广泛使用的数据集上验证了方法的有效性，结果与现有最优方法相当，并提供了可用于下游任务的三角形网格和材质组件。

Insight: 联合优化几何和材质属性可以更好地处理高反射表面，而光度变化的监督策略为粗糙度估计提供了更可靠的信息源。

Abstract: We propose a unified solution for mesh reconstruction and material decomposition from multi-view images based on 3D Gaussian Splatting, referred to as GS-2M. Previous works handle these tasks separately and struggle to reconstruct highly reflective surfaces, often relying on priors from external models to enhance the decomposition results. Conversely, our method addresses these two problems by jointly optimizing attributes relevant to the quality of rendered depth and normals, maintaining geometric details while being resilient to reflective surfaces. Although contemporary works effectively solve these tasks together, they often employ sophisticated neural components to learn scene properties, which hinders their performance at scale. To further eliminate these neural components, we propose a novel roughness supervision strategy based on multi-view photometric variation. When combined with a carefully designed loss and optimization process, our unified framework produces reconstruction results comparable to state-of-the-art methods, delivering triangle meshes and their associated material components for downstream tasks. We validate the effectiveness of our approach with widely used datasets from previous works and qualitative comparisons with state-of-the-art surface reconstruction methods.

[95] MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning cs.CV | cs.ROPDF

Jinkun Hao, Naifu Liang, Zhen Luo, Xudong Xu, Weipeng Zhong

TL;DR: 本文提出了一种新的任务驱动的桌面场景生成方法，通过3D空间推理生成符合任务描述的物理可行场景。作者还发布了包含10,700个合成场景的数据集MesaTask-10K，并提出了一种基于大语言模型（LLM）和DPO算法的框架MesaTask。

Details

Motivation: 现有的桌面场景生成方法依赖耗时的手动设计或纯随机布局，缺乏任务相关性和物理合理性。本文旨在填补任务指令与场景生成之间的鸿沟，为机器人训练提供高质量数据。

Result: 实验表明，MesaTask在生成符合任务描述的物理可行场景上优于基线方法。

Insight: 任务驱动的场景生成需要结合语义推理和物理约束，LLM和DPO算法的结合展示了强大的潜力。

Abstract: The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce MesaTask-10K, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with manually crafted layouts that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a Spatial Reasoning Chain that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts. Project page is at https://mesatask.github.io/

[96] Rule-Based Reinforcement Learning for Document Image Classification with Vision Language Models cs.CVPDF

Michael Jungo, Andreas Fischer

TL;DR: 该论文探讨了基于规则的强化学习在文档图像分类任务中的应用，展示了其在处理分布外数据时更强的泛化能力。

Details

Motivation: 在文档分析领域中，强化学习尚未广泛使用，但其潜力未被充分挖掘，尤其是其推理能力可能对下游任务有益。本文旨在研究基于规则的强化学习在文档图像分类任务中的效果。

Result: 研究发现，强化学习方法在分布外数据上表现出更强的泛化能力，优于传统方法。

Insight: 强化学习在文档图像分类任务中不仅能提升性能，还能增强模型对未知数据的适应能力，为其在更广泛的应用场景中提供了潜力。

Abstract: Rule-based reinforcement learning has been gaining popularity ever since DeepSeek-R1 has demonstrated its success through simple verifiable rewards. In the domain of document analysis, reinforcement learning is not as prevalent, even though many downstream tasks may benefit from the emerging properties of reinforcement learning, particularly the enhanced reason capabilities. We study the effects of rule-based reinforcement learning with the task of Document Image Classification which is one of the most commonly studied downstream tasks in document analysis. We find that reinforcement learning tends to have better generalisation capabilities to out-of-distritbution data, which we examine in three different scenarios, namely out-of-distribution images, unseen classes and different modalities. Our code is available at https://github.com/jungomi/vision-finetune.

[97] Jailbreaking on Text-to-Video Models via Scene Splitting Strategy cs.CV | cs.AIPDF

Wonjun Lee, Haon Park, Doehyeon Lee, Bumsub Ham, Suhyun Kim

TL;DR: 本文提出了SceneSplit方法，一种针对文本到视频（T2V）模型的对抗攻击策略，通过将有害叙述拆分为多个无害场景来绕过安全机制。该方法通过迭代场景操作和策略库增强了攻击效果，在多个T2V模型上显著提升了攻击成功率。

Details

Motivation: 随着T2V模型的快速发展，其安全风险逐渐显现。然而，目前对T2V模型的对抗攻击研究较少，本文旨在填补这一空白，揭示其安全漏洞。

Result: 实验在Luma Ray2、Hailuo和Veo2等模型上进行，SceneSplit的平均攻击成功率达77.2%-84.1%，显著优于基线方法。

Insight: 本文揭示了T2V模型安全机制的漏洞，表明通过调整叙述结构可以绕过安全过滤。这为改进T2V模型的安全性提供了新方向。

Abstract: Along with the rapid advancement of numerous Text-to-Video (T2V) models, growing concerns have emerged regarding their safety risks. While recent studies have explored vulnerabilities in models like LLMs, VLMs, and Text-to-Image (T2I) models through jailbreak attacks, T2V models remain largely unexplored, leaving a significant safety gap. To address this gap, we introduce SceneSplit, a novel black-box jailbreak method that works by fragmenting a harmful narrative into multiple scenes, each individually benign. This approach manipulates the generative output space, the abstract set of all potential video outputs for a given prompt, using the combination of scenes as a powerful constraint to guide the final outcome. While each scene individually corresponds to a wide and safe space where most outcomes are benign, their sequential combination collectively restricts this space, narrowing it to an unsafe region and significantly increasing the likelihood of generating a harmful video. This core mechanism is further enhanced through iterative scene manipulation, which bypasses the safety filter within this constrained unsafe region. Additionally, a strategy library that reuses successful attack patterns further improves the attack’s overall effectiveness and robustness. To validate our method, we evaluate SceneSplit across 11 safety categories on T2V models. Our results show that it achieves a high average Attack Success Rate (ASR) of 77.2% on Luma Ray2, 84.1% on Hailuo, and 78.2% on Veo2, significantly outperforming the existing baseline. Through this work, we demonstrate that current T2V safety mechanisms are vulnerable to attacks that exploit narrative structure, providing new insights for understanding and improving the safety of T2V models.

[98] HiGS: History-Guided Sampling for Plug-and-Play Enhancement of Diffusion Models cs.CV | cs.AI | cs.LGPDF

Seyedmorteza Sadat, Farnood Salehi, Romann M. Weber

TL;DR: HiGS proposed a momentum-based sampling method,历史引导采样(HiGS),通过将过去预测的加权平均应用到当前采样步,显著提升了扩散模型生成质量和效率,无需额外训练或计算损失。

Details

Motivation: 扩散模型在图像生成上取得了显著进展,但其输出在少步数(NFEs)或低引导规模下仍显不够真实且缺少细节,如何在不增加额外计算的情况下提升采样效率和生成质量成为关键问题。

Result: 实验表明HiGS能稳定提升多个扩散模型的生成质量,在ImageNet 256x256上仅用30步(原250步)达到FID 1.61的SOTA性能。

Insight: 历史信息可以有效引导扩散采样过程,即使不增加计算或训练也能显著提升生成质量和效率。

Abstract: While diffusion models have made remarkable progress in image generation, their outputs can still appear unrealistic and lack fine details, especially when using fewer number of neural function evaluations (NFEs) or lower guidance scales. To address this issue, we propose a novel momentum-based sampling technique, termed history-guided sampling (HiGS), which enhances quality and efficiency of diffusion sampling by integrating recent model predictions into each inference step. Specifically, HiGS leverages the difference between the current prediction and a weighted average of past predictions to steer the sampling process toward more realistic outputs with better details and structure. Our approach introduces practically no additional computation and integrates seamlessly into existing diffusion frameworks, requiring neither extra training nor fine-tuning. Extensive experiments show that HiGS consistently improves image quality across diverse models and architectures and under varying sampling budgets and guidance scales. Moreover, using a pretrained SiT model, HiGS achieves a new state-of-the-art FID of 1.61 for unguided ImageNet generation at 256$\times$256 with only 30 sampling steps (instead of the standard 250). We thus present HiGS as a plug-and-play enhancement to standard diffusion sampling that enables faster generation with higher fidelity.

[99] Johnson-Lindenstrauss Lemma Guided Network for Efficient 3D Medical Segmentation cs.CVPDF

Jinpeng Lu, Linghan Cai, Yinda Chen, Guo Tang, Songhan Jiang

TL;DR: 论文提出了一种高效轻量的3D医学图像分割方法VeloxSeg，采用双流CNN-Transformer架构结合Paired Window Attention和JL定理引导的卷积，显著提升模型的效率和鲁棒性。

Details

Motivation: 轻量化的3D医学图像分割面临效率和鲁棒性的冲突，尤其是在处理复杂解剖结构和异构模态时。

Result: 在多模态基准测试中，VeloxSeg的Dice系数提升26%，GPU和CPU吞吐量分别提高11倍和48倍。

Insight: 结合JL定理优化高维数据特征提取，并通过双流架构和知识迁移平衡效率与性能。

Abstract: Lightweight 3D medical image segmentation remains constrained by a fundamental “efficiency / robustness conflict”, particularly when processing complex anatomical structures and heterogeneous modalities. In this paper, we study how to redesign the framework based on the characteristics of high-dimensional 3D images, and explore data synergy to overcome the fragile representation of lightweight methods. Our approach, VeloxSeg, begins with a deployable and extensible dual-stream CNN-Transformer architecture composed of Paired Window Attention (PWA) and Johnson-Lindenstrauss lemma-guided convolution (JLC). For each 3D image, we invoke a “glance-and-focus” principle, where PWA rapidly retrieves multi-scale information, and JLC ensures robust local feature extraction with minimal parameters, significantly enhancing the model’s ability to operate with low computational budget. Followed by an extension of the dual-stream architecture that incorporates modal interaction into the multi-scale image-retrieval process, VeloxSeg efficiently models heterogeneous modalities. Finally, Spatially Decoupled Knowledge Transfer (SDKT) via Gram matrices injects the texture prior extracted by a self-supervised network into the segmentation network, yielding stronger representations than baselines at no extra inference cost. Experimental results on multimodal benchmarks show that VeloxSeg achieves a 26% Dice improvement, alongside increasing GPU throughput by 11x and CPU by 48x. Codes are available at https://github.com/JinPLu/VeloxSeg.

[100] RAPID^3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformer cs.CVPDF

Wangbo Zhao, Yizeng Han, Zhiwei Tang, Jiasheng Tang, Pengfei Zhou

TL;DR: RAPID^3 提出了一种三层次强化加速策略，在不修改基础扩散变换器（DiT）的情况下，通过轻量级策略头实现图像级别的采样加速，显著提升推理速度。

Details

Motivation: 当前训练无关的加速方法通常依赖统一的启发式或手动设计的适应策略，无法兼顾所有图像的生成质量；动态神经网络虽然能实现逐图像适应，但微调成本高。RAPID^3 旨在解决这些局限性。

Result: 在 Stable Diffusion 3 和 FLUX 等 DiT 骨干上，RAPID^3 实现了近 3 倍的采样加速，同时保持竞争力的生成质量。

Insight: 通过强化学习和对抗训练的联合优化，可以高效地实现逐图像自适应加速，而无需修改基础模型，为大规模扩散模型的实时应用提供了新思路。

Abstract: Diffusion Transformers (DiTs) excel at visual generation yet remain hampered by slow sampling. Existing training-free accelerators - step reduction, feature caching, and sparse attention - enhance inference speed but typically rely on a uniform heuristic or a manually designed adaptive strategy for all images, leaving quality on the table. Alternatively, dynamic neural networks offer per-image adaptive acceleration, but their high fine-tuning costs limit broader applicability. To address these limitations, we introduce RAPID3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformers, a framework that delivers image-wise acceleration with zero updates to the base generator. Specifically, three lightweight policy heads - Step-Skip, Cache-Reuse, and Sparse-Attention - observe the current denoising state and independently decide their corresponding speed-up at each timestep. All policy parameters are trained online via Group Relative Policy Optimization (GRPO) while the generator remains frozen. Meanwhile, an adversarially learned discriminator augments the reward signal, discouraging reward hacking by boosting returns only when generated samples stay close to the original model’s distribution. Across state-of-the-art DiT backbones, including Stable Diffusion 3 and FLUX, RAPID3 achieves nearly 3x faster sampling with competitive generation quality.

[101] Pedestrian Attribute Recognition via Hierarchical Cross-Modality HyperGraph Learning cs.CV | cs.AIPDF

Xiao Wang, Shujuan Wu, Xiaoxia Cheng, Changwei Bi, Jin Tang

TL;DR: 该论文提出了一种基于多模态知识图谱的层次跨模态超图学习方法，通过挖掘视觉特征与文本、属性与视觉样本之间的关系，提升行人属性识别的准确性。

Details

Motivation: 现有行人属性识别（PAR）方法未能充分利用属性知识和上下文信息，且对视觉与语义关系的建模不足。该研究旨在通过构建多模态知识图谱，填补这一空白。

Result: 在多个PAR基准数据集上的实验验证了知识图谱的有效性，为知识引导的行人属性识别奠定了基础。

Insight: 通过知识图谱建模多模态关系可以显著提升行人属性识别的性能，尤其是在挖掘上下文信息方面具有潜力。

Abstract: Current Pedestrian Attribute Recognition (PAR) algorithms typically focus on mapping visual features to semantic labels or attempt to enhance learning by fusing visual and attribute information. However, these methods fail to fully exploit attribute knowledge and contextual information for more accurate recognition. Although recent works have started to consider using attribute text as additional input to enhance the association between visual and semantic information, these methods are still in their infancy. To address the above challenges, this paper proposes the construction of a multi-modal knowledge graph, which is utilized to mine the relationships between local visual features and text, as well as the relationships between attributes and extensive visual context samples. Specifically, we propose an effective multi-modal knowledge graph construction method that fully considers the relationships among attributes and the relationships between attributes and vision tokens. To effectively model these relationships, this paper introduces a knowledge graph-guided cross-modal hypergraph learning framework to enhance the standard pedestrian attribute recognition framework. Comprehensive experiments on multiple PAR benchmark datasets have thoroughly demonstrated the effectiveness of our proposed knowledge graph for the PAR task, establishing a strong foundation for knowledge-guided pedestrian attribute recognition. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR

[102] CircuitSense: A Hierarchical Circuit System Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process cs.CVPDF

Arman Akbari, Jian Gao, Yifei Zou, Mei Yang, Jinru Duan

TL;DR: CircuitSense是一个评估视觉理解与符号推理能力的电路系统基准测试，通过8,006+问题覆盖从组件级到系统级的设计流程，揭示了多模态大语言模型在视觉到数学推理中的局限性。

Details

Motivation: 工程设计要求从视觉到数学的推理能力，而现有多模态大语言模型在此任务上的表现尚未被充分探索。CircuitSense旨在填补这一空白，评估模型在电路设计全流程中的能力。

Result: 闭源模型在感知任务（如组件识别）上准确率达85%，但在符号推理任务上低于19%。符号推理能力强的模型在设计任务中表现更好。

Insight: 符号推理能力是工程设计中的核心指标，而当前多模态模型在此能力上存在显著不足，需进一步研究。

Abstract: Engineering design operates through hierarchical abstraction from system specifications to component implementations, requiring visual understanding coupled with mathematical reasoning at each level. While Multi-modal Large Language Models (MLLMs) excel at natural image tasks, their ability to extract mathematical models from technical diagrams remains unexplored. We present \textbf{CircuitSense}, a comprehensive benchmark evaluating circuit understanding across this hierarchy through 8,006+ problems spanning component-level schematics to system-level block diagrams. Our benchmark uniquely examines the complete engineering workflow: Perception, Analysis, and Design, with a particular emphasis on the critical but underexplored capability of deriving symbolic equations from visual inputs. We introduce a hierarchical synthetic generation pipeline consisting of a grid-based schematic generator and a block diagram generator with auto-derived symbolic equation labels. Comprehensive evaluation of six state-of-the-art MLLMs, including both closed-source and open-source models, reveals fundamental limitations in visual-to-mathematical reasoning. Closed-source models achieve over 85% accuracy on perception tasks involving component recognition and topology identification, yet their performance on symbolic derivation and analytical reasoning falls below 19%, exposing a critical gap between visual parsing and symbolic reasoning. Models with stronger symbolic reasoning capabilities consistently achieve higher design task accuracy, confirming the fundamental role of mathematical understanding in circuit synthesis and establishing symbolic reasoning as the key metric for engineering competence.

[103] Effectiveness of Large Multimodal Models in Detecting Disinformation: Experimental Results cs.CVPDF

Yasmina Kheddache, Marc Lalonde

TL;DR: 论文研究了大型多模态模型（LMMs）在检测虚假信息中的有效性，提出了基于GPT-4o的优化方法，覆盖了提示工程、多模态分析框架和评估标准。

Details

Motivation: 多模态虚假信息的快速传播对数字平台构成挑战，亟需高效的自动化检测工具。

Result: 论文在多数据集上验证了GPT-4o的检测能力，揭示了其优势和局限性。

Insight: 大型多模态模型在多模态虚假信息检测中具有潜力，但稳定性和可靠性仍需优化。

Abstract: The proliferation of disinformation, particularly in multimodal contexts combining text and images, presents a significant challenge across digital platforms. This study investigates the potential of large multimodal models (LMMs) in detecting and mitigating false information. We propose to approach multimodal disinformation detection by leveraging the advanced capabilities of the GPT-4o model. Our contributions include: (1) the development of an optimized prompt incorporating advanced prompt engineering techniques to ensure precise and consistent evaluations; (2) the implementation of a structured framework for multimodal analysis, including a preprocessing methodology for images and text to comply with the model’s token limitations; (3) the definition of six specific evaluation criteria that enable a fine-grained classification of content, complemented by a self-assessment mechanism based on confidence levels; (4) a comprehensive performance analysis of the model across multiple heterogeneous datasets Gossipcop, Politifact, Fakeddit, MMFakeBench, and AMMEBA highlighting GPT-4o’s strengths and limitations in disinformation detection; (5) an investigation of prediction variability through repeated testing, evaluating the stability and reliability of the model’s classifications; and (6) the introduction of confidence-level and variability-based evaluation methods. These contributions provide a robust and reproducible methodological framework for automated multimodal disinformation analysis.

[104] GPT-4 for Occlusion Order Recovery cs.CV | I.4.5PDF

Kaziwa Saleh, Zhyar Rzgar K Rostam, Sándor Szénási, Zoltán Vámossy

TL;DR: 利用预训练的GPT-4模型解决遮挡顺序预测问题，通过设计特定提示从图像中推断遮挡关系。模型在零样本（zero-shot）模式下表现优异，无需标注数据，可直接整合到现有框架中。

Details

Motivation: 当前视觉模型在处理复杂、密集的真实场景图像时，遮挡关系预测仍是一个显著挑战。论文旨在利用GPT-4的高级能力，通过语义上下文、视觉模式和常识知识，更准确地预测遮挡顺序。

Result: 在COCOA和InstaOrder数据集上评估表明，该方法比基线模型更准确，尤其在零样本场景下表现突出。

Insight: 利用大语言模型（如GPT-4）的多模态推理能力，可以突破传统视觉模型的局限性，为遮挡处理提供新的解决方案。

Abstract: Occlusion remains a significant challenge for current vision models to robustly interpret complex and dense real-world images and scenes. To address this limitation and to enable accurate prediction of the occlusion order relationship between objects, we propose leveraging the advanced capability of a pre-trained GPT-4 model to deduce the order. By providing a specifically designed prompt along with the input image, GPT-4 can analyze the image and generate order predictions. The response can then be parsed to construct an occlusion matrix which can be utilized in assisting with other occlusion handling tasks and image understanding. We report the results of evaluating the model on COCOA and InstaOrder datasets. The results show that by using semantic context, visual patterns, and commonsense knowledge, the model can produce more accurate order predictions. Unlike baseline methods, the model can reason about occlusion relationships in a zero-shot fashion, which requires no annotated training data and can easily be integrated into occlusion handling frameworks.

[105] Gradient-based multi-focus image fusion with focus-aware saliency enhancement cs.CVPDF

Haoyu Li, XiaoSong Li

TL;DR: 该论文提出了一种基于梯度的多焦点图像融合方法，通过显著边界增强和梯度域建模，显著提升了融合图像的质量，并在四个公开数据集上超越了12种先进方法。

Details

Motivation: 现有多焦点图像融合方法在保留焦点边界细节方面表现不佳，容易出现模糊过渡和焦点细节丢失的问题。

Result: 在四个公开数据集上的实验表明，该方法在主观和客观评估中均优于12种现有方法。

Insight: 梯度域建模和显著性特征提取的结合，能够有效解决多焦点图像融合中的边界模糊问题。

Abstract: Multi-focus image fusion (MFIF) aims to yield an all-focused image from multiple partially focused inputs, which is crucial in applications cover sur-veillance, microscopy, and computational photography. However, existing methods struggle to preserve sharp focus-defocus boundaries, often resulting in blurred transitions and focused details loss. To solve this problem, we propose a MFIF method based on significant boundary enhancement, which generates high-quality fused boundaries while effectively detecting focus in-formation. Particularly, we propose a gradient-domain-based model that can obtain initial fusion results with complete boundaries and effectively pre-serve the boundary details. Additionally, we introduce Tenengrad gradient detection to extract salient features from both the source images and the ini-tial fused image, generating the corresponding saliency maps. For boundary refinement, we develop a focus metric based on gradient and complementary information, integrating the salient features with the complementary infor-mation across images to emphasize focused regions and produce a high-quality initial decision result. Extensive experiments on four public datasets demonstrate that our method consistently outperforms 12 state-of-the-art methods in both subjective and objective evaluations. We have realized codes in https://github.com/Lihyua/GICI

[106] Text Adversarial Attacks with Dynamic Outputs cs.CVPDF

Wenqiang Wang, Siyuan Liang, Xiao Yan, Xiaochun Cao

TL;DR: 本文提出了一种针对文本对抗攻击的动态输出场景方法（TDOA），通过聚类代理模型和最远标签目标攻击策略，显著提升了攻击效果和适应性。

Details

Motivation: 现有的文本对抗攻击方法多针对静态场景，难以适应动态输出和有限查询的挑战，亟需一种更具适应性和高效的方法。

Result: 在四种数据集和八种受害模型上验证，单次查询成功率最高达50.81%，静态场景下最高ASR为82.68%；在生成式任务中也表现优异。

Insight: 动态输出场景的转化策略可以扩展到其他任务，最远标签策略揭示了模型在粗粒度标签下的脆弱性。

Abstract: Text adversarial attack methods are typically designed for static scenarios with fixed numbers of output labels and a predefined label space, relying on extensive querying of the victim model (query-based attacks) or the surrogate model (transfer-based attacks). To address this gap, we introduce the Textual Dynamic Outputs Attack (TDOA) method, which employs a clustering-based surrogate model training approach to convert the dynamic-output scenario into a static single-output scenario. To improve attack effectiveness, we propose the farthest-label targeted attack strategy, which selects adversarial vectors that deviate most from the model’s coarse-grained labels, thereby maximizing disruption. We extensively evaluate TDOA on four datasets and eight victim models (e.g., ChatGPT-4o, ChatGPT-4.1), showing its effectiveness in crafting adversarial examples and its strong potential to compromise large language models with limited access. With a single query per text, TDOA achieves a maximum attack success rate of 50.81%. Additionally, we find that TDOA also achieves state-of-the-art performance in conventional static output scenarios, reaching a maximum ASR of 82.68%. Meanwhile, by conceptualizing translation tasks as classification problems with unbounded output spaces, we extend the TDOA framework to generative settings, surpassing prior results by up to 0.64 RDBLEU and 0.62 RDchrF.

[107] Integrating Background Knowledge in Medical Semantic Segmentation with Logic Tensor Networks cs.CV | cs.LGPDF

Luca Bergamin, Giovanna Maria Dimitri, Fabio Aiolli

TL;DR: 该论文提出了一种结合逻辑张量网络（LTNs）和深度学习的方法，以将医学背景知识融入语义分割任务中，从而提高分割性能，特别是在训练数据稀缺的情况下。

Details

Motivation: 当前基于深度学习的医学图像语义分割方法仍存在不足，尤其是在噪声和伪影存在的情况下。论文认为，通过将医学背景知识整合到分割模型的损失函数中，可以进一步提升性能。

Result: 实验表明，LTNs显著提升了基线分割性能，尤其是在训练数据稀缺的情况下。论文表明这种方法具有普适性，可扩展到其他医学分割任务。

Insight: 神经符号方法（如LTNs）为医学图像分析提供了一种新思路，能够有效地结合领域知识和深度学习，提升模型的性能和鲁棒性。

Abstract: Semantic segmentation is a fundamental task in medical image analysis, aiding medical decision-making by helping radiologists distinguish objects in an image. Research in this field has been driven by deep learning applications, which have the potential to scale these systems even in the presence of noise and artifacts. However, these systems are not yet perfected. We argue that performance can be improved by incorporating common medical knowledge into the segmentation model’s loss function. To this end, we introduce Logic Tensor Networks (LTNs) to encode medical background knowledge using first-order logic (FOL) rules. The encoded rules span from constraints on the shape of the produced segmentation, to relationships between different segmented areas. We apply LTNs in an end-to-end framework with a SwinUNETR for semantic segmentation. We evaluate our method on the task of segmenting the hippocampus in brain MRI scans. Our experiments show that LTNs improve the baseline segmentation performance, especially when training data is scarce. Despite being in its preliminary stages, we argue that neurosymbolic methods are general enough to be adapted and applied to other medical semantic segmentation tasks.

[108] Closing the Safety Gap: Surgical Concept Erasure in Visual Autoregressive Models cs.CVPDF

Xinhao Zhong, Yimin Zhou, Zhiqi Zhang, Junhao Li, Yi Sun

TL;DR: 论文提出了一种新的方法VARE和S-VARE，用于在视觉自回归模型（VAR）中精确删除不安全概念，同时保持生成质量，解决了现有方法在VAR中泛化性不足的问题。

Details

Motivation: 随着视觉自回归模型（VAR）的快速发展，其文本到图像生成能力带来了新的机遇，但也引发了安全问题。现有的概念删除技术主要针对扩散模型设计，无法直接适用于VAR模型，因为VAR采用的是逐尺度标记预测范式，导致安全需求未被满足。

Result: 实验表明，VARE和S-VARE能够精确删除不安全概念，同时保持生成质量和多样性，优于现有方法。

Insight: 论文表明，通过针对性地调整模型中与不安全概念相关的视觉标记，可以在不显著影响生成质量的情况下实现安全目标，为VAR模型的安全应用提供了新思路。

Abstract: The rapid progress of visual autoregressive (VAR) models has brought new opportunities for text-to-image generation, but also heightened safety concerns. Existing concept erasure techniques, primarily designed for diffusion models, fail to generalize to VARs due to their next-scale token prediction paradigm. In this paper, we first propose a novel VAR Erasure framework VARE that enables stable concept erasure in VAR models by leveraging auxiliary visual tokens to reduce fine-tuning intensity. Building upon this, we introduce S-VARE, a novel and effective concept erasure method designed for VAR, which incorporates a filtered cross entropy loss to precisely identify and minimally adjust unsafe visual tokens, along with a preservation loss to maintain semantic fidelity, addressing the issues such as language drift and reduced diversity introduce by na"ive fine-tuning. Extensive experiments demonstrate that our approach achieves surgical concept erasure while preserving generation quality, thereby closing the safety gap in autoregressive text-to-image generation by earlier methods.

[109] RAU: Reference-based Anatomical Understanding with Vision Language Models cs.CV | cs.AIPDF

Yiwei Li, Yikang Liu, Jiaqi Guo, Lin Zhao, Zheyuan Zhang

TL;DR: RAU是一个基于视觉语言模型（VLM）的框架，用于医学图像中参考驱动的解剖结构识别、定位和分割。它通过相对空间推理和结合SAM2的细粒度分割能力，显著提升了性能，尤其在泛化能力上表现突出。

Details

Motivation: 医学图像中的解剖理解需要大量专家标注数据，但标注成本高。RAU旨在利用参考图像减少标注依赖，同时发挥VLM的空间推理能力。

Result: RAU在四个数据集上（包括分布外数据）优于SAM2微调基线，分割和定位更准确，泛化能力更强。

Insight: VLM的空间推理能力与SAM2的分割能力结合是医学图像分析的有效方案，且减少了对标注数据的依赖。

Abstract: Anatomical understanding through deep learning is critical for automatic report generation, intra-operative navigation, and organ localization in medical imaging; however, its progress is constrained by the scarcity of expert-labeled data. A promising remedy is to leverage an annotated reference image to guide the interpretation of an unlabeled target. Although recent vision-language models (VLMs) exhibit non-trivial visual reasoning, their reference-based understanding and fine-grained localization remain limited. We introduce RAU, a framework for reference-based anatomical understanding with VLMs. We first show that a VLM learns to identify anatomical regions through relative spatial reasoning between reference and target images, trained on a moderately sized dataset. We validate this capability through visual question answering (VQA) and bounding box prediction. Next, we demonstrate that the VLM-derived spatial cues can be seamlessly integrated with the fine-grained segmentation capability of SAM2, enabling localization and pixel-level segmentation of small anatomical regions, such as vessel segments. Across two in-distribution and two out-of-distribution datasets, RAU consistently outperforms a SAM2 fine-tuning baseline using the same memory setup, yielding more accurate segmentations and more reliable localization. More importantly, its strong generalization ability makes it scalable to out-of-distribution datasets, a property crucial for medical image applications. To the best of our knowledge, RAU is the first to explore the capability of VLMs for reference-based identification, localization, and segmentation of anatomical structures in medical images. Its promising performance highlights the potential of VLM-driven approaches for anatomical understanding in automated clinical workflows.

[110] FreqDebias: Towards Generalizable Deepfake Detection via Consistency-Driven Frequency Debiasing cs.CVPDF

Hossein Kashiani, Niloufar Alipour Talemi, Fatemeh Afghah

TL;DR: FreqDebias提出了一种通过一致性驱动的频率去偏框架，解决了Deepfake检测器中频域谱偏问题，通过Fo-Mixup增强和双一致性正则化提升模型的泛化能力。

Details

Motivation: Deepfake检测器常因训练数据有限而学习到偏差，导致在新伪造类型上泛化能力不足。研究发现频域中存在谱偏问题，检测器过度依赖特定频段限制了泛化能力。

Result: FreqDebias显著提升了跨域泛化性能，并在多个实验场景中优于SOTA方法。

Insight: 解决Deepfake检测泛化问题的关键在于减轻频域偏差，动态增强和多重一致性约束可以有效提升模型鲁棒性。

Abstract: Deepfake detectors often struggle to generalize to novel forgery types due to biases learned from limited training data. In this paper, we identify a new type of model bias in the frequency domain, termed spectral bias, where detectors overly rely on specific frequency bands, restricting their ability to generalize across unseen forgeries. To address this, we propose FreqDebias, a frequency debiasing framework that mitigates spectral bias through two complementary strategies. First, we introduce a novel Forgery Mixup (Fo-Mixup) augmentation, which dynamically diversifies frequency characteristics of training samples. Second, we incorporate a dual consistency regularization (CR), which enforces both local consistency using class activation maps (CAMs) and global consistency through a von Mises-Fisher (vMF) distribution on a hyperspherical embedding space. This dual CR mitigates over-reliance on certain frequency components by promoting consistent representation learning under both local and global supervision. Extensive experiments show that FreqDebias significantly enhances cross-domain generalization and outperforms state-of-the-art methods in both cross-domain and in-domain settings.

[111] LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer cs.CVPDF

Song Fei, Tian Ye, Lujia Wang, Lei Zhu

TL;DR: LucidFlux proposes a caption-free universal image restoration framework using a large diffusion transformer, achieving superior performance by conditioning degraded inputs strategically without relying on text prompts.

Details

Motivation: Existing image restoration methods often oversmooth or hallucinate under unknown degradation mixtures, highlighting the need for a more robust and caption-free approach.

Result: LucidFlux outperforms baselines on synthetic and real-world benchmarks, with ablations confirming its component necessity.

Insight: The study shows that strategic conditioning, rather than parameter addition or text prompts, is key to robust caption-free image restoration.

Abstract: Universal image restoration (UIR) aims to recover images degraded by unknown mixtures while preserving semantics – conditions under which discriminative restorers and UNet-based diffusion priors often oversmooth, hallucinate, or drift. We present LucidFlux, a caption-free UIR framework that adapts a large diffusion transformer (Flux.1) without image captions. LucidFlux introduces a lightweight dual-branch conditioner that injects signals from the degraded input and a lightly restored proxy to respectively anchor geometry and suppress artifacts. Then, a timestep- and layer-adaptive modulation schedule is designed to route these cues across the backbone’s hierarchy, in order to yield coarse-to-fine and context-aware updates that protect the global structure while recovering texture. After that, to avoid the latency and instability of text prompts or MLLM captions, we enforce caption-free semantic alignment via SigLIP features extracted from the proxy. A scalable curation pipeline further filters large-scale data for structure-rich supervision. Across synthetic and in-the-wild benchmarks, LucidFlux consistently outperforms strong open-source and commercial baselines, and ablation studies verify the necessity of each component. LucidFlux shows that, for large DiTs, when, where, and what to condition on – rather than adding parameters or relying on text prompts – is the governing lever for robust and caption-free universal image restoration in the wild.

Jiawei Liang, Ruoyu Chen, Xianghao Jiao, Siyuan Liang, Shiming Liu

TL;DR: 该论文提出了一种通过利用模态内交互增强多模态大语言模型（MLLMs）可解释性的方法，分别针对视觉和文本模态设计了MSEA和ARC技术，提升了模型解释的准确性和连贯性。

Details

Motivation: 现有可解释性研究主要关注跨模态注意力分配，忽视了模态内依赖关系。这导致视觉解释的碎片化和文本解释中虚假激活的问题，影响了模型行为解释的准确性。

Result: 实验表明，MSEA和ARC显著优于现有方法，提供更忠实和细粒度的模型行为解释。

Insight: 模态内交互是多模态模型可解释性的关键因素，动态调整感受野和上下文关联性可有效提升解释质量。

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood. Existing interpretability research has primarily focused on cross-modal attribution, identifying which image regions the model attends to during output generation. However, these approaches often overlook intra-modal dependencies. In the visual modality, attributing importance to isolated image patches ignores spatial context due to limited receptive fields, resulting in fragmented and noisy explanations. In the textual modality, reliance on preceding tokens introduces spurious activations. Failing to effectively mitigate these interference compromises attribution fidelity. To address these limitations, we propose enhancing interpretability by leveraging intra-modal interaction. For the visual branch, we introduce \textit{Multi-Scale Explanation Aggregation} (MSEA), which aggregates attributions over multi-scale inputs to dynamically adjust receptive fields, producing more holistic and spatially coherent visual explanations. For the textual branch, we propose \textit{Activation Ranking Correlation} (ARC), which measures the relevance of contextual tokens to the current token via alignment of their top-$k$ prediction rankings. ARC leverages this relevance to suppress spurious activations from irrelevant contexts while preserving semantically coherent ones. Extensive experiments across state-of-the-art MLLMs and benchmark datasets demonstrate that our approach consistently outperforms existing interpretability methods, yielding more faithful and fine-grained explanations of model behavior.

[113] $γ$-Quant: Towards Learnable Quantization for Low-bit Pattern Recognition cs.CVPDF

Mishal Fatima, Shashank Agnihotri, Marius Bock, Kanchana Vaishnavi Gandikota, Kristof Van Laerhoven

TL;DR: 论文《γ-Quant》提出了一种可学习的非线性量化方法（γ-Quant），用于低比特模式识别任务，旨在优化传感器数据的传输效率和能耗。

Details

Motivation: 当前模式识别模型通常基于高比特预处理数据，但这种处理方式对于自动化任务可能并非最优，尤其在低带宽和能耗受限的场景下。论文旨在探索如何通过可学习的量化方法提升低比特数据的识别性能。

Result: 实验表明，使用4位量化数据的性能可媲美12位原始数据，验证了方法的有效性。

Insight: 非线性量化能够显著提升低比特数据的任务表现，为低带宽和能耗受限场景下的模式识别提供了新思路。

Abstract: Most pattern recognition models are developed on pre-proce-ssed data. In computer vision, for instance, RGB images processed through image signal processing (ISP) pipelines designed to cater to human perception are the most frequent input to image analysis networks. However, many modern vision tasks operate without a human in the loop, raising the question of whether such pre-processing is optimal for automated analysis. Similarly, human activity recognition (HAR) on body-worn sensor data commonly takes normalized floating-point data arising from a high-bit analog-to-digital converter (ADC) as an input, despite such an approach being highly inefficient in terms of data transmission, significantly affecting the battery life of wearable devices. In this work, we target low-bandwidth and energy-constrained settings where sensors are limited to low-bit-depth capture. We propose $\gamma$-Quant, i.e.~the task-specific learning of a non-linear quantization for pattern recognition. We exemplify our approach on raw-image object detection as well as HAR of wearable data, and demonstrate that raw data with a learnable quantization using as few as 4-bits can perform on par with the use of raw 12-bit data. All code to reproduce our experiments is publicly available via https://github.com/Mishalfatima/Gamma-Quant

[114] SSVIF: Self-Supervised Segmentation-Oriented Visible and Infrared Image Fusion cs.CVPDF

Zixian Zhao, Xingchen Zhang

TL;DR: 该论文提出了一种自监督训练的SSVIF框架，用于分割导向的可见光和红外图像融合，避免了传统方法需要标注数据的问题。

Details

Motivation: 传统和任务导向的VIF方法需要标注数据，数据获取成本高。论文希望设计一种自监督方法，无需标注即可学习高级语义特征。

Result: 在公开数据集上，SSVIF优于传统VIF方法，与有监督的任务导向方法表现相当。

Insight: 自监督方法可以在无标注数据的情况下，通过学习一致性任务提升融合图像的语义信息。

Abstract: Visible and infrared image fusion (VIF) has gained significant attention in recent years due to its wide application in tasks such as scene segmentation and object detection. VIF methods can be broadly classified into traditional VIF methods and application-oriented VIF methods. Traditional methods focus solely on improving the quality of fused images, while application-oriented VIF methods additionally consider the performance of downstream tasks on fused images by introducing task-specific loss terms during training. However, compared to traditional methods, application-oriented VIF methods require datasets labeled for downstream tasks (e.g., semantic segmentation or object detection), making data acquisition labor-intensive and time-consuming. To address this issue, we propose a self-supervised training framework for segmentation-oriented VIF methods (SSVIF). Leveraging the consistency between feature-level fusion-based segmentation and pixel-level fusion-based segmentation, we introduce a novel self-supervised task-cross-segmentation consistency-that enables the fusion model to learn high-level semantic features without the supervision of segmentation labels. Additionally, we design a two-stage training strategy and a dynamic weight adjustment method for effective joint learning within our self-supervised framework. Extensive experiments on public datasets demonstrate the effectiveness of our proposed SSVIF. Remarkably, although trained only on unlabeled visible-infrared image pairs, our SSVIF outperforms traditional VIF methods and rivals supervised segmentation-oriented ones. Our code will be released upon acceptance.

[115] Bézier Meets Diffusion: Robust Generation Across Domains for Medical Image Segmentation cs.CVPDF

Chen Li, Meilong Xu, Xiaoling Hu, Weimin Lyu, Chao Chen

TL;DR: 论文提出了一个结合Bézier曲线和扩散模型的新框架，用于跨域医疗图像生成，以提高分割模型的鲁棒性。

Details

Motivation: 医疗影像模态间的域差异导致模型泛化性差。现有方法依赖GAN，但在高变异性区域表现不佳。

Result: 实验表明，生成的图像质量高，显著提升了目标域的分割性能。

Insight: 结合几何（Bézier）和概率（扩散）方法可有效解决跨域生成问题，尤其适合医疗影像。

Abstract: Training robust learning algorithms across different medical imaging modalities is challenging due to the large domain gap. Unsupervised domain adaptation (UDA) mitigates this problem by using annotated images from the source domain and unlabeled images from the target domain to train the deep models. Existing approaches often rely on GAN-based style transfer, but these methods struggle to capture cross-domain mappings in regions with high variability. In this paper, we propose a unified framework, B'ezier Meets Diffusion, for cross-domain image generation. First, we introduce a B'ezier-curve-based style transfer strategy that effectively reduces the domain gap between source and target domains. The transferred source images enable the training of a more robust segmentation model across domains. Thereafter, using pseudo-labels generated by this segmentation model on the target domain, we train a conditional diffusion model (CDM) to synthesize high-quality, labeled target-domain images. To mitigate the impact of noisy pseudo-labels, we further develop an uncertainty-guided score matching method that improves the robustness of CDM training. Extensive experiments on public datasets demonstrate that our approach generates realistic labeled images, significantly augmenting the target domain and improving segmentation performance.

[116] PSTTS: A Plug-and-Play Token Selector for Efficient Event-based Spatio-temporal Representation Learning cs.CVPDF

Xiangmo Zhao, Nan Yang, Yang Wang, Zhanwen Liu

TL;DR: PSTTS是一个即插即用的模块，用于高效学习事件数据的时空表示，通过剔除冗余空间和时间令牌，显著降低计算开销。

Details

Motivation: 主流方法将事件流转换为事件帧序列，但忽视了其空间稀疏性和帧间运动冗余，导致计算效率低下。现有RGB视频的令牌稀疏化方法不适用于事件数据，因为它们依赖不可靠的中间表示且忽略了事件噪声的影响。

Result: 在HARDVS、DailyDVS-200和SeACT数据集上，PSTTS显著提升了效率（FLOPs减少29-43.6%，FPS提高21.6-41.3%），同时保持了任务精度。

Insight: 利用原始事件数据的时空分布特性，可以直接识别冗余信息，避免了依赖中间表示的局限性；两阶段设计分别解决了空间噪声和时间冗余问题。

Abstract: Mainstream event-based spatio-temporal representation learning methods typically process event streams by converting them into sequences of event frames, achieving remarkable performance. However, they neglect the high spatial sparsity and inter-frame motion redundancy inherent in event frame sequences, leading to significant computational overhead. Existing token sparsification methods for RGB videos rely on unreliable intermediate token representations and neglect the influence of event noise, making them ineffective for direct application to event data. In this paper, we propose Progressive Spatio-Temporal Token Selection (PSTTS), a Plug-and-Play module for event data without introducing any additional parameters. PSTTS exploits the spatio-temporal distribution characteristics embedded in raw event data to effectively identify and discard spatio-temporal redundant tokens, achieving an optimal trade-off between accuracy and efficiency. Specifically, PSTTS consists of two stages, Spatial Token Purification and Temporal Token Selection. Spatial Token Purification discards noise and non-event regions by assessing the spatio-temporal consistency of events within each event frame to prevent interference with subsequent temporal redundancy evaluation. Temporal Token Selection evaluates the motion pattern similarity between adjacent event frames, precisely identifying and removing redundant temporal information. We apply PSTTS to four representative backbones UniformerV2, VideoSwin, EVMamba, and ExACT on the HARDVS, DailyDVS-200, and SeACT datasets. Experimental results demonstrate that PSTTS achieves significant efficiency improvements. Specifically, PSTTS reduces FLOPs by 29-43.6% and increases FPS by 21.6-41.3% on the DailyDVS-200 dataset, while maintaining task accuracy. Our code will be available.

[117] Group Critical-token Policy Optimization for Autoregressive Image Generation cs.CVPDF

Guohui Zhang, Hu Yu, Xiaoxiao Ma, JingHao Zhang, Yaning Pan

TL;DR: 本文提出了一种名为GCPO的方法，通过识别自回归生成中的关键图像token，并对其进行有效的策略优化，从而提升生成质量。实验证明GCPO在多个文本到图像基准上表现优于现有方法。

Details

Motivation: 现有的基于RLVR的自回归图像生成方法通常对所有token进行均匀优化，而忽略了不同token对训练效果的贡献差异。本文旨在解决如何识别关键token并对其进行有效优化的问题。

Result: GCPO在多个文本到图像基准上表现优于GRPO等现有方法，且仅需30%的token即可取得更好的效果。

Insight: 关键token的识别和动态优化能显著提升自回归图像生成的效率和质量。未来可探索更精细的token选择策略。

Abstract: Recent studies have extended Reinforcement Learning with Verifiable Rewards (RLVR) to autoregressive (AR) visual generation and achieved promising progress. However, existing methods typically apply uniform optimization across all image tokens, while the varying contributions of different image tokens for RLVR’s training remain unexplored. In fact, the key obstacle lies in how to identify more critical image tokens during AR generation and implement effective token-wise optimization for them. To tackle this challenge, we propose $\textbf{G}$roup $\textbf{C}$ritical-token $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{GCPO}$), which facilitates effective policy optimization on critical tokens. We identify the critical tokens in RLVR-based AR generation from three perspectives, specifically: $\textbf{(1)}$ Causal dependency: early tokens fundamentally determine the later tokens and final image effect due to unidirectional dependency; $\textbf{(2)}$ Entropy-induced spatial structure: tokens with high entropy gradients correspond to image structure and bridges distinct visual regions; $\textbf{(3)}$ RLVR-focused token diversity: tokens with low visual similarity across a group of sampled images contribute to richer token-level diversity. For these identified critical tokens, we further introduce a dynamic token-wise advantage weight to encourage exploration, based on confidence divergence between the policy model and reference model. By leveraging 30% of the image tokens, GCPO achieves better performance than GRPO with full tokens. Extensive experiments on multiple text-to-image benchmarks for both AR models and unified multimodal models demonstrate the effectiveness of GCPO for AR visual generation.

[118] Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation cs.CVPDF

Ruoyu Chen, Xiaoqing Guo, Kangwei Liu, Siyuan Liang, Shiming Liu

TL;DR: EAGLE是一个轻量级黑盒框架，用于解释多模态大语言模型（MLLMs）中自回归token生成的依赖关系，量化视觉模态与语言先验的相对影响，并通过贪婪搜索优化忠实性和高效性。

Details

Motivation: 多模态大语言模型在视觉与语言对齐中表现优异，但生成token对视觉模态的依赖不明确，限制了模型的可解释性和可靠性。

Result: 实验证明EAGLE在忠实性、定位和幻觉诊断上优于现有方法，且GPU内存占用更低。

Insight: EAGLE为MLLM的可解释性提供实用工具，揭示了模型决策中视觉与语言的交互机制。

Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present EAGLE, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. EAGLE attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, EAGLE performs modality-aware analysis that disentangles what tokens rely on, providing fine-grained interpretability of model decisions. Extensive experiments across open-source MLLMs show that EAGLE consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis, while requiring substantially less GPU memory. These results highlight its effectiveness and practicality for advancing the interpretability of MLLMs. The code is available at https://github.com/RuoyuChen10/EAGLE.

[119] Color Names in Vision-Language Models cs.CVPDF

Alexandra Gomez-Villa, Pablo Hernández-Cámara, Muhammad Atif Butt, Valero Laparra, Jesus Malo

TL;DR: 论文首次系统评估了视觉语言模型（VLMs）的颜色命名能力，发现其在典型颜色上表现优异，但在非典型颜色上显著下降，并揭示了语言模型架构对颜色命名的独立影响。

Details

Motivation: 颜色是人类视觉感知和沟通的基本维度，理解VLMs是否能像人类一样命名颜色对有效的人机交互至关重要。

Result: VLMs在典型颜色上表现好，非典型颜色表现差；语言模型架构显著影响颜色命名。

Insight: 颜色命名受语言模型架构独立影响，且存在跨语言的训练不平衡问题。

Abstract: Color serves as a fundamental dimension of human visual perception and a primary means of communicating about objects and scenes. As vision-language models (VLMs) become increasingly prevalent, understanding whether they name colors like humans is crucial for effective human-AI interaction. We present the first systematic evaluation of color naming capabilities across VLMs, replicating classic color naming methodologies using 957 color samples across five representative models. Our results show that while VLMs achieve high accuracy on prototypical colors from classical studies, performance drops significantly on expanded, non-prototypical color sets. We identify 21 common color terms that consistently emerge across all models, revealing two distinct approaches: constrained models using predominantly basic terms versus expansive models employing systematic lightness modifiers. Cross-linguistic analysis across nine languages demonstrates severe training imbalances favoring English and Chinese, with hue serving as the primary driver of color naming decisions. Finally, ablation studies reveal that language model architecture significantly influences color naming independent of visual processing capabilities.

[120] EfficientDepth: A Fast and Detail-Preserving Monocular Depth Estimation Model cs.CVPDF

Andrii Litvynchuk, Ivan Livinsky, Anand Ravi, Nima Kalantari, Andrii Tsarov

TL;DR: EfficientDepth是一种快速且保留细节的单目深度估计模型，结合了Transformer和轻量级卷积解码器，通过多阶段优化和新型损失函数，实现了高效的深度估计。

Details

Motivation: 现有单目深度估计方法在几何一致性、细节保留、鲁棒性和计算效率上存在不足，无法满足3D重建和视图合成的需求。

Result: 实验表明，EfficientDepth性能媲美或优于现有最优模型，且计算资源显著减少。

Insight: 融合Transformer与轻量卷积的设计能够平衡计算效率与细节保留，多阶段优化和新型损失函数进一步提升了模型的实用性。

Abstract: Monocular depth estimation (MDE) plays a pivotal role in various computer vision applications, such as robotics, augmented reality, and autonomous driving. Despite recent advancements, existing methods often fail to meet key requirements for 3D reconstruction and view synthesis, including geometric consistency, fine details, robustness to real-world challenges like reflective surfaces, and efficiency for edge devices. To address these challenges, we introduce a novel MDE system, called EfficientDepth, which combines a transformer architecture with a lightweight convolutional decoder, as well as a bimodal density head that allows the network to estimate detailed depth maps. We train our model on a combination of labeled synthetic and real images, as well as pseudo-labeled real images, generated using a high-performing MDE method. Furthermore, we employ a multi-stage optimization strategy to improve training efficiency and produce models that emphasize geometric consistency and fine detail. Finally, in addition to commonly used objectives, we introduce a loss function based on LPIPS to encourage the network to produce detailed depth maps. Experimental results demonstrate that EfficientDepth achieves performance comparable to or better than existing state-of-the-art models, with significantly reduced computational resources.

[121] Category Discovery: An Open-World Perspective cs.CVPDF

Zhenqi He, Yuanpei Liu, Kai Han

TL;DR: 综述论文《Category Discovery: An Open-World Perspective》全面回顾了类别发现（CD）任务的研究进展，探讨了不同的方法和挑战，重点分析了代表性学习、标签分配和类别数量估计等核心组件。

Details

Motivation: 类别发现任务旨在自动分类未标记数据中的未见类别实例，是开放世界学习中的重要问题。然而，现有方法在不同场景下的表现和挑战尚未系统总结。

Result: 研究发现，大规模预训练模型、层次化和辅助线索、分阶段训练对类别发现有利，但标签设计和多目标场景仍是挑战。

Insight: 关键见解包括预训练模型的重要性，以及类别数量和标签分配问题的复杂性。未来研究应关注多对象场景和鲁棒性提升。

Abstract: Category discovery (CD) is an emerging open-world learning task, which aims at automatically categorizing unlabelled data containing instances from unseen classes, given some labelled data from seen classes. This task has attracted significant attention over the years and leads to a rich body of literature trying to address the problem from different perspectives. In this survey, we provide a comprehensive review of the literature, and offer detailed analysis and in-depth discussion on different methods. Firstly, we introduce a taxonomy for the literature by considering two base settings, namely novel category discovery (NCD) and generalized category discovery (GCD), and several derived settings that are designed to address the extra challenges in different real-world application scenarios, including continual category discovery, skewed data distribution, federated category discovery, etc. Secondly, for each setting, we offer a detailed analysis of the methods encompassing three fundamental components, representation learning, label assignment, and estimation of class number. Thirdly, we benchmark all the methods and distill key insights showing that large-scale pretrained backbones, hierarchical and auxiliary cues, and curriculum-style training are all beneficial for category discovery, while challenges remain in the design of label assignment, the estimation of class numbers, and scaling to complex multi-object scenarios.Finally, we discuss the key insights from the literature so far and point out promising future research directions. We compile a living survey of the category discovery literature at \href{https://github.com/Visual-AI/Category-Discovery}{https://github.com/Visual-AI/Category-Discovery}.

[122] HyCoVAD: A Hybrid SSL-LLM Model for Complex Video Anomaly Detection cs.CVPDF

Mohammad Mahdi Hemmatyar, Mahdi Jafari, Mohammad Amin Yousefi, Mohammad Reza Nemati, Mobin Azadani

TL;DR: HyCoVAD是一个结合自监督学习（SSL）和大型语言模型（LLM）的混合模型，用于复杂视频异常检测，通过SSL模块从视频中捕捉低层时空模式，再用LLM验证语义上下文，显著提升性能并减少计算开销。

Details

Motivation: 复杂视频异常检测需要对多实体间的复杂关系和时序依赖进行建模，而现有自监督学习方法缺乏语义理解能力，LLM计算开销大且缺乏细粒度空间定位。HyCoVAD旨在结合两者优势。

Result: 在ComplexVAD数据集上达到72.5%的帧级AUC，比基线提升12.5%，同时降低LLM计算开销。

Insight: 结合SSL的低层建模能力和LLM的语义推理能力，可以显著提升复杂异常检测性能，同时通过自适应阈值优化计算效率。

Abstract: Video anomaly detection (VAD) is crucial for intelligent surveillance, but a significant challenge lies in identifying complex anomalies, which are events defined by intricate relationships and temporal dependencies among multiple entities rather than by isolated actions. While self-supervised learning (SSL) methods effectively model low-level spatiotemporal patterns, they often struggle to grasp the semantic meaning of these interactions. Conversely, large language models (LLMs) offer powerful contextual reasoning but are computationally expensive for frame-by-frame analysis and lack fine-grained spatial localization. We introduce HyCoVAD, Hybrid Complex Video Anomaly Detection, a hybrid SSL-LLM model that combines a multi-task SSL temporal analyzer with LLM validator. The SSL module is built upon an nnFormer backbone which is a transformer-based model for image segmentation. It is trained with multiple proxy tasks, learns from video frames to identify those suspected of anomaly. The selected frames are then forwarded to the LLM, which enriches the analysis with semantic context by applying structured, rule-based reasoning to validate the presence of anomalies. Experiments on the challenging ComplexVAD dataset show that HyCoVAD achieves a 72.5% frame-level AUC, outperforming existing baselines by 12.5% while reducing LLM computation. We release our interaction anomaly taxonomy, adaptive thresholding protocol, and code to facilitate future research in complex VAD scenarios.

Shuang Zeng, Dekang Qi, Xinyuan Chang, Feng Xiong, Shichao Xie

TL;DR: JanusVLN提出了一种新的视觉-语言导航框架，通过双隐式神经记忆将语义和空间信息解耦，避免了显式存储的冗余问题，显著提升了导航效率。

Details

Motivation: 现有的视觉-语言导航方法依赖显式语义记忆，如文本认知地图或历史视觉帧存储，导致空间信息丢失和计算冗余。JanusVLN受人类导航中隐式场景表征的启发，提出双隐式记忆结构来解决这些问题。

Result: JanusVLN在实验中优于20多种现有方法，取得了SOTA性能，成功率比多数据类型输入方法提升10.5-35.5，比仅RGB输入方法提升3.6-10.8。

Insight: 双隐式记忆结构是一种新的VLN范式，通过解耦语义和空间信息，显著提升了导航效率，为未来研究提供了新方向。

Abstract: Vision-and-Language Navigation requires an embodied agent to navigate through unseen environments, guided by natural language instructions and a continuous video stream. Recent advances in VLN have been driven by the powerful semantic understanding of Multimodal Large Language Models. However, these methods typically rely on explicit semantic memory, such as building textual cognitive maps or storing historical visual frames. This type of method suffers from spatial information loss, computational redundancy, and memory bloat, which impede efficient navigation. Inspired by the implicit scene representation in human navigation, analogous to the left brain’s semantic understanding and the right brain’s spatial cognition, we propose JanusVLN, a novel VLN framework featuring a dual implicit neural memory that models spatial-geometric and visual-semantic memory as separate, compact, and fixed-size neural representations. This framework first extends the MLLM to incorporate 3D prior knowledge from the spatial-geometric encoder, thereby enhancing the spatial reasoning capabilities of models based solely on RGB input. Then, the historical key-value caches from the spatial-geometric and visual-semantic encoders are constructed into a dual implicit memory. By retaining only the KVs of tokens in the initial and sliding window, redundant computation is avoided, enabling efficient incremental updates. Extensive experiments demonstrate that JanusVLN outperforms over 20 recent methods to achieve SOTA performance. For example, the success rate improves by 10.5-35.5 compared to methods using multiple data types as input and by 3.6-10.8 compared to methods using more RGB training data. This indicates that the proposed dual implicit neural memory, as a novel paradigm, explores promising new directions for future VLN research. Ours project page: https://miv-xjtu.github.io/JanusVLN.github.io/.

[124] SpikeMatch: Semi-Supervised Learning with Temporal Dynamics of Spiking Neural Networks cs.CVPDF

Jini Yang, Beomseok Oh, Seungryong Kim, Sunok Kim

TL;DR: SpikeMatch是首个利用脉冲神经网络的时序动态性进行半监督学习的框架，通过泄漏因子的多样性伪标签生成机制，在有限标签下提升模型性能。

Details

Motivation: 脉冲神经网络（SNNs）因其生物合理性和高能效性受到关注，但其半监督学习方法研究较少，本文旨在填补这一空白。

Result: 在多个标准benchmark上，SpikeMatch表现优于现有方法。

Insight: 脉冲神经网络的时序动态性在半监督学习中具有潜力，能有效缓解确认偏差。

Abstract: Spiking neural networks (SNNs) have recently been attracting significant attention for their biological plausibility and energy efficiency, but semi-supervised learning (SSL) methods for SNN-based models remain underexplored compared to those for artificial neural networks (ANNs). In this paper, we introduce SpikeMatch, the first SSL framework for SNNs that leverages the temporal dynamics through the leakage factor of SNNs for diverse pseudo-labeling within a co-training framework. By utilizing agreement among multiple predictions from a single SNN, SpikeMatch generates reliable pseudo-labels from weakly-augmented unlabeled samples to train on strongly-augmented ones, effectively mitigating confirmation bias by capturing discriminative features with limited labels. Experiments show that SpikeMatch outperforms existing SSL methods adapted to SNN backbones across various standard benchmarks.

[125] Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting cs.CV | cs.AI | cs.CLPDF

Yasmine Omri, Connor Ding, Tsachy Weissman, Thierry Tambe

TL;DR: 本文探讨了使用2D Gaussian Splatting（2DGS）作为视觉语言对齐的新方法，解决了传统RGB图像传输效率低和序列长度爆炸的问题，并提出了一种高效的2DGS实现和适配的CLIP训练框架。

Details

Motivation: 传统视觉语言模型依赖RGB图像编码器，存在两大低效问题：(i)边缘设备到云端的密集RGB数据传输成本高；(ii)基于分块的标记化导致序列长度过长。2DGS作为一种紧凑、空间自适应的表示方法，具有潜在优势。

Result: 实验结果显示，2DGS编码器在大规模数据集上实现了有意义的零样本性能，同时将输入压缩至像素的3到20倍。尽管精度略低于RGB编码器，但证明了2DGS作为多模态表示的潜力。

Insight: 本文表明2DGS是一种有效的视觉表示方法，能够平衡语义能力和传输效率，为边缘-云学习提供了新的方向。同时，指出了当前架构的瓶颈和改进路径。

Abstract: Modern vision language pipelines are driven by RGB vision encoders trained on massive image text corpora. While these pipelines have enabled impressive zero shot capabilities and strong transfer across tasks, they still inherit two structural inefficiencies from the pixel domain: (i) transmitting dense RGB images from edge devices to the cloud is energy intensive and costly, and (ii) patch based tokenization explodes sequence length, stressing attention budgets and context limits. We explore 2D Gaussian Splatting (2DGS) as an alternative visual substrate for alignment: a compact, spatially adaptive representation that parameterizes images by a set of colored anisotropic Gaussians. We develop a scalable 2DGS pipeline with structured initialization, luminance aware pruning, and batched CUDA kernels, achieving over 90x faster fitting and about 97% GPU utilization compared to prior implementations. We further adapt contrastive language image pretraining (CLIP) to 2DGS by reusing a frozen RGB-based transformer backbone with a lightweight splat aware input stem and a perceiver resampler, training only about 7% of the total parameters. On large DataComp subsets, GS encoders yield meaningful zero shot ImageNet-1K performance while compressing inputs 3 to 20x relative to pixels. While accuracy currently trails RGB encoders, our results establish 2DGS as a viable multimodal substrate, pinpoint architectural bottlenecks, and open a path toward representations that are both semantically powerful and transmission efficient for edge cloud learning.

[126] LongLive: Real-time Interactive Long Video Generation cs.CVPDF

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao

TL;DR: LongLive是一个实时交互式长视频生成框架，通过自回归设计和KV缓存机制解决效率和一致性问题，支持长达240秒的视频生成。

Details

Motivation: 长视频生成在效率和质量上存在挑战，现有模型如Diffusion因双向注意力效率低，而自回归模型在长视频中质量下降。交互式需求进一步增加了复杂性。

Result: 在单GPU上实现20.7 FPS的生成速度，支持240秒视频，量化后性能损失较小。

Insight: KV-recache和流式调优是关键创新，解决了长视频生成中的效率和一致性问题。

Abstract: We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long-test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU. LongLive further supports INT8-quantized inference with only marginal quality loss.

[127] SPARK: Synergistic Policy And Reward Co-Evolving Framework cs.CV | cs.LGPDF

Ziyu Liu, Yuhang Zang, Shengyuan Ding, Yuhang Cao, Xiaoyi Dong

TL;DR: SPARK提出了一种协同策略与奖励共同进化的框架，解决了RLHF的高成本和奖励-策略不匹配问题，同时利用RLVR中丢弃的数据提升性能。

Details

Motivation: RLHF依赖人类偏好导致高成本和奖励-策略不匹配，而RLVR则浪费了监督信号。SPARK旨在通过协同训练生成奖励模型来解决这些问题。

Result: SPARK在多个LLM和LVLM模型上显著提升性能，例如SPARK-VL-7B在推理任务上平均提升9.7%。

Insight: 通过协同训练生成奖励模型的自我提升机制，SPARK展示了高效、稳定的性能提升，避免了对外部奖励模型的依赖。

Abstract: Recent Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) increasingly use Reinforcement Learning (RL) for post-pretraining, such as RL with Verifiable Rewards (RLVR) for objective tasks and RL from Human Feedback (RLHF) for subjective tasks. However, RLHF incurs high costs and potential reward-policy mismatch due to reliance on human preferences, while RLVR still wastes supervision by discarding rollouts and correctness signals after each update. To address these challenges, we introduce the Synergistic Policy And Reward Co-Evolving Framework (SPARK), an efficient, on-policy, and stable method that builds on RLVR. Instead of discarding rollouts and correctness data, SPARK recycles this valuable information to simultaneously train the model itself as a generative reward model. This auxiliary training uses a mix of objectives, such as pointwise reward score, pairwise comparison, and evaluation conditioned on further-reflection responses, to teach the model to evaluate and improve its own responses. Our process eliminates the need for a separate reward model and costly human preference data. SPARK creates a positive co-evolving feedback loop: improved reward accuracy yields better policy gradients, which in turn produce higher-quality rollouts that further refine the reward model. Our unified framework supports test-time scaling via self-reflection without external reward models and their associated costs. We show that SPARK achieves significant performance gains on multiple LLM and LVLM models and multiple reasoning, reward models, and general benchmarks. For example, SPARK-VL-7B achieves an average 9.7% gain on 7 reasoning benchmarks, 12.1% on 2 reward benchmarks, and 1.5% on 8 general benchmarks over the baselines, demonstrating robustness and broad generalization.

[128] CCNeXt: An Effective Self-Supervised Stereo Depth Estimation Approach cs.CVPDF

Alexandre Lopes, Roberto Souza, Helio Pedrini

TL;DR: CCNeXt提出了一种新型自监督立体深度估计方法，通过结合现代CNN特征提取器和新型窗口化极线交叉注意力模块，在计算成本和性能之间取得平衡，显著优于现有CNN和ViT方法。

Details

Motivation: 深度估计在机器人和自动驾驶等应用中至关重要，但由于计算资源限制和真实深度数据获取困难，需要高效且自监督的方法。立体图像对提供了无需标注数据的解决方案。

Result: 在KITTI Eigen Split测试数据上速度快10.18倍，性能优于现有方法；在KITTI Improved Ground Truth和Driving Stereo数据集上所有指标达到SOTA。

Insight: 通过在编码器中引入窗口化极线交叉注意力模块，CCNeXt有效利用了立体图像对的空间信息，提升了深度估计的精度和效率。

Abstract: Depth Estimation plays a crucial role in recent applications in robotics, autonomous vehicles, and augmented reality. These scenarios commonly operate under constraints imposed by computational power. Stereo image pairs offer an effective solution for depth estimation since it only needs to estimate the disparity of pixels in image pairs to determine the depth in a known rectified system. Due to the difficulty in acquiring reliable ground-truth depth data across diverse scenarios, self-supervised techniques emerge as a solution, particularly when large unlabeled datasets are available. We propose a novel self-supervised convolutional approach that outperforms existing state-of-the-art Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) while balancing computational cost. The proposed CCNeXt architecture employs a modern CNN feature extractor with a novel windowed epipolar cross-attention module in the encoder, complemented by a comprehensive redesign of the depth estimation decoder. Our experiments demonstrate that CCNeXt achieves competitive metrics on the KITTI Eigen Split test data while being 10.18$\times$ faster than the current best model and achieves state-of-the-art results in all metrics in the KITTI Eigen Split Improved Ground Truth and Driving Stereo datasets when compared to recently proposed techniques. To ensure complete reproducibility, our project is accessible at \href{https://github.com/alelopes/CCNext}{\texttt{https://github.com/alelopes/CCNext}}.

[129] UML-CoT: Structured Reasoning and Planning with Unified Modeling Language for Robotic Room Cleaning cs.CV | I.2.6; I.2.7; I.2.8; I.4.8; I.5.4PDF

Hongyu Chen, Guangrun Wang

TL;DR: UML-CoT将统一建模语言（UML）与链式思维（CoT）结合，提升机器人室内清洁任务中的结构化推理和规划能力。

Details

Motivation: 现有的CoT方法依赖非结构化文本，缺乏解释性和可执行性；而现有结构化CoT方法（如场景或逻辑图）无法建模高阶关系或抽象行为。

Result: 在新基准MRoom-30k上，UML-CoT在解释性、规划连贯性和执行成功率上优于非结构化CoT。

Insight: UML是一种更具表达力和可操作性的结构化推理形式，适用于具身任务。

Abstract: Chain-of-Thought (CoT) prompting improves reasoning in large language models (LLMs), but its reliance on unstructured text limits interpretability and executability in embodied tasks. Prior work has explored structured CoTs using scene or logic graphs, yet these remain fundamentally limited: they model only low-order relations, lack constructs like inheritance or behavioral abstraction, and provide no standardized semantics for sequential or conditional planning. We propose UML-CoT, a structured reasoning and planning framework that leverages Unified Modeling Language (UML) to generate symbolic CoTs and executable action plans. UML class diagrams capture compositional object semantics, while activity diagrams model procedural control flow. Our three-stage training pipeline combines supervised fine-tuning with Group Relative Policy Optimization (GRPO), including reward learning from answer-only data. We evaluate UML-CoT on MRoom-30k, a new benchmark of cluttered room-cleaning scenarios. UML-CoT outperforms unstructured CoTs in interpretability, planning coherence, and execution success, highlighting UML as a more expressive and actionable structured reasoning formalism.

[130] LABELING COPILOT: A Deep Research Agent for Automated Data Curation in Computer Vision cs.CV | cs.CLPDF

Debargha Ganguly, Sumit Kumar, Ishwar Balappanawar, Weicong Chen, Shashank Kambhatla

TL;DR: Labeling Copilot是一种用于计算机视觉的深度研究代理，能够自动进行高质量数据集标注，通过三大核心功能（校准发现、可控合成和共识标注）显著提升数据集标注的效率和质量。

Details

Motivation: 高质量数据集标注是计算机视觉系统部署的关键瓶颈，传统方法在数据质量、多样性和成本之间难以平衡。Labeling Copilot旨在通过代理驱动的工具链解决这一问题。

Result: 1.共识标注在COCO数据集上每图生成14.2个候选框（ground-truth为7.4），mAP达37.1%；2.在Open Images上发现903个新类别；3.校准发现的效率比传统方法高40倍。

Insight: 代理驱动的工具链结合优化的多模型协作机制，能够高效且可扩展地解决大规模数据集的标注问题。

Abstract: Curating high-quality, domain-specific datasets is a major bottleneck for deploying robust vision systems, requiring complex trade-offs between data quality, diversity, and cost when researching vast, unlabeled data lakes. We introduce Labeling Copilot, the first data curation deep research agent for computer vision. A central orchestrator agent, powered by a large multimodal language model, uses multi-step reasoning to execute specialized tools across three core capabilities: (1) Calibrated Discovery sources relevant, in-distribution data from large repositories; (2) Controllable Synthesis generates novel data for rare scenarios with robust filtering; and (3) Consensus Annotation produces accurate labels by orchestrating multiple foundation models via a novel consensus mechanism incorporating non-maximum suppression and voting. Our large-scale validation proves the effectiveness of Labeling Copilot’s components. The Consensus Annotation module excels at object discovery: on the dense COCO dataset, it averages 14.2 candidate proposals per image-nearly double the 7.4 ground-truth objects-achieving a final annotation mAP of 37.1%. On the web-scale Open Images dataset, it navigated extreme class imbalance to discover 903 new bounding box categories, expanding its capability to over 1500 total. Concurrently, our Calibrated Discovery tool, tested at a 10-million sample scale, features an active learning strategy that is up to 40x more computationally efficient than alternatives with equivalent sample efficiency. These experiments validate that an agentic workflow with optimized, scalable tools provides a robust foundation for curating industrial-scale datasets.

[131] Scale-Wise VAR is Secretly Discrete Diffusion cs.CV | cs.LGPDF

Amandeep Kumar, Nithin Gopalakrishnan Nair, Vishal M. Patel

TL;DR: 本文揭示了视觉自回归生成 (VAR) 在配备马尔可夫注意力掩码时，数学上等同于离散扩散模型，并提出了 SRDD 框架，将自回归变换器与扩散模型的优势结合，提高了效率和生成质量。

Details

Motivation: 自回归变换器在视觉生成中表现优异，但其与扩散模型的关系尚未明确。本文旨在揭示 VAR 与离散扩散的内在联系，以结合两者的优势。

Result: 实验表明，SRDD 在多个数据集上提升了 VAR 的收敛速度、推理效率和零样本重建能力，生成了更高质量的结果。

Insight: 自回归变换器与扩散模型之间存在深层的数学联系，这种统一视角为视觉生成提供了新的优化方向。

Abstract: Autoregressive (AR) transformers have emerged as a powerful paradigm for visual generation, largely due to their scalability, computational efficiency and unified architecture with language and vision. Among them, next scale prediction Visual Autoregressive Generation (VAR) has recently demonstrated remarkable performance, even surpassing diffusion-based models. In this work, we revisit VAR and uncover a theoretical insight: when equipped with a Markovian attention mask, VAR is mathematically equivalent to a discrete diffusion. We term this reinterpretation as Scalable Visual Refinement with Discrete Diffusion (SRDD), establishing a principled bridge between AR transformers and diffusion models. Leveraging this new perspective, we show how one can directly import the advantages of diffusion such as iterative refinement and reduce architectural inefficiencies into VAR, yielding faster convergence, lower inference cost, and improved zero-shot reconstruction. Across multiple datasets, we show that the diffusion based perspective of VAR leads to consistent gains in efficiency and generation.

[132] Hierarchical Representation Matching for CLIP-based Class-Incremental Learning cs.CV | cs.AIPDF

Zhen-Hao Wen, Yan Wang, Ji Feng, Han-Jia Ye, De-Chuan Zhan

TL;DR: 该论文提出了一种基于CLIP的分层表示匹配方法（HERMAN），用于类别增量学习（CIL）。通过利用LLM生成层次化的文本描述符，并结合CLIP多层特征，实现了对视觉概念的精确区分和抗遗忘。

Details

Motivation: 现有方法在CIL中依赖单一模板（如’a photo of a [CLASS]’），忽略了视觉概念的层次性，导致区分能力不足。此外，CLIP的当前特征映射仅用最后一层特征，忽略了早期层的层次信息。

Result: 在多个CIL基准测试中，HERMAN的性能优于现有方法，证明了其有效性。

Insight: 层次化的文本描述和多层特征匹配是提升CLIP在增量学习中性能的关键。

Abstract: Class-Incremental Learning (CIL) aims to endow models with the ability to continuously adapt to evolving data streams. Recent advances in pre-trained vision-language models (e.g., CLIP) provide a powerful foundation for this task. However, existing approaches often rely on simplistic templates, such as “a photo of a [CLASS]”, which overlook the hierarchical nature of visual concepts. For example, recognizing “cat” versus “car” depends on coarse-grained cues, while distinguishing “cat” from “lion” requires fine-grained details. Similarly, the current feature mapping in CLIP relies solely on the representation from the last layer, neglecting the hierarchical information contained in earlier layers. In this work, we introduce HiErarchical Representation MAtchiNg (HERMAN) for CLIP-based CIL. Our approach leverages LLMs to recursively generate discriminative textual descriptors, thereby augmenting the semantic space with explicit hierarchical cues. These descriptors are matched to different levels of the semantic hierarchy and adaptively routed based on task-specific requirements, enabling precise discrimination while alleviating catastrophic forgetting in incremental tasks. Extensive experiments on multiple benchmarks demonstrate that our method consistently achieves state-of-the-art performance.

[133] Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs cs.CV | cs.AI | cs.CLPDF

Xingyu Fu, Siyi Liu, Yinuo Xu, Pan Lu, Guangqiuse Hu

TL;DR: 论文提出了DeeptraceReward，首个细粒度、时空感知的基准数据集，用于标注人类感知的视频生成假痕迹，并通过多模态语言模型训练奖励模型，模仿人类判断和定位能力。

Details

Motivation: 视频生成模型快速发展，但人类是否能检测到深度伪造视频中的痕迹（即时空视觉伪影）被忽视。研究旨在填补这一空白。

Result: 模型在假痕迹识别、定位和解释上平均优于GPT-5 34.7%，发现任务难度从语言解释到时间标注递增。

Insight: 二进制真假分类比细粒度假痕迹检测简单，后者中时间标注是最难的任务；强调人类感知痕迹有助于提升视频生成的社会信任度。

Abstract: Can humans identify AI-generated (fake) videos and provide grounded reasons? While video generation models have advanced rapidly, a critical dimension – whether humans can detect deepfake traces within a generated video, i.e., spatiotemporal grounded visual artifacts that reveal a video as machine generated – has been largely overlooked. We introduce DeeptraceReward, the first fine-grained, spatially- and temporally- aware benchmark that annotates human-perceived fake traces for video generation reward. The dataset comprises 4.3K detailed annotations across 3.3K high-quality generated videos. Each annotation provides a natural-language explanation, pinpoints a bounding-box region containing the perceived trace, and marks precise onset and offset timestamps. We consolidate these annotations into 9 major categories of deepfake traces that lead humans to identify a video as AI-generated, and train multimodal language models (LMs) as reward models to mimic human judgments and localizations. On DeeptraceReward, our 7B reward model outperforms GPT-5 by 34.7% on average across fake clue identification, grounding, and explanation. Interestingly, we observe a consistent difficulty gradient: binary fake v.s. real classification is substantially easier than fine-grained deepfake trace detection; within the latter, performance degrades from natural language explanations (easiest), to spatial grounding, to temporal labeling (hardest). By foregrounding human-perceived deepfake traces, DeeptraceReward provides a rigorous testbed and training signal for socially aware and trustworthy video generation.

[134] CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning cs.CV | cs.AI | cs.CLPDF

Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jianze Liang

TL;DR: 论文提出CapRL，通过强化学习（RL）解决传统监督微调（SFT）在图像描述任务中泛化性和多样性不足的问题，利用无视觉的LLM验证描述质量。

Details

Motivation: 传统SFT依赖于人工标注的高成本数据，限制了模型生成多样化和通用性强的描述。CapRL通过强化学习重新定义描述质量，克服这一限制。

Result: CapRL在Prism评估框架中表现媲美Qwen2.5-VL-72B，平均超出基线8.4%。预训练在多个任务中取得显著增益。

Insight: 1. RL适用于主观任务，但需设计合理奖励函数；2. 描述质量可通过其信息传递效用间接衡量；3. 解耦流程避免了端到端训练的复杂性。

Abstract: Image captioning is a fundamental task that bridges the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable data annotated by humans or proprietary models. This approach often leads to models that memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome the limitation of SFT, we propose applying the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm to the open-ended task of image captioning. A primary challenge, however, is designing an objective reward function for the inherently subjective nature of what constitutes a “good” caption. We introduce Captioning Reinforcement Learning (CapRL), a novel training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding image. CapRL employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. As the first study to apply RLVR to the subjective image captioning task, we demonstrate that CapRL significantly enhances multiple settings. Pretraining on the CapRL-5M caption dataset annotated by CapRL-3B results in substantial gains across 12 benchmarks. Moreover, within the Prism Framework for caption quality evaluation, CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding the baseline by an average margin of 8.4%. Code is available here: https://github.com/InternLM/CapRL.

[135] RefAM: Attention Magnets for Zero-Shot Referral Segmentation cs.CVPDF

Anna Kukleva, Enis Simsar, Alessio Tonioni, Muhammad Ferjad Naeem, Federico Tombari

TL;DR: 本文提出了RefAM方法，通过利用扩散变换器的注意力分数，无需微调或额外训练，实现了零样本参考分割任务的新SOTA性能。

Details

Motivation: 现有的参考分割方法通常需要微调或多模型组合，成本高且复杂。本文利用扩散模型丰富的语义信息，直接提取其注意力特征以解决下游任务。

Result: 在零样本参考图像和视频分割任务中，RefAM方法显著优于现有方法，无需微调或额外组件即可达到新SOTA。

Insight: 1. 扩散模型的注意力分数可直接用于下游任务，无需额外训练。2. 注意力机制中的噪声和冗余可以通过简单策略优化。3. 零样本方法在参考分割任务中具有潜力。

Abstract: Most existing approaches to referring segmentation achieve strong performance only through fine-tuning or by composing multiple pre-trained models, often at the cost of additional training and architectural modifications. Meanwhile, large-scale generative diffusion models encode rich semantic information, making them attractive as general-purpose feature extractors. In this work, we introduce a new method that directly exploits features, attention scores, from diffusion transformers for downstream tasks, requiring neither architectural modifications nor additional training. To systematically evaluate these features, we extend benchmarks with vision-language grounding tasks spanning both images and videos. Our key insight is that stop words act as attention magnets: they accumulate surplus attention and can be filtered to reduce noise. Moreover, we identify global attention sinks (GAS) emerging in deeper layers and show that they can be safely suppressed or redirected onto auxiliary tokens, leading to sharper and more accurate grounding maps. We further propose an attention redistribution strategy, where appended stop words partition background activations into smaller clusters, yielding sharper and more localized heatmaps. Building on these findings, we develop RefAM, a simple training-free grounding framework that combines cross-attention maps, GAS handling, and redistribution. Across zero-shot referring image and video segmentation benchmarks, our approach consistently outperforms prior methods, establishing a new state of the art without fine-tuning or additional components.

cs.CL [Back]

[136] Influence Guided Context Selection for Effective Retrieval-Augmented Generation cs.CL | cs.AIPDF

Jiale Deng, Yanyan Shen, Ziyuan Pei, Youmin Chen, Linpeng Huang

TL;DR: 本文提出了一种新的上下文质量评估方法，通过引入上下文影响力值（CI值）来量化上下文质量，解决了传统检索增强生成（RAG）中上下文质量低的问题。该方法结合查询相关性、列表独特性和生成器对齐性，显著提升了性能。

Details

Motivation: 传统RAG方法因低质量上下文（如无关或噪声信息）导致效果受限。现有方法基于预定义质量评估指标改进性能，但未能全面利用查询、上下文列表和生成器的信息。

Result: 实验表明，该方法显著优于现有基线，有效过滤低质量上下文并保留关键信息。

Insight: 全面考虑查询、上下文列表和生成器的信息能更准确地评估上下文质量，而CI值为动态选择提供了高效且无超参数调优的方案。

Abstract: Retrieval-Augmented Generation (RAG) addresses large language model (LLM) hallucinations by grounding responses in external knowledge, but its effectiveness is compromised by poor-quality retrieved contexts containing irrelevant or noisy information. While existing approaches attempt to improve performance through context selection based on predefined context quality assessment metrics, they show limited gains over standard RAG. We attribute this limitation to their failure in holistically utilizing available information (query, context list, and generator) for comprehensive quality assessment. Inspired by recent advances in data selection, we reconceptualize context quality assessment as an inference-time data valuation problem and introduce the Contextual Influence Value (CI value). This novel metric quantifies context quality by measuring the performance degradation when removing each context from the list, effectively integrating query-aware relevance, list-aware uniqueness, and generator-aware alignment. Moreover, CI value eliminates complex selection hyperparameter tuning by simply retaining contexts with positive CI values. To address practical challenges of label dependency and computational overhead, we develop a parameterized surrogate model for CI value prediction during inference. The model employs a hierarchical architecture that captures both local query-context relevance and global inter-context interactions, trained through oracle CI value supervision and end-to-end generator feedback. Extensive experiments across 8 NLP tasks and multiple LLMs demonstrate that our context selection method significantly outperforms state-of-the-art baselines, effectively filtering poor-quality contexts while preserving critical information. Code is available at https://github.com/SJTU-DMTai/RAG-CSM.

[137] How Large Language Models Need Symbolism cs.CL | cs.AIPDF

Xiaotie Deng, Hanyu Li

TL;DR: 该论文论证了大型语言模型（LLMs）的未来发展不仅需要规模扩展，还需要结合人类设计的符号以引导其直觉，从而推动真正的发现。

Details

Motivation: 当前大型语言模型主要依赖数据规模和计算能力，但其直觉虽强大却缺乏方向性，可能无法实现真正的发现。因此，作者提出需要引入人类设计的符号作为其认知的“指南针”。

Result: 论文未提具体实验结果，但通过理论分析指出符号主义的引入可以弥补纯数据驱动模型的局限性，从而提升模型的发现能力。

Insight: 人工智能的未来不仅是规模的扩展，更需要结合符号主义的指导。纯直觉驱动的模型可能迷失方向，而符号可以作为其认知的基础设施，推动更高效和有意义的发现。

Abstract: We argue that AI’s future requires more than scaling. To unlock genuine discovery, large language models need a compass: human-crafted symbols to guide their powerful but blind intuition.

[138] One Model, Many Morals: Uncovering Cross-Linguistic Misalignments in Computational Moral Reasoning cs.CL | cs.AIPDF

Sualeha Farid, Jayden Lin, Zean Chen, Shivani Kumar, David Jurgens

TL;DR: 这篇论文系统地研究了大型语言模型（LLMs）在多语言和文化背景下的道德推理能力，发现其在跨语言应用中存在显著的道德判断不一致性，并提出了一种结构化分类法以提升文化意识。

Details

Motivation: 随着LLMs在多语言和文化环境中的广泛应用，其基于英语数据预训练的道德推理是否能适应不同语境成为关键问题。

Result: 研究发现LLMs在不同语言中的道德判断存在显著不一致性，反映了文化错位，并提出结构化错误分类法。

Insight: 预训练数据的语言和文化偏向对LLMs的道德推理能力有显著影响，需开发更具文化意识的AI模型。

Abstract: Large Language Models (LLMs) are increasingly deployed in multilingual and multicultural environments where moral reasoning is essential for generating ethically appropriate responses. Yet, the dominant pretraining of LLMs on English-language data raises critical concerns about their ability to generalize judgments across diverse linguistic and cultural contexts. In this work, we systematically investigate how language mediates moral decision-making in LLMs. We translate two established moral reasoning benchmarks into five culturally and typologically diverse languages, enabling multilingual zero-shot evaluation. Our analysis reveals significant inconsistencies in LLMs’ moral judgments across languages, often reflecting cultural misalignment. Through a combination of carefully constructed research questions, we uncover the underlying drivers of these disparities, ranging from disagreements to reasoning strategies employed by LLMs. Finally, through a case study, we link the role of pretraining data in shaping an LLM’s moral compass. Through this work, we distill our insights into a structured typology of moral reasoning errors that calls for more culturally-aware AI.

[139] LLM-Based Support for Diabetes Diagnosis: Opportunities, Scenarios, and Challenges with GPT-5 cs.CLPDF

Gaurav Kumar Gupta, Nirajan Acharya, Pranal Pande

TL;DR: 这篇论文探讨了如何利用GPT-5等大型语言模型辅助糖尿病诊断，展示了其在症状识别、实验室数据解读等场景中的潜力，并强调了可重复评估框架在医疗领域中的重要性。

Details

Motivation: 糖尿病诊断存在早期识别困难、症状模糊等问题，而大型语言模型（如GPT-5）提供了结构化和可解释的输出，有望为临床决策提供支持。

Result: 结果显示，GPT-5与ADA标准高度一致，表明其可作为临床医生和患者的双重工具，同时突显了可重复评估框架的必要性。

Insight: 通过可解释的生成能力，GPT-5有望为糖尿病诊断提供支持，但模型的评估和验证需要在医疗领域中得到严格规范。

Abstract: Diabetes mellitus is a major global health challenge, affecting over half a billion adults worldwide with prevalence projected to rise. Although the American Diabetes Association (ADA) provides clear diagnostic thresholds, early recognition remains difficult due to vague symptoms, borderline laboratory values, gestational complexity, and the demands of long-term monitoring. Advances in large language models (LLMs) offer opportunities to enhance decision support through structured, interpretable, and patient-friendly outputs. This study evaluates GPT-5, the latest generative pre-trained transformer, using a simulation framework built entirely on synthetic cases aligned with ADA Standards of Care 2025 and inspired by public datasets including NHANES, Pima Indians, EyePACS, and MIMIC-IV. Five representative scenarios were tested: symptom recognition, laboratory interpretation, gestational diabetes screening, remote monitoring, and multimodal complication detection. For each, GPT-5 classified cases, generated clinical rationales, produced patient explanations, and output structured JSON summaries. Results showed strong alignment with ADA-defined criteria, suggesting GPT-5 may function as a dual-purpose tool for clinicians and patients, while underscoring the importance of reproducible evaluation frameworks for responsibly assessing LLMs in healthcare.

[140] A State-of-the-Art SQL Reasoning Model using RLVR cs.CL | cs.AI | cs.DB | cs.LGPDF

Alnur Ali, Ashutosh Baheti, Jonathan Chang, Ta-Chung Chi, Brandon Cui

TL;DR: 论文提出了一个基于强化学习（RL）的SQL推理模型RLVR，通过结合企业特定知识，在BIRD基准测试中取得了最先进的性能。

Details

Motivation: 企业客户需要能够整合特定领域知识的定制化推理模型，而强化学习在这一领域具有潜力，尤其是在奖励函数可验证的RLVR设置下。

Result: 在BIRD测试集上，模型准确率达到73.56%（无自一致性）和75.68%（带自一致性），且生成次数更少。

Insight: 该框架因其简单性和泛化能力，可推广到商业智能、数据科学和编程等企业场景。

Abstract: Developing custom reasoning models via Reinforcement Learning (RL) that can incorporate organization-specific knowledge has great potential to address problems faced by enterprise customers. In many of these problems, the reward function is verifiable, a setting termed RL with Verifiable Rewards (RLVR). We apply RLVR to a popular data science benchmark called BIRD that measures the ability of an AI agent to convert a natural language query for a database to SQL executions. We apply a simple and general-purpose training recipe involving careful prompt and model selection, a warm-up stage using our offline RL approach called TAO, followed by rigorous online RLVR training. With no additional training data beyond the BIRD training set and no use of proprietary models, our very first submission to the BIRD leaderboard reached state-of-the-art accuracy on the private test set: 73.56% without self-consistency and 75.68% with self-consistency. In the latter case, our model also required fewer generations than the second-best approach. While BIRD is only a proxy task, the simplicity of our framework makes it broadly applicable to enterprise domains such as business intelligence, data science, and coding.

[141] Learning to Reason with Mixture of Tokens cs.CL | cs.AI | cs.LGPDF

Adit Jain, Brendan Rappazzo

TL;DR: 该论文提出了在强化学习验证奖励（RLVR）中使用混合令牌生成（MoT-G）的方法，通过保留模型对候选令牌的概率分布信息，显著提升了推理任务的性能。

Details

Motivation: 当前RLVR方法在生成推理步骤时仅采样离散令牌，忽略了模型对候选令牌的丰富分布信息。这种限制可能导致推理搜索空间的不必要约束。

Result: 在Qwen2.5-1.5B模型上，MoT-G方法在7/10的任务中实现了5-35%的性能提升，且仅需一半的轨迹即可达到相同准确率。

Insight: MoT-G的成功可能源于其在推理过程中维护更高的隐藏状态熵，并促进令牌空间的探索。

Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a leading approach for improving large language model (LLM) reasoning capabilities. Most current methods follow variants of Group Relative Policy Optimization, which samples multiple reasoning completions, scores them relative to each other, and adjusts the policy accordingly. However, these approaches invariably sample discrete tokens at each reasoning step, discarding the rich distributional information in the model’s probability distribution over candidate tokens. While preserving and utilizing this distributional information has proven beneficial in non-RL settings, current RLVR methods seem to be unnecessarily constraining the reasoning search space by not using this information. To address this limitation, we investigate mixture-of-token generation (MoT-G) in RLVR. We present a unified framework that generalizes existing MoT-G approaches, including existing training-free methods that construct mixture embeddings as weighted sums over token embeddings, and extend RLVR to operate directly in this continuous mixture space for generating chain-of-thought. Evaluating two MoT-G variants on Reasoning-Gym, a suite of reasoning-intensive language tasks, we find that MoT–G methods achieve substantial improvements (5–35 % gains on 7 out of 10 tasks) compared to standard decoding with the Qwen2.5-1.5B model, while reaching comparable accuracy with half the number of trajectories, suggesting improved training efficiency. Through comprehensive hidden-state and token-level analyses, we provide evidence that MoT–G’s benefits may stem from its ability to maintain higher hidden-state entropy throughout the reasoning process and promote exploration in token space.

[142] Dual-Head Reasoning Distillation: Improving Classifier Accuracy with Train-Time-Only Reasoning cs.CL | cs.AIPDF

Jillian Xu, Dylan Zhou, Vinay Shukla, Yang Yang, Junrui Ruan

TL;DR: 提出一种双头推理蒸馏方法（DHRD），在训练时引入推理头监督，测试时仅使用分类头，从而在保持推理速度的同时提升分类准确性。

Details

Motivation: Chain-of-Thought（CoT）提示方法虽然能提升分类准确性，但会因生成推理过程显著降低吞吐量。为此，需要一种既保持高效推理又提升性能的方法。

Result: 在SuperGLUE任务上相对基线提升0.65-5.47%，尤其在蕴含/因果任务上表现更优。推理吞吐量匹配纯分类模型，比CoT快96-142倍。

Insight: 通过训练时引入推理监督，测试时剥离推理头，成功平衡了性能和效率。双头设计为类似任务提供了一种通用框架。

Abstract: Chain-of-Thought (CoT) prompting often improves classification accuracy, but it introduces a significant throughput penalty with rationale generation (Wei et al., 2022; Cheng and Van Durme, 2024). To resolve this trade-off, we introduce Dual-Head Reasoning Distillation (DHRD), a simple training method for decoder-only language models (LMs) that adds (i) a pooled classification head used during training and inference and (ii) a reasoning head supervised by teacher rationales used only in training. We train with a loss function that is a weighted sum of label cross-entropy and token-level LM loss over input-plus-rationale sequences. On seven SuperGLUE tasks, DHRD yields relative gains of 0.65-5.47% over pooled baselines, with notably larger gains on entailment/causal tasks. Since we disable the reasoning head at test time, inference throughput matches pooled classifiers and exceeds CoT decoding on the same backbones by 96-142 times in QPS.

[143] On Code-Induced Reasoning in LLMs cs.CL | cs.PLPDF

Abdul Waheed, Zhen Wu, Carolyn Rosé, Daphne Ippolito

TL;DR: 论文通过系统性实验探讨了代码数据如何提升大型语言模型（LLM）的推理能力，发现模型对结构扰动的脆弱性显著高于语义扰动，且适当抽象（如伪代码）能够保留或提升性能。

Details

Motivation: 研究代码数据中哪些具体属性（结构或语义）对LLM推理能力的提升最为关键，以指导训练数据的设计。

Result: LLM对结构扰动更敏感，尤其在数学和代码任务中。伪代码等抽象表示效果与原始代码相当甚至更优，且语法风格（如Python与Java）对不同任务有特异性优势。

Insight: 代码的结构属性对LLM推理能力至关重要；设计训练数据时可通过抽象化（如伪代码）降低成本并保持性能，同时需考虑任务与编程语言的匹配性。

Abstract: Code data has been shown to enhance the reasoning capabilities of large language models (LLMs), but it remains unclear which aspects of code are most responsible. We investigate this question with a systematic, data-centric framework. We construct parallel instruction datasets in ten programming languages and apply controlled perturbations that selectively disrupt structural or semantic properties of code. We then finetune LLMs from five model families and eight scales on each variant and evaluate their performance on natural language, math, and code tasks. Across 3,331 experiments, our results show that LLMs are more vulnerable to structural perturbations than semantic ones, particularly on math and code tasks. Appropriate abstractions like pseudocode and flowcharts can be as effective as code, while encoding the same information with fewer tokens without adhering to original syntax can often retain or even improve performance. Remarkably, even corrupted code with misleading signals remains competitive when surface-level regularities persist. Finally, syntactic styles also shape task-specific gains with Python favoring natural language reasoning and lower-level languages such as Java and Rust favoring math. Through our systematic framework, we aim to provide insight into how different properties of code influence reasoning and inform the design of training data for enhancing LLM reasoning capabilities.

[144] Vision Language Models Cannot Plan, but Can They Formalize? cs.CLPDF

Muyu He, Yuxi Zheng, Yuchen Liu, Zijian An, Bill Cai

TL;DR: 该论文探讨了视觉语言模型（VLM）在多模态任务中的规划能力挑战，并提出了一套VLM作为形式化工具的方法，表明其在开放词汇和低质量图像任务中优于端到端规划生成。

Details

Motivation: 现有的视觉语言模型在简单多模态规划任务中表现良好，但在长范围规划任务中表现不佳。为解决这一问题，论文试图将VLM重新定位为形式化工具，类似于文本领域的LLM使用PDDL的形式化方法。

Result: 实验表明，VLM作为形式化工具的表现显著优于端到端规划生成，但视觉能力的不足仍是主要瓶颈。中间文本表示虽有一定帮助，但其提升效果不稳定。

Insight: 论文揭示了在多模态规划中，VLM的视觉能力是主要限制因素。未来研究方向可以集中在如何更有效地提取多模态信息以支持复杂规划任务。

Abstract: The advancement of vision language models (VLMs) has empowered embodied agents to accomplish simple multimodal planning tasks, but not long-horizon ones requiring long sequences of actions. In text-only simulations, long-horizon planning has seen significant improvement brought by repositioning the role of LLMs. Instead of directly generating action sequences, LLMs translate the planning domain and problem into a formal planning language like the Planning Domain Definition Language (PDDL), which can call a formal solver to derive the plan in a verifiable manner. In multimodal environments, research on VLM-as-formalizer remains scarce, usually involving gross simplifications such as predefined object vocabulary or overly similar few-shot examples. In this work, we present a suite of five VLM-as-formalizer pipelines that tackle one-shot, open-vocabulary, and multimodal PDDL formalization. We evaluate those on an existing benchmark while presenting another two that for the first time account for planning with authentic, multi-view, and low-quality images. We conclude that VLM-as-formalizer greatly outperforms end-to-end plan generation. We reveal the bottleneck to be vision rather than language, as VLMs often fail to capture an exhaustive set of necessary object relations. While generating intermediate, textual representations such as captions or scene graphs partially compensate for the performance, their inconsistent gain leaves headroom for future research directions on multimodal planning formalization.

[145] “Be My Cheese?”: Assessing Cultural Nuance in Multilingual LLM Translations cs.CLPDF

Madison Van Doren, Cory Holland

TL;DR: 该论文探讨了多语言AI模型在翻译英语习语和双关语到全球多种语言时的文化适应性表现，发现现有模型在文化细微差别处理上仍有改进空间。

Details

Motivation: 研究动机在于评估多语言大语言模型（LLM）在翻译文化相关语言（如习语和双关语）时的表现，填补了现有研究中未充分探索文化适应性的空白。

Result: 结果显示，尽管LLM能生成语法正确的翻译，但文化适应性表现较差，尤其在习语和双关语处理上需要人工干预。

Insight: 研究指出当前多语言AI系统在现实本地化场景中存在局限，建议未来研究应扩大规模以提供更通用的见解。

Abstract: This pilot study explores the localisation capabilities of state-of-the-art multilingual AI models when translating figurative language, such as idioms and puns, from English into a diverse range of global languages. It expands on existing LLM translation research and industry benchmarks, which emphasise grammatical accuracy and token-level correctness, by focusing on cultural appropriateness and overall localisation quality - critical factors for real-world applications like marketing and e-commerce. To investigate these challenges, this project evaluated a sample of 87 LLM-generated translations of e-commerce marketing emails across 24 regional dialects of 20 languages. Human reviewers fluent in each target language provided quantitative ratings and qualitative feedback on faithfulness to the original’s tone, meaning, and intended audience. Findings suggest that, while leading models generally produce grammatically correct translations, culturally nuanced language remains a clear area for improvement, often requiring substantial human refinement. Notably, even high-resource global languages, despite topping industry benchmark leaderboards, frequently mistranslated figurative expressions and wordplay. This work challenges the assumption that data volume is the most reliable predictor of machine translation quality and introduces cultural appropriateness as a key determinant of multilingual LLM performance - an area currently underexplored in existing academic and industry benchmarks. As a proof of concept, this pilot highlights limitations of current multilingual AI systems for real-world localisation use cases. Results of this pilot support the opportunity for expanded research at greater scale to deliver generalisable insights and inform deployment of reliable machine translation workflows in culturally diverse contexts.

[146] Multi-Objective Reinforcement Learning for Large Language Model Optimization: Visionary Perspective cs.CL | cs.AI | cs.LG | cs.MAPDF

Lingxiao Kong, Cong Yang, Oya Deniz Beyan, Zeyd Boukhers

TL;DR: 本文探讨了多目标强化学习（MORL）在大型语言模型（LLM）优化中的应用，提出了一个MORL分类法，并分析了不同MORL方法的优缺点。作者提出了一个MORL基准测试框架的愿景，并提出了未来研究方向，包括通过双层学习范式提高效率的元策略MORL开发。

Details

Motivation: 随着大型语言模型（LLM）的广泛应用，如何优化多个目标（如准确性、效率和个人化）成为一个关键挑战。多目标强化学习（MORL）为解决这些问题提供了潜在途径，但缺乏系统性的研究和基准测试框架。

Result: 本文未提供具体实验结果，但通过分析指出了MORL在LLM优化中的关键挑战和未来研究机会。

Insight: MORL有望成为优化LLM的关键工具，但其在效率和灵活性方面的改进需求凸显了元策略MORL研究的重要性。

Abstract: Multi-Objective Reinforcement Learning (MORL) presents significant challenges and opportunities for optimizing multiple objectives in Large Language Models (LLMs). We introduce a MORL taxonomy and examine the advantages and limitations of various MORL methods when applied to LLM optimization, identifying the need for efficient and flexible approaches that accommodate personalization functionality and inherent complexities in LLMs and RL. We propose a vision for a MORL benchmarking framework that addresses the effects of different methods on diverse objective relationships. As future research directions, we focus on meta-policy MORL development that can improve efficiency and flexibility through its bi-level learning paradigm, highlighting key research questions and potential solutions for improving LLM performance.

[147] OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja’s Rule cs.CL | cs.AI | cs.LGPDF

Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati

TL;DR: OjaKV提出了一种结合混合存储策略和在线子空间自适应的KV缓存压缩框架，旨在解决长上下文大语言模型中的内存瓶颈问题。

Details

Motivation: 大语言模型的长上下文能力受限于KV缓存的高内存占用，而现有低秩压缩方法在数据分布变化时性能下降。

Result: 实验显示，OjaKV在高压缩比下保持甚至提升零样本准确率，尤其在复杂长上下文推理任务中表现突出。

Insight: 在线子空间自适应能有效跟踪上下文动态变化，混合存储策略确保关键信息的保留，为无需微调的长上下文推理提供实际解决方案。

Abstract: The expanding long-context capabilities of large language models are constrained by a significant memory bottleneck: the key-value (KV) cache required for autoregressive generation. This bottleneck is substantial; for instance, a Llama-3.1-8B model processing a 32K-token prompt at a batch size of 4 requires approximately 16GB for its KV cache, a size exceeding the model’s weights. While KV-cache compression via low-rank projection is a promising direction, existing methods rely on a static, offline-learned subspace that performs poorly under data distribution shifts. To overcome these limitations, we introduce OjaKV, a novel framework that integrates a strategic hybrid storage policy with online subspace adaptation. First, OjaKV recognizes that not all tokens are equally important for compression; it preserves the crucial first and most recent tokens in full-rank, maintaining high-fidelity anchors for attention. Second, for the vast majority of intermediate tokens, it applies low-rank compression by incrementally adapting the projection basis using Oja’s algorithm for online principal component analysis. This adaptation involves a comprehensive update during prompt prefilling and lightweight periodic updates during decoding, ensuring the subspace remains aligned with the evolving context. Crucially, our framework is fully compatible with modern attention modules like FlashAttention. Experiments demonstrate that OjaKV maintains or even improves zero-shot accuracy at high compression ratios. In particular, OjaKV achieves its strongest gains on very long-context benchmarks that require complex reasoning, highlighting the importance of online subspace adaptation in dynamically tracking context shifts. These results establish our hybrid framework as a practical, plug-and-play solution for memory-efficient long-context inference without requiring model fine-tuning.

[148] GRAB: A Risk Taxonomy–Grounded Benchmark for Unsupervised Topic Discovery in Financial Disclosures cs.CLPDF

Ying Li, Tiejun Ma

TL;DR: GRAB是一个专注于金融披露文档的无监督主题发现基准测试，结合FinBERT注意力机制、YAKE关键短语信号和分类感知的搭配匹配方法，提供了一个标准化的评估框架。

Details

Motivation: 金融风险披露的分类对于监管和投资至关重要，但缺乏公开的基准测试来评估无监督主题模型在此任务上的表现。

Result: 提供了固定的数据集划分和多种评估指标（如Accuracy、Macro-F1等），支持模型比较。

Insight: GRAB为金融领域的无监督主题发现提供了可复现的标准化评估工具，填补了领域空白。

Abstract: Risk categorization in 10-K risk disclosures matters for oversight and investment, yet no public benchmark evaluates unsupervised topic models for this task. We present GRAB, a finance-specific benchmark with 1.61M sentences from 8,247 filings and span-grounded sentence labels produced without manual annotation by combining FinBERT token attention, YAKE keyphrase signals, and taxonomy-aware collocation matching. Labels are anchored in a risk taxonomy mapping 193 terms to 21 fine-grained types nested under five macro classes; the 21 types guide weak supervision, while evaluation is reported at the macro level. GRAB unifies evaluation with fixed dataset splits and robust metrics–Accuracy, Macro-F1, Topic BERTScore, and the entropy-based Effective Number of Topics. The dataset, labels, and code enable reproducible, standardized comparison across classical, embedding-based, neural, and hybrid topic models on financial disclosures.

[149] Think-on-Graph 3.0: Efficient and Adaptive LLM Reasoning on Heterogeneous Graphs via Multi-Agent Dual-Evolving Context Retrieval cs.CLPDF

Xiaojun Wu, Cehao Yang, Xueyuan Lin, Chengjin Xu, Xuhui Jiang

TL;DR: 该论文提出了Think-on-Graph 3.0（ToG-3），通过多智能体双演化上下文检索机制（MACER），动态构建和优化异构图索引，解决了现有图检索增强生成方法的静态性问题。

Details

Motivation: 现有的图检索增强生成方法依赖高质量图结构，但手动构建成本高，自动提取受限于轻量级LLM性能。ToG-3旨在通过动态图构建和演化查询克服这些局限性。

Result: 在深度和广度推理基准测试中优于基线方法，消融实验验证了MACER框架的有效性。

Insight: 动态演化查询和图索引能够弥补静态图的不足，多智能体协作机制显著提升了推理的灵活性和精确性。

Abstract: Retrieval-Augmented Generation (RAG) and Graph-based RAG has become the important paradigm for enhancing Large Language Models (LLMs) with external knowledge. However, existing approaches face a fundamental trade-off. While graph-based methods are inherently dependent on high-quality graph structures, they face significant practical constraints: manually constructed knowledge graphs are prohibitively expensive to scale, while automatically extracted graphs from corpora are limited by the performance of the underlying LLM extractors, especially when using smaller, local-deployed models. This paper presents Think-on-Graph 3.0 (ToG-3), a novel framework that introduces Multi-Agent Context Evolution and Retrieval (MACER) mechanism to overcome these limitations. Our core innovation is the dynamic construction and refinement of a Chunk-Triplets-Community heterogeneous graph index, which pioneeringly incorporates a dual-evolution mechanism of Evolving Query and Evolving Sub-Graph for precise evidence retrieval. This approach addresses a critical limitation of prior Graph-based RAG methods, which typically construct a static graph index in a single pass without adapting to the actual query. A multi-agent system, comprising Constructor, Retriever, Reflector, and Responser agents, collaboratively engages in an iterative process of evidence retrieval, answer generation, sufficiency reflection, and, crucially, evolving query and subgraph. This dual-evolving multi-agent system allows ToG-3 to adaptively build a targeted graph index during reasoning, mitigating the inherent drawbacks of static, one-time graph construction and enabling deep, precise reasoning even with lightweight LLMs. Extensive experiments demonstrate that ToG-3 outperforms compared baselines on both deep and broad reasoning benchmarks, and ablation studies confirm the efficacy of the components of MACER framework.

[150] Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models cs.CL | cs.SDPDF

Zhen Xiong, Yujun Cai, Zhecheng Li, Junsong Yuan, Yiwei Wang

TL;DR: 论文提出了一种名为Thinking-with-Sound (TwS)的框架，通过结合语言学推理和实时音频域分析，为大型音频语言模型（LALMs）引入了音频链式思维（Audio CoT），以提升其在复杂音频场景中的推理能力。

Details

Motivation: 现有的LALMs在复杂音频场景中表现不佳，特别是缺乏噪声抑制、声源分离等音频工具的集成能力。TwS旨在通过动态音频分析和多模态推理解决这一问题。

Result: 在MELD-Hard1k基准上，TwS大幅提升了LALMs的鲁棒性，小型模型性能提升24.73%，大型模型提升达36.61%。

Insight: 音频链式思维（Audio CoT）可以作为增强模型鲁棒性的有效路径，无需重新训练即可实现性能提升。

Abstract: Recent Large Audio-Language Models (LALMs) have shown strong performance on various audio understanding tasks such as speech translation and Audio Q&A. However, they exhibit significant limitations on challenging audio reasoning tasks in complex acoustic scenarios. These situations would greatly benefit from the use of acoustic tools like noise suppression, source separation, and precise temporal alignment, but current LALMs lack access to such tools. To address this limitation, we introduce Thinking-with-Sound (TwS), a framework that equips LALMs with Audio CoT by combining linguistic reasoning with on-the-fly audio-domain analysis. Unlike existing approaches that treat audio as static input, TwS enables models to actively think with audio signals, performing numerical analysis and digital manipulation through multimodal reasoning. To evaluate this approach, we construct MELD-Hard1k, a new robustness benchmark created by introducing various acoustic perturbations. Experiments reveal that state-of-the-art LALMs suffer dramatic performance degradation on MELD-Hard1k, with accuracy dropping by more than $50%$ compared to clean audio. TwS achieves substantial improvements in robustness, demonstrating both effectiveness and scalability: small models gain $24.73%$ absolute accuracy, with improvements scaling consistently up to $36.61%$ for larger models. Our findings demonstrate that Audio CoT can significantly enhance robustness without retraining, opening new directions for developing more robust audio understanding systems.

[151] SynerGen: Contextualized Generative Recommender for Unified Search and Recommendation cs.CLPDF

Vianne R. Gao, Chen Xue, Marc Versage, Xie Zhou, Zhongruo Wang

TL;DR: SynerGen 是一个生成式推荐模型，通过单一生成式架构统一了搜索和推荐任务，结合检索和排序优化，实现了性能提升。

Details

Motivation: 现有的检索-排序流程在推荐系统中存在校准问题和工程开销，而生成式序列模型虽能统一检索和排序，但通常仅针对搜索或推荐任务，难以同时兼顾两者。

Result: SynerGen 在多个推荐和搜索基准测试中显著优于现有生成式推荐模型和联合搜索-推荐基线方法。

Insight: 单一生成式基础模型在工业级信息访问任务中具有潜力，搜索和推荐的语义信号可以互相增强。

Abstract: The dominant retrieve-then-rank pipeline in large-scale recommender systems suffers from mis-calibration and engineering overhead due to its architectural split and differing optimization objectives. While recent generative sequence models have shown promise in unifying retrieval and ranking by auto-regressively generating ranked items, existing solutions typically address either personalized search or query-free recommendation, often exhibiting performance trade-offs when attempting to unify both. We introduce \textit{SynerGen}, a novel generative recommender model that bridges this critical gap by providing a single generative backbone for both personalized search and recommendation, while simultaneously excelling at retrieval and ranking tasks. Trained on behavioral sequences, our decoder-only Transformer leverages joint optimization with InfoNCE for retrieval and a hybrid pointwise-pairwise loss for ranking, allowing semantic signals from search to improve recommendation and vice versa. We also propose a novel time-aware rotary positional embedding to effectively incorporate time information into the attention mechanism. \textit{SynerGen} achieves significant improvements on widely adopted recommendation and search benchmarks compared to strong generative recommender and joint search and recommendation baselines. This work demonstrates the viability of a single generative foundation model for industrial-scale unified information access.

[152] Navigating the Impact of Structured Output Format on Large Language Models through the Compass of Causal Inference cs.CL | cs.LGPDF

Han Yuan, Yue Zhao, Li Zhang, Wuqiong Luo, Zheng Ma

TL;DR: 这篇论文通过因果推断方法系统研究了结构化输出对大型语言模型（LLMs）生成质量的影响，发现大多数情况下结构化输出并无因果影响，部分情况受具体指令影响。

Details

Motivation: 现有研究对结构化输出对LLMs生成质量的影响存在矛盾结论，且受限于测试场景和控制条件。论文旨在通过因果推断提供更精细的分析。

Result: 实验结果表明，48个场景中43个无因果影响，其余5个受具体指令影响。

Insight: 结构化输出的影响并非单向，而是依赖具体任务和指令，需结合因果推断方法进行更深入分析。

Abstract: Structured output from large language models (LLMs) has enhanced efficiency in processing generated information and is increasingly adopted in industrial applications. Prior studies have investigated the impact of structured output on LLMs’ generation quality, often presenting one-way findings. Some suggest that structured format enhances completeness and factual accuracy, while others argue that it restricts the reasoning capacity of LLMs and leads to reductions in standard evaluation metrics. Potential limitations of these assessments include restricted testing scenarios, weakly controlled comparative settings, and reliance on coarse metrics. In this work, we present a refined analysis using causal inference. Based on one assumed and two guaranteed constraints, we derive five potential causal structures characterizing the influence of structured output on LLMs’ generation: (1) collider without m-bias, (2) collider with m-bias, (3) single cause from instruction, (4) single cause from output format, and (5) independence. Across seven public and one developed reasoning tasks, we find that coarse metrics report positive, negative, or neutral effects of structured output on GPT-4o’s generation. However, causal inference reveals no causal impact in 43 out of 48 scenarios. In the remaining 5, 3 involve multifaceted causal structures influenced by concrete instructions.

[153] Evaluating and Improving Cultural Awareness of Reward Models for LLM Alignment cs.CL | cs.AIPDF

Hongbin Zhang, Kehai Chen, Xuefeng Bai, Yang Xiang, Min Zhang

TL;DR: 本文提出了文化感知奖励模型基准（CARB），用于评估和改进大型语言模型（LLM）奖励模型的文化意识表现，并提出了Think-as-Locals方法以提升文化感知能力。

Details

Motivation: 现有奖励模型评估缺乏文化意识评估的多样性数据，不利于全球LLM对齐的推进。

Result: CARB揭示了奖励模型的缺陷，Think-as-Locals有效减少了虚假特征干扰，提升了文化感知能力。

Insight: 奖励模型依赖表面特征而非真实文化理解，需通过强化学习加强文化推理能力。

Abstract: Reward models (RMs) are crucial for aligning large language models (LLMs) with diverse cultures. Consequently, evaluating their cultural awareness is essential for further advancing global alignment of LLMs. However, existing RM evaluations fall short in assessing cultural awareness due to the scarcity of culturally relevant evaluation datasets. To fill this gap, we propose Cultural Awareness Reward modeling Benchmark (CARB), covering 10 distinct cultures across 4 cultural domains. Our extensive evaluation of state-of-the-art RMs reveals their deficiencies in modeling cultural awareness and demonstrates a positive correlation between performance on CARB and downstream multilingual cultural alignment tasks. Further analysis identifies the spurious correlations within culture-aware reward modeling, wherein RM’s scoring relies predominantly on surface-level features rather than authentic cultural nuance understanding. To address these, we propose Think-as-Locals to elicit deeper culturally grounded reasoning from generative RMs via reinforcement learning from verifiable rewards (RLVR) and employ well-designed rewards to ensure accurate preference judgments and high-quality structured evaluation criteria generation. Experimental results validate its efficacy in mitigating spurious features interference and advancing culture-aware reward modeling.

[154] Towards Minimal Causal Representations for Human Multimodal Language Understanding cs.CLPDF

Menghua Jiang, Yuncheng Jiang, Haifeng Hu, Sijie Mai

TL;DR: 该论文提出了一种基于因果关系的多模态语言理解方法（CaMIB），通过信息瓶颈和因果分离提升模型的分布外泛化能力和可解释性。

Details

Motivation: 现有的多模态语言理解方法依赖统计相关性，容易受到数据集偏差的影响，导致分布外泛化性能下降。作者希望通过因果推理解决这一问题。

Result: 在情感分析、幽默检测和讽刺检测任务中，CaMIB表现优于传统方法，尤其在分布外测试集上展现出更强的泛化能力。

Insight: 因果推理可以显著提升多模态模型的鲁棒性，同时提供更清晰的模型解释性。

Abstract: Human Multimodal Language Understanding (MLU) aims to infer human intentions by integrating related cues from heterogeneous modalities. Existing works predominantly follow a ``learning to attend” paradigm, which maximizes mutual information between data and labels to enhance predictive performance. However, such methods are vulnerable to unintended dataset biases, causing models to conflate statistical shortcuts with genuine causal features and resulting in degraded out-of-distribution (OOD) generalization. To alleviate this issue, we introduce a Causal Multimodal Information Bottleneck (CaMIB) model that leverages causal principles rather than traditional likelihood. Concretely, we first applies the information bottleneck to filter unimodal inputs, removing task-irrelevant noise. A parameterized mask generator then disentangles the fused multimodal representation into causal and shortcut subrepresentations. To ensure global consistency of causal features, we incorporate an instrumental variable constraint, and further adopt backdoor adjustment by randomly recombining causal and shortcut features to stabilize causal estimation. Extensive experiments on multimodal sentiment analysis, humor detection, and sarcasm detection, along with OOD test sets, demonstrate the effectiveness of CaMIB. Theoretical and empirical analyses further highlight its interpretability and soundness.

[155] ResT: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models cs.CLPDF

Zihan Lin, Xiaohan Wang, Jie Cao, Jiajun Chai, Guojun Yin

TL;DR: ResT提出了一种针对工具使用任务的token级策略梯度重塑方法，通过熵感知的token重加权优化训练稳定性，显著提升了大型语言模型在工具使用任务中的性能。

Details

Motivation: 大型语言模型在工具使用任务中常依赖于稀疏的奖励信号，导致训练效率低下。本文旨在通过理论分析链接策略熵与训练稳定性，提出更高效的优化方法。

Result: 在BFCL和API-Bank任务上，ResT比现有方法性能提升高达8.76%。在4B基础LLM上微调后，ResT在单轮和多轮任务中分别超过GPT-4o 4.11%和1.50%。

Insight: 工具使用任务中，低熵token对奖励起着决定性作用，通过动态调整token权重可以显著提升训练效率和模型性能。

Abstract: Large language models (LLMs) transcend passive generation and act as goal-directed agents by invoking external tools. Reinforcement learning (RL) offers a principled framework for optimizing these emergent tool-use policies, yet the prevailing paradigm relies exclusively on sparse outcome rewards and lacks consideration of the particularity of tool-use tasks, inflating policy-gradient variance and resulting in inefficient training. To better understand and address these challenges, we first establish a theoretical link between policy entropy and training stability of tool-use tasks, which reveals that structured, low-entropy tokens are primary determinants of rewards. Motivated by this insight, we propose \textbf{Res}haped \textbf{T}oken-level policy gradients (\textbf{ResT}) for tool-use tasks. ResT reshapes the policy gradient through entropy-informed token reweighting, progressively upweighting reasoning tokens as training proceeds. This entropy-aware scheme enables a smooth shift from structural correctness to semantic reasoning and stabilizes convergence in multi-turn tool-use tasks. Evaluation on BFCL and API-Bank shows that ResT achieves state-of-the-art results, outperforming prior methods by up to $8.76%$. When fine-tuned on a 4B base LLM, ResT further surpasses GPT-4o by $4.11%$ on single-turn tasks and $1.50%$ on multi-turn base tasks.

[156] Semantic Agreement Enables Efficient Open-Ended LLM Cascades cs.CLPDF

Duncan Soiffer, Steven Kolawole, Virginia Smith

TL;DR: 该论文提出了一种基于语义一致性（semantic agreement）的LLM级联方法，通过评估模型输出的语义一致性来判断是否需要调用更大的模型，从而在保证质量的同时显著降低成本与延迟。

Details

Motivation: 在开放文本生成任务中，传统级联系统难以判断输出的可靠性，因为生成结果通常是一个连续谱且存在多个有效答案。作者希望通过语义一致性来解决这一问题，避免依赖模型内部的置信度指标。

Result: 语义级联方法能以40%的成本达到或超过目标模型的质量，并将延迟降低60%。

Insight: 语义一致性是比词级别置信度更可靠的信号，适用于开放文本生成任务中的模型调度问题。这一方法为实际部署中的成本与质量平衡提供了实用基线。

Abstract: Cascade systems route computational requests to smaller models when possible and defer to larger models only when necessary, offering a promising approach to balance cost and quality in LLM deployment. However, they face a fundamental challenge in open-ended text generation: determining output reliability when generation quality lies on a continuous spectrum, often with multiple valid responses. To address this, we propose semantic agreement – meaning-level consensus between ensemble outputs – as a training-free signal for reliable deferral. We show that when diverse model outputs agree semantically, their consensus is a stronger reliability signal than token-level confidence. Evaluated from 500M to 70B-parameter models, we find that semantic cascades match or surpass target-model quality at 40% of the cost and reduce latency by up to 60%. Our method requires no model internals, works across black-box APIs, and remains robust to model updates, making it a practical baseline for real-world LLM deployment.

[157] Following the TRACE: A Structured Path to Empathetic Response Generation with Multi-Agent Models cs.CL | cs.MAPDF

Ziqi Liu, Ziyang Zhou, Yilin Li, Haiyang Zhang, Yangbin Chen

TL;DR: 该论文提出了一种名为TRACE的新型框架，通过将共情任务分解为分析和合成的流水线，结合了深度分析与生成能力，显著提升了共情对话的效果。

Details

Motivation: 为了解决现有方法在专业模型的深度分析和大型语言模型（LLM）的生成流畅性之间的权衡问题，作者希望开发一种结构化方法，以更好地理解和生成共情对话。

Result: 实验表明，TRACE在自动评估和基于LLM的评估中均显著优于基线模型。

Insight: 结构化分解任务是一种有效方法，既能提升共情对话的表现，又能增强模型的可解释性。

Abstract: Empathetic response generation is a crucial task for creating more human-like and supportive conversational agents. However, existing methods face a core trade-off between the analytical depth of specialized models and the generative fluency of Large Language Models (LLMs). To address this, we propose TRACE, Task-decomposed Reasoning for Affective Communication and Empathy, a novel framework that models empathy as a structured cognitive process by decomposing the task into a pipeline for analysis and synthesis. By building a comprehensive understanding before generation, TRACE unites deep analysis with expressive generation. Experimental results show that our framework significantly outperforms strong baselines in both automatic and LLM-based evaluations, confirming that our structured decomposition is a promising paradigm for creating more capable and interpretable empathetic agents. Our code is available at https://anonymous.4open.science/r/TRACE-18EF/README.md.

[158] No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping cs.CL | cs.AI | cs.LGPDF

Thanh-Long V. Le, Myeongho Jeon, Kim Vu, Viet Lai, Eunho Yang

TL;DR: 该论文提出了一种新算法RL-ZVP，通过利用零方差提示（zero-variance prompts）在LLM强化学习中提供反馈信号，显著提升了数学推理任务的表现。

Details

Motivation: 当前的强化学习方法（如GRPO）仅关注模型对不同输入的响应差异，而忽略了所有响应结果相同的零方差提示（zero-variance prompts）。论文认为这些提示并非无用，而是可以提供有效的政策优化反馈。

Result: 在六个数学推理基准测试中，RL-ZVP比GRPO在准确率和通过率上分别提升了8.61和7.77个百分点，且优于其他过滤零方差提示的基线方法。

Insight: 零方差提示在强化学习中有重要潜力，合理利用这部分数据可以显著提升模型的推理能力。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward - so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce RL with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR.

[159] QoNext: Towards Next-generation QoE for Foundation Models cs.CLPDF

Yijin Guo, Ye Shen, Farong Wen, Junying Wang, Zicheng Zhang

TL;DR: QoNext提出了一种基于用户体验（QoE）的新框架，用于评估基础模型，通过实验构建数据库并训练预测模型，优化用户满意度。

Details

Motivation: 现有评估方法仅关注输出正确性，忽视了用户交互体验的重要性，无法全面反映用户满意度。

Result: QoNext实现了主动、精细化评估，并为优化基础模型提供了实际可行的指导。

Insight: 用户满意度不仅是输出质量的结果，还受交互体验影响，QoNext为模型评估提供了新视角。

Abstract: Existing evaluations of foundation models, including recent human-centric approaches, fail to capture what truly matters: user’s experience during interaction. Current methods treat evaluation as a matter of output correctness alone, overlooking that user satisfaction emerges from the interplay between response quality and interaction, which limits their ability to account for the mechanisms underlying user experience. To address this gap, we introduce QoNext, the first framework that adapts Quality of Experience (QoE) principles from networking and multimedia to the assessment of foundation models. QoNext identifies experiential factors that shape user experience and incorporates them into controlled experiments, where human ratings are collected under varied configurations. From these studies we construct a QoE-oriented database and train predictive models that estimate perceived user experience from measurable system parameters. Our results demonstrate that QoNext not only enables proactive and fine-grained evaluation but also provides actionable guidance for productized services of optimizing foundation models in practice.

[160] AutoSCORE: Enhancing Automated Scoring with Multi-Agent Large Language Models via Structured Component Recognition cs.CL | cs.AIPDF

Yun Wang, Zhaojun Ding, Xuansheng Wu, Siyue Sun, Ninghao Liu

TL;DR: AutoSCORE 是一个基于多智能体大语言模型的自动化评分框架，通过结构化组件识别提升了评分的准确性和可解释性。它在多个基准数据集上表现优于单智能体基线，特别是在复杂多维评分标准和小型LLM上表现出显著优势。

Details

Motivation: 当前大语言模型（LLM）在自动化评分任务中存在准确性低、提示敏感性高、可解释性差和评分标准对齐不足等问题，这些问题限制了其在实际评估中的应用。

Result: 在 ASAP 基准数据集上，AutoSCORE 在准确性、人机一致性（QWK、相关性）和误差指标（MAE、RMSE）上均优于单智能体基线，尤其对复杂多维评分标准和小型 LLM 提升显著。

Insight: 结构化组件识别与多智能体设计结合，为自动化评分提供了可扩展、可靠且可解释的解决方案，特别适用于复杂评分任务。

Abstract: Automated scoring plays a crucial role in education by reducing the reliance on human raters, offering scalable and immediate evaluation of student work. While large language models (LLMs) have shown strong potential in this task, their use as end-to-end raters faces challenges such as low accuracy, prompt sensitivity, limited interpretability, and rubric misalignment. These issues hinder the implementation of LLM-based automated scoring in assessment practice. To address the limitations, we propose AutoSCORE, a multi-agent LLM framework enhancing automated scoring via rubric-aligned Structured COmponent REcognition. With two agents, AutoSCORE first extracts rubric-relevant components from student responses and encodes them into a structured representation (i.e., Scoring Rubric Component Extraction Agent), which is then used to assign final scores (i.e., Scoring Agent). This design ensures that model reasoning follows a human-like grading process, enhancing interpretability and robustness. We evaluate AutoSCORE on four benchmark datasets from the ASAP benchmark, using both proprietary and open-source LLMs (GPT-4o, LLaMA-3.1-8B, and LLaMA-3.1-70B). Across diverse tasks and rubrics, AutoSCORE consistently improves scoring accuracy, human-machine agreement (QWK, correlations), and error metrics (MAE, RMSE) compared to single-agent baselines, with particularly strong benefits on complex, multi-dimensional rubrics, and especially large relative gains on smaller LLMs. These results demonstrate that structured component recognition combined with multi-agent design offers a scalable, reliable, and interpretable solution for automated scoring.

[161] Why Chain of Thought Fails in Clinical Text Understanding cs.CL | cs.AIPDF

Jiageng Wu, Kevin Xie, Bowen Gu, Nils Krüger, Kueiyu Joshua Lin

TL;DR: 本研究首次大规模系统性评估了Chain-of-Thought (CoT)在临床文本理解中的表现，发现86.3%的模型在CoT设置下性能下降，指出了CoT在临床领域的局限性。

Details

Motivation: 临床领域对模型的准确性和透明推理要求极高，CoT在其他任务中表现优异，但其在临床文本（如电子健康记录EHRs）中的应用效果尚未明确。

Result: 86.3%的模型在CoT设置下性能下降，但能力更强的模型相对稳健。CoT在临床任务中虽提高可解释性，却可能降低可靠性。

Insight: 研究揭示了CoT在临床场景中的矛盾：提升透明度的同时可能损害可靠性，呼吁开发更适合临床领域的可信推理方法。

Abstract: Large language models (LLMs) are increasingly being applied to clinical care, a domain where both accuracy and transparent reasoning are critical for safe and trustworthy deployment. Chain-of-thought (CoT) prompting, which elicits step-by-step reasoning, has demonstrated improvements in performance and interpretability across a wide range of tasks. However, its effectiveness in clinical contexts remains largely unexplored, particularly in the context of electronic health records (EHRs), the primary source of clinical documentation, which are often lengthy, fragmented, and noisy. In this work, we present the first large-scale systematic study of CoT for clinical text understanding. We assess 95 advanced LLMs on 87 real-world clinical text tasks, covering 9 languages and 8 task types. Contrary to prior findings in other domains, we observe that 86.3% of models suffer consistent performance degradation in the CoT setting. More capable models remain relatively robust, while weaker ones suffer substantial declines. To better characterize these effects, we perform fine-grained analyses of reasoning length, medical concept alignment, and error profiles, leveraging both LLM-as-a-judge evaluation and clinical expert evaluation. Our results uncover systematic patterns in when and why CoT fails in clinical contexts, which highlight a critical paradox: CoT enhances interpretability but may undermine reliability in clinical text tasks. This work provides an empirical basis for clinical reasoning strategies of LLMs, highlighting the need for transparent and trustworthy approaches.

[162] Debiasing Large Language Models in Thai Political Stance Detection via Counterfactual Calibration cs.CL | cs.AIPDF

Kasidit Sermsri, Teerapong Panboonyuen

TL;DR: 该论文提出一种轻量级、模型无关的去偏框架ThaiFACTUAL，用于在泰国政治立场检测中减少大语言模型的系统性偏见，无需微调。通过反事实数据增强和基于理性的监督，ThaiFACTUAL显著降低了虚假相关性，提升了零样本泛化和公平性。

Details

Motivation: 在低资源且文化复杂的泰国政治环境中，政治立场检测面临间接语言、极化人物以及情感与立场交织的挑战。大语言模型常表现出系统性偏见（如情感泄漏和实体偏好），影响公平性和可靠性。

Result: 实验显示，ThaiFACTUAL显著降低了虚假相关性，增强了零样本泛化能力，并在多个大语言模型中提升了公平性。

Insight: 研究表明针对低资源语言的去偏技术需结合文化背景，ThaiFACTUAL为其他类似语言提供了参考。

Abstract: Political stance detection in low-resource and culturally complex settings poses a critical challenge for large language models (LLMs). In the Thai political landscape - marked by indirect language, polarized figures, and entangled sentiment and stance - LLMs often display systematic biases such as sentiment leakage and favoritism toward entities. These biases undermine fairness and reliability. We present ThaiFACTUAL, a lightweight, model-agnostic calibration framework that mitigates political bias without requiring fine-tuning. ThaiFACTUAL uses counterfactual data augmentation and rationale-based supervision to disentangle sentiment from stance and reduce bias. We also release the first high-quality Thai political stance dataset, annotated with stance, sentiment, rationales, and bias markers across diverse entities and events. Experimental results show that ThaiFACTUAL significantly reduces spurious correlations, enhances zero-shot generalization, and improves fairness across multiple LLMs. This work highlights the importance of culturally grounded debiasing techniques for underrepresented languages.

[163] GraphSearch: An Agentic Deep Searching Workflow for Graph Retrieval-Augmented Generation cs.CLPDF

Cehao Yang, Xiaojun Wu, Xueyuan Lin, Chengjin Xu, Xuhui Jiang

TL;DR: GraphSearch提出了一种新颖的代理深度搜索工作流，通过双通道检索策略优化图检索增强生成（GraphRAG），解决了现有方法的浅层检索和图数据利用效率低的问题。

Details

Motivation: 现有的GraphRAG方法存在检索深度不足和结构化图数据利用效率低的问题，限制了复杂查询的推理效果。

Result: 在六个多跳RAG基准测试中，GraphSearch显著优于传统方法，验证了其有效性。

Insight: 双通道检索策略结合语义与关系信息是提升GraphRAG性能的关键，模块化设计为后续扩展提供了灵活性。

Abstract: Graph Retrieval-Augmented Generation (GraphRAG) enhances factual reasoning in LLMs by structurally modeling knowledge through graph-based representations. However, existing GraphRAG approaches face two core limitations: shallow retrieval that fails to surface all critical evidence, and inefficient utilization of pre-constructed structural graph data, which hinders effective reasoning from complex queries. To address these challenges, we propose \textsc{GraphSearch}, a novel agentic deep searching workflow with dual-channel retrieval for GraphRAG. \textsc{GraphSearch} organizes the retrieval process into a modular framework comprising six modules, enabling multi-turn interactions and iterative reasoning. Furthermore, \textsc{GraphSearch} adopts a dual-channel retrieval strategy that issues semantic queries over chunk-based text data and relational queries over structural graph data, enabling comprehensive utilization of both modalities and their complementary strengths. Experimental results across six multi-hop RAG benchmarks demonstrate that \textsc{GraphSearch} consistently improves answer accuracy and generation quality over the traditional strategy, confirming \textsc{GraphSearch} as a promising direction for advancing graph retrieval-augmented generation.

[164] Fuzzy Reasoning Chain (FRC): An Innovative Reasoning Framework from Fuzziness to Clarity cs.CL | cs.AIPDF

Ping Chen, Xiang Liu, Zhaoxiang Liu, Zezhou Chen, Xingpeng Zhang

TL;DR: FRC框架结合LLM语义先验与模糊隶属度，通过概率与模糊推理的显式交互，将模糊输入逐步转为清晰决策，提升了复杂文本处理的解释性和鲁棒性。

Details

Motivation: 尽管大语言模型取得了显著进展，但处理模糊、多义或不确定文本仍存在挑战，传统概率方法难以捕捉冲突或不确定信号。

Result: 在情感分析任务中验证，FRC实现了稳定的推理和跨模型规模的知识迁移，理论分析与实验结果均支持其有效性。

Insight: 模糊推理与概率推理的结合为处理复杂语言表达提供了新思路，显式交互机制增强了模型的可解释性与适应性。

Abstract: With the rapid advancement of large language models (LLMs), natural language processing (NLP) has achieved remarkable progress. Nonetheless, significant challenges remain in handling texts with ambiguity, polysemy, or uncertainty. We introduce the Fuzzy Reasoning Chain (FRC) framework, which integrates LLM semantic priors with continuous fuzzy membership degrees, creating an explicit interaction between probability-based reasoning and fuzzy membership reasoning. This transition allows ambiguous inputs to be gradually transformed into clear and interpretable decisions while capturing conflicting or uncertain signals that traditional probability-based methods cannot. We validate FRC on sentiment analysis tasks, where both theoretical analysis and empirical results show that it ensures stable reasoning and facilitates knowledge transfer across different model scales. These findings indicate that FRC provides a general mechanism for managing subtle and ambiguous expressions with improved interpretability and robustness.

[165] S2J: Bridging the Gap Between Solving and Judging Ability in Generative Reward Models cs.CLPDF

Shaoning Sun, Jiachen Yu, Zongqi Wang, Xuewei Yang, Tianle Gu

TL;DR: 本文提出了一种名为S2J的方法，通过联合优化生成奖励模型（GRM）的问题解决和判断能力，显著缩小了解决与判断之间的能力差距，提升了模型性能。

Details

Motivation: 研究发现，生成奖励模型（GRM）在解决问题和判断正确性之间存在显著的能力差距（14%-37%），尽管模型可以解决某些问题，却无法正确判断。这种差距影响了GRM的实际应用效果。

Result: 实验表明，S2J显著缩小了解决与判断差距，并在相同基础模型上实现了SOTA性能，同时使用了更小的训练数据集且无需依赖外部模型蒸馏。

Insight: GRM的问题解决和判断能力之间存在隐性关联，通过显式联合优化可以显著提升模型性能，为生成奖励模型的优化提供了新思路。

Abstract: With the rapid development of large language models (LLMs), generative reward models (GRMs) have been widely adopted for reward modeling and evaluation. Previous studies have primarily focused on training specialized GRMs by optimizing them on preference datasets with the judgment correctness as supervision. While it’s widely accepted that GRMs with stronger problem-solving capabilities typically exhibit superior judgment abilities, we first identify a significant solve-to-judge gap when examining individual queries. Specifically, the solve-to-judge gap refers to the phenomenon where GRMs struggle to make correct judgments on some queries (14%-37%), despite being fully capable of solving them. In this paper, we propose the Solve-to-Judge (S2J) approach to address this problem. Specifically, S2J simultaneously leverages both the solving and judging capabilities on a single GRM’s output for supervision, explicitly linking the GRM’s problem-solving and evaluation abilities during model optimization, thereby narrowing the gap. Our comprehensive experiments demonstrate that S2J effectively reduces the solve-to-judge gap by 16.2%, thereby enhancing the model’s judgment performance by 5.8%. Notably, S2J achieves state-of-the-art (SOTA) performance among GRMs built on the same base model while utilizing a significantly smaller training dataset. Moreover, S2J accomplishes this through self-evolution without relying on more powerful external models for distillation.

[166] Think Right, Not More: Test-Time Scaling for Numerical Claim Verification cs.CLPDF

Primakov Chungkham, V Venktesh, Vinay Setty, Avishek Anand

TL;DR: 本文探讨了在测试时利用计算资源（TTS）来提升大型语言模型（LLMs）对复杂数值声明的事实核查能力。通过生成多条推理路径并训练验证模型（VERIFIERFC）从中选择最优路径，显著改善了LLMs在数值推理任务中的表现。

Details

Motivation: 尽管LLMs在推理任务上取得了巨大进展，但在处理需要复合推理和数值推理的真实世界声明时仍存在不足。模型容易因推理偏移（reasoning drift）而导致误解，本文旨在解决这一问题。

Result: 提出的方法比单次事实核查方法性能提升18.8%，同时通过自适应机制使计算效率提高了1.8倍。

Insight: 通过多路径推理和验证模型的选择机制，可以有效缓解LLMs在处理复杂数值声明时的推理偏移问题，提升事实核查的准确性和效率。

Abstract: Fact-checking real-world claims, particularly numerical claims, is inherently complex that require multistep reasoning and numerical reasoning for verifying diverse aspects of the claim. Although large language models (LLMs) including reasoning models have made tremendous advances, they still fall short on fact-checking real-world claims that require a combination of compositional and numerical reasoning. They are unable to understand nuance of numerical aspects, and are also susceptible to the reasoning drift issue, where the model is unable to contextualize diverse information resulting in misinterpretation and backtracking of reasoning process. In this work, we systematically explore scaling test-time compute (TTS) for LLMs on the task of fact-checking complex numerical claims, which entails eliciting multiple reasoning paths from an LLM. We train a verifier model (VERIFIERFC) to navigate this space of possible reasoning paths and select one that could lead to the correct verdict. We observe that TTS helps mitigate the reasoning drift issue, leading to significant performance gains for fact-checking numerical claims. To improve compute efficiency in TTS, we introduce an adaptive mechanism that performs TTS selectively based on the perceived complexity of the claim. This approach achieves 1.8x higher efficiency than standard TTS, while delivering a notable 18.8% performance improvement over single-shot claim verification methods. Our code and data can be found at https://github.com/VenkteshV/VerifierFC

[167] Universal Legal Article Prediction via Tight Collaboration between Supervised Classification Model and LLM cs.CL | cs.AIPDF

Xiao Chi, Wenlin Zhong, Yiquan Wu, Wei Wang, Kun Kuang

TL;DR: Uni-LAP是一个通用的法律条文预测框架，通过紧密结合监督分类模型（SCM）和大语言模型（LLM），解决了现有方法在法律条文预测任务中的局限性。SCM采用改进的Top-K损失函数生成候选条文，LLM通过演绎推理优化最终预测。实验结果表明，Uni-LAP在多个司法管辖区数据集上表现优异。

Details

Motivation: 现有法律条文预测方法（如SCMs和LLMs）各有局限性：SCMs难以捕捉复杂事实模式，LLMs在预测性任务中表现不佳。此外，现有方法通常局限于特定司法管辖区，缺乏通用性。因此，需要一种结合两者优势的通用框架。

Result: 在多个司法管辖区的数据集上，Uni-LAP均优于现有基线方法，证明了其有效性和泛化能力。

Insight: 1. 结合SCM和LLM能够互补优缺点；2. Top-K损失函数有助于提高候选条文生成的准确性；3. 演绎推理机制可以提高LLM在预测任务中的表现。

Abstract: Legal Article Prediction (LAP) is a critical task in legal text classification, leveraging natural language processing (NLP) techniques to automatically predict relevant legal articles based on the fact descriptions of cases. As a foundational step in legal decision-making, LAP plays a pivotal role in determining subsequent judgments, such as charges and penalties. Despite its importance, existing methods face significant challenges in addressing the complexities of LAP. Supervised classification models (SCMs), such as CNN and BERT, struggle to fully capture intricate fact patterns due to their inherent limitations. Conversely, large language models (LLMs), while excelling in generative tasks, perform suboptimally in predictive scenarios due to the abstract and ID-based nature of legal articles. Furthermore, the diversity of legal systems across jurisdictions exacerbates the issue, as most approaches are tailored to specific countries and lack broader applicability. To address these limitations, we propose Uni-LAP, a universal framework for legal article prediction that integrates the strengths of SCMs and LLMs through tight collaboration. Specifically, in Uni-LAP, the SCM is enhanced with a novel Top-K loss function to generate accurate candidate articles, while the LLM employs syllogism-inspired reasoning to refine the final predictions. We evaluated Uni-LAP on datasets from multiple jurisdictions, and empirical results demonstrate that our approach consistently outperforms existing baselines, showcasing its effectiveness and generalizability.

[168] Multilingual Vision-Language Models, A Survey cs.CLPDF

Andrei-Alexandru Manea, Jindřich Libovický

TL;DR: 这篇综述研究了多语言视觉语言模型，分析了31个模型和21个基准测试，重点讨论了语言中立性与文化意识之间的权衡。

Details

Motivation: 探索多语言视觉语言模型的现状、挑战和发展趋势，特别是跨语言表示的一致性和文化适应性问题。

Result: 发现训练方法倾向于语言中立性，但文化意识需要多样化数据；多数基准测试采用翻译方法，缺乏文化相关内容。

Insight: 跨语言能力存在不一致性，训练目标与评测目标之间也存在差距，未来研究需要更注重文化多样性和评测方法的改进。

Abstract: This survey examines multilingual vision-language models that process text and images across languages. We review 31 models and 21 benchmarks, spanning encoder-only and generative architectures, and identify a key tension between language neutrality (consistent cross-lingual representations) and cultural awareness (adaptation to cultural contexts). Current training methods favor neutrality through contrastive learning, while cultural awareness depends on diverse data. Two-thirds of evaluation benchmarks use translation-based approaches prioritizing semantic consistency, though recent work incorporates culturally grounded content. We find discrepancies in cross-lingual capabilities and gaps between training objectives and evaluation goals.

[169] R-Capsule: Compressing High-Level Plans for Efficient Large Language Model Reasoning cs.CL | cs.AIPDF

Hongyu Shan, Mingyang Song, Chang Dai, Di Liang, Han Chen

TL;DR: R-Capsule框架通过将高层次推理计划压缩为少量潜在标记（Reasoning Capsule），在保持显式推理步骤的同时提升效率，平衡了效率、准确性和可解释性。

Details

Motivation: 传统CoT（Chain-of-Thought）提示法虽然能帮助大语言模型（LLMs）完成复杂推理，但其冗长性导致延迟和内存占用增加，且可能传播早期错误。需要一种更高效且透明的推理方法。

Result: 在复杂基准测试中，R-Capsule显著减少推理的可见标记数量，同时保持或提升准确性，实现了效率与性能的平衡。

Insight: 潜在推理与显式推理的结合可以有效提升效率，而信息瓶颈原则为设计高效且可解释的模型提供了理论基础。

Abstract: Chain-of-Thought (CoT) prompting helps Large Language Models (LLMs) tackle complex reasoning by eliciting explicit step-by-step rationales. However, CoT’s verbosity increases latency and memory usage and may propagate early errors across long chains. We propose the Reasoning Capsule (R-Capsule), a framework that aims to combine the efficiency of latent reasoning with the transparency of explicit CoT. The core idea is to compress the high-level plan into a small set of learned latent tokens (a Reasoning Capsule) while keeping execution steps lightweight or explicit. This hybrid approach is inspired by the Information Bottleneck (IB) principle, where we encourage the capsule to be approximately minimal yet sufficient for the task. Minimality is encouraged via a low-capacity bottleneck, which helps improve efficiency. Sufficiency is encouraged via a dual objective: a primary task loss for answer accuracy and an auxiliary plan-reconstruction loss that encourages the capsule to faithfully represent the original textual plan. The reconstruction objective helps ground the latent space, thereby improving interpretability and reducing the use of uninformative shortcuts. Our framework strikes a balance between efficiency, accuracy, and interpretability, thereby reducing the visible token footprint of reasoning while maintaining or improving accuracy on complex benchmarks. Our codes are available at: https://anonymous.4open.science/r/Reasoning-Capsule-7BE0

Jianzhi Yan, Le Liu, Youcheng Pan, Shiwei Chen, Zike Yuan

TL;DR: 论文提出了一种多轮自适应链式思维压缩框架（MACC），通过多轮迭代逐步压缩链式思维（CoT），以优化性能和减少延迟，同时提高准确性和预测透明度。

Details

Motivation: 链式思维推理（CoT）虽然能提升复杂任务的性能，但由于冗长的输出导致推断延迟显著增加。论文旨在通过压缩CoT来解决这一问题。

Result: 与基线方法相比，MACC平均提高准确性5.6%，减少CoT长度47个token，显著降低延迟，且性能预测可靠。

Insight: CoT压缩可以通过动态多轮调整实现效率和性能的平衡，且测试时性能可通过可解释特征预先预测。

Abstract: Chain-of-Thought (CoT) reasoning improves performance on complex tasks but introduces significant inference latency due to verbosity. We propose Multiround Adaptive Chain-of-Thought Compression (MACC), a framework that leverages the token elasticity phenomenon–where overly small token budgets can paradoxically increase output length–to progressively compress CoTs via multiround refinement. This adaptive strategy allows MACC to determine the optimal compression depth for each input. Our method achieves an average accuracy improvement of 5.6 percent over state-of-the-art baselines, while also reducing CoT length by an average of 47 tokens and significantly lowering latency. Furthermore, we show that test-time performance–accuracy and token length–can be reliably predicted using interpretable features like perplexity and compression rate on the training set. Evaluated across different models, our method enables efficient model selection and forecasting without repeated fine-tuning, demonstrating that CoT compression is both effective and predictable. Our code will be released in https://github.com/Leon221220/MACC.

[171] Mixture of Detectors: A Compact View of Machine-Generated Text Detection cs.CLPDF

Sai Teja Lekkala, Yadagiri Annepaka, Arun Kumar Challa, Samatha Reddy Machireddy, Partha Pakray

TL;DR: 本文提出了一种混合检测器的方法，用于机器生成文本的检测，并引入了新的数据集BMAS English，支持二进制分类、多类分类、句子级分割和对抗攻击检测。

Details

Motivation: 随着大型语言模型（LLMs）在文本生成任务中的表现越来越接近人类创作，如何区分人类和机器生成的文本变得愈发重要。本文旨在解决机器生成文本检测在不同场景下的挑战。

Result: 实验结果展示了BMAS English数据集的有效性，以及所提方法在多任务检测中的优越性能。

Insight: 机器生成文本检测是一个多方面的挑战，需要统一的框架来覆盖不同场景。BMAS English数据集为未来研究提供了有力的基准。

Abstract: Large Language Models (LLMs) are gearing up to surpass human creativity. The veracity of the statement needs careful consideration. In recent developments, critical questions arise regarding the authenticity of human work and the preservation of their creativity and innovative abilities. This paper investigates such issues. This paper addresses machine-generated text detection across several scenarios, including document-level binary and multiclass classification or generator attribution, sentence-level segmentation to differentiate between human-AI collaborative text, and adversarial attacks aimed at reducing the detectability of machine-generated text. We introduce a new work called BMAS English: an English language dataset for binary classification of human and machine text, for multiclass classification, which not only identifies machine-generated text but can also try to determine its generator, and Adversarial attack addressing where it is a common act for the mitigation of detection, and Sentence-level segmentation, for predicting the boundaries between human and machine-generated text. We believe that this paper will address previous work in Machine-Generated Text Detection (MGTD) in a more meaningful way.

[172] Context Parametrization with Compositional Adapters cs.CLPDF

Josip Jukić, Martin Tutek, Jan Šnajder

TL;DR: CompAs提出了一种基于组合适配器的元学习框架，能够将上下文直接映射到适配器参数，支持多块信息的无缝整合，降低了推理成本并提升了长上下文稳定性。

Details

Motivation: 传统方法（如上下文学习和监督微调）存在效率低、训练开销大或牺牲灵活性等问题。CompAs旨在通过组合适配器生成参数，解决这些问题。

Result: 在多任务和多选问答任务中，CompAs优于传统上下文学习和生成方法，尤其在处理更多输入时效果显著。

Insight: 组合适配器生成是高效率、可扩展的替代方案，同时通过可逆编码兼顾了安全性。

Abstract: Large language models (LLMs) often seamlessly adapt to new tasks through in-context learning (ICL) or supervised fine-tuning (SFT). However, both of these approaches face key limitations: ICL is inefficient when handling many demonstrations, and SFT incurs training overhead while sacrificing flexibility. Mapping instructions or demonstrations from context directly into adapter parameters offers an appealing alternative. While prior work explored generating adapters based on a single input context, it has overlooked the need to integrate multiple chunks of information. To address this gap, we introduce CompAs, a meta-learning framework that translates context into adapter parameters with a compositional structure. Adapters generated this way can be merged algebraically, enabling instructions, demonstrations, or retrieved passages to be seamlessly combined without reprocessing long prompts. Critically, this approach yields three benefits: lower inference cost, robustness to long-context instability, and establishes a principled solution when input exceeds the model’s context window. Furthermore, CompAs encodes information into adapter parameters in a reversible manner, enabling recovery of input context through a decoder, facilitating safety and security. Empirical results on diverse multiple-choice and extractive question answering tasks show that CompAs outperforms ICL and prior generator-based methods, especially when scaling to more inputs. Our work establishes composable adapter generation as a practical and efficient alternative for scaling LLM deployment.

[173] When Does Reasoning Matter? A Controlled Study of Reasoning’s Contribution to Model Performance cs.CLPDF

Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Kevin El-Haddad, Céline Hudelot, Pierre Colombo

TL;DR: 该论文通过合成数据蒸馏框架，比较了不同规模的语言模型在数学和通用任务上的表现，发现推理能力对模型性能的提升显著，尤其在模型规模增大时超越IFT模型的性能。

Details

Motivation: 探索推理能力对语言模型性能的具体贡献，尤其是在不同任务和模型规模下的有效性和成本效益。

Result: 推理能力显著提升模型性能，尤其在模型规模增大时，能够超越IFT模型的性能上限，尽管IFT在训练和推理成本上仍占优。

Insight: 推理模型在规模扩大时表现尤为突出，尤其在推理密集型和开放式任务上，为未来模型设计提供了重要参考。

Abstract: Large Language Models (LLMs) with reasoning capabilities have achieved state-of-the-art performance on a wide range of tasks. Despite its empirical success, the tasks and model scales at which reasoning becomes effective, as well as its training and inference costs, remain underexplored. In this work, we rely on a synthetic data distillation framework to conduct a large-scale supervised study. We compare Instruction Fine-Tuning (IFT) and reasoning models of varying sizes, on a wide range of math-centric and general-purpose tasks, evaluating both multiple-choice and open-ended formats. Our analysis reveals that reasoning consistently improves model performance, often matching or surpassing significantly larger IFT systems. Notably, while IFT remains Pareto-optimal in training and inference costs, reasoning models become increasingly valuable as model size scales, overcoming IFT performance limits on reasoning-intensive and open-ended tasks.

[174] Thinking in Many Modes: How Composite Reasoning Elevates Large Language Model Performance with Limited Data cs.CL | cs.AIPDF

Zishan Ahmad, Saisubramaniam Gopalakrishnan

TL;DR: 论文提出了复合推理（CR）方法，使大语言模型（LLM）能够动态结合多种推理风格（如演绎、归纳和溯因），从而在有限数据下提升复杂问题的解决能力。

Details

Motivation: 现有大语言模型依赖单一推理范式，难以应对需要多样化认知策略的复杂问题。为了解决这一问题，作者提出复合推理方法，通过动态结合多种推理风格提升模型的适应性。

Result: 在科学和医学问答基准测试中，CR方法优于Chain-of-Thought（CoT）和DeepSeek-R1风格推理（SR），并展示了更高的样本效率和适应性。

Insight: 通过培养内部推理风格的多样性，大语言模型可以获得更鲁棒、自适应和高效的问题解决能力，尤其是在数据有限的情况下。

Abstract: Large Language Models (LLMs), despite their remarkable capabilities, rely on singular, pre-dominant reasoning paradigms, hindering their performance on intricate problems that demand diverse cognitive strategies. To address this, we introduce Composite Reasoning (CR), a novel reasoning approach empowering LLMs to dynamically explore and combine multiple reasoning styles like deductive, inductive, and abductive for more nuanced problem-solving. Evaluated on scientific and medical question-answering benchmarks, our approach outperforms existing baselines like Chain-of-Thought (CoT) and also surpasses the accuracy of DeepSeek-R1 style reasoning (SR) capabilities, while demonstrating superior sample efficiency and adequate token usage. Notably, CR adaptively emphasizes domain-appropriate reasoning styles. It prioritizes abductive and deductive reasoning for medical question answering, but shifts to causal, deductive, and inductive methods for scientific reasoning. Our findings highlight that by cultivating internal reasoning style diversity, LLMs acquire more robust, adaptive, and efficient problem-solving abilities.

[175] In Their Own Words: Reasoning Traces Tailored for Small Models Make Them Better Reasoners cs.CLPDF

Jaehoon Kim, Kwangwook Seo, Dongha Lee

TL;DR: 该论文指出，大型语言模型到小型模型的知识蒸馏在推理任务中效果不佳，原因是大型模型生成的推理轨迹中包含小模型无法表示的低概率标记，提出了反向推测解码（RSD）方法来解决这一问题，从而提升小模型的推理能力。

Details

Motivation: 传统的知识蒸馏方法在推理任务中效果不佳，因为大型模型的推理轨迹中包含小模型无法表示的低概率标记，导致学习障碍。论文旨在解决这一问题，提升小模型在推理任务中的表现。

Result: 实验表明，使用RSD生成的推理轨迹训练的Qwen3-0.6B模型在主要推理基准测试中提升了4.9%，而传统蒸馏方法则导致性能下降20.5%。

Insight: 低概率标记是推理能力迁移的关键瓶颈，而RSD生成的推理轨迹必须针对每个学生模型的独特表示分布进行定制，不具有普适性。

Abstract: Transferring reasoning capabilities from larger language models to smaller ones through supervised fine-tuning often fails counterintuitively, with performance degrading despite access to high-quality teacher demonstrations. We identify that this failure stems from distributional misalignment: reasoning traces from larger models contain tokens that are low probability under the student’s distribution, exceeding the internal representation capacity of smaller architectures and creating learning barriers rather than helpful guidance. We propose Reverse Speculative Decoding (RSD), a mechanism for generating student-friendly reasoning traces in which the teacher model proposes candidate tokens but the student model determines acceptance based on its own probability distributions, filtering low probability tokens. When applied to Qwen3-0.6B, direct distillation of s1K-1.1 reasoning trace data degrades average performance across major reasoning benchmarks by 20.5%, while the same model trained on RSD-generated reasoning traces achieves meaningful improvements of 4.9%. Our analysis reveals that low probability tokens constitute the critical bottleneck in reasoning ability transfer. However, cross-model experiments demonstrate that RSD traces are model-specific rather than universally applicable, indicating that distributional alignment must be tailored for each student architecture’s unique internal representation.

[176] FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding cs.CL | cs.AI | cs.SEPDF

Haorui Chen, Chengze Li, Jia Li

TL;DR: FeatBench是一个专注于评估编程代理在‘氛围编码’模式下功能实现能力的基准测试，弥补了现有基准测试的不足，通过纯自然语言提示、严格的数据收集过程和全面的测试用例，揭示了当前技术在功能实现中的挑战。

Details

Motivation: 现有代码生成基准测试无法有效评估‘氛围编码’（vibe coding）场景下的功能实现能力，因为它们通常依赖代码级规范或局限于问题解决能力。

Result: 评估显示，当前技术在功能实现中的最高成功率仅为29.94%，并揭示了‘激进实现’策略的双重影响。

Insight: 功能实现是氛围编码中的一大挑战，‘激进实现’策略虽可能导致失败，但也可能带来更优的软件设计。

Abstract: The rapid advancement of Large Language Models (LLMs) has given rise to a novel software development paradigm known as “vibe coding,” where users interact with coding agents through high-level natural language. However, existing evaluation benchmarks for code generation inadequately assess an agent’s vibe coding capabilities. Existing benchmarks are misaligned, as they either require code-level specifications or focus narrowly on issue-solving, neglecting the critical scenario of feature implementation within the vibe coding paradiam. To address this gap, we propose FeatBench, a novel benchmark for vibe coding that focuses on feature implementation. Our benchmark is distinguished by several key features: 1. Pure Natural Language Prompts. Task inputs consist solely of abstract natural language descriptions, devoid of any code or structural hints. 2. A Rigorous & Evolving Data Collection Process. FeatBench is built on a multi-level filtering pipeline to ensure quality and a fully automated pipeline to evolve the benchmark, mitigating data contamination. 3. Comprehensive Test Cases. Each task includes Fail-to-Pass (F2P) and Pass-to-Pass (P2P) tests to verify correctness and prevent regressions. 4. Diverse Application Domains. The benchmark includes repositories from diverse domains to ensure it reflects real-world scenarios. We evaluate two state-of-the-art agent frameworks with four leading LLMs on FeatBench. Our evaluation reveals that feature implementation within the vibe coding paradigm is a significant challenge, with the highest success rate of only 29.94%. Our analysis also reveals a tendency for “aggressive implementation,” a strategy that paradoxically leads to both critical failures and superior software design. We release FeatBench, our automated collection pipeline, and all experimental results to facilitate further community research.

[177] Safety Compliance: Rethinking LLM Safety Reasoning through the Lens of Compliance cs.CL | cs.AIPDF

Wenbin Hu, Huihao Jing, Haochen Shi, Haoran Li, Yangqiu Song

TL;DR: 该论文提出了从法律合规角度解决大语言模型（LLM）安全问题的框架Safety Compliance，并通过新的基准测试和GRPO方法显著提升了模型的安全合规性能。

Details

Motivation: 现有LLM安全方法缺乏系统性保护，无法应对复杂行为，因此论文提出以法律合规（如欧盟AI法案和GDPR）为标准来定义和衡量LLM安全性。

Result: 实验表明，Compliance Reasoner在基准测试中对欧盟AI法案和GDPR的平均性能分别提升10.45%和11.85%。

Insight: 将法律合规标准与LLM安全结合，不仅提供了系统性保护框架，还能通过量化基准和优化方法显著改进模型安全性。

Abstract: The proliferation of Large Language Models (LLMs) has demonstrated remarkable capabilities, elevating the critical importance of LLM safety. However, existing safety methods rely on ad-hoc taxonomy and lack a rigorous, systematic protection, failing to ensure safety for the nuanced and complex behaviors of modern LLM systems. To address this problem, we solve LLM safety from legal compliance perspectives, named safety compliance. In this work, we posit relevant established legal frameworks as safety standards for defining and measuring safety compliance, including the EU AI Act and GDPR, which serve as core legal frameworks for AI safety and data security in Europe. To bridge the gap between LLM safety and legal compliance, we first develop a new benchmark for safety compliance by generating realistic LLM safety scenarios seeded with legal statutes. Subsequently, we align Qwen3-8B using Group Policy Optimization (GRPO) to construct a safety reasoner, Compliance Reasoner, which effectively aligns LLMs with legal standards to mitigate safety risks. Our comprehensive experiments demonstrate that the Compliance Reasoner achieves superior performance on the new benchmark, with average improvements of +10.45% for the EU AI Act and +11.85% for GDPR.

[178] Beyond Textual Context: Structural Graph Encoding with Adaptive Space Alignment to alleviate the hallucination of LLMs cs.CL | cs.AIPDF

Yifang Zhang, Pengfei Duan, Yiwen Yang, Shengwu Xiong

TL;DR: 论文提出SSKG-LLM模型，通过结合知识图谱的结构和语义信息，缓解大语言模型的幻觉问题，通过KGR、KGE和KGA模块实现结构化知识的有效整合。

Details

Motivation: 当前解决大语言模型幻觉问题的方法主要依赖知识图谱，但通常仅将其视为文本，忽略了结构信息，且知识图谱与语言模型嵌入空间的不匹配阻碍了知识的有效利用。

Result: 实验表明，结合知识图谱结构信息显著提升了大语言模型的事实推理能力。

Insight: 结构信息在知识图谱中对缓解语言模型幻觉具有重要作用，自适应空间对齐是整合结构化知识的关键。

Abstract: Currently, the main approach for Large Language Models (LLMs) to tackle the hallucination issue is incorporating Knowledge Graphs(KGs).However, LLMs typically treat KGs as plain text, extracting only semantic information and limiting their use of the crucial structural aspects of KGs. Another challenge is the gap between the embedding spaces of KGs encoders and LLMs text embeddings, which hinders the effective integration of structured knowledge. To overcome these obstacles, we put forward the SSKG-LLM, an innovative model architecture that is designed to efficiently integrate both the Structural and Semantic information of KGs into the reasoning processes of LLMs. SSKG-LLM incorporates the Knowledge Graph Retrieval (KGR) module and the Knowledge Graph Encoding (KGE) module to preserve semantics while utilizing structure. Then, the Knowledge Graph Adaptation (KGA) module is incorporated to enable LLMs to understand KGs embeddings. We conduct extensive experiments and provide a detailed analysis to explore how incorporating the structural information of KGs can enhance the factual reasoning abilities of LLMs. Our code are available at https://github.com/yfangZhang/SSKG-LLM.

[179] Bridging Fairness and Explainability: Can Input-Based Explanations Promote Fairness in Hate Speech Detection? cs.CL | cs.AIPDF

Yifan Wang, Mayank Jobanputra, Ji-Ung Lee, Soyoung Oh, Isabel Valera

TL;DR: 本文首次系统研究了可解释性与公平性在仇恨言论检测中的关系，发现基于输入的解释可以有效检测偏见并辅助去偏训练，但在公平模型选择上不可靠。

Details

Motivation: NLP模型常因训练数据的社会偏见而表现出不公平性，且其黑盒特性阻碍了偏见的检测与缓解。现有研究多为定性，缺乏大规模定量分析。

Result: 基于输入的解释可有效检测偏见并辅助去偏训练，但在公平模型选择上不可靠。

Insight: 可解释性工具在公平性任务中需针对性使用，其在某些场景（如去偏训练）效果显著，但在其他场景（如模型选择）可能无效。

Abstract: Natural language processing (NLP) models often replicate or amplify social bias from training data, raising concerns about fairness. At the same time, their black-box nature makes it difficult for users to recognize biased predictions and for developers to effectively mitigate them. While some studies suggest that input-based explanations can help detect and mitigate bias, others question their reliability in ensuring fairness. Existing research on explainability in fair NLP has been predominantly qualitative, with limited large-scale quantitative analysis. In this work, we conduct the first systematic study of the relationship between explainability and fairness in hate speech detection, focusing on both encoder- and decoder-only models. We examine three key dimensions: (1) identifying biased predictions, (2) selecting fair models, and (3) mitigating bias during model training. Our findings show that input-based explanations can effectively detect biased predictions and serve as useful supervision for reducing bias during training, but they are unreliable for selecting fair models among candidates.

[180] Advancing Natural Language Formalization to First Order Logic with Fine-tuned LLMs cs.CL | cs.AI | 03B10 | I.2.7; I.2.3PDF

Felix Vossel, Till Mossakowski, Björn Gehrke

TL;DR: 论文通过微调大型语言模型（LLM）将自然语言形式化为一阶逻辑（FOL），比较了不同架构（编码器-解码器 vs. 仅解码器）和训练策略，提出了多种技术和评估指标。微调的Flan-T5-XXL模型表现优于GPT-4o等模型，展示了谓词可用性和模型泛化能力的重要性。

Details

Motivation: 将自然语言形式化为一阶逻辑是知识表示和形式化方法的关键任务，但目前仍具挑战性。论文旨在通过微调LLMs提升这一任务的自动化水平。

Result: 微调的Flan-T5-XXL模型在提供谓词列表时达到70%的准确率，优于GPT-4o等模型，并展现出对未见过逻辑参数的泛化能力。谓词可用性提升15-20%性能。

Insight: 1. 谓词提取是主要瓶颈。2. T5模型优于更大的仅解码器LLMs。3. 模型在未专门训练的数据集上表现良好。

Abstract: Automating the translation of natural language to first-order logic (FOL) is crucial for knowledge representation and formal methods, yet remains challenging. We present a systematic evaluation of fine-tuned LLMs for this task, comparing architectures (encoder-decoder vs. decoder-only) and training strategies. Using the MALLS and Willow datasets, we explore techniques like vocabulary extension, predicate conditioning, and multilingual training, introducing metrics for exact match, logical equivalence, and predicate alignment. Our fine-tuned Flan-T5-XXL achieves 70% accuracy with predicate lists, outperforming GPT-4o and even the DeepSeek-R1-0528 model with CoT reasoning ability as well as symbolic systems like ccg2lambda. Key findings show: (1) predicate availability boosts performance by 15-20%, (2) T5 models surpass larger decoder-only LLMs, and (3) models generalize to unseen logical arguments (FOLIO dataset) without specific training. While structural logic translation proves robust, predicate extraction emerges as the main bottleneck.

[181] Transformers Can Learn Connectivity in Some Graphs but Not Others cs.CL | cs.AI | cs.LG | cs.LOPDF

Amit Roy, Abulhair Saparov

TL;DR: 该论文研究了transformer模型在不同类型的有向图中推断连通性（即传递关系）的能力，发现transformer在低维“网格状”图上表现良好，而在高维或非网格图上表现较差，且模型规模的增加有助于泛化能力。

Details

Motivation: 研究transformer模型在处理传递关系（如因果推理）时的能力，填补了此前未探讨的从训练样本中学习传递关系的空白，并探究了模型规模对此任务的影响。

Result: transformer在低维网格状图上能有效学习连通性，但在高维或非网格（多连通分量）图上效果较差；模型规模的增加能提升在网格图上的泛化能力。

Insight: transformer学习连通性的能力与图的结构密切相关，低维嵌入是关键；模型规模对任务表现有积极影响，但无法解决复杂图形（如多连通分量）的学习挑战。

Abstract: Reasoning capability is essential to ensure the factual correctness of the responses of transformer-based Large Language Models (LLMs), and robust reasoning about transitive relations is instrumental in many settings, such as causal inference. Hence, it is essential to investigate the capability of transformers in the task of inferring transitive relations (e.g., knowing A causes B and B causes C, then A causes C). The task of inferring transitive relations is equivalent to the task of connectivity in directed graphs (e.g., knowing there is a path from A to B, and there is a path from B to C, then there is a path from A to C). Past research focused on whether transformers can learn to infer transitivity from in-context examples provided in the input prompt. However, transformers’ capability to infer transitive relations from training examples and how scaling affects the ability is unexplored. In this study, we seek to answer this question by generating directed graphs to train transformer models of varying sizes and evaluate their ability to infer transitive relations for various graph sizes. Our findings suggest that transformers are capable of learning connectivity on “grid-like’’ directed graphs where each node can be embedded in a low-dimensional subspace, and connectivity is easily inferable from the embeddings of the nodes. We find that the dimensionality of the underlying grid graph is a strong predictor of transformers’ ability to learn the connectivity task, where higher-dimensional grid graphs pose a greater challenge than low-dimensional grid graphs. In addition, we observe that increasing the model scale leads to increasingly better generalization to infer connectivity over grid graphs. However, if the graph is not a grid graph and contains many disconnected components, transformers struggle to learn the connectivity task, especially when the number of components is large.

[182] The InviTE Corpus: Annotating Invectives in Tudor English Texts for Computational Modeling cs.CLPDF

Sophie Spliethoff, Sanne Hoeken, Silke Schwandt, Sina Zarrieß, Özge Alaçam

TL;DR: 论文介绍了InviTE语料库，包含近2000句早期现代英语句子，用于研究都铎时期宗教争议中的骂詈语言，并比较了基于BERT和零样本提示的大型语言模型的表现。

Details

Motivation: 旨在将自然语言处理技术应用于历史研究，尤其是都铎英格兰新教改革时期的宗教骂詈语言分析。

Result: 基于历史数据预训练并微调的模型在骂詈语言检测中表现最佳。

Insight: 历史语言任务的专用数据标注和模型微调对性能提升至关重要。

Abstract: In this paper, we aim at the application of Natural Language Processing (NLP) techniques to historical research endeavors, particularly addressing the study of religious invectives in the context of the Protestant Reformation in Tudor England. We outline a workflow spanning from raw data, through pre-processing and data selection, to an iterative annotation process. As a result, we introduce the InviTE corpus – a corpus of almost 2000 Early Modern English (EModE) sentences, which are enriched with expert annotations regarding invective language throughout 16th-century England. Subsequently, we assess and compare the performance of fine-tuned BERT-based models and zero-shot prompted instruction-tuned large language models (LLMs), which highlights the superiority of models pre-trained on historical data and fine-tuned to invective detection.

[183] Conversational Implicatures: Modelling Relevance Theory Probabilistically cs.CLPDF

Christoph Unger, Hendrik Buschmeier

TL;DR: 本文探讨了如何将贝叶斯概率理论应用于关联理论语用学，以研究隐含意义的交际现象。

Details

Motivation: 随着贝叶斯概率理论在认知科学中的应用以及计算工具的进步，语用学和语义学领域出现了’概率转向’。本文旨在将类似的贝叶斯方法应用于关联理论，以更好地理解隐含意义的交际。

Result: 研究结果表明，贝叶斯方法可以有效地建模关联理论中的隐含意义交际问题。

Insight: 通过贝叶斯概率框架，可以更形式化地理解语用学中的隐含意义传递，为未来的计算模型提供了新的方向。

Abstract: Recent advances in Bayesian probability theory and its application to cognitive science in combination with the development of a new generation of computational tools and methods for probabilistic computation have led to a ‘probabilistic turn’ in pragmatics and semantics. In particular, the framework of Rational Speech Act theory has been developed to model broadly Gricean accounts of pragmatic phenomena in Bayesian terms, starting with fairly simple reference games and covering ever more complex communicative exchanges such as verbal syllogistic reasoning. This paper explores in which way a similar Bayesian approach might be applied to relevance-theoretic pragmatics (Sperber & Wilson, 1995) by study a paradigmatic pragmatic phenomenon: the communication of implicit meaning by ways of (conversational) implicatures.

[184] Exploratory Semantic Reliability Analysis of Wind Turbine Maintenance Logs using Large Language Models cs.CLPDF

Max Malyi, Jonathan Shek, Andre Biscaya

TL;DR: 该论文提出了一个利用大语言模型（LLMs）对风力涡轮机维护日志进行深度语义分析的探索性框架，超越了传统的文本分类方法，实现了故障模式识别、因果链推断、站点比较分析和数据质量审计等功能。

Details

Motivation: 风力涡轮机维护日志中的非结构化文本蕴含了大量运营智能，但传统定量分析方法难以利用这些信息。现有机器学习方法通常仅限于分类任务，无法进行复杂的语义分析。

Result: 结果表明，LLMs可以作为强大的“可靠性副驾驶”，不仅完成标注任务，还能综合文本信息并生成可执行的高级假设。

Insight: LLMs能够从非结构化文本中提取隐藏的语义信息，为风能领域的可靠性分析提供了新的方法论，有助于发现传统方法无法触及的深度洞察。

Abstract: A wealth of operational intelligence is locked within the unstructured free-text of wind turbine maintenance logs, a resource largely inaccessible to traditional quantitative reliability analysis. While machine learning has been applied to this data, existing approaches typically stop at classification, categorising text into predefined labels. This paper addresses the gap in leveraging modern large language models (LLMs) for more complex reasoning tasks. We introduce an exploratory framework that uses LLMs to move beyond classification and perform deep semantic analysis. We apply this framework to a large industrial dataset to execute four analytical workflows: failure mode identification, causal chain inference, comparative site analysis, and data quality auditing. The results demonstrate that LLMs can function as powerful “reliability co-pilots,” moving beyond labelling to synthesise textual information and generate actionable, expert-level hypotheses. This work contributes a novel and reproducible methodology for using LLMs as a reasoning tool, offering a new pathway to enhance operational intelligence in the wind energy sector by unlocking insights previously obscured in unstructured data.

[185] What Is The Political Content in LLMs’ Pre- and Post-Training Data? cs.CL | cs.AI | cs.CYPDF

Tanise Ceron, Dmitry Nikolaev, Dominik Stammbach, Debora Nozza

TL;DR: 论文分析了开源大语言模型OLMO2的训练数据中政治内容的比例及其对模型生成政治偏见文本的影响，发现左倾内容占主导且与模型的政治偏见显著相关。

Details

Motivation: 大语言模型（LLM）生成的文本存在政治偏见，但其来源尚不明确。研究旨在通过分析训练数据中的政治内容，揭示偏见形成的机制。

Result: 预训练数据中左倾内容显著多于后训练数据；左倾与右倾文档以不同价值观和合法性来源讨论相似主题；训练数据的主导立场与模型政治偏见强相关。

Insight: 政治内容分析应纳入数据筛选流程，并需透明记录过滤策略以减少模型偏见。

Abstract: Large language models (LLMs) are known to generate politically biased text, yet how such biases arise remains unclear. A crucial step toward answering this question is the analysis of training data, whose political content remains largely underexplored in current LLM research. To address this gap, we present in this paper an analysis of the pre- and post-training corpora of OLMO2, the largest fully open-source model released together with its complete dataset. From these corpora, we draw large random samples, automatically annotate documents for political orientation, and analyze their source domains and content. We then assess how political content in the training data correlates with models’ stance on specific policy issues. Our analysis shows that left-leaning documents predominate across datasets, with pre-training corpora containing significantly more politically engaged content than post-training data. We also find that left- and right-leaning documents frame similar topics through distinct values and sources of legitimacy. Finally, the predominant stance in the training data strongly correlates with models’ political biases when evaluated on policy issues. These findings underscore the need to integrate political content analysis into future data curation pipelines as well as in-depth documentation of filtering strategies for transparency.

[186] Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding cs.CL | cs.AIPDF

Ziheng Chi, Yifan Hou, Chenxi Pang, Shaobo Cui, Mubashara Akhtar

TL;DR: Chimera是一个用于评估视觉-语言模型（VLMs）在图表理解中是否存在捷径学习（shortcut learning）的综合测试套件。通过分析三种捷径行为，研究发现当前VLMs的表现主要依赖于捷径而非真正的理解。

Details

Motivation: 当前的视觉-语言模型在图表相关的评测中表现良好，但其是否真正理解和推理图表内容仍存疑。需要一种更严格的评测方法，以揭示模型是否依赖捷径行为。

Result: 研究发现，VLMs在图表理解任务中的强表现主要来自捷径行为：视觉记忆捷径影响较小，知识召回捷径影响中等，Clever-Hans捷径贡献显著。

Insight: 当前VLMs在复杂视觉输入（如图表）的理解上存在明显局限性，现有评测协议可能掩盖了模型的真实能力，需要更严格的评测方法以推动模型提升真实理解能力。

Abstract: Diagrams convey symbolic information in a visual format rather than a linear stream of words, making them especially challenging for AI models to process. While recent evaluations suggest that vision-language models (VLMs) perform well on diagram-related benchmarks, their reliance on knowledge, reasoning, or modality shortcuts raises concerns about whether they genuinely understand and reason over diagrams. To address this gap, we introduce Chimera, a comprehensive test suite comprising 7,500 high-quality diagrams sourced from Wikipedia; each diagram is annotated with its symbolic content represented by semantic triples along with multi-level questions designed to assess four fundamental aspects of diagram comprehension: entity recognition, relation understanding, knowledge grounding, and visual reasoning. We use Chimera to measure the presence of three types of shortcuts in visual question answering: (1) the visual-memorization shortcut, where VLMs rely on memorized visual patterns; (2) the knowledge-recall shortcut, where models leverage memorized factual knowledge instead of interpreting the diagram; and (3) the Clever-Hans shortcut, where models exploit superficial language patterns or priors without true comprehension. We evaluate 15 open-source VLMs from 7 model families on Chimera and find that their seemingly strong performance largely stems from shortcut behaviors: visual-memorization shortcuts have slight impact, knowledge-recall shortcuts play a moderate role, and Clever-Hans shortcuts contribute significantly. These findings expose critical limitations in current VLMs and underscore the need for more robust evaluation protocols that benchmark genuine comprehension of complex visual inputs (e.g., diagrams) rather than question-answering shortcuts.

[187] Evaluating the Limits of Large Language Models in Multilingual Legal Reasoning cs.CL | cs.AI | cs.LGPDF

Antreas Ioannou, Andreas Shiamishis, Nora Hollenstein, Nezihe Merve Gürel

TL;DR: 这篇论文评估了大型语言模型（LLMs）在多语言法律推理任务中的表现和局限性，揭示了法律任务对LLMs的挑战，并提出了一个开源评估框架。

Details

Motivation: 随着LLMs在法律工作流中的广泛应用，了解其在多语言、多司法管辖区和对抗性环境中的能力与局限变得至关重要。

Result: 法律任务（如LEXam）的准确率通常低于50%，而通用任务（如XNLI）超过70%。Gemini平均比LLaMA高出约24个百分点。

Insight: 1. 法律任务对LLMs显著更具挑战性；2. 英语表现更稳定但不总更准确；3. 模型表现与语言和英语的句法相似性相关；4. 对抗性漏洞在多语言中普遍存在。

Abstract: In an era dominated by Large Language Models (LLMs), understanding their capabilities and limitations, especially in high-stakes fields like law, is crucial. While LLMs such as Meta’s LLaMA, OpenAI’s ChatGPT, Google’s Gemini, DeepSeek, and other emerging models are increasingly integrated into legal workflows, their performance in multilingual, jurisdictionally diverse, and adversarial contexts remains insufficiently explored. This work evaluates LLaMA and Gemini on multilingual legal and non-legal benchmarks, and assesses their adversarial robustness in legal tasks through character and word-level perturbations. We use an LLM-as-a-Judge approach for human-aligned evaluation. We moreover present an open-source, modular evaluation pipeline designed to support multilingual, task-diverse benchmarking of any combination of LLMs and datasets, with a particular focus on legal tasks, including classification, summarization, open questions, and general reasoning. Our findings confirm that legal tasks pose significant challenges for LLMs with accuracies often below 50% on legal reasoning benchmarks such as LEXam, compared to over 70% on general-purpose tasks like XNLI. In addition, while English generally yields more stable results, it does not always lead to higher accuracy. Prompt sensitivity and adversarial vulnerability is also shown to persist across languages. Finally, a correlation is found between the performance of a language and its syntactic similarity to English. We also observe that LLaMA is weaker than Gemini, with the latter showing an average advantage of about 24 percentage points across the same task. Despite improvements in newer LLMs, challenges remain in deploying them reliably for critical, multilingual legal applications.

[188] NeLLCom-Lex: A Neural-agent Framework to Study the Interplay between Lexical Systems and Language Use cs.CLPDF

Yuqing Zhang, Ecesu Ürker, Tessa Verhoef, Gemma Boleda, Arianna Bisazza

TL;DR: NeLLCom-Lex是一个神经代理框架，通过模拟语义变化研究词汇系统与语言使用的交互关系。框架基于真实词汇系统训练代理，并通过操纵其交际需求模拟语义变化。实验证明代理能在颜色命名任务中重现人类行为。

Details

Motivation: 传统词汇语义变化研究依赖观察或实验方法，但前者无法揭示因果机制，后者因时间跨度大难以实施。NeLLCom-Lex填补了这一空白，通过模拟研究语义变化的机制。

Result: 实验表明，训练的神经代理能在颜色命名任务中高度重现人类行为模式，验证了框架的有效性。

Insight: NeLLCom-Lex为语义变化研究提供了可控的实验平台，有助于揭示语言演变的潜在机制。

Abstract: Lexical semantic change has primarily been investigated with observational and experimental methods; however, observational methods (corpus analysis, distributional semantic modeling) cannot get at causal mechanisms, and experimental paradigms with humans are hard to apply to semantic change due to the extended diachronic processes involved. This work introduces NeLLCom-Lex, a neural-agent framework designed to simulate semantic change by first grounding agents in a real lexical system (e.g. English) and then systematically manipulating their communicative needs. Using a well-established color naming task, we simulate the evolution of a lexical system within a single generation, and study which factors lead agents to: (i) develop human-like naming behavior and lexicons, and (ii) change their behavior and lexicons according to their communicative needs. Our experiments with different supervised and reinforcement learning pipelines show that neural agents trained to ‘speak’ an existing language can reproduce human-like patterns in color naming to a remarkable extent, supporting the further use of NeLLCom-Lex to elucidate the mechanisms of semantic change.

[189] Exploring Solution Divergence and Its Effect on Large Language Model Problem Solving cs.CL | cs.AIPDF

Hang Li, Kaiqi Yang, Yucheng Chu, Hui Liu, Jiliang Tang

TL;DR: 本文探讨了大语言模型（LLM）在解决问题时生成的解决方案的多样性（solution divergence），并提出其作为衡量LLM问题解决能力的指标。研究发现，更高的解决方案多样性与更好的问题解决能力正相关，并通过实验验证了这一指标对监督微调（SFT）和强化学习（RL）策略的辅助作用。

Details

Motivation: 现有的LLM问题解决能力提升方法主要依赖监督微调或强化学习，但这些方法忽视了解决方案的多样性对问题解决能力的影响。本文希望通过研究解决方案的多样性，提供一个新的视角来优化LLM的训练和评估。

Result: 实验结果表明，更高的解决方案多样性与更好的问题解决能力显著正相关。使用该指标能够一致性地提升LLM的成功率。

Insight: 解决方案多样性是一个简单但有效的工具，为LLM的训练和评估提供了新的方向，同时也揭示了LLM在解决问题时多样性对性能的重要性。

Abstract: Large language models (LLMs) have been widely used for problem-solving tasks. Most recent work improves their performance through supervised fine-tuning (SFT) with labeled data or reinforcement learning (RL) from task feedback. In this paper, we study a new perspective: the divergence in solutions generated by LLMs for a single problem. We show that higher solution divergence is positively related to better problem-solving abilities across various models. Based on this finding, we propose solution divergence as a novel metric that can support both SFT and RL strategies. We test this idea on three representative problem domains and find that using solution divergence consistently improves success rates. These results suggest that solution divergence is a simple but effective tool for advancing LLM training and evaluation.

[190] InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models cs.CL | cs.AIPDF

Wenjun Wang, Shuo Cai, Congkai Xie, Mingfa Feng, Yiming Zhang

TL;DR: 本文提出了InfiR2，一种全面的FP8训练方法，旨在降低大型语言模型的训练成本。该方法采用混合粒度量化策略，在保持数值精度的同时提升计算效率，实验证明其稳定且无损，显著减少了训练时间和内存使用。

Details

Motivation: 大型语言模型的高训练成本阻碍了创新，FP8训练虽能提升效率但缺乏开源方案。本文旨在填补这一空白，提供一种高效的FP8训练方法。

Result: 实验表明，该方法在160B标记数据集上的训练效率提升了22%，内存使用减少了14%，同时性能与BF16基线相当。

Insight: FP8训练是BF16的可行替代方案，能够显著降低训练成本，同时不牺牲模型性能。

Abstract: The immense computational cost of training Large Language Models (LLMs) presents a major barrier to innovation. While FP8 training offers a promising solution with significant theoretical efficiency gains, its widespread adoption has been hindered by the lack of a comprehensive, open-source training recipe. To bridge this gap, we introduce an end-to-end FP8 training recipe that seamlessly integrates continual pre-training and supervised fine-tuning. Our methodology employs a fine-grained, hybrid-granularity quantization strategy to maintain numerical fidelity while maximizing computational efficiency. Through extensive experiments, including the continue pre-training of models on a 160B-token corpus, we demonstrate that our recipe is not only remarkably stable but also essentially lossless, achieving performance on par with the BF16 baseline across a suite of reasoning benchmarks. Crucially, this is achieved with substantial efficiency improvements, including up to a 22% reduction in training time, a 14% decrease in peak memory usage, and a 19% increase in throughput. Our results establish FP8 as a practical and robust alternative to BF16, and we will release the accompanying code to further democratize large-scale model training.

[191] Think Socially via Cognitive Reasoning cs.CLPDF

Jinfeng Zhou, Zheyu Chen, Shuai Wang, Quanyu Dai, Zhenhua Dong

TL;DR: 论文提出了一种名为”Cognitive Reasoning”的新范式，结合人类社交认知的机制，通过结构化认知流（cognitive flow）增强LLM的社交推理能力，并设计了CogFlow框架，通过监督微调和强化学习优化模型表现。

Details

Motivation: 传统逐步推理范式不适用于社交场景，因其通常涉及模糊线索和开放式结果，需要一种更适应人类社交认知的方法。

Result: 实验表明，CogFlow有效提升了LLM的社交认知能力，甚至对人类决策也有积极影响。

Insight: 通过结构化认知流和渐进式优化，LLM可以更好地模拟人类社交推理，填补了传统逻辑推理在社交场景中的不足。

Abstract: LLMs trained for logical reasoning excel at step-by-step deduction to reach verifiable answers. However, this paradigm is ill-suited for navigating social situations, which induce an interpretive process of analyzing ambiguous cues that rarely yield a definitive outcome. To bridge this gap, we introduce Cognitive Reasoning, a paradigm modeled on human social cognition. It formulates the interpretive process into a structured cognitive flow of interconnected cognitive units (e.g., observation or attribution), which combine adaptively to enable effective social thinking and responses. We then propose CogFlow, a complete framework that instills this capability in LLMs. CogFlow first curates a dataset of cognitive flows by simulating the associative and progressive nature of human thought via tree-structured planning. After instilling the basic cognitive reasoning capability via supervised fine-tuning, CogFlow adopts reinforcement learning to enable the model to improve itself via trial and error, guided by a multi-objective reward that optimizes both cognitive flow and response quality. Extensive experiments show that CogFlow effectively enhances the social cognitive capabilities of LLMs, and even humans, leading to more effective social decision-making.

[192] Variational Reasoning for Language Models cs.CL | cs.AI | cs.LGPDF

Xiangxin Zhou, Zichen Liu, Haonan Wang, Chao Du, Min Lin

TL;DR: 该论文提出了一个变分推理框架，将语言模型的思维轨迹视为潜变量，并通过变分推断优化它们。扩展了ELBO目标，提出多轨迹目标和前向KL公式以稳定训练，并揭示了基于模型精度的隐式权重偏差。实验在多任务上验证了有效性。

Details

Motivation: 现有语言模型在推理任务中的表现往往不够稳定或缺乏理论支撑，希望通过变分推理和强化学习方法的统一，提供一个更稳健的优化框架。

Result: 在Qwen模型家族上验证，结果表明该方法能显著提升语言模型的推理能力，并提供更稳定的训练目标。

Insight: 1) 变分推理与强化学习方法可以自然统一；2) 隐式权重偏差揭示了模型对简单问题的偏好；3) 前向KL公式有助于训练稳定性。

Abstract: We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.

[193] Language Models Can Learn from Verbal Feedback Without Scalar Rewards cs.CL | cs.AI | cs.LGPDF

Renjie Luo, Zichen Liu, Xiangyan Liu, Chao Du, Min Lin

TL;DR: 该论文提出了一种反馈条件策略（FCP），使语言模型能够直接从语言反馈中学习，而无需将其压缩为标量奖励，从而保留了反馈的丰富信息。

Details

Motivation: 传统的RL方法将复杂的语言反馈压缩为标量奖励，导致信息丢失和尺度不平衡问题。论文旨在通过条件生成的方式直接利用语言反馈，提升模型的表达能力。

Result: 实验结果展示了FCP能够有效利用语言反馈，避免了标量奖励的信息损失，提升了模型的表达能力。

Insight: 论文提供了一种新的视角，将反馈驱动学习重新定义为条件生成问题，而非传统的奖励优化，为语言模型的反馈学习开辟了新方向。

Abstract: LLMs are often trained with RL from human or AI feedback, yet such methods typically compress nuanced feedback into scalar rewards, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the feedback-conditional policy (FCP). FCP learns directly from response-feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on offline data. We further develop an online bootstrapping stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback. Our code is available at https://github.com/sail-sg/feedback-conditional-policy.

[194] Death of the Novel(ty): Beyond n-Gram Novelty as a Metric for Textual Creativity cs.CL | cs.AI | cs.HCPDF

Arkadiy Saakyan, Najoung Kim, Smaranda Muresan, Tuhin Chakrabarty

TL;DR: 本文探讨了n-gram新颖性作为评估文本创造力的指标的局限性，提出结合新颖性与适当性的综合评估方法，并通过实验验证专家标注与模型生成的创造力差异。

Details

Motivation: 当前n-gram新颖性被广泛用于评估语言模型的创造力，但其未能完全捕捉创造力的双重特性（新颖性与适当性），因此需要更全面的评估方法。

Result: n-gram新颖性与专家评判创造力呈正相关，但91%高新颖性文本未被判定为创意文本；开源大模型高新颖性文本实用性更低；前沿大模型在识别非实用性文本上表现较差。

Insight: 创造力评估需兼顾新颖性与适当性，单纯依赖n-gram新颖性可能导致误判；前沿大模型尚有识别非实用性文本的提升空间。

Abstract: N-gram novelty is widely used to evaluate language models’ ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity’s dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and n-gram novelty through 7542 expert writer annotations (n=26) of novelty, pragmaticality, and sensicality via close reading of human and AI-generated text. We find that while n-gram novelty is positively associated with expert writer-judged creativity, ~91% of top-quartile expressions by n-gram novelty are not judged as creative, cautioning against relying on n-gram novelty alone. Furthermore, unlike human-written text, higher n-gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier close-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify creative expressions (a positive aspect of writing) and non-pragmatic ones (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty scores from the best-performing model were predictive of expert writer preferences.

[195] WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning cs.CL | cs.AIPDF

Zimu Lu, Houxing Ren, Yunqiao Yang, Ke Wang, Zhuofan Zong

TL;DR: WebGen-Agent 提出了一种基于多级视觉反馈和步骤级强化学习的网站生成代理，显著提升了网站代码生成的质量和外观效果。

Details

Motivation: 当前的代码生成代理在处理网站代码生成任务时，仅依赖简单的代码执行反馈，无法准确评估生成代码的实际质量。

Result: 在 WebGen-Bench 数据集上，WebGen-Agent 将 Claude-3.5-Sonnet 的准确率从 26.4% 提升到 51.9%，外观评分从 3.0 提升到 3.9。

Insight: 多级视觉反馈和步骤级强化学习的结合可以显著提高代码生成任务的性能，尤其是在依赖视觉效果的场景中。

Abstract: Agent systems powered by large language models (LLMs) have demonstrated impressive performance on repository-level code-generation tasks. However, for tasks such as website codebase generation, which depend heavily on visual effects and user-interaction feedback, current code agents rely only on simple code execution for feedback and verification. This approach fails to capture the actual quality of the generated code. In this paper, we propose WebGen-Agent, a novel website-generation agent that leverages comprehensive and multi-level visual feedback to iteratively generate and refine the website codebase. Detailed and expressive text descriptions and suggestions regarding the screenshots and GUI-agent testing of the websites are generated by a visual language model (VLM), together with scores that quantify their quality. The screenshot and GUI-agent scores are further integrated with a backtracking and select-best mechanism, enhancing the performance of the agent. Utilizing the accurate visual scores inherent in the WebGen-Agent workflow, we further introduce \textit{Step-GRPO with Screenshot and GUI-agent Feedback} to improve the ability of LLMs to act as the reasoning engine of WebGen-Agent. By using the screenshot and GUI-agent scores at each step as the reward in Step-GRPO, we provide a dense and reliable process supervision signal, which effectively improves the model’s website-generation ability. On the WebGen-Bench dataset, WebGen-Agent increases the accuracy of Claude-3.5-Sonnet from 26.4% to 51.9% and its appearance score from 3.0 to 3.9, outperforming the previous state-of-the-art agent system. Additionally, our Step-GRPO training approach increases the accuracy of Qwen2.5-Coder-7B-Instruct from 38.9% to 45.4% and raises the appearance score from 3.4 to 3.7.

[196] VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing cs.CL | cs.AI | cs.CV | cs.HC | cs.SDPDF

Ke Wang, Houxing Ren, Zimu Lu, Mingjie Zhan, Hongsheng Li

TL;DR: VoiceAssistant-Eval是一个全面的评测基准，用于评估AI助手在听、说、看三方面的能力，结果表明专用模型未必优于开源模型，且多模态输入和角色模仿仍具挑战性。

Details

Motivation: 现有评测基准无法全面评估新型语音优先AI助手的能力，因此需要一个涵盖多任务类别的综合评测框架。

Result: 发现中等规模模型可以超过更大模型；多模态输入和角色模仿是当前模型的难点；在鲁棒性和安全对齐上仍有不足。

Insight: 专用模型未必全面优于开源模型，模型的音频理解能力较弱，设计良好的小模型可以匹敌大模型。

Abstract: The growing capabilities of large language models and multimodal systems have spurred interest in voice-first AI assistants, yet existing benchmarks are inadequate for evaluating the full range of these systems’ capabilities. We introduce VoiceAssistant-Eval, a comprehensive benchmark designed to assess AI assistants across listening, speaking, and viewing. VoiceAssistant-Eval comprises 10,497 curated examples spanning 13 task categories. These tasks include natural sounds, music, and spoken dialogue for listening; multi-turn dialogue, role-play imitation, and various scenarios for speaking; and highly heterogeneous images for viewing. To demonstrate its utility, we evaluate 21 open-source models and GPT-4o-Audio, measuring the quality of the response content and speech, as well as their consistency. The results reveal three key findings: (1) proprietary models do not universally outperform open-source models; (2) most models excel at speaking tasks but lag in audio understanding; and (3) well-designed smaller models can rival much larger ones. Notably, the mid-sized Step-Audio-2-mini (7B) achieves more than double the listening accuracy of LLaMA-Omni2-32B-Bilingual. However, challenges remain: multimodal (audio plus visual) input and role-play voice imitation tasks are difficult for current models, and significant gaps persist in robustness and safety alignment. VoiceAssistant-Eval identifies these gaps and establishes a rigorous framework for evaluating and guiding the development of next-generation AI assistants. Code and data will be released at https://mathllm.github.io/VoiceAssistantEval/ .

cs.GR [Back]

[197] ControlHair: Physically-based Video Diffusion for Controllable Dynamic Hair Rendering cs.GR | cs.CV | I.3; I.2; I.4PDF

Weikai Lin, Haoxiang Li, Yuhao Zhu

TL;DR: ControlHair 是一种混合框架，结合物理模拟器和条件视频扩散模型，实现了可控的动态头发渲染，支持多样化的物理参数和精确的动态控制。

Details

Motivation: 传统的视频扩散模型缺乏对头发动态的细粒度控制，而头发渲染的复杂性（如动力学和光照交互）需要新的解决方案。

Result: 在 10K 视频数据集上训练，ControlHair 优于基于文本和姿态的基线模型，实现了精确的动态控制。

Insight: 通过分离物理推理和视频生成，框架既支持灵活的物理参数，又简化了视频扩散模型的训练。

Abstract: Hair simulation and rendering are challenging due to complex strand dynamics, diverse material properties, and intricate light-hair interactions. Recent video diffusion models can generate high-quality videos, but they lack fine-grained control over hair dynamics. We present ControlHair, a hybrid framework that integrates a physics simulator with conditional video diffusion to enable controllable dynamic hair rendering. ControlHair adopts a three-stage pipeline: it first encodes physics parameters (e.g., hair stiffness, wind) into per-frame geometry using a simulator, then extracts per-frame control signals, and finally feeds control signals into a video diffusion model to generate videos with desired hair dynamics. This cascaded design decouples physics reasoning from video generation, supports diverse physics, and makes training the video diffusion model easy. Trained on a curated 10K video dataset, ControlHair outperforms text- and pose-conditioned baselines, delivering precisely controlled hair dynamics. We further demonstrate three use cases of ControlHair: dynamic hairstyle try-on, bullet-time effects, and cinemagraphic. ControlHair introduces the first physics-informed video diffusion framework for controllable dynamics. We provide a teaser video and experimental results on our website.

[198] Rigidity-Aware 3D Gaussian Deformation from a Single Image cs.GR | cs.AI | cs.CVPDF

Jinhyeok Kim, Jaehun Bang, Seunghyun Seo, Kyungdon Joo

TL;DR: 论文提出了一种名为DeformSplat的新框架，通过单张图像指导3D高斯变形，解决变形重建的挑战，核心贡献是高斯到像素匹配和刚性部分分割，实验显示其优于现有方法。

Details

Motivation: 现有方法通常依赖多视角视频恢复变形，这在受限场景中适用性有限。论文旨在解决单图像下重建物体变形的难题。

Result: 实验表明，该方法显著优于现有方法，并可扩展到帧插值和交互式物体操作等应用。

Insight: 通过显式识别刚性区域和跨域匹配技术，单图像下的变形重建可以实现高效且一致性强的结果。

Abstract: Reconstructing object deformation from a single image remains a significant challenge in computer vision and graphics. Existing methods typically rely on multi-view video to recover deformation, limiting their applicability under constrained scenarios. To address this, we propose DeformSplat, a novel framework that effectively guides 3D Gaussian deformation from only a single image. Our method introduces two main technical contributions. First, we present Gaussian-to-Pixel Matching which bridges the domain gap between 3D Gaussian representations and 2D pixel observations. This enables robust deformation guidance from sparse visual cues. Second, we propose Rigid Part Segmentation consisting of initialization and refinement. This segmentation explicitly identifies rigid regions, crucial for maintaining geometric coherence during deformation. By combining these two techniques, our approach can reconstruct consistent deformations from a single image. Extensive experiments demonstrate that our approach significantly outperforms existing methods and naturally extends to various applications,such as frame interpolation and interactive object manipulation.

cs.AI [Back]

[199] Towards mitigating information leakage when evaluating safety monitors cs.AI | cs.CL | cs.LGPDF

Gerard Boxo, Aman Neelappa, Shivam Raval

TL;DR: 本文提出了一个系统性框架，用于评估安全监控器在检测真实模型行为而非表面诱发信号时的表现，并提出了三种新策略来缓解信息泄漏问题。

Details

Motivation: 白盒监控器在检测大型语言模型的有害行为时具有优势，但其训练和评估需要来自诱发行为的响应样本，导致信息泄漏问题，进而夸大监控器效果。

Result: 内容过滤显著降低AUROC（30%），分数过滤减少15%，微调模型即使重新训练也会降低监控器性能（40%）。

Insight: 信息泄漏主要有两种形式（诱发泄漏和推理泄漏），且需要通过针对性策略来避免评估时的性能夸大。

Abstract: White box monitors that analyze model internals offer promising advantages for detecting potentially harmful behaviors in large language models, including lower computational costs and integration into layered defense systems.However, training and evaluating these monitors requires response exemplars that exhibit the target behaviors, typically elicited through prompting or fine-tuning. This presents a challenge when the information used to elicit behaviors inevitably leaks into the data that monitors ingest, inflating their effectiveness. We present a systematic framework for evaluating a monitor’s performance in terms of its ability to detect genuine model behavior rather than superficial elicitation artifacts. Furthermore, we propose three novel strategies to evaluate the monitor: content filtering (removing deception-related text from inputs), score filtering (aggregating only over task-relevant tokens), and prompt distilled fine-tuned model organisms (models trained to exhibit deceptive behavior without explicit prompting). Using deception detection as a representative case study, we identify two forms of leakage that inflate monitor performance: elicitation leakage from prompts that explicitly request harmful behavior, and reasoning leakage from models that verbalize their deceptive actions. Through experiments on multiple deception benchmarks, we apply our proposed mitigation strategies and measure performance retention. Our evaluation of the monitors reveal three crucial findings: (1) Content filtering is a good mitigation strategy that allows for a smooth removal of elicitation signal and can decrease probe AUROC by 30% (2) Score filtering was found to reduce AUROC by 15% but is not as straightforward to attribute to (3) A finetuned model organism improves monitor evaluations but reduces their performance by upto 40%, even when re-trained.

[200] UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios cs.AI | cs.CLPDF

Haotian Luo, Huaisong Zhang, Xuelin Zhang, Haoyu Wang, Zeyu Qin

TL;DR: UltraHorizon是一个新颖的基准测试，专注于评估智能体在超长视野场景中的能力，填补了现有基准测试在长视野任务上的空白。

Details

Motivation: 现实世界中许多关键任务（如软件开发、商业投资和科学探索）需要长视野的推理、规划和工具使用能力，但现有基准测试主要关注短视野任务，无法系统评估这些能力。

Result: 实验表明，LLM智能体在长视野任务中表现不佳，而人类参与者表现更好，揭示了智能体在长视野能力上的差距。同时，简单的扩展方法在任务中失败。

Insight: 通过轨迹分析，发现智能体的失败原因主要包括上下文锁定和功能基础能力不足。这为未来智能体的改进提供了方向。

Abstract: Autonomous agents have recently achieved remarkable progress across diverse domains, yet most evaluations focus on short-horizon, fully observable tasks. In contrast, many critical real-world tasks, such as large-scale software development, commercial investment, and scientific discovery, unfold in long-horizon and partially observable scenarios where success hinges on sustained reasoning, planning, memory management, and tool use. Existing benchmarks rarely capture these long-horizon challenges, leaving a gap in systematic evaluation. To bridge this gap, we introduce \textbf{UltraHorizon} a novel benchmark that measures the foundational capabilities essential for complex real-world challenges. We use exploration as a unifying task across three distinct environments to validate these core competencies. Agents are designed in long-horizon discovery tasks where they must iteratively uncover hidden rules through sustained reasoning, planning, memory and tools management, and interaction with environments. Under the heaviest scale setting, trajectories average \textbf{200k+} tokens and \textbf{400+} tool calls, whereas in standard configurations they still exceed \textbf{35k} tokens and involve more than \textbf{60} tool calls on average. Our extensive experiments reveal that LLM-agents consistently underperform in these settings, whereas human participants achieve higher scores, underscoring a persistent gap in agents’ long-horizon abilities. We also observe that simple scaling fails in our task. To better illustrate the failure of agents, we conduct an in-depth analysis of collected trajectories. We identify eight types of errors and attribute them to two primary causes: in-context locking and functional fundamental capability gaps. \href{https://github.com/StarDewXXX/UltraHorizon}{Our code will be available here.}

[201] RISK: A Framework for GUI Agents in E-commerce Risk Management cs.AI | cs.CLPDF

Renqi Chen, Zeyin Tao, Jianming Guo, Jingzhe Zhu, Yiheng Peng

TL;DR: RISK是一个针对电商风险管理的GUI代理框架，通过整合RISK-Data、RISK-Bench和RISK-R1三个组件，提升了动态交互内容处理能力，并在单步和多步任务中显著优于现有基线。

Details

Motivation: 传统爬虫方法和现有GUI代理无法有效处理电商风险中复杂的多步骤动态交互内容，需要一种更高效的解决方案。

Result: 实验显示RISK-R1在离线和在线评测中分别提升了6.8%和8.8%，任务成功率最高达70.5%。

Insight: RISK通过分层优化和多步交互强化解决了动态内容处理的难点，为电商风险管理提供了可扩展的自动化工具。

Abstract: E-commerce risk management requires aggregating diverse, deeply embedded web data through multi-step, stateful interactions, which traditional scraping methods and most existing Graphical User Interface (GUI) agents cannot handle. These agents are typically limited to single-step tasks and lack the ability to manage dynamic, interactive content critical for effective risk assessment. To address this challenge, we introduce RISK, a novel framework designed to build and deploy GUI agents for this domain. RISK integrates three components: (1) RISK-Data, a dataset of 8,492 single-step and 2,386 multi-step interaction trajectories, collected through a high-fidelity browser framework and a meticulous data curation process; (2) RISK-Bench, a benchmark with 802 single-step and 320 multi-step trajectories across three difficulty levels for standardized evaluation; and (3) RISK-R1, a R1-style reinforcement fine-tuning framework considering four aspects: (i) Output Format: Updated format reward to enhance output syntactic correctness and task comprehension, (ii) Single-step Level: Stepwise accuracy reward to provide granular feedback during early training stages, (iii) Multi-step Level: Process reweight to emphasize critical later steps in interaction sequences, and (iv) Task Level: Level reweight to focus on tasks of varying difficulty. Experiments show that RISK-R1 outperforms existing baselines, achieving a 6.8% improvement in offline single-step and an 8.8% improvement in offline multi-step. Moreover, it attains a top task success rate of 70.5% in online evaluation. RISK provides a scalable, domain-specific solution for automating complex web interactions, advancing the state of the art in e-commerce risk management.

[202] The Thinking Spectrum: An Emperical Study of Tunable Reasoning in LLMs through Model Merging cs.AI | cs.CLPDF

Xiaochong Lan, Yu Zheng, Shiteng Cao, Yong Li

TL;DR: 这篇论文通过模型合并技术研究了大语言模型（LLMs）的可调推理能力，展示了如何在推理深度和计算成本之间平衡，发现合并技术能有效提升模型的精度和效率。

Details

Motivation: 由于实际应用对具有可调推理能力的LLMs需求增长，迫切需要一种高效方法以平衡推理深度和计算成本。模型合并作为一种无需训练的解决方案受到了关注。

Result: 研究发现模型合并能有效平衡推理精度和计算成本，甚至在某些情况下，合并模型的精度和效率都优于其父模型。

Insight: 论文揭示了模型合并在可调推理能力中的潜力，为构建满足多样化应用需求的LLMs提供了实用指南。

Abstract: The growing demand for large language models (LLMs) with tunable reasoning capabilities in many real-world applications highlights a critical need for methods that can efficiently produce a spectrum of models balancing reasoning depth and computational cost. Model merging has emerged as a promising, training-free technique to address this challenge by arithmetically combining the weights of a general-purpose model with a specialized reasoning model. While various merging techniques exist, their potential to create a spectrum of models with fine-grained control over reasoning abilities remains largely unexplored. This work presents a large-scale empirical study evaluating a range of model merging techniques across multiple reasoning benchmarks. We systematically vary merging strengths to construct accuracy-efficiency curves, providing the first comprehensive view of the tunable performance landscape. Our findings reveal that model merging offers an effective and controllable method for calibrating the trade-off between reasoning accuracy and token efficiency, even when parent models have highly divergent weight spaces. Crucially, we identify instances of Pareto Improvement, where a merged model achieves both higher accuracy and lower token consumption than one of its parents. Our study provides the first comprehensive analysis of this tunable space, offering practical guidelines for creating LLMs with specific reasoning profiles to meet diverse application demands.

[203] A2R: An Asymmetric Two-Stage Reasoning Framework for Parallel Reasoning cs.AI | cs.CLPDF

Ziqi Wang, Boye Niu, Zhongli Li, Linghui Meng, Jing Liu

TL;DR: A2R是一个新颖的两阶段并行推理框架，通过‘探索者’和‘合成者’模型的分工协作，显著提升了模型的潜在性能与效率。

Details

Motivation: 当前大型推理模型在复杂任务中表现差异显著，潜力与实际性能之间存在明显差距。因此，作者提出A2R，旨在通过并行推理框架弥补这一差距。

Result: 实验表明，A2R将Qwen3-8B-distill的性能提升75%，A2R-Efficient版本成本降低30%，性能超越Qwen3-32B。

Insight: 异质化模型分配（小型探索者+大型合成者）是实现高效推理的有效范式。

Abstract: Recent Large Reasoning Models have achieved significant improvements in complex task-solving capabilities by allocating more computation at the inference stage with a “thinking longer” paradigm. Even as the foundational reasoning capabilities of models advance rapidly, the persistent gap between a model’s performance in a single attempt and its latent potential, often revealed only across multiple solution paths, starkly highlights the disparity between its realized and inherent capabilities. To address this, we present A2R, an Asymmetric Two-Stage Reasoning framework designed to explicitly bridge the gap between a model’s potential and its actual performance. In this framework, an “explorer” model first generates potential solutions in parallel through repeated sampling. Subsequently,a “synthesizer” model integrates these references for a more refined, second stage of reasoning. This two-stage process allows computation to be scaled orthogonally to existing sequential methods. Our work makes two key innovations: First, we present A2R as a plug-and-play parallel reasoning framework that explicitly enhances a model’s capabilities on complex questions. For example, using our framework, the Qwen3-8B-distill model achieves a 75% performance improvement compared to its self-consistency baseline. Second, through a systematic analysis of the explorer and synthesizer roles, we identify an effective asymmetric scaling paradigm. This insight leads to A2R-Efficient, a “small-to-big” variant that combines a Qwen3-4B explorer with a Qwen3-8B synthesizer. This configuration surpasses the average performance of a monolithic Qwen3-32B model at a nearly 30% lower cost. Collectively, these results show that A2R is not only a performance-boosting framework but also an efficient and practical solution for real-world applications.

[204] InfiMed-Foundation: Pioneering Advanced Multimodal Medical Models with Compute-Efficient Pre-Training and Multi-Stage Fine-Tuning cs.AI | cs.CLPDF

Guanghao Zhu, Zhitian Hou, Zeyu Liu, Zhijie Sang, Congkai Xie

TL;DR: 该论文提出了InfiMed-Foundation系列多模态医学大模型，通过高效预训练和多阶段微调，解决了医学领域中的知识专业性、数据质量和计算效率问题。

Details

Motivation: 通用多模态大模型在医学领域表现不足，缺乏专业知识且计算成本高。论文旨在开发高效且专业的医学多模态模型。

Result: InfiMed-Foundation-4B在医学视觉问答和诊断任务中超过HuatuoGPT-V-7B和MedGemma-27B-IT等模型，表现优异。

Insight: 通过优化数据质量和训练效率，可以显著提升医学多模态模型的性能，为医疗AI提供更可靠的解决方案。

Abstract: Multimodal large language models (MLLMs) have shown remarkable potential in various domains, yet their application in the medical field is hindered by several challenges. General-purpose MLLMs often lack the specialized knowledge required for medical tasks, leading to uncertain or hallucinatory responses. Knowledge distillation from advanced models struggles to capture domain-specific expertise in radiology and pharmacology. Additionally, the computational cost of continual pretraining with large-scale medical data poses significant efficiency challenges. To address these issues, we propose InfiMed-Foundation-1.7B and InfiMed-Foundation-4B, two medical-specific MLLMs designed to deliver state-of-the-art performance in medical applications. We combined high-quality general-purpose and medical multimodal data and proposed a novel five-dimensional quality assessment framework to curate high-quality multimodal medical datasets. We employ low-to-high image resolution and multimodal sequence packing to enhance training efficiency, enabling the integration of extensive medical data. Furthermore, a three-stage supervised fine-tuning process ensures effective knowledge extraction for complex medical tasks. Evaluated on the MedEvalKit framework, InfiMed-Foundation-1.7B outperforms Qwen2.5VL-3B, while InfiMed-Foundation-4B surpasses HuatuoGPT-V-7B and MedGemma-27B-IT, demonstrating superior performance in medical visual question answering and diagnostic tasks. By addressing key challenges in data quality, training efficiency, and domain-specific knowledge extraction, our work paves the way for more reliable and effective AI-driven solutions in healthcare. InfiMed-Foundation-4B model is available at \href{https://huggingface.co/InfiX-ai/InfiMed-Foundation-4B}{InfiMed-Foundation-4B}.

[205] PRIME: Planning and Retrieval-Integrated Memory for Enhanced Reasoning cs.AI | cs.CLPDF

Hieu Tran, Zonghai Yao, Nguyen Luong Tran, Zhichao Yang, Feiyun Ouyang

TL;DR: PRIME是一个多智能体推理框架，动态集成人类认知中的快速直觉思维（System 1）和慢速逻辑思维（System 2），通过快速响应和结构化推理流程提升LLMs在多跳和知识密集型任务中的表现。

Details

Motivation: 受人类双进程认知理论的启发，PRIME旨在解决现有大型语言模型在复杂推理任务中效率和准确性的平衡问题。

Result: 使用LLaMA 3的实验表明，PRIME在多跳和知识密集型任务中性能接近GPT-4等闭源模型。

Insight: PRIME通过模拟人类认知过程，展示了多智能体设计在提升LLMs推理能力方面的潜力。

Abstract: Inspired by the dual-process theory of human cognition from \textit{Thinking, Fast and Slow}, we introduce \textbf{PRIME} (Planning and Retrieval-Integrated Memory for Enhanced Reasoning), a multi-agent reasoning framework that dynamically integrates \textbf{System 1} (fast, intuitive thinking) and \textbf{System 2} (slow, deliberate thinking). PRIME first employs a Quick Thinking Agent (System 1) to generate a rapid answer; if uncertainty is detected, it then triggers a structured System 2 reasoning pipeline composed of specialized agents for \textit{planning}, \textit{hypothesis generation}, \textit{retrieval}, \textit{information integration}, and \textit{decision-making}. This multi-agent design faithfully mimics human cognitive processes and enhances both efficiency and accuracy. Experimental results with LLaMA 3 models demonstrate that PRIME enables open-source LLMs to perform competitively with state-of-the-art closed-source models like GPT-4 and GPT-4o on benchmarks requiring multi-hop and knowledge-grounded reasoning. This research establishes PRIME as a scalable solution for improving LLMs in domains requiring complex, knowledge-intensive reasoning.

[206] Dynamic Experts Search: Enhancing Reasoning in Mixture-of-Experts LLMs at Test Time cs.AI | cs.CL | cs.LGPDF

Yixuan Han, Fan Ma, Ruijie Quan, Yi Yang

TL;DR: 本文提出了动态专家搜索（DES）方法，通过在推理时动态调整激活的专家数量，增强了Mixture-of-Experts（MoE）大语言模型的推理能力，且无需额外成本。

Details

Motivation: 现有测试时扩展（TTS）方法主要依赖输出级采样，忽视了模型架构的作用。MoE LLMs中激活不同数量的专家可生成互补的解集，利用这一多样性可以提升推理性能。

Result: 在MoE架构、验证器和多领域推理基准测试中，DES显著优于基线TTS方法，提升了准确性和稳定性。

Insight: 现代LLMs的结构灵活性可为推理能力提升提供新途径，DES展示了架构感知TTS的实用性和扩展性。

Abstract: Test-Time Scaling (TTS) enhances the reasoning ability of large language models (LLMs) by allocating additional computation during inference. However, existing approaches primarily rely on output-level sampling while overlooking the role of model architecture. In mainstream Mixture-of-Experts (MoE) LLMs, we observe that varying the number of activated experts yields complementary solution sets with stable accuracy, revealing a new and underexplored source of diversity. Motivated by this observation, we propose Dynamic Experts Search (DES), a TTS strategy that elevates expert activation into a controllable dimension of the search space. DES integrates two key components: (1) Dynamic MoE, which enables direct control of expert counts during inference to generate diverse reasoning trajectories without additional cost; and (2) Expert Configuration Inheritance, which preserves consistent expert counts within a reasoning path while varying them across runs, thereby balancing stability and diversity throughout the search. Extensive experiments across MoE architectures, verifiers and reasoning benchmarks (i.e., math, code and knowledge) demonstrate that DES reliably outperforms TTS baselines, enhancing accuracy and stability without additional cost. These results highlight DES as a practical and scalable form of architecture-aware TTS, illustrating how structural flexibility in modern LLMs can advance reasoning.

cs.RO [Back]

Chih Yao Hu, Yang-Sen Lin, Yuna Lee, Chih-Hai Su, Jie-Ying Lee

TL;DR: SPF是一个无需训练的无人机导航框架，通过VLM将自由指令分解为2D航点预测，实现3D导航，在仿真和现实中表现优异。

Details

Motivation: 现有的基于VLM的方法将动作预测视为文本生成任务，但作者认为AVLN更适合作为2D空间接地任务，从而提出SPF。

Result: 在仿真基准测试中性能提升63%，现实测试中也显著优于基线。

Insight: 将导航问题视为2D空间接地任务比文本生成更有效，且无需训练即可适应不同VLM。

Abstract: We present See, Point, Fly (SPF), a training-free aerial vision-and-language navigation (AVLN) framework built atop vision-language models (VLMs). SPF is capable of navigating to any goal based on any type of free-form instructions in any kind of environment. In contrast to existing VLM-based approaches that treat action prediction as a text generation task, our key insight is to consider action prediction for AVLN as a 2D spatial grounding task. SPF harnesses VLMs to decompose vague language instructions into iterative annotation of 2D waypoints on the input image. Along with the predicted traveling distance, SPF transforms predicted 2D waypoints into 3D displacement vectors as action commands for UAVs. Moreover, SPF also adaptively adjusts the traveling distance to facilitate more efficient navigation. Notably, SPF performs navigation in a closed-loop control manner, enabling UAVs to follow dynamic targets in dynamic environments. SPF sets a new state of the art in DRL simulation benchmark, outperforming the previous best method by an absolute margin of 63%. In extensive real-world evaluations, SPF outperforms strong baselines by a large margin. We also conduct comprehensive ablation studies to highlight the effectiveness of our design choice. Lastly, SPF shows remarkable generalization to different VLMs. Project page: https://spf-web.pages.dev

[208] Language-in-the-Loop Culvert Inspection on the Erie Canal cs.RO | cs.CVPDF

Yashom Dighe, Yash Turkar, Karthik Dantu

TL;DR: 论文提出了一种名为VISION的系统，结合视觉语言模型和受限视角规划，用于伊利运河涵洞的自动化检测。系统通过语言提示生成感兴趣区域（ROI）提案，并结合立体深度恢复尺度，最终通过重新拍摄提高检测精度。

Details

Motivation: 由于涵洞的老化、几何复杂性、光照不足和难以访问等问题，人工检测涵洞具有挑战性，因此需要一种自动化解决方案。

Result: 在实地测试中，初始ROI提案与专家意见的一致率为61.4%，经过重新拍摄后提升至80%。

Insight: 展示了语言驱动的视觉模型在复杂环境（如涵洞检测）中的潜力，同时证明了闭环规划对提高检测精度的重要性。

Abstract: Culverts on canals such as the Erie Canal, built originally in 1825, require frequent inspections to ensure safe operation. Human inspection of culverts is challenging due to age, geometry, poor illumination, weather, and lack of easy access. We introduce VISION, an end-to-end, language-in-the-loop autonomy system that couples a web-scale vision-language model (VLM) with constrained viewpoint planning for autonomous inspection of culverts. Brief prompts to the VLM solicit open-vocabulary ROI proposals with rationales and confidences, stereo depth is fused to recover scale, and a planner – aware of culvert constraints – commands repositioning moves to capture targeted close-ups. Deployed on a quadruped in a culvert under the Erie Canal, VISION closes the see, decide, move, re-image loop on-board and produces high-resolution images for detailed reporting without domain-specific fine-tuning. In an external evaluation by New York Canal Corporation personnel, initial ROI proposals achieved 61.4% agreement with subject-matter experts, and final post-re-imaging assessments reached 80%, indicating that VISION converts tentative hypotheses into grounded, expert-aligned findings.

[209] RoboView-Bias: Benchmarking Visual Bias in Embodied Agents for Robotic Manipulation cs.RO | cs.CVPDF

Enguang Liu, Siyuan Liang, Liming Lu, Xiyu Zeng, Xiaochun Cao

TL;DR: 论文提出了RoboView-Bias基准，用于系统量化机器人操作中的视觉偏差，并通过实验揭示了视觉偏差的影响及缓解策略。

Details

Motivation: 现有的基准测试主要关注泛化和鲁棒性，但对视觉偏差的系统量化不足，限制了对其如何影响决策稳定性的深入理解。

Result: 实验表明：(i)所有代理均存在显著视觉偏差，相机视角是关键因素；(ii)代理在高饱和度颜色上表现最佳；(iii)视觉偏差存在不对称耦合效应。

Insight: 通过语义接地层的缓解策略可将视觉偏差降低54.5%，强调了视觉偏差系统分析对开发安全可靠机器人代理的必要性。

Abstract: The safety and reliability of embodied agents rely on accurate and unbiased visual perception. However, existing benchmarks mainly emphasize generalization and robustness under perturbations, while systematic quantification of visual bias remains scarce. This gap limits a deeper understanding of how perception influences decision-making stability. To address this issue, we propose RoboView-Bias, the first benchmark specifically designed to systematically quantify visual bias in robotic manipulation, following a principle of factor isolation. Leveraging a structured variant-generation framework and a perceptual-fairness validation protocol, we create 2,127 task instances that enable robust measurement of biases induced by individual visual factors and their interactions. Using this benchmark, we systematically evaluate three representative embodied agents across two prevailing paradigms and report three key findings: (i) all agents exhibit significant visual biases, with camera viewpoint being the most critical factor; (ii) agents achieve their highest success rates on highly saturated colors, indicating inherited visual preferences from underlying VLMs; and (iii) visual biases show strong, asymmetric coupling, with viewpoint strongly amplifying color-related bias. Finally, we demonstrate that a mitigation strategy based on a semantic grounding layer substantially reduces visual bias by approximately 54.5% on MOKA. Our results highlight that systematic analysis of visual bias is a prerequisite for developing safe and reliable general-purpose embodied agents.

[210] MINT-RVAE: Multi-Cues Intention Prediction of Human-Robot Interaction using Human Pose and Emotion Information from RGB-only Camera Data cs.RO | cs.CVPDF

Farida Mohsen, Ali Safa

TL;DR: 论文提出了一种仅依赖RGB摄像头的MINT-RVAE方法，用于预测人机交互意图，主要通过人类姿态和情感信息，解决数据类别不平衡问题，并在性能上超越现有方法。

Details

Motivation: 现有的人机交互意图预测方法多依赖多模态输入（如RGB-D），限制了应用场景。论文旨在开发一种仅需RGB摄像头的轻量级解决方案，同时解决数据不平衡问题。

Result: 方法在AUROC指标上达到0.95，优于现有方法（0.90-0.912），且支持帧级精确预测。

Insight: RGB-only方法可实现高性能的人机交互意图预测，合成的序列生成策略能有效缓解数据不平衡问题，对未来轻量化应用具有重要意义。

Abstract: Efficiently detecting human intent to interact with ubiquitous robots is crucial for effective human-robot interaction (HRI) and collaboration. Over the past decade, deep learning has gained traction in this field, with most existing approaches relying on multimodal inputs, such as RGB combined with depth (RGB-D), to classify time-sequence windows of sensory data as interactive or non-interactive. In contrast, we propose a novel RGB-only pipeline for predicting human interaction intent with frame-level precision, enabling faster robot responses and improved service quality. A key challenge in intent prediction is the class imbalance inherent in real-world HRI datasets, which can hinder the model’s training and generalization. To address this, we introduce MINT-RVAE, a synthetic sequence generation method, along with new loss functions and training strategies that enhance generalization on out-of-sample data. Our approach achieves state-of-the-art performance (AUROC: 0.95) outperforming prior works (AUROC: 0.90-0.912), while requiring only RGB input and supporting precise frame onset prediction. Finally, to support future research, we openly release our new dataset with frame-level labeling of human interaction intent.

[211] WoW: Towards a World omniscient World model Through Embodied Interaction cs.RO | cs.CV | cs.MMPDF

Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi

TL;DR: 论文提出WoW，一个基于机器人交互轨迹训练的生成世界模型，强调通过真实世界的交互发展物理直觉，并在物理一致性基准测试中表现优异。

Details

Motivation: 当前视频模型（如Sora）依赖被动观察，难以理解物理因果关系。作者认为真实的物理直觉必须通过大量且丰富的真实世界交互来建立。

Result: WoW在WoWBench上表现优异，展示了在物理因果性、碰撞动力学和物体持久性方面的能力，为AI发展中大规模真实交互的必要性提供了证据。

Insight: 大规模真实世界交互是发展AI物理直觉的关键。模型的物理理解是一种概率分布，需通过迭代约束提高其真实性。

Abstract: Humans develop an understanding of intuitive physics through active interaction with the world. This approach is in stark contrast to current video models, such as Sora, which rely on passive observation and therefore struggle with grasping physical causality. This observation leads to our central hypothesis: authentic physical intuition of the world model must be grounded in extensive, causally rich interactions with the real world. To test this hypothesis, we present WoW, a 14-billion-parameter generative world model trained on 2 million robot interaction trajectories. Our findings reveal that the model’s understanding of physics is a probabilistic distribution of plausible outcomes, leading to stochastic instabilities and physical hallucinations. Furthermore, we demonstrate that this emergent capability can be actively constrained toward physical realism by SOPHIA, where vision-language model agents evaluate the DiT-generated output and guide its refinement by iteratively evolving the language instructions. In addition, a co-trained Inverse Dynamics Model translates these refined plans into executable robotic actions, thus closing the imagination-to-action loop. We establish WoWBench, a new benchmark focused on physical consistency and causal reasoning in video, where WoW achieves state-of-the-art performance in both human and autonomous evaluation, demonstrating strong ability in physical causality, collision dynamics, and object permanence. Our work provides systematic evidence that large-scale, real-world interaction is a cornerstone for developing physical intuition in AI. Models, data, and benchmarks will be open-sourced.

cs.SD [Back]

[212] MDAR: A Multi-scene Dynamic Audio Reasoning Benchmark cs.SD | cs.AI | cs.CL | eess.ASPDF

Hui Li, Changhao Jiang, Hongyu Wang, Ming Zhang, Jiajun Sun

TL;DR: MDAR是一个面向多场景动态音频推理的基准测试，包含3000个多样化音频片段链接的问题-答案对，用于评估模型的复杂推理能力。它对26个SOTA音频语言模型进行了测试，发现它们在复杂任务中表现有限。

Details

Motivation: 现有音频基准主要针对静态或单一场景，无法全面捕捉多说话者、动态事件和异质音频源的交互场景。MDAR旨在填补这一空白。

Result: Qwen2.5-Omni在单选择任务中表现最佳（76.67%），而GPT-4o Audio在多选择和开放式任务中更优。所有模型的性能均未达到80%。

Insight: MDAR揭示了复杂音频推理任务的独特挑战，为未来音频推理研究提供了有价值的基准。

Abstract: The ability to reason from audio, including speech, paralinguistic cues, environmental sounds, and music, is essential for AI agents to interact effectively in real-world scenarios. Existing benchmarks mainly focus on static or single-scene settings and do not fully capture scenarios where multiple speakers, unfolding events, and heterogeneous audio sources interact. To address these challenges, we introduce MDAR, a benchmark for evaluating models on complex, multi-scene, and dynamically evolving audio reasoning tasks. MDAR comprises 3,000 carefully curated question-answer pairs linked to diverse audio clips, covering five categories of complex reasoning and spanning three question types. We benchmark 26 state-of-the-art audio language models on MDAR and observe that they exhibit limitations in complex reasoning tasks. On single-choice questions, Qwen2.5-Omni (open-source) achieves 76.67% accuracy, whereas GPT-4o Audio (closed-source) reaches 68.47%; however, GPT-4o Audio substantially outperforms Qwen2.5-Omni on the more challenging multiple-choice and open-ended tasks. Across all three question types, no model achieves 80% performance. These findings underscore the unique challenges posed by MDAR and its value as a benchmark for advancing audio reasoning research.Code and benchmark can be found at https://github.com/luckyerr/MDAR.

cs.NI [Back]

[213] Evaluating Open-Source Large Language Models for Technical Telecom Question Answering cs.NI | cs.CLPDF

Arina Caraus, Alessio Buscemi, Sumit Kumar, Ion Turcanu

TL;DR: 论文评估了两个开源大语言模型（Gemma 3 27B和DeepSeek R1 32B）在电信技术问答中的表现，发现Gemma在语义保真度和正确性上表现更佳，而DeepSeek在词汇一致性上略胜一筹。

Details

Motivation: 尽管大语言模型（LLMs）在多领域表现出色，但其在电信等专业领域的能力尚未充分探索，需要评估其可靠性和适用性。

Result: Gemma在语义保真度和LLM评分的正确性上表现更好，而DeepSeek在词汇一致性上稍优，同时揭示了当前模型在电信应用中的局限性。

Insight: 电信领域需要更多专用模型以支持可信赖的AI助手，同时LLMs在该领域的幻觉问题和一致性仍需改进。

Abstract: Large Language Models (LLMs) have shown remarkable capabilities across various fields. However, their performance in technical domains such as telecommunications remains underexplored. This paper evaluates two open-source LLMs, Gemma 3 27B and DeepSeek R1 32B, on factual and reasoning-based questions derived from advanced wireless communications material. We construct a benchmark of 105 question-answer pairs and assess performance using lexical metrics, semantic similarity, and LLM-as-a-judge scoring. We also analyze consistency, judgment reliability, and hallucination through source attribution and score variance. Results show that Gemma excels in semantic fidelity and LLM-rated correctness, while DeepSeek demonstrates slightly higher lexical consistency. Additional findings highlight current limitations in telecom applications and the need for domain-adapted models to support trustworthy Artificial Intelligence (AI) assistants in engineering.

stat.ML [Back]

[214] Towards Efficient Online Exploration for Reinforcement Learning with Human Feedback stat.ML | cs.AI | cs.CL | cs.LG | math.ST | stat.THPDF

Gen Li, Yuling Yan

TL;DR: 该论文研究了在线强化学习与人类反馈（RLHF）的高效探索方法，提出了一种新的探索方案以减少与策略改进相关的奖励差异不确定性，并证明了其优于现有乐观探索方法的理论界限。

Details

Motivation: 现有乐观探索方法在RLHF中存在采样效率低的问题，倾向于收集非信息丰富的比较数据，导致线性遗憾。论文旨在解决这一问题，提出更高效的在线探索方法。

Result: 在理论分析中，证明了新方法具有$T^{(\beta+1)/(\beta+2)}$的后悔界限，优于现有方法。

Insight: 通过定向查询高信息量的偏好数据，可以显著提升RLHF的在线探索效率，并避免线性遗憾问题。

Abstract: Reinforcement learning with human feedback (RLHF), which learns a reward model from human preference data and then optimizes a policy to favor preferred responses, has emerged as a central paradigm for aligning large language models (LLMs) with human preferences. In this paper, we investigate exploration principles for online RLHF, where one seeks to adaptively collect new preference data to refine both the reward model and the policy in a data-efficient manner. By examining existing optimism-based exploration algorithms, we identify a drawback in their sampling protocol: they tend to gather comparisons that fail to reduce the most informative uncertainties in reward differences, and we prove lower bounds showing that such methods can incur linear regret over exponentially long horizons. Motivated by this insight, we propose a new exploration scheme that directs preference queries toward reducing uncertainty in reward differences most relevant to policy improvement. Under a multi-armed bandit model of RLHF, we establish regret bounds of order $T^{(\beta+1)/(\beta+2)}$, where $\beta>0$ is a hyperparameter that balances reward maximization against mitigating distribution shift. To our knowledge, this is the first online RLHF algorithm with regret scaling polynomially in all model parameters.

cs.IR [Back]

Jiahao Zhang, Wenzhe Yin, Shujian Yu

TL;DR: 该论文提出了一种基于Cauchy-Schwarz（CS）散度的跨模态检索方法，通过广义CS（GCS）散度实现多模态对齐，提升检索性能和训练稳定性。

Details

Motivation: 现有的跨模态检索方法多关注双模态任务，依赖KL散度、MMD等分布对齐技术，但存在数值不稳定、超参数敏感等问题，且难以直接扩展到多模态对齐。

Result: 在六个基准数据集上，CS/GCS方法在双模态和三模态检索任务中均表现出色。

Insight: CS/GCS散度能够更高效地捕捉分布结构，避免传统方法的数值不稳定性和超参数敏感性，为多模态对齐提供了通用框架。

Abstract: Effective cross-modal retrieval requires robust alignment of heterogeneous data types. Most existing methods focus on bi-modal retrieval tasks and rely on distributional alignment techniques such as Kullback-Leibler divergence, Maximum Mean Discrepancy, and correlation alignment. However, these methods often suffer from critical limitations, including numerical instability, sensitivity to hyperparameters, and their inability to capture the full structure of the underlying distributions. In this paper, we introduce the Cauchy-Schwarz (CS) divergence, a hyperparameter-free measure that improves both training stability and retrieval performance. We further propose a novel Generalized CS (GCS) divergence inspired by H"older’s inequality. This extension enables direct alignment of three or more modalities within a unified mathematical framework through a bidirectional circular comparison scheme, eliminating the need for exhaustive pairwise comparisons. Extensive experiments on six benchmark datasets demonstrate the effectiveness of our method in both bi-modal and tri-modal retrieval tasks. The code of our CS/GCS divergence is publicly available at https://github.com/JiahaoZhang666/CSD.

eess.IV [Back]

[216] Comparative Analysis of GAN and Diffusion for MRI-to-CT translation eess.IV | cs.CV | cs.LGPDF

Emily Honey, Anders Helbo, Jens Petersen

TL;DR: 本文比较了条件生成对抗网络（cGAN）和条件去噪扩散概率模型（cDDPM）在MRI-to-CT转换任务中的性能，发现多通道条件输入和使用cDDPM架构更具优势。

Details

Motivation: 由于CT扫描在某些情况下难以获取，研究人员寻求从MRI生成合成CT（sCT）的方法，因此需要确定哪种策略在此任务中最有效。

Result: 实验表明，多通道条件输入和cDDPM架构在MRI-to-CT转换中表现更优。

Insight: 2D切片转换策略可降低计算成本；cDDPM在生成质量和切片连续性方面优于cGAN。

Abstract: Computed tomography (CT) is essential for treatment and diagnostics; In case CT are missing or otherwise difficult to obtain, methods for generating synthetic CT (sCT) images from magnetic resonance imaging (MRI) images are sought after. Therefore, it is valuable to establish a reference for what strategies are most effective for MRI-to-CT translation. In this paper, we compare the performance of two frequently used architectures for MRI-to-CT translation: a conditional generative adversarial network (cGAN) and a conditional denoising diffusion probabilistic model (cDDPM). We chose well-established implementations to represent each architecture: Pix2Pix for cGAN, and Palette for cDDPM. We separate the classical 3D translation problem into a sequence of 2D translations on the transverse plane, to investigate the viability of a strategy that reduces the computational cost. We also investigate the impact of conditioning the generative process on a single MRI image/slice and on multiple MRI slices. The performance is assessed using a thorough evaluation protocol, including a novel slice-wise metric Similarity Of Slices (SIMOS), which measures the continuity between transverse slices when compiling the sCTs into 3D format. Our comparative analysis revealed that MRI-to-CT generative models benefit from multi-channel conditional input and using cDDPM as an architecture.

[217] COMPASS: Robust Feature Conformal Prediction for Medical Segmentation Metrics eess.IV | cs.CV | cs.LG | stat.AP | stat.MLPDF

Matt Y. Cheung, Ashok Veeraraghavan, Guha Balakrishnan

TL;DR: COMPASS是一个用于医学图像分割指标不确定性量化的框架，通过扰动模型的中间特征来生成高效的置信区间，比传统方法更紧凑。

Details

Motivation: 在临床应用中，分割模型的实用性常基于下游指标（如器官大小）的准确性，而非像素级分割。因此，为这些指标提供高效的不确定性量化至关重要。

Result: 在四个医学图像分割任务中，COMPASS生成的区间比传统方法更紧凑，且能在协变量偏移下保持目标覆盖率。

Insight: 利用模型的内部特征可以显著提升不确定性量化的效率，而不仅仅是依赖端到端的黑箱方法。

Abstract: In clinical applications, the utility of segmentation models is often based on the accuracy of derived downstream metrics such as organ size, rather than by the pixel-level accuracy of the segmentation masks themselves. Thus, uncertainty quantification for such metrics is crucial for decision-making. Conformal prediction (CP) is a popular framework to derive such principled uncertainty guarantees, but applying CP naively to the final scalar metric is inefficient because it treats the complex, non-linear segmentation-to-metric pipeline as a black box. We introduce COMPASS, a practical framework that generates efficient, metric-based CP intervals for image segmentation models by leveraging the inductive biases of their underlying deep neural networks. COMPASS performs calibration directly in the model’s representation space by perturbing intermediate features along low-dimensional subspaces maximally sensitive to the target metric. We prove that COMPASS achieves valid marginal coverage under exchangeability and nestedness assumptions. Empirically, we demonstrate that COMPASS produces significantly tighter intervals than traditional CP baselines on four medical image segmentation tasks for area estimation of skin lesions and anatomical structures. Furthermore, we show that leveraging learned internal features to estimate importance weights allows COMPASS to also recover target coverage under covariate shifts. COMPASS paves the way for practical, metric-based uncertainty quantification for medical image segmentation.

cs.LG [Back]

[218] LLMs for Bayesian Optimization in Scientific Domains: Are We There Yet? cs.LG | cs.CLPDF

Rushil Gupta, Jason Hartford, Bang Liu

TL;DR: 论文评估了LLM在实验设计中的表现，发现其对实验反馈不敏感，传统方法如线性赌博和高斯过程优化更优。提出了一种结合LLM先验知识和最近邻采样的混合方法LLMNN，表现优异。

Details

Motivation: 探索大语言模型（LLM）是否可作为通用实验设计代理，尤其在科学领域的贝叶斯优化任务中。

Result: LLM对实验反馈不敏感，传统方法更优；LLMNN在多个领域中表现优异，无需大量上下文适应。

Insight: 当前LLM无法实际执行上下文实验设计，需混合框架分离先验推理和批量获取后验更新。

Abstract: Large language models (LLMs) have recently been proposed as general-purpose agents for experimental design, with claims that they can perform in-context experimental design. We evaluate this hypothesis using both open- and closed-source instruction-tuned LLMs applied to genetic perturbation and molecular property discovery tasks. We find that LLM-based agents show no sensitivity to experimental feedback: replacing true outcomes with randomly permuted labels has no impact on performance. Across benchmarks, classical methods such as linear bandits and Gaussian process optimization consistently outperform LLM agents. We further propose a simple hybrid method, LLM-guided Nearest Neighbour (LLMNN) sampling, that combines LLM prior knowledge with nearest-neighbor sampling to guide the design of experiments. LLMNN achieves competitive or superior performance across domains without requiring significant in-context adaptation. These results suggest that current open- and closed-source LLMs do not perform in-context experimental design in practice and highlight the need for hybrid frameworks that decouple prior-based reasoning from batch acquisition with updated posteriors.

[219] Leveraging Big Data Frameworks for Spam Detection in Amazon Reviews cs.LG | cs.CLPDF

Mst Eshita Khatun, Halima Akter, Tasnimul Rehan, Toufiq Ahmed

TL;DR: 该研究利用大数据框架和机器学习方法，从亚马逊产品评论中检测和分类垃圾评论，其中逻辑回归模型取得了90.35%的准确率。

Details

Motivation: 在线购物中，虚假评论会误导消费者并损害卖家声誉，因此需要高效的方法检测垃圾评论，提升评论的真实性。

Result: 逻辑回归模型在垃圾评论检测中达到了90.35%的准确率，验证了方法的有效性。

Insight: 大数据框架的高效性与机器学习方法的结合可以有效提升垃圾评论检测的准确性，适用于其他类似场景。

Abstract: In this digital era, online shopping is common practice in our daily lives. Product reviews significantly influence consumer buying behavior and help establish buyer trust. However, the prevalence of fraudulent reviews undermines this trust by potentially misleading consumers and damaging the reputations of the sellers. This research addresses this pressing issue by employing advanced big data analytics and machine learning approaches on a substantial dataset of Amazon product reviews. The primary objective is to detect and classify spam reviews accurately so that it enhances the authenticity of the review. Using a scalable big data framework, we efficiently process and analyze a large scale of review data, extracting key features indicative of fraudulent behavior. Our study illustrates the utility of various machine learning classifiers in detecting spam reviews, with Logistic Regression achieving an accuracy of 90.35%, thus contributing to a more trustworthy and transparent online shopping environment.

[220] EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning cs.LG | cs.CLPDF

Xu Wujiang, Wentian Zhao, Zhenting Wang, Li Yu-Jhe, Jin Can

TL;DR: EPO提出了一种解决大语言模型（LLM）智能体在多回合稀疏奖励环境中训练失败的新方法，通过熵正则化和动态平衡机制提升性能。

Details

Motivation: 在多回合稀疏奖励环境中，LLM智能体容易因早期策略过早收敛和晚期策略崩溃而失败，传统熵正则化方法效果不佳。EPO旨在解决这一问题。

Result: 在ScienceWorld和ALFWorld上分别实现了最高152%和19.8%的性能提升。

Insight: 多回合稀疏奖励环境需要与传统强化学习不同的熵控制方法，EPO为此提供了新思路。

Abstract: Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.

[221] Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning cs.LG | cs.AI | cs.CL | cs.CV | cs.MAPDF

Yulei Qin, Xiaoyu Tan, Zhengbao He, Gang Li, Haojia Lin

TL;DR: SPEAR是一种基于课程学习的自模仿学习方法，用于平衡强化学习中的探索与利用，避免熵崩溃或发散问题。

Details

Motivation: 强化学习在多轮长时任务中存在探索与利用的权衡问题。现有方法通过最大化策略熵来促进探索，但易导致训练不稳定。本文旨在通过自模仿学习实现逐步的探索与利用平衡。

Result: SPEAR在稀疏奖励的长时任务中表现出色，提升了训练稳定性并加速了解决方案迭代。

Insight: 探索与利用的动态平衡可通过自模仿学习和课程学习实现，避免了传统熵最大化的不稳定问题。

Abstract: Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL training instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a curriculum-based self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL framework, where a replay buffer stores self-generated promising trajectories for off-policy update, by gradually steering the policy evolution within a well-balanced range of entropy across stages. Specifically, our approach incorporates a curriculum to manage the exploration process, utilizing intrinsic rewards to foster skill-level exploration and facilitating action-level exploration through SIL. At first, the auxiliary tool call reward plays a critical role in the accumulation of tool-use skills, enabling broad exposure to the unfamiliar distributions of the environment feedback with an upward entropy trend. As training progresses, self-imitation gets strengthened to exploit existing successful patterns from replayed experiences for comparative action-level exploration, accelerating solution iteration without unbounded entropy growth. To further stabilize training, we recalibrate the advantages of experiences in the replay buffer to address the potential policy drift. Reugularizations such as the clipping of tokens with high covariance between probability and advantage are introduced to the trajectory-level entropy control to curb over-confidence.

[222] IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning cs.LG | cs.AI | cs.CLPDF

Aayush Mishra, Daniel Khashabi, Anqi Liu

TL;DR: 本文提出了ICL Activation Alignment (IA2)，一种自蒸馏技术，旨在通过将ICL的激活模式复制到SFT模型中，提高其准确性和校准性，从而改善监督微调的表现。

Details

Motivation: 研究发现ICL和SFT在数据稀缺的情况下表现出不同的适应机制，ICL具有更好的泛化性和校准性。希望通过利用ICL的内部计算改进SFT的性能。

Result: 在12个流行基准和2个模型家族上的实验表明，IA2显著提高了模型的输出准确性和校准性。

Insight: 这不仅具有实用价值，还提供了对模型适应内部机制的理解窗口。

Abstract: Supervised Fine-Tuning (SFT) is used to specialize model behavior by training weights to produce intended target responses for queries. In contrast, In-Context Learning (ICL) adapts models during inference with instructions or demonstrations in the prompt. ICL can offer better generalizability and more calibrated responses compared to SFT in data scarce settings, at the cost of more inference compute. In this work, we ask the question: Can ICL’s internal computations be used to improve the qualities of SFT? We first show that ICL and SFT produce distinct activation patterns, indicating that the two methods achieve adaptation through different functional mechanisms. Motivated by this observation and to use ICL’s rich functionality, we introduce ICL Activation Alignment (IA2), a self-distillation technique which aims to replicate ICL’s activation patterns in SFT models and incentivizes ICL-like internal reasoning. Performing IA2 as a priming step before SFT significantly improves the accuracy and calibration of model outputs, as shown by our extensive empirical results on 12 popular benchmarks and 2 model families. This finding is not only practically useful, but also offers a conceptual window into the inner mechanics of model adaptation.

[223] VISION: Prompting Ocean Vertical Velocity Reconstruction from Incomplete Observations cs.LG | cs.CV | physics.ao-phPDF

Yuan Gao, Hao Wu, Qingsong Wen, Kun Wang, Xian Wu

TL;DR: 该论文提出了VISION方法，通过动态提示技术从不完整的地表观测数据中重建海洋垂直速度场，并发布了高质量基准KD48，显著提升了重建性能。

Details

Motivation: 海洋动力学研究中，从有限的表层观测数据重建垂直速度场是一个关键挑战，缺乏标准化的分析基准阻碍了研究进展。

Result: 在KD48基准上，VISION性能显著优于现有方法，且在极端数据缺失情况下表现出强泛化能力。

Insight: 动态提示技术为不完全观测数据下的海洋科学研究提供了新思路，同时高质量的基准推动了领域标准化发展。

Abstract: Reconstructing subsurface ocean dynamics, such as vertical velocity fields, from incomplete surface observations poses a critical challenge in Earth science, a field long hampered by the lack of standardized, analysis-ready benchmarks. To systematically address this issue and catalyze research, we first build and release KD48, a high-resolution ocean dynamics benchmark derived from petascale simulations and curated with expert-driven denoising. Building on this benchmark, we introduce VISION, a novel reconstruction paradigm based on Dynamic Prompting designed to tackle the core problem of missing data in real-world observations. The essence of VISION lies in its ability to generate a visual prompt on-the-fly from any available subset of observations, which encodes both data availability and the ocean’s physical state. More importantly, we design a State-conditioned Prompting module that efficiently injects this prompt into a universal backbone, endowed with geometry- and scale-aware operators, to guide its adaptive adjustment of computational strategies. This mechanism enables VISION to precisely handle the challenges posed by varying input combinations. Extensive experiments on the KD48 benchmark demonstrate that VISION not only substantially outperforms state-of-the-art models but also exhibits strong generalization under extreme data missing scenarios. By providing a high-quality benchmark and a robust model, our work establishes a solid infrastructure for ocean science research under data uncertainty. Our codes are available at: https://github.com/YuanGao-YG/VISION.

[224] DistillKac: Few-Step Image Generation via Damped Wave Equations cs.LG | cs.AI | cs.CV | math.PR | stat.MLPDF

Weiqiao Han, Chenlin Meng, Christopher D. Manning, Stefano Ermon

TL;DR: DistillKac提出了一种基于阻尼波动方程和Kac表示的快速图像生成方法，通过有限速度传输和全局有界动能解决扩散模型中的问题，并通过端点蒸馏和速度空间的分类器自由指导实现高效生成。

Details

Motivation: 扩散模型在反向时间生成时可能速度不稳定且隐含无限传播速度，Kac动力学提供有限速度传输和全局有界动能，改善了这些问题。

Result: 实验表明DistillKac能以极少的函数评估生成高质量样本，同时保持数值稳定性。

Insight: 有限速度传输和全局有界动能有助于提升生成模型的效率和稳定性，端点蒸馏是一种有效的训练策略。

Abstract: We present DistillKac, a fast image generator that uses the damped wave equation and its stochastic Kac representation to move probability mass at finite speed. In contrast to diffusion models whose reverse time velocities can become stiff and implicitly allow unbounded propagation speed, Kac dynamics enforce finite speed transport and yield globally bounded kinetic energy. Building on this structure, we introduce classifier-free guidance in velocity space that preserves square integrability under mild conditions. We then propose endpoint only distillation that trains a student to match a frozen teacher over long intervals. We prove a stability result that promotes supervision at the endpoints to closeness along the entire path. Experiments demonstrate DistillKac delivers high quality samples with very few function evaluations while retaining the numerical stability benefits of finite speed probability flows.

[225] TRiCo: Triadic Game-Theoretic Co-Training for Robust Semi-Supervised Learning cs.LG | cs.CVPDF

Hongyang He, Xinyuan Song, Yangfan He, Zeyu Zhang, Yanshu Li

TL;DR: TRiCo 是一种基于博弈论的三元协同训练框架，通过教师、学生和对抗生成器的结构化交互，提升了半监督学习的鲁棒性。

Details

Motivation: 现有半监督学习方法存在静态视图交互、伪标签不可靠以及缺乏困难样本建模等问题，TRiCo 旨在通过结构化交互解决这些局限性。

Result: 在 CIFAR-10、SVHN、STL-10 和 ImageNet 上实现了低标签率下的最先进性能。

Insight: 结构化交互和动态伪标签选择是提升半监督学习鲁棒性的关键。

Abstract: We introduce TRiCo, a novel triadic game-theoretic co-training framework that rethinks the structure of semi-supervised learning by incorporating a teacher, two students, and an adversarial generator into a unified training paradigm. Unlike existing co-training or teacher-student approaches, TRiCo formulates SSL as a structured interaction among three roles: (i) two student classifiers trained on frozen, complementary representations, (ii) a meta-learned teacher that adaptively regulates pseudo-label selection and loss balancing via validation-based feedback, and (iii) a non-parametric generator that perturbs embeddings to uncover decision boundary weaknesses. Pseudo-labels are selected based on mutual information rather than confidence, providing a more robust measure of epistemic uncertainty. This triadic interaction is formalized as a Stackelberg game, where the teacher leads strategy optimization and students follow under adversarial perturbations. By addressing key limitations in existing SSL frameworks, such as static view interactions, unreliable pseudo-labels, and lack of hard sample modeling, TRiCo provides a principled and generalizable solution. Extensive experiments on CIFAR-10, SVHN, STL-10, and ImageNet demonstrate that TRiCo consistently achieves state-of-the-art performance in low-label regimes, while remaining architecture-agnostic and compatible with frozen vision backbones.

[226] Adaptive Dual-Mode Distillation with Incentive Schemes for Scalable, Heterogeneous Federated Learning on Non-IID Data cs.LG | cs.CVPDF

Zahid Iqbal

TL;DR: 该论文提出了一种自适应双模式蒸馏框架（DL-SH和DL-MH）及激励机制（I-DL-MH），用于解决联邦学习中的统计异构性和模型异构性问题，显著提升模型性能并降低通信成本。

Details

Motivation: 联邦学习（FL）面临三大挑战：1) 客户端的模型异构性（不同模型需求与资源差异）；2) 数据非独立同分布（Non-IID）导致的性能下降；3) 需要激励机制以提升客户端参与度。

Result: 在多数据集和复杂分布（IID/Non-IID）下，所提方法显著优于现有技术：DL-SH提升全局准确率153%，I-DL-MH提升225%，同时降低通信成本。

Insight: 通过结合蒸馏技术和激励机制，能够同时解决FL中的异构性和参与度问题，为实际部署提供可行性方案。

Abstract: Federated Learning (FL) has emerged as a promising decentralized learning (DL) approach that enables the use of distributed data without compromising user privacy. However, FL poses several key challenges. First, it is frequently assumed that every client can train the same machine learning models, however, not all clients are able to meet this assumption because of differences in their business needs and computational resources. Second, statistical heterogeneity (a.k.a. non-IID data) poses a major challenge in FL, which can lead to lower global model performance. Third, while addressing these challenges, there is a need for a cost-effective incentive mechanism to encourage clients to participate in FL training. In response to these challenges, we propose several methodologies: DL-SH, which facilitates efficient, privacy-preserving, and communication-efficient learning in the context of statistical heterogeneity; DL-MH, designed to manage fully heterogeneous models while tackling statistical disparities; and I-DL-MH, an incentive-based extension of DL-MH that promotes client engagement in federated learning training by providing incentives within this complex federated learning framework. Comprehensive experiments were carried out to assess the performance and scalability of the proposed approaches across a range of complex experimental settings. This involved utilizing various model architectures, in diverse data distributions, including IID and several non-IID scenarios, as well as multiple datasets. Experimental results demonstrate that the proposed approaches significantly enhance accuracy and decrease communication costs while effectively addressing statistical heterogeneity and model heterogeneity in comparison to existing state-of-the-art approaches and baselines, with DL-SH improving global model accuracy by 153%, and I-DL-MH achieving a 225% improvement under non-IID conditions.

[227] Activation Function Design Sustains Plasticity in Continual Learning cs.LG | cs.AI | cs.CVPDF

Lute Lillo, Nick Cheney

TL;DR: 该论文探讨了激活函数在持续学习中的作用，提出通过设计激活函数的性质（如负分支形状和饱和行为）来缓解塑性损失。作者提出了两种新的非线性激活函数（Smooth-Leaky和Randomized Smooth-Leaky），并在监督学习和强化学习任务中验证了其效果。

Details

Motivation: 在独立同分布（i.i.d.）训练中，激活函数的差异可以通过模型大小和优化调整来缩小，但在持续学习中，激活函数的作用及其对塑性的影响仍未被充分研究。作者旨在探索激活函数设计对持续学习性能的影响。

Result: 实验表明，新的激活函数在持续学习任务中显著减缓了塑性损失，且无需额外容量或任务特定调参。

Insight: 激活函数设计是持续学习中一种轻量且领域通用的方法，能够有效维持模型的适应性。

Abstract: In independent, identically distributed (i.i.d.) training regimes, activation functions have been benchmarked extensively, and their differences often shrink once model size and optimization are tuned. In continual learning, however, the picture is different: beyond catastrophic forgetting, models can progressively lose the ability to adapt (referred to as loss of plasticity) and the role of the non-linearity in this failure mode remains underexplored. We show that activation choice is a primary, architecture-agnostic lever for mitigating plasticity loss. Building on a property-level analysis of negative-branch shape and saturation behavior, we introduce two drop-in nonlinearities (Smooth-Leaky and Randomized Smooth-Leaky) and evaluate them in two complementary settings: (i) supervised class-incremental benchmarks and (ii) reinforcement learning with non-stationary MuJoCo environments designed to induce controlled distribution and dynamics shifts. We also provide a simple stress protocol and diagnostics that link the shape of the activation to the adaptation under change. The takeaway is straightforward: thoughtful activation design offers a lightweight, domain-general way to sustain plasticity in continual learning without extra capacity or task-specific tuning.

cs.CR [Back]

[228] Guidance Watermarking for Diffusion Models cs.CR | cs.CVPDF

Enoal Gesny, Eva Giboulot, Teddy Furon, Vivien Chappelier

TL;DR: 本文提出了一种新颖的扩散模型水印方法，通过任何现成的水印解码器计算的梯度来指导扩散过程，提升对未训练攻击的鲁棒性。

Details

Motivation: 现有的扩散模型水印方法通常在生成后嵌入水印，可能影响生成质量或无法抵御特定攻击。本文旨在通过梯度指导的扩散过程嵌入水印，提升鲁棒性而不需重新训练。

Result: 验证了方法在不同扩散模型和检测器上的有效性，水印指导不影响生成图像的质量和多样性。

Insight: 扩散过程可以自然嵌入水印信息，且梯度指导是一种灵活的方式，能与现有水印技术结合使用。

Abstract: This paper introduces a novel watermarking method for diffusion models. It is based on guiding the diffusion process using the gradient computed from any off-the-shelf watermark decoder. The gradient computation encompasses different image augmentations, increasing robustness to attacks against which the decoder was not originally robust, without retraining or fine-tuning. Our method effectively convert any \textit{post-hoc} watermarking scheme into an in-generation embedding along the diffusion process. We show that this approach is complementary to watermarking techniques modifying the variational autoencoder at the end of the diffusion process. We validate the methods on different diffusion models and detectors. The watermarking guidance does not significantly alter the generated image for a given seed and prompt, preserving both the diversity and quality of generation.

cs.MA [Back]

[229] Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow cs.MA | cs.CVPDF

Xinlei Yu, Chengming Xu, Guibin Zhang, Yongbo He, Zhangquan Chen

TL;DR: 论文提出ViF方法，通过视觉流缓解多智能体系统中的幻觉雪球效应，提升模型性能。

Details

Motivation: 多智能体系统（MAS）在视觉语言模型（VLM）驱动下，因文本流过度依赖导致幻觉问题被放大，即幻觉雪球效应。

Result: 在多种MAS结构和基模型上，ViF显著减少幻觉雪球效应，性能提升。

Insight: 中间层的视觉令牌在保留视觉证据方面表现最佳，但随深度增加逐渐衰减，是幻觉雪球的关键原因。

Abstract: Multi-Agent System (MAS) powered by Visual Language Models (VLMs) enables challenging tasks but suffers from a novel failure term, multi-agent visual hallucination snowballing, where hallucinations are seeded in a single agent and amplified by following ones due to the over-reliance on textual flow to relay visual information. Through turn-, layer-, and token-wise attention analyses, we provide detailed insights into the essence of hallucination snowballing regarding the reduction of visual attention allocation. It leads us to identify a subset of vision tokens with a unimodal attention peak in middle layers that best preserve visual evidence but gradually diminish in deeper agent turns, resulting in the visual hallucination snowballing in MAS. Thus, we propose ViF, a lightweight, plug-and-play mitigation paradigm that relays inter-agent messages with Visual Flow powered by the selected visual relay tokens and applies attention reallocation to amplify this pattern. The experiment results demonstrate that our method markedly reduces hallucination snowballing, consistently improving the performance across eight benchmarks based on four common MAS structures and ten base models. The source code will be available at: https://github.com/YU-deep/ViF.git.

cs.MM [Back]

[230] Perception-Consistency Multimodal Large Language Models Reasoning via Caption-Regularized Policy Optimization cs.MM | cs.CV | 68T07, 68T45 | I.2.6; I.2.7; I.2.10PDF

Songjun Tu, Qichao Zhang, Jingbo Sun, Yuqian Fu, Linjing Li

TL;DR: 该论文提出了一种名为CapPO的新强化学习框架，通过基于标题的一致性正则化和KL加权优势估计来增强多模态大语言模型的感知一致性，从而减少推理过程中的感知错误。

Details

Motivation: 多模态大语言模型在视觉感知与符号推理任务中表现出色，但感知错误会在推理链中传播。现有的强化学习微调方法未能解决视觉基础与后续推理之间的对齐问题，因此需要一种新方法来提升感知一致性。

Result: 在5个数学和5个通用推理基准测试中，CapPO显著提升了性能，数学任务准确率提升6.0%，通用任务提升2.4%，并显著减少了感知相关错误。

Insight: CapPO通过显式优化感知一致性，为多模态推理提供了一种简单有效的框架，弥补了当前方法在视觉基础与推理对齐方面的不足。

Abstract: While multimodal large language models excel at tasks that integrate visual perception with symbolic reasoning, their performance is often undermined by a critical vulnerability: perception-induced errors that propagate through the reasoning chain. Current reinforcement learning (RL) fine-tuning methods, while enhancing reasoning abilities, largely fail to address the underlying misalignment between visual grounding and the subsequent reasoning process. To address this challenge, we propose \textbf{Caption-Regularized Policy Optimization (CapPO)}, a novel RL framework that explicitly enforces perceptual consistency during policy optimization. CapPO integrates two key mechanisms: (1) a caption-based consistency regularization, which minimizes the divergence between responses conditioned on raw images and those conditioned on captions, thereby anchoring reasoning to semantically faithful visual content; and (2) a KL-weighted advantage estimation scheme, which adaptively scales reinforcement signals to strengthen perceptually consistent trajectories while suppressing spurious correlations. Extensive experiments on five math-focused and five general reasoning benchmarks demonstrate that CapPO achieves competitive performance, yielding gains of +6.0% accuracy on math-related tasks and +2.4% on general reasoning tasks over the base Qwen2.5-VL-7B model. Moreover, ablation studies further confirm the effectiveness of each component, while error analysis reveals that CapPO significantly reduces perception-related mistakes compared with baselines. Overall, CapPO provides a simple yet effective framework for improving multimodal reasoning.

Table of Contents

cs.CV [Back]

[1] Random Direct Preference Optimization for Radiography Report Generation cs.CV | cs.AI | cs.CLPDF

[2] Improving Autism Detection with Multimodal Behavioral Analysis cs.CV | cs.LGPDF

[3] KV-Efficient VLA: A Method of Speed up Vision Language Model with RNN-Gated Chunked KV Cache cs.CV | cs.AIPDF

[4] Phrase-grounded Fact-checking for Automatically Generated Chest X-ray Reports cs.CV | cs.AIPDF

[5] MDF-MLLM: Deep Fusion Through Cross-Modal Feature Alignment for Contextually Aware Fundoscopic Image Classification cs.CV | cs.AIPDF

[6] Multimodal Prompt Decoupling Attack on the Safety Filters in Text-to-Image Models cs.CV | cs.AIPDF

[7] A Mutual Learning Method for Salient Object Detection with intertwined Multi-Supervision–Revised cs.CV | cs.AIPDF

[8] MAJORScore: A Novel Metric for Evaluating Multimodal Relevance via Joint Representation cs.CV | cs.AIPDF

[9] Safety Assessment of Scaffolding on Construction Site using AI cs.CV | cs.AIPDF

[10] Automated Prompt Generation for Creative and Counterfactual Text-to-image Synthesis cs.CV | cs.AIPDF

[11] In silico Deep Learning Protocols for Label-Free Super-Resolution Microscopy: A Comparative Study of Network Architectures and SNR Dependence cs.CV | cs.AIPDF

[12] Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation cs.CV | cs.AIPDF

[13] Assessing the Alignment of Popular CNNs to the Brain for Valence Appraisal cs.CVPDF

[14] Debugging Concept Bottleneck Models through Removal and Retraining cs.CV | cs.LGPDF

[15] ShipwreckFinder: A QGIS Tool for Shipwreck Detection in Multibeam Sonar Data cs.CV | cs.RO | eess.IVPDF

[16] TUN3D: Towards Real-World Scene Understanding from Unposed Images cs.CV | eess.IVPDF

[17] Large AI Model-Enabled Generative Semantic Communications for Image Transmission cs.CV | cs.AI | cs.IT | math.ITPDF

[18] mmHSense: Multi-Modal and Distributed mmWave ISAC Datasets for Human Sensing cs.CV | cs.LGPDF

[19] Skeleton Sparsification and Densification Scale-Spaces cs.CV | eess.IVPDF

[20] JaiLIP: Jailbreaking Vision-Language Models via Loss Guided Image Perturbation cs.CVPDF

[21] QuadGPT: Native Quadrilateral Mesh Generation with Autoregressive Models cs.CVPDF

[22] DyME: Dynamic Multi-Concept Erasure in Diffusion Models with Bi-Level Orthogonal LoRA Adaptation cs.CV | cs.AI | cs.LGPDF

[23] VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding cs.CV | cs.CLPDF

[24] Residual Vector Quantization For Communication-Efficient Multi-Agent Perception cs.CV | cs.ROPDF

[25] Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Moderation cs.CVPDF

[26] Learning GUI Grounding with Spatial Reasoning from Visual Feedback cs.CV | cs.CLPDF

[27] X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought Reasoning cs.CVPDF

[28] Unsupervised Defect Detection for Surgical Instruments cs.CVPDF

[29] No Alignment Needed for Generation: Learning Linearly Separable Representations in Diffusion Models cs.CV | cs.AI | cs.LGPDF

[30] X-Streamer: Unified Human World Modeling with Audiovisual Interaction cs.CVPDF

[31] What Happens Next? Anticipating Future Motion by Generating Point Trajectories cs.CV | cs.AI | cs.LGPDF

[32] Temporal vs. Spatial: Comparing DINOv3 and V-JEPA2 Feature Representations for Video Action Analysis cs.CV | cs.AIPDF

[33] VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment cs.CV | cs.LGPDF

[34] A Data-driven Typology of Vision Models from Integrated Representational Metrics cs.CV | cs.AIPDF

[35] FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction cs.CVPDF

[36] MORPH: Shape-agnostic PDE Foundation Models cs.CV | cs.AI | cs.LG | physics.comp-phPDF

[37] MS-YOLO: Infrared Object Detection for Edge Deployment via MobileNetV4 and SlideLoss cs.CVPDF

[38] Motion-Aware Transformer for Multi-Object Tracking cs.CVPDF

[39] DeLiVR: Differential Spatiotemporal Lie Bias for Efficient Video Deraining cs.CVPDF

[40] On the Status of Foundation Models for SAR Imagery cs.CV | eess.IVPDF

[41] UISim: An Interactive Image-Based UI Simulator for Dynamic Mobile Environments cs.CV | cs.AI | cs.CL | cs.HC | cs.LGPDF

[42] LFA-Net: A Lightweight Network with LiteFusion Attention for Retinal Vessel Segmentation cs.CV | cs.AIPDF

[43] Incorporating Scene Context and Semantic Labels for Enhanced Group-level Emotion Recognition cs.CVPDF

[44] KG-SAM: Injecting Anatomical Knowledge into Segment Anything Models via Conditional Random Fields cs.CVPDF

[45] UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models cs.CVPDF

[46] CubistMerge: Spatial-Preserving Token Merging For Diverse ViT Backbones cs.CV | cs.LGPDF

[47] Training-Free Multimodal Deepfake Detection via Graph Reasoning cs.CV | cs.CYPDF

[48] Prompt-guided Representation Disentanglement for Action Recognition cs.CVPDF

[49] DeHate: A Stable Diffusion-based Multimodal Approach to Mitigate Hate Speech in Images cs.CV | cs.CLPDF

[50] MIRG-RL: Multi-Image Reasoning and Grounding with Reinforcement Learning cs.CVPDF

[51] LongScape: Advancing Long-Horizon Embodied World Models with Context-Aware MoE cs.CVPDF

[52] MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation cs.CVPDF

[53] DiTraj: training-free trajectory control for video diffusion transformer cs.CV | cs.AIPDF

[54] A Comprehensive Evaluation of Transformer-Based Question Answering Models and RAG-Enhanced Design cs.CVPDF

[55] Dynamic Novel View Synthesis in High Dynamic Range cs.CVPDF

[56] Unlocking the Essence of Beauty: Advanced Aesthetic Reasoning with Relative-Absolute Policy Optimization cs.CV | cs.AIPDF

[57] StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing cs.CV | cs.MMPDF

[58] Drag4D: Align Your Motion with Text-Driven 3D Scene Generation cs.CVPDF

[59] Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers cs.CVPDF

[60] LG-CD: Enhancing Language-Guided Change Detection through SAM2 Adaptation cs.CVPDF

[61] Taming Flow-based I2V Models for Creative Video Editing cs.CV | cs.MMPDF

[62] Multi-View Crowd Counting With Self-Supervised Learning cs.CVPDF

[63] Spatial Reasoning in Foundation Models: Benchmarking Object-Centric Spatial Understanding cs.CVPDF

[64] PANICL: Mitigating Over-Reliance on Single Prompt in Visual In-Context Learning cs.CVPDF

[65] Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach cs.CVPDF

[66] MultiCrafter: High-Fidelity Multi-Subject Generation via Spatially Disentangled Attention and Identity-Aware Reinforcement Learning cs.CVPDF

[67] PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data cs.CVPDF

[68] Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning cs.CV | cs.AIPDF

[69] Benchmarking and Mitigate Psychological Sycophancy in Medical Vision-Language Models cs.CV | cs.AIPDF

[70] Resolving Ambiguity in Gaze-Facilitated Visual Assistant Interaction Paradigm cs.CVPDF

[71] From Bias to Balance: Exploring and Mitigating Spatial Bias in LVLMs cs.CV | cs.CLPDF

[72] Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation cs.CVPDF

[73] WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM cs.CV | cs.SDPDF

[74] ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models cs.CV | cs.AI | cs.CL | cs.LGPDF

[75] DualFocus: Depth from Focus with Spatio-Focal Dual Variational Constraints cs.CVPDF

[76] Rate-Distortion Optimized Communication for Collaborative Perception cs.CVPDF

[77] Exposing Hallucinations To Suppress Them: VLMs Representation Editing With Generative Anchors cs.CVPDF

[78] CoFFT: Chain of Foresight-Focus Thought for Visual Language Models cs.CVPDF