Table of Contents
- cs.CV [Total: 91]
- cs.CL [Total: 27]
- cs.DC [Total: 1]
- cs.CY [Total: 1]
- cs.LG [Total: 6]
- cs.RO [Total: 4]
- cs.AR [Total: 1]
- cs.SD [Total: 1]
- cs.CR [Total: 2]
- eess.IV [Total: 2]
- cs.IR [Total: 1]
- quant-ph [Total: 1]
- cs.ET [Total: 1]
- cs.AI [Total: 8]
cs.CV [Back]
[1] Knowledge-Guided Textual Reasoning for Explainable Video Anomaly Detection via LLMs cs.CV | cs.AIPDF
Hari Lee
TL;DR: 本文提出了一种基于语言的视频异常检测框架TbVAD,通过语言模型实现视频内容的文本化表达和可解释性推理,避免了传统视觉特征依赖的问题。
Details
Motivation: 传统弱监督视频异常检测(WSVAD)依赖视觉特征,缺乏可解释性。本文希望通过语言驱动的方法,实现更透明的异常检测和解释。
Result: 在UCF-Crime和XD-Violence两个公开基准上验证了TbVAD的有效性,展示了其在现实监控场景中的可解释性和可靠性。
Insight: 语言模型可以替代视觉特征,实现更具解释性的视频异常检测,这在监控等领域具有重要应用前景。
Abstract: We introduce Text-based Explainable Video Anomaly Detection (TbVAD), a language-driven framework for weakly supervised video anomaly detection that performs anomaly detection and explanation entirely within the textual domain. Unlike conventional WSVAD models that rely on explicit visual features, TbVAD represents video semantics through language, enabling interpretable and knowledge-grounded reasoning. The framework operates in three stages: (1) transforming video content into fine-grained captions using a vision-language model, (2) constructing structured knowledge by organizing the captions into four semantic slots (action, object, context, environment), and (3) generating slot-wise explanations that reveal which semantic factors contribute most to the anomaly decision. We evaluate TbVAD on two public benchmarks, UCF-Crime and XD-Violence, demonstrating that textual knowledge reasoning provides interpretable and reliable anomaly detection for real-world surveillance scenarios.
[2] Two Datasets Are Better Than One: Method of Double Moments for 3-D Reconstruction in Cryo-EM cs.CV | math.NA | stat.MEPDF
Joe Kileel, Oscar Mickelin, Amit Singer, Sheng Xu
TL;DR: 该论文提出了一种名为双矩方法(MoDM)的新数据融合框架,用于通过两种不同取向分布的二阶矩重建三维分子结构,证明了其唯一性,并提出了一种基于凸松弛的算法。
Details
Motivation: 冷冻电镜(cryo-EM)是一种强大的成像技术,但其重建过程依赖于噪声极大的投影图像。传统方法通常基于单数据集,而论文希望通过利用多数据集(不同实验条件下)的信息提升重建质量。
Result: 实验证明,MoDM能够仅基于二阶矩实现高精度的三维结构重建,表明多数据集融合可以显著提升重建质量。
Insight: 利用多数据集的不同统计特性(如不同的取向分布)可以增强重建的唯一性和准确性,为计算成像任务提供了新思路。
Abstract: Cryo-electron microscopy (cryo-EM) is a powerful imaging technique for reconstructing three-dimensional molecular structures from noisy tomographic projection images of randomly oriented particles. We introduce a new data fusion framework, termed the method of double moments (MoDM), which reconstructs molecular structures from two instances of the second-order moment of projection images obtained under distinct orientation distributions–one uniform, the other non-uniform and unknown. We prove that these moments generically uniquely determine the underlying structure, up to a global rotation and reflection, and we develop a convex-relaxation-based algorithm that achieves accurate recovery using only second-order statistics. Our results demonstrate the advantage of collecting and modeling multiple datasets under different experimental conditions, illustrating that leveraging dataset diversity can substantially enhance reconstruction quality in computational imaging tasks.
[3] Modulo Video Recovery via Selective Spatiotemporal Vision Transformer cs.CV | cs.AI | eess.IVPDF
Tianyu Geng, Feng Ji, Wee Peng Tay
TL;DR: 论文提出了一种名为SSViT的新型Transformer架构,用于解决模数视频恢复问题,区别于传统HDR方法,专注于从折叠样本中恢复原始信号,并通过选择性时空注意力提高效率和质量。
Details
Motivation: 传统传感器的动态范围有限,导致高动态范围(HDR)场景中的信号饱和。模数相机通过将信号折叠到有限范围内解决这一问题,但需要专门的解折叠算法。现有HDR方法不适用于模数恢复,而Transformer的全局依赖性能力为解决这一问题提供了可能。
Result: 实验表明,SSViT在8位折叠视频中实现了高质量重建,并在模数视频恢复任务中达到最先进性能。
Insight: Transformer架构在模数恢复任务中具备潜力,但需结合新型注意力机制以提高效率和聚焦关键区域。
Abstract: Conventional image sensors have limited dynamic range, causing saturation in high-dynamic-range (HDR) scenes. Modulo cameras address this by folding incident irradiance into a bounded range, yet require specialized unwrapping algorithms to reconstruct the underlying signal. Unlike HDR recovery, which extends dynamic range from conventional sampling, modulo recovery restores actual values from folded samples. Despite being introduced over a decade ago, progress in modulo image recovery has been slow, especially in the use of modern deep learning techniques. In this work, we demonstrate that standard HDR methods are unsuitable for modulo recovery. Transformers, however, can capture global dependencies and spatial-temporal relationships crucial for resolving folded video frames. Still, adapting existing Transformer architectures for modulo recovery demands novel techniques. To this end, we present Selective Spatiotemporal Vision Transformer (SSViT), the first deep learning framework for modulo video reconstruction. SSViT employs a token selection strategy to improve efficiency and concentrate on the most critical regions. Experiments confirm that SSViT produces high-quality reconstructions from 8-bit folded videos and achieves state-of-the-art performance in modulo video recovery.
[4] Laplacian Score Sharpening for Mitigating Hallucination in Diffusion Models cs.CV | cs.AI | cs.LG | stat.MLPDF
Barath Chandran. C, Srinivas Anumasa, Dianbo Liu
TL;DR: 该论文提出了一种基于拉普拉斯分数锐化的后处理调整方法,用于减少扩散模型中的幻觉现象,通过高效的拉普拉斯近似在高维数据中实现显著改进。
Details
Motivation: 扩散模型在生成样本时存在幻觉问题,导致样本不连贯或不真实。尽管已有研究指出这与模式插值和分数平滑有关,但缺乏有效的解决方法。
Result: 实验表明,该方法在玩具分布和高维图像数据集中显著减少了幻觉样本的生成率。
Insight: 分析了拉普拉斯分数与分数不确定性之间的关系,表明锐化分数可以更有效地捕捉数据分布的局部结构。
Abstract: Diffusion models, though successful, are known to suffer from hallucinations that create incoherent or unrealistic samples. Recent works have attributed this to the phenomenon of mode interpolation and score smoothening, but they lack a method to prevent their generation during sampling. In this paper, we propose a post-hoc adjustment to the score function during inference that leverages the Laplacian (or sharpness) of the score to reduce mode interpolation hallucination in unconditional diffusion models across 1D, 2D, and high-dimensional image data. We derive an efficient Laplacian approximation for higher dimensions using a finite-difference variant of the Hutchinson trace estimator. We show that this correction significantly reduces the rate of hallucinated samples across toy 1D/2D distributions and a high- dimensional image dataset. Furthermore, our analysis explores the relationship between the Laplacian and uncertainty in the score.
[5] Toward the Frontiers of Reliable Diffusion Sampling via Adversarial Sinkhorn Attention Guidance cs.CV | cs.AIPDF
Kwanyoung Kim
TL;DR: 该论文提出了Adversarial Sinkhorn Attention Guidance (ASAG),一种通过对抗性Sinkhorn算法优化扩散模型注意力的方法,以提升生成样本的质量和可靠性。
Details
Motivation: 当前的扩散模型采样方法(如CFG)依赖启发式扰动函数,缺乏理论依据。ASAG旨在通过最优传输理论改进注意力机制,减少误导性对齐。
Result: ASAG在文本到图像生成任务中表现优异,提升了条件与非条件样本的质量,并在下游任务(如IP-Adapter和ControlNet)中增强了可控性和保真度。
Insight: 通过理论驱动的对抗性优化,ASAG展示了扩散模型采样可靠性的提升潜力,且无需模型重训练,具有轻量化和即插即用的优势。
Abstract: Diffusion models have demonstrated strong generative performance when using guidance methods such as classifier-free guidance (CFG), which enhance output quality by modifying the sampling trajectory. These methods typically improve a target output by intentionally degrading another, often the unconditional output, using heuristic perturbation functions such as identity mixing or blurred conditions. However, these approaches lack a principled foundation and rely on manually designed distortions. In this work, we propose Adversarial Sinkhorn Attention Guidance (ASAG), a novel method that reinterprets attention scores in diffusion models through the lens of optimal transport and intentionally disrupt the transport cost via Sinkhorn algorithm. Instead of naively corrupting the attention mechanism, ASAG injects an adversarial cost within self-attention layers to reduce pixel-wise similarity between queries and keys. This deliberate degradation weakens misleading attention alignments and leads to improved conditional and unconditional sample quality. ASAG shows consistent improvements in text-to-image diffusion, and enhances controllability and fidelity in downstream applications such as IP-Adapter and ControlNet. The method is lightweight, plug-and-play, and improves reliability without requiring any model retraining.
[6] LiveNeRF: Efficient Face Replacement Through Neural Radiance Fields Integration cs.CVPDF
Tung Vu, Hai Nguyen, Cong Tran
TL;DR: LiveNeRF是一个实时(33 FPS)高效面部替换框架,通过整合神经辐射场(NeRF)提升视觉质量并支持实际应用,如直播和视频会议,同时强调负责任的使用。
Details
Motivation: 现有面部替换技术存在实时性和视觉质量的不足,LiveNeRF旨在解决这些问题,推动娱乐、教育和无障碍通信的应用。
Result: LiveNeRF在实时性能(33 FPS)和视觉质量上优于现有方法,适用于直播、视频会议等实际场景。
Insight: NeRF的高保真度渲染潜力可以扩展到实时应用中,但需结合伦理考量以确保技术不被滥用。
Abstract: Face replacement technology enables significant advancements in entertainment, education, and communication applications, including dubbing, virtual avatars, and cross-cultural content adaptation. Our LiveNeRF framework addresses critical limitations of existing methods by achieving real-time performance (33 FPS) with superior visual quality, enabling practical deployment in live streaming, video conferencing, and interactive media. The technology particularly benefits content creators, educators, and individuals with speech impairments through accessible avatar communication. While acknowledging potential misuse in unauthorized deepfake creation, we advocate for responsible deployment with user consent verification and integration with detection systems to ensure positive societal impact while minimizing risks.
[7] TrackStudio: An Integrated Toolkit for Markerless Tracking cs.CV | q-bio.QMPDF
Hristo Dimitrov, Giulia Dominijanni, Viktorija Pavalkyte, Tamar R. Makin
TL;DR: TrackStudio是一个集成的无标记追踪工具包,旨在为非专家提供易于使用的解决方案,结合了多种开源工具,支持2D/3D追踪、校准和可视化等功能。
Details
Motivation: 当前的无标记运动追踪工具需要较高的技术门槛,导致非专家用户难以使用。TrackStudio旨在填补这一空白,提供一个无需编程技能的集成解决方案。
Result: 在76名参与者测试中,帧间相关性超过0.98,三角误差低(手部追踪误差<13.6mm),验证了工具包的稳定性和一致性。
Insight: TrackStudio展示了无需标记的追踪技术可以通过集成工具包降低使用门槛,适用于广泛的研究和应用场景。
Abstract: Markerless motion tracking has advanced rapidly in the past 10 years and currently offers powerful opportunities for behavioural, clinical, and biomechanical research. While several specialised toolkits provide high performance for specific tasks, using existing tools still requires substantial technical expertise. There remains a gap in accessible, integrated solutions that deliver sufficient tracking for non-experts across diverse settings. TrackStudio was developed to address this gap by combining established open-source tools into a single, modular, GUI-based pipeline that works out of the box. It provides automatic 2D and 3D tracking, calibration, preprocessing, feature extraction, and visualisation without requiring any programming skills. We supply a user guide with practical advice for video acquisition, synchronisation, and setup, alongside documentation of common pitfalls and how to avoid them. To validate the toolkit, we tested its performance across three environments using either low-cost webcams or high-resolution cameras, including challenging conditions for body position, lightning, and space and obstructions. Across 76 participants, average inter-frame correlations exceeded 0.98 and average triangulation errors remained low (<13.6mm for hand tracking), demonstrating stable and consistent tracking. We further show that the same pipeline can be extended beyond hand tracking to other body and face regions. TrackStudio provides a practical, accessible route into markerless tracking for researchers or laypeople who need reliable performance without specialist expertise.
[8] FlowFeat: Pixel-Dense Embedding of Motion Profiles cs.CVPDF
Nikita Araslanov, Anna Sonnweber, Daniel Cremers
TL;DR: FlowFeat提出了一种高分辨率、多任务的特征表示方法,通过新颖的蒸馏技术嵌入运动分布信息,显著提升了现有编码器在密集预测任务中的表现。
Details
Motivation: 现有网络(如Transformer)生成的低分辨率特征网格对密集预测任务不利,需要一种高分辨率的特征表示方法。
Result: FlowFeat在视频对象分割、单目深度估计和语义分割三个密集任务中显著提升了五种编码器的表现。
Insight: FlowFeat展示了运动信息在高分辨率特征表示中的重要性,且训练成本低,对不准确的光流估计鲁棒。
Abstract: Dense and versatile image representations underpin the success of virtually all computer vision applications. However, state-of-the-art networks, such as transformers, produce low-resolution feature grids, which are suboptimal for dense prediction tasks. To address this limitation, we present FlowFeat, a high-resolution and multi-task feature representation. The key ingredient behind FlowFeat is a novel distillation technique that embeds a distribution of plausible apparent motions, or motion profiles. By leveraging optical flow networks and diverse video data, we develop an effective self-supervised training framework that statistically approximates the apparent motion. With its remarkable level of spatial detail, FlowFeat encodes a compelling degree of geometric and semantic cues while exhibiting high temporal consistency. Empirically, FlowFeat significantly enhances the representational power of five state-of-the-art encoders and alternative upsampling strategies across three dense tasks: video object segmentation, monocular depth estimation and semantic segmentation. Training FlowFeat is computationally inexpensive and robust to inaccurate flow estimation, remaining highly effective even when using unsupervised flow networks. Our work takes a step forward towards reliable and versatile dense image representations.
[9] Cross Modal Fine-grained Alignment via Granularity-aware and Region-uncertain Modeling cs.CV | cs.MMPDF
Jiale Liu, Haoming Zhou, Yishu Zhu, Bingzhi Chen, Yuncheng Jiang
TL;DR: 提出了基于粒度感知和区域不确定性建模的跨模态细粒度对齐方法,显著提升了图像-文本对齐的性能和鲁棒性。
Details
Motivation: 现有的跨模态对齐方法未能有效建模文本和视觉标记的重要性,且忽略了区域-词对应关系的不确定性,影响在复杂场景中的表现。
Result: 在Flickr30K和MS-COCO数据集上实现了SOTA性能,增强了模型的鲁棒性和可解释性。
Insight: 显著性感知和不确定性建模是提升细粒度对齐的关键,高斯混合分布能有效捕捉复杂对应关系。
Abstract: Fine-grained image-text alignment is a pivotal challenge in multimodal learning, underpinning key applications such as visual question answering, image captioning, and vision-language navigation. Unlike global alignment, fine-grained alignment requires precise correspondence between localized visual regions and textual tokens, often hindered by noisy attention mechanisms and oversimplified modeling of cross-modal relationships. In this work, we identify two fundamental limitations of existing approaches: the lack of robust intra-modal mechanisms to assess the significance of visual and textual tokens, leading to poor generalization in complex scenes; and the absence of fine-grained uncertainty modeling, which fails to capture the one-to-many and many-to-one nature of region-word correspondences. To address these issues, we propose a unified approach that incorporates significance-aware and granularity-aware modeling and region-level uncertainty modeling. Our method leverages modality-specific biases to identify salient features without relying on brittle cross-modal attention, and represents region features as a mixture of Gaussian distributions to capture fine-grained uncertainty. Extensive experiments on Flickr30K and MS-COCO demonstrate that our approach achieves state-of-the-art performance across various backbone architectures, significantly enhancing the robustness and interpretability of fine-grained image-text alignment.
[10] UltraGS: Gaussian Splatting for Ultrasound Novel View Synthesis cs.CV | cs.AIPDF
Yuezhe Yang, Wenjie Cai, Dexin Yang, Yufang Dong, Xingbo Dong
TL;DR: UltraGS是一个针对超声影像优化的高斯泼溅框架,通过深度感知的高斯泼溅策略和轻量级渲染函数SH-DARS,显著提升了超声新视角合成的性能,并在真实临床数据集上验证了其优越性。
Details
Motivation: 超声影像在临床诊断中非常重要,但由于视野受限,新视角合成面临挑战。需要一种专门针对超声影像优化的方法来提升合成精度和效率。
Result: 在PSNR(最高29.55)、SSIM(最高0.89)和MSE(最低0.002)上达到SOTA,实时合成速度为64.69 fps。
Insight: 结合物理特性和可学习的高斯泼溅策略是提升超声影像新视角合成的有效方法,为临床应用提供了高效工具。
Abstract: Ultrasound imaging is a cornerstone of non-invasive clinical diagnostics, yet its limited field of view complicates novel view synthesis. We propose \textbf{UltraGS}, a Gaussian Splatting framework optimized for ultrasound imaging. First, we introduce a depth-aware Gaussian splatting strategy, where each Gaussian is assigned a learnable field of view, enabling accurate depth prediction and precise structural representation. Second, we design SH-DARS, a lightweight rendering function combining low-order spherical harmonics with ultrasound-specific wave physics, including depth attenuation, reflection, and scattering, to model tissue intensity accurately. Third, we contribute the Clinical Ultrasound Examination Dataset, a benchmark capturing diverse anatomical scans under real-world clinical protocols. Extensive experiments on three datasets demonstrate UltraGS’s superiority, achieving state-of-the-art results in PSNR (up to 29.55), SSIM (up to 0.89), and MSE (as low as 0.002) while enabling real-time synthesis at 64.69 fps. The code and dataset are open-sourced at: https://github.com/Bean-Young/UltraGS.
[11] VectorSynth: Fine-Grained Satellite Image Synthesis with Structured Semantics cs.CVPDF
Daniel Cher, Brian Wei, Srikumar Sastry, Nathan Jacobs
TL;DR: VectorSynth 是一个基于扩散模型的卫星图像合成框架,通过多边形地理标注和语义属性实现像素级图像生成,支持细粒度的空间编辑和交互式工作流。
Details
Motivation: 传统的文本或布局条件模型无法精确对齐图像和语义向量几何,限制了卫星图像的细粒度编辑能力。VectorSynth旨在解决这一问题。
Result: 在语义保真度和结构真实性上显著优于现有方法,展示了细粒度的空间定位能力。
Insight: 通过结构化语义实现像素级对齐是提升卫星图像合成质量的关键,同时交互式工作流为实际应用提供了灵活性。
Abstract: We introduce VectorSynth, a diffusion-based framework for pixel-accurate satellite image synthesis conditioned on polygonal geographic annotations with semantic attributes. Unlike prior text- or layout-conditioned models, VectorSynth learns dense cross-modal correspondences that align imagery and semantic vector geometry, enabling fine-grained, spatially grounded edits. A vision language alignment module produces pixel-level embeddings from polygon semantics; these embeddings guide a conditional image generation framework to respect both spatial extents and semantic cues. VectorSynth supports interactive workflows that mix language prompts with geometry-aware conditioning, allowing rapid what-if simulations, spatial edits, and map-informed content generation. For training and evaluation, we assemble a collection of satellite scenes paired with pixel-registered polygon annotations spanning diverse urban scenes with both built and natural features. We observe strong improvements over prior methods in semantic fidelity and structural realism, and show that our trained vision language model demonstrates fine-grained spatial grounding. The code and data are available at https://github.com/mvrl/VectorSynth.
[12] Auto-US: An Ultrasound Video Diagnosis Agent Using Video Classification Framework and LLMs cs.CV | cs.AIPDF
Yuezhe Yang, Yiyue Guo, Wenjie Cai, Qingqing Ruan, Siying Wang
TL;DR: 论文提出了一种名为Auto-US的智能诊断代理,结合超声视频数据和临床诊断文本,显著提升了超声视频分类准确性(86.73%)并生成有临床意义的诊断建议。
Details
Motivation: 当前AI辅助超声视频诊断在数据集多样性、诊断性能和临床适用性方面存在局限性,亟需更高效的解决方案。
Result: CTU-Net分类准确率达86.73%,诊断建议评分超过3/5,经临床验证有效。
Insight: 多模态融合(视频+文本)可提升AI在医疗诊断中的实用性和可解释性。
Abstract: AI-assisted ultrasound video diagnosis presents new opportunities to enhance the efficiency and accuracy of medical imaging analysis. However, existing research remains limited in terms of dataset diversity, diagnostic performance, and clinical applicability. In this study, we propose \textbf{Auto-US}, an intelligent diagnosis agent that integrates ultrasound video data with clinical diagnostic text. To support this, we constructed \textbf{CUV Dataset} of 495 ultrasound videos spanning five categories and three organs, aggregated from multiple open-access sources. We developed \textbf{CTU-Net}, which achieves state-of-the-art performance in ultrasound video classification, reaching an accuracy of 86.73% Furthermore, by incorporating large language models, Auto-US is capable of generating clinically meaningful diagnostic suggestions. The final diagnostic scores for each case exceeded 3 out of 5 and were validated by professional clinicians. These results demonstrate the effectiveness and clinical potential of Auto-US in real-world ultrasound applications. Code and data are available at: https://github.com/Bean-Young/Auto-US.
[13] Class Incremental Medical Image Segmentation via Prototype-Guided Calibration and Dual-Aligned Distillation cs.CVPDF
Shengqian Zhu, Chengrong Yu, Qiang Wang, Ying Song, Guangjun Li
TL;DR: 本文提出了PGCD和DAPD方法,通过原型引导的校准和双对齐蒸馏,解决了类增量医学图像分割中的知识遗忘问题,显著提升了性能。
Details
Motivation: 现有方法在类增量医学图像分割中未能有效区分空间区域和特征通道的重要性,同时忽略了新旧数据中原型的对齐,导致知识退化。
Result: 在两个多器官分割基准测试中表现优于现有方法,验证了方法的鲁棒性和泛化能力。
Insight: 原型引导和双对齐机制能有效缓解知识遗忘,尤其在医学图像分割中表现出色。
Abstract: Class incremental medical image segmentation (CIMIS) aims to preserve knowledge of previously learned classes while learning new ones without relying on old-class labels. However, existing methods 1) either adopt one-size-fits-all strategies that treat all spatial regions and feature channels equally, which may hinder the preservation of accurate old knowledge, 2) or focus solely on aligning local prototypes with global ones for old classes while overlooking their local representations in new data, leading to knowledge degradation. To mitigate the above issues, we propose Prototype-Guided Calibration Distillation (PGCD) and Dual-Aligned Prototype Distillation (DAPD) for CIMIS in this paper. Specifically, PGCD exploits prototype-to-feature similarity to calibrate class-specific distillation intensity in different spatial regions, effectively reinforcing reliable old knowledge and suppressing misleading information from old classes. Complementarily, DAPD aligns the local prototypes of old classes extracted from the current model with both global prototypes and local prototypes, further enhancing segmentation performance on old categories. Comprehensive evaluations on two widely used multi-organ segmentation benchmarks demonstrate that our method outperforms state-of-the-art methods, highlighting its robustness and generalization capabilities.
[14] Filtered-ViT: A Robust Defense Against Multiple Adversarial Patch Attacks cs.CV | cs.AIPDF
Aja Khanal, Ahmed Faid, Apurva Narayan
TL;DR: Filtered-ViT是一种新型视觉Transformer架构,通过集成SMART-VMF增强鲁棒性,有效抵御多重对抗性补丁攻击和自然干扰。
Details
Motivation: 深度学习视觉系统在安全关键领域(如医疗)中的部署增多,但其对小对抗性补丁的脆弱性限制了可靠性。现有防御方法多假设单一补丁攻击,无法应对多重局部干扰。
Result: 在ImageNet上,对抗四重1%补丁攻击时,Filtered-ViT达到79.8%干净准确率和46.3%鲁棒准确率,优于现有防御方法。实际医疗影像测试中有效解决自然干扰。
Insight: Filtered-ViT展示了Transformer在处理多重干扰方面的潜力,为高可靠性视觉系统提供了新方向。
Abstract: Deep learning vision systems are increasingly deployed in safety-critical domains such as healthcare, yet they remain vulnerable to small adversarial patches that can trigger misclassifications. Most existing defenses assume a single patch and fail when multiple localized disruptions occur, the type of scenario adversaries and real-world artifacts often exploit. We propose Filtered-ViT, a new vision transformer architecture that integrates SMART Vector Median Filtering (SMART-VMF), a spatially adaptive, multi-scale, robustness-aware mechanism that enables selective suppression of corrupted regions while preserving semantic detail. On ImageNet with LaVAN multi-patch attacks, Filtered-ViT achieves 79.8% clean accuracy and 46.3% robust accuracy under four simultaneous 1% patches, outperforming existing defenses. Beyond synthetic benchmarks, a real-world case study on radiographic medical imagery shows that Filtered-ViT mitigates natural artifacts such as occlusions and scanner noise without degrading diagnostic content. This establishes Filtered-ViT as the first transformer to demonstrate unified robustness against both adversarial and naturally occurring patch-like disruptions, charting a path toward reliable vision systems in truly high-stakes environments.
[15] Beyond Randomness: Understand the Order of the Noise in Diffusion cs.CVPDF
Song Yan, Min Li, Bi Xinliang, Jian Yang, Yusen Zhang
TL;DR: 本文揭示了扩散模型中初始噪声背后隐藏的可分析模式,提出了基于信息理论的‘语义擦除-注入’两步法,优化生成内容的一致性。
Details
Motivation: 传统观点认为扩散模型中的初始噪声是随机的,但本文发现噪声中蕴含丰富语义信息,为优化生成提供新视角。
Result: 实验证明该方法在DiT和UNet架构的T2C模型中均有效,提升生成一致性。
Insight: 噪声不仅是多样性来源,还可作为语义调控工具,为扩散模型优化提供新思路。
Abstract: In text-driven content generation (T2C) diffusion model, semantic of generated content is mostly attributed to the process of text embedding and attention mechanism interaction. The initial noise of the generation process is typically characterized as a random element that contributes to the diversity of the generated content. Contrary to this view, this paper reveals that beneath the random surface of noise lies strong analyzable patterns. Specifically, this paper first conducts a comprehensive analysis of the impact of random noise on the model’s generation. We found that noise not only contains rich semantic information, but also allows for the erasure of unwanted semantics from it in an extremely simple way based on information theory, and using the equivalence between the generation process of diffusion model and semantic injection to inject semantics into the cleaned noise. Then, we mathematically decipher these observations and propose a simple but efficient training-free and universal two-step “Semantic Erasure-Injection” process to modulate the initial noise in T2C diffusion model. Experimental results demonstrate that our method is consistently effective across various T2C models based on both DiT and UNet architectures and presents a novel perspective for optimizing the generation of diffusion model, providing a universal tool for consistent generation.
[16] Revisiting MLLM Based Image Quality Assessment: Errors and Remedy cs.CVPDF
Zhenchen Tang, Songlin Yang, Bo Peng, Zichuan Wang, Jing Dong
TL;DR: 该论文分析了多模态大语言模型(MLLMs)在图像质量评估(IQA)任务中的性能瓶颈,并提出了一种名为Q-Scorer的简单有效框架,解决了离散标记与连续质量分数的匹配问题,显著提升了IQA的性能。
Details
Motivation: MLLMs在IQA任务中表现不佳,主要是由于其离散标记输出与连续质量分数之间的不匹配,以及语义混淆问题。论文深入分析了这些问题的根源。
Result: 在多个IQA基准测试中,Q-Scorer实现了最先进的性能,并在混合数据集中表现出良好的泛化能力。
Insight: 简单直接的改进(如回归模块和分数标记)可以显著提升MLLMs在IQA任务中的性能,同时不影响其原始多模态能力。
Abstract: The rapid progress of multi-modal large language models (MLLMs) has boosted the task of image quality assessment (IQA). However, a key challenge arises from the inherent mismatch between the discrete token outputs of MLLMs and the continuous nature of quality scores required by IQA tasks. This discrepancy significantly hinders the performance of MLLM-based IQA methods. Previous approaches that convert discrete token predictions into continuous scores often suffer from conversion errors. Moreover, the semantic confusion introduced by level tokens (e.g., ``good’’) further constrains the performance of MLLMs on IQA tasks and degrades their original capabilities for related tasks. To tackle these problems, we provide a theoretical analysis of the errors inherent in previous approaches and, motivated by this analysis, propose a simple yet effective framework, Q-Scorer. This framework incorporates a lightweight regression module and IQA-specific score tokens into the MLLM pipeline. Extensive experiments demonstrate that Q-Scorer achieves state-of-the-art performance across multiple IQA benchmarks, generalizes well to mixed datasets, and further improves when combined with other methods.
[17] Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views cs.CV | cs.AIPDF
Haida Feng, Hao Wei, Zewen Xu, Haolin Wang, Chade Li
TL;DR: Sparse3DPR 是一种无需训练的开放式3D场景理解框架,利用预训练大语言模型的推理能力,仅需稀疏RGB输入,通过分层平面增强场景图和任务自适应子图提取方法提升推理效率和准确性。
Details
Motivation: 当前无需训练的3D场景理解方法在部署时面临准确性和效率的挑战。为了解决这些问题,作者提出了Sparse3DPR框架。
Result: 在Space3D-Bench上,性能提升28.7%,速度提升78.2%;在ScanQA上与基于训练的方法性能相当。
Insight: 利用平面结构作为空间锚点可以显著提升3D场景推理的清晰度和可靠性,任务自适应子图提取有效提高了效率。
Abstract: Recently, large language models (LLMs) have been explored widely for 3D scene understanding. Among them, training-free approaches are gaining attention for their flexibility and generalization over training-based methods. However, they typically struggle with accuracy and efficiency in practical deployment. To address the problems, we propose Sparse3DPR, a novel training-free framework for open-ended scene understanding, which leverages the reasoning capabilities of pre-trained LLMs and requires only sparse-view RGB inputs. Specifically, we introduce a hierarchical plane-enhanced scene graph that supports open vocabulary and adopts dominant planar structures as spatial anchors, which enables clearer reasoning chains and more reliable high-level inferences. Furthermore, we design a task-adaptive subgraph extraction method to filter query-irrelevant information dynamically, reducing contextual noise and improving 3D scene reasoning efficiency and accuracy. Experimental results demonstrate the superiority of Sparse3DPR, which achieves a 28.7% EM@1 improvement and a 78.2% speedup compared with ConceptGraphs on the Space3D-Bench. Moreover, Sparse3DPR obtains comparable performance to training-based methods on ScanQA, with additional real-world experiments confirming its robustness and generalization capability.
[18] Cancer-Net PCa-MultiSeg: Multimodal Enhancement of Prostate Cancer Lesion Segmentation Using Synthetic Correlated Diffusion Imaging cs.CVPDF
Jarett Dewbury, Chi-en Amy Tai, Alexander Wong
TL;DR: 该论文研究了合成相关扩散成像(CDI$^s$)在提升前列腺癌病灶分割任务中的有效性,通过结合CDI$^s$与标准扩散成像协议,显著提升了分割性能。
Details
Motivation: 当前基于深度学习的前列腺癌病灶分割方法性能有限(Dice得分≤0.32),研究者希望通过CDI$^s$增强现有扩散成像协议,从而提升分割效果。
Result: 94%的评估配置中,CDI$^s$显著提升或保持了分割性能,CDI$^s$ + DWI组合在50%的架构中表现最佳且无性能下降。
Insight: CDI$^s$能够无缝集成到临床工作流中,为前列腺癌病灶分割提供了稳定的性能增强途径。
Abstract: Current deep learning approaches for prostate cancer lesion segmentation achieve limited performance, with Dice scores of 0.32 or lower in large patient cohorts. To address this limitation, we investigate synthetic correlated diffusion imaging (CDI$^s$) as an enhancement to standard diffusion-based protocols. We conduct a comprehensive evaluation across six state-of-the-art segmentation architectures using 200 patients with co-registered CDI$^s$, diffusion-weighted imaging (DWI) and apparent diffusion coefficient (ADC) sequences. We demonstrate that CDI$^s$ integration reliably enhances or preserves segmentation performance in 94% of evaluated configurations, with individual architectures achieving up to 72.5% statistically significant relative improvement over baseline modalities. CDI$^s$ + DWI emerges as the safest enhancement pathway, achieving significant improvements in half of evaluated architectures with zero instances of degradation. Since CDI$^s$ derives from existing DWI acquisitions without requiring additional scan time or architectural modifications, it enables immediate deployment in clinical workflows. Our results establish validated integration pathways for CDI$^s$ as a practical drop-in enhancement for PCa lesion segmentation tasks across diverse deep learning architectures.
[19] Human Motion Synthesis in 3D Scenes via Unified Scene Semantic Occupancy cs.CVPDF
Gong Jingyu, Tong Kunkun, Chen Zhuoran, Yuan Chuanhan, Chen Mingang
TL;DR: 论文提出了一种名为SSOMotion的人体运动合成框架,通过统一的场景语义占据(SSO)表示3D场景,结合双向三平面分解和CLIP编码,实现了更高效的语义理解和运动控制。
Details
Motivation: 现有方法主要关注场景结构,而忽略了语义理解的重要性,导致在3D场景中的人体运动合成效果受限。为此,作者提出了一种结合场景语义的统一表示方法。
Result: 在ShapeNet家具数据集以及PROX和Replica扫描场景上的实验表明,SSOMotion在性能上达到了领先水平,并能有效泛化。
Insight: 结合语义理解的场景表示方法可以显著提升人体运动合成的效果,而双向三平面分解和CLIP编码是一种高效的特征提取策略。
Abstract: Human motion synthesis in 3D scenes relies heavily on scene comprehension, while current methods focus mainly on scene structure but ignore the semantic understanding. In this paper, we propose a human motion synthesis framework that take an unified Scene Semantic Occupancy (SSO) for scene representation, termed SSOMotion. We design a bi-directional tri-plane decomposition to derive a compact version of the SSO, and scene semantics are mapped to an unified feature space via CLIP encoding and shared linear dimensionality reduction. Such strategy can derive the fine-grained scene semantic structures while significantly reduce redundant computations. We further take these scene hints and movement direction derived from instructions for motion control via frame-wise scene query. Extensive experiments and ablation studies conducted on cluttered scenes using ShapeNet furniture, as well as scanned scenes from PROX and Replica datasets, demonstrate its cutting-edge performance while validating its effectiveness and generalization ability. Code will be publicly available at https://github.com/jingyugong/SSOMotion.
[20] CloudMamba: Grouped Selective State Spaces for Point Cloud Analysis cs.CVPDF
Kanglin Qu, Pan Gao, Qun Dai, Zhanzhi Ye, Rui Ye
TL;DR: CloudMamba通过序列扩展与合并、并行双向Mamba以及分组选择性状态空间模型(GS6)解决点云分析中的序列化不完善、高层次几何感知不足及S6过拟合问题,实现了高性能与低复杂度。
Details
Motivation: 现有Mamba在点云分析中存在点云序列化不完善、高层次几何感知不足及核心选择性状态空间模型(S6)过拟合的问题,限制了性能。
Result: CloudMamba在多任务点云分析中实现SOTA性能,且复杂度显著降低。
Insight: 通过分组参数共享和双向处理增强模型对点云无序性和几何特征的适应性,同时降低计算负担。
Abstract: Due to the long-range modeling ability and linear complexity property, Mamba has attracted considerable attention in point cloud analysis. Despite some interesting progress, related work still suffers from imperfect point cloud serialization, insufficient high-level geometric perception, and overfitting of the selective state space model (S6) at the core of Mamba. To this end, we resort to an SSM-based point cloud network termed CloudMamba to address the above challenges. Specifically, we propose sequence expanding and sequence merging, where the former serializes points along each axis separately and the latter serves to fuse the corresponding higher-order features causally inferred from different sequences, enabling unordered point sets to adapt more stably to the causal nature of Mamba without parameters. Meanwhile, we design chainedMamba that chains the forward and backward processes in the parallel bidirectional Mamba, capturing high-level geometric information during scanning. In addition, we propose a grouped selective state space model (GS6) via parameter sharing on S6, alleviating the overfitting problem caused by the computational mode in S6. Experiments on various point cloud tasks validate CloudMamba’s ability to achieve state-of-the-art results with significantly less complexity.
[21] Visual Bridge: Universal Visual Perception Representations Generating cs.CVPDF
Yilin Gao, Shuguang Dou, Junzhou Li, Zhiheng Yu, Yin Li
TL;DR: 该论文提出了一种基于流匹配的通用视觉感知框架,通过统一的流匹配问题解决多任务场景下的视觉表示生成问题,避免了传统的“单任务-单模型”范式限制。
Details
Motivation: 当前扩散模型在多任务场景中因“单任务-单模型”范式受限,缺乏通用性和扩展性。受到大语言模型跨域泛化能力的启发,作者提出了通用视觉感知框架。
Result: 在分类、检测、分割、深度估计和图像文本检索等任务中,模型在零样本和微调设置下均表现优异。
Insight: 通过统一的流匹配框架和多任务嵌入机制,能够有效桥接异构任务间的鸿沟,为通用视觉建模奠定了基础。
Abstract: Recent advances in diffusion models have achieved remarkable success in isolated computer vision tasks such as text-to-image generation, depth estimation, and optical flow. However, these models are often restricted by a ``single-task-single-model’’ paradigm, severely limiting their generalizability and scalability in multi-task scenarios. Motivated by the cross-domain generalization ability of large language models, we propose a universal visual perception framework based on flow matching that can generate diverse visual representations across multiple tasks. Our approach formulates the process as a universal flow-matching problem from image patch tokens to task-specific representations rather than an independent generation or regression problem. By leveraging a strong self-supervised foundation model as the anchor and introducing a multi-scale, circular task embedding mechanism, our method learns a universal velocity field to bridge the gap between heterogeneous tasks, supporting efficient and flexible representation transfer. Extensive experiments on classification, detection, segmentation, depth estimation, and image-text retrieval demonstrate that our model achieves competitive performance in both zero-shot and fine-tuned settings, outperforming prior generalist and several specialist models. Ablation studies further validate the robustness, scalability, and generalization of our framework. Our work marks a significant step towards general-purpose visual perception, providing a solid foundation for future research in universal vision modeling.
[22] Generating Sketches in a Hierarchical Auto-Regressive Process for Flexible Sketch Drawing Manipulation at Stroke-Level cs.CV | cs.AIPDF
Sicong Zang, Shuhui Gao, Zhijun Fang
TL;DR: 这篇论文提出了一种分层次自回归的草图生成方法,支持在生成过程中灵活地调整笔画级别的草图绘制。
Details
Motivation: 现有的草图生成方法需在生成前确定所有笔画条件,无法在生成过程中动态调整。希望实现更灵活的笔画级别控制。
Result: 实现了草图生成的动态笔画级控制,提升了生成的灵活性。
Insight: 分层次和自回归的设计允许模型在生成过程中逐步优化全局草图结构,同时支持灵活的笔画调整。
Abstract: Generating sketches with specific patterns as expected, i.e., manipulating sketches in a controllable way, is a popular task. Recent studies control sketch features at stroke-level by editing values of stroke embeddings as conditions. However, in order to provide generator a global view about what a sketch is going to be drawn, all these edited conditions should be collected and fed into generator simultaneously before generation starts, i.e., no further manipulation is allowed during sketch generating process. In order to realize sketch drawing manipulation more flexibly, we propose a hierarchical auto-regressive sketch generating process. Instead of generating an entire sketch at once, each stroke in a sketch is generated in a three-staged hierarchy: 1) predicting a stroke embedding to represent which stroke is going to be drawn, and 2) anchoring the predicted stroke on the canvas, and 3) translating the embedding to a sequence of drawing actions to form the full sketch. Moreover, the stroke prediction, anchoring and translation are proceeded auto-regressively, i.e., both the recently generated strokes and their positions are considered to predict the current one, guiding model to produce an appropriate stroke at a suitable position to benefit the full sketch generation. It is flexible to manipulate stroke-level sketch drawing at any time during generation by adjusting the exposed editable stroke embeddings.
[23] Theoretical Analysis of Power-law Transformation on Images for Text Polarity Detection cs.CVPDF
Narendra Singh Yadav, Pavan Kumar Perepu
TL;DR: 该论文对图像中的文本极性检测进行了理论分析,探讨了幂律变换对文本和背景两类之间的最大类间方差的影响。
Details
Motivation: 文本极性检测在许多计算机视觉应用中至关重要,而现有的幂律变换方法仅通过直观现象展示了其对类间方差的影响。缺乏理论支持,本研究旨在填补这一空白。
Result: 理论分析证实了幂律变换确实会增加(暗文本)或减少(亮文本)类间方差,支持了此前仅是直观观察的现象。
Insight: 幂律变换在文本极性检测中的作用可以通过理论推导验证,这为后续的预处理任务提供了更可靠的基础。
Abstract: Several computer vision applications like vehicle license plate recognition, captcha recognition, printed or handwriting character recognition from images etc., text polarity detection and binarization are the important preprocessing tasks. To analyze any image, it has to be converted to a simple binary image. This binarization process requires the knowledge of polarity of text in the images. Text polarity is defined as the contrast of text with respect to background. That means, text is darker than the background (dark text on bright background) or vice-versa. The binarization process uses this polarity information to convert the original colour or gray scale image into a binary image. In the literature, there is an intuitive approach based on power-law transformation on the original images. In this approach, the authors have illustrated an interesting phenomenon from the histogram statistics of the transformed images. Considering text and background as two classes, they have observed that maximum between-class variance between two classes is increasing (decreasing) for dark (bright) text on bright (dark) background. The corresponding empirical results have been presented. In this paper, we present a theoretical analysis of the above phenomenon.
[24] Exploring the Underwater World Segmentation without Extra Training cs.CV | cs.AIPDF
Bingyu Li, Tao Huo, Da Zhang, Zhiyuan Zhao, Junyu Gao
TL;DR: 该论文提出了第一个大规模水下开放词汇分割数据集AquaOV255和基准UOVSBench,并提出了无需额外训练的Earth2Ocean框架,通过几何引导和语义对齐实现水下场景的精准分割。
Details
Motivation: 水下生物分割对生态监测至关重要,但现有数据集和模型主要集中在陆地场景,水下领域缺乏高质量数据和高效方法。
Result: 在UOVSBench上的实验表明,Earth2Ocean显著提升了水下分割性能,同时保持了高效推理。
Insight: 通过几何结构和语义对齐的结合,可以有效地将陆地预训练模型迁移至水下场景,无需额外训练。
Abstract: Accurate segmentation of marine organisms is vital for biodiversity monitoring and ecological assessment, yet existing datasets and models remain largely limited to terrestrial scenes. To bridge this gap, we introduce \textbf{AquaOV255}, the first large-scale and fine-grained underwater segmentation dataset containing 255 categories and over 20K images, covering diverse categories for open-vocabulary (OV) evaluation. Furthermore, we establish the first underwater OV segmentation benchmark, \textbf{UOVSBench}, by integrating AquaOV255 with five additional underwater datasets to enable comprehensive evaluation. Alongside, we present \textbf{Earth2Ocean}, a training-free OV segmentation framework that transfers terrestrial vision–language models (VLMs) to underwater domains without any additional underwater training. Earth2Ocean consists of two core components: a Geometric-guided Visual Mask Generator (\textbf{GMG}) that refines visual features via self-similarity geometric priors for local structure perception, and a Category-visual Semantic Alignment (\textbf{CSA}) module that enhances text embeddings through multimodal large language model reasoning and scene-aware template construction. Extensive experiments on the UOVSBench benchmark demonstrate that Earth2Ocean achieves significant performance improvement on average while maintaining efficient inference.
[25] An Image-Based Path Planning Algorithm Using a UAV Equipped with Stereo Vision cs.CV | cs.ROPDF
Selim Ahmet Iz, Mustafa Unel
TL;DR: 本文提出了一种基于图像的路径规划算法,利用无人机配备的立体视觉技术生成深度图,并结合计算机视觉方法(如边缘检测和ArUco标记)定义轨迹点,其效果优于传统A*和PRM算法。
Details
Motivation: 传统二维图像难以区分地形(如坑洞和山丘)的深度信息,影响路径规划的安全性。因此,需结合立体视觉技术解决这一问题。
Result: 在仿真和实际测试中,算法表现优于A*和PRM算法,证明了其高效性和安全性。
Insight: 结合立体视觉和传统路径规划算法能显著提升复杂地形下的路径规划效果。
Abstract: This paper presents a novel image-based path planning algorithm that was developed using computer vision techniques, as well as its comparative analysis with well-known deterministic and probabilistic algorithms, namely A* and Probabilistic Road Map algorithm (PRM). The terrain depth has a significant impact on the calculated path safety. The craters and hills on the surface cannot be distinguished in a two-dimensional image. The proposed method uses a disparity map of the terrain that is generated by using a UAV. Several computer vision techniques, including edge, line and corner detection methods, as well as the stereo depth reconstruction technique, are applied to the captured images and the found disparity map is used to define candidate way-points of the trajectory. The initial and desired points are detected automatically using ArUco marker pose estimation and circle detection techniques. After presenting the mathematical model and vision techniques, the developed algorithm is compared with well-known algorithms on different virtual scenes created in the V-REP simulation program and a physical setup created in a laboratory environment. Results are promising and demonstrate effectiveness of the proposed algorithm.
[26] Federated CLIP for Resource-Efficient Heterogeneous Medical Image Classification cs.CVPDF
Yihang Wu, Ahmad Chaddad
TL;DR: 提出了一种基于CLIP的联邦学习方法FedMedCLIP,用于医疗图像分类,解决了数据异构性和资源开销问题,通过掩蔽特征适应模块和KL散度正则化,显著提升了性能并降低了计算成本。
Details
Motivation: 医疗图像分类中,传统深度学习需要集中数据训练,但隐私问题和数据异构性限制了其应用。联邦学习(FL)虽提供了分布式解决方案,但资源开销和数据异构性问题仍未得到有效解决。
Result: 在四个公开医疗数据集上验证,性能优于基线(如ISIC2019上8%提升),且速度快120倍于FedAVG。
Insight: 1. 掩蔽模块设计有效降低了FL的计算和通信开销;2. KL正则化提升了模型鲁棒性;3. 模型压缩和集成预测的组合是一种高效策略。
Abstract: Despite the remarkable performance of deep models in medical imaging, they still require source data for training, which limits their potential in light of privacy concerns. Federated learning (FL), as a decentralized learning framework that trains a shared model with multiple hospitals (a.k.a., FL clients), provides a feasible solution. However, data heterogeneity and resource costs hinder the deployment of FL models, especially when using vision language models (VLM). To address these challenges, we propose a novel contrastive language-image pre-training (CLIP) based FL approach for medical image classification (FedMedCLIP). Specifically, we introduce a masked feature adaptation module (FAM) as a communication module to reduce the communication load while freezing the CLIP encoders to reduce the computational overhead. Furthermore, we propose a masked multi-layer perceptron (MLP) as a private local classifier to adapt to the client tasks. Moreover, we design an adaptive Kullback-Leibler (KL) divergence-based distillation regularization method to enable mutual learning between FAM and MLP. Finally, we incorporate model compression to transmit the FAM parameters while using ensemble predictions for classification. Extensive experiments on four publicly available medical datasets demonstrate that our model provides feasible performance (e.g., 8% higher compared to second best baseline on ISIC2019) with reasonable resource cost (e.g., 120$\times$ faster than FedAVG).
[27] Laytrol: Preserving Pretrained Knowledge in Layout Control for Multimodal Diffusion Transformers cs.CVPDF
Sida Huang, Siqi Huang, Ping Luo, Hongyuan Zhang
TL;DR: 该论文提出了一种名为Laytrol的新方法,通过在布局控制中保留预训练知识,解决了现有布局到图像生成方法中视觉质量低和风格不一致的问题。
Details
Motivation: 提升扩散模型在文本到图像生成中的空间可控性是一个重要挑战,现有方法在引入布局条件时常导致预训练知识丢失,影响生成图像的质量和风格。
Result: 定性和定量实验验证了Laytrol方法的有效性,生成的图像在视觉质量和风格一致性上表现更优。
Insight: 通过数据合成和参数继承的方式可以显著减少预训练知识的丢失,从而提高模型的空间可控性和生成质量。
Abstract: With the development of diffusion models, enhancing spatial controllability in text-to-image generation has become a vital challenge. As a representative task for addressing this challenge, layout-to-image generation aims to generate images that are spatially consistent with the given layout condition. Existing layout-to-image methods typically introduce the layout condition by integrating adapter modules into the base generative model. However, the generated images often exhibit low visual quality and stylistic inconsistency with the base model, indicating a loss of pretrained knowledge. To alleviate this issue, we construct the Layout Synthesis (LaySyn) dataset, which leverages images synthesized by the base model itself to mitigate the distribution shift from the pretraining data. Moreover, we propose the Layout Control (Laytrol) Network, in which parameters are inherited from MM-DiT to preserve the pretrained knowledge of the base model. To effectively activate the copied parameters and avoid disturbance from unstable control conditions, we adopt a dedicated initialization scheme for Laytrol. In this scheme, the layout encoder is initialized as a pure text encoder to ensure that its output tokens remain within the data domain of MM-DiT. Meanwhile, the outputs of the layout control network are initialized to zero. In addition, we apply Object-level Rotary Position Embedding to the layout tokens to provide coarse positional information. Qualitative and quantitative experiments demonstrate the effectiveness of our method.
[28] DiffRegCD: Integrated Registration and Change Detection with Diffusion Features cs.CV | cs.AIPDF
Seyedehnanita Madani, Rama Chellappa, Vishal M. Patel
TL;DR: DiffRegCD是一个统一的框架,将密集配准和变化检测整合到一个模型中,利用扩散模型的冻结多尺度特征,通过高斯平滑分类任务实现亚像素精度。
Details
Motivation: 传统的变化检测方法假设输入图像已配准,但现实中常因视差、视角变化和时间间隔导致严重错位。现有方法(如BiFA、ChangeRD)在大位移情况下表现不佳。
Result: 在多个数据集(LEVIR-CD、DSIFN-CD等)上优于现有基线,且在广泛的时间和几何变化下仍可靠。
Insight: 扩散特征和分类任务为基础的统一框架为解决变化检测中的配准问题提供了新的方向。
Abstract: Change detection (CD) is fundamental to computer vision and remote sensing, supporting applications in environmental monitoring, disaster response, and urban development. Most CD models assume co-registered inputs, yet real-world imagery often exhibits parallax, viewpoint shifts, and long temporal gaps that cause severe misalignment. Traditional two stage methods that first register and then detect, as well as recent joint frameworks (e.g., BiFA, ChangeRD), still struggle under large displacements, relying on regression only flow, global homographies, or synthetic perturbations. We present DiffRegCD, an integrated framework that unifies dense registration and change detection in a single model. DiffRegCD reformulates correspondence estimation as a Gaussian smoothed classification task, achieving sub-pixel accuracy and stable training. It leverages frozen multi-scale features from a pretrained denoising diffusion model, ensuring robustness to illumination and viewpoint variation. Supervision is provided through controlled affine perturbations applied to standard CD datasets, yielding paired ground truth for both flow and change detection without pseudo labels. Extensive experiments on aerial (LEVIR-CD, DSIFN-CD, WHU-CD, SYSU-CD) and ground level (VL-CMU-CD) datasets show that DiffRegCD consistently surpasses recent baselines and remains reliable under wide temporal and geometric variation, establishing diffusion features and classification based correspondence as a strong foundation for unified change detection.
[29] Is It Truly Necessary to Process and Fit Minutes-Long Reference Videos for Personalized Talking Face Generation? cs.CVPDF
Rui-Qing Sun, Ang Li, Zhijing Wu, Tian Lan, Qianyu Lu
TL;DR: 论文探讨了在个性化说话人脸生成中是否必须处理几分钟的参考视频,提出了一种高效的片段选择策略ISExplore,仅需5秒的高质量视频片段即可达到或超越全长视频的效果。
Details
Motivation: 当前基于NeRF或3DGS的说话人脸生成方法需处理长参考视频,计算负担大,限制了实际应用。作者质疑这一必要性,并提出高质量片段可能足够。
Result: 实验表明,ISExplore在NeRF和3DGS方法中将数据处理和训练速度提升5倍以上,同时保持高质量输出。
Insight: 高质量的视频片段可以显著提升效率,减少冗余计算,为实际应用提供了可行性。
Abstract: Talking Face Generation (TFG) aims to produce realistic and dynamic talking portraits, with broad applications in fields such as digital education, film and television production, e-commerce live streaming, and other related areas. Currently, TFG methods based on Neural Radiated Field (NeRF) or 3D Gaussian sputtering (3DGS) are received widespread attention. They learn and store personalized features from reference videos of each target individual to generate realistic speaking videos. To ensure models can capture sufficient 3D information and successfully learns the lip-audio mapping, previous studies usually require meticulous processing and fitting several minutes of reference video, which always takes hours. The computational burden of processing and fitting long reference videos severely limits the practical application value of these methods.However, is it really necessary to fit such minutes of reference video? Our exploratory case studies show that using some informative reference video segments of just a few seconds can achieve performance comparable to or even better than the full reference video. This indicates that video informative quality is much more important than its length. Inspired by this observation, we propose the ISExplore (short for Informative Segment Explore), a simple-yet-effective segment selection strategy that automatically identifies the informative 5-second reference video segment based on three key data quality dimensions: audio feature diversity, lip movement amplitude, and number of camera views. Extensive experiments demonstrate that our approach increases data processing and training speed by more than 5x for NeRF and 3DGS methods, while maintaining high-fidelity output. Project resources are available at xx.
[30] Libra-MIL: Multimodal Prototypes Stereoscopic Infused with Task-specific Language Priors for Few-shot Whole Slide Image Classification cs.CV | cs.AIPDF
Zhenfeng Zhuang, Fangyu Zhou, Liansheng Wang
TL;DR: 该论文提出了一种名为Libra-MIL的新型多模态原型学习方法,结合任务特定语言先验,通过双向交互和最优传输算法提升少样本全切片图像分类的性能。
Details
Motivation: 全切片图像(WSI)的计算成本高昂,而现有方法通常依赖单向的多实例学习(MIL),且语言模型生成的实例级描述存在偏差。因此需要一种高效的双向交互方法来提升分类性能和模型可解释性。
Result: 在三种不同癌症数据集的少样本分类和可解释性实验中,该方法表现出显著的性能提升和泛化能力。
Insight: 任务特定的语言原型和多模态双向交互对提升WSI分类至关重要,而立体最优传输算法能够有效增强跨模态语义对齐。
Abstract: While Large Language Models (LLMs) are emerging as a promising direction in computational pathology, the substantial computational cost of giga-pixel Whole Slide Images (WSIs) necessitates the use of Multi-Instance Learning (MIL) to enable effective modeling. A key challenge is that pathological tasks typically provide only bag-level labels, while instance-level descriptions generated by LLMs often suffer from bias due to a lack of fine-grained medical knowledge. To address this, we propose that constructing task-specific pathological entity prototypes is crucial for learning generalizable features and enhancing model interpretability. Furthermore, existing vision-language MIL methods often employ unidirectional guidance, limiting cross-modal synergy. In this paper, we introduce a novel approach, Multimodal Prototype-based Multi-Instance Learning, that promotes bidirectional interaction through a balanced information compression scheme. Specifically, we leverage a frozen LLM to generate task-specific pathological entity descriptions, which are learned as text prototypes. Concurrently, the vision branch learns instance-level prototypes to mitigate the model’s reliance on redundant data. For the fusion stage, we employ the Stereoscopic Optimal Transport (SOT) algorithm, which is based on a similarity metric, thereby facilitating broader semantic alignment in a higher-dimensional space. We conduct few-shot classification and explainability experiments on three distinct cancer datasets, and the results demonstrate the superior generalization capabilities of our proposed method.
[31] Multi-Modal Assistance for Unsupervised Domain Adaptation on Point Cloud 3D Object Detection cs.CVPDF
Shenao Zhao, Pengpeng Liang, Zhoufan Yang
TL;DR: 该论文提出了一种名为MMAssist的方法,通过多模态辅助(图像和文本特征)改进无监督域适应(UDA)在点云3D目标检测中的性能。
Details
Motivation: 虽然点云和图像数据通常同时采集,但现有3D UDA方法很少利用图像数据。论文旨在利用多模态信息(图像和文本)提升3D目标检测的域适应效果。
Result: 在三个流行3D目标检测数据集上的实验表明,MMAssist相比现有方法表现出色。
Insight: 多模态信息(尤其是图像和文本)可以显著提升3D目标检测的域适应性能,且伪标签的增强策略进一步改善了模型的表现。
Abstract: Unsupervised domain adaptation for LiDAR-based 3D object detection (3D UDA) based on the teacher-student architecture with pseudo labels has achieved notable improvements in recent years. Although it is quite popular to collect point clouds and images simultaneously, little attention has been paid to the usefulness of image data in 3D UDA when training the models. In this paper, we propose an approach named MMAssist that improves the performance of 3D UDA with multi-modal assistance. A method is designed to align 3D features between the source domain and the target domain by using image and text features as bridges. More specifically, we project the ground truth labels or pseudo labels to the images to get a set of 2D bounding boxes. For each 2D box, we extract its image feature from a pre-trained vision backbone. A large vision-language model (LVLM) is adopted to extract the box’s text description, and a pre-trained text encoder is used to obtain its text feature. During the training of the model in the source domain and the student model in the target domain, we align the 3D features of the predicted boxes with their corresponding image and text features, and the 3D features and the aligned features are fused with learned weights for the final prediction. The features between the student branch and the teacher branch in the target domain are aligned as well. To enhance the pseudo labels, we use an off-the-shelf 2D object detector to generate 2D bounding boxes from images and estimate their corresponding 3D boxes with the aid of point cloud, and these 3D boxes are combined with the pseudo labels generated by the teacher model. Experimental results show that our approach achieves promising performance compared with state-of-the-art methods in three domain adaptation tasks on three popular 3D object detection datasets. The code is available at https://github.com/liangp/MMAssist.
[32] DANCE: Density-agnostic and Class-aware Network for Point Cloud Completion cs.CVPDF
Da-Yeong Kim, Yeong-Jun Cho
TL;DR: DANCE提出了一种新型点云补全框架,通过射线采样和多视角融合生成候选点,并结合语义信息实现高精度补全,同时对输入点云的密度和噪声具有鲁棒性。
Details
Motivation: 现有方法通常假设输入/输出密度固定或依赖基于图像的表示,难以适应真实场景中稀疏性变化和有限监督的问题。DANCE旨在解决这一问题。
Result: 在PCN和MVP基准测试中,DANCE在精度和结构一致性上优于现有方法,且对输入密度和噪声具有鲁棒性。
Insight: DANCE展示了在不依赖外部图像监督的情况下,如何通过语义引导实现高质量点云补全,同时适应真实场景的复杂性。
Abstract: Point cloud completion aims to recover missing geometric structures from incomplete 3D scans, which often suffer from occlusions or limited sensor viewpoints. Existing methods typically assume fixed input/output densities or rely on image-based representations, making them less suitable for real-world scenarios with variable sparsity and limited supervision. In this paper, we introduce Density-agnostic and Class-aware Network (DANCE), a novel framework that completes only the missing regions while preserving the observed geometry. DANCE generates candidate points via ray-based sampling from multiple viewpoints. A transformer decoder then refines their positions and predicts opacity scores, which determine the validity of each point for inclusion in the final surface. To incorporate semantic guidance, a lightweight classification head is trained directly on geometric features, enabling category-consistent completion without external image supervision. Extensive experiments on the PCN and MVP benchmarks show that DANCE outperforms state-of-the-art methods in accuracy and structural consistency, while remaining robust to varying input densities and noise levels.
[33] ChexFract: From General to Specialized - Enhancing Fracture Description Generation cs.CVPDF
Nikolay Nechaev, Evgeniia Przhezdzetskaia, Dmitry Umerenkov, Dmitry V. Dylov
TL;DR: 论文提出了针对骨折检测和描述的专用视觉-语言模型,显著提高了对罕见且临床重要的骨折病理的描述准确性,并公开了最佳模型以促进未来研究。
Details
Motivation: 在医学AI领域,从胸部X光片生成准确且临床有意义的放射学报告仍具挑战性,尤其是对罕见但重要的病理(如骨折)的描述,通用模型表现不佳。
Result: 专用模型在生成准确的骨折描述方面显著优于通用模型,并通过分析骨折类型、位置和年龄展示了模型的优劣势。
Insight: 当前视觉-语言模型架构在罕见病理描述上仍需改进,专用模型是解决此类问题的有效途径。
Abstract: Generating accurate and clinically meaningful radiology reports from chest X-ray images remains a significant challenge in medical AI. While recent vision-language models achieve strong results in general radiology report generation, they often fail to adequately describe rare but clinically important pathologies like fractures. This work addresses this gap by developing specialized models for fracture pathology detection and description. We train fracture-specific vision-language models with encoders from MAIRA-2 and CheXagent, demonstrating significant improvements over general-purpose models in generating accurate fracture descriptions. Analysis of model outputs by fracture type, location, and age reveals distinct strengths and limitations of current vision-language model architectures. We publicly release our best-performing fracture-reporting model, facilitating future research in accurate reporting of rare pathologies.
[34] CSF-Net: Context-Semantic Fusion Network for Large Mask Inpainting cs.CVPDF
Chae-Yeon Heo, Yeong-Jun Cho
TL;DR: 论文提出了CSF-Net,一种基于语义引导的Transformer框架,用于解决大面积掩码图像修复问题,通过融合预训练的AC模型生成的结构感知候选与上下文特征,提升修复质量。
Details
Motivation: 在大面积掩码图像修复中,缺失的视觉内容和有限的上下文信息导致修复质量低下。现有的方法难以同时保证结构准确性和语义一致性。
Result: 在Places365和COCOA数据集上的实验表明,CSF-Net减少了目标幻觉,提升了视觉真实性和语义对齐。
Insight: 结合语义先验与上下文特征可以显著提升大面积掩码修复的质量,Transformer的融合机制在这一任务中表现出色。
Abstract: In this paper, we propose a semantic-guided framework to address the challenging problem of large-mask image inpainting, where essential visual content is missing and contextual cues are limited. To compensate for the limited context, we leverage a pretrained Amodal Completion (AC) model to generate structure-aware candidates that serve as semantic priors for the missing regions. We introduce Context-Semantic Fusion Network (CSF-Net), a transformer-based fusion framework that fuses these candidates with contextual features to produce a semantic guidance image for image inpainting. This guidance improves inpainting quality by promoting structural accuracy and semantic consistency. CSF-Net can be seamlessly integrated into existing inpainting models without architectural changes and consistently enhances performance across diverse masking conditions. Extensive experiments on the Places365 and COCOA datasets demonstrate that CSF-Net effectively reduces object hallucination while enhancing visual realism and semantic alignment. The code for CSF-Net is available at https://github.com/chaeyeonheo/CSF-Net.
[35] Hardware-Aware YOLO Compression for Low-Power Edge AI on STM32U5 for Weeds Detection in Digital Agriculture cs.CV | cs.AIPDF
Charalampos S. Kouzinopoulos, Yuri Manna
TL;DR: 论文提出了一种基于YOLOv8n的低功耗边缘AI系统,用于数字农业中的杂草检测,通过结构化剪枝、整数量化和输入图像分辨率缩放等技术优化模型,以适应STM32U5微控制器的硬件限制。
Details
Motivation: 传统杂草管理方法依赖化学除草剂,存在环境污染和抗药性杂草的风险。计算机视觉和机器学习的精准除草方法虽有前景,但通常受限于高功耗计算平台。
Result: 在CropAndWeed数据集上达到检测精度与效率的平衡,单次推理能耗低至51.8mJ。
Insight: 该系统展示了在资源受限的边缘设备上部署高效AI模型的潜力,为数字农业中的可持续除草提供了可行性方案。
Abstract: Weeds significantly reduce crop yields worldwide and pose major challenges to sustainable agriculture. Traditional weed management methods, primarily relying on chemical herbicides, risk environmental contamination and lead to the emergence of herbicide-resistant species. Precision weeding, leveraging computer vision and machine learning methods, offers a promising eco-friendly alternative but is often limited by reliance on high-power computational platforms. This work presents an optimized, low-power edge AI system for weeds detection based on the YOLOv8n object detector deployed on the STM32U575ZI microcontroller. Several compression techniques are applied to the detection model, including structured pruning, integer quantization and input image resolution scaling in order to meet strict hardware constraints. The model is trained and evaluated on the CropAndWeed dataset with 74 plant species, achieving a balanced trade-off between detection accuracy and efficiency. Our system supports real-time, in-situ weeds detection with a minimal energy consumption of 51.8mJ per inference, enabling scalable deployment in power-constrained agricultural environments.
[36] Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning cs.CV | cs.AIPDF
Jialong Qin, Xin Zou, Di Lu, Yibo Yan, Xuming Hu
TL;DR: SharpV是一种高效的自适应视觉令牌和KV缓存剪枝方法,通过动态调整空间-时间信息的剪枝比例,减少VideoLLMs的计算复杂度,同时偶尔提升性能。
Details
Motivation: 现有VideoLLMs因处理冗余视觉令牌导致计算复杂度高和KV缓存膨胀,亟需高效解决方案。
Result: 在多个公开基准测试中表现优异,部分任务性能超过密集模型。
Insight: 从信息瓶颈角度实现分层缓存剪枝,为VideoLLMs的信息流提供了新视角。
Abstract: Current Video Large Language Models (VideoLLMs) suffer from quadratic computational complexity and key-value cache scaling, due to their reliance on processing excessive redundant visual tokens. To address this problem, we propose SharpV, a minimalist and efficient method for adaptive pruning of visual tokens and KV cache. Different from most uniform compression approaches, SharpV dynamically adjusts pruning ratios based on spatial-temporal information. Remarkably, this adaptive mechanism occasionally achieves performance gains over dense models, offering a novel paradigm for adaptive pruning. During the KV cache pruning stage, based on observations of visual information degradation, SharpV prunes degraded visual features via a self-calibration manner, guided by similarity to original visual features. In this way, SharpV achieves hierarchical cache pruning from the perspective of information bottleneck, offering a new insight into VideoLLMs’ information flow. Experiments on multiple public benchmarks demonstrate the superiority of SharpV. Moreover, to the best of our knowledge, SharpV is notably the first two-stage pruning framework that operates without requiring access to exposed attention scores, ensuring full compatibility with hardware acceleration techniques like Flash Attention.
[37] EAGLE: Episodic Appearance- and Geometry-aware Memory for Unified 2D-3D Visual Query Localization in Egocentric Vision cs.CVPDF
Yifei Cao, Yu Liu, Guolong Wang, Zhu Liu, Kai Wang
TL;DR: EAGLE是一个新颖的框架,通过整合显式和隐式几何特征记忆,实现了2D-3D视觉查询任务的统一定位,显著提升了在视角和外观变化下的检索精度。
Details
Motivation: 解决第一视角视觉查询定位中因相机运动、视角变化和外观差异带来的挑战。
Result: 在Ego4D-VQ基准测试中达到了最先进的性能。
Insight: 仿生记忆机制可以有效处理动态场景中的视觉查询任务,显式和隐式几何特征的联合建模是关键。
Abstract: Egocentric visual query localization is vital for embodied AI and VR/AR, yet remains challenging due to camera motion, viewpoint changes, and appearance variations. We present EAGLE, a novel framework that leverages episodic appearance- and geometry-aware memory to achieve unified 2D-3D visual query localization in egocentric vision. Inspired by avian memory consolidation, EAGLE synergistically integrates segmentation guided by an appearance-aware meta-learning memory (AMM), with tracking driven by a geometry-aware localization memory (GLM). This memory consolidation mechanism, through structured appearance and geometry memory banks, stores high-confidence retrieval samples, effectively supporting both long- and short-term modeling of target appearance variations. This enables precise contour delineation with robust spatial discrimination, leading to significantly improved retrieval accuracy. Furthermore, by integrating the VQL-2D output with a visual geometry grounded Transformer (VGGT), we achieve a efficient unification of 2D and 3D tasks, enabling rapid and accurate back-projection into 3D space. Our method achieves state-ofthe-art performance on the Ego4D-VQ benchmark.
[38] High-Quality Proposal Encoding and Cascade Denoising for Imaginary Supervised Object Detection cs.CVPDF
Zhiyuan Chen, Yuelin Guo, Zitong Huang, Haoyu He, Renhao Lu
TL;DR: 该论文提出了Cascade HQP-DETR方法,通过高质量数据生成、图像先验引导的查询编码和级联去噪算法,显著提升了虚幻监督目标检测的性能和在真实数据上的泛化能力。
Details
Motivation: 传统目标检测模型依赖大量标注数据,成本高昂;现有的虚幻监督方法在数据质量、检测器收敛速度和噪声鲁棒性方面存在局限性。
Result: 仅用12轮训练便在PASCAL VOC 2007上达到61.04%的mAP@0.5,优于现有方法。
Insight: 高质量合成数据和动态训练策略是提升虚幻监督目标检测性能的关键。
Abstract: Object detection models demand large-scale annotated datasets, which are costly and labor-intensive to create. This motivated Imaginary Supervised Object Detection (ISOD), where models train on synthetic images and test on real images. However, existing methods face three limitations: (1) synthetic datasets suffer from simplistic prompts, poor image quality, and weak supervision; (2) DETR-based detectors, due to their random query initialization, struggle with slow convergence and overfitting to synthetic patterns, hindering real-world generalization; (3) uniform denoising pressure promotes model overfitting to pseudo-label noise. We propose Cascade HQP-DETR to address these limitations. First, we introduce a high-quality data pipeline using LLaMA-3, Flux, and Grounding DINO to generate the FluxVOC and FluxCOCO datasets, advancing ISOD from weak to full supervision. Second, our High-Quality Proposal guided query encoding initializes object queries with image-specific priors from SAM-generated proposals and RoI-pooled features, accelerating convergence while steering the model to learn transferable features instead of overfitting to synthetic patterns. Third, our cascade denoising algorithm dynamically adjusts training weights through progressively increasing IoU thresholds across decoder layers, guiding the model to learn robust boundaries from reliable visual cues rather than overfitting to noisy labels. Trained for just 12 epochs solely on FluxVOC, Cascade HQP-DETR achieves a SOTA 61.04% mAP@0.5 on PASCAL VOC 2007, outperforming strong baselines, with its competitive real-data performance confirming the architecture’s universal applicability.
[39] Multi-modal Deepfake Detection and Localization with FPN-Transformer cs.CV | cs.AIPDF
Chende Zheng, Ruiqi Suo, Zhoulin Ji, Jingyi Deng, Fangbin Yi
TL;DR: 本文提出了一种基于FPN-Transformer的多模态深度伪造检测与定位方法,利用跨模态相关性和时间边界回归解决了现有限制的检测方法。
Details
Motivation: 当前的单模态检测方法无法利用跨模态相关性,且难以精确定位伪造片段,面对复杂、细粒度的伪造内容时效果有限。
Result: 在IJCAI’25 DDL-AV测试集上得分0.7535,显示出优越的跨模态检测和定位性能。
Insight: 多模态特征联合分析和帧级定位框架为通用深度伪造检测提供了新思路。
Abstract: The rapid advancement of generative adversarial networks (GANs) and diffusion models has enabled the creation of highly realistic deepfake content, posing significant threats to digital trust across audio-visual domains. While unimodal detection methods have shown progress in identifying synthetic media, their inability to leverage cross-modal correlations and precisely localize forged segments limits their practicality against sophisticated, fine-grained manipulations. To address this, we introduce a multi-modal deepfake detection and localization framework based on a Feature Pyramid-Transformer (FPN-Transformer), addressing critical gaps in cross-modal generalization and temporal boundary regression. The proposed approach utilizes pre-trained self-supervised models (WavLM for audio, CLIP for video) to extract hierarchical temporal features. A multi-scale feature pyramid is constructed through R-TLM blocks with localized attention mechanisms, enabling joint analysis of cross-context temporal dependencies. The dual-branch prediction head simultaneously predicts forgery probabilities and refines temporal offsets of manipulated segments, achieving frame-level localization precision. We evaluate our approach on the test set of the IJCAI’25 DDL-AV benchmark, showing a good performance with a final score of 0.7535 for cross-modal deepfake detection and localization in challenging environments. Experimental results confirm the effectiveness of our approach and provide a novel way for generalized deepfake detection. Our code is available at https://github.com/Zig-HS/MM-DDL
[40] Perceptual Quality Assessment of 3D Gaussian Splatting: A Subjective Dataset and Prediction Metric cs.CVPDF
Zhaolin Wan, Yining Diao, Jingqi Xu, Hao Wang, Zhiyang Li
TL;DR: 该论文提出了首个针对3D高斯投影(3DGS)内容的主观质量评估数据集3DGS-QA,并设计了一种基于高斯原生的无参考质量预测模型,用于量化不同失真条件下的感知质量。
Details
Motivation: 3DGS作为实时高保真渲染的领先技术,其感知质量在多变的重建条件下尚未被系统研究。论文旨在填补这一空白,量化常见失真因素对视觉质量的影响。
Result: 实验表明,所提模型在3DGS内容评估中表现优于传统和基于学习的方法。
Insight: 高斯原生特征的直接利用对于3DGS质量评估具有显著优势,未来研究可从这一角度深入探索。
Abstract: With the rapid advancement of 3D visualization, 3D Gaussian Splatting (3DGS) has emerged as a leading technique for real-time, high-fidelity rendering. While prior research has emphasized algorithmic performance and visual fidelity, the perceptual quality of 3DGS-rendered content, especially under varying reconstruction conditions, remains largely underexplored. In practice, factors such as viewpoint sparsity, limited training iterations, point downsampling, noise, and color distortions can significantly degrade visual quality, yet their perceptual impact has not been systematically studied. To bridge this gap, we present 3DGS-QA, the first subjective quality assessment dataset for 3DGS. It comprises 225 degraded reconstructions across 15 object types, enabling a controlled investigation of common distortion factors. Based on this dataset, we introduce a no-reference quality prediction model that directly operates on native 3D Gaussian primitives, without requiring rendered images or ground-truth references. Our model extracts spatial and photometric cues from the Gaussian representation to estimate perceived quality in a structure-aware manner. We further benchmark existing quality assessment methods, spanning both traditional and learning-based approaches. Experimental results show that our method consistently achieves superior performance, highlighting its robustness and effectiveness for 3DGS content evaluation. The dataset and code are made publicly available at https://github.com/diaoyn/3DGSQA to facilitate future research in 3DGS quality assessment.
[41] WEDepth: Efficient Adaptation of World Knowledge for Monocular Depth Estimation cs.CVPDF
Gongshu Wang, Zhirui Wang, Kan Yang
TL;DR: WEDepth提出了一种高效利用预训练视觉基础模型(VFMs)的方法,用于单目深度估计(MDE),无需修改模型结构或权重,通过多级特征增强注入先验知识,在NYU-Depth v2和KITTI数据集上取得了SOTA性能,并展示了强大的零样本迁移能力。
Details
Motivation: 单目深度估计由于从单张2D图像重建3D场景的固有不适定性,是一项极具挑战的任务。预训练的视觉基础模型在大规模多样化数据上表现出强大的世界理解能力,可以为MDE提供帮助。
Result: 在NYU-Depth v2和KITTI数据集上,WEDepth实现了SOTA性能,效果与基于扩散的方法(需多次前向传播)和预训练的相对深度方法相当。
Insight: WEDepth展示了如何在不改变预训练模型的情况下,通过多级特征增强充分利用其先验知识,为其他视觉任务的高效迁移提供了借鉴。
Abstract: Monocular depth estimation (MDE) has widely applicable but remains highly challenging due to the inherently ill-posed nature of reconstructing 3D scenes from single 2D images. Modern Vision Foundation Models (VFMs), pre-trained on large-scale diverse datasets, exhibit remarkable world understanding capabilities that benefit for various vision tasks. Recent studies have demonstrated significant improvements in MDE through fine-tuning these VFMs. Inspired by these developments, we propose WEDepth, a novel approach that adapts VFMs for MDE without modi-fying their structures and pretrained weights, while effec-tively eliciting and leveraging their inherent priors. Our method employs the VFM as a multi-level feature en-hancer, systematically injecting prior knowledge at differ-ent representation levels. Experiments on NYU-Depth v2 and KITTI datasets show that WEDepth establishes new state-of-the-art (SOTA) performance, achieving competi-tive results compared to both diffusion-based approaches (which require multiple forward passes) and methods pre-trained on relative depth. Furthermore, we demonstrate our method exhibits strong zero-shot transfer capability across diverse scenarios.
[42] ProSona: Prompt-Guided Personalization for Multi-Expert Medical Image Segmentation cs.CV | cs.AIPDF
Aya Elgebaly, Nikolaos Delopoulos, Juliane Hörner-Rieber, Carolin Rippke, Sebastian Klüter
TL;DR: ProSona是一个两阶段框架,通过自然语言提示学习标注风格的连续潜在空间,实现可控的个性化医学图像分割,相比DPersona提升了分割效果。
Details
Motivation: 医学图像分割中,不同专家之间的标注差异较大,现有方法要么忽略这种变异性,要么为每个标注者设计单独的分支。ProSona旨在通过自然语言提示灵活控制个性化分割。
Result: 在LIDC-IDRI肺结节和多机构前列腺MRI数据集上,ProSona将广义能量距离降低了17%,平均Dice系数提高了1分以上。
Insight: 自然语言提示可以灵活、准确且可解释地控制个性化医学图像分割,为处理标注变异性提供了新思路。
Abstract: Automated medical image segmentation suffers from high inter-observer variability, particularly in tasks such as lung nodule delineation, where experts often disagree. Existing approaches either collapse this variability into a consensus mask or rely on separate model branches for each annotator. We introduce ProSona, a two-stage framework that learns a continuous latent space of annotation styles, enabling controllable personalization via natural language prompts. A probabilistic U-Net backbone captures diverse expert hypotheses, while a prompt-guided projection mechanism navigates this latent space to generate personalized segmentations. A multi-level contrastive objective aligns textual and visual representations, promoting disentangled and interpretable expert styles. Across the LIDC-IDRI lung nodule and multi-institutional prostate MRI datasets, ProSona reduces the Generalized Energy Distance by 17% and improves mean Dice by more than one point compared with DPersona. These results demonstrate that natural-language prompts can provide flexible, accurate, and interpretable control over personalized medical image segmentation. Our implementation is available online 1 .
[43] Generalized-Scale Object Counting with Gradual Query Aggregation cs.CVPDF
Jer Pelhan, Alan Lukezic, Matej Kristan
TL;DR: GECO2 是一种端到端的少样本计数与检测方法,通过渐进查询聚合技术解决了物体尺度多样性和密集小物体检测的问题,显著提升了计数与检测精度。
Details
Motivation: 现有少样本计数器在处理多尺度物体和密集小物体时效果不佳,主要依赖临时解决方案(如上采样和分块处理)。GECO2 旨在直接解决这些问题。
Result: GECO2 在计数和检测精度上超越现有方法 10%,运行速度提升 3 倍,GPU内存占用更小。
Insight: 渐进查询聚合是解决多尺度物体检测和密集小物体计数的有效方法。
Abstract: Few-shot detection-based counters estimate the number of instances in the image specified only by a few test-time exemplars. A common approach to localize objects across multiple sizes is to merge backbone features of different resolutions. Furthermore, to enable small object detection in densely populated regions, the input image is commonly upsampled and tiling is applied to cope with the increased computational and memory requirements. Because of these ad-hoc solutions, existing counters struggle with images containing diverse-sized objects and densely populated regions of small objects. We propose GECO2, an end-to-end few-shot counting and detection method that explicitly addresses the object scale issues. A new dense query representation gradually aggregates exemplar-specific feature information across scales that leads to high-resolution dense queries that enable detection of large as well as small objects. GECO2 surpasses state-of-the-art few-shot counters in counting as well as detection accuracy by 10% while running 3x times faster at smaller GPU memory footprint.
[44] Taming Identity Consistency and Prompt Diversity in Diffusion Models via Latent Concatenation and Masked Conditional Flow Matching cs.CV | cs.AIPDF
Aditi Singhania, Arushi Jain, Krutik Malani, Riddhi Dhawan, Souymodip Chakraborty
TL;DR: 论文提出了一种基于LoRA微调的扩散模型,通过潜变量拼接和掩码条件流匹配(CFM)目标,解决了主题驱动图像生成中身份一致性和提示多样性之间的权衡问题。此外,还引入了一个两阶段的蒸馏数据整理框架和大规模评估工具CHARIS。
Details
Motivation: 主题驱动图像生成需要在多样化的上下文环境中保留主题的核心身份特征,但现有方法在身份一致性和提示多样性之间存在明显的权衡,无法同时满足高质量生成和大规模训练的需求。
Result: 方法在身份一致性和提示多样性之间取得了更好的平衡,同时支持大规模训练和高效微调。CHARIS框架提供了全面的生成质量评估能力。
Insight: 1. 潜变量拼接和掩码CFM的结合是解决身份一致性问题的有效方法;2. 数据整理的自动化流程可以显著提升训练效率和模型泛化能力;3. 多维度的评估工具有助于深入理解生成模型的性能表现。
Abstract: Subject-driven image generation aims to synthesize novel depictions of a specific subject across diverse contexts while preserving its core identity features. Achieving both strong identity consistency and high prompt diversity presents a fundamental trade-off. We propose a LoRA fine-tuned diffusion model employing a latent concatenation strategy, which jointly processes reference and target images, combined with a masked Conditional Flow Matching (CFM) objective. This approach enables robust identity preservation without architectural modifications. To facilitate large-scale training, we introduce a two-stage Distilled Data Curation Framework: the first stage leverages data restoration and VLM-based filtering to create a compact, high-quality seed dataset from diverse sources; the second stage utilizes these curated examples for parameter-efficient fine-tuning, thus scaling the generation capability across various subjects and contexts. Finally, for filtering and quality assessment, we present CHARIS, a fine-grained evaluation framework that performs attribute-level comparisons along five key axes: identity consistency, prompt adherence, region-wise color fidelity, visual quality, and transformation diversity.
[45] I2E: Real-Time Image-to-Event Conversion for High-Performance Spiking Neural Networks cs.CVPDF
Ruichen Ma, Liwei Meng, Guanchao Qiao, Ning Ning, Yang Liu
TL;DR: I2E是一个实时图像到事件流的转换框架,解决了脉冲神经网络(SNN)训练中事件流数据匮乏的问题,性能远超现有方法,并在多个数据集上验证了其有效性。
Details
Motivation: SNN因高能效计算潜力而备受关注,但实际应用中缺乏高质量的事件流数据,限制了其发展。I2E旨在通过模拟眼球微扫视运动,将静态图像转换为高保真事件流,填补这一数据缺口。
Result: 在生成的数据集I2E-ImageNet上训练的SNN达到60.50%的准确率(SOTA);通过预训练和微调策略,在CIFAR10-DVS上实现了92.5%的准确率。
Insight: 合成事件数据可以替代真实传感器数据,为神经形态工程提供了一个可行的高性能解决方案,解决了长期存在的数据匮乏问题。
Abstract: Spiking neural networks (SNNs) promise highly energy-efficient computing, but their adoption is hindered by a critical scarcity of event-stream data. This work introduces I2E, an algorithmic framework that resolves this bottleneck by converting static images into high-fidelity event streams. By simulating microsaccadic eye movements with a highly parallelized convolution, I2E achieves a conversion speed over 300x faster than prior methods, uniquely enabling on-the-fly data augmentation for SNN training. The framework’s effectiveness is demonstrated on large-scale benchmarks. An SNN trained on the generated I2E-ImageNet dataset achieves a state-of-the-art accuracy of 60.50%. Critically, this work establishes a powerful sim-to-real paradigm where pre-training on synthetic I2E data and fine-tuning on the real-world CIFAR10-DVS dataset yields an unprecedented accuracy of 92.5%. This result validates that synthetic event data can serve as a high-fidelity proxy for real sensor data, bridging a long-standing gap in neuromorphic engineering. By providing a scalable solution to the data problem, I2E offers a foundational toolkit for developing high-performance neuromorphic systems. The open-source algorithm and all generated datasets are provided to accelerate research in the field.
[46] Radar-APLANC: Unsupervised Radar-based Heartbeat Sensing via Augmented Pseudo-Label and Noise Contrast cs.CV | cs.AI | cs.HC | eess.SPPDF
Ying Wang, Zhaodong Sun, Xu Cheng, Zuxian He, Xiaobai Li
TL;DR: 论文提出了一种名为Radar-APLANC的无监督雷达心跳检测方法,通过增强伪标签和噪声对比提升性能,避免了传统方法对标记数据的依赖。
Details
Motivation: 传统雷达心跳检测方法受噪声影响性能下降,而基于学习的方法需要昂贵标记数据。为解决这些问题,作者提出了一种无监督框架。
Result: 在Equipleth数据集和自采数据集上的实验表明,Radar-APLANC的性能接近最先进的监督方法。
Insight: 通过噪声对比和伪标签增强,无监督方法能在缺少标记数据的情况下显著提升性能,为无监督生理信号检测提供了新思路。
Abstract: Frequency Modulated Continuous Wave (FMCW) radars can measure subtle chest wall oscillations to enable non-contact heartbeat sensing. However, traditional radar-based heartbeat sensing methods face performance degradation due to noise. Learning-based radar methods achieve better noise robustness but require costly labeled signals for supervised training. To overcome these limitations, we propose the first unsupervised framework for radar-based heartbeat sensing via Augmented Pseudo-Label and Noise Contrast (Radar-APLANC). We propose to use both the heartbeat range and noise range within the radar range matrix to construct the positive and negative samples, respectively, for improved noise robustness. Our Noise-Contrastive Triplet (NCT) loss only utilizes positive samples, negative samples, and pseudo-label signals generated by the traditional radar method, thereby avoiding dependence on expensive ground-truth physiological signals. We further design a pseudo-label augmentation approach featuring adaptive noise-aware label selection to improve pseudo-label signal quality. Extensive experiments on the Equipleth dataset and our collected radar dataset demonstrate that our unsupervised method achieves performance comparable to state-of-the-art supervised methods. Our code, dataset, and supplementary materials can be accessed from https://github.com/RadarHRSensing/Radar-APLANC.
[47] CLIP is All You Need for Human-like Semantic Representations in Stable Diffusion cs.CV | cs.AIPDF
Cameron Braunstein, Mariya Toneva, Eddy Ilg
TL;DR: 论文研究发现,Stable Diffusion中人类语义理解的来源主要是CLIP模型,而非逆向扩散过程。CLIP编码的语义信息比扩散过程更接近人类理解。
Details
Motivation: 研究旨在探索Stable Diffusion等潜在扩散模型在文本到图像生成任务中是否具备人类可理解的语义表征。
Result: 实验显示CLIP的文本编码是语义信息的主要载体,扩散过程中语义区分能力逐渐下降。
Insight: CLIP作为独立的视觉语言模型,是语义理解的关键,而扩散过程仅作为视觉解码器发挥作用。
Abstract: Latent diffusion models such as Stable Diffusion achieve state-of-the-art results on text-to-image generation tasks. However, the extent to which these models have a semantic understanding of the images they generate is not well understood. In this work, we investigate whether the internal representations used by these models during text-to-image generation contain semantic information that is meaningful to humans. To do so, we perform probing on Stable Diffusion with simple regression layers that predict semantic attributes for objects and evaluate these predictions against human annotations. Surprisingly, we find that this success can actually be attributed to the text encoding occurring in CLIP rather than the reverse diffusion process. We demonstrate that groups of specific semantic attributes have markedly different decoding accuracy than the average, and are thus represented to different degrees. Finally, we show that attributes become more difficult to disambiguate from one another during the inverse diffusion process, further demonstrating the strongest semantic representation of object attributes in CLIP. We conclude that the separately trained CLIP vision-language model is what determines the human-like semantic representation, and that the diffusion process instead takes the role of a visual decoder.
[48] Beyond the Pixels: VLM-based Evaluation of Identity Preservation in Reference-Guided Synthesis cs.CV | cs.AIPDF
Aditi Singhania, Krutik Malani, Riddhi Dhawan, Arushi Jain, Garv Tandon
TL;DR: 论文提出了一个名为’Beyond the Pixels’的分层评估框架,用于解决生成模型中身份保留评估的问题。该框架通过特征级分解和结构化提示,利用视觉语言模型(VLM)进行细粒度分析,显著提升了评估的准确性和一致性。
Details
Motivation: 生成模型的身份保留评估是一个关键但未解决的挑战。现有方法通常依赖全局嵌入或粗粒度的VLM提示,无法捕捉细粒度的身份变化,且缺乏诊断性洞察。
Result: 在四种先进生成模型上的验证表明,该方法与人类判断具有高度一致性。新基准数据集包含1,078对图像-提示,涵盖多样主题和变换轴。
Insight: 通过细粒度分解和结构化VLM分析,可以显著提升身份保留评估的准确性和可解释性。
Abstract: Evaluating identity preservation in generative models remains a critical yet unresolved challenge. Existing metrics rely on global embeddings or coarse VLM prompting, failing to capture fine-grained identity changes and providing limited diagnostic insight. We introduce Beyond the Pixels, a hierarchical evaluation framework that decomposes identity assessment into feature-level transformations. Our approach guides VLMs through structured reasoning by (1) hierarchically decomposing subjects into (type, style) -> attribute -> feature decision tree, and (2) prompting for concrete transformations rather than abstract similarity scores. This decomposition grounds VLM analysis in verifiable visual evidence, reducing hallucinations and improving consistency. We validate our framework across four state-of-the-art generative models, demonstrating strong alignment with human judgments in measuring identity consistency. Additionally, we introduce a new benchmark specifically designed to stress-test generative models. It comprises 1,078 image-prompt pairs spanning diverse subject types, including underrepresented categories such as anthropomorphic and animated characters, and captures an average of six to seven transformation axes per prompt.
[49] StableMorph: High-Quality Face Morph Generation with Stable Diffusion cs.CV | cs.AIPDF
Wassim Kabbani, Kiran Raja, Raghavendra Ramachandra, Christoph Busch
TL;DR: StableMorph利用稳定扩散技术生成高质量、无伪影的融合人脸图像,解决了现有方法模糊、伪影多的问题,为生物识别安全提供了更真实的攻击样本和评估标准。
Details
Motivation: 现有的人脸融合方法生成的图像质量低,容易被检测,无法模拟真实场景中的高级攻击。需要高质量、真实的融合图像来开发和评估生物识别系统的安全性。
Result: 生成的融合图像质量优于真实图像,能有效欺骗人脸识别系统,为MAD系统提供了更具挑战性的测试数据。
Insight: 扩散模型在人脸融合任务中表现出色,能够生成高质量且难以检测的图像,提升了生物识别安全评估的逼真性。
Abstract: Face morphing attacks threaten the integrity of biometric identity systems by enabling multiple individuals to share a single identity. To develop and evaluate effective morphing attack detection (MAD) systems, we need access to high-quality, realistic morphed images that reflect the challenges posed in real-world scenarios. However, existing morph generation methods often produce images that are blurry, riddled with artifacts, or poorly constructed making them easy to detect and not representative of the most dangerous attacks. In this work, we introduce StableMorph, a novel approach that generates highly realistic, artifact-free morphed face images using modern diffusion-based image synthesis. Unlike prior methods, StableMorph produces full-head images with sharp details, avoids common visual flaws, and offers unmatched control over visual attributes. Through extensive evaluation, we show that StableMorph images not only rival or exceed the quality of genuine face images but also maintain a strong ability to fool face recognition systems posing a greater challenge to existing MAD solutions and setting a new standard for morph quality in research and operational testing. StableMorph improves the evaluation of biometric security by creating more realistic and effective attacks and supports the development of more robust detection systems.
[50] Introducing Nylon Face Mask Attacks: A Dataset for Evaluating Generalised Face Presentation Attack Detection cs.CV | cs.ETPDF
Manasa, Sushrut Patwardhan, Narayan Vetrekar, Pavan Kumar, R. S. Gad
TL;DR: 本文介绍了一个名为Nylon Face Masks Attacks的新数据集,用于评估人脸呈现攻击检测(PAD)的泛化能力。该数据集模拟了高级3D伪装攻击场景,展示了现有PAD方法在面对新型攻击时的性能差异。
Details
Motivation: 人脸识别系统在多种应用中广泛部署,但其易受呈现攻击(PAs)的威胁,尤其是新型Nylon Face Masks(NFMs)攻击,因其弹性和逼真外观,能高度模拟受害者面部几何形状,对系统可靠性构成严重挑战。
Result: 实验结果表明,现有PAD方法在NFM攻击下性能差异显著,强调了开发能够有效应对新兴伪装威胁的PAD技术的重要性。
Insight: NFM攻击对PAD技术提出了新的挑战,现有方法在多样化攻击场景中的泛化能力仍需提升。
Abstract: Face recognition systems are increasingly deployed across a wide range of applications, including smartphone authentication, access control, and border security. However, these systems remain vulnerable to presentation attacks (PAs), which can significantly compromise their reliability. In this work, we introduce a new dataset focused on a novel and realistic presentation attack instrument called Nylon Face Masks (NFMs), designed to simulate advanced 3D spoofing scenarios. NFMs are particularly concerning due to their elastic structure and photorealistic appearance, which enable them to closely mimic the victim’s facial geometry when worn by an attacker. To reflect real-world smartphone-based usage conditions, we collected the dataset using an iPhone 11 Pro, capturing 3,760 bona fide samples from 100 subjects and 51,281 NFM attack samples across four distinct presentation scenarios involving both humans and mannequins. We benchmark the dataset using five state-of-the-art PAD methods to evaluate their robustness under unseen attack conditions. The results demonstrate significant performance variability across methods, highlighting the challenges posed by NFMs and underscoring the importance of developing PAD techniques that generalise effectively to emerging spoofing threats.
[51] LatentPrintFormer: A Hybrid CNN-Transformer with Spatial Attention for Latent Fingerprint identification cs.CVPDF
Arnab Maity, Manasa, Pavan Kumar C, Raghavendra Ramachandra
TL;DR: LatentPrintFormer提出了一种结合CNN和Transformer的混合模型,用于潜指纹识别,通过空间注意力模块提升性能,显著优于现有方法。
Details
Motivation: 潜指纹识别因图像质量低、背景噪声和部分印记等问题而具有挑战性。
Result: 在两个公开数据集上,LatentPrintFormer在Rank-10识别率上优于三种SOTA方法。
Insight: 混合CNN-Transformer结构与空间注意力模块的结合能有效提升潜指纹识别的性能。
Abstract: Latent fingerprint identification remains a challenging task due to low image quality, background noise, and partial impressions. In this work, we propose a novel identification approach called LatentPrintFormer. The proposed model integrates a CNN backbone (EfficientNet-B0) and a Transformer backbone (Swin Tiny) to extract both local and global features from latent fingerprints. A spatial attention module is employed to emphasize high-quality ridge regions while suppressing background noise. The extracted features are fused and projected into a unified 512-dimensional embedding, and matching is performed using cosine similarity in a closed-set identification setting. Extensive experiments on two publicly available datasets demonstrate that LatentPrintFormer consistently outperforms three state-of-the-art latent fingerprint recognition techniques, achieving higher identification rates across Rank-10.
[52] OTSNet: A Neurocognitive-Inspired Observation-Thinking-Spelling Pipeline for Scene Text Recognition cs.CV | cs.AIPDF
Lixu Sun, Nurmemet Yolwas, Wushour Silamu
TL;DR: OTSNet引入了一种受人类神经认知启发的观察-思考-拼写三阶段管道,通过双注意力机制、位置感知模块和多模态验证器,显著提升场景文本识别的准确性。
Details
Motivation: 现有场景文本识别(STR)方法在视觉和语言模态的解耦优化中引入了跨模态错位问题,导致误差传播,尤其在处理不规则文本时效果不佳。
Result: 在Union14M-L和OST数据集上分别达到83.5%和79.1%的平均准确率,刷新了9项评测记录。
Insight: 通过模拟人类认知过程统一视觉与语言模态,能够有效解决不规则文本识别中的跨模态对齐问题。
Abstract: Scene Text Recognition (STR) remains challenging due to real-world complexities, where decoupled visual-linguistic optimization in existing frameworks amplifies error propagation through cross-modal misalignment. Visual encoders exhibit attention bias toward background distractors, while decoders suffer from spatial misalignment when parsing geometrically deformed text-collectively degrading recognition accuracy for irregular patterns. Inspired by the hierarchical cognitive processes in human visual perception, we propose OTSNet, a novel three-stage network embodying a neurocognitive-inspired Observation-Thinking-Spelling pipeline for unified STR modeling. The architecture comprises three core components: (1) a Dual Attention Macaron Encoder (DAME) that refines visual features through differential attention maps to suppress irrelevant regions and enhance discriminative focus; (2) a Position-Aware Module (PAM) and Semantic Quantizer (SQ) that jointly integrate spatial context with glyph-level semantic abstraction via adaptive sampling; and (3) a Multi-Modal Collaborative Verifier (MMCV) that enforces self-correction through cross-modal fusion of visual, semantic, and character-level features. Extensive experiments demonstrate that OTSNet achieves state-of-the-art performance, attaining 83.5% average accuracy on the challenging Union14M-L benchmark and 79.1% on the heavily occluded OST dataset-establishing new records across 9 out of 14 evaluation scenarios.
[53] PEOD: A Pixel-Aligned Event-RGB Benchmark for Object Detection under Challenging Conditions cs.CVPDF
Luoping Cui, Hanqing Liu, Mingjie Liu, Endian Lin, Donghong Jiang
TL;DR: PEOD是一个首个大规模、像素对齐的高分辨率(1280 x 720)事件-RGB数据集,专注于挑战性条件下的目标检测,填补了现有数据集的不足。
Details
Motivation: 现有的事件-RGB数据集在极端条件覆盖率和空间分辨率(≤640 x 480)方面存在局限,限制了目标检测器在挑战性场景下的全面评估。
Result: 在正常子集和完整测试集上,融合方法表现最优;但在光照挑战子集上,基于事件的方法优于所有融合模型,融合模型仍优于RGB方法,揭示了现有融合方法在帧模态严重退化时的局限性。
Insight: PEOD为多模态感知提供了高质量的基准,揭示了事件数据在极端条件下的优势,以及融合方法在模态退化时的局限性。
Abstract: Robust object detection for challenging scenarios increasingly relies on event cameras, yet existing Event-RGB datasets remain constrained by sparse coverage of extreme conditions and low spatial resolution (<= 640 x 480), which prevents comprehensive evaluation of detectors under challenging scenarios. To address these limitations, we propose PEOD, the first large-scale, pixel-aligned and high-resolution (1280 x 720) Event-RGB dataset for object detection under challenge conditions. PEOD contains 130+ spatiotemporal-aligned sequences and 340k manual bounding boxes, with 57% of data captured under low-light, overexposure, and high-speed motion. Furthermore, we benchmark 14 methods across three input configurations (Event-based, RGB-based, and Event-RGB fusion) on PEOD. On the full test set and normal subset, fusion-based models achieve the excellent performance. However, in illumination challenge subset, the top event-based model outperforms all fusion models, while fusion models still outperform their RGB-based counterparts, indicating limits of existing fusion methods when the frame modality is severely degraded. PEOD establishes a realistic, high-quality benchmark for multimodal perception and facilitates future research.
[54] Boomda: Balanced Multi-objective Optimization for Multimodal Domain Adaptation cs.CV | cs.LGPDF
Jun Sun, Xinxin Zhang, Simin Hong, Jian Zhu, Xiang Gao
TL;DR: 该论文提出了一种名为Boomda的多模态领域自适应方法,通过平衡多目标优化解决不同模态间的领域偏移问题,采用信息瓶颈和相关性对齐技术,并在实验中表现优异。
Details
Motivation: 多模态学习面临标注数据稀缺的挑战,尤其是在多模态领域中领域自适应的研究较少,因此需要一种能够平衡不同模态领域偏移的方法。
Result: 实验表明Boomda在性能上优于其他竞争方法,证明了其有效性。
Insight: 通过平衡多目标优化和简化解法,可以有效解决多模态领域自适应中的复杂挑战。
Abstract: Multimodal learning, while contributing to numerous success stories across various fields, faces the challenge of prohibitively expensive manual annotation. To address the scarcity of annotated data, a popular solution is unsupervised domain adaptation, which has been extensively studied in unimodal settings yet remains less explored in multimodal settings. In this paper, we investigate heterogeneous multimodal domain adaptation, where the primary challenge is the varying domain shifts of different modalities from the source to the target domain. We first introduce the information bottleneck method to learn representations for each modality independently, and then match the source and target domains in the representation space with correlation alignment. To balance the domain alignment of all modalities, we formulate the problem as a multi-objective task, aiming for a Pareto optimal solution. By exploiting the properties specific to our model, the problem can be simplified to a quadratic programming problem. Further approximation yields a closed-form solution, leading to an efficient modality-balanced multimodal domain adaptation algorithm. The proposed method features \textbf{B}alanced multi-\textbf{o}bjective \textbf{o}ptimization for \textbf{m}ultimodal \textbf{d}omain \textbf{a}daptation, termed \textbf{Boomda}. Extensive empirical results showcase the effectiveness of the proposed approach and demonstrate that Boomda outperforms the competing schemes. The code is is available at: https://github.com/sunjunaimer/Boomda.git.
[55] Non-Aligned Reference Image Quality Assessment for Novel View Synthesis cs.CVPDF
Abhijay Ghildyal, Rajesh Sureddi, Nabajeet Barman, Saman Zadtootaghaj, Alan Bovik
TL;DR: 这篇论文提出了一个专门用于新型视图合成(NVS)的非对齐参考图像质量评估(NAR-IQA)框架,通过对比学习框架和合成失真数据训练模型,显著提升了评估性能。
Details
Motivation: 现有的全参考(FR-IQA)和无参考(NR-IQA)方法在NVS中无法有效处理参考图像未对齐的问题,因此需要一种新的评估方法。
Result: 模型在FR-IQA、NR-IQA和NAR-IQA方法中表现最优,且与主观评分高度相关。
Insight: 合成数据的训练和对比学习框架可以有效提升模型在非对齐参考下的评估性能,为NVS质量评估提供了新方向。
Abstract: Evaluating the perceptual quality of Novel View Synthesis (NVS) images remains a key challenge, particularly in the absence of pixel-aligned ground truth references. Full-Reference Image Quality Assessment (FR-IQA) methods fail under misalignment, while No-Reference (NR-IQA) methods struggle with generalization. In this work, we introduce a Non-Aligned Reference (NAR-IQA) framework tailored for NVS, where it is assumed that the reference view shares partial scene content but lacks pixel-level alignment. We constructed a large-scale image dataset containing synthetic distortions targeting Temporal Regions of Interest (TROI) to train our NAR-IQA model. Our model is built on a contrastive learning framework that incorporates LoRA-enhanced DINOv2 embeddings and is guided by supervision from existing IQA methods. We train exclusively on synthetically generated distortions, deliberately avoiding overfitting to specific real NVS samples and thereby enhancing the model’s generalization capability. Our model outperforms state-of-the-art FR-IQA, NR-IQA, and NAR-IQA methods, achieving robust performance on both aligned and non-aligned references. We also conducted a novel user study to gather data on human preferences when viewing non-aligned references in NVS. We find strong correlation between our proposed quality prediction model and the collected subjective ratings. For dataset and code, please visit our project page: https://stootaghaj.github.io/nova-project/
[56] LandSegmenter: Towards a Flexible Foundation Model for Land Use and Land Cover Mapping cs.CVPDF
Chenying Liu, Wei Huang, Xiao Xiang Zhu
TL;DR: LandSegmenter是一个面向土地利用和土地覆盖(LULC)映射的灵活基础模型框架,通过多模态数据集和适配器设计解决标注数据不足和跨模态适应性问题。
Details
Motivation: 当前LULC模型通常针对特定模态和固定分类法设计,泛化性和适用性受限。基础模型虽有潜力,但需微调或大量标注数据,而遥感领域获取成本高。
Result: 在六个LULC数据集上,LandSegmenter在迁移学习和零样本任务中表现优异,尤其在零样本场景下显著优于基线方法。
Insight: 弱监督和多模态适配器设计是实现LULC任务专用基础模型的高效途径,显著减少对标注数据的依赖。
Abstract: Land Use and Land Cover (LULC) mapping is a fundamental task in Earth Observation (EO). However, current LULC models are typically developed for a specific modality and a fixed class taxonomy, limiting their generability and broader applicability. Recent advances in foundation models (FMs) offer promising opportunities for building universal models. Yet, task-agnostic FMs often require fine-tuning for downstream applications, whereas task-specific FMs rely on massive amounts of labeled data for training, which is costly and impractical in the remote sensing (RS) domain. To address these challenges, we propose LandSegmenter, an LULC FM framework that resolves three-stage challenges at the input, model, and output levels. From the input side, to alleviate the heavy demand on labeled data for FM training, we introduce LAnd Segment (LAS), a large-scale, multi-modal, multi-source dataset built primarily with globally sampled weak labels from existing LULC products. LAS provides a scalable, cost-effective alternative to manual annotation, enabling large-scale FM training across diverse LULC domains. For model architecture, LandSegmenter integrates an RS-specific adapter for cross-modal feature extraction and a text encoder for semantic awareness enhancement. At the output stage, we introduce a class-wise confidence-guided fusion strategy to mitigate semantic omissions and further improve LandSegmenter’s zero-shot performance. We evaluate LandSegmenter on six precisely annotated LULC datasets spanning diverse modalities and class taxonomies. Extensive transfer learning and zero-shot experiments demonstrate that LandSegmenter achieves competitive or superior performance, particularly in zero-shot settings when transferred to unseen datasets. These results highlight the efficacy of our proposed framework and the utility of weak supervision for building task-specific FMs.
[57] Multi-Granularity Mutual Refinement Network for Zero-Shot Learning cs.CVPDF
Ning Wang, Long Yu, Cong Hua, Guangming Zhu, Lin Mei
TL;DR: 本文提出了一种多粒度相互精炼网络(Mg-MRN),通过学习解耦的多粒度特征和跨粒度特征交互,提升零样本学习中视觉特征的判别性和可迁移性。
Details
Motivation: 现有零样本学习方法通常忽视了局部区域特征之间的内在交互,而这些交互可以进一步提升视觉特征的可迁移性和判别性。
Result: 在三个流行的零样本学习基准数据集上的实验结果表明,Mg-MRN方法具有优越性和竞争力。
Insight: 跨粒度特征的交互能够显著提升特征的表征能力,从而在零样本学习中实现更好的性能。
Abstract: Zero-shot learning (ZSL) aims to recognize unseen classes with zero samples by transferring semantic knowledge from seen classes. Current approaches typically correlate global visual features with semantic information (i.e., attributes) or align local visual region features with corresponding attributes to enhance visual-semantic interactions. Although effective, these methods often overlook the intrinsic interactions between local region features, which can further improve the acquisition of transferable and explicit visual features. In this paper, we propose a network named Multi-Granularity Mutual Refinement Network (Mg-MRN), which refine discriminative and transferable visual features by learning decoupled multi-granularity features and cross-granularity feature interactions. Specifically, we design a multi-granularity feature extraction module to learn region-level discriminative features through decoupled region feature mining. Then, a cross-granularity feature fusion module strengthens the inherent interactions between region features of varying granularities. This module enhances the discriminability of representations at each granularity level by integrating region representations from adjacent hierarchies, further improving ZSL recognition performance. Extensive experiments on three popular ZSL benchmark datasets demonstrate the superiority and competitiveness of our proposed Mg-MRN method. Our code is available at https://github.com/NingWang2049/Mg-MRN.
[58] KPLM-STA: Physically-Accurate Shadow Synthesis for Human Relighting via Keypoint-Based Light Modeling cs.CVPDF
Xinhui Yin, Qifei Li, Yilin Guo, Hongxia Xie, Xiaoli Zhang
TL;DR: 论文提出了一种基于关键点线性模型(KPLM)和阴影三角算法(STA)的新框架,用于合成高真实感和几何精确的阴影,解决了图像合成中阴影生成的难题。
Details
Motivation: 现有方法(如IC-Light)在合成图像的阴影生成中难以同时实现外观真实性和几何精度,尤其在复杂人体姿态下表现不佳。
Result: 实验表明,该方法在阴影真实感基准测试中达到SOTA,尤其在复杂人体姿态和多方向重光照场景中表现突出。
Insight: 通过显式几何建模和物理约束,可以有效提升阴影合成的真实感和几何精度,适用于复杂动态场景。
Abstract: Image composition aims to seamlessly integrate a foreground object into a background, where generating realistic and geometrically accurate shadows remains a persistent challenge. While recent diffusion-based methods have outperformed GAN-based approaches, existing techniques, such as the diffusion-based relighting framework IC-Light, still fall short in producing shadows with both high appearance realism and geometric precision, especially in composite images. To address these limitations, we propose a novel shadow generation framework based on a Keypoints Linear Model (KPLM) and a Shadow Triangle Algorithm (STA). KPLM models articulated human bodies using nine keypoints and one bounding block, enabling physically plausible shadow projection and dynamic shading across joints, thereby enhancing visual realism. STA further improves geometric accuracy by computing shadow angles, lengths, and spatial positions through explicit geometric formulations. Extensive experiments demonstrate that our method achieves state-of-the-art performance on shadow realism benchmarks, particularly under complex human poses, and generalizes effectively to multi-directional relighting scenarios such as those supported by IC-Light.
[59] Distributed Zero-Shot Learning for Visual Recognition cs.CVPDF
Zhi Chen, Yadan Luo, Zi Huang, Jingjing Li, Sen Wang
TL;DR: 该论文提出了一种分布式零样本学习框架DistZSL,通过跨节点属性正则化和全局属性-视觉一致性解决数据异构性问题,提升对未见类的识别性能。
Details
Motivation: 分布式环境中数据异构性问题可能导致零样本学习性能下降,需要一种方法来协调不同节点的学习过程。
Result: 实验结果表明,DistZSL在分布式数据上的零样本学习性能优于现有方法。
Insight: 分布式零样本学习的关键在于协调节点间的异构性,统一的属性特征空间和双边一致性映射是提升性能的有效手段。
Abstract: In this paper, we propose a Distributed Zero-Shot Learning (DistZSL) framework that can fully exploit decentralized data to learn an effective model for unseen classes. Considering the data heterogeneity issues across distributed nodes, we introduce two key components to ensure the effective learning of DistZSL: a cross-node attribute regularizer and a global attribute-to-visual consensus. Our proposed cross-node attribute regularizer enforces the distances between attribute features to be similar across different nodes. In this manner, the overall attribute feature space would be stable during learning, and thus facilitate the establishment of visual-to-attribute(V2A) relationships. Then, we introduce the global attribute-tovisual consensus to mitigate biased V2A mappings learned from individual nodes. Specifically, we enforce the bilateral mapping between the attribute and visual feature distributions to be consistent across different nodes. Thus, the learned consistent V2A mapping can significantly enhance zero-shot learning across different nodes. Extensive experiments demonstrate that DistZSL achieves superior performance to the state-of-the-art in learning from distributed data.
[60] VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion cs.CVPDF
Samet Hicsonmez, Abd El Rahman Shabayek, Djamila Aouada
TL;DR: VLMDiff是一种新型的无监督多类视觉异常检测框架,结合了潜在扩散模型(LDM)和视觉语言模型(VLM),通过VLM提取图像描述作为LDM的训练条件,实现了高效的异常定位与检测。
Details
Motivation: 传统基于扩散模型的方法依赖于合成噪声生成,限制了泛化能力,并且需要针对每个类别训练模型,阻碍了扩展性。VLMDiff旨在解决这些问题,利用VLM自动获取正常图像描述,避免手动标注和额外训练。
Result: 在Real-IAD和COCO-AD数据集上,VLMDiff明显优于现有方法,PRO指标分别提升了25点和8点。
Insight: VLM提供的语义信息可以显著增强扩散模型在多类异常检测中的表现,同时减少对人工标注的依赖,更具实用性。
Abstract: Detecting visual anomalies in diverse, multi-class real-world images is a significant challenge. We introduce \ours, a novel unsupervised multi-class visual anomaly detection framework. It integrates a Latent Diffusion Model (LDM) with a Vision-Language Model (VLM) for enhanced anomaly localization and detection. Specifically, a pre-trained VLM with a simple prompt extracts detailed image descriptions, serving as additional conditioning for LDM training. Current diffusion-based methods rely on synthetic noise generation, limiting their generalization and requiring per-class model training, which hinders scalability. \ours, however, leverages VLMs to obtain normal captions without manual annotations or additional training. These descriptions condition the diffusion model, learning a robust normal image feature representation for multi-class anomaly detection. Our method achieves competitive performance, improving the pixel-level Per-Region-Overlap (PRO) metric by up to 25 points on the Real-IAD dataset and 8 points on the COCO-AD dataset, outperforming state-of-the-art diffusion-based approaches. Code is available at https://github.com/giddyyupp/VLMDiff.
[61] WarpGAN: Warping-Guided 3D GAN Inversion with Style-Based Novel View Inpainting cs.CVPDF
Kaitao Huang, Yan Yan, Jing-Hao Xue, Hanzi Wang
TL;DR: WarpGAN提出了一种基于变形和修复的策略,将图像修复融入3D GAN反转中,通过对称性和多视角一致性提升遮挡区域的生成质量。
Details
Motivation: 现有方法在3D GAN反转中主要关注可见区域的重建,而遮挡区域的生成仅依赖3D GAN的先验信息,导致质量较差。
Result: 定量和定性实验表明,WarpGAN在遮挡区域生成质量和多视角一致性上优于现有方法。
Insight: 将图像修复融入3D GAN反转可以显著提升遮挡区域的生成质量,对称性和多视角一致性是关键因素。
Abstract: 3D GAN inversion projects a single image into the latent space of a pre-trained 3D GAN to achieve single-shot novel view synthesis, which requires visible regions with high fidelity and occluded regions with realism and multi-view consistency. However, existing methods focus on the reconstruction of visible regions, while the generation of occluded regions relies only on the generative prior of 3D GAN. As a result, the generated occluded regions often exhibit poor quality due to the information loss caused by the low bit-rate latent code. To address this, we introduce the warping-and-inpainting strategy to incorporate image inpainting into 3D GAN inversion and propose a novel 3D GAN inversion method, WarpGAN. Specifically, we first employ a 3D GAN inversion encoder to project the single-view image into a latent code that serves as the input to 3D GAN. Then, we perform warping to a novel view using the depth map generated by 3D GAN. Finally, we develop a novel SVINet, which leverages the symmetry prior and multi-view image correspondence w.r.t. the same latent code to perform inpainting of occluded regions in the warped image. Quantitative and qualitative experiments demonstrate that our method consistently outperforms several state-of-the-art methods.
[62] UI2Code$^\text{N}$: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation cs.CVPDF
Zhen Yang, Wenyi Hong, Mingde Xu, Xinyue Fan, Weihan Wang
TL;DR: UI2Code$^\text{N}$提出了一种交互式的UI到代码生成范式,通过多阶段训练(预训练、微调、强化学习)提升了视觉语言模型的多模态编码能力,并结合了多轮反馈优化表现。
Details
Motivation: 当前UI编程复杂且视觉语言模型在自动UI编码中面临两大限制:多模态编码能力不足,单轮范式未能利用迭代视觉反馈。
Result: 在UI到代码和UI优化基准测试中,UI2Code$^\text{N}$达到了开源模型的新SOTA,性能接近Claude-4-Sonnet和GPT-5。
Insight: 交互式多轮反馈显著提升了模型性能,多模态能力的统一是关键突破。
Abstract: User interface (UI) programming is a core yet highly complex part of modern software development. Recent advances in visual language models (VLMs) highlight the potential of automatic UI coding, but current approaches face two key limitations: multimodal coding capabilities remain underdeveloped, and single-turn paradigms make little use of iterative visual feedback. We address these challenges with an interactive UI-to-code paradigm that better reflects real-world workflows and raises the upper bound of achievable performance. Under this paradigm, we present UI2Code$^\text{N}$, a visual language model trained through staged pretraining, fine-tuning, and reinforcement learning to achieve foundational improvements in multimodal coding. The model unifies three key capabilities: UI-to-code generation, UI editing, and UI polishing. We further explore test-time scaling for interactive generation, enabling systematic use of multi-turn feedback. Experiments on UI-to-code and UI polishing benchmarks show that UI2Code$^\text{N}$ establishes a new state of the art among open-source models and achieves performance comparable to leading closed-source models such as Claude-4-Sonnet and GPT-5. Our code and models are available at https://github.com/zai-org/UI2Code_N.
[63] UCDSC: Open Set UnCertainty aware Deep Simplex Classifier for Medical Image Datasets cs.CVPDF
Arnav Aditya, Nitin Kumar, Saurabh Shigwan
TL;DR: 本研究提出了一种名为UCDSC的开放集不确定性感知深度单纯形分类器,用于医学图像数据集,通过设计一种损失函数来有效拒绝未知类别样本,并在多个医学数据集上取得了显著的性能提升。
Details
Motivation: 医学领域面临数据稀缺和标注成本高的问题,尤其是对于新发或罕见疾病。开放集识别在此情境中至关重要,但现有方法在处理未知类别时仍有不足。
Result: 在四个MedMNIST数据集和一个皮肤数据集上表现优于现有技术。
Insight: 在医学图像分类中,结合开放集识别和不确定性感知可以有效处理未知样本,提升模型的鲁棒性和实用性。
Abstract: Driven by advancements in deep learning, computer-aided diagnoses have made remarkable progress. However, outside controlled laboratory settings, algorithms may encounter several challenges. In the medical domain, these difficulties often stem from limited data availability due to ethical and legal restrictions, as well as the high cost and time required for expert annotations-especially in the face of emerging or rare diseases. In this context, open-set recognition plays a vital role by identifying whether a sample belongs to one of the known classes seen during training or should be rejected as an unknown. Recent studies have shown that features learned in the later stages of deep neural networks are observed to cluster around their class means, which themselves are arranged as individual vertices of a regular simplex [32]. The proposed method introduces a loss function designed to reject samples of unknown classes effectively by penalizing open space regions using auxiliary datasets. This approach achieves significant performance gain across four MedMNIST datasets-BloodMNIST, OCTMNIST, DermaMNIST, TissueMNIST and a publicly available skin dataset [29] outperforming state-of-the-art techniques.
[64] Evaluating Gemini LLM in Food Image-Based Recipe and Nutrition Description with EfficientNet-B4 Visual Backbone cs.CV | cs.LGPDF
Rizal Khoirul Anam
TL;DR: 这篇论文评估了一种结合EfficientNet-B4视觉骨干和Gemini LLM的食品图像识别与营养/食谱生成系统,分析了视觉分类准确性对下游生成输出的影响,并提出了‘语义误差传播’的概念。
Details
Motivation: 随着数字化食品应用的普及,亟需高效的自动化营养分析和烹饪指导方法。论文旨在探索视觉分类与生成模型结合的权衡问题。
Result: EfficientNet-B4在准确性和效率上表现最佳(Top-1 Acc. 89.0%),Gemini LLM在生成质量上最优(事实准确率9.2/10),但系统的整体性能受限于视觉模块的准确性。
Insight: 视觉分类的准确性对下游生成任务至关重要;高语义相似性是失败的主要模式。
Abstract: The proliferation of digital food applications necessitates robust methods for automated nutritional analysis and culinary guidance. This paper presents a comprehensive comparative evaluation of a decoupled, multimodal pipeline for food recognition. We evaluate a system integrating a specialized visual backbone (EfficientNet-B4) with a powerful generative large language model (Google’s Gemini LLM). The core objective is to evaluate the trade-offs between visual classification accuracy, model efficiency, and the quality of generative output (nutritional data and recipes). We benchmark this pipeline against alternative vision backbones (VGG-16, ResNet-50, YOLOv8) and a lightweight LLM (Gemma). We introduce a formalization for “Semantic Error Propagation” (SEP) to analyze how classification inaccuracies from the visual module cascade into the generative output. Our analysis is grounded in a new Custom Chinese Food Dataset (CCFD) developed to address cultural bias in public datasets. Experimental results demonstrate that while EfficientNet-B4 (89.0% Top-1 Acc.) provides the best balance of accuracy and efficiency, and Gemini (9.2/10 Factual Accuracy) provides superior generative quality, the system’s overall utility is fundamentally bottlenecked by the visual front-end’s perceptive accuracy. We conduct a detailed per-class analysis, identifying high semantic similarity as the most critical failure mode.
[65] 2D Representation for Unguided Single-View 3D Super-Resolution in Real-Time cs.CV | cs.AIPDF
Ignasi Mas, Ivan Huerta, Ramon Morros, Javier Ruiz-Hidalgo
TL;DR: 2Dto3D-SR 是一个实时单视角 3D 超分辨率框架,无需高分辨率 RGB 引导,通过将 3D 数据编码为结构化 2D 表示,可直接应用现有 2D 图像超分辨率架构。
Details
Motivation: 现有的 3D 超分辨率方法通常依赖高分辨率 RGB 引导或复杂的 3D 点云处理,限制了实际应用。2Dto3D-SR 通过 2D 表示简化这一过程,提升效率和适应性。
Result: Swin Transformer 版本在标准基准上达到 SOTA 精度,Vision Mamba 版本则实现实时速度。
Insight: 2D 表示可以有效简化 3D 超分辨率任务,为实际场景中的高效部署提供了新思路。
Abstract: We introduce 2Dto3D-SR, a versatile framework for real-time single-view 3D super-resolution that eliminates the need for high-resolution RGB guidance. Our framework encodes 3D data from a single viewpoint into a structured 2D representation, enabling the direct application of existing 2D image super-resolution architectures. We utilize the Projected Normalized Coordinate Code (PNCC) to represent 3D geometry from a visible surface as a regular image, thereby circumventing the complexities of 3D point-based or RGB-guided methods. This design supports lightweight and fast models adaptable to various deployment environments. We evaluate 2Dto3D-SR with two implementations: one using Swin Transformers for high accuracy, and another using Vision Mamba for high efficiency. Experiments show the Swin Transformer model achieves state-of-the-art accuracy on standard benchmarks, while the Vision Mamba model delivers competitive results at real-time speeds. This establishes our geometry-guided pipeline as a surprisingly simple yet viable and practical solution for real-world scenarios, especially where high-resolution RGB data is inaccessible.
[66] Accurate and Efficient Surface Reconstruction from Point Clouds via Geometry-Aware Local Adaptation cs.CVPDF
Eito Ogawa, Taiga Hayami, Hiroshi Watanabe
TL;DR: 提出了一种基于输入点云曲率自适应调整局部区域大小和间距的方法,以提高表面重建的精度和效率。
Details
Motivation: 当前点云表面重建方法在处理几何复杂度变化时适应性不足,限制了重建精度和效率。
Result: 相较于传统方法,该方法在重建精度和效率上均有显著提升。
Insight: 几何复杂度感知的局部化策略是提升点云重建性能的关键。
Abstract: Point cloud surface reconstruction has improved in accuracy with advances in deep learning, enabling applications such as infrastructure inspection. Recent approaches that reconstruct from small local regions rather than entire point clouds have attracted attention for their strong generalization capability. However, prior work typically places local regions uniformly and keeps their size fixed, limiting adaptability to variations in geometric complexity. In this study, we propose a method that improves reconstruction accuracy and efficiency by adaptively modulating the spacing and size of local regions based on the curvature of the input point cloud.
[67] Remodeling Semantic Relationships in Vision-Language Fine-Tuning cs.CV | cs.AIPDF
Xiangyang Wu, Liu Liu, Baosheng Yu, Jiayan Qiu, Zhenwei Shi
TL;DR: 该论文提出了一种改进视觉语言微调的方法,通过建模语义关系提升多模态对齐和融合效果。
Details
Motivation: 现有视觉语言微调方法通常忽略文本上下文中的语义关系信息,导致性能次优。
Result: 在八个基础模型和两个下游任务上的评估表明,该方法优于现有方法。
Insight: 语义关系的建模对多模态对齐和融合至关重要,可以有效提升下游任务的性能。
Abstract: Vision-language fine-tuning has emerged as an efficient paradigm for constructing multimodal foundation models. While textual context often highlights semantic relationships within an image, existing fine-tuning methods typically overlook this information when aligning vision and language, thus leading to suboptimal performance. Toward solving this problem, we propose a method that can improve multimodal alignment and fusion based on both semantics and relationships.Specifically, we first extract multilevel semantic features from different vision encoder to capture more visual cues of the relationships. Then, we learn to project the vision features to group related semantics, among which are more likely to have relationships. Finally, we fuse the visual features with the textual by using inheritable cross-attention, where we globally remove the redundant visual relationships by discarding visual-language feature pairs with low correlation. We evaluate our proposed method on eight foundation models and two downstream tasks, visual question answering and image captioning, and show that it outperforms all existing methods.
[68] Hierarchical Direction Perception via Atomic Dot-Product Operators for Rotation-Invariant Point Clouds Learning cs.CV | cs.AIPDF
Chenyu Hu, Xiaotong Li, Hao Zhu, Biao Hou
TL;DR: DiPVNet通过原子点积算子实现层次化的方向感知,解决了点云学习中旋转不变性的挑战,并在噪声和大角度旋转下表现出色。
Details
Motivation: 点云在3D视觉任务中至关重要,但旋转会导致方向信息丢失,现有方法未能充分利用多尺度方向性提升特征表示。
Result: 在嘈杂和大角度旋转场景下,DiPVNet在点云分类和分割任务上达到SOTA性能。
Insight: 通过层次化方向感知和严格的理论证明,DiPVNet有效结合了旋转对称性与方向感知能力。
Abstract: Point cloud processing has become a cornerstone technology in many 3D vision tasks. However, arbitrary rotations introduce variations in point cloud orientations, posing a long-standing challenge for effective representation learning. The core of this issue is the disruption of the point cloud’s intrinsic directional characteristics caused by rotational perturbations. Recent methods attempt to implicitly model rotational equivariance and invariance, preserving directional information and propagating it into deep semantic spaces. Yet, they often fall short of fully exploiting the multiscale directional nature of point clouds to enhance feature representations. To address this, we propose the Direction-Perceptive Vector Network (DiPVNet). At its core is an atomic dot-product operator that simultaneously encodes directional selectivity and rotation invariance–endowing the network with both rotational symmetry modeling and adaptive directional perception. At the local level, we introduce a Learnable Local Dot-Product (L2DP) Operator, which enables interactions between a center point and its neighbors to adaptively capture the non-uniform local structures of point clouds. At the global level, we leverage generalized harmonic analysis to prove that the dot-product between point clouds and spherical sampling vectors is equivalent to a direction-aware spherical Fourier transform (DASFT). This leads to the construction of a global directional response spectrum for modeling holistic directional structures. We rigorously prove the rotation invariance of both operators. Extensive experiments on challenging scenarios involving noise and large-angle rotations demonstrate that DiPVNet achieves state-of-the-art performance on point cloud classification and segmentation tasks. Our code is available at https://github.com/wxszreal0/DiPVNet.
[69] NERVE: Neighbourhood & Entropy-guided Random-walk for training free open-Vocabulary sEgmentation cs.CV | cs.AIPDF
Kunal Mahatha, Jose Dolz, Christian Desrosiers
TL;DR: NERVE提出了一种无需训练的开放词汇语义分割方法,结合全局和局部信息,利用扩散模型的注意力机制和随机游走来优化语义分割,避免了传统后处理步骤。
Details
Motivation: 现有开放词汇语义分割方法在训练过程中存在计算成本高、注意力图加权不均或依赖固定大小高斯核的问题,难以处理复杂形状物体。
Result: 在7个基准数据集上实现了零样本分割的最优性能。
Insight: 随机游走和熵引导的选择机制能有效处理复杂形状,并且无需额外训练或后处理。
Abstract: Despite recent advances in Open-Vocabulary Semantic Segmentation (OVSS), existing training-free methods face several limitations: use of computationally expensive affinity refinement strategies, ineffective fusion of transformer attention maps due to equal weighting or reliance on fixed-size Gaussian kernels to reinforce local spatial smoothness, enforcing isotropic neighborhoods. We propose a strong baseline for training-free OVSS termed as NERVE (Neighbourhood & Entropy-guided Random-walk for open-Vocabulary sEgmentation), which uniquely integrates global and fine-grained local information, exploiting the neighbourhood structure from the self-attention layer of a stable diffusion model. We also introduce a stochastic random walk for refining the affinity rather than relying on fixed-size Gaussian kernels for local context. This spatial diffusion process encourages propagation across connected and semantically related areas, enabling it to effectively delineate objects with arbitrary shapes. Whereas most existing approaches treat self-attention maps from different transformer heads or layers equally, our method uses entropy-based uncertainty to select the most relevant maps. Notably, our method does not require any conventional post-processing techniques like Conditional Random Fields (CRF) or Pixel-Adaptive Mask Refinement (PAMR). Experiments are performed on 7 popular semantic segmentation benchmarks, yielding an overall state-of-the-art zero-shot segmentation performance, providing an effective approach to open-vocabulary semantic segmentation.
[70] LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning cs.CVPDF
Fengyi Fu, Mengqi Huang, Lei Zhang, Zhendong Mao
TL;DR: LayerEdit提出了一种新型的多层解耦编辑框架,通过分层分解、编辑和融合实现多对象的无损编辑,解决了现有方法中的注意力纠缠问题。
Details
Motivation: 现有多对象图像编辑方法忽视了对象间的交互,导致编辑泄漏或约束。LayerEdit旨在通过冲突感知的分层处理实现更精准的多对象编辑。
Result: 实验证明LayerEdit在多对象场景下优于现有方法,实现了更好的可控性和一致性。
Insight: 对象间的注意力纠缠是多对象编辑的关键挑战,分层处理能有效解耦冲突。
Abstract: Text-driven multi-object image editing which aims to precisely modify multiple objects within an image based on text descriptions, has recently attracted considerable interest. Existing works primarily follow the localize-editing paradigm, focusing on independent object localization and editing while neglecting critical inter-object interactions. However, this work points out that the neglected attention entanglements in inter-object conflict regions, inherently hinder disentangled multi-object editing, leading to either inter-object editing leakage or intra-object editing constraints. We thereby propose a novel multi-layer disentangled editing framework LayerEdit, a training-free method which, for the first time, through precise object-layered decomposition and coherent fusion, enables conflict-free object-layered editing. Specifically, LayerEdit introduces a novel “decompose-editingfusion” framework, consisting of: (1) Conflict-aware Layer Decomposition module, which utilizes an attention-aware IoU scheme and time-dependent region removing, to enhance conflict awareness and suppression for layer decomposition. (2) Object-layered Editing module, to establish coordinated intra-layer text guidance and cross-layer geometric mapping, achieving disentangled semantic and structural modifications. (3) Transparency-guided Layer Fusion module, to facilitate structure-coherent inter-object layer fusion through precise transparency guidance learning. Extensive experiments verify the superiority of LayerEdit over existing methods, showing unprecedented intra-object controllability and inter-object coherence in complex multi-object scenarios. Codes are available at: https://github.com/fufy1024/LayerEdit.
[71] ImagebindDC: Compressing Multi-modal Data with Imagebind-based Condensation cs.CV | cs.AIPDF
Yue Min, Shaobo Wang, Jiaze Li, Tianle Niu, Junxin Fan
TL;DR: ImageBindDC提出了一种基于ImageBind统一特征空间的新数据压缩框架,通过特征函数损失实现多模态数据的高效压缩和训练。
Details
Motivation: 传统数据压缩技术在单模态场景中表现良好,但在多模态场景中难以保留模态间复杂的依赖关系,因此需要一种新方法来应对这一挑战。
Result: 在NYU-v2数据集上,仅用每类5个压缩数据点训练的模型性能与完整数据集相当,相比之前最优方法提升8.2%,且压缩时间减少超过4倍。
Insight: 在多模态数据压缩中,通过傅里叶域的特征函数损失能够更精确地捕捉模态间依赖关系,从而实现高效的训练数据压缩。
Abstract: Data condensation techniques aim to synthesize a compact dataset from a larger one to enable efficient model training, yet while successful in unimodal settings, they often fail in multimodal scenarios where preserving intricate inter-modal dependencies is crucial. To address this, we introduce ImageBindDC, a novel data condensation framework operating within the unified feature space of ImageBind. Our approach moves beyond conventional distribution-matching by employing a powerful Characteristic Function (CF) loss, which operates in the Fourier domain to facilitate a more precise statistical alignment via exact infinite moment matching. We design our objective to enforce three critical levels of distributional consistency: (i) uni-modal alignment, which matches the statistical properties of synthetic and real data within each modality; (ii) cross-modal alignment, which preserves pairwise semantics by matching the distributions of hybrid real-synthetic data pairs; and (iii) joint-modal alignment, which captures the complete multivariate data structure by aligning the joint distribution of real data pairs with their synthetic counterparts. Extensive experiments highlight the effectiveness of ImageBindDC: on the NYU-v2 dataset, a model trained on just 5 condensed datapoints per class achieves lossless performance comparable to one trained on the full dataset, achieving a new state-of-the-art with an 8.2% absolute improvement over the previous best method and more than 4$\times$ less condensation time.
[72] Re-coding for Uncertainties: Edge-awareness Semantic Concordance for Resilient Event-RGB Segmentation cs.CVPDF
Nan Bao, Yifan Zhao, Lin Zhu, Jia Li
TL;DR: 本文提出了一种边缘感知语义一致性框架,通过边缘线索统一事件-RGB异构特征,解决了极端条件下的语义分割问题,性能优于现有方法。
Details
Motivation: 现有语义分割方法在极端条件下(如光线不足、相机剧烈运动)因RGB信息损失严重而表现不佳。事件模态虽能补充,但异构特征导致融合效果差。
Result: 在提出的DERS-XS数据集上mIoU提升2.55%,且在空间遮挡下表现鲁棒。
Insight: 边缘特征是统一异构模态的关键,重编码技术能有效解决特征不匹配问题。
Abstract: Semantic segmentation has achieved great success in ideal conditions. However, when facing extreme conditions (e.g., insufficient light, fierce camera motion), most existing methods suffer from significant information loss of RGB, severely damaging segmentation results. Several researches exploit the high-speed and high-dynamic event modality as a complement, but event and RGB are naturally heterogeneous, which leads to feature-level mismatch and inferior optimization of existing multi-modality methods. Different from these researches, we delve into the edge secret of both modalities for resilient fusion and propose a novel Edge-awareness Semantic Concordance framework to unify the multi-modality heterogeneous features with latent edge cues. In this framework, we first propose Edge-awareness Latent Re-coding, which obtains uncertainty indicators while realigning event-RGB features into unified semantic space guided by re-coded distribution, and transfers event-RGB distributions into re-coded features by utilizing a pre-established edge dictionary as clues. We then propose Re-coded Consolidation and Uncertainty Optimization, which utilize re-coded edge features and uncertainty indicators to solve the heterogeneous event-RGB fusion issues under extreme conditions. We establish two synthetic and one real-world event-RGB semantic segmentation datasets for extreme scenario comparisons. Experimental results show that our method outperforms the state-of-the-art by a 2.55% mIoU on our proposed DERS-XS, and possesses superior resilience under spatial occlusion. Our code and datasets are publicly available at https://github.com/iCVTEAM/ESC.
[73] SWAN - Enabling Fast and Mobile Histopathology Image Annotation through Swipeable Interfaces cs.CV | cs.SEPDF
Sweta Banerjee, Timo Gosch, Sara Hester, Viktoria Weiss, Thomas Conrad
TL;DR: SWAN是一种基于滑动界面的快速移动病理图像标注工具,通过直观的手势操作提高标注效率,保持标注质量,适用于大规模标注任务。
Details
Motivation: 传统的文件夹标注流程效率低、疲劳度高且难以扩展,SWAN旨在解决这些问题,提供一个快速、直观的标注工具。
Result: 在实验中,SWAN的标注一致性(Kappa=0.61-0.80)与传统方法(Kappa=0.63-0.75)相当,用户评价高。
Insight: 滑动交互显著提升标注效率和用户体验,适用于大规模病理图像标注任务。
Abstract: The annotation of large scale histopathology image datasets remains a major bottleneck in developing robust deep learning models for clinically relevant tasks, such as mitotic figure classification. Folder-based annotation workflows are usually slow, fatiguing, and difficult to scale. To address these challenges, we introduce SWipeable ANnotations (SWAN), an open-source, MIT-licensed web application that enables intuitive image patch classification using a swiping gesture. SWAN supports both desktop and mobile platforms, offers real-time metadata capture, and allows flexible mapping of swipe gestures to class labels. In a pilot study with four pathologists annotating 600 mitotic figure image patches, we compared SWAN against a traditional folder-sorting workflow. SWAN enabled rapid annotations with pairwise percent agreement ranging from 86.52% to 93.68% (Cohen’s Kappa = 0.61-0.80), while for the folder-based method, the pairwise percent agreement ranged from 86.98% to 91.32% (Cohen’s Kappa = 0.63-0.75) for the task of classifying atypical versus normal mitotic figures, demonstrating high consistency between annotators and comparable performance. Participants rated the tool as highly usable and appreciated the ability to annotate on mobile devices. These results suggest that SWAN can accelerate image annotation while maintaining annotation quality, offering a scalable and user-friendly alternative to conventional workflows.
[74] MAUGIF: Mechanism-Aware Unsupervised General Image Fusion via Dual Cross-Image Autoencoders cs.CVPDF
Kunjing Yang, Zhiwei Wang, Minru Bai
TL;DR: MAUGIF提出了一种基于双重跨图像自编码器的机制感知无监督通用图像融合方法,通过区分加性和乘性融合机制,提升了融合任务的效果和可解释性。
Details
Motivation: 现有图像融合方法要么高度任务特定,要么忽略不同任务的独特融合机制,导致性能受限。MAUGIF旨在通过机制感知的统一框架解决这一问题。
Result: 在多种融合任务上验证了方法的有效性和泛化能力。
Insight: 图像融合任务的底层机制差异显著,动态调整融合策略能显著提升性能和可解释性。
Abstract: Image fusion aims to integrate structural and complementary information from multi-source images. However, existing fusion methods are often either highly task-specific, or general frameworks that apply uniform strategies across diverse tasks, ignoring their distinct fusion mechanisms. To address this issue, we propose a mechanism-aware unsupervised general image fusion (MAUGIF) method based on dual cross-image autoencoders. Initially, we introduce a classification of additive and multiplicative fusion according to the inherent mechanisms of different fusion tasks. Then, dual encoders map source images into a shared latent space, capturing common content while isolating modality-specific details. During the decoding phase, dual decoders act as feature injectors, selectively reintegrating the unique characteristics of each modality into the shared content for reconstruction. The modality-specific features are injected into the source image in the fusion process, generating the fused image that integrates information from both modalities. The architecture of decoders varies according to their fusion mechanisms, enhancing both performance and interpretability. Extensive experiments are conducted on diverse fusion tasks to validate the effectiveness and generalization ability of our method. The code is available at https://anonymous.4open.science/r/MAUGIF.
[75] SkelSplat: Robust Multi-view 3D Human Pose Estimation with Differentiable Gaussian Rendering cs.CVPDF
Laura Bragagnolo, Leonardo Barcellona, Stefano Ghidoni
TL;DR: SkelSplat提出了一种基于可微分高斯渲染的多视角3D人体姿态估计方法,通过骨架化的3D高斯模型优化姿态,无需3D标注数据,提高了泛化能力和对遮挡的鲁棒性。
Details
Motivation: 现有的多视角3D人体姿态估计方法依赖大量标注数据训练,泛化能力差,尤其是在测试场景与训练场景差异较大时。SkelSplat旨在通过可微分渲染技术突破这一限制。
Result: 在Human3.6M和CMU数据集上表现优于不依赖3D真值的方法,跨数据集误差降低47.8%;在遮挡场景下也展现了鲁棒性。
Insight: 1. 可微分高斯渲染为多视角3D姿态估计提供了新的优化范式;2. 稀疏任务的适应性改进(如独热编码)是方法成功的关键。
Abstract: Accurate 3D human pose estimation is fundamental for applications such as augmented reality and human-robot interaction. State-of-the-art multi-view methods learn to fuse predictions across views by training on large annotated datasets, leading to poor generalization when the test scenario differs. To overcome these limitations, we propose SkelSplat, a novel framework for multi-view 3D human pose estimation based on differentiable Gaussian rendering. Human pose is modeled as a skeleton of 3D Gaussians, one per joint, optimized via differentiable rendering to enable seamless fusion of arbitrary camera views without 3D ground-truth supervision. Since Gaussian Splatting was originally designed for dense scene reconstruction, we propose a novel one-hot encoding scheme that enables independent optimization of human joints. SkelSplat outperforms approaches that do not rely on 3D ground truth in Human3.6M and CMU, while reducing the cross-dataset error up to 47.8% compared to learning-based methods. Experiments on Human3.6M-Occ and Occlusion-Person demonstrate robustness to occlusions, without scenario-specific fine-tuning. Our project page is available here: https://skelsplat.github.io.
[76] NeuSpring: Neural Spring Fields for Reconstruction and Simulation of Deformable Objects from Videos cs.CVPDF
Qingshan Xu, Jiao Liu, Shangshu Yu, Yuxuan Wang, Yuan Zhou
TL;DR: 该论文提出NeuSpring,一种基于神经弹簧场的框架,用于从视频中重建和模拟可变形物体,通过创新的分段拓扑解法和神经弹簧场显著提升了当前状态建模和未来预测的性能。
Details
Motivation: 现有方法大多关注当前状态的物理学习,忽略了可变形物体的内在物理特性,导致对未来预测的泛化能力较差。
Result: 实验表明,NeuSpring在当前状态建模和未来预测的Chamfer距离上分别提升了20%和25%。
Insight: 通过建模材料的异质性和弹簧的空间关联性,可以显著提升可变形物体的重建和模拟性能。
Abstract: In this paper, we aim to create physical digital twins of deformable objects under interaction. Existing methods focus more on the physical learning of current state modeling, but generalize worse to future prediction. This is because existing methods ignore the intrinsic physical properties of deformable objects, resulting in the limited physical learning in the current state modeling. To address this, we present NeuSpring, a neural spring field for the reconstruction and simulation of deformable objects from videos. Built upon spring-mass models for realistic physical simulation, our method consists of two major innovations: 1) a piecewise topology solution that efficiently models multi-region spring connection topologies using zero-order optimization, which considers the material heterogeneity of real-world objects. 2) a neural spring field that represents spring physical properties across different frames using a canonical coordinate-based neural network, which effectively leverages the spatial associativity of springs for physical learning. Experiments on real-world datasets demonstrate that our NeuSping achieves superior reconstruction and simulation performance for current state modeling and future prediction, with Chamfer distance improved by 20% and 25%, respectively.
[77] The Impact of Longitudinal Mammogram Alignment on Breast Cancer Risk Assessment cs.CVPDF
Solveig Thrun, Stine Hansen, Zijun Sun, Nele Blum, Suaiba A. Salahuddin
TL;DR: 研究比较了纵向乳腺X光对齐方法(包括基于图像的配准、特征级对齐和隐式对齐)对乳腺癌风险评估的影响,发现基于图像的配准在所有指标中表现最优。
Details
Motivation: 乳腺X光片的纵向对齐对癌症风险评估至关重要,但现有方法存在对齐不准确的问题,导致模型性能下降。本研究旨在比较不同对齐方法的有效性。
Result: 基于图像的配准在所有指标中表现最佳,尤其是结合特征空间使用时,能够提升预测准确性和变形场质量。正则化特征级对齐会降低预测性能。
Insight: 图像基础的变形对齐在纵向风险评估中具有显著优势,为个性化筛查提供了更可靠的依据。代码开源有助于推动相关研究的发展。
Abstract: Regular mammography screening is crucial for early breast cancer detection. By leveraging deep learning-based risk models, screening intervals can be personalized, especially for high-risk individuals. While recent methods increasingly incorporate longitudinal information from prior mammograms, accurate spatial alignment across time points remains a key challenge. Misalignment can obscure meaningful tissue changes and degrade model performance. In this study, we provide insights into various alignment strategies, image-based registration, feature-level (representation space) alignment with and without regularization, and implicit alignment methods, for their effectiveness in longitudinal deep learning-based risk modeling. Using two large-scale mammography datasets, we assess each method across key metrics, including predictive accuracy, precision, recall, and deformation field quality. Our results show that image-based registration consistently outperforms the more recently favored feature-based and implicit approaches across all metrics, enabling more accurate, temporally consistent predictions and generating smooth, anatomically plausible deformation fields. Although regularizing the deformation field improves deformation quality, it reduces the risk prediction performance of feature-level alignment. Applying image-based deformation fields within the feature space yields the best risk prediction performance. These findings underscore the importance of image-based deformation fields for spatial alignment in longitudinal risk modeling, offering improved prediction accuracy and robustness. This approach has strong potential to enhance personalized screening and enable earlier interventions for high-risk individuals. The code is available at https://github.com/sot176/Mammogram_Alignment_Study_Risk_Prediction.git, allowing full reproducibility of the results.
[78] Empowering DINO Representations for Underwater Instance Segmentation via Aligner and Prompter cs.CVPDF
Zhiyang Chen, Chen Zhang, Hao Fang, Runmin Cong
TL;DR: 论文提出DiveSeg框架,通过AquaStyle Aligner和ObjectPrior Prompter增强DINO模型在水下实例分割任务中的性能,在UIIS和USIS10K数据集上达到SOTA。
Details
Motivation: 水下实例分割(UIS)在海洋资源探索和生态保护中具有重要意义,但现有方法难以充分利用大规模预训练视觉基础模型的潜力。
Result: DiveSeg在UIIS和USIS10K数据集上达到最优性能。
Insight: 结合风格适应和对象先验提示是提升水下实例分割任务性能的有效策略。
Abstract: Underwater instance segmentation (UIS), integrating pixel-level understanding and instance-level discrimination, is a pivotal technology in marine resource exploration and ecological protection. In recent years, large-scale pretrained visual foundation models, exemplified by DINO, have advanced rapidly and demonstrated remarkable performance on complex downstream tasks. In this paper, we demonstrate that DINO can serve as an effective feature learner for UIS, and we introduce DiveSeg, a novel framework built upon two insightful components: (1) The AquaStyle Aligner, designed to embed underwater color style features into the DINO fine-tuning process, facilitating better adaptation to the underwater domain. (2) The ObjectPrior Prompter, which incorporates binary segmentation-based prompts to deliver object-level priors, provides essential guidance for instance segmentation task that requires both object- and instance-level reasoning. We conduct thorough experiments on the popular UIIS and USIS10K datasets, and the results show that DiveSeg achieves the state-of-the-art performance. Code: https://github.com/ettof/Diveseg.
[79] VideoChain: A Transformer-Based Framework for Multi-hop Video Question Generation cs.CVPDF
Arpan Phukan, Anupam Pandey, Deepjyoti Bodo, Asif Ekbal
TL;DR: VideoChain提出了一个基于Transformer的多跳视频问题生成框架,用于生成需要跨多个视频片段推理的问题。
Details
Motivation: 目前多跳问题生成仅限于文本,而视频问题生成仅局限于零跳问题和单个片段。为解决这一问题,作者提出了VideoChain。
Result: 在ROUGE-L、BLEU-1等指标上表现优异,证明了生成的提问具有连贯性和上下文相关性。
Insight: VideoChain的成功表明多模态推理在视频问题生成中的重要性。
Abstract: Multi-hop Question Generation (QG) effectively evaluates reasoning but remains confined to text; Video Question Generation (VideoQG) is limited to zero-hop questions over single segments. To address this, we introduce VideoChain, a novel Multi-hop Video Question Generation (MVQG) framework designed to generate questions that require reasoning across multiple, temporally separated video segments. VideoChain features a modular architecture built on a modified BART backbone enhanced with video embeddings, capturing textual and visual dependencies. Using the TVQA+ dataset, we automatically construct the large-scale MVQ-60 dataset by merging zero-hop QA pairs, ensuring scalability and diversity. Evaluations show VideoChain’s strong performance across standard generation metrics: ROUGE-L (0.6454), ROUGE-1 (0.6854), BLEU-1 (0.6711), BERTScore-F1 (0.7967), and semantic similarity (0.8110). These results highlight the model’s ability to generate coherent, contextually grounded, and reasoning-intensive questions.
[80] Retrospective motion correction in MRI using disentangled embeddings cs.CVPDF
Qi Wang, Veronika Ecker, Marcel Früh, Sergios Gatidis, Thomas Küstner
TL;DR: 该论文提出了一种基于分层向量量化变分自编码器的方法,用于MRI中的运动伪影校正,通过学习解耦的运动与干净图像特征的嵌入,实现了对不同运动类型和严重程度的鲁棒校正。
Details
Motivation: MRI中的生理运动会导致图像质量下降,但现有的回顾性运动校正方法往往难以泛化到不同的运动类型和身体区域,尤其是基于机器学习的方法通常针对特定应用和数据集。
Result: 在模拟全身运动伪影的实验表明,该方法对不同严重程度的运动具有鲁棒性,能够有效解耦运动特征并提升校正效果。
Insight: 通过解耦运动特征,该方法展示了在多解剖区域和运动类型中的潜在应用价值,为MRI运动校正提供了更通用的解决方案。
Abstract: Physiological motion can affect the diagnostic quality of magnetic resonance imaging (MRI). While various retrospective motion correction methods exist, many struggle to generalize across different motion types and body regions. In particular, machine learning (ML)-based corrections are often tailored to specific applications and datasets. We hypothesize that motion artifacts, though diverse, share underlying patterns that can be disentangled and exploited. To address this, we propose a hierarchical vector-quantized (VQ) variational auto-encoder that learns a disentangled embedding of motion-to-clean image features. A codebook is deployed to capture finite collection of motion patterns at multiple resolutions, enabling coarse-to-fine correction. An auto-regressive model is trained to learn the prior distribution of motion-free images and is used at inference to guide the correction process. Unlike conventional approaches, our method does not require artifact-specific training and can generalize to unseen motion patterns. We demonstrate the approach on simulated whole-body motion artifacts and observe robust correction across varying motion severity. Our results suggest that the model effectively disentangled physical motion of the simulated motion-effective scans, therefore, improving the generalizability of the ML-based MRI motion correction. Our work of disentangling the motion features shed a light on its potential application across anatomical regions and motion types.
[81] A Circular Argument : Does RoPE need to be Equivariant for Vision? cs.CV | cs.AIPDF
Chase van de Geijn, Timo Lüddecke, Polina Turishcheva, Alexander S. Ecker
TL;DR: 论文探讨了RoPE(Rotary Positional Encodings)在计算机视觉中是否必须具有位置等变性(equivariance),提出了Spherical RoPE方法,并质疑相对位置编码的重要性。
Details
Motivation: RoPE因其位置等变性在一维序列(如NLP)中表现优异,但在高维数据(如图像)中的推广是否必须依赖等变性尚未明确,作者希望验证这一点。
Result: Spherical RoPE表现出与等变版本相当或更好的学习行为,表明严格的位置等变性在计算机视觉中可能不关键。
Insight: 相对位置编码的重要性可能被高估,未来视觉任务中的位置编码设计可以跳出等变性的限制,从而提升速度和泛化能力。
Abstract: Rotary Positional Encodings (RoPE) have emerged as a highly effective technique for one-dimensional sequences in Natural Language Processing spurring recent progress towards generalizing RoPE to higher-dimensional data such as images and videos. The success of RoPE has been thought to be due to its positional equivariance, i.e. its status as a relative positional encoding. In this paper, we mathematically show RoPE to be one of the most general solutions for equivariant positional embedding in one-dimensional data. Moreover, we show Mixed RoPE to be the analogously general solution for M-dimensional data, if we require commutative generators – a property necessary for RoPE’s equivariance. However, we question whether strict equivariance plays a large role in RoPE’s performance. We propose Spherical RoPE, a method analogous to Mixed RoPE, but assumes non-commutative generators. Empirically, we find Spherical RoPE to have the equivalent or better learning behavior compared to its equivariant analogues. This suggests that relative positional embeddings are not as important as is commonly believed, at least within computer vision. We expect this discovery to facilitate future work in positional encodings for vision that can be faster and generalize better by removing the preconception that they must be relative.
[82] RAPTR: Radar-based 3D Pose Estimation using Transformer cs.CV | cs.AI | eess.SPPDF
Sorachi Kato, Ryoma Yataka, Pu Perry Wang, Pedro Miraldo, Takuya Fujihashi
TL;DR: RAPTR提出了一种基于雷达的弱监督3D人体姿态估计方法,仅需3D边界框和2D关键点标签,显著降低了标注成本,并通过两阶段解码器和伪3D可变形注意力模块提升性能。
Details
Motivation: 传统雷达室内3D姿态估计依赖昂贵的细粒度3D关键点标注,RAPTR旨在通过弱监督(仅需3D BBox和2D关键点标签)解决这一问题。
Result: 在HIBER和MMVR数据集上,RAPTR分别减少了34.3%和76.9%的关节位置误差,显著优于现有方法。
Insight: 通过弱监督设计,RAPTR展示了仅需低成本标注即可实现高性能3D姿态估计的潜力,为实际应用提供了可扩展的解决方案。
Abstract: Radar-based indoor 3D human pose estimation typically relied on fine-grained 3D keypoint labels, which are costly to obtain especially in complex indoor settings involving clutter, occlusions, or multiple people. In this paper, we propose \textbf{RAPTR} (RAdar Pose esTimation using tRansformer) under weak supervision, using only 3D BBox and 2D keypoint labels which are considerably easier and more scalable to collect. Our RAPTR is characterized by a two-stage pose decoder architecture with a pseudo-3D deformable attention to enhance (pose/joint) queries with multi-view radar features: a pose decoder estimates initial 3D poses with a 3D template loss designed to utilize the 3D BBox labels and mitigate depth ambiguities; and a joint decoder refines the initial poses with 2D keypoint labels and a 3D gravity loss. Evaluated on two indoor radar datasets, RAPTR outperforms existing methods, reducing joint position error by $34.3%$ on HIBER and $76.9%$ on MMVR. Our implementation is available at https://github.com/merlresearch/radar-pose-transformer.
[83] Anatomy-VLM: A Fine-grained Vision-Language Model for Medical Interpretation cs.CV | cs.AI | cs.LGPDF
Difei Gu, Yunhe Gao, Mu Zhou, Dimitris Metaxas
TL;DR: Anatomy-VLM是一种细粒度的视觉语言模型,专注于医学图像解释,通过定位关键解剖特征并结合结构化知识,实现了临床级别的疾病预测和解剖学相关的知识对齐。
Details
Motivation: 现有视觉语言模型(VLMs)通常将图像视为整体,忽视了医学诊断中关键的细粒度图像细节。为解决这一问题,Anatomy-VLM从临床医生的工作流程中汲取灵感,设计了多尺度信息融合的方法。
Result: Anatomy-VLM在分布内外数据集上表现出色,并在下游图像分割任务中验证了其对解剖学和病理学知识的捕捉能力。
Insight: 结合临床工作流程的多尺度信息融合策略和结构化知识能够显著提升视觉语言模型在医学图像解释中的性能。
Abstract: Accurate disease interpretation from radiology remains challenging due to imaging heterogeneity. Achieving expert-level diagnostic decisions requires integration of subtle image features with clinical knowledge. Yet major vision-language models (VLMs) treat images as holistic entities and overlook fine-grained image details that are vital for disease diagnosis. Clinicians analyze images by utilizing their prior medical knowledge and identify anatomical structures as important region of interests (ROIs). Inspired from this human-centric workflow, we introduce Anatomy-VLM, a fine-grained, vision-language model that incorporates multi-scale information. First, we design a model encoder to localize key anatomical features from entire medical images. Second, these regions are enriched with structured knowledge for contextually-aware interpretation. Finally, the model encoder aligns multi-scale medical information to generate clinically-interpretable disease prediction. Anatomy-VLM achieves outstanding performance on both in- and out-of-distribution datasets. We also validate the performance of Anatomy-VLM on downstream image segmentation tasks, suggesting that its fine-grained alignment captures anatomical and pathology-related knowledge. Furthermore, the Anatomy-VLM’s encoder facilitates zero-shot anatomy-wise interpretation, providing its strong expert-level clinical interpretation capabilities.
[84] Cross-pyramid consistency regularization for semi-supervised medical image segmentation cs.CVPDF
Matus Bojko, Maros Kollar, Marek Jakab, Wanda Benesova
TL;DR: 论文提出了一种跨金字塔一致性正则化(CPCR)方法,用于半监督医学图像分割,通过结合双分支金字塔网络(DBPNet)和新型正则化策略,显著提升了模型性能。
Details
Motivation: 医学图像标注成本高且耗时,半监督学习能够利用少量标注数据和大量未标注数据训练模型,但其性能仍有提升空间。
Result: 实验表明,DBPNet与CPCR在公共基准数据集上优于五种最先进的半监督学习方法,性能与近期方法相当。
Insight: 跨金字塔一致性正则化能够有效利用多层次特征的一致性,提升半监督医学图像分割的性能。
Abstract: Semi-supervised learning (SSL) enables training of powerful models with the assumption of limited, carefully labelled data and a large amount of unlabeled data to support the learning. In this paper, we propose a hybrid consistency learning approach to effectively exploit unlabeled data for semi-supervised medical image segmentation by leveraging Cross-Pyramid Consistency Regularization (CPCR) between two decoders. First, we design a hybrid Dual Branch Pyramid Network (DBPNet), consisting of an encoder and two decoders that differ slightly, each producing a pyramid of perturbed auxiliary predictions across multiple resolution scales. Second, we present a learning strategy for this network named CPCR that combines existing consistency learning and uncertainty minimization approaches on the main output predictions of decoders with our novel regularization term. More specifically, in this term, we extend the soft-labeling setting to pyramid predictions across decoders to support knowledge distillation in deep hierarchical features. Experimental results show that DBPNet with CPCR outperforms five state-of-the-art self-supervised learning methods and has comparable performance with recent ones on a public benchmark dataset.
[85] Contrastive Integrated Gradients: A Feature Attribution-Based Method for Explaining Whole Slide Image Classification cs.CV | cs.AIPDF
Anh Mai Vu, Tuan L. Vo, Ngoc Lam Quang Bui, Nam Nguyen Le Binh, Akash Awasthi
TL;DR: 该论文提出了一种新的特征归因方法——对比积分梯度(CIG),用于增强全切片图像(WSI)分类的可解释性。CIG通过对比类别判别区域,提供更清晰的肿瘤与非肿瘤区域区分,并满足积分归因的公理。此外,还提出了两种新的归因质量指标MIL-AIC和MIL-SIC。
Details
Motivation: 在全切片图像分析中,模型的解释性对于建立AI辅助诊断的信任至关重要。现有的积分梯度方法在高分辨率WSI中应用时存在不足,可能忽略类别判别信号,无法清晰区分肿瘤亚型。
Result: 在CAMELYON16、TCGA-RCC和TCGA-Lung数据集上,CIG在定量(MIL-AIC/MIL-SIC)和定性(可视化)方面均优于基线方法,归因结果与真实肿瘤区域高度一致。
Insight: CIG通过对比分析增强了模型解释性,尤其在肿瘤亚型区分方面表现突出,有助于提升AI诊断的可信度。
Abstract: Interpretability is essential in Whole Slide Image (WSI) analysis for computational pathology, where understanding model predictions helps build trust in AI-assisted diagnostics. While Integrated Gradients (IG) and related attribution methods have shown promise, applying them directly to WSIs introduces challenges due to their high-resolution nature. These methods capture model decision patterns but may overlook class-discriminative signals that are crucial for distinguishing between tumor subtypes. In this work, we introduce Contrastive Integrated Gradients (CIG), a novel attribution method that enhances interpretability by computing contrastive gradients in logit space. First, CIG highlights class-discriminative regions by comparing feature importance relative to a reference class, offering sharper differentiation between tumor and non-tumor areas. Second, CIG satisfies the axioms of integrated attribution, ensuring consistency and theoretical soundness. Third, we propose two attribution quality metrics, MIL-AIC and MIL-SIC, which measure how predictive information and model confidence evolve with access to salient regions, particularly under weak supervision. We validate CIG across three datasets spanning distinct cancer types: CAMELYON16 (breast cancer metastasis in lymph nodes), TCGA-RCC (renal cell carcinoma), and TCGA-Lung (lung cancer). Experimental results demonstrate that CIG yields more informative attributions both quantitatively, using MIL-AIC and MIL-SIC, and qualitatively, through visualizations that align closely with ground truth tumor regions, underscoring its potential for interpretable and trustworthy WSI-based diagnostics
[86] Compression then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding cs.CV | cs.IRPDF
Da Li, Yuxiao Luo, Keping Bi, Jiafeng Guo, Wei Yuan
TL;DR: 本文提出了一种名为CoMa的高效预训练范式,通过解耦视觉语言模型(VLM)的全面理解和对比学习目标,先进行压缩预训练作为热身阶段,再用少量数据将其转化为竞争性嵌入模型。
Details
Motivation: 传统的视觉语言模型通过大规模对比学习优化嵌入,但这种耦合的目标可能导致效率低下。作者认为,解耦输入内容的全面理解和对比学习目标可以更高效地实现语义嵌入。
Result: CoMa在MMEB任务上实现了同类VLM中的最高性能,同时兼具高效性和有效性。
Insight: 解耦预训练目标和对比学习目标可能是优化多模态嵌入的有效途径,压缩阶段的热身作用为后续任务提供了更好的初始化。
Abstract: Vision-language models advance multimodal representation learning by acquiring transferable semantic embeddings, thereby substantially enhancing performance across a range of vision-language tasks, including cross-modal retrieval, clustering, and classification. An effective embedding is expected to comprehensively preserve the semantic content of the input while simultaneously emphasizing features that are discriminative for downstream tasks. Recent approaches demonstrate that VLMs can be adapted into competitive embedding models via large-scale contrastive learning, enabling the simultaneous optimization of two complementary objectives. We argue that the two aforementioned objectives can be decoupled: a comprehensive understanding of the input facilitates the embedding model in achieving superior performance in downstream tasks via contrastive learning. In this paper, we propose CoMa, a compressed pre-training phase, which serves as a warm-up stage for contrastive learning. Experiments demonstrate that with only a small amount of pre-training data, we can transform a VLM into a competitive embedding model. CoMa achieves new state-of-the-art results among VLMs of comparable size on the MMEB, realizing optimization in both efficiency and effectiveness.
[87] CleverBirds: A Multiple-Choice Benchmark for Fine-grained Human Knowledge Tracing cs.CV | cs.LGPDF
Leonie Bossemeyer, Samuel Heinrich, Grant Van Horn, Oisin Mac Aodha
TL;DR: CleverBirds是一个用于细粒度鸟类识别的知识追踪大规模基准数据集,源自公民科学平台eBird。它包含40,000多名参与者的17百万多答题数据,覆盖10,000多种鸟类。目标是支持视觉知识追踪方法的研究和评估。
Details
Motivation: 细粒度视觉识别在许多专业领域至关重要,但建模人类在掌握这种专长方面的进展仍具挑战性。准确推断人类学习者的知识状态是理解视觉学习的关键步骤。
Result: 结果表明追踪学习者知识状态具有挑战性,不同形式的上下文信息提供了不同程度的预测益处。
Insight: CleverBirds为新方法研究和评估提供了丰富的资源,尤其适合探索人类在细粒度视觉识别中的专长发展模式。
Abstract: Mastering fine-grained visual recognition, essential in many expert domains, can require that specialists undergo years of dedicated training. Modeling the progression of such expertize in humans remains challenging, and accurately inferring a human learner’s knowledge state is a key step toward understanding visual learning. We introduce CleverBirds, a large-scale knowledge tracing benchmark for fine-grained bird species recognition. Collected by the citizen-science platform eBird, it offers insight into how individuals acquire expertize in complex fine-grained classification. More than 40,000 participants have engaged in the quiz, answering over 17 million multiple-choice questions spanning over 10,000 bird species, with long-range learning patterns across an average of 400 questions per participant. We release this dataset to support the development and evaluation of new methods for visual knowledge tracing. We show that tracking learners’ knowledge is challenging, especially across participant subgroups and question types, with different forms of contextual information offering varying degrees of predictive benefit. CleverBirds is among the largest benchmark of its kind, offering a substantially higher number of learnable concepts. With it, we hope to enable new avenues for studying the development of visual expertize over time and across individuals.
[88] UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist cs.CVPDF
Zhengyang Liang, Daoan Zhang, Huichi Zhou, Rui Huang, Bobo Li
TL;DR: UniVA是一个开源的多代理框架,旨在统一视频理解、分割、编辑和生成任务,将其整合为连贯的工作流。通过Plan-and-Act双代理架构和多级内存设计,UniVA实现了自动化、上下文连贯的视频工作流,并推出了UniVA-Bench作为评估基准。
Details
Motivation: 现实世界的视频应用需要结合多种视频任务(如生成和理解)的复杂工作流,而现有的专用模型无法满足这种需求。UniVA旨在填补这一空白,提供统一的开源解决方案。
Result: UniVA能够实现迭代式、多条件视频工作流(如文本/图像/视频条件的生成→多轮编辑→对象分割→组合合成),超越了单一模型或视频语言模型的局限性。
Insight: 模块化和多代理设计是实现复杂视频任务的关键,而统一的开源框架可以推动下一代多模态AI系统的研究。
Abstract: While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation $\rightarrow$ multi-round editing $\rightarrow$ object segmentation $\rightarrow$ compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)
[89] Large Sign Language Models: Toward 3D American Sign Language Translation cs.CV | cs.AIPDF
Sen Zhang, Xiaoxiao He, Di Liu, Zhaoyang Xia, Mingyu Zhao
TL;DR: 论文提出了一种名为LSLM的新框架,利用大型语言模型(LLMs)作为核心,实现3D美国手语(ASL)的翻译,为听障人士的虚拟沟通提供了更精准和鲁棒的解决方案。
Details
Motivation: 现有手语识别方法主要依赖2D视频,无法充分捕捉手语的空间、手势和深度信息,限制了翻译的准确性和鲁棒性。作者希望通过直接使用3D数据改进这一局限性。
Result: LSLM框架能够更准确地翻译3D手语,提高了手语翻译的鲁棒性和灵活性,为听障人士的虚拟沟通提供了更好的支持。
Insight: 将3D信息引入手语翻译不仅能提升翻译质量,还为LLMs扩展多模态处理能力提供了新的研究方向,推动了包容性智能系统的发展。
Abstract: We present Large Sign Language Models (LSLM), a novel framework for translating 3D American Sign Language (ASL) by leveraging Large Language Models (LLMs) as the backbone, which can benefit hearing-impaired individuals’ virtual communication. Unlike existing sign language recognition methods that rely on 2D video, our approach directly utilizes 3D sign language data to capture rich spatial, gestural, and depth information in 3D scenes. This enables more accurate and resilient translation, enhancing digital communication accessibility for the hearing-impaired community. Beyond the task of ASL translation, our work explores the integration of complex, embodied multimodal languages into the processing capabilities of LLMs, moving beyond purely text-based inputs to broaden their understanding of human communication. We investigate both direct translation from 3D gesture features to text and an instruction-guided setting where translations can be modulated by external prompts, offering greater flexibility. This work provides a foundational step toward inclusive, multimodal intelligent systems capable of understanding diverse forms of language.
[90] 3D4D: An Interactive, Editable, 4D World Model via 3D Video Generation cs.CVPDF
Yunhong He, Zhengqing Yuan, Zhengzhong Tu, Yanfang Ye, Lichao Sun
TL;DR: 3D4D是一个交互式的4D可视化框架,通过结合WebGL和Supersplat渲染技术,将静态图像和文本转换为连贯的4D场景,支持实时多模态交互。
Details
Motivation: 旨在解决4D环境可视化中交互性与实时性不足的问题,允许用户自适应地探索复杂场景。
Result: 实现了用户驱动的4D环境交互探索,支持高效实时多模态操作。
Insight: 结合WebGL与Supersplat技术为4D可视化提供了新的交互范式,foveated渲染策略显著提升了实时性能。
Abstract: We introduce 3D4D, an interactive 4D visualization framework that integrates WebGL with Supersplat rendering. It transforms static images and text into coherent 4D scenes through four core modules and employs a foveated rendering strategy for efficient, real-time multi-modal interaction. This framework enables adaptive, user-driven exploration of complex 4D environments. The project page and code are available at https://yunhonghe1021.github.io/NOVA/.
[91] Vision Transformer Based User Equipment Positioning cs.CV | cs.NIPDF
Parshwa Shah, Dhaval K. Patel, Brijesh Soni, Miguel López-Benítez, Siddhartan Govindasamy
TL;DR: 本文提出了一种基于Vision Transformer(ViT)的用户设备(UE)定位方法,通过注意力机制聚焦于CSI矩阵的角度延迟剖面(ADP),在室内和室外环境下显著优于现有方法。
Details
Motivation: 现有深度学习在UE定位中的方法存在如下问题:1)对输入的所有部分给予相同的注意力;2)不适用于非序列数据(如仅瞬时CSI可用的情况)。因此,需要一种更高效的注意力机制来处理CSI数据。
Result: 在DeepMIMO和ViWi数据集上,室内RMSE为0.55米,室外为13.59米(DeepMIMO)和3.45米(ViWi),性能优于现有方法约38%。
Insight: ViT的注意力机制在处理非序列性CSI数据时表现优异,动态权重分配显著提升了定位精度,为UE定位提供了一种新思路。
Abstract: Recently, Deep Learning (DL) techniques have been used for User Equipment (UE) positioning. However, the key shortcomings of such models is that: i) they weigh the same attention to the entire input; ii) they are not well suited for the non-sequential data e.g., when only instantaneous Channel State Information (CSI) is available. In this context, we propose an attention-based Vision Transformer (ViT) architecture that focuses on the Angle Delay Profile (ADP) from CSI matrix. Our approach, validated on the DeepMIMO' and ViWi’ ray-tracing datasets, achieves an Root Mean Squared Error (RMSE) of 0.55m indoors, 13.59m outdoors in DeepMIMO, and 3.45m in ViWi’s outdoor blockage scenario. The proposed scheme outperforms state-of-the-art schemes by $\sim$ 38%. It also performs substantially better than other approaches that we have considered in terms of the distribution of error distance.
cs.CL [Back]
[92] Large Language Models for Scientific Idea Generation: A Creativity-Centered Survey cs.CLPDF
Fatemeh Shahhosseini, Arash Marioriyad, Ali Momen, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban
TL;DR: 这篇综述探讨了大语言模型(LLMs)在科学创意生成中的应用,总结了五种主要方法,并结合创造力框架分析了其贡献和发展方向。
Details
Motivation: 科学创意生成是科学发现的核心任务,但其多目标和开放性的特点使其具有挑战性。LLMs展示了在创意生成中的潜力,但其创造力尚未被充分理解和利用。
Result: 通过这一分类和框架分析,文章揭示了当前领域的发展现状,并为LLMs在科学发现中的系统性应用指明了方向。
Insight: LLMs在科学创意生成中展现出巨大潜力,但其创造力需要结合外部知识和协作机制才能真正发挥作用。
Abstract: Scientific idea generation lies at the heart of scientific discovery and has driven human progress-whether by solving unsolved problems or proposing novel hypotheses to explain unknown phenomena. Unlike standard scientific reasoning or general creative generation, idea generation in science is a multi-objective and open-ended task, where the novelty of a contribution is as essential as its empirical soundness. Large language models (LLMs) have recently emerged as promising generators of scientific ideas, capable of producing coherent and factual outputs with surprising intuition and acceptable reasoning, yet their creative capacity remains inconsistent and poorly understood. This survey provides a structured synthesis of methods for LLM-driven scientific ideation, examining how different approaches balance creativity with scientific soundness. We categorize existing methods into five complementary families: External knowledge augmentation, Prompt-based distributional steering, Inference-time scaling, Multi-agent collaboration, and Parameter-level adaptation. To interpret their contributions, we employ two complementary frameworks: Boden’s taxonomy of Combinatorial, Exploratory and Transformational creativity to characterize the level of ideas each family expected to generate, and Rhodes’ 4Ps framework-Person, Process, Press, and Product-to locate the aspect or source of creativity that each method emphasizes. By aligning methodological advances with creativity frameworks, this survey clarifies the state of the field and outlines key directions toward reliable, systematic, and transformative applications of LLMs in scientific discovery.
[93] GRIP: In-Parameter Graph Reasoning through Fine-Tuning Large Language Models cs.CL | cs.AIPDF
Jiarui Feng, Donghong Cai, Yixin Chen, Muhan Zhang
TL;DR: GRIP提出了一种通过微调LLMs使其内部化图数据关系的框架,避免了复杂的图-文本转换或额外模块,提高了处理图相关任务的效率和效果。
Details
Motivation: 现有方法在处理结构化图数据时需要复杂的转换或额外模块,导致效率低下且效果不佳,因此需要一种更高效的方式使LLMs适应图数据。
Result: 在多基准测试中验证了GRIP的高效性和有效性。
Insight: 通过参数内部化知识的方法可以高效扩展LLMs的能力,尤其适用于结构化数据任务。
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in modeling sequential textual data and generalizing across diverse tasks. However, adapting LLMs to effectively handle structural data, such as knowledge graphs or web data, remains a challenging problem. Some approaches adopt complex strategies to convert graphs into text sequences, resulting in significant token overhead and rendering them impractical for large-scale graphs. Others introduce additional modules to encode graphs into fixed-size token representations for LLMs. However, these methods typically require large-scale post-training on graph-text corpus and complex alignment procedures, yet often yield sub-optimal results due to poor modality alignment. Inspired by in-parameter knowledge injection for test-time adaptation of LLMs, we propose GRIP, a novel framework that equips LLMs with the ability to internalize complex relational information from graphs through carefully designed fine-tuning tasks. This knowledge is efficiently stored within lightweight LoRA parameters, enabling the fine-tuned LLM to perform a wide range of graph-related tasks without requiring access to the original graph at inference time. Extensive experiments across multiple benchmarks validate the effectiveness and efficiency of our approach.
[94] Stress Testing Factual Consistency Metrics for Long-Document Summarization cs.CL | cs.AI | cs.LGPDF
Zain Muhammad Mujahid, Dustin Wright, Isabelle Augenstein
TL;DR: 该论文系统评估了六种常用于短文本摘要的无参考事实一致性指标在长文档摘要中的可靠性,揭示了这些指标在语义等价摘要上的不一致性,并提出了改进方向。
Details
Motivation: 长文档摘要的事实一致性评估面临挑战,现有指标因输入长度限制和长程依赖问题表现不佳,亟需系统评估和改进。
Result: 结果表明,现有指标在语义等价摘要上表现不一致,信息密度高的内容可靠性下降,扩大检索上下文仅部分改善稳定性。
Insight: 改进长文档事实评估需关注多跨度推理、上下文感知校准和基于保留语义的训练,以增强鲁棒性。
Abstract: Evaluating the factual consistency of abstractive text summarization remains a significant challenge, particularly for long documents, where conventional metrics struggle with input length limitations and long-range dependencies. In this work, we systematically evaluate the reliability of six widely used reference-free factuality metrics, originally proposed for short-form summarization, in the long-document setting. We probe metric robustness through seven factuality-preserving perturbations applied to summaries, namely paraphrasing, simplification, synonym replacement, logically equivalent negations, vocabulary reduction, compression, and source text insertion, and further analyze their sensitivity to retrieval context and claim information density. Across three long-form benchmark datasets spanning science fiction, legal, and scientific domains, our results reveal that existing short-form metrics produce inconsistent scores for semantically equivalent summaries and exhibit declining reliability for information-dense claims whose content is semantically similar to many parts of the source document. While expanding the retrieval context improves stability in some domains, no metric consistently maintains factual alignment under long-context conditions. Finally, our results highlight concrete directions for improving factuality evaluation, including multi-span reasoning, context-aware calibration, and training on meaning-preserving variations to enhance robustness in long-form summarization. We release all code, perturbed data, and scripts required to reproduce our results at https://github.com/zainmujahid/metricEval-longSum.
[95] CAPO: Confidence Aware Preference Optimization Learning for Multilingual Preferences cs.CL | cs.AIPDF
Rhitabrat Pokharel, Yufei Tao, Ameeta Agrawal
TL;DR: CAPO提出了一种动态损失缩放机制,通过基于相对奖励的置信度调整学习信号,显著提升多语言偏好优化的鲁棒性和性能。
Details
Motivation: 现有方法(如DPO)在英语中表现良好,但在多语言环境中泛化能力不足,尤其是在噪声或低边际比较的情况下。CAPO旨在解决这一问题。
Result: CAPO在多语言环境下显著优于现有基线方法,奖励准确率提升至少16%,并通过增大偏好与非偏好响应之间的差距提升对齐效果。
Insight: 动态调整学习信号的方法在多语言偏好优化中具有潜力,尤其是在处理噪声数据时表现出更强的鲁棒性。
Abstract: Preference optimization is a critical post-training technique used to align large language models (LLMs) with human preferences, typically by fine-tuning on ranked response pairs. While methods like Direct Preference Optimization (DPO) have proven effective in English, they often fail to generalize robustly to multilingual settings. We propose a simple yet effective alternative, Confidence-Aware Preference Optimization (CAPO), which replaces DPO’s fixed treatment of preference pairs with a dynamic loss scaling mechanism based on a relative reward. By modulating the learning signal according to the confidence in each preference pair, CAPO enhances robustness to noisy or low-margin comparisons, typically encountered in multilingual text. Empirically, CAPO outperforms existing preference optimization baselines by at least 16% in reward accuracy, and improves alignment by widening the gap between preferred and dispreferred responses across languages.
[96] Design, Results and Industry Implications of the World’s First Insurance Large Language Model Evaluation Benchmark cs.CLPDF
Hua Zhou, Bing Ma, Yufei Zhang, Yi Zhao
TL;DR: 本文介绍了CUFEInse v1.0保险大语言模型评测基准的建设方法、多维评价体系及设计理念,并基于该基准对11个主流大模型进行了综合评测,揭示了通用模型和专业模型的瓶颈,同时填补了保险领域专业评测工具的空白。
Details
Motivation: 填补保险领域专业评测工具的空白,并为垂直领域大模型的评测范式提供重要参考。
Result: 评测发现通用模型在精算能力和合规适应上存在瓶颈,高质量领域专用训练在垂直场景中有优势但在业务适应和合规方面仍有不足。
Insight: 评测基准的建设理念和方法可为垂直领域大模型评测提供参考,未来保险大模型的发展方向为‘领域适应+推理增强’。
Abstract: This paper comprehensively elaborates on the construction methodology, multi-dimensional evaluation system, and underlying design philosophy of CUFEInse v1.0. Adhering to the principles of “quantitative-oriented, expert-driven, and multi-validation,” the benchmark establishes an evaluation framework covering 5 core dimensions, 54 sub-indicators, and 14,430 high-quality questions, encompassing insurance theoretical knowledge, industry understanding, safety and compliance, intelligent agent application, and logical rigor. Based on this benchmark, a comprehensive evaluation was conducted on 11 mainstream large language models. The evaluation results reveal that general-purpose models suffer from common bottlenecks such as weak actuarial capabilities and inadequate compliance adaptation. High-quality domain-specific training demonstrates significant advantages in insurance vertical scenarios but exhibits shortcomings in business adaptation and compliance. The evaluation also accurately identifies the common bottlenecks of current large models in professional scenarios such as insurance actuarial, underwriting and claim settlement reasoning, and compliant marketing copywriting. The establishment of CUFEInse not only fills the gap in professional evaluation benchmarks for the insurance field, providing academia and industry with a professional, systematic, and authoritative evaluation tool, but also its construction concept and methodology offer important references for the evaluation paradigm of large models in vertical fields, serving as an authoritative reference for academic model optimization and industrial model selection. Finally, the paper looks forward to the future iteration direction of the evaluation benchmark and the core development direction of “domain adaptation + reasoning enhancement” for insurance large models.
[97] From Experience to Strategy: Empowering LLM Agents with Trainable Graph Memory cs.CLPDF
Siyu Xia, Zekun Xu, Jiajun Chai, Wentian Fan, Yan Song
TL;DR: 本文提出了一种基于图记忆框架的创新方法,通过强化学习优化记忆权重,提升LLM代理的战略推理能力和泛化性能。
Details
Motivation: 现有的LLM代理在利用经验方面存在局限性:隐式记忆易受灾难性遗忘和可解释性差的困扰,而显式记忆缺乏适应性。
Result: 实验表明,该方法在强化学习和战略推理任务中表现出色,显著提升了泛化能力和性能。
Insight: 通过结构化记忆和动态权重优化,LLM代理能够更有效地利用历史经验,提升了任务解决的能力和适应性。
Abstract: Large Language Models (LLMs) based agents have demonstrated remarkable potential in autonomous task-solving across complex, open-ended environments. A promising approach for improving the reasoning capabilities of LLM agents is to better utilize prior experiences in guiding current decisions. However, LLMs acquire experience either through implicit memory via training, which suffers from catastrophic forgetting and limited interpretability, or explicit memory via prompting, which lacks adaptability. In this paper, we introduce a novel agent-centric, trainable, multi-layered graph memory framework and evaluate how context memory enhances the ability of LLMs to utilize parametric information. The graph abstracts raw agent trajectories into structured decision paths in a state machine and further distills them into high-level, human-interpretable strategic meta-cognition. In order to make memory adaptable, we propose a reinforcement-based weight optimization procedure that estimates the empirical utility of each meta-cognition based on reward feedback from downstream tasks. These optimized strategies are then dynamically integrated into the LLM agent’s training loop through meta-cognitive prompting. Empirically, the learnable graph memory delivers robust generalization, improves LLM agents’ strategic reasoning performance, and provides consistent benefits during Reinforcement Learning (RL) training.
[98] Last Layer Logits to Logic: Empowering LLMs with Logic-Consistent Structured Knowledge Reasoning cs.CLPDF
Songze Li, Zhiqiang Liu, Zhaoyan Gong, Xiaoke Guo, Zhengke Gui
TL;DR: 提出了Logits-to-Logic框架,通过强化和过滤LLM输出的logits来解决其在结构化知识推理中的逻辑一致性不足问题,显著提升了知识图谱问答任务的性能。
Details
Motivation: 大型语言模型(LLMs)在非结构化文本推理中表现优异,但在结构化知识推理(如知识图谱问答)中因逻辑漂移(Logic Drift)而表现不佳。现有方法仅通过输入层面的提示引导推理,无法从根本上解决问题。
Result: 实验表明,该方法显著提升了LLMs在结构化知识推理中的逻辑一致性,并在KGQA任务中取得了最先进的性能。
Insight: 直接操作LLM生成的logits可以有效纠正逻辑不一致问题,为结构化知识推理提供了一种灵活且任务无关的解决方案。
Abstract: Large Language Models (LLMs) achieve excellent performance in natural language reasoning tasks through pre-training on vast unstructured text, enabling them to understand the logic in natural language and generate logic-consistent responses. However, the representational differences between unstructured and structured knowledge make LLMs inherently struggle to maintain logic consistency, leading to \textit{Logic Drift} challenges in structured knowledge reasoning tasks such as Knowledge Graph Question Answering (KGQA). Existing methods address this limitation by designing complex workflows embedded in prompts to guide LLM reasoning. Nevertheless, these approaches only provide input-level guidance and fail to fundamentally address the \textit{Logic Drift} in LLM outputs. Additionally, their inflexible reasoning workflows cannot adapt to different tasks and knowledge graphs. To enhance LLMs’ logic consistency in structured knowledge reasoning, we specifically target the logits output from the autoregressive generation process. We propose the \textit{Logits-to-Logic} framework, which incorporates logits strengthening and logits filtering as core modules to correct logical defects in LLM outputs. Extensive experiments show that our approach significantly improves LLMs’ logic consistency in structured knowledge reasoning and achieves state-of-the-art performance on multiple KGQA benchmarks.
[99] Social Media for Mental Health: Data, Methods, and Findings cs.CLPDF
Nur Shazwani Kamarudin, Ghazaleh Beigi, Lydia Manikonda, Huan Liu
TL;DR: 本文综述了利用社交媒体数据研究心理健康问题的最新方法、数据和发现,重点分析了语言、视觉和情感指标,并探讨了其对医疗实践和政策制定的潜在影响。
Details
Motivation: 社交媒体的普及为研究心理健康问题提供了独特的数据来源,尤其是针对抑郁症、焦虑症和自杀倾向等高度污名化的问题。这种匿名性支持了用户的自由表达,使得研究更为可行。
Result: 研究发现社交媒体数据可以有效揭示心理健康问题的趋势和模式,为医疗实践和政策制定提供了新的支持工具。
Insight: 社交媒体数据为心理健康研究提供了前所未有的规模和多样性,但同时也提出了数据隐私和处理伦理等挑战。
Abstract: There is an increasing number of virtual communities and forums available on the web. With social media, people can freely communicate and share their thoughts, ask personal questions, and seek peer-support, especially those with conditions that are highly stigmatized, without revealing personal identity. We study the state-of-the-art research methodologies and findings on mental health challenges like de- pression, anxiety, suicidal thoughts, from the pervasive use of social media data. We also discuss how these novel thinking and approaches can help to raise awareness of mental health issues in an unprecedented way. Specifically, this chapter describes linguistic, visual, and emotional indicators expressed in user disclosures. The main goal of this chapter is to show how this new source of data can be tapped to improve medical practice, provide timely support, and influence government or policymakers. In the context of social media for mental health issues, this chapter categorizes social media data used, introduces different deployed machine learning, feature engineering, natural language processing, and surveys methods and outlines directions for future research.
[100] Distinct Theta Synchrony across Speech Modes: Perceived, Spoken, Whispered, and Imagined cs.CLPDF
Jung-Sun Lee, Ha-Na Jo, Eunyeong Ko
TL;DR: 该研究通过分析不同语音模式(感知、出声、耳语和想象)下theta波段神经同步性的差异,揭示了每种模式独特的神经机制及其在语言处理和注意力控制中的作用。
Details
Motivation: 语音产生的多种模式(如出声、耳语和想象)在神经机制上存在差异,但已有研究多集中于单一模式,缺乏对theta同步性的全面比较。本研究旨在填补这一空白。
Result: 出声和耳语语音表现出广泛而强的前颞叶同步性,感知语音则以后颞叶同步性为主,而想象语音显示出局限于额叶区的内部一致性同步模式。
Insight: theta同步性的范围和空间分布在不同语音模式下差异显著,表明每种模式依赖于独特的神经机制,反映了语言处理和注意力控制的多样性。
Abstract: Human speech production encompasses multiple modes such as perceived, overt, whispered, and imagined, each reflecting distinct neural mechanisms. Among these, theta-band synchrony has been closely associated with language processing, attentional control, and inner speech. However, previous studies have largely focused on a single mode, such as overt speech, and have rarely conducted an integrated comparison of theta synchrony across different speech modes. In this study, we analyzed differences in theta-band neural synchrony across speech modes based on connectivity metrics, focusing on region-wise variations. The results revealed that overt and whispered speech exhibited broader and stronger frontotemporal synchrony, reflecting active motor-phonological coupling during overt articulation, whereas perceived speech showed dominant posterior and temporal synchrony patterns, consistent with auditory perception and comprehension processes. In contrast, imagined speech demonstrated a more spatially confined but internally coherent synchronization pattern, primarily involving frontal and supplementary motor regions. These findings indicate that the extent and spatial distribution of theta synchrony differ substantially across modes, with overt articulation engaging widespread cortical interactions, whispered speech showing intermediate engagement, and perception relying predominantly on temporoparietal networks. Therefore, this study aims to elucidate the differences in theta-band neural synchrony across various speech modes, thereby uncovering both the shared and distinct neural dynamics underlying language perception and imagined speech.
[101] Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker cs.CLPDF
Matthias De Lange, Jens-Joris Decorte, Jeroen Van Hautte
TL;DR: 论文提出了WorkBench评测套件和Unified Work Embeddings (UWE)模型,通过对比学习和多任务排名器解决工作领域NLP任务的复杂性和数据稀缺问题。
Details
Motivation: 工作领域的NLP任务具有长尾分布、多标签目标空间和数据稀缺性等复杂性,现有通用嵌入模型表现不佳。
Result: UWE在未见目标空间上实现零样本排名,显著提升Macro-averaged MAP和RP@10指标。
Insight: 跨任务迁移对工作领域NLP任务性能提升显著,任务无关的嵌入结构可有效应对复杂场景。
Abstract: Workforce transformation across diverse industries has driven an increased demand for specialized natural language processing capabilities. Nevertheless, tasks derived from work-related contexts inherently reflect real-world complexities, characterized by long-tailed distributions, extreme multi-label target spaces, and scarce data availability. The rise of generalist embedding models prompts the question of their performance in the work domain, especially as progress in the field has focused mainly on individual tasks. To this end, we introduce WorkBench, the first unified evaluation suite spanning six work-related tasks formulated explicitly as ranking problems, establishing a common ground for multi-task progress. Based on this benchmark, we find significant positive cross-task transfer, and use this insight to compose task-specific bipartite graphs from real-world data, synthetically enriched through grounding. This leads to Unified Work Embeddings (UWE), a task-agnostic bi-encoder that exploits our training-data structure with a many-to-many InfoNCE objective, and leverages token-level embeddings with task-agnostic soft late interaction. UWE demonstrates zero-shot ranking performance on unseen target spaces in the work domain, enables low-latency inference by caching the task target space embeddings, and shows significant gains in macro-averaged MAP and RP@10 over generalist embedding models.
[102] NOTAM-Evolve: A Knowledge-Guided Self-Evolving Optimization Framework with LLMs for NOTAM Interpretation cs.CL | cs.AIPDF
Maoqi Liu, Quan Fang, Yuhao Wu, Can Zhao, Yang Yang
TL;DR: NOTAM-Evolve提出了一种基于LLM的自进化框架,解决了NOTAMs解析中的动态知识嵌入和模式推理问题,显著提升了准确性。
Details
Motivation: 航空安全的NOTAMs解析因语言简练晦涩而面临挑战,现有自动化系统难以提取可操作信息。
Result: NOTAM-Evolve比基线LLM准确率提升30.4%,达到新SOTA。
Insight: 知识引导的自进化方法可为其他领域复杂语言解析任务提供借鉴。
Abstract: Accurate interpretation of Notices to Airmen (NOTAMs) is critical for aviation safety, yet their condensed and cryptic language poses significant challenges to both manual and automated processing. Existing automated systems are typically limited to shallow parsing, failing to extract the actionable intelligence needed for operational decisions. We formalize the complete interpretation task as deep parsing, a dual-reasoning challenge requiring both dynamic knowledge grounding (linking the NOTAM to evolving real-world aeronautical data) and schema-based inference (applying static domain rules to deduce operational status). To tackle this challenge, we propose NOTAM-Evolve, a self-evolving framework that enables a large language model (LLM) to autonomously master complex NOTAM interpretation. Leveraging a knowledge graph-enhanced retrieval module for data grounding, the framework introduces a closed-loop learning process where the LLM progressively improves from its own outputs, minimizing the need for extensive human-annotated reasoning traces. In conjunction with this framework, we introduce a new benchmark dataset of 10,000 expert-annotated NOTAMs. Our experiments demonstrate that NOTAM-Evolve achieves a 30.4% absolute accuracy improvement over the base LLM, establishing a new state of the art on the task of structured NOTAM interpretation.
[103] State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting? cs.CL | cs.AIPDF
Taja Kuzman Pungeršek, Peter Rupnik, Ivan Porupski, Vuk Dinić, Nikola Ljubešić
TL;DR: 该论文比较了微调的BERT类模型与指令调优的大语言模型(LLMs)在南斯拉夫语言文本分类任务中的表现,发现LLMs在零样本设置下表现优异,但存在输出不稳定、推理速度慢和计算成本高等问题。
Details
Motivation: 探究LLMs在少资源和低资源语言(如南斯拉夫语言)中文本分类任务的零样本和少样本性能,并与传统微调模型对比。
Result: LLMs在零样本设置下表现优于或与微调模型相当,但对计算资源和推理时间的需求较高。
Insight: LLMs在低资源语言中具有潜力,但微调模型仍是大规模文本标注的更实用选择。
Abstract: Until recently, fine-tuned BERT-like models provided state-of-the-art performance on text classification tasks. With the rise of instruction-tuned decoder-only models, commonly known as large language models (LLMs), the field has increasingly moved toward zero-shot and few-shot prompting. However, the performance of LLMs on text classification, particularly on less-resourced languages, remains under-explored. In this paper, we evaluate the performance of current language models on text classification tasks across several South Slavic languages. We compare openly available fine-tuned BERT-like models with a selection of open-source and closed-source LLMs across three tasks in three domains: sentiment classification in parliamentary speeches, topic classification in news articles and parliamentary speeches, and genre identification in web texts. Our results show that LLMs demonstrate strong zero-shot performance, often matching or surpassing fine-tuned BERT-like models. Moreover, when used in a zero-shot setup, LLMs perform comparably in South Slavic languages and English. However, we also point out key drawbacks of LLMs, including less predictable outputs, significantly slower inference, and higher computational costs. Due to these limitations, fine-tuned BERT-like models remain a more practical choice for large-scale automatic text annotation.
[104] Multimodal LLMs Do Not Compose Skills Optimally Across Modalities cs.CLPDF
Paula Ontalvilla, Aitor Ormazabal, Gorka Azkune
TL;DR: 该论文研究了多模态大语言模型(MLLM)在跨模态技能组合方面的能力,发现现有模型存在显著的组合差距,并通过提示和微调实验展示了改进的潜力与局限性。
Details
Motivation: 随着神经网络在预训练中学到越来越复杂的技能,如何成功组合这些技能以解决新任务尚不明确。该文聚焦于多模态大语言模型(MLLM),探索其在跨模态技能组合中的表现。
Result: 所有MLLM在跨模态技能组合中表现不佳;思维链提示和微调方法虽能提升性能,但仍存在显著差距。
Insight: 跨模态技能组合是MLLM的薄弱环节,现有方法效果有限,需更深入研究以解决这一问题。
Abstract: Skill composition is the ability to combine previously learned skills to solve new tasks. As neural networks acquire increasingly complex skills during their pretraining, it is not clear how successfully they can compose them. In this paper, we focus on Multimodal Large Language Models (MLLM), and study their ability to compose skills across modalities. To this end, we design three evaluation tasks which can be solved sequentially composing two modality-dependent skills, and evaluate several open MLLMs under two main settings: i) prompting the model to directly solve the task, and ii) using a two-step cascaded inference approach, which manually enforces the composition of the two skills for a given task. Even with these straightforward compositions, we find that all evaluated MLLMs exhibit a significant cross-modality skill composition gap. To mitigate the aforementioned gap, we explore two alternatives: i) use chain-of-thought prompting to explicitly instruct MLLMs for skill composition and ii) a specific fine-tuning recipe to promote skill composition. Although those strategies improve model performance, they still exhibit significant skill composition gaps, suggesting that more research is needed to improve cross-modal skill composition in MLLMs.
[105] Quantification and object perception in Multimodal Large Language Models deviate from human linguistic cognition cs.CLPDF
Raquel Montero, Natalia Moskvina, Paolo Morosi, Tamara Serrano, Elena Pagliarini
TL;DR: 该研究发现多模态大语言模型(MLLMs)在量化和物体感知任务中的表现与人类语言认知存在明显差异,尤其是在量化尺度、使用范围和原型性及人类近似数字系统的偏差等方面。
Details
Motivation: 研究MLLMs在量化任务中的表现与人类认知的差异,探索量化现象在模型架构中的编码方式及其跨语言稳定性。
Result: 发现MLLMs在量化任务中的表现与人类存在明显差异,尤其是在跨语言任务中。
Insight: MLLMs在处理量化现象时可能存在架构上的局限性,未来的研究需要进一步探索如何提升其对复杂语言现象的理解能力。
Abstract: Quantification has been proven to be a particularly difficult linguistic phenomenon for (Multimodal) Large Language Models (MLLMs). However, given that quantification interfaces with the logic, pragmatic, and numerical domains, the exact reasons for the poor performance are still unclear. This papers looks at three key features of human quantification shared cross-linguistically that have remained so far unexplored in the (M)LLM literature: the ordering of quantifiers into scales, the ranges of use and prototypicality, and the biases inherent in the human approximate number system. The aim is to determine how these features are encoded in the models’ architecture, how they may differ from humans, and whether the results are affected by the type of model and language under investigation. We find that there are clear differences between humans and MLLMs with respect to these features across various tasks that tap into the representation of quantification in vivo vs. in silico. This work, thus, paves the way for addressing the nature of MLLMs as semantic and pragmatic agents, while the cross-linguistic lens can elucidate whether their abilities are robust and stable across different languages.
[106] Still Not There: Can LLMs Outperform Smaller Task-Specific Seq2Seq Models on the Poetry-to-Prose Conversion Task? cs.CLPDF
Kunal Kingkar Das, Manoj Balaji Jagadeeshan, Nallani Chakravartula Sahith, Jivnesh Sandhan, Pawan Goyal
TL;DR: 论文探讨了大语言模型(LLMs)在梵语诗歌到散文转换任务中的表现是否优于小型任务专用序列到序列(Seq2Seq)模型,结果表明任务专用的ByT5-Sanskrit模型显著优于LLMs。
Details
Motivation: 研究动机是验证LLMs是否在低资源、形态丰富的语言(如梵语)中能够作为通用解决方案,尤其是在诗歌到散文转换这类复杂任务中。
Result: 实验结果显示,任务专用的ByT5-Sanskrit模型在所有指标上显著优于LLMs,人类评价也证实了这一结果。
Insight: 论文指出,尽管LLMs具有通用性,但在特定任务和低资源语言中,任务专用模型可能更具优势;同时提示策略为缺乏专门语料的情况提供了替代方案。
Abstract: Large Language Models (LLMs) are increasingly treated as universal, general-purpose solutions across NLP tasks, particularly in English. But does this assumption hold for low-resource, morphologically rich languages such as Sanskrit? We address this question by comparing instruction-tuned and in-context-prompted LLMs with smaller task-specific encoder-decoder models on the Sanskrit poetry-to-prose conversion task. This task is intrinsically challenging: Sanskrit verse exhibits free word order combined with rigid metrical constraints, and its conversion to canonical prose (anvaya) requires multi-step reasoning involving compound segmentation, dependency resolution, and syntactic linearisation. This makes it an ideal testbed to evaluate whether LLMs can surpass specialised models. For LLMs, we apply instruction fine-tuning on general-purpose models and design in-context learning templates grounded in Paninian grammar and classical commentary heuristics. For task-specific modelling, we fully fine-tune a ByT5-Sanskrit Seq2Seq model. Our experiments show that domain-specific fine-tuning of ByT5-Sanskrit significantly outperforms all instruction-driven LLM approaches. Human evaluation strongly corroborates this result, with scores exhibiting high correlation with Kendall’s Tau scores. Additionally, our prompting strategies provide an alternative to fine-tuning when domain-specific verse corpora are unavailable, and the task-specific Seq2Seq model demonstrates robust generalisation on out-of-domain evaluations.
[107] Do Syntactic Categories Help in Developmentally Motivated Curriculum Learning for Language Models? cs.CLPDF
Arzu Burcu Güven, Anna Rogers, Rob van der Goot
TL;DR: 论文研究了句法类别在发展性课程学习中对语言模型的帮助,发现CHILDES数据在年龄分组上缺乏强句法区分性,但句法知识有助于解释模型表现。课程学习实验表明,部分课程有助于阅读任务,但主要性能提升来自筛选出的可分类句法数据。
Details
Motivation: 探索句法类别在发展性课程学习中的作用,验证认知启发式课程设计对语言模型表现的影响。
Result: CHILDES语料在年龄分组上缺乏显著句法区分性。课程学习中,筛选出的可分类句法数据比完整噪声语料更有效提升模型性能。
Insight: 句法类别信息在发展性课程学习中具有实用价值,但需结合数据筛选,而非单纯依赖年龄分组或完整语料。
Abstract: We examine the syntactic properties of BabyLM corpus, and age-groups within CHILDES. While we find that CHILDES does not exhibit strong syntactic differentiation by age, we show that the syntactic knowledge about the training data can be helpful in interpreting model performance on linguistic tasks. For curriculum learning, we explore developmental and several alternative cognitively inspired curriculum approaches. We find that some curricula help with reading tasks, but the main performance improvement come from using the subset of syntactically categorizable data, rather than the full noisy corpus.
[108] VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context cs.CLPDF
Heyang Liu, Ziyang Cheng, Yuhao Wang, Hongcheng Liu, Yiqi Li
TL;DR: VocalBench-zh是一个专为普通话上下文设计的语音对话评测基准,包含10个子集和10K+高质量实例,覆盖12种用户导向特性,评测了14个主流模型并揭示了当前方法的共同挑战。
Details
Motivation: 普通话是全球使用最广泛的语言之一,但缺乏全面的语音到语音(S2S)评测基准,阻碍了开发者系统评估和用户公平比较模型的能力。
Result: 揭示了当前语音交互方法的共同挑战,为下一代系统提供了新见解。
Insight: 强调了普通话语音交互评测的重要性,并为未来研究提供了方向。
Abstract: The development of multi-modal large language models (LLMs) leads to intelligent approaches capable of speech interactions. As one of the most widely spoken languages globally, Mandarin is supported by most models to enhance their applicability and reach. However, the scarcity of comprehensive speech-to-speech (S2S) benchmarks in Mandarin contexts impedes systematic evaluation for developers and hinders fair model comparison for users. In this work, we propose VocalBench-zh, an ability-level divided evaluation suite adapted to Mandarin context consisting of 10 well-crafted subsets and over 10K high-quality instances, covering 12 user-oriented characters. The evaluation experiment on 14 mainstream models reveals the common challenges for current routes, and highlights the need for new insights into next-generation speech interactive systems. The evaluation codes and datasets will be available at https://github.com/SJTU-OmniAgent/VocalBench-zh.
[109] Hierarchical structure understanding in complex tables with VLLMs: a benchmark and experiments cs.CLPDF
Luca Bindini, Simone Giovannini, Simone Marinai, Valeria Nardoni, Kimiya Noor Ali
TL;DR: 该研究探讨了视觉大语言模型(VLLMs)在理解科学文章中表格层次结构方面的能力,提出了复杂层次表格(CHiTab)作为基准,并通过提示工程和微调实验评估了VLLMs的表现。
Details
Motivation: 科学文章中的表格通常具有复杂的层次结构,而通用VLLMs是否能够理解这种结构尚未明确。本研究旨在填补这一空白。
Result: 实验表明,通用VLLMs能够在一定程度上理解表格层次结构,但其性能仍落后于人类。同时,微调可显著提升模型表现。
Insight: VLLMs在结构化数据理解方面具有潜力,但仍需针对性的优化和训练,尤其是在复杂任务中。
Abstract: This work investigates the ability of Vision Large Language Models (VLLMs) to understand and interpret the structure of tables in scientific articles. Specifically, we explore whether VLLMs can infer the hierarchical structure of tables without additional processing. As a basis for our experiments we use the PubTables-1M dataset, a large-scale corpus of scientific tables. From this dataset, we extract a subset of tables that we introduce as Complex Hierarchical Tables (CHiTab): a benchmark collection of complex tables containing hierarchical headings. We adopt a series of prompt engineering strategies to probe the models’ comprehension capabilities, experimenting with various prompt formats and writing styles. Multiple state-of-the-art open-weights VLLMs are evaluated on the benchmark first using their off-the-shelf versions and then fine-tuning some models on our task. We also measure the performance of humans to solve the task on a small set of tables comparing with performance of the evaluated VLLMs. The experiments support our intuition that generic VLLMs, not explicitly designed for understanding the structure of tables, can perform this task. This study provides insights into the potential and limitations of VLLMs to process complex tables and offers guidance for future work on integrating structured data understanding into general-purpose VLLMs.
[110] Automatic Paper Reviewing with Heterogeneous Graph Reasoning over LLM-Simulated Reviewer-Author Debates cs.CLPDF
Shuaimin Li, Liyang Fan, Yufang Lin, Zeyang Li, Xian Wei
TL;DR: 论文提出ReViewGraph框架,通过基于LLM的多轮评审者-作者辩论构建异构图,结合GNN推理优化评审决策,表现优于基线方法。
Details
Motivation: 现有评审方法依赖浅层特征或LLM,易产生幻觉和偏见,且无法捕捉评审者-作者辩论的动态复杂性。
Result: 在三个数据集上平均相对提升15.73%,验证了对辩论结构建模的价值。
Insight: 评审过程的有效建模需捕捉动态辩论关系,LLM结合GNN的方法能提升评审准确性。
Abstract: Existing paper review methods often rely on superficial manuscript features or directly on large language models (LLMs), which are prone to hallucinations, biased scoring, and limited reasoning capabilities. Moreover, these methods often fail to capture the complex argumentative reasoning and negotiation dynamics inherent in reviewer-author interactions. To address these limitations, we propose ReViewGraph (Reviewer-Author Debates Graph Reasoner), a novel framework that performs heterogeneous graph reasoning over LLM-simulated multi-round reviewer-author debates. In our approach, reviewer-author exchanges are simulated through LLM-based multi-agent collaboration. Diverse opinion relations (e.g., acceptance, rejection, clarification, and compromise) are then explicitly extracted and encoded as typed edges within a heterogeneous interaction graph. By applying graph neural networks to reason over these structured debate graphs, ReViewGraph captures fine-grained argumentative dynamics and enables more informed review decisions. Extensive experiments on three datasets demonstrate that ReViewGraph outperforms strong baselines with an average relative improvement of 15.73%, underscoring the value of modeling detailed reviewer-author debate structures.
[111] Adaptive Multi-Agent Response Refinement in Conversational Systems cs.CL | cs.AI | cs.MAPDF
Soyeong Jeong, Aparna Elangovan, Emine Yilmaz, Oleg Rokhlenko
TL;DR: 该论文提出了一种多智能体框架,通过动态协调不同角色的智能体(分别专注于事实性、个性化和连贯性)来优化大语言模型(LLM)生成的对话响应,从而提升对话质量。
Details
Motivation: 大语言模型在生成对话响应时可能忽略个性化或特定知识,而现有方法通常集中于单一模型的优化。作者提出多智能体协作框架,以更全面地解决这些问题。
Result: 在具有挑战性的对话数据集上验证,该方法显著优于基线,尤其是在涉及知识或用户个性化需求的任务中。
Insight: 多智能体分工协作能更有效地解决对话系统中的复杂问题;动态通信策略提升了灵活性和适应性。
Abstract: Large Language Models (LLMs) have demonstrated remarkable success in conversational systems by generating human-like responses. However, they can fall short, especially when required to account for personalization or specific knowledge. In real-life settings, it is impractical to rely on users to detect these errors and request a new response. One way to address this problem is to refine the response before returning it to the user. While existing approaches focus on refining responses within a single LLM, this method struggles to consider diverse aspects needed for effective conversations. In this work, we propose refining responses through a multi-agent framework, where each agent is assigned a specific role for each aspect. We focus on three key aspects crucial to conversational quality: factuality, personalization, and coherence. Each agent is responsible for reviewing and refining one of these aspects, and their feedback is then merged to improve the overall response. To enhance collaboration among them, we introduce a dynamic communication strategy. Instead of following a fixed sequence of agents, our approach adaptively selects and coordinates the most relevant agents based on the specific requirements of each query. We validate our framework on challenging conversational datasets, demonstrating that ours significantly outperforms relevant baselines, particularly in tasks involving knowledge or user’s persona, or both.
[112] AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress cs.CL | cs.IR | cs.LGPDF
Zhiheng Xi, Chenyang Liao, Guanyu Li, Yajie Yang, Wenxiang Chen
TL;DR: 本文提出AgentPRM,一种为LLM代理设计的流程奖励模型,通过逐步评估决策的贡献和目标接近度,提升多轮决策任务的效率。
Details
Motivation: LLM在多轮决策任务(如网上购物和浏览器导航)中面临挑战,现有方法依赖复杂的提示工程或专家轨迹微调。本文探索通过流程奖励模型(PRM)评估决策,指导代理行为。
Result: 实验表明,AgentPRM计算效率比基线高8倍以上,且在扩展测试计算时表现出稳健提升。
Insight: AgentPRM不仅适用于推理任务,还可用于LLM代理的强化学习,为多轮决策提供了新视角。
Abstract: Despite rapid development, large language models (LLMs) still encounter challenges in multi-turn decision-making tasks (i.e., agent tasks) like web shopping and browser navigation, which require making a sequence of intelligent decisions based on environmental feedback. Previous work for LLM agents typically relies on elaborate prompt engineering or fine-tuning with expert trajectories to improve performance. In this work, we take a different perspective: we explore constructing process reward models (PRMs) to evaluate each decision and guide the agent’s decision-making process. Unlike LLM reasoning, where each step is scored based on correctness, actions in agent tasks do not have a clear-cut correctness. Instead, they should be evaluated based on their proximity to the goal and the progress they have made. Building on this insight, we propose a re-defined PRM for agent tasks, named AgentPRM, to capture both the interdependence between sequential decisions and their contribution to the final goal. This enables better progress tracking and exploration-exploitation balance. To scalably obtain labeled data for training AgentPRM, we employ a Temporal Difference-based (TD-based) estimation method combined with Generalized Advantage Estimation (GAE), which proves more sample-efficient than prior methods. Extensive experiments across different agentic tasks show that AgentPRM is over $8\times$ more compute-efficient than baselines, and it demonstrates robust improvement when scaling up test-time compute. Moreover, we perform detailed analyses to show how our method works and offer more insights, e.g., applying AgentPRM to the reinforcement learning of LLM agents.
[113] DPRM: A Dual Implicit Process Reward Model in Multi-Hop Question Answering cs.CL | cs.AIPDF
Xinyi Wang, Yiping Song, Zhiliang Tian, Bo Liu, Tingjin Luo
TL;DR: 这篇论文提出了DPRM(双隐含过程奖励模型),用于解决多跳问答任务中CoT和KG推理路径的一致性问题,通过双PRM模型实现无显式标注的奖励参数化,实验显示其在多数据集上优于13个基线模型。
Details
Motivation: 在多跳问答(MHQA)中,CoT通过多步推理提升生成质量,KG通过语义匹配减少幻觉。传统ORM仅评估最终答案,而PRM需昂贵的人工标注或生成扩展。隐含PRM虽仅需结果信号,但现有方法未考虑KG的图结构约束和CoT与KG路径的不一致问题。
Result: 在多数据集上优于13个基线模型,Hit@1指标提升高达16.6%。
Insight: 双PRM模型能有效捕捉KG的结构约束和CoT推理路径的一致性,无需额外标注,为MHQA任务提供了高效的多步推理评估方法。
Abstract: In multi-hop question answering (MHQA) tasks, Chain of Thought (CoT) improves the quality of generation by guiding large language models (LLMs) through multi-step reasoning, and Knowledge Graphs (KGs) reduce hallucinations via semantic matching. Outcome Reward Models (ORMs) provide feedback after generating the final answers but fail to evaluate the process for multi-step reasoning. Traditional Process Reward Models (PRMs) evaluate the reasoning process but require costly human annotations or rollout generation. While implicit PRM is trained only with outcome signals and derives step rewards through reward parameterization without explicit annotations, it is more suitable for multi-step reasoning in MHQA tasks. However, existing implicit PRM has only been explored for plain text scenarios. When adapting to MHQA tasks, it cannot handle the graph structure constraints in KGs and capture the potential inconsistency between CoT and KG paths. To address these limitations, we propose the DPRM (Dual Implicit Process Reward Model). It trains two implicit PRMs for CoT and KG reasoning in MHQA tasks. Both PRMs, namely KG-PRM and CoT-PRM, derive step-level rewards from outcome signals via reward parameterization without additional explicit annotations. Among them, KG-PRM uses preference pairs to learn structural constraints from KGs. DPRM further introduces a consistency constraint between CoT and KG reasoning steps, making the two PRMs mutually verify and collaboratively optimize the reasoning paths. We also provide a theoretical demonstration of the derivation of process rewards. Experimental results show that our method outperforms 13 baselines on multiple datasets with up to 16.6% improvement on Hit@1.
[114] PCRLLM: Proof-Carrying Reasoning with Large Language Models under Stepwise Logical Constraints cs.CLPDF
Tangrui Li, Pei Wang, Hongzheng Wang Christian Hahm, Matteo Spatola, Justin Shi
TL;DR: 论文提出了一种名为PCRLLM的框架,通过单步推理约束和自然语言表述,增强大语言模型(LLMs)的逻辑一致性,并支持验证和多模型协作。
Details
Motivation: 大语言模型在逻辑推理中存在不一致性问题,通常直接从前提映射到结论而无需遵循明确的推理规则,这降低了其可信度。
Result: 该框架提升了LLM的逻辑一致性,实现了可验证的推理,并在多模型协作中表现出色。
Insight: 将形式化验证与自然语言推理结合,能够有效提升LLM的逻辑严谨性,同时保持语言模型的灵活性。
Abstract: Large Language Models (LLMs) often exhibit limited logical coherence, mapping premises to conclusions without adherence to explicit inference rules. We propose Proof-Carrying Reasoning with LLMs (PCRLLM), a framework that constrains reasoning to single-step inferences while preserving natural language formulations. Each output explicitly specifies premises, rules, and conclusions, thereby enabling verification against a target logic. This mechanism mitigates trustworthiness concerns by supporting chain-level validation even in black-box settings. Moreover, PCRLLM facilitates systematic multi-LLM collaboration, allowing intermediate steps to be compared and integrated under formal rules. Finally, we introduce a benchmark schema for generating large-scale step-level reasoning data, combining natural language expressiveness with formal rigor.
[115] SPEAR-MM: Selective Parameter Evaluation and Restoration via Model Merging for Efficient Financial LLM Adaptation cs.CL | cs.AI | cs.LG | math.SPPDF
Berkcan Kapusuzoglu, Supriyo Chakraborty, Renkun Ni, Stephen Rawls, Sambit Sahu
TL;DR: SPEAR-MM是一种选择性参数评估与恢复的模型合并框架,有效解决金融领域LLM适配中通用推理能力遗忘问题,同时保持计算效率。
Details
Motivation: 金融领域适配的LLM常因灾难性遗忘而丢失通用推理能力,影响客户交互和复杂分析。
Result: 在LLaMA-3.1-8B上,SPEAR-MM保留91.2%通用能力(传统方法69.7%),同时维持94%领域适配增益,计算成本减少90%。
Insight: SPEAR-MM提供可解释的权衡控制,适用于资源受限的金融机构,为领域适配与通用能力平衡提供新思路。
Abstract: Large language models (LLMs) adapted to financial domains often suffer from catastrophic forgetting of general reasoning capabilities essential for customer interactions and complex financial analysis. We introduce Selective Parameter Evaluation and Restoration via Model Merging (SPEAR-MM), a practical framework that preserves critical capabilities while enabling domain adaptation. Our method approximates layer-wise impact on external benchmarks through post-hoc analysis, then selectively freezes or restores transformer layers via spherical interpolation merging. Applied to LLaMA-3.1-8B for financial tasks, SPEAR-MM achieves 91.2% retention of general capabilities versus 69.7% for standard continual pretraining, while maintaining 94% of domain adaptation gains. The approach provides interpretable trade-off control and reduces computational costs by 90% crucial for resource-constrained financial institutions.
[116] Structured RAG for Answering Aggregative Questions cs.CL | cs.LGPDF
Omri Koshorek, Niv Granot, Aviv Alloni, Shahar Admati, Roee Hendel
TL;DR: 论文提出了S-RAG方法,专门用于回答需要从大量文档中聚合信息的查询任务。S-RAG在数据摄取时构建结构化表示,并在推理时将自然语言查询转换为形式化查询。实验表明,S-RAG在新型数据集和公共基准测试中均显著优于普通RAG系统和长上下文LLMs。
Details
Motivation: 当前RAG方法主要针对小规模相关段落查询,而缺乏对需要聚合多文档信息的查询的支持。论文旨在填补这一空白,解决聚合查询任务中的挑战。
Result: 实验显示,S-RAG在新数据集和公共基准测试中显著优于普通RAG和长上下文LLMs。
Insight: 结构化表示和形式化查询转换是解决聚合查询任务的有效途径,未来可扩展至更复杂的多模态和多文档场景。
Abstract: Retrieval-Augmented Generation (RAG) has become the dominant approach for answering questions over large corpora. However, current datasets and methods are highly focused on cases where only a small part of the corpus (usually a few paragraphs) is relevant per query, and fail to capture the rich world of aggregative queries. These require gathering information from a large set of documents and reasoning over them. To address this gap, we propose S-RAG, an approach specifically designed for such queries. At ingestion time, S-RAG constructs a structured representation of the corpus; at inference time, it translates natural-language queries into formal queries over said representation. To validate our approach and promote further research in this area, we introduce two new datasets of aggregative queries: HOTELS and WORLD CUP. Experiments with S-RAG on the newly introduced datasets, as well as on a public benchmark, demonstrate that it substantially outperforms both common RAG systems and long-context LLMs.
[117] Investigating CoT Monitorability in Large Reasoning Models cs.CLPDF
Shu Yang, Junchao Wu, Xilin Gou, Xuansheng Wu, Derek Wong
TL;DR: 该论文首次系统地研究了大型推理模型(LRMs)中链式思维(CoT)的可监控性,探讨了模型是否真实反映其内部决策以及监控器的可靠性问题,并提出了一种新的监控方法MoME。
Details
Motivation: 大型推理模型(LRMs)通过详细推理提高性能,但其推理过程是否真实反映决策行为以及对潜在错误行为的监控能力尚未明确,亟需系统性研究。
Result: 实证分析表明CoT干预可能影响监控效果,而MoME能有效监测模型错误行为并提供结构化证据。
Insight: 模型的推理过程可能不完全真实,监控器设计需平衡敏感性和鲁棒性;MoME为AI安全提供了新的监控范式。
Abstract: Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex tasks by engaging in extended reasoning before producing final answers. Beyond improving abilities, these detailed reasoning traces also create a new opportunity for AI safety, CoT Monitorability: monitoring potential model misbehavior, such as the use of shortcuts or sycophancy, through their chain-of-thought (CoT) during decision-making. However, two key fundamental challenges arise when attempting to build more effective monitors through CoT analysis. First, as prior research on CoT faithfulness has pointed out, models do not always truthfully represent their internal decision-making in the generated reasoning. Second, monitors themselves may be either overly sensitive or insufficiently sensitive, and can potentially be deceived by models’ long, elaborate reasoning traces. In this paper, we present the first systematic investigation of the challenges and potential of CoT monitorability. Motivated by two fundamental challenges we mentioned before, we structure our study around two central perspectives: (i) verbalization: to what extent do LRMs faithfully verbalize the true factors guiding their decisions in the CoT, and (ii) monitor reliability: to what extent can misbehavior be reliably detected by a CoT-based monitor? Specifically, we provide empirical evidence and correlation analyses between verbalization quality, monitor reliability, and LLM performance across mathematical, scientific, and ethical domains. Then we further investigate how different CoT intervention methods, designed to improve reasoning efficiency or performance, will affect monitoring effectiveness. Finally, we propose MoME, a new paradigm in which LLMs monitor other models’ misbehavior through their CoT and provide structured judgments along with supporting evidence.
[118] Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models cs.CL | cs.AI | cs.LG | cs.PFPDF
Tianyu Fu, Yichen You, Zekai Chen, Guohao Dai, Huazhong Yang
TL;DR: 论文提出了Think-at-Hard (TaH)方法,通过动态迭代改进大型语言模型的推理能力,仅在困难token上触发额外迭代,提升了模型推理性能且参数高效。
Details
Motivation: 现有方法在固定迭代次数下可能因冗余迭代导致easy token预测错误(latent overthinking),影响推理能力和效率。TaH旨在动态选择困难token进行迭代改进。
Result: TaH在五大基准上实现8.1-12.6%的准确率提升,94%的token免于二次迭代,仅需<3%额外参数。
Insight: 动态迭代优于固定迭代,聚焦困难token能高效提升模型推理能力,duo-causal attention机制是并行化处理迭代维度的关键。
Abstract: Improving reasoning capabilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Prior work proposes recurrent transformers, which allocate a fixed number of extra iterations per token to improve generation quality. After the first, standard forward pass, instead of verbalization, last-layer hidden states are fed back as inputs for additional iterations to refine token predictions. Yet we identify a latent overthinking phenomenon: easy token predictions that are already correct after the first pass are sometimes revised into errors in additional iterations. To address this, we propose Think-at-Hard (TaH), a dynamic latent thinking method that iterates deeper only at hard tokens. It employs a lightweight neural decider to trigger latent iterations only at tokens that are likely incorrect after the standard forward pass. During latent iterations, Low-Rank Adaptation (LoRA) modules shift the LLM objective from general next-token prediction to focused hard-token refinement. We further introduce a duo-causal attention mechanism that extends attention from the token sequence dimension to an additional iteration depth dimension. This enables cross-iteration information flow while maintaining full sequential parallelism. Experiments show that TaH boosts LLM reasoning performance across five challenging benchmarks while maintaining the same parameter count. Compared with baselines that iterate twice for all output tokens, TaH delivers 8.1-11.3% accuracy gains while exempting 94% of tokens from the second iteration. Against strong single-iteration Qwen3 models finetuned with the same data, it also delivers 4.0-5.0% accuracy gains. When allowing less than 3% additional parameters from LoRA and the iteration decider, the gains increase to 8.5-12.6% and 5.3-5.4%, respectively. Our code is available at https://github.com/thu-nics/TaH.
cs.DC [Back]
[119] Intelligence per Watt: Measuring Intelligence Efficiency of Local AI cs.DC | cs.AI | cs.CL | cs.LGPDF
Jon Saad-Falcon, Avanika Narayan, Hakki Orhun Akengin, J. Wes Griffin, Herumb Shandilya
TL;DR: 这篇论文提出了一种衡量本地AI智能效率的指标——每瓦特智能(IPW),并通过大规模实证研究表明,小型语言模型(LM)在本地加速器上的表现足以替代集中式基础设施。
Details
Motivation: 随着大型语言模型(LLM)需求的快速增长,集中式云基础设施面临扩展压力。论文探讨是否可以通过本地推理来重新分配需求,以缓解这一问题。
Result: 1. 本地LM可准确回答88.7%的单轮聊天和推理查询;2. 2023-2025年IPW提升5.3倍,查询覆盖率从23.2%增至71.3%;3. 本地加速器IPW比云端至少低1.4倍。
Insight: 本地推理能够有效分担集中式基础设施的负担,IPW是衡量这一转变的关键指标,未来优化空间显著。
Abstract: Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure. Rapidly growing demand strains this paradigm, and cloud providers struggle to scale infrastructure at pace. Two advances enable us to rethink this paradigm: small LMs (<=20B active parameters) now achieve competitive performance to frontier models on many tasks, and local accelerators (e.g., Apple M4 Max) run these models at interactive latencies. This raises the question: can local inference viably redistribute demand from centralized infrastructure? Answering this requires measuring whether local LMs can accurately answer real-world queries and whether they can do so efficiently enough to be practical on power-constrained devices (i.e., laptops). We propose intelligence per watt (IPW), task accuracy divided by unit of power, as a metric for assessing capability and efficiency of local inference across model-accelerator pairs. We conduct a large-scale empirical study across 20+ state-of-the-art local LMs, 8 accelerators, and a representative subset of LLM traffic: 1M real-world single-turn chat and reasoning queries. For each query, we measure accuracy, energy, latency, and power. Our analysis reveals $3$ findings. First, local LMs can accurately answer 88.7% of single-turn chat and reasoning queries with accuracy varying by domain. Second, from 2023-2025, IPW improved 5.3x and local query coverage rose from 23.2% to 71.3%. Third, local accelerators achieve at least 1.4x lower IPW than cloud accelerators running identical models, revealing significant headroom for optimization. These findings demonstrate that local inference can meaningfully redistribute demand from centralized infrastructure, with IPW serving as the critical metric for tracking this transition. We release our IPW profiling harness for systematic intelligence-per-watt benchmarking.
cs.CY [Back]
[120] The Polite Liar: Epistemic Pathology in Language Models cs.CY | cs.AI | cs.CLPDF
Bentley DeVilling
TL;DR: 论文讨论了大型语言模型表现出的‘礼貌谎言’现象,即模型在缺乏知识时仍表现得自信,并将其归因于强化学习从人类反馈(RLHF)的结构性缺陷。
Details
Motivation: 研究动机源于语言模型在对话中表现出虚假自信的现象,这源于RLHF优化了用户满意度而非真实性,导致模型在缺乏证据时仍显得自信。
Result: 结果表明,当前的RLHF方法倾向于优化表面的流畅性和礼貌性,而非真实性和认知合理性。
Insight: 研究指出,未来的模型对齐应专注于奖励‘合理自信’而非‘表面流畅’,以解决认知与语言的对齐矛盾。
Abstract: Large language models exhibit a peculiar epistemic pathology: they speak as if they know, even when they do not. This paper argues that such confident fabrication, what I call the polite liar, is a structural consequence of reinforcement learning from human feedback (RLHF). Building on Frankfurt’s analysis of bullshit as communicative indifference to truth, I show that this pathology is not deception but structural indifference: a reward architecture that optimizes for perceived sincerity over evidential accuracy. Current alignment methods reward models for being helpful, harmless, and polite, but not for being epistemically grounded. As a result, systems learn to maximize user satisfaction rather than truth, performing conversational fluency as a virtue. I analyze this behavior through the lenses of epistemic virtue theory, speech-act philosophy, and cognitive alignment, showing that RLHF produces agents trained to mimic epistemic confidence without access to epistemic justification. The polite liar thus reveals a deeper alignment tension between linguistic cooperation and epistemic integrity. The paper concludes with an “epistemic alignment” principle: reward justified confidence over perceived fluency.
cs.LG [Back]
[121] LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows cs.LG | cs.AI | cs.CL | stat.MLPDF
Raffi Khatchadourian, Rolando Franco
TL;DR: 论文量化了金融任务中LLM输出的不稳定性(输出漂移),挑战了大模型普遍优于生产的假设,并提出了一套金融校准的测试框架和审计系统。
Details
Motivation: 金融领域对LLM输出的确定性和可信度要求极高,但输出漂移问题影响了审计与信任,亟需解决方案。
Result: 小模型(7B-8B)在T=0.0时输出100%一致,而大模型(120B)一致性仅为12.5%;结构化任务(如SQL)更稳定。
Insight: 任务类型对输出漂移敏感性有显著影响,跨提供商验证证明了确定性行为的可转移性,为合规AI部署提供了路径。
Abstract: Financial institutions deploy Large Language Models (LLMs) for reconciliations, regulatory reporting, and client communications, but nondeterministic outputs (output drift) undermine auditability and trust. We quantify drift across five model architectures (7B-120B parameters) on regulated financial tasks, revealing a stark inverse relationship: smaller models (Granite-3-8B, Qwen2.5-7B) achieve 100% output consistency at T=0.0, while GPT-OSS-120B exhibits only 12.5% consistency (95% CI: 3.5-36.0%) regardless of configuration (p<0.0001, Fisher’s exact test). This finding challenges conventional assumptions that larger models are universally superior for production deployment. Our contributions include: (i) a finance-calibrated deterministic test harness combining greedy decoding (T=0.0), fixed seeds, and SEC 10-K structure-aware retrieval ordering; (ii) task-specific invariant checking for RAG, JSON, and SQL outputs using finance-calibrated materiality thresholds (plus or minus 5%) and SEC citation validation; (iii) a three-tier model classification system enabling risk-appropriate deployment decisions; and (iv) an audit-ready attestation system with dual-provider validation. We evaluated five models (Qwen2.5-7B via Ollama, Granite-3-8B via IBM watsonx.ai, Llama-3.3-70B, Mistral-Medium-2505, and GPT-OSS-120B) across three regulated financial tasks. Across 480 runs (n=16 per condition), structured tasks (SQL) remain stable even at T=0.2, while RAG tasks show drift (25-75%), revealing task-dependent sensitivity. Cross-provider validation confirms deterministic behavior transfers between local and cloud deployments. We map our framework to Financial Stability Board (FSB), Bank for International Settlements (BIS), and Commodity Futures Trading Commission (CFTC) requirements, demonstrating practical pathways for compliance-ready AI deployments.
[122] DynaAct: Large Language Model Reasoning with Dynamic Action Spaces cs.LG | cs.CLPDF
Xueliang Zhao, Wei Wu, Jian Guan, Qintong Li, Lingpeng Kong
TL;DR: DynaAct提出了一种动态构建紧凑动作空间的方法,通过大语言模型提取通用草图,并结合次模函数和贪心算法优化动作选择,显著提升了复杂推理问题的效率和性能。
Details
Motivation: 现代序列决策系统中,动作空间的构建是关键,但传统方法依赖手动定义或非结构化空间,缺乏可扩展性或计算效率。DynaAct旨在解决这一问题。
Result: 在六个标准基准测试中,DynaAct显著提升了性能,同时保持了高效推理,未引入显著延迟。
Insight: 通过动态构建动作空间,DynaAct展示了在复杂推理问题中结合大语言模型和优化算法的潜力,为序列决策提供了新思路。
Abstract: In modern sequential decision-making systems, the construction of an optimal candidate action space is critical to efficient inference. However, existing approaches either rely on manually defined action spaces that lack scalability or utilize unstructured spaces that render exhaustive search computationally prohibitive. In this paper, we propose a novel framework named \textsc{DynaAct} for automatically constructing a compact action space to enhance sequential reasoning in complex problem-solving scenarios. Our method first estimates a proxy for the complete action space by extracting general sketches observed in a corpus covering diverse complex reasoning problems using large language models. We then formulate a submodular function that jointly evaluates candidate actions based on their utility to the current state and their diversity, and employ a greedy algorithm to select an optimal candidate set. Extensive experiments on six diverse standard benchmarks demonstrate that our approach significantly improves overall performance, while maintaining efficient inference without introducing substantial latency. The implementation is available at https://github.com/zhaoxlpku/DynaAct.
[123] Towards Personalized Quantum Federated Learning for Anomaly Detection cs.LG | cs.CV | quant-phPDF
Ratun Rahman, Sina Shaham, Dinh C. Nguyen
TL;DR: 论文提出了一种个性化的量子联邦学习框架(PQFL),用于异常检测,解决了量子网络中客户端因硬件、噪声和数据分布的异构性问题,显著提升了检测精度。
Details
Motivation: 传统量子联邦学习在处理异构量子客户端(如硬件能力、噪声水平、数据编码差异)时效率低下,尤其是面对非独立同分布数据时,全局模型训练效果不佳。
Result: 实验表明,PQFL显著提升了异常检测性能,减少了23%的错误,并在AUROC和AUPR指标上分别提升了24.2%和20.5%。
Insight: 量子联邦学习中的异构性问题需要通过个性化策略解决,PQFL的量子中心设计为实际量子网络中的应用提供了可扩展的解决方案。
Abstract: Anomaly detection has a significant impact on applications such as video surveillance, medical diagnostics, and industrial monitoring, where anomalies frequently depend on context and anomaly-labeled data are limited. Quantum federated learning (QFL) overcomes these concerns by distributing model training among several quantum clients, consequently eliminating the requirement for centralized quantum storage and processing. However, in real-life quantum networks, clients frequently differ in terms of hardware capabilities, circuit designs, noise levels, and how classical data is encoded or preprocessed into quantum states. These differences create inherent heterogeneity across clients - not just in their data distributions, but also in their quantum processing behaviors. As a result, training a single global model becomes ineffective, especially when clients handle imbalanced or non-identically distributed (non-IID) data. To address this, we propose a new framework called personalized quantum federated learning (PQFL) for anomaly detection. PQFL enhances local model training at quantum clients using parameterized quantum circuits and classical optimizers, while introducing a quantum-centric personalization strategy that adapts each client’s model to its own hardware characteristics and data representation. Extensive experiments show that PQFL significantly improves anomaly detection accuracy under diverse and realistic conditions. Compared to state-of-the-art methods, PQFL reduces false errors by up to 23%, and achieves gains of 24.2% in AUROC and 20.5% in AUPR, highlighting its effectiveness and scalability in practical quantum federated settings.
[124] From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training cs.LG | cs.CVPDF
Donglai Xu, Hongzheng Yang, Yuzhi Zhao, Pingping Zhang, Jinpeng Chen
TL;DR: 论文提出了一种新颖的两阶段熵优化方法(RLVR),通过动态调整探索与利用阶段,提升了多模态大语言模型(MLLM)在噪声标注数据下的鲁棒性。
Details
Motivation: 现有基于熵最小化的无监督RLVR方法容易过拟合噪声标签,且无法为Group-Relative Policy Optimization(GRPO)提供可靠的奖励排序信号,因此需要一种更鲁棒的训练策略。
Result: 在Qwen2-VL-2B、Qwen2-VL-7B和Qwen2.5-VL-3B等多种MLLM模型及噪声环境下,该方法均显著优于现有方法。
Insight: 分阶段熵优化策略不仅提升了噪声容忍度,还统一了外部、内部和基于熵的方法,为RLVR训练提供了新思路。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) for Multimodal Large Language Models (MLLMs) is highly dependent on high-quality labeled data, which is often scarce and prone to substantial annotation noise in real-world scenarios. Existing unsupervised RLVR methods, including pure entropy minimization, can overfit to incorrect labels and limit the crucial reward ranking signal for Group-Relative Policy Optimization (GRPO). To address these challenges and enhance noise tolerance, we propose a novel two-stage, token-level entropy optimization method for RLVR. This approach dynamically guides the model from exploration to exploitation during training. In the initial exploration phase, token-level entropy maximization promotes diverse and stochastic output generation, serving as a strong regularizer that prevents premature convergence to noisy labels and ensures sufficient intra-group variation, which enables more reliable reward gradient estimation in GRPO. As training progresses, the method transitions into the exploitation phase, where token-level entropy minimization encourages the model to produce confident and deterministic outputs, thereby consolidating acquired knowledge and refining prediction accuracy. Empirically, across three MLLM backbones - Qwen2-VL-2B, Qwen2-VL-7B, and Qwen2.5-VL-3B - spanning diverse noise settings and multiple tasks, our phased strategy consistently outperforms prior approaches by unifying and enhancing external, internal, and entropy-based methods, delivering robust and superior performance across the board.
[125] Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment cs.LG | cs.CVPDF
Hua Ye, Hang Ding, Siyuan Chen, Yiyang Jiang, Changyuan Zhang
TL;DR: BACL引入了边界感知课程学习方法,通过渐进式困难负例采样和对比局部注意力损失,提升多模态对齐性能,无需额外标注即可实现显著性能提升。
Details
Motivation: 现有方法对所有负例一视同仁,忽略了与正例仅在细节上存在差异的模糊负例的重要性。
Result: 实验表明BACL在四大基准测试中性能提升显著,最高Recall@1提升32%,且理论误差率为O(1/n)。
Insight: 边界案例可作为有效的课程信号,局部注意力机制能精准捕捉多模态失配细节。
Abstract: Most multimodal models treat every negative pair alike, ignoring the ambiguous negatives that differ from the positive by only a small detail. We propose Boundary-Aware Curriculum with Local Attention (BACL), a lightweight add-on that turns these borderline cases into a curriculum signal. A Boundary-aware Negative Sampler gradually raises difficulty, while a Contrastive Local Attention loss highlights where the mismatch occurs. The two modules are fully differentiable and work with any off-the-shelf dual encoder. Theory predicts a fast O(1/n) error rate; practice shows up to +32% R@1 over CLIP and new SOTA on four large-scale benchmarks, all without extra labels.
[126] LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics cs.LG | cs.AI | cs.CV | stat.MLPDF
Randall Balestriero, Yann LeCun
TL;DR: LeJEPA提出了一种理论完备、可扩展的自监督学习框架,通过SIGReg正则化目标约束嵌入分布为各向同性高斯分布,简化了训练流程并提升了稳定性。
Details
Motivation: 学习世界的可操纵表示是AI的核心问题,但现有JEPAs缺乏理论指导和实践规范,导致研究依赖启发式方法。
Result: 在10+数据集、60+架构上验证,ViT-H/14在ImageNet-1k线性评估中达到79%准确率。
Insight: LeJEPA的理论完备性与实践简化性有望推动自监督学习成为AI研究的核心支柱。
Abstract: Learning manipulable representations of the world and its dynamics is central to AI. Joint-Embedding Predictive Architectures (JEPAs) offer a promising blueprint, but lack of practical guidance and theory has led to ad-hoc R&D. We present a comprehensive theory of JEPAs and instantiate it in {\bf LeJEPA}, a lean, scalable, and theoretically grounded training objective. First, we identify the isotropic Gaussian as the optimal distribution that JEPAs’ embeddings should follow to minimize downstream prediction risk. Second, we introduce a novel objective–{\bf Sketched Isotropic Gaussian Regularization} (SIGReg)–to constrain embeddings to reach that ideal distribution. Combining the JEPA predictive loss with SIGReg yields LeJEPA with numerous theoretical and practical benefits: (i) single trade-off hyperparameter, (ii) linear time and memory complexity, (iii) stability across hyper-parameters, architectures (ResNets, ViTs, ConvNets) and domains, (iv) heuristics-free, e.g., no stop-gradient, no teacher-student, no hyper-parameter schedulers, and (v) distributed training-friendly implementation requiring only $\approx$50 lines of code. Our empirical validation covers 10+ datasets, 60+ architectures, all with varying scales and domains. As an example, using imagenet-1k for pretraining and linear evaluation with frozen backbone, LeJEPA reaches 79% with a ViT-H/14. We hope that the simplicity and theory-friendly ecosystem offered by LeJEPA will reestablish self-supervised pre-training as a core pillar of AI research (\href{git@github.com:rbalestr-lab/lejepa.git}{GitHub repo}).
cs.RO [Back]
[127] ViPRA: Video Prediction for Robot Actions cs.RO | cs.AI | cs.CL | cs.CV | cs.LGPDF
Sandeep Routray, Hengkai Pan, Unnat Jain, Shikhar Bahl, Deepak Pathak
TL;DR: ViPRA是一个从无动作标注的视频中学习连续机器人控制的框架,通过预测未来视觉观测和运动中心潜在动作,并结合感知损失和光流一致性来训练潜在动作。下游控制通过分块流匹配解码器将潜在动作映射到机器人特定连续动作序列。该方法避免了昂贵的动作标注,支持跨实体泛化,并在真实世界任务中表现优越。
Details
Motivation: 现有视频预测模型缺乏动作标注,限制了其在机器人学习中的应用。ViPRA旨在利用无动作标注的视频学习连续控制,减少对动作标注的依赖。
Result: ViPRA在SIMPLER基准上提升16%,在真实世界任务中提升13%,支持高达22Hz的高频控制。
Insight: 潜在动作的显式建模能够捕捉场景动态的‘变化内容’和‘变化方式’,避免了传统自回归策略学习的局限性。
Abstract: Can we turn a video prediction model into a robot policy? Videos, including those of humans or teleoperated robots, capture rich physical interactions. However, most of them lack labeled actions, which limits their use in robot learning. We present Video Prediction for Robot Actions (ViPRA), a simple pretraining-finetuning framework that learns continuous robot control from these actionless videos. Instead of directly predicting actions, we train a video-language model to predict both future visual observations and motion-centric latent actions, which serve as intermediate representations of scene dynamics. We train these latent actions using perceptual losses and optical flow consistency to ensure they reflect physically grounded behavior. For downstream control, we introduce a chunked flow matching decoder that maps latent actions to robot-specific continuous action sequences, using only 100 to 200 teleoperated demonstrations. This approach avoids expensive action annotation, supports generalization across embodiments, and enables smooth, high-frequency continuous control upto 22 Hz via chunked action decoding. Unlike prior latent action works that treat pretraining as autoregressive policy learning, explicitly models both what changes and how. Our method outperforms strong baselines, with a 16% gain on the SIMPLER benchmark and a 13% improvement across real world manipulation tasks. We will release models and code at https://vipra-project.github.io
[128] PerspAct: Enhancing LLM Situated Collaboration Skills through Perspective Taking and Active Vision cs.RO | cs.AI | cs.CL | cs.HCPDF
Sabrina Patania, Luca Annese, Anita Pellegrini, Silvia Serino, Anna Lambiase
TL;DR: 论文探讨了通过视角采择(perspective taking)和主动视觉(active vision)提升大型语言模型(LLM)在协作任务中的能力。研究发现,结合ReAct框架和主动探索策略能显著提高模型的解释准确性和协作效果。
Details
Motivation: 当前LLM在多智能体交互中缺乏对视角差异的理解能力,影响了其在协作任务中的表现。论文旨在通过整合视角采择和主动视觉解决这一问题。
Result: 实验表明,显式的视角提示和主动探索策略显著提升了模型的性能,尤其是在解决指代歧义和协作任务中的表现。
Insight: 为LLM在多智能体系统和机器人领域的应用提供了新思路,强调了结合主动感知和视角采择机制的重要性。
Abstract: Recent advances in Large Language Models (LLMs) and multimodal foundation models have significantly broadened their application in robotics and collaborative systems. However, effective multi-agent interaction necessitates robust perspective-taking capabilities, enabling models to interpret both physical and epistemic viewpoints. Current training paradigms often neglect these interactive contexts, resulting in challenges when models must reason about the subjectivity of individual perspectives or navigate environments with multiple observers. This study evaluates whether explicitly incorporating diverse points of view using the ReAct framework, an approach that integrates reasoning and acting, can enhance an LLM’s ability to understand and ground the demands of other agents. We extend the classic Director task by introducing active visual exploration across a suite of seven scenarios of increasing perspective-taking complexity. These scenarios are designed to challenge the agent’s capacity to resolve referential ambiguity based on visual access and interaction, under varying state representations and prompting strategies, including ReAct-style reasoning. Our results demonstrate that explicit perspective cues, combined with active exploration strategies, significantly improve the model’s interpretative accuracy and collaborative effectiveness. These findings highlight the potential of integrating active perception with perspective-taking mechanisms in advancing LLMs’ application in robotics and multi-agent systems, setting a foundation for future research into adaptive and context-aware AI systems.
[129] RoboTAG: End-to-end Robot Configuration Estimation via Topological Alignment Graph cs.RO | cs.CVPDF
Yifan Liu, Fangneng Zhan, Wanhua Li, Haowen Sun, Katerina Fragkiadaki
TL;DR: 提出了RoboTAG方法,通过拓扑对齐图从单目RGB图像估计机器人姿态,注入3D先验并减少对标注数据的依赖。
Details
Motivation: 现有方法依赖2D视觉骨干网络和大量标注数据,忽略了3D先验,且难以解决从仿真到现实的差距问题。
Result: 实验结果表明,该方法对不同类型机器人均有效,具有缓解数据瓶颈的潜力。
Insight: 通过拓扑对齐图和一致性监督,巧妙结合2D和3D信息,减少了对标注数据的依赖,提升了姿态估计的鲁棒性。
Abstract: Estimating robot pose from a monocular RGB image is a challenge in robotics and computer vision. Existing methods typically build networks on top of 2D visual backbones and depend heavily on labeled data for training, which is often scarce in real-world scenarios, causing a sim-to-real gap. Moreover, these approaches reduce the 3D-based problem to 2D domain, neglecting the 3D priors. To address these, we propose Robot Topological Alignment Graph (RoboTAG), which incorporates a 3D branch to inject 3D priors while enabling co-evolution of the 2D and 3D representations, alleviating the reliance on labels. Specifically, the RoboTAG consists of a 3D branch and a 2D branch, where nodes represent the states of the camera and robot system, and edges capture the dependencies between these variables or denote alignments between them. Closed loops are then defined in the graph, on which a consistency supervision across branches can be applied. This design allows us to utilize in-the-wild images as training data without annotations. Experimental results demonstrate that our method is effective across robot types, highlighting its potential to alleviate the data bottleneck in robotics.
[130] SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control cs.RO | cs.AI | cs.CV | cs.GR | eess.SYPDF
Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen
TL;DR: 论文提出了SONIC,一种通过扩大模型容量、数据和计算资源来构建通用人形机器人控制器的方法,利用运动跟踪作为核心任务,实现了自然且鲁棒的全身体控制。
Details
Motivation: 尽管大规模基础模型在多领域取得了成功,但人形机器人控制器仍局限于小规模模型和有限行为集。作者希望通过类似的大规模训练方法,提升人形控制的通用性和鲁棒性。
Result: 结果表明,模型性能随着计算和数据多样性的增加稳步提升,学到的表示能泛化到未见过的动作。此外,模型支持实时运动和多种输入接口的实际应用。
Insight: 研究表明,运动跟踪任务在规模扩展下具有良好性质,可作为人形控制的实用基础任务,为未来的通用人形控制提供了新方向。
Abstract: Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited behavior set, and are trained on a handful of GPUs over several days. We show that scaling up model capacity, data, and compute yields a generalist humanoid controller capable of creating natural and robust whole-body movements. Specifically, we posit motion tracking as a natural and scalable task for humanoid control, leverageing dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (from 1.2M to 42M parameters), dataset volume (over 100M frames, 700 hours of high-quality motion data), and compute (9k GPU hours). Beyond demonstrating the benefits of scale, we show the practical utility of our model through two mechanisms: (1) a real-time universal kinematic planner that bridges motion tracking to downstream task execution, enabling natural and interactive control, and (2) a unified token space that supports various motion input interfaces, such as VR teleoperation devices, human videos, and vision-language-action (VLA) models, all using the same policy. Scaling motion tracking exhibits favorable properties: performance improves steadily with increased compute and data diversity, and learned representations generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.
cs.AR [Back]
[131] Re$^{\text{2}}$MaP: Macro Placement by Recursively Prototyping and Packing Tree-based Relocating cs.AR | cs.CV | eess.SYPDF
Yunqi Shi, Xi Lin, Zhiang Wang, Siyuan Xu, Shixiong Kai
TL;DR: Re$^{\text{2}}$MaP通过递归原型生成和基于树的重新定位技术,实现了高质量的宏布局,显著提升了芯片设计的时序指标和设计规则检查。
Details
Motivation: 现有的宏布局方法在处理复杂的连接关系和设计约束时效果有限,Re$^{\text{2}}$MaP旨在通过递归和角度优化的方法提升宏布局的质量和效率。
Result: 在WNS和TNS上分别平均提升了10.26%和33.97%,优于Hier-RTLMP和ReMaP。
Insight: 递归迭代和角度优化能够有效平衡宏分布的均匀性和时序优化,专家启发式的成本函数在多约束条件下表现优异。
Abstract: This work introduces the Re$^{\text{2}}$MaP method, which generates expert-quality macro placements through recursively prototyping and packing tree-based relocating. We first perform multi-level macro grouping and PPA-aware cell clustering to produce a unified connection matrix that captures both wirelength and dataflow among macros and clusters. Next, we use DREAMPlace to build a mixed-size placement prototype and obtain reference positions for each macro and cluster. Based on this prototype, we introduce ABPlace, an angle-based analytical method that optimizes macro positions on an ellipse to distribute macros uniformly near chip periphery, while optimizing wirelength and dataflow. A packing tree-based relocating procedure is then designed to jointly adjust the locations of macro groups and the macros within each group, by optimizing an expertise-inspired cost function that captures various design constraints through evolutionary search. Re$^{\text{2}}$MaP repeats the above process: Only a subset of macro groups are positioned in each iteration, and the remaining macros are deferred to the next iteration to improve the prototype’s accuracy. Using a well-established backend flow with sufficient timing optimizations, Re$^{\text{2}}$MaP achieves up to 22.22% (average 10.26%) improvement in worst negative slack (WNS) and up to 97.91% (average 33.97%) improvement in total negative slack (TNS) compared to the state-of-the-art academic placer Hier-RTLMP. It also ranks higher on WNS, TNS, power, design rule check (DRC) violations, and runtime than the conference version ReMaP, across seven tested cases. Our code is available at https://github.com/lamda-bbo/Re2MaP.
cs.SD [Back]
[132] SpeechJudge: Towards Human-Level Judgment for Speech Naturalness cs.SD | cs.AI | cs.CLPDF
Xueyao Zhang, Chaoren Wang, Huan Liao, Ziniu Li, Yuancheng Wang
TL;DR: 论文提出SpeechJudge,一个包含数据集、基准和奖励模型的套件,用于评估和改进语音合成的自然度,显着缩小了现有模型与人类判断之间的差距。
Details
Motivation: 语音合成领域缺乏大规模的人类偏好数据集,限制了模型与人类感知的对齐能力。SpeechJudge旨在填补这一空白,提升语音合成的自然度评估。
Result: SpeechJudge-GRM在基准测试中达到77.2%准确率(79.4%经过缩放),远超Bradley-Terry模型的72.7%。
Insight: 通过人类反馈和强化学习的结合,可以有效提升语音合成模型的自然度对齐能力;AudioLLMs在自然度判断任务上仍有显著改进空间。
Abstract: Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address this, we introduce SpeechJudge, a comprehensive suite comprising a dataset, a benchmark, and a reward model centered on naturalness–one of the most fundamental subjective metrics for speech synthesis. First, we present SpeechJudge-Data, a large-scale human feedback corpus of 99K speech pairs. The dataset is constructed using a diverse set of advanced zero-shot text-to-speech (TTS) models across diverse speech styles and multiple languages, with human annotations for both intelligibility and naturalness preference. From this, we establish SpeechJudge-Eval, a challenging benchmark for speech naturalness judgment. Our evaluation reveals that existing metrics and AudioLLMs struggle with this task; the leading model, Gemini-2.5-Flash, achieves less than 70% agreement with human judgment, highlighting a significant gap for improvement. To bridge this gap, we develop SpeechJudge-GRM, a generative reward model (GRM) based on Qwen2.5-Omni-7B. It is trained on SpeechJudge-Data via a two-stage post-training process: Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales followed by Reinforcement Learning (RL) with GRPO on challenging cases. On the SpeechJudge-Eval benchmark, the proposed SpeechJudge-GRM demonstrates superior performance, achieving 77.2% accuracy (and 79.4% after inference-time scaling @10) compared to a classic Bradley-Terry reward model (72.7%). Furthermore, SpeechJudge-GRM can be also employed as a reward function during the post-training of speech generation models to facilitate their alignment with human preferences.
cs.CR [Back]
[133] SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought cs.CR | cs.AI | cs.CL | cs.LGPDF
Shourya Batra, Pierce Tillman, Samarth Gaggar, Shashank Kesineni, Kevin Zhu
TL;DR: SALT是一种轻量级的测试时干预方法,通过在隐藏状态中注入定向引导向量,减少大语言模型(LLM)在思维链(CoT)中的隐私泄露,同时保持模型的任务性能和实用性。
Details
Motivation: 随着LLM发展为可访问敏感用户数据的个人助手,其内部推理过程中隐私泄露的问题日益突出,亟需一种在不损害推理能力的前提下保护隐私的方法。
Result: 在多个LLM(如QwQ-32B、Llama-3.1-8B、Deepseek等)上,SALT实现了显著的隐私泄露减少(如18.2%、17.9%、31.2%),且不影响任务性能。
Insight: SALT展示了在测试时保护LLM隐私的可行性,为部署更安全的个人助手提供了解决方案。
Abstract: As Large Language Models (LLMs) evolve into personal assistants with access to sensitive user data, they face a critical privacy challenge: while prior work has addressed output-level privacy, recent findings reveal that LLMs often leak private information through their internal reasoning processes, violating contextual privacy expectations. These leaky thoughts occur when models inadvertently expose sensitive details in their reasoning traces, even when final outputs appear safe. The challenge lies in preventing such leakage without compromising the model’s reasoning capabilities, requiring a delicate balance between privacy and utility. We introduce Steering Activations towards Leakage-free Thinking (SALT), a lightweight test-time intervention that mitigates privacy leakage in model’s Chain of Thought (CoT) by injecting targeted steering vectors into hidden state. We identify the high-leakage layers responsible for this behavior. Through experiments across multiple LLMs, we demonstrate that SALT achieves reductions including $18.2%$ reduction in CPL on QwQ-32B, $17.9%$ reduction in CPL on Llama-3.1-8B, and $31.2%$ reduction in CPL on Deepseek in contextual privacy leakage dataset AirGapAgent-R while maintaining comparable task performance and utility. Our work establishes SALT as a practical approach for test-time privacy protection in reasoning-capable language models, offering a path toward safer deployment of LLM-based personal agents.
[134] Class-feature Watermark: A Resilient Black-box Watermark Against Model Extraction Attacks cs.CR | cs.CV | cs.LGPDF
Yaxin Xiao, Qingqing Ye, Zi Liang, Haoyang Li, RongHua Li
TL;DR: 该论文提出了一种名为Class-Feature Watermark(CFW)的黑盒水印方法,旨在抵御模型提取攻击(MEA)和水印移除攻击(WRK)。CFW通过利用类级别的水印特征提升鲁棒性,显著优于现有方法。
Details
Motivation: 当前的黑盒水印方法虽然通过表征纠缠抵御模型提取攻击,但其对连续MEA和水印移除攻击的鲁棒性不足。作者发现现有移除方法难以应对由纠缠带来的约束,因此提出改进方案。
Result: 在多种实验场景下,CFW的水印成功率至少为70.15%,显著优于现有方法(WRK攻击下成功率降低至少88.79%)。
Insight: 类级别的水印特征比样本级别更具鲁棒性,域外样本的使用可以有效抵御攻击,同时保护模型的功能性。
Abstract: Machine learning models constitute valuable intellectual property, yet remain vulnerable to model extraction attacks (MEA), where adversaries replicate their functionality through black-box queries. Model watermarking counters MEAs by embedding forensic markers for ownership verification. Current black-box watermarks prioritize MEA survival through representation entanglement, yet inadequately explore resilience against sequential MEAs and removal attacks. Our study reveals that this risk is underestimated because existing removal methods are weakened by entanglement. To address this gap, we propose Watermark Removal attacK (WRK), which circumvents entanglement constraints by exploiting decision boundaries shaped by prevailing sample-level watermark artifacts. WRK effectively reduces watermark success rates by at least 88.79% across existing watermarking benchmarks. For robust protection, we propose Class-Feature Watermarks (CFW), which improve resilience by leveraging class-level artifacts. CFW constructs a synthetic class using out-of-domain samples, eliminating vulnerable decision boundaries between original domain samples and their artifact-modified counterparts (watermark samples). CFW concurrently optimizes both MEA transferability and post-MEA stability. Experiments across multiple domains show that CFW consistently outperforms prior methods in resilience, maintaining a watermark success rate of at least 70.15% in extracted models even under the combined MEA and WRK distortion, while preserving the utility of protected models.
eess.IV [Back]
[135] EvoPS: Evolutionary Patch Selection for Whole Slide Image Analysis in Computational Pathology eess.IV | cs.CV | cs.LGPDF
Saya Hashemian, Azam Asilian Bidgoli
TL;DR: EvoPS提出了一种基于多目标优化的进化搜索方法,用于在计算病理学中选择关键病理图像块,显著减少计算成本并提升诊断性能。
Details
Motivation: 全切片图像(WSI)的高分辨率导致需要分析大量小块(patch),但现有方法难以平衡计算成本和诊断准确性。
Result: EvoPS减少90%以上的训练小块嵌入量,同时维持或提升分类F1分数。
Insight: EvoPS为WSI分析提供了一种高效、准确且可解释的小块选择方法,平衡计算成本与诊断性能。
Abstract: In computational pathology, the gigapixel scale of Whole-Slide Images (WSIs) necessitates their division into thousands of smaller patches. Analyzing these high-dimensional patch embeddings is computationally expensive and risks diluting key diagnostic signals with many uninformative patches. Existing patch selection methods often rely on random sampling or simple clustering heuristics and typically fail to explicitly manage the crucial trade-off between the number of selected patches and the accuracy of the resulting slide representation. To address this gap, we propose EvoPS (Evolutionary Patch Selection), a novel framework that formulates patch selection as a multi-objective optimization problem and leverages an evolutionary search to simultaneously minimize the number of selected patch embeddings and maximize the performance of a downstream similarity search task, generating a Pareto front of optimal trade-off solutions. We validated our framework across four major cancer cohorts from The Cancer Genome Atlas (TCGA) using five pretrained deep learning models to generate patch embeddings, including both supervised CNNs and large self-supervised foundation models. The results demonstrate that EvoPS can reduce the required number of training patch embeddings by over 90% while consistently maintaining or even improving the final classification F1-score compared to a baseline that uses all available patches’ embeddings selected through a standard extraction pipeline. The EvoPS framework provides a robust and principled method for creating efficient, accurate, and interpretable WSI representations, empowering users to select an optimal balance between computational cost and diagnostic performance.
[136] Deep Learning Analysis of Prenatal Ultrasound for Identification of Ventriculomegaly eess.IV | cs.CVPDF
Youssef Megahed, Inok Lee, Robin Ducharme, Aylin Erman, Olivier X. Miguel
TL;DR: 本文提出了一种基于深度学习的模型USF-MAE,用于在产前超声图像中检测脑室扩大(ventriculomegaly),表现优于基线模型(VGG-19、ResNet-50和ViT-B/16),且具有较高的临床解释性。
Details
Motivation: 脑室扩大是胎儿大脑的潜在异常,早期诊断对遗传综合征风险评估至关重要。现有模型在性能和解释性上不足,因此需要改进。
Result: 模型F1-score达91.76%(交叉验证)和91.78%(测试集),准确率97.24%。Eigen-CAM热图显示模型关注脑室区域,验证了其临床合理性。
Insight: 自监督预训练结合微调可显著提升模型性能;引入Eigen-CAM增强了模型的可解释性,对临床应用具有重要意义。
Abstract: The proposed study aimed to develop a deep learning model capable of detecting ventriculomegaly on prenatal ultrasound images. Ventriculomegaly is a prenatal condition characterized by dilated cerebral ventricles of the fetal brain and is important to diagnose early, as it can be associated with an increased risk for fetal aneuploidies and/or underlying genetic syndromes. An Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE), recently developed by our group, was fine-tuned for a binary classification task to distinguish fetal brain ultrasound images as either normal or showing ventriculomegaly. The USF-MAE incorporates a Vision Transformer encoder pretrained on more than 370,000 ultrasound images from the OpenUS-46 corpus. For this study, the pretrained encoder was adapted and fine-tuned on a curated dataset of fetal brain ultrasound images to optimize its performance for ventriculomegaly detection. Model evaluation was conducted using 5-fold cross-validation and an independent test cohort, and performance was quantified using accuracy, precision, recall, specificity, F1-score, and area under the receiver operating characteristic curve (AUC). The proposed USF-MAE model reached an F1-score of 91.76% on the 5-fold cross-validation and 91.78% on the independent test set, with much higher scores than those obtained by the baseline models by 19.37% and 16.15% compared to VGG-19, 2.31% and 2.56% compared to ResNet-50, and 5.03% and 11.93% compared to ViT-B/16, respectively. The model also showed a high mean test precision of 94.47% and an accuracy of 97.24%. The Eigen-CAM (Eigen Class Activation Map) heatmaps showed that the model was focusing on the ventricle area for the diagnosis of ventriculomegaly, which has explainability and clinical plausibility.
cs.IR [Back]
[137] A Hybrid Multimodal Deep Learning Framework for Intelligent Fashion Recommendation cs.IR | cs.CVPDF
Kamand Kalashi, Babak Teimourpour
TL;DR: 该论文提出了一种结合视觉和文本信息的混合多模态深度学习框架,用于时尚推荐任务,包括搭配兼容性预测和互补物品检索,取得了较高的性能指标。
Details
Motivation: 随着在线时尚平台的快速发展,需要智能推荐系统能够同时理解视觉和文本信息,以提高推荐的准确性和多样性。
Result: 在Polyvore数据集上,搭配兼容性预测的AUC达到0.95,互补物品检索在FITB指标下的准确率为69.24%。
Insight: 多模态学习在时尚推荐中表现突出,表明结合视觉和文本信息可以更全面地理解用户需求,提升推荐效果。
Abstract: The rapid expansion of online fashion platforms has created an increasing demand for intelligent recommender systems capable of understanding both visual and textual cues. This paper proposes a hybrid multimodal deep learning framework for fashion recommendation that jointly addresses two key tasks: outfit compatibility prediction and complementary item retrieval. The model leverages the visual and textual encoders of the CLIP architecture to obtain joint latent representations of fashion items, which are then integrated into a unified feature vector and processed by a transformer encoder. For compatibility prediction, an “outfit token” is introduced to model the holistic relationships among items, achieving an AUC of 0.95 on the Polyvore dataset. For complementary item retrieval, a “target item token” representing the desired item description is used to retrieve compatible items, reaching an accuracy of 69.24% under the Fill-in-the-Blank (FITB) metric. The proposed approach demonstrates strong performance across both tasks, highlighting the effectiveness of multimodal learning for fashion recommendation.
quant-ph [Back]
[138] Hybrid Quantum-Classical Selective State Space Artificial Intelligence quant-ph | cs.AI | cs.CLPDF
Amin Ebrahimi, Farzan Haddadi
TL;DR: 该论文提出了一种混合量子经典的选择性状态空间人工智能(HQC)算法,用于时间序列分类问题,通过变分量子电路(VQCs)增强特征提取和信息抑制,提升了模型性能和效率。
Details
Motivation: 当前深度学习架构(尤其是自然语言处理领域)的计算瓶颈源于大规模矩阵乘法和高维优化,而量子计算能够提供更高效的表示学习。论文旨在利用量子资源解决这些问题。
Result: 在MNIST数据集上的实验表明,加入量子层的混合模型在仅4个周期内达到了24.6%的准确率,高于纯经典方法的21.6%。
Insight: 量子增强的门控机制可能是实现可扩展、资源高效NLP模型的可行路径,尤其在有限计算资源下表现出潜力。
Abstract: Hybrid Quantum Classical (HQC) algorithms constitute one of the most effective paradigms for exploiting the computational advantages of quantum systems in large-scale numerical tasks. By operating in high-dimensional Hilbert spaces, quantum circuits enable exponential speed-ups and provide access to richer representations of cost landscapes compared to purely classical methods. These capabilities are particularly relevant for machine learning, where state-of-the-art models especially in Natural Language Processing (NLP) suffer from prohibitive time complexity due to massive matrix multiplications and high-dimensional optimization. In this manuscript, we propose a Hybrid Quantum Classical selection mechanism for the Mamba architecture, designed specifically for temporal sequence classification problems. Our approach leverages Variational Quantum Circuits (VQCs) as quantum gating modules that both enhance feature extraction and improve suppression of irrelevant information. This integration directly addresses the computational bottlenecks of deep learning architectures by exploiting quantum resources for more efficient representation learning. We analyze how introducing quantum subroutines into large language models (LLMs) impacts their generalization capability, expressivity, and parameter efficiency. The results highlight the potential of quantum-enhanced gating mechanisms as a path toward scalable, resource-efficient NLP models, in a limited simulation step. Within the first four epochs on a reshaped MNIST dataset with input format (batch, 784, d_model), our hybrid model achieved 24.6% accuracy while using one quantum layer and achieve higher expressivity, compared to 21.6% obtained by a purely classical selection mechanism. we state No founding
cs.ET [Back]
[139] CNN-Based Automated Parameter Extraction Framework for Modeling Memristive Devices cs.ET | cs.AI | cs.CV | cs.LGPDF
Akif Hamid, Orchi Hassan
TL;DR: 论文提出了一种基于CNN的自动化参数提取框架,用于模拟忆阻器,解决了传统手动参数提取的耗时和不适应性问题。
Details
Motivation: 传统RRAM紧凑模型依赖大量手动调参,效率低且难以适应不同设备,需要一个自动化解决方案。
Result: 框架在四个关键NVM指标上表现优异,验证了其在多样设备特性中的低误差和鲁棒性。
Insight: CNN与启发式优化的结合为忆阻器建模提供了高效、可靠的自动化工具,推动了RRAM技术的发展。
Abstract: Resistive random access memory (RRAM) is a promising candidate for next-generation nonvolatile memory (NVM) and in-memory computing applications. Compact models are essential for analyzing the circuit and system-level performance of experimental RRAM devices. However, most existing RRAM compact models rely on multiple fitting parameters to reproduce the device I-V characteristics, and in most cases, as the parameters are not directly related to measurable quantities, their extraction requires extensive manual tuning, making the process time-consuming and limiting adaptability across different devices. This work presents an automated framework for extracting the fitting parameters of the widely used Stanford RRAM model directly from the device I-V characteristics. The framework employs a convolutional neural network (CNN) trained on a synthetic dataset to generate initial parameter estimates, which are then refined through three heuristic optimization blocks that minimize errors via adaptive binary search in the parameter space. We evaluated the framework using four key NVM metrics: set voltage, reset voltage, hysteresis loop area, and low resistance state (LRS) slope. Benchmarking against RRAM device characteristics derived from previously reported Stanford model fits, other analytical models, and experimental data shows that the framework achieves low error across diverse device characteristics, offering a fast, reliable, and robust solution for RRAM modeling.
cs.AI [Back]
[140] Think Before You Retrieve: Learning Test-Time Adaptive Search with Small Language Models cs.AI | cs.CL | cs.IRPDF
Supriti Vijay, Aman Priyanshu, Anu Vellore, Baturay Saglam, Amin Karbasi
TL;DR: 论文提出了Orion框架,通过训练小型语言模型(SLM)实现迭代检索,结合合成轨迹生成、监督微调和强化学习,提升了检索性能,超越了更大规模的模型。
Details
Motivation: 现有检索方法缺乏推理能力或成本过高,无法满足复杂查询的动态需求。
Result: 1.2B参数的模型在多个基准测试中超越更大规模的模型,表现优异。
Insight: 检索性能可以通过学习策略而非模型规模提升,动态搜索和反思是关键。
Abstract: Effective information retrieval requires reasoning over partial evidence and refining strategies as information emerges. Yet current approaches fall short: neural retrievers lack reasoning capabilities, large language models (LLMs) provide semantic depth but at prohibitive cost, and query rewriting or decomposition limits improvement to static transformations. As a result, existing methods fail to capture the iterative dynamics of exploration, feedback, and revision that complex user queries demand. We introduce Orion, a training framework that enables compact models (350M-1.2B parameters) to perform iterative retrieval through learned search strategies. Orion combines: (1) synthetic trajectory generation and supervised fine-tuning to encourage diverse exploration patterns in models, (2) reinforcement learning (RL) that rewards effective query refinement and backtracking behaviors, and (3) inference-time beam search algorithms that exploit the self-reflection capabilities learned during RL. Despite using only 3% of the training data available, our 1.2B model achieves 77.6% success on SciFact (vs. 72.6% for prior retrievers), 25.2% on BRIGHT (vs. 22.1%), 63.2% on NFCorpus (vs. 57.8%), and remains competitive on FEVER, HotpotQA, and MSMarco. It outperforms retrievers up to 200-400x larger on five of six benchmarks. These findings suggest that retrieval performance can emerge from learned strategies, not just model scale, when models are trained to search, reflect, and revise.
[141] Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces cs.AI | cs.CLPDF
Shreyas Rajesh, Pavan Holur, Chenda Duan, David Chong, Vwani Roychowdhury
TL;DR: 这篇论文提出了一个名为Generative Semantic Workspace(GSW)的神经启发式生成记忆框架,旨在解决大型语言模型(LLMs)在长上下文推理中的挑战,并通过结构化、可解释的表征来支持对动态事件的推理。
Details
Motivation: 当前的增强检索生成(RAG)方法主要针对基于事实的检索,缺乏对动态事件中实体追踪所需的时间和空间锚定叙事表征的支持。
Result: 在EpBench基准测试中,GSW比现有RAG基线性能提升20%,同时查询时的上下文令牌减少51%。
Insight: GSF为LLMs提供了类人类的情景记忆能力,为实现更长视野的推理任务提供了可能。
Abstract: Large Language Models (LLMs) face fundamental challenges in long-context reasoning: many documents exceed their finite context windows, while performance on texts that do fit degrades with sequence length, necessitating their augmentation with external memory frameworks. Current solutions, which have evolved from retrieval using semantic embeddings to more sophisticated structured knowledge graphs representations for improved sense-making and associativity, are tailored for fact-based retrieval and fail to build the space-time-anchored narrative representations required for tracking entities through episodic events. To bridge this gap, we propose the \textbf{Generative Semantic Workspace} (GSW), a neuro-inspired generative memory framework that builds structured, interpretable representations of evolving situations, enabling LLMs to reason over evolving roles, actions, and spatiotemporal contexts. Our framework comprises an \textit{Operator}, which maps incoming observations to intermediate semantic structures, and a \textit{Reconciler}, which integrates these into a persistent workspace that enforces temporal, spatial, and logical coherence. On the Episodic Memory Benchmark (EpBench) \cite{huet_episodic_2025} comprising corpora ranging from 100k to 1M tokens in length, GSW outperforms existing RAG based baselines by up to \textbf{20%}. Furthermore, GSW is highly efficient, reducing query-time context tokens by \textbf{51%} compared to the next most token-efficient baseline, reducing inference time costs considerably. More broadly, GSW offers a concrete blueprint for endowing LLMs with human-like episodic memory, paving the way for more capable agents that can reason over long horizons.
[142] ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents cs.AI | cs.CL | cs.LGPDF
Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich
TL;DR: 论文提出了ResearchRubrics,这是一个用于评估深度研究代理(DR)的标准基准,包含多样化的提示和专家编写的细粒度评分标准,旨在解决DR评估的挑战。
Details
Motivation: 深度研究代理(DR)需要整合多步推理、跨文档综合等能力,但当前评估方法难以应对其长且多样的回答。因此,需要一个标准化的评估框架。
Result: 测试显示,领先的DR代理(如Gemini和OpenAI)平均遵从评分标准的比例低于68%,主要问题是遗漏隐式上下文和检索信息的推理不足。
Insight: 研究揭示了当前DR代理的局限性,强调了标准化评估的重要性,并通过开源基准促进更可靠的DR系统发展。
Abstract: Deep Research (DR) is an emerging agent application that leverages large language models (LLMs) to address open-ended queries. It requires the integration of several capabilities, including multi-step reasoning, cross-document synthesis, and the generation of evidence-backed, long-form answers. Evaluating DR remains challenging because responses are lengthy and diverse, admit many valid solutions, and often depend on dynamic information sources. We introduce ResearchRubrics, a standardized benchmark for DR built with over 2,800+ hours of human labor that pairs realistic, domain-diverse prompts with 2,500+ expert-written, fine-grained rubrics to assess factual grounding, reasoning soundness, and clarity. We also propose a new complexity framework for categorizing DR tasks along three axes: conceptual breadth, logical nesting, and exploration. In addition, we develop human and model-based evaluation protocols that measure rubric adherence for DR agents. We evaluate several state-of-the-art DR systems and find that even leading agents like Gemini’s DR and OpenAI’s DR achieve under 68% average compliance with our rubrics, primarily due to missed implicit context and inadequate reasoning about retrieved information. Our results highlight the need for robust, scalable assessment of deep research capabilities, to which end we release ResearchRubrics(including all prompts, rubrics, and evaluation code) to facilitate progress toward well-justified research assistants.
[143] Thinker: Training LLMs in Hierarchical Thinking for Deep Search via Multi-Turn Interaction cs.AI | cs.CLPDF
Jun Xu, Xinkai Du, Yu Ao, Peilong Zhao, Yang Li
TL;DR: Thinker提出了一种分层思考模型,通过多轮交互实现深度搜索,使推理过程可监督和可验证,解决了以往端到端强化学习方法缺乏推理过程监督的问题。
Details
Motivation: 现有方法主要通过端到端强化学习训练LLMs利用外部检索器解决复杂问题,但缺乏对推理过程的监督,导致逻辑一致性和严谨性难以保证。
Result: 实验表明,Thinker仅需数百训练样本即可达到现有基线性能,且在全量训练集上显著优于基线方法。
Insight: 分层思考和双重表示设计显著提升推理过程的逻辑严谨性,知识边界判定则优化了搜索效率。
Abstract: Efficient retrieval of external knowledge bases and web pages is crucial for enhancing the reasoning abilities of LLMs. Previous works on training LLMs to leverage external retrievers for solving complex problems have predominantly employed end-to-end reinforcement learning. However, these approaches neglect supervision over the reasoning process, making it difficult to guarantee logical coherence and rigor. To address these limitations, we propose Thinker, a hierarchical thinking model for deep search through multi-turn interaction, making the reasoning process supervisable and verifiable. It decomposes complex problems into independently solvable sub-problems, each dually represented in both natural language and an equivalent logical function to support knowledge base and web searches. Concurrently, dependencies between sub-problems are passed as parameters via these logical functions, enhancing the logical coherence of the problem-solving process. To avoid unnecessary external searches, we perform knowledge boundary determination to check if a sub-problem is within the LLM’s intrinsic knowledge, allowing it to answer directly. Experimental results indicate that with as few as several hundred training samples, the performance of Thinker is competitive with established baselines. Furthermore, when scaled to the full training set, Thinker significantly outperforms these methods across various datasets and model sizes. The source code is available at https://github.com/OpenSPG/KAG-Thinker.
[144] Dual-Process Scaffold Reasoning for Enhancing LLM Code Debugging cs.AI | cs.CL | cs.SEPDF
Po-Chung Hsieh, Chin-Po Chen, Jeng-Lin Li, Ming-Ching Chang
TL;DR: 论文提出了一种基于心理学的Scaffold Reasoning框架,用于提升LLM在代码调试任务中的表现,通过结合Scaffold Stream、Analytic Stream和Integration Stream,显著提高了推理准确性和效率。
Details
Motivation: 当前LLMs虽然在多种基准测试中展现出高级推理能力,但在代码调试任务中,如何平衡复杂性与计算效率的推理步骤仍未解决。论文借鉴心理学理论,特别是将System 1和System 2的认知过程应用到LLMs中,以优化推理路径。
Result: 在DebugBench上表现优异(88.91%通过率,平均5.36秒推理时间),且优于其他LLM推理方法。实验结果还表明框架与人类认知过程一致。
Insight: 1. 心理学理论对优化LLM推理路径具有指导意义;2. 双流程推理能够有效平衡复杂性和效率;3. 不同认知路径在不同场景下表现各异,需针对性优化。
Abstract: Recent LLMs have demonstrated sophisticated problem-solving capabilities on various benchmarks through advanced reasoning algorithms. However, the key research question of identifying reasoning steps that balance complexity and computational efficiency remains unsolved. Recent research has increasingly drawn upon psychological theories to explore strategies for optimizing cognitive pathways. The LLM’s final outputs and intermediate steps are regarded as System 1 and System 2, respectively. However, an in-depth exploration of the System 2 reasoning is still lacking. Therefore, we propose a novel psychologically backed Scaffold Reasoning framework for code debugging, which encompasses the Scaffold Stream, Analytic Stream, and Integration Stream. The construction of reference code within the Scaffold Stream is integrated with the buggy code analysis results produced by the Analytic Stream through the Integration Stream. Our framework achieves an 88.91% pass rate and an average inference time of 5.36 seconds per-problem on DebugBench, outperforming other reasoning approaches across various LLMs in both reasoning accuracy and efficiency. Further analyses elucidate the advantages and limitations of various cognitive pathways across varying problem difficulties and bug types. Our findings also corroborate the alignment of the proposed Scaffold Reasoning framework with human cognitive processes.
[145] SciAgent: A Unified Multi-Agent System for Generalistic Scientific Reasoning cs.AI | cs.CL | cs.MAPDF
Xuchen Li, Ruitao Wu, Xuanbo Liu, Xukai Wang, Jinbo Hu
TL;DR: SciAgent是一个统一的多智能体系统,旨在实现跨学科的科学推理,动态协调专用子系统完成符号推导、概念建模等任务,并在多个国际竞赛中达到或超越人类金牌水平。
Details
Motivation: 现有AI系统在特定领域表现出色,但缺乏跨学科的通用科学推理能力。SciAgent通过多智能体协作实现灵活推理,填补了这一空白。
Result: 在数学和物理奥赛(IMO、IPhO等)中表现优异,达到或超越人类金牌水平,并在化学奥赛和HLE基准测试中展现了跨学科泛化能力。
Insight: SciAgent展示了多智能体协作在实现通用科学推理中的潜力,为通用人工智能在科学领域的应用提供了实际范例。
Abstract: Recent advances in large language models have enabled AI systems to achieve expert-level performance on domain-specific scientific tasks, yet these systems remain narrow and handcrafted. We introduce SciAgent, a unified multi-agent system designed for generalistic scientific reasoning-the ability to adapt reasoning strategies across disciplines and difficulty levels. SciAgent organizes problem solving as a hierarchical process: a Coordinator Agent interprets each problem’s domain and complexity, dynamically orchestrating specialized Worker Systems, each composed of interacting reasoning Sub-agents for symbolic deduction, conceptual modeling, numerical computation, and verification. These agents collaboratively assemble and refine reasoning pipelines tailored to each task. Across mathematics and physics Olympiads (IMO, IMC, IPhO, CPhO), SciAgent consistently attains or surpasses human gold-medalist performance, demonstrating both domain generality and reasoning adaptability. Additionally, SciAgent has been tested on the International Chemistry Olympiad (IChO) and selected problems from the Humanity’s Last Exam (HLE) benchmark, further confirming the system’s ability to generalize across diverse scientific domains. This work establishes SciAgent as a concrete step toward generalistic scientific intelligence-AI systems capable of coherent, cross-disciplinary reasoning at expert levels.
[146] Multi-Agent GraphRAG: A Text-to-Cypher Framework for Labeled Property Graphs cs.AI | cs.CLPDF
Anton Gusarov, Anastasia Volkova, Valentin Khrulkov, Andrey Kuznetsov, Evgenii Maslov
TL;DR: 论文提出了一种基于多智能体的GraphRAG框架,用于从文本生成Cypher查询,以自然语言接口形式访问基于LPG的图数据。
Details
Motivation: 现有的GraphRAG方法主要依赖RDF知识图谱和SPARQL查询,但Cypher和LPG数据库在GraphRAG中的潜力尚未被充分挖掘。
Result: 在CypherBench数据集和IFC数据的属性图上验证了系统的性能,展示了其在工业数字自动化中的应用潜力。
Insight: LPG数据库结合LLM驱动的GraphRAG框架能够有效支持复杂推理任务,并适用于大规模现实应用。
Abstract: While Retrieval-Augmented Generation (RAG) methods commonly draw information from unstructured documents, the emerging paradigm of GraphRAG aims to leverage structured data such as knowledge graphs. Most existing GraphRAG efforts focus on Resource Description Framework (RDF) knowledge graphs, relying on triple representations and SPARQL queries. However, the potential of Cypher and Labeled Property Graph (LPG) databases to serve as scalable and effective reasoning engines within GraphRAG pipelines remains underexplored in current research literature. To fill this gap, we propose Multi-Agent GraphRAG, a modular LLM agentic system for text-to-Cypher query generation serving as a natural language interface to LPG-based graph data. Our proof-of-concept system features an LLM-based workflow for automated Cypher queries generation and execution, using Memgraph as the graph database backend. Iterative content-aware correction and normalization, reinforced by an aggregated feedback loop, ensures both semantic and syntactic refinement of generated queries. We evaluate our system on the CypherBench graph dataset covering several general domains with diverse types of queries. In addition, we demonstrate performance of the proposed workflow on a property graph derived from the IFC (Industry Foundation Classes) data, representing a digital twin of a building. This highlights how such an approach can bridge AI with real-world applications at scale, enabling industrial digital automation use cases.
[147] Simulating the Visual World with Artificial Intelligence: A Roadmap cs.AI | cs.CVPDF
Jingtong Yue, Ziqi Huang, Zhaoxi Chen, Xintao Wang, Pengfei Wan
TL;DR: 这篇论文探讨了视频生成领域的演变,提出了一种概念框架,将现代视频基础模型视为隐式世界模型和视频渲染器的结合,目标是实现物理真实性和交互能力的虚拟环境模拟。
Details
Motivation: 视频生成领域从单纯的视觉吸引力转向支持交互和物理真实性的虚拟环境构建。需要系统地研究这一演变,并探索如何将视频生成模型发展为具有模拟能力的隐式世界模型。
Result: 展示了视频生成模型如何发展为支持复杂交互和多尺度规划的隐式世界模型,并在机器人、自动驾驶和游戏等领域有广泛应用。
Insight: 未来的世界模型需要融合智能体的规划能力,并在评估中注重交互性和物理合理性。
Abstract: The landscape of video generation is shifting, from a focus on generating visually appealing clips to building virtual environments that support interaction and maintain physical plausibility. These developments point toward the emergence of video foundation models that function not only as visual generators but also as implicit world models, models that simulate the physical dynamics, agent-environment interactions, and task planning that govern real or imagined worlds. This survey provides a systematic overview of this evolution, conceptualizing modern video foundation models as the combination of two core components: an implicit world model and a video renderer. The world model encodes structured knowledge about the world, including physical laws, interaction dynamics, and agent behavior. It serves as a latent simulation engine that enables coherent visual reasoning, long-term temporal consistency, and goal-driven planning. The video renderer transforms this latent simulation into realistic visual observations, effectively producing videos as a “window” into the simulated world. We trace the progression of video generation through four generations, in which the core capabilities advance step by step, ultimately culminating in a world model, built upon a video generation model, that embodies intrinsic physical plausibility, real-time multimodal interaction, and planning capabilities spanning multiple spatiotemporal scales. For each generation, we define its core characteristics, highlight representative works, and examine their application domains such as robotics, autonomous driving, and interactive gaming. Finally, we discuss open challenges and design principles for next-generation world models, including the role of agent intelligence in shaping and evaluating these systems. An up-to-date list of related works is maintained at this link.