cs.CV [Total: 78]
cs.CL [Total: 40]
physics.chem-ph [Total: 1]
cs.AI [Total: 6]
cs.CR [Total: 1]
eess.AS [Total: 1]
cs.MA [Total: 1]
cs.SI [Total: 1]
cs.LG [Total: 6]

cs.CV [Back]

[1] MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation cs.CVPDF

Ye Tian, Ling Yang, Jiongfan Yang, Anran Wang, Yu Tian

TL;DR: MMaDA-Parallel是一个多模态扩散框架，通过并行双向交互提升跨模态一致性，显著改善了思考感知的图像生成任务。

Details

Motivation: 现有自回归方法在处理复杂任务时因误差传播导致性能下降，跨模态对齐不足是主要问题。本研究旨在解决这一问题。

Result: 实验表明模型在ParaBench上的输出对齐提升6.9%，优于当前最佳模型Bagel。

Insight: 并行交互和多模态强化学习是提升跨模态生成任务性能的有效途径。

Abstract: While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our model significantly improves cross-modal alignment and semantic consistency, achieving a 6.9% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis. Our code is open-sourced at https://github.com/tyfeld/MMaDA-Parallel

[2] PriVi: Towards A General-Purpose Video Model For Primate Behavior In The Wild cs.CV | cs.LGPDF

Felix B. Mueller, Jan F. Meier, Timo Lueddecke, Richard Vogg, Roger L. Freixanet

TL;DR: PriVi是一个大规模以灵长类为中心的视频预训练数据集，旨在通过数据为中心的方法提升灵长类行为分析的泛化能力。V-JEPA在PriVi上预训练后，在多个基准数据集上表现优异，优于全微调基线。

Details

Motivation: 现有方法依赖人类中心的预训练模型，且集中在单一数据集上，限制了泛化能力。PriVi通过灵长类特化的数据为中心方法弥补这一缺陷。

Result: 在ChimpACT等四个基准数据集上优于现有方法，包括全微调基线，且在低标签场景下表现优异。

Insight: 灵长类特化的预训练显著提升了数据效率和泛化能力，表明数据为中心的方法在领域特化任务中的潜力。

Abstract: Non-human primates are our closest living relatives, and analyzing their behavior is central to research in cognition, evolution, and conservation. Computer vision could greatly aid this research, but existing methods often rely on human-centric pretrained models and focus on single datasets, which limits generalization. We address this limitation by shifting from a model-centric to a data-centric approach and introduce PriVi, a large-scale primate-centric video pretraining dataset. PriVi contains 424 hours of curated video, combining 174 hours from behavioral research across 11 settings with 250 hours of diverse web-sourced footage, assembled through a scalable data curation pipeline. We pretrain V-JEPA on PriVi to learn primate-specific representations and evaluate it using a lightweight frozen classifier. Across four benchmark datasets, ChimpACT, BaboonLand, PanAf500, and ChimpBehave, our approach consistently outperforms prior work, including fully finetuned baselines, and scales favorably with fewer labels. These results demonstrate that primate-centric pretraining substantially improves data efficiency and generalization, making it a promising approach for low-label applications. Code, models, and the majority of the dataset will be made available.

[3] Classifying Phonotrauma Severity from Vocal Fold Images with Soft Ordinal Regression cs.CV | cs.LGPDF

Katie Matton, Purvaja Balaji, Hamzeh Ghasemzadeh, Jameson C. Cooper, Daryush D. Mehta

TL;DR: 论文提出了一种通过声带图像自动分类音声创伤严重程度的方法，采用软序数回归框架处理标签的序数性和不确定性。

Details

Motivation: 音声创伤的严重程度评估依赖临床专家的主观判断，成本高且可靠性不一，需要自动化工具支持大规模研究与临床决策。

Result: 预测性能接近临床专家水平，且能生成校准良好的不确定性估计。

Insight: 软序数回归方法可用于其他具有序数标签和不确定性的医学图像分类任务。

Abstract: Phonotrauma refers to vocal fold tissue damage resulting from exposure to forces during voicing. It occurs on a continuum from mild to severe, and treatment options can vary based on severity. Assessment of severity involves a clinician’s expert judgment, which is costly and can vary widely in reliability. In this work, we present the first method for automatically classifying phonotrauma severity from vocal fold images. To account for the ordinal nature of the labels, we adopt a widely used ordinal regression framework. To account for label uncertainty, we propose a novel modification to ordinal regression loss functions that enables them to operate on soft labels reflecting annotator rating distributions. Our proposed soft ordinal regression method achieves predictive performance approaching that of clinical experts, while producing well-calibrated uncertainty estimates. By providing an automated tool for phonotrauma severity assessment, our work can enable large-scale studies of phonotrauma, ultimately leading to improved clinical understanding and patient care.

[4] SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control cs.CVPDF

Arman Zarei, Samyadeep Basu, Mobina Pournemat, Sayan Nag, Ryan Rossi

TL;DR: SliderEdit是一个框架，支持通过细粒度、可解释的滑块控制实现连续图像编辑，克服了现有方法对指令强度的固定限制问题。

Details

Motivation: 现有基于指令的图像编辑模型无法连续调整单个编辑指令的强度，限制了用户的精确控制需求。SliderEdit旨在填补这一空白。

Result: 在FLUX-Kontext和Qwen-Image-Edit等模型中应用SliderEdit，显著提升了编辑可控性、视觉一致性和用户操作性。

Insight: SliderEdit展示了在基于指令的图像编辑中实现连续、细粒度控制的技术路径，为交互式图像操作提供了新方向。

Abstract: Instruction-based image editing models have recently achieved impressive performance, enabling complex edits to an input image from a multi-instruction prompt. However, these models apply each instruction in the prompt with a fixed strength, limiting the user’s ability to precisely and continuously control the intensity of individual edits. We introduce SliderEdit, a framework for continuous image editing with fine-grained, interpretable instruction control. Given a multi-part edit instruction, SliderEdit disentangles the individual instructions and exposes each as a globally trained slider, allowing smooth adjustment of its strength. Unlike prior works that introduced slider-based attribute controls in text-to-image generation, typically requiring separate training or fine-tuning for each attribute or concept, our method learns a single set of low-rank adaptation matrices that generalize across diverse edits, attributes, and compositional instructions. This enables continuous interpolation along individual edit dimensions while preserving both spatial locality and global semantic consistency. We apply SliderEdit to state-of-the-art image editing models, including FLUX-Kontext and Qwen-Image-Edit, and observe substantial improvements in edit controllability, visual consistency, and user steerability. To the best of our knowledge, we are the first to explore and propose a framework for continuous, fine-grained instruction control in instruction-based image editing models. Our results pave the way for interactive, instruction-driven image manipulation with continuous and compositional control.

[5] Density Estimation and Crowd Counting cs.CVPDF

Balachandra Devarangadi Sunil, Rakshith Venkatesh, Shantanu Todmal

TL;DR: 该研究将原本用于图像分析的群体密度估计算法扩展到视频场景，通过结合去噪概率模型和扩散过程生成高质量密度图，同时引入回归分支和合并机制提升精度。采用事件驱动采样技术减少计算负担，并通过实验验证了方法的有效性。

Details

Motivation: 视频中的群体密度估计面临时间动态性和计算负担的挑战。现有方法多针对静态图像，难以直接适用于视频。因此，需要一种高效且适应动态场景的方法来支持实时监控应用。

Result: 实验显示，模型在稀疏和密集场景下均能有效捕捉群体动态，采样方法在减少帧数的同时保留了关键事件，MAE指标验证了其准确性。

Insight: 结合扩散过程的概率模型和事件驱动采样技术为视频群体分析提供了高效解决方案，尤其适合实时监控需求。

Abstract: This study enhances a crowd density estimation algorithm originally designed for image-based analysis by adapting it for video-based scenarios. The proposed method integrates a denoising probabilistic model that utilizes diffusion processes to generate high-quality crowd density maps. To improve accuracy, narrow Gaussian kernels are employed, and multiple density map outputs are generated. A regression branch is incorporated into the model for precise feature extraction, while a consolidation mechanism combines these maps based on similarity scores to produce a robust final result. An event-driven sampling technique, utilizing the Farneback optical flow algorithm, is introduced to selectively capture frames showing significant crowd movements, reducing computational load and storage by focusing on critical crowd dynamics. Through qualitative and quantitative evaluations, including overlay plots and Mean Absolute Error (MAE), the model demonstrates its ability to effectively capture crowd dynamics in both dense and sparse settings. The efficiency of the sampling method is further assessed, showcasing its capability to decrease frame counts while maintaining essential crowd events. By addressing the temporal challenges unique to video analysis, this work offers a scalable and efficient framework for real-time crowd monitoring in applications such as public safety, disaster response, and event management.

[6] PALMS+: Modular Image-Based Floor Plan Localization Leveraging Depth Foundation Model cs.CV | cs.AI | cs.ROPDF

Yunqian Cheng, Benjamin Princen, Roberto Manduchi

TL;DR: PALMS+是一个基于图像的模块化室内定位系统，利用RGB图像和深度估计模型重建3D点云，并通过几何布局匹配实现高精度定位。

Details

Motivation: GPS信号在室内环境中无法使用，现有视觉定位方法（如PALMS）受限于智能手机LiDAR的短距离和室内布局的模糊性。

Result: 在两个数据集（Structured3D和自定义校园数据集）上，PALMS+在静态定位精度上优于PALMS和F3Loc；在33条真实轨迹上的连续定位误差更低。

Insight: PALMS+无需训练即可实现高精度定位，展示了在无基础设施应用中摄像头无关跟踪的潜力。

Abstract: Indoor localization in GPS-denied environments is crucial for applications like emergency response and assistive navigation. Vision-based methods such as PALMS enable infrastructure-free localization using only a floor plan and a stationary scan, but are limited by the short range of smartphone LiDAR and ambiguity in indoor layouts. We propose PALMS$+$, a modular, image-based system that addresses these challenges by reconstructing scale-aligned 3D point clouds from posed RGB images using a foundation monocular depth estimation model (Depth Pro), followed by geometric layout matching via convolution with the floor plan. PALMS$+$ outputs a posterior over the location and orientation, usable for direct or sequential localization. Evaluated on the Structured3D and a custom campus dataset consisting of 80 observations across four large campus buildings, PALMS$+$ outperforms PALMS and F3Loc in stationary localization accuracy – without requiring any training. Furthermore, when integrated with a particle filter for sequential localization on 33 real-world trajectories, PALMS$+$ achieved lower localization errors compared to other methods, demonstrating robustness for camera-free tracking and its potential for infrastructure-free applications. Code and data are available at https://github.com/Head-inthe-Cloud/PALMS-Plane-based-Accessible-Indoor-Localization-Using-Mobile-Smartphones

Ahmed Alia, Mohcine Chraibi, Armin Seyfried

TL;DR: 论文提出了一种改进的Social LSTM模型，通过动态占据空间损失函数，在预测行人轨迹时减少碰撞率并提高位移准确性。

Details

Motivation: 在动态和拥挤的环境中，行人轨迹预测的挑战在于复杂的人体运动和相互影响。现有方法通常将行人视为点实体，忽略了其实际占据的物理空间。

Result: 实验显示，模型在碰撞率上降低了31%，位移误差和终点误差分别平均降低了5%和6%。

Insight: 考虑行人实际占据的空间和场景密度对轨迹预测的优化具有显著作用。

Abstract: In dynamic and crowded environments, realistic pedestrian trajectory prediction remains a challenging task due to the complex nature of human motion and the mutual influences among individuals. Deep learning models have recently achieved promising results by implicitly learning such patterns from 2D trajectory data. However, most approaches treat pedestrians as point entities, ignoring the physical space that each person occupies. To address these limitations, this paper proposes a novel deep learning model that enhances the Social LSTM with a new Dynamic Occupied Space loss function. This loss function guides Social LSTM in learning to avoid realistic collisions without increasing displacement error across different crowd densities, ranging from low to high, in both homogeneous and heterogeneous density settings. Such a function achieves this by combining the average displacement error with a new collision penalty that is sensitive to scene density and individual spatial occupancy. For efficient training and evaluation, five datasets were generated from real pedestrian trajectories recorded during the Festival of Lights in Lyon 2022. Four datasets represent homogeneous crowd conditions – low, medium, high, and very high density – while the fifth corresponds to a heterogeneous density distribution. The experimental findings indicate that the proposed model not only lowers collision rates but also enhances displacement prediction accuracy in each dataset. Specifically, the model achieves up to a 31% reduction in the collision rate and reduces the average displacement error and the final displacement error by 5% and 6%, respectively, on average across all datasets compared to the baseline. Moreover, the proposed model consistently outperforms several state-of-the-art deep learning models across most test sets.

[8] Soiling detection for Advanced Driver Assistance Systems cs.CV | cs.AIPDF

Filip Beránek, Václav Diviš, Ivan Gruber

TL;DR: 该论文探讨了汽车摄像头污染检测问题，将其视为语义分割任务，比较了多种分割方法，优于基于瓦片分类的方法。同时，论文指出Woodscape数据集存在数据泄露和标注不精确的问题，并提出了一个更小的子集，可以在更短时间内达到可比结果。

Details

Motivation: 汽车摄像头的污染检测对高级驾驶辅助系统（ADAS）至关重要，但目前的数据集可能存在质量问题（如数据泄露和标注不精确），影响了模型的性能。论文旨在提出一个更高效且准确的方法来解决这一问题。

Result: 语义分割方法显著优于瓦片分类方法。尽管使用更小的数据子集，仍能在短时间内达到与传统方法相当的性能。

Insight: 数据质量（如标注精确性和数据泄露）对模型性能有重要影响；优化数据集可以提高训练效率和模型鲁棒性。

Abstract: Soiling detection for automotive cameras is a crucial part of advanced driver assistance systems to make them more robust to external conditions like weather, dust, etc. In this paper, we regard the soiling detection as a semantic segmentation problem. We provide a comprehensive comparison of popular segmentation methods and show their superiority in performance while comparing them to tile-level classification approaches. Moreover, we present an extensive analysis of the Woodscape dataset showing that the original dataset contains a data-leakage and imprecise annotations. To address these problems, we create a new data subset, which, despite being much smaller, provides enough information for the segmentation method to reach comparable results in a much shorter time. All our codes and dataset splits are available at https://github.com/filipberanek/woodscape_revision.

[9] Feature Quality and Adaptability of Medical Foundation Models: A Comparative Evaluation for Radiographic Classification and Segmentation cs.CV | cs.AIPDF

Frank Li, Theo Dapamede, Mohammadreza Chavoshi, Young Seok Jeon, Bardia Khosravi

TL;DR: 该论文评估了医学和通用领域的8个基础模型（FMs）在胸部X射线分析中的表现，重点比较了分类和分割任务的效果。研究发现医学领域的预训练显著优于通用模型，但特征有效性高度依赖任务。此外，文本-图像对齐并非必需，监督基线模型在分割任务中表现优异。

Details

Motivation: 基础模型在医学影像中的应用潜力巨大，但其预训练领域（医学vs通用）、范式（如文本引导）和架构对特征质量的影响尚不明确。因此，论文旨在评估这些因素如何影响放射学任务中的表现，以帮助选择最合适的编码器。

Result: 1. 医学预训练模型的线性探测表现优于通用模型。2. 特征在全局分类和显著解剖结构分割中表现良好，但对复杂病理（如气胸）分割效果差。3. 监督基线模型在分割任务中匹配或超越最佳FMs。

Insight: 医学预训练有益，但架构选择（如多尺度）至关重要。预训练特征并非万能，复杂定位任务中监督模型仍有优势。此外，文本-图像对齐的非必要性为非对齐方法提供了机会。

Abstract: Foundation models (FMs) promise to generalize medical imaging, but their effectiveness varies. It remains unclear how pre-training domain (medical vs. general), paradigm (e.g., text-guided), and architecture influence embedding quality, hindering the selection of optimal encoders for specific radiology tasks. To address this, we evaluate vision encoders from eight medical and general-domain FMs for chest X-ray analysis. We benchmark classification (pneumothorax, cardiomegaly) and segmentation (pneumothorax, cardiac boundary) using linear probing and fine-tuning. Our results show that domain-specific pre-training provides a significant advantage; medical FMs consistently outperformed general-domain models in linear probing, establishing superior initial feature quality. However, feature utility is highly task-dependent. Pre-trained embeddings were strong for global classification and segmenting salient anatomy (e.g., heart). In contrast, for segmenting complex, subtle pathologies (e.g., pneumothorax), all FMs performed poorly without significant fine-tuning, revealing a critical gap in localizing subtle disease. Subgroup analysis showed FMs use confounding shortcuts (e.g., chest tubes for pneumothorax) for classification, a strategy that fails for precise segmentation. We also found that expensive text-image alignment is not a prerequisite; image-only (RAD-DINO) and label-supervised (Ark+) FMs were among top performers. Notably, a supervised, end-to-end baseline remained highly competitive, matching or exceeding the best FMs on segmentation tasks. These findings show that while medical pre-training is beneficial, architectural choices (e.g., multi-scale) are critical, and pre-trained features are not universally effective, especially for complex localization tasks where supervised models remain a strong alternative.

[10] STORM: Segment, Track, and Object Re-Localization from a Single 3D Model cs.CVPDF

Yu Deng, Teng Cao, Hikaru Shindo, Jiahong Xue, Quentin Delfosse

TL;DR: STORM 是一个无标注、实时 6D 姿态估计系统，结合视觉-语言理解和自监督特征匹配，实现了高精度的目标分割、跟踪和重定位。

Details

Motivation: 现有方法依赖首帧手动标注分割掩码，耗时且对遮挡和快速运动表现不佳。STORM 旨在解决这些问题。

Result: 在遮挡、高速运动和光照变化的工业数据集上达到 SOTA 精度，且运行速度为实时。

Insight: STORM 通过无标注和自动化机制显著降低了部署成本，为制造业和质量控制等应用提供实用方案。

Abstract: Accurate 6D pose estimation and tracking are fundamental capabilities for physical AI systems such as robots. However, existing approaches typically rely on a manually annotated segmentation mask of the target in the first frame, which is labor-intensive and leads to reduced performance when faced with occlusions or rapid movement. To address these limi- tations, we propose STORM (Segment, Track, and Object Re-localization from a single 3D Model), an open-source robust real-time 6D pose estimation system that requires no manual annotation. STORM employs a novel three-stage pipeline combining vision-language understanding with self-supervised feature matching: contextual object descriptions guide localization, self-cross-attention mechanisms identify candidate regions, and a segmentation model produces precise masks for accurate pose estimation. Another key innovation is our automatic re-registration mechanism that detects tracking failures through feature similarity monitoring and recovers from severe occlusions or rapid motion. STORM achieves state-of-the-art accuracy on challenging industrial datasets featuring multi-object occlusions, high-speed motion, and varying illumination, while operating at real-time speeds without additional training. This annotation-free approach significantly reduces deployment overhead, providing a practical solution for modern applications, such as flexible manufacturing and intelligent quality control.

[11] Test-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models cs.CV | cs.AI | cs.LGPDF

Konstantinos M. Dafnis, Dimitris N. Metaxas

TL;DR: STS是一种轻量级的测试时适应框架，通过谱子空间提取和潜空间调整，显著提升了视觉语言模型在零样本任务中的泛化能力，且无需修改模型权重。

Details

Motivation: 视觉语言模型在零样本推理中表现优异，但在测试时的领域偏移下性能下降。现有方法通常需要反向传播或修改模型组件，计算开销大。

Result: STS在性能上优于现有方法，计算效率提升8倍，内存占用减少12倍。

Insight: STS展示了在潜空间轻量调整的潜力，为测试时适应提供了一种高效且无需模型修改的解决方案。

Abstract: Vision-Language Models (VLMs) excel at zero-shot inference but often degrade under test-time domain shifts. For this reason, episodic test-time adaptation strategies have recently emerged as powerful techniques for adapting VLMs to a single unlabeled image. However, existing adaptation strategies, such as test-time prompt tuning, typically require backpropagating through large encoder weights or altering core model components. In this work, we introduce Spectrum-Aware Test-Time Steering (STS), a lightweight adaptation framework that extracts a spectral subspace from the textual embeddings to define principal semantic directions and learns to steer latent representations in a spectrum-aware manner by adapting a small number of per-sample shift parameters to minimize entropy across augmented views. STS operates entirely at inference in the latent space, without backpropagation through or modification of the frozen encoders. Building on standard evaluation protocols, our comprehensive experiments demonstrate that STS largely surpasses or compares favorably against state-of-the-art test-time adaptation methods, while introducing only a handful of additional parameters and achieving inference speeds up to 8x faster with a 12x smaller memory footprint than conventional test-time prompt tuning. The code is available at https://github.com/kdafnis/STS.

[12] From Street to Orbit: Training-Free Cross-View Retrieval via Location Semantics and LLM Guidance cs.CV | cs.AIPDF

Jeongho Min, Dongyoung Kim, Jaehyup Lee

TL;DR: 提出一种无需训练的跨视角图像检索框架，利用预训练视觉编码器和大型语言模型（LLM），通过地理位置语义和LLM指导实现街景到卫星图像的检索。

Details

Motivation: 现有跨视角图像检索方法通常需要监督训练或依赖特定数据集，限制了实际应用，亟需一种无需额外训练且适应性强的方法。

Result: 在零样本设置下，超越现有基于学习的方法的基准性能，同时支持高效的数据集自动构建。

Insight: 结合预训练模型和LLM可以实现高效的无监督跨模态任务处理，为实际应用提供了灵活性和可扩展性。

Abstract: Cross-view image retrieval, particularly street-to-satellite matching, is a critical task for applications such as autonomous navigation, urban planning, and localization in GPS-denied environments. However, existing approaches often require supervised training on curated datasets and rely on panoramic or UAV-based images, which limits real-world deployment. In this paper, we present a simple yet effective cross-view image retrieval framework that leverages a pretrained vision encoder and a large language model (LLM), requiring no additional training. Given a monocular street-view image, our method extracts geographic cues through web-based image search and LLM-based location inference, generates a satellite query via geocoding API, and retrieves matching tiles using a pretrained vision encoder (e.g., DINOv2) with PCA-based whitening feature refinement. Despite using no ground-truth supervision or finetuning, our proposed method outperforms prior learning-based approaches on the benchmark dataset under zero-shot settings. Moreover, our pipeline enables automatic construction of semantically aligned street-to-satellite datasets, which is offering a scalable and cost-efficient alternative to manual annotation. All source codes will be made publicly available at https://jeonghomin.github.io/street2orbit.github.io/.

[13] AHA! Animating Human Avatars in Diverse Scenes with Gaussian Splatting cs.CVPDF

Aymen Mir, Jian Wang, Riza Alp Guler, Chuan Guo, Gerard Pons-Moll

TL;DR: 该论文提出了一种基于3D高斯分布（3DGS）的新型框架，用于在3D场景中动画化人类角色。通过将人类和场景表示为高斯分布，实现了几何一致的自由视角渲染。

Details

Motivation: 现有的动画管线通常使用网格或点云作为3D表示，但这些方法在人类与场景交互的自由视角渲染中表现有限。3DGS作为新型场景表示方法尚未被充分探索用于人类动画化问题。

Result: 在Scannet++和SuperSplat库的场景上进行了评估，展示了在稀疏和密集多视角捕捉下的人类角色重建效果，并支持单目RGB视频的编辑和新人类动画化。

Insight: 3DGS在人类动画化问题中具有潜在优势，特别是几何一致的自由视角渲染能力，为单目视频提供了新的应用场景。

Abstract: We present a novel framework for animating humans in 3D scenes using 3D Gaussian Splatting (3DGS), a neural scene representation that has recently achieved state-of-the-art photorealistic results for novel-view synthesis but remains under-explored for human-scene animation and interaction. Unlike existing animation pipelines that use meshes or point clouds as the underlying 3D representation, our approach introduces the use of 3DGS as the 3D representation to the problem of animating humans in scenes. By representing humans and scenes as Gaussians, our approach allows for geometry-consistent free-viewpoint rendering of humans interacting with 3D scenes. Our key insight is that the rendering can be decoupled from the motion synthesis and each sub-problem can be addressed independently, without the need for paired human-scene data. Central to our method is a Gaussian-aligned motion module that synthesizes motion without explicit scene geometry, using opacity-based cues and projected Gaussian structures to guide human placement and pose alignment. To ensure natural interactions, we further propose a human-scene Gaussian refinement optimization that enforces realistic contact and navigation. We evaluate our approach on scenes from Scannet++ and the SuperSplat library, and on avatars reconstructed from sparse and dense multi-view human capture. Finally, we demonstrate that our framework allows for novel applications such as geometry-consistent free-viewpoint rendering of edited monocular RGB videos with new animated humans, showcasing the unique advantage of 3DGS for monocular video-based human animation.

[14] CertMask: Certifiable Defense Against Adversarial Patches via Theoretically Optimal Mask Coverage cs.CV | cs.AIPDF

Xuntao Lyu, Ching-Chi Lin, Abdullah Al Arafat, Georg von der Brüggen, Jian-Jia Chen

TL;DR: CertMask是一种可认证的防御方法，通过理论最优的掩码覆盖率来对抗对抗性补丁攻击。相比现有方法（PatchCleanser），它在单轮掩码操作中实现高效且稳健的防御，显著提升认证稳健精度。

Details

Motivation: 对抗性补丁攻击通过局部扰动误导深度学习模型，尤其在现实应用中具有高风险。现有防御方法（如PatchCleanser）效率低且计算成本高（O(n^2)）。CertMask旨在提供高效、理论可认证的防御。

Result: 在ImageNet等数据集上的实验表明，CertMask的认证稳健精度比PatchCleanser提升高达+13.4%，且保持与基础模型几乎相同的干净精度。

Insight: CertMask展示了理论最优覆盖率在高效率防御中的重要性，为对抗性补丁防御提供了一种可扩展且可认证的解决方案。

Abstract: Adversarial patch attacks inject localized perturbations into images to mislead deep vision models. These attacks can be physically deployed, posing serious risks to real-world applications. In this paper, we propose CertMask, a certifiably robust defense that constructs a provably sufficient set of binary masks to neutralize patch effects with strong theoretical guarantees. While the state-of-the-art approach (PatchCleanser) requires two rounds of masking and incurs $O(n^2)$ inference cost, CertMask performs only a single round of masking with $O(n)$ time complexity, where $n$ is the cardinality of the mask set to cover an input image. Our proposed mask set is computed using a mathematically rigorous coverage strategy that ensures each possible patch location is covered at least $k$ times, providing both efficiency and robustness. We offer a theoretical analysis of the coverage condition and prove its sufficiency for certification. Experiments on ImageNet, ImageNette, and CIFAR-10 show that CertMask improves certified robust accuracy by up to +13.4% over PatchCleanser, while maintaining clean accuracy nearly identical to the vanilla model.

[15] CORONA-Fields: Leveraging Foundation Models for Classification of Solar Wind Phenomena cs.CV | astro-ph.IM | astro-ph.SRPDF

Daniela Martin, Jinsu Hong, Connor O’Brien, Valmir P Moraes Filho, Jasmine R. Kobayashi

TL;DR: 该论文提出了一个基于基础模型的深度学习架构，用于太阳风现象的自动分类，结合了远程观测和实地测量数据，为空间天气预报提供了初步可行性证明。

Details

Motivation: 太阳活动和太阳风现象对地球的空间天气和卫星等技术基础设施构成重大风险，但目前自动化分类这些结构仍具有挑战性。

Result: 分类性能一般，主要受限于粗糙标签、类别不平衡和预训练模型的迁移能力不足，但证明了基础模型嵌入在太阳风任务中的可行性。

Insight: 这是首个概念验证研究，为未来改进空间天气预报模型奠定了基础，强调了多模态数据结合的重要性。

Abstract: Space weather at Earth, driven by the solar activity, poses growing risks to satellites around our planet as well as to critical ground-based technological infrastructure. Major space weather contributors are the solar wind and coronal mass ejections whose variable density, speed, temperature, and magnetic field make the automated classification of those structures challenging. In this work, we adapt a foundation model for solar physics, originally trained on Solar Dynamics Observatory imagery, to create embeddings suitable for solar wind structure analysis. These embeddings are concatenated with the spacecraft position and solar magnetic connectivity encoded using Fourier features which generates a neural field-based model. The full deep learning architecture is fine-tuned bridging the gap between remote sensing and in situ observations. Labels are derived from Parker Solar Probe measurements, forming a downstream classification task that maps plasma properties to solar wind structures. Although overall classification performance is modest, likely due to coarse labeling, class imbalance, and limited transferability of the pretrained model, this study demonstrates the feasibility of leveraging foundation model embeddings for in situ solar wind tasks. As a first proof-of-concept, it lays the groundwork for future improvements toward more reliable space weather predictions. The code and configuration files used in this study are publicly available to support reproducibility.

[16] Remember Me: Bridging the Long-Range Gap in LVLMs with Three-Step Inference-Only Decay Resilience Strategies cs.CVPDF

Peng Gao, Yujian Lee, Xiaofeng Zhang, Zailong Chen, Hui Zhang

TL;DR: 本文提出了一种名为T-DRS的三步推断专用衰减恢复策略，用于缓解LVLM中因ROPE导致的长距离依赖建模问题，通过语义驱动、距离感知控制和重新强化远程依赖的组合策略，显著提升了模型的全局上下文记忆能力。

Details

Motivation: 大型视觉语言模型（LVLM）在ROPE（Rotary Positional Encoding）的使用中存在长距离依赖建模的缺陷，尤其是注意力衰减问题，导致模型难以记住全局上下文。因此，本文旨在解决这一问题。

Result: 在视觉问答（VQA）基准测试中，T-DRS在不额外训练的情况下显著提升了模型的性能。

Insight: 本文工作表明，推断阶段的策略优化可以有效弥补ROPE在长距离依赖建模中的缺陷，同时不影响局部归纳偏置。这种推断专用方法为其他模型的优化提供了新思路。

Abstract: Large Vision-Language Models (LVLMs) have achieved impressive performance across a wide range of multimodal tasks. However, they still face critical challenges in modeling long-range dependencies under the usage of Rotary Positional Encoding (ROPE). Although it can facilitate precise modeling of token positions, it induces progressive attention decay as token distance increases, especially with progressive attention decay over distant token pairs, which severely impairs the model’s ability to remember global context. To alleviate this issue, we propose inference-only Three-step Decay Resilience Strategies (T-DRS), comprising (1) Semantic-Driven DRS (SD-DRS), amplifying semantically meaningful but distant signals via content-aware residuals, (2) Distance-aware Control DRS (DC-DRS), which can purify attention by smoothly modulating weights based on positional distances, suppressing noise while preserving locality, and (3) re-Reinforce Distant DRS (reRD-DRS), consolidating the remaining informative remote dependencies to maintain global coherence. Together, the T-DRS recover suppressed long-range token pairs without harming local inductive biases. Extensive experiments on Vision Question Answering (VQA) benchmarks demonstrate that T-DRS can consistently improve performance in a training-free manner. The code can be accessed in https://github.com/labixiaoq-qq/Remember-me

[17] SAM-DAQ: Segment Anything Model with Depth-guided Adaptive Queries for RGB-D Video Salient Object Detection cs.CVPDF

Jia Lin, Xiaofei Zhou, Jiyuan Liu, Runmin Cong, Guodao Zhang

TL;DR: SAM-DAQ是一种将Segment Anything Model（SAM）与深度引导自适应查询相结合的方法，用于RGB-D视频显著性目标检测，解决了手动提示依赖、高内存消耗和计算负担的问题。

Details

Motivation: 现有的SAM模型在RGB-D视频显著性目标检测中面临手动提示依赖、高内存消耗和计算负担的挑战。

Result: 在三个RGB-D VSOD数据集上的实验表明，SAM-DAQ在所有评估指标上均优于现有方法。

Insight: 深度信息的有效融合和自适应查询机制显著提升了视频显著性目标检测的性能。

Abstract: Recently segment anything model (SAM) has attracted widespread concerns, and it is often treated as a vision foundation model for universal segmentation. Some researchers have attempted to directly apply the foundation model to the RGB-D video salient object detection (RGB-D VSOD) task, which often encounters three challenges, including the dependence on manual prompts, the high memory consumption of sequential adapters, and the computational burden of memory attention. To address the limitations, we propose a novel method, namely Segment Anything Model with Depth-guided Adaptive Queries (SAM-DAQ), which adapts SAM2 to pop-out salient objects from videos by seamlessly integrating depth and temporal cues within a unified framework. Firstly, we deploy a parallel adapter-based multi-modal image encoder (PAMIE), which incorporates several depth-guided parallel adapters (DPAs) in a skip-connection way. Remarkably, we fine-tune the frozen SAM encoder under prompt-free conditions, where the DPA utilizes depth cues to facilitate the fusion of multi-modal features. Secondly, we deploy a query-driven temporal memory (QTM) module, which unifies the memory bank and prompt embeddings into a learnable pipeline. Concretely, by leveraging both frame-level queries and video-level queries simultaneously, the QTM module can not only selectively extract temporal consistency features but also iteratively update the temporal representations of the queries. Extensive experiments are conducted on three RGB-D VSOD datasets, and the results show that the proposed SAM-DAQ consistently outperforms state-of-the-art methods in terms of all evaluation metrics.

[18] RWKV-PCSSC: Exploring RWKV Model for Point Cloud Semantic Scene Completion cs.CVPDF

Wenzhe He, Xiaojun Chen, Wentang Chen, Hongyu Wang, Ying Liu

TL;DR: 该论文提出了一种轻量化的点云语义场景补全网络RWKV-PCSSC，通过引入RWKV机制降低了模型复杂度，并在性能和资源效率上优于现有方法。

Details

Motivation: 现有的语义场景补全方法通常采用密集网络架构，参数量大且资源需求高，限制了实际应用。因此，作者提出一种轻量化解决方案。

Result: 实验表明，RWKV-PCSSC在参数量上减少了4.18倍，内存效率提升1.37倍，同时在多个数据集上达到SOTA性能。

Insight: RWKV机制在点云任务中表现优异，轻量化设计不仅能减少计算资源需求，还能保持高性能。

Abstract: Semantic Scene Completion (SSC) aims to generate a complete semantic scene from an incomplete input. Existing approaches often employ dense network architectures with a high parameter count, leading to increased model complexity and resource demands. To address these limitations, we propose RWKV-PCSSC, a lightweight point cloud semantic scene completion network inspired by the Receptance Weighted Key Value (RWKV) mechanism. Specifically, we introduce a RWKV Seed Generator (RWKV-SG) module that can aggregate features from a partial point cloud to produce a coarse point cloud with coarse features. Subsequently, the point-wise feature of the point cloud is progressively restored through multiple stages of the RWKV Point Deconvolution (RWKV-PD) modules. By leveraging a compact and efficient design, our method achieves a lightweight model representation. Experimental results demonstrate that RWKV-PCSSC reduces the parameter count by 4.18$\times$ and improves memory efficiency by 1.37$\times$ compared to state-of-the-art methods PointSSC. Furthermore, our network achieves state-of-the-art performance on established indoor (SSC-PC, NYUCAD-PC) and outdoor (PointSSC) scene dataset, as well as on our proposed datasets (NYUCAD-PC-V2, 3D-FRONT-PC).

[19] HCC-3D: Hierarchical Compensatory Compression for 98% 3D Token Reduction in Vision-Language Models cs.CVPDF

Liheng Zhang, Jin Wang, Hui Li, Bingfeng Zhang, Weifeng Liu

TL;DR: HCC-3D提出了一种分层补偿压缩方法，显著减少3D视觉语言模型的计算开销，同时保持信息完整性。

Details

Motivation: 当前3D-VLMs直接嵌入3D点云数据导致计算成本高，限制了应用。研究目标是减少3D tokens的计算开销而不损失关键信息。

Result: HCC-3D在压缩率达到98%的同时，性能优于现有方法，实现了效率和性能的双重提升。

Insight: 通过分层压缩和细节补偿的策略，可以有效平衡计算效率和信息完整性，为3D多模态建模提供了新思路。

Abstract: 3D understanding has drawn significant attention recently, leveraging Vision-Language Models (VLMs) to enable multi-modal reasoning between point cloud and text data. Current 3D-VLMs directly embed the 3D point clouds into 3D tokens, following large 2D-VLMs with powerful reasoning capabilities. However, this framework has a great computational cost limiting its application, where we identify that the bottleneck lies in processing all 3D tokens in the Large Language Model (LLM) part. This raises the question: how can we reduce the computational overhead introduced by 3D tokens while preserving the integrity of their essential information? To address this question, we introduce Hierarchical Compensatory Compression (HCC-3D) to efficiently compress 3D tokens while maintaining critical detail retention. Specifically, we first propose a global structure compression (GSC), in which we design global queries to compress all 3D tokens into a few key tokens while keeping overall structural information. Then, to compensate for the information loss in GSC, we further propose an adaptive detail mining (ADM) module that selectively recompresses salient but under-attended features through complementary scoring. Extensive experiments demonstrate that HCC-3D not only achieves extreme compression ratios (approximately 98%) compared to previous 3D-VLMs, but also achieves new state-of-the-art performance, showing the great improvements on both efficiency and performance.

[20] MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding cs.CVPDF

Ketong Chen, Yuhao Chen, Yang Xue

TL;DR: 文章提出了MosaicDoc，一个双语（中英文）的大规模视觉丰富文档理解（VRDU）基准测试，弥补了现有测试在语言多样性、布局复杂性和任务多样性上的不足。

Details

Motivation: 现有的视觉语言模型（VLMs）基准测试多为英文，布局简单且任务有限，难以评估模型在处理复杂布局和密集文本的VRDU任务中的表现。

Result: MosaicDoc成为当前VRDU领域的权威基准，评测显示现有先进模型在处理复杂文档时仍存在明显不足。

Insight: 复杂的文档布局和多语言支持是VRDU领域的关键挑战，未来研究需进一步优化模型在此类任务上的表现。

Abstract: Despite the rapid progress of Vision-Language Models (VLMs), their capabilities are inadequately assessed by existing benchmarks, which are predominantly English-centric, feature simplistic layouts, and support limited tasks. Consequently, they fail to evaluate model performance for Visually Rich Document Understanding (VRDU), a critical challenge involving complex layouts and dense text. To address this, we introduce DocWeaver, a novel multi-agent pipeline that leverages Large Language Models to automatically generate a new benchmark. The result is MosaicDoc, a large-scale, bilingual (Chinese and English) resource designed to push the boundaries of VRDU. Sourced from newspapers and magazines, MosaicDoc features diverse and complex layouts (including multi-column and non-Manhattan), rich stylistic variety from 196 publishers, and comprehensive multi-task annotations (OCR, VQA, reading order, and localization). With 72K images and over 600K QA pairs, MosaicDoc serves as a definitive benchmark for the field. Our extensive evaluation of state-of-the-art models on this benchmark reveals their current limitations in handling real-world document complexity and charts a clear path for future research.

[21] Compensating Distribution Drifts in Class-incremental Learning of Pre-trained Vision Transformers cs.CV | cs.AI | cs.CLPDF

Xuan Rao, Simian Xu, Zheng Li, Bo Zhao, Derong Liu

TL;DR: 论文提出一种名为SLDC的方法，通过潜在空间转换算子和知识蒸馏，解决了预训练ViT在类增量学习中的分布漂移问题，显著提升了SeqFT的性能。

Details

Motivation: 类增量学习中，预训练ViT的顺序微调（SeqFT）会导致特征分布漂移，影响分类器性能。现有方法未能有效解决这一问题。

Result: 实验表明，SLDC显著提升了SeqFT的性能，结合知识蒸馏后，性能接近联合训练。

Insight: 分布漂移是影响类增量学习的关键问题，SLDC通过特征对齐和知识蒸馏有效解决了这一问题。

Abstract: Recent advances have shown that sequential fine-tuning (SeqFT) of pre-trained vision transformers (ViTs), followed by classifier refinement using approximate distributions of class features, can be an effective strategy for class-incremental learning (CIL). However, this approach is susceptible to distribution drift, caused by the sequential optimization of shared backbone parameters. This results in a mismatch between the distributions of the previously learned classes and that of the updater model, ultimately degrading the effectiveness of classifier performance over time. To address this issue, we introduce a latent space transition operator and propose Sequential Learning with Drift Compensation (SLDC). SLDC aims to align feature distributions across tasks to mitigate the impact of drift. First, we present a linear variant of SLDC, which learns a linear operator by solving a regularized least-squares problem that maps features before and after fine-tuning. Next, we extend this with a weakly nonlinear SLDC variant, which assumes that the ideal transition operator lies between purely linear and fully nonlinear transformations. This is implemented using learnable, weakly nonlinear mappings that balance flexibility and generalization. To further reduce representation drift, we apply knowledge distillation (KD) in both algorithmic variants. Extensive experiments on standard CIL benchmarks demonstrate that SLDC significantly improves the performance of SeqFT. Notably, by combining KD to address representation drift with SLDC to compensate distribution drift, SeqFT achieves performance comparable to joint training across all evaluated datasets. Code: https://github.com/raoxuan98-hash/sldc.git.

[22] AdaptViG: Adaptive Vision GNN with Exponential Decay Gating cs.CV | cs.AI | cs.LGPDF

Mustafa Munir, Md Mostafijur Rahman, Radu Marculescu

TL;DR: AdaptViG提出了一种高效的视觉图神经网络（ViG），通过自适应图卷积和指数衰减门控机制解决传统ViG的计算效率问题，实现了精度和效率的最佳平衡。

Details

Motivation: 传统ViG在计算效率上存在瓶颈，尤其是在图构建阶段，限制了其实际应用。AdaptViG旨在通过创新的动态门控和混合策略解决这一问题。

Result: AdaptViG-M在ImageNet上达到82.6%的top-1准确率，参数和计算量分别减少80%和84%；在下游任务中也显著优于大规模模型。

Insight: 动态门控和混合设计可以有效提升ViG的效率，同时保持高精度，为视觉任务的图神经网络提供了新思路。

Abstract: Vision Graph Neural Networks (ViGs) offer a new direction for advancements in vision architectures. While powerful, ViGs often face substantial computational challenges stemming from their graph construction phase, which can hinder their efficiency. To address this issue we propose AdaptViG, an efficient and powerful hybrid Vision GNN that introduces a novel graph construction mechanism called Adaptive Graph Convolution. This mechanism builds upon a highly efficient static axial scaffold and a dynamic, content-aware gating strategy called Exponential Decay Gating. This gating mechanism selectively weighs long-range connections based on feature similarity. Furthermore, AdaptViG employs a hybrid strategy, utilizing our efficient gating mechanism in the early stages and a full Global Attention block in the final stage for maximum feature aggregation. Our method achieves a new state-of-the-art trade-off between accuracy and efficiency among Vision GNNs. For instance, our AdaptViG-M achieves 82.6% top-1 accuracy, outperforming ViG-B by 0.3% while using 80% fewer parameters and 84% fewer GMACs. On downstream tasks, AdaptViG-M obtains 45.8 mIoU, 44.8 APbox, and 41.1 APmask, surpassing the much larger EfficientFormer-L7 by 0.7 mIoU, 2.2 APbox, and 2.1 APmask, respectively, with 78% fewer parameters.

[23] TSPE-GS: Probabilistic Depth Extraction for Semi-Transparent Surface Reconstruction via 3D Gaussian Splatting cs.CVPDF

Zhiyuan Xu, Nan Min, Yuhang Guo, Tong Wei

TL;DR: 论文提出了TSPE-GS方法，通过概率深度提取和3D高斯泼溅技术改进半透明表面重建，解决了传统方法假设每个像素单一深度导致的跨表面深度模糊问题。

Details

Motivation: 传统3D高斯泼溅方法在处理半透明表面时表现不佳，因为它们假设每个像素只有一个深度，而半透明场景中多个表面可能同时可见。

Result: 在公开和自收集的半透明和不透明数据集上，TSPE-GS显著提升了半透明几何重建质量，同时在不透明场景中保持性能。

Insight: 论文表明，建模像素级多模态分布是解决半透明表面重建问题的关键，同时证明了方法对其他高斯重建流程的通用性。

Abstract: 3D Gaussian Splatting offers a strong speed-quality trade-off but struggles to reconstruct semi-transparent surfaces because most methods assume a single depth per pixel, which fails when multiple surfaces are visible. We propose TSPE-GS (Transparent Surface Probabilistic Extraction for Gaussian Splatting), which uniformly samples transmittance to model a pixel-wise multi-modal distribution of opacity and depth, replacing the prior single-peak assumption and resolving cross-surface depth ambiguity. By progressively fusing truncated signed distance functions, TSPE-GS reconstructs external and internal surfaces separately within a unified framework. The method generalizes to other Gaussian-based reconstruction pipelines without extra training overhead. Extensive experiments on public and self-collected semi-transparent and opaque datasets show TSPE-GS significantly improves semi-transparent geometry reconstruction while maintaining performance on opaque scenes.

[24] Beyond Cosine Similarity Magnitude-Aware CLIP for No-Reference Image Quality Assessment cs.CV | cs.AIPDF

Zhicheng Liao, Dongxu Wu, Zhenshan Shi, Sijie Mai, Hanwei Zhu

TL;DR: 本文提出了一种结合CLIP模型余弦相似度和特征幅度感知的新型无参考图像质量评估方法，通过自适应融合框架和统计归一化提升性能。

Details

Motivation: 现有基于CLIP的NR-IQA方法仅依赖余弦相似度语义匹配，忽视了CLIP图像特征幅度与感知质量的强相关性。

Result: 在多个基准数据集上，该方法无需任务特定训练即超越传统CLIP方法和SOTA基线。

Insight: CLIP特征的幅度信息可作为语义匹配的有效补充，统计归一化和自适应融合是提升NR-IQA性能的关键。

Abstract: Recent efforts have repurposed the Contrastive Language-Image Pre-training (CLIP) model for No-Reference Image Quality Assessment (NR-IQA) by measuring the cosine similarity between the image embedding and textual prompts such as “a good photo” or “a bad photo.” However, this semantic similarity overlooks a critical yet underexplored cue: the magnitude of the CLIP image features, which we empirically find to exhibit a strong correlation with perceptual quality. In this work, we introduce a novel adaptive fusion framework that complements cosine similarity with a magnitude-aware quality cue. Specifically, we first extract the absolute CLIP image features and apply a Box-Cox transformation to statistically normalize the feature distribution and mitigate semantic sensitivity. The resulting scalar summary serves as a semantically-normalized auxiliary cue that complements cosine-based prompt matching. To integrate both cues effectively, we further design a confidence-guided fusion scheme that adaptively weighs each term according to its relative strength. Extensive experiments on multiple benchmark IQA datasets demonstrate that our method consistently outperforms standard CLIP-based IQA and state-of-the-art baselines, without any task-specific training.

[25] Robust Object Detection with Pseudo Labels from VLMs using Per-Object Co-teaching cs.CVPDF

Uday Bhaskar, Rishabh Bhattacharya, Avinash Patel, Sarthak Khoche, Praveen Anil Kulkarni

TL;DR: 该论文提出了一种利用视觉语言模型（VLM）生成的伪标签来训练高效实时目标检测器的新管道，通过每对象协作教学策略减少VLM生成标签中的噪声，显著提升了检测性能。

Details

Motivation: 在自动驾驶等领域，手动标注数据成本高昂。尽管VLM提供零样本目标检测能力，但其检测延迟和幻觉预测问题限制了直接应用。因此，需要一个高效且稳健的方法来利用VLM的伪标签训练实时目标检测器。

Result: 在KITTI数据集上，mAP@0.5从31.12%提升至46.61%，补充少量真实标签（10%）后进一步提升至57.97%。ACDC和BDD100k数据集上也观察到类似改进。

Insight: 研究表明，VLM生成的伪标签可用于训练高效检测器，通过协作教学策略减少噪声的影响。同时，少量真实标签可显著提升性能，为实际应用提供了高效的数据标注解决方案。

Abstract: Foundation models, especially vision-language models (VLMs), offer compelling zero-shot object detection for applications like autonomous driving, a domain where manual labelling is prohibitively expensive. However, their detection latency and tendency to hallucinate predictions render them unsuitable for direct deployment. This work introduces a novel pipeline that addresses this challenge by leveraging VLMs to automatically generate pseudo-labels for training efficient, real-time object detectors. Our key innovation is a per-object co-teaching-based training strategy that mitigates the inherent noise in VLM-generated labels. The proposed per-object coteaching approach filters noisy bounding boxes from training instead of filtering the entire image. Specifically, two YOLO models learn collaboratively, filtering out unreliable boxes from each mini-batch based on their peers’ per-object loss values. Overall, our pipeline provides an efficient, robust, and scalable approach to train high-performance object detectors for autonomous driving, significantly reducing reliance on costly human annotation. Experimental results on the KITTI dataset demonstrate that our method outperforms a baseline YOLOv5m model, achieving a significant mAP@0.5 boost ($31.12%$ to $46.61%$) while maintaining real-time detection latency. Furthermore, we show that supplementing our pseudo-labelled data with a small fraction of ground truth labels ($10%$) leads to further performance gains, reaching $57.97%$ mAP@0.5 on the KITTI dataset. We observe similar performance improvements for the ACDC and BDD100k datasets.

[26] Equivariant Sampling for Improving Diffusion Model-based Image Restoration cs.CVPDF

Chenxu Wu, Qingpeng Kong, Peiang Zhao, Wendi Yang, Wenxin Ma

TL;DR: 论文提出了一种名为EquS的方法，通过双采样轨迹引入等变信息，改进了基于扩散模型的图像恢复（DMIR）方法。此外，还提出了时间步感知调度（TAS）以进一步提升性能。

Details

Motivation: 现有的问题无关扩散模型图像恢复（DMIR）方法未能充分利用扩散先验，导致性能不佳。本文通过分析采样过程并提出有效解决方案来解决这些限制。

Result: 实验表明，EquS兼容现有DMIR方法，显著提升了性能且未增加计算成本。

Insight: 等变信息的引入和时间步感知调度可以显著提升扩散模型在图像恢复任务中的表现。

Abstract: Recent advances in generative models, especially diffusion models, have significantly improved image restoration (IR) performance. However, existing problem-agnostic diffusion model-based image restoration (DMIR) methods face challenges in fully leveraging diffusion priors, resulting in suboptimal performance. In this paper, we address the limitations of current problem-agnostic DMIR methods by analyzing their sampling process and providing effective solutions. We introduce EquS, a DMIR method that imposes equivariant information through dual sampling trajectories. To further boost EquS, we propose the Timestep-Aware Schedule (TAS) and introduce EquS$^+$. TAS prioritizes deterministic steps to enhance certainty and sampling efficiency. Extensive experiments on benchmarks demonstrate that our method is compatible with previous problem-agnostic DMIR methods and significantly boosts their performance without increasing computational costs. Our code is available at https://github.com/FouierL/EquS.

[27] Difference Vector Equalization for Robust Fine-tuning of Vision-Language Models cs.CV | cs.AIPDF

Satoshi Suzuki, Shin’ya Yamaguchi, Shoichiro Takeda, Taiga Yamane, Naoki Makishima

TL;DR: 论文提出了一种名为DiVE的新方法，用于在微调视觉语言模型（如CLIP）时保护嵌入的几何结构，从而在不损害其在分布外（OOD）和零样本场景下泛化能力的情况下增强分布内（ID）性能。

Details

Motivation: 现有的鲁棒微调方法在微调过程中使用对比学习，但这些方法会扭曲嵌入的几何结构，限制了模型的OOD和零样本性能。因此，论文旨在解决这一问题。

Result: 实验表明，DiVE在保护几何结构的同时，在ID、OOD和零样本指标上均取得了显著效果。

Insight: 保护嵌入的几何结构对于视觉语言模型的泛化能力至关重要，DiVE通过约束差异向量实现了这一点。

Abstract: Contrastive pre-trained vision-language models, such as CLIP, demonstrate strong generalization abilities in zero-shot classification by leveraging embeddings extracted from image and text encoders. This paper aims to robustly fine-tune these vision-language models on in-distribution (ID) data without compromising their generalization abilities in out-of-distribution (OOD) and zero-shot settings. Current robust fine-tuning methods tackle this challenge by reusing contrastive learning, which was used in pre-training, for fine-tuning. However, we found that these methods distort the geometric structure of the embeddings, which plays a crucial role in the generalization of vision-language models, resulting in limited OOD and zero-shot performance. To address this, we propose Difference Vector Equalization (DiVE), which preserves the geometric structure during fine-tuning. The idea behind DiVE is to constrain difference vectors, each of which is obtained by subtracting the embeddings extracted from the pre-trained and fine-tuning models for the same data sample. By constraining the difference vectors to be equal across various data samples, we effectively preserve the geometric structure. Therefore, we introduce two losses: average vector loss (AVL) and pairwise vector loss (PVL). AVL preserves the geometric structure globally by constraining difference vectors to be equal to their weighted average. PVL preserves the geometric structure locally by ensuring a consistent multimodal alignment. Our experiments demonstrate that DiVE effectively preserves the geometric structure, achieving strong results across ID, OOD, and zero-shot metrics.

[28] DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Instance Segmentation cs.CVPDF

Xuexun Liu, Xiaoxu Xu, Qiudan Zhang, Lin Ma, Xu Wang

TL;DR: DBGroup提出了一种双分支点分组方法，利用场景级标注作为弱监督3D实例分割的高效解决方案，并通过伪标签生成和自训练提升性能。

Details

Motivation: 当前弱监督3D实例分割方法依赖昂贵的人工标注，且过程复杂而低效。DBGroup旨在通过场景级注释减少标注成本并提高可扩展性。

Result: DBGroup在稀疏点级监督方法中表现突出，并超越基于场景级监督的语义分割方法。

Insight: 场景级标注是一种低成本、高效的弱监督解决方案，伪标签的精细优化和多轮自训练显著提升了3D实例分割的性能。

Abstract: Weakly supervised 3D instance segmentation is essential for 3D scene understanding, especially as the growing scale of data and high annotation costs associated with fully supervised approaches. Existing methods primarily rely on two forms of weak supervision: one-thing-one-click annotations and bounding box annotations, both of which aim to reduce labeling efforts. However, these approaches still encounter limitations, including labor-intensive annotation processes, high complexity, and reliance on expert annotators. To address these challenges, we propose \textbf{DBGroup}, a two-stage weakly supervised 3D instance segmentation framework that leverages scene-level annotations as a more efficient and scalable alternative. In the first stage, we introduce a Dual-Branch Point Grouping module to generate pseudo labels guided by semantic and mask cues extracted from multi-view images. To further improve label quality, we develop two refinement strategies: Granularity-Aware Instance Merging and Semantic Selection and Propagation. The second stage involves multi-round self-training on an end-to-end instance segmentation network using the refined pseudo-labels. Additionally, we introduce an Instance Mask Filter strategy to address inconsistencies within the pseudo labels. Extensive experiments demonstrate that DBGroup achieves competitive performance compared to sparse-point-level supervised 3D instance segmentation methods, while surpassing state-of-the-art scene-level supervised 3D semantic segmentation approaches. Code is available at https://github.com/liuxuexun/DBGroup.

[29] LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers cs.CVPDF

Minjun Kim, Jaeri Lee, Jongjin Kim, Jeongin Yun, Yongmo Kwon

TL;DR: LampQ是一种针对Vision Transformers的层混合精度量化方法，解决了现有方法中粒度粗、尺度不匹配和位分配缺乏量化意识的问题。

Details

Motivation: 现有量化方法多为统一精度，忽视了ViT不同组件对量化的敏感度差异。

Result: 在图像分类、目标检测和零样本量化等多个任务上取得SOTA性能。

Insight: ViT组件的量化敏感度差异可以通过类型感知度量和优化位分配来精确捕捉，从而提升量化效果。

Abstract: How can we accurately quantize a pre-trained Vision Transformer model? Quantization algorithms compress Vision Transformers (ViTs) into low-bit formats, reducing memory and computation demands with minimal accuracy degradation. However, existing methods rely on uniform precision, ignoring the diverse sensitivity of ViT components to quantization. Metric-based Mixed Precision Quantization (MPQ) is a promising alternative, but previous MPQ methods for ViTs suffer from three major limitations: 1) coarse granularity, 2) mismatch in metric scale across component types, and 3) quantization-unaware bit allocation. In this paper, we propose LampQ (Layer-wise Mixed Precision Quantization for Vision Transformers), an accurate metric-based MPQ method for ViTs to overcome these limitations. LampQ performs layer-wise quantization to achieve both fine-grained control and efficient acceleration, incorporating a type-aware Fisher-based metric to measure sensitivity. Then, LampQ assigns bit-widths optimally through integer linear programming and further updates them iteratively. Extensive experiments show that LampQ provides the state-of-the-art performance in quantizing ViTs pre-trained on various tasks such as image classification, object detection, and zero-shot quantization.

[30] MIRNet: Integrating Constrained Graph-Based Reasoning with Pre-training for Diagnostic Medical Imaging cs.CV | cs.AIPDF

Shufeng Kong, Zijie Wang, Nuan Cui, Hao Tang, Yihan Meng

TL;DR: MIRNet结合自监督预训练与约束图推理，优化医学图像诊断，特别是在舌诊领域，通过MAE学习视觉表示、GAT建模标签关系，并引入TongueAtlas-4K数据集解决标注稀缺问题。

Details

Motivation: 医学图像诊断需解决标注稀缺、标签不平衡及临床合理性等问题，尤其舌诊领域对视觉语义理解要求高。

Result: MIRNet在舌诊任务上表现优异，并可推广至其他医学图像诊断任务。

Insight: 结合自监督学习与专家知识驱动的图推理能有效提升医学图像诊断的鲁棒性和泛化能力。

Abstract: Automated interpretation of medical images demands robust modeling of complex visual-semantic relationships while addressing annotation scarcity, label imbalance, and clinical plausibility constraints. We introduce MIRNet (Medical Image Reasoner Network), a novel framework that integrates self-supervised pre-training with constrained graph-based reasoning. Tongue image diagnosis is a particularly challenging domain that requires fine-grained visual and semantic understanding. Our approach leverages self-supervised masked autoencoder (MAE) to learn transferable visual representations from unlabeled data; employs graph attention networks (GAT) to model label correlations through expert-defined structured graphs; enforces clinical priors via constraint-aware optimization using KL divergence and regularization losses; and mitigates imbalance using asymmetric loss (ASL) and boosting ensembles. To address annotation scarcity, we also introduce TongueAtlas-4K, a comprehensive expert-curated benchmark comprising 4,000 images annotated with 22 diagnostic labels–representing the largest public dataset in tongue analysis. Validation shows our method achieves state-of-the-art performance. While optimized for tongue diagnosis, the framework readily generalizes to broader diagnostic medical imaging tasks.

[31] AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models cs.CVPDF

Xinyi Wang, Xun Yang, Yanlong Xu, Yuchen Wu, Zhen Li

TL;DR: 该论文提出了一种名为AffordBot的新框架，用于解决3D细粒度具身推理任务，通过多模态大语言模型和定制的思维链推理范式，实现了在3D场景中对可交互元素的空间位置、运动类型和运动轴的预测。

Details

Motivation: 现有方法通常在对象级别或分离地处理细粒度可交互推理，缺乏连贯的、基于指令的推理和定位能力。论文旨在解决这一问题，以实现更有效的人-智能体协作。

Result: 在SceneFun3D数据集上，AffordBot实现了最先进的性能，展示了强大的泛化能力和物理基础的推理能力，仅需3D点云输入和MLLM。

Insight: 1) 多视角渲染和投影有效解决了3D场景的视觉表示问题；2) 思维链推理的提升显著提高了任务的性能；3) 该框架适用于更广泛的具身推理任务。

Abstract: Effective human-agent collaboration in physical environments requires understanding not only what to act upon, but also where the actionable elements are and how to interact with them. Existing approaches often operate at the object level or disjointedly handle fine-grained affordance reasoning, lacking coherent, instruction-driven grounding and reasoning. In this work, we introduce a new task: Fine-grained 3D Embodied Reasoning, which requires an agent to predict, for each referenced affordance element in a 3D scene, a structured triplet comprising its spatial location, motion type, and motion axis, based on a task instruction. To solve this task, we propose AffordBot, a novel framework that integrates Multimodal Large Language Models (MLLMs) with a tailored chain-of-thought (CoT) reasoning paradigm. To bridge the gap between 3D input and 2D-compatible MLLMs, we render surround-view images of the scene and project 3D element candidates into these views, forming a rich visual representation aligned with the scene geometry. Our CoT pipeline begins with an active perception stage, prompting the MLLM to select the most informative viewpoint based on the instruction, before proceeding with step-by-step reasoning to localize affordance elements and infer plausible interaction motions. Evaluated on the SceneFun3D dataset, AffordBot achieves state-of-the-art performance, demonstrating strong generalization and physically grounded reasoning with only 3D point cloud input and MLLMs.

[32] Anomagic: Crossmodal Prompt-driven Zero-shot Anomaly Generation cs.CV | cs.AIPDF

Yuxin Jiang, Wei Luo, Hui Zhang, Qiyu Chen, Haiming Yao

TL;DR: Anomagic提出了一种零样本异常生成方法，通过跨模态提示编码结合视觉和文本信息，无需真实异常样本即可生成语义一致的异常，并利用对比细化策略提升下游异常检测性能。

Details

Motivation: 现有异常生成方法通常依赖真实异常样本，限制了模型的泛化能力。Anomagic通过跨模态提示实现零样本异常生成，解决这一限制。

Result: Anomagic生成的异常更真实多样，显著提升了下游异常检测性能，并能通过用户定义的提示为任何正常图像生成异常。

Insight: 跨模态提示和对比细化策略的结合为异常生成任务提供了一种泛化性强且高效的解决方案。

Abstract: We propose Anomagic, a zero-shot anomaly generation method that produces semantically coherent anomalies without requiring any exemplar anomalies. By unifying both visual and textual cues through a crossmodal prompt encoding scheme, Anomagic leverages rich contextual information to steer an inpainting-based generation pipeline. A subsequent contrastive refinement strategy enforces precise alignment between synthesized anomalies and their masks, thereby bolstering downstream anomaly detection accuracy. To facilitate training, we introduce AnomVerse, a collection of 12,987 anomaly-mask-caption triplets assembled from 13 publicly available datasets, where captions are automatically generated by multimodal large language models using structured visual prompts and template-based textual hints. Extensive experiments demonstrate that Anomagic trained on AnomVerse can synthesize more realistic and varied anomalies than prior methods, yielding superior improvements in downstream anomaly detection. Furthermore, Anomagic can generate anomalies for any normal-category image using user-defined prompts, establishing a versatile foundation model for anomaly generation.

Feiyang Jia, Caiyan Jia, Ailin Liu, Shaoqing Xu, Qiming Xia

TL;DR: DGFusion提出了一种双引导融合方法，通过难度感知实例匹配和双引导模块，提升了多模态3D目标检测在困难实例（远距离、小目标或遮挡对象）上的性能。

Details

Motivation: 现有的多模态3D目标检测方法通常采用单引导范式，忽略了不同模态间信息密度的差异，尤其是对困难实例的不适应性，影响了自动驾驶系统的安全性。

Result: 相比基线方法，DGFusion在mAP、NDS和平均召回率上分别提升了1.0%、0.8%和1.3%，并在多种困难场景下表现鲁棒。

Insight: 双引导范式能够充分利用不同模态的优势，显著提升困难实例的检测性能，为自动驾驶感知系统提供了更可靠的解决方案。

Abstract: As a critical task in autonomous driving perception systems, 3D object detection is used to identify and track key objects, such as vehicles and pedestrians. However, detecting distant, small, or occluded objects (hard instances) remains a challenge, which directly compromises the safety of autonomous driving systems. We observe that existing multi-modal 3D object detection methods often follow a single-guided paradigm, failing to account for the differences in information density of hard instances between modalities. In this work, we propose DGFusion, based on the Dual-guided paradigm, which fully inherits the advantages of the Point-guide-Image paradigm and integrates the Image-guide-Point paradigm to address the limitations of the single paradigms. The core of DGFusion, the Difficulty-aware Instance Pair Matcher (DIPM), performs instance-level feature matching based on difficulty to generate easy and hard instance pairs, while the Dual-guided Modules exploit the advantages of both pair types to enable effective multi-modal feature fusion. Experimental results demonstrate that our DGFusion outperforms the baseline methods, with respective improvements of +1.0% mAP, +0.8% NDS, and +1.3% average recall on nuScenes. Extensive experiments demonstrate consistent robustness gains for hard instance detection across ego-distance, size, visibility, and small-scale training scenarios.

[34] FreDFT: Frequency Domain Fusion Transformer for Visible-Infrared Object Detection cs.CVPDF

Wencong Wu, Xiuwei Zhang, Hanlin Yin, Shun Dai, Hongxi Zhang

TL;DR: FreDFT提出了一种基于频率域的多模态融合Transformer方法，用于可见光-红外目标检测，通过频率域注意力和混合尺度特征融合策略提升性能。

Details

Motivation: 现有方法多在空间域使用Transformer融合多模态信息，忽略了频率域的互补信息潜力，导致在复杂场景中多模态信息不平衡，检测性能下降。

Result: 在多个公开数据集上优于SOTA方法。

Insight: 频率域Transformer在多模态信息融合中具有显著优势，能够更好地挖掘互补信息。

Abstract: Visible-infrared object detection has gained sufficient attention due to its detection performance in low light, fog, and rain conditions. However, visible and infrared modalities captured by different sensors exist the information imbalance problem in complex scenarios, which can cause inadequate cross-modal fusion, resulting in degraded detection performance. \textcolor{red}{Furthermore, most existing methods use transformers in the spatial domain to capture complementary features, ignoring the advantages of developing frequency domain transformers to mine complementary information.} To solve these weaknesses, we propose a frequency domain fusion transformer, called FreDFT, for visible-infrared object detection. The proposed approach employs a novel multimodal frequency domain attention (MFDA) to mine complementary information between modalities and a frequency domain feed-forward layer (FDFFL) via a mixed-scale frequency feature fusion strategy is designed to better enhance multimodal features. To eliminate the imbalance of multimodal information, a cross-modal global modeling module (CGMM) is constructed to perform pixel-wise inter-modal feature interaction in a spatial and channel manner. Moreover, a local feature enhancement module (LFEM) is developed to strengthen multimodal local feature representation and promote multimodal feature fusion by using various convolution layers and applying a channel shuffle. Extensive experimental results have verified that our proposed FreDFT achieves excellent performance on multiple public datasets compared with other state-of-the-art methods. The code of our FreDFT is linked at https://github.com/WenCongWu/FreDFT.

[35] MuSc-V2: Zero-Shot Multimodal Industrial Anomaly Classification and Segmentation with Mutual Scoring of Unlabeled Samples cs.CVPDF

Xurui Li, Feng Xue, Yu Zhou

TL;DR: MuSc-V2是一个零样本多模态工业异常分类与分割框架，通过联合评分未标记样本实现高精度异常检测，显著提升性能。

Details

Motivation: 现有零样本异常检测方法忽略了正常图像块在2D和3D中的相似性，而异常块通常是多样且孤立的。MuSc-V2旨在利用这一特性改进检测效果。

Result: 在MVTec 3D-AD和Eyecandies数据集上分别实现23.7%和19.3%的平均精度提升，超越零样本基准甚至部分少样本方法。

Insight: 正常样本在多模态中具有高度相似性，而异常样本则表现为孤立性，跨模态联合评分可有效弥补单模态的局限性。

Abstract: Zero-shot anomaly classification (AC) and segmentation (AS) methods aim to identify and outline defects without using any labeled samples. In this paper, we reveal a key property that is overlooked by existing methods: normal image patches across industrial products typically find many other similar patches, not only in 2D appearance but also in 3D shapes, while anomalies remain diverse and isolated. To explicitly leverage this discriminative property, we propose a Mutual Scoring framework (MuSc-V2) for zero-shot AC/AS, which flexibly supports single 2D/3D or multimodality. Specifically, our method begins by improving 3D representation through Iterative Point Grouping (IPG), which reduces false positives from discontinuous surfaces. Then we use Similarity Neighborhood Aggregation with Multi-Degrees (SNAMD) to fuse 2D/3D neighborhood cues into more discriminative multi-scale patch features for mutual scoring. The core comprises a Mutual Scoring Mechanism (MSM) that lets samples within each modality to assign score to each other, and Cross-modal Anomaly Enhancement (CAE) that fuses 2D and 3D scores to recover modality-specific missing anomalies. Finally, Re-scoring with Constrained Neighborhood (RsCon) suppresses false classification based on similarity to more representative samples. Our framework flexibly works on both the full dataset and smaller subsets with consistently robust performance, ensuring seamless adaptability across diverse product lines. In aid of the novel framework, MuSc-V2 achieves significant performance improvements: a $\textbf{+23.7%}$ AP gain on the MVTec 3D-AD dataset and a $\textbf{+19.3%}$ boost on the Eyecandies dataset, surpassing previous zero-shot benchmarks and even outperforming most few-shot methods. The code will be available at The code will be available at \href{https://github.com/HUST-SLOW/MuSc-V2}{https://github.com/HUST-SLOW/MuSc-V2}.

[36] Image Aesthetic Reasoning via HCM-GRPO: Empowering Compact Model for Superior Performance cs.CVPDF

Zhiyuan Hu, Zheng Sun, Yi Wei, Long Yu

TL;DR: 论文提出了HCM-GRPO方法，结合Hard Cases Mining（HCM）策略和Dynamic Proportional Accuracy（DPA）奖励，显著提升了图像美学推理能力，并通过一个包含128k样本的数据集验证了其优于开源和闭源大模型的性能。

Details

Motivation: 现有Multimodal Large Language Models（MLLMs）在图像美学推理能力上表现不佳，且相关数据集匮乏，限制了图像筛选任务的发展。

Result: 实验表明，HCM-GRPO在小模型上表现优异，超越了GPT4o和Qwen-VL-Max等闭源大模型。

Insight: 数据集质量和方法改进（如HCM和DPA）是提升图像美学推理能力的关键，小模型通过优化也能达到甚至超越大模型的性能。

Abstract: The performance of image generation has been significantly improved in recent years. However, the study of image screening is rare and its performance with Multimodal Large Language Models (MLLMs) is unsatisfactory due to the lack of data and the weak image aesthetic reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive image screening dataset with over 128k samples, about 640k images. Each sample consists of an original image, four generated images. The dataset evaluates the image aesthetic reasoning ability under four aspects: appearance deformation, physical shadow, placement layout, and extension rationality. Regarding data annotation, we investigate multiple approaches, including purely manual, fully automated, and answer-driven annotations, to acquire high-quality chains of thought (CoT) data in the most cost-effective manner. Methodologically, we introduce a Hard Cases Mining (HCM) strategy with a Dynamic Proportional Accuracy (DPA) reward into the Group Relative Policy Optimization (GRPO) framework, called HCM-GRPO. This enhanced method demonstrates superior image aesthetic reasoning capabilities compared to the original GRPO. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT4o and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic reasoning. In contrast, by leveraging the HCM-GRPO, we are able to surpass the scores of both large-scale open-source and leading closed-source models with a much smaller model.

[37] When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion? cs.CVPDF

Qilang Ye, Wei Zeng, Meng Liu, Jie Zhang, Yupeng Hu

TL;DR: 论文提出了一种新的基准AV-ConfuseBench，用于测试多模态大语言模型（MLLMs）在‘视听混淆’场景下的表现，发现模型因视觉主导的推理难以辨别缺失的音频。为此，作者提出RL-CoMM方法，通过强化学习结合音频语言模型（LALM）优化多模态推理。

Details

Motivation: 研究多模态大语言模型（MLLMs）在视听不一致场景中的表现，探索其在视觉主导推理下是否能正确识别音频缺失的问题。

Result: RL-CoMM在有限训练数据下，将音频视觉问答和幻觉任务的准确率提升10%~30%。

Insight: MLLMs在视听不一致时容易受视觉主导影响，通过结合音频推理和强化学习可有效提升多模态任务的鲁棒性。

Abstract: Can Multimodal Large Language Models (MLLMs) discern confused objects that are visually present but audio-absent? To study this, we introduce a new benchmark, AV-ConfuseBench, which simulates an ``Audio-Visual Confusion’’ scene by modifying the corresponding sound of an object in the video, e.g., mute the sounding object and ask MLLMs Is there a/an muted-object sound’’. Experimental results reveal that MLLMs, such as Qwen2.5-Omni and Gemini 2.5, struggle to discriminate non-existent audio due to visually dominated reasoning. Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM that is built upon the Qwen2.5-Omni foundation. RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning. Then, we design a Step-wise Reasoning Reward function that enables MLLMs to self-improve audio-visual reasoning with the audio-only reference. 2) To ensure an accurate answer prediction, we introduce Answer-centered Confidence Optimization to reduce the uncertainty of potential heterogeneous reasoning differences. Extensive experiments on audio-visual question answering and audio-visual hallucination show that RL-CoMM improves the accuracy by 10~30% over the baseline model with limited training data. Follow: https://github.com/rikeilong/AVConfusion.

[38] Multivariate Gaussian Representation Learning for Medical Action Evaluation cs.CV | cs.AIPDF

Luming Yang, Haoxian Liu, Siqing Li, Alper Yilmaz

TL;DR: 该论文提出了一种基于多元高斯表示的医学动作评估方法，名为GaussMedAct，并引入了CPREval-6k数据集，通过自适应时空表征学习提升医学动作分析的性能。

Details

Motivation: 医学视觉中的细粒度动作评估面临数据集不足、精度要求高以及快速动作的时空动态建模不足等问题，亟需新的方法和数据集来解决这些问题。

Result: 在基准测试中达到92.1%的Top-1准确率，比ST-GCN基线高5.9%，且仅需10%的FLOPs。跨数据集实验验证了鲁棒性。

Insight: 多元高斯表示能够有效捕捉动作的语义信息，同时对时空噪声保持鲁棒性，为医学动作分析提供了新思路。

Abstract: Fine-grained action evaluation in medical vision faces unique challenges due to the unavailability of comprehensive datasets, stringent precision requirements, and insufficient spatiotemporal dynamic modeling of very rapid actions. To support development and evaluation, we introduce CPREval-6k, a multi-view, multi-label medical action benchmark containing 6,372 expert-annotated videos with 22 clinical labels. Using this dataset, we present GaussMedAct, a multivariate Gaussian encoding framework, to advance medical motion analysis through adaptive spatiotemporal representation learning. Multivariate Gaussian Representation projects the joint motions to a temporally scaled multi-dimensional space, and decomposes actions into adaptive 3D Gaussians that serve as tokens. These tokens preserve motion semantics through anisotropic covariance modeling while maintaining robustness to spatiotemporal noise. Hybrid Spatial Encoding, employing a Cartesian and Vector dual-stream strategy, effectively utilizes skeletal information in the form of joint and bone features. The proposed method achieves 92.1% Top-1 accuracy with real-time inference on the benchmark, outperforming the ST-GCN baseline by +5.9% accuracy with only 10% FLOPs. Cross-dataset experiments confirm the superiority of our method in robustness.

[39] Perceive, Act and Correct: Confidence Is Not Enough for Hyperspectral Classification cs.CVPDF

Muzhou Yang, Wuzhou Quan, Mingqiang Wei

TL;DR: 论文提出了CABIN框架，通过感知、行动和纠正的闭环学习过程解决高光谱图像分类中置信度误导的问题。

Details

Motivation: 高光谱图像分类中，仅依赖置信度容易导致错误，特别是在稀疏标注或类别不平衡的情况下，模型容易过度拟合错误的置信度预测。

Result: 实验表明，CABIN显著提升了多种现有方法的标注效率和性能。

Insight: 通过闭环学习和动态策略，CABIN能够更可靠地处理不确定性，为半监督学习提供了新思路。

Abstract: Confidence alone is often misleading in hyperspectral image classification, as models tend to mistake high predictive scores for correctness while lacking awareness of uncertainty. This leads to confirmation bias, especially under sparse annotations or class imbalance, where models overfit confident errors and fail to generalize. We propose CABIN (Cognitive-Aware Behavior-Informed learNing), a semi-supervised framework that addresses this limitation through a closed-loop learning process of perception, action, and correction. CABIN first develops perceptual awareness by estimating epistemic uncertainty, identifying ambiguous regions where errors are likely to occur. It then acts by adopting an Uncertainty-Guided Dual Sampling Strategy, selecting uncertain samples for exploration while anchoring confident ones as stable pseudo-labels to reduce bias. To correct noisy supervision, CABIN introduces a Fine-Grained Dynamic Assignment Strategy that categorizes pseudo-labeled data into reliable, ambiguous, and noisy subsets, applying tailored losses to enhance generalization. Experimental results show that a wide range of state-of-the-art methods benefit from the integration of CABIN, with improved labeling efficiency and performance.

[40] VLF-MSC: Vision-Language Feature-Based Multimodal Semantic Communication System cs.CV | eess.SYPDF

Gwangyeon Ahn, Jiwan Seo, Joonhyuk Kang

TL;DR: VLF-MSC提出了一种基于视觉语言特征的统一多模态语义通信系统，通过单一紧凑的视觉语言表示同时支持图像和文本生成，提高了频谱效率和适应性。

Details

Motivation: 现有的语义通信技术通常分别处理多模态数据，导致频谱效率低下和适应性不足。VLF-MSC旨在通过统一的视觉语言特征表示解决这一问题。

Result: 实验证明VLF-MSC在低信噪比下优于文本和图像基线方法，显著降低了带宽需求。

Insight: 统一的视觉语言特征表示可以同时支持多模态生成任务，且预训练模型的应用显著提升了系统的鲁棒性和语义保真度。

Abstract: We propose Vision-Language Feature-based Multimodal Semantic Communication (VLF-MSC), a unified system that transmits a single compact vision-language representation to support both image and text generation at the receiver. Unlike existing semantic communication techniques that process each modality separately, VLF-MSC employs a pre-trained vision-language model (VLM) to encode the source image into a vision-language semantic feature (VLF), which is transmitted over the wireless channel. At the receiver, a decoder-based language model and a diffusion-based image generator are both conditioned on the VLF to produce a descriptive text and a semantically aligned image. This unified representation eliminates the need for modality-specific streams or retransmissions, improving spectral efficiency and adaptability. By leveraging foundation models, the system achieves robustness to channel noise while preserving semantic fidelity. Experiments demonstrate that VLF-MSC outperforms text-only and image-only baselines, achieving higher semantic accuracy for both modalities under low SNR with significantly reduced bandwidth.

[41] Mitigating Error Accumulation in Co-Speech Motion Generation via Global Rotation Diffusion and Multi-Level Constraints cs.CVPDF

Xiangyue Zhang, Jianfang Li, Jianqiang Ren, Jiaxu Zhang

TL;DR: GlobalDiff是一个基于扩散模型的框架，首次直接在全局关节旋转空间中操作，通过多级约束方案缓解了分层误差累积问题，显著提升了共语动作生成的准确性和流畅性。

Details

Motivation: 现有生成方法通常在局部关节旋转上操作，导致分层误差累积，从而在末端效应器上产生不稳定和不真实的动作。为了解决这一问题，论文提出了直接操作全局旋转空间的新方法。

Result: 在标准共语动作生成基准测试中，GlobalDiff相比当前最优方法性能提升了46.0%，生成的动更加平滑和准确。

Insight: 直接操作全局旋转空间可以显著减少分层误差累积，但需要额外的结构约束以保持动作的合理性。多级约束方案在此类生成任务中是关键。

Abstract: Reliable co-speech motion generation requires precise motion representation and consistent structural priors across all joints. Existing generative methods typically operate on local joint rotations, which are defined hierarchically based on the skeleton structure. This leads to cumulative errors during generation, manifesting as unstable and implausible motions at end-effectors. In this work, we propose GlobalDiff, a diffusion-based framework that operates directly in the space of global joint rotations for the first time, fundamentally decoupling each joint’s prediction from upstream dependencies and alleviating hierarchical error accumulation. To compensate for the absence of structural priors in global rotation space, we introduce a multi-level constraint scheme. Specifically, a joint structure constraint introduces virtual anchor points around each joint to better capture fine-grained orientation. A skeleton structure constraint enforces angular consistency across bones to maintain structural integrity. A temporal structure constraint utilizes a multi-scale variational encoder to align the generated motion with ground-truth temporal patterns. These constraints jointly regularize the global diffusion process and reinforce structural awareness. Extensive evaluations on standard co-speech benchmarks show that GlobalDiff generates smooth and accurate motions, improving the performance by 46.0 % compared to the current SOTA under multiple speaker identities.

[42] GridPrune: From “Where to Look” to “What to Select” in Visual Token Pruning for MLLMs cs.CVPDF

Yuxiang Duan, Ao Li, Yingqin Li, Luyu Li, Pengwei Wang

TL;DR: GridPrune提出了一种新的视觉token修剪方法，通过两阶段的‘全局引导-局部选择’策略，显著提升了MLLMs的效率。

Details

Motivation: 研究表明人类视觉系统采用‘先看哪里，再选什么’的两阶段注意力分配策略。然而现有的视觉token修剪方法直接优化‘选什么’，忽略了空间分配的重要性。

Result: 在LLaVA-NeXT-7B上，GridPrune仅使用11.1%的token即可保留96.98%的性能，优于现有方法2.34%。

Insight: 人类的注意力分配策略可以启发高效的token修剪方法，全局引导和局部选择的结合是关键。

Abstract: Multimodal large language models (MLLMs) have shown remarkable capabilities in a wide range of vision-language tasks. However, the large number of visual tokens introduces significant computational overhead. To address this issue, visual token pruning has emerged as a key technique for enhancing the efficiency of MLLMs. In cognitive science, humans tend to first determine which regions of a scene to attend to (“where to look”) before deciding which specific elements within those regions to process in detail (“what to select”). This two-stage strategy enables the visual system to efficiently allocate attention at a coarse spatial level before performing fine-grained selection. However, existing pruning methods primarily focus on directly optimizing “what to select”, typically using attention scores or similarity metrics. They rarely consider “where to look”, which has been shown to lead to inefficient spatial allocation, positional bias, and the retention of irrelevant or redundant tokens. In this paper, we propose GridPrune, a method that replaces the global Top-K mechanism with a “guide-globally, select-locally” zonal selection system. GridPrune splits the pruning process into two steps: first, it uses text-conditional guidance to dynamically allocate a token budget across spatial zones; and then, it performs local selection within each budgeted zone. Experimental results demonstrate that GridPrune achieves superior performance across various MLLM architectures. On LLaVA-NeXT-7B, GridPrune retains 96.98% of the full performance while using 11.1% of the tokens, outperforming the best-performing baseline by 2.34% at the same pruning rate.

[43] SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition cs.CVPDF

Qilang Ye, Yu Zhou, Lian He, Jie Zhang, Xuanming Guo

TL;DR: SUGAR 是一种新颖的范式，通过结合视觉-运动知识和骨骼数据学习动作表示，利用大型语言模型（LLMs）进行动作分类和描述。

Details

Motivation: 传统的动作识别方法通常依赖于手动设计的特征或深度学习的端到端训练，而大型语言模型（LLMs）拥有丰富的隐式知识和强大的迁移能力。本文探索如何将 LLMs 与人体骨骼数据结合，解决 LLMs 理解骨骼数据和区分动作类别的问题。

Result: 在多个骨骼动作分类基准测试中表现优越，零样本场景下显示出比线性方法更强的泛化能力。

Insight: SUGAR 揭示了将 LLMs 与先验知识结合的潜力，为多模态动作识别提供了新思路。

Abstract: Large Language Models (LLMs) hold rich implicit knowledge and powerful transferability. In this paper, we explore the combination of LLMs with the human skeleton to perform action classification and description. However, when treating LLM as a recognizer, two questions arise: 1) How can LLMs understand skeleton? 2) How can LLMs distinguish among actions? To address these problems, we introduce a novel paradigm named learning Skeleton representation with visUal-motion knowledGe for Action Recognition (SUGAR). In our pipeline, we first utilize off-the-shelf large-scale video models as a knowledge base to generate visual, motion information related to actions. Then, we propose to supervise skeleton learning through this prior knowledge to yield discrete representations. Finally, we use the LLM with untouched pre-training weights to understand these representations and generate the desired action targets and descriptions. Notably, we present a Temporal Query Projection (TQP) module to continuously model the skeleton signals with long sequences. Experiments on several skeleton-based action classification benchmarks demonstrate the efficacy of our SUGAR. Moreover, experiments on zero-shot scenarios show that SUGAR is more versatile than linear-based methods.

[44] MTAttack: Multi-Target Backdoor Attacks against Large Vision-Language Models cs.CVPDF

Zihan Wang, Guansong Pang, Wenjun Miao, Jin Zheng, Xiao Bai

TL;DR: MTAttack提出了一种针对大型视觉语言模型（LVLMs）的多目标后门攻击框架，通过独特的优化方法实现了多个触发器与目标的精准映射，展示了LVLMs在多目标攻击下的脆弱性。

Details

Motivation: 现有的后门攻击主要集中在单目标攻击上，而多目标攻击由于触发器间的特征干扰难以实现。作者发现了这一漏洞，并提出了MTAttack来解决多目标攻击中的挑战。

Result: 实验表明MTAttack在多目标攻击中实现了高成功率，显著优于现有方法，且在数据集间具有强泛化性和对抗防御策略的鲁棒性。

Insight: MTAttack揭示了LVLMs在多目标后门攻击中的脆弱性，强调了防御此类威胁的紧迫性。

Abstract: Recent advances in Large Visual Language Models (LVLMs) have demonstrated impressive performance across various vision-language tasks by leveraging large-scale image-text pretraining and instruction tuning. However, the security vulnerabilities of LVLMs have become increasingly concerning, particularly their susceptibility to backdoor attacks. Existing backdoor attacks focus on single-target attacks, i.e., targeting a single malicious output associated with a specific trigger. In this work, we uncover multi-target backdoor attacks, where multiple independent triggers corresponding to different attack targets are added in a single pass of training, posing a greater threat to LVLMs in real-world applications. Executing such attacks in LVLMs is challenging since there can be many incorrect trigger-target mappings due to severe feature interference among different triggers. To address this challenge, we propose MTAttack, the first multi-target backdoor attack framework for enforcing accurate multiple trigger-target mappings in LVLMs. The core of MTAttack is a novel optimization method with two constraints, namely Proxy Space Partitioning constraint and Trigger Prototype Anchoring constraint. It jointly optimizes multiple triggers in the latent space, with each trigger independently mapping clean images to a unique proxy class while at the same time guaranteeing their separability. Experiments on popular benchmarks demonstrate a high success rate of MTAttack for multi-target attacks, substantially outperforming existing attack methods. Furthermore, our attack exhibits strong generalizability across datasets and robustness against backdoor defense strategies. These findings highlight the vulnerability of LVLMs to multi-target backdoor attacks and underscore the urgent need for mitigating such threats. Code is available at https://github.com/mala-lab/MTAttack.

[45] RobIA: Robust Instance-aware Continual Test-time Adaptation for Deep Stereo cs.CVPDF

Jueun Ko, Hyewon Park, Hyesong Choi, Dongbo Min

TL;DR: RobIA提出了一种鲁棒的、实例感知的持续测试时适应框架，用于解决立体深度估计中的动态域偏移问题，通过动态路由和鲁棒教师模型实现高效适应。

Details

Motivation: 立体深度估计在真实环境中面临动态域偏移、稀疏或不准确的监督以及密集标签获取成本高的问题，传统测试时适应方法在持续偏移下效果有限。

Result: 实验表明RobIA在动态目标域中表现优异，同时保持了计算高效性。

Insight: 通过实例感知的动态适应和伪监督策略，RobIA在持续域偏移下表现出更强的鲁棒性和适应性。

Abstract: Stereo Depth Estimation in real-world environments poses significant challenges due to dynamic domain shifts, sparse or unreliable supervision, and the high cost of acquiring dense ground-truth labels. While recent Test-Time Adaptation (TTA) methods offer promising solutions, most rely on static target domain assumptions and input-invariant adaptation strategies, limiting their effectiveness under continual shifts. In this paper, we propose RobIA, a novel Robust, Instance-Aware framework for Continual Test-Time Adaptation (CTTA) in stereo depth estimation. RobIA integrates two key components: (1) Attend-and-Excite Mixture-of-Experts (AttEx-MoE), a parameter-efficient module that dynamically routes input to frozen experts via lightweight self-attention mechanism tailored to epipolar geometry, and (2) Robust AdaptBN Teacher, a PEFT-based teacher model that provides dense pseudo-supervision by complementing sparse handcrafted labels. This strategy enables input-specific flexibility, broad supervision coverage, improving generalization under domain shift. Extensive experiments demonstrate that RobIA achieves superior adaptation performance across dynamic target domains while maintaining computational efficiency.

Mingda Jia, Weiliang Meng, Zenghuang Fu, Yiheng Li, Qi Zeng

TL;DR: 针对密集视频描述任务，论文提出了一种显式时空语义建模框架CACMI，通过跨模态检索和上下文感知特征增强，显著提升了性能。

Details

Motivation: 现有的密集视频描述方法多依赖隐式建模（如帧级特征），忽视了事件序列的时序一致性和视觉上下文的语义完整性，导致效果受限。

Result: 在ActivityNet Captions和YouCook2数据集上达到了SOTA性能。

Insight: 显式建模时序和语义信息是密集视频描述任务的关键，跨模态交互和上下文感知能有效弥补传统方法的不足。

Abstract: Dense video captioning jointly localizes and captions salient events in untrimmed videos. Recent methods primarily focus on leveraging additional prior knowledge and advanced multi-task architectures to achieve competitive performance. However, these pipelines rely on implicit modeling that uses frame-level or fragmented video features, failing to capture the temporal coherence across event sequences and comprehensive semantics within visual contexts. To address this, we propose an explicit temporal-semantic modeling framework called Context-Aware Cross-Modal Interaction (CACMI), which leverages both latent temporal characteristics within videos and linguistic semantics from text corpus. Specifically, our model consists of two core components: Cross-modal Frame Aggregation aggregates relevant frames to extract temporally coherent, event-aligned textual features through cross-modal retrieval; and Context-aware Feature Enhancement utilizes query-guided attention to integrate visual dynamics with pseudo-event semantics. Extensive experiments on the ActivityNet Captions and YouCook2 datasets demonstrate that CACMI achieves the state-of-the-art performance on dense video captioning task.

[47] Right Looks, Wrong Reasons: Compositional Fidelity in Text-to-Image Generation cs.CV | cs.AIPDF

Mayank Vatsa, Aparna Bharati, Richa Singh

TL;DR: 该论文指出当前主流文本到图像生成模型在逻辑组合能力上的缺陷，特别是在否定、计数和空间关系方面的表现崩溃，并分析了其失败的三个关键因素。

Details

Motivation: 现有文本到图像生成模型在单个逻辑元素上表现良好，但在组合逻辑（如否定、计数和空间关系）上表现急剧下降，凸显了模型的不足。

Result: 研究发现，现有模型在组合逻辑任务上表现极差，且简单的扩展或调整无法解决这一问题。

Insight: 实现真正的组合性需要根本性的表征和推理方法改进，而非对现有架构的小修小补。

Abstract: The architectural blueprint of today’s leading text-to-image models contains a fundamental flaw: an inability to handle logical composition. This survey investigates this breakdown across three core primitives-negation, counting, and spatial relations. Our analysis reveals a dramatic performance collapse: models that are accurate on single primitives fail precipitously when these are combined, exposing severe interference. We trace this failure to three key factors. First, training data show a near-total absence of explicit negations. Second, continuous attention architectures are fundamentally unsuitable for discrete logic. Third, evaluation metrics reward visual plausibility over constraint satisfaction. By analyzing recent benchmarks and methods, we show that current solutions and simple scaling cannot bridge this gap. Achieving genuine compositionality, we conclude, will require fundamental advances in representation and reasoning rather than incremental adjustments to existing architectures.

[48] Split-Layer: Enhancing Implicit Neural Representation by Maximizing the Dimensionality of Feature Space cs.CVPDF

Zhicheng Cai, Hao Zhu, Linsen Chen, Qiu Shen, Xun Cao

TL;DR: 本文提出了一种称为split-layer的新方法，通过将多层感知机(MLP)的每一层分解为多个并行分支，并通过Hadamard乘积整合输出，从而显著提高了隐式神经表示(INR)的表征能力，同时避免了计算成本的急剧增加。

Details

Motivation: 隐式神经表示(INR)在信号建模和逆问题中具有广泛应用，但传统MLP的低维特征空间限制了其表征能力。扩展MLP宽度虽可提升能力，但会带来计算和内存成本的二次增长，因此需要一种更高效的方法。

Result: 实验表明，split-layer在2D图像拟合、2D CT重建、3D形状表示和5D新视角合成等任务中表现优异，超越了现有方法。

Insight: 通过并行分支和Hadamard乘积的整合方式，split-layer高效地扩展了特征空间的维度，为提升INR的表征能力提供了一种低开销的解决方案。

Abstract: Implicit neural representation (INR) models signals as continuous functions using neural networks, offering efficient and differentiable optimization for inverse problems across diverse disciplines. However, the representational capacity of INR defined by the range of functions the neural network can characterize, is inherently limited by the low-dimensional feature space in conventional multilayer perceptron (MLP) architectures. While widening the MLP can linearly increase feature space dimensionality, it also leads to a quadratic growth in computational and memory costs. To address this limitation, we propose the split-layer, a novel reformulation of MLP construction. The split-layer divides each layer into multiple parallel branches and integrates their outputs via Hadamard product, effectively constructing a high-degree polynomial space. This approach significantly enhances INR’s representational capacity by expanding the feature space dimensionality without incurring prohibitive computational overhead. Extensive experiments demonstrate that the split-layer substantially improves INR performance, surpassing existing methods across multiple tasks, including 2D image fitting, 2D CT reconstruction, 3D shape representation, and 5D novel view synthesis.

[49] Physically Interpretable Multi-Degradation Image Restoration via Deep Unfolding and Explainable Convolution cs.CVPDF

Hu Gao, Xiaoning Lei, Xichen Xu, Depeng Dang, Lizhuang Ma

TL;DR: 这篇论文提出了一种名为InterIR的新方法，通过深度展开网络和可解释卷积模块，解决了多退化图像复原问题，同时保持了模型的物理可解释性。

Details

Motivation: 现实场景中的图像常同时存在多种退化（如雨、噪声、雾），而现有方法通常仅针对单一退化。此外，性能提升的模块堆叠方法通常缺乏可解释性。

Result: InterIR在多退化复原任务中表现优异，同时在单一退化任务中也具有竞争力。

Insight: 结合数学优化算法的深度展开结构和可解释模块设计，能在提升性能的同时保持模型的物理可解释性。

Abstract: Although image restoration has advanced significantly, most existing methods target only a single type of degradation. In real-world scenarios, images often contain multiple degradations simultaneously, such as rain, noise, and haze, requiring models capable of handling diverse degradation types. Moreover, methods that improve performance through module stacking often suffer from limited interpretability. In this paper, we propose a novel interpretability-driven approach for multi-degradation image restoration, built upon a deep unfolding network that maps the iterative process of a mathematical optimization algorithm into a learnable network structure. Specifically, we employ an improved second-order semi-smooth Newton algorithm to ensure that each module maintains clear physical interpretability. To further enhance interpretability and adaptability, we design an explainable convolution module inspired by the human brain’s flexible information processing and the intrinsic characteristics of images, allowing the network to flexibly leverage learned knowledge and autonomously adjust parameters for different input. The resulting tightly integrated architecture, named InterIR, demonstrates excellent performance in multi-degradation restoration while remaining highly competitive on single-degradation tasks.

[50] CephRes-MHNet: A Multi-Head Residual Network for Accurate and Robust Cephalometric Landmark Detection cs.CVPDF

Ahmed Jaheen, Islam Hassan, Mohanad Abouserie, Abdelaty Rehab, Adham Elasfar

TL;DR: 提出了CephRes-MHNet，一种多头残差网络，用于高效准确地检测头部X光片中的标志点，优于现有方法。

Details

Motivation: 手动标记头部X光片的标志点耗时且易错，而现有自动化方法难以应对低对比度和复杂解剖结构的问题。

Result: 在1,000张X光片数据集上，平均径向误差（MRE）1.23 mm，2 mm内检测成功率（SDR）85.5%，优于基准方法。

Insight: 网络结构的效率是关键，通过残差和多头设计可以在减少参数的同时提升性能。

Abstract: Accurate localization of cephalometric landmarks from 2D lateral skull X-rays is vital for orthodontic diagnosis and treatment. Manual annotation is time-consuming and error-prone, whereas automated approaches often struggle with low contrast and anatomical complexity. This paper introduces CephRes-MHNet, a multi-head residual convolutional network for robust and efficient cephalometric landmark detection. The architecture integrates residual encoding, dual-attention mechanisms, and multi-head decoders to enhance contextual reasoning and anatomical precision. Trained on the Aariz Cephalometric dataset of 1,000 radiographs, CephRes-MHNet achieved a mean radial error (MRE) of 1.23 mm and a success detection rate (SDR) @ 2.0 mm of 85.5%, outperforming all evaluated models. In particular, it exceeded the strongest baseline, the attention-driven AFPF-Net (MRE = 1.25 mm, SDR @ 2.0 mm = 84.1%), while using less than 25% of its parameters. These results demonstrate that CephRes-MHNet attains state-of-the-art accuracy through architectural efficiency, providing a practical solution for real-world orthodontic analysis.

Stephane Da Silva Martins, Emanuel Aldea, Sylvie Le Hégarat-Mascle

TL;DR: VISTA 是一种基于递归目标条件转换器的多智能体轨迹预测方法，结合长期意图与社交交互，显著提升了轨迹的真实性和安全性。

Details

Motivation: 现有方法难以同时捕捉智能体的长期目标和细粒度社交交互，导致多智能体轨迹预测结果不真实。

Result: 在高密度 MADRAS 和 SDD 基准上，VISTA 实现了最先进的精度，碰撞率大幅降低（MADRAS 从 2.14% 降至 0.03%，SDD 零碰撞）。

Insight: VISTA 通过联合建模目标和交互，生成了更真实、安全和可解释的轨迹，适用于安全关键的自主系统。

Abstract: Multi-agent trajectory prediction is crucial for autonomous systems operating in dense, interactive environments. Existing methods often fail to jointly capture agents’ long-term goals and their fine-grained social interactions, which leads to unrealistic multi-agent futures. We propose VISTA, a recursive goal-conditioned transformer for multi-agent trajectory forecasting. VISTA combines (i) a cross-attention fusion module that integrates long-horizon intent with past motion, (ii) a social-token attention mechanism for flexible interaction modeling across agents, and (iii) pairwise attention maps that make social influence patterns interpretable at inference time. Our model turns single-agent goal-conditioned prediction into a coherent multi-agent forecasting framework. Beyond standard displacement metrics, we evaluate trajectory collision rates as a measure of joint realism. On the high-density MADRAS benchmark and on SDD, VISTA achieves state-of-the-art accuracy and substantially fewer collisions. On MADRAS, it reduces the average collision rate of strong baselines from 2.14 to 0.03 percent, and on SDD it attains zero collisions while improving ADE, FDE, and minFDE. These results show that VISTA generates socially compliant, goal-aware, and interpretable trajectories, making it promising for safety-critical autonomous systems.

[52] HeatV2X: Scalable Heterogeneous Collaborative Perception via Efficient Alignment and Interaction cs.CVPDF

Yueran Zhao, Zhang Zhang, Chao Sun, Tianze Wang, Chao Yue

TL;DR: HeatV2X是一种针对V2X协同感知的可扩展异构框架，通过高效的异构对齐和多智能体交互，解决了多模态异构性和可扩展性问题。

Details

Motivation: 现有V2X协同感知框架面临多模态异构性和可扩展性挑战，特别是在参与智能体增多时，异构性和训练成本成为瓶颈。

Result: 在OPV2V-H和DAIR-V2X数据集上表现优异，显著降低训练成本并超越现有方法。

Insight: 异构对齐和轻量化微调是实现可扩展协同感知的关键，HeatV2X为多智能体协作提供了高效解决方案。

Abstract: Vehicle-to-Everything (V2X) collaborative perception extends sensing beyond single vehicle limits through transmission. However, as more agents participate, existing frameworks face two key challenges: (1) the participating agents are inherently multi-modal and heterogeneous, and (2) the collaborative framework must be scalable to accommodate new agents. The former requires effective cross-agent feature alignment to mitigate heterogeneity loss, while the latter renders full-parameter training impractical, highlighting the importance of scalable adaptation. To address these issues, we propose Heterogeneous Adaptation (HeatV2X), a scalable collaborative framework. We first train a high-performance agent based on heterogeneous graph attention as the foundation for collaborative learning. Then, we design Local Heterogeneous Fine-Tuning and Global Collaborative Fine-Tuning to achieve effective alignment and interaction among heterogeneous agents. The former efficiently extracts modality-specific differences using Hetero-Aware Adapters, while the latter employs the Multi-Cognitive Adapter to enhance cross-agent collaboration and fully exploit the fusion potential. These designs enable substantial performance improvement of the collaborative framework with minimal training cost. We evaluate our approach on the OPV2V-H and DAIR-V2X datasets. Experimental results demonstrate that our method achieves superior perception performance with significantly reduced training overhead, outperforming existing state-of-the-art approaches. Our implementation will be released soon.

[53] Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization cs.CVPDF

Ashutosh Anshul, Shreyas Gopal, Deepu Rajan, Eng Siong Chng

TL;DR: 该论文提出了一种单阶段训练框架，通过结合下一帧预测和窗口级注意力机制，提升了多模态深度伪造检测的泛化能力，并实现了精确的时间定位。

Details

Motivation: 现有的多模态深度伪造检测方法在泛化性和对抗保留音频-视觉对齐的伪造方法时存在不足，需要改进。

Result: 模型在多个基准数据集上表现出强大的泛化能力和精确的时间定位性能。

Insight: 结合时序预测和注意力机制可以有效提升深度伪造检测的性能，尤其是针对部分伪造样本的定位能力。

Abstract: Recent multimodal deepfake detection methods designed for generalization conjecture that single-stage supervised training struggles to generalize across unseen manipulations and datasets. However, such approaches that target generalization require pretraining over real samples. Additionally, these methods primarily focus on detecting audio-visual inconsistencies and may overlook intra-modal artifacts causing them to fail against manipulations that preserve audio-visual alignment. To address these limitations, we propose a single-stage training framework that enhances generalization by incorporating next-frame prediction for both uni-modal and cross-modal features. Additionally, we introduce a window-level attention mechanism to capture discrepancies between predicted and actual frames, enabling the model to detect local artifacts around every frame, which is crucial for accurately classifying fully manipulated videos and effectively localizing deepfake segments in partially spoofed samples. Our model, evaluated on multiple benchmark datasets, demonstrates strong generalization and precise temporal localization.

[54] TubeRMC: Tube-conditioned Reconstruction with Mutual Constraints for Weakly-supervised Spatio-Temporal Video Grounding cs.CVPDF

Jinxuan Li, Yi Zhang, Jian-Fang Hu, Chaolei Tan, Tianming Liang

TL;DR: TubeRMC（Tube-conditioned Reconstruction with Mutual Constraints）是针对弱监督时空视频定位（STVG）任务提出的新框架，通过文本条件的候选管生成和管条件的重建来解决目标识别和跟踪不一致的问题。

Details

Motivation: 现有的弱监督STVG方法通常采用简单的后期融合策略，导致目标识别失败和跟踪不一致。本文旨在通过引入文本条件和时空约束的候选管生成与重建方法来改进这一问题。

Result: 在VidSTG和HCSTVG两个公共基准上，TubeRMC的性能优于现有方法，视觉分析表明其有效减少了目标识别错误和跟踪不一致问题。

Insight: 通过文本条件的管生成和时空约束的重建策略，TubeRMC在弱监督STVG任务中实现了更高质量的定位效果，显示出时空推理和视觉语言理解的有效结合。

Abstract: Spatio-Temporal Video Grounding (STVG) aims to localize a spatio-temporal tube that corresponds to a given language query in an untrimmed video. This is a challenging task since it involves complex vision-language understanding and spatiotemporal reasoning. Recent works have explored weakly-supervised setting in STVG to eliminate reliance on fine-grained annotations like bounding boxes or temporal stamps. However, they typically follow a simple late-fusion manner, which generates tubes independent of the text description, often resulting in failed target identification and inconsistent target tracking. To address this limitation, we propose a Tube-conditioned Reconstruction with Mutual Constraints (\textbf{TubeRMC}) framework that generates text-conditioned candidate tubes with pre-trained visual grounding models and further refine them via tube-conditioned reconstruction with spatio-temporal constraints. Specifically, we design three reconstruction strategies from temporal, spatial, and spatio-temporal perspectives to comprehensively capture rich tube-text correspondences. Each strategy is equipped with a Tube-conditioned Reconstructor, utilizing spatio-temporal tubes as condition to reconstruct the key clues in the query. We further introduce mutual constraints between spatial and temporal proposals to enhance their quality for reconstruction. TubeRMC outperforms existing methods on two public benchmarks VidSTG and HCSTVG. Further visualization shows that TubeRMC effectively mitigates both target identification errors and inconsistent tracking.

[55] FineSkiing: A Fine-grained Benchmark for Skiing Action Quality Assessment cs.CV | cs.AI | cs.HCPDF

Yongji Zhang, Siqi Li, Yue Gao, Yu Jiang

TL;DR: 该论文提出了FineSkiing数据集和JudgeMind方法，首次为空中滑雪运动提供了细粒度的子分数和扣分注释，并通过模拟裁判评分思路提升了AQA的性能和可靠性。

Details

Motivation: 现有AQA方法通常从整个视频中提取特征进行评分，导致其可解释性和可靠性有限，且缺乏细粒度的动作评分注释。

Result: 实验表明，JudgeMind在FineSkiing数据集上实现了最先进的性能。

Insight: 分阶段评分和结合裁判知识的方法可以显著提升AQA的可解释性和可靠性，尤其在细粒度评分任务中。

Abstract: Action Quality Assessment (AQA) aims to evaluate and score sports actions, which has attracted widespread interest in recent years. Existing AQA methods primarily predict scores based on features extracted from the entire video, resulting in limited interpretability and reliability. Meanwhile, existing AQA datasets also lack fine-grained annotations for action scores, especially for deduction items and sub-score annotations. In this paper, we construct the first AQA dataset containing fine-grained sub-score and deduction annotations for aerial skiing, which will be released as a new benchmark. For the technical challenges, we propose a novel AQA method, named JudgeMind, which significantly enhances performance and reliability by simulating the judgment and scoring mindset of professional referees. Our method segments the input action video into different stages and scores each stage to enhance accuracy. Then, we propose a stage-aware feature enhancement and fusion module to boost the perception of stage-specific key regions and enhance the robustness to visual changes caused by frequent camera viewpoints switching. In addition, we propose a knowledge-based grade-aware decoder to incorporate possible deduction items as prior knowledge to predict more accurate and reliable scores. Experimental results demonstrate that our method achieves state-of-the-art performance.

[56] Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis cs.CVPDF

Jiulong Wu, Yucheng Shen, Lingyong Yan, Haixin Sun, Deguo Xia

TL;DR: Facial-R1提出了一个三阶段对齐框架，通过指令微调、强化训练和数据合成解决了情感分析中的幻觉推理和推理-识别不对齐问题，并在多个基准上实现了SOTA性能。

Details

Motivation: 传统的情感分析方法存在幻觉推理和推理-识别不对齐的问题，Facial-R1旨在通过结合情感识别、面部动作单元识别和基于动作单元的情感推理来提供更精细的解释性分析。

Result: 在八个标准基准上实现了最优性能，展示了强泛化能力和解释性。

Insight: 通过强化训练和数据合成的结合，可以有效提升模型在细粒度情感分析任务中的表现，同时增强其可解释性。

Abstract: Facial Emotion Analysis (FEA) extends traditional facial emotion recognition by incorporating explainable, fine-grained reasoning. The task integrates three subtasks: emotion recognition, facial Action Unit (AU) recognition, and AU-based emotion reasoning to model affective states jointly. While recent approaches leverage Vision-Language Models (VLMs) and achieve promising results, they face two critical limitations: (1) hallucinated reasoning, where VLMs generate plausible but inaccurate explanations due to insufficient emotion-specific knowledge; and (2) misalignment between emotion reasoning and recognition, caused by fragmented connections between observed facial features and final labels. We propose Facial-R1, a three-stage alignment framework that effectively addresses both challenges with minimal supervision. First, we employ instruction fine-tuning to establish basic emotional reasoning capability. Second, we introduce reinforcement training guided by emotion and AU labels as reward signals, which explicitly aligns the generated reasoning process with the predicted emotion. Third, we design a data synthesis pipeline that iteratively leverages the prior stages to expand the training dataset, enabling scalable self-improvement of the model. Built upon this framework, we introduce FEA-20K, a benchmark dataset comprising 17,737 training and 1,688 test samples with fine-grained emotion analysis annotations. Extensive experiments across eight standard benchmarks demonstrate that Facial-R1 achieves state-of-the-art performance in FEA, with strong generalization and robust interpretability.

[57] PROPA: Toward Process-level Optimization in Visual Reasoning via Reinforcement Learning cs.CVPDF

Yanbei Jiang, Chao Lei, Yihao Ding, Krista Ehinger, Jey Han Lau

TL;DR: PROPA整合了MCTS与GRPO，通过密集的过程级奖励优化视觉推理任务，无需人工标注。

Details

Motivation: 视觉语言模型在复杂推理中依赖多步关联，早期错误易传导。现有方法SFT依赖昂贵标注，RLVR仅提供稀疏反馈，限制了优化效果。

Result: 在7个基准测试和4种VLM主干网络上，PROPA优于SFT和RLVR基线，域内任务提升17.0%，域外任务提升21.0%。

Insight: 密集过程级奖励和多策略交替优化显著提升复杂视觉推理的能力和泛化性，无需依赖昂贵人工标注。

Abstract: Despite significant progress, Vision-Language Models (VLMs) still struggle with complex visual reasoning, where multi-step dependencies cause early errors to cascade through the reasoning chain. Existing post-training paradigms are limited: Supervised Fine-Tuning (SFT) relies on costly step-level annotations, while Reinforcement Learning with Verifiable Rewards (RLVR) methods like GRPO provide only sparse, outcome-level feedback, hindering stable optimization. We introduce PROPA (Process-level Reasoning Optimization with interleaved Policy Alignment), a novel framework that integrates Monte Carlo Tree Search (MCTS) with GRPO to generate dense, process-level rewards and optimize reasoning at each intermediate step without human annotations. To overcome the cold-start problem, PROPA interleaves GRPO updates with SFT, enabling the model to learn from both successful and failed reasoning trajectories. A Process Reward Model (PRM) is further trained to guide inference-time search, aligning the test-time search with the training signal. Across seven benchmarks and four VLM backbones, PROPA consistently outperforms both SFT- and RLVR-based baselines. It achieves up to 17.0% gains on in-domain tasks and 21.0% gains on out-of-domain tasks compared to existing state-of-the-art, establishing a strong reasoning and generalization capability for visual reasoning tasks. The code isavailable at: https://github.com/YanbeiJiang/PROPA.

Shruti Singh Baghel, Yash Pratap Singh Rathore, Sushovan Jena, Anurag Pradhan, Amit Shukla

TL;DR: 论文研究了轻量级视觉语言模型（VLM）在盲人和低视力（BLV）用户中的应用，通过评估不同规模的SmolVLM2模型，并提出了两个新的评估框架来专注于BLV的可访问性。

Details

Motivation: 大型视觉语言模型（VLM）虽然在视频描述任务上表现优异，但其高资源需求限制了在BLV用户中的实际应用。因此，研究轻量级模型在BLV用户中的可行性和效果具有重要意义。

Result: 论文展示了轻量级VLM在BLV用户中的潜力，并通过新的评估框架提供了更全面的性能分析。

Insight: 轻量级VLM可以在资源有限的设备上部署，同时满足BLV用户的特殊需求，新的评估框架为未来研究提供了重要的方向。

Abstract: Large Vision-Language Models (VLMs) excel at understanding and generating video descriptions but their high memory, computation, and deployment demands hinder practical use particularly for blind and low-vision (BLV) users who depend on detailed, context-aware descriptions. To study the effect of model size on accessibility-focused description quality, we evaluate SmolVLM2 variants with 500M and 2.2B parameters across two diverse datasets: AVCaps (outdoor), and Charades (indoor). In this work, we introduce two novel evaluation frameworks specifically designed for BLV accessibility assessment: the Multi-Context BLV Framework evaluating spatial orientation, social interaction, action events, and ambience contexts; and the Navigational Assistance Framework focusing on mobility-critical information. Additionally, we conduct a systematic evaluation of four different prompt design strategies and deploy both models on a smartphone, evaluating FP32 and INT8 precision variants to assess real-world performance constraints on resource-limited mobile devices.

[59] Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models cs.CV | cs.AIPDF

Zhengtao Zou, Ya Gao, Jiarui Guan, Bin Li, Pekka Marttinen

TL;DR: 论文提出了一种低开销的框架RUDDER，通过自适应残差更新方向减少大型视觉语言模型中的物体幻觉问题，实现了高效性和可靠性的平衡。

Details

Motivation: 大型视觉语言模型（LVLM）常因物体幻觉生成与视觉输入不一致的文本，影响可靠性。现有方法需额外计算开销，RUDDER旨在解决这一效率与效果的矛盾。

Result: 在POPE和CHAIR等基准测试中，RUDDER表现与SOTA方法相当，且计算延迟极低，验证了其高效性和有效性。

Insight: RUDDER通过高效的单次前向传播即可实现可靠性提升，为实际部署提供了实用解决方案。

Abstract: Large Vision-Language Models (LVLMs) often suffer from object hallucination, generating text inconsistent with visual inputs, which can critically undermine their reliability. Existing inference-time interventions to mitigate this issue present a challenging trade-off: while methods that steer internal states or adjust output logits can be effective, they often incur substantial computational overhead, typically requiring extra forward passes. This efficiency bottleneck can limit their practicality for real-world, latency-sensitive deployments. In this work, we aim to address this trade-off with Residual-Update Directed DEcoding Regulation (RUDDER), a low-overhead framework that steers LVLMs towards visually-grounded generation. RUDDER is built on two key innovations: (1) Contextual Activation Residual Direction (CARD) vector, a per-sample visual evidence vector extracted from the residual update of a self-attention layer during a single, standard forward pass. (2) A Bayesian-inspired adaptive gate that performs token-wise injection, applying a corrective signal whose strength is conditioned on the model’s deviation from the visual context. Extensive experiments on key hallucination benchmarks, including POPE and CHAIR, indicate that RUDDER achieves performance comparable to state-of-the-art methods while introducing negligible computational latency, validating RUDDER as a pragmatic and effective approach for improving LVLMs’ reliability without a significant compromise on efficiency.

[60] Rethinking Visual Information Processing in Multimodal LLMs cs.CV | cs.AIPDF

Dongwan Kim, Viresh Ranjan, Takashi Nagata, Arnab Dhua, Amit Kumar K C

TL;DR: 论文提出LLaViT，通过让LLM同时作为视觉编码器，改进多模态LLMs中的视觉信息处理，显著优于基线方法。

Details

Motivation: LLaVA架构在多模态任务中表现优异，但由于文本和视觉模态的不匹配，其视觉特征整合效果不佳。本文旨在解决这一问题，提出LLM不仅可作为语言模型，还可作为强大的视觉编码器。

Result: LLaViT在多个基准测试中显著优于LLaVA基线方法，甚至超越参数翻倍的模型。

Insight: LLM不仅适用于语言任务，还可通过适当修改作为视觉编码器，为多模态任务提供更高效的视觉-语言建模方案。

Abstract: Despite the remarkable success of the LLaVA architecture for vision-language tasks, its design inherently struggles to effectively integrate visual features due to the inherent mismatch between text and vision modalities. We tackle this issue from a novel perspective in which the LLM not only serves as a language model but also a powerful vision encoder. To this end, we present LLaViT - Large Language Models as extended Vision Transformers - which enables the LLM to simultaneously function as a vision encoder through three key modifications: (1) learning separate QKV projections for vision modality, (2) enabling bidirectional attention on visual tokens, and (3) incorporating both global and local visual representations. Through extensive controlled experiments on a wide range of LLMs, we demonstrate that LLaViT significantly outperforms the baseline LLaVA method on a multitude of benchmarks, even surpassing models with double its parameter count, establishing a more effective approach to vision-language modeling.

[61] CLIP4VI-ReID: Learning Modality-shared Representations via CLIP Semantic Bridge for Visible-Infrared Person Re-identification cs.CVPDF

Xiaomei Yang, Xizhan Gao, Sijie Niu, Fa Zhu, Guang Feng

TL;DR: 论文提出了一种基于CLIP的模态共享表示学习网络CLIP4VI-ReID，通过文本语义生成、红外特征嵌入和高层语义对齐三个模块，实现了可见光-红外人员重识别的模态对齐和共享表示学习。

Details

Motivation: 面对可见光和红外图像在物理特性上的巨大差异，传统方法难以实现有效的跨模态对齐，因此需要一种新的方法来生成共享的模态表示。

Result: 在多个VI-ReID数据集上的实验表明，CLIP4VI-ReID性能优于其他最先进方法。

Insight: 文本语义可以作为跨模态对齐的有效桥梁，并且通过高层语义对齐可以进一步提升共享表示的质量。

Abstract: This paper proposes a novel CLIP-driven modality-shared representation learning network named CLIP4VI-ReID for VI-ReID task, which consists of Text Semantic Generation (TSG), Infrared Feature Embedding (IFE), and High-level Semantic Alignment (HSA). Specifically, considering the huge gap in the physical characteristics between natural images and infrared images, the TSG is designed to generate text semantics only for visible images, thereby enabling preliminary visible-text modality alignment. Then, the IFE is proposed to rectify the feature embeddings of infrared images using the generated text semantics. This process injects id-related semantics into the shared image encoder, enhancing its adaptability to the infrared modality. Besides, with text serving as a bridge, it enables indirect visible-infrared modality alignment. Finally, the HSA is established to refine the high-level semantic alignment. This process ensures that the fine-tuned text semantics only contain id-related information, thereby achieving more accurate cross-modal alignment and enhancing the discriminability of the learned modal-shared representations. Extensive experimental results demonstrate that the proposed CLIP4VI-ReID achieves superior performance than other state-of-the-art methods on some widely used VI-ReID datasets.

[62] Depth-Consistent 3D Gaussian Splatting via Physical Defocus Modeling and Multi-View Geometric Supervision cs.CV | cs.AIPDF

Yu Deng, Baozhu Zhao, Junyan Su, Xiaohan Zhang, Qi Liu

TL;DR: 本文提出了一种结合景深监督和多视角一致性监督的3D高斯泼溅方法，解决了极端深度变化场景中深度估计不一致的问题，显著提升了深度保真度。

Details

Motivation: 在深度变化极大的场景中，现有方法无法同时解决远场区域深度估计不准确和近场区域结构退化的问题，亟需一种综合物理成像原理和学习深度正则化的新方法。

Result: 在Waymo Open Dataset上，相比现有方法，PSNR提升了0.8 dB，证明了深度保真度的显著提高。

Insight: 本文的创新在于结合物理成像原理和多视角几何约束，为解决复杂城市环境中深度分层问题提供了可扩展的方案。

Abstract: Three-dimensional reconstruction in scenes with extreme depth variations remains challenging due to inconsistent supervisory signals between near-field and far-field regions. Existing methods fail to simultaneously address inaccurate depth estimation in distant areas and structural degradation in close-range regions. This paper proposes a novel computational framework that integrates depth-of-field supervision and multi-view consistency supervision to advance 3D Gaussian Splatting. Our approach comprises two core components: (1) Depth-of-field Supervision employs a scale-recovered monocular depth estimator (e.g., Metric3D) to generate depth priors, leverages defocus convolution to synthesize physically accurate defocused images, and enforces geometric consistency through a novel depth-of-field loss, thereby enhancing depth fidelity in both far-field and near-field regions; (2) Multi-View Consistency Supervision employing LoFTR-based semi-dense feature matching to minimize cross-view geometric errors and enforce depth consistency via least squares optimization of reliable matched points. By unifying defocus physics with multi-view geometric constraints, our method achieves superior depth fidelity, demonstrating a 0.8 dB PSNR improvement over the state-of-the-art method on the Waymo Open Dataset. This framework bridges physical imaging principles and learning-based depth regularization, offering a scalable solution for complex depth stratification in urban environments.

[63] Learning to Tell Apart: Weakly Supervised Video Anomaly Detection via Disentangled Semantic Alignment cs.CVPDF

Wenti Yin, Huaxin Zhang, Xiang Wang, Yuqing Lu, Yicheng Zhang

TL;DR: 论文提出了一种新型的弱监督视频异常检测方法DSANet，通过解耦语义对齐，显式地区分异常和正常特征，从而提升分类的细粒度和准确性。

Details

Motivation: 现有的弱监督视频异常检测方法倾向于检测最显著的回放片段，忽视了挖掘与异常分离的多样化正常模式，且因外观相似而易引起类别混淆，导致细粒度分类效果不佳。

Result: 在XD-Violence和UCF-Crime两个基准测试中，DSANet性能优于现有最先进方法。

Insight: 显式解耦异常与正常特征，结合多模态对比学习，能够有效提升视频异常检测的细粒度分类能力和时间分离效果。

Abstract: Recent advancements in weakly-supervised video anomaly detection have achieved remarkable performance by applying the multiple instance learning paradigm based on multimodal foundation models such as CLIP to highlight anomalous instances and classify categories. However, their objectives may tend to detect the most salient response segments, while neglecting to mine diverse normal patterns separated from anomalies, and are prone to category confusion due to similar appearance, leading to unsatisfactory fine-grained classification results. Therefore, we propose a novel Disentangled Semantic Alignment Network (DSANet) to explicitly separate abnormal and normal features from coarse-grained and fine-grained aspects, enhancing the distinguishability. Specifically, at the coarse-grained level, we introduce a self-guided normality modeling branch that reconstructs input video features under the guidance of learned normal prototypes, encouraging the model to exploit normality cues inherent in the video, thereby improving the temporal separation of normal patterns and anomalous events. At the fine-grained level, we present a decoupled contrastive semantic alignment mechanism, which first temporally decomposes each video into event-centric and background-centric components using frame-level anomaly scores and then applies visual-language contrastive learning to enhance class-discriminative representations. Comprehensive experiments on two standard benchmarks, namely XD-Violence and UCF-Crime, demonstrate that DSANet outperforms existing state-of-the-art methods.

[64] DermAI: Clinical dermatology acquisition through quality-driven image collection for AI classification in mobile cs.CV | cs.AIPDF

Thales Bezerra, Emanoel Thyago, Kelvin Cunha, Rodrigo Abreu, Fábio Papais

TL;DR: DermAI是一款基于智能手机的轻量级应用，旨在通过实时捕获、注释和分类皮肤病变图像推动AI在皮肤病学中的应用。它解决了现有数据集偏差、图像质量不一和验证不足的问题。

Details

Motivation: AI在皮肤病学中的应用受限于数据集偏差、图像质量不一致和验证不足等问题，DermAI旨在解决这些问题。

Result: 初步实验表明，公共数据集训练的模型在DermAI数据集上泛化能力不足，但经过本地数据微调后性能显著提升。

Insight: 研究强调了标准化、多样化数据收集的重要性，尤其是在医疗需求与机器学习开发结合的场景中。

Abstract: AI-based dermatology adoption remains limited by biased datasets, variable image quality, and limited validation. We introduce DermAI, a lightweight, smartphone-based application that enables real-time capture, annotation, and classification of skin lesions during routine consultations. Unlike prior dermoscopy-focused tools, DermAI performs on-device quality checks, and local model adaptation. The DermAI clinical dataset, encompasses a wide range of skin tones, ethinicity and source devices. In preliminary experiments, models trained on public datasets failed to generalize to our samples, while fine-tuning with local data improved performance. These results highlight the importance of standardized, diverse data collection aligned with healthcare needs and oriented to machine learning development.

Xun Huang, Shijia Zhao, Yunxiang Wang, Xin Lu, Wanfa Zhang

TL;DR: MSGNav提出了一种基于多模态3D场景图（M3DSG）的零样本导航系统，通过保留视觉线索和动态图像分配优化场景图构建，结合多个创新模块解决了传统方法的局限性，并在实验中达到最先进性能。

Details

Motivation: 现有零样本导航方法通常将视觉观测压缩为文本关系，导致高构建成本、视觉信息丢失和受限的词汇量，因此需要一种能保留视觉线索且支持开放词汇的方法。

Result: 在GOAT-Bench和HM3D-OVON数据集上达到最先进性能。

Insight: 通过多模态场景图和动态视觉关系保留，显著提升了零样本导航的泛化能力和准确性，验证了视觉信息在处理复杂任务中的重要性。

Abstract: Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relational edges with dynamically assigned images. Built on M3DSG, we propose MSGNav, a zero-shot navigation system that includes a Key Subgraph Selection module for efficient reasoning, an Adaptive Vocabulary Update module for open vocabulary support, and a Closed-Loop Reasoning module for accurate exploration reasoning. Additionally, we further identify the last-mile problem in zero-shot navigation - determining the feasible target location with a suitable final viewpoint, and propose a Visibility-based Viewpoint Decision module to explicitly resolve it. Comprehensive experimental results demonstrate that MSGNav achieves state-of-the-art performance on GOAT-Bench and HM3D-OVON datasets. The open-source code will be publicly available.

[66] SAMIRO: Spatial Attention Mutual Information Regularization with a Pre-trained Model as Oracle for Lane Detection cs.CVPDF

Hyunjong Lee, Jangho Lee, Jaekoo Lee

TL;DR: 论文提出了一种名为SAMIRO的方法，通过预训练模型作为Oracle，利用空间注意力互信息正则化来提升车道检测性能，适用于多种先进模型和数据集。

Details

Motivation: 现实世界中复杂的背景、光照变化和遮挡等问题对车道检测提出了挑战，尤其是在数据驱动方法中，数据收集和标注成本较高。因此，需要一种方法来利用上下文和全局信息，同时减少对大量标注数据的依赖。

Result: 在CULane、Tusimple和LLAMAS等主流基准测试中，SAMIRO显著提升了不同车道检测模型的性能。

Insight: SAMIRO展示了如何利用预训练模型和正则化方法提升特定任务的性能，同时证明了其灵活性和可扩展性。

Abstract: Lane detection is an important topic in the future mobility solutions. Real-world environmental challenges such as background clutter, varying illumination, and occlusions pose significant obstacles to effective lane detection, particularly when relying on data-driven approaches that require substantial effort and cost for data collection and annotation. To address these issues, lane detection methods must leverage contextual and global information from surrounding lanes and objects. In this paper, we propose a Spatial Attention Mutual Information Regularization with a pre-trained model as an Oracle, called SAMIRO. SAMIRO enhances lane detection performance by transferring knowledge from a pretrained model while preserving domain-agnostic spatial information. Leveraging SAMIRO’s plug-and-play characteristic, we integrate it into various state-of-the-art lane detection approaches and conduct extensive experiments on major benchmarks such as CULane, Tusimple, and LLAMAS. The results demonstrate that SAMIRO consistently improves performance across different models and datasets. The code will be made available upon publication.

[67] MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns cs.CV | cs.AIPDF

Jiarui Zhang, Yuliang Liu, Zijun Wu, Guosheng Pang, Zhili Ye

TL;DR: MonkeyOCR v1.5提出了一个统一的多模态视觉语言框架，通过两阶段解析流程解决了复杂文档布局的OCR问题。第一阶段利用大模型预测布局和阅读顺序，第二阶段进行局部内容识别。通过视觉一致性强化学习和专门模块，提升了复杂表格解析能力，实验证明其性能优于现有方法。

Details

Motivation: 解决现实世界中复杂文档布局（如多级表格、嵌入图像或公式、跨页结构）对现有OCR系统的挑战，提升文档解析的鲁棒性和准确性。

Result: 在OmniDocBench v1.5数据集上，MonkeyOCR v1.5表现出色，优于PPOCR-VL和MinerU 2.5，在复杂文档场景中展现出卓越的鲁棒性。

Insight: 结合多模态视觉和语言信息的两阶段解析流程，以及对复杂表格的专门处理，是提升OCR系统在复杂文档中性能的关键。

Abstract: Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts with multi-level tables, embedded images or formulas, and cross-page structures, which remain challenging for existing OCR systems. We introduce MonkeyOCR v1.5, a unified vision-language framework that enhances both layout understanding and content recognition through a two-stage parsing pipeline. The first stage employs a large multimodal model to jointly predict document layout and reading order, leveraging visual information to ensure structural and sequential consistency. The second stage performs localized recognition of text, formulas, and tables within detected regions, maintaining high visual fidelity while reducing error propagation. To address complex table structures, we propose a visual consistency-based reinforcement learning scheme that evaluates recognition quality via render-and-compare alignment, improving structural accuracy without manual annotations. Additionally, two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables containing embedded images and reconstruction of tables crossing pages or columns. Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios.

[68] LLM-YOLOMS: Large Language Model-based Semantic Interpretation and Fault Diagnosis for Wind Turbine Components cs.CVPDF

Yaru Li, Yanxue Wang, Meng Li, Xinming Li, Jianbo Feng

TL;DR: 本文提出LLM-YOLOMS框架，结合YOLOMS与大型语言模型（LLM），用于风力涡轮机组件的语义解释与故障诊断。通过增强特征提取与语义推理，提高了故障检测的准确性与维护建议的可解释性。

Details

Motivation: 现有风力涡轮机组件故障检测方法主要依赖视觉识别，输出缺乏语义解释性，无法支持维护决策。本文旨在通过结合视觉检测与语言模型，提升诊断结果的语义化和实用性。

Result: 实验显示，框架故障检测准确率达90.6%，维护报告平均准确率为89%，显著提升了诊断的可解释性和决策支持能力。

Insight: 结合视觉与语言模型的多模态方法可显著增强工业设备故障诊断的语义解释性，为智能运维提供新思路。

Abstract: The health condition of wind turbine (WT) components is crucial for ensuring stable and reliable operation. However, existing fault detection methods are largely limited to visual recognition, producing structured outputs that lack semantic interpretability and fail to support maintenance decision-making. To address these limitations, this study proposes an integrated framework that combines YOLOMS with a large language model (LLM) for intelligent fault analysis and diagnosis. Specifically, YOLOMS employs multi-scale detection and sliding-window cropping to enhance fault feature extraction, while a lightweight key-value (KV) mapping module bridges the gap between visual outputs and textual inputs. This module converts YOLOMS detection results into structured textual representations enriched with both qualitative and quantitative attributes. A domain-tuned LLM then performs semantic reasoning to generate interpretable fault analyses and maintenance recommendations. Experiments on real-world datasets demonstrate that the proposed framework achieves a fault detection accuracy of 90.6% and generates maintenance reports with an average accuracy of 89%, thereby improving the interpretability of diagnostic results and providing practical decision support for the operation and maintenance of wind turbines.

[69] RodEpil: A Video Dataset of Laboratory Rodents for Seizure Detection and Benchmark Evaluation cs.CVPDF

Daniele Perlo, Vladimir Despotovic, Selma Boudissa, Sang-Yoon Kim, Petr Nazarov

TL;DR: 该论文介绍了RodEpil数据集，一个用于癫痫发作检测的实验室啮齿动物视频数据集，包含标注正常的活动和癫痫发作的片段，并使用TimeSformer模型取得了97%的平均F1分数。

Details

Motivation: 为了解决预临床癫痫研究中非侵入式视频监测的需求，作者提供了高质量的视频数据集和基准评估方法。

Result: 实验结果表明，TimeSformer能够以97%的平均F1分数区分癫痫发作和正常活动。

Insight: 该数据集和基准代码为预临床癫痫研究的非侵入式视频监测提供了可重复的研究基础。

Abstract: We introduce a curated video dataset of laboratory rodents for automatic detection of convulsive events. The dataset contains short (10~s) top-down and side-view video clips of individual rodents, labeled at clip level as normal activity or seizure. It includes 10,101 negative samples and 2,952 positive samples collected from 19 subjects. We describe the data curation, annotation protocol and preprocessing pipeline, and report baseline experiments using a transformer-based video classifier (TimeSformer). Experiments employ five-fold cross-validation with strict subject-wise partitioning to prevent data leakage (no subject appears in more than one fold). Results show that the TimeSformer architecture enables discrimination between seizure and normal activity with an average F1-score of 97%. The dataset and baseline code are publicly released to support reproducible research on non-invasive, video-based monitoring in preclinical epilepsy research. RodEpil Dataset access - DOI: 10.5281/zenodo.17601357

[70] OpenSR-SRGAN: A Flexible Super-Resolution Framework for Multispectral Earth Observation Data cs.CV | cs.LGPDF

Simon Donike, Cesar Aybar, Julio Contreras, Luis Gómez-Chova

TL;DR: OpenSR-SRGAN是一个开源的、模块化的超分辨率框架，专为多光谱地球观测数据设计，旨在简化SRGAN类模型的配置和应用。

Details

Motivation: 现有的超分辨率模型实现通常需要修改代码以适应不同任务和数据集，增加了使用门槛。OpenSR-SRGAN旨在通过配置驱动的方式降低这一门槛。

Result: OpenSR-SRGAN提供了一个即插即用的解决方案，支持多光谱卫星数据（如Sentinel-2），并可作为基准实现。

Insight: 将超分辨率任务转化为配置驱动的工作流，显著提升了灵活性和可复现性，适用于广泛的地球观测数据集。

Abstract: We present OpenSR-SRGAN, an open and modular framework for single-image super-resolution in Earth Observation. The software provides a unified implementation of SRGAN-style models that is easy to configure, extend, and apply to multispectral satellite data such as Sentinel-2. Instead of requiring users to modify model code, OpenSR-SRGAN exposes generators, discriminators, loss functions, and training schedules through concise configuration files, making it straightforward to switch between architectures, scale factors, and band setups. The framework is designed as a practical tool and benchmark implementation rather than a state-of-the-art model. It ships with ready-to-use configurations for common remote sensing scenarios, sensible default settings for adversarial training, and built-in hooks for logging, validation, and large-scene inference. By turning GAN-based super-resolution into a configuration-driven workflow, OpenSR-SRGAN lowers the entry barrier for researchers and practitioners who wish to experiment with SRGANs, compare models in a reproducible way, and deploy super-resolution pipelines across diverse Earth-observation datasets.

[71] SPOT: Sparsification with Attention Dynamics via Token Relevance in Vision Transformers cs.CV | eess.IVPDF

Oded Schlesinger, Amirhossein Farzam, J. Matias Di Martino, Guillermo Sapiro

TL;DR: SPOT提出了一种基于注意力的动态token稀疏化方法，通过token相关性预测和剪枝，显著提升ViT的计算效率，在保持或提高性能的同时减少40%的计算量。

Details

Motivation: Vision Transformers（ViT）的计算需求随token数量呈二次增长，需要一种高效的方法减少冗余token以提升效率。

Result: 实验表明SPOT可减少40%的计算量，同时保持或提高模型性能，适配多种ViT架构。

Insight: SPOT的token剪枝方法具有通用性和可解释性，为ViT的高效部署提供了新思路。

Abstract: While Vision Transformers (ViT) have demonstrated remarkable performance across diverse tasks, their computational demands are substantial, scaling quadratically with the number of processed tokens. Compact attention representations, reflecting token interaction distributions, can guide early detection and reduction of less salient tokens prior to attention computation. Motivated by this, we present SParsification with attentiOn dynamics via Token relevance (SPOT), a framework for early detection of redundant tokens within ViTs that leverages token embeddings, interactions, and attention dynamics across layers to infer token importance, resulting in a more context-aware and interpretable relevance detection process. SPOT informs token sparsification and facilitates the elimination of such tokens, improving computational efficiency without sacrificing performance. SPOT employs computationally lightweight predictors that can be plugged into various ViT architectures and learn to derive effective input-specific token prioritization across layers. Its versatile design supports a range of performance levels adaptable to varying resource constraints. Empirical evaluations demonstrate significant efficiency gains of up to 40% compared to standard ViTs, while maintaining or even improving accuracy. Code and models are available at https://github.com/odedsc/SPOT .

[72] SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation cs.CV | cs.ROPDF

Wei Li, Renshan Zhang, Rui Shao, Zhijian Fang, Kaiwen Zhou

TL;DR: SemanticVLA提出了一种新的视觉-语言-动作（VLA）框架，通过语义对齐的稀疏化和增强技术提升机器人操作的效率和性能。

Details

Motivation: 当前VLA模型在机器人操作中存在两个关键问题：1)视觉输入的感知冗余导致低效处理；2)指令与视觉对齐浅层化，阻碍了动作的语义基础。

Result: 在LIBERO基准测试中，SemanticVLA的成功率比OpenVLA高21.1%，同时训练成本和推理延迟分别降低3倍和2.7倍。

Insight: 通过语义对齐的稀疏化和增强技术，可以有效提升机器人操作的效率和性能，同时降低计算成本。

Abstract: Vision-Language-Action (VLA) models have advanced in robotic manipulation, yet practical deployment remains hindered by two key limitations: 1) perceptual redundancy, where irrelevant visual inputs are processed inefficiently, and 2) superficial instruction-vision alignment, which hampers semantic grounding of actions. In this paper, we propose SemanticVLA, a novel VLA framework that performs Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation. Specifically: 1) To sparsify redundant perception while preserving semantic alignment, Semantic-guided Dual Visual Pruner (SD-Pruner) performs: Instruction-driven Pruner (ID-Pruner) extracts global action cues and local semantic anchors in SigLIP; Spatial-aggregation Pruner (SA-Pruner) compacts geometry-rich features into task-adaptive tokens in DINOv2. 2) To exploit sparsified features and integrate semantics with spatial geometry, Semantic-complementary Hierarchical Fuser (SH-Fuser) fuses dense patches and sparse tokens across SigLIP and DINOv2 for coherent representation. 3) To enhance the transformation from perception to action, Semantic-conditioned Action Coupler (SA-Coupler) replaces the conventional observation-to-DoF approach, yielding more efficient and interpretable behavior modeling for manipulation tasks. Extensive experiments on simulation and real-world tasks show that SemanticVLA sets a new SOTA in both performance and efficiency. SemanticVLA surpasses OpenVLA on LIBERO benchmark by 21.1% in success rate, while reducing training cost and inference latency by 3.0-fold and 2.7-fold.SemanticVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/SemanticVLA

[73] Dynamic Avatar-Scene Rendering from Human-centric Context cs.CVPDF

Wenqing Wang, Haosen Yang, Josef Kittler, Xiatian Zhu

TL;DR: 该论文提出了一种名为Separate-then-Map（StM）的策略，用于从单目视频中重建动态人类与真实环境的交互。该方法通过专用信息映射机制桥接单独优化的模型，显著提升了视觉质量和渲染精度。

Details

Motivation: 现有方法要么整体建模动态场景，要么将场景和背景分开建模并引入参数化人体先验，但这些方法未能处理不同组件（尤其是人类）的独特运动特性，或者忽略了组件间的信息交互，导致空间不一致性和视觉伪影。

Result: 在单目视频数据集上的实验表明，StM在视觉质量和渲染精度上显著优于现有方法，尤其是在复杂的人类-场景交互边界处。

Insight: 分开建模并引入专用信息映射机制可以有效解决人类与场景交互中的空间不一致性和视觉伪影问题。

Abstract: Reconstructing dynamic humans interacting with real-world environments from monocular videos is an important and challenging task. Despite considerable progress in 4D neural rendering, existing approaches either model dynamic scenes holistically or model scenes and backgrounds separately aim to introduce parametric human priors. However, these approaches either neglect distinct motion characteristics of various components in scene especially human, leading to incomplete reconstructions, or ignore the information exchange between the separately modeled components, resulting in spatial inconsistencies and visual artifacts at human-scene boundaries. To address this, we propose {\bf Separate-then-Map} (StM) strategy that introduces a dedicated information mapping mechanism to bridge separately defined and optimized models. Our method employs a shared transformation function for each Gaussian attribute to unify separately modeled components, enhancing computational efficiency by avoiding exhaustive pairwise interactions while ensuring spatial and visual coherence between humans and their surroundings. Extensive experiments on monocular video datasets demonstrate that StM significantly outperforms existing state-of-the-art methods in both visual quality and rendering accuracy, particularly at challenging human-scene interaction boundaries.

[74] A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space cs.CV | cs.AIPDF

Huijie Liu, Shuhao Cui, Haoxiang Cao, Shuai Ma, Kai Wu

TL;DR: 论文提出了一种新的任务和开源方法CoTyle，通过数值风格代码生成新颖且一致的视觉风格图像，避免了传统方法中对文本提示、参考图像或复杂风格表示的依赖。

Details

Motivation: 现有的视觉风格生成方法通常依赖文本提示或参考图像，难以保证风格一致性和多样性。为了解决这一问题，论文探索了通过数值代码直接控制风格生成的可能性。

Result: 实验验证CoTyle能够从单一数值代码生成多样且一致的视觉风格，展示了方法的有效性和创造性。

Insight: 论文表明，复杂的视觉风格可以通过简单的数值代码高效控制，为风格生成领域提供了新的思路和工具。

Abstract: Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we affirm that a style is worth one numerical code by introducing the novel task, code-to-style image generation, which produces images with novel, consistent visual styles conditioned solely on a numerical style code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Unlike existing methods, our method offers unparalleled simplicity and diversity, unlocking a vast space of reproducible styles from minimal input. Extensive experiments validate that CoTyle effectively turns a numerical code into a style controller, demonstrating a style is worth one code.

[75] OmniVGGT: Omni-Modality Driven Visual Geometry Grounded cs.CVPDF

Haosong Peng, Hao Li, Yalun Dai, Yushi Lan, Yihang Luo

TL;DR: OmniVGGT是一个多模态驱动的视觉几何基础框架，通过GeoAdapter和随机多模态融合策略，有效利用几何信息（如深度、相机参数），在RGB输入和多模态输入下均取得SOTA结果。

Details

Motivation: 现有基础模型多依赖RGB输入，忽略了几何信息的重要性。OmniVGGT旨在充分利用这些辅助模态，提升模型性能。

Result: 在单目/多视图深度估计、多视立体视觉等任务中优于现有方法；在VLA模型中增强性能，优于点云基线。

Insight: 几何信息对视觉任务至关重要，轻量化的多模态设计可以显著提升模型表现且不影响效率。

Abstract: General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model’s representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed, which randomly samples modality subsets per instance during training. This enables an arbitrary number of modality inputs during testing and promotes learning robust spatial representations instead of overfitting to auxiliary cues. Comprehensive experiments on monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation demonstrate that OmniVGGT outperforms prior methods with auxiliary inputs and achieves state-of-the-art results even with RGB-only input. To further highlight its practical utility, we integrated OmniVGGT into vision-language-action (VLA) models. The enhanced VLA model by OmniVGGT not only outperforms the vanilla point-cloud-based baseline on mainstream benchmarks, but also effectively leverages accessible auxiliary inputs to achieve consistent gains on robotic tasks.

[76] From 2D to 3D Without Extra Baggage: Data-Efficient Cancer Detection in Digital Breast Tomosynthesis cs.CVPDF

Yen Nhi Truong Vu, Dan Guo, Sripad Joshi, Harshit Kumar, Jason Su

TL;DR: 论文提出了一种名为M&M-3D的架构，可在不增加参数的情况下从2D FFDM模型迁移学习3D DBT数据，显著提升了乳腺癌检测的性能。

Details

Motivation: DBT数据标注有限，现有方法要么丢弃3D信息，要么需要复杂架构和大量数据。M&M-3D旨在高效利用2D模型参数，同时保留3D推理能力。

Result: M&M-3D在定位和分类任务上分别超越2D和3D基准方法11-54%和3-10%，并在低数据条件下显著优于复杂3D方法。

Insight: 3D推理无需复杂架构或大量数据，通过高效特征设计和迁移学习可显著提升性能。这一方法可能适用于其他3D医学图像任务。

Abstract: Digital Breast Tomosynthesis (DBT) enhances finding visibility for breast cancer detection by providing volumetric information that reduces the impact of overlapping tissues; however, limited annotated data has constrained the development of deep learning models for DBT. To address data scarcity, existing methods attempt to reuse 2D full-field digital mammography (FFDM) models by either flattening DBT volumes or processing slices individually, thus discarding volumetric information. Alternatively, 3D reasoning approaches introduce complex architectures that require more DBT training data. Tackling these drawbacks, we propose M&M-3D, an architecture that enables learnable 3D reasoning while remaining parameter-free relative to its FFDM counterpart, M&M. M&M-3D constructs malignancy-guided 3D features, and 3D reasoning is learned through repeatedly mixing these 3D features with slice-level information. This is achieved by modifying operations in M&M without adding parameters, thus enabling direct weight transfer from FFDM. Extensive experiments show that M&M-3D surpasses 2D projection and 3D slice-based methods by 11-54% for localization and 3-10% for classification. Additionally, M&M-3D outperforms complex 3D reasoning variants by 20-47% for localization and 2-10% for classification in the low-data regime, while matching their performance in high-data regime. On the popular BCS-DBT benchmark, M&M-3D outperforms previous top baseline by 4% for classification and 10% for localization.

[77] One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models cs.CVPDF

Aleksandr Razin, Danil Kazantsev, Ilya Makarov

TL;DR: 本文提出了一种名为Latent Upscaler Adapter (LUA)的轻量级模块，直接在扩散模型的隐空间进行超分辨率操作，避免了传统像素空间超分辨率的延迟与伪影问题。

Details

Motivation: 扩散模型在超出训练分辨率时面临采样速度慢、成本高的问题，而传统的图像超分辨率方法在解码后操作会引入伪影和额外延迟。

Result: LUA在1024像素生成任务中仅增加0.42秒，比传统像素空间超分辨率方法（1.87秒）快3倍，同时保持了可比的感知质量。

Insight: LUA展示了在不同VAE隐空间中的强泛化能力，使其能够轻松部署而无需为每个新解码器重新训练。

Abstract: Diffusion models struggle to scale beyond their training resolutions, as direct high-resolution sampling is slow and costly, while post-hoc image super-resolution (ISR) introduces artifacts and additional latency by operating after decoding. We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator’s latent code before the final VAE decoding step. LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages, and enables high-resolution synthesis through a single feed-forward pass in latent space. A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines, achieving comparable perceptual quality with nearly 3x lower decoding and upscaling time (adding only +0.42 s for 1024 px generation from 512 px, compared to 1.87 s for pixel-space SR using the same SwinIR architecture). Furthermore, LUA shows strong generalization across the latent spaces of different VAEs, making it easy to deploy without retraining from scratch for each new decoder. Extensive experiments demonstrate that LUA closely matches the fidelity of native high-resolution generation while offering a practical and efficient path to scalable, high-fidelity image synthesis in modern diffusion pipelines.

[78] Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling cs.CVPDF

Jiahao Wang, Weiye Xu, Aijun Yang, Wengang Zhou, Lewei Lu

TL;DR: 论文提出了一种名为Self-Consistency Sampling (SCS)的方法，通过视觉扰动和轨迹重采样来解决多模态大语言模型(MLLMs)在基于结果的强化学习(RL)训练中存在的轨迹不忠实问题，显著提升了模型在多模态基准上的表现。

Details

Motivation: 在多模态推理基准的多选题设置中，基于结果的RL训练存在一个常见问题：即使推理链错误，模型也能通过猜测得到正确答案，从而导致不忠实的轨迹与真实推理获得相同的奖励。这一问题亟待解决。

Result: 在Qwen2.5-VL-7B-Instruct等多种MLLM上，SCS显著提升了性能，最高提升了7.7个百分点，且在Qwen2.5-VL-3B-Instruct和InternVL3-8B上也表现优异。

Insight: SCS提供了一种简单通用的解决方案，能够有效消除多模态RL训练中的轨迹不忠实问题，且计算开销极低，适用于多种RL算法。

Abstract: Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting - a dominant format for multimodal reasoning benchmarks - the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We propose Self-Consistency Sampling (SCS) to correct this issue. For each question, SCS (i) introduces small visual perturbations and (ii) performs repeated truncation and resampling of an initial trajectory; agreement among the resulting trajectories yields a differentiable consistency score that down-weights unreliable traces during policy updates. Based on Qwen2.5-VL-7B-Instruct, plugging SCS into RLOO, GRPO, and REINFORCE++ series improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation. SCS also yields notable gains on both Qwen2.5-VL-3B-Instruct and InternVL3-8B, offering a simple, general remedy for outcome-reward RL in MLLMs.

cs.CL [Back]

[79] Order Matters: Rethinking Prompt Construction in In-Context Learning cs.CLPDF

Warren Li, Yiqian Wang, Zihan Wang, Jingbo Shang

TL;DR: 本文研究了示例顺序在上下文学习（ICL）中的重要性，发现其对性能的影响与示例选择相当，并提出了通过开发集识别最优顺序的方法。

Details

Motivation: 以往研究主要关注示例选择对上下文学习性能的影响，而忽略了示例顺序的作用。本文旨在重新评估示例选择和顺序的相对重要性。

Result: 实验结果显示，不同示例顺序导致的性能差异与完全不同的示例集相当；基于开发集的方法可以识别接近最优的顺序。

Insight: 示例选择和顺序在提示设计中具有同等重要性，需要重新审视ICL中的假设。

Abstract: In-context learning (ICL) enables large language models to perform new tasks by conditioning on a sequence of examples. Most prior work reasonably and intuitively assumes that which examples are chosen has a far greater effect on performance than how those examples are ordered, leading to a focus on example selection. We revisit this assumption and conduct a systematic comparison between the effect of selection and ordering. Through controlled experiments on both classification and generation tasks, using multiple open-source model families (0.5B to 27B parameters) and GPT-5, we find that the variance in performance due to different example orderings is comparable to that from using entirely different example sets. Furthermore, we show that strong orderings can be identified using only a development set, achieving performance close to an oracle that selects the best ordering based on test labels. Our findings highlight the equal and intertwined importance of example selection and ordering in prompt design, calling for a reexamination of the assumptions held in ICL.

C. LeMay, A. Lane, J. Seales, M. Winstead, S. Baty

TL;DR: 本文探讨了自然语言处理（NLP）在社会科学研究中的应用潜力，通过对里根至克林顿政府总统指令的信号主题识别案例，展示了NLP在分析大规模文本语料中的作用，同时指出了NLP与人工标注结果的差异。

Details

Motivation: 社会科学研究中常涉及大规模文本分析，传统方法效率低且主观性强。作者希望通过NLP技术提升分析效率和客观性，同时验证其在这一领域的适用性。

Result: NLP能够有效识别相关文档，但与人工标注结果存在差距，表明现有工具仍需改进。

Insight: NLP在社会科学研究中具有潜力，但其准确性仍需进一步验证和完善。技术快速发展使得工具更新频繁，研究需与时俱进。

Abstract: Our research investigates how Natural Language Processing (NLP) can be used to extract main topics from a larger corpus of written data, as applied to the case of identifying signaling themes in Presidential Directives (PDs) from the Reagan through Clinton administrations. Analysts and NLP both identified relevant documents, demonstrating the potential utility of NLPs in research involving large written corpuses. However, we also identified discrepancies between NLP and human-labeled results that indicate a need for more research to assess the validity of NLP in this use case. The research was conducted in 2023, and the rapidly evolving landscape of AIML means existing tools have improved and new tools have been developed; this research displays the inherent capabilities of a potentially dated AI tool in emerging social science applications.

[81] How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation cs.CL | cs.AIPDF

Muskaan Chopra, Lorenz Sparrenberg, Sarthak Khanna, Rafet Sifa

TL;DR: 论文研究了如何在边缘设备上部署紧凑的语言模型（sub-2B参数）以实现高效的机器翻译关键错误检测（CED），发现10亿参数左右的模型（如Gemma-3-1B）在质量和效率上表现最佳。

Details

Motivation: 大型语言模型（LLMs）在机器翻译评估中表现出色，但其规模和成本限制了在边缘设备和隐私敏感场景中的应用。因此，研究紧凑模型能否在保持高性能的同时实现高效部署。

Result: Gemma-3-1B表现最佳（MCC=0.77, F1-ERR=0.98），在MacBook Pro M4 Pro上单样本延迟仅为400毫秒；而更大的Qwen-3-1.7B虽性能更高但计算成本更高。

Insight: 紧凑且经过指令调优的LLMs结合轻量校准和小样本监督，可实现高效的、隐私保护的实时错误检测，适用于实际翻译流程。

Abstract: Large Language Models (LLMs) excel at evaluating machine translation (MT), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Focusing on English->German Critical Error Detection (CED), we benchmark sub-2B models (LFM2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across WMT21, WMT22, and SynCED-EnDe-2025. Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput). Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off, reaching MCC=0.77 with F1-ERR=0.98 on SynCED-EnDe-2025 after merged-weights fine-tuning, while maintaining 400 ms single-sample latency on a MacBook Pro M4 Pro (24 GB). At larger scale, Qwen-3-1.7B attains the highest absolute MCC (+0.11 over Gemma) but with higher compute cost. In contrast, ultra-small models (0.6B) remain usable with few-shot calibration yet under-detect entity and number errors. Overall, compact, instruction-tuned LLMs augmented with lightweight calibration and small-sample supervision can deliver trustworthy, on-device CED for MT, enabling private, low-cost error screening in real-world translation pipelines. All datasets, prompts, and scripts are publicly available at our GitHub repository.

[82] Improving Graduate Outcomes by Identifying Skills Gaps and Recommending Courses Based on Career Interests cs.CL | cs.CYPDF

Rahul Soni, Basem Suleiman, Sonit Singh

TL;DR: 该论文提出了一种基于数据分析和机器学习的课程推荐系统，通过结合用户偏好、学术标准和行业趋势，帮助学生选择与其职业兴趣相符的课程。

Details

Motivation: 目前学生在选择课程时缺乏具体指导，难以与行业需求对接。为了解决这一问题，论文旨在开发一个智能推荐系统，帮助学生做出更明智的课程选择。

Result: 开发的系统能够为学生提供个性化的课程推荐，帮助他们在学术和职业发展上做出更明智的决策。

Insight: 该系统的成功关键在于结合了多源数据和用户反馈，解决了教育和行业需求之间的脱节问题。

Abstract: This paper aims to address the challenge of selecting relevant courses for students by proposing the design and development of a course recommendation system. The course recommendation system utilises a combination of data analytics techniques and machine learning algorithms to recommend courses that align with current industry trends and requirements. In order to provide customised suggestions, the study entails the design and implementation of an extensive algorithmic framework that combines machine learning methods, user preferences, and academic criteria. The system employs data mining and collaborative filtering techniques to examine past courses and individual career goals in order to provide course recommendations. Moreover, to improve the accessibility and usefulness of the recommendation system, special attention is given to the development of an easy-to-use front-end interface. The front-end design prioritises visual clarity, interaction, and simplicity through iterative prototyping and user input revisions, guaranteeing a smooth and captivating user experience. We refined and optimised the proposed system by incorporating user feedback, ensuring that it effectively meets the needs and preferences of its target users. The proposed course recommendation system could be a useful tool for students, instructors, and career advisers to use in promoting lifelong learning and professional progression as it fills the gap between university learning and industry expectations. We hope that the proposed course recommendation system will help university students in making data-drive and industry-informed course decisions, in turn, improving graduate outcomes for the university sector.

[83] Answering Students’ Questions on Course Forums Using Multiple Chain-of-Thought Reasoning and Finetuning RAG-Enabled LLM cs.CL | cs.CYPDF

Neo Wang, Sonit Singh

TL;DR: 为了解决课程论坛中学生问题回答的延迟和重复性问题，该论文提出了一种基于检索增强生成（RAG）方法和大语言模型（LLM）的问答系统，并通过多链式推理减少幻觉问题。

Details

Motivation: 随着课程学生数量的增加，教师在论坛中难以及时回答学生的问题，且重复性问题频发。这些挑战促使研究团队设计一个自动化问答系统，以提高效率。

Result: 实验结果表明，结合RAG方法和微调的LLM在HotpotQA数据集上表现出色，显著提升了问答任务的性能。

Insight: 本研究展示了RAG和多链式推理在改善LLM问答系统性能方面的潜力，特别是在教育领域。为类似场景提供了可扩展的解决方案。

Abstract: The course forums are increasingly significant and play vital role in facilitating student discussions and answering their questions related to the course. It provides a platform for students to post their questions related to the content and admin issues related to the course. However, there are several challenges due to the increase in the number of students enrolled in the course. The primary challenge is that students’ queries cannot be responded immediately and the instructors have to face lots of repetitive questions. To mitigate these issues, we propose a question answering system based on large language model with retrieval augmented generation (RAG) method. This work focuses on designing a question answering system with open source Large Language Model (LLM) and fine-tuning it on the relevant course dataset. To further improve the performance, we use a local knowledge base and applied RAG method to retrieve relevant documents relevant to students’ queries, where the local knowledge base contains all the course content. To mitigate the hallucination of LLMs, We also integrate it with multi chain-of-thought reasoning to overcome the challenge of hallucination in LLMs. In this work, we experiment fine-tuned LLM with RAG method on the HotpotQA dataset. The experimental results demonstrate that the fine-tuned LLM with RAG method has a strong performance on question answering task.

[84] In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback cs.CLPDF

Mingye Zhu, Yi Liu, Zheren Fu, Quan Wang, Yongdong Zhang

TL;DR: InTRO（In-Token Rationality Optimization）是一种新型框架，通过自反馈和token级探索实现LLM的准确简洁推理。它避免了传统方法的泛化问题和计算成本，显著提升数学推理任务的准确性（相对提升20%）并减少冗余。

Details

Motivation: 传统方法中，监督微调单一黄金理由会惩罚其他有效选择，限制泛化；而强化学习的验证奖励面临信用分配和高计算成本问题。InTRO旨在解决这些局限性。

Result: 在六个数学推理基准上，InTRO相对基础模型提升20%准确率，且生成的推理链更简洁；还能成功迁移到非数学域推理任务。

Insight: token级探索和自反馈的结合是高效优化LLM推理的关键；校正因子提供了一种轻量级信用分配机制，避免了传统强化学习的复杂性。

Abstract: Training Large Language Models (LLMs) for chain-of-thought reasoning presents a significant challenge: supervised fine-tuning on a single “golden” rationale hurts generalization as it penalizes equally valid alternatives, whereas reinforcement learning with verifiable rewards struggles with credit assignment and prohibitive computational cost. To tackle these limitations, we introduce InTRO (In-Token Rationality Optimization), a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning. Instead of directly optimizing an intractable objective over all valid reasoning paths, InTRO leverages correction factors-token-wise importance weights estimated by the information discrepancy between the generative policy and its answer-conditioned counterpart, for informative next token selection. This approach allows the model to perform token-level exploration and receive self-generated feedback within a single forward pass, ultimately encouraging accurate and concise rationales. Across six math-reasoning benchmarks, InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model. Its chains of thought are also notably more concise, exhibiting reduced verbosity. Beyond this, InTRO enables cross-domain transfer, successfully adapting to out-of-domain reasoning tasks that extend beyond the realm of mathematics, demonstrating robust generalization.

[85] HierRouter: Coordinated Routing of Specialized Large Language Models via Reinforcement Learning cs.CL | cs.LGPDF

Nikunj Gupta, Bill Guo, Rajgopal Kannan, Viktor K. Prasanna

TL;DR: HierRouter提出了一种基于强化学习的分层路由方法，动态地从多个轻量级专用语言模型中选择推理流程，以提高性能和降低成本。

Details

Motivation: 大型语言模型（LLMs）虽然性能卓越，但计算和内存成本高昂，限制了其在资源受限或实时场景中的应用。为解决这一问题，作者提出了HierRouter。

Result: 在六个基准测试中，HierRouter比独立使用单个模型的响应质量提升了2.4倍，同时仅增加了少量额外的推理成本。

Insight: 分层路由能够高效协调模型资源，为资源受限场景下实现高性能LLM推理提供了可行方案。

Abstract: Large Language Models (LLMs) deliver state-of-the-art performance across many tasks but impose high computational and memory costs, limiting their deployment in resource-constrained or real-time settings. To address this, we propose HierRouter, a hierarchical routing approach that dynamically assembles inference pipelines from a pool of specialized, lightweight language models. Formulated as a finite-horizon Markov Decision Process (MDP), our approach trains a Proximal Policy Optimization (PPO)-based reinforcement learning agent to iteratively select which models to invoke at each stage of multi-hop inference. The agent conditions on the evolving context and accumulated cost to make context-aware routing decisions. Experiments with three open-source candidate LLMs across six benchmarks, including QA, code generation, and mathematical reasoning, show that HierRouter improves response quality by up to 2.4x compared to using individual models independently, while incurring only a minimal additional inference cost on average. These results highlight the promise of hierarchical routing for cost-efficient, high-performance LLM inference. All codes can be found here https://github.com/ Nikunj-Gupta/hierouter.

[86] EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models cs.CL | cs.CRPDF

Jialin Wu, Kecen Li, Zhicong Huang, Xinfeng Li, Xiaofeng Wang

TL;DR: EnchTable是一个新框架，用于在微调的大型语言模型（LLMs）中保持安全性对齐，避免重新训练，并通过NTK-based safety vector蒸馏技术和干扰感知合并技术，实现安全和效用的平衡。

Details

Motivation: 微调LLMs可能导致安全性对齐的系统性退化，增加有害输出的风险。现有方法需要大量重新训练或牺牲性能，亟需一种高效且通用的解决方案。

Result: 在11个数据集上验证，EnchTable显著降低不安全率，提高效用分数，优于6种参数修改方法和2种推理时对齐基线，且抗攻击能力强。

Insight: 安全性对齐可通过特征蒸馏和干扰感知技术高效转移，无需重新训练；跨模型和任务域的通用性是可行的。

Abstract: Many machine learning models are fine-tuned from large language models (LLMs) to achieve high performance in specialized domains like code generation, biomedical analysis, and mathematical problem solving. However, this fine-tuning process often introduces a critical vulnerability: the systematic degradation of safety alignment, undermining ethical guidelines and increasing the risk of harmful outputs. Addressing this challenge, we introduce EnchTable, a novel framework designed to transfer and maintain safety alignment in downstream LLMs without requiring extensive retraining. EnchTable leverages a Neural Tangent Kernel (NTK)-based safety vector distillation method to decouple safety constraints from task-specific reasoning, ensuring compatibility across diverse model architectures and sizes. Additionally, our interference-aware merging technique effectively balances safety and utility, minimizing performance compromises across various task domains. We implemented a fully functional prototype of EnchTable on three different task domains and three distinct LLM architectures, and evaluated its performance through extensive experiments on eleven diverse datasets, assessing both utility and model safety. Our evaluations include LLMs from different vendors, demonstrating EnchTable’s generalization capability. Furthermore, EnchTable exhibits robust resistance to static and dynamic jailbreaking attacks, outperforming vendor-released safety models in mitigating adversarial prompts. Comparative analyses with six parameter modification methods and two inference-time alignment baselines reveal that EnchTable achieves a significantly lower unsafe rate, higher utility score, and universal applicability across different task domains. Additionally, we validate EnchTable can be seamlessly integrated into various deployment pipelines without significant overhead.

[87] HI-TransPA: Hearing Impairments Translation Personal Assistant cs.CL | cs.MM | cs.SDPDF

Zhiming Ma, Shiyu Gan, Junhao Zhao, Xianming Li, Qingyun Pan

TL;DR: HI-TransPA是一个面向听力障碍人士的多模态个人助手，通过融合语音和唇部动态实现翻译与对话，采用课程学习和高质量数据预处理提升模型鲁棒性。

Details

Motivation: 为听力障碍人士提供统一的日常沟通解决方案，解决现有Omni-Model在噪声数据和处理听力障碍语音方面的局限性。

Result: 在HI-Dialogue数据集上，HI-TransPA在语义保真度和翻译准确性上达到SOTA。

Insight: Omni-Model范式可有效应用于辅助技术，未来研究可在此基础上扩展。

Abstract: To provide a unified and flexible solution for daily communication among hearing-impaired individuals, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with high-frame-rate lip dynamics, enabling both translation and dialogue within a single multimodal framework. To tackle the challenges of noisy and heterogeneous raw data and the limited adaptability of existing Omni-Models to hearing-impaired speech, we construct a comprehensive preprocessing and curation pipeline that detects facial landmarks, isolates and stabilizes the lip region, and quantitatively assesses multimodal sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. We further adopt a SigLIP encoder combined with a Unified 3D-Resampler to efficiently encode high-frame-rate lip motion. Experiments on our purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. This work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.

Pritish Sahu, Anirudh Som, Dimitra Vergyri, Ajay Divakaran

TL;DR: 论文提出了一个名为Norm-RAG的检索增强框架，用于多轮对话中的社交规范推理，并引入了MINDS数据集，包含中英和西英双语对话，用于社交规范分类和遵守检测。

Details

Motivation: 社交规范是隐式的、基于文化的期望，指导人际交流。现有的标注数据集多为孤立语句或合成对话，无法捕捉真实对话的多轮流动性，需要更好的模型和多文化数据集。

Result: 实验表明Norm-RAG在规范检测和泛化方面表现优越，提升了文化适应性和社交智能对话系统的性能。

Insight: 社交规范推理需要多属性建模和文化背景支持，语义检索能增强模型的解释性和适应性。

Abstract: Social norms are implicit, culturally grounded expectations that guide interpersonal communication. Unlike factual commonsense, norm reasoning is subjective, context-dependent, and varies across cultures, posing challenges for computational models. Prior works provide valuable normative annotations but mostly target isolated utterances or synthetic dialogues, limiting their ability to capture the fluid, multi-turn nature of real-world conversations. In this work, we present Norm-RAG, a retrieval-augmented, agentic framework for nuanced social norm inference in multi-turn dialogues. Norm-RAG models utterance-level attributes including communicative intent, speaker roles, interpersonal framing, and linguistic cues and grounds them in structured normative documentation retrieved via a novel Semantic Chunking approach. This enables interpretable and context-aware reasoning about norm adherence and violation across multilingual dialogues. We further introduce MINDS (Multilingual Interactions with Norm-Driven Speech), a bilingual dataset comprising 31 multi-turn Mandarin-English and Spanish-English conversations. Each turn is annotated for norm category and adherence status using multi-annotator consensus, reflecting cross-cultural and realistic norm expression. Our experiments show that Norm-RAG improves norm detection and generalization, demonstrates improved performance for culturally adaptive and socially intelligent dialogue systems.

[89] Leveraging Large Language Models for Identifying Knowledge Components cs.CL | cs.HCPDF

Canwen Wang, Jionghao Lin, Kenneth R. Koedinger

TL;DR: 该论文研究了利用大语言模型（LLM）自动化识别知识组件（KCs）的方法，提出了一种基于余弦相似度的语义合并策略，显著减少了冗余标签并提升了性能。

Details

Motivation: 手动标注知识组件（KCs）是自适应学习系统的瓶颈，而现有基于LLM的方法在小数据集上表现不佳且生成冗余标签。

Result: 合并策略将KC数量从569降至428，RMSE从0.4285提升至0.4259，接近专家模型。

Insight: 单独使用LLM生成KC标签效果有限，但结合语义合并技术可以显著优化自动化识别流程。

Abstract: Knowledge Components (KCs) are foundational to adaptive learning systems, but their manual identification by domain experts is a significant bottleneck. While Large Language Models (LLMs) offer a promising avenue for automating this process, prior research has been limited to small datasets and has been shown to produce superfluous, redundant KC labels. This study addresses these limitations by first scaling a “simulated textbook” LLM prompting strategy (using GPT-4o-mini) to a larger dataset of 646 multiple-choice questions. We found that this initial automated approach performed significantly worse than an expert-designed KC model (RMSE 0.4285 vs. 0.4206) and generated an excessive number of KCs (569 vs. 101). To address the issue of redundancy, we proposed and evaluated a novel method for merging semantically similar KC labels based on their cosine similarity. This merging strategy significantly improved the model’s performance; a model using a cosine similarity threshold of 0.8 achieved the best result, reducing the KC count to 428 and improving the RMSE to 0.4259. This demonstrates that while scaled LLM generation alone is insufficient, combining it with a semantic merging technique offers a viable path toward automating and refining KC identification.

[90] REAP: Enhancing RAG with Recursive Evaluation and Adaptive Planning for Multi-Hop Question Answering cs.CLPDF

Yijie Zhu, Haojie Zhou, Wanting Hong, Tailin Liu, Ning Wang

TL;DR: 论文提出REAP方法，通过递归评估和自适应规划提升多跳问答的性能，解决了现有RAG方法在多跳任务中全局规划不足和线索利用不充分的问题。

Details

Motivation: 现有RAG方法在多跳推理任务中缺乏全局规划，容易陷入局部推理困境，且对检索内容利用不足，导致推理结果不准确。

Result: 在多个多跳数据集上，REAP显著优于现有RAG方法，验证了其在复杂推理任务中的有效性。

Insight: 全局规划和动态路径优化是提升多跳问答性能的关键；统一任务范式设计可增强模型在数据稀缺任务上的表现。

Abstract: Retrieval-augmented generation (RAG) has been extensively employed to mitigate hallucinations in large language models (LLMs). However, existing methods for multi-hop reasoning tasks often lack global planning, increasing the risk of falling into local reasoning impasses. Insufficient exploitation of retrieved content and the neglect of latent clues fail to ensure the accuracy of reasoning outcomes. To overcome these limitations, we propose Recursive Evaluation and Adaptive Planning (REAP), whose core idea is to explicitly maintain structured sub-tasks and facts related to the current task through the Sub-task Planner (SP) and Fact Extractor (FE) modules. SP maintains a global perspective, guiding the overall reasoning direction and evaluating the task state based on the outcomes of FE, enabling dynamic optimization of the task-solving trajectory. FE performs fine-grained analysis over retrieved content to extract reliable answers and clues. These two modules incrementally enrich a logically coherent representation of global knowledge, enhancing the reliability and the traceability of the reasoning process. Furthermore, we propose a unified task paradigm design that enables effective multi-task fine-tuning, significantly enhancing SP’s performance on complex, data-scarce tasks. We conduct extensive experiments on multiple public multi-hop datasets, and the results demonstrate that our method significantly outperforms existing RAG methods in both in-domain and out-of-domain settings, validating its effectiveness in complex multi-hop reasoning tasks.

[91] NumPert: Numerical Perturbations to Probe Language Models for Veracity Prediction cs.CLPDF

Peter Røysland Aarnes, Vinay Setty

TL;DR: 论文通过数值扰动评估大语言模型在真实性预测任务中的表现，发现模型在数值推理上存在显著不足，且对扰动敏感，尤其是在上下文长度增加或扰动演示丰富化时。

Details

Motivation: 大语言模型在知识密集型任务中表现优异，但在数值推理上存在短板。作者希望通过系统性评估揭示模型在数值真实性预测中的局限性和鲁棒性问题。

Result: 主流模型在扰动下准确率下降高达62%；上下文长度增加降低了准确率，但通过丰富化演示可部分恢复模型性能。

Insight: 数值推理仍是大语言模型的短板，尤其是在鲁棒性方面；丰富的上下文并不总能提升模型表现，需结合特定扰动策略来优化。

Abstract: Large language models show strong performance on knowledge intensive tasks such as fact-checking and question answering, yet they often struggle with numerical reasoning. We present a systematic evaluation of state-of-the-art models for veracity prediction on numerical claims and evidence pairs using controlled perturbations, including label-flipping probes, to test robustness. Our results indicate that even leading proprietary systems experience accuracy drops of up to 62% under certain perturbations. No model proves to be robust across all conditions. We further find that increasing context length generally reduces accuracy, but when extended context is enriched with perturbed demonstrations, most models substantially recover. These findings highlight critical limitations in numerical fact-checking and suggest that robustness remains an open challenge for current language models.

[92] Language Drift in Multilingual Retrieval-Augmented Generation: Characterization and Decoding-Time Mitigation cs.CLPDF

Bo Li, Zhenghua Xu, Rui Xie

TL;DR: 论文研究了多语言检索增强生成（RAG）中的语言漂移现象，发现其源于解码器层面的崩溃而非理解失败，并提出了一种无需训练的轻量解码策略SCD以缓解该问题。

Details

Motivation: 多语言RAG在检索证据与查询语言不一致时会产生非预期的语言漂移现象，尤其在推理密集型任务中更为明显，影响了多语言任务的性能。

Result: SCD在多语言数据集和多样化语言中一致提升了语言对齐性和任务性能。

Insight: 语言漂移主要源于解码器崩溃而非理解失败，英语在多语言场景中具有显著的干扰作用。

Abstract: Multilingual Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to perform knowledge-intensive tasks in multilingual settings by leveraging retrieved documents as external evidence. However, when the retrieved evidence differs in language from the user query and in-context exemplars, the model often exhibits language drift by generating responses in an unintended language. This phenomenon is especially pronounced during reasoning-intensive decoding, such as Chain-of-Thought (CoT) generation, where intermediate steps introduce further language instability. In this paper, we systematically study output language drift in multilingual RAG across multiple datasets, languages, and LLM backbones. Our controlled experiments reveal that the drift results not from comprehension failure but from decoder-level collapse, where dominant token distributions and high-frequency English patterns dominate the intended generation language. We further observe that English serves as a semantic attractor under cross-lingual conditions, emerging as both the strongest interference source and the most frequent fallback language. To mitigate this, we propose Soft Constrained Decoding (SCD), a lightweight, training-free decoding strategy that gently steers generation toward the target language by penalizing non-target-language tokens. SCD is model-agnostic and can be applied to any generation algorithm without modifying the architecture or requiring additional data. Experiments across three multilingual datasets and multiple typologically diverse languages show that SCD consistently improves language alignment and task performance, providing an effective and generalizable solution in multilingual RAG.

[93] PustakAI: Curriculum-Aligned and Interactive Textbooks Using Large Language Models cs.CL | cs.AIPDF

Shivam Sharma, Riya Naik, Tejas Gawas, Heramb Patil, Kunal Korgaonkar

TL;DR: 该论文提出了一个名为PustakAI的框架，用于设计和评估与印度NCERT课程（6至8年级英语和科学）对齐的问题回答数据集NCERT-QA，并通过多种提示技术和评估指标分析了开源和高性能大语言模型在教育系统中的适用性与局限性。

Details

Motivation: 大语言模型（LLMs）在教育领域的潜力巨大，尤其是为资源有限的地区提供个性化和交互式学习体验。然而，如何将这些模型有效适应特定课程内容（如印度NCERT课程）仍面临准确性、对齐性和教学相关性的挑战。

Result: 结果表明，某些提示技术（如CoT）能更好地满足课程需求，而高性能模型在教育场景中表现更优，但也凸显了开源模型在资源有限环境中的实用性。

Insight: 论文揭示了LLMs在教育中的潜力与挑战，提示技术在课程对齐中的重要性，以及不同规模模型在资源受限环境中的权衡。

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like content. This has revolutionized various sectors such as healthcare, software development, and education. In education, LLMs offer potential for personalized and interactive learning experiences, especially in regions with limited teaching resources. However, adapting these models effectively to curriculum-specific content, such as the National Council of Educational Research and Training (NCERT) syllabus in India, presents unique challenges in terms of accuracy, alignment, and pedagogical relevance. In this paper, we present the framework “PustakAI”\footnote{Pustak means `book’ in many Indian languages.} for the design and evaluation of a novel question-answering dataset “NCERT-QA” aligned with the NCERT curriculum for English and Science subjects of grades 6 to 8. We classify the curated QA pairs as Factoid, Inferential, and Others (evaluative and reasoning). We evaluate the dataset with various prompting techniques, such as meta-prompt, few-shot, and CoT-style prompting, using diverse evaluation metrics to understand which approach aligns more efficiently with the structure and demands of the curriculum. Along with the usability of the dataset, we analyze the strengths and limitations of current open-source LLMs (Gemma3:1b, Llama3.2:3b, and Nemotron-mini:4b) and high-end LLMs (Llama-4-Scout-17B and Deepseek-r1-70B) as AI-based learning tools in formal education systems.

[94] ScaleFormer: Span Representation Cumulation for Long-Context Transformer cs.CLPDF

Jiangshu Du, Wenpeng Yin, Philip Yu

TL;DR: ScaleFormer是一种无需修改架构或从头训练的即插即用框架，通过分块和上下文累积机制，将预训练模型适应于长文本任务。

Details

Motivation: 标准自注意力的二次复杂度限制了Transformer在长文本任务中的应用，而现有高效变体通常需要架构修改和从头训练。为了解决这些问题，ScaleFormer提供了一种无需修改架构的解决方案。

Result: 在长文档摘要任务中，ScaleFormer表现优异，甚至优于现有方法，且无需架构修改或外部检索机制。

Insight: 通过简单的分块和上下文累积策略，可以有效提升预训练模型处理长文本的能力，证明了结构感知对长文本任务的重要性。

Abstract: The quadratic complexity of standard self-attention severely limits the application of Transformer-based models to long-context tasks. While efficient Transformer variants exist, they often require architectural changes and costly pre-training from scratch. To circumvent this, we propose ScaleFormer(Span Representation Cumulation for Long-Context Transformer) - a simple and effective plug-and-play framework that adapts off-the-shelf pre-trained encoder-decoder models to process long sequences without requiring architectural modifications. Our approach segments long inputs into overlapping chunks and generates a compressed, context-aware representation for the decoder. The core of our method is a novel, parameter-free fusion mechanism that endows each chunk’s representation with structural awareness of its position within the document. It achieves this by enriching each chunk’s boundary representations with cumulative context vectors from all preceding and succeeding chunks. This strategy provides the model with a strong signal of the document’s narrative flow, achieves linear complexity, and enables pre-trained models to reason effectively over long-form text. Experiments on long-document summarization show that our method is highly competitive with and often outperforms state-of-the-art approaches without requiring architectural modifications or external retrieval mechanisms.

[95] Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism cs.CLPDF

Jinhong Jeong, Sunghyun Lee, Jaeyoung Lee, Seonah Han, Youngjae Yu

TL;DR: 论文探讨了多模态大语言模型（MLLMs）如何将声音与意义关联，通过声音象征性（sound symbolism）研究了模型对听觉信息的处理能力，并提出LEX-ICON数据集支持研究。

Details

Motivation: 声音象征性是非任意性的语音形式与意义的关联，本研究旨在探索MLLMs是否能够捕捉这种关联，从而理解其在跨模态语言处理中的能力。

Result: 研究发现MLLMs的语音直觉与语言学研究成果一致，并在标志性音素上展现出集中的注意力模式。

Insight: 研究为AI与认知语言学的交叉领域提供了实证基础，表明MLLMs能够捕捉声音象征性，为跨模态语言理解提供了新视角。

Abstract: Sound symbolism is a linguistic concept that refers to non-arbitrary associations between phonetic forms and their meanings. We suggest that this can be a compelling probe into how Multimodal Large Language Models (MLLMs) interpret auditory information in human languages. We investigate MLLMs’ performance on phonetic iconicity across textual (orthographic and IPA) and auditory forms of inputs with up to 25 semantic dimensions (e.g., sharp vs. round), observing models’ layer-wise information processing by measuring phoneme-level attention fraction scores. To this end, we present LEX-ICON, an extensive mimetic word dataset consisting of 8,052 words from four natural languages (English, French, Japanese, and Korean) and 2,930 systematically constructed pseudo-words, annotated with semantic features applied across both text and audio modalities. Our key findings demonstrate (1) MLLMs’ phonetic intuitions that align with existing linguistic research across multiple semantic dimensions and (2) phonosemantic attention patterns that highlight models’ focus on iconic phonemes. These results bridge domains of artificial intelligence and cognitive linguistics, providing the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs’ interpretability.

[96] Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts cs.CLPDF

Xanh Ho, Yun-Ang Wu, Sunisth Kumar, Florian Boudin, Atsuhiro Takasu

TL;DR: 该论文研究了多模态大语言模型（LLMs）在验证科学主张时对表格和图表的鲁棒性差异。研究表明，模型在表格数据上表现更好，而在图表数据上较弱，且小模型（小于8B）的跨模态泛化能力有限。

Details

Motivation: 随着科学论文数量的增长，需要系统辅助评审验证研究主张。实验结果常以表格或图表形式呈现，但当前多模态LLMs在不同证据格式下的鲁棒性尚不明确。

Result: 模型在表格数据上表现更优，图表上表现不佳；人类在两者上均表现良好；小模型（小于8B）跨模态泛化能力有限。

Insight: 当前多模态LLMs在图表理解上存在明显短板，未来需针对性提升这一能力以支持科学主张验证。

Abstract: With the growing number of submitted scientific papers, there is an increasing demand for systems that can assist reviewers in evaluating research claims. Experimental results are a core component of scientific work, often presented in varying formats such as tables or charts. Understanding how robust current multimodal large language models (multimodal LLMs) are at verifying scientific claims across different evidence formats remains an important and underexplored challenge. In this paper, we design and conduct a series of experiments to assess the ability of multimodal LLMs to verify scientific claims using both tables and charts as evidence. To enable this evaluation, we adapt two existing datasets of scientific papers by incorporating annotations and structures necessary for a multimodal claim verification task. Using this adapted dataset, we evaluate 12 multimodal LLMs and find that current models perform better with table-based evidence while struggling with chart-based evidence. We further conduct human evaluations and observe that humans maintain strong performance across both formats, unlike the models. Our analysis also reveals that smaller multimodal LLMs (under 8B) show weak correlation in performance between table-based and chart-based tasks, indicating limited cross-modal generalization. These findings highlight a critical gap in current models’ multimodal reasoning capabilities. We suggest that future multimodal LLMs should place greater emphasis on improving chart understanding to better support scientific claim verification.

[97] ELYADATA & LIA at NADI 2025: ASR and ADI Subtasks cs.CLPDF

Haroun Elleuch, Youssef Saidi, Salima Mdhaffar, Yannick Estève, Fethi Bougares

TL;DR: 论文描述了Elyadata和LIA在NADI 2025多方言阿拉伯语语音处理比赛中的联合提交，展示了在口语阿拉伯方言识别（ADI）和多方言阿拉伯语自动语音识别（ASR）任务中的优异表现。

Details

Motivation: 阿拉伯语方言多样，语音处理任务具有挑战性，需要高效的系统来识别和转录多方言阿拉伯语。

Result: ADI任务官方测试集准确率达79.83%；ASR任务平均WER和CER分别为38.54%和14.53%。

Insight: 针对特定任务的数据增强和方言适配微调是提高阿拉伯语语音处理性能的关键。

Abstract: This paper describes Elyadata & LIA’s joint submission to the NADI multi-dialectal Arabic Speech Processing 2025. We participated in the Spoken Arabic Dialect Identification (ADI) and multi-dialectal Arabic ASR subtasks. Our submission ranked first for the ADI subtask and second for the multi-dialectal Arabic ASR subtask among all participants. Our ADI system is a fine-tuned Whisper-large-v3 encoder with data augmentation. This system obtained the highest ADI accuracy score of \textbf{79.83%} on the official test set. For multi-dialectal Arabic ASR, we fine-tuned SeamlessM4T-v2 Large (Egyptian variant) separately for each of the eight considered dialects. Overall, we obtained an average WER and CER of \textbf{38.54%} and \textbf{14.53%}, respectively, on the test set. Our results demonstrate the effectiveness of large pre-trained speech models with targeted fine-tuning for Arabic speech processing.

[98] On the Military Applications of Large Language Models cs.CL | cs.AIPDF

Satu Johansson, Taneli Riihonen

TL;DR: 论文探讨了大型语言模型（如GPT）在军事领域的应用潜力，分析了其总结与生成能力如何直接支持多种应用，并评估了基于商业云服务（如Microsoft Azure）的实现可行性。

Details

Motivation: 随着生成预训练模型（如ChatGPT）的快速发展，研究者希望探索其在军事领域的潜在应用，以提升任务效率和决策支持能力。

Result: 研究表明，语言模型的总结和生成能力可直接支持多种军事应用，部分特性在特定场景中尤为有用。

Insight: 商业化的云服务和现有语言模型技术已具备支持军事应用的潜力，但需进一步验证其可靠性和安全性。

Abstract: In this paper, military use cases or applications and implementation thereof are considered for natural language processing and large language models, which have broken into fame with the invention of the generative pre-trained transformer (GPT) and the extensive foundation model pretraining done by OpenAI for ChatGPT and others. First, we interrogate a GPT-based language model (viz. Microsoft Copilot) to make it reveal its own knowledge about their potential military applications and then critically assess the information. Second, we study how commercial cloud services (viz. Microsoft Azure) could be used readily to build such applications and assess which of them are feasible. We conclude that the summarization and generative properties of language models directly facilitate many applications at large and other features may find particular uses.

[99] Generalizing to Unseen Disaster Events: A Causal View cs.CL | cs.LGPDF

Philipp Seeberger, Steffen Freisinger, Tobias Bocklet, Korbinian Riedhammer

TL;DR: 该论文通过因果视角解决灾害事件分类中的偏差问题，提出了一种新方法以减少事件和领域相关偏差，从而提升对未来事件的泛化能力。

Details

Motivation: 现有系统在处理灾害事件数据时受到事件相关偏差的影响，导致泛化能力不足。因果学习和去偏差方法的进展为解决这一问题提供了潜力，但在灾害事件领域尚未充分探索。

Result: 实验表明，该方法在三个灾害分类任务中优于多个基线模型，F1分数最高提升+1.9%，并显著提升了基于PLM的分类器性能。

Insight: 因果视角为灾害事件数据处理提供了新的思路，能够有效减少偏差并提升模型泛化能力，尤其是在新兴事件上的表现。

Abstract: Due to the rapid growth of social media platforms, these tools have become essential for monitoring information during ongoing disaster events. However, extracting valuable insights requires real-time processing of vast amounts of data. A major challenge in existing systems is their exposure to event-related biases, which negatively affects their ability to generalize to emerging events. While recent advancements in debiasing and causal learning offer promising solutions, they remain underexplored in the disaster event domain. In this work, we approach bias mitigation through a causal lens and propose a method to reduce event- and domain-related biases, enhancing generalization to future events. Our approach outperforms multiple baselines by up to +1.9% F1 and significantly improves a PLM-based classifier across three disaster classification tasks.

[100] Beyond the Black Box: Demystifying Multi-Turn LLM Reasoning with VISTA cs.CLPDF

Yiran Zhang, Mingyang Lin, Mark Dras, Usman Naseem

TL;DR: 论文提出了VISTA，一个基于Web的可视化交互系统，用于分析和可视化多轮LLM推理任务中的复杂推理过程，支持上下文修改和依赖树的自动生成。

Details

Motivation: 多轮交互中的LLM推理过程复杂且缺乏可视化工具，增加了研究者的认知负担，因此需要一种工具来透明化和简化分析过程。

Result: VISTA显著降低了分析推理链的复杂性，帮助深入理解LLM的能力和局限。

Insight: 通过可视化工具，可以更直观地分析LLM的推理过程，揭示其逻辑路径的透明性。

Abstract: Recent research has increasingly focused on the reasoning capabilities of Large Language Models (LLMs) in multi-turn interactions, as these scenarios more closely mirror real-world problem-solving. However, analyzing the intricate reasoning processes within these interactions presents a significant challenge due to complex contextual dependencies and a lack of specialized visualization tools, leading to a high cognitive load for researchers. To address this gap, we present VISTA, an web-based Visual Interactive System for Textual Analytics in multi-turn reasoning tasks. VISTA allows users to visualize the influence of context on model decisions and interactively modify conversation histories to conduct “what-if” analyses across different models. Furthermore, the platform can automatically parse a session and generate a reasoning dependency tree, offering a transparent view of the model’s step-by-step logical path. By providing a unified and interactive framework, VISTA significantly reduces the complexity of analyzing reasoning chains, thereby facilitating a deeper understanding of the capabilities and limitations of current LLMs. The platform is open-source and supports easy integration of custom benchmarks and local models.

[101] Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL cs.CL | cs.DBPDF

Qifeng Cai, Hao Liang, Chang Xu, Tao Xie, Wentao Zhang

TL;DR: Text2SQL-Flow是一个SQL感知的数据增强框架，通过六个维度的增强生成高质量、大规模的Text-to-SQL对，并构建SQLFlow数据集，显著提升模型性能。

Details

Motivation: Text-to-SQL领域的性能受限于数据稀缺、简单且多样性低的问题。

Result: SQLFlow数据集包含89,544个标注样本；提升了开源和闭源LLM的性能；检索策略优于现有方法。

Insight: 高质量结构化数据对Text-to-SQL系统至关重要；数据多样性提升模型泛化能力。

Abstract: The data-centric paradigm has become pivotal in AI, especially for Text-to-SQL, where performance is limited by scarce, simplistic, and low-diversity datasets. To address this, we propose Text2SQL-Flow, a SQL-aware data augmentation framework that generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from minimal seed data. It operates across six augmentation dimensions and integrates an end-to-end pipeline featuring SQL execution verification, natural language question generation, chain-of-thought reasoning traces, and data classification. A modular Database Manager ensures cross-database compatibility and scalability. Using this framework, we build SQLFlow, a high-quality dataset of 89,544 annotated examples. We evaluate SQLFlow in two settings: (1) For open-source LLMs, fine-tuning on SQLFlow consistently improves performance across benchmarks under the same data budget. (2) For closed-source LLMs, we introduce a masked alignment retrieval method that treats SQLFlow as both knowledge base and training data for the retriever. This enables structure-aware example matching by modeling fine-grained alignments between questions and SQL queries. Experiments show our retrieval strategy outperforms existing methods, underscoring the value of SQLFlow’s high-fidelity data and our novel technique. Our work establishes a scalable, data-centric foundation for advancing Text-to-SQL systems and highlights the critical role of high-quality structured data in modern AI.

[102] EffiReason-Bench: A Unified Benchmark for Evaluating and Advancing Efficient Reasoning in Large Language Models cs.CLPDF

Junquan Huang, Haotian Wu, Yubo Gao, Yibo Yan, Junyan Zhang

TL;DR: EffiReason-Bench是一个统一的基准测试，用于评估大语言模型（LLMs）中高效推理方法，填补了现有评估实践的碎片化问题。

Details

Motivation: 当前CoT提示的LLMs虽然推理能力强，但生成长解释增加了成本并可能降低准确性，且缺乏统一的效率评估框架。

Result: 实验显示，没有单一方法在所有场景中都最优，最佳策略取决于模型规模、任务复杂度和架构。

Insight: 高效推理方法的有效性高度依赖于具体场景，统一基准和稳定评估指标对跨方法比较至关重要。

Abstract: Large language models (LLMs) with Chain-of-Thought (CoT) prompting achieve strong reasoning but often produce unnecessarily long explanations, increasing cost and sometimes reducing accuracy. Fair comparison of efficiency-oriented approaches is hindered by fragmented evaluation practices. We introduce EffiReason-Bench, a unified benchmark for rigorous cross-paradigm evaluation of efficient reasoning methods across three categories: Reasoning Blueprints, Dynamic Execution, and Post-hoc Refinement. To enable step-by-step evaluation, we construct verified CoT annotations for CommonsenseQA and LogiQA via a pipeline that enforces standardized reasoning structures, comprehensive option-wise analysis, and human verification. We evaluate 7 methods across 6 open-source LLMs (1B-70B) on 4 datasets spanning mathematics, commonsense, and logic, and propose the E3-Score, a principled metric inspired by economic trade-off modeling that provides smooth, stable evaluation without discontinuities or heavy reliance on heuristics. Experiments show that no single method universally dominates; optimal strategies depend on backbone scale, task complexity, and architecture.

[103] Rectify Evaluation Preference: Improving LLMs’ Critique on Math Reasoning via Perplexity-aware Reinforcement Learning cs.CLPDF

Changyuan Tian, Zhicong Lu, Shuang Qian, Nayu Liu, Peiguang Li

TL;DR: 本文通过困惑度感知强化学习算法，解决大语言模型（LLMs）在多步数学推理（MsMR）中批判能力不足的问题，发现了LLMs存在的不平衡评估偏好，并提出了一种新颖的方法来纠正这种偏好。

Details

Motivation: 现有的方法依赖高质量监督微调来增强LLMs的批判能力，但忽视了其表现不佳的根本原因。本文发现LLMs存在不平衡评估偏好（倾向于认为困惑度较低的解答正确），导致批判能力受限。

Result: 在OPS和现有批判基准上的实验结果表明，所提方法显著提升了LLMs的批判能力。

Insight: 1. LLMs在评估解答时存在困惑度驱动的偏好偏差。2. 直接优化这种偏好可以显著提升模型的批判性能。3. 困惑度可作为强化学习中的重要信号，指导策略优化。

Abstract: To improve Multi-step Mathematical Reasoning (MsMR) of Large Language Models (LLMs), it is crucial to obtain scalable supervision from the corpus by automatically critiquing mistakes in the reasoning process of MsMR and rendering a final verdict of the problem-solution. Most existing methods rely on crafting high-quality supervised fine-tuning demonstrations for critiquing capability enhancement and pay little attention to delving into the underlying reason for the poor critiquing performance of LLMs. In this paper, we orthogonally quantify and investigate the potential reason – imbalanced evaluation preference, and conduct a statistical preference analysis. Motivated by the analysis of the reason, a novel perplexity-aware reinforcement learning algorithm is proposed to rectify the evaluation preference, elevating the critiquing capability. Specifically, to probe into LLMs’ critiquing characteristics, a One-to-many Problem-Solution (OPS) benchmark is meticulously constructed to quantify the behavior difference of LLMs when evaluating the problem solutions generated by itself and others. Then, to investigate the behavior difference in depth, we conduct a statistical preference analysis oriented on perplexity and find an intriguing phenomenon – ``LLMs incline to judge solutions with lower perplexity as correct’’, which is dubbed as \textit{imbalanced evaluation preference}. To rectify this preference, we regard perplexity as the baton in the algorithm of Group Relative Policy Optimization, supporting the LLMs to explore trajectories that judge lower perplexity as wrong and higher perplexity as correct. Extensive experimental results on our built OPS and existing available critic benchmarks demonstrate the validity of our method.

[104] BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages cs.CL | cs.AIPDF

Guduru Manoj, Neel Prabhanjan Rachamalla, Ashish Kulkarni, Gautam Rajeev, Jay Piplodiya

TL;DR: 该论文系统研究了为印度语言构建大规模合成预训练数据的方法，提出了BhashaKritika数据集（540B tokens），并探讨了数据生成的技术、语言选择和评估方法。

Details

Motivation: 低资源语言的LLM预训练数据不足，导致语言间受益不均，而合成数据为这一挑战提供了解决方案。

Result: 实验揭示了生成策略中的关键权衡，并提出了构建多语言语料库的最佳实践。

Insight: 语言选择（提示指令和文档基础）对合成数据质量有显著影响，同时评估流程的设计需适应多样化的语言环境。

Abstract: In the context of pretraining of Large Language Models (LLMs), synthetic data has emerged as an alternative for generating high-quality pretraining data at scale. This is particularly beneficial in low-resource language settings where the benefits of recent LLMs have been unevenly distributed across languages. In this work, we present a systematic study on the generation and evaluation of synthetic multilingual pretraining data for Indic languages, where we construct a large-scale synthetic dataset BhashaKritika, comprising 540B tokens using 5 different techniques for 10 languages. We explore the impact of grounding generation in documents, personas, and topics. We analyze how language choice, both in the prompt instructions and document grounding, affects data quality, and we compare translations of English content with native generation in Indic languages. To support scalable and language-sensitive evaluation, we introduce a modular quality evaluation pipeline that integrates script and language detection, metadata consistency checks, n-gram repetition analysis, and perplexity-based filtering using KenLM models. Our framework enables robust quality control across diverse scripts and linguistic contexts. Empirical results through model runs reveal key trade-offs in generation strategies and highlight best practices for constructing effective multilingual corpora.

[105] Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates cs.CLPDF

Andrea Schimmenti, Valentina Pasqual, Fabio Vitali, Marieke van Erp

TL;DR: 该论文提出了一种名为ATR4CH的系统方法，结合LLMs和本体工程，从文化遗产文本中生成知识图谱，并通过案例研究验证其有效性。

Details

Motivation: 文化遗产文本包含丰富的知识，但由于从非结构化文本转换为结构化知识图谱的困难，这些知识难以系统化查询。该方法旨在解决这一问题。

Result: 实验结果显示，该方法在元数据提取（F1 0.96-0.99）、实体识别（F1 0.7-0.8）、假设提取（F1 0.65-0.75）、证据提取（F1 0.95-0.97）和话语表示（G-EVAL 0.62）方面表现优异，小模型也能高效运行。

Insight: ATR4CH为文化遗产领域提供了一个可复现的框架，适用于多领域和多机构资源。尽管结果积极，但后处理仍需人工监督，且目前仅限于维基百科文章。

Abstract: Cultural Heritage texts contain rich knowledge that is difficult to query systematically due to the challenges of converting unstructured discourse into structured Knowledge Graphs (KGs). This paper introduces ATR4CH (Adaptive Text-to-RDF for Cultural Heritage), a systematic five-step methodology for Large Language Model-based Knowledge Extraction from Cultural Heritage documents. We validate the methodology through a case study on authenticity assessment debates. Methodology - ATR4CH combines annotation models, ontological frameworks, and LLM-based extraction through iterative development: foundational analysis, annotation schema development, pipeline architecture, integration refinement, and comprehensive evaluation. We demonstrate the approach using Wikipedia articles about disputed items (documents, artifacts…), implementing a sequential pipeline with three LLMs (Claude Sonnet 3.7, Llama 3.3 70B, GPT-4o-mini). Findings - The methodology successfully extracts complex Cultural Heritage knowledge: 0.96-0.99 F1 for metadata extraction, 0.7-0.8 F1 for entity recognition, 0.65-0.75 F1 for hypothesis extraction, 0.95-0.97 for evidence extraction, and 0.62 G-EVAL for discourse representation. Smaller models performed competitively, enabling cost-effective deployment. Originality - This is the first systematic methodology for coordinating LLM-based extraction with Cultural Heritage ontologies. ATR4CH provides a replicable framework adaptable across CH domains and institutional resources. Research Limitations - The produced KG is limited to Wikipedia articles. While the results are encouraging, human oversight is necessary during post-processing. Practical Implications - ATR4CH enables Cultural Heritage institutions to systematically convert textual knowledge into queryable KGs, supporting automated metadata enrichment and knowledge discovery.

[106] Position: On the Methodological Pitfalls of Evaluating Base LLMs for Reasoning cs.CLPDF

Jason Chan, Zhixue Zhao, Robert Gaizauskas

TL;DR: 这篇立场论文指出，评估基础大型语言模型（LLMs）的推理能力存在固有的方法论问题，质疑了现有研究中忽视的语言模型预训练目标与推理评估标准（如正确性）之间的不匹配。

Details

Motivation: 现有研究通过评估基础LLMs的推理能力来揭示其局限性或偏见，但作者认为这种做法忽略了方法论上的根本问题。

Result: 研究表明，基础LLMs的输出不能直接用于评估其推理能力，因为其生成逻辑结论的过程与推理的本质脱节。

Insight: 论文呼吁重新审视现有研究中评估LLMs推理能力的假设，并建议未来研究需避免此类方法论陷阱。

Abstract: Existing work investigates the reasoning capabilities of large language models (LLMs) to uncover their limitations, human-like biases and underlying processes. Such studies include evaluations of base LLMs (pre-trained on unlabeled corpora only) for this purpose. Our position paper argues that evaluating base LLMs’ reasoning capabilities raises inherent methodological concerns that are overlooked in such existing studies. We highlight the fundamental mismatch between base LLMs’ pretraining objective and normative qualities, such as correctness, by which reasoning is assessed. In particular, we show how base LLMs generate logically valid or invalid conclusions as coincidental byproducts of conforming to purely linguistic patterns of statistical plausibility. This fundamental mismatch challenges the assumptions that (a) base LLMs’ outputs can be assessed as their bona fide attempts at correct answers or conclusions; and (b) conclusions about base LLMs’ reasoning can generalize to post-trained LLMs optimized for successful instruction-following. We call for a critical re-examination of existing work that relies implicitly on these assumptions, and for future work to account for these methodological pitfalls.

[107] Reasoning About Intent for Ambiguous Requests cs.CL | cs.AIPDF

Irina Saparina, Mirella Lapata

TL;DR: 为了解决大型语言模型对模糊请求的单一解释问题，本文提出了一种生成多解释-答案对的单步结构化响应方法，通过强化学习和定制奖励函数提升覆盖率和准确性。

Details

Motivation: 大型语言模型在处理模糊请求时常隐含选择一个解释，导致用户误解和安全风险。为提升透明性和准确性，本文提出生成多解释-答案对的方法。

Result: 实验显示该方法在覆盖率和人类评估中表现优于基线，解释与答案高度一致。

Insight: 结构化响应不仅提升透明性，还支持下游应用，单步生成兼顾效率。

Abstract: Large language models often respond to ambiguous requests by implicitly committing to one interpretation. Intent misunderstandings can frustrate users and create safety risks. To address this, we propose generating multiple interpretation-answer pairs in a single structured response to ambiguous requests. Our models are trained with reinforcement learning and customized reward functions using multiple valid answers as supervision. Experiments on conversational question answering and semantic parsing demonstrate that our method achieves higher coverage of valid answers than baseline approaches. Human evaluation confirms that predicted interpretations are highly aligned with their answers. Our approach promotes transparency with explicit interpretations, achieves efficiency by requiring only one generation step, and supports downstream applications through its structured output format.

[108] Exploring State Tracking Capabilities of Large Language Models cs.CLPDF

Kiamehr Rezaee, Jose Camacho-Collados, Mohammad Taher Pilehvar

TL;DR: 该论文研究了大型语言模型（LLM）在状态跟踪任务上的表现，提出了一个基准测试，分析了不同模型在状态跟踪中的能力。

Details

Motivation: 状态跟踪是一个复杂任务，需要模型动态维护多个实体的状态。研究旨在评估LLM在此任务上的能力，并探讨其局限性。

Result: 新一代LLM（如GPT-4和Llama3）在状态跟踪任务中表现良好，尤其是结合Chain of Thought时；前一代模型则在任务后期表现不佳。

Insight: LLM的状态跟踪能力与其规模和架构改进密切相关，Chain of Thought等技术能显著提升任务表现。

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in solving complex tasks, including those requiring a certain level of reasoning. In this paper, we focus on state tracking, a problem where models need to keep track of the state governing a number of entities. To isolate the state tracking component from other factors, we propose a benchmark based on three well-defined state tracking tasks and analyse the performance of LLMs in different scenarios. The results indicate that the recent generation of LLMs (specifically, GPT-4 and Llama3) are capable of tracking state, especially when integrated with mechanisms such as Chain of Thought. However, models from the former generation, while understanding the task and being able to solve it at the initial stages, often fail at this task after a certain number of steps.

[109] LocalBench: Benchmarking LLMs on County-Level Local Knowledge and Reasoning cs.CL | cs.AI | cs.CYPDF

Zihan Gao, Yifei Xu, Jacob Thebault-Spieker

TL;DR: LocalBench是一个专注于评估大型语言模型（LLMs）在美国县一级本地知识和推理能力的首个基准。通过14,782个验证问题对526个县进行评估，结果表明现有LLMs在本地知识处理上存在显著局限性，尤其是数值推理和叙事风格问题。

Details

Motivation: 现有基准未能充分捕捉LLMs在处理超本地知识（如社区动态、文化叙事和本地治理）时的能力，而这对实际应用（如公民平台和社区新闻）至关重要。

Result: 最佳模型在叙事风格问题上仅达到56.8%准确率，数值推理问题准确率低于15.5%。网络增强对部分模型有益（如Gemini提升13.6%），但对其他模型有害（如GPT系列降低11.4%）。

Insight: 模型规模和网络增强不总能提升性能，强调了开发能够支持公平、具有地方意识的AI系统的紧迫性。

Abstract: Large language models (LLMs) have been widely evaluated on macro-scale geographic tasks, such as global factual recall, event summarization, and regional reasoning. Yet, their ability to handle hyper-local knowledge remains poorly understood. This gap is increasingly consequential as real-world applications, from civic platforms to community journalism, demand AI systems that can reason about neighborhood-specific dynamics, cultural narratives, and local governance. Existing benchmarks fall short in capturing this complexity, often relying on coarse-grained data or isolated references. We present LocalBench, the first benchmark designed to systematically evaluate LLMs on county-level local knowledge across the United States. Grounded in the Localness Conceptual Framework, LocalBench includes 14,782 validated question-answer pairs across 526 U.S. counties in 49 states, integrating diverse sources such as Census statistics, local subreddit discourse, and regional news. It spans physical, cognitive, and relational dimensions of locality. Using LocalBench, we evaluate 13 state-of-the-art LLMs under both closed-book and web-augmented settings. Our findings reveal critical limitations: even the best-performing models reach only 56.8% accuracy on narrative-style questions and perform below 15.5% on numerical reasoning. Moreover, larger model size and web augmentation do not guarantee better performance, for example, search improves Gemini’s accuracy by +13.6%, but reduces GPT-series performance by -11.4%. These results underscore the urgent need for language models that can support equitable, place-aware AI systems: capable of engaging with the diverse, fine-grained realities of local communities across geographic and cultural contexts.

[110] Beyond Elicitation: Provision-based Prompt Optimization for Knowledge-Intensive Tasks cs.CL | cs.AIPDF

Yunzhe Xu, Zhuosheng Zhang, Zhe Liu

TL;DR: 论文提出了一种基于知识补给的提示优化框架KPPO，针对知识密集型任务，通过系统化的知识整合而非潜在的能力引发，显著提升语言模型的性能。

Details

Motivation: 现有的提示优化方法主要关注引发模型能力的策略，但这些方法在处理知识密集型任务时存在固有局限性，因为它们无法提供所需的专业知识、术语精确性和推理模式。

Result: 在15个知识密集型基准测试中，KPPO平均性能提升约6%，同时token消耗减少高达29%。

Insight: 知识密集型任务需要超越传统提示优化的方法，通过系统整合专业知识能够更有效地提升模型性能。

Abstract: While prompt optimization has emerged as a critical technique for enhancing language model performance, existing approaches primarily focus on elicitation-based strategies that search for optimal prompts to activate models’ capabilities. These methods exhibit fundamental limitations when addressing knowledge-intensive tasks, as they operate within fixed parametric boundaries rather than providing the factual knowledge, terminology precision, and reasoning patterns required in specialized domains. To address these limitations, we propose Knowledge-Provision-based Prompt Optimization (KPPO), a framework that reformulates prompt optimization as systematic knowledge integration rather than potential elicitation. KPPO introduces three key innovations: 1) a knowledge gap filling mechanism for knowledge gap identification and targeted remediation; 2) a batch-wise candidate evaluation approach that considers both performance improvement and distributional stability; 3) an adaptive knowledge pruning strategy that balances performance and token efficiency, reducing up to 29% token usage. Extensive evaluation on 15 knowledge-intensive benchmarks from various domains demonstrates KPPO’s superiority over elicitation-based methods, with an average performance improvement of ~6% over the strongest baseline while achieving comparable or lower token consumption. Code at: https://github.com/xyz9911/KPPO.

[111] Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following cs.CLPDF

Yun He, Wenzhe Li, Hejia Zhang, Songlin Li, Karishma Mandyam

TL;DR: 论文提出一个名为AdvancedIF的基准测试，包含1600多个复杂指令，并提出RIFL方法，通过基于规则的奖励信号和改进RL训练，显著提升LLMs的指令跟随能力。

Details

Motivation: 现有LLMs在复杂、多轮和系统级别的指令跟随能力不足，缺乏高质量评估基准和可靠奖励信号，制约了其训练与评估。

Result: RIFL在AdvancedIF上实现6.7%绝对性能提升，并在公开基准上表现优异。消融实验验证各组件有效性。

Insight: 规则化评估标准不仅适用于LLMs的训练优化，也是衡量其指令跟随能力的有效工具。

Abstract: Recent progress in large language models (LLMs) has led to impressive performance on a range of tasks, yet advanced instruction following (IF)-especially for complex, multi-turn, and system-prompted instructions-remains a significant challenge. Rigorous evaluation and effective training for such capabilities are hindered by the lack of high-quality, human-annotated benchmarks and reliable, interpretable reward signals. In this work, we introduce AdvancedIF (we will release this benchmark soon), a comprehensive benchmark featuring over 1,600 prompts and expert-curated rubrics that assess LLMs ability to follow complex, multi-turn, and system-level instructions. We further propose RIFL (Rubric-based Instruction-Following Learning), a novel post-training pipeline that leverages rubric generation, a finetuned rubric verifier, and reward shaping to enable effective reinforcement learning for instruction following. Extensive experiments demonstrate that RIFL substantially improves the instruction-following abilities of LLMs, achieving a 6.7% absolute gain on AdvancedIF and strong results on public benchmarks. Our ablation studies confirm the effectiveness of each component in RIFL. This work establishes rubrics as a powerful tool for both training and evaluating advanced IF in LLMs, paving the way for more capable and reliable AI systems.

[112] LOCA-R: Near-Perfect Performance on the Chinese Physics Olympiad 2025 cs.CL | cs.AI | physics.ed-phPDF

Dong-Shan Jian, Xiang Li, Chen-Xu Yan, Hui-Wen Zheng, Zhi-Zhang Bian

TL;DR: LOCA-R是一个改进的逻辑链增强推理框架，用于解决中国物理奥林匹克竞赛的复杂问题，取得接近满分的成绩。

Details

Motivation: 奥林匹克级别的物理问题解决对人类和AI都是巨大挑战，需要精确计算、抽象推理和物理原理的综合运用。

Result: 在CPhO 2025理论考试中获得313分（满分320），超越所有人类和基线方法。

Insight: 逻辑链增强推理在处理复杂物理问题时表现出色，展示了AI在高级推理任务中的潜力。

Abstract: Olympiad-level physics problem-solving presents a significant challenge for both humans and artificial intelligence (AI), as it requires a sophisticated integration of precise calculation, abstract reasoning, and a fundamental grasp of physical principles. The Chinese Physics Olympiad (CPhO), renowned for its complexity and depth, serves as an ideal and rigorous testbed for these advanced capabilities. In this paper, we introduce LOCA-R (LOgical Chain Augmentation for Reasoning), an improved version of the LOCA framework adapted for complex reasoning, and apply it to the CPhO 2025 theory examination. LOCA-R achieves a near-perfect score of 313 out of 320 points, solidly surpassing the highest-scoring human competitor and significantly outperforming all baseline methods.

[113] Convomem Benchmark: Why Your First 150 Conversations Don’t Need RAG cs.CLPDF

Egor Pakhomov, Erik Nijkamp, Caiming Xiong

TL;DR: 论文提出了一个全面的对话记忆评估基准Convomem Benchmark，研究了对话记忆与检索增强生成（RAG）的关系，发现简单全上下文方法在小规模对话中的性能优于RAG系统。

Details

Motivation: 现有基准在统计效力、数据生成一致性和评估灵活性方面存在不足，本研究旨在填补这一空白，并探索对话记忆与RAG的差异和联系。

Result: 在小规模对话中（少于150次），简单全上下文方法达到70-82%的准确率，而RAG方法仅为30-45%；随着对话数量增加，RAG或混合方法的必要性逐渐显现。

Insight: 对话记忆的小规模优势使得穷举搜索和完全重排成为可能，这为对话记忆系统的优化提供了独特的研究方向，而非直接套用通用RAG方案。

Abstract: We introduce a comprehensive benchmark for conversational memory evaluation containing 75,336 question-answer pairs across diverse categories including user facts, assistant recall, abstention, preferences, temporal changes, and implicit connections. While existing benchmarks have advanced the field, our work addresses fundamental challenges in statistical power, data generation consistency, and evaluation flexibility that limit current memory evaluation frameworks. We examine the relationship between conversational memory and retrieval-augmented generation (RAG). While these systems share fundamental architectural patterns–temporal reasoning, implicit extraction, knowledge updates, and graph representations–memory systems have a unique characteristic: they start from zero and grow progressively with each conversation. This characteristic enables naive approaches that would be impractical for traditional RAG. Consistent with recent findings on long context effectiveness, we observe that simple full-context approaches achieve 70-82% accuracy even on our most challenging multi-message evidence cases, while sophisticated RAG-based memory systems like Mem0 achieve only 30-45% when operating on conversation histories under 150 interactions. Our analysis reveals practical transition points: long context excels for the first 30 conversations, remains viable with manageable trade-offs up to 150 conversations, and typically requires hybrid or RAG approaches beyond that point as costs and latencies become prohibitive. These patterns indicate that the small-corpus advantage of conversational memory–where exhaustive search and complete reranking are feasible–deserves dedicated research attention rather than simply applying general RAG solutions to conversation histories.

[114] URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding cs.CLPDF

Yongxin Shi, Jiapeng Wang, Zeyu Shan, Dezhi Peng, Zening Lin

TL;DR: URaG是一个统一检索与生成的框架，通过利用MLLMs的内在证据定位能力，实现了高效的长文档理解。

Details

Motivation: 当前的多模态大语言模型（MLLMs）在处理长文档时面临信息干扰和计算成本高昂的挑战，现有方法未能很好地平衡效率和细节保留。

Result: 实验表明，URaG在保持最先进性能的同时，计算开销减少了44-56%。

Insight: MLLMs表现出类似人类的由粗到细的推理模式，这可以显式用于检索和生成统一的设计。

Abstract: Recent multimodal large language models (MLLMs) still struggle with long document understanding due to two fundamental challenges: information interference from abundant irrelevant content, and the quadratic computational cost of Transformer-based architectures. Existing approaches primarily fall into two categories: token compression, which sacrifices fine-grained details; and introducing external retrievers, which increase system complexity and prevent end-to-end optimization. To address these issues, we conduct an in-depth analysis and observe that MLLMs exhibit a human-like coarse-to-fine reasoning pattern: early Transformer layers attend broadly across the document, while deeper layers focus on relevant evidence pages. Motivated by this insight, we posit that the inherent evidence localization capabilities of MLLMs can be explicitly leveraged to perform retrieval during the reasoning process, facilitating efficient long document understanding. To this end, we propose URaG, a simple-yet-effective framework that Unifies Retrieval and Generation within a single MLLM. URaG introduces a lightweight cross-modal retrieval module that converts the early Transformer layers into an efficient evidence selector, identifying and preserving the most relevant pages while discarding irrelevant content. This design enables the deeper layers to concentrate computational resources on pertinent information, improving both accuracy and efficiency. Extensive experiments demonstrate that URaG achieves state-of-the-art performance while reducing computational overhead by 44-56%. The code is available at https://github.com/shi-yx/URaG.

[115] Evaluating Prompting Strategies with MedGemma for Medical Order Extraction cs.CL | cs.AIPDF

Abhinand Balachandran, Bavana Durgapraveen, Gowsikkan Sikkan Sudhagar, Vidhya Varshany J S, Sriram Rajkumar

TL;DR: 该论文研究了使用MedGemma模型从医患对话中提取医疗指令的效果，比较了三种提示策略：单次提示、ReAct框架和多步代理流程，发现简单的单次提示在验证集上表现最好。

Details

Motivation: 准确提取医疗指令对减轻临床文档负担和保障患者安全至关重要。本文旨在探索MedGemma模型在不同提示策略下的表现，为临床信息提取提供指导。

Result: 实验结果表明，单次提示方法在手动标注的验证集上表现最佳，而复杂的推理方法（如ReAct和多步代理）可能因‘过度思考’而引入噪音。

Insight: 在手动标注数据上，简单的提示策略可能更高效，因为复杂推理容易产生噪音；MedGemma在这一任务中展现了潜力，但提示策略的选择需结合实际数据特性。

Abstract: The accurate extraction of medical orders from doctor-patient conversations is a critical task for reducing clinical documentation burdens and ensuring patient safety. This paper details our team submission to the MEDIQA-OE-2025 Shared Task. We investigate the performance of MedGemma, a new domain-specific open-source language model, for structured order extraction. We systematically evaluate three distinct prompting paradigms: a straightforward one-Shot approach, a reasoning-focused ReAct framework, and a multi-step agentic workflow. Our experiments reveal that while more complex frameworks like ReAct and agentic flows are powerful, the simpler one-shot prompting method achieved the highest performance on the official validation set. We posit that on manually annotated transcripts, complex reasoning chains can lead to “overthinking” and introduce noise, making a direct approach more robust and efficient. Our work provides valuable insights into selecting appropriate prompting strategies for clinical information extraction in varied data conditions.

[116] SSR: Socratic Self-Refine for Large Language Model Reasoning cs.CL | cs.AI | cs.LGPDF

Haizhou Shi, Ye Liu, Bo Pang, Zeyu Leo Liu, Hao Wang

TL;DR: 本文提出了Socratic Self-Refine (SSR)框架，通过细粒度分解模型回答并逐步验证和修正，提升大语言模型(LLM)的推理能力，实验表明其在多个基准测试上优于现有方法。

Details

Motivation: 现有的大语言模型推理框架通常依赖粗粒度的自我验证和自我修正，限制了在复杂任务上的效果，需要更精细的方法来提升推理准确性。

Result: 实验表明，SSR在五个推理基准测试和三种LLM上均优于现有迭代自修正基线方法。

Insight: SSR不仅提升了推理性能，还提供了一种解释性强的黑盒分析工具，有助于理解LLM的内部推理逻辑。

Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, yet existing test-time frameworks often rely on coarse self-verification and self-correction, limiting their effectiveness on complex tasks. In this paper, we propose Socratic Self-Refine (SSR), a novel framework for fine-grained evaluation and precise refinement of LLM reasoning. Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation through controlled re-solving and self-consistency checks. By pinpointing unreliable steps and iteratively refining them, SSR produces more accurate and interpretable reasoning chains. Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines. Beyond performance gains, SSR provides a principled black-box approach for evaluating and understanding the internal reasoning processes of LLMs. Code is available at https://github.com/SalesforceAIResearch/socratic-self-refine-reasoning.

[117] Instella: Fully Open Language Models with Stellar Performance cs.CL | cs.AI | cs.LGPDF

Jiang Liu, Jialian Wu, Xiaodong Yu, Yusheng Su, Prakamya Mishra

TL;DR: Instella是一个完全开放的三亿参数语言模型家族，基于公开数据和代码库训练，并在性能上与同类领先的开源模型竞争。

Details

Motivation: 当前高性能语言模型多为闭源或部分开源，限制了透明度和可复现性。Instella旨在通过完全开放模型和数据集推动开放研究。

Result: Instella在三亿参数规模下达到同类完全开放模型的SOTA性能，并与领先的开源权重模型竞争。

Insight: 完全开放的语言模型可以实现高性能，同时推动透明度和社区研究；专用变体展示了模型多样化的潜力。

Abstract: Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks, yet the majority of high-performing models remain closed-source or partially open, limiting transparency and reproducibility. In this work, we introduce Instella, a family of fully open three billion parameter language models trained entirely on openly available data and codebase. Powered by AMD Instinct MI300X GPUs, Instella is developed through large-scale pre-training, general-purpose instruction tuning, and alignment with human preferences. Despite using substantially fewer pre-training tokens than many contemporaries, Instella achieves state-of-the-art results among fully open models and is competitive with leading open-weight models of comparable size. We further release two specialized variants: Instella-Long, capable of handling context lengths up to 128K tokens, and Instella-Math, a reasoning-focused model enhanced through supervised fine-tuning and reinforcement learning on mathematical tasks. Together, these contributions establish Instella as a transparent, performant, and versatile alternative for the community, advancing the goal of open and reproducible language modeling research.

[118] ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference cs.CLPDF

Yesheng Liang, Haisheng Chen, Song Han, Zhijian Liu

TL;DR: ParoQuant提出了一种基于成对旋转量化的权重后训练量化方法，通过硬件高效的独立Givens旋转和通道缩放，解决了LLM中异常值和动态范围大的问题，提升了推理效率和精度。

Details

Motivation: 大语言模型（LLM）的权重后训练量化（PTQ）在减少内存占用和加速推理时，常因权重和激活中的异常值导致较大的量化误差和精度下降，尤其在多步推理任务中误差累积更为严重。现有方法要么无法有效抑制异常值，要么带来较大推理开销。

Result: 在推理任务中，ParoQuant比AWQ平均提升了2.4%的准确率，且额外开销低于10%，显著优于现有方法。

Insight: 通过旋转量化和通道缩放的结合，不仅能有效抑制异常值对量化的影响，还能在保持硬件效率的同时显著提升推理精度。

Abstract: Weight-only post-training quantization (PTQ) compresses the weights of Large Language Models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference. However, the presence of outliers in weights and activations often leads to large quantization errors and severe accuracy degradation, especially in recent reasoning LLMs where errors accumulate across long chains of thought. Existing PTQ methods either fail to sufficiently suppress outliers or introduce significant overhead during inference. In this paper, we propose Pairwise Rotation Quantization (ParoQuant), a weight-only PTQ method that combines hardware-efficient and optimizable independent Givens rotations with channel-wise scaling to even out the magnitude across channels and narrow the dynamic range within each quantization group. We further co-design the inference kernel to fully exploit GPU parallelism and keep the rotations and scaling lightweight at runtime. ParoQuant achieves an average 2.4% accuracy improvement over AWQ on reasoning tasks with less than 10% overhead. This paves the way for more efficient and accurate deployment of reasoning LLMs.

physics.chem-ph [Back]

[119] VEDA: 3D Molecular Generation via Variance-Exploding Diffusion with Annealing physics.chem-ph | cs.AI | cs.CVPDF

Peining Zhang, Jinbo Bi, Minghu Song

TL;DR: VEDA是一种结合方差爆炸扩散（VE）和退火的SE(3)-等变框架，用于高效生成高精度3D分子结构，解决了扩散模型在采样速度和构象准确性之间的权衡问题。

Details

Motivation: 现有扩散模型在3D分子生成中存在采样效率与构象准确性的权衡问题，而VEDA旨在通过结合VE扩散与退火策略来解决这一问题。

Result: 在QM9和GEOM-DRUGS数据集上，VEDA仅需100步采样即达到流模型效率，生成结构放松能量中位数仅1.72 kcal/mol。

Insight: VE扩散与SE(3)-等变架构的结合可以实现高效且高精度的3D分子生成，为未来分子设计提供新方向。

Abstract: Diffusion models show promise for 3D molecular generation, but face a fundamental trade-off between sampling efficiency and conformational accuracy. While flow-based models are fast, they often produce geometrically inaccurate structures, as they have difficulty capturing the multimodal distributions of molecular conformations. In contrast, denoising diffusion models are more accurate but suffer from slow sampling, a limitation attributed to sub-optimal integration between diffusion dynamics and SE(3)-equivariant architectures. To address this, we propose VEDA, a unified SE(3)-equivariant framework that combines variance-exploding diffusion with annealing to efficiently generate conformationally accurate 3D molecular structures. Specifically, our key technical contributions include: (1) a VE schedule that enables noise injection functionally analogous to simulated annealing, improving 3D accuracy and reducing relaxation energy; (2) a novel preconditioning scheme that reconciles the coordinate-predicting nature of SE(3)-equivariant networks with a residual-based diffusion objective, and (3) a new arcsin-based scheduler that concentrates sampling in critical intervals of the logarithmic signal-to-noise ratio. On the QM9 and GEOM-DRUGS datasets, VEDA matches the sampling efficiency of flow-based models, achieving state-of-the-art valency stability and validity with only 100 sampling steps. More importantly, VEDA’s generated structures are remarkably stable, as measured by their relaxation energy during GFN2-xTB optimization. The median energy change is only 1.72 kcal/mol, significantly lower than the 32.3 kcal/mol from its architectural baseline, SemlaFlow. Our framework demonstrates that principled integration of VE diffusion with SE(3)-equivariant architectures can achieve both high chemical accuracy and computational efficiency.

cs.AI [Back]

[120] Enhancing the Medical Context-Awareness Ability of LLMs via Multifaceted Self-Refinement Learning cs.AI | cs.CLPDF

Yuxuan Zhou, Yubin Wang, Bin Wang, Chen Ning, Xien Liu

TL;DR: 该论文提出了一种名为MuSeR的多方面自优化学习方法，旨在增强大型语言模型（LLMs）在医疗领域的上下文感知能力，通过自评估和优化提升其在决策、沟通和安全性三个关键方面的表现。

Details

Motivation: 尽管LLMs在医疗领域的多个基准测试中表现优异，但在实际医疗场景中仍表现不佳，主要原因是缺乏对上下文细节（如用户身份、病史、风险因素）的感知能力。MuSeR旨在解决这一问题。

Result: 在HealthBench数据集上的实验表明，MuSeR显著提升了LLMs的性能，特别是在上下文感知方面。小型LLM（Qwen3-32B）通过知识蒸馏超越其教师模型，达到SOTA（63.8%和43.1%）。

Insight: 1) 自优化方法能有效提升LLMs在实际场景中的应用能力；2) 上下文感知的多样性模拟对实现稳健性能至关重要；3) 知识蒸馏可以放大高效方法的优势，赋能小型模型。

Abstract: Large language models (LLMs) have shown great promise in the medical domain, achieving strong performance on several benchmarks. However, they continue to underperform in real-world medical scenarios, which often demand stronger context-awareness, i.e., the ability to recognize missing or critical details (e.g., user identity, medical history, risk factors) and provide safe, helpful, and contextually appropriate responses. To address this issue, we propose Multifaceted Self-Refinement (MuSeR), a data-driven approach that enhances LLMs’ context-awareness along three key facets (decision-making, communication, and safety) through self-evaluation and refinement. Specifically, we first design a attribute-conditioned query generator that simulates diverse real-world user contexts by varying attributes such as role, geographic region, intent, and degree of information ambiguity. An LLM then responds to these queries, self-evaluates its answers along three key facets, and refines its responses to better align with the requirements of each facet. Finally, the queries and refined responses are used for supervised fine-tuning to reinforce the model’s context-awareness ability. Evaluation results on the latest HealthBench dataset demonstrate that our method significantly improves LLM performance across multiple aspects, with particularly notable gains in the context-awareness axis. Furthermore, by incorporating knowledge distillation with the proposed method, the performance of a smaller backbone LLM (e.g., Qwen3-32B) surpasses its teacher model, achieving a new SOTA across all open-source LLMs on HealthBench (63.8%) and its hard subset (43.1%). Code and dataset will be released at https://muser-llm.github.io.

[121] ProgRAG: Hallucination-Resistant Progressive Retrieval and Reasoning over Knowledge Graphs cs.AI | cs.CLPDF

Minbae Park, Hyemin Yang, Jeonghyun Kim, Kunsoo Park, Hyunjoon Kim

TL;DR: ProgRAG 是一个多跳知识图谱问答框架，通过将复杂问题分解为子问题并逐步扩展推理路径，结合不确定性感知修剪优化证据检索和上下文组织，显著提升了问答的可靠性和推理质量。

Details

Motivation: 大语言模型（LLMs）在推理任务中表现出色，但存在幻觉和透明度不足的问题。知识图谱（KGs）虽能增强 LLMs 的推理能力，但现有方法仍面临检索不准确和推理失败的挑战。ProgRAG 旨在通过渐进式检索与推理解决这些问题。

Result: 在三个知名数据集上的实验表明，ProgRAG 在多跳 KGQA 任务中表现优于现有基线，推理质量和可靠性显著提升。

Insight: 渐进式检索与推理路径优化能够有效解决复杂问题中的检索失败和推理错误，不确定性感知修剪是提升 LLM 推理可靠性的关键。

Abstract: Large Language Models (LLMs) demonstrate strong reasoning capabilities but struggle with hallucinations and limited transparency. Recently, KG-enhanced LLMs that integrate knowledge graphs (KGs) have been shown to improve reasoning performance, particularly for complex, knowledge-intensive tasks. However, these methods still face significant challenges, including inaccurate retrieval and reasoning failures, often exacerbated by long input contexts that obscure relevant information or by context constructions that struggle to capture the richer logical directions required by different question types. Furthermore, many of these approaches rely on LLMs to directly retrieve evidence from KGs, and to self-assess the sufficiency of this evidence, which often results in premature or incorrect reasoning. To address the retrieval and reasoning failures, we propose ProgRAG, a multi-hop knowledge graph question answering (KGQA) framework that decomposes complex questions into sub-questions, and progressively extends partial reasoning paths by answering each sub-question. At each step, external retrievers gather candidate evidence, which is then refined through uncertainty-aware pruning by the LLM. Finally, the context for LLM reasoning is optimized by organizing and rearranging the partial reasoning paths obtained from the sub-question answers. Experiments on three well-known datasets demonstrate that ProgRAG outperforms existing baselines in multi-hop KGQA, offering improved reliability and reasoning quality.

[122] FactGuard: Event-Centric and Commonsense-Guided Fake News Detection cs.AI | cs.CLPDF

Jing He, Han Zhang, Yuanhui Xiao, Wei Guo, Shaowen Yao

TL;DR: FactGuard提出了一个基于事件内容和常识的假新闻检测框架，利用大型语言模型（LLMs）提取事件中心内容，并通过动态可用性机制和知识蒸馏提高检测的鲁棒性和实用性。

Details

Motivation: 现有基于写作风格的假新闻检测方法因对手模仿真实新闻风格而效果下降。尽管LLMs具有潜力，但其在假新闻检测中的实际应用受到浅层功能探索、模糊可用性和高昂推理成本的限制。

Result: 在两个基准数据集上实验表明，FactGuard在鲁棒性和准确性上均优于现有方法，有效解决了风格敏感性和LLM可用性问题。

Insight: 事件内容和常识推理是假新闻检测的关键方向，结合LLMs的动态可用性机制可以显著提升检测的可靠性。

Abstract: Fake news detection methods based on writing style have achieved remarkable progress. However, as adversaries increasingly imitate the style of authentic news, the effectiveness of such approaches is gradually diminishing. Recent research has explored incorporating large language models (LLMs) to enhance fake news detection. Yet, despite their transformative potential, LLMs remain an untapped goldmine for fake news detection, with their real-world adoption hampered by shallow functionality exploration, ambiguous usability, and prohibitive inference costs. In this paper, we propose a novel fake news detection framework, dubbed FactGuard, that leverages LLMs to extract event-centric content, thereby reducing the impact of writing style on detection performance. Furthermore, our approach introduces a dynamic usability mechanism that identifies contradictions and ambiguous cases in factual reasoning, adaptively incorporating LLM advice to improve decision reliability. To ensure efficiency and practical deployment, we employ knowledge distillation to derive FactGuard-D, enabling the framework to operate effectively in cold-start and resource-constrained scenarios. Comprehensive experiments on two benchmark datasets demonstrate that our approach consistently outperforms existing methods in both robustness and accuracy, effectively addressing the challenges of style sensitivity and LLM usability in fake news detection.

[123] EgoEMS: A High-Fidelity Multimodal Egocentric Dataset for Cognitive Assistance in Emergency Medical Services cs.AI | cs.CV | cs.LGPDF

Keshara Weerasinghe, Xueren Ge, Tessa Heick, Lahiru Nuwan Wijayasingha, Anthony Cortez

TL;DR: 论文提出了首个高保真、多模态的自中心数据集EgoEMS，旨在支持急诊医疗服务中的AI认知助手开发，包含20小时的多任务场景数据和多种标注。

Details

Motivation: 急诊医疗服务（EMS）中，急救人员面临高压力和高认知负荷的任务，AI认知助手有望通过实时数据收集和决策支持缓解这一问题。

Result: EgoEMS提供了高保真数据和多任务基准，支持AI工具的开发和评估，推动了智能EMS系统的研究。

Insight: 数据集的真实性和多模态特性为开发实时认知助手提供了重要基础，有望改善急救效率和患者结局。

Abstract: Emergency Medical Services (EMS) are critical to patient survival in emergencies, but first responders often face intense cognitive demands in high-stakes situations. AI cognitive assistants, acting as virtual partners, have the potential to ease this burden by supporting real-time data collection and decision making. In pursuit of this vision, we introduce EgoEMS, the first end-to-end, high-fidelity, multimodal, multiperson dataset capturing over 20 hours of realistic, procedural EMS activities from an egocentric view in 233 simulated emergency scenarios performed by 62 participants, including 46 EMS professionals. Developed in collaboration with EMS experts and aligned with national standards, EgoEMS is captured using an open-source, low-cost, and replicable data collection system and is annotated with keysteps, timestamped audio transcripts with speaker diarization, action quality metrics, and bounding boxes with segmentation masks. Emphasizing realism, the dataset includes responder-patient interactions reflecting real-world emergency dynamics. We also present a suite of benchmarks for real-time multimodal keystep recognition and action quality estimation, essential for developing AI support tools for EMS. We hope EgoEMS inspires the research community to push the boundaries of intelligent EMS systems and ultimately contribute to improved patient outcomes.

[124] Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis for Large Reasoning Models cs.AI | cs.CVPDF

Yongxian Wei, Yilin Zhao, Li Shen, Xinrui Chen, Runxi Cheng

TL;DR: 本文提出了一种结合推理和自适应难度调整的问题生成方法，用于训练大型推理模型，显著提升了数据合成的质量和模型的泛化能力。

Details

Motivation: 现有数据合成方法在生成问题时缺乏对问题方向和难度的精确控制，导致生成的问题价值低或过于简单。本文旨在通过显式推理和自适应难度调整解决这些问题。

Result: 实验表明，该方法平均提升了2.5%的性能，并能泛化到语言和视觉语言模型。协同进化进一步提升了0.7%的性能。

Insight: 显式推理和自适应难度调整是提升数据合成质量的关键；生成器与求解器的协同设计能持续优化模型性能。

Abstract: Data synthesis for training large reasoning models offers a scalable alternative to limited, human-curated datasets, enabling the creation of high-quality data. However, existing approaches face several challenges: (i) indiscriminate generation that ignores the solver’s ability and yields low-value problems, or reliance on complex data pipelines to balance problem difficulty; and (ii) a lack of reasoning in problem generation, leading to shallow problem variants. In this paper, we develop a problem generator that reasons explicitly to plan problem directions before synthesis and adapts difficulty to the solver’s ability. Specifically, we construct related problem pairs and augment them with intermediate problem-design CoT produced by a reasoning model. These data bootstrap problem-design strategies from the generator. Then, we treat the solver’s feedback on synthetic problems as a reward signal, enabling the generator to calibrate difficulty and produce complementary problems near the edge of the solver’s competence. Extensive experiments on 10 mathematical and general reasoning benchmarks show that our method achieves an average improvement of 2.5% and generalizes to both language and vision-language models. Moreover, a solver trained on the synthesized data provides improved rewards for continued generator training, enabling co-evolution and yielding a further 0.7% performance gain. Our code will be made publicly available here.

[125] Querying Labeled Time Series Data with Scenario Programs cs.AI | cs.CV | cs.FL | cs.LGPDF

Edward Kim, Devan Shanker, Varun Bharadwaj, Hongbeen Park, Jinkyu Kim

TL;DR: 该论文提出了一种新的方法，通过标记时间序列数据和场景程序的形式化匹配，提高了仿真测试中发现失效场景的真实性和效率。

Details

Motivation: 为弥合仿真测试与实际系统之间的差距（sim-to-real gap），验证仿真中发现的安全场景是否在真实世界中可复现，需要一种高效的方法在真实数据中匹配这些场景。

Result: 实验表明，该算法在查询场景时比现有商业视觉大模型更准确且快几个数量级，并能随数据时长扩展。

Insight: 通过形式化定义和高效查询算法，该方法为仿真测试的验证提供了新工具，有助于提升自动驾驶系统的安全性验证效率。

Abstract: Simulation-based testing has become a crucial complement to road testing for ensuring the safety of cyber physical systems (CPS). As a result, significant research efforts have been directed toward identifying failure scenarios within simulation environments. However, a critical question remains. Are the AV failure scenarios discovered in simulation reproducible on actual systems in the real world? The sim-to-real gap caused by differences between simulated and real sensor data means that failure scenarios identified in simulation might either be artifacts of synthetic sensor data or actual issues that also occur with real sensor data. To address this, an effective approach to validating simulated failure scenarios is to locate occurrences of these scenarios within real-world datasets and verify whether the failure persists on the datasets. To this end, we introduce a formal definition of how labeled time series sensor data can match an abstract scenario, represented as a scenario program using the Scenic probabilistic programming language. We present a querying algorithm that, given a scenario program and a labeled dataset, identifies the subset of data that matches the specified scenario. Our experiment shows that our algorithm is more accurate and orders of magnitude faster in querying scenarios than the state-of-the-art commercial vision large language models, and can scale with the duration of queried time series data.

cs.CR [Back]

[126] Trapped by Their Own Light: Deployable and Stealth Retroreflective Patch Attacks on Traffic Sign Recognition Systems cs.CR | cs.CVPDF

Go Tsuruoka, Takami Sato, Qi Alfred Chen, Kazuki Nomoto, Ryunosuke Kobayashi

TL;DR: 该论文提出了一种新型的对抗性攻击方法——对抗性回反射补丁（ARP），利用回反射材料在受害者车头灯照射下激活的特性，结合高部署性和隐蔽性攻击交通标志识别系统（TSR）。通过黑盒优化最大化攻击效果，ARP在动态场景中达到93.4%的成功率，比传统补丁攻击更隐蔽。论文还提出了防御方法DPR Shield，对特定标志防御成功率≥75%。

Details

Motivation: 交通标志识别系统（TSR）对自动驾驶的安全性至关重要，但现有对抗攻击（如贴纸或激光投影）存在视觉可检测性或实现限制的问题。论文探索了一种新的攻击表面，利用回反射材料实现隐蔽且可部署的攻击。

Result: 1. ARP在35米动态场景中攻击成功率≥93.4%；2. 对商业TSR系统的攻击成功率≥60%；3. ARP的隐蔽性优于传统补丁攻击（提升≥1.9%）；4. DPR Shield对特定标志的防御成功率≥75%。

Insight: 1. 回反射材料为对抗攻击提供了新的隐蔽性与有效性平衡；2. 黑盒优化可用于复杂场景的攻击生成；3. 偏振滤波是抵御此类攻击的有效手段。

Abstract: Traffic sign recognition plays a critical role in ensuring safe and efficient transportation of autonomous vehicles but remain vulnerable to adversarial attacks using stickers or laser projections. While existing attack vectors demonstrate security concerns, they suffer from visual detectability or implementation constraints, suggesting unexplored vulnerability surfaces in TSR systems. We introduce the Adversarial Retroreflective Patch (ARP), a novel attack vector that combines the high deployability of patch attacks with the stealthiness of laser projections by utilizing retroreflective materials activated only under victim headlight illumination. We develop a retroreflection simulation method and employ black-box optimization to maximize attack effectiveness. ARP achieves $\geq$93.4% success rate in dynamic scenarios at 35 meters and $\geq$60% success rate against commercial TSR systems in real-world conditions. Our user study demonstrates that ARP attacks maintain near-identical stealthiness to benign signs while achieving $\geq$1.9% higher stealthiness scores than previous patch attacks. We propose the DPR Shield defense, employing strategically placed polarized filters, which achieves $\geq$75% defense success rates for stop signs and speed limit signs against micro-prism patches.

eess.AS [Back]

[127] Music Flamingo: Scaling Music Understanding in Audio Language Models eess.AS | cs.CLPDF

Sreyan Ghosh, Arushi Goel, Lasha Koroshinadze, Sang-gil Lee, Zhifeng Kong

TL;DR: Music Flamingo是一个新型的大规模音频-语言模型，旨在提升对音乐（包括歌曲）的理解能力。通过构建高质量的音乐数据集MF-Skills和改进后的训练方法，该模型在音乐理解与推理任务中取得了多项SOTA成果。

Details

Motivation: 当前音频-语言模型对音乐的理解仍停留在浅层次，主要由于高质量音乐数据的稀缺性和标注难度。Music Flamingo旨在解决这一问题，推动音乐理解从表面识别向多层次、人类感知的方向发展。

Result: Music Flamingo在10多个音乐理解和推理基准测试中取得了SOTA结果，展现了其在音乐领域的强大能力和泛化性。

Insight: 高质量的数据集和结合音乐理论的训练方法对提升音频-语言模型的音乐理解能力至关重要。Music Flamingo为未来模型的发展提供了新的方向和标准。

Abstract: We introduce Music Flamingo, a novel large audio-language model designed to advance music (including song) understanding in foundational audio models. While audio-language research has progressed rapidly, music remains challenging due to its dynamic, layered, and information-dense nature. Progress has been further limited by the difficulty of scaling open audio understanding models, primarily because of the scarcity of high-quality music data and annotations. As a result, prior models are restricted to producing short, high-level captions, answering only surface-level questions, and showing limited generalization across diverse musical cultures. To address these challenges, we curate MF-Skills, a large-scale dataset labeled through a multi-stage pipeline that yields rich captions and question-answer pairs covering harmony, structure, timbre, lyrics, and cultural context. We fine-tune an enhanced Audio Flamingo 3 backbone on MF-Skills and further strengthen multiple skills relevant to music understanding. To improve the model’s reasoning abilities, we introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards. Music Flamingo achieves state-of-the-art results across 10+ benchmarks for music understanding and reasoning, establishing itself as a generalist and musically intelligent audio-language model. Beyond strong empirical results, Music Flamingo sets a new standard for advanced music understanding by demonstrating how models can move from surface-level recognition toward layered, human-like perception of songs. We believe this work provides both a benchmark and a foundation for the community to build the next generation of models that engage with music as meaningfully as humans do.

cs.MA [Back]

[128] Rethinking the Reliability of Multi-agent System: A Perspective from Byzantine Fault Tolerance cs.MA | cs.AI | cs.CLPDF

Lifan Zheng, Jiawei Chen, Qinghong Yin, Jingyuan Zhang, Xinyi Zeng

TL;DR: 本文研究了基于大语言模型（LLM）的多智能体系统（MAS）的可靠性，从拜占庭容错的角度探讨LLM智能体在可靠性上的优势，并提出了一种新的共识机制CP-WBFT。该方法通过探针加权的信息流传输提升了系统的稳定性，在极端故障条件下表现出色。

Details

Motivation: 随着LLM智能体在多智能体系统中的广泛应用，其可靠性问题尚未得到充分研究。传统智能体在应对错误信息流时表现不足，而LLM智能体展现出更强的怀疑能力。本文旨在量化这种可靠性差异，并提出改进方案。

Result: 实验表明，CP-WBFT在不同网络拓扑结构下均表现出色，尤其在极端拜占庭故障条件下（85.7%故障率），显著优于传统方法。在数学推理和安全评估任务中保持了高准确性和可靠性。

Insight: LLM智能体的怀疑能力对提升MAS的可靠性至关重要。CP-WBFT通过结合LLM的固有特性与加权共识机制，为未来构建更稳定的多智能体系统提供了新思路。

Abstract: Ensuring the reliability of agent architectures and effectively identifying problematic agents when failures occur are crucial challenges in multi-agent systems (MAS). Advances in large language models (LLMs) have established LLM-based agents as a major branch of MAS, enabling major breakthroughs in complex problem solving and world modeling. However, the reliability implications of this shift remain largely unexplored. i.e., whether substituting traditional agents with LLM-based agents can effectively enhance the reliability of MAS. In this work, we investigate and quantify the reliability of LLM-based agents from the perspective of Byzantine fault tolerance. We observe that LLM-based agents demonstrate stronger skepticism when processing erroneous message flows, a characteristic that enables them to outperform traditional agents across different topological structures. Motivated by the results of the pilot experiment, we design CP-WBFT, a confidence probe-based weighted Byzantine Fault Tolerant consensus mechanism to enhance the stability of MAS with different topologies. It capitalizes on the intrinsic reflective and discriminative capabilities of LLMs by employing a probe-based, weighted information flow transmission method to improve the reliability of LLM-based agents. Extensive experiments demonstrate that CP-WBFT achieves superior performance across diverse network topologies under extreme Byzantine conditions (85.7% fault rate). Notably, our approach surpasses traditional methods by attaining remarkable accuracy on various topologies and maintaining strong reliability in both mathematical reasoning and safety assessment tasks.

cs.SI [Back]

Raj Gaurav Maurya, Vaibhav Shukla, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat

TL;DR: 论文提出了一种基于大语言模型（LLM）的社交网络错误信息传播模拟框架，通过模拟具有不同认知偏见和意识形态的用户行为，量化并分析了错误信息的传播机制。

Details

Motivation: 社交网络中的错误信息往往依赖人类认知偏见和情感因素传播，传统方法难以精确模拟这一过程。作者希望通过LLM模拟用户行为并量化错误信息的扩散效果。

Result: 实验表明，意识形态驱动的角色在政治、营销和技术领域加速错误信息传播，而专家角色保持事实稳定。异质性角色互动会导致错误信息快速升级为宣传级别。

Insight: LLM不仅能模拟人类偏见，还可作为审计工具追踪信息保真度，为研究和缓解数字生态中的错误信息提供了新方法。

Abstract: Misinformation on social media thrives on surprise, emotion, and identity-driven reasoning, often amplified through human cognitive biases. To investigate these mechanisms, we model large language model (LLM) personas as synthetic agents that mimic user-level biases, ideological alignments, and trust heuristics. Within this setup, we introduce an auditor–node framework to simulate and analyze how misinformation evolves as it circulates through networks of such agents. News articles are propagated across networks of persona-conditioned LLM nodes, each rewriting received content. A question–answering-based auditor then measures factual fidelity at every step, offering interpretable, claim-level tracking of misinformation drift. We formalize a misinformation index and a misinformation propagation rate to quantify factual degradation across homogeneous and heterogeneous branches of up to 30 sequential rewrites. Experiments with 21 personas across 10 domains reveal that identity- and ideology-based personas act as misinformation accelerators, especially in politics, marketing, and technology. By contrast, expert-driven personas preserve factual stability. Controlled-random branch simulations further show that once early distortions emerge, heterogeneous persona interactions rapidly escalate misinformation to propaganda-level distortion. Our taxonomy of misinformation severity – spanning factual errors, lies, and propaganda – connects observed drift to established theories in misinformation studies. These findings demonstrate the dual role of LLMs as both proxies for human-like biases and as auditors capable of tracing information fidelity. The proposed framework provides an interpretable, empirically grounded approach for studying, simulating, and mitigating misinformation diffusion in digital ecosystems.

cs.LG [Back]

[130] Probability-Biased Attention over Directed Bipartite Graphs for Long-Tail ICD Coding cs.LG | cs.AI | cs.CLPDF

Tianlei Chen, Yuxiao Chen, Yang Li, Feifei Wang

TL;DR: 这篇论文提出了一种针对长尾分布的ICD编码任务的概率偏置注意力方法，通过构建有向二部图和注入共现概率偏差，显著提升了罕见代码的分类性能。

Details

Motivation: ICD编码任务面临大量标签空间和长尾分布的挑战，罕见代码缺乏足够训练数据。为此，论文提出通过建模代码间的细粒度共现关系来提升罕见代码的分类性能。

Result: 在三个基准数据集上的实验表明，该方法在长尾分类的关键指标Macro-F1上取得了显著提升，达到了当前最佳性能。

Insight: 通过统计共现关系和外部知识的结合，可以显著提升长尾标签的学习效果，尤其是在罕见代码的分类任务中。

Abstract: Automated International Classification of Diseases (ICD) coding aims to assign multiple disease codes to clinical documents, constituting a crucial multi-label text classification task in healthcare informatics. However, the task is challenging due to its large label space (10,000 to 20,000 codes) and long-tail distribution, where a few codes dominate while many rare codes lack sufficient training data. To address this, we propose a learning method that models fine-grained co-occurrence relationships among codes. Specifically, we construct a Directed Bipartite Graph Encoder with disjoint sets of common and rare code nodes. To facilitate a one-way information flow, edges are directed exclusively from common to rare codes. The nature of these connections is defined by a probability-based bias, which is derived from the conditional probability of a common code co-occurring given the presence of a rare code. This bias is then injected into the encoder’s attention module, a process we term Co-occurrence Encoding. This structure empowers the graph encoder to enrich rare code representations by aggregating latent comorbidity information reflected in the statistical co-occurrence of their common counterparts. To ensure high-quality input to the graph, we utilize a large language model (LLM) to generate comprehensive descriptions for codes, enriching initial embeddings with clinical context and comorbidity information, serving as external knowledge for the statistical co-occurrence relationships in the code system. Experiments on three automated ICD coding benchmark datasets demonstrate that our method achieves state-of-the-art performance with particularly notable improvements in Macro-F1, which is the key metric for long-tail classification.

[131] OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models cs.LG | cs.CLPDF

Yuping Yan, Yuhan Xie, Yuanshuai Li, Yingchao Yu, Lingjuan Lyu

TL;DR: 该论文提出了OutSafe-Bench，一个针对多模态大语言模型（MLLMs）的内容安全评测基准，覆盖多模态数据并提供新的评估指标和方法，揭示了现有模型的安全漏洞。

Details

Motivation: 随着多模态大语言模型的广泛应用，其输出的不安全内容（如有毒语言、偏见图像等）引发担忧。当前的安全评测基准在多模态覆盖和性能评估上存在不足，亟需更全面的评测工具。

Result: 评测九种先进MLLMs后发现显著的安全漏洞，表明现有模型亟需更强的安全防护机制。

Insight: 多模态内容安全评测需覆盖更广的风险类别和模态，同时需解决单模型评测的偏差问题。

Abstract: Since Multimodal Large Language Models (MLLMs) are increasingly being integrated into everyday tools and intelligent agents, growing concerns have arisen regarding their possible output of unsafe contents, ranging from toxic language and biased imagery to privacy violations and harmful misinformation. Current safety benchmarks remain highly limited in both modality coverage and performance evaluations, often neglecting the extensive landscape of content safety. In this work, we introduce OutSafe-Bench, the first most comprehensive content safety evaluation test suite designed for the multimodal era. OutSafe-Bench includes a large-scale dataset that spans four modalities, featuring over 18,000 bilingual (Chinese and English) text prompts, 4,500 images, 450 audio clips and 450 videos, all systematically annotated across nine critical content risk categories. In addition to the dataset, we introduce a Multidimensional Cross Risk Score (MCRS), a novel metric designed to model and assess overlapping and correlated content risks across different categories. To ensure fair and robust evaluation, we propose FairScore, an explainable automated multi-reviewer weighted aggregation framework. FairScore selects top-performing models as adaptive juries, thereby mitigating biases from single-model judgments and enhancing overall evaluation reliability. Our evaluation of nine state-of-the-art MLLMs reveals persistent and substantial safety vulnerabilities, underscoring the pressing need for robust safeguards in MLLMs.

[132] AgentEvolver: Towards Efficient Self-Evolving Agent System cs.LG | cs.AI | cs.CLPDF

Yunpeng Zhai, Shuchang Tao, Cheng Chen, Anni Zou, Ziqian Chen

TL;DR: AgentEvolver是一个基于大型语言模型（LLM）的自主代理系统，通过自我提问、自我导航和自我归因三种机制，实现了高效的自我进化。

Details

Motivation: 现有自主代理系统依赖人工数据集和强化学习，成本高且效率低。AgentEvolver旨在通过LLM的语义理解和推理能力，解决这些问题。

Result: 实验结果表明，AgentEvolver在探索效率、样本利用率和适应速度上优于传统强化学习方法。

Insight: 利用LLM的推理能力可显著降低代理系统的开发成本并提高效率，为自主代理的未来发展提供了新方向。

Abstract: Autonomous agents powered by large language models (LLMs) have the potential to significantly enhance human productivity by reasoning, using tools, and executing complex tasks in diverse environments. However, current approaches to developing such agents remain costly and inefficient, as they typically require manually constructed task datasets and reinforcement learning (RL) pipelines with extensive random exploration. These limitations lead to prohibitively high data-construction costs, low exploration efficiency, and poor sample utilization. To address these challenges, we present AgentEvolver, a self-evolving agent system that leverages the semantic understanding and reasoning capabilities of LLMs to drive autonomous agent learning. AgentEvolver introduces three synergistic mechanisms: (i) self-questioning, which enables curiosity-driven task generation in novel environments, reducing dependence on handcrafted datasets; (ii) self-navigating, which improves exploration efficiency through experience reuse and hybrid policy guidance; and (iii) self-attributing, which enhances sample efficiency by assigning differentiated rewards to trajectory states and actions based on their contribution. By integrating these mechanisms into a unified framework, AgentEvolver enables scalable, cost-effective, and continual improvement of agent capabilities. Preliminary experiments indicate that AgentEvolver achieves more efficient exploration, better sample utilization, and faster adaptation compared to traditional RL-based baselines.

[133] Impact of Layer Norm on Memorization and Generalization in Transformers cs.LG | cs.AI | cs.CL | cs.CVPDF

Rishi Singhal, Jung-Eun Kim

TL;DR: 该研究探讨了层归一化（LayerNorm）在Pre-LayerNorm和Post-LayerNorm变压器中对记忆和学习的影响，发现其在Pre-LayerNorm模型中稳定学习，而在Post-LayerNorm模型中影响记忆。

Details

Motivation: 层归一化是变压器中的核心组件，但其在不同架构中对记忆和学习的具体影响尚未明确，尤其是Pre-LayerNorm和Post-LayerNorm模型的差异。

Result: Pre-LayerNorm模型中移除LayerNorm会破坏学习稳定性并加剧记忆，而Post-LayerNorm模型中移除LayerNorm能有效减少记忆，恢复真实标签。

Insight: LayerNorm的早期层对模型行为影响最大，其作用在Pre和Post架构中表现出显著差异，为设计高效变压器提供了新视角。

Abstract: Layer Normalization (LayerNorm) is one of the fundamental components in transformers that stabilizes training and improves optimization. In recent times, Pre-LayerNorm transformers have become the preferred choice over Post-LayerNorm transformers due to their stable gradient flow. However, the impact of LayerNorm on learning and memorization across these architectures remains unclear. In this work, we investigate how LayerNorm influences memorization and learning for Pre- and Post-LayerNorm transformers. We identify that LayerNorm serves as a key factor for stable learning in Pre-LayerNorm transformers, while in Post-LayerNorm transformers, it impacts memorization. Our analysis reveals that eliminating LayerNorm parameters in Pre-LayerNorm models exacerbates memorization and destabilizes learning, while in Post-LayerNorm models, it effectively mitigates memorization by restoring genuine labels. We further precisely identify that early layers LayerNorm are the most critical over middle/later layers and their influence varies across Pre and Post LayerNorm models. We have validated it through 13 models across 6 Vision and Language datasets. These insights shed new light on the role of LayerNorm in shaping memorization and learning in transformers.

[134] Towards Emotionally Intelligent and Responsible Reinforcement Learning cs.LG | cs.AI | cs.CL | cs.HC | cs.MAPDF

Garapati Keerthana, Manik Gupta

TL;DR: 论文提出了一种负责任强化学习（RRL）框架，将情感和伦理考量融入决策过程，以解决个性化系统中忽视情感和伦理的问题。通过约束马尔可夫决策过程（CMDP）和多目标奖励函数，RRL实现了短期行为参与和长期用户福祉的平衡。

Details

Motivation: 当前个性化决策系统通常基于静态规则或最大化参与度的启发式方法，忽视了用户的情感背景和伦理约束，可能导致不敏感或不安全的干预。

Result: RRL框架提供了一种情感智能且伦理安全的强化学习方法，适用于行为健康、教育等以人为本的领域。

Insight: 将情感计算与安全强化学习结合，可为个性化系统带来更高的伦理可信度和情感智能。

Abstract: Personalized decision systems in healthcare and behavioral support often rely on static rule-based or engagement-maximizing heuristics that overlook users’ emotional context and ethical constraints. Such approaches risk recommending insensitive or unsafe interventions, especially in domains involving serious mental illness, substance use disorders, or depression. To address this limitation, we propose a Responsible Reinforcement Learning (RRL) framework that integrates emotional and contextual understanding with ethical considerations into the sequential decision-making process. RRL formulates personalization as a Constrained Markov Decision Process (CMDP), where the agent optimizes engagement and adherence while ensuring emotional alignment and ethical safety. We introduce a multi-objective reward function that explicitly balances short-term behavioral engagement with long-term user well-being, and define an emotion-informed state representation that captures fluctuations in emotional readiness, affect, and risk. The proposed architecture can be instantiated with any RL algorithm (e.g., DQN, PPO) augmented with safety constraints or Lagrangian regularization. Conceptually, this framework operationalizes empathy and responsibility within machine learning policy optimization, bridging safe RL, affective computing and responsible AI. We discuss the implications of this approach for human-centric domains such as behavioral health, education, and digital therapeutics, and outline simulation-based validation paths for future empirical work. This paper aims to initiate a methodological conversation about ethically aligned reinforcement learning for emotionally aware and trustworthy personalization systems.

[135] How does My Model Fail? Automatic Identification and Interpretation of Physical Plausibility Failure Modes with Matryoshka Transcoders cs.LG | cs.CVPDF

Yiming Tang, Abhijeet Sinha, Dianbo Liu

TL;DR: 该论文提出了一种名为Matryoshka Transcoders的新框架，用于自动发现和解释生成模型中的物理合理性错误模式，通过多粒度稀疏特征学习和物理合理性分类器的中间表示，提供了对模型物理约束失败的深入分析。

Details

Motivation: 现有的生成模型虽然能够产生逼真且符合指令的输出，但在物理合理性方面仍存在显著问题。这些问题往往难以通过现有评估方法检测，且缺乏自动识别和解释的框架，阻碍了针对性改进。

Result: 该方法在特征相关性和准确性上优于现有方法，并在8个前沿生成模型中发现了多种物理合理性错误模式，为改进模型提供了方向。

Insight: 通过自动识别和解释物理合理性错误，该方法不仅揭示了生成模型的常见失败模式，还为提升模型的物理合理性提供了有效工具。

Abstract: Although recent generative models are remarkably capable of producing instruction-following and realistic outputs, they remain prone to notable physical plausibility failures. Though critical in applications, these physical plausibility errors often escape detection by existing evaluation methods. Furthermore, no framework exists for automatically identifying and interpreting specific physical error patterns in natural language, preventing targeted model improvements. We introduce Matryoshka Transcoders, a novel framework for the automatic discovery and interpretation of physical plausibility features in generative models. Our approach extends the Matryoshka representation learning paradigm to transcoder architectures, enabling hierarchical sparse feature learning at multiple granularity levels. By training on intermediate representations from a physical plausibility classifier and leveraging large multimodal models for interpretation, our method identifies diverse physics-related failure modes without manual feature engineering, achieving superior feature relevance and feature accuracy compared to existing approaches. We utilize the discovered visual patterns to establish a benchmark for evaluating physical plausibility in generative models. Our analysis of eight state-of-the-art generative models provides valuable insights into how these models fail to follow physical constraints, paving the way for further model improvements.

Table of Contents

cs.CV [Back]

[1] MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation cs.CVPDF

[2] PriVi: Towards A General-Purpose Video Model For Primate Behavior In The Wild cs.CV | cs.LGPDF

[3] Classifying Phonotrauma Severity from Vocal Fold Images with Soft Ordinal Regression cs.CV | cs.LGPDF

[4] SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control cs.CVPDF

[5] Density Estimation and Crowd Counting cs.CVPDF

[6] PALMS+: Modular Image-Based Floor Plan Localization Leveraging Depth Foundation Model cs.CV | cs.AI | cs.ROPDF

[7] Social LSTM with Dynamic Occupancy Modeling for Realistic Pedestrian Trajectory Prediction cs.CV | cs.AIPDF

[8] Soiling detection for Advanced Driver Assistance Systems cs.CV | cs.AIPDF

[9] Feature Quality and Adaptability of Medical Foundation Models: A Comparative Evaluation for Radiographic Classification and Segmentation cs.CV | cs.AIPDF

[10] STORM: Segment, Track, and Object Re-Localization from a Single 3D Model cs.CVPDF

[11] Test-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models cs.CV | cs.AI | cs.LGPDF

[12] From Street to Orbit: Training-Free Cross-View Retrieval via Location Semantics and LLM Guidance cs.CV | cs.AIPDF

[13] AHA! Animating Human Avatars in Diverse Scenes with Gaussian Splatting cs.CVPDF

[14] CertMask: Certifiable Defense Against Adversarial Patches via Theoretically Optimal Mask Coverage cs.CV | cs.AIPDF

[15] CORONA-Fields: Leveraging Foundation Models for Classification of Solar Wind Phenomena cs.CV | astro-ph.IM | astro-ph.SRPDF

[16] Remember Me: Bridging the Long-Range Gap in LVLMs with Three-Step Inference-Only Decay Resilience Strategies cs.CVPDF

[17] SAM-DAQ: Segment Anything Model with Depth-guided Adaptive Queries for RGB-D Video Salient Object Detection cs.CVPDF

[18] RWKV-PCSSC: Exploring RWKV Model for Point Cloud Semantic Scene Completion cs.CVPDF

[19] HCC-3D: Hierarchical Compensatory Compression for 98% 3D Token Reduction in Vision-Language Models cs.CVPDF

[20] MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding cs.CVPDF

[21] Compensating Distribution Drifts in Class-incremental Learning of Pre-trained Vision Transformers cs.CV | cs.AI | cs.CLPDF

[22] AdaptViG: Adaptive Vision GNN with Exponential Decay Gating cs.CV | cs.AI | cs.LGPDF

[23] TSPE-GS: Probabilistic Depth Extraction for Semi-Transparent Surface Reconstruction via 3D Gaussian Splatting cs.CVPDF

[24] Beyond Cosine Similarity Magnitude-Aware CLIP for No-Reference Image Quality Assessment cs.CV | cs.AIPDF

[25] Robust Object Detection with Pseudo Labels from VLMs using Per-Object Co-teaching cs.CVPDF

[26] Equivariant Sampling for Improving Diffusion Model-based Image Restoration cs.CVPDF

[27] Difference Vector Equalization for Robust Fine-tuning of Vision-Language Models cs.CV | cs.AIPDF

[28] DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Instance Segmentation cs.CVPDF

[29] LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers cs.CVPDF

[30] MIRNet: Integrating Constrained Graph-Based Reasoning with Pre-training for Diagnostic Medical Imaging cs.CV | cs.AIPDF

[31] AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models cs.CVPDF

[32] Anomagic: Crossmodal Prompt-driven Zero-shot Anomaly Generation cs.CV | cs.AIPDF

[33] DGFusion: Dual-guided Fusion for Robust Multi-Modal 3D Object Detection cs.CVPDF

[34] FreDFT: Frequency Domain Fusion Transformer for Visible-Infrared Object Detection cs.CVPDF

[35] MuSc-V2: Zero-Shot Multimodal Industrial Anomaly Classification and Segmentation with Mutual Scoring of Unlabeled Samples cs.CVPDF

[36] Image Aesthetic Reasoning via HCM-GRPO: Empowering Compact Model for Superior Performance cs.CVPDF

[37] When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion? cs.CVPDF

[38] Multivariate Gaussian Representation Learning for Medical Action Evaluation cs.CV | cs.AIPDF

[39] Perceive, Act and Correct: Confidence Is Not Enough for Hyperspectral Classification cs.CVPDF

[40] VLF-MSC: Vision-Language Feature-Based Multimodal Semantic Communication System cs.CV | eess.SYPDF

[41] Mitigating Error Accumulation in Co-Speech Motion Generation via Global Rotation Diffusion and Multi-Level Constraints cs.CVPDF

[42] GridPrune: From “Where to Look” to “What to Select” in Visual Token Pruning for MLLMs cs.CVPDF

[43] SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition cs.CVPDF

[44] MTAttack: Multi-Target Backdoor Attacks against Large Vision-Language Models cs.CVPDF

[45] RobIA: Robust Instance-aware Continual Test-time Adaptation for Deep Stereo cs.CVPDF

[46] Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction cs.CVPDF

[47] Right Looks, Wrong Reasons: Compositional Fidelity in Text-to-Image Generation cs.CV | cs.AIPDF

[48] Split-Layer: Enhancing Implicit Neural Representation by Maximizing the Dimensionality of Feature Space cs.CVPDF

[49] Physically Interpretable Multi-Degradation Image Restoration via Deep Unfolding and Explainable Convolution cs.CVPDF

[50] CephRes-MHNet: A Multi-Head Residual Network for Accurate and Robust Cephalometric Landmark Detection cs.CVPDF

[51] VISTA: A Vision and Intent-Aware Social Attention Framework for Multi-Agent Trajectory Prediction cs.CV | cs.AI | cs.ROPDF

[52] HeatV2X: Scalable Heterogeneous Collaborative Perception via Efficient Alignment and Interaction cs.CVPDF

[53] Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization cs.CVPDF

[54] TubeRMC: Tube-conditioned Reconstruction with Mutual Constraints for Weakly-supervised Spatio-Temporal Video Grounding cs.CVPDF

[55] FineSkiing: A Fine-grained Benchmark for Skiing Action Quality Assessment cs.CV | cs.AI | cs.HCPDF

[56] Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis cs.CVPDF

[57] PROPA: Toward Process-level Optimization in Visual Reasoning via Reinforcement Learning cs.CVPDF

[58] Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals cs.CV | cs.CLPDF

[59] Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models cs.CV | cs.AIPDF

[60] Rethinking Visual Information Processing in Multimodal LLMs cs.CV | cs.AIPDF

[61] CLIP4VI-ReID: Learning Modality-shared Representations via CLIP Semantic Bridge for Visible-Infrared Person Re-identification cs.CVPDF

[62] Depth-Consistent 3D Gaussian Splatting via Physical Defocus Modeling and Multi-View Geometric Supervision cs.CV | cs.AIPDF

[63] Learning to Tell Apart: Weakly Supervised Video Anomaly Detection via Disentangled Semantic Alignment cs.CVPDF

[64] DermAI: Clinical dermatology acquisition through quality-driven image collection for AI classification in mobile cs.CV | cs.AIPDF

[65] MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation cs.CV | cs.ROPDF

[66] SAMIRO: Spatial Attention Mutual Information Regularization with a Pre-trained Model as Oracle for Lane Detection cs.CVPDF

[67] MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns cs.CV | cs.AIPDF

[68] LLM-YOLOMS: Large Language Model-based Semantic Interpretation and Fault Diagnosis for Wind Turbine Components cs.CVPDF

[69] RodEpil: A Video Dataset of Laboratory Rodents for Seizure Detection and Benchmark Evaluation cs.CVPDF

[70] OpenSR-SRGAN: A Flexible Super-Resolution Framework for Multispectral Earth Observation Data cs.CV | cs.LGPDF

[71] SPOT: Sparsification with Attention Dynamics via Token Relevance in Vision Transformers cs.CV | eess.IVPDF

[72] SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation cs.CV | cs.ROPDF

[73] Dynamic Avatar-Scene Rendering from Human-centric Context cs.CVPDF

[74] A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space cs.CV | cs.AIPDF

[75] OmniVGGT: Omni-Modality Driven Visual Geometry Grounded cs.CVPDF

[76] From 2D to 3D Without Extra Baggage: Data-Efficient Cancer Detection in Digital Breast Tomosynthesis cs.CVPDF

[77] One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models cs.CVPDF

[78] Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling cs.CVPDF