Table of Contents

cs.CV [Back]

[1] UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment cs.CVPDF

Wei Zhang, Yeying Jin, Xin Li, Yan Zhang, Xiaofeng Cong

TL;DR: UniFit提出了一种基于多模态大语言模型(MLLM)的通用虚拟试穿框架,通过MLLM引导的语义对齐模块(MGSA)减少文本指令与参考图像之间的语义鸿沟,并通过两阶段训练策略从有限数据中学习复杂任务。

Details

Motivation: 现有虚拟试穿方法在多样化和复杂任务上的处理能力有限,且文本指令与图像之间存在语义差距。UniFit旨在构建一个更通用的框架,解决这些挑战。

Result: 实验表明,UniFit支持多服装和模型间试穿等复杂任务,并在性能上达到SOTA。

Insight: MLLM的引入可以有效减少跨模态语义鸿沟,而两阶段训练策略则能够高效利用有限数据解决复杂任务。

Abstract: Image-based virtual try-on (VTON) aims to synthesize photorealistic images of a person wearing specified garments. Despite significant progress, building a universal VTON framework that can flexibly handle diverse and complex tasks remains a major challenge. Recent methods explore multi-task VTON frameworks guided by textual instructions, yet they still face two key limitations: (1) semantic gap between text instructions and reference images, and (2) data scarcity in complex scenarios. To address these challenges, we propose UniFit, a universal VTON framework driven by a Multimodal Large Language Model (MLLM). Specifically, we introduce an MLLM-Guided Semantic Alignment Module (MGSA), which integrates multimodal inputs using an MLLM and a set of learnable queries. By imposing a semantic alignment loss, MGSA captures cross-modal semantic relationships and provides coherent and explicit semantic guidance for the generative process, thereby reducing the semantic gap. Moreover, by devising a two-stage progressive training strategy with a self-synthesis pipeline, UniFit is able to learn complex tasks from limited data. Extensive experiments show that UniFit not only supports a wide range of VTON tasks, including multi-garment and model-to-model try-on, but also achieves state-of-the-art performance. The source code and pretrained models are available at https://github.com/zwplus/UniFit.


[2] EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3 cs.CV | cs.AIPDF

Chengxi Zeng, Yuxuan Jiang, Aaron Zhang

TL;DR: EfficientSAM3通过渐进式分层蒸馏(PHD)从SAM3模型中提取能力,生成轻量化的模型系列,实现了高效的视频概念分割,同时在性能和效率之间取得了良好平衡。

Details

Motivation: SAM3虽然在图像和视频的Promptable Concept Segmentation(PCS)方面表现出色,但其统一架构(共享的视觉主干、DETR风格检测器和密集内存跟踪器)在移动设备上使用时计算成本过高。

Result: 在流行的VOS数据集上进行了基准测试,与相关工作对比,展示了强大的性能-效率权衡。

Insight: 渐进式分层蒸馏能够有效地将复杂教师模型的能力转移到轻量化学生模型中,适用于移动设备的高效视觉任务。

Abstract: The Segment Anything Model 3 (SAM3) advances visual understanding with Promptable Concept Segmentation (PCS) across images and videos, but its unified architecture (shared vision backbone, DETR-style detector, dense-memory tracker) remains prohibitive for on-device use. We present EfficientSAM3, a family of efficient models built on Progressive Hierarchical Distillation (PHD) that transfers capability from SAM3 to lightweight students in three stages: (1) Encoder Distillation aligns image features via prompt-in-the-loop training on SA-1B; (2) Temporal Memory Distillation replaces dense memory with a compact Perceiver-based module trained on SA-V to compress and retrieve spatiotemporal features efficiently; and (3) End-to-End Fine-Tuning refines the full pipeline on the official SAM3 PCS data to preserve concept-level performance. PHD yields a spectrum of student variants using RepViT, TinyViT, and EfficientViT backbones, enabling on-device concept segmentation and tracking while maintaining high fidelity to teacher behavior. We benchmark on popular VOS datasets, and compare with varies of releated work, achieing strong performance-efficiency trade-offs.


[3] WALDO: Where Unseen Model-based 6D Pose Estimation Meets Occlusion cs.CV | cs.AI | cs.LGPDF

Sajjad Pakdamansavoji, Yintao Ma, Amir Rasouli, Tongtong Cao

TL;DR: 论文提出了一种创新的方法来改进基于模型的6D姿态估计,特别是在遮挡情况下。通过动态非均匀密集采样、多假设推理机制、迭代精化和遮挡增强训练等方法,显著提高了对未见物体的姿态估计精度和速度。

Details

Motivation: 现有方法对遮挡情况下的6D姿态估计表现不佳,尤其是多阶段流水线的早期错误会传递到后续阶段。论文目标是提升遮挡情况下的鲁棒性和泛化能力。

Result: 在ICBIN数据集上精度提升超过5%,BOP数据集提升超过2%,推理速度提升约3倍。

Insight: 通过聚焦可见区域和保留多假设,可以有效缓解遮挡带来的误差,同时提升模型的泛化能力和效率。

Abstract: Accurate 6D object pose estimation is vital for robotics, augmented reality, and scene understanding. For seen objects, high accuracy is often attainable via per-object fine-tuning but generalizing to unseen objects remains a challenge. To address this problem, past arts assume access to CAD models at test time and typically follow a multi-stage pipeline to estimate poses: detect and segment the object, propose an initial pose, and then refine it. Under occlusion, however, the early-stage of such pipelines are prone to errors, which can propagate through the sequential processing, and consequently degrade the performance. To remedy this shortcoming, we propose four novel extensions to model-based 6D pose estimation methods: (i) a dynamic non-uniform dense sampling strategy that focuses computation on visible regions, reducing occlusion-induced errors; (ii) a multi-hypothesis inference mechanism that retains several confidence-ranked pose candidates, mitigating brittle single-path failures; (iii) iterative refinement to progressively improve pose accuracy; and (iv) series of occlusion-focused training augmentations that strengthen robustness and generalization. Furthermore, we propose a new weighted by visibility metric for evaluation under occlusion to minimize the bias in the existing protocols. Via extensive empirical evaluations, we show that our proposed approach achieves more than 5% improvement in accuracy on ICBIN and more than 2% on BOP dataset benchmarks, while achieving approximately 3 times faster inference.


[4] Automatic Uncertainty-Aware Synthetic Data Bootstrapping for Historical Map Segmentation cs.CVPDF

Lukas Arzoumanidis, Julius Knechtel, Jan-Henrik Haunert, Youness Dehbi

TL;DR: 该论文提出了一种自动化的合成数据生成方法,用于解决历史地图分割任务中训练数据稀缺的问题。通过将原始历史地图的风格迁移到矢量数据上,并结合深度生成或随机退化技术,生成具有视觉不确定性的合成数据,显著提升了语义分割模型的性能。

Details

Motivation: 历史地图的分割任务面临训练数据稀缺的挑战,尤其是对于特定领域的同质地图。手工标注成本高,而现有合成数据常缺乏真实性和多样性。

Result: 生成的合成数据显著提升了语义分割模型的性能,证明了其在解决数据稀缺问题上的有效性。

Insight: 合成数据的真实性(affinity)和多样性(variation)对模型的泛化能力至关重要,尤其是在数据稀缺的场景下。

Abstract: The automated analysis of historical documents, particularly maps, has drastically benefited from advances in deep learning and its success across various computer vision applications. However, most deep learning-based methods heavily rely on large amounts of annotated training data, which are typically unavailable for historical maps, especially for those belonging to specific, homogeneous cartographic domains, also known as corpora. Creating high-quality training data suitable for machine learning often takes a significant amount of time and involves extensive manual effort. While synthetic training data can alleviate the scarcity of real-world samples, it often lacks the affinity (realism) and diversity (variation) necessary for effective learning. By transferring the cartographic style of an original historical map corpus onto vector data, we bootstrap an effectively unlimited number of synthetic historical maps suitable for tasks such as land-cover interpretation of a homogeneous historical map corpus. We propose an automatic deep generative approach and a alternative manual stochastic degradation technique to emulate the visual uncertainty and noise, also known as data-dependent uncertainty, commonly observed in historical map scans. To quantitatively evaluate the effectiveness and applicability of our approach, the generated training datasets were employed for domain-adaptive semantic segmentation on a homogeneous map corpus using a Self-Constructing Graph Convolutional Network, enabling a comprehensive assessment of the impact of our data bootstrapping methods.


[5] Box6D : Zero-shot Category-level 6D Pose Estimation of Warehouse Boxes cs.CV | cs.AI | cs.LGPDF

Yintao Ma, Sajjad Pakdamansavoji, Amir Rasouli, Tongtong Cao

TL;DR: Box6D是一种针对仓库环境中存储箱的6D姿态估计方法,通过单次RGB-D观察快速推断箱体尺寸并使用类别CAD模板估计姿态,显著降低了计算成本,提升了效率和精度。

Details

Motivation: 现有6D姿态估计方法在灵活性或准确性上存在不足,尤其在工业场景中实用性受限。Box6D旨在为仓库环境中的存储箱提供高效且准确的解决方案。

Result: 在真实仓库场景和公开基准测试中,Box6D展现出竞争性或优于现有方法的6D姿态精度,同时减少约76%的推理时间。

Insight: Box6D的成功表明,针对特定工业场景优化方法(如利用环境先验和对象类别信息)可显著提升6D姿态估计的实用性和效率。

Abstract: Accurate and efficient 6D pose estimation of novel objects under clutter and occlusion is critical for robotic manipulation across warehouse automation, bin picking, logistics, and e-commerce fulfillment. There are three main approaches in this domain; Model-based methods assume an exact CAD model at inference but require high-resolution meshes and transfer poorly to new environments; Model-free methods that rely on a few reference images or videos are more flexible, however often fail under challenging conditions; Category-level approaches aim to balance flexibility and accuracy but many are overly general and ignore environment and object priors, limiting their practicality in industrial settings. To this end, we propose Box6d, a category-level 6D pose estimation method tailored for storage boxes in the warehouse context. From a single RGB-D observation, Box6D infers the dimensions of the boxes via a fast binary search and estimates poses using a category CAD template rather than instance-specific models. Suing a depth-based plausibility filter and early-stopping strategy, Box6D then rejects implausible hypotheses, lowering computational cost. We conduct evaluations on real-world storage scenarios and public benchmarks, and show that our approach delivers competitive or superior 6D pose precision while reducing inference time by approximately 76%.


[6] RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification cs.CVPDF

Meilong Xu, Di Fu, Jiaxing Zhang, Gong Yu, Jiayu Zheng

TL;DR: RB-FT通过两阶段自监督范式(生成文本rationale并基于此微调)提升VLM在视频分类任务中的表现,显著超越了直接监督微调的效果。

Details

Motivation: 当前VLM在领域特定视频分类任务中表现不佳,主要由于数据不足导致的rationale gap,即复杂时空内容与抽象分类标签之间的语义距离难以弥合。

Result: 在多个数据集上的实验表明,该方法显著优于直接监督微调,验证了自生成rationale的有效性。

Insight: 自生成的rationale可以作为中间监督信号,有效弥合VLM与领域特定任务之间的语义鸿沟。

Abstract: Vision Language Models (VLMs) are becoming increasingly integral to multimedia understanding; however, they often struggle with domain-specific video classification tasks, particularly in cases with limited data. This stems from a critical \textit{rationale gap}, where sparse domain data is insufficient to bridge the semantic distance between complex spatio-temporal content and abstract classification labels. We propose a two-stage self-improvement paradigm to bridge this gap without new annotations. First, we prompt the VLMs to generate detailed textual rationales for each video, compelling them to articulate the domain-specific logic. The VLM is then fine-tuned on these self-generated rationales, utilizing this intermediate supervision to align its representations with the nuances of the target domain. Second, conventional supervised fine-tuning (SFT) is performed on the task labels, achieving markedly higher effectiveness as a result of the model’s pre-acquired domain reasoning. Extensive experiments on diverse datasets demonstrate that our method significantly outperforms direct SFT, validating self-generated rationale as an effective, annotation-efficient paradigm for adapting VLMs to domain-specific video analysis.


[7] Boosting Medical Visual Understanding From Multi-Granular Language Learning cs.CVPDF

Zihan Li, Yiqing Wang, Sina Farsiu, Paul Kinahan

TL;DR: 论文提出了一种多粒度语言学习框架(MGLL),用于提升医学影像的多标签和跨粒度对齐能力,优于现有方法。

Details

Motivation: 现有的CLIP方法在医学影像领域效果有限,因其专注于单标签和单粒度对齐,而医学影像常涉及多标签和多粒度标注。

Result: 在大规模数据集上预训练后,MGLL在多个下游任务中优于现有方法。

Insight: 多粒度对齐和多标签监督的引入显著提升了医学影像的理解能力。

Abstract: Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback-Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at \href{https://github.com/HUANGLIZI/MGLL}{https://github.com/HUANGLIZI/MGLL}.


[8] Automated Interpretable 2D Video Extraction from 3D Echocardiography cs.CVPDF

Milos Vukadinovic, Hirotaka Ieki, Yuki Sahasi, David Ouyang, Bryan He

TL;DR: 本文提出了一种从3D心脏超声图像中自动提取标准2D视图的方法,结合深度学习分类器和解剖学启发式规则,实现了96%的准确率,并通过AI模型验证了其在检测心脏异常和生成临床测量方面的有效性。

Details

Motivation: 传统心脏超声依赖2D视图,而3D超声虽能提供更全面的信息,但临床医生更熟悉2D格式。因此,需要一种自动方法将3D数据转换为标准2D视图,以结合3D超声的优势和临床习惯。

Result: 在1,600个视频(来自两家医院)的盲测中,准确性达到96%。提取的2D视频通过EchoPrime、PanEcho和EchoNet-Measurement模型验证了其在心脏异常检测和临床测量中的有效性。

Insight: 该方法展示了如何通过自动化技术弥合3D超声与临床2D需求之间的鸿沟,同时保留了3D数据的完整性和诊断特征。

Abstract: Although the heart has complex three-dimensional (3D) anatomy, conventional medical imaging with cardiac ultrasound relies on a series of 2D videos showing individual cardiac structures. 3D echocardiography is a developing modality that now offers adequate image quality for clinical use, with potential to streamline acquisition and improve assessment of off-axis features. We propose an automated method to select standard 2D views from 3D cardiac ultrasound volumes, allowing physicians to interpret the data in their usual format while benefiting from the speed and usability of 3D scanning. Applying a deep learning view classifier and downstream heuristics based on anatomical landmarks together with heuristics provided by cardiologists, we reconstruct standard echocardiography views. This approach was validated by three cardiologists in blinded evaluation (96% accuracy in 1,600 videos from 2 hospitals). The downstream 2D videos were also validated in their ability to detect cardiac abnormalities using AI echocardiography models (EchoPrime and PanEcho) as well as ability to generate clinical-grade measurements of cardiac anatomy (EchoNet-Measurement). We demonstrated that the extracted 2D videos preserve spatial calibration and diagnostic features, allowing clinicians to obtain accurate real-world interpretations from 3D volumes. We release the code and a dataset of 29 3D echocardiography videos https://github.com/echonet/3d-echo .


[9] Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click cs.CVPDF

Raphael Ruschel, Hardikkumar Prajapati, Awsafur Rahman, B. S. Manjunath

TL;DR: Click2Graph是一个交互式的全景视频场景图生成框架,通过单次用户点击(如点击或边界框)实现全景视频场景图的生成。它将视觉提示与空间、时间和语义理解相结合,并引入了动态交互发现模块和语义分类头,以实现用户引导的可控视频场景理解。

Details

Motivation: 现有的视频场景图生成(VSGG)系统通常是封闭的前馈管道,无法融入人工指导,而像SAM2这样的提示式分割模型虽然支持精确的用户交互,但缺乏语义或关系推理。Click2Graph旨在结合两者的优势,提供可交互的视频场景图生成。

Result: 在OpenPVSG基准测试中,Click2Graph展示了其作为用户引导PVSG的强有力基础,实现了可控和可解释的视频场景理解。

Insight: 结合用户交互和自动化推理可以显著提升视频场景理解的灵活性和准确性,Click2Graph为实现这一目标提供了一个可行的框架。

Abstract: State-of-the-art Video Scene Graph Generation (VSGG) systems provide structured visual understanding but operate as closed, feed-forward pipelines with no ability to incorporate human guidance. In contrast, promptable segmentation models such as SAM2 enable precise user interaction but lack semantic or relational reasoning. We introduce Click2Graph, the first interactive framework for Panoptic Video Scene Graph Generation (PVSG) that unifies visual prompting with spatial, temporal, and semantic understanding. From a single user cue, such as a click or bounding box, Click2Graph segments and tracks the subject across time, autonomously discovers interacting objects, and predicts <subject, object, predicate> triplets to form a temporally consistent scene graph. Our framework introduces two key components: a Dynamic Interaction Discovery Module that generates subject-conditioned object prompts, and a Semantic Classification Head that performs joint entity and predicate reasoning. Experiments on the OpenPVSG benchmark demonstrate that Click2Graph establishes a strong foundation for user-guided PVSG, showing how human prompting can be combined with panoptic grounding and relational inference to enable controllable and interpretable video scene understanding.


[10] InfoCLIP: Bridging Vision-Language Pretraining and Open-Vocabulary Semantic Segmentation via Information-Theoretic Alignment Transfer cs.CVPDF

Muyao Yuan, Yuanhong Zhang, Weizhan Zhang, Lan Ma, Yuan Gao

TL;DR: InfoCLIP通过信息论对齐转移方法,解决了CLIP微调用于开放词汇语义分割时的模态对齐退化问题,提升了分割性能。

Details

Motivation: 现有方法在有限已见类别上微调CLIP进行分割时,容易过拟合并破坏预训练的多模态对齐能力。

Result: 在多种基准测试中,InfoCLIP显著提升了开放词汇语义分割的性能,验证了其适应性和优越性。

Insight: 信息论视角为多模态对齐的转移提供了新思路,适用于异构任务的知识迁移。

Abstract: Recently, the strong generalization ability of CLIP has facilitated open-vocabulary semantic segmentation, which labels pixels using arbitrary text. However, existing methods that fine-tune CLIP for segmentation on limited seen categories often lead to overfitting and degrade the pretrained vision-language alignment. To stabilize modality alignment during fine-tuning, we propose InfoCLIP, which leverages an information-theoretic perspective to transfer alignment knowledge from pretrained CLIP to the segmentation task. Specifically, this transfer is guided by two novel objectives grounded in mutual information. First, we compress the pixel-text modality alignment from pretrained CLIP to reduce noise arising from its coarse-grained local semantic representations learned under image-text supervision. Second, we maximize the mutual information between the alignment knowledge of pretrained CLIP and the fine-tuned model to transfer compact local semantic relations suited for the segmentation task. Extensive evaluations across various benchmarks validate the effectiveness of InfoCLIP in enhancing CLIP fine-tuning for open-vocabulary semantic segmentation, demonstrating its adaptability and superiority in asymmetric transfer.


[11] Externally Validated Multi-Task Learning via Consistency Regularization Using Differentiable BI-RADS Features for Breast Ultrasound Tumor Segmentation cs.CV | cs.AIPDF

Jingru Zhang, Saed Moradi, Ashirbani Saha

TL;DR: 这篇论文提出了一种新的多任务学习方法,通过可微分的BI-RADS特征实现一致性正则化,以改善乳腺超声肿瘤分割任务的泛化性能。

Details

Motivation: 多任务学习中存在任务干扰问题,可能导致联合训练的模型性能不如单任务基线,限制了泛化能力。作者旨在通过一致性正则化方法解决这一问题,特别是在乳腺超声肿瘤分割中的应用。

Result: 在三个外部数据集(UDIAT、BUSI、BUS-UCLM)上验证了方法的有效性,分割任务的Dice系数显著提升(例如UDIAT数据集上的0.81 vs 0.59)。

Insight: 一致性正则化结合可微分BI-RADS特征可以有效缓解多任务学习中的任务干扰问题,提升模型的泛化能力。

Abstract: Multi-task learning can suffer from destructive task interference, where jointly trained models underperform single-task baselines and limit generalization. To improve generalization performance in breast ultrasound-based tumor segmentation via multi-task learning, we propose a novel consistency regularization approach that mitigates destructive interference between segmentation and classification. The consistency regularization approach is composed of differentiable BI-RADS-inspired morphological features. We validated this approach by training all models on the BrEaST dataset (Poland) and evaluating them on three external datasets: UDIAT (Spain), BUSI (Egypt), and BUS-UCLM (Spain). Our comprehensive analysis demonstrates statistically significant (p<0.001) improvements in generalization for segmentation task of the proposed multi-task approach vs. the baseline one: UDIAT, BUSI, BUS-UCLM (Dice coefficient=0.81 vs 0.59, 0.66 vs 0.56, 0.69 vs 0.49, resp.). The proposed approach also achieves state-of-the-art segmentation performance under rigorous external validation on the UDIAT dataset.


[12] UniDGF: A Unified Detection-to-Generation Framework for Hierarchical Object Visual Recognition cs.CVPDF

Xinyu Nan, Lingtao Mao, Huangyu Dai, Zexin Zheng, Xinyu Sun

TL;DR: UniDGF是一个统一的检测到生成框架,用于层次化物体视觉识别,通过检测引导的生成方法预测层次化类别和属性标记,显著优于现有的相似性流水线和多阶段分类系统。

Details

Motivation: 现有方法基于全局相似性,难以捕捉细粒度类别差异和类别特定属性多样性,特别是在大规模电商场景中,需要一个统一框架同时处理物体检测、类别预测和属性识别。

Result: 在大规模电商数据集和开源数据集上表现优异,显著优于现有方法,尤其在细粒度识别和统一推理方面。

Insight: 检测与生成结合的统一框架能够更好地捕捉层次化语义信息,适用于复杂场景下的视觉识别任务。

Abstract: Achieving visual semantic understanding requires a unified framework that simultaneously handles object detection, category prediction, and attribute recognition. However, current advanced approaches rely on global similarity and struggle to capture fine-grained category distinctions and category-specific attribute diversity, especially in large-scale e-commerce scenarios. To overcome these challenges, we introduce a detection-guided generative framework that predicts hierarchical category and attribute tokens. For each detected object, we extract refined ROI-level features and employ a BART-based generator to produce semantic tokens in a coarse-to-fine sequence covering category hierarchies and property-value pairs, with support for property-conditioned attribute recognition. Experiments on both large-scale proprietary e-commerce datasets and open-source datasets demonstrate that our approach significantly outperforms existing similarity-based pipelines and multi-stage classification systems, achieving stronger fine-grained recognition and more coherent unified inference.


[13] Fairness in Multi-modal Medical Diagnosis with Demonstration Selection cs.CV | cs.CY | cs.LGPDF

Dawei Li, Zijian Gu, Peng Wang, Chuhan Song, Zhen Tan

TL;DR: 该论文探讨了多模态大语言模型(MLLMs)在医学图像推理中的公平性问题,并提出了一种无需调参的轻量级方法FADS,通过聚类采样构建人口统计学平衡的演示样本,以减少性别、种族和民族相关的差异。

Details

Motivation: 现有的去偏方法通常依赖于大规模标注数据或微调,这对于基础规模的模型来说不切实际。本研究探索了上下文学习(ICL)作为一种轻量级且无需调参的替代方案,旨在提升多模态医学诊断中的公平性。

Result: 在多个医学影像基准测试中,FADS显著减少了性别、种族和民族相关的差异,同时保持了较高的准确率。

Insight: 研究发现,上下文学习可以作为一种高效且数据利用率高的解决方案,为公平的医学图像推理提供了可扩展的路径。

Abstract: Multimodal large language models (MLLMs) have shown strong potential for medical image reasoning, yet fairness across demographic groups remains a major concern. Existing debiasing methods often rely on large labeled datasets or fine-tuning, which are impractical for foundation-scale models. We explore In-Context Learning (ICL) as a lightweight, tuning-free alternative for improving fairness. Through systematic analysis, we find that conventional demonstration selection (DS) strategies fail to ensure fairness due to demographic imbalance in selected exemplars. To address this, we propose Fairness-Aware Demonstration Selection (FADS), which builds demographically balanced and semantically relevant demonstrations via clustering-based sampling. Experiments on multiple medical imaging benchmarks show that FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining strong accuracy, offering an efficient and scalable path toward fair medical image reasoning. These results highlight the potential of fairness-aware in-context learning as a scalable and data-efficient solution for equitable medical image reasoning.


[14] Exploiting Inter-Sample Information for Long-tailed Out-of-Distribution Detection cs.CVPDF

Nimeshika Udayangani, Hadi M. Dolatabadi, Sarah Erfani, Christopher Leckie

TL;DR: 这篇论文提出了一种基于图表示的方法,利用样本间关系来改进长尾分布数据集中的离群检测(OOD),并通过高斯化和图卷积网络(GCNs)优化特征空间,显著降低了假阳性率(FPR)并提高了尾部类别的准确率。

Details

Motivation: 在长尾分布的数据集中,现有的离群检测方法在尾部类别上表现较差,假阳性率高且尾部类别分类准确率低。因此,需要一种能够有效利用样本间信息的方法来提高OOD检测性能。

Result: 在CIFAR10-LT、CIFAR100-LT和ImageNet-LT三个基准数据集上的实验表明,该方法在FPR和尾部类别分类准确率上显著优于现有方法。

Insight: 论文表明,通过充分利用样本间关系并结合高斯化和GCNs,可以有效解决长尾分布数据集中OOD检测的挑战,特别是针对尾部类别的性能瓶颈。

Abstract: Detecting out-of-distribution (OOD) data is essential for safe deployment of deep neural networks (DNNs). This problem becomes particularly challenging in the presence of long-tailed in-distribution (ID) datasets, often leading to high false positive rates (FPR) and low tail-class ID classification accuracy. In this paper, we demonstrate that exploiting inter-sample relationships using a graph-based representation can significantly improve OOD detection in long-tailed recognition of vision datasets. To this end, we use the feature space of a pre-trained model to initialize our graph structure. We account for the differences between the activation layer distribution of the pre-training vs. training data, and actively introduce Gaussianization to alleviate any deviations from a standard normal distribution in the activation layers of the pre-trained model. We then refine this initial graph representation using graph convolutional networks (GCNs) to arrive at a feature space suitable for long-tailed OOD detection. This leads us to address the inferior performance observed in ID tail-classes within existing OOD detection methods. Experiments over three benchmarks CIFAR10-LT, CIFAR100-LT, and ImageNet-LT demonstrate that our method outperforms the state-of-the-art approaches by a large margin in terms of FPR and tail-class ID classification accuracy.


[15] Physically Realistic Sequence-Level Adversarial Clothing for Robust Human-Detection Evasion cs.CV | cs.AIPDF

Dingkun Zhou, Patrick P. K. Chan, Hengxu Wu, Shikang Zheng, Ruiqi Huang

TL;DR: 论文提出了一种序列级优化框架,生成自然且可打印的对抗性纹理,用于服装和帽子,以在长视频序列中持续隐藏人体检测,适用于数字和物理环境。

Details

Motivation: 现有的对抗攻击方法多针对单帧优化,难以在长视频序列中保持隐藏效果。论文旨在解决这一问题,提供一种更真实的威胁模型。

Result: 对抗纹理在数字和物理环境中均表现出稳定的隐藏效果,对抗视角变化和多模型迁移能力强,物理打印服饰在多场景下可靠抑制检测。

Insight: 序列级优化和物理模拟是解决长视频对抗攻击的有效途径,同时确保颜色可打印是实现实际应用的关键。

Abstract: Deep neural networks used for human detection are highly vulnerable to adversarial manipulation, creating safety and privacy risks in real surveillance environments. Wearable attacks offer a realistic threat model, yet existing approaches usually optimize textures frame by frame and therefore fail to maintain concealment across long video sequences with motion, pose changes, and garment deformation. In this work, a sequence-level optimization framework is introduced to generate natural, printable adversarial textures for shirts, trousers, and hats that remain effective throughout entire walking videos in both digital and physical settings. Product images are first mapped to UV space and converted into a compact palette and control-point parameterization, with ICC locking to keep all colors printable. A physically based human-garment pipeline is then employed to simulate motion, multi-angle camera viewpoints, cloth dynamics, and illumination variation. An expectation-over-transformation objective with temporal weighting is used to optimize the control points so that detection confidence is minimized across whole sequences. Extensive experiments demonstrate strong and stable concealment, high robustness to viewpoint changes, and superior cross-model transferability. Physical garments produced with sublimation printing achieve reliable suppression under indoor and outdoor recordings, confirming real-world feasibility.


[16] Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution cs.CVPDF

Xiao He, Zhijun Tu, Kun Cheng, Mingrui Zhu, Jie Hu

TL;DR: 该论文提出了一种基于稀疏Mixture-of-Experts (MoE) 架构的Mixture-of-Ranks (MoR) 方法,用于单步真实世界图像超分辨率 (Real-ISR),通过细粒度专家分区和动态路由机制提升性能。

Details

Motivation: 现有Real-ISR方法主要依赖预训练扩散模型和LoRA模块,无法自适应地捕捉复杂退化样本的异质性特征或在同等计算预算下共享知识。因此,研究者探索将稀疏MoE融入Real-ISR。

Result: 实验表明,该方法在Real-ISR任务中具有优越的效率和性能,达到SOTA水平。

Insight: 通过稀疏MoE架构和动态路由,MoR既能捕捉样本异质性,又能高效共享知识,为Real-ISR提供了一种轻量且灵活的解决方案。

Abstract: The demonstrated success of sparsely-gated Mixture-of-Experts (MoE) architectures, exemplified by models such as DeepSeek and Grok, has motivated researchers to investigate their adaptation to diverse domains. In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models through Low-Rank Adaptation (LoRA) module to reconstruct high-resolution (HR) images. However, these dense Real-ISR models are limited in their ability to adaptively capture the heterogeneous characteristics of complex real-world degraded samples or enable knowledge sharing between inputs under equivalent computational budgets. To address this, we investigate the integration of sparse MoE into Real-ISR and propose a Mixture-of-Ranks (MoR) architecture for single-step image super-resolution. We introduce a fine-grained expert partitioning strategy that treats each rank in LoRA as an independent expert. This design enables flexible knowledge recombination while isolating fixed-position ranks as shared experts to preserve common-sense features and minimize routing redundancy. Furthermore, we develop a degradation estimation module leveraging CLIP embeddings and predefined positive-negative text pairs to compute relative degradation scores, dynamically guiding expert activation. To better accommodate varying sample complexities, we incorporate zero-expert slots and propose a degradation-aware load-balancing loss, which dynamically adjusts the number of active experts based on degradation severity, ensuring optimal computational resource allocation. Comprehensive experiments validate our framework’s effectiveness and state-of-the-art performance.


[17] Towards a Safer and Sustainable Manufacturing Process: Material classification in Laser Cutting Using Deep Learning cs.CV | cs.AI | cs.LG | cs.ROPDF

Mohamed Abdallah Salem, Hamdy Ahmed Ashur, Ahmed Elshinnawy

TL;DR: 该论文提出了一种基于深度学习的激光散斑模式的材料分类技术,用于实时监测和控制激光切割过程,解决了传统方法在激光颜色变化时的分类问题,并在实验中表现出高准确率。

Details

Motivation: 激光切割过程中产生的粉尘和烟雾对环境和工人健康构成威胁,传统材料分类方法在激光颜色变化时表现不佳,因此需要一种更鲁棒的分类技术。

Result: 模型在训练集和验证集上的准确率分别为98.30%和96.88%,在新数据集(30种材料,3000张图像)上的F1分数达到0.9643。

Insight: 深度学习能够有效解决激光颜色变化导致的分类问题,这表明基于散斑传感的材料分类技术在工业应用中有巨大潜力。

Abstract: Laser cutting is a widely adopted technology in material processing across various industries, but it generates a significant amount of dust, smoke, and aerosols during operation, posing a risk to both the environment and workers’ health. Speckle sensing has emerged as a promising method to monitor the cutting process and identify material types in real-time. This paper proposes a material classification technique using a speckle pattern of the material’s surface based on deep learning to monitor and control the laser cutting process. The proposed method involves training a convolutional neural network (CNN) on a dataset of laser speckle patterns to recognize distinct material types for safe and efficient cutting. Previous methods for material classification using speckle sensing may face issues when the color of the laser used to produce the speckle pattern is changed. Experiments conducted in this study demonstrate that the proposed method achieves high accuracy in material classification, even when the laser color is changed. The model achieved an accuracy of 98.30 % on the training set and 96.88% on the validation set. Furthermore, the model was evaluated on a set of 3000 new images for 30 different materials, achieving an F1-score of 0.9643. The proposed method provides a robust and accurate solution for material-aware laser cutting using speckle sensing.


[18] CuriGS: Curriculum-Guided Gaussian Splatting for Sparse View Synthesis cs.CVPDF

Zijian Wu, Mingfeng Jiang, Zidian Lin, Ying Song, Hanjie Ma

TL;DR: CuriGS 是一种基于课程的框架,利用3D高斯泼溅(3DGS)解决稀疏视图合成的核心挑战,通过引入伪视图和渐进式训练策略提升渲染保真度和几何一致性。

Details

Motivation: 3DGS 虽然在实时场景重建和渲染中表现出色,但在稀疏视图设置中容易因监督不足和视角覆盖有限而过拟合。CuriGS 旨在通过引入伪视图和课程学习策略解决这一问题。

Result: 实验表明,CuriGS 在合成和真实稀疏视图场景中,渲染保真度和几何一致性均优于现有基线方法。

Insight: 通过伪视图和课程学习策略,CuriGS 有效缓解了稀疏视图设置中的监督不足问题,为3DGS在稀疏场景中的应用提供了新思路。

Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as an efficient, high-fidelity representation for real-time scene reconstruction and rendering. However, extending 3DGS to sparse-view settings remains challenging because of supervision scarcity and overfitting caused by limited viewpoint coverage. In this paper, we present CuriGS, a curriculum-guided framework for sparse-view 3D reconstruction using 3DGS. CuriGS addresses the core challenge of sparse-view synthesis by introducing student views: pseudo-views sampled around ground-truth poses (teacher). For each teacher, we generate multiple groups of student views with different perturbation levels. During training, we follow a curriculum schedule that gradually unlocks higher perturbation level, randomly sampling candidate students from the active level to assist training. Each sampled student is regularized via depth-correlation and co-regularization, and evaluated using a multi-signal metric that combines SSIM, LPIPS, and an image-quality measure. For every teacher and perturbation level, we periodically retain the best-performing students and promote those that satisfy a predefined quality threshold to the training set, resulting in a stable augmentation of sparse training views. Experimental results show that CuriGS outperforms state-of-the-art baselines in both rendering fidelity and geometric consistency across various synthetic and real sparse-view scenes. Project page: https://zijian1026.github.io/CuriGS/


[19] Crossmodal learning for Crop Canopy Trait Estimation cs.CVPDF

Timilehin T. Ayanlade, Anirudha Powadi, Talukder Z. Jubery, Baskar Ganapathysubramanian, Soumik Sarkar

TL;DR: 该论文提出了一种跨模态学习策略,通过将高分辨率卫星图像与无人机(UAV)级别的视觉细节结合,用于作物冠层性状估计。实验表明,生成的UAV级图像在下游任务中优于真实卫星图像。

Details

Motivation: 现代农业需要高效的作物监测方法,卫星图像受限于空间分辨率,难以满足微地块管理需求。无人机数据虽精细,但覆盖范围有限。论文旨在结合两者优势,提升农业监测效果。

Result: 生成的UAV级图像在产量和氮素预测等下游任务中表现优于真实卫星图像,验证了方法的有效性。

Insight: 跨模态学习可以弥补卫星与无人机数据的差距,为农业监测提供了一种高效的解决方案。

Abstract: Recent advances in plant phenotyping have driven widespread adoption of multi sensor platforms for collecting crop canopy reflectance data. This includes the collection of heterogeneous data across multiple platforms, with Unmanned Aerial Vehicles (UAV) seeing significant usage due to their high performance in crop monitoring, forecasting, and prediction tasks. Similarly, satellite missions have been shown to be effective for agriculturally relevant tasks. In contrast to UAVs, such missions are bound to the limitation of spatial resolution, which hinders their effectiveness for modern farming systems focused on micro-plot management. In this work, we propose a cross modal learning strategy that enriches high-resolution satellite imagery with UAV level visual detail for crop canopy trait estimation. Using a dataset of approximately co registered satellite UAV image pairs collected from replicated plots of 84 hybrid maize varieties across five distinct locations in the U.S. Corn Belt, we train a model that learns fine grained spectral spatial correspondences between sensing modalities. Results show that the generated UAV-like representations from satellite inputs consistently outperform real satellite imagery on multiple downstream tasks, including yield and nitrogen prediction, demonstrating the potential of cross-modal correspondence learning to bridge the gap between satellite and UAV sensing in agricultural monitoring.


[20] LLMs-based Augmentation for Domain Adaptation in Long-tailed Food Datasets cs.CVPDF

Qing Wang, Chong-Wah Ngo, Ee-Peng Lim, Qianru Sun

TL;DR: 该论文提出了一种基于大语言模型(LLMs)的框架,通过生成食物的标题和成分文本,并将文本与图像映射到共享嵌入空间,以解决食物识别中的长尾分布、领域自适应和细粒度分类问题。

Details

Motivation: 训练食物识别模型的挑战在于,互联网爬取的训练数据与用户实际拍摄的图像存在领域偏移问题,同时食物数据集通常是长尾分布的,且不同类别的食物之间可能存在细微的视觉差异。

Result: 在两个食物数据集上,该方法在长尾分布、领域自适应和细粒度分类任务中均优于现有方法。

Insight: LLMs生成的文本信息可以有效弥补视觉特征的不足,特别是在长尾和细粒度分类任务中;多模态对齐进一步提升了模型的鲁棒性和性能。

Abstract: Training a model for food recognition is challenging because the training samples, which are typically crawled from the Internet, are visually different from the pictures captured by users in the free-living environment. In addition to this domain-shift problem, the real-world food datasets tend to be long-tailed distributed and some dishes of different categories exhibit subtle variations that are difficult to distinguish visually. In this paper, we present a framework empowered with large language models (LLMs) to address these challenges in food recognition. We first leverage LLMs to parse food images to generate food titles and ingredients. Then, we project the generated texts and food images from different domains to a shared embedding space to maximize the pair similarities. Finally, we take the aligned features of both modalities for recognition. With this simple framework, we show that our proposed approach can outperform the existing approaches tailored for long-tailed data distribution, domain adaptation, and fine-grained classification, respectively, on two food datasets.


[21] VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning cs.CVPDF

Zishan Xu, Yifu Guo, Yuquan Lu, Fengyu Yang, Junxin Li

TL;DR: VideoSeg-R1是首个将强化学习引入视频推理分割的框架,通过分阶段的分层文本引导帧采样、推理模型和分割传播,显著提升了复杂视频分割任务的性能和效率。

Details

Motivation: 传统视频分割方法依赖监督微调,泛化能力有限且缺乏显式推理能力。VideoSeg-R1旨在通过强化学习解决这些问题,实现更高效的视频分割。

Result: 在多个基准测试中,VideoSeg-R1在复杂视频推理和分割任务中实现了最先进的性能。

Insight: 强化学习在视频分割中的应用能够提升模型的泛化能力和推理效率,任务难度感知机制可以动态调整推理复杂度,进一步提升性能。

Abstract: Traditional video reasoning segmentation methods rely on supervised fine-tuning, which limits generalization to out-of-distribution scenarios and lacks explicit reasoning. To address this, we propose \textbf{VideoSeg-R1}, the first framework to introduce reinforcement learning into video reasoning segmentation. It adopts a decoupled architecture that formulates the task as joint referring image segmentation and video mask propagation. It comprises three stages: (1) A hierarchical text-guided frame sampler to emulate human attention; (2) A reasoning model that produces spatial cues along with explicit reasoning chains; and (3) A segmentation-propagation stage using SAM2 and XMem. A task difficulty-aware mechanism adaptively controls reasoning length for better efficiency and accuracy. Extensive evaluations on multiple benchmarks demonstrate that VideoSeg-R1 achieves state-of-the-art performance in complex video reasoning and segmentation tasks. The code will be publicly available at https://github.com/euyis1019/VideoSeg-R1.


[22] Rad-GS: Radar-Vision Integration for 3D Gaussian Splatting SLAM in Outdoor Environments cs.CVPDF

Renxiang Xiao, Wei Liu, Yuanfan Zhang, Yushuai Chen, Jinming Chen

TL;DR: Rad-GS是一个结合4D雷达与相机的SLAM系统,利用3D高斯可微空间表示,用于大规模室外环境重建,通过雷达点云和多普勒信息动态掩蔽物体,提升渲染和定位精度。

Details

Motivation: 传统基于相机或LiDAR的SLAM在室外大规模环境中面临动态物体处理、内存消耗和纹理一致性等问题,Rad-GS通过雷达-视觉融合解决这些问题。

Result: Rad-GS在大规模室外环境中表现优异,性能与基于相机或LiDAR的传统方法相当。

Insight: 4D毫米波雷达在高斯SLAM中潜力显著,适合大规模场景重建。

Abstract: We present Rad-GS, a 4D radar-camera SLAM system designed for kilometer-scale outdoor environments, utilizing 3D Gaussian as a differentiable spatial representation. Rad-GS combines the advantages of raw radar point cloud with Doppler information and geometrically enhanced point cloud to guide dynamic object masking in synchronized images, thereby alleviating rendering artifacts and improving localization accuracy. Additionally, unsynchronized image frames are leveraged to globally refine the 3D Gaussian representation, enhancing texture consistency and novel view synthesis fidelity. Furthermore, the global octree structure coupled with a targeted Gaussian primitive management strategy further suppresses noise and significantly reduces memory consumption in large-scale environments. Extensive experiments and ablation studies demonstrate that Rad-GS achieves performance comparable to traditional 3D Gaussian methods based on camera or LiDAR inputs, highlighting the feasibility of robust outdoor mapping using 4D mmWave radar. Real-world reconstruction at kilometer scale validates the potential of Rad-GS for large-scale scene reconstruction.


[23] T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs cs.CV | cs.AIPDF

Shao-Jun Xia, Huixin Zhang, Zhengzhong Tu

TL;DR: T2T-VICL提出了一种创新的跨任务视觉上下文学习方法,通过隐式文本驱动的VLMs解决不同视觉任务间的上下文学习问题。

Details

Motivation: 现有的视觉上下文学习(VICL)主要关注同任务情景,而跨任务VICL的潜力尚未充分探索。该研究旨在解锁VLMs在这一领域的边界。

Result: 在九个跨任务场景中取得领先性能,在另外十个场景中获得次优结果,验证了方法的有效性。

Insight: 隐式文本提示可以有效弥合不同视觉任务间的鸿沟,为跨任务VICL提供了新的解决思路。

Abstract: In large language models (LLM), in-context learning (ICL) refers to performing new tasks by conditioning on small demonstrations provided in the input context. Recent advances in visual in-context learning (VICL) demonstrate promising capabilities for solving downstream tasks by unified vision-language models (VLMs). When the visual prompt and the target images originate from different visual tasks, can VLMs still enable VICL? In the paper, we propose a fully collaborative pipeline, i.e. T2T-VICL, for VLMs to investigate the potential of cross-task VICL. Fundamentally, we design a mechanism to generate and select text prompts that best implicitly describe the differences between two distinct low-level vision tasks, and construct the first cross-task VICL dataset. Building upon this, we propose a novel inference framework that combines perceptual score-based reasoning with traditional evaluation metrics to perform cross-task VICL. Our approach achieves top-tier results across nine cross-task scenarios and second-tier performance in ten additional scenarios, unlocking the boundaries of cross-task VICL within VLMs.


[24] Clustered Error Correction with Grouped 4D Gaussian Splatting cs.CV | cs.GRPDF

Taeho Kang, Jaeyeon Park, Kyungjin Lee, Youngki Lee

TL;DR: 论文提出了一种改进4D高斯溅射(4DGS)的方法,通过错误聚类和分组溅射解决了动态场景重建的模糊像素对应和动态区域密度不足的问题,显著提升了时间一致性和渲染质量。

Details

Motivation: 现有的4D高斯溅射方法在动态场景重建中存在模糊像素对应和动态区域密度不足的问题,作者希望通过改进算法提升重建精度和时间一致性。

Result: 在Neural 3D Video和Technicolor数据集上的评估表明,该方法显著提升了时间一致性和渲染质量(PSNR提升0.39dB)。

Insight: 动态区域的错误分类和针对性校正可以有效提升重建精度,分组溅射技术增强了动态对象与溅射的一致性。

Abstract: Existing 4D Gaussian Splatting (4DGS) methods struggle to accurately reconstruct dynamic scenes, often failing to resolve ambiguous pixel correspondences and inadequate densification in dynamic regions. We address these issues by introducing a novel method composed of two key components: (1) Elliptical Error Clustering and Error Correcting Splat Addition that pinpoints dynamic areas to improve and initialize fitting splats, and (2) Grouped 4D Gaussian Splatting that improves consistency of mapping between splats and represented dynamic objects. Specifically, we classify rendering errors into missing-color and occlusion types, then apply targeted corrections via backprojection or foreground splitting guided by cross-view color consistency. Evaluations on Neural 3D Video and Technicolor datasets demonstrate that our approach significantly improves temporal consistency and achieves state-of-the-art perceptual rendering quality, improving 0.39dB of PSNR on the Technicolor Light Field dataset. Our visualization shows improved alignment between splats and dynamic objects, and the error correction method’s capability to identify errors and properly initialize new splats. Our implementation details and source code are available at https://github.com/tho-kn/cem-4dgs.


[25] Decoupling Complexity from Scale in Latent Diffusion Model cs.CVPDF

Tianxiong Zhong, Xingye Tian, Xuebo Wang, Boyuan Jiang, Xin Tao

TL;DR: DCS-LDM提出了一种新的潜在扩散模型范式,解耦了信息复杂度与尺度,通过分层、尺度无关的潜在空间支持多尺度生成,并在固定潜在表示下实现灵活的生成质量权衡。

Details

Motivation: 现有潜在扩散模型通常将尺度与内容复杂度耦合,导致高分辨率或高帧率生成需要更多潜在token。但实际上,潜在容量主要取决于内容复杂度而非尺度。

Result: 实验表明DCS-LDM性能与SOTA方法相当,同时支持跨尺度和视觉质量的灵活生成。

Insight: 潜在容量需求主要由内容复杂度决定,尺度仅为其上限,解耦二者可实现更高效的生成框架。

Abstract: Existing latent diffusion models typically couple scale with content complexity, using more latent tokens to represent higher-resolution images or higher-frame rate videos. However, the latent capacity required to represent visual data primarily depends on content complexity, with scale serving only as an upper bound. Motivated by this observation, we propose DCS-LDM, a novel paradigm for visual generation that decouples information complexity from scale. DCS-LDM constructs a hierarchical, scale-independent latent space that models sample complexity through multi-level tokens and supports decoding to arbitrary resolutions and frame rates within a fixed latent representation. This latent space enables DCS-LDM to achieve a flexible computation-quality tradeoff. Furthermore, by decomposing structural and detailed information across levels, DCS-LDM supports a progressive coarse-to-fine generation paradigm. Experimental results show that DCS-LDM delivers performance comparable to state-of-the-art methods while offering flexible generation across diverse scales and visual qualities.


[26] VTinker: Guided Flow Upsampling and Texture Mapping for High-Resolution Video Frame Interpolation cs.CVPDF

Chenyang Wu, Jiayi Fu, Chun-Le Guo, Shuhao Han, Chongyi Li

TL;DR: VTinker提出了一种新的视频帧插值(VFI)方法,通过引导流上采样(GFU)和纹理映射技术,解决了高分辨率视频插值中的运动模糊和像素不对齐问题。

Details

Motivation: 高分辨率视频帧插值的挑战包括大像素运动和高计算成本,传统的低分辨率运动估计和简单上采样方法易导致模糊和对齐问题。

Result: 实验证明VTinker在VFI任务中达到了最先进的性能。

Insight: 利用输入帧信息引导流上采样和纹理映射可以有效提升高分辨率视频插值的质量和细节保留能力。

Abstract: Due to large pixel movement and high computational cost, estimating the motion of high-resolution frames is challenging. Thus, most flow-based Video Frame Interpolation (VFI) methods first predict bidirectional flows at low resolution and then use high-magnification upsampling (e.g., bilinear) to obtain the high-resolution ones. However, this kind of upsampling strategy may cause blur or mosaic at the flows’ edges. Additionally, the motion of fine pixels at high resolution cannot be adequately captured in motion estimation at low resolution, which leads to the misalignment of task-oriented flows. With such inaccurate flows, input frames are warped and combined pixel-by-pixel, resulting in ghosting and discontinuities in the interpolated frame. In this study, we propose a novel VFI pipeline, VTinker, which consists of two core components: guided flow upsampling (GFU) and Texture Mapping. After motion estimation at low resolution, GFU introduces input frames as guidance to alleviate the blurring details in bilinear upsampling flows, which makes flows’ edges clearer. Subsequently, to avoid pixel-level ghosting and discontinuities, Texture Mapping generates an initial interpolated frame, referred to as the intermediate proxy. The proxy serves as a cue for selecting clear texture blocks from the input frames, which are then mapped onto the proxy to facilitate producing the final interpolated frame via a reconstruction module. Extensive experiments demonstrate that VTinker achieves state-of-the-art performance in VFI. Codes are available at: https://github.com/Wucy0519/VTinker.


[27] How Noise Benefits AI-generated Image Detection cs.CVPDF

Jiazhen Yan, Ziqiang Li, Fan Wang, Kai Zeng, Zhangjie Fu

TL;DR: 论文提出了一种名为PiN-CLIP的方法,通过在特征空间中引入带有正激励原则的噪声,提高了AI生成图像检测的泛化能力,并在42种不同生成模型的合成图像数据集上实现了最佳性能。

Details

Motivation: 生成模型的快速发展使得真实图像和合成图像难以区分。尽管已有大量研究致力于检测AI生成图像,但在分布外泛化方面仍存在挑战。论文认为这一问题的根源在于训练中被利用的虚假捷径(spurious shortcuts),并提出通过特征空间的小扰动来缓解这一问题。

Result: 在包含42种不同生成模型的合成图像数据集上的实验表明,PiN-CLIP达到了新的最佳性能,平均准确率比现有方法提高了5.4个百分点。

Insight: 噪声可以作为一种有效手段,通过在特征空间中引入可控扰动,改善模型对AI生成图像的检测能力。这一方法为解决分布外泛化问题提供了新思路。

Abstract: The rapid advancement of generative models has made real and synthetic images increasingly indistinguishable. Although extensive efforts have been devoted to detecting AI-generated images, out-of-distribution generalization remains a persistent challenge. We trace this weakness to spurious shortcuts exploited during training and we also observe that small feature-space perturbations can mitigate shortcut dominance. To address this problem in a more controllable manner, we propose the Positive-Incentive Noise for CLIP (PiN-CLIP), which jointly trains a noise generator and a detection network under a variational positive-incentive principle. Specifically, we construct positive-incentive noise in the feature space via cross-attention fusion of visual and categorical semantic features. During optimization, the noise is injected into the feature space to fine-tune the visual encoder, suppressing shortcut-sensitive directions while amplifying stable forensic cues, thereby enabling the extraction of more robust and generalized artifact representations. Comparative experiments are conducted on an open-world dataset comprising synthetic images generated by 42 distinct generative models. Our method achieves new state-of-the-art performance, with notable improvements of 5.4 in average accuracy over existing approaches.


[28] Degradation-Aware Hierarchical Termination for Blind Quality Enhancement of Compressed Video cs.CVPDF

Li Yu, Yingbo Zhao, Shiyu Wu, Siyue Yu, Moncef Gabbouj

TL;DR: 该论文提出了一种新的盲视频质量增强方法,通过预训练的退化表示学习模块和多尺度信息提取来解决现有方法缺乏空间细节的问题,并引入分层终止机制以动态调整计算需求。

Details

Motivation: 现有视频质量增强方法依赖已知量化参数(QP),但实际场景中QP可能未知,限制了其应用。当前盲方法仅捕获全局退化信息,缺乏空间细节,且计算资源未根据压缩级别优化。

Result: PSNR提升了110%(从0.31 dB到0.65 dB),在QP=22时推理时间减少一半。

Insight: 退化表示需结合多尺度信息以捕捉空间细节,计算资源应根据压缩级别灵活分配。

Abstract: Existing studies on Quality Enhancement for Compressed Video (QECV) predominantly rely on known Quantization Parameters (QPs), employing distinct enhancement models per QP setting, termed non-blind methods. However, in real-world scenarios involving transcoding or transmission, QPs may be partially or entirely unknown, limiting the applicability of such approaches and motivating the development of blind QECV techniques. Current blind methods generate degradation vectors via classification models with cross-entropy loss, using them as channel attention to guide artifact removal. However, these vectors capture only global degradation information and lack spatial details, hindering adaptation to varying artifact patterns at different spatial positions. To address these limitations, we propose a pretrained Degradation Representation Learning (DRL) module that decouples and extracts high-dimensional, multiscale degradation representations from video content to guide the artifact removal. Additionally, both blind and non-blind methods typically employ uniform architectures across QPs, hence, overlooking the varying computational demands inherent to different compression levels. We thus introduce a hierarchical termination mechanism that dynamically adjusts the number of artifact reduction stages based on the compression level. Experimental results demonstrate that the proposed approach significantly enhances performance, achieving a PSNR improvement of 110% (from 0.31 dB to 0.65 dB) over a competing state-of-the-art blind method at QP = 22. Furthermore, the proposed hierarchical termination mechanism reduces the average inference time at QP = 22 by half compared to QP = 42.


[29] Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions cs.CV | cs.CLPDF

Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Ruicong Liu

TL;DR: 该论文揭示了当前多模态大语言模型(MLLMs)在评估复杂社交交互中欺骗能力的不足,提出了新任务MIDA和相关数据集,并通过基准测试指出模型的性能差距。作者提出了SoCoT和DSEM模块,展示了改进潜力。

Details

Motivation: 现有的MLLMs虽然在推理能力上表现出色,但缺乏人类社交能力中的关键部分——识别欺骗和复杂社交互动中的环境。这限制了模型在真实社交场景中的应用。

Result: 实验表明,即使是GPT-4o等高能力模型也难以可靠地区分真假。提出的SoCoT和DSEM模块带来了性能提升。

Insight: MLLMs在多模态社交场景中缺乏对人类认知和意图的建模能力,未来需进一步研究如何增强模型的社会推理能力。

Abstract: Despite their advanced reasoning capabilities, state-of-the-art Multimodal Large Language Models (MLLMs) demonstrably lack a core component of human intelligence: the ability to `read the room’ and assess deception in complex social interactions. To rigorously quantify this failure, we introduce a new task, Multimodal Interactive Deception Assessment (MIDA), and present a novel multimodal dataset providing synchronized video and text with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating 12 state-of-the-art open- and closed-source MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to effectively ground language in multimodal social cues and lack the ability to model what others know, believe, or intend, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems. To take a step forward, we design a Social Chain-of-Thought (SoCoT) reasoning pipeline and a Dynamic Social Epistemic Memory (DSEM) module. Our framework yields performance improvement on this challenging task, demonstrating a promising new path toward building MLLMs capable of genuine human-like social reasoning.


[30] Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval cs.CVPDF

Chunxu Liu, Jiyuan Yang, Ruopeng Gao, Yuhan Zhu, Feng Zhu

TL;DR: 该论文提出了一种名为Reasoning Guided Embeddings (RGE)的方法,通过显式地将推理过程融入嵌入提取中,提升多模态表示的质量,证明了推理能力的利用可以显著改善多模态检索性能。

Details

Motivation: 多模态嵌入在多模态检索等任务中广泛应用,但现有方法通常将其视为直接编码步骤,忽略了多模态大语言模型(MLLMs)的生成和推理能力对表示质量的潜在提升作用。

Result: 在MMEB基准测试中,RGE比非推理基线提升了4.9%的多模态检索性能。

Insight: 显式利用MLLMs的推理能力可以显著提升多模态嵌入的质量,表明生成模型的推理过程对表示学习具有重要价值。

Abstract: Multimodal embeddings are widely used in downstream tasks such as multimodal retrieval, enabling alignment of interleaved modalities in a shared representation space. While recent studies show that Multimodal Large Language Models (MLLMs) can serve as strong embedding extractors, existing approaches treat embedding extraction as a direct encoding step, overlooking the fact that MLLMs possess the generative capability for reasoning that could be leveraged to enhance representation quality. In this work, we explore how to explicitly incorporate reasoning into the embedding process. To this end, we propose Reasoning Guided Embeddings (RGE), which preserves the generative rationale process of MLLMs and couples it with contrastive training. Our method first enables the model to perform structured rationale generation conditioned on the instruction, and then extracts representations after reasoning has unfolded. This simple design enhances the context-conditional inference signals within the embedding, leading to improved multimodal representation quality. Experiments on the MMEB benchmark show that reasoning-guided conditioning improves multimodal retrieval performance by 4.9% over the non-reasoning baseline, confirming that explicit reasoning can effectively enhance embedding quality.


[31] Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers cs.CVPDF

Jian Ma, Qirong Peng, Xujie Zhu, Peixing Xie, Chen Chen

TL;DR: PPCL是一种针对DiT架构的灵活结构化剪枝框架,通过线性探测与相似度度量的趋势分析识别冗余层,提出即插即用的师生交替蒸馏方案,实现参数减少50%且关键指标下降低于3%。

Details

Motivation: DiT在图像生成中表现出色,但参数量大导致计算成本高,难以在资源受限环境中部署。需要一种高效剪枝方法以减少计算开销。

Result: 在多模态DiT架构上,PPCL实现了50%参数压缩率,关键指标下降低于3%,同时保持高质量图像生成能力。

Insight: PPCL的创新在于灵活识别冗余层并通过蒸馏保留知识,适用于资源受限环境,实验证明其在高压缩率下仍能保持性能。

Abstract: Diffusion Transformers (DiTs) have shown exceptional performance in image generation, yet their large parameter counts incur high computational costs, impeding deployment in resource-constrained settings. To address this, we propose Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible structured pruning framework specifically designed for DiT architectures. First, we identify redundant layer intervals through a linear probing mechanism combined with the first-order differential trend analysis of similarity metrics. Subsequently, we propose a plug-and-play teacher-student alternating distillation scheme tailored to integrate depth-wise and width-wise pruning within a single training phase. This distillation framework enables flexible knowledge transfer across diverse pruning ratios, eliminating the need for per-configuration retraining. Extensive experiments on multiple Multi-Modal Diffusion Transformer architecture models demonstrate that PPCL achieves a 50% reduction in parameter count compared to the full model, with less than 3% degradation in key objective metrics. Notably, our method maintains high-quality image generation capabilities while achieving higher compression ratios, rendering it well-suited for resource-constrained environments. The open-source code, checkpoints for PPCL can be found at the following link: https://github.com/OPPO-Mente-Lab/Qwen-Image-Pruning.


[32] Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning cs.CVPDF

Yibin Huang, Wang Xu, Wanyue Zhang, Helu Zhi, Jingjing Huang

TL;DR: Video2Layout提出了一种重建基于度量空间的视觉认知框架,通过连续边界坐标量化物体间距离和大小,提升了多模态大语言模型的空间推理能力。

Details

Motivation: 现有基于网格的认知图方法依赖于离散化的栅格表示,限制了细粒度空间推理的能力。论文旨在通过连续度量空间解决这一问题。

Result: 在QVS-Bench和主流空间推理基准上,模型V2LO-7B比基于网格的方法平均提升4.92%。

Insight: 连续度量空间的表示有助于解决自然语言描述空间关系时的模糊性,提升模型的定量推理能力。

Abstract: Spatial intelligence is a critical frontier for Multimodal Large Language Models (MLLMs), empowering them to comprehend the physical world. Drawing inspiration from human perception mechanisms, existing studies attempt to construct a coherent spatial understanding via grid-based cognitive maps from multi-frame visual inputs. However, current grid-based map methods rely on discretized raster representations, which limit the model’s ability in fine-grained spatial reasoning. To overcome this limitation, we propose Video2Layout, a framework for reconstructing metric-grounded spatial layouts from video. The framework employs continuous object boundary coordinates to quantify inter-object physical distances and object size. This empowers the model with quantitative spatial computation capabilities, effectively alleviating the inherent ambiguity when describing spatial relationships in natural language. Specifically, our method comprises two core stages. First, in supervised fine-tuning stage, we construct a high-quality dataset from the AI2THOR simulator, which enables the model to learn the mapping from visual inputs to precise boundary coordinates. Subsequently, a reinforcement fine-tuning stage further enhances the model’s real-world generalization capabilities. To systematically evaluate the correlation between cognitive map accuracy and image quantity, as well as how the quantity of image inputs affects spatial reasoning accuracy, we introduce QVS-Bench, a diagnostic benchmark designed to analyze the relevant mechanisms. Evaluated on QVS-Bench and mainstream spatial reasoning benchmarks, our model, V2LO-7B achieves an average improvement of 4.92% over the model trained on grid maps, validating the superiority of our method. Our code is available at https://github.com/ybrrraway/Video2Layout.


[33] Simba: Towards High-Fidelity and Geometrically-Consistent Point Cloud Completion via Transformation Diffusion cs.CVPDF

Lirui Zhang, Zhengkai Zhao, Zhi Zuo, Pan Gao, Jie Qin

TL;DR: Simba提出了一种新的点云补全框架,通过将点级变换回归重构为分布学习问题,结合对称性先验和扩散模型的生成能力,避免了过拟合问题并提升了抗噪声能力,同时在多个基准测试中取得了SOTA性能。

Details

Motivation: 点云补全任务中,现有的基于回归的方法容易过拟合且对输入噪声敏感,导致泛化能力不足。Simba旨在通过扩散模型和分布学习解决这些问题,同时保留几何细节和全局结构一致性。

Result: 在PCN、ShapeNet和KITTI基准测试中,Simba取得了SOTA性能,证明了其在高保真补全和几何一致性方面的优势。

Insight: 通过分布学习和扩散模型,可以有效解决传统回归方法的过拟合和噪声敏感性问题,同时保留细节和全局结构。

Abstract: Point cloud completion is a fundamental task in 3D vision. A persistent challenge in this field is simultaneously preserving fine-grained details present in the input while ensuring the global structural integrity of the completed shape. While recent works leveraging local symmetry transformations via direct regression have significantly improved the preservation of geometric structure details, these methods suffer from two major limitations: (1) These regression-based methods are prone to overfitting which tend to memorize instant-specific transformations instead of learning a generalizable geometric prior. (2) Their reliance on point-wise transformation regression lead to high sensitivity to input noise, severely degrading their robustness and generalization. To address these challenges, we introduce Simba, a novel framework that reformulates point-wise transformation regression as a distribution learning problem. Our approach integrates symmetry priors with the powerful generative capabilities of diffusion models, avoiding instance-specific memorization while capturing robust geometric structures. Additionally, we introduce a hierarchical Mamba-based architecture to achieve high-fidelity upsampling. Extensive experiments across the PCN, ShapeNet, and KITTI benchmarks validate our method’s state-of-the-art (SOTA) performance.


[34] TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding cs.CV | cs.AI | cs.CLPDF

Boshen Xu, Zihan Xiao, Jiaze Li, Jianzhong Ju, Zhenbo Luo

TL;DR: TimeViper是一个基于Mamba-Transformer混合架构的视频-语言模型,旨在高效处理长视频理解任务。通过TransV模块压缩视觉标记信息,模型能够处理超长视频(如10,000帧以上)。

Details

Motivation: 长期视频理解任务需要高效处理大量时序信息,现有模型难以平衡效率和表达能力。TimeViper结合状态空间模型的高效性和注意力机制的强大表达能力,以解决这一问题。

Result: TimeViper能够处理超长视频(如10,000帧以上),在多个基准测试中表现优异。

Insight: 混合架构结合了高效性和表达能力,TransV模块有效减少了视觉标记冗余,同时为未来混合模型的开发提供了新思路。

Abstract: We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.


[35] SurvAgent: Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction cs.CV | cs.CLPDF

Guolin Huang, Wenting Chen, Jiaqi Yang, Xinheng Lyu, Xiaoling Luo

TL;DR: SurvAgent是一个基于分层思维链(CoT)增强的多模态生存预测多智能体系统,通过整合病理图像和基因数据,并结合历史病例的经验学习,显著提升了生存预测的透明性和准确性。

Details

Motivation: 现有的生存分析方法缺乏临床所需的透明性,且在整合多模态数据、有效探索感兴趣区域和利用历史病例经验学习方面存在不足。SurvAgent旨在解决这些问题。

Result: 在五个TCGA队列上的实验表明,SurvAgent优于传统方法、私有MLLMs和医学智能体,为精准肿瘤学中的可解释AI驱动生存预测设立了新范式。

Insight: SurvAgent的成功在于其分层CoT设计和多模态数据整合能力,同时强调了经验学习在多智能体系统中的重要性。

Abstract: Survival analysis is critical for cancer prognosis and treatment planning, yet existing methods lack the transparency essential for clinical adoption. While recent pathology agents have demonstrated explainability in diagnostic tasks, they face three limitations for survival prediction: inability to integrate multimodal data, ineffective region-of-interest exploration, and failure to leverage experiential learning from historical cases. We introduce SurvAgent, the first hierarchical chain-of-thought (CoT)-enhanced multi-agent system for multimodal survival prediction. SurvAgent consists of two stages: (1) WSI-Gene CoT-Enhanced Case Bank Construction employs hierarchical analysis through Low-Magnification Screening, Cross-Modal Similarity-Aware Patch Mining, and Confidence-Aware Patch Mining for pathology images, while Gene-Stratified analysis processes six functional gene categories. Both generate structured reports with CoT reasoning, storing complete analytical processes for experiential learning. (2) Dichotomy-Based Multi-Expert Agent Inference retrieves similar cases via RAG and integrates multimodal reports with expert predictions through progressive interval refinement. Extensive experiments on five TCGA cohorts demonstrate SurvAgent’s superority over conventional methods, proprietary MLLMs, and medical agents, establishing a new paradigm for explainable AI-driven survival prediction in precision oncology.


[36] An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs cs.CVPDF

Zhi Luo, Zenghui Yuan, Wenqi Wei, Daizong Liu, Pan Zhou

TL;DR: 本文提出了一种新颖的‘冗长文本诱导攻击’(VTIA),通过两阶段框架注入不易察觉的对抗扰动,旨在最大化视觉语言模型(VLMs)的输出标记长度,从而提高攻击的稳定性和可控性。

Details

Motivation: 随着视觉语言模型在多模态任务中的成功应用,其部署效率问题日益突出,尤其是在生成过程中消耗的标记数量成为关键评估指标。现有方法仅通过延迟EOS标记来隐性延长输出,缺乏稳定性和可控性,亟需一种更直接的方法。

Result: 在四种流行VLMs上的实验表明,VTIA在有效性、效率和泛化能力上均具显著优势。

Insight: 直接优化输出标记长度而非隐性操作,显著提高了攻击的稳定性和可控性,为解决VLMs的部署效率问题提供了新思路。

Abstract: With the remarkable success of Vision-Language Models (VLMs) on multimodal tasks, concerns regarding their deployment efficiency have become increasingly prominent. In particular, the number of tokens consumed during the generation process has emerged as a key evaluation metric.Prior studies have shown that specific inputs can induce VLMs to generate lengthy outputs with low information density, which significantly increases energy consumption, latency, and token costs. However, existing methods simply delay the occurrence of the EOS token to implicitly prolong output, and fail to directly maximize the output token length as an explicit optimization objective, lacking stability and controllability.To address these limitations, this paper proposes a novel verbose-text induction attack (VTIA) to inject imperceptible adversarial perturbations into benign images via a two-stage framework, which identifies the most malicious prompt embeddings for optimizing and maximizing the output token of the perturbed images.Specifically, we first perform adversarial prompt search, employing reinforcement learning strategies to automatically identify adversarial prompts capable of inducing the LLM component within VLMs to produce verbose outputs. We then conduct vision-aligned perturbation optimization to craft adversarial examples on input images, maximizing the similarity between the perturbed image’s visual embeddings and those of the adversarial prompt, thereby constructing malicious images that trigger verbose text generation. Comprehensive experiments on four popular VLMs demonstrate that our method achieves significant advantages in terms of effectiveness, efficiency, and generalization capability.


[37] EvoVLA: Self-Evolving Vision-Language-Action Model cs.CVPDF

Zeting Liu, Zida Yang, Zeyu Zhang, Hao Tang

TL;DR: EvoVLA提出了一种自监督的视觉-语言-动作(VLA)框架,通过阶段对齐奖励、基于姿态的对象探索和长时程记忆三个组件,解决了多阶段任务中的阶段幻觉问题,显著提升了任务成功率和样本效率。

Details

Motivation: 长时程机器人操作任务中,现有的VLA模型存在阶段幻觉问题,即代理利用粗糙的评估信号在多步任务中走捷径,导致任务未真正完成。

Result: 在Discoverse-L基准测试中,EvoVLA将任务成功率提升10.2个百分点至69.2%,样本效率提高1.5倍,阶段幻觉减少23.7个百分点。真实机器人实验中,成功率优于基线11个百分点。

Insight: 通过自监督学习和多阶段对齐机制,EvoVLA在长时程任务中显著提升了泛化能力和真实世界迁移效果。

Abstract: Long-horizon robotic manipulation remains challenging for Vision-Language-Action (VLA) models despite recent progress in zero-shot generalization and simulation-to-real-world transfer. Current VLA models suffer from stage hallucination, where agents exploit coarse evaluation signals to shortcut multi-step tasks, reporting high progress without truly completing them. We present EvoVLA, a self-supervised VLA framework that addresses this issue through three complementary components: Stage-Aligned Reward (SAR), which uses triplet contrastive learning with Gemini-generated hard negatives to prevent visual shortcuts; Pose-Based Object Exploration (POE), which grounds curiosity in relative object-gripper pose instead of raw pixels; and Long-Horizon Memory, which uses selective context retention and gated fusion to stabilize intrinsic shaping during extended rollouts. Extensive evaluations on Discoverse-L, a long-horizon manipulation benchmark with three multi-stage tasks, show that EvoVLA improves average task success by 10.2 percentage points over the strongest baseline (OpenVLA-OFT), reaching 69.2 percent. EvoVLA also achieves one-and-a-half times better sample efficiency and reduces stage hallucination from 38.5 percent to 14.8 percent. Real-world deployment on physical robots reaches an average success rate of 54.6 percent across four manipulation tasks, outperforming OpenVLA-OFT by 11 points, demonstrating effective sim-to-real transfer and strong generalization. Code: https://github.com/AIGeeksGroup/EvoVLA. Website: https://aigeeksgroup.github.io/EvoVLA.


[38] Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation cs.CV | cs.AI | cs.CLPDF

Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen

TL;DR: 该论文提出了Thinking-while-Generating (TwiG),一种在视觉生成过程中交织文本推理的新框架,以实现更动态的多模态交互和更丰富的语义输出。

Details

Motivation: 现有的视觉生成方法通常将文本推理作为生成前(预规划)或生成后(后优化)的步骤,缺乏生成过程中的动态交互。TwiG旨在填补这一空白。

Result: TwiG框架生成了更具上下文感知和语义丰富的视觉输出。

Insight: 交织的文本推理能够动态指导生成过程,并为未来的视觉生成研究提供了新方向。

Abstract: Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.


[39] Target Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective cs.CVPDF

Jiahao Li, Yang Lu, Yachao Zhang, Yong Xie, Fangyong Wang

TL;DR: 论文提出了一种训练无关的方法RF-CLIP,通过重新分配CLIP模型的注意力资源,提升其在开放词汇语义分割(OVSS)中的密集预测性能,在多基准测试中达到最先进水平。

Details

Motivation: 现有方法在利用CLIP的视觉-语言对齐能力时,未从解释性机制角度深入研究CLIP在密集预测中的性能边界,尤其是其注意力分散问题。

Result: 在八个基准测试中达到最先进性能,同时保持高效推理速度。

Insight: CLIP的注意力分散问题源于维度特异性过激活,通过注意力重新分配可显著提升其密集预测能力。

Abstract: Open-vocabulary semantic segmentation (OVSS) employs pixel-level vision-language alignment to associate category-related prompts with corresponding pixels. A key challenge is enhancing the multimodal dense prediction capability, specifically this pixel-level multimodal alignment. Although existing methods achieve promising results by leveraging CLIP’s vision-language alignment, they rarely investigate the performance boundaries of CLIP for dense prediction from an interpretability mechanisms perspective. In this work, we systematically investigate CLIP’s internal mechanisms and identify a critical phenomenon: analogous to human distraction, CLIP diverts significant attention resources from target regions to irrelevant tokens. Our analysis reveals that these tokens arise from dimension-specific over-activation; filtering them enhances CLIP’s dense prediction performance. Consequently, we propose ReFocusing CLIP (RF-CLIP), a training-free approach that emulates human distraction-refocusing behavior to redirect attention from distraction tokens back to target regions, thereby refining CLIP’s multimodal alignment granularity. Our method achieves SOTA performance on eight benchmarks while maintaining high inference efficiency.


[40] Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight cs.CV | cs.AIPDF

Yi Yang, Xueqi Li, Yiyang Chen, Jin Song, Yihan Wang

TL;DR: 论文提出了一种名为Mantis的新型Vision-Language-Action(VLA)模型框架,通过解耦视觉预见(DVF)任务减轻主干模型负担,提升语言监督效果,显著提高了性能和泛化能力。

Details

Motivation: 当前VLA模型在预测高维视觉状态时面临模型容量分散和训练成本高的问题,且对语言监督的忽视导致理解和推理能力不足。

Result: 在LIBERO基准测试中达到96.7%的成功率,超越现有方法。在真实场景中,指令跟随能力、泛化能力和推理能力优于$π_{0.5}$模型。

Insight: 解耦视觉预见任务可以有效减少主干模型负担,同时提升语言监督效果,显著提高模型性能。

Abstract: Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms $π_{0.5}$, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.


[41] PrIntMesh: Precise Intersection Surfaces for 3D Organ Mesh Reconstruction cs.CVPDF

Deniz Sayin Mercadier, Hieu Le, Yihong Chen, Jiancheng Yang, Udaranga Wickramasinghe

TL;DR: PrIntMesh是一个基于模板的拓扑保持框架,用于重建器官的统一系统,通过联合变形所有子结构以匹配患者特定解剖结构,同时保留内部边界并生成平滑的表面。

Details

Motivation: 现有深度学习方法通常独立处理器官的子结构,导致解剖学上不合理的重建结果。PrIntMesh旨在解决这一问题,通过联合建模器官的几何和空间关系,生成更符合解剖学结构的重建。

Result: 在心脏、海马体和肺部等器官上,PrIntMesh展示了高几何精度、正确的拓扑结构,并且在训练数据有限或有噪声时仍表现稳健。

Insight: PrIntMesh的成功表明,联合建模器官的子结构及其空间关系是提高重建质量的关键,同时也显示了模板方法在医学图像重建中的潜力。

Abstract: Human organs are composed of interconnected substructures whose geometry and spatial relationships constrain one another. Yet, most deep-learning approaches treat these parts independently, producing anatomically implausible reconstructions. We introduce PrIntMesh, a template-based, topology-preserving framework that reconstructs organs as unified systems. Starting from a connected template, PrIntMesh jointly deforms all substructures to match patient-specific anatomy, while explicitly preserving internal boundaries and enforcing smooth, artifact-free surfaces. We demonstrate its effectiveness on the heart, hippocampus, and lungs, achieving high geometric accuracy, correct topology, and robust performance even with limited or noisy training data. Compared to voxel- and surface-based methods, PrIntMesh better reconstructs shared interfaces, maintains structural consistency, and provides a data-efficient solution suitable for clinical use.


[42] When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models cs.CV | cs.AIPDF

Yuping Yan, Yuhan Xie, Yinxin Zhang, Lingjuan Lyu, Yaochu Jin

TL;DR: 该论文研究了视觉-语言-动作模型(VLA)在多模态对抗攻击下的脆弱性,提出了一种名为VLA-Fool的综合攻击框架,涵盖文本、视觉和多模态对齐层面的对抗扰动,揭示了这些模型在真实场景中的鲁棒性问题。

Details

Motivation: 尽管VLA模型在具身环境中表现出色,但其在多模态对抗攻击下的鲁棒性尚未被充分研究,尤其是在真实的多模态和黑盒条件下。现有研究多关注单模态扰动,忽略了跨模态对齐失效对决策的影响。

Result: 实验表明,即使轻微的扰动也会导致VLA模型在LIBERO基准测试中出现显著的行为偏差,揭示了多模态对齐的脆弱性。

Insight: 跨模态对齐是多模态模型鲁棒性的关键弱点,未来研究需要关注真实场景下的多模态对抗防御方法。

Abstract: Vision-Language-Action models (VLAs) have recently demonstrated remarkable progress in embodied environments, enabling robots to perceive, reason, and act through unified multimodal understanding. Despite their impressive capabilities, the adversarial robustness of these systems remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies mainly focus on single-modality perturbations and overlook the cross-modal misalignment that fundamentally affects embodied reasoning and decision-making. In this paper, we introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings. VLA-Fool unifies three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that intentionally disrupt the semantic correspondence between perception and instruction. We further incorporate a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework. Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.


[43] SwiTrack: Tri-State Switch for Cross-Modal Object Tracking cs.CVPDF

Boyue Xu, Ruichao Hou, Tongwei Ren, Dongming Zhou, Gangshan Wu

TL;DR: SwiTrack 提出了一种三态切换框架,用于跨模态目标跟踪,通过三个专用流处理RGB、NIR和无效模态,解决了现有方法的模态特征提取不足和目标漂移问题。

Details

Motivation: 现有跨模态目标跟踪方法通过共享主干连接RGB和NIR分支,未能充分提取模态特异性特征且难以应对无效输入和目标漂移问题。

Result: 在最新基准测试中,SwiTrack 实现了SOTA性能,精度和成功率分别提升7.2%和4.3%,并保持65 FPS的实时跟踪速度。

Insight: 跨模态目标跟踪需要分别处理不同模态的特异性特征,并通过动态更新和时序预测来提升鲁棒性。

Abstract: Cross-modal object tracking (CMOT) is an emerging task that maintains target consistency while the video stream switches between different modalities, with only one modality available in each frame, mostly focusing on RGB-Near Infrared (RGB-NIR) tracking. Existing methods typically connect parallel RGB and NIR branches to a shared backbone, which limits the comprehensive extraction of distinctive modality-specific features and fails to address the issue of object drift, especially in the presence of unreliable inputs. In this paper, we propose SwiTrack, a novel state-switching framework that redefines CMOT through the deployment of three specialized streams. Specifically, RGB frames are processed by the visual encoder, while NIR frames undergo refinement via a NIR gated adapter coupled with the visual encoder to progressively calibrate shared latent space features, thereby yielding more robust cross-modal representations. For invalid modalities, a consistency trajectory prediction module leverages spatio-temporal cues to estimate target movement, ensuring robust tracking and mitigating drift. Additionally, we incorporate dynamic template reconstruction to iteratively update template features and employ a similarity alignment loss to reinforce feature consistency. Experimental results on the latest benchmarks demonstrate that our tracker achieves state-of-the-art performance, boosting precision rate and success rate gains by 7.2% and 4.3%, respectively, while maintaining real-time tracking at 65 frames per second. Code and models are available at https://github.com/xuboyue1999/SwiTrack.git.


[44] TetraSDF: Precise Mesh Extraction with Multi-resolution Tetrahedral Grid cs.CV | cs.GRPDF

Seonghun Oh, Youngjung Uh, Jin-Hwa Kim

TL;DR: TetraSDF提出了一个基于多分辨率四面体格子的精确网格提取框架,通过结合ReLU多层感知器和位置编码器,实现了高效且准确的SDF网格提取。

Details

Motivation: 现有的SDF网格提取方法存在离散化误差或仅适用于简单ReLU MLP的问题,需要一种更精确且通用的解决方案。

Result: 在多个基准测试中,TetraSDF在SDF重建准确性和网格自一致性方面优于现有方法。

Insight: 通过引入四面体细分和预调节器,TetraSDF在保证高效性的同时提高了网格提取的精度。

Abstract: Extracting meshes that exactly match the zero-level set of neural signed distance functions (SDFs) remains challenging. Sampling-based methods introduce discretization error, while continuous piecewise affine (CPWA) analytic approaches apply only to plain ReLU MLPs. We present TetraSDF, a precise analytic meshing framework for SDFs represented by a ReLU MLP composed with a multi-resolution tetrahedral positional encoder. The encoder’s barycentric interpolation preserves global CPWA structure, enabling us to track ReLU linear regions within an encoder-induced polyhedral complex. A fixed analytic input preconditioner derived from the encoder’s metric further reduces directional bias and stabilizes training. Across multiple benchmarks, TetraSDF matches or surpasses existing grid-based encoders in SDF reconstruction accuracy, and its analytic extractor produces highly self-consistent meshes that remain faithful to the learned isosurfaces, all with practical runtime and memory efficiency.


[45] Building temporally coherent 3D maps with VGGT for memory-efficient Semantic SLAM cs.CVPDF

Gergely Dinya, Péter Halász, András Lőrincz, Kristóf Karacs, Anna Gelencsér-Horváth

TL;DR: 提出了一种基于Vision Gated Generative Transformers (VGGT)的高效时空场景理解框架,用于构建时间一致的3D语义SLAM地图,并通过滑动窗口处理图像流,降低内存需求。

Details

Motivation: 当前语义SLAM系统在处理连续3D场景更新时面临高内存需求和缺乏时间一致性的问题,需要一种高效且内存友好的解决方案。

Result: 在知名基准和自定义数据集上评估,验证了框架在辅助导航等实际场景中的适用性。

Insight: VGGT在语义SLAM中的高效利用和滑动窗口技术的内存优化为实时3D场景理解提供了新思路。

Abstract: We present a fast, spatio-temporal scene understanding framework based on Vision Gated Generative Transformers (VGGT). The proposed pipeline is designed to enable efficient, close to real-time performance, supporting applications including assistive navigation. To achieve continuous updates of the 3D scene representation, we process the image flow with a sliding window, aligning submaps, thereby overcoming VGGT’s high memory demands. We exploit the VGGT tracking head to aggregate 2D semantic instance masks into 3D objects. To allow for temporal consistency and richer contextual reasoning the system stores timestamps and instance-level identities, thereby enabling the detection of changes in the environment. We evaluate the approach on well-known benchmarks and custom datasets specifically designed for assistive navigation scenarios. The results demonstrate the applicability of the framework to real-world scenarios.


[46] Explainable AI for Diabetic Retinopathy Detection Using Deep Learning with Attention Mechanisms and Fuzzy Logic-Based Interpretability cs.CVPDF

Abishek Karthik, Pandiyaraju V, Sreya Mynampati

TL;DR: 提出了一种结合CNN、ViT和GNN的杂草检测混合深度学习框架,通过GAN增强和自监督对比预训练提高性能,达到99.33%的准确率,适用于边缘设备实时部署。

Details

Motivation: 精准农业中杂草检测对选择性使用除草剂至关重要。现有方法对多变田间条件鲁棒性不足,需高效、可解释且能适应有限标注数据的解决方案。

Result: 在多个基准数据集上实现99.33%的准确率、精确率、召回率和F1分数,展现高鲁棒性和泛化能力。

Insight: 结合不同网络的优势可提升复杂场景下的检测性能;数据增强和自监督学习对缓解标注数据不足至关重要;框架的高效性支持边缘设备部署,助力可持续农业。

Abstract: The task of weed detection is an essential element of precision agriculture since accurate species identification allows a farmer to selectively apply herbicides and fits into sustainable agriculture crop management. This paper proposes a hybrid deep learning framework recipe for weed detection that utilizes Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Graph Neural Networks (GNNs) to build robustness to multiple field conditions. A Generative Adversarial Network (GAN)-based augmentation method was imposed to balance class distributions and better generalize the model. Further, a self-supervised contrastive pre-training method helps to learn more features from limited annotated data. Experimental results yield superior results with 99.33% accuracy, precision, recall, and F1-score on multi-benchmark datasets. The proposed model architecture enables local, global, and relational feature representations and offers high interpretability and adaptability. Practically, the framework allows real-time, efficient deployment of edge devices for automated weed detecting, reducing over-reliance on herbicides and providing scalable, sustainable precision-farming options.


[47] Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling cs.CVPDF

Minseok Seo, Mark Hamilton, Changick Kim

TL;DR: 提出了一种名为Upsample Anyting的轻量级测试时优化框架,无需训练即可将低分辨率特征上采样为高分辨率像素输出,显著提升了Vision Foundation Models在像素级任务中的适用性。

Details

Motivation: Vision Foundation Models的输出通常是14x/16x下采样的低分辨率特征,直接用于像素级任务能力有限。现有上采样方法依赖于数据集重训练或复杂的隐式优化,限制了其扩展性和泛化能力。

Result: 在224x224大小的图像上仅需约0.419秒,且在语义分割、深度估计以及深度和概率图上采样任务中达到了最先进性能。

Insight: 该方法展示了通过简单的优化方法可以有效提升Vision Foundation Models在像素级任务中的表现,同时保持了较高的通用性和效率。

Abstract: We present \textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only $\approx0.419 \text{s}$ per 224x224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling.


[48] BioBench: A Blueprint to Move Beyond ImageNet for Scientific ML Benchmarks cs.CVPDF

Samuel Stevens

TL;DR: BioBench是一个生态视觉基准测试,旨在解决ImageNet在科学图像任务中的不足,通过统一多个公开数据集和评估方法,为生态领域的计算机视觉提供更可靠的评估标准。

Details

Motivation: ImageNet-1K评估视觉表示质量的线性探针转移准确率在科学图像任务中表现不佳,无法有效预测模型性能。因此,需要一种更准确的基准测试来填补这一空白。

Result: 在46个现代视觉模型检查点中,ImageNet的Top-1准确率仅能解释生态任务34%的方差,并错误排序了30%的模型。BioBench则提供了更强的信号和更准确的模型排序。

Insight: BioBench不仅为生态领域提供了更准确的评估工具,还为其他科学领域的AI基准测试提供了一种可复用的模板。

Abstract: ImageNet-1K linear-probe transfer accuracy remains the default proxy for visual representation quality, yet it no longer predicts performance on scientific imagery. Across 46 modern vision model checkpoints, ImageNet top-1 accuracy explains only 34% of variance on ecology tasks and mis-ranks 30% of models above 75% accuracy. We present BioBench, an open ecology vision benchmark that captures what ImageNet misses. BioBench unifies 9 publicly released, application-driven tasks, 4 taxonomic kingdoms, and 6 acquisition modalities (drone RGB, web video, micrographs, in-situ and specimen photos, camera-trap frames), totaling 3.1M images. A single Python API downloads data, fits lightweight classifiers to frozen backbones, and reports class-balanced macro-F1 (plus domain metrics for FishNet and FungiCLEF); ViT-L models evaluate in 6 hours on an A6000 GPU. BioBench provides new signal for computer vision in ecology and a template recipe for building reliable AI-for-science benchmarks in any domain. Code and predictions are available at https://github.com/samuelstevens/biobench and results at https://samuelstevens.me/biobench.


[49] WWE-UIE: A Wavelet & White Balance Efficient Network for Underwater Image Enhancement cs.CVPDF

Ching-Heng Cheng, Jen-Wei Lee, Chia-Ming Lee, Chih-Chung Hsu

TL;DR: WWE-UIE是一种高效的水下图像增强网络,结合了自适应白平衡、小波增强块和梯度感知模块,显著降低了计算成本,同时保持出色的恢复质量。

Details

Motivation: 水下图像常因波长依赖的吸收和散射导致颜色失真和可见度下降,现有混合方法虽性能强但计算成本高,难以实时应用。

Result: 在基准数据集上,WWE-UIE以更少的参数量和FLOPs实现竞争性恢复质量,支持资源受限平台实时运行。

Insight: 结合领域先验和轻量化设计可有效平衡性能与计算效率,适用于实时水下图像增强。

Abstract: Underwater Image Enhancement (UIE) aims to restore visibility and correct color distortions caused by wavelength-dependent absorption and scattering. Recent hybrid approaches, which couple domain priors with modern deep neural architectures, have achieved strong performance but incur high computational cost, limiting their practicality in real-time scenarios. In this work, we propose WWE-UIE, a compact and efficient enhancement network that integrates three interpretable priors. First, adaptive white balance alleviates the strong wavelength-dependent color attenuation, particularly the dominance of blue-green tones. Second, a wavelet-based enhancement block (WEB) performs multi-band decomposition, enabling the network to capture both global structures and fine textures, which are critical for underwater restoration. Third, a gradient-aware module (SGFB) leverages Sobel operators with learnable gating to explicitly preserve edge structures degraded by scattering. Extensive experiments on benchmark datasets demonstrate that WWE-UIE achieves competitive restoration quality with substantially fewer parameters and FLOPs, enabling real-time inference on resource-limited platforms. Ablation studies and visualizations further validate the contribution of each component. The source code is available at https://github.com/chingheng0808/WWE-UIE.


[50] Arbitrary-Resolution and Arbitrary-Scale Face Super-Resolution with Implicit Representation Networks cs.CVPDF

Yi Ting Tsai, Yu Wei Chen, Hong-Han Shuai, Ching-Chun Huang

TL;DR: ARASFSR提出了一种基于隐式表示网络的任意分辨率和任意尺度的面部超分辨率方法,解决了现有方法固定上采样尺度和输入尺寸敏感的问题。

Details

Motivation: 现有面部超分辨率方法在固定上采样尺度和输入尺寸变化时表现不佳,限制了其应用范围。

Result: 定量和定性实验表明,ARASFSR在多样输入尺寸和上采样尺度下均优于现有方法。

Insight: 隐式表示网络在处理任意尺度和分辨率的任务中具有潜力,尤其在面部超分辨率领域。

Abstract: Face super-resolution (FSR) is a critical technique for enhancing low-resolution facial images and has significant implications for face-related tasks. However, existing FSR methods are limited by fixed up-sampling scales and sensitivity to input size variations. To address these limitations, this paper introduces an Arbitrary-Resolution and Arbitrary-Scale FSR method with implicit representation networks (ARASFSR), featuring three novel designs. First, ARASFSR employs 2D deep features, local relative coordinates, and up-sampling scale ratios to predict RGB values for each target pixel, allowing super-resolution at any up-sampling scale. Second, a local frequency estimation module captures high-frequency facial texture information to reduce the spectral bias effect. Lastly, a global coordinate modulation module guides FSR to leverage prior facial structure knowledge and achieve resolution adaptation effectively. Quantitative and qualitative evaluations demonstrate the robustness of ARASFSR over existing state-of-the-art methods while super-resolving facial images across various input sizes and up-sampling scales.


[51] Aerial View River Landform Video segmentation: A Weakly Supervised Context-aware Temporal Consistency Distillation Approach cs.CVPDF

Chi-Han Chen, Chieh-Ming Chen, Wen-Huang Cheng, Ching-Chun Huang

TL;DR: 该论文提出了一种基于弱监督的上下文感知时空一致性蒸馏方法,用于无人机遥感中的河流地形视频分割任务,解决了标注数据稀缺和时空一致性问题,仅需30%标注数据即可提升分割性能和时空一致性。

Details

Motivation: 无人机遥感地形分类任务与地面任务差异显著,面临标注数据稀缺、标注复杂性和时空一致性不足等挑战。需要一种方法在不依赖全标注数据的情况下提升分割性能和时空一致性。

Result: 实验结果表明,该方法仅使用30%标注数据,同时提高了mIoU和时空一致性,有效解决了传统方法在无人机任务中的不足。

Insight: 关键帧选择和更新策略在弱监督学习中对提升时空一致性至关重要,教师-学生架构能够有效蒸馏知识,克服标注数据不足的限制。

Abstract: The study of terrain and landform classification through UAV remote sensing diverges significantly from ground vehicle patrol tasks. Besides grappling with the complexity of data annotation and ensuring temporal consistency, it also confronts the scarcity of relevant data and the limitations imposed by the effective range of many technologies. This research substantiates that, in aerial positioning tasks, both the mean Intersection over Union (mIoU) and temporal consistency (TC) metrics are of paramount importance. It is demonstrated that fully labeled data is not the optimal choice, as selecting only key data lacks the enhancement in TC, leading to failures. Hence, a teacher-student architecture, coupled with key frame selection and key frame updating algorithms, is proposed. This framework successfully performs weakly supervised learning and TC knowledge distillation, overcoming the deficiencies of traditional TC training in aerial tasks. The experimental results reveal that our method utilizing merely 30% of labeled data, concurrently elevates mIoU and temporal consistency ensuring stable localization of terrain objects. Result demo : https://gitlab.com/prophet.ai.inc/drone-based-riverbed-inspection


[52] Multi-Order Matching Network for Alignment-Free Depth Super-Resolution cs.CVPDF

Zhengxue Wang, Zhiqiang Yan, Yuan Wu, Guangwei Gao, Xiang Li

TL;DR: 论文提出了多阶匹配网络(MOMNet),用于解决RGB-D数据不严格对齐时的深度超分辨率问题,通过多阶匹配机制和多阶聚合策略实现了高性能和鲁棒性。

Details

Motivation: 现实场景中,RGB-D数据的严格对齐难以实现,导致现有方法在数据不对齐时性能下降。论文旨在开发一种无需对齐的高质量深度超分辨率方法。

Result: 实验表明MOMNet在性能和鲁棒性上均达到最优水平。

Insight: 多阶匹配机制和多阶聚合策略的结合可以有效解决RGB-D不对齐问题,为深度超分辨率任务提供了新思路。

Abstract: Recent guided depth super-resolution methods are premised on the assumption of strictly spatial alignment between depth and RGB, achieving high-quality depth reconstruction. However, in real-world scenarios, the acquisition of strictly aligned RGB-D is hindered by inherent hardware limitations (e.g., physically separate RGB-D sensors) and unavoidable calibration drift induced by mechanical vibrations or temperature variations. Consequently, existing approaches often suffer inevitable performance degradation when applied to misaligned real-world scenes. In this paper, we propose the Multi-Order Matching Network (MOMNet), a novel alignment-free framework that adaptively retrieves and selects the most relevant information from misaligned RGB. Specifically, our method begins with a multi-order matching mechanism, which jointly performs zero-order, first-order, and second-order matching to comprehensively identify RGB information consistent with depth across multi-order feature spaces. To effectively integrate the retrieved RGB and depth, we further introduce a multi-order aggregation composed of multiple structure detectors. This strategy uses multi-order priors as prompts to facilitate the selective feature transfer from RGB to depth. Extensive experiments demonstrate that MOMNet achieves state-of-the-art performance and exhibits outstanding robustness.


[53] CAMS: Towards Compositional Zero-Shot Learning via Gated Cross-Attention and Multi-Space Disentanglement cs.CVPDF

Pan Yang, Cheng Deng, Jing Yang, Han Zhao, Yun Liu

TL;DR: CAMS提出了一种基于CLIP模型的门控交叉注意力(Gated Cross-Attention)和多空间解耦(Multi-Space Disentanglement)方法,用于提升组合零样本学习的性能,尤其在未见过的属性-对象组合上表现优异。

Details

Motivation: 大多数基于CLIP的组合零样本学习方法仅依赖于图像编码器获取的全局语义表示,导致解耦能力有限。CAMS旨在通过更细粒度的语义特征提取和多维空间解耦,改进对属性-对象组合的泛化能力。

Result: 在MIT-States、UT-Zappos和C-GQA三个基准数据集上,CAMS在封闭世界和开放世界设定中均达到了最先进的性能。

Insight: CAMS通过结合细粒度语义特征提取和多空间解耦,展示了在组合零样本学习中解耦属性与对象的重要性,为未来的相关研究提供了新思路。

Abstract: Compositional zero-shot learning (CZSL) aims to learn the concepts of attributes and objects in seen compositions and to recognize their unseen compositions. Most Contrastive Language-Image Pre-training (CLIP)-based CZSL methods focus on disentangling attributes and objects by leveraging the global semantic representation obtained from the image encoder. However, this representation has limited representational capacity and do not allow for complete disentanglement of the two. To this end, we propose CAMS, which aims to extract semantic features from visual features and perform semantic disentanglement in multidimensional spaces, thereby improving generalization over unseen attribute-object compositions. Specifically, CAMS designs a Gated Cross-Attention that captures fine-grained semantic features from the high-level image encoding blocks of CLIP through a set of latent units, while adaptively suppressing background and other irrelevant information. Subsequently, it conducts Multi-Space Disentanglement to achieve disentanglement of attribute and object semantics. Experiments on three popular benchmarks (MIT-States, UT-Zappos, and C-GQA) demonstrate that CAMS achieves state-of-the-art performance in both closed-world and open-world settings. The code is available at https://github.com/ybyangjing/CAMS.


[54] CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation cs.CVPDF

Samer Abualhanud, Christian Grannemann, Max Mehltretter

TL;DR: 论文提出了一种基于圆柱空间注意力的多视角一致性自监督深度估计方法CylinderDepth,通过将多相机图像的三维点投影到共享圆柱体上并利用显式空间注意力机制,提升了深度估计的一致性和准确性。

Details

Motivation: 现有自监督环绕深度估计方法在多视角重叠区域的深度估计不一致,限制了其在密集三维感知中的应用。

Result: 在DDAD和nuScenes数据集上验证,相比现有方法显著提升了深度估计的一致性和整体准确性。

Insight: 通过几何约束和显式注意力机制能够有效解决多视角深度估计一致性问题,同时保持了自监督学习的低成本优势。

Abstract: Self-supervised surround-view depth estimation enables dense, low-cost 3D perception with a 360° field of view from multiple minimally overlapping images. Yet, most existing methods suffer from depth estimates that are inconsistent between overlapping images. Addressing this limitation, we propose a novel geometry-guided method for calibrated, time-synchronized multi-camera rigs that predicts dense, metric, and cross-view-consistent depth. Given the intrinsic and relative orientation parameters, a first depth map is predicted per image and the so-derived 3D points from all images are projected onto a shared unit cylinder, establishing neighborhood relations across different images. This produces a 2D position map for every image, where each pixel is assigned its projected position on the cylinder. Based on these position maps, we apply an explicit, non-learned spatial attention that aggregates features among pixels across images according to their distances on the cylinder, to predict a final depth map per image. Evaluated on the DDAD and nuScenes datasets, our approach improves the consistency of depth estimates across images and the overall depth compared to state-of-the-art methods.


[55] Graph Neural Networks for Surgical Scene Segmentation cs.CV | cs.LGPDF

Yihan Li, Nikhil Churamani, Maria Robu, Imanol Luengo, Danail Stoyanov

TL;DR: 本文提出了两种基于图神经网络的图像分割方法,结合Vision Transformer和GNN,显著提升了腹腔镜手术场景分割的准确性和解剖一致性。

Details

Motivation: 腹腔镜胆囊切除术中准确识别肝胆囊解剖结构对避免手术并发症至关重要。现有深度学习模型在处理遮挡、长距离依赖关系及罕见结构的精细几何特征时存在困难。

Result: 在Endoscapes-Seg50和CholecSeg8k基准测试中,mIoU和mDice分数分别提升7-8%和6%,尤其在罕见和关键结构上表现优异。

Insight: 结合ViT的全局上下文与图神经网络的关系推理能力,不仅能提升分割性能,还能增强模型的可解释性和可靠性,为手术安全提供支持。

Abstract: Purpose: Accurate identification of hepatocystic anatomy is critical to preventing surgical complications during laparoscopic cholecystectomy. Deep learning models often struggle with occlusions, long-range dependencies, and capturing the fine-scale geometry of rare structures. This work addresses these challenges by introducing graph-based segmentation approaches that enhance spatial and semantic understanding in surgical scene analyses. Methods: We propose two segmentation models integrating Vision Transformer (ViT) feature encoders with Graph Neural Networks (GNNs) to explicitly model spatial relationships between anatomical regions. (1) A static k Nearest Neighbours (k-NN) graph with a Graph Convolutional Network with Initial Residual and Identity Mapping (GCNII) enables stable long-range information propagation. (2) A dynamic Differentiable Graph Generator (DGG) with a Graph Attention Network (GAT) supports adaptive topology learning. Both models are evaluated on the Endoscapes-Seg50 and CholecSeg8k benchmarks. Results: The proposed approaches achieve up to 7-8% improvement in Mean Intersection over Union (mIoU) and 6% improvement in Mean Dice (mDice) scores over state-of-the-art baselines. It produces anatomically coherent predictions, particularly on thin, rare and safety-critical structures. Conclusion: The proposed graph-based segmentation methods enhance both performance and anatomical consistency in surgical scene segmentation. By combining ViT-based global context with graph-based relational reasoning, the models improve interpretability and reliability, paving the way for safer laparoscopic and robot-assisted surgery through a precise identification of critical anatomical features.


[56] Beyond Visual Cues: Leveraging General Semantics as Support for Few-Shot Segmentation cs.CVPDF

Jin Wang, Bingfeng Zhang, Jian Pang, Mengyu Liu, Honglong Chen

TL;DR: 该论文提出了一个语言驱动的属性泛化架构(LDAG),利用目标类别的语言描述构建鲁棒的支持策略,以解决少样本分割中视觉特征带来的偏差问题。

Details

Motivation: 现有少样本分割方法主要依赖支持图像的视觉表示,但由于类内视觉变化,提取的元信息无法准确指导未见类别的分割。作者认为支持图像的关键是提供对未见和已见类别均无偏的元指导。

Result: 实验表明,该方法显著优于现有方法,取得了新的最先进性能。

Insight: 支持图像的关键在于提供无偏的元指导,而非仅依赖视觉信息;语言描述可以通过多模态交互弥补视觉特征的不足。

Abstract: Few-shot segmentation (FSS) aims to segment novel classes under the guidance of limited support samples by a meta-learning paradigm. Existing methods mainly mine references from support images as meta guidance. However, due to intra-class variations among visual representations, the meta information extracted from support images cannot produce accurate guidance to segment untrained classes. In this paper, we argue that the references from support images may not be essential, the key to the support role is to provide unbiased meta guidance for both trained and untrained classes. We then introduce a Language-Driven Attribute Generalization (LDAG) architecture to utilize inherent target property language descriptions to build robust support strategy. Specifically, to obtain an unbiased support representation, we design a Multi-attribute Enhancement (MaE) module, which produces multiple detailed attribute descriptions of the target class through Large Language Models (LLMs), and then builds refined visual-text prior guidance utilizing multi-modal matching. Meanwhile, due to text-vision modal shift, attribute text struggles to promote visual feature representation, we design a Multi-modal Attribute Alignment (MaA) to achieve cross-modal interaction between attribute texts and visual feature. Experiments show that our proposed method outperforms existing approaches by a clear margin and achieves the new state-of-the art performance. The code will be released.


[57] StreetView-Waste: A Multi-Task Dataset for Urban Waste Management cs.CVPDF

Diogo J. Paulo, João Martins, Hugo Proença, João C. Neves

TL;DR: 该论文介绍了StreetView-Waste数据集,专注于城市废物管理的多任务评估,包括废物容器检测、跟踪和溢出分割,并提出了两种改进基线性能的策略。

Details

Motivation: 尽管存在许多垃圾检测数据集,但针对动态场景(如垃圾车拍摄的图像)中的废物容器监控研究较少,现有数据集通常缺少特定标注或局限于静态环境,限制了其在实际物流中的应用。

Result: 启发式方法将平均绝对计数误差降低了79.6%;几何感知策略在轻量级模型上将分割mAP@0.5提高了27%。

Insight: 多模态输入和几何先验信息对提升废物管理任务的性能具有重要价值,尤其是在动态场景中。

Abstract: Urban waste management remains a critical challenge for the development of smart cities. Despite the growing number of litter detection datasets, the problem of monitoring overflowing waste containers, particularly from images captured by garbage trucks, has received little attention. While existing datasets are valuable, they often lack annotations for specific container tracking or are captured in static, decontextualized environments, limiting their utility for real-world logistics. To address this gap, we present StreetView-Waste, a comprehensive dataset of urban scenes featuring litter and waste containers. The dataset supports three key evaluation tasks: (1) waste container detection, (2) waste container tracking, and (3) waste overflow segmentation. Alongside the dataset, we provide baselines for each task by benchmarking state-of-the-art models in object detection, tracking, and segmentation. Additionally, we enhance baseline performance by proposing two complementary strategies: a heuristic-based method for improved waste container tracking and a model-agnostic framework that leverages geometric priors to refine litter segmentation. Our experimental results show that while fine-tuned object detectors achieve reasonable performance in detecting waste containers, baseline tracking methods struggle to accurately estimate their number; however, our proposed heuristics reduce the mean absolute counting error by 79.6%. Similarly, while segmenting amorphous litter is challenging, our geometry-aware strategy improves segmentation mAP@0.5 by 27% on lightweight models, demonstrating the value of multimodal inputs for this task. Ultimately, StreetView-Waste provides a challenging benchmark to encourage research into real-world perception systems for urban waste management.


[58] VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference cs.CV | cs.AIPDF

Ziyan Liu, Yeqiu Chen, Hongyi Cai, Tao Lin, Shuo Yang

TL;DR: VLA-Pruner提出了一种针对视觉-语言-动作(VLA)模型的动态视觉标记剪枝方法,通过双层级重要性标准和时序感知优化,提升了模型的实时推理效率。

Details

Motivation: VLA模型在AI领域表现优异,但在处理连续视觉流时计算成本高,限制了实时部署。现有剪枝方法仅基于语义显著性,忽略了VLA模型的语义理解与动作执行双重特性。

Result: 实验表明,VLA-Pruner在多种VLA架构和机器人任务中均达到最佳效果。

Insight: VLA模型的剪枝需兼顾语义与动作的双重需求,时序连续性在动作生成中具有重要作用。

Abstract: Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA’s intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.


[59] LLaVA$^3$: Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs cs.CVPDF

Doriand Petit, Steve Bourgeois, Vincent Gay-Bellile, Florian Chabot, Loïc Barthe

TL;DR: LLaVA$^3$通过类似立体派画家的方式将3D场景表示为多视角2D图像,提升视觉语言模型(VLM)对3D场景的理解能力,无需微调即可超越传统2D方法。

Details

Motivation: 3D场景理解训练数据稀缺,而2D数据集丰富,导致多模态语言模型在3D场景理解上表现受限。

Result: 在3D视觉问答和3D语言定位任务上表现优于传统2D方法。

Insight: 通过2D数据间接表示3D信息是一种有效提升3D场景理解的替代方案。

Abstract: Developing a multi-modal language model capable of understanding 3D scenes remains challenging due to the limited availability of 3D training data, in contrast to the abundance of 2D datasets used for vision-language models (VLM). As an alternative, we introduce LLaVA$^3$ (pronounced LLaVA-Cube), a novel method that improves the 3D scene understanding capabilities of VLM using only multi-view 2D images and without any fine-tuning. Inspired by Cubist painters, who represented multiple viewpoints of a 3D object within a single picture, we propose to describe the 3D scene for the VLM through omnidirectional visual representations of each object. These representations are derived from an intermediate multi-view 3D reconstruction of the scene. Extensive experiments on 3D VQA and 3D language grounding show that our approach outperforms previous 2D-based VLM solutions.


[60] FastSurfer-CC: A robust, accurate, and comprehensive framework for corpus callosum morphometry cs.CVPDF

Clemens Pollak, Kersten Diers, Santiago Estrada, David Kügler, Martin Reuter

TL;DR: FastSurfer-CC是一个高效、全自动的胼胝体形态测量框架,能够自动完成多种任务并超越现有工具的性能。

Details

Motivation: 胼胝体是研究衰老和神经系统疾病的重要结构,但现有工具缺乏全面且自动化的分析流程。

Result: FastSurfer-CC在各项任务中优于现有工具,并在亨廷顿病研究中发现现有方法未能检测到的显著差异。

Insight: FastSurfer-CC的多功能性为临床研究和神经科学提供了强大的工具,有助于发现新的生物标志物。

Abstract: The corpus callosum, the largest commissural structure in the human brain, is a central focus in research on aging and neurological diseases. It is also a critical target for interventions such as deep brain stimulation and serves as an important biomarker in clinical trials, including those investigating remyelination therapies. Despite extensive research on corpus callosum segmentation, few publicly available tools provide a comprehensive and automated analysis pipeline. To address this gap, we present FastSurfer-CC, an efficient and fully automated framework for corpus callosum morphometry. FastSurfer-CC automatically identifies mid-sagittal slices, segments the corpus callosum and fornix, localizes the anterior and posterior commissures to standardize head positioning, generates thickness profiles and subdivisions, and extracts eight shape metrics for statistical analysis. We demonstrate that FastSurfer-CC outperforms existing specialized tools across the individual tasks. Moreover, our method reveals statistically significant differences between Huntington’s disease patients and healthy controls that are not detected by the current state-of-the-art.


[61] Flow and Depth Assisted Video Prediction with Latent Transformer cs.CVPDF

Eliyas Suleyman, Paul Henderson, Eksan Firkat, Nicolas Pugeault

TL;DR: 该论文研究了在视频预测任务中结合点流(point-flow)和深度图(depth-maps)信息的方法,以提升遮挡场景下的预测性能。通过改进潜在变换器架构,作者验证了这种结合方式的有效性。

Details

Motivation: 遮挡是视频预测中的一个固有挑战。论文假设通过提供显式的运动(点流)和几何结构(深度图)信息,可以改善模型在遮挡和背景运动场景下的预测能力。

Result: 实验表明,结合点流和深度信息的预测模型在遮挡场景和背景运动预测上表现优于不结合这些信息的模型。

Insight: 显式的运动与几何信息(如点流和深度图)可以显著提升视频预测模型在复杂场景(如遮挡)中的性能,尤其是在捕捉背景运动和对象分布方面。

Abstract: Video prediction is a fundamental task for various downstream applications, including robotics and world modeling. Although general video prediction models have achieved remarkable performance in standard scenarios, occlusion is still an inherent challenge in video prediction. We hypothesize that providing explicit information about motion (via point-flow) and geometric structure (via depth-maps) will enable video prediction models to perform better in situations with occlusion and the background motion. To investigate this, we present the first systematic study dedicated to occluded video prediction. We use a standard multi-object latent transformer architecture to predict future frames, but modify this to incorporate information from depth and point-flow. We evaluate this model in a controlled setting on both synthetic and real-world datasets with not only appearance-based metrics but also Wasserstein distances on object masks, which can effectively measure the motion distribution of the prediction. We find that when the prediction model is assisted with point flow and depth, it performs better in occluded scenarios and predicts more accurate background motion compared to models without the help of these modalities.


[62] Acquisition Time-Informed Breast Tumor Segmentation from Dynamic Contrast-Enhanced MRI cs.CVPDF

Rui Wang, Yuexi Du, John Lewin, R. Todd Constable, Nicha C. Dvornek

TL;DR: 该论文提出了一种利用图像采集时间来改善动态对比增强MRI(DCE-MRI)中乳腺肿瘤分割的方法,通过特征线性调制(FiLM)层将时间信息融入模型,提高了分割性能和泛化能力。

Details

Motivation: 动态对比增强MRI在乳腺癌筛查和评估中至关重要,但不同采集协议和个体因素导致组织表现差异大,使得自动化肿瘤分割面临挑战。如何利用采集时间信息提高分割效果是论文的核心动机。

Result: 实验表明,融入采集时间信息的模型在域内和域外数据集上均表现出更好的分割性能和更强的泛化能力。

Insight: 时间信息在动态MRI任务中是关键特征,通过轻量级方式(如FiLM)融入可以有效提升模型性能,而无需增加复杂计算负担。

Abstract: Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) plays an important role in breast cancer screening, tumor assessment, and treatment planning and monitoring. The dynamic changes in contrast in different tissues help to highlight the tumor in post-contrast images. However, varying acquisition protocols and individual factors result in large variation in the appearance of tissues, even for images acquired in the same phase (e.g., first post-contrast phase), making automated tumor segmentation challenging. Here, we propose a tumor segmentation method that leverages knowledge of the image acquisition time to modulate model features according to the specific acquisition sequence. We incorporate the acquisition times using feature-wise linear modulation (FiLM) layers, a lightweight method for incorporating temporal information that also allows for capitalizing on the full, variables number of images acquired per imaging study. We trained baseline and different configurations for the time-modulated models with varying backbone architectures on a large public multisite breast DCE-MRI dataset. Evaluation on in-domain images and a public out-of-domain dataset showed that incorporating knowledge of phase acquisition time improved tumor segmentation performance and model generalization.


[63] YOWO: You Only Walk Once to Jointly Map An Indoor Scene and Register Ceiling-mounted Cameras cs.CVPDF

Fan Yang, Sosuke Yamao, Ikuo Kusajima, Atsunori Moteki, Shoichi Masui

TL;DR: 论文提出了一种名为YOWO的方法,通过单次行走联合完成室内场景重建和天花板相机注册,避免了传统手动或视觉定位方法的低效和模糊性问题。

Details

Motivation: 使用天花板相机(CMCs)进行室内视觉捕捉具有广泛应用,但其注册到目标场景布局的任务面临挑战。手动注册效率低、成本高,而自动视觉定位在视觉模糊时效果不佳。

Result: 实验结果表明,该方法不仅能在一个统一框架内完成两项任务,还能通过联合优化提升性能。

Insight: YOWO为解决室内场景重建与多视角相机注册提供了一个高效可靠的解决方案,推动了基于位置感知的下游应用。

Abstract: Using ceiling-mounted cameras (CMCs) for indoor visual capturing opens up a wide range of applications. However, registering CMCs to the target scene layout presents a challenging task. While manual registration with specialized tools is inefficient and costly, automatic registration with visual localization may yield poor results when visual ambiguity exists. To alleviate these issues, we propose a novel solution for jointly mapping an indoor scene and registering CMCs to the scene layout. Our approach involves equipping a mobile agent with a head-mounted RGB-D camera to traverse the entire scene once and synchronize CMCs to capture this mobile agent. The egocentric videos generate world-coordinate agent trajectories and the scene layout, while the videos of CMCs provide pseudo-scale agent trajectories and CMC relative poses. By correlating all the trajectories with their corresponding timestamps, the CMC relative poses can be aligned to the world-coordinate scene layout. Based on this initialization, a factor graph is customized to enable the joint optimization of ego-camera poses, scene layout, and CMC poses. We also develop a new dataset, setting the first benchmark for collaborative scene mapping and CMC registration (https://sites.google.com/view/yowo/home). Experimental results indicate that our method not only effectively accomplishes two tasks within a unified framework, but also jointly enhances their performance. We thus provide a reliable tool to facilitate downstream position-aware applications.


[64] BoxingVI: A Multi-Modal Benchmark for Boxing Action Recognition and Localization cs.CVPDF

Rahul Kumar, Vipul Baghel, Sudhanshu Singh, Bikash Kumar Badatya, Shivam Yadav

TL;DR: 该论文提出了一个名为BoxingVI的多模态基准数据集,专为拳击动作识别和定位设计,包含6,915个高质量拳击片段,标注了六种拳击类型,旨在支持低资源和无约束环境下的实时视觉动作识别研究。

Details

Motivation: 由于动态和非结构化的动作特性以及录制环境的多样性,现有的战斗体育视觉分析数据集在鲁棒性上存在瓶颈。为了解决这一问题,作者提出了一种全面且标注精确的数据集,以推动拳击领域的动作识别和性能评估研究。

Result: 生成了一个包含六种拳击类型的标注数据集,支持低资源和无约束环境下的实时视觉动作识别研究。

Insight: 该数据集的多样性和精确标注为拳击动作识别和性能分析提供了重要基础,尤其适用于教练自动化和运动表现评估等实际应用场景。

Abstract: Accurate analysis of combat sports using computer vision has gained traction in recent years, yet the development of robust datasets remains a major bottleneck due to the dynamic, unstructured nature of actions and variations in recording environments. In this work, we present a comprehensive, well-annotated video dataset tailored for punch detection and classification in boxing. The dataset comprises 6,915 high-quality punch clips categorized into six distinct punch types, extracted from 20 publicly available YouTube sparring sessions and involving 18 different athletes. Each clip is manually segmented and labeled to ensure precise temporal boundaries and class consistency, capturing a wide range of motion styles, camera angles, and athlete physiques. This dataset is specifically curated to support research in real-time vision-based action recognition, especially in low-resource and unconstrained environments. By providing a rich benchmark with diverse punch examples, this contribution aims to accelerate progress in movement analysis, automated coaching, and performance assessment within boxing and related domains.


[65] Contrastive vision-language learning with paraphrasing and negation cs.CV | cs.LGPDF

Kwun Ho Ngan, Saman Sadeghi Afgeh, Joe Townsend, Artur d’Avila Garcez

TL;DR: 这篇论文提出了SemCLIP,通过结合改写和否定的对比学习方法改进CLIP模型,提高了对语义变换的鲁棒性。

Details

Motivation: CLIP模型在否定或改写的文本上表现不一致,需要改进其对语义变换的鲁棒性和对齐能力。

Result: 在CC-Neg基准测试中,准确率从68.1%提升至78.1%;在Sugarcrepe++和下游零样本分类任务中表现优于CLIP。

Insight: 结合改写和否定的对比学习方法可以有效提升视觉语言模型对语义变换的鲁棒性,尤其在否定文本上表现更优。

Abstract: Contrastive vision-language models continue to be the dominant approach for image and text retrieval. Contrastive Language-Image Pre-training (CLIP) trains two neural networks in contrastive manner to align their image and text embeddings in a shared latent space. Recent results evaluating CLIP on negated or paraphrased text have shown mixed performance because negation changes meaning radically with minimal lexical changes, while paraphrasing can create very different textual expressions with the same intended meaning. This poses a significant challenge for improving the evaluation results and alignment of vision-language models. To address this challenge, this paper evaluates the combination of paraphrasing and negation, proposes a new CLIP contrastive loss function accounting for both paraphrasing and negation, and applies LLM-generated training triples consisting of original, paraphrased and negated textual captions to CLIP-like training models. The approach, called SemCLIP, is shown to move paraphrased captions towards the original image embeddings while pushing negated captions further away in embedding space. Empirically, SemCLIP is shown to be capable of preserving CLIP’s performance while increasing considerably the distances to negated captions. On the CC-Neg benchmark using an original over negation image-retrieval accuracy metric, SemCLIP improves accuracy from 68.1% to 78.1%. Although results are mixed when compared with CLIP on the Sugarcrepe++ benchmark, SemCLIP’s performance is generally better than the models trained with negated captions. This robustness to negation extends to downstream zero-shot classification tasks where SemCLIP pre-trained on Sugarcrepe++ performs better than CLIP on all tested downstream tasks. These results indicate that SemCLIP can achieve significant robustness to semantic transformations.


[66] Investigating Optical Flow Computation: From Local Methods to a Multiresolution Horn-Schunck Implementation with Bilinear Interpolation cs.CVPDF

Haytham Ziani

TL;DR: 本文分析和比较了光流计算的局部方法(如Lucas-Kanade)和全局方法(如Horn-Schunck),并提出了一种基于双线性插值和多分辨率策略的改进Horn-Schunck算法。

Details

Motivation: 光流计算是计算机视觉中的核心问题,但现有方法在复杂场景下的精度和鲁棒性仍有提升空间。本文旨在通过改进Horn-Schunck算法,结合多分辨率策略和插值技术,提高光流估计的准确性。

Result: 实验表明,改进的多分辨率Horn-Schunck算法在复杂图像条件下提高了光流估计的精度和收敛速度。

Insight: 多分辨率策略和插值技术能显著改善光流算法的性能,尤其是在处理大位移或动态范围较广的场景时。

Abstract: This paper presents an applied analysis of local and global methods, with a focus on the Horn-Schunck algorithm for optical flow computation. We explore the theoretical and practical aspects of local approaches, such as the Lucas-Kanade method, and global techniques such as Horn-Schunck. Additionally, we implement a multiresolution version of the Horn-Schunck algorithm, using bilinear interpolation and prolongation to improve accuracy and convergence. The study investigates the effectiveness of these combined strategies in estimating motion between frames, particularly under varying image conditions.


[67] Supervised Contrastive Learning for Few-Shot AI-Generated Image Detection and Attribution cs.CV | cs.AIPDF

Jaime Álvarez Urueña, David Camacho, Javier Huertas Tato

TL;DR: 该论文提出了一种新颖的两阶段检测框架,利用监督对比学习和少量样本学习,以解决合成图像检测的泛化挑战。

Details

Motivation: 生成式人工智能的快速发展使得合成图像越来越难以区分,传统检测方法因依赖定期重新训练而变得不切实际。

Result: 在少量样本学习下,检测准确率达到91.3%,来源归属任务的AUC和OSCR指标也有显著提升。

Insight: 监督对比学习结合少量样本学习可以有效应对新型生成模型的挑战,无需频繁重新训练。

Abstract: The rapid advancement of generative artificial intelligence has enabled the creation of synthetic images that are increasingly indistinguishable from authentic content, posing significant challenges for digital media integrity. This problem is compounded by the accelerated release cycle of novel generative models, which renders traditional detection approaches (reliant on periodic retraining) computationally infeasible and operationally impractical. This work proposes a novel two-stage detection framework designed to address the generalization challenge inherent in synthetic image detection. The first stage employs a vision deep learning model trained via supervised contrastive learning to extract discriminative embeddings from input imagery. Critically, this model was trained on a strategically partitioned subset of available generators, with specific architectures withheld from training to rigorously ablate cross-generator generalization capabilities. The second stage utilizes a k-nearest neighbors (k-NN) classifier operating on the learned embedding space, trained in a few-shot learning paradigm incorporating limited samples from previously unseen test generators. With merely 150 images per class in the few-shot learning regime, which are easily obtainable from current generation models, the proposed framework achieves an average detection accuracy of 91.3%, representing a 5.2 percentage point improvement over existing approaches . For the source attribution task, the proposed approach obtains improvements of of 14.70% and 4.27% in AUC and OSCR respectively on an open set classification context, marking a significant advancement toward robust, scalable forensic attribution systems capable of adapting to the evolving generative AI landscape without requiring exhaustive retraining protocols.


[68] Progressive Supernet Training for Efficient Visual Autoregressive Modeling cs.CVPDF

Xiaoyue Chen, Yuling Shi, Kaiyuan Li, Huandong Wang, Yong Li

TL;DR: 论文提出VARiant方法,通过尺度-深度不对称依赖性和渐进训练策略,在视觉自回归模型中实现内存和效率的优化,同时保持生成质量。

Details

Motivation: 视觉自回归模型在多尺度生成中存在内存开销大的问题,限制了实际部署。通过观察尺度-深度不对称依赖性,提出优化方案。

Result: 在ImageNet上,VARiant在显著减少内存消耗(40-80%)的同时,保持接近原始模型的生成质量(FID接近)。

Insight: 尺度-深度不对称依赖性为模型优化提供了新视角,权重共享和渐进训练的结合是提升效率与质量的关键。

Abstract: Visual Auto-Regressive (VAR) models significantly reduce inference steps through the “next-scale” prediction paradigm. However, progressive multi-scale generation incurs substantial memory overhead due to cumulative KV caching, limiting practical deployment. We observe a scale-depth asymmetric dependency in VAR: early scales exhibit extreme sensitivity to network depth, while later scales remain robust to depth reduction. Inspired by this, we propose VARiant: by equidistant sampling, we select multiple subnets ranging from 16 to 2 layers from the original 30-layer VAR-d30 network. Early scales are processed by the full network, while later scales utilize subnet. Subnet and the full network share weights, enabling flexible depth adjustment within a single model. However, weight sharing between subnet and the entire network can lead to optimization conflicts. To address this, we propose a progressive training strategy that breaks through the Pareto frontier of generation quality for both subnets and the full network under fixed-ratio training, achieving joint optimality. Experiments on ImageNet demonstrate that, compared to the pretrained VAR-d30 (FID 1.95), VARiant-d16 and VARiant-d8 achieve nearly equivalent quality (FID 2.05/2.12) while reducing memory consumption by 40-65%. VARiant-d2 achieves 3.5 times speedup and 80% memory reduction at moderate quality cost (FID 2.97). In terms of deployment, VARiant’s single-model architecture supports zero-cost runtime depth switching and provides flexible deployment options from high quality to extreme efficiency, catering to diverse application scenarios.


[69] NutriScreener: Retrieval-Augmented Multi-Pose Graph Attention Network for Malnourishment Screening cs.CV | cs.AIPDF

Misaal Khan, Mayank Vatsa, Kuldeep Singh, Richa Singh

TL;DR: NutriScreener结合CLIP视觉嵌入、知识检索和上下文感知,通过多姿态图注意力网络实现了儿童营养不良的高效筛查和人体测量预测,显著提升了召回率和降低误差。

Details

Motivation: 全球儿童营养不良问题严重,现有筛查方法效率低、扩展性差,难以实现早期干预。

Result: 在临床研究中,医生评分4.3/5(准确性)和4.6/5(效率);在跨数据集测试中,召回率达0.79,AUC为0.82,RMSE显著降低。

Insight: 通过匹配人口统计的知识库,可显著提升模型性能。NutriScreener为低资源环境提供了可扩展的解决方案。

Abstract: Child malnutrition remains a global crisis, yet existing screening methods are laborious and poorly scalable, hindering early intervention. In this work, we present NutriScreener, a retrieval-augmented, multi-pose graph attention network that combines CLIP-based visual embeddings, class-boosted knowledge retrieval, and context awareness to enable robust malnutrition detection and anthropometric prediction from children’s images, simultaneously addressing generalizability and class imbalance. In a clinical study, doctors rated it 4.3/5 for accuracy and 4.6/5 for efficiency, confirming its deployment readiness in low-resource settings. Trained and tested on 2,141 children from AnthroVision and additionally evaluated on diverse cross-continent populations, including ARAN and an in-house collected CampusPose dataset, it achieves 0.79 recall, 0.82 AUC, and significantly lower anthropometric RMSEs, demonstrating reliable measurement in unconstrained pediatric settings. Cross-dataset results show up to 25% recall gain and up to 3.5 cm RMSE reduction using demographically matched knowledge bases. NutriScreener offers a scalable and accurate solution for early malnutrition detection in low-resource environments.


[70] Erase to Retain: Low Rank Adaptation Guided Selective Unlearning in Medical Segmentation Networks cs.CVPDF

Nirjhor Datta, Md. Golam Rabiul Alam

TL;DR: 这篇论文提出了一种可控的遗忘框架Erase to Retain,用于医学图像分割网络,通过低秩适应(LoRA)实现选择性知识遗忘,同时保留全局解剖理解。

Details

Motivation: 随着隐私合规和伦理部署需求的增加,需要一种方法能够从医疗分割网络中选择性删除特定知识(如病灶或类别),而无需完全重新训练。

Result: 在ISIC分割任务中,遗忘集的IoU从0.875降至0.509,同时保留集和验证集的性能保持稳定;在分类任务中,遗忘集的准确率从87.0%降至64.1%,保留集的准确率从83.9%提升至90.6%。

Insight: LoRA为基础的子空间遗忘为医学图像分析提供了一种可控且可逆的遗忘方法,适用于敏感数据或结构的删除需求。

Abstract: The ability to selectively remove knowledge from medical segmentation networks is increasingly important for privacy compliance, ethical deployment, and continual dataset revision. We introduce Erase to Retain, a controllable unlearning framework for medical image segmentation that achieves targeted forgetting without full retraining. Our method uses a teacher-student distillation paradigm with Low-Rank Adaptation (LoRA) constrained subspace updates, enabling the student network to erase lesion-specific or class-specific representations in low-rank decoder spaces while preserving global anatomical understanding. During the strong unlearning phase, LoRA modules are adversarially optimized to contradict the teacher’s confident predictions on a designated forget subset, enforcing semantic removal. This is followed by a gentle restoration phase that recovers generalization on retained data through head-only supervised refinement. For ISIC segmentation, the student reduces forget-set IoU from 0.875 to 0.509 while maintaining competitive performance on the retain and validation splits (0.647 to 0.677 IoU). On the cross-domain CHASE dataset, Erase to Retain consistently lowers forget-set IoU while preserving utility on retain and validation sets. For ISIC classification, our method decreases accuracy on the forget subset from 87.0 percent to 64.1 percent while improving retain accuracy from 83.9 percent to 90.6 percent. These results demonstrate that LoRA-based subspace unlearning provides a practical pathway toward responsible, controllable, and reversible unlearning in medical image analysis, enabling models to forget sensitive samples or structures while preserving performance where it matters most.


[71] SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking cs.CV | eess.IV | q-bio.TOPDF

Haofeng Liu, Ziyue Wang, Sudhanshu Mishra, Mingqi Gao, Guanyi Qin

TL;DR: SAM2S是一个增强SAM2的基础模型,专注于手术视频的语义长期跟踪,通过多样化记忆机制、时间语义学习和抗模糊学习提升分割性能。

Details

Motivation: 手术视频分割在计算机辅助手术中至关重要,但现有的iVOS模型(如SAM2)面临领域差距和长期跟踪不足的挑战。

Result: SAM2S在SA-SV上达到80.42的平均J&F分数,比原始SAM2提高17.10分,实时推理速度为68 FPS。

Insight: 多样化记忆和语义学习是手术视频长期跟踪的关键,抗模糊学习能有效解决标注不一致问题。

Abstract: Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing \textbf{SAM2} for \textbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average $\mathcal{J}$&$\mathcal{F}$ over vanilla SAM2. SAM2S further advances performance to 80.42 average $\mathcal{J}$&$\mathcal{F}$, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.


[72] Adaptive Guided Upsampling for Low-light Image Enhancement cs.CV | cs.LG | eess.IVPDF

Angela Vivian Dcosta, Chunbo Song, Rafael Radkowski

TL;DR: 本文提出了一种自适应引导上采样(AGU)方法,用于低光照图像增强,能同时优化多项图像质量特性,如降噪和增加锐度。

Details

Motivation: 现有的引导图像方法在低光照条件下效果不佳,因为图像噪声高、亮度低,缺乏足够的特征。本文旨在通过多参数优化解决这一问题。

Result: 实验表明,AGU在低光照场景下优于现有方法,并能实时生成高质量图像。

Insight: 通过学习低光照和明亮图像的特征关联,AGU解决了传统引导方法在低光照条件下效果不足的问题。

Abstract: We introduce Adaptive Guided Upsampling (AGU), an efficient method for upscaling low-light images capable of optimizing multiple image quality characteristics at the same time, such as reducing noise and increasing sharpness. It is based on a guided image method, which transfers image characteristics from a guidance image to the target image. Using state-of-the-art guided methods, low-light images lack sufficient characteristics for this purpose due to their high noise level and low brightness, rendering suboptimal/not significantly improved images in the process. We solve this problem with multi-parameter optimization, learning the association between multiple low-light and bright image characteristics. Our proposed machine learning method learns these characteristics from a few sample images-pairs. AGU can render high-quality images in real time using low-quality, low-resolution input; our experiments demonstrate that it is superior to state-of-the-art methods in the addressed low-light use case.


[73] TRIM: Scalable 3D Gaussian Diffusion Inference with Temporal and Spatial Trimming cs.CVPDF

Zeyuan Yin, Xiaoming Liu

TL;DR: TRIM提出了一种时空修剪策略,通过轻量级选择器模型和实例掩码去噪技术,显著提升了3D高斯扩散模型的推理效率和质量。

Details

Motivation: 现有3D高斯扩散模型因高斯原语数量庞大,导致去噪和后处理时间过长,生成速度慢且难以扩展。TRIM旨在解决这一问题。

Result: 实验表明,TRIM在保持输出质量的同时,显著提升了3D生成的效率和可扩展性。

Insight: 时空修剪策略在3D扩散模型中有效平衡了效率和质量,展现了未训练方法的潜力。

Abstract: Recent advances in 3D Gaussian diffusion models suffer from time-intensive denoising and post-denoising processing due to the massive number of Gaussian primitives, resulting in slow generation and limited scalability along sampling trajectories. To improve the efficiency of 3D diffusion models, we propose $\textbf{TRIM}$ ($\textbf{T}$rajectory $\textbf{R}$eduction and $\textbf{I}$nstance $\textbf{M}$ask denoising), a post-training approach that incorporates both temporal and spatial trimming strategies, to accelerate inference without compromising output quality while supporting the inference-time scaling for Gaussian diffusion models. Instead of scaling denoising trajectories in a costly end-to-end manner, we develop a lightweight selector model to evaluate latent Gaussian primitives derived from multiple sampled noises, enabling early trajectory reduction by selecting candidates with high-quality potential. Furthermore, we introduce instance mask denoising to prune learnable Gaussian primitives by filtering out redundant background regions, reducing inference computation at each denoising step. Extensive experiments and analysis demonstrate that TRIM significantly improves both the efficiency and quality of 3D generation. Source code is available at $\href{https://github.com/zeyuanyin/TRIM}{link}$.


[74] Late-decoupled 3D Hierarchical Semantic Segmentation with Semantic Prototype Discrimination based Bi-branch Supervision cs.CVPDF

Shuyu Cao, Chongshou Li, Jie Xu, Tianrui Li, Na Zhao

TL;DR: 该论文提出了一种新颖的3D分层语义分割框架,通过分支解耦和语义原型判别,解决了多层级冲突和类别不平衡问题,并在多个数据集上实现了最优性能。

Details

Motivation: 现有的3D分层语义分割方法存在多层级优化冲突和类别不平衡问题,影响了模型性能。

Result: 在多个数据集和骨干网络上实现了SOTA性能,方法的核心组件可通用。

Insight: 层级解耦和语义原型判别是解决多层级冲突和类别不平衡的有效手段。

Abstract: 3D hierarchical semantic segmentation (3DHS) is crucial for embodied intelligence applications that demand a multi-grained and multi-hierarchy understanding of 3D scenes. Despite the progress, previous 3DHS methods have overlooked following two challenges: I) multi-label learning with a parameter-sharing model can lead to multi-hierarchy conflicts in cross-hierarchy optimization, and II) the class imbalance issue is inevitable across multiple hierarchies of 3D scenes, which makes the model performance become dominated by major classes. To address these issues, we propose a novel framework with a primary 3DHS branch and an auxiliary discrimination branch. Specifically, to alleviate the multi-hierarchy conflicts, we propose a late-decoupled 3DHS framework which employs multiple decoders with the coarse-to-fine hierarchical guidance and consistency. The late-decoupled architecture can mitigate the underfitting and overfitting conflicts among multiple hierarchies and can also constrain the class imbalance problem in each individual hierarchy. Moreover, we introduce a 3DHS-oriented semantic prototype based bi-branch supervision mechanism, which additionally learns class-wise discriminative point cloud features and performs mutual supervision between the auxiliary and 3DHS branches, to enhance the class-imbalance segmentation. Extensive experiments on multiple datasets and backbones demonstrate that our approach achieves state-of-the-art 3DHS performance, and its core components can also be used as a plug-and-play enhancement to improve previous methods.


[75] Solving Spatial Supersensing Without Spatial Supersensing cs.CV | cs.LGPDF

Vishaal Udandarao, Shyamgopal Karthik, Surabhi S. Nath, Andreas Hochlehnert, Matthias Bethge

TL;DR: 该论文通过引入两个基准测试(VSR和VSC)以及对Cambrian-S方法的批判性分析,指出当前的空间超感知基准未能有效衡量空间认知能力,且其推理方法可能依赖于捷径而非真正的空间超感知。

Details

Motivation: 论文旨在揭示当前的空间超感知基准(如VSR和VSC)及其推理方法的局限性,提出这些基准可能无法真实评估空间认知和世界建模能力。

Result: NoSense在VSR基准上达到95%的准确率;VSC-Repeat实验显示Cambrian-S的性能从42%降至0%。

Insight: 当前的空间超感知基准可能存在设计缺陷,推理方法的表现可能更多依赖数据中的捷径而非真正的空间认知能力。

Abstract: Cambrian-S aims to take the first steps towards improving video world models with spatial supersensing by introducing (i) two benchmarks, VSI-Super-Recall (VSR) and VSI-Super-Counting (VSC), and (ii) bespoke predictive sensing inference strategies tailored to each benchmark. In this work, we conduct a critical analysis of Cambrian-S across both these fronts. First, we introduce a simple baseline, NoSense, which discards almost all temporal structure and uses only a bag-of-words SigLIP model, yet near-perfectly solves VSR, achieving 95% accuracy even on 4-hour videos. This shows benchmarks like VSR can be nearly solved without spatial cognition, world modeling or spatial supersensing. Second, we hypothesize that the tailored inference methods proposed by Cambrian-S likely exploit shortcut heuristics in the benchmark. We illustrate this with a simple sanity check on the VSC benchmark, called VSC-Repeat: We concatenate each video with itself 1-5 times, which does not change the number of unique objects. However, this simple perturbation entirely collapses the mean relative accuracy of Cambrian-S from 42% to 0%. A system that performs spatial supersensing and integrates information across experiences should recognize views of the same scene and keep object-count predictions unchanged; instead, Cambrian-S inference algorithm relies largely on a shortcut in the VSC benchmark that rooms are never revisited. Taken together, our findings suggest that (i) current VSI-Super benchmarks do not yet reliably measure spatial supersensing, and (ii) predictive-sensing inference recipes used by Cambrian-S improve performance by inadvertently exploiting shortcuts rather than from robust spatial supersensing. We include the response from the Cambrian-S authors (in Appendix A) to provide a balanced perspective alongside our claims. We release our code at: https://github.com/bethgelab/supersanity


[76] TriDiff-4D: Fast 4D Generation through Diffusion-based Triplane Re-posing cs.CVPDF

Eddie Pokming Sheung, Qihao Liu, Wufei Ma, Prakhar Kaushik, Jianwen Xie

TL;DR: TriDiff-4D 是一种基于扩散的三平面重定向方法,用于快速生成高质量的4D虚拟角色。它通过自回归策略生成任意长度的4D序列,显著提升了生成速度、时间一致性、运动准确性和视觉保真度。

Details

Motivation: 当前4D生成方法存在时间不一致性、几何不一致性、运动异常和高计算成本等问题,TriDiff-4D 旨在解决这些限制,提供更高效和可控的4D生成方案。

Result: TriDiff-4D 在生成时间上从小时级缩短到秒级,同时显著提升了复杂运动生成的质量和3D几何准确性。

Insight: 结合扩散模型和三平面表示,能够有效捕捉3D结构和运动先验,从而实现高效且高质量的4D生成。

Abstract: With the increasing demand for 3D animation, generating high-fidelity, controllable 4D avatars from textual descriptions remains a significant challenge. Despite notable efforts in 4D generative modeling, existing methods exhibit fundamental limitations that impede their broader applicability, including temporal and geometric inconsistencies, perceptual artifacts, motion irregularities, high computational costs, and limited control over dynamics. To address these challenges, we propose TriDiff-4D, a novel 4D generative pipeline that employs diffusion-based triplane re-posing to produce high-quality, temporally coherent 4D avatars. Our model adopts an auto-regressive strategy to generate 4D sequences of arbitrary length, synthesizing each 3D frame with a single diffusion process. By explicitly learning 3D structure and motion priors from large-scale 3D and motion datasets, TriDiff-4D enables skeleton-driven 4D generation that excels in temporal consistency, motion accuracy, computational efficiency, and visual fidelity. Specifically, TriDiff-4D first generates a canonical 3D avatar and a corresponding motion sequence from a text prompt, then uses a second diffusion model to animate the avatar according to the motion sequence, supporting arbitrarily long 4D generation. Experimental results demonstrate that TriDiff-4D significantly outperforms existing methods, reducing generation time from hours to seconds by eliminating the optimization process, while substantially improving the generation of complex motions with high-fidelity appearance and accurate 3D geometry.


[77] SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation cs.CVPDF

Zhenyuan Qin, Xincheng Shuai, Henghui Ding

TL;DR: SceneDesigner提出了一种可控的多对象图像生成方法,通过9自由度(9-DoF)姿态操纵实现精确控制,解决了现有方法在多对象姿态控制上的局限性。

Details

Motivation: 现有方法在多对象9D姿态(位置、大小、方向)的同步控制上表现不足,存在可控性和质量下降的问题。需要一种更灵活的解决方案。

Result: 实验表明,SceneDesigner在可控性和生成质量上显著优于现有方法。

Insight: CNOCS地图和强化学习微调策略是提升多对象姿态控制效果的关键创新点。

Abstract: Controllable image generation has attracted increasing attention in recent years, enabling users to manipulate visual content such as identity and style. However, achieving simultaneous control over the 9D poses (location, size, and orientation) of multiple objects remains an open challenge. Despite recent progress, existing methods often suffer from limited controllability and degraded quality, falling short of comprehensive multi-object 9D pose control. To address these limitations, we propose SceneDesigner, a method for accurate and flexible multi-object 9-DoF pose manipulation. SceneDesigner incorporates a branched network to the pre-trained base model and leverages a new representation, CNOCS map, which encodes 9D pose information from the camera view. This representation exhibits strong geometric interpretation properties, leading to more efficient and stable training. To support training, we construct a new dataset, ObjectPose9D, which aggregates images from diverse sources along with 9D pose annotations. To further address data imbalance issues, particularly performance degradation on low-frequency poses, we introduce a two-stage training strategy with reinforcement learning, where the second stage fine-tunes the model using a reward-based objective on rebalanced data. At inference time, we propose Disentangled Object Sampling, a technique that mitigates insufficient object generation and concept confusion in complex multi-object scenes. Moreover, by integrating user-specific personalization weights, SceneDesigner enables customized pose control for reference subjects. Extensive qualitative and quantitative experiments demonstrate that SceneDesigner significantly outperforms existing approaches in both controllability and quality. Code is publicly available at https://github.com/FudanCVL/SceneDesigner.


[78] V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models cs.CVPDF

Yang Luo, Xuanlei Zhao, Baijiong Lin, Lingting Zhu, Liyao Tang

TL;DR: V-ReasonBench是一个用于评估视频生成模型推理能力的统一基准测试套件,涵盖结构化问题解决、空间认知、基于模式的推理和物理动力学四个关键维度。

Details

Motivation: 随着生成式视频模型(如Veo-3)在零样本推理能力方面的显著进步,亟需一个系统和可靠的评估工具。

Result: 揭示了视频模型在不同推理维度上的显著差异,并分析了幻觉行为和视频时长对推理的影响。

Insight: 视频推理能力在不同模型中表现不均,且视频时长和任务类型显著影响推理效果。

Abstract: Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.


[79] Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO cs.CVPDF

Junhao Cheng, Liang Hou, Xin Tao, Jing Liao

TL;DR: 这篇论文提出了Video-as-Answer(VANS)模型,通过Joint-GRPO方法联合Vision-Language Model(VLM)和Video Diffusion Model(VDM),解决了Video-Next-Event Prediction(VNEP)任务中多模态输入理解、指令条件推理和视频生成的挑战。

Details

Motivation: 视频的动态表达能力可以更好地传递物理世界信息,但当前视频生成主要用于娱乐。论文提出了VNEP任务,将视频作为回答模态,用于预测和生成下一个事件,提升学习直观性。

Result: 在过程和预测基准测试中,VANS在视频事件预测和可视化任务中达到了最先进的性能。

Insight: 将视频作为回答模态的VNEP任务有潜力成为新的研究方向,Joint-GRPO的方法在多模态对齐问题上提供了有效的解决方案。

Abstract: While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video’s inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as Video-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduce VANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposed Joint-GRPO that orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craft VANS-Data-100K, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization. Codes are released in https://github.com/KlingTeam/VANS.


[80] Learning to Think Fast and Slow for Visual Language Models cs.CVPDF

Chenyu Lin, Cheng Chi, Jinlin Wu, Sharon Li, Kaiyang Zhou

TL;DR: 论文提出了一种名为DualMindVLM的双模式思维视觉语言模型,通过强化学习方法自动根据任务难度切换快思考和慢思考模式,显著提升了计算效率和推理性能。

Details

Motivation: 现有的视觉语言模型在处理问题时通常采用长而详细的推理链,导致计算成本过高。受人类快思考和慢思考机制的启发,作者希望设计一种能够根据任务难度动态调整推理模式的模型。

Result: DualMindVLM在保持高token效率的同时,性能超越了基础模型,并与最先进的视觉推理模型相媲美。

Insight: 通过模拟人类快/慢思维机制,可以显著提升视觉语言模型的计算效率和推理能力。模型的输出长度可以作为任务难度的有效代理指标。

Abstract: When confronted with complex problems, we tend to think slowly; conversely, for simple questions, we think quickly. Such a two-system thinking mechanism allows us to efficiently allocate cognitive resources, enabling quick decision-making for straightforward issues while reserving deeper analytical thinking for more intricate challenges. However, existing reasoning-oriented visual language models (VLMs), whether trained with explicit chain-of-thought annotations or rule-based RL rewards, mainly pursue lengthy, detailed reasoning chains, which often lead to excessive computational costs. In this work, we propose a simple RL approach, which enables VLMs to automatically switch between fast and slow thinking modes depending on task difficulty. The approach consists of two stages: in the first stage, we label data as either requiring fast thinking or slow thinking based on the model output length, which is inspired by the observation that pre-trained VLMs typically produce answers of varying lengths for different types of questions; in the second stage, we train the model using GRPO along with the thinking mode labels to develop dual-mode thinking. Despite its simplicity, our model, named DualMindVLM, significantly outperforms the base model and achieves performance on par with state-of-the-art visual reasoning models, while maintaining exceptionally high token efficiency.


[81] EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards cs.CVPDF

Omkat Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal

TL;DR: EvoLMM是一个自进化的多模态模型框架,通过内部奖励机制实现无监督学习,提升了在数学推理任务上的性能。

Details

Motivation: 现有的多模态模型依赖人工标注数据或外部奖励模型,限制了其自主性和扩展性。EvoLMM旨在通过无监督方式提升模型的推理能力。

Result: 在ChartQA、MathVista和MathVision等数学推理任务上,性能提升了约3%。

Insight: 通过内部奖励机制可实现无监督学习,为自进化多模态模型的研究提供了新思路。

Abstract: Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.


[82] Dataset Distillation for Pre-Trained Self-Supervised Vision Models cs.CV | cs.AI | cs.LGPDF

George Cazenavette, Antonio Torralba, Vincent Sitzmann

TL;DR: 该论文研究了数据集蒸馏问题,旨在为预训练的自监督视觉模型合成一小部分图像,使得在这些图像上训练的线性分类器性能可与在大规模真实数据集上训练的模型相媲美。作者提出了一种名为线性梯度匹配的方法,通过优化合成图像的特征表示梯度来匹配真实数据的效果。

Details

Motivation: 现有的数据集蒸馏方法主要针对随机初始化模型的训练,而现代视觉方法更多依赖预训练的自监督模型。因此,作者提出研究如何为预训练模型合成高效的数据集。

Result: 合成数据性能优于真实数据基线,且在预训练模型间具有泛化能力。例如,通过DINO骨干蒸馏的数据集可用于训练CLIP线性分类器并取得竞争性性能。

Insight: 1. 合成数据集不仅能提升性能,还能跨模型迁移;2. 该方法为模型可解释性研究提供了新工具,例如预测模型嵌入空间的相似性或检测对抗数据集的虚假相关性。

Abstract: The task of dataset distillation aims to find a small set of synthetic images such that training a model on them reproduces the performance of the same model trained on a much larger dataset of real samples. Existing distillation methods focus on synthesizing datasets that enable training randomly initialized models. In contrast, state-of-the-art vision approaches are increasingly building on large, pre-trained self-supervised models rather than training from scratch. In this paper, we investigate the problem of distilling datasets that enable us to optimally train linear probes on top of such large, pre-trained vision models. We introduce a method of dataset distillation for this task called Linear Gradient Matching that optimizes the synthetic images such that, when passed through a pre-trained feature extractor, they induce gradients in the linear classifier similar to those produced by the real data. Our method yields synthetic data that outperform all real-image baselines and, remarkably, generalize across pre-trained vision models, enabling us, for instance, to train a linear CLIP probe that performs competitively using a dataset distilled via a DINO backbone. Further, we show that our distilled datasets are exceptionally effective for fine-grained classification and provide a valuable tool for model interpretability, predicting, among other things, how similar two models’ embedding spaces are under the platonic representation hypothesis or whether a model is sensitive to spurious correlations in adversarial datasets.


cs.CL [Back]

[83] What Really Counts? Examining Step and Token Level Attribution in Multilingual CoT Reasoning cs.CLPDF

Jeremias Ferrao, Ezgi Basar, Khondoker Ittehadul Islam, Mahrokh Hassani

TL;DR: 该论文研究了多语言LLMs(大语言模型)中Chain-of-Thought(CoT)推理的归因模式,发现CoT生成的推理链存在局限性,特别是在多语言鲁棒性和解释性方面。

Details

Motivation: 尽管CoT提示已被证明能提升任务性能,但关于其生成的推理链的忠实性和可解释性仍存在疑问。因此,作者研究了多语言模型中的CoT归因模式。

Result: 实验结果表明,CoT归因存在偏差,结构化提示对高资源语言更有效,但模型在扰动情境下表现不佳。

Insight: 研究发现提示方法的局限性,特别是在低资源语言和解释透明度方面,为未来改进CoT提供了方向。

Abstract: This study investigates the attribution patterns underlying Chain-of-Thought (CoT) reasoning in multilingual LLMs. While prior works demonstrate the role of CoT prompting in improving task performance, there are concerns regarding the faithfulness and interpretability of the generated reasoning chains. To assess these properties across languages, we applied two complementary attribution methods–ContextCite for step-level attribution and Inseq for token-level attribution–to the Qwen2.5 1.5B-Instruct model using the MGSM benchmark. Our experimental results highlight key findings such as: (1) attribution scores excessively emphasize the final reasoning step, particularly in incorrect generations; (2) structured CoT prompting significantly improves accuracy primarily for high-resource Latin-script languages; and (3) controlled perturbations via negation and distractor sentences reduce model accuracy and attribution coherence. These findings highlight the limitations of CoT prompting, particularly in terms of multilingual robustness and interpretive transparency.


[84] Mind the Motions: Benchmarking Theory-of-Mind in Everyday Body Language cs.CLPDF

Seungbeen Lee, Jinhong Jeong, Donghyun Kim, Yejin Son, Youngjae Yu

TL;DR: 该论文提出了Motion2Mind框架,用于评估机器在解读非语言线索(NVCs)中的心智理论(ToM)能力。通过构建一个精细标注的视频数据集,揭示了当前AI系统在解读NVCs时的显著不足。

Details

Motivation: 现有的心智理论基准主要集中在虚假信念任务和非对称信息推理上,而忽视了信念之外的其他心理状态以及丰富的非语言交流。因此,需要一个新的框架来评估机器在解读非语言线索中的心智理论能力。

Result: 结果显示,当前AI系统在NVC解读方面表现不佳,尤其是在检测任务中存在显著性能差距,且在解释任务中表现出过度解读的倾向。

Insight: 该研究揭示了AI系统在理解复杂非语言线索和心理状态方面的局限性,为未来改进提供了方向。

Abstract: Our ability to interpret others’ mental states through nonverbal cues (NVCs) is fundamental to our survival and social cohesion. While existing Theory of Mind (ToM) benchmarks have primarily focused on false-belief tasks and reasoning with asymmetric information, they overlook other mental states beyond belief and the rich tapestry of human nonverbal communication. We present Motion2Mind, a framework for evaluating the ToM capabilities of machines in interpreting NVCs. Leveraging an expert-curated body-language reference as a proxy knowledge base, we build Motion2Mind, a carefully curated video dataset with fine-grained nonverbal cue annotations paired with manually verified psychological interpretations. It encompasses 222 types of nonverbal cues and 397 mind states. Our evaluation reveals that current AI systems struggle significantly with NVC interpretation, exhibiting not only a substantial performance gap in Detection, as well as patterns of over-interpretation in Explanation compared to human annotators.


[85] Liars’ Bench: Evaluating Lie Detectors for Language Models cs.CL | cs.AIPDF

Kieron Kretschmar, Walter Laurito, Sharan Maiya, Samuel Marks

TL;DR: 论文提出了LIARS’ BENCH测试平台,用于评估大型语言模型(LLM)的说谎检测技术,揭示现有方法的局限性。

Details

Motivation: 现有技术通常在狭窄的场景中验证LLM的说谎检测能力,无法覆盖多样化的说谎行为。需要更全面的评估平台。

Result: 现有技术在某些类型的说谎检测上系统性失败,尤其是仅依赖文本内容无法判断的场景。

Insight: 说谎检测技术需要结合模型内部状态(白盒),仅依赖文本(黑盒)可能不足以全面评估LLM的说谎行为。

Abstract: Prior work has introduced techniques for detecting when large language models (LLMs) lie, that is, generating statements they believe are false. However, these techniques are typically validated in narrow settings that do not capture the diverse lies LLMs can generate. We introduce LIARS’ BENCH, a testbed consisting of 72,863 examples of lies and honest responses generated by four open-weight models across seven datasets. Our settings capture qualitatively different types of lies and vary along two dimensions: the model’s reason for lying and the object of belief targeted by the lie. Evaluating three black- and white-box lie detection techniques on LIARS’ BENCH, we find that existing techniques systematically fail to identify certain types of lies, especially in settings where it’s not possible to determine whether the model lied from the transcript alone. Overall, LIARS’ BENCH reveals limitations in prior techniques and provides a practical testbed for guiding progress in lie detection.


[86] Learning Tractable Distributions Of Language Model Continuations cs.CL | cs.AIPDF

Gwen Yidou-Weng, Ian Li, Anji Liu, Oliver Broadrick, Guy Van den Broeck

TL;DR: 论文提出了一种名为LTLA的混合方法,结合基础语言模型和固定可处理的代理模型,解决了未来令牌依赖约束下的语言生成问题,提高了生成质量和效率。

Details

Motivation: 传统的代理模型(如HMMs)在语言生成中难以处理未来令牌依赖的序列级约束,且上下文感知能力弱,影响了生成质量。

Result: LTLA在条件似然率上优于无条件HMM,能处理视觉语言模型的延续分布,并在控制生成任务中提高了约束满足度和流畅性,推理开销最小。

Insight: 通过固定代理模型的解码器和动态更新潜在状态,LTLA实现了高效的前缀重用和上下文感知,为复杂约束下的语言生成提供了新思路。

Abstract: Controlled language generation conditions text on sequence-level constraints (for example, syntax, style, or safety). These constraints may depend on future tokens, which makes directly conditioning an autoregressive language model (LM) generally intractable. Prior work uses tractable surrogates such as hidden Markov models (HMMs) to approximate the distribution over continuations and adjust the model’s next-token logits at decoding time. However, we find that these surrogates are often weakly context aware, which reduces query quality. We propose Learning to Look Ahead (LTLA), a hybrid approach that pairs the same base language model for rich prefix encoding with a fixed tractable surrogate model that computes exact continuation probabilities. Two efficiency pitfalls arise when adding neural context: (i) naively rescoring the prefix with every candidate next token requires a sweep over the entire vocabulary at each step, and (ii) predicting fresh surrogate parameters for each prefix, although tractable at a single step, forces recomputation of future probabilities for every new prefix and eliminates reuse. LTLA avoids both by using a single batched HMM update to account for all next-token candidates at once, and by conditioning only the surrogate’s latent state prior on the LM’s hidden representations while keeping the surrogate decoder fixed, so computations can be reused across prefixes. Empirically, LTLA attains higher conditional likelihood than an unconditional HMM, approximates continuation distributions for vision-language models where a standalone HMM cannot encode visual context, and improves constraint satisfaction at comparable fluency on controlled-generation tasks, with minimal inference overhead.


[87] TS-PEFT: Token-Selective Parameter-Efficient Fine-Tuning with Learnable Threshold Gating cs.CL | cs.AIPDF

Dabiao Ma, Ziming Dai, Zhimin Xin, Shu Wang, Ye Wang

TL;DR: 该论文提出了TS-PEFT方法,通过可学习阈值门控选择性地对部分位置索引应用PEFT修改,避免了传统PEFT对所有索引的冗余修改,提升了下游任务性能。

Details

Motivation: 传统PEFT方法对所有位置索引进行修改,但论文质疑其必要性,提出选择性修改可能更高效,并研究了这种选择性方法的效果。

Result: 实验结果表明,传统PEFT对所有索引的修改不仅多余,还可能有害,而TS-PEFT能显著提升下游任务性能。

Insight: PEFT的修改应更有针对性,选择性方法不仅节省资源,还能提升模型性能,为大模型微调提供了新的优化方向。

Abstract: In the field of large models (LMs) for natural language processing (NLP) and computer vision (CV), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a resource-efficient method that modifies a limited number of parameters while keeping the pretrained weights fixed. This paper investigates the traditional PEFT approach, which applies modifications to all position indices, and questions its necessity. We introduce a new paradigm called Token-Selective PEFT (TS-PEFT), in which a function S selectively applies PEFT modifications to a subset of position indices, potentially enhancing performance on downstream tasks. Our experimental results reveal that the indiscriminate application of PEFT to all indices is not only superfluous, but may also be counterproductive. This study offers a fresh perspective on PEFT, advocating for a more targeted approach to modifications and providing a framework for future research to optimize the fine-tuning process for large models.


[88] SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning cs.CL | cs.DLPDF

Sebastian Haan

TL;DR: SemanticCite是一个AI驱动的系统,通过全文分析和证据推理验证引用的准确性,解决了学术文献中的语义引用错误和AI生成幻觉引用问题。

Details

Motivation: 学术文献中存在引用错误和AI生成的幻觉引用问题,传统引用格式无法精确指向支持具体主张的文本段,亟需一种高效、透明的引用验证方法。

Result: 实验表明,轻量级模型性能与大型商业系统相当,计算成本更低,适合大规模引用验证。

Insight: 该系统不仅提升引用准确性,还支持同行评审和AI生成内容的质量控制,为研究完整性提供了开源解决方案。

Abstract: Effective scientific communication depends on accurate citations that validate sources and guide readers to supporting evidence. Yet academic literature faces mounting challenges: semantic citation errors that misrepresent sources, AI-generated hallucinated references, and traditional citation formats that point to entire papers without indicating which sections substantiate specific claims. We introduce SemanticCite, an AI-powered system that verifies citation accuracy through full-text source analysis while providing rich contextual information via detailed reasoning and relevant text snippets. Our approach combines multiple retrieval methods with a four-class classification system (Supported, Partially Supported, Unsupported, Uncertain) that captures nuanced claim-source relationships and enables appropriate remedial actions for different error types. Our experiments show that fine-tuned lightweight language models achieve performance comparable to large commercial systems with significantly lower computational requirements, making large-scale citation verification practically feasible. The system provides transparent, evidence-based explanations that support user understanding and trust. We contribute a comprehensive dataset of over 1,000 citations with detailed alignments, functional classifications, semantic annotations, and bibliometric metadata across eight disciplines, alongside fine-tuned models and the complete verification framework as open-source software. SemanticCite addresses critical challenges in research integrity through scalable citation verification, streamlined peer review, and quality control for AI-generated content, providing an open-source foundation for maintaining citation accuracy at scale.


[89] SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning cs.CL | cs.AIPDF

Wei Xia, Zhi-Hong Deng

TL;DR: 论文提出了SDA(Steering-Driven Distribution Alignment),一种无需微调的训练无关方法,用于动态调整开源大语言模型(LLM)的输出分布,以更好地符合人类意图。

Details

Motivation: 随着大语言模型(LLM)的广泛应用,确保其行为与人类意图保持一致成为关键挑战。传统方法通常需要昂贵的微调或大量监督,SDA旨在解决这一效率和技术难题。

Result: 实验结果显示,SDA在8个开源LLM上显著提升了3H维度(有用性、无害性和诚实性)的表现,平均提升分别为64.4%、11.5%和30%。

Insight: SDA展示了无需微调即可有效对齐LLM行为的可能性,为模型对齐提供了一种高效且灵活的解决方案。

Abstract: With the rapid advancement of large language models (LLMs), their deployment in real-world applications has become increasingly widespread. LLMs are expected to deliver robust performance across diverse tasks, user preferences, and practical scenarios. However, as demands grow, ensuring that LLMs produce responses aligned with human intent remains a foundational challenge. In particular, aligning model behavior effectively and efficiently during inference, without costly retraining or extensive supervision, is both a critical requirement and a non-trivial technical endeavor. To address the challenge, we propose SDA (Steering-Driven Distribution Alignment), a training-free and model-agnostic alignment framework designed for open-source LLMs. SDA dynamically redistributes model output probabilities based on user-defined alignment instructions, enhancing alignment between model behavior and human intents without fine-tuning. The method is lightweight, resource-efficient, and compatible with a wide range of open-source LLMs. It can function independently during inference or be integrated with training-based alignment strategies. Moreover, SDA supports personalized preference alignment, enabling flexible control over the model response behavior. Empirical results demonstrate that SDA consistently improves alignment performance across 8 open-source LLMs with varying scales and diverse origins, evaluated on three key alignment dimensions, helpfulness, harmlessness, and honesty (3H). Specifically, SDA achieves average gains of 64.4% in helpfulness, 30% in honesty and 11.5% in harmlessness across the tested models, indicating its effectiveness and generalization across diverse models and application scenarios.


[90] Incorporating Self-Rewriting into Large Language Model Reasoning Reinforcement cs.CLPDF

Jiashu Yao, Heyan Huang, Shuang Zeng, Chuwei Luo, WangJie You

TL;DR: 本文提出了一种自我重写框架,通过让大型推理模型(LRMs)改写自身的推理文本来提升内部思考过程的质量,结合选择性改写和高效的RL算法实现,显著提高了推理准确性和效率。

Details

Motivation: 当前大型推理模型的强化学习(RL)仅依赖最终正确性奖励,缺乏对内部推理过程的详细监督,导致推理质量不佳(如过度思考、思考不足等问题)。因此,需要一种方法改进内部推理质量。

Result: 在多样任务和模型规模中验证了自我重写的有效性:1. 准确性提升0.6%,推理长度减少46%;2. LLM-as-a-judge评分提升7.2,显著缓解内部推理缺陷。

Insight: 自我重写通过内部反馈改进推理质量,避免了对人工标注的依赖,同时选择性改写保持了RL的高效性,为推理模型的优化提供了新思路。

Abstract: Through reinforcement learning (RL) with outcome correctness rewards, large reasoning models (LRMs) with scaled inference computation have demonstrated substantial success on complex reasoning tasks. However, the one-sided reward, focused solely on final correctness, limits its ability to provide detailed supervision over internal reasoning process. This deficiency leads to suboptimal internal reasoning quality, manifesting as issues like over-thinking, under-thinking, redundant-thinking, and disordered-thinking. Inspired by the recent progress in LRM self-rewarding, we introduce self-rewriting framework, where a model rewrites its own reasoning texts, and subsequently learns from the rewritten reasoning to improve the internal thought process quality. For algorithm design, we propose a selective rewriting approach wherein only “simple” samples, defined by the model’s consistent correctness, are rewritten, thereby preserving all original reward signals of GRPO. For practical implementation, we compile rewriting and vanilla generation within one single batch, maintaining the scalability of the RL algorithm and introducing only ~10% overhead. Extensive experiments on diverse tasks with different model sizes validate the effectiveness of self-rewriting. In terms of the accuracy-length tradeoff, the self-rewriting approach achieves improved accuracy (+0.6) with substantially shorter reasoning (-46%) even without explicit instructions in rewriting prompts to reduce reasoning length, outperforming existing strong baselines. In terms of internal reasoning quality, self-rewriting achieves significantly higher scores (+7.2) under the LLM-as-a-judge metric, successfully mitigating internal reasoning flaws.


[91] ESGBench: A Benchmark for Explainable ESG Question Answering in Corporate Sustainability Reports cs.CL | cs.IRPDF

Sherine George, Nithish Saji

TL;DR: ESGBench是一个基准数据集和评估框架,旨在评估基于企业可持续发展报告的可解释ESG问答系统,包含多主题问题和人工标注答案,揭示了当前LLM在事实一致性等方面的挑战。

Details

Motivation: 目前缺乏针对企业可持续发展报告中ESG(环境、社会和治理)问题的可解释问答系统的评估工具,ESGBench填补了这一空白。

Result: 揭示了大语言模型在事实一致性、可追溯性和领域对齐等方面的关键挑战。

Insight: ESGBench为透明和可问责的ESG人工智能系统研究提供了重要工具,推动了该领域的进展。

Abstract: We present ESGBench, a benchmark dataset and evaluation framework designed to assess explainable ESG question answering systems using corporate sustainability reports. The benchmark consists of domain-grounded questions across multiple ESG themes, paired with human-curated answers and supporting evidence to enable fine-grained evaluation of model reasoning. We analyze the performance of state-of-the-art LLMs on ESGBench, highlighting key challenges in factual consistency, traceability, and domain alignment. ESGBench aims to accelerate research in transparent and accountable ESG-focused AI systems.


[92] Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems cs.CLPDF

Elias Lumer, Alex Cardenas, Matt Melich, Myles Mason, Sara Dieter

TL;DR: 论文比较了多模态检索增强生成(RAG)系统中基于文本和基于图像的检索方法,发现直接多模态嵌入检索在性能上显著优于基于LLM摘要的方法。

Details

Motivation: 现有RAG系统依赖LLM将图像转换为文本摘要,导致视觉细节和上下文信息丢失,影响下游任务性能。为了解决这一问题,论文分析了两种检索方法的差异。

Result: 直接多模态嵌入检索在mAP@5和nDCG@5上分别比LLM摘要方法高13%和11%,且生成的答案更准确。

Insight: LLM摘要会导致信息丢失,而直接多模态嵌入保留了视觉上下文,提升了检索和问答效果。

Abstract: Recent advancements in Retrieval-Augmented Generation (RAG) have enabled Large Language Models (LLMs) to access multimodal knowledge bases containing both text and visual information such as charts, diagrams, and tables in financial documents. However, existing multimodal RAG systems rely on LLM-based summarization to convert images into text during preprocessing, storing only text representations in vector databases, which causes loss of contextual information and visual details critical for downstream retrieval and question answering. To address this limitation, we present a comprehensive comparative analysis of two retrieval approaches for multimodal RAG systems, including text-based chunk retrieval (where images are summarized into text before embedding) and direct multimodal embedding retrieval (where images are stored natively in the vector space). We evaluate all three approaches across 6 LLM models and a two multi-modal embedding models on a newly created financial earnings call benchmark comprising 40 question-answer pairs, each paired with 2 documents (1 image and 1 text chunk). Experimental results demonstrate that direct multimodal embedding retrieval significantly outperforms LLM-summary-based approaches, achieving absolute improvements of 13% in mean average precision (mAP@5) and 11% in normalized discounted cumulative gain. These gains correspond to relative improvements of 32% in mAP@5 and 20% in nDCG@5, providing stronger evidence of their practical impact. We additionally find that direct multimodal retrieval produces more accurate and factually consistent answers as measured by LLM-as-a-judge pairwise comparisons. We demonstrate that LLM summarization introduces information loss during preprocessing, whereas direct multimodal embeddings preserve visual context for retrieval and inference.


[93] Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs cs.CLPDF

Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Ruisi Cai, Marcin Chochowski

TL;DR: 论文提出Nemotron Elastic框架,通过单一父模型内嵌多个子模型,实现高效的多规模推理LLM,显著降低训练成本。

Details

Motivation: 传统方法需要针对不同规模和部署目标分别训练LLM,成本高昂;现有压缩方法虽降低成本,但仍需大量训练资源。

Result: 在Nemotron Nano V2 12B上实现9B和6B子模型,仅需110B训练token,性能媲美或优于SOTA。

Insight: 单一父模型嵌套多子模型的设计能显著降低训练成本,同时保持高性能,适用于多预算推理场景。

Abstract: Training a family of large language models targeting multiple scales and deployment objectives is prohibitively expensive, requiring separate training runs for each different size. Recent work on model compression through pruning and knowledge distillation has reduced this cost; however, this process still incurs hundreds of billions of tokens worth of training cost per compressed model. In this paper, we present Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning. We enable this functionality through an end-to-end trained router, tightly coupled to a two-stage training curriculum designed specifically for reasoning models. We additionally introduce group-aware SSM elastification that preserves Mamba’s structural constraints, heterogeneous MLP elastification, normalized MSE-based layer importance for improved depth selection, and knowledge distillation enabling simultaneous multi-budget optimization. We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens; this results in over 360x cost reduction compared to training model families from scratch, and around 7x compared to SoTA compression techniques. Each of the nested models performs on par or better than the SoTA in accuracy. Moreover, unlike other compression methods, the nested capability of our approach allows having a many-in-one reasoning model that has constant deployment memory against the number of models in the family.


cs.RO [Back]

[94] MiMo-Embodied: X-Embodied Foundation Model Technical Report cs.RO | cs.CL | cs.CVPDF

Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang

TL;DR: MiMo-Embodied是首个成功整合自动驾驶和具身AI的跨具身基础模型,在17个具身AI基准和12个自动驾驶基准上表现优异,通过多阶段学习、数据构建和CoT/RL微调实现领域间正向迁移。

Details

Motivation: 开发一个跨领域的统一基础模型,以验证自动驾驶和具身AI之间的互补性与正向迁移潜力。

Result: 在29个基准测试中显著超越现有开源、闭源和专用基线模型。

Insight: 自动驾驶和具身AI可以通过统一模型设计实现互补,多领域数据融合和微调策略是关键。

Abstract: We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Perception, Status Prediction, and Driving Planning. Across these tasks, MiMo-Embodied significantly outperforms existing open-source, closed-source, and specialized baselines. Our results indicate that through multi-stage learning, curated data construction, and CoT/RL fine-tuning, these two domains exhibit strong positive transfer and mutually reinforce one another. We provide a detailed analysis of our model design and training methodologies to facilitate further research. Code and models are available at https://github.com/XiaomiMiMo/MiMo-Embodied.


[95] How Robot Dogs See the Unseeable cs.RO | cs.CVPDF

Oliver Bimber, Karl Dietrich von Ellenrieder, Michael Haller, Rakesh John Amala Arokia Nathan, Gianni Lunardi

TL;DR: 这篇论文提出了一种受动物行为启发的机器人视觉方法,通过侧向移动(peering motion)生成宽合成孔径(synthetic aperture),有效解决部分遮挡问题,提升场景理解能力。

Details

Motivation: 传统机器人相机会因小孔径和大景深导致前景遮挡物与背景同时清晰,丢失关键信息。受动物通过运动视差估计距离的启发,作者希望开发一种波长无关、计算高效的方法来克服部分遮挡问题。

Result: 实验表明,该方法不仅能恢复基本场景理解能力,还能支持大型多模态模型的高级视觉推理,解决了传统视觉在遮挡情况下失效的问题。相比多视图3D视觉或LiDAR等方法,这种方法对遮挡更鲁棒、计算更高效,且可直接部署于移动机器人。

Insight: 这项研究揭示了动物行为对机器人感知的启发意义,表明peering motion生成的合成孔径感知是复杂环境中高级场景理解的关键。此外,该方法展示了生物灵感与机器人技术的深度融合潜力。

Abstract: Peering, a side-to-side motion used by animals to estimate distance through motion parallax, offers a powerful bio-inspired strategy to overcome a fundamental limitation in robotic vision: partial occlusion. Conventional robot cameras, with their small apertures and large depth of field, render both foreground obstacles and background objects in sharp focus, causing occluders to obscure critical scene information. This work establishes a formal connection between animal peering and synthetic aperture (SA) sensing from optical imaging. By having a robot execute a peering motion, its camera describes a wide synthetic aperture. Computational integration of the captured images synthesizes an image with an extremely shallow depth of field, effectively blurring out occluding elements while bringing the background into sharp focus. This efficient, wavelength-independent technique enables real-time, high-resolution perception across various spectral bands. We demonstrate that this approach not only restores basic scene understanding but also empowers advanced visual reasoning in large multimodal models, which fail with conventionally occluded imagery. Unlike feature-dependent multi-view 3D vision methods or active sensors like LiDAR, SA sensing via peering is robust to occlusion, computationally efficient, and immediately deployable on any mobile robot. This research bridges animal behavior and robotics, suggesting that peering motions for synthetic aperture sensing are a key to advanced scene understanding in complex, cluttered environments.


cs.IR [Back]

[96] Music Recommendation with Large Language Models: Challenges, Opportunities, and Evaluation cs.IR | cs.CLPDF

Elena V. Epure, Yashar Deldjoo, Bruno Sguerra, Markus Schedl, Manuel Moussallam

TL;DR: 这篇论文探讨了大型语言模型(LLMs)如何改变音乐推荐系统(MRS)的评估框架,指出传统的信息检索范式在生成式LLM面前不再适用,并提出了新的评估方法和维度。

Details

Motivation: 传统的MRS评估主要依赖于信息检索任务的准确性,但这种范式无法全面衡量推荐质量。LLMs的出现,作为一种生成式而非排序式模型,暴露了传统评估方法的局限性,亟需重新思考评估框架。

Result: 研究结果表明,LLMs为MRS带来了新的机会(如自然语言交互)和挑战(如幻觉问题),需要重新定义评估标准和实践。

Insight: LLMs的引入不仅改变了MRS的技术范式,还要求评估方法从单纯的任务准确性转向多维度的综合考量(如用户满意度、公平性等)。

Abstract: Music Recommender Systems (MRS) have long relied on an information-retrieval framing, where progress is measured mainly through accuracy on retrieval-oriented subtasks. While effective, this reductionist paradigm struggles to address the deeper question of what makes a good recommendation, and attempts to broaden evaluation, through user studies or fairness analyses, have had limited impact. The emergence of Large Language Models (LLMs) disrupts this framework: LLMs are generative rather than ranking-based, making standard accuracy metrics questionable. They also introduce challenges such as hallucinations, knowledge cutoffs, non-determinism, and opaque training data, rendering traditional train/test protocols difficult to interpret. At the same time, LLMs create new opportunities, enabling natural-language interaction and even allowing models to act as evaluators. This work argues that the shift toward LLM-driven MRS requires rethinking evaluation. We first review how LLMs reshape user modeling, item modeling, and natural-language recommendation in music. We then examine evaluation practices from NLP, highlighting methodologies and open challenges relevant to MRS. Finally, we synthesize insights-focusing on how LLM prompting applies to MRS, to outline a structured set of success and risk dimensions. Our goal is to provide the MRS community with an updated, pedagogical, and cross-disciplinary perspective on evaluation.


cs.MA [Back]

[97] The Subtle Art of Defection: Understanding Uncooperative Behaviors in LLM based Multi-Agent Systems cs.MA | cs.CLPDF

Devang Kulshreshtha, Wanyu Du, Raghav Jain, Srikanth Doss, Hang Su

TL;DR: 本文提出了一种新框架,用于模拟和分析LLM基于多智能体系统中不合作行为如何导致系统崩溃,揭示了不合作行为对系统稳定性的显著负面影响。

Details

Motivation: 现有研究缺乏对多智能体系统中不合作行为的系统性分析,尤其是这些行为如何动态演化并影响系统稳定性。本文旨在填补这一空白。

Result: 框架能96.7%准确地生成真实不合作行为;合作智能体保持100%稳定性,而不合作行为导致系统在1-7轮内崩溃。

Insight: 不合作行为对多智能体系统的负面影响远超预期,需设计更具鲁棒性的系统以应对此类行为。

Abstract: This paper introduces a novel framework for simulating and analyzing how uncooperative behaviors can destabilize or collapse LLM-based multi-agent systems. Our framework includes two key components: (1) a game theory-based taxonomy of uncooperative agent behaviors, addressing a notable gap in the existing literature; and (2) a structured, multi-stage simulation pipeline that dynamically generates and refines uncooperative behaviors as agents’ states evolve. We evaluate the framework via a collaborative resource management setting, measuring system stability using metrics such as survival time and resource overuse rate. Empirically, our framework achieves 96.7% accuracy in generating realistic uncooperative behaviors, validated by human evaluations. Our results reveal a striking contrast: cooperative agents maintain perfect system stability (100% survival over 12 rounds with 0% resource overuse), while any uncooperative behavior can trigger rapid system collapse within 1 to 7 rounds. These findings demonstrate that uncooperative agents can significantly degrade collective outcomes, highlighting the need for designing more resilient multi-agent systems.


cs.SE [Back]

[98] Green Resilience of Cyber-Physical Systems: Doctoral Dissertation cs.SE | cs.AI | cs.CV | cs.ROPDF

Diaeddin Rimawi

TL;DR: 该研究提出了一种平衡在线协作AI系统(OL-CAIS)的绿色性和韧性的方法,通过建模系统状态、开发优化策略(多目标优化、博弈论和强化学习)以及量化指标,证明了其在缩短恢复时间、稳定性能和减少人为依赖性方面的有效性。

Details

Motivation: 在线协作AI系统(OL-CAIS)在面对破坏性事件时需要在恢复性能和减少能源消耗之间找到平衡,因此需要一种方法来同时优化系统的韧性和绿色性。

Result: 实验表明,GResilience策略能缩短恢复时间、稳定性能并减少人为依赖性,其中强化学习策略效果最佳,尽管CO2排放略有增加;容器化执行能将CO2排放减半。

Insight: 1. 强化学习在优化韧性与绿色性平衡中表现突出;2. 容器化技术显著降低能源消耗;3. 灾难性遗忘是系统需要持续关注的问题。

Abstract: Cyber-physical systems (CPS) combine computational and physical components. Online Collaborative AI System (OL-CAIS) is a type of CPS that learn online in collaboration with humans to achieve a common goal, which makes it vulnerable to disruptive events that degrade performance. Decision-makers must therefore restore performance while limiting energy impact, creating a trade-off between resilience and greenness. This research addresses how to balance these two properties in OL-CAIS. It aims to model resilience for automatic state detection, develop agent-based policies that optimize the greenness-resilience trade-off, and understand catastrophic forgetting to maintain performance consistency. We model OL-CAIS behavior through three operational states: steady, disruptive, and final. To support recovery during disruptions, we introduce the GResilience framework, which provides recovery strategies through multi-objective optimization (one-agent), game-theoretic decision-making (two-agent), and reinforcement learning (RL-agent). We also design a measurement framework to quantify resilience and greenness. Empirical evaluation uses real and simulated experiments with a collaborative robot learning object classification from human demonstrations. Results show that the resilience model captures performance transitions during disruptions, and that GResilience policies improve green recovery by shortening recovery time, stabilizing performance, and reducing human dependency. RL-agent policies achieve the strongest results, although with a marginal increase in CO2 emissions. We also observe catastrophic forgetting after repeated disruptions, while our policies help maintain steadiness. A comparison with containerized execution shows that containerization cuts CO2 emissions by half. Overall, this research provides models, metrics, and policies that ensure the green recovery of OL-CAIS.


cs.AI [Back]

[99] Chain of Summaries: Summarization Through Iterative Questioning cs.AI | cs.CLPDF

William Brach, Lukas Galke Poech

TL;DR: 论文提出了一种称为’摘要链’(Chain of Summaries, CoS)的方法,通过迭代问答生成信息密集的摘要,帮助大型语言模型(LLMs)更好地处理外部网络内容,显著提升了问答性能。

Details

Motivation: 由于网络内容的格式不适合LLMs处理且上下文长度受限,LLMs难以直接利用这些内容。论文提出CoS方法,旨在生成通用且信息密集的摘要,提升LLMs的信息处理能力。

Result: 在TriviaQA、TruthfulQA和SQUAD数据集上,CoS比零样本LLMs提升66%,比BRIO和PEGASUS等专用方法提升27%。生成的摘要显著提升问答性能,同时节省计算资源。

Insight: CoS提供了一种高效的方式,使网络内容更易于被LLMs处理,同时保留了人类监督的可能性,为网站维护者和LLMs使用者提供了实用工具。

Abstract: Large Language Models (LLMs) are increasingly using external web content. However, much of this content is not easily digestible by LLMs due to LLM-unfriendly formats and limitations of context length. To address this issue, we propose a method for generating general-purpose, information-dense summaries that act as plain-text repositories of web content. Inspired by Hegel’s dialectical method, our approach, denoted as Chain of Summaries (CoS), iteratively refines an initial summary (thesis) by identifying its limitations through questioning (antithesis), leading to a general-purpose summary (synthesis) that can satisfy current and anticipate future information needs. Experiments on the TriviaQA, TruthfulQA, and SQUAD datasets demonstrate that CoS outperforms zero-shot LLM baselines by up to 66% and specialized summarization methods such as BRIO and PEGASUS by up to 27%. CoS-generated summaries yield higher Q&A performance compared to the source content, while requiring substantially fewer tokens and being agnostic to the specific downstream LLM. CoS thus resembles an appealing option for website maintainers to make their content more accessible for LLMs, while retaining possibilities for human oversight.


[100] Step-Audio-R1 Technical Report cs.AI | cs.CL | cs.SDPDF

Fei Tian, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Yuxin Li

TL;DR: Step-Audio-R1是首个成功在音频领域释放推理能力的模型,通过模态接地推理蒸馏(MGRD)框架,生成基于音频特征的推理链,而非无关的幻觉推理。该模型在音频理解和推理任务上表现卓越,超越Gemini 2.5 Pro,接近Gemini 3 Pro的性能。

Details

Motivation: 现有音频语言模型在推理任务中表现不佳,倾向于无推理或少推理时效果更好,引发了对音频智能是否真能从深思中受益的疑问。

Result: Step-Audio-R1在音频理解和推理任务中超越Gemini 2.5 Pro,接近Gemini 3 Pro的性能,验证了推理能力在跨模态中的适用性。

Insight: 研究表明,推理能力在不同模态中是可迁移的,关键在于如何将其与具体模态特征紧密结合,从而提升推理效果。

Abstract: Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.


[101] JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation cs.AI | cs.CLPDF

Zhenyu Bi, Gaurav Srivastava, Yang Li, Meng Lu, Swastik Roy

TL;DR: JudgeBoard提出了一种新的评估小型语言模型(SLM)推理能力的方法,通过直接询问模型评估答案正确性,避免了传统方法的间接性。其MAJ框架通过多智能体协作显著提升了SLM的评估性能。

Details

Motivation: 现有基于LLM的评估框架依赖间接比较,难以自动化和细粒度评估推理输出。JudgeBoard旨在直接评估SLM的推理判断能力,弥补其与LLM的性能差距。

Result: MAJ框架显著提升了SLM的可靠性和一致性,在某些任务中表现优于大型模型。

Insight: 多智能体协作可以弥补SLM在推理判断任务中的不足,为其在高效评估中的应用提供了可能。

Abstract: While small language models (SLMs) have shown promise on various reasoning tasks, their ability to judge the correctness of answers remains unclear compared to large language models (LLMs). Prior work on LLM-as-a-judge frameworks typically relies on comparing candidate answers against ground-truth labels or other candidate answers using predefined metrics like entailment. However, this approach is inherently indirect and difficult to fully automate, offering limited support for fine-grained and scalable evaluation of reasoning outputs. In this work, we propose JudgeBoard, a novel evaluation pipeline that directly queries models to assess the correctness of candidate answers without requiring extra answer comparisons. We focus on two core reasoning domains: mathematical reasoning and science/commonsense reasoning, and construct task-specific evaluation leaderboards using both accuracy-based ranking and an Elo-based rating system across five benchmark datasets, enabling consistent model comparison as judges rather than comparators. To improve judgment performance in lightweight models, we propose MAJ (Multi-Agent Judging), a novel multi-agent evaluation framework that leverages multiple interacting SLMs with distinct reasoning profiles to approximate LLM-level judgment accuracy through collaborative deliberation. Experimental results reveal a significant performance gap between SLMs and LLMs in isolated judging tasks. However, our MAJ framework substantially improves the reliability and consistency of SLMs. On the MATH dataset, MAJ using smaller-sized models as backbones performs comparatively well or even better than their larger-sized counterparts. Our findings highlight that multi-agent SLM systems can potentially match or exceed LLM performance in judgment tasks, with implications for scalable and efficient assessment.


[102] CARE-RAG - Clinical Assessment and Reasoning in RAG cs.AI | cs.CLPDF

Deepthi Potluri, Aby Mammen Mathew, Jeffrey B DeWitt, Alexander L. Rasgon, Yide Hao

TL;DR: CARE-RAG提出了一种评估框架,以衡量检索增强生成(RAG)在临床推理中的准确性、一致性和忠诚度,特别是在遵循结构化协议的情况下。

Details

Motivation: 在临床环境中,即使LLMs能够检索到正确的证据,推理的正确性仍然存在问题。作者希望通过研究WET指南来填补检索与推理之间的鸿沟。

Result: 研究显示,即使提供权威段落,LLMs的错误仍然存在,表明推理评估与检索同样重要。

Insight: RAG可以约束LLMs的输出,但安全部署需要严格评估推理能力,特别是在临床等高风险领域。

Abstract: Access to the right evidence does not guarantee that large language models (LLMs) will reason with it correctly. This gap between retrieval and reasoning is especially concerning in clinical settings, where outputs must align with structured protocols. We study this gap using Written Exposure Therapy (WET) guidelines as a testbed. In evaluating model responses to curated clinician-vetted questions, we find that errors persist even when authoritative passages are provided. To address this, we propose an evaluation framework that measures accuracy, consistency, and fidelity of reasoning. Our results highlight both the potential and the risks: retrieval-augmented generation (RAG) can constrain outputs, but safe deployment requires assessing reasoning as rigorously as retrieval.


[103] OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe cs.AI | cs.CLPDF

Kaichen Zhang, Keming Wu, Zuhao Yang, Kairui Hu, Bin Wang

TL;DR: OpenMMReasoner提出了一种透明、可复现的两阶段多模态推理方法(监督微调和强化学习),通过高质量数据集和训练设计,显著提升了多模态推理性能,且在多个基准测试中表现优异。

Details

Motivation: 当前多模态推理研究缺乏透明和可复现的数据构建及训练方法,阻碍了研究的扩展性。OpenMMReasoner旨在填补这一空白,提供一个开放、通用的方法。

Result: 在多个基准测试中超越了Qwen2.5-VL-7B-Instruct基线,提升11.6%,验证了数据质量和训练设计对多模态推理的关键作用。

Insight: 高质量数据和透明训练方法是提升多模态推理性能的核心;开放资源有助于推动未来研究。

Abstract: Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.


[104] TOFA: Training-Free One-Shot Federated Adaptation for Vision-Language Models cs.AI | cs.CLPDF

Li Zhang, Zhongxuan Han, XiaoHua Feng, Jiaming Zhang, Yuyuan Li

TL;DR: 本文提出了一种无需训练的联邦学习方法TOFA,用于一次性调整视觉-语言模型(VLMs),解决了现有方法中通信成本高和数据异构性处理不足的问题。

Details

Motivation: 现有联邦学习方法在调整VLMs时需要多次迭代训练,导致通信成本高且易受攻击。希望通过一次性调整和充分利用多模态信息,解决这些问题。

Result: 在9个数据集上的实验表明,TOFA在各种联邦学习设置中均表现优异。

Insight: TOFA的创新在于无需训练即可高效调整VLMs,通过多模态特征和自适应机制平衡个性化和鲁棒性。

Abstract: Efficient and lightweight adaptation of pre-trained Vision-Language Models (VLMs) to downstream tasks through collaborative interactions between local clients and a central server is a rapidly emerging research topic in federated learning. Existing adaptation algorithms are typically trained iteratively, which incur significant communication costs and increase the susceptibility to potential attacks. Motivated by the one-shot federated training techniques that reduce client-server exchanges to a single round, developing a lightweight one-shot federated VLM adaptation method to alleviate these issues is particularly attractive. However, current one-shot approaches face certain challenges in adapting VLMs within federated settings: (1) insufficient exploitation of the rich multimodal information inherent in VLMs; (2) lack of specialized adaptation strategies to systematically handle the severe data heterogeneity; and (3) requiring additional training resource of clients or server. To bridge these gaps, we propose a novel Training-free One-shot Federated Adaptation framework for VLMs, named TOFA. To fully leverage the generalizable multimodal features in pre-trained VLMs, TOFA employs both visual and textual pipelines to extract task-relevant representations. In the visual pipeline, a hierarchical Bayesian model learns personalized, class-specific prototype distributions. For the textual pipeline, TOFA evaluates and globally aligns the generated local text prompts for robustness. An adaptive weight calibration mechanism is also introduced to combine predictions from both modalities, balancing personalization and robustness to handle data heterogeneity. Our method is training-free, not relying on additional training resources on either the client or server side. Extensive experiments across 9 datasets in various federated settings demonstrate the effectiveness of the proposed TOFA method.


[105] How Modality Shapes Perception and Reasoning: A Study of Error Propagation in ARC-AGI cs.AI | cs.CV | cs.MAPDF

Bo Wen, Chen Wang, Erhan Bilal

TL;DR: 论文研究了不同模态(文本和图像)如何在ARC-AGI任务中影响模型的感知和推理能力,并提出了一种分离感知错误和执行错误的方法。通过对比九种文本和图像模态,发现结构化文本更适合稀疏特征,图像适合2D形状但对分辨率敏感,结合两者能提升任务表现。

Details

Motivation: 当前基于指令的系统在ARC-AGI任务中缺乏对模态如何影响感知和推理的系统性研究,尤其是如何区分感知错误和执行错误。作者希望通过对比不同模态的表现,揭示模态对任务性能的影响。

Result: 结构化文本在稀疏特征上表现更精准,图像适合捕捉2D形状但对分辨率敏感;结合两者能提升执行性能(约8个感知点,0.20的中位相似度提升)。

Insight: 1. 模态选择显著影响模型的感知能力;2. 结合文本和图像模态可以弥补各自的短板;3. 对齐表示与transformer的归纳偏置能提升任务性能。

Abstract: ARC-AGI and ARC-AGI-2 measure generalization-through-composition on small color-quantized grids, and their prize competitions make progress on these harder held-out tasks a meaningful proxy for systematic generalization. Recent instruction-first systems translate grids into concise natural-language or DSL rules executed in generate-execute-select loops, yet we lack a principled account of how encodings shape model perception and how to separate instruction errors from execution errors. We hypothesize that modality imposes perceptual bottlenecks – text flattens 2D structure into 1D tokens while images preserve layout but can introduce patch-size aliasing – thereby shaping which grid features are reliably perceived. To test this, we isolate perception from reasoning across nine text and image modalities using a weighted set-disagreement metric and a two-stage reasoning pipeline, finding that structured text yields precise coordinates on sparse features, images capture 2D shapes yet are resolution-sensitive, and combining them improves execution (about 8 perception points; about 0.20 median similarity). Overall, aligning representations with transformer inductive biases and enabling cross-validation between text and image yields more accurate instructions and more reliable execution without changing the underlying model.


[106] FOOTPASS: A Multi-Modal Multi-Agent Tactical Context Dataset for Play-by-Play Action Spotting in Soccer Broadcast Videos cs.AI | cs.CVPDF

Jeremie Ochin, Raphael Chekroun, Bogdan Stanciulescu, Sotiris Manitsaris

TL;DR: FOOTPASS是首个针对足球比赛的全场多模态、多智能体战术上下文数据集,旨在支持基于计算机视觉的Play-by-Play动作识别,结合战术知识提升数据提取的自动化与可靠性。

Details

Motivation: 现有足球视频理解数据集在Play-by-Play动作标注上依赖人工辅助,缺乏结合战术知识的自动化方法,限制了体育分析的数据驱动能力。

Result: FOOTPASS为数据驱动的体育分析提供了可靠的数据流输入。

Insight: 利用战术知识作为先验,可以提升计算机视觉任务在复杂运动场景中的预测准确性,推动自动化体育分析的发展。

Abstract: Soccer video understanding has motivated the creation of datasets for tasks such as temporal action localization, spatiotemporal action detection (STAD), or multiobject tracking (MOT). The annotation of structured sequences of events (who does what, when, and where) used for soccer analytics requires a holistic approach that integrates both STAD and MOT. However, current action recognition methods remain insufficient for constructing reliable play-by-play data and are typically used to assist rather than fully automate annotation. Parallel research has advanced tactical modeling, trajectory forecasting, and performance analysis, all grounded in game-state and play-by-play data. This motivates leveraging tactical knowledge as a prior to support computer-vision-based predictions, enabling more automated and reliable extraction of play-by-play data. We introduce Footovision Play-by-Play Action Spotting in Soccer Dataset (FOOTPASS), the first benchmark for play-by-play action spotting over entire soccer matches in a multi-modal, multi-agent tactical context. It enables the development of methods for player-centric action spotting that exploit both outputs from computer-vision tasks (e.g., tracking, identification) and prior knowledge of soccer, including its tactical regularities over long time horizons, to generate reliable play-by-play data streams. These streams form an essential input for data-driven sports analytics.