Table of Contents
- cs.CV [Total: 85]
- cs.CL [Total: 21]
- cs.AI [Total: 5]
- cs.LG [Total: 4]
- cs.HC [Total: 1]
- cs.RO [Total: 2]
cs.CV [Back]
[1] ABBSPO: Adaptive Bounding Box Scaling and Symmetric Prior based Orientation Prediction for Detecting Aerial Image Objects cs.CV | cs.AIPDF
Woojin Lee, Hyugjae Chang, Jaeho Moon, Jaehyup Lee, Munchurl Kim
TL;DR: 本文提出了一种名为ABBSPO的弱监督定向目标检测框架,用于航空图像中的目标检测。该框架通过自适应边界框缩放和基于对称先验的方向预测,解决了现有水平边界框监督方法中尺度估计不准确和学习崩溃的问题,从而在弱监督设置下实现了最先进的性能。
Details
Motivation: 弱监督定向目标检测(WS-OOD)作为一种成本效益高的替代方案受到关注,但现有的水平边界框监督方法直接将真实水平框与预测旋转框的最小外接矩形比较,导致尺度估计不准确,并且在多视图增强学习中可能出现预测全部错误导致学习崩溃的问题。
Result: 大量实验结果表明,ABBSPO在弱监督定向目标检测任务上取得了最先进的性能,超越了现有方法。
Insight: 主要创新点包括:1)自适应边界框缩放(ABBS),通过适当缩放真实水平框来优化每个预测旋转框的尺寸,确保更准确的尺度预测;2)对称先验角(SPA)损失,利用航空目标的固有对称性进行自监督学习,解决了多视图增强学习中所有预测一致错误导致学习崩溃的问题。从客观角度看,该方法巧妙地将几何先验(对称性)和自适应尺度调整结合,为弱监督旋转框检测提供了更鲁棒的优化目标。
Abstract: Weakly supervised oriented object detection (WS-OOD) has gained attention as a cost-effective alternative to fully supervised methods, providing both efficiency and high accuracy. Among weakly supervised approaches, horizontal bounding box (HBox)-supervised OOD stands out for its ability to directly leverage existing HBox annotations while achieving the highest accuracy under weak supervision settings. This paper introduces adaptive bounding box scaling and symmetry-prior-based orientation prediction, called ABBSPO, a framework for WS-OOD. Our ABBSPO addresses limitations of previous HBox-supervised OOD methods, which compare ground truth (GT) HBoxes directly with the minimum circumscribed rectangles of predicted RBoxes, often leading to inaccurate scale estimation. To overcome this, we propose: (i) Adaptive Bounding Box Scaling (ABBS), which appropriately scales GT HBoxes to optimize for the size of each predicted RBox, ensuring more accurate scale prediction; and (ii) a Symmetric Prior Angle (SPA) loss that exploits inherent symmetry of aerial objects for self-supervised learning, resolving issues in previous methods where learning collapses when predictions for all three augmented views (original, rotated, and flipped) are consistently incorrect. Extensive experimental results demonstrate that ABBSPO achieves state-of-the-art performance, outperforming existing methods.
[2] Diffusion Is Your Friend in Show, Suggest and Tell cs.CV | cs.CLPDF
Jia Cheng Hu, Roberto Cavicchioli, Alessandro Capotondi
TL;DR: 本文提出了一种新的图像描述生成范式SST,通过将扩散模型作为建议模块与自回归模型结合,而非替代后者,从而在COCO数据集上实现了最先进的性能。
Details
Motivation: 扩散去噪模型在生成式计算机视觉任务中表现出色,但在离散领域仍无法超越标准自回归模型,因此研究旨在结合两者的优势,利用扩散模型提供建议以增强自回归生成。
Result: SST在COCO数据集上取得了125.1的CIDEr-D分数,无需强化学习,比自回归和扩散模型的最先进结果分别高出1.5和2.5分,达到SOTA水平。
Insight: 创新点在于将扩散模型作为建议模块与自回归模型协同工作,结合了前者的双向和细化能力与后者的强语言结构,实验表明建议与描述质量正相关,为未充分探索的研究方向提供了新思路。
Abstract: Diffusion Denoising models demonstrated impressive results across generative Computer Vision tasks, but they still fail to outperform standard autoregressive solutions in the discrete domain, and only match them at best. In this work, we propose a different paradigm by adopting diffusion models to provide suggestions to the autoregressive generation rather than replacing them. By doing so, we combine the bidirectional and refining capabilities of the former with the strong linguistic structure provided by the latter. To showcase its effectiveness, we present Show, Suggest and Tell (SST), which achieves State-of-the-Art results on COCO, among models in a similar setting. In particular, SST achieves 125.1 CIDEr-D on the COCO dataset without Reinforcement Learning, outperforming both autoregressive and diffusion model State-of-the-Art results by 1.5 and 2.5 points. On top of the strong results, we performed extensive experiments to validate the proposal and analyze the impact of the suggestion module. Results demonstrate a positive correlation between suggestion and caption quality, overall indicating a currently underexplored but promising research direction. Code will be available at: https://github.com/jchenghu/show\_suggest\_tell.
[3] MetaVoxel: Joint Diffusion Modeling of Imaging and Clinical Metadata cs.CV | cs.AIPDF
Yihao Liu, Chenyu Gao, Lianrui Zuo, Michael E. Kim, Brian D. Boyd
TL;DR: MetaVoxel是一个联合扩散建模框架,用于对医学影像数据和临床元数据的联合分布进行建模。它通过一个单一的扩散过程统一了图像生成、年龄估计和性别预测等传统上需要独立模型的任务,并支持使用任意输入子集进行灵活的零样本推理。
Details
Motivation: 解决现有深度学习方法通常针对特定预测方向和特定输入变量集训练条件模型的问题,旨在通过建模联合分布来统一任务并提高临床应用的灵活性。
Result: 在超过10,000个T1加权MRI扫描和临床元数据上验证,单个MetaVoxel模型在图像生成、年龄估计和性别预测任务上取得了与特定任务基线模型相当的性能。
Insight: 创新点在于提出了一种联合多模态扩散建模方法,能够捕获影像与元数据的联合分布,从而实现任务统一和无需重新训练的灵活推理,为统一医学AI模型提供了新方向。
Abstract: Modern deep learning methods have achieved impressive results across tasks from disease classification, estimating continuous biomarkers, to generating realistic medical images. Most of these approaches are trained to model conditional distributions defined by a specific predictive direction with a specific set of input variables. We introduce MetaVoxel, a generative joint diffusion modeling framework that models the joint distribution over imaging data and clinical metadata by learning a single diffusion process spanning all variables. By capturing the joint distribution, MetaVoxel unifies tasks that traditionally require separate conditional models and supports flexible zero-shot inference using arbitrary subsets of inputs without task-specific retraining. Using more than 10,000 T1-weighted MRI scans paired with clinical metadata from nine datasets, we show that a single MetaVoxel model can perform image generation, age estimation, and sex prediction, achieving performance comparable to established task-specific baselines. Additional experiments highlight its capabilities for flexible inference.Together, these findings demonstrate that joint multimodal diffusion offers a promising direction for unifying medical AI models and enabling broader clinical applicability.
[4] Independent Density Estimation cs.CV | cs.LGPDF
Jiahao Liu
TL;DR: 本文提出了一种名为独立密度估计(IDE)的新方法,旨在解决大规模视觉-语言模型在组合泛化方面的困难。IDE通过学习句子中单个单词与图像中相应特征之间的关联来实现组合泛化,并构建了两个模型:一个使用完全解耦的视觉表示作为输入,另一个利用变分自编码器从原始图像中获取部分解耦的特征。此外,还提出了一种基于熵的组合推理方法来结合句子中每个单词的预测。
Details
Motivation: 大规模视觉-语言模型在图像描述和条件图像生成等领域取得了显著成果,但在实现类似人类的组合泛化方面仍面临挑战,本文旨在解决这一问题。
Result: 在多个数据集上的评估表明,与现有模型相比,所提出的模型在未见过的组合上表现出更优的泛化能力。
Insight: 创新点在于引入独立密度估计(IDE)来建模单词与图像特征的独立关联,并结合解耦表示和基于熵的推理,以提升组合泛化性能;客观分析认为,该方法通过强调局部特征与语言的对应关系,可能增强了模型对新颖组合的推理能力。
Abstract: Large-scale Vision-Language models have achieved remarkable results in various domains, such as image captioning and conditioned image generation. Nevertheless, these models still encounter difficulties in achieving human-like compositional generalization. In this study, we propose a new method called Independent Density Estimation (IDE) to tackle this challenge. IDE aims to learn the connection between individual words in a sentence and the corresponding features in an image, enabling compositional generalization. We build two models based on the philosophy of IDE. The first one utilizes fully disentangled visual representations as input, and the second leverages a Variational Auto-Encoder to obtain partially disentangled features from raw images. Additionally, we propose an entropy-based compositional inference method to combine predictions of each word in the sentence. Our models exhibit superior generalization to unseen compositions compared to current models when evaluated on various datasets.
[5] TraceFlow: Dynamic 3D Reconstruction of Specular Scenes Driven by Ray Tracing cs.CVPDF
Jiachen Tao, Junyi Wu, Haoxuan Wang, Zongxin Yang, Dawen Cai
TL;DR: TraceFlow提出了一种用于动态高光场景高保真渲染的新框架,通过解决精确反射方向估计和物理准确反射建模两个关键挑战,实现了高质量的动态镜面反射重建。
Details
Motivation: 动机在于解决动态高光场景渲染中精确反射方向估计和物理准确反射建模的挑战,以实现更逼真的动态镜面反射效果。
Result: 在动态场景基准测试上的大量实验表明,TraceFlow在定量和定性上均优于先前方法,在复杂动态环境中产生更清晰、更真实的镜面反射。
Insight: 创新点包括提出残差材料增强的2D高斯泼溅表示来建模动态几何和材料属性,引入动态环境高斯和混合渲染管线分解漫反射和镜面反射成分,以及采用从粗到精的训练策略提升优化稳定性。
Abstract: We present TraceFlow, a novel framework for high-fidelity rendering of dynamic specular scenes by addressing two key challenges: precise reflection direction estimation and physically accurate reflection modeling. To achieve this, we propose a Residual Material-Augmented 2D Gaussian Splatting representation that models dynamic geometry and material properties, allowing accurate reflection ray computation. Furthermore, we introduce a Dynamic Environment Gaussian and a hybrid rendering pipeline that decomposes rendering into diffuse and specular components, enabling physically grounded specular synthesis via rasterization and ray tracing. Finally, we devise a coarse-to-fine training strategy to improve optimization stability and promote physically meaningful decomposition. Extensive experiments on dynamic scene benchmarks demonstrate that TraceFlow outperforms prior methods both quantitatively and qualitatively, producing sharper and more realistic specular reflections in complex dynamic environments.
[6] Hierarchical Instance Tracking to Balance Privacy Preservation with Accessible Information cs.CVPDF
Neelima Prasad, Jarek Reynolds, Neel Karsanbhai, Tanusree Sharma, Lotus Zhang
TL;DR: 本文提出了一种新颖的分层实例跟踪任务,旨在同时跟踪预定义类别(包括物体和部件)的所有实例,并保持它们之间的层次关系。为此,作者构建了首个支持该任务的基准数据集,包含552个视频中的2765个独特实体,涵盖40个类别。对四种模型的七个变体进行评估,结果表明该数据集具有挑战性。
Details
Motivation: 解决在视频中同时跟踪物体及其部件实例并保持其层次结构关系的问题,以平衡隐私保护与信息可访问性。
Result: 在作者新构建的基准数据集上评估了四种模型的七个变体,结果显示该数据集具有挑战性,但未提及具体定量结果或是否达到SOTA水平。
Insight: 创新点在于提出了分层实例跟踪这一新任务及其首个基准数据集,将物体与部件的跟踪统一在一个层次化框架内,为计算机视觉中细粒度、结构化视频理解提供了新的研究方向。
Abstract: We propose a novel task, hierarchical instance tracking, which entails tracking all instances of predefined categories of objects and parts, while maintaining their hierarchical relationships. We introduce the first benchmark dataset supporting this task, consisting of 2,765 unique entities that are tracked in 552 videos and belong to 40 categories (across objects and parts). Evaluation of seven variants of four models tailored to our novel task reveals the new dataset is challenging. Our dataset is available at https://vizwiz.org/tasks-and-datasets/hierarchical-instance-tracking/
[7] Feature Coding for Scalable Machine Vision cs.CVPDF
Md Eimran Hossain Eimon, Juan Merlos, Ashan Perera, Hari Kalva, Velibor Adzic
TL;DR: 本文介绍了MPEG提出的面向机器视觉的特征编码标准(FCM)及其测试模型FCTM,旨在解决边缘设备与云端协同推理时中间特征传输的带宽瓶颈问题。通过设计专用的比特流语法和编解码流程,FCTM在多种视觉任务中实现了平均85.14%的码率降低,同时保持精度,为带宽受限和隐私敏感的应用提供了可扩展的智能特征部署方案。
Details
Motivation: 深度神经网络在边缘设备部署时面临高计算需求,传统全本地或全云端推理方案在延迟、带宽和隐私方面存在权衡;边缘-云端协同推理需要传输中间特征,但现有方法带宽开销大,因此需开发高效的特征压缩技术以实现可扩展的机器视觉部署。
Result: FCTM在多个视觉任务(如分类、检测)的基准测试中,相比原始特征传输,实现了平均85.14%的码率降低,同时保持了任务精度,为特征压缩提供了有效的标准化解决方案。
Insight: 创新点在于提出了面向机器而非人眼感知的特征编码标准(FCM),通过定制化的比特流语法和编解码流程,显著压缩中间特征数据,平衡了带宽效率与任务精度,推动了边缘-云端协同推理在资源受限场景中的实用化。
Abstract: Deep neural networks (DNNs) drive modern machine vision but are challenging to deploy on edge devices due to high compute demands. Traditional approaches-running the full model on-device or offloading to the cloud face trade-offs in latency, bandwidth, and privacy. Splitting the inference workload between the edge and the cloud offers a balanced solution, but transmitting intermediate features to enable such splitting introduces new bandwidth challenges. To address this, the Moving Picture Experts Group (MPEG) initiated the Feature Coding for Machines (FCM) standard, establishing a bitstream syntax and codec pipeline tailored for compressing intermediate features. This paper presents the design and performance of the Feature Coding Test Model (FCTM), showing significant bitrate reductions-averaging 85.14%-across multiple vision tasks while preserving accuracy. FCM offers a scalable path for efficient and interoperable deployment of intelligent features in bandwidth-limited and privacy-sensitive consumer applications.
[8] Latent Chain-of-Thought World Modeling for End-to-End Driving cs.CV | cs.ROPDF
Shuhan Tan, Kashyap Chitta, Yuxiao Chen, Ran Tian, Yurong You
TL;DR: 本文提出了一种名为Latent-CoT-Drive(LCDrive)的端到端自动驾驶模型,它使用潜在语言而非自然语言进行思维链推理,通过交替生成动作提议令牌和基于学习到的潜在世界模型的令牌来预测驾驶动作的未来结果,从而统一推理与决策过程。
Details
Motivation: 现有基于视觉-语言-动作的自动驾驶模型多使用自然语言进行推理,但文本可能不是最高效的表示形式,因此本文旨在探索更高效的潜在表示来进行思维链推理,以提升驾驶性能与安全性。
Result: 在大规模端到端驾驶基准测试中,LCDrive相比无推理和文本推理基线,实现了更快的推理速度、更好的轨迹质量,并在交互式强化学习后获得了更大的性能提升。
Insight: 创新点在于将思维链推理与决策统一在动作对齐的潜在空间中,使用动作提议令牌和世界模型令牌进行潜在推理,并通过监督学习和闭环强化学习进行训练,这为高效、可解释的端到端驾驶系统提供了新思路。
Abstract: Recent Vision-Language-Action (VLA) models for autonomous driving explore inference-time reasoning as a way to improve driving performance and safety in challenging scenarios. Most prior work uses natural language to express chain-of-thought (CoT) reasoning before producing driving actions. However, text may not be the most efficient representation for reasoning. In this work, we present Latent-CoT-Drive (LCDrive): a model that expresses CoT in a latent language that captures possible outcomes of the driving actions being considered. Our approach unifies CoT reasoning and decision making by representing both in an action-aligned latent space. Instead of natural language, the model reasons by interleaving (1) action-proposal tokens, which use the same vocabulary as the model’s output actions; and (2) world model tokens, which are grounded in a learned latent world model and express future outcomes of these actions. We cold start latent CoT by supervising the model’s action proposals and world model tokens based on ground-truth future rollouts of the scene. We then post-train with closed-loop reinforcement learning to strengthen reasoning capabilities. On a large-scale end-to-end driving benchmark, LCDrive achieves faster inference, better trajectory quality, and larger improvements from interactive reinforcement learning compared to both non-reasoning and text-reasoning baselines.
[9] Emerging Standards for Machine-to-Machine Video Coding cs.CVPDF
Md Eimran Hossain Eimon, Velibor Adzic, Hari Kalva, Borko Furht
TL;DR: 本文探讨了机器对机器(M2M)视频编码的新兴标准,指出传统基于像素、为人眼优化的编解码器在M2M场景中带宽消耗大、可扩展性差且隐私暴露风险高。为此,MPEG提出了两种新范式:面向机器的视频编码(VCM)在像素域应用任务感知编码工具,而面向机器的特征编码(FCM)则压缩神经网络中间特征以降低码率、保护隐私并支持计算卸载。实验表明FCM能在显著降低码率的同时保持接近边缘推理的精度,并分析了不同内部编解码器(如H.264/AVC、H.265/HEVC、H.266/VVC)对机器任务性能的影响。
Details
Motivation: 解决传统为人眼优化的视频编解码器在机器对机器通信中带宽消耗高、可扩展性差、隐私暴露严重的问题,推动更适合机器消费视觉数据的编码标准。
Result: FCM在显著降低码率的同时,能保持接近边缘推理的准确性。在FCM内部编解码器对比中,H.265/HEVC和H.266/VVC的机器任务性能几乎相同(用HEVC替换VVC平均BD-Rate仅增加1.39%),而H.264/AVC相比VVC平均BD-Rate增加32.28%。对于跟踪任务,编解码器选择影响很小,HEVC甚至略优于VVC(BD-Rate为-1.81%),表明现有已部署编解码器的硬件可以支持M2M通信而不降低性能。
Insight: 创新点在于从压缩像素转向压缩神经网络中间特征(FCM),这能从根本上降低码率、保护隐私并支持计算卸载。客观分析认为,其重要洞察是:对于机器任务,最新编解码器(HEVC/VVC)的性能差距远小于传统认知,且在某些任务(如跟踪)上旧标准(如HEVC)已足够好,这为利用现有硬件基础设施部署高效的M2M系统提供了实用依据。
Abstract: Machines are increasingly becoming the primary consumers of visual data, yet most deployments of machine-to-machine systems still rely on remote inference where pixel-based video is streamed using codecs optimized for human perception. Consequently, this paradigm is bandwidth intensive, scales poorly, and exposes raw images to third parties. Recent efforts in the Moving Picture Experts Group (MPEG) redesigned the pipeline for machine-to-machine communication: Video Coding for Machines (VCM) is designed to apply task-aware coding tools in the pixel domain, and Feature Coding for Machines (FCM) is designed to compress intermediate neural features to reduce bitrate, preserve privacy, and support compute offload. Experiments show that FCM is capable of maintaining accuracy close to edge inference while significantly reducing bitrate. Additional analysis of H.26X codecs used as inner codecs in FCM reveals that H.265/High Efficiency Video Coding (HEVC) and H.266/Versatile Video Coding (VVC) achieve almost identical machine task performance, with an average BD-Rate increase of 1.39% when VVC is replaced with HEVC. In contrast, H.264/Advanced Video Coding (AVC) yields an average BD-Rate increase of 32.28% compared to VVC. However, for the tracking task, the impact of codec choice is minimal, with HEVC outperforming VVC and achieving BD Rate of -1.81% and 8.79% for AVC, indicating that existing hardware for already deployed codecs can support machine-to-machine communication without degrading performance.
[10] Multi-dimensional Preference Alignment by Conditioning Reward Itself cs.CVPDF
Jiho Jang, Jinyoung Kim, Kyungjune Baek, Nojun Kwak
TL;DR: 本文提出了一种名为Multi Reward Conditional DPO(MCDPO)的新方法,用于解决扩散模型对齐中标准DPO方法因使用单一标量奖励而导致的奖励冲突问题。MCDPO通过引入解耦的Bradley-Terry目标和条件化训练,使模型能够独立学习多个评估维度(如美学质量和语义对齐)的优化方向,并在推理时实现动态多轴控制。
Details
Motivation: 标准DPO方法依赖于Bradley-Terry模型将多个评估维度聚合为单一标量奖励,这会导致奖励冲突,即模型可能被迫遗忘特定维度上的可取特征,如果这些特征出现在全局非优样本中。本文旨在解决这一根本性限制。
Result: 在Stable Diffusion 1.5和SDXL上的大量实验表明,MCDPO在基准测试中取得了优越的性能。
Insight: 主要创新点包括:1)提出解耦的Bradley-Terry目标以解决奖励冲突;2)在训练中显式注入偏好结果向量作为条件,使单个网络能独立学习各奖励轴的优化方向;3)引入维度奖励丢弃以确保跨维度平衡优化;4)其条件框架支持在推理时使用无分类器引导进行动态多轴控制,无需额外训练或外部奖励模型。
Abstract: Reinforcement Learning from Human Feedback has emerged as a standard for aligning diffusion models. However, we identify a fundamental limitation in the standard DPO formulation because it relies on the Bradley-Terry model to aggregate diverse evaluation axes like aesthetic quality and semantic alignment into a single scalar reward. This aggregation creates a reward conflict where the model is forced to unlearn desirable features of a specific dimension if they appear in a globally non-preferred sample. To address this issue, we propose Multi Reward Conditional DPO (MCDPO). This method resolves reward conflicts by introducing a disentangled Bradley-Terry objective. MCDPO explicitly injects a preference outcome vector as a condition during training, which allows the model to learn the correct optimization direction for each reward axis independently within a single network. We further introduce dimensional reward dropout to ensure balanced optimization across dimensions. Extensive experiments on Stable Diffusion 1.5 and SDXL demonstrate that MCDPO achieves superior performance on benchmarks. Notably, our conditional framework enables dynamic and multiple-axis control at inference time using Classifier Free Guidance to amplify specific reward dimensions without additional training or external reward models.
[11] Solving Semi-Supervised Few-Shot Learning from an Auto-Annotation Perspective cs.CV | cs.LGPDF
Tian Liu, Anwesha Basu, James Caverlee, Shu Kong
TL;DR: 本文提出了一种名为SWIFT(Stage-Wise Finetuning with Temperature Tuning)的方法,用于解决半监督少样本学习(SSFSL)问题,特别是从自动标注的角度出发。该方法通过简单的分类器初始化和温度调优技术,解决了视觉语言模型(VLM)在微调时因softmax概率分布过于平坦而导致的未标记数据利用率低和监督信号弱的问题,从而显著提升了性能。
Details
Motivation: 半监督少样本学习旨在利用少量标记数据和大量未标记数据学习模型以实现自动标注,但现有研究忽视了开源视觉语言模型及其预训练数据等资源,而相关领域如少样本学习已成功利用这些资源提升性能。本文旨在探索如何有效利用这些开源资源来解决SSFSL问题。
Result: 在五个SSFSL基准测试上的广泛实验表明,SWIFT方法比最近的FSL和SSL方法准确率提高了约5个百分点,甚至可与使用真实标签微调VLM的监督学习相媲美。
Insight: 创新点在于揭示了VLM微调时softmax概率分布平坦导致未标记数据利用率低的根本问题,并通过简单的分类器初始化和温度调优技术有效解决了该问题,从而提出了分阶段微调框架SWIFT,能够利用任务相关的噪声数据提升性能。从客观角度看,该方法将开源VLM资源与半监督学习结合,为实际自动标注应用提供了高效且简单的解决方案。
Abstract: Semi-supervised few-shot learning (SSFSL) formulates real-world applications like ‘’auto-annotation’’, as it aims to learn a model over a few labeled and abundant unlabeled examples to annotate the unlabeled ones. Despite the availability of powerful open-source Vision-Language Models (VLMs) and their pretraining data, the SSFSL literature largely neglects these open-source resources. In contrast, the related area few-shot learning (FSL) has already exploited them to boost performance. Arguably, to achieve auto-annotation in the real world, SSFSL should leverage such open-source resources. To this end, we start by applying established SSL methods to finetune a VLM. Counterintuitively, they significantly underperform FSL baselines. Our in-depth analysis reveals the root cause: VLMs produce rather ‘’flat’’ distributions of softmax probabilities. This results in zero utilization of unlabeled data and weak supervision signals. We address this issue with embarrassingly simple techniques: classifier initialization and temperature tuning. They jointly increase the confidence scores of pseudo-labels, improving the utilization rate of unlabeled data, and strengthening supervision signals. Building on this, we propose: Stage-Wise Finetuning with Temperature Tuning (SWIFT), which enables existing SSL methods to effectively finetune a VLM on limited labeled data, abundant unlabeled data, and task-relevant but noisy data retrieved from the VLM’s pretraining set. Extensive experiments on five SSFSL benchmarks show that SWIFT outperforms recent FSL and SSL methods by $\sim$5 accuracy points. SWIFT even rivals supervised learning, which finetunes VLMs with the unlabeled data being labeled with ground truth!
[12] RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection cs.CV | cs.AIPDF
Zhuo Wang, Xiliang Liu, Ligang Sun
TL;DR: 本文提出了RobustSora基准,旨在评估数字水印对AI生成视频检测器鲁棒性的影响。通过构建包含真实与生成、带水印与去水印/伪造水印的6500个视频数据集,并设计两个评估任务,系统测试了十种检测模型在水印操作下的性能变化,揭示了现有模型对水印存在部分依赖。
Details
Motivation: 现有AI生成视频检测基准忽略了生成模型输出中普遍存在的数字水印,检测器可能依赖这些水印模式,从而影响其真实检测能力。本文旨在评估水印对检测器鲁棒性的影响。
Result: 在RobustSora基准上测试了十种模型,包括专用AIGC检测器、Transformer架构和MLLM方法。实验显示,在水印操作下模型性能变化为2-8个百分点。基于Transformer的模型表现出中等且一致的依赖性(6-8pp),而MLLM则呈现多样化模式(2-8pp)。
Insight: 创新点在于构建了首个系统评估水印鲁棒性的AIGC视频检测基准,并揭示了现有检测器对水印存在非预期的部分依赖。这强调了在训练策略中考虑水印感知的重要性,为开发更鲁棒的检测器提供了关键工具和方向。
Abstract: The proliferation of AI-generated video technologies poses challenges to information integrity. While recent benchmarks advance AIGC video detection, they overlook a critical factor: many state-of-the-art generative models embed digital watermarks in outputs, and detectors may partially rely on these patterns. To evaluate this influence, we present RobustSora, the benchmark designed to assess watermark robustness in AIGC video detection. We systematically construct a dataset of 6,500 videos comprising four types: Authentic-Clean (A-C), Authentic-Spoofed with fake watermarks (A-S), Generated-Watermarked (G-W), and Generated-DeWatermarked (G-DeW). Our benchmark introduces two evaluation tasks: Task-I tests performance on watermark-removed AI videos, while Task-II assesses false alarm rates on authentic videos with fake watermarks. Experiments with ten models spanning specialized AIGC detectors, transformer architectures, and MLLM approaches reveal performance variations of 2-8pp under watermark manipulation. Transformer-based models show consistent moderate dependency (6-8pp), while MLLMs exhibit diverse patterns (2-8pp). These findings indicate partial watermark dependency and highlight the need for watermark-aware training strategies. RobustSora provides essential tools to advance robust AIGC detection research.
[13] GDKVM: Echocardiography Video Segmentation via Spatiotemporal Key-Value Memory with Gated Delta Rule cs.CVPDF
Rui Wang, Yimu Sun, Jingxing Guo, Huisi Wu, Jing Qin
TL;DR: 本文提出了一种名为GDKVM的新型超声心动图视频分割架构,通过线性键值关联建模帧间相关性,引入门控Delta规则高效存储中间记忆状态,并设计关键像素特征融合模块整合多尺度局部与全局特征,以提升分割精度和鲁棒性。
Details
Motivation: 解决超声心动图序列中心腔分割中因成像噪声、伪影及心脏形变运动带来的挑战,并平衡长程时空依赖建模与计算效率及细粒度特征表示之间的权衡。
Result: 在CAMUS和EchoNet-Dynamic两个主流超声心动图视频数据集上验证,GDKVM在分割精度和鲁棒性方面优于多种最先进方法,同时保证了实时性能。
Insight: 创新点包括线性键值关联机制、门控Delta规则的高效记忆存储策略,以及关键像素特征融合模块的多尺度特征整合,这些设计共同提升了模型对边界模糊和噪声干扰的鲁棒性。
Abstract: Accurate segmentation of cardiac chambers in echocardiography sequences is crucial for the quantitative analysis of cardiac function, aiding in clinical diagnosis and treatment. The imaging noise, artifacts, and the deformation and motion of the heart pose challenges to segmentation algorithms. While existing methods based on convolutional neural networks, Transformers, and space-time memory networks have improved segmentation accuracy, they often struggle with the trade-off between capturing long-range spatiotemporal dependencies and maintaining computational efficiency with fine-grained feature representation. In this paper, we introduce GDKVM, a novel architecture for echocardiography video segmentation. The model employs Linear Key-Value Association (LKVA) to effectively model inter-frame correlations, and introduces Gated Delta Rule (GDR) to efficiently store intermediate memory states. Key-Pixel Feature Fusion (KPFF) module is designed to integrate local and global features at multiple scales, enhancing robustness against boundary blurring and noise interference. We validated GDKVM on two mainstream echocardiography video datasets (CAMUS and EchoNet-Dynamic) and compared it with various state-of-the-art methods. Experimental results show that GDKVM outperforms existing approaches in terms of segmentation accuracy and robustness, while ensuring real-time performance. Code is available at https://github.com/wangrui2025/GDKVM.
[14] VLM-NCD:Novel Class Discovery with Vision-Based Large Language Models cs.CVPDF
Yuetong Su, Baoguo Wei, Xinyu Wang, Xu Li, Lixin Li
TL;DR: 本文提出VLM-NCD,一种融合视觉-文本语义与原型引导聚类的多模态框架,旨在利用已知类别的先验知识从未标注数据中发现新类别。该方法通过联合优化已知类别的图像和文本特征来建模聚类中心和语义原型,并采用基于语义亲和度阈值与自适应聚类的双阶段发现机制动态区分已知或新样本。
Details
Motivation: 现有图像新类发现方法主要依赖视觉特征,存在特征区分性不足和数据长尾分布等局限性,需要突破这一瓶颈。
Result: 在CIFAR-100数据集上的实验表明,相比当前方法,该方法在未知类别的准确率上实现了高达25.3%的提升,且首次在新类发现文献中展现出对长尾分布的独特鲁棒性。
Insight: 创新点在于通过多模态(视觉-文本)语义融合增强特征判别力,并引入双阶段发现机制动态处理已知与未知样本;客观来看,将大语言模型的语义先验与视觉原型聚类结合,为解决长尾分布下的新类发现问题提供了新思路。
Abstract: Novel Class Discovery aims to utilise prior knowledge of known classes to classify and discover unknown classes from unlabelled data. Existing NCD methods for images primarily rely on visual features, which suffer from limitations such as insufficient feature discriminability and the long-tail distribution of data. We propose LLM-NCD, a multimodal framework that breaks this bottleneck by fusing visual-textual semantics and prototype guided clustering. Our key innovation lies in modelling cluster centres and semantic prototypes of known classes by jointly optimising known class image and text features, and a dualphase discovery mechanism that dynamically separates known or novel samples via semantic affinity thresholds and adaptive clustering. Experiments on the CIFAR-100 dataset show that compared to the current methods, this method achieves up to 25.3% improvement in accuracy for unknown classes. Notably, our method shows unique resilience to long tail distributions, a first in NCD literature.
[15] Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction cs.CVPDF
Chen Ziwen, Hao Tan, Peng Wang, Zexiang Xu, Li Fuxin
TL;DR: Long-LRM++ 提出了一种半显式场景表示与轻量级解码器相结合的方法,旨在解决通用高斯泼溅(GS)方法在一次性预测数百万高斯参数时因微小误差导致细节模糊的问题,以及隐式表示方法计算密集、无法实时渲染的局限性。该模型在保持高渲染质量的同时,实现了实时渲染性能。
Details
Motivation: 动机在于解决现有通用高斯泼溅方法在一次性预测高斯参数时对误差敏感、导致细节(如文本)模糊的问题,以及隐式表示方法(如LVSM和LaCT)因计算密集的解码过程而无法实现实时渲染的瓶颈。
Result: 在DL3DV基准上,Long-LRM++的渲染质量与LaCT相当,同时在A100 GPU上实现了14 FPS的实时渲染速度。模型可扩展到64个输入视图(分辨率950×540),并在ScanNetv2上展示了优于直接从高斯泼溅渲染的深度预测性能。
Insight: 创新点在于采用半显式场景表示结合轻量级解码器的设计,平衡了渲染质量与速度,避免了隐式方法中深度序列化解压缩的必要性,从而在保持高保真度的同时实现实时性能,并展示了在输入视图数量增加时的强泛化能力。
Abstract: Recent advances in generalizable Gaussian splatting (GS) have enabled feed-forward reconstruction of scenes from tens of input views. Long-LRM notably scales this paradigm to 32 input images at $950\times540$ resolution, achieving 360° scene-level reconstruction in a single forward pass. However, directly predicting millions of Gaussian parameters at once remains highly error-sensitive: small inaccuracies in positions or other attributes lead to noticeable blurring, particularly in fine structures such as text. In parallel, implicit representation methods such as LVSM and LaCT have demonstrated significantly higher rendering fidelity by compressing scene information into model weights rather than explicit Gaussians, and decoding RGB frames using the full transformer or TTT backbone. However, this computationally intensive decompression process for every rendered frame makes real-time rendering infeasible. These observations raise key questions: Is the deep, sequential “decompression” process necessary? Can we retain the benefits of implicit representations while enabling real-time performance? We address these questions with Long-LRM++, a model that adopts a semi-explicit scene representation combined with a lightweight decoder. Long-LRM++ matches the rendering quality of LaCT on DL3DV while achieving real-time 14 FPS rendering on an A100 GPU, overcoming the speed limitations of prior implicit methods. Our design also scales to 64 input views at the $950\times540$ resolution, demonstrating strong generalization to increased input lengths. Additionally, Long-LRM++ delivers superior novel-view depth prediction on ScanNetv2 compared to direct depth rendering from Gaussians. Extensive ablation studies validate the effectiveness of each component in the proposed framework.
[16] MotionEdit: Benchmarking and Learning Motion-Centric Image Editing cs.CV | cs.AI | cs.CLPDF
Yixin Wan, Lei Ke, Wenhao Yu, Kai-Wei Chang, Dong Yu
TL;DR: 本文提出了MotionEdit数据集和MotionEdit-Bench基准,专注于运动中心图像编辑任务,即修改主体动作和交互同时保持身份、结构和物理合理性。针对现有模型在该任务上的不足,作者提出了MotionNFT(运动引导的负感知微调)后训练框架,通过计算运动对齐奖励来提升编辑质量和运动保真度。
Details
Motivation: 现有图像编辑数据集主要关注静态外观变化,缺乏高质量、逼真的运动编辑数据,因此需要构建专门的数据集和基准来推动运动中心图像编辑这一具有科学挑战性和实际应用价值(如帧控制视频合成和动画)的任务。
Result: 在MotionEdit-Bench基准上的评估表明,现有最先进的基于扩散的编辑模型在运动编辑任务上仍面临巨大挑战。在FLUX.1 Kontext和Qwen-Image-Edit模型上的大量实验证明,MotionNFT框架在不牺牲通用编辑能力的情况下,持续提升了两个基础模型在运动编辑任务上的编辑质量和运动保真度。
Insight: 创新点包括:1) 构建了首个专注于高质量、逼真运动变换的图像编辑数据集MotionEdit;2) 提出了包含生成式、判别式和基于偏好的度量的综合基准MotionEdit-Bench;3) 设计了MotionNFT后训练框架,利用输入图像与模型编辑图像之间的运动流与真实运动的匹配度作为奖励信号,引导模型实现准确运动变换,这是一种新颖的基于运动对齐的微调方法。
Abstract: We introduce MotionEdit, a novel dataset for motion-centric image editing-the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility. Unlike existing image editing datasets that focus on static appearance changes or contain only sparse, low-quality motion edits, MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted and verified from continuous videos. This new task is not only scientifically challenging but also practically significant, powering downstream applications such as frame-controlled video synthesis and animation. To evaluate model performance on the novel task, we introduce MotionEdit-Bench, a benchmark that challenges models on motion-centric edits and measures model performance with generative, discriminative, and preference-based metrics. Benchmark results reveal that motion editing remains highly challenging for existing state-of-the-art diffusion-based editing models. To address this gap, we propose MotionNFT (Motion-guided Negative-aware Fine Tuning), a post-training framework that computes motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion, guiding models toward accurate motion transformations. Extensive experiments on FLUX.1 Kontext and Qwen-Image-Edit show that MotionNFT consistently improves editing quality and motion fidelity of both base models on the motion editing task without sacrificing general editing ability, demonstrating its effectiveness.
[17] ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions cs.CVPDF
Xiaoxue Wu, Xinyuan Chen, Yaohui Wang, Yu Qiao
TL;DR: ShotDirector是一个用于导演可控多镜头视频生成的高效框架,通过整合参数级相机控制和分层编辑模式感知提示,实现电影般的可控镜头转换。
Details
Motivation: 解决现有多镜头视频生成方法主要关注镜头间低层次视觉一致性,而忽视镜头转换设计和电影语言对连贯叙事表达贡献的问题,避免仅产生无意识的顺序镜头变化。
Result: 在构建的ShotWeaver40K数据集上进行广泛实验,证明了框架的有效性,并开发了一套可控多镜头视频生成的评估指标。
Insight: 创新点包括采用6-DoF位姿和内在设置的相机控制模块实现精确相机信息注入,以及利用镜头感知掩码机制引入分层提示以细粒度控制镜头内容,结合参数级条件与高层次语义指导实现电影化可控镜头转换。
Abstract: Shot transitions play a pivotal role in multi-shot video generation, as they determine the overall narrative expression and the directorial design of visual storytelling. However, recent progress has primarily focused on low-level visual consistency across shots, neglecting how transitions are designed and how cinematographic language contributes to coherent narrative expression. This often leads to mere sequential shot changes without intentional film-editing patterns. To address this limitation, we propose ShotDirector, an efficient framework that integrates parameter-level camera control and hierarchical editing-pattern-aware prompting. Specifically, we adopt a camera control module that incorporates 6-DoF poses and intrinsic settings to enable precise camera information injection. In addition, a shot-aware mask mechanism is employed to introduce hierarchical prompts aware of professional editing patterns, allowing fine-grained control over shot content. Through this design, our framework effectively combines parameter-level conditions with high-level semantic guidance, achieving film-like controllable shot transitions. To facilitate training and evaluation, we construct ShotWeaver40K, a dataset that captures the priors of film-like editing patterns, and develop a set of evaluation metrics for controllable multi-shot video generation. Extensive experiments demonstrate the effectiveness of our framework.
[18] Physically Aware 360$^\circ$ View Generation from a Single Image using Disentangled Scene Embeddings cs.CVPDF
Karthikeya KV, Narendra Bandaru
TL;DR: 本文提出了Disentangled360,一种创新的3D感知技术,它结合了方向解耦的体积渲染和单图像360°独特视图合成的优势,适用于医学成像和自然场景重建。该框架在高斯泼溅主干中明确区分了各向同性和各向异性的贡献,通过双分支条件框架分别处理CT强度驱动的体数据散射和真实世界RGB场景的归一化相机嵌入。为了解决尺度模糊性并保持结构真实感,提出了一种混合姿态无关的锚定方法,自适应采样场景深度和材质过渡。评估表明,在Mip-NeRF 360、RealEstate10K和DeepDRR数据集上,该方法在SSIM和LPIPS指标上表现优越,且运行时评估确认了其交互应用的可行性。
Details
Motivation: 当前技术要么过度简化了各向异性光行为,要么缺乏跨不同场景的泛化能力,本文旨在解决这些问题,实现更物理感知的单图像360°视图生成。
Result: 在Mip-NeRF 360、RealEstate10K和DeepDRR数据集上,该方法在SSIM和LPIPS指标上表现出优越性能,达到了SOTA水平,且运行时评估确认了其交互应用的可行性。
Insight: 创新点包括:在高斯泼溅主干中明确区分各向同性和各向异性贡献的双分支条件框架,以及混合姿态无关的锚定方法,这些设计提高了视图合成的物理真实感和泛化能力,无需场景特定微调或昂贵的光子模拟。
Abstract: We introduce Disentangled360, an innovative 3D-aware technology that integrates the advantages of direction disentangled volume rendering with single-image 360° unique view synthesis for applications in medical imaging and natural scene reconstruction. In contrast to current techniques that either oversimplify anisotropic light behavior or lack generalizability across various contexts, our framework distinctly differentiates between isotropic and anisotropic contributions inside a Gaussian Splatting backbone. We implement a dual-branch conditioning framework, one optimized for CT intensity driven scattering in volumetric data and the other for real-world RGB scenes through normalized camera embeddings. To address scale ambiguity and maintain structural realism, we present a hybrid pose agnostic anchoring method that adaptively samples scene depth and material transitions, functioning as stable pivots during scene distillation. Our design integrates preoperative radiography simulation and consumer-grade 360° rendering into a singular inference pipeline, facilitating rapid, photorealistic view synthesis with inherent directionality. Evaluations on the Mip-NeRF 360, RealEstate10K, and DeepDRR datasets indicate superior SSIM and LPIPS performance, while runtime assessments confirm its viability for interactive applications. Disentangled360 facilitates mixed-reality medical supervision, robotic perception, and immersive content creation, eliminating the necessity for scene-specific finetuning or expensive photon simulations.
[19] Efficient-VLN: A Training-Efficient Vision-Language Navigation Model cs.CVPDF
Duo Zheng, Shijia Huang, Yanyang Li, Liwei Wang
TL;DR: 本文提出Efficient-VLN,一种训练高效的视觉语言导航模型,旨在解决多模态大语言模型在VLN任务中训练开销过大的问题。通过设计渐进式记忆和可学习递归记忆机制来降低长历史观测的token处理负担,并引入动态混合策略来平衡探索与效率。
Details
Motivation: 多模态大语言模型在视觉语言导航中潜力巨大,但实际开发受限于巨大的训练开销,主要源于长历史观测的二次计算负担和DAgger数据聚合中的探索-效率权衡问题。
Result: 在R2R-CE和RxR-CE基准测试上分别达到64.2%和67.0%的成功率,取得SOTA性能;同时训练仅消耗282 H800 GPU小时,相比现有方法大幅降低训练开销。
Insight: 创新点包括两种高效记忆机制(渐进式记忆和可学习递归记忆)来优化token处理,以及动态混合策略来平衡探索与效率;从客观角度看,这些设计有效解决了VLN中计算效率和训练数据收集的瓶颈,为实际部署提供了可行方案。
Abstract: Multimodal large language models (MLLMs) have shown promising potential in Vision-Language Navigation (VLN). However, their practical development is severely hindered by the substantial training overhead. We recognize two key issues that contribute to the overhead: (1) the quadratic computational burden from processing long-horizon historical observations as massive sequences of tokens, and (2) the exploration-efficiency trade-off in DAgger, i.e., a data aggregation process of collecting agent-explored trajectories. While more exploration yields effective error-recovery trajectories for handling test-time distribution shifts, it comes at the cost of longer trajectory lengths for both training and inference. To address these challenges, we propose Efficient-VLN, a training-efficient VLN model. Specifically, to mitigate the token processing burden, we design two efficient memory mechanisms: a progressive memory that dynamically allocates more tokens to recent observations, and a learnable recursive memory that utilizes the key-value cache of learnable tokens as the memory state. Moreover, we introduce a dynamic mixed policy to balance the exploration-efficiency trade-off. Extensive experiments show that Efficient-VLN achieves state-of-the-art performance on R2R-CE (64.2% SR) and RxR-CE (67.0% SR). Critically, our model consumes merely 282 H800 GPU hours, demonstrating a dramatic reduction in training overhead compared to state-of-the-art methods.
[20] DualProtoSeg: Simple and Efficient Design with Text- and Image-Guided Prototype Learning for Weakly Supervised Histopathology Image Segmentation cs.CVPDF
Anh M. Vu, Khang P. Le, Trang T. K. Vo, Ha Thach, Huy Hung Nguyen
TL;DR: 本文提出了一种名为DualProtoSeg的简单高效框架,用于弱监督组织病理学图像分割。该方法通过结合可学习的文本提示调优和图像原型,构建了一个双模态原型库,以捕获语义和外观线索,并引入多尺度金字塔模块来增强空间精度。在BCSS-WSSS基准测试中,该方法超越了现有最先进方法。
Details
Motivation: 解决弱监督语义分割在组织病理学图像中面临的类间同质性、类内异质性以及基于CAM监督的区域收缩效应等问题,旨在降低标注成本。
Result: 在BCSS-WSSS基准测试上超越了现有最先进方法,实验分析展示了文本描述多样性、上下文长度以及文本与图像原型的互补行为带来的益处。
Insight: 创新点在于联合利用文本语义和视觉原型学习,通过双模态原型库整合语义与外观信息,并采用多尺度金字塔模块缓解ViT表示的过度平滑问题,从而提升弱监督下的区域发现能力。
Abstract: Weakly supervised semantic segmentation (WSSS) in histopathology seeks to reduce annotation cost by learning from image-level labels, yet it remains limited by inter-class homogeneity, intra-class heterogeneity, and the region-shrinkage effect of CAM-based supervision. We propose a simple and effective prototype-driven framework that leverages vision-language alignment to improve region discovery under weak supervision. Our method integrates CoOp-style learnable prompt tuning to generate text-based prototypes and combines them with learnable image prototypes, forming a dual-modal prototype bank that captures both semantic and appearance cues. To address oversmoothing in ViT representations, we incorporate a multi-scale pyramid module that enhances spatial precision and improves localization quality. Experiments on the BCSS-WSSS benchmark show that our approach surpasses existing state-of-the-art methods, and detailed analyses demonstrate the benefits of text description diversity, context length, and the complementary behavior of text and image prototypes. These results highlight the effectiveness of jointly leveraging textual semantics and visual prototype learning for WSSS in digital pathology.
[21] ConStruct: Structural Distillation of Foundation Models for Prototype-Based Weakly Supervised Histopathology Segmentation cs.CVPDF
Khang Le, Ha Thach, Anh M. Vu, Trang T. K. Vo, Han H. Huynh
TL;DR: 本文提出了一种用于组织病理学图像弱监督语义分割的原型学习框架ConStruct,通过整合CONCH视觉语言模型的形态感知表示、SegFormer分割模型的多尺度结构线索以及文本引导的语义对齐,生成语义区分性强且空间一致的原型,无需像素级标注即可产生高质量伪掩码。
Details
Motivation: 解决组织病理学弱监督语义分割中分类主干网络仅定位最具区分性区域、难以捕捉组织完整空间范围的问题,并整合视觉语言模型与分割模型的互补优势。
Result: 在BCSS-WSSS数据集上的实验表明,该框架超越了现有弱监督语义分割方法,同时通过冻结基础模型主干和轻量可训练适配器保持了计算效率。
Insight: 创新点在于提出了文本引导的原型初始化以整合病理学描述生成更完整准确的伪掩码,以及通过结构蒸馏机制从SegFormer转移空间知识以在原型学习中保留细粒度形态模式和局部组织边界。
Abstract: Weakly supervised semantic segmentation (WSSS) in histopathology relies heavily on classification backbones, yet these models often localize only the most discriminative regions and struggle to capture the full spatial extent of tissue structures. Vision-language models such as CONCH offer rich semantic alignment and morphology-aware representations, while modern segmentation backbones like SegFormer preserve fine-grained spatial cues. However, combining these complementary strengths remains challenging, especially under weak supervision and without dense annotations. We propose a prototype learning framework for WSSS in histopathological images that integrates morphology-aware representations from CONCH, multi-scale structural cues from SegFormer, and text-guided semantic alignment to produce prototypes that are simultaneously semantically discriminative and spatially coherent. To effectively leverage these heterogeneous sources, we introduce text-guided prototype initialization that incorporates pathology descriptions to generate more complete and semantically accurate pseudo-masks. A structural distillation mechanism transfers spatial knowledge from SegFormer to preserve fine-grained morphological patterns and local tissue boundaries during prototype learning. Our approach produces high-quality pseudo masks without pixel-level annotations, improves localization completeness, and enhances semantic consistency across tissue types. Experiments on BCSS-WSSS datasets demonstrate that our prototype learning framework outperforms existing WSSS methods while remaining computationally efficient through frozen foundation model backbones and lightweight trainable adapters.
[22] Point2Pose: A Generative Framework for 3D Human Pose Estimation with Multi-View Point Cloud Dataset cs.CVPDF
Hyunsoo Lee, Daeum Jeon, Hyeokjae Oh
TL;DR: 本文提出了一种名为Point2Pose的新型生成式框架,用于从多视角点云数据中进行3D人体姿态估计。该框架通过时空点云编码器和姿态特征编码器提取关节级特征,并利用基于注意力的生成回归器来建模给定序列点云和姿态历史条件下的人体姿态分布。同时,论文还发布了一个包含IMU数据、密集多视角点云和RGB图像的大规模室内数据集MVPose3D。
Details
Motivation: 解决3D人体姿态估计中因人体复杂几何结构、关节自遮挡以及需要大规模真实世界运动数据所带来的关键挑战。
Result: 实验结果表明,所提方法在多个数据集上超越了基线模型,展现了其优越性能。
Insight: 创新点在于提出了一种条件生成式框架来建模姿态分布,并构建了包含多模态数据的大规模数据集MVPose3D,这有助于推动基于点云的3D姿态估计研究。从客观角度看,将生成模型与时空点云特征提取结合,是应对遮挡和复杂运动的一个有前景的方向。
Abstract: We propose a novel generative approach for 3D human pose estimation. 3D human pose estimation poses several key challenges due to the complex geometry of the human body, self-occluding joints, and the requirement for large-scale real-world motion datasets. To address these challenges, we introduce Point2Pose, a framework that effectively models the distribution of human poses conditioned on sequential point cloud and pose history. Specifically, we employ a spatio-temporal point cloud encoder and a pose feature encoder to extract joint-wise features, followed by an attention-based generative regressor. Additionally, we present a large-scale indoor dataset MVPose3D, which contains multiple modalities, including IMU data of non-trivial human motions, dense multi-view point clouds, and RGB images. Experimental results show that the proposed method outperforms the baseline models, demonstrating its superior performance across various datasets.
[23] EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs cs.CVPDF
Chao Gong, Depeng Wang, Zhipeng Wei, Ya Guo, Huijia Zhu
TL;DR: 本文提出EchoingPixels框架,旨在解决音频-视觉大语言模型(AV-LLMs)因处理海量音频和视频token而带来的过高计算开销问题。该框架的核心是跨模态语义筛(CS2)模块,它通过早期跨模态交互,将音频和视频token合并为单一池进行联合压缩,并自适应分配token预算,而非为每种模态设置固定预算。同时,设计了同步增强的RoPE(Sync-RoPE)来保持稀疏选择token间的重要时序关系。实验表明,该方法仅使用原始token的5-20%即可达到与强基线相当的性能,并实现了2-3倍的加速和内存减少。
Details
Motivation: 当前音频-视觉大语言模型面临海量音频和视频token带来的巨大计算开销瓶颈。现有的token压缩方法主要针对纯视频LLMs,无法利用音频-视觉的跨模态协同作用,且为每种模态设置静态预算无法适应其动态变化的信息密度。因此,如何在联合的音频-视觉流上进行token压缩是一个尚未解决的关键问题。
Result: 大量实验表明,EchoingPixels在仅使用原始token的5-20%的情况下,达到了与强基线模型相当的性能,同时实现了2-3倍的推理加速和内存占用减少。
Insight: 论文的创新点在于提出了一个跨模态自适应的token压缩框架。其核心是跨模态语义筛(CS2)模块,它通过早期跨模态注意力机制,将音频和视频token视为一个联合池进行压缩,实现了跨模态的自适应预算分配和显著token的协同动态识别。此外,同步增强的RoPE(Sync-RoPE)的协同设计,确保了在激进压缩下关键时序建模能力的保留。从客观角度看,这种将跨模态交互前置并统一处理token池的思路,以及为保持时序关系而专门设计的位置编码,是解决多模态高效建模问题的有效且新颖的途径。
Abstract: Audio-Visual Large Language Models (AV-LLMs) face prohibitive computational overhead from massive audio and video tokens. Token reduction, while extensively explored for video-only LLMs, is insufficient for the audio-visual domain, as these unimodal methods cannot leverage audio-visual cross-modal synergies. Furthermore, the distinct and dynamic information densities of audio and video render static budgets per modality suboptimal. How to perform token reduction on a joint audio-visual stream thus remains an unaddressed bottleneck. To fill this gap, we introduce EchoingPixels, a framework inspired by the coexistence and interaction of visuals and sound in real-world scenes. The core of our framework is the Cross-Modal Semantic Sieve (CS2), a module enabling early audio-visual interaction. Instead of compressing modalities independently, CS2 co-attends to the joint multimodal stream and reduces tokens from an entire combined pool of audio-visual tokens rather than using fixed budgets per modality. This single-pool approach allows it to adaptively allocate the token budget across both modalities and dynamically identify salient tokens in concert. To ensure this aggressive reduction preserves the vital temporal modeling capability, we co-design a Synchronization-Augmented RoPE (Sync-RoPE) to maintain critical temporal relationships for the sparsely selected tokens. Extensive experiments demonstrate that EchoingPixels achieves performance comparable to strong baselines using only 5-20% of the original tokens, with a 2-3x speedup and memory reduction.
[24] StainNet: A Special Staining Self-Supervised Vision Transformer for Computational Pathology cs.CVPDF
Jiawen Li, Jiali Hu, Xitong Ling, Yongqiang Lv, Yuxuan Chen
TL;DR: 本文提出StainNet,一种基于视觉Transformer架构、专为特殊染色病理图像设计的自监督基础模型,旨在解决现有病理基础模型主要基于H&E染色图像预训练、在特殊染色图像上应用受限的问题。
Details
Motivation: 现有病理基础模型主要基于H&E染色图像预训练,而临床实践中常使用特殊染色图像,这限制了模型在涉及特殊染色临床应用中的性能。
Result: 在内部肝恶性肿瘤分类任务和两个公共ROI级数据集上验证了其强大能力,并通过少样本学习和检索评估,与近期更大规模病理基础模型比较,凸显了其优势。
Insight: 创新点在于专门针对特殊染色病理图像构建大规模自监督预训练模型,采用自蒸馏SSL方法,利用公开HISTAI数据库中的大量特殊染色WSI进行训练,以提升模型在特殊染色图像分析中的泛化能力。
Abstract: Foundation models trained with self-supervised learning (SSL) on large-scale histological images have significantly accelerated the development of computational pathology. These models can serve as backbones for region-of-interest (ROI) image analysis or patch-level feature extractors in whole-slide images (WSIs) based on multiple instance learning (MIL). Existing pathology foundation models (PFMs) are typically pre-trained on Hematoxylin-Eosin (H&E) stained pathology images. However, images with special stains, such as immunohistochemistry, are also frequently used in clinical practice. PFMs pre-trained mainly on H&E-stained images may be limited in clinical applications involving special stains. To address this issue, we propose StainNet, a specialized foundation model for special stains based on the vision transformer (ViT) architecture. StainNet adopts a self-distillation SSL approach and is trained on over 1.4 million patch images cropping from 20,231 publicly available special staining WSIs in the HISTAI database. To evaluate StainNet, we conduct experiments on an in-house slide-level liver malignancy classification task and two public ROI-level datasets to demonstrate its strong ability. We also perform few-ratio learning and retrieval evaluations, and compare StainNet with recently larger PFMs to further highlight its strengths. We have released the StainNet model weights at: https://huggingface.co/JWonderLand/StainNet.
[25] Simple Yet Effective Selective Imputation for Incomplete Multi-view Clustering cs.CV | cs.MMPDF
Cai Xu, Jinlong Liu, Yilin Zhang, Ziyu Guan, Wei Zhao
TL;DR: 本文提出了一种基于信息量的选择性插补方法(ISMVC)用于不完整多视图聚类。该方法通过评估每个缺失位置在视图内相似性和视图间一致性方面的信息量,仅在信息充足时进行选择性插补,并结合变分自编码器与高斯混合先验学习聚类友好的潜在表示。
Details
Motivation: 解决不完整多视图数据中传统插补方法引入噪声和偏差,以及免插补方法在严重不完整情况下因缺乏跨视图互补性而性能受限的问题。
Result: 在多个基准数据集上,在更现实且具有挑战性的不平衡缺失场景下,ISMVC方法优于现有的基于插补和免插补方法。
Insight: 创新点在于提出了一种轻量级、数据驱动且模型无关的选择性插补策略,通过分布级插补稳定后验分布聚合并显式建模插补不确定性,可作为即插即用模块集成到现有不完整多视图聚类模型中。
Abstract: Incomplete multi-view data, where different views suffer from missing and unbalanced observations, pose significant challenges for clustering. Existing imputation-based methods attempt to estimate missing views to restore data associations, but indiscriminate imputation often introduces noise and bias, especially when the available information is insufficient. Imputation-free methods avoid this risk by relying solely on observed data, but struggle under severe incompleteness due to the lack of cross-view complementarity. To address this issue, we propose Informativeness-based Selective imputation Multi-View Clustering (ISMVC). Our method evaluates the imputation-relevant informativeness of each missing position based on intra-view similarity and cross-view consistency, and selectively imputes only when sufficient support is available. Furthermore, we integrate this selection with a variational autoencoder equipped with a mixture-of-Gaussians prior to learn clustering-friendly latent representations. By performing distribution-level imputation, ISMVC not only stabilizes the aggregation of posterior distributions but also explicitly models imputation uncertainty, enabling robust fusion and preventing overconfident reconstructions. Compared with existing cautious imputation strategies that depend on training dynamics or model feedback, our method is lightweight, data-driven, and model-agnostic. It can be readily integrated into existing IMC models as a plug-in module. Extensive experiments on multiple benchmark datasets under a more realistic and challenging unbalanced missing scenario demonstrate that our method outperforms both imputation-based and imputation-free approaches.
[26] Zero-shot Adaptation of Stable Diffusion via Plug-in Hierarchical Degradation Representation for Real-World Super-Resolution cs.CVPDF
Yi-Cheng Liao, Shyang-En Weng, Yu-Syuan Xu, Chi-Wei Hsiao, Wei-Chen Chiu
TL;DR: 本文提出了一种名为HD-CLIP(Hierarchical Degradation CLIP)的即插即用模块,用于解决真实世界图像超分辨率(Real-ISR)中未知复杂退化的问题。该方法将低质量图像分解为语义嵌入和有序退化嵌入,并通过无分类器投影引导(CFPG)集成到扩散模型中,以指导生成式恢复并抑制伪影,实现零样本适应,无需训练即可提升多种超分辨率框架在真实数据集上的细节保真度和感知真实感。
Details
Motivation: 解决真实世界图像超分辨率中退化因素未知、复杂且耦合的挑战,现有方法通常假设已知退化程度或依赖无法捕捉数值严重性的CLIP文本编码器,导致泛化能力有限。
Result: 作为即插即用模块,HD-CLIP无需训练即可集成到多种超分辨率框架中,在多样化的真实世界数据集上显著提高了细节保真度和感知真实感。
Insight: 创新点在于将低质量图像分解为语义嵌入和有序退化嵌入,后者能捕捉有序关系并支持未见退化程度的插值;通过无分类器投影引导(CFPG)将退化表示集成到扩散模型中,利用语义线索指导恢复,同时用退化线索抑制幻觉和伪影,实现了零样本适应和即插即用的灵活性。
Abstract: Real-World Image Super-Resolution (Real-ISR) aims to recover high-quality images from low-quality inputs degraded by unknown and complex real-world factors. Real-world scenarios involve diverse and coupled degradations, making it necessary to provide diffusion models with richer and more informative guidance. However, existing methods often assume known degradation severity and rely on CLIP text encoders that cannot capture numerical severity, limiting their generalization ability. To address this, we propose \textbf{HD-CLIP} (\textbf{H}ierarchical \textbf{D}egradation CLIP), which decomposes a low-quality image into a semantic embedding and an ordinal degradation embedding that captures ordered relationships and allows interpolation across unseen levels. Furthermore, we integrated it into diffusion models via classifier-free guidance (CFG) and proposed classifier-free projection guidance (CFPG). HD-CLIP leverages semantic cues to guide generative restoration while using degradation cues to suppress undesired hallucinations and artifacts. As a \textbf{plug-and-play module}, HD-CLIP can be seamlessly integrated into various super-resolution frameworks without training, significantly improving detail fidelity and perceptual realism across diverse real-world datasets.
[27] CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates cs.CVPDF
Shresth Grover, Priyank Pathak, Akash Kumar, Vibhav Vineet, Yogesh S Rawat
TL;DR: 本文提出了一个名为CoSPlan的基准测试,用于评估大规模视觉语言模型在视觉顺序规划任务中的表现,特别是在存在错误步骤的场景下。同时,论文提出了一种无需训练的新方法——场景图增量更新,以提升模型在检测错误和完成规划序列方面的能力。
Details
Motivation: 大规模视觉语言模型在复杂推理方面表现出色,但在视觉顺序规划任务(即执行多步动作以实现目标)中尚未得到充分探索。实际规划任务常包含非最优(错误)步骤,这对模型检测和纠正错误的能力提出了挑战。
Result: 在CoSPlan基准测试的四个领域(迷宫导航、积木重排、图像重建和物体重组)上,即使是使用了思维链和场景图等先进技术的SOTA模型(如Intern-VLM和Qwen2)也表现不佳。而提出的SGI方法平均带来了5.2%的性能提升,并且能泛化到Plan-Bench和VQA等传统规划任务。
Insight: 论文的创新点在于构建了一个专注于错误检测与纠正的顺序规划基准,并提出了一种通过引入初始状态与目标状态之间的中间推理步骤(场景图增量更新)来增强模型规划推理能力的训练无关方法。该方法的核心洞察是利用增量式场景图更新来提供更丰富的上下文线索,以辅助序列推理。
Abstract: Large-scale Vision-Language Models (VLMs) exhibit impressive complex reasoning capabilities but remain largely unexplored in visual sequential planning, i.e., executing multi-step actions towards a goal. Additionally, practical sequential planning often involves non-optimal (erroneous) steps, challenging VLMs to detect and correct such steps. We propose Corrective Sequential Planning Benchmark (CoSPlan) to evaluate VLMs in error-prone, vision-based sequential planning tasks across 4 domains: maze navigation, block rearrangement, image reconstruction,and object reorganization. CoSPlan assesses two key abilities: Error Detection (identifying non-optimal action) and Step Completion (correcting and completing action sequences to reach the goal). Despite using state-of-the-art reasoning techniques such as Chain-of-Thought and Scene Graphs, VLMs (e.g. Intern-VLM and Qwen2) struggle on CoSPlan, failing to leverage contextual cues to reach goals. Addressing this, we propose a novel training-free method, Scene Graph Incremental updates (SGI), which introduces intermediate reasoning steps between the initial and goal states. SGI helps VLMs reason about sequences, yielding an average performance gain of 5.2%. In addition to enhancing reliability in corrective sequential planning, SGI generalizes to traditional planning tasks such as Plan-Bench and VQA.
[28] Topology-Agnostic Animal Motion Generation from Text Prompt cs.CVPDF
Keyi Chen, Mingze Sun, Zhenyu Liu, Zhangquan Chen, Ruqi Huang
TL;DR: 本文提出了一种不依赖于固定骨骼拓扑结构的文本驱动动物运动生成方法。为了解决现有方法因依赖固定骨骼模板而无法泛化到不同或扰动拓扑结构的问题,作者构建了大规模多物种动物运动数据集OmniZoo,并在此基础上提出了一个通用的自回归运动生成框架。该框架的核心是一个拓扑感知的骨骼嵌入模块,能够将任意骨骼的几何与结构属性编码到共享的令牌空间中,从而实现与文本语义的无缝融合。该方法能够根据文本提示和目标骨骼,生成时序连贯、物理合理且语义对齐的运动,并支持跨物种的运动风格迁移。
Details
Motivation: 当前运动生成方法的核心局限在于缺乏大规模异构动物运动数据,以及缺乏能够统一建模任意骨骼拓扑和文本条件的生成框架,这导致它们无法泛化到具有不同或扰动拓扑的骨骼结构。
Result: 论文构建了OmniZoo数据集,包含140个物种和32,979个序列,并提出了一个通用生成框架。该方法能够为任意骨骼拓扑生成文本驱动的运动,并支持跨物种风格迁移,但摘要中未提及具体的定量评估指标(如FID、多样性分数)或在特定基准测试(如AMASS、BABEL)上与SOTA方法的对比结果。
Insight: 主要创新点在于:1)构建了大规模、多物种、多模态标注的动物运动数据集OmniZoo,解决了数据稀缺问题;2)提出了一个拓扑无关的通用生成框架,其核心拓扑感知骨骼嵌入模块能够将任意骨骼结构编码到统一空间,实现了对任意骨骼拓扑的泛化能力,这是对现有依赖固定模板方法的重要突破;3)实现了文本驱动的运动生成与跨物种风格迁移的统一建模。
Abstract: Motion generation is fundamental to computer animation and widely used across entertainment, robotics, and virtual environments. While recent methods achieve impressive results, most rely on fixed skeletal templates, which prevent them from generalizing to skeletons with different or perturbed topologies. We address the core limitation of current motion generation methods - the combined lack of large-scale heterogeneous animal motion data and unified generative frameworks capable of jointly modeling arbitrary skeletal topologies and textual conditions. To this end, we introduce OmniZoo, a large-scale animal motion dataset spanning 140 species and 32,979 sequences, enriched with multimodal annotations. Building on OmniZoo, we propose a generalized autoregressive motion generation framework capable of producing text-driven motions for arbitrary skeletal topologies. Central to our model is a Topology-aware Skeleton Embedding Module that encodes geometric and structural properties of any skeleton into a shared token space, enabling seamless fusion with textual semantics. Given a text prompt and a target skeleton, our method generates temporally coherent, physically plausible, and semantically aligned motions, and further enables cross-species motion style transfer.
[29] Hybrid Transformer-Mamba Architecture for Weakly Supervised Volumetric Medical Segmentation cs.CVPDF
Yiheng Lyu, Lian Xu, Mohammed Bennamoun, Farid Boussaid, Coen Arrow
TL;DR: 本文提出了一种名为TranSamba的混合Transformer-Mamba架构,用于弱监督的体医学图像分割。该方法通过引入跨平面Mamba模块来增强标准视觉Transformer骨干网络,以高效地捕获3D上下文信息,解决了现有2D编码器忽略数据固有体素特性的问题。
Details
Motivation: 现有弱监督体医学分割方法多依赖2D编码器,忽视了数据的固有三维体素特性,导致无法有效建模3D上下文。本文旨在设计一种能够高效捕获3D上下文、用于弱监督体医学分割的架构。
Result: 在三个数据集上的广泛实验表明,TranSamba建立了新的最先进(SOTA)性能,在不同模态和病理类型上持续优于现有方法。
Insight: 主要创新点在于提出了混合Transformer-Mamba架构,通过Cross-Plane Mamba模块利用状态空间模型的线性复杂度,在相邻切片间进行高效信息交换,从而增强Transformer块计算的切片内成对自注意力。该设计实现了时间复杂度与输入体积深度成线性增长,并在批处理时保持恒定内存使用的高效体素建模。
Abstract: Weakly supervised semantic segmentation offers a label-efficient solution to train segmentation models for volumetric medical imaging. However, existing approaches often rely on 2D encoders that neglect the inherent volumetric nature of the data. We propose TranSamba, a hybrid Transformer-Mamba architecture designed to capture 3D context for weakly supervised volumetric medical segmentation. TranSamba augments a standard Vision Transformer backbone with Cross-Plane Mamba blocks, which leverage the linear complexity of state space models for efficient information exchange across neighboring slices. The information exchange enhances the pairwise self-attention within slices computed by the Transformer blocks, directly contributing to the attention maps for object localization. TranSamba achieves effective volumetric modeling with time complexity that scales linearly with the input volume depth and maintains constant memory usage for batch processing. Extensive experiments on three datasets demonstrate that TranSamba establishes new state-of-the-art performance, consistently outperforming existing methods across diverse modalities and pathologies. Our source code and trained models are openly accessible at: https://github.com/YihengLyu/TranSamba.
[30] mmCounter: Static People Counting in Dense Indoor Scenarios Using mmWave Radar cs.CVPDF
Tarik Reza Toha, Shao-Jung, Lu, Shahriar Nirjon
TL;DR: 本文提出了mmCounter系统,利用毫米波雷达在密集室内场景中准确计数静态人群。通过提取呼吸和微小幅体动产生的超低频信号,并应用创新的信号处理技术从背景噪声和静态物体中区分这些微弱信号,实现了在每平方米最多三人的高密度环境下对静态人员的精确计数。
Details
Motivation: 现有毫米波雷达在检测密集、静态人群时存在困难,主要受限于空间分辨率以及对运动检测的依赖。本文旨在解决在密集室内环境中对静态人员进行准确计数的问题。
Result: 在多种环境下的广泛评估表明,mmCounter在熟悉环境中实现了87%的平均F1分数和0.6的平均绝对误差,在未测试环境中实现了60%的平均F1分数和1.1的平均绝对误差。该系统能在3平方米空间内(无并排间距,仅1米前后距离)计数多达7人。
Insight: 创新点在于利用超低频信号(<1 Hz)进行静态人员计数,并提出了一个新颖的多阶段信号处理流程来提取相关低频源及其空间信息,并将其映射到个体人员,从而在未知人数先验的情况下实现准确计数。从客观角度看,该方法将毫米波雷达的应用从依赖运动检测扩展到了静态场景,并有效处理了高密度环境下的信号分离挑战。
Abstract: mmWave radars struggle to detect or count individuals in dense, static (non-moving) groups due to limitations in spatial resolution and reliance on movement for detection. We present mmCounter, which accurately counts static people in dense indoor spaces (up to three people per square meter). mmCounter achieves this by extracting ultra-low frequency (< 1 Hz) signals, primarily from breathing and micro-scale body movements such as slight torso shifts, and applying novel signal processing techniques to differentiate these subtle signals from background noise and nearby static objects. Our problem differs significantly from existing studies on breathing rate estimation, which assume the number of people is known a priori. In contrast, mmCounter utilizes a novel multi-stage signal processing pipeline to extract relevant low-frequency sources along with their spatial information and map these sources to individual people, enabling accurate counting. Extensive evaluations in various environments demonstrate that mmCounter delivers an 87% average F1 score and 0.6 mean absolute error in familiar environments, and a 60% average F1 score and 1.1 mean absolute error in previously untested environments. It can count up to seven individuals in a three square meter space, such that there is no side-by-side spacing and only a one-meter front-to-back distance.
[31] Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task cs.CVPDF
Sunqi Fan, Jiashuo Cui, Meng-Hao Guo, Shuojin Yang
TL;DR: 本文提出了一种工具增强的时空推理框架(STAR),通过为多模态大语言模型(MLLM)配备一个全面且可扩展的视频工具包,来增强其在复杂视频问答(VideoQA)任务中的时空推理能力。该框架通过策略性地调度时空工具,逐步定位视频中的关键区域,从而解决现有MLLM在同时建模视频帧内空间关系和理解时间演化因果动态方面的困难。
Details
Motivation: 现有MLLM在复杂且需要密集推理的VideoQA任务中,难以同时有效建模视频帧内的空间关系和理解时间演化的因果动态,因此需要增强其时空推理能力。
Result: 所提出的STAR框架使用轻量级工具增强了GPT-4o,在VideoMME基准上取得了8.2%的性能提升,在LongVideoBench基准上取得了4.6%的提升。
Insight: 创新点在于设计了一个全面的视频工具包和一个策略性的时空推理框架(STAR),通过工具调度来逐步定位关键视频区域,从而系统性地增强MLLM的时空推理能力,并避免了工具链的捷径问题。从客观角度看,这是一种模块化、可扩展的增强方法,有望构建更自主、智能的视频分析助手。
Abstract: Video Question Answering (VideoQA) task serves as a critical playground for evaluating whether foundation models can effectively perceive, understand, and reason about dynamic real-world scenarios. However, existing Multimodal Large Language Models (MLLMs) struggle with simultaneously modeling spatial relationships within video frames and understanding the causal dynamics of temporal evolution on complex and reasoning-intensive VideoQA task. In this work, we equip MLLM with a comprehensive and extensible Video Toolkit, to enhance MLLM’s spatiotemporal reasoning capabilities and ensure the harmony between the quantity and diversity of tools. To better control the tool invocation sequence and avoid toolchain shortcut issues, we propose a Spatiotemporal Reasoning Framework (STAR) that strategically schedules temporal and spatial tools, thereby progressively localizing the key area in the video. Our STAR framework enhances GPT-4o using lightweight tools, achieving an 8.2% gain on VideoMME and 4.6% on LongVideoBench. We believe that our proposed Video Toolkit and STAR framework make an important step towards building autonomous and intelligent video analysis assistants. The code is publicly available at https://github.com/fansunqi/VideoTool.
[32] Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models cs.CV | cs.AIPDF
Woojun Jung, Jaehoon Go, Mingyu Jeon, Sunjae Yoon, Junyeong Kim
TL;DR: 本文提出了一种名为Visual Funnel的训练免费两阶段方法,旨在解决多模态大语言模型(MLLMs)在感知细粒度视觉细节时存在的‘上下文盲区’问题。该方法通过上下文锚定识别感兴趣区域,并构建一个基于注意力熵动态确定裁剪尺寸的熵缩放组合,以保留从焦点细节到更广环境的层次化上下文,从而提升模型在需要精确视觉理解任务中的性能。
Details
Motivation: 多模态大语言模型虽然展现出强大的推理能力,但在感知细粒度视觉细节方面存在不足,限制了其在需要高精度任务中的应用。现有方法(如裁剪图像显著区域)引入了‘上下文盲区’这一关键限制,即高保真细节(来自裁剪)与更广泛的全局上下文(来自原始图像)之间存在结构性脱节,即使所有必要的视觉信息都已提供。作者认为这一限制源于模型输入缺乏‘结构多样性’,而非信息‘数量’不足。
Result: 通过大量实验证明,Visual Funnel在性能上显著优于简单的单裁剪和非结构化的多裁剪基线方法。结果进一步验证了仅仅添加更多非结构化裁剪带来的益处有限甚至有害,确认了所提组合的层次化结构是解决上下文盲区的关键。
Insight: 论文的创新点在于明确提出了‘上下文盲区’问题,并指出其根源在于输入缺乏结构多样性而非信息量不足。提出的Visual Funnel方法通过训练免费的两步流程(上下文锚定和构建熵缩放组合)来系统性地构建层次化的视觉输入,从而有效连接局部细节与全局上下文。从客观角度看,该方法提供了一种无需额外训练即可增强MLLMs细粒度视觉感知能力的通用框架,其基于注意力熵的动态裁剪策略具有借鉴意义。
Abstract: Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning capabilities, but often fail to perceive fine-grained visual details, limiting their applicability in precision-demanding tasks. While methods that crop salient regions of an image offer a partial solution, we identify a critical limitation they introduce: “Contextual Blindness”. This failure occurs due to structural disconnect between high-fidelity details (from the crop) and the broader global context (from the original image), even when all necessary visual information is present. We argue that this limitation stems not from a lack of information ‘Quantity’, but from a lack of ‘Structural Diversity’ in the model’s input. To resolve this, we propose Visual Funnel, a training-free, two-step approach. Visual Funnel first performs Contextual Anchoring to identify the region of interest in a single forward pass. It then constructs an Entropy-Scaled Portfolio that preserves the hierarchical context - ranging from focal detail to broader surroundings - by dynamically determining crop sizes based on attention entropy and refining crop centers. Through extensive experiments, we demonstrate that Visual Funnel significantly outperforms naive single-crop and unstructured multi-crop baselines. Our results further validate that simply adding more unstructured crops provides limited or even detrimental benefits, confirming that the hierarchical structure of our portfolio is key to resolving Contextual Blindness.
[33] Point to Span: Zero-Shot Moment Retrieval for Navigating Unseen Hour-Long Videos cs.CVPDF
Mingyu Jeon, Jisoo Yang, Sungjin Han, Jinkwon Hwang, Sunjae Yoon
TL;DR: 本文提出了Point-to-Span (P2S)框架,一种无需训练即可在超长视频中进行零样本时刻检索的方法,通过自适应跨度生成器和查询分解技术,有效解决了现有方法在搜索阶段候选爆炸和精炼阶段计算成本高的问题,并在MAD等基准上超越了有监督的SOTA方法。
Details
Motivation: 解决零样本长视频时刻检索任务中,现有方法因计算不可行而采用的‘搜索-精炼’范式所面临的挑战:搜索阶段候选爆炸导致效率低下,精炼阶段依赖高成本视觉语言模型验证带来巨大计算开销。
Result: 在MAD基准测试中,P2S在R5@0.1指标上比有监督的SOTA方法显著提升了3.7%,是首个能在超长视频中实现时间定位的零样本框架。
Insight: 创新点包括:1) 自适应跨度生成器,动态生成候选片段以避免搜索阶段的候选爆炸;2) 查询分解技术,将复杂查询分解为子查询进行精炼,无需依赖高成本VLM验证,从而显著降低计算开销并提升泛化能力。
Abstract: Zero-shot Long Video Moment Retrieval (ZLVMR) is the task of identifying temporal segments in hour-long videos using a natural language query without task-specific training. The core technical challenge of LVMR stems from the computational infeasibility of processing entire lengthy videos in a single pass. This limitation has established a ‘Search-then-Refine’ approach, where candidates are rapidly narrowed down, and only those portions are analyzed, as the dominant paradigm for LVMR. However, existing approaches to this paradigm face severe limitations. Conventional supervised learning suffers from limited scalability and poor generalization, despite substantial resource consumption. Yet, existing zero-shot methods also fail, facing a dual challenge: (1) their heuristic strategies cause a ‘search’ phase candidate explosion, and (2) the ‘refine’ phase, which is vulnerable to semantic discrepancy, requires high-cost VLMs for verification, incurring significant computational overhead. We propose \textbf{P}oint-\textbf{to}-\textbf{S}pan (P2S), a novel training-free framework to overcome this challenge of inefficient ‘search’ and costly ‘refine’ phases. P2S overcomes these challenges with two key innovations: an ‘Adaptive Span Generator’ to prevent the search phase candidate explosion, and ‘Query Decomposition’ to refine candidates without relying on high-cost VLM verification. To our knowledge, P2S is the first zero-shot framework capable of temporal grounding in hour-long videos, outperforming supervised state-of-the-art methods by a significant margin (e.g., +3.7% on R5@0.1 on MAD).
[34] Breaking the Vicious Cycle: Coherent 3D Gaussian Splatting from Sparse and Motion-Blurred Views cs.CVPDF
Zhankuo Xu, Chaoran Feng, Yingtao Li, Jianbin Zhao, Jiashu Yang
TL;DR: 本文提出了CoherentGS,一个从稀疏且运动模糊的图像中进行高保真3D重建的新框架。它通过结合一个专门的去模糊网络和一个扩散模型的双先验策略,解决了稀疏视图和运动模糊相互加剧的恶性循环问题,从而在极少量输入视图下实现了高质量的3D高斯泼溅重建。
Details
Motivation: 3D高斯泼溅(3DGS)的性能严重依赖密集、高质量的输入图像,而现实应用中的数据通常是稀疏且运动模糊的。稀疏视图和运动模糊会形成一个恶性循环,导致重建失败,产生碎片化视图和低频偏差。本文旨在打破这一循环。
Result: 在合成和真实场景上,使用少至3、6、9个输入视图进行定量和定性实验。结果表明,CoherentGS显著优于现有方法,为该挑战性任务设定了新的最先进水平(SOTA)。
Insight: 核心创新点是采用双先验策略,结合去模糊网络(提供光度指导)和扩散模型(提供几何先验)来共同应对稀疏和模糊的复合退化。此外,一致性引导的相机探索模块和深度正则化损失等技术确保了生成过程的适应性和几何合理性。
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a state-of-the-art method for novel view synthesis. However, its performance heavily relies on dense, high-quality input imagery, an assumption that is often violated in real-world applications, where data is typically sparse and motion-blurred. These two issues create a vicious cycle: sparse views ignore the multi-view constraints necessary to resolve motion blur, while motion blur erases high-frequency details crucial for aligning the limited views. Thus, reconstruction often fails catastrophically, with fragmented views and a low-frequency bias. To break this cycle, we introduce CoherentGS, a novel framework for high-fidelity 3D reconstruction from sparse and blurry images. Our key insight is to address these compound degradations using a dual-prior strategy. Specifically, we combine two pre-trained generative models: a specialized deblurring network for restoring sharp details and providing photometric guidance, and a diffusion model that offers geometric priors to fill in unobserved regions of the scene. This dual-prior strategy is supported by several key techniques, including a consistency-guided camera exploration module that adaptively guides the generative process, and a depth regularization loss that ensures geometric plausibility. We evaluate CoherentGS through both quantitative and qualitative experiments on synthetic and real-world scenes, using as few as 3, 6, and 9 input views. Our results demonstrate that CoherentGS significantly outperforms existing methods, setting a new state-of-the-art for this challenging task. The code and video demos are available at https://potatobigroom.github.io/CoherentGS/.
[35] RaLiFlow: Scene Flow Estimation with 4D Radar and LiDAR Point Clouds cs.CVPDF
Jingyun Fu, Zhiyu Xiang, Na Zhao
TL;DR: 本文提出了RaLiFlow,首个用于4D雷达与LiDAR点云融合的场景流估计框架。通过构建基于真实自动驾驶数据集的雷达-LiDAR场景流数据集,并设计动态感知双向跨模态融合模块及专用损失函数,有效融合雷达的动态速度信息与LiDAR的几何信息,显著提升了场景流估计性能。
Details
Motivation: 现有场景流估计方法多基于图像与LiDAR融合,而4D雷达与LiDAR的融合尚未探索。雷达成本低、天气鲁棒性强且能提供点级速度,是LiDAR的有价值补充,但其数据存在噪声、低分辨率与稀疏性挑战,且缺乏相应的融合数据集。
Result: 在基于公开真实汽车数据集重构的场景流数据集上进行大量实验,结果表明该方法显著优于现有的基于LiDAR或雷达的单模态方法。
Insight: 创新点包括:构建首个雷达-LiDAR场景流数据集并提出有效的雷达去噪与场景流标签生成预处理策略;提出动态感知双向跨模态融合模块,将雷达动态线索融入局部跨注意力机制;设计专用损失函数以减轻不可靠雷达数据的影响并增强实例级一致性。
Abstract: Recent multimodal fusion methods, integrating images with LiDAR point clouds, have shown promise in scene flow estimation. However, the fusion of 4D millimeter wave radar and LiDAR remains unexplored. Unlike LiDAR, radar is cheaper, more robust in various weather conditions and can detect point-wise velocity, making it a valuable complement to LiDAR. However, radar inputs pose challenges due to noise, low resolution, and sparsity. Moreover, there is currently no dataset that combines LiDAR and radar data specifically for scene flow estimation. To address this gap, we construct a Radar-LiDAR scene flow dataset based on a public real-world automotive dataset. We propose an effective preprocessing strategy for radar denoising and scene flow label generation, deriving more reliable flow ground truth for radar points out of the object boundaries. Additionally, we introduce RaLiFlow, the first joint scene flow learning framework for 4D radar and LiDAR, which achieves effective radar-LiDAR fusion through a novel Dynamic-aware Bidirectional Cross-modal Fusion (DBCF) module and a carefully designed set of loss functions. The DBCF module integrates dynamic cues from radar into the local cross-attention mechanism, enabling the propagation of contextual information across modalities. Meanwhile, the proposed loss functions mitigate the adverse effects of unreliable radar data during training and enhance the instance-level consistency in scene flow predictions from both modalities, particularly for dynamic foreground areas. Extensive experiments on the repurposed scene flow dataset demonstrate that our method outperforms existing LiDAR-based and radar-based single-modal methods by a significant margin.
[36] Self-Supervised Contrastive Embedding Adaptation for Endoscopic Image Matching cs.CVPDF
Alberto Rota, Elena De Momi
TL;DR: 本文提出了一种用于内窥镜图像匹配的自监督对比嵌入适应方法,通过新视角合成生成真实对应点,并利用对比学习优化DINOv2骨干网络,以提升内窥镜图像对的像素级匹配精度。
Details
Motivation: 解决内窥镜手术中由于弱透视线索、非朗伯组织反射和复杂可变形解剖结构导致的传统计算机视觉技术性能下降问题,以及深度学习模型在自然场景特征不适用于手术图像细粒度匹配的挑战。
Result: 在SCARED数据集上超越了现有最先进方法,表现出更高的匹配精度和更低的极线误差。
Insight: 创新点包括利用新视角合成生成自监督训练的真实对应点,以及通过对比学习三元组挖掘和额外Transformer层优化DINOv2骨干网络,实现嵌入适应以支持余弦相似度阈值直接匹配。
Abstract: Accurate spatial understanding is essential for image-guided surgery, augmented reality integration and context awareness. In minimally invasive procedures, where visual input is the sole intraoperative modality, establishing precise pixel-level correspondences between endoscopic frames is critical for 3D reconstruction, camera tracking, and scene interpretation. However, the surgical domain presents distinct challenges: weak perspective cues, non-Lambertian tissue reflections, and complex, deformable anatomy degrade the performance of conventional computer vision techniques. While Deep Learning models have shown strong performance in natural scenes, their features are not inherently suited for fine-grained matching in surgical images and require targeted adaptation to meet the demands of this domain. This research presents a novel Deep Learning pipeline for establishing feature correspondences in endoscopic image pairs, alongside a self-supervised optimization framework for model training. The proposed methodology leverages a novel-view synthesis pipeline to generate ground-truth inlier correspondences, subsequently utilized for mining triplets within a contrastive learning paradigm. Through this self-supervised approach, we augment the DINOv2 backbone with an additional Transformer layer, specifically optimized to produce embeddings that facilitate direct matching through cosine similarity thresholding. Experimental evaluation demonstrates that our pipeline surpasses state-of-the-art methodologies on the SCARED datasets improved matching precision and lower epipolar error compared to the related work. The proposed framework constitutes a valuable contribution toward enabling more accurate high-level computer vision applications in surgical endoscopy.
[37] Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies cs.CV | cs.AIPDF
Cong Pang, Hongtao Yu, Zixuan Chen, Lewei Lu, Xin Lou
TL;DR: 本文针对大型视觉语言模型在细粒度识别方面的不足,提出了Fine-grained Recognition Open World (FROW) 基准测试,并设计了一种结合数据构建和训练过程的优化策略,以提升模型在细粒度识别任务上的性能。
Details
Motivation: 现有基准测试主要关注推理任务,忽略了细粒度识别这一对实际应用至关重要的能力,因此需要专门的基准和优化方法来弥补这一差距。
Result: 实验表明,所提出的马赛克数据使类别识别准确率提升1%,开放世界数据使FROW基准准确率提升10%-20%,内容准确率提升6%-12%;在预训练阶段加入细粒度数据可使模型类别识别准确率最高提升10%。
Insight: 创新点在于构建了专注于细粒度识别的FROW基准,并提出从数据构造(马赛克数据、开放世界数据)和训练过程两方面进行优化的策略,有效提升了LVLMs在细粒度视觉理解上的能力。
Abstract: Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, existing benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: \textit{data construction} and \textit{training process}, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments show that mosaic data improves category recognition accuracy by 1% and open-world data boosts FROW benchmark accuracy by 10%-20% and content accuracy by 6%-12%. Meanwhile, incorporating fine-grained data into the pre-training phase can improve the model’s category recognition accuracy by up to 10%. The benchmark will be available at https://github.com/pc-inno/FROW.
[38] MultiHateLoc: Towards Temporal Localisation of Multimodal Hate Content in Online Videos cs.CVPDF
Qiyue Sun, Tailin Chen, Yinghui Zhang, Yuchen Zhang, Jiangbei Yue
TL;DR: 本文提出了MultiHateLoc,这是首个用于弱监督多模态仇恨内容时间定位的框架。该框架通过模态感知时序编码器、动态跨模态融合与对比对齐策略,以及模态感知多示例学习目标,仅利用视频级标签即可实现细粒度的帧级预测,并在HateMM和MultiHateClip数据集上达到了最先进的性能。
Details
Motivation: 解决在线视频中多模态仇恨内容(视觉、听觉、文本流中微妙且异步出现)的细粒度时间定位问题,现有研究多集中于视频级分类,且在仅有视频级标签的弱监督下,静态融合或基于分类的架构难以捕捉跨模态和时序动态。
Result: 在HateMM和MultiHateClip数据集上的实验表明,该方法在时间定位任务中取得了最先进的性能。
Insight: 创新点包括:1) 模态感知时序编码器处理异构序列模式;2) 动态跨模态融合与对比对齐策略,自适应强调信息最丰富的模态并增强特征一致性;3) 模态感知多示例学习目标,在视频级监督下识别判别性片段。从客观角度看,其弱监督下实现细粒度、可解释预测的框架设计具有借鉴意义。
Abstract: The rapid growth of video content on platforms such as TikTok and YouTube has intensified the spread of multimodal hate speech, where harmful cues emerge subtly and asynchronously across visual, acoustic, and textual streams. Existing research primarily focuses on video-level classification, leaving the practically crucial task of temporal localisation, identifying when hateful segments occur, largely unaddressed. This challenge is even more noticeable under weak supervision, where only video-level labels are available, and static fusion or classification-based architectures struggle to capture cross-modal and temporal dynamics. To address these challenges, we propose MultiHateLoc, the first framework designed for weakly-supervised multimodal hate localisation. MultiHateLoc incorporates (1) modality-aware temporal encoders to model heterogeneous sequential patterns, including a tailored text-based preprocessing module for feature enhancement; (2) dynamic cross-modal fusion to adaptively emphasise the most informative modality at each moment and a cross-modal contrastive alignment strategy to enhance multimodal feature consistency; (3) a modality-aware MIL objective to identify discriminative segments under video-level supervision. Despite relying solely on coarse labels, MultiHateLoc produces fine-grained, interpretable frame-level predictions. Experiments on HateMM and MultiHateClip show that our method achieves state-of-the-art performance in the localisation task.
[39] Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation cs.CV | cs.AI | cs.CLPDF
Yiwen Tang, Zoey Guo, Kaixin Zhu, Ray Zhang, Qizhi Chen
TL;DR: 该论文首次系统性地研究了强化学习在文本到3D自回归生成中的应用,通过评估奖励设计、RL算法、引入新基准MME-3DR以及提出分层优化方法Hi-GRPO,开发了首个RL增强的文本到3D模型AR3D-R1,旨在解决3D生成中全局几何一致性和局部纹理细粒度的高空间复杂性挑战。
Details
Motivation: 由于3D对象具有更高的空间复杂性,需要全局一致的几何结构和细粒度的局部纹理,使得3D生成对奖励设计和RL算法极为敏感,而现有研究尚未充分探索RL在3D生成中的应用,因此论文旨在填补这一空白。
Result: 论文开发了AR3D-R1模型,在引入的MME-3DR基准上评估了隐式推理能力,通过奖励设计和Hi-GRPO方法优化,实现了从粗形状到纹理细化的层次生成,但摘要未提及具体定量结果或与SOTA的比较。
Insight: 创新点包括:系统研究RL在文本到3D生成的多个维度;提出MME-3DR基准以衡量3D生成模型的隐式推理能力;设计Hi-GRPO分层RL范式,通过专用奖励集成优化全局到局部生成;强调人类偏好对齐和多模态模型奖励信号的重要性,为RL驱动的3D生成推理提供新见解。
Abstract: Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.
[40] Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction cs.CV | cs.AIPDF
Wenfei Guan, Jilin Mei, Tong Shen, Xumin Wu, Shuo Wang
TL;DR: 本文针对越野环境下的矢量道路网络提取问题,提出了WildRoad数据集和MaGRoad方法。WildRoad是一个全球越野道路网络数据集,通过专用交互标注工具高效构建;MaGRoad是一种以路径为中心的框架,通过沿候选路径聚合多尺度视觉证据来鲁棒推断连通性,解决了现有节点中心方法在遮挡和模糊路口处的拓扑错误问题。
Details
Motivation: 深度学习在城市矢量道路提取方面已取得进展,但越野环境仍缺乏探索且面临挑战,主要问题包括缺乏大规模矢量数据集以及主流方法(如SAM-Road等节点中心范式)在稀疏端点进行推理,导致对遮挡和模糊路口的脆弱性及拓扑错误。
Result: 在提出的WildRoad基准测试中,MaGRoad实现了最先进的性能,同时在城市数据集上泛化良好;其简化流程还使推理速度提高了约2.5倍,提升了实际应用性。
Insight: 创新点包括:1)发布专门针对越野道路网络标注的全球数据集WildRoad;2)提出路径中心推理范式MaGRoad,通过路径级视觉证据聚合增强对复杂地形的鲁棒性;从客观角度看,该方法从节点到路径的范式转变有效解决了拓扑完整性难题,并结合高效标注工具与快速推理,为野外道路测绘提供了更坚实基础。
Abstract: Deep learning has advanced vectorized road extraction in urban settings, yet off-road environments remain underexplored and challenging. A significant domain gap causes advanced models to fail in wild terrains due to two key issues: lack of large-scale vectorized datasets and structural weakness in prevailing methods. Models such as SAM-Road employ a node-centric paradigm that reasons at sparse endpoints, making them fragile to occlusions and ambiguous junctions in off-road scenes, leading to topological errors.This work addresses these limitations in two complementary ways. First, we release WildRoad, a gloabal off-road road network dataset constructed efficiently with a dedicated interactive annotation tool tailored for road-network labeling. Second, we introduce MaGRoad (Mask-aware Geodesic Road network extractor), a path-centric framework that aggregates multi-scale visual evidence along candidate paths to infer connectivity robustly.Extensive experiments show that MaGRoad achieves state-of-the-art performance on our challenging WildRoad benchmark while generalizing well to urban datasets. A streamlined pipeline also yields roughly 2.5x faster inference, improving practical applicability. Together, the dataset and path-centric paradigm provide a stronger foundation for mapping roads in the wild.
[41] Neural Collapse in Test-Time Adaptation cs.CVPDF
Xiao Chen, Zhongjing Du, Jiazhen Huang, Xu Jiang, Li Lu
TL;DR: 本文提出一种基于样本级神经坍缩(NC3+)现象的新颖测试时自适应(TTA)方法NCTTA,通过特征-分类器对齐和混合目标来缓解域偏移下的伪标签不可靠问题,从而提升模型对分布外数据的鲁棒性。
Details
Motivation: 现有测试时自适应方法缺乏对域偏移下性能下降根本原因的理论洞察,本文旨在通过扩展神经坍缩理论至样本层面,揭示性能下降源于样本级特征与分类器权重的错位,并设计方法解决此问题。
Result: 在ImageNet-C等基准测试上的大量实验表明,NCTTA显著提升了模型对域偏移的鲁棒性,例如在ImageNet-C上比Tent方法性能高出14.52%。
Insight: 创新点在于首次将神经坍缩现象扩展至样本层面,发现样本级对齐坍缩(NC3+),并据此提出一种结合几何邻近性和预测置信度的混合目标对齐方法,以更可靠地利用伪标签进行特征-分类器重对齐。
Abstract: Test-Time Adaptation (TTA) enhances model robustness to out-of-distribution (OOD) data by updating the model online during inference, yet existing methods lack theoretical insights into the fundamental causes of performance degradation under domain shifts. Recently, Neural Collapse (NC) has been proposed as an emergent geometric property of deep neural networks (DNNs), providing valuable insights for TTA. In this work, we extend NC to the sample-wise level and discover a novel phenomenon termed Sample-wise Alignment Collapse (NC3+), demonstrating that a sample’s feature embedding, obtained by a trained model, aligns closely with the corresponding classifier weight. Building on NC3+, we identify that the performance degradation stems from sample-wise misalignment in adaptation which exacerbates under larger distribution shifts. This indicates the necessity of realigning the feature embeddings with their corresponding classifier weights. However, the misalignment makes pseudo-labels unreliable under domain shifts. To address this challenge, we propose NCTTA, a novel feature-classifier alignment method with hybrid targets to mitigate the impact of unreliable pseudo-labels, which blends geometric proximity with predictive confidence. Extensive experiments demonstrate the effectiveness of NCTTA in enhancing robustness to domain shifts. For example, NCTTA outperforms Tent by 14.52% on ImageNet-C.
[42] An M-Health Algorithmic Approach to Identify and Assess Physiotherapy Exercises in Real Time cs.CV | cs.AIPDF
Stylianos Kandylakis, Christos Orfanopoulos, Georgios Siolas, Panayiotis Tsanakas
TL;DR: 本文提出了一种基于移动设备的实时算法框架,用于识别、分类和评估人体物理治疗练习。该方法将动态运动视为一系列静态姿势,通过姿态估计神经网络从摄像头输入中提取人体关键点,并将其转换为基于三角角度的特征,使用轻量级监督模型进行分类,生成帧级姿势预测和准确度分数。为了识别完整运动并检测与规定模式的偏差,采用基于改进Levenshtein距离算法的动态规划方案,实现鲁棒的序列匹配和错误定位。系统完全在客户端运行,确保可扩展性和实时性能。实验评估证明了该方法的有效性,并突显了其在远程物理治疗监督和移动健康应用中的适用性。
Details
Motivation: 解决在移动健康(m-health)应用中,如何实时、准确地识别和评估物理治疗练习,以支持远程监督和指导的问题。
Result: 实验评估证明了该方法的有效性,并突显了其在远程物理治疗监督和移动健康应用中的适用性。
Insight: 创新点包括:1) 将动态运动分解为静态姿势序列进行处理;2) 使用基于三角角度的特征表示和轻量级模型进行帧级分类;3) 采用基于改进Levenshtein距离的动态规划算法进行序列匹配和错误定位;4) 完全客户端运行的设计确保了实时性和可扩展性。该方法为实时运动分析提供了一个高效、可部署的框架。
Abstract: This work presents an efficient algorithmic framework for real-time identification, classification, and evaluation of human physiotherapy exercises using mobile devices. The proposed method interprets a kinetic movement as a sequence of static poses, which are estimated from camera input using a pose-estimation neural network. Extracted body keypoints are transformed into trigonometric angle-based features and classified with lightweight supervised models to generate frame-level pose predictions and accuracy scores. To recognize full exercise movements and detect deviations from prescribed patterns, we employ a dynamic-programming scheme based on a modified Levenshtein distance algorithm, enabling robust sequence matching and localization of inaccuracies. The system operates entirely on the client side, ensuring scalability and real-time performance. Experimental evaluation demonstrates the effectiveness of the methodology and highlights its applicability to remote physiotherapy supervision and m-health applications.
[43] Error-Propagation-Free Learned Video Compression With Dual-Domain Progressive Temporal Alignment cs.CVPDF
Han Li, Shaohui Li, Wenrui Dai, Chenglin Li, Xinlong Pan
TL;DR: 本文提出了一种新颖的基于学习的视频压缩统一变换框架,该框架通过双域渐进式时间对齐和质量条件专家混合模块,解决了现有方法在运动估计补偿中不准确的时间对齐与误差传播之间的两难问题。
Details
Motivation: 现有学习型视频压缩框架在运动估计与补偿方面,面临单独变换框架(误差传播明显)与统一变换框架(时间对齐不准确)之间的性能权衡困境。
Result: 实验结果表明,所提方法在率失真性能上与最先进方法相比具有竞争力,同时成功消除了误差传播。
Insight: 主要创新点包括:1)双域渐进式时间对齐,结合粗粒度像素域对齐和细粒度潜在域对齐以增强时序上下文建模;2)质量条件专家混合模块,实现基于目标质量和内容的连续比特率自适应控制,而非单一量化步长。
Abstract: Existing frameworks for learned video compression suffer from a dilemma between inaccurate temporal alignment and error propagation for motion estimation and compensation (ME/MC). The separate-transform framework employs distinct transforms for intra-frame and inter-frame compression to yield impressive rate-distortion (R-D) performance but causes evident error propagation, while the unified-transform framework eliminates error propagation via shared transforms but is inferior in ME/MC in shared latent domains. To address this limitation, in this paper, we propose a novel unifiedtransform framework with dual-domain progressive temporal alignment and quality-conditioned mixture-of-expert (QCMoE) to enable quality-consistent and error-propagation-free streaming for learned video compression. Specifically, we propose dualdomain progressive temporal alignment for ME/MC that leverages coarse pixel-domain alignment and refined latent-domain alignment to significantly enhance temporal context modeling in a coarse-to-fine fashion. The coarse pixel-domain alignment efficiently handles simple motion patterns with optical flow estimated from a single reference frame, while the refined latent-domain alignment develops a Flow-Guided Deformable Transformer (FGDT) over latents from multiple reference frames to achieve long-term motion refinement (LTMR) for complex motion patterns. Furthermore, we design a QCMoE module for continuous bit-rate adaptation that dynamically assigns different experts to adjust quantization steps per pixel based on target quality and content rather than relies on a single quantization step. QCMoE allows continuous and consistent rate control with appealing R-D performance. Experimental results show that the proposed method achieves competitive R-D performance compared with the state-of-the-arts, while successfully eliminating error propagation.
[44] 3D Blood Pulsation Maps cs.CVPDF
Maurice Rohr, Tobias Reinhardt, Tizian Dege, Justus Thies, Christoph Hoog Antink
TL;DR: 本文介绍了首个用于估计3D血流搏动图的数据集Pulse3DFace,该数据集包含多视角原始视频、脉搏参考测量和3D面部扫描,可用于开发动态面部血流搏动模型、生成合成视频以改进远程光电容积脉搏波成像方法,并支持研究减轻光照影响的多视角新方法。
Details
Motivation: 解决现有远程脉搏估计方法缺乏高质量3D血流动态数据的问题,为光电容积脉搏波成像的模型开发和验证提供合成数据基础,并促进多视角方法在血流分析中减少光照干扰的研究。
Result: 数据集包含15名受试者在23个视角下以30Hz录制的RGB视频、脉搏参考测量和单目运动恢复结构生成的3D面部扫描,提供了与FLAME头部模型纹理空间兼容的3D搏动图,包含信噪比、局部脉搏幅度、相位信息等,并评估了光照条件、图一致性和捕获面部及颈部皮肤生理特征的能力。
Insight: 创新点在于首次构建了3D血流搏动图数据集,将多视角采集与3D建模结合,为生理信号分析提供了空间动态信息;客观上,该数据集有望推动基于合成数据的脉搏估计方法鲁棒性研究,并为多模态生理传感开辟新方向。
Abstract: We present Pulse3DFace, the first dataset of its kind for estimating 3D blood pulsation maps. These maps can be used to develop models of dynamic facial blood pulsation, enabling the creation of synthetic video data to improve and validate remote pulse estimation methods via photoplethysmography imaging. Additionally, the dataset facilitates research into novel multi-view-based approaches for mitigating illumination effects in blood pulsation analysis. Pulse3DFace consists of raw videos from 15 subjects recorded at 30 Hz with an RGB camera from 23 viewpoints, blood pulse reference measurements, and facial 3D scans generated using monocular structure-from-motion techniques. It also includes processed 3D pulsation maps compatible with the texture space of the 3D head model FLAME. These maps provide signal-to-noise ratio, local pulse amplitude, phase information, and supplementary data. We offer a comprehensive evaluation of the dataset’s illumination conditions, map consistency, and its ability to capture physiologically meaningful features in the facial and neck skin regions.
[45] Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding cs.CVPDF
Yuchen Feng, Zhenyu Zhang, Naibin Gu, Yilong Chen, Peng Fu
TL;DR: 本文提出了一种名为Blink的动态视觉令牌分辨率框架,旨在模拟人类视觉的‘眨眼式’动态扫描过程,以增强多模态大语言模型(MLLMs)的视觉感知能力。该框架通过显著性引导的扫描和动态令牌分辨率两个模块,在单次前向传播中自适应地分配计算资源到重要的视觉区域,从而在保持效率的同时提升模型对复杂场景的理解。
Details
Motivation: 现有MLLMs在视觉感知方面存在局限,而人类通过动态扫描和聚焦于显著区域的‘眨眼式’过程高效感知复杂场景。论文旨在探究MLLMs是否具有类似行为,并基于此开发一种受人类启发的动态机制来增强其视觉感知。
Result: 广泛的实验验证了Blink的有效性,表明其能显著增强视觉感知和多模态理解能力,但摘要未具体提及在哪些基准测试上达到何种水平(如SOTA)。
Insight: 创新点在于首次将人类视觉的动态扫描机制引入MLLMs,通过基于注意力图的显著性估计和可插拔的令牌超分辨率模块,实现了在单次前向传播中动态调整视觉令牌分辨率,从而平衡了广泛探索和细粒度聚焦,这是一种高效且自适应的视觉感知增强方法。
Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress on various vision-language tasks, yet their visual perception remains limited. Humans, in comparison, perceive complex scenes efficiently by dynamically scanning and focusing on salient regions in a sequential “blink-like” process. Motivated by this strategy, we first investigate whether MLLMs exhibit similar behavior. Our pilot analysis reveals that MLLMs naturally attend to different visual regions across layers and that selectively allocating more computation to salient tokens can enhance visual perception. Building on this insight, we propose Blink, a dynamic visual token resolution framework that emulates the human-inspired process within a single forward pass. Specifically, Blink includes two modules: saliency-guided scanning and dynamic token resolution. It first estimates the saliency of visual tokens in each layer based on the attention map, and extends important tokens through a plug-and-play token super-resolution (TokenSR) module. In the next layer, it drops the extended tokens when they lose focus. This dynamic mechanism balances broad exploration and fine-grained focus, thereby enhancing visual perception adaptively and efficiently. Extensive experiments validate Blink, demonstrating its effectiveness in enhancing visual perception and multimodal understanding.
[46] Grounding Everything in Tokens for Multimodal Large Language Models cs.CVPDF
Xiangxuan Ren, Zhongdao Wang, Liping Hou, Pin Tang, Guoqing Wang
TL;DR: 本文提出了一种名为GETok的空间表示方法,旨在提升多模态大语言模型(MLLMs)在二维图像空间中对物体的精准定位能力。该方法通过引入可学习的网格令牌和偏移令牌,将空间关系直接嵌入到令牌中,从而在不改变自回归Transformer架构的前提下,显著增强了模型的原生2D空间推理性能。
Details
Motivation: 当前MLLMs使用的自回归Transformer架构需要对输入图像进行令牌化,这限制了模型在二维图像空间中精确地定位物体的能力。因此,本文旨在解决如何改进序列语言令牌,以更好地在2D空间中对物体进行定位的问题。
Result: 大量实验表明,GETok在有监督微调和强化学习设置下的多种指代任务中,均优于最先进的方法,实现了SOTA性能。
Insight: 创新点在于设计了一种专门的可学习令牌词汇表(网格令牌和偏移令牌),将空间关系直接编码到令牌序列中,从而实现了对物体位置的精确和迭代式细化,同时保持了MLLMs原有架构的简洁性。
Abstract: Multimodal large language models (MLLMs) have made significant advancements in vision understanding and reasoning. However, the autoregressive Transformer architecture used by MLLMs requries tokenization on input images, which limits their ability to accurately ground objects within the 2D image space. This raises an important question: how can sequential language tokens be improved to better ground objects in 2D spatial space for MLLMs? To address this, we present a spatial representation method for grounding objects, namely GETok, that integrates a specialized vocabulary of learnable tokens into MLLMs. GETok first uses grid tokens to partition the image plane into structured spatial anchors, and then exploits offset tokens to enable precise and iterative refinement of localization predictions. By embedding spatial relationships directly into tokens, GETok significantly advances MLLMs in native 2D space reasoning without modifying the autoregressive architecture. Extensive experiments demonstrate that GETok achieves superior performance over the state-of-the-art methods across various referring tasks in both supervised fine-tuning and reinforcement learning settings.
[47] Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner cs.CVPDF
Haojie Zheng, Shuchen Weng, Jingqi Liu, Siqi Yang, Boxin Shi
TL;DR: 本文提出了AVI-Edit框架,用于实现音频同步的视频实例编辑。该框架通过粒度感知的掩码细化器迭代优化用户提供的粗掩码,并利用自反馈音频代理生成高质量音频引导,以实现细粒度的时空控制。此外,作者构建了一个大规模数据集以支持该任务。
Details
Motivation: 现有视频编辑方法大多忽视了视听同步性,并且缺乏对实例级编辑所需的细粒度时空控制能力。本文旨在解决这一问题。
Result: 大量实验表明,AVI-Edit在视觉质量、条件遵循和视听同步方面均优于最先进的方法。
Insight: 主要创新点包括:1. 粒度感知的掩码细化器,用于实现精确的实例级区域分割;2. 自反馈音频代理,用于提供细粒度的时间控制;3. 构建了一个包含实例中心对应关系和全面标注的大规模数据集。
Abstract: Recent advancements in video generation highlight that realistic audio-visual synchronization is crucial for engaging content creation. However, existing video editing methods largely overlook audio-visual synchronization and lack the fine-grained spatial and temporal controllability required for precise instance-level edits. In this paper, we propose AVI-Edit, a framework for audio-sync video instance editing. We propose a granularity-aware mask refiner that iteratively refines coarse user-provided masks into precise instance-level regions. We further design a self-feedback audio agent to curate high-quality audio guidance, providing fine-grained temporal control. To facilitate this task, we additionally construct a large-scale dataset with instance-centric correspondence and comprehensive annotations. Extensive experiments demonstrate that AVI-Edit outperforms state-of-the-art methods in visual quality, condition following, and audio-visual synchronization. Project page: https://hjzheng.net/projects/AVI-Edit/.
[48] Salient Object Detection in Complex Weather Conditions via Noise Indicators cs.CVPDF
Quan Chen, Xiaokai Yang, Tingyu Wang, Rongfeng Lu, Xichun Sheng
TL;DR: 本文提出了一种针对复杂天气条件的显著目标检测框架,该框架包含一个特定的编码器和一个可替换的解码器。为了处理不同天气噪声,作者引入了一个独热向量作为噪声指示器来表示天气类型,并设计了一个噪声指示器融合模块。该模块以语义特征和噪声指示器为双输入,通过自适应特征调制嵌入天气感知先验。
Details
Motivation: 现有显著目标检测方法大多假设低噪声视觉条件,忽略了现实场景中天气引起的噪声对分割精度的影响。本文旨在解决复杂天气条件下显著目标检测性能下降的问题。
Result: 在WXSOD数据集上,使用不同训练数据规模(100%、50%、30%的全训练集)、三种编码器和七种解码器配置进行了广泛实验。结果表明,所提出的框架(特别是NIFM增强的特定编码器)相比普通编码器,在复杂天气条件下提高了分割精度。
Insight: 创新点在于引入了可学习的噪声指示器作为天气条件的先验信息,并通过一个专门的融合模块(NIFM)将其自适应地调制到编码器的特征中,从而增强了模型对天气噪声的鲁棒性,同时保持了与主流解码器的兼容性。
Abstract: Salient object detection (SOD), a foundational task in computer vision, has advanced from single-modal to multi-modal paradigms to enhance generalization. However, most existing SOD methods assume low-noise visual conditions, overlooking the degradation of segmentation accuracy caused by weather-induced noise in real-world scenarios. In this paper, we propose a SOD framework tailored for diverse weather conditions, encompassing a specific encoder and a replaceable decoder. To enable handling of varying weather noises, we introduce a one-hot vector as a noise indicator to represent different weather types and design a Noise Indicator Fusion Module (NIFM). The NIFM takes both semantic features and the noise indicator as dual inputs and is inserted between consecutive stages of the encoder to embed weather-aware priors via adaptive feature modulation. Critically, the proposed specific encoder retains compatibility with mainstream SOD decoders. Extensive experiments are conducted on the WXSOD dataset under varying training data scales (100%, 50%, 30% of the full training set), three encoder and seven decoder configurations. Results show that the proposed SOD framework (particularly the NIFM-enhanced specific encoder) improves segmentation accuracy under complex weather conditions compared to a vanilla encoder.
[49] Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval cs.CV | cs.AIPDF
J. Xiao, Y. Guo, X. Zi, K. Thiyagarajan, C. Moreira
TL;DR: 本文提出了一种无需训练、纯文本驱动的遥感图像检索框架TRSLLaVA,通过将跨模态检索重新定义为文本到文本的匹配问题,利用VLM生成的丰富文本描述在统一的文本嵌入空间中进行检索。同时,作者构建了Remote Sensing Rich Text (RSRT)数据集作为评估基准。实验表明,该方法在RSITMD和RSICD基准上具有竞争力,甚至超越了部分有监督模型。
Details
Motivation: 解决遥感图像语义检索中的‘语义鸿沟’问题,即低层视觉特征与高层人类概念之间的差异。现有基于大视觉语言模型的方法通常依赖昂贵、领域特定的训练,且缺乏在零样本检索场景下评估VLM生成文本实用性的基准。
Result: 在RSITMD基准上,该方法实现了42.62%的平均召回率,是标准零样本CLIP基线(23.86%)的近两倍,并超越了多个顶级有监督模型;在RSICD基准上也表现出高度竞争力。
Insight: 创新点在于完全避免了模型训练或微调,通过结构化文本描述(作为查询和数据库)在统一文本嵌入空间中进行纯文本匹配,为遥感图像检索提供了一种强大且高性价比的范式。同时,新构建的RSRT数据集为评估VLM生成文本的实用性提供了基准。
Abstract: Semantic retrieval of remote sensing (RS) images is a critical task fundamentally challenged by the \textquote{semantic gap}, the discrepancy between a model’s low-level visual features and high-level human concepts. While large Vision-Language Models (VLMs) offer a promising path to bridge this gap, existing methods often rely on costly, domain-specific training, and there is a lack of benchmarks to evaluate the practical utility of VLM-generated text in a zero-shot retrieval context. To address this research gap, we introduce the Remote Sensing Rich Text (RSRT) dataset, a new benchmark featuring multiple structured captions per image. Based on this dataset, we propose a fully training-free, text-only retrieval reference called TRSLLaVA. Our methodology reformulates cross-modal retrieval as a text-to-text (T2T) matching problem, leveraging rich text descriptions as queries against a database of VLM-generated captions within a unified textual embedding space. This approach completely bypasses model training or fine-tuning. Experiments on the RSITMD and RSICD benchmarks show our training-free method is highly competitive with state-of-the-art supervised models. For instance, on RSITMD, our method achieves a mean Recall of 42.62%, nearly doubling the 23.86% of the standard zero-shot CLIP baseline and surpassing several top supervised models. This validates that high-quality semantic representation through structured text provides a powerful and cost-effective paradigm for remote sensing image retrieval.
[50] Track and Caption Any Motion: Query-Free Motion Discovery and Description in Videos cs.CVPDF
Bishoy Galoaa, Sarah Ostadabbas
TL;DR: TCAM是一个以运动为中心的自动视频理解框架,无需用户查询即可发现并描述视频中的运动模式。它通过运动场注意力机制自主观察视频、识别多个运动活动,并将每个自然语言描述空间定位到对应的轨迹上。
Details
Motivation: 解决在遮挡、伪装或快速运动等挑战性条件下,视频理解更依赖于运动动态而非静态外观的问题,旨在实现无需查询的自动运动发现与描述。
Result: 在MeViS基准测试中,TCAM实现了58.4%的视频到文本检索率、64.9的JF空间定位分数,平均每个视频发现4.8个相关表达式且精度达84.7%,展现了强大的跨任务泛化能力。
Insight: 核心创新在于将运动模式与对比视觉-语言表示对齐,为动作识别和描述提供强大的语义信号;通过结合全局视频-文本对齐与细粒度空间对应的统一训练,利用多头交叉注意力实现无查询的多运动表达发现。
Abstract: We propose Track and Caption Any Motion (TCAM), a motion-centric framework for automatic video understanding that discovers and describes motion patterns without user queries. Understanding videos in challenging conditions like occlusion, camouflage, or rapid movement often depends more on motion dynamics than static appearance. TCAM autonomously observes a video, identifies multiple motion activities, and spatially grounds each natural language description to its corresponding trajectory through a motion-field attention mechanism. Our key insight is that motion patterns, when aligned with contrastive vision-language representations, provide powerful semantic signals for recognizing and describing actions. Through unified training that combines global video-text alignment with fine-grained spatial correspondence, TCAM enables query-free discovery of multiple motion expressions via multi-head cross-attention. On the MeViS benchmark, TCAM achieves 58.4% video-to-text retrieval, 64.9 JF for spatial grounding, and discovers 4.8 relevant expressions per video with 84.7% precision, demonstrating strong cross-task generalization.
[51] Robust Multi-Disease Retinal Classification via Xception-Based Transfer Learning and W-Net Vessel Segmentation cs.CVPDF
Mohammad Sadegh Gholizadeh, Amir Arsalan Rezapour
TL;DR: 本文提出了一种结合Xception迁移学习和W-Net血管分割的深度学习框架,用于多疾病视网膜图像的自动化诊断,旨在通过可解释的图像处理模块提升模型的可信度和临床部署可行性。
Details
Motivation: 解决威胁视力眼病发病率上升带来的规模化精准筛查需求,并针对标准卷积神经网络’黑箱’局限性,通过结合深度特征提取与可解释图像处理来弥合算法输出与医学专家验证之间的差距。
Result: 未在摘要中明确提及具体定量结果或基准测试,但强调通过临床相关形态特征(如视网膜血管分割)来减少假阳性并提升临床部署可行性。
Insight: 创新点在于将高保真视网膜血管分割作为辅助任务来引导分类过程,将模型预测基于临床可解释的形态特征,从而增强模型的可解释性和临床实用性。
Abstract: In recent years, the incidence of vision-threatening eye diseases has risen dramatically, necessitating scalable and accurate screening solutions. This paper presents a comprehensive study on deep learning architectures for the automated diagnosis of ocular conditions. To mitigate the “black-box” limitations of standard convolutional neural networks (CNNs), we implement a pipeline that combines deep feature extraction with interpretable image processing modules. Specifically, we focus on high-fidelity retinal vessel segmentation as an auxiliary task to guide the classification process. By grounding the model’s predictions in clinically relevant morphological features, we aim to bridge the gap between algorithmic output and expert medical validation, thereby reducing false positives and improving deployment viability in clinical settings.
[52] Lang2Motion: Bridging Language and Motion through Joint Embedding Spaces cs.CVPDF
Bishoy Galoaa, Xiangyu Bai, Sarah Ostadabbas
TL;DR: Lang2Motion是一个通过联合嵌入空间对齐运动流形,实现语言引导点轨迹生成的框架。它利用点跟踪从真实视频中提取任意物体的运动,并生成显式轨迹。该方法基于Transformer的自编码器,通过文本运动描述和渲染轨迹可视化的双重监督学习轨迹表示,两者均通过冻结的CLIP编码器映射。
Details
Motivation: 解决现有工作主要关注人体运动或视频合成,而缺乏为任意物体生成显式、语言可控的运动轨迹的问题。
Result: 在文本到轨迹检索任务上达到34.2%的Recall@1,比基于视频的方法高出12.5个百分点;与视频生成基线相比,运动精度(ADE)提高了33-52%(12.4 vs 18.3-25.3)。仅在多样物体运动上训练,却在人体动作识别上实现了88.3%的Top-1准确率,显示了跨运动域的有效迁移。
Insight: 创新点在于将语言与任意物体的显式点轨迹运动通过CLIP对齐的联合嵌入空间联系起来,并利用双重监督(文本和渲染视觉)学习轨迹表示。这支持了风格迁移、语义插值和潜在空间编辑等应用,展示了从物体运动到人体动作的强泛化能力。
Abstract: We present Lang2Motion, a framework for language-guided point trajectory generation by aligning motion manifolds with joint embedding spaces. Unlike prior work focusing on human motion or video synthesis, we generate explicit trajectories for arbitrary objects using motion extracted from real-world videos via point tracking. Our transformer-based auto-encoder learns trajectory representations through dual supervision: textual motion descriptions and rendered trajectory visualizations, both mapped through CLIP’s frozen encoders. Lang2Motion achieves 34.2% Recall@1 on text-to-trajectory retrieval, outperforming video-based methods by 12.5 points, and improves motion accuracy by 33-52% (12.4 ADE vs 18.3-25.3) compared to video generation baselines. We demonstrate 88.3% Top-1 accuracy on human action recognition despite training only on diverse object motions, showing effective transfer across motion domains. Lang2Motion supports style transfer, semantic interpolation, and latent-space editing through CLIP-aligned trajectory representations.
[53] DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM cs.CVPDF
Qintong Zhang, Junyuan Zhang, Zhifei Ren, Linke Ouyang, Zichen Wen
TL;DR: 本文提出了DOCR-Inspector,一个利用视觉语言模型(VLM)对文档解析结果进行细粒度、自动化评估的系统。它将评估形式化为错误检测与分析,通过一个包含28种预定义错误类型的分类体系,对文档图像及其解析输出进行全面质量评估。该系统基于新构建的DOCRcase-200K数据集进行训练,并采用Chain-of-Checklist推理范式。在真实世界基准DOCRcaseBench上的实验表明,其7B版本超越了Gemini 2.5 Pro等商业模型和领先的开源模型,其评估结果还能有效指导解析结果的优化。
Details
Motivation: 当前文档解析任务依赖标准基准测试进行评估,但这些基准可能存在数据集偏差,导致模型排名不一致且与真实世界性能关联有限;同时,现有评估指标通常只提供总体分数,掩盖了输出中具体的错误模式。因此,需要一种能在真实场景下可靠、全面评估文档解析质量的方法。
Result: 在作者新构建的包含882个真实案例的DOCRcaseBench基准上,DOCR-Inspector-7B模型超越了Gemini 2.5 Pro等商业模型以及领先的开源模型,取得了最佳性能。
Insight: 论文的主要创新点在于:1) 将文档解析评估形式化为细粒度的错误检测与分析任务,并定义了28种错误类型,实现了评估的精细化和可解释性;2) 提出了Chain-of-Checklist推理范式,以支持层次化的解析质量评估结构;3) 构建了大规模的训练数据集DOCRcase-200K和真实世界的评估基准DOCRcaseBench,为领域研究提供了重要资源。从客观角度看,其利用VLM-as-a-Judge进行自动化、细粒度评估的思路,以及评估结果可直接用于指导系统改进的闭环设计,具有很好的实用价值和推广潜力。
Abstract: Document parsing aims to transform unstructured PDF images into semi-structured data, facilitating the digitization and utilization of information in diverse domains. While vision language models (VLMs) have significantly advanced this task, achieving reliable, high-quality parsing in real-world scenarios remains challenging. Common practice often selects the top-performing model on standard benchmarks. However, these benchmarks may carry dataset-specific biases, leading to inconsistent model rankings and limited correlation with real-world performance. Moreover, benchmark metrics typically provide only overall scores, which can obscure distinct error patterns in output. This raises a key challenge: how can we reliably and comprehensively assess document parsing quality in the wild? We address this problem with DOCR-Inspector, which formalizes document parsing assessment as fine-grained error detection and analysis. Leveraging VLM-as-a-Judge, DOCR-Inspector analyzes a document image and its parsed output, identifies all errors, assigns them to one of 28 predefined types, and produces a comprehensive quality assessment. To enable this capability, we construct DOCRcase-200K for training and propose the Chain-of-Checklist reasoning paradigm to enable the hierarchical structure of parsing quality assessment. For empirical validation, we introduce DOCRcaseBench, a set of 882 real-world document parsing cases with manual annotations. On this benchmark, DOCR-Inspector-7B outperforms commercial models like Gemini 2.5 Pro, as well as leading open-source models. Further experiments demonstrate that its quality assessments provide valuable guidance for parsing results refinement, making DOCR-Inspector both a practical evaluator and a driver for advancing document parsing systems at scale. Model and code are released at: https://github.com/ZZZZZQT/DOCR-Inspector.
[54] K-Track: Kalman-Enhanced Tracking for Accelerating Deep Point Trackers on Edge Devices cs.CVPDF
Bishoy Galoaa, Pau Closas, Sarah Ostadabbas
TL;DR: K-Track是一个通用的、与具体跟踪器无关的加速框架,旨在解决基于深度学习的点跟踪器在资源受限的边缘设备上部署困难的问题。它通过结合稀疏深度学习关键帧更新和轻量级卡尔曼滤波进行中间帧预测,利用贝叶斯不确定性传播来保持时间一致性,实现了5-10倍的加速,同时保持了原跟踪器85%以上的精度。
Details
Motivation: 基于深度学习的点跟踪器虽然在挑战性基准测试上达到了最先进的精度,但其依赖每帧GPU推理的计算成本,阻碍了其在计算、功耗和连接性都受限的边缘设备上的部署。
Result: K-Track在多个最先进的点跟踪器上进行了评估,在NVIDIA Jetson Nano和RTX Titan等边缘平台上实现了实时性能,在显著加速的同时保持了原跟踪器85%以上的精度。
Insight: 论文的核心创新点在于提出了一种混合策略,将稀疏的深度学习关键帧更新与轻量级卡尔曼滤波预测相结合,并利用贝叶斯不确定性传播来管理预测不确定性,从而在精度和效率之间取得了良好平衡,为高质量点跟踪在边缘设备上的实际部署提供了可行路径。
Abstract: Point tracking in video sequences is a foundational capability for real-world computer vision applications, including robotics, autonomous systems, augmented reality, and video analysis. While recent deep learning-based trackers achieve state-of-the-art accuracy on challenging benchmarks, their reliance on per-frame GPU inference poses a major barrier to deployment on resource-constrained edge devices, where compute, power, and connectivity are limited. We introduce K-Track (Kalman-enhanced Tracking), a general-purpose, tracker-agnostic acceleration framework designed to bridge this deployment gap. K-Track reduces inference cost by combining sparse deep learning keyframe updates with lightweight Kalman filtering for intermediate frame prediction, using principled Bayesian uncertainty propagation to maintain temporal coherence. This hybrid strategy enables 5-10X speedup while retaining over 85% of the original trackers’ accuracy. We evaluate K-Track across multiple state-of-the-art point trackers and demonstrate real-time performance on edge platforms such as the NVIDIA Jetson Nano and RTX Titan. By preserving accuracy while dramatically lowering computational requirements, K-Track provides a practical path toward deploying high-quality point tracking in real-world, resource-limited settings, closing the gap between modern tracking algorithms and deployable vision systems.
[55] TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection cs.CV | cs.CRPDF
Jian-Yu Jiang-Lin, Kang-Yang Huang, Ling Zou, Ling Lo, Sheng-Ping Yang
TL;DR: TriDF是一个用于可解释DeepFake检测的综合性基准,包含图像、视频和音频模态的16种伪造类型,评估模型的感知、检测和幻觉三个关键方面。实验表明,准确的感知对可靠检测至关重要,但幻觉会严重干扰决策,揭示了这三方面的相互依赖性。
Details
Motivation: 生成建模的进步使得伪造个人真实肖像变得容易,对安全、通信和公共信任构成严重风险。需要系统不仅能区分篡改内容与真实媒体,还能提供清晰可靠的推理。
Result: 在先进的多模态大语言模型上的实验表明,准确的感知对于可靠检测至关重要,但幻觉会严重干扰决策。TriDF基准评估了模型在感知、检测和幻觉三个方面的性能。
Insight: 论文创新性地提出了一个统一框架(TriDF基准),将检测准确性、证据识别和解释可靠性三者结合起来评估,为构建应对现实世界合成媒体威胁的可信系统提供了基础。其核心洞察在于揭示了感知、检测和幻觉三个方面的相互依赖性,而不仅仅是孤立地评估检测性能。
Abstract: Advances in generative modeling have made it increasingly easy to fabricate realistic portrayals of individuals, creating serious risks for security, communication, and public trust. Detecting such person-driven manipulations requires systems that not only distinguish altered content from authentic media but also provide clear and reliable reasoning. In this paper, we introduce TriDF, a comprehensive benchmark for interpretable DeepFake detection. TriDF contains high-quality forgeries from advanced synthesis models, covering 16 DeepFake types across image, video, and audio modalities. The benchmark evaluates three key aspects: Perception, which measures the ability of a model to identify fine-grained manipulation artifacts using human-annotated evidence; Detection, which assesses classification performance across diverse forgery families and generators; and Hallucination, which quantifies the reliability of model-generated explanations. Experiments on state-of-the-art multimodal large language models show that accurate perception is essential for reliable detection, but hallucination can severely disrupt decision-making, revealing the interdependence of these three aspects. TriDF provides a unified framework for understanding the interaction between detection accuracy, evidence identification, and explanation reliability, offering a foundation for building trustworthy systems that address real-world synthetic media threats.
[56] XDen-1K: A Density Field Dataset of Real-World Objects cs.CVPDF
Jingxuan Zhang, Tianqi Yu, Yatu Zhang, Jinze Wu, Kaixin Yao
TL;DR: 本文介绍了XDen-1K,这是首个用于真实世界物体物理属性估计的大规模多模态数据集,核心包含1000个跨148个类别的物体,提供高分辨率3D几何模型、部件级标注和对应的双平面X射线扫描。论文还提出了一种从稀疏X射线视图恢复高保真体积密度场的优化框架,并展示了该数据集在体积分割和机器人下游任务(如质心估计和操作成功率)中的实用价值。
Details
Motivation: 当前模型擅长捕捉物体表面几何和外观,但普遍忽略了内部物理属性(如体积密度),而这对机器人操作和物理模拟等应用至关重要。主要瓶颈在于缺乏大规模真实世界数据。
Result: 实验表明,利用该数据集能有效提高质心估计的准确性和机器人操作的成功率,为物理基础视觉推理和具身AI研究提供了新的基础资源和基准。
Insight: 创新点在于构建了首个大规模真实世界物体体积密度数据集XDen-1K,并提出了从稀疏X射线视图优化恢复密度场的框架,将物理属性估计从仿真推向真实世界,为具身AI提供了关键的物理基础数据支持。
Abstract: A deep understanding of the physical world is a central goal for embodied AI and realistic simulation. While current models excel at capturing an object’s surface geometry and appearance, they largely neglect its internal physical properties. This omission is critical, as properties like volumetric density are fundamental for predicting an object’s center of mass, stability, and interaction dynamics in applications ranging from robotic manipulation to physical simulation. The primary bottleneck has been the absence of large-scale, real-world data. To bridge this gap, we introduce XDen-1K, the first large-scale, multi-modal dataset designed for real-world physical property estimation, with a particular focus on volumetric density. The core of this dataset consists of 1,000 real-world objects across 148 categories, for which we provide comprehensive multi-modal data, including a high-resolution 3D geometric model with part-level annotations and a corresponding set of real-world biplanar X-ray scans. Building upon this data, we introduce a novel optimization framework that recovers a high-fidelity volumetric density field of each object from its sparse X-ray views. To demonstrate its practical value, we add X-ray images as a conditioning signal to an existing segmentation network and perform volumetric segmentation. Furthermore, we conduct experiments on downstream robotics tasks. The results show that leveraging the dataset can effectively improve the accuracy of center-of-mass estimation and the success rate of robotic manipulation. We believe XDen-1K will serve as a foundational resource and a challenging new benchmark, catalyzing future research in physically grounded visual inference and embodied AI.
[57] Optimal transport unlocks end-to-end learning for single-molecule localization cs.CV | cs.LGPDF
Romain Seailles, Jean-Baptiste Masson, Jean Ponce, Julien Mairal
TL;DR: 该论文提出了一种基于最优传输(Optimal Transport)的端到端训练方法,用于解决单分子定位显微镜(SMLM)中高密度荧光分子发射的定位问题。通过将训练目标重新定义为集合匹配问题,消除了推理过程中对非极大值抑制(NMS)的依赖,并设计了一个结合显微镜光学系统知识的迭代神经网络架构。
Details
Motivation: 现有SMLM方法需要非重叠的荧光分子发射,导致采集时间长,不利于活细胞成像;而现有的深度学习方法虽然能处理更密集的发射,但依赖于不可微分的NMS层,可能丢弃真实信号。
Result: 在合成基准测试和真实生物数据上的实验表明,所提出的新损失函数和架构在中等和高发射密度下均超越了现有技术水平(SOTA)。
Insight: 创新点在于将SMLM训练目标重新构建为集合匹配问题,并引入最优传输损失以实现端到端训练;同时,将光学系统知识整合到迭代神经网络中,提升了模型在密集发射场景下的定位精度和鲁棒性。
Abstract: Single-molecule localization microscopy (SMLM) allows reconstructing biology-relevant structures beyond the diffraction limit by detecting and localizing individual fluorophores – fluorescent molecules stained onto the observed specimen – over time to reconstruct super-resolved images. Currently, efficient SMLM requires non-overlapping emitting fluorophores, leading to long acquisition times that hinders live-cell imaging. Recent deep-learning approaches can handle denser emissions, but they rely on variants of non-maximum suppression (NMS) layers, which are unfortunately non-differentiable and may discard true positives with their local fusion strategy. In this presentation, we reformulate the SMLM training objective as a set-matching problem, deriving an optimal-transport loss that eliminates the need for NMS during inference and enables end-to-end training. Additionally, we propose an iterative neural network that integrates knowledge of the microscope’s optical system inside our model. Experiments on synthetic benchmarks and real biological data show that both our new loss function and architecture surpass the state of the art at moderate and high emitter densities. Code is available at https://github.com/RSLLES/SHOT.
[58] CheXmask-U: Quantifying uncertainty in landmark-based anatomical segmentation for X-ray images cs.CVPDF
Matias Cosarinsky, Nicolas Gaggion, Rodrigo Echeveste, Enzo Ferrante
TL;DR: 本文提出了一种用于胸部X光图像解剖标志点分割的不确定性量化方法,通过结合卷积编码器和基于图的生成解码器的混合神经网络架构,从变分潜在空间中提取潜在不确定性和预测不确定性两种互补度量,并发布了包含65.7万张图像不确定性标注的大规模数据集CheXmask-U。
Details
Motivation: 医学图像分割系统在临床部署中需要不确定性估计以确保安全,但现有研究多关注像素级不确定性,而基于标志点的分割方法虽具有拓扑保证,其不确定性尚未被充分探索。
Result: 在CheXmask数据集上,通过受控破坏实验表明两种不确定性度量均随扰动强度增加而上升,能有效识别不可靠预测并支持分布外检测,为基于标志点的解剖分割方法提供了鲁棒性评估基础。
Insight: 创新点在于将变分潜在空间与图生成解码器结合,首次系统量化了标志点分割中的空间不确定性;发布的带节点级不确定性标注的大规模数据集为后续研究提供了关键资源,推动了解剖分割方法在临床中的安全部署。
Abstract: Uncertainty estimation is essential for the safe clinical deployment of medical image segmentation systems, enabling the identification of unreliable predictions and supporting human oversight. While prior work has largely focused on pixel-level uncertainty, landmark-based segmentation offers inherent topological guarantees yet remains underexplored from an uncertainty perspective. In this work, we study uncertainty estimation for anatomical landmark-based segmentation on chest X-rays. Inspired by hybrid neural network architectures that combine standard image convolutional encoders with graph-based generative decoders, and leveraging their variational latent space, we derive two complementary measures: (i) latent uncertainty, captured directly from the learned distribution parameters, and (ii) predictive uncertainty, obtained by generating multiple stochastic output predictions from latent samples. Through controlled corruption experiments we show that both uncertainty measures increase with perturbation severity, reflecting both global and local degradation. We demonstrate that these uncertainty signals can identify unreliable predictions by comparing with manual ground-truth, and support out-of-distribution detection on the CheXmask dataset. More importantly, we release CheXmask-U (huggingface.co/datasets/mcosarinsky/CheXmask-U), a large scale dataset of 657,566 chest X-ray landmark segmentations with per-node uncertainty estimates, enabling researchers to account for spatial variations in segmentation quality when using these anatomical masks. Our findings establish uncertainty estimation as a promising direction to enhance robustness and safe deployment of landmark-based anatomical segmentation methods in chest X-ray. A fully working interactive demo of the method is available at huggingface.co/spaces/matiasky/CheXmask-U and the source code at github.com/mcosarinsky/CheXmask-U.
[59] SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving cs.CVPDF
Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang
TL;DR: SpaceDrive是一种基于视觉语言模型(VLM)的端到端自动驾驶框架,通过将3D空间信息作为显式位置编码(PEs)注入模型,解决了现有VLM在理解细粒度3D空间关系方面的不足,从而提升了驾驶规划精度。
Details
Motivation: 当前基于VLM的端到端自动驾驶方法虽然具备强大的通用视觉理解和推理能力,但难以理解精细的3D空间关系,而这是与现实物理世界交互系统的基本要求。
Result: 在nuScenes数据集上取得了最先进的(SOTA)开环性能,并在Bench2Drive闭环基准测试中,在现有基于VLM的方法中获得了第二高的驾驶分数(78.02)。
Insight: 核心创新在于将3D空间信息(来自多视角深度估计、历史自车状态和文本提示)作为任务无关的坐标表示,以位置编码形式直接注入VLM,替代逐数字的文本标记,实现了语义与空间的联合推理,并可直接回归轨迹坐标,而非逐数字生成,从而提升了空间推理和规划精度。
Abstract: End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D visual tokens. Meanwhile, they serve as a task-agnostic coordinate representation, replacing the digit-wise numerical tokens as both inputs and outputs for the VLM. This mechanism enables the model to better index specific visual semantics in spatial reasoning and directly regress trajectory coordinates rather than generating digit-by-digit, thereby enhancing planning accuracy. Extensive experiments validate that SpaceDrive achieves state-of-the-art open-loop performance on the nuScenes dataset and the second-best Driving Score of 78.02 on the Bench2Drive closed-loop benchmark over existing VLM-based methods.
[60] Video Depth Propagation cs.CVPDF
Luigi Piccinelli, Thiemo Wandel, Christos Sakaridis, Wim Abbeloos, Luc Van Gool
TL;DR: 本文提出VeloDepth,一种高效鲁棒的在线视频深度估计方法,通过利用先前深度预测的时空先验和深度特征传播,解决了现有方法在时间一致性和实时性方面的不足。
Details
Motivation: 现有视频深度估计方法要么依赖逐帧单目模型导致时间不一致和不准确,要么使用计算密集的时间建模不适用于实时应用,限制了实际应用中的通用性和性能。
Result: 在多个基准测试上的零样本评估表明,VeloDepth在时间一致性方面达到SOTA水平,同时具有竞争力的精度,并且推理速度显著快于现有基于视频的深度估计器。
Insight: 创新点在于提出了一个新颖的传播模块,结合基于光流的扭曲和学习到的残差校正来细化和传播深度特征与预测,并在结构上强制时间一致性,从而在连续帧中实现稳定且高效的深度预测。
Abstract: Depth estimation in videos is essential for visual perception in real-world applications. However, existing methods either rely on simple frame-by-frame monocular models, leading to temporal inconsistencies and inaccuracies, or use computationally demanding temporal modeling, unsuitable for real-time applications. These limitations significantly restrict general applicability and performance in practical settings. To address this, we propose VeloDepth, an efficient and robust online video depth estimation pipeline that effectively leverages spatiotemporal priors from previous depth predictions and performs deep feature propagation. Our method introduces a novel Propagation Module that refines and propagates depth features and predictions using flow-based warping coupled with learned residual corrections. In addition, our design structurally enforces temporal consistency, resulting in stable depth predictions across consecutive frames with improved efficiency. Comprehensive zero-shot evaluation on multiple benchmarks demonstrates the state-of-the-art temporal consistency and competitive accuracy of VeloDepth, alongside its significantly faster inference compared to existing video-based depth estimators. VeloDepth thus provides a practical, efficient, and accurate solution for real-time depth estimation suitable for diverse perception tasks. Code and models are available at https://github.com/lpiccinelli-eth/velodepth
[61] IRG-MotionLLM: Interleaving Motion Generation, Assessment and Refinement for Text-to-Motion Generation cs.CVPDF
Yuan-Ming Li, Qize Yang, Nan Lei, Shenghao Fu, Ling-An Zeng
TL;DR: 本文提出了IRG-MotionLLM模型,这是一种用于文本到动作生成的新范式。它通过迭代的文本-动作对话,将动作生成、评估和精炼任务紧密耦合,以实现理解与生成之间的双向知识流动,从而提升生成性能。
Details
Motivation: 现有基于大语言模型的运动感知模型通常将运动理解和生成任务分开处理,限制了任务间通过交互反馈可能带来的相互增益。本文旨在通过引入评估和精炼任务作为桥梁,促进理解与生成之间的双向知识流动。
Result: 实验表明,评估和精炼任务显著改善了文本-动作对齐;交织这三个步骤在训练的各阶段都带来了持续的性能提升;IRG-MotionLLM在标准的文本到动作生成基准测试上明显优于基线模型,并达到了先进性能。跨评估测试进一步验证了其有效性。
Insight: 核心创新在于提出了IRMoGen(交织推理运动生成)范式,首次将动作生成、评估和精炼无缝交织在一个模型中。这通过一个新颖的三阶段训练方案和一个自动数据引擎来实现,该引擎能从现有数据集中合成交织推理的标注。这种迭代、交互式的生成-评估-精炼闭环是提升生成质量的关键洞察。
Abstract: Recent advances in motion-aware large language models have shown remarkable promise for unifying motion understanding and generation tasks. However, these models typically treat understanding and generation separately, limiting the mutual benefits that could arise from interactive feedback between tasks. In this work, we reveal that motion assessment and refinement tasks act as crucial bridges to enable bidirectional knowledge flow between understanding and generation. Leveraging this insight, we propose Interleaved Reasoning for Motion Generation (IRMoGen), a novel paradigm that tightly couples motion generation with assessment and refinement through iterative text-motion dialogue. To realize this, we introduce IRG-MotionLLM, the first model that seamlessly interleaves motion generation, assessment, and refinement to improve generation performance. IRG-MotionLLM is developed progressively with a novel three-stage training scheme, initializing and subsequently enhancing native IRMoGen capabilities. To facilitate this development, we construct an automated data engine to synthesize interleaved reasoning annotations from existing text-motion datasets. Extensive experiments demonstrate that: (i) Assessment and refinement tasks significantly improve text-motion alignment; (ii) Interleaving motion generation, assessment, and refinement steps yields consistent performance gains across training stages; and (iii) IRG-MotionLLM clearly outperforms the baseline model and achieves advanced performance on standard text-to-motion generation benchmarks. Cross-evaluator testing further validates its effectiveness. Code & Data: https://github.com/HumanMLLM/IRG-MotionLLM/tree/main.
[62] LDP: Parameter-Efficient Fine-Tuning of Multimodal LLM for Medical Report Generation cs.CVPDF
Tianyu Zhou, Junyi Tang, Zehui Li, Dahong Qian, Suncheng Xiang
TL;DR: 本文提出LDP框架,利用多模态大语言模型生成专业息肉诊断报告。通过构建MMEndo数据集,采用LoRA参数高效微调和DPO对齐临床标准,显著降低训练成本并提升报告质量。
Details
Motivation: 解决结肠镜息肉诊断中传统自动报告因高质量多模态医疗数据稀缺导致的不一致和幻觉问题。
Result: 在自动指标和临床专家评估(医师评分7.2/10)上优于现有基线,相比全微调训练计算成本降低833倍,在IU-XRay数据集上验证了鲁棒性。
Insight: 结合LoRA和DPO实现参数高效微调与临床对齐,构建专业多模态医疗数据集MMEndo,为初级医疗提供可扩展的临床可行方案。
Abstract: Colonoscopic polyp diagnosis is pivotal for early colorectal cancer detection, yet traditional automated reporting suffers from inconsistencies and hallucinations due to the scarcity of high-quality multimodal medical data. To bridge this gap, we propose LDP, a novel framework leveraging multimodal large language models (MLLMs) for professional polyp diagnosis report generation. Specifically, we curate MMEndo, a multimodal endoscopic dataset comprising expert-annotated colonoscopy image-text pairs. We fine-tune the Qwen2-VL-7B backbone using Parameter-Efficient Fine-Tuning (LoRA) and align it with clinical standards via Direct Preference Optimization (DPO). Extensive experiments show that our LDP outperforms existing baselines on both automated metrics and rigorous clinical expert evaluations (achieving a Physician Score of 7.2/10), significantly reducing training computational costs by 833x compared to full fine-tuning. The proposed solution offers a scalable, clinically viable path for primary healthcare, with additional validation on the IU-XRay dataset confirming its robustness.
[63] What matters for Representation Alignment: Global Information or Spatial Structure? cs.CV | cs.AI | cs.GR | cs.LG | stat.MLPDF
Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang
TL;DR: 该论文通过大规模实证分析发现,在生成模型训练中,表征对齐(REPA)的性能主要取决于目标表征的空间结构(即图像块标记间的余弦相似性),而非其全局语义信息(如ImageNet-1K准确率)。作者进一步提出了iREPA方法,通过简单的卷积层和空间归一化层增强空间信息传递,从而提升REPA的收敛速度。
Details
Motivation: 研究表征对齐(REPA)中目标表征的哪个方面对生成性能更重要:是全局语义信息还是空间结构,以改进生成模型的训练机制。
Result: 在27种不同视觉编码器和模型规模上的实验表明,空间结构是驱动生成性能的关键因素;提出的iREPA方法(仅需<4行代码修改)在多种编码器、模型规模和训练变体(如REPA、REPA-E、Meanflow、JiT)中均能一致提升收敛速度。
Insight: 创新点在于揭示了空间结构在表征对齐中的核心作用,并通过简单的架构修改(卷积投影层和空间归一化)有效传递空间信息,这挑战了传统认为全局语义性能主导生成质量的观念,为生成模型训练提供了新视角。
Abstract: Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question: what aspect of the target representation matters for generation, its \textit{global} \revision{semantic} information (e.g., measured by ImageNet-1K accuracy) or its spatial structure (i.e. pairwise cosine similarity between patch tokens)? Prevalent wisdom holds that stronger global semantic performance leads to better generation as a target representation. To study this, we first perform a large-scale empirical analysis across 27 different vision encoders and different model scales. The results are surprising; spatial structure, rather than global performance, drives the generation performance of a target representation. To further study this, we introduce two straightforward modifications, which specifically accentuate the transfer of \emph{spatial} information. We replace the standard MLP projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation. Surprisingly, our simple method (implemented in $<$4 lines of code), termed iREPA, consistently improves convergence speed of REPA, across a diverse set of vision encoders, model sizes, and training variants (such as REPA, REPA-E, Meanflow, JiT etc). %, etc. Our work motivates revisiting the fundamental working mechanism of representational alignment and how it can be leveraged for improved training of generative models. The code and project page are available at https://end2end-diffusion.github.io/irepa
[64] Self-Ensemble Post Learning for Noisy Domain Generalization cs.CVPDF
Wang Lu, Jindong Wang
TL;DR: 本文提出了一种名为自集成后学习(SEPL)的方法,用于解决领域泛化(DG)中遇到标签噪声时性能下降的问题。SEPL通过特征探测训练和预测集成推理两部分,利用模型内部的中间特征表示训练多个探测分类器,并通过集成这些分类器的输出来获得最终预测,以增强现有方法的鲁棒性。
Details
Motivation: 当领域泛化遇到噪声标签时,噪声会加剧深层中虚假特征的放大,导致现有算法性能下降。本文旨在探索如何使现有方法在遇到噪声时重新有效工作。
Result: 广泛的实验评估表明,所提出的方法不仅增强了现有方法的鲁棒性,而且在现实应用中展现出高灵活性和显著潜力。
Insight: 创新点在于利用模型内部潜在特征的判别能力,通过训练多个关注图像不同部分的探测分类器来多样化可利用的特征,并采用半监督算法训练这些分类器以应对噪声标签,最后通过众包推理方式集成预测。这提供了一种灵活的后处理框架来提升模型在噪声和分布偏移下的泛化能力。
Abstract: While computer vision and machine learning have made great progress, their robustness is still challenged by two key issues: data distribution shift and label noise. When domain generalization (DG) encounters noise, noisy labels further exacerbate the emergence of spurious features in deep layers, i.e. spurious feature enlargement, leading to a degradation in the performance of existing algorithms. This paper, starting from domain generalization, explores how to make existing methods rework when meeting noise. We find that the latent features inside the model have certain discriminative capabilities, and different latent features focus on different parts of the image. Based on these observations, we propose the Self-Ensemble Post Learning approach (SEPL) to diversify features which can be leveraged. Specifically, SEPL consists of two parts: feature probing training and prediction ensemble inference. It leverages intermediate feature representations within the model architecture, training multiple probing classifiers to fully exploit the capabilities of pre-trained models, while the final predictions are obtained through the integration of outputs from these diverse classification heads. Considering the presence of noisy labels, we employ semi-supervised algorithms to train probing classifiers. Given that different probing classifiers focus on different areas, we integrate their predictions using a crowdsourcing inference approach. Extensive experimental evaluations demonstrate that the proposed method not only enhances the robustness of existing methods but also exhibits significant potential for real-world applications with high flexibility.
[65] PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning cs.CVPDF
Jianqi Chen, Biao Zhang, Xiangjun Tang, Peter Wonka
TL;DR: 本文提出PoseGAM,一个用于未见物体6D姿态估计的几何感知多视图框架。该方法直接从查询图像和多个模板图像预测物体姿态,无需显式特征匹配。通过整合基于点的显式几何信息和从几何表示网络学习到的特征,并结合大规模合成数据集进行训练,在多个基准测试上实现了最先进的性能。
Details
Motivation: 解决未见物体6D姿态估计的挑战,现有方法通常依赖在查询图像与物体模型或模板图像之间构建显式特征对应,本工作旨在消除这种显式匹配的需求。
Result: 在多个基准测试上进行广泛评估,展示了最先进的性能,平均AR(平均召回率)比先前方法提升5.1%,在个别数据集上增益高达17.6%,表明对未见物体具有很强的泛化能力。
Insight: 创新点在于提出一个无需显式匹配的几何感知多视图推理框架,通过显式点几何和几何表示网络学习特征两种互补机制整合几何信息,并利用大规模合成数据集增强鲁棒性和泛化性。
Abstract: 6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and generalization. Extensive evaluations across multiple benchmarks demonstrate our state-of-the-art performance, yielding an average AR improvement of 5.1% over prior methods and achieving up to 17.6% gains on individual datasets, indicating strong generalization to unseen objects. Project page: https://windvchen.github.io/PoseGAM/ .
[66] SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation cs.CVPDF
Kehong Gong, Zhengyu Wen, Mingxi Xu, Weixia He, Qi Wang
TL;DR: SWiT-4D是一种基于滑动窗口Transformer的无损、无参数时序4D网格生成方法。它能够将单目视频转换为高质量、具有明确4D网格的动画3D资产,通过无缝集成任何基于扩散Transformer的图像到3D生成器,并引入时空建模,实现对任意长度视频的4D重建。
Details
Motivation: 现有方法难以将单目视频高质量地转换为具有显式4D网格的动画3D资产,且缺乏大规模4D网格数据集限制了纯数据驱动模型的训练。本文旨在更好地利用现有强大的图像到3D生成先验模型,同时最小化对4D监督的依赖。
Result: 在领域内测试集和具有挑战性的领域外基准(如C4D、Objaverse和真实世界视频)上的综合实验表明,SWiT-4D在时间平滑性方面持续优于现有基线方法。仅需一个短于10秒的视频进行微调,即可实现高保真几何和稳定的时间一致性。
Insight: 创新点在于提出了一个无损、无参数的滑动窗口Transformer架构,可无缝集成现有DiT-based图像到3D生成器,保留其单图像前向过程的同时增加跨视频帧的时空建模能力。此外,针对静态相机单目视频,引入了一个基于优化的轨迹模块来恢复全局平移,展示了在极有限4D监督下的强数据效率和实际部署潜力。
Abstract: Despite significant progress in 4D content generation, the conversion of monocular videos into high-quality animated 3D assets with explicit 4D meshes remains considerably challenging. The scarcity of large-scale, naturally captured 4D mesh datasets further limits the ability to train generalizable video-to-4D models from scratch in a purely data-driven manner. Meanwhile, advances in image-to-3D generation, supported by extensive datasets, offer powerful prior models that can be leveraged. To better utilize these priors while minimizing reliance on 4D supervision, we introduce SWiT-4D, a Sliding-Window Transformer for lossless, parameter-free temporal 4D mesh generation. SWiT-4D integrates seamlessly with any Diffusion Transformer (DiT)-based image-to-3D generator, adding spatial-temporal modeling across video frames while preserving the original single-image forward process, enabling 4D mesh reconstruction from videos of arbitrary length. To recover global translation, we further introduce an optimization-based trajectory module tailored for static-camera monocular videos. SWiT-4D demonstrates strong data efficiency: with only a single short (<10s) video for fine-tuning, it achieves high-fidelity geometry and stable temporal consistency, indicating practical deployability under extremely limited 4D supervision. Comprehensive experiments on both in-domain zoo-test sets and challenging out-of-domain benchmarks (C4D, Objaverse, and in-the-wild videos) show that SWiT-4D consistently outperforms existing baselines in temporal smoothness. Project page: https://animotionlab.github.io/SWIT4D/
[67] MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence cs.CV | cs.AIPDF
Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao
TL;DR: 本文提出了MMSI-Video-Bench,一个用于全面评估多模态大语言模型在视频空间智能方面能力的基准测试。该基准基于一个四层框架(感知、规划、预测和跨视频推理),包含来自25个数据集和内部视频的1,278个片段上的1,106个人工标注问题。作者评估了25个开源和专有模型,揭示了显著的人机差距,并进行了细粒度的错误分析。
Details
Motivation: 目前缺乏一个全面的基准来评估MLLMs在连续视觉输入(视频)中的空间理解能力,而这对MLLMs成为物理环境中的通用助手至关重要。
Result: 在MMSI-Video-Bench上评估了25个强大的MLLMs,结果显示显著的人机差距:许多模型表现接近随机猜测,最佳推理模型落后人类近60%。空间微调的模型在该基准上泛化效果不佳。
Insight: 提出了一个基于四层框架(感知、规划、预测、跨视频推理)的、数据来源多样且任务覆盖全面的视频空间智能基准。该基准还支持三个面向特定领域的子基准,用于针对性能力评估。细粒度错误分析揭示了模型在几何推理、运动基础、长时程预测和跨视频对应关系上的系统性失败,并指出典型的帧采样策略、3D空间线索和思维链提示在此类推理密集型任务上效果有限。
Abstract: Spatial understanding over continuous visual input is crucial for MLLMs to evolve into general-purpose assistants in physical environments. Yet there is still no comprehensive benchmark that holistically assesses the progress toward this goal. In this work, we introduce MMSI-Video-Bench, a fully human-annotated benchmark for video-based spatial intelligence in MLLMs. It operationalizes a four-level framework, Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 clips from 25 datasets and in-house videos. Each item is carefully designed and reviewed by 3DV experts with explanatory rationales to ensure precise, unambiguous grounding. Leveraging its diverse data sources and holistic task coverage, MMSI-Video-Bench also supports three domain-oriented sub-benchmarks (Indoor Scene Perception Bench, Robot Bench and Grounding Bench) for targeted capability assessment. We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human–AI gap: many models perform near chance, and the best reasoning model lags humans by nearly 60%. We further find that spatially fine-tuned models still fail to generalize effectively on our benchmark. Fine-grained error analysis exposes systematic failures in geometric reasoning, motion grounding, long-horizon prediction, and cross-video correspondence. We also show that typical frame-sampling strategies transfer poorly to our reasoning-intensive benchmark, and that neither 3D spatial cues nor chain-of-thought prompting yields meaningful gains. We expect our benchmark to establish a solid testbed for advancing video-based spatial intelligence.
[68] From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models cs.CVPDF
Zongzhao Li, Xiangzhe Kong, Jiahui Su, Zongyang Ma, Mingze Li
TL;DR: 本文提出了微观空间智能(MiSI)的概念,并构建了MiSI-Bench基准框架,用于评估视觉语言模型在分子微观空间关系感知与推理方面的能力。该基准包含超过16.3万个问答对和58.7万张图像,涵盖九项互补任务。实验表明,当前SOTA VLMs在该基准上表现远低于人类水平,但经过微调的7B模型在空间变换任务上展现出潜力,甚至超越人类,而在氢键识别等科学任务上表现不佳,凸显了整合显式领域知识的重要性。
Details
Motivation: 动机是评估视觉语言模型在感知和推理不可见微观实体(如分子)空间关系(即微观空间智能)方面的潜力,这是科学发现的基础,而现有基准主要关注宏观世界。
Result: 在提出的MiSI-Bench基准上,当前最先进的视觉语言模型表现显著低于人类水平。然而,一个经过微调的7B模型在空间变换任务上表现出巨大潜力,甚至超越了人类,但在氢键识别等基于科学的任务上表现不佳。
Insight: 论文的创新点在于首次系统性地定义了微观空间智能(MiSI)并构建了大规模、多任务的分子视觉问答基准(MiSI-Bench)。客观来看,其核心贡献是将VLMs的评估从宏观视觉场景拓展到微观科学领域,并通过基准结果清晰地揭示了当前VLMs在科学推理中缺乏领域知识的局限性,为面向科学AGI的研究指明了整合显式知识的方向。
Abstract: This paper introduces the concept of Microscopic Spatial Intelligence (MiSI), the capability to perceive and reason about the spatial relationships of invisible microscopic entities, which is fundamental to scientific discovery. To assess the potential of Vision-Language Models (VLMs) in this domain, we propose a systematic benchmark framework MiSI-Bench. This framework features over 163,000 question-answer pairs and 587,000 images derived from approximately 4,000 molecular structures, covering nine complementary tasks that evaluate abilities ranging from elementary spatial transformations to complex relational identifications. Experimental results reveal that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model demonstrates substantial potential, even surpassing humans in spatial transformation tasks, while its poor performance in scientifically-grounded tasks like hydrogen bond recognition underscores the necessity of integrating explicit domain knowledge for progress toward scientific AGI. The datasets are available at https://huggingface.co/datasets/zongzhao/MiSI-bench.
[69] MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos cs.CVPDF
Kehong Gong, Zhengyu Wen, Weixia He, Mingxi Xu, Qi Wang
TL;DR: 本文提出了MoCapAnything,一个统一的、类别无关的运动捕捉框架,旨在从单目视频中为任意骨骼结构的3D资产生成驱动动画。该方法通过参考引导的分解框架,先预测3D关节轨迹,再通过约束感知的逆向运动学恢复资产特定的旋转,实现了对任意骨骼的动画重建。
Details
Motivation: 现有运动捕捉流程大多针对特定物种或模板,缺乏通用性。本文旨在解决类别无关运动捕捉问题,即给定单目视频和任意绑定好的3D资产,直接生成能驱动该资产的基于旋转的动画(如BVH文件)。
Result: 在领域内基准测试和野外视频上的实验表明,MoCapAnything能生成高质量的骨骼动画,并在异构绑定之间实现有意义的跨物种运动重定向,支持可扩展的、提示驱动的3D运动捕捉。
Insight: 创新点在于将运动捕捉形式化为类别无关任务,并提出一个包含可学习模块(参考提示编码器、视频特征提取器、统一运动解码器)和轻量级逆向运动学阶段的分解框架。该方法通过构建粗粒度4D变形网格桥接视频与关节空间,并使用标准化的骨骼-网格-渲染三元组数据集进行训练,实现了对任意资产的通用驱动。
Abstract: Motion capture now underpins content creation far beyond digital humans, yet most existing pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation such as BVH that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware inverse kinematics. The system contains three learnable modules and a lightweight IK stage: (1) a Reference Prompt Encoder that extracts per-joint queries from the asset’s skeleton, mesh, and rendered images; (2) a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the gap between video and joint space; and (3) a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1038 motion clips, each providing a standardized skeleton-mesh-render triad. Experiments on both in-domain benchmarks and in-the-wild videos show that MoCapAnything delivers high-quality skeletal animations and exhibits meaningful cross-species retargeting across heterogeneous rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets. Project page: https://animotionlab.github.io/MoCapAnything/
[70] PubTables-v2: A new large-scale dataset for full-page and multi-page table extraction cs.CVPDF
Brandon Smock, Valerie Faucon-Morin, Max Sokolov, Libin Liang, Tayyibah Khanam
TL;DR: 该论文提出了PubTables-v2,一个用于全页和多页表格提取的大规模数据集,以解决视觉文档理解中表格提取任务缺乏标注数据的问题。
Details
Motivation: 当前表格提取方法(如视觉语言模型)在完整页面或文档上下文中直接提取表格方面进展显著,但由于缺乏标注数据,难以有效评估和推进,因此需要创建大规模基准数据集。
Result: 论文通过在该数据集上评估领域专用视觉语言模型,展示了其有效性,并利用该数据集开发了Page-Object Table Transformer(POTATR),扩展了Table Transformer以实现全面的页面级表格提取。
Insight: 创新点在于首次提供了大规模的多页表格结构识别基准,并提出了POTATR这一图像到图的扩展模型,为全页和多页表格提取任务提供了新的数据和方法支持。
Abstract: Table extraction (TE) is a key challenge in visual document understanding. Traditional approaches detect tables first, then recognize their structure. Recently, interest has surged in developing methods, such as vision-language models (VLMs), that can extract tables directly in their full page or document context. However, progress has been difficult to demonstrate due to a lack of annotated data. To address this, we create a new large-scale dataset, PubTables-v2. PubTables-v2 supports a number of current challenging table extraction tasks. Notably, it is the first large-scale benchmark for multi-page table structure recognition. We demonstrate its usefulness by evaluating domain-specialized VLMs on these tasks and highlighting current progress. Finally, we use PubTables-v2 to create the Page-Object Table Transformer (POTATR), an image-to-graph extension of the Table Transformer to comprehensive page-level TE. Data, code, and trained models will be released.
[71] DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance cs.CVPDF
Peiying Zhang, Nanxuan Zhao, Matthew Fisher, Yiran Xu, Jing Liao
TL;DR: DuetSVG是一个统一的多模态模型,用于联合生成图像token和对应的SVG token,通过端到端方式解决现有基于视觉语言模型(VLM)的SVG生成方法因缺乏视觉信号而导致的语义复杂性和几何一致性不足的问题。
Details
Motivation: 现有基于VLM的SVG生成方法在解码时仅生成文本,缺乏视觉信号,导致难以处理复杂语义,且生成的SVG在视觉吸引力和几何连贯性上表现不佳。
Result: 大量实验表明,该方法在多种应用场景下均优于现有方法,能够生成视觉逼真、语义对齐且语法简洁的SVG。
Insight: 创新点在于提出了一种统一的多模态生成框架,联合训练图像和SVG数据,并引入了一种新颖的测试时缩放策略,利用模型自身的视觉预测作为指导来提升SVG解码质量。
Abstract: Recent vision-language model (VLM)-based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model’s native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.
[72] FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos cs.CVPDF
Yulu Gan, Ligeng Zhu, Dandan Shan, Baifeng Shi, Hongxu Yin
TL;DR: 该论文提出了FoundationMotion,一个全自动的数据构建流水线,用于生成大规模、细粒度的运动理解数据集。该方法通过检测和跟踪视频中的物体以提取轨迹,并利用大型语言模型(LLMs)结合轨迹和视频帧来生成关于运动和空间推理的细粒度描述及多样化问答对。使用该流水线生成的数据集微调开源模型(如NVILA-Video-15B和Qwen2.5-7B),显著提升了模型在运动理解任务上的性能,且不损害其他任务表现。
Details
Motivation: 当前最先进的模型在运动理解基准测试上表现不佳,主要原因是缺乏大规模、细粒度的运动数据集。现有数据集通常依赖昂贵的人工标注,严重限制了可扩展性。
Result: 使用FoundationMotion流水线生成的数据集微调模型后,在多种运动理解数据集和基准测试上,其性能超越了强大的闭源基线模型(如Gemini-2.5 Flash)和大型开源模型(如Qwen2.5-VL-72B),达到了先进水平。
Insight: 论文的核心创新点在于提出了一种全自动、可扩展的数据构建流水线,通过结合物体轨迹检测与LLMs来生成高质量的标注数据,从而避免了昂贵的人工标注。这为运动理解和空间推理领域提供了一种高效的数据集构建范式,并可推广用于增强多种模型的微调效果。
Abstract: Motion understanding is fundamental to physical reasoning, enabling models to infer dynamics and predict future states. However, state-of-the-art models still struggle on recent motion benchmarks, primarily due to the scarcity of large-scale, fine-grained motion datasets. Existing motion datasets are often constructed from costly manual annotation, severely limiting scalability. To address this challenge, we introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets. Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models (LLMs) to generate fine-grained captions and diverse question-answer pairs about motion and spatial reasoning. Using datasets produced by this pipeline, we fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance on other tasks. Notably, our models outperform strong closed-source baselines like Gemini-2.5 Flash and large open-source models such as Qwen2.5-VL-72B across diverse motion understanding datasets and benchmarks. FoundationMotion thus provides a scalable solution for curating fine-grained motion datasets that enable effective fine-tuning of diverse models to enhance motion understanding and spatial reasoning capabilities.
[73] BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models cs.CV | cs.AIPDF
Shengao Wang, Wenqi Wang, Zecheng Wang, Max Whitton, Michael Wakeham
TL;DR: BabyVLM-V2是一个受婴儿发展启发的视觉语言模型框架,通过纵向、多方面的预训练数据集、通用模型以及用于认知评估的DevCV工具箱,显著改进了BabyVLM-V1。该框架旨在模拟婴儿的视听体验进行高效预训练,并引入与早期儿童能力对齐的多模态任务基准套件。
Details
Motivation: 论文的动机是利用早期儿童发展轨迹作为样本高效预训练视觉基础模型的自然目标,解决如何以更符合人类认知发展的方式预训练视觉语言模型的问题。
Result: 实验结果表明,一个从头开始预训练的紧凑模型在DevCV工具箱的十个多模态任务基准上取得了有竞争力的性能,在某些任务上甚至超过了GPT-4o。
Insight: 论文的创新点在于提出了一个基于发展心理学的、统一的预训练与评估框架,其核心是将婴儿的纵向视听体验数据化用于预训练,并创建了首个与早期儿童认知能力严格对齐的多模态评估基准(DevCV工具箱),为开发更具发展合理性的视觉基础模型提供了新路径。
Abstract: Early children’s developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children’s capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.
[74] Any4D: Unified Feed-Forward Metric 4D Reconstruction cs.CV | cs.AI | cs.LG | cs.ROPDF
Jay Karhade, Nikhil Keetha, Yuchen Zhang, Tanisha Gupta, Akash Sharma
TL;DR: Any4D是一个用于度量尺度、密集前馈4D重建的可扩展多视图变换器,它直接为N帧生成逐像素的运动和几何预测,并能灵活处理RGB、RGB-D、IMU和雷达等多种模态数据。
Details
Motivation: 解决现有方法通常局限于2视图密集场景流或稀疏3D点跟踪,以及从单目RGB视频进行4D重建的方法无法有效利用多模态传感器数据的问题。
Result: 在多种设置下取得了卓越性能,精度误差降低了2-3倍,计算效率提升了15倍。
Insight: 创新性地采用了模块化的4D场景表示,将每视图的4D预测编码为以局部相机坐标表示的自我中心因子(如深度图)和以全局世界坐标表示的他者中心因子(如相机外参和场景流),从而构建了一个灵活的统一框架。
Abstract: We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordinates. We achieve superior performance across diverse setups - both in terms of accuracy (2-3X lower error) and compute efficiency (15X faster), opening avenues for multiple downstream applications.
[75] GaussianHeadTalk: Wobble-Free 3D Talking Heads with Audio Driven Gaussian Splatting cs.CVPDF
Madhav Agarwal, Mingtian Zhang, Laura Sevilla-Lara, Steven McDonagh
TL;DR: 本文提出GaussianHeadTalk,一种基于音频驱动的3D高斯溅射方法,用于生成无抖动、高保真的实时说话头部视频。该方法通过3D形变模型映射高斯溅射来构建个性化虚拟形象,并利用基于Transformer的模型直接从音频预测参数以确保时序稳定性。
Details
Motivation: 解决现有语音驱动说话头部方法在实时性和时序稳定性上的不足,特别是高斯溅射方法因面部跟踪不准确或映射不一致导致的输出抖动和视频伪影问题。
Result: 在单目视频和独立音频输入下,该方法实现了实时说话头部视频生成,在定量和定性评估中均报告了具有竞争力的性能。
Insight: 创新点在于结合3D形变模型与高斯溅射来提升面部表示的准确性,并引入基于Transformer的音频到参数预测机制以增强时序一致性,从而在保持实时性的同时改善视觉稳定性。
Abstract: Speech-driven talking heads have recently emerged and enable interactive avatars. However, real-world applications are limited, as current methods achieve high visual fidelity but slow or fast yet temporally unstable. Diffusion methods provide realistic image generation, yet struggle with oneshot settings. Gaussian Splatting approaches are real-time, yet inaccuracies in facial tracking, or inconsistent Gaussian mappings, lead to unstable outputs and video artifacts that are detrimental to realistic use cases. We address this problem by mapping Gaussian Splatting using 3D Morphable Models to generate person-specific avatars. We introduce transformer-based prediction of model parameters, directly from audio, to drive temporal consistency. From monocular video and independent audio speech inputs, our method enables generation of real-time talking head videos where we report competitive quantitative and qualitative performance.
[76] OmniView: An All-Seeing Diffusion Model for 3D and 4D View Synthesis cs.CV | cs.AIPDF
Xiang Fan, Sharath Girish, Vivek Ramanujan, Chaoyang Wang, Ashkan Mirzaei
TL;DR: OmniView是一个统一的扩散模型框架,用于处理多种3D和4D视图合成任务,包括静态/动态新视角合成、文本/图像到视频生成等,通过分离空间、时间和视角条件实现灵活组合,并在多个基准测试中达到或超越任务专用模型的性能。
Details
Motivation: 现有方法通常针对特定4D一致性任务(如新视角合成、带相机控制的文本到视频等)进行训练,导致模型碎片化且无法充分利用数据;OmniView旨在构建一个通用框架,统一处理广泛的4D任务,避免数据割裂。
Result: 在多个基准测试中表现优异:在LLFF多视角新视角合成数据集上图像质量得分提升33%,在Neural 3D Video动态新视角合成基准上提升60%,在RE-10K静态相机控制任务上提升20%,在文本条件视频生成中将相机轨迹误差降低4倍,与任务专用模型竞争力相当。
Insight: 创新点在于将空间、时间和视角条件分离表示,允许灵活组合输入条件,从而统一处理多种4D任务;客观来看,这种模块化设计提升了模型的泛化能力,为通用4D视频模型提供了可行性验证。
Abstract: Prior approaches injecting camera control into diffusion models have focused on specific subsets of 4D consistency tasks: novel view synthesis, text-to-video with camera control, image-to-video, amongst others. Therefore, these fragmented approaches are trained on disjoint slices of available 3D/4D data. We introduce OmniView, a unified framework that generalizes across a wide range of 4D consistency tasks. Our method separately represents space, time, and view conditions, enabling flexible combinations of these inputs. For example, OmniView can synthesize novel views from static, dynamic, and multiview inputs, extrapolate trajectories forward and backward in time, and create videos from text or image prompts with full camera control. OmniView is competitive with task-specific models across diverse benchmarks and metrics, improving image quality scores among camera-conditioned diffusion models by up to 33% in multiview NVS LLFF dataset, 60% in dynamic NVS Neural 3D Video benchmark, 20% in static camera control on RE-10K, and reducing camera trajectory errors by 4x in text-conditioned video generation. With strong generalizability in one model, OmniView demonstrates the feasibility of a generalist 4D video model. Project page is available at https://snap-research.github.io/OmniView/
[77] Mull-Tokens: Modality-Agnostic Latent Thinking cs.CV | cs.AIPDF
Arijit Ray, Ahmed Abdelkader, Chengzhi Mao, Bryan A. Plummer, Kate Saenko
TL;DR: 本文提出Mull-Tokens,一种模态无关的潜在令牌,用于在多模态推理中灵活地以图像或文本形式持有中间信息,从而让模型能够自由形式地思考以得出正确答案。
Details
Motivation: 现有探索图像推理潜力的多模态模型存在脆弱且难以扩展的问题,它们依赖调用专用工具、昂贵的图像生成或手工制作的推理数据在文本和图像思维之间切换,因此需要一种更简单的替代方案。
Result: 在四个涉及解谜和视角转换等任务的挑战性空间推理基准测试中,Mull-Tokens相比仅使用文本推理或交错图像-文本推理的多个基线模型有所提升,平均提升+3%,在推理密集的谜题解决子集上最高提升达+16%。
Insight: 创新点在于引入模态无关的潜在令牌,通过受潜在推理框架启发的预训练方法,允许模型在推理过程中抽象地、自由地在多种模态中思考,为解决文本和视觉推理的落地挑战提供了一个简单方案。
Abstract: Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative – Mull-Tokens – modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.
[78] VL-JEPA: Joint Embedding Predictive Architecture for Vision-language cs.CVPDF
Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu
TL;DR: 本文提出了VL-JEPA,一种基于联合嵌入预测架构(JEPA)的视觉语言模型。与传统的自回归生成词元的视觉语言模型不同,VL-JEPA预测目标文本的连续嵌入表示。通过在抽象表示空间中学习,模型聚焦于任务相关的语义,同时抽象掉表层的语言变异性。在严格控制变量(使用相同视觉编码器和训练数据)的对比实验中,VL-JEPA在可训练参数量减少50%的情况下,取得了更强的性能。推理时,仅在需要时将预测的嵌入通过轻量级文本解码器转换为文本。该模型原生支持选择性解码,与非自适应的均匀解码相比,在保持相似性能的同时,将解码操作次数减少了2.85倍。此外,VL-JEPA的嵌入空间无需任何架构修改,即可天然支持开放词汇分类、文本到视频检索和判别式视觉问答任务。
Details
Motivation: 动机在于改进传统视觉语言模型的自回归生成范式,通过在抽象嵌入空间进行预测,以更高效地捕捉语义信息,减少对表层语言形式的依赖,并降低模型参数量和计算开销。
Result: 在八个视频分类和八个视频检索数据集上,VL-JEPA的平均性能超越了CLIP、SigLIP2和Perception Encoder。在四个视觉问答数据集(GQA、TallyQA、POPE和POPEv2)上,其性能与InstructBLIP、QwenVL等经典视觉语言模型相当,而参数量仅为16亿。
Insight: 核心创新点在于将联合嵌入预测架构(JEPA)引入视觉语言建模,用连续嵌入预测替代自回归词元生成,实现了更紧凑的模型和更高效的推理(如选择性解码)。其嵌入空间具有多功能性,能无缝支持生成、检索和分类等多种任务,展示了统一表示学习的潜力。
Abstract: We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA’s embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.
[79] AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation cs.CV | cs.AIPDF
Sharath Girish, Viacheslav Ivanov, Tsai-Shien Chen, Hao Chen, Aliaksandr Siarohin
TL;DR: AlcheMinT是一个用于多参考一致视频生成的统一框架,通过引入显式时间戳条件,实现了对视频中多个主体出现和消失的细粒度时序控制。该方法采用新颖的位置编码机制,将时间区间与主体身份关联,并与预训练视频生成模型的位置嵌入无缝集成,同时结合主体描述性文本标记以增强视觉身份与视频描述的绑定。
Details
Motivation: 现有基于大扩散模型的主体驱动视频生成方法缺乏对主体出现和消失的细粒度时序控制,这在组合视频合成、故事板制作和可控动画等应用中至关重要。
Result: 实验结果表明,AlcheMinT在视觉质量上与最先进的视频个性化方法相当,同时在视频内多主体生成方面首次实现了精确的时序控制,并在评估多主体身份保持、视频保真度和时序遵循的基准测试中表现出色。
Insight: 创新点包括:引入显式时间戳条件实现细粒度时序控制;设计新颖的位置编码机制,将时间区间编码与主体身份关联,无需额外交叉注意力模块,参数开销可忽略;结合主体描述性文本标记以缓解生成过程中的歧义,提升身份绑定效果。
Abstract: Recent advances in subject-driven video generation with large diffusion models have enabled personalized content synthesis conditioned on user-provided subjects. However, existing methods lack fine-grained temporal control over subject appearance and disappearance, which are essential for applications such as compositional video synthesis, storyboarding, and controllable animation. We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation. Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities, while seamlessly integrating with the pretrained video generation model positional embeddings. Additionally, we incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation. Through token-wise concatenation, AlcheMinT avoids any additional cross-attention modules and incurs negligible parameter overhead. We establish a benchmark evaluating multiple subject identity preservation, video fidelity, and temporal adherence. Experimental results demonstrate that AlcheMinT achieves visual quality matching state-of-the-art video personalization methods, while, for the first time, enabling precise temporal control over multi-subject generation within videos. Project page is at https://snap-research.github.io/Video-AlcheMinT
[80] MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation cs.CVPDF
Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang
TL;DR: 本文提出了一个大规模多模态数据集MeViS,专注于基于物体运动语言描述的指代运动表达视频分割任务,旨在通过运动表达和运动推理线索实现像素级视频理解。该数据集包含2,006个复杂场景视频中的8,171个物体,标注了33,072个文本和音频形式的运动表达,并评估了15种现有方法在四个任务上的性能,揭示了现有方法在处理运动表达引导视频理解方面的局限性。
Details
Motivation: 现有指代视频分割数据集通常关注显著物体,并依赖富含静态属性的语言表达,可能允许在单帧中识别目标物体,从而低估了视频和语言中运动的作用。本文旨在探索利用运动表达和运动推理线索进行像素级视频理解的可行性。
Result: 在MeViS数据集上对15种现有方法进行了基准测试,包括6种指代视频物体分割(RVOS)方法、3种音频引导视频物体分割(AVOS)方法、2种指代多目标跟踪(RMOT)方法和4种用于新引入的指代运动表达生成(RMEG)任务的视频描述方法。结果表明现有方法在处理运动表达引导视频理解方面存在弱点和局限。提出的LMPM++方法在RVOS/AVOS/RMOT任务上取得了新的最先进(SOTA)结果。
Insight: 创新点在于引入了首个专注于运动表达的大规模多模态数据集MeViS,强调了运动在视频和语言描述中的核心作用,并提出了一个统一的评估平台和新的LMPM++方法,推动了复杂场景下运动表达引导视频理解算法的发展。从客观角度看,该工作填补了现有数据集在运动推理方面的空白,为多模态视频理解提供了新的研究方向。
Abstract: This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects’ motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes. The proposed MeViS dataset and the method’s source code are publicly available at https://henghuiding.com/MeViS/
[81] Towards Efficient and Effective Multi-Camera Encoding for End-to-End Driving cs.CVPDF
Jiawei Yang, Ziyu Chen, Yurong You, Yan Wang, Yiming Li
TL;DR: 本文提出了Flex,一种高效的多摄像头场景编码器,用于解决端到端自动驾驶中处理大量多摄像头数据的计算瓶颈。该方法通过一组可学习的场景令牌联合编码不同摄像头和时间步的图像信息,无需依赖显式的3D先验(如BEV、占据栅格或三平面表示),从而实现对视觉输入的激进压缩,供下游基于大语言模型的策略模型使用。
Details
Motivation: 解决端到端自动驾驶中处理高容量多摄像头数据时的计算效率瓶颈,并挑战现有方法依赖显式3D先验(如BEV)的假设,探索更可扩展的数据驱动编码策略。
Result: 在包含20,000驾驶小时的大规模专有数据集上评估,Flex相比现有最优方法(SOTA)实现了2.2倍的推理吞吐量提升,并大幅提高了驾驶性能。
Insight: 创新点在于提出了一种几何无关的联合编码策略,直接学习紧凑的场景表示,无需显式3D归纳偏置;此外,这些场景令牌展现出无需显式监督的场景分解能力,表明数据驱动的编码方法可能比依赖3D先验的方法更具可扩展性和效率。
Abstract: We present Flex, an efficient and effective scene encoder that addresses the computational bottleneck of processing high-volume multi-camera data in end-to-end autonomous driving. Flex employs a small set of learnable scene tokens to jointly encode information from all image tokens across different cameras and timesteps. By design, our approach is geometry-agnostic, learning a compact scene representation directly from data without relying on the explicit 3D inductive biases, such as Bird-Eye-View (BEV), occupancy or tri-plane representations, which are common in prior work. This holistic encoding strategy aggressively compresses the visual input for the downstream Large Language Model (LLM) based policy model. Evaluated on a large-scale proprietary dataset of 20,000 driving hours, our Flex achieves 2.2x greater inference throughput while improving driving performance by a large margin compared to state-of-the-art methods. Furthermore, we show that these compact scene tokens develop an emergent capability for scene decomposition without any explicit supervision. Our findings challenge the prevailing assumption that 3D priors are necessary, demonstrating that a data-driven, joint encoding strategy offers a more scalable, efficient and effective path for future autonomous driving systems.
[82] E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training cs.CVPDF
Qitao Zhao, Hao Tan, Qianqian Wang, Sai Bi, Kai Zhang
TL;DR: 本文提出了E-RayZer,一种自监督的大型3D视觉模型,旨在直接从无标签的多视角图像中学习真正具有3D感知能力的表示。与先前通过潜在空间视图合成间接推断3D的方法(如RayZer)不同,E-RayZer直接在3D空间中操作,利用显式几何进行自监督的3D重建。该方法消除了捷径解,并产生了基于几何的表示。为了确保收敛性和可扩展性,作者引入了一种新颖的细粒度学习课程,以从易到难的方式组织训练样本,并以完全无监督的方式协调异构数据源。实验表明,E-RayZer在姿态估计任务上显著优于RayZer,并在3D下游任务的表示迁移中超越了领先的视觉预训练模型。
Details
Motivation: 自监督预训练已在语言、2D图像和视频领域取得了革命性进展,但从多视角图像中学习3D感知表示的方法仍未被充分探索。现有方法(如RayZer)通过潜在空间视图合成间接推断3D,存在捷径解问题,缺乏真正的几何基础。本文旨在解决这一问题,提出一种直接在3D空间中操作的自监督方法,以学习几何上可靠的3D表示。
Result: 实验表明,E-RayZer在姿态估计任务上显著优于其前身RayZer,并与完全监督的重建模型(如VGGT)相当或有时更优。更重要的是,在迁移到3D下游任务时,其学习到的表示超越了领先的视觉预训练模型(如DINOv3、CroCo v2、VideoMAE V2和RayZer),确立了E-RayZer作为3D感知视觉预训练的新范式。
Insight: 论文的核心创新点在于提出了直接在3D空间中进行自监督重建的范式,通过显式几何避免了潜在空间方法可能存在的捷径解问题,从而学习到真正几何基础的表示。此外,引入的细粒度无监督学习课程,通过从易到难的样本组织和异构数据协调,有效解决了大规模训练中的收敛和可扩展性问题,为3D视觉预训练提供了新的思路。
Abstract: Self-supervised pre-training has revolutionized foundation models for languages, individual 2D images and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer that infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with Explicit geometry. This formulation eliminates shortcut solutions and yields representations that are geometrically grounded. To ensure convergence and scalability, we introduce a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an entirely unsupervised manner. Experiments demonstrate that E-RayZer significantly outperforms RayZer on pose estimation, matches or sometimes surpasses fully supervised reconstruction models such as VGGT. Furthermore, its learned representations outperform leading visual pre-training models (e.g., DINOv3, CroCo v2, VideoMAE V2, and RayZer) when transferring to 3D downstream tasks, establishing E-RayZer as a new paradigm for 3D-aware visual pre-training.
[83] Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization cs.CVPDF
Tsai-Shien Chen, Aliaksandr Siarohin, Guocheng Gordon Qian, Kuan-Chieh Jackson Wang, Egor Nemchinov
TL;DR: 本文提出了Omni-Attribute,一种开放词汇的图像属性编码器,旨在解决视觉概念个性化中因通用图像编码器产生纠缠嵌入而导致属性难以分离的问题。该方法通过联合设计数据和模型,利用语义链接的图像对和双目标训练范式,学习高保真、属性特定的表示。
Details
Motivation: 现有视觉概念个性化方法依赖于通用图像编码器的整体嵌入,这些嵌入纠缠了多个视觉因素,导致难以隔离单一属性,从而引发信息泄露和不连贯的合成。本文旨在解决这一局限性。
Result: 该方法在多个基准测试中实现了最先进的性能,其生成的嵌入在开放词汇属性检索、个性化和组合生成任务中证明是有效的。
Insight: 创新点在于联合设计数据和模型:策划带有正负属性标注的语义链接图像对以明确指导编码器保留或抑制的内容,并采用平衡生成保真度与对比解纠缠的双目标训练范式,从而学习到解纠缠的属性特定表示。
Abstract: Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.
[84] Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision cs.CVPDF
Wentao Zhou, Xuweiyi Chen, Vignesh Rajagopal, Jeffrey Chen, Rohan Chandra
TL;DR: 本文提出了StereoWalker,一种增强型机器人导航基础模型,它通过引入立体视觉输入和显式中层视觉模块(如深度估计和密集像素跟踪)来解决动态城市导航中的几何与动态理解难题。
Details
Motivation: 现有端到端导航基础模型仅依赖单目视觉,忽略了中层视觉先验,导致在动态非结构化环境中需要大量像素到动作的监督数据且难以解决深度尺度模糊问题。
Result: 在自动标注的大规模立体导航数据集上,StereoWalker仅用1.5%的训练数据即达到与SOTA相当的性能,使用全数据时超越SOTA;立体视觉输入相比单目输入显著提升了导航性能。
Insight: 创新点在于将立体视觉与显式中层视觉模块结合到导航基础模型中,有效利用几何和运动结构先验,大幅降低数据需求并提升在动态场景中的鲁棒性。
Abstract: The success of foundation models in language and vision motivated research in fully end-to-end robot navigation foundation models (NFMs). NFMs directly map monocular visual input to control actions and ignore mid-level vision modules (tracking, depth estimation, etc) entirely. While the assumption that vision capabilities will emerge implicitly is compelling, it requires large amounts of pixel-to-action supervision that are difficult to obtain. The challenge is especially pronounced in dynamic and unstructured settings, where robust navigation requires precise geometric and dynamic understanding, while the depth-scale ambiguity in monocular views further limits accurate spatial reasoning. In this paper, we show that relying on monocular vision and ignoring mid-level vision priors is inefficient. We present StereoWalker, which augments NFMs with stereo inputs and explicit mid-level vision such as depth estimation and dense pixel tracking. Our intuition is straightforward: stereo inputs resolve the depth-scale ambiguity, and modern mid-level vision models provide reliable geometric and motion structure in dynamic scenes. We also curate a large stereo navigation dataset with automatic action annotation from Internet stereo videos to support training of StereoWalker and to facilitate future research. Through our experiments, we find that mid-level vision enables StereoWalker to achieve a comparable performance as the state-of-the-art using only 1.5% of the training data, and surpasses the state-of-the-art using the full data. We also observe that stereo vision yields higher navigation performance than monocular input.
[85] WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World cs.CVPDF
Ao Liang, Lingdong Kong, Tianyi Yan, Hongsi Liu, Wesley Yang
TL;DR: WorldLens是一个用于全面评估生成式驾驶世界模型的基准,涵盖生成、重建、动作跟随、下游任务和人类偏好五个维度,以衡量生成世界的视觉真实性、几何一致性、物理合理性和功能可靠性。
Details
Motivation: 当前生成式世界模型能合成逼真的4D驾驶环境,但往往在物理或行为上存在缺陷,且缺乏统一的评估方法来综合判断生成世界在几何、物理和控制方面的保真度。
Result: 评估发现,现有世界模型没有能在所有维度上表现优异的:纹理强的模型常违反物理规律,而几何稳定的模型则缺乏行为保真度。
Insight: 提出了一个包含多维度评估的基准框架,并构建了带人类标注的大规模数据集WorldLens-26K和蒸馏出的评估代理WorldLens-Agent,形成了一个用于标准化衡量世界保真度的统一生态系统,强调不仅评估生成世界看起来多真实,更要评估其行为多真实。
Abstract: Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects – Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference – jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity – standardizing how future models are judged not only by how real they look, but by how real they behave.
cs.CL [Back]
[86] What Kind of Reasoning (if any) is an LLM actually doing? On the Stochastic Nature and Abductive Appearance of Large Language Models cs.CL | cs.AIPDF
Luciano Floridi, Jessica Morley, Claudio Novelli, David Watson
TL;DR: 本文探讨了当前基于token补全的大语言模型(LLMs)的推理机制,指出其本质是随机生成而非真正的溯因推理,其推理表象源于训练数据中的人类文本模式。
Details
Motivation: 旨在澄清LLMs的推理本质,解决对其是否真正执行逻辑推理的误解,强调其随机性与人类溯因推理的相似性仅是表象。
Result: 通过示例分析表明,LLMs能生成看似合理的想法、模仿常识推理并提供解释性答案,但这些输出缺乏真实性、语义基础、验证或理解支撑。
Insight: 创新点在于揭示了LLMs随机基础与溯因表象的双重性,这对模型评估与应用有重要启示:它们可辅助人类思维生成,但输出需严格批判性评估,因其无法辨识真理或验证解释。
Abstract: This article looks at how reasoning works in current Large Language Models (LLMs) that function using the token-completion method. It examines their stochastic nature and their similarity to human abductive reasoning. The argument is that these LLMs create text based on learned patterns rather than performing actual abductive reasoning. When their output seems abductive, this is largely because they are trained on human-generated texts that include reasoning structures. Examples are used to show how LLMs can produce plausible ideas, mimic commonsense reasoning, and give explanatory answers without being grounded in truth, semantics, verification, or understanding, and without performing any real abductive reasoning. This dual nature, where the models have a stochastic base but appear abductive in use, has important consequences for how LLMs are evaluated and applied. They can assist with generating ideas and supporting human thinking, but their outputs must be critically assessed because they cannot identify truth or verify their explanations. The article concludes by addressing five objections to these points, noting some limitations in the analysis, and offering an overall evaluation.
[87] Generate-Then-Validate: A Novel Question Generation Approach Using Small Language Models cs.CL | cs.HCPDF
Yumou Wei, John Stamper, Paulo F. Carvalho
TL;DR: 本文提出了一种基于小语言模型(SLMs)的“生成-验证”问题生成新方法,通过利用SLMs的文本生成和概率推理能力,先生成大量候选问题,再通过基于概率推理的选择性验证进行精炼,以生成高质量的问题。
Details
Motivation: 探索将小语言模型作为大型语言模型的补充,用于自动问题生成,以解决在分析学习中高质量问题生成的需求。
Result: 通过人类专家和大型语言模型(LLM)的评估,大多数评判者(人类或LLM)认为生成的问题答案清晰,且与预期学习目标基本一致,表明该方法能有效生成高质量问题。
Insight: 创新点在于“生成-验证”策略,结合SLMs的生成和概率推理能力,通过选择性验证优化问题质量;客观分析认为,该方法为资源受限环境下利用轻量级模型实现高效问题生成提供了可行方案。
Abstract: We explore the use of small language models (SLMs) for automatic question generation as a complement to the prevalent use of their large counterparts in learning analytics research. We present a novel question generation pipeline that leverages both the text generation and the probabilistic reasoning abilities of SLMs to generate high-quality questions. Adopting a “generate-then-validate” strategy, our pipeline first performs expansive generation to create an abundance of candidate questions and refine them through selective validation based on novel probabilistic reasoning. We conducted two evaluation studies, one with seven human experts and the other with a large language model (LLM), to assess the quality of the generated questions. Most judges (humans or LLMs) agreed that the generated questions had clear answers and generally aligned well with the intended learning objectives. Our findings suggest that an SLM can effectively generate high-quality questions when guided by a well-designed pipeline that leverages its strengths.
[88] PARAN: Persona-Augmented Review ANswering system on Food Delivery Review Dataset cs.CL | cs.AIPDF
Moonsoo Park, Jeongseok Yun, Bohyung Kim
TL;DR: 本文提出了一种名为PARAN的两阶段提示框架,用于在用户信息有限的外卖平台中生成个性化的评论回复。该框架首先从简短评论中推断用户的显性和隐性个人特征,然后将这些特征融入生成提示,以产生定制化的回复。通过调整解码温度来平衡多样性与忠实性,并在真实韩国外卖数据集上验证了方法在提升回复相关性、个性化和语义一致性方面的有效性。
Details
Motivation: 解决在外卖等用户信息有限的领域中,大语言模型因缺乏上下文用户数据而生成通用回复,导致互动性和效果下降的问题。
Result: 在真实韩国外卖应用数据集上的评估表明,该方法在精确度、多样性和语义一致性方面有效提升了自动回复的相关性和个性化水平,且无需模型微调。
Insight: 创新点在于直接从短文本评论中推断显性(如用户声明的偏好)和隐性(如人口统计或风格线索)个人特征,并通过两阶段提示框架将其用于生成个性化回复;客观来看,其无需微调、通过提示工程和推理温度调整来增强LLM个性化能力的方法具有实用借鉴价值。
Abstract: Personalized review response generation presents a significant challenge in domains where user information is limited, such as food delivery platforms. While large language models (LLMs) offer powerful text generation capabilities, they often produce generic responses when lacking contextual user data, reducing engagement and effectiveness. In this work, we propose a two-stage prompting framework that infers both explicit (e.g., user-stated preferences) and implicit (e.g., demographic or stylistic cues) personas directly from short review texts. These inferred persona attributes are then incorporated into the response generation prompt to produce user-tailored replies. To encourage diverse yet faithful generations, we adjust decoding temperature during inference. We evaluate our method using a real-world dataset collected from a Korean food delivery app, and assess its impact on precision, diversity, and semantic consistency. Our findings highlight the effectiveness of persona-augmented prompting in enhancing the relevance and personalization of automated responses without requiring model fine-tuning.
[89] Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning cs.CL | cs.AIPDF
Lama Alssum, Hani Itani, Hasan Abed Al Kader Hammoud, Philip Torr, Adel Bibi
TL;DR: 本文研究了大型语言模型(LLM)在适应新任务时出现的安全性退化问题,将其归因于灾难性遗忘,并将微调过程中的安全性保持问题构建为一个持续学习(CL)问题。作者在微调即服务的场景下,评估了多种CL方法(包括基于正则化、基于记忆和模型合并的方法)在减轻安全性退化方面的效果,并在良性用户数据和中毒用户数据两种场景下进行了测试。结果表明,CL方法(尤其是DER)在保持任务效用的同时,能持续实现比标准微调更低的攻击成功率,且这一发现在多个下游任务和模型家族中具有普适性。
Details
Motivation: 随着LLM的普及,其安全性对齐变得日益重要。本文旨在解决LLM在适应新任务(微调)时出现的安全性退化(妥协)问题,该问题被归因于灾难性遗忘。
Result: 在良性数据和中毒数据两种场景下,持续学习方法(CL)的攻击成功率均持续低于标准微调。其中,DER方法在保持任务效用的同时,优于其他CL方法和现有的安全性保持基线方法。这些结果在三个下游任务(GSM8K, SST2, Code)和三个模型家族(LLaMA2-7B, Mistral-7B, Gemma-2B)上得到验证,确立了CL作为保持安全性的实用解决方案。
Insight: 论文的核心创新点在于将LLM微调中的安全性保持问题形式化为一个持续学习问题,并系统性地评估了多种CL方法在此背景下的有效性。从客观角度看,这为缓解LLM适应新任务时的安全性遗忘提供了一个新颖且可操作的框架,特别是证明了在微调即服务的实际场景下,现有CL技术(如DER)能有效平衡任务性能与安全性。
Abstract: The safety alignment of large language models (LLMs) is becoming increasingly important with their democratization. In this paper, we study the safety degradation that comes with adapting LLMs to new tasks. We attribute this safety compromise to catastrophic forgetting and frame the problem of preserving safety when fine-tuning as a continual learning (CL) problem. We consider the fine-tuning-as-a-service setup where the user uploads their data to a service provider to get a customized model that excels on the user’s selected task. We adapt several CL approaches from the literature and systematically evaluate their ability to mitigate safety degradation. These include regularization-based, memory-based, and model merging approaches. We consider two scenarios, (1) benign user data and (2) poisoned user data. Our results demonstrate that CL approaches consistently achieve lower attack success rates than standard fine-tuning. Among these, DER outperforms both other CL methods and existing safety-preserving baselines while maintaining task utility. These findings generalize across three downstream tasks (GSM8K, SST2, Code) and three model families (LLaMA2-7B, Mistral-7B, Gemma-2B), establishing CL as a practical solution to preserve safety.
[90] Multilingual VLM Training: Adapting an English-Trained VLM to French cs.CL | cs.AIPDF
Jules Lahmi, Alexis Roger
TL;DR: 本文研究了将英语训练的视觉语言模型(VLM)适配到法语等非英语语言所面临的挑战,探索并比较了基于翻译的流程、LoRA微调以及将视觉适配与语言适配分离的两阶段微调策略,发现数据集翻译是影响多语言VLM性能的主要瓶颈。
Details
Motivation: 当前视觉语言模型的进展主要局限于英语,限制了非英语用户的可及性,因此需要将VLM能力扩展到更广泛的语言。
Result: 通过将标准多模态基准翻译为目标语言并结合母语专家的手动评估,发现数据质量限制了训练和评估的有效性,数据集翻译是性能瓶颈。
Insight: 论文的创新点在于系统比较了多种适配策略(翻译流程、LoRA、两阶段微调),并强调了原生语言数据集收集和改进翻译策略对未来工作的重要性;客观来看,其将视觉与语言适配分离的两阶段策略为多语言VLM适配提供了可借鉴的结构化方法。
Abstract: Artificial intelligence has made great progress in recent years, particularly in the development of Vision–Language Models (VLMs) that understand both visual and textual data. However, these advancements remain largely limited to English, reducing their accessibility for non–English speakers. It is essential to extend these capabilities to a broader range of languages. This paper explores the challenges of adapting an English-trained VLM to different languages. To this end, we will explore and compare different methods for their performance and computational cost. We consider a translation-based pipeline, LoRA finetuning, and a two-stage finetuning strategy that separates vision adaptation from language adaptation. To evaluate these methods, we use a combination of standard multimodal benchmarks translated into the target language and manual assessments by native experts. The results reveal that dataset translation remains a major bottleneck in multilingual VLM performance, with data quality limiting the effectiveness of training and evaluation. These findings suggest that future efforts should focus on native-language dataset collection and improved translation strategies.
[91] Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale cs.CL | cs.AI | cs.LG | cs.SEPDF
Zhaodong Wang, Zhenting Qi, Sherman Wong, Nathan Hu, Samuel Lin
TL;DR: 本文介绍了Confucius Code Agent (CCA),一个开源的、面向工业规模的AI软件工程师,以及其底层开发平台Confucius SDK。该平台从代理体验、用户体验和开发者体验三个维度设计,提供了统一的编排器、分层工作记忆、持久笔记系统和模块化扩展模块,以支持长上下文推理、跨会话持续学习和鲁棒的工具使用。通过一个元代理自动化配置的合成、评估与优化,CCA在真实世界软件工程任务上表现出色,在SWE-Bench-Pro基准测试中取得了最先进的性能。
Details
Motivation: 解决现有开源编码代理在处理工业规模工作负载(如大规模代码库推理、长会话记忆、复杂工具链协调)时能力不足,以及专有代理可扩展性、可解释性和可控性有限的问题,旨在提供一个透明、可扩展且可复现的工业级AI代理基础。
Result: 在SWE-Bench-Pro基准测试中,CCA实现了54.3%的Resolve@1性能,达到了最先进水平,显著超越了先前的编码代理。
Insight: 创新点在于从AX/UX/DX三个互补视角系统设计代理开发平台,引入了分层工作记忆、持久笔记系统和模块化扩展等机制来支持工业级需求,并通过元代理实现配置的自动化优化循环,这为构建可扩展、可复现的生产级AI代理提供了新的框架和方法。
Abstract: Real-world AI software engineering demands coding agents that can reason over massive repositories, maintain durable memory across and within long sessions, and robustly coordinate complex toolchains at test time. Existing open-source coding agents provide transparency but frequently fall short when pushed to these industrial-scale workloads, while proprietary coding agents offer strong practical performance but limited extensibility, interpretability, and controllability. We present the Confucius Code Agent (CCA), an open-sourced AI software engineer that can operate at an industrial scale. CCA is built atop the Confucius SDK, an open-sourced agent development platform designed around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK introduces a unified orchestrator with hierarchical working memory for long-context reasoning, a persistent note-taking system for cross-session continual learning, and a modular extension module for robust tool use. Moreover, a meta-agent automates the synthesis, evaluation, and refinement of agent configurations through a build-test-improve loop, enabling rapid agent development on new tasks, environments, and tool stacks. Instantiated on Confucius SDK with these mechanisms, CCA delivers strong performance on real-world software engineering tasks. On SWE-Bench-Pro, CCA achieves a state-of-the-art Resolve@1 performance of 54.3%, substantially improving over prior coding agents. Together, the Confucius SDK and CCA provide a transparent, extensible, and reproducible foundation for AI agents, bridge gaps between research prototypes and production-grade systems, and support agent development and deployment at industrial scale.
[92] Cooperative Retrieval-Augmented Generation for Question Answering: Mutual Information Exchange and Ranking by Contrasting Layers cs.CL | cs.AIPDF
Youmin Ko, Sungjong Seo, Hyunjoon Kim
TL;DR: 本文提出了一种名为CoopRAG的新型检索增强生成框架,用于解决问答任务中现有RAG方法存在的检索错误和幻觉问题。该框架通过让检索器和大型语言模型相互协作交换信息,并利用检索器模型的不同层进行对比来对检索文档进行重排序,从而提升多跳和简单问答的性能。
Details
Motivation: 现有RAG方法在简单和多跳问答任务中仍易出现错误检索和幻觉生成,为了解决这些局限性,需要设计一个让检索器和LLM更紧密协作的框架。
Result: 实验表明,CoopRAG在三个多跳问答数据集和一个简单问答数据集上,在检索和问答性能方面均持续优于最先进的问答方法。
Insight: 核心创新点在于提出了一个协作式框架,实现了检索器与LLM之间的双向信息交换,以及利用检索器模型内部不同层的对比来进行文档重排序,这增强了检索的准确性和答案生成的可靠性。
Abstract: Since large language models (LLMs) have a tendency to generate factually inaccurate output, retrieval-augmented generation (RAG) has gained significant attention as a key means to mitigate this downside of harnessing only LLMs. However, existing RAG methods for simple and multi-hop question answering (QA) are still prone to incorrect retrievals and hallucinations. To address these limitations, we propose CoopRAG, a novel RAG framework for the question answering task in which a retriever and an LLM work cooperatively with each other by exchanging informative knowledge, and the earlier and later layers of the retriever model work cooperatively with each other to accurately rank the retrieved documents relevant to a given query. In this framework, we (i) unroll a question into sub-questions and a reasoning chain in which uncertain positions are masked, (ii) retrieve the documents relevant to the question augmented with the sub-questions and the reasoning chain, (iii) rerank the documents by contrasting layers of the retriever, and (iv) reconstruct the reasoning chain by filling the masked positions via the LLM. Our experiments demonstrate that CoopRAG consistently outperforms state-of-the-art QA methods on three multi-hop QA datasets as well as a simple QA dataset in terms of both the retrieval and QA performances. Our code is available.\footnote{https://github.com/meaningful96/CoopRAG}
[93] T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground cs.CLPDF
Dmitrii Stoianov, Danil Taranets, Olga Tsymboi, Ramil Latypov, Almaz Dautov
TL;DR: T-pro 2.0是一个开源的俄语大语言模型,专注于混合推理和高效推理。它支持直接回答和推理轨迹生成,通过使用西里尔字母密集的分词器和改进的EAGLE推测解码流水线来降低延迟。论文还发布了模型权重、T-Wix 500k指令数据集、T-Math推理基准以及EAGLE权重,旨在为俄语推理研究和应用构建一个可访问的开放系统。
Details
Motivation: 解决俄语大语言模型在混合推理和高效推理方面的资源不足问题,旨在提供一个可复现、可扩展的研究平台和实用的应用系统。
Result: 论文发布了模型和相关资源(如T-Math基准),并通过公共网页演示展示了推理栈在不同领域实现的速度提升,但摘要中未提及具体的定量基准测试结果(如准确率)或与SOTA的比较。
Insight: 创新点包括:为俄语设计的西里尔字母密集分词器、适配的EAGLE推测解码流水线以优化推理速度,以及配套发布的数据集和基准,为俄语LLM的推理研究和高效部署提供了完整的工具链。
Abstract: We introduce T-pro 2.0, an open-weight Russian LLM for hybrid reasoning and efficient inference. The model supports direct answering and reasoning-trace generation, using a Cyrillic-dense tokenizer and an adapted EAGLE speculative-decoding pipeline to reduce latency. To enable reproducible and extensible research, we release the model weights, the T-Wix 500k instruction corpus, the T-Math reasoning benchmark, and the EAGLE weights on Hugging Face. These resources allow users to study Russian-language reasoning and to extend or adapt both the model and the inference pipeline. A public web demo exposes reasoning and non-reasoning modes and illustrates the speedups achieved by our inference stack across domains. T-pro 2.0 thus serves as an accessible open system for building and evaluating efficient, practical Russian LLM applications.
[94] Semantic Reconstruction of Adversarial Plagiarism: A Context-Aware Framework for Detecting and Restoring “Tortured Phrases” in Scientific Literature cs.CLPDF
Agniva Maiti, Prajwal Panth, Suresh Chandra Satapathy
TL;DR: 本文提出了一种名为SRAP的框架,用于检测和恢复科学文献中由对抗性文本生成技术(如自动转述工具)产生的’受折磨短语’(如用’counterfeit consciousness’替代’artificial intelligence’),以应对抄袭问题。该框架采用两阶段架构:首先使用领域特定的掩码语言模型进行统计异常检测,然后通过密集向量检索和句子对齐实现基于来源的语义重建。
Details
Motivation: 科学文献的完整性和可靠性正受到对抗性文本生成技术的严重威胁,特别是使用自动转述工具来掩盖抄袭行为,现有检测方法依赖静态黑名单或通用领域语言模型,对新奇混淆的漏报率高且无法确定抄袭来源。
Result: 在对抗性科学文本的平行语料库上的实验表明,零样本基线完全失败(恢复准确率为0.00%),而本文提出的检索增强方法实现了23.67%的恢复准确率,显著优于基线方法,并证明静态决策边界在术语密集的科学文本中对于鲁棒检测是必要的。
Insight: 创新点包括结合领域特定语言模型(SciBERT)的统计异常检测与基于密集检索(FAISS)和句子嵌入(SBERT)的语义重建,实现从混淆表达中恢复原始术语并链接到可能来源文档,为法证分析提供了新途径。
Abstract: The integrity and reliability of scientific literature is facing a serious threat by adversarial text generation techniques, specifically from the use of automated paraphrasing tools to mask plagiarism. These tools generate “tortured phrases”, statistically improbable synonyms (e.g. “counterfeit consciousness” for “artificial intelligence”), that preserve the local grammar while obscuring the original source. Most existing detection methods depend heavily on static blocklists or general-domain language models, which suffer from high false-negative rates for novel obfuscations and cannot determine the source of the plagiarized content. In this paper, we propose Semantic Reconstruction of Adversarial Plagiarism (SRAP), a framework designed not only to detect these anomalies but to mathematically recover the original terminology. We use a two-stage architecture: (1) statistical anomaly detection with a domain-specific masked language model (SciBERT) using token-level pseudo-perplexity, and (2) source-based semantic reconstruction using dense vector retrieval (FAISS) and sentence-level alignment (SBERT). Experiments on a parallel corpus of adversarial scientific text show that while zero-shot baselines fail completely (0.00 percent restoration accuracy), our retrieval-augmented approach achieves 23.67 percent restoration accuracy, significantly outperforming baseline methods. We also show that static decision boundaries are necessary for robust detection in jargon-heavy scientific text, since dynamic thresholding fails under high variance. SRAP enables forensic analysis by linking obfuscated expressions back to their most probable source documents.
[95] Enhancing Next-Generation Language Models with Knowledge Graphs: Extending Claude, Mistral IA, and GPT-4 via KG-BERT cs.CLPDF
Nour El Houda Ben Chaabene, Hamza Hammami
TL;DR: 该论文提出通过知识图谱(KG)与KG-BERT集成来增强下一代大型语言模型(如Claude、Mistral IA和GPT-4),以解决其缺乏结构化知识导致事实不一致的问题。实验表明,在问答和实体链接等知识密集型任务上取得了显著提升,提高了事实可靠性并实现了更具上下文感知能力的LLMs。
Details
Motivation: 解决大型语言模型在自然语言处理中因缺乏结构化知识而导致的事实不一致问题,通过整合知识图谱来增强模型的接地性和推理能力。
Result: 在知识密集型任务(如问答和实体链接)上实验显示显著性能提升,表明该方法能有效提高事实可靠性,但未明确提及是否达到SOTA或具体基准测试名称。
Insight: 创新点在于将知识图谱通过KG-BERT集成到LLMs中,以结构化知识补充模型,从而增强事实一致性和上下文感知能力;客观分析认为这是一种结合符号知识与统计模型的混合方法,可提升模型的可解释性和可靠性。
Abstract: Large language models (LLMs) like Claude, Mistral IA, and GPT-4 excel in NLP but lack structured knowledge, leading to factual inconsistencies. We address this by integrating Knowledge Graphs (KGs) via KG-BERT to enhance grounding and reasoning. Experiments show significant gains in knowledge-intensive tasks such as question answering and entity linking. This approach improves factual reliability and enables more context-aware next-generation LLMs.
[96] Decoding Student Minds: Leveraging Conversational Agents for Psychological and Learning Analysis cs.CLPDF
Nour El Houda Ben Chaabene, Hamza Hammami, Laid Kahloul
TL;DR: 本文提出了一种具备心理感知能力的对话代理系统,旨在提升教育场景中的学习表现和情感健康。该系统融合了大型语言模型(LLMs)、知识图谱增强的BERT(KG-BERT)以及带注意力的双向长短期记忆网络(LSTM),以实时分类学生的认知和情感状态。与先前局限于辅导或情感支持的聊天机器人不同,该方法利用多模态数据(包括文本语义、韵律语音特征和时间行为趋势)来推断学生的参与度、压力和概念理解。一项针对大学生的初步研究表明,与基线方法相比,该系统能提高学习动机、减轻压力并带来适度的学业提升。
Details
Motivation: 解决现有教育对话代理系统功能单一(仅限于辅导或情感支持)的问题,旨在通过多模态数据融合和实时状态分析,同时提升学生的学习表现和情感健康。
Result: 在针对大学生的初步研究中,与基线方法相比,该系统在提高学习动机、减轻压力和取得适度学业进步方面显示出积极效果。
Insight: 创新点在于将LLMs、知识图谱增强的语义理解(KG-BERT)和带注意力的时序建模(双向LSTM)相结合,并整合文本、语音和行为等多模态数据,以实现对学生认知与情感状态的实时、综合推断,从而支持自适应的、以学生为中心的教育干预。
Abstract: This paper presents a psychologically-aware conversational agent designed to enhance both learning performance and emotional well-being in educational settings. The system combines Large Language Models (LLMs), a knowledge graph-enhanced BERT (KG-BERT), and a bidirectional Long Short-Term Memory (LSTM) with attention to classify students’ cognitive and affective states in real time. Unlike prior chatbots limited to either tutoring or affective support, our approach leverages multimodal data-including textual semantics, prosodic speech features, and temporal behavioral trends-to infer engagement, stress, and conceptual understanding. A pilot study with university students demonstrated improved motivation, reduced stress, and moderate academic gains compared to baseline methods. These results underline the promise of integrating semantic reasoning, multimodal fusion, and temporal modeling to support adaptive, student-centered educational interventions.
[97] Causal Reasoning Favors Encoders: On The Limits of Decoder-Only Models cs.CL | cs.LGPDF
Amartya Roy, Elamparithy M, Kripabandhu Ghosh, Ponnurangam Kumaraguru, Adrian de Wynter
TL;DR: 本文研究了不同架构(编码器、编码器-解码器、仅解码器)在因果推理任务中的表现,发现仅依赖上下文学习(ICL)的仅解码器模型在因果推理中不可靠,容易受到分布偏移和无关输入特征的影响,而经过微调的编码器或编码器-解码器模型在成本效益和鲁棒性方面更具优势,尤其是在小规模场景下。
Details
Motivation: 动机是探究上下文学习(ICL)在因果推理中的作用和性能,因果推理需要多跳组合和严格合取控制,而仅解码器模型可能过度依赖输入的虚假词汇关系,导致误导性结果。
Result: 实验表明,在自然语言和非自然语言场景中,仅解码器模型的零样本和少样本ICL在因果推理中表现脆弱,对分布偏移敏感;微调的编码器和编码器-解码器模型则能更鲁棒地泛化,仅在模型规模极大时被仅解码器模型匹配或超越。
Insight: 创新点在于揭示了编码器架构通过将输入投影到潜在空间,更适合多跳合取推理,强调了针对因果推理任务进行微调的重要性,为成本效益高的鲁棒因果推理提供了架构选择指导。
Abstract: In context learning (ICL) underpins recent advances in large language models (LLMs), although its role and performance in causal reasoning remains unclear. Causal reasoning demands multihop composition and strict conjunctive control, and reliance on spurious lexical relations of the input could provide misleading results. We hypothesize that, due to their ability to project the input into a latent space, encoder and encoder decoder architectures are better suited for said multihop conjunctive reasoning versus decoder only models. To do this, we compare fine-tuned versions of all the aforementioned architectures with zero and few shot ICL in both natural language and non natural language scenarios. We find that ICL alone is insufficient for reliable causal reasoning, often overfocusing on irrelevant input features. In particular, decoder only models are noticeably brittle to distributional shifts, while finetuned encoder and encoder decoder models can generalize more robustly across our tests, including the non natural language split. Both architectures are only matched or surpassed by decoder only architectures at large scales. We conclude by noting that for cost effective, short horizon robust causal reasoning, encoder or encoder decoder architectures with targeted finetuning are preferable.
[98] RoleRMBench & RoleRM: Towards Reward Modeling for Profile-Based Role Play in Dialogue Systems cs.CLPDF
Hang Ding, Qiming Feng, Dongqi Liu, Qi Zhao, Tao Yao
TL;DR: 该论文提出了RoleRMBench基准和RoleRM奖励模型,旨在解决现有奖励模型在角色扮演对话等主观开放领域性能严重下降的问题。RoleRMBench是首个针对角色扮演对话奖励建模的系统性基准,涵盖七项细粒度能力。RoleRM模型通过连续隐式偏好训练,将主观评估重构为多策略下的连续一致成对监督,在基准测试中显著优于现有模型。
Details
Motivation: 现有奖励模型在将大语言模型与人类偏好对齐时,在角色扮演等主观开放领域表现不佳,难以捕捉基于角色的细微人类判断。
Result: 在RoleRMBench基准上的评估表明,RoleRM模型平均超越强大的开源和闭源奖励模型超过24%,在叙事连贯性和风格保真度方面取得显著提升。
Insight: 论文的创新点在于构建了首个角色扮演奖励建模基准,并提出了基于连续隐式偏好的训练方法,将主观评估转化为连续一致的成对监督,强调了连续偏好表示和标注一致性的重要性,为面向人本对话系统的主观对齐奠定了基础。
Abstract: Reward modeling has become a cornerstone of aligning large language models (LLMs) with human preferences. Yet, when extended to subjective and open-ended domains such as role play, existing reward models exhibit severe degradation, struggling to capture nuanced and persona-grounded human judgments. To address this gap, we introduce RoleRMBench, the first systematic benchmark for reward modeling in role-playing dialogue, covering seven fine-grained capabilities from narrative management to role consistency and engagement. Evaluation on RoleRMBench reveals large and consistent gaps between general-purpose reward models and human judgment, particularly in narrative and stylistic dimensions. We further propose RoleRM, a reward model trained with Continuous Implicit Preferences (CIP), which reformulates subjective evaluation as continuous consistent pairwise supervision under multiple structuring strategies. Comprehensive experiments show that RoleRM surpasses strong open- and closed-source reward models by over 24% on average, demonstrating substantial gains in narrative coherence and stylistic fidelity. Our findings highlight the importance of continuous preference representation and annotation consistency, establishing a foundation for subjective alignment in human-centered dialogue systems.
[99] AgriGPT-Omni: A Unified Speech-Vision-Text Framework for Multilingual Agricultural Intelligence cs.CLPDF
Bo Yang, Lanfei Feng, Yunkui Chen, Yu Zhang, Jianyu Zhang
TL;DR: 本文提出了AgriGPT-Omni,一个集成了语音、视觉和文本的统一多模态大语言模型框架,专门用于农业领域。为了解决农业应用中缺乏多语言语音数据、统一多模态架构和全面评估基准的问题,作者构建了大规模数据合成流水线,创建了迄今最大的农业语音数据集,并采用三阶段训练范式训练了首个农业全能模型。同时,提出了首个涵盖语音-视觉-文本任务的多语言基准AgriBench-Omni-2K。实验表明,该模型在多语言多模态推理及真实语音理解任务上显著优于通用基线模型。
Details
Motivation: 当前多模态大语言模型发展迅速,但农业应用仍受限于多语言语音数据的缺乏、统一多模态架构的缺失以及综合性评估基准的不足。
Result: 实验表明,AgriGPT-Omni在多语言多模态推理以及真实世界语音理解任务上显著优于通用基线模型。
Insight: 主要创新点包括:1) 构建了一个可扩展的数据合成与收集流水线,生成了迄今最大的多语言农业语音数据集;2) 提出了一个三阶段训练范式(文本知识注入、渐进式多模态对齐、基于GRPO的强化学习)来训练首个农业全能模型;3) 提出了首个面向农业的三模态(语音-视觉-文本)多语言评估基准AgriBench-Omni-2K,并提供了标准化协议和可复现工具。从客观角度看,其将统一多模态框架系统性地应用于资源匮乏的农业领域,并特别关注多语言和低资源场景,具有实际应用价值。
Abstract: Despite rapid advances in multimodal large language models, agricultural applications remain constrained by the lack of multilingual speech data, unified multimodal architectures, and comprehensive evaluation benchmarks. To address these challenges, we present AgriGPT-Omni, an agricultural omni-framework that integrates speech, vision, and text in a unified framework. First, we construct a scalable data synthesis and collection pipeline that converts agricultural texts and images into training data, resulting in the largest agricultural speech dataset to date, including 492K synthetic and 1.4K real speech samples across six languages. Second, based on this, we train the first agricultural omni-model via a three-stage paradigm: textual knowledge injection, progressive multimodal alignment, and GRPO-based reinforcement learning, enabling unified reasoning across languages and modalities. Third, we propose AgriBench-Omni-2K, the first tri-modal benchmark for agriculture, covering diverse speech-vision-text tasks and multilingual slices, with standardized protocols and reproducible tools. Experiments show that AgriGPT-Omni significantly outperforms general-purpose baselines on multilingual and multimodal reasoning as well as real-world speech understanding. All models, data, benchmarks, and code will be released to promote reproducible research, inclusive agricultural intelligence, and sustainable AI development for low-resource regions.
[100] Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving cs.CL | cs.AIPDF
Songyang Gao, Yuzhe Gu, Zijian Wu, Lingkai Kong, Wenwei Zhang
TL;DR: 本文提出了一种名为OPV(基于结果的流程验证器)的新型验证器,用于解决大型语言模型在长链推理任务中,现有验证器难以可靠评估中间步骤的问题。OPV通过总结长推理链的结果来验证其推理过程,实现了准确高效的验证,并支持大规模标注。
Details
Motivation: 当前基于结果的验证器无法检查长推理链中不可靠的中间步骤,而基于过程的验证器又因高质量人工标注成本高昂、数据稀缺,难以可靠检测复杂长推理链中的错误。因此,需要一种能兼顾准确性和效率的验证方法。
Result: 在内部基准测试\textsc{\thisbench}上,OPV取得了新的最先进结果(F1分数83.1),优于如Qwen3-Max-Preview(F1分数76.3)等更大的开源模型。在AIME2025基准上,与策略模型协作时,能将DeepSeek-R1-Distill-Qwen-32B的准确率从55.2%提升至73.3%。
Insight: 核心创新点是提出了OPV验证器,它通过验证总结性结果背后的推理过程,平衡了准确性与可扩展性。方法上的创新是采用了结合专家标注的迭代主动学习框架,通过拒绝微调和RLVR来高效提升验证能力,降低了标注成本。
Abstract: Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the \textbf{O}utcome-based \textbf{P}rocess \textbf{V}erifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV’s superior performance and broad applicability. It achieves new state-of-the-art results on our held-out \textsc{\thisbench}, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.
[101] TRIDENT: A Redundant Architecture for Caribbean-Accented Emergency Speech Triage cs.CLPDF
Elroy Galbraith, Chadwick Sutherland, Donahue Morgan
TL;DR: 本文提出了TRIDENT,一个三层冗余架构,旨在支持调度员处理加勒比口音英语的紧急呼叫,即使自动语音识别失败。该系统结合了加勒比口音优化的ASR、基于大语言模型的本地实体提取和生物声学痛苦检测,为调度员提供转录置信度、结构化临床实体和声音压力指标三个互补信号,以辅助应用既定的分诊协议。
Details
Motivation: 现有紧急语音识别系统对非标准英语变体(如加勒比口音英语)性能显著下降,导致对加勒比人群的服务存在关键缺口。
Result: 论文未报告具体的定量实验结果,仅提及实证验证是未来的工作。
Insight: 核心创新点在于将低ASR置信度视为有价值的队列优先级信号,而非系统失败,特别是当结合升高的声音痛苦标记时,表明呼叫者可能因危机而转向更基础的语言变体。另一互补见解是,训练有素的响应者或冷静的旁观者可能在没有高声压的情况下报告危及生命的紧急情况,因此需要语义分析来捕捉副语言特征可能遗漏的临床指标。架构设计基于压力诱发语码转换的心理语言学研究,并考虑了灾难场景下的离线操作。
Abstract: Emergency speech recognition systems exhibit systematic performance degradation on non-standard English varieties, creating a critical gap in services for Caribbean populations. We present TRIDENT (Transcription and Routing Intelligence for Dispatcher-Empowered National Triage), a three-layer dispatcher-support architecture designed to structure emergency call inputs for human application of established triage protocols (the ESI for routine operations and START for mass casualty events), even when automatic speech recognition fails. The system combines Caribbean-accent-tuned ASR, local entity extraction via large language models, and bio-acoustic distress detection to provide dispatchers with three complementary signals: transcription confidence, structured clinical entities, and vocal stress indicators. Our key insight is that low ASR confidence, rather than representing system failure, serves as a valuable queue prioritization signal – particularly when combined with elevated vocal distress markers indicating a caller in crisis whose speech may have shifted toward basilectal registers. A complementary insight drives the entity extraction layer: trained responders and composed bystanders may report life-threatening emergencies without elevated vocal stress, requiring semantic analysis to capture clinical indicators that paralinguistic features miss. We describe the architectural design, theoretical grounding in psycholinguistic research on stress-induced code-switching, and deployment considerations for offline operation during disaster scenarios. This work establishes a framework for accent-resilient emergency AI that ensures Caribbean voices receive equitable access to established national triage protocols. Empirical validation on Caribbean emergency calls remains future work.
[102] OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification cs.CL | cs.LGPDF
Zijian Wu, Lingkai Kong, Wenwei Zhang, Songyang Gao, Yuzhe Gu
TL;DR: 本文提出了基于结果的流程验证器(OPV),用于高效验证长思维链(CoT)推理过程。OPV通过总结长CoT的结果来验证其推理过程,结合了结果验证器和流程验证器的优点,并采用迭代主动学习框架和专家标注来降低标注成本,提升验证能力。
Details
Motivation: 当前基于结果的验证器(OV)无法检查长推理链中的不可靠中间步骤,而基于流程的验证器(PV)由于高质量标注稀缺(人工标注成本高昂),难以可靠检测复杂长CoT中的错误。因此,需要一种既能准确验证长CoT,又能实现高效大规模标注的验证方法。
Result: 在自建的OPV-Bench基准测试中,OPV取得了新的最先进(SOTA)结果,F1分数达到83.1,优于Qwen3-Max-Preview等更大开源模型(F1为76.3)。在合成数据集中有效检测误报,与专家评估高度一致。与策略模型协作时,能持续带来性能提升,例如在AIME2025上将DeepSeek-R1-Distill-Qwen-32B的准确率从55.2%提高到73.3%。
Insight: 创新点在于提出OPV,通过验证总结结果的推理过程来兼顾准确性和效率,并采用迭代主动学习(结合拒绝微调和RLVR)来渐进提升验证能力,降低对昂贵人工标注的依赖。从客观角度看,该方法为长链复杂推理的自动化验证提供了一种可扩展的解决方案。
Abstract: Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV’s superior performance and broad applicability. It achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.
[103] Script Gap: Evaluating LLM Triage on Indian Languages in Native vs Roman Scripts in a Real World Setting cs.CL | cs.LGPDF
Manurag Khullar, Utkarsh Desai, Poorva Malviya, Aman Dalmia, Zheyuan Ryan Shi
TL;DR: 该论文研究了在印度语言中,使用罗马化文本与原生脚本对大型语言模型(LLM)在真实世界临床分诊任务中性能的影响。研究发现,罗马化文本会导致LLM性能显著下降,F1分数比原生脚本低5-12分,这在实际应用中可能导致数百万次额外错误。
Details
Motivation: 在印度的高风险临床应用中,用户经常使用罗马化文本而非原生脚本进行交流,但现有研究很少使用真实世界数据评估这种正字法变化对LLM可靠性的影响,尤其是在孕产妇和新生儿保健分诊这一关键领域。
Result: 在涵盖五种印度语言和尼泊尔语的真实用户查询数据集上,对领先的LLM进行基准测试。结果显示,罗马化消息的性能持续下降,F1分数比原生脚本低5-12点。在合作机构中,这一差距可能导致近200万次额外的分诊错误。
Insight: 论文揭示了LLM在健康系统中一个关键的安全盲点:模型可能看似理解了罗马化输入的语义意图,但在存在正字法噪声时,其最终分类输出仍然脆弱。这表明性能差距并非源于临床推理失败,而是对输入形式变化的鲁棒性不足,为多语言和低资源环境下的LLM部署安全提供了重要洞见。
Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes clinical applications in India. In many such settings, speakers of Indian languages frequently communicate using romanized text rather than native scripts, yet existing research rarely evaluates this orthographic variation using real-world data. We investigate how romanization impacts the reliability of LLMs in a critical domain: maternal and newborn healthcare triage. We benchmark leading LLMs on a real-world dataset of user-generated queries spanning five Indian languages and Nepali. Our results reveal consistent degradation in performance for romanized messages, with F1 scores trailing those of native scripts by 5-12 points. At our partner maternal health organization in India, this gap could cause nearly 2 million excess errors in triage. Crucially, this performance gap by scripts is not due to a failure in clinical reasoning. We demonstrate that LLMs often correctly infer the semantic intent of romanized queries. Nevertheless, their final classification outputs remain brittle in the presence of orthographic noise in romanized inputs. Our findings highlight a critical safety blind spot in LLM-based health systems: models that appear to understand romanized input may still fail to act on it reliably.
[104] The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality cs.CL | cs.AIPDF
Aileen Cheng, Alon Jacovi, Amir Globerson, Ben Golan, Charles Kwong
TL;DR: 本文介绍了FACTS排行榜,这是一个全面的在线评测套件和基准,用于评估大语言模型在不同场景下生成事实准确文本的能力。该套件通过聚合模型在四个子排行榜上的表现来提供整体的事实性度量:多模态事实性、参数化知识、搜索信息寻求和基于文档的生成。
Details
Motivation: 为了解决现有基准在全面评估大语言模型事实性方面的不足,作者旨在创建一个更全面、平衡且自动化的评测框架。
Result: 该论文提出了一个新的基准套件,并计划持续维护,包含公开和私有数据集以保障完整性,但摘要中未提及具体模型的定量结果或SOTA比较。
Insight: 创新点在于将事实性评估分解为四个互补的维度,并采用自动化的评判模型进行综合评分,提供了一个更全面、可扩展的评测框架,有助于推动模型事实性研究的标准化。
Abstract: We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text across diverse scenarios. The suite provides a holistic measure of factuality by aggregating the performance of models on four distinct sub-leaderboards: (1) FACTS Multimodal, which measures the factuality of responses to image-based questions; (2) FACTS Parametric, which assesses models’ world knowledge by answering closed-book factoid questions from internal parameters; (3) FACTS Search, which evaluates factuality in information-seeking scenarios, where the model must use a search API; and (4) FACTS Grounding (v2), which evaluates whether long-form responses are grounded in provided documents, featuring significantly improved judge models. Each sub-leaderboard employs automated judge models to score model responses, and the final suite score is an average of the four components, designed to provide a robust and balanced assessment of a model’s overall factuality. The FACTS Leaderboard Suite will be actively maintained, containing both public and private splits to allow for external participation while guarding its integrity. It can be found at https://www.kaggle.com/benchmarks/google/facts .
[105] LabelFusion: Learning to Fuse LLMs and Transformer Classifiers for Robust Text Classification cs.CL | cs.AIPDF
Michael Schlee, Christoph Weisser, Timo Kivimäki, Melchizedek Mashiku, Benjamin Saefken
TL;DR: LabelFusion是一种用于文本分类的融合集成方法,它通过学习将传统的基于Transformer的分类器(如RoBERTa)与一个或多个大型语言模型(如GPT、Gemini或DeepSeek)相结合,以在多类和多标签任务中提供准确且成本感知的预测。该方法通过将骨干模型的嵌入向量与LLM生成的每类得分拼接,并输入到一个紧凑的多层感知机中,实现端到端训练。
Details
Motivation: 解决如何有效结合传统Transformer分类器的效率与LLM的推理能力,以在文本分类任务中实现鲁棒、准确且成本可控的预测。
Result: 在AG News数据集上达到92.4%的准确率,在10类Reuters 21578主题分类任务上达到92.3%的准确率,展现了强大的性能。
Insight: 创新点在于通过结构化的提示工程策略获取LLM的每类得分,并将其与传统模型的嵌入向量进行拼接,通过一个可学习的FusionMLP进行融合,从而捕获两种模型的互补优势,并在精度、延迟和成本之间实现实用权衡。
Abstract: LabelFusion is a fusion ensemble for text classification that learns to combine a traditional transformer-based classifier (e.g., RoBERTa) with one or more Large Language Models (LLMs such as OpenAI GPT, Google Gemini, or DeepSeek) to deliver accurate and cost-aware predictions across multi-class and multi-label tasks. The package provides a simple high-level interface (AutoFusionClassifier) that trains the full pipeline end-to-end with minimal configuration, and a flexible API for advanced users. Under the hood, LabelFusion integrates vector signals from both sources by concatenating the ML backbone’s embeddings with the LLM-derived per-class scores – obtained through structured prompt-engineering strategies – and feeds this joint representation into a compact multi-layer perceptron (FusionMLP) that produces the final prediction. This learned fusion approach captures complementary strengths of LLM reasoning and traditional transformer-based classifiers, yielding robust performance across domains – achieving 92.4% accuracy on AG News and 92.3% on 10-class Reuters 21578 topic classification – while enabling practical trade-offs between accuracy, latency, and cost.
[106] Computational emotion analysis with multimodal LLMs: Current evidence on an emerging methodological opportunity cs.CLPDF
Hauke Licht
TL;DR: 本文评估了多模态大语言模型(mLLMs)在视频情绪分析中的有效性,发现其在理想条件下能可靠评估情绪唤醒度且无人口统计偏差,但在真实议会辩论场景中表现不佳,可能影响下游统计推断。
Details
Motivation: 解决多模态AI在情绪分析中缺乏有效性证据的问题,特别是在政治沟通领域利用视听材料分析情绪显示的需求日益增长。
Result: 在人类标注的视频数据集上,mLLMs在理想条件下情绪唤醒度评分高度可靠且无人口统计偏差;但在真实议会辩论录音中,其评分表现不佳,可能对统计推断产生负面影响。
Insight: 创新点在于系统评估mLLMs在政治情绪分析中的实际应用,并提供了一个可复制的评估框架,强调了对新兴生成AI方法在政治分析中持续深入评估的必要性。
Abstract: Emotions are central to politics and analyzing their role in political communication has a long tradition. As research increasingly leverages audio-visual materials to analyze the display of emotions, the emergence of multimodal generative AI promises great advances. However, we lack evidence about the effectiveness of multimodal AI in emotion analysis. This paper addresses this gap by evaluating current multimodal large language models (mLLMs) in video-based analysis of emotional arousal in two complementary data sets of human-labeled video recordings. I find that under ideal circumstances, mLLMs’ emotional arousal ratings are highly reliable and show little to know indication of demographic bias. However, in recordings of speakers in real-world parliamentary debates, mLLMs’ arousal ratings fail to deliver on this promise with potential negative consequences for downstream statistical inferences. This study therefore underscores the need for continued, thorough evaluation of emerging generative AI methods in political analysis and contributes a suitable replicable framework.
cs.AI [Back]
[107] Exploring LLMs for Scientific Information Extraction Using The SciEx Framework cs.AI | cs.CLPDF
Sha Li, Ayush Sadekar, Nathan Self, Yiqi Su, Lars Andersland
TL;DR: 本文提出了SciEx框架,一个模块化、可组合的系统,旨在利用大语言模型(LLMs)自动化科学文献信息抽取。该框架通过解耦PDF解析、多模态检索、信息抽取和聚合等关键组件,解决了科学文献处理中长上下文、多模态内容以及跨文献信息标准化等挑战,并支持快速适应变化的数据模式。
Details
Motivation: 现有基于LLM的信息抽取方法和工具难以有效处理科学文献的长文档、多模态内容,以及跨多篇文献的细粒度信息标准化问题,尤其是在数据模式快速变化时,系统难以重新架构或微调。
Result: 在涵盖三个科学主题的数据集上评估了SciEx框架,验证了其准确、一致地提取细粒度信息的能力,并对当前基于LLM的流程的优势和局限性提供了实践性见解。
Insight: 主要创新点在于提出了一个模块化、可组合的框架设计,将信息抽取流程解耦为独立组件,从而增强了系统的可扩展性和灵活性,便于集成新模型、提示策略和推理机制,以快速适应不断变化的科学信息抽取需求。
Abstract: Large language models (LLMs) are increasingly touted as powerful tools for automating scientific information extraction. However, existing methods and tools often struggle with the realities of scientific literature: long-context documents, multi-modal content, and reconciling varied and inconsistent fine-grained information across multiple publications into standardized formats. These challenges are further compounded when the desired data schema or extraction ontology changes rapidly, making it difficult to re-architect or fine-tune existing systems. We present SciEx, a modular and composable framework that decouples key components including PDF parsing, multi-modal retrieval, extraction, and aggregation. This design streamlines on-demand data extraction while enabling extensibility and flexible integration of new models, prompting strategies, and reasoning mechanisms. We evaluate SciEx on datasets spanning three scientific topics for its ability to extract fine-grained information accurately and consistently. Our findings provide practical insights into both the strengths and limitations of current LLM-based pipelines.
[108] Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution cs.AI | cs.CLPDF
Zouying Cao, Jiaji Deng, Li Yu, Weikang Zhou, Zhaoyang Liu
TL;DR: 本文提出ReMe(Remember Me, Refine Me)框架,旨在解决现有LLM智能体过程记忆框架中静态、被动积累经验的问题,通过多层面蒸馏、上下文自适应重用和基于效用的精炼三大机制,实现经验驱动的智能体动态进化。
Details
Motivation: 现有过程记忆框架主要采用‘被动积累’范式,将记忆视为静态的只增不减的档案,无法实现动态推理。ReMe旨在弥合静态存储与动态推理之间的差距。
Result: 在BFCL-V3和AppWorld基准测试上,ReMe建立了智能体记忆系统的新SOTA。关键发现是显著的内存缩放效应:配备ReMe的Qwen3-8B模型性能超过了更大的、无记忆的Qwen3-14B模型。
Insight: 创新点在于提出了一个覆盖记忆全生命周期的动态框架,通过主动提取、情境化重用和自主维护来优化经验池,为终身学习提供了一条计算高效的路径。
Abstract: Procedural memory enables large language model (LLM) agents to internalize “how-to” knowledge, theoretically reducing redundant trial-and-error. However, existing frameworks predominantly suffer from a “passive accumulation” paradigm, treating memory as a static append-only archive. To bridge the gap between static storage and dynamic reasoning, we propose $\textbf{ReMe}$ ($\textit{Remember Me, Refine Me}$), a comprehensive framework for experience-driven agent evolution. ReMe innovates across the memory lifecycle via three mechanisms: 1) $\textit{multi-faceted distillation}$, which extracts fine-grained experiences by recognizing success patterns, analyzing failure triggers and generating comparative insights; 2) $\textit{context-adaptive reuse}$, which tailors historical insights to new contexts via scenario-aware indexing; and 3) $\textit{utility-based refinement}$, which autonomously adds valid memories and prunes outdated ones to maintain a compact, high-quality experience pool. Extensive experiments on BFCL-V3 and AppWorld demonstrate that ReMe establishes a new state-of-the-art in agent memory system. Crucially, we observe a significant memory-scaling effect: Qwen3-8B equipped with ReMe outperforms larger, memoryless Qwen3-14B, suggesting that self-evolving memory provides a computation-efficient pathway for lifelong learning. We release our code and the $\texttt{reme.library}$ dataset to facilitate further research.
[109] Echo-CoPilot: A Multi-View, Multi-Task Agent for Echocardiography Interpretation and Reporting cs.AI | cs.CV | cs.LG | eess.IVPDF
Moein Heidari, Mohammad Amin Roohi, Armin Khosravi, Ilker Hacihaliloglu
TL;DR: 本文提出了Echo-CoPilot,一个用于超声心动图解读和报告的多视图、多任务智能体。它利用大语言模型协调一系列专门的超声心动图工具,通过ReAct式循环分解临床医生查询、调用工具进行视图识别、心脏结构分割、测量与疾病预测以及报告合成,并将输出整合为符合指南的答案和叙述性总结。
Details
Motivation: 超声心动图是心血管诊疗的核心,但完整研究解读仍是一项认知要求高、多视图的手动任务。现有的超声心动图基础模型虽然在视图分类、分割或疾病预测等单个感知子任务上表现出色,但它们通常孤立运行,无法提供统一、临床连贯的评估。
Result: 在公开基准MIMIC-EchoQA上,Echo-CoPilot达到了50.8%的准确率,优于通用和生物医学视频视觉语言模型。定性分析进一步表明,该智能体能利用定量测量和生理学背景解决临床决策阈值附近的挑战性病例。
Insight: 创新点在于构建了一个由大语言模型驱动的多任务协调智能体框架,将多个专业工具集成到一个统一的、临床连贯的工作流中,实现了从感知到推理和报告生成的端到端自动化,并能处理复杂的边缘病例,提升了超声心动图解读的自动化水平和临床实用性。
Abstract: Echocardiography is central to contemporary cardiovascular care, but full-study interpretation remains a cognitively demanding, multi-view task that is still performed manually. While recent foundation models for echocardiography can achieve strong performance on individual perceptual subtasks such as view classification, segmentation, or disease prediction, they typically operate in isolation and do not provide a unified, clinically coherent assessment. In this work, we introduce Echo-CoPilot, a multi-view, multi-task agent that uses a large language model to orchestrate a suite of specialized echocardiography tools. Within a ReAct-style loop, the agent decomposes clinician queries, invokes tools for view recognition, cardiac structure segmentation, measurement and disease prediction, and report synthesis, and integrates their outputs into guideline-aware answers and narrative summaries. We evaluate Echo-CoPilot on the public MIMIC-EchoQA benchmark, where it achieves an accuracy of 50.8%, outperforming both general-purpose and biomedical video vision-language models. Qualitative analyses further show that the agent leverages quantitative measurements and physiologic context to resolve challenging cases near clinical decision thresholds, such as borderline left ventricular hypertrophy or pericardial effusion severity. The code will be released upon acceptance of the paper.
[110] Enhancing Radiology Report Generation and Visual Grounding using Reinforcement Learning cs.AI | cs.CVPDF
Benjamin Gundersen, Nicolas Deperrois, Samuel Ruiperez-Campillo, Thomas M. Sutter, Julia E. Vogt
TL;DR: 该论文研究了在胸部X光(CXR)视觉语言模型(VLM)中应用强化学习(RL)和显式中间推理(’思考’)的效果。基于Qwen3-VL构建了RadVLM模型,先进行大规模监督微调(SFT),再通过冷启动SFT赋予基本思考能力,最后应用基于临床任务特定奖励的GRPO进行强化学习优化。研究发现,虽然强大的SFT对基础性能至关重要,但RL能在报告生成和视觉定位两个任务上带来额外提升,而显式思考并未进一步改善结果。优化后的RadVLM模型在统一评估框架下超越了基线,并在报告生成和定位任务上达到了最先进的性能。
Details
Motivation: 当前许多医学视觉语言模型仅依赖监督微调(SFT),其优化目标是下一个词预测,而未评估答案质量。强化学习(RL)可以整合任务特定的反馈,结合显式中间推理在可验证的数学和编码任务上已显示出显著增益。本研究旨在探索RL和’思考’在CXR VLM中对报告生成和视觉定位任务的影响。
Result: 在统一的评估流程下,经过RL优化的RadVLM模型在报告生成和视觉定位任务上均超越了其基线模型,并达到了最先进的(SOTA)性能水平。
Insight: 论文的创新点在于将强化学习与临床任务特定的奖励(通过GRPO方法)系统地应用于医学视觉语言模型,以优化报告生成和视觉定位。从客观角度看,其核心贡献是证明了临床对齐的强化学习可以作为SFT的有力补充,显著提升医学VLM在关键任务上的性能,而显式中间推理机制在本研究的具体任务和设置中并未显示出额外优势,这一发现本身也具有参考价值。
Abstract: Recent advances in vision-language models (VLMs) have improved Chest X-ray (CXR) interpretation in multiple aspects. However, many medical VLMs rely solely on supervised fine-tuning (SFT), which optimizes next-token prediction without evaluating answer quality. In contrast, reinforcement learning (RL) can incorporate task-specific feedback, and its combination with explicit intermediate reasoning (“thinking”) has demonstrated substantial gains on verifiable math and coding tasks. To investigate the effects of RL and thinking in a CXR VLM, we perform large-scale SFT on CXR data to build an updated RadVLM based on Qwen3-VL, followed by a cold-start SFT stage that equips the model with basic thinking ability. We then apply Group Relative Policy Optimization (GRPO) with clinically grounded, task-specific rewards for report generation and visual grounding, and run matched RL experiments on both domain-specific and general-domain Qwen3-VL variants, with and without thinking. Across these settings, we find that while strong SFT remains crucial for high base performance, RL provides additional gains on both tasks, whereas explicit thinking does not appear to further improve results. Under a unified evaluation pipeline, the RL-optimized RadVLM models outperform their baseline counterparts and reach state-of-the-art performance on both report generation and grounding, highlighting clinically aligned RL as a powerful complement to SFT for medical VLMs.
[111] Agile Deliberation: Concept Deliberation for Subjective Visual Classification cs.AI | cs.CV | cs.HC | cs.LGPDF
Leijie Wang, Otilia Stretcu, Wei Qiao, Thomas Denby, Krishnamurthy Viswanathan
TL;DR: 本文提出了一种名为’敏捷审议’的人机交互框架,用于支持用户迭代定义和细化主观视觉概念,以训练图像分类器。该框架通过概念范围界定和概念迭代两个阶段,帮助用户处理模糊且不断演化的概念,并通过用户研究发现其能提升分类性能并降低认知负担。
Details
Motivation: 现有的人机交互方法通常假设用户对概念有清晰、稳定的理解,但现实中用户往往从模糊的想法开始,需要通过’概念审议’来迭代细化。本文旨在解决这种主观且动态的概念定义问题,特别是在内容审核等应用中。
Result: 通过18次、每次1.5小时的用户会话评估(而非标准基准数据集),敏捷审议框架比自动化分解基线提高了7.5%的F1分数,比手动审议提高了超过3%,同时参与者报告了更清晰的概念理解和更低的认知努力。
Insight: 创新点在于将内容审核专家的实际审议策略形式化为一个明确支持演化主观概念的人机交互框架,通过边界案例暴露和结构化层次分解来引导用户迭代对齐分类器与意图,强调了主观任务中交互式评估的重要性。
Abstract: From content moderation to content curation, applications requiring vision classifiers for visual concepts are rapidly expanding. Existing human-in-the-loop approaches typically assume users begin with a clear, stable concept understanding to be able to provide high-quality supervision. In reality, users often start with a vague idea and must iteratively refine it through “concept deliberation”, a practice we uncovered through structured interviews with content moderation experts. We operationalize the common strategies in deliberation used by real content moderators into a human-in-the-loop framework called “Agile Deliberation” that explicitly supports evolving and subjective concepts. The system supports users in defining the concept for themselves by exposing them to borderline cases. The system does this with two deliberation stages: (1) concept scoping, which decomposes the initial concept into a structured hierarchy of sub-concepts, and (2) concept iteration, which surfaces semantically borderline examples for user reflection and feedback to iteratively align an image classifier with the user’s evolving intent. Since concept deliberation is inherently subjective and interactive, we painstakingly evaluate the framework through 18 user sessions, each 1.5h long, rather than standard benchmarking datasets. We find that Agile Deliberation achieves 7.5% higher F1 scores than automated decomposition baselines and more than 3% higher than manual deliberation, while participants reported clearer conceptual understanding and lower cognitive effort.
cs.LG [Back]
[112] Asynchronous Reasoning: Training-Free Interactive Thinking LLMs cs.LG | cs.CLPDF
George Yakushev, Nataliia Babina, Masoud Vahid Dastgerdi, Vyacheslav Zhdanovskiy, Alina Shutova
TL;DR: 本文提出了一种无需额外训练的异步推理方法,使具备推理能力的大语言模型能够同时进行思考、接收输入和生成输出,从而显著减少实时交互中的延迟。
Details
Motivation: 解决现有大语言模型在推理时需顺序交互(先思考后回答)导致的实时性不足问题,使其能像人类一样异步处理信息,适应语音助手等实时应用场景。
Result: 在数学、常识和安全推理任务上评估,该方法将首个非思考令牌的生成时间从数分钟缩短至≤5秒,整体实时延迟降低6-11倍。
Insight: 利用旋转嵌入的特性,使原本设计用于顺序交互的LLM能够异步执行推理,无需重新训练即可实现实时思考与响应,提升了模型在动态环境中的适用性。
Abstract: Many state-of-the-art LLMs are trained to think before giving their answer. Reasoning can greatly improve language model capabilities and safety, but it also makes them less interactive: given a new input, a model must stop thinking before it can respond. Real-world use cases such as voice-based or embedded assistants require an LLM agent to respond and adapt to additional information in real time, which is incompatible with sequential interactions. In contrast, humans can listen, think, and act asynchronously: we begin thinking about the problem while reading it and continue thinking while formulating the answer. In this work, we augment LLMs capable of reasoning to operate in a similar way without additional training. Our method uses the properties of rotary embeddings to enable LLMs built for sequential interactions to simultaneously think, listen, and generate outputs. We evaluate our approach on math, commonsense, and safety reasoning and find that it can generate accurate thinking-augmented answers in real time, reducing time to first non-thinking token from minutes to <= 5s. and the overall real-time delays by 6-11x.
[113] Stronger Normalization-Free Transformers cs.LG | cs.AI | cs.CL | cs.CVPDF
Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, Zhuang Liu
TL;DR: 本文提出了一种名为Derf的新型逐点函数,用于替代Transformer架构中的归一化层。Derf基于缩放的高斯累积分布函数,通过大规模搜索发现其在多个领域(包括视觉、语音和DNA序列建模)中超越了LayerNorm、RMSNorm和Dynamic Tanh等现有方法,实现了更强的性能和泛化能力。
Details
Motivation: 尽管归一化层长期以来被视为深度学习架构中不可或缺的组件,但Dynamic Tanh(DyT)的出现表明替代方案是可能的。本文旨在探索能够超越DyT的函数设计,以进一步提升无归一化Transformer的性能。
Result: Derf在图像识别与生成、语音表示和DNA序列建模等多个领域的基准测试中,均优于LayerNorm、RMSNorm和DyT,表现出更强的性能,成为无归一化Transformer架构的实用选择。
Insight: 论文的创新点在于通过大规模搜索发现Derf函数,其性能提升主要源于改进的泛化能力而非拟合能力。Derf的简单性和强大性能使其成为无归一化Transformer的有效替代方案,挑战了归一化层的传统必要性。
Abstract: Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce $\mathrm{Derf}(x) = \mathrm{erf}(αx + s)$, where $\mathrm{erf}(x)$ is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including vision (image recognition and generation), speech representation, and DNA sequence modeling. Our findings suggest that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.
[114] CC-GRMAS: A Multi-Agent Graph Neural System for Spatiotemporal Landslide Risk Assessment in High Mountain Asia cs.LG | cs.AI | cs.CVPDF
Mihir Panchal, Ying-Jung Chen, Surya Parkash
TL;DR: 本文提出了CC-GRMAS框架,这是一个利用卫星观测和环境信号进行滑坡风险评估的多智能体图神经网络系统,旨在提高高山亚洲地区滑坡预测的准确性。该系统由预测、规划和执行三个相互关联的智能体组成,通过协作实现实时态势感知、响应规划和干预。
Details
Motivation: 滑坡是一种日益严重的气候诱发灾害,尤其在亚洲高山地区,尽管卫星和时序数据日益增多,但及时的检测和灾害响应仍然不完善且分散。
Result: 摘要中未提及具体的定量实验结果或基准测试比较。
Insight: 创新点在于将多智能体协调与图神经网络结合,并整合局部环境因素,为脆弱山区提供了一个可扩展、主动的气候韧性灾害准备解决方案。从客观角度看,其系统架构将预测、规划、执行模块化并通过智能体协作,是灾害管理领域一个值得关注的集成方法创新。
Abstract: Landslides are a growing climate induced hazard with severe environmental and human consequences, particularly in high mountain Asia. Despite increasing access to satellite and temporal datasets, timely detection and disaster response remain underdeveloped and fragmented. This work introduces CC-GRMAS, a framework leveraging a series of satellite observations and environmental signals to enhance the accuracy of landslide forecasting. The system is structured around three interlinked agents Prediction, Planning, and Execution, which collaboratively enable real time situational awareness, response planning, and intervention. By incorporating local environmental factors and operationalizing multi agent coordination, this approach offers a scalable and proactive solution for climate resilient disaster preparedness across vulnerable mountainous terrains.
[115] Interpretable and Steerable Concept Bottleneck Sparse Autoencoders cs.LG | cs.CVPDF
Akshay Kulkarni, Tsui-Wei Weng, Vivek Narayanaswamy, Shusen Liu, Wesam A. Sakla
TL;DR: 本文提出了一种名为概念瓶颈稀疏自编码器(CB-SAE)的新型后处理框架,旨在解决稀疏自编码器(SAE)在大型语言模型和大型视觉语言模型中存在的可解释性和可操控性不足的问题。通过引入两种新的计算成本低廉的评估指标,作者发现多数SAE神经元在可解释性或可操控性上表现不佳,且由于无监督学习特性,用户所需概念常缺失。CB-SAE通过剪枝低效用神经元并利用轻量级概念瓶颈增强潜在空间,显著提升了模型性能。
Details
Motivation: 稀疏自编码器(SAE)在机制可解释性、概念发现和模型操控方面具有潜力,但现有方法学到的特征往往缺乏可解释性和可操控性,且用户所需概念常缺失,限制了实际应用。
Result: 在大型视觉语言模型和图像生成任务中,CB-SAE将可解释性提升了32.1%,可操控性提升了14.5%。
Insight: 创新点包括引入两种新的可解释性和可操控性评估指标,以及提出CB-SAE框架,通过剪枝低效用神经元和集成用户定义概念集的概念瓶颈来增强稀疏自编码器的实用性和性能。从客观角度看,该方法结合了后处理优化和概念对齐,为提升模型透明度和可控性提供了可借鉴的思路。
Abstract: Sparse autoencoders (SAEs) promise a unified approach for mechanistic interpretability, concept discovery, and model steering in LLMs and LVLMs. However, realizing this potential requires that the learned features be both interpretable and steerable. To that end, we introduce two new computationally inexpensive interpretability and steerability metrics and conduct a systematic analysis on LVLMs. Our analysis uncovers two observations; (i) a majority of SAE neurons exhibit either low interpretability or low steerability or both, rendering them ineffective for downstream use; and (ii) due to the unsupervised nature of SAEs, user-desired concepts are often absent in the learned dictionary, thus limiting their practical utility. To address these limitations, we propose Concept Bottleneck Sparse Autoencoders (CB-SAE) - a novel post-hoc framework that prunes low-utility neurons and augments the latent space with a lightweight concept bottleneck aligned to a user-defined concept set. The resulting CB-SAE improves interpretability by +32.1% and steerability by +14.5% across LVLMs and image generation tasks. We will make our code and model weights available.
cs.HC [Back]
[116] CompanionCast: A Multi-Agent Conversational AI Framework with Spatial Audio for Social Co-Viewing Experiences cs.HC | cs.CLPDF
Yiyang Wang, Chen Chen, Tica Lin, Vishnu Raj, Josh Kimball
TL;DR: 本文提出了CompanionCast,一个用于社交共同观看体验的多智能体对话AI框架。该框架通过协调多个角色专用的AI智能体,利用多模态输入、语音合成和空间音频来响应视频内容,旨在重现共同观看的社交临场感。
Details
Motivation: 现代媒体消费日益孤立,缺乏社交临场感。论文旨在研究多智能体对话AI系统是否能够为不同类型的视频内容重现共享观看体验的动态和社交互动。
Result: 通过在足球观看这一具有丰富动态和社交传统的领域进行试点研究,结果表明,与单独观看相比,多智能体交互提高了用户感知的社交临场感。
Insight: 主要创新点包括:1) 一个围绕多模态视频内容协调多智能体对话的通用框架;2) 一个新颖的基于LLM-as-a-Judge的评估器-智能体流程,用于在五个维度(相关性、真实性、参与度、多样性、个性一致性)上迭代评分和优化对话质量;3) 探索性地证明了AI介导的共同观看能增强社交临场感。
Abstract: Social presence is central to the enjoyment of watching content together, yet modern media consumption is increasingly solitary. We investigate whether multi-agent conversational AI systems can recreate the dynamics of shared viewing experiences across diverse content types. We present CompanionCast, a general framework for orchestrating multiple role-specialized AI agents that respond to video content using multimodal inputs, speech synthesis, and spatial audio. Distinctly, CompanionCast integrates an LLM-as-a-Judge module that iteratively scores and refines conversations across five dimensions (relevance, authenticity, engagement, diversity, personality consistency). We validate this framework through sports viewing, a domain with rich dynamics and strong social traditions, where a pilot study with soccer fans suggests that multi-agent interaction improves perceived social presence compared to solo viewing. We contribute: (1) a generalizable framework for orchestrating multi-agent conversations around multimodal video content, (2) a novel evaluator-agent pipeline for conversation quality control, and (3) exploratory evidence of increased social presence in AI-mediated co-viewing. We discuss challenges and future directions for applying this approach to diverse viewing contexts including entertainment, education, and collaborative watching experiences.
cs.RO [Back]
[117] Design of a six wheel suspension and a three-axis linear actuation mechanism for a laser weeding robot cs.RO | cs.CV | eess.SYPDF
Muhammad Usama, Muhammad Ibrahim Khan, Ahmad Hasan, Muhammad Shaaf Nadeem, Khawaja Fahad Iqbal
TL;DR: 本文设计了一种用于激光除草机器人的六轮悬架系统和三维线性驱动机构。该机器人采用新型双四杆悬架以提高稳定性,并通过三维线性驱动机构引导激光精准照射检测到的杂草。
Details
Motivation: 传统机械除草在大田中效率低下,而除草剂会破坏土壤生态系统。激光除草作为一种可持续的精准农业替代方案,需要移动机器人平台来实现高效、准确的杂草清除。
Result: 田间测试表明,机器人能有效导航农业地形,克服高达15厘米的障碍物。在42.5厘米/秒的最佳速度下,杂草检测率达到86.2%,每米操作时间为87秒。激光驱动机构平均位置误差仅为1.54毫米,命中率高达97%。
Insight: 创新点在于结合了新型双四杆六轮悬架(提高地形适应性和稳定性)与三维线性驱动机构(实现高精度激光定位),在保证移动性能的同时,实现了高精度、高效率的激光除草,为精准农业机器人提供了实用的机械设计解决方案。
Abstract: Mobile robots are increasingly utilized in agriculture to automate labor-intensive tasks such as weeding, sowing, harvesting and soil analysis. Recently, agricultural robots have been developed to detect and remove weeds using mechanical tools or precise herbicide sprays. Mechanical weeding is inefficient over large fields, and herbicides harm the soil ecosystem. Laser weeding with mobile robots has emerged as a sustainable alternative in precision farming. In this paper, we present an autonomous weeding robot that uses controlled exposure to a low energy laser beam for weed removal. The proposed robot is six-wheeled with a novel double four-bar suspension for higher stability. The laser is guided towards the detected weeds by a three-dimensional linear actuation mechanism. Field tests have demonstrated the robot’s capability to navigate agricultural terrains effectively by overcoming obstacles up to 15 cm in height. At an optimal speed of 42.5 cm/s, the robot achieves a weed detection rate of 86.2% and operating time of 87 seconds per meter. The laser actuation mechanism maintains a minimal mean positional error of 1.54 mm, combined with a high hit rate of 97%, ensuring effective and accurate weed removal. This combination of speed, accuracy, and efficiency highlights the robot’s potential for significantly enhancing precision farming practices.
[118] Evaluating Gemini Robotics Policies in a Veo World Simulator cs.RO | cs.AI | cs.CV | cs.LGPDF
Gemini Robotics Team, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao
TL;DR: 该论文提出了一种基于前沿视频基础模型Veo的生成式评估系统,用于全面评估机器人策略。该系统通过生成逼真的场景变化来模拟策略在多种环境下的表现,包括正常性能、分布外泛化以及物理和语义安全性测试。
Details
Motivation: 现有视频模型在机器人领域的应用主要局限于分布内评估,即评估与训练数据相似的场景。该研究旨在探索视频模型在机器人策略评估全谱系中的应用潜力,包括评估正常性能、分布外泛化能力以及安全性探测。
Result: 通过1600多次真实世界评估,验证了该系统能够准确预测不同策略在正常和分布外条件下的相对性能,并确定不同泛化轴对策略性能的影响。评估对象包括八个Gemini Robotics策略检查点和五个双手操作器任务。
Insight: 创新点在于构建了一个支持机器人动作条件化和多视图一致性的生成式评估系统,并整合了生成式图像编辑和多视图补全技术,以合成真实世界场景的多轴泛化变体。该系统保持了基础视频模型的能力,能够准确模拟包含新交互对象、新视觉背景和新干扰对象的编辑场景。
Abstract: Generative world models hold significant potential for simulating interactions with visuomotor policies in varied environments. Frontier video models can enable generation of realistic observations and environment interactions in a scalable and general manner. However, the use of video models in robotics has been limited primarily to in-distribution evaluations, i.e., scenarios that are similar to ones used to train the policy or fine-tune the base video model. In this report, we demonstrate that video models can be used for the entire spectrum of policy evaluation use cases in robotics: from assessing nominal performance to out-of-distribution (OOD) generalization, and probing physical and semantic safety. We introduce a generative evaluation system built upon a frontier video foundation model (Veo). The system is optimized to support robot action conditioning and multi-view consistency, while integrating generative image-editing and multi-view completion to synthesize realistic variations of real-world scenes along multiple axes of generalization. We demonstrate that the system preserves the base capabilities of the video model to enable accurate simulation of scenes that have been edited to include novel interaction objects, novel visual backgrounds, and novel distractor objects. This fidelity enables accurately predicting the relative performance of different policies in both nominal and OOD conditions, determining the relative impact of different axes of generalization on policy performance, and performing red teaming of policies to expose behaviors that violate physical or semantic safety constraints. We validate these capabilities through 1600+ real-world evaluations of eight Gemini Robotics policy checkpoints and five tasks for a bimanual manipulator.