Table of Contents

cs.CV [Back]

[1] Detection of Cyberbullying in GIF using AI cs.CV | cs.LG | cs.MMPDF

Pal Dave, Xiaohong Yuan, Madhuri Siddula, Kaushik Roy

TL;DR: 论文提出了一种使用AI检测GIF中网络暴力的方法,填补了现有研究在GIF这一媒介上的空白,并通过预训练模型VGG16实现了97%的准确率。

Details

Motivation: 网络暴力在社交媒体上日益严重,现有研究主要集中在文本和图像上,而在GIF或贴纸上的检测研究较少。

Result: 模型在检测GIF中的网络暴力时达到了97%的准确率。

Insight: GIF作为一种流行的社交媒体表达形式,其网络暴力检测需求尚未被充分研究,未来的工作可以扩展到多模态分析。

Abstract: Cyberbullying is a well-known social issue, and it is escalating day by day. Due to the vigorous development of the internet, social media provide many different ways for the user to express their opinions and exchange information. Cyberbullying occurs on social media using text messages, comments, sharing images and GIFs or stickers, and audio and video. Much research has been done to detect cyberbullying on textual data; some are available for images. Very few studies are available to detect cyberbullying on GIFs/stickers. We collect a GIF dataset from Twitter and Applied a deep learning model to detect cyberbullying from the dataset. Firstly, we extracted hashtags related to cyberbullying using Twitter. We used these hashtags to download GIF file using publicly available API GIPHY. We collected over 4100 GIFs including cyberbullying and non cyberbullying. we applied deep learning pre-trained model VGG16 for the detection of the cyberbullying. The deep learning model achieved the accuracy of 97%. Our work provides the GIF dataset for researchers working in this area.


[2] Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality cs.CVPDF

Zekai Luo, Zongze Du, Zhouhang Zhu, Hao Zhong, Muzhi Zhu

TL;DR: 本文提出LivingSwap,首个基于视频参考的面部交换模型,通过关键帧和目标身份注入实现高保真度和时间一致性,显著减少制作工作流中的手动操作。

Details

Motivation: 在影视制作中,高保真度和长时间一致性的面部交换仍是一个挑战。现有方法难以平衡身份保持和时间连续性,本文探索通过视频参考增强交换效果。

Result: 实验表明,本文方法在身份保持和视频一致性上优于现有方法,显著提升视觉效果和制作效率。

Insight: 视频参考和关键帧结合是解决长时间一致性问题的有效途径;构建高质量数据集对模型训练至关重要。

Abstract: Video face swapping is crucial in film and entertainment production, where achieving high fidelity and temporal consistency over long and complex video sequences remains a significant challenge. Inspired by recent advances in reference-guided image editing, we explore whether rich visual attributes from source videos can be similarly leveraged to enhance both fidelity and temporal coherence in video face swapping. Building on this insight, this work presents LivingSwap, the first video reference guided face swapping model. Our approach employs keyframes as conditioning signals to inject the target identity, enabling flexible and controllable editing. By combining keyframe conditioning with video reference guidance, the model performs temporal stitching to ensure stable identity preservation and high-fidelity reconstruction across long video sequences. To address the scarcity of data for reference-guided training, we construct a paired face-swapping dataset, Face2Face, and further reverse the data pairs to ensure reliable ground-truth supervision. Extensive experiments demonstrate that our method achieves state-of-the-art results, seamlessly integrating the target identity with the source video’s expressions, lighting, and motion, while significantly reducing manual effort in production workflows. Project webpage: https://aim-uofa.github.io/LivingSwap


[3] Restrictive Hierarchical Semantic Segmentation for Stratified Tooth Layer Detection cs.CV | cs.AIPDF

Ryan Banks, Camila Lindoni Azevedo, Hongying Tang, Yunpeng Li

TL;DR: 该论文提出了一种分层语义分割框架,通过层次化预测机制、限制性输出头和自上而下的特征调节,显式嵌入解剖学层次结构,提升了牙齿分层检测的性能和解剖一致性。

Details

Motivation: 现有分层感知的分割方法主要通过损失函数间接编码解剖结构,监督效果较弱。为了更直接地建模解剖层次关系,作者提出了一种显式的分层分割框架。

Result: 在TL-pano数据集上验证,UNet和HRNet的分层变体显著提升了IoU、Dice和召回率,尤其是细粒度解剖结构,但召回率的提高以增加误检为代价。

Insight: 显式分层结构不仅能提升性能,还能增强临床合理性,在低数据量的牙科影像中尤为有效。

Abstract: Accurate understanding of anatomical structures is essential for reliably staging certain dental diseases. A way of introducing this within semantic segmentation models is by utilising hierarchy-aware methodologies. However, existing hierarchy-aware segmentation methods largely encode anatomical structure through the loss functions, providing weak and indirect supervision. We introduce a general framework that embeds an explicit anatomical hierarchy into semantic segmentation by coupling a recurrent, level-wise prediction scheme with restrictive output heads and top-down feature conditioning. At each depth of the class tree, the backbone is re-run on the original image concatenated with logits from the previous level. Child class features are conditioned using Feature-wise Linear Modulation of their parent class probabilities, to modulate child feature spaces for fine grained detection. A probabilistic composition rule enforces consistency between parent and descendant classes. Hierarchical loss combines per-level class weighted Dice and cross entropy loss and a consistency term loss, ensuring parent predictions are the sum of their children. We validate our approach on our proposed dataset, TL-pano, containing 194 panoramic radiographs with dense instance and semantic segmentation annotations, of tooth layers and alveolar bone. Utilising UNet and HRNet as donor models across a 5-fold cross validation scheme, the hierarchical variants consistently increase IoU, Dice, and recall, particularly for fine-grained anatomies, and produce more anatomically coherent masks. However, hierarchical variants also demonstrated increased recall over precision, implying increased false positives. The results demonstrate that explicit hierarchical structuring improves both performance and clinical plausibility, especially in low data dental imaging regimes.


[4] FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models cs.CV | cs.AIPDF

Jiyoon Pyo, Yuankun Jiao, Dongwon Jung, Zekun Li, Leeje Jang

TL;DR: FRIEDA提出了一个用于评估视觉语言模型(LVLMs)在复杂开放式地图推理任务中表现的基准,揭示了现有模型在多步地图推理方面的显著不足。

Details

Motivation: 地图推理作为认知能力和关键任务(如灾害响应和城市规划)的基础能力,尚未得到充分评估。当前的地图视觉问答(VQA)研究常将地图视为图表的特殊案例,忽略了其独特的多层符号和空间关系。

Result: 最强模型Gemini-2.5-Pro和GPT-5-Think的准确率仅为38.20%和37.20%,远低于人类水平的84.87%,显示了模型的显著不足。

Insight: FRIEDA突显了LVLMs在多步地图推理任务中的局限性,为未来研究提供了一个严格且多样化的评估标准。

Abstract: Cartographic reasoning is the skill of interpreting geographic relationships by aligning legends, map scales, compass directions, map texts, and geometries across one or more map images. Although essential as a concrete cognitive capability and for critical tasks such as disaster response and urban planning, it remains largely unevaluated. Building on progress in chart and infographic understanding, recent large vision language model studies on map visual question-answering often treat maps as a special case of charts. In contrast, map VQA demands comprehension of layered symbology (e.g., symbols, geometries, and text labels) as well as spatial relations tied to orientation and distance that often span multiple maps and are not captured by chart-style evaluations. To address this gap, we introduce FRIEDA, a benchmark for testing complex open-ended cartographic reasoning in LVLMs. FRIEDA sources real map images from documents and reports in various domains and geographical areas. Following classifications in Geographic Information System (GIS) literature, FRIEDA targets all three categories of spatial relations: topological (border, equal, intersect, within), metric (distance), and directional (orientation). All questions require multi-step inference, and many require cross-map grounding and reasoning. We evaluate eleven state-of-the-art LVLMs under two settings: (1) the direct setting, where we provide the maps relevant to the question, and (2) the contextual setting, where the model may have to identify the maps relevant to the question before reasoning. Even the strongest models, Gemini-2.5-Pro and GPT-5-Think, achieve only 38.20% and 37.20% accuracy, respectively, far below human performance of 84.87%. These results reveal a persistent gap in multi-step cartographic reasoning, positioning FRIEDA as a rigorous benchmark to drive progress on spatial intelligence in LVLMs.


[5] Lost in Translation, Found in Embeddings: Sign Language Translation and Alignment cs.CVPDF

Youngjoon Jang, Liliane Momeni, Zifan Jiang, Joon Son Chung, Gül Varol

TL;DR: 该论文提出了一种统一模型,用于手语理解和手语翻译(SLT)与手语字幕对齐(SSA),结合了轻量级视觉骨干网络、滑动感知映射网络和多任务训练策略,并在多语言预训练基础上实现了最先进的性能。

Details

Motivation: 开发一个统一的手语理解模型,同时实现手语翻译和字幕对齐,以支持实际通信、大规模语料库构建和教育应用。

Result: 在BOBSL数据集上实现了SLT和SSA的最先进性能,并在How2Sign上展示了零样本泛化和微调能力。

Insight: 多语言预训练和统一模型架构是实现跨语言手语翻译和字幕对齐的关键。

Abstract: Our aim is to develop a unified model for sign language understanding, that performs sign language translation (SLT) and sign-subtitle alignment (SSA). Together, these two tasks enable the conversion of continuous signing videos into spoken language text and also the temporal alignment of signing with subtitles – both essential for practical communication, large-scale corpus construction, and educational applications. To achieve this, our approach is built upon three components: (i) a lightweight visual backbone that captures manual and non-manual cues from human keypoints and lip-region images while preserving signer privacy; (ii) a Sliding Perceiver mapping network that aggregates consecutive visual features into word-level embeddings to bridge the vision-text gap; and (iii) a multi-task scalable training strategy that jointly optimises SLT and SSA, reinforcing both linguistic and temporal alignment. To promote cross-linguistic generalisation, we pretrain our model on large-scale sign-text corpora covering British Sign Language (BSL) and American Sign Language (ASL) from the BOBSL and YouTube-SL-25 datasets. With this multilingual pretraining and strong model design, we achieve state-of-the-art results on the challenging BOBSL (BSL) dataset for both SLT and SSA. Our model also demonstrates robust zero-shot generalisation and finetuned SLT performance on How2Sign (ASL), highlighting the potential of scalable translation across different sign languages.


[6] Towards Sustainable Universal Deepfake Detection with Frequency-Domain Masking cs.CVPDF

Chandler Timm C. Doloriel, Habib Ullah, Kristian Hovde Liland, Fadi Al Machot, Ngai-Man Cheung

TL;DR: 该论文提出了一种基于频域掩码的通用深度伪造检测方法,通过随机掩码和几何变换提升检测器的泛化能力,同时减少计算开销。

Details

Motivation: 目前深度伪造技术快速发展,检测方法需要对新出现的伪造图像具有鲁棒性,同时减少计算资源消耗以适应大规模筛查需求。

Result: 在GAN和扩散模型生成的图像数据集上实现了最先进的泛化性能,并且在结构化剪枝下保持鲁棒性。

Insight: 频域掩码为可持续和通用性深度伪造检测提供了实用方案,同时兼顾了性能和资源效率。

Abstract: Universal deepfake detection aims to identify AI-generated images across a broad range of generative models, including unseen ones. This requires robust generalization to new and unseen deepfakes, which emerge frequently, while minimizing computational overhead to enable large-scale deepfake screening, a critical objective in the era of Green AI. In this work, we explore frequency-domain masking as a training strategy for deepfake detectors. Unlike traditional methods that rely heavily on spatial features or large-scale pretrained models, our approach introduces random masking and geometric transformations, with a focus on frequency masking due to its superior generalization properties. We demonstrate that frequency masking not only enhances detection accuracy across diverse generators but also maintains performance under significant model pruning, offering a scalable and resource-conscious solution. Our method achieves state-of-the-art generalization on GAN- and diffusion-generated image datasets and exhibits consistent robustness under structured pruning. These results highlight the potential of frequency-based masking as a practical step toward sustainable and generalizable deepfake detection. Code and models are available at: https://github.com/chandlerbing65nm/FakeImageDetection.


[7] Mask to Adapt: Simple Random Masking Enables Robust Continual Test-Time Learning cs.CVPDF

Chandler Timm C. Doloriel

TL;DR: 论文提出了一种简单的持续测试时适应方法M2A,通过随机空间或频率掩码生成多个视图,并结合一致性损失和熵最小化损失,实现了在强干扰下的鲁棒性能。

Details

Motivation: 测试时的分布偏移会降低图像分类器的性能,而现有的持续测试时适应方法通常依赖复杂的掩码设计或不确定性校准。本文探讨是否简单的随机掩码足以实现鲁棒的适应。

Result: 在CIFAR10C/CIFAR100C/ImageNetC数据集上,M2A(空间掩码)表现优于或匹配基线方法,而频率掩码稍逊。实验表明随机掩码简单且有效。

Insight: 简单的随机掩码设计,结合一致性和熵目标,足以驱动鲁棒的测试时适应,无需依赖不确定性或注意力信号。

Abstract: Distribution shifts at test time degrade image classifiers. Recent continual test-time adaptation (CTTA) methods use masking to regulate learning, but often depend on calibrated uncertainty or stable attention scores and introduce added complexity. We ask: do we need custom-made masking designs, or can a simple random masking schedule suffice under strong corruption? We introduce Mask to Adapt (M2A), a simple CTTA approach that generates a short sequence of masked views (spatial or frequency) and adapts with two objectives: a mask consistency loss that aligns predictions across different views and an entropy minimization loss that encourages confident outputs. Motivated by masked image modeling, we study two common masking families – spatial masking and frequency masking – and further compare subtypes within each (spatial: patch vs.\ pixel; frequency: all vs.\ low vs.\ high). On CIFAR10C/CIFAR100C/ImageNetC (severity~5), M2A (Spatial) attains 8.3%/19.8%/39.2% mean error, outperforming or matching strong CTTA baselines, while M2A (Frequency) lags behind. Ablations further show that simple random masking is effective and robust. These results indicate that a simple random masking schedule, coupled with consistency and entropy objectives, is sufficient to drive effective test-time adaptation without relying on uncertainty or attention signals.


[8] Identification of Deforestation Areas in the Amazon Rainforest Using Change Detection Models cs.CVPDF

Christian Massao Konishi, Helio Pedrini

TL;DR: 论文提出了一种基于变化检测模型的亚马逊雨林砍伐区域识别方法,通过统一数据集评估了多种模型,并展示了预处理和后处理技术的改进效果。

Details

Motivation: 亚马逊雨林的保护是全球气候变化的重点,现有方法在检测森林砍伐区域时存在模型效果不佳、缺乏现代架构和方法标准化等问题。

Result: 通过模型组合,论文达到了80.41%的F1-score,优于单独模型的效果,与近期文献结果相当。

Insight: 预处理和后处理技术对提升模型效果至关重要,同时模型组合能够进一步优化性能,为森林砍伐检测提供了新的方法论参考。

Abstract: The preservation of the Amazon Rainforest is one of the global priorities in combating climate change, protecting biodiversity, and safeguarding indigenous cultures. The Satellite-based Monitoring Project of Deforestation in the Brazilian Legal Amazon (PRODES), a project of the National Institute for Space Research (INPE), stands out as a fundamental initiative in this effort, annually monitoring deforested areas not only in the Amazon but also in other Brazilian biomes. Recently, machine learning models have been developed using PRODES data to support this effort through the comparative analysis of multitemporal satellite images, treating deforestation detection as a change detection problem. However, existing approaches present significant limitations: models evaluated in the literature still show unsatisfactory effectiveness, many do not incorporate modern architectures, such as those based on self-attention mechanisms, and there is a lack of methodological standardization that allows direct comparisons between different studies. In this work, we address these gaps by evaluating various change detection models in a unified dataset, including fully convolutional models and networks incorporating self-attention mechanisms based on Transformers. We investigate the impact of different pre- and post-processing techniques, such as filtering deforested areas predicted by the models based on the size of connected components, texture replacement, and image enhancements; we demonstrate that such approaches can significantly improve individual model effectiveness. Additionally, we test different strategies for combining the evaluated models to achieve results superior to those obtained individually, reaching an F1-score of 80.41%, a value comparable to other recent works in the literature.


[9] CVP: Central-Peripheral Vision-Inspired Multimodal Model for Spatial Reasoning cs.CVPDF

Zeyuan Chen, Xiang Zhang, Haiyang Xu, Jianwen Xie, Zhuowen Tu

TL;DR: CVP提出了一种受人类中央与周边视觉启发的多模态模型,通过目标亲和力标记和全局网格提升空间推理能力,在多个3D场景理解任务中达到最优性能。

Details

Motivation: 现有的空间推理方法通常依赖非结构化表示(如点云、体素等),并通过坐标嵌入隐式注入场景上下文,导致缺乏显式的高层次结构理解。CVP从人类中央与周边视觉的机制中汲取灵感,提出了更结构化的解决方案。

Result: CVP在多个3D场景理解基准测试中表现优异,实现state-of-the-art性能。

Insight: 1. 人类视觉机制(中央与周边视觉)可为多模态模型设计提供有效启发;2. 显式结构表示优于隐式嵌入,能显著提升模型的空间推理能力。

Abstract: We present a central-peripheral vision-inspired framework (CVP), a simple yet effective multimodal model for spatial reasoning that draws inspiration from the two types of human visual fields – central vision and peripheral vision. Existing approaches primarily rely on unstructured representations, such as point clouds, voxels, or patch features, and inject scene context implicitly via coordinate embeddings. However, this often results in limited spatial reasoning capabilities due to the lack of explicit, high-level structural understanding. To address this limitation, we introduce two complementary components into a Large Multimodal Model-based architecture: target-affinity token, analogous to central vision, that guides the model’s attention toward query-relevant objects; and allocentric grid, akin to peripheral vision, that captures global scene context and spatial arrangements. These components work in tandem to enable structured, context-aware understanding of complex 3D environments. Experiments show that CVP achieves state-of-the-art performance across a range of 3D scene understanding benchmarks.


[10] Fourier-RWKV: A Multi-State Perception Network for Efficient Image Dehazing cs.CVPDF

Lirong Zheng, Yanshan Li, Rui Yu, Kaihao Zhang

TL;DR: Fourier-RWKV提出了一种基于多状态感知范式的新型去雾框架,通过空间感知、频率感知和语义关系感知三态协同建模非均匀雾霾,同时保持线性计算复杂度。

Details

Motivation: 当前基于Transformer的去雾方法虽能捕捉全局上下文,但二次计算复杂度限制了实时部署。为解决这一问题,作者提出了一种高效的多状态感知网络。

Result: 在多基准测试中实现SOTA性能,同时显著降低计算开销,达成恢复质量与实际效率的双重优势。

Insight: 1. 频域建模在去雾任务中能有效兼顾全局依赖和计算效率;2. 多状态感知范式为解决复杂视觉退化问题提供了新思路。

Abstract: Image dehazing is crucial for reliable visual perception, yet it remains highly challenging under real-world non-uniform haze conditions. Although Transformer-based methods excel at capturing global context, their quadratic computational complexity hinders real-time deployment. To address this, we propose Fourier Receptance Weighted Key Value (Fourier-RWKV), a novel dehazing framework based on a Multi-State Perception paradigm. The model achieves comprehensive haze degradation modeling with linear complexity by synergistically integrating three distinct perceptual states: (1) Spatial-form Perception, realized through the Deformable Quad-directional Token Shift (DQ-Shift) operation, which dynamically adjusts receptive fields to accommodate local haze variations; (2) Frequency-domain Perception, implemented within the Fourier Mix block, which extends the core WKV attention mechanism of RWKV from the spatial domain to the Fourier domain, preserving the long-range dependencies essential for global haze estimation while mitigating spatial attenuation; (3) Semantic-relation Perception, facilitated by the Semantic Bridge Module (SBM), which utilizes Dynamic Semantic Kernel Fusion (DSK-Fusion) to precisely align encoder-decoder features and suppress artifacts. Extensive experiments on multiple benchmarks demonstrate that Fourier-RWKV delivers state-of-the-art performance across diverse haze scenarios while significantly reducing computational overhead, establishing a favorable trade-off between restoration quality and practical efficiency. Code is available at: https://github.com/Dilizlr/Fourier-RWKV.


[11] Accuracy Does Not Guarantee Human-Likeness in Monocular Depth Estimators cs.CVPDF

Yuki Kubota, Taiki Fukiage

TL;DR: 论文探讨了单目深度估计模型在精度和人类感知相似性之间的权衡关系,指出高精度并不保证模型的行为更接近人类感知,强调了开发多角度、以人为中心的评估方法的必要性。

Details

Motivation: 尽管深度神经网络在物理基准测试上达到了超人类精度,但模型表征与人类感知的对齐仍是一个关键挑战。研究旨在探索深度估计中是否存在类似物体识别任务中的精度与人类行为之间的权衡关系。

Result: 结果表明,尽管人类和模型在某些估计偏差上存在正相关,但模型精度与人类相似性之间存在明显的权衡关系,即高精度不必然导致更接近人类的行为。

Insight: 研究强调了单目深度估计中需要开发超越传统精度的、多角度的以人为中心的评估方法,以更好地对齐模型行为与人类感知。

Abstract: Monocular depth estimation is a fundamental capability for real-world applications such as autonomous driving and robotics. Although deep neural networks (DNNs) have achieved superhuman accuracy on physical-based benchmarks, a key challenge remains: aligning model representations with human perception, a promising strategy for enhancing model robustness and interpretability. Research in object recognition has revealed a complex trade-off between model accuracy and human-like behavior, raising a question whether a similar divergence exist in depth estimation, particularly for natural outdoor scenes where benchmarks rely on sensor-based ground truth rather than human perceptual estimates. In this study, we systematically investigated the relationship between model accuracy and human similarity across 69 monocular depth estimators using the KITTI dataset. To dissect the structure of error patterns on a factor-by-factor basis, we applied affine fitting to decompose prediction errors into interpretable components. Intriguingly, our results reveal while humans and DNNs share certain estimation biases (positive error correlations), we observed distinct trade-off relationships between model accuracy and human similarity. This finding indicates that improving accuracy does not necessarily lead to more human-like behavior, underscoring the necessity of developing multifaceted, human-centric evaluations beyond traditional accuracy.


[12] GeoLoom: High-quality Geometric Diagram Generation from Textual Input cs.CVPDF

Xiaojing Wei, Ting Zhang, Wei He, Jingdong Wang, Hua Huang

TL;DR: GeoLoom 是一个基于文本生成高质量几何图形的新框架,通过结合形式化语言和优化方法,实现了高精度的图形生成。

Details

Motivation: 几何图形生成需要严格的几何约束和空间准确性,现有方法在结构保真度上表现不足。

Result: GeoLoom 在结构保真度上显著优于现有基线方法。

Insight: 形式化语言和数学优化为几何图形生成提供了可解释性和可扩展性基础。

Abstract: High-quality geometric diagram generation presents both a challenge and an opportunity: it demands strict spatial accuracy while offering well-defined constraints to guide generation. Inspired by recent advances in geometry problem solving that employ formal languages and symbolic solvers for enhanced correctness and interpretability, we propose GeoLoom, a novel framework for text-to-diagram generation in geometric domains. GeoLoom comprises two core components: an autoformalization module that translates natural language into a specifically designed generation-oriented formal language GeoLingua, and a coordinate solver that maps formal constraints to precise coordinates using the efficient Monte Carlo optimization. To support this framework, we introduce GeoNF, a dataset aligning natural language geometric descriptions with formal GeoLingua descriptions. We further propose a constraint-based evaluation metric that quantifies structural deviation, offering mathematically grounded supervision for iterative refinement. Empirical results demonstrate that GeoLoom significantly outperforms state-of-the-art baselines in structural fidelity, providing a principled foundation for interpretable and scalable diagram generation.


[13] Blur2Sharp: Human Novel Pose and View Synthesis with Generative Prior Refinement cs.CVPDF

Chia-Hern Lai, I-Hsuan Lo, Yen-Ku Yeh, Thanh-Nguyen Truong, Ching-Chun Huang

TL;DR: Blur2Sharp提出了一种结合3D感知神经渲染和扩散模型的新框架,用于从单一参考视图生成几何一致且清晰的多视角图像。

Details

Motivation: 现有方法在生成多样视角和复杂动作时,要么几何不一致,要么牺牲真实感,导致模糊输出。Blur2Sharp旨在解决这一问题。

Result: 实验表明,Blur2Sharp在复杂场景(如宽松衣物和遮挡)中优于现有技术,生成结果更清晰且几何一致。

Insight: 结合3D感知神经渲染和扩散模型可以有效提升生成图像的几何一致性和细节质量,尤其在处理复杂人体动作时表现突出。

Abstract: The creation of lifelike human avatars capable of realistic pose variation and viewpoint flexibility remains a fundamental challenge in computer vision and graphics. Current approaches typically yield either geometrically inconsistent multi-view images or sacrifice photorealism, resulting in blurry outputs under diverse viewing angles and complex motions. To address these issues, we propose Blur2Sharp, a novel framework integrating 3D-aware neural rendering and diffusion models to generate sharp, geometrically consistent novel-view images from only a single reference view. Our method employs a dual-conditioning architecture: initially, a Human NeRF model generates geometrically coherent multi-view renderings for target poses, explicitly encoding 3D structural guidance. Subsequently, a diffusion model conditioned on these renderings refines the generated images, preserving fine-grained details and structural fidelity. We further enhance visual quality through hierarchical feature fusion, incorporating texture, normal, and semantic priors extracted from parametric SMPL models to simultaneously improve global coherence and local detail accuracy. Extensive experiments demonstrate that Blur2Sharp consistently surpasses state-of-the-art techniques in both novel pose and view generation tasks, particularly excelling under challenging scenarios involving loose clothing and occlusions.


[14] VisKnow: Constructing Visual Knowledge Base for Object Understanding cs.CVPDF

Ziwei Yao, Qiyang Wan, Ruiping Wang, Xilin Chen

TL;DR: 论文提出了视觉知识库VisKnow,用于结构化多模态对象知识,并通过框架AnimalKB具体实现,涵盖406种动物类别,提升了零样本识别和细粒度视觉问答任务。

Details

Motivation: 现有对象理解任务的数据通常局限于特定任务,缺乏系统性组织,因此需要结构化多模态知识以支持更全面的对象理解。

Result: AnimalKB包含406种动物类别的22K文本知识三元组和420K图像,显著提升了零样本识别和细粒度视觉问答任务的性能。

Insight: 研究表明,自动化构建视觉知识库能够显著提升视觉理解的深度和广度,并为知识图谱补全和部件分割等任务提供了新的基准。

Abstract: Understanding objects is fundamental to computer vision. Beyond object recognition that provides only a category label as typical output, in-depth object understanding represents a comprehensive perception of an object category, involving its components, appearance characteristics, inter-category relationships, contextual background knowledge, etc. Developing such capability requires sufficient multi-modal data, including visual annotations such as parts, attributes, and co-occurrences for specific tasks, as well as textual knowledge to support high-level tasks like reasoning and question answering. However, these data are generally task-oriented and not systematically organized enough to achieve the expected understanding of object categories. In response, we propose the Visual Knowledge Base that structures multi-modal object knowledge as graphs, and present a construction framework named VisKnow that extracts multi-modal, object-level knowledge for object understanding. This framework integrates enriched aligned text and image-source knowledge with region annotations at both object and part levels through a combination of expert design and large-scale model application. As a specific case study, we construct AnimalKB, a structured animal knowledge base covering 406 animal categories, which contains 22K textual knowledge triplets extracted from encyclopedic documents, 420K images, and corresponding region annotations. A series of experiments showcase how AnimalKB enhances object-level visual tasks such as zero-shot recognition and fine-grained VQA, and serves as challenging benchmarks for knowledge graph completion and part segmentation. Our findings highlight the potential of automatically constructing visual knowledge bases to advance visual understanding and its practical applications. The project page is available at https://vipl-vsu.github.io/VisKnow.


[15] SOP^2: Transfer Learning with Scene-Oriented Prompt Pool on 3D Object Detection cs.CVPDF

Ching-Hung Cheng, Hsiu-Fu Wu, Bing-Chen Wu, Khanh-Phong Bui, Van-Tin Luu

TL;DR: 论文探讨了在3D物体检测中应用提示调优方法的有效性,并提出了一种场景导向的提示池(SOP^2),展示了其在3D检测中的潜力。

Details

Motivation: 受大型语言模型(如GPT-3)在NLP领域中通过提示调优展现的强大泛化能力启发,研究是否能在3D物体检测中实现类似的迁移学习效果。

Result: SOP^2在3D物体检测中表现出有效性,证明了提示调优在3D领域的潜力。

Insight: 提示调优不仅可以用于NLP任务,也适用于3D视觉任务,为未来研究提供了新的方向。

Abstract: With the rise of Large Language Models (LLMs) such as GPT-3, these models exhibit strong generalization capabilities. Through transfer learning techniques such as fine-tuning and prompt tuning, they can be adapted to various downstream tasks with minimal parameter adjustments. This approach is particularly common in the field of Natural Language Processing (NLP). This paper aims to explore the effectiveness of common prompt tuning methods in 3D object detection. We investigate whether a model trained on the large-scale Waymo dataset can serve as a foundation model and adapt to other scenarios within the 3D object detection field. This paper sequentially examines the impact of prompt tokens and prompt generators, and further proposes a Scene-Oriented Prompt Pool (\textbf{SOP$^2$}). We demonstrate the effectiveness of prompt pools in 3D object detection, with the goal of inspiring future researchers to delve deeper into the potential of prompts in the 3D field.


[16] New VVC profiles targeting Feature Coding for Machines cs.CVPDF

Md Eimran Hossain Eimon, Ashan Perera, Juan Merlos, Velibor Adzic, Hari Kalva

TL;DR: 论文提出了三种轻量级VVC(Versatile Video Coding)配置,用于机器学习任务中的特征编码,显著提升了编码效率并减少了计算开销。

Details

Motivation: 传统视频编解码器主要针对人类视觉优化,但在分步推理系统中,传输的是神经网络中间特征而非像素数据,这些特征的抽象性、稀疏性和任务专一性使得传统编解码方法不再适用。

Result: Fast配置实现了2.96%的BD-Rate增益,编码时间减少21.8%;Faster配置获得1.85%的BD-Rate增益,速度提升51.5%;Fastest配置编码时间减少95.6%,仅损失1.71%的BD-Rate。

Insight: 针对机器学习任务的特征编码需求,轻量化VVC配置可以在显著降低计算开销的同时保持较高的压缩效率和任务准确性。

Abstract: Modern video codecs have been extensively optimized to preserve perceptual quality, leveraging models of the human visual system. However, in split inference systems-where intermediate features from neural network are transmitted instead of pixel data-these assumptions no longer apply. Intermediate features are abstract, sparse, and task-specific, making perceptual fidelity irrelevant. In this paper, we investigate the use of Versatile Video Coding (VVC) for compressing such features under the MPEG-AI Feature Coding for Machines (FCM) standard. We perform a tool-level analysis to understand the impact of individual coding components on compression efficiency and downstream vision task accuracy. Based on these insights, we propose three lightweight essential VVC profiles-Fast, Faster, and Fastest. The Fast profile provides 2.96% BD-Rate gain while reducing encoding time by 21.8%. Faster achieves a 1.85% BD-Rate gain with a 51.5% speedup. Fastest reduces encoding time by 95.6% with only a 1.71% loss in BD-Rate.


[17] MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models cs.CV | cs.AIPDF

Jusheng Zhang, Kaitong Cai, Xiaoyang Guo, Sidi Liu, Qinhan Lv

TL;DR: MM-CoT 是一个诊断性基准测试,专门用于评估多模态模型(MMs)在视觉链式推理(CoT)中的视觉基础和逻辑连贯性。现有基准测试侧重于生成能力,而 MM-CoT 突出了验证能力,通过对抗性干扰项揭示模型的推理缺陷。实验表明,即使是先进的模型也表现不佳,揭示了生成流畅性与真实推理一致性之间的巨大差距。

Details

Motivation: 现有的多模态模型基准测试主要关注生成的推理链的表观质量,而忽略了其是否基于视觉证据且逻辑连贯。MM-CoT 旨在填补这一空白,通过验证视觉一致性和逻辑有效性,推动模型实现更真实的推理。

Result: 实验结果表明,即使是目前最先进的视觉语言模型在 MM-CoT 测试中也表现不佳,显示出生成流畅性与真实推理能力之间的显著差距。MM-CoT 与现有基准测试相关性低,验证了其独特性。

Insight: MM-CoT 揭示了多模态模型在视觉推理中的关键挑战:生成能力不等同于真实的推理能力。未来模型需更注重视觉基础与逻辑连贯性的结合,而非仅仅追求生成的流畅性。

Abstract: The ability to perform Chain-of-Thought (CoT) reasoning marks a major milestone for multimodal models (MMs), enabling them to solve complex visual reasoning problems. Yet a critical question remains: is such reasoning genuinely grounded in visual evidence and logically coherent? Existing benchmarks emphasize generation but neglect verification, i.e., the capacity to assess whether a reasoning chain is both visually consistent and logically valid. To fill this gap, we introduce MM-CoT, a diagnostic benchmark specifically designed to probe the visual grounding and logical coherence of CoT reasoning in MMs. Instead of generating free-form explanations, models must select the sole event chain that satisfies two orthogonal constraints: (i) visual consistency, ensuring all steps are anchored in observable evidence, and (ii) logical coherence, ensuring causal and commonsense validity. Adversarial distractors are engineered to violate one of these constraints, exposing distinct reasoning failures. We evaluate leading vision-language models on MM-CoT and find that even the most advanced systems struggle, revealing a sharp discrepancy between generative fluency and true reasoning fidelity. MM-CoT shows low correlation with existing benchmarks, confirming that it measures a unique combination of visual grounding and logical reasoning. This benchmark provides a foundation for developing future models that reason not just plausibly, but faithfully and coherently within the visual world.


[18] Geometry-Aware Sparse Depth Sampling for High-Fidelity RGB-D Depth Completion in Robotic Systems cs.CV | cs.ROPDF

Tony Salloom, Dandi Zhou, Xinhai Sun

TL;DR: 该论文提出了一种基于法线引导的稀疏深度采样策略,通过PCA法线估计衡量深度可靠性,改进深度完成任务,能够生成更真实的训练条件并提升边缘和几何精度。

Details

Motivation: 现代工业机器人系统需要精确的三维感知,但RGB-D传感器的深度图通常存在噪声和不完整问题。当前深度完成方法中稀疏深度采样的不真实性是主要限制。

Result: 在NYU Depth v2上实验表明,几何感知的稀疏深度采样提高了深度完成精度,减少了边缘伪影,并更真实地模拟了传感器行为。

Insight: 几何依赖的深度采样能更好地反映真实传感器行为,为深度完成任务提供了更合理的训练条件。

Abstract: Accurate three-dimensional perception is essential for modern industrial robotic systems that perform manipulation, inspection, and navigation tasks. RGB-D and stereo vision sensors are widely used for this purpose, but the depth maps they produce are often noisy, incomplete, or biased due to sensor limitations and environmental conditions. Depth completion methods aim to generate dense, reliable depth maps from RGB images and sparse depth input. However, a key limitation in current depth completion pipelines is the unrealistic generation of sparse depth: sparse pixels are typically selected uniformly at random from dense ground-truth depth, ignoring the fact that real sensors exhibit geometry-dependent and spatially nonuniform reliability. In this work, we propose a normal-guided sparse depth sampling strategy that leverages PCA-based surface normal estimation on the RGB-D point cloud to compute a per-pixel depth reliability measure. The sparse depth samples are then drawn according to this reliability distribution. We integrate this sampling method with the Marigold-DC diffusion-based depth completion model and evaluate it on NYU Depth v2 using the standard metrics. Experiments show that our geometry-aware sparse depth improves accuracy, reduces artifacts near edges and discontinuities, and produces more realistic training conditions that better reflect real sensor behavior.


[19] HybridToken-VLM: Hybrid Token Compression for Vision-Language Models cs.CV | cs.AIPDF

Jusheng Zhang, Xiaoyang Guo, Kaitong Cai, Qinhan Lv, Yijia Fan

TL;DR: HTC-VLM提出了一种混合标记压缩框架,通过双通道(连续的ViT补丁和离散的MGVQ量化)分离语义和外观,解决了视觉语言模型中计算成本高的问题,并在多个基准测试中表现优异。

Details

Motivation: 视觉语言模型(VLMs)在处理数百个视觉补丁标记时,计算成本呈二次增长,导致内存和上下文窗口的压力。传统方法在连续压缩和高阶语义保留或离散量化和细节保留之间存在折衷。

Result: 在七个基准测试(GQA、VQAv2等)中平均保留87.2%的性能,压缩比为580:1,优于连续基线的81.0%。注意力分析表明压缩标记优先关注离散锚点。

Insight: 混合设计能够平衡效率和保真度,为可扩展的视觉语言模型提供了新思路。

Abstract: Vision-language models (VLMs) have transformed multimodal reasoning, but feeding hundreds of visual patch tokens into LLMs incurs quadratic computational costs, straining memory and context windows. Traditional approaches face a trade-off: continuous compression dilutes high-level semantics such as object identities, while discrete quantization loses fine-grained details such as textures. We introduce HTC-VLM, a hybrid framework that disentangles semantics and appearance through dual channels, i.e., a continuous pathway for fine-grained details via ViT patches and a discrete pathway for symbolic anchors using MGVQ quantization projected to four tokens. These are fused into a 580-token hybrid sequence and compressed into a single voco token via a disentanglement attention mask and bottleneck, ensuring efficient and grounded representations. HTC-VLM achieves an average performance retention of 87.2 percent across seven benchmarks (GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, ScienceQA-Image), outperforming the leading continuous baseline at 81.0 percent with a 580-to-1 compression ratio. Attention analyses show that the compressed token prioritizes the discrete anchor, validating its semantic guidance. Our work demonstrates that a minimalist hybrid design can resolve the efficiency-fidelity dilemma and advance scalable VLMs.


[20] Residual-SwinCA-Net: A Channel-Aware Integrated Residual CNN-Swin Transformer for Malignant Lesion Segmentation in BUSI cs.CV | cs.AI | cs.LGPDF

Saeeda Naz, Saddam Hussain Khan

TL;DR: 本文提出了一种名为Residual-SwinCA-Net的混合深度学习框架,用于乳腺癌超声图像(BUSI)中的恶性病灶分割,结合了残差CNN模块和Swin Transformer模块,显著提升了分割精度。

Details

Motivation: 乳腺癌超声图像的分割面临局部相关性特征和全局依赖性特征的提取挑战,同时需要抑制超声噪声并保持病灶形态完整性。本文旨在通过混合框架解决这些问题。

Result: 在BUSI数据集上,Residual-SwinCA-Net达到99.29%平均准确率、98.74% IoU和0.9041 Dice系数,优于现有CNNs/ViTs方法。

Insight: 混合CNN和Transformer的框架能有效结合局部和全局特征,同时注意力机制和多尺度处理显著提升了分割精度和临床决策支持能力。

Abstract: A novel deep hybrid Residual-SwinCA-Net segmentation framework is proposed in the study for addressing such challenges by extracting locally correlated and robust features, incorporating residual CNN modules. Furthermore, for learning global dependencies, Swin Transformer blocks are customized using internal residual pathways, which reinforce gradient stability, refine local patterns, and facilitate global feature fusion. Formerly, for enhancing tissue continuity, ultrasound noise suppressions, and accentuating fine structural transitions Laplacian-of-Gaussian regional operator is applied, and for maintaining the morphological integrity of malignant lesion contours, a boundary-oriented operator has been incorporated. Subsequently, a contraction strategy was applied stage-wise by progressively reducing features-map progressively for capturing scale invariance and enhancing the robustness of structural variability. In addition, each decoder level prior augmentation integrates a new Multi-Scale Channel Attention and Squeezing (MSCAS) module. The MSCAS selectively emphasizes encoder salient maps, retains discriminative global context, and complementary local structures with minimal computational cost while suppressing redundant activations. Finally, the Pixel-Attention module encodes class-relevant spatial cues by adaptively weighing malignant lesion pixels while suppressing background interference. The Residual-SwinCA-Net and existing CNNs/ViTs techniques have been implemented on the publicly available BUSI dataset. The proposed Residual-SwinCA-Net framework outperformed and achieved 99.29% mean accuracy, 98.74% IoU, and 0.9041 Dice for breast lesion segmentation. The proposed Residual-SwinCA-Net framework improves the BUSI lesion diagnostic performance and strengthens timely clinical decision-making.


[21] Beyond Real Weights: Hypercomplex Representations for Stable Quantization cs.CV | cs.CLPDF

Jawad Ibn Ahad, Maisha Rahman, Amrijit Biswas, Muhammad Rafsan Kabir, Robin Krambroeckers

TL;DR: 该论文提出了一种通过渐进式重参数化策略压缩多模态语言模型的方法,使用参数化超复数乘法(PHM)层替代密集前馈网络块,从而减少参数量和计算量,同时保持性能。

Details

Motivation: 多模态语言模型(MLLMs)需要大量参数来对齐高维视觉特征和语言表示,导致计算开销大且难以高效部署。为解决这一问题,论文提出了一种渐进式压缩方法。

Result: 在多个视觉语言模型(VLMs)上验证了方法的有效性,模型大小和推理延迟显著降低,同时保持了与原始模型相当的性能。

Insight: 渐进式PHM替换为多模态推理提供了一种架构兼容的高效路径,并可与其他低比特量化技术互补。

Abstract: Multimodal language models (MLLMs) require large parameter capacity to align high-dimensional visual features with linguistic representations, making them computationally heavy and difficult to deploy efficiently. We introduce a progressive reparameterization strategy that compresses these models by gradually replacing dense feed-forward network blocks with compact Parameterized Hypercomplex Multiplication (PHM) layers. A residual interpolation schedule, together with lightweight reconstruction and knowledge distillation losses, ensures that the PHM modules inherit the functional behavior of their dense counterparts during training. This transition yields substantial parameter and FLOP reductions while preserving strong multimodal alignment, enabling faster inference without degrading output quality. We evaluate the approach on multiple vision-language models (VLMs). Our method maintains performance comparable to the base models while delivering significant reductions in model size and inference latency. Progressive PHM substitution thus offers an architecture-compatible path toward more efficient multimodal reasoning and complements existing low-bit quantization techniques.


[22] Pose-Based Sign Language Spotting via an End-to-End Encoder Architecture cs.CV | cs.CLPDF

Samuel Ebimobowei Johnny, Blessed Guda, Emmanuel Enejo Aaron, Assane Gueye

TL;DR: 该论文提出了一个基于姿态的端到端编码器架构,用于手语检索中的特定手势检测(称为Sign Language Spotting),避免了传统方法中依赖中间符号识别或文本匹配的问题,直接在姿态关键点上进行操作,降低了计算成本并减少了视觉噪声。

Details

Motivation: 手语识别(ASLR)在聋人和听力人群之间架起了重要的桥梁,但目前对于连续手语序列中特定手势的检测或检索问题研究较少。本文旨在填补这一空白,通过直接处理姿态数据来实现高效的手语检索。

Result: 在WSLP 2025共享任务的Word Presence Prediction数据集上,模型达到了61.88%的准确率和60.00%的F1分数,验证了其有效性。

Insight: 基于姿态的方法在手语检索中具有潜力,能够显著降低计算开销并减少噪声干扰,为未来的手语自动检索和验证研究奠定了基础。

Abstract: Automatic Sign Language Recognition (ASLR) has emerged as a vital field for bridging the gap between deaf and hearing communities. However, the problem of sign-to-sign retrieval or detecting a specific sign within a sequence of continuous signs remains largely unexplored. We define this novel task as Sign Language Spotting. In this paper, we present a first step toward sign language retrieval by addressing the challenge of detecting the presence or absence of a query sign video within a sentence-level gloss or sign video. Unlike conventional approaches that rely on intermediate gloss recognition or text-based matching, we propose an end-to-end model that directly operates on pose keypoints extracted from sign videos. Our architecture employs an encoder-only backbone with a binary classification head to determine whether the query sign appears within the target sequence. By focusing on pose representations instead of raw RGB frames, our method significantly reduces computational cost and mitigates visual noise. We evaluate our approach on the Word Presence Prediction dataset from the WSLP 2025 shared task, achieving 61.88% accuracy and 60.00% F1-score. These results demonstrate the effectiveness of our pose-based framework for Sign Language Spotting, establishing a strong foundation for future research in automatic sign language retrieval and verification. Code is available at https://github.com/EbimoJohnny/Pose-Based-Sign-Language-Spotting


[23] SFP: Real-World Scene Recovery Using Spatial and Frequency Priors cs.CVPDF

Yun Liu, Tao Li, Cosmin Ancuti, Wenqi Ren, Weisi Lin

TL;DR: SFP提出了一种结合空间和频率先验的真实场景恢复方法,通过空间域的传输图估计和频域的自适应增强,显著提升了多样退化条件下的恢复效果。

Details

Motivation: 现有场景恢复方法通常依赖单一先验或复杂网络,难以处理多样化真实场景的退化问题。SFP旨在通过结合空间和频率域的互补信息,解决这一局限性。

Result: 实验表明,SFP在多种退化条件下均优于现有方法,展示了其在真实场景恢复中的有效性和优越性。

Insight: 多域先验的结合(空间和频率)能显著提升场景恢复的鲁棒性和适应性,尤其是对复杂真实退化情况的处理能力。

Abstract: Scene recovery serves as a critical task for various computer vision applications. Existing methods typically rely on a single prior, which is inherently insufficient to handle multiple degradations, or employ complex network architectures trained on synthetic data, which suffer from poor generalization for diverse real-world scenarios. In this paper, we propose Spatial and Frequency Priors (SFP) for real-world scene recovery. In the spatial domain, we observe that the inverse of the degraded image exhibits a projection along its spectral direction that resembles the scene transmission. Leveraging this spatial prior, the transmission map is estimated to recover the scene from scattering degradation. In the frequency domain, a mask is constructed for adaptive frequency enhancement, with two parameters estimated using our proposed novel priors. Specifically, one prior assumes that the mean intensity of the degraded image’s direct current (DC) components across three channels in the frequency domain closely approximates that of each channel in the clear image. The second prior is based on the observation that, for clear images, the magnitude of low radial frequencies below 0.001 constitutes approximately 1% of the total spectrum. Finally, we design a weighted fusion strategy to integrate spatial-domain restoration, frequency-domain enhancement, and salient features from the input image, yielding the final recovered result. Extensive evaluations demonstrate the effectiveness and superiority of our proposed SFP for scene recovery under various degradation conditions.


[24] RLCNet: An end-to-end deep learning framework for simultaneous online calibration of LiDAR, RADAR, and Camera cs.CV | cs.ROPDF

Hafeez Husain Cholakkal, Stefano Arrigoni, Francesco Braghin

TL;DR: RLCNet是一种端到端的深度学习框架,用于LiDAR、RADAR和相机的实时在线标定,解决了动态环境中传感器漂移和噪声的挑战。

Details

Motivation: 自动驾驶车辆的可靠感知依赖于多模态传感器的精确外参标定,但在动态环境中,机械振动和传感器漂移等问题使得标定变得困难。

Result: 在实际数据集上的验证表明,RLCNet在多种条件下表现优异,标定精度和鲁棒性优于现有方法。

Insight: 架构设计和动态调整机制对于多模态传感器在线标定的性能至关重要,RLCNet的成功表明深度学习可以有效解决复杂的传感器融合问题。

Abstract: Accurate extrinsic calibration of LiDAR, RADAR, and camera sensors is essential for reliable perception in autonomous vehicles. Still, it remains challenging due to factors such as mechanical vibrations and cumulative sensor drift in dynamic environments. This paper presents RLCNet, a novel end-to-end trainable deep learning framework for the simultaneous online calibration of these multimodal sensors. Validated on real-world datasets, RLCNet is designed for practical deployment and demonstrates robust performance under diverse conditions. To support real-time operation, an online calibration framework is introduced that incorporates a weighted moving average and outlier rejection, enabling dynamic adjustment of calibration parameters with reduced prediction noise and improved resilience to drift. An ablation study highlights the significance of architectural choices, while comparisons with existing methods demonstrate the superior accuracy and robustness of the proposed approach.


[25] EgoX: Egocentric Video Generation from a Single Exocentric Video cs.CVPDF

Taewoong Kang, Kinam Kim, Dohyeon Kim, Minho Park, Junha Hyung

TL;DR: EgoX是一个新框架,通过单个第三人称视频生成第一人称视频,利用轻量级LoRA适配和几何引导的自注意力机制实现高质量生成。

Details

Motivation: 研究目标是解决从第三人称视频转换到第一人称视频的挑战,包括相机姿态变化大和视野重叠少的问题,从而实现沉浸式理解。

Result: EgoX在未见过和真实场景的视频中展现了强扩展性和鲁棒性,能够生成连贯且真实的第一人称视频。

Insight: 研究表明,利用预训练模型的知识和几何引导的自注意力机制可以显著提升第一人称视频生成的质量和一致性。

Abstract: Egocentric perception enables humans to experience and understand the world directly from their own point of view. Translating exocentric (third-person) videos into egocentric (first-person) videos opens up new possibilities for immersive understanding but remains highly challenging due to extreme camera pose variations and minimal view overlap. This task requires faithfully preserving visible content while synthesizing unseen regions in a geometrically consistent manner. To achieve this, we present EgoX, a novel framework for generating egocentric videos from a single exocentric input. EgoX leverages the pretrained spatio temporal knowledge of large-scale video diffusion models through lightweight LoRA adaptation and introduces a unified conditioning strategy that combines exocentric and egocentric priors via width and channel wise concatenation. Additionally, a geometry-guided self-attention mechanism selectively attends to spatially relevant regions, ensuring geometric coherence and high visual fidelity. Our approach achieves coherent and realistic egocentric video generation while demonstrating strong scalability and robustness across unseen and in-the-wild videos.


[26] PAVAS: Physics-Aware Video-to-Audio Synthesis cs.CV | cs.MM | cs.SDPDF

Oh Hyun-Bin, Yuhta Takida, Toshimitsu Uesaka, Tae-Hyun Oh, Yuki Mitsufuji

TL;DR: PAVAS是一种基于物理感知的视频到音频合成方法,通过引入物理驱动音频适配器(Phy-Adapter)和物理参数估计器(PPE),结合视觉语言模型和动态3D重建模块,生成更符合物理规律的音频。

Details

Motivation: 现有视频到音频合成方法主要基于视觉-声学关联,忽略了物理因素对声音的影响,导致生成的音频缺乏物理真实性。

Result: 在VGG-Impact基准测试中,PAVAS生成的音频在物理真实性和感知质量上均优于现有方法,并通过APCC指标验证了其物理一致性。

Insight: 物理因素的引入显著提升了视频到音频合成的真实性和合理性,为多模态生成任务提供了新的研究方向。

Abstract: Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter receives object-level physical parameters estimated by the Physical Parameter Estimator (PPE), which uses a Vision-Language Model (VLM) to infer the moving-object mass and a segmentation-based dynamic 3D reconstruction module to recover its motion trajectory for velocity computation. These physical cues enable the model to synthesize sounds that reflect underlying physical factors. To assess physical realism, we curate VGG-Impact, a benchmark focusing on object-object interactions, and introduce Audio-Physics Correlation Coefficient (APCC), an evaluation metric that measures consistency between physical and auditory attributes. Comprehensive experiments show that PAVAS produces physically plausible and perceptually coherent audio, outperforming existing V2A models in both quantitative and qualitative evaluations. Visit https://physics-aware-video-to-audio-synthesis.github.io for demo videos.


[27] OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation cs.CVPDF

Yexin Liu, Manyuan Zhang, Yueze Wang, Hongyu Li, Dian Zheng

TL;DR: OpenSubject是一个基于视频的大规模数据集,用于提升主题驱动的图像生成和编辑。通过四阶段流程,利用跨帧身份先验增强模型的性能。

Details

Motivation: 现有的主题驱动图像生成模型在复杂多主体场景中难以保持参考身份的一致性。为了解决这一问题,OpenSubject提出了一个视频驱动的大规模数据集和相应方法。

Result: 实验表明,OpenSubject显著提升了主题驱动生成和编辑的性能,尤其是在复杂多主体场景中。

Insight: 通过视频驱动的跨帧身份先验和大规模数据,能够有效提升模型在复杂场景中的表现。数据质量和多样性构建是关键。

Abstract: Despite the promising progress in subject-driven image generation, current models often deviate from the reference identities and struggle in complex scenes with multiple subjects. To address this challenge, we introduce OpenSubject, a video-derived large-scale corpus with 2.5M samples and 4.35M images for subject-driven generation and manipulation. The dataset is built with a four-stage pipeline that exploits cross-frame identity priors. (i) Video Curation. We apply resolution and aesthetic filtering to obtain high-quality clips. (ii) Cross-Frame Subject Mining and Pairing. We utilize vision-language model (VLM)-based category consensus, local grounding, and diversity-aware pairing to select image pairs. (iii) Identity-Preserving Reference Image Synthesis. We introduce segmentation map-guided outpainting to synthesize the input images for subject-driven generation and box-guided inpainting to generate input images for subject-driven manipulation, together with geometry-aware augmentations and irregular boundary erosion. (iv) Verification and Captioning. We utilize a VLM to validate synthesized samples, re-synthesize failed samples based on stage (iii), and then construct short and long captions. In addition, we introduce a benchmark covering subject-driven generation and manipulation, and then evaluate identity fidelity, prompt adherence, manipulation consistency, and background consistency with a VLM judge. Extensive experiments show that training with OpenSubject improves generation and manipulation performance, particularly in complex scenes.


[28] GeoDM: Geometry-aware Distribution Matching for Dataset Distillation cs.CV | cs.AIPDF

Xuhui Li, Zhengquan Luo, Zihui Cui, Zhiqiang Xu

TL;DR: GeoDM提出了一种几何感知的数据分布匹配框架,通过结合欧几里得、双曲和球面流形,捕捉数据的线性、层次和循环结构,提升了数据集蒸馏的效果。

Details

Motivation: 现有的数据分布匹配方法局限于欧几里得空间,无法捕捉真实数据的内在几何结构(如曲率),而高维数据往往存在于低维流形上,因此需要一种能够对齐原始数据流形的数据集蒸馏方法。

Result: 实验表明,GeoDM在标准基准上优于现有数据集蒸馏方法,且在不同几何结构的分布匹配策略中均保持高效。

Insight: 通过几何感知的分布匹配,GeoDM能够更准确地捕捉数据的非线性结构,从而提升模型在蒸馏数据上的泛化能力。

Abstract: Dataset distillation aims to synthesize a compact subset of the original data, enabling models trained on it to achieve performance comparable to those trained on the original large dataset. Existing distribution-matching methods are confined to Euclidean spaces, making them only capture linear structures and overlook the intrinsic geometry of real data, e.g., curvature. However, high-dimensional data often lie on low-dimensional manifolds, suggesting that dataset distillation should have the distilled data manifold aligned with the original data manifold. In this work, we propose a geometry-aware distribution-matching framework, called \textbf{GeoDM}, which operates in the Cartesian product of Euclidean, hyperbolic, and spherical manifolds, with flat, hierarchical, and cyclical structures all captured by a unified representation. To adapt to the underlying data geometry, we introduce learnable curvature and weight parameters for three kinds of geometries. At the same time, we design an optimal transport loss to enhance the distribution fidelity. Our theoretical analysis shows that the geometry-aware distribution matching in a product space yields a smaller generalization error bound than the Euclidean counterparts. Extensive experiments conducted on standard benchmarks demonstrate that our algorithm outperforms state-of-the-art data distillation methods and remains effective across various distribution-matching strategies for the single geometries.


[29] GeoDiffMM: Geometry-Guided Conditional Diffusion for Motion Magnification cs.CVPDF

Xuedeng Liu, Jiabao Guo, Zheng Zhang, Fei Wang, Zhi Liu

TL;DR: GeoDiffMM提出了一种基于扩散模型的拉格朗日视频运动放大框架,通过几何引导的光流条件实现了结构一致的运动放大,显著提升了运动放大的效果。

Details

Motivation: 现有的欧拉方法在放大微小运动时难以区分光子噪声和真实位移,GeoDiffMM旨在通过几何引导的光流条件解决这一问题。

Result: 在真实和合成数据集上的实验表明,GeoDiffMM显著优于现有方法,提升了运动放大的效果。

Insight: 几何引导的光流条件能够有效分离噪声和真实运动,扩散模型在高保真运动放大任务中具有潜力。

Abstract: Video Motion Magnification (VMM) amplifies subtle macroscopic motions to a perceptible level. Recently, existing mainstream Eulerian approaches address amplification-induced noise via decoupling representation learning such as texture, shape and frequancey schemes, but they still struggle to separate photon noise from true micro-motion when motion displacements are very small. We propose GeoDiffMM, a novel diffusion-based Lagrangian VMM framework conditioned on optical flow as a geometric cue, enabling structurally consistent motion magnification. Specifically, we design a Noise-free Optical Flow Augmentation strategy that synthesizes diverse nonrigid motion fields without photon noise as supervision, helping the model learn more accurate geometry-aware optial flow and generalize better. Next, we develop a Diffusion Motion Magnifier that conditions the denoising process on (i) optical flow as a geometry prior and (ii) a learnable magnification factor controlling magnitude, thereby selectively amplifying motion components consistent with scene semantics and structure while suppressing content-irrelevant perturbations. Finally, we perform Flow-based Video Synthesis to map the amplified motion back to the image domain with high fidelity. Extensive experiments on real and synthetic datasets show that GeoDiffMM outperforms state-of-the-art methods and significantly improves motion magnification.


[30] Interpreting Structured Perturbations in Image Protection Methods for Diffusion Models cs.CV | cs.AI | cs.LGPDF

Michael R. Martin, Garrick Chan, Kwan-Liu Ma

TL;DR: 该论文系统分析了图像保护机制(如Glaze和Nightshade)在扩散模型中的结构化扰动行为,揭示了其低熵、与图像内容紧密耦合的特性,并探讨了其在表示、空间和频谱域的检测性。

Details

Motivation: 随着文本到图像生成模型的普及,图像保护机制(如Glaze和Nightshade)通过微小对抗性扰动破坏下游模型的任务日益重要,但其内部结构、检测性及表示行为尚未被充分理解。

Result: 研究表明:1. 保护机制是结构化的低熵扰动;2. 扰动与图像内容紧密耦合;3. 频率分析显示扰动沿主导频率轴重新分配能量而非引入扩散噪声。

Insight: 当代图像保护机制通过特征级结构化形变实现保护,解释了为何保护信号视觉上细微但检测性强。这对未来防御和检测策略的设计提供了重要启示。

Abstract: Recent image protection mechanisms such as Glaze and Nightshade introduce imperceptible, adversarially designed perturbations intended to disrupt downstream text-to-image generative models. While their empirical effectiveness is known, the internal structure, detectability, and representational behavior of these perturbations remain poorly understood. This study provides a systematic, explainable AI analysis using a unified framework that integrates white-box feature-space inspection and black-box signal-level probing. Through latent-space clustering, feature-channel activation analysis, occlusion-based spatial sensitivity mapping, and frequency-domain characterization, we show that protection mechanisms operate as structured, low-entropy perturbations tightly coupled to underlying image content across representational, spatial, and spectral domains. Protected images preserve content-driven feature organization with protection-specific substructure rather than inducing global representational drift. Detectability is governed by interacting effects of perturbation entropy, spatial deployment, and frequency alignment, with sequential protection amplifying detectable structure rather than suppressing it. Frequency-domain analysis shows that Glaze and Nightshade redistribute energy along dominant image-aligned frequency axes rather than introducing diffuse noise. These findings indicate that contemporary image protection operates through structured feature-level deformation rather than semantic dislocation, explaining why protection signals remain visually subtle yet consistently detectable. This work advances the interpretability of adversarial image protection and informs the design of future defenses and detection strategies for generative AI systems.


[31] PointDico: Contrastive 3D Representation Learning Guided by Diffusion Models cs.CVPDF

Pengbo Li, Yiding Sun, Haozhe Cheng

TL;DR: PointDico结合扩散模型和对比学习,提出了一种新型的3D表示学习方法,通过多尺度几何特征提取和双通道设计,显著提升了3D数据表示的精度。

Details

Motivation: 现有的3D表示学习方法在处理无序和密度不均的点云数据时存在困难,对比学习方法容易过拟合,而生成方法(如3D Mask Autoencoders)难以处理无序性。

Result: 在ScanObjectNN上实现94.32%的准确率,ShapeNetPart上达到86.5%的实例mIoU,刷新了3D表示学习的SOTA。

Insight: 扩散模型的引导作用能够有效缓解对比学习的过拟合问题,同时多尺度特征提取和双通道设计显著提升了模型的几何信息捕捉能力。

Abstract: Self-supervised representation learning has shown significant improvement in Natural Language Processing and 2D Computer Vision. However, existing methods face difficulties in representing 3D data because of its unordered and uneven density. Through an in-depth analysis of mainstream contrastive and generative approaches, we find that contrastive models tend to suffer from overfitting, while 3D Mask Autoencoders struggle to handle unordered point clouds. This motivates us to learn 3D representations by sharing the merits of diffusion and contrast models, which is non-trivial due to the pattern difference between the two paradigms. In this paper, we propose \textit{PointDico}, a novel model that seamlessly integrates these methods. \textit{PointDico} learns from both denoising generative modeling and cross-modal contrastive learning through knowledge distillation, where the diffusion model serves as a guide for the contrastive model. We introduce a hierarchical pyramid conditional generator for multi-scale geometric feature extraction and employ a dual-channel design to effectively integrate local and global contextual information. \textit{PointDico} achieves a new state-of-the-art in 3D representation learning, \textit{e.g.}, \textbf{94.32%} accuracy on ScanObjectNN, \textbf{86.5%} Inst. mIoU on ShapeNetPart.


[32] HybridSplat: Fast Reflection-baked Gaussian Tracing using Hybrid Splatting cs.CVPDF

Chang Liu, Hongliang Yuan, Lianghao Zhang, Sichao Wang, Jianwei Guo

TL;DR: HybridSplat提出了一种混合溅射(Hybrid Splatting)机制,用于高保真场景重建,通过反射烘焙高斯追踪加速复杂反射场景的渲染。

Details

Motivation: 现有3D高斯溅射方法在复杂反射场景的渲染速度和内存存储上存在瓶颈,需要通过更高效的方法提升性能。

Result: 在Ref-NeRF和NeRF-Casting数据集上,渲染速度提升7倍,高斯基元数量减少4倍,同时保持反射渲染质量。

Insight: 通过烘焙反射和混合溅射,可以在不牺牲质量的情况下显著提升渲染速度和降低存储需求。

Abstract: Rendering complex reflection of real-world scenes using 3D Gaussian splatting has been a quite promising solution for photorealistic novel view synthesis, but still faces bottlenecks especially in rendering speed and memory storage. This paper proposes a new Hybrid Splatting(HybridSplat) mechanism for Gaussian primitives. Our key idea is a new reflection-baked Gaussian tracing, which bakes the view-dependent reflection within each Gaussian primitive while rendering the reflection using tile-based Gaussian splatting. Then we integrate the reflective Gaussian primitives with base Gaussian primitives using a unified hybrid splatting framework for high-fidelity scene reconstruction. Moreover, we further introduce a pipeline-level acceleration for the hybrid splatting, and reflection-sensitive Gaussian pruning to reduce the model size, thus achieving much faster rendering speed and lower memory storage while preserving the reflection rendering quality. By extensive evaluation, our HybridSplat accelerates about 7x rendering speed across complex reflective scenes from Ref-NeRF, NeRF-Casting with 4x fewer Gaussian primitives than similar ray-tracing based Gaussian splatting baselines, serving as a new state-of-the-art method especially for complex reflective scenes.


[33] TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels cs.CVPDF

Jiahao Lu, Weitao Xiong, Jiacheng Deng, Peng Li, Tianyu Huang

TL;DR: 论文提出TrackingWorld,一种新的单目3D跟踪方法,通过2D轨迹升采样和世界坐标系投影,实现对所有像素的密集3D跟踪,并能分离相机运动和动态前景运动。

Details

Motivation: 现有单目3D跟踪方法难以分离相机运动与动态前景运动,且无法密集跟踪视频中新出现的动态物体。

Result: 在合成和真实数据集上的实验表明,该方法能在世界坐标系中实现准确且密集的3D跟踪。

Insight: 2D轨迹升采样和世界坐标系投影的结合是提升单目3D跟踪效果的关键。

Abstract: Monocular 3D tracking aims to capture the long-term motion of pixels in 3D space from a single monocular video and has witnessed rapid progress in recent years. However, we argue that the existing monocular 3D tracking methods still fall short in separating the camera motion from foreground dynamic motion and cannot densely track newly emerging dynamic subjects in the videos. To address these two limitations, we propose TrackingWorld, a novel pipeline for dense 3D tracking of almost all pixels within a world-centric 3D coordinate system. First, we introduce a tracking upsampler that efficiently lifts the arbitrary sparse 2D tracks into dense 2D tracks. Then, to generalize the current tracking methods to newly emerging objects, we apply the upsampler to all frames and reduce the redundancy of 2D tracks by eliminating the tracks in overlapped regions. Finally, we present an efficient optimization-based framework to back-project dense 2D tracks into world-centric 3D trajectories by estimating the camera poses and the 3D coordinates of these 2D tracks. Extensive evaluations on both synthetic and real-world datasets demonstrate that our system achieves accurate and dense 3D tracking in a world-centric coordinate frame.


[34] The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss cs.CVPDF

Bozhou Li, Xinda Xue, Sihan Yang, Yang Shi, Xinlong Chen

TL;DR: 研究发现Pre-Norm结构的MLLM中存在视觉和文本token的norm差异问题,导致信息损失,提出简单的LayerNorm层插入解决方案,显著提升性能。

Details

Motivation: 多模态大语言模型(MLLM)在视觉和语言融合中表现出色,但Pre-Norm结构导致视觉和文本token的norm不一致,严重影响跨模态特征融合的有效性。

Result: 在LLaVA-1.5架构上的实验表明,该方法在多模态和纯文本(如MMLU)评测中均取得了显著性能提升。

Insight: norm对齐不仅能改善跨模态融合,还能提升模型的整体能力,说明架构平衡对MLLM至关重要。

Abstract: Multimodal Large Language Models (MLLMs), which couple pre-trained vision encoders and language models, have shown remarkable capabilities. However, their reliance on the ubiquitous Pre-Norm architecture introduces a subtle yet critical flaw: a severe norm disparity between the high-norm visual tokens and the low-norm text tokens. In this work, we present a formal theoretical analysis demonstrating that this imbalance is not a static issue. Instead, it induces an asymmetric update dynamic,'' where high-norm visual tokens exhibit a representational inertia,’’ causing them to transform semantically much slower than their textual counterparts. This fundamentally impairs effective cross-modal feature fusion. Our empirical validation across a range of mainstream MLLMs confirms that this theoretical dynamic – the persistence of norm disparity and the resulting asymmetric update rates – is a prevalent phenomenon. Based on this insight, we propose a remarkably simple yet effective solution: inserting a single, carefully initialized LayerNorm layer after the visual projector to enforce norm alignment. Experiments conducted on the LLaVA-1.5 architecture show that this intervention yields significant performance gains not only on a wide suite of multimodal benchmarks but also, notably, on text-only evaluations such as MMLU, suggesting that resolving the architectural imbalance leads to a more holistically capable model.


[35] Simultaneous Enhancement and Noise Suppression under Complex Illumination Conditions cs.CVPDF

Jing Tao, You Li, Banglei Guan, Yang Shang, Qifeng Yu

TL;DR: 该论文提出了一种在复杂光照条件下同时增强图像和抑制噪声的新框架,结合梯度域加权引导滤波器和Retinex模型,通过优化光照层和反射层提升图像质量,并在实验中验证了其优越性。

Details

Motivation: 在复杂光照条件下,图像常因噪声和光照不均导致质量下降,现有方法或放大噪声或仅适用于特定条件。论文旨在解决这一问题,提出更鲁棒的图像增强与降噪方法。

Result: 实验表明,该方法在对比度增强和噪声抑制上优于现有技术。

Insight: 通过分解与并行优化光照和反射层,能够更灵活地处理复杂光照问题,同时抑制噪声。

Abstract: Under challenging light conditions, captured images often suffer from various degradations, leading to a decline in the performance of vision-based applications. Although numerous methods have been proposed to enhance image quality, they either significantly amplify inherent noise or are only effective under specific illumination conditions. To address these issues, we propose a novel framework for simultaneous enhancement and noise suppression under complex illumination conditions. Firstly, a gradient-domain weighted guided filter (GDWGIF) is employed to accurately estimate illumination and improve image quality. Next, the Retinex model is applied to decompose the captured image into separate illumination and reflection layers. These layers undergo parallel processing, with the illumination layer being corrected to optimize lighting conditions and the reflection layer enhanced to improve image quality. Finally, the dynamic range of the image is optimized through multi-exposure fusion and a linear stretching strategy. The proposed method is evaluated on real-world datasets obtained from practical applications. Experimental results demonstrate that our proposed method achieves better performance compared to state-of-the-art methods in both contrast enhancement and noise suppression.


[36] Towards Visual Re-Identification of Fish using Fine-Grained Classification for Electronic Monitoring in Fisheries cs.CVPDF

Samitha Nuwan Thilakarathna, Ercan Avsar, Martin Mathias Nielsen, Malte Pedersen

TL;DR: 该论文提出了一种优化的深度学习流程,用于渔业电子监控中的鱼类视觉重识别,使用自定义图像转换和硬三元组挖掘技术,结合Swin-T架构显著提升了性能。

Details

Motivation: 渔业数据的准确性对可持续海洋资源管理至关重要,但电子监控系统产生的视频数据量远超人工处理能力。如何高效自动化鱼类重识别成为关键挑战。

Result: Swin-T架构表现优于ResNet-50,达到90.43%的Rank-1准确率和41.65%的mAP@k。研究发现同类鱼视角不一致比部分遮挡更具挑战性。

Insight: 同类鱼个体间的视觉相似性是主要挑战,视角不一致比部分遮挡对重识别的影响更大。Vision Transformer在细粒度分类任务中表现优异。

Abstract: Accurate fisheries data are crucial for effective and sustainable marine resource management. With the recent adoption of Electronic Monitoring (EM) systems, more video data is now being collected than can be feasibly reviewed manually. This paper addresses this challenge by developing an optimized deep learning pipeline for automated fish re-identification (Re-ID) using the novel AutoFish dataset, which simulates EM systems with conveyor belts with six similarly looking fish species. We demonstrate that key Re-ID metrics (R1 and mAP@k) are substantially improved by using hard triplet mining in conjunction with a custom image transformation pipeline that includes dataset-specific normalization. By employing these strategies, we demonstrate that the Vision Transformer-based Swin-T architecture consistently outperforms the Convolutional Neural Network-based ResNet-50, achieving peak performance of 41.65% mAP@k and 90.43% Rank-1 accuracy. An in-depth analysis reveals that the primary challenge is distinguishing visually similar individuals of the same species (Intra-species errors), where viewpoint inconsistency proves significantly more detrimental than partial occlusion. The source code and documentation are available at: https://github.com/msamdk/Fish_Re_Identification.git


[37] SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos cs.CVPDF

Mingqi Gao, Yunqi Miao, Jungong Han

TL;DR: SAM-Body4D是一种无需训练的框架,从视频中恢复时间一致且抗遮挡的4D人体网格。通过生成身份一致的掩码片段并结合遮挡感知模块,提升了在视频中的人体网格恢复性能。

Details

Motivation: 现有的基于图像的HMR方法在视频中逐帧推断会导致时间不一致性和遮挡下性能下降。本文旨在解决这一问题,无需额外训练。

Result: 实验表明,SAM-Body4D在挑战性野外视频中提升了时间稳定性和鲁棒性,无需重新训练。

Insight: 利用视频中的人体连续性可以提升4D人体网格恢复的性能,而无需依赖额外的训练数据或模型微调。

Abstract: Human Mesh Recovery (HMR) aims to reconstruct 3D human pose and shape from 2D observations and is fundamental to human-centric understanding in real-world scenarios. While recent image-based HMR methods such as SAM 3D Body achieve strong robustness on in-the-wild images, they rely on per-frame inference when applied to videos, leading to temporal inconsistency and degraded performance under occlusions. We address these issues without extra training by leveraging the inherent human continuity in videos. We propose SAM-Body4D, a training-free framework for temporally consistent and occlusion-robust HMR from videos. We first generate identity-consistent masklets using a promptable video segmentation model, then refine them with an Occlusion-Aware module to recover missing regions. The refined masklets guide SAM 3D Body to produce consistent full-body mesh trajectories, while a padding-based parallel strategy enables efficient multi-human inference. Experimental results demonstrate that SAM-Body4D achieves improved temporal stability and robustness in challenging in-the-wild videos, without any retraining. Our code and demo are available at: https://github.com/gaomingqi/sam-body4d.


[38] Towards Effective and Efficient Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval cs.CVPDF

Tao Chen, Shaobo Ju, Qiong Wu, Chenxin Fang, Kun Zhang

TL;DR: 论文提出了OneClip-RAG方法,通过一次性片段检索增强长视频理解,解决了多模态大语言模型(MLLMs)因内存限制只能处理有限帧视频的问题,显著提升了性能和效率。

Details

Motivation: 现有的MLLMs因内存开销过大,只能处理有限帧的视频,限制了长视频理解的能力。亟需一种高效且有效的方法来解决这一问题。

Result: OneClip-RAG显著提升了MLLMs(如InternLV2 8B和Qwen2-VL 7B)的性能,在某些任务上达到GPT-4o的水平,并在单张4090 GPU上实现1小时内视频的高效处理。

Insight: 通过结合片段检索的完整性与语义一致性,以及高效的统一分块算法,可以显著提升长视频理解的性能和效率。

Abstract: Due to excessive memory overhead, most Multimodal Large Language Models (MLLMs) can only process videos of limited frames. In this paper, we propose an effective and efficient paradigm to remedy this shortcoming, termed One-shot video-Clip based Retrieval AuGmentation (OneClip-RAG). Compared with existing video RAG methods, OneClip-RAG makes full use of the merits of video clips for augmented video understanding in terms of both knowledge integrity and semantic coherence. Besides, it is also equipped with a novel query-guided video chunking algorithm that can unify clip chunking and cross-modal retrieval in one processing step, avoiding redundant computations. To improve instruction following, we further propose a new dataset called SynLongVideo and design a progressive training regime for OneClip-RAG. OneClip-RAG is plugged into five recent MLLMs and validated on a set of long-video benchmarks. Experimental results not only show the obvious performance gains by OneClip-RAG over MLLMs, e.g., boosting InternLV2 8B and Qwen2-VL 7B to the level of GPT-4o on MLVU, but also show its superior efficiency in handling long videos. e.g., enabling LLaVA-Video understand up to an hour of videos in less than 2.2 minutes on a single 4090 GPU.


[39] LapFM: A Laparoscopic Segmentation Foundation Model via Hierarchical Concept Evolving Pre-training cs.CVPDF

Qing Xu, Kun Yuan, Yuxiang Luo, Yuhao Zhai, Wenting Duan

TL;DR: LapFM是一个基于分层概念演化预训练的手术分割基础模型,通过构建层级概念结构(LCH)和置信度驱动的演化标签方法,显著提升了腹腔镜图像的分割性能。

Details

Motivation: 现有的手术分割方法依赖有限标注数据和领域适配器,无法泛化到多样化手术目标。LapFM旨在通过大规模无标注数据训练分割基础模型,解决语义不一致和标注稀缺的问题。

Result: LapFM在通用腹腔镜分割任务中显著优于现有方法,实现了粒度自适应泛化。

Insight: 分层概念结构和置信度驱动的伪标签生成是解决手术分割中语义不一致和标注稀缺的有效方法。

Abstract: Surgical segmentation is pivotal for scene understanding yet remains hindered by annotation scarcity and semantic inconsistency across diverse procedures. Existing approaches typically fine-tune natural foundation models (e.g., SAM) with limited supervision, functioning merely as domain adapters rather than surgical foundation models. Consequently, they struggle to generalize across the vast variability of surgical targets. To bridge this gap, we present LapFM, a foundation model designed to evolve robust segmentation capabilities from massive unlabeled surgical images. Distinct from medical foundation models relying on inefficient self-supervised proxy tasks, LapFM leverages a Hierarchical Concept Evolving Pre-training paradigm. First, we establish a Laparoscopic Concept Hierarchy (LCH) via a hierarchical mask decoder with parent-child query embeddings, unifying diverse entities (i.e., Anatomy, Tissue, and Instrument) into a scalable knowledge structure with cross-granularity semantic consistency. Second, we propose a Confidence-driven Evolving Labeling that iteratively generates and filters pseudo-labels based on hierarchical consistency, progressively incorporating reliable samples from unlabeled images into training. This process yields LapBench-114K, a large-scale benchmark comprising 114K image-mask pairs. Extensive experiments demonstrate that LapFM significantly outperforms state-of-the-art methods, establishing new standards for granularity-adaptive generalization in universal laparoscopic segmentation. The source code is available at https://github.com/xq141839/LapFM.


[40] Uncertainty-Aware Subset Selection for Robust Visual Explainability under Distribution Shifts cs.CV | cs.LGPDF

Madhav Gupta, Vishak Prasad C, Ganesh Ramakrishnan

TL;DR: 这篇论文提出了一种不确定性感知的视觉可解释性子集选择方法,旨在提升深度视觉模型在分布偏移(OOD)条件下的解释鲁棒性和可靠性。

Details

Motivation: 现有的基于子集选择的视觉可解释性方法在分布内(ID)场景表现良好,但在分布外(OOD)条件下,其解释往往冗余、不稳定且对不确定性敏感。亟需一种方法提升其在OOD场景中的鲁棒性。

Result: 实验验证表明,该方法不仅改善了OOD条件下的解释效果,还在ID场景中表现更优。

Insight: 不确定性驱动的优化能够显著提升视觉解释的鲁棒性和透明度,为真实世界中的可信AI应用提供了新思路。

Abstract: Subset selection-based methods are widely used to explain deep vision models: they attribute predictions by highlighting the most influential image regions and support object-level explanations. While these methods perform well in in-distribution (ID) settings, their behavior under out-of-distribution (OOD) conditions remains poorly understood. Through extensive experiments across multiple ID-OOD sets, we find that reliability of the existing subset based methods degrades markedly, yielding redundant, unstable, and uncertainty-sensitive explanations. To address these shortcomings, we introduce a framework that combines submodular subset selection with layer-wise, gradient-based uncertainty estimation to improve robustness and fidelity without requiring additional training or auxiliary models. Our approach estimates uncertainty via adaptive weight perturbations and uses these estimates to guide submodular optimization, ensuring diverse and informative subset selection. Empirical evaluations show that, beyond mitigating the weaknesses of existing methods under OOD scenarios, our framework also yields improvements in ID settings. These findings highlight limitations of current subset-based approaches and demonstrate how uncertainty-driven optimization can enhance attribution and object-level interpretability, paving the way for more transparent and trustworthy AI in real-world vision applications.


[41] Team-Aware Football Player Tracking with SAM: An Appearance-Based Approach to Occlusion Recovery cs.CVPDF

Chamath Ranasinghe, Uthayasanker Thayasivam

TL;DR: 提出了一种基于SAM的轻量级足球运动员跟踪方法,结合CSRT跟踪器和颜色外观模型,用于遮挡恢复。在资源受限的情况下表现出色,但在长时间遮挡场景下表现较差。

Details

Motivation: 足球运动员跟踪面临密集场景中的频繁遮挡、相似外观和快速运动的挑战,需要一种轻量且高效的解决方案。

Result: 在轻遮挡场景下100%成功率,拥挤场景下90%成功率,但长时间遮挡恢复效果较差(8.66%)。

Insight: 传统跟踪方法在连续可见性场景效果好,但需要更强的重识别机制应对长时间遮挡。

Abstract: Football player tracking is challenged by frequent occlusions, similar appearances, and rapid motion in crowded scenes. This paper presents a lightweight SAM-based tracking method combining the Segment Anything Model (SAM) with CSRT trackers and jersey color-based appearance models. We propose a team-aware tracking system that uses SAM for precise initialization and HSV histogram-based re-identification to improve occlusion recovery. Our evaluation measures three dimensions: processing speed (FPS and memory), tracking accuracy (success rate and box stability), and robustness (occlusion recovery and identity consistency). Experiments on football video sequences show that the approach achieves 7.6-7.7 FPS with stable memory usage (~1880 MB), maintaining 100 percent tracking success in light occlusions and 90 percent in crowded penalty-box scenarios with 5 or more players. Appearance-based re-identification recovers 50 percent of heavy occlusions, demonstrating the value of domain-specific cues. Analysis reveals key trade-offs: the SAM + CSRT combination provides consistent performance across crowd densities but struggles with long-term occlusions where players leave the frame, achieving only 8.66 percent re-acquisition success. These results offer practical guidelines for deploying football tracking systems under resource constraints, showing that classical tracker-based methods work well with continuous visibility but require stronger re-identification mechanisms for extended absences.


[42] Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform cs.CV | cs.AI | cs.GRPDF

Yuning Gong, Yifei Liu, Yifan Zhan, Muyao Niu, Xueying Li

TL;DR: Visionary是一个基于WebGPU的高斯泼溅(3DGS)平台,旨在提供一个轻量级、实时且支持动态内容的浏览器内渲染解决方案,降低了3DGS方法的部署门槛。

Details

Motivation: 现有3DGS渲染解决方案分散、笨重或受限于传统管道,导致部署困难且对动态内容和生成模型支持有限。

Result: 在相同3DGS资源下,Visionary因基于GPU的原语排序而实现了更高的渲染效率,支持MLP-based 3DGS、4DGS等多种变体。

Insight: 通过直接在浏览器中统一推理和渲染,Visionary为3DGS方法的复现、比较和部署提供了便利,成为重建和生成范式的统一载体。

Abstract: Neural rendering, particularly 3D Gaussian Splatting (3DGS), has evolved rapidly and become a key component for building world models. However, existing viewer solutions remain fragmented, heavy, or constrained by legacy pipelines, resulting in high deployment friction and limited support for dynamic content and generative models. In this work, we present Visionary, an open, web-native platform for real-time various Gaussian Splatting and meshes rendering. Built on an efficient WebGPU renderer with per-frame ONNX inference, Visionary enables dynamic neural processing while maintaining a lightweight, “click-to-run” browser experience. It introduces a standardized Gaussian Generator contract, which not only supports standard 3DGS rendering but also allows plug-and-play algorithms to generate or update Gaussians each frame. Such inference also enables us to apply feedforward generative post-processing. The platform further offers a plug in three.js library with a concise TypeScript API for seamless integration into existing web applications. Experiments show that, under identical 3DGS assets, Visionary achieves superior rendering efficiency compared to current Web viewers due to GPU-based primitive sorting. It already supports multiple variants, including MLP-based 3DGS, 4DGS, neural avatars, and style transformation or enhancement networks. By unifying inference and rendering directly in the browser, Visionary significantly lowers the barrier to reproduction, comparison, and deployment of 3DGS-family methods, serving as a unified World Model Carrier for both reconstructive and generative paradigms.


[43] On-the-fly Large-scale 3D Reconstruction from Multi-Camera Rigs cs.CVPDF

Yijia Guo, Tong Hu, Zhiwei Li, Liwen Hu, Keming Qian

TL;DR: 该论文提出了一种基于多相机设备的实时大规模3D重建框架,通过动态融合多相机RGB流为统一的3D高斯表示,实现了高效、无漂移的重建。

Details

Motivation: 现有基于单目相机的实时3D重建方法因视野有限无法实现完整3D覆盖,多相机设备可以解决这一问题。

Result: 仅用2分钟即可重建数百米的3D场景,展现高效、鲁棒和高质量的实时重建能力。

Insight: 多相机设备结合3D高斯表示是一种高效的实时3D重建解决方案,适用于大范围场景。

Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled efficient free-viewpoint rendering and photorealistic scene reconstruction. While on-the-fly extensions of 3DGS have shown promise for real-time reconstruction from monocular RGB streams, they often fail to achieve complete 3D coverage due to the limited field of view (FOV). Employing a multi-camera rig fundamentally addresses this limitation. In this paper, we present the first on-the-fly 3D reconstruction framework for multi-camera rigs. Our method incrementally fuses dense RGB streams from multiple overlapping cameras into a unified Gaussian representation, achieving drift-free trajectory estimation and efficient online reconstruction. We propose a hierarchical camera initialization scheme that enables coarse inter-camera alignment without calibration, followed by a lightweight multi-camera bundle adjustment that stabilizes trajectories while maintaining real-time performance. Furthermore, we introduce a redundancy-free Gaussian sampling strategy and a frequency-aware optimization scheduler to reduce the number of Gaussian primitives and the required optimization iterations, thereby maintaining both efficiency and reconstruction fidelity. Our method reconstructs hundreds of meters of 3D scenes within just 2 minutes using only raw multi-camera video streams, demonstrating unprecedented speed, robustness, and Fidelity for on-the-fly 3D scene reconstruction.


[44] Disrupting Hierarchical Reasoning: Adversarial Protection for Geographic Privacy in Multimodal Reasoning Models cs.CV | cs.AIPDF

Jiaming Zhang, Che Wang, Yang Cao, Longtao Huang, Wei Yang Bryan Lim

TL;DR: 该论文提出了一种名为ReasonBreak的新型对抗框架,旨在通过概念感知的扰动破坏多模态大推理模型(MLRMs)中的层级推理能力,从而保护地理隐私。论文还贡献了一个包含6,341张超高分辨率图像的分层概念标注数据集GeoPrivacy-6K,并在多种MLRMs上验证了方法的有效性。

Details

Motivation: 现有的隐私保护技术主要针对基于感知的模型,而对MLRMs通过多步层级推理推断地理位置的隐私风险无效。因此,需要一种专门针对层级推理的新型保护方法。

Result: 实验结果表明,ReasonBeat在保护地理隐私方面显著优于现有方法,例如在tract-level保护上提升了14.4%(33.8% vs 19.4%),在block-level保护上提升了近一倍(33.5% vs 16.8%)。

Insight: 论文揭示了传统隐私保护技术在对抗层级推理模型时的局限性,提出了概念感知扰动的重要性,为未来对抗推理威胁的隐私保护技术提供了新思路。

Abstract: Multi-modal large reasoning models (MLRMs) pose significant privacy risks by inferring precise geographic locations from personal images through hierarchical chain-of-thought reasoning. Existing privacy protection techniques, primarily designed for perception-based models, prove ineffective against MLRMs’ sophisticated multi-step reasoning processes that analyze environmental cues. We introduce \textbf{ReasonBreak}, a novel adversarial framework specifically designed to disrupt hierarchical reasoning in MLRMs through concept-aware perturbations. Our approach is founded on the key insight that effective disruption of geographic reasoning requires perturbations aligned with conceptual hierarchies rather than uniform noise. ReasonBreak strategically targets critical conceptual dependencies within reasoning chains, generating perturbations that invalidate specific inference steps and cascade through subsequent reasoning stages. To facilitate this approach, we contribute \textbf{GeoPrivacy-6K}, a comprehensive dataset comprising 6,341 ultra-high-resolution images ($\geq$2K) with hierarchical concept annotations. Extensive evaluation across seven state-of-the-art MLRMs (including GPT-o3, GPT-5, Gemini 2.5 Pro) demonstrates ReasonBreak’s superior effectiveness, achieving a 14.4% improvement in tract-level protection (33.8% vs 19.4%) and nearly doubling block-level protection (33.5% vs 16.8%). This work establishes a new paradigm for privacy protection against reasoning-based threats.


[45] OCCDiff: Occupancy Diffusion Model for High-Fidelity 3D Building Reconstruction from Noisy Point Clouds cs.CVPDF

Jialu Sui, Rui Liu, Hongsheng Zhang

TL;DR: OCCDiff利用隐空间扩散模型和函数自编码器架构,通过多任务训练策略,从噪声点云中实现了高保真度的3D建筑重建。

Details

Motivation: 从LiDAR点云重建建筑物时,点云的密度不均和噪声干扰会导致表面重建不准确。OCCDiff旨在解决这一问题,实现高保真的3D建筑重建。

Result: 实验表明,OCCDiff能生成物理一致的高保真样本,对噪声数据具有强鲁棒性。

Insight: 通过隐扩散模型和函数空间的结合,OCCDiff为点云重建提供了一种灵活且鲁棒的方法,适用于多分辨率场景。

Abstract: A major challenge in reconstructing buildings from LiDAR point clouds lies in accurately capturing building surfaces under varying point densities and noise interference. To flexibly gather high-quality 3D profiles of the building in diverse resolution, we propose OCCDiff applying latent diffusion in the occupancy function space. Our OCCDiff combines a latent diffusion process with a function autoencoder architecture to generate continuous occupancy functions evaluable at arbitrary locations. Moreover, a point encoder is proposed to provide condition features to diffusion learning, constraint the final occupancy prediction for occupancy decoder, and insert multi-modal features for latent generation to latent encoder. To further enhance the model performance, a multi-task training strategy is employed, ensuring that the point encoder learns diverse and robust feature representations. Empirical results show that our method generates physically consistent samples with high fidelity to the target distribution and exhibits robustness to noisy data.


[46] Thinking with Images via Self-Calling Agent cs.CVPDF

Wenxi Yang, Yuzhong Zhao, Fang Wan, Qixiang Ye

TL;DR: 论文提出了Self-Calling Chain-of-Thought (sCoT)方法,通过将多模态视觉推理任务分解为原子子任务并由虚拟子代理解决,避免了显式的模态交替,提升了训练效率和性能。

Details

Motivation: 现有的思维链(CoT)方法在多模态视觉推理任务中面临模态交替优化的挑战,且高质量推理数据稀缺。

Result: 在HR-Bench 4K上,sCoT将推理性能提升1.9%,同时减少75%的GPU训练时间。

Insight: 通过自调用机制将多模态任务转换为语言CoT,避免了显式模态交替,提升了训练效率和泛化能力。

Abstract: Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to $1.9%$ with $\sim 75%$ fewer GPU hours compared to strong baseline approaches. Code is available at https://github.com/YWenxi/think-with-images-through-self-calling.


[47] PaintFlow: A Unified Framework for Interactive Oil Paintings Editing and Generation cs.CVPDF

Zhangli Hu, Ye Chen, Jiajun Yao, Bingbing Ni

TL;DR: PaintFlow提出了一种统一的多模态框架,支持用户通过参考图像、手绘草图和自然语言提示来交互式生成和编辑油画,同时保持统一的绘画风格。

Details

Motivation: 油画的复杂笔触动态和风格化特性使得数字生成和编辑具有挑战性;现有方法受限于训练数据分布,且主要针对真实照片修改。

Result: 实验表明,系统实现了细粒度编辑,保留了油画的艺术特性,并在风格化生成和编辑中实现了前所未有的想象力表达。

Insight: 结合多模态输入(图像、草图、文本)和自监督风格迁移,是提升油画生成与编辑效果的创新方向。

Abstract: Oil painting, as a high-level medium that blends human abstract thinking with artistic expression, poses substantial challenges for digital generation and editing due to its intricate brushstroke dynamics and stylized characteristics. Existing generation and editing techniques are often constrained by the distribution of training data and primarily focus on modifying real photographs. In this work, we introduce a unified multimodal framework for oil painting generation and editing. The proposed system allows users to incorporate reference images for precise semantic control, hand-drawn sketches for spatial structure alignment, and natural language prompts for high-level semantic guidance, while consistently maintaining a unified painting style across all outputs. Our method achieves interactive oil painting creation through three crucial technical advancements. First, we enhance the training stage with spatial alignment and semantic enhancement conditioning strategy, which map masks and sketches into spatial constraints, and encode contextual embedding from reference images and text into feature constraints, enabling object-level semantic alignment. Second, to overcome data scarcity, we propose a self-supervised style transfer pipeline based on Stroke-Based Rendering (SBR), which simulates the inpainting dynamics of oil painting restoration, converting real images into stylized oil paintings with preserved brushstroke textures to construct a large-scale paired training dataset. Finally, during inference, we integrate features using the AdaIN operator to ensure stylistic consistency. Extensive experiments demonstrate that our interactive system enables fine-grained editing while preserving the artistic qualities of oil paintings, achieving an unprecedented level of imagination realization in stylized oil paintings generation and editing.


[48] Fast-ARDiff: An Entropy-informed Acceleration Framework for Continuous Space Autoregressive Generation cs.CVPDF

Zhen Zou, Xiaoxiao Ma, Jie Huang, Zichao Yu, Feng Zhao

TL;DR: Fast-ARDiff提出了一个统一的AR-diffusion框架,通过熵引导的加速策略和动态调度器,显著减少了AR生成和扩散解码的延迟,实现了无损加速和高效率合成。

Details

Motivation: 现有的AR-diffusion混合方法由于顺序AR生成和迭代扩散去噪导致高延迟,限制了其实际应用。

Result: 在ImageNet 256×256上实现4.3倍无损加速,文本条件生成实现3倍加速。

Insight: 通过熵对齐和端到端优化,AR-diffusion混合方法可以实现高效且高质量的生成。

Abstract: Autoregressive(AR)-diffusion hybrid paradigms combine AR’s structured modeling with diffusion’s photorealistic synthesis, yet suffer from high latency due to sequential AR generation and iterative denoising. In this work, we tackle this bottleneck and propose a unified AR-diffusion framework Fast-ARDiff that jointly optimizes both components, accelerating AR speculative decoding while simultaneously facilitating faster diffusion decoding. Specifically: (1) The entropy-informed speculative strategy encourages draft model to produce higher-entropy representations aligned with target model’s entropy characteristics, mitigating entropy mismatch and high rejection rates caused by draft overconfidence. (2) For diffusion decoding, rather than treating it as an independent module, we integrate it into the same end-to-end framework using a dynamic scheduler that prioritizes AR optimization to guide the diffusion part in further steps. The diffusion part is optimized through a joint distillation framework combining trajectory and distribution matching, ensuring stable training and high-quality synthesis with extremely few steps. During inference, shallow feature entropy from AR module is used to pre-filter low-entropy drafts, avoiding redundant computation and improving latency. Fast-ARDiff achieves state-of-the-art acceleration across diverse models: on ImageNet 256$\times$256, TransDiff attains 4.3$\times$ lossless speedup, and NextStep-1 achieves 3$\times$ acceleration on text-conditioned generation. Code will be available at https://github.com/aSleepyTree/Fast-ARDiff.


[49] An Iteration-Free Fixed-Point Estimator for Diffusion Inversion cs.CVPDF

Yifei Chen, Kaiyu Song, Yan Pan, Jianxing Yu, Jian Yin

TL;DR: 论文提出了一种无迭代的固定点估计器,用于解决扩散反演中的高计算成本和超参数选择问题,通过误差近似实现高效且无偏的噪声估计。

Details

Motivation: 扩散反演中的固定点迭代方法虽然能最小化重建误差,但因其迭代特性导致高计算成本,且超参数选择复杂。本文旨在提出一种无需迭代的高效反演方法。

Result: 在NOCAPS和MS-COCO数据集上,相比DDIM反演和其他固定点迭代方法,本文方法在重建任务中表现更优且无需额外迭代或训练。

Insight: 通过误差近似技术,固定点估计器能有效避免迭代,同时保持低方差和无偏性,为扩散反演提供了一种高效解决方案。

Abstract: Diffusion inversion aims to recover the initial noise corresponding to a given image such that this noise can reconstruct the original image through the denoising diffusion process. The key component of diffusion inversion is to minimize errors at each inversion step, thereby mitigating cumulative inaccuracies. Recently, fixed-point iteration has emerged as a widely adopted approach to minimize reconstruction errors at each inversion step. However, it suffers from high computational costs due to its iterative nature and the complexity of hyperparameter selection. To address these issues, we propose an iteration-free fixed-point estimator for diffusion inversion. First, we derive an explicit expression of the fixed point from an ideal inversion step. Unfortunately, it inherently contains an unknown data prediction error. Building upon this, we introduce the error approximation, which uses the calculable error from the previous inversion step to approximate the unknown error at the current inversion step. This yields a calculable, approximate expression for the fixed point, which is an unbiased estimator characterized by low variance, as shown by our theoretical analysis. We evaluate reconstruction performance on two text-image datasets, NOCAPS and MS-COCO. Compared to DDIM inversion and other inversion methods based on the fixed-point iteration, our method achieves consistent and superior performance in reconstruction tasks without additional iterations or training.


[50] BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain cs.CVPDF

Navve Wasserman, Matias Cosarinsky, Yuval Golbari, Aude Oliva, Antonio Torralba

TL;DR: 论文提出了一种名为BrainExplore的大规模自动化框架,用于发现和解释人类大脑中的视觉表征,解决了传统研究中规模小、依赖人工检查等问题。

Details

Motivation: 理解人脑如何表征视觉概念及其在哪些脑区编码是一个长期挑战,传统研究受限于规模、复杂性和手动方法,无法系统性地探索。

Result: 框架揭示了数千个可解释的视觉模式,涵盖多种视觉概念,包括之前未报告的细粒度表征。

Insight: 自动化方法能够高效、大规模地探索复杂脑信号,为理解视觉表征提供了新工具。

Abstract: Understanding how the human brain represents visual concepts, and in which brain regions these representations are encoded, remains a long-standing challenge. Decades of work have advanced our understanding of visual representations, yet brain signals remain large and complex, and the space of possible visual concepts is vast. As a result, most studies remain small-scale, rely on manual inspection, focus on specific regions and properties, and rarely include systematic validation. We present a large-scale, automated framework for discovering and explaining visual representations across the human cortex. Our method comprises two main stages. First, we discover candidate interpretable patterns in fMRI activity through unsupervised, data-driven decomposition methods. Next, we explain each pattern by identifying the set of natural images that most strongly elicit it and generating a natural-language description of their shared visual meaning. To scale this process, we introduce an automated pipeline that tests multiple candidate explanations, assigns quantitative reliability scores, and selects the most consistent description for each voxel pattern. Our framework reveals thousands of interpretable patterns spanning many distinct visual concepts, including fine-grained representations previously unreported.


[51] Modular Neural Image Signal Processing cs.CVPDF

Mahmoud Afifi, Zhongling Wang, Ran Zhang, Michael S. Brown

TL;DR: 本文提出了一种模块化神经图像信号处理(ISP)框架,通过高度模块化的设计处理原始输入并生成高质量的显示参考图像,提升了渲染准确性、可扩展性和用户定制灵活性。

Details

Motivation: 现有的神经ISP设计缺乏模块化,限制了渲染过程的控制和调试能力。本文旨在通过模块化设计解决这些问题,同时支持多样化的用户偏好和编辑操作。

Result: 方法在多个测试集上展现了竞争性的定性和定量结果,同时支持无限的后编辑重渲染。

Insight: 模块化设计不仅提升了渲染质量,还为ISP系统的调试、扩展和用户定制提供了更高的灵活性。

Abstract: This paper presents a modular neural image signal processing (ISP) framework that processes raw inputs and renders high-quality display-referred images. Unlike prior neural ISP designs, our method introduces a high degree of modularity, providing full control over multiple intermediate stages of the rendering process.~This modular design not only achieves high rendering accuracy but also improves scalability, debuggability, generalization to unseen cameras, and flexibility to match different user-preference styles. To demonstrate the advantages of this design, we built a user-interactive photo-editing tool that leverages our neural ISP to support diverse editing operations and picture styles. The tool is carefully engineered to take advantage of the high-quality rendering of our neural ISP and to enable unlimited post-editable re-rendering. Our method is a fully learning-based framework with variants of different capacities, all of moderate size (ranging from ~0.5 M to ~3.9 M parameters for the entire pipeline), and consistently delivers competitive qualitative and quantitative results across multiple test sets. Watch the supplemental video at: https://youtu.be/ByhQjQSjxVM


[52] Instance-Aware Test-Time Segmentation for Continual Domain Shifts cs.CVPDF

Seunghwan Lee, Inyoung Jung, Hojoon Lee, Eunil Park, Sungeun Hong

TL;DR: 论文提出了一种实例感知的测试时分割方法,通过动态调整伪标签和平衡学习,提升语义分割在持续域变化中的性能。

Details

Motivation: 现有方法在处理持续域变化时,通常依赖于固定或批量级的阈值,无法应对不同类别和实例的难度变化,导致语义分割效果不佳。

Result: 在八个CTTA和TTA场景中,该方法显著优于现有技术,尤其在合成到真实和长期域变化中表现突出。

Insight: 实例和类别感知的自适应策略对于语义分割在持续域变化中的性能提升至关重要。

Abstract: Continual Test-Time Adaptation (CTTA) enables pre-trained models to adapt to continuously evolving domains. Existing methods have improved robustness but typically rely on fixed or batch-level thresholds, which cannot account for varying difficulty across classes and instances. This limitation is especially problematic in semantic segmentation, where each image requires dense, multi-class predictions. We propose an approach that adaptively adjusts pseudo labels to reflect the confidence distribution within each image and dynamically balances learning toward classes most affected by domain shifts. This fine-grained, class- and instance-aware adaptation produces more reliable supervision and mitigates error accumulation throughout continual adaptation. Extensive experiments across eight CTTA and TTA scenarios, including synthetic-to-real and long-term shifts, show that our method consistently outperforms state-of-the-art techniques, setting a new standard for semantic segmentation under evolving conditions.


[53] From Cells to Survival: Hierarchical Analysis of Cell Inter-Relations in Multiplex Microscopy for Lung Cancer Prognosis cs.CVPDF

Olle Edgren Schüllerqvist, Jens Baumann, Joakim Lindblad, Love Nordling, Artur Mezheyeuski

TL;DR: 该论文提出了HiGINE,一种基于层次图的方法,用于从多路免疫荧光(mIF)图像中预测肺癌患者的生存期,并通过结合细胞类型和形态学信息,提高风险分层的准确性。

Details

Motivation: 肿瘤微环境(TME)已成为预后生物标志物的重要来源,但现有方法未能充分捕捉不同细胞类型之间的复杂相互作用,限制了其潜力。

Result: 在两个公开数据集上验证了HiGINE的优越性,表现出更高的风险分层能力、鲁棒性和泛化性。

Insight: 该研究表明,通过捕捉细胞间的多层次互连关系,并结合临床数据,可以有效提升肺癌预后的预测性能。

Abstract: The tumor microenvironment (TME) has emerged as a promising source of prognostic biomarkers. To fully leverage its potential, analysis methods must capture complex interactions between different cell types. We propose HiGINE – a hierarchical graph-based approach to predict patient survival (short vs. long) from TME characterization in multiplex immunofluorescence (mIF) images and enhance risk stratification in lung cancer. Our model encodes both local and global inter-relations in cell neighborhoods, incorporating information about cell types and morphology. Multimodal fusion, aggregating cancer stage with mIF-derived features, further boosts performance. We validate HiGINE on two public datasets, demonstrating improved risk stratification, robustness, and generalizability.


[54] Disturbance-Free Surgical Video Generation from Multi-Camera Shadowless Lamps for Open Surgery cs.CV | cs.AI | cs.LG | cs.ROPDF

Yuna Kato, Shohei Mori, Hideo Saito, Yoshifumi Takatsume, Hiroki Kajita

TL;DR: 论文提出了一种自动化方法,通过多摄像头无影灯捕捉开放手术视频,自动对齐和选择最少遮挡的视角,生成高质量手术视频。

Details

Motivation: 开放手术视频对于教学和研究非常重要,但传统方法中摄像头常被遮挡,且需要人工调整对齐,效率低下。

Result: 用户研究表明,生成的视频在确认手术区域和观看舒适度方面优于传统方法,视频质量也有所提升。

Insight: 多摄像头结合自动对齐技术可以有效解决手术视频遮挡问题,同时用户偏好分析为未来优化提供了方向。

Abstract: Video recordings of open surgeries are greatly required for education and research purposes. However, capturing unobstructed videos is challenging since surgeons frequently block the camera field of view. To avoid occlusion, the positions and angles of the camera must be frequently adjusted, which is highly labor-intensive. Prior work has addressed this issue by installing multiple cameras on a shadowless lamp and arranging them to fully surround the surgical area. This setup increases the chances of some cameras capturing an unobstructed view. However, manual image alignment is needed in post-processing since camera configurations change every time surgeons move the lamp for optimal lighting. This paper aims to fully automate this alignment task. The proposed method identifies frames in which the lighting system moves, realigns them, and selects the camera with the least occlusion to generate a video that consistently presents the surgical field from a fixed perspective. A user study involving surgeons demonstrated that videos generated by our method were superior to those produced by conventional methods in terms of the ease of confirming the surgical area and the comfort during video viewing. Additionally, our approach showed improvements in video quality over existing techniques. Furthermore, we implemented several synthesis options for the proposed view-synthesis method and conducted a user study to assess surgeons’ preferences for each option.


[55] Decoupling Template Bias in CLIP: Harnessing Empty Prompts for Enhanced Few-Shot Learning cs.CV | cs.AIPDF

Zhenyu Zhang, Guangyao Chen, Yixiong Zou, Zhimeng Huang, Yuhua Li

TL;DR: 本文提出一种方法,通过引入空提示(empty prompts)来减少CLIP模型在少样本学习中由模板样本相似性(TSS)引起的偏差,从而提升分类准确性和鲁棒性。

Details

Motivation: 研究发现CLIP模型因模板样本相似性(TSS)而引入偏差,导致模型依赖模板而非样本与类别的真实对齐,影响了少样本学习的性能。

Result: 实验证明该方法显著减少了由TSS引起的性能波动,提升了分类准确性和鲁棒性。

Insight: 空提示是一种简单但有效的方法,能够量化并纠正CLIP中的模板偏差,为少样本学习提供了一种新的研究方向。

Abstract: The Contrastive Language-Image Pre-Training (CLIP) model excels in few-shot learning by aligning visual and textual representations. Our study shows that template-sample similarity (TSS), defined as the resemblance between a text template and an image sample, introduces bias. This bias leads the model to rely on template proximity rather than true sample-to-category alignment, reducing both accuracy and robustness in classification. We present a framework that uses empty prompts, textual inputs that convey the idea of “emptiness” without category information. These prompts capture unbiased template features and offset TSS bias. The framework employs two stages. During pre-training, empty prompts reveal and reduce template-induced bias within the CLIP encoder. During few-shot fine-tuning, a bias calibration loss enforces correct alignment between images and their categories, ensuring the model focuses on relevant visual cues. Experiments across multiple benchmarks demonstrate that our template correction method significantly reduces performance fluctuations caused by TSS, yielding higher classification accuracy and stronger robustness. The repository of this project is available at https://github.com/zhenyuZ-HUST/Decoupling-Template-Bias-in-CLIP.


[56] Trajectory Densification and Depth from Perspective-based Blur cs.CVPDF

Tianchen Qiu, Qirun Zhang, Jiajian He, Zhengyue Zhuge, Jiahui Xu

TL;DR: 该论文提出了一种通过分析视频流中的模糊模式和密集轨迹来估计深度的方法,结合光学设计算法和视觉语言模型,实现了高精度的深度估计和轨迹重建。

Details

Motivation: 在无机械稳定器的情况下,相机在拍摄过程中会产生不可避免的旋转动态,特别是在长曝光场景下会导致透视模糊。这种模糊与物体的深度位置相关,因此可以用于深度估计。

Result: 方法在多个深度数据集上表现出色,能够在大深度范围内保持高精度,同时在手持拍摄场景中,轨迹重建的准确性优于基准方法。

Insight: 透视模糊是深度相关的,可以通过分析模糊模式来实现深度估计;结合深度学习和传统光学设计能够提升稀疏轨迹的重建精度。

Abstract: In the absence of a mechanical stabilizer, the camera undergoes inevitable rotational dynamics during capturing, which induces perspective-based blur especially under long-exposure scenarios. From an optical standpoint, perspective-based blur is depth-position-dependent: objects residing at distinct spatial locations incur different blur levels even under the same imaging settings. Inspired by this, we propose a novel method that estimate metric depth by examining the blur pattern of a video stream and dense trajectory via joint optical design algorithm. Specifically, we employ off-the-shelf vision encoder and point tracker to extract video information. Then, we estimate depth map via windowed embedding and multi-window aggregation, and densify the sparse trajectory from the optical algorithm using a vision-language model. Evaluations on multiple depth datasets demonstrate that our method attains strong performance over large depth range, while maintaining favorable generalization. Relative to the real trajectory in handheld shooting settings, our optical algorithm achieves superior precision and the dense reconstruction maintains strong accuracy.


[57] Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning cs.CV | cs.AIPDF

Huilin Xu, Zhuoyang Liu, Yixiang Luomei, Feng Xu

TL;DR: 本文提出了一种统一的空间、时间和具体推理框架,用于无人机(UAV)的单目视觉语言导航(VLN),仅依赖单目RGB图像和自然语言指令,通过多任务学习和关键帧选择优化性能。

Details

Motivation: 现有的无人机视觉语言导航方法通常依赖全景图像、深度输入或里程计,增加了系统成本和复杂性。本文旨在设计一种轻量化的解决方案,仅使用单目RGB图像和自然语言指令,以推动实际部署。

Result: 该方法在Aerial VLN基准测试中表现优异,显著优于现有RGB-only基线,并缩小了与RGB-D方法的性能差距。

Insight: 关键帧选择和动作合并机制可以有效提升轻量化导航系统的性能,多任务学习是优化导航任务的有力工具。

Abstract: Aerial Vision-and-Language Navigation (VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and navigate complex urban environments using onboard visual observation. This task holds promise for real-world applications such as low-altitude inspection, search-and-rescue, and autonomous aerial delivery. Existing methods often rely on panoramic images, depth inputs, or odometry to support spatial reasoning and action planning. These requirements increase system cost and integration complexity, thus hindering practical deployment for lightweight UAVs. We present a unified aerial VLN framework that operates solely on egocentric monocular RGB observations and natural language instructions. The model formulates navigation as a next-token prediction problem, jointly optimizing spatial perception, trajectory reasoning, and action prediction through prompt-guided multi-task learning. Moreover, we propose a keyframe selection strategy to reduce visual redundancy by retaining semantically informative frames, along with an action merging and label reweighting mechanism that mitigates long-tailed supervision imbalance and facilitates stable multi-task co-training. Extensive experiments on the Aerial VLN benchmark validate the effectiveness of our method. Under the challenging monocular RGB-only setting, our model achieves strong results across both seen and unseen environments. It significantly outperforms existing RGB-only baselines and narrows the performance gap with state-of-the-art panoramic RGB-D counterparts. Comprehensive ablation studies further demonstrate the contribution of our task design and architectural choices.


[58] Chain-of-Image Generation: Toward Monitorable and Controllable Image Generation cs.CVPDF

Young Kyung Kim, Oded Schlesinger, Yuzhou Zhao, J. Matias Di Martino, Guillermo Sapiro

TL;DR: 该论文提出了Chain-of-Image Generation (CoIG)框架,旨在通过将图像生成过程分解为类似人类创作的语义步骤,提高生成过程的可监控性和可控性。

Details

Motivation: 当前图像生成模型的内部过程是一个“黑盒”,缺乏可观察性和人为干预能力,限制了模型的可靠性、安全性和控制性。

Result: 实验表明,CoIG显著提升了生成过程的可监控性,同时在组合鲁棒性上与基线模型竞争性相当。

Insight: 将图像生成过程类比人类创作语义步骤,不仅提高了透明度和可控性,还能避免任务复杂化导致的实体崩溃问题。

Abstract: While state-of-the-art image generation models achieve remarkable visual quality, their internal generative processes remain a “black box.” This opacity limits human observation and intervention, and poses a barrier to ensuring model reliability, safety, and control. Furthermore, their non-human-like workflows make them difficult for human observers to interpret. To address this, we introduce the Chain-of-Image Generation (CoIG) framework, which reframes image generation as a sequential, semantic process analogous to how humans create art. Similar to the advantages in monitorability and performance that Chain-of-Thought (CoT) brought to large language models (LLMs), CoIG can produce equivalent benefits in text-to-image generation. CoIG utilizes an LLM to decompose a complex prompt into a sequence of simple, step-by-step instructions. The image generation model then executes this plan by progressively generating and editing the image. Each step focuses on a single semantic entity, enabling direct monitoring. We formally assess this property using two novel metrics: CoIG Readability, which evaluates the clarity of each intermediate step via its corresponding output; and Causal Relevance, which quantifies the impact of each procedural step on the final generated image. We further show that our framework mitigates entity collapse by decomposing the complex generation task into simple subproblems, analogous to the procedural reasoning employed by CoT. Our experimental results indicate that CoIG substantially enhances quantitative monitorability while achieving competitive compositional robustness compared to established baseline models. The framework is model-agnostic and can be integrated with any image generation model.


[59] Repulsor: Accelerating Generative Modeling with a Contrastive Memory Bank cs.CVPDF

Shaofeng Zhang, Xuanqi Chen, Ning Liao, Haoxiang Zhao, Xiaoxing Wang

TL;DR: Repulsor提出了一种无需外部编码器的对比记忆库框架,显著加速生成模型的训练并提高生成质量。

Details

Motivation: 去噪生成模型(如扩散模型、流匹配)虽然主导视觉合成,但其训练成本高且表征学习效率低。现有方法依赖外部预训练编码器,带来额外开销和领域偏移问题。

Result: 在ImageNet-256上,Repulsor以40万步达到FID 2.40,显著优于同类方法。

Insight: Repulsor证明通过高效负样本管理和自包含设计,生成模型可以在避免外部依赖的同时提升性能和训练效率。

Abstract: The dominance of denoising generative models (e.g., diffusion, flow-matching) in visual synthesis is tempered by their substantial training costs and inefficiencies in representation learning. While injecting discriminative representations via auxiliary alignment has proven effective, this approach still faces key limitations: the reliance on external, pre-trained encoders introduces overhead and domain shift. A dispersed-based strategy that encourages strong separation among in-batch latent representations alleviates this specific dependency. To assess the effect of the number of negative samples in generative modeling, we propose {\mname}, a plug-and-play training framework that requires no external encoders. Our method integrates a memory bank mechanism that maintains a large, dynamically updated queue of negative samples across training iterations. This decouples the number of negatives from the mini-batch size, providing abundant and high-quality negatives for a contrastive objective without a multiplicative increase in computational cost. A low-dimensional projection head is used to further minimize memory and bandwidth overhead. {\mname} offers three principal advantages: (1) it is self-contained, eliminating dependency on pretrained vision foundation models and their associated forward-pass overhead; (2) it introduces no additional parameters or computational cost during inference; and (3) it enables substantially faster convergence, achieving superior generative quality more efficiently. On ImageNet-256, {\mname} achieves a state-of-the-art FID of \textbf{2.40} within 400k steps, significantly outperforming comparable methods.


[60] What really matters for person re-identification? A Mixture-of-Experts Framework for Semantic Attribute Importance cs.CVPDF

Athena Psalta, Vasileios Tsironis, Konstantinos Karantzalos

TL;DR: 该论文提出了MoSAIC-ReID框架,通过Mixture-of-Experts模型系统地量化了行人属性在ReID任务中的重要性,揭示了关键属性(如衣物颜色)对ReID的贡献,并提供了一个解释性框架。

Details

Motivation: 现有的人重识别(ReID)方法在高精度下仍然缺乏可解释性,作者希望通过量化语义属性在ReID中的作用来填补这一空白。

Result: 在Market-1501和DukeMTMC上取得了竞争性性能,同时揭示了衣物颜色等高频属性对ReID的贡献显著,而低频属性(如配饰)影响较小。

Insight: 语义属性在ReID中的作用可以通过量化方法系统地分析,高频属性(如颜色)对性能贡献更大,这为设计可解释的ReID模型提供了指导。

Abstract: State-of-the-art person re-identification methods achieve impressive accuracy but remain largely opaque, leaving open the question: which high-level semantic attributes do these models actually rely on? We propose MoSAIC-ReID, a Mixture-of-Experts framework that systematically quantifies the importance of pedestrian attributes for re-identification. Our approach uses LoRA-based experts, each linked to a single attribute, and an oracle router that enables controlled attribution analysis. While MoSAIC-ReID achieves competitive performance on Market-1501 and DukeMTMC under the assumption that attribute annotations are available at test time, its primary value lies in providing a large-scale, quantitative study of attribute importance across intrinsic and extrinsic cues. Using generalized linear models, statistical tests, and feature-importance analyses, we reveal which attributes, such as clothing colors and intrinsic characteristics, contribute most strongly, while infrequent cues (e.g. accessories) have limited effect. This work offers a principled framework for interpretable ReID and highlights the requirements for integrating explicit semantic knowledge in practice. Code is available at https://github.com/psaltaath/MoSAIC-ReID


[61] SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images cs.CVPDF

Kaiyu Li, Shengqi Zhang, Yupeng Deng, Zhi Wang, Deyu Meng

TL;DR: 论文探索了无需训练的SAM 3在遥感图像开放词汇语义分割(OVSS)中的应用,提出了一种掩码融合策略和基于存在得分的过滤方法,显著提升了性能。

Details

Motivation: 现有基于CLIP的开放词汇语义分割方法在遥感场景中面临定位不精确或流程复杂的问题,SAM 3的提出为实现统一的识别与分割提供了可能。

Result: 在多个遥感数据集上验证了方法的有效性,展示了SAM 3在遥感OVSS中的潜力。

Insight: SAM 3的统一框架为遥感场景中的复杂分割任务提供了简洁高效的解决方案,无需训练即可实现高性能。

Abstract: Most existing methods for training-free Open-Vocabulary Semantic Segmentation (OVSS) are based on CLIP. While these approaches have made progress, they often face challenges in precise localization or require complex pipelines to combine separate modules, especially in remote sensing scenarios where numerous dense and small targets are present. Recently, Segment Anything Model 3 (SAM 3) was proposed, unifying segmentation and recognition in a promptable framework. In this paper, we present a preliminary exploration of applying SAM 3 to the remote sensing OVSS task without any training. First, we implement a mask fusion strategy that combines the outputs from SAM 3’s semantic segmentation head and the Transformer decoder (instance head). This allows us to leverage the strengths of both heads for better land coverage. Second, we utilize the presence score from the presence head to filter out categories that do not exist in the scene, reducing false positives caused by the vast vocabulary sizes and patch-level processing in geospatial scenes. We evaluate our method on extensive remote sensing datasets. Experiments show that this simple adaptation achieves promising performance, demonstrating the potential of SAM 3 for remote sensing OVSS. Our code is released at https://github.com/earth-insights/SegEarth-OV-3.


[62] A Scalable Pipeline Combining Procedural 3D Graphics and Guided Diffusion for Photorealistic Synthetic Training Data Generation in White Button Mushroom Segmentation cs.CVPDF

Artúr I. Károly, Péter Galambos

TL;DR: 这篇论文提出了一种结合3D渲染和引导扩散模型的新工作流程,用于生成高质量、逼真的蘑菇合成训练数据,无需专业计算机图形学知识,并在零样本设置下实现了最先进的蘑菇分割性能。

Details

Motivation: 工业蘑菇种植依赖计算机视觉进行监测和自动化收获,但获取大规模精确标注的真实数据成本高昂。合成数据提供了可扩展的替代方案,但往往缺乏足够的真实性。本文旨在解决这一问题。

Result: 在合成的训练数据上训练的Mask R-CNN模型在两个真实数据集上的分割性能达到最先进水平(F1分数为0.859)。

Insight: 该方法不仅可以应用于蘑菇,还可以扩展到其他农业领域的物体检测任务,展示了合成数据在计算机视觉任务中的潜力。

Abstract: Industrial mushroom cultivation increasingly relies on computer vision for monitoring and automated harvesting. However, developing accurate detection and segmentation models requires large, precisely annotated datasets that are costly to produce. Synthetic data provides a scalable alternative, yet often lacks sufficient realism to generalize to real-world scenarios. This paper presents a novel workflow that integrates 3D rendering in Blender with a constrained diffusion model to automatically generate high-quality annotated, photorealistic synthetic images of Agaricus Bisporus mushrooms. This approach preserves full control over 3D scene configuration and annotations while achieving photorealism without the need for specialized computer graphics expertise. We release two synthetic datasets (each containing 6,000 images depicting over 250k mushroom instances) and evaluate Mask R-CNN models trained on them in a zero-shot setting. When tested on two independent real-world datasets (including a newly collected benchmark), our method achieves state-of-the-art segmentation performance (F1 = 0.859 on M18K), despite using only synthetic training data. Although the approach is demonstrated on Agaricus Bisporus mushrooms, the proposed pipeline can be readily adapted to other mushroom species or to other agricultural domains, such as fruit and leaf detection.


[63] Skewness-Guided Pruning of Multimodal Swin Transformers for Federated Skin Lesion Classification on Edge Devices cs.CV | cs.DCPDF

Kuniko Paxton, Koorosh Aslansefat, Dhavalkumar Thakker, Yiannis Papadopoulos

TL;DR: 本文提出了一种基于偏度指导的修剪方法,用于在边缘设备上部署联邦学习的多模态Swin Transformer模型,实现了36%的模型压缩且精度无损。

Details

Motivation: 高性能计算机视觉模型在医学影像中表现优异,但计算复杂且体积大,不适合边缘设备部署。同时,隐私限制促使采用联邦学习。

Result: 实验表明,模型大小减少约36%,且分类精度未下降。

Insight: 通过偏度指导的修剪,实现了高效模型压缩和隐私保护的分布式学习,适合边缘设备部署。

Abstract: In recent years, high-performance computer vision models have achieved remarkable success in medical imaging, with some skin lesion classification systems even surpassing dermatology specialists in diagnostic accuracy. However, such models are computationally intensive and large in size, making them unsuitable for deployment on edge devices. In addition, strict privacy constraints hinder centralized data management, motivating the adoption of Federated Learning (FL). To address these challenges, this study proposes a skewness-guided pruning method that selectively prunes the Multi-Head Self-Attention and Multi-Layer Perceptron layers of a multimodal Swin Transformer based on the statistical skewness of their output distributions. The proposed method was validated in a horizontal FL environment and shown to maintain performance while substantially reducing model complexity. Experiments on the compact Swin Transformer demonstrate approximately 36% model size reduction with no loss in accuracy. These findings highlight the feasibility of achieving efficient model compression and privacy-preserving distributed learning for multimodal medical AI on edge devices.


[64] Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance cs.CVPDF

Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu

TL;DR: Wan-Move提出了一种通过潜在轨迹引导实现运动可控的视频生成框架,解决了现有方法控制粒度粗糙和可扩展性不足的问题,支持高质量和精细的运动控制。

Details

Motivation: 现有视频生成模型在运动控制方面存在控制粒度粗糙和可扩展性差的问题,限制了其在实际应用中的效果。Wan-Move旨在通过直接利用运动感知的特征来实现精细和高质的运动控制。

Result: Wan-Move生成了5秒480p的视频,其运动控制效果媲美商业工具Kling 1.5 Pro的Motion Brush。实验表明其在MoveBench和公共数据集上均优于现有方法。

Insight: 直接利用潜在空间的特征传播实现运动控制是一种高效且可扩展的方法,避免了额外运动编码器的需求,同时保持了视频生成的灵活性。

Abstract: We present Wan-Move, a simple and scalable framework that brings motion control to video generative models. Existing motion-controllable methods typically suffer from coarse control granularity and limited scalability, leaving their outputs insufficient for practical use. We narrow this gap by achieving precise and high-quality motion control. Our core idea is to directly make the original condition features motion-aware for guiding video synthesis. To this end, we first represent object motions with dense point trajectories, allowing fine-grained control over the scene. We then project these trajectories into latent space and propagate the first frame’s features along each trajectory, producing an aligned spatiotemporal feature map that tells how each scene element should move. This feature map serves as the updated latent condition, which is naturally integrated into the off-the-shelf image-to-video model, e.g., Wan-I2V-14B, as motion guidance without any architecture change. It removes the need for auxiliary motion encoders and makes fine-tuning base models easily scalable. Through scaled training, Wan-Move generates 5-second, 480p videos whose motion controllability rivals Kling 1.5 Pro’s commercial Motion Brush, as indicated by user studies. To support comprehensive evaluation, we further design MoveBench, a rigorously curated benchmark featuring diverse content categories and hybrid-verified annotations. It is distinguished by larger data volume, longer video durations, and high-quality motion annotations. Extensive experiments on MoveBench and the public dataset consistently show Wan-Move’s superior motion quality. Code, models, and benchmark data are made publicly available.


[65] LoFA: Learning to Predict Personalized Priors for Fast Adaptation of Visual Generative Models cs.CVPDF

Yiming Hao, Mutian Xu, Chongjie Ye, Jie Qin, Shunlin Lu

TL;DR: 论文提出了一种名为LoFA的框架,通过学习预测个性化先验,快速适应视觉生成模型,解决了现有方法如LoRA对于任务特定数据和长时间优化的依赖问题。

Details

Motivation: 当前个性化视觉生成模型的适应方法(如LoRA)需要大量任务特定数据和长时间优化,效率低下。超网络虽尝试直接预测适应权重,但难以将细粒度用户提示映射到复杂的LoRA分布,限制了实用性。

Result: 实验表明,LoFA能在几秒内预测高质量个性化先验,在多任务和用户提示下表现优于需要数小时优化的传统LoRA。

Insight: LoRA参数的相对变化呈现结构化模式,这一发现为快速适应视觉生成模型提供了新的思路。

Abstract: Personalizing visual generative models to meet specific user needs has gained increasing attention, yet current methods like Low-Rank Adaptation (LoRA) remain impractical due to their demand for task-specific data and lengthy optimization. While a few hypernetwork-based approaches attempt to predict adaptation weights directly, they struggle to map fine-grained user prompts to complex LoRA distributions, limiting their practical applicability. To bridge this gap, we propose LoFA, a general framework that efficiently predicts personalized priors for fast model adaptation. We first identify a key property of LoRA: structured distribution patterns emerge in the relative changes between LoRA and base model parameters. Building on this, we design a two-stage hypernetwork: first predicting relative distribution patterns that capture key adaptation regions, then using these to guide final LoRA weight prediction. Extensive experiments demonstrate that our method consistently predicts high-quality personalized priors within seconds, across multiple tasks and user prompts, even outperforming conventional LoRA that requires hours of processing. Project page: https://jaeger416.github.io/lofa/.


[66] MatteViT: High-Frequency-Aware Document Shadow Removal with Shadow Matte Guidance cs.CV | cs.AIPDF

Chaewon Kim, Seoyeon Lee, Jonghyuk Park

TL;DR: MatteViT提出了一种新的文档去阴影框架,结合空间和频域信息,通过高频放大模块和连续亮度阴影遮罩保留高频细节,实现了最先进的性能。

Details

Motivation: 文档阴影会模糊或扭曲高频细节(如文本边缘和线条),影响文档数字化质量。如何有效去除阴影同时保留这些细节是关键挑战。

Result: 在公开数据集(RDD和Kligler)上达到了最先进的性能,且在OCR等下游任务中显著提升了文本识别率。

Insight: 结合频域信息和高精度空间引导能有效提升阴影去除任务的高频细节保留能力,为实际应用提供了实用解决方案。

Abstract: Document shadow removal is essential for enhancing the clarity of digitized documents. Preserving high-frequency details (e.g., text edges and lines) is critical in this process because shadows often obscure or distort fine structures. This paper proposes a matte vision transformer (MatteViT), a novel shadow removal framework that applies spatial and frequency-domain information to eliminate shadows while preserving fine-grained structural details. To effectively retain these details, we employ two preservation strategies. First, our method introduces a lightweight high-frequency amplification module (HFAM) that decomposes and adaptively amplifies high-frequency components. Second, we present a continuous luminance-based shadow matte, generated using a custom-built matte dataset and shadow matte generator, which provides precise spatial guidance from the earliest processing stage. These strategies enable the model to accurately identify fine-grained regions and restore them with high fidelity. Extensive experiments on public benchmarks (RDD and Kligler) demonstrate that MatteViT achieves state-of-the-art performance, providing a robust and practical solution for real-world document shadow removal. Furthermore, the proposed method better preserves text-level details in downstream tasks, such as optical character recognition, improving recognition performance over prior methods.


[67] Training-Free Dual Hyperbolic Adapters for Better Cross-Modal Reasoning cs.CV | cs.AIPDF

Yi Zhang, Chun-Wun Cheng, Junyi He, Ke Yu, Yushun Tang

TL;DR: 论文提出了一种无需训练的跨模态推理方法(T-DHA),通过在双曲空间中建模视觉-语言的层次关系,显著提升了少样本学习和领域泛化任务的性能。

Details

Motivation: 现有视觉-语言模型在跨模态推理中面临领域变化导致的性能下降问题,且需要大量计算资源进行微调。论文旨在解决这些问题。

Result: 在多个数据集上的实验表明,T-DHA在少样本图像识别和领域泛化任务中显著优于现有方法。

Insight: 双曲空间的指数体积增长特性更适合嵌入层次结构数据,为跨模态推理提供了新的建模思路。

Abstract: Recent research in Vision-Language Models (VLMs) has significantly advanced our capabilities in cross-modal reasoning. However, existing methods suffer from performance degradation with domain changes or require substantial computational resources for fine-tuning in new domains. To address this issue, we develop a new adaptation method for large vision-language models, called \textit{Training-free Dual Hyperbolic Adapters} (T-DHA). We characterize the vision-language relationship between semantic concepts, which typically has a hierarchical tree structure, in the hyperbolic space instead of the traditional Euclidean space. Hyperbolic spaces exhibit exponential volume growth with radius, unlike the polynomial growth in Euclidean space. We find that this unique property is particularly effective for embedding hierarchical data structures using the Poincaré ball model, achieving significantly improved representation and discrimination power. Coupled with negative learning, it provides more accurate and robust classifications with fewer feature dimensions. Our extensive experimental results on various datasets demonstrate that the T-DHA method significantly outperforms existing state-of-the-art methods in few-shot image recognition and domain generalization tasks.


[68] InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models cs.CV | cs.AIPDF

Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang

TL;DR: InfiniteVL 是一种线性复杂度的视觉语言模型架构,结合了滑动窗口注意力(SWA)和门控DeltaNet,解决了窗口注意力和线性注意力在长序列任务中的性能问题。通过三阶段训练策略,InfiniteVL 在少数据情况下性能超越同类模型,并保持恒定延迟和内存占用。

Details

Motivation: 窗口注意力在序列长度超过窗口大小时性能下降,而线性注意力在信息密集型任务(如OCR)中表现不佳。为此,作者提出 InfiniteVL 以克服这些限制。

Result: InfiniteVL 在少数据条件下性能优于现有线性复杂度VLM,匹配 Transformer 性能;推理速度提升 3.6 倍,在视频任务中保持 24 FPS 实时性。

Insight: 线性注意力与滑动窗口的协同设计可显著提升效率与性能;三阶段训练策略对少数据场景下的模型优化至关重要。

Abstract: Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.


[69] Generation is Required for Data-Efficient Perception cs.CV | cs.LGPDF

Jack Brady, Bernhard Schölkopf, Thomas Kipf, Simon Buchholz, Wieland Brendel

TL;DR: 论文探讨生成方法对数据高效视觉感知的必要性,通过理论与实验证明生成方法在组合泛化上的优势。

Details

Motivation: 探讨生成方法是否为实现人类水平视觉感知的必要条件,验证其在组合泛化能力上的潜力。

Result: 生成方法在组合泛化上显著优于非生成方法,无需额外数据或监督即可提升性能。

Insight: 生成方法因其内在的解码器结构优势,更适合实现高效的组合泛化,尤其在数据受限场景下。

Abstract: It has been hypothesized that human-level visual perception requires a generative approach in which internal representations result from inverting a decoder. Yet today’s most successful vision models are non-generative, relying on an encoder that maps images to representations without decoder inversion. This raises the question of whether generation is, in fact, necessary for machines to achieve human-level visual perception. To address this, we study whether generative and non-generative methods can achieve compositional generalization, a hallmark of human perception. Under a compositional data generating process, we formalize the inductive biases required to guarantee compositional generalization in decoder-based (generative) and encoder-based (non-generative) methods. We then show theoretically that enforcing these inductive biases on encoders is generally infeasible using regularization or architectural constraints. In contrast, for generative methods, the inductive biases can be enforced straightforwardly, thereby enabling compositional generalization by constraining a decoder and inverting it. We highlight how this inversion can be performed efficiently, either online through gradient-based search or offline through generative replay. We examine the empirical implications of our theory by training a range of generative and non-generative methods on photorealistic image datasets. We find that, without the necessary inductive biases, non-generative methods often fail to generalize compositionally and require large-scale pretraining or added supervision to improve generalization. By comparison, generative methods yield significant improvements in compositional generalization, without requiring additional data, by leveraging suitable inductive biases on a decoder along with search and replay.


[70] Tri-Bench: Stress-Testing VLM Reliability on Spatial Reasoning under Camera Tilt and Object Interference cs.CVPDF

Amit Bendkhale

TL;DR: Tri-Bench是一个紧凑的基准测试,专注于测试VLM在摄像机倾斜和物体干扰下的空间推理可靠性,结果显示VLM在这些场景下的表现不佳,尤其是对少数形状类别的识别。

Details

Motivation: 尽管VLM在多模态任务中表现出色,但在真实场景下的几何推理能力(尤其是涉及摄像机姿态和物体干扰时)仍存在不足,因此需要一种可靠的评估方法来验证其可控性和可信度。

Result: VLM的平均准确率为69%,在2D投影下的表现稍好(72%),但对少数形状类别的识别几乎完全失败(0%)。摄像机倾斜导致准确率下降4.1%,而物体干扰无明显影响。

Insight: VLM未能充分利用提示中的参考框架信息,默认依赖2D图像平面线索,表明其在几何推理中存在局限性,尤其是在复杂场景下。

Abstract: Verifiable geometric reasoning is a critical component for trustworthy and controllable agentic AI. Despite impressive capabilities, Vision-Language Models (VLMs) often fail under realistic scene changes. We present Tri-Bench, a compact benchmark of planar triangle problems that isolates relative geometric reasoning while stressing two deployment-critical factors: camera pose (planar vs. tilted) and scene context via object interference (10 everyday objects). To test verifiability and control, we evaluate four recent VLMs using a single, fixed prompt whose guardrail explicitly describes a surrounding square border, enabling correct answers via homography. We evaluate six simple tasks over binary and continuous targets, and observe that the overall accuracy with respect to 3D ground truth is modest, ~69% on average (best ~75%, worst ~64%). The same responses align even more closely with 2D projections in the image plane, where mean accuracy is ~72%. All four VLMs consistently fail, with accuracy falling to ~0%, on recognizing minority shape classes (equilateral, isosceles, right-angled triangles). Additionally, overall VLM accuracy degrades by ~4.1% under camera tilt. This demonstrates that models fail to correctly utilize the explicit frame-of-reference hint provided in the prompt and default to 2D image plane cues. Finally, we find that object interference has no significant effect on VLM accuracy.


[71] SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing cs.CVPDF

Aysim Toker, Andreea-Maria Oncescu, Roy Miles, Ismail Elezi, Jiankang Deng

TL;DR: 该论文提出了一种名为SATGround的新方法,通过在预训练的视觉-语言模型(VLM)中引入结构化定位机制,显著提升了卫星图像中的视觉定位能力,并在多个遥感基准测试中取得了优于现有方法的结果。

Details

Motivation: 在遥感领域,视觉-语言模型(VLMs)虽能整合多任务信息,但在复杂卫星场景中的视觉定位能力仍有不足。作者旨在通过结构化空间推理增强VLM的定位精度。

Result: 在多个遥感基准测试中,SATGround显著优于现有方法,视觉定位任务相对性能提升24.8%。

Insight: 研究表明,结构化空间推理能有效增强视觉-语言模型在复杂场景中的定位能力,为卫星数据分析提供了更可靠的解决方案。

Abstract: Vision-language models (VLMs) are emerging as powerful generalist tools for remote sensing, capable of integrating information across diverse tasks and enabling flexible, instruction-based interactions via a chat interface. In this work, we enhance VLM-based visual grounding in satellite imagery by proposing a novel structured localization mechanism. Our approach involves finetuning a pretrained VLM on a diverse set of instruction-following tasks, while interfacing a dedicated grounding module through specialized control tokens for localization. This method facilitates joint reasoning over both language and spatial information, significantly enhancing the model’s ability to precisely localize objects in complex satellite scenes. We evaluate our framework on several remote sensing benchmarks, consistently improving the state-of-the-art, including a 24.8% relative improvement over previous methods on visual grounding. Our results highlight the benefits of integrating structured spatial reasoning into VLMs, paving the way for more reliable real-world satellite data analysis.


[72] No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers cs.CV | cs.AIPDF

Damiano Marsili, Georgia Gkioxari

TL;DR: 论文提出了一种无需标注的训练框架,通过多模态验证器提升视觉推理和物体定位能力,结合LLM和VLM的优势,在多种空间推理任务中表现优异。

Details

Motivation: 当前视觉推理方法通常依赖大规模标注数据或预训练模型,但存在逻辑错误或定位不准的问题。本文希望通过AI验证器(LLM和VLM)实现无标注训练,提升推理和定位效果。

Result: 方法在多种空间推理任务中表现优异,超越开源和专有模型,视觉定位模型还优于近期仅文本的视觉推理方法。

Insight: 结合语言和视觉模型的优势,通过验证器实现无监督训练,为视觉推理任务提供了高效且可扩展的解决方案。

Abstract: Visual reasoning is challenging, requiring both precise object grounding and understanding complex spatial relationships. Existing methods fall into two camps: language-only chain-of-thought approaches, which demand large-scale (image, query, answer) supervision, and program-synthesis approaches which use pre-trained models and avoid training, but suffer from flawed logic and erroneous grounding. We propose an annotation-free training framework that improves both reasoning and grounding. Our framework uses AI-powered verifiers: an LLM verifier refines LLM reasoning via reinforcement learning, while a VLM verifier strengthens visual grounding through automated hard-negative mining, eliminating the need for ground truth labels. This design combines the strengths of modern AI systems: advanced language-only reasoning models for decomposing spatial queries into simpler subtasks, and strong vision specialist models improved via performant VLM critics. We evaluate our approach across diverse spatial reasoning tasks, and show that our method improves visual reasoning and surpasses open-source and proprietary models, while with our improved visual grounding model we further outperform recent text-only visual reasoning methods. Project webpage: https://glab-caltech.github.io/valor/


[73] UniLayDiff: A Unified Diffusion Transformer for Content-Aware Layout Generation cs.CVPDF

Zeyang Liu, Le Wang, Sanping Zhou, Yuxuan Wu, Xiaolong Sun

TL;DR: UniLayDiff是一种统一的扩散Transformer模型,首次通过单一可训练的端到端模型解决了多种内容感知布局生成任务,实现了从无条件到各种条件生成任务的最优性能。

Details

Motivation: 现有方法通常只能处理部分内容感知布局生成任务,或需要为不同条件分别训练模型,缺乏统一的解决方案。这激发了作者设计一个能够统一处理多种任务的模型。

Result: 实验表明,UniLayDit在无条件及多种条件生成任务中均达到了最先进的性能,且是首个统一完整内容感知布局生成任务的模型。

Insight: 将约束作为一种模态处理并采用多模态框架,可以有效提升模型对不同任务的适应性和生成质量,为未来的多任务统一生成模型提供了设计思路。

Abstract: Content-aware layout generation is a critical task in graphic design automation, focused on creating visually appealing arrangements of elements that seamlessly blend with a given background image. The variety of real-world applications makes it highly challenging to develop a single model capable of unifying the diverse range of input-constrained generation sub-tasks, such as those conditioned by element types, sizes, or their relationships. Current methods either address only a subset of these tasks or necessitate separate model parameters for different conditions, failing to offer a truly unified solution. In this paper, we propose UniLayDiff: a Unified Diffusion Transformer, that for the first time, addresses various content-aware layout generation tasks with a single, end-to-end trainable model. Specifically, we treat layout constraints as a distinct modality and employ Multi-Modal Diffusion Transformer framework to capture the complex interplay between the background image, layout elements, and diverse constraints. Moreover, we integrate relation constraints through fine-tuning the model with LoRA after pretraining the model on other tasks. Such a schema not only achieves unified conditional generation but also enhances overall layout quality. Extensive experiments demonstrate that UniLayDiff achieves state-of-the-art performance across from unconditional to various conditional generation tasks and, to the best of our knowledge, is the first model to unify the full range of content-aware layout generation tasks.


[74] Self-Evolving 3D Scene Generation from a Single Image cs.CVPDF

Kaizhi Zheng, Yue Fan, Jing Gu, Zishuo Xu, Xuehai He

TL;DR: EvoScene是一个无需训练的自进化框架,通过结合3D生成模型的几何推理和视频生成模型的视觉知识,从单一图像逐步重建完整的3D场景。

Details

Motivation: 从单一图像生成高质量、纹理丰富的3D场景是计算机视觉和图形学中的核心挑战,现有方法在复杂场景中的泛化能力有限。

Result: 在多样场景中展示了优越的几何稳定性、视角一致的纹理和未见过区域的补全效果。

Insight: 结合2D和3D模型的互补优势,通过迭代优化实现场景重建的自进化能力。

Abstract: Generating high-quality, textured 3D scenes from a single image remains a fundamental challenge in vision and graphics. Recent image-to-3D generators recover reasonable geometry from single views, but their object-centric training limits generalization to complex, large-scale scenes with faithful structure and texture. We present EvoScene, a self-evolving, training-free framework that progressively reconstructs complete 3D scenes from single images. The key idea is combining the complementary strengths of existing models: geometric reasoning from 3D generation models and visual knowledge from video generation models. Through three iterative stages–Spatial Prior Initialization, Visual-guided 3D Scene Mesh Generation, and Spatial-guided Novel View Generation–EvoScene alternates between 2D and 3D domains, gradually improving both structure and appearance. Experiments on diverse scenes demonstrate that EvoScene achieves superior geometric stability, view-consistent textures, and unseen-region completion compared to strong baselines, producing ready-to-use 3D meshes for practical applications.


[75] LiDAS: Lighting-driven Dynamic Active Sensing for Nighttime Perception cs.CV | cs.ROPDF

Simon de Moreau, Andrei Bursuc, Hafid El-Idrissi, Fabien Moutarde

TL;DR: LiDAS 是一种动态主动照明系统,通过优化光照分配提升夜间感知性能,无需重新训练即可实现零样本夜间泛化,同时降低能耗。

Details

Motivation: 夜间环境下相机感知性能受限,传统方法依赖被动场景光照。LiDAS 旨在通过动态控制照明,提升夜间感知的鲁棒性。

Result: 在真实闭环驾驶场景中,LiDAS 比标准低光束照明性能提升(mAP50 +18.7%,mIoU +5.0%),同时节能40%。

Insight: LiDAS 表明动态照明控制可显著提升感知性能,为解决夜间感知问题提供了高效低成本方案。

Abstract: Nighttime environments pose significant challenges for camera-based perception, as existing methods passively rely on the scene lighting. We introduce Lighting-driven Dynamic Active Sensing (LiDAS), a closed-loop active illumination system that combines off-the-shelf visual perception models with high-definition headlights. Rather than uniformly brightening the scene, LiDAS dynamically predicts an optimal illumination field that maximizes downstream perception performance, i.e., decreasing light on empty areas to reallocate it on object regions. LiDAS enables zero-shot nighttime generalization of daytime-trained models through adaptive illumination control. Trained on synthetic data and deployed zero-shot in real-world closed-loop driving scenarios, LiDAS enables +18.7% mAP50 and +5.0% mIoU over standard low-beam at equal power. It maintains performances while reducing energy use by 40%. LiDAS complements domain-generalization methods, further strengthening robustness without retraining. By turning readily available headlights into active vision actuators, LiDAS offers a cost-effective solution to robust nighttime perception.


[76] Unified Diffusion Transformer for High-fidelity Text-Aware Image Restoration cs.CVPDF

Jin Hyeon Kim, Paul Hyunbin Cho, Claire Kim, Jaewon Min, Jaeeun Lee

TL;DR: 论文提出了一种名为UniT的统一文本感知图像修复框架,通过结合扩散变换器、视觉语言模型和文本定位模块,解决了文本幻觉问题,并在实验中展现了优异的性能。

Details

Motivation: 现有扩散模型在文本感知图像修复任务中容易产生文本幻觉,缺乏显式语言知识的引导。

Result: 在SA-Text和Real-Text基准测试中,UniT显著减少了文本幻觉,并在端到端F1分数上取得了最优性能。

Insight: 显式引入文本知识并通过迭代优化方式结合视觉和语言信息,对提升文本感知图像修复任务的性能至关重要。

Abstract: Text-Aware Image Restoration (TAIR) aims to recover high-quality images from low-quality inputs containing degraded textual content. While diffusion models provide strong generative priors for general image restoration, they often produce text hallucinations in text-centric tasks due to the absence of explicit linguistic knowledge. To address this, we propose UniT, a unified text restoration framework that integrates a Diffusion Transformer (DiT), a Vision-Language Model (VLM), and a Text Spotting Module (TSM) in an iterative fashion for high-fidelity text restoration. In UniT, the VLM extracts textual content from degraded images to provide explicit textual guidance. Simultaneously, the TSM, trained on diffusion features, generates intermediate OCR predictions at each denoising step, enabling the VLM to iteratively refine its guidance during the denoising process. Finally, the DiT backbone, leveraging its strong representational power, exploit these cues to recover fine-grained textual content while effectively suppressing text hallucinations. Experiments on the SA-Text and Real-Text benchmarks demonstrate that UniT faithfully reconstructs degraded text, substantially reduces hallucinations, and achieves state-of-the-art end-to-end F1-score performance in TAIR task.


[77] Efficiently Reconstructing Dynamic Scenes One D4RT at a Time cs.CVPDF

Chuhan Zhang, Guillaume Le Moing, Skanda Koppula, Ignacio Rocco, Liliane Momeni

TL;DR: D4RT是一种高效的动态场景重建方法,通过统一的Transformer架构联合推断深度、时空对应和相机参数,避免了传统方法的密集计算和多任务解码复杂性。

Details

Motivation: 动态场景的几何和运动重建是计算机视觉中的难题,传统方法计算量大且复杂。

Result: D4RT在4D重建任务上表现优异,超越现有方法。

Insight: 统一的Transformer设计和灵活的查询机制显著提升了动态场景重建的效率和效果。

Abstract: Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks. We refer to the project webpage for animated results: https://d4rt-paper.github.io/.


[78] Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment cs.CV | cs.GRPDF

Youming Deng, Songyou Peng, Junyi Zhang, Kathryn Heal, Tiancheng Sun

TL;DR: Selfi is a self-improving 3D reconstruction pipeline that enhances the geometric consistency of VGGT features through feature alignment, achieving state-of-the-art performance in Novel View Synthesis (NVS) and camera pose estimation.

Details

Motivation: Traditional NVS models rely on explicit 3D inductive biases and known camera parameters, while recent foundation models like VGGT learn 3D knowledge implicitly but lack geometric consistency. Selfi addresses this gap by improving 3D feature consistency.

Result: Selfi achieves state-of-the-art performance in NVS and camera pose estimation, demonstrating the benefits of feature alignment for downstream 3D tasks.

Insight: Feature alignment is a crucial step for enhancing the geometric consistency of learned 3D representations, enabling better performance in 3D reasoning tasks.

Abstract: Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach – 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapter using a reprojection-based consistency loss, which distills VGGT outputs into a new geometrically-aligned feature space that captures spatial proximity in 3D. This enables state-of-the-art performance in both NVS and camera pose estimation, demonstrating that feature alignment is a highly beneficial step for downstream 3D reasoning.


[79] Astra: General Interactive World Model with Autoregressive Denoising cs.CV | cs.AI | cs.LGPDF

Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao

TL;DR: 论文《Astra: General Interactive World Model with Autoregressive Denoising》提出了一种通用的交互式世界模型Astra,通过自回归去噪架构和精准的动作控制,实现多样场景下长时程的视频预测与交互。

Details

Motivation: 现有的视频生成模型虽然在文本或图像生成高质量视频片段上取得了进展,但在通用场景下从历史观测和动作预测长时程未来的世界模型仍未被充分探索。本文旨在填补这一空白。

Result: 实验表明,Astra在多个数据集上的保真度、长时程预测和动作对齐方面优于现有世界模型。

Insight: 通过结合自回归去噪和精准动作控制,Astra在通用长时程视频预测中展示了强大的潜力,特别是在多模态交互场景中表现突出。

Abstract: Recent advances in diffusion transformers have empowered video generation models to generate high-quality video clips from texts or images. However, world models with the ability to predict long-horizon futures from past observations and actions remain underexplored, especially for general-purpose scenarios and various forms of actions. To bridge this gap, we introduce Astra, an interactive general world model that generates real-world futures for diverse scenarios (e.g., autonomous driving, robot grasping) with precise action interactions (e.g., camera motion, robot action). We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations and support streaming outputs. We use a noise-augmented history memory to avoid over-reliance on past frames to balance responsiveness with temporal coherence. For precise action control, we introduce an action-aware adapter that directly injects action signals into the denoising process. We further develop a mixture of action experts that dynamically route heterogeneous action modalities, enhancing versatility across diverse real-world tasks such as exploration, manipulation, and camera control. Astra achieves interactive, consistent, and general long-term video prediction and supports various forms of interactions. Experiments across multiple datasets demonstrate the improvements of Astra in fidelity, long-range prediction, and action alignment over existing state-of-the-art world models.


cs.CL [Back]

[80] Adaptation of Embedding Models to Financial Filings via LLM Distillation cs.CLPDF

Eliot Brenner, Dominic Seyler, Manjunath Hegde, Andrei Simion, Koustuv Dasgupta

TL;DR: 该论文提出了一种通过LLM蒸馏技术,将通用嵌入模型适应到金融领域的专用检索任务中,显著提升了检索性能。

Details

Motivation: 尽管大型语言模型(LLMs)在生成任务上取得了进展,但在特定领域(如金融)的信息检索中,通用嵌入模型的表现仍不理想,且计算成本和延迟要求限制了专用对话AI代理的实用化。

Result: 在14种金融文件类型上的21,800个查询-文档对中,平均MRR@5提升了27.7%,DCG@5提升了44.6%,并在FinanceBench的3/4文档类别上NDCG有所改善。

Insight: 通过迭代挖掘困难样本来训练模型,能够以低成本弥合通用模型与特定领域之间的差距,无需人工标注。

Abstract: Despite advances in generative large language models (LLMs), practical application of specialized conversational AI agents remains constrained by computation costs, latency requirements, and the need for precise domain-specific relevance measures. While existing embedding models address the first two constraints, they underperform on information retrieval in specialized domains like finance. This paper introduces a scalable pipeline that trains specialized models from an unlabeled corpus using a general purpose retrieval embedding model as foundation. Our method yields an average of 27.7% improvement in MRR$\texttt{@}$5, 44.6% improvement in mean DCG$\texttt{@}$5 across 14 financial filing types measured over 21,800 query-document pairs, and improved NDCG on 3 of 4 document classes in FinanceBench. We adapt retrieval embeddings (bi-encoder) for RAG, not LLM generators, using LLM-judged relevance to distill domain knowledge into a compact retriever. There are prior works which pair synthetically generated queries with real passages to directly fine-tune the retrieval model. Our pipeline differs from these by introducing interaction between student and teacher models that interleaves retrieval-based mining of hard positive/negative examples from the unlabeled corpus with iterative retraining of the student model’s weights using these examples. Each retrieval iteration uses the refined student model to mine the corpus for progressively harder training examples for the subsequent training iteration. The methodology provides a cost-effective solution to bridging the gap between general-purpose models and specialized domains without requiring labor-intensive human annotation.


[81] Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing cs.CLPDF

Zifan Jiang, Youngjoon Jang, Liliane Momeni, Gül Varol, Sarah Ebling

TL;DR: SEA提出了一种通用的方法,用于将字幕与连续手语视频对齐,避免了以往依赖于特定语言或数据集的训练方式的局限性。

Details

Motivation: 目前的方法通常依赖于特定语言或数据集的端到端训练,限制了其通用性。SEA旨在提供一个跨语言和跨领域的统一框架。

Result: 在四个手语数据集上的实验表明,SEA实现了最先进的对齐性能,为手语处理提供了高质量的并行数据。

Insight: SEA的灵活性和高效性表明,预训练模型和轻量级对齐方法的结合在跨语言任务中具有潜力。

Abstract: The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA’s code and models are openly available.


[82] Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation cs.CLPDF

Sampriti Soor, Suklav Ghosh, Arijit Sur

TL;DR: 该论文研究了通用对抗后缀,即短令牌序列(4-10个令牌),可以附加到任何输入中以广泛降低任务和模型的准确性。通过使用Gumbel-Softmax松弛训练可微分的“软”形式后缀,并在推理时离散化,该方法实现了跨任务和模型的有效攻击。

Details

Motivation: 语言模型在零样本或少样本分类中容易受到对抗性提示的影响,而现有的对抗性触发器通常是任务或模型特定的,缺乏通用性和可比性。因此,研究一种通用的对抗后缀具有重要意义。

Result: 在情感分析、自然语言推理、释义检测、常识问答和物理推理等任务中,该方法显著降低了Qwen2-1.5B、Phi-1.5和TinyLlama-1.1B等模型的准确性和校准置信度。

Insight: 通用对抗后缀的研究揭示了语言模型的脆弱性,同时也为对抗攻击的可迁移性和通用性提供了新的研究方向。

Abstract: Language models (LMs) are often used as zero-shot or few-shot classifiers by scoring label words, but they remain fragile to adversarial prompts. Prior work typically optimizes task- or model-specific triggers, making results difficult to compare and limiting transferability. We study universal adversarial suffixes: short token sequences (4-10 tokens) that, when appended to any input, broadly reduce accuracy across tasks and models. Our approach learns the suffix in a differentiable “soft” form using Gumbel-Softmax relaxation and then discretizes it for inference. Training maximizes calibrated cross-entropy on the label region while masking gold tokens to prevent trivial leakage, with entropy regularization to avoid collapse. A single suffix trained on one model transfers effectively to others, consistently lowering both accuracy and calibrated confidence. Experiments on sentiment analysis, natural language inference, paraphrase detection, commonsense QA, and physical reasoning with Qwen2-1.5B, Phi-1.5, and TinyLlama-1.1B demonstrate consistent attack effectiveness and transfer across tasks and model families.


[83] Universal Adversarial Suffixes for Language Models Using Reinforcement Learning with Calibrated Reward cs.CLPDF

Sampriti Soor, Suklav Ghosh, Arijit Sur

TL;DR: 本文提出了一种基于强化学习的通用对抗后缀生成方法,通过校准奖励训练对抗后缀,使其在多种任务和语言模型上更具通用性和迁移性。

Details

Motivation: 语言模型容易受到短对抗后缀的影响,但目前的方法(如梯度搜索或基于规则的方法)往往脆弱且局限于单一任务或模型。本文旨在解决这一问题。

Result: 在五项NLP基准任务和三种语言模型上测试,结果显示RL训练的后缀能更有效地降低模型准确率,且在不同任务和模型间迁移能力更强。

Insight: 校准奖励的设计是关键,它不仅提升了对抗后缀的效果,还增强了通用性,为对抗攻击的研究提供了新思路。

Abstract: Language models are vulnerable to short adversarial suffixes that can reliably alter predictions. Previous works usually find such suffixes with gradient search or rule-based methods, but these are brittle and often tied to a single task or model. In this paper, a reinforcement learning framework is used where the suffix is treated as a policy and trained with Proximal Policy Optimization against a frozen model as a reward oracle. Rewards are shaped using calibrated cross-entropy, removing label bias and aggregating across surface forms to improve transferability. The proposed method is evaluated on five diverse NLP benchmark datasets, covering sentiment, natural language inference, paraphrase, and commonsense reasoning, using three distinct language models: Qwen2-1.5B Instruct, TinyLlama-1.1B Chat, and Phi-1.5. Results show that RL-trained suffixes consistently degrade accuracy and transfer more effectively across tasks and models than previous adversarial triggers of similar genres.


[84] ClinicalTrialsHub: Bridging Registries and Literature for Comprehensive Clinical Trial Access cs.CL | cs.AI | cs.HC | cs.IRPDF

Jiwoo Park, Ruoqi Liu, Avani Jagdale, Andrew Srisuwananukorn, Jing Zhao

TL;DR: ClinicalTrialsHub是一个整合ClinicalTrials.gov和PubMed数据的交互式平台,利用大型语言模型提升临床试验数据的可访问性,并通过用户研究和自动评估验证其有效性。

Details

Motivation: 现有的临床试验数据分散在多个平台(如ClinicalTrials.gov和PubMed),缺乏统一的结构化访问方式,限制了患者、医生和研究者的使用效率。

Result: 通过用户研究和自动评估验证了平台的实用性,证明其能高效提取信息和回答问题。

Insight: 大型语言模型在整合多源医学数据方面具有潜力,可为循证医学提供支持。

Abstract: We present ClinicalTrialsHub, an interactive search-focused platform that consolidates all data from ClinicalTrials.gov and augments it by automatically extracting and structuring trial-relevant information from PubMed research articles. Our system effectively increases access to structured clinical trial data by 83.8% compared to relying on ClinicalTrials.gov alone, with potential to make access easier for patients, clinicians, researchers, and policymakers, advancing evidence-based medicine. ClinicalTrialsHub uses large language models such as GPT-5.1 and Gemini-3-Pro to enhance accessibility. The platform automatically parses full-text research articles to extract structured trial information, translates user queries into structured database searches, and provides an attributed question-answering system that generates evidence-grounded answers linked to specific source sentences. We demonstrate its utility through a user study involving clinicians, clinical researchers, and PhD students of pharmaceutical sciences and nursing, and a systematic automatic evaluation of its information extraction and question answering capabilities.


[85] Soft Inductive Bias Approach via Explicit Reasoning Perspectives in Inappropriate Utterance Detection Using Large Language Models cs.CLPDF

Ju-Young Kim, Ji-Hong Park, Se-Yeon Lee, Sujin Park, Gun-Woo Kim

TL;DR: 论文提出了一种软归纳偏置方法,通过明确界定推理视角,优化韩语大语言模型在不当言论检测中的性能,提高了准确性。

Details

Motivation: 在线匿名环境中不当言论频发,亟需技术手段检测并构建更安全的交流环境。现有研究在韩语大语言模型和应用链式推理方面仍有不足。

Result: Kanana-1.5模型平均准确率达到87.0046,比标准监督学习提升约3.89%。

Insight: 该方法通过约束推理视角,避免了大语言模型的简单知识模仿,实现了更精确和一致的判断,适用于不当言论检测。

Abstract: Recent incidents in certain online games and communities, where anonymity is guaranteed, show that unchecked inappropriate remarks frequently escalate into verbal abuse and even criminal behavior, raising significant social concerns. Consequently, there is a growing need for research on techniques that can detect inappropriate utterances within conversational texts to help build a safer communication environment. Although large-scale language models trained on Korean corpora and chain-of-thought reasoning have recently gained attention, research applying these approaches to inappropriate utterance detection remains limited. In this study, we propose a soft inductive bias approach that explicitly defines reasoning perspectives to guide the inference process, thereby promoting rational decision-making and preventing errors that may arise during reasoning. We fine-tune a Korean large language model using the proposed method and conduct both quantitative performance comparisons and qualitative evaluations across different training strategies. Experimental results show that the Kanana-1.5 model achieves an average accuracy of 87.0046, improving by approximately 3.89 percent over standard supervised learning. These findings indicate that the proposed method goes beyond simple knowledge imitation by large language models and enables more precise and consistent judgments through constrained reasoning perspectives, demonstrating its effectiveness for inappropriate utterance detection.


[86] Curriculum Guided Massive Multi Agent System Solving For Robust Long Horizon Tasks cs.CL | cs.AI | cs.CV | cs.MAPDF

Indrajit Kar, Kalathur Chenchu Kishore Kumar

TL;DR: 论文提出了一种分层多智能体架构,通过空间课程学习和Thompson Sampling课程管理器,解决了长时程复杂任务的分解和计算成本问题。

Details

Motivation: 当前大型语言模型和多智能体系统在处理长时程任务时存在计算成本高和分解能力不足的挑战。

Result: 在空间化汉诺塔任务上验证,结果显示系统稳定性提升,Oracle使用减少,分布式智能体协作效果增强。

Insight: 通过分层设计和课程学习,可以有效分解和解决长时程任务,同时降低计算成本。

Abstract: Large Language Models and multi-agent systems have shown promise in decomposing complex tasks, yet they struggle with long-horizon reasoning tasks and escalating computation cost. This work introduces a hierarchical multi-agent architecture that distributes reasoning across a 64*64 grid of lightweight agents, supported by a selective oracle. A spatial curriculum progressively expands the operational region of the grid, ensuring that agents master easier central tasks before tackling harder peripheral ones. To improve reliability, the system integrates Negative Log-Likelihood as a measure of confidence, allowing the curriculum to prioritize regions where agents are both accurate and well calibrated. A Thompson Sampling curriculum manager adaptively chooses training zones based on competence and NLL-driven reward signals. We evaluate the approach on a spatially grounded Tower of Hanoi benchmark, which mirrors the long-horizon structure of many robotic manipulation and planning tasks. Results demonstrate improved stability, reduced oracle usage, and stronger long-range reasoning from distributed agent cooperation.


[87] Ask, Answer, and Detect: Role-Playing LLMs for Personality Detection with Question-Conditioned Mixture-of-Experts cs.CLPDF

Yifan Lyu, Liang Zhang

TL;DR: ROME通过结合心理学知识和LLMs的角色扮演能力,将自由文本转换为心理学问卷的回答,提供中间监督信号和多任务学习框架,显著提升了人格检测的准确性。

Details

Motivation: 现有的人格检测方法受限于监督信号不足和文本与心理构念之间语义映射的不明确性,ROME旨在通过心理学知识增强模型性能。

Result: 在两个真实数据集上,ROME的性能超过现有基线方法(如在Kaggle数据集上提升15.41%)。

Insight: 1. 心理学知识与LLMs结合能有效解决监督信号不足的问题;2. 角色扮演生成的问卷回答可增强模型的解释性。

Abstract: Understanding human personality is crucial for web applications such as personalized recommendation and mental health assessment. Existing studies on personality detection predominantly adopt a “posts -> user vector -> labels” modeling paradigm, which encodes social media posts into user representations for predicting personality labels (e.g., MBTI labels). While recent advances in large language models (LLMs) have improved text encoding capacities, these approaches remain constrained by limited supervision signals due to label scarcity, and under-specified semantic mappings between user language and abstract psychological constructs. We address these challenges by proposing ROME, a novel framework that explicitly injects psychological knowledge into personality detection. Inspired by standardized self-assessment tests, ROME leverages LLMs’ role-play capability to simulate user responses to validated psychometric questionnaires. These generated question-level answers transform free-form user posts into interpretable, questionnaire-grounded evidence linking linguistic cues to personality labels, thereby providing rich intermediate supervision to mitigate label scarcity while offering a semantic reasoning chain that guides and simplifies the text-to-personality mapping learning. A question-conditioned Mixture-of-Experts module then jointly routes over post and question representations, learning to answer questionnaire items under explicit supervision. The predicted answers are summarized into an interpretable answer vector and fused with the user representation for final prediction within a multi-task learning framework, where question answering serves as a powerful auxiliary task for personality detection. Extensive experiments on two real-world datasets demonstrate that ROME consistently outperforms state-of-the-art baselines, achieving improvements (15.41% on Kaggle dataset).


[88] Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis cs.CL | cs.AI | cs.LGPDF

Ferdinand Kapl, Emmanouil Angelis, Tobias Höppe, Kaitlin Maile, Johannes von Oswald

TL;DR: 本文探讨了通过逐步增加Transformer模型深度(如MIDAS方法)如何更有效地利用深度,克服传统模型中的”深度诅咒”现象。作者通过深度分析揭示了逐步增长的结构如何改变残差流结构并形成可置换计算块。

Details

Motivation: 传统预训练的Transformer模型中,后半部分的层对最终输出的贡献较小(即”深度诅咒”),而逐步增加深度的方法不仅能降低训练成本,还能提升推理性能。本研究旨在从机理上理解这种增益的来源。

Result: 结果表明,逐步增长的模型能更有效地利用深度,形成独特的计算电路,并在下游推理任务中表现更优。

Insight: 逐步增加的深度不仅改变了模型结构,还可能促进了更高效的计算块组合,从而克服了传统模型中的深度利用不足问题。

Abstract: Gradually growing the depth of Transformers during training can not only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS (Saunshi et al., 2024). Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half - also known as the Curse of Depth (Sun et al., 2025, Csordás et al., 2025). Using depth-wise analyses, we demonstrate that growth via gradual middle stacking yields more effective utilization of model depth, alters the residual stream structure, and facilitates the formation of permutable computational blocks. In addition, we propose a lightweight modification of MIDAS that yields further improvements in downstream reasoning benchmarks. Overall, this work highlights how the gradual growth of model depth can lead to the formation of distinct computational circuits and overcome the limited depth utilization seen in standard non-grown models.


[89] Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders cs.CL | cs.AIPDF

Guangzhi Xiong, Zhenghao He, Bohan Liu, Sanchit Sinha, Aidong Zhang

TL;DR: 该论文提出了一种基于稀疏自动编码器(SAEs)的轻量级幻觉检测方法RAGLens,通过分离LLM的激活特征,准确识别RAG生成的幻觉内容,并提供可解释的决策依据。

Details

Motivation: 检索增强生成(RAG)虽然在提升大语言模型的事实性方面效果显著,但其生成的输出仍可能偏离检索证据,导致忠实性失败(hallucination)。现有检测方法依赖大规模标注数据或高成本的外部LLM查询,亟需一种高效且准确的解决方案。

Result: RAGLens在幻觉检测任务中表现优于现有方法,同时提供可解释的决策依据,支持对不忠实输出的后处理缓解。

Insight: 揭示了LLM中幻觉相关信号的分布规律,为理解RAG的忠实性失败提供了新视角。

Abstract: Retrieval-Augmented Generation (RAG) improves the factuality of large language models (LLMs) by grounding outputs in retrieved evidence, but faithfulness failures, where generations contradict or extend beyond the provided sources, remain a critical challenge. Existing hallucination detection methods for RAG often rely either on large-scale detector training, which requires substantial annotated data, or on querying external LLM judges, which leads to high inference costs. Although some approaches attempt to leverage internal representations of LLMs for hallucination detection, their accuracy remains limited. Motivated by recent advances in mechanistic interpretability, we employ sparse autoencoders (SAEs) to disentangle internal activations, successfully identifying features that are specifically triggered during RAG hallucinations. Building on a systematic pipeline of information-based feature selection and additive feature modeling, we introduce RAGLens, a lightweight hallucination detector that accurately flags unfaithful RAG outputs using LLM internal representations. RAGLens not only achieves superior detection performance compared to existing methods, but also provides interpretable rationales for its decisions, enabling effective post-hoc mitigation of unfaithful RAG. Finally, we justify our design choices and reveal new insights into the distribution of hallucination-related signals within LLMs. The code is available at https://github.com/Teddy-XiongGZ/RAGLens.


cs.AI [Back]

[90] Reasoning Models Ace the CFA Exams cs.AI | cs.CL | q-fin.GNPDF

Jaisal Patel, Yunzhe Chen, Kaiwen He, Keyi Wang, David Li

TL;DR: 现代推理模型在CFA考试中表现出色,Gemini 3.0 Pro在Level I取得97.6%的高分,其他模型如GPT-5和Gemini 2.5 Pro在不同级别考试中也有卓越表现。

Details

Motivation: 先前研究显示LLMs在CFA考试中表现不佳,而最新推理模型在学术和专业考试中表现优异,因此探讨这些模型在CFA考试中的潜力。

Result: Gemini 3.0 Pro在Level I得分97.6%,GPT-5在Level II为94.3%,Gemini 2.5 Pro在Level III的多选题中得分86.4%,Gemini 3.0 Pro在构造题中得分92.0%。

Insight: 推理模型在复杂专业考试中表现出色,显示了其在标准化测试和实际应用中的潜力。

Abstract: Previous research has reported that large language models (LLMs) demonstrate poor performance on the Chartered Financial Analyst (CFA) exams. However, recent reasoning models have achieved strong results on graduate-level academic and professional examinations across various disciplines. In this paper, we evaluate state-of-the-art reasoning models on a set of mock CFA exams consisting of 980 questions across three Level I exams, two Level II exams, and three Level III exams. Using the same pass/fail criteria from prior studies, we find that most models clear all three levels. The models that pass, ordered by overall performance, are Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1. Specifically, Gemini 3.0 Pro achieves a record score of 97.6% on Level I. Performance is also strong on Level II, led by GPT-5 at 94.3%. On Level III, Gemini 2.5 Pro attains the highest score with 86.4% on multiple-choice questions while Gemini 3.0 Pro achieves 92.0% on constructed-response questions.


[91] The High Cost of Incivility: Quantifying Interaction Inefficiency via Multi-Agent Monte Carlo Simulations cs.AI | cs.CL | cs.CY | cs.MAPDF

Benedikt Mangold

TL;DR: 论文使用基于大型语言模型的多智能体系统和蒙特卡罗方法模拟毒性行为对讨论效率的影响,发现毒性行为显著延长对话时间25%,并提出这是一种量化社会摩擦影响的伦理和方法学替代方案。

Details

Motivation: 职场毒性行为对组织文化有害,但直接量化其对操作效率的影响存在伦理和方法学挑战,本研究旨在通过多智能体模拟解决这一问题。

Result: 结果显示毒性行为使对话时间显著增加约25%,证明了毒性行为的“延迟效应”。

Insight: 研究表明多智能体建模为测量社会摩擦机制提供了一种可重复且伦理的替代方案。

Abstract: Workplace toxicity is widely recognized as detrimental to organizational culture, yet quantifying its direct impact on operational efficiency remains methodologically challenging due to the ethical and practical difficulties of reproducing conflict in human subjects. This study leverages Large Language Model (LLM) based Multi-Agent Systems to simulate 1-on-1 adversarial debates, creating a controlled “sociological sandbox”. We employ a Monte Carlo method to simulate hundrets of discussions, measuring the convergence time (defined as the number of arguments required to reach a conclusion) between a baseline control group and treatment groups involving agents with “toxic” system prompts. Our results demonstrate a statistically significant increase of approximately 25% in the duration of conversations involving toxic participants. We propose that this “latency of toxicity” serves as a proxy for financial damage in corporate and academic settings. Furthermore, we demonstrate that agent-based modeling provides a reproducible, ethical alternative to human-subject research for measuring the mechanics of social friction.


[92] See-Control: A Multimodal Agent Framework for Smartphone Interaction with a Robotic Arm cs.AI | cs.CV | cs.HCPDF

Haoyu Zhao, Weizhong Ding, Yuhao Yang, Zheng Tian, Linyi Yang

TL;DR: 该论文提出了一个名为See-Control的多模态代理框架,用于通过低自由度的机械臂直接与智能手机交互,解决了现有方法依赖ADB的限制。

Details

Motivation: 当前基于MLLM的智能手机交互代理依赖ADB,限制了其仅适用于Android设备。作者希望通过物理交互实现平台无关的解决方案。

Result: See-Control填补了数字代理与物理世界之间的鸿沟,为家庭机器人执行智能手机任务提供了基础。

Insight: 直接物理交互比依赖ADB更灵活,适合跨平台使用,是智能代理与物理世界结合的重要一步。

Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have enabled their use as intelligent agents for smartphone operation. However, existing methods depend on the Android Debug Bridge (ADB) for data transmission and action execution, limiting their applicability to Android devices. In this work, we introduce the novel Embodied Smartphone Operation (ESO) task and present See-Control, a framework that enables smartphone operation via direct physical interaction with a low-DoF robotic arm, offering a platform-agnostic solution. See-Control comprises three key components: (1) an ESO benchmark with 155 tasks and corresponding evaluation metrics; (2) an MLLM-based embodied agent that generates robotic control commands without requiring ADB or system back-end access; and (3) a richly annotated dataset of operation episodes, offering valuable resources for future research. By bridging the gap between digital agents and the physical world, See-Control provides a concrete step toward enabling home robots to perform smartphone-dependent tasks in realistic environments.


physics.geo-ph [Back]

[93] Self-Reinforced Deep Priors for Reparameterized Full Waveform Inversion physics.geo-ph | cs.CVPDF

Guangyuan Zou, Junlun Li, Feng Liu, Xuejing Zheng, Jianjian Xie

TL;DR: 该论文提出了一种自增强深度先验(SRDIP-FWI)框架,用于解决全波形反演(FWI)中的非线性问题和局部极小值收敛问题。

Details

Motivation: 传统的深度图像先验FWI(DIP-FWI)在反演过程中使用固定随机输入,未能充分利用网络输入与输出的映射关系,且在复杂地质条件下缺乏信息性先验,导致反问题的不适定性加剧。

Result: 合成测试和实际陆地数据应用表明,SRDIP-FWI在分辨率、精度和深度穿透性方面均优于多尺度FWI,且无需手动选择频带和时间窗口,大幅简化了反演流程。

Insight: SRDIP-FWI通过自适应的输入更新机制改善了传统DIP-FWI的局限性,为准确的地层速度模型重建提供了一种新的、自适应且鲁棒的框架。

Abstract: Full waveform inversion (FWI) has become a widely adopted technique for high-resolution subsurface imaging. However, its inherent strong nonlinearity often results in convergence toward local minima. Recently, deep image prior-based reparameterized FWI (DIP-FWI) has been proposed to alleviate the dependence on massive training data. By exploiting the spectral bias and implicit regularization in the neural network architecture, DIP-FWI can effectively avoid local minima and reconstruct more geologically plausible velocity models. Nevertheless, existing DIP-FWI typically use a fixed random input throughout the inversion process, which fails to utilize the mapping and correlation between the input and output of the network. Moreover, under complex geological conditions, the lack of informative prior in the input can exacerbate the ill-posedness of the inverse problem, leading to artifacts and unstable reconstructions. To address these limitations, we propose a self-reinforced DIP-FWI (SRDIP-FWI) framework, in which a steering algorithm alternately updates both the network parameters and the input at each iteration using feedback from the current network output. This design allows adaptive structural enhancement and improved regularization, thereby effectively mitigating the ill-posedness in FWI. Additionally, we analyze the spectral bias of the network in SRDIP-FWI and quantify its role in multiscale velocity model building. Synthetic tests and field land data application demonstrate that SRDIP-FWI achieves superior resolution, improved accuracy and greater depth penetration compared to multiscale FWI. More importantly, SRDIP-FWI eliminates the need for manual frequency-band selection and time-window picking, substantially simplifying the inversion workflow. Overall, the proposed method provides a novel, adaptive and robust framework for accurate subsurface velocity model reconstruction.


math.NA [Back]

[94] Generalizations of the Normalized Radon Cumulative Distribution Transform for Limited Data Recognition math.NA | cs.CV | cs.ITPDF

Matthias Beckmann, Robert Beinert, Jonas Bresch

TL;DR: 该论文提出了归一化Radon累积分布变换(R-CDT)的泛化方法,以提升其在有限数据识别任务中的灵活性,并扩展到多维和非欧几里得空间。

Details

Motivation: 在有限数据识别任务(如水印识别)中,数据常受仿射变换影响,传统R-CDT对特定变换不具备不变性。本文旨在设计更灵活的归一化方法,并扩展到更复杂的数据类型。

Result: 实验结果表明,新方法在分类和聚类任务中几乎达到完美准确率。

Insight: 通过归一化泛化和多维扩展,R-CDT在有限数据任务中表现优异,尤其适用于仿射变换频繁的场景。

Abstract: The Radon cumulative distribution transform (R-CDT) exploits one-dimensional Wasserstein transport and the Radon transform to represent prominent features in images. It is closely related to the sliced Wasserstein distance and facilitates classification tasks, especially in the small data regime, like the recognition of watermarks in filigranology. Here, a typical issue is that the given data may be subject to affine transformations caused by the measuring process. To make the R-CDT invariant under arbitrary affine transformations, a two-step normalization of the R-CDT has been proposed in our earlier works. The aim of this paper is twofold. First, we propose a family of generalized normalizations to enhance flexibility for applications. Second, we study multi-dimensional and non-Euclidean settings by making use of generalized Radon transforms. We prove that our novel feature representations are invariant under certain transformations and allow for linear separation in feature space. Our theoretical results are supported by numerical experiments based on 2d images, 3d shapes and 3d rotation matrices, showing near perfect classification accuracies and clustering results.


cs.IR [Back]

[95] Ontology-Based Knowledge Graph Framework for Industrial Standard Documents via Hierarchical and Propositional Structuring cs.IR | cs.CLPDF

Jiin Park, Hyuna Jeon, Yoonseo Lee, Jisu Hong, Misuk Kim

TL;DR: 该论文提出了一种基于本体的知识图谱构建方法,专门针对工业标准文档的层级和命题结构进行建模,通过LLM三元组提取实现了高效的知识表示,并在多类QA任务中显著优于现有方法。

Details

Motivation: 工业标准文档包含复杂的技术信息和规则,传统方法难以有效捕捉其层级和逻辑结构,限制了知识图谱的构建和应用。

Result: 在规则、表格和多跳QA任务中,提出的方法相比现有KG-RAG方法显著提升了性能。

Insight: 工业文档的复杂结构和规则可以通过层级和命题建模更有效地表示,为领域特定RAG和智能文档管理提供了新思路。

Abstract: Ontology-based knowledge graph (KG) construction is a core technology that enables multidimensional understanding and advanced reasoning over domain knowledge. Industrial standards, in particular, contain extensive technical information and complex rules presented in highly structured formats that combine tables, scopes of application, constraints, exceptions, and numerical calculations, making KG construction especially challenging. In this study, we propose a method that organizes such documents into a hierarchical semantic structure, decomposes sentences and tables into atomic propositions derived from conditional and numerical rules, and integrates them into an ontology-knowledge graph through LLM-based triple extraction. Our approach captures both the hierarchical and logical structures of documents, effectively representing domain-specific semantics that conventional methods fail to reflect. To verify its effectiveness, we constructed rule, table, and multi-hop QA datasets, as well as a toxic clause detection dataset, from industrial standards, and implemented an ontology-aware KG-RAG framework for comparative evaluation. Experimental results show that our method achieves significant performance improvements across all QA types compared to existing KG-RAG approaches. This study demonstrates that reliable and scalable knowledge representation is feasible even for industrial documents with intertwined conditions, constraints, and scopes, contributing to future domain-specific RAG development and intelligent document management.


cs.GR [Back]

[96] Learning to Control Physically-simulated 3D Characters via Generating and Mimicking 2D Motions cs.GR | cs.CVPDF

Jianan Li, Xiao Chen, Tao Huang, Tien-Tsin Wong

TL;DR: 本文提出了一种名为Mimic2DM的框架,直接从视频中提取的2D关键点轨迹学习控制策略,无需依赖3D运动数据,实现了多样化的物理模拟动作生成。

Details

Motivation: 视频数据比动作捕捉数据更经济高效,但直接从视频合成多样且真实的3D动作仍具挑战性。现有方法依赖3D重建技术,但这些方法在通用性和物理合理性上存在局限。

Result: 实验表明,Mimic2DM能在舞蹈、足球运球和动物运动等多个领域生成物理合理且多样化的动作,无需3D数据。

Insight: 直接从2D数据学习动作控制策略具有潜力,可推广到3D任务中;多视角聚合是实现3D能力的有效途径。

Abstract: Video data is more cost-effective than motion capture data for learning 3D character motion controllers, yet synthesizing realistic and diverse behaviors directly from videos remains challenging. Previous approaches typically rely on off-the-shelf motion reconstruction techniques to obtain 3D trajectories for physics-based imitation. These reconstruction methods struggle with generalizability, as they either require 3D training data (potentially scarce) or fail to produce physically plausible poses, hindering their application to challenging scenarios like human-object interaction (HOI) or non-human characters. We tackle this challenge by introducing Mimic2DM, a novel motion imitation framework that learns the control policy directly and solely from widely available 2D keypoint trajectories extracted from videos. By minimizing the reprojection error, we train a general single-view 2D motion tracking policy capable of following arbitrary 2D reference motions in physics simulation, using only 2D motion data. The policy, when trained on diverse 2D motions captured from different or slightly different viewpoints, can further acquire 3D motion tracking capabilities by aggregating multiple views. Moreover, we develop a transformer-based autoregressive 2D motion generator and integrate it into a hierarchical control framework, where the generator produces high-quality 2D reference trajectories to guide the tracking policy. We show that the proposed approach is versatile and can effectively learn to synthesize physically plausible and diverse motions across a range of domains, including dancing, soccer dribbling, and animal movements, without any reliance on explicit 3D motion data. Project Website: https://jiann-li.github.io/mimic2dm/


cs.RO [Back]

[97] VLD: Visual Language Goal Distance for Reinforcement Learning Navigation cs.RO | cs.CVPDF

Lazar Milikic, Manthan Patel, Jonas Frey

TL;DR: 论文提出了视觉语言距离(VLD)学习框架,将感知学习与策略学习解耦,通过在互联网规模的视频数据上训练自监督的目标距离预测器,提升了强化学习导航的鲁棒性和多模态目标支持能力。

Details

Motivation: 现有端到端导航策略依赖原始感官输入或有限标注数据,存在仿真与现实差距或数据不足问题。VLD通过解耦感知与策略学习,利用大规模视觉训练提升泛化能力。

Result: VLD超越ViNT和VIP等时序距离方法,在仿真中展现竞争性导航性能,支持灵活的目标模态。

Insight: 解耦感知与策略学习结合大规模预训练,是提升导航鲁棒性和泛化能力的有效路径。

Abstract: Training end-to-end policies from image data to directly predict navigation actions for robotic systems has proven inherently difficult. Existing approaches often suffer from either the sim-to-real gap during policy transfer or a limited amount of training data with action labels. To address this problem, we introduce Vision-Language Distance (VLD) learning, a scalable framework for goal-conditioned navigation that decouples perception learning from policy learning. Instead of relying on raw sensory inputs during policy training, we first train a self-supervised distance-to-goal predictor on internet-scale video data. This predictor generalizes across both image- and text-based goals, providing a distance signal that can be minimized by a reinforcement learning (RL) policy. The RL policy can be trained entirely in simulation using privileged geometric distance signals, with injected noise to mimic the uncertainty of the trained distance predictor. At deployment, the policy consumes VLD predictions, inheriting semantic goal information-“where to go”-from large-scale visual training while retaining the robust low-level navigation behaviors learned in simulation. We propose using ordinal consistency to assess distance functions directly and demonstrate that VLD outperforms prior temporal distance approaches, such as ViNT and VIP. Experiments show that our decoupled design achieves competitive navigation performance in simulation while supporting flexible goal modalities, providing an alternative and, most importantly, scalable path toward reliable, multimodal navigation policies.


[98] DIJIT: A Robotic Head for an Active Observer cs.RO | cs.CVPDF

Mostafa Kamali Tabrizi, Mingshi Chi, Bir Bikram Dey, Yu Qing Yuan, Markus D. Solbach

TL;DR: 论文介绍了DIJIT,一种专为移动智能体设计的双目机器人头部系统,用于研究主动视觉和人眼/头部运动的交互作用。其设计包含9个机械自由度和4个光学自由度,性能接近人类。文中还提出了一种新的快速眼动方法。

Details

Motivation: 研究主动视觉及其与人眼/头部运动的交互作用,探索人类视觉与计算机视觉的差异。

Result: DIJIT的性能接近人类快速眼动的准确度,可用于主动视觉研究。

Insight: 通过模拟人类眼/头运动,DIJIT为研究主动视觉和人机视觉差异提供了新工具。

Abstract: We present DIJIT, a novel binocular robotic head expressly designed for mobile agents that behave as active observers. DIJIT’s unique breadth of functionality enables active vision research and the study of human-like eye and head-neck motions, their interrelationships, and how each contributes to visual ability. DIJIT is also being used to explore the differences between how human vision employs eye/head movements to solve visual tasks and current computer vision methods. DIJIT’s design features nine mechanical degrees of freedom, while the cameras and lenses provide an additional four optical degrees of freedom. The ranges and speeds of the mechanical design are comparable to human performance. Our design includes the ranges of motion required for convergent stereo, namely, vergence, version, and cyclotorsion. The exploration of the utility of these to both human and machine vision is ongoing. Here, we present the design of DIJIT and evaluate aspects of its performance. We present a new method for saccadic camera movements. In this method, a direct relationship between camera orientation and motor values is developed. The resulting saccadic camera movements are close to human movements in terms of their accuracy.


[99] Embodied Tree of Thoughts: Deliberate Manipulation Planning with Embodied World Model cs.RO | cs.AI | cs.CVPDF

Wenjiang Xu, Cindy Wang, Rui Fang, Mingkang Zhang, Lusong Li

TL;DR: 论文提出了Embodied Tree of Thoughts (EToT)框架,利用物理仿真器作为具身世界模型,通过树搜索和反馈机制优化机器人操纵规划任务。

Details

Motivation: 当前基于视频生成的世界模型在物理基础和长期约束一致性方面存在不足,限制了机器人操纵规划的可靠性。

Result: 在短长期操纵任务中,EToT表现优于基线方法,能够有效预测物理动态并适应潜在失败。

Insight: 通过物理仿真器与高效反馈机制的结合,可以在复杂操纵任务中显著提升规划的鲁棒性和可靠性。

Abstract: World models have emerged as a pivotal component in robot manipulation planning, enabling agents to predict future environmental states and reason about the consequences of actions before execution. While video-generation models are increasingly adopted, they often lack rigorous physical grounding, leading to hallucinations and a failure to maintain consistency in long-horizon physical constraints. To address these limitations, we propose Embodied Tree of Thoughts (EToT), a novel Real2Sim2Real planning framework that leverages a physics-based interactive digital twin as an embodied world model. EToT formulates manipulation planning as a tree search expanded through two synergistic mechanisms: (1) Priori Branching, which generates diverse candidate execution paths based on semantic and spatial analysis; and (2) Reflective Branching, which utilizes VLMs to diagnose execution failures within the simulator and iteratively refine the planning tree with corrective actions. By grounding high-level reasoning in a physics simulator, our framework ensures that generated plans adhere to rigid-body dynamics and collision constraints. We validate EToT on a suite of short- and long-horizon manipulation tasks, where it consistently outperforms baselines by effectively predicting physical dynamics and adapting to potential failures. Website at https://embodied-tree-of-thoughts.github.io .


[100] Zero-Splat TeleAssist: A Zero-Shot Pose Estimation Framework for Semantic Teleoperation cs.RO | cs.CV | cs.LG | eess.IVPDF

Srijan Dokania, Dharini Raghavan

TL;DR: Zero-Splat TeleAssist是一个零射位姿估计框架,通过融合视觉语言分割、单目深度和3D高斯泼溅技术,将普通CCTV视频流转化为共享的6自由度世界模型,支持多边遥操作。

Details

Motivation: 现有遥操作系统依赖标记或深度传感器,限制了其灵活性和广泛应用。本文旨在开发一种无需额外传感器的零射方法,实现多机器人实时位姿估计。

Result: 系统能实时提供多机器人的6自由度位姿,支持交互为中心的遥操作,无需依赖标记或深度传感器。

Insight: 通过融合视觉语言模型和3D重建技术,展示了零射方法在复杂场景下的潜力,为遥操作提供了更灵活的解决方案。

Abstract: We introduce Zero-Splat TeleAssist, a zero-shot sensor-fusion pipeline that transforms commodity CCTV streams into a shared, 6-DoF world model for multilateral teleoperation. By integrating vision-language segmentation, monocular depth, weighted-PCA pose extraction, and 3D Gaussian Splatting (3DGS), TeleAssist provides every operator with real-time global positions and orientations of multiple robots without fiducials or depth sensors in an interaction-centric teleoperation setup.


cs.CY [Back]

[101] Accelerating Urban Science Research with AI Urban Scientist cs.CY | cs.CL | cs.MAPDF

Tong Xia, Jiankun Zhang, Ruiwen You, Ao Xu, Linghao Zhang

TL;DR: AI Urban Scientist是一个知识驱动、多智能体协作的框架,旨在加速城市科学研究,通过生成假设、处理异构数据、自动化分析和模拟,并提供与城市科学兼容的见解。

Details

Motivation: 城市是复杂自适应系统,尽管数据丰富,但其底层原理仍难以解构。城市科学面临挑战,需要将碎片化的跨学科信息转化为对城市功能和演化的连贯解释。

Result: 系统提供可重用分析工具,支持社区驱动扩展,帮助揭示城市系统机制,设计更具韧性和公平的城市。

Insight: AI Urban Scientist不仅是助手,更是协作者,为城市科学提供了加速研究的通用化工具,同时强调了领域知识的重要性。

Abstract: Cities are complex, adaptive systems whose underlying principles remain difficult to disentangle despite unprecedented data abundance. Urban science therefore faces a fundamental challenge: converting vast, fragmented and interdisciplinary information into coherent explanations of how cities function and evolve. The emergence of AI scientists, i.e., agents capable of autonomous reasoning, hypothesis formation and data-driven experimentation, offers a new pathway toward accelerating this transformation, yet general-purpose systems fall short of the domain knowledge and methodological depth required for urban science research. Here we introduce a knowledge-driven AI Urban Scientist, built from hypotheses, peer-review signals, datasets and analytical patterns distilled from thousands of high-quality studies, and implemented as a coordinated multi-agent framework for end-to-end inquiry. The system generates structured hypotheses, retrieves and harmonizes heterogeneous datasets, conducts automated empirical analysis and simulation, and synthesizes insights in forms compatible with urban scientific reasoning. By providing reusable analytical tools and supporting community-driven extensions, the AI Urban Scientist lowers barriers to advanced urban analytics and acts not merely as an assistant but as an active collaborator in revealing the mechanisms that shape urban systems and in guiding the design of more resilient and equitable cities.


cs.LG [Back]

[102] ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models cs.LG | cs.AI | cs.CLPDF

Long Lian, Sida Wang, Felix Juefei-Xu, Tsu-Jui Fu, Xiuyu Li

TL;DR: ThreadWeaver是一个自适应并行推理框架,显著降低语言模型推理延迟,同时保持与顺序推理模型相当的准确性。

Details

Motivation: 现有并行推理方法在现实任务中要么局限于监督行为克隆,要么准确性显著下降,且需要定制化推理引擎,部署复杂。

Result: 在六个数学推理基准测试中,ThreadWeaver在Qwen3-8B模型上实现与顶尖顺序推理模型相当的准确性(平均71.9%,AIME24上79.9%),同时平均加速1.53倍。

Insight: ThreadWeaver在保持准确性的同时显著提升推理效率,为语言模型的并行推理提供了可部署的解决方案。

Abstract: Scaling inference-time computation has enabled Large Language Models (LLMs) to achieve strong reasoning performance, but inherently sequential decoding leads to substantial latency, especially on complex tasks. Recent work on adaptive parallel reasoning aims to improve inference efficiency by decomposing the problem-solving process into concurrent reasoning threads when beneficial. However, existing methods on realistic tasks are either limited to supervised behavior cloning or exhibit significant accuracy drops compared to widely-used sequential long chain-of-thought (CoT) baselines. Moreover, many require customized inference engines, complicating deployment. We introduce ThreadWeaver, a framework for adaptive parallel reasoning that achieves accuracy on par with popular sequential reasoning models of comparable size while significantly reducing inference latency. ThreadWeaver’s performance stems from three key innovations: 1) a two-stage parallel trajectory generator that produces large-scale, high-quality CoT data with parallel annotations for supervised fine-tuning; 2) a trie-based training-inference co-design that enables parallel reasoning on any off-the-shelf autoregressive inference engine without modifying position embeddings or KV caches; and 3) a parallelization-aware reinforcement learning framework that teaches the model to balance accuracy with effective parallelization. Across six challenging mathematical reasoning benchmarks, ThreadWeaver trained atop Qwen3-8B achieves accuracy comparable to cutting-edge sequential reasoning models (71.9% on average and 79.9% on AIME24) while delivering up to 1.53x average speedup in token latency, establishing a new Pareto frontier between accuracy and efficiency.


[103] Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training cs.LG | cs.AI | cs.CLPDF

Jakub Krajewski, Amitis Shidani, Dan Busbridge, Sam Wiseman, Jason Ramapuram

TL;DR: 本研究挑战了传统观点,提出了直接建模大型语言模型(LLM)在下游任务中性能扩展的框架,发现基于训练预算的简单幂律可以准确描述多个下游任务的性能扩展行为。

Details

Motivation: 传统上,LLM的扩展定律聚焦于预训练损失等代理指标,而预测下游任务性能被认为不可靠。本文旨在验证是否可以通过直接建模训练预算与下游性能的关系来改善预测。

Result: 结果表明,直接方法的扩展预测优于传统的两阶段方法,且新函数形式能有效预测不同条件下的性能。

Insight: 下游任务的性能扩展可以通过直接建模训练预算与任务准确性之间的关系来可靠预测,且幂律是一个有效的工具。

Abstract: While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct framework to model the scaling of benchmark performance from the training budget. We find that for a fixed token-to-parameter ratio, a simple power law can accurately describe the scaling behavior of log accuracy on multiple popular downstream tasks. Our results show that the direct approach extrapolates better than the previously proposed two-stage procedure, which is prone to compounding errors. Furthermore, we introduce functional forms that predict accuracy across token-to-parameter ratios and account for inference compute under repeated sampling. We validate our findings on models with up to 17B parameters trained on up to 350B tokens across two dataset mixtures. To support reproducibility and encourage future research, we release the complete set of pretraining losses and downstream evaluation results.


[104] LAPA: Log-Domain Prediction-Driven Dynamic Sparsity Accelerator for Transformer Model cs.LG | cs.CVPDF

Huizheng Wang, Hongbin Wang, Shaojun Wei, Yang Hu, Shouyi Yin

TL;DR: LAPA提出了一种基于对数域注意力预测的动态稀疏加速器,针对Transformer模型的计算瓶颈进行跨阶段稀疏加速,通过算法-架构协同设计显著提升了能效。

Details

Motivation: Transformer模型在处理变长输入序列时,计算瓶颈表现出动态变化,现有稀疏加速方法多为单阶段设计且功耗较高,LAPA旨在解决这一问题。

Result: 实验表明,LAPA在能效上比SOTA方法Spatten、Sanger和FACT分别提升了3.52倍、3.24倍和2.79倍。

Insight: 动态稀疏加速和算法-硬件协同设计是提升Transformer计算效率的关键方向。

Abstract: Attention-based Transformers have revolutionized natural language processing (NLP) and shown strong performance in computer vision (CV) tasks. However, as the input sequence varies, the computational bottlenecks in Transformer models exhibit dynamic behavior across stages, which calls for a cross-stage sparse acceleration strategy. Unfortunately, most existing sparse Transformer approaches are single-stage based, and their sparsity prediction mechanisms lead to significant power overhead when applied across multiple stages. To this end, this paper proposes a log-domain attention prediction algorithm-architecture co-design, named LAPA. First, an asymmetric leading one computing (ALOC) scheme is designed to eliminate expensive multiplications. Next, a mixed-precision multi-round shifting accumulation (MRSA) mechanism is further proposed to mitigate the accumulation overhead. A data-feature dependent filter (DDF) strategy is designed to work in concert with the MRSA process. Finally, an elaborate accelerator is designed to translate the theoretical enhancement into practical hardware improvement. Experimental results show that LAPA achieves 3.52x, 3.24x and 2.79x higher energy efficiency than the state-of-the-art (SOTA) works Spatten, Sanger and FACT, respectively.


[105] GSPN-2: Efficient Parallel Sequence Modeling cs.LG | cs.AI | cs.CVPDF

Hongjun Wang, Yitong Jiang, Collin McCarthy, David Wehr, Hanrong Ye

TL;DR: GSPN-2提出了一种高效的并行序列建模方法,通过联合算法与系统设计,解决了GSPN在GPU实现中的开销问题,同时保持了视觉任务中的高精度。

Details

Motivation: GSPN虽然通过线性传播降低了计算复杂度,但其GPU实现仍存在多次内核启动、数据迁移和冗余计算的问题。GSPN-2旨在进一步优化这些性能瓶颈。

Result: GSPN-2在保持Transformer精度的同时,显著降低了计算成本,成为视觉任务中全局空间上下文建模的新效率标杆。

Insight: 联合算法与硬件优化是提升计算效率的关键,结构化矩阵变换与GPU优化结合可为视觉任务提供高效解决方案。

Abstract: Efficient vision transformer remains a bottleneck for high-resolution images and long-video related real-world applications. Generalized Spatial Propagation Network (GSPN) addresses this by replacing quadratic self-attention with a line-scan propagation scheme, bringing the cost close to linear in the number of rows or columns, while retaining accuracy. Despite this advancement, the existing GSPN implementation still suffers from (i) heavy overhead due to repeatedly launching GPU kernels, (ii) excessive data transfers from global GPU memory, and (iii) redundant computations caused by maintaining separate propagation weights for each channel. We introduce GSPN-2, a joint algorithm-system redesign. In particular, we eliminate thousands of micro-launches from the previous implementation into one single 2D kernel, explicitly pin one warp to each channel slice, and stage the previous column’s activations in shared memory. On the model side, we introduce a compact channel propagation strategy that replaces per-channel matrices, trimming parameters, and align naturally with the affinity map used in transformer attention. Experiments demonstrate GSPN-2’s effectiveness across image classification and text-to-image synthesis tasks, matching transformer-level accuracy with significantly lower computational cost. GSPN-2 establishes a new efficiency frontier for modeling global spatial context in vision applications through its unique combination of structured matrix transformations and GPU-optimized implementation. Project page: https://whj363636.github.io/GSPN2/


[106] TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models cs.LG | cs.AI | cs.CVPDF

Zheng Ding, Weirui Ye

TL;DR: 论文提出了一种名为TreeGRPO的新型强化学习框架,通过将去噪过程建模为搜索树,显著提升了训练效率,相比基线方法实现了2.4倍的加速。

Details

Motivation: 现有的强化学习后训练方法在生成模型与人类偏好对齐中计算成本过高,阻碍了其广泛应用。为解决这一问题,论文提出了TreeGRPO。

Result: 在扩散模型和流模型上的实验表明,TreeGRPO实现了2.4倍的训练加速,并在效率-奖励的权衡空间中建立了更优的帕累托前沿。

Insight: 通过树状结构的设计,TreeGRPO为解决RL后训练中的高计算成本问题提供了可扩展且高效的方案。

Abstract: Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce \textbf{TreeGRPO}, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise samples, TreeGRPO strategically branches to generate multiple candidate trajectories while efficiently reusing their common prefixes. This tree-structured approach delivers three key advantages: (1) \emph{High sample efficiency}, achieving better performance under same training samples (2) \emph{Fine-grained credit assignment} via reward backpropagation that computes step-specific advantages, overcoming the uniform credit assignment limitation of trajectory-based methods, and (3) \emph{Amortized computation} where multi-child branching enables multiple policy updates per forward pass. Extensive experiments on both diffusion and flow-based models demonstrate that TreeGRPO achieves \textbf{2.4$\times$ faster training} while establishing a superior Pareto frontier in the efficiency-reward trade-off space. Our method consistently outperforms GRPO baselines across multiple benchmarks and reward models, providing a scalable and effective pathway for RL-based visual generative model alignment. The project website is available at treegrpo.github.io.