cs.CV [Total: 153]
cs.CL [Total: 17]
math.OC [Total: 1]
cs.CR [Total: 1]
eess.SP [Total: 1]
cs.LG [Total: 8]
cs.RO [Total: 2]
cs.AI [Total: 4]
cs.DC [Total: 1]
cs.MM [Total: 1]
eess.IV [Total: 3]

cs.CV [Back]

[1] Personalized Reward Modeling for Text-to-Image Generation cs.CV | cs.AIPDF

Jeongeun Lee, Ryang Heo, Dongha Lee

TL;DR: 这篇论文提出了PIGReward，一种个性化的奖励模型，用于评估文本到图像（T2I）生成任务中生成的图像是否符合用户个人偏好。通过自举策略和CoT推理，PIGReward能够在缺乏用户数据的情况下动态生成用户特定的评估维度，并提供个性化的反馈。实验结果证明了其在准确性和可解释性上的优势。

Details

Motivation: 现有的T2I生成模型评估方法多为通用奖励函数或基于相似性的指标，无法捕捉用户个性化的视觉偏好。本文旨在解决这一问题，提出一种能够适应个人需求的评估方法。

Result: 实验表明，PIGReward在评估图像是否符合用户偏好方面优于现有方法，同时具有更高的可解释性。PIGBench也展示了多样化的个人视觉偏好。

Insight: 个性化评估是提升T2I生成模型实用性的关键。通过推理和自举策略，可以在缺乏大量用户数据的情况下实现有效的个性化适配。

Abstract: Recent text-to-image (T2I) models generate semantically coherent images from textual prompts, yet evaluating how well they align with individual user preferences remains an open challenge. Conventional evaluation methods, general reward functions or similarity-based metrics, fail to capture the diversity and complexity of personal visual tastes. In this work, we present PIGReward, a personalized reward model that dynamically generates user-conditioned evaluation dimensions and assesses images through CoT reasoning. To address the scarcity of user data, PIGReward adopt a self-bootstrapping strategy that reasons over limited reference data to construct rich user contexts, enabling personalization without user-specific training. Beyond evaluation, PIGReward provides personalized feedback that drives user-specific prompt optimization, improving alignment between generated images and individual intent. We further introduce PIGBench, a per-user preference benchmark capturing diverse visual interpretations of shared prompts. Extensive experiments demonstrate that PIGReward surpasses existing methods in both accuracy and interpretability, establishing a scalable and reasoning-based foundation for personalized T2I evaluation and optimization. Taken together, our findings highlight PIGReward as a robust steptoward individually aligned T2I generation.

[2] SG-OIF: A Stability-Guided Online Influence Framework for Reliable Vision Data cs.CV | cs.AI | cs.LGPDF

Penghao Rao, Runmin Jiang, Min Xu

TL;DR: SG-OIF是一个稳定性导向的在线影响力框架，旨在通过实时控制和模块化设计，高效且可靠地估计深度学习视觉模型中训练样本对测试预测的影响。

Details

Motivation: 现有的影响力函数方法在深度学习视觉模型中的应用存在计算成本高、无法适应训练动态变化以及缺乏置信度校准等问题，导致影响力排名不准确。SG-OIF旨在解决这些问题。

Result: 在噪声标签和分布外检测任务中达到SOTA性能，CIFAR-10（20%不对称噪声）上top 1%样本准确率为91.1%，MNIST上AUPR得分为99.8%。

Insight: 算法稳定性可以作为实时控制的有效指标，模块化设计能够灵活适应动态训练环境并提供鲁棒的影响力估计。

Abstract: Approximating training-point influence on test predictions is critical for deploying deep-learning vision models, essential for locating noisy data. Though the influence function was proposed for attributing how infinitesimal up-weighting or removal of individual training examples affects model outputs, its implementation is still challenging in deep-learning vision models: inverse-curvature computations are expensive, and training non-stationarity invalidates static approximations. Prior works use iterative solvers and low-rank surrogates to reduce cost, but offline computation lags behind training dynamics, and missing confidence calibration yields fragile rankings that misidentify critical examples. To address these challenges, we introduce a Stability-Guided Online Influence Framework (SG-OIF), the first framework that treats algorithmic stability as a real-time controller, which (i) maintains lightweight anchor IHVPs via stochastic Richardson and preconditioned Neumann; (ii) proposes modular curvature backends to modulate per-example influence scores using stability-guided residual thresholds, anomaly gating, and confidence. Experimental results show that SG-OIF achieves SOTA (State-Of-The-Art) on noise-label and out-of-distribution detection tasks across multiple datasets with various corruption. Notably, our approach achieves 91.1% accuracy in the top 1% prediction samples on the CIFAR-10 (20% asym), and gets 99.8% AUPR score on MNIST, effectively demonstrating that this framework is a practical controller for online influence estimation.

[3] Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks cs.CV | cs.AI | cs.MMPDF

Jie Li, Hongyi Cai, Mingkang Dong, Muxin Pu, Shan You

TL;DR: Pistachio是一个通过生成流水线构建的视频异常检测与理解（VAD/VAU）基准测试，旨在解决现有数据集中场景多样性不足、异常覆盖不均衡和时间复杂性缺失的问题。

Details

Motivation: 现有视频异常检测基准测试在场景多样性、异常覆盖和时间复杂性方面存在不足，而手动标注视频异常理解任务耗时巨大。

Result: 实验表明Pistachio具有规模大、多样性高和复杂性强的特点，揭示了现有方法的挑战并为未来研究提供了方向。

Insight: 生成式方法可以高效构建高质量的视频基准测试，为复杂异常理解任务的评测提供了新思路。

Abstract: Automatically detecting abnormal events in videos is crucial for modern autonomous systems, yet existing Video Anomaly Detection (VAD) benchmarks lack the scene diversity, balanced anomaly coverage, and temporal complexity needed to reliably assess real-world performance. Meanwhile, the community is increasingly moving toward Video Anomaly Understanding (VAU), which requires deeper semantic and causal reasoning but remains difficult to benchmark due to the heavy manual annotation effort it demands. In this paper, we introduce Pistachio, a new VAD/VAU benchmark constructed entirely through a controlled, generation-based pipeline. By leveraging recent advances in video generation models, Pistachio provides precise control over scenes, anomaly types, and temporal narratives, effectively eliminating the biases and limitations of Internet-collected datasets. Our pipeline integrates scene-conditioned anomaly assignment, multi-step storyline generation, and a temporally consistent long-form synthesis strategy that produces coherent 41-second videos with minimal human intervention. Extensive experiments demonstrate the scale, diversity, and complexity of Pistachio, revealing new challenges for existing methods and motivating future research on dynamic and multi-event anomaly understanding.

[4] Tracking and Segmenting Anything in Any Modality cs.CV | cs.AI | cs.MMPDF

Tianlu Zhang, Qiang Zhang, Guiguang Ding, Jungong Han

TL;DR: 论文提出了一个通用的跟踪与分割框架SATA，通过解耦多模态和多任务的表示学习，统一了多种任务和模态输入，提升了模型的泛化能力。

Details

Motivation: 现有方法在处理跟踪与分割任务时通常采用专用架构或模态特定参数，限制了其泛化性和扩展性。SATA旨在克服跨模态和多任务的表示差距问题。

Result: 在18个挑战性的跟踪与分割基准测试中表现优异。

Insight: 解耦跨模态和多任务的表示学习是提升通用视频理解能力的关键。

Abstract: Tracking and segmentation play essential roles in video understanding, providing basic positional information and temporal association of objects within video sequences. Despite their shared objective, existing approaches often tackle these tasks using specialized architectures or modality-specific parameters, limiting their generalization and scalability. Recent efforts have attempted to unify multiple tracking and segmentation subtasks from the perspectives of any modality input or multi-task inference. However, these approaches tend to overlook two critical challenges: the distributional gap across different modalities and the feature representation gap across tasks. These issues hinder effective cross-task and cross-modal knowledge sharing, ultimately constraining the development of a true generalist model. To address these limitations, we propose a universal tracking and segmentation framework named SATA, which unifies a broad spectrum of tracking and segmentation subtasks with any modality input. Specifically, a Decoupled Mixture-of-Expert (DeMoE) mechanism is presented to decouple the unified representation learning task into the modeling process of cross-modal shared knowledge and specific information, thus enabling the model to maintain flexibility while enhancing generalization. Additionally, we introduce a Task-aware Multi-object Tracking (TaMOT) pipeline to unify all the task outputs as a unified set of instances with calibrated ID information, thereby alleviating the degradation of task-specific knowledge during multi-task training. SATA demonstrates superior performance on 18 challenging tracking and segmentation benchmarks, offering a novel perspective for more generalizable video understanding.

[5] The Determinant Ratio Matrix Approach to Solving 3D Matching and 2D Orthographic Projection Alignment Tasks cs.CV | eess.IVPDF

Andrew J. Hanson, Sonya M. Hanson

TL;DR: 论文提出了一种新的方法——行列式比率矩阵（DRaM），用于解决3D匹配和2D正交投影对齐任务中的姿态估计问题，同时提供了一个统一的视角来理解EnP和OnP问题的解决方案家族。

Details

Motivation: 姿态估计是计算机视觉中的核心问题，尤其在3D-3D对齐（EnP）和3D-2D正交投影对齐（OnP）任务中，现有的方法缺乏统一的数学框架。论文旨在填补这一空白，并提供一个更通用的解决方案。

Result: DRaM方法提供了一种新的解决方案，可以精确求解EnP问题，并为OnP问题提供了封闭形式的近似解。此外，DRaM框架揭示了现有方法的数学统一性。

Insight: 1. DRaM方法的数学推导可以追溯到高斯时代，具有经典数学基础。
2. 该方法不仅适用于3D和2D问题，还能推广到更高维度的欧几里得空间。

Abstract: Pose estimation is a general problem in computer vision with wide applications. The relative orientation of a 3D reference object can be determined from a 3D rotated version of that object, or from a projection of the rotated object to a 2D planar image. This projection can be a perspective projection (the PnP problem) or an orthographic projection (the OnP problem). We restrict our attention here to the OnP problem and the full 3D pose estimation task (the EnP problem). Here we solve the least squares systems for both the error-free EnP and OnP problems in terms of the determinant ratio matrix (DRaM) approach. The noisy-data case can be addressed with a straightforward rotation correction scheme. While the SVD and optimal quaternion eigensystem methods solve the noisy EnP 3D-3D alignment exactly, the noisy 3D-2D orthographic (OnP) task has no known comparable closed form, and can be solved by DRaM-class methods. We note that while previous similar work has been presented in the literature exploiting both the QR decomposition and the Moore-Penrose pseudoinverse transformations, here we place these methods in a larger context that has not previously been fully recognized in the absence of the corresponding DRaM solution. We term this class of solutions as the DRaM family, and conduct comparisons of the behavior of the families of solutions for the EnP and OnP rotation estimation problems. Overall, this work presents both a new solution to the 3D and 2D orthographic pose estimation problems and provides valuable insight into these classes of problems. With hindsight, we are able to show that our DRaM solutions to the exact EnP and OnP problems possess derivations that could have been discovered in the time of Gauss, and in fact generalize to all analogous N-dimensional Euclidean pose estimation problems.

[6] Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning cs.CVPDF

Liqin Luo, Guangyao Chen, Xiawu Zheng, Yongxing Dai, Yixiong Zou

TL;DR: 论文提出了一种无需任务特定微调的视觉接地方法GroundingAgent，通过迭代推理机制结合预训练的开放词汇对象检测器和多模态大语言模型，实现了65.1%的零样本准确率。

Details

Motivation: 现有视觉接地方法依赖大量任务特定标注和微调，泛化能力不足。GroundingAgent旨在解决这一问题，实现无需微调的高效通用视觉接地。

Result: 在RefCOCO等基准测试中，零样本准确率达65.1%；替换查询文本为多模态大语言生成的描述后，选择阶段准确率接近90%。

Insight: 大语言模型的推理能力对视觉接地任务至关重要；开放词汇检测器和多模态模型的结合显著提升了泛化性能。

Abstract: Visual grounding, the task of linking textual queries to specific regions within images, plays a pivotal role in vision-language integration. Existing methods typically rely on extensive task-specific annotations and fine-tuning, limiting their ability to generalize effectively to novel or out-of-distribution scenarios. To address these limitations, we introduce GroundingAgent, a novel agentic visual grounding framework that operates without any task-specific fine-tuning. GroundingAgent employs a structured, iterative reasoning mechanism that integrates pretrained open-vocabulary object detectors, multimodal large language models (MLLMs), and large language models (LLMs) to progressively refine candidate regions through joint semantic and spatial analyses. Remarkably, GroundingAgent achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks (RefCOCO, RefCOCO+, RefCOCOg), entirely without fine-tuning. Furthermore, by substituting MLLM-generated captions with the original query texts, the accuracy at the selection stage alone reaches approximately 90 %, closely matching supervised performance and underscoring the critical role of LLM reasoning capabilities. GroundingAgent also offers strong interpretability, transparently illustrating each reasoning step and providing clear insights into its decision-making process.

[7] Towards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning cs.CV | cs.AI | cs.IT | cs.LGPDF

Zhaoqi Xu, Yingying Zhang, Jian Li, Jianwei Guo, Qiannan Zhu

TL;DR: 该论文提出了一种基于信息论的视觉语言模型（VLM）自适应结构化压缩框架InfoPrune，通过信息瓶颈原则量化注意力头和FFN层的冗余性，实现了高效压缩和加速。

Details

Motivation: 现有的VLM压缩方法多依赖于启发式重要性指标或经验性剪枝规则，缺乏信息保存的理论保证，影响模型的性能和效率。

Result: 在VQAv2、TextVQA和GQA数据集上，InfoPrune实现了最高3.2倍FLOP减少和1.8倍加速，性能损失可忽略。

Insight: 信息理论为模型压缩提供了理论支撑，自适应剪枝与低秩近似结合是提升VLM效率的有效途径。

Abstract: Recent advances in vision-language models (VLMs) have shown remarkable performance across multimodal tasks, yet their ever-growing scale poses severe challenges for deployment and efficiency. Existing compression methods often rely on heuristic importance metrics or empirical pruning rules, lacking theoretical guarantees about information preservation. In this work, we propose InfoPrune, an information-theoretic framework for adaptive structural compression of VLMs. Grounded in the Information Bottleneck principle, we formulate pruning as a trade-off between retaining task-relevant semantics and discarding redundant dependencies. To quantify the contribution of each attention head, we introduce an entropy-based effective rank (eRank) and employ the Kolmogorov–Smirnov (KS) distance to measure the divergence between original and compressed structures. This yields a unified criterion that jointly considers structural sparsity and informational efficiency. Building on this foundation, we further design two complementary schemes: (1) a training-based head pruning guided by the proposed information loss objective, and (2) a training-free FFN compression via adaptive low-rank approximation. Extensive experiments on VQAv2, TextVQA, and GQA demonstrate that InfoPrune achieves up to 3.2x FLOP reduction and 1.8x acceleration with negligible performance degradation, establishing a theoretically grounded and practically effective step toward efficient multimodal large models.

[8] VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning cs.CV | cs.MAPDF

Boyu Chen, Zikang Wang, Zhengrong Yue, Kainan Yan, Chenyun Yu

TL;DR: VideoChat-M1是一种基于多智能体强化学习(MARL)的视频理解框架，通过协作策略规划(CPP)动态优化工具调用机制，提升了对复杂视频的理解能力。

Details

Motivation: 现有视频理解框架多采用静态、不可学习的工具调用机制，限制了从复杂视频中发现多样化线索的能力。

Result: 在八个基准测试中取得SOTA性能，显著超越Gemini 2.5 pro和GPT-4o等模型。

Insight: 协作式策略规划和强化学习的结合显著提升了视频理解的鲁棒性和适应性。

Abstract: By leveraging tool-augmented Multimodal Large Language Models (MLLMs), multi-agent frameworks are driving progress in video understanding. However, most of them adopt static and non-learnable tool invocation mechanisms, which limit the discovery of diverse clues essential for robust perception and reasoning regarding temporally or spatially complex videos. To address this challenge, we propose a novel Multi-agent system for video understanding, namely VideoChat-M1. Instead of using a single or fixed policy, VideoChat-M1 adopts a distinct Collaborative Policy Planning (CPP) paradigm with multiple policy agents, which comprises three key processes. (1) Policy Generation: Each agent generates its unique tool invocation policy tailored to the user’s query; (2) Policy Execution: Each agent sequentially invokes relevant tools to execute its policy and explore the video content; (3) Policy Communication: During the intermediate stages of policy execution, agents interact with one another to update their respective policies. Through this collaborative framework, all agents work in tandem, dynamically refining their preferred policies based on contextual insights from peers to effectively respond to the user’s query. Moreover, we equip our CPP paradigm with a concise Multi-Agent Reinforcement Learning (MARL) method. Consequently, the team of policy agents can be jointly optimized to enhance VideoChat-M1’s performance, guided by both the final answer reward and intermediate collaborative process feedback. Extensive experiments demonstrate that VideoChat-M1 achieves SOTA performance across eight benchmarks spanning four tasks. Notably, on LongVideoBench, our method outperforms the SOTA model Gemini 2.5 pro by 3.6% and GPT-4o by 15.6%.

[9] Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models cs.CVPDF

Jonathan Lee, Xingrui Wang, Jiawei Peng, Luoxin Ye, Zehan Zheng

TL;DR: 论文提出了Perceptual Taxonomy，一种结构化的场景理解方法，通过多属性层次推理评估视觉语言模型的能力，并构建了一个包含多类问题的基准测试集。

Details

Motivation: 现有的视觉语言基准测试集中在表面识别或图像-文本对齐上，缺乏对层次化场景推理能力的全面评估。论文旨在填补这一空白。

Result: 实验表明，领先的视觉语言模型在识别任务上表现良好，但在需多步推理的属性驱动问题上性能下降10%-20%。上下文推理示例可提升性能。

Insight: 当前模型依赖模式匹配，缺乏结构化视觉理解能力；层次化推理引导的提示方法可以提高模型在复杂任务上的表现。

Abstract: We propose Perceptual Taxonomy, a structured process of scene understanding that first recognizes objects and their spatial configurations, then infers task-relevant properties such as material, affordance, function, and physical attributes to support goal-directed reasoning. While this form of reasoning is fundamental to human cognition, current vision-language benchmarks lack comprehensive evaluation of this ability and instead focus on surface-level recognition or image-text alignment. To address this gap, we introduce Perceptual Taxonomy, a benchmark for physically grounded visual reasoning. We annotate 3173 objects with four property families covering 84 fine-grained attributes. Using these annotations, we construct a multiple-choice question benchmark with 5802 images across both synthetic and real domains. The benchmark contains 28033 template-based questions spanning four types (object description, spatial reasoning, property matching, and taxonomy reasoning), along with 50 expert-crafted questions designed to evaluate models across the full spectrum of perceptual taxonomy reasoning. Experimental results show that leading vision-language models perform well on recognition tasks but degrade by 10 to 20 percent on property-driven questions, especially those requiring multi-step reasoning over structured attributes. These findings highlight a persistent gap in structured visual understanding and the limitations of current models that rely heavily on pattern matching. We also show that providing in-context reasoning examples from simulated scenes improves performance on real-world and expert-curated questions, demonstrating the effectiveness of perceptual-taxonomy-guided prompting.

[10] MapRF: Weakly Supervised Online HD Map Construction via NeRF-Guided Self-Training cs.CVPDF

Hongyu Lyu, Thomas Monninger, Julie Stephany Berrio Perez, Mao Shan, Zhenxing Ming

TL;DR: MapRF提出了一种弱监督的在线高清地图构建框架，仅使用2D图像标签即可学习3D地图，并通过NeRF和自训练提升性能。

Details

Motivation: 现有的在线高清地图构建方法依赖昂贵的3D标注数据，限制了方法的通用性和可扩展性。MapRF旨在通过弱监督减少对标注数据的依赖。

Result: 在Argoverse 2和nuScenes数据集上，MapRF性能接近全监督方法，达到基线效果的75%，且优于其他仅使用2D标签的方法。

Insight: MapRF展示了弱监督方法在高清地图构建中的潜力，通过NeRF和自训练的结合，减少了标注成本，为自动驾驶提供了可扩展的解决方案。

Abstract: Autonomous driving systems benefit from high-definition (HD) maps that provide critical information about road infrastructure. The online construction of HD maps offers a scalable approach to generate local maps from on-board sensors. However, existing methods typically rely on costly 3D map annotations for training, which limits their generalization and scalability across diverse driving environments. In this work, we propose MapRF, a weakly supervised framework that learns to construct 3D maps using only 2D image labels. To generate high-quality pseudo labels, we introduce a novel Neural Radiance Fields (NeRF) module conditioned on map predictions, which reconstructs view-consistent 3D geometry and semantics. These pseudo labels are then iteratively used to refine the map network in a self-training manner, enabling progressive improvement without additional supervision. Furthermore, to mitigate error accumulation during self-training, we propose a Map-to-Ray Matching strategy that aligns map predictions with camera rays derived from 2D labels. Extensive experiments on the Argoverse 2 and nuScenes datasets demonstrate that MapRF achieves performance comparable to fully supervised methods, attaining around 75% of the baseline while surpassing several approaches using only 2D labels. This highlights the potential of MapRF to enable scalable and cost-effective online HD map construction for autonomous driving.

[11] Vidi2: Large Multimodal Models for Video Understanding and Creation cs.CVPDF

Vidi Team, Celong Liu, Chia-Wen Kuo, Chuang Huang, Dawei Du

TL;DR: Vidi2是一个大型多模态模型，专注于视频理解和创作，通过细粒度时空定位（STG）和视频问答（Video QA）提升了性能，并引入了新的基准测试VUE-STG和VUE-TR-V2，超越了主流专有系统。

Details

Motivation: 视频已成为互联网上主要的沟通和创作媒介，亟需可扩展、高质量的视频生成和理解技术。

Result: Vidi2在VUE-STG和VUE-TR-V2上显著优于Gemini 3 Pro和GPT-5，并在Video QA任务上与开源模型表现相当。

Insight: 1. 端到端时空定位为复杂视频编辑提供了新可能；2. 高质量的基准测试对模型评估至关重要。

Abstract: Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve toward next-generation video creation and have achieved state-of-the-art performance in multimodal temporal retrieval (TR). In its second release, Vidi2 advances video understanding with fine-grained spatio-temporal grounding (STG) and extends its capability to video question answering (Video QA), enabling comprehensive multimodal reasoning. Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges. This end-to-end spatio-temporal grounding capability enables potential applications in complex editing scenarios, such as plot or character understanding, automatic multi-view switching, and intelligent, composition-aware reframing and cropping. To enable comprehensive evaluation of STG in practical settings, we introduce a new benchmark, VUE-STG, which offers four key improvements over existing STG datasets: 1) Video duration: spans from roughly 10s to 30 mins, enabling long-context reasoning; 2) Query format: queries are mostly converted into noun phrases while preserving sentence-level expressiveness; 3) Annotation quality: all ground-truth time ranges and bounding boxes are manually annotated with high accuracy; 4) Evaluation metric: a refined vIoU/tIoU/vIoU-Intersection scheme. In addition, we upgrade the previous VUE-TR benchmark to VUE-TR-V2, achieving a more balanced video-length distribution and more user-style queries. Remarkably, the Vidi2 model substantially outperforms leading proprietary systems, such as Gemini 3 Pro (Preview) and GPT-5, on both VUE-TR-V2 and VUE-STG, while achieving competitive results with popular open-source models with similar scale on video QA benchmarks.

[12] Cross-Domain Generalization of Multimodal LLMs for Global Photovoltaic Assessment cs.CV | cs.AI | cs.LG | eess.IVPDF

Muhao Guo, Yang Weng

TL;DR: 该论文研究了多模态大型语言模型在跨域光伏评估中的泛化能力，通过结构化提示和微调，整合检测、定位和量化任务，表现优于传统计算机视觉方法。

Details

Motivation: 分布式光伏系统的快速发展对电网管理提出挑战，传统计算机视觉模型需要大量标注数据且泛化能力不足。该研究探索多模态LLM的跨域泛化能力。

Result: 实验结果表明，该方法在未见过的区域表现最优，性能下降最小，优于传统CV和基于Transformer的基线模型。

Insight: 多模态LLM在域偏移下表现鲁棒，具有可扩展性、可迁移性和可解释性，适用于全球光伏评估。

Abstract: The rapid expansion of distributed photovoltaic (PV) systems poses challenges for power grid management, as many installations remain undocumented. While satellite imagery provides global coverage, traditional computer vision (CV) models such as CNNs and U-Nets require extensive labeled data and fail to generalize across regions. This study investigates the cross-domain generalization of a multimodal large language model (LLM) for global PV assessment. By leveraging structured prompts and fine-tuning, the model integrates detection, localization, and quantification within a unified schema. Cross-regional evaluation using the $Δ$F1 metric demonstrates that the proposed model achieves the smallest performance degradation across unseen regions, outperforming conventional CV and transformer baselines. These results highlight the robustness of multimodal LLMs under domain shift and their potential for scalable, transferable, and interpretable global PV mapping.

[13] Studying Maps at Scale: A Digital Investigation of Cartography and the Evolution of Figuration cs.CV | cs.CL | cs.DLPDF

Remi Petitpierre

TL;DR: 该论文提出了大规模调查地图文化遗产的方法和数据集，从历史和文化视角分析了地图的语义符号系统和政治认知背景。通过77万多条地图记录和近10万张数字化图像，研究揭示了地图的空间关注点与政治动态的关系，并提出了语义分割和目标检测模型用于地图内容识别。

Details

Motivation: 文化遗产机构数字化了大量地图，但现有自动化方法未能充分结合地图的历史和文化背景。本研究旨在填补这一空白，揭示地图作为文化对象的语义符号系统及其演变过程。

Result: 1. 揭示了地图设计与政治动态（如殖民扩张）的关联。2. 展示了地图符号系统的局部一致性及演变（如地形符号替换）。3. 发现主要城市和大机构在符号规范和语义文化传播中的作用。

Insight: 1. 地图不仅是地理工具，更是文化符号和政治认知的载体。2. 自动化技术与历史分析结合，可揭示大规模文化遗产的深层模式。3. 符号系统的演变反映了技术和权力的变迁。

Abstract: This thesis presents methods and datasets to investigate cartographic heritage on a large scale and from a cultural perspective. Heritage institutions worldwide have digitized more than one million maps, and automated techniques now enable large-scale recognition and extraction of map content. Yet these methods have engaged little with the history of cartography, or the view that maps are semantic-symbolic systems, and cultural objects reflecting political and epistemic expectations. This work leverages a diverse corpus of 771,561 map records and 99,715 digitized images aggregated from 38 digital catalogs. After normalization, the dataset includes 236,925 contributors and spans six centuries, from 1492 to 1948. These data make it possible to chart geographic structures and the global chronology of map publication. The spatial focus of cartography is analyzed in relation to political dynamics, evidencing links between Atlantic maritime charting, the triangular trade, and colonial expansion. Further results document the progression of national, domestic focus and the impact of military conflicts on publication volumes. The research introduces semantic segmentation techniques and object detection models for the generic recognition of land classes and cartographic signs, trained on annotated data and synthetic images. The analysis of land classes shows that maps are designed images whose framing and composition emphasize features through centering and semantic symmetries. The study of cartographic figuration encodes 63 M signs and 25 M fragments into a latent visual space, revealing figurative shifts such as the replacement of relief hachures by terrain contours and showing that signs tend to form locally consistent systems. Analyses of collaboration and diffusion highlight the role of legitimacy, larger actors, and major cities in the spread of figurative norms and semiotic cultures.

[14] Proxy-Free Gaussian Splats Deformation with Splat-Based Surface Estimation cs.CVPDF

Jaeyeong Kim, Seungwoo Yoo, Minhyuk Sung

TL;DR: SpLap 是一种无需代理的高斯斑点（GS）变形方法，通过表面感知的斑点图构造拉普拉斯算子，避免了传统代理依赖的问题，同时在变形中保留了细节和拓扑结构。

Details

Motivation: 现有高斯斑点变形方法通常依赖于代理（如笼子或网格），但这些方法受代理质量影响且计算开销大。直接将拉普拉斯变形应用于斑点云的方法则因缺乏显式结构而无法准确捕捉表面信息。

Result: 在 ShapeNet、Objaverse、Sketchfab 和 NeRF-Synthetic 数据集上的实验表明，SpLap 在变形质量和渲染效果上优于基于代理和无代理的基线方法。

Insight: 斑点间的空间交汇关系是捕捉表面结构的关键，而非简单的点云距离；拉普拉斯算子结合表面感知图能有效提升变形质量。

Abstract: We introduce SpLap, a proxy-free deformation method for Gaussian splats (GS) based on a Laplacian operator computed from our novel surface-aware splat graph. Existing approaches to GS deformation typically rely on deformation proxies such as cages or meshes, but they suffer from dependency on proxy quality and additional computational overhead. An alternative is to directly apply Laplacian-based deformation techniques by treating splats as point clouds. However, this often fail to properly capture surface information due to lack of explicit structure. To address this, we propose a novel method that constructs a surface-aware splat graph, enabling the Laplacian operator derived from it to support more plausible deformations that preserve details and topology. Our key idea is to leverage the spatial arrangement encoded in splats, defining neighboring splats not merely by the distance between their centers, but by their intersections. Furthermore, we introduce a Gaussian kernel adaptation technique that preserves surface structure under deformation, thereby improving rendering quality after deformation. In our experiments, we demonstrate the superior performance of our method compared to both proxy-based and proxy-free baselines, evaluated on 50 challenging objects from the ShapeNet, Objaverse, and Sketchfab datasets, as well as the NeRF-Synthetic dataset. Code is available at https://github.com/kjae0/SpLap.

[15] Think First, Assign Next (ThiFAN-VQA): A Two-stage Chain-of-Thought Framework for Post-Disaster Damage Assessment cs.CV | cs.AI | cs.LGPDF

Ehsan Karimi, Nhut Le, Maryam Rahnemoonfar

TL;DR: ThiFAN-VQA 是一个两阶段推理框架，用于灾害场景的视觉问答（VQA）。它通过 Chain-of-Thought（CoT）提示和上下文学学习（ICL）生成结构化推理轨迹，并结合答案选择模块提升模型性能。该框架在 FloodNet 和 RescueNet-VQA 数据集上表现出优越的准确性、可解释性和适应性。

Details

Motivation: 自然灾害后的及时准确损害评估对应急响应至关重要。现有 AI 方法依赖固定答案空间的分类框架或生成模型，存在数据收集成本高、泛化性差、生成结果不可靠等问题，亟需一种灵活且可靠的解决方案。

Result: 在 FloodNet 和 RescueNet-VQA 数据集上，ThiFAN-VQA 在准确性、可解释性和适应性上优于现有方法。

Insight: 1) CoT 提示和 ICL 的结合可以增强模型在有限监督下的推理能力；2) 答案选择模块能有效减少生成模型的幻觉问题；3) 两阶段设计平衡了灵活性和一致性。

Abstract: Timely and accurate assessment of damages following natural disasters is essential for effective emergency response and recovery. Recent AI-based frameworks have been developed to analyze large volumes of aerial imagery collected by Unmanned Aerial Vehicles, providing actionable insights rapidly. However, creating and annotating data for training these models is costly and time-consuming, resulting in datasets that are limited in size and diversity. Furthermore, most existing approaches rely on traditional classification-based frameworks with fixed answer spaces, restricting their ability to provide new information without additional data collection or model retraining. Using pre-trained generative models built on in-context learning (ICL) allows for flexible and open-ended answer spaces. However, these models often generate hallucinated outputs or produce generic responses that lack domain-specific relevance. To address these limitations, we propose ThiFAN-VQA, a two-stage reasoning-based framework for visual question answering (VQA) in disaster scenarios. ThiFAN-VQA first generates structured reasoning traces using chain-of-thought (CoT) prompting and ICL to enable interpretable reasoning under limited supervision. A subsequent answer selection module evaluates the generated responses and assigns the most coherent and contextually accurate answer, effectively improve the model performance. By integrating a custom information retrieval system, domain-specific prompting, and reasoning-guided answer selection, ThiFAN-VQA bridges the gap between zero-shot and supervised methods, combining flexibility with consistency. Experiments on FloodNet and RescueNet-VQA, UAV-based datasets from flood- and hurricane-affected regions, demonstrate that ThiFAN-VQA achieves superior accuracy, interpretability, and adaptability for real-world post-disaster damage assessment tasks.

[16] HunyuanOCR Technical Report cs.CV | cs.AIPDF

Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng

TL;DR: HunyuanOCR是一个轻量级（1B参数）的开源视觉-语言模型（VLM），专为OCR任务设计，性能优于商业API和更大模型，并在ICDAR 2025 DIMT挑战赛中取得第一。

Details

Motivation: 解决传统OCR专家模型功能单一与通用视觉-语言模型效率低下的问题，同时避免传统流水线中的错误传播。

Result: 在Text Spotting、Parsing等任务中表现优异，OCRBench上达到SOTA效果。

Insight: 高质量数据和强化学习对OCR任务性能有显著提升，端到端设计可简化系统复杂性。

Abstract: This paper presents HunyuanOCR, a commercial-grade, open-source, and lightweight (1B parameters) Vision-Language Model (VLM) dedicated to OCR tasks. The architecture comprises a Native Vision Transformer (ViT) and a lightweight LLM connected via an MLP adapter. HunyuanOCR demonstrates superior performance, outperforming commercial APIs, traditional pipelines, and larger models (e.g., Qwen3-VL-4B). Specifically, it surpasses current public solutions in perception tasks (Text Spotting, Parsing) and excels in semantic tasks (IE, Text Image Translation), securing first place in the ICDAR 2025 DIMT Challenge (Small Model Track). Furthermore, it achieves state-of-the-art (SOTA) results on OCRBench among VLMs with fewer than 3B parameters. HunyuanOCR achieves breakthroughs in three key aspects: 1) Unifying Versatility and Efficiency: We implement comprehensive support for core capabilities including spotting, parsing, IE, VQA, and translation within a lightweight framework. This addresses the limitations of narrow “OCR expert models” and inefficient “General VLMs”. 2) Streamlined End-to-End Architecture: Adopting a pure end-to-end paradigm eliminates dependencies on pre-processing modules (e.g., layout analysis). This fundamentally resolves error propagation common in traditional pipelines and simplifies system deployment. 3) Data-Driven and RL Strategies: We confirm the critical role of high-quality data and, for the first time in the industry, demonstrate that Reinforcement Learning (RL) strategies yield significant performance gains in OCR tasks. HunyuanOCR is officially open-sourced on HuggingFace. We also provide a high-performance deployment solution based on vLLM, placing its production efficiency in the top tier. We hope this model will advance frontier research and provide a solid foundation for industrial applications.

[17] Leveraging Unlabeled Scans for NCCT Image Segmentation in Early Stroke Diagnosis: A Semi-Supervised GAN Approach cs.CVPDF

Maria Thoma, Michalis A. Savelonas, Dimitris K. Iakovidis

TL;DR: 论文提出了一种基于GAN的半监督分割方法，用于NCCT图像中早期缺血性中风区域的精确分割，通过结合有标注和无标注数据提升模型性能。

Details

Motivation: 缺血性中风诊断依赖于快速且准确的影像分析，但NCCT在超早期阶段难以检测轻微的缺血变化，导致干预延迟。因此，需要一种能够有效利用有限标注数据和大量无标注数据的方法。

Result: 在公开数据集AISD上的实验表明，该方法能够有效提高诊断能力，减少人工标注负担，并支持更高效的临床决策。

Insight: 通过半监督学习充分利用无标注数据，可以在医学影像分割任务中显著提升模型性能，同时降低对标注数据的依赖。

Abstract: Ischemic stroke is a time-critical medical emergency where rapid diagnosis is essential for improving patient outcomes. Non-contrast computed tomography (NCCT) serves as the frontline imaging tool, yet it often fails to reveal the subtle ischemic changes present in the early, hyperacute phase. This limitation can delay crucial interventions. To address this diagnostic challenge, we introduce a semi-supervised segmentation method using generative adversarial networks (GANs) to accurately delineate early ischemic stroke regions. The proposed method employs an adversarial framework to effectively learn from a limited number of annotated NCCT scans, while simultaneously leveraging a larger pool of unlabeled scans. By employing Dice loss, cross-entropy loss, a feature matching loss and a self-training loss, the model learns to identify and delineate early infarcts, even when they are faint or their size is small. Experiments on the publicly available Acute Ischemic Stroke Dataset (AISD) demonstrate the potential of the proposed method to enhance diagnostic capabilities, reduce the burden of manual annotation, and support more efficient clinical decision-making in stroke care.

[18] SkillSight: Efficient First-Person Skill Assessment with Gaze cs.CVPDF

Chi Hsuan Wu, Kumar Ashutosh, Kristen Grauman

TL;DR: SkillSight提出了一种高效的第一人称技能评估方法，通过结合注视（gaze）和视频数据，预测技能水平，并在推理时仅使用注视数据，大幅降低功耗。

Details

Motivation: 智能眼镜的自我中心感知（egocentric perception）在物理世界中学习新技能具有潜力，但自动技能评估仍是一个技术挑战。

Result: SkillSight教师模型实现最优性能，学生模型仅需注视输入，功耗减少73倍。

Insight: 注视信息是技能评估的重要线索，结合蒸馏技术可以在节省功耗的同时保持高精度。

Abstract: Egocentric perception on smart glasses could transform how we learn new skills in the physical world, but automatic skill assessment remains a fundamental technical challenge. We introduce SkillSight for power-efficient skill assessment from first-person data. Central to our approach is the hypothesis that skill level is evident not only in how a person performs an activity (video), but also in how they direct their attention when doing so (gaze). Our two-stage framework first learns to jointly model gaze and egocentric video when predicting skill level, then distills a gaze-only student model. At inference, the student model requires only gaze input, drastically reducing power consumption by eliminating continuous video processing. Experiments on three datasets spanning cooking, music, and sports establish, for the first time, the valuable role of gaze in skill understanding across diverse real-world settings. Our SkillSight teacher model achieves state-of-the-art performance, while our gaze-only student variant maintains high accuracy using 73x less power than competing methods. These results pave the way for in-the-wild AI-supported skill learning.

[19] On the Utility of Foundation Models for Fast MRI: Vision-Language-Guided Image Reconstruction cs.CV | cs.AIPDF

Ruimin Feng, Xingxin He, Ronald Mercer, Zachary Stewart, Fang Liu

TL;DR: 该论文研究了如何利用预训练的多模态基础模型（视觉-语言模型）提升欠采样MRI重建的质量，通过将重建图像和辅助信息编码为高层语义特征，并结合对比目标优化语义一致性。

Details

Motivation: 传统的MRI重建方法主要依赖低层次先验，缺乏对高层语义信息的利用。本文探讨如何通过视觉-语言基础模型引入高层上下文信息，以提升重建质量。

Result: 实验表明，语义先验（尤其是图像-语言信息）能更好地保留解剖结构细节，提升感知质量（如更低的LPIPS值、更高的Tenengrad分数），并在读者研究中表现更优。

Insight: 视觉-语言基础模型通过语义空间优化为MRI重建提供了新的可能性，高层语义信息的引入可以显著提升重建结果的感知质量和语义一致性。

Abstract: Purpose: To investigate whether a vision-language foundation model can enhance undersampled MRI reconstruction by providing high-level contextual information beyond conventional priors. Methods: We proposed a semantic distribution-guided reconstruction framework that uses a pre-trained vision-language foundation model to encode both the reconstructed image and auxiliary information into high-level semantic features. A contrastive objective aligns the reconstructed representation with the target semantic distribution, ensuring consistency with high-level perceptual cues. The proposed objective works with various deep learning-based reconstruction methods and can flexibly incorporate semantic priors from multimodal sources. To test the effectiveness of these semantic priors, we evaluated reconstruction results guided by priors derived from either image-only or image-language auxiliary information. Results: Experiments on knee and brain datasets demonstrate that semantic priors from images preserve fine anatomical structures and achieve superior perceptual quality, as reflected in lower LPIPS values, higher Tenengrad scores, and improved scores in the reader study, compared with conventional regularization. The image-language information further expands the semantic distribution and enables high-level control over reconstruction attributes. Across all evaluations, the contrastive objective consistently guided the reconstructed features toward the desired semantic distributions while maintaining data fidelity, demonstrating the effectiveness of the proposed optimization framework. Conclusion: The study highlights that vision-language foundation models can improve undersampled MRI reconstruction through semantic-space optimization.

[20] Navigating Gigapixel Pathology Images with Large Multimodal Models cs.CVPDF

Thomas A. Buckley, Kian R. Weihrauch, Katherine Latham, Andrew Z. Zhou, Padmini A. Manrai

TL;DR: 该论文提出了GIANT框架，使大型多模态模型（LMMs）能够像病理学家一样迭代导航全切片图像（WSIs），并发布了新的基准测试MultiPathQA。实验表明，GIANT在病理学任务中显著优于传统方法，甚至接近或超过专门训练的模型。

Details

Motivation: 现有的大型多模态模型在医学图像（尤其是病理学中的千兆像素图像）解释中表现不佳，可能是因为使用了低分辨率缩略图或随机图像块。本文旨在解决这一问题，并探索LMMs在病理学中的潜力。

Result: GIANT在病理学任务中表现优异，例如在病理学家编写的问题上，GPT-5结合GIANT的准确率达到62.5%，优于TITAN（43.8%）和SlideChat（37.5%）。

Insight: 研究表明，通过适当的导航策略，LMMs可以在病理学中发挥重要作用。GIANT的成功揭示了基础模型在专业领域中的潜力，但也指出了其局限性。

Abstract: Despite being widely used to support clinical care, general-purpose large multimodal models (LMMs) have generally shown poor or inconclusive performance in medical image interpretation, particularly in pathology, where gigapixel images are used. However, prior studies have used either low-resolution thumbnails or random patches, which likely underestimated model performance. Here, we ask whether LMMs can be adapted to reason coherently and accurately in the evaluation of such images. In this study, we introduce Gigapixel Image Agent for Navigating Tissue (GIANT), the first framework that allows LMMs to iteratively navigate whole-slide images (WSIs) like a pathologist. Accompanying GIANT, we release MultiPathQA, a new benchmark, which comprises 934 WSI-level questions, encompassing five clinically-relevant tasks ranging from cancer diagnosis to open-ended reasoning. MultiPathQA also includes 128 questions, authored by two professional pathologists, requiring direct slide interpretation. Using MultiPathQA, we show that our simple agentic system substantially outperforms conventional patch- and thumbnail-based baselines, approaching or surpassing the performance of specialized models trained on millions of images. For example, on pathologist-authored questions, GPT-5 with GIANT achieves 62.5% accuracy, outperforming specialist pathology models such as TITAN (43.8%) and SlideChat (37.5%). Our findings reveal the strengths and limitations of current foundation models and ground future development of LMMs for expert reasoning in pathology.

[21] CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization cs.CVPDF

Xinhai Hou, Shaoyuan Xu, Manan Biyani, Mayan Li, Jia Liu

TL;DR: 论文揭示了当前视觉代理在工具使用上的不忠实问题，并提出了一种基于代码的视觉代理CodeV，通过工具感知策略优化（TAPO）提升忠实性。

Details

Motivation: 现有的视觉语言模型在调用图像操作时，虽然最终答案准确率较高，但中间工具使用往往不忠实，表现为工具调用与证据无关或忽视工具输出。这激发了研究忠实性问题的需求。

Result: CodeV在两阶段的SFT+RL训练后，在视觉搜索任务中显著提升了忠实工具使用率，同时保持了高准确率，并在多模态推理和数学任务中表现优异。

Insight: 显式监督中间工具行为是构建可信视觉推理系统的关键，代码化的工具表示和过程级强化学习有助于提升模型的忠实性和可验证性。

Abstract: Agentic vision-language models are increasingly trained to “think with images” by calling image operations. However, we show that high final-answer accuracy often hides unfaithful visual reasoning: models may invoke tools on irrelevant regions or ignore tool outputs entirely, yet still guess the correct answer. In this work, we first propose a faithfulness evaluation protocol that measures whether intermediate visual tool outputs (e.g., crops) actually contain the queried evidence. This reveals that recent visual agents achieve high final-answer accuracy but exhibit low rates of faithful tool-use on visual search benchmarks. We then introduce CodeV, a code-based visual agent trained with Tool-Aware Policy Optimization (TAPO). TAPO is a process-level RL framework that augments GRPO with dense rewards defined directly on visual tool inputs and outputs, rather than on chain-of-thought tokens, making supervision easier to verify and less susceptible to reward hacking. CodeV represents visual tools as executable Python code, and TAPO assigns step-wise rewards based solely on the question and tool output, encouraging both necessary and evidence-consistent tool use. In a two-stage SFT+RL pipeline, CodeV achieves competitive or superior accuracy while substantially increasing faithful tool-use rates on related visual search benchmarks. Beyond visual search, CodeV attains strong performance on a range of multimodal reasoning and math benchmarks, suggesting that explicitly supervising intermediate tool behavior is crucial for building trustworthy, agentic visual reasoning systems.

[22] OncoVision: Integrating Mammography and Clinical Data through Attention-Driven Multimodal AI for Enhanced Breast Cancer Diagnosis cs.CVPDF

Istiak Ahmed, Galib Ahmed, K. Shahriar Sanjid, Md. Tanzim Hossain, Md. Nishan Khan

TL;DR: OncoVision是一种多模态AI系统，通过结合乳房X光片和临床数据提高乳腺癌诊断的准确性。它采用注意力驱动的编码器-解码器模型，分割多个ROI，预测临床特征，并通过后期融合策略提升诊断精度。

Details

Motivation: 乳腺癌早期诊断对治疗效果至关重要。现有的AI方法通常仅依赖影像数据，忽略了临床信息的价值，导致诊断精度不足。OncoVision旨在通过多模态数据整合弥补这一缺陷。

Result: OncoVision在ROI分割和临床特征预测上达到SOTA精度，显著降低诊断中的观察者间差异。

Insight: 多模态数据整合（影像+临床）能显著提升AI诊断系统的性能和可信度；后期融合策略是提升诊断精度的有效方法。

Abstract: OncoVision is a multimodal AI pipeline that combines mammography images and clinical data for better breast cancer diagnosis. Employing an attention-based encoder-decoder backbone, it jointly segments four ROIs - masses, calcifications, axillary findings, and breast tissues - with state-of-the-art accuracy and robustly predicts ten structured clinical features: mass morphology, calcification type, ACR breast density, and BI-RADS categories. To fuse imaging and clinical insights, we developed two late-fusion strategies. By utilizing complementary multimodal data, late fusion strategies improve diagnostic precision and reduce inter-observer variability. Operationalized as a secure, user-friendly web application, OncoVision produces structured reports with dual-confidence scoring and attention-weighted visualizations for real-time diagnostic support to improve clinician trust and facilitate medical teaching. It can be easily incorporated into the clinic, making screening available in underprivileged areas around the world, such as rural South Asia. Combining accurate segmentation with clinical intuition, OncoVision raises the bar for AI-based mammography, offering a scalable and equitable solution to detect breast cancer at an earlier stage and enhancing treatment through timely interventions.

[23] INTERLACE: Interleaved Layer Pruning and Efficient Adaptation in Large Vision-Language Models cs.CVPDF

Parsa Madinei, Ryan Solgi, Ziqi Wen, Jonathan Skaza, Miguel Eckstein

TL;DR: INTERLACE是一种新的框架，通过高效的微调在大型视觉-语言模型（VLM）中剪枝冗余层，同时保持性能。

Details

Motivation: 现有的层剪枝方法在应用于VLMs时会导致显著的性能下降，因此需要一种能减少冗余层又不牺牲性能的新方法。

Result: 在剪枝25%的网络后，仅用1%的数据微调一个epoch，INTERLACE实现了88.9%的平均性能保留，达到了SOTA水平。

Insight: 通过局部冗余分析和交替微调-冻结设计，可以在剪枝后高效恢复模型性能，显著减少计算成本。

Abstract: We introduce INTERLACE, a novel framework that prunes redundant layers in VLMs while maintaining performance through sample-efficient finetuning. Existing layer pruning methods lead to significant performance drop when applied to VLMs. Instead, we analyze triplets of consecutive layers to identify local redundancy, removing the most redundant of the first two layers, finetune the remaining layer to compensate for the lost capacity, and freeze the third layer to serve as a stable anchor during finetuning. We found that this interleaved finetune-freeze design enables rapid convergence with minimal data after pruning. By finetuning only a subset of layers on just 1% of the FineVision dataset for one epoch, Interlace achieves 88.9% average performance retention after dropping 25% of the network, achieving SOTA performance. Our code is available at: https://github.com/pmadinei/Interlace.git

[24] IndEgo: A Dataset of Industrial Scenarios and Collaborative Work for Egocentric Assistants cs.CV | cs.AI | cs.HC | cs.ROPDF

Vivek Chavan, Yasmina Imgrund, Tung Dao, Sanwantri Bai, Bosong Wang

TL;DR: 论文介绍了IndEgo数据集，这是一个包含工业和协作任务的多模态数据集，涵盖第一人称和第三人称视角，提供丰富的注释和基准测试。

Details

Motivation: 工业场景中的协作任务需要复杂的认知和体能投入，但目前缺乏高质量的多模态数据集来支持相关研究。IndEgo旨在填补这一空白。

Result: 基线评估表明，当前的先进多模态模型在该数据集上仍面临挑战。

Insight: IndEgo数据集为工业场景中的协作任务理解、错误检测和问题回答研究提供了新的资源和挑战。

Abstract: We introduce IndEgo, a multimodal egocentric and exocentric dataset addressing common industrial tasks, including assembly/disassembly, logistics and organisation, inspection and repair, woodworking, and others. The dataset contains 3,460 egocentric recordings (approximately 197 hours), along with 1,092 exocentric recordings (approximately 97 hours). A key focus of the dataset is collaborative work, where two workers jointly perform cognitively and physically intensive tasks. The egocentric recordings include rich multimodal data and added context via eye gaze, narration, sound, motion, and others. We provide detailed annotations (actions, summaries, mistake annotations, narrations), metadata, processed outputs (eye gaze, hand pose, semi-dense point cloud), and benchmarks on procedural and non-procedural task understanding, Mistake Detection, and reasoning-based Question Answering. Baseline evaluations for Mistake Detection, Question Answering and collaborative task understanding show that the dataset presents a challenge for the state-of-the-art multimodal models. Our dataset is available at: https://huggingface.co/datasets/FraunhoferIPK/IndEgo

[25] CountXplain: Interpretable Cell Counting with Prototype-Based Density Map Estimation cs.CVPDF

Abdurahman Ali Mohammed, Wallapak Tavanapong, Catherine Fonder, Donald S. Sakaguchi

TL;DR: CountXplain提出了一种基于原型学习的可解释细胞计数方法，通过密度图估计实现透明化。

Details

Motivation: 生物医学图像中细胞计数是关键任务，但深度学习模型的可解释性较差，阻碍了其在临床中的应用。

Result: 在两个公共数据集上的实验表明，该方法在保持计数精度的同时实现了高可解释性。

Insight: 通过原型学习，模型不仅能计数，还能提供可视化解释，增强了临床用户的信任。

Abstract: Cell counting in biomedical imaging is pivotal for various clinical applications, yet the interpretability of deep learning models in this domain remains a significant challenge. We propose a novel prototype-based method for interpretable cell counting via density map estimation. Our approach integrates a prototype layer into the density estimation network, enabling the model to learn representative visual patterns for both cells and background artifacts. The learned prototypes were evaluated through a survey of biologists, who confirmed the relevance of the visual patterns identified, further validating the interpretability of the model. By generating interpretations that highlight regions in the input image most similar to each prototype, our method offers a clear understanding of how the model identifies and counts cells. Extensive experiments on two public datasets demonstrate that our method achieves interpretability without compromising counting effectiveness. This work provides researchers and clinicians with a transparent and reliable tool for cell counting, potentially increasing trust and accelerating the adoption of deep learning in critical biomedical applications. Code is available at https://github.com/NRT-D4/CountXplain.

[26] RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models cs.CVPDF

Omar Alama, Darshil Jariwala, Avigyan Bhattacharya, Seungchan Kim, Wenshan Wang

TL;DR: RADSeg提出了一种参数和计算高效的零样本开放词汇分割方法，利用RADIO模型并通过自相关递归注意力等技术，提升了分割性能，同时降低了计算和内存成本。

Details

Motivation: 现有的开放词汇语义分割方法要么依赖有限的训练数据导致泛化能力不足，要么需要组合多个模型导致高计算和内存成本。RADSeg旨在解决这些问题。

Result: RADSeg在ViT基类模型上实现了6-30%的mIoU提升，速度提升3.95倍，参数减少2.5倍，甚至优于更大的模型组合。

Insight: 表明高效的基础模型优化可以在不增加计算成本的情况下显著提升开放词汇分割性能。

Abstract: Open-vocabulary semantic segmentation (OVSS) underpins many vision and robotics tasks that require generalizable semantic understanding. Existing approaches either rely on limited segmentation training data, which hinders generalization, or apply zero-shot heuristics to vision-language models (e.g CLIP), while the most competitive approaches combine multiple models to improve performance at the cost of high computational and memory demands. In this work, we leverage an overlooked agglomerative vision foundation model, RADIO, to improve zero-shot OVSS along three key axes simultaneously: mIoU, latency, and parameter efficiency. We present the first comprehensive study of RADIO for zero-shot OVSS and enhance its performance through self-correlating recursive attention, self-correlating global aggregation, and computationally efficient mask refinement. Our approach, RADSeg, achieves 6-30% mIoU improvement in the base ViT class while being 3.95x faster and using 2.5x fewer parameters. Surprisingly, RADSeg-base (105M) outperforms previous combinations of huge vision models (850-1350M) in mIoU, achieving state-of-the-art accuracy with substantially lower computational and memory cost.

[27] Rethinking Vision Transformer Depth via Structural Reparameterization cs.CVPDF

Chengwei Zhou, Vipin Chaudhary, Gourav Datta

TL;DR: 该论文提出了一种基于分支的结构重参数化技术，旨在减少Vision Transformer的层数，同时保持其表示能力，从而在推理阶段实现加速。

Details

Motivation: Vision Transformer的计算开销主要源于其深层架构。现有加速策略多集中于算法级优化（如令牌剪枝和注意力加速），而忽略了对层数的直接优化。论文探索是否可以通过减少层数来加速推理。

Result: 在ViT-Tiny上成功将12层减少至6、4甚至3层，ImageNet-1K分类精度不变，移动CPU推理速度提升达37%。

Insight: 传统偏好极深Transformer堆叠的观点可能过于保守，结构重参数化为高效视觉Transformer设计提供了新思路。

Abstract: The computational overhead of Vision Transformers in practice stems fundamentally from their deep architectures, yet existing acceleration strategies have primarily targeted algorithmic-level optimizations such as token pruning and attention speedup. This leaves an underexplored research question: can we reduce the number of stacked transformer layers while maintaining comparable representational capacity? To answer this, we propose a branch-based structural reparameterization technique that operates during the training phase. Our approach leverages parallel branches within transformer blocks that can be systematically consolidated into streamlined single-path models suitable for inference deployment. The consolidation mechanism works by gradually merging branches at the entry points of nonlinear components, enabling both feed-forward networks (FFN) and multi-head self-attention (MHSA) modules to undergo exact mathematical reparameterization without inducing approximation errors at test time. When applied to ViT-Tiny, the framework successfully reduces the original 12-layer architecture to 6, 4, or as few as 3 layers while maintaining classification accuracy on ImageNet-1K. The resulting compressed models achieve inference speedups of up to 37% on mobile CPU platforms. Our findings suggest that the conventional wisdom favoring extremely deep transformer stacks may be unnecessarily restrictive, and point toward new opportunities for constructing efficient vision transformers.

[28] Efficient Transferable Optimal Transport via Min-Sliced Transport Plans cs.CVPDF

Xinran Liu, Elaheh Akbari, Rocio Diaz Martin, Navid NaderiAlizadeh, Soheil Kolouri

TL;DR: 论文提出了一种基于最小切片运输计划（min-STP）的高效可迁移最优运输方法，通过优化一维投影（切片）实现计算成本的降低，并研究了切片在分布变化下的可迁移性。

Details

Motivation: 最优运输（OT）虽在计算机视觉中广泛应用，但其高昂的计算成本限制了可扩展性。切片方法虽能降低计算成本，但其在分布变化下的可迁移性尚未清晰。本文旨在解决这一问题。

Result: 实验表明min-STP在一次性匹配任务中表现出色，并能高效支持点云对齐和流生成模型的摊销训练。

Insight: 优化的切片运输计划具有分布鲁棒性，能在相关任务中高效迁移，为解决大规模OT问题提供了新思路。

Abstract: Optimal Transport (OT) offers a powerful framework for finding correspondences between distributions and addressing matching and alignment problems in various areas of computer vision, including shape analysis, image generation, and multimodal tasks. The computation cost of OT, however, hinders its scalability. Slice-based transport plans have recently shown promise for reducing the computational cost by leveraging the closed-form solutions of 1D OT problems. These methods optimize a one-dimensional projection (slice) to obtain a conditional transport plan that minimizes the transport cost in the ambient space. While efficient, these methods leave open the question of whether learned optimal slicers can transfer to new distribution pairs under distributional shift. Understanding this transferability is crucial in settings with evolving data or repeated OT computations across closely related distributions. In this paper, we study the min-Sliced Transport Plan (min-STP) framework and investigate the transferability of optimized slicers: can a slicer trained on one distribution pair yield effective transport plans for new, unseen pairs? Theoretically, we show that optimized slicers remain close under slight perturbations of the data distributions, enabling efficient transfer across related tasks. To further improve scalability, we introduce a minibatch formulation of min-STP and provide statistical guarantees on its accuracy. Empirically, we demonstrate that the transferable min-STP achieves strong one-shot matching performance and facilitates amortized training for point cloud alignment and flow-based generative modeling.

[29] Leveraging Foundation Models for Histological Grading in Cutaneous Squamous Cell Carcinoma using PathFMTools cs.CV | cs.AIPDF

Abdul Rahman Diab, Emily E. Karn, Renchin Wu, Emily S. Ruiz, William Lotter

TL;DR: 论文介绍了PathFMTools，一个轻量级Python工具包，用于高效执行、分析和可视化病理学基础模型。通过该工具，评估了CONCH和MUSK两种视觉语言基础模型在皮肤鳞状细胞癌（cSCC）组织学分级任务中的表现，并验证了基础模型嵌入训练小型专家模型的潜力。

Details

Motivation: 尽管计算病理学基础模型具有潜力，但由于全切片图像（WSI）处理的复杂性、学习特征的不可解释性以及多种适应策略的存在，将其应用于特定临床任务仍具挑战性。

Result: 实验结果显示，基础模型嵌入可用于训练高效的小型专家模型，同时揭示了不同预测方法之间的权衡。

Insight: 病理学基础模型在实际临床应用中具有潜力，PathFMTools工具为高效分析和验证提供了支持。

Abstract: Despite the promise of computational pathology foundation models, adapting them to specific clinical tasks remains challenging due to the complexity of whole-slide image (WSI) processing, the opacity of learned features, and the wide range of potential adaptation strategies. To address these challenges, we introduce PathFMTools, a lightweight, extensible Python package that enables efficient execution, analysis, and visualization of pathology foundation models. We use this tool to interface with and evaluate two state-of-the-art vision-language foundation models, CONCH and MUSK, on the task of histological grading in cutaneous squamous cell carcinoma (cSCC), a critical criterion that informs cSCC staging and patient management. Using a cohort of 440 cSCC H&E WSIs, we benchmark multiple adaptation strategies, demonstrating trade-offs across prediction approaches and validating the potential of using foundation model embeddings to train small specialist models. These findings underscore the promise of pathology foundation models for real-world clinical applications, with PathFMTools enabling efficient analysis and validation.

[30] What You See is (Usually) What You Get: Multimodal Prototype Networks that Abstain from Expensive Modalities cs.CVPDF

Muchang Bahng, Charlie Berens, Jon Donnelly, Eric Chen, Chaofan Chen

TL;DR: 该论文提出了一种多模态原型网络（Multimodal Prototype Networks），通过结合多种模态数据（如图像和遗传数据）来解决物种检测中的黑盒性和高成本问题，同时保持模型的解释性和高效性。

Details

Motivation: 物种检测对生态系统健康和入侵物种识别至关重要，但传统多模态神经网络存在解释性差和数据采集成本高的问题。作者希望通过改进原型网络（ProtoPNets）来解决这些问题。

Result: 该方法能够在精细分类中智能分配昂贵的遗传数据，同时利用图像数据实现可比的双模态模型精度。

Insight: 1. 多模态原型网络在保持解释性的同时，显著降低了数据成本；2. 动态模态选择策略为高成本数据的使用提供了灵活性。

Abstract: Species detection is important for monitoring the health of ecosystems and identifying invasive species, serving a crucial role in guiding conservation efforts. Multimodal neural networks have seen increasing use for identifying species to help automate this task, but they have two major drawbacks. First, their black-box nature prevents the interpretability of their decision making process. Second, collecting genetic data is often expensive and requires invasive procedures, often necessitating researchers to capture or kill the target specimen. We address both of these problems by extending prototype networks (ProtoPNets), which are a popular and interpretable alternative to traditional neural networks, to the multimodal, cost-aware setting. We ensemble prototypes from each modality, using an associated weight to determine how much a given prediction relies on each modality. We further introduce methods to identify cases for which we do not need the expensive genetic information to make confident predictions. We demonstrate that our approach can intelligently allocate expensive genetic data for fine-grained distinctions while using abundant image data for clearer visual classifications and achieving comparable accuracy to models that consistently use both modalities.

[31] Vision–Language Enhanced Foundation Model for Semi-supervised Medical Image Segmentation cs.CVPDF

Jiaqi Guo, Mingzhen Li, Hanyu Su, Santiago López, Lexiaozi Fan

TL;DR: VESSA结合视觉语言模型（VLM）和半监督学习（SSL）技术，通过两阶段方法提升医学图像分割的准确性，尤其是在标注数据极少的条件下。

Details

Motivation: 医学图像分割通常需要大量专家标注，半监督学习可以减少这种依赖。同时，视觉语言模型在多领域展现了强大的泛化和小样本能力。因此，研究如何将VLM融入半监督医学图像分割是本文的核心动机。

Result: 在多个数据集和领域上的实验表明，VESSA显著提升了分割精度，尤其在标注数据极少的条件下优于现有方法。

Insight: 结合视觉语言的语义理解能力和半监督学习的动态调整机制，是提升小样本医学图像分割性能的有效途径。

Abstract: Semi-supervised learning (SSL) has emerged as an effective paradigm for medical image segmentation, reducing the reliance on extensive expert annotations. Meanwhile, vision-language models (VLMs) have demonstrated strong generalization and few-shot capabilities across diverse visual domains. In this work, we integrate VLM-based segmentation into semi-supervised medical image segmentation by introducing a Vision-Language Enhanced Semi-supervised Segmentation Assistant (VESSA) that incorporates foundation-level visual-semantic understanding into SSL frameworks. Our approach consists of two stages. In Stage 1, the VLM-enhanced segmentation foundation model VESSA is trained as a reference-guided segmentation assistant using a template bank containing gold-standard exemplars, simulating learning from limited labeled data. Given an input-template pair, VESSA performs visual feature matching to extract representative semantic and spatial cues from exemplar segmentations, generating structured prompts for a SAM2-inspired mask decoder to produce segmentation masks. In Stage 2, VESSA is integrated into a state-of-the-art SSL framework, enabling dynamic interaction with the student model: as student predictions become more refined, they are fed back to VESSA as prompts, allowing it to generate higher-quality pseudo-labels and stronger guidance. Extensive experiments across multiple segmentation datasets and domains show that VESSA-augmented SSL significantly enhances segmentation accuracy, outperforming state-of-the-art baselines under extremely limited annotation conditions.

[32] CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception cs.CV | cs.AI | cs.CL | cs.LGPDF

Miguel Carvalho, Helder Dias, Bruno Martins

TL;DR: CropVLM通过强化学习训练，动态聚焦图像区域，提升视觉语言模型（VLM）在细粒度任务（如场景文本识别）中的性能，无需人工标注或昂贵合成评估。

Details

Motivation: 视觉语言模型在细粒度图像理解任务中表现不佳，主要由于感知局限和视觉碎片化问题，亟需一种低成本的增强方法。

Result: 在需要高分辨率图像理解的任务中，尤其是在目标VLM域外基准测试中，性能显著提升。

Insight: 通过外部模块动态聚焦图像区域，可以有效弥补VLM在细粒度任务中的不足，且避免了灾难性遗忘问题。

Abstract: Vision-Language Models (VLMs) often struggle with tasks that require fine-grained image understanding, such as scene-text recognition or document analysis, due to perception limitations and visual fragmentation. To address these challenges, we introduce CropVLM as an external low-cost method for boosting performance, enabling VLMs to dynamically ‘’zoom in’’ on relevant image regions, enhancing their ability to capture fine details. CropVLM is trained using reinforcement learning, without using human-labeled bounding boxes as a supervision signal, and without expensive synthetic evaluations. The model is trained once and can be paired with both open-source and proprietary VLMs to improve their performance. Our approach delivers significant improvements on tasks that require high-resolution image understanding, notably for benchmarks that are out-of-domain for the target VLM, without modifying or fine-tuning the VLM, thus avoiding catastrophic forgetting.

[33] MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization cs.CV | cs.AI | cs.CL | cs.LG | cs.ROPDF

Chengyue Huang, Mellon M. Zhang, Robert Azarcon, Glen Chou, Zsolt Kira

TL;DR: 该论文提出了MAPS（模块化邻近调度），一种用于视觉-语言-动作（VLA）模型的鲁棒微调框架，通过模块化调度邻近约束，平衡稳定性与灵活性，显著提升性能。

Details

Motivation: 现有方法（如冻结模块或统一正则化）在微调VLA模型时，要么过度限制适应能力，要么忽视不同模块的角色差异，导致泛化能力下降。

Result: 在多个基准（如MiniVLA-VQ、CALVIN等）及现实机器人平台（Franka Emika Panda）上，MAPS显著提升模型在分布内外性能（最高+30%）。

Insight: 研究表明，基于经验的模块邻近约束调度是保持VLM到VLA迁移中广泛泛化能力的简单而强大的原则。

Abstract: Vision-Language-Action (VLA) models inherit strong priors from pretrained Vision-Language Models (VLMs), but naive fine-tuning often disrupts these representations and harms generalization. Existing fixes – freezing modules or applying uniform regularization – either overconstrain adaptation or ignore the differing roles of VLA components. We present MAPS (Module-Wise Proximity Scheduling), the first robust fine-tuning framework for VLAs. Through systematic analysis, we uncover an empirical order in which proximity constraints should be relaxed to balance stability and flexibility. MAPS linearly schedules this relaxation, enabling visual encoders to stay close to their pretrained priors while action-oriented language layers adapt more freely. MAPS introduces no additional parameters or data, and can be seamlessly integrated into existing VLAs. Across MiniVLA-VQ, MiniVLA-OFT, OpenVLA-OFT, and challenging benchmarks such as SimplerEnv, CALVIN, LIBERO, as well as real-world evaluations on the Franka Emika Panda platform, MAPS consistently boosts both in-distribution and out-of-distribution performance (up to +30%). Our findings highlight empirically guided proximity to pretrained VLMs as a simple yet powerful principle for preserving broad generalization in VLM-to-VLA transfer.

[34] Lightweight Transformer Framework for Weakly Supervised Semantic Segmentation cs.CVPDF

Ali Torabi, Sanjog Gaihre, Yaqoob Majeed

TL;DR: CrispFormer通过改进SegFormer解码器，引入边界分支、不确定性引导的精炼器和动态多尺度融合层，显著提升了弱监督语义分割的性能，同时保持了轻量计算开销。

Details

Motivation: 弱监督语义分割（WSSS）通常面临噪声和标注不全的挑战。本文旨在通过改进解码器设计，在不改变MiT主干或依赖繁重后处理的情况下，提升弱监督的效果。

Result: CrispFormer在边界F分数、小目标召回率和mIoU上均优于SegFormer基线，且计算开销较少。

Insight: 解码器的小改进可以在不增加主干复杂度的情况下显著提升WSSS性能，轻量化设计是关键。

Abstract: Weakly supervised semantic segmentation (WSSS) must learn dense masks from noisy, under-specified cues. We revisit the SegFormer decoder and show that three small, synergistic changes make weak supervision markedly more effective-without altering the MiT backbone or relying on heavy post-processing. Our method, CrispFormer, augments the decoder with: (1) a boundary branch that supervises thin object contours using a lightweight edge head and a boundary-aware loss; (2) an uncertainty-guided refiner that predicts per-pixel aleatoric uncertainty and uses it to weight losses and gate a residual correction of the segmentation logits; and (3) a dynamic multi-scale fusion layer that replaces static concatenation with spatial softmax gating over multi-resolution features, optionally modulated by uncertainty. The result is a single-pass model that preserves crisp boundaries, selects appropriate scales per location, and resists label noise from weak cues. Integrated into a standard WSSS pipeline (seed, student, and EMA relabeling), CrispFormer consistently improves boundary F-score, small-object recall, and mIoU over SegFormer baselines trained on the same seeds, while adding minimal compute. Our decoder-centric formulation is simple to implement, broadly compatible with existing SegFormer variants, and offers a reproducible path to higher-fidelity masks from image-level supervision.

[35] CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding cs.CV | cs.CLPDF

Yuefei Chen, Jiang Liu, Xiaodong Lin, Ruixiang Tang

TL;DR: 论文提出了CounterVQA基准，用于评估视觉语言模型在视频理解中的反事实推理能力，并开发了CFGPT方法以提升模型性能。

Details

Motivation: 现有多模态模型在视频理解中的反事实推理能力尚未充分研究，这对理解视频中的因果关系至关重要。

Result: 实验表明，CFGPT在所有CounterVQA难度级别上均带来一致的性能提升。

Insight: 语言模态中的反事实推理能力可以有效地迁移到视觉模态，从而增强模型的多模态推理能力。

Abstract: Vision Language Models (VLMs) have recently shown significant advancements in video understanding, especially in feature alignment, event reasoning, and instruction-following tasks. However, their capability for counterfactual reasoning, inferring alternative outcomes under hypothetical conditions, remains underexplored. This capability is essential for robust video understanding, as it requires identifying underlying causal structures and reasoning about unobserved possibilities, rather than merely recognizing observed patterns. To systematically evaluate this capability, we introduce CounterVQA, a video-based benchmark featuring three progressive difficulty levels that assess different aspects of counterfactual reasoning. Through comprehensive evaluation of both state-of-the-art open-source and closed-source models, we uncover a substantial performance gap: while these models achieve reasonable accuracy on simple counterfactual questions, performance degrades significantly on complex multi-hop causal chains. To address these limitations, we develop a post-training method, CFGPT, that enhances a model’s visual counterfactual reasoning ability by distilling its counterfactual reasoning capability from the language modality, yielding consistent improvements across all CounterVQA difficulty levels. Dataset and code will be further released.

[36] Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering cs.CV | cs.AI | cs.ROPDF

Noah Frahm, Prakrut Patel, Yue Zhang, Shoubin Yu, Mohit Bansal

TL;DR: 该论文提出了一种名为Prune-Then-Plan的框架，通过在Embodied Question Answering（EQA）任务中引入步级校准来稳定探索，解决了大视觉语言模型（VLMs）在步级探索中的不稳定性问题。

Details

Motivation: 大视觉语言模型（VLMs）在开放词汇推理中提供了强大的语义先验，但在步级探索中常表现出边界振荡（frontier oscillations），导致导航效率低下和答案质量下降。因此，需要一个方法来校准步级行为。

Result: 在OpenEQA和EXPRESS-Bench数据集上，该方法在同等探索预算下实现了更好的场景覆盖，SPL和LLM-Match指标相对基线分别提升了49%和33%。

Insight: 通过分离剪枝和规划步骤，可以避免VLM的过度自信问题，从而实现更稳定的探索行为。这种方法可扩展应用于其他需要步级决策的任务。

Abstract: Large vision-language models (VLMs) have improved embodied question answering (EQA) agents by providing strong semantic priors for open-vocabulary reasoning. However, when used directly for step-level exploration, VLMs often exhibit frontier oscillations, unstable back-and-forth movements caused by overconfidence and miscalibration, leading to inefficient navigation and degraded answer quality. We propose Prune-Then-Plan, a simple and effective framework that stabilizes exploration through step-level calibration. Instead of trusting raw VLM scores, our method prunes implausible frontier choices using a Holm-Bonferroni inspired pruning procedure and then delegates final decisions to a coverage-based planner. This separation converts overconfident predictions into conservative, interpretable actions by relying on human-level judgments to calibrate the step-level behavior of VLMs. Integrated into the 3D-Mem EQA framework, our approach achieves relative improvements of up to 49% and 33% in visually grounded SPL and LLM-Match metrics respectively over baselines. Overall, our method achieves better scene coverage under equal exploration budgets on both OpenEQA and EXPRESS-Bench datasets.

[37] One Attention, One Scale: Phase-Aligned Rotary Positional Embeddings for Mixed-Resolution Diffusion Transformer cs.CVPDF

Haoyu Wu, Jingyi Xu, Qiaomu Miao, Dimitris Samaras, Hieu Le

TL;DR: 论文提出Cross-Resolution Phase-Aligned Attention (CRPA)，解决了Diffusion Transformers在混合分辨率去噪中由于线性插值RoPE引起的注意力机制崩溃问题，提升了生成质量。

Details

Motivation: 传统的线性插值方法在混合分辨率去噪中会导致RoPE相位不一致，引发注意力机制崩溃，影响生成质量。

Result: CRPA显著提升了混合分辨率生成的保真度和效率，优于现有方法。

Insight: 相位一致性对Diffusion Transformers的注意力机制至关重要，简单的线性插值会破坏其功能。

Abstract: We identify a core failure mode that occurs when using the usual linear interpolation on rotary positional embeddings (RoPE) for mixed-resolution denoising with Diffusion Transformers. When tokens from different spatial grids are mixed, the attention mechanism collapses. The issue is structural. Linear coordinate remapping forces a single attention head to compare RoPE phases sampled at incompatible rates, creating phase aliasing that destabilizes the score landscape. Pretrained DiTs are especially brittle-many heads exhibit extremely sharp, periodic phase selectivity-so even tiny cross-rate inconsistencies reliably cause blur, artifacts, or full collapse. To this end, our main contribution is Cross-Resolution Phase-Aligned Attention (CRPA), a training-free drop-in fix that eliminates this failure at its source. CRPA modifies only the RoPE index map for each attention call: all Q/K positions are expressed on the query’s stride so that equal physical distances always induce identical phase increments. This restores the precise phase patterns that DiTs rely on. CRPA is fully compatible with pretrained DiTs, stabilizes all heads and layers uniformly. We demonstrate that CRPA enables high-fidelity and efficient mixed-resolution generation, outperforming previous state-of-the-art methods on image and video generation.

[38] Reading Between the Lines: Abstaining from VLM-Generated OCR Errors via Latent Representation Probes cs.CVPDF

Jihan Yao, Achin Kulshrestha, Nathalie Rauschmayr, Reed Roberts, Banghua Zhu

TL;DR: 为了解决视觉语言模型（VLMs）在安全关键应用中因OCR错误导致的可靠性问题，论文提出了一种通过潜伏表示探针（LRP）检测不确定性的方法，显著提升了模型在STVQA任务中的自我克制能力。

Details

Motivation: VLMs在安全关键应用中（如交通标志识别）的OCR错误可能导致严重后果，因此需要一种可靠的方法让模型在不确定时主动放弃回答。

Result: 在四个基准测试中，LRP相比基线方法提升了7.6%的自我克制准确率，并能适应多种不确定性来源和数据集。

Insight: 不确定性信号更多地隐藏在VLMs的中间层而非最终层，这为开发更可靠的AI系统提供了新的方向。

Abstract: As VLMs are deployed in safety-critical applications, their ability to abstain from answering when uncertain becomes crucial for reliability, especially in Scene Text Visual Question Answering (STVQA) tasks. For example, OCR errors like misreading “50 mph” as “60 mph” could cause severe traffic accidents. This leads us to ask: Can VLMs know when they can’t see? Existing abstention methods suggest pessimistic answers: they either rely on miscalibrated output probabilities or require semantic agreement unsuitable for OCR tasks. However, this failure may indicate we are looking in the wrong place: uncertainty signals could be hidden in VLMs’ internal representations. Building on this insight, we propose Latent Representation Probing (LRP): training lightweight probes on hidden states or attention patterns. We explore three probe designs: concatenating representations across all layers, aggregating attention over visual tokens, and ensembling single layer probes by majority vote. Experiments on four benchmarks across image and video modalities show LRP improves abstention accuracy by 7.6% over best baselines. Our analysis reveals: probes generalize across various uncertainty sources and datasets, and optimal signals emerge from intermediate rather than final layers. This establishes a principled framework for building deployment-ready AI systems by detecting confidence signals from internal states rather than unreliable outputs.

[39] ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding cs.CVPDF

Byeongjun Park, Byung-Hoon Kim, Hyungjin Chung, Jong Chul Ye

TL;DR: ReDirector提出了一种新颖的相机控制视频重拍生成方法，通过Rotary Camera Encoding（RoCE）改进RoPE的时空对齐，实现了对动态捕获视频的任意长度重拍生成。

Details

Motivation: 现有方法在使用RoPE时存在时空位置未对齐的问题，限制了视频重拍的灵活性和质量。

Result: 实验表明，ReDirector在相机可控性、几何一致性和视频质量方面均有显著提升。

Insight: 相机条件的引入和时空对齐是提升视频重拍技术的关键。

Abstract: We present ReDirector, a novel camera-controlled video retake generation method for dynamically captured variable-length videos. In particular, we rectify a common misuse of RoPE in previous works by aligning the spatiotemporal positions of the input video and the target retake. Moreover, we introduce Rotary Camera Encoding (RoCE), a camera-conditioned RoPE phase shift that captures and integrates multi-view relationships within and across the input and target videos. By integrating camera conditions into RoPE, our method generalizes to out-of-distribution camera trajectories and video lengths, yielding improved dynamic object localization and static background preservation. Extensive experiments further demonstrate significant improvements in camera controllability, geometric consistency, and video quality across various trajectories and lengths.

[40] Large Language Model Aided Birt-Hogg-Dube Syndrome Diagnosis with Multimodal Retrieval-Augmented Generation cs.CVPDF

Haoqing Li, Jun Shi, Xianmeng Chen, Qiwei Jia, Rui Wang

TL;DR: 该论文提出了BHD-RAG框架，通过结合多模态检索增强生成技术和大语言模型，提升Birt-Hogg-Dube综合征的诊断准确性，解决了罕见病诊断中的样本不足和类别间差异小的问题。

Details

Motivation: 针对弥漫性囊性肺疾病（DCLDs）诊断中样本稀缺且类别间差异小的挑战，以及多模态大语言模型（MLLMs）因缺乏领域知识和放射学特征而导致的幻觉风险，提出了BHD-RAG框架。

Result: 在包含四种DCLDs的数据集上验证了BHD-RAG的优越性，其诊断准确性高且生成的描述与专家见解高度一致。

Insight: 通过结合领域知识和检索增强技术，可以有效减轻MLLMs在罕见病诊断中的幻觉问题，同时提升诊断的可解释性和准确性。

Abstract: Deep learning methods face dual challenges of limited clinical samples and low inter-class differentiation among Diffuse Cystic Lung Diseases (DCLDs) in advancing Birt-Hogg-Dube syndrome (BHD) diagnosis via Computed Tomography (CT) imaging. While Multimodal Large Language Models (MLLMs) demonstrate diagnostic potential fo such rare diseases, the absence of domain-specific knowledge and referable radiological features intensify hallucination risks. To address this problem, we propose BHD-RAG, a multimodal retrieval-augmented generation framework that integrates DCLD-specific expertise and clinical precedents with MLLMs to improve BHD diagnostic accuracy. BHDRAG employs: (1) a specialized agent generating imaging manifestation descriptions of CT images to construct a multimodal corpus of DCLDs cases. (2) a cosine similarity-based retriever pinpointing relevant imagedescription pairs for query images, and (3) an MLLM synthesizing retrieved evidence with imaging data for diagnosis. BHD-RAG is validated on the dataset involving four types of DCLDs, achieving superior accuracy and generating evidence-based descriptions closely aligned with expert insights.

[41] Rectified SpaAttn: Revisiting Attention Sparsity for Efficient Video Generation cs.CV | cs.AIPDF

Xuewen Liu, Zhikai Li, Jing Zhang, Mengjuan Chen, Qingyi Gu

TL;DR: 本文提出Rectified SpaAttn方法，通过修正注意力稀疏性中的系统性偏差，提升了视频生成的效率和质量，同时开源了实现代码。

Details

Motivation: 扩散变换器在视频生成中占据主导地位，但其注意力计算的二次复杂度导致高延迟。现有注意力稀疏方法虽然降低了计算成本，但存在严重的性能退化问题。

Result: 在HunyuanVideo和Wan 2.1上分别实现了3.33倍和2.08倍的加速，同时保持了高质量的生成效果。

Insight: 修正注意力稀疏性中的系统性偏差是关键，隐式参考全注意力可以有效提升稀疏注意力的性能。

Abstract: Diffusion Transformers dominate video generation, but the quadratic complexity of attention computation introduces substantial latency. Attention sparsity reduces computational costs by focusing on critical tokens while ignoring non-critical tokens. However, existing methods suffer from severe performance degradation. In this paper, we revisit attention sparsity and reveal that existing methods induce systematic biases in attention allocation: (1) excessive focus on critical tokens amplifies their attention weights; (2) complete neglect of non-critical tokens causes the loss of relevant attention weights. To address these issues, we propose Rectified SpaAttn, which rectifies attention allocation with implicit full attention reference, thereby enhancing the alignment between sparse and full attention maps. Specifically: (1) for critical tokens, we show that their bias is proportional to the sparse attention weights, with the ratio governed by the amplified weights. Accordingly, we propose Isolated-Pooling Attention Reallocation, which calculates accurate rectification factors by reallocating multimodal pooled weights. (2) for non-critical tokens, recovering attention weights from the pooled query-key yields attention gains but also introduces pooling errors. Therefore, we propose Gain-Aware Pooling Rectification, which ensures that the rectified gain consistently surpasses the induced error. Moreover, we customize and integrate the Rectified SpaAttn kernel using Triton, achieving up to 3.33 and 2.08 times speedups on HunyuanVideo and Wan 2.1, respectively, while maintaining high generation quality. We release Rectified SpaAttn as open-source at https://github.com/BienLuky/Rectified-SpaAttn .

[42] Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward cs.CV | cs.CLPDF

Yuwei Niu, Weiyang Jin, Jiaqi Liao, Chaoran Feng, Peng Jin

TL;DR: 论文探讨了统一多模态模型中理解是否能够指导生成的问题，提出了UniSandbox评估框架，揭示了理解与生成之间的差距，并提出链式思维和自我训练等方法改善了生成质量。

Details

Motivation: 研究统一多模态模型中理解与生成的关系，旨在填补现有研究的空白，探索两者之间的关联和改进方法。

Result: 发现理解与生成之间存在显著差距；CoT显式应用于理解模块能有效改善生成任务；自我训练方法能够内化推理能力；CoT帮助知识迁移任务中检索新知识。

Insight: 生成任务依赖于显式推理能力；知识迁移任务中，CoT有助于新知识的检索；未来统一架构设计应注重理解与生成的深度融合。

Abstract: Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation. Code and data are available at https://github.com/PKU-YuanGroup/UniSandBox

[43] 4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models cs.CVPDF

Yiting Lu, Wei Luo, Peiyan Tu, Haoran Li, Hanxin Zhu

TL;DR: 4DWorldBench是一个用于评估3D/4D世界生成模型的综合框架，涵盖感知质量、条件对齐、物理真实性和一致性四个维度，支持多模态输入的统一评测。

Details

Motivation: 传统评测缺乏对世界模型的全面评估，尤其在跨模态一致性、物理真实性等方面存在不足，亟需统一的标准框架。

Result: 初步人类研究表明，自适应工具选择与人类主观评测更一致，为模型改进提供了可靠基准。

Insight: 从视觉生成扩展到世界生成需要多维度的评测标准，多模态统一映射和自适应评测方法是未来的关键方向。

Abstract: World Generation Models are emerging as a cornerstone of next-generation multimodal intelligence systems. Unlike traditional 2D visual generation, World Models aim to construct realistic, dynamic, and physically consistent 3D/4D worlds from images, videos, or text. These models not only need to produce high-fidelity visual content but also maintain coherence across space, time, physics, and instruction control, enabling applications in virtual reality, autonomous driving, embodied intelligence, and content creation. However, prior benchmarks emphasize different evaluation dimensions and lack a unified assessment of world-realism capability. To systematically evaluate World Models, we introduce the 4DWorldBench, which measures models across four key dimensions: Perceptual Quality, Condition-4D Alignment, Physical Realism, and 4D Consistency. The benchmark covers tasks such as Image-to-3D/4D, Video-to-4D, Text-to-3D/4D. Beyond these, we innovatively introduce adaptive conditioning across multiple modalities, which not only integrates but also extends traditional evaluation paradigms. To accommodate different modality-conditioned inputs, we map all modality conditions into a unified textual space during evaluation, and further integrate LLM-as-judge, MLLM-as-judge, and traditional network-based methods. This unified and adaptive design enables more comprehensive and consistent evaluation of alignment, physical realism, and cross-modal coherence. Preliminary human studies further demonstrate that our adaptive tool selection achieves closer agreement with subjective human judgments. We hope this benchmark will serve as a foundation for objective comparisons and improvements, accelerating the transition from “visual generation” to “world generation.” Our project can be found at https://yeppp27.github.io/4DWorldBench.github.io/.

[44] Face, Whole-Person, and Object Classification in a Unified Space Via The Interleaved Multi-Domain Identity Curriculum cs.CVPDF

Thomas M Metz, Matthew Q Hill, Alice J O’Toole

TL;DR: 该论文提出了一种名为IMIC的训练方法，通过在统一嵌入空间中联合训练四个任务（对象识别、高/低质量图像的人脸识别和全身图像的人物识别），避免了灾难性遗忘问题，并在多种基础模型上验证了其有效性。

Details

Motivation: 现有的视觉基础模型在零样本模式下可以完成广义的对象分类任务，但在微调时会出现灾难性遗忘问题。作者希望通过多任务联合训练来解决这一问题，同时保持模型的泛化能力。

Result: EVA-02和CLIP模型在使用IMIC方法后，性能与领域专家相当，并且在多任务处理中优于人类。IMIC还保持了模型在分布外数据上的泛化能力。

Insight: IMIC方法实现了任务之间特征的共享，同时保持了线性可分性。通过少于100个主成分即可完成跨任务的特征表示，表明模型的高效性和多功能性。

Abstract: Vision foundation models can perform generalized object classification in zero-shot mode, and face/person recognition when they are fine-tuned. However, fine-tuned models suffer from catastrophic forgetting. We create models that perform four tasks (object recognition, face recognition from high- and low-quality images, and person recognition from whole-body images) in a single embedding space – without incurring substantial catastrophic forgetting. To accomplish this, we introduce two variants of the Interleaved Multi-Domain Identity Curriculum (IMIC): a gradient-coupled, interleaving training schedule that fine-tunes a foundation backbone simultaneously on all four tasks. The IMIC method proved effective with three foundation model bases: DINOv3, CLIP, and EVA-02. Two of these (EVA-02 and CLIP) performed comparably with domain experts on all four tasks concurrently and were more accurate than humans at multi-tasking across face, body, and object datasets. Further, we demonstrate that our approach does not substantially harm out-of-distribution generalization, thus maintaining a key property of foundation models. Analysis of the most accurate model variants (EVA-02 + IMIC A and B) showed linearly separable representations of the four tasks in the unified embedding space, but with substantial sharing of features across tasks. Fewer than 100 PCs calculated from any one task could perform all other tasks with nearly zero performance degradation.

[45] STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction cs.CVPDF

Jiankuo Zhao, Xiangyu Zhu, Zidu Wang, Zhen Lei

TL;DR: STAvatar提出了一种基于3D高斯喷绘的单目3D头部重建方法，通过软绑定和时序密度控制解决了现有方法在运动和遮挡区域处理上的局限性。

Details

Motivation: 现有方法基于网格三角绑定和线性混合蒙皮，导致运动僵硬且缺乏表达能力，同时也难以处理频繁遮挡区域。

Result: 在四个基准数据集上实现了最先进的性能，尤其是在细节和遮挡区域重建方面表现卓越。

Insight: 结合UV空间动态采样和时序密度控制可以显著提升3D头部重建的质量和表达能力。

Abstract: Reconstructing high-fidelity and animatable 3D head avatars from monocular videos remains a challenging yet essential task. Existing methods based on 3D Gaussian Splatting typically bind Gaussians to mesh triangles and model deformations solely via Linear Blend Skinning, which results in rigid motion and limited expressiveness. Moreover, they lack specialized strategies to handle frequently occluded regions (e.g., mouth interiors, eyelids). To address these limitations, we propose STAvatar, which consists of two key components: (1) a UV-Adaptive Soft Binding framework that leverages both image-based and geometric priors to learn per-Gaussian feature offsets within the UV space. This UV representation supports dynamic resampling, ensuring full compatibility with Adaptive Density Control (ADC) and enhanced adaptability to shape and textural variations. (2) a Temporal ADC strategy, which first clusters structurally similar frames to facilitate more targeted computation of the densification criterion. It further introduces a novel fused perceptual error as clone criterion to jointly capture geometric and textural discrepancies, encouraging densification in regions requiring finer details. Extensive experiments on four benchmark datasets demonstrate that STAvatar achieves state-of-the-art reconstruction performance, especially in capturing fine-grained details and reconstructing frequently occluded regions. The code will be publicly available.

[46] Temporal-Visual Semantic Alignment: A Unified Architecture for Transferring Spatial Priors from Vision Models to Zero-Shot Temporal Tasks cs.CVPDF

Xiangkai Ma, Han Zhang, Wenzhong Li, Sanglu Lu

TL;DR: 论文提出了TimeArtist，一种时间-视觉转换框架，首次实现了时间序列波动与视觉概念的语义级对齐，并通过广泛实验验证了其在图像生成和零样本时间任务中的优越性能。

Details

Motivation: 现有的多模态模型在文本和图像模态中对齐和生成内容上取得了显著进展，但如何利用非视觉、连续的序列信号（如时间序列）进行高保真图像生成仍然未被充分探索。此外，现有方法将序列转换为“伪图像”用于时间预测，但未能实现语义级对齐。

Result: TimeArtist在图像生成指标上表现优异，同时在零样本时间任务中也取得了优于基准的结果。

Insight: TimeArtist开创了一种新的跨模态生成范式，填补了时间动态与视觉语义之间的鸿沟，为时间数据与视觉内容的关联提供了新思路。

Abstract: Large Multimodal Models (LMMs) have achieved remarkable progress in aligning and generating content across text and image modalities. However, the potential of using non-visual, continuous sequential, as a conditioning signal for high-fidelity image generation remains largely unexplored. Furthermore, existing methods that convert series into “pseudo-images” for temporal forecasting fail to establish semantic-level alignment. In this paper, we propose TimeArtist, a temporal-visual conversion framework that pioneers semantic-level alignment between time series fluctuations and visual concepts. It pioneers a “warmup-align” paradigm: first, a dual-autoencoder and shared quantizer are self-supervised trained on large-scale datasets to learn modality-shared representations. Then, the encoders and quantizer are frozen, and a projection is introduced to align temporal and visual samples at the representation level. TimeArtist establishes a versatile cross-modal framework, enabling high-quality, diverse image generation directly from time series, while capturing temporal fluctuation patterns to render images as styles transfer. Extensive experiments show that TimeArtist achieves satisfactory performance in image generation metrics, while also attaining superior results in zero-shot temporal tasks. Our work establishes a new paradigm for cross-modal generation, bridging the gap between temporal dynamics and visual semantics.

[47] GigaWorld-0: World Models as Data Engine to Empower Embodied AI cs.CV | cs.ROPDF

GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang

TL;DR: 这篇论文提出了GigaWorld-0，一个统一的世界模型框架，旨在通过大规模视频生成和3D建模技术，为视觉-语言-动作（VLA）学习提供高质量合成的交互数据。

Details

Motivation: 现有的世界模型在生成多样性和物理一致性上存在不足，GigaWorld-0旨在通过联合优化视频生成和3D建模技术，为AI系统提供高质量的训练数据。

Result: 实验表明，GigaWorld-0生成了高质量、多样化的数据，基于这些数据训练的VLA模型（如GigaBrain-0）在真实机器人任务中表现出色，泛化能力和任务成功率显著提升。

Insight: 统一的世界模型框架能够有效解决数据生成的多样性和物理一致性问题，为AI系统的大规模训练提供了高效的数据引擎。

Abstract: World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a data engine for Vision-Language-Action (VLA) learning. GigaWorld-0 integrates two synergistic components: GigaWorld-0-Video, which leverages large-scale video generation to produce diverse, texture-rich, and temporally coherent embodied sequences under fine-grained control of appearance, camera viewpoint, and action semantics; and GigaWorld-0-3D, which combines 3D generative modeling, 3D Gaussian Splatting reconstruction, physically differentiable system identification, and executable motion planning to ensure geometric consistency and physical realism. Their joint optimization enables the scalable synthesis of embodied interaction data that is visually compelling, spatially coherent, physically plausible, and instruction-aligned. Training at scale is made feasible through our efficient GigaTrain framework, which exploits FP8-precision and sparse attention to drastically reduce memory and compute requirements. We conduct comprehensive evaluations showing that GigaWorld-0 generates high-quality, diverse, and controllable data across multiple dimensions. Critically, VLA model (e.g., GigaBrain-0) trained on GigaWorld-0-generated data achieve strong real-world performance, significantly improving generalization and task success on physical robots without any real-world interaction during training.

[48] ChessMamba: Structure-Aware Interleaving of State Spaces for Change Detection in Remote Sensing Images cs.CVPDF

Lei Ding, Tong Liu, Xuanguang Liu, Xiangyun Liu, Haitao Guo

TL;DR: ChessMamba提出了一种结构感知的远程感知图像变化检测框架，通过状态空间模型实现了多时相特征的高效融合和准确变化定位。

Details

Motivation: 多时相遥感图像的变化检测面临异构性和时空不对齐的挑战，现有方法（如视觉Transformer或状态空间模型）通常会破坏局部结构一致性，导致判别性线索模糊和变化定位不可靠。

Result: 在二元变化检测、语义变化检测和多模态建筑物损坏评估任务中，ChessMamba显著优于现有方法。

Insight: 结构感知的序列化和融合策略能够有效缓解时空不对齐问题，提升变化检测的精度。

Abstract: Change detection (CD) in multitemporal remote sensing imagery presents significant challenges for fine-grained recognition, owing to heterogeneity and spatiotemporal misalignment. However, existing methodologies based on vision transformers or state-space models typically disrupt local structural consistency during temporal serialization, obscuring discriminative cues under misalignment and hindering reliable change localization. To address this, we introduce ChessMamba, a structure-aware framework leveraging interleaved state-space modeling for robust CD with multi-temporal inputs. ChessMamba integrates a SpatialMamba encoder with a lightweight cross-source interaction module, featuring two key innovations: (i) Chessboard interleaving with snake scanning order, which serializes multi-temporal features into a unified sequence within a single forward pass, thereby shortening interaction paths and enabling direct comparison for accurate change localization; and (ii) Structure-aware fusion via multi-dilated convolutions, selectively capturing center-and-corner neighborhood contexts within each mono-temporal. Comprehensive evaluations on three CD tasks, including binary CD, semantic CD and multimodal building damage assessment, demonstrate that ChessMamba effectively fuses heterogeneous features and achieves substantial accuracy improvements over state-of-the-art methods.The relevant code will be available at: github.com/DingLei14/ChessMamba.

Junhong Liu, Yuan Zhang, Tao Huang, Wenchao Xu, Renyu Yang

TL;DR: 该论文提出了一种基于频率解耦的跨模态知识蒸馏方法，通过分离低频和高频特征并分别对齐，解决了跨模态知识传递中的表示不一致问题。

Details

Motivation: 传统的知识蒸馏在跨模态任务（如视觉到语言）中效果有限，主要由于不同模态的表示不一致导致知识传递困难。

Result: 在多个基准数据集上的实验表明，该方法显著优于传统知识蒸馏和其他跨模态知识蒸馏方法。

Insight: 低频特征在跨模态任务中更具普适性，而高频特征可能带有模态特有的噪声或细节，需区别对待。

Abstract: Knowledge distillation (KD) has proven highly effective for compressing large models and enhancing the performance of smaller ones. However, its effectiveness diminishes in cross-modal scenarios, such as vision-to-language distillation, where inconsistencies in representation across modalities lead to difficult knowledge transfer. To address this challenge, we propose frequency-decoupled cross-modal knowledge distillation, a method designed to decouple and balance knowledge transfer across modalities by leveraging frequency-domain features. We observed that low-frequency features exhibit high consistency across different modalities, whereas high-frequency features demonstrate extremely low cross-modal similarity. Accordingly, we apply distinct losses to these features: enforcing strong alignment in the low-frequency domain and introducing relaxed alignment for high-frequency features. We also propose a scale consistency loss to address distributional shifts between modalities, and employ a shared classifier to unify feature spaces. Extensive experiments across multiple benchmark datasets show our method substantially outperforms traditional KD and state-of-the-art cross-modal KD approaches. Code is available at https://github.com/Johumliu/FD-CMKD.

[50] VeriSciQA: An Auto-Verified Dataset for Scientific Visual Question Answering cs.CVPDF

Yuyi Li, Daoyuan Chen, Zhen Wang, Yutong Lu, Yaliang Li

TL;DR: 论文提出了一个名为VeriSciQA的科学视觉问答数据集，通过‘生成-验证’框架自动生成并验证高质量问答对，填补了开源数据集的空白。

Details

Motivation: 现有的大型视觉语言模型在科学视觉问答（SVQA）任务上表现不佳，主要原因是缺乏高质量、大规模的公开数据集。已有的合成数据集存在系统性错误，影响了模型性能。

Result: VeriSciQA挑战了开源模型，最佳开源模型精度为64%，落后于专有模型（82%）。微调模型在SVQA任务上表现提升，且性能随数据规模增长。人类评估验证了数据集的正确性。

Insight: 通过可扩展的‘生成-验证’框架生成高质量数据，能够显著提升开源模型在SVQA任务上的性能，推动开源社区的发展。

Abstract: Large Vision-Language Models (LVLMs) show promise for scientific applications, yet open-source models still struggle with Scientific Visual Question Answering (SVQA), namely answering questions about figures from scientific papers. A key bottleneck lies in the lack of public, large-scale, high-quality SVQA datasets. Although recent work uses LVLMs to synthesize data at scale, we identify systematic errors in their resulting QA pairs, stemming from LVLMs’ inherent limitations and information asymmetry between figures and text. To address these challenges, we propose a verification-centric Generate-then-Verify framework that first generates QA pairs with figure-associated textual context, then applies cross-modal consistency checks against figures along with auxiliary filters to eliminate erroneous pairs. We instantiate this framework to curate VeriSciQA, a dataset of 20,351 QA pairs spanning 20 scientific domains and 12 figure types. VeriSciQA poses a challenging benchmark for open-source models, with a substantial accuracy gap between the leading open-source models (64%) and a proprietary model (82%). Moreover, models fine-tuned on VeriSciQA achieve consistent improvements on SVQA benchmarks, with performance gains that scale with data size and surpass models trained on existing datasets. Human evaluation further validates the superior correctness of VeriSciQA. Together, these evidences demonstrate that continued data expansion by our scalable framework can further advance SVQA capability in the open-source community.

[51] Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning cs.CV | cs.AIPDF

Jiaqi Liu, Kaiwen Xiong, Peng Xia, Yiyang Zhou, Haonian Ji

TL;DR: Agent0-VL提出了一种自进化的视觉语言代理，通过工具集成的推理实现持续改进，无需外部监督。

Details

Motivation: 现有的视觉语言代理依赖人为标注监督，且纯文本的自评估无法验证复杂的视觉推理步骤。为解决这些问题，设计了Agent0-VL，通过工具集成实现自进化和自评估。

Result: 在几何问题求解和视觉科学分析任务上，性能提升12.5%。

Insight: 工具集成不仅可以增强推理能力，还能支持自评估和自修复，为视觉语言任务的零监督学习提供了新思路。

Abstract: Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on geometric problem solving and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the base model. Our code is available at \href{https://github.com/aiming-lab/Agent0/Agent0-VL}{this https URL}.

[52] MHB: Multimodal Handshape-aware Boundary Detection for Continuous Sign Language Recognition cs.CVPDF

Mingyu Zhao, Zhanfu Yang, Yang Zhou, Zhaoyang Xia, Can Jin

TL;DR: 该论文提出了一种多模态方法，用于连续手语识别。通过结合3D骨骼特征和手形分类器，提高了边界检测的鲁棒性。方法在ASLLRP语料库上表现显著优于先前研究。

Details

Motivation: 连续手语识别中的边界检测是关键挑战，但现有方法往往忽略手形信息。为了提高识别鲁棒性，需要结合多模态信息（如骨骼动态和手形）进行边界检测。

Result: 在ASLLRP语料库上实现了显著的性能提升，证明了多模态方法的有效性。

Insight: 结合手形和骨骼动态信息可以显著提高手语边界检测的准确性。这一方法可能适用于其他需要时序边界检测的多模态任务。

Abstract: This paper presents a multimodal approach for continuous sign recognition that first uses machine learning to detect the start and end frames of signs in videos of American Sign Language (ASL) sentences, and then recognizes the segmented signs. For improved robustness, we use 3D skeletal features extracted from sign language videos to capture the convergence of sign properties and their dynamics, which tend to cluster at sign boundaries. Another focus of this work is the incorporation of information from 3D handshape for boundary detection. To detect handshapes normally expected at the beginning and end of signs, we pretrain a handshape classifier for 87 linguistically defined canonical handshape categories using a dataset that we created by integrating and normalizing several existing datasets. A multimodal fusion module is then used to unify the pretrained sign video segmentation framework and the handshape classification models. Finally, the estimated boundaries are used for sign recognition, where the recognition model is trained on a large database containing both citation-form isolated signs and signs pre-segmented (based on manual annotations) from continuous signing, as such signs often differ in certain respects. We evaluate our method on the ASLLRP corpus and demonstrate significant improvements over previous work.

[53] Motion Marionette: Rethinking Rigid Motion Transfer via Prior Guidance cs.CVPDF

Haoxuan Wang, Jiachen Tao, Junyi Wu, Gaowen Liu, Ramana Rao Kompella

TL;DR: Motion Marionette提出了一种零样本刚性运动传递框架，通过内部先验而非外部先验引导运动传递，解决了通用性与时间一致性之间的权衡问题。

Details

Motivation: 现有方法依赖外部先验（几何、生成或模拟）引导运动传递，但会引入额外约束，导致通用性与时间一致性的权衡问题。

Result: 实验表明，该方法具有通用性，能生成时间一致的运动视频，且支持可控视频生成。

Insight: 通过内部先验捕捉时空变换，避免了外部先验的局限性，提升了运动传递的灵活性和质量。

Abstract: We present Motion Marionette, a zero-shot framework for rigid motion transfer from monocular source videos to single-view target images. Previous works typically employ geometric, generative, or simulation priors to guide the transfer process, but these external priors introduce auxiliary constraints that lead to trade-offs between generalizability and temporal consistency. To address these limitations, we propose guiding the motion transfer process through an internal prior that exclusively captures the spatial-temporal transformations and is shared between the source video and any transferred target video. Specifically, we first lift both the source video and the target image into a unified 3D representation space. Motion trajectories are then extracted from the source video to construct a spatial-temporal (SpaT) prior that is independent of object geometry and semantics, encoding relative spatial variations over time. This prior is further integrated with the target object to synthesize a controllable velocity field, which is subsequently refined using Position-Based Dynamics to mitigate artifacts and enhance visual coherence. The resulting velocity field can be flexibly employed for efficient video production. Empirical results demonstrate that Motion Marionette generalizes across diverse objects, produces temporally consistent videos that align well with the source motion, and supports controllable video generation.

[54] Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving cs.CV | cs.ROPDF

Dapeng Zhang, Zhenlong Yuan, Zhangquan Chen, Chih-Ting Liao, Yinda Chen

TL;DR: Reasoning-VLA 是一个快速的通用视觉-语言-动作推理模型，用于自动驾驶。它通过学习性动作查询和推理增强的视觉-语言特征交互，并行生成连续动作轨迹，同时整合了多个数据集和强化学习微调，实现了高性能、强泛化能力和快速推理。

Details

Motivation: 现有的视觉-语言-动作（VLA）模型在自动驾驶中面临推理效率低和泛化能力不足的问题，特别是在新型车辆配置和驾驶场景中。

Result: 在多个基准测试中实现了最先进的性能、强大的泛化能力和快速的推理速度。

Insight: 学习性查询和推理增强特征的结合是一种有效的动作生成方法，多数据集整合和强化学习微调显著提升了模型的泛化能力。

Abstract: Vision-Language-Action (VLA) models have recently shown strong decision-making capabilities in autonomous driving. However, existing VLAs often struggle with achieving efficient inference and generalizing to novel autonomous vehicle configurations and driving scenarios. In this paper, we propose Reasoning-VLA, a general and fast action-generation VLA framework. The proposed model employs a set of learnable action queries, initialized via Gaussian sampling from ground-truth trajectories within the training corpus. These learnable queries interact with reasoning-enhanced vision-language features to generate continuous action trajectories in parallel. To promote robust generalization, we consolidate eight publicly available autonomous driving datasets into a standardized, Chain-of-Thought reasoning-based, and easy-to-use data format for model training. Leveraging both supervised learning and reinforcement learning fine-tuning, extensive empirical evaluations across multiple benchmarks demonstrate that Reasoning-VLA achieves state-of-the-art performance, superior generalization capability, and the excellent inference speed reported to date.

[55] Coupled Physics-Gated Adaptation: Spatially Decoding Volumetric Photochemical Conversion in Complex 3D-Printed Objects cs.CVPDF

Maryam Eftekharifar, Churun Zhang, Jialiang Wei, Xudong Cao, Hossein Heidari

TL;DR: 该论文提出了一种新颖的框架C-PGA，用于预测复杂3D打印物体中的光化学转换，通过对视觉和非视觉物理耦合的建模，实现了从3D视觉数据到密集体积物理属性的精确预测。

Details

Motivation: 传统视觉模型难以处理光学物理和材料物理的非线性耦合问题，因此需要一种能够动态调整视觉感知的架构。

Result: 该方法实现了对化学转换状态的精确控制，消除了传统打印后测量的需求。

Insight: 通过显式建模物理耦合，可以在复杂3D视觉任务中实现更高的预测精度。

Abstract: We present a framework that pioneers the prediction of photochemical conversion in complex three-dimensionally printed objects, introducing a challenging new computer vision task: predicting dense, non-visual volumetric physical properties from 3D visual data. This approach leverages the largest-ever optically printed 3D specimen dataset, comprising a large family of parametrically designed complex minimal surface structures that have undergone terminal chemical characterisation. Conventional vision models are ill-equipped for this task, as they lack an inductive bias for the coupled, non-linear interactions of optical physics (diffraction, absorption) and material physics (diffusion, convection) that govern the final chemical state. To address this, we propose Coupled Physics-Gated Adaptation (C-PGA), a novel multimodal fusion architecture. Unlike standard concatenation, C-PGA explicitly models physical coupling by using sparse geometrical and process parameters (e.g., surface transport, print layer height) as a Query to dynamically gate and adapt the dense visual features via feature-wise linear modulation (FiLM). This mechanism spatially modulates dual 3D visual streams-extracted by parallel 3D-CNNs processing raw projection stacks and their diffusion-diffraction corrected counterparts allowing the model to recalibrate its visual perception based on the physical context. This approach offers a breakthrough in virtual chemical characterisation, eliminating the need for traditional post-print measurements and enabling precise control over the chemical conversion state.

[56] HybriDLA: Hybrid Generation for Document Layout Analysis cs.CVPDF

Yufan Chen, Omar Moured, Ruiping Liu, Junwei Zheng, Kunyu Peng

TL;DR: HybriDLA提出了一种统一的生成框架，结合扩散模型和自回归解码器，以应对现代文档中复杂布局和多样元素数量的挑战，实现了83.5%的平均精度提升。

Details

Motivation: 传统文档布局分析方法依赖于固定查询或经验先验，难以应对现代文档的复杂布局和多样元素数量需求。

Result: 在DocLayNet和M$^6$Doc基准测试中，HybriDLA以83.5% mAP刷新了SOTA性能。

Insight: 结合生成模型的扩散和自回归特性可以显著提升复杂文档布局的分析能力。

Abstract: Conventional document layout analysis (DLA) traditionally depends on empirical priors or a fixed set of learnable queries executed in a single forward pass. While sufficient for early-generation documents with a small, predetermined number of regions, this paradigm struggles with contemporary documents, which exhibit diverse element counts and increasingly complex layouts. To address challenges posed by modern documents, we present HybriDLA, a novel generative framework that unifies diffusion and autoregressive decoding within a single layer. The diffusion component iteratively refines bounding-box hypotheses, whereas the autoregressive component injects semantic and contextual awareness, enabling precise region prediction even in highly varied layouts. To further enhance detection quality, we design a multi-scale feature-fusion encoder that captures both fine-grained and high-level visual cues. This architecture elevates performance to 83.5% mean Average Precision (mAP). Extensive experiments on the DocLayNet and M$^6$Doc benchmarks demonstrate that HybriDLA sets a state-of-the-art performance, outperforming previous approaches. All data and models will be made publicly available at https://yufanchen96.github.io/projects/HybriDLA.

[57] Intelligent Image Search Algorithms Fusing Visual Large Models cs.CVPDF

Kehan Wang, Tingqiong Cui, Yang Zhang, Yu Chen, Shifeng Wu

TL;DR: DetVLM结合目标检测和视觉大模型（VLM），提出了一种两阶段智能图像搜索框架，显著提升了细粒度检索能力和零样本搜索性能。

Details

Motivation: 传统细粒度图像检索方法存在局限性：手工特征缺乏鲁棒性；深度学习检测器无法进行状态检索或零样本搜索；VLM虽具备语义能力和零样本能力，但空间定位差且计算成本高。本文旨在弥补这些不足。

Result: 在车辆组件数据集上，DetVLM整体检索准确率达到94.82%，零样本搜索准确率为94.95%，状态搜索任务的平均准确率超过90%。

Insight: 结合目标检测的高效性和VLM的语义能力，能够显著提升细粒度检索任务的效果，同时支持零样本搜索，为实际应用提供了新思路。

Abstract: Fine-grained image retrieval, which aims to find images containing specific object components and assess their detailed states, is critical in fields like security and industrial inspection. However, conventional methods face significant limitations: manual features (e.g., SIFT) lack robustness; deep learning-based detectors (e.g., YOLO) can identify component presence but cannot perform state-specific retrieval or zero-shot search; Visual Large Models (VLMs) offer semantic and zero-shot capabilities but suffer from poor spatial grounding and high computational cost, making them inefficient for direct retrieval. To bridge these gaps, this paper proposes DetVLM, a novel intelligent image search framework that synergistically fuses object detection with VLMs. The framework pioneers a search-enhancement paradigm via a two-stage pipeline: a YOLO detector first conducts efficient, high-recall component-level screening to determine component presence; then, a VLM acts as a recall-enhancement unit, performing secondary verification for components missed by the detector. This architecture directly enables two advanced capabilities: 1) State Search: Guided by task-specific prompts, the VLM refines results by verifying component existence and executing sophisticated state judgments (e.g., “sun visor lowered”), allowing retrieval based on component state. 2) Zero-shot Search: The framework leverages the VLM’s inherent zero-shot capability to recognize and retrieve images containing unseen components or attributes (e.g., “driver wearing a mask”) without any task-specific training. Experiments on a vehicle component dataset show DetVLM achieves a state-of-the-art overall retrieval accuracy of 94.82%, significantly outperforming detection-only baselines. It also attains 94.95% accuracy in zero-shot search for driver mask-wearing and over 90% average accuracy in state search tasks.

[58] Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos cs.CVPDF

Youngseo Kim, Dohyun Kim, Geonhee Han, Paul Hongsuck Seo

TL;DR: 本文提出了一种新方法，将图像扩散模型的自注意力映射重新解释为语义标签传播核，从而实现视频中的零样本目标跟踪。通过测试时优化策略和SAM引导的掩码细化，作者提出了DRIFT框架，在标准视频目标分割基准上取得了领先性能。

Details

Motivation: 图像扩散模型不仅可用于图像生成，还能捕获丰富的语义结构，但其在视频时序传播中的潜力尚未被充分探索。作者希望通过重新解释自注意力机制，将其应用于视频目标跟踪任务。

Result: DRIFT框架在标准视频目标分割基准上实现了零样本状态的领先性能。

Insight: 图像扩散模型的语义捕捉能力不仅限于生成任务，可以通过重新设计用于视频理解和跟踪任务。

Abstract: Image diffusion models, though originally developed for image generation, implicitly capture rich semantic structures that enable various recognition and localization tasks beyond synthesis. In this work, we investigate their self-attention maps can be reinterpreted as semantic label propagation kernels, providing robust pixel-level correspondences between relevant image regions. Extending this mechanism across frames yields a temporal propagation kernel that enables zero-shot object tracking via segmentation in videos. We further demonstrate the effectiveness of test-time optimization strategies-DDIM inversion, textual inversion, and adaptive head weighting-in adapting diffusion features for robust and consistent label propagation. Building on these findings, we introduce DRIFT, a framework for object tracking in videos leveraging a pretrained image diffusion model with SAM-guided mask refinement, achieving state-of-the-art zero-shot performance on standard video object segmentation benchmarks.

[59] Low-Resolution Editing is All You Need for High-Resolution Editing cs.CVPDF

Junsung Lee, Hyunsoo Lee, Yong Jae Lee, Bohyung Han

TL;DR: 论文提出了一种针对高分辨率图像编辑的新方法，通过分块优化和细节转移模块实现高质量编辑。

Details

Motivation: 高分辨率内容创作在视觉和图形学领域日益重要，但现有方法仅支持低分辨率（通常低于1K）。为了满足用户需求，需要一种可控的高分辨率图像编辑机制。

Result: 实验表明，该方法能生成高质量的高分辨率编辑结果，推动了高分辨率内容创作的进展。

Insight: 高分辨率编辑可以通过低分辨率分块优化结合细节保留策略实现，避免了直接处理高分辨率数据的计算负担。

Abstract: High-resolution content creation is rapidly emerging as a central challenge in both the vision and graphics communities. While images serve as the most fundamental modality for visual expression, content generation that aligns with the user intent requires effective, controllable high-resolution image manipulation mechanisms. However, existing approaches remain limited to low-resolution settings, typically supporting only up to 1K resolution. In this work, we introduce the task of high-resolution image editing and propose a test-time optimization framework to address it. Our method performs patch-wise optimization on high-resolution source images, followed by a fine-grained detail transfer module and a novel synchronization strategy to maintain consistency across patches. Extensive experiments show that our method produces high-quality edits, facilitating the way toward high-resolution content creation.

[60] Supervise Less, See More: Training-free Nuclear Instance Segmentation with Prototype-Guided Prompting cs.CVPDF

Wen Zhang, Qin Ren, Wenjing Liu, Haibin Ling, Chenyu You

TL;DR: 论文提出了一种无需训练和标注的核实例分割方法SPROUT，通过原型引导的点提示技术，利用Segment Anything Model（SAM）实现高效的病理图像分割。

Details

Motivation: 当前核实例分割任务仍依赖密集监督和昂贵的计算微调，限制了其可扩展性。研究目标是开发一种完全无需训练的解决方案。

Result: 在多个病理学基准测试中，SPROUT取得了与监督方法竞争的性能，证明了其有效性。

Insight: SPROUT展示了完全无需训练的分割潜力，为病理学中的可扩展分割任务提供了新思路。

Abstract: Accurate nuclear instance segmentation is a pivotal task in computational pathology, supporting data-driven clinical insights and facilitating downstream translational applications. While large vision foundation models have shown promise for zero-shot biomedical segmentation, most existing approaches still depend on dense supervision and computationally expensive fine-tuning. Consequently, training-free methods present a compelling research direction, yet remain largely unexplored. In this work, we introduce SPROUT, a fully training- and annotation-free prompting framework for nuclear instance segmentation. SPROUT leverages histology-informed priors to construct slide-specific reference prototypes that mitigate domain gaps. These prototypes progressively guide feature alignment through a partial optimal transport scheme. The resulting foreground and background features are transformed into positive and negative point prompts, enabling the Segment Anything Model (SAM) to produce precise nuclear delineations without any parameter updates. Extensive experiments across multiple histopathology benchmarks demonstrate that SPROUT achieves competitive performance without supervision or retraining, establishing a novel paradigm for scalable, training-free nuclear instance segmentation in pathology.

[61] MambaEye: A Size-Agnostic Visual Encoder with Causal Sequential Processing cs.CV | cs.AIPDF

Changho Choi, Minho Kim, Jinkyu Kim

TL;DR: MambaEye是一种新型的视觉编码器，通过因果序贯处理实现输入大小无关的特性，具备高效的线性复杂度和适应性强的分辨率处理能力。

Details

Motivation: 传统视觉编码器难以实现输入大小无关的特性，而人类视觉具备这一能力。MambaEye旨在填补这一空白。

Result: 在ImageNet-1K分类任务中，MambaEye在高分辨率（如1536^2）下表现出色，同时保持线性时间和内存复杂度。

Insight: 严格单向处理和空间偏移编码是实现输入大小无关视觉编码的关键创新。

Abstract: Despite decades of progress, a truly input-size agnostic visual encoder-a fundamental characteristic of human vision-has remained elusive. We address this limitation by proposing \textbf{MambaEye}, a novel, causal sequential encoder that leverages the low complexity and causal-process based pure Mamba2 backbone. Unlike previous Mamba-based vision encoders that often employ bidirectional processing, our strictly unidirectional approach preserves the inherent causality of State Space Models, enabling the model to generate a prediction at any point in its input sequence. A core innovation is our use of relative move embedding, which encodes the spatial shift between consecutive patches, providing a strong inductive bias for translation invariance and making the model inherently adaptable to arbitrary image resolutions and scanning patterns. To achieve this, we introduce a novel diffusion-inspired loss function that provides dense, step-wise supervision, training the model to build confidence as it gathers more visual evidence. We demonstrate that MambaEye exhibits robust performance across a wide range of image resolutions, especially at higher resolutions such as $1536^2$ on the ImageNet-1K classification task. This feat is achieved while maintaining linear time and memory complexity relative to the number of patches.

[62] HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning cs.CVPDF

Hongji Yang, Yucheng Zhou, Wencheng Han, Runzhou Tao, Zhongying Qiu

TL;DR: HiCoGen提出了一种基于强化学习的层次化组合文本到图像生成框架，通过Chain of Synthesis（CoS）范式解决了现有扩散模型在复杂提示下生成图像时的概念遗漏和组合性问题。

Details

Motivation: 现有的扩散模型在简单提示下表现优异，但对于涉及多个对象和层次结构的复杂提示，往往无法准确遵循指令，导致概念遗漏、混淆和组合性差等问题。

Result: 实验表明，HiCoGen在概念覆盖率和组合准确性上显著优于现有方法。

Insight: 1. 层次化分解和迭代合成是解决复杂提示生成问题的有效途径；2. 强化学习的探索能力可以通过调整随机性调度优化；3. 多层次奖励机制有助于提升生成图像的综合质量。

Abstract: Recent advances in diffusion models have demonstrated impressive capability in generating high-quality images for simple prompts. However, when confronted with complex prompts involving multiple objects and hierarchical structures, existing models struggle to accurately follow instructions, leading to issues such as concept omission, confusion, and poor compositionality. To address these limitations, we propose a Hierarchical Compositional Generative framework (HiCoGen) built upon a novel Chain of Synthesis (CoS) paradigm. Instead of monolithic generation, HiCoGen first leverages a Large Language Model (LLM) to decompose complex prompts into minimal semantic units. It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next, ensuring all textual concepts are faithfully constructed into the final scene. To further optimize this process, we introduce a reinforcement learning (RL) framework. Crucially, we identify that the limited exploration of standard diffusion samplers hinders effective RL. We theoretically prove that sample diversity is maximized by concentrating stochasticity in the early generation stages and, based on this insight, propose a novel Decaying Stochasticity Schedule to enhance exploration. Our RL algorithm is then guided by a hierarchical reward mechanism that jointly evaluates the image at the global, subject, and relationship levels. We also construct HiCoPrompt, a new text-to-image benchmark with hierarchical prompts for rigorous evaluation. Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.

[63] Boosting Reasoning in Large Multimodal Models via Activation Replay cs.CVPDF

Yun Xing, Xiaobin Hu, Qingdong He, Jiangning Zhang, Shuicheng Yan

TL;DR: 本文提出了一种名为‘激活重放’的简单有效方法，通过调控低熵激活来提升多模态模型的后训练推理能力，无需昂贵的策略优化。

Details

Motivation: 尽管基于可验证奖励的强化学习（RLVR）在提升大型多模态模型（LMMs）推理能力方面表现出色，但其背后的机制尚不明确。本文旨在探索RLVR如何影响模型的输入激活，并利用这一发现提升推理能力。

Result: 实验表明，‘激活重放’显著提升了Pass@K指标，并缓解了RLVR导致的推理覆盖范围狭窄问题。方法在多种任务中表现优异。

Insight: 调控低熵激活是提升多模态推理能力的有效手段，且无需复杂训练过程。

Abstract: Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach to incentivizing reasoning capability in Large Multimodal Models (LMMs), while the underlying mechanisms behind this post-training paradigm are poorly understood. We begin by exploring how input activations are affected by RLVR through the perspective of logit lens. Our systematic investigations across multiple post-trained LMMs suggest that RLVR shifts low-entropy activations unexpectedly, while high-entropy ones are less affected. We further demonstrate that such phenomena are associated with LMM reasoning by controlled experiments, suggesting a potentially beneficial role of modulating low-entropy activations. To this end, we propose Activation Replay, a novel simple yet effective training-free approach that boosts multimodal reasoning of post-trained LMMs without requiring expensive policy optimization. Our design involves manipulation of visual tokens at test time, replaying low-entropy activations from the input context of base LMMs to regulating the RLVR counterparts. Activation Replay triggers better reasoning across diverse scenarios, including mathematics, o3-like visual agents, and video reasoning. We further show that Activation Replay boosts Pass@K and mitigates narrower reasoning coverage of RLVR. Our design is compared against alternative choices, such as replaying high-entropy activations instead of low-entropy ones, or direct cross-model intervention instead of manipulating input tokens, demonstrating the superiority of our implementation. Codes will be made publicly available.

[64] EmoFeedback2: Reinforcement of Continuous Emotional Image Generation via LVLM-based Reward and Textual Feedback cs.CV | cs.AIPDF

Jingyang Jia, Kai Shu, Gang Yang, Long Xing, Xun Chen

TL;DR: EmoFeedback2提出了一种基于强化学习的连续情感图像生成框架，利用大型视觉语言模型（LVLM）提供奖励和文本反馈，以增强生成图像的情感连续性和保真度。

Details

Motivation: 现有连续情感图像生成方法缺乏对生成图像的情感反馈，无法实现情感连续性控制，且情感与文本提示的对齐简单，导致情感保真度不足。

Result: 实验表明，该方法在自定义数据集中优于现有方法，生成高质量且情感准确的图像。

Insight: 利用LVLM的推理能力可实现情感生成的动态优化，强调反馈机制对连续情感控制的重要性。

Abstract: Continuous emotional image generation (C-EICG) is emerging rapidly due to its ability to produce images aligned with both user descriptions and continuous emotional values. However, existing approaches lack emotional feedback from generated images, limiting the control of emotional continuity. Additionally, their simple alignment between emotions and naively generated texts fails to adaptively adjust emotional prompts according to image content, leading to insufficient emotional fidelity. To address these concerns, we propose a novel generation-understanding-feedback reinforcement paradigm (EmoFeedback2) for C-EICG, which exploits the reasoning capability of the fine-tuned large vision-language model (LVLM) to provide reward and textual feedback for generating high-quality images with continuous emotions. Specifically, we introduce an emotion-aware reward feedback strategy, where the LVLM evaluates the emotional values of generated images and computes the reward against target emotions, guiding the reinforcement fine-tuning of the generative model and enhancing the emotional continuity of images. Furthermore, we design a self-promotion textual feedback framework, in which the LVLM iteratively analyzes the emotional content of generated images and adaptively produces refinement suggestions for the next-round prompt, improving the emotional fidelity with fine-grained content. Extensive experimental results demonstrate that our approach effectively generates high-quality images with the desired emotions, outperforming existing state-of-the-art methods in our custom dataset. The code and dataset will be released soon.

[65] SONIC: Spectral Optimization of Noise for Inpainting with Consistency cs.CVPDF

Seungyeon Baek, Erqun Dong, Shadan Namazifard, Mark J. Matthews, Kwang Moo Yi

TL;DR: 该论文提出了一种无需训练的方法SONIC，通过优化初始噪声和在谱域中进行优化，提升了现成文本到图像模型的修复效果，无需专门的修复模型。

Details

Motivation: 现有基于指导的方法在理论上可将通用模型用于逆问题（如图像修复），但实践中效果有限，导致仍需专门的修复模型。论文旨在解决这一问题。

Result: 在多种修复任务中，SONIC表现优于现有最优方法。

Insight: 优化初始噪声能显著提升通用模型的修复能力，谱域优化则是实现高效稳定的关键。

Abstract: We propose a novel training-free method for inpainting with off-the-shelf text-to-image models. While guidance-based methods in theory allow generic models to be used for inverse problems such as inpainting, in practice, their effectiveness is limited, leading to the necessity of specialized inpainting-specific models. In this work, we argue that the missing ingredient for training-free inpainting is the optimization (guidance) of the initial seed noise. We propose to optimize the initial seed noise to approximately match the unmasked parts of the data - with as few as a few tens of optimization steps. We then apply conventional training-free inpainting methods on top of our optimized initial seed noise. Critically, we propose two core ideas to effectively implement this idea: (i) to avoid the costly unrolling required to relate the initial noise and the generated outcome, we perform linear approximation; and (ii) to stabilize the optimization, we optimize the initial seed noise in the spectral domain. We demonstrate the effectiveness of our method on various inpainting tasks, outperforming the state of the art. Project page: https://ubc-vision.github.io/sonic/

[66] GazeProphetV2: Head-Movement-Based Gaze Prediction Enabling Efficient Foveated Rendering on Mobile VR cs.CVPDF

Farhaan Ebadulla, Chiraag Mudlpaur, Shreya Chaurasia, Gaurav BV

TL;DR: GazeProphetV2提出了一种结合目光历史、头部动作和场景内容的多模态方法，通过门控融合和跨模态注意力机制，提升了VR中目光预测的准确性，验证了93.1%的准确率。

Details

Motivation: VR环境中目光行为的准确预测对渲染优化和交互设计至关重要，但现有方法多依赖昂贵的眼部追踪硬件。本文旨在通过多模态数据结合，实现更高效的目光预测。

Result: 在包含5.3M目光样本的数据集上，验证准确率达93.1%，并展示了跨场景的鲁棒性和时间一致性。

Insight: 1. 多模态融合显著提升预测性能；2. 头部动作和场景内容对目光预测有重要补充作用；3. 该方法可替代昂贵硬件，推动高效VR系统的发展。

Abstract: Predicting gaze behavior in virtual reality environments remains a significant challenge with implications for rendering optimization and interface design. This paper introduces a multimodal approach to VR gaze prediction that combines temporal gaze patterns, head movement data, and visual scene information. By leveraging a gated fusion mechanism with cross-modal attention, the approach learns to adaptively weight gaze history, head movement, and scene content based on contextual relevance. Evaluations using a dataset spanning 22 VR scenes with 5.3M gaze samples demonstrate improvements in predictive accuracy when combining modalities compared to using individual data streams alone. The results indicate that integrating past gaze trajectories with head orientation and scene content enhances prediction accuracy across 1-3 future frames. Cross-scene generalization testing shows consistent performance with 93.1% validation accuracy and temporal consistency in predicted gaze trajectories. These findings contribute to understanding attention mechanisms in virtual environments while suggesting potential applications in rendering optimization, interaction design, and user experience evaluation. The approach represents a step toward more efficient virtual reality systems that can anticipate user attention patterns without requiring expensive eye tracking hardware.

Yaoli Liu, Ziheng Ouyang, Shengtao Lou, Yiren Song

TL;DR: OmniRefiner提出了一种基于强化学习的局部扩散细化框架，用于在参考图像引导下增强生成图像的细粒度细节，解决了传统方法在细节保留和一致性方面的局限性。

Details

Motivation: 现有的VAD-based潜在压缩方法在细化生成图像时会丢失细微纹理信息，导致身份和属性线索消失；现有后处理方法常因光照、纹理或形状不一致而产生不自然结果。

Result: 实验表明，OmniRefiner在参考对齐和细粒度细节保留方面显著优于开源和商业模型。

Insight: 结合扩散模型与强化学习可有效解决图像细化中的细节丢失和一致性问题，为参考引导的图像生成提供了新思路。

Abstract: Reference-guided image generation has progressed rapidly, yet current diffusion models still struggle to preserve fine-grained visual details when refining a generated image using a reference. This limitation arises because VAE-based latent compression inherently discards subtle texture information, causing identity- and attribute-specific cues to vanish. Moreover, post-editing approaches that amplify local details based on existing methods often produce results inconsistent with the original image in terms of lighting, texture, or shape. To address this, we introduce \ourMthd{}, a detail-aware refinement framework that performs two consecutive stages of reference-driven correction to enhance pixel-level consistency. We first adapt a single-image diffusion editor by fine-tuning it to jointly ingest the draft image and the reference image, enabling globally coherent refinement while maintaining structural fidelity. We then apply reinforcement learning to further strengthen localized editing capability, explicitly optimizing for detail accuracy and semantic consistency. Extensive experiments demonstrate that \ourMthd{} significantly improves reference alignment and fine-grained detail preservation, producing faithful and visually coherent edits that surpass both open-source and commercial models on challenging reference-guided restoration benchmarks.

[68] CREward: A Type-Specific Creativity Reward Model cs.CVPDF

Jiyeon Han, Ali Mahdavi-Amiri, Hao Zhang, Haedong Jeong

TL;DR: 论文提出了一种类型特定的创造力奖励模型CREward，通过几何、材料和纹理三个‘轴’对创造力进行多维度评估，并利用大规模视觉语言模型（LVLM）与人判断的对齐性训练模型，应用于创造力评估、可解释创造力和创意样本生成。

Details

Motivation: 传统方法将创造力视为单一维度缺乏区分性，作者希望通过类型特定的创造力评估模型更细致地量化创造力。

Result: CREward能够有效评估创造力类型，并在生成和评估任务中展现出实用性。

Insight: 创造力是多维度的，LVLM可以作为高质量标注数据的替代源，支持创造力模型的训练。

Abstract: Creativity is a complex phenomenon. When it comes to representing and assessing creativity, treating it as a single undifferentiated quantity would appear naive and underwhelming. In this work, we learn the \emph{first type-specific creativity reward model}, coined CREward, which spans three creativity ``axes,” geometry, material, and texture, to allow us to view creativity through the lens of the image formation pipeline. To build our reward model, we first conduct a human benchmark evaluation to capture human perception of creativity for each type across various creative images. We then analyze the correlation between human judgments and predictions by large vision-language models (LVLMs), confirming that LVLMs exhibit strong alignment with human perception. Building on this observation, we collect LVLM-generated labels to train our CREward model that is applicable to both evaluation and generation of creative images. We explore three applications of CREward: creativity assessment, explainable creativity, and creative sample acquisition for both human design inspiration and guiding creative generation through low-rank adaptation.

[69] On the Feasibility of Hijacking MLLMs’ Decision Chain via One Perturbation cs.CV | cs.AI | cs.CRPDF

Changyue Li, Jiaying Li, Youliang Yuan, Jiaming He, Zhicong Huang

TL;DR: 论文揭示了一种新型威胁：通过单一扰动劫持多模态大语言模型（MLLMs）的整个决策链，实现多目标操控。作者提出语义感知通用扰动（SAUPs）方法，并开发优化算法，实验证明其攻击成功率达70%。

Details

Motivation: 传统对抗攻击仅针对单次决策，但现实场景中模型通常需要连续决策，单个错误容易被纠正，而连锁错误可能导致严重后果。本文旨在揭示通过单一扰动操控整个决策链的可行性。

Result: 在三个MLLMs上验证了SAUPs的有效性，攻击成功率达70%，实现单一扰动操控五个不同目标。

Insight: 该方法表明MLLMs对语义感知扰动的脆弱性，为模型安全性设计提供了重要警示。

Abstract: Conventional adversarial attacks focus on manipulating a single decision of neural networks. However, real-world models often operate in a sequence of decisions, where an isolated mistake can be easily corrected, but cascading errors can lead to severe risks. This paper reveals a novel threat: a single perturbation can hijack the whole decision chain. We demonstrate the feasibility of manipulating a model’s outputs toward multiple, predefined outcomes, such as simultaneously misclassifying “non-motorized lane” signs as “motorized lane” and “pedestrian” as “plastic bag”. To expose this threat, we introduce Semantic-Aware Universal Perturbations (SAUPs), which induce varied outcomes based on the semantics of the inputs. We overcome optimization challenges by developing an effective algorithm, which searches for perturbations in normalized space with a semantic separation strategy. To evaluate the practical threat of SAUPs, we present RIST, a new real-world image dataset with fine-grained semantic annotations. Extensive experiments on three multimodal large language models demonstrate their vulnerability, achieving a 70% attack success rate when controlling five distinct targets using just an adversarial frame.

[70] Pedestrian Crossing Intention Prediction Using Multimodal Fusion Network cs.CV | cs.AIPDF

Yuanzhe Li, Steffen Müller

TL;DR: 该论文提出了一种多模态融合网络，用于预测行人过马路的意图，结合视觉和运动分支的七种模态特征，并通过注意力机制增强特征提取和整合，在JAAD数据集上表现优于基线方法。

Details

Motivation: 行人过马路意图预测对自动驾驶汽车的安全部署至关重要，但现有方法难以应对行人行为的多样性及其对多模态上下文的依赖。

Result: 在JAAD数据集上验证了网络的有效性，性能优于基线方法。

Insight: 多模态融合和注意力机制的结合显著提升了行人意图预测的准确性，尤其在复杂环境中表现出色。

Abstract: Pedestrian crossing intention prediction is essential for the deployment of autonomous vehicles (AVs) in urban environments. Ideal prediction provides AVs with critical environmental cues, thereby reducing the risk of pedestrian-related collisions. However, the prediction task is challenging due to the diverse nature of pedestrian behavior and its dependence on multiple contextual factors. This paper proposes a multimodal fusion network that leverages seven modality features from both visual and motion branches, aiming to effectively extract and integrate complementary cues across different modalities. Specifically, motion and visual features are extracted from the raw inputs using multiple Transformer-based extraction modules. Depth-guided attention module leverages depth information to guide attention towards salient regions in another modality through comprehensive spatial feature interactions. To account for the varying importance of different modalities and frames, modality attention and temporal attention are designed to selectively emphasize informative modalities and effectively capture temporal dependencies. Extensive experiments on the JAAD dataset validate the effectiveness of the proposed network, achieving superior performance compared to the baseline methods.

Yuanzhe Li, Steffen Müller

TL;DR: 本文提出了一种基于注意力机制的多模态交互Transformer（ACIT），用于预测行人过马路的意图。ACIT通过六种视觉和运动模态的分组交互，结合双路径注意力机制和多模态特征融合模块，显著提升了预测准确性。

Details

Motivation: 预测行人过马路意图对自动驾驶系统至关重要，但如何有效提取和整合多模态数据中的互补信息仍是一大挑战。

Result: ACIT在JAADbeh和JAADall数据集上的准确率分别达到70%和89%，优于现有方法。

Insight: 多模态数据的互补交互是提升行人意图预测性能的关键，而注意力机制能有效捕捉模态间的动态关系。

Abstract: Predicting pedestrian crossing intention is crucial for autonomous vehicles to prevent pedestrian-related collisions. However, effectively extracting and integrating complementary cues from different types of data remains one of the major challenges. This paper proposes an attention-guided cross-modal interaction Transformer (ACIT) for pedestrian crossing intention prediction. ACIT leverages six visual and motion modalities, which are grouped into three interaction pairs: (1) Global semantic map and global optical flow, (2) Local RGB image and local optical flow, and (3) Ego-vehicle speed and pedestrian’s bounding box. Within each visual interaction pair, a dual-path attention mechanism enhances salient regions within the primary modality through intra-modal self-attention and facilitates deep interactions with the auxiliary modality (i.e., optical flow) via optical flow-guided attention. Within the motion interaction pair, cross-modal attention is employed to model the cross-modal dynamics, enabling the effective extraction of complementary motion features. Beyond pairwise interactions, a multi-modal feature fusion module further facilitates cross-modal interactions at each time step. Furthermore, a Transformer-based temporal feature aggregation module is introduced to capture sequential dependencies. Experimental results demonstrate that ACIT outperforms state-of-the-art methods, achieving accuracy rates of 70% and 89% on the JAADbeh and JAADall datasets, respectively. Extensive ablation studies are further conducted to investigate the contribution of different modules of ACIT.

[72] WaymoQA: A Multi-View Visual Question Answering Dataset for Safety-Critical Reasoning in Autonomous Driving cs.CV | cs.AIPDF

Seungjun Yu, Seonho Lee, Namho Kim, Jaeyo Shin, Junsung Park

TL;DR: 本文介绍了一个名为WaymoQA的多视角视觉问答数据集，用于自动驾驶中安全关键场景的高层次推理任务。通过多视图输入，提出了安全关键推理的两阶段方法，并在实验中验证了其对现有MLLM性能的提升效果。

Details

Motivation: 自动驾驶在安全关键场景中的高层次推理是一个重要挑战，单视角输入难以应对复杂的多风险场景。多视图输入和新的推理方法成为解决这一问题的关键。

Result: 实验表明，现有MLLM在安全关键场景中表现不佳，但用WaymoQA微调后推理能力显著提升。

Insight: 多视图输入和安全关键推理框架能有效提升自动驾驶在复杂场景中的安全性，WaymoQA的引入为研究提供了重要工具。

Abstract: Recent advancements in multimodal large language models (MLLMs) have shown strong understanding of driving scenes, drawing interest in their application to autonomous driving. However, high-level reasoning in safety-critical scenarios, where avoiding one traffic risk can create another, remains a major challenge. Such reasoning is often infeasible with only a single front view and requires a comprehensive view of the environment, which we achieve through multi-view inputs. We define Safety-Critical Reasoning as a new task that leverages multi-view inputs to address this challenge. Then, we distill Safety-Critical Reasoning into two stages: first resolve the immediate risk, then mitigate the decision-induced downstream risks. To support this, we introduce WaymoQA, a dataset of 35,000 human-annotated question-answer pairs covering complex, high-risk driving scenarios. The dataset includes multiple-choice and open-ended formats across both image and video modalities. Experiments reveal that existing MLLMs underperform in safety-critical scenarios compared to normal scenes, but fine-tuning with WaymoQA significantly improves their reasoning ability, highlighting the effectiveness of our dataset in developing safer and more reasoning-capable driving agents.

[73] Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention cs.CVPDF

Jianfei Zhao, Feng Zhang, Xin Sun, Chong Feng, Zhixing Tan

TL;DR: 该论文提出了一种名为视觉引导注意力（VGA）的方法，旨在减少多模态大语言模型（MLLMs）中的幻觉问题，通过利用视觉标记的语义内容来精确引导模型的注意力区域。

Details

Motivation: 多模态大语言模型在处理视觉信息时，其注意力的定位能力有限，容易产生幻觉。尽管模型能够从视觉标记中准确提取语义信息，但未能充分利用这一优势进行后续推理。

Result: 实验表明，VGA在多个幻觉基准测试中实现了最先进的去幻觉性能，同时仅引入4.36%的延迟开销，并与高效注意力机制（如FlashAttention）完全兼容。

Insight: 显式的视觉引导在多模态大语言模型的视觉理解能力中起到关键作用，为其性能提升提供了新思路。

Abstract: Visual attention serves as the primary mechanism through which MLLMs interpret visual information; however, its limited localization capability often leads to hallucinations. We observe that although MLLMs can accurately extract visual semantics from visual tokens, they fail to fully leverage this advantage during subsequent inference. To address this limitation, we propose Vision-Guided Attention (VGA), a training-free method that first constructs precise visual grounding by exploiting the semantic content of visual tokens, and then uses this grounding to guide the model’s focus toward relevant visual regions. In image captioning, VGA further refines this guidance dynamically during generation by suppressing regions that have already been described. In VGA, each token undergoes only a single forward pass, introducing a negligible latency overhead of just 4.36%. In addition, VGA is fully compatible with efficient attention implementations such as FlashAttention. Extensive experiments across diverse MLLMs and multiple hallucination benchmarks demonstrate that VGA achieves state-of-the-art dehallucination performance. Further analysis confirms that explicit visual guidance plays a crucial role in enhancing the visual understanding capabilities of MLLMs.

[74] MFM-point: Multi-scale Flow Matching for Point Cloud Generation cs.CV | cs.AI | cs.LGPDF

Petr Molodyk, Jaemoo Choi, David W. Romero, Ming-Yu Liu, Yongxin Chen

TL;DR: MFM-Point是一个基于多尺度流匹配的点云生成框架，显著提升了基于点的方法的性能和扩展性，同时保持了其简单性和高效性。

Details

Motivation: 现有的基于点的点云生成方法虽然训练成本低且算法简单，但性能往往不如基于表示的方法。为了解决这一问题，提出了MFM-Point。

Result: MFM-Point在基于点的方法中表现最佳，并挑战了基于表示的方法的性能，在多类别和高分辨率生成任务中表现出色。

Insight: 多尺度生成和保持几何对齐是提升点云生成质量的关键，同时无需额外的训练或推理开销。

Abstract: In recent years, point cloud generation has gained significant attention in 3D generative modeling. Among existing approaches, point-based methods directly generate point clouds without relying on other representations such as latent features, meshes, or voxels. These methods offer low training cost and algorithmic simplicity, but often underperform compared to representation-based approaches. In this paper, we propose MFM-Point, a multi-scale Flow Matching framework for point cloud generation that substantially improves the scalability and performance of point-based methods while preserving their simplicity and efficiency. Our multi-scale generation algorithm adopts a coarse-to-fine generation paradigm, enhancing generation quality and scalability without incurring additional training or inference overhead. A key challenge in developing such a multi-scale framework lies in preserving the geometric structure of unordered point clouds while ensuring smooth and consistent distributional transitions across resolutions. To address this, we introduce a structured downsampling and upsampling strategy that preserves geometry and maintains alignment between coarse and fine resolutions. Our experimental results demonstrate that MFM-Point achieves best-in-class performance among point-based methods and challenges the best representation-based methods. In particular, MFM-point demonstrates strong results in multi-category and high-resolution generation tasks.

[75] DeLightMono: Enhancing Self-Supervised Monocular Depth Estimation in Endoscopy by Decoupling Uneven Illumination cs.CVPDF

Mingyang Ou, Haojin Li, Yifeng Zhang, Ke Niu, Zhongxi Qiu

TL;DR: DeLightMono提出了一种新颖的自监督单目深度估计框架，通过光照解耦解决内窥镜图像中不均匀光照对深度估计的影响。

Details

Motivation: 内窥镜导航系统中，自监督单目深度估计是一个关键任务，但由于内窥镜图像中不均匀光照（尤其是低强度区域），其性能持续下降。现有的低光照增强技术无法有效指导深度网络，而其他领域的解决方案（如自动驾驶）需要良好的光照条件，不适合内窥镜场景且增加了数据收集负担。

Result: 在两个公共数据集上的广泛比较和消融实验中，验证了所提方法的有效性。

Insight: 通过光照解耦可以有效解决不均匀光照对深度估计的影响，为内窥镜导航系统的开发提供了新思路。

Abstract: Self-supervised monocular depth estimation serves as a key task in the development of endoscopic navigation systems. However, performance degradation persists due to uneven illumination inherent in endoscopic images, particularly in low-intensity regions. Existing low-light enhancement techniques fail to effectively guide the depth network. Furthermore, solutions from other fields, like autonomous driving, require well-lit images, making them unsuitable and increasing data collection burdens. To this end, we present DeLight-Mono - a novel self-supervised monocular depth estimation framework with illumination decoupling. Specifically, endoscopic images are represented by a designed illumination-reflectance-depth model, and are decomposed with auxiliary networks. Moreover, a self-supervised joint-optimizing framework with novel losses leveraging the decoupled components is proposed to mitigate the effects of uneven illumination on depth estimation. The effectiveness of the proposed methods was rigorously verified through extensive comparisons and an ablation study performed on two public datasets.

[76] PRADA: Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images cs.CVPDF

Simon Damm, Jonas Ricker, Henning Petzka, Asja Fischer

TL;DR: PRADA提出了一种基于概率比的方法，用于检测和归因自回归生成的图像，通过分析模型的概率比特征实现高精度的检测和模型溯源。

Details

Motivation: 自回归图像生成技术快速发展，生成的真实性越来越高，但目前缺乏专门检测该类生成图像的方法。PRADA旨在填补这一空白，提供可靠的自回归生成图像检测和模型溯源手段。

Result: 实验表明，PRADA在8种类别到图像和4种文本到图像的生成模型中表现出高度的有效性，能够可靠地检测和归因生成图像。

Insight: PRADA揭示了自回归生成图像的概率比特征具有模型特异性，这一发现为未来生成图像检测提供了新的技术方向。

Abstract: Autoregressive (AR) image generation has recently emerged as a powerful paradigm for image synthesis. Leveraging the generation principle of large language models, they allow for efficiently generating deceptively real-looking images, further increasing the need for reliable detection methods. However, to date there is a lack of work specifically targeting the detection of images generated by AR image generators. In this work, we present PRADA (Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images), a simple and interpretable approach that can reliably detect AR-generated images and attribute them to their respective source model. The key idea is to inspect the ratio of a model’s conditional and unconditional probability for the autoregressive token sequence representing a given image. Whenever an image is generated by a particular model, its probability ratio shows unique characteristics which are not present for images generated by other models or real images. We exploit these characteristics for threshold-based attribution and detection by calibrating a simple, model-specific score function. Our experimental evaluation shows that PRADA is highly effective against eight class-to-image and four text-to-image models.

[77] Learning Procedural-aware Video Representations through State-Grounded Hierarchy Unfolding cs.CVPDF

Jinghan Zhao, Yifei Huang, Feng Lu

TL;DR: 该论文提出了一个名为Task-Step-State（TSS）的新框架，通过学习基于状态的视频表示来解决程序性任务的建模问题。通过逐步预训练策略，模型能够更好地将抽象步骤与可观察的视频状态对齐，从而在多个下游任务中取得更好的表现。

Details

Motivation: 现有的方法通过将视频内容与任务和步骤级别的文本描述对齐来注入程序性语义，但这些抽象的‘任务’和‘步骤’描述与视频中的具体细节难以形成稳健的对齐。因此，作者引入了‘状态’这一中间层，以更好地连接抽象程序与可观察的视频内容。

Result: 在COIN和CrossTask数据集上，该方法在任务识别、步骤识别和下一步预测等任务中均优于基线模型。消融实验表明，状态监督是性能提升的关键因素。

Insight: 状态层作为一种视觉基础（visually-grounded）的语义单元，能够有效弥合抽象程序与具体视频内容之间的鸿沟。逐步预训练策略比联合训练更能强制模型学习目标层次结构。

Abstract: Learning procedural-aware video representations is a key step towards building agents that can reason about and execute complex tasks. Existing methods typically address this problem by aligning visual content with textual descriptions at the task and step levels to inject procedural semantics into video representations. However, due to their high level of abstraction, ‘task’ and ‘step’ descriptions fail to form a robust alignment with the concrete, observable details in visual data. To address this, we introduce ‘states’, i.e., textual snapshots of object configurations, as a visually-grounded semantic layer that anchors abstract procedures to what a model can actually see. We formalize this insight in a novel Task-Step-State (TSS) framework, where tasks are achieved via steps that drive transitions between observable states. To enforce this structure, we propose a progressive pre-training strategy that unfolds the TSS hierarchy, forcing the model to ground representations in states while associating them with steps and high-level tasks. Extensive experiments on the COIN and CrossTask datasets show that our method outperforms baseline models on multiple downstream tasks, including task recognition, step recognition, and next step prediction. Ablation studies show that introducing state supervision is a key driver of performance gains across all tasks. Additionally, our progressive pretraining strategy proves more effective than standard joint training, as it better enforces the intended hierarchical structure.

[78] Explainable Visual Anomaly Detection via Concept Bottleneck Models cs.CV | cs.AIPDF

Arianna Stropeni, Valentina Zaccaria, Francesco Borsatti, Davide Dalle Pezze, Manuel Barusco

TL;DR: 该论文提出了一种基于概念瓶颈模型（CBM）的可解释视觉异常检测方法CONVAD，通过增强语义解释能力，填补了传统异常检测方法在用户理解上的不足。

Details

Motivation: 传统视觉异常检测（VAD）方法虽然能提供异常区域的可视化解释，但这些解释缺乏直接的语义信息，限制了用户对异常的理解和信任。

Result: CONVAD在性能上与经典VAD方法相当，同时提供了更丰富、更易理解的概念驱动解释。

Insight: 通过将概念学习引入VAD任务，可以在不牺牲性能的情况下显著提升模型的解释性和用户信任度。

Abstract: In recent years, Visual Anomaly Detection (VAD) has gained significant attention due to its ability to identify anomalous images using only normal images during training. Many VAD models work without supervision but are still able to provide visual explanations by highlighting the anomalous regions within an image. However, although these visual explanations can be helpful, they lack a direct and semantically meaningful interpretation for users. To address this limitation, we propose extending Concept Bottleneck Models (CBMs) to the VAD setting. By learning meaningful concepts, the network can provide human-interpretable descriptions of anomalies, offering a novel and more insightful way to explain them. Our contributions are threefold: (i) we develop a Concept Dataset to support research on CBMs for VAD; (ii) we improve the CBM architecture to generate both concept-based and visual explanations, bridging semantic and localization interpretability; and (iii) we introduce a pipeline for synthesizing artificial anomalies, preserving the VAD paradigm of minimizing dependence on rare anomalous samples. Our approach, Concept-Aware Visual Anomaly Detection (CONVAD), achieves performance comparable to classic VAD methods while providing richer, concept-driven explanations that enhance interpretability and trust in VAD systems.

[79] WPT: World-to-Policy Transfer via Online World Model Distillation cs.CVPDF

Guangfeng Jiang, Yueru Luo, Jun Liu, Yi Huang, Yiyao Zhu

TL;DR: WPT提出了一种World-to-Policy Transfer的训练范式，通过在线蒸馏将世界模型的知识转移到轻量级策略中，提升了规划性能和实时部署能力。

Details

Motivation: 现有世界模型方法存在运行时耦合紧密或依赖离线奖励信号的问题，导致推理开销大或阻碍端到端优化。

Result: WPT在开环和闭环基准测试中均达到SOTA，碰撞率0.11（开环），驾驶分数79.23（闭环），推理速度提升4.9倍。

Insight: 世界模型的知识可通过在线蒸馏有效传递给轻量级策略，同时保持高性能和实时性。

Abstract: Recent years have witnessed remarkable progress in world models, which primarily aim to capture the spatio-temporal correlations between an agent’s actions and the evolving environment. However, existing approaches often suffer from tight runtime coupling or depend on offline reward signals, resulting in substantial inference overhead or hindering end-to-end optimization. To overcome these limitations, we introduce WPT, a World-to-Policy Transfer training paradigm that enables online distillation under the guidance of an end-to-end world model. Specifically, we develop a trainable reward model that infuses world knowledge into a teacher policy by aligning candidate trajectories with the future dynamics predicted by the world model. Subsequently, we propose policy distillation and world reward distillation to transfer the teacher’s reasoning ability into a lightweight student policy, enhancing planning performance while preserving real-time deployability. Extensive experiments on both open-loop and closed-loop benchmarks show that our WPT achieves state-of-the-art performance with a simple policy architecture: it attains a 0.11 collision rate (open-loop) and achieves a 79.23 driving score (closed-loop) surpassing both world-model-based and imitation-learning methods in accuracy and safety. Moreover, the student sustains up to 4.9x faster inference, while retaining most of the gains.

[80] Multi Head Attention Enhanced Inception v3 for Cardiomegaly Detection cs.CVPDF

Abishek Karthik, Pandiyaraju V

TL;DR: 本文将Inception v3与多头注意力机制结合，提出了一种用于X光图像中心脏增大自动检测的方法，取得了高准确率和临床意义显著的性能。

Details

Motivation: 心血管疾病的自动检测需求日益增长，尤其是心脏增大这类结构性异常的识别。通过结合深度学习和注意力机制，可以提升模型的敏感性和准确性。

Result: 模型达到了95.6的准确率、95.2的精确率和96.0的AUC值，显示了高敏感性和特异性。

Insight: 多头注意力机制能够自动学习并聚焦关键区域，为医学图像分析提供了新的思路，尤其是在需要高敏感性的任务中。

Abstract: The healthcare industry has been revolutionized significantly by novel imaging technologies, not just in the diagnosis of cardiovascular diseases but also by the visualization of structural abnormalities like cardiomegaly. This article explains an integrated approach to the use of deep learning tools and attention mechanisms for automatic detection of cardiomegaly using X-ray images. The initiation of the project is grounded on a strong Data Collection phase and gathering the data of annotated X-ray images of various types. Then, while the Preprocessing module fine-tunes image quality, it is feasible to utilize the best out of the data quality in the proposed system. In our proposed system, the process is a CNN configuration leveraging the inception V3 model as one of the key blocks. Besides, we also employ a multilayer attention mechanism to enhance the strength. The most important feature of the method is the multi-head attention mechanism that can learn features automatically. By exact selective focusing on only some regions of input, the model can thus identify cardiomegaly in a sensitive manner. Attention rating is calculated, duplicated, and applied to enhance representation of main data, and therefore there is a successful diagnosis. The Evaluation stage will be extremely strict and it will thoroughly evaluate the model based on such measures as accuracy and precision. This will validate that the model can identify cardiomegaly and will also show the clinical significance of this method. The model has accuracy of 95.6, precision of 95.2, recall of 96.2, sensitivity of 95.7, specificity of 96.1 and an Area Under Curve(AUC) of 96.0 and their respective graphs are plotted for visualisation.

[81] LungEvaty: A Scalable, Open-Source Transformer-based Deep Learning Model for Lung Cancer Risk Prediction in LDCT Screening cs.CV | cs.AIPDF

Johannes Brandt, Maulik Chevli, Rickmer Braren, Georgios Kaissis, Philip Müller

TL;DR: LungEvaty是一个基于Transformer的开源深度学习模型，专注于通过单次LDCT扫描预测1-6年肺癌风险，解决了现有方法在可扩展性和性能上的局限性。

Details

Motivation: 随着低剂量CT（LDCT）筛查的普及，需要一种高效且可扩展的方法来处理全肺数据，而现有方法要么依赖像素级标注，降低了可扩展性，要么分片段分析，影响了性能。

Result: LungEvaty在性能上与现有最优方法相当，同时具有数据高效性和可扩展性，适合未来纵向和多模态肺癌风险预测的研究。

Insight: Transformer架构在医学图像分析中显示出潜力，尤其是在处理大规模全肺数据时，结合解剖学注意力机制可以进一步提升模型性能。

Abstract: Lung cancer risk estimation is gaining increasing importance as more countries introduce population-wide screening programs using low-dose CT (LDCT). As imaging volumes grow, scalable methods that can process entire lung volumes efficiently are essential to tap into the full potential of these large screening datasets. Existing approaches either over-rely on pixel-level annotations, limiting scalability, or analyze the lung in fragments, weakening performance. We present LungEvaty, a fully transformer-based framework for predicting 1-6 year lung cancer risk from a single LDCT scan. The model operates on whole-lung inputs, learning directly from large-scale screening data to capture comprehensive anatomical and pathological cues relevant for malignancy risk. Using only imaging data and no region supervision, LungEvaty matches state-of-the-art performance, refinable by an optional Anatomically Informed Attention Guidance (AIAG) loss that encourages anatomically focused attention. In total, LungEvaty was trained on more than 90,000 CT scans, including over 28,000 for fine-tuning and 6,000 for evaluation. The framework offers a simple, data-efficient, and fully open-source solution that provides an extensible foundation for future research in longitudinal and multimodal lung cancer risk prediction.

[82] UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers cs.CVPDF

Min Zhao, Hongzhou Zhu, Yingze Wang, Bokai Yan, Jintao Zhang

TL;DR: UltraViCo提出了一种无需训练、即插即用的方法，通过抑制超出训练窗口的token注意力来解决视频扩散变换器在长度外推时的重复内容和质量退化问题。

Details

Motivation: 当前视频扩散变换器难以泛化到训练长度之外的视频，表现为内容重复和质量退化。此前方法仅通过位置编码解决重复问题，忽视了质量退化，且效果有限。

Result: UltraViCo在4倍长度外推时，动态度和图像质量分别提升了233%和40.5%，优于现有方法，并能无缝应用于可控视频合成和编辑任务。

Insight: 注意力分散是影响视频长度外推性能的根本原因，通过直接调整注意力权重可以有效解决这一问题，无需复杂的训练或结构调整。

Abstract: Despite advances, video diffusion transformers still struggle to generalize beyond their training length, a challenge we term video length extrapolation. We identify two failure modes: model-specific periodic content repetition and a universal quality degradation. Prior works attempt to solve repetition via positional encodings, overlooking quality degradation and achieving only limited extrapolation. In this paper, we revisit this challenge from a more fundamental view: attention maps, which directly govern how context influences outputs. We identify that both failure modes arise from a unified cause: attention dispersion, where tokens beyond the training window dilute learned attention patterns. This leads to quality degradation and repetition emerges as a special case when this dispersion becomes structured into periodic attention patterns, induced by harmonic properties of positional encodings. Building on this insight, we propose UltraViCo, a training-free, plug-and-play method that suppresses attention for tokens beyond the training window via a constant decay factor. By jointly addressing both failure modes, we outperform a broad set of baselines largely across models and extrapolation ratios, pushing the extrapolation limit from 2x to 4x. Remarkably, it improves Dynamic Degree and Imaging Quality by 233% and 40.5% over the previous best method at 4x extrapolation. Furthermore, our method generalizes seamlessly to downstream tasks such as controllable video synthesis and editing.

[83] Vision-Language Models for Automated 3D PET/CT Report Generation cs.CVPDF

Wenpei Jiao, Kun Shang, Hui Li, Ke Yan, Jiajin Zhang

TL;DR: PETRG-3D是一个端到端的3D双分支框架，用于自动化生成PET/CT报告，通过风格自适应提示解决医院间报告差异，并在淋巴瘤数据集上显著提升性能。

Details

Motivation: PET/CT在肿瘤学中至关重要，但专业人才短缺，自动化报告生成需求迫切，PETRG-3D旨在解决功能性和结构性成像的独特挑战。

Result: PETRG-3D在自然语言指标（如ROUGE-L提升31.49%）和临床效能指标（如PET-All提升8.18%）上显著优于现有方法。

Insight: 3D双模态建模和风格感知提示对提升PET/CT报告生成的临床实用性至关重要，未来可关注疾病感知推理和临床可靠性评估。

Abstract: Positron emission tomography/computed tomography (PET/CT) is essential in oncology, yet the rapid expansion of scanners has outpaced the availability of trained specialists, making automated PET/CT report generation (PETRG) increasingly important for reducing clinical workload. Compared with structural imaging (e.g., X-ray, CT, and MRI), functional PET poses distinct challenges: metabolic patterns vary with tracer physiology, and whole-body 3D contextual information is required rather than local-region interpretation. To advance PETRG, we propose PETRG-3D, an end-to-end 3D dual-branch framework that separately encodes PET and CT volumes and incorporates style-adaptive prompts to mitigate inter-hospital variability in reporting practices. We construct PETRG-Lym, a multi-center lymphoma dataset collected from four hospitals (824 reports w/ 245,509 paired PET/CT slices), and construct AutoPET-RG-Lym, a publicly accessible PETRG benchmark derived from open imaging data but equipped with new expert-written, clinically validated reports (135 cases). To assess clinical utility, we introduce PETRG-Score, a lymphoma-specific evaluation protocol that jointly measures metabolic and structural findings across curated anatomical regions. Experiments show that PETRG-3D substantially outperforms existing methods on both natural language metrics (e.g., +31.49% ROUGE-L) and clinical efficacy metrics (e.g., +8.18% PET-All), highlighting the benefits of volumetric dual-modality modeling and style-aware prompting. Overall, this work establishes a foundation for future PET/CT-specific models emphasizing disease-aware reasoning and clinically reliable evaluation. Codes, models, and AutoPET-RG-Lym will be released.

[84] Hybrid Convolution and Frequency State Space Network for Image Compression cs.CVPDF

Haodong Pan, Hao Wei, Yusong Wang, Nanning Zheng, Caigui Jiang

TL;DR: HCFSSNet是一种结合卷积神经网络（CNN）和频率状态空间模型的混合架构，用于学习图像压缩（LIC），通过提取局部高频信息和建模长范围低频信息，实现了高效的比特分配和竞争性的率失真性能。

Details

Motivation: 现有的Transformer和状态空间模型（SSM）在图像压缩中存在结构信息丢失或频率特性忽略的问题，而CNN虽能捕捉局部高频细节，但缺乏长范围建模能力。

Result: 在Kodak、Tecnick和CLIC数据集上，HCFSSNet的BD率分别比VTM锚点降低18.06%、24.56%和22.44%，参数更少且性能优于MambaIC等SSM编解码器。

Insight: 1) 混合架构能兼顾局部和全局建模；2) 自适应频率调制和方向扫描提升比特分配效率；3) 频率感知设计增强了压缩的可解释性。

Abstract: Learned image compression (LIC) has recently benefited from Transformer based and state space model (SSM) based architectures. Convolutional neural networks (CNNs) effectively capture local high frequency details, whereas Transformers and SSMs provide strong long range modeling capabilities but may cause structural information loss or ignore frequency characteristics that are crucial for compression. In this work we propose HCFSSNet, a Hybrid Convolution and Frequency State Space Network for LIC. HCFSSNet uses CNNs to extract local high frequency structures and introduces a Vision Frequency State Space (VFSS) block that models long range low frequency information. The VFSS block combines an Omni directional Neighborhood State Space (VONSS) module, which scans features horizontally, vertically and diagonally, with an Adaptive Frequency Modulation Module (AFMM) that applies content adaptive weighting of discrete cosine transform frequency components for more efficient bit allocation. To further reduce redundancy in the entropy model, we integrate AFMM with a Swin Transformer to form a Frequency Swin Transformer Attention Module (FSTAM) for frequency aware side information modeling. Experiments on the Kodak, Tecnick and CLIC Professional Validation datasets show that HCFSSNet achieves competitive rate distortion performance compared with recent SSM based codecs such as MambaIC, while using significantly fewer parameters. On Kodak, Tecnick and CLIC, HCFSSNet reduces BD rate over the VTM anchor by 18.06, 24.56 and 22.44 percent, respectively, providing an efficient and interpretable hybrid architecture for future learned image compression systems.

[85] Alzheimers Disease Progression Prediction Based on Manifold Mapping of Irregularly Sampled Longitudinal Data cs.CVPDF

Xin Hong, Ying Shi, Yinhao Li, Yen-Wei Chen

TL;DR: 提出了一种基于流形映射的方法（R-TNAG），通过结合时间感知神经ODE和注意力机制的门控循环单元，有效预测阿尔茨海默病（AD）的不规则纵向数据进展。

Details

Motivation: 不规则采样的纵向影像数据难以在欧几里得空间中建模，需保留数据的连续性和非线性几何结构。

Result: 在AD预测任务中表现优于SOTA，且对序列长度和缺失率具有鲁棒性。

Insight: 流形空间和时间感知机制的结合显著提升了不规则纵向数据的建模能力。

Abstract: The uncertainty of clinical examinations frequently leads to irregular observation intervals in longitudinal imaging data, posing challenges for modeling disease progression.Most existing imaging-based disease prediction models operate in Euclidean space, which assumes a flat representation of data and fails to fully capture the intrinsic continuity and nonlinear geometric structure of irregularly sampled longitudinal images. To address the challenge of modeling Alzheimers disease (AD) progression from irregularly sampled longitudinal structural Magnetic Resonance Imaging (sMRI) data, we propose a Riemannian manifold mapping, a Time-aware manifold Neural ordinary differential equation, and an Attention-based riemannian Gated recurrent unit (R-TNAG) framework. Our approach first projects features extracted from high-dimensional sMRI into a manifold space to preserve the intrinsic geometry of disease progression. On this representation, a time-aware Neural Ordinary Differential Equation (TNODE) models the continuous evolution of latent states between observations, while an Attention-based Riemannian Gated Recurrent Unit (ARGRU) adaptively integrates historical and current information to handle irregular intervals. This joint design improves temporal consistency and yields robust AD trajectory prediction under irregular sampling.Experimental results demonstrate that the proposed method consistently outperforms state-of-the-art models in both disease status prediction and cognitive score regression. Ablation studies verify the contributions of each module, highlighting their complementary roles in enhancing predictive accuracy. Moreover, the model exhibits stable performance across varying sequence lengths and missing data rates, indicating strong temporal generalizability. Cross-dataset validation further confirms its robustness and applicability in diverse clinical settings.

[86] Map-World: Masked Action planning and Path-Integral World Model for Autonomous Driving cs.CV | cs.ROPDF

Bin Hu, Zijian Lu, Haicheng Liao, Chengran Yuan, Bin Rao

TL;DR: MAP-World是一个多模态运动规划框架，结合了掩码动作规划和路径加权世界模型，用于自动驾驶。它避免了手工锚点或强化学习，通过多样化的轨迹查询和轻量级世界模型实现了高效的多模态规划。

Details

Motivation: 现有自动驾驶规划方法依赖手工锚点或强化学习来选择最佳轨迹模式，这不仅丢弃了替代未来的信息，还增加了优化的复杂性。MAP-World旨在解决这一问题。

Result: 在NAVSIM基准测试中，MAP-World的性能与基于锚点的方法相当，并在世界模型方法中达到SOTA，同时避免了强化学习且保持了实时推理延迟。

Insight: MAP-World展示了多模态规划中多样化轨迹生成的重要性，同时证明了轻量级世界模型在高效语义预测中的有效性。

Abstract: Motion planning for autonomous driving must handle multiple plausible futures while remaining computationally efficient. Recent end-to-end systems and world-model-based planners predict rich multi-modal trajectories, but typically rely on handcrafted anchors or reinforcement learning to select a single best mode for training and control. This selection discards information about alternative futures and complicates optimization. We propose MAP-World, a prior-free multi-modal planning framework that couples masked action planning with a path-weighted world model. The Masked Action Planning (MAP) module treats future ego motion as masked sequence completion: past waypoints are encoded as visible tokens, future waypoints are represented as mask tokens, and a driving-intent path provides a coarse scaffold. A compact latent planning state is expanded into multiple trajectory queries with injected noise, yielding diverse, temporally consistent modes without anchor libraries or teacher policies. A lightweight world model then rolls out future BEV semantics conditioned on each candidate trajectory. During training, semantic losses are computed as an expectation over modes, using trajectory probabilities as discrete path weights, so the planner learns from the full distribution of plausible futures instead of a single selected path. On NAVSIM, our method matches anchor-based approaches and achieves state-of-the-art performance among world-model-based methods, while avoiding reinforcement learning and maintaining real-time inference latency.

[87] SKEL-CF: Coarse-to-Fine Biomechanical Skeleton and Surface Mesh Recovery cs.CVPDF

Da Li, Ji-Ping Jin, Xuanlong Yu, Wei Liu, Xiaodong Cun

TL;DR: SKEL-CF提出了一种从粗到细的框架，用于估计SKEL参数，解决了SKEL模型直接估计的挑战，如训练数据不足和视角模糊性。通过转换SMPL数据集为SKEL对齐版本，并与相机建模结合，显著提升了人体运动分析的精度。

Details

Motivation: 现有的参数化3D人体模型（如SMPL）虽然推动了人体姿态和形状估计的进步，但其简化的运动学限制了生物力学的真实性。SKEL模型通过引入解剖学精确的骨骼解决了这一问题，但其参数的直接估计仍面临训练数据不足、深度模糊等挑战。

Result: 在MOYO数据集上，SKEL-CF达到了85.0 MPJPE / 51.4 PA-MPJPE，显著优于之前基于SKEL的SOTA方法HSMR（104.5 / 79.6）。

Insight: SKEL-CF通过层级化设计和数据对齐策略，提高了生物力学骨架估计的精度，为计算机视觉与生物力学的结合提供了可行的解决方案。

Abstract: Parametric 3D human models such as SMPL have driven significant advances in human pose and shape estimation, yet their simplified kinematics limit biomechanical realism. The recently proposed SKEL model addresses this limitation by re-rigging SMPL with an anatomically accurate skeleton. However, estimating SKEL parameters directly remains challenging due to limited training data, perspective ambiguities, and the inherent complexity of human articulation. We introduce SKEL-CF, a coarse-to-fine framework for SKEL parameter estimation. SKEL-CF employs a transformer-based encoder-decoder architecture, where the encoder predicts coarse camera and SKEL parameters, and the decoder progressively refines them in successive layers. To ensure anatomically consistent supervision, we convert the existing SMPL-based dataset 4DHuman into a SKEL-aligned version, 4DHuman-SKEL, providing high-quality training data for SKEL estimation. In addition, to mitigate depth and scale ambiguities, we explicitly incorporate camera modeling into the SKEL-CF pipeline and demonstrate its importance across diverse viewpoints. Extensive experiments validate the effectiveness of the proposed design. On the challenging MOYO dataset, SKEL-CF achieves 85.0 MPJPE / 51.4 PA-MPJPE, significantly outperforming the previous SKEL-based state-of-the-art HSMR (104.5 / 79.6). These results establish SKEL-CF as a scalable and anatomically faithful framework for human motion analysis, bridging the gap between computer vision and biomechanics. Our implementation is available on the project page: https://pokerman8.github.io/SKEL-CF/.

[88] Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs cs.CVPDF

Ziqi Wang, Chang Che, Qi Wang, Hui Ma, Zenglin Shi

TL;DR: 该论文提出了Harmonious Parameter Adaptation (HPA)方法，旨在解决安全对齐的多模态大语言模型（MLLMs）在持续视觉指令调优（CVIT）中的任务遗忘和安全性能退化问题。

Details

Motivation: 现有的CVIT研究多忽略安全对齐的重要性，而现实中MLLMs需要安全机制来规避风险。如何在持续调优中平衡任务性能与安全性成为关键挑战。

Result: 在CVIT基准和安全评估数据集上的实验表明，HPA优于现有基线，能更好地维持高安全性和减少遗忘。

Insight: 安全对齐的MLLMs在持续学习中需独立处理任务性能与安全性能，参数分区和正交约束是有效手段。

Abstract: While continual visual instruction tuning (CVIT) has shown promise in adapting multimodal large language models (MLLMs), existing studies predominantly focus on models without safety alignment. This critical oversight ignores the fact that real-world MLLMs inherently require such mechanisms to mitigate potential risks. In this work, we shift our focus to CVIT for safety-aligned MLLMs and observe that during continual adaptation, the model not only suffers from task forgetting but also exhibits degradation in its safety. Achieving a harmonious balance between safety and task performance remains a crucial challenge. To address this, we propose Harmonious Parameter Adaptation (HPA), a post-training framework composed of focusing-based parameter partition, harmoniously balanced parameter selection, and orthogonal parameter adjustment. Specifically, HPA partitions parameters into two types based on their focus on safety or task performance, and selects the focused ones to preserve from a balanced perspective. In addition, HPA imposes orthogonality constraints on parameter updates to further alleviate catastrophic forgetting. Extensive experiments on the CVIT benchmark and safety evaluation datasets demonstrate that HPA better maintains high safety and mitigates forgetting than existing baselines.

[89] While recognizing actions, LMMs struggle to detect core interaction events cs.CV | cs.AI | q-bio.NCPDF

Daniel Harari, Michael Sidorov, Liel David, Chen Shterental, Abrham Kahsay Gebreselasie

TL;DR: 该研究探讨了大模型在多模态任务中对视频交互事件的感知能力，发现尽管能够描述动作和目标对象，但在定位交互事件的起始和结束帧时表现不佳，显示出缺乏对动态场景的深度理解。

Details

Motivation: 大模型在多模态任务（如图像和视频理解）中表现出色，但其是否能够真正基于视觉输入进行语义理解尚不清楚。研究旨在验证模型在定位交互事件（如物体接触或分离）时的能力。

Result: 模型能命名目标对象和动作并进行推理，但无法准确识别交互事件的起始/结束帧或事件位置。

Insight: 研究表明，大模型在动态场景的感知和语义理解之间存在差距，缺乏对物理交互事件的精确感知能力。

Abstract: Large multi-modal models (LMMs) show increasing performance in realistic visual tasks for images and, more recently, for videos. For example, given a video sequence, such models are able to describe in detail objects, the surroundings and dynamic actions. In this study, we explored the extent to which these models ground their semantic understanding in the actual visual input. Specifically, given sequences of hands interacting with objects, we asked models when and where the interaction begins or ends. For this purpose, we introduce a first of its kind, large-scale dataset with more than 20K annotated interactions on videos from the Something-Something-V2 dataset. 250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached (‘contact’) or detached (‘release’). We asked two LMMs (Qwen-2.5VL and GPT-4o) to locate these events in short videos, each with a single event. The results show that although the models can reliably name the target objects, identify the action and provide coherent reasoning, they consistently fail to identify the frame where the interaction begins or ends and cannot localize the event within the scene. Our findings suggest that in struggling to pinpoint the moment and location of physical contact that defines the interaction, the models lack the perceptual grounding required for deeper understanding of dynamic scenes.

[90] ADNet: A Large-Scale and Extensible Multi-Domain Benchmark for Anomaly Detection Across 380 Real-World Categories cs.CVPDF

Hai Ling, Jia Guo, Zhulin Tao, Yunkang Cao, Donglin Di

TL;DR: ADNet是一个大规模、多领域的异常检测基准数据集，涵盖380个现实世界类别，旨在解决现有基准数据集的局限性。

Details

Motivation: 现有异常检测基准数据集（如MVTec-AD）覆盖范围有限，难以评估跨上下文泛化和可扩展性，ADNet填补了这一空白。

Result: Dinomaly-m在ADNet上达到83.2%的I-AUROC和93.1%的P-AUROC，优于现有方法。

Insight: 大规模数据集揭示扩展性挑战，上下文引导的多专家混合方法是解决扩展性问题的有效途径。

Abstract: Anomaly detection (AD) aims to identify defects using normal-only training data. Existing anomaly detection benchmarks (e.g., MVTec-AD with 15 categories) cover only a narrow range of categories, limiting the evaluation of cross-context generalization and scalability. We introduce ADNet, a large-scale, multi-domain benchmark comprising 380 categories aggregated from 49 publicly available datasets across Electronics, Industry, Agrifood, Infrastructure, and Medical domains. The benchmark includes a total of 196,294 RGB images, consisting of 116,192 normal samples for training and 80,102 test images, of which 60,311 are anomalous. All images are standardized with MVTec-style pixel-level annotations and structured text descriptions spanning both spatial and visual attributes, enabling multimodal anomaly detection tasks. Extensive experiments reveal a clear scalability challenge: existing state-of-the-art methods achieve 90.6% I-AUROC in one-for-one settings but drop to 78.5% when scaling to all 380 categories in a multi-class setting. To address this, we propose Dinomaly-m, a context-guided Mixture-of-Experts extension of Dinomaly that expands decoder capacity without increasing inference cost. It achieves 83.2% I-AUROC and 93.1% P-AUROC, demonstrating superior performance over existing approaches. ADNet is designed as a standardized and extensible benchmark, supporting the community in expanding anomaly detection datasets across diverse domains and providing a scalable foundation for future anomaly detection foundation models. Dataset: https://grainnet.github.io/ADNet

[91] Realizing Fully-Integrated, Low-Power, Event-Based Pupil Tracking with Neuromorphic Hardware cs.CVPDF

Federico Paredes-Valles, Yoshitaka Miyatani, Kirk Y. W. Scheper

TL;DR: 本文提出了一种基于事件视觉传感器和神经形态硬件的电池供电、低功耗穿戴式瞳孔跟踪系统，首次实现了完全设备集成的实时推理方案。

Details

Motivation: 穿戴式设备中的高频率、低功耗瞳孔跟踪是一个挑战性任务，尤其是如何在低功耗下实现实时推理。传统的解决方案无法满足这些需求。

Result: 在多用户数据集上验证了系统性能，实现了100Hz的双目瞳孔跟踪，每眼功耗低于5mW，展示了电池供电的穿戴式原型。

Insight: 结果表明，端到端的神经形态计算能够支持下一代高效能穿戴式设备的实时、低功耗瞳孔跟踪。

Abstract: Eye tracking is fundamental to numerous applications, yet achieving robust, high-frequency tracking with ultra-low power consumption remains challenging for wearable platforms. While event-based vision sensors offer microsecond resolution and sparse data streams, they have lacked fully integrated, low-power processing solutions capable of real-time inference. In this work, we present the first battery-powered, wearable pupil-center-tracking system with complete on-device integration, combining event-based sensing and neuromorphic processing on the commercially available Speck2f system-on-chip with lightweight coordinate decoding on a low-power microcontroller. Our solution features a novel uncertainty-quantifying spiking neural network with gated temporal decoding, optimized for strict memory and bandwidth constraints, complemented by systematic deployment mechanisms that bridge the reality gap. We validate our system on a new multi-user dataset and demonstrate a wearable prototype with dual neuromorphic devices achieving robust binocular pupil tracking at 100 Hz with an average power consumption below 5 mW per eye. Our work demonstrates that end-to-end neuromorphic computing enables practical, always-on eye tracking for next-generation energy-efficient wearable systems.

[92] Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis cs.CVPDF

Mohammad Mahdi, Yuqian Fu, Nedko Savov, Jiancheng Pan, Danda Pani Paudel

TL;DR: Exo2EgoSyn通过改进WAN2.2视频生成模型，实现了从第三人称视角（Exocentric）到第一人称视角（Egocentric）的视频合成，无需从头训练，提升了跨视角视频生成的性能。

Details

Motivation: 现有的基础视频生成模型（如WAN2.2）主要集中在同视角的视频生成（text-或image-conditioned），无法直接用于跨视角（Exo2Ego）的视频合成。作者希望通过改进现有模型，解决这一限制。

Result: 在ExoEgo4D数据集上的实验表明，Exo2EgoSyn显著提升了Ego2Exo合成的性能，验证了该方法的有效性。

Insight: 通过改进现有基础模型（如WAN2.2）而非从头训练，可以实现跨视角的高保真视频合成，为跨视角视频生成的扩展提供了新思路。

Abstract: Foundation video generation models such as WAN 2.2 exhibit strong text- and image-conditioned synthesis abilities but remain constrained to the same-view generation setting. In this work, we introduce Exo2EgoSyn, an adaptation of WAN 2.2 that unlocks Exocentric-to-Egocentric(Exo2Ego) cross-view video synthesis. Our framework consists of three key modules. Ego-Exo View Alignment(EgoExo-Align) enforces latent-space alignment between exocentric and egocentric first-frame representations, reorienting the generative space from the given exo view toward the ego view. Multi-view Exocentric Video Conditioning (MultiExoCon) aggregates multi-view exocentric videos into a unified conditioning signal, extending WAN2.2 beyond its vanilla single-image or text conditioning. Furthermore, Pose-Aware Latent Injection (PoseInj) injects relative exo-to-ego camera pose information into the latent state, guiding geometry-aware synthesis across viewpoints. Together, these modules enable high-fidelity ego view video generation from third-person observations without retraining from scratch. Experiments on ExoEgo4D validate that Exo2EgoSyn significantly improves Ego2Exo synthesis, paving the way for scalable cross-view video generation with foundation models. Source code and models will be released publicly.

[93] SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA cs.CVPDF

Haibin He, Qihuang Zhong, Juhua Liu, Bo Du, Peng Wang

TL;DR: SFA是一个无需训练的新框架，专门针对视频文本视觉问答（Video TextVQA）任务设计，通过扫描、聚焦和放大关键区域，有效引导视频大型语言模型（Video-LLM）的注意力，显著提升了回答准确性。

Details

Motivation: 视频TextVQA任务需要对视频中的视觉文本进行精确感知和理解，同时整合时间和语义上下文，并筛选无关信息。现有方法难以高效处理这些问题。

Result: 在多个公开Video TextVQA数据集上取得新的最佳性能，明显优于之前的方法。

Insight: 通过模仿人类回答过程，SFA展示了如何在复杂视频任务中高效引导模型注意力，为未来研究提供了新思路。

Abstract: Video text-based visual question answering (Video TextVQA) task aims to answer questions about videos by leveraging the visual text appearing within the videos. This task poses significant challenges, requiring models to accurately perceive and comprehend scene text that varies in scale, orientation, and clarity across frames, while effectively integrating temporal and semantic context to generate precise answers. Moreover, the model must identify question-relevant textual cues and filter out redundant or irrelevant information to ensure answering is guided by the most relevant and informative cues. To address these challenges, we propose SFA, a training-free framework and the first Video-LLM-based method tailored for Video TextVQA, motivated by the human process of answering questions. By adaptively scanning video frames, selectively focusing on key regions, and directly amplifying them, SFA effectively guides the Video-LLM’s attention toward essential cues, enabling it to generate more accurate answers. SFA achieves new state-of-the-art results across several public Video TextVQA datasets and surpasses previous methods by a substantial margin, demonstrating its effectiveness and generalizability.

[94] GHR-VQA: Graph-guided Hierarchical Relational Reasoning for Video Question Answering cs.CVPDF

Dionysia Danai Brilli, Dimitrios Mallis, Vassilis Pitsikalis, Petros Maragos

TL;DR: GHR-VQA是一种新的视频问答框架，通过使用场景图和层次化关系推理，专注于人-物体交互，显著提升了视频内容的理解能力。

Details

Motivation: 传统基于像素的视频问答方法难以捕捉复杂的时空动态和人-物体交互，作者提出了一种更结构化、可解释的方法，以提高推理能力。

Result: 在AGQA数据集上取得了显著性能提升，对象-关系推理性能提高了7.3%。

Insight: 通过显式建模人-物体交互和层次化推理，能够更高效地捕捉视频的时空动态，增强了模型的可解释性和性能。

Abstract: We propose GHR-VQA, Graph-guided Hierarchical Relational Reasoning for Video Question Answering (Video QA), a novel human-centric framework that incorporates scene graphs to capture intricate human-object interactions within video sequences. Unlike traditional pixel-based methods, each frame is represented as a scene graph and human nodes across frames are linked to a global root, forming the video-level graph and enabling cross-frame reasoning centered on human actors. The video-level graphs are then processed by Graph Neural Networks (GNNs), transforming them into rich, context-aware embeddings for efficient processing. Finally, these embeddings are integrated with question features in a hierarchical network operating across different abstraction levels, enhancing both local and global understanding of video content. This explicit human-rooted structure enhances interpretability by decomposing actions into human-object interactions and enables a more profound understanding of spatiotemporal dynamics. We validate our approach on the Action Genome Question Answering (AGQA) dataset, achieving significant performance improvements, including a 7.3% improvement in object-relation reasoning over the state of the art.

[95] OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation cs.CV | cs.AIPDF

Hao Yu, Jiabo Zhan, Zile Wang, Jinglin Wang, Huaisong Zhang

TL;DR: OmniAlpha提出了一种统一的序列到序列框架，支持多任务RGBA生成与编辑，通过创新的MSRoPE-BiL方法和AlphaLayers数据集，显著提升了性能，尤其在无掩模抠图任务上。

Details

Motivation: 现有生成模型主要专注于RGB合成，而RGBA处理需要多任务能力，但现有方法要么过于专用，要么局限于RGB领域。OmniAlpha旨在填补这一空白。

Result: 在21项任务上表现优异，无掩模抠图任务中SAD相对降低84.8%，在层条件完成任务中90%的人为偏好优于基线。

Insight: 统一的多任务模型可以通过共享表示学习提升RGBA处理的性能，为层感知生成系统开辟了新方向。

Abstract: Generative models have excelled in RGB synthesis, but real-world applications require RGBA manipulation. This has led to a fragmented landscape: specialized, single-task models handle alpha but lack versatility, while unified multi-task frameworks are confined to the RGB domain. To bridge this critical gap, we propose OmniAlpha, the first unified, multi-task generative framework for sequence-to-sequence RGBA image generation and editing. Its architecture features MSRoPE-BiL, a novel RoPE method with a bi-directionally extendable layer axis for its Diffusion Transformer (DiT) backbone, enabling the concurrent processing of multiple input and target RGBA layers. To power this framework, we introduce AlphaLayers, a new dataset of 1,000 high-quality, multi-layer triplets, built via a novel automated synthesis and filter pipeline. Jointly training OmniAlpha on this dataset across a comprehensive suite of 21 diverse tasks, extensive experiments demonstrate that our unified approach consistently outperforms strong, specialized baselines. Most notably, OmniAlpha achieves a dramatic 84.8% relative reduction in SAD for mask-free matting on AIM-500 and wins over 90% of human preferences in layer-conditioned completion. Our work proves that a unified, multi-task model can learn a superior shared representation for RGBA, paving the way for more powerful, layer-aware generative systems.

[96] Text-guided Controllable Diffusion for Realistic Camouflage Images Generation cs.CVPDF

Yuhang Qian, Haiyan Chen, Wentong Li, Ningzhong Liu, Jie Qin

TL;DR: 论文提出了一种基于文本引导的可控扩散方法（CT-CIG），用于生成真实且逻辑合理的伪装图像，通过视觉语言模型（VLM）和轻量级控制器提升伪装效果。

Details

Motivation: 现有方法在伪装图像生成中常忽略对象与背景间的逻辑关系，导致结果不自然。为此，作者提出结合文本引导和可控扩散的方法，以生成更真实的伪装图像。

Result: 实验验证了生成文本提示的语义对齐性，CT-CIG能生成逼真的伪装图像，其有效性通过CLIPScore和伪装效果评估得到证明。

Insight: 结合文本引导和可控扩散可显著提升伪装图像的自然性和逻辑合理性，同时轻量级模块设计有助于场景适应性和纹理细节捕捉。

Abstract: Camouflage Images Generation (CIG) is an emerging research area that focuses on synthesizing images in which objects are harmoniously blended and exhibit high visual consistency with their surroundings. Existing methods perform CIG by either fusing objects into specific backgrounds or outpainting the surroundings via foreground object-guided diffusion. However, they often fail to obtain natural results because they overlook the logical relationship between camouflaged objects and background environments. To address this issue, we propose CT-CIG, a Controllable Text-guided Camouflage Images Generation method that produces realistic and logically plausible camouflage images. Leveraging Large Visual Language Models (VLM), we design a Camouflage-Revealing Dialogue Mechanism (CRDM) to annotate existing camouflage datasets with high-quality text prompts. Subsequently, the constructed image-prompt pairs are utilized to finetune Stable Diffusion, incorporating a lightweight controller to guide the location and shape of camouflaged objects for enhanced camouflage scene fitness. Moreover, we design a Frequency Interaction Refinement Module (FIRM) to capture high-frequency texture features, facilitating the learning of complex camouflage patterns. Extensive experiments, including CLIPScore evaluation and camouflage effectiveness assessment, demonstrate the semantic alignment of our generated text prompts and CT-CIG’s ability to produce photorealistic camouflage images.

[97] Patch-Level Glioblastoma Subregion Classification with a Contrastive Learning-Based Encoder cs.CVPDF

Juexin Zhang, Qifeng Zhong, Ying Weng, Ke Chen

TL;DR: 该论文提出了一种基于对比学习的ViT编码器方法，用于胶质母细胞瘤的病理切片子区域分类，并在BraTS-Path 2025挑战赛中取得第二名。

Details

Motivation: 胶质母细胞瘤的高度异质性使得诊断和患者分层复杂化，深度学习为全切片图像的客观自动化分析提供了可能。

Result: 模型在验证集上的MCC为0.7064，F1-score为0.7676；测试集上的MCC为0.6509，F1-score为0.5330。

Insight: 基于ViT的方法为病理分析提供了基线，未来需解决未见数据上的性能差距问题。

Abstract: The significant molecular and pathological heterogeneity of glioblastoma, an aggressive brain tumor, complicates diagnosis and patient stratification. While traditional histopathological assessment remains the standard, deep learning offers a promising path toward objective and automated analysis of whole slide images. For the BraTS-Path 2025 Challenge, we developed a method that fine-tunes a pre-trained Vision Transformer (ViT) encoder with a dedicated classification head on the official training dataset. Our model’s performance on the online validation set, evaluated via the Synapse platform, yielded a Matthews Correlation Coefficient (MCC) of 0.7064 and an F1-score of 0.7676. On the final test set, the model achieved an MCC of 0.6509 and an F1-score of 0.5330, which secured our team second place in the BraTS-Pathology 2025 Challenge. Our results establish a solid baseline for ViT-based histopathological analysis, and future efforts will focus on bridging the performance gap observed on the unseen validation data.

[98] V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs cs.CVPDF

Sen Nie, Jie Zhang, Jianxin Yan, Shiguang Shan, Xilin Chen

TL;DR: 该论文提出了V-Attack方法，通过针对Transformer注意力块中的值特征（V）进行精确的局部语义攻击，克服了现有对抗攻击方法在可控性方面的不足。

Details

Motivation: 现有的对抗攻击方法在大型视觉语言模型（LVLMs）中难以精确操控特定概念的语义，主要原因是patch-token表征中的语义纠缠问题。全局上下文信息主导了局部特征，导致无法实现精准的语义操控。

Result: 实验表明，V-Attack在LLaVA、InternVL、DeepseekVL和GPT-4o等LVLMs上，攻击成功率平均提升36%，显著优于现有方法。

Insight: 值特征是解耦局部语义的关键，为对抗攻击提供了更可控的操控手段。这一发现揭示了现代视觉语言模型在语义理解上的关键漏洞。

Abstract: Adversarial attacks have evolved from simply disrupting predictions on conventional task-specific models to the more complex goal of manipulating image semantics on Large Vision-Language Models (LVLMs). However, existing methods struggle with controllability and fail to precisely manipulate the semantics of specific concepts in the image. We attribute this limitation to semantic entanglement in the patch-token representations on which adversarial attacks typically operate: global context aggregated by self-attention in the vision encoder dominates individual patch features, making them unreliable handles for precise local semantic manipulation. Our systematic investigation reveals a key insight: value features (V) computed within the transformer attention block serve as much more precise handles for manipulation. We show that V suppresses global-context channels, allowing it to retain high-entropy, disentangled local semantic information. Building on this discovery, we propose V-Attack, a novel method designed for precise local semantic attacks. V-Attack targets the value features and introduces two core components: (1) a Self-Value Enhancement module to refine V’s intrinsic semantic richness, and (2) a Text-Guided Value Manipulation module that leverages text prompts to locate source concept and optimize it toward a target concept. By bypassing the entangled patch features, V-Attack achieves highly effective semantic control. Extensive experiments across diverse LVLMs, including LLaVA, InternVL, DeepseekVL and GPT-4o, show that V-Attack improves the attack success rate by an average of 36% over state-of-the-art methods, exposing critical vulnerabilities in modern visual-language understanding. Our code and data are available https://github.com/Summu77/V-Attack.

[99] Uplifting Table Tennis: A Robust, Real-World Application for 3D Trajectory and Spin Estimation cs.CV | cs.AI | cs.LGPDF

Daniel Kienzle, Katja Ludwig, Julian Lorenz, Shin’ichi Satoh, Rainer Lienhart

TL;DR: 该论文提出了一种新颖的两阶段管道，用于从单目视频中精确估计乒乓球的三维轨迹和旋转，解决了现有方法在真实场景中泛化能力差的问题。

Details

Motivation: 现有的基于合成数据训练的方法难以泛化到真实世界中的噪声和不完美的球与球台检测，主要原因在于缺乏真实世界视频的三维轨迹和旋转标注数据。

Result: 提出的方法将概念验证的提升方法转化为一个实用、鲁棒且高性能的端到端应用，能够精确分析乒乓球的三维轨迹和旋转。

Insight: 论文的见解在于通过将问题分阶段解决，并结合合成数据与真实世界数据，显著提高了方法的鲁棒性和实用性。

Abstract: Obtaining the precise 3D motion of a table tennis ball from standard monocular videos is a challenging problem, as existing methods trained on synthetic data struggle to generalize to the noisy, imperfect ball and table detections of the real world. This is primarily due to the inherent lack of 3D ground truth trajectories and spin annotations for real-world video. To overcome this, we propose a novel two-stage pipeline that divides the problem into a front-end perception task and a back-end 2D-to-3D uplifting task. This separation allows us to train the front-end components with abundant 2D supervision from our newly created TTHQ dataset, while the back-end uplifting network is trained exclusively on physically-correct synthetic data. We specifically re-engineer the uplifting model to be robust to common real-world artifacts, such as missing detections and varying frame rates. By integrating a ball detector and a table keypoint detector, our approach transforms a proof-of-concept uplifting method into a practical, robust, and high-performing end-to-end application for 3D table tennis trajectory and spin analysis.

[100] Zoo3D: Zero-Shot 3D Object Detection at Scene Level cs.CVPDF

Andrey Lemeshko, Bulat Gabdullin, Nikita Drozdov, Anton Konushin, Danila Rukhovich

TL;DR: Zoo3D提出了一种无需训练的3D物体检测框架，通过2D实例掩码的图聚类生成3D边界框，并结合开集模块实现语义标注。其零样本模式超越现有自监督方法。

Details

Motivation: 现实世界需要模型能识别未知物体，现有方法仍需训练场景依赖。Zoo3D旨在开发完全无需训练的3D检测框架。

Result: 在ScanNet200和ARKitScenes上达到SOTA，零样本模式超越现有自监督方法。

Insight: 无需训练的开集方法在3D理解中潜力巨大，2D到3D的转换是有效的零样本检测路径。

Abstract: 3D object detection is fundamental for spatial understanding. Real-world environments demand models capable of recognizing diverse, previously unseen objects, which remains a major limitation of closed-set methods. Existing open-vocabulary 3D detectors relax annotation requirements but still depend on training scenes, either as point clouds or images. We take this a step further by introducing Zoo3D, the first training-free 3D object detection framework. Our method constructs 3D bounding boxes via graph clustering of 2D instance masks, then assigns semantic labels using a novel open-vocabulary module with best-view selection and view-consensus mask generation. Zoo3D operates in two modes: the zero-shot Zoo3D$_0$, which requires no training at all, and the self-supervised Zoo3D$_1$, which refines 3D box prediction by training a class-agnostic detector on Zoo3D$_0$-generated pseudo labels. Furthermore, we extend Zoo3D beyond point clouds to work directly with posed and even unposed images. Across ScanNet200 and ARKitScenes benchmarks, both Zoo3D$_0$ and Zoo3D$_1$ achieve state-of-the-art results in open-vocabulary 3D object detection. Remarkably, our zero-shot Zoo3D$_0$ outperforms all existing self-supervised methods, hence demonstrating the power and adaptability of training-free, off-the-shelf approaches for real-world 3D understanding. Code is available at https://github.com/col14m/zoo3d .

[101] XiCAD: Camera Activation Detection in the Da Vinci Xi User Interface cs.CV | cs.AIPDF

Alexander C. Jenke, Gregor Just, Claas de Boer, Martin Wagner, Sebastian Bodenstedt

TL;DR: 论文提出了一个基于ResNet18的轻量级管道，用于自动检测达芬奇Xi手术系统界面中的摄像头激活状态，支持下游手术数据分析任务。

Details

Motivation: 机器人辅助微创手术依赖内窥镜视频作为唯一的术中视觉反馈，摄像头激活状态的检测能为工具追踪、技能评估等任务提供重要元数据。

Result: 模型在二分类任务中F1分数接近完美，且能准确定位摄像头图块，无多摄像头误检。

Insight: 该研究为手术数据分析提供了可靠的预处理工具，开源代码和模型有助于推动领域发展。

Abstract: Purpose: Robot-assisted minimally invasive surgery relies on endoscopic video as the sole intraoperative visual feedback. The DaVinci Xi system overlays a graphical user interface (UI) that indicates the state of each robotic arm, including the activation of the endoscope arm. Detecting this activation provides valuable metadata such as camera movement information, which can support downstream surgical data science tasks including tool tracking, skill assessment, or camera control automation. Methods: We developed a lightweight pipeline based on a ResNet18 convolutional neural network to automatically identify the position of the camera tile and its activation state within the DaVinci Xi UI. The model was fine-tuned on manually annotated data from the SurgToolLoc dataset and evaluated across three public datasets comprising over 70,000 frames. Results: The model achieved F1-scores between 0.993 and 1.000 for the binary detection of active cameras and correctly localized the camera tile in all cases without false multiple-camera detections. Conclusion: The proposed pipeline enables reliable, real-time extraction of camera activation metadata from surgical videos, facilitating automated preprocessing and analysis for diverse downstream applications. All code, trained models, and annotations are publicly available.

[102] The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation cs.CVPDF

Weijia Mao, Hao Chen, Zhenheng Yang, Mike Zheng Shou

TL;DR: 本文提出了一种新的强化学习框架Adv-GRPO，通过对抗性奖励机制解决图像生成中的奖励黑客问题，并利用视觉基础模型（如DINO）提供丰富的视觉奖励，显著提升了生成图像的质量和美学表现。

Details

Motivation: 现有的强化学习方法依赖于预训练的偏好模型来提供标量奖励，但这些奖励难以准确反映人类感知，且容易被黑客攻击，导致高分不代表高质量图像。

Result: 在人类评估中，Adv-GRPO在图像质量和美学方面的胜率分别为70.0%和72.4%，优于Flow-GRPO和SD3。

Insight: 图像本身可以作为丰富的奖励信号，结合视觉基础模型和参考样本，不仅能提升生成质量，还能实现分布迁移和灵活的样式定制。

Abstract: A reliable reward function is essential for reinforcement learning (RL) in image generation. Most current RL approaches depend on pre-trained preference models that output scalar rewards to approximate human preferences. However, these rewards often fail to capture human perception and are vulnerable to reward hacking, where higher scores do not correspond to better images. To address this, we introduce Adv-GRPO, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator. The reward model is supervised using reference images as positive samples and can largely avoid being hacked. Unlike KL regularization that constrains parameter updates, our learned reward directly guides the generator through its visual outputs, leading to higher-quality images. Moreover, while optimizing existing reward functions can alleviate reward hacking, their inherent biases remain. For instance, PickScore may degrade image quality, whereas OCR-based rewards often reduce aesthetic fidelity. To address this, we take the image itself as a reward, using reference images and vision foundation models (e.g., DINO) to provide rich visual rewards. These dense visual signals, instead of a single scalar, lead to consistent gains across image quality, aesthetics, and task-specific metrics. Finally, we show that combining reference samples with foundation-model rewards enables distribution transfer and flexible style customization. In human evaluation, our method outperforms Flow-GRPO and SD3, achieving 70.0% and 72.4% win rates in image quality and aesthetics, respectively. Code and models have been released.

Xiaohan Wang, Zhangtao Cheng, Ting Zhong, Leiting Chen, Fan Zhou

TL;DR: 该论文提出了一种名为MBCD的统一协作蒸馏框架，通过自适应模态丢弃和梯度一致性约束解决多模态域泛化中权重平均技术导致的问题，实现了更好的跨模态融合和更平坦的损失曲面。

Details

Motivation: 在多模态域泛化（MMDG）中，直接应用权重平均（WA）会导致模型偏向优化速度更快的模态，抑制了其他模态的贡献，从而影响融合效果和泛化性能。

Result: 在MMDG基准测试中，MBCD表现优于现有方法，提供了更高的准确性和鲁棒性。

Insight: 通过平衡模态间的优化速度并强化跨模态交互，可以显著提升多模态模型的泛化能力。

Abstract: Weight Averaging (WA) has emerged as a powerful technique for enhancing generalization by promoting convergence to a flat loss landscape, which correlates with stronger out-of-distribution performance. However, applying WA directly to multi-modal domain generalization (MMDG) is challenging: differences in optimization speed across modalities lead WA to overfit to faster-converging ones in early stages, suppressing the contribution of slower yet complementary modalities, thereby hindering effective modality fusion and skewing the loss surface toward sharper, less generalizable minima. To address this issue, we propose MBCD, a unified collaborative distillation framework that retains WA’s flatness-inducing advantages while overcoming its shortcomings in multi-modal contexts. MBCD begins with adaptive modality dropout in the student model to curb early-stage bias toward dominant modalities. A gradient consistency constraint then aligns learning signals between uni-modal branches and the fused representation, encouraging coordinated and smoother optimization. Finally, a WA-based teacher conducts cross-modal distillation by transferring fused knowledge to each uni-modal branch, which strengthens cross-modal interactions and steer convergence toward flatter solutions. Extensive experiments on MMDG benchmarks show that MBCD consistently outperforms existing methods, achieving superior accuracy and robustness across diverse unseen domains.

[104] Advancing Image Classification with Discrete Diffusion Classification Modeling cs.CVPDF

Omer Belhasin, Shelly Golan, Ran El-Yaniv, Michael Elad

TL;DR: 论文提出了离散扩散分类模型（DiDiCM），通过扩散过程建模类别标签的后验分布，提高了高不确定性条件下的图像分类性能。

Details

Motivation: 传统分类方法在输入图像被破坏或训练数据有限时表现不佳，DiDiCM旨在利用扩散模型提升此类场景的分类准确性。

Result: 在ImageNet数据集上，DiDiCM展现出优于基线模型的分类准确性，尤其在更具挑战性的任务中提升更显著。

Insight: 扩散模型在高不确定性条件下的分类任务中具有潜力，灵活的预测方式为实际应用提供了实用价值。

Abstract: Image classification is a well-studied task in computer vision, and yet it remains challenging under high-uncertainty conditions, such as when input images are corrupted or training data are limited. Conventional classification approaches typically train models to directly predict class labels from input images, but this might lead to suboptimal performance in such scenarios. To address this issue, we propose Discrete Diffusion Classification Modeling (DiDiCM), a novel framework that leverages a diffusion-based procedure to model the posterior distribution of class labels conditioned on the input image. DiDiCM supports diffusion-based predictions either on class probabilities or on discrete class labels, providing flexibility in computation and memory trade-offs. We conduct a comprehensive empirical study demonstrating the superior performance of DiDiCM over standard classifiers, showing that a few diffusion iterations achieve higher classification accuracy on the ImageNet dataset compared to baselines, with accuracy gains increasing as the task becomes more challenging. We release our code at https://github.com/omerb01/didicm .

[105] DRL-Guided Neural Batch Sampling for Semi-Supervised Pixel-Level Anomaly Detection cs.CVPDF

Amirhossein Khadivi Noghredeh, Abdollah Safari, Fatemeh Ziaeetabar, Firoozeh Haghighi

TL;DR: 该论文提出了一种半监督深度强化学习框架，用于工业视觉检测中的异常检测，通过神经批采样器和复合奖励机制，结合自动编码器和预测器实现了对细微缺陷的有效检测。

Details

Motivation: 工业视觉检测中异常样本稀缺，现有无监督方法容易过拟合且难以检测细微缺陷。

Result: 在MVTec AD数据集上优于现有方法，F1_max平均提升0.15，AUC提升0.06，最大F1_max增益达0.37。

Insight: 通过强化学习的自适应选择和半监督学习，能够显著提升异常检测的准确性和定位能力。

Abstract: Anomaly detection in industrial visual inspection is challenging due to the scarcity of defective samples. Most existing methods rely on unsupervised reconstruction using only normal data, often resulting in overfitting and poor detection of subtle defects. We propose a semi-supervised deep reinforcement learning framework that integrates a neural batch sampler, an autoencoder, and a predictor. The RL-based sampler adaptively selects informative patches by balancing exploration and exploitation through a composite reward. The autoencoder generates loss profiles highlighting abnormal regions, while the predictor performs segmentation in the loss-profile space. This interaction enables the system to effectively learn both normal and defective patterns with limited labeled data. Experiments on the MVTec AD dataset demonstrate that our method achieves higher accuracy and better localization of subtle anomalies than recent state-of-the-art approaches while maintaining low complexity, yielding an average improvement of 0.15 in F1_max and 0.06 in AUC, with a maximum gain of 0.37 in F1_max in the best case.

[106] VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs cs.CVPDF

Tianxiang Jiang, Sheng Xia, Yicheng Xu, Linquan Wu, Xiangyu Zeng

TL;DR: 本文介绍了VKnowU，一个评估多模态大语言模型（MLLMs）对视觉知识理解能力的基准测试，并提出VideoKnow+模型通过强化学习和视觉知识奖励提升性能，填补了现有模型的不足。

Details

Motivation: 当前MLLMs在对象识别方面表现优异，但缺乏对人类物理和社会世界的高层次视觉知识理解，这种能力是连接感知与推理的桥梁。

Result: VideoKnow+在VKnowU上提升3.7%，并在其他测试中表现一致提升，但仍未达到人类水平。

Insight: 视觉知识是MLLMs实现更广泛泛化能力的关键，未来工作应更注重高层次语义理解的建模。

Abstract: While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world’s underlying physical and social principles. This high-level vision-grounded semantics, which we term visual knowledge, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both world-centric (e.g., intuitive physics) and human-centric (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, VKnowQA, and VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs. VideoKnow+ follows a structured See-Think-Answer paradigm and adopts reinforcement learning with visual knowledge reward, achieving a +3.7% improvement on VKnowU and consistent gains on MVBench, Video-MME, and MMVU. Our work highlights visual knowledge as a missing cornerstone for developing more generalizable MLLMs that can not only see but also truly understand our physical and social worlds.

[107] DAPointMamba: Domain Adaptive Point Mamba for Point Cloud Completion cs.CVPDF

Yinghui Li, Qianyu Zhou, Di Shao, Hao Yang, Ye Zhu

TL;DR: 论文提出了DAPointMamba，一种用于点云补全的域自适应框架，通过结合State Space Models (SSMs)解决了传统方法在域自适应任务中面临的感受野受限和计算复杂度高的问题。

Details

Motivation: 现有点云补全方法在跨域任务中由于使用CNN或Transformer，面临感受野受限或二次复杂度的问题。论文研究了SSMs在点云补全中的适应性，发现直接应用SSMs会导致空间拓扑和局部几何特征被破坏，且忽略域无关表征设计。

Result: 在合成和真实数据集上，DAPointMamba表现优于现有方法，同时具有更低的计算复杂度和推理延迟。

Insight: SSMs在点云补全任务中具有潜力，能够实现全局感受野和高效计算；跨域对齐需要从局部几何到全局语义的多层次设计。

Abstract: Domain adaptive point cloud completion (DA PCC) aims to narrow the geometric and semantic discrepancies between the labeled source and unlabeled target domains. Existing methods either suffer from limited receptive fields or quadratic complexity due to using CNNs or vision Transformers. In this paper, we present the first work that studies the adaptability of State Space Models (SSMs) in DA PCC and find that directly applying SSMs to DA PCC will encounter several challenges: directly serializing 3D point clouds into 1D sequences often disrupts the spatial topology and local geometric features of the target domain. Besides, the overlook of designs in the learning domain-agnostic representations hinders the adaptation performance. To address these issues, we propose a novel framework, DAPointMamba for DA PCC, that exhibits strong adaptability across domains and has the advantages of global receptive fields and efficient linear complexity. It has three novel modules. In particular, Cross-Domain Patch-Level Scanning introduces patch-level geometric correspondences, enabling effective local alignment. Cross-Domain Spatial SSM Alignment further strengthens spatial consistency by modulating patch features based on cross-domain similarity, effectively mitigating fine-grained structural discrepancies. Cross-Domain Channel SSM Alignment actively addresses global semantic gaps by interleaving and aligning feature channels. Extensive experiments on both synthetic and real-world benchmarks demonstrate that our DAPointMamba outperforms state-of-the-art methods with less computational complexity and inference latency.

Yang Liu, Xilin Zhao, Peisong Wen, Siran Dai, Qingming Huang

TL;DR: 该论文提出了一种基于多模态思维链（MM-CoT）的训练免费框架，通过迭代自优化生成符合物理规律的视频，显著提升了物理一致性评分。

Details

Motivation: 尽管视频生成模型在视觉质量上取得了显著进展，但其生成的视频往往不符合真实的物理规律。作者希望通过引入物理感知的指导来解决这一问题。

Result: 在PhyIQ基准测试中，该方法将Physics-IQ分数从56.31提升到62.38，证明了其有效性。

Insight: 这项工作是物理一致性视频生成的初步探索，为未来研究提供了思路。

Abstract: Recent progress in video generation has led to impressive visual quality, yet current models still struggle to produce results that align with real-world physical principles. To this end, we propose an iterative self-refinement framework that leverages large language models and vision-language models to provide physics-aware guidance for video generation. Specifically, we introduce a multimodal chain-of-thought (MM-CoT) process that refines prompts based on feedback from physical inconsistencies, progressively enhancing generation quality. This method is training-free and plug-and-play, making it readily applicable to a wide range of video generation models. Experiments on the PhyIQ benchmark show that our method improves the Physics-IQ score from 56.31 to 62.38. We hope this work serves as a preliminary exploration of physics-consistent video generation and may offer insights for future research.

[109] Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations cs.CVPDF

Chao Wang, Chengan Che, Xinyue Chen, Sophia Tsoka, Luis C. Garcia-Peraza-Herrera

TL;DR: 本文提出了一种名为BTTF的优化框架，用于生成视频反事实解释（CFEs），解决了现有图像CFE方法在视频场景中缺乏时间一致性和物理合理性的问题。

Details

Motivation: 现有的反事实解释方法主要针对图像分类器设计，无法生成时间一致、平滑且物理合理的视频反事实解释。因此，作者提出BTTF来解决这一局限性。

Result: 在Shape-Moving、MEAD和NTU RGB+D视频数据集上的实验表明，BTTF能有效生成有效、视觉相似且逼真的反事实视频，揭示了分类器的决策机制。

Insight: 视频反事实解释需要兼顾时间一致性和物理合理性，BTTF通过优化策略实现了这一点，为视频分类器的解释提供了新的思路。

Abstract: Counterfactual explanations (CFEs) are minimal and semantically meaningful modifications of the input of a model that alter the model predictions. They highlight the decisive features the model relies on, providing contrastive interpretations for classifiers. State-of-the-art visual counterfactual explanation methods are designed to explain image classifiers. The generation of CFEs for video classifiers remains largely underexplored. For the counterfactual videos to be useful, they have to be physically plausible, temporally coherent, and exhibit smooth motion trajectories. Existing CFE image-based methods, designed to explain image classifiers, lack the capacity to generate temporally coherent, smooth and physically plausible video CFEs. To address this, we propose Back To The Feature (BTTF), an optimization framework that generates video CFEs. Our method introduces two novel features, 1) an optimization scheme to retrieve the initial latent noise conditioned by the first frame of the input video, 2) a two-stage optimization strategy to enable the search for counterfactual videos in the vicinity of the input video. Both optimization processes are guided solely by the target classifier, ensuring the explanation is faithful. To accelerate convergence, we also introduce a progressive optimization strategy that incrementally increases the number of denoising steps. Extensive experiments on video datasets such as Shape-Moving (motion classification), MEAD (emotion classification), and NTU RGB+D (action classification) show that our BTTF effectively generates valid, visually similar and realistic counterfactual videos that provide concrete insights into the classifier’s decision-making mechanism.

[110] CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation cs.CVPDF

Shilei Cao, Ziyang Gong, Hehai Lin, Yang Liu, Jiashun Cheng

TL;DR: CrossEarth-Gate提出了一种针对遥感语义分割任务的参数高效微调方法，通过Fisher信息引导的动态选择机制和多模块工具箱，有效应对多领域差距问题，并在16个跨域基准测试中取得了最先进的性能。

Details

Motivation: 遥感数据中存在多领域差距（如空间、语义和频率偏移），现有参数高效微调方法难以完全应对这些问题。

Result: 在16个跨域遥感语义分割基准测试中实现了最先进的性能。

Insight: Fisher信息引导的动态选择机制能够有效应对遥感数据中的多领域差距问题，同时提高了模型的适应效率和效果。

Abstract: In Remote Sensing (RS), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key approach to activate the generalizable representation ability of foundation models for downstream tasks. However, existing specialized PEFT methods often fail when applied to large-scale Earth observation tasks, as they are unable to fully handle the multifaceted and unpredictable domain gaps (\eg, spatial, semantic, and frequency shifts) inherent in RS data. To overcome this, we propose CrossEarth-Gate, which introduces two primary contributions. First, we establish a comprehensive RS module toolbox to address multifaceted domain gaps, comprising spatial, semantic, and frequency modules. Second, we develop a Fisher-guided adaptive selection mechanism that operates on this toolbox. This selection is guided by Fisher Information to quantify each module’s importance by measuring its contribution to the task-specific gradient flow. It dynamically activates only the most critical modules at the appropriate layers, guiding the gradient flow to maximize adaptation effectiveness and efficiency. Comprehensive experiments validate the efficacy and generalizability of our method, where CrossEarth-Gate achieves state-of-the-art performance across 16 cross-domain benchmarks for RS semantic segmentation. The code of the work will be released.

[111] TaCo: Capturing Spatio-Temporal Semantic Consistency in Remote Sensing Change Detection cs.CVPDF

Han Guo, Chenyang Liu, Haotian Zhang, Bowen Chen, Zhengxia Zou

TL;DR: TaCo提出了一种时空语义一致性网络，通过联合约束增强遥感变化检测的性能，特别解决了语义不一致性问题。

Details

Motivation: 现有遥感变化检测方法主要依赖掩模监督，缺乏对时间语义转换的约束，导致语义不一致性未被解决。

Result: 在六个公共数据集上达到SOTA性能，且无需额外推理计算开销。

Insight: 将变化视为语义过渡并引入多模态信息（文本）可显著提升变化检测的语义一致性和性能。

Abstract: Remote sensing change detection (RSCD) aims to identify surface changes across bi-temporal satellite images. Most previous methods rely solely on mask supervision, which effectively guides spatial localization but provides limited constraints on the temporal semantic transitions. Consequently, they often produce spatially coherent predictions while still suffering from unresolved semantic inconsistencies. To address this limitation, we propose TaCo, a spatio-temporal semantic consistent network, which enriches the existing mask-supervised framework with a spatio-temporal semantic joint constraint. TaCo conceptualizes change as a semantic transition between bi-temporal states, in which one temporal feature representation can be derived from the other via dedicated transition features. To realize this, we introduce a Text-guided Transition Generator that integrates textual semantics with bi-temporal visual features to construct the cross-temporal transition features. In addition, we propose a spatio-temporal semantic joint constraint consisting of bi-temporal reconstruct constraints and a transition constraint: the former enforces alignment between reconstructed and original features, while the latter enhances discrimination for changes. This design can yield substantial performance gains without introducing any additional computational overhead during inference. Extensive experiments on six public datasets, spanning both binary and semantic change detection tasks, demonstrate that TaCo consistently achieves SOTA performance.

[112] TReFT: Taming Rectified Flow Models For One-Step Image Translation cs.CVPDF

Shengqian Li, Ming Gao, Yi Liu, Zuzeng Lin, Feng Wang

TL;DR: TReFT 是一种新方法，通过直接使用预训练 RF 模型的预测速度作为输出来实现一步图像翻译，解决了现有方法的多步去噪问题，并在多项任务中实现了实时推理。

Details

Motivation: 现有的 Rectified Flow (RF) 模型在图像翻译任务中依赖多步去噪，导致计算成本高、难以实时应用。尽管 CycleGAN-Turbo 能够在预训练扩散模型中实现一步翻译，但直接应用于 RF 模型会引发严重的收敛问题。

Result: TReFT 在多种图像翻译数据集上实现了与最先进方法相当的性能，同时支持一步推理，显著提升了速度。

Insight: 研究发现，预训练 RF 模型在去噪过程接近结束时，预测的速度会收敛为指向最终干净图像的向量。这一性质为一步图像翻译提供了理论基础。

Abstract: Rectified Flow (RF) models have advanced high-quality image and video synthesis via optimal transport theory. However, when applied to image-to-image translation, they still depend on costly multi-step denoising, hindering real-time applications. Although the recent adversarial training paradigm, CycleGAN-Turbo, works in pretrained diffusion models for one-step image translation, we find that directly applying it to RF models leads to severe convergence issues. In this paper, we analyze these challenges and propose TReFT, a novel method to Tame Rectified Flow models for one-step image Translation. Unlike previous works, TReFT directly uses the velocity predicted by pretrained DiT or UNet as output-a simple yet effective design that tackles the convergence issues under adversarial training with one-step inference. This design is mainly motivated by a novel observation that, near the end of the denoising process, the velocity predicted by pretrained RF models converges to the vector from origin to the final clean image, a property we further justify through theoretical analysis. When applying TReFT to large pretrained RF models such as SD3.5 and FLUX, we introduce memory-efficient latent cycle-consistency and identity losses during training, as well as lightweight architectural simplifications for faster inference. Pretrained RF models finetuned with TReFT achieve performance comparable to sota methods across multiple image translation datasets while enabling real-time inference.

[113] AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models cs.CVPDF

Tianyi Yan, Tao Tang, Xingtai Gui, Yongkang Li, Jiasen Zhesng

TL;DR: 论文提出了AD-R1框架，通过引入Impartial World Model和Counterfactual Synthesis方法，解决了RL在自动驾驶中存在的乐观偏差问题，显著提升了安全性。

Details

Motivation: 传统端到端自动驾驶模型难以应对长尾事件和安全挑战，RL虽有望解决这些问题，但因其世界模型存在乐观偏差而难以成功。

Result: 实验表明，该方法在Risk Foreseeing Benchmark上显著优于基线模型，有效减少了安全违规事件。

Insight: 教导模型预测危险事件是提升自动驾驶安全性的关键，通过合成数据和闭环RL的结合可以显著改进RL策略的真实性和可靠性。

Abstract: End-to-end models for autonomous driving hold the promise of learning complex behaviors directly from sensor data, but face critical challenges in safety and handling long-tail events. Reinforcement Learning (RL) offers a promising path to overcome these limitations, yet its success in autonomous driving has been elusive. We identify a fundamental flaw hindering this progress: a deep seated optimistic bias in the world models used for RL. To address this, we introduce a framework for post-training policy refinement built around an Impartial World Model. Our primary contribution is to teach this model to be honest about danger. We achieve this with a novel data synthesis pipeline, Counterfactual Synthesis, which systematically generates a rich curriculum of plausible collisions and off-road events. This transforms the model from a passive scene completer into a veridical forecaster that remains faithful to the causal link between actions and outcomes. We then integrate this Impartial World Model into our closed-loop RL framework, where it serves as an internal critic. During refinement, the agent queries the critic to ``dream” of the outcomes for candidate actions. We demonstrate through extensive experiments, including on a new Risk Foreseeing Benchmark, that our model significantly outperforms baselines in predicting failures. Consequently, when used as a critic, it enables a substantial reduction in safety violations in challenging simulations, proving that teaching a model to dream of danger is a critical step towards building truly safe and intelligent autonomous agents.

[114] 3D Motion Perception of Binocular Vision Target with PID-CNN cs.CV | cs.AIPDF

Shi Jiazhao, Pan Pan, Shi Haotian

TL;DR: 该论文提出了一种基于PID-CNN的双目视觉目标3D运动感知网络，能够实时预测目标的3D坐标、速度和加速度。通过分析PID视角下的神经网络设计原理，构建了一个小型17层网络，并在模拟数据集上验证了其接近输入分辨率上限的预测精度。

Details

Motivation: 研究动机是为了解决双目视觉目标3D运动信息的实时感知问题，同时从PID控制理论的角度理解神经网络的非线性拟合能力。

Result: 实验结果表明，网络在模拟数据集上的预测精度接近输入图像分辨率的理论上限。

Insight: 论文指出高维卷积可提升计算效率和特征空间利用率，同时PID信息可能用于实现记忆和注意力机制，为未来研究提供方向。

Abstract: This article trained a network for perceiving three-dimensional motion information of binocular vision target, which can provide real-time three-dimensional coordinate, velocity, and acceleration, and has a basic spatiotemporal perception capability. Understood the ability of neural networks to fit nonlinear problems from the perspective of PID. Considered a single-layer neural network as using a second-order difference equation and a nonlinearity to describe a local problem. Multilayer networks gradually transform the raw representation to the desired representation through multiple such combinations. Analysed some reference principles for designing neural networks. Designed a relatively small PID convolutional neural network, with a total of 17 layers and 413 thousand parameters. Implemented a simple but practical feature reuse method by concatenation and pooling. The network was trained and tested using the simulated randomly moving ball datasets, and the experimental results showed that the prediction accuracy was close to the upper limit that the input image resolution can represent. Analysed the experimental results and errors, as well as the existing shortcomings and possible directions for improvement. Finally, discussed the advantages of high-dimensional convolution in improving computational efficiency and feature space utilization. As well as the potential advantages of using PID information to implement memory and attention mechanisms.

[115] ShelfRectNet: Single View Shelf Image Rectification with Homography Estimation cs.CVPDF

Onur Berk Tore, Ibrahim Samil Yalciner, Server Calap

TL;DR: 论文提出了一种基于深度学习的单视角货架图像校正方法ShelfRectNet，通过估计4参数化的单应矩阵，利用ConvNeXt骨干网络和标准化坐标回归提升性能。为解决数据稀缺问题，提出了一种新颖的增强策略，并在测试集上实现了1.298像素的平均角点误差。

Details

Motivation: 在零售等领域，单视角图像校正具有实际应用价值，但由于数据稀缺和视角限制，传统方法表现不佳。论文旨在解决这些问题。

Result: 在测试集上平均角点误差为1.298像素，比传统方法和深度学习方法更具竞争力。

Insight: 结合先进的骨干网络和标准化回归可提升单应矩阵估计的稳定性；合成数据增强能有效缓解数据稀缺问题。

Abstract: Estimating homography from a single image remains a challenging yet practically valuable task, particularly in domains like retail, where only one viewpoint is typically available for shelf monitoring and product alignment. In this paper, we present a deep learning framework that predicts a 4-point parameterized homography matrix to rectify shelf images captured from arbitrary angles. Our model leverages a ConvNeXt-based backbone for enhanced feature representation and adopts normalized coordinate regression for improved stability. To address data scarcity and promote generalization, we introduce a novel augmentation strategy by modeling and sampling synthetic homographies. Our method achieves a mean corner error of 1.298 pixels on the test set. When compared with both classical computer vision and deep learning-based approaches, our method demonstrates competitive performance in both accuracy and inference speed. Together, these results establish our approach as a robust and efficient solution for realworld single-view rectification. To encourage further research in this domain, we will make our dataset, ShelfRectSet, and code publicly available

[116] AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend cs.CVPDF

Hengyi Wang, Lourdes Agapito

TL;DR: AMB3R提出了一种多视图前馈模型，用于在度量尺度上进行密集的3D重建，支持多种3D视觉任务。

Details

Motivation: 现有的3D重建方法通常在任务特定优化或测试时调整上需要额外开销。AMB3R旨在提供一种通用且高效的前馈解决方案，无需额外调整即可扩展到不同任务。

Result: 在多个基准测试中，AMB3R在相机位姿、深度估计和3D重建任务上表现优于现有方法，甚至超过了基于优化的SLAM和SfM方法。

Insight: 稀疏体积表示的有效性和前馈设计的通用性表明，紧凑的场景表示能够在不牺牲性能的情况下支持多样化的3D任务。

Abstract: We present AMB3R, a multi-view feed-forward model for dense 3D reconstruction on a metric-scale that addresses diverse 3D vision tasks. The key idea is to leverage a sparse, yet compact, volumetric scene representation as our backend, enabling geometric reasoning with spatial compactness. Although trained solely for multi-view reconstruction, we demonstrate that AMB3R can be seamlessly extended to uncalibrated visual odometry (online) or large-scale structure from motion without the need for task-specific fine-tuning or test-time optimization. Compared to prior pointmap-based models, our approach achieves state-of-the-art performance in camera pose, depth, and metric-scale estimation, 3D reconstruction, and even surpasses optimization-based SLAM and SfM methods with dense reconstruction priors on common benchmarks.

[117] Material-informed Gaussian Splatting for 3D World Reconstruction in a Digital Twin cs.CV | cs.ROPDF

João Malheiro Silva, Andy Huynh, Tong Duy Son, Holger Caesar

TL;DR: 论文提出了一种基于多视角图像的3D高斯喷洒重建方法，结合语义材质掩码和物理材质属性，实现了数字孪生中高保真的3D重建和传感器模拟，避免了传统LiDAR-相机融合的复杂性和校准问题。

Details

Motivation: 传统的LiDAR-相机融合方法在3D重建中需要复杂的校准，且对玻璃等材质的处理效果不佳。本文旨在通过仅使用摄像头的方法，结合语义和物理属性，实现更高效和准确的数字孪生重建。

Result: 该方法在内部测试数据集上验证，利用LiDAR作为反射率的地面真值，结合图像相似性指标，实现了与LiDAR-相机融合相当的传感器模拟保真度。

Insight: 纯摄像头的方法可以通过结合语义和物理属性，有效替代传统LiDAR-相机融合的复杂系统，同时保持高保真度。

Abstract: 3D reconstruction for Digital Twins often relies on LiDAR-based methods, which provide accurate geometry but lack the semantics and textures naturally captured by cameras. Traditional LiDAR-camera fusion approaches require complex calibration and still struggle with certain materials like glass, which are visible in images but poorly represented in point clouds. We propose a camera-only pipeline that reconstructs scenes using 3D Gaussian Splatting from multi-view images, extracts semantic material masks via vision models, converts Gaussian representations to mesh surfaces with projected material labels, and assigns physics-based material properties for accurate sensor simulation in modern graphics engines and simulators. This approach combines photorealistic reconstruction with physics-based material assignment, providing sensor simulation fidelity comparable to LiDAR-camera fusion while eliminating hardware complexity and calibration requirements. We validate our camera-only method using an internal dataset from an instrumented test vehicle, leveraging LiDAR as ground truth for reflectivity validation alongside image similarity metrics.

[118] Thinking in 360°: Humanoid Visual Search in the Wild cs.CVPDF

Heyang Yu, Yinan Han, Xiangyu Zhang, Baiqiao Yin, Bowen Chang

TL;DR: 论文提出了人类视觉搜索任务，通过360°全景图像模拟真实世界中的视觉搜索行为，并构建了新基准H* Bench来评估模型的性能。实验显示现有模型表现不佳，但通过后训练技术显著提升了模型性能，尤其是路径搜索任务仍具挑战性。

Details

Motivation: 人类通过头部和眼睛的协同控制高效搜索视觉信息，而现有视觉搜索方法局限于静态图像，忽略了真实世界的交互。论文旨在开发类似人类的高效视觉搜索智能体，避免硬件限制。

Result: 后训练使Qwen2.5-VL在物体搜索任务中成功率从14.83%提升至47.38%，路径搜索从6.44%提升至24.94%，但路径搜索仍显著较难。

Insight: 1. 360°视觉搜索任务模拟真实世界交互具有挑战性；2. 路径搜索需更复杂的空间常识；3. 后训练可显著提升模型性能，但仍有很大改进空间。

Abstract: Humans rely on the synergistic control of head (cephalomotor) and eye (oculomotor) to efficiently search for visual information in 360°. However, prior approaches to visual search are limited to a static image, neglecting the physical embodiment and its interaction with the 3D world. How can we develop embodied visual search agents as efficient as humans while bypassing the constraints imposed by real-world hardware? To this end, we propose humanoid visual search where a humanoid agent actively rotates its head to search for objects or paths in an immersive world represented by a 360° panoramic image. To study visual search in visually-crowded real-world scenarios, we build H* Bench, a new benchmark that moves beyond household scenes to challenging in-the-wild scenes that necessitate advanced visual-spatial reasoning capabilities, such as transportation hubs, large-scale retail spaces, urban streets, and public institutions. Our experiments first reveal that even top-tier proprietary models falter, achieving only ~30% success in object and path search. We then use post-training techniques to enhance the open-source Qwen2.5-VL, increasing its success rate by over threefold for both object search (14.83% to 47.38%) and path search (6.44% to 24.94%). Notably, the lower ceiling of path search reveals its inherent difficulty, which we attribute to the demand for sophisticated spatial commonsense. Our results not only show a promising path forward but also quantify the immense challenge that remains in building MLLM agents that can be seamlessly integrated into everyday human life.

[119] GS-Checker: Tampering Localization for 3D Gaussian Splatting cs.CVPDF

Haoliang Han, Ziyuan Luo, Jun Qi, Anderson Rocha, Renjie Wan

TL;DR: GS-Checker是一种新方法，用于定位3D高斯泼溅（3DGS）模型中的篡改区域。该方法通过引入3D篡改属性并结合3D对比机制和循环优化策略，实现了无需昂贵3D标签的高效篡改定位。

Details

Motivation: 随着3DGS编辑技术的发展，恶意篡改3D内容的风险增加。为了防范此类风险，急需一种能够精确定位篡改区域的方法。

Result: 实验结果表明，GS-Checker能够有效定位3DGS模型中的篡改区域，且不依赖于昂贵的3D标签数据。

Insight: 通过结合低成本的3D篡改属性和对比机制，GS-Checker展示了在不依赖大量标注数据的情况下，实现高效篡改检测的潜力。

Abstract: Recent advances in editing technologies for 3D Gaussian Splatting (3DGS) have made it simple to manipulate 3D scenes. However, these technologies raise concerns about potential malicious manipulation of 3D content. To avoid such malicious applications, localizing tampered regions becomes crucial. In this paper, we propose GS-Checker, a novel method for locating tampered areas in 3DGS models. Our approach integrates a 3D tampering attribute into the 3D Gaussian parameters to indicate whether the Gaussian has been tampered. Additionally, we design a 3D contrastive mechanism by comparing the similarity of key attributes between 3D Gaussians to seek tampering cues at 3D level. Furthermore, we introduce a cyclic optimization strategy to refine the 3D tampering attribute, enabling more accurate tampering localization. Notably, our approach does not require expensive 3D labels for supervision. Extensive experimental results demonstrate the effectiveness of our proposed method to locate the tampered 3DGS area.

[120] VGGTFace: Topologically Consistent Facial Geometry Reconstruction in the Wild cs.CVPDF

Xin Ming, Yuxuan Han, Tianyu Huang, Feng Xu

TL;DR: VGGTFace提出了一种基于3D基础模型VGGT的自动方法，用于从多视角图像中重建拓扑一致的人脸几何，解决了现有方法在泛化和表达能力上的局限性。

Details

Motivation: 现有方法在人脸几何重建中通常需要人工干预、泛化能力不足或受到3D Morphable Models表达能力限制的问题。VGGTFace旨在通过自动化的方式克服这些障碍。

Result: 实验表明，VGGTFace在16视角下仅需10秒即可完成高质量重建，在基准测试中表现出色并展现出强大的泛化能力。

Insight: 通过结合大规模训练的3D基础模型和拓扑信息注入技术，可以显著提升人脸几何重建的质量和效率。

Abstract: Reconstructing topologically consistent facial geometry is crucial for the digital avatar creation pipelines. Existing methods either require tedious manual efforts, lack generalization to in-the-wild data, or are constrained by the limited expressiveness of 3D Morphable Models. To address these limitations, we propose VGGTFace, an automatic approach that innovatively applies the 3D foundation model, \emph{i.e.} VGGT, for topologically consistent facial geometry reconstruction from in-the-wild multi-view images captured by everyday users. Our key insight is that, by leveraging VGGT, our method naturally inherits strong generalization ability and expressive power from its large-scale training and point map representation. However, it is unclear how to reconstruct a topologically consistent mesh from VGGT, as the topology information is missing in its prediction. To this end, we augment VGGT with Pixel3DMM for injecting topology information via pixel-aligned UV values. In this manner, we convert the pixel-aligned point map of VGGT to a point cloud with topology. Tailored to this point cloud with known topology, we propose a novel Topology-Aware Bundle Adjustment strategy to fuse them, where we construct a Laplacian energy for the Bundle Adjustment objective. Our method achieves high-quality reconstruction in 10 seconds for 16 views on a single NVIDIA RTX 4090. Experiments demonstrate state-of-the-art results on benchmarks and impressive generalization to in-the-wild data. Code is available at https://github.com/grignarder/vggtface.

[121] A Training-Free Approach for Multi-ID Customization via Attention Adjustment and Spatial Control cs.CVPDF

Jiawei Lin, Guanlong Jiao, Jianjin Xu

TL;DR: 论文提出了一种无需训练的方法MultiID，通过注意力调整和空间控制实现多ID定制生成任务，解决了现有方法的复制粘贴问题和文本控制不足问题。

Details

Motivation: 多ID定制生成任务在计算机视觉中备受关注，但现有方法存在复制粘贴问题和文本控制能力不足的挑战。作者希望通过一种无需训练的方式解决这些问题。

Result: 实验结果表明，MultiID在解决复制粘贴和文本对齐问题上的表现优于或相当于需要训练的多ID定制方法。

Insight: 无需训练的方法可以高效解决复杂问题，注意力机制和空间控制的结合在多任务生成中具有潜力。

Abstract: Multi-ID customization is an interesting topic in computer vision and attracts considerable attention recently. Given the ID images of multiple individuals, its purpose is to generate a customized image that seamlessly integrates them while preserving their respective identities. Compared to single-ID customization, multi-ID customization is much more difficult and poses two major challenges. First, since the multi-ID customization model is trained to reconstruct an image from the cropped person regions, it often encounters the copy-paste issue during inference, leading to lower quality. Second, the model also suffers from inferior text controllability. The generated result simply combines multiple persons into one image, regardless of whether it is aligned with the input text. In this work, we propose MultiID to tackle this challenging task in a training-free manner. Since the existing single-ID customization models have less copy-paste issue, our key idea is to adapt these models to achieve multi-ID customization. To this end, we present an ID-decoupled cross-attention mechanism, injecting distinct ID embeddings into the corresponding image regions and thus generating multi-ID outputs. To enhance the generation controllability, we introduce three critical strategies, namely the local prompt, depth-guided spatial control, and extended self-attention, making the results more consistent with the text prompts and ID images. We also carefully build a benchmark, called IDBench, for evaluation. The extensive qualitative and quantitative results demonstrate the effectiveness of MultiID in solving the aforementioned two challenges. Its performance is comparable or even better than the training-based multi-ID customization methods.

[122] MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts cs.CVPDF

Zilong Huang, Jun He, Xiaobin Huang, Ziyi Xiong, Yang Luo

TL;DR: MajutsuCity是一个基于自然语言的框架，用于生成具有可控3D资产和布局的美学自适应3D城市，结合了文本生成的创意灵活性和对象级编辑能力。

Details

Motivation: 现有方法难以在文本生成的创意灵活性和显式结构表示的对象级编辑性之间取得平衡，MajutsuCity旨在解决这一问题。

Result: MajutsuCity在布局FID上比CityDreamer降低83.7%，比CityCraft降低20.1%，在AQS和RDR评分中均排名第一。

Insight: MajutsuCity在几何保真度、风格适应性和语义可控性上达到新SOTA，展示了语言驱动3D城市生成的潜力。

Abstract: Generating realistic 3D cities is fundamental to world models, virtual reality, and game development, where an ideal urban scene must satisfy both stylistic diversity, fine-grained, and controllability. However, existing methods struggle to balance the creative flexibility offered by text-based generation with the object-level editability enabled by explicit structural representations. We introduce MajutsuCity, a natural language-driven and aesthetically adaptive framework for synthesizing structurally consistent and stylistically diverse 3D urban scenes. MajutsuCity represents a city as a composition of controllable layouts, assets, and materials, and operates through a four-stage pipeline. To extend controllability beyond initial generation, we further integrate MajutsuAgent, an interactive language-grounded editing agent} that supports five object-level operations. To support photorealistic and customizable scene synthesis, we also construct MajutsuDataset, a high-quality multimodal dataset} containing 2D semantic layouts and height maps, diverse 3D building assets, and curated PBR materials and skyboxes, each accompanied by detailed annotations. Meanwhile, we develop a practical set of evaluation metrics, covering key dimensions such as structural consistency, scene complexity, material fidelity, and lighting atmosphere. Extensive experiments demonstrate MajutsuCity reduces layout FID by 83.7% compared with CityDreamer and by 20.1% over CityCraft. Our method ranks first across all AQS and RDR scores, outperforming existing methods by a clear margin. These results confirm MajutsuCity as a new state-of-the-art in geometric fidelity, stylistic adaptability, and semantic controllability for 3D city generation. We expect our framework can inspire new avenues of research in 3D city generation. Our dataset and code will be released at https://github.com/LongHZ140516/MajutsuCity.

[123] StableTrack: Stabilizing Multi-Object Tracking on Low-Frequency Detections cs.CV | cs.AI | cs.LGPDF

Matvei Shelukhan, Timur Mamedov, Karina Kvanchiani

TL;DR: StableTrack 提出了一种新方法，通过在低频检测中引入两阶段匹配策略和改进的距离度量（Bbox-Based Distance），并结合视觉跟踪与卡尔曼滤波，显著提升多目标跟踪（MOT）的性能，特别是在低频条件下的表现。

Details

Motivation: 当前的多目标跟踪方法主要依赖于高频率的检测帧，而在计算资源受限的低频检测条件下性能下降明显。为解决这一问题，StableTrack 旨在优化低频检测时的跟踪稳定性。

Result: 在低频检测（1 Hz）下，MOT17-val 的 HOTA 指标提升了 11.6%。同时在标准 MOT17、MOT20 和 DanceTrack 基准上保持了与现有最佳方法相当的性能。

Insight: 低频检测是多目标跟踪中的一个重要挑战，通过改进匹配策略和距离度量可以显著提升跟踪稳定性，尤其是在资源受限的场景下。

Abstract: Multi-object tracking (MOT) is one of the most challenging tasks in computer vision, where it is important to correctly detect objects and associate these detections across frames. Current approaches mainly focus on tracking objects in each frame of a video stream, making it almost impossible to run the model under conditions of limited computing resources. To address this issue, we propose StableTrack, a novel approach that stabilizes the quality of tracking on low-frequency detections. Our method introduces a new two-stage matching strategy to improve the cross-frame association between low-frequency detections. We propose a novel Bbox-Based Distance instead of the conventional Mahalanobis distance, which allows us to effectively match objects using the Re-ID model. Furthermore, we integrate visual tracking into the Kalman Filter and the overall tracking pipeline. Our method outperforms current state-of-the-art trackers in the case of low-frequency detections, achieving $\textit{11.6%}$ HOTA improvement at $\textit{1}$ Hz on MOT17-val, while keeping up with the best approaches on the standard MOT17, MOT20, and DanceTrack benchmarks with full-frequency detections.

[124] Block Cascading: Training Free Acceleration of Block-Causal Video Models cs.CV | cs.AIPDF

Hmrishav Bandyopadhyay, Nikhil Pinnaparaju, Rahim Entezari, Jim Scott, Yi-Zhe Song

TL;DR: Block Cascading 是一种无需训练的并行化方法，显著缓解了块因果视频生成模型的速度-质量权衡，通过部分去噪上下文启动多块并行生成，实现了约2倍的加速。

Details

Motivation: 块因果视频生成模型面临速度与质量的显著权衡，小模型虽快但质量低，大模型质量高但速度慢，用户需要在响应速度和质量之间做出选择。

Result: 实验表明，Block Cascading 在 5 GPU 配置下，1.3B 模型从 16 FPS 加速到 30 FPS，14B 模型从 4.5 FPS 加速到 12.5 FPS，且生成质量无明显损失。

Insight: 未来块的生成无需完全去噪的当前块上下文，部分去噪信息即可启动并行生成，这为高效视频推理提供了新思路。

Abstract: Block-causal video generation faces a stark speed-quality trade-off: small 1.3B models manage only 16 FPS while large 14B models crawl at 4.5 FPS, forcing users to choose between responsiveness and quality. Block Cascading significantly mitigates this trade-off through training-free parallelization. Our key insight: future video blocks do not need fully denoised current blocks to begin generation. By starting block generation with partially denoised context from predecessors, we transform sequential pipelines into parallel cascades where multiple blocks denoise simultaneously. With 5 GPUs exploiting temporal parallelism, we achieve ~2x acceleration across all model scales: 1.3B models accelerate from 16 to 30 FPS, 14B models from 4.5 to 12.5 FPS. Beyond inference speed, Block Cascading eliminates overhead from KV-recaching (of ~200ms) during context switches for interactive generation. Extensive evaluations validated against multiple block-causal pipelines demonstrate no significant loss in generation quality when switching from block-causal to Block Cascading pipelines for inference. Project Page: https://hmrishavbandy.github.io/block_cascading_page/

[125] BRIC: Bridging Kinematic Plans and Physical Control at Test Time cs.CV | cs.ROPDF

Dohun Lim, Minji Kim, Jaewoon Lim, Sungchan Kim

TL;DR: BRIC是一种新型测试时自适应（TTA）框架，通过结合扩散运动规划和强化学习物理控制，实现长期人类运动生成，解决了物理执行中的不一致性问题。

Details

Motivation: 扩散模型能生成多样化的运动，但其输出通常物理上不可行，导致模拟中的执行偏差。BRIC旨在通过测试时自适应解决这一问题。

Result: 在运动合成、避障和人-场景交互等多种任务中实现了最优性能。

Insight: BRIC的创新在于将扩散模型的规划能力与物理控制的执行能力结合，并通过测试时自适应解决了长期运动生成的挑战。

Abstract: We propose BRIC, a novel test-time adaptation (TTA) framework that enables long-term human motion generation by resolving execution discrepancies between diffusion-based kinematic motion planners and reinforcement learning-based physics controllers. While diffusion models can generate diverse and expressive motions conditioned on text and scene context, they often produce physically implausible outputs, leading to execution drift during simulation. To address this, BRIC dynamically adapts the physics controller to noisy motion plans at test time, while preserving pre-trained skills via a loss function that mitigates catastrophic forgetting. In addition, BRIC introduces a lightweight test-time guidance mechanism that steers the diffusion model in the signal space without updating its parameters. By combining both adaptation strategies, BRIC ensures consistent and physically plausible long-term executions across diverse environments in an effective and efficient manner. We validate the effectiveness of BRIC on a variety of long-term tasks, including motion composition, obstacle avoidance, and human-scene interaction, achieving state-of-the-art performance across all tasks.

[126] Object-Centric Vision Token Pruning for Vision Language Models cs.CV | cs.AIPDF

Guangyuan Li, Rongzhen Zhao, Jinhong Deng, Yanbo Wang, Joni Pajarinen

TL;DR: OC-VTP是一种直接且可靠的方法，用于选择最具代表性的视觉令牌，以提高视觉语言模型（VLMs）的推理效率同时保持精度。它通过轻量级预训练的小型对象中心视觉令牌修剪器实现，无需在任何数据集上微调模型。

Details

Motivation: 现有的视觉令牌修剪方法依赖间接且不可靠的方式，无法确保保留最具信息量的令牌，导致视觉令牌在VLMs中计算量过大且信息分散。OC-VTP旨在直接解决这一问题。

Result: OC-VTP在各种视觉令牌修剪比例下均能帮助主流VLMs保持最高的推理精度，同时展示了有趣的解释性。

Insight: OC-VTP的直接性和可靠性为视觉令牌修剪提供了新思路，其轻量级设计和无需微调的特点使其具有广泛适用性。

Abstract: In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume too much unnecessary computation. Pruning redundant vision tokens for high VLM inference efficiency has been continuously studied but all existing methods resort to indirect and non-guaranteed ways. We propose OC-VTP, a direct and guaranteed approach to select the most representative vision tokens for high-efficiency yet accuracy-preserving VLM inference. Our OC-VTP requires merely light-weight pre-training of a small object-centric vision token pruner, which can then be inserted into existing VLMs, without fine-tuning of any models on any datasets. It is gauranteed that the most representative vision tokens are kept by minimizing the error in reconstructing the original unpruned tokens from the selected ones. Across any vision pruning ratios, i.e., inference efficiency, our OC-VTP consistently helps mainstream VLMs to preserve the highest inference accuracy. Our pruning also demonstrates interesting interpretability. Our codes are available at https://github.com/GarryLarry010131/OC-VTP.

[127] Look Where It Matters: Training-Free Ultra-HR Remote Sensing VQA via Adaptive Zoom Search cs.CVPDF

Yunqi Zhou, Chengjie Jiang, Chun Yuan, Jing Li

TL;DR: 论文提出了一种无需训练的ZoomSearch方法，用于超高分辨率遥感视觉问答（RS-VQA），通过自适应分块搜索和布局感知重组，显著提升了性能和效率。

Details

Motivation: 超高分辨率遥感图像的处理对现有模型构成挑战，全图编码消耗资源，预处理会丢失细节。因此，需要一种方法在预测前定位关键区域。

Result: 在LLaVA-ov基础上，ZoomSearch在LRS-VQA上提升26.3%，在MME-RealWorld-RS上提升114.8%，效率提升20%~44%。

Insight: ZoomSearch的成功表明，在超高分辨率图像处理中，动态定位关键区域比固定预处理更有效，且无需额外训练即可实现显著改进。

Abstract: With advances in satellite constellations, sensor technologies, and imaging pipelines, ultra-high-resolution (Ultra-HR) remote sensing imagery is becoming increasingly widespread. However, current remote sensing foundation models are ill-suited to such inputs: full-image encoding exhausts token and memory budgets, while resize-based preprocessing loses fine-grained and answer-critical details. In this context, guiding the model look where it matters before prediction becomes crucial. Therefore, we present ZoomSearch, a training-free, plug-and-play pipeline that decouples ‘where to look’ from ‘how to answer’ for Ultra-HR Remote Sensing Visual Question Answering (RS-VQA). ZoomSearch combines Adaptive Multi-Branch Zoom Search, which performs a hierarchical search over image patches to localize query-relevant regions, with Layout-Aware Patch Reassembly, which reorganizes the selected patches into a compact, layout-faithful canvas. We conduct comprehensive experiments on Ultra-HR RS-VQA benchmarks MME-RealWorld-RS and LRS-VQA, comparing against (i) strong general foundation models, (ii) remote sensing foundation models, (iii) Ultra-HR RS-VQA methods, and (iv) plug-and-play search-based VQA methods. When integrated with LLaVA-ov, ZoomSearch attains state-of-the-art accuracy across diverse tasks, improving the LLaVA-ov baseline by 26.3% on LRS-VQA and 114.8% on MME-RealWorld-RS. Meanwhile, it achieves much higher inference efficiency, outperforming prior search-based methods by 20%~44% in speed.

[128] STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow cs.CV | cs.LGPDF

Jiatao Gu, Ying Shen, Tianrong Chen, Laurent Dinh, Yuyang Wang

TL;DR: STARFlow-V是一种基于归一化流的端到端视频生成模型，通过引入全局-局部架构和流-得分匹配技术，解决了视频生成中的时空复杂性和误差累积问题。

Details

Motivation: 尽管归一化流在图像生成中取得了进展，但在视频生成领域，扩散模型仍占主导地位。作者希望通过STARFlow-V探索归一化流在视频生成中的潜力，并提供端到端学习、因果预测和似然估计等优势。

Result: 实验表明，STARFlow-V在视觉保真度和时间一致性上表现出色，采样效率优于基于扩散的基线方法，展示了归一化流在视频生成中的潜力。

Insight: 归一化流在视频生成中表现出色，尤其是在自回归任务中，这可能为构建世界模型（world models）提供新的研究方向。

Abstract: Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models. Code and generated samples are available at https://github.com/apple/ml-starflow.

[129] Dance Style Classification using Laban-Inspired and Frequency-Domain Motion Features cs.CV | cs.LGPDF

Ben Hamscher, Arnold Brosch, Nicolas Binninger, Maksymilian Jan Dejna, Kira Maag

TL;DR: 这篇论文提出了一种轻量级的舞蹈风格分类框架，结合了Laban运动分析的时空描述符和频域特征，通过运动数据高效地识别舞蹈风格。

Details

Motivation: 舞蹈是人类文化的重要组成部分，但许多舞蹈风格在动作和姿态上相似，使得基于运动数据的分类成为挑战。

Result: 该方法能够高效且鲁棒地分类不同舞蹈风格，无需复杂模型架构。

Insight: 可解释的运动特征（时空和频域）可以有效捕捉舞蹈风格的细微差异，适用于轻量级分类任务。

Abstract: Dance is an essential component of human culture and serves as a tool for conveying emotions and telling stories. Identifying and distinguishing dance genres based on motion data is a complex problem in human activity recognition, as many styles share similar poses, gestures, and temporal motion patterns. This work presents a lightweight framework for classifying dance styles that determines motion characteristics based on pose estimates extracted from videos. We propose temporal-spatial descriptors inspired by Laban Movement Analysis. These features capture local joint dynamics such as velocity, acceleration, and angular movement of the upper body, enabling a structured representation of spatial coordination. To further encode rhythmic and periodic aspects of movement, we integrate Fast Fourier Transform features that characterize movement patterns in the frequency domain. The proposed approach achieves robust classification of different dance styles with low computational effort, as complex model architectures are not required, and shows that interpretable motion representations can effectively capture stylistic nuances.

[130] Modular Deep Learning Framework for Assistive Perception: Gaze, Affect, and Speaker Identification cs.CV | cs.LGPDF

Akshit Pramod Anchan, Jewelith Thomas, Sritama Roy

TL;DR: 该论文提出了一种模块化的深度学习框架，用于辅助感知任务（如注视、情感和说话者识别），并通过独立的CNN和LSTM模块实现了高精度。

Details

Motivation: 开发轻量级且高效的辅助感知技术，以支持资源受限设备的实时多模态集成需求。

Result: 在眼状态检测、表情识别和说话者识别任务中分别达到93.0%、97.8%和96.89%的准确率。

Insight: 模块化设计能够高效解决多模态任务，适合资源受限设备，为未来实时集成提供了基础。

Abstract: Developing comprehensive assistive technologies requires the seamless integration of visual and auditory perception. This research evaluates the feasibility of a modular architecture inspired by core functionalities of perceptive systems like ‘Smart Eye.’ We propose and benchmark three independent sensing modules: a Convolutional Neural Network (CNN) for eye state detection (drowsiness/attention), a deep CNN for facial expression recognition, and a Long Short-Term Memory (LSTM) network for voice-based speaker identification. Utilizing the Eyes Image, FER2013, and customized audio datasets, our models achieved accuracies of 93.0%, 97.8%, and 96.89%, respectively. This study demonstrates that lightweight, domain-specific models can achieve high fidelity on discrete tasks, establishing a validated foundation for future real-time, multimodal integration in resource-constrained assistive devices.

[131] AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs cs.CVPDF

Kuniaki Saito, Risa Shinoda, Shohei Tanaka, Tosho Hirasawa, Fumio Okura

TL;DR: AlignBench 是一个新的基准测试，用于评估图像-文本对齐模型（如 CLIP）的细粒度对齐能力，通过合成图像-标题对和详细标注来提供更准确的性能指标。

Details

Motivation: 现有的基准测试主要依赖基于规则的扰动或简短标题，无法充分测量图像与文本的细粒度对齐能力。

Result: 研究发现：(i) CLIP 类模型对细粒度对齐几乎无效；(ii) 检测器倾向于高估早期句子；(iii) 模型表现出自偏好性，影响检测性能。

Insight: 视觉语言模型在细粒度对齐任务中存在显著不足，需进一步改进对齐评估方法和模型设计。

Abstract: Assessing image-text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment. We introduce AlignBench, a benchmark that provides a new indicator of image-text alignment by evaluating detailed image-caption pairs generated by diverse image-to-text and text-to-image models. Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators. Benchmarking a wide range of decoder-based VLMs reveals three key findings: (i) CLIP-based models, even those tailored for compositional reasoning, remain nearly blind; (ii) detectors systematically over-score early sentences; and (iii) they show strong self-preference, favoring their own outputs and harming detection performance. Our project page will be available at https://dahlian00.github.io/AlignBench/.

[132] HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation cs.CVPDF

Xiang Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yuqian Zhou

TL;DR: HBridge提出了一种非对称H形架构，通过选择性桥接异构专家模型的中间层，显著提升了多模态理解和生成的效率与质量。

Details

Motivation: 现有统一模型（如BAGEL和LMFusion）采用对称设计（MoT），尽管性能强大，但由于模态差异，其初始化与融合仍不理想。HBridge旨在解决这一问题。

Result: 在多个基准测试中，HBridge表现优异，确立了统一多模态生成的新范式。

Insight: 非对称设计和分层选择性桥接是提升异构专家模型性能的关键，而语义重建令牌进一步增强了跨模态语义对齐。

Abstract: Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixture-of-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors from their respective modality domains. Unlike prior dense fusion strategies that straightforwardly connect all layers between experts via shared attention, HBridge selectively bridges intermediate layers, reducing over 40% attention sharing, which improves efficiency and enhances generation quality. Shallow and deep layers, which capture modality-specific representations, are decoupled, while mid-layer bridging promotes semantic alignment. To further strengthen cross-modal coherence, we introduce semantic reconstruction tokens that explicitly guide the generative expert to reconstruct visual semantic tokens of the target image. Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge, establishing a new paradigm for unified multimodal generation.

[133] Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos cs.CVPDF

Yayuan Li, Aadit Jain, Filippos Bellos, Jason J. Corso

TL;DR: MATT提出了一种细粒度分析自我中心视频中人类错误的框架，通过MisEngine自动构建数据集并开发MisFormer模型，在多维度上实现了错误归因。

Details

Motivation: 现有错误理解方法缺乏细粒度输出，无法具体归因错误到指令文本或视频尝试中，MATT填补了这一空白。

Result: 实验表明，MisFormer在多个基准上优于现有视频-语言、时序定位、手-物交互和错误检测模型。

Insight: 细粒度错误归因需要结合语义、时间和空间信息；自动数据集构建可大幅提升数据规模；注意力机制在多维度任务中表现出色。

Abstract: We introduce Mistake Attribution (MATT), a task for fine-grained understanding of human mistakes in egocentric video. Unlike prior mistake understanding work, which lacks fine-grained output, MATT concretely attributes mistakes to the input instruction text or the attempt video. MATT determines what part of the instruction is violated (semantic role), when the deviation becomes irreversible (the Point-of-No-Return, PNR), and where the mistake appears in the PNR frame. We develop MisEngine, a data engine that automatically constructs attribution-rich mistake samples from existing datasets and inherits their annotations. Applied to large egocentric corpora, MisEngine yields EPIC-KITCHENS-M and Ego4D-M, two datasets that are up to two orders of magnitude larger than prior mistake datasets. We then present MisFormer, a unified attention-based model for mistake attribution across semantic (what), temporal (when), and spatial (where) dimensions, trained using MisEngine supervision. Experiments on our new datasets and prior benchmarks show that MisFormer outperforms strong video-language, temporal localization, hand-object interaction, and mistake-detection baselines.

[134] New York Smells: A Large Multimodal Dataset for Olfaction cs.CV | cs.AI | cs.LGPDF

Ege Ozguroglu, Junbang Liang, Ruoshi Liu, Mia Chiquier, Michael DeTienne

TL;DR: 该论文介绍了名为‘New York Smells’的大规模多模态嗅觉数据集，包含7,000对图像和嗅觉信号，覆盖3,500个不同对象，是目前嗅觉数据集中对象数量的70倍。通过实验证明，视觉数据能够促进跨模态嗅觉表征学习，且其学习到的嗅觉表征优于传统手工特征。

Details

Motivation: 嗅觉是动物感知世界的重要方式，但对机器而言仍难以访问。主要瓶颈是缺乏在自然环境中收集的多样多模态嗅觉训练数据。

Result: 实验表明，视觉数据能有效学习跨模态嗅觉表征，且优于传统手工特征；提出的数据集为嗅觉研究提供了丰富资源。

Insight: 视觉模态可以作为嗅觉表征学习的桥梁；大规模自然数据集对多模态学习至关重要。

Abstract: While olfaction is central to how animals perceive the world, this rich chemical sensory modality remains largely inaccessible to machines. One key bottleneck is the lack of diverse, multimodal olfactory training data collected in natural settings. We present New York Smells, a large dataset of paired image and olfactory signals captured ``in the wild.’’ Our dataset contains 7,000 smell-image pairs from 3,500 distinct objects across indoor and outdoor environments, with approximately 70$\times$ more objects than existing olfactory datasets. Our benchmark has three tasks: cross-modal smell-to-image retrieval, recognizing scenes, objects, and materials from smell alone, and fine-grained discrimination between grass species. Through experiments on our dataset, we find that visual data enables cross-modal olfactory representation learning, and that our learned olfactory representations outperform widely-used hand-crafted features.

[135] Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning cs.CV | cs.AIPDF

Guanjie Chen, Shirui Huang, Kai Liu, Jianchen Zhu, Xiaoye Qu

TL;DR: Flash-DMD是一个高效蒸馏和联合强化学习的框架，旨在实现高保真少步骤图像生成，显著降低了训练成本并提高了生成质量。

Details

Motivation: 扩散模型的迭代采样过程计算成本高，而传统的时间步蒸馏技术训练耗时长且容易导致图像质量下降。此外，使用强化学习（RL）进行微调时存在不稳定性。

Result: 在少步采样任务中达到SOTA生成质量，视觉质量、人类偏好和文本-图像对齐指标均优于现有方法。

Insight: 持续的蒸馏损失可以有效稳定RL训练，防止奖励劫持，从而实现高效的模型优化。

Abstract: Diffusion Models have emerged as a leading class of generative models, yet their iterative sampling process remains computationally expensive. Timestep distillation is a promising technique to accelerate generation, but it often requires extensive training and leads to image quality degradation. Furthermore, fine-tuning these distilled models for specific objectives, such as aesthetic appeal or user preference, using Reinforcement Learning (RL) is notoriously unstable and easily falls into reward hacking. In this work, we introduce Flash-DMD, a novel framework that enables fast convergence with distillation and joint RL-based refinement. Specifically, we first propose an efficient timestep-aware distillation strategy that significantly reduces training cost with enhanced realism, outperforming DMD2 with only $2.1%$ its training cost. Second, we introduce a joint training scheme where the model is fine-tuned with an RL objective while the timestep distillation training continues simultaneously. We demonstrate that the stable, well-defined loss from the ongoing distillation acts as a powerful regularizer, effectively stabilizing the RL training process and preventing policy collapse. Extensive experiments on score-based and flow matching models show that our proposed Flash-DMD not only converges significantly faster but also achieves state-of-the-art generation quality in the few-step sampling regime, outperforming existing methods in visual quality, human preference, and text-image alignment metrics. Our work presents an effective paradigm for training efficient, high-fidelity, and stable generative models. Codes are coming soon.

[136] PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding cs.CVPDF

Haoze Zhang, Tianyu Huang, Zichen Wan, Xiaowei Jin, Hongzhi Zhang

TL;DR: PhysChoreo提出了一个两阶段的框架，能够从单张图像生成具有物理可控性和真实感的视频，优于现有方法。

Details

Motivation: 现有视频生成模型在物理可控性和真实性上存在不足，需要通过物理驱动的模拟来解决这一问题。

Result: 实验结果表明，PhysChoreo在生成多样化行为和物理真实感视频方面优于现有方法。

Insight: 分阶段的物理建模和时间控制是实现高真实感视频生成的有效途径。

Abstract: While recent video generation models have achieved significant visual fidelity, they often suffer from the lack of explicit physical controllability and plausibility. To address this, some recent studies attempted to guide the video generation with physics-based rendering. However, these methods face inherent challenges in accurately modeling complex physical properties and effectively control ling the resulting physical behavior over extended temporal sequences. In this work, we introduce PhysChoreo, a novel framework that can generate videos with diverse controllability and physical realism from a single image. Our method consists of two stages: first, it estimates the static initial physical properties of all objects in the image through part-aware physical property reconstruction. Then, through temporally instructed and physically editable simulation, it synthesizes high-quality videos with rich dynamic behaviors and physical realism. Experimental results show that PhysChoreo can generate videos with rich behaviors and physical realism, outperforming state-of-the-art methods on multiple evaluation metrics.

[137] A Reason-then-Describe Instruction Interpreter for Controllable Video Generation cs.CVPDF

Shengqiong Wu, Weicai Ye, Yuanxing Zhang, Jiahao Wang, Quande Liu

TL;DR: 论文提出了ReaDe，一种通用的、模型无关的指令解释器，用于将用户输入的模糊指令转换为精确的视频生成规范，通过两阶段优化提升可控视频生成的意图对齐。

Details

Motivation: 现有的扩散变换器在视频生成中虽提升了保真度和时间一致性，但对用户模糊或复杂意图的理解和可控性仍不足，导致意图与输出不匹配。

Result: 在单条件和多条件场景下的实验显示，ReaDe显著提升了指令忠实度、描述准确性和视频质量，并能泛化到推理密集型及未见过的输入。

Insight: ReaDe为实现可控视频生成与用户意图的精确对齐提供了一条可行路径，强调了指令解析在生成任务中的重要性。

Abstract: Diffusion Transformers have significantly improved video fidelity and temporal coherence, however, practical controllability remains limited. Concise, ambiguous, and compositionally complex user inputs contrast with the detailed prompts used in training, yielding an intent-output mismatch. We propose ReaDe, a universal, model-agnostic interpreter that converts raw instructions into precise, actionable specifications for downstream video generators. ReaDe follows a reason-then-describe paradigm: it first analyzes the user request to identify core requirements and resolve ambiguities, then produces detailed guidance that enables faithful, controllable generation. We train ReaDe via a two-stage optimization: (i) reasoning-augmented supervision imparts analytic parsing with stepwise traces and dense captions, and (ii) a multi-dimensional reward assigner enables stable, feedback-driven refinement for natural-style captions. Experiments across single- and multi-condition scenarios show consistent gains in instruction fidelity, caption accuracy, and downstream video quality, with strong generalization to reasoning-intensive and unseen inputs. ReaDe offers a practical route to aligning controllable video generation with accurately interpreted user intent. Project Page: https://sqwu.top/ReaDe/.

[138] DINO-Tok: Adapting DINO for Visual Tokenizers cs.CVPDF

Mingkai Jia, Mingxiao Li, Liaoyuan Fan, Tianxing Shi, Jiaxin Guo

TL;DR: DINO-Tok是基于DINO改进的视觉分词器，通过结合浅层和深层特征，统一了分层表示，解决了现有分词器在高维潜在空间中语义表示和重建保真度的平衡问题，并引入全局PCA加权机制优化向量量化。

Details

Motivation: 现有的视觉分词器通常从头训练，难以在高维潜在空间中平衡语义表示和重建保真度，需要更高效的分词方法支持生成模型。

Result: 在ImageNet 256×256上取得SOTA重建性能（PSNR 28.54），显著优于现有分词器，媲美基于十亿级数据训练的模型。

Insight: 利用预训练视觉模型（如DINO）改进分词器，可实现语义对齐和高保真潜在表示，推动下一代生成模型的发展。

Abstract: Recent advances in visual generation have highlighted the rise of Latent Generative Models (LGMs), which rely on effective visual tokenizers to bridge pixels and semantics. However, existing tokenizers are typically trained from scratch and struggle to balance semantic representation and reconstruction fidelity, particularly in high-dimensional latent spaces. In this work, we introduce DINO-Tok, a DINO-based visual tokenizer that unifies hierarchical representations into an information-complete latent space. By integrating shallow features that retain fine-grained details with deep features encoding global semantics, DINO-Tok effectively bridges pretrained representations and visual generation. We further analyze the challenges of vector quantization (VQ) in this high-dimensional space, where key information is often lost and codebook collapse occurs. We thus propose a global PCA reweighting mechanism to stabilize VQ and preserve essential information across dimensions. On ImageNet 256$\times$256, DINO-Tok achieves state-of-the-art reconstruction performance, reaching 28.54 PSNR for autoencoding and 23.98 PSNR for VQ-based modeling, significantly outperforming prior tokenizers and comparable to billion-level data trained models (such as Hunyuan and Wan). These results demonstrate that adapting powerful pretrained vision models like DINO for tokenization enables semantically aligned and high-fidelity latent representations, enabling next-generation visual generative models. Code will be publicly available at https://github.com/MKJia/DINO-Tok.

[139] VQ-VA World: Towards High-Quality Visual Question-Visual Answering cs.CVPDF

Chenhui Gou, Zilong Chen, Zeyu Wang, Feng Li, Deyao Zhu

TL;DR: 论文提出了VQ-VA World，一个数据为中心的框架，用于生成高质量的视觉问题-视觉回答（VQ-VA）模型，显著提升了开源模型的性能。

Details

Motivation: 现有的视觉问题-视觉回答（VQ-VA）能力主要集中在闭源系统（如NanoBanana和GPT-Image）中，开源模型表现较差。本文旨在填补这一差距，为开源社区提供高性能的解决方案。

Result: 在IntelligentBench上，训练后的LightFusion达到53.06分，远超之前的开源基线（如vanilla LightFusion的7.78分），并显著缩小与闭源系统（如NanoBanana的81.67分）的差距。

Insight: 通过高质量数据和大规模训练，开源模型也能在VQ-VA任务上取得显著提升，为未来研究奠定了基础。

Abstract: This paper studies Visual Question-Visual Answering (VQ-VA): generating an image, rather than text, in response to a visual question – an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce VQ-VA World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of ~1.8M high-quality, interleaved image-text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of world knowledge, design knowledge, and reasoning. Training with VQ-VA World data yields strong empirical gains: it helps LightFusion attain 53.06 on IntelligentBench, substantially surpassing the best prior open-source baselines (i.e., 7.78 from vanilla LightFusion; 1.94 from UniWorld-V1), and significantly narrowing the gap toward leading proprietary systems (e.g., 81.67 from NanoBanana; 82.64 from GPT-Image). By releasing the full suite of model weights, datasets, and pipelines, we hope to stimulate future research on VQ-VA.

[140] The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment cs.CVPDF

Ziheng Ouyang, Yiren Song, Yaoli Liu, Shihao Zhu, Qibin Hou

TL;DR: 这篇论文提出了一种名为ImageCritic的方法，通过参考图像引导的后编辑技术来解决生成图像中的不一致性问题，并结合注意力对齐损失和细节编码器进行精确修复。

Details

Motivation: 现有的定制化生成任务在生成一致的细粒度细节方面仍存在局限性，因此需要一种方法来纠正生成图像中的不一致性。

Result: 实验表明，ImageCritic在各种定制化生成场景中能有效解决细节不一致问题，显著优于现有方法。

Insight: 注意力机制和细节编码器的结合可以帮助生成模型更好地对齐参考图像的细节，从而提升一致性。

Abstract: Previous works have explored various customized generation tasks given a reference image, but they still face limitations in generating consistent fine-grained details. In this paper, our aim is to solve the inconsistency problem of generated images by applying a reference-guided post-editing approach and present our ImageCritic. We first construct a dataset of reference-degraded-target triplets obtained via VLM-based selection and explicit degradation, which effectively simulates the common inaccuracies or inconsistencies observed in existing generation models. Furthermore, building on a thorough examination of the model’s attention mechanisms and intrinsic representations, we accordingly devise an attention alignment loss and a detail encoder to precisely rectify inconsistencies. ImageCritic can be integrated into an agent framework to automatically detect inconsistencies and correct them with multi-round and local editing in complex scenarios. Extensive experiments demonstrate that ImageCritic can effectively resolve detail-related issues in various customized generation scenarios, providing significant improvements over existing methods.

[141] Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI cs.CV | cs.ROPDF

Xinhao Liu, Jiaqi Li, Youming Deng, Ruxin Chen, Yingjia Zhang

TL;DR: Wanderland是一个基于真实到模拟（real-to-sim）的框架，专注于开放在线世界的具身AI，结合多传感器捕捉、高质量重建和精确几何，解决了现有3DGS方法在视觉导航中的局限性。

Details

Motivation: 当前具身AI（如视觉导航）的可复现闭环评估面临瓶颈，尤其是高保真仿真中存在视觉与几何的模拟到现实的差距。

Result: 展示了图像仅方法在开放环境中的局限性，几何质量对新视角合成的影响，以及这些因素对导航策略评估可靠性的影响。

Insight: 几何质量和多传感器数据对具身AI的仿真和评估至关重要，Wanderland为开放世界的具身AI研究提供了新的基础。

Abstract: Reproducible closed-loop evaluation remains a major bottleneck in Embodied AI such as visual navigation. A promising path forward is high-fidelity simulation that combines photorealistic sensor rendering with geometrically grounded interaction in complex, open-world urban environments. Although recent video-3DGS methods ease open-world scene capturing, they are still unsuitable for benchmarking due to large visual and geometric sim-to-real gaps. To address these challenges, we introduce Wanderland, a real-to-sim framework that features multi-sensor capture, reliable reconstruction, accurate geometry, and robust view synthesis. Using this pipeline, we curate a diverse dataset of indoor-outdoor urban scenes and systematically demonstrate how image-only pipelines scale poorly, how geometry quality impacts novel view synthesis, and how all of these adversely affect navigation policy learning and evaluation reliability. Beyond serving as a trusted testbed for embodied navigation, Wanderland’s rich raw sensor data further allows benchmarking of 3D reconstruction and novel view synthesis models. Our work establishes a new foundation for reproducible research in open-world embodied AI. Project website is at https://ai4ce.github.io/wanderland/.

[142] ShapeGen: Towards High-Quality 3D Shape Synthesis cs.CVPDF

Yangguang Li, Xianglong He, Zi-Xin Zou, Zexiang Liu, Wanli Ouyang

TL;DR: ShapeGen提出了一种高质量3D形状合成方法，通过改进3D表示和监督、分辨率提升及线性Transformer的优势，显著提升了图像到3D生成的性能。

Details

Motivation: 现有3D形状生成方法在细节丰富性、表面平滑度和薄壳结构完整性方面存在不足，限制了生成资产的艺术性和实用性。

Result: 实验表明，ShapeGen在图像到3D生成任务中实现了显著的性能提升，达到了新的最优水平。

Insight: 通过多方面的改进协同作用，ShapeGen展示了在3D生成任务中结合分辨率提升和Transformer架构的巨大潜力。

Abstract: Inspired by generative paradigms in image and video, 3D shape generation has made notable progress, enabling the rapid synthesis of high-fidelity 3D assets from a single image. However, current methods still face challenges, including the lack of intricate details, overly smoothed surfaces, and fragmented thin-shell structures. These limitations leave the generated 3D assets still one step short of meeting the standards favored by artists. In this paper, we present ShapeGen, which achieves high-quality image-to-3D shape generation through 3D representation and supervision improvements, resolution scaling up, and the advantages of linear transformers. These advancements allow the generated assets to be seamlessly integrated into 3D pipelines, facilitating their widespread adoption across various applications. Through extensive experiments, we validate the impact of these improvements on overall performance. Ultimately, thanks to the synergistic effects of these enhancements, ShapeGen achieves a significant leap in image-to-3D generation, establishing a new state-of-the-art performance.

[143] MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models cs.CV | cs.AI | cs.LGPDF

Chieh-Yun Chen, Zhonghao Wang, Qi Chen, Zhifan Ye, Min Shi

TL;DR: 论文提出了MapReduce LoRA和RaTE两种方法，用于在多偏好优化中避免对齐税问题，提升生成模型的性能。

Details

Motivation: 在多目标优化中，传统的强化学习方法往往会导致对某些偏好的优化损害其他偏好，即所谓的对齐税问题。

Result: 在文本到图像、文本到视频以及语言任务中，性能显著提升，分别在多个评估指标上取得了36.1%~67.1%的改进。

Insight: 结合并行训练和动态嵌入的方法能够有效解决多目标优化中的对齐税问题，并在多种生成任务中实现更优的性能。

Abstract: Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax, improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. On the language task, Helpful Assistant, with Llama-2 7B, helpful and harmless improve by 43.4% and 136.7%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.

[144] iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation cs.CVPDF

Zhoujie Fu, Xianfang Zeng, Jinghong Lan, Xinyao Liao, Cheng Chen

TL;DR: iMontage是一个统一的框架，旨在将强大的视频模型重新用于多功能图像生成器，能够生成具有丰富动态范围和自然过渡的图像集。

Details

Motivation: 现有视频模型在时序连贯性上表现出色，但其动态范围受限于连续的训练数据。通过注入图像数据的多样性，可以生成更具动态范围的图像集。

Result: iMontage在多项多输入多输出任务中表现优异，保持了图像间的上下文一致性，并生成了超出常规范围的动态场景。

Insight: 通过利用视频模型的时序连贯性并注入图像数据的多样性，可以实现兼具高质量和丰富动态范围的图像生成。

Abstract: Pre-trained video models learn powerful priors for generating high-quality, temporally coherent content. While these models excel at temporal coherence, their dynamics are often constrained by the continuous nature of their training data. We hypothesize that by injecting the rich and unconstrained content diversity from image data into this coherent temporal framework, we can generate image sets that feature both natural transitions and a far more expansive dynamic range. To this end, we introduce iMontage, a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. To achieve this, we propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, not only maintaining strong cross-image contextual consistency but also generating scenes with extraordinary dynamics that surpass conventional scopes. Find our homepage at: https://kr1sjfu.github.io/iMontage-web/.

[145] MotionV2V: Editing Motion in a Video cs.CV | cs.AI | cs.GR | cs.LGPDF

Ryan Burgert, Charles Herrmann, Forrester Cole, Michael S Ryoo, Neal Wadhwa

TL;DR: 该论文提出了MotionV2V，一种通过直接编辑视频中的稀疏轨迹来实现精确运动控制的方法，并展示了其在视频编辑中的强大能力。

Details

Motivation: 目前的生成视频模型虽然在保真度和一致性上取得了显著进展，但在视频编辑中的应用仍然是一个复杂挑战。作者认为精确的运动控制是一个有前景但未被充分探索的视频编辑范式。

Result: 在四向用户研究中，该模型以超过65%的偏好率优于先前工作。

Insight: 通过直接操作运动轨迹可以实现更灵活的视频编辑，同时运动反事实生成的训练方法可能适用于其他视频生成任务。

Abstract: While generative video models have achieved remarkable fidelity and consistency, applying these capabilities to video editing remains a complex challenge. Recent research has explored motion controllability as a means to enhance text-to-video generation or image animation; however, we identify precise motion control as a promising yet under-explored paradigm for editing existing videos. In this work, we propose modifying video motion by directly editing sparse trajectories extracted from the input. We term the deviation between input and output trajectories a “motion edit” and demonstrate that this representation, when coupled with a generative backbone, enables powerful video editing capabilities. To achieve this, we introduce a pipeline for generating “motion counterfactuals”, video pairs that share identical content but distinct motion, and we fine-tune a motion-conditioned video diffusion architecture on this dataset. Our approach allows for edits that start at any timestamp and propagate naturally. In a four-way head-to-head user study, our model achieves over 65 percent preference against prior work. Please see our project page: https://ryanndagreat.github.io/MotionV2V

[146] Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition cs.CV | cs.LGPDF

Wei Tang, Zuo-Zheng Wang, Kun Zhang, Tong Wei, Min-Ling Zhang

TL;DR: 论文提出了一种新型端到端框架CAPNET，通过显式建模CLIP文本编码器中的标签相关性，解决了长尾多标签视觉识别任务中的不平衡问题。CAPNET结合了图卷积网络和可学习软提示，显著提升了性能。

Details

Motivation: 长尾多标签视觉识别任务中存在严重的类别不平衡问题，导致模型偏向头部类别而尾部类别性能不足。现有方法通常直接从不平衡数据集中提取语义关系，但由于数据稀缺，尾部类别的相关性不可靠。此外，CLIP的零样本范式适用于单标签任务，在多标签任务中表现不佳。

Result: 在VOC-LT、COCO-LT和NUS-WIDE等基准测试中，CAPNET显著优于现有方法，验证了其在长尾多标签视觉识别任务中的有效性。

Insight: 1. 显式建模标签相关性有助于缓解数据稀缺问题；2. 结合视觉-语言模型的先验知识可提升多标签任务的性能；3. 参数高效微调和测试时集成是解决长尾问题的有效策略。

Abstract: Long-tailed multi-label visual recognition poses a significant challenge, as images typically contain multiple labels with highly imbalanced class distributions, leading to biased models that favor head classes while underperforming on tail classes. Recent efforts have leveraged pre-trained vision-language models, such as CLIP, alongside long-tailed learning techniques to exploit rich visual-textual priors for improved performance. However, existing methods often derive semantic inter-class relationships directly from imbalanced datasets, resulting in unreliable correlations for tail classes due to data scarcity. Moreover, CLIP’s zero-shot paradigm is optimized for single-label image-text matching, making it suboptimal for multi-label tasks. To address these issues, we propose the correlation adaptation prompt network (CAPNET), a novel end-to-end framework that explicitly models label correlations from CLIP’s textual encoder. The framework incorporates a graph convolutional network for label-aware propagation and learnable soft prompts for refined embeddings. It utilizes a distribution-balanced Focal loss with class-aware re-weighting for optimized training under imbalance. Moreover, it improves generalization through test-time ensembling and realigns visual-textual modalities using parameter-efficient fine-tuning to avert overfitting on tail classes without compromising head class performance. Extensive experiments and ablation studies on benchmarks including VOC-LT, COCO-LT, and NUS-WIDE demonstrate that CAPNET achieves substantial improvements over state-of-the-art methods, validating its effectiveness for real-world long-tailed multi-label visual recognition.

[147] Concept-Aware Batch Sampling Improves Language-Image Pretraining cs.CV | cs.LGPDF

Adhiraj Ghosh, Vishaal Udandarao, Thao Nguyen, Matteo Farina, Mehdi Cherti

TL;DR: 论文提出了一个动态、任务自适应的在线数据采样框架CABS，通过概念感知的批次采样优化视觉-语言预训练，显著提升了模型的性能。

Details

Motivation: 现有的数据筛选方法多为静态且概念无关，可能引入额外偏差。研究旨在探索更灵活、任务导向的动态数据筛选方法。

Result: 在28个基准测试中，CABS显著提升了CLIP/SigLIP模型的性能。

Insight: 动态、概念感知的数据采样是提升视觉-语言预训练效果的关键，开源框架CABS为下游任务定制提供了可行方案。

Abstract: What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional data biases. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization (CABS-DM) to curate batches with a broad coverage of available concepts, and (ii) Frequency Maximization (CABS-FM) to curate batches with high object multiplicity. Through extensive evaluations across 28 benchmarks, we demonstrate that our CABS method significantly benefits CLIP/SigLIP model classes and yields highly performant models. Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks.

[148] Vision-Language Memory for Spatial Reasoning cs.CVPDF

Zuntao Liu, Yi Du, Taimeng Fu, Shaoshu Su, Cherie Ho

TL;DR: VLM$^2$提出了一种具有持久性记忆的视觉语言模型，专注于从2D视频中进行空间推理，解决了语义-几何不一致性和缺乏长期3D表示记忆的问题。

Details

Motivation: 当前的视觉语言模型在视频空间推理方面的表现仍不及人类水平，主要原因是语义-几何不对齐和缺乏长期的3D表示记忆。

Result: 实验表明，VLM$^2$在多个基准测试中表现优异，显著提升了视觉空间智能的水平。

Insight: 通过引入持久性记忆和3D感知表示，VLM$^2$能够在2D视频中实现一致的空间推理，为长时视觉任务提供了新的解决方案。

Abstract: Spatial reasoning is a critical capability for intelligent robots, yet current vision-language models (VLMs) still fall short of human-level performance in video-based spatial reasoning. This gap mainly stems from two challenges: a semantic-geometric misalignment that prevents consistent 3D understanding, and the absence of persistent memory to retain 3D representation and understanding over time. To address these limitations, we present VLM$^2$, a Vision-Language Model with persistent Memory for spatial reasoning with a view-consistent, 3D-aware representation purely from 2D video. Specifically, to enhance long-horizon reasoning, we incorporate a dual-memory module, consisting of a working memory that operates as a sliding window to focus on immediate context, and an episodic memory that consolidates and stores critical long-term information. This design enables efficient and long-horizon spatial reasoning with a fixed computational cost. Extensive experiments on multiple benchmarks show that VLM$^2$ achieves state-of-the-art performance among video-only models, significantly advancing the frontier of visual-spatial intelligence.

[149] 3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding cs.CVPDF

Xiaoye Wang, Chen Tang, Xiangyu Yue, Wei-Hong Li

TL;DR: 论文提出了一种利用跨视角相关性（如代价体积）在多任务学习（MTL）网络中引入几何一致性的方法，通过轻量级跨视角模块（CvM）增强3D感知能力，显著提升了密集场景理解任务的性能。

Details

Motivation: 当前的多任务学习方法主要在2D图像空间中捕捉跨任务关系，导致特征缺乏3D感知能力，而3D感知对全面的场景理解至关重要。

Result: 在NYUv2和PASCAL-Context数据集上进行了广泛的实验，验证了该方法能够有效提升现有MTL方法的性能。

Insight: 将几何一致性引入多任务学习网络可以显著提升密集场景理解任务的性能，尤其是在需要3D感知的任务中。

Abstract: This paper addresses the challenge of training a single network to jointly perform multiple dense prediction tasks, such as segmentation and depth estimation, i.e., multi-task learning (MTL). Current approaches mainly capture cross-task relations in the 2D image space, often leading to unstructured features lacking 3D-awareness. We argue that 3D-awareness is vital for modeling cross-task correlations essential for comprehensive scene understanding. We propose to address this problem by integrating correlations across views, i.e., cost volume, as geometric consistency in the MTL network. Specifically, we introduce a lightweight Cross-view Module (CvM), shared across tasks, to exchange information across views and capture cross-view correlations, integrated with a feature from MTL encoder for multi-task predictions. This module is architecture-agnostic and can be applied to both single and multi-view data. Extensive results on NYUv2 and PASCAL-Context demonstrate that our method effectively injects geometric consistency into existing MTL methods to improve performance.

[150] Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization cs.CVPDF

Tahira Kazimi, Connor Dunlop, Pinar Yanardag

TL;DR: 该论文提出了一种名为DPP-GRPO的新框架，用于解决文本到视频（T2V）扩散模型在生成多样性视频时面临的问题，通过对冗余样本施加递减回报和提供群体反馈，显著提升了视频生成的多样性。

Details

Motivation: 现有文本到视频扩散模型虽然在生成质量和提示对齐方面表现优异，但在从单一文本提示生成多个视频时，输出的多样性较低。作者旨在通过优化策略来解决这一问题。

Result: 在WAN和CogVideoX上实现了该方法，并在VBench、VideoScore等基准测试以及人类偏好研究中表现出显著的多样性提升。

Insight: 通过显式奖励多样性，可以显著提升生成视频的多维多样性，且该方法具有即插即用和模型无关的特点。

Abstract: While recent text-to-video (T2V) diffusion models have achieved impressive quality and prompt alignment, they often produce low-diversity outputs when sampling multiple videos from a single text prompt. We tackle this challenge by formulating it as a set-level policy optimization problem, with the goal of training a policy that can cover the diverse range of plausible outcomes for a given prompt. To address this, we introduce DPP-GRPO, a novel framework for diverse video generation that combines Determinantal Point Processes (DPPs) and Group Relative Policy Optimization (GRPO) theories to enforce explicit reward on diverse generations. Our objective turns diversity into an explicit signal by imposing diminishing returns on redundant samples (via DPP) while supplies groupwise feedback over candidate sets (via GRPO). Our framework is plug-and-play and model-agnostic, and encourages diverse generations across visual appearance, camera motions, and scene structure without sacrificing prompt fidelity or perceptual quality. We implement our method on WAN and CogVideoX, and show that our method consistently improves video diversity on state-of-the-art benchmarks such as VBench, VideoScore, and human preference studies. Moreover, we release our code and a new benchmark dataset of 30,000 diverse prompts to support future research.

[151] LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight cs.CVPDF

Yunze Man, Shihao Wang, Guowen Zhang, Johan Bjorck, Zhiqi Li

TL;DR: LocateAnything3D是一个基于视觉语言模型（VLM）的方法，通过Chain-of-Sight（CoS）序列将3D检测任务转化为下一个token预测问题，实现了开放词汇和视觉提示能力，并在Omni3D基准上达到了SOTA性能。

Details

Motivation: 现有的视觉语言模型在2D描述和定位方面表现出色，但在多物体3D检测方面仍存在空缺。为了填补这一空白，作者提出了一个直接基于VLM的3D检测方法。

Result: 在Omni3D基准上，LocateAnything3D达到了49.89 AP_3D，比之前最佳方法提升了15.51，且能零样本泛化到未见类别。

Insight: 将3D检测任务分解为有序的token预测问题，不仅能提升性能，还能保持模型的通用性和灵活性。

Abstract: To act in the world, a model must name what it sees and know where it is in 3D. Today’s vision-language models (VLMs) excel at open-ended 2D description and grounding, yet multi-object 3D detection remains largely missing from the VLM toolbox. We present LocateAnything3D, a VLM-native recipe that casts 3D detection as a next-token prediction problem. The key is a short, explicit Chain-of-Sight (CoS) sequence that mirrors how human reason from images: find an object in 2D, then infer its distance, size, and pose. The decoder first emits 2D detections as a visual chain-of-thought, then predicts 3D boxes under an easy-to-hard curriculum: across objects, a near-to-far order reduces early ambiguity and matches ego-centric utility; within each object, a center-from-camera, dimensions, and rotation factorization ranks information by stability and learnability. This VLM-native interface preserves open-vocabulary and visual-prompting capability without specialized heads. On the challenging Omni3D benchmark, our model achieves state-of-the-art results, with 49.89 AP_3D, surpassing the previous best by +15.51 absolute improvement even when the baseline is given ground-truth 2D boxes. It also generalizes zero-shot to held-out categories with strong robustness. By turning 3D detection into a disciplined next-token problem, LocateAnything3D offers a practical foundation for models to perceive in 3D.

[152] Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout cs.CVPDF

Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Pinar Yanardag

TL;DR: 本文提出了$∞$-RoPE框架，解决了当前自回归视频扩散模型的三个核心瓶颈问题，通过Block-Relativistic RoPE、KV Flush和RoPE Cut三个组件，实现了无限时间长、动作可控且支持场景切换的视频生成。

Details

Motivation: 当前的自回归视频扩散模型在时间范围、动作控制的响应速度以及场景切换能力上存在限制，阻碍了其在长视频生成中的应用。

Result: 实验表明，$∞$-RoPE在整体VBench评分上优于现有自回归模型。

Insight: 通过推理阶段的动态调整与优化，无需额外训练即可实现更灵活的视频生成能力。

Abstract: Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model’s 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce $\infty$-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model’s maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish $\infty$-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that $\infty$-RoPE consistently surpasses previous autoregressive models in overall VBench scores.

[153] RubricRL: Simple Generalizable Rewards for Text-to-Image Generation cs.CVPDF

Xuelu Feng, Yunsheng Li, Ziyu Wan, Zixuan Gao, Junsong Yuan

TL;DR: RubricRL提出了一种基于评分标准的奖励设计框架，用于文本到图像生成任务，通过动态构建多维度视觉标准来提高奖励的可解释性和灵活性。

Details

Motivation: 现有的强化学习方法在文本到图像生成任务中依赖于固定的复合指标或单一标量奖励，限制了可解释性和用户控制。RubricRL旨在解决这一问题。

Result: 实验表明，RubricRL在提示一致性、视觉细节和泛化能力上表现优于现有方法，同时提供了高度灵活和可扩展的奖励设计基础。

Insight: RubricRL的动态多维度奖励机制不仅能提升模型性能，还为用户提供了直观的调整接口，适用于不同文本到图像架构。

Abstract: Reinforcement learning (RL) has recently emerged as a promising approach for aligning text-to-image generative models with human preferences. A key challenge, however, lies in designing effective and interpretable rewards. Existing methods often rely on either composite metrics (e.g., CLIP, OCR, and realism scores) with fixed weights or a single scalar reward distilled from human preference models, which can limit interpretability and flexibility. We propose RubricRL, a simple and general framework for rubric-based reward design that offers greater interpretability, composability, and user control. Instead of using a black-box scalar signal, RubricRL dynamically constructs a structured rubric for each prompt–a decomposable checklist of fine-grained visual criteria such as object correctness, attribute accuracy, OCR fidelity, and realism–tailored to the input text. Each criterion is independently evaluated by a multimodal judge (e.g., o4-mini), and a prompt-adaptive weighting mechanism emphasizes the most relevant dimensions. This design not only produces interpretable and modular supervision signals for policy optimization (e.g., GRPO or PPO), but also enables users to directly adjust which aspects to reward or penalize. Experiments with an autoregressive text-to-image model demonstrate that RubricRL improves prompt faithfulness, visual detail, and generalizability, while offering a flexible and extensible foundation for interpretable RL alignment across text-to-image architectures.

cs.CL [Back]

[154] Efficient Multi-Hop Question Answering over Knowledge Graphs via LLM Planning and Embedding-Guided Search cs.CL | cs.AIPDF

Manil Shrestha, Edward Kim

TL;DR: 该论文提出了两种互补的混合算法来解决多跳知识图谱问答中的效率和可验证性问题：LLM引导规划和嵌入引导神经搜索。前者通过单次LLM调用预测关系序列，后者完全避免LLM调用，实现了100倍以上的加速。

Details

Motivation: 多跳知识图谱问答面临组合爆炸问题，现有方法依赖昂贵的LLM推理，且生成的答案缺乏结构化知识的可验证性。

Result: 在MetaQA上实现了接近完美的准确性（micro-F1 > 0.90），嵌入引导方法速度提升100倍以上。

Insight: 可验证的多跳推理不需要大规模模型，而是需要结合符号结构和学习表示的架构归纳偏置。

Abstract: Multi-hop question answering over knowledge graphs remains computationally challenging due to the combinatorial explosion of possible reasoning paths. Recent approaches rely on expensive Large Language Model (LLM) inference for both entity linking and path ranking, limiting their practical deployment. Additionally, LLM-generated answers often lack verifiable grounding in structured knowledge. We present two complementary hybrid algorithms that address both efficiency and verifiability: (1) LLM-Guided Planning that uses a single LLM call to predict relation sequences executed via breadth-first search, achieving near-perfect accuracy (micro-F1 > 0.90) while ensuring all answers are grounded in the knowledge graph, and (2) Embedding-Guided Neural Search that eliminates LLM calls entirely by fusing text and graph embeddings through a lightweight 6.7M-parameter edge scorer, achieving over 100 times speedup with competitive accuracy. Through knowledge distillation, we compress planning capability into a 4B-parameter model that matches large-model performance at zero API cost. Evaluation on MetaQA demonstrates that grounded reasoning consistently outperforms ungrounded generation, with structured planning proving more transferable than direct answer generation. Our results show that verifiable multi-hop reasoning does not require massive models at inference time, but rather the right architectural inductive biases combining symbolic structure with learned representations.

[155] Can LLMs Faithfully Explain Themselves in Low-Resource Languages? A Case Study on Emotion Detection in Persian cs.CLPDF

Mobina Mehrazar, Mohammad Amin Yousefi, Parisa Abolfath Beygi, Behnam Bahrak

TL;DR: 该论文研究了大型语言模型（LLMs）在低资源语言（波斯语）中生成自我解释的忠实性问题，重点关注情感分类任务。结果表明，尽管LLMs的分类表现良好，但其生成的解释常与人类标注不一致，揭示了现有解释方法和指标的局限性。

Details

Motivation: 研究动机在于评估LLMs在低资源语言中生成自我解释的忠实性，尤其是在波斯语情感分类任务中。由于多语言环境下LLMs解释的可靠性问题未被充分研究，作者试图填补这一空白。

Result: 结果显示，LLMs虽然分类性能强，但其生成的解释与人类标注不一致，且在解释间的内部一致性高于与人类标注的一致性。这表明LLMs的解释方法在多语言环境下仍需改进。

Insight: 研究发现揭示了当前LLM解释方法的局限性，尤其是在低资源语言中解释的可靠性问题。未来需要更鲁棒的方法来提升LLMs在多语言环境下的可信度。

Abstract: Large language models (LLMs) are increasingly used to generate self-explanations alongside their predictions, a practice that raises concerns about the faithfulness of these explanations, especially in low-resource languages. This study evaluates the faithfulness of LLM-generated explanations in the context of emotion classification in Persian, a low-resource language, by comparing the influential words identified by the model against those identified by human annotators. We assess faithfulness using confidence scores derived from token-level log-probabilities. Two prompting strategies, differing in the order of explanation and prediction (Predict-then-Explain and Explain-then-Predict), are tested for their impact on explanation faithfulness. Our results reveal that while LLMs achieve strong classification performance, their generated explanations often diverge from faithful reasoning, showing greater agreement with each other than with human judgments. These results highlight the limitations of current explanation methods and metrics, emphasizing the need for more robust approaches to ensure LLM reliability in multilingual and low-resource contexts.

[156] What does it mean to understand language? cs.CLPDF

Colton Casto, Anna Ivanova, Evelina Fedorenko, Nancy Kanwisher

TL;DR: 本文探讨了语言理解的深层次含义，提出大脑核心语言系统的处理能力有限，因此需要将信息输出到其他脑区以构建丰富的心理模型。

Details

Motivation: 研究的动机在于揭示语言理解的认知和神经机制，强调语言理解不仅仅是表层意义的提取，而是需要结合感知、运动和心理模型等更广泛的脑区协作。

Result: 结果表明，语言理解需要多个脑区的协作，而不仅仅是语言系统的独立工作。

Insight: 研究发现，语言理解是一个分布式的神经过程，涉及感知、运动和心理模型等多个系统的整合。

Abstract: Language understanding entails not just extracting the surface-level meaning of the linguistic input, but constructing rich mental models of the situation it describes. Here we propose that because processing within the brain’s core language system is fundamentally limited, deeply understanding language requires exporting information from the language system to other brain regions that compute perceptual and motor representations, construct mental models, and store our world knowledge and autobiographical memories. We review the existing evidence for this hypothesis, and argue that recent progress in cognitive neuroscience provides both the conceptual foundation and the methods to directly test it, thus opening up a new strategy to reveal what it means, cognitively and neurally, to understand language.

[157] Gender Bias in Emotion Recognition by Large Language Models cs.CL | cs.CYPDF

Maureen Herbert, Katie Sun, Angelica Lim, Yasaman Etesam

TL;DR: 论文研究了大型语言模型（LLMs）在情感识别任务中是否存在性别偏见，并提出并评估了几种去偏见策略，发现基于训练的去偏见方法比推断时的提示工程更有效。

Details

Motivation: 随着LLMs在日常生活中的广泛应用，确保其公平性变得尤为重要。本研究旨在探索LLMs在情感理论任务中是否存在性别偏见。

Result: 发现仅依赖推断时的提示工程难以有效减少偏见，而基于训练的去偏见方法效果更显著。

Insight: LLMs的偏见问题需要从训练阶段入手解决，仅靠推断时的调整难以达到公平性目标。

Abstract: The rapid advancement of large language models (LLMs) and their growing integration into daily life underscore the importance of evaluating and ensuring their fairness. In this work, we examine fairness within the domain of emotional theory of mind, investigating whether LLMs exhibit gender biases when presented with a description of a person and their environment and asked, “How does this person feel?”. Furthermore, we propose and evaluate several debiasing strategies, demonstrating that achieving meaningful reductions in bias requires training based interventions rather than relying solely on inference-time prompt-based approaches such as prompt engineering.

[158] Language-Independent Sentiment Labelling with Distant Supervision: A Case Study for English, Sepedi and Setswana cs.CL | cs.AIPDF

Koena Ronny Mabokela, Tim Schlippe, Mpho Raborife, Turgay Celik

TL;DR: 该论文提出了一种基于情感表情符号和词汇的语言无关情感标注方法，旨在减少人工标注工作量，并在英语、Sepedi和Setswana三种语言的推文上进行了实验。

Details

Motivation: 许多非洲语言因缺乏标注数据被视为低资源语言，手工标注既耗时又昂贵，需要高效的自动化方法。

Result: 英语标注准确率为66%，Sepedi为69%，Setswana为63%，平均只需修正34%的自动标注标签。

Insight: 情感表情符号和词汇可以作为跨语言情感标注的有效工具，尤其在低资源语言中表现良好。

Abstract: Sentiment analysis is a helpful task to automatically analyse opinions and emotions on various topics in areas such as AI for Social Good, AI in Education or marketing. While many of the sentiment analysis systems are developed for English, many African languages are classified as low-resource languages due to the lack of digital language resources like text labelled with corresponding sentiment classes. One reason for that is that manually labelling text data is time-consuming and expensive. Consequently, automatic and rapid processes are needed to reduce the manual effort as much as possible making the labelling process as efficient as possible. In this paper, we present and analyze an automatic language-independent sentiment labelling method that leverages information from sentiment-bearing emojis and words. Our experiments are conducted with tweets in the languages English, Sepedi and Setswana from SAfriSenti, a multilingual sentiment corpus for South African languages. We show that our sentiment labelling approach is able to label the English tweets with an accuracy of 66%, the Sepedi tweets with 69%, and the Setswana tweets with 63%, so that on average only 34% of the automatically generated labels remain to be corrected.

[159] A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction cs.CL | cs.AIPDF

Farzad Ahmed, Joniel Augustine Jerome, Meliha Yetisgen, Özlem Uzuner

TL;DR: 该论文系统地分析了9种大型语言模型（LLM）在医疗错误检测和纠正任务中的表现，比较了零样本提示、静态提示和检索增强动态提示（RDP）的效果，发现RDP在降低假阳性率和提升召回率方面表现最优。

Details

Motivation: 临床文档中的错误可能危及患者安全，而LLM的行为在不同提示策略下的表现尚不明确，因此需要系统评估其用于医疗错误处理的潜力。

Result: RDP在所有模型中表现最佳，假阳性率降低约15%，错误句子检测召回率提升5-10%，生成的纠正更准确。

Insight: 检索增强的动态提示能有效提升LLM在医疗错误处理中的可靠性和表现，尤其是在处理缩写和不典型错误时优势明显。

Abstract: Objective: Clinical documentation contains factual, diagnostic, and management errors that can compromise patient safety. Large language models (LLMs) may help detect and correct such errors, but their behavior under different prompting strategies remains unclear. We evaluate zero-shot prompting, static prompting with random exemplars (SPR), and retrieval-augmented dynamic prompting (RDP) for three subtasks of medical error processing: error flag detection, error sentence detection, and error correction. Methods: Using the MEDEC dataset, we evaluated nine instruction-tuned LLMs (GPT, Claude, Gemini, and OpenAI o-series models). We measured performance using accuracy, recall, false-positive rate (FPR), and an aggregate score of ROUGE-1, BLEURT, and BERTScore for error correction. We also analyzed example outputs to identify failure modes and differences between LLM and clinician reasoning. Results: Zero-shot prompting showed low recall in both detection tasks, often missing abbreviation-heavy or atypical errors. SPR improved recall but increased FPR. Across all nine LLMs, RDP reduced FPR by about 15 percent, improved recall by 5 to 10 percent in error sentence detection, and generated more contextually accurate corrections. Conclusion: Across diverse LLMs, RDP outperforms zero-shot and SPR prompting. Using retrieved exemplars improves detection accuracy, reduces false positives, and enhances the reliability of medical error correction.

[160] AppSelectBench: Application-Level Tool Selection Benchmark cs.CLPDF

Tianyi Chen, Michael Solodko, Sen Wang, Jongwoo Ko, Junheng Hao

TL;DR: 这篇论文提出了AppSelectBench，一个用于评估计算机使用代理（CUAs）在应用选择能力的综合基准测试，填补了现有基准测试主要关注细粒度API选择的空白。

Details

Motivation: 目前CUAs越来越多地配备外部工具以执行复杂任务，但现有基准测试主要评估细粒度API选择，未能充分验证模型是否具备跨应用推理和选择的能力。

Result: 实验表明，即使是当前最强的模型在跨应用推理和一致性选择方面仍存在困难。

Insight: 应用选择是CUAs的核心能力之一，AppSelectBench为未来研究和提升这一能力奠定了基础。

Abstract: Computer Using Agents (CUAs) are increasingly equipped with external tools, enabling them to perform complex and realistic tasks. For CUAs to operate effectively, application selection, which refers to deciding which application to use before invoking fine-grained tools such as APIs, is a fundamental capability. It determines whether the agent initializes the correct environment, avoids orchestration confusion, and efficiently focuses on relevant context. However, existing benchmarks primarily assess fine-grained API selection, offering limited insight into whether models can reason across and choose between different applications. To fill this gap, we introduce AppSelectBench, a comprehensive benchmark for evaluating application selection in CUAs. AppSelectBench contains a novel user task generation pipeline that produces realistic, diverse, and semantically grounded user intents at scale, together with unified evaluation protocols covering random, heuristic, zero-shot, few-shot, and retrieval-augmented-settings. AppSelectBench covers one hundred widely used desktop applications and includes more than one hundred thousand realistic, diverse, and semantically grounded user tasks. Extensive experiments across both closed-source and open-source large language models reveal systematic strengths and weaknesses in inter-application reasoning, showing that even the most capable models still struggle to make consistent application choices. Together, these results establish AppSelectBench as a foundation for studying and advancing application level reasoning, an essential yet underexplored capability of intelligent CUAs. The source is available at https://github.com/microsoft/appselectbench.

[161] $\text{R}^2\text{R}$: A Route-to-Rerank Post-Training Framework for Multi-Domain Decoder-Only Rerankers cs.CL | cs.IRPDF

Xinyu Wang, Hanwei Wu, Qingchen Hu, Zhenghan Tai, Jingrui Tian

TL;DR: 论文提出了R2R框架，通过动态专家路由和两阶段训练策略EAG，解决了解码器重排器在多领域任务中的泛化问题，避免了领域特异性过度拟合。

Details

Motivation: 在多领域（如金融、法律）中，通用解码器重排器缺乏领域特异性知识，而直接微调会导致表面形式过度拟合和灾难性遗忘，R2R旨在解决这些问题。

Result: 在多领域（法律、医疗、金融）实验中，R2R超越通用和单领域微调基线，表现出强跨领域鲁棒性和模型无关性。

Insight: 通过掩蔽表面线索，可以迫使模型学习更深层次的领域不变特征，动态路由机制则实现了高效的领域适配。

Abstract: Decoder-only rerankers are central to Retrieval-Augmented Generation (RAG). However, generalist models miss domain-specific nuances in high-stakes fields like finance and law, and naive fine-tuning causes surface-form overfitting and catastrophic forgetting. To address this challenge, we introduce R2R, a domain-aware framework that combines dynamic expert routing with a two-stage training strategy, Entity Abstraction for Generalization (EAG). EAG introduces a counter-shortcut mechanism by masking the most predictive surface cues, forcing the reranker to learn domain-invariant relevance patterns rather than memorizing dataset-specific entities. To efficiently activate domain experts, R2R employs a lightweight Latent Semantic Router that probes internal representations from the frozen backbone decoder to select the optimal LoRA expert per query. Extensive experiments across different reranker backbones and diverse domains (legal, medical, and financial) demonstrate that R2R consistently surpasses generalist and single-domain fine-tuned baselines. Our results confirm that R2R is a model-agnostic and modular approach to domain specialization with strong cross-domain robustness.

[162] Directional Optimization Asymmetry in Transformers: A Synthetic Stress Test cs.CL | cs.AIPDF

Mihir Sahasrabudhe

TL;DR: 这篇论文通过合成基准测试研究了Transformer模型在方向优化中的不对称性问题，发现即使在无语言先验的条件下，Transformer仍表现出显著的方向优化差距。

Details

Motivation: Transformer理论上是对称的，但实际应用中存在方向性失败（如反转诅咒）。这种问题源于语言统计还是架构本身尚不明确，作者希望通过干净的合成测试解决这一疑问。

Result: 实验显示Transformer在逆向任务中存在显著优化差距（如K=5时为1.16 nats），预训练初始化无法消除这种差距，LoRA在高熵任务中表现出能力瓶颈。

Insight: 方向优化差距是Transformer因果训练的固有特性，与语言统计无关，这一发现为未来研究其机制提供了基础。

Abstract: Transformers are theoretically reversal-invariant: their function class does not prefer left-to-right over right-to-left mappings. Yet empirical studies on natural language repeatedly report a “reversal curse,” and recent work on temporal asymmetry in LLMs suggests that real-world corpora carry their own arrow of time. This leaves an unresolved question: do directional failures stem from linguistic statistics, or from the architecture itself? We cut through this ambiguity with a fully synthetic, entropy-controlled benchmark designed as a clean-room stress test for directional learning. Using random string mappings with tunable branching factor K, we construct forward tasks with zero conditional entropy and inverse tasks with analytically determined entropy floors. Excess loss above these floors reveals that even scratch-trained GPT-2 models exhibit a strong, reproducible directional optimization gap (e.g., 1.16 nats at K=5), far larger than that of an MLP trained on the same data. Pre-trained initializations shift optimization behavior but do not eliminate this gap, while LoRA encounters a sharp capacity wall on high-entropy inverse mappings. Together, these results isolate a minimal, semantics-free signature of directional friction intrinsic to causal Transformer training-one that persists even when linguistic priors, token frequencies, and corpus-level temporal asymmetries are removed. Our benchmark provides a controlled instrument for dissecting directional biases in modern sequence models and motivates deeper mechanistic study of why inversion remains fundamentally harder for Transformers.

[163] Online-PVLM: Advancing Personalized VLMs with Online Concept Learning cs.CLPDF

Huiyu Bai, Runze Wang, Zhuoyun Du, Yiyang Zhao, Fengji Zhang

TL;DR: 论文提出了Online-PVLM框架，通过双曲表示实现测试时的在线概念学习，避免了传统方法需为每个新概念单独训练嵌入的限制，提升了可扩展性和效率。此外，还提出了OP-Eval基准数据集，验证了方法的有效性。

Details

Motivation: 现有个性化视觉语言模型（VLMs）需为每个新概念单独训练嵌入，无法支持测试时的实时适应，且在大规模场景下检索效率低下。

Result: 实验表明Online-PVLM在在线概念学习中实现了最优性能。

Insight: 双曲表示在在线学习和概念嵌入生成中具有潜力，为个性化VLMs的实时适应性问题提供了新思路。

Abstract: Personalized Visual Language Models (VLMs) are gaining increasing attention for their formidable ability in user-specific concepts aligned interactions (e.g., identifying a user’s bike). Existing methods typically require the learning of separate embeddings for each new concept, which fails to support real-time adaptation during testing. This limitation becomes particularly pronounced in large-scale scenarios, where efficient retrieval of concept embeddings is not achievable. To alleviate this gap, we propose Online-PVLM, a framework for online concept learning by leveraging hyperbolic representations. Our approach makes a train-free paradigm for concept embeddings generation at test time, making the use of personalized VLMs both scalable and efficient. In addition, we develop OP-Eval, a comprehensive and large-scale benchmark comprising 1,292 concepts and over 30K high-quality instances with diverse question types, designed to rigorously assess online concept learning in realistic scenarios. Extensive experiments demonstrate the state-of-the-art performance of our proposed framework. Our source code and dataset will be made available.

[164] MTA: A Merge-then-Adapt Framework for Personalized Large Language Model cs.CLPDF

Xiaopeng Li, Yuanjin Zheng, Wanyu Wang, wenlin zhang, Pengyue Jia

TL;DR: 本文提出了MTA框架，通过合并和适应的方式实现个性化大语言模型的高效训练，解决了存储成本高和数据稀疏问题。

Details

Motivation: 个性化大语言模型需要为每个用户单独微调模块，但存在存储成本线性增长和数据稀疏导致的性能不佳问题。

Result: 在LaMP基准上的实验表明，MTA优于现有方法。

Insight: 动态融合和堆叠技术可以有效解决个性化大语言模型的存储和少样本性能问题。

Abstract: Personalized Large Language Models (PLLMs) aim to align model outputs with individual user preferences, a crucial capability for user-centric applications. However, the prevalent approach of fine-tuning a separate module for each user faces two major limitations: (1) storage costs scale linearly with the number of users, rendering the method unscalable; and (2) fine-tuning a static model from scratch often yields suboptimal performance for users with sparse data. To address these challenges, we propose MTA, a Merge-then-Adapt framework for PLLMs. MTA comprises three key stages. First, we construct a shared Meta-LoRA Bank by selecting anchor users and pre-training meta-personalization traits within meta-LoRA modules. Second, to ensure scalability and enable dynamic personalization combination beyond static models, we introduce an Adaptive LoRA Fusion stage. This stage retrieves and dynamically merges the most relevant anchor meta-LoRAs to synthesize a user-specific one, thereby eliminating the need for user-specific storage and supporting more flexible personalization. Third, we propose a LoRA Stacking for Few-Shot Personalization stage, which applies an additional ultra-low-rank, lightweight LoRA module on top of the merged LoRA. Fine-tuning this module enables effective personalization under few-shot settings. Extensive experiments on the LaMP benchmark demonstrate that our approach outperforms existing SOTA methods across multiple tasks.

[165] More Bias, Less Bias: BiasPrompting for Enhanced Multiple-Choice Question Answering cs.CLPDF

Duc Anh Vu, Thong Nguyen, Cong-Duy Nguyen, Viet Anh Nguyen, Anh Tuan Luu

TL;DR: 论文提出了一种名为BiasPrompting的新推理框架，通过引导大语言模型（LLM）生成并评估所有可能的答案选项的推理过程，显著提升了多选问题（MCQ）任务的性能。

Details

Motivation: 现有的多选问题回答方法通常缺乏对答案选项的上下文解释或推理支持，导致模型未能充分探索所有可能答案，影响了推理能力。

Result: 在五个广泛使用的多选问题基准测试中，BiasPrompting表现显著优于现有方法。

Insight: 通过引导模型对多选项进行全面推理，BiasPrompting增强了LLM的推理能力，尤其适用于复杂和具有挑战性的题目。

Abstract: With the advancement of large language models (LLMs), their performance on multiple-choice question (MCQ) tasks has improved significantly. However, existing approaches face key limitations: answer choices are typically presented to LLMs without contextual grounding or explanation. This absence of context can lead to incomplete exploration of all possible answers, ultimately degrading the models’ reasoning capabilities. To address these challenges, we introduce BiasPrompting, a novel inference framework that guides LLMs to generate and critically evaluate reasoning across all plausible answer options before reaching a final prediction. It consists of two components: first, a reasoning generation stage, where the model is prompted to produce supportive reasonings for each answer option, and then, a reasoning-guided agreement stage, where the generated reasonings are synthesized to select the most plausible answer. Through comprehensive evaluations, BiasPrompting demonstrates significant improvements in five widely used multiple-choice question answering benchmarks. Our experiments showcase that BiasPrompting enhances the reasoning capabilities of LLMs and provides a strong foundation for tackling complex and challenging questions, particularly in settings where existing methods underperform.

[166] REFLEX: Self-Refining Explainable Fact-Checking via Disentangling Truth into Style and Substance cs.CLPDF

Chuyi Kong, Gao Wei, Jing Ma, Hongzhan Lin, Zhiyuan Fan

TL;DR: REFLEX提出了一种自我优化的可解释事实核查范式，通过分离真实信息为风格和实质，利用内部知识提升准确性和解释质量，仅需少量训练数据即可实现最优性能。

Details

Motivation: 解决现有基于大语言模型的事实核查系统依赖外部知识、延迟高且可能产生幻觉的问题，提出一种更高效、可靠的内部知识利用方法。

Result: REFLEX在真实数据集上取得最优性能，仅需465个训练样本；带解释目标的模型能提升无目标模型性能7.57%。

Insight: 内部解释信号不仅能增强推理的可解释性，还能提升事实核查的准确性。

Abstract: The prevalence of misinformation on social media threatens public trust, demanding automated fact-checking systems that provide accurate verdicts with interpretable explanations. However, existing large language model-based (LLM-based) approaches often rely heavily on external knowledge sources, introducing substantial latency and even hallucinations that undermine reliability, interpretability, and responsiveness, which is crucial for real-time use. To address these challenges, we propose REason-guided Fact-checking with Latent EXplanations REFLEX paradigm, a plug-and-play, self-refining paradigm that leverages the internal knowledge in backbone model to improve both verdict accuracy and explanation quality. REFLEX reformulates fact-checking as a role-play dialogue and jointly trains verdict prediction and explanation generation. It adaptively extracts contrastive activation pairs between the backbone model and its fine-tuned variant to construct steering vectors that disentangle truth into style and substance naturally. These activation-level signals guide inference and suppress noisy explanations, enabling more faithful and efficient reasoning. Experiments on real-world datasets show that REFLEX outperforms previous methods that steer toward a single truth direction and underscores the challenge traditional approaches face when handling the subtle, human-unknown truth in fact-checking tasks. Remarkably, with only 465 self-refined training samples, RELFEX achieves state-of-the-art performance. Furthermore, models trained with explanatory objectives can effectively guide those without them, yielding up to a 7.57% improvement, highlighting that internal explanation signals play a dual role in both interpreting and enhancing factual reasoning.

[167] The Curious Case of Analogies: Investigating Analogical Reasoning in Large Language Models cs.CLPDF

Taewhoo Lee, Minju Song, Chanwoong Yoon, Jungwoo Park, Jaewoo Kang

TL;DR: 这篇论文研究了大型语言模型（LLM）在类比推理中的能力，发现它们能编码实体间的关系信息，但在处理新实体或应用高层次的推理时存在困难。隐藏表征的补丁策略能部分改善信息传递，但成功的推理仍需结构对齐。

Details

Motivation: 类比推理是人类认知的核心，但LLMs是否具备编码和应用高层次关系概念的能力尚不明确。论文旨在探索这一问题。

Result: LLMs能编码关系信息，但在新实体或应用时表现不佳；补丁策略能部分改善；成功的推理需要强结构对齐。

Insight: LLMs在高层次推理中展现了初步能力，但与人类认知仍存在差距，隐藏表征的干预可能提升模型性能。

Abstract: Analogical reasoning is at the core of human cognition, serving as an important foundation for a variety of intellectual activities. While prior work has shown that LLMs can represent task patterns and surface-level concepts, it remains unclear whether these models can encode high-level relational concepts and apply them to novel situations through structured comparisons. In this work, we explore this fundamental aspect using proportional and story analogies, and identify three key findings. First, LLMs effectively encode the underlying relationships between analogous entities; both attributive and relational information propagate through mid-upper layers in correct cases, whereas reasoning failures reflect missing relational information within these layers. Second, unlike humans, LLMs often struggle not only when relational information is missing, but also when attempting to apply it to new entities. In such cases, strategically patching hidden representations at critical token positions can facilitate information transfer to a certain extent. Lastly, successful analogical reasoning in LLMs is marked by strong structural alignment between analogous situations, whereas failures often reflect degraded or misplaced alignment. Overall, our findings reveal that LLMs exhibit emerging but limited capabilities in encoding and applying high-level relational concepts, highlighting both parallels and gaps with human cognition.

[168] BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali cs.CL | cs.AIPDF

Abdullah Al Sefat

TL;DR: BengaliFig 是一个面向孟加拉语的紧凑且标注丰富的挑战数据集，专注于低资源语言中的比喻和文化推理任务，揭示了主流大语言模型在此类任务上的不足。

Details

Motivation: 尽管大语言模型在多语言基准测试中表现优异，但在低资源语言的比喻和文化推理任务中缺乏系统评估。BengaliFig 填补了这一空白。

Result: 主流大语言模型在比喻和文化特定推理任务中表现不佳，尤其在低资源文化背景下。

Insight: 该研究强调了低资源语言和文化特定任务对大语言模型评估的重要性，为更包容和文化遗产感知的 NLP 评估提供了方向。

Abstract: Large language models excel on broad multilingual benchmarks but remain to be evaluated extensively in figurative and culturally grounded reasoning, especially in low-resource contexts. We present BengaliFig, a compact yet richly annotated challenge set that targets this gap in Bengali, a widely spoken low-resourced language. The dataset contains 435 unique riddles drawn from Bengali oral and literary traditions. Each item is annotated along five orthogonal dimensions capturing reasoning type, trap type, cultural depth, answer category, and difficulty, and is automatically converted to multiple-choice format through a constraint-aware, AI-assisted pipeline. We evaluate eight frontier LLMs from major providers under zero-shot and few-shot chain-of-thought prompting, revealing consistent weaknesses in metaphorical and culturally specific reasoning. BengaliFig thus contributes both a diagnostic probe for evaluating LLM robustness in low-resource cultural contexts and a step toward inclusive and heritage-aware NLP evaluation.

[169] Adversarial Confusion Attack: Disrupting Multimodal Large Language Models cs.CLPDF

Jakub Hoscilowicz, Artur Janicki

TL;DR: 对抗混淆攻击是一种针对多模态大语言模型的新型威胁，旨在通过对抗图像干扰模型输出，使其生成不连贯或错误的结果。该方法使用开源MLLMs的小规模集成，在黑白盒设置下均表现出强迁移性。

Details

Motivation: 多模态大语言模型（MLLMs）在实际应用中越来越广泛，但其安全性尚未深入研究。本研究旨在揭示MLLMs在面对对抗攻击时的潜在脆弱性，并提出一种系统性的破坏方法。

Result: 实验表明，单个对抗图像可以在全图和对抗CAPTCHA设置下干扰所有集成模型，并且迁移到未见的开源和专有模型。

Insight: MLLMs在面对对抗攻击时的脆弱性不容忽视，未来需要更强的防御机制。

Abstract: We introduce the Adversarial Confusion Attack, a new class of threats against multimodal large language models (MLLMs). Unlike jailbreaks or targeted misclassification, the goal is to induce systematic disruption that makes the model generate incoherent or confidently incorrect outputs. Applications include embedding adversarial images into websites to prevent MLLM-powered agents from operating reliably. The proposed attack maximizes next-token entropy using a small ensemble of open-source MLLMs. In the white-box setting, we show that a single adversarial image can disrupt all models in the ensemble, both in the full-image and adversarial CAPTCHA settings. Despite relying on a basic adversarial technique (PGD), the attack generates perturbations that transfer to both unseen open-source (e.g., Qwen3-VL) and proprietary (e.g., GPT-5.1) models.

[170] Latent Collaboration in Multi-Agent Systems cs.CL | cs.AI | cs.LGPDF

Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li, Katherine Tieu

TL;DR: LatentMAS是一个无需训练的多智能体协作框架，通过在潜在空间直接协作，实现了更高表达能力和无损信息交换，显著提升了效率和推理质量。

Details

Motivation: 现有的LLM智能体依赖于基于文本的中介进行推理和通信，限制了效率和表达力。LatentMAS通过在潜在空间直接协作，解决了这一问题。

Result: 在9个基准测试中，LatentMAS比单模型和基于文本的多智能体基线模型准确率提升14.6%，输出令牌减少70.8%-83.7%，推理速度提升4x-4.3x。

Insight: 潜在空间协作不仅提高了推理质量，还大幅降低了计算和通信成本，为多智能体系统的设计提供了新的方向。

Abstract: Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent’s internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.

math.OC [Back]

[171] Optimization of Sums of Bivariate Functions: An Introduction to Relaxation-Based Methods for the Case of Finite Domains math.OC | cs.CV | stat.MLPDF

Nils Müller

TL;DR: 本文研究了多变量函数优化问题，特别是那些可以表示为双变量函数之和的函数在有限域上的优化问题。通过松弛方法和熵正则化，作者提出了可处理的优化框架，并通过实验验证了其有效性。

Details

Motivation: 多变量函数优化在计算机视觉和机器学习等领域有广泛应用，但直接优化通常计算复杂度高。本文旨在通过双变量函数分解和松弛方法，降低优化复杂度。

Result: 实验表明，所提方法在随机函数、顶点着色和信号重建等问题中表现良好，展示了其广泛的适用性。

Insight: 双变量函数分解和松弛方法为多变量优化问题提供了新的工具，特别是对有限域问题效果显著。

Abstract: We study the optimization of functions with $n>2$ arguments that have a representation as a sum of several functions that have only $2$ of the $n$ arguments each, termed sums of bivariates, on finite domains. The complexity of optimizing sums of bivariates is shown to be NP-equivalent and it is shown that there exists free lunch in the optimization of sums of bivariates. Based on measure-valued extensions of the objective function, so-called relaxations, $\ell^2$-approximation, and entropy-regularization, we derive several tractable problem formulations solvable with linear programming, coordinate ascent as well as with closed-form solutions. The limits of applying tractable versions of such relaxations to sums of bivariates are investigated using general results for reconstructing measures from their bivariate marginals. Experiments in which the derived algorithms are applied to random functions, vertex coloring, and signal reconstruction problems provide insights into qualitatively different function classes that can be modeled as sums of bivariates.

cs.CR [Back]

[172] SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models cs.CR | cs.AI | cs.CV | cs.LGPDF

Mohammed Talha Alam, Nada Saadi, Fahad Shamshad, Nils Lukas, Karthik Nandakumar

TL;DR: SPQR是一个标准化基准，用于评估文本到图像扩散模型的安全对齐方法，特别是在良性下游微调后安全性的持续性。通过综合评分，SPQR提供了安全、效用和鲁棒性的标准化评估框架。

Details

Motivation: 文本到图像扩散模型可能生成不安全、侵权或私密内容，而现有安全对齐方法在良性微调后的稳定性缺乏系统评估。

Result: 研究发现现有安全方法在良性微调后频繁失效，SPQR基准能有效识别此类失败场景。

Insight: 安全对齐需适应后续微调，SPQR为未来安全对齐技术的标准化评估提供了实用工具。

Abstract: Text-to-image diffusion models can emit copyrighted, unsafe, or private content. Safety alignment aims to suppress specific concepts, yet evaluations seldom test whether safety persists under benign downstream fine-tuning routinely applied after deployment (e.g., LoRA personalization, style/domain adapters). We study the stability of current safety methods under benign fine-tuning and observe frequent breakdowns. As true safety alignment must withstand even benign post-deployment adaptations, we introduce the SPQR benchmark (Safety-Prompt adherence-Quality-Robustness). SPQR is a single-scored metric that provides a standardized and reproducible framework to evaluate how well safety-aligned diffusion models preserve safety, utility, and robustness under benign fine-tuning, by reporting a single leaderboard score to facilitate comparisons. We conduct multilingual, domain-specific, and out-of-distribution analyses, along with category-wise breakdowns, to identify when safety alignment fails after benign fine-tuning, ultimately showcasing SPQR as a concise yet comprehensive benchmark for T2I safety alignment techniques for T2I models.

eess.SP [Back]

[173] Redefining Radar Segmentation: Simultaneous Static-Moving Segmentation and Ego-Motion Estimation using Radar Point Clouds eess.SP | cs.CVPDF

Simin Zhu, Satish Ravindran, Alexander Yarovoy, Francesco Fioranelli

TL;DR: 该论文提出了一种基于神经网络的解决方案，能够同时从雷达点云中分割静态和动态物体，并估计移动平台的瞬时速度。方法使用了多层感知机（MLP）和循环神经网络（RNN），无需复杂的预处理步骤，并在RadarScenes数据集上验证了其有效性。

Details

Motivation: 传统雷达分割研究主要关注动态物体的分类标签，但实际上判断物体是否为静态或动态是许多任务的前提条件。此外，雷达传感器与光学传感器的差异导致分类标签的可靠性存在问题。

Result: 在RadarScenes数据集上的实验表明，该方法在双重任务上表现良好，并显示出在其他雷达感知任务中的广泛应用潜力。

Insight: 1. 直接从点云中提取信息是可行的，无需复杂的预处理；2. 静态物体的径向速度与平台运动相关，可用于估计平台速度；3. 简单的神经网络结构（MLP和RNN）能够有效解决复杂任务。

Abstract: Conventional radar segmentation research has typically focused on learning category labels for different moving objects. Although fundamental differences between radar and optical sensors lead to differences in the reliability of predicting accurate and consistent category labels, a review of common radar perception tasks in automotive reveals that determining whether an object is moving or static is a prerequisite for most tasks. To fill this gap, this study proposes a neural network based solution that can simultaneously segment static and moving objects from radar point clouds. Furthermore, since the measured radial velocity of static objects is correlated with the motion of the radar, this approach can also estimate the instantaneous 2D velocity of the moving platform or vehicle (ego motion). However, despite performing dual tasks, the proposed method employs very simple yet effective building blocks for feature extraction: multi layer perceptrons (MLPs) and recurrent neural networks (RNNs). In addition to being the first of its kind in the literature, the proposed method also demonstrates the feasibility of extracting the information required for the dual task directly from unprocessed point clouds, without the need for cloud aggregation, Doppler compensation, motion compensation, or any other intermediate signal processing steps. To measure its performance, this study introduces a set of novel evaluation metrics and tests the proposed method using a challenging real world radar dataset, RadarScenes. The results show that the proposed method not only performs well on the dual tasks, but also has broad application potential in other radar perception tasks.

cs.LG [Back]

[174] BlockCert: Certified Blockwise Extraction of Transformer Mechanisms cs.LG | cs.AI | cs.CLPDF

Sandro Andric

TL;DR: BlockCert提出了一种框架，用于认证提取Transformer模型的模块化机制，并提供轻量级扩展支持认证局部编辑。框架通过证书限制近似误差、记录覆盖率，并验证底层工件，最终实现全局偏差的量化控制。

Details

Motivation: 当前的模型解释性和编辑方法多依赖于非正式证据和临时实验，缺乏对模型行为和原始模型之间偏差的定量保证。BlockCert旨在填补这一空白，提供可验证的可靠提取和编辑方法。

Result: 在GPT-2 small、TinyLlama-1.1B-Chat和Llama-3.2-3B上验证，实现了高覆盖率和低残差错误。在TinyLlama场景中，全缝合模型与基线困惑度相差约6e-5。

Insight: BlockCert证明模块化提取与形式化验证在真实Transformer模型中可行，为机制解释性和模型行为的形式化推理搭建了桥梁。

Abstract: Mechanistic interpretability aspires to reverse-engineer neural networks into explicit algorithms, while model editing seeks to modify specific behaviours without retraining. Both areas are typically evaluated with informal evidence and ad-hoc experiments, with few explicit guarantees about how far an extracted or edited model can drift from the original on relevant inputs. We introduce BlockCert, a framework for certified blockwise extraction of transformer mechanisms, and outline how a lightweight extension can support certified local edits. Given a pre-trained transformer and a prompt distribution, BlockCert extracts structured surrogate implementations for residual blocks together with machine-checkable certificates that bound approximation error, record coverage metrics, and hash the underlying artifacts. We formalize a simple Lipschitz-based composition theorem in Lean 4 that lifts these local guarantees to a global deviation bound. Empirically, we apply the framework to GPT-2 small, TinyLlama-1.1B-Chat, and Llama-3.2-3B. Across these models we obtain high per-block coverage and small residual errors on the evaluated prompts, and in the TinyLlama setting we show that a fully stitched model matches the baseline perplexity within approximately 6e-5 on stress prompts. Our results suggest that blockwise extraction with explicit certificates is feasible for real transformer language models and offers a practical bridge between mechanistic interpretability and formal reasoning about model behaviour.

[175] Quantifying Modality Contributions via Disentangling Multimodal Representations cs.LG | cs.AI | cs.CLPDF

Padegal Amit, Omkar Mahesh Kashyap, Namitha Rayasam, Nidhi Shekhar, Surabhi Narayan

TL;DR: 提出一种基于部分信息分解（PID）的框架，量化多模态模型中各模态的贡献，通过分解嵌入中的预测信息为独特、冗余和协同成分，提供更清晰的模态行为解释。

Details

Motivation: 现有方法通过模态移除后的性能下降量化贡献，但无法区分模态本身的固有信息还是与其他模态交互产生的价值，尤其在跨注意力架构中问题更明显。

Result: 实现了对模态行为的层级和数据集级别的量化分析，比基于结果的指标更清晰和可解释。

Insight: 表示层面而非结果层面的分析能更好区分模态的固有信息和交互价值，尤其在复杂多模态架构中更具优势。

Abstract: Quantifying modality contributions in multimodal models remains a challenge, as existing approaches conflate the notion of contribution itself. Prior work relies on accuracy-based approaches, interpreting performance drops after removing a modality as indicative of its influence. However, such outcome-driven metrics fail to distinguish whether a modality is inherently informative or whether its value arises only through interaction with other modalities. This distinction is particularly important in cross-attention architectures, where modalities influence each other’s representations. In this work, we propose a framework based on Partial Information Decomposition (PID) that quantifies modality contributions by decomposing predictive information in internal embeddings into unique, redundant, and synergistic components. To enable scalable, inference-only analysis, we develop an algorithm based on the Iterative Proportional Fitting Procedure (IPFP) that computes layer and dataset-level contributions without retraining. This provides a principled, representation-level view of multimodal behavior, offering clearer and more interpretable insights than outcome-based metrics.

[176] Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits cs.LG | cs.AI | cs.CLPDF

Areeb Ahmad, Abhinav Joshi, Ashutosh Modi

TL;DR: 论文提出了一种基于奇异向量的方法，用于更细粒度地解释Transformer电路的内部计算，揭示了单个注意力头或多层感知机（MLP）中存在的叠加和独立子功能。

Details

Motivation: 现有的mechanistic interpretability方法通常将注意力头和MLP视为不可分割的单元，忽略了其内部可能存在的功能子结构。为了更深入地理解Transformer的内部计算，需要一种更细粒度的分析视角。

Result: 实验证实Transformer的计算具有更强的分布性、结构性和组合性；电路中的计算节点激活集中在特定的低秩方向上。

Insight: Transformer的计算比此前假设的更为复杂和结构化，其功能可以通过低秩子空间更精细地描述，为未来的mechanistic interpretability提供了新方向。

Abstract: Transformer-based language models exhibit complex and distributed behavior, yet their internal computations remain poorly understood. Existing mechanistic interpretability methods typically treat attention heads and multilayer perceptron layers (MLPs) (the building blocks of a transformer architecture) as indivisible units, overlooking possibilities of functional substructure learned within them. In this work, we introduce a more fine-grained perspective that decomposes these components into orthogonal singular directions, revealing superposed and independent computations within a single head or MLP. We validate our perspective on widely used standard tasks like Indirect Object Identification (IOI), Gender Pronoun (GP), and Greater Than (GT), showing that previously identified canonical functional heads, such as the name mover, encode multiple overlapping subfunctions aligned with distinct singular directions. Nodes in a computational graph, that are previously identified as circuit elements show strong activation along specific low-rank directions, suggesting that meaningful computations reside in compact subspaces. While some directions remain challenging to interpret fully, our results highlight that transformer computations are more distributed, structured, and compositional than previously assumed. This perspective opens new avenues for fine-grained mechanistic interpretability and a deeper understanding of model internals.

[177] Geometry of Decision Making in Language Models cs.LG | cs.AI | cs.CLPDF

Abhinav Joshi, Divyanshu Bhatt, Ashutosh Modi

TL;DR: 该论文研究了大型语言模型（LLMs）在决策过程中的隐藏表示几何特性，发现不同层的内部维度变化模式与任务决策相关，揭示了模型如何将输入投影到低维流形的过程。

Details

Motivation: LLMs虽然在多种任务上表现出强大的泛化能力，但其内部决策机制仍然不透明。本文旨在通过几何视角（尤其是内在维度）解析LLMs的工作原理。

Result: 发现一致的ID模式：模型通过投影学习将输入结构化至低维流形，从而支持任务决策。

Insight: LLMs通过隐式学习低维流形结构与任务对齐，为理解其泛化和推理机制提供了新的几何视角。

Abstract: Large Language Models (LLMs) show strong generalization across diverse tasks, yet the internal decision-making processes behind their predictions remain opaque. In this work, we study the geometry of hidden representations in LLMs through the lens of \textit{intrinsic dimension} (ID), focusing specifically on decision-making dynamics in a multiple-choice question answering (MCQA) setting. We perform a large-scale study, with 28 open-weight transformer models and estimate ID across layers using multiple estimators, while also quantifying per-layer performance on MCQA tasks. Our findings reveal a consistent ID pattern across models: early layers operate on low-dimensional manifolds, middle layers expand this space, and later layers compress it again, converging to decision-relevant representations. Together, these results suggest LLMs implicitly learn to project linguistic inputs onto structured, low-dimensional manifolds aligned with task-specific decisions, providing new geometric insights into how generalization and reasoning emerge in language models.

[178] Soft Adaptive Policy Optimization cs.LG | cs.AI | cs.CLPDF

Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu

TL;DR: 论文提出了Soft Adaptive Policy Optimization (SAPO)，一种用于强化学习优化大语言模型的策略优化方法，通过软门控机制替代传统的硬截断，提高了训练稳定性和性能。

Details

Motivation: 现有基于组的策略优化方法（如GSPO和GRPO）使用硬截断解决重要性比率方差高的问题，但难以同时保持稳定性和有效学习。因此，需要一种更灵活的优化方法。

Result: 实验表明，SAPO在数学推理基准测试中表现出更高的训练稳定性和Pass@1性能，并在Qwen3-VL模型系列中展现出跨任务和模型规模的性能提升。

Insight: 软门控机制不仅能提高训练稳定性，还能更有效地利用样本，避免硬截断带来的信息损失，是强化学习优化大语言模型的一种可靠策略。

Abstract: Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO. When a sequence contains a few highly off-policy tokens, GSPO suppresses all gradients for that sequence, whereas SAPO selectively down-weights only the offending tokens and preserves the learning signal from the near-on-policy ones, improving sample efficiency. Relative to GRPO, SAPO replaces hard token-level clipping with smooth, temperature-controlled scaling, enabling more informative and stable updates. Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes. Overall, SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs.

[179] Shortcut Invariance: Targeted Jacobian Regularization in Disentangled Latent Space cs.LG | cs.CV | stat.MLPDF

Shivam Pal, Sakshi Varshney, Piyush Rai

TL;DR: 该论文提出了一种通过目标雅可比正则化在解缠潜在空间中学习鲁棒函数的方法，以解决深度神经网络在训练数据中学习捷径（spurious correlations）导致的OOD泛化问题。

Details

Motivation: 深度神经网络容易学习训练数据中的捷径（虚假相关性），导致在分布外（OOD）泛化时表现差。现有方法通常通过分离潜在空间来学习鲁棒表示，但这种方法复杂且难以扩展。

Result: 在捷径学习基准测试中取得了最先进的OOD性能。

Insight: 解缠潜在空间为目标噪声的注入提供了基础，通过功能不变性而非表示鲁棒性解决捷径学习问题。

Abstract: Deep neural networks are prone to learning shortcuts, spurious and easily learned correlations in training data that cause severe failures in out-of-distribution (OOD) generalization. A dominant line of work seeks robustness by learning a robust representation, often explicitly partitioning the latent space into core and spurious components; this approach can be complex, brittle, and difficult to scale. We take a different approach, instead of a robust representation, we learn a robust function. We present a simple and effective training method that renders the classifier functionally invariant to shortcut signals. Our method operates within a disentangled latent space, which is essential as it isolates spurious and core features into distinct dimensions. This separation enables the identification of candidate shortcut features by their strong correlation with the label, used as a proxy for semantic simplicity. The classifier is then desensitized to these features by injecting targeted, anisotropic latent noise during training. We analyze this as targeted Jacobian regularization, which forces the classifier to ignore spurious features and rely on more complex, core semantic signals. The result is state-of-the-art OOD performance on established shortcut learning benchmarks.

[180] Merging without Forgetting: Continual Fusion of Task-Specific Models via Optimal Transport cs.LG | cs.AI | cs.CVPDF

Zecheng Pan, Zhikang Chen, Ding Li, Min Zhang, Sen Cui

TL;DR: OTMF提出了一种基于最优传输理论的新型模型融合框架，通过选择性提取任务无关组件来避免特征空间分布偏移，支持持续增量融合，在准确性和效率上均实现领先性能。

Details

Motivation: 现有模型融合方法多依赖参数插值，导致特征空间分布偏移，破坏任务特定知识，因此需要一种更有效的融合方法。

Result: 在多个视觉和语言基准测试中，OTMF在准确性和效率方面均达到最先进水平。

Insight: 最优传输理论可以有效对齐任务模型的语义几何，选择性掩码提取有助于保留任务无关知识，增量融合范式具有实际应用价值。

Abstract: Merging models fine-tuned for different tasks into a single unified model has become an increasingly important direction for building versatile, efficient multi-task systems. Existing approaches predominantly rely on parameter interpolation in weight space, which we show introduces significant distribution shift in the feature space and undermines task-specific knowledge. In this paper, we propose OTMF (Optimal Transport-based Masked Fusion), a novel model merging framework rooted in optimal transport theory to address the distribution shift that arises from naive parameter interpolation. Instead of directly aggregating features or weights, OTMF aligns the semantic geometry of task-specific models by discovering common masks applied to task vectors through optimal transport plans. These masks selectively extract transferable and task-agnostic components while preserving the unique structural identities of each task. To ensure scalability in real-world settings, OTMF further supports a continual fusion paradigm that incrementally integrates each new task vector without revisiting previous ones, maintaining a bounded memory footprint and enabling efficient fusion across a growing number of tasks. We conduct comprehensive experiments on multiple vision and language benchmarks, and results show that OTMF achieves state-of-the-art performance in terms of both accuracy and efficiency. These findings highlight the practical and theoretical value of our approach to model merging.

[181] Learning Massively Multitask World Models for Continuous Control cs.LG | cs.CV | cs.ROPDF

Nicklas Hansen, Hao Su, Xiaolong Wang

TL;DR: 本文提出了一种名为Newt的多任务世界模型，通过大规模预训练和轻量级强化学习的策略，在200个多样化任务上进行训练，展示了优于基线方法的多任务性能和适应性。

Details

Motivation: 现有强化学习（RL）研究多集中于单任务或离线场景，限制了其通用性。本文旨在探索在线RL在多任务场景中的扩展能力。

Result: Newt在多任务性能和数据效率上优于基线方法，且表现出强大的开环控制和快速适应能力。

Insight: 大规模预训练与轻量级RL的结合是实现通用控制的有效途径，有望推动多任务RL的研究。

Abstract: General-purpose control demands agents that act across many tasks and embodiments, yet research on reinforcement learning (RL) for continuous control remains dominated by single-task or offline regimes, reinforcing a view that online RL does not scale. Inspired by the foundation model recipe (large-scale pretraining followed by light RL) we ask whether a single agent can be trained on hundreds of tasks with online interaction. To accelerate research in this direction, we introduce a new benchmark with 200 diverse tasks spanning many domains and embodiments, each with language instructions, demonstrations, and optionally image observations. We then present \emph{Newt}, a language-conditioned multitask world model that is first pretrained on demonstrations to acquire task-aware representations and action priors, and then jointly optimized with online interaction across all tasks. Experiments show that Newt yields better multitask performance and data-efficiency than a set of strong baselines, exhibits strong open-loop control, and enables rapid adaptation to unseen tasks. We release our environments, demonstrations, code for training and evaluation, as well as 200+ checkpoints.

cs.RO [Back]

Samarth Chopra, Jing Liang, Gershom Seneviratne, Yonghan Lee, Jaehoon Choi

TL;DR: Splatblox提出了一种实时系统，结合Gaussian Splatting技术和语义分割的RGB图像与LiDAR点云，构建了可通行性感知的ESDF，用于室外机器人导航。

Details

Motivation: 在复杂室外环境中，现有导航方法难以区分可穿越的植被（如高草）和刚性障碍物（如树木），且缺乏几何和语义的联合编码。

Result: 在植被丰富的室外环境中，Splatblox的成功率提高50%，冻结事件减少40%，路径缩短5%，到达目标时间快13%，支持长达100米的远程任务。

Insight: 联合几何与语义信息的表示在复杂室外环境中对导航性能提升显著，尤其是在区分可通行植被与刚性障碍物方面。

Abstract: We present Splatblox, a real-time system for autonomous navigation in outdoor environments with dense vegetation, irregular obstacles, and complex terrain. Our method fuses segmented RGB images and LiDAR point clouds using Gaussian Splatting to construct a traversability-aware Euclidean Signed Distance Field (ESDF) that jointly encodes geometry and semantics. Updated online, this field enables semantic reasoning to distinguish traversable vegetation (e.g., tall grass) from rigid obstacles (e.g., trees), while LiDAR ensures 360-degree geometric coverage for extended planning horizons. We validate Splatblox on a quadruped robot and demonstrate transfer to a wheeled platform. In field trials across vegetation-rich scenarios, it outperforms state-of-the-art methods with over 50% higher success rate, 40% fewer freezing incidents, 5% shorter paths, and up to 13% faster time to goal, while supporting long-range missions up to 100 meters. Experiment videos and more details can be found on our project page: https://splatblox.github.io

[183] ArtiBench and ArtiBrain: Benchmarking Generalizable Vision-Language Articulated Object Manipulation cs.RO | cs.CVPDF

Yuhan Wu, Tiantian Wei, Shuo Wang, ZhiChao Wang, Yanyong Zhang

TL;DR: 论文提出了ArtiBench基准和ArtiBrain框架，用于评测和解决可操作性物体交互中的通用性问题。ArtiBench覆盖多环境和多任务，而ArtiBrain通过结合高层推理和自适应低层控制，显著提升了通用性和鲁棒性。

Details

Motivation: 现有基于视觉语言和扩散模型的策略在多部分、多实例和多类别场景中难以通用。需要一种系统性方法评测和解决这些挑战。

Result: 实验表明，ArtiBrain在ArtiBench上的表现显著优于现有方法，尤其在跨部分、跨实例和多任务长时程操作中。

Insight: 亲和力记忆库的设计是实现跨场景泛化的关键，而模块化框架结合高层推理和低层控制提供了一个可扩展的解决方案。

Abstract: Interactive articulated manipulation requires long-horizon, multi-step interactions with appliances while maintaining physical consistency. Existing vision-language and diffusion-based policies struggle to generalize across parts, instances, and categories. We first introduce ArtiBench, a five-level benchmark covering kitchen, storage, office, and tool environments. ArtiBench enables structured evaluation from cross-part and cross-instance variation to long-horizon multi-object tasks, revealing the core generalization challenges of articulated object manipulation. Building on this benchmark, we propose ArtiBrain, a modular framework that unifies high-level reasoning with adaptive low-level control. ArtiBrain uses a VLM-based Task Reasoner (GPT-4.1) to decompose and validate subgoals, and employs a Hybrid Controller that combines geometry-aware keyframe execution with affordance-guided diffusion for precise and interpretable manipulation. An Affordance Memory Bank continually accumulates successful execution episodes and propagates part-level actionable affordances to unseen articulated parts and configurations. Extensive experiments on ArtiBench show that our ArtiBrain significantly outperforms state-of-the-art multimodal and diffusion-based methods in robustness and generalization. Code and dataset will be released upon acceptance.

cs.AI [Back]

[184] Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs cs.AI | cs.CL | cs.CVPDF

Meng Lu, Ran Xu, Yi Fang, Wenxuan Zhang, Yue Yu

TL;DR: 论文提出了VISTA-Gym，一个可扩展的训练环境，旨在提升视觉语言模型（VLMs）的多步视觉交互推理能力，并通过强化学习训练模型VISTA-R1，显著优于现有基准模型。

Details

Motivation: 尽管当前视觉语言模型在图像理解方面表现优异，但在多步视觉交互推理能力上仍显不足，亟需一种标准化的方法来提升其工具集成的视觉推理能力。

Result: 在11个公共推理密集的VQA基准测试中，VISTA-R1-8B模型优于同类规模的基线模型9.51%-18.72%。

Insight: VISTA-Gym为VLMs提供了一个有效的训练平台，展示了工具集成和多步推理能力的重要性，同时表明强化学习在此方向上的潜力。

Abstract: While recent vision-language models (VLMs) demonstrate strong image understanding, their ability to “think with images”, i.e., to reason through multi-step visual interactions, remains limited. We introduce VISTA-Gym, a scalable training environment for incentivizing tool-integrated visual reasoning capabilities in VLMs. VISTA-Gym unifies diverse real-world multimodal reasoning tasks (7 tasks from 13 datasets in total) with a standardized interface for visual tools (e.g., grounding, parsing), executable interaction loops, verifiable feedback signals, and efficient trajectory logging, enabling visual agentic reinforcement learning at scale. While recent VLMs exhibit strong text-only reasoning, both proprietary and open-source models still struggle with tool selection, invocation, and coordination. With VISTA-Gym, we train VISTA-R1 to interleave tool-use with agentic reasoning via multi-turn trajectory sampling and end-to-end reinforcement learning. Extensive experiments across 11 public reasoning-intensive VQA benchmarks show that VISTA-R1-8B outperforms state-of-the-art baselines with similar sizes by 9.51%-18.72%, demonstrating VISTA-Gym as an effective training ground to unlock the tool-integrated reasoning capabilities for VLMs.

Haebin Seong, Sungmin Kim, Minchan Kim, Yongjun Cho, Myunchul Joe

TL;DR: CostNav是一个新的导航基准，专注于经济成本评估，揭示了导航研究中任务成功与经济可行性之间的差距。

Details

Motivation: 现有导航基准主要关注任务成功率，忽略了经济可行性对商业部署的重要性，CostNav填补了这一空白。

Result: 基准测试显示，基线方法的SLA合规率为43.0%，但商业上不可行，每运行亏损30.009美元，碰撞维修成本占99.7%。

Insight: 碰撞避免是经济优化的关键目标，CostNav为评估基于规则的导航、模仿学习和成本感知强化学习奠定了基础。

Abstract: Existing navigation benchmarks focus on task success metrics while overlooking economic viability – critical for commercial deployment of autonomous delivery robots. We introduce \emph{CostNav}, a \textbf{Micro-Navigation Economic Testbed} that evaluates embodied agents through comprehensive cost-revenue analysis aligned with real-world business operations. CostNav models the complete economic lifecycle including hardware, training, energy, maintenance costs, and delivery revenue with service-level agreements, using industry-derived parameters. \textbf{To our knowledge, CostNav is the first work to quantitatively expose the gap between navigation research metrics and commercial viability}, revealing that optimizing for task success fundamentally differs from optimizing for economic deployment. Our cost model uses parameters derived from industry data sources (energy rates, delivery service pricing), and we project from a reduced-scale simulation to realistic deliveries. Under this projection, the baseline achieves 43.0% SLA compliance but is \emph{not} commercially viable: yielding a loss of $30.009 per run with no finite break-even point, because operating costs are dominated by collision-induced maintenance, which accounts for 99.7% of per-run costs and highlights collision avoidance as a key optimization target. We demonstrate a learning-based on-device navigation baseline and establish a foundation for evaluating rule-based navigation, imitation learning, and cost-aware RL training. CostNav bridges the gap between navigation research and commercial deployment, enabling data-driven decisions about economic trade-offs across navigation paradigms.

[186] VibraVerse: A Large-Scale Geometry-Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning cs.AI | cs.CV | cs.GR | cs.ROPDF

Bo Pang, Chenxi Xu, Jierui Ren, Guoping Wang, Sheng Li

TL;DR: VibraVerse是一个大规模几何-声学对齐数据集，通过CLASP对比学习框架实现物理一致性多模态学习，填补了现有视觉与语言框架的物理一致性缺失问题。

Details

Motivation: 现有多模态学习框架主要关注视觉与语言，忽视了物体几何、材料、振动模式与声音之间的因果关系，缺乏物理一致性。

Result: 实验表明，基于VibraVerse训练的模型在多模态任务中表现出更高的准确性、可解释性和泛化能力。

Insight: VibraVerse为物理一致性和因果可解释的多模态学习提供了基准，推动了声音引导的感知和物理世界的理解。

Abstract: Understanding the physical world requires perceptual models grounded in physical laws rather than mere statistical correlations. However, existing multimodal learning frameworks, focused on vision and language, lack physical consistency and overlook the intrinsic causal relationships among an object’s geometry, material, vibration modes, and the sounds it produces. We introduce VibraVerse, a large-scale geometry-acoustics alignment dataset that explicitly bridges the causal chain from 3D geometry -> physical attributes -> modal parameters -> acoustic signals. Each 3D model has explicit physical properties (density, Young’s modulus, Poisson’s ratio) and volumetric geometry, from which modal eigenfrequencies and eigenvectors are computed for impact sound synthesis under controlled excitations. To establish this coherence, we introduce CLASP, a contrastive learning framework for cross-modal alignment that preserves the causal correspondence between an object’s physical structure and its acoustic response. This framework enforces physically consistent alignment across modalities, ensuring that every sample is coherent, traceable to the governing equations, and embedded within a unified representation space spanning shape, image, and sound. Built upon VibraVerse, we define a suite of benchmark tasks for geometry-to-sound prediction, sound-guided shape reconstruction, and cross-modal representation learning. Extensive validations on these tasks demonstrate that models trained on VibraVerse exhibit superior accuracy, interpretability, and generalization across modalities. These results establish VibraVerse as a benchmark for physically consistent and causally interpretable multimodal learning, providing a foundation for sound-guided embodied perception and a deeper understanding of the physical world. The dataset will be open-sourced.

[187] Beyond Generation: Multi-Hop Reasoning for Factual Accuracy in Vision-Language Models cs.AI | cs.CV | cs.LGPDF

Shamima Hossain

TL;DR: 该论文提出了一种多模态视觉语言模型（VLM）的知识引导推理框架，通过结构化知识图谱实现多跳验证，显著提升了事实准确性。

Details

Motivation: 视觉语言模型（VLM）虽然生成能力强，但由于缺乏可靠的多模态推理能力，常输出事实错误的内容。当前研究主要集中于文本模态的推理优化，而多模态结合的推理仍待探索。

Result: 在Google Landmarks v2等数据集上的实验表明，该方法将事实准确性提升了约31%。

Insight: 揭示了对VLM推理模式和失败案例的关键洞察，展示了外部知识对多模态模型的显著改进潜力。

Abstract: Visual Language Models (VLMs) are powerful generative tools but often produce factually inaccurate outputs due to a lack of robust reasoning capabilities. While extensive research has been conducted on integrating external knowledge for reasoning in large language models (LLMs), such efforts remain underexplored in VLMs, where the challenge is compounded by the need to bridge multiple modalities seamlessly. This work introduces a framework for knowledge-guided reasoning in VLMs, leveraging structured knowledge graphs for multi-hop verification using image-captioning task to illustrate our framework. Our approach enables systematic reasoning across multiple steps, including visual entity recognition, knowledge graph traversal, and fact-based caption refinement. We evaluate the framework using hierarchical, triple-based and bullet-point based knowledge representations, analyzing their effectiveness in factual accuracy and logical inference. Empirical results show that our approach improves factual accuracy by approximately 31% on preliminary experiments on a curated dataset of mixtures from Google Landmarks v2, Conceptual captions and Coco captions revealing key insights into reasoning patterns and failure modes. This work demonstrates the potential of integrating external knowledge for advancing reasoning in VLMs, paving the way for more reliable and knowledgable multimodal systems.

cs.DC [Back]

[188] QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation cs.DC | cs.CLPDF

Xinguo Zhu, Shaohui Peng, Jiaming Guo, Yunji Chen, Qi Guo

TL;DR: 本文提出了Macro-Thinking Micro-Coding (MTMC)框架，通过分层设计解决LLM生成高性能GPU内核时的正确性和效率问题，结合强化学习策略和高层次优化指导，显著提升了生成内核的准确性和运行速度。

Details

Motivation: 开发高性能GPU内核对AI和科学计算至关重要，但传统方法依赖专家经验且移植性差。虽然LLM提供了自动化可能，但其在生成低层代码时面临正确性和效率的双重挑战。

Result: 在KernelBench和TritonBench上，MTMC在准确性和运行时间上均显著优于现有方法，最高可达7.3倍加速和近100%的准确性。

Insight: 分层设计有效解决了LLM在复杂优化空间中的探索难题，同时展示了高层次策略指导在自动代码生成中的重要性。

Abstract: Developing high-performance GPU kernels is critical for AI and scientific computing, but remains challenging due to its reliance on expert crafting and poor portability. While LLMs offer promise for automation, both general-purpose and finetuned LLMs suffer from two fundamental and conflicting limitations: correctness and efficiency. The key reason is that existing LLM-based approaches directly generate the entire optimized low-level programs, requiring exploration of an extremely vast space encompassing both optimization policies and implementation codes. To address the challenge of exploring an intractable space, we propose Macro Thinking Micro Coding (MTMC), a hierarchical framework inspired by the staged optimization strategy of human experts. It decouples optimization strategy from implementation details, ensuring efficiency through high-level strategy and correctness through low-level implementation. Specifically, Macro Thinking employs reinforcement learning to guide lightweight LLMs in efficiently exploring and learning semantic optimization strategies that maximize hardware utilization. Micro Coding leverages general-purpose LLMs to incrementally implement the stepwise optimization proposals from Macro Thinking, avoiding full-kernel generation errors. Together, they effectively navigate the vast optimization space and intricate implementation details, enabling LLMs for high-performance GPU kernel generation. Comprehensive results on widely adopted benchmarks demonstrate the superior performance of MTMC on GPU kernel generation in both accuracy and running time. On KernelBench, MTMC achieves near 100% and 70% accuracy at Levels 1-2 and 3, over 50% than SOTA general-purpose and domain-finetuned LLMs, with up to 7.3x speedup over LLMs, and 2.2x over expert-optimized PyTorch Eager kernels. On the more challenging TritonBench, MTMC attains up to 59.64% accuracy and 34x speedup.

cs.MM [Back]

Xiangyu Zhao, Yaling Shen, Yiwen Jiang, Zimu Wang, Jiahe Liu

TL;DR: 该论文提出了一种多模态大语言模型框架，用于抑郁症检测，通过将音频语言模型与视觉理解相结合，并在时间戳级别对齐视听特征，从而提升了建模的精确性。

Details

Motivation: 抑郁症是全球最常见的精神健康问题之一，传统的文本中心的大语言模型无法处理音频和视觉模态中的非语言线索，而这些线索在心理健康评估中至关重要。因此，研究提出了一种多模态LLM框架，以填补这一空白。

Result: 在DAIC-WoZ数据集上的实验表明，该模型优于单模态方法和以往的多模态方法。

Insight: 该框架的可扩展性使其能够整合更多生理信号，未来可应用于更广泛的临床领域，而不仅限于精神健康。

Abstract: Depression is one of the most prevalent mental health disorders globally. In recent years, multi-modal data, such as speech, video, and transcripts, has been increasingly used to develop AI-assisted depression assessment systems. Large language models have further advanced this field due to their strong language understanding and generalization capabilities. However, conventional LLMs remain text-centric and cannot process the rich non-verbal cues found in audio and visual modalities, which are critical components in mental health evaluation. While multi-modal LLMs offer a promising direction, few are tailored for psychological applications. In this study, we propose a novel multi-modal LLM framework for depression detection. Our approach augments an audio language model with visual understanding and aligns audio-visual features at the timestamp level. This fine-grained alignment improves modeling of temporal dynamics across modalities while reducing the need for extensive training data and computational resources. Experiments on the DAIC-WoZ dataset demonstrate that our model outperforms both single-modality approaches and previous multi-modal methods. Moreover, the proposed framework can be extended to incorporate additional physiological signals, paving the way for broader clinical applications beyond mental health.

eess.IV [Back]

[190] A Multi-Stage Deep Learning Framework with PKCP-MixUp Augmentation for Pediatric Liver Tumor Diagnosis Using Multi-Phase Contrast-Enhanced CT eess.IV | cs.CV | cs.LGPDF

Wanqi Wang, Chun Yang, Jianbo Shao, Yaokai Zhang, Xuehua Peng

TL;DR: 这篇论文提出了一种多阶段的深度学习框架，结合PKCP-MixUp数据增强方法，用于通过多期增强CT实现儿童肝肿瘤的自动诊断，解决了数据稀缺和类别不平衡问题。

Details

Motivation: 儿童肝肿瘤是儿科最常见的实体瘤之一，目前的病理检查虽为金标准，但存在侵入性风险和高昂成本问题。论文旨在开发一种非侵入性的AI诊断方法，填补儿童肝肿瘤DL诊断的空白。

Result: 肿瘤检测模型mAP=0.871，良恶性分类AUC=0.989，良性和恶性亚型分类AUC分别为0.915和0.979，表现优异。

Insight: 论文揭示了PKCP-MixUp对数据增强的重要性，并为多阶段诊断框架的设计提供了可操作性见解。

Abstract: Pediatric liver tumors are one of the most common solid tumors in pediatrics, with differentiation of benign or malignant status and pathological classification critical for clinical treatment. While pathological examination is the gold standard, the invasive biopsy has notable limitations: the highly vascular pediatric liver and fragile tumor tissue raise complication risks such as bleeding; additionally, young children with poor compliance require anesthesia for biopsy, increasing medical costs or psychological trauma. Although many efforts have been made to utilize AI in clinical settings, most researchers have overlooked its importance in pediatric liver tumors. To establish a non-invasive examination procedure, we developed a multi-stage deep learning (DL) framework for automated pediatric liver tumor diagnosis using multi-phase contrast-enhanced CT. Two retrospective and prospective cohorts were enrolled. We established a novel PKCP-MixUp data augmentation method to address data scarcity and class imbalance. We also trained a tumor detection model to extract ROIs, and then set a two-stage diagnosis pipeline with three backbones with ROI-masked images. Our tumor detection model has achieved high performance (mAP=0.871), and the first stage classification model between benign and malignant tumors reached an excellent performance (AUC=0.989). Final diagnosis models also exhibited robustness, including benign subtype classification (AUC=0.915) and malignant subtype classification (AUC=0.979). We also conducted multi-level comparative analyses, such as ablation studies on data and training pipelines, as well as Shapley-Value and CAM interpretability analyses. This framework fills the pediatric-specific DL diagnostic gap, provides actionable insights for CT phase selection and model design, and paves the way for precise, accessible pediatric liver tumor diagnosis.

[191] The Selective Disk Bispectrum and Its Inversion, with Application to Multi-Reference Alignment eess.IV | cs.CVPDF

Adele Myers, Nina Miolane

TL;DR: 本文提出了一种名为选择性盘双谱（selective disk bispectrum）的快速旋转不变图像表示方法，解决了盘双谱在图像形状分析中缺乏可逆性和高计算复杂度的问题。

Details

Motivation: 在计算机视觉和形状分析任务中，通常需要从图像中学习对象的形状信息，而忽略其方向信息。然而，现有的旋转不变表示方法往往缺乏可逆性或计算效率低，限制了其在实际学习任务中的应用。

Result: 实验表明，选择性盘双谱能够高效且准确地恢复图像形状，解决了传统盘双谱在多参考对齐任务中的不可行性问题。

Insight: 本研究的洞察在于，通过选择性计算和大规模可逆性验证，盘双谱可以成为处理旋转不变形状数据的实用工具，为后续研究提供了理论基础。

Abstract: In many computer vision and shape analysis tasks, practitioners are interested in learning from the shape of the object in an image, while disregarding the object’s orientation. To this end, it is valuable to define a rotation-invariant representation of images, retaining all information about that image, but disregarding the way an object is rotated in the frame. To be practical for learning tasks, this representation must be computationally efficient for large datasets and invertible, so the representation can be visualized in image space. To this end, we present the selective disk bispectrum: a fast, rotation-invariant representation for image shape analysis. While the translational bispectrum has long been used as a translational invariant representation for 1-D and 2-D signals, its extension to 2-D (disk) rotational invariance on images has been hindered by the absence of an invertible formulation and its cubic complexity. In this work, we derive an explicit inverse for the disk bispectrum, which allows us to define a “selective” disk bispectrum, which only uses the minimal number of coefficients needed for faithful shape recovery. We show that this representation enables multi-reference alignment for rotated images-a task previously intractable for disk bispectrum methods. These results establish the disk bispectrum as a practical and theoretically grounded tool for learning on rotation-invariant shape data.

[192] DLADiff: A Dual-Layer Defense Framework against Fine-Tuning and Zero-Shot Customization of Diffusion Models eess.IV | cs.CVPDF

Jun Jia, Hongyi Miao, Yingjie Zhou, Linhan Cao, Yanwei Jiang

TL;DR: 论文提出了DLADiff框架，通过双层防御机制保护面部隐私，对抗扩散模型的微调和无采样生成攻击。

Details

Motivation: 随着扩散模型技术的发展，恶意行为者可以利用微调或无采样方法生成高保真合成图像，威胁面部隐私。现有防御方法多针对微调，忽视了无采样生成的防御需求。

Result: 实验表明DLADiff在防御扩散模型微调和无采样生成方面显著优于现有方法。

Insight: 双层防御机制的结合提供了更全面的隐私保护，尤其是在防御新兴的无采样攻击技术方面表现出色。

Abstract: With the rapid advancement of diffusion models, a variety of fine-tuning methods have been developed, enabling high-fidelity image generation with high similarity to the target content using only 3 to 5 training images. More recently, zero-shot generation methods have emerged, capable of producing highly realistic outputs from a single reference image without altering model weights. However, technological advancements have also introduced significant risks to facial privacy. Malicious actors can exploit diffusion model customization with just a few or even one image of a person to create synthetic identities nearly identical to the original identity. Although research has begun to focus on defending against diffusion model customization, most existing defense methods target fine-tuning approaches and neglect zero-shot generation defenses. To address this issue, this paper proposes Dual-Layer Anti-Diffusion (DLADiff) to defense both fine-tuning methods and zero-shot methods. DLADiff contains a dual-layer protective mechanism. The first layer provides effective protection against unauthorized fine-tuning by leveraging the proposed Dual-Surrogate Models (DSUR) mechanism and Alternating Dynamic Fine-Tuning (ADFT), which integrates adversarial training with the prior knowledge derived from pre-fine-tuned models. The second layer, though simple in design, demonstrates strong effectiveness in preventing image generation through zero-shot methods. Extensive experimental results demonstrate that our method significantly outperforms existing approaches in defending against fine-tuning of diffusion models and achieves unprecedented performance in protecting against zero-shot generation.

Table of Contents

cs.CV [Back]

[1] Personalized Reward Modeling for Text-to-Image Generation cs.CV | cs.AIPDF

[2] SG-OIF: A Stability-Guided Online Influence Framework for Reliable Vision Data cs.CV | cs.AI | cs.LGPDF

[3] Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks cs.CV | cs.AI | cs.MMPDF

[4] Tracking and Segmenting Anything in Any Modality cs.CV | cs.AI | cs.MMPDF

[5] The Determinant Ratio Matrix Approach to Solving 3D Matching and 2D Orthographic Projection Alignment Tasks cs.CV | eess.IVPDF

[6] Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning cs.CVPDF

[7] Towards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning cs.CV | cs.AI | cs.IT | cs.LGPDF

[8] VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning cs.CV | cs.MAPDF

[9] Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models cs.CVPDF

[10] MapRF: Weakly Supervised Online HD Map Construction via NeRF-Guided Self-Training cs.CVPDF

[11] Vidi2: Large Multimodal Models for Video Understanding and Creation cs.CVPDF

[12] Cross-Domain Generalization of Multimodal LLMs for Global Photovoltaic Assessment cs.CV | cs.AI | cs.LG | eess.IVPDF

[13] Studying Maps at Scale: A Digital Investigation of Cartography and the Evolution of Figuration cs.CV | cs.CL | cs.DLPDF

[14] Proxy-Free Gaussian Splats Deformation with Splat-Based Surface Estimation cs.CVPDF

[15] Think First, Assign Next (ThiFAN-VQA): A Two-stage Chain-of-Thought Framework for Post-Disaster Damage Assessment cs.CV | cs.AI | cs.LGPDF

[16] HunyuanOCR Technical Report cs.CV | cs.AIPDF

[17] Leveraging Unlabeled Scans for NCCT Image Segmentation in Early Stroke Diagnosis: A Semi-Supervised GAN Approach cs.CVPDF

[18] SkillSight: Efficient First-Person Skill Assessment with Gaze cs.CVPDF

[19] On the Utility of Foundation Models for Fast MRI: Vision-Language-Guided Image Reconstruction cs.CV | cs.AIPDF

[20] Navigating Gigapixel Pathology Images with Large Multimodal Models cs.CVPDF

[21] CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization cs.CVPDF

[22] OncoVision: Integrating Mammography and Clinical Data through Attention-Driven Multimodal AI for Enhanced Breast Cancer Diagnosis cs.CVPDF

[23] INTERLACE: Interleaved Layer Pruning and Efficient Adaptation in Large Vision-Language Models cs.CVPDF

[24] IndEgo: A Dataset of Industrial Scenarios and Collaborative Work for Egocentric Assistants cs.CV | cs.AI | cs.HC | cs.ROPDF

[25] CountXplain: Interpretable Cell Counting with Prototype-Based Density Map Estimation cs.CVPDF

[26] RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models cs.CVPDF

[27] Rethinking Vision Transformer Depth via Structural Reparameterization cs.CVPDF

[28] Efficient Transferable Optimal Transport via Min-Sliced Transport Plans cs.CVPDF

[29] Leveraging Foundation Models for Histological Grading in Cutaneous Squamous Cell Carcinoma using PathFMTools cs.CV | cs.AIPDF

[30] What You See is (Usually) What You Get: Multimodal Prototype Networks that Abstain from Expensive Modalities cs.CVPDF

[31] Vision–Language Enhanced Foundation Model for Semi-supervised Medical Image Segmentation cs.CVPDF

[32] CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception cs.CV | cs.AI | cs.CL | cs.LGPDF

[33] MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization cs.CV | cs.AI | cs.CL | cs.LG | cs.ROPDF

[34] Lightweight Transformer Framework for Weakly Supervised Semantic Segmentation cs.CVPDF

[35] CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding cs.CV | cs.CLPDF

[36] Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering cs.CV | cs.AI | cs.ROPDF

[37] One Attention, One Scale: Phase-Aligned Rotary Positional Embeddings for Mixed-Resolution Diffusion Transformer cs.CVPDF

[38] Reading Between the Lines: Abstaining from VLM-Generated OCR Errors via Latent Representation Probes cs.CVPDF

[39] ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding cs.CVPDF

[40] Large Language Model Aided Birt-Hogg-Dube Syndrome Diagnosis with Multimodal Retrieval-Augmented Generation cs.CVPDF

[41] Rectified SpaAttn: Revisiting Attention Sparsity for Efficient Video Generation cs.CV | cs.AIPDF

[42] Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward cs.CV | cs.CLPDF

[43] 4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models cs.CVPDF

[44] Face, Whole-Person, and Object Classification in a Unified Space Via The Interleaved Multi-Domain Identity Curriculum cs.CVPDF

[45] STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction cs.CVPDF

[46] Temporal-Visual Semantic Alignment: A Unified Architecture for Transferring Spatial Priors from Vision Models to Zero-Shot Temporal Tasks cs.CVPDF

[47] GigaWorld-0: World Models as Data Engine to Empower Embodied AI cs.CV | cs.ROPDF

[48] ChessMamba: Structure-Aware Interleaving of State Spaces for Change Detection in Remote Sensing Images cs.CVPDF

[49] Distilling Cross-Modal Knowledge via Feature Disentanglement cs.CV | cs.AIPDF

[50] VeriSciQA: An Auto-Verified Dataset for Scientific Visual Question Answering cs.CVPDF

[51] Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning cs.CV | cs.AIPDF

[52] MHB: Multimodal Handshape-aware Boundary Detection for Continuous Sign Language Recognition cs.CVPDF

[53] Motion Marionette: Rethinking Rigid Motion Transfer via Prior Guidance cs.CVPDF

[54] Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving cs.CV | cs.ROPDF

[55] Coupled Physics-Gated Adaptation: Spatially Decoding Volumetric Photochemical Conversion in Complex 3D-Printed Objects cs.CVPDF

[56] HybriDLA: Hybrid Generation for Document Layout Analysis cs.CVPDF

[57] Intelligent Image Search Algorithms Fusing Visual Large Models cs.CVPDF

[58] Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos cs.CVPDF

[59] Low-Resolution Editing is All You Need for High-Resolution Editing cs.CVPDF

[60] Supervise Less, See More: Training-free Nuclear Instance Segmentation with Prototype-Guided Prompting cs.CVPDF

[61] MambaEye: A Size-Agnostic Visual Encoder with Causal Sequential Processing cs.CV | cs.AIPDF

[62] HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning cs.CVPDF

[63] Boosting Reasoning in Large Multimodal Models via Activation Replay cs.CVPDF

[64] EmoFeedback2: Reinforcement of Continuous Emotional Image Generation via LVLM-based Reward and Textual Feedback cs.CV | cs.AIPDF

[65] SONIC: Spectral Optimization of Noise for Inpainting with Consistency cs.CVPDF

[66] GazeProphetV2: Head-Movement-Based Gaze Prediction Enabling Efficient Foveated Rendering on Mobile VR cs.CVPDF

[67] OmniRefiner: Reinforcement-Guided Local Diffusion Refinement cs.CVPDF

[68] CREward: A Type-Specific Creativity Reward Model cs.CVPDF

[69] On the Feasibility of Hijacking MLLMs’ Decision Chain via One Perturbation cs.CV | cs.AI | cs.CRPDF

[70] Pedestrian Crossing Intention Prediction Using Multimodal Fusion Network cs.CV | cs.AIPDF

[71] ACIT: Attention-Guided Cross-Modal Interaction Transformer for Pedestrian Crossing Intention Prediction cs.CVPDF

[72] WaymoQA: A Multi-View Visual Question Answering Dataset for Safety-Critical Reasoning in Autonomous Driving cs.CV | cs.AIPDF

[73] Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention cs.CVPDF

[74] MFM-point: Multi-scale Flow Matching for Point Cloud Generation cs.CV | cs.AI | cs.LGPDF

[75] DeLightMono: Enhancing Self-Supervised Monocular Depth Estimation in Endoscopy by Decoupling Uneven Illumination cs.CVPDF

[76] PRADA: Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images cs.CVPDF

[77] Learning Procedural-aware Video Representations through State-Grounded Hierarchy Unfolding cs.CVPDF

[78] Explainable Visual Anomaly Detection via Concept Bottleneck Models cs.CV | cs.AIPDF