Table of Contents

cs.CV [Back]

[1] AREA3D: Active Reconstruction Agent with Unified Feed-Forward 3D Perception and Vision-Language Guidance cs.CV | cs.AI | cs.ROPDF

Tianling Xu, Shengzhe Gan, Leslie Gu, Yuelei Li, Fangneng Zhan

TL;DR: AREA3D提出了一种主动3D重建代理,结合前馈3D感知和视觉-语言引导,无需手工几何启发式即可高效选择视角,提升重建质量。

Details

Motivation: 现有主动重建方法依赖手工几何启发式,导致冗余观测且重建质量提升有限。作者希望通过结合前馈3D模型和视觉-语言引导来解决这一问题。

Result: 在场景和物体级基准测试中,AREA3D表现出色,尤其在稀疏视角下达到最优重建精度。

Insight: 解耦不确定性建模和语义引导的结合是提升主动3D重建效率和质量的关键。

Abstract: Active 3D reconstruction enables an agent to autonomously select viewpoints to efficiently obtain accurate and complete scene geometry, rather than passively reconstructing scenes from pre-collected images. However, existing active reconstruction methods often rely on hand-crafted geometric heuristics, which can lead to redundant observations without substantially improving reconstruction quality. To address this limitation, we propose AREA3D, an active reconstruction agent that leverages feed-forward 3D reconstruction models and vision-language guidance. Our framework decouples view-uncertainty modeling from the underlying feed-forward reconstructor, enabling precise uncertainty estimation without expensive online optimization. In addition, an integrated vision-language model provides high-level semantic guidance, encouraging informative and diverse viewpoints beyond purely geometric cues. Extensive experiments on both scene-level and object-level benchmarks demonstrate that AREA3D achieves state-of-the-art reconstruction accuracy, particularly in the sparse-view regime. Code will be made available at: https://github.com/TianlingXu/AREA3D .


[2] ChromouVQA: Benchmarking Vision-Language Models under Chromatic Camouflaged Images cs.CV | cs.AIPDF

Yunfei Zhang, Yizhuo He, Yuanxun Shao, Zhengtao Yao, Haoyan Xu

TL;DR: ChromouVQA是一个基于色彩伪装图像的大规模多任务基准测试,用于评估视觉语言模型在复杂背景下的表现。

Details

Motivation: 现有视觉语言模型在多模态理解方面虽有进展,但在目标被嵌入杂乱背景时仍表现不佳。作者希望通过色彩伪装图像测试模型的性能。

Result: 实验显示人类和视觉语言模型在色彩对比微妙或几何填充干扰时存在显著差距,提出的方法改善了全局形状的恢复。

Insight: 色彩伪装和几何干扰对视觉语言模型提出了更高要求,对比学习有助于提升模型在多模态任务中的表现。

Abstract: Vision-Language Models (VLMs) have advanced multimodal understanding, yet still struggle when targets are embedded in cluttered backgrounds requiring figure-ground segregation. To address this, we introduce ChromouVQA, a large-scale, multi-task benchmark based on Ishihara-style chromatic camouflaged images. We extend classic dot plates with multiple fill geometries and vary chromatic separation, density, size, occlusion, and rotation, recording full metadata for reproducibility. The benchmark covers nine vision-question-answering tasks, including recognition, counting, comparison, and spatial reasoning. Evaluations of humans and VLMs reveal large gaps, especially under subtle chromatic contrast or disruptive geometric fills. We also propose a model-agnostic contrastive recipe aligning silhouettes with their camouflaged renderings, improving recovery of global shapes. ChromouVQA provides a compact, controlled benchmark for reproducible evaluation and extension. Code and dataset are available at https://github.com/Chromou-VQA-Benchmark/Chromou-VQA.


[3] Spatiotemporal Satellite Image Downscaling with Transfer Encoders and Autoregressive Generative Models cs.CV | cs.LG | stat.MLPDF

Yang Xiang, Jingwen Zhong, Yige Yan, Petros Koutrakis, Eric Garshick

TL;DR: 该论文提出了一种基于迁移学习和扩散生成模型的卫星图像降尺度框架,结合轻量级U-Net迁移编码器和扩散生成模型,实现了从低分辨率到高分辨率卫星图像的高效重建。

Details

Motivation: 现有的卫星图像降尺度方法在处理长时间序列和大区域数据时面临计算效率和物理一致性挑战。因此,论文旨在开发一种既能高效学习时空表征,又能保证降尺度结果物理一致性的方法。

Result: 模型在季节性区域划分中表现优异(R2=0.65-0.94),优于确定性U-Net、变分自编码器等基线方法,并通过多种评估证明了结果的物理一致性。

Insight: 迁移学习结合扩散生成模型为长时间序列的低分辨率图像降尺度提供了高效且物理一致的解决方案,对环境监测具有重要意义。

Abstract: We present a transfer-learning generative downscaling framework to reconstruct fine resolution satellite images from coarse scale inputs. Our approach combines a lightweight U-Net transfer encoder with a diffusion-based generative model. The simpler U-Net is first pretrained on a long time series of coarse resolution data to learn spatiotemporal representations; its encoder is then frozen and transferred to a larger downscaling model as physically meaningful latent features. Our application uses NASA’s MERRA-2 reanalysis as the low resolution source domain (50 km) and the GEOS-5 Nature Run (G5NR) as the high resolution target (7 km). Our study area included a large area in Asia, which was made computationally tractable by splitting into two subregions and four seasons. We conducted domain similarity analysis using Wasserstein distances confirmed minimal distributional shift between MERRA-2 and G5NR, validating the safety of parameter frozen transfer. Across seasonal regional splits, our model achieved excellent performance (R2 = 0.65 to 0.94), outperforming comparison models including deterministic U-Nets, variational autoencoders, and prior transfer learning baselines. Out of data evaluations using semivariograms, ACF/PACF, and lag-based RMSE/R2 demonstrated that the predicted downscaled images preserved physically consistent spatial variability and temporal autocorrelation, enabling stable autoregressive reconstruction beyond the G5NR record. These results show that transfer enhanced diffusion models provide a robust and physically coherent solution for downscaling a long time series of coarse resolution images with limited training periods. This advancement has significant implications for improving environmental exposure assessment and long term environmental monitoring.


[4] FlowEO: Generative Unsupervised Domain Adaptation for Earth Observation cs.CV | cs.AIPDF

Georges Le Bellier, Nicolas Audebert

TL;DR: FlowEO提出了一种基于生成模型的非监督域自适应框架,通过流匹配技术学习源域到目标域的语义保留映射,提升了地球观测数据的分类和语义分割性能。

Details

Motivation: 地球观测数据存在异质性,不同传感器、地理区域、采集时间和大气条件导致分布偏移,限制了预训练模型的泛化能力。非监督域自适应(UDA)成为解决这一问题的关键。

Result: FlowEO在多个测试场景中优于现有图像翻译方法,同时保持了同等或更好的图像感知质量,证明了流匹配在遥感数据UDA中的潜力。

Insight: 流匹配技术在域自适应任务中能够有效保留语义信息,生成高质量的跨域图像,为遥感数据分析提供了新的技术路径。

Abstract: The increasing availability of Earth observation data offers unprecedented opportunities for large-scale environmental monitoring and analysis. However, these datasets are inherently heterogeneous, stemming from diverse sensors, geographical regions, acquisition times, and atmospheric conditions. Distribution shifts between training and deployment domains severely limit the generalization of pretrained remote sensing models, making unsupervised domain adaptation (UDA) crucial for real-world applications. We introduce FlowEO, a novel framework that leverages generative models for image-space UDA in Earth observation. We leverage flow matching to learn a semantically preserving mapping that transports from the source to the target image distribution. This allows us to tackle challenging domain adaptation configurations for classification and semantic segmentation of Earth observation images. We conduct extensive experiments across four datasets covering adaptation scenarios such as SAR to optical translation and temporal and semantic shifts caused by natural disasters. Experimental results demonstrate that FlowEO outperforms existing image translation approaches for domain adaptation while achieving on-par or better perceptual image quality, highlighting the potential of flow-matching-based UDA for remote sensing.


[5] Self-Improving VLM Judges Without Human Annotations cs.CVPDF

Inna Wanyin Lin, Yushi Hu, Shuyue Stella Li, Scott Geng, Pang Wei Koh

TL;DR: 该论文提出了一种无需人工标注的自训练框架,通过自生成数据和迭代方法改进视觉语言模型(VLM)的评判能力,显著提升了评判准确性,超越了更大规模的模型。

Details

Motivation: 当前训练VLM评判模型依赖大规模人工偏好标注,成本高且容易过时。因此,作者提出一种无需人工标注的自训练方法,以应对模型快速迭代的需求。

Result: 在VL-RewardBench上的总体准确率从0.38提升到0.51,超越了包括Llama-3.2-90B、GPT-4o和Claude 3.5 Sonnet在内的更大模型。

Insight: 无人工标注的自训练方法展示了自主评判模型的潜力,能够适应VLM的快速迭代,在通用性、幻觉减少和推理能力等方面表现突出。

Abstract: Effective judges of Vision-Language Models (VLMs) are crucial for model development. Current methods for training VLM judges mainly rely on large-scale human preference annotations. However, such an approach is costly, and the annotations easily become obsolete as models rapidly improve. In this work, we present a framework to self-train a VLM judge model without any human preference annotations, using only self-synthesized data. Our method is iterative and has three stages: (1) generate diverse multimodal instruction-response pairs at varying quality levels, (2) generate reasoning traces and judgments for each pair, removing the ones that do not match our expected quality levels, and (3) training on correct judge answers and their reasoning traces. We evaluate the resulting judge on Multimodal RewardBench and VL-RewardBench across domains: correctness, preference, reasoning, safety, and visual question-answering. Our method improves a Llama-3.2-11B multimodal judge from 0.38 to 0.51 in overall accuracy on VL-RewardBench, often outperforming much larger models including Llama-3.2-90B, GPT-4o, and Claude 3.5 Sonnet, with particularly strong gains in general, hallucination, and reasoning dimensions. The overall strength of these human-annotation-free results suggest the potential for a future self-judge that evolves alongside rapidly improving VLM capabilities.


[6] TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows cs.CVPDF

Zhenglin Cheng, Peng Sun, Jianguo Li, Tao Lin

TL;DR: TwinFlow是一个用于训练一步生成模型的简单有效的框架,避免了固定预训练教师模型的需求和标准对抗网络的使用,适用于大规模高效模型。

Details

Motivation: 传统多模态生成模型(如扩散模型和流匹配)推理效率低(需要40-100次函数评估)。现有的少步方法(如蒸馏和对抗训练)存在迭代过程复杂、性能退化或训练不稳定的问题。

Result: 在1-NFE下,TwinFlow在GenEval和DPG-Bench基准测试中表现优异,与原始100-NFE模型性能相当,同时计算成本降低了100倍。

Insight: TwinFlow展示了在大模型上实现高效少步生成的潜力,避免了传统方法的复杂性和不稳定性。

Abstract: Recent advances in large multi-modal generative models have demonstrated impressive capabilities in multi-modal generation, including image and video generation. These models are typically built upon multi-step frameworks like diffusion and flow matching, which inherently limits their inference efficiency (requiring 40-100 Number of Function Evaluations (NFEs)). While various few-step methods aim to accelerate the inference, existing solutions have clear limitations. Prominent distillation-based methods, such as progressive and consistency distillation, either require an iterative distillation procedure or show significant degradation at very few steps (< 4-NFE). Meanwhile, integrating adversarial training into distillation (e.g., DMD/DMD2 and SANA-Sprint) to enhance performance introduces training instability, added complexity, and high GPU memory overhead due to the auxiliary trained models. To this end, we propose TwinFlow, a simple yet effective framework for training 1-step generative models that bypasses the need of fixed pretrained teacher models and avoids standard adversarial networks during training, making it ideal for building large-scale, efficient models. On text-to-image tasks, our method achieves a GenEval score of 0.83 in 1-NFE, outperforming strong baselines like SANA-Sprint (a GAN loss-based framework) and RCGM (a consistency-based framework). Notably, we demonstrate the scalability of TwinFlow by full-parameter training on Qwen-Image-20B and transform it into an efficient few-step generator. With just 1-NFE, our approach matches the performance of the original 100-NFE model on both the GenEval and DPG-Bench benchmarks, reducing computational cost by $100\times$ with minor quality degradation. Project page is available at https://zhenglin-cheng.com/twinflow.


[7] Semore: VLM-guided Enhanced Semantic Motion Representations for Visual Reinforcement Learning cs.CV | cs.AIPDF

Wentao Wang, Chunyang Liu, Kehua Sheng, Bo Zhang, Yan Wang

TL;DR: 该论文提出了Semore框架,利用VLM增强视觉强化学习中的语义和运动表示,通过双路径主干网络和预训练的CLIP模型实现文本-图像对齐,从而提升决策能力。

Details

Motivation: 现有基于LLM的强化学习方法在控制策略上表现有限,Semore旨在通过VLM和双路径主干网络提取更丰富的语义和运动表示。

Result: 实验表明,Semore在特征级别上实现了高效和自适应的能力,优于现有方法。

Insight: 结合VLM和双路径主干网络可以更有效地提取和融合语义与运动表示,从而提升视觉强化学习的决策能力。

Abstract: The growing exploration of Large Language Models (LLM) and Vision-Language Models (VLM) has opened avenues for enhancing the effectiveness of reinforcement learning (RL). However, existing LLM-based RL methods often focus on the guidance of control policy and encounter the challenge of limited representations of the backbone networks. To tackle this problem, we introduce Enhanced Semantic Motion Representations (Semore), a new VLM-based framework for visual RL, which can simultaneously extract semantic and motion representations through a dual-path backbone from the RGB flows. Semore utilizes VLM with common-sense knowledge to retrieve key information from observations, while using the pre-trained clip to achieve the text-image alignment, thereby embedding the ground-truth representations into the backbone. To efficiently fuse semantic and motion representations for decision-making, our method adopts a separately supervised approach to simultaneously guide the extraction of semantics and motion, while allowing them to interact spontaneously. Extensive experiments demonstrate that, under the guidance of VLM at the feature level, our method exhibits efficient and adaptive ability compared to state-of-art methods. All codes are released.


[8] Your Latent Mask is Wrong: Pixel-Equivalent Latent Compositing for Diffusion Models cs.CV | cs.GR | cs.LGPDF

Rowan Bradbury, Dazhi Zhong

TL;DR: 提出了一种名为PELC(Pixel-Equivalent Latent Compositing)的原则,通过DecoderFormer实现像素等效的潜在空间编辑,显著提升了扩散模型在修复任务中的表现。

Details

Motivation: 现有扩散模型中的潜在空间修复通常依赖线性插值和降采样掩码,导致掩码边缘的伪影和全局颜色偏移。作者提出PELC原则,旨在实现与像素空间编辑一致的潜在空间编辑效果。

Result: 在FLUX.1系列模型上,PELC显著减少了掩码边缘的误差(最高降低53%),恢复了全局颜色一致性、软掩码支持和高保真掩蔽效果。

Insight: 潜在空间的编辑应与像素空间编辑等效,而现代VAE的全局上下文能力使得简单的线性插值无法实现这一点。

Abstract: Latent inpainting in diffusion models still relies almost universally on linearly interpolating VAE latents under a downsampled mask. We propose a key principle for compositing image latents: Pixel-Equivalent Latent Compositing (PELC). An equivalent latent compositor should be the same as compositing in pixel space. This principle enables full-resolution mask control and true soft-edge alpha compositing, even though VAEs compress images 8x spatially. Modern VAEs capture global context beyond patch-aligned local structure, so linear latent blending cannot be pixel-equivalent: it produces large artifacts at mask seams and global degradation and color shifts. We introduce DecFormer, a 7.7M-parameter transformer that predicts per-channel blend weights and an off-manifold residual correction to realize mask-consistent latent fusion. DecFormer is trained so that decoding after fusion matches pixel-space alpha compositing, is plug-compatible with existing diffusion pipelines, requires no backbone finetuning and adds only 0.07% of FLUX.1-Dev’s parameters and 3.5% FLOP overhead. On the FLUX.1 family, DecFormer restores global color consistency, soft-mask support, sharp boundaries, and high-fidelity masking, reducing error metrics around edges by up to 53% over standard mask interpolation. Used as an inpainting prior, a lightweight LoRA on FLUX.1-Dev with DecFormer achieves fidelity comparable to FLUX.1-Fill, a fully finetuned inpainting model. While we focus on inpainting, PELC is a general recipe for pixel-equivalent latent editing, as we demonstrate on a complex color-correction task.


[9] IE2Video: Adapting Pretrained Diffusion Models for Event-Based Video Reconstruction cs.CVPDF

Dmitrii Torbunov, Onur Okuducu, Yi Huang, Odera Dim, Rebecca Coles

TL;DR: 论文提出了一种混合捕获范式,通过结合稀疏RGB关键帧和事件摄像头数据,离线重建完整RGB视频,以降低功耗。提出了IE2Video任务,并比较了两种架构策略:自回归模型和基于预训练扩散模型的动态文本到视频生成方法,后者在感知质量上表现更优。

Details

Motivation: 传统RGB摄像头在连续视频监控中功耗高,而事件摄像头虽功耗低但输出为异步事件流。论文旨在通过结合两者优势,实现低功耗的视频重构。

Result: 扩散模型在感知质量上优于自回归模型33%(LPIPS:0.283 vs 0.422),且在多数据集和不同序列长度下表现稳健。

Insight: 扩散模型在事件驱动的视频重构任务中潜力大,尤其是结合预训练模型和低秩适应技术。

Abstract: Continuous video monitoring in surveillance, robotics, and wearable systems faces a fundamental power constraint: conventional RGB cameras consume substantial energy through fixed-rate capture. Event cameras offer sparse, motion-driven sensing with low power consumption, but produce asynchronous event streams rather than RGB video. We propose a hybrid capture paradigm that records sparse RGB keyframes alongside continuous event streams, then reconstructs full RGB video offline – reducing capture power consumption while maintaining standard video output for downstream applications. We introduce the Image and Event to Video (IE2Video) task: reconstructing RGB video sequences from a single initial frame and subsequent event camera data. We investigate two architectural strategies: adapting an autoregressive model (HyperE2VID) for RGB generation, and injecting event representations into a pretrained text-to-video diffusion model (LTX) via learned encoders and low-rank adaptation. Our experiments demonstrate that the diffusion-based approach achieves 33% better perceptual quality than the autoregressive baseline (0.283 vs 0.422 LPIPS). We validate our approach across three event camera datasets (BS-ERGB, HS-ERGB far/close) at varying sequence lengths (32-128 frames), demonstrating robust cross-dataset generalization with strong performance on unseen capture configurations.


[10] Inferring Compositional 4D Scenes without Ever Seeing One cs.CVPDF

Ahmet Berke Gokmen, Ajad Chhatkuli, Luc Van Gool, Danda Pani Paudel

TL;DR: COM4D是一种无需4D合成训练数据的方法,通过空间和时间注意力机制从单目视频中重建完整的4D场景。

Details

Motivation: 现实世界中的场景通常由多个静态和动态物体组成,但现有方法往往只能处理单个对象或依赖类别特定的形状模型,导致场景配置不一致且受限。

Result: COM4D在现有4D对象和3D重建任务中取得了state-of-the-art的结果。

Insight: 通过解耦空间和时间学习,可以避免对4D合成数据的需求,直接从视频中重建复杂的4D场景。

Abstract: Scenes in the real world are often composed of several static and dynamic objects. Capturing their 4-dimensional structures, composition and spatio-temporal configuration in-the-wild, though extremely interesting, is equally hard. Therefore, existing works often focus on one object at a time, while relying on some category-specific parametric shape model for dynamic objects. This can lead to inconsistent scene configurations, in addition to being limited to the modeled object categories. We propose COM4D (Compositional 4D), a method that consistently and jointly predicts the structure and spatio-temporal configuration of 4D/3D objects using only static multi-object or dynamic single object supervision. We achieve this by a carefully designed training of spatial and temporal attentions on 2D video input. The training is disentangled into learning from object compositions on the one hand, and single object dynamics throughout the video on the other, thus completely avoiding reliance on 4D compositional training data. At inference time, our proposed attention mixing mechanism combines these independently learned attentions, without requiring any 4D composition examples. By alternating between spatial and temporal reasoning, COM4D reconstructs complete and persistent 4D scenes with multiple interacting objects directly from monocular videos. Furthermore, COM4D provides state-of-the-art results in existing separate problems of 4D object and composed 3D reconstruction despite being purely data-driven.


[11] From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model cs.CV | cs.AIPDF

Kevin Cannons, Saeed Ranjbar Alvar, Mohammad Asiful Hossain, Ahmad Rezaei, Mohsen Gholami

TL;DR: 该论文提出了一个专注于自动驾驶(AD)中时间理解的基准测试TAD,并评估了现有VLMs在此任务上的表现;为提升性能,提出了两种无需训练的解决方案Scene-CoT和TCogMap,显著提高了精度。

Details

Motivation: 自动驾驶中的时间理解是一个重要但未被充分研究的挑战,现有基准测试未专注于AD的独特需求。TAD旨在填补这一空白,推动相关研究。

Result: 现有SoTA模型在TAD上表现不佳,而Scene-CoT和TCogMap将其平均精度最高提升了17.72%。

Insight: 1. 现有VLMs在AD时间理解上仍有不足;2. 引入认知图等机制可显著提升模型性能。基准测试和代码已开源。

Abstract: Temporal understanding in autonomous driving (AD) remains a significant challenge, even for recent state-of-the-art (SoTA) Vision-Language Models (VLMs). Prior work has introduced datasets and benchmarks aimed at improving temporal reasoning, but these have emphasized other video content, including sports, cooking, and movies. No existing benchmark focuses exclusively on the unique challenges of temporal understanding in ego-centric AD footage. To fill this gap, the Temporal Understanding in Autonomous Driving (TAD) benchmark is presented, which evaluates VLMs’ ability to capture the dynamic relationships between actions in AD. TAD comprises nearly 6,000 question-answer (QA) pairs, spanning 7 human-designed tasks. In addition, an evaluation is performed that consists of 9 closed- and open-source generalist models as well as SoTA AD specialist models. When applied to TAD, current SoTA models demonstrated substandard accuracies, largely due to imperfect fine-grained motion understanding. To improve motion understanding and overall accuracy on TAD, two novel training-free solutions are proposed: Scene-CoT, that leverages Chain-of-Thought (CoT) and TCogMap, which incorporates an ego-centric temporal cognitive map. The proposed approaches are integrated with existing VLMs and improve average accuracy on TAD by up to 17.72%. By introducing TAD, benchmarking multiple SoTA models, and proposing effective enhancements, this work aims to catalyze future research on temporal understanding in AD. The benchmark and evaluation code are available at \href{https://huggingface.co/datasets/vbdai/TAD}{Hugging Face} and \href{https://github.com/vbdi/tad_bench}{Github}, respectively.


[12] SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling cs.CV | cs.AIPDF

Elisabetta Fedele, Francis Engelmann, Ian Huang, Or Litany, Marc Pollefeys

TL;DR: SpaceControl提出了一种无需训练的测试时空间控制方法,通过几何输入(如粗略图元或详细网格)实现对3D生成模型的精确控制,平衡几何保真度和视觉质量。

Details

Motivation: 现有3D生成方法多依赖文本或图像提示,但这些方式在几何特异性上表现不足,语言模糊且图像编辑繁琐。

Result: 定量评估和用户研究表明,SpaceControl在几何忠实度和视觉质量上优于基于训练和优化的基线方法。

Insight: 交互式界面支持在线编辑超二次曲面并直接生成纹理3D资产,显著提升了创意工作流程的实用性。

Abstract: Generative methods for 3D assets have recently achieved remarkable progress, yet providing intuitive and precise control over the object geometry remains a key challenge. Existing approaches predominantly rely on text or image prompts, which often fall short in geometric specificity: language can be ambiguous, and images are cumbersome to edit. In this work, we introduce SpaceControl, a training-free test-time method for explicit spatial control of 3D generation. Our approach accepts a wide range of geometric inputs, from coarse primitives to detailed meshes, and integrates seamlessly with modern pre-trained generative models without requiring any additional training. A controllable parameter lets users trade off between geometric fidelity and output realism. Extensive quantitative evaluation and user studies demonstrate that SpaceControl outperforms both training-based and optimization-based baselines in geometric faithfulness while preserving high visual quality. Finally, we present an interactive user interface that enables online editing of superquadrics for direct conversion into textured 3D assets, facilitating practical deployment in creative workflows. Find our project page at https://spacecontrol3d.github.io/


[13] SplatPainter: Interactive Authoring of 3D Gaussians from 2D Edits via Test-Time Training cs.CV | cs.GRPDF

Yang Zheng, Hao Tan, Kai Zhang, Peng Wang, Leonidas Guibas

TL;DR: SplatPainter提出了一个交互式编辑3D高斯模型的方法,通过2D视角的输入直接预测3D高斯的属性更新,利用Test-Time Training实现高效、精确的编辑工作流。

Details

Motivation: 现有的3D高斯编辑方法(如基于扩散或优化的技术)速度慢、会破坏原始资产的身份特性或缺乏精细控制能力,因此需要一种高效的交互式编辑解决方案。

Result: 该方法能够实现高保真的局部细节优化、局部涂改和全局一致性重新着色,且速度达到交互式。

Insight: Test-Time Training的结合为3D内容创作提供了高效的迭代工作流,扩展了3D高斯编辑的实用性和灵活性。

Abstract: The rise of 3D Gaussian Splatting has revolutionized photorealistic 3D asset creation, yet a critical gap remains for their interactive refinement and editing. Existing approaches based on diffusion or optimization are ill-suited for this task, as they are often prohibitively slow, destructive to the original asset’s identity, or lack the precision for fine-grained control. To address this, we introduce \ourmethod, a state-aware feedforward model that enables continuous editing of 3D Gaussian assets from user-provided 2D view(s). Our method directly predicts updates to the attributes of a compact, feature-rich Gaussian representation and leverages Test-Time Training to create a state-aware, iterative workflow. The versatility of our approach allows a single architecture to perform diverse tasks, including high-fidelity local detail refinement, local paint-over, and consistent global recoloring, all at interactive speeds, paving the way for fluid and intuitive 3D content authoring.


[14] Group Orthogonal Low-Rank Adaptation for RGB-T Tracking cs.CVPDF

Zekai Shao, Yufan Hu, Jingyuan Liu, Bin Fan, Hongmin Liu

TL;DR: 提出了Group Orthogonal Low-Rank Adaptation (GOLA)框架,通过结构化参数学习有效利用秩空间,减少冗余并增强特征表示能力,从而提升RGB-T跟踪任务的性能。

Details

Motivation: 现有的低秩适应方法在秩空间存在显著冗余,限制了模型学习多样知识以应对RGB-T跟踪中各种挑战的能力。

Result: GOLA在四个基准数据集上显著优于现有方法,验证了其在RGB-T跟踪任务中的有效性。

Insight: 通过结构化的秩空间利用和正交约束,可以有效减少参数冗余并提升模型的多样性和适应性。

Abstract: Parameter-efficient fine-tuning has emerged as a promising paradigm in RGB-T tracking, enabling downstream task adaptation by freezing pretrained parameters and fine-tuning only a small set of parameters. This set forms a rank space made up of multiple individual ranks, whose expressiveness directly shapes the model’s adaptability. However, quantitative analysis reveals low-rank adaptation exhibits significant redundancy in the rank space, with many ranks contributing almost no practical information. This hinders the model’s ability to learn more diverse knowledge to address the various challenges in RGB-T tracking. To address this issue, we propose the Group Orthogonal Low-Rank Adaptation (GOLA) framework for RGB-T tracking, which effectively leverages the rank space through structured parameter learning. Specifically, we adopt a rank decomposition partitioning strategy utilizing singular value decomposition to quantify rank importance, freeze crucial ranks to preserve the pretrained priors, and cluster the redundant ranks into groups to prepare for subsequent orthogonal constraints. We further design an inter-group orthogonal constraint strategy. This constraint enforces orthogonality between rank groups, compelling them to learn complementary features that target diverse challenges, thereby alleviating information redundancy. Experimental results demonstrate that GOLA effectively reduces parameter redundancy and enhances feature representation capabilities, significantly outperforming state-of-the-art methods across four benchmark datasets and validating its effectiveness in RGB-T tracking tasks.


[15] PoolNet: Deep Learning for 2D to 3D Video Process Validation cs.CV | cs.LGPDF

Sanchit Kaul, Joseph Luna, Shray Arora

TL;DR: PoolNet是一个深度学习框架,用于验证2D到3D视频处理中的场景适应性,区分适合Structure-from-Motion(SfM)处理的场景,显著降低传统算法的时间成本。

Details

Motivation: 传统SfM处理耗时且计算资源密集,且公开数据常因相机姿态变化不足、场景遮挡或噪声而不适合处理。PoolNet旨在解决这一问题,提供高效的数据验证方法。

Result: 实验表明PoolNet能有效区分适合与不适合SfM处理的场景,并大幅减少处理时间。

Insight: 深度学习可用于替代或辅助传统耗时的SfM预处理,为3D重建任务提供更高效的解决方案。

Abstract: Lifting Structure-from-Motion (SfM) information from sequential and non-sequential image data is a time-consuming and computationally expensive task. In addition to this, the majority of publicly available data is unfit for processing due to inadequate camera pose variation, obscuring scene elements, and noisy data. To solve this problem, we introduce PoolNet, a versatile deep learning framework for frame-level and scene-level validation of in-the-wild data. We demonstrate that our model successfully differentiates SfM ready scenes from those unfit for processing while significantly undercutting the amount of time state of the art algorithms take to obtain structure-from-motion data.


[16] ShaRP: SHAllow-LayeR Pruning for Video Large Language Models Acceleration cs.CVPDF

Yingjie Xia, Tao Liu, Jinglei Shi, Qingsong Xie, Heng Guo

TL;DR: ShaRP 提出了一种改进的基于注意力的剪枝框架,通过分段感知因果掩码、位置去偏和令牌去重,有效加速视频大语言模型的推理,并在高压缩率下保持稳定性能。

Details

Motivation: 视频大语言模型在预填充阶段因处理大量视觉令牌而面临高计算负载,现有注意力剪枝方法在浅层解码器中表现不佳,尤其是在高压缩率下性能显著下降。

Result: 在多个视频理解基准测试中表现优异,验证了框架在高压缩率下的有效性和稳定性。

Insight: 浅层剪枝的性能受限可能与位置编码偏差和信息交互不足有关,通过针对性的优化可以显著提升剪枝效果。

Abstract: Video Large Language Models (VLLMs) face the challenge of high computational load during the pre-filling stage due to the processing of an enormous number of visual tokens. Although attention-based pruning methods are widely used to accelerate inference, trials at early decoder layers often result in significant performance degradation, especially under high compression rates. We argue that while attention-based pruning inherently holds the potential to identify the most relevant visual tokens, its effectiveness in shallow decoder layers is limited by factors such as positional encoding bias and insufficient information interaction. In this paper, we propose an improved attention-based pruning framework, termed ShaRP, that integrates segment-aware causal masking, positional debiasing, and token deduplication for enhanced token selection. It enables effective pruning at shallow layers while maintaining stable performance under high compression rates without retraining. Extensive experiments demonstrate that ShaRP achieves competitive performance across multiple video understanding benchmarks, establishing a new paradigm for accelerating VLLM inference.


[17] LoC-Path: Learning to Compress for Pathology Multimodal Large Language Models cs.CVPDF

Qingqiao Hu, Weimin Lyu, Meilong Xu, Kehan Qi, Xiaoling Hu

TL;DR: 该论文提出了一种高效的多模态大语言模型框架LoC-Path,通过减少冗余特征压缩计算成本,同时保持病理图像诊断的准确性。

Details

Motivation: 病理全切片图像(WSI)理解因海量像素和诊断相关区域极端稀疏而极具挑战性。现有方法依赖笨重的切片级编码器,计算成本高昂,而LoC-Path旨在通过冗余减少高效处理任务相关区域。

Result: 实验表明,LoC-Path在保持性能的同时显著降低计算和内存需求,达到与现有先进方法相当的效果。

Insight: 病理图像中存在大量冗余特征,仅需关注少数任务相关区域即可实现高效诊断,为未来高效医疗视觉模型设计提供了思路。

Abstract: Whole Slide Image (WSI) understanding is fundamentally challenging due to its gigapixel scale and the extreme sparsity of diagnostically relevant regions. Unlike human experts who primarily rely on key areas to arrive at a diagnosis, existing slide-level multimodal large language models (MLLMs) for pathology rely on heavy slide-level encoders that process thousands of patch features in a brute-force manner, resulting in excessive computational cost. In this work, we revisit the WSI-language modeling paradigm and show that tile-level features exhibit strong global and local redundancy, whereas only a small subset of tiles are truly task-relevant. Motivated by this observation, we introduce an efficient MLLM framework, called LoC-Path, that replaces the expensive slide-level encoder with redundancy-reducing modules. We first design a Sparse Token Merger (STM) and an MAE-pretrained resampler to remove local redundancy and compress globally redundant tile tokens into a compact slide-level representation set. We then propose a Cross-Attention Routing Adapter (CARA) and a Token Importance Scorer (TIS) to integrate the compressed visual representation with the language model in a computation-efficient manner. Extensive experiments demonstrate that our approach achieves performance comparable to existing state-of-the-art whole-slide MLLMs, while requiring significantly lower computation and memory.


[18] Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability cs.CVPDF

Shizhan Liu, Xinran Deng, Zhuoyi Yang, Jiayan Teng, Xiaotao Gu

TL;DR: 论文提出了一种视频VAE的潜在频谱偏置方法,通过统计分析发现低频偏置和信道主导模式的重要性,并提出两种轻量级正则化器,显著提升了文本到视频生成的效率和奖励。

Details

Motivation: 现有视频VAE主要关注重建保真度,忽视了潜在结构对扩散训练的影响,导致训练难度大、效率低。

Result: 实验表明,SSVAE在文本到视频生成中实现3倍收敛速度提升和10%的视频奖励增益,超越现有开源VAE。

Insight: 潜在空间的频谱结构对扩散模型训练至关重要,轻量级正则化能有效改善这一结构,从而提高生成效率和质量。

Abstract: Latent diffusion models pair VAEs with diffusion backbones, and the structure of VAE latents strongly influences the difficulty of diffusion training. However, existing video VAEs typically focus on reconstruction fidelity, overlooking latent structure. We present a statistical analysis of video VAE latent spaces and identify two spectral properties essential for diffusion training: a spatio-temporal frequency spectrum biased toward low frequencies, and a channel-wise eigenspectrum dominated by a few modes. To induce these properties, we propose two lightweight, backbone-agnostic regularizers: Local Correlation Regularization and Latent Masked Reconstruction. Experiments show that our Spectral-Structured VAE (SSVAE) achieves a $3\times$ speedup in text-to-video generation convergence and a 10% gain in video reward, outperforming strong open-source VAEs. The code is available at https://github.com/zai-org/SSVAE.


[19] The Dynamic Prior: Understanding 3D Structures for Casual Dynamic Videos cs.CVPDF

Zhuoyuan Wu, Xurui Yang, Jiahui Huang, Yue Wang, Jun Gao

TL;DR: 论文提出了Dynamic Prior方法,利用视觉语言模型(VLMs)和SAM2的细粒度分割能力,无需特定训练即可识别动态物体,从而提升3D场景理解的准确性和鲁棒性。

Details

Motivation: 经典的结构从运动中难以处理动态物体的3D重建问题,现有学习方法依赖大规模标注数据集,导致性能受限。

Result: 在合成和真实视频上的实验表明,该方法在运动分割和3D理解任务上达到最优性能。

Insight: 视觉语言模型和多模态工具在3D场景动态物体识别中具有潜力,为无监督或弱监督方法提供了新思路。

Abstract: Estimating accurate camera poses, 3D scene geometry, and object motion from in-the-wild videos is a long-standing challenge for classical structure from motion pipelines due to the presence of dynamic objects. Recent learning-based methods attempt to overcome this challenge by training motion estimators to filter dynamic objects and focus on the static background. However, their performance is largely limited by the availability of large-scale motion segmentation datasets, resulting in inaccurate segmentation and, therefore, inferior structural 3D understanding. In this work, we introduce the Dynamic Prior (\ourmodel) to robustly identify dynamic objects without task-specific training, leveraging the powerful reasoning capabilities of Vision-Language Models (VLMs) and the fine-grained spatial segmentation capacity of SAM2. \ourmodel can be seamlessly integrated into state-of-the-art pipelines for camera pose optimization, depth reconstruction, and 4D trajectory estimation. Extensive experiments on both synthetic and real-world videos demonstrate that \ourmodel not only achieves state-of-the-art performance on motion segmentation, but also significantly improves accuracy and robustness for structural 3D understanding.


[20] YOLO and SGBM Integration for Autonomous Tree Branch Detection and Depth Estimation in Radiata Pine Pruning Applications cs.CVPDF

Yida Lin, Bing Xue, Mengjie Zhang, Sam Schofield, Richard Green

TL;DR: 该论文提出了一种结合YOLO目标检测和SGBM立体视觉的计算机视觉框架,用于自主无人机修剪辐射松树枝,避免了昂贵的LiDAR传感器需求。

Details

Motivation: 手动修剪辐射松树木在高空和复杂地形中存在安全隐患,因此需要一种低成本且高效的自主修剪解决方案。

Result: 实验表明,YOLO在树枝分割上的性能优于Mask R-CNN(82.0% mAPmask50-95),系统在2米范围内能准确定位树枝,单帧处理时间小于1秒。

Insight: 证明了仅用立体相机即可实现高效自主修剪,为商业林业的安全和效率提供了经济可行的解决方案。

Abstract: Manual pruning of radiata pine trees poses significant safety risks due to extreme working heights and challenging terrain. This paper presents a computer vision framework that integrates YOLO object detection with Semi-Global Block Matching (SGBM) stereo vision for autonomous drone-based pruning operations. Our system achieves precise branch detection and depth estimation using only stereo camera input, eliminating the need for expensive LiDAR sensors. Experimental evaluation demonstrates YOLO’s superior performance over Mask R-CNN, achieving 82.0% mAPmask50-95 for branch segmentation. The integrated system accurately localizes branches within a 2 m operational range, with processing times under one second per frame. These results establish the feasibility of cost-effective autonomous pruning systems that enhance worker safety and operational efficiency in commercial forestry.


[21] Moving object detection from multi-depth images with an attention-enhanced CNN cs.CV | cs.AI | cs.LGPDF

Masato Shibukawa, Fumi Yoshida, Toshifumi Yanagisawa, Takashi Ito, Hirohisa Kurosaki

TL;DR: 这篇论文提出了一种结合卷积块注意力模块的多输入卷积神经网络,用于从多深度图像中检测太阳系中的移动物体,显著提升了检测准确性并减少了人工干预的需求。

Details

Motivation: 传统移动物体检测依赖人眼验证,劳动成本高。论文旨在通过深度学习方法减少人工干预,提升检测效率和准确性。

Result: 在2000张观测图像数据集上,模型准确率达99%,AUC>0.99,人力成本减少99%。

Insight: 注意力机制和多输入架构的结合能显著提升移动物体检测的鲁棒性和效率。

Abstract: One of the greatest challenges for detecting moving objects in the solar system from wide-field survey data is determining whether a signal indicates a true object or is due to some other source, like noise. Object verification has relied heavily on human eyes, which usually results in significant labor costs. In order to address this limitation and reduce the reliance on manual intervention, we propose a multi-input convolutional neural network integrated with a convolutional block attention module. This method is specifically tailored to enhance the moving object detection system that we have developed and used previously. The current method introduces two innovations. This first one is a multi-input architecture that processes multiple stacked images simultaneously. The second is the incorporation of the convolutional block attention module which enables the model to focus on essential features in both spatial and channel dimensions. These advancements facilitate efficient learning from multiple inputs, leading to more robust detection of moving objects. The performance of the model is evaluated on a dataset consisting of approximately 2,000 observational images. We achieved an accuracy of nearly 99% with AUC (an Area Under the Curve) of >0.99. These metrics indicate that the proposed model achieves excellent classification performance. By adjusting the threshold for object detection, the new model reduces the human workload by more than 99% compared to manual verification.


[22] Performance Evaluation of Deep Learning for Tree Branch Segmentation in Autonomous Forestry Systems cs.CVPDF

Yida Lin, Bing Xue, Mengjie Zhang, Sam Schofield, Richard Green

TL;DR: 论文评估了多种深度学习方法在三个不同分辨率下对树木分支分割的性能,发现U-Net与MiT-B4在低分辨率下表现最佳,而PSPNet在效率上最优。

Details

Motivation: 自主林业系统需要快速且精确的树木分支分割方法,以适应不同的像素分辨率和操作条件,确保安全和高效的自动化修剪与导航。

Result: U-Net与MiT-B4在低分辨率下表现最优,而PSPNet在效率上最突出,但精度有所牺牲。

Insight: 在高分辨率下,精度和边界质量变得更为重要,而低分辨率下效率和速度是关键权衡点。

Abstract: UAV-based autonomous forestry operations require rapid and precise tree branch segmentation for safe navigation and automated pruning across varying pixel resolutions and operational conditions. We evaluate different deep learning methods at three resolutions (256x256, 512x512, 1024x1024) using the Urban Street Tree Dataset, employing standard metrics (IoU, Dice) and specialized measures including Thin Structure IoU (TS-IoU) and Connectivity Preservation Rate (CPR). Among 22 configurations tested, U-Net with MiT-B4 backbone achieves strong performance at 256x256. At 512x512, MiT-B4 leads in IoU, Dice, TS-IoU, and Boundary-F1. At 1024x1024, U-Net+MiT-B3 shows the best validation performance for IoU/Dice and precision, while U-Net++ excels in boundary quality. PSPNet provides the most efficient option (2.36/9.43/37.74 GFLOPs) with 25.7/19.6/11.8 percentage point IoU reductions compared to top performers at respective resolutions. These results establish multi-resolution benchmarks for accuracy-efficiency trade-offs in embedded forestry systems. Implementation is available at https://github.com/BennyLinntu/PerformanceTreeBranchSegmentation.


[23] ParaUni: Enhance Generation in Unified Multimodal Model with Reinforcement-driven Hierarchical Parallel Information Interaction cs.CVPDF

Jiangtong Tan, Lin Liu, Jie Huanng, Xiaopeng Zhang, Qi Tian

TL;DR: ParaUni提出了一种并行交互的多模态统一模型,通过分层动态调整机制(LDAM)和强化学习(RL)优化生成质量。

Details

Motivation: 现有方法在多模态统一模型中难以实现充分信息交互与灵活架构的平衡,尤其是视觉语言模型(VLM)和扩散模型之间存在表征差异。

Result: 实验表明ParaUni显著提升了生成质量,并在RL阶段展示了多奖励优化的潜力。

Insight: VLM的分层特征在不同奖励下表现不均,动态调整机制能有效优化生成过程。

Abstract: Unified multimodal models significantly improve visual generation by combining vision-language models (VLMs) with diffusion models. However, existing methods struggle to fully balance sufficient interaction and flexible implementation due to vast representation difference. Considering abundant and hierarchical information in VLM’s layers from low-level details to high-level semantics, we propose \textbf{ParaUni}. It extracts features from variants VLM’s layers in a \textbf{Para}llel way for comprehensive information interaction and retains a flexible separation architecture to enhance generation in \textbf{Uni}fied multimodal model. Concretely, visual features from all VLM’s layers are fed in parallel into a Layer Integration Module (LIM), which efficiently integrates fine-grained details and semantic abstractions and provides the fused representation as a condition to the diffusion model. To further enhance performance, we reveal that these hierarchical layers respond unequally to different rewards in Reinforcement Learning (RL). Crucially, we design a Layer-wise Dynamic Adjustment Mechanism (LDAM) to facilitate multiple reward improvements that aligns the hierarchical properties of these layers using RL. Extensive experiments show ParaUni leverages complementary multi-layer features to substantially improve generation quality and shows strong potential for multiple reward advances during RL stages. Code is available at https://github.com/JosephTiTan/ParaUni.


[24] TED-4DGS: Temporally Activated and Embedding-based Deformation for 4DGS Compression cs.CVPDF

Cheng-Yuan Ho, He-Bi Yang, Jui-Chiu Chiang, Yu-Lun Liu, Wen-Hsiao Peng

TL;DR: TED-4DGS提出了一种基于时间激活和嵌入的变形方案,用于动态3DGS的压缩,通过稀疏锚点表示和共享变形库实现高效的动态场景建模与压缩。

Details

Motivation: 动态3DGS(4DGS)在场景表示中表现优异,但现有方法的变形方案和压缩策略效率不足,缺乏显式的时间控制和率失真优化,亟需改进。

Result: TED-4DGS在多个真实数据集上实现了优异的率失真性能,是动态3DGS压缩领域的首次尝试之一。

Insight: 稀疏锚点与共享变形库的结合显著提升了动态3DGS的效率,而时间激活和嵌入变形为场景动态建模提供了更精细的控制。

Abstract: Building on the success of 3D Gaussian Splatting (3DGS) in static 3D scene representation, its extension to dynamic scenes, commonly referred to as 4DGS or dynamic 3DGS, has attracted increasing attention. However, designing more compact and efficient deformation schemes together with rate-distortion-optimized compression strategies for dynamic 3DGS representations remains an underexplored area. Prior methods either rely on space-time 4DGS with overspecified, short-lived Gaussian primitives or on canonical 3DGS with deformation that lacks explicit temporal control. To address this, we present TED-4DGS, a temporally activated and embedding-based deformation scheme for rate-distortion-optimized 4DGS compression that unifies the strengths of both families. TED-4DGS is built on a sparse anchor-based 3DGS representation. Each canonical anchor is assigned learnable temporal-activation parameters to specify its appearance and disappearance transitions over time, while a lightweight per-anchor temporal embedding queries a shared deformation bank to produce anchor-specific deformation. For rate-distortion compression, we incorporate an implicit neural representation (INR)-based hyperprior to model anchor attribute distributions, along with a channel-wise autoregressive model to capture intra-anchor correlations. With these novel elements, our scheme achieves state-of-the-art rate-distortion performance on several real-world datasets. To the best of our knowledge, this work represents one of the first attempts to pursue a rate-distortion-optimized compression framework for dynamic 3DGS representations.


[25] University Building Recognition Dataset in Thailand for the mission-oriented IoT sensor system cs.CV | cs.AIPDF

Takara Taniguchi, Yudai Ueda, Atsuya Muramatsu, Kohki Hashimoto, Ryo Yagi

TL;DR: 论文提出了一个专门用于泰国朱拉隆功大学的建筑识别数据集(CUBR),并验证了在无线Ad Hoc联邦学习(WAFL)场景下的训练优越性。

Details

Motivation: 随着边缘设备性能的提升,WAFL成为边缘协同学习的潜在方案,但缺乏针对特定任务的专用数据集。研究目标是构建一个适用于泰国高校建筑识别的数据集,并验证WAFL的高效性。

Result: WAFL-ViT在CUBR数据集上的表现优于自训练方法,验证了其高效性。数据已公开发布。

Insight: 为特定任务构建专用数据集对提升边缘学习性能至关重要,WAFL在分布式学习场景中具有潜力。

Abstract: Many industrial sectors have been using of machine learning at inference mode on edge devices. Future directions show that training on edge devices is promising due to improvements in semiconductor performance. Wireless Ad Hoc Federated Learning (WAFL) has been proposed as a promising approach for collaborative learning with device-to-device communication among edges. In particular, WAFL with Vision Transformer (WAFL-ViT) has been tested on image recognition tasks with the UTokyo Building Recognition Dataset (UTBR). Since WAFL-ViT is a mission-oriented sensor system, it is essential to construct specific datasets by each mission. In our work, we have developed the Chulalongkorn University Building Recognition Dataset (CUBR), which is specialized for Chulalongkorn University as a case study in Thailand. Additionally, our results also demonstrate that training on WAFL scenarios achieves better accuracy than self-training scenarios. Dataset is available in https://github.com/jo2lxq/wafl/.


[26] EmoStyle: Emotion-Driven Image Stylization cs.CVPDF

Jingyuan Yang, Zihuan Bai, Hui Huang

TL;DR: EmoStyle提出了一种情感驱动的图像风格化方法,通过构建数据集和学习情感风格词典,实现了在风格化图像中表达特定情感的目标。

Details

Motivation: 现有图像风格化方法忽略了风格的情感影响,而艺术创作中情感表达至关重要,因此需要一种能够结合内容和情感的风格化方法。

Result: 实验表明,EmoStyle在情感表达和内容一致性上优于基线方法,且其风格词典可推广到其他生成任务。

Insight: 情感与风格的有效结合可以提升AI生成艺术的表达能力,且离散化的情感风格编码具有通用性。

Abstract: Art has long been a profound medium for expressing emotions. While existing image stylization methods effectively transform visual appearance, they often overlook the emotional impact carried by styles. To bridge this gap, we introduce Affective Image Stylization (AIS), a task that applies artistic styles to evoke specific emotions while preserving content. We present EmoStyle, a framework designed to address key challenges in AIS, including the lack of training data and the emotion-style mapping. First, we construct EmoStyleSet, a content-emotion-stylized image triplet dataset derived from ArtEmis to support AIS. We then propose an Emotion-Content Reasoner that adaptively integrates emotional cues with content to learn coherent style queries. Given the discrete nature of artistic styles, we further develop a Style Quantizer that converts continuous style features into emotion-related codebook entries. Extensive qualitative and quantitative evaluations, including user studies, demonstrate that EmoStyle enhances emotional expressiveness while maintaining content consistency. Moreover, the learned emotion-aware style dictionary is adaptable to other generative tasks, highlighting its potential for broader applications. Our work establishes a foundation for emotion-driven image stylization, expanding the creative potential of AI-generated art.


[27] UniFS: Unified Multi-Contrast MRI Reconstruction via Frequency-Spatial Fusion cs.CV | cs.AIPDF

Jialin Li, Yiwei Ren, Kai Pan, Dong Wei, Pujin Cheng

TL;DR: UniFS提出了一种统一的多对比MRI重建方法,通过频率-空间融合模块处理多种k空间欠采样模式,无需重新训练。

Details

Motivation: 现有多对比MR重建方法需针对每种欠采样模式训练单独模型,泛化能力差,且未能充分利用频率信息。

Result: 在BraTS和HCP数据集上的实验表明,UniFS在多种欠采样模式和加速因子下均达到SOTA性能。

Insight: 频率信息在多模态MRI重建中被低估,UniFS通过自适应提示学习有效利用了跨模态频率互补信息。

Abstract: Recently, Multi-Contrast MR Reconstruction (MCMR) has emerged as a hot research topic that leverages high-quality auxiliary modalities to reconstruct undersampled target modalities of interest. However, existing methods often struggle to generalize across different k-space undersampling patterns, requiring the training of a separate model for each specific pattern, which limits their practical applicability. To address this challenge, we propose UniFS, a Unified Frequency-Spatial Fusion model designed to handle multiple k-space undersampling patterns for MCMR tasks without any need for retraining. UniFS integrates three key modules: a Cross-Modal Frequency Fusion module, an Adaptive Mask-Based Prompt Learning module, and a Dual-Branch Complementary Refinement module. These modules work together to extract domain-invariant features from diverse k-space undersampling patterns while dynamically adapt to their own variations. Another limitation of existing MCMR methods is their tendency to focus solely on spatial information while neglect frequency characteristics, or extract only shallow frequency features, thus failing to fully leverage complementary cross-modal frequency information. To relieve this issue, UniFS introduces an adaptive prompt-guided frequency fusion module for k-space learning, significantly enhancing the model’s generalization performance. We evaluate our model on the BraTS and HCP datasets with various k-space undersampling patterns and acceleration factors, including previously unseen patterns, to comprehensively assess UniFS’s generalizability. Experimental results across multiple scenarios demonstrate that UniFS achieves state-of-the-art performance. Our code is available at https://github.com/LIKP0/UniFS.


[28] Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding cs.CV | cs.AI | cs.CLPDF

Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong

TL;DR: 论文提出了一种主动视频感知(AVP)框架,通过迭代的计划-观察-反思过程,直接从像素中提取紧凑、与查询相关的证据,显著提升了长视频理解的性能,同时降低了计算成本。

Details

Motivation: 长视频理解(LVU)的挑战在于真实查询通常依赖于稀疏且时间分散的提示,而现有方法依赖查询无关的描述器,浪费计算资源且模糊细粒度信息。受主动感知理论启发,作者认为LVU代理应主动决定观察内容、时间和位置,并持续评估观察是否足以回答问题。

Result: 在五个LVU基准测试中,AVP实现最高性能,平均准确率提升5.7%,同时仅需18.4%推理时间和12.4%输入令牌。

Insight: 主动感知框架通过聚焦查询相关证据,有效解决了长视频理解中的冗余和噪声问题,为代理式视频理解提供了高效且可扩展的解决方案。

Abstract: Long video understanding (LVU) is challenging because answering real-world queries often depends on sparse, temporally dispersed cues buried in hours of mostly redundant and irrelevant content. While agentic pipelines improve video reasoning capabilities, prevailing frameworks rely on a query-agnostic captioner to perceive video information, which wastes computation on irrelevant content and blurs fine-grained temporal and spatial information. Motivated by active perception theory, we argue that LVU agents should actively decide what, when, and where to observe, and continuously assess whether the current observation is sufficient to answer the query. We present Active Video Perception (AVP), an evidence-seeking framework that treats the video as an interactive environment and acquires compact, queryrelevant evidence directly from pixels. Concretely, AVP runs an iterative plan-observe-reflect process with MLLM agents. In each round, a planner proposes targeted video interactions, an observer executes them to extract time-stamped evidence, and a reflector evaluates the sufficiency of the evidence for the query, either halting with an answer or triggering further observation. Across five LVU benchmarks, AVP achieves highest performance with significant improvements. Notably, AVP outperforms the best agentic method by 5.7% in average accuracy while only requires 18.4% inference time and 12.4% input tokens.


[29] Concept-based Explainable Data Mining with VLM for 3D Detection cs.CVPDF

Mai Tsujimoto

TL;DR: 本文提出了一种基于多模态的方法,利用2D视觉语言模型(VLM)从驾驶场景中挖掘稀有物体,以提升3D目标检测性能。该方法结合了多种技术,形成了一个可解释的流程,显著减少了标注负担并提高了稀有物体的检测效果。

Details

Motivation: 在自动驾驶系统中,基于点云的稀有物体检测具有挑战性。视觉语言模型在图像理解方面表现优异,但其在多模态数据挖掘中的潜力尚未被充分探索。

Result: 在nuScenes数据集上的实验表明,该方法显著提升了3D检测性能,尤其是在拖车和自行车等挑战性类别上,同时仅需少量训练数据。

Insight: 概念驱动的数据挖掘方法可以高效地筛选高质量训练样本,为自动驾驶系统的数据集构建提供了重要参考。

Abstract: Rare-object detection remains a challenging task in autonomous driving systems, particularly when relying solely on point cloud data. Although Vision-Language Models (VLMs) exhibit strong capabilities in image understanding, their potential to enhance 3D object detection through intelligent data mining has not been fully explored. This paper proposes a novel cross-modal framework that leverages 2D VLMs to identify and mine rare objects from driving scenes, thereby improving 3D object detection performance. Our approach synthesizes complementary techniques such as object detection, semantic feature extraction, dimensionality reduction, and multi-faceted outlier detection into a cohesive, explainable pipeline that systematically identifies rare but critical objects in driving scenes. By combining Isolation Forest and t-SNE-based outlier detection methods with concept-based filtering, the framework effectively identifies semantically meaningful rare objects. A key strength of this approach lies in its ability to extract and annotate targeted rare object concepts such as construction vehicles, motorcycles, and barriers. This substantially reduces the annotation burden and focuses only on the most valuable training samples. Experiments on the nuScenes dataset demonstrate that this concept-guided data mining strategy enhances the performance of 3D object detection models while utilizing only a fraction of the training data, with particularly notable improvements for challenging object categories such as trailers and bicycles compared with the same amount of random data. This finding has substantial implications for the efficient curation of datasets in safety-critical autonomous systems.


[30] WaterWave: Bridging Underwater Image Enhancement into Video Streams via Wavelet-based Temporal Consistency Field cs.CVPDF

Qi Zhu, Jingyi Zhang, Naishan Zheng, Wei Yu, Jinghao Zhang

TL;DR: 论文提出了一种名为WaterWave的方法,通过小波时域一致性场实现水下视频增强,解决了单帧增强方法缺乏时域一致性的问题。

Details

Motivation: 水下视频难以获取,现有方法通常逐帧应用单图像增强模型,但缺乏时域一致性,导致视频不自然。因此,需要一种能够在无配对数据条件下保持时域一致性的方法。

Result: 实验表明,WaterWave显著提升单帧增强视频的质量,并在下游跟踪任务(如UOSTrack和MAT)中比原始视频性能提升19.7%和9.7%。

Insight: 从时域频率视角分析动态场景时,可以利用小波域的特性来约束时域一致性,这对视频增强任务具有重要意义。

Abstract: Underwater video pairs are fairly difficult to obtain due to the complex underwater imaging. In this case, most existing video underwater enhancement methods are performed by directly applying the single-image enhancement model frame by frame, but a natural issue is lacking temporal consistency. To relieve the problem, we rethink the temporal manifold inherent in natural videos and observe a temporal consistency prior in dynamic scenes from the local temporal frequency perspective. Building upon the specific prior and no paired-data condition, we propose an implicit representation manner for enhanced video signals, which is conducted in the wavelet-based temporal consistency field, WaterWave. Specifically, under the constraints of the prior, we progressively filter and attenuate the inconsistent components while preserving motion details and scenes, achieving a natural-flowing video. Furthermore, to represent temporal frequency bands more accurately, an underwater flow correction module is designed to rectify estimated flows considering the transmission in underwater scenes. Extensive experiments demonstrate that WaterWave significantly enhances the quality of videos generated using single-image underwater enhancements. Additionally, our method demonstrates high potential in downstream underwater tracking tasks, such as UOSTrack and MAT, outperforming the original video by a large margin, i.e., 19.7% and 9.7% on precise respectively.


[31] Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding cs.CV | cs.AI | cs.CLPDF

Zhiyuan Jiang, Shenghao Xie, Wenyi Li, Wenqiang Zu, Peihang Li

TL;DR: 论文研究了缩放(zoom)作为GUI定位中未充分探索的先验,提出了无需训练的ZoomClick方法,显著提升了模型性能,并提出了评估缩放适应性的基准GUIZoom-Bench。

Details

Motivation: 现有GUI定位方法依赖于大规模边界框监督,但仍面临跨平台泛化、复杂布局分析和细粒度元素定位等问题。缩放作为一种先验潜力尚未被充分挖掘。

Result: ZoomClick显著提升了模型性能,UI-Venus-72B在ScreenSpot-Pro上的成功率达到73.1%,优于现有方法。

Insight: 缩放是一种高效的GUI定位先验,通过合理利用其特性,可以提升模型性能并推动未来研究工作。

Abstract: Grounding is a fundamental capability for building graphical user interface (GUI) agents. Although existing approaches rely on large-scale bounding box supervision, they still face various challenges, such as cross-platform generalization, complex layout analysis, and fine-grained element localization. In this paper, we investigate zoom as a strong yet underexplored prior for GUI grounding, and propose a training-free method, ZoomClick. By characterizing four key properties of zoom (i.e., pre-zoom, depth, shrink size, minimal crop size), we unlock its full capabilities for dynamic spatial focusing and adaptive context switching. Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models, achieving state-of-the-art results on several mainstream benchmarks; for example, UI-Venus-72B attains a 73.1% success rate on ScreenSpot-Pro. Furthermore, we present GUIZoom-Bench, a benchmark for evaluating model adaptability to zoom, aiming to inspire future research on improving zoom for further training and test-time scaling in GUI grounding tasks.


[32] Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning cs.CVPDF

Chinthani Sugandhika, Chen Li, Deepu Rajan, Basura Fernando

TL;DR: 论文介绍了Know-Show,一个用于评估视频-语言模型在时空基础推理能力的新基准,并提出了GRAM方法以增强模型的细粒度基础推理能力。

Details

Motivation: 现有的视频-语言模型在多模态理解方面取得了巨大进展,但其推理能力在时空维度上仍然缺乏基础性。Know-Show旨在填补这一空白,评估模型在结合视觉和时间证据时的推理能力。

Result: 实验表明,现有模型在细粒度任务(如手-物体交互)上表现不佳,GRAM显著提升了模型的推理和基础能力。Know-Show成为评估视频-语言模型的新标准。

Insight: 视频-语言模型需要更紧密地结合时空信息以实现可靠的推理,Know-Show为开发可解释的多模态推理系统提供了重要参考。

Abstract: Large Video-Language Models (Video-LMs) have achieved impressive progress in multimodal understanding, yet their reasoning remains weakly grounded in space and time. We present Know-Show, a new benchmark designed to evaluate spatio-temporal grounded reasoning, the ability of a model to reason about actions and their semantics while simultaneously grounding its inferences in visual and temporal evidence. Know-Show unifies reasoning and localization within a single evaluation framework consisting of five complementary scenarios across spatial (person, object, person-object, and hand-object) and temporal dimensions. Built from Charades, Action Genome, and Ego4D with 2.5K human-authored questions, the benchmark exposes significant gaps between current Video-LMs and human reasoning. To bridge this gap, we propose GRAM, a training-free plug-in that augments Video-LMs with fine-grained grounding through attention-based video token selection and explicit timestamp encoding. Extensive experiments across open and closed Video-LMs (Qwen, VideoLLaVA, GPT-4o, and Gemini, etc.) reveal that existing models struggle to “show what they know” and vice versa, especially in fine-grained hand-object interactions. Know-Show establishes a unified standard for assessing grounded reasoning in video-language understanding and provides insights toward developing interpretable and reliable multimodal reasoning systems. We will release the code at https://github.com/LUNAProject22/Know-Show.


[33] DashFusion: Dual-stream Alignment with Hierarchical Bottleneck Fusion for Multimodal Sentiment Analysis cs.CV | cs.LGPDF

Yuhua Wen, Qifei Li, Yingying Zhou, Yingming Gao, Zhengqi Wen

TL;DR: DashFusion提出了一种新颖的双流对齐与分层瓶颈融合框架,用于多模态情感分析,解决了现有方法在对齐和融合方面的局限性。

Details

Motivation: 多模态情感分析中的对齐和融合问题严重影响性能,现有方法往往单独处理这些问题,导致效果不佳。

Result: 在CMU-MOSI、CMU-MOSEI和CH-SIMS数据集上实现了SOTA表现,并通过消融实验验证了方法的有效性。

Insight: 时序和语义对齐的结合以及分层瓶颈融合策略显著提升了多模态情感分析的性能和效率。

Abstract: Multimodal sentiment analysis (MSA) integrates various modalities, such as text, image, and audio, to provide a more comprehensive understanding of sentiment. However, effective MSA is challenged by alignment and fusion issues. Alignment requires synchronizing both temporal and semantic information across modalities, while fusion involves integrating these aligned features into a unified representation. Existing methods often address alignment or fusion in isolation, leading to limitations in performance and efficiency. To tackle these issues, we propose a novel framework called Dual-stream Alignment with Hierarchical Bottleneck Fusion (DashFusion). Firstly, dual-stream alignment module synchronizes multimodal features through temporal and semantic alignment. Temporal alignment employs cross-modal attention to establish frame-level correspondences among multimodal sequences. Semantic alignment ensures consistency across the feature space through contrastive learning. Secondly, supervised contrastive learning leverages label information to refine the modality features. Finally, hierarchical bottleneck fusion progressively integrates multimodal information through compressed bottleneck tokens, which achieves a balance between performance and computational efficiency. We evaluate DashFusion on three datasets: CMU-MOSI, CMU-MOSEI, and CH-SIMS. Experimental results demonstrate that DashFusion achieves state-of-the-art performance across various metrics, and ablation studies confirm the effectiveness of our alignment and fusion techniques. The codes for our experiments are available at https://github.com/ultramarineX/DashFusion.


[34] VOST-SGG: VLM-Aided One-Stage Spatio-Temporal Scene Graph Generation cs.CVPDF

Chinthani Sugandhika, Chen Li, Deepu Rajan, Basura Fernando

TL;DR: VOST-SGG 提出了一种基于视觉语言模型(VLM)的单阶段时空场景图生成框架,通过双源查询初始化策略和多模态特征库提升性能。

Details

Motivation: 现有的单阶段时空场景图生成模型存在语义信息不足和单模态特征局限的问题,VOST-SGG 引入VLM的常识推理能力以解决这些问题。

Result: 在Action Genome数据集上达到SOTA性能,验证了VLM和多模态特征的有效性。

Insight: VLM的语义先验和多模态特征可显著提升时空场景图生成模型的性能,尤其是在复杂关系推理任务中。

Abstract: Spatio-temporal scene graph generation (ST-SGG) aims to model objects and their evolving relationships across video frames, enabling interpretable representations for downstream reasoning tasks such as video captioning and visual question answering. Despite recent advancements in DETR-style single-stage ST-SGG models, they still suffer from several key limitations. First, while these models rely on attention-based learnable queries as a core component, these learnable queries are semantically uninformed and instance-agnostically initialized. Second, these models rely exclusively on unimodal visual features for predicate classification. To address these challenges, we propose VOST-SGG, a VLM-aided one-stage ST-SGG framework that integrates the common sense reasoning capabilities of vision-language models (VLMs) into the ST-SGG pipeline. First, we introduce the dual-source query initialization strategy that disentangles what to attend to from where to attend, enabling semantically grounded what-where reasoning. Furthermore, we propose a multi-modal feature bank that fuses visual, textual, and spatial cues derived from VLMs for improved predicate classification. Extensive experiments on the Action Genome dataset demonstrate that our approach achieves state-of-the-art performance, validating the effectiveness of integrating VLM-aided semantic priors and multi-modal features for ST-SGG. We will release the code at https://github.com/LUNAProject22/VOST.


[35] See in Depth: Training-Free Surgical Scene Segmentation with Monocular Depth Priors cs.CV | cs.AIPDF

Kunyi Yang, Qingyu Wang, Cheng Yuan, Yutong Ban

TL;DR: 这篇论文提出了一种无需训练的腹腔镜场景分割方法DepSeg,结合单目深度先验和预训练视觉基础模型,通过深度引导提示和模板匹配实现高效分割。

Details

Motivation: 腹腔镜场景的像素级分割对计算机辅助手术至关重要,但密集标注成本高昂。作者希望利用深度信息和预训练模型减少对标注的依赖。

Result: 在CholecSeg8k数据集上,DepSeg比SAM2基线显著提升(35.9% vs. 14.7% mIoU),且仅需10-20%模板即可保持竞争力。

Insight: 深度信息可作为有效的几何先验,结合预训练模型可显著减少对标注数据的依赖,为手术场景分割提供高效解决方案。

Abstract: Pixel-wise segmentation of laparoscopic scenes is essential for computer-assisted surgery but difficult to scale due to the high cost of dense annotations. We propose depth-guided surgical scene segmentation (DepSeg), a training-free framework that utilizes monocular depth as a geometric prior together with pretrained vision foundation models. DepSeg first estimates a relative depth map with a pretrained monocular depth estimation network and proposes depth-guided point prompts, which SAM2 converts into class-agnostic masks. Each mask is then described by a pooled pretrained visual feature and classified via template matching against a template bank built from annotated frames. On the CholecSeg8k dataset, DepSeg improves over a direct SAM2 auto segmentation baseline (35.9% vs. 14.7% mIoU) and maintains competitive performance even when using only 10–20% of the object templates. These results show that depth-guided prompting and template-based classification offer an annotation-efficient segmentation approach.


[36] Ideal Observer for Segmentation of Dead Leaves Images cs.CV | math.ST | stat.MEPDF

Swantje Mahncke, Malte Ott

TL;DR: 这篇论文提出了一种基于贝叶斯理论的理想观察者方法,用于分析由‘枯叶模型’生成的图像分割问题,旨在为人类和视觉算法的分割性能提供理论上限。

Details

Motivation: 人类视觉环境中,物体通过遮挡关系分布在空间中。‘枯叶模型’通过分层叠加物体来模拟这种遮挡现象。论文旨在从理论上推导一个理想观察者模型,用于评估基于这种模型的分割性能。

Result: 该方法可以为有限像素集的图像分割提供理论性能上限,作为人类和视觉算法的基准。

Insight: ‘枯叶模型’及其理想观察者为研究分割决策提供了理论基础,同时揭示了实际应用中计算复杂度的限制因素。

Abstract: The human visual environment is comprised of different surfaces that are distributed in space. The parts of a scene that are visible at any one time are governed by the occlusion of overlapping objects. In this work we consider “dead leaves” models, which replicate these occlusions when generating images by layering objects on top of each other. A dead leaves model is a generative model comprised of distributions for object position, shape, color and texture. An image is generated from a dead leaves model by sampling objects (“leaves”) from these distributions until a stopping criterion is reached, usually when the image is fully covered or until a given number of leaves was sampled. Here, we describe a theoretical approach, based on previous work, to derive a Bayesian ideal observer for the partition of a given set of pixels based on independent dead leaves model distributions. Extending previous work, we provide step-by-step explanations for the computation of the posterior probability as well as describe factors that determine the feasibility of practically applying this computation. The dead leaves image model and the associated ideal observer can be applied to study segmentation decisions in a limited number of pixels, providing a principled upper-bound on performance, to which humans and vision algorithms could be compared.


[37] Conscious Gaze: Adaptive Attention Mechanisms for Hallucination Mitigation in Vision-Language Models cs.CV | cs.AIPDF

Weijue Bu, Guan Yuan, Guixian Zhang

TL;DR: 论文提出了一种无需训练的推理时框架CG-VLM,通过游戏论的可解释性实现解码控制,有效缓解视觉语言模型中的幻觉问题,并在多个基准测试中取得SOTA结果。

Details

Motivation: 大型视觉语言模型(VLMs)常因文本惯性导致注意力从视觉证据漂移至语言先验,引发目标幻觉问题。现有方法仅干预输出逻辑或依赖启发式抑制,缺乏理论依据。

Result: 在POPE和CHAIR基准测试中,CG-VLM在InstructBLIP等模型上表现最佳,同时保留通用能力。

Insight: 基于标记级的感知可实现精准、上下文感知的干预,而无需妥协模型的基础知识。

Abstract: Large Vision-Language Models (VLMs) often exhibit text inertia, where attention drifts from visual evidence toward linguistic priors, resulting in object hallucinations. Existing decoding strategies intervene only at the output logits and thus cannot correct internal reasoning drift, while recent internal-control methods based on heuristic head suppression or global steering vectors lack principled grounding. We introduce Conscious Gaze (CG-VLM), a training-free, inference-time framework that converts game-theoretic interpretability into actionable decoding control. A Cognitive Demand Sensor built on Harsanyi interactions estimates instantaneous vision-text synergy and identifies moments when visual grounding is necessary. Conditioned on this signal, a Focused Consensus Induction module selectively reorients mid-layer attention toward visual tokens before collapse into text priors. CG-VLM achieves state-of-the-art results on POPE and CHAIR across InstructBLIP, LLaVA, Qwen-VL, and mPLUG, while preserving general capabilities, demonstrating that token-level sensing enables precise, context-aware intervention without compromising foundational knowledge.


[38] 2K-Characters-10K-Stories: A Quality-Gated Stylized Narrative Dataset with Disentangled Control and Sequence Consistency cs.CV | cs.AIPDF

Xingxi Yin, Yicheng Li, Gong Yan, Chenglin Li, Jian Zhao

TL;DR: 该论文提出了一个名为2K-Characters-10K-Stories的多模态风格化叙事数据集,重点关注身份一致性与瞬态属性的解耦控制。通过Human-in-the-Loop流程和质量门控机制,数据集实现了大规模的身份一致性和高质量的视觉叙事生成。

Details

Motivation: 当前的可控视觉叙事数据集中,身份一致性与瞬态属性(如姿态和表情)的解耦控制不足,限制了可靠的序列合成能力。作者希望通过高质量的标注和解耦控制方案来解决这一问题。

Result: 实验表明,基于该数据集微调的模型在视觉叙事生成任务中性能接近闭源模型。

Insight: 高质量的数据集设计与解耦控制方案是实现可控视觉叙事的关键,Human-in-the-Loop和质量门控机制能够显著提升数据质量。

Abstract: Sequential identity consistency under precise transient attribute control remains a long-standing challenge in controllable visual storytelling. Existing datasets lack sufficient fidelity and fail to disentangle stable identities from transient attributes, limiting structured control over pose, expression, and scene composition and thus constraining reliable sequential synthesis. To address this gap, we introduce \textbf{2K-Characters-10K-Stories}, a multi-modal stylized narrative dataset of \textbf{2{,}000} uniquely stylized characters appearing across \textbf{10{,}000} illustration stories. It is the first dataset that pairs large-scale unique identities with explicit, decoupled control signals for sequential identity consistency. We introduce a \textbf{Human-in-the-Loop pipeline (HiL)} that leverages expert-verified character templates and LLM-guided narrative planning to generate highly-aligned structured data. A \textbf{decoupled control} scheme separates persistent identity from transient attributes – pose and expression – while a \textbf{Quality-Gated loop} integrating MMLM evaluation, Auto-Prompt Tuning, and Local Image Editing enforces pixel-level consistency. Extensive experiments demonstrate that models fine-tuned on our dataset achieves performance comparable to closed-source models in generating visual narratives.


[39] ProPhy: Progressive Physical Alignment for Dynamic World Simulation cs.CVPDF

Zijun Wang, Panwen Hu, Jing Wang, Terry Jingchen Zhang, Yuhao Cheng

TL;DR: ProPhy提出了一种渐进式物理对齐框架,通过两阶段Mixture-of-Physics-Experts机制提取物理先验,结合语义和细粒度动态信息,显著提升了动态世界模拟中视频生成的物理一致性。

Details

Motivation: 现有视频生成方法在处理大规模或复杂动态时物理一致性不足,主要因其对各向同性的物理提示响应不敏感,且忽视生成内容与局部物理线索的精细对齐。

Result: 在物理感知视频生成基准测试中,ProPhy生成的视频比现有方法更具真实性、动态性和物理一致性。

Insight: ProPhy通过分阶段提取和显式对齐物理线索,为动态世界模拟提供了一种有效的视频生成框架,尤其适用于复杂物理现象的表达。

Abstract: Recent advances in video generation have shown remarkable potential for constructing world simulators. However, current models still struggle to produce physically consistent results, particularly when handling large-scale or complex dynamics. This limitation arises primarily because existing approaches respond isotropically to physical prompts and neglect the fine-grained alignment between generated content and localized physical cues. To address these challenges, we propose ProPhy, a Progressive Physical Alignment Framework that enables explicit physics-aware conditioning and anisotropic generation. ProPhy employs a two-stage Mixture-of-Physics-Experts (MoPE) mechanism for discriminative physical prior extraction, where Semantic Experts infer semantic-level physical principles from textual descriptions, and Refinement Experts capture token-level physical dynamics. This mechanism allows the model to learn fine-grained, physics-aware video representations that better reflect underlying physical laws. Furthermore, we introduce a physical alignment strategy that transfers the physical reasoning capabilities of vision-language models (VLMs) into the Refinement Experts, facilitating a more accurate representation of dynamic physical phenomena. Extensive experiments on physics-aware video generation benchmarks demonstrate that ProPhy produces more realistic, dynamic, and physically coherent results than existing state-of-the-art methods.


[40] Learning High-Fidelity Cloth Animation via Skinning-Free Image Transfer cs.CVPDF

Rong Wang, Wei Mao, Changsheng Lu, Hongdong Li

TL;DR: 该论文提出了一种无需蒙皮的方法,通过独立估计低频姿态和高频皱纹,实现了高质量的3D服装动画生成,并通过图像处理和模态融合提升了视觉效果。

Details

Motivation: 现有方法依赖线性混合蒙皮生成低频姿态形状,仅回归高频皱纹,但由于缺乏明确的蒙皮监督,常导致形状错位和皱纹恢复失败。

Result: 在多种服装类型上显著提高了动画质量,恢复了比现有方法更精细的皱纹。

Insight: 通过解耦高低频信息并利用2D图像处理技术,可以在无需蒙皮的情况下实现高质量的3D服装动画生成。

Abstract: We present a novel method for generating 3D garment deformations from given body poses, which is key to a wide range of applications, including virtual try-on and extended reality. To simplify the cloth dynamics, existing methods mostly rely on linear blend skinning to obtain low-frequency posed garment shape and only regress high-frequency wrinkles. However, due to the lack of explicit skinning supervision, such skinning-based approach often produces misaligned shapes when posing the garment, consequently corrupts the high-frequency signals and fails to recover high-fidelity wrinkles. To tackle this issue, we propose a skinning-free approach by independently estimating posed (i) vertex position for low-frequency posed garment shape, and (ii) vertex normal for high-frequency local wrinkle details. In this way, each frequency modality can be effectively decoupled and directly supervised by the geometry of the deformed garment. To further improve the visual quality of animation, we propose to encode both vertex attributes as rendered texture images, so that 3D garment deformation can be equivalently achieved via 2D image transfer. This enables us to leverage powerful pretrained image models to recover fine-grained visual details in wrinkles, while maintaining superior scalability for garments of diverse topologies without relying on manual UV partition. Finally, we propose a multimodal fusion to incorporate constraints from both frequency modalities and robustly recover deformed 3D garments from transferred images. Extensive experiments show that our method significantly improves animation quality on various garment types and recovers finer wrinkles than state-of-the-art methods.


[41] NormalView: sensor-agnostic tree species classification from backpack and aerial lidar data using geometric projections cs.CVPDF

Juho Korkeala, Jesse Muhojoki, Josef Taher, Klaara Salolahti, Matti Hyyppä

TL;DR: 论文提出了NormalView方法,通过几何投影将点云数据转换为二维图像,结合YOLOv11实现树种的传感器无关分类,并在高密度MLS和ALS数据上取得优异性能。

Details

Motivation: 激光扫描在森林环境评估中具有重要价值,但缺乏一种传感器无关的方法实现树种的高精度分类。因此,需要一种能够利用几何信息的通用方法。

Result: 在MLS数据上整体准确率为95.5%,ALS数据上为91.8%。多光谱强度信息的加入进一步提升了分类性能。

Insight: 基于几何投影的方法结合先进的图像分类网络可以实现传感器无关的高性能分类,且多光谱信息对分类有显著帮助。

Abstract: Laser scanning has proven to be an invaluable tool in assessing the decomposition of forest environments. Mobile laser scanning (MLS) has shown to be highly promising for extremely accurate, tree level inventory. In this study, we present NormalView, a sensor-agnostic projection-based deep learning method for classifying tree species from point cloud data. NormalView embeds local geometric information into two-dimensional projections, in the form of normal vector estimates, and uses the projections as inputs to an image classification network, YOLOv11. In addition, we inspected the effect of multispectral radiometric intensity information on classification performance. We trained and tested our model on high-density MLS data (7 species, ~5000 pts/m^2), as well as high-density airborne laser scanning (ALS) data (9 species, >1000 pts/m^2). On the MLS data, NormalView achieves an overall accuracy (macro-average accuracy) of 95.5 % (94.8 %), and 91.8 % (79.1 %) on the ALS data. We found that having intensity information from multiple scanners provides benefits in tree species classification, and the best model on the multispectral ALS dataset was a model using intensity information from all three channels of the multispectral ALS. This study demonstrates that projection-based methods, when enhanced with geometric information and coupled with state-of-the-art image classification backbones, can achieve exceptional results. Crucially, these methods are sensor-agnostic, relying only on geometric information. Additionally, we publically release the MLS dataset used in the study.


[42] DistillFSS: Synthesizing Few-Shot Knowledge into a Lightweight Segmentation Model cs.CVPDF

Pasquale De Marinis, Pieter M. Blok, Uzay Kaymak, Rogier Brussee, Gennaro Vessio

TL;DR: DistillFSS 是一种轻量级分割模型,通过教师-学生蒸馏过程将小样本知识直接嵌入模型参数中,消除了测试时对支持图像的需求,实现了快速高效的推理,并在跨领域小样本语义分割任务中表现优异。

Details

Motivation: 跨领域小样本语义分割任务(CD-FSS)面临分布偏移、标签空间不重叠和支持样本稀缺等挑战,传统方法在测试时计算量大且不可靠。DistillFSS 旨在通过蒸馏技术直接嵌入支持集知识,提升模型的效率和泛化能力。

Result: 实验表明 DistillFSS 在多类和多样本场景中表现优异,计算效率显著优于现有方法。

Insight: 将小样本知识直接嵌入模型参数是一种高效的方法,尤其适用于跨领域任务,能够显著减少测试时的计算开销。

Abstract: Cross-Domain Few-Shot Semantic Segmentation (CD-FSS) seeks to segment unknown classes in unseen domains using only a few annotated examples. This setting is inherently challenging: source and target domains exhibit substantial distribution shifts, label spaces are disjoint, and support images are scarce–making standard episodic methods unreliable and computationally demanding at test time. To address these constraints, we propose DistillFSS, a framework that embeds support-set knowledge directly into a model’s parameters through a teacher–student distillation process. By internalizing few-shot reasoning into a dedicated layer within the student network, DistillFSS eliminates the need for support images at test time, enabling fast, lightweight inference, while allowing efficient extension to novel classes in unseen domains through rapid teacher-driven specialization. Combined with fine-tuning, the approach scales efficiently to large support sets and significantly reduces computational overhead. To evaluate the framework under realistic conditions, we introduce a new CD-FSS benchmark spanning medical imaging, industrial inspection, and remote sensing, with disjoint label spaces and variable support sizes. Experiments show that DistillFSS matches or surpasses state-of-the-art baselines, particularly in multi-class and multi-shot scenarios, while offering substantial efficiency gains. The code is available at https://github.com/pasqualedem/DistillFSS.


[43] Self-Supervised AI-Generated Image Detection: A Camera Metadata Perspective cs.CVPDF

Nan Zhong, Mian Zou, Yiran Xu, Zhenxing Qian, Xinpeng Zhang

TL;DR: 本文提出了一种自监督方法,通过利用相机元数据(EXIF标签)检测AI生成图像,显著提升了跨模型泛化能力和鲁棒性。

Details

Motivation: AI生成图像的快速扩散对多媒体取证提出了挑战,现有检测器往往依赖特定生成模型的内部假设,限制了其普适性。

Result: 实验表明,该方法在多种生成模型上显著优于现有技术,表现出强大的泛化能力和对图像扰动的鲁棒性。

Insight: 相机元数据(EXIF标签)能够有效区分真实摄影和AI生成图像,为多媒体取证提供了新思路。

Abstract: The proliferation of AI-generated imagery poses escalating challenges for multimedia forensics, yet many existing detectors depend on assumptions about the internals of specific generative models, limiting their cross-model applicability. We introduce a self-supervised approach for detecting AI-generated images that leverages camera metadata – specifically exchangeable image file format (EXIF) tags – to learn features intrinsic to digital photography. Our pretext task trains a feature extractor solely on camera-captured photographs by classifying categorical EXIF tags (\eg, camera model and scene type) and pairwise-ranking ordinal and continuous EXIF tags (\eg, focal length and aperture value). Using these EXIF-induced features, we first perform one-class detection by modeling the distribution of photographic images with a Gaussian mixture model and flagging low-likelihood samples as AI-generated. We then extend to binary detection that treats the learned extractor as a strong regularizer for a classifier of the same architecture, operating on high-frequency residuals from spatially scrambled patches. Extensive experiments across various generative models demonstrate that our EXIF-induced detectors substantially advance the state of the art, delivering strong generalization to in-the-wild samples and robustness to common benign image perturbations.


[44] LeAD-M3D: Leveraging Asymmetric Distillation for Real-time Monocular 3D Detection cs.CVPDF

Johannes Meier, Jonathan Michel, Oussema Dhaouadi, Yung-Hsu Yang, Christoph Reich

TL;DR: 论文提出LeAD-M3D方法,通过非对称蒸馏、3D感知匹配和置信度门控推理,实现实时单目3D检测,兼具高精度与高效率。

Details

Motivation: 单目3D检测面临深度歧义、视角变化和计算成本高的挑战,传统方法依赖额外模态或牺牲效率。作者旨在提出一种无需LiDAR或几何先验,同时实现高精度和实时性的解决方案。

Result: 在KITTI、Waymo和Rope3D数据集上达到SOTA精度,且推理速度提升3.6倍。

Insight: 高保真与实时性在单目3D检测中可同时实现,无需依赖LiDAR或几何假设。

Abstract: Real-time monocular 3D object detection remains challenging due to severe depth ambiguity, viewpoint shifts, and the high computational cost of 3D reasoning. Existing approaches either rely on LiDAR or geometric priors to compensate for missing depth, or sacrifice efficiency to achieve competitive accuracy. We introduce LeAD-M3D, a monocular 3D detector that achieves state-of-the-art accuracy and real-time inference without extra modalities. Our method is powered by three key components. Asymmetric Augmentation Denoising Distillation (A2D2) transfers geometric knowledge from a clean-image teacher to a mixup-noised student via a quality- and importance-weighted depth-feature loss, enabling stronger depth reasoning without LiDAR supervision. 3D-aware Consistent Matching (CM3D) improves prediction-to-ground truth assignment by integrating 3D MGIoU into the matching score, yielding more stable and precise supervision. Finally, Confidence-Gated 3D Inference (CGI3D) accelerates detection by restricting expensive 3D regression to top-confidence regions. Together, these components set a new Pareto frontier for monocular 3D detection: LeAD-M3D achieves state-of-the-art accuracy on KITTI and Waymo, and the best reported car AP on Rope3D, while running up to 3.6x faster than prior high-accuracy methods. Our results demonstrate that high fidelity and real-time efficiency in monocular 3D detection are simultaneously attainable - without LiDAR, stereo, or geometric assumptions.


[45] InverseCrafter: Efficient Video ReCapture as a Latent Domain Inverse Problem cs.CV | cs.AI | cs.LGPDF

Yeobin Hong, Suhyeon Lee, Hyungjin Chung, Jong Chul Ye

TL;DR: InverseCrafter是一种高效的视频重捕捉方法,通过将4D生成任务重新表述为潜在空间中的修复问题,避免了传统方法的高计算成本。

Details

Motivation: 传统基于预训练视频扩散模型(VDMs)的可控4D视频生成方法计算成本高,且容易遗忘模型原有的生成先验,亟需更高效的解决方案。

Result: 在相机控制任务中实现了与现有方法相当的新视角生成效果,并在通用视频修复和编辑任务中表现优异。

Insight: 将问题转化为潜在空间中的修复问题是一种高效且通用的解决方案,避免了传统方法的计算瓶颈。

Abstract: Recent approaches to controllable 4D video generation often rely on fine-tuning pre-trained Video Diffusion Models (VDMs). This dominant paradigm is computationally expensive, requiring large-scale datasets and architectural modifications, and frequently suffers from catastrophic forgetting of the model’s original generative priors. Here, we propose InverseCrafter, an efficient inpainting inverse solver that reformulates the 4D generation task as an inpainting problem solved in the latent space. The core of our method is a principled mechanism to encode the pixel space degradation operator into a continuous, multi-channel latent mask, thereby bypassing the costly bottleneck of repeated VAE operations and backpropagation. InverseCrafter not only achieves comparable novel view generation and superior measurement consistency in camera control tasks with near-zero computational overhead, but also excels at general-purpose video inpainting with editing. Code is available at https://github.com/yeobinhong/InverseCrafter.


[46] Physics-Informed Graph Neural Network with Frequency-Aware Learning for Optical Aberration Correction cs.CV | physics.opticsPDF

Yong En Kok, Bowen Deng, Alexander Bentley, Andrew J. Parkes, Michael G. Somekh

TL;DR: ZRNet是一个物理启发的框架,通过结合Zernike系数预测和光学图像恢复,利用图形神经网络和频域对齐损失,显著提升了显微镜图像的校正效果。

Details

Motivation: 现有方法通常仅针对轻度像差和有限样本类型,且忽略了光学波前畸变的物理原理。ZRNet旨在解决复杂、大幅度像差的挑战,并利用物理原理提高校正的准确性和泛化能力。

Result: 在CytoImageNet数据集上,ZRNet在图像恢复和Zernike系数预测任务上均达到SOTA性能。

Insight: 结合物理原理与学习算法可以显著提升光学像差校正的效果,尤其在复杂和大规模像差场景中。

Abstract: Optical aberrations significantly degrade image quality in microscopy, particularly when imaging deeper into samples. These aberrations arise from distortions in the optical wavefront and can be mathematically represented using Zernike polynomials. Existing methods often address only mild aberrations on limited sample types and modalities, typically treating the problem as a black-box mapping without leveraging the underlying optical physics of wavefront distortions. We propose ZRNet, a physics-informed framework that jointly performs Zernike coefficient prediction and optical image Restoration. We contribute a Zernike Graph module that explicitly models physical relationships between Zernike polynomials based on their azimuthal degrees-ensuring that learned corrections align with fundamental optical principles. To further enforce physical consistency between image restoration and Zernike prediction, we introduce a Frequency-Aware Alignment (FAA) loss, which better aligns Zernike coefficient prediction and image features in the Fourier domain. Extensive experiments on CytoImageNet demonstrates that our approach achieves state-of-the-art performance in both image restoration and Zernike coefficient prediction across diverse microscopy modalities and biological samples with complex, large-amplitude aberrations. Code is available at https://github.com/janetkok/ZRNet.


[47] OWL: Unsupervised 3D Object Detection by Occupancy Guided Warm-up and Large Model Priors Reasoning cs.CVPDF

Xusheng Guo, Wanfa Zhang, Shijia Zhao, Qiming Xia, Xiaolong Xie

TL;DR: OWL提出了一种无监督3D目标检测方法,通过占用引导预热和大模型先验推理,解决了伪标签初始错误和过滤难的问题,显著提升了性能。

Details

Motivation: 无监督3D目标检测可以降低自动驾驶中高昂的标注成本,但现有方法在伪标签初始阶段易受错误引导,且缺乏高效的过滤和优化机制。

Result: 在Waymo和KITTI数据集上,OWL比现有无监督方法性能提升超过15.0% mAP。

Insight: 解决伪标签初始质量差的问题,并结合大模型先验知识,是无监督3D目标检测性能提升的关键。

Abstract: Unsupervised 3D object detection leverages heuristic algorithms to discover potential objects, offering a promising route to reduce annotation costs in autonomous driving. Existing approaches mainly generate pseudo labels and refine them through self-training iterations. However, these pseudo-labels are often incorrect at the beginning of training, resulting in misleading the optimization process. Moreover, effectively filtering and refining them remains a critical challenge. In this paper, we propose OWL for unsupervised 3D object detection by occupancy guided warm-up and large-model priors reasoning. OWL first employs an Occupancy Guided Warm-up (OGW) strategy to initialize the backbone weight with spatial perception capabilities, mitigating the interference of incorrect pseudo-labels on network convergence. Furthermore, OWL introduces an Instance-Cued Reasoning (ICR) module that leverages the prior knowledge of large models to assess pseudo-label quality, enabling precise filtering and refinement. Finally, we design a Weight-adapted Self-training (WAS) strategy to dynamically re-weight pseudo-labels, improving the performance through self-training. Extensive experiments on Waymo Open Dataset (WOD) and KITTI demonstrate that OWL outperforms state-of-the-art unsupervised methods by over 15.0% mAP, revealing the effectiveness of our method.


[48] Distilling Expert Surgical Knowledge: How to train local surgical VLMs for anatomy explanation in Complete Mesocolic Excision cs.CVPDF

Lennart Maack, Julia-Kristin Graß, Lisa-Marie Toscha, Nathaniel Melling, Alexander Schlaefer

TL;DR: 论文提出了一种隐私保护框架,通过从通用大语言模型(LLM)中提取知识,训练高效、可本地部署的视觉语言模型(VLM),以提升其在手术场景中对解剖标志的识别和解释能力。

Details

Motivation: 当前VLM在手术领域特定场景理解(如解剖标志识别)中表现不足,且需要本地部署以避免患者数据泄露到外部大型VLM。

Result: 微调后的VLM在手术领域知识上显著提升,证明了该框架的高效性和隐私合规性。

Insight: 结合专家知识和LLM的泛化能力,可以在保护隐私的同时提升领域专用模型的性能。

Abstract: Recently, Vision Large Language Models (VLMs) have demonstrated high potential in computer-aided diagnosis and decision-support. However, current VLMs show deficits in domain specific surgical scene understanding, such as identifying and explaining anatomical landmarks during Complete Mesocolic Excision. Additionally, there is a need for locally deployable models to avoid patient data leakage to large VLMs, hosted outside the clinic. We propose a privacy-preserving framework to distill knowledge from large, general-purpose LLMs into an efficient, local VLM. We generate an expert-supervised dataset by prompting a teacher LLM without sensitive images, using only textual context and binary segmentation masks for spatial information. This dataset is used for Supervised Fine-Tuning (SFT) and subsequent Direct Preference Optimization (DPO) of the locally deployable VLM. Our evaluation confirms that finetuning VLMs with our generated datasets increases surgical domain knowledge compared to its base VLM by a large margin. Overall, this work validates a data-efficient and privacy-conforming way to train a surgical domain optimized, locally deployable VLM for surgical scene understanding.


[49] USV: Unified Sparsification for Accelerating Video Diffusion Models cs.CVPDF

Xinjian Wu, Hongmei Wang, Yuan Zhou, Qinglin Lu

TL;DR: USV提出了一种统一稀疏化框架,加速视频扩散模型,通过联合优化模型内部计算和采样过程的稀疏化策略,实现高效、高质量的视频生成。

Details

Motivation: 视频扩散模型在全局时空注意力的二次复杂度和长迭代降噪轨迹的计算开销上存在冗余,现有加速方法通常只针对单一维度,效果有限。

Result: 实验表明,USV在降噪过程中实现83.3%的加速和22.7%的端到端加速,同时保持高视觉保真度。

Insight: 统一动态稀疏化是高效、高质量视频生成的实用路径,多维协同设计能显著提升加速效果。

Abstract: The scalability of high-fidelity video diffusion models (VDMs) is constrained by two key sources of redundancy: the quadratic complexity of global spatio-temporal attention and the computational overhead of long iterative denoising trajectories. Existing accelerators – such as sparse attention and step-distilled samplers – typically target a single dimension in isolation and quickly encounter diminishing returns, as the remaining bottlenecks become dominant. In this work, we introduce USV (Unified Sparsification for Video diffusion models), an end-to-end trainable framework that overcomes this limitation by jointly orchestrating sparsification across both the model’s internal computation and its sampling process. USV learns a dynamic, data- and timestep-dependent sparsification policy that prunes redundant attention connections, adaptively merges semantically similar tokens, and reduces denoising steps, treating them not as independent tricks but as coordinated actions within a single optimization objective. This multi-dimensional co-design enables strong mutual reinforcement among previously disjoint acceleration strategies. Extensive experiments on large-scale video generation benchmarks demonstrate that USV achieves up to 83.3% speedup in the denoising process and 22.7% end-to-end acceleration, while maintaining high visual fidelity. Our results highlight unified, dynamic sparsification as a practical path toward efficient, high-quality video generation.


[50] FNOPT: Resolution-Agnostic, Self-Supervised Cloth Simulation using Meta-Optimization with Fourier Neural Operators cs.CV | cs.GRPDF

Ruochen Chen, Thuy Tran, Shaifali Parashar

TL;DR: FNOpt是一种自监督的布料模拟框架,通过元优化和傅里叶神经算子(FNO)实现了分辨率无关的神经优化器,无需重新训练即可适应不同的网格分辨率和运动模式。

Details

Motivation: 现有神经模拟器依赖大量真实数据或牺牲细节,且在分辨率与运动模式上泛化能力差。FNOpt旨在解决这些问题,提供更高效可靠的布料模拟。

Result: 在布料模拟数据集上,FNOpt在分布外设置中表现出更高的精度和鲁棒性,优于其他基于学习的方法。

Insight: FNOpt展示了元优化和傅里叶神经算子在布料模拟中的潜力,减少了对标注数据的依赖,提升了跨分辨率可靠性。

Abstract: We present FNOpt, a self-supervised cloth simulation framework that formulates time integration as an optimization problem and trains a resolution-agnostic neural optimizer parameterized by a Fourier neural operator (FNO). Prior neural simulators often rely on extensive ground truth data or sacrifice fine-scale detail, and generalize poorly across resolutions and motion patterns. In contrast, FNOpt learns to simulate physically plausible cloth dynamics and achieves stable and accurate rollouts across diverse mesh resolutions and motion patterns without retraining. Trained only on a coarse grid with physics-based losses, FNOpt generalizes to finer resolutions, capturing fine-scale wrinkles and preserving rollout stability. Extensive evaluations on a benchmark cloth simulation dataset demonstrate that FNOpt outperforms prior learning-based approaches in out-of-distribution settings in both accuracy and robustness. These results position FNO-based meta-optimization as a compelling alternative to previous neural simulators for cloth, thus reducing the need for curated data and improving cross-resolution reliability.


[51] Bring Your Dreams to Life: Continual Text-to-Video Customization cs.CVPDF

Jiahua Dong, Xudong Wang, Wenqi Liang, Zongyan Han, Meng Cao

TL;DR: 本文提出了一种新颖的持续定制视频扩散模型(CCVD),解决了持续文本到视频生成中的遗忘和概念忽略问题,通过引入特定概念属性保留模块和任务感知概念聚合策略,显著提升了生成效果。

Details

Motivation: 现有的定制文本到视频生成方法假设个性化概念是静态的,无法处理概念随时间扩展的情况,且在持续学习中容易出现遗忘和概念忽略问题。因此,需要一种能够持续学习新概念并生成多样化视频的模型。

Result: 实验表明,CCVD在多样化的文本到视频生成任务中表现优于现有方法,有效解决了遗忘和概念忽略问题。

Insight: 通过持续学习和动态技术处理多概念生成的挑战是未来视频生成的重要方向;分层和区域引导的策略可以显著提升生成质量。

Abstract: Customized text-to-video generation (CTVG) has recently witnessed great progress in generating tailored videos from user-specific text. However, most CTVG methods assume that personalized concepts remain static and do not expand incrementally over time. Additionally, they struggle with forgetting and concept neglect when continuously learning new concepts, including subjects and motions. To resolve the above challenges, we develop a novel Continual Customized Video Diffusion (CCVD) model, which can continuously learn new concepts to generate videos across various text-to-video generation tasks by tackling forgetting and concept neglect. To address catastrophic forgetting, we introduce a concept-specific attribute retention module and a task-aware concept aggregation strategy. They can capture the unique characteristics and identities of old concepts during training, while combining all subject and motion adapters of old concepts based on their relevance during testing. Besides, to tackle concept neglect, we develop a controllable conditional synthesis to enhance regional features and align video contexts with user conditions, by incorporating layer-specific region attention-guided noise estimation. Extensive experimental comparisons demonstrate that our CCVD outperforms existing CTVG models. The code is available at https://github.com/JiahuaDong/CCVD.


[52] Probing the effectiveness of World Models for Spatial Reasoning through Test-time Scaling cs.CV | cs.AIPDF

Saurav Jha, M. Jehanzeb Mirza, Wei Lin, Shiqi Yang, Sarath Chandar

TL;DR: 论文探讨了视觉语言模型(VLM)在空间推理任务中的局限性,并通过测试时缩放方法改进了世界模型的表现,提出了一种基于空间声明的验证框架ViSA,取得了一定效果,但也揭示了当前世界模型的信息瓶颈问题。

Details

Motivation: 现有视觉语言模型在多视角理解和具身视角转换的空间推理任务中表现不佳,测试时缩放方法的有效性需要系统分析。

Result: ViSA在SAT-Real基准上提升了空间推理性能,但在MMSI-Bench上效果有限,显示了世界模型的信息瓶颈。

Insight: 测试时验证器需要在奖励信号和探索行为上更平衡,而当前世界模型在细粒度推理上的表现仍需突破。

Abstract: Vision-Language Models (VLMs) remain limited in spatial reasoning tasks that require multi-view understanding and embodied perspective shifts. Recent approaches such as MindJourney attempt to mitigate this gap through test-time scaling where a world model imagines action-conditioned trajectories and a heuristic verifier selects helpful views from such trajectories. In this work, we systematically examine how such test-time verifiers behave across benchmarks, uncovering both their promise and their pitfalls. Our uncertainty-based analyses show that MindJourney’s verifier provides little meaningful calibration, and that random scoring often reduces answer entropy equally well, thus exposing systematic action biases and unreliable reward signals. To mitigate these, we introduce a Verification through Spatial Assertions (ViSA) framework that grounds the test-time reward in verifiable, frame-anchored micro-claims. This principled verifier consistently improves spatial reasoning on the SAT-Real benchmark and corrects trajectory-selection biases through more balanced exploratory behavior. However, on the challenging MMSI-Bench, none of the verifiers, including ours, achieve consistent scaling, suggesting that the current world models form an information bottleneck where imagined views fail to enrich fine-grained reasoning. Together, these findings chart the bad, good, and ugly aspects of test-time verification for world-model-based reasoning. Our code is available at https://github.com/chandar-lab/visa-for-mindjourney.


[53] Phase-OTDR Event Detection Using Image-Based Data Transformation and Deep Learning cs.CV | cs.AIPDF

Muhammet Cagri Yeke, Samil Sirin, Kivilcim Yuksel, Abdurrahman Gumus

TL;DR: 这项研究提出了一种新颖的方法,通过将Phase-OTDR的一维数据转换为灰度图像(使用Gramian Angular Difference Field等技术),并结合成多通道RGB表示,利用迁移学习模型实现了光纤事件的高精度分类(准确率达98%以上)。

Details

Motivation: Phase-OTDR系统在光纤事件检测中面临数据复杂性和分析效率的挑战,研究旨在通过图像转换和深度学习提升分类性能。

Result: 实现了高分类准确率(98.84%和98.24%),并通过5折交叉验证验证了模型的可靠性(测试准确率达99.07%和98.68%)。

Insight: 图像转换方法显著提升了光纤数据的分析和分类效率,展示了深度学习在光纤监测系统中的潜力。

Abstract: This study focuses on event detection in optical fibers, specifically classifying six events using the Phase-OTDR system. A novel approach is introduced to enhance Phase-OTDR data analysis by transforming 1D data into grayscale images through techniques such as Gramian Angular Difference Field, Gramian Angular Summation Field, and Recurrence Plot. These grayscale images are combined into a multi-channel RGB representation, enabling more robust and adaptable analysis using transfer learning models. The proposed methodology achieves high classification accuracies of 98.84% and 98.24% with the EfficientNetB0 and DenseNet121 models, respectively. A 5-fold cross-validation process confirms the reliability of these models, with test accuracy rates of 99.07% and 98.68%. Using a publicly available Phase-OTDR dataset, the study demonstrates an efficient approach to understanding optical fiber events while reducing dataset size and improving analysis efficiency. The results highlight the transformative potential of image-based analysis in interpreting complex fiber optic sensing data, offering significant advancements in the accuracy and reliability of fiber optic monitoring systems. The codes and the corresponding image-based dataset are made publicly available on GitHub to support further research: https://github.com/miralab-ai/Phase-OTDR-event-detection.


[54] VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack cs.CVPDF

Shiji Zhao, Shukun Xiong, Yao Huang, Yan Jin, Zhenyu Wu

TL;DR: 论文提出了一种视觉推理序列攻击(VRSA),通过分解有害文本为多个相关子图像,逐步引导多模态大语言模型(MLLMs)输出有害内容。方法包括自适应场景优化、语义连贯补全和文本-图像一致性对齐,实验表明VRSA在开源和闭源MLLMs(如GPT-4o和Claude-4.5-Sonnet)上攻击成功率更高。

Details

Motivation: 多模态大语言模型(MLLMs)因其强大的跨模态理解和生成能力被广泛应用,但也更容易被用于攻击(jailbreaking)。此前攻击集中在文本模态的推理安全风险,视觉模态的类似威胁被忽视,VRSA旨在全面评估视觉推理任务中的潜在安全风险。

Result: 实验表明,VRSA在GPT-4o和Claude-4.5-Sonnet等开源和闭源MLLMs上的攻击成功率超过现有方法。

Insight: 视觉模态的安全风险不容忽视,VRSA展示了MLLMs在视觉推理任务中的潜在漏洞。未来的安全研究需兼顾文本和视觉模态的攻击防御策略。

Abstract: Multimodal Large Language Models (MLLMs) are widely used in various fields due to their powerful cross-modal comprehension and generation capabilities. However, more modalities bring more vulnerabilities to being utilized for jailbreak attacks, which induces MLLMs to output harmful content. Due to the strong reasoning ability of MLLMs, previous jailbreak attacks try to explore reasoning safety risk in text modal, while similar threats have been largely overlooked in the visual modal. To fully evaluate potential safety risks in the visual reasoning task, we propose Visual Reasoning Sequential Attack (VRSA), which induces MLLMs to gradually externalize and aggregate complete harmful intent by decomposing the original harmful text into several sequentially related sub-images. In particular, to enhance the rationality of the scene in the image sequence, we propose Adaptive Scene Refinement to optimize the scene most relevant to the original harmful query. To ensure the semantic continuity of the generated image, we propose Semantic Coherent Completion to iteratively rewrite each sub-text combined with contextual information in this scene. In addition, we propose Text-Image Consistency Alignment to keep the semantical consistency. A series of experiments demonstrates that the VRSA can achieve a higher attack success rate compared with the state-of-the-art jailbreak attack methods on both the open-source and closed-source MLLMs such as GPT-4o and Claude-4.5-Sonnet.


[55] Underwater Image Reconstruction Using a Swin Transformer-Based Generator and PatchGAN Discriminator cs.CVPDF

Md. Mahbub Hasan Akash, Aria Tasnim Mridula, Sheekar Banerjee, Ishtiak Al Mamoon

TL;DR: 该论文提出了一种基于Swin Transformer和PatchGAN的新方法,用于水下图像重建,显著提升了色彩校正、对比度增强和去雾效果,性能优于现有方法。

Details

Motivation: 水下成像因波长依赖的吸收和散射导致图像严重退化(如颜色失真、低对比度和雾化效果)。传统方法和基于CNN的模型因感受野有限和无法建模全局依赖关系,难以有效解决这些问题。

Result: 定量结果显示PSNR为24.76 dB,SSIM为0.89,优于现有方法。可视化结果证实了色彩平衡、对比度改善和去雾效果的有效性。消融实验验证了Swin Transformer的优越性。

Insight: Swin Transformer能够有效建模图像全局依赖关系,而GAN框架中的对抗训练有助于保留高频细节,为水下图像重建提供了一种鲁棒的解决方案。

Abstract: Underwater imaging is essential for marine exploration, environmental monitoring, and infrastructure inspection. However, water causes severe image degradation through wavelength-dependent absorption and scattering, resulting in color distortion, low contrast, and haze effects. Traditional reconstruction methods and convolutional neural network-based approaches often fail to adequately address these challenges due to limited receptive fields and inability to model global dependencies. This paper presented a novel deep learning framework that integrated a Swin Transformer architecture within a generative adversarial network (GAN) for underwater image reconstruction. Our generator employed a U-Net structure with Swin Transformer blocks to capture both local features and long-range dependencies crucial for color correction across entire images. A PatchGAN discriminator provided adversarial training to ensure high-frequency detail preservation. We trained and evaluated our model on the EUVP dataset, which contains paired underwater images of varying quality. Quantitative results demonstrate stateof-the-art performance with PSNR of 24.76 dB and SSIM of 0.89, representing significant improvements over existing methods. Visual results showed effective color balance restoration, contrast improvement, and haze reduction. An ablation study confirms the superiority of our Swin Transformer designed over convolutional alternatives. The proposed method offers robust underwater image reconstruction suitable for various marine applications.


[56] SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations cs.CVPDF

Wenhao Yan, Sheng Ye, Zhuoyi Yang, Jiayan Teng, ZhenHui Dong

TL;DR: SCAIL提出了一种基于上下文学习的3D一致姿态表示方法,通过创新的3D姿态表示和全上下文姿态注入机制,提升了角色动画的结构保真度和时间一致性,实现了接近工作室级别的动画质量。

Details

Motivation: 现有的角色动画方法在复杂动作和跨身份动画等野外场景中,往往难以保持结构保真度和时间一致性,限制了其在实际工作室生产中的应用。

Result: 实验表明,SCAIL在角色动画任务中达到了最先进性能,显著提升了动画的可靠性和真实感。

Insight: 3D姿态表示和上下文学习是提升角色动画质量的关键技术,尤其是在复杂动作和跨身份场景中,上下文信息的充分利用尤为重要。

Abstract: Achieving character animation that meets studio-grade production standards remains challenging despite recent progress. Existing approaches can transfer motion from a driving video to a reference image, but often fail to preserve structural fidelity and temporal consistency in wild scenarios involving complex motion and cross-identity animations. In this work, we present \textbf{SCAIL} (\textbf{S}tudio-grade \textbf{C}haracter \textbf{A}nimation via \textbf{I}n-context \textbf{L}earning), a framework designed to address these challenges from two key innovations. First, we propose a novel 3D pose representation, providing a more robust and flexible motion signal. Second, we introduce a full-context pose injection mechanism within a diffusion-transformer architecture, enabling effective spatio-temporal reasoning over full motion sequences. To align with studio-level requirements, we develop a curated data pipeline ensuring both diversity and quality, and establish a comprehensive benchmark for systematic evaluation. Experiments show that \textbf{SCAIL} achieves state-of-the-art performance and advances character animation toward studio-grade reliability and realism.


[57] LPD: Learnable Prototypes with Diversity Regularization for Weakly Supervised Histopathology Segmentation cs.CVPDF

Khang Le, Anh Mai Vu, Thi Kim Trang Vo, Ha Thach, Ngoc Bui Lam Quang

TL;DR: 本文提出了一种基于可学习原型和多样性正则化的单阶段弱监督分割方法(LPD),用于解决组织病理学图像中类内异质性的覆盖问题,并显著提升了性能。

Details

Motivation: 组织病理学中的弱监督语义分割(WSSS)因类间同质性和类内异质性面临挑战。全局池化激活图(CAM)仅突出最显著区域,导致分割不完整。现有两阶段方法效率低且效果有限。

Result: 在mIoU和mDice指标上超越先前方法,生成的分割图边界更清晰,误标更少。

Insight: 可学习原型比聚类原型更能覆盖多样的类内区域,为弱监督分割提供了更鲁棒的解决方案。

Abstract: Weakly supervised semantic segmentation (WSSS) in histopathology reduces pixel-level labeling by learning from image-level labels, but it is hindered by inter-class homogeneity, intra-class heterogeneity, and CAM-induced region shrinkage (global pooling-based class activation maps whose activations highlight only the most distinctive areas and miss nearby class regions). Recent works address these challenges by constructing a clustering prototype bank and then refining masks in a separate stage; however, such two-stage pipelines are costly, sensitive to hyperparameters, and decouple prototype discovery from segmentation learning, limiting their effectiveness and efficiency. We propose a cluster-free, one-stage learnable-prototype framework with diversity regularization to enhance morphological intra-class heterogeneity coverage. Our approach achieves state-of-the-art (SOTA) performance on BCSS-WSSS, outperforming prior methods in mIoU and mDice. Qualitative segmentation maps show sharper boundaries and fewer mislabels, and activation heatmaps further reveal that, compared with clustering-based prototypes, our learnable prototypes cover more diverse and complementary regions within each class, providing consistent qualitative evidence for their effectiveness.


[58] World Models That Know When They Don’t Know: Controllable Video Generation with Calibrated Uncertainty cs.CV | cs.AI | cs.ROPDF

Zhiting Mei, Tenny Yin, Micah Baker, Ola Shorinwa, Anirudha Majumdar

TL;DR: 这篇论文提出了一种名为C3的方法,用于在可控视频生成中量化不确定性,旨在解决现有模型在生成视频时可能产生的幻觉问题。该方法通过校准的密集置信估计,提供了高分辨率的像素级不确定性热图。

Details

Motivation: 尽管最近的可控视频生成模型在高保真视频合成方面取得了显著进展,但这些模型经常出现幻觉,生成的视频帧可能与现实不对齐。然而,现有模型缺乏评估和表达置信度的能力,影响了幻觉的缓解。

Result: 在大规模机器人学习数据集(Bridge和DROID)和真实世界评估上的实验表明,C3方法不仅提供了训练分布内的校准不确定性估计,还能有效检测分布外的数据。

Insight: C3方法的潜在空间不确定性估计和映射技术为视频生成模型的置信度评估提供了一种高效且直观的解决方案,有助于提升可控视频生成的可信度和实用性。

Abstract: Recent advances in generative video models have led to significant breakthroughs in high-fidelity video synthesis, specifically in controllable video generation where the generated video is conditioned on text and action inputs, e.g., in instruction-guided video editing and world modeling in robotics. Despite these exceptional capabilities, controllable video models often hallucinate - generating future video frames that are misaligned with physical reality - which raises serious concerns in many tasks such as robot policy evaluation and planning. However, state-of-the-art video models lack the ability to assess and express their confidence, impeding hallucination mitigation. To rigorously address this challenge, we propose C3, an uncertainty quantification (UQ) method for training continuous-scale calibrated controllable video models for dense confidence estimation at the subpatch level, precisely localizing the uncertainty in each generated video frame. Our UQ method introduces three core innovations to empower video models to estimate their uncertainty. First, our method develops a novel framework that trains video models for correctness and calibration via strictly proper scoring rules. Second, we estimate the video model’s uncertainty in latent space, avoiding training instability and prohibitive training costs associated with pixel-space approaches. Third, we map the dense latent-space uncertainty to interpretable pixel-level uncertainty in the RGB space for intuitive visualization, providing high-resolution uncertainty heatmaps that identify untrustworthy regions. Through extensive experiments on large-scale robot learning datasets (Bridge and DROID) and real-world evaluations, we demonstrate that our method not only provides calibrated uncertainty estimates within the training distribution, but also enables effective out-of-distribution detection.


[59] Synset Signset Germany: a Synthetic Dataset for German Traffic Sign Recognition cs.CV | cs.ROPDF

Anne Sielemann, Lena Loercher, Max-Lion Schumacher, Stefan Wolf, Masoud Roschani

TL;DR: 论文提出了一种合成德国交通标志数据集Synset Signset Germany的方法,结合了数据驱动和建模分析的优势,通过GAN生成纹理和光线渲染,生成了高质量且参数可控的数据集,适用于可解释AI(XAI)和鲁棒性测试。

Details

Motivation: 现有交通标志数据集缺乏对新发布标志的覆盖,且真实数据难以高效获取。合成数据虽能填补空白,但传统方法在真实感和参数控制上存在不足。

Result: 数据集在真实感测试中表现优于现有合成数据集CATERED,并能在GTSRB基准上验证模型的鲁棒性和可解释性。

Insight: 通过合成数据结合数据驱动和建模分析,既能生成真实感强的数据,又能支持参数化研究,为XAI和鲁棒性测试提供了新工具。

Abstract: In this paper, we present a synthesis pipeline and dataset for training / testing data in the task of traffic sign recognition that combines the advantages of data-driven and analytical modeling: GAN-based texture generation enables data-driven dirt and wear artifacts, rendering unique and realistic traffic sign surfaces, while the analytical scene modulation achieves physically correct lighting and allows detailed parameterization. In particular, the latter opens up applications in the context of explainable AI (XAI) and robustness tests due to the possibility of evaluating the sensitivity to parameter changes, which we demonstrate with experiments. Our resulting synthetic traffic sign recognition dataset Synset Signset Germany contains a total of 105500 images of 211 different German traffic sign classes, including newly published (2020) and thus comparatively rare traffic signs. In addition to a mask and a segmentation image, we also provide extensive metadata including the stochastically selected environment and imaging effect parameters for each image. We evaluate the degree of realism of Synset Signset Germany on the real-world German Traffic Sign Recognition Benchmark (GTSRB) and in comparison to CATERED, a state-of-the-art synthetic traffic sign recognition dataset.


[60] AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement cs.CV | cs.AIPDF

Munsif Ali, Najmul Hassan, Lucia Ventura, Davide Di Bari, Simonepietro Canese

TL;DR: 本文提出了一种名为AQUA-Net的水下图像增强模型,通过结合频率和光照域的辅助分支,实现了高效的颜色平衡和对比度恢复,同时保持了较低的模型复杂度。

Details

Motivation: 水下图像常因波长依赖的光吸收和散射而出现严重的颜色失真、低对比度和雾状外观,而现有的深度学习模型计算复杂度高,难以实时部署。

Result: 在多个基准数据集上的实验表明,AQUA-Net在定性和定量评估中均达到SOTA水平,同时模型参数更少。消融研究验证了分支设计的有效性。

Insight: 频率和光照域的分支设计互补性显著,通过联合优化改善了水下图像的可见性和颜色表现,同时保持了高效性和泛化能力。

Abstract: Underwater images often suffer from severe color distortion, low contrast, and a hazy appearance due to wavelength-dependent light absorption and scattering. Simultaneously, existing deep learning models exhibit high computational complexity, which limits their practical deployment for real-time underwater applications. To address these challenges, this paper presents a novel underwater image enhancement model, called Adaptive Frequency Fusion and Illumination Aware Network (AQUA-Net). It integrates a residual encoder decoder with dual auxiliary branches, which operate in the frequency and illumination domains. The frequency fusion encoder enriches spatial representations with frequency cues from the Fourier domain and preserves fine textures and structural details. Inspired by Retinex, the illumination-aware decoder performs adaptive exposure correction through a learned illumination map that separates reflectance from lighting effects. This joint spatial, frequency, and illumination design enables the model to restore color balance, visual contrast, and perceptual realism under diverse underwater conditions. Additionally, we present a high-resolution, real-world underwater video-derived dataset from the Mediterranean Sea, which captures challenging deep-sea conditions with realistic visual degradations to enable robust evaluation and development of deep learning models. Extensive experiments on multiple benchmark datasets show that AQUA-Net performs on par with SOTA in both qualitative and quantitative evaluations while using less number of parameters. Ablation studies further confirm that the frequency and illumination branches provide complementary contributions that improve visibility and color representation. Overall, the proposed model shows strong generalization capability and robustness, and it provides an effective solution for real-world underwater imaging applications.


[61] EditThinker: Unlocking Iterative Reasoning for Any Image Editor cs.CVPDF

Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia

TL;DR: 论文提出了EditThinker框架,通过迭代的Think-while-Edit循环提升图像编辑模型的指令遵循能力,显著优于现有方法。

Details

Motivation: 现有基于指令的图像编辑方法虽然在美学质量上表现出色,但由于其随机性和缺乏深思熟虑,单次成功执行指令的能力有限。

Result: 在四个基准测试中,EditThinker显著提升了图像编辑模型的指令遵循能力,且大幅优于现有方法。

Insight: 迭代推理和强化学习的结合可以有效提升图像编辑模型的指令执行能力,模拟人类‘边思考边编辑’的过程是关键。

Abstract: Instruction-based image editing has emerged as a prominent research area, which, benefiting from image generation foundation models, have achieved high aesthetic quality, making instruction-following capability the primary challenge. Existing approaches improve instruction adherence via supervised or reinforcement learning, yet single-turn success rates remain limited due to inherent stochasticity and a lack of deliberation. In this work, we propose a deliberative editing framework to ‘think’ while they edit, which simulates the human cognitive loop by iteratively executing a Think-while-Edit cycle: Critiquing results and Refining instructions , followed by Repeating the generation until satisfactory. Specifically, we train a single MLLM, EditThinker, to act as the reasoning engine of this framework, which jointly produce the critique score, reasoning process, and refined instructions. We employ reinforcement learning to align the EditThinker’s thinking with its editing, thereby generating more targeted instruction improvements. Extensive experiments on four benchmarks demonstrate that our approach significantly improves the instruction-following capability of any image editing model by a large margin. We will release our data construction framework, datasets, and models to benefit the community.


cs.CL [Back]

[62] Fine-Tuning BERT for Domain-Specific Question Answering: Toward Educational NLP Resources at University Scale cs.CL | cs.AIPDF

Aurélie Montfrond

TL;DR: 该论文研究了如何通过微调BERT模型,将其应用于特定领域的问答任务(如大学课程信息查询),证明了即使在较小数据集上微调也能有效提升模型性能。

Details

Motivation: 现有科学问答研究主要集中在聊天机器人系统上,缺乏对基础模型在特定领域推理能力的深入探索。论文旨在填补这一空白,验证BERT模型在教育领域的适应性。

Result: 结果显示,即使在小规模数据集上微调BERT,也能显著改善假设生成和知识提取能力。

Insight: 为教育领域的特定任务设计微调基础模型是可行的,未来可以进一步扩展为大学专属的问答模型。

Abstract: Prior work on scientific question answering has largely emphasized chatbot-style systems, with limited exploration of fine-tuning foundation models for domain-specific reasoning. In this study, we developed a chatbot for the University of Limerick’s Department of Electronic and Computer Engineering to provide course information to students. A custom dataset of 1,203 question-answer pairs in SQuAD format was constructed using the university book of modules, supplemented with manually and synthetically generated entries. We fine-tuned BERT (Devlin et al., 2019) using PyTorch and evaluated performance with Exact Match and F1 scores. Results show that even modest fine-tuning improves hypothesis framing and knowledge extraction, demonstrating the feasibility of adapting foundation models to educational domains. While domain-specific BERT variants such as BioBERT and SciBERT exist for biomedical and scientific literature, no foundation model has yet been tailored to university course materials. Our work addresses this gap by showing that fine-tuning BERT with academic QA pairs yields effective results, highlighting the potential to scale towards the first domain-specific QA model for universities and enabling autonomous educational knowledge systems.


[63] Decoding the Black Box: Discerning AI Rhetorics About and Through Poetic Prompting cs.CL | cs.CYPDF

P. D. Edgar, Alia Hall

TL;DR: 本文探讨了通过诗歌提示模式(Poetry Prompt Patterns)作为一种提示工程工具的潜力,并分析了其对大型语言模型(LLM)的描述和评估能力。

Details

Motivation: 研究动机在于探索创造性文本提示(如诗歌提示)如何揭示LLM的算法倾向和偏见,同时为提示工程师提供新的工具箱。

Result: 结果显示,LLM能够通过诗歌提示模式展现其创造性和适应性,但也暴露了对原作改写的意愿和可能的偏见。

Insight: 研究的核心洞察是,诗歌提示模式不仅是一种有效的提示工程方法,还能揭示LLM在处理创造性内容时的行为特征。

Abstract: Prompt engineering has emerged as a useful way studying the algorithmic tendencies and biases of large language models. Meanwhile creatives and academics have leveraged LLMs to develop creative works and explore the boundaries of their writing capabilities through text generation and code. This study suggests that creative text prompting, specifically Poetry Prompt Patterns, may be a useful addition to the toolbox of the prompt engineer, and outlines the process by which this approach may be taken. Then, the paper uses poetic prompts to assess descriptions and evaluations of three models of a renowned poet and test the consequences of the willingness of models to adapt or rewrite original creative works for presumed audiences.


[64] To Think or Not to Think: The Hidden Cost of Meta-Training with Excessive CoT Examples cs.CL | cs.AI | cs.LGPDF

Vignesh Kothapalli, Ata Fatahibaarzi, Hamed Firooz, Maziar Sanjabi

TL;DR: 论文探讨了在大型语言模型(LLM)中使用思维链(CoT)提示时,过度包含CoT示例在元训练中可能导致性能下降的问题,并提出了一种名为CoT-Recipe的方法来优化CoT和非CoT示例的比例,显著提升了新任务的准确性。

Details

Motivation: 尽管CoT提示在LLM中展现了强大的推理能力,但研究发现,在元训练中过度包含CoT示例可能会在缺乏CoT监督的情况下降低性能。为解决这一问题,作者研究了如何通过调整CoT和非CoT示例的比例来优化性能。

Result: 实验结果表明,CoT-Recipe可以将Transformer模型在新任务上的准确性提高300%,并在预训练的LLM(Qwen2.5系列)上实现了130%的准确率提升。

Insight: 研究表明,CoT示例虽然在推理中有帮助,但过度依赖可能导致性能下降。调制CoT和非CoT示例的比例是关键,能够有效提升模型在新任务上的泛化能力。

Abstract: Chain-of-thought (CoT) prompting combined with few-shot in-context learning (ICL) has unlocked significant reasoning capabilities in large language models (LLMs). However, ICL with CoT examples is ineffective on novel tasks when the pre-training knowledge is insufficient. We study this problem in a controlled setting using the CoT-ICL Lab framework, and propose meta-training techniques to learn novel abstract reasoning tasks in-context. Although CoT examples facilitate reasoning, we noticed that their excessive inclusion during meta-training degrades performance when CoT supervision is limited. To mitigate such behavior, we propose CoT-Recipe, a formal approach to modulate the mix of CoT and non-CoT examples in meta-training sequences. We demonstrate that careful modulation via CoT-Recipe can increase the accuracy of transformers on novel tasks by up to 300% even when there are no CoT examples available in-context. We confirm the broader effectiveness of these techniques by applying them to pretrained LLMs (Qwen2.5 series) for symbolic reasoning tasks and observing gains of up to 130% in accuracy.


[65] LYNX: Learning Dynamic Exits for Confidence-Controlled Reasoning cs.CL | cs.AI | cs.LGPDF

Ömer Faruk Akgül, Yusuf Hakan Kalaycı, Rajgopal Kannan, Willie Neiswanger, Viktor Prasanna

TL;DR: LYNX是一种在线早期退出机制,通过利用模型隐藏状态的动态感知,实现置信度控制的推理停止,显著提升效率并减少计算开销,适用于多种任务和模型规模。

Details

Motivation: 大型推理模型在复杂任务上表现优异,但常因“过度思考”浪费计算资源甚至降低准确性。现有早期退出方法依赖额外采样、辅助验证模型或缺乏形式化保证,LYNX旨在解决这些问题。

Result: 在GSM8K、MATH-500等任务中,LYNX显著减少40-70%的token使用,保持或提升基线准确率,且在非数学任务CommonsenseQA上实现零样本迁移。

Insight: LYNX展示了模型隐藏状态的可迁移性和动态退出的高效性,为推理任务的实时优化提供了新思路。

Abstract: Large reasoning models achieve strong performance on complex tasks by generating extended chains of thought, but they often “overthink”: continuing to reason long after they have enough information to answer correctly. This wastes inference-time compute and can hurt accuracy. Existing attempts to stop early either manipulate decoding with extra sampling and heuristics, rely on auxiliary verifier models, or operate only as post-hoc analysis pipelines without formal guarantees. We introduce LYNX, an online early-exit mechanism that turns a model’s own hidden-state awareness into confidence-controlled stopping decisions. LYNX attaches exit decisions to naturally occurring reasoning cues (e.g., “hmm”, “wait”) during generation, trains a lightweight probe on hidden states at those cue tokens using supervision from forced exits, and wraps the resulting scores in split conformal prediction to obtain distribution-free control over premature exits. Crucially, we train and calibrate this probe once on a generic mathematical corpus and reuse it unchanged across benchmarks, decoding temperatures, and even non-mathematical tasks. Across three model families spanning 1.5B to 32B parameters, a single mathematically trained probe per base model yields strong accuracy–efficiency tradeoffs. On GSM8K, LYNX matches or improves baseline accuracy while reducing tokens by 40–65%; on MATH-500 it improves accuracy by up to 12 points with roughly 35–60% fewer tokens; on AIME 2024 it recovers baseline accuracy with more than 50% token savings; and on CommonsenseQA, a non-math benchmark, it transfers zero-shot with modest accuracy gains and up to 70% fewer tokens. Compared to state-of-the-art early-exit methods, LYNX offers competitive or superior Pareto frontiers while remaining fully online, requiring no proxy models at inference, and providing explicit, user-tunable confidence guarantees.


[66] Exposing Pink Slime Journalism: Linguistic Signatures and Robust Detection Against LLM-Generated Threats cs.CL | cs.LGPDF

Sadat Shahriar, Navid Ayoobi, Arjun Mukherjee, Mostafa Musharrat, Sai Vishnu Vamsi

TL;DR: 论文研究了低质量自动生成的新闻报道(Pink Slime Journalism),提出了一种基于语言学特征的检测方法,并揭示了大型语言模型(LLMs)对现有检测系统的威胁。作者还设计了一个抗对抗性攻击的鲁棒学习框架。

Details

Motivation: 本地新闻作为可靠信息来源面临Pink Slime Journalism的威胁。这些低质量自动生成的文章模仿真实报道,难以区分。传统检测方法在面对LLMs的修改时效果显著下降。

Result: 实验表明,LLMs可将现有检测系统的F1分数降低40%,而新框架则将性能提升了27%。

Insight: LLMs不仅是生成工具,也可能被用于对抗性攻击,新闻检测系统需要更强的鲁棒性和适应性。

Abstract: The local news landscape, a vital source of reliable information for 28 million Americans, faces a growing threat from Pink Slime Journalism, a low-quality, auto-generated articles that mimic legitimate local reporting. Detecting these deceptive articles requires a fine-grained analysis of their linguistic, stylistic, and lexical characteristics. In this work, we conduct a comprehensive study to uncover the distinguishing patterns of Pink Slime content and propose detection strategies based on these insights. Beyond traditional generation methods, we highlight a new adversarial vector: modifications through large language models (LLMs). Our findings reveal that even consumer-accessible LLMs can significantly undermine existing detection systems, reducing their performance by up to 40% in F1-score. To counter this threat, we introduce a robust learning framework specifically designed to resist LLM-based adversarial attacks and adapt to the evolving landscape of automated pink slime journalism, and showed and improvement by up to 27%.


[67] Transformer-Enabled Diachronic Analysis of Vedic Sanskrit: Neural Methods for Quantifying Types of Language Change cs.CLPDF

Ananth Hariharan, David Mortensen

TL;DR: 该论文提出了一种混合神经符号方法,用于分析梵语的历时变化,挑战了语言变化为简化的假设,并通过弱监督和加权集成方法解决了数据稀缺问题。

Details

Motivation: 研究动机是量化分析梵语历时变化的复杂性,尤其是针对形态丰富的低资源语言,挑战传统认为语言变化是简化的观点。

Result: 结果显示:形态复杂性未简化而是重分配;动词特征呈周期性下降,复合词和新哲学术语显著增加;系统置信度与准确性高度相关(Pearson r = 0.92)。

Insight: 研究发现梵语复杂性的动态变化模式,而非简单的简化过程;同时展示了弱监督和混合方法在低资源语言研究中的潜力。

Abstract: This study demonstrates how hybrid neural-symbolic methods can yield significant new insights into the evolution of a morphologically rich, low-resource language. We challenge the naive assumption that linguistic change is simplification by quantitatively analyzing over 2,000 years of Sanskrit, demonstrating how weakly-supervised hybrid methods can yield new insights into the evolution of morphologically rich, low-resource languages. Our approach addresses data scarcity through weak supervision, using 100+ high-precision regex patterns to generate pseudo-labels for fine-tuning a multilingual BERT. We then fuse symbolic and neural outputs via a novel confidence-weighted ensemble, creating a system that is both scalable and interpretable. Applying this framework to a 1.47-million-word diachronic corpus, our ensemble achieves a 52.4% overall feature detection rate. Our findings reveal that Sanskrit’s overall morphological complexity does not decrease but is instead dynamically redistributed: while earlier verbal features show cyclical patterns of decline, complexity shifts to other domains, evidenced by a dramatic expansion in compounding and the emergence of new philosophical terminology. Critically, our system produces well-calibrated uncertainty estimates, with confidence strongly correlating with accuracy (Pearson r = 0.92) and low overall calibration error (ECE = 0.043), bolstering the reliability of these findings for computational philology.


[68] ArtistMus: A Globally Diverse, Artist-Centric Benchmark for Retrieval-Augmented Music Question Answering cs.CL | cs.AI | cs.IR | cs.LGPDF

Daeyong Kwon, SeungHeon Doh, Juhan Nam

TL;DR: 该论文提出了ArtistMus和MusWikiDB,一个以艺术家为中心的音乐问答基准和一个音乐相关维基百科段落的数据集,用于系统评估检索增强生成(RAG)在音乐问答任务中的表现。实验表明,RAG显著提高了事实准确性,开源模型性能提升显著。

Details

Motivation: 大型语言模型(LLMs)在音乐相关的推理任务中表现有限,因为预训练数据中音乐知识的稀疏性。现有资源缺乏基于艺术家元数据或历史背景的音乐问答(MQA)支持。

Result: RAG显著提升了开源模型的性能(如Qwen3 8B从35.0提高到91.8),接近专有模型的表现;RAG微调进一步提升了事实回忆和上下文推理能力;MusWikiDB的检索速度和准确性优于通用维基百科语料。

Insight: 音乐领域的问答任务需要结合结构化元数据和上下文信息,检索增强生成可以弥补LLMs在领域知识上的不足,为文化丰富领域的研究提供了新思路。

Abstract: Recent advances in large language models (LLMs) have transformed open-domain question answering, yet their effectiveness in music-related reasoning remains limited due to sparse music knowledge in pretraining data. While music information retrieval and computational musicology have explored structured and multimodal understanding, few resources support factual and contextual music question answering (MQA) grounded in artist metadata or historical context. We introduce MusWikiDB, a vector database of 3.2M passages from 144K music-related Wikipedia pages, and ArtistMus, a benchmark of 1,000 questions on 500 diverse artists with metadata such as genre, debut year, and topic. These resources enable systematic evaluation of retrieval-augmented generation (RAG) for MQA. Experiments show that RAG markedly improves factual accuracy; open-source models gain up to +56.8 percentage points (for example, Qwen3 8B improves from 35.0 to 91.8), approaching proprietary model performance. RAG-style fine-tuning further boosts both factual recall and contextual reasoning, improving results on both in-domain and out-of-domain benchmarks. MusWikiDB also yields approximately 6 percentage points higher accuracy and 40% faster retrieval than a general-purpose Wikipedia corpus. We release MusWikiDB and ArtistMus to advance research in music information retrieval and domain-specific question answering, establishing a foundation for retrieval-augmented reasoning in culturally rich domains such as music.


[69] Dynamic Alignment for Collective Agency: Toward a Scalable Self-Improving Framework for Open-Ended LLM Alignment cs.CL | cs.AIPDF

Panatchakorn Anantaprayoon, Nataliia Babina, Jad Tarifi, Nima Asgharbeygi

TL;DR: 本文提出了一种动态对齐框架(Dynamic Alignment),旨在超越传统的对齐规范,引入集体代理(Collective Agency, CA)作为开放的、统一的价值观,并通过自我改进的方法实现可扩展的对齐。

Details

Motivation: 随着AI系统向AGI和ASI发展,传统基于人类反馈的对齐方法在资源和可扩展性上存在限制,且价值观可能不够全面。本文探索更全面的对齐目标和可扩展的自改进对齐方法。

Result: 实验表明,该方法成功将模型对齐到CA,同时保留通用NLP能力。

Insight: 动态对齐为未来智能系统的价值观对齐提供了可扩展的自改进路径,超越了传统静态对齐方法的局限性。

Abstract: Large Language Models (LLMs) are typically aligned with human values using preference data or predefined principles such as helpfulness, honesty, and harmlessness. However, as AI systems progress toward Artificial General Intelligence (AGI) and Artificial Superintelligence (ASI), such value systems may become insufficient. In addition, human feedback-based alignment remains resource-intensive and difficult to scale. While AI-feedback-based self-improving alignment methods have been explored as a scalable alternative, they have largely remained constrained to conventional alignment values. In this work, we explore both a more holistic alignment objective and a scalable, self-improving alignment approach. Aiming to transcend conventional alignment norms, we introduce Collective Agency (CA)-a unified and open-ended alignment value that encourages integrated agentic capabilities. We also propose Dynamic Alignment-an alignment framework that enables an LLM to iteratively align itself. Dynamic Alignment comprises two key components: (1) automated training dataset generation with LLMs, and (2) a self-rewarding mechanism, where the policy model evaluates its own output candidates and assigns rewards for GRPO-based learning. Experimental results demonstrate that our approach successfully aligns the model to CA while preserving general NLP capabilities.


[70] Automated Identification of Incidentalomas Requiring Follow-Up: A Multi-Anatomy Evaluation of LLM-Based and Supervised Approaches cs.CLPDF

Namu Park, Farzad Ahmed, Zhaoyi Sun, Kevin Lybarger, Ethan Breinhorst

TL;DR: 论文研究了如何利用大型语言模型(LLM)和传统监督学习方法,通过解剖结构感知的提示策略,自动化识别需要随访的偶发瘤(incidentalomas)。结果表明,结合解剖信息的LLM表现优于传统方法,接近人类专家水平。

Details

Motivation: 当前的文档级分类系统在识别需要随访的偶发瘤时存在局限性,无法细粒度检测病灶级别的问题。因此,研究旨在探索LLM是否能够通过改进的推理策略(如病灶标记和解剖感知提示)提升性能。

Result: 解剖感知的GPT-OSS-20b模型表现最佳(宏F1:0.79),优于所有监督学习方法(最高F1:0.70),通过多数投票集成后性能进一步提升(具体F1未提)。

Insight: 1. 解剖结构信息的引入显著提升了LLM在医学任务中的表现。2. 模型集成可以弥补单一模型的不足,但需权衡效率。3. 在需要高精度和专业知识的医疗领域,提示工程的优化至关重要。

Abstract: Objective: To evaluate large language models (LLMs) against supervised baselines for fine-grained, lesion-level detection of incidentalomas requiring follow-up, addressing the limitations of current document-level classification systems. Methods: We utilized a dataset of 400 annotated radiology reports containing 1,623 verified lesion findings. We compared three supervised transformer-based encoders (BioClinicalModernBERT, ModernBERT, Clinical Longformer) against four generative LLM configurations (Llama 3.1-8B, GPT-4o, GPT-OSS-20b). We introduced a novel inference strategy using lesion-tagged inputs and anatomy-aware prompting to ground model reasoning. Performance was evaluated using class-specific F1-scores. Results: The anatomy-informed GPT-OSS-20b model achieved the highest performance, yielding an incidentaloma-positive macro-F1 of 0.79. This surpassed all supervised baselines (maximum macro-F1: 0.70) and closely matched the inter-annotator agreement of 0.76. Explicit anatomical grounding yielded statistically significant performance gains across GPT-based models (p < 0.05), while a majority-vote ensemble of the top systems further improved the macro-F1 to 0.90. Error analysis revealed that anatomy-aware LLMs demonstrated superior contextual reasoning in distinguishing actionable findings from benign lesions. Conclusion: Generative LLMs, when enhanced with structured lesion tagging and anatomical context, significantly outperform traditional supervised encoders and achieve performance comparable to human experts. This approach offers a reliable, interpretable pathway for automated incidental finding surveillance in radiology workflows.


[71] Structured Reasoning with Tree-of-Thoughts for Bengali Math Word Problems cs.CLPDF

Aurprita Mahmood, Sabrin alam, Neloy kumer Sagor, Md. Abdul Hadi, Md. Sehab Al Islam

TL;DR: 论文研究了在孟加拉语数学应用题中使用Tree-of-Thought(ToT)推理方法的效果,相比Chain-of-Thought(CoT),ToT在多步推理中表现更好,尤其在中大规模模型中提升显著。

Details

Motivation: 数学应用题(MWPs)需要语言理解和多步数值推理,但现有的CoT方法的线性结构容易传播错误,限制了其效果。研究旨在探索ToT在低资源语言(如孟加拉语)中的表现。

Result: CoT将基线准确率从78%提升至83%,而ToT进一步提升了5个百分点,达到88%(GPT-OSS-120B),尤其是在中大规模模型中效果显著。

Insight: ToT的结构化推理比CoT更可靠,适用于多语言和低资源语言场景,为未来多语言NLP中的推理方法提供了新方向。

Abstract: Mathematical Word Problems (MWPs) are among the most challenging tasks in natural language processing because they require both linguistic understanding and multi-step numerical reasoning. While Chain-of-Thought (CoT) prompting has shown promise, its linear structure often propagates errors, limiting overall effectiveness. To address this limitation, we present the a systematic study of Tree-of-Thought (ToT) reasoning for Bengali MWPs using the SOMADHAN dataset. Owing to computational and token-cost constraints, we evaluate a curated set of 100 representative problems across multiple large language models (LLMs), including GPT-OSS and LLaMA variants, under standard prompting, CoT, and ToT strategies. Our results show that CoT improves baseline accuracy from 78% (standard prompting) to 83% on average, while ToT further increases performance by up to 5 percentage points, achieving 88% accuracy with GPT-OSS-120B. These improvements highlight that ToT is particularly effective in medium-to-large-scale models but may offer less advantage for smaller ones. Overall, our findings establish ToT as a robust framework for solving mathematical problems in low-resource languages such as Bengali. More broadly, this study shows that structured reasoning methods like ToT can provide more reliable and globally consistent outcomes than CoT, paving the way for better reasoning strategies in multilingual NLP.


[72] A Greek Government Decisions Dataset for Public-Sector Analysis and Insight cs.CLPDF

Giorgos Antoniou, Giorgos Filandrianos, Aggelos Vlachos, Giorgos Stamou, Lampros Kollimenos

TL;DR: 该论文介绍了希腊政府决策的开放数据集,包含100万条从Diavgeia平台提取的高质量文本,支持检索增强生成(RAG)任务,可作为法律和政府领域语言模型的预训练数据。

Details

Motivation: 论文旨在通过开放的、可机读的政府决策数据集,提升公共部门的透明度和信息获取能力,同时为语言模型提供高质量的训练材料。

Result: RAG系统展示了其在检索和推理政府文件上的潜力,数据集可作为法律和政府领域语言模型的预训练材料。

Insight: 大规模公共部门数据集支持透明度和高级信息访问,同时为语言模型提供了新的领域适应和知识增强生成机会。

Abstract: We introduce an open, machine-readable corpus of Greek government decisions sourced from the national transparency platform Diavgeia. The resource comprises 1 million decisions, featuring and high-quality raw text extracted from PDFs. It is released with raw extracted text in Markdown format, alongside a fully reproducible extraction pipeline. Beyond the core dataset, we conduct qualitative analyses to explore boilerplate patterns and design a retrieval-augmented generation (RAG) task by formulating a set of representative questions, creating high-quality answers, and evaluating a baseline RAG system on its ability to retrieve and reason over public decisions. This evaluation demonstrates the potential of large-scale public-sector corpora to support advanced information access and transparency through structured retrieval and reasoning over governmental documents, and highlights how such a RAG pipeline could simulate a chat-based assistant capable of interactively answering questions about public decisions. Due to its scale, quality, and domain coverage, the corpus can also serve as high-value pre-training or fine-tuning material for new Language Models (LMs) and Large Language Models (LLMs) respectively, including specialized models for legal and governmental domains, and as a foundation for novel approaches in domain adaptation, knowledge-grounded generation, and explainable AI. Finally, we discuss limitations, outline future directions, and make both the data and the code accessible.


[73] Grounded Multilingual Medical Reasoning for Question Answering with Large Language Models cs.CL | cs.AIPDF

Pietro Ferrazzi, Aitor Soroa, Rodrigo Agerri

TL;DR: 该论文提出了一种基于检索增强生成的方法,生成多语言医疗推理痕迹,旨在提升大型语言模型在医疗问答中的表现。

Details

Motivation: 现有方法主要针对英语,且依赖通用大型语言模型的知识蒸馏,可能导致医疗知识的可靠性不足。

Result: 在医疗问答基准测试中,该方法表现出色,尤其是在8B参数的大型语言模型中达到了最先进的性能。

Insight: 多语言医疗推理痕迹可以提升模型的透明性和安全性,支持多语言临床决策工具的研发。

Abstract: Large Language Models (LLMs) with reasoning capabilities have recently demonstrated strong potential in medical Question Answering (QA). Existing approaches are largely English-focused and primarily rely on distillation from general-purpose LLMs, raising concerns about the reliability of their medical knowledge. In this work, we present a method to generate multilingual reasoning traces grounded in factual medical knowledge. We produce 500k traces in English, Italian, and Spanish, using a retrievalaugmented generation approach over medical information from Wikipedia. The traces are generated to solve medical questions drawn from MedQA and MedMCQA, which we extend to Italian and Spanish. We test our pipeline in both in-domain and outof-domain settings across Medical QA benchmarks, and demonstrate that our reasoning traces improve performance both when utilized via in-context learning (few-shot) and supervised fine-tuning, yielding state-of-the-art results among 8B-parameter LLMs. We believe that these resources can support the development of safer, more transparent clinical decision-support tools in multilingual settings. We release the full suite of resources: reasoning traces, translated QA datasets, Medical-Wikipedia, and fine-tuned models.


[74] Interleaved Latent Visual Reasoning with Selective Perceptual Modeling cs.CL | cs.CVPDF

Shuai Dong, Siyuan Wang, Xingyu Liu, Zhongyu Wei

TL;DR: ILVR提出了一种交替潜在视觉推理框架,将动态状态演进与精确感知建模相结合,显著提升了多模态大语言模型的推理能力。

Details

Motivation: 现有方法在视觉推理中面临计算成本高或感知建模不足的问题,亟需一种能同时支持动态推理和精确感知的框架。

Result: 在多项多模态推理基准测试中,ILVR显著优于现有方法。

Insight: 通过选择性感知建模和动态潜在表示的结合,可以解决多模态推理中的计算和建模矛盾。

Abstract: Interleaved reasoning paradigms enhance Multimodal Large Language Models (MLLMs) with visual feedback but are hindered by the prohibitive computational cost of repeatedly re-encoding pixel-dense images. A promising alternative, latent visual reasoning, circumvents this bottleneck yet currently forces a critical trade-off: methods either sacrifice precise perceptual modeling by over-compressing features or fail to model dynamic problems due to static, non-interleaved structures. We introduce Interleaved Latent Visual Reasoning (ILVR), a framework that unifies dynamic state evolution with precise perceptual modeling. ILVR interleaves textual generation with latent visual representations that act as specific, evolving cues for subsequent reasoning. To enable this, we employ a self-supervision strategy where a Momentum Teacher Model selectively distills relevant features from helper images into sparse supervision targets. This adaptive selection mechanism guides the model to autonomously generate context-aware visual signals. Extensive experiments on multimodal reasoning benchmarks demonstrate that ILVR significantly outperforms existing approaches, effectively bridging the gap between fine-grained perception and sequential multimodal reasoning.


[75] MedTutor-R1: Socratic Personalized Medical Teaching with Multi-Agent Simulation cs.CLPDF

Zhitao He, Haolin Yang, Zeyu Qin, Yi R Fung

TL;DR: 论文提出了一种基于多智能体模拟的个性化医学教学系统MedTutor-R1,通过多智能体教学模拟器ClinEdu生成教学数据,并训练了一个支持一对多教学的Socratic导师。该系统在教学评分上比基础模型提高了20%以上。

Details

Motivation: 医学教育中临床培训需求大但专家资源稀缺,现有研究多关注一对一知识传授,忽略了团队协作推理能力的培养。

Result: MedTutor-R1在教学评分上比基础模型提高了20%以上,且在学生数量变化时表现出高适应性。

Insight: 多智能体模拟器和Socratic教学结合可以有效提升医学教育的个性化和适应性,尤其在团队协作能力的培养上具有潜力。

Abstract: The significant gap between rising demands for clinical training and the scarcity of expert instruction poses a major challenge to medical education. With powerful capabilities in personalized guidance, Large Language Models (LLMs) offer a promising solution to bridge this gap. However, current research focuses mainly on one-on-one knowledge instruction, overlooking collaborative reasoning, a key skill for students developed in teamwork like ward rounds. To this end, we develop ClinEdu, a multi-agent pedagogical simulator with personality-driven patients and diverse student cohorts, enabling controlled testing of complex pedagogical processes and scalable generation of teaching data. Based on ClinEdu, we construct ClinTeach, a large Socratic teaching dialogue dataset that captures the complexities of group instruction. We then train MedTutor-R1, the first multimodal Socratic tutor designed for one-to-many instruction in clinical medical education. MedTutor-R1 is first instruction-tuned on our ClinTeach dataset and then optimized with reinforcement learning, using rewards derived from a three-axis rubric, covering structural fidelity, analytical quality, and clinical safety, to refine its adaptive Socratic strategies. For authentic in-situ assessment, we use simulation-based interactive evaluation that redeploys the tutor back into ClinEdu. Experimental results demonstrate that our MedTutor-R1 outperforms the base model by over 20% in average pedagogical score and is comparable to o3, while also exhibiting high adaptability in handling a varying number of students. This promising performance underscores the effectiveness of our pedagogical simulator, ClinEdu.


[76] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG cs.CL | cs.AI | cs.CVPDF

David Anugraha, Patrick Amadeus Irawan, Anshul Singh, En-Shiun Annie Lee, Genta Indra Winata

TL;DR: M4-RAG是一个大规模、多语言、多文化、多模态的RAG基准,涵盖42种语言和56种方言/语域,包含8万多样化的图像-问题对,用于评估跨语言和多模态的检索增强VQA。研究发现RAG对小模型有益,但对大模型效果不佳。

Details

Motivation: 现有的视觉语言模型(VLM)在VQA任务中表现优秀,但受限于静态训练数据。RAG虽能缓解这一问题,但多语言多模态RAG的研究仍然不足。因此,作者提出M4-RAG,填补这一空白。

Result: RAG对小规模VLM有明显提升,但对大规模VLM效果不佳,甚至可能降低性能,表明检索能力与模型规模存在不匹配。

Insight: 1) 多语言多模态RAG的研究需要更多关注;2) 检索能力与模型规模的匹配是未来RAG系统发展的关键;3) M4-RAG为下一代跨语言、跨模态RAG系统奠定了基础。

Abstract: Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark covering 42 languages and 56 regional dialects and registers, comprising over 80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to scale to larger models and often even degrades their performance, exposing a critical mismatch between model size and current retrieval effectiveness. M4-RAG provides a foundation for advancing next-generation RAG systems capable of reasoning seamlessly across languages, modalities, and cultural contexts.


eess.AS [Back]

[77] SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model eess.AS | cs.AI | cs.CL | cs.CV | cs.MM | cs.SDPDF

Kaidi Wang, Yi He, Wenhao Guan, Weijie Wu, Hongwu Ding

TL;DR: SyncVoice通过视觉增强的预训练TTS模型提升视频配音质量,解决了语音自然度和视听同步问题,并扩展到跨语言场景。

Details

Motivation: 现有视频配音方法在语音自然度和视听同步上表现不足,且仅限于单语言设置。SyncVoice旨在通过这些挑战提升性能。

Result: 实验表明SyncVoice在语音生成和同步性能上表现优异。

Insight: 视觉信息对TTS模型的微调至关重要,可以提升视听同步和跨语言合成的鲁棒性。

Abstract: Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model. By fine-tuning the TTS model on audio-visual data, we achieve strong audiovisual consistency. We propose a Dual Speaker Encoder to effectively mitigate inter-language interference in cross-lingual speech synthesis and explore the application of video dubbing in video translation scenarios. Experimental results show that SyncVoice achieves high-fidelity speech generation with strong synchronization performance, demonstrating its potential in video dubbing tasks.


econ.TH [Back]

[78] Vague Knowledge: Information without Transitivity and Partitions econ.TH | cs.CL | math.LO | q-fin.GNPDF

Kerry Xiao

TL;DR: 论文放松了经济学模型中信息的传递性和分区结构假设,提出了模糊知识的概念,揭示了其在现实世界中的重要性。

Details

Motivation: 传统经济学模型假设信息具有传递性和分区结构,而现实中的知识往往是模糊的。论文旨在研究模糊知识的特性及其对信息传递的影响。

Result: 模糊知识虽然无法分区状态空间,但仍能区分部分状态,且需通过模糊沟通传递。

Insight: 模糊知识为现实世界中的自然语言沟通和定性推理提供了微观基础。

Abstract: I relax the standard assumptions of transitivity and partition structure in economic models of information to formalize vague knowledge: non-transitive indistinguishability over states. I show that vague knowledge, while failing to partition the state space, remains informative by distinguishing some states from others. Moreover, it can only be faithfully expressed through vague communication with blurred boundaries. My results provide microfoundations for the prevalence of natural language communication and qualitative reasoning in the real world, where knowledge is often vague.


cs.AI [Back]

[79] To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis cs.AI | cs.CLPDF

Federico Bianchi, Yongchan Kwon, Zachary Izzo, Linjun Zhang, James Zou

TL;DR: 本文開發了一個基於GPT-5的論文正確性檢查器,用於系統性識別AI頂級會議和期刊論文中存在的客觀錯誤,發現錯誤數量逐年增加,並展示了LLM在檢測和糾正錯誤方面的潛力。

Details

Motivation: 由於AI研究加速發展和同行評審系統壓力增加,已發表論文中存在的錯誤可能未被發現並持續傳播,影響後續研究和可重現性。

Result: 1) 錯誤數量逐年增加(如NeurIPS 2021至2025增加55.3%);2) AI檢查器識別錯誤的精確度為83.2%;3) 檢查器能為75.8%的錯誤提出正確修正。

Insight: 前沿LLM能夠有效檢測和糾正論文中的客觀錯誤,有助於提升文獻質量和可重現性,減輕同行評審系統的壓力。

Abstract: How many mistakes do published AI papers contain? Peer-reviewed publications form the foundation upon which new research and knowledge are built. Errors that persist in the literature can propagate unnoticed, creating confusion in follow-up studies and complicating reproducibility. The accelerating pace of research and the increasing demands on the peer-review system make such mistakes harder to detect and avoid. To address this, we developed a Paper Correctness Checker based on GPT-5 to systematically identify mistakes in papers previously published at top AI conferences and journals. Our analysis focuses on objective mistakes-e.g., errors in formulas, derivations, calculations, figures, and tables-that have a clearly verifiable ground truth. We intentionally exclude subjective considerations such as novelty, importance, or writing quality. We find that published papers contain a non-negligible number of objective mistakes and that the average number of mistakes per paper has increased over time-from 3.8 in NeurIPS 2021 to 5.9 in NeurIPS 2025 (55.3% increase); from 4.1 in ICLR 2018 to 5.2 in ICLR 2025; and from 5.0 in TMLR 2022/23 to 5.5 in TMLR 2025. Human experts reviewed 316 potential mistakes identified by the AI Checker and confirmed that 263 were actual mistakes, corresponding to a precision of 83.2%. While most identified issues are relatively minor, correcting them would reduce confusion in the literature and strengthen reproducibility. The AI Checker also surfaced potentially more substantive mistakes that could affect the interpretation of results. Moreover, we show that the AI Checker can propose correct fixes for 75.8% of the identified mistakes. Overall, this study highlights the potential of frontier LLMs to detect and correct objective mistakes in published papers, helping to establish a firmer foundation of knowledge.


[80] Multimodal Oncology Agent for IDH1 Mutation Prediction in Low-Grade Glioma cs.AI | cs.CVPDF

Hafsa Akebli, Adam Shephard, Vincenzo Della Mea, Nasir Rajpoot

TL;DR: 该论文提出了一种多模态肿瘤学代理(MOA),结合组织学和临床/基因组数据,通过外部生物医学资源预测低级别胶质瘤中的IDH1突变,性能优于基线方法。

Details

Motivation: 低级别胶质瘤中IDH1突变的预测对预后和治疗具有重要意义,现有方法单一模态的局限性促使开发多模态融合方法。

Result: MOA在TCGA-LGG数据集上达到F1-score为0.912,优于临床基线(0.798)和组织学基线(0.894)。

Insight: 多模态融合和外部生物医学资源的利用可以有效提升肿瘤突变预测的准确性,弥补单一模态的不足。

Abstract: Low-grade gliomas frequently present IDH1 mutations that define clinically distinct subgroups with specific prognostic and therapeutic implications. This work introduces a Multimodal Oncology Agent (MOA) integrating a histology tool based on the TITAN foundation model for IDH1 mutation prediction in low-grade glioma, combined with reasoning over structured clinical and genomic inputs through PubMed, Google Search, and OncoKB. MOA reports were quantitatively evaluated on 488 patients from the TCGA-LGG cohort against clinical and histology baselines. MOA without the histology tool outperformed the clinical baseline, achieving an F1-score of 0.826 compared to 0.798. When fused with histology features, MOA reached the highest performance with an F1-score of 0.912, exceeding both the histology baseline at 0.894 and the fused histology-clinical baseline at 0.897. These results demonstrate that the proposed agent captures complementary mutation-relevant information enriched through external biomedical sources, enabling accurate IDH1 mutation prediction.


cs.IR [Back]

[81] RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering cs.IR | cs.AI | cs.CLPDF

Rongyang Zhang, Yuqing Huang, Chengqiang Lu, Qimeng Wang, Yan Gao

TL;DR: 论文提出了RAG-IGBench,一个专为评估RAG-IG任务设计的综合性基准,用于开放域问答中的交错图像-文本生成。通过结合MLLMs和检索机制,生成高质量的多模态内容,并引入创新评测指标。

Details

Motivation: 实际场景中,用户查询的视觉增强响应可以极大提升理解与记忆,而现有的交错图像-文本生成任务评估不足,缺乏合适的评测标准。

Result: 实验结果展示了MLLMs的潜力与局限性,新评测指标与人工评估高度相关,微调后的模型在多个基准上性能提升。

Insight: RAG-IGBench为多模态生成任务提供了更全面的评估框架,显示检索机制对提升内容质量的重要性。

Abstract: In real-world scenarios, providing user queries with visually enhanced responses can considerably benefit understanding and memory, underscoring the great value of interleaved image-text generation. Despite recent progress, like the visual autoregressive model that unifies text and image processing in a single transformer architecture, generating high-quality interleaved content remains challenging. Moreover, evaluations of these interleaved sequences largely remain underexplored, with existing benchmarks often limited by unimodal metrics that inadequately assess the intricacies of combined image-text outputs. To address these issues, we present RAG-IGBench, a thorough benchmark designed specifically to evaluate the task of Interleaved Generation based on Retrieval-Augmented Generation (RAG-IG) in open-domain question answering. RAG-IG integrates multimodal large language models (MLLMs) with retrieval mechanisms, enabling the models to access external image-text information for generating coherent multimodal content. Distinct from previous datasets, RAG-IGBench draws on the latest publicly available content from social platforms and introduces innovative evaluation metrics that measure the quality of text and images, as well as their consistency. Through extensive experiments with state-of-the-art MLLMs (both open-source and proprietary) on RAG-IGBench, we provide an in-depth analysis examining the capabilities and limitations of these models. Additionally, we validate our evaluation metrics by demonstrating their high correlation with human assessments. Models fine-tuned on RAG-IGBench’s training set exhibit improved performance across multiple benchmarks, confirming both the quality and practical utility of our dataset. Our benchmark is available at https://github.com/USTC-StarTeam/RAG-IGBench.


cs.HC [Back]

[82] EXR: An Interactive Immersive EHR Visualization in Extended Reality cs.HC | cs.CV | cs.LG | cs.MMPDF

Benoit Marteau, Shaun Q. Y. Tan, Jieru Li, Andrew Hornback, Yishan Zhong

TL;DR: 本文提出了一种扩展现实(XR)平台,用于电子健康记录(EHR)的沉浸式和交互式可视化。该系统超越了传统的2D界面,将结构化和非结构化患者数据可视化到共享的3D环境中,支持直观探索和实时协作。

Details

Motivation: 传统EHR系统主要依赖2D界面,缺乏直观性和交互性。通过XR技术,可以实现更自然的数据探索和协作,为下一代临床决策支持工具提供基础。

Result: 实验表明,该平台能够有效支持EHR数据的沉浸式可视化和交互,为临床决策提供更直观的工具。

Insight: XR技术有望成为医疗数据可视化的重要方向,尤其是在需要复杂数据探索和协作的场景中。

Abstract: This paper presents the design and implementation of an Extended Reality (XR) platform for immersive, interactive visualization of Electronic Health Records (EHRs). The system extends beyond conventional 2D interfaces by visualizing both structured and unstructured patient data into a shared 3D environment, enabling intuitive exploration and real-time collaboration. The modular infrastructure integrates FHIR-based EHR data with volumetric medical imaging and AI-generated segmentation, ensuring interoperability with modern healthcare systems. The platform’s capabilities are demonstrated using synthetic EHR datasets and computed tomography (CT)-derived spine models processed through an AI-powered segmentation pipeline. This work suggests that such integrated XR solutions could form the foundation for next-generation clinical decision-support tools, where advanced data infrastructures are directly accessible in an interactive and spatially rich environment.


cs.SE [Back]

[83] Natural Language Summarization Enables Multi-Repository Bug Localization by LLMs in Microservice Architectures cs.SE | cs.AI | cs.CL | cs.IRPDF

Amirkia Rafiei Oskooei, S. Selcan Yukcu, Mehmet Cevheri Bozoglan, Mehmet S. Aktas

TL;DR: 该论文提出了一种通过自然语言摘要重构多仓库微服务架构中的Bug定位问题的方法,显著优于传统检索方法。

Details

Motivation: 多仓库微服务架构中的Bug定位面临语义鸿沟、LLM上下文限制和仓库选择困难等挑战。

Result: 在46个仓库的工业系统上,Pass@10为0.82,MRR为0.50,优于GitHub Copilot等基线。

Insight: 自然语言表示比原始代码更有效,提供了可解释的搜索路径,增强了企业AI工具的透明度。

Abstract: Bug localization in multi-repository microservice architectures is challenging due to the semantic gap between natural language bug reports and code, LLM context limitations, and the need to first identify the correct repository. We propose reframing this as a natural language reasoning task by transforming codebases into hierarchical NL summaries and performing NL-to-NL search instead of cross-modal retrieval. Our approach builds context-aware summaries at file, directory, and repository levels, then uses a two-phase search: first routing bug reports to relevant repositories, then performing top-down localization within those repositories. Evaluated on DNext, an industrial system with 46 repositories and 1.1M lines of code, our method achieves Pass@10 of 0.82 and MRR of 0.50, significantly outperforming retrieval baselines and agentic RAG systems like GitHub Copilot and Cursor. This work demonstrates that engineered natural language representations can be more effective than raw source code for scalable bug localization, providing an interpretable repository -> directory -> file search path, which is vital for building trust in enterprise AI tools by providing essential transparency.


cs.LG [Back]

[84] Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning cs.LG | cs.CLPDF

Zhenpeng Su, Leiyu Pan, Minxuan Lv, Tiehua Mei, Zijia Lin

TL;DR: 该论文提出了熵比率裁剪(ERC)机制,作为一种软全局约束,用于解决强化学习中的训练不稳定问题,包括策略熵波动和梯度不稳定。ERC通过量化策略探索的相对变化,稳定了策略更新,并在多个基准测试中证明了其有效性。

Details

Motivation: 大规模语言模型的后训练依赖强化学习来提高模型能力和对齐质量。然而,离线策略训练模式引入了分布偏移,导致策略超出信任区域,进而引发训练不稳定(如策略熵波动和梯度不稳定)。尽管PPO-Clip通过重要性裁剪缓解了这一问题,但它忽视了动作的全局分布偏移。

Result: 实验表明,ERC在多基准测试中均能提升性能,证明了其在稳定训练和改善模型表现方面的有效性。

Insight: ERC通过全局度量量化策略探索的变化,为解决强化学习中的训练不稳定问题提供了新思路。其双向约束机制不仅适用于现有算法,还能扩展到其他类似场景。

Abstract: Large language model post-training relies on reinforcement learning to improve model capability and alignment quality. However, the off-policy training paradigm introduces distribution shift, which often pushes the policy beyond the trust region, leading to training instabilities manifested as fluctuations in policy entropy and unstable gradients. Although PPO-Clip mitigates this issue through importance clipping, it still overlooks the global distributional shift of actions. To address these challenges, we propose using the entropy ratio between the current and previous policies as a new global metric that effectively quantifies the relative change in policy exploration throughout updates. Building on this metric, we introduce an \textbf{Entropy Ratio Clipping} (ERC) mechanism that imposes bidirectional constraints on the entropy ratio. This stabilizes policy updates at the global distribution level and compensates for the inability of PPO-clip to regulate probability shifts of un-sampled actions. We integrate ERC into both DAPO and GPPO reinforcement learning algorithms. Experiments across multiple benchmarks show that ERC consistently improves performance.


cs.RO [Back]

[85] Toward Efficient and Robust Behavior Models for Multi-Agent Driving Simulation cs.RO | cs.CVPDF

Fabian Konstantinidis, Moritz Sackmann, Ulrich Hofmann, Christoph Stiller

TL;DR: 该论文提出了一种高效的实例中心化和查询中心化的场景表示方法,用于多智能体驾驶模拟中的行为建模,通过局部坐标系和对称上下文编码器优化交互建模,并结合对抗逆强化学习提升模型的鲁棒性和真实性。

Details

Motivation: 多智能体驾驶模拟需要既高效又真实的行为模型,以便在复杂场景中实现可扩展性。现有方法通常在计算效率和交互建模上存在不足。

Result: 实验表明,该方法在大量标记下仍能高效扩展,显著减少训练和推理时间,同时在位置准确性和鲁棒性上优于多种基线。

Insight: 局部坐标系的动态场景表示和对称编码设计是提升多智能体模拟效率的关键,自适应奖励变换有效解决了鲁棒性和真实性的平衡问题。

Abstract: Scalable multi-agent driving simulation requires behavior models that are both realistic and computationally efficient. We address this by optimizing the behavior model that controls individual traffic participants. To improve efficiency, we adopt an instance-centric scene representation, where each traffic participant and map element is modeled in its own local coordinate frame. This design enables efficient, viewpoint-invariant scene encoding and allows static map tokens to be reused across simulation steps. To model interactions, we employ a query-centric symmetric context encoder with relative positional encodings between local frames. We use Adversarial Inverse Reinforcement Learning to learn the behavior model and propose an adaptive reward transformation that automatically balances robustness and realism during training. Experiments demonstrate that our approach scales efficiently with the number of tokens, significantly reducing training and inference times, while outperforming several agent-centric baselines in terms of positional accuracy and robustness.


[86] SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models cs.RO | cs.CVPDF

Haowen Liu, Shaoxiong Yao, Haonan Chen, Jiawei Gao, Jiayuan Mao

TL;DR: SIMPACT通过结合仿真和视觉-语言模型(VLMs),在不需要额外训练的情况下提升物理推理能力,从而改进机器人的精细操作任务。

Details

Motivation: 现有的VLMs缺乏对物理动态的直观理解,因为它们是基于静态的互联网数据进行训练的。这限制了它们在需要物理推理的机器人操作任务中的应用。

Result: 在五种需要精细物理推理的真实机器人操作任务中,SIMPACT表现优于现有通用机器人操作模型。

Insight: 通过仿真嵌入物理理解是一种提升通用机器人智能的有效方法,且无需额外训练。

Abstract: Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities. However, they lack a grounded understanding of physical dynamics. This limitation arises from training VLMs on static internet-scale visual-language data that contain no causal interactions or action-conditioned changes. Consequently, it remains challenging to leverage VLMs for fine-grained robotic manipulation tasks that require physical understanding, reasoning, and corresponding action planning. To overcome this, we present SIMPACT, a test-time, SIMulation-enabled ACTion Planning framework that equips VLMs with physical reasoning through simulation-in-the-loop world modeling, without requiring any additional training. From a single RGB-D observation, SIMPACT efficiently constructs physics simulations, enabling the VLM to propose informed actions, observe simulated rollouts, and iteratively refine its reasoning. By integrating language reasoning with physics prediction, our simulation-enabled VLM can understand contact dynamics and action outcomes in a physically grounded way. Our method demonstrates state-of-the-art performance on five challenging, real-world rigid-body and deformable manipulation tasks that require fine-grained physical reasoning, outperforming existing general-purpose robotic manipulation models. Our results demonstrate that embedding physics understanding via efficient simulation into VLM reasoning at test time offers a promising path towards generalizable embodied intelligence. Project webpage can be found at https://simpact-bot.github.io