cs.CV [Total: 85]
cs.CL [Total: 17]
cs.RO [Total: 4]
cs.AI [Total: 2]
cs.LG [Total: 2]
physics.med-ph [Total: 1]
cs.CY [Total: 2]
eess.IV [Total: 3]
cs.SD [Total: 1]
cs.HC [Total: 1]

cs.CV [Back]

[1] The persistence of painting styles cs.CVPDF

Reetikaa Reddy Munnangi, Barbara Giunti

TL;DR: 本文提出了一种使用持久同调（PH）的拓扑数据分析方法，用于客观地区分不同艺术家的绘画风格，并区分真实作品与AI生成作品。

Details

Motivation: 传统艺术风格识别依赖主观视觉直觉和经验，而数学工具可以为艺术分析提供更客观和结构化的视角。

Result: PH方法能够区分不同艺术流派或同一流派内的艺术家风格，并能可靠地识别AI生成的艺术作品。

Insight: 拓扑方法为艺术风格的量化分析提供了新工具，揭示了艺术风格的持久性和可区分性。

Abstract: Art is a deeply personal and expressive medium, where each artist brings their own style, technique, and cultural background into their work. Traditionally, identifying artistic styles has been the job of art historians or critics, relying on visual intuition and experience. However, with the advancement of mathematical tools, we can explore art through more structured lens. In this work, we show how persistent homology (PH), a method from topological data analysis, provides objective and interpretable insights on artistic styles. We show how PH can, with statistical certainty, differentiate between artists, both from different artistic currents and from the same one, and distinguish images of an artist from an AI-generated image in the artist’s style.

[2] Motion Transfer-Enhanced StyleGAN for Generating Diverse Macaque Facial Expressions cs.CV | eess.IVPDF

Takuya Igaue, Catia Correia-Caeiro, Akito Yoshida, Takako Miyabe-Nishiwaki, Ryusuke Hayashi

TL;DR: 本文提出了一种结合运动传递增强的StyleGAN2方法，用于生成多样化的猕猴面部表情。通过数据增强、样本选择和损失函数优化，解决了训练数据不足的问题，并在生成效果和图像编辑上优于仅使用原始静态图像的模型。

Details

Motivation: 由于动物面部表情的训练数据在数量和质量上的限制，生成多样性不足的猕猴面部表情是一个挑战。本文旨在通过生成对抗网络（GAN）技术提升生成效果。

Result: 提出的方法能够生成多样化的猕猴面部表情，优于仅用原始静态图像的模型，并在风格编辑中表现出色。

Insight: 运动传递技术可以有效缓解数据不足问题，且StyleGAN的潜在空间能够解耦表情和动作，为动物面部研究提供了新工具。

Abstract: Generating animal faces using generative AI techniques is challenging because the available training images are limited both in quantity and variation, particularly for facial expressions across individuals. In this study, we focus on macaque monkeys, widely studied in systems neuroscience and evolutionary research, and propose a method to generate their facial expressions using a style-based generative image model (i.e., StyleGAN2). To address data limitations, we implemented: 1) data augmentation by synthesizing new facial expression images using a motion transfer to animate still images with computer graphics, 2) sample selection based on the latent representation of macaque faces from an initially trained StyleGAN2 model to ensure the variation and uniform sampling in training dataset, and 3) loss function refinement to ensure the accurate reproduction of subtle movements, such as eye movements. Our results demonstrate that the proposed method enables the generation of diverse facial expressions for multiple macaque individuals, outperforming models trained solely on original still images. Additionally, we show that our model is effective for style-based image editing, where specific style parameters correspond to distinct facial movements. These findings underscore the model’s potential for disentangling motion components as style parameters, providing a valuable tool for research on macaque facial expressions.

[3] PairHuman: A High-Fidelity Photographic Dataset for Customized Dual-Person Generation cs.CV | cs.AIPDF

Ting Pan, Ye Wang, Peiguang Jing, Rui Ma, Zili Yi

TL;DR: 论文提出了PairHuman数据集，这是首个专为高质量双人肖像生成设计的大规模基准数据集，包含10万+图像和丰富元数据，并提出了DHumanDiff基线方法，实验结果表明该方法能生成高度定制且视觉质量优越的双人肖像。

Details

Motivation: 个性化双人肖像定制在情感记忆保存和婚礼摄影规划等领域有广泛应用，但缺乏高质量的基准数据集阻碍了这一方向的发展。

Result: 实验结果表明，PairHuman数据集和DHumanDiff方法能生成高度定制且视觉质量优越的双人肖像。

Insight: 高质量数据集和针对性基线方法对提升双人肖像生成的定制性和视觉质量至关重要。

Abstract: Personalized dual-person portrait customization has considerable potential applications, such as preserving emotional memories and facilitating wedding photography planning. However, the absence of a benchmark dataset hinders the pursuit of high-quality customization in dual-person portrait generation. In this paper, we propose the PairHuman dataset, which is the first large-scale benchmark dataset specifically designed for generating dual-person portraits that meet high photographic standards. The PairHuman dataset contains more than 100K images that capture a variety of scenes, attire, and dual-person interactions, along with rich metadata, including detailed image descriptions, person localization, human keypoints, and attribute tags. We also introduce DHumanDiff, which is a baseline specifically crafted for dual-person portrait generation that features enhanced facial consistency and simultaneously balances in personalized person generation and semantic-driven scene creation. Finally, the experimental results demonstrate that our dataset and method produce highly customized portraits with superior visual quality that are tailored to human preferences. Our dataset is publicly available at https://github.com/annaoooo/PairHuman.

[4] A Machine Learning-Driven Solution for Denoising Inertial Confinement Fusion Images cs.CV | cs.AIPDF

Asya Y. Akkus, Bradley T. Wolfe, Pinghan Chu, Chengkun Huang, Chris S. Campbell

TL;DR: 该论文提出了一种基于无监督自编码器和CDF 97小波变换的混合高斯-泊松去噪方法，用于惯性约束聚变（ICF）中子图像的噪声去除，展示了较传统非机器学习方法的优势。

Details

Motivation: 中子图像在ICF事件分析中至关重要，但常被高斯和泊松噪声共同干扰，传统去噪方法效果有限。近年来合成数据的进步为机器学习去噪提供了新的机会。

Result: 网络成功去噪中子图像，重建误差更低，边缘保留效果优于BM3D等传统方法。

Insight: 机器学习方法在ICF图像去噪中具有潜力，结合小波变换可以更有效地处理混合噪声问题，为未来三维重建分析提供了新思路。

Abstract: Neutron imaging is important in optimizing analysis of inertial confinement fusion (ICF) events such as those at the National Ignition Facility (NIF) and improving current and future ICF platforms. However, images of neutron sources are often degraded by various types of noise. Most commonly, Gaussian and Poisson noise often coexist within one image, obscuring fine details and blurring edges. These noise types often overlap, making them difficult to distinguish and remove using conventional filtering and thresholding methods. As a result, noise removal techniques that preserve image fidelity are important for analyzing and interpreting images of a neutron source. Current solutions include a combination of filtering and thresholding methodologies. In the past, machine learning approaches were rarely implemented due to a lack of ground truth neutron imaging data for ICF processes. However, recent advances in synthetic data production, particularly in the fusion imaging field, have opened opportunities to investigate new denoising procedures using both supervised and unsupervised machine learning methods. In this study, we implement an unsupervised autoencoder with a Cohen-Daubechies- Feauveau (CDF 97) wavelet transform in the latent space for mixed Gaussian-Poisson denoising. The network successfully denoises neutron imaging data. Additionally, it demonstrates lower reconstruction error and superior edge preservation metrics when benchmarked with data generated by a forward model and compared to non-ML-based filtering mechanisms such as Block-matching and 3D filtering (BM3D). This approach presents a promising advancement in neutron image noise reduction and three-dimensional reconstruction analysis of ICF experiments.

[5] SAM 3: Segment Anything with Concepts cs.CV | cs.AIPDF

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu

TL;DR: SAM 3是一个基于概念提示的统一模型，能够检测、分割和追踪图像与视频中的对象。通过Promptable Concept Segmentation (PCS)，模型接受短名词短语或图像示例作为输入，返回分割掩码和唯一标识。数据集包含4M独特标签，模型结合了图像级检测器和基于记忆的视频追踪器，显著提升了现有系统的准确性。

Details

Motivation: 现有分割模型在处理多样化概念提示时表现不足，需要一种更灵活的解决方案，能够通过自然语言或视觉示例进行对象分割和追踪。

Result: SAM 3在图像和视频PCS任务中准确率翻倍，并改进了SAM在视觉分割任务中的表现。

Insight: 结合自然语言和视觉示例的多模态提示能显著提升分割模型的灵活性和准确性；解耦识别与定位的设计优化了性能。

Abstract: We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., “yellow school bus”), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.

[6] SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge cs.CV | cs.AI | cs.LGPDF

Adeel Yousaf, Joseph Fioresi, James Beetham, Amrit Singh Bedi, Mubarak Shah

TL;DR: SaFeR-CLIP是一种针对视觉语言模型（如CLIP）的安全微调框架，通过将不安全概念重定向到语义上最接近的安全替代品，避免了传统方法导致的性能下降，同时保持了安全性。

Details

Motivation: 现有的安全微调方法通常以牺牲模型的泛化性能为代价，因为它们强制将不安全概念对齐到单一预定义的安全目标，破坏了模型的语义结构。

Result: 在零样本准确率上比之前的方法提高了8.0%，同时保持鲁棒的安全性。

Insight: 研究表明，尊重预训练表示的几何结构是实现安全性而不牺牲性能的关键。

Abstract: Improving the safety of vision-language models like CLIP via fine-tuning often comes at a steep price, causing significant drops in their generalization performance. We find this trade-off stems from rigid alignment strategies that force unsafe concepts toward single, predefined safe targets, disrupting the model’s learned semantic structure. To address this, we propose a proximity-aware approach: redirecting unsafe concepts to their semantically closest safe alternatives to minimize representational change. We introduce SaFeR-CLIP, a fine-tuning framework that applies this principle of minimal intervention. SaFeR-CLIP successfully reconciles safety and performance, recovering up to 8.0% in zero-shot accuracy over prior methods while maintaining robust safety. To support more rigorous evaluation, we also contribute NSFW-Caps, a new benchmark of 1,000 highly-aligned pairs for testing safety under distributional shift. Our work shows that respecting the geometry of pretrained representations is key to achieving safety without sacrificing performance.

[7] SVG360: Multi-View SVG Generation with Geometric and Color Consistency from a Single SVG cs.CVPDF

Mengnan Jiang, Zhaolin Sun, Christian Franke, Michele Franco Adesso, Antonio Haas

TL;DR: 该论文提出了一种名为SVG360的三阶段框架，用于从单个SVG输入生成具有几何和颜色一致性的多视图SVG。

Details

Motivation: SVG在现代设计工作流中非常重要，但如何从单视图输入生成多视图一致的SVG仍是一个未充分探索的问题。

Result: 生成的SVG在多视图中表现出几何和颜色一致性，减少了冗余路径，同时保留了精细的结构细节。

Insight: 该研究将生成模型与结构化矢量表示结合起来，为单输入生成多视图SVG提供了可扩展的解决方案。

Abstract: Scalable Vector Graphics (SVGs) are central to modern design workflows, offering scaling without distortion and precise editability. However, for single object SVGs, generating multi-view consistent SVGs from a single-view input remains underexplored. We present a three stage framework that produces multi-view SVGs with geometric and color consistency from a single SVG input. First, the rasterized input is lifted to a 3D representation and rendered under target camera poses, producing multi-view images of the object. Next, we extend the temporal memory mechanism of Segment Anything 2 (SAM2) to the spatial domain, constructing a spatial memory bank that establishes part level correspondences across neighboring views, yielding cleaner and more consistent vector paths and color assignments without retraining. Finally, during the raster to vector conversion, we perform path consolidation and structural optimization to reduce redundancy while preserving boundaries and semantics. The resulting SVGs exhibit strong geometric and color consistency across views, significantly reduce redundant paths, and retain fine structural details. This work bridges generative modeling and structured vector representation, providing a scalable route to single input, object level multi-view SVG generation and supporting applications such as asset creation and semantic vector editing.

[8] Mesh RAG: Retrieval Augmentation for Autoregressive Mesh Generation cs.CV | cs.AIPDF

Xiatao Sun, Chen Liang, Qian Wang, Daniel Rakita

TL;DR: Mesh RAG是一个提出用于自回归网格生成的新型、免训练框架，通过检索增强技术提升生成质量、加速生成速度，并支持增量编辑。

Details

Motivation: 传统手工制作3D网格耗时且难以扩展，现有自回归方法在质量与速度之间存在权衡，且难以支持增量编辑。

Result: Mesh RAG显著提升了网格质量，加速了生成速度，并支持增量编辑，适用于多种自回归网格生成模型。

Insight: 检索增强技术可有效解耦顺序依赖性，为3D内容生成的效率和质量提供新思路。

Abstract: 3D meshes are a critical building block for applications ranging from industrial design and gaming to simulation and robotics. Traditionally, meshes are crafted manually by artists, a process that is time-intensive and difficult to scale. To automate and accelerate this asset creation, autoregressive models have emerged as a powerful paradigm for artistic mesh generation. However, current methods to enhance quality typically rely on larger models or longer sequences that result in longer generation time, and their inherent sequential nature imposes a severe quality-speed trade-off. This sequential dependency also significantly complicates incremental editing. To overcome these limitations, we propose Mesh RAG, a novel, training-free, plug-and-play framework for autoregressive mesh generation models. Inspired by RAG for language models, our approach augments the generation process by leveraging point cloud segmentation, spatial transformation, and point cloud registration to retrieve, generate, and integrate mesh components. This retrieval-based approach decouples generation from its strict sequential dependency, facilitating efficient and parallelizable inference. We demonstrate the wide applicability of Mesh RAG across various foundational autoregressive mesh generation models, showing it significantly enhances mesh quality, accelerates generation speed compared to sequential part prediction, and enables incremental editing, all without model retraining.

[9] WorldGen: From Text to Traversable and Interactive 3D Worlds cs.CV | cs.AIPDF

Dilin Wang, Hyunyoung Jung, Tom Monnier, Kihyuk Sohn, Chuhang Zou

TL;DR: WorldGen是一个从文本提示自动生成大规模、交互式3D世界的系统，结合了多种技术以实现连贯、可导航的环境生成。

Details

Motivation: 现有3D世界生成通常需要手动建模或专业技能，限制了创造力与效率。WorldGen旨在通过自动化和模块化消除这些障碍，使非专家也能轻松设计虚拟世界。

Result: WorldGen能够生成几何一致、视觉效果丰富且实时渲染高效的3D世界，适合游戏、模拟和社交环境。

Insight: 该方法展示了多技术融合在3D生成中的潜力，同时推动了生成式AI在虚拟世界构建中的应用。

Abstract: We introduce WorldGen, a system that enables the automatic creation of large-scale, interactive 3D worlds directly from text prompts. Our approach transforms natural language descriptions into traversable, fully textured environments that can be immediately explored or edited within standard game engines. By combining LLM-driven scene layout reasoning, procedural generation, diffusion-based 3D generation, and object-aware scene decomposition, WorldGen bridges the gap between creative intent and functional virtual spaces, allowing creators to design coherent, navigable worlds without manual modeling or specialized 3D expertise. The system is fully modular and supports fine-grained control over layout, scale, and style, producing worlds that are geometrically consistent, visually rich, and efficient to render in real time. This work represents a step towards accessible, generative world-building at scale, advancing the frontier of 3D generative AI for applications in gaming, simulation, and immersive social environments.

[10] Towards Unified Vision Language Models for Forest Ecological Analysis in Earth Observation cs.CVPDF

Xizhe Xue, Xiao Xiang Zhu

TL;DR: 论文提出了REO-Instruct，首个面向地球观测（EO）中描述性和回归任务的统一基准数据集，用于连接多模态感知与可测量的生物物理变量，展示了当前视觉语言模型在数值推理上的挑战。

Details

Motivation: 现有EO数据集主要关注语义理解任务（如分类或图像描述），缺乏对齐多模态感知与可测量生物物理变量的基准数据集。REO-Instruct填补了这一空白，旨在推动科学视觉语言模型的发展。

Result: 研究表明当前通用视觉语言模型在数值推理任务中存在显著困难，突显了科学视觉语言模型的关键挑战。

Insight: REO-Instruct为开发下一代具备描述和科学推理能力的地理空间模型提供了标准化基础。

Abstract: Recent progress in vision language models (VLMs) has enabled remarkable perception and reasoning capabilities, yet their potential for scientific regression in Earth Observation (EO) remains largely unexplored. Existing EO datasets mainly emphasize semantic understanding tasks such as captioning or classification, lacking benchmarks that align multimodal perception with measurable biophysical variables. To fill this gap, we present REO-Instruct, the first unified benchmark designed for both descriptive and regression tasks in EO. REO-Instruct establishes a cognitively interpretable logic chain in forest ecological scenario (human activity,land-cover classification, ecological patch counting, above-ground biomass (AGB) regression), bridging qualitative understanding and quantitative prediction. The dataset integrates co-registered Sentinel-2 and ALOS-2 imagery with structured textual annotations generated and validated through a hybrid human AI pipeline. Comprehensive evaluation protocols and baseline results across generic VLMs reveal that current models struggle with numeric reasoning, highlighting an essential challenge for scientific VLMs. REO-Instruct offers a standardized foundation for developing and assessing next-generation geospatial models capable of both description and scientific inference. The project page are publicly available at \href{https://github.com/zhu-xlab/REO-Instruct}{REO-Instruct}.

[11] BOP-ASK: Object-Interaction Reasoning for Vision-Language Models cs.CV | cs.ROPDF

Vineet Bhat, Sungsu Kim, Valts Blukis, Greg Heinrich, Prashanth Krishnamurthy

TL;DR: BOP-ASK 是一个新的大规模数据集，旨在通过细粒度对象交互任务（如3D定位、物理兼容性和路径规划）训练和评估视觉语言模型（VLMs），填补了当前空间推理评测的不足。

Details

Motivation: 现有视觉语言模型在空间推理评测中表现优异，但忽略了细粒度的对象交互理解（如3D定位、物理兼容性等），限制了实际应用能力。

Result: 实验表明，BOP-ASK训练的模型在细粒度空间推理任务（如精确3D定位）上优于基线，并展现出新能力。

Insight: 细粒度对象交互数据能显著提升VLMs的实际应用能力，揭示当前模型在空间推理上的局限性。

Abstract: Vision Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high level relationships (‘left of,’ ‘behind’, etc.) but ignore fine-grained spatial understanding needed for real world applications: precise 3D localization, physical compatibility between objects, object affordances and multi step spatial planning. In this work, we present BOP-ASK, a novel large scale dataset for object interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships. BOP-ASK comprises over 150k images and 33M question answer pairs spanning six tasks (four novel), providing a rich resource for training and evaluating VLMs. We evaluate proprietary and open sourced VLMs, and conduct human evaluations on BOP-ASK-core, a contributed test benchmark. We also release BOP-ASK-lab, an out-of-distribution benchmark with images not sourced from BOP, enabling testing of generalization. Our experiments demonstrate that models trained on BOP-ASK outperform baselines and exhibit emergent capabilities such as precise object and grasp pose estimation, trajectory planning, and fine-grained object-centric spatial reasoning in cluttered environments. We will publicly release our datasets and dataset generation pipeline.

[12] Parts-Mamba: Augmenting Joint Context with Part-Level Scanning for Occluded Human Skeleton cs.CVPDF

Tianyi Shen, Huijuan Xu, Nilesh Ahuja, Omesh Tickoo, Philip Shin

TL;DR: Parts-Mamba是一种混合GCN-Mamba模型，旨在通过部分级扫描增强遮挡骨架的动作识别能力，显著提升模型性能。

Details

Motivation: 现实场景中，骨架数据常因遮挡或通信质量差而不完整，现有GCN模型因缺乏局部上下文而性能下降。

Result: 在NTU RGB+D 60和120数据集上，遮挡条件下准确率最高提升12.9%。

Insight: 部分级上下文捕获和长距离关节信息融合是提升遮挡骨架动作识别性能的关键。

Abstract: Skeleton action recognition involves recognizing human action from human skeletons. The use of graph convolutional networks (GCNs) has driven major advances in this recognition task. In real-world scenarios, the captured skeletons are not always perfect or complete because of occlusions of parts of the human body or poor communication quality, leading to missing parts in skeletons or videos with missing frames. In the presence of such non-idealities, existing GCN models perform poorly due to missing local context. To address this limitation, we propose Parts-Mamba, a hybrid GCN-Mamba model designed to enhance the ability to capture and maintain contextual information from distant joints. The proposed Parts-Mamba model effectively captures part-specific information through its parts-specific scanning feature and preserves non-neighboring joint context via a parts-body fusion module. Our proposed model is evaluated on the NTU RGB+D 60 and NTU RGB+D 120 datasets under different occlusion settings, achieving up to 12.9% improvement in accuracy.

[13] The Joint Gromov Wasserstein Objective for Multiple Object Matching cs.CV | q-bio.BMPDF

Aryan Tajmir Riahi, Khanh Dao Duc

TL;DR: 该论文提出了联合Gromov-Wasserstein（JGW）目标，扩展了传统GW距离的框架，实现了多对多对象的同步匹配，解决了原始GW仅限一对一匹配的限制。

Details

Motivation: 传统Gromov-Wasserstein（GW）距离仅支持一对一对象匹配，但在实际应用中，多对一或多对多的匹配需求更为常见。因此，亟需一种更灵活的匹配方法。

Result: 实验表明，JGW在部分匹配任务中表现出更高的准确性和计算效率，并在合成和真实数据集（如几何形状和生物分子复合体）上验证了其有效性。

Insight: JGW为解决复杂匹配问题提供了新思路，有望在计算机图形学和结构生物学等领域广泛应用。

Abstract: The Gromov-Wasserstein (GW) distance serves as a powerful tool for matching objects in metric spaces. However, its traditional formulation is constrained to pairwise matching between single objects, limiting its utility in scenarios and applications requiring multiple-to-one or multiple-to-multiple object matching. In this paper, we introduce the Joint Gromov-Wasserstein (JGW) objective and extend the original framework of GW to enable simultaneous matching between collections of objects. Our formulation provides a non-negative dissimilarity measure that identifies partially isomorphic distributions of mm-spaces, with point sampling convergence. We also show that the objective can be formulated and solved for point cloud object representations by adapting traditional algorithms in Optimal Transport, including entropic regularization. Our benchmarking with other variants of GW for partial matching indicates superior performance in accuracy and computational efficiency of our method, while experiments on both synthetic and real-world datasets show its effectiveness for multiple shape matching, including geometric shapes and biomolecular complexes, suggesting promising applications for solving complex matching problems across diverse domains, including computer graphics and structural biology.

[14] Align & Invert: Solving Inverse Problems with Diffusion and Flow-based Models via Representational Alignment cs.CV | cs.LGPDF

Loukas Sfountouris, Giannis Daras, Paris Giampouras

TL;DR: 该论文提出了一种通过表示对齐（REPA）的方法，用于解决逆问题，利用扩散或基于流的生成模型与预训练的自监督视觉编码器（如DINOv2）之间的对齐，显著提升重建质量和效率。

Details

Motivation: 逆问题中，预训练的生成模型常被用作先验，但其重建质量和收敛速度仍有提升空间。通过利用表示对齐的归纳偏差，可以改善模型的性能。

Result: 在超分辨率、盒式修复、高斯去模糊和运动去模糊等任务中，REPA显著提升了重建质量，同时减少了所需的离散化步骤，提高了效率。

Insight: 表示对齐不仅改善了生成模型的收敛性和样本质量，还能通过近似目标特征对齐提升逆问题的重建效果，证明了其在提升感知逼真度中的重要作用。

Abstract: Enforcing alignment between the internal representations of diffusion or flow-based generative models and those of pretrained self-supervised encoders has recently been shown to provide a powerful inductive bias, improving both convergence and sample quality. In this work, we extend this idea to inverse problems, where pretrained generative models are employed as priors. We propose applying representation alignment (REPA) between diffusion or flow-based models and a pretrained self-supervised visual encoder, such as DINOv2, to guide the reconstruction process at inference time. Although ground-truth signals are unavailable in inverse problems, we show that aligning model representations with approximate target features can substantially enhance reconstruction fidelity and perceptual realism. We provide theoretical results showing (a) the relation between the REPA regularization and a divergence measure in the DINOv2 embedding space, and (b) how REPA updates steer the model’s internal representations toward those of the clean image. These results offer insights into the role of REPA in improving perceptual fidelity. Finally, we demonstrate the generality of our approach by integrating it into multiple state-of-the-art inverse problem solvers. Extensive experiments on super-resolution, box inpainting, Gaussian deblurring, and motion deblurring confirm that our method consistently improves reconstruction quality across tasks, while also providing substantial efficiency gains by reducing the number of required discretization steps without compromising the performance of the underlying solver.

[15] Glass Surface Detection: Leveraging Reflection Dynamics in Flash/No-flash Imagery cs.CVPDF

Tao Yan, Hao Huang, Yiwei Lu, Zeyu Wang, Ke Xu

TL;DR: 论文提出了一种基于闪光/非闪光图像反射动态的玻璃表面检测方法NFGlassNet，通过反射对比挖掘模块和反射引导注意力模块，实现了高精度的玻璃表面检测，并构建了包含3.3K张图像的数据集。实验证明该方法优于现有技术。

Details

Motivation: 玻璃表面无色透明且缺乏特征，现有方法主要依赖边界或反射线索，未能充分利用玻璃本身的动态反射特性，限制了检测精度。

Result: 实验表明NFGlassNet在玻璃表面检测任务上优于现有方法。

Insight: 玻璃表面的反射动态差异是检测的新线索，融合闪光/非闪光图像信息可显著提升检测性能。

Abstract: Glass surfaces are ubiquitous in daily life, typically appearing colorless, transparent, and lacking distinctive features. These characteristics make glass surface detection a challenging computer vision task. Existing glass surface detection methods always rely on boundary cues (e.g., window and door frames) or reflection cues to locate glass surfaces, but they fail to fully exploit the intrinsic properties of the glass itself for accurate localization. We observed that in most real-world scenes, the illumination intensity in front of the glass surface differs from that behind it, which results in variations in the reflections visible on the glass surface. Specifically, when standing on the brighter side of the glass and applying a flash towards the darker side, existing reflections on the glass surface tend to disappear. Conversely, while standing on the darker side and applying a flash towards the brighter side, distinct reflections will appear on the glass surface. Based on this phenomenon, we propose NFGlassNet, a novel method for glass surface detection that leverages the reflection dynamics present in flash/no-flash imagery. Specifically, we propose a Reflection Contrast Mining Module (RCMM) for extracting reflections, and a Reflection Guided Attention Module (RGAM) for fusing features from reflection and glass surface for accurate glass surface detection. For learning our network, we also construct a dataset consisting of 3.3K no-flash and flash image pairs captured from various scenes with corresponding ground truth annotations. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods. Our code, model, and dataset will be available upon acceptance of the manuscript.

[16] R-AVST: Empowering Video-LLMs with Fine-Grained Spatio-Temporal Reasoning in Complex Audio-Visual Scenarios cs.CVPDF

Lu Zhu, Tiantian Geng, Yangye Chen, Teng Wang, Ping Lu

TL;DR: R-AVST是一个针对复杂视听场景的细粒度时空推理数据集，提出了A-VST任务和AVST-Zero强化学习模型，推动了视频多模态理解的发展。

Details

Motivation: 现有视频多模态大语言模型（MLLMs）研究多集中于简单场景，无法反映真实世界视听事件的复杂性。

Result: R-AVST数据集包含5K未剪辑视频和8K高质量问答对，AVST-Zero在实验中表现优异。

Insight: R-AVST为真实世界视听推理提供基准，AVST-Zero无需中间监督，为未来挑战提供新思路。

Abstract: Recently, rapid advancements have been made in multimodal large language models (MLLMs), especially in video understanding tasks. However, current research focuses on simple video scenarios, failing to reflect the complex and diverse nature of real-world audio-visual events in videos. To bridge this gap, we firstly introduce R-AVST, a dataset for audio-visual reasoning featuring fine-grained spatio-temporal annotations. In constructing this, we design a pipeline consisting of LLM-based key object extraction, automatic spatial annotation and manual quality inspection, resulting in over 5K untrimmed videos with 27K objects across 100 types of audio-visual events. Building on this dataset, we define three core tasks for spatio-temporal reasoning in audio-visual scenes and generate more than 8K high-quality, evenly distributed question-answer pairs to effectively benchmark model performance. To further enhance reasoning, we propose AVST-Zero, a reinforcement learning-based model that avoids intermediate supervision, directly optimizing behavior via carefully designed multi-dimensional rewards. Extensive experiments validate the effectiveness of our R-AVST in advancing audio-visual spatio-temporal reasoning, upon which AVST-Zero demonstrates competitive performance compared to existing models. To the best of our knowledge, R-AVST is the first dataset designed for real-world audio-visual spatio-temporal reasoning, and AVST-Zero offers a novel perspective for tackling future challenges in this domain.

[17] Warm Diffusion: Recipe for Blur-Noise Mixture Diffusion Models cs.CVPDF

Hao-Chien Hsueh, Chi-En Yen, Wen-Hsiao Peng, Ching-Chun Huang

TL;DR: 本文提出了一种结合噪声和模糊的扩散模型——Warm Diffusion（Blur-Noise Mixture Diffusion Model），通过同时控制模糊和噪声，解决了纯噪声扩散（hot diffusion）和纯模糊扩散（cold diffusion）的局限性，提升了生成质量和效率。

Details

Motivation: 现有的扩散模型主要分为两类：纯噪声扩散（hot diffusion）和纯模糊扩散（cold diffusion）。前者忽略图像高低频相关性，早期生成步骤表现随机；后者则缺乏噪声对数据流形的塑造作用，导致性能下降。本文旨在结合两者优势，提出一种更高效的混合方法。

Result: 在多个基准测试中，Warm Diffusion表现出优于纯噪声或纯模糊扩散模型的生成效果和效率，验证了其有效性。

Insight: 1. 噪声和模糊的结合可以更好地利用图像高低频相关性；2. 频谱分析为扩散模型的设计提供了新视角；3. BNR的引入为平衡噪声与模糊提供了理论工具。

Abstract: Diffusion probabilistic models have achieved remarkable success in generative tasks across diverse data types. While recent studies have explored alternative degradation processes beyond Gaussian noise, this paper bridges two key diffusion paradigms: hot diffusion, which relies entirely on noise, and cold diffusion, which uses only blurring without noise. We argue that hot diffusion fails to exploit the strong correlation between high-frequency image detail and low-frequency structures, leading to random behaviors in the early steps of generation. Conversely, while cold diffusion leverages image correlations for prediction, it neglects the role of noise (randomness) in shaping the data manifold, resulting in out-of-manifold issues and partially explaining its performance drop. To integrate both strengths, we propose Warm Diffusion, a unified Blur-Noise Mixture Diffusion Model (BNMD), to control blurring and noise jointly. Our divide-and-conquer strategy exploits the spectral dependency in images, simplifying score model estimation by disentangling the denoising and deblurring processes. We further analyze the Blur-to-Noise Ratio (BNR) using spectral analysis to investigate the trade-off between model learning dynamics and changes in the data manifold. Extensive experiments across benchmarks validate the effectiveness of our approach for image generation.

[18] Q-REAL: Towards Realism and Plausibility Evaluation for AI-Generated Content cs.CVPDF

Shushi Wang, Zicheng Zhang, Chunyi Li, Wei Wang, Liya Ma

TL;DR: Q-REAL是一个专为AI生成图像的真实性和合理性设计的细粒度评估数据集，包含3088张图像及标注，并提供Q-Real Bench用于评估多模态大语言模型（MLLMs）的判断和推理能力。

Details

Motivation: 现有的AI生成内容质量评估方法过于粗糙，仅提供单一评分，无法为目标模型优化提供针对性指导。真实性和合理性是图像生成的两个关键维度，需要细粒度评估以提升生成性能。

Result: 实验证明Q-REAL数据集高质量且显著，Benchmark全面，微调框架有效提升MLLMs能力。

Insight: 细粒度评估有助于提升生成模型的真实性和合理性，MLLMs在多模态任务中具有潜力，可通过微调进一步优化。

Abstract: Quality assessment of AI-generated content is crucial for evaluating model capability and guiding model optimization. However, most existing quality assessment datasets and models provide only a single quality score, which is too coarse to offer targeted guidance for improving generative models. In current applications of AI-generated images, realism and plausibility are two critical dimensions, and with the emergence of unified generation-understanding models, fine-grained evaluation along these dimensions becomes especially effective for improving generative performance. Therefore, we introduce Q-Real, a novel dataset for fine-grained evaluation of realism and plausibility in AI-generated images. Q-Real consists of 3,088 images generated by popular text-to-image models. For each image, we annotate the locations of major entities and provide a set of judgment questions and attribution descriptions for these along the dimensions of realism and plausibility. Considering that recent advances in multi-modal large language models (MLLMs) enable fine-grained evaluation of AI-generated images, we construct Q-Real Bench to evaluate them on two tasks: judgment and grounding with reasoning. Finally, to enhance MLLM capabilities, we design a fine-tuning framework and conduct experiments on multiple MLLMs using our dataset. Experimental results demonstrate the high quality and significance of our dataset and the comprehensiveness of the benchmark. Dataset and code will be released upon publication.

[19] UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation cs.CVPDF

Chi Zhang, Jiepeng Wang, Youming Wang, Yuanzhi Liang, Xiaoyan Yang

TL;DR: UniModel是一个统一的生成模型，通过像素到像素的扩散框架同时支持视觉理解和生成任务，实现了模型、任务和表征的统一。

Details

Motivation: 传统多模态学习方法存在模态差异和任务分离的问题，UniModel旨在通过完全视觉化的表征形式和多任务统一的框架解决这些问题。

Result: 实验表明UniModel在文本到图像生成和图像到文本理解任务中表现出色，展现了跨模态对齐和可控性（如图像-标题-图像的循环一致性）。

Insight: 通过在单一视觉空间统一模型、任务和表征，UniModel为通用多模态智能提供了一种有前景的范式。

Abstract: We present UniModel, a unified generative model that jointly supports visual understanding and visual generation within a single pixel-to-pixel diffusion framework. Our goal is to achieve unification along three axes: the model, the tasks, and the representations. At the representation level, we eliminate modality discrepancies by mapping both text and images into a shared visual space: textual prompts are rendered as painted text images on a clean canvas, and all inputs and outputs are treated purely as RGB pixels. This yields a fully vision-native formulation of multimodal learning. At the task level, a broad range of vision-language problems are cast as pixel-to-pixel transformations in this visual space. For understanding tasks, the model takes an RGB image and produces a painted text image that visually encodes the semantic prediction. For generation tasks, painted text images serve as visual conditions that guide realistic and semantically aligned image synthesis. Captioning and text-to-image generation thus become different directions of the same underlying visual translation process. At the model level, we instantiate a single Unified Diffusion Transformer trained with rectified flow in pixel space. A shared backbone jointly learns bidirectional mappings between natural images and painted text images, with lightweight task embeddings to specify the desired direction. Experiments on text-to-image synthesis and image-to-text understanding demonstrate strong cross-modal alignment and emergent controllability such as cycle-consistent image-caption-image loops. Our initial exploration suggests that unifying model, tasks, and representations in a single visual space is a promising paradigm for general-purpose multimodal intelligence.

[20] Rethinking Diffusion Model-Based Video Super-Resolution: Leveraging Dense Guidance from Aligned Features cs.CVPDF

Jingyi Xu, Meisong Zheng, Ying Chen, Minglang Qiao, Xin Deng

TL;DR: 本文提出了一种新的基于扩散模型的视频超分辨率方法DGAF-VSR，通过在特征域中进行对齐和补偿，解决了现有方法中因不准确对齐和补偿不足导致的问题，显著提升了感知质量、保真度和时间一致性。

Details

Motivation: 现有的基于扩散模型的视频超分辨率方法在感知质量上表现优异，但在对齐和补偿方面存在不足，导致误差累积、空间伪影以及感知质量与保真度之间的权衡问题。本文旨在通过改进特征对齐和补偿机制，解决这些问题。

Result: DGAF-VSR在合成和真实数据集上均表现优异，感知质量（DISTS降低35.82%）、保真度（PSNR提升0.20 dB）和时间一致性（tLPIPS降低30.37%）均有显著提升。

Insight: 1. 特征域对齐和补偿能够更有效地捕捉时空相关性；2. 在超分辨率尺度上进行形变可以更好地保留高频信息；3. 密集的时间指导对提升视频超分辨率效果至关重要。

Abstract: Diffusion model (DM) based Video Super-Resolution (VSR) approaches achieve impressive perceptual quality. However, they suffer from error accumulation, spatial artifacts, and a trade-off between perceptual quality and fidelity, primarily caused by inaccurate alignment and insufficient compensation between video frames. In this paper, within the DM-based VSR pipeline, we revisit the role of alignment and compensation between adjacent video frames and reveal two crucial observations: (a) the feature domain is better suited than the pixel domain for information compensation due to its stronger spatial and temporal correlations, and (b) warping at an upscaled resolution better preserves high-frequency information, but this benefit is not necessarily monotonic. Therefore, we propose a novel Densely Guided diffusion model with Aligned Features for Video Super-Resolution (DGAF-VSR), with an Optical Guided Warping Module (OGWM) to maintain high-frequency details in the aligned features and a Feature-wise Temporal Condition Module (FTCM) to deliver dense guidance in the feature domain. Extensive experiments on synthetic and real-world datasets demonstrate that DGAF-VSR surpasses state-of-the-art methods in key aspects of VSR, including perceptual quality (35.82% DISTS reduction), fidelity (0.20 dB PSNR gain), and temporal consistency (30.37% tLPIPS reduction).

[21] Shape-preserving Tooth Segmentation from CBCT Images Using Deep Learning with Semantic and Shape Awareness cs.CVPDF

Zongrui Ji, Zhiming Cui, Na Li, Qianhan Zheng, Miaojing Shi

TL;DR: 本文提出了一种结合语义和形状感知的深度学习框架，用于CBCT图像中牙齿的形状保持分割，显著提升了分割效果。

Details

Motivation: CBCT图像中的牙齿分割在存在邻牙粘连时面临形状失真的挑战，影响数字牙科的准确性。

Result: 在内部和外部数据集上的实验表明，该方法显著优于现有方法。

Insight: 显式建模形状约束可以有效缓解医学图像分割中的形状失真问题。

Abstract: Background:Accurate tooth segmentation from cone beam computed tomography (CBCT) images is crucial for digital dentistry but remains challenging in cases of interdental adhesions, which cause severe anatomical shape distortion. Methods: To address this, we propose a deep learning framework that integrates semantic and shape awareness for shape-preserving segmentation. Our method introduces a target-tooth-centroid prompted multi-label learning strategy to model semantic relationships between teeth, reducing shape ambiguity. Additionally, a tooth-shape-aware learning mechanism explicitly enforces morphological constraints to preserve boundary integrity. These components are unified via multi-task learning, jointly optimizing segmentation and shape preservation. Results: Extensive evaluations on internal and external datasets demonstrate that our approach significantly outperforms existing methods. Conclusions: Our approach effectively mitigates shape distortions and providing anatomically faithful tooth boundaries.

[22] OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios cs.CV | cs.AIPDF

Hong Gao, Jingyu Wu, Xiangkai Xu, Kangni Xie, Yunchen Zhang

TL;DR: OmniGround是一个全面的时空视频接地（STVG）基准测试，专注于真实世界中复杂场景的多类别、复杂查询任务。通过提出高质量的标注流程和系统性评估框架，揭示了现有模型的不足，并提出了一种无需训练的两阶段框架PG-TAF，显著提升了性能。

Details

Motivation: 当前STVG任务在真实世界中面临多样对象和复杂查询的挑战，现有基准测试范围有限，导致模型存在类别偏见、推理简单化和语言鲁棒性差的问题。

Result: PG-TAF在OmniGround上实现了25.6%和35.6%的性能提升，并在四个基准测试上表现一致。

Insight: 复杂场景中模型的性能显著下降，特别是对小/遮挡对象和复杂空间关系的处理能力不足。PG-TAF通过分解任务解决了这一问题。

Abstract: Spatio-Temporal Video Grounding (STVG) aims to localize target objects in videos based on natural language descriptions. Despite recent advances in Multimodal Large Language Models, a significant gap remains between current models and real-world demands involving diverse objects and complex queries. We attribute this to limited benchmark scope, causing models to exhibit category bias, oversimplified reasoning, and poor linguistic robustness. To address these limitations, we introduce OmniGround, a comprehensive benchmark with 3,475 videos spanning 81 categories and complex real-world queries. We propose the Forward-Backward-Refinement annotation pipeline that combines multi-directional tracking with intelligent error correction for high-quality labels. We further introduce DeepSTG, a systematic evaluation framework quantifying dataset quality across four complementary dimensions beyond superficial statistics. Evaluations reveal performance average drop of 10.4% on complex real-world scenes, particularly with small/occluded objects and intricate spatial relations. Motivated by these, we propose PG-TAF, a training-free two-stage framework decomposing STVG into high-level temporal grounding and fine-grained spatio-temporal propagation. Experiments demonstrate PG-TAF achieves 25.6% and 35.6% improvements in m_tIoU and m_vIoU on OmniGround with consistent gains across four benchmarks.

[23] MultiPriv: Benchmarking Individual-Level Privacy Reasoning in Vision-Language Models cs.CV | cs.CRPDF

Xiongtao Sun, Hui Li, Jiaming Zhang, Yujie Yang, Kaili Liu

TL;DR: 该论文提出了MultiPriv基准，首次系统性评估了视觉语言模型（VLMs）在个体层面隐私推理能力，填补了现有隐私评估仅关注属性感知的不足，揭示了VLMs在隐私推理上的重大风险。

Details

Motivation: 现代视觉语言模型（VLMs）展示了复杂的推理能力，但其隐私风险评估仍停留在属性感知层面，无法应对更危险的个体级隐私推理威胁。

Result: 发现（1）VLMs存在未测量的重大隐私推理风险；（2）感知指标无法预测推理风险；（3）现有安全对齐对推理攻击无效。

Insight: MultiPriv揭示了VLMs的系统性漏洞，为开发隐私保护模型提供了框架和必要性。

Abstract: Modern Vision-Language Models (VLMs) demonstrate sophisticated reasoning, escalating privacy risks beyond simple attribute perception to individual-level linkage. Current privacy benchmarks are structurally insufficient for this new threat, as they primarily evaluate privacy perception while failing to address the more critical risk of privacy reasoning: a VLM’s ability to infer and link distributed information to construct individual profiles. To address this critical gap, we propose \textbf{MultiPriv}, the first benchmark designed to systematically evaluate individual-level privacy reasoning in VLMs. We introduce the \textbf{Privacy Perception and Reasoning (PPR)} framework and construct a novel, bilingual multimodal dataset to support it. The dataset uniquely features a core component of synthetic individual profiles where identifiers (e.g., faces, names) are meticulously linked to sensitive attributes. This design enables nine challenging tasks evaluating the full PPR spectrum, from attribute detection to cross-image re-identification and chained inference. We conduct a large-scale evaluation of over 50 foundational and commercial VLMs. Our analysis reveals: (1) Many VLMs possess significant, unmeasured reasoning-based privacy risks. (2) Perception-level metrics are poor predictors of these reasoning risks, revealing a critical evaluation gap. (3) Existing safety alignments are inconsistent and ineffective against such reasoning-based attacks. MultiPriv exposes systemic vulnerabilities and provides the necessary framework for developing robust, privacy-preserving VLMs.

[24] Flow-Guided Implicit Neural Representation for Motion-Aware Dynamic MRI Reconstruction cs.CVPDF

Baoqing Li, Yuanyuan Liu, Congcong Liu, Qingyong Zhu, Jing Cheng

TL;DR: 本文提出了一种联合建模动态MRI图像内容和运动场的隐式神经表示（INR）框架，通过光学流方程耦合两者，无需预先估计运动场即可同时恢复时序一致的图像和运动场。实验证明了其在心脏动态MRI中的优越性。

Details

Motivation: 动态MRI因采样不足和运动伪影导致重建质量下降，传统方法依赖预估计的光流，但在欠采样下不准确。本文旨在通过隐式神经表示联合建模图像内容和运动场，解决这一问题。

Result: 在心脏动态MRI数据集上，该方法在重建质量、运动估计精度和时间一致性上优于现有方法。

Insight: 隐式联合建模结合物理约束是提升动态MRI重建的有效途径。

Abstract: Dynamic magnetic resonance imaging (dMRI) captures temporally-resolved anatomy but is often challenged by limited sampling and motion-induced artifacts. Conventional motion-compensated reconstructions typically rely on pre-estimated optical flow, which is inaccurate under undersampling and degrades reconstruction quality. In this work, we propose a novel implicit neural representation (INR) framework that jointly models both the dynamic image sequence and its underlying motion field. Specifically, one INR is employed to parameterize the spatiotemporal image content, while another INR represents the optical flow. The two are coupled via the optical flow equation, which serves as a physics-inspired regularization, in addition to a data consistency loss that enforces agreement with k-space measurements. This joint optimization enables simultaneous recovery of temporally coherent images and motion fields without requiring prior flow estimation. Experiments on dynamic cardiac MRI datasets demonstrate that the proposed method outperforms state-of-the-art motion-compensated and deep learning approaches, achieving superior reconstruction quality, accurate motion estimation, and improved temporal fidelity. These results highlight the potential of implicit joint modeling with flow-regularized constraints for advancing dMRI reconstruction.

[25] FingerCap: Fine-grained Finger-level Hand Motion Captioning cs.CVPDF

Xin Shen, Rui Zhu, Lei Shen, Xinyu Wang, Kaihao Zhang

TL;DR: 该论文提出了FingerCap任务和FingerCap-40K数据集，通过FiGOP方法增强视频多模态大模型，以生成细粒度的手指级手部运动描述。

Details

Motivation: 细粒度的手部运动理解对视觉感知和多模态通信至关重要。现有方法因时间稀疏性难以捕捉高频率的手指动态，亟需改进。

Result: 实验表明FiGOP显著提升了模型性能，优于现有开源和闭源视频多模态大模型。

Insight: 时间稀疏性是手指级运动理解的瓶颈，而关键点信息可以有效补充RGB特征的不足。

Abstract: Understanding fine-grained human hand motion is fundamental to visual perception, embodied intelligence, and multimodal communication. In this work, we propose Fine-grained Finger-level Hand Motion Captioning (FingerCap), which aims to generate textual descriptions that capture detailed finger-level semantics of hand actions. To support this task, we curate FingerCap-40K, a large-scale corpus of 40K paired hand-motion videos and captions spanning two complementary sources: concise instruction-style finger motions and diverse, naturalistic hand-object interactions. To enable effective evaluation, we employ HandJudge, a LLM-based rubric that measures finger-level correctness and motion completeness. Temporal sparsity remains a fundamental bottleneck for current Video-MLLMs, since sparse RGB sampling is insufficient to capture the subtle, high-frequency dynamics underlying fine finger motions. As a simple and compute-friendly remedy, we introduce FiGOP (Finger Group-of-Pictures), which pairs each RGB keyframe with subsequent hand keypoints until the next keyframe. A lightweight temporal encoder converts the keypoints into motion embeddings and integrates them with RGB features. FiGOP adapts the classic GOP concept to finger motion, recovering fine temporal cues without increasing RGB density. Experiments on FingerCap-40K show that strong open- and closed-source Video-MLLMs still struggle with finger-level reasoning, while our FiGOP-augmented model yield consistent gains under HandJudge and human studies.

[26] Point-Supervised Facial Expression Spotting with Gaussian-Based Instance-Adaptive Intensity Modeling cs.CVPDF

Yicheng Deng, Hideaki Hayashi, Hajime Nagahara

TL;DR: 该论文提出了一种基于点监督的人脸表情识别方法（P-FES），通过高斯分布的实例自适应强度建模（GIM）生成软伪标签，有效解决了中性帧与不同强度表情帧的混淆问题，并通过双分支框架实现高效的表情提议和分类。

Details

Motivation: 传统方法依赖耗时的时间边界标注，而点监督学习仅需单个时间戳标注，大大降低了标注成本。论文旨在通过点监督实现高效的表情识别。

Result: 在SAMM-LV、CAS(ME)^2和CAS(ME)^3数据集上的实验表明，所提框架的有效性，显著提升了点监督下的表情识别性能。

Insight: 通过高斯分布建模实例级强度分布，能够更准确地捕捉表情动态变化，而双分支设计则平衡了提议生成与分类任务的需求。

Abstract: Automatic facial expression spotting, which aims to identify facial expression instances in untrimmed videos, is crucial for facial expression analysis. Existing methods primarily focus on fully-supervised learning and rely on costly, time-consuming temporal boundary annotations. In this paper, we investigate point-supervised facial expression spotting (P-FES), where only a single timestamp annotation per instance is required for training. We propose a unique two-branch framework for P-FES. First, to mitigate the limitation of hard pseudo-labeling, which often confuses neutral and expression frames with various intensities, we propose a Gaussian-based instance-adaptive intensity modeling (GIM) module to model instance-level expression intensity distribution for soft pseudo-labeling. By detecting the pseudo-apex frame around each point label, estimating the duration, and constructing an instance-level Gaussian distribution, GIM assigns soft pseudo-labels to expression frames for more reliable intensity supervision. The GIM module is incorporated into our framework to optimize the class-agnostic expression intensity branch. Second, we design a class-aware apex classification branch that distinguishes macro- and micro-expressions solely based on their pseudo-apex frames. During inference, the two branches work independently: the class-agnostic expression intensity branch generates expression proposals, while the class-aware apex-classification branch is responsible for macro- and micro-expression classification.Furthermore, we introduce an intensity-aware contrastive loss to enhance discriminative feature learning and suppress neutral noise by contrasting neutral frames with expression frames with various intensities. Extensive experiments on the SAMM-LV, CAS(ME)$^2$, and CAS(ME)$^3$ datasets demonstrate the effectiveness of our proposed framework.

[27] Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models cs.CV | cs.LG | eess.IVPDF

Dailan He, Guanlin Feng, Xingtong Ge, Yazhe Niu, Yi Zhang

TL;DR: 本文提出了一种新的对齐算法Neighbor GRPO，通过扰动ODE的初始噪声条件生成多样化的候选轨迹，完全绕过了SDE的需求，提高了训练效率和生成质量。

Details

Motivation: 现有的SDE-based GRPO方法由于引入了随机性，存在信用分配效率低和与高阶求解器不兼容的问题。本文旨在解决这些问题，同时保持确定性ODE采样的优势。

Result: 实验表明，Neighbor GRPO在训练成本、收敛速度和生成质量上均显著优于SDE-based GRPO方法。

Insight: 文章揭示了SDE-based GRPO方法的对比学习机制，并通过完全摆脱SDE的限制，展示了确定性ODE采样在高阶求解器和效率上的优势。

Abstract: Group Relative Policy Optimization (GRPO) has shown promise in aligning image and video generative models with human preferences. However, applying it to modern flow matching models is challenging because of its deterministic sampling paradigm. Current methods address this issue by converting Ordinary Differential Equations (ODEs) to Stochastic Differential Equations (SDEs), which introduce stochasticity. However, this SDE-based GRPO suffers from issues of inefficient credit assignment and incompatibility with high-order solvers for fewer-step sampling. In this paper, we first reinterpret existing SDE-based GRPO methods from a distance optimization perspective, revealing their underlying mechanism as a form of contrastive learning. Based on this insight, we propose Neighbor GRPO, a novel alignment algorithm that completely bypasses the need for SDEs. Neighbor GRPO generates a diverse set of candidate trajectories by perturbing the initial noise conditions of the ODE and optimizes the model using a softmax distance-based surrogate leaping policy. We establish a theoretical connection between this distance-based objective and policy gradient optimization, rigorously integrating our approach into the GRPO framework. Our method fully preserves the advantages of deterministic ODE sampling, including efficiency and compatibility with high-order solvers. We further introduce symmetric anchor sampling for computational efficiency and group-wise quasi-norm reweighting to address reward flattening. Extensive experiments demonstrate that Neighbor GRPO significantly outperforms SDE-based counterparts in terms of training cost, convergence speed, and generation quality.

[28] MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis cs.CVPDF

Di Luo, Shuhui Yang, Mingxin Yang, Jiawei Lu, Yixuan Tang

TL;DR: MatPedia提出了一种统一的RGB-PBR联合表示方法，通过视频扩散架构实现了高质量的多任务材质合成，显著提升了生成质量和多样性。

Details

Motivation: 当前基于物理的渲染（PBR）材质合成方法缺乏统一的表示方式，任务专用流程割裂且难以利用大规模RGB图像数据。MatPedia旨在弥补这一差距。

Result: MatPedia在1024×1024分辨率下实现了高质量的材质合成，在质量和多样性上显著超越了现有方法。

Insight: 通过统一RGB和PBR的表示，MatPedia展示了多任务学习在大规模混合数据上的潜力，为材质合成提供了一个通用基础模型。

Abstract: Physically-based rendering (PBR) materials are fundamental to photorealistic graphics, yet their creation remains labor-intensive and requires specialized expertise. While generative models have advanced material synthesis, existing methods lack a unified representation bridging natural image appearance and PBR properties, leading to fragmented task-specific pipelines and inability to leverage large-scale RGB image data. We present MatPedia, a foundation model built upon a novel joint RGB-PBR representation that compactly encodes materials into two interdependent latents: one for RGB appearance and one for the four PBR maps encoding complementary physical properties. By formulating them as a 5-frame sequence and employing video diffusion architectures, MatPedia naturally captures their correlations while transferring visual priors from RGB generation models. This joint representation enables a unified framework handling multiple material tasks–text-to-material generation, image-to-material generation, and intrinsic decomposition–within a single architecture. Trained on MatHybrid-410K, a mixed corpus combining PBR datasets with large-scale RGB images, MatPedia achieves native $1024\times1024$ synthesis that substantially surpasses existing approaches in both quality and diversity.

Hsuan Yuan, Shao-Yu Weng, I-Hsuan Lo, Wei-Chen Chiu, Yu-Syuan Xu

TL;DR: 该论文提出了一种双分支退化提取网络（Dual Branch Degradation Extractor Network），通过预测模糊和噪声的无监督退化嵌入，解决了盲超分辨率任务中退化模型不确定的问题，并取得了SOTA性能。

Details

Motivation: 传统超分辨率方法在固定退化条件下表现优异，但在实际应用中，退化模型往往未知且复杂（如同时包含模糊和噪声），导致性能下降。论文旨在解决盲超分辨率问题。

Result: 在多个盲超分辨率基准测试中，该方法表现优异，超越了现有方法，展示了强大的泛化能力。

Insight: 1. 模糊和噪声对超分辨率的影响不同，分开建模更有效；2. 将退化提取器设计为正则化器可以提升模型对退化不确定性的鲁棒性。

Abstract: Previous methods have demonstrated remarkable performance in single image super-resolution (SISR) tasks with known and fixed degradation (e.g., bicubic downsampling). However, when the actual degradation deviates from these assumptions, these methods may experience significant declines in performance. In this paper, we propose a Dual Branch Degradation Extractor Network to address the blind SR problem. While some blind SR methods assume noise-free degradation and others do not explicitly consider the presence of noise in the degradation model, our approach predicts two unsupervised degradation embeddings that represent blurry and noisy information. The SR network can then be adapted to blur embedding and noise embedding in distinct ways. Furthermore, we treat the degradation extractor as a regularizer to capitalize on differences between SR and HR images. Extensive experiments on several benchmarks demonstrate our method achieves SOTA performance in the blind SR problem.

[30] Real-Time Cooked Food Image Synthesis and Visual Cooking Progress Monitoring on Edge Devices cs.CV | cs.LGPDF

Jigyasa Gupta, Soumya Goyal, Anil Kumar, Ishan Jindal

TL;DR: 该论文提出了一种在边缘设备上实时合成烹饪食物图像并监控烹饪进度的方法，通过引入一个新的数据集和高效生成模型，显著提升了图像真实性和边缘设备部署的可行性。

Details

Motivation: 现有图像生成方法在烹饪图像合成上要么效果不真实，要么计算资源消耗过大，难以在边缘设备上实现实时处理。

Result: 模型在FID分数上显著优于基线（数据集上提升30%，公共数据集上提升60%）。

Insight: 领域特定的评估指标（如CIS）在生成任务中能有效提升模型的真实性和实用性。

Abstract: Synthesizing realistic cooked food images from raw inputs on edge devices is a challenging generative task, requiring models to capture complex changes in texture, color and structure during cooking. Existing image-to-image generation methods often produce unrealistic results or are too resource-intensive for edge deployment. We introduce the first oven-based cooking-progression dataset with chef-annotated doneness levels and propose an edge-efficient recipe and cooking state guided generator that synthesizes realistic food images conditioned on raw food image. This formulation enables user-preferred visual targets rather than fixed presets. To ensure temporal consistency and culinary plausibility, we introduce a domain-specific \textit{Culinary Image Similarity (CIS)} metric, which serves both as a training loss and a progress-monitoring signal. Our model outperforms existing baselines with significant reductions in FID scores (30% improvement on our dataset; 60% on public datasets)

[31] The Finer the Better: Towards Granular-aware Open-set Domain Generalization cs.CV | cs.AIPDF

Yunyun Wang, Zheng Duan, Xinyue Liao, Ke-Jia Chen, Songcan Chen

TL;DR: 论文提出Semantic-enhanced CLIP (SeeCLIP)框架，通过细粒度语义增强解决开放集域泛化中的已知类结构风险和未知类开放空间风险的困境，显著提升性能。

Details

Motivation: 开放集域泛化（OSDG）在实际场景中面临域偏移和新物体类别的双重挑战，现有方法（如CLIP）在处理与已知类别视觉相似的“困难未知类”时容易过自信。

Result: 在五个基准测试中，SeeCLIP比现有方法准确率提升3%，H-score提升5%。

Insight: 细粒度语义增强是关键，合成的“困难负样本”能有效迫使模型学习更精细的决策边界。

Abstract: Open-Set Domain Generalization (OSDG) tackles the realistic scenario where deployed models encounter both domain shifts and novel object categories. Despite impressive progress with vision-language models like CLIP, existing methods still fall into the dilemma between structural risk of known-classes and open-space risk from unknown-classes, and easily suffers from over-confidence, especially when distinguishing ``hard unknowns” that share fine-grained visual similarities with known classes. To this end, we propose a Semantic-enhanced CLIP (SeeCLIP) framework that explicitly addresses this dilemma through fine-grained semantic enhancement. In SeeCLIP, we propose a semantic-aware prompt enhancement module to decompose images into discriminative semantic tokens, enabling nuanced vision-language alignment beyond coarse category labels. To position unknown prompts effectively, we introduce duplex contrastive learning with complementary objectives, that is, repulsion to maintain separability from known classes, and cohesion to preserve semantic proximity. Further, our semantic-guided diffusion module synthesizes pseudo-unknowns by perturbing extracted semantic tokens, generating challenging samples that are visually similar to known classes yet exhibit key local differences. These hard negatives force the model to learn finer decision boundaries. Extensive experiments across five benchmarks demonstrate consistent improvements of 3% accuracy and 5% H-score over state-of-the-art methods.

[32] DReX: Pure Vision Fusion of Self-Supervised and Convolutional Representations for Image Complexity Prediction cs.CVPDF

Jonathan Skaza, Parsa Madinei, Ziqi Wen, Miguel Eckstein

TL;DR: 论文提出了DReX模型，通过结合自监督和卷积表示来预测图像复杂度，无需语言信息，在多个基准测试中表现优异。

Details

Motivation: 视觉复杂度预测是一个重要问题，现有方法多依赖多模态模型，但语言信息是否必要尚不明确。论文探讨纯视觉方法的潜力。

Result: 在IC9600基准测试中达到Pearson r=0.9581，参数减少21.5倍，同时在多个评价指标和数据集上表现鲁棒。

Insight: 纯视觉特征足以预测人类感知的图像复杂度，自监督Transformer与监督CNN的结合具有互补优势。

Abstract: Visual complexity prediction is a fundamental problem in computer vision with applications in image compression, retrieval, and classification. Understanding what makes humans perceive an image as complex is also a long-standing question in cognitive science. Recent approaches have leveraged multimodal models that combine visual and linguistic representations, but it remains unclear whether language information is necessary for this task. We propose DReX (DINO-ResNet Fusion), a vision-only model that fuses self-supervised and convolutional representations through a learnable attention mechanism to predict image complexity. Our architecture integrates multi-scale hierarchical features from ResNet-50 with semantically rich representations from DINOv3 ViT-S/16, enabling the model to capture both low-level texture patterns and high-level semantic structure. DReX achieves state-of-the-art performance on the IC9600 benchmark (Pearson r = 0.9581), surpassing previous methods–including those trained on multimodal image-text data–while using approximately 21.5x fewer learnable parameters. Furthermore, DReX generalizes robustly across multiple datasets and metrics, achieving superior results on Pearson and Spearman correlation, Root Mean Square Error (RMSE), and Mean Absolute Error (MAE). Ablation and attention analyses confirm that DReX leverages complementary cues from both backbones, with the DINOv3 [CLS] token enhancing sensitivity to visual complexity. Our findings suggest that visual features alone can be sufficient for human-aligned complexity prediction and that, when properly fused, self-supervised transformers and supervised deep convolutional neural networks offer complementary and synergistic benefits for this task.

[33] DepthFocus: Controllable Depth Estimation for See-Through Scenes cs.CVPDF

Junhong Min, Jimin Kim, Cheol-Hui Min, Minwook Kim, Youngpil Jeon

TL;DR: DepthFocus提出了一种可操控的Vision Transformer，将立体深度估计重新定义为意图驱动的控制任务，通过适应性地计算聚焦于目标深度，实现了复杂场景中选择性感知。

Details

Motivation: 现实世界中的深度通常是多层次的，尤其是透射材料会产生分层模糊性，现有被动模型无法适应这种动态需求，而人类却能主动调整焦点感知目标深度。

Result: DepthFocus在单深度和多深度数据集上均表现优异，尤其在BOOSTER等复杂场景中达到SOTA，同时在未见过的透射场景中展现了强泛化能力。

Insight: 将深度估计从静态任务转为动态意图驱动任务，更贴近人类感知方式，为3D感知的主动性和适应性提供了新思路。

Abstract: Depth in the real world is rarely singular. Transmissive materials create layered ambiguities that confound conventional perception systems. Existing models remain passive, attempting to estimate static depth maps anchored to the nearest surface, while humans actively shift focus to perceive a desired depth. We introduce DepthFocus, a steerable Vision Transformer that redefines stereo depth estimation as intent-driven control. Conditioned on a scalar depth preference, the model dynamically adapts its computation to focus on the intended depth, enabling selective perception within complex scenes. The training primarily leverages our newly constructed 500k multi-layered synthetic dataset, designed to capture diverse see-through effects. DepthFocus not only achieves state-of-the-art performance on conventional single-depth benchmarks like BOOSTER, a dataset notably rich in transparent and reflective objects, but also quantitatively demonstrates intent-aligned estimation on our newly proposed real and synthetic multi-depth datasets. Moreover, it exhibits strong generalization capabilities on unseen see-through scenes, underscoring its robustness as a significant step toward active and human-like 3D perception.

[34] VLM-Augmented Degradation Modeling for Image Restoration Under Adverse Weather Conditions cs.CVPDF

Qianyi Shao, Yuanfan Zhang, Renxiang Xiao, Liang Hu

TL;DR: 本文提出了一种名为MVLR的统一模型，通过结合视觉语言模型（VLM）和隐式记忆库（IMB），提高恶劣天气条件下的图像恢复能力。模型利用VLM生成先验知识并结合IMB中的退化模式，显著提升了恢复准确性和效率。

Details

Motivation: 恶劣天气条件下的图像恢复对自动驾驶和户外机器人至关重要，但现有方法在多天气条件和不同退化程度下的表现有限。

Result: 在四个恶劣天气基准测试中，MVLR在PSNR和SSIM指标上优于单分支和混合专家基线模型，实现了高效与高准确性的平衡。

Insight: VLM和IMB的结合为恶劣天气条件下的图像恢复提供了新的思路，通过语言模型的先验知识增强退化建模能力，同时保持了模型的轻量化和实时性。

Abstract: Reliable visual perception under adverse weather conditions, such as rain, haze, snow, or a mixture of them, is desirable yet challenging for autonomous driving and outdoor robots. In this paper, we propose a unified Memory-Enhanced Visual-Language Recovery (MVLR) model that restores images from different degradation levels under various weather conditions. MVLR couples a lightweight encoder-decoder backbone with a Visual-Language Model (VLM) and an Implicit Memory Bank (IMB). The VLM performs chain-of-thought inference to encode weather degradation priors and the IMB stores continuous latent representations of degradation patterns. The VLM-generated priors query the IMB to retrieve fine-grained degradation prototypes. These prototypes are then adaptively fused with multi-scale visual features via dynamic cross-attention mechanisms, enhancing restoration accuracy while maintaining computational efficiency. Extensive experiments on four severe-weather benchmarks show that MVLR surpasses single-branch and Mixture-of-Experts baselines in terms of Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM). These results indicate that MVLR offers a practical balance between model compactness and expressiveness for real-time deployment in diverse outdoor conditions.

[35] Vision Language Models are Confused Tourists cs.CV | cs.CLPDF

Patrick Amadeus Irawan, Ikhlasul Akmal Hanif, Muhammad Dehan Al Kautsar, Genta Indra Winata, Fajri Koto

TL;DR: 论文提出ConfusedTourist，一个评估视觉语言模型（VLM）在地理文化线索扰动下稳定性的新套件，揭示了其在多文化线索共现时的系统性缺陷。

Details

Motivation: 当前VLM的文化多样性评估常忽略多文化线索共现的场景，导致模型在真实多文化社会中的应用受限。

Result: 实验显示VLM准确性显著下降，尤其是在生成式干扰下，注意力系统性地被分散线索吸引。

Insight: 文化概念混合会显著削弱VLM性能，亟需开发更具文化鲁棒性的多模态理解方法。

Abstract: Although the cultural dimension has been one of the key aspects in evaluating Vision-Language Models (VLMs), their ability to remain stable across diverse cultural inputs remains largely untested, despite being crucial to support diversity and multicultural societies. Existing evaluations often rely on benchmarks featuring only a singular cultural concept per image, overlooking scenarios where multiple, potentially unrelated cultural cues coexist. To address this gap, we introduce ConfusedTourist, a novel cultural adversarial robustness suite designed to assess VLMs’ stability against perturbed geographical cues. Our experiments reveal a critical vulnerability, where accuracy drops heavily under simple image-stacking perturbations and even worsens with its image-generation-based variant. Interpretability analyses further show that these failures stem from systematic attention shifts toward distracting cues, diverting the model from its intended focus. These findings highlight a critical challenge: visual cultural concept mixing can substantially impair even state-of-the-art VLMs, underscoring the urgent need for more culturally robust multimodal understanding.

[36] FLUID: Training-Free Face De-identification via Latent Identity Substitution cs.CV | cs.AIPDF

Jinhyeong Park, Shaheryar Muhammad, Seangmin Lee, Jong Taek Lee, Soon Ki Jung

TL;DR: FLUID是一种无需训练的面部去标识化框架，通过预训练扩散模型的隐空间直接替换身份信息，结合化学中的替换机制，实现了高效的语义位移编辑。

Details

Motivation: 现有的面部去标识化方法需要大量训练或复杂的优化过程，FLUID旨在提供一种无需训练的高效解决方案，同时保持其他属性的完整性。

Result: 在CelebA-HQ和FFHQ数据集上，FLUID在身份压制和属性保留方面优于现有方法，取得了更高的质量和定量结果。

Insight: FLUID的创新之处在于将化学替换机制引入隐空间编辑，展示了预训练模型隐空间的强大表达能力，为无需训练的编辑任务提供了新思路。

Abstract: We present FLUID (Face de-identification in the Latent space via Utility-preserving Identity Displacement), a training-free framework that directly substitutes identity in the latent space of pretrained diffusion models. Inspired by substitution mechanisms in chemistry, we reinterpret identity editing as semantic displacement in the latent h-space of a pretrained unconditional diffusion model. Our framework discovers identity-editing directions through optimization guided by novel reagent losses, which supervise for attribute preservation and identity suppression. We further propose both linear and geodesic (tangent-based) editing schemes to effectively navigate the latent manifold. Experimental results on CelebA-HQ and FFHQ demonstrate that FLUID achieves a superior trade-off between identity suppression and attribute preservation, outperforming state-of-the-art de-identification methods in both qualitative and quantitative metrics.

[37] Parameter-Free Neural Lens Blur Rendering for High-Fidelity Composites cs.CV | cs.AI | cs.GR | eess.IVPDF

Lingyan Ruan, Bin Chen, Taehyun Rhee

TL;DR: 该论文提出了一种无需相机参数的新型神经网络方法，直接通过RGB图像估计模糊圈（CoC）图，用于高质量混合现实合成。

Details

Motivation: 现有方法依赖相机参数和场景深度来渲染镜头模糊，但这些信息通常难以获取，限制了方法的通用性和可访问性。

Result: 实验表明，该方法在定性和定量评估中均优于现有技术，实现了高保真的混合现实合成。

Insight: RGB图像本身包含足够的信息来推断复杂的镜头模糊效果，无需依赖传统相机参数。

Abstract: Consistent and natural camera lens blur is important for seamlessly blending 3D virtual objects into photographed real-scenes. Since lens blur typically varies with scene depth, the placement of virtual objects and their corresponding blur levels significantly affect the visual fidelity of mixed reality compositions. Existing pipelines often rely on camera parameters (e.g., focal length, focus distance, aperture size) and scene depth to compute the circle of confusion (CoC) for realistic lens blur rendering. However, such information is often unavailable to ordinary users, limiting the accessibility and generalizability of these methods. In this work, we propose a novel compositing approach that directly estimates the CoC map from RGB images, bypassing the need for scene depth or camera metadata. The CoC values for virtual objects are inferred through a linear relationship between its signed CoC map and depth, and realistic lens blur is rendered using a neural reblurring network. Our method provides flexible and practical solution for real-world applications. Experimental results demonstrate that our method achieves high-fidelity compositing with realistic defocus effects, outperforming state-of-the-art techniques in both qualitative and quantitative evaluations.

[38] RacketVision: A Multiple Racket Sports Benchmark for Unified Ball and Racket Analysis cs.CV | cs.AI | cs.MMPDF

Linfeng Dong, Yuchen Yang, Hao Wu, Wei Wang, Yuenan HouZhihang Zhong

TL;DR: RacketVision是一个多球拍运动数据集和基准测试，为乒乓球、网球和羽毛球提供大规模、细粒度的球拍姿态和球位标注，支持复杂人机交互研究。

Details

Motivation: 现有数据集缺乏对球拍姿态的细粒度标注，限制了运动分析中对人与物体交互的研究。RacketVision填补了这一空白。

Result: CrossAttention机制显著提升了轨迹预测性能，超越了单模态基线方法。

Insight: 简单地拼接球拍姿态特征会降低性能，跨注意力机制是多模态融合的关键。

Abstract: We introduce RacketVision, a novel dataset and benchmark for advancing computer vision in sports analytics, covering table tennis, tennis, and badminton. The dataset is the first to provide large-scale, fine-grained annotations for racket pose alongside traditional ball positions, enabling research into complex human-object interactions. It is designed to tackle three interconnected tasks: fine-grained ball tracking, articulated racket pose estimation, and predictive ball trajectory forecasting. Our evaluation of established baselines reveals a critical insight for multi-modal fusion: while naively concatenating racket pose features degrades performance, a CrossAttention mechanism is essential to unlock their value, leading to trajectory prediction results that surpass strong unimodal baselines. RacketVision provides a versatile resource and a strong starting point for future research in dynamic object tracking, conditional motion forecasting, and multimodal analysis in sports. Project page at https://github.com/OrcustD/RacketVision

[39] PathAgent: Toward Interpretable Analysis of Whole-slide Pathology Images via Large Language Model-based Agentic Reasoning cs.CVPDF

Jingyun Chen, Linghan Cai, Zhikang Wang, Yi Huang, Songhan Jiang

TL;DR: PathAgent是一个基于大语言模型的智能体框架，模拟病理学家的逐步推理过程，实现对全玻片图像的透明和可解释性分析。

Details

Motivation: 现有计算流程缺乏明确的推理轨迹，导致预测不透明且难以解释。PathAgent旨在填补这一空白，模拟人类专家的逐步分析方法。

Result: 在五个数据集上表现出强大的零样本泛化能力，超越任务专用基线。

Insight: PathAgent展示了LLM为临床诊断提供透明辅助的潜力。

Abstract: Analyzing whole-slide images (WSIs) requires an iterative, evidence-driven reasoning process that parallels how pathologists dynamically zoom, refocus, and self-correct while collecting the evidence. However, existing computational pipelines often lack this explicit reasoning trajectory, resulting in inherently opaque and unjustifiable predictions. To bridge this gap, we present PathAgent, a training-free, large language model (LLM)-based agent framework that emulates the reflective, stepwise analytical approach of human experts. PathAgent can autonomously explore WSI, iteratively and precisely locating significant micro-regions using the Navigator module, extracting morphology visual cues using the Perceptor, and integrating these findings into the continuously evolving natural language trajectories in the Executor. The entire sequence of observations and decisions forms an explicit chain-of-thought, yielding fully interpretable predictions. Evaluated across five challenging datasets, PathAgent exhibits strong zero-shot generalization, surpassing task-specific baselines in both open-ended and constrained visual question-answering tasks. Moreover, a collaborative evaluation with human pathologists confirms PathAgent’s promise as a transparent and clinically grounded diagnostic assistant.

[40] OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding cs.CV | cs.AIPDF

Teng Fu, Mengyang Zhao, Ke Niu, Kaixin Peng, Bin Li

TL;DR: 论文提出了OmniPT框架，通过结合大型视觉语言模型（LVLM）与强化学习训练策略，实现了行人跟踪的高级语义理解任务，并在跟踪基准测试中表现优异。

Details

Motivation: 尽管LVLM在图像级任务（如视觉问答和图像描述）中表现优秀，但在实例级任务（如视觉接地和目标检测）中仍有性能差距。行人跟踪任务强调高级语义理解，而这正是LVLM的优势所在。

Result: 在多个跟踪基准测试中，OmniPT的性能优于已有方法。

Insight: 通过将LVLM与强化学习结合，可将高级语义理解能力成功应用于实例级任务（如行人跟踪），为类似任务提供了新思路。

Abstract: LVLMs have been shown to perform excellently in image-level tasks such as VQA and caption. However, in many instance-level tasks, such as visual grounding and object detection, LVLMs still show performance gaps compared to previous expert models. Meanwhile, although pedestrian tracking is a classical task, there have been a number of new topics in combining object tracking and natural language, such as Referring MOT, Cross-view Referring MOT, and Semantic MOT. These tasks emphasize that models should understand the tracked object at an advanced semantic level, which is exactly where LVLMs excel. In this paper, we propose a new unified Pedestrian Tracking framework, namely OmniPT, which can track, track based on reference and generate semantic understanding of tracked objects interactively. We address two issues: how to model the tracking task into a task that foundation models can perform, and how to make the model output formatted answers. To this end, we implement a training phase consisting of RL-Mid Training-SFT-RL. Based on the pre-trained weights of the LVLM, we first perform a simple RL phase to enable the model to output fixed and supervisable bounding box format. Subsequently, we conduct a mid-training phase using a large number of pedestrian-related datasets. Finally, we perform supervised fine-tuning on several pedestrian tracking datasets, and then carry out another RL phase to improve the model’s tracking performance and enhance its ability to follow instructions. We conduct experiments on tracking benchmarks and the experimental results demonstrate that the proposed method can perform better than the previous methods.

[41] RL-AD-Net: Reinforcement Learning Guided Adaptive Displacement in Latent Space for Refined Point Cloud Completion cs.CVPDF

Bhanu Pratap Paregi, Vaibhav Kumar

TL;DR: RL-AD-Net采用强化学习在潜在空间中调整点云补全结果，通过RL代理和轻量级PointNN选择器提升几何一致性，适用于多种补全网络，无需重新训练。

Details

Motivation: 现有点云补全方法在全局形状上表现良好，但在局部几何一致性上存在不足，需要一种轻量且通用的改进方法。

Result: 在ShapeNetCore-2048上实验表明，RL-AD-Net在随机裁剪和训练裁剪场景下均能显著提升补全效果。

Insight: 强化学习的动态性和无监督特性使其适用于点云补全的后期优化，轻量级设计使其具有广泛适用性。

Abstract: Recent point cloud completion models, including transformer-based, denoising-based, and other state-of-the-art approaches, generate globally plausible shapes from partial inputs but often leave local geometric inconsistencies. We propose RL-AD-Net, a reinforcement learning (RL) refinement framework that operates in the latent space of a pretrained point autoencoder. The autoencoder encodes completions into compact global feature vectors (GFVs), which are selectively adjusted by an RL agent to improve geometric fidelity. To ensure robustness, a lightweight non-parametric PointNN selector evaluates the geometric consistency of both the original completion and the RL-refined output, retaining the better reconstruction. When ground truth is available, both Chamfer Distance and geometric consistency metrics guide refinement. Training is performed separately per category, since the unsupervised and dynamic nature of RL makes convergence across highly diverse categories challenging. Nevertheless, the framework can be extended to multi-category refinement in future work. Experiments on ShapeNetCore-2048 demonstrate that while baseline completion networks perform reasonable under their training-style cropping, they struggle in random cropping scenarios. In contrast, RL-AD-Net consistently delivers improvements across both settings, highlighting the effectiveness of RL-guided ensemble refinement. The approach is lightweight, modular, and model-agnostic, making it applicable to a wide range of completion networks without requiring retraining.

[42] Spanning Tree Autoregressive Visual Generation cs.CV | cs.AIPDF

Sangkyu Lee, Changho Lee, Janghoon Han, Hosung Song, Tackgeun You

TL;DR: STAR提出了一种基于生成树的图像建模方法，结合了视觉先验知识（如中心偏置和局部性），在保持采样性能的同时提供灵活的序列顺序，支持推理时的图像编辑。

Details

Motivation: 传统自回归模型在图像生成中采用随机排列的序列顺序会导致性能下降或序列顺序灵活性不足。STAR旨在结合图像先验知识，解决这一问题。

Result: STAR在保持采样性能的同时，支持灵活的序列顺序选择，适用于图像编辑任务。

Insight: 通过结构化随机策略结合视觉先验，STAR展示了在不改变模型架构的情况下优化序列顺序的潜力。

Abstract: We present Spanning Tree Autoregressive (STAR) modeling, which can incorporate prior knowledge of images, such as center bias and locality, to maintain sampling performance while also providing sufficiently flexible sequence orders to accommodate image editing at inference. Approaches that expose randomly permuted sequence orders to conventional autoregressive (AR) models in visual generation for bidirectional context either suffer from a decline in performance or compromise the flexibility in sequence order choice at inference. Instead, STAR utilizes traversal orders of uniform spanning trees sampled in a lattice defined by the positions of image patches. Traversal orders are obtained through breadth-first search, allowing us to efficiently construct a spanning tree whose traversal order ensures that the connected partial observation of the image appears as a prefix in the sequence through rejection sampling. Through the tailored yet structured randomized strategy compared to random permutation, STAR preserves the capability of postfix completion while maintaining sampling performance without any significant changes to the model architecture widely adopted in the language AR modeling.

[43] SPAGS: Sparse-View Articulated Object Reconstruction from Single State via Planar Gaussian Splatting cs.CVPDF

Di Wu, Liu Liu, Xueyu Yuan, Qiaoyu Jun, Wenxiao Chen

TL;DR: SPAGS提出了一种基于平面高斯泼溅（Planar Gaussian Splatting）的稀疏视角下单状态关节物体重建方法，仅需少量RGB图像即可实现高保真重建。

Details

Motivation: 现有重建方法通常需要多阶段、多视角输入，成本高昂，SPAGS旨在减少输入需求并提升重建效果。

Result: 在合成和真实数据上均实现比现有方法更高保真的部件级表面重建。

Insight: 平面高斯泼溅可实现稀疏视角下的高效重建，结合分割信息进一步提升了部件级的重建精度。

Abstract: Articulated objects are ubiquitous in daily environments, and their 3D reconstruction holds great significance across various fields. However, existing articulated object reconstruction methods typically require costly inputs such as multi-stage and multi-view observations. To address the limitations, we propose a category-agnostic articulated object reconstruction framework via planar Gaussian Splatting, which only uses sparse-view RGB images from a single state. Specifically, we first introduce a Gaussian information field to perceive the optimal sparse viewpoints from candidate camera poses. Then we compress 3D Gaussians into planar Gaussians to facilitate accurate estimation of normal and depth. The planar Gaussians are optimized in a coarse-to-fine manner through depth smooth regularization and few-shot diffusion. Moreover, we introduce a part segmentation probability for each Gaussian primitive and update them by back-projecting part segmentation masks of renderings. Extensive experimental results demonstrate that our method achieves higher-fidelity part-level surface reconstruction on both synthetic and real-world data than existing methods. Codes will be made publicly available.

[44] Sparse Reasoning is Enough: Biological-Inspired Framework for Video Anomaly Detection with Large Pre-trained Models cs.CVPDF

He Huang, Zixuan Hu, Dongxiao Li, Yao Xiao, Ling-Yu Duan

TL;DR: 论文提出ReCoVAD框架，利用稀疏推理减少计算成本，结合快速反射和精细意识通路实现高效视频异常检测。

Details

Motivation: 现有基于预训练模型的视频异常检测方法通常依赖密集帧处理，计算成本高。作者受人类神经系统启发，探索稀疏推理是否足够。

Result: 在UCF-Crime和XD-Violence数据集上，仅处理28.55%和16.04%的帧即实现SOTA性能。

Insight: 稀疏推理足以支持基于大模型的视频异常检测，双重通路设计提供了计算效率与检测精度的平衡。

Abstract: Video anomaly detection (VAD) plays a vital role in real-world applications such as security surveillance, autonomous driving, and industrial monitoring. Recent advances in large pre-trained models have opened new opportunities for training-free VAD by leveraging rich prior knowledge and general reasoning capabilities. However, existing studies typically rely on dense frame-level inference, incurring high computational costs and latency. This raises a fundamental question: Is dense reasoning truly necessary when using powerful pre-trained models in VAD systems? To answer this, we propose ReCoVAD, a novel framework inspired by the dual reflex and conscious pathways of the human nervous system, enabling selective frame processing to reduce redundant computation. ReCoVAD consists of two core pathways: (i) a Reflex pathway that uses a lightweight CLIP-based module to fuse visual features with prototype prompts and produce decision vectors, which query a dynamic memory of past frames and anomaly scores for fast response; and (ii) a Conscious pathway that employs a medium-scale vision-language model to generate textual event descriptions and refined anomaly scores for novel frames. It continuously updates the memory and prototype prompts, while an integrated large language model periodically reviews accumulated descriptions to identify unseen anomalies, correct errors, and refine prototypes. Extensive experiments show that ReCoVAD achieves state-of-the-art training-free performance while processing only 28.55% and 16.04% of the frames used by previous methods on the UCF-Crime and XD-Violence datasets, demonstrating that sparse reasoning is sufficient for effective large-model-based VAD.

[45] ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better cs.CVPDF

Yuan Zhang, Ming Lu, Junwen Pan, Tao Huang, Kuan Cheng

TL;DR: ChainV通过动态整合视觉提示（visual hints）到多模态推理过程中，显著提高了推理的准确性和效率，尤其是在需要多步符号推理的任务上。

Details

Motivation: 现有的多模态推理模型在生成冗长的推理链时存在冗余自反思的问题，而传统的CoT压缩方法因依赖静态视觉参考在多模态任务中效果有限。

Result: 在MathVista等任务上，ChainV推理准确性提升了2.3%，推理延迟降低了51.4%，输出token长度缩短了24.5%。

Insight: 动态视觉提示的引入和可靠性评估是多模态高效推理的关键，Bernoulli随机过程进一步增强了推理的稳健性。

Abstract: Recent advances in multimodal reasoning models have demonstrated impressive capabilities across text and vision. However, even leading models exhibit redundant self-reflection when generating lengthy reasoning chains. While training-free CoT compression methods have emerged in the LLMs domain, they rely on static visual references and thus provide limited gains for multimodal reasoning. Therefore, we propose ChainV, a framework that dynamically integrates visual hints into the reasoning process, thereby making multimodal reasoning shorter and better. Specifically, ChainV first performs a coarse visual patch selection based on the previous reasoning step, then refines it by identifying the most representative atomic visual hint according to the averaged attention intensity. Additionally, ChainV introduces a consistency-based evaluation mechanism to assess the reliability of the chosen hint, guiding the model to adaptively adjust its level of self-reflection. Eventually, the pixel coordinates of the selected visual hint and its reliability are incorporated into thinking with a Bernoulli stochastic process. Experiments indicate that our method significantly improves reasoning accuracy and efficiency, especially on math-intensive benchmarks where visual hints are crucial for multi-step symbolic reasoning. For example, ChainV achieves $2.3%$ improvement on the MathVista within MIMO-VL-RL, while reducing inference latency by $51.4%$ and shortening output token length by $24.5%$.

[46] PEGS: Physics-Event Enhanced Large Spatiotemporal Motion Reconstruction via 3D Gaussian Splatting cs.CVPDF

Yijun Xu, Jingrui Zhang, Hongyi Liu, Yuhan Chen, Yuanyang Wang

TL;DR: PEGS整合物理先验和事件流增强，通过3D高斯泼溅实现大时空尺度的运动重建，提出三重监督方案和运动感知退火策略，实验表现优于主流动态方法。

Details

Motivation: 大时空尺度下的刚体运动重建存在建模范式限制、严重运动模糊和物理一致性不足的挑战。PEGS旨在解决这些问题。

Result: PEGS在大时空尺度运动重建中表现优于主流动态方法。

Insight: 物理先验和事件流增强显著提升运动重建的质量和一致性。

Abstract: Reconstruction of rigid motion over large spatiotemporal scales remains a challenging task due to limitations in modeling paradigms, severe motion blur, and insufficient physical consistency. In this work, we propose PEGS, a framework that integrates Physical priors with Event stream enhancement within a 3D Gaussian Splatting pipeline to perform deblurred target-focused modeling and motion recovery. We introduce a cohesive triple-level supervision scheme that enforces physical plausibility via an acceleration constraint, leverages event streams for high-temporal resolution guidance, and employs a Kalman regularizer to fuse multi-source observations. Furthermore, we design a motion-aware simulated annealing strategy that adaptively schedules the training process based on real-time kinematic states. We also contribute the first RGB-Event paired dataset targeting natural, fast rigid motion across diverse scenarios. Experiments show PEGS’s superior performance in reconstructing motion over large spatiotemporal scales compared to mainstream dynamic methods.

[47] Planning with Sketch-Guided Verification for Physics-Aware Video Generation cs.CV | cs.AI | cs.CLPDF

Yidong Huang, Zun Wang, Han Lin, Dong-Ki Kim, Shayegan Omidshafiei

TL;DR: 论文提出了SketchVerify框架，通过草图验证的方式优化视频生成中的运动规划，提升物理合理性和运动一致性，同时避免昂贵的重复计算。

Details

Motivation: 现有视频生成方法多依赖简单的单次规划或高成本的迭代优化，难以实现动态一致且物理合理的运动轨迹。SketchVerify旨在通过草图验证和高效规划克服这些限制。

Result: 在WorldModelBench和PhyWorldBench上，SketchVerify显著提升了运动质量、物理真实性和长期一致性，同时计算效率更高。

Insight: 扩大候选轨迹数量能持续提升性能，验证器的高效设计是关键。

Abstract: Recent video generation approaches increasingly rely on planning intermediate control signals such as object trajectories to improve temporal coherence and motion fidelity. However, these methods mostly employ single-shot plans that are typically limited to simple motions, or iterative refinement which requires multiple calls to the video generator, incuring high computational cost. To overcome these limitations, we propose SketchVerify, a training-free, sketch-verification-based planning framework that improves motion planning quality with more dynamically coherent trajectories (i.e., physically plausible and instruction-consistent motions) prior to full video generation by introducing a test-time sampling and verification loop. Given a prompt and a reference image, our method predicts multiple candidate motion plans and ranks them using a vision-language verifier that jointly evaluates semantic alignment with the instruction and physical plausibility. To efficiently score candidate motion plans, we render each trajectory as a lightweight video sketch by compositing objects over a static background, which bypasses the need for expensive, repeated diffusion-based synthesis while achieving comparable performance. We iteratively refine the motion plan until a satisfactory one is identified, which is then passed to the trajectory-conditioned generator for final synthesis. Experiments on WorldModelBench and PhyWorldBench demonstrate that our method significantly improves motion quality, physical realism, and long-term consistency compared to competitive baselines while being substantially more efficient. Our ablation study further shows that scaling up the number of trajectory candidates consistently enhances overall performance.

[48] A Multi-Stage Optimization Framework for Deploying Learned Image Compression on FPGAs cs.CVPDF

Jiaxun Fang, Li Chen

TL;DR: 该论文提出了一种多阶段优化框架，用于在FPGA上高效部署深度学习图像压缩（LIC）模型，解决了量化引起的性能下降问题，并通过硬件感知优化技术显著提升了计算效率。

Details

Motivation: 深度学习图像压缩（LIC）在浮点模型上表现优异，但在资源受限的FPGA上部署时面临量化性能下降和计算复杂度高的挑战。本文旨在弥合高性能模型与高效硬件实现之间的差距。

Result: DRAQ显著提升8位模型的性能，BD-rate开销降低至6.3%；硬件优化减少20%计算复杂度，同时RD性能几乎不受影响。

Insight: 论文展示了如何通过量化、混合精度和剪枝等方法，在保持模型性能的同时实现高效的硬件部署，为解决LIC在FPGA上的挑战提供了系统化框架。

Abstract: Deep learning-based image compression (LIC) has achieved state-of-the-art rate-distortion (RD) performance, yet deploying these models on resource-constrained FPGAs remains a major challenge. This work presents a complete, multi-stage optimization framework to bridge the gap between high-performance floating-point models and efficient, hardware-friendly integer-based implementations. First, we address the fundamental problem of quantization-induced performance degradation. We propose a Dynamic Range-Aware Quantization (DRAQ) method that uses statistically-calibrated activation clipping and a novel weight regularization scheme to counteract the effects of extreme data outliers and large dynamic ranges, successfully creating a high-fidelity 8-bit integer model. Second, building on this robust foundation, we introduce two hardware-aware optimization techniques tailored for FPGAs. A progressive mixed-precision search algorithm exploits FPGA flexibility to assign optimal, non-uniform bit-widths to each layer, minimizing complexity while preserving performance. Concurrently, a channel pruning method, adapted to work with the Generalized Divisive Normalization (GDN) layers common in LIC, removes model redundancy by eliminating inactive channels. Our comprehensive experiments show that the foundational DRAQ method reduces the BD-rate overhead of a GDN-based model from $30%$ to $6.3%$. The subsequent hardware-aware optimizations further reduce computational complexity by over $20%$ with negligible impact on RD performance, yielding a final model that is both state-of-the-art in efficiency and superior in quality to existing FPGA-based LIC implementations.

[49] Learning to Look Closer: A New Instance-Wise Loss for Small Cerebral Lesion Segmentation cs.CVPDF

Luc Bouteille, Alexander Jaus, Jens Kleesiek, Rainer Stiefelhagen, Lukas Heine

TL;DR: 论文提出了一种新的实例级损失函数CC-DiceCE，用于改善小病灶的医学图像分割性能，尤其在检测率（召回率）上表现优异，尽管可能增加少量假阳性。

Details

Motivation: 传统损失函数（如Dice）在小病灶分割中表现不佳，因为它们对小体积病灶的贡献较小，导致漏检问题。

Result: CC-DiceCE在检测率上优于blob损失和传统DiceCE基线，且分割性能无明显下降，但假阳性略有增加。

Insight: 实例级损失函数对小病灶分割非常重要，CC-DiceCE在召回率和假阳性之间取得了良好平衡。

Abstract: Traditional loss functions in medical image segmentation, such as Dice, often under-segment small lesions because their small relative volume contributes negligibly to the overall loss. To address this, instance-wise loss functions and metrics have been proposed to evaluate segmentation quality on a per-lesion basis. We introduce CC-DiceCE, a loss function based on the CC-Metrics framework, and compare it with the existing blob loss. Both are benchmarked against a DiceCE baseline within the nnU-Net framework, which provides a robust and standardized setup. We find that CC-DiceCE loss increases detection (recall) with minimal to no degradation in segmentation performance, albeit at the cost of slightly more false positives. Furthermore, our multi-dataset study shows that CC-DiceCE generally outperforms blob loss.

[50] A lightweight detector for real-time detection of remote sensing images cs.CV | cs.AIPDF

Qianyi Wang, Guoqiang Ren

TL;DR: DMG-YOLO是一种轻量级实时检测器，针对遥感图像中的小目标检测问题，通过双分支特征提取和多尺度特征融合模块，实现了高效且准确的检测。

Details

Motivation: 遥感图像中小目标多且检测需要兼顾精度与效率，现有方法难以满足实时性需求，因此设计了DMG-YOLO来解决这一问题。

Result: 在VisDrone2019和NWPU VHR-10数据集上，DMG-YOLO在mAP、模型大小等指标上表现优越。

Insight: 结合卷积和变换器的优势，既能提取局部特征又能捕捉全局上下文，适合遥感图像的小目标检测。

Abstract: Remote sensing imagery is widely used across various fields, yet real-time detection remains challenging due to the prevalence of small objects and the need to balance accuracy with efficiency. To address this, we propose DMG-YOLO, a lightweight real-time detector tailored for small object detection in remote sensing images. Specifically, we design a Dual-branch Feature Extraction (DFE) module in the backbone, which partitions feature maps into two parallel branches: one extracts local features via depthwise separable convolutions, and the other captures global context using a vision transformer with a gating mechanism. Additionally, a Multi-scale Feature Fusion (MFF) module with dilated convolutions enhances multi-scale integration while preserving fine details. In the neck, we introduce the Global and Local Aggregate Feature Pyramid Network (GLAFPN) to further boost small object detection through global-local feature fusion. Extensive experiments on the VisDrone2019 and NWPU VHR-10 datasets show that DMG-YOLO achieves competitive performance in terms of mAP, model size, and other key metrics.

[51] UI-Styler: Ultrasound Image Style Transfer with Class-Aware Prompts for Cross-Device Diagnosis Using a Frozen Black-Box Inference Network cs.CVPDF

Nhat-Tuong Do-Tran, Ngoc-Hoang-Lam Le, Ching-Chun Huang

TL;DR: UI-Styler提出了一种针对超声图像的风格迁移框架，通过类别感知提示和纹理匹配机制解决了跨设备超声图像域偏移问题，显著提升了黑色盒推理网络的复用性能。

Details

Motivation: 超声图像在不同采集设备间的外观差异导致域偏移，限制了固定黑色盒推理模型的复用性能。现有无配对图像翻译方法缺乏类别语义对齐，影响诊断准确性。

Result: 在跨设备任务中，UI-Styler在分布距离和下流任务（分类和分割）上显著优于现有方法。

Insight: 类别感知提示和纹理匹配的结合能够有效解决超声图像的域偏移问题，同时适应黑色盒推理模型的约束。

Abstract: The appearance of ultrasound images varies across acquisition devices, causing domain shifts that degrade the performance of fixed black-box downstream inference models when reused. To mitigate this issue, it is practical to develop unpaired image translation (UIT) methods that effectively align the statistical distributions between source and target domains, particularly under the constraint of a reused inference-blackbox setting. However, existing UIT approaches often overlook class-specific semantic alignment during domain adaptation, resulting in misaligned content-class mappings that can impair diagnostic accuracy. To address this limitation, we propose UI-Styler, a novel ultrasound-specific, class-aware image style transfer framework. UI-Styler leverages a pattern-matching mechanism to transfer texture patterns embedded in the target images onto source images while preserving the source structural content. In addition, we introduce a class-aware prompting strategy guided by pseudo labels of the target domain, which enforces accurate semantic alignment with diagnostic categories. Extensive experiments on ultrasound cross-device tasks demonstrate that UI-Styler consistently outperforms existing UIT methods, achieving state-of-the-art performance in distribution distance and downstream tasks, such as classification and segmentation.

[52] FireScope: Wildfire Risk Prediction with a Chain-of-Thought Oracle cs.CV | cs.LGPDF

Mario Markov, Stefan Maria Ailuro, Luc Van Gool, Konrad Schindler, Danda Pani Paudel

TL;DR: FireScope提出了一个基于视觉语言模型（VLM）的推理生成框架，用于预测连续的野火风险地图，通过结合视觉、气候和地理数据，提升了跨大陆的泛化能力和可解释性。

Details

Motivation: 现有方法在野火风险预测中缺乏因果推理和多模态理解能力，导致泛化不可靠。FireScope通过引入大规模数据集和推理框架，解决了这一问题。

Result: 在美国训练、欧洲测试的场景中，FireScope表现出显著的性能提升，推理轨迹被专家认为是可信且语义丰富的。

Insight: 语言推理能够提升视觉生成任务的泛化能力，同时增强模型的可解释性，为空间建模提供了新思路。

Abstract: Predicting wildfire risk is a reasoning-intensive spatial problem that requires the integration of visual, climatic, and geographic factors to infer continuous risk maps. Existing methods lack the causal reasoning and multimodal understanding required for reliable generalization. We introduce $\textbf{FireScope-Bench}$, a large-scale dataset and benchmark that couples Sentinel-2 imagery and climate data with expert-defined risk rasters across the USA, and real wildfire events in Europe for cross-continental evaluation. Building on this dataset, we propose $\textbf{FireScope}$, a VLM-based reasoning-to-generation framework that learns from both reinforcement learning and visual supervision to predict risk rasters with complementary reasoning traces. When trained in the USA and tested in Europe, $\textbf{FireScope}$ achieves substantial performance gains, while expert feedback and automated analysis confirm that its reasoning traces are faithful and semantically meaningful. Our findings demonstrate that reasoning can ground raster prediction models, improving both generalization and interpretability. To our knowledge, this is the first framework to (1) demonstrate that language-based reasoning can improve generalization in visual generation, (2) propose a high-resolution wildfire risk model that can be applied across continents, and (3) enable systematic studies of robust cross-continental generalization for multimodal fire risk models. We believe that $\textbf{FireScope-Bench}$ has the potential to serve as a foundation for advancing reasoning-driven, interpretable and generalizable spatial modeling. Data and source code will be made publicly available.

[53] Investigating self-supervised representations for audio-visual deepfake detection cs.CV | cs.LG | cs.SDPDF

Dragos-Alexandru Boldisor, Stefan Smeu, Dan Oneata, Elisabeta Oneata

TL;DR: 论文探讨了自监督表示在音频-视觉深度伪造检测中的潜力，发现其在检测有效性、信息可解释性和模态互补性方面表现良好，但跨数据集泛化能力有限。

Details

Motivation: 当前自监督表示在视觉和语音任务中表现出色，但它们在音频-视觉深度伪造检测中的应用尚未充分研究。论文旨在填补这一空白。

Result: 发现大多数自监督特征能捕捉深度伪造相关信息，且信息互补性强，但跨数据集的泛化能力较差。

Insight: 自监督特征能够学习有意义模式，但由于数据集特性问题，跨域稳健性仍然是一个挑战。

Abstract: Self-supervised representations excel at many vision and speech tasks, but their potential for audio-visual deepfake detection remains underexplored. Unlike prior work that uses these features in isolation or buried within complex architectures, we systematically evaluate them across modalities (audio, video, multimodal) and domains (lip movements, generic visual content). We assess three key dimensions: detection effectiveness, interpretability of encoded information, and cross-modal complementarity. We find that most self-supervised features capture deepfake-relevant information, and that this information is complementary. Moreover, models primarily attend to semantically meaningful regions rather than spurious artifacts. Yet none generalize reliably across datasets. This generalization failure likely stems from dataset characteristics, not from the features themselves latching onto superficial patterns. These results expose both the promise and fundamental challenges of self-supervised representations for deepfake detection: while they learn meaningful patterns, achieving robust cross-domain performance remains elusive.

[54] Navigating in the Dark: A Multimodal Framework and Dataset for Nighttime Traffic Sign Recognition cs.CV | cs.CYPDF

Aditya Mishra, Akshay Agarwal, Haroon Lone

TL;DR: 该论文提出了一种针对夜间交通标志识别的多模态框架LENS-Net和一个大规模夜间交通标志数据集INTSD。LENS-Net结合了自适应图像增强检测器和多模态分类器，显著提升了夜间环境下的识别性能。

Details

Motivation: 现有交通标志识别方法在夜间环境下表现不佳，主要由于低光照条件下的视觉噪声和缺乏公开的大规模夜间数据集。

Result: LENS-Net在INTSD数据集上超越现有方法，消融实验验证了其关键模块的有效性。

Insight: 多模态融合和图推理有助于提升低光照条件下的语义理解和识别一致性。

Abstract: Traffic signboards are vital for road safety and intelligent transportation systems, enabling navigation and autonomous driving. Yet, recognizing traffic signs at night remains challenging due to visual noise and scarcity of public nighttime datasets. Despite advances in vision architectures, existing methods struggle with robustness under low illumination and fail to leverage complementary mutlimodal cues effectively. To overcome these limitations, firstly, we introduce INTSD, a large-scale dataset comprising street-level night-time images of traffic signboards collected across diverse regions of India. The dataset spans 41 traffic signboard classes captured under varying lighting and weather conditions, providing a comprehensive benchmark for both detection and classification tasks. To benchmark INTSD for night-time sign recognition, we conduct extensive evaluations using state-of-the-art detection and classification models. Secondly, we propose LENS-Net, which integrates an adaptive image enhancement detector for joint illumination correction and sign localization, followed by a structured multimodal CLIP-GCNN classifier that leverages cross-modal attention and graph-based reasoning for robust and semantically consistent recognition. Our method surpasses existing frameworks, with ablation studies confirming the effectiveness of its key components. The dataset and code for LENS-Net is publicly available for research.

[55] PostCam: Camera-Controllable Novel-View Video Generation with Query-Shared Cross-Attention cs.CVPDF

Yipeng Chen, Zhichao Ye, Zhenzhou Fang, Xinyu Chen, Xiaoyu Zhang

TL;DR: PostCam提出了一种新颖的相机可控视频生成框架，通过查询共享交叉注意力模块融合6-DoF相机位姿和2D渲染视频帧，实现了更精确和灵活的相机轨迹编辑。

Details

Motivation: 现有的视频重捕获方法在相机运动注入策略上存在不足，导致生成的视频在相机控制精度和视觉细节保留上表现不佳。PostCam旨在解决这一问题。

Result: 在真实和合成数据集上，PostCam在相机控制精度和视图一致性上领先现有方法20%，并实现最高的视频生成质量。

Insight: 通过融合多模态控制信号（位姿和视觉），可以显著提升视频生成的控制精度和视觉质量；两阶段训练策略有效平衡了控制学习和视觉保真度。

Abstract: We propose PostCam, a framework for novel-view video generation that enables post-capture editing of camera trajectories in dynamic scenes. We find that existing video recapture methods suffer from suboptimal camera motion injection strategies; such suboptimal designs not only limit camera control precision but also result in generated videos that fail to preserve fine visual details from the source video. To achieve more accurate and flexible motion manipulation, PostCam introduces a query-shared cross-attention module. It integrates two distinct forms of control signals: the 6-DoF camera poses and the 2D rendered video frames. By fusing them into a unified representation within a shared feature space, our model can extract underlying motion cues, which enhances both control precision and generation quality. Furthermore, we adopt a two-stage training strategy: the model first learns coarse camera control from pose inputs, and then incorporates visual information to refine motion accuracy and enhance visual fidelity. Experiments on both real-world and synthetic datasets demonstrate that PostCam outperforms state-of-the-art methods by over 20% in camera control precision and view consistency, while achieving the highest video generation quality. Our project webpage is publicly available at: https://cccqaq.github.io/PostCam.github.io/

[56] VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation cs.CVPDF

Hanyu Zhou, Chuanhao Ma, Gim Hee Lee

TL;DR: VLA-4D通过4D感知（空间+时间）增强视觉-语言-动作模型，实现机器人操作的时空一致性，解决了传统方法在时间连贯性上的不足。

Details

Motivation: 现有的视觉-语言-动作（VLA）模型在处理时空一致的操作任务时，由于缺乏时间信息的嵌入，难以实现连贯的动作控制。

Result: 实验验证了VLA-4D在机器人操作任务中的优越性，实现了时空一致的控制。

Insight: 时间信息的显式嵌入是解决机器人操作时空一致性的关键，跨模态融合与对齐方法对此类任务至关重要。

Abstract: Vision-language-action (VLA) models show potential for general robotic tasks, but remain challenging in spatiotemporally coherent manipulation, which requires fine-grained representations. Typically, existing methods embed 3D positions into visual representations to enhance the spatial precision of actions. However, these methods struggle to achieve temporally coherent control over action execution. In this work, we propose VLA-4D, a general VLA model with 4D awareness for spatiotemporally coherent robotic manipulation. Our model is guided by two key designs: 1) 4D-aware visual representation. We extract visual features, embed 1D time into 3D positions for 4D embeddings, and fuse them into a unified visual representation via a cross-attention mechanism. 2) Spatiotemporal action representation. We extend conventional spatial action representations with temporal information to enable the spatiotemporal planning, and align the multimodal representations into the LLM for spatiotemporal action prediction. Within this unified framework, the designed visual and action representations jointly make robotic manipulation spatially-smooth and temporally-coherent. In addition, we extend the VLA dataset with temporal action annotations for fine-tuning our model. Extensive experiments have been conducted to verify the superiority of our method across different tasks of robotic manipulation.

[57] SING3R-SLAM: Submap-based Indoor Monocular Gaussian SLAM with 3D Reconstruction Priors cs.CV | cs.ROPDF

Kunyi Li, Michael Niemeyer, Sen Wang, Stefano Gasperini, Nassir Navab

TL;DR: SING3R-SLAM是一个基于子图的室内单目高斯SLAM框架，通过全局高斯表示优化场景几何和相机位姿，解决了SLAM中的漂移和冗余问题。

Details

Motivation: 现有的密集3D重建方法在SLAM中存在漂移和冗余点云的问题，限制了效率和下游任务的表现，如新视角合成。

Result: 实验表明，SING3R-SLAM在跟踪、3D重建和新视角渲染方面达到最优，跟踪精度提升12%，同时保持内存高效。

Insight: 全局高斯表示不仅能优化几何一致性，还能增强局部跟踪的鲁棒性，为多下游任务提供了高效解决方案。

Abstract: Recent advances in dense 3D reconstruction enable the accurate capture of local geometry; however, integrating them into SLAM is challenging due to drift and redundant point maps, which limit efficiency and downstream tasks, such as novel view synthesis. To address these issues, we propose SING3R-SLAM, a globally consistent and compact Gaussian-based dense RGB SLAM framework. The key idea is to combine locally consistent 3D reconstructions with a unified global Gaussian representation that jointly refines scene geometry and camera poses, enabling efficient and versatile 3D mapping for multiple downstream applications. SING3R-SLAM first builds locally consistent submaps through our lightweight tracking and reconstruction module, and then progressively aligns and fuses them into a global Gaussian map that enforces cross-view geometric consistency. This global map, in turn, provides feedback to correct local drift and enhance the robustness of tracking. Extensive experiments demonstrate that SING3R-SLAM achieves state-of-the-art tracking, 3D reconstruction, and novel view rendering, resulting in over 12% improvement in tracking and producing finer, more detailed geometry, all while maintaining a compact and memory-efficient global representation on real-world datasets.

Cris Claessens, Christiaan Viviers, Giacomo D’Amicantonio, Egor Bondarev, Fons van der Sommen

TL;DR: SPECTRE是一种基于Transformer的3D医学影像基础模型，通过自监督和跨模态预训练提取通用CT表示，解决了体积CT的独特挑战，并在多个基准测试中表现出色。

Details

Motivation: 体积CT存在独特挑战（如极端token缩放、几何各向异性、弱或噪声临床监督），需要一种可扩展的解决方案来提取通用表示。

Result: 在多个CT基准测试中，SPECTRE在零样本和微调场景下均优于现有模型。

Insight: 研究表明，无需依赖私有数据，仅使用公开数据即可训练出高性能的基础模型。

Abstract: We introduce SPECTRE, a fully transformer-based foundation model for volumetric computed tomography (CT). Our Self-Supervised & Cross-Modal Pretraining for CT Representation Extraction (SPECTRE) approach utilizes scalable 3D Vision Transformer architectures and modern self-supervised and vision-language pretraining strategies to learn general-purpose CT representations. Volumetric CT poses unique challenges, such as extreme token scaling, geometric anisotropy, and weak or noisy clinical supervision, that make standard transformer and contrastive learning recipes ineffective out of the box. The framework jointly optimizes a local transformer for high-resolution volumetric feature extraction and a global transformer for whole-scan context modeling, making large-scale 3D attention computationally tractable. Notably, SPECTRE is trained exclusively on openly available CT datasets, demonstrating that high-performing, generalizable representations can be achieved without relying on private data. Pretraining combines DINO-style self-distillation with SigLIP-based vision-language alignment using paired radiology reports, yielding features that are both geometrically consistent and clinically meaningful. Across multiple CT benchmarks, SPECTRE consistently outperforms prior CT foundation models in both zero-shot and fine-tuned settings, establishing SPECTRE as a scalable, open, and fully transformer-based foundation model for 3D medical imaging.

[59] Dual-domain Adaptation Networks for Realistic Image Super-resolution cs.CVPDF

Chaowei Fang, Bolin Fu, De Cheng, Lechao Cheng, Guanbin Li

TL;DR: 这篇论文提出了一种双域适应网络（Dual-domain Adaptation Networks），用于将预训练的图像超分辨率模型从合成数据集高效适应到真实世界数据集，结合空间域和频域的适配策略，显著提升了超分辨率的性能。

Details

Motivation: 真实世界的图像超分辨率任务面临复杂的退化模式和有限的实际数据，导致现有方法难以学习基本图像特征。为了解决这些问题，作者利用预训练模型的先验知识，提出了一种高效的适应方法。

Result: 在RealSR、D2CRealSR和DRealSR等公开基准测试中，该方法表现优于现有最先进模型。

Insight: 结合空间域和频域的适配策略能够更有效地利用预训练模型的先验知识，同时减少对真实世界数据的需求，为真实场景超分辨率任务提供了新的思路。

Abstract: Realistic image super-resolution (SR) focuses on transforming real-world low-resolution (LR) images into high-resolution (HR) ones, handling more complex degradation patterns than synthetic SR tasks. This is critical for applications like surveillance, medical imaging, and consumer electronics. However, current methods struggle with limited real-world LR-HR data, impacting the learning of basic image features. Pre-trained SR models from large-scale synthetic datasets offer valuable prior knowledge, which can improve generalization, speed up training, and reduce the need for extensive real-world data in realistic SR tasks. In this paper, we introduce a novel approach, Dual-domain Adaptation Networks, which is able to efficiently adapt pre-trained image SR models from simulated to real-world datasets. To achieve this target, we first set up a spatial-domain adaptation strategy through selectively updating parameters of pre-trained models and employing the low-rank adaptation technique to adjust frozen parameters. Recognizing that image super-resolution involves recovering high-frequency components, we further integrate a frequency domain adaptation branch into the adapted model, which combines the spectral data of the input and the spatial-domain backbone’s intermediate features to infer HR frequency maps, enhancing the SR result. Experimental evaluations on public realistic image SR benchmarks, including RealSR, D2CRealSR, and DRealSR, demonstrate the superiority of our proposed method over existing state-of-the-art models. Codes are available at: https://github.com/dummerchen/DAN.

[60] QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy cs.CV | cs.ROPDF

Adam Lilja, Ji Lan, Junsheng Fu, Lars Hammarstrand

TL;DR: QueryOcc提出了一种基于查询的自监督框架，直接从相邻帧的独立4D时空查询中学习连续的3D语义占用，无需依赖2D渲染一致性或离散化体素网格，显著提升了语义场景理解的性能。

Details

Motivation: 当前的自监督学习方法要么依赖2D渲染一致性，导致3D结构隐式学习，要么依赖于离散化的体素网格，限制了空间精度和可扩展性。QueryOcc旨在通过直接学习4D时空查询来克服这些限制。

Result: 在自监督Occ3D-nuScenes基准测试中，QueryOcc在语义RayIoU指标上比之前的相机方法提升了26%，同时运行速度为11.6 FPS。

Insight: 直接4D查询监督是一种高效的自监督学习方法，能够在保持实时性能的同时显著提升3D语义占用的准确性。

Abstract: Learning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving. Since large-scale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels. Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability. We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames. The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data. To enable long-range supervision and reasoning under constant memory, we introduce a contractive scene representation that preserves near-field detail while smoothly compressing distant regions. QueryOcc surpasses previous camera-based methods by 26% in semantic RayIoU on the self-supervised Occ3D-nuScenes benchmark while running at 11.6 FPS, demonstrating that direct 4D query supervision enables strong self-supervised occupancy learning. https://research.zenseact.com/publications/queryocc/

[61] Equivariant-Aware Structured Pruning for Efficient Edge Deployment: A Comprehensive Framework with Adaptive Fine-Tuning cs.CV | cs.LGPDF

Mohammed Alnemari

TL;DR: 该论文提出了一种结合群等变卷积神经网络（G-CNNs）与等变感知结构化剪枝的新框架，用于在资源受限环境下生成紧凑且对几何变换不变的模型。该方法通过动态INT8量化和自适应微调实现高效压缩与性能恢复。

Details

Motivation: 当前的深度学习模型在边缘设备上部署时面临计算资源限制的问题，尤其是对于需要几何变换不变性的任务（如卫星图像分析）。现有的剪枝方法可能会破坏模型的等变性，因此需要一种能同时保持等变性和高效性的框架。

Result: 在EuroSAT、CIFAR-10和Rotated MNIST等数据集上的实验表明，该方法实现了29.3%的参数压缩，同时通过自适应微调显著恢复了精度。

Insight: 结构化剪枝可以通过保留等变性特性实现对G-CNNs的高效压缩，特别是在卫星图像分析等几何视觉任务中表现优异。自适应微调是恢复剪枝后模型性能的关键技术。

Abstract: This paper presents a novel framework combining group equivariant convolutional neural networks (G-CNNs) with equivariant-aware structured pruning to produce compact, transformation-invariant models for resource-constrained environments. Equivariance to rotations is achieved through the C4 cyclic group via the e2cnn library,enabling consistent performance under geometric transformations while reducing computational overhead. Our approach introduces structured pruning that preserves equivariant properties by analyzing e2cnn layer structure and applying neuron-level pruning to fully connected components. To mitigate accuracy degradation, we implement adaptive fine-tuning that automatically triggers when accuracy drop exceeds 2%, using early stopping and learning rate scheduling for efficient recovery. The framework includes dynamic INT8 quantization and a comprehensive pipeline encompassing training, knowledge distillation, structured pruning, fine-tuning, and quantization. We evaluate our method on satellite imagery (EuroSAT) and standard benchmarks (CIFAR-10, Rotated MNIST) demonstrating effectiveness across diverse domains. Experimental results show 29.3% parameter reduction with significant accuracy recovery, demonstrating that structured pruning of equivariant networks achieves substantial compression while maintaining geometric robustness. Our pipeline provides a reproducible framework for optimizing equivariant models, bridging the gap between group-theoretic network design and practical deployment constraints, with particular relevance to satellite imagery analysis and geometric vision tasks.

[62] Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats cs.CV | cs.AIPDF

Jiaye Qian, Ge Zheng, Yuchen Zhu, Sibei Yang

TL;DR: 本文提出了一种统一的干预框架，通过分析LVLM中图像到输入文本、图像到输出文本和文本到文本三种路径对幻觉的交互影响，提出针对性方法以减少幻觉现象。

Details

Motivation: LVLM在多任务中表现优异，但仍存在幻觉问题，且其成因复杂。本文旨在通过分析不同路径的影响，提出更全面的解决方案。

Result: 在多基准测试中，该方法一致减少了不同对齐格式下的幻觉现象。

Insight: LVLM的幻觉现象并非单一路径所致，而是多种路径交互的结果，且路径依赖性随对齐格式变化。

Abstract: Despite their impressive performance across a wide range of tasks, Large Vision-Language Models (LVLMs) remain prone to hallucination. In this study, we propose a comprehensive intervention framework aligned with the transformer’s causal architecture in LVLMs, integrating the effects of different intervention paths on hallucination. We find that hallucinations in LVLMs do not arise from a single causal path, but rather from the interplay among image-to-input-text, image-to-output-text, and text-to-text pathways. For the first time, we also find that LVLMs rely on different pathways depending on the question-answer alignment format. Building on these insights, we propose simple yet effective methods to identify and intervene on critical hallucination heads within each pathway, tailored to discriminative and generative formats. Experiments across multiple benchmarks demonstrate that our approach consistently reduces hallucinations across diverse alignment types.

[63] A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback cs.CV | cs.IRPDF

Bulat Khaertdinov, Mirela Popa, Nava Tintarev

TL;DR: 该论文提出了一种基于相关性反馈的文本-图像检索方法，通过四种策略（PRF、GRF、AFS和显式反馈）改进了视觉-语言模型的检索性能，尤其是在多轮迭代检索中表现更鲁棒。

Details

Motivation: 视觉-语言模型（VLMs）在自然语言查询的视觉搜索中表现优异，但其性能提升通常需要微调或更大规模的模型。论文希望通过相关性反馈机制，在不依赖微调的情况下提升检索效果。

Result: 在Flickr30k和COCO数据集上，GRF、AFS和显式反馈使MRR@5提升了3-5%（小规模VLMs）和1-3%（大规模VLMs）；AFS在多轮检索中表现更鲁棒。

Insight: 相关性反馈是一种模型无关的方法，可适配不同规模的VLMs，并为交互式视觉搜索提供了新机会。

Abstract: Large vision-language models (VLMs) enable intuitive visual search using natural language queries. However, improving their performance often requires fine-tuning and scaling to larger model variants. In this work, we propose a mechanism inspired by traditional text-based search to improve retrieval performance at inference time: relevance feedback. While relevance feedback can serve as an alternative to fine-tuning, its model-agnostic design also enables use with fine-tuned VLMs. Specifically, we introduce and evaluate four feedback strategies for VLM-based retrieval. First, we revise classical pseudo-relevance feedback (PRF), which refines query embeddings based on top-ranked results. To address its limitations, we propose generative relevance feedback (GRF), which uses synthetic captions for query refinement. Furthermore, we introduce an attentive feedback summarizer (AFS), a custom transformer-based model that integrates multimodal fine-grained features from relevant items. Finally, we simulate explicit feedback using ground-truth captions as an upper-bound baseline. Experiments on Flickr30k and COCO with the VLM backbones show that GRF, AFS, and explicit feedback improve retrieval performance by 3-5% in MRR@5 for smaller VLMs, and 1-3% for larger ones, compared to retrieval with no feedback. Moreover, AFS, similarly to explicit feedback, mitigates query drift and is more robust than GRF in iterative, multi-turn retrieval settings. Our findings demonstrate that relevance feedback can consistently enhance retrieval across VLMs and open up opportunities for interactive and adaptive visual search.

[64] Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation cs.CV | cs.AI | cs.CYPDF

Chuancheng Shi, Shangze Li, Shiming Guo, Simiao Xie, Wenhua Wu

TL;DR: 本文揭示了多语言文本到图像(T2I)模型在生成图像时存在文化差距的问题，并提出了一种探测文化敏感神经元的方法和两种对齐策略，以提升跨语言文化一致性。

Details

Motivation: 当前的多语言T2I模型在生成图像时往往产生文化中性或英语偏向的结果，这表明模型在处理多语言提示时未能充分激活与文化相关的表示。

Result: 在CultureBench上实验表明，所提方法在保留保真度和多样性的同时，显著提升了文化一致性。

Insight: T2I模型的文化差距问题并非源于缺乏文化知识，而是与文化相关表示的激活不足有关。通过针对性的神经元激活和层更新，可以有效解决这一问题。

Abstract: Multilingual text-to-image (T2I) models have advanced rapidly in terms of visual realism and semantic alignment, and are now widely utilized. Yet outputs vary across cultural contexts: because language carries cultural connotations, images synthesized from multilingual prompts should preserve cross-lingual cultural consistency. We conduct a comprehensive analysis showing that current T2I models often produce culturally neutral or English-biased results under multilingual prompts. Analyses of two representative models indicate that the issue stems not from missing cultural knowledge but from insufficient activation of culture-related representations. We propose a probing method that localizes culture-sensitive signals to a small set of neurons in a few fixed layers. Guided by this finding, we introduce two complementary alignment strategies: (1) inference-time cultural activation that amplifies the identified neurons without backbone fine-tuned; and (2) layer-targeted cultural enhancement that updates only culturally relevant layers. Experiments on our CultureBench demonstrate consistent improvements over strong baselines in cultural consistency while preserving fidelity and diversity.

[65] MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning cs.CVPDF

Wenrui Zhang, Xinggang Wang, Bin Feng, Wenyu Liu

TL;DR: MolSight是一个用于光学化学结构识别（OCSR）的综合学习框架，通过SMILES预训练、多粒度学习和强化学习三阶段训练，解决了现有系统在识别立体化学信息方面的挑战。

Details

Motivation: 现代化学信息学中，OCSR技术对于从科学文献、专利和教材中自动提取化学结构至关重要，但现有系统在识别立体化学信息时面临困难，亟需改进。

Result: 实验表明，MolSight在立体化学光学结构识别上达到了最先进的性能。

Insight: 即使模型参数规模较小，GRPO算法仍能显著提升性能，表明强化学习在化学结构识别中的潜力。

Abstract: Optical Chemical Structure Recognition (OCSR) plays a pivotal role in modern chemical informatics, enabling the automated conversion of chemical structure images from scientific literature, patents, and educational materials into machine-readable molecular representations. This capability is essential for large-scale chemical data mining, drug discovery pipelines, and Large Language Model (LLM) applications in related domains. However, existing OCSR systems face significant challenges in accurately recognizing stereochemical information due to the subtle visual cues that distinguish stereoisomers, such as wedge and dash bonds, ring conformations, and spatial arrangements. To address these challenges, we propose MolSight, a comprehensive learning framework for OCSR that employs a three-stage training paradigm. In the first stage, we conduct pre-training on large-scale but noisy datasets to endow the model with fundamental perception capabilities for chemical structure images. In the second stage, we perform multi-granularity fine-tuning using datasets with richer supervisory signals, systematically exploring how auxiliary tasks-specifically chemical bond classification and atom localization-contribute to molecular formula recognition. Finally, we employ reinforcement learning for post-training optimization and introduce a novel stereochemical structure dataset. Remarkably, we find that even with MolSight’s relatively compact parameter size, the Group Relative Policy Optimization (GRPO) algorithm can further enhance the model’s performance on stereomolecular. Through extensive experiments across diverse datasets, our results demonstrate that MolSight achieves state-of-the-art performance in (stereo)chemical optical structure recognition.

[66] SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion cs.CVPDF

Jiajie Guo, Qingpeng Zhu, Jin Zeng, Xiaolong Wu, Changyong He

TL;DR: 论文提出了SpatialGeo，通过几何-语义融合增强多模态大语言模型的空间推理能力，解决了现有模型在三维空间推理中的局限性。

Details

Motivation: 现有MLLMs在空间推理能力上表现不足，主要因为视觉编码器（如CLIP）的嵌入丢失了空间信息，仅关注实例级语义特征。

Result: 实验表明，SpatialGeo在SpatialRGPT-Bench任务上将准确率提升至少8.0%，且推理内存成本降低50%。

Insight: 几何与语义特征的融合能有效增强模型对空间信息的理解，为提升MLLMs在复杂视觉任务中的表现提供了新思路。

Abstract: Multimodal large language models (MLLMs) have achieved significant progress in image and language tasks due to the strong reasoning capability of large language models (LLMs). Nevertheless, most MLLMs suffer from limited spatial reasoning ability to interpret and infer spatial arrangements in three-dimensional space. In this work, we propose a novel vision encoder based on hierarchical fusion of geometry and semantics features, generating spatial-aware visual embedding and boosting the spatial grounding capability of MLLMs. Specifically, we first unveil that the spatial ambiguity shortcoming stems from the lossy embedding of the vision encoder utilized in most existing MLLMs (e.g., CLIP), restricted to instance-level semantic features. This motivates us to complement CLIP with the geometry features from vision-only self-supervised learning via a hierarchical adapter, enhancing the spatial awareness in the proposed SpatialGeo. The network is efficiently trained using pretrained LLaVA model and optimized with random feature dropping to avoid trivial solutions relying solely on the CLIP encoder. Experimental results show that SpatialGeo improves the accuracy in spatial reasoning tasks, enhancing state-of-the-art models by at least 8.0% in SpatialRGPT-Bench with approximately 50% less memory cost during inference. The source code is available via https://ricky-plus.github.io/SpatialGeoPages/.

[67] MuM: Multi-View Masked Image Modeling for 3D Vision cs.CV | cs.AI | cs.LGPDF

David Nordström, Johan Edstedt, Fredrik Kahl, Georg Bökman

TL;DR: MuM通过多视角掩码图像建模（MAE）提升3D视觉特征学习，简化并超越了CroCo的设计，在下游任务中表现优异。

Details

Motivation: 尽管自监督学习在语义理解上表现出色，但在几何推理（3D视觉）方面的优化较少。MuM旨在通过改进CroCo的设计，专注于3D视觉特征学习。

Result: MuM在3D重建、密集图像匹配和相对姿态估计等任务中表现优于DINOv3和CroCo v2。

Insight: 多视角掩码的统一处理和轻量设计是提升3D视觉特征学习的关键，优于复杂的多视角融合方法。

Abstract: Self-supervised learning on images seeks to extract meaningful visual representations from unlabeled data. When scaled to large datasets, this paradigm has achieved state-of-the-art performance and the resulting trained models such as DINOv3 have seen widespread adoption. However, most prior efforts are optimized for semantic understanding rather than geometric reasoning. One important exception is Cross-View Completion, CroCo, which is a form of masked autoencoding (MAE) tailored for 3D understanding. In this work, we continue on the path proposed by CroCo and focus on learning features tailored for 3D vision. In a nutshell, we extend MAE to arbitrarily many views of the same scene. By uniformly masking all views and employing a lightweight decoder with inter-frame attention, our approach is inherently simpler and more scalable than CroCo. We evaluate the resulting model, MuM, extensively on downstream tasks including feedforward reconstruction, dense image matching and relative pose estimation, finding that it outperforms the state-of-the-art visual encoders DINOv3 and CroCo v2.

[68] Refracting Reality: Generating Images with Realistic Transparent Objects cs.CVPDF

Yue Yin, Enze Tao, Dylan Campbell

TL;DR: 该论文提出了一种新方法，通过在图像生成过程中应用斯涅尔折射定律，同步透明物体内外像素，生成更符合光学规律的透明物体图像。

Details

Motivation: 现有生成模型在透明物体的折射、反射、吸收和散射方面表现不佳，无法准确遵循光学规律。论文旨在解决透明物体折射的生成问题。

Result: 实验表明，该方法生成的透明物体图像在光学规律上更为合理，显著优于现有方法。

Insight: 通过引入光学规律（如折射定律）和多视角同步，可以有效提升生成模型的物理合理性，尤其是在透明物体的渲染上。

Abstract: Generative image models can produce convincingly real images, with plausible shapes, textures, layouts and lighting. However, one domain in which they perform notably poorly is in the synthesis of transparent objects, which exhibit refraction, reflection, absorption and scattering. Refraction is a particular challenge, because refracted pixel rays often intersect with surfaces observed in other parts of the image, providing a constraint on the color. It is clear from inspection that generative models have not distilled the laws of optics sufficiently well to accurately render refractive objects. In this work, we consider the problem of generating images with accurate refraction, given a text prompt. We synchronize the pixels within the object’s boundary with those outside by warping and merging the pixels using Snell’s Law of Refraction, at each step of the generation trajectory. For those surfaces that are not directly observed in the image, but are visible via refraction or reflection, we recover their appearance by synchronizing the image with a second generated image – a panorama centered at the object – using the same warping and merging procedure. We demonstrate that our approach generates much more optically-plausible images that respect the physical constraints.

[69] Loomis Painter: Reconstructing the Painting Process cs.CVPDF

Markus Pobitzer, Chang Liu, Chenyi Zhuang, Teng Long, Bin Ren

TL;DR: 该论文提出了一个统一框架Loomis Painter，用于多媒体绘画过程生成，通过语义驱动的风格控制机制和跨媒体风格增强，解决了现有生成模型在多媒介一致性和时间连贯性上的不足。

Details

Motivation: 现有的绘画教程视频缺乏交互性和个性化，而生成模型在多媒介和时间连贯性上表现不佳，无法准确重现人类创作流程。

Result: 在多媒介一致性、时间连贯性和最终图像保真度上取得了显著效果。

Insight: 通过语义驱动的风格控制和跨媒介增强，可以更接近人类创作过程的纹理演化和风格迁移，同时PDP曲线为量化艺术创作流程提供了新工具。

Abstract: Step-by-step painting tutorials are vital for learning artistic techniques, but existing video resources (e.g., YouTube) lack interactivity and personalization. While recent generative models have advanced artistic image synthesis, they struggle to generalize across media and often show temporal or structural inconsistencies, hindering faithful reproduction of human creative workflows. To address this, we propose a unified framework for multi-media painting process generation with a semantics-driven style control mechanism that embeds multiple media into a diffusion models conditional space and uses cross-medium style augmentation. This enables consistent texture evolution and process transfer across styles. A reverse-painting training strategy further ensures smooth, human-aligned generation. We also build a large-scale dataset of real painting processes and evaluate cross-media consistency, temporal coherence, and final-image fidelity, achieving strong results on LPIPS, DINO, and CLIP metrics. Finally, our Perceptual Distance Profile (PDP) curve quantitatively models the creative sequence, i.e., composition, color blocking, and detail refinement, mirroring human artistic progression.

[70] DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture cs.CVPDF

Xiangteng He, Shunsuke Sakai, Kun Yuan, Nicolas Padoy, Tatsuhito Hasegawa

TL;DR: DSeq-JEPA 是一种结合了联合嵌入预测架构（JEPA）和顺序推理的自监督学习方法，通过区分性顺序预测提升视觉表示学习的效果。

Details

Motivation: 现有的 I-JEPA 方法对所有区域进行统一和独立的预测，缺乏对区域重要性和顺序的显式建模。受人类选择性注意力机制的启发，作者提出了顺序预测的方法。

Result: 在图像分类、细粒度视觉分类、检测与分割以及低层推理任务中，DSeq-JEPA 均表现优于 I-JEPA 的变体。

Insight: 结合预测式学习和顺序推理的自监督学习方法能够更好地捕捉视觉任务中的区分性特征。

Abstract: Image-based Joint-Embedding Predictive Architecture (I-JEPA) learns visual representations by predicting latent embeddings of masked regions from visible context. However, it treats all regions uniformly and independently, lacking an explicit notion of where or in what order predictions should be made. Inspired by human visual perception, which deploys attention selectively and sequentially from the most informative to secondary regions, we propose DSeq-JEPA, a Discriminative Sequential Joint-Embedding Predictive Architecture that bridges predictive and autoregressive self-supervised learning, integrating JEPA-style latent prediction with GPT-style sequential reasoning. Specifically, DSeq-JEPA (i) first identifies primary discriminative regions based on a transformer-derived saliency map, emphasizing the distribution of visual importance, and then (ii) predicts subsequent regions in this discriminative order, progressively forming a curriculum-like semantic progression from primary to secondary cues – a form of GPT-style pre-training. Extensive experiments across diverse tasks, including image classification (ImageNet), fine-grained visual categorization (iNaturalist21, CUB-200-2011, Stanford-Cars), detection and segmentation (MS-COCO, ADE20K), and low-level reasoning tasks (Clevr/Count, Clevr/Dist), demonstrate that DSeq-JEPA consistently focuses on more discriminative and generalizable representations than I-JEPA variants. Project page: https://github.com/SkyShunsuke/DSeq-JEPA.

[71] UAM: A Unified Attention-Mamba Backbone of Multimodal Framework for Tumor Cell Classification cs.CVPDF

Taixi Chen, Jingyun Chen, Nancy Guo

TL;DR: 该论文提出了一种统一的注意力-Mamba骨干网络（UAM），用于多模态框架下的肿瘤细胞分类，通过灵活结合注意力和Mamba模块提升了细胞级分类和图像分割的性能，并在公开数据集上实现了最先进的结果。

Details

Motivation: 现有的研究主要集中在切片或块级的肿瘤分类，而细胞级的放射组学分析尚未充分探索。同时，缺乏专门为放射组学数据设计的骨干网络。因此，作者提出了一种统一的多模态框架来解决这些问题。

Result: 实验结果表明，UAM在细胞分类和肿瘤分割任务上均超越了现有最佳模型，分类准确率从74%提升到78%，分割精度从75%提升到80%。

Insight: 通过统一架构设计，UAM展示了在多模态任务中的灵活性和有效性，为放射组学驱动的癌症诊断提供了新的基础框架。

Abstract: Cell-level radiomics features provide fine-grained insights into tumor phenotypes and have the potential to significantly enhance diagnostic accuracy on hematoxylin and eosin (H&E) images. By capturing micro-level morphological and intensity patterns, these features support more precise tumor identification and improve AI interpretability by highlighting diagnostically relevant cells for pathologist review. However, most existing studies focus on slide-level or patch-level tumor classification, leaving cell-level radiomics analysis largely unexplored. Moreover, there is currently no dedicated backbone specifically designed for radiomics data. Inspired by the recent success of the Mamba architecture in vision and language domains, we introduce a Unified Attention-Mamba (UAM) backbone for cell-level classification using radiomics features. Unlike previous hybrid approaches that integrate Attention and Mamba modules in fixed proportions, our unified design flexibly combines their capabilities within a single cohesive architecture, eliminating the need for manual ratio tuning and improving encode capability. We develop two UAM variants to comprehensively evaluate the benefits of this unified structure. Building on this backbone, we further propose a multimodal UAM framework that jointly performs cell-level classification and image segmentation. Experimental results demonstrate that UAM achieves state-of-the-art performance across both tasks on public benchmarks, surpassing leading image-based foundation models. It improves cell classification accuracy from 74% to 78% ($n$=349,882 cells), and tumor segmentation precision from 75% to 80% ($n$=406 patches). These findings highlight the effectiveness and promise of UAM as a unified and extensible multimodal foundation for radiomics-driven cancer diagnosis.

[72] SuperQuadricOcc: Multi-Layer Gaussian Approximation of Superquadrics for Real-Time Self-Supervised Occupancy Estimation cs.CVPDF

Seamie Hayes, Reenu Mohandas, Tim Brophy, Alexandre Boulch, Ganesh Sistu

TL;DR: SuperQuadricOcc提出了一种基于超二次曲面的实时自监督占据估计算法，通过多层高斯近似实现高效训练和推理，显著降低内存占用并提升速度。

Details

Motivation: 传统的高斯表示在自监督占据估计中内存需求大且不适合实时推理，而超二次曲面因其多样的形状可以减少基元数量和内存需求。然而，缺乏超二次曲面的光栅化方法阻碍了其在此类任务中的应用。

Result: 在Occ3D数据集上，SuperQuadricOcc在mIoU上提升5.9%，内存占用减少75%，推理速度提升124%。

Insight: 超二次曲面可作为高效的场景表示方法，结合高斯近似技术能够显著优化计算效率和内存需求。

Abstract: Semantic occupancy estimation enables comprehensive scene understanding for automated driving, providing dense spatial and semantic information essential for perception and planning. While Gaussian representations have been widely adopted in self-supervised occupancy estimation, the deployment of a large number of Gaussian primitives drastically increases memory requirements and is not suitable for real-time inference. In contrast, superquadrics permit reduced primitive count and lower memory requirements due to their diverse shape set. However, implementation into a self-supervised occupancy model is nontrivial due to the absence of a superquadric rasterizer to enable model supervision. Our proposed method, SuperQuadricOcc, employs a superquadric-based scene representation. By leveraging a multi-layer icosphere-tessellated Gaussian approximation of superquadrics, we enable Gaussian rasterization for supervision during training. On the Occ3D dataset, SuperQuadricOcc achieves a 75% reduction in memory footprint, 124% faster inference, and a 5.9% improvement in mIoU compared to previous Gaussian-based methods, without the use of temporal labels. To our knowledge, this is the first occupancy model to enable real-time inference while maintaining competitive performance. The use of superquadrics reduces the number of primitives required for scene modeling by 84% relative to Gaussian-based approaches. Finally, evaluation against prior methods is facilitated by our fast superquadric voxelization module. The code will be released as open source.

[73] SVRecon: Sparse Voxel Rasterization for Surface Reconstruction cs.CVPDF

Seunghun Oh, Jaesung Choe, Dongjae Lee, Daeun Lee, Seunghoon Jeong

TL;DR: SVRecon提出了一个新的稀疏体素栅格化框架，结合了SDF进行高保真表面重建，通过几何初始化和空间平滑损失解决局部最小值问题。

Details

Motivation: 传统稀疏体素方法由于空间解耦和锐利边界特性，容易在优化过程中陷入局部最小值，而SDF的自然平滑性难以在独立参数化的稀疏体素中保持。

Result: 在各种基准测试中展现了高重建精度和快速收敛。

Insight: 稀疏体素与SDF结合能有效提升重建质量和优化效率，几何初始化和平滑损失是关键。

Abstract: We extend the recently proposed sparse voxel rasterization paradigm to the task of high-fidelity surface reconstruction by integrating Signed Distance Function (SDF), named SVRecon. Unlike 3D Gaussians, sparse voxels are spatially disentangled from their neighbors and have sharp boundaries, which makes them prone to local minima during optimization. Although SDF values provide a naturally smooth and continuous geometric field, preserving this smoothness across independently parameterized sparse voxels is nontrivial. To address this challenge, we promote coherent and smooth voxel-wise structure through (1) robust geometric initialization using a visual geometry model and (2) a spatial smoothness loss that enforces coherent relationships across parent-child and sibling voxel groups. Extensive experiments across various benchmarks show that our method achieves strong reconstruction accuracy while having consistently speedy convergence. The code will be made public.

[74] MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration cs.CVPDF

Runxun Zhang, Yizhou Liu, Li Dongrui, Bo XU, Jingwei Wei

TL;DR: MorphSeek提出了一种细粒度的潜在表示级策略优化范式，用于可变形图像配准，通过高斯策略头和分组相对策略优化，在三维配准基准上实现了显著的性能提升。

Details

Motivation: 可变形图像配准（DIR）由于高维变形空间和缺乏体素级监督，仍然是一个挑战。现有强化学习方法通常将空间投影为低维表示，限制了空间变化变形的捕捉能力。

Result: 在OASIS脑MRI、LiTS肝CT和Abdomen MR-CT三个基准上，MorphSeek在Dice指标上显著优于基线方法，同时保持了较低的参数和延迟开销。

Insight: MorphSeek提供了一种通用的表示级策略学习范式，支持空间连贯且数据高效的变形优化，适用于高维场景中的视觉对齐任务。

Abstract: Deformable image registration (DIR) remains a fundamental yet challenging problem in medical image analysis, largely due to the prohibitively high-dimensional deformation space of dense displacement fields and the scarcity of voxel-level supervision. Existing reinforcement learning frameworks often project this space into coarse, low-dimensional representations, limiting their ability to capture spatially variant deformations. We propose MorphSeek, a fine-grained representation-level policy optimization paradigm that reformulates DIR as a spatially continuous optimization process in the latent feature space. MorphSeek introduces a stochastic Gaussian policy head atop the encoder to model a distribution over latent features, facilitating efficient exploration and coarse-to-fine refinement. The framework integrates unsupervised warm-up with weakly supervised fine-tuning through Group Relative Policy Optimization, where multi-trajectory sampling stabilizes training and improves label efficiency. Across three 3D registration benchmarks (OASIS brain MRI, LiTS liver CT, and Abdomen MR-CT), MorphSeek achieves consistent Dice improvements over competitive baselines while maintaining high label efficiency with minimal parameter cost and low step-level latency overhead. Beyond optimizer specifics, MorphSeek advances a representation-level policy learning paradigm that achieves spatially coherent and data-efficient deformation optimization, offering a principled, backbone-agnostic, and optimizer-agnostic solution for scalable visual alignment in high-dimensional settings.

[75] MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment cs.CVPDF

Huangbiao Xu, Huanqi Wu, Xiao Ke, Junyi Wu, Rui Xu

TL;DR: 论文提出了MCMoE框架，通过混合专家（Mixture of Experts）动态生成缺失模态并学习模态间的联合表征，解决了多模态动作质量评估（AQA）中模态缺失导致的性能下降问题。

Details

Motivation: 在多模态动作质量评估（AQA）中，模态缺失（如某些传感器数据不可用）会导致现有模型失效或性能严重下降。论文旨在解决模态缺失问题，提升模型的鲁棒性和性能。

Result: 在三个公开AQA基准上，MCMoE在完整和不完整多模态学习任务中均达到了最先进的性能。

Insight: 模态缺失问题可通过动态生成和模态知识融合缓解；混合专家机制在多模态学习中具有显著优势。

Abstract: Multimodal Action Quality Assessment (AQA) has recently emerged as a promising paradigm. By leveraging complementary information across shared contextual cues, it enhances the discriminative evaluation of subtle intra-class variations in highly similar action sequences. However, partial modalities are frequently unavailable at the inference stage in reality. The absence of any modality often renders existing multimodal models inoperable. Furthermore, it triggers catastrophic performance degradation due to interruptions in cross-modal interactions. To address this issue, we propose a novel Missing Completion Framework with Mixture of Experts (MCMoE) that unifies unimodal and joint representation learning in single-stage training. Specifically, we propose an adaptive gated modality generator that dynamically fuses available information to reconstruct missing modalities. We then design modality experts to learn unimodal knowledge and dynamically mix the knowledge of all experts to extract cross-modal joint representations. With a mixture of experts, missing modalities are further refined and complemented. Finally, in the training phase, we mine the complete multimodal features and unimodal expert knowledge to guide modality generation and generation-based joint representation extraction. Extensive experiments demonstrate that our MCMoE achieves state-of-the-art results in both complete and incomplete multimodal learning on three public AQA benchmarks. Code is available at https://github.com/XuHuangbiao/MCMoE.

[76] Sparse Mixture-of-Experts for Multi-Channel Imaging: Are All Channel Interactions Required? cs.CV | cs.AIPDF

Sukwon Yun, Heming Yao, Burkhard Hoeckendorf, David Richmond, Aviv Regev

TL;DR: MoE-ViT提出了一种稀疏专家混合架构，专注于多通道图像的高效注意力机制，避免了所有通道交互的计算开销。

Details

Motivation: 多通道图像（如细胞染色或卫星图像）中，每个通道包含不同信息，现有方法在token化时独立处理通道，导致注意力计算的二次增长，计算开销大。因此，研究是否所有通道交互都需要建模是关键。

Result: 在JUMP-CP和So2Sat数据集上，MoE-ViT实现了显著的效率提升，部分情况下性能有所增强。

Insight: 在多通道图像任务中，并非所有通道交互都是必要的，稀疏选择机制可有效减少计算开销，同时保持模型性能。

Abstract: Vision Transformers ($\text{ViTs}$) have become the backbone of vision foundation models, yet their optimization for multi-channel domains - such as cell painting or satellite imagery - remains underexplored. A key challenge in these domains is capturing interactions between channels, as each channel carries different information. While existing works have shown efficacy by treating each channel independently during tokenization, this approach naturally introduces a major computational bottleneck in the attention block - channel-wise comparisons leads to a quadratic growth in attention, resulting in excessive $\text{FLOPs}$ and high training cost. In this work, we shift focus from efficacy to the overlooked efficiency challenge in cross-channel attention and ask: “Is it necessary to model all channel interactions?”. Inspired by the philosophy of Sparse Mixture-of-Experts ($\text{MoE}$), we propose MoE-ViT, a Mixture-of-Experts architecture for multi-channel images in $\text{ViTs}$, which treats each channel as an expert and employs a lightweight router to select only the most relevant experts per patch for attention. Proof-of-concept experiments on real-world datasets - JUMP-CP and So2Sat - demonstrate that $\text{MoE-ViT}$ achieves substantial efficiency gains without sacrificing, and in some cases enhancing, performance, making it a practical and attractive backbone for multi-channel imaging.

[77] REMSA: An LLM Agent for Foundation Model Selection in Remote Sensing cs.CV | cs.AIPDF

Binger Chen, Tacettin Emre Bök, Behnood Rasti, Volker Markl, Begüm Demir

TL;DR: 论文介绍了REMSA，一个基于大型语言模型（LLM）的智能体，用于从自然语言查询中自动选择遥感基础模型（RSFM），并通过RSFM数据库（RS-FMD）解决了模型选择的困难。

Details

Motivation: 遥感基础模型（RSFM）的应用广泛，但由于文档分散、格式异构和部署约束多样，选择合适的模型非常困难。

Result: REMSA在900种配置下的评测中优于基线方法（如朴素智能体、密集检索和非结构化RAG-based LLM）。

Insight: 完全基于公开元数据的智能体可以有效解决RSFM选择的复杂性，且无需访问私有或敏感数据。

Abstract: Foundation Models (FMs) are increasingly used in remote sensing (RS) for tasks such as environmental monitoring, disaster assessment, and land-use mapping. These models include unimodal vision encoders trained on a single data modality and multimodal architectures trained on combinations of SAR, multispectral, hyperspectral, and image-text data. They support diverse RS tasks including semantic segmentation, image classification, change detection, and visual question answering. However, selecting an appropriate remote sensing foundation model (RSFM) remains difficult due to scattered documentation, heterogeneous formats, and varied deployment constraints. We introduce the RSFM Database (RS-FMD), a structured resource covering over 150 RSFMs spanning multiple data modalities, resolutions, and learning paradigms. Built on RS-FMD, we present REMSA, the first LLM-based agent for automated RSFM selection from natural language queries. REMSA interprets user requirements, resolves missing constraints, ranks candidate models using in-context learning, and provides transparent justifications. We also propose a benchmark of 75 expert-verified RS query scenarios, producing 900 configurations under an expert-centered evaluation protocol. REMSA outperforms several baselines, including naive agents, dense retrieval, and unstructured RAG-based LLMs. It operates entirely on publicly available metadata and does not access private or sensitive data.

[78] MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models cs.CVPDF

Yuqi Li, Junhao Dong, Chuanguang Yang, Shiping Wen, Piotr Koniusz

TL;DR: 论文提出了一种名为MMT-ARD的多模态多教师对抗蒸馏框架，旨在提升视觉-语言模型的对抗鲁棒性。通过双教师知识融合架构和动态权重分配策略，该方法在保持特征清洁性的同时增强了鲁棒性。

Details

Motivation: 随着视觉-语言模型（VLMs）在安全关键应用中日益普及，其对抗鲁棒性成为重要问题。传统的单教师知识蒸馏方法存在知识多样性不足、收敛慢以及鲁棒性和准确性难以平衡等挑战。

Result: 在ImageNet和零样本基准测试中，ViT-B-32模型的鲁棒准确率提升4.32%，零样本准确率提升3.5%，训练效率提高2.3倍。

Insight: MMT-ARD通过多教师协作和动态调整策略，有效提升了对抗攻击下的模型性能，为大规模多模态模型的鲁棒性优化提供了可扩展的解决方案。

Abstract: Vision-Language Models (VLMs) are increasingly deployed in safety-critical applications, making their adversarial robustness a crucial concern. While adversarial knowledge distillation has shown promise in transferring robustness from teacher to student models, traditional single-teacher approaches suffer from limited knowledge diversity, slow convergence, and difficulty in balancing robustness and accuracy. To address these challenges, we propose MMT-ARD: a Multimodal Multi-Teacher Adversarial Robust Distillation framework. Our key innovation is a dual-teacher knowledge fusion architecture that collaboratively optimizes clean feature preservation and robust feature enhancement. To better handle challenging adversarial examples, we introduce a dynamic weight allocation strategy based on teacher confidence, enabling adaptive focus on harder samples. Moreover, to mitigate bias among teachers, we design an adaptive sigmoid-based weighting function that balances the strength of knowledge transfer across modalities. Extensive experiments on ImageNet and zero-shot benchmarks demonstrate that MMT-ARD improves robust accuracy by +4.32% and zero-shot accuracy by +3.5% on the ViT-B-32 model, while achieving a 2.3x increase in training efficiency over traditional single-teacher methods. These results highlight the effectiveness and scalability of MMT-ARD in enhancing the adversarial robustness of multimodal large models. Our codes are available at https://github.com/itsnotacie/MMT-ARD.

[79] Improving Multimodal Distillation for 3D Semantic Segmentation under Domain Shift cs.CVPDF

Björn Michele, Alexandre Boulch, Gilles Puy, Tuan-Hung Vu, Renaud Marlet

TL;DR: 该论文研究了如何利用视觉基础模型（VFMs）在无监督领域适应中提升激光雷达点云的语义分割性能，通过改进多模态蒸馏方法，在领域偏移下取得了最先进的结果。

Details

Motivation: 由于激光雷达语义分割网络在一种激光雷达上训练后难以泛化到其他类型激光雷达，论文旨在通过多模态蒸馏方法解决领域偏移问题，提升模型的泛化能力。

Result: 在四个广泛认可的挑战性场景中，提出的方法取得了最先进的性能表现。

Insight: 论文揭示了预训练主干冻结的重要性及MLP头部的适应性训练对跨领域泛化的有效性。

Abstract: Semantic segmentation networks trained under full supervision for one type of lidar fail to generalize to unseen lidars without intervention. To reduce the performance gap under domain shifts, a recent trend is to leverage vision foundation models (VFMs) providing robust features across domains. In this work, we conduct an exhaustive study to identify recipes for exploiting VFMs in unsupervised domain adaptation for semantic segmentation of lidar point clouds. Building upon unsupervised image-to-lidar knowledge distillation, our study reveals that: (1) the architecture of the lidar backbone is key to maximize the generalization performance on a target domain; (2) it is possible to pretrain a single backbone once and for all, and use it to address many domain shifts; (3) best results are obtained by keeping the pretrained backbone frozen and training an MLP head for semantic segmentation. The resulting pipeline achieves state-of-the-art results in four widely-recognized and challenging settings. The code will be available at: https://github.com/valeoai/muddos.

[80] Counterfactual World Models via Digital Twin-conditioned Video Diffusion cs.CVPDF

Yiqing Shen, Aiza Maksutova, Chenjia Li, Mathias Unberath

TL;DR: 论文提出了一种新的反事实世界模型框架CWMDT，通过数字孪生和视频扩散模型预测环境在干预下的演化，解决了传统模型无法针对特定场景属性进行修改的问题。

Details

Motivation: 传统世界模型基于事实观测进行预测，缺乏对反事实查询的支持，例如移除某物体后的场景变化。CWMDT旨在填补这一空白，实现对场景属性的选择性干预。

Result: 在两项基准测试中，CWMDT实现了最先进的性能，验证了数字孪生表示对世界模型的强控制能力。

Insight: 数字孪生和结构化文本表示展示了在视频预测任务中结合显式推理和生成模型的潜力。

Abstract: World models learn to predict the temporal evolution of visual observations given a control signal, potentially enabling agents to reason about environments through forward simulation. Because of the focus on forward simulation, current world models generate predictions based on factual observations. For many emerging applications, such as comprehensive evaluations of physical AI behavior under varying conditions, the ability of world models to answer counterfactual queries, such as “what would happen if this object was removed?”, is of increasing importance. We formalize counterfactual world models that additionally take interventions as explicit inputs, predicting temporal sequences under hypothetical modifications to observed scene properties. Traditional world models operate directly on entangled pixel-space representations where object properties and relationships cannot be selectively modified. This modeling choice prevents targeted interventions on specific scene properties. We introduce CWMDT, a framework to overcome those limitations, turning standard video diffusion models into effective counterfactual world models. First, CWMDT constructs digital twins of observed scenes to explicitly encode objects and their relationships, represented as structured text. Second, CWMDT applies large language models to reason over these representations and predict how a counterfactual intervention propagates through time to alter the observed scene. Third, CWMDT conditions a video diffusion model with the modified representation to generate counterfactual visual sequences. Evaluations on two benchmarks show that the CWMDT approach achieves state-of-the-art performance, suggesting that alternative representations of videos, such as the digital twins considered here, offer powerful control signals for video forward simulation-based world models.

[81] An Artificial Intelligence Framework for Measuring Human Spine Aging Using MRI cs.CVPDF

Roozbeh Bazargani, Saqib Abdullah Basar, Daniel Daly-Grafstein, Rodrigo Solis Pompa, Soojin Lee

TL;DR: 论文提出了一种基于计算机视觉的深度学习方法，通过分析超过18,000份MRI图像序列，估计人类脊柱的年龄，并探讨了脊柱年龄与实际年龄之间的差距（SAG）与脊柱退行性疾病及生活方式因素的关系。

Details

Motivation: 脊柱的健康对人类生活质量至关重要，但其随着年龄增长易发生退行性变化。目前缺乏一种高效的量化方法来评估脊柱的衰老程度及其与健康和生活方式的关系。

Result: 结果表明，SAG与椎间盘突出、骨赘、椎管狭窄、骨折等疾病以及吸烟和体力劳动等生活方式因素显著相关，可作为评估脊柱健康的生物标志物。

Insight: 该研究不仅提供了一种新的脊柱衰老评估工具，还揭示了脊柱健康与生活方式的具体关系，为预防和早期干预脊柱退行性疾病提供了科学依据。

Abstract: The human spine is a complex structure composed of 33 vertebrae. It holds the body and is important for leading a healthy life. The spine is vulnerable to age-related degenerations that can be identified through magnetic resonance imaging (MRI). In this paper we propose a novel computer-vison-based deep learning method to estimate spine age using images from over 18,000 MRI series. Data are restricted to subjects with only age-related spine degeneration. Eligibility criteria are created by identifying common age-based clusters of degenerative spine conditions using uniform manifold approximation and projection (UMAP) and hierarchical density-based spatial clustering of applications with noise (HDBSCAN). Model selection is determined using a detailed ablation study on data size, loss, and the effect of different spine regions. We evaluate the clinical utility of our model by calculating the difference between actual spine age and model-predicted age, the spine age gap (SAG), and examining the association between these differences and spine degenerative conditions and lifestyle factors. We find that SAG is associated with conditions including disc bulges, disc osteophytes, spinal stenosis, and fractures, as well as lifestyle factors like smoking and physically demanding work, and thus may be a useful biomarker for measuring overall spine health.

[82] Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models cs.CVPDF

Mark Endo, Serena Yeung-Levy

TL;DR: 该论文研究了小型多模态模型（SMM）中感知与推理能力的瓶颈问题，发现缩小LLM规模对视觉能力的影响超过推理能力。为解决这一问题，作者提出了视觉提取调优和逐步推理的Extract+Think方法。

Details

Motivation: 尽管大型多模态模型在视觉理解和推理方面取得显著进展，但实际应用需要更小巧高效的模型。然而，缩小LLM规模会对多模态能力产生不成比例的影响，尤其是视觉能力。

Result: 研究发现LLM规模缩小对视觉能力的影响甚至超过推理能力；Extract+Think方法在小规模模型中表现出色，成为效率和性能的新标杆。

Insight: 视觉能力的保持对小型多模态模型至关重要；提取关键视觉细节后结合逐步推理，可以有效缓解因模型规模缩小带来的性能下降。

Abstract: Scaling up multimodal models has enabled remarkable advances in visual understanding and reasoning, but practical demands call for smaller, efficient systems. In this work, we conduct a principled analysis of downscaling intelligence in multimodal models, examining how reduced large language model (LLM) capacity affects multimodal capabilities. Our initial findings reveal an interesting trend: LLM downscaling disproportionately affects visual capabilities, rather than abilities inherited from the LLM. We then examine whether this drop mainly reflects the expected decline in visual reasoning or a more fundamental loss of perceptual abilities. Isolating the effect of LLM downscaling on perception, we find performance still drops sharply, often matching or exceeding the impact on reasoning. To address this bottleneck, we introduce visual extraction tuning, which explicitly trains the model to extract instruction-relevant visual details consistently across tasks. With these extracted visual details, we then apply step-by-step reasoning to generate answers. Together, these components form our Extract+Think approach, setting a new standard for efficiency and performance in this space.

[83] Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination cs.CVPDF

Yolo Yunlong Tang, Daiki Shimada, Hang Hua, Chao Huang, Jing Bi

TL;DR: Video-R4提出了一种新的视频推理方法，通过视觉反刍（iterative visual rumination）实现文本丰富视频的理解，显著提升了细粒度证据的捕捉能力。

Details

Motivation: 现有视频问答模型通常依赖单次感知固定帧，容易产生幻觉或忽略细粒度文本信息。受人类反复观察关键区域的启发，Video-R4旨在通过迭代选择帧、放大关键区域和更新推理状态来提高模型性能。

Result: 在M4-ViteVQA上取得SOTA效果，并在多页文档QA、幻灯片QA和通用视频QA中展示了良好的泛化能力。

Insight: 迭代视觉反刍是像素级多模态推理的有效范式，尤其适合需要反复观察的文本丰富视频任务。

Abstract: Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning.

[84] EvDiff: High Quality Video with an Event Camera cs.CVPDF

Weilun Li, Lei Sun, Ruixi Gao, Qi Jiang, Yuqin Ma

TL;DR: EvDiff提出了一种基于事件的扩散模型，通过代理训练框架从单色事件流生成高质量彩色视频，显著提高了生成视频的保真度和真实感。

Details

Motivation: 事件相机作为神经形态传感器，虽然具有高时间分辨率和高动态范围的优势，但从事件中重建高质量视频是一项高度不适定的任务。现有方法通常采用确定性映射，导致生成结果感知质量较低。

Result: 实验证明，EvDiff在真实数据集上表现出色，在像素级和感知指标上均优于现有方法。

Insight: 代理训练框架和扩散模型的结合为事件相机视频生成提供了新思路，展示了利用大规模图像数据集的潜力。

Abstract: As neuromorphic sensors, event cameras asynchronously record changes in brightness as streams of sparse events with the advantages of high temporal resolution and high dynamic range. Reconstructing intensity images from events is a highly ill-posed task due to the inherent ambiguity of absolute brightness. Early methods generally follow an end-to-end regression paradigm, directly mapping events to intensity frames in a deterministic manner. While effective to some extent, these approaches often yield perceptually inferior results and struggle to scale up in model capacity and training data. In this work, we propose EvDiff, an event-based diffusion model that follows a surrogate training framework to produce high-quality videos. To reduce the heavy computational cost of high-frame-rate video generation, we design an event-based diffusion model that performs only a single forward diffusion step, equipped with a temporally consistent EvEncoder. Furthermore, our novel Surrogate Training Framework eliminates the dependence on paired event-image datasets, allowing the model to leverage large-scale image datasets for higher capacity. The proposed EvDiff is capable of generating high-quality colorful videos solely from monochromatic event streams. Experiments on real-world datasets demonstrate that our method strikes a sweet spot between fidelity and realism, outperforming existing approaches on both pixel-level and perceptual metrics.

[85] Native 3D Editing with Full Attention cs.CV | cs.GRPDF

Weiwei Cai, Shuangkang Fang, Weicai Ye, Xin Dong, Yunhan Yang

TL;DR: 本文提出了一种新颖的原生3D编辑框架，通过单次前向传播直接操作3D表示，解决了现有方法速度慢和几何不一致的问题。

Details

Motivation: 现有的3D编辑方法中，优化法速度慢，而基于多视角2D编辑的前馈法存在几何不一致和视觉质量下降的问题，亟需高效且一致的解决方案。

Result: 实验表明，3D token拼接策略更高效且性能更优，方法在生成质量、3D一致性和指令忠实度上优于现有的2D提升方法。

Insight: 直接操作3D表示的前馈方法可以有效平衡效率与一致性，而3D token拼接是实现这一目标的关键技术。

Abstract: Instruction-guided 3D editing is a rapidly emerging field with the potential to broaden access to 3D content creation. However, existing methods face critical limitations: optimization-based approaches are prohibitively slow, while feed-forward approaches relying on multi-view 2D editing often suffer from inconsistent geometry and degraded visual quality. To address these issues, we propose a novel native 3D editing framework that directly manipulates 3D representations in a single, efficient feed-forward pass. Specifically, we create a large-scale, multi-modal dataset for instruction-guided 3D editing, covering diverse addition, deletion, and modification tasks. This dataset is meticulously curated to ensure that edited objects faithfully adhere to the instructional changes while preserving the consistency of unedited regions with the source object. Building upon this dataset, we explore two distinct conditioning strategies for our model: a conventional cross-attention mechanism and a novel 3D token concatenation approach. Our results demonstrate that token concatenation is more parameter-efficient and achieves superior performance. Extensive evaluations show that our method outperforms existing 2D-lifting approaches, setting a new benchmark in generation quality, 3D consistency, and instruction fidelity.

cs.CL [Back]

[86] Towards Hyper-Efficient RAG Systems in VecDBs: Distributed Parallel Multi-Resolution Vector Search cs.CL | cs.AIPDF

Dong Liu, Yanxuan Yu

TL;DR: 本文提出了一种名为语义金字塔索引（SPI）的多分辨率向量索引框架，旨在提升检索增强生成（RAG）系统的效率，通过动态调整分辨率来优化检索速度和语义相关性。

Details

Motivation: 现有的向量数据库检索系统通常采用单一分辨率的索引结构，无法适应不同用户查询对语义粒度的多样化需求，导致检索速度和语义相关性之间的权衡不佳。

Result: 实验表明，SPI在多个RAG任务中实现了5.7倍的检索加速、1.8倍的内存效率提升，并在端到端问答F1分数上提升了2.5分。

Insight: 多分辨率索引结构可以显著优化检索效率和语义覆盖，其兼容性设计使其易于在生产环境中部署。

Abstract: Retrieval-Augmented Generation (RAG) systems have become a dominant approach to augment large language models (LLMs) with external knowledge. However, existing vector database (VecDB) retrieval pipelines rely on flat or single-resolution indexing structures, which cannot adapt to the varying semantic granularity required by diverse user queries. This limitation leads to suboptimal trade-offs between retrieval speed and contextual relevance. To address this, we propose \textbf{Semantic Pyramid Indexing (SPI)}, a novel multi-resolution vector indexing framework that introduces query-adaptive resolution control for RAG in VecDBs. Unlike existing hierarchical methods that require offline tuning or separate model training, SPI constructs a semantic pyramid over document embeddings and dynamically selects the optimal resolution level per query through a lightweight classifier. This adaptive approach enables progressive retrieval from coarse-to-fine representations, significantly accelerating search while maintaining semantic coverage. We implement SPI as a plugin for both FAISS and Qdrant backends and evaluate it across multiple RAG tasks including MS MARCO, Natural Questions, and multimodal retrieval benchmarks. SPI achieves up to \textbf{5.7$\times$} retrieval speedup and \textbf{1.8$\times$} memory efficiency gain while improving end-to-end QA F1 scores by up to \textbf{2.5 points} compared to strong baselines. Our theoretical analysis provides guarantees on retrieval quality and latency bounds, while extensive ablation studies validate the contribution of each component. The framework’s compatibility with existing VecDB infrastructures makes it readily deployable in production RAG systems. Code is availabe at \href{https://github.com/FastLM/SPI_VecDB}{https://github.com/FastLM/SPI\_VecDB}.

[87] Bench360: Benchmarking Local LLM Inference from 360° cs.CL | cs.AI | cs.LG | cs.PFPDF

Linus Stuhlmann, Mauricio Fadel Argerich, Jonathan Fürst

TL;DR: 该论文介绍了 Bench360，一个从360度全方位评测本地大语言模型（LLM）推理的框架，帮助用户在多种配置中寻找最优平衡点。

Details

Motivation: 本地运行LLM的配置选择繁多，缺乏一个统一的评测框架来平衡功能性和非功能性需求，用户需要手动评估大量配置，效率低下。

Result: 实验覆盖四个常见LLM任务、三个硬件平台和四个推理引擎，揭示了任务性能与系统效率之间的权衡，证明了不存在单一最佳本地推理配置。

Insight: 本地LLM推理的最优配置高度依赖场景和需求，Bench360为用户提供了一个实用的工具，帮助其高效选择合适配置。

Abstract: Running large language models (LLMs) locally is becoming increasingly common. While the growing availability of small open-source models and inference engines has lowered the entry barrier, users now face an overwhelming number of configuration choices. Identifying an optimal configuration – balancing functional and non-functional requirements – requires substantial manual effort. While several benchmarks target LLM inference, they are designed for narrow evaluation goals and not user-focused. They fail to integrate relevant system and task-specific metrics into a unified, easy-to-use benchmark that supports multiple inference engines, usage scenarios, and quantization levels. To address this gap, we present Bench360 – Benchmarking Local LLM Inference from 360°. Bench360 allows users to easily define their own custom tasks along with datasets and relevant task-specific metrics and then automatically benchmarks selected LLMs, inference engines, and quantization levels across different usage scenarios (single stream, batch & server). Bench360 tracks a wide range of metrics, including (1) system metrics – such as Computing Performance (e.g., latency, throughput), Resource Usage (e.g., energy per query), and Deployment (e.g., cold start time) – and (2) task-specific metrics such as ROUGE, F1 score or accuracy. We demonstrate Bench360 on four common LLM tasks – General Knowledge & Reasoning, QA, Summarization and Text-to-SQL – across three hardware platforms and four state of the art inference engines. Our results reveal several interesting trade-offs between task performance and system-level efficiency, highlighting the differences in inference engines and models. Most importantly, there is no single best setup for local inference, which strongly motivates the need for a framework such as Bench360.

[88] Reproducibility Report: Test-Time Training on Nearest Neighbors for Large Language Models cs.CLPDF

Boyang Zhou, Johan Lindqvist, Lindsey Li

TL;DR: 该论文复现了一篇关于大型语言模型在测试时通过最近邻序列进行微调的研究，验证了该方法能显著降低模型的困惑度和比特每字节指标，尤其是在结构化或专业数据集上。同时提出了内存高效的检索实现方案，并扩展了研究范围。

Details

Motivation: 研究动机在于验证最近邻测试时训练方法对大型语言模型的普遍适用性和效果，尤其是在不同规模和领域的模型上的表现。

Result: 实验结果表明，测试时训练显著降低了模型的困惑度和比特每字节指标，尤其是在结构化数据集上。较小的模型可通过该方法接近较大模型的性能。

Insight: 研究发现，未在类似数据上预训练的模型受益更大，且内存高效的检索实现能显著降低资源需求，为大规模检索增强适应提供了实用方案。

Abstract: We reproduce the central claims of Test-Time Training on Nearest Neighbors for Large Language Models (Hardt and Sun, 2024), which proposes adapting a language model at inference time by fine-tuning on retrieved nearest-neighbor sequences. Using pretrained RoBERTa embeddings indexed with Faiss, we retrieve 20 neighbors per test input and apply one gradient update per neighbor across GPT-2 (117M, 774M), GPT-Neo (1.3B), and R1-Distilled-Qwen2.5-1.5B. Our experiments confirm that test-time training significantly reduces perplexity and bits-per-byte metrics across diverse domains from The Pile, with the largest improvements in structured or specialized datasets such as GitHub and EuroParl. We further validate that models not pretrained on The Pile benefit more from this adaptation than models already trained on similar data, allowing smaller models to approach the performance of larger ones. Due to infrastructure limitations, we introduce a memory-efficient retrieval implementation that loads only required line offsets rather than entire files, reducing RAM requirements from over 128 GB per server to 32 GB. We also extend the original study by evaluating R1-Distilled-Qwen2.5-1.5B, showing that test-time training yields consistent gains even for modern reasoning-optimized architectures. Overall, our results support the robustness and generality of nearest-neighbor test-time training while highlighting practical considerations for reproducing large-scale retrieval-augmented adaptation.

[89] Improving Latent Reasoning in LLMs via Soft Concept Mixing cs.CLPDF

Kang Wang, Xiangyu Duan, Tianyi Du

TL;DR: 论文提出了一种名为Soft Concept Mixing (SCM)的训练方法，通过将软概念向量融入模型隐藏状态，提升大语言模型（LLMs）的潜在推理能力。

Details

Motivation: 大语言模型通常通过生成离散token进行推理，这可能限制其表达能力。SCM旨在弥合推理中的软概念与训练中的离散token之间的差距。

Result: 实验表明，SCM显著提升了LLMs的推理性能，同时保持了训练的动态稳定性。

Insight: 直接暴露模型训练于软概念表示是提升LLMs推理能力的有效途径。

Abstract: Unlike human reasoning in abstract conceptual spaces, large language models (LLMs) typically reason by generating discrete tokens, which potentially limit their expressive power. The recent work Soft Thinking has shown that LLMs’ latent reasoning via soft concepts is a promising direction, but LLMs are trained on discrete tokens. To reduce this gap between the soft concepts in reasoning and the discrete tokens in training, we propose Soft Concept Mixing (SCM), a soft concept aware training scheme that directly exposes the model to soft representations during training. Specifically, SCM constructs a soft concept vector by forming a probability-weighted average of embeddings. Then, this vector is mixed into the model’s hidden states, which embody rich contextual information. Finally, the entire latent reasoning process is optimized with Reinforcement Learning (RL). Experiments on five reasoning benchmarks demonstrate that SCM improves the reasoning performance of LLMs, and simultaneously maintains a stable training dynamic.

[90] Deep Improvement Supervision cs.CL | cs.AI | cs.LGPDF

Arip Asadulaev, Rayan Banerjee, Fakhri Karray, Martin Takac

TL;DR: 该论文研究了如何通过最小改动进一步提升小型循环架构（如TRMs）的效率，提出了一种新型训练方案，显著提高了训练效率，同时保持了与标准TRM相当的性能。

Details

Motivation: 近年研究表明，小型循环架构（如TRMs）在复杂推理任务中可以超越大型语言模型（LLMs）。但如何进一步提升这些方法的效率是一个核心问题。

Result: 方法减少了18倍的前向传播次数，消除了停止机制，同时在参数仅为0.8M的情况下达到了24%的ARC-1准确率，优于大多数LLMs。

Insight: 小型循环架构通过高效的设计可以实现与大型模型相当甚至更好的性能，且训练效率的提升是关键突破点。

Abstract: Recently, it was shown that small, looped architectures, such as Tiny Recursive Models (TRMs), can outperform Large Language Models (LLMs) on complex reasoning tasks, including the Abstraction and Reasoning Corpus (ARC). In this work, we investigate a core question: how can we further improve the efficiency of these methods with minimal changes? To address this, we frame the latent reasoning of TRMs as a form of classifier-free guidance and implicit policy improvement algorithm. Building on these insights, we propose a novel training scheme that provides a target for each loop during training. We demonstrate that our approach significantly enhances training efficiency. Our method reduces the total number of forward passes by 18x and eliminates halting mechanisms, while maintaining quality comparable to standard TRMs. Notably, we achieve 24% accuracy on ARC-1 with only 0.8M parameters, outperforming most LLMs.

[91] Do Vision-Language Models Understand Visual Persuasiveness? cs.CL | cs.CVPDF

Gyuwon Park

TL;DR: 该论文探讨了视觉语言模型（VLM）是否能理解视觉说服力，并通过构建数据集和引入视觉说服因素（VPFs）分类法进行分析。研究发现VLM在高层次语义对齐方面表现较好，但在低/中层次特征判别上较弱，且推理干预策略中对象基础的理性指导效果最佳。

Details

Motivation: 近年来视觉语言模型在多模态推理和理解方面取得显著进展，但其是否真正理解视觉说服力（即视觉线索如何影响人类态度和决策）尚不明确。

Result: VLM表现出回忆导向偏差（过度预测高说服力），对低/中层次特征判别能力较弱，而高层次语义对齐是人类判断的最强预测因子。对象基础的理性指导显著提升模型性能。

Insight: VLM的核心局限不在于识别说服性对象，而在于将其与沟通意图联系起来。简洁、基于对象的理性指导是改善模型说服力理解的有效策略。

Abstract: Recent advances in vision-language models (VLMs) have enabled impressive multi-modal reasoning and understanding. Yet, whether these models truly grasp visual persuasion-how visual cues shape human attitudes and decisions-remains unclear. To probe this question, we construct a high-consensus dataset for binary persuasiveness judgment and introduce the taxonomy of Visual Persuasive Factors (VPFs), encompassing low-level perceptual, mid-level compositional, and high-level semantic cues. We also explore cognitive steering and knowledge injection strategies for persuasion-relevant reasoning. Empirical analysis across VLMs reveals a recall-oriented bias-models over-predict high persuasiveness-and weak discriminative power for low/mid-level features. In contrast, high-level semantic alignment between message and object presence emerges as the strongest predictor of human judgment. Among intervention strategies, simple instruction or unguided reasoning scaffolds yield marginal or negative effects, whereas concise, object-grounded rationales significantly improve precision and F1 scores. These results indicate that VLMs core limitation lies not in recognizing persuasive objects but in linking them to communicative intent.

[92] Principled Design of Interpretable Automated Scoring for Large-Scale Educational Assessments cs.CLPDF

Yunsung Kim, Mike Hardy, Joseph Tey, Candace Thille, Chris Piech

TL;DR: 该论文提出了一种可解释的自动评分框架AnalyticScore，适用于大规模教育评估，通过四项原则（FGTI）和基于LLM的特征提取方法，实现了评分的高准确性和可解释性。

Details

Motivation: 当前AI驱动的自动评分系统在大规模教育评估中缺乏透明度和可解释性，难以满足利益相关者需求。

Result: AnalyticScore在准确性上优于许多不可解释方法，接近SOTA水平（差距0.06 QWK）。

Insight: 基于LLM的特征提取方法与人类注释者行为高度一致，证明框架的可解释性和实用性。

Abstract: AI-driven automated scoring systems offer scalable and efficient means of evaluating complex student-generated responses. Yet, despite increasing demand for transparency and interpretability, the field has yet to develop a widely accepted solution for interpretable automated scoring to be used in large-scale real-world assessments. This work takes a principled approach to address this challenge. We analyze the needs and potential benefits of interpretable automated scoring for various assessment stakeholders and develop four principles of interpretability – Faithfulness, Groundedness, Traceability, and Interchangeability (FGTI) – targeted at those needs. To illustrate the feasibility of implementing these principles, we develop the AnalyticScore framework for short answer scoring as a baseline reference framework for future research. AnalyticScore operates by (1) extracting explicitly identifiable elements of the responses, (2) featurizing each response into human-interpretable values using LLMs, and (3) applying an intuitive ordinal logistic regression model for scoring. In terms of scoring accuracy, AnalyticScore outperforms many uninterpretable scoring methods, and is within only 0.06 QWK of the uninterpretable SOTA on average across 10 items from the ASAP-SAS dataset. By comparing against human annotators conducting the same featurization task, we further demonstrate that the featurization behavior of AnalyticScore aligns well with that of humans.

[93] Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design cs.CL | cs.AI | cs.DCPDF

Quentin Anthony, Yury Tokpanov, Skyler Szot, Srivatsan Rajagopal, Praneeth Medepalli

TL;DR: 论文首次在纯AMD硬件（MI300X GPU和Pollara互连）上进行了大规模的混合专家（MoE）预训练研究，提供了系统与模型设计的实践指导，并展示了ZAYA1-base模型的竞争力表现。

Details

Motivation: 验证AMD硬件（MI300X GPU和Pollara互连）及其软件栈在大规模预训练任务中的成熟度和优化效果，填补相关研究空白。

Result: ZAYA1-base模型（760M活跃参数，8.3B总参数）在推理、数学和编码任务中表现优于Llama-3-8B和OLMoE等模型，与Qwen3-4B和Gemma3-12B表现相当。

Insight: AMD硬件及其生态系统已具备支持竞争性大规模预训练的能力，为未来的模型设计和硬件优化提供了重要参考。

Abstract: We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs with Pollara interconnect. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts on Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as fault-tolerance and checkpoint-reshaping, as well as detailed information on our training recipe. We also provide a preview of our model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE) - which will be further improved upon in forthcoming papers. ZAYA1-base achieves performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks. Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining.

[94] Parrot: Persuasion and Agreement Robustness Rating of Output Truth – A Sycophancy Robustness Benchmark for LLMs cs.CL | cs.AI | cs.CE | cs.LGPDF

Yusuf Çelebi, Mahmoud El Hussieni, Özay Ezerceli

TL;DR: PARROT是一个评估大型语言模型（LLMs）在权威和劝说压力下输出准确性下降的框架，揭示了模型在社交压力下的表现差异。

Details

Motivation: 研究动机在于评估LLMs在面对权威性错误信息时的过度顺从（sycophancy）现象，以确保其在现实世界中的安全部署。

Result: 结果表明，高级模型（如GPT-5）表现较好（跟随率≤11%），而老旧/小型模型（如GPT-4）表现出严重的认知崩溃（跟随率80%）。

Insight: 研究揭示了不同领域知识的脆弱性，并强调应将对社交压力的抵抗性与准确性、安全性并列为模型部署的核心目标。

Abstract: This study presents PARROT (Persuasion and Agreement Robustness Rating of Output Truth), a robustness focused framework designed to measure the degradation in accuracy that occurs under social pressure exerted on users through authority and persuasion in large language models (LLMs) the phenomenon of sycophancy (excessive conformity). PARROT (i) isolates causal effects by comparing the neutral version of the same question with an authoritatively false version using a double-blind evaluation, (ii) quantifies confidence shifts toward the correct and imposed false responses using log-likelihood-based calibration tracking, and (iii) systematically classifies failure modes (e.g., robust correct, sycophantic agreement, reinforced error, stubborn error, self-correction, etc.) using an eight-state behavioral taxonomy. We evaluated 22 models using 1,302 MMLU-style multiple-choice questions across 13 domains and domain-specific authority templates. Findings show marked heterogeneity: advanced models (e.g., GPT-5, GPT-4.1, Claude Sonnet 4.5) exhibit low “follow rates” ($\leq 11%$, GPT-5: 4%) and minimal accuracy loss, while older/smaller models show severe epistemic collapse (GPT-4: 80%, Qwen 2.5-1.5B: 94%). The danger is not limited to response changes; weak models reduce confidence in the correct response while increasing confidence in the imposed incorrect response. While international law and global knowledge at the domain level exhibit high fragility, elementary mathematics is relatively resilient. Consequently, we argue that the goal of “resistance to overfitting pressure” should be addressed as a primary objective alongside accuracy, harm avoidance, and privacy for safe deployment in the real world.

[95] Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables cs.CL | cs.AI | cs.CVPDF

Anshul Singh, Rohan Chaudhary, Gagneet Singh, Abhay Kumary

TL;DR: 论文分析了现有VLM在多语言和视觉噪声场景下的性能下降问题，提出了包含多语言和视觉噪声的新基准数据集MirageTVQA，揭示了VLM的两大失败模式。

Details

Motivation: 现有VLM评测基准（如WikiTableQuestions和FinQA）多为单语（英语）且表格格式干净，未能反映现实世界的多语言和视觉噪声复杂性。

Result: 主流VLM在视觉噪声下性能下降超35%，且在多语言环境下存在英语优先的偏见。

Insight: 现实应用中，VLM需进一步优化以应对多语言和视觉噪声的挑战；MirageTVQA为未来研究提供了重要评测基准。

Abstract: The impressive performance of VLMs is largely measured on benchmarks that fail to capture the complexities of real-world scenarios. Existing datasets for tabular QA, such as WikiTableQuestions and FinQA, are overwhelmingly monolingual (English) and present tables in a digitally perfect, clean format. This creates a significant gap between research and practice. To address this, we present \textbf{MirageTVQA}, a new benchmark designed to evaluate VLMs on these exact dimensions. Featuring nearly 60,000 QA pairs across 24 languages, MirageTVQA challenges models with tables that are not only multilingual but also visually imperfect, incorporating realistic noise to mimic scanned documents. Our evaluation of the leading VLMs reveals two primary failure points: a severe degradation in performance (over 35% drop for the best models) when faced with visual noise and a consistent English-first bias where reasoning abilities fail to transfer to other languages. MirageTVQA provides a benchmark for measuring and driving progress towards more robust VLM models for table reasoning. The dataset and the code are available at: https://github.com/anshulsc/MirageTVQA.

Benjamin White, Anastasia Shimorina

TL;DR: 本文提出了一种混合方法预测社交媒体用户行为，针对常见和罕见行为分别建模，在Bluesky数据集上表现优异，夺得SocialSim挑战赛第一名。

Details

Motivation: 现有方法主要关注常见行为（如点赞、转发），而罕见但重要的行为预测研究较少。本文旨在填补这一空白，提供更全面的用户行为预测方法。

Result: 常见行为预测平均宏F1分数为0.64，罕见行为分类宏F1分数为0.56。方法在SocialSim挑战赛中排名第一。

Insight: 社交媒体行为预测需针对行为类型差异采用定制化建模策略，常见和罕见行为需分别处理以提高整体效果。

Abstract: Understanding and predicting user behavior on social media platforms is crucial for content recommendation and platform design. While existing approaches focus primarily on common actions like retweeting and liking, the prediction of rare but significant behaviors remains largely unexplored. This paper presents a hybrid methodology for social media user behavior prediction that addresses both frequent and infrequent actions across a diverse action vocabulary. We evaluate our approach on a large-scale Bluesky dataset containing 6.4 million conversation threads spanning 12 distinct user actions across 25 persona clusters. Our methodology combines four complementary approaches: (i) a lookup database system based on historical response patterns; (ii) persona-specific LightGBM models with engineered temporal and semantic features for common actions; (iii) a specialized hybrid neural architecture fusing textual and temporal representations for rare action classification; and (iv) generation of text replies. Our persona-specific models achieve an average macro F1-score of 0.64 for common action prediction, while our rare action classifier achieves 0.56 macro F1-score across 10 rare actions. These results demonstrate that effective social media behavior prediction requires tailored modeling strategies recognizing fundamental differences between action types. Our approach achieved first place in the SocialSim: Social-Media Based Personas challenge organized at the Social Simulation with LLMs workshop at COLM 2025.

[97] Estonian WinoGrande Dataset: Comparative Analysis of LLM Performance on Human and Machine Translation cs.CLPDF

Marii Ojastu, Hele-Andra Kuulmets, Aleksei Dorkin, Marika Borovikova, Dage Särg

TL;DR: 该论文介绍了爱沙尼亚语的WinoGrande测试集的本地化翻译和性能分析，对比了人类翻译和机器翻译对大语言模型性能的影响，并探讨了提示工程在翻译质量中的作用。

Details

Motivation: 研究动机在于评估人类翻译和机器翻译对大语言模型在多语言常识推理任务中的性能影响，同时探索如何通过提示工程改进机器翻译效果。

Result: 结果显示：1) 人类翻译数据集上模型性能略低于原始英文数据集；2) 机器翻译数据上的性能显著下降；3) 提示工程对翻译质量和模型准确性的改进有限。

Insight: 重要洞察是：为确保对大语言模型语言能力和推理能力的可靠评估，需要语言专家参与数据集的翻译和本地化。

Abstract: In this paper, we present a localized and culturally adapted Estonian translation of the test set from the widely used commonsense reasoning benchmark, WinoGrande. We detail the translation and adaptation process carried out by translation specialists and evaluate the performance of both proprietary and open source models on the human translated benchmark. Additionally, we explore the feasibility of achieving high-quality machine translation by incorporating insights from the manual translation process into the design of a detailed prompt. This prompt is specifically tailored to address both the linguistic characteristics of Estonian and the unique translation challenges posed by the WinoGrande dataset. Our findings show that model performance on the human translated Estonian dataset is slightly lower than on the original English test set, while performance on machine-translated data is notably worse. Additionally, our experiments indicate that prompt engineering offers limited improvement in translation quality or model accuracy, and highlight the importance of involving language specialists in dataset translation and adaptation to ensure reliable and interpretable evaluations of language competency and reasoning in large language models.

Koena Ronny Mabokela, Tim Schlippe, Matthias Wölfel

TL;DR: 该研究探讨了利用大型语言模型（LLMs）对南非语言（英语、Sepedi和Setswana）社交媒体帖子进行情感分析，以检测社会挑战的零样本性能。通过融合不同LLM的结果，分类错误率可降至1%以下。

Details

Motivation: 情感分析有助于理解多语言社区对社会问题的情感倾向，但目前缺乏针对南非语言的LLM研究。研究旨在填补这一空白，为政府部门提供可靠的社会挑战检测工具。

Result: 不同LLM在不同语言和话题上表现差异显著。融合LLM结果可将情感分类错误率降至1%以下，提供了高可靠性。

Insight: LLM的零样本能力在多语言情感分析中具有潜力，但需模型融合以弥补个体不足。研究为跨语言社会挑战检测开辟了新途径。

Abstract: Sentiment analysis can aid in understanding people’s opinions and emotions on social issues. In multilingual communities sentiment analysis systems can be used to quickly identify social challenges in social media posts, enabling government departments to detect and address these issues more precisely and effectively. Recently, large-language models (LLMs) have become available to the wide public and initial analyses have shown that they exhibit magnificent zero-shot sentiment analysis abilities in English. However, there is no work that has investigated to leverage LLMs for sentiment analysis on social media posts in South African languages and detect social challenges. Consequently, in this work, we analyse the zero-shot performance of the state-of-the-art LLMs GPT-3.5, GPT-4, LlaMa 2, PaLM 2, and Dolly 2 to investigate the sentiment polarities of the 10 most emerging topics in English, Sepedi and Setswana social media posts that fall within the jurisdictional areas of 10 South African government departments. Our results demonstrate that there are big differences between the various LLMs, topics, and languages. In addition, we show that a fusion of the outcomes of different LLMs provides large gains in sentiment classification performance with sentiment classification errors below 1%. Consequently, it is now feasible to provide systems that generate reliable information about sentiment analysis to detect social challenges and draw conclusions about possible needs for actions on specific topics and within different language groups.

[99] Don’t Learn, Ground: A Case for Natural Language Inference with Visual Grounding cs.CLPDF

Daniil Ignatev, Ayman Santeer, Albert Gatt, Denis Paperno

TL;DR: 提出一种零样本NLI方法，通过视觉模态增强语言理解，生成前提的视觉表示并用文本假设进行推理，结果显示其高效性和鲁棒性。

Details

Motivation: 当前NLI任务存在文本偏见和表面启发式问题，研究希望通过视觉模态提供更鲁棒的语言理解方法。

Result: 方法无需任务微调即实现高准确率，对文本偏见和表面启发式具有鲁棒性。

Insight: 视觉模态可作为一种有效的意义表示，为语言理解提供了新方向。

Abstract: We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these representations with textual hypotheses. We evaluate two inference techniques: cosine similarity and visual question answering. Our method achieves high accuracy without task-specific fine-tuning, demonstrating robustness against textual biases and surface heuristics. Additionally, we design a controlled adversarial dataset to validate the robustness of our approach. Our findings suggest that leveraging visual modality as a meaning representation provides a promising direction for robust natural language understanding.

[100] Beyond Multiple Choice: A Hybrid Framework for Unifying Robust Evaluation and Verifiable Reasoning Training cs.CL | cs.AIPDF

Yesheng Liu, Hao Li, Haiyu Xu, Baoqi Pei, Jiahao Wang

TL;DR: 论文提出ReVeL框架，将多选题（MCQA）转为开放性问题（OpenQA），以更可靠地评估和训练模型，解决了多选题可能泄露答案信号的问题。

Details

Motivation: 多选题评测和训练（MCQA）存在答案泄露的缺陷，导致评估不可靠且鼓励猜测行为，因此需要一种更开放但可验证的评测与训练框架。

Result: 1）ReVeL训练的模型在多选题和开放题评测中表现更好；2）揭示了MCQA评测中的分数膨胀问题（高达20%）；3）降低了评测成本和延迟。

Insight: 开放性问题更能反映模型真实能力，而ReVeL框架为解决MCQA的局限性提供了可行方向。

Abstract: Multiple-choice question answering (MCQA) has been a popular format for evaluating and reinforcement fine-tuning (RFT) of modern multimodal language models. Its constrained output format allows for simplified, deterministic automatic verification. However, we find that the options may leak exploitable signals, which makes the accuracy metrics unreliable for indicating real capabilities and encourages explicit or implicit answer guessing behaviors during RFT. We propose ReVeL (Rewrite and Verify by LLM), a framework that rewrites multiple-choice questions into open-form questions while keeping answers verifiable whenever possible. The framework categorizes questions according to different answer types, apply different rewriting and verification schemes, respectively. When applied for RFT, we converted 20k MCQA examples and use GRPO to finetune Qwen2.5-VL models. Models trained on ReVeL-OpenQA match MCQA accuracy on multiple-choice benchmarks and improve OpenQA accuracy by about six percentage points, indicating better data efficiency and more robust reward signals than MCQA-based training. When used for evaluation, ReVeL also reveals up to 20 percentage points of score inflation in MCQA benchmarks (relative to OpenQA), improves judging accuracy, and reduces both cost and latency. We will release code and data publicly.

[101] SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation cs.CL | cs.AI | cs.CVPDF

Shrikant Kendre, Austin Xu, Honglu Zhou, Michael Ryoo, Shafiq Joty

TL;DR: 本文提出了SMILE（语义度量结合词汇精确性），一种新型的问答评估方法，结合了句子级和关键词级的语义理解以及简单的关键词匹配，弥补了传统评估指标的不足。

Details

Motivation: 传统的评估指标（如ROUGE、METEOR和EM）过于依赖n-gram词汇相似性，忽略了深层次语义理解。而BERTScore和MoverScore虽然利用上下文嵌入弥补了这一缺陷，但缺乏灵活性且忽视了词汇相似性。大语言模型（LLM）评估器成本高且存在偏见和不一致问题。

Result: 在文本、图像和视频问答任务上，SMILE与人类评估高度一致，且计算效率高。

Insight: SMILE成功弥合了词汇和语义评估之间的鸿沟，提供了一种高效且全面的评估方法。

Abstract: Traditional evaluation metrics for textual and visual question answering, like ROUGE, METEOR, and Exact Match (EM), focus heavily on n-gram based lexical similarity, often missing the deeper semantic understanding needed for accurate assessment. While measures like BERTScore and MoverScore leverage contextual embeddings to address this limitation, they lack flexibility in balancing sentence-level and keyword-level semantics and ignore lexical similarity, which remains important. Large Language Model (LLM) based evaluators, though powerful, come with drawbacks like high costs, bias, inconsistency, and hallucinations. To address these issues, we introduce SMILE: Semantic Metric Integrating Lexical Exactness, a novel approach that combines sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching. This composite method balances lexical precision and semantic relevance, offering a comprehensive evaluation. Extensive benchmarks across text, image, and video QA tasks show SMILE is highly correlated with human judgments and computationally lightweight, bridging the gap between lexical and semantic evaluation.

[102] Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards cs.CL | cs.AI | cs.LGPDF

Zhen Wang, Zhifeng Gao, Guolin Ke

TL;DR: 该论文提出了一种名为MR-RLVR的新型自监督强化学习方法，通过掩码和重排序任务从中间推理中提取信号，提升在仅结果可验证场景下的性能。

Details

Motivation: 当前RLVR在数学推理（尤其是定理证明）中的扩展性受限，主要原因是中间推理步骤难以直接验证。同时，传统的监督微调容易导致机械记忆而非长链推理。

Result: 在AIME、AMC和MATH500等数据集上，MR-RLVR相比原始RLVR取得了显著提升（Pass@1提升9.86%，Pass@5提升5.27%，Pass@8提升4.00%）。

Insight: 过程感知的自监督信号可以显著提升RLVR在仅结果可验证任务中的表现，表明中间推理信息的利用是关键。

Abstract: Test-time scaling has been shown to substantially improve large language models’ (LLMs) mathematical reasoning. However, for a large portion of mathematical corpora, especially theorem proving, RLVR’s scalability is limited: intermediate reasoning is crucial, while final answers are difficult to directly and reliably verify. Meanwhile, token-level SFT often degenerates into rote memorization rather than inducing longer chains of thought. Inspired by BERT’s self-supervised tasks, we propose MR-RLVR (Masked-and-Reordered RLVR), which constructs process-level self-supervised rewards via “masked-then-fill” and “step reordering” to extract learnable signals from intermediate reasoning. Our training pipeline comprises two stages: we first perform self-supervised training on sampled mathematical calculation and proof data; we then conduct RLVR fine-tuning on mathematical calculation datasets where only outcomes are verifiable. We implement MR-RLVR on Qwen2.5-3B and DeepSeek-R1-Distill-Qwen-1.5B, and evaluate on AIME24, AIME25, AMC23, and MATH500. Under a fixed sampling and decoding budget, MR-RLVR achieves average relative gains over the original RLVR of +9.86% Pass@1, +5.27% Pass@5, and +4.00% Pass@8. These results indicate that incorporating process-aware self-supervised signals can effectively enhance RLVR’s scalability and performance in only outcome-verifiable settings.

cs.RO [Back]

[103] Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM cs.RO | cs.CL | cs.CV | cs.SD | eess.ASPDF

Chiori Hori, Yoshiki Masuyama, Siddarth Jain, Radu Corcodel, Devesh Jha

TL;DR: 该论文提出了一种长上下文Q-Former模型，结合多模态LLM，用于机器人动作确认和规划的生成，提升了长任务序列的依赖性处理能力。

Details

Motivation: 现有方法主要关注片段级处理，忽视了长任务序列中动作之间的依赖性，因此需要一种能够整合长上下文信息的方法来改进机器人动作确认和规划。

Result: 在YouCook2数据集上的实验表明，长上下文Q-Former显著提升了动作确认和规划的性能，验证了长上下文整合的重要性。

Insight: 长上下文信息的整合对机器人动作规划至关重要，尤其是在多步骤任务中；文本条件化方法可以提高信息传递的效率和准确性。

Abstract: Human-robot collaboration towards a shared goal requires robots to understand human action and interaction with the surrounding environment. This paper focuses on human-robot interaction (HRI) based on human-robot dialogue that relies on the robot action confirmation and action step generation using multimodal scene understanding. The state-of-the-art approach uses multimodal transformers to generate robot action steps aligned with robot action confirmation from a single clip showing a task composed of multiple micro steps. Although actions towards a long-horizon task depend on each other throughout an entire video, the current approaches mainly focus on clip-level processing and do not leverage long-context information. This paper proposes a long-context Q-former incorporating left and right context dependency in full videos. Furthermore, this paper proposes a text-conditioning approach to feed text embeddings directly into the LLM decoder to mitigate the high abstraction of the information in text by Q-former. Experiments with the YouCook2 corpus show that the accuracy of confirmation generation is a major factor in the performance of action planning. Furthermore, we demonstrate that the long-context Q-former improves the confirmation and action planning by integrating VideoLLaMA3.

Shanshan Li, Da Huang, Yu He, Yanwei Fu, Yu-Gang Jiang

TL;DR: 本文提出了TP-MDDN方法，解决了传统需求驱动导航无法处理多需求和任务偏好的问题，并通过AWMSystem和MASMap等技术实现了高效的多需求导航。

Details

Motivation: 现实世界中，人们在导航时往往面临多需求和任务偏好的复杂性，而传统方法只能处理单一需求。

Result: 实验表明，该方法在感知精度和导航鲁棒性上优于现有基准。

Insight: 多需求导航需要结合任务分解、目标选择和状态监控，同时高效的空间记忆和实时错误处理是关键。

Abstract: In daily life, people often move through spaces to find objects that meet their needs, posing a key challenge in embodied AI. Traditional Demand-Driven Navigation (DDN) handles one need at a time but does not reflect the complexity of real-world tasks involving multiple needs and personal choices. To bridge this gap, we introduce Task-Preferenced Multi-Demand-Driven Navigation (TP-MDDN), a new benchmark for long-horizon navigation involving multiple sub-demands with explicit task preferences. To solve TP-MDDN, we propose AWMSystem, an autonomous decision-making system composed of three key modules: BreakLLM (instruction decomposition), LocateLLM (goal selection), and StatusMLLM (task monitoring). For spatial memory, we design MASMap, which combines 3D point cloud accumulation with 2D semantic mapping for accurate and efficient environmental understanding. Our Dual-Tempo action generation framework integrates zero-shot planning with policy-based fine control, and is further supported by an Adaptive Error Corrector that handles failure cases in real time. Experiments demonstrate that our approach outperforms state-of-the-art baselines in both perception accuracy and navigation robustness.

[105] METIS: Multi-Source Egocentric Training for Integrated Dexterous Vision-Language-Action Model cs.RO | cs.CVPDF

Yankai Fu, Ning Chen, Junkai Zhao, Shaozhe Shan, Guocai Yao

TL;DR: METIS提出了一种多源自我中心数据集训练的视觉-语言-动作（VLA）模型，通过整合人类与机器人数据及运动感知动态表示，显著提升了灵巧操作的性能与泛化能力。

Details

Motivation: 构建一个能够在多样化任务中进行感知、推理和操作的通用机器人仍然是一个开放性问题，尤其是针对灵巧操作任务。现有方法的局限包括缺乏大规模动作标注数据，以及人类和机器人之间的视觉差异。

Result: METIS在六个真实世界任务中取得了最高平均成功率，同时在分布外场景中表现出卓越的泛化能力和鲁棒性。

Insight: 多源数据的整合与运动感知动态表示的结合是提升灵巧操作性能的关键。METIS展示了统一推理与动作框架在通用机器人模型中的潜力。

Abstract: Building a generalist robot that can perceive, reason, and act across diverse tasks remains an open challenge, especially for dexterous manipulation. A major bottleneck lies in the scarcity of large-scale, action-annotated data for dexterous skills, as teleoperation is difficult and costly. Human data, with its vast scale and diverse manipulation behaviors, provides rich priors for learning robotic actions. While prior works have explored leveraging human demonstrations, they are often constrained by limited scenarios and a large visual gap between human and robots. To eliminate these limitations, we propose METIS, a vision-language-action (VLA) model for dexterous manipulation pretrained on multi-source egocentric datasets. We first construct EgoAtlas, which integrates large-scale human and robotic data from multiple sources, all unified under a consistent action space. We further extract motion-aware dynamics, a compact and discretized motion representation, which provides efficient and expressive supervision for VLA training. Built upon them, METIS integrates reasoning and acting into a unified framework, enabling effective deployment to downstream dexterous manipulation tasks. Our method demonstrates exceptional dexterous manipulation capabilities, achieving highest average success rate in six real-world tasks. Experimental results also highlight the superior generalization and robustness to out-of-distribution scenarios. These findings emphasize METIS as a promising step toward a generalist model for dexterous manipulation.

Yifan Li, Lichi Li, Anh Dao, Xinyu Zhou, Yicheng Qiao

TL;DR: 该论文提出了IndustryNav，首个动态工业导航基准测试，用于评估视觉大语言模型（VLLMs）在动态环境中的主动空间推理能力，特别关注安全行为和距离估计。

Details

Motivation: 现有基准测试多集中在静态家庭环境中，忽视了动态复杂工业场景的需求，亟需新的评估框架以推动具身智能的发展。

Result: 实验显示，闭源模型表现优于开源模型，但所有代理在路径规划、碰撞避免和主动探索方面仍存在显著不足。

Insight: 动态环境中的安全行为和主动探索是实现实用具身代理的关键研究方向。

Abstract: While Visual Large Language Models (VLLMs) show great promise as embodied agents, they continue to face substantial challenges in spatial reasoning. Existing embodied benchmarks largely focus on passive, static household environments and evaluate only isolated capabilities, failing to capture holistic performance in dynamic, real-world complexity. To fill this gap, we present IndustryNav, the first dynamic industrial navigation benchmark for active spatial reasoning. IndustryNav leverages 12 manually created, high-fidelity Unity warehouse scenarios featuring dynamic objects and human movement. Our evaluation employs a PointGoal navigation pipeline that effectively combines egocentric vision with global odometry to assess holistic local-global planning. Crucially, we introduce the “collision rate” and “warning rate” metrics to measure safety-oriented behaviors and distance estimation. A comprehensive study of nine state-of-the-art VLLMs (including models such as GPT-5-mini, Claude-4.5, and Gemini-2.5) reveals that closed-source models maintain a consistent advantage; however, all agents exhibit notable deficiencies in robust path planning, collision avoidance and active exploration. This highlights a critical need for embodied research to move beyond passive perception and toward tasks that demand stable planning, active exploration, and safe behavior in dynamic, real-world environment.

cs.AI [Back]

[107] Cognitive BASIC: An In-Model Interpreted Reasoning Language for LLMs cs.AI | cs.CLPDF

Oliver Kramer

TL;DR: 论文提出了一种名为Cognitive BASIC的LLM推理语言，通过模拟BASIC风格的程序步骤，提供显式、分步的推理跟踪，增强了语言模型的透明度和可解释性。

Details

Motivation: 现有的LLM虽然在推理任务上表现优异，但其推理过程往往是黑箱的，缺乏透明性。为了提升LLM的可解释性，作者提出了一种简单的BASIC风格语言，用于结构化LLM的推理过程。

Result: 实验表明，三种LLM均能执行Cognitive BASIC程序，在知识提取、冲突检测和推理任务上表现良好，但性能存在差异。

Insight: Cognitive BASIC展示了LLM可以通过结构化语言模拟程序行为，为提升模型透明性和可解释性提供了新思路。

Abstract: Cognitive BASIC is a minimal, BASIC-style prompting language and in-model interpreter that structures large language model (LLM) reasoning into explicit, stepwise execution traces. Inspired by the simplicity of retro BASIC, we repurpose numbered lines and simple commands as an interpretable cognitive control layer. Modern LLMs can reliably simulate such short programs, enabling transparent multi-step reasoning inside the model. A natural-language interpreter file specifies command semantics, memory updates, and logging behavior. Our mental-model interpreter extracts declarative and procedural knowledge, detects contradictions, and produces resolutions when necessary. A comparison across three LLMs on a benchmark of knowledge extraction, conflict detection, and reasoning tasks shows that all models can execute Cognitive BASIC programs, with overall strong but not uniform performance.

[108] Fantastic Bugs and Where to Find Them in AI Benchmarks cs.AI | cs.CL | cs.LGPDF

Sang Truong, Yuheng Tu, Michael Hardy, Anka Reuel, Zeyu Tang

TL;DR: 该论文提出了一个系统地修订AI基准测试的框架，通过分析响应模式的统计数据标记潜在无效问题，并结合专家审核和LLM初审，高效提升基准测试的可靠性。

Details

Motivation: AI基准测试中存在大量无效问题，手动识别和修正既不可行也不高效，亟待一种系统化的方法提升基准测试的可靠性。

Result: 在九个基准测试中，该方法能以84%的精度识别无效问题，显著减少了人工审核的工作量。

Insight: 通过统计分析和自动化工具结合，可以有效提升基准测试的质量和可靠性，同时减少人工负担。

Abstract: Benchmarks are pivotal in driving AI progress, and invalid benchmark questions frequently undermine their reliability. Manually identifying and correcting errors among thousands of benchmark questions is not only infeasible but also a critical bottleneck for reliable evaluation. In this work, we introduce a framework for systematic benchmark revision that leverages statistical analysis of response patterns to flag potentially invalid questions for further expert review. Our approach builds on a core assumption commonly used in AI evaluations that the mean score sufficiently summarizes model performance. This implies a unidimensional latent construct underlying the measurement experiment, yielding expected ranges for various statistics for each item. When empirically estimated values for these statistics fall outside the expected range for an item, the item is more likely to be problematic. Across nine widely used benchmarks, our method guides expert review to identify problematic questions with up to 84% precision. In addition, we introduce an LLM-judge first pass to review questions, further reducing human effort. Together, these components provide an efficient and scalable framework for systematic benchmark revision.

cs.LG [Back]

[109] Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach cs.LG | cs.AI | cs.CVPDF

Yaoxin Yang, Peng Ye, Xudong Tan, Chongjun Tu, Maosen Zhao

TL;DR: FlashCache是一种基于频域引导和Outlier-KV感知的多模态KV缓存压缩框架，通过低频能量分析和动态预算分配，显著提升解码速度并降低内存使用，同时保持任务性能。

Details

Motivation: 多模态大语言模型由于视觉输入长度导致的KV缓存增长带来了显著的推理开销。现有压缩方法主要依赖注意力分数，忽视了值向量的贡献且不兼容高效注意力内核（如FlashAttention）。

Result: 在多个MLLMs和基准测试中，FlashCache比现有方法快1.69倍，内存使用降低80%，且任务性能无损。

Insight: KV矩阵的频域分布揭示了低频能量集中现象，Outlier KVs对性能至关重要，动态预算分配可优化缓存利用率。

Abstract: Multimodal large language models suffer from substantial inference overhead since multimodal KV Cache grows proportionally with the visual input length. Existing multimodal KV Cache compression methods mostly rely on attention score to reduce cache size, which makes them are incompatible with established efficient attention kernels (e.g., FlashAttention) and ignores the contribution of value vectors to the attention output. In this work, we revisit multimodal KV Cache compression from the perspective of the KV matrices’ distribution. First, we observe that frequency-domain energy of multimodal KV matrices is predominantly concentrated in low-frequency and extract this principal energy via a low-pass filter. Further, we find that removing KV pairs that deviate substantially from this principal energy leads to a pronounced performance drop, which we define as Outlier KVs. Considering Outlier KVs are more likely to encode features critical for inference, we propose FlashCache, a frequency-domain-guided, Outlier-KV-aware KV Cache compression framework. First, we introduce an Outlier KV Recognition Module that models the principal component of multimodal KV matrices in the frequency domain and preferentially retains KV pairs that significantly deviate from it. Furthermore, Dynamic Budget Allocation Module is designed to adaptively determine the per-layer KV Cache size to retain more Outlier KVs. Experiments on multiple MLLMs and benchmarks demonstrate that FlashCache outperforms state-of-the-art multimoal KV compression methods, achieving up to 1.69 times faster decoding with 80% lower KV memory usage while maintaining task performance.

[110] Energy Scaling Laws for Diffusion Models: Quantifying Compute and Carbon Emissions in Image Generation cs.LG | cs.CV | cs.CYPDF

Aniketh Iyengar, Jiaqi Han, Boris Ruf, Vincent Grari, Marcin Detyniecki

TL;DR: 本文提出了一种基于计算复杂性（FLOPs）的方法，通过调整Kaplan缩放定律来预测扩散模型在GPU上的能耗，重点关注图像生成中的能源消耗与环境影响。

Details

Motivation: 扩散模型在图像生成中的广泛应用导致计算需求急剧增加，引发了能源消耗和环境影响的担忧，但目前缺乏系统的方法来预测不同模型配置和硬件设置下的能耗。

Result: 实验结果显示，提出的能耗缩放定律在单个硬件架构内具有高预测准确性（R-squared > 0.9），并能很好地泛化到未见过的模型和硬件组合。

Insight: 研究验证了扩散模型推理的计算受限本质，为可持续AI部署和碳排放估算提供了理论基础。

Abstract: The rapidly growing computational demands of diffusion models for image generation have raised significant concerns about energy consumption and environmental impact. While existing approaches to energy optimization focus on architectural improvements or hardware acceleration, there is a lack of principled methods to predict energy consumption across different model configurations and hardware setups. We propose an adaptation of Kaplan scaling laws to predict GPU energy consumption for diffusion models based on computational complexity (FLOPs). Our approach decomposes diffusion model inference into text encoding, iterative denoising, and decoding components, with the hypothesis that denoising operations dominate energy consumption due to their repeated execution across multiple inference steps. We conduct comprehensive experiments across four state-of-the-art diffusion models (Stable Diffusion 2, Stable Diffusion 3.5, Flux, and Qwen) on three GPU architectures (NVIDIA A100, A4000, A6000), spanning various inference configurations including resolution (256x256 to 1024x1024), precision (fp16/fp32), step counts (10-50), and classifier-free guidance settings. Our energy scaling law achieves high predictive accuracy within individual architectures (R-squared > 0.9) and exhibits strong cross-architecture generalization, maintaining high rank correlations across models and enabling reliable energy estimation for unseen model-hardware combinations. These results validate the compute-bound nature of diffusion inference and provide a foundation for sustainable AI deployment planning and carbon footprint estimation.

physics.med-ph [Back]

[111] Exploring the added value of pretherapeutic MR descriptors in predicting breast cancer pathologic complete response to neoadjuvant chemotherapy physics.med-ph | cs.CVPDF

Caroline Malhaire, Fatine Selhane, Marie-Judith Saint-Martin, Vincent Cockenpot, Pia Akl

TL;DR: 该研究探讨了治疗前MRI描述符对乳腺癌（BC）新辅助化疗（NAC）后病理完全缓解（pCR）的预测价值，发现非毛刺边缘和单灶性是预测pCR的独立因素，并能提升模型性能。

Details

Motivation: 研究旨在评估治疗前MRI特征与BC患者对NAC的病理完全缓解（pCR）之间的关联，以改善治疗反应的预测模型。

Result: 研究发现非毛刺边缘和单灶性与pCR显著相关，将MRI特征加入临床生物学变量后，随机森林模型的敏感性、特异性和精确度均有所提升。

Insight: 研究结果表明，整合MRI特征与临床生物学变量的多模态方法可能有助于开发更准确的预测模型，优化治疗策略选择。

Abstract: Objectives: To evaluate the association between pretreatment MRI descriptors and breast cancer (BC) pathological complete response (pCR) to neoadjuvant chemotherapy (NAC). Materials & Methods: Patients with BC treated by NAC with a breast MRI between 2016 and 2020 were included in this retrospective observational single-center study. MR studies were described using the standardized BI-RADS and breast edema score on T2-weighted MRI. Univariable and multivariable logistic regression analyses were performed to assess variables association with pCR according to residual cancer burden. Random forest classifiers were trained to predict pCR on a random split including 70% of the database and were validated on the remaining cases. Results: Among 129 BC, 59 (46%) achieved pCR after NAC (luminal (n=7/37, 19%), triple negative (TN) (n=30/55, 55%), HER2+ (n=22/37, 59%). Clinical and biological items associated with pCR were BC subtype (p<0.001), T stage 0/I/II (p=0.008), higher Ki67 (p=0.005) and higher tumor-infiltrating lymphocytes levels (p=0.016). Univariate analysis showed that the following MRI features, oval or round shape (p=0.047), unifocality (p=0.026), non-spiculated margins (p=0.018), no associated non-mass enhancement (NME) (p = 0.024) and a lower MRI size (p = 0.031) were significantly associated with pCR. Unifocality and non-spiculated margins remained independently associated with pCR at multivariable analysis. Adding significant MRI features to clinicobiological variables in random forest classifiers significantly increased sensitivity (0.67 versus 0.62), specificity (0.69 versus 0.67) and precision (0.71 versus 0.67) for pCR prediction. Conclusion: Non-spiculated margins and unifocality are independently associated with pCR and can increase models performance to predict BC response to NAC. Clinical Relevance Statement: A multimodal approach integrating pretreatment MRI features with clinicobiological predictors, including TILs, could be employed to develop machine learning models for identifying patients at risk of non-response. This may enable consideration of alternative therapeutic strategies to optimize treatment outcomes

cs.CY [Back]

[112] OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists cs.CY | cs.CE | cs.CLPDF

Chenyang Shao, Dehao Huang, Yu Li, Keyu Zhao, Weiquan Lin

TL;DR: OmniScientist是一个将人类科研机制编码到AI科学工作流中的框架，支持端到端自动化科研流程，同时模拟人类科学系统的基础设施，促进AI与人类科学家的协同进化。

Details

Motivation: 现有的AI科学家系统将科学发现建模为独立的搜索或优化问题，忽视了科研的社会协作本质。OmniScientist旨在填补这一空白，通过模拟人类科研基础设施，实现AI与人类的深度协作。

Result: OmniScientist不仅实现了科研全流程自动化，还通过基础设施支持AI与人类科学家的深度互动和协同创新。

Insight: 科学研究的未来依赖于AI与人类的协作生态系统，OmniScientist为此提供了技术框架和基础设施支持。

Abstract: With the rapid development of Large Language Models (LLMs), AI agents have demonstrated increasing proficiency in scientific tasks, ranging from hypothesis generation and experimental design to manuscript writing. Such agent systems are commonly referred to as “AI Scientists.” However, existing AI Scientists predominantly formulate scientific discovery as a standalone search or optimization problem, overlooking the fact that scientific research is inherently a social and collaborative endeavor. Real-world science relies on a complex scientific infrastructure composed of collaborative mechanisms, contribution attribution, peer review, and structured scientific knowledge networks. Due to the lack of modeling for these critical dimensions, current systems struggle to establish a genuine research ecosystem or interact deeply with the human scientific community. To bridge this gap, we introduce OmniScientist, a framework that explicitly encodes the underlying mechanisms of human research into the AI scientific workflow. OmniScientist not only achieves end-to-end automation across data foundation, literature review, research ideation, experiment automation, scientific writing, and peer review, but also provides comprehensive infrastructural support by simulating the human scientific system, comprising: (1) a structured knowledge system built upon citation networks and conceptual correlations; (2) a collaborative research protocol (OSP), which enables seamless multi-agent collaboration and human researcher participation; and (3) an open evaluation platform (ScienceArena) based on blind pairwise user voting and Elo rankings. This infrastructure empowers agents to not only comprehend and leverage human knowledge systems but also to collaborate and co-evolve, fostering a sustainable and scalable innovation ecosystem.

[113] Cross-cultural value alignment frameworks for responsible AI governance: Evidence from China-West comparative analysis cs.CY | cs.CLPDF

Haijiang Liu, Jinguang Gu, Xun Wu, Daniel Hershcovich, Qiaoling Xiao

TL;DR: 本文提出了一种多层审计平台，用于评估中西方大型语言模型（LLM）在跨文化价值观对齐方面的表现，揭示了普遍性挑战和区域发展差异。

Details

Motivation: 随着LLM在全球高影响力决策中的作用增强，确保其与多元文化价值观的对齐成为关键治理挑战。

Result: 研究发现：模型存在价值观系统不稳定、年轻群体代表性不足、模型规模与对齐质量非线性关系等问题；中国模型注重多语言数据整合，西方模型则偏向架构实验但存在美国中心偏见。

Insight: Mistral架构在跨文化对齐中表现优于LLaMA3，全参数微调在保护文化多样性方面优于RLHF。

Abstract: As Large Language Models (LLMs) increasingly influence high-stakes decision-making across global contexts, ensuring their alignment with diverse cultural values has become a critical governance challenge. This study presents a Multi-Layered Auditing Platform for Responsible AI that systematically evaluates cross-cultural value alignment in China-origin and Western-origin LLMs through four integrated methodologies: Ethical Dilemma Corpus for assessing temporal stability, Diversity-Enhanced Framework (DEF) for quantifying cultural fidelity, First-Token Probability Alignment for distributional accuracy, and Multi-stAge Reasoning frameworK (MARK) for interpretable decision-making. Our comparative analysis of 20+ leading models, such as Qwen, GPT-4o, Claude, LLaMA, and DeepSeek, reveals universal challenges-fundamental instability in value systems, systematic under-representation of younger demographics, and non-linear relationships between model scale and alignment quality-alongside divergent regional development trajectories. While China-origin models increasingly emphasize multilingual data integration for context-specific optimization, Western models demonstrate greater architectural experimentation but persistent U.S.-centric biases. Neither paradigm achieves robust cross-cultural generalization. We establish that Mistral-series architectures significantly outperform LLaMA3-series in cross-cultural alignment, and that Full-Parameter Fine-Tuning on diverse datasets surpasses Reinforcement Learning from Human Feedback in preserving cultural variation…

eess.IV [Back]

[114] MRI Super-Resolution with Deep Learning: A Comprehensive Survey eess.IV | cs.AI | cs.CV | eess.SPPDF

Mohammad Khateri, Serge Vasylechko, Morteza Ghahremani, Liam Timms, Deniz Kocanaogullari

TL;DR: 该论文是一篇关于深度学习在MRI超分辨率（SR）技术中应用的综合性综述，涵盖了理论、方法、数据集和评估指标，并提出了分类系统。

Details

Motivation: 高分辨率MRI在临床和研究中至关重要，但成本高且技术受限。SR技术通过计算从低分辨率生成高分辨率图像，成为解决这些问题的有效途径。

Result: 总结了当前MRI SR技术的进展、挑战和未来方向，并提供了开源资源集合。

Insight: 深度学习在MRI SR中有广泛应用潜力，但仍需解决临床和研究中的独特挑战。

Abstract: High-resolution (HR) magnetic resonance imaging (MRI) is crucial for many clinical and research applications. However, achieving it remains costly and constrained by technical trade-offs and experimental limitations. Super-resolution (SR) presents a promising computational approach to overcome these challenges by generating HR images from more affordable low-resolution (LR) scans, potentially improving diagnostic accuracy and efficiency without requiring additional hardware. This survey reviews recent advances in MRI SR techniques, with a focus on deep learning (DL) approaches. It examines DL-based MRI SR methods from the perspectives of computer vision, computational imaging, inverse problems, and MR physics, covering theoretical foundations, architectural designs, learning strategies, benchmark datasets, and performance metrics. We propose a systematic taxonomy to categorize these methods and present an in-depth study of both established and emerging SR techniques applicable to MRI, considering unique challenges in clinical and research contexts. We also highlight open challenges and directions that the community needs to address. Additionally, we provide a collection of essential open-access resources, tools, and tutorials, available on our GitHub: https://github.com/mkhateri/Awesome-MRI-Super-Resolution. IEEE keywords: MRI, Super-Resolution, Deep Learning, Computational Imaging, Inverse Problem, Survey.

Qi Jiang, Xiaolong Qian, Yao Gao, Lei Sun, Kailun Yang

TL;DR: OmniLens++通过大规模LensLib预训练和潜在PSF表征解决了盲镜头畸变校正中的数据扩展和先验引导问题，提升了泛化能力。

Details

Motivation: 现有方法在盲镜头畸变校正中面临数据扩展困难和光学退化先验缺失的挑战，限制了其泛化能力。

Result: 在真实镜头和合成LensLib上的实验表明，OmniLens++在盲畸变校正中表现最优。

Insight: 大规模LensLib预训练为光学退化校正提供了更强的泛化能力，LPR进一步释放了大规模数据的潜力。

Abstract: Emerging deep-learning-based lens library pre-training (LensLib-PT) pipeline offers a new avenue for blind lens aberration correction by training a universal neural network, demonstrating strong capability in handling diverse unknown optical degradations. This work proposes the OmniLens++ framework, which resolves two challenges that hinder the generalization ability of existing pipelines: the difficulty of scaling data and the absence of prior guidance characterizing optical degradation. To improve data scalability, we expand the design specifications to increase the degradation diversity of the lens source, and we sample a more uniform distribution by quantifying the spatial-variation patterns and severity of optical degradation. In terms of model design, to leverage the Point Spread Functions (PSFs), which intuitively describe optical degradation, as guidance in a blind paradigm, we propose the Latent PSF Representation (LPR). The VQVAE framework is introduced to learn latent features of LensLib’s PSFs, which is assisted by modeling the optical degradation process to constrain the learning of degradation priors. Experiments on diverse aberrations of real-world lenses and synthetic LensLib show that OmniLens++ exhibits state-of-the-art generalization capacity in blind aberration correction. Beyond performance, the AODLibpro is verified as a scalable foundation for more effective training across diverse aberrations, and LPR can further tap the potential of large-scale LensLib. The source code and datasets will be made publicly available at https://github.com/zju-jiangqi/OmniLens2.

[116] Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal eess.IV | cs.CV | physics.opticsPDF

Xiaolong Qian, Qi Jiang, Lei Sun, Zongxi Yu, Kailun Yang

TL;DR: 这篇论文提出了一种生成模型VeilGen和一个恢复网络DeVeiler，用于模拟和去除透镜的眩光现象，显著提升了光学系统的成像质量。

Details

Motivation: 紧凑光学系统的成像性能常因非理想光学表面和涂层导致的眩光而降低。传统的散射模型难以拟合这种空间变化且深度无关的现象，导致数据驱动的眩光去除模型缺乏高质量配对数据。

Result: 实验表明，该方法在紧凑光学系统中显著优于现有方法，恢复了高质量的图像并保持了物理一致性。

Insight: 通过学习潜在传输和眩光图，可以更精确地建模眩光现象，从而指导有效的恢复过程。

Abstract: Beyond the commonly recognized optical aberrations, the imaging performance of compact optical systems-including single-lens and metalens designs-is often further degraded by veiling glare caused by stray-light scattering from non-ideal optical surfaces and coatings, particularly in complex real-world environments. This compound degradation undermines traditional lens aberration correction yet remains underexplored. A major challenge is that conventional scattering models (e.g., for dehazing) fail to fit veiling glare due to its spatial-varying and depth-independent nature. Consequently, paired high-quality data are difficult to prepare via simulation, hindering application of data-driven veiling glare removal models. To this end, we propose VeilGen, a generative model that learns to simulate veiling glare by estimating its underlying optical transmission and glare maps in an unsupervised manner from target images, regularized by Stable Diffusion (SD)-based priors. VeilGen enables paired dataset generation with realistic compound degradation of optical aberrations and veiling glare, while also providing the estimated latent optical transmission and glare maps to guide the veiling glare removal process. We further introduce DeVeiler, a restoration network trained with a reversibility constraint, which utilizes the predicted latent maps to guide an inverse process of the learned scattering model. Extensive experiments on challenging compact optical systems demonstrate that our approach delivers superior restoration quality and physical fidelity compared with existing methods. These suggest that VeilGen reliably synthesizes realistic veiling glare, and its learned latent maps effectively guide the restoration process in DeVeiler. All code and datasets will be publicly released at https://github.com/XiaolongQian/DeVeiler.

cs.SD [Back]

[117] MusicAIR: A Multimodal AI Music Generation Framework Powered by an Algorithm-Driven Core cs.SD | cs.AI | cs.CL | cs.MMPDF

Callie C. Liao, Duoduo Liao, Ellie L. Zhang

TL;DR: MusicAIR是一个多模态AI音乐生成框架，核心基于算法驱动，减少了对大型数据集的依赖，避免了版权风险。它能从歌词、文本和图像生成音乐，输出的音乐遵循音乐理论和节奏规范，性能优于人类作曲家。

Details

Motivation: 传统基于神经网络的音乐生成模型依赖大量数据，可能导致版权问题和计算成本高昂，而MusicAIR旨在通过算法驱动的方法解决这些问题。

Result: 系统生成的音乐在关键音符置信度上达到85%，超越了人类作曲家的79%，并与音乐理论标准高度契合，证明了其能力。

Insight: MusicAIR的创新点在于算法驱动和多模态生成，为音乐生成提供了更高效、低成本的解决方案，同时拓宽了AI在音乐创作和教育中的应用场景。

Abstract: Recent advances in generative AI have made music generation a prominent research focus. However, many neural-based models rely on large datasets, raising concerns about copyright infringement and high-performance costs. In contrast, we propose MusicAIR, an innovative multimodal AI music generation framework powered by a novel algorithm-driven symbolic music core, effectively mitigating copyright infringement risks. The music core algorithms connect critical lyrical and rhythmic information to automatically derive musical features, creating a complete, coherent melodic score solely from the lyrics. The MusicAIR framework facilitates music generation from lyrics, text, and images. The generated score adheres to established principles of music theory, lyrical structure, and rhythmic conventions. We developed Generate AI Music (GenAIM), a web tool using MusicAIR for lyric-to-song, text-to-music, and image-to-music generation. In our experiments, we evaluated AI-generated music scores produced by the system using both standard music metrics and innovative analysis that compares these compositions with original works. The system achieves an average key confidence of 85%, outperforming human composers at 79%, and aligns closely with established music theory standards, demonstrating its ability to generate diverse, human-like compositions. As a co-pilot tool, GenAIM can serve as a reliable music composition assistant and a possible educational composition tutor while simultaneously lowering the entry barrier for all aspiring musicians, which is innovative and significantly contributes to AI for music generation.

cs.HC [Back]

[118] Generative Augmented Reality: Paradigms, Technologies, and Future Applications cs.HC | cs.AI | cs.CVPDF

Chen Liang, Jiawen Zheng, Yufeng Zeng, Yi Tan, Hengye Lyu

TL;DR: 这篇论文提出了一种新型的增强现实范式——生成式增强现实 (GAR)，它将传统的多阶段AR引擎替换为统一的生成式主干网络，通过联合编码环境感知、虚拟内容和交互信号，实现连续的视频生成。

Details

Motivation: 传统的增强现实技术通常基于多阶段的模块化处理，限制了真实感、交互性和沉浸感的进一步提升。GAR旨在通过生成式方法重新定义增强现实的核心范式，以实现更高保真度的体验。

Result: GAR有望提供更高真实感、交互性和沉浸感的增强现实体验，同时为未来AR研究奠定了基础。

Insight: GAR代表了AR技术的范式转变，强调了生成式AI在增强现实中的核心作用，但也提出了对计算效率、内容生态和伦理问题的新挑战。

Abstract: This paper introduces Generative Augmented Reality (GAR) as a next-generation paradigm that reframes augmentation as a process of world re-synthesis rather than world composition by a conventional AR engine. GAR replaces the conventional AR engine’s multi-stage modules with a unified generative backbone, where environmental sensing, virtual content, and interaction signals are jointly encoded as conditioning inputs for continuous video generation. We formalize the computational correspondence between AR and GAR, survey the technical foundations that make real-time generative augmentation feasible, and outline prospective applications that leverage its unified inference model. We envision GAR as a future AR paradigm that delivers high-fidelity experiences in terms of realism, interactivity, and immersion, while eliciting new research challenges on technologies, content ecosystems, and the ethical and societal implications.

Table of Contents

cs.CV [Back]

[1] The persistence of painting styles cs.CVPDF

[2] Motion Transfer-Enhanced StyleGAN for Generating Diverse Macaque Facial Expressions cs.CV | eess.IVPDF

[3] PairHuman: A High-Fidelity Photographic Dataset for Customized Dual-Person Generation cs.CV | cs.AIPDF

[4] A Machine Learning-Driven Solution for Denoising Inertial Confinement Fusion Images cs.CV | cs.AIPDF

[5] SAM 3: Segment Anything with Concepts cs.CV | cs.AIPDF

[6] SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge cs.CV | cs.AI | cs.LGPDF

[7] SVG360: Multi-View SVG Generation with Geometric and Color Consistency from a Single SVG cs.CVPDF

[8] Mesh RAG: Retrieval Augmentation for Autoregressive Mesh Generation cs.CV | cs.AIPDF

[9] WorldGen: From Text to Traversable and Interactive 3D Worlds cs.CV | cs.AIPDF

[10] Towards Unified Vision Language Models for Forest Ecological Analysis in Earth Observation cs.CVPDF

[11] BOP-ASK: Object-Interaction Reasoning for Vision-Language Models cs.CV | cs.ROPDF

[12] Parts-Mamba: Augmenting Joint Context with Part-Level Scanning for Occluded Human Skeleton cs.CVPDF

[13] The Joint Gromov Wasserstein Objective for Multiple Object Matching cs.CV | q-bio.BMPDF

[14] Align & Invert: Solving Inverse Problems with Diffusion and Flow-based Models via Representational Alignment cs.CV | cs.LGPDF

[15] Glass Surface Detection: Leveraging Reflection Dynamics in Flash/No-flash Imagery cs.CVPDF

[16] R-AVST: Empowering Video-LLMs with Fine-Grained Spatio-Temporal Reasoning in Complex Audio-Visual Scenarios cs.CVPDF

[17] Warm Diffusion: Recipe for Blur-Noise Mixture Diffusion Models cs.CVPDF

[18] Q-REAL: Towards Realism and Plausibility Evaluation for AI-Generated Content cs.CVPDF

[19] UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation cs.CVPDF

[20] Rethinking Diffusion Model-Based Video Super-Resolution: Leveraging Dense Guidance from Aligned Features cs.CVPDF

[21] Shape-preserving Tooth Segmentation from CBCT Images Using Deep Learning with Semantic and Shape Awareness cs.CVPDF

[22] OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios cs.CV | cs.AIPDF

[23] MultiPriv: Benchmarking Individual-Level Privacy Reasoning in Vision-Language Models cs.CV | cs.CRPDF

[24] Flow-Guided Implicit Neural Representation for Motion-Aware Dynamic MRI Reconstruction cs.CVPDF

[25] FingerCap: Fine-grained Finger-level Hand Motion Captioning cs.CVPDF

[26] Point-Supervised Facial Expression Spotting with Gaussian-Based Instance-Adaptive Intensity Modeling cs.CVPDF

[27] Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models cs.CV | cs.LG | eess.IVPDF

[28] MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis cs.CVPDF

[29] Two Heads Better than One: Dual Degradation Representation for Blind Super-Resolution cs.CVPDF

[30] Real-Time Cooked Food Image Synthesis and Visual Cooking Progress Monitoring on Edge Devices cs.CV | cs.LGPDF

[31] The Finer the Better: Towards Granular-aware Open-set Domain Generalization cs.CV | cs.AIPDF

[32] DReX: Pure Vision Fusion of Self-Supervised and Convolutional Representations for Image Complexity Prediction cs.CVPDF

[33] DepthFocus: Controllable Depth Estimation for See-Through Scenes cs.CVPDF

[34] VLM-Augmented Degradation Modeling for Image Restoration Under Adverse Weather Conditions cs.CVPDF

[35] Vision Language Models are Confused Tourists cs.CV | cs.CLPDF

[36] FLUID: Training-Free Face De-identification via Latent Identity Substitution cs.CV | cs.AIPDF

[37] Parameter-Free Neural Lens Blur Rendering for High-Fidelity Composites cs.CV | cs.AI | cs.GR | eess.IVPDF

[38] RacketVision: A Multiple Racket Sports Benchmark for Unified Ball and Racket Analysis cs.CV | cs.AI | cs.MMPDF

[39] PathAgent: Toward Interpretable Analysis of Whole-slide Pathology Images via Large Language Model-based Agentic Reasoning cs.CVPDF

[40] OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding cs.CV | cs.AIPDF

[41] RL-AD-Net: Reinforcement Learning Guided Adaptive Displacement in Latent Space for Refined Point Cloud Completion cs.CVPDF

[42] Spanning Tree Autoregressive Visual Generation cs.CV | cs.AIPDF

[43] SPAGS: Sparse-View Articulated Object Reconstruction from Single State via Planar Gaussian Splatting cs.CVPDF

[44] Sparse Reasoning is Enough: Biological-Inspired Framework for Video Anomaly Detection with Large Pre-trained Models cs.CVPDF

[45] ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better cs.CVPDF

[46] PEGS: Physics-Event Enhanced Large Spatiotemporal Motion Reconstruction via 3D Gaussian Splatting cs.CVPDF

[47] Planning with Sketch-Guided Verification for Physics-Aware Video Generation cs.CV | cs.AI | cs.CLPDF

[48] A Multi-Stage Optimization Framework for Deploying Learned Image Compression on FPGAs cs.CVPDF

[49] Learning to Look Closer: A New Instance-Wise Loss for Small Cerebral Lesion Segmentation cs.CVPDF

[50] A lightweight detector for real-time detection of remote sensing images cs.CV | cs.AIPDF

[51] UI-Styler: Ultrasound Image Style Transfer with Class-Aware Prompts for Cross-Device Diagnosis Using a Frozen Black-Box Inference Network cs.CVPDF

[52] FireScope: Wildfire Risk Prediction with a Chain-of-Thought Oracle cs.CV | cs.LGPDF

[53] Investigating self-supervised representations for audio-visual deepfake detection cs.CV | cs.LG | cs.SDPDF

[54] Navigating in the Dark: A Multimodal Framework and Dataset for Nighttime Traffic Sign Recognition cs.CV | cs.CYPDF

[55] PostCam: Camera-Controllable Novel-View Video Generation with Query-Shared Cross-Attention cs.CVPDF

[56] VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation cs.CVPDF

[57] SING3R-SLAM: Submap-based Indoor Monocular Gaussian SLAM with 3D Reconstruction Priors cs.CV | cs.ROPDF

[58] Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers cs.CVPDF

[59] Dual-domain Adaptation Networks for Realistic Image Super-resolution cs.CVPDF

[60] QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy cs.CV | cs.ROPDF

[61] Equivariant-Aware Structured Pruning for Efficient Edge Deployment: A Comprehensive Framework with Adaptive Fine-Tuning cs.CV | cs.LGPDF

[62] Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats cs.CV | cs.AIPDF

[63] A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback cs.CV | cs.IRPDF

[64] Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation cs.CV | cs.AI | cs.CYPDF

[65] MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning cs.CVPDF

[66] SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion cs.CVPDF

[67] MuM: Multi-View Masked Image Modeling for 3D Vision cs.CV | cs.AI | cs.LGPDF

[68] Refracting Reality: Generating Images with Realistic Transparent Objects cs.CVPDF

[69] Loomis Painter: Reconstructing the Painting Process cs.CVPDF

[70] DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture cs.CVPDF

[71] UAM: A Unified Attention-Mamba Backbone of Multimodal Framework for Tumor Cell Classification cs.CVPDF

[72] SuperQuadricOcc: Multi-Layer Gaussian Approximation of Superquadrics for Real-Time Self-Supervised Occupancy Estimation cs.CVPDF

[73] SVRecon: Sparse Voxel Rasterization for Surface Reconstruction cs.CVPDF

[74] MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration cs.CVPDF

[75] MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment cs.CVPDF

[76] Sparse Mixture-of-Experts for Multi-Channel Imaging: Are All Channel Interactions Required? cs.CV | cs.AIPDF

[77] REMSA: An LLM Agent for Foundation Model Selection in Remote Sensing cs.CV | cs.AIPDF

[78] MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models cs.CVPDF