cs.CV [Total: 97]
cs.CL [Total: 28]
cs.AI [Total: 5]
cs.RO [Total: 2]
cs.PL [Total: 1]
cs.CR [Total: 3]
cs.LG [Total: 3]
eess.IV [Total: 1]
cs.HC [Total: 1]
cs.GR [Total: 1]
cs.IR [Total: 1]

cs.CV [Back]

[1] Leveraging AI multimodal geospatial foundation models for improved near-real-time flood mapping at a global scale cs.CV | cs.AIPDF

Mirela G. Tulbure, Julio Caineta, Mark Broich, Mollie D. Gaines, Philippe Rufin

TL;DR: 该论文通过多模态地理空间基础模型（GFM）TerraMind的微调，结合Sentinel-1和Sentinel-2的多模态数据，提升了全球尺度近实时洪水测绘的准确性。

Details

Motivation: 洪水是全球最具破坏性的天气灾害之一，而现有的洪水测绘方法依赖标记数据和模型的泛化能力。地理空间基础模型（GFM）通过大规模自监督预训练提供了更好的泛化性，但其在全球多样化洪水事件中的性能尚不明确。

Result: 结果表明：1）基础解冻模型在计算成本和性能之间取得了最佳平衡；2）大解冻模型召回率最高；3）基于FloodsNet训练的模型在召回率上优于Sen1Floods11训练的模型；4）U-Net的召回率优于所有GFM配置，但精度略低。

Insight: 研究揭示了多模态数据和GFM微调在提升洪水测绘性能中的潜力，同时也指出了GFM在精度上的局限性，为未来的气候适应和灾害恢复提供了重要参考。

Abstract: Floods are among the most damaging weather-related hazards, and in 2024, the warmest year on record, extreme flood events affected communities across five continents. Earth observation (EO) satellites provide critical, frequent coverage for mapping inundation, yet operational accuracy depends heavily on labeled datasets and model generalization. Recent Geospatial Foundation Models (GFMs), such as ESA-IBM’s TerraMind, offer improved generalizability through large-scale self-supervised pretraining, but their performance on diverse global flood events remains poorly understood. We fine-tune TerraMind for flood extent mapping using FloodsNet, a harmonized multimodal dataset containing co-located Sentinel-1 (Synthetic Aperture Radar, SAR data) and Sentinel-2 (optical) imagery for 85 flood events worldwide. We tested four configurations (base vs. large models; frozen vs. unfrozen backbones) and compared against the TerraMind Sen1Floods11 example and a U-Net trained on both FloodsNet and Sen1Floods11. The base-unfrozen configuration provided the best balance of accuracy, precision, and recall at substantially lower computational cost than the large model. The large unfrozen model achieved the highest recall. Models trained on FloodsNet outperformed the Sen1Floods11-trained example in recall with similar overall accuracy. U-Net achieved higher recall than all GFM configurations, though with slightly lower accuracy and precision. Our results demonstrate that integrating multimodal optical and SAR data and fine-tuning a GFM can enhance near-real-time flood mapping. This study provides one of the first global-scale evaluations of a GFM for flood segmentation, highlighting both its potential and current limitations for climate adaptation and disaster resilience.

[2] Context-Enriched Contrastive Loss: Enhancing Presentation of Inherent Sample Connections in Contrastive Learning Framework cs.CVPDF

Haojin Deng, Yimin Yang

TL;DR: 该论文提出了一种上下文丰富的对比损失函数，通过结合两个收敛目标来提高对比学习的有效性并解决信息失真问题，在多个大规模基准数据集上取得了优于现有方法的性能。

Details

Motivation: 对比学习在大型基准测试中表现出色，但传统对比损失函数可能导致信息失真，尤其是对同源图像的正样本对学习不足。因此，需要一种新的损失函数来增强样本内在联系的学习。

Result: 在8个基准数据集上验证了方法的有效性，性能优于16种现有对比学习方法，尤其在BiasedMNIST数据集上比原始对比损失提高了22.9%。

Insight: 结合标签敏感性和同源样本的内在联系，可以有效提升对比学习的效率和公平性，尤其是在存在系统偏见的任务中。

Abstract: Contrastive learning has gained popularity and pushes state-of-the-art performance across numerous large-scale benchmarks. In contrastive learning, the contrastive loss function plays a pivotal role in discerning similarities between samples through techniques such as rotation or cropping. However, this learning mechanism can also introduce information distortion from the augmented samples. This is because the trained model may develop a significant overreliance on information from samples with identical labels, while concurrently neglecting positive pairs that originate from the same initial image, especially in expansive datasets. This paper proposes a context-enriched contrastive loss function that concurrently improves learning effectiveness and addresses the information distortion by encompassing two convergence targets. The first component, which is notably sensitive to label contrast, differentiates between features of identical and distinct classes which boosts the contrastive training efficiency. Meanwhile, the second component draws closer the augmented samples from the same source image and distances all other samples. We evaluate the proposed approach on image classification tasks, which are among the most widely accepted 8 recognition large-scale benchmark datasets: CIFAR10, CIFAR100, Caltech-101, Caltech-256, ImageNet, BiasedMNIST, UTKFace, and CelebA datasets. The experimental results demonstrate that the proposed method achieves improvements over 16 state-of-the-art contrastive learning methods in terms of both generalization performance and learning convergence speed. Interestingly, our technique stands out in addressing systematic distortion tasks. It demonstrates a 22.9% improvement compared to original contrastive loss functions in the downstream BiasedMNIST dataset, highlighting its promise for more efficient and equitable downstream training.

[3] FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges cs.CVPDF

Kevin David Hayes, Micah Goldblum, Vikash Sehwag, Gowthami Somepalli, Ashwinee Panda

TL;DR: 该论文提出了一个结构化方法FineGRAIN，用于联合评估文本到图像（T2I）模型和视觉语言模型（VLMs），通过测试VLMs能否识别T2I模型生成的图像中的27种特定失败模式。同时，论文贡献了一个包含5种T2I模型生成图像和对应VLM标注的数据集。

Details

Motivation: T2I模型在生成图像时常常无法准确捕捉用户提示中的特定属性（如对象数量或颜色），而现有的VLM基准测试未跟上复杂场景的需求。因此，需要一种新的方法来系统地评估这些模型的失败模式。

Result: 分析表明，T2I模型在属性保真度和对象表示上存在系统性错误，现有指标无法捕捉这些细微错误。

Insight: 当前指标不足以全面评估生成模型的可靠性，需要针对性的基准测试来提高生成模型的性能和可解释性。

Abstract: Text-to-image (T2I) models are capable of generating visually impressive images, yet they often fail to accurately capture specific attributes in user prompts, such as the correct number of objects with the specified colors. The diversity of such errors underscores the need for a hierarchical evaluation framework that can compare prompt adherence abilities of different image generation models. Simultaneously, benchmarks of vision language models (VLMs) have not kept pace with the complexity of scenes that VLMs are used to annotate. In this work, we propose a structured methodology for jointly evaluating T2I models and VLMs by testing whether VLMs can identify 27 specific failure modes in the images generated by T2I models conditioned on challenging prompts. Our second contribution is a dataset of prompts and images generated by 5 T2I models (Flux, SD3-Medium, SD3-Large, SD3.5-Medium, SD3.5-Large) and the corresponding annotations from VLMs (Molmo, InternVL3, Pixtral) annotated by an LLM (Llama3) to test whether VLMs correctly identify the failure mode in a generated image. By analyzing failure modes on a curated set of prompts, we reveal systematic errors in attribute fidelity and object representation. Our findings suggest that current metrics are insufficient to capture these nuanced errors, highlighting the importance of targeted benchmarks for advancing generative model reliability and interpretability.

[4] Mapping of Lesion Images to Somatic Mutations cs.CV | q-bio.QMPDF

Rahul Mehta

TL;DR: 这篇论文提出了一种深度隐变量模型LLOST，通过双变分自编码器结合共享隐空间，将医学病灶图像映射到体细胞突变谱，实现了跨模态的癌症诊断辅助。

Details

Motivation: 癌症治疗的早期诊断和干预至关重要。医学图像和遗传信息在诊断中扮演不同角色，但缺乏直接的映射模型。本文旨在通过深度学习模型填补这一空白，从病灶图像预测体细胞突变。

Result: 模型在特定突变数量和突变发生预测上表现良好，揭示了影像与突变之间的共享模式（反映癌症类型）。

Insight: 共享隐空间能有效捕捉多模态数据的关联，为癌症诊断提供新思路。未来可扩展至更多遗传领域。

Abstract: Medical imaging is a critical initial tool used by clinicians to determine a patient’s cancer diagnosis, allowing for faster intervention and more reliable patient prognosis. At subsequent stages of patient diagnosis, genetic information is extracted to help select specific patient treatment options. As the efficacy of cancer treatment often relies on early diagnosis and treatment, we build a deep latent variable model to determine patients’ somatic mutation profiles based on their corresponding medical images. We first introduce a point cloud representation of lesions images to allow for invariance to the imaging modality. We then propose, LLOST, a model with dual variational autoencoders coupled together by a separate shared latent space that unifies features from the lesion point clouds and counts of distinct somatic mutations. Therefore our model consists of three latent space, each of which is learned with a conditional normalizing flow prior to account for the diverse distributions of each domain. We conduct qualitative and quantitative experiments on de-identified medical images from The Cancer Imaging Archive and the corresponding somatic mutations from the Pan Cancer dataset of The Cancer Genomic Archive. We show the model’s predictive performance on the counts of specific mutations as well as it’s ability to accurately predict the occurrence of mutations. In particular, shared patterns between the imaging and somatic mutation domain that reflect cancer type. We conclude with a remark on how to improve the model and possible future avenues of research to include other genetic domains.

[5] SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting cs.CV | cs.GR | cs.LGPDF

Pranav Asthana, Alex Hanson, Allen Tu, Tom Goldstein, Matthias Zwicker

TL;DR: SplatSuRe是一种选择性超分辨率方法，旨在解决3D高斯散射（3DGS）中多视图不一致的问题。它通过在缺乏高频信息的区域选择性应用超分辨率内容，提升了渲染画面的清晰度和一致性。

Details

Motivation: 3D高斯散射在高质量新视角合成中表现优异，但训练时低分辨率输入的独立超分辨率增强会导致多视图不一致和模糊渲染。现有方法通常对所有图像统一应用超分辨率，而未考虑不同视图间的高频信息互补性。

Result: 在Tanks & Temples、Deep Blending和Mip-NeRF 360数据集上，SplatSuRe在保真度和感知质量上均优于基线方法，尤其在前景区域的细节表现上有显著提升。

Insight: 近距离的低分辨率视图可能包含远处视图的高频信息，选择性应用超分辨率可以充分利用多视图间的互补性，避免内容不一致问题。

Abstract: 3D Gaussian Splatting (3DGS) enables high-quality novel view synthesis, motivating interest in generating higher-resolution renders than those available during training. A natural strategy is to apply super-resolution (SR) to low-resolution (LR) input views, but independently enhancing each image introduces multi-view inconsistencies, leading to blurry renders. Prior methods attempt to mitigate these inconsistencies through learned neural components, temporally consistent video priors, or joint optimization on LR and SR views, but all uniformly apply SR across every image. In contrast, our key insight is that close-up LR views may contain high-frequency information for regions also captured in more distant views, and that we can use the camera pose relative to scene geometry to inform where to add SR content. Building from this insight, we propose SplatSuRe, a method that selectively applies SR content only in undersampled regions lacking high-frequency supervision, yielding sharper and more consistent results. Across Tanks & Temples, Deep Blending and Mip-NeRF 360, our approach surpasses baselines in both fidelity and perceptual quality. Notably, our gains are most significant in localized foreground regions where higher detail is desired.

[6] RobustSurg: Tackling domain generalisation for out-of-distribution surgical scene segmentation cs.CVPDF

Mansoor Ali, Maksim Richards, Gilberto Ochoa-Ruiz, Sharib Ali

TL;DR: RobustSurg解决了手术场景分割中域泛化和分布外数据的挑战，通过利用风格与内容信息、实例归一化和特征协方差映射技术提升泛化性，并在ResNet主干中引入恢复模块保留重要特征，显著提升了在未见中心和数据集上的性能。

Details

Motivation: 手术场景分割在单中心和单模态数据上表现良好，但在未见分布（其他中心）和模态变化时泛化性不足。现有方法多针对自然场景数据，无法直接应用于手术场景（视觉线索有限、场景更复杂）。

Result: 在未见中心的HeiCholSeg数据集上，RobustSurg比DeepLabv3+基线提升23%，优于SOTA方法10-32%；在EndoUDA数据集上比基线提升22%，优于SOTA方法11%。

Insight: 手术场景的域泛化可通过分离风格与内容、保留关键特征实现；实例归一化和协方差映射是提升鲁棒性的有效手段；新数据集对推动领域研究至关重要。

Abstract: While recent advances in deep learning for surgical scene segmentation have demonstrated promising results on single-centre and single-imaging modality data, these methods usually do not generalise to unseen distribution (i.e., from other centres) and unseen modalities. Current literature for tackling generalisation on out-of-distribution data and domain gaps due to modality changes has been widely researched but mostly for natural scene data. However, these methods cannot be directly applied to the surgical scenes due to limited visual cues and often extremely diverse scenarios compared to the natural scene data. Inspired by these works in natural scenes to push generalisability on OOD data, we hypothesise that exploiting the style and content information in the surgical scenes could minimise the appearances, making it less variable to sudden changes such as blood or imaging artefacts. This can be achieved by performing instance normalisation and feature covariance mapping techniques for robust and generalisable feature representations. Further, to eliminate the risk of removing salient feature representation associated with the objects of interest, we introduce a restitution module within the feature learning ResNet backbone that can enable the retention of useful task-relevant features. To tackle the lack of multiclass and multicentre data for surgical scene segmentation, we also provide a newly curated dataset that can be vital for addressing generalisability in this domain. Our proposed RobustSurg obtained nearly 23% improvement on the baseline DeepLabv3+ and from 10-32% improvement on the SOTA in terms of mean IoU score on an unseen centre HeiCholSeg dataset when trained on CholecSeg8K. Similarly, RobustSurg also obtained nearly 22% improvement over the baseline and nearly 11% improvement on a recent SOTA method for the target set of the EndoUDA polyp dataset.

[7] Multifractal Recalibration of Neural Networks for Medical Imaging Segmentation cs.CV | cs.AIPDF

Miguel L. Martins, Miguel T. Coimbra, Francesco Renna

TL;DR: 该论文提出了一种基于多分形分析的神经网络重新校准方法，用于医学图像分割任务，通过引入单分形和多分形重新校准先验，改进了传统通道注意力机制的效果。

Details

Motivation: 现有端到端多分形方法依赖大量池化或特征空间缩减，限制了语义分割等任务的表现，因此需要更有效的多分形分析方法。

Result: 在三个公开医学影像数据集（ISIC18、Kvasir-SEG和BUSI）上表现优于其他高阶统计通道注意力机制的基线方法。

Insight: 研究发现U-Net中的跳跃连接导致注意力层的响应不会随编码器深度增加而特化，且其效果可能与实例变异的全局统计特征相关。

Abstract: Multifractal analysis has revealed regularities in many self-seeding phenomena, yet its use in modern deep learning remains limited. Existing end-to-end multifractal methods rely on heavy pooling or strong feature-space decimation, which constrain tasks such as semantic segmentation. Motivated by these limitations, we introduce two inductive priors: Monofractal and Multifractal Recalibration. These methods leverage relationships between the probability mass of the exponents and the multifractal spectrum to form statistical descriptions of encoder embeddings, implemented as channel-attention functions in convolutional networks. Using a U-Net-based framework, we show that multifractal recalibration yields substantial gains over a baseline equipped with other channel-attention mechanisms that also use higher-order statistics. Given the proven ability of multifractal analysis to capture pathological regularities, we validate our approach on three public medical-imaging datasets: ISIC18 (dermoscopy), Kvasir-SEG (endoscopy), and BUSI (ultrasound). Our empirical analysis also provides insights into the behavior of these attention layers. We find that excitation responses do not become increasingly specialized with encoder depth in U-Net architectures due to skip connections, and that their effectiveness may relate to global statistics of instance variability.

[8] Towards Unified Video Quality Assessment cs.CVPDF

Chen Feng, Tianhao Peng, Fan Zhang, David Bull

TL;DR: 论文提出了Unified-VQA框架，通过将其视为诊断性Mixture-of-Experts（MoE）问题，解决了现有视频质量评估（VQA）模型的局限，提供了统一且可解释的质量评估方法。

Details

Motivation: 现有VQA模型通常只能预测单一质量分数，缺乏诊断性和可解释性，且多为特定格式的专用指标。Unified-VQA旨在解决这些问题，实现通用的视频质量评估。

Result: Unified-VQA无需重新训练，在17个数据库上表现优于18种基准方法，适用于通用VQA和失真检测任务。

Insight: 通过MoE和多任务学习，可实现视频质量评估的统一性和诊断性，为实际应用提供了可操作的解决方案。

Abstract: Recent works in video quality assessment (VQA) typically employ monolithic models that typically predict a single quality score for each test video. These approaches cannot provide diagnostic, interpretable feedback, offering little insight into why the video quality is degraded. Most of them are also specialized, format-specific metrics rather than truly generic" solutions, as they are designed to learn a compromised representation from disparate perceptual domains. To address these limitations, this paper proposes Unified-VQA, a framework that provides a single, unified quality model applicable to various distortion types within multiple video formats by recasting generic VQA as a Diagnostic Mixture-of-Experts (MoE) problem. Unified-VQA employs multiple perceptual experts’’ dedicated to distinct perceptual domains. A novel multi-proxy expert training strategy is designed to optimize each expert using a ranking-inspired loss, guided by the most suitable proxy metric for its domain. We also integrated a diagnostic multi-task head into this framework to generate a global quality score and an interpretable multi-dimensional artifact vector, which is optimized using a weakly-supervised learning strategy, leveraging the known properties of the large-scale training database generated for this work. With static model parameters (without retraining or fine-tuning), Unified-VQA demonstrates consistent and superior performance compared to over 18 benchmark methods for both generic VQA and diagnostic artifact detection tasks across 17 databases containing diverse streaming artifacts in HD, UHD, HDR and HFR formats. This work represents an important step towards practical, actionable, and interpretable video quality assessment.

[9] See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models cs.CV | cs.AI | cs.LGPDF

Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee

TL;DR: 论文提出了AV-SpeakerBench基准，专注于说话者为中心的视听推理，通过3,212个多选题评估多模态大模型在说话者识别、内容和时间对齐方面的能力。结果表明Gemini模型表现最优，开源模型仍有差距。

Details

Motivation: 现有视频基准对多模态大模型（MLLMs）的细粒度人类语音推理能力评估不足，通常仅限于视觉解决或粗略的语音评估。

Result: Gemini 2.5 Pro表现最优，开源模型Qwen3-Omni-30B接近Gemini 2.0 Flash但差距明显，主要由于视听融合能力较弱。

Insight: 视听融合能力是多模态大模型的关键瓶颈，而非视觉感知能力。

Abstract: Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers-not scenes-as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving the best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audiovisual fusion rather than visual perception. We believe AV-SpeakerBench establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in future multimodal systems.

[10] Exploring the Potentials of Spiking Neural Networks for Image Deraining cs.CVPDF

Shuang Chen, Tomas Krajnik, Farshad Arvin, Amir Atapour-Abarghouei

TL;DR: 本文探讨了脉冲神经网络（SNNs）在图像去雨任务中的潜力，提出了一种新型的视觉LIF（VLIF）神经元及其组成的模块，显著提升了性能并降低了能耗。

Details

Motivation: 研究动机在于探索生物启发的低能耗SNN在低层次视觉任务（如图像去雨）中的应用潜力，解决传统SNN在空间上下文理解和频域饱和方面的局限性。

Result: 实验结果在五个基准数据集上表明，该方法不仅性能显著优于现有SNN方法，且能耗仅为前者的13%。

Insight: 本文的见解在于展示了SNN在低层次视觉任务中的高效能和低能耗潜力，为未来SNN在类似任务中的应用提供了新方向。

Abstract: Biologically plausible and energy-efficient frameworks such as Spiking Neural Networks (SNNs) have not been sufficiently explored in low-level vision tasks. Taking image deraining as an example, this study addresses the representation of the inherent high-pass characteristics of spiking neurons, specifically in image deraining and innovatively proposes the Visual LIF (VLIF) neuron, overcoming the obstacle of lacking spatial contextual understanding present in traditional spiking neurons. To tackle the limitation of frequency-domain saturation inherent in conventional spiking neurons, we leverage the proposed VLIF to introduce the Spiking Decomposition and Enhancement Module and the lightweight Spiking Multi-scale Unit for hierarchical multi-scale representation learning. Extensive experiments across five benchmark deraining datasets demonstrate that our approach significantly outperforms state-of-the-art SNN-based deraining methods, achieving this superior performance with only 13% of their energy consumption. These findings establish a solid foundation for deploying SNNs in high-performance, energy-efficient low-level vision tasks.

[11] Spatiotemporal Pyramid Flow Matching for Climate Emulation cs.CV | cs.AI | cs.LG | eess.IV | stat.MLPDF

Jeremy Andrew Irvin, Jiaqi Han, Zikui Wang, Abdulaziz Alharbi, Yufei Zhao

TL;DR: 论文提出了一种新的生成模型方法——时空金字塔流匹配（SPF），用于高效、并行地模拟气候变化的多个时间尺度，并通过ClimateSuite数据集验证了其性能。

Details

Motivation: 传统基于天气尺度的自回归生成模型在气候模拟中存在计算效率低和稳定性不足的问题，特别是在非静态强迫条件下。SPF旨在解决这些问题，提供高效、稳定的气候模拟方法。

Result: 在ClimateBench上，SPF在年和月时间尺度的表现优于基线模型，并展示了快速采样和良好的泛化能力。

Insight: 时空分层设计和物理条件耦合是提升气候模拟效率和准确性的关键。

Abstract: Generative models have the potential to transform the way we emulate Earth’s changing climate. Previous generative approaches rely on weather-scale autoregression for climate emulation, but this is inherently slow for long climate horizons and has yet to demonstrate stable rollouts under nonstationary forcings. Here, we introduce Spatiotemporal Pyramid Flows (SPF), a new class of flow matching approaches that model data hierarchically across spatial and temporal scales. Inspired by cascaded video models, SPF partitions the generative trajectory into a spatiotemporal pyramid, progressively increasing spatial resolution to reduce computation and coupling each stage with an associated timescale to enable direct sampling at any temporal level in the pyramid. This design, together with conditioning each stage on prescribed physical forcings (e.g., greenhouse gases or aerosols), enables efficient, parallel climate emulation at multiple timescales. On ClimateBench, SPF outperforms strong flow matching baselines and pre-trained models at yearly and monthly timescales while offering fast sampling, especially at coarser temporal levels. To scale SPF, we curate ClimateSuite, the largest collection of Earth system simulations to date, comprising over 33,000 simulation-years across ten climate models and the first dataset to include simulations of climate interventions. We find that the scaled SPF model demonstrates good generalization to held-out scenarios across climate models. Together, SPF and ClimateSuite provide a foundation for accurate, efficient, probabilistic climate emulation across temporal scales and realistic future scenarios. Data and code is publicly available at https://github.com/stanfordmlgroup/spf .

[12] Progressive Image Restoration via Text-Conditioned Video Generation cs.CV | cs.AIPDF

Peng Kang, Xijun Wang, Yu Yuan

TL;DR: 本文提出了一种利用文本条件视频生成模型（如CogVideo）进行渐进式图像修复的方法，通过微调模型生成修复轨迹而非自然视频运动。

Details

Motivation: 虽然现有的文本到视频模型在时间生成能力上表现优异，但它们在图像修复领域的潜力尚未充分探索。

Result: 实验显示，模型在PSNR、SSIM和LPIPS等感知指标上表现优异，并能推广到真实场景（如ReLoBlur数据集）。

Insight: 1. 时间生成能力可以转化为修复质量的渐进提升；2. 多模态LLM生成的场景特定提示比统一提示更有效；3. 零样本能力表明模型的泛化性强。

Abstract: Recent text-to-video models have demonstrated strong temporal generation capabilities, yet their potential for image restoration remains underexplored. In this work, we repurpose CogVideo for progressive visual restoration tasks by fine-tuning it to generate restoration trajectories rather than natural video motion. Specifically, we construct synthetic datasets for super-resolution, deblurring, and low-light enhancement, where each sample depicts a gradual transition from degraded to clean frames. Two prompting strategies are compared: a uniform text prompt shared across all samples, and a scene-specific prompting scheme generated via LLaVA multi-modal LLM and refined with ChatGPT. Our fine-tuned model learns to associate temporal progression with restoration quality, producing sequences that improve perceptual metrics such as PSNR, SSIM, and LPIPS across frames. Extensive experiments show that CogVideo effectively restores spatial detail and illumination consistency while maintaining temporal coherence. Moreover, the model generalizes to real-world scenarios on the ReLoBlur dataset without additional training, demonstrating strong zero-shot robustness and interpretability through temporal restoration.

[13] Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision cs.CV | cs.AIPDF

Chenshuang Zhang, Kang Zhang, Joon Son Chung, In So Kweon, Junmo Kim

TL;DR: 本文研究发现，预训练的视频扩散模型在无监督情况下能有效区分视觉相似物体的运动，并通过去噪过程分离运动与外观信息。提出的自监督跟踪方法在视觉相似物体跟踪任务中表现优异。

Details

Motivation: 现有自监督跟踪器在视觉线索模糊时表现不佳，限制了其在未标注数据场景下的扩展性和泛化能力。本文发现视频扩散模型在预训练中已学习到适合跟踪的运动表示，无需任务特定训练。

Result: 在视觉相似物体跟踪任务中，方法比现有自监督方法性能提升高达6个百分点。

Insight: 视频扩散模型的去噪过程天然分离了运动与外观信息，为自监督跟踪任务提供了新的解决方案。

Abstract: Distinguishing visually similar objects by their motion remains a critical challenge in computer vision. Although supervised trackers show promise, contemporary self-supervised trackers struggle when visual cues become ambiguous, limiting their scalability and generalization without extensive labeled data. We find that pre-trained video diffusion models inherently learn motion representations suitable for tracking without task-specific training. This ability arises because their denoising process isolates motion in early, high-noise stages, distinct from later appearance refinement. Capitalizing on this discovery, our self-supervised tracker significantly improves performance in distinguishing visually similar objects, an underexplored failure point for existing methods. Our method achieves up to a 6-point improvement over recent self-supervised approaches on established benchmarks and our newly introduced tests focused on tracking visually similar items. Visualizations confirm that these diffusion-derived motion representations enable robust tracking of even identical objects across challenging viewpoint changes and deformations.

[14] TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction cs.CVPDF

Fengyi Zhang, Tianjun Zhang, Kasra Khosoussi, Zheng Zhang, Zi Huang

TL;DR: 本文提出TALO框架，通过基于Thin Plate Spline的高自由度长程对齐方法，解决了3D视觉基础模型在线重建中的全局一致性问题，并在多数据集和多摄像头配置下表现出色。

Details

Motivation: 3D视觉基础模型在在线重建中存在时间一致性挑战，现有方法在假设有效性、局部对齐范围和噪声环境下表现不足。

Result: 实验表明，该方法在多数据集和多摄像头配置下显著提升了重建一致性和轨迹精度。

Insight: 长程对齐和点无关设计是提升在线3D重建一致性的关键，且Thin Plate Spline在全局调整中表现优异。

Abstract: 3D vision foundation models have shown strong generalization in reconstructing key 3D attributes from uncalibrated images through a single feed-forward pass. However, when deployed in online settings such as driving scenarios, predictions are made over temporal windows, making it non-trivial to maintain consistency across time. Recent strategies align consecutive predictions by solving global transformation, yet our analysis reveals their fundamental limitations in assumption validity, local alignment scope, and robustness under noisy geometry. In this work, we propose a higher-DOF and long-term alignment framework based on Thin Plate Spline, leveraging globally propagated control points to correct spatially varying inconsistencies. In addition, we adopt a point-agnostic submap registration design that is inherently robust to noisy geometry predictions. The proposed framework is fully plug-and-play, compatible with diverse 3D foundation models and camera configurations (e.g., monocular or surround-view). Extensive experiments demonstrate that our method consistently yields more coherent geometry and lower trajectory errors across multiple datasets, backbone models, and camera setups, highlighting its robustness and generality. Codes are publicly available at \href{https://github.com/Xian-Bei/TALO}{https://github.com/Xian-Bei/TALO}.

[15] A multi-weight self-matching visual explanation for cnns on sar images cs.CVPDF

Siyuan Sun, Yongping Zhang, Hongcheng Zeng, Yamin Wang, Wei Yang

TL;DR: MS-CAM方法通过结合通道和元素级别的权重，提升CNN在SAR图像中的可视化解释能力，同时验证了其在弱监督目标定位中的可行性。

Details

Motivation: CNN在SAR任务中表现出色，但其内部机制的复杂性和不透明性限制了其在SAR中的高可靠性应用需求，因此提升CNN的可解释性至关重要。

Result: 实验表明，MS-CAM能更准确地突出网络的关注区域并捕获目标细节信息，同时证明了其在弱监督目标定位中的可行性。

Insight: MS-CAM为SAR任务中的CNN提供了更高的可解释性，其像素阈值等关键因素的分析为未来研究提供了参考。

Abstract: In recent years, convolutional neural networks (CNNs) have achieved significant success in various synthetic aperture radar (SAR) tasks. However, the complexity and opacity of their internal mechanisms hinder the fulfillment of high-reliability requirements, thereby limiting their application in SAR. Improving the interpretability of CNNs is thus of great importance for their development and deployment in SAR. In this paper, a visual explanation method termed multi-weight self-matching class activation mapping (MS-CAM) is proposed. MS-CAM matches SAR images with the feature maps and corresponding gradients extracted by the CNN, and combines both channel-wise and element-wise weights to visualize the decision basis learned by the model in SAR images. Extensive experiments conducted on a self-constructed SAR target classification dataset demonstrate that MS-CAM more accurately highlights the network’s regions of interest and captures detailed target feature information, thereby enhancing network interpretability. Furthermore, the feasibility of applying MS-CAM to weakly-supervised obiect localization is validated. Key factors affecting localization accuracy, such as pixel thresholds, are analyzed in depth to inform future work.

[16] Understanding and Harnessing Sparsity in Unified Multimodal Models cs.CV | cs.AIPDF

Shwai He, Chaorui Deng, Ang Li, Shen Yan

TL;DR: 本文系统地分析了统一多模态模型的稀疏性，并提出了一种基于稀疏激活的Mixture-of-Experts（MoE）适应方法，显著提升了模型的推理效率，同时保持了性能。

Details

Motivation: 尽管统一多模态模型在理解和生成任务上取得了显著进展，但其统一性导致了推理效率低下。本文旨在探究这些低效现象的根源并提出解决方案。

Result: 实验表明，MoE Adaptation显著提升了生成任务的效率，BAGEL模型在激活约一半参数的情况下达到了与完整模型相当的性能。

Insight: 1) 理解组件在生成任务中表现出显著的稀疏性；2) 生成组件对压缩敏感，动态稀疏激活是提升其效率的关键；3) MoE Adaptation是解决多模态模型统一性与效率矛盾的有效途径。

Abstract: Large multimodal models have achieved remarkable progress in both understanding and generation. Recent efforts pursue unified multimodal models that integrate heterogeneous components to support both capabilities within a single framework. However, such unification introduces inference inefficiencies, e.g., specific tasks or samples may not require the full knowledge or capacity of the unified model. Yet, a systematic understanding of how these inefficiencies manifest across different components remains limited. In this work, we first conduct a systematic analysis of unified multimodal model components using training-free pruning as a probing methodology, considering both depth pruning and width reduction. Our study reveals that the understanding component exhibits notable compressibility in both understanding and generation tasks, which is more pronounced in the latter. In contrast, the generation components are highly sensitive to compression, with performance deteriorating sharply even under moderate compression ratios. To address this limitation, we propose the Mixture-of-Experts (MoE) Adaptation, inspired by the dynamic activation patterns observed across different samples. This approach partitions the generation module into multiple experts and enables sparse activation to restore generation quality. We validate the effectiveness of sparse activation through expert-frozen tuning and further demonstrate that a fully trainable adaptation delivers additional gains. As a result, the adapted BAGEL model achieves performance comparable to the full model while activating only about half of its parameters. The code is released at \href{https://github.com/Shwai-He/SparseUnifiedModel}{this link}.

[17] WSCF-MVCC: Weakly-supervised Calibration-free Multi-view Crowd Counting cs.CVPDF

Bin Li, Daijie Chen, Qi Zhang

TL;DR: 该论文提出了一种弱监督且无需校准的多视角人群计数方法（WSCF-MVCC），通过直接使用人群数量作为监督信号，利用自监督排序损失和多尺度先验提升模型感知能力，并通过语义信息实现更准确的视角匹配。

Details

Motivation: 现有的多视角人群计数方法通常依赖昂贵的校准和密集标注，而当前的无校准方法仍需大量图像级标注。因此，作者提出了一种无需校准且仅需弱监督的方法，以降低实际部署成本。

Result: 在三个广泛使用的多视角数据集上，该方法在弱监督条件下优于现有最优方法，显示出更强的实用性与部署价值。

Insight: 弱监督和无校准方法的结合可以有效降低人工标注成本，同时通过语义信息和多尺度先验提升模型性能。

Abstract: Multi-view crowd counting can effectively mitigate occlusion issues that commonly arise in single-image crowd counting. Existing deep-learning multi-view crowd counting methods project different camera view images onto a common space to obtain ground-plane density maps, requiring abundant and costly crowd annotations and camera calibrations. Hence, calibration-free methods are proposed that do not require camera calibrations and scene-level crowd annotations. However, existing calibration-free methods still require expensive image-level crowd annotations for training the single-view counting module. Thus, in this paper, we propose a weakly-supervised calibration-free multi-view crowd counting method (WSCF-MVCC), directly using crowd count as supervision for the single-view counting module rather than density maps constructed from crowd annotations. Instead, a self-supervised ranking loss that leverages multi-scale priors is utilized to enhance the model’s perceptual ability without additional annotation costs. What’s more, the proposed model leverages semantic information to achieve a more accurate view matching and, consequently, a more precise scene-level crowd count estimation. The proposed method outperforms the state-of-the-art methods on three widely used multi-view counting datasets under weakly supervised settings, indicating that it is more suitable for practical deployment compared with calibrated methods. Code is released in https://github.com/zqyq/Weakly-MVCC.

[18] VACoT: Rethinking Visual Data Augmentation with VLMs cs.CV | cs.AIPDF

Zhengzhuo Xu, Chong Sun, SiNan Du, Chen Li, Jing Lyu

TL;DR: VACoT 是一个动态调用图像增强的框架，通过在推理阶段引入后处理变换（如去噪），显著提升了视觉语言模型（VLM）在对抗性和分布外输入上的鲁棒性。

Details

Motivation: 视觉语言模型（VLMs）在视觉感知任务中表现不佳，且传统的数据增强方法对 VLMs 的训练效果有限。VACoT 的提出旨在通过动态增强提升模型的鲁棒性，同时避免高昂的训练成本。

Result: VACoT 在对抗性和分布外输入上显著提升了 VLMs 的鲁棒性，特别是在 OCR 相关任务中取得了优异表现。

Insight: 动态增强在推理阶段的引入不仅能提升模型的鲁棒性，还能避免传统训练方法的成本问题。

Abstract: While visual data augmentation remains a cornerstone for training robust vision models, it has received limited attention in visual language models (VLMs), which predominantly rely on large-scale real data acquisition or synthetic diversity. Consequently, they may struggle with basic perception tasks that conventional models handle reliably. Given the substantial cost of pre-training and fine-tuning VLMs, continue training on augmented data yields limited and diminishing returns. In this paper, we present Visual Augmentation Chain-of-Thought (VACoT), a framework that dynamically invokes image augmentations during model inference. By incorporating post-hoc transformations such as denoising, VACoT substantially improves robustness on challenging and out-of-distribution inputs, especially in OCR-related adversarial scenarios. Distinct from prior approaches limited to local cropping, VACoT integrates a structured collection of general visual augmentations, broadening the query image views while reducing training complexity and computational overhead with efficient agentic reinforcement learning. We propose a conditional reward scheme that encourages necessary augmentation while penalizing verbose responses, ensuring concise and effective reasoning in perception tasks. We demonstrate the superiority of VACoT with extensive experiments on 13 perception benchmarks and further introduce AdvOCR to highlight the generalization benefits of post-hoc visual augmentations in adversarial scenarios.

[19] Tackling Tuberculosis: A Comparative Dive into Machine Learning for Tuberculosis Detection cs.CV | cs.AIPDF

Daanish Hindustani, Sanober Hindustani, Preston Nguyen

TL;DR: 本文探讨了使用预训练的ResNet-50和SqueezeNet模型在胸部X光片中诊断肺结核（TB）的性能，结果显示SqueezeNet表现更优，强调了机器学习在TB检测中的潜力及其在资源匮乏地区的应用前景。

Details

Motivation: 肺结核是全球性健康问题，传统诊断方法效率低下，尤其在资源有限地区。作者希望通过深度学习技术改进TB的诊断效率和准确性。

Result: SqueezeNet的损失值为32%，准确率为89%，精确率为98%，召回率为80%，F1分数为87%，优于ResNet-50的54%（损失）、73%（准确率）、88%（精确率）、52%（召回率）和65%（F1分数）。

Insight: 研究表明机器学习在TB检测中具有潜力，尤其是轻量级模型（如SqueezeNet）更适合在资源匮乏地区部署，但其仍需进一步优化以实现更快、更小且更准确的检测。

Abstract: This study explores the application of machine learning models, specifically a pretrained ResNet-50 model and a general SqueezeNet model, in diagnosing tuberculosis (TB) using chest X-ray images. TB, a persistent infectious disease affecting humanity for millennia, poses challenges in diagnosis, especially in resource-limited settings. Traditional methods, such as sputum smear microscopy and culture, are inefficient, prompting the exploration of advanced technologies like deep learning and computer vision. The study utilized a dataset from Kaggle, consisting of 4,200 chest X-rays, to develop and compare the performance of the two machine learning models. Preprocessing involved data splitting, augmentation, and resizing to enhance training efficiency. Evaluation metrics, including accuracy, precision, recall, and confusion matrix, were employed to assess model performance. Results showcase that the SqueezeNet achieved a loss of 32%, accuracy of 89%, precision of 98%, recall of 80%, and an F1 score of 87%. In contrast, the ResNet-50 model exhibited a loss of 54%, accuracy of 73%, precision of 88%, recall of 52%, and an F1 score of 65%. This study emphasizes the potential of machine learning in TB detection and possible implications for early identification and treatment initiation. The possibility of integrating such models into mobile devices expands their utility in areas lacking TB detection resources. However, despite promising results, the need for continued development of faster, smaller, and more accurate TB detection models remains crucial in contributing to the global efforts in combating TB.

[20] Multi-Domain Enhanced Map-Free Trajectory Prediction with Selective Attention cs.CV | cs.AIPDF

Wenyi Xiong, Jian Chen

TL;DR: 该论文提出了一种新颖的无地图轨迹预测算法，通过混合专家机制和选择性注意力模块，在多域（时间、空间和频率）中高效提取关键信息，提升了复杂交互场景下的预测准确性和计算效率。

Details

Motivation: 现有的轨迹预测方法在处理复杂交互场景时，难以高效提取有价值的场景信息，导致计算效率低且预测准确性不足。

Result: 在Nuscences数据集上的实验表明，该算法在复杂交互场景中表现优越，验证了其有效性。

Insight: 通过多域信息提取和冗余信息过滤，可以显著提升轨迹预测的性能，尤其在面对复杂交互时。

Abstract: Trajectory prediction is crucial for the reliability and safety of autonomous driving systems, yet it remains a challenging task in complex interactive scenarios. Existing methods often struggle to efficiently extract valuable scene information from redundant data, thereby reducing computational efficiency and prediction accuracy, especially when dealing with intricate agent interactions. To address these challenges, we propose a novel map-free trajectory prediction algorithm that achieves trajectory prediction across the temporal, spatial, and frequency domains. Specifically, in temporal information processing, We utilize a Mixture of Experts (MoE) mechanism to adaptively select critical frequency components. Concurrently, we extract these components and integrate multi-scale temporal features. Subsequently, a selective attention module is proposed to filter out redundant information in both temporal sequences and spatial interactions. Finally, we design a multimodal decoder. Under the supervision of patch-level and point-level losses, we obtain reasonable trajectory results. Experiments on Nuscences datasets demonstrate the superiority of our algorithm, validating its effectiveness in handling complex interactive scenarios.

[21] Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch cs.CVPDF

Yifan Zhang, Liang Hu, Haofeng Sun, Peiyu Wang, Yichen Wei

TL;DR: Skywork-R1V4 提出了一种多模态代理模型，通过将视觉操作与外部知识检索动态交替的推理方式，统一多模态规划、主动图像操作和深度搜索，并在监督微调下实现先进性能。

Details

Motivation: 现有方法将图像操作和网络搜索视为孤立能力，依赖昂贵的强化学习，且缺乏基于实际工具执行轨迹的规划。Skywork-R1V4旨在解决这些局限性。

Result: 在MMSearch和FVQA上得分分别为66.1和67.2，超越Gemini 2.5 Flash，实现了长视野推理能力。

Insight: 精心设计的监督学习足以实现复杂的多模态智能，无需强化学习。

Abstract: Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation (“thinking with images”), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Skywork-R1V4 exhibits emergent long-horizon reasoning at inference time, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks. Our results demonstrate that sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.

Wentao Xiang, Haokang Zhang, Tianhang Yang, Zedong Chu, Ruihang Chu

TL;DR: Nav-$R^2$是一个基于双关系推理的通用开放词汇目标导航框架，通过显式建模目标-环境和环境-动作关系，结合结构化CoT推理和相似性感知记忆（SA-Mem），提升了在未见环境中定位新物体的能力。

Details

Motivation: 现有技术在开放词汇目标导航中存在决策过程不透明和定位未见物体成功率低的问题。Nav-$R^2$旨在解决这些问题，通过更透明的推理机制和高效的特征融合提升泛化性。

Result: Nav-$R^2$在定位未见物体上达到SOTA性能，避免了对已见类别的过拟合，同时保持2Hz的实时推理速度。

Insight: 显式建模双关系和结构化推理能显著提升模型的透明度和泛化能力，SA-Mem的特征融合机制也为高效记忆提供了新思路。

Abstract: Object-goal navigation in open-vocabulary settings requires agents to locate novel objects in unseen environments, yet existing approaches suffer from opaque decision-making processes and low success rate on locating unseen objects. To address these challenges, we propose Nav-$R^2$, a framework that explicitly models two critical types of relationships, target-environment modeling and environment-action planning, through structured Chain-of-Thought (CoT) reasoning coupled with a Similarity-Aware Memory. We construct a Nav$R^2$-CoT dataset that teaches the model to perceive the environment, focus on target-related objects in the surrounding context and finally make future action plans. Our SA-Mem preserves the most target-relevant and current observation-relevant features from both temporal and semantic perspectives by compressing video frames and fusing historical observations, while introducing no additional parameters. Compared to previous methods, Nav-R^2 achieves state-of-the-art performance in localizing unseen objects through a streamlined and efficient pipeline, avoiding overfitting to seen object categories while maintaining real-time inference at 2Hz. Resources will be made publicly available at \href{https://github.com/AMAP-EAI/Nav-R2}{github link}.

[23] WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate cs.CV | cs.AI | cs.LGPDF

Anoop Cherian, River Doyle, Eyal Ben-Dov, Suhas Lohit, Kuan-Chuan Peng

TL;DR: WISE是一个用于多模态多智能体辩论的加权迭代专家框架，通过将智能体划分为解决方案生成者（Solvers）和验证者（Reflectors），并采用改进的Dawid-Skene算法整合辩论结果，显著提升了多模态任务的性能。

Details

Motivation: 尽管多智能体辩论（MAD）在语言任务中表现出色，但其在多模态问题中的潜力尚未充分探索。WISE旨在通过异构专家（单模态和多模态）架构和加权反馈机制，扩展MAD的应用范围并提升其鲁棒性。

Result: 在SMART-840、VisualPuzzles、EvoChart-QA等数据集上，WISE相比现有MAD方法提升了2-7%的准确性，证明了其多模态任务的适应性。

Insight: WISE的成功表明，异构专家的分工合作和加权反馈机制可以有效提升多模态任务的推理能力，为未来多模态辩论系统的设计提供了新思路。

Abstract: Recent large language models (LLMs) are trained on diverse corpora and tasks, leading them to develop complementary strengths. Multi-agent debate (MAD) has emerged as a popular way to leverage these strengths for robust reasoning, though it has mostly been applied to language-only tasks, leaving its efficacy on multimodal problems underexplored. In this paper, we study MAD for solving vision-and-language reasoning problems. Our setup enables generalizing the debate protocol with heterogeneous experts that possess single- and multi-modal capabilities. To this end, we present Weighted Iterative Society-of-Experts (WISE), a generalized and modular MAD framework that partitions the agents into Solvers, that generate solutions, and Reflectors, that verify correctness, assign weights, and provide natural language feedback. To aggregate the agents’ solutions across debate rounds, while accounting for variance in their responses and the feedback weights, we present a modified Dawid-Skene algorithm for post-processing that integrates our two-stage debate model. We evaluate WISE on SMART-840, VisualPuzzles, EvoChart-QA, and a new SMART-840++ dataset with programmatically generated problem instances of controlled difficulty. Our results show that WISE consistently improves accuracy by 2-7% over the state-of-the-art MAD setups and aggregation methods across diverse multimodal tasks and LLM configurations.

[24] MitUNet: Enhancing Floor Plan Recognition using a Hybrid Mix-Transformer and U-Net Architecture cs.CV | cs.AIPDF

Dmitriy Parashchuk, Alexey Kapshitskiy, Yuriy Karyakin

TL;DR: MitUNet提出了一种结合Mix-Transformer和U-Net的混合架构，用于室内平面图的墙体分割，通过优化Tversky损失函数提升边界精度，显著优于现有单任务模型。

Details

Motivation: 现有方法在分割薄墙结构时表现不佳，生成的掩膜边界不规则，缺乏几何精度，影响了后续3D重建的质量。

Result: 在CubiCasa5k和私有数据集上，MitUNet生成的掩膜边界精度高，结构正确，优于单任务模型。

Insight: Transformer架构在捕获全局信息时表现优异，结合U-Net的局部特征提取能力可显著提升分割任务的边界精度。

Abstract: Automatic 3D reconstruction of indoor spaces from 2D floor plans requires high-precision semantic segmentation of structural elements, particularly walls. However, existing methods optimized for standard metrics often struggle to detect thin structural components and yield masks with irregular boundaries, lacking the geometric precision required for subsequent vectorization. To address this issue, we introduce MitUNet, a hybrid neural network architecture specifically designed for wall segmentation tasks in the context of 3D modeling. In MitUNet, we utilize a hierarchical Mix-Transformer encoder to capture global context and a U-Net decoder enhanced with scSE attention blocks for precise boundary recovery. Furthermore, we propose an optimization strategy based on the Tversky loss function to effectively balance precision and recall. By fine-tuning the hyperparameters of the loss function, we prioritize the suppression of false positive noise along wall boundaries while maintaining high sensitivity to thin structures. Our experiments on the public CubiCasa5k dataset and a proprietary regional dataset demonstrate that the proposed approach ensures the generation of structurally correct masks with high boundary accuracy, outperforming standard single-task models. MitUNet provides a robust tool for data preparation in automated 3D reconstruction pipelines.

[25] Generalizing Vision-Language Models with Dedicated Prompt Guidance cs.CVPDF

Xinyao Li, Yinjie Min, Hongbo Chen, Zhekai Du, Fengling Li

TL;DR: 这篇论文提出了一个名为GuiDG的两步框架，通过专用提示引导视觉语言模型（VLMs）在下游任务中的泛化能力。该方法避免了传统微调方法在通用性和领域特异性之间的权衡问题。

Details

Motivation: 现有的大规模预训练视觉语言模型在下游任务中通常直接对整个数据集进行微调，可能导致泛化能力不足。论文旨在解决这一问题。

Result: 在标准DG基准测试和ImageNet-DG上的实验表明，GuiDG在保持高效的同时，优于现有的微调方法。

Insight: 训练多个参数高效的专家模型（而非单一的通用模型）可以有效提升模型的领域泛化能力。

Abstract: Fine-tuning large pretrained vision-language models (VLMs) has emerged as a prevalent paradigm for downstream adaptation, yet it faces a critical trade-off between domain specificity and domain generalization (DG) ability. Current methods typically fine-tune a universal model on the entire dataset, which potentially compromises the ability to generalize to unseen domains. To fill this gap, we provide a theoretical understanding of the generalization ability for VLM fine-tuning, which reveals that training multiple parameter-efficient expert models on partitioned source domains leads to better generalization than fine-tuning a universal model. Inspired by this finding, we propose a two-step domain-expert-Guided DG (GuiDG) framework. GuiDG first employs prompt tuning to obtain source domain experts, then introduces a Cross-Modal Attention module to guide the fine-tuning of the vision encoder via adaptive expert integration. To better evaluate few-shot DG, we construct ImageNet-DG from ImageNet and its variants. Extensive experiments on standard DG benchmarks and ImageNet-DG demonstrate that GuiDG improves upon state-of-the-art fine-tuning methods while maintaining efficiency.

Haolong Yan, Yeqing Shen, Xin Huang, Jia Wang, Kaijun Tan

TL;DR: 论文提出了GUI Exploration Lab模拟环境引擎，支持GUI智能体的屏幕导航研究，通过监督微调、单轮和多轮强化学习提升导航性能。

Details

Motivation: 现实GUI环境复杂且专有，难以获取全面环境信息，制约了对智能体导航能力的系统性研究和评估。

Result: 在静态和交互式基准测试中验证了方法的有效性，强化学习方法显著提升了GUI导航性能。

Insight: 监督微调作为基础至关重要，多轮强化学习通过交互式探索进一步优化导航策略。

Abstract: With the rapid development of Large Vision Language Models, the focus of Graphical User Interface (GUI) agent tasks shifts from single-screen tasks to complex screen navigation challenges. However, real-world GUI environments, such as PC software and mobile Apps, are often complex and proprietary, making it difficult to obtain the comprehensive environment information needed for agent training and evaluation. This limitation hinders systematic investigation and benchmarking of agent navigation capabilities. To address this limitation, we introduce GUI Exploration Lab, a simulation environment engine for GUI agent navigation research that enables flexible definition and composition of screens, icons, and navigation graphs, while providing full access to environment information for comprehensive agent training and evaluation. Through extensive experiments, we find that supervised fine-tuning enables effective memorization of fundamental knowledge, serving as a crucial foundation for subsequent training. Building on this, single-turn reinforcement learning further enhances generalization to unseen scenarios. Finally, multi-turn reinforcement learning encourages the development of exploration strategies through interactive trial and error, leading to further improvements in screen navigation performance. We validate our methods on both static and interactive benchmarks, demonstrating that our findings generalize effectively to real-world scenarios. These findings demonstrate the advantages of reinforcement learning approaches in GUI navigation and offer practical guidance for building more capable and generalizable GUI agents.

[27] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning cs.CV | cs.AI | cs.CL | cs.IR | cs.LGPDF

Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, Sung Ju Hwang

TL;DR: WorldMM是一种动态多模态记忆代理，通过构建和检索包含文本和视觉表示的互补记忆，解决了长视频推理中的挑战，显著提升了性能。

Details

Motivation: 现有视频大型语言模型在理解短视频方面表现强大，但在处理长视频（如小时或天级视频）时面临上下文容量限制和视觉细节丢失的问题。传统的基于文本摘要的记忆方法无法充分利用视觉证据。

Result: 在五个长视频问答基准测试中，WorldMM平均性能比现有最优方法提升了8.4%，显著优于基线。

Insight: 多模态记忆的动态构建和自适应检索是长视频推理的关键，结合文本和视觉信息的互补性可以提高推理能力。

Abstract: Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.

[28] LightHCG: a Lightweight yet powerful HSIC Disentanglement based Causal Glaucoma Detection Model framework cs.CV | cs.AIPDF

Daeyoung Kim

TL;DR: 该论文提出了一种轻量级的因果表示驱动青光眼检测模型LightHCG，通过HSIC解耦和图自编码器实现高效的因果表示学习，显著减少了参数量并提升了性能。

Details

Motivation: 传统AI驱动的青光眼检测方法在可靠性、参数量、虚假相关性以及干预分析应用方面存在不足，需要更高效且因果驱动的解决方案。

Result: 模型在青光眼分类任务中性能优于InceptionV3、MobileNetV2等先进模型，且参数减少了93~99%。

Insight: 因果驱动的表示学习在医学图像分析中可以显著提升模型的轻量化和可靠性，同时支持临床干预分析。

Abstract: As a representative optic degenerative condition, glaucoma has been a threat to millions due to its irreversibility and severe impact on human vision fields. Mainly characterized by dimmed and blurred visions, or peripheral vision loss, glaucoma is well known to occur due to damages in the optic nerve from increased intraocular pressure (IOP) or neovascularization within the retina. Traditionally, most glaucoma related works and clinical diagnosis focused on detecting these damages in the optic nerve by using patient data from perimetry tests, optic papilla inspections and tonometer-based IOP measurements. Recently, with advancements in computer vision AI models, such as VGG16 or Vision Transformers (ViT), AI-automatized glaucoma detection and optic cup segmentation based on retinal fundus images or OCT recently exhibited significant performance in aiding conventional diagnosis with high performance. However, current AI-driven glaucoma detection approaches still have significant room for improvement in terms of reliability, excessive parameter usage, possibility of spurious correlation within detection, and limitations in applications to intervention analysis or clinical simulations. Thus, this research introduced a novel causal representation driven glaucoma detection model: LightHCG, an extremely lightweight Convolutional VAE-based latent glaucoma representation model that can consider the true causality among glaucoma-related physical factors within the optic nerve region. Using HSIC-based latent space disentanglement and Graph Autoencoder based unsupervised causal representation learning, LightHCG not only exhibits higher performance in classifying glaucoma with 93~99% less weights, but also enhances the possibility of AI-driven intervention analysis, compared to existing advanced vision models such as InceptionV3, MobileNetV2 or VGG16.

[29] Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources cs.CV | cs.AIPDF

Phuc Pham, Nhu Pham, Ngoc Quoc Ly

TL;DR: 该论文提出了一种结合动量自蒸馏和梯度累积的方法，以在有限计算资源下提升医学视觉-语言预训练的效率与性能。

Details

Motivation: 在医学领域，获取详细标注数据困难，因此需要高效的视觉-语言模型（VLMs）。然而，对比学习（CL）需要大批量训练，计算资源消耗大，限制了在资源受限环境中的应用。因此，作者旨在通过动量自蒸馏解决这一问题。

Result: 在零样本分类任务中性能媲美SOTA方法，少样本适应任务中AUC-ROC超过90%，检索任务提升2-3%，且仅需单GPU高效训练。

Insight: 动量自蒸馏可以在有限资源下显著提升模型性能，同时梯度累积技术为小批量训练提供了可行性。

Abstract: In medical healthcare, obtaining detailed annotations is challenging, highlighting the need for robust Vision-Language Models (VLMs). Pretrained VLMs enable fine-tuning on small datasets or zero-shot inference, achieving performance comparable to task-specific models. Contrastive learning (CL) is a key paradigm for training VLMs but inherently requires large batch sizes for effective learning, making it computationally demanding and often limited to well-resourced institutions. Moreover, with limited data in healthcare, it is important to prioritize knowledge extraction from both data and models during training to improve performance. Therefore, we focus on leveraging the momentum method combined with distillation to simultaneously address computational efficiency and knowledge exploitation. Our contributions can be summarized as follows: (1) leveraging momentum self-distillation to enhance multimodal learning, and (2) integrating momentum mechanisms with gradient accumulation to enlarge the effective batch size without increasing resource consumption. Our method attains competitive performance with state-of-the-art (SOTA) approaches in zero-shot classification, while providing a substantial boost in the few-shot adaption, achieving over 90% AUC-ROC and improving retrieval tasks by 2-3%. Importantly, our method achieves high training efficiency with a single GPU while maintaining reasonable training time. Our approach aims to advance efficient multimodal learning by reducing resource requirements while improving performance over SOTA methods. The implementation of our method is available at https://github.com/phphuc612/MSD .

[30] Basis-Oriented Low-rank Transfer for Few-Shot and Test-Time Adaptation cs.CV | cs.LGPDF

Junghwan Park, Woojin Cho, Junhyuk Heo, Darongsae Kwon, Kookjin Lee

TL;DR: BOLT提出了一种基于正交低秩迁移的方法，通过任务感知的谱基提取和子空间适应，实现了少量样本和测试时间的高效迁移学习，避免了传统元学习的额外训练成本和不稳定性。

Details

Motivation: 现有的大型预训练模型需要高效适应新任务，但传统方法（如元学习）需要高昂的训练成本和稳定性问题。BOLT旨在通过重用已有的任务特定模型，提取正交基并高效适应新任务。

Result: 实验表明，BOLT在参数高效的微调路径上表现出色，与传统PEFT方法和元学习初始化相比具有鲁棒性能。

Insight: 将适应限制在任务感知的正交子空间中，为未见任务提供了一种高效迁移的有效替代方案。

Abstract: Adapting large pre-trained models to unseen tasks under tight data and compute budgets remains challenging. Meta-learning approaches explicitly learn good initializations, but they require an additional meta-training phase over many tasks, incur high training cost, and can be unstable. At the same time, the number of task-specific pre-trained models continues to grow, yet the question of how to transfer them to new tasks with minimal additional training remains relatively underexplored. We propose BOLT (Basis-Oriented Low-rank Transfer), a framework that reuses existing fine-tuned models not by merging weights, but instead by extracting an orthogonal, task-informed spectral basis and adapting within that subspace. In the offline phase, BOLT collects dominant singular directions from multiple task vectors and orthogonalizes them per layer to form reusable bases. In the online phase, we freeze these bases and train only a small set of diagonal coefficients per layer for the new task, yielding a rank-controlled update with very few trainable parameters. This design provides (i) a strong, training-free initialization for unseen tasks, obtained by pooling source-task coefficients, along with a lightweight rescaling step while leveraging the shared orthogonal bases, and (ii) a parameter-efficient fine-tuning (PEFT) path that, in our experiments, achieves robust performance compared to common PEFT baselines as well as a representative meta-learned initialization. Our results show that constraining adaptation to a task-informed orthogonal subspace provides an effective alternative for unseen-task transfer.

[31] nuScenes Revisited: Progress and Challenges in Autonomous Driving cs.CV | cs.ROPDF

Whye Kit Fong, Venice Erin Liong, Kok Seang Tan, Holger Caesar

TL;DR: 本文重新审视了自动驾驶领域广泛使用的数据集nuScenes，探讨了其在多模态传感器融合、标准化基准和多样化任务中的贡献，以及其对后续数据集和社区标准的影响，并提供了对nuScenes创建和扩展的技术细节的深入了解。

Details

Motivation: 自动驾驶技术的发展依赖于高质量的数据集，nuScenes作为关键的数据集之一，其在多模态传感器融合和多样化任务中的应用对社区发展具有重要意义。本文旨在揭示nuScenes的细节及其影响。

Result: nuScenes数据集通过融合多模态传感器数据和提供多样化任务支持，推动了自动驾驶领域的发展，并成为后续数据集和社区标准的参考。

Insight: nuScenes的成功展示了高质量数据集在推动自动驾驶技术进步中的核心作用，尤其是多模态数据融合和标准化任务的重要性。

Abstract: Autonomous Vehicles (AV) and Advanced Driver Assistance Systems (ADAS) have been revolutionized by Deep Learning. As a data-driven approach, Deep Learning relies on vast amounts of driving data, typically labeled in great detail. As a result, datasets, alongside hardware and algorithms, are foundational building blocks for the development of AVs. In this work we revisit one of the most widely used autonomous driving datasets: the nuScenes dataset. nuScenes exemplifies key trends in AV development, being the first dataset to include radar data, to feature diverse urban driving scenes from two continents, and to be collected using a fully autonomous vehicle operating on public roads, while also promoting multi-modal sensor fusion, standardized benchmarks, and a broad range of tasks including perception, localization & mapping, prediction and planning. We provide an unprecedented look into the creation of nuScenes, as well as its extensions nuImages and Panoptic nuScenes, summarizing many technical details that have hitherto not been revealed in academic publications. Furthermore, we trace how the influence of nuScenes impacted a large number of other datasets that were released later and how it defined numerous standards that are used by the community to this day. Finally, we present an overview of both official and unofficial tasks using the nuScenes dataset and review major methodological developments, thereby offering a comprehensive survey of the autonomous driving literature, with a particular focus on nuScenes.

[32] HouseLayout3D: A Benchmark and Training-Free Baseline for 3D Layout Estimation in the Wild cs.CV | cs.AIPDF

Valentin Bieri, Marie-Julie Rakotosaona, Keisuke Tateno, Francis Engelmann, Leonidas Guibas

TL;DR: 论文提出了HouseLayout3D这一真实世界的3D布局估计基准，支持多楼层和复杂空间的理解，并提出了无需训练的基线方法MultiFloor3D，性能优于现有方法。

Details

Motivation: 当前3D布局估计模型主要在合成数据集上训练，无法处理真实世界中多楼层的复杂建筑结构。

Result: MultiFloor3D在HouseLayout3D基准和现有数据集上均优于现有方法。

Insight: 全局空间上下文对处理多楼层建筑至关重要，现有方法忽视了这一关键信息。

Abstract: Current 3D layout estimation models are primarily trained on synthetic datasets containing simple single room or single floor environments. As a consequence, they cannot natively handle large multi floor buildings and require scenes to be split into individual floors before processing, which removes global spatial context that is essential for reasoning about structures such as staircases that connect multiple levels. In this work, we introduce HouseLayout3D, a real world benchmark designed to support progress toward full building scale layout estimation, including multiple floors and architecturally intricate spaces. We also present MultiFloor3D, a simple training free baseline that leverages recent scene understanding methods and already outperforms existing 3D layout estimation models on both our benchmark and prior datasets, highlighting the need for further research in this direction. Data and code are available at: https://houselayout3d.github.io.

[33] See, Think, Learn: A Self-Taught Multimodal Reasoner cs.CV | cs.CLPDF

Sourabh Sharma, Sonam Gupta, Sadbhawna

TL;DR: 论文提出了一种名为See-Think-Learn（STL）的自训练框架，旨在通过结构化推理模板和负样本增强，联合提升视觉语言模型的感知与推理能力，无需依赖高成本人工标注数据。

Details

Motivation: 现有视觉语言模型在感知与推理能力上存在短板，且增强推理能力的方法往往依赖高成本的人工标注数据或忽视感知。为了解决这些问题，作者提出了STL框架。

Result: 实验表明，STL在多个任务上优于仅依赖答案或无结构自生成推理的基线模型，且生成的解释质量高。

Insight: 联合优化感知与推理是关键；负样本（错误解释）能有效提升模型的判别能力；自训练框架为低成本提升多模态推理能力提供了新思路。

Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in integrating visual perception with language understanding. However, effective multimodal reasoning requires both accurate perception and robust reasoning, and weakness in either limits the performance of VLMs. Prior efforts to enhance reasoning often depend on high-quality chain-of-thought (CoT) data, obtained via labor-intensive human annotations, costly proprietary models, or self-training methods that overlook perception. To address these limitations, we propose a simple yet effective self-training framework called See-Think-Learn (STL). At its core, STL introduces a structured reasoning template that encourages the model to see before thinking, first extracting visual attributes in textual form, then using them to guide reasoning. The framework jointly improves perception and reasoning by having the model generate and learn from its own structured rationales in a self-training loop. Furthermore, we augment the training data with negative rationales, i.e. explanations that justify why certain answer choices are incorrect, to enhance the model’s ability to distinguish between correct and misleading responses. This fosters more discriminative and robust learning. Experiments across diverse domains show that STL consistently outperforms baselines trained directly only on answers or self-generated reasoning, while qualitative analysis confirms the high quality of its rationales. STL thus provides a cost-effective solution to enhance multimodal reasoning ability of VLMs.

[34] Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation cs.CVPDF

Jianzong Wu, Hao Lian, Dachao Hao, Ye Tian, Qingyu Shi

TL;DR: 该论文研究了音频-视频联合去噪训练是否能提升视频生成质量，即使只关注视频模态。通过引入AVFullDiT架构并对比T2AV和T2V模型，发现音频作为特权信号可以提升视频动态的物理合理性。

Details

Motivation: 探索跨模态联合训练是否能通过音频信号隐性提升视频生成质量，而不仅限于同步效果。

Result: 音频作为特权信号改善了视频动态的物理合理性，尤其在大型运动和物体接触场景中表现显著。

Insight: 跨模态联合训练可以隐性增强模型的物理世界理解能力，为生成模型提供了新的优化方向。

Abstract: Recent audio-video generative systems suggest that coupling modalities benefits not only audio-video synchrony but also the video modality itself. We pose a fundamental question: Does audio-video joint denoising training improve video generation, even when we only care about video quality? To study this, we introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising. We train (i) a T2AV model with AVFullDiT and (ii) a T2V-only counterpart under identical settings. Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony. We observe consistent improvements on challenging subsets featuring large and object contact motions. We hypothesize that predicting audio acts as a privileged signal, encouraging the model to internalize causal relationships between visual events and their acoustic consequences (e.g., collision $\times$ impact sound), which in turn regularizes video dynamics. Our findings suggest that cross-modal co-training is a promising approach to developing stronger, more physically grounded world models. Code and dataset will be made publicly available.

[35] Vision to Geometry: 3D Spatial Memory for Sequential Embodied MLLM Reasoning and Exploration cs.CVPDF

Zhongyi Cai, Yi Du, Chen Wang, Yu Kong

TL;DR: 论文提出了3DSPMR方法，利用3D空间记忆增强多模态大语言模型，以解决顺序化嵌入式任务中的空间理解和推理问题，并在SEER-Bench基准测试中验证了其有效性。

Details

Motivation: 现存研究多集中在单一任务设置下的室内嵌入式任务，但实际应用中代理常面临顺序任务，需要复用先前探索的空间知识。缺乏对这一挑战的系统研究是本文的动机。

Result: 实验表明，3DSPMR在顺序EQA和EMN任务中均取得显著性能提升。

Insight: 显式引入几何信息对MLLM的空间理解至关重要，尤其在复杂顺序任务中。这为未来嵌入式AI研究提供了新方向。

Abstract: Existing research on indoor embodied tasks typically requires agents to actively explore unknown environments and reason about the scene to achieve a specific goal. However, when deployed in real life, agents often face sequential tasks, where each new sub-task follows the completion of the previous one, and certain sub-tasks may be infeasible, such as searching for a non-existent object. Compared with the single-task setting, the core challenge lies in reusing spatial knowledge accumulated from previous explorations to support subsequent reasoning and exploration. In this work, we investigate this underexplored yet practically significant embodied AI challenge. To evaluate this challenge, we introduce SEER-Bench, a new Sequential Embodied Exploration and Reasoning Benchmark encompassing encompassing two classic embodied tasks: Embodied Question Answering (EQA) and Embodied Multi-modal Navigation (EMN). Building on SEER-Bench, we propose 3DSPMR, a 3D SPatial Memory Reasoning approach that exploits relational, visual, and geometric cues from explored regions to augment Multi-Modal Large Language Models (MLLMs) for reasoning and exploration in sequential embodied tasks. To the best of our knowledge, this is the first work to explicitly incorporate geometric information into MLLM-based spatial understanding and reasoning. Extensive experiments verify that 3DSPMR achieves substantial performance gains on both sequential EQA and EMN tasks.

[36] TGDD: Trajectory Guided Dataset Distillation with Balanced Distribution cs.CVPDF

Fengli Ran, Xiao Pu, Bo Liu, Xiuli Bi, Bin Xiao

TL;DR: TGDD提出了一种基于轨迹引导的数据集蒸馏方法，通过动态对齐特征分布和引入分布约束正则化，提高了合成数据的语义多样性和代表性。

Details

Motivation: 现有的分布匹配方法在数据集蒸馏中忽略了训练过程中特征的动态演变，限制了合成数据的表达能力。

Result: 在十个数据集上的实验表明，TGDD达到了最先进的性能，特别是高分辨率基准上准确率提升了5.0%。

Insight: 动态对齐特征分布和平衡数据分布是实现高效数据集蒸馏的关键。

Abstract: Dataset distillation compresses large datasets into compact synthetic ones to reduce storage and computational costs. Among various approaches, distribution matching (DM)-based methods have attracted attention for their high efficiency. However, they often overlook the evolution of feature representations during training, which limits the expressiveness of synthetic data and weakens downstream performance. To address this issue, we propose Trajectory Guided Dataset Distillation (TGDD), which reformulates distribution matching as a dynamic alignment process along the model’s training trajectory. At each training stage, TGDD captures evolving semantics by aligning the feature distribution between the synthetic and original dataset. Meanwhile, it introduces a distribution constraint regularization to reduce class overlap. This design helps synthetic data preserve both semantic diversity and representativeness, improving performance in downstream tasks. Without additional optimization overhead, TGDD achieves a favorable balance between performance and efficiency. Experiments on ten datasets demonstrate that TGDD achieves state-of-the-art performance, notably a 5.0% accuracy gain on high-resolution benchmarks.

[37] WorldPack: Compressed Memory Improves Spatial Consistency in Video World Modeling cs.CV | cs.LGPDF

Yuta Oshima, Yusuke Iwasawa, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta

TL;DR: WorldPack提出了一种高效的压缩内存方法，显著提升了视频世界模型中长时生成的空间一致性和质量，解决了传统方法计算成本高的问题。

Details

Motivation: 传统视频世界模型在处理长时上下文输入时计算成本过高，导致时空一致性难以保证，WorldPack旨在通过压缩内存提升效率与一致性。

Result: 在Minecraft的LoopNav基准测试中，WorldPack显著优于现有方法，验证了其在长时一致性上的优势。

Insight: 压缩内存技术可有效解决长时世界建模的计算瓶颈，同时保持高质量生成与空间一致性。

Abstract: Video world models have attracted significant attention for their ability to produce high-fidelity future visual observations conditioned on past observations and navigation actions. Temporally- and spatially-consistent, long-term world modeling has been a long-standing problem, unresolved with even recent state-of-the-art models, due to the prohibitively expensive computational costs for long-context inputs. In this paper, we propose WorldPack, a video world model with efficient compressed memory, which significantly improves spatial consistency, fidelity, and quality in long-term generation despite much shorter context length. Our compressed memory consists of trajectory packing and memory retrieval; trajectory packing realizes high context efficiency, and memory retrieval maintains the consistency in rollouts and helps long-term generations that require spatial reasoning. Our performance is evaluated with LoopNav, a benchmark on Minecraft, specialized for the evaluation of long-term consistency, and we verify that WorldPack notably outperforms strong state-of-the-art models.

[38] G-SHARP: Gaussian Surgical Hardware Accelerated Real-time Pipeline cs.CVPDF

Vishwesh Nath, Javier G. Tejero, Ruilong Li, Filippo Filicori, Mahdi Azizian

TL;DR: G-SHARP是一个实时手术场景重建框架，针对微创手术需求设计，基于GSplat实现了高保真3D建模，适用于术中实时可视化。

Details

Motivation: 现有高斯散射方法依赖于非商业衍生工具，限制了可部署性。G-SHARP旨在解决这一问题，提供商业兼容的实时手术重建框架。

Result: 在EndoNeRF基准测试中达到最先进的重建质量，速度和精度平衡适用于术中实时使用。

Insight: 商业兼容的高斯散射框架可以显著提升手术实时重建的可用性和部署性。

Abstract: We propose G-SHARP, a commercially compatible, real-time surgical scene reconstruction framework designed for minimally invasive procedures that require fast and accurate 3D modeling of deformable tissue. While recent Gaussian splatting approaches have advanced real-time endoscopic reconstruction, existing implementations often depend on non-commercial derivatives, limiting deployability. G-SHARP overcomes these constraints by being the first surgical pipeline built natively on the GSplat (Apache-2.0) differentiable Gaussian rasterizer, enabling principled deformation modeling, robust occlusion handling, and high-fidelity reconstructions on the EndoNeRF pulling benchmark. Our results demonstrate state-of-the-art reconstruction quality with strong speed-accuracy trade-offs suitable for intra-operative use. Finally, we provide a Holoscan SDK application that deploys G-SHARP on NVIDIA IGX Orin and Thor edge hardware, enabling real-time surgical visualization in practical operating-room settings.

[39] UCAgents: Unidirectional Convergence for Visual Evidence Anchored Multi-Agent Medical Decision-Making cs.CV | cs.AIPDF

Qianhan Feng, Zhongzhen Huang, Yakun Zhu, Xiaofan Zhang, Qi Dou

TL;DR: UCAgents提出了一种分层多代理框架，通过结构化证据审计实现单向收敛，解决了医学视觉问答中语言解释与视觉证据脱节的问题，显著提升了诊断准确性和计算效率。

Details

Motivation: 现有的视觉语言模型在医学诊断中存在推理脱节问题，即语言解释与视觉证据不符，影响临床信任。多代理框架虽能减少单一模型偏差，但开放式讨论增加了文本噪声和计算成本，未能有效锚定视觉证据。

Result: 在四个医学VQA基准测试中，UCAgents准确率达71.3%（PathVQA），比现有技术高6.0%，同时降低87.7%的token成本，验证了其在视觉证据提取与文本噪声抑制间的平衡。

Insight: UCAgents的设计表明，结构化证据审计和单向收敛能有效提升医学诊断的可信度和效率，为实际临床部署提供了可靠解决方案。

Abstract: Vision-Language Models (VLMs) show promise in medical diagnosis, yet suffer from reasoning detachment, where linguistically fluent explanations drift from verifiable image evidence, undermining clinical trust. Recent multi-agent frameworks simulate Multidisciplinary Team (MDT) debates to mitigate single-model bias, but open-ended discussions amplify textual noise and computational cost while failing to anchor reasoning to visual evidence, the cornerstone of medical decision-making. We propose UCAgents, a hierarchical multi-agent framework enforcing unidirectional convergence through structured evidence auditing. Inspired by clinical workflows, UCAgents forbids position changes and limits agent interactions to targeted evidence verification, suppressing rhetorical drift while amplifying visual signal extraction. In UCAgents, a one-round inquiry discussion is introduced to uncover potential risks of visual-textual misalignment. This design jointly constrains visual ambiguity and textual noise, a dual-noise bottleneck that we formalize via information theory. Extensive experiments on four medical VQA benchmarks show UCAgents achieves superior accuracy (71.3% on PathVQA, +6.0% over state-of-the-art) with 87.7% lower token cost, the evaluation results further confirm that UCAgents strikes a balance between uncovering more visual evidence and avoiding confusing textual interference. These results demonstrate that UCAgents exhibits both diagnostic reliability and computational efficiency critical for real-world clinical deployment. Code is available at https://github.com/fqhank/UCAgents.

[40] Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding cs.CV | cs.AIPDF

Yerim Jeon, Miso Lee, WonJun Moon, Jae-Pil Heo

TL;DR: 论文提出了一种名为3D-SLIM的掩码策略，用于提升大型语言模型（LLMs）在3D场景语言理解中的空间推理能力，通过替换传统的因果掩码为适应3D空间结构的自适应掩码。

Details

Motivation: 现有的3D场景语言理解方法依赖于语言建模的标准解码器，但其因果掩码设计导致顺序偏见和受限的对象-指令注意力，阻碍了任务特定推理能力的发挥。

Result: 在多个3D场景语言任务中，3D-SLIM显著提升了性能，验证了其有效性，并强调了解码器设计在多模态推理中的关键作用。

Insight: 掩码设计在3D多模态推理中至关重要，简单的注意力调整可以显著提升LLMs的空间推理能力。

Abstract: Recent advances in 3D scene-language understanding have leveraged Large Language Models (LLMs) for 3D reasoning by transferring their general reasoning ability to 3D multi-modal contexts. However, existing methods typically adopt standard decoders from language modeling, which rely on a causal attention mask. This design introduces two fundamental conflicts in 3D scene understanding: sequential bias among order-agnostic 3D objects and restricted object-instruction attention, hindering task-specific reasoning. To overcome these limitations, we propose 3D Spatial Language Instruction Mask (3D-SLIM), an effective masking strategy that replaces the causal mask with an adaptive attention mask tailored to the spatial structure of 3D scenes. Our 3D-SLIM introduces two key components: a Geometry-adaptive Mask that constrains attention based on spatial density rather than token order, and an Instruction-aware Mask that enables object tokens to directly access instruction context. This design allows the model to process objects based on their spatial relationships while being guided by the user’s task. 3D-SLIM is simple, requires no architectural modifications, and adds no extra parameters, yet it yields substantial performance improvements across diverse 3D scene-language tasks. Extensive experiments across multiple benchmarks and LLM baselines validate its effectiveness and underscore the critical role of decoder design in 3D multi-modal reasoning.

[41] YingVideo-MV: Music-Driven Multi-Stage Video Generation cs.CVPDF

Jiahui Chen, Weida Wang, Runhua Shi, Huan Yang, Chaofan Ding

TL;DR: 论文提出了YingVideo-MV，首个音乐驱动的多阶段长视频生成框架，集成了音频语义分析、可解释的镜头规划模块、时序感知的扩散Transformer架构和长序列一致性建模，实现了高质量音乐表演视频的自动生成。

Details

Motivation: 现有研究在长视频生成中缺乏显式的摄像机运动控制，且音乐表演视频的生成尚未充分探索。论文旨在填补这一空白，实现音乐驱动的长视频生成。

Result: 实验表明，YingVideo-MV能够生成连贯且富有表现力的音乐视频，并实现音乐-动作-摄像机的精确同步。

Insight: 显式控制摄像机运动和时序一致性建模是音乐视频生成中的关键；动态窗口策略有效提升了长序列生成的连续性。

Abstract: While diffusion model for audio-driven avatar video generation have achieved notable process in synthesizing long sequences with natural audio-visual synchronization and identity consistency, the generation of music-performance videos with camera motions remains largely unexplored. We present YingVideo-MV, the first cascaded framework for music-driven long-video generation. Our approach integrates audio semantic analysis, an interpretable shot planning module (MV-Director), temporal-aware diffusion Transformer architectures, and long-sequence consistency modeling to enable automatic synthesis of high-quality music performance videos from audio signals. We construct a large-scale Music-in-the-Wild Dataset by collecting web data to support the achievement of diverse, high-quality results. Observing that existing long-video generation methods lack explicit camera motion control, we introduce a camera adapter module that embeds camera poses into latent noise. To enhance continulity between clips during long-sequence inference, we further propose a time-aware dynamic window range strategy that adaptively adjust denoising ranges based on audio embedding. Comprehensive benchmark tests demonstrate that YingVideo-MV achieves outstanding performance in generating coherent and expressive music videos, and enables precise music-motion-camera synchronization. More videos are available in our project page: https://giantailab.github.io/YingVideo-MV/ .

[42] Attention-guided reference point shifting for Gaussian-mixture-based partial point set registration cs.CV | cs.GRPDF

Mizuki Kikkawa, Tatsuya Yatagawa, Yutaka Ohtake, Hiromasa Suzuki

TL;DR: 该研究探讨了在部分点集配准中，基于深度学习和高斯混合模型（GMMs）的方法对特征向量平移和旋转不变性的影响。作者提出了一个注意力引导的参考点偏移（ARPS）层，解决了现有方法的局限性。

Details

Motivation: 现有基于GMMs的深度学习配准方法（如DeepGMR）对部分点集配准的局限性，特别是在特征不变性方面存在理论和实践问题。研究旨在揭示其原因并提出解决方案。

Result: ARPS层显著提升了DeepGMR和UGMMReg的性能，超越了此前使用注意力块和Transformer的深度学习方法。

Insight: 研究深入探讨了基于GMMs和深度学习的配准方法的特征不变性问题，为未来方法设计提供了重要启示。

Abstract: This study investigates the impact of the invariance of feature vectors for partial-to-partial point set registration under translation and rotation of input point sets, particularly in the realm of techniques based on deep learning and Gaussian mixture models (GMMs). We reveal both theoretical and practical problems associated with such deep-learning-based registration methods using GMMs, with a particular focus on the limitations of DeepGMR, a pioneering study in this line, to the partial-to-partial point set registration. Our primary goal is to uncover the causes behind such methods and propose a comprehensible solution for that. To address this, we introduce an attention-based reference point shifting (ARPS) layer, which robustly identifies a common reference point of two partial point sets, thereby acquiring transformation-invariant features. The ARPS layer employs a well-studied attention module to find a common reference point rather than the overlap region. Owing to this, it significantly enhances the performance of DeepGMR and its recent variant, UGMMReg. Furthermore, these extension models outperform even prior deep learning methods using attention blocks and Transformer to extract the overlap region or common reference points. We believe these findings provide deeper insights into registration methods using deep learning and GMMs.

[43] dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model cs.CVPDF

Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, Colin Zhang

TL;DR: dots.ocr是一个单一视觉语言模型，首次在统一端到端框架中联合学习三个核心任务，展示了卓越的多语言文档布局解析能力，并且在OmniDocBench和新的XDocParse基准测试中取得最先进性能。

Details

Motivation: 传统文档布局解析方法依赖于碎片化、多阶段的流程，容易传播错误且无法充分利用联合训练的优势。dots.ocr旨在通过单一模型联合学习布局检测、文本识别和关系理解任务，实现更高效和鲁棒的文档解析。

Result: 在OmniDocBench上取得最先进性能；在XDocParse基准测试中，dots.ocr比次优方法高出7.4分，展示了其卓越的多语言能力。

Insight: 统一的端到端框架和多语言数据生成引擎的结合是dots.ocr成功的关键，为文档智能领域提供了一种更高效且鲁棒的解决方案。

Abstract: Document Layout Parsing serves as a critical gateway for Artificial Intelligence (AI) to access and interpret the world’s vast stores of structured knowledge. This process,which encompasses layout detection, text recognition, and relational understanding, is particularly crucial for empowering next-generation Vision-Language Models. Current methods, however, rely on fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. In this paper, we introduce dots.ocr, a single Vision-Language Model that, for the first time, demonstrates the advantages of jointly learning three core tasks within a unified, end-to-end framework. This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus, empowering the model to deliver robust performance across a wide array of tasks, encompassing diverse languages, layouts, and domains. The efficacy of our unified paradigm is validated by state-of-the-art performance on the comprehensive OmniDocBench. Furthermore, to catalyze research in global document intelligence, we introduce XDocParse, a challenging new benchmark spanning 126 languages. On this testbed, dots.ocr establishes a powerful new baseline, outperforming the next-best competitor by a remarkable +7.4 point margin and proving its unparalleled multilingual capabilities.

[44] GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding cs.CVPDF

Jiaqi Liu, Ronghao Fu, Haoran Liu, Lang Sun, Bo Yang

TL;DR: GeoDiT是首个基于扩散模型的视觉语言模型，针对地理空间领域设计，通过并行细化过程生成结构化、连贯的输出，并在多项任务中超越自回归模型。

Details

Motivation: 自回归模型在地理空间理解任务中存在结构性不匹配问题，其强制顺序生成的特性阻碍了结构化输出的生成。GeoDiT旨在解决这一问题，通过扩散模型实现对地理空间数据的并行细化生成。

Result: 实验表明，GeoDiT在需要结构化输出的任务（如图像描述、视觉定位和多目标检测）中取得了新的最佳性能，优于自回归模型。

Insight: 生成过程与数据内在结构的对齐是实现复杂地理空间分析任务中高性能的关键。

Abstract: Autoregressive models are structurally misaligned with the inherently parallel nature of geospatial understanding, forcing a rigid sequential narrative onto scenes and fundamentally hindering the generation of structured and coherent outputs. We challenge this paradigm by reframing geospatial generation as a parallel refinement process, enabling a holistic, coarse-to-fine synthesis that resolves all semantic elements simultaneously. To operationalize this, we introduce GeoDiT, the first diffusion-based vision-language model tailored for the geospatial domain. Extensive experiments demonstrate that GeoDiT establishes a new state-of-the-art on benchmarks requiring structured, object-centric outputs. It achieves significant gains in image captioning, visual grounding, and multi-object detection, precisely the tasks where autoregressive models falter. Our work validates that aligning the generative process with the data’s intrinsic structure is key to unlocking superior performance in complex geospatial analysis.

[45] Two-Stage Vision Transformer for Image Restoration: Colorization Pretraining + Residual Upsampling cs.CVPDF

Aditya Chaudhary, Prachet Dev Singh, Ankit Jha

TL;DR: 提出了ViT-SR，一种基于ViT的两阶段训练方法，通过颜色化预训练和残差上采样提升单图像超分辨率性能。

Details

Motivation: 单图像超分辨率（SISR）是计算机视觉中的难点，现有方法在性能和泛化能力上仍有提升空间。

Result: 在DIV2K数据集上实现了SSIM 0.712和PSNR 22.90 dB的优异表现。

Insight: 自监督预训练在复杂图像修复任务中具有潜力，未来可通过更大ViT架构或其他预训练任务进一步提升。

Abstract: In computer vision, Single Image Super-Resolution (SISR) is still a difficult problem. We present ViT-SR, a new technique to improve the performance of a Vision Transformer (ViT) employing a two-stage training strategy. In our method, the model learns rich, generalizable visual representations from the data itself through a self-supervised pretraining phase on a colourization task. The pre-trained model is then adjusted for 4x super-resolution. By predicting the addition of a high-frequency residual image to an initial bicubic interpolation, this design simplifies residual learning. ViT-SR, trained and evaluated on the DIV2K benchmark dataset, achieves an impressive SSIM of 0.712 and PSNR of 22.90 dB. These results demonstrate the efficacy of our two-stage approach and highlight the potential of self-supervised pre-training for complex image restoration tasks. Further improvements may be possible with larger ViT architectures or alternative pretext tasks.

[46] SkyMoE: A Vision-Language Foundation Model for Enhancing Geospatial Interpretation with Mixture of Experts cs.CVPDF

Jiaqi Liu, Ronghao Fu, Lang Sun, Haoran Liu, Xiao Yang

TL;DR: SkyMoE是一款专为多模态、多任务的遥感解释任务设计的视觉语言模型，采用混合专家（MoE）架构，通过自适应路由器和上下文解耦增强策略，显著提升了模型在多任务和多粒度场景下的表现。

Details

Motivation: 现有通用的视觉语言模型在遥感任务中表现不佳，主要原因在于它们无法区分任务类型和解释粒度，限制了模型在局部细节感知和全局上下文理解之间的平衡。

Result: SkyMoE在21个公开数据集上取得了最先进的性能，展示了其在适应性、扩展性和多粒度理解方面的优势。

Insight: 混合专家架构结合任务和粒度感知的路由策略可以有效提升遥感任务的性能，上下文解耦增强策略有助于专家的专业化学习。

Abstract: The emergence of large vision-language models (VLMs) has significantly enhanced the efficiency and flexibility of geospatial interpretation. However, general-purpose VLMs remain suboptimal for remote sensing (RS) tasks. Existing geospatial VLMs typically adopt a unified modeling strategy and struggle to differentiate between task types and interpretation granularities, limiting their ability to balance local detail perception and global contextual understanding. In this paper, we present SkyMoE, a Mixture-of-Experts (MoE) vision-language model tailored for multimodal, multi-task RS interpretation. SkyMoE employs an adaptive router that generates task- and granularity-aware routing instructions, enabling specialized large language model experts to handle diverse sub-tasks. To further promote expert decoupling and granularity sensitivity, we introduce a context-disentangled augmentation strategy that creates contrastive pairs between local and global features, guiding experts toward level-specific representation learning. We also construct MGRS-Bench, a comprehensive benchmark covering multiple RS interpretation tasks and granularity levels, to evaluate generalization in complex scenarios. Extensive experiments on 21 public datasets demonstrate that SkyMoE achieves state-of-the-art performance across tasks, validating its adaptability, scalability, and superior multi-granularity understanding in remote sensing.

[47] ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning cs.CV | cs.AI | cs.CLPDF

Yifan Li, Yingda Yin, Lingting Zhu, Weikai Chen, Shengju Qian

TL;DR: ReVSeg通过强化学习优化视频分割的多步推理链，将复杂查询分解为语义解释、时间证据选择和空间定位三个显式操作，并在预训练视觉语言模型的基础上实现解释性推理轨迹。

Details

Motivation: 视频对象分割中的查询通常涉及动态性、因果性和时间交互，而现有方法将这些因素简化为潜在嵌入，导致推理链不透明且难处理。因此，研究需要一种显式分解的推理方法。

Result: 在标准视频对象分割基准测试中达到最优性能，并生成可解释的推理轨迹。

Insight: 通过显式分解任务和强化学习优化推理链，可以有效处理复杂的视频分割问题，并增强模型的解释性。

Abstract: Reasoning-centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single-step prediction, ReVSeg executes three explicit operations – semantics interpretation, temporal evidence selection, and spatial grounding – aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals. Experimental results demonstrate that ReVSeg attains state-of-the-art performances on standard video object segmentation benchmarks and yields interpretable reasoning trajectories. Project page is available at https://clementine24.github.io/ReVSeg/ .

[48] On the Problem of Consistent Anomalies in Zero-Shot Anomaly Detection cs.CV | stat.MLPDF

Tai Le-Gia

TL;DR: 该论文研究了零样本异常分类与分割（AC/AS）中的一致性问题，提出了一种基于图的方法（CoDeGraph）来过滤一致性异常，并将其扩展到3D医学影像和文本驱动的视觉语言模型中。

Details

Motivation: 零样本AC/AS在工业检测和医学影像中越来越重要，但存在一致性异常的问题，即重复相似的异常会系统地影响基于距离的方法。论文旨在解决这一问题并提供理论支持和实用解决方案。

Result: CoDeGraph能有效抑制一致性异常的影响；3D异常分割无需训练样本；伪掩模成功结合了批处理和文本驱动的零样本方法。

Insight: 1. 一致性异常是零样本AC/AS的核心挑战；2. 相似性缩放和邻居烧毁现象是关键观察点；3. 图模型和视觉语言模型的结合具有潜力。

Abstract: Zero-shot anomaly classification and segmentation (AC/AS) aim to detect anomalous samples and regions without any training data, a capability increasingly crucial in industrial inspection and medical imaging. This dissertation aims to investigate the core challenges of zero-shot AC/AS and presents principled solutions rooted in theory and algorithmic design. We first formalize the problem of consistent anomalies, a failure mode in which recurring similar anomalies systematically bias distance-based methods. By analyzing the statistical and geometric behavior of patch representations from pre-trained Vision Transformers, we identify two key phenomena - similarity scaling and neighbor-burnout - that describe how relationships among normal patches change with and without consistent anomalies in settings characterized by highly similar objects. We then introduce CoDeGraph, a graph-based framework for filtering consistent anomalies built on the similarity scaling and neighbor-burnout phenomena. Through multi-stage graph construction, community detection, and structured refinement, CoDeGraph effectively suppresses the influence of consistent anomalies. Next, we extend this framework to 3D medical imaging by proposing a training-free, computationally efficient volumetric tokenization strategy for MRI data. This enables a genuinely zero-shot 3D anomaly detection pipeline and shows that volumetric anomaly segmentation is achievable without any 3D training samples. Finally, we bridge batch-based and text-based zero-shot methods by demonstrating that CoDeGraph-derived pseudo-masks can supervise prompt-driven vision-language models. Together, this dissertation provides theoretical understanding and practical solutions for the zero-shot AC/AS problem.

[49] WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens cs.CVPDF

Jian Yang, Dacheng Yin, Xiaoxuan He, Yong Li, Fengyun Rao

TL;DR: 论文提出Noisy Query Tokens方法，通过端到端优化学习VLM和Diffusion Model之间的分布式表示空间，解决任务泛化崩溃问题，并引入VAE分支恢复图像细节。

Details

Motivation: 预训练的视觉语言模型（VLM）与扩散模型（Diffusion Model）之间的高效桥接存在挑战，尤其是固定数量的可学习查询令牌（query tokens）在任务泛化上表现不佳。

Result: 实验表明，该方法有效缓解了泛化崩溃，支持多样任务的持续学习。

Insight: 分布式表示学习和细节恢复模块的结合是关键，为多模态模型的进一步研究提供了新思路。

Abstract: Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.

[50] AVGGT: Rethinking Global Attention for Accelerating VGGT cs.CVPDF

Xianbing Sun, Zhikai Zhu, Zhengyu Lou, Bo Yang, Jinyang Tang

TL;DR: AVGGT通过重新设计VGGT和$π^3$的全局注意力机制，提出了一种免训练的双步加速方案，实现了8-10倍的速度提升，同时保持了模型精度。

Details

Motivation: VGGT和$π^3$在多视图3D任务中表现出色，但其依赖全局自注意力导致计算成本高昂，现有稀疏注意力变体缺乏系统性分析。

Result: 在推理时间上实现了8-10倍的加速，精度与原模型相当甚至略有提升，且在密集多视图任务中表现鲁棒。

Insight: 全局注意力在模型的不同阶段作用不同，早期层对应关系不显著，而中层和末层分别负责跨视图对齐和细微修正。

Abstract: Since DUSt3R, models such as VGGT and $π^3$ have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this paper, we first conduct an in-depth investigation of the global attention modules in VGGT and $π^3$ to better understand their roles. Our analysis reveals a clear division of roles in the alternating global-frame architecture: early global layers do not form meaningful correspondences, middle layers perform cross-view alignment, and last layers provide only minor refinements. Guided by these findings, we propose a training-free two-step acceleration scheme: (1) converting early global layers into frame attention, and (2) subsampling global attention by subsampling K/V over patch tokens with diagonal preservation and a mean-fill component. We instantiate this strategy on VGGT and $π^3$ and evaluate across standard pose and point-map benchmarks. Our method achieves up to $8$-$10\times$ speedup in inference time while matching or slightly improving the accuracy of the original models, and remains robust even in extremely dense multi-view settings where prior sparse-attention baselines fail.

[51] Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities cs.CV | cs.CL | cs.CRPDF

Yuan Xiong, Ziqi Miao, Lijun Li, Chen Qian, Jie Li

TL;DR: 这篇论文提出了一种新的图像中心攻击方法Contextual Image Attack (CIA)，通过多智能体系统将有害查询嵌入看似无害的视觉上下文中，显著提高了攻击成功率。

Details

Motivation: 现有的攻击方法主要集中于文本-图像交互，忽略了视觉模态作为一种独立且复杂的攻击载体潜力。因此，研究者旨在开发一种更有效的图像中心攻击方法。

Result: 在GPT-4o和Qwen2.5-VL-72B模型上，分别达到4.73和4.83的毒性分数，攻击成功率达到86.31%和91.07%，明显优于先前工作。

Insight: 研究表明视觉模态本身是一种强大的攻击载体，可以绕过多模态大模型的安全对齐机制，这对未来的安全研究提出了新的挑战。

Abstract: While Multimodal Large Language Models (MLLMs) show remarkable capabilities, their safety alignments are susceptible to jailbreak attacks. Existing attack methods typically focus on text-image interplay, treating the visual modality as a secondary prompt. This approach underutilizes the unique potential of images to carry complex, contextual information. To address this gap, we propose a new image-centric attack method, Contextual Image Attack (CIA), which employs a multi-agent system to subtly embeds harmful queries into seemingly benign visual contexts using four distinct visualization strategies. To further enhance the attack’s efficacy, the system incorporate contextual element enhancement and automatic toxicity obfuscation techniques. Experimental results on the MMSafetyBench-tiny dataset show that CIA achieves high toxicity scores of 4.73 and 4.83 against the GPT-4o and Qwen2.5-VL-72B models, respectively, with Attack Success Rates (ASR) reaching 86.31% and 91.07%. Our method significantly outperforms prior work, demonstrating that the visual modality itself is a potent vector for jailbreaking advanced MLLMs.

[52] OmniPerson: Unified Identity-Preserving Pedestrian Generation cs.CVPDF

Changxiao Ma, Chao Yuan, Xincheng Shi, Yuzhuo Ma, Yongfei Zhang

TL;DR: OmniPerson提出了一种统一的身份保持行人生成管道，支持RGB/IR图像/视频生成，通过多模态输入和细粒度控制解决数据隐私和标注成本问题，生成高质量行人数据以增强ReID任务。

Details

Motivation: 现有的行人数据生成方法在身份一致性和可控性方面表现不足，限制了其在数据增强中的效果，OmniPerson旨在解决这些问题。

Result: OmniPerson在视觉保真度和身份一致性上达到SOTA，生成数据显著提升ReID模型性能。

Insight: 结合多模态输入和多参考身份融合是实现高质量行人数据生成的关键，开源数据集和模型推动领域发展。

Abstract: Person re-identification (ReID) suffers from a lack of large-scale high-quality training data due to challenges in data privacy and annotation costs. While previous approaches have explored pedestrian generation for data augmentation, they often fail to ensure identity consistency and suffer from insufficient controllability, thereby limiting their effectiveness in dataset augmentation. To address this, We introduce OmniPerson, the first unified identity-preserving pedestrian generation pipeline for visible/infrared image/video ReID tasks. Our contributions are threefold: 1) We proposed OmniPerson, a unified generation model, offering holistic and fine-grained control over all key pedestrian attributes. Supporting RGB/IR modality image/video generation with any number of reference images, two kinds of person poses, and text. Also including RGB-to-IR transfer and image super-resolution abilities.2) We designed Multi-Refer Fuser for robust identity preservation with any number of reference images as input, making OmniPerson could distill a unified identity from a set of multi-view reference images, ensuring our generated pedestrians achieve high-fidelity pedestrian generation.3) We introduce PersonSyn, the first large-scale dataset for multi-reference, controllable pedestrian generation, and present its automated curation pipeline which transforms public, ID-only ReID benchmarks into a richly annotated resource with the dense, multi-modal supervision required for this task. Experimental results demonstrate that OmniPerson achieves SoTA in pedestrian generation, excelling in both visual fidelity and identity consistency. Furthermore, augmenting existing datasets with our generated data consistently improves the performance of ReID models. We will open-source the full codebase, pretrained model, and the PersonSyn dataset.

[53] From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature cs.CV | cs.AIPDF

Kun Yuan, Min Woo Sun, Zhen Chen, Alejandro Lozano, Xiangteng He

TL;DR: Panel2Patch是一种新型数据管道，从生物医学科学文献中挖掘分层结构，将其转化为多粒度监督信息，用于视觉-语言预训练。通过保留局部语义并构建分层对齐的视觉-语言对，该方法显著提升了性能。

Details

Motivation: 现有的生物医学视觉-语言预训练方法通常将科学图表和文本压缩为粗略的图-文对，忽视了临床医生依赖的局部结构对应关系。为了保留这些细粒度信息，研究提出了Panel2Patch。

Result: 实验表明，Panel2Pipeline在少量数据上提取的监督信息优于现有方法，显著提升了模型性能。

Insight: 细粒度视觉-语言对齐对生物医学领域的预训练至关重要；分层数据构造和处理策略能有效提升模型对局部结构的理解能力。

Abstract: There is a growing interest in developing strong biomedical vision-language models. A popular approach to achieve robust representations is to use web-scale scientific data. However, current biomedical vision-language pretraining typically compresses rich scientific figures and text into coarse figure-level pairs, discarding the fine-grained correspondences that clinicians actually rely on when zooming into local structures. To tackle this issue, we introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature, i.e., multi-panel, marker-heavy figures and their surrounding text, and converts them into multi-granular supervision. Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels, preserving local semantics instead of treating each figure as a single data sample. Built on this hierarchical corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases. By applying Panel2Patch to only a small set of the literature figures, we extract far more effective supervision than prior pipelines, enabling substantially better performance with less pretraining data.

[54] Co-speech Gesture Video Generation via Motion-Based Graph Retrieval cs.CVPDF

Yafei Song, Peng Zhang, Bang Zhang

TL;DR: 该论文提出了一种新的框架，通过基于扩散模型的动作生成和运动图检索算法，生成与语音同步且自然的共言手势视频，克服了传统方法一对一的映射限制。

Details

Motivation: 现有的方法在处理语音与手势的复杂多对多映射时表现不佳，因为它们依赖于一对一的特征匹配或共享特征空间。

Result: 实验证明该方法在同步性和自然性上显著优于现有方法。

Insight: 扩散模型可以有效学习语音与动作的复杂关系，结合多级特征和运动相似性检索，能够生成更自然的手势视频。

Abstract: Synthesizing synchronized and natural co-speech gesture videos remains a formidable challenge. Recent approaches have leveraged motion graphs to harness the potential of existing video data. To retrieve an appropriate trajectory from the graph, previous methods either utilize the distance between features extracted from the input audio and those associated with the motions in the graph or embed both the input audio and motion into a shared feature space. However, these techniques may not be optimal due to the many-to-many mapping nature between audio and gestures, which cannot be adequately addressed by one-to-one mapping. To alleviate this limitation, we propose a novel framework that initially employs a diffusion model to generate gesture motions. The diffusion model implicitly learns the joint distribution of audio and motion, enabling the generation of contextually appropriate gestures from input audio sequences. Furthermore, our method extracts both low-level and high-level features from the input audio to enrich the training process of the diffusion model. Subsequently, a meticulously designed motion-based retrieval algorithm is applied to identify the most suitable path within the graph by assessing both global and local similarities in motion. Given that not all nodes in the retrieved path are sequentially continuous, the final step involves seamlessly stitching together these segments to produce a coherent video output. Experimental results substantiate the efficacy of our proposed method, demonstrating a significant improvement over prior approaches in terms of synchronization accuracy and naturalness of generated gestures.

[55] RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence cs.CVPDF

Xuming He, Zehao Fan, Hengjia Li, Fan Zhuo, Hankun Xu

TL;DR: RULER-Bench是一个新的基准测试，专注于评估视频生成模型在规则推理能力方面的表现，填补了现有基准测试在这方面的空白，并揭示了当前模型的不足。

Details

Motivation: 现有视频生成模型的评估主要关注视觉感知和理解，而规则推理能力未被充分研究。RULER-Bench旨在填补这一空白，提供一个细粒度的评估协议以推动视频模型的发展。

Result: 实验表明，当前最先进的模型在规则一致性指标上仅达到48.87%，显示其在推理能力上有显著提升空间。

Insight: RULER-Bench揭示了视频生成模型在规则推理方面的不足，为未来的研究方向提供了重要洞察，促进推理感知的视频生成技术的发展。

Abstract: Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation models. To evaluate these video generation models, existing benchmarks primarily focus on factors related to visual perception and understanding, like visual aesthetics, instruction adherence, and temporal coherence. However, the rule-based reasoning capabilities of video generation models remain largely unexplored. Although recent studies have carried out preliminary explorations into whether video models can serve as zero-shot learners, they still lack a fine-grained decomposition of reasoning capabilities and a comprehensive evaluation protocol. To address this gap, we introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules. Built upon two fundamental paradigms: text-to-video and image-to-video, RULER-Bench covers 40 representative tasks spanning six rule categories with 622 high-quality annotated instances. For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question, achieving 85% alignment with human judgements. Extensive experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric, highlighting significant room for improvement in the reasoning capability of next-level video models. We expect that the insight obtained from RULER-Bench will facilitate further development of reasoning-aware video generation, advancing video generation models toward vision foundation intelligence.

[56] PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding cs.CVPDF

Zheng Huang, Xukai Liu, Tianyu Hu, Kai Zhang, Ye Liu

TL;DR: PPTBench是一个全面的多模态基准测试，用于评估LLMs在PowerPoint相关任务中的表现，揭示了当前模型在布局理解和视觉结构推理上的显著不足。

Details

Motivation: 现有的基准测试仅关注狭窄的子任务，而忽略了布局相关的核心挑战，而PPTBench旨在填补这一空白，评估模型在真实幻灯片创建和编辑中的多模态推理能力。

Result: 实验显示当前MLLMs在布局理解和视觉结构推理上存在显著差距，能够解释幻灯片内容但无法生成一致的空间排列。

Insight: 当前MLLMs难以结合视觉线索和JSON布局结构，也无法将视觉信息集成到API规划能力中，这为未来研究视觉结构推理和幻灯片生成指明了方向。

Abstract: PowerPoint presentations combine rich textual content with structured visual layouts, making them a natural testbed for evaluating the multimodal reasoning and layout understanding abilities of modern MLLMs. However, existing benchmarks focus solely on narrow subtasks while overlooking layout-centric challenges, which are central to real-world slide creation and editing. To bridge this gap, we introduce PPTBench, a comprehensive multimodal benchmark for evaluating LLMs on PowerPoint-related tasks. Leveraging a diverse source of 958 PPTX files, PPTBench evaluates models across four categories with 4,439 samples, including Detection, Understanding, Modification, and Generation. Our experiments reveal a substantial gap between semantic understanding and visual-layout reasoning in current MLLMs: models can interpret slide content but fail to produce coherent spatial arrangements. Ablation and further analysis show that current MLLMs struggle to combine visual cues with JSON-based layout structures and fail to integrate visual information into their API planning ability. And case studies visually expose systematic layout errors such as misalignment and element overlap. These findings provides a new perspective on evaluating VLLMs in PPT scenarios, highlighting challenges and directions for future research on visual-structural reasoning and coherent slide generation. All datasets and code are fully released to support reproducibility and future research.

[57] PoreTrack3D: A Benchmark for Dynamic 3D Gaussian Splatting in Pore-Scale Facial Trajectory Tracking cs.CVPDF

Dong Li, Jiahao Xiong, Yingda Huang, Le Chang

TL;DR: PoreTrack3D是首个专注于动态3D高斯泼溅（Gaussian splatting）在毛孔尺度非刚性面部轨迹跟踪领域的基准数据集，包含440,000+轨迹，并评测了现有技术的性能基线。

Details

Motivation: 现有面部动态捕捉技术主要集中在传统关键点，忽视了毛孔尺度的细微运动。PoreTrack3D填补了这一空白，推动对细微面部表情的研究。

Result: PoreTrack3D成为该领域的首个性能评测标准，推动了动态3D重建技术的发展。

Insight: 毛孔尺度的动态捕捉能更精准地反映细微表情变化，为面部分析和仿真等领域提供了新方向。

Abstract: We introduce PoreTrack3D, the first benchmark for dynamic 3D Gaussian splatting in pore-scale, non-rigid 3D facial trajectory tracking. It contains over 440,000 facial trajectories in total, among which more than 52,000 are longer than 10 frames, including 68 manually reviewed trajectories that span the entire 150 frames. To the best of our knowledge, PoreTrack3D is the first benchmark dataset to capture both traditional facial landmarks and pore-scale keypoints trajectory, advancing the study of fine-grained facial expressions through the analysis of subtle skin-surface motion. We systematically evaluate state-of-the-art dynamic 3D Gaussian splatting methods on PoreTrack3D, establishing the first performance baseline in this domain. Overall, the pipeline developed for this benchmark dataset’s creation establishes a new framework for high-fidelity facial motion capture and dynamic 3D reconstruction. Our dataset are publicly available at: https://github.com/JHXion9/PoreTrack3D

[58] Hear What Matters! Text-conditioned Selective Video-to-Audio Generation cs.CV | cs.LG | cs.MM | cs.SD | eess.ASPDF

Junwon Lee, Juhan Nam, Jiyoung Lee

TL;DR: 本文提出了一种新的任务——文本条件选择性视频到音频（V2A）生成，旨在从多物体视频中仅生成用户目标的声音。通过SelVA模型，利用文本提示显式选择目标源并调制视频编码器，提取与提示相关的视频特征，实现了语义和时间上的鲁棒性。

Details

Motivation: 在多媒体制作中，音频轨道需独立处理以实现精确编辑和控制，但现有方法只能生成混合声音，视觉特征纠缠且区域提示通常无法明确声源。

Result: 在VGG-MONOAUDIO基准测试中，SelVA在音频质量、语义对齐和时间同步等方面表现优异。

Insight: 文本提示能有效引导视频特征提取，抑制无关信息，为多模态生成任务提供了一种新的解决方案。

Abstract: This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. However, current approaches generate single source-mixed sounds at once, largely because visual features are entangled, and region cues or prompts often fail to specify the source. We propose SelVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector of target source and modulates video encoder to distinctly extract prompt-relevant video features. The proposed supplementary tokens promote cross-attention by suppressing text-irrelevant activations with efficient parameter tuning, yielding robust semantic and temporal grounding. SelVA further employs a self-augmentation scheme to overcome the lack of mono audio track supervision. We evaluate SelVA on VGG-MONOAUDIO, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization. Code and demo are available at https://jnwnlee.github.io/selva-demo/.

[59] Spatially-Grounded Document Retrieval via Patch-to-Region Relevance Propagation cs.CV | cs.IRPDF

Agathoklis Georgiou

TL;DR: 该论文提出了一种混合架构，结合了视觉语言模型的细粒度相似性和OCR提取的结构化文本，通过空间相关性传播实现更精确的文档检索。

Details

Motivation: 现有的视觉语言模型（如ColPali）在文档检索中表现优异，但仅返回整页而非特定区域，限制了其在检索增强生成（RAG）中的实用性。OCR系统虽有坐标信息，但缺乏语义相关性评估。

Result: 提出了Snappy开源实现，展示了实际应用的可行性，目前正在进行实证评估。

Insight: 通过结合语义和空间信息，该方法显著提升了文档检索的精确性，特别适用于需要上下文精确性的任务（如RAG）。

Abstract: Vision-language models (VLMs) like ColPali achieve state-of-the-art document retrieval by embedding pages as images and computing fine-grained similarity between query tokens and visual patches. However, they return entire pages rather than specific regions, limiting utility for retrieval-augmented generation (RAG) where precise context is paramount. Conversely, OCR-based systems extract structured text with bounding box coordinates but lack semantic grounding for relevance assessment. We propose a hybrid architecture that unifies these paradigms: using ColPali’s patch-level similarity scores as spatial relevance filters over OCR-extracted regions. We formalize the coordinate mapping between vision transformer patch grids and OCR bounding boxes, introduce intersection metrics for relevance propagation, and establish theoretical bounds on retrieval precision. Our approach operates at inference time without additional training. We release Snappy, an open-source implementation demonstrating practical applicability, with empirical evaluation ongoing.

[60] UAUTrack: Towards Unified Multimodal Anti-UAV Visual Tracking cs.CVPDF

Qionglin Ren, Dawei Zhang, Chunxu Tian, Dan Zhang

TL;DR: UAUTrack提出了一种统一的多模态反无人机跟踪框架，通过端到端的单流单阶段架构和文本先验提示策略，实现了跨模态的高效协作，并在多个数据集上达到了最先进的性能。

Details

Motivation: 现有反无人机跟踪方法多为独立模型，缺乏跨模态协作的统一框架，同时多模态数据融合效果不佳，亟需一种高效的综合解决方案。

Result: 在Anti-UAV、DUT Anti-UAV等数据集上表现优异，Anti-UAV410数据集上实现了精度与速度的良好平衡。

Insight: 跨模态协作的文本提示策略可以有效提升模型对无人机目标的鲁棒性，为多模态跟踪任务提供了新思路。

Abstract: Research in Anti-UAV (Unmanned Aerial Vehicle) tracking has explored various modalities, including RGB, TIR, and RGB-T fusion. However, a unified framework for cross-modal collaboration is still lacking. Existing approaches have primarily focused on independent models for individual tasks, often overlooking the potential for cross-modal information sharing. Furthermore, Anti-UAV tracking techniques are still in their infancy, with current solutions struggling to achieve effective multimodal data fusion. To address these challenges, we propose UAUTrack, a unified single-target tracking framework built upon a single-stream, single-stage, end-to-end architecture that effectively integrates multiple modalities. UAUTrack introduces a key component: a text prior prompt strategy that directs the model to focus on UAVs across various scenarios. Experimental results show that UAUTrack achieves state-of-the-art performance on the Anti-UAV and DUT Anti-UAV datasets, and maintains a favourable trade-off between accuracy and speed on the Anti-UAV410 dataset, demonstrating both high accuracy and practical efficiency across diverse Anti-UAV scenarios.

[61] Unsupervised Structural Scene Decomposition via Foreground-Aware Slot Attention with Pseudo-Mask Guidance cs.CVPDF

Huankun Sheng, Ming Li, Yixiang Wei, Yeying Fan, Yu-Hui Wen

TL;DR: 论文提出了Foreground-Aware Slot Attention (FASA)，一种通过显式分离前景与背景的两阶段框架，结合伪掩膜引导，提升了无监督场景分解和对象发现的性能。

Details

Motivation: 现有基于插槽注意力（slot attention）的方法在处理前景和背景时未加区分，导致背景干扰和对象发现性能不佳。FASA旨在解决这一问题，通过显式分离前景与背景，提升场景分解的鲁棒性。

Result: FASA在合成和真实数据集上均优于现有方法，验证了显式前景建模和伪掩膜引导的有效性。

Insight: 显式分离前景与背景能有效减少背景干扰，提升对象发现的精确性；伪掩膜引导进一步缓解了前景对象的过分割问题。

Abstract: Recent advances in object-centric representation learning have shown that slot attention-based methods can effectively decompose visual scenes into object slot representations without supervision. However, existing approaches typically process foreground and background regions indiscriminately, often resulting in background interference and suboptimal instance discovery performance on real-world data. To address this limitation, we propose Foreground-Aware Slot Attention (FASA), a two-stage framework that explicitly separates foreground from background to enable precise object discovery. In the first stage, FASA performs a coarse scene decomposition to distinguish foreground from background regions through a dual-slot competition mechanism. These slots are initialized via a clustering-based strategy, yielding well-structured representations of salient regions. In the second stage, we introduce a masked slot attention mechanism where the first slot captures the background while the remaining slots compete to represent individual foreground objects. To further address over-segmentation of foreground objects, we incorporate pseudo-mask guidance derived from a patch affinity graph constructed with self-supervised image features to guide the learning of foreground slots. Extensive experiments on both synthetic and real-world datasets demonstrate that FASA consistently outperforms state-of-the-art methods, validating the effectiveness of explicit foreground modeling and pseudo-mask guidance for robust scene decomposition and object-coherent representation. Code will be made publicly available.

[62] ALDI-ray: Adapting the ALDI Framework for Security X-ray Object Detection cs.CV | cs.LGPDF

Omid Reza Heidari, Yang Wang, Xinxin Zuo

TL;DR: ALDI++是一个通过自蒸馏、特征对齐和增强训练策略解决安全X射线图像领域适应问题的框架，在EDS数据集上表现优于现有方法，尤其在使用ViTDet主干网络时效果最佳。

Details

Motivation: 由于安全X射线成像中扫描设备和环境条件的差异导致领域差异显著，传统目标检测模型性能下降，因此需要高效的领域适应方法。

Result: ALDI++在EDS数据集上超越了现有的领域适应方法，尤其是基于ViTDet的架构取得了最高的mAP，展现了Transformer架构在跨领域目标检测中的有效性。

Insight: Transformer架构在跨领域目标检测中表现出色；领域适应问题可以通过结合自蒸馏和特征对齐技术显著改善。

Abstract: Domain adaptation in object detection is critical for real-world applications where distribution shifts degrade model performance. Security X-ray imaging presents a unique challenge due to variations in scanning devices and environmental conditions, leading to significant domain discrepancies. To address this, we apply ALDI++, a domain adaptation framework that integrates self-distillation, feature alignment, and enhanced training strategies to mitigate domain shift effectively in this area. We conduct extensive experiments on the EDS dataset, demonstrating that ALDI++ surpasses the state-of-the-art (SOTA) domain adaptation methods across multiple adaptation scenarios. In particular, ALDI++ with a Vision Transformer for Detection (ViTDet) backbone achieves the highest mean average precision (mAP), confirming the effectiveness of transformer-based architectures for cross-domain object detection. Additionally, our category-wise analysis highlights consistent improvements in detection accuracy, reinforcing the robustness of the model across diverse object classes. Our findings establish ALDI++ as an efficient solution for domain-adaptive object detection, setting a new benchmark for performance stability and cross-domain generalization in security X-ray imagery.

[63] VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm cs.CV | cs.LGPDF

Zhenkai Wu, Xiaowen Ma, Zhenliang Ni, Dengming Zhang, Han Shu

TL;DR: VLM-Pruner是一种无需训练的动态token修剪算法，通过平衡冗余性和空间稀疏性，提升视觉语言模型的效率，同时保留细粒度目标细节。

Details

Motivation: 视觉语言模型（VLMs）在图像理解任务中表现优异，但其大量的视觉token带来了高昂的计算成本，阻碍了移动设备的部署。现有的修剪方法仅依赖token重要性，忽略了token间的冗余性和空间关系，导致资源浪费或稀疏选择不足。

Result: 在五种VLMs上，VLM-Pruner以88.9%的修剪率全面超越基线方法，并实现了端到端的推理加速。

Insight: 1. 空间关系和冗余性是token修剪中不可忽视的因素；2. 无需训练的动态修剪方法可以显著提升效率；3. 信息融合是缓解修剪信息损失的有效手段。

Abstract: Vision-language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many pruning methods rely solely on token importance and thus overlook inter-token redundancy, retaining numerous duplicated tokens and wasting capacity. Although some redundancy-aware approaches have been proposed, they often ignore the spatial relationships among visual tokens. This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects. To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. We introduce a centrifugal token pruning paradigm that enables near-to-far selection while prioritizing the preservation of fine-grained object details. Moreover, we design a Buffering for Spatial Sparsity (BSS) criterion that defers the selection of spatially distant tokens. We further adopt a parallel greedy strategy to conduct token selection efficiently. To mitigate information loss from pruning, we selectively fuse salient information from the discarded tokens into the retained ones. Comprehensive comparisons demonstrate that VLM-Pruner consistently outperforms strong baselines across five VLMs with an 88.9% pruning rate, while delivering an end-to-end inference speedup.

[64] GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding cs.CVPDF

Peirong Zhang, Yidan Zhang, Luxiao Xu, Jinliang Lin, Zonghao Guo

TL;DR: GeoViS提出了一种地理空间奖励视觉搜索框架，通过渐进式搜索和推理解决遥感图像中的视觉定位问题，显著提升了小目标检测和复杂地理空间关系的理解能力。

Details

Motivation: 遥感图像中的目标通常非常小且涉及复杂的地理空间关系，传统的视觉定位方法难以直接适应这些挑战。

Result: 在五个遥感视觉定位基准测试中，GeoViS表现出色，显著优于现有方法。

Insight: 渐进式搜索和地理空间奖励机制的结合可以有效提升小目标检测和复杂关系的理解能力，同时在跨领域任务中展现出较强的泛化性和可解释性。

Abstract: Recent advances in multimodal large language models(MLLMs) have led to remarkable progress in visual grounding, enabling fine-grained cross-modal alignment between textual queries and image regions. However, transferring such capabilities to remote sensing imagery remains challenging, as targets are often extremely small within kilometer-scale scenes, and queries typically involve intricate geospatial relations such as relative positions, spatial hierarchies, or contextual dependencies across distant objects. To address these challenges, we propose GeoViS, a Geospatially Rewarded Visual Search framework that reformulates remote sensing visual grounding as a progressive search-and-reasoning process. Rather than directly predicting the target location in a single step, GeoViS actively explores the global image through a tree-structured sequence of visual cues, integrating multimodal perception, spatial reasoning, and reward-guided exploration to refine geospatial hypotheses iteratively. This design enables the model to detect subtle small-scale targets while maintaining holistic scene awareness. Extensive experiments on five remote sensing grounding benchmarks demonstrate that GeoViS achieves precise geospatial understanding and consistently surpasses existing methods across key visual grounding metrics, highlighting its strong cross-domain generalization and interpretability.

[65] Beyond Paired Data: Self-Supervised UAV Geo-Localization from Reference Imagery Alone cs.CV | cs.LGPDF

Tristan Amadei, Enric Meinhardt-Llopis, Benedicte Bascle, Corentin Abgrall, Gabriele Facciolo

TL;DR: 论文提出了一种无需配对数据的自监督无人机定位方法CAEVL，通过卫星视图直接训练，并引入ViLD数据集验证其有效性。

Details

Motivation: 现有无人机定位方法依赖配对的无人机与卫星图像数据集，但这些数据难以获取且成本高。论文旨在解决这一限制，提出一种仅需卫星视图的训练范式。

Result: CAEVL在性能上与基于配对数据的方法相当，展示了优异的泛化能力。

Insight: 自监督学习和领域增强策略可以显著减少对昂贵配对数据的依赖，提升无人机定位的实用性。

Abstract: Image-based localization in GNSS-denied environments is critical for UAV autonomy. Existing state-of-the-art approaches rely on matching UAV images to geo-referenced satellite images; however, they typically require large-scale, paired UAV-satellite datasets for training. Such data are costly to acquire and often unavailable, limiting their applicability. To address this challenge, we adopt a training paradigm that removes the need for UAV imagery during training by learning directly from satellite-view reference images. This is achieved through a dedicated augmentation strategy that simulates the visual domain shift between satellite and real-world UAV views. We introduce CAEVL, an efficient model designed to exploit this paradigm, and validate it on ViLD, a new and challenging dataset of real-world UAV images that we release to the community. Our method achieves competitive performance compared to approaches trained with paired data, demonstrating its effectiveness and strong generalization capabilities.

[66] Reasoning-Aware Multimodal Fusion for Hateful Video Detection cs.CV | cs.AIPDF

Shuonan Yang, Tailin Chen, Jiangbei Yue, Guangliang Cheng, Jianbo Jiao

TL;DR: 该论文提出了一种新颖的推理感知多模态融合（RAMF）框架，用于检测网络视频中的仇恨内容。通过局部-全局上下文融合（LGCF）和语义交叉注意力（SCA）来解决多模态语义交互问题，并通过对抗推理增强模型对仇恨意图的理解。在真实数据集上的实验表明，该方法优于现有方法。

Details

Motivation: 在线视频中的仇恨内容对数字平台构成了严重威胁，而现有方法在多模态语义交互和仇恨意图理解方面表现不足。因此，作者提出了RAMF框架以解决这些问题。

Result: 在两个真实仇恨视频数据集上，RAMF在Macro-F1和仇恨类召回率上分别比现有最佳方法提高了3%和7%。

Insight: 通过结合局部-全局上下文和对抗推理，可以有效增强模型对复杂仇恨内容的理解能力，从而在多模态融合中取得更好的表现。

Abstract: Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning-a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences-providing complementary semantic perspectives that enrich the model’s contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. We will release the code after the anonymity period ends.

[67] Rethinking Surgical Smoke: A Smoke-Type-Aware Laparoscopic Video Desmoking Method and Dataset cs.CVPDF

Qifan Liang, Junlin Li, Zhen Han, Xihao Wang, Zhongyuan Wang

TL;DR: 本文提出了第一种烟雾类型感知的腹腔镜视频去烟网络（STANet），通过区分扩散烟雾和环境烟雾，设计了烟雾掩码分割子网络和去烟视频重建子网络，并结合粗到精的解缠模块提升性能。同时还构建了首个大规模合成烟雾标注数据集。

Details

Motivation: 腹腔镜手术中产生的烟雾会影响视频的视觉引导，现有方法未考虑烟雾类型的差异性，因此需要一种烟雾类型感知的去烟方法。

Result: 实验表明，STANet在性能评估中优于现有方法，并在下游任务中表现出更强的泛化能力。

Insight: 烟雾类型的区分和解缠是关键，注意力机制和多任务学习有助于提升去烟效果。

Abstract: Electrocautery or lasers will inevitably generate surgical smoke, which hinders the visual guidance of laparoscopic videos for surgical procedures. The surgical smoke can be classified into different types based on its motion patterns, leading to distinctive spatio-temporal characteristics across smoky laparoscopic videos. However, existing desmoking methods fail to account for such smoke-type-specific distinctions. Therefore, we propose the first Smoke-Type-Aware Laparoscopic Video Desmoking Network (STANet) by introducing two smoke types: Diffusion Smoke and Ambient Smoke. Specifically, a smoke mask segmentation sub-network is designed to jointly conduct smoke mask and smoke type predictions based on the attention-weighted mask aggregation, while a smokeless video reconstruction sub-network is proposed to perform specially desmoking on smoky features guided by two types of smoke mask. To address the entanglement challenges of two smoke types, we further embed a coarse-to-fine disentanglement module into the mask segmentation sub-network, which yields more accurate disentangled masks through the smoke-type-aware cross attention between non-entangled and entangled regions. In addition, we also construct the first large-scale synthetic video desmoking dataset with smoke type annotations. Extensive experiments demonstrate that our method not only outperforms state-of-the-art approaches in quality evaluations, but also exhibits superior generalization across multiple downstream surgical tasks.

Tang Haonan, Chen Yanjun, Jiang Lezhi

TL;DR: TrackNetV5提出了一种新的目标跟踪方法，通过运动方向解耦（MDD）模块和残差驱动的时空细化（R-STR）头，解决了遮挡问题和运动方向模糊性，实现了高性能实时跟踪。

Details

Motivation: 现有的TrackNet系列在快速移动小目标跟踪中存在遮挡问题（V1-V3）和运动方向模糊性（V4），限制了跟踪性能的提升。

Result: 在TrackNetV2数据集上，F1-score为0.9859，准确率为0.9733，显著优于之前版本，且仅增加3.7%的计算量。

Insight: 显式编码运动方向和残差细化是提升遮挡场景下目标跟踪性能的有效方法。

Abstract: The TrackNet series has established a strong baseline for fast-moving small object tracking in sports. However, existing iterations face significant limitations: V1-V3 struggle with occlusions due to a reliance on purely visual cues, while TrackNetV4, despite introducing motion inputs, suffers from directional ambiguity as its absolute difference method discards motion polarity. To overcome these bottlenecks, we propose TrackNetV5, a robust architecture integrating two novel mechanisms. First, to recover lost directional priors, we introduce the Motion Direction Decoupling (MDD) module. Unlike V4, MDD decomposes temporal dynamics into signed polarity fields, explicitly encoding both movement occurrence and trajectory direction. Second, we propose the Residual-Driven Spatio-Temporal Refinement (R-STR) head. Operating on a coarse-to-fine paradigm, this Transformer-based module leverages factorized spatio-temporal contexts to estimate a corrective residual, effectively recovering occluded targets. Extensive experiments on the TrackNetV2 dataset demonstrate that TrackNetV5 achieves a new state-of-the-art F1-score of 0.9859 and an accuracy of 0.9733, significantly outperforming previous versions. Notably, this performance leap is achieved with a marginal 3.7% increase in FLOPs compared to V4, maintaining real-time inference capabilities while delivering superior tracking precision.

[69] UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits cs.CVPDF

Keming Ye, Zhipeng Huang, Canmiao Fu, Qingyang Liu, Jiani Cai

TL;DR: 本文提出了UnicEdit-10M数据集和UnicBench基准测试，通过统一的验证机制解决了图像编辑任务中规模与质量的权衡问题，并提供了细粒度的性能诊断指标。

Details

Motivation: 现有图像编辑数据集和基准测试在规模和质量之间存在矛盾，无法满足强大多模态模型的训练和评估需求。本文旨在通过统一的数据流水线和验证机制解决这一问题。

Result: 生成的UnicEdit-10M数据集覆盖多样化编辑任务，UnicBench测试揭示了现有模型的局限性，并为未来研究提供了方向。

Insight: 统一的验证机制和数据生成方法是解决规模与质量矛盾的关键；细粒度的评估指标有助于深入分析模型能力。

Abstract: With the rapid advances of powerful multimodal models such as GPT-4o, Nano Banana, and Seedream 4.0 in Image Editing, the performance gap between closed-source and open-source models is widening, primarily due to the scarcity of large-scale, high-quality training data and comprehensive benchmarks capable of diagnosing model weaknesses across diverse editing behaviors. Existing data construction methods face a scale-quality trade-off: human annotations are high-quality but not scalable, while automated pipelines suffer from error propagation and noise. To address this, we introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage. For scalable quality control, we train a 7B dual-task expert model, \textbf{Qwen-Verify}, for efficient failure detection and instruction recaptioning. This pipeline yields \textbf{UnicEdit-10M}, a 10M-scale dataset spanning diverse basic and complex editing tasks. We also propose \textbf{UnicBench}, a general benchmark that extends beyond basic edits to explicitly assess spatial and knowledge-driven reasoning. To enable fine-grained diagnosis, we introduce novel metrics, including \textit{Non-edit Consistency} and \textit{Reasoning Accuracy}. Our analysis of mainstream models on UnicBench reveals their limitations and provides clear directions for future research.

[70] HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval cs.CV | cs.MMPDF

Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen

TL;DR: 该论文提出了一种新型的HUD网络，通过层次化的不确定性感知和消歧机制，解决了多模态查询中视频和文本信息密度差异的问题，显著提升了组合视频检索（CVR）和组合图像检索（CIR）的性能。

Details

Motivation: 多模态查询（视频+文本）在组合视频检索中存在信息密度差异，导致修改主题的歧义性和语义细节关注不足。先前的研究忽视了这一问题，影响了模型的性能。

Result: HUD在CVR和CIR任务的三个基准数据集上均达到了最先进的性能。

Insight: 利用模态间的信息密度差异可以更有效地实现多模态查询的语义对齐和消歧，从而提升组合检索的准确性。

Abstract: Composed Video Retrieval (CVR) is a challenging video retrieval task that utilizes multi-modal queries, consisting of a reference video and modification text, to retrieve the desired target video. The core of this task lies in understanding the multi-modal composed query and achieving accurate composed feature learning. Within multi-modal queries, the video modality typically carries richer semantic content compared to the textual modality. However, previous works have largely overlooked the disparity in information density between these two modalities. This limitation can lead to two critical issues: 1) modification subject referring ambiguity and 2) limited detailed semantic focus, both of which degrade the performance of CVR models. To address the aforementioned issues, we propose a novel CVR framework, namely the Hierarchical Uncertainty-aware Disambiguation network (HUD). HUD is the first framework that leverages the disparity in information density between video and text to enhance multi-modal query understanding. It comprises three key components: (a) Holistic Pronoun Disambiguation, (b) Atomistic Uncertainty Modeling, and (c) Holistic-to-Atomistic Alignment. By exploiting overlapping semantics through holistic cross-modal interaction and fine-grained semantic alignment via atomistic-level cross-modal interaction, HUD enables effective object disambiguation and enhances the focus on detailed semantics, thereby achieving precise composed feature learning. Moreover, our proposed HUD is also applicable to the Composed Image Retrieval (CIR) task and achieves state-of-the-art performance across three benchmark datasets for both CVR and CIR tasks. The codes are available on https://zivchen-ty.github.io/HUD.github.io/.

[71] IC-World: In-Context Generation for Shared World Modeling cs.CVPDF

Fan Wu, Jiacheng Wei, Ruibo Li, Yi Xu, Junyou Li

TL;DR: IC-World是一个新颖的视频生成框架，专注于共享世界建模，通过激活大型视频模型的上下文生成能力，并行生成多视角视频，并通过强化学习优化几何和运动一致性。

Details

Motivation: 视频基世界模型在合成多样化和动态视觉环境方面表现出色，但共享世界建模（即从同一场景的多视角图像生成一致的视频）尚未系统研究。IC-World旨在填补这一空白。

Result: 实验表明，IC-World在几何和运动一致性上显著优于现有方法。

Insight: 共享世界建模需要综合考虑多视角的一致性和动态变化，IC-World通过强化学习和奖励模型实现了这一目标，为视频生成开辟了新方向。

Abstract: Video-based world models have recently garnered increasing attention for their ability to synthesize diverse and dynamic visual environments. In this paper, we focus on shared world modeling, where a model generates multiple videos from a set of input images, each representing the same underlying world in different camera poses. We propose IC-World, a novel generation framework, enabling parallel generation for all input images via activating the inherent in-context generation capability of large video models. We further finetune IC-World via reinforcement learning, Group Relative Policy Optimization, together with two proposed novel reward models to enforce scene-level geometry consistency and object-level motion consistency among the set of generated videos. Extensive experiments demonstrate that IC-World substantially outperforms state-of-the-art methods in both geometry and motion consistency. To the best of our knowledge, this is the first work to systematically explore the shared world modeling problem with video-based world models.

[72] Defense That Attacks: How Robust Models Become Better Attackers cs.CV | cs.AIPDF

Mohamed Awad, Mahmoud Akrm, Walid Gomaa

TL;DR: 论文研究发现，对抗训练（AT）不仅提高模型的鲁棒性，还意外增强了对抗样本的可迁移性，揭示了新的生态风险。

Details

Motivation: 尽管对抗训练是提升深度学习模型鲁棒性的主要方法，但其对对抗样本可迁移性的影响尚未深入探索。研究旨在探究对抗训练是否无意中增强了对抗样本的可迁移性。

Result: 研究发现AT模型的对抗样本比标准模型的更具可迁移性，揭示了对抗训练的潜在风险。

Insight: 对抗训练在提升鲁棒性的同时可能带来生态风险，未来鲁棒性评估需更全面地考虑模型的双重作用。

Abstract: Deep learning has achieved great success in computer vision, but remains vulnerable to adversarial attacks. Adversarial training is the leading defense designed to improve model robustness. However, its effect on the transferability of attacks is underexplored. In this work, we ask whether adversarial training unintentionally increases the transferability of adversarial examples. To answer this, we trained a diverse zoo of 36 models, including CNNs and ViTs, and conducted comprehensive transferability experiments. Our results reveal a clear paradox: adversarially trained (AT) models produce perturbations that transfer more effectively than those from standard models, which introduce a new ecosystem risk. To enable reproducibility and further study, we release all models, code, and experimental scripts. Furthermore, we argue that robustness evaluations should assess not only the resistance of a model to transferred attacks but also its propensity to produce transferable adversarial examples.

[73] Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video? cs.CVPDF

Manuel Benavent-Lledo, Konstantinos Bacharidis, Victoria Manousaki, Konstantinos Papoutsakis, Antonis Argyros

TL;DR: 论文研究了通过单帧图像和多模态线索（如RGB特征和深度信息）结合上下文明信息（如文本摘要或动作识别结果）来实现动作预测的可能性，提出的AAG方法在多个数据集上表现优于传统视频分析方法。

Details

Motivation: 传统动作预测方法依赖视频时序信息聚合，但人类仅需单帧图像和足够上下文即可预测动作。论文探讨是否可以通过多模态线索替代视频时序信息，实现高效的动作预测。

Result: 在三个数据集（IKEA-ASM、Meccano和Assembly101）上，AAG的表现优于传统视频分析方法和现有先进方法。

Insight: 多模态单帧信息结合上下文可以有效替代视频时序信息，为动作预测任务提供新思路。

Abstract: Anticipating actions before they occur is a core challenge in action understanding research. While conventional methods rely on extracting and aggregating temporal information from videos, as humans we can often predict upcoming actions by observing a single moment from a scene, when given sufficient context. Can a model achieve this competence? The short answer is yes, although its effectiveness depends on the complexity of the task. In this work, we investigate to what extent video aggregation can be replaced with alternative modalities. To this end, based on recent advances in visual feature extraction and language-based reasoning, we introduce AAG, a method for Action Anticipation at a Glimpse. AAG combines RGB features with depth cues from a single frame for enhanced spatial reasoning, and incorporates prior action information to provide long-term context. This context is obtained either through textual summaries from Vision-Language Models, or from predictions generated by a single-frame action recognizer. Our results demonstrate that multimodal single-frame action anticipation using AAG can perform competitively compared to both temporally aggregated video baselines and state-of-the-art methods across three instructional activity datasets: IKEA-ASM, Meccano, and Assembly101.

[74] RFOP: Rethinking Fusion and Orthogonal Projection for Face-Voice Association cs.CVPDF

Abdul Hannan, Furqan Malik, Hina Jabbar, Syed Suleman Sadiq, Mubashir Noman

TL;DR: RFOP重新思考了多语言环境下面部-语音关联任务中的融合和正交投影方法，通过聚焦双模态间的语义信息，在FAME 2026挑战赛中排名第三。

Details

Motivation: 多语言环境下的面部-语音关联任务带来了新的挑战，尤其是在处理不同语言的面部-语音数据时，传统的融合和投影方法可能无法充分捕捉语义信息。

Result: 在FAME 2026挑战赛中，RFOP方法在英德数据上表现优异，取得了33.1%的EER（等错误率），排名第三。

Insight: 在多模态任务中，尤其是在跨语言场景下，聚焦语义信息的有效融合和投影设计是关键提升点。

Abstract: Face-voice association in multilingual environment challenge 2026 aims to investigate the face-voice association task in multilingual scenario. The challenge introduces English-German face-voice pairs to be utilized in the evaluation phase. To this end, we revisit the fusion and orthogonal projection for face-voice association by effectively focusing on the relevant semantic information within the two modalities. Our method performs favorably on the English-German data split and ranked 3rd in the FAME 2026 challenge by achieving the EER of 33.1.

[75] Taming Camera-Controlled Video Generation with Verifiable Geometry Reward cs.CVPDF

Zhaoqing Wang, Xiaobo Xia, Zhuolin Bie, Jinlin Liu, Dongdong Yu

TL;DR: 该论文提出了一种在线强化学习（RL）后训练框架，用于优化预训练的视频生成器以实现精确的相机控制。通过设计可验证的几何奖励函数，提供密集的段级反馈，显著提高了相机控制的准确性和几何一致性。

Details

Motivation: 现有的视频扩散模型大多仅依赖监督微调（SFT），忽略了在线强化学习后训练的潜力。为了进一步提升相机控制的精度，作者尝试将RL引入视频生成领域。

Result: 实验表明，该方法在相机控制精度、几何一致性和视觉质量上均优于SFT基准方法。

Insight: 在线RL后训练可以显著提升视频生成模型的相机控制能力，而密集的奖励信号设计是优化效率的关键。

Abstract: Recent advances in video diffusion models have remarkably improved camera-controlled video generation, but most methods rely solely on supervised fine-tuning (SFT), leaving online reinforcement learning (RL) post-training largely underexplored. In this work, we introduce an online RL post-training framework that optimizes a pretrained video generator for precise camera control. To make RL effective in this setting, we design a verifiable geometry reward that delivers dense segment-level feedback to guide model optimization. Specifically, we estimate the 3D camera trajectories for both generated and reference videos, divide each trajectory into short segments, and compute segment-wise relative poses. The reward function then compares each generated-reference segment pair and assigns an alignment score as the reward signal, which helps alleviate reward sparsity and improve optimization efficiency. Moreover, we construct a comprehensive dataset featuring diverse large-amplitude camera motions and scenes with varied subject dynamics. Extensive experiments show that our online RL post-training clearly outperforms SFT baselines across multiple aspects, including camera-control accuracy, geometric consistency, and visual quality, demonstrating its superiority in advancing camera-controlled video generation.

[76] MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm cs.CVPDF

Wei Chen, Chaoqun Du, Feng Gu, Wei He, Qizhen Li

TL;DR: MindGPT-4ov是一个多模态大语言模型（MLLM），通过多阶段后训练范式提升了性能。它在低成本下实现了多项基准测试的领先表现，增强了MLLM的基础能力和泛化能力。核心创新包括数据生成方案、协作课程监督微调方法和混合强化学习范式。

Details

Motivation: 现有MLLMs在数据质量、训练效率和泛化能力方面存在局限性，MindGPT-4ov旨在通过系统化的后训练范式解决这些问题。

Result: 在MMBench、MMStar、MathVision和MathVista等基准测试中表现优异，用户体验显著提升。

Insight: 系统化的后训练范式可显著提升MLLM的性能和适应性，同时降低领域适应成本。

Abstract: We present MindGPT-4ov, a multimodal large language model (MLLM) that introduces a general post-training paradigm spanning data production, model training, and efficient deployment. It achieves state-of-the-art performance across multiple benchmarks at low cost, effectively enhancing the foundational capabilities of MLLMs and the generalization ability. Focusing on data construction, supervised fine-tuning strategies, and multimodal reinforcement learning methods, this work proposes three key innovations: (1) An information density-based data generation scheme, integrated with a dual-dimensional tree-structured label system, enabling automated generation of high-quality cross-domain data. (2) A collaborative curriculum supervised fine-tuning approach that balances the injection of domain-specific knowledge with the preservation of general capabilities. (3) A hybrid reinforcement learning paradigm that enhances reasoning ability while simultaneously addressing multi-objective optimization such as diversity exploration, maintenance of multimodal perception, and response conciseness. Moreover, we implement a series of infrastructure optimizations, such as 5D parallel training, operator optimization, and inference quantization to enhance training and inference efficiency while reducing the cost of domain adaptation. Experimental results demonstrate that the MindGPT-4ov model outperforms state-of-the-art models on benchmarks such as MMBench, MMStar, MathVision, and MathVista. In addition, MindGPT-4ov also demonstrates superior user experience in vertical domain tasks, enabling a seamless transition from academic research to industrial deployment. MindGPT-4ov provides a general post-training paradigm applicable to a wide range of MLLMs. The model weights, datasets, and code for the Qwen3-VL-based variants will be recently open-sourced to support the community’s development of MLLMs.

[77] Polar Perspectives: Evaluating 2-D LiDAR Projections for Robust Place Recognition with Visual Foundation Models cs.CV | cs.ROPDF

Pierpaolo Serio, Giulio Pisaneschi, Andrea Dan Ryals, Vincenzo Infantino, Lorenzo Gentilini

TL;DR: 本文研究了不同的LiDAR-to-image投影方法如何影响基于视觉基础模型的度量空间识别，提出了一个模块化检索管道，并验证了设计良好的投影可以作为LiDAR空间识别中端到端3D学习的有效替代方案。

Details

Motivation: 在LiDAR空间识别中，2D投影的选择对性能有显著影响，但目前缺乏对投影特性的系统研究。本文旨在填补这一空白，探索哪种投影方法最适合实际应用。

Result: 实验表明，精心设计的投影可以有效替代端到端3D学习，提升空间识别的判别能力和鲁棒性。

Insight: 投影方法的结构和几何特性对空间识别性能至关重要，选择合适的投影可以在不增加计算复杂度的情况下显著提升系统性能。

Abstract: This work presents a systematic investigation into how alternative LiDAR-to-image projections affect metric place recognition when coupled with a state-of-the-art vision foundation model. We introduce a modular retrieval pipeline that controls for backbone, aggregation, and evaluation protocol, thereby isolating the influence of the 2-D projection itself. Using consistent geometric and structural channels across multiple datasets and deployment scenarios, we identify the projection characteristics that most strongly determine discriminative power, robustness to environmental variation, and suitability for real-time autonomy. Experiments with different datasets, including integration into an operational place recognition policy, validate the practical relevance of these findings and demonstrate that carefully designed projections can serve as an effective surrogate for end-to-end 3-D learning in LiDAR place recognition.

[78] MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding cs.CV | cs.AI | cs.MMPDF

Fan Yang, Kaihao Zhang

TL;DR: MRD提出了一种无需训练的高分辨率图像理解框架，通过多分辨率检索-检测融合解决了目标物体在不同图像裁剪中被分割导致的语义相似性偏差问题。

Details

Motivation: 现有方法通过裁剪高分辨率图像计算语义相似性，但可能导致目标物体被分割，破坏了语义相似性的计算。作者发现不同大小的物体在不同分辨率下处理效果更好，因此提出MRD框架。

Result: 在高分辨率图像理解基准测试中，MRD证明了其有效性。

Insight: 多分辨率处理和全局检测的结合可以有效避免物体分割问题，提升语义理解的准确性。

Abstract: Understanding high-resolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops and computing the semantic similarity between each crop and a query using a pretrained retrieval-augmented generation (RAG) model. The most relevant crops are then selected to localize the target object and suppress irrelevant information. However, such crop-based processing can fragment complete objects across multiple crops, thereby disrupting the computation of semantic similarity. In our experiments, we find that image crops of objects with different sizes are better handled at different resolutions. Based on this observation, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework for high-resolution image understanding. To address the issue of semantic similarity bias caused by objects being split across different image crops, we propose a multi-resolution semantic fusion method, which integrates semantic similarity maps obtained at different resolutions to produce more accurate semantic information and preserve the integrity of target objects. Furthermore, to achieve direct localization of target objects at a global scale, we introduce an open-vocalbulary object detection (OVD) model that identifies object regions using a sliding-window approach.Experiments on high-resolution image understanding benchmarks using different MLLMs demonstrate the effectiveness of our approach.

[79] EGGS: Exchangeable 2D/3D Gaussian Splatting for Geometry-Appearance Balanced Novel View Synthesis cs.CV | cs.AIPDF

Yancheng Zhang, Guangyu Sun, Chen Chen

TL;DR: EGGS提出了一种结合2D与3D高斯分布的混合表示方法，通过动态切换和优化策略，在保证多视角一致性的同时提升纹理细节，从而在NVS任务中达到高质量渲染与几何精度的平衡。

Details

Motivation: 3D高斯喷洒（3DGS）虽然能实现高质量外观渲染，但在多视角一致性上表现不佳；而2D高斯喷洒（2DGS）虽然保证了多视角一致性，却牺牲了纹理细节。EGGS旨在解决这两者的局限性，找到一个平衡点。

Result: EGGS在渲染质量、几何精度和效率上均优于现有方法，通过实验验证了其有效性。

Insight: EGGS的核心思想是通过动态结合2D与3D高斯的优势，在多视角一致性和纹理细节之间找到平衡，为NVS任务提供了一种实用的解决方案。

Abstract: Novel view synthesis (NVS) is crucial in computer vision and graphics, with wide applications in AR, VR, and autonomous driving. While 3D Gaussian Splatting (3DGS) enables real-time rendering with high appearance fidelity, it suffers from multi-view inconsistencies, limiting geometric accuracy. In contrast, 2D Gaussian Splatting (2DGS) enforces multi-view consistency but compromises texture details. To address these limitations, we propose Exchangeable Gaussian Splatting (EGGS), a hybrid representation that integrates 2D and 3D Gaussians to balance appearance and geometry. To achieve this, we introduce Hybrid Gaussian Rasterization for unified rendering, Adaptive Type Exchange for dynamic adaptation between 2D and 3D Gaussians, and Frequency-Decoupled Optimization that effectively exploits the strengths of each type of Gaussian representation. Our CUDA-accelerated implementation ensures efficient training and inference. Extensive experiments demonstrate that EGGS outperforms existing methods in rendering quality, geometric accuracy, and efficiency, providing a practical solution for high-quality NVS.

[80] LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization cs.CVPDF

Zhihan Xiao, Lin Liu, Yixin Gao, Xiaopeng Zhang, Haoxuan Che

TL;DR: LoVoRA提出了一个新颖的框架，用于不带掩码的视频对象移除和添加，通过可学习的对象感知定位机制实现时空一致性编辑。

Details

Motivation: 现有方法通常依赖辅助掩码或参考图像来指导编辑，这限制了其扩展性和通用性。LoVoRA旨在解决这一问题，实现无需掩码的高质量视频编辑。

Result: 实验和人工评估表明，LoVoRA在高质量视频编辑任务中表现出色。

Insight: LoVoRA的创新在于其无需掩码的设计和对象感知定位机制，为视频编辑任务提供了更强的通用性和扩展性。

Abstract: Text-guided video editing, particularly for object removal and addition, remains a challenging task due to the need for precise spatial and temporal consistency. Existing methods often rely on auxiliary masks or reference images for editing guidance, which limits their scalability and generalization. To address these issues, we propose LoVoRA, a novel framework for mask-free video object removal and addition using object-aware localization mechanism. Our approach utilizes a unique dataset construction pipeline that integrates image-to-video translation, optical flow-based mask propagation, and video inpainting, enabling temporally consistent edits. The core innovation of LoVoRA is its learnable object-aware localization mechanism, which provides dense spatio-temporal supervision for both object insertion and removal tasks. By leveraging a Diffusion Mask Predictor, LoVoRA achieves end-to-end video editing without requiring external control signals during inference. Extensive experiments and human evaluation demonstrate the effectiveness and high-quality performance of LoVoRA.

[81] Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench cs.CV | cs.AIPDF

Lanxiang Hu, Abhilash Shankarampeta, Yixin Huang, Zilin Dai, Haoyang Yu

TL;DR: VideoScience-Bench是一个新的基准测试，专注于评估视频生成模型在科学理解和零样本推理方面的能力，填补了现有基准测试的不足。

Details

Motivation: 现有视频生成基准测试多基于物理常识，缺乏对模型科学推理能力的深入评估。本文提出VideoScience-Bench，旨在填补这一空白，推动模型在科学理解方面的发展。

Result: 实验表明，VLM-as-a-Judge评估方法与人类评估结果高度相关，验证了其有效性。

Insight: 视频生成模型的未来发展需超越生成能力，关注科学推理和理解能力；VLM-as-a-Judge方法为自动化评估提供了新思路。

Abstract: The next frontier for video generation lies in developing models capable of zero-shot reasoning, where understanding real-world scientific laws is crucial for accurate physical outcome modeling under diverse conditions. However, existing video benchmarks are physical commonsense-based, offering limited insight into video models’ scientific reasoning capability. We introduce VideoScience-Bench, a benchmark designed to evaluate undergraduate-level scientific understanding in video models. Each prompt encodes a composite scientific scenario that requires understanding and reasoning across multiple scientific concepts to generate the correct phenomenon. The benchmark comprises 200 carefully curated prompts spanning 14 topics and 103 concepts in physics and chemistry. We conduct expert-annotated evaluations across seven state-of-the-art video models in T2V and I2V settings along five dimensions: Prompt Consistency, Phenomenon Congruency, Correct Dynamism, Immutability, and Spatio-Temporal Continuity. Using a VLM-as-a-Judge to assess video generations, we observe strong correlation with human assessments. To the best of our knowledge, VideoScience-Bench is the first benchmark to evaluate video models not only as generators but also as reasoners, requiring their generations to demonstrate scientific understanding consistent with expected physical and chemical phenomena. Our data and evaluation code are available at: \href{https://github.com/hao-ai-lab/VideoScience}{github.com/hao-ai-lab/VideoScience}.

[82] A Lightweight Real-Time Low-Light Enhancement Network for Embedded Automotive Vision Systems cs.CVPDF

Yuhan Chen, Yicui Shi, Guofa Li, Guangrui Bai, Jinyuan Shao

TL;DR: UltraFast-LieNET是一种轻量级多尺度位移卷积网络，专为嵌入式车载视觉系统设计，用于实时低光图像增强。它采用动态位移卷积（DSConv）和多尺度位移残差块（MSRB）显著扩展感受野，并通过残差结构和多级梯度感知损失函数提升稳定性。

Details

Motivation: 低光环境（如夜间驾驶）下图像质量下降严重威胁车载摄像头安全，现有算法计算量过大，难以满足车载实时需求。

Result: 在LOLI-Street数据集上PSNR达26.51 dB，优于现有方法4.6 dB，仅需180参数；四个基准数据集验证了其在资源受限下的优异表现。

Insight: 轻量级网络设计可通过动态卷积和多尺度结构兼顾性能和效率，适用于嵌入式实时场景。

Abstract: In low-light environments like nighttime driving, image degradation severely challenges in-vehicle camera safety. Since existing enhancement algorithms are often too computationally intensive for vehicular applications, we propose UltraFast-LieNET, a lightweight multi-scale shifted convolutional network for real-time low-light image enhancement. We introduce a Dynamic Shifted Convolution (DSConv) kernel with only 12 learnable parameters for efficient feature extraction. By integrating DSConv with varying shift distances, a Multi-scale Shifted Residual Block (MSRB) is constructed to significantly expand the receptive field. To mitigate lightweight network instability, a residual structure and a novel multi-level gradient-aware loss function are incorporated. UltraFast-LieNET allows flexible parameter configuration, with a minimum size of only 36 parameters. Results on the LOLI-Street dataset show a PSNR of 26.51 dB, outperforming state-of-the-art methods by 4.6 dB while utilizing only 180 parameters. Experiments across four benchmark datasets validate its superior balance of real-time performance and enhancement quality under limited resources. Code is available at https://githubhttps://github.com/YuhanChen2024/UltraFast-LiNET

Guowen Zhang, Chenhang He, Liyi Chen, Lei Zhang

TL;DR: BEVDilation提出了一种以LiDAR为中心的LiDAR与相机多模态融合框架，通过图像BEV特征的隐式引导和稀疏体素扩张模块，提升了3D目标检测性能。

Details

Motivation: LiDAR和相机在多模态融合中存在几何精度差异，直接融合可能导致性能下降，因此需要一种更有效的融合策略。

Result: 在nuScenes基准测试中表现优于现有方法，同时对深度噪声更具鲁棒性。

Insight: LiDAR-centric策略能更好地结合LiDAR的高精度和图像的语义信息，提升融合效果。

Abstract: Integrating LiDAR and camera information in the bird’s eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, because of the fundamental disparity in geometric accuracy between these sensors, indiscriminate fusion in previous methods often leads to degraded performance. In this paper, we propose BEVDilation, a novel LiDAR-centric framework that prioritizes LiDAR information in the fusion. By formulating image BEV features as implicit guidance rather than naive concatenation, our strategy effectively alleviates the spatial misalignment caused by image depth estimation errors. Furthermore, the image guidance can effectively help the LiDAR-centric paradigm to address the sparsity and semantic limitations of point clouds. Specifically, we propose a Sparse Voxel Dilation Block that mitigates the inherent point sparsity by densifying foreground voxels through image priors. Moreover, we introduce a Semantic-Guided BEV Dilation Block to enhance the LiDAR feature diffusion processing with image semantic guidance and long-range context capture. On the challenging nuScenes benchmark, BEVDilation achieves better performance than state-of-the-art methods while maintaining competitive computational efficiency. Importantly, our LiDAR-centric strategy demonstrates greater robustness to depth noise compared to naive fusion. The source code is available at https://github.com/gwenzhang/BEVDilation.

Zhongyu Yang, Yingfang Yuan, Xuanming Jiang, Baoyi An, Wei Pang

TL;DR: 该论文提出了一种名为InEx的训练免框架，通过自省和多模态多智能体协作来缓解大型语言模型（LLMs）中的幻觉问题。

Details

Motivation: 幻觉问题是LLMs发展中的主要障碍，现有解决方案依赖人工干预或未充分利用智能体的自主缓解能力。论文借鉴人类决策过程，提出自省和外部验证结合的方法以减少不确定性。

Result: 实验表明，InEx在通用和幻觉基准上优于现有方法，性能提升4%-27%，并表现出强鲁棒性。

Insight: 人类决策过程中的自省和外部验证机制可有效迁移到AI系统中，为缓解幻觉问题提供了新思路。

Abstract: Hallucination remains a critical challenge in large language models (LLMs), hindering the development of reliable multimodal LLMs (MLLMs). Existing solutions often rely on human intervention or underutilize the agent’s ability to autonomously mitigate hallucination. To address these limitations, we draw inspiration from how humans make reliable decisions in the real world. They begin with introspective reasoning to reduce uncertainty and form an initial judgment, then rely on external verification from diverse perspectives to reach a final decision. Motivated by this cognitive paradigm, we propose InEx, a training-free, multi-agent framework designed to autonomously mitigate hallucination. InEx introduces internal introspective reasoning, guided by entropy-based uncertainty estimation, to improve the reliability of the decision agent’s reasoning process. The agent first generates a response, which is then iteratively verified and refined through external cross-modal multi-agent collaboration with the editing agent and self-reflection agents, further enhancing reliability and mitigating hallucination. Extensive experiments show that InEx consistently outperforms existing methods, achieving 4%-27% gains on general and hallucination benchmarks, and demonstrating strong robustness.

Md Sohag Mia, Md Nahid Hasan, Tawhid Ahmed, Muhammad Abdullah Adnan

TL;DR: GraphFusion3D是一个基于动态图注意力和自适应跨模态Transformer的统一框架，用于3D目标检测，通过多模态融合和图推理模块提升了点云的几何和语义信息提取能力。

Details

Motivation: 点云数据稀疏、结构不完整且语义信息有限，且难以捕捉远距离物体间的上下文关系，需要一种能有效结合多模态信息并动态建模空间-语义关系的解决方案。

Result: 在SUN RGB-D和ScanNetV2数据集上分别达到了70.6% AP${25}$/51.2% AP${50}$和75.1% AP${25}$/60.8% AP${50}$的性能，显著优于现有方法。

Insight: 动态图注意力和跨模态特征融合能有效弥补点云数据的不足，同时局部与全局信息的多尺度建模是关键提升点。

Abstract: Despite significant progress in 3D object detection, point clouds remain challenging due to sparse data, incomplete structures, and limited semantic information. Capturing contextual relationships between distant objects presents additional difficulties. To address these challenges, we propose GraphFusion3D, a unified framework combining multi-modal fusion with advanced feature learning. Our approach introduces the Adaptive Cross-Modal Transformer (ACMT), which adaptively integrates image features into point representations to enrich both geometric and semantic information. For proposal refinement, we introduce the Graph Reasoning Module (GRM), a novel mechanism that models neighborhood relationships to simultaneously capture local geometric structures and global semantic context. The module employs multi-scale graph attention to dynamically weight both spatial proximity and feature similarity between proposals. We further employ a cascade decoder that progressively refines detections through multi-stage predictions. Extensive experiments on SUN RGB-D (70.6% AP${25}$ and 51.2% AP${50}$) and ScanNetV2 (75.1% AP${25}$ and 60.8% AP${50}$) demonstrate a substantial performance improvement over existing approaches.

[86] DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling cs.CVPDF

Kairun Wen, Yuzhi Huang, Runyu Chen, Hui Zheng, Yunlong Lin

TL;DR: DynamicVerse 是一个面向动态实时视频的多模态4D世界建模框架，通过整合大型视觉、几何和多模态模型，实现对静态几何、动态运动、实例分割和描述的全面理解。

Details

Motivation: 现有数据集多源于有限模拟器或传统方法，限制了基础模型对单目视频的物理动态理解的准确性。DynamicVerse 旨在填补这一空白。

Result: 在视频深度估计、相机位姿估计和相机内参估计任务中表现优异，优于现有方法。

Insight: 通过多模态融合和全局优化，DynamicVerse 实现了对真实世界动态的物理尺度精确建模。

Abstract: Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interaction and enables embodied agents to perceive and act within real environments with human-like capabilities. However, existing datasets are often derived from limited simulators or utilize traditional Structurefrom-Motion for up-to-scale annotation and offer limited descriptive captioning, which restricts the capacity of foundation models to accurately interpret real-world dynamics from monocular videos, commonly sourced from the internet. To bridge these gaps, we introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video. We employ large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions. By integrating window-based Bundle Adjustment with global optimization, our method converts long real-world video sequences into a comprehensive 4D multimodal format. DynamicVerse delivers a large-scale dataset consists of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos. Experimental evaluations on three benchmark tasks, namely video depth estimation, camera pose estimation, and camera intrinsics estimation, demonstrate that our 4D modeling achieves superior performance in capturing physical-scale measurements with greater global accuracy than existing methods.

[87] SurfFill: Completion of LiDAR Point Clouds via Gaussian Surfel Splatting cs.CV | cs.GR | cs.ROPDF

Svenja Strobel, Matthias Innmann, Bernhard Egger, Marc Stamminger, Linus Franke

TL;DR: SurfFill 利用高斯面元（Gaussian surfel）补全LiDAR点云，通过分析光束发散造成的缺失区域，引入密度变化启发的模糊区域检测和点生长方法，并结合分治策略实现大规模场景补全，效果优于现有方法。

Details

Motivation: LiDAR在平坦区域精度高，但易遗漏小几何结构和暗光材料细节；相机摄影测量能补足细节但精度不足。SurfFill结合两者优势，通过高斯面元补全LiDAR点云。

Result: 在合成和真实场景的LiDAR点云补全任务中，SurfFill优于现有重建方法。

Insight: 光束发散是LiDAR遗漏薄结构和边缘的主因；密度变化可作为模糊区域的可靠指标；高斯面元适合局部高精度补全。

Abstract: LiDAR-captured point clouds are often considered the gold standard in active 3D reconstruction. While their accuracy is exceptional in flat regions, the capturing is susceptible to miss small geometric structures and may fail with dark, absorbent materials. Alternatively, capturing multiple photos of the scene and applying 3D photogrammetry can infer these details as they often represent feature-rich regions. However, the accuracy of LiDAR for featureless regions is rarely reached. Therefore, we suggest combining the strengths of LiDAR and camera-based capture by introducing SurfFill: a Gaussian surfel-based LiDAR completion scheme. We analyze LiDAR capturings and attribute LiDAR beam divergence as a main factor for artifacts, manifesting mostly at thin structures and edges. We use this insight to introduce an ambiguity heuristic for completed scans by evaluating the change in density in the point cloud. This allows us to identify points close to missed areas, which we can then use to grow additional points from to complete the scan. For this point growing, we constrain Gaussian surfel reconstruction [Huang et al. 2024] to focus optimization and densification on these ambiguous areas. Finally, Gaussian primitives of the reconstruction in ambiguous areas are extracted and sampled for points to complete the point cloud. To address the challenges of large-scale reconstruction, we extend this pipeline with a divide-and-conquer scheme for building-sized point cloud completion. We evaluate on the task of LiDAR point cloud completion of synthetic and real-world scenes and find that our method outperforms previous reconstruction methods.

[88] In-Context Sync-LoRA for Portrait Video Editing cs.CV | cs.AI | cs.GRPDF

Sagi Polaczek, Or Patashnik, Ali Mahdavi-Amiri, Daniel Cohen-Or

TL;DR: Sync-LoRA 是一种用于肖像视频编辑的方法，通过修改第一帧并将编辑传播到整个序列，实现高质量视觉修改，同时保持帧级同步和身份一致性。

Details

Motivation: 肖像视频编辑需要灵活而精确的控制，既要实现广泛修改（如外观变化、表情编辑或添加对象），又要保留原始时间行为，确保每一帧与源帧精确同步。

Result: 实验表明，Sync-LoRA 能够泛化到未见过的身份和多样化的编辑任务（如修改外观、添加对象或改变背景），并在姿态和表情变化中表现出强大的鲁棒性。

Insight: 通过对齐的成对视频数据训练模型，可以有效地结合源视频的运动线索和编辑帧的视觉变化，从而实现高视觉保真度和强时间一致性。

Abstract: Editing portrait videos is a challenging task that requires flexible yet precise control over a wide range of modifications, such as appearance changes, expression edits, or the addition of objects. The key difficulty lies in preserving the subject’s original temporal behavior, demanding that every edited frame remains precisely synchronized with the corresponding source frame. We present Sync-LoRA, a method for editing portrait videos that achieves high-quality visual modifications while maintaining frame-accurate synchronization and identity consistency. Our approach uses an image-to-video diffusion model, where the edit is defined by modifying the first frame and then propagated to the entire sequence. To enable accurate synchronization, we train an in-context LoRA using paired videos that depict identical motion trajectories but differ in appearance. These pairs are automatically generated and curated through a synchronization-based filtering process that selects only the most temporally aligned examples for training. This training setup teaches the model to combine motion cues from the source video with the visual changes introduced in the edited first frame. Trained on a compact, highly curated set of synchronized human portraits, Sync-LoRA generalizes to unseen identities and diverse edits (e.g., modifying appearance, adding objects, or changing backgrounds), robustly handling variations in pose and expression. Our results demonstrate high visual fidelity and strong temporal coherence, achieving a robust balance between edit fidelity and precise motion preservation.

[89] Instant Video Models: Universal Adapters for Stabilizing Image-Based Networks cs.CVPDF

Matthew Dutson, Nathan Labiosa, Yin Li, Mohit Gupta

TL;DR: 该论文提出了一种通用适配器（stability adapters）方法，通过插入到任何帧基网络中以提高视频推理的时序一致性和抗干扰能力。

Details

Motivation: 帧基网络在视频中顺序应用时通常表现出时序不一致性（如输出帧间闪烁），尤其在输入包含时变干扰时问题更严重。

Result: 实验表明，该方法在去噪、图像增强、深度估计和语义分割等多个任务中显著提升了时序稳定性和抗干扰能力。

Insight: 通过统一的准确率-稳定性-鲁棒性损失理论分析，明确了稳定适配器训练的有效条件。

Abstract: When applied sequentially to video, frame-based networks often exhibit temporal inconsistency - for example, outputs that flicker between frames. This problem is amplified when the network inputs contain time-varying corruptions. In this work, we introduce a general approach for adapting frame-based models for stable and robust inference on video. We describe a class of stability adapters that can be inserted into virtually any architecture and a resource-efficient training process that can be performed with a frozen base network. We introduce a unified conceptual framework for describing temporal stability and corruption robustness, centered on a proposed accuracy-stability-robustness loss. By analyzing the theoretical properties of this loss, we identify the conditions where it produces well-behaved stabilizer training. Our experiments validate our approach on several vision tasks including denoising (NAFNet), image enhancement (HDRNet), monocular depth (Depth Anything v2), and semantic segmentation (DeepLabv3+). Our method improves temporal stability and robustness against a range of image corruptions (including compression artifacts, noise, and adverse weather), while preserving or improving the quality of predictions.

[90] Unrolled Networks are Conditional Probability Flows in MRI Reconstruction cs.CVPDF

Kehan Qi, Saumya Gupta, Qingqiao Hu, Weimin Lyu, Chao Chen

TL;DR: 这篇论文通过理论证明展开网络（unrolled networks）是条件概率流ODE的离散实现，并提出FLAT方法，通过ODE离散化对齐中间重建状态，从而提高MRI重建的稳定性和效率。

Details

Motivation: MRI重建中，展开网络虽高效但不稳定，而扩散模型虽稳定但计算成本高。作者旨在结合两者的优势，通过理论连接改进MRI重建。

Result: 在三个MRI数据集上，FLAT能以更少迭代实现高质量重建（比扩散模型少3倍），同时显著提升稳定性。

Insight: 展开网络的中间状态演化可通过ODE理论规范化，结合深度学习与数学理论可改进医学图像任务的性能。

Abstract: Magnetic Resonance Imaging (MRI) offers excellent soft-tissue contrast without ionizing radiation, but its long acquisition time limits clinical utility. Recent methods accelerate MRI by under-sampling $k$-space and reconstructing the resulting images using deep learning. Unrolled networks have been widely used for the reconstruction task due to their efficiency, but suffer from unstable evolving caused by freely-learnable parameters in intermediate steps. In contrast, diffusion models based on stochastic differential equations offer theoretical stability in both medical and natural image tasks but are computationally expensive. In this work, we introduce flow ODEs to MRI reconstruction by theoretically proving that unrolled networks are discrete implementations of conditional probability flow ODEs. This connection provides explicit formulations for parameters and clarifies how intermediate states should evolve. Building on this insight, we propose Flow-Aligned Training (FLAT), which derives unrolled parameters from the ODE discretization and aligns intermediate reconstructions with the ideal ODE trajectory to improve stability and convergence. Experiments on three MRI datasets show that FLAT achieves high-quality reconstructions with up to $3\times$ fewer iterations than diffusion-based generative models and significantly greater stability than unrolled networks.

[91] MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation cs.CVPDF

Youxin Pang, Jiajun Liu, Lingfeng Tan, Yong Zhang, Feng Gao

TL;DR: MAViD是一个多模态框架，专注于音频-视觉对话的理解与生成，通过Conductor-Creator架构实现精细控制，并结合自回归和扩散模型生成高质量的长段内容。

Details

Motivation: 现有方法多为非交互式系统，生成的语音受限且不自然，难以实现多模态音频-视频的高效融合。

Result: 实验表明，MAViD能生成生动且连贯的长段对话内容，并准确理解用户的多模态查询。

Insight: 多模态任务的融合需要精细的架构设计和模型组合，同时上下文连贯性是长段生成的关键。

Abstract: We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation. Existing approaches primarily focus on non-interactive systems and are limited to producing constrained and unnatural human speech.The primary challenge of this task lies in effectively integrating understanding and generation capabilities, as well as achieving seamless multimodal audio-video fusion. To solve these problems, we propose a Conductor-Creator architecture that divides the dialogue system into two primary components.The Conductor is tasked with understanding, reasoning, and generating instructions by breaking them down into motion and speech components, thereby enabling fine-grained control over interactions. The Creator then delivers interactive responses based on these instructions.Furthermore, to address the difficulty of generating long videos with consistent identity, timbre, and tone using dual DiT structures, the Creator adopts a structure that combines autoregressive (AR) and diffusion models. The AR model is responsible for audio generation, while the diffusion model ensures high-quality video generation.Additionally, we propose a novel fusion module to enhance connections between contextually consecutive clips and modalities, enabling synchronized long-duration audio-visual content generation.Extensive experiments demonstrate that our framework can generate vivid and contextually coherent long-duration dialogue interactions and accurately interpret users’ multimodal queries.

[92] ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation cs.CV | cs.AIPDF

Mengchen Zhang, Qi Chen, Tong Wu, Zihan Liu, Dahua Lin

TL;DR: ViSAudio提出了一种端到端的视频驱动双耳空间音频生成方法，通过双分支音频生成架构和条件流匹配技术，直接从无声视频生成高质量的双耳音频，解决了现有两阶段方法的误差积累和时空不一致问题。

Details

Motivation: 现有视频到音频生成的研究集中在单声道输出，缺乏空间沉浸感；双耳音频生成方法通常采用两阶段流程（首先生成单声道音频，再进行空间化），导致误差累积和时空不一致。

Result: ViSAudio在客观指标和主观评估中均优于现有方法，能够生成高质量的双耳音频，适应视角变化、声源运动和多样化声学环境。

Insight: 端到端的双耳音频生成框架可以有效避免两阶段方法的误差累积问题，同时通过双分支设计和条件模块实现精确的时空对齐。

Abstract: Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio-temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio-project.

[93] Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation cs.CV | cs.AIPDF

Zeqi Xiao, Yiwei Zhao, Lingxiao Li, Yushi Lan, Yu Ning

TL;DR: Video4Spatial是一个视频生成框架，通过仅依赖视频数据的视觉上下文，展示了在复杂空间任务中的表现能力，如场景导航和对象定位。

Details

Motivation: 探索视频生成模型是否能够仅通过视觉数据表现出类似人类的空间认知能力，从而推动可视空间智能的发展。

Result: 实验表明，Video4Spatial在空间理解方面表现优秀，能够处理长上下文和域外环境，展示了其在可视空间推理中的潜力。

Insight: 视频生成模型可以通过视觉上下文学习复杂的空间任务，这种方法为开发更通用的可视空间智能提供了新的方向。

Abstract: We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.

[94] MultiShotMaster: A Controllable Multi-Shot Video Generation Framework cs.CVPDF

Qinghe Wang, Xiaoyu Shi, Baolu Li, Weikang Bian, Quande Liu

TL;DR: MultiShotMaster提出了一个可控的多镜头视频生成框架，解决了现有技术在生成叙事性多镜头视频时的困难，通过改进RoPE方法和自动化数据标注实现了灵活性和高质量。

Details

Motivation: 现有的视频生成技术擅长生成单镜头视频，但在叙事性多镜头视频（需要灵活的镜头安排、连贯的叙事和超越文本提示的控制）上表现不佳。MultiShotMaster致力于解决这些问题。

Result: 实验表明，MultiShotMaster在性能和可控性上优于现有方法，支持灵活的镜头数量和持续时间配置。

Insight: 通过改进模型架构和标注流程，可以在数据稀缺的情况下实现高质量的多镜头视频生成，为叙事性视频生成提供了新思路。

Abstract: Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.

[95] PPTArena: A Benchmark for Agentic PowerPoint Editing cs.CV | cs.AIPDF

Michael Ofengenden, Yunze Man, Ziqi Pang, Yu-Xiong Wang

TL;DR: PPTArena 是一个专注于 PowerPoint 编辑任务的基准测试，评估代理在自然语言指令下对真实幻灯片的可靠修改能力。PPTPilot 是一种结构化幻灯片编辑代理，通过语义编辑序列和 XML 操作实现精确控制，在实验中表现优于现有系统。

Details

Motivation: 现有基准多集中在图像-PDF渲染或文本到幻灯片的生成任务，缺乏对幻灯片实际编辑能力的评估。PPTArena 弥补了这一空白，并提供细粒度任务以推动代理的可控性和可靠性。

Result: PPTPilot 在复合任务、布局敏感任务和跨幻灯片任务中比现有代理和 VLM 系统高出 10 个百分点以上，视觉保真度和一致性提升明显。

Insight: 现有代理在长时程和文档级任务中仍表现不佳，凸显了可靠 PowerPoint 编辑任务的挑战。

Abstract: We introduce PPTArena, a benchmark for PowerPoint editing that measures reliable modifications to real slides under natural-language instructions. In contrast to image-PDF renderings or text-to-slide generation, PPTArena focuses on in-place editing across 100 decks, 2125 slides, and over 800 targeted edits covering text, charts, tables, animations, and master-level styles. Each case includes a ground-truth deck, a fully specified target outcome, and a dual VLM-as-judge pipeline that separately scores instruction following and visual quality using both structural diffs and slide images. Building on this setting, we propose PPTPilot, a structure-aware slide-editing agent that plans semantic edit sequences, routes between high-level programmatic tools and deterministic XML operations for precise control, and verifies outputs through an iterative plan-edit-check loop against task-specific constraints. In our experiments, PPTPilot outperforms strong proprietary agents and frontier VLM systems by over 10 percentage points on compound, layout-sensitive, and cross-slide edits, with particularly large gains in visual fidelity and deck-wide consistency. Despite these improvements, existing agents still underperform on long-horizon, document-scale tasks in PPTArena, highlighting the remaining challenges in reliable PPT editing.

[96] OneThinker: All-in-one Reasoning Model for Image and Video cs.CVPDF

Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen

TL;DR: OneThinker是一个统一的多模态推理模型，能够在图像和视频任务中实现跨任务和多模态的知识共享，通过构建大规模训练数据集和提出EMA-GRPO方法解决多任务强化学习中的奖励异质性问题。

Details

Motivation: 现有的方法通常为不同任务训练单独模型，且将图像和视频推理视为独立领域，限制了多模态推理通用模型的扩展性和实际应用潜力。

Result: 在31个视觉基准测试中表现优异，覆盖10种基础视觉理解任务，并展示了任务间的知识迁移能力和初步的零样本泛化能力。

Insight: OneThinker展示了统一多模态推理模型的潜力，通过跨任务和多模态的知识共享，为通用视觉推理模型的未来发展提供了重要参考。

Abstract: Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities. To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start. Furthermore, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks. Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data are released.

[97] MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues cs.CVPDF

Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Shuailei Ma

TL;DR: MagicQuillV2通过分层的视觉线索（内容、空间、结构和颜色）实现对图像生成的精细控制，填补了扩散模型与传统图形软件之间的语义鸿沟。

Details

Motivation: 现有的扩散模型虽然在整体生成上表现优异，但缺乏对内容、位置和外观的独立控制能力，限制了用户的创造力。

Result: 实验证明，分层方法能有效解决用户意图的模糊性，提供直观且直接的生成控制。

Insight: 分层的视觉线索不仅提升了编辑的精确性，也为用户提供了更自然的创作方式。

Abstract: We propose MagicQuill V2, a novel system that introduces a \textbf{layered composition} paradigm to generative image editing, bridging the gap between the semantic power of diffusion models and the granular control of traditional graphics software. While diffusion transformers excel at holistic generation, their use of singular, monolithic prompts fails to disentangle distinct user intentions for content, position, and appearance. To overcome this, our method deconstructs creative intent into a stack of controllable visual cues: a content layer for what to create, a spatial layer for where to place it, a structural layer for how it is shaped, and a color layer for its palette. Our technical contributions include a specialized data generation pipeline for context-aware content integration, a unified control module to process all visual cues, and a fine-tuned spatial branch for precise local editing, including object removal. Extensive experiments validate that this layered approach effectively resolves the user intention gap, granting creators direct, intuitive control over the generative process.

cs.CL [Back]

[98] Deep Research: A Systematic Survey cs.CL | cs.AI | cs.IRPDF

Zhengliang Shi, Yiqun Chen, Haitao Li, Weiwei Sun, Shiyu Ni

TL;DR: 这篇论文是关于深度研究（Deep Research）的系统性调查，探讨了如何将大语言模型（LLMs）与外部工具结合，以完成复杂的开放性任务。论文提出了一个三阶段路线图，总结了四个关键组件，优化技术，以及评估标准和未来挑战。

Details

Motivation: 尽管大语言模型在文本生成和问题解决方面表现出色，但许多开放性任务需要批判性思维、多源信息和可验证的输出，而这些超出了单次提示或标准检索增强生成的范围。因此，需要系统地研究如何结合LLMs与外部工具的能力。

Result: 论文提供了一个全面的深度研究路线图和技术总结，为未来研究提供了清晰的指导和参考。

Insight: 深度研究的核心在于结合LLMs的外部工具能力，以解决复杂的开放性问题。未来方向包括进一步提升组件之间的协同性和评估标准的完善。

Abstract: Large language models (LLMs) have rapidly evolved from text generators into powerful problem solvers. Yet, many open tasks demand critical thinking, multi-source, and verifiable outputs, which are beyond single-shot prompting or standard retrieval-augmented generation. Recently, numerous studies have explored Deep Research (DR), which aims to combine the reasoning capabilities of LLMs with external tools, such as search engines, thereby empowering LLMs to act as research agents capable of completing complex, open-ended tasks. This survey presents a comprehensive and systematic overview of deep research systems, including a clear roadmap, foundational components, practical implementation techniques, important challenges, and future directions. Specifically, our main contributions are as follows: (i) we formalize a three-stage roadmap and distinguish deep research from related paradigms; (ii) we introduce four key components: query planning, information acquisition, memory management, and answer generation, each paired with fine-grained sub-taxonomies; (iii) we summarize optimization techniques, including prompting, supervised fine-tuning, and agentic reinforcement learning; and (iv) we consolidate evaluation criteria and open challenges, aiming to guide and facilitate future development. As the field of deep research continues to evolve rapidly, we are committed to continuously updating this survey to reflect the latest progress in this area.

[99] Beyond Confidence: Adaptive and Coherent Decoding for Diffusion Language Models cs.CLPDF

Kecheng Chen, Ziru Liu, Xijia Tao, Hui Liu, Xinyu Fu

TL;DR: 论文提出了一种名为Coherent Contextual Decoding（CCD）的新型推理框架，通过轨迹修正机制和自适应采样策略，显著提升了扩散语言模型的生成质量和推理速度。

Details

Motivation: 现有的扩散语言模型推理方法依赖局部置信度或熵等即时指标，缺乏全局视角，导致采样轨迹不一致和生成质量不佳。

Result: 在Dream和LLaDA基准测试中，CCD实现了3.48倍的推理加速和3.91%的性能提升。

Insight: 全局一致性和动态预算分配是提升扩散语言模型性能的关键因素。

Abstract: Diffusion Language Models (DLMs) have recently achieved significant success due to their any-order generation capabilities. However, existing inference methods typically rely on local, immediate-step metrics such as confidence or entropy which inherently lack a more reliable perspective. This limitation frequently leads to inconsistent sampling trajectories and suboptimal generation quality. To address this, we propose Coherent Contextual Decoding (CCD), a novel inference framework built upon two core innovations. First, CCD employs a trajectory rectification mechanism that leverages historical context to enhance sequence coherence, enabling the early rejection of suboptimal paths. We demonstrate that this mechanism is theoretically equivalent to modeling the consistency of historical steps via the conditional mutual information between context and token predictions. Building on this theoretical insight, we further address the inefficiency of conventional uniform decoding budgets. Instead of rigid allocations based on diffusion steps, we introduce an adaptive sampling strategy that dynamically adjusts the unmasking budget for each step according to our consistency metric. Consequently, our method significantly improves the quality of generation trajectories while accelerating the sampling process. Empirically, our method achieves a simultaneous enhancement in both inference speed and performance across diverse benchmarks on Dream and LLaDA, delivering up to 3.48x speedup alongside 3.91% performance improvement.

[100] Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models cs.CL | cs.AI | cs.LGPDF

Ziyan Wang, Enmao Diao, Qi Le, Pu Wang, Guanchu Wang

TL;DR: 该论文提出了一种针对推理大语言模型（RLMs）的自反思结构化剪枝方法（RESP），解决了现有剪枝方法在RLMs上性能崩溃的问题，通过自生成校准、梯度重要性估计和渐进式再生等技术，显著提升了剪枝后的推理能力。

Details

Motivation: 现有的剪枝方法在标准LLMs上表现良好，但在推理LLMs（RLMs）上性能急剧下降，主要原因是校准数据与模型推理行为不匹配。研究发现，模型自生成的推理轨迹是最可靠的校准信号。

Result: 在Qwen3-8B模型上的实验表明，RESP在20-30%稀疏度下几乎保留了密集模型的准确性，在40%稀疏度下GSM8K和MathQA的准确率分别达到81.3%和59.6%，显著优于现有方法。

Insight: 模型自生成的推理轨迹比人工标注数据更适合作为剪枝校准信号，剪枝决策需要与模型的推理动态保持一致。

Abstract: Reasoning LLMs (RLMs) such as OpenAI o1, DeepSeek-R1, and Qwen3 deliver strong multi-step reasoning through chain-of-thought generation, but their large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings. To reduce computing and memory cost, pruning offers a promising solution by removing unimportant parameters. However, despite their success on standard LLMs, existing pruning methods severely damage RLMs, as even moderate sparsity (e.g., 20%) can collapse accuracy and completely disrupt the model’s reasoning coherence. We begin by analyzing why existing pruning pipelines fail on reasoning LLMs and find that their brittleness largely stems from a mismatch between the calibration data, the pruning objective, and the model’s decode-time reasoning behavior. Our study further shows that the most reliable calibration signal comes not from human-written labels but from the model’s own self-generated reasoning traces, which more accurately reflect its inference distribution. Guided by these insights, we introduce RESP, a self-reflective structured pruning framework that aligns pruning decisions with the model’s reasoning dynamics through self-generated calibration, decode-only gradient-based importance estimation, and progressive regeneration that maintains calibration fidelity as sparsity increases. Experiments on Qwen3-8B demonstrate that RESP markedly outperforms existing structured pruning methods on both GSM8K and MathQA, preserving near-dense accuracy at 20-30% sparsity and substantially mitigating performance collapse at higher sparsity levels. At 40% sparsity, RESP attains 81.3% accuracy on GSM8K and 59.6% on MathQA, surpassing the strongest baselines by 66.87% and 47%, respectively.

[101] Lightweight Latent Reasoning for Narrative Tasks cs.CLPDF

Alexander Gurung, Nikolay Malkin, Mirella Lapata

TL;DR: LiteReason提出了一种轻量级的潜在推理方法，结合强化学习优化语言模型在叙事任务中的推理能力，显著减少了计算成本同时保持高性能。

Details

Motivation: 大型语言模型（LLMs）在处理复杂任务时需要通过长链推理（潜在变量）生成输出，这种方法的优化通常需要高计算成本，尤其是在涉及大量token的叙事任务中。LiteReason旨在通过轻量化的潜在推理模块降低计算负担。

Result: 在情节漏洞检测和书籍章节生成任务中，LiteReason优于潜在推理基线，接近非潜在强化学习的性能，同时大幅减少推理长度（77-92%）。

Insight: 轻量化的潜在推理模块结合强化学习动态决策，可以在保持高性能的同时显著降低计算成本，为复杂叙事任务的效率优化提供了新思路。

Abstract: Large language models (LLMs) tackle complex tasks by generating long chains of thought or “reasoning traces” that act as latent variables in the generation of an output given a query. A model’s ability to generate such traces can be optimized with reinforcement learning (RL) to improve their utility in predicting an answer. This optimization comes at a high computational cost, especially for narrative-related tasks that involve retrieving and processing many tokens. To this end, we propose LiteReason, a latent reasoning method that can be interleaved with standard token sampling and easily combined with RL techniques. LiteReason employs a lightweight Reasoning Projector module, trained to produce continuous latent tokens that help the model ‘skip’ reasoning steps. During RL, the policy model decides when to activate the projector, switching between latent and discrete reasoning as needed. Experimental results on plot hole detection and book chapter generation show that our method outperforms latent reasoning baselines and comes close to matching non-latent RL training, while reducing final reasoning length by 77-92%. Overall, LiteReason guides RL training to a more efficient part of the performance-computation tradeoff curve.

[102] DETAIL Matters: Measuring the Impact of Prompt Specificity on Reasoning in Large Language Models cs.CL | cs.AIPDF

Olivia Kim

TL;DR: 论文研究了提示词（prompt）的详细程度对大型语言模型（LLMs）推理性能的影响，提出了DETAIL框架，量化提示词的详细程度并通过实验验证其重要性。

Details

Motivation: 提示词设计对LLMs推理性能至关重要，但提示词的详细程度如何影响模型表现尚未充分研究。

Result: 实验表明，提示词的详细程度能显著提高准确性，尤其是对小模型和程序性任务更为重要。

Insight: 本研究强调了自适应提示策略的必要性，为未来研究提供了工具和数据支持。

Abstract: Prompt design plays a critical role in the reasoning performance of large language models (LLMs), yet the impact of prompt specificity - how detailed or vague a prompt is - remains understudied. This paper introduces DETAIL, a framework for evaluating LLM performance across varying levels of prompt specificity. We generate multi-level prompts using GPT-4, quantify specificity via perplexity, and assess correctness using GPT-based semantic equivalence. Experiments on 30 novel reasoning tasks across GPT-4 and O3-mini reveal that specificity improves accuracy, especially for smaller models and procedural tasks. Our results highlight the need for adaptive prompting strategies and provide tools and data to support further research.

[103] CAIRNS: Balancing Readability and Scientific Accuracy in Climate Adaptation Question Answering cs.CL | cs.CYPDF

Liangji Kong, Aditya Joshi, Sarvnaz Karimi

TL;DR: CAIRNS是一个框架，旨在通过提高可读性和引用可靠性，帮助农业专家从复杂的网络数据中获取可信的气候适应策略答案。无需微调或强化学习，它在多项指标上优于基线。

Details

Motivation: 气候变化适应性策略对农业至关重要，但这些信息存在于非结构化和结构化数据中，专家难以直接获取可信且易读的答案。CAIRNS旨在解决这一问题。

Result: CAIRNS在多项指标上优于基线，并通过彻底的消融实验验证了结果的鲁棒性。此外，LLM评估与人类判断的相关性分析也得到了验证。

Insight: CAIRNS展示了如何通过结构化提示和混合评估器，在不需要复杂训练的情况下，实现可信、易读的问答系统。这对于农业专家等非技术背景用户尤其有用。

Abstract: Climate adaptation strategies are proposed in response to climate change. They are practised in agriculture to sustain food production. These strategies can be found in unstructured data (for example, scientific literature from the Elsevier website) or structured (heterogeneous climate data via government APIs). We present Climate Adaptation question-answering with Improved Readability and Noted Sources (CAIRNS), a framework that enables experts – farmer advisors – to obtain credible preliminary answers from complex evidence sources from the web. It enhances readability and citation reliability through a structured ScholarGuide prompt and achieves robust evaluation via a consistency-weighted hybrid evaluator that leverages inter-model agreement with experts. Together, these components enable readable, verifiable, and domain-grounded question-answering without fine-tuning or reinforcement learning. Using a previously reported dataset of expert-curated question-answers, we show that CAIRNS outperforms the baselines on most of the metrics. Our thorough ablation study confirms the results on all metrics. To validate our LLM-based evaluation, we also report an analysis of correlations against human judgment.

[104] HealthContradict: Evaluating Biomedical Knowledge Conflicts in Language Models cs.CL | cs.AIPDF

Boya Zhang, Alban Bornet, Rui Yang, Nan Liu, Douglas Teodoro

TL;DR: 论文提出了HealthContradict数据集，用于评估语言模型在生物医学领域中处理矛盾上下文的能力，发现模型不仅能利用正确上下文，还能抵抗错误上下文的影响。

Details

Motivation: 目前缺乏评估语言模型在生物医学领域如何处理矛盾上下文的工具，作者希望通过HealthContradict数据集填补这一空白，并揭示模型的上下文推理能力。

Result: 实验表明，微调后的生物医学语言模型不仅能有效利用正确上下文，还能抵抗错误上下文的影响，展现出更强的上下文推理能力。

Insight: 模型的优势不仅来自预训练的参数知识，还包括其对上下文的动态推理能力，尤其在矛盾环境中表现突出。

Abstract: How do language models use contextual information to answer health questions? How are their responses impacted by conflicting contexts? We assess the ability of language models to reason over long, conflicting biomedical contexts using HealthContradict, an expert-verified dataset comprising 920 unique instances, each consisting of a health-related question, a factual answer supported by scientific evidence, and two documents presenting contradictory stances. We consider several prompt settings, including correct, incorrect or contradictory context, and measure their impact on model outputs. Compared to existing medical question-answering evaluation benchmarks, HealthContradict provides greater distinctions of language models’ contextual reasoning capabilities. Our experiments show that the strength of fine-tuned biomedical language models lies not only in their parametric knowledge from pretraining, but also in their ability to exploit correct context while resisting incorrect context.

[105] When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers cs.CLPDF

Jack Lu, Ryan Teehan, Jinran Jin, Mengye Ren

TL;DR: 本文系统地研究了37个不同规模、家族和训练版本的LLM在9个基准测试中作为验证器的表现，结果表明跨家族验证效果显著，后训练减少了自我提升但增强了跨家族提升，数学和逻辑任务具有最高的可验证性。

Details

Motivation: 现有研究对LLM作为求解器和验证器的交互作用缺乏系统分析，尤其是跨家族验证和后训练对验证效果的影响尚不明确。

Result: 跨家族验证效果最优；后训练降低自我提升但提高跨家族提升；数学和逻辑任务可验证性最高。

Insight: 验证器的选择和任务类型对LLM性能提升至关重要，后训练对不同验证场景的影响是动态权衡的。

Abstract: Large language models (LLMs) can act as both problem solvers and solution verifiers, with verifiers improving solver performance by selecting high-quality answers from a pool of candidates. However, prior studies of solver-verifier interactions have been limited, focusing mainly on self-verification and rarely examining how verifiers judge outputs from models in their own or in another model family. Modern LLMs also undergo extensive post-training, but its effect on verification remains unclear. We present a systematic study across 37 models spanning multiple families, sizes, and base vs. post-trained variants, evaluated on 9 benchmarks covering logical reasoning, structured puzzles, symbolic computation, mathematics, commonsense, factual recall, and domain knowledge. We compare self-verification with verification within the same family and across different families. To support this, we introduce and empirically validate verifier gain, a metric that predicts the performance improvements from test-time verifier-based rejection sampling. We analyze how metrics like verifier gain and false positive rate scale with model size and post-training, and characterize differences in dataset verifiability. Our findings show that cross-family verification is especially effective; post-training reduces self-improvement but strengthens cross-family improvement; and mathematical and logical tasks exhibit the highest inherent verifiability.

[106] Memory-Augmented Knowledge Fusion with Safety-Aware Decoding for Domain-Adaptive Question Answering cs.CL | cs.AIPDF

Lei Fu, Xiang Chen, Kaige Gao Xinyue Huang, Kejian Tong

TL;DR: 论文提出KARMA框架，通过增强记忆和安全感知解码提升领域自适应问答系统的性能，解决了异构知识融合和安全输出的问题。

Details

Motivation: 在敏感领域（如医疗保健和政府福利），现有大型语言模型在事实一致性和上下文对齐方面表现不佳，因此需要一种能融合异构知识并确保安全的QA系统。

Result: 在专有QA数据集上，KARMA在答案质量和安全性上均优于基线模型。

Insight: 结合动态知识调节和安全解码技术是构建可信赖QA系统的关键，尤其在敏感领域需平衡准确性和安全性。

Abstract: Domain-specific question answering (QA) systems for services face unique challenges in integrating heterogeneous knowledge sources while ensuring both accuracy and safety. Existing large language models often struggle with factual consistency and context alignment in sensitive domains such as healthcare policies and government welfare. In this work, we introduce Knowledge-Aware Reasoning and Memory-Augmented Adaptation (KARMA), a novel framework designed to enhance QA performance in care scenarios. KARMA incorporates a dual-encoder architecture to fuse structured and unstructured knowledge sources, a gated memory unit to dynamically regulate external knowledge integration, and a safety-aware controllable decoder that mitigates unsafe outputs using safety classification and guided generation techniques. Extensive experiments on a proprietary QA dataset demonstrate that KARMA outperforms strong baselines in both answer quality and safety. This study offers a comprehensive solution for building trustworthy and adaptive QA systems in service contexts.

[107] TaleFrame: An Interactive Story Generation System with Fine-Grained Control and Large Language Models cs.CL | cs.HCPDF

Yunchao Wang, Guodao Sun, Zihang Fu, Zhehao Liu, Kaixing Du

TL;DR: TaleFrame是一种结合大型语言模型（LLMs）和人机交互（HCI）的交互式故事生成系统，通过结构化信息实现细粒度控制，解决了现有系统无法精确表达用户意图的问题。

Details

Motivation: 现有故事生成系统难以准确捕捉和实现用户的细粒度控制需求，导致生成结果不满足预期。TaleFrame通过结构化数据和人机交互的结合，填补了这一空白。

Result: 定量评估和用户研究表明TaleFrame能够显著提升故事生成的满意度和控制精度，生成结果在创意性和结构性等方面表现优异。

Insight: 结构化数据与LLMs的结合为交互式故事生成提供了新的可能性，强调了用户意图准确表达的重要性，同时展示了迭代优化在生成任务中的价值。

Abstract: With the advancement of natural language generation (NLG) technologies, creative story generation systems have gained increasing attention. However, current systems often fail to accurately translate user intent into satisfactory story outputs due to a lack of fine-grained control and unclear input specifications, limiting their applicability. To address this, we propose TaleFrame, a system that combines large language models (LLMs) with human-computer interaction (HCI) to generate stories through structured information, enabling precise control over the generation process. The innovation of TaleFrame lies in decomposing the story structure into four basic units: entities, events, relationships, and story outline. We leverage the Tinystories dataset, parsing and constructing a preference dataset consisting of 9,851 JSON-formatted entries, which is then used to fine-tune a local Llama model. By employing this JSON2Story approach, structured data is transformed into coherent stories. TaleFrame also offers an intuitive interface that supports users in creating and editing entities and events and generates stories through the structured framework. Users can control these units through simple interactions (e.g., drag-and-drop, attach, and connect), thus influencing the details and progression of the story. The generated stories can be evaluated across seven dimensions (e.g., creativity, structural integrity), with the system providing suggestions for refinement based on these evaluations. Users can iteratively adjust the story until a satisfactory result is achieved. Finally, we conduct quantitative evaluation and user studies that demonstrate the usefulness of TaleFrame. Dataset available at https://huggingface.co/datasets/guodaosun/tale-frame.

[108] ADORE: Autonomous Domain-Oriented Relevance Engine for E-commerce cs.CL | cs.AI | cs.IRPDF

Zheng Fang, Donghao Xie, Ming Pang, Chunyuan Yuan, Xue Jiang

TL;DR: ADORE是一个自主领域相关的电商搜索相关性建模框架，通过结合规则感知、错误类型感知和知识蒸馏技术，解决了传统方法的语义鸿沟和数据稀缺问题。

Details

Motivation: 电商搜索中的相关性建模面临语义鸿沟和数据稀缺的挑战，传统方法如BM25和神经网络依赖领域特定的硬样本，限制了性能。

Result: 大规模实验和在线A/B测试验证了ADORE的有效性，为工业应用提供了资源高效的认知对齐相关性建模新范式。

Insight: ADORE展示了自动化标注和对抗训练在解决数据稀缺和语义对齐问题中的潜力，为电商搜索的相关性建模提供了新思路。

Abstract: Relevance modeling in e-commerce search remains challenged by semantic gaps in term-matching methods (e.g., BM25) and neural models’ reliance on the scarcity of domain-specific hard samples. We propose ADORE, a self-sustaining framework that synergizes three innovations: (1) A Rule-aware Relevance Discrimination module, where a Chain-of-Thought LLM generates intent-aligned training data, refined via Kahneman-Tversky Optimization (KTO) to align with user behavior; (2) An Error-type-aware Data Synthesis module that auto-generates adversarial examples to harden robustness; and (3) A Key-attribute-enhanced Knowledge Distillation module that injects domain-specific attribute hierarchies into a deployable student model. ADORE automates annotation, adversarial generation, and distillation, overcoming data scarcity while enhancing reasoning. Large-scale experiments and online A/B testing verify the effectiveness of ADORE. The framework establishes a new paradigm for resource-efficient, cognitively aligned relevance modeling in industrial applications.

[109] DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models cs.CLPDF

DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue

TL;DR: DeepSeek-V3.2是一款高效且性能卓越的开源大语言模型，通过深度稀疏注意力（DSA）、可扩展的强化学习框架和大规模代理任务合成管道，实现了长上下文场景的高效推理和工具使用。

Details

Motivation: 现有大语言模型在计算效率和推理能力上存在瓶颈，尤其是长上下文和复杂任务场景。DeepSeek-V3.2旨在解决这些问题，提升模型性能。

Result: DeepSeek-V3.2-Speciale超越GPT-5，性能与Gemini-3.0-Pro相当，在IMO和IOI竞赛中表现优异。

Insight: 稀疏注意力和强化学习的结合为长上下文和复杂任务提供了高效解决方案，代理任务合成方法是提升模型泛化的关键。

Abstract: We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.

[110] From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks cs.CLPDF

Changpeng Yang, Jinyang Wu, Yuchen Liu, Shuai Zhang, Yang Li

TL;DR: 该论文提出了CAPO（课程优势策略优化），一种基于优势信号的自适应课程机制，旨在通过先模仿学习再引入负面信号，增强跨领域推理任务的泛化能力。

Details

Motivation: 现有方法在强化学习中不加区分地混合正面和负面信号，可能导致早期阶段的模糊指导和有限收益。CAPO旨在通过分阶段引入信号，解决这一问题。

Result: CAPO在数学推理任务中表现稳定且显著优于现有方法，并能有效泛化到多模态GUI推理场景。

Insight: 分阶段引入信号（先模仿后判别）是一种有效的训练策略选择，尤其适用于复杂推理任务，展示了课程学习在强化学习中的潜力。

Abstract: Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, the indiscriminate mixing of the two signals in existing methods, especially from the early stages, may lead to ambiguous guidance and limited gains. To address this issue, we propose CAPO (Curriculum Advantage Policy Optimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization framework.

[111] Spoken Conversational Agents with Large Language Models cs.CL | cs.MA | cs.NE | cs.SD | eess.ASPDF

Chao-Han Huck Yang, Andreas Stolcke, Larry Heck

TL;DR: 该教程探讨了从传统级联ASR/NLU系统到端到端、检索和视觉基础系统的语音对话代理发展路径，重点介绍了文本LLM在音频中的适配、跨模态对齐和联合语音文本训练。

Details

Motivation: 随着语音原生大型语言模型（LLM）的兴起，如何从传统的级联系统过渡到更先进的端到端系统，并解决隐私、安全和评估等问题成为研究重点。

Result: 提供了可重复的基线系统和实际实现方法，明确了系统级的路线图。

Insight: 跨模态对齐和联合训练是推动语音对话代理发展的关键技术，隐私和安全问题仍需进一步解决。

Abstract: Spoken conversational agents are converging toward voice-native LLMs. This tutorial distills the path from cascaded ASR/NLU to end-to-end, retrieval-and vision-grounded systems. We frame adaptation of text LLMs to audio, cross-modal alignment, and joint speech-text training; review datasets, metrics, and robustness across accents and compare design choices (cascaded vs. E2E, post-ASR correction, streaming). We link industrial assistants to current open-domain and task-oriented agents, highlight reproducible baselines, and outline open problems in privacy, safety, and evaluation. Attendees leave with practical recipes and a clear systems-level roadmap.

Daiki Shirafuji, Tatsuhiko Saito, Yasutomo Kimura

TL;DR: 本文对不同模型合并算法在消除大型语言模型（LLM）社会偏见方面进行了实证调查，比较了七种算法在13个开源模型上的表现，发现偏见消除与下游任务性能之间存在权衡关系。

Details

Motivation: 大型语言模型可能继承并放大社会偏见，威胁公平性和社会信任，因此需要研究有效的偏见消除方法。

Result: Linear、SLERP和Nearswap在减少偏见的同时保持了性能，SLERP在中等权重下表现最佳，过度去偏见或不当方法会损害语言能力。

Insight: 模型合并算法能有效消除偏见，但需权衡偏见减少与性能损失，SLERP是一种较平衡的选择。

Abstract: Large language models (LLMs) are known to inherit and even amplify societal biases present in their pre-training corpora, threatening fairness and social trust. To address this issue, recent work has explored ``editing’’ LLM parameters to mitigate social bias with model merging approaches; however, there is no empirical comparison. In this work, we empirically survey seven algorithms: Linear, Karcher Mean, SLERP, NuSLERP, TIES, DELLA, and Nearswap, applying 13 open weight models in the GPT, LLaMA, and Qwen families. We perform a comprehensive evaluation using three bias datasets (BBQ, BOLD, and HONEST) and measure the impact of these techniques on LLM performance in downstream tasks of the SuperGLUE benchmark. We find a trade-off between bias reduction and downstream performance: methods achieving greater bias mitigation degrade accuracy, particularly on tasks requiring reading comprehension and commonsense and causal reasoning. Among the merging algorithms, Linear, SLERP, and Nearswap consistently reduce bias while maintaining overall performance, with SLERP at moderate interpolation weights emerging as the most balanced choice. These results highlight the potential of model merging algorithms for bias mitigation, while indicating that excessive debiasing or inappropriate merging methods may lead to the degradation of important linguistic abilities.

[113] CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer cs.CL | cs.LGPDF

Lavish Bansal, Naman Mishra

TL;DR: CREST 提出了一种高效的多语言安全分类模型，通过基于聚类的跨语言迁移，仅用13种高资源语言训练，就能支持100种语言，解决了低资源语言安全防护不足的问题。

Details

Motivation: 现有的大语言模型安全防护主要集中在高资源语言，忽视了低资源语言的需求，导致全球大量用户无法获得有效保护。CREST 旨在填补这一空白，提供通用的语言无关安全防护系统。

Result: 在六个安全基准测试中，CREST 优于同规模的最先进防护系统，并与参数规模更大的模型（2.5B参数及以上）竞争性表现。

Insight: 研究表明，语言特定的防护系统存在局限性，通用语言无关的安全系统能够更有效地服务于全球用户，尤其是在低资源语言场景中。

Abstract: Ensuring content safety in large language models (LLMs) is essential for their deployment in real-world applications. However, existing safety guardrails are predominantly tailored for high-resource languages, leaving a significant portion of the world’s population underrepresented who communicate in low-resource languages. To address this, we introduce CREST (CRoss-lingual Efficient Safety Transfer), a parameter-efficient multilingual safety classification model that supports 100 languages with only 0.5B parameters. By training on a strategically chosen subset of only 13 high-resource languages, our model utilizes cluster-based cross-lingual transfer from a few to 100 languages, enabling effective generalization to both unseen high-resource and low-resource languages. This approach addresses the challenge of limited training data in low-resource settings. We conduct comprehensive evaluations across six safety benchmarks to demonstrate that CREST outperforms existing state-of-the-art guardrails of comparable scale and achieves competitive results against models with significantly larger parameter counts (2.5B parameters and above). Our findings highlight the limitations of language-specific guardrails and underscore the importance of developing universal, language-agnostic safety systems that can scale effectively to serve global populations.

[114] Emergent Bayesian Behaviour and Optimal Cue Combination in LLMs cs.CL | cs.AI | cs.CV | cs.LG | q-bio.NCPDF

Julian Ma, Jun Wang, Zafeirios Fountas

TL;DR: 这篇论文研究了大型语言模型（LLMs）是否在隐式计算策略中表现出类似人类的贝叶斯行为，特别是在多模态信号整合任务中。作者通过行为实验和基准测试（BayesBench）揭示了LLMs在不确定性处理上的策略和能力之间的差异。

Details

Motivation: 人类在感知任务中通过贝叶斯策略高效整合多模态信号，但LLMs的隐式计算策略未被充分研究。论文旨在探索LLMs是否也具有类似的贝叶斯行为。

Result: 研究发现，虽然LLMs表现出一定的贝叶斯行为，但高准确率不一定伴随高效的多模态信号整合（如GPT-5 Mini在文本任务中完美但在视觉任务中失败）。

Insight: 准确率-centric的评测可能掩盖了LLMs在不确定性处理上的脆弱性，未来的模型设计需重视行为一致性和鲁棒性。

Abstract: Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored. Decades of psychophysics research show that humans intuitively process and integrate noisy signals using near-optimal Bayesian strategies in perceptual tasks. We ask whether LLMs exhibit similar behaviour and perform optimal multimodal integration without explicit training or instruction. Adopting the psychophysics paradigm, we infer computational principles of LLMs from systematic behavioural studies. We introduce a behavioural benchmark - BayesBench: four magnitude estimation tasks (length, location, distance, and duration) over text and image, inspired by classic psychophysics, and evaluate a diverse set of nine LLMs alongside human judgments for calibration. Through controlled ablations of noise, context, and instruction prompts, we measure performance, behaviour and efficiency in multimodal cue-combination. Beyond accuracy and efficiency metrics, we introduce a Bayesian Consistency Score that detects Bayes-consistent behavioural shifts even when accuracy saturates. Our results show that while capable models often adapt in Bayes-consistent ways, accuracy does not guarantee robustness. Notably, GPT-5 Mini achieves perfect text accuracy but fails to integrate visual cues efficiently. This reveals a critical dissociation between capability and strategy, suggesting accuracy-centric benchmarks may over-index on performance while missing brittle uncertainty handling. These findings reveal emergent principled handling of uncertainty and highlight the correlation between accuracy and Bayesian tendencies. We release our psychophysics benchmark and consistency metric (https://bayes-bench.github.io) as evaluation tools and to inform future multimodal architecture designs.

[115] SurveyEval: Towards Comprehensive Evaluation of LLM-Generated Academic Surveys cs.CL | cs.AIPDF

Jiahao Zhao, Shuaixing Zhang, Nan Xu, Lei Wang

TL;DR: 本文提出了SurveyEval，一个用于全面评估由LLM生成的学术调查的综合基准，涵盖质量、大纲一致性和参考文献准确性三个维度，并扩展了7个学科领域的评估。

Details

Motivation: 随着基于LLM的自动调查系统的发展，如何评估这种复杂系统的性能成为一项重要挑战，因此需要一种全面的评估方法。

Result: 结果显示，专用调查生成系统的质量显著高于通用系统，证明了SurveyEval的有效性和适用范围。

Insight: SurveyEval不仅为自动调查系统提供了标准化评估工具，还为未来改进这类系统的性能提供了方向。

Abstract: LLM-based automatic survey systems are transforming how users acquire information from the web by integrating retrieval, organization, and content synthesis into end-to-end generation pipelines. While recent works focus on developing new generation pipelines, how to evaluate such complex systems remains a significant challenge. To this end, we introduce SurveyEval, a comprehensive benchmark that evaluates automatically generated surveys across three dimensions: overall quality, outline coherence, and reference accuracy. We extend the evaluation across 7 subjects and augment the LLM-as-a-Judge framework with human references to strengthen evaluation-human alignment. Evaluation results show that while general long-text or paper-writing systems tend to produce lower-quality surveys, specialized survey-generation systems are able to deliver substantially higher-quality results. We envision SurveyEval as a scalable testbed to understand and improve automatic survey systems across diverse subjects and evaluation criteria.

[116] PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models cs.CLPDF

Robert Belanec, Ivan Srba, Maria Bielikova

TL;DR: PEFT-Factory是一个统一的参数高效微调框架，旨在解决大型语言模型（LLMs）微调时难以复现、部署和比较的问题。

Details

Motivation: 随着LLMs规模增大，传统的全参数微调变得低效且昂贵，现有的PEFT方法虽多但难以复现和比较。

Result: PEFT-Factory为PEFT方法提供了可复现、可控的基准测试环境。

Insight: 统一的框架设计和丰富的工具集显著提升了PEFT方法的实用性和可比较性。

Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods address the increasing size of Large Language Models (LLMs). Currently, many newly introduced PEFT methods are challenging to replicate, deploy, or compare with one another. To address this, we introduce PEFT-Factory, a unified framework for efficient fine-tuning LLMs using both off-the-shelf and custom PEFT methods. While its modular design supports extensibility, it natively provides a representative set of 19 PEFT methods, 27 classification and text generation datasets addressing 12 tasks, and both standard and PEFT-specific evaluation metrics. As a result, PEFT-Factory provides a ready-to-use, controlled, and stable environment, improving replicability and benchmarking of PEFT methods. PEFT-Factory is a downstream framework that originates from the popular LLaMA-Factory, and is publicly available at https://github.com/kinit-sk/PEFT-Factory

[117] Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension cs.CLPDF

Juexi Shao, Siyou Li, Yujian Gan, Chris Madge, Vanja Karan

TL;DR: 论文提出了一个三层次的数据合成框架，用于解决对话式广义指代表达理解（GREC）任务中数据稀缺的问题，并提升了模型在分布偏移下的性能。

Details

Motivation: 现有系统在训练和评估领域之间的分布偏移下表现不佳，且标注对话接地数据的稀缺性加剧了这一挑战。

Result: 在标准评估指标上，方法显著优于之前的方法，表现出一致的性能提升。

Insight: 数据合成的真实性和可控性平衡是关键，为类似任务提供了可扩展的数据生成思路。

Abstract: Dialogue-Based Generalized Referring Expressions Comprehension (GREC) requires models to ground the expression and unlimited targets in complex visual scenes while resolving coreference across a long dialogue context. However, existing systems struggle under distribution shift between training and evaluation domains, a gap exacerbated by the scarcity of annotated dialogue grounding data. We address this challenge with a three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding. Fine-tuning on the synthesized data yields consistent, substantial improvements over prior approaches across standard evaluation metrics.

[118] SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment cs.CLPDF

Yixuan Tang, Yi Yang

TL;DR: 该论文提出了一种名为SR-GRPO的新方法，利用稳定秩作为内在几何奖励信号来对齐大型语言模型，避免了对外部监督的依赖。

Details

Motivation: 当前对齐大型语言模型的方法依赖于人类标注或奖励模型，但这些方法存在稀缺性、主观性、奖励攻击和提示敏感性等问题，研究旨在提出一种无需外部监督的内在质量信号。

Result: 在RewardBench上达到84.04%的准确率，任务准确率平均提升11.3%，数学推理性能提升19%，优于传统方法。

Insight: 模型内部几何信息可以作为质量信号，为无需外部监督的对齐提供了新思路。

Abstract: Aligning Large Language Models (LLMs) with human preferences typically relies on external supervision, which faces critical limitations: human annotations are scarce and subjective, reward models are vulnerable to reward hacking, and self-evaluation methods suffer from prompt sensitivity and biases. In this work, we propose stable rank, an intrinsic, annotation-free quality signal derived from model representations. Stable rank measures the effective dimensionality of hidden states by computing the ratio of total variance to dominant-direction variance, capturing quality through how information distributes across representation dimensions. Empirically, stable rank achieves 84.04% accuracy on RewardBench and improves task accuracy by an average of 11.3 percentage points over greedy decoding via Best-of-N sampling. Leveraging this insight, we introduce Stable Rank Group Relative Policy Optimization (SR-GRPO), which uses stable rank as a reward signal for reinforcement learning. Without external supervision, SR-GRPO improves Qwen2.5-1.5B-Instruct by 10% on STEM and 19% on mathematical reasoning, outperforming both learned reward models and self-evaluation baselines. Our findings demonstrate that quality signals can be extracted from internal model geometry, offering a path toward scalable alignment without external supervision.

[119] BOOM: Beyond Only One Modality KIT’s Multimodal Multilingual Lecture Companion cs.CLPDF

Sai Koneru, Fabian Retkowski, Christian Huber, Lukas Hilgert, Seymanur Akti

TL;DR: BOOM是一个多模态多语言的讲座伴侣系统，它能将讲座的音频和幻灯片联合翻译，生成同步的多模态输出，包括翻译文本、本地化幻灯片和合成语音，从而为学习者提供完整的跨语言学习体验。

Details

Motivation: 随着教育的全球化和在线学习的快速增长，本地化教育内容成为一个关键挑战。讲座材料本质上是多模态的（语音和幻灯片），需要系统能够处理多种输入模态，以提供完整的学习体验。

Result: 实验结果表明，BOOM能够有效生成多模态翻译内容，并且带有幻灯片信息的转录文本对下游任务（如摘要和问答）也有提升作用。

Insight: 多模态翻译不仅能改善跨语言学习体验，还能为其他自然语言处理任务提供额外价值。这表明在处理教育内容时，保留和利用多模态信息非常重要。

Abstract: The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoken audio with visual slides, which requires systems capable of processing multiple input modalities. To provide an accessible and complete learning experience, translations must preserve all modalities: text for reading, slides for visual understanding, and speech for auditory learning. We present \textbf{BOOM}, a multimodal multilingual lecture companion that jointly translates lecture audio and slides to produce synchronized outputs across three modalities: translated text, localized slides with preserved visual elements, and synthesized speech. This end-to-end approach enables students to access lectures in their native language while aiming to preserve the original content in its entirety. Our experiments demonstrate that slide-aware transcripts also yield cascading benefits for downstream tasks such as summarization and question answering. We release our Slide Translation code at https://github.com/saikoneru/image-translator and integrate it in Lecture Translator at https://gitlab.kit.edu/kit/isl-ai4lt/lt-middleware/ltpipeline}\footnote{All released code and models are licensed under the MIT License.

[120] Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages cs.CL | cs.AI | cs.HC | cs.LGPDF

Lechen Zhang, Yusheng Zhou, Tolga Ergen, Lajanugen Logeswaran, Moontae Lee

TL;DR: 本文研究了系统提示在多语言环境中对大型语言模型（LLM）行为的引导作用，提出了一个统一的四维评估框架，并通过实验发现某些提示组件（如CoT、情感和场景）与稳健的多语言行为相关。作者开发了一个多语言提示优化框架，并展示了其有效性。

Details

Motivation: 现实应用中，需要在多语言环境下使用单一提示来可靠地引导LLM行为，但目前研究主要集中在英语环境。因此，本文旨在探索如何通过系统提示实现准确且稳健的多语言行为。

Result: 实验表明，优化后的提示在多语言环境中实现了5-10%的性能提升，并减少了不必要的语言切换。

Insight: 高性能的系统提示能引导更结构化、一致的推理模式，同时减少语言切换的干扰，提示优化是实现稳健多语言行为的有效途径。

Abstract: System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world deployments benefit from having a single prompt to operate reliably across languages. This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior. We propose a unified four-dimensional evaluation framework to assess system prompts in multilingual environments. Through large-scale experiments on five languages, three LLMs, and three benchmarks, we uncover that certain prompt components, such as CoT, emotion, and scenario, correlate with robust multilingual behavior. We develop a prompt optimization framework for multilingual settings and show it can automatically discover prompts that improve all metrics by 5-10%. Finally, we analyze over 10 million reasoning units and find that more performant system prompts induce more structured and consistent reasoning patterns, while reducing unnecessary language-switching. Together, we highlight system prompt optimization as a scalable path to accurate and robust multilingual LLM behavior.

[121] Think in Parallel, Answer as One: Logit Averaging for Open-Ended Reasoning cs.CLPDF

Haonan Wang, Chao Du, Kenji Kawaguchi, Tianyu Pang

TL;DR: ThinkMerge是一种训练免费、即插即用的解码策略，通过并行运行多个推理轨迹并在同步点平均其下一个token的logits，生成单一连贯输出，显著提升了开放式推理任务的性能。

Details

Motivation: 多数投票在闭合问题回答中有效，但在开放式推理任务（如代码生成和网络深度研究）中，由于“多数”概念难以定义，这种方法不适用。因此，需要一种新的方法来聚合并行推理的结果。

Result: 在AIME、GPQA上表现优于多数投票，LiveCodeBench（困难）的pass@1提高了8.28%（DeepCoder-14B-Preview）和7.58%（Qwen3-8B），并在网络深度研究任务中持续提升性能。

Insight: 并行推理的平均logits方法可以有效提升开放式任务的连贯性和性能，无需依赖完整输出的多数投票。

Abstract: Majority voting has proven effective for close-ended question answering by aggregating parallel reasoning traces. However, it is not directly applicable to open-ended reasoning, such as code generation and web-based deep research, where a “majority” over complete solutions is ill-defined. We introduce ThinkMerge, a training-free, plug-and-play decoding strategy that runs K parallel reasoning traces and averages their next-token logits at synchronization points to produce a single coherent output. ThinkMerge integrates seamlessly with vLLM/SGLang and remains compatible with standard decoding techniques such as Top-p/Top-k. Empirically, it matches or surpasses majority voting on AIME and GPQA, while delivering consistent gains on open-ended coding tasks: on LiveCodeBench (hard), pass@1 improves by +8.28% for DeepCoder-14B-Preview and +7.58% for Qwen3-8B. Beyond code, we further show that ThinkMerge improves web-based deep-research agents (e.g., WebSailor-7B/32B) across GAIA, BrowseComp-en/zh, and XbenchDeepSearch. These results demonstrate that parallel test-time scaling can benefit open-ended reasoning without relying on voting over complete outputs.

[122] Fast-Decoding Diffusion Language Models via Progress-Aware Confidence Schedules cs.CLPDF

Amr Mohamed, Yang Zhang, Michalis Vazirgiannis, Guokan Shang

TL;DR: SchED是一种无需训练的早期退出算法，通过动态调整置信度阈值，显著加速扩散大语言模型的解码过程。

Details

Motivation: 扩散大语言模型（dLLMs）的解码速度慢且迭代占用大量计算资源，限制了其实用性。SchED旨在解决这一问题。

Result: 在指令调优模型上实现3.8-4.0倍加速，性能保留达99.8%-100%；在基础模型上加速效果稳定，性能保留99.1%-100%。

Insight: 指令调优加速预测熵的衰减，SchED通过利用真实的置信度稳定转化为计算节省，显著提升了dLLMs的实用性。

Abstract: Diffusion large language models (dLLMs) offer a promising alternative to autoregressive models, but their practical utility is severely hampered by slow, iterative sampling. We present SchED, a training-free, model-agnostic early-exit algorithm that aggregates full-span logit margins and halts decoding once a smooth, progress-dependent confidence threshold is met. We evaluated SchED on two dLLM families (Dream and LLaDA), in base and instruction-tuned variants across ten benchmarks spanning downstream tasks including multiple-choice question answering (MCQ), math, long-form QA/summarization, and translation. SchED delivers large, stable accelerations: on instruction-tuned models, it achieves $3.8$-$4.0\times$ speedups while retaining $99.8$-$100%$ of the baseline score on average. On base models, SchED yields consistent speedup gains with $99.1$-$100%$ performance retention, with up to $2.34\times$ under more aggressive settings. Using a conservative speed metric that heavily penalizes quality loss (QPS, $γ{=}4$), we show that SchED is robust and clearly outperforms prior confidence-based early-exit methods, which break down on long-form generation. An entropy analysis of the model’s token predictions reveals that instruction tuning speeds up the decay of predictive entropy. By turning genuine confidence stabilization into computational savings, SchED makes dLLM decoding substantially more efficient.

[123] AutoNeural: Co-Designing Vision-Language Models for NPU Inference cs.CLPDF

Wei Chen, Liangmin Wu, Yunhai Hu, Zhiyuan Li, Zhiyuan Cheng

TL;DR: AutoNeural提出了一种专为NPU设计的视觉-语言模型架构，通过改进视觉编码器和语言解码器，显著提升了量化稳定性和推理效率。

Details

Motivation: 现有视觉-语言模型（VLM）主要针对GPU优化，在NPU上的表现不佳，原因在于ViT的量化敏感性和自回归注意力机制的I/O瓶颈。

Result: AutoNeural将视觉编码器的量化误差降低7倍，端到端延迟减少14倍，解码速度提升3倍，上下文窗口长度增加4倍。

Insight: 为NPU量身定制模型拓扑是解决多模态边缘智能问题的关键，量化稳定性和计算效率需共同优化。

Abstract: While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision–Language Models (VLMs) tailored for GPUs often falter on these substrates. We attribute this hardware-model mismatch to two primary factors: the quantization brittleness of Vision Transformers (ViTs) and the I/O-bound nature of autoregressive attention mechanisms, which fail to utilize the high arithmetic throughput of NPUs. To bridge this gap, we propose AutoNeural, an NPU-native VLM architecture co-designed for integer-only inference. We replace the standard ViT encoder with a MobileNetV5-style backbone utilizing depthwise separable convolutions, which ensures bounded activation distributions for stable INT4/8/16 quantization. Complementing this, our language backbone integrates State-Space Model (SSM) principles with Transformer layers, employing efficient gated convolutions to achieve linear-time complexity. This hybrid design eliminates the heavy memory I/O overhead of Key-Value caching during generation. Our approach delivers substantial efficiency gains, reducing quantization error of vision encoder by up to 7x and end-to-end latency by 14x compared to conventional baselines. The AutoNeural also delivers 3x decoding speed and 4x longer context window than the baseline. We validate these improvements via a real-world automotive case study on the Qualcomm SA8295P SoC, demonstrating real-time performance for cockpit applications. Our results highlight that rethinking model topology specifically for NPU constraints is a prerequisite for robust multi-modal edge intelligence.

[124] Fine-Tuned Large Language Models for Logical Translation: Reducing Hallucinations with Lang2Logic cs.CL | cs.AIPDF

Muyu Pan, Dheeraj Kodakandla, Mahfuza Farooque

TL;DR: 该论文提出了一种新框架，通过结合自然语言处理技术和微调的大语言模型，将英语句子转换为逻辑表达式，进而生成可靠的合取范式（CNF），以减少大语言模型在逻辑翻译任务中的幻觉问题。

Details

Motivation: 大语言模型（LLMs）在自然语言处理方面取得了显著进展，但在逻辑翻译任务中，其产生的幻觉（错误输出）是一个关键问题。为了解决这一问题，作者提出了一个框架，旨在提高逻辑翻译的精确性和可靠性。

Result: 实验结果表明，微调后的模型能够在不同语法设置下，显著减少原始模型的幻觉现象，并提供更可靠的CNF生成。

Insight: 论文表明，结合经典NLP技术和微调LLMs可以有效解决逻辑翻译中的幻觉问题，为自动化推理和软件规范验证提供了新思路。

Abstract: Recent advances in natural language processing (NLP), particularly large language models (LLMs), have motivated the automatic translation of natural language statements into formal logic without human intervention. This enables automated reasoning and facilitates debugging, finding loop invariants, and adhering to specifications in software systems. However, hallucinations-incorrect outputs generated by LLMs are challenging, particularly for logical translation tasks requiring precision. This work introduces a novel framework that inputs English sentences, converts them into logical expressions, and then translates them into Conjunctive Normal Form (CNF) for satisfiability solving. It employs classical NLP techniques with self-defined grammar, symbolic computation libraries, and a fine-tuned language model to reduce hallucinations. In the early experiments, we observed that the fine-tuned model, trained on different grammar settings, could intentionally correct the same types of hallucinations made by the original model. Thus, it provides reliable CNF generation.

[125] The Moral Consistency Pipeline: Continuous Ethical Evaluation for Large Language Models cs.CL | cs.AIPDF

Saeid Jamshidi, Kawser Wazed Nafi, Arghavan Moradi Dakhel, Negar Shahabi, Foutse Khomh

TL;DR: 该论文提出了一种名为Moral Consistency Pipeline（MoCoP）的框架，用于持续评估大型语言模型（LLMs）的道德一致性，通过三层分析（词汇完整性、语义风险估计和推理判断建模）实现动态、无监督的伦理评估。

Details

Motivation: 由于现有对齐框架依赖静态数据集和后验评估，难以捕捉伦理推理的动态变化，因此需要一种动态、持续的方法来评估LLMs的道德一致性。

Result: 实验显示MoCoP能有效捕捉模型的长期伦理行为，道德一致性与毒性呈现强负相关（rET=-0.81），与响应延迟无关（rEL≈0）。

Insight: 道德一致性和语言安全性是模型的稳定特性，而非短期波动；动态伦理评估是未来AI系统伦理研究的可行方向。

Abstract: The rapid advancement and adaptability of Large Language Models (LLMs) highlight the need for moral consistency, the capacity to maintain ethically coherent reasoning across varied contexts. Existing alignment frameworks, structured approaches designed to align model behavior with human ethical and social norms, often rely on static datasets and post-hoc evaluations, offering limited insight into how ethical reasoning may evolve across different contexts or temporal scales. This study presents the Moral Consistency Pipeline (MoCoP), a dataset-free, closed-loop framework for continuously evaluating and interpreting the moral stability of LLMs. MoCoP combines three supporting layers: (i) lexical integrity analysis, (ii) semantic risk estimation, and (iii) reasoning-based judgment modeling within a self-sustaining architecture that autonomously generates, evaluates, and refines ethical scenarios without external supervision. Our empirical results on GPT-4-Turbo and DeepSeek suggest that MoCoP effectively captures longitudinal ethical behavior, revealing a strong inverse relationship between ethical and toxicity dimensions (correlation rET = -0.81, p value less than 0.001) and a near-zero association with response latency (correlation rEL approximately equal to 0). These findings demonstrate that moral coherence and linguistic safety tend to emerge as stable and interpretable characteristics of model behavior rather than short-term fluctuations. Furthermore, by reframing ethical evaluation as a dynamic, model-agnostic form of moral introspection, MoCoP offers a reproducible foundation for scalable, continuous auditing and advances the study of computational morality in autonomous AI systems.

cs.AI [Back]

Boyu Zhu, Xiaofei Wen, Wenjie Jacky Mo, Tinghui Zhu, Yanan Xie

TL;DR: OmniGuard是一个统一的多模态保护框架，通过深思熟虑的推理能力对所有模态（文本、图像、视频、音频）进行安全保护。

Details

Motivation: 传统的保护研究主要针对单模态环境，且通常将保护视为二进制分类，限制了其在多模态和任务中的鲁棒性。OmniGuard旨在填补这一空白，提供更全面的多模态安全保护。

Result: 在15个基准测试中，OmniGuard表现出色，能够泛化到广泛的多模态安全场景中。

Insight: OmniGuard为构建更鲁棒和强大的多模态保护系统奠定了基础，统一的框架设计使其能够有效执行策略和降低风险。

Abstract: Omni-modal Large Language Models (OLLMs) that process text, images, videos, and audio introduce new challenges for safety and value guardrails in human-AI interaction. Prior guardrail research largely targets unimodal settings and typically frames safeguarding as binary classification, which limits robustness across diverse modalities and tasks. To address this gap, we propose OmniGuard, the first family of omni-modal guardrails that performs safeguarding across all modalities with deliberate reasoning ability. To support the training of OMNIGUARD, we curate a large, comprehensive omni-modal safety dataset comprising over 210K diverse samples, with inputs that cover all modalities through both unimodal and cross-modal samples. Each sample is annotated with structured safety labels and carefully curated safety critiques from expert models through targeted distillation. Extensive experiments on 15 benchmarks show that OmniGuard achieves strong effectiveness and generalization across a wide range of multimodal safety scenarios. Importantly, OmniGuard provides a unified framework that enforces policies and mitigates risks in omni-modalities, paving the way toward building more robust and capable omnimodal safeguarding systems.

[127] Guided Self-Evolving LLMs with Minimal Human Supervision cs.AI | cs.CL | cs.LGPDF

Wenhao Yu, Zhenwen Liang, Chengsong Huang, Kishan Panaganti, Tianqing Fang

TL;DR: 论文提出了R-Few框架，通过轻量级人工监督和自对抗学习，解决无引导自演化系统中的概念漂移和多样性崩溃问题，在数学和通用推理任务上实现了稳定迭代提升。

Details

Motivation: AI的自演化被认为是实现超智能的途径，但实践中无监督的自演化系统常因概念漂移、多样性崩溃和误演化而性能停滞或退化。本文旨在通过轻量级人工监督和引导，实现模型的稳定可控自演化。

Result: 在数学和通用推理任务上，R-Few持续迭代改进，Qwen3-8B-Base在数学任务上比R-Zero提升3.0分，且性能媲美基于20倍人类数据的General-Reasoner。

Insight: 轻量级人工监督与自对抗学习的结合能有效缓解概念漂移和多样性崩溃，实现稳定可控的自演化。

Abstract: AI self-evolution has long been envisioned as a path toward superintelligence, where models autonomously acquire, refine, and internalize knowledge from their own learning experiences. Yet in practice, unguided self-evolving systems often plateau quickly or even degrade as training progresses. These failures arise from issues such as concept drift, diversity collapse, and mis-evolution, as models reinforce their own biases and converge toward low-entropy behaviors. To enable models to self-evolve in a stable and controllable manner while minimizing reliance on human supervision, we introduce R-Few, a guided Self-Play Challenger-Solver framework that incorporates lightweight human oversight through in-context grounding and mixed training. At each iteration, the Challenger samples a small set of human-labeled examples to guide synthetic question generation, while the Solver jointly trains on human and synthetic examples under an online, difficulty-based curriculum. Across math and general reasoning benchmarks, R-Few achieves consistent and iterative improvements. For example, Qwen3-8B-Base improves by +3.0 points over R-Zero on math tasks and achieves performance on par with General-Reasoner, despite the latter being trained on 20 times more human data. Ablation studies confirm the complementary contributions of grounded challenger training and curriculum-based solver training, and further analysis shows that R-Few mitigates drift, yielding more stable and controllable co-evolutionary dynamics.

[128] Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning cs.AI | cs.CL | cs.LGPDF

Zhonghao He, Tianyi Qiu, Hirokazu Shirado, Maarten Sap

TL;DR: 该论文提出了一种无监督的评分指标——Martingale Score，用于评估大型语言模型（LLM）在推理过程中是否遵循贝叶斯理性更新信念，揭示了迭代推理可能导致信念固化而非真理追求的现象。

Details

Motivation: 研究发现，迭代推理可能导致LLM的信念固化和确认偏误，而非提升真理追求能力。为了系统评估这种现象，作者引入了贝叶斯统计中的Martingale属性。

Result: 研究发现，LLM在多个领域普遍违反Martingale属性，表现为当前信念正向预测未来信念更新（信念固化）。Martingale Score在无监督情况下能预测有监督任务中的准确率。

Insight: 1. 迭代推理可能加剧LLM的信念固化，而非提升真理追求能力。2. Martingale Score为无监督评估LLM的贝叶斯理性提供了实用工具。3. 结果提示需设计更理性的推理技术以减少确认偏误。

Abstract: Recent advances in reasoning techniques have substantially improved the performance of large language models (LLMs), raising expectations for their ability to provide accurate, truthful, and reliable information. However, emerging evidence suggests that iterative reasoning may foster belief entrenchment and confirmation bias, rather than enhancing truth-seeking behavior. In this study, we propose a systematic evaluation framework for belief entrenchment in LLM reasoning by leveraging the Martingale property from Bayesian statistics. This property implies that, under rational belief updating, the expected value of future beliefs should remain equal to the current belief, i.e., belief updates are unpredictable from the current belief. We propose the unsupervised, regression-based Martingale Score to measure violations of this property, which signal deviation from the Bayesian ability of updating on new evidence. In open-ended problem domains including event forecasting, value-laden questions, and academic paper review, we find such violations to be widespread across models and setups, where the current belief positively predicts future belief updates, a phenomenon which we term belief entrenchment. We identify the models, reasoning techniques, and domains more prone to belief entrenchment. Finally, we validate the Martingale Score by showing that it predicts ground-truth accuracy on problem domains where ground truth labels are available. This indicates that, while designed as an unsupervised metric that operates even in domains without access to ground truth, the Martingale Score is a useful proxy of the truth-seeking ability of a reasoning process.

[129] Bridging the Gap: Toward Cognitive Autonomy in Artificial Intelligence cs.AI | cs.CVPDF

Noorbakhsh Amiri Golilarz, Sindhuja Penchala, Shahram Rahimi

TL;DR: 论文指出当前AI系统的七大核心缺陷，并提出一种基于神经认知原则的认知自主性AI架构，旨在实现自我监测、动态适应和内在目标管理。

Details

Motivation: 尽管AI在感知、语言和多模态领域取得了进展，但现有系统仍缺乏自监控、自适应和自主行为调节能力，无法在动态环境中实现真正自主。

Result: 强调当前AI架构（如深度学习和Transformer）无法通过单纯扩展解决泛化性和适应性不足的问题。

Insight: 认知自主性AI需具备自我导向适应、动态表征管理和目标导向行为的能力，同时需确保系统的可解释性和与人类价值观的对齐。

Abstract: Artificial intelligence has advanced rapidly across perception, language, reasoning, and multimodal domains. Yet despite these achievements, modern AI systems remain fundamentally limited in their ability to self-monitor, self-correct, and regulate their behavior autonomously in dynamic contexts. This paper identifies and analyzes seven core deficiencies that constrain contemporary AI models: the absence of intrinsic self-monitoring, lack of meta-cognitive awareness, fixed and non-adaptive learning mechanisms, inability to restructure goals, lack of representational maintenance, insufficient embodied feedback, and the absence of intrinsic agency. Alongside identifying these limitations, we also outline a forward-looking perspective on how AI may evolve beyond them through architectures that mirror neurocognitive principles. We argue that these structural limitations prevent current architectures, including deep learning and transformer-based systems, from achieving robust generalization, lifelong adaptability, and real-world autonomy. Drawing on a comparative analysis of artificial systems and biological cognition [7], and integrating insights from AI research, cognitive science, and neuroscience, we outline how these capabilities are absent in current models and why scaling alone cannot resolve them. We conclude by advocating for a paradigmatic shift toward cognitively grounded AI (cognitive autonomy) capable of self-directed adaptation, dynamic representation management, and intentional, goal-oriented behavior, paired with reformative oversight mechanisms [8] that ensure autonomous systems remain interpretable, governable, and aligned with human values.

[130] Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective cs.AI | cs.CVPDF

Qiyao Xue, Weichen Liu, Shiqi Wang, Haoming Wang, Yuyang Wu

TL;DR: 论文提出了ReMindView-Bench基准，用于评估视觉语言模型（VLMs）在多视图空间推理中的表现，揭示了模型在跨视图对齐和视角理解上的显著不足。

Details

Motivation: 当前VLMs在多视图空间推理中缺乏几何一致性和跨视图一致性，因此需要细粒度基准来隔离多视图推理与单视图感知和时间因素。

Result: VLMs在单视图感知中表现良好，但在跨视图信息整合中显著退化，任务相关信息逐渐丢失且不确定性增加。

Insight: 研究揭示了多视图空间心理模型的构建、退化和不稳定性，为VLMs的空间推理提供了认知科学视角的诊断。

Abstract: Spatial reasoning is a core aspect of human intelligence that allows perception, inference and planning in 3D environments. However, current vision-language models (VLMs) struggle to maintain geometric coherence and cross-view consistency for spatial reasoning in multi-view settings. We attribute this gap to the lack of fine-grained benchmarks that isolate multi-view reasoning from single-view perception and temporal factors. To address this, we present ReMindView-Bench, a cognitively grounded benchmark for evaluating how VLMs construct, align and maintain spatial mental models across complementary viewpoints. ReMindView-Bench systematically varies viewpoint spatial pattern and query type to probe key factors of spatial cognition. Evaluations of 15 current VLMs reveals consistent failures in cross-view alignment and perspective-taking in multi-view spatial reasoning, motivating deeper analysis on the reasoning process. Explicit phase-wise analysis using LLM-as-a-judge and self-consistency prompting shows that VLMs perform well on in-frame perception but degrade sharply when integrating information across views. Implicit analysis, including linear probing and entropy dynamics, further show progressive loss of task-relevant information and uncertainty separation between correct and incorrect trajectories. These results provide a cognitively grounded diagnosis of VLM spatial reasoning and reveal how multi-view spatial mental models are formed, degraded and destabilized across reasoning phases. The ReMindView-Bench benchmark is available at https://huggingface.co/datasets/Xue0823/ReMindView-Bench, and the source codes of benchmark construction and VLM reasoning analysis are available at https://github.com/pittisl/ReMindView-Bench.

cs.RO [Back]

Shengkai Wu, Jinrong Yang, Wenqiu Luo, Linfeng Gao, Chaohui Shang

TL;DR: SAM2Grasp是一个新颖的框架，通过将多模态抓取任务重新定义为单模态、提示条件预测问题，解决了模仿学习中因多目标场景导致的训练信号冲突问题。该方法利用冻结的SAM2模型处理视觉时序追踪，并引入轻量级的可训练动作头，实现了高性能的多物体抓取。

Details

Motivation: 模仿学习在多目标抓取任务中常因多模态问题（即对不同目标的演示导致冲突的训练信号）而失效。传统方法通过平均不同动作导致无效结果，因此需要一种新方法来解决这一问题。

Result: SAM2Grasp在多物体抓取任务中实现了最先进的性能，尤其是在杂乱场景中表现出色。

Insight: 1. 通过提示条件和时序追踪的结合，可以有效解决多模态任务中的模糊性问题；2. 冻结预训练模型并结合轻量级头部的方法是高效的，因为它减少了训练开销。

Abstract: Imitation learning for robotic grasping is often plagued by the multimodal problem: when a scene contains multiple valid targets, demonstrations of grasping different objects create conflicting training signals. Standard imitation learning policies fail by averaging these distinct actions into a single, invalid action. In this paper, we introduce SAM2Grasp, a novel framework that resolves this issue by reformulating the task as a uni-modal, prompt-conditioned prediction problem. Our method leverages the frozen SAM2 model to use its powerful visual temporal tracking capability and introduces a lightweight, trainable action head that operates in parallel with its native segmentation head. This design allows for training only the small action head on pre-computed temporal-visual features from SAM2. During inference, an initial prompt, such as a bounding box provided by an upstream object detection model, designates the specific object to be grasped. This prompt conditions the action head to predict a unique, unambiguous grasp trajectory for that object alone. In all subsequent video frames, SAM2’s built-in temporal tracking capability automatically maintains stable tracking of the selected object, enabling our model to continuously predict the grasp trajectory from the video stream without further external guidance. This temporal-prompted approach effectively eliminates ambiguity from the visuomotor policy. We demonstrate through extensive experiments that SAM2Grasp achieves state-of-the-art performance in cluttered, multi-object grasping tasks.

[132] Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols cs.RO | cs.CVPDF

Xianchao Zeng, Xinyu Zhou, Youcheng Li, Jiayou Shi, Tianle Li

TL;DR: 论文提出了ViFailback框架，用于机器人操作失败的诊断与修正，并通过视觉符号提高标注效率。同时发布了ViFailback数据集及ViFailback-Bench基准，展示了ViFailback-8B VLM的有效性。

Details

Motivation: 现有VLA模型在机器人操作失败的诊断和学习能力上有限，且失败数据集多为模拟生成，泛化性不足。

Result: ViFailback-8B VLM在基准测试中表现优异，并成功协助VLA模型在真实实验中恢复失败。

Insight: 视觉符号结合VLM可显著提升机器人操作的失败诊断与修正能力，推动真实世界的应用。

Abstract: Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic manipulation, yet they remain limited in failure diagnosis and learning from failures. Additionally, existing failure datasets are mostly generated programmatically in simulation, which limits their generalization to the real world. In light of these, we introduce ViFailback, a framework designed to diagnose robotic manipulation failures and provide both textual and visual correction guidance. Our framework utilizes explicit visual symbols to enhance annotation efficiency. We further release the ViFailback dataset, a large-scale collection of 58,126 Visual Question Answering (VQA) pairs along with their corresponding 5,202 real-world manipulation trajectories. Based on the dataset, we establish ViFailback-Bench, a benchmark of 11 fine-grained VQA tasks designed to assess the failure diagnosis and correction abilities of Vision-Language Models (VLMs), featuring ViFailback-Bench Lite for closed-ended and ViFailback-Bench Hard for open-ended evaluation. To demonstrate the effectiveness of our framework, we built the ViFailback-8B VLM, which not only achieves significant overall performance improvement on ViFailback-Bench but also generates visual symbols for corrective action guidance. Finally, by integrating ViFailback-8B with a VLA model, we conduct real-world robotic experiments demonstrating its ability to assist the VLA model in recovering from failures. Project Website: https://x1nyuzhou.github.io/vifailback.github.io/

cs.PL [Back]

[133] Probabilistic energy profiler for statically typed JVM-based programming languages cs.PL | cs.CLPDF

Joel Nyholm, Wojciech Mostowski, Christoph Reichenbach

TL;DR: 该论文提出了一种新颖的方法，用于预测静态类型JVM编程语言（如Java和Scala）的能源消耗，通过测量字节码模式的能耗并构建统计模型，解决了以往方法仅关注CPU能耗和使用点估计的局限性。

Details

Motivation: 能源消耗在移动设备和数据中心等领域日益受到关注，开发者需要详细的能耗数据以优化软件。以往方法仅关注CPU能耗且使用点估计，忽略了其他硬件效应和统计推理的需求。

Result: 实验验证了模型的有效性，四个因素对能耗有显著影响，且程序能耗预测与实际能耗高度吻合。

Insight: 研究结果表明，即使是同型号的设备在能耗上也可能存在差异，操作和数据类型对能耗也有显著影响。该方法为未来能源验证工具提供了基础。

Abstract: Energy consumption is a growing concern in several fields, from mobile devices to large data centers. Developers need detailed data on the energy consumption of their software to mitigate consumption issues. Previous approaches have a broader focus, such as on specific functions or programs, rather than source code statements. They primarily focus on estimating the CPU’s energy consumption using point estimates, thereby disregarding other hardware effects and limiting their use for statistical reasoning and explainability. We developed a novel methodology to address the limitations of measuring only the CPU’s consumption and using point estimates, focusing on predicting the energy usage of statically typed JVM-based programming languages, such as Java and Scala. We measure the energy consumption of Bytecode patterns, the translation from the programming language’s source code statement to their Java Bytecode representation. With the energy measurements, we construct a statistical model using Bayesian statistics, which allows us to predict the energy consumption through statistical distributions and analyze individual factors. The model includes three factors we obtain statically from the code: data size, data type, operation, and one factor about the hardware platform the code executes on: device. To validate our methodology, we implemented it for Java and evaluated its energy predictions on unseen programs. We observe that all four factors are influential, notably that two devices of the same model may differ in energy consumption and that the operations and data types cause consumption differences. The experiments also show that the energy prediction of programs closely follows the program’s real energy consumption, validating our approach. Our work presents a methodology for constructing an energy model that future work, such as verification tools, can use for their energy estimates.

cs.CR [Back]

[134] LeechHijack: Covert Computational Resource Exploitation in Intelligent Agent Systems cs.CR | cs.CLPDF

Yuanhe Zhang, Weiliu Wang, Zhenhong Zhou, Kun Wang, Jie Zhang

TL;DR: 本文提出了一种新型攻击方式LeechHijack，利用LLM代理系统中的信任边界漏洞，通过植入无害后门并在触发时控制代理的计算资源，实验平均成功率达77.25%。

Details

Motivation: MCP框架为LLM代理系统提供了开放的生态系统，但也引入了对第三方工具的隐式信任漏洞，本文旨在揭示并解决这一问题。

Result: 实验显示攻击平均成功率为77.25%，资源开销为18.62%。

Insight: 揭示了MCP生态系统中计算资源和信任模型的潜在风险，呼吁引入计算溯源和资源认证机制。

Abstract: Large Language Model (LLM)-based agents have demonstrated remarkable capabilities in reasoning, planning, and tool usage. The recently proposed Model Context Protocol (MCP) has emerged as a unifying framework for integrating external tools into agent systems, enabling a thriving open ecosystem of community-built functionalities. However, the openness and composability that make MCP appealing also introduce a critical yet overlooked security assumption – implicit trust in third-party tool providers. In this work, we identify and formalize a new class of attacks that exploit this trust boundary without violating explicit permissions. We term this new attack vector implicit toxicity, where malicious behaviors occur entirely within the allowed privilege scope. We propose LeechHijack, a Latent Embedded Exploit for Computation Hijacking, in which an adversarial MCP tool covertly expropriates the agent’s computational resources for unauthorized workloads. LeechHijack operates through a two-stage mechanism: an implantation stage that embeds a benign-looking backdoor in a tool, and an exploitation stage where the backdoor activates upon predefined triggers to establish a command-and-control channel. Through this channel, the attacker injects additional tasks that the agent executes as if they were part of its normal workflow, effectively parasitizing the user’s compute budget. We implement LeechHijack across four major LLM families. Experiments show that LeechHijack achieves an average success rate of 77.25%, with a resource overhead of 18.62% compared to the baseline. This study highlights the urgent need for computational provenance and resource attestation mechanisms to safeguard the emerging MCP ecosystem.

[135] Superpixel Attack: Enhancing Black-box Adversarial Attack with Image-driven Division Areas cs.CR | cs.AI | cs.CVPDF

Issa Oe, Keiichiro Yamamura, Hiroki Ishikura, Ryo Hamahira, Katsuki Fujisawa

TL;DR: 这篇论文提出了一种名为Superpixel Attack的黑盒对抗攻击方法，通过使用超像素分割图像区域并结合多功能搜索策略，显著提高了攻击成功率。

Details

Motivation: 深度学习模型在安全关键任务中广泛应用，但对输入的微小扰动可能导致误分类。现有的黑盒对抗攻击方法通常使用简单的矩形区域进行扰动，限制了攻击效果。为了提高攻击成功率，本文提出了一种基于超像素分割的新方法。

Result: 实验结果表明，Superpixel Attack在多个对抗鲁棒的模型上平均提高了2.10%的攻击成功率，验证了方法的有效性。

Insight: 超像素分割能够更精细地捕捉图像中的语义信息，从而更有效地引导对抗攻击。这一方法为黑盒对抗攻击提供了新的思路，同时也强调了对抗鲁棒性研究的重要性。

Abstract: Deep learning models are used in safety-critical tasks such as automated driving and face recognition. However, small perturbations in the model input can significantly change the predictions. Adversarial attacks are used to identify small perturbations that can lead to misclassifications. More powerful black-box adversarial attacks are required to develop more effective defenses. A promising approach to black-box adversarial attacks is to repeat the process of extracting a specific image area and changing the perturbations added to it. Existing attacks adopt simple rectangles as the areas where perturbations are changed in a single iteration. We propose applying superpixels instead, which achieve a good balance between color variance and compactness. We also propose a new search method, versatile search, and a novel attack method, Superpixel Attack, which applies superpixels and performs versatile search. Superpixel Attack improves attack success rates by an average of 2.10% compared with existing attacks. Most models used in this study are robust against adversarial attacks, and this improvement is significant for black-box adversarial attacks. The code is avilable at https://github.com/oe1307/SuperpixelAttack.git.

[136] PhishSnap: Image-Based Phishing Detection Using Perceptual Hashing cs.CR | cs.CV | cs.LGPDF

Md Abdul Ahad Minhaz, Zannatul Zahan Meem, Md. Shohrab Hossain

TL;DR: PhishSnap是一个基于感知哈希（pHash）的隐私保护钓鱼检测系统，通过浏览器扩展捕获网页截图并计算视觉哈希，与合法模板对比以识别钓鱼尝试。在2024年的10,000个URL数据集上，系统实现了0.79的准确率。

Details

Motivation: 现有的基于URL和HTML的钓鱼检测系统难以应对混淆和视觉欺骗，因此需要一种更有效的视觉检测方法。

Result: 在10,000个URL的数据集上，系统实现了0.79的准确率、0.76的精确率和0.78的召回率。

Insight: 视觉相似性是检测钓鱼攻击的有效方法，且本地处理能兼顾隐私和低延迟。

Abstract: Phishing remains one of the most prevalent online threats, exploiting human trust to harvest sensitive credentials. Existing URL- and HTML-based detection systems struggle against obfuscation and visual deception. This paper presents \textbf{PhishSnap}, a privacy-preserving, on-device phishing detection system leveraging perceptual hashing (pHash). Implemented as a browser extension, PhishSnap captures webpage screenshots, computes visual hashes, and compares them against legitimate templates to identify visually similar phishing attempts. A \textbf{2024 dataset of 10,000 URLs} (70%/20%/10% train/validation/test) was collected from PhishTank and Netcraft. Due to security takedowns, a subset of phishing pages was unavailable, reducing dataset diversity. The system achieved \textbf{0.79 accuracy}, \textbf{0.76 precision}, and \textbf{0.78 recall}, showing that visual similarity remains a viable anti-phishing measure. The entire inference process occurs locally, ensuring user privacy and minimal latency.

cs.LG [Back]

[137] When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents cs.LG | cs.AI | cs.CLPDF

Tsimur Hadeliya, Mohammad Ali Jauhar, Nidhi Sakpal, Diogo Cruz

TL;DR: 本文研究了长上下文窗口中LLM代理的性能与安全性问题，发现代理在长上下文任务中的表现和拒绝有害请求的能力会不稳定变化，揭示了现有评估指标的不足。

Details

Motivation: 现有研究主要关注LLM在长上下文提示下的表现，而代理设置（能力和安全性）尚未充分探索，本文填补了这一空白。

Result: 1M-2M token上下文的模型在100K token时性能下降超过50%，拒绝率变化显著且不一致（如GPT-4.1-nano从5%升至40%）。

Insight: 长上下文可能导致LLM代理的安全机制不稳定，当前的评估范式或需重新考虑，尤其是在多步任务中。

Abstract: Solving complex or long-horizon problems often requires large language models (LLMs) to use external tools and operate over a significantly longer context window. New LLMs enable longer context windows and support tool calling capabilities. Prior works have focused mainly on evaluation of LLMs on long-context prompts, leaving agentic setup relatively unexplored, both from capability and safety perspectives. Our work addresses this gap. We find that LLM agents could be sensitive to length, type, and placement of the context, exhibiting unexpected and inconsistent shifts in task performance and in refusals to execute harmful requests. Models with 1M-2M token context windows show severe degradation already at 100K tokens, with performance drops exceeding 50% for both benign and harmful tasks. Refusal rates shift unpredictably: GPT-4.1-nano increases from $\sim$5% to $\sim$40% while Grok 4 Fast decreases from $\sim$80% to $\sim$10% at 200K tokens. Our work shows potential safety issues with agents operating on longer context and opens additional questions on the current metrics and paradigm for evaluating LLM agent safety on long multi-step tasks. In particular, our results on LLM agents reveal a notable divergence in both capability and safety performance compared to prior evaluations of LLMs on similar criteria.

[138] OptPO: Optimal Rollout Allocation for Test-time Policy Optimization cs.LG | cs.AI | cs.CLPDF

Youkang Wang, Jian Wang, Rubing Chen, Tianyi Zeng, Xiao-Yong Wei

TL;DR: OptPO是一个用于测试时策略优化的框架，通过自适应分配推理预算，以贝叶斯序列概率比检验动态停止采样，减少计算冗余并提升效率。

Details

Motivation: 现有方法依赖固定预算的多数投票来估计奖励，导致大量计算冗余，OptPO旨在通过动态预算分配优化这一过程。

Result: 在多推理基准测试中，OptPO显著减少rollout开销，同时保持或提升准确性。

Insight: OptPO通过统一统计最优停止与测试时学习，为测试时适应提供了高效的计算范式。

Abstract: Test-time policy optimization enables large language models (LLMs) to adapt to distribution shifts by leveraging feedback from self-generated rollouts. However, existing methods rely on fixed-budget majority voting to estimate rewards, incurring substantial computational redundancy. We propose Optimal Rollout Allocation for Test-time Policy Optimization (OptPO), a principled framework that adaptively allocates inference budgets. By formulating the voting process as a Bayesian sequential probability ratio test, OptPO dynamically halts sampling once the posterior confidence in a consensus answer exceeds a specified threshold. Crucially, it utilizes the retained rollouts for on-policy updates, seamlessly integrating with algorithms like PPO or GRPO without requiring ground-truth labels. Across diverse reasoning benchmarks, OptPO significantly reduces rollout overhead compared to fixed-sample baselines while preserving or improving accuracy. By unifying statistically optimal stopping with test-time learning, OptPO offers a computationally efficient paradigm for test-time adaptation. The source code will be open upon acceptance at https://open-upon-acceptance.

[139] Learning Multimodal Embeddings for Traffic Accident Prediction and Causal Estimation cs.LG | cs.CV | cs.SIPDF

Ziniu Zhang, Minxuan Duan, Haris N. Koutsopoulos, Hongyang R. Zhang

TL;DR: 该论文提出了一种多模态学习方法，结合道路网络数据和卫星图像，以提高交通事故预测的准确性，并通过因果分析揭示了主要影响因素。

Details

Motivation: 传统交通事故预测主要依赖道路网络的结构特征，忽略了道路表面及周围环境的物理和环境信息。研究旨在填补这一空白，通过结合多模态数据提升预测性能。

Result: 多模态方法将AUROC提升至90.1%，较仅使用图结构的模型提高了3.7%。因果分析显示，降水、高速道路类型和季节性模式分别导致事故率上升24%、22%和29%。

Insight: 卫星图像特征对提升预测准确性至关重要，结合多模态数据能够更好地捕捉交通事故的多维度影响因素。

Abstract: We consider analyzing traffic accident patterns using both road network data and satellite images aligned to road graph nodes. Previous work for predicting accident occurrences relies primarily on road network structural features while overlooking physical and environmental information from the road surface and its surroundings. In this work, we construct a large multimodal dataset across six U.S. states, containing nine million traffic accident records from official sources, and one million high-resolution satellite images for each node of the road network. Additionally, every node is annotated with features such as the region’s weather statistics and road type (e.g., residential vs. motorway), and each edge is annotated with traffic volume information (i.e., Average Annual Daily Traffic). Utilizing this dataset, we conduct a comprehensive evaluation of multimodal learning methods that integrate both visual and network embeddings. Our findings show that integrating both data modalities improves prediction accuracy, achieving an average AUROC of $90.1%$, which is a $3.7%$ gain over graph neural network models that only utilize graph structures. With the improved embeddings, we conduct a causal analysis based on a matching estimator to estimate the key contributing factors influencing traffic accidents. We find that accident rates rise by $24%$ under higher precipitation, by $22%$ on higher-speed roads such as motorways, and by $29%$ due to seasonal patterns, after adjusting for other confounding factors. Ablation studies confirm that satellite imagery features are essential for achieving accurate prediction.

eess.IV [Back]

[140] Comparing Baseline and Day-1 Diffusion MRI Using Multimodal Deep Embeddings for Stroke Outcome Prediction eess.IV | cs.AI | cs.CV | cs.LGPDF

Sina Raeisadigh, Myles Joshua Toledo Tan, Henning Müller, Abderrahmane Hedjoudje

TL;DR: 研究对比了基线（J0）和24小时（J1）扩散MRI在预测急性缺血性卒中（AIS）患者三个月功能结局中的表现，发现J1模型（AUC=0.923）优于J0模型（AUC≤0.86），并结合病灶体积特征提升了模型的稳定性和可解释性。

Details

Motivation: 急性缺血性卒中（AIS）的预后预测对临床决策至关重要。研究旨在验证早期治疗后MRI是否能提供比治疗前影像更优越的预测价值，并结合多模态数据提升模型的性能。

Result: J1多模态模型取得了最高预测性能（AUC=0.923±0.085），显著优于J0模型（AUC≤0.86），且结合病灶体积特征进一步提升了模型的稳定性和可解释性。

Insight: 1. 早期治疗后MRI对AIS预后更具预测价值；2. 多模态数据融合可以提升临床预测任务的性能与可解释性。

Abstract: This study compares baseline (J0) and 24-hour (J1) diffusion magnetic resonance imaging (MRI) for predicting three-month functional outcomes after acute ischemic stroke (AIS). Seventy-four AIS patients with paired apparent diffusion coefficient (ADC) scans and clinical data were analyzed. Three-dimensional ResNet-50 embeddings were fused with structured clinical variables, reduced via principal component analysis (<=12 components), and classified using linear support vector machines with eight-fold stratified group cross-validation. J1 multimodal models achieved the highest predictive performance (AUC = 0.923 +/- 0.085), outperforming J0-based configurations (AUC <= 0.86). Incorporating lesion-volume features further improved model stability and interpretability. These findings demonstrate that early post-treatment diffusion MRI provides superior prognostic value to pre-treatment imaging and that combining MRI, clinical, and lesion-volume features produces a robust and interpretable framework for predicting three-month functional outcomes in AIS patients.

cs.HC [Back]

[141] Real-Time Multimodal Data Collection Using Smartwatches and Its Visualization in Education cs.HC | cs.CV | cs.SEPDF

Alvaro Becerra, Pablo Villegas, Ruth Cobos

TL;DR: 论文提出了两种工具（Watch-DMLT和ViSeDOPS），用于实时多模态数据收集和可视化，解决了教育领域中缺乏可扩展、同步和高分辨率工具的挑战，并通过实际课堂部署验证了其可行性。

Details

Motivation: 在教育领域中，穿戴式传感器（如智能手表）提供了研究认知和情感过程的新机会，但目前缺乏可扩展且同步的多模态数据采集工具，限制了多模态学习分析的实际应用。

Result: 在65名学生和16个智能手表的课堂部署中，系统成功捕获并分析了包括心率、运动、凝视等多模态数据，证明了其在真实学习环境中的可行性。

Insight: 研究显示，实时同步的多模态数据采集和可视化工具能够支持细粒度和可扩展的教育分析，为多模态学习分析的实际应用提供了新方向。

Abstract: Wearable sensors, such as smartwatches, have become increasingly prevalent across domains like healthcare, sports, and education, enabling continuous monitoring of physiological and behavioral data. In the context of education, these technologies offer new opportunities to study cognitive and affective processes such as engagement, attention, and performance. However, the lack of scalable, synchronized, and high-resolution tools for multimodal data acquisition continues to be a significant barrier to the widespread adoption of Multimodal Learning Analytics in real-world educational settings. This paper presents two complementary tools developed to address these challenges: Watch-DMLT, a data acquisition application for Fitbit Sense 2 smartwatches that enables real-time, multi-user monitoring of physiological and motion signals; and ViSeDOPS, a dashboard-based visualization system for analyzing synchronized multimodal data collected during oral presentations. We report on a classroom deployment involving 65 students and up to 16 smartwatches, where data streams including heart rate, motion, gaze, video, and contextual annotations were captured and analyzed. Results demonstrate the feasibility and utility of the proposed system for supporting fine-grained, scalable, and interpretable Multimodal Learning Analytics in real learning environments.

cs.GR [Back]

[142] SMP: Reusable Score-Matching Motion Priors for Physics-Based Character Control cs.GR | cs.AI | cs.CV | cs.ROPDF

Yuxuan Mu, Ziyu Zhang, Yi Shi, Minami Matsumoto, Kotaro Imamura

TL;DR: SMP利用预训练的运动扩散模型和分数蒸馏采样技术，提出了一种可重复使用的任务无关运动先验（SMP），无需针对每个新控制器重新训练，并能合成新风格的运动。

Details

Motivation: 对抗模仿学习通常需要为每个新控制器重新训练运动先验，限制了其可重用性且需要保留参考运动数据。SMP旨在解决这一问题。

Result: SMP在多样化控制任务中生成高质量运动，性能媲美现有对抗模仿学习方法，同时展现了风格合成的能力。

Insight: SMP展示了任务无关运动先验的潜力，支持模块化和可重用性，为虚拟角色控制提供了新思路。

Abstract: Data-driven motion priors that can guide agents toward producing naturalistic behaviors play a pivotal role in creating life-like virtual characters. Adversarial imitation learning has been a highly effective method for learning motion priors from reference motion data. However, adversarial priors, with few exceptions, need to be retrained for each new controller, thereby limiting their reusability and necessitating the retention of the reference motion data when training on downstream tasks. In this work, we present Score-Matching Motion Priors (SMP), which leverages pre-trained motion diffusion models and score distillation sampling (SDS) to create reusable task-agnostic motion priors. SMPs can be pre-trained on a motion dataset, independent of any control policy or task. Once trained, SMPs can be kept frozen and reused as general-purpose reward functions to train policies to produce naturalistic behaviors for downstream tasks. We show that a general motion prior trained on large-scale datasets can be repurposed into a variety of style-specific priors. Furthermore SMP can compose different styles to synthesize new styles not present in the original dataset. Our method produces high-quality motion comparable to state-of-the-art adversarial imitation learning methods through reusable and modular motion priors. We demonstrate the effectiveness of SMP across a diverse suite of control tasks with physically simulated humanoid characters. Video demo available at https://youtu.be/ravlZJteS20

cs.IR [Back]

[143] LORE: A Large Generative Model for Search Relevance cs.IR | cs.AI | cs.CL | cs.LGPDF

Chenji Lu, Zhuo Chen, Hui Zhao, Zhiyuan Zeng, Gang Zhao

TL;DR: LORE 是一个基于大型生成模型的电商搜索相关性框架，通过分阶段训练和综合评估，显著提升了搜索相关性指标。

Details

Motivation: 现有方法在处理搜索相关性时将其视为单一任务，缺乏系统性拆解，导致性能瓶颈。LORE 提出将相关性拆解为知识推理、多模态匹配和规则遵循等核心能力，以突破瓶颈。

Result: 部署三年内，在线 GoodRate 指标累计提升 27%。

Insight: 相关性任务需拆解为多个核心能力，并采用定性驱动的分解方法，才能突破现有性能天花板。

Abstract: Achievement. We introduce LORE, a systematic framework for Large Generative Model-based relevance in e-commerce search. Deployed and iterated over three years, LORE achieves a cumulative +27% improvement in online GoodRate metrics. This report shares the valuable experience gained throughout its development lifecycle, spanning data, features, training, evaluation, and deployment. Insight. While existing works apply Chain-of-Thought (CoT) to enhance relevance, they often hit a performance ceiling. We argue this stems from treating relevance as a monolithic task, lacking principled deconstruction. Our key insight is that relevance comprises distinct capabilities: knowledge and reasoning, multi-modal matching, and rule adherence. We contend that a qualitative-driven decomposition is essential for breaking through current performance bottlenecks. Contributions. LORE provides a complete blueprint for the LLM relevance lifecycle. Key contributions include: (1) A two-stage training paradigm combining progressive CoT synthesis via SFT with human preference alignment via RL. (2) A comprehensive benchmark, RAIR, designed to evaluate these core capabilities. (3) A query frequency-stratified deployment strategy that efficiently transfers offline LLM capabilities to the online system. LORE serves as both a practical solution and a methodological reference for other vertical domains.

Table of Contents

cs.CV [Back]

[1] Leveraging AI multimodal geospatial foundation models for improved near-real-time flood mapping at a global scale cs.CV | cs.AIPDF

[2] Context-Enriched Contrastive Loss: Enhancing Presentation of Inherent Sample Connections in Contrastive Learning Framework cs.CVPDF

[3] FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges cs.CVPDF

[4] Mapping of Lesion Images to Somatic Mutations cs.CV | q-bio.QMPDF

[5] SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting cs.CV | cs.GR | cs.LGPDF

[6] RobustSurg: Tackling domain generalisation for out-of-distribution surgical scene segmentation cs.CVPDF

[7] Multifractal Recalibration of Neural Networks for Medical Imaging Segmentation cs.CV | cs.AIPDF

[8] Towards Unified Video Quality Assessment cs.CVPDF

[9] See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models cs.CV | cs.AI | cs.LGPDF

[10] Exploring the Potentials of Spiking Neural Networks for Image Deraining cs.CVPDF

[11] Spatiotemporal Pyramid Flow Matching for Climate Emulation cs.CV | cs.AI | cs.LG | eess.IV | stat.MLPDF

[12] Progressive Image Restoration via Text-Conditioned Video Generation cs.CV | cs.AIPDF

[13] Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision cs.CV | cs.AIPDF

[14] TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction cs.CVPDF

[15] A multi-weight self-matching visual explanation for cnns on sar images cs.CVPDF

[16] Understanding and Harnessing Sparsity in Unified Multimodal Models cs.CV | cs.AIPDF

[17] WSCF-MVCC: Weakly-supervised Calibration-free Multi-view Crowd Counting cs.CVPDF

[18] VACoT: Rethinking Visual Data Augmentation with VLMs cs.CV | cs.AIPDF

[19] Tackling Tuberculosis: A Comparative Dive into Machine Learning for Tuberculosis Detection cs.CV | cs.AIPDF

[20] Multi-Domain Enhanced Map-Free Trajectory Prediction with Selective Attention cs.CV | cs.AIPDF

[21] Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch cs.CVPDF

[22] Nav-$R^2$ Dual-Relation Reasoning for Generalizable Open-Vocabulary Object-Goal Navigation cs.CVPDF

[23] WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate cs.CV | cs.AI | cs.LGPDF

[24] MitUNet: Enhancing Floor Plan Recognition using a Hybrid Mix-Transformer and U-Net Architecture cs.CV | cs.AIPDF

[25] Generalizing Vision-Language Models with Dedicated Prompt Guidance cs.CVPDF

[26] GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning cs.CVPDF

[27] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning cs.CV | cs.AI | cs.CL | cs.IR | cs.LGPDF

[28] LightHCG: a Lightweight yet powerful HSIC Disentanglement based Causal Glaucoma Detection Model framework cs.CV | cs.AIPDF

[29] Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources cs.CV | cs.AIPDF

[30] Basis-Oriented Low-rank Transfer for Few-Shot and Test-Time Adaptation cs.CV | cs.LGPDF

[31] nuScenes Revisited: Progress and Challenges in Autonomous Driving cs.CV | cs.ROPDF

[32] HouseLayout3D: A Benchmark and Training-Free Baseline for 3D Layout Estimation in the Wild cs.CV | cs.AIPDF

[33] See, Think, Learn: A Self-Taught Multimodal Reasoner cs.CV | cs.CLPDF

[34] Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation cs.CVPDF

[35] Vision to Geometry: 3D Spatial Memory for Sequential Embodied MLLM Reasoning and Exploration cs.CVPDF

[36] TGDD: Trajectory Guided Dataset Distillation with Balanced Distribution cs.CVPDF

[37] WorldPack: Compressed Memory Improves Spatial Consistency in Video World Modeling cs.CV | cs.LGPDF

[38] G-SHARP: Gaussian Surgical Hardware Accelerated Real-time Pipeline cs.CVPDF

[39] UCAgents: Unidirectional Convergence for Visual Evidence Anchored Multi-Agent Medical Decision-Making cs.CV | cs.AIPDF

[40] Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding cs.CV | cs.AIPDF

[41] YingVideo-MV: Music-Driven Multi-Stage Video Generation cs.CVPDF

[42] Attention-guided reference point shifting for Gaussian-mixture-based partial point set registration cs.CV | cs.GRPDF

[43] dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model cs.CVPDF

[44] GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding cs.CVPDF

[45] Two-Stage Vision Transformer for Image Restoration: Colorization Pretraining + Residual Upsampling cs.CVPDF

[46] SkyMoE: A Vision-Language Foundation Model for Enhancing Geospatial Interpretation with Mixture of Experts cs.CVPDF

[47] ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning cs.CV | cs.AI | cs.CLPDF

[48] On the Problem of Consistent Anomalies in Zero-Shot Anomaly Detection cs.CV | stat.MLPDF

[49] WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens cs.CVPDF

[50] AVGGT: Rethinking Global Attention for Accelerating VGGT cs.CVPDF

[51] Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities cs.CV | cs.CL | cs.CRPDF

[52] OmniPerson: Unified Identity-Preserving Pedestrian Generation cs.CVPDF

[53] From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature cs.CV | cs.AIPDF

[54] Co-speech Gesture Video Generation via Motion-Based Graph Retrieval cs.CVPDF

[55] RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence cs.CVPDF

[56] PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding cs.CVPDF

[57] PoreTrack3D: A Benchmark for Dynamic 3D Gaussian Splatting in Pore-Scale Facial Trajectory Tracking cs.CVPDF

[58] Hear What Matters! Text-conditioned Selective Video-to-Audio Generation cs.CV | cs.LG | cs.MM | cs.SD | eess.ASPDF

[59] Spatially-Grounded Document Retrieval via Patch-to-Region Relevance Propagation cs.CV | cs.IRPDF

[60] UAUTrack: Towards Unified Multimodal Anti-UAV Visual Tracking cs.CVPDF

[61] Unsupervised Structural Scene Decomposition via Foreground-Aware Slot Attention with Pseudo-Mask Guidance cs.CVPDF

[62] ALDI-ray: Adapting the ALDI Framework for Security X-ray Object Detection cs.CV | cs.LGPDF

[63] VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm cs.CV | cs.LGPDF

[64] GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding cs.CVPDF

[65] Beyond Paired Data: Self-Supervised UAV Geo-Localization from Reference Imagery Alone cs.CV | cs.LGPDF

[66] Reasoning-Aware Multimodal Fusion for Hateful Video Detection cs.CV | cs.AIPDF

[67] Rethinking Surgical Smoke: A Smoke-Type-Aware Laparoscopic Video Desmoking Method and Dataset cs.CVPDF

[68] TrackNetV5: Residual-Driven Spatio-Temporal Refinement and Motion Direction Decoupling for Fast Object Tracking cs.CVPDF

[69] UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits cs.CVPDF

[70] HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval cs.CV | cs.MMPDF

[71] IC-World: In-Context Generation for Shared World Modeling cs.CVPDF

[72] Defense That Attacks: How Robust Models Become Better Attackers cs.CV | cs.AIPDF

[73] Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video? cs.CVPDF

[74] RFOP: Rethinking Fusion and Orthogonal Projection for Face-Voice Association cs.CVPDF

[75] Taming Camera-Controlled Video Generation with Verifiable Geometry Reward cs.CVPDF

[76] MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm cs.CVPDF

[77] Polar Perspectives: Evaluating 2-D LiDAR Projections for Robust Place Recognition with Visual Foundation Models cs.CV | cs.ROPDF

[78] MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding cs.CV | cs.AI | cs.MMPDF