cs.CV [Total: 81]
cs.CL [Total: 35]
eess.AS [Total: 1]
cs.CR [Total: 2]
cs.LG [Total: 9]
eess.IV [Total: 5]
cs.IR [Total: 1]
cs.SD [Total: 2]
cs.AI [Total: 3]
cs.GR [Total: 3]
cs.RO [Total: 4]

cs.CV [Back]

[1] Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations cs.CV | cs.AIPDF

Zhiyu Xue, Reza Abbasi-Asl, Ramtin Pedarsani

TL;DR: 本文提出了一种新型推理时防御策略，通过合成临床演示增强医学视觉-语言模型（Med-VLMs）的安全性，同时避免过度防御导致的性能下降。

Details

Motivation: 生成式医学视觉-语言模型（Med-VLMs）在生成复杂文本信息（如诊断报告）时面临安全漏洞问题，需既能拒绝有害查询（如保险欺诈指令），又需避免因过度防御而拒绝良性临床查询。

Result: 实验表明，合成演示策略有效提升了模型安全性且未显著影响性能，混合演示策略在少样本预算下平衡了安全与性能。

Insight: 合成演示是一种有效且灵活的安全增强方法，演示预算的增加可以缓解过度防御问题，同时混合演示策略在资源受限时提供了可行的解决方案。

Abstract: Generative medical vision-language models~~(Med-VLMs) are primarily designed to generate complex textual information~~(e.g., diagnostic reports) from multimodal inputs including vision modality~~(e.g., medical images) and language modality~~(e.g., clinical queries). However, their security vulnerabilities remain underexplored. Med-VLMs should be capable of rejecting harmful queries, such as \textit{Provide detailed instructions for using this CT scan for insurance fraud}. At the same time, addressing security concerns introduces the risk of over-defense, where safety-enhancing mechanisms may degrade general performance, causing Med-VLMs to reject benign clinical queries. In this paper, we propose a novel inference-time defense strategy to mitigate harmful queries, enabling defense against visual and textual jailbreak attacks. Using diverse medical imaging datasets collected from nine modalities, we demonstrate that our defense strategy based on synthetic clinical demonstrations enhances model safety without significantly compromising performance. Additionally, we find that increasing the demonstration budget alleviates the over-defense issue. We then introduce a mixed demonstration strategy as a trade-off solution for balancing security and performance under few-shot demonstration budget constraints.

[2] Segment Any Architectural Facades (SAAF):An automatic segmentation model for building facades, walls and windows based on multimodal semantics guidance cs.CV | cs.AIPDF

Peilin Li, Jun Yin, Jing Zhong, Ran Luo, Pengyu Zeng

TL;DR: SAAF是一种基于多模态语义指导的建筑立面、墙体和窗户自动分割模型，结合自然语言处理技术，通过端到端训练框架提升分割的自动化与鲁棒性，实验表明其在多样化数据集上优于现有方法。

Details

Motivation: 建筑数字化发展中，墙体和窗户的自动分割是提高建筑信息模型和计算机辅助设计效率的关键步骤。

Result: 在多立面数据集上，SAAF的mIoU指标优于现有方法，提高了分割任务的准确性和泛化能力。

Insight: 多模态学习在建筑领域的应用为建筑计算机视觉技术的发展提供了新思路和技术路径。

Abstract: In the context of the digital development of architecture, the automatic segmentation of walls and windows is a key step in improving the efficiency of building information models and computer-aided design. This study proposes an automatic segmentation model for building facade walls and windows based on multimodal semantic guidance, called Segment Any Architectural Facades (SAAF). First, SAAF has a multimodal semantic collaborative feature extraction mechanism. By combining natural language processing technology, it can fuse the semantic information in text descriptions with image features, enhancing the semantic understanding of building facade components. Second, we developed an end-to-end training framework that enables the model to autonomously learn the mapping relationship from text descriptions to image segmentation, reducing the influence of manual intervention on the segmentation results and improving the automation and robustness of the model. Finally, we conducted extensive experiments on multiple facade datasets. The segmentation results of SAAF outperformed existing methods in the mIoU metric, indicating that the SAAF model can maintain high-precision segmentation ability when faced with diverse datasets. Our model has made certain progress in improving the accuracy and generalization ability of the wall and window segmentation task. It is expected to provide a reference for the development of architectural computer vision technology and also explore new ideas and technical paths for the application of multimodal learning in the architectural field.

[3] VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks cs.CV | cs.AIPDF

Xinlong Chen, Yuanxing Zhang, Yushuo Guan, Bohan Zeng, Yang Shi

TL;DR: 这篇论文提出了VersaVid-R1模型，通过两个新数据集DarkEventInfer和MixVidQA，结合强化学习方法，首次将Reason-Then-Respond范式扩展到视频理解与推理任务中，并在多项任务上显著超越现有模型。

Details

Motivation: 目前多模态大语言模型已成功将Reason-Then-Respond范式应用于图像推理，但视频推理领域由于高质量数据和有效训练方法的缺乏仍未被充分开发。本文旨在填补这一空白。

Result: VersaVid-R1在视频通用理解、认知推理和描述任务中显著优于现有模型。

Insight: 通过专门设计的数据集和强化学习，可以有效提升视频理解与推理能力，为视频多模态任务提供了新的解决方案。

Abstract: Recent advancements in multimodal large language models have successfully extended the Reason-Then-Respond paradigm to image-based reasoning, yet video-based reasoning remains an underdeveloped frontier, primarily due to the scarcity of high-quality reasoning-oriented data and effective training methodologies. To bridge this gap, we introduce DarkEventInfer and MixVidQA, two novel datasets specifically designed to stimulate the model’s advanced video understanding and reasoning abilities. DarkEventinfer presents videos with masked event segments, requiring models to infer the obscured content based on contextual video cues. MixVidQA, on the other hand, presents interleaved video sequences composed of two distinct clips, challenging models to isolate and reason about one while disregarding the other. Leveraging these carefully curated training samples together with reinforcement learning guided by diverse reward functions, we develop VersaVid-R1, the first versatile video understanding and reasoning model under the Reason-Then-Respond paradigm capable of handling multiple-choice and open-ended question answering, as well as video captioning tasks. Extensive experiments demonstrate that VersaVid-R1 significantly outperforms existing models across a broad spectrum of benchmarks, covering video general understanding, cognitive reasoning, and captioning tasks.

[4] FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation cs.CV | cs.AI | cs.CLPDF

Zheqi He, Yesheng Liu, Jing-shu Zheng, Xuejing Li, Richeng Xuan

TL;DR: FlagEvalMM是一个开源的多模态评估框架，用于全面评估模型在视觉语言任务中的表现，支持灵活的资源分配和任务扩展。

Details

Motivation: 现有的多模态模型评估工具往往缺乏灵活性和效率，FlagEvalMM旨在解决这些问题，提供更全面的评估能力。

Result: 实验表明FlagEvalMM能够高效准确地评估模型，揭示其优缺点。

Insight: 该框架为多模态研究提供了标准化且高效的评估工具，促进了模型的对比与改进。

Abstract: We present FlagEvalMM, an open-source evaluation framework designed to comprehensively assess multimodal models across a diverse range of vision-language understanding and generation tasks, such as visual question answering, text-to-image/video generation, and image-text retrieval. We decouple model inference from evaluation through an independent evaluation service, thus enabling flexible resource allocation and seamless integration of new tasks and models. Moreover, FlagEvalMM utilizes advanced inference acceleration tools (e.g., vLLM, SGLang) and asynchronous data loading to significantly enhance evaluation efficiency. Extensive experiments show that FlagEvalMM offers accurate and efficient insights into model strengths and limitations, making it a valuable tool for advancing multimodal research. The framework is publicly accessible athttps://github.com/flageval-baai/FlagEvalMM.

[5] AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models cs.CV | cs.AI | cs.LGPDF

Zheda Mai, Arpita Chowdhury, Zihe Wang, Sooyoung Jeon, Lemeng Wang

TL;DR: AVA-Bench是一个专门针对视觉基础模型（VFMs）的原子视觉能力评测基准，通过解耦14种原子视觉能力（如定位、深度估计等），解决了传统评测方法中数据不匹配和多能力耦合的问题。

Details

Motivation: 传统评测方法（如VQA基准测试）存在数据不匹配和多能力耦合的盲点，无法准确评估视觉基础模型的具体能力短板。

Result: 实验显示，即使使用较小的语言模型（0.5B），也能获得与较大模型（7B）类似的排名效果，同时显著减少计算成本（GPU时间减少8倍）。

Insight: AVA-Bench为视觉基础模型的精准评测和能力优化提供了透明、高效的框架，有望推动下一代模型的开发。

Abstract: The rise of vision foundation models (VFMs) calls for systematic evaluation. A common approach pairs VFMs with large language models (LLMs) as general-purpose heads, followed by evaluation on broad Visual Question Answering (VQA) benchmarks. However, this protocol has two key blind spots: (i) the instruction tuning data may not align with VQA test distributions, meaning a wrong prediction can stem from such data mismatch rather than a VFM’ visual shortcomings; (ii) VQA benchmarks often require multiple visual abilities, making it hard to tell whether errors stem from lacking all required abilities or just a single critical one. To address these gaps, we introduce AVA-Bench, the first benchmark that explicitly disentangles 14 Atomic Visual Abilities (AVAs) – foundational skills like localization, depth estimation, and spatial understanding that collectively support complex visual reasoning tasks. By decoupling AVAs and matching training and test distributions within each, AVA-Bench pinpoints exactly where a VFM excels or falters. Applying AVA-Bench to leading VFMs thus reveals distinctive “ability fingerprints,” turning VFM selection from educated guesswork into principled engineering. Notably, we find that a 0.5B LLM yields similar VFM rankings as a 7B LLM while cutting GPU hours by 8x, enabling more efficient evaluation. By offering a comprehensive and transparent benchmark, we hope AVA-Bench lays the foundation for the next generation of VFMs.

[6] BakuFlow: A Streamlining Semi-Automatic Label Generation Tool cs.CV | cs.AIPDF

Jerry Lin, Partick P. W. Chen

TL;DR: BakuFlow 是一个半自动化的图像标注工具，通过结合手动修正、数据增强、标签传播和自动标注模块，显著提升了标注效率和用户体验。

Details

Motivation: 手动标注大规模数据集时耗且容易出错，现有工具如 LabelImg 仍需手动标注每张图像，亟需更高效的解决方案。

Result: 显著减少了标注工作量，特别适用于视频数据标注和动态数据集，提升了计算机视觉任务的实际效率。

Insight: 半自动化工具通过结合手动与自动标注的优势，能够有效解决大规模数据标注的瓶颈问题，尤其适用于工业场景的动态需求。

Abstract: Accurately labeling (or annotation) data is still a bottleneck in computer vision, especially for large-scale tasks where manual labeling is time-consuming and error-prone. While tools like LabelImg can handle the labeling task, some of them still require annotators to manually label each image. In this paper, we introduce BakuFlow, a streamlining semi-automatic label generation tool. Key features include (1) a live adjustable magnifier for pixel-precise manual corrections, improving user experience; (2) an interactive data augmentation module to diversify training datasets; (3) label propagation for rapidly copying labeled objects between consecutive frames, greatly accelerating annotation of video data; and (4) an automatic labeling module powered by a modified YOLOE framework. Unlike the original YOLOE, our extension supports adding new object classes and any number of visual prompts per class during annotation, enabling flexible and scalable labeling for dynamic, real-world datasets. These innovations make BakuFlow especially effective for object detection and tracking, substantially reducing labeling workload and improving efficiency in practical computer vision and industrial scenarios.

[7] Bias Analysis in Unconditional Image Generative Models cs.CV | cs.LGPDF

Xiaofeng Zhang, Michelle Lin, Simon Lacoste-Julien, Aaron Courville, Yash Goyal

TL;DR: 论文分析了无条件图像生成模型的偏见机制，定义偏见为观察分布与理想参考分布中属性出现概率的差异，实验显示检测到的属性偏移较小，但对分类器的敏感性显著。

Details

Motivation: 研究无条件图像生成模型中的偏见机制，揭示生成分布与训练分布之间的属性偏移，以及评估框架中对分类器的依赖性问题。

Result: 实验结果显示属性偏移较小，但偏移检测对分类器敏感，尤其在属性值为连续谱而非二元时。

Insight: 研究指出需要改进标签代表性，深入理解评估框架的局限性，并认识到属性在社会复杂性中的多样性。

Abstract: The widespread adoption of generative AI models has raised growing concerns about representational harm and potential discriminatory outcomes. Yet, despite growing literature on this topic, the mechanisms by which bias emerges - especially in unconditional generation - remain disentangled. We define the bias of an attribute as the difference between the probability of its presence in the observed distribution and its expected proportion in an ideal reference distribution. In our analysis, we train a set of unconditional image generative models and adopt a commonly used bias evaluation framework to study bias shift between training and generated distributions. Our experiments reveal that the detected attribute shifts are small. We find that the attribute shifts are sensitive to the attribute classifier used to label generated images in the evaluation framework, particularly when its decision boundaries fall in high-density regions. Our empirical analysis indicates that this classifier sensitivity is often observed in attributes values that lie on a spectrum, as opposed to exhibiting a binary nature. This highlights the need for more representative labeling practices, understanding the shortcomings through greater scrutiny of evaluation frameworks, and recognizing the socially complex nature of attributes when evaluating bias.

[8] CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation cs.CV | cs.CLPDF

Arnav Yayavaram, Siddharth Yayavaram, Simran Khanuja, Michael Saxon, Graham Neubig

TL;DR: CAIRe是一种新颖的评估指标，用于衡量图像的文化相关性，解决了文本生成图像模型中跨文化偏见的测量问题。

Details

Motivation: 随着文本生成图像模型的普及，确保其在多样文化环境中的公平性至关重要。然而，跨文化偏见的测量问题阻碍了进展。

Result: 在手动标注的数据集上，CAIRe比基线方法提高了28%的F1分数，与人类评分的相关性达到0.56和0.66。

Insight: CAIRe的检索增强评估方法为跨文化偏见的量化提供了有效工具，为未来研究奠定了基础。

Abstract: As text-to-image models become increasingly prevalent, ensuring their equitable performance across diverse cultural contexts is critical. Efforts to mitigate cross-cultural biases have been hampered by trade-offs, including a loss in performance, factual inaccuracies, or offensive outputs. Despite widespread recognition of these challenges, an inability to reliably measure these biases has stalled progress. To address this gap, we introduce CAIRe, a novel evaluation metric that assesses the degree of cultural relevance of an image, given a user-defined set of labels. Our framework grounds entities and concepts in the image to a knowledge base and uses factual information to give independent graded judgments for each culture label. On a manually curated dataset of culturally salient but rare items built using language models, CAIRe surpasses all baselines by 28% F1 points. Additionally, we construct two datasets for culturally universal concept, one comprising of T2I-generated outputs and another retrieved from naturally occurring data. CAIRe achieves Pearson’s correlations of 0.56 and 0.66 with human ratings on these sets, based on a 5-point Likert scale of cultural relevance. This demonstrates its strong alignment with human judgment across diverse image sources.

[9] Seedance 1.0: Exploring the Boundaries of Video Generation Models cs.CVPDF

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang

TL;DR: Seedance 1.0是一个高性能的视频生成基础模型，通过多源数据增强、高效架构设计和优化训练方法，实现了高质量、快速且符合指令的视频生成。

Details

Motivation: 当前视频生成模型难以同时平衡指令跟随、运动合理性和视觉质量，Seedance 1.0旨在解决这些问题。

Result: 在1080p分辨率下5秒视频生成仅需41.4秒，生成质量高，时空流畅，指令跟随精准。

Insight: 数据增强和架构设计对视频生成模型的性能至关重要，多任务学习可以进一步提升模型的通用性和效率。

Abstract: Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning, enabling comprehensive learning across diverse scenarios; (ii) an efficient architecture design with proposed training paradigm, which allows for natively supporting multi-shot generation and jointly learning of both text-to-video and image-to-video tasks. (iii) carefully-optimized post-training approaches leveraging fine-grained supervised fine-tuning, and video-specific RLHF with multi-dimensional reward mechanisms for comprehensive performance improvements; (iv) excellent model acceleration achieving ~10x inference speedup through multi-stage distillation strategies and system-level optimizations. Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds (NVIDIA-L20). Compared to state-of-the-art video generation models, Seedance 1.0 stands out with high-quality and fast video generation having superior spatiotemporal fluidity with structural stability, precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence with consistent subject representation.

[10] Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models cs.CVPDF

Sungwon Hwang, Hyojin Jang, Kinam Kim, Minho Park, Jaegul choo

TL;DR: 论文提出了一种新的正则化技术CREPA，用于改进视频扩散模型（VDMs）的微调，通过跨帧对齐隐藏状态与外部特征，显著提升了视觉保真度和帧间语义一致性。

Details

Motivation: 现有视频扩散模型的微调在保持帧间语义一致性方面存在不足，而现有的表征对齐方法（如REPA）仅适用于图像扩散模型，无法直接迁移到视频任务中。

Result: 在CogVideoX-5B和Hunyuan Video等大规模VDMs上验证了CREPA的有效性，显著提升了视觉保真度和帧间语义一致性。

Insight: 跨帧特征对齐是优化视频扩散模型微调的关键，尤其在参数高效微调场景中，能显著提升生成视频的质量和一致性。

Abstract: Fine-tuning Video Diffusion Models (VDMs) at the user level to generate videos that reflect specific attributes of training data presents notable challenges, yet remains underexplored despite its practical importance. Meanwhile, recent work such as Representation Alignment (REPA) has shown promise in improving the convergence and quality of DiT-based image diffusion models by aligning, or assimilating, its internal hidden states with external pretrained visual features, suggesting its potential for VDM fine-tuning. In this work, we first propose a straightforward adaptation of REPA for VDMs and empirically show that, while effective for convergence, it is suboptimal in preserving semantic consistency across frames. To address this limitation, we introduce Cross-frame Representation Alignment (CREPA), a novel regularization technique that aligns hidden states of a frame with external features from neighboring frames. Empirical evaluations on large-scale VDMs, including CogVideoX-5B and Hunyuan Video, demonstrate that CREPA improves both visual fidelity and cross-frame semantic coherence when fine-tuned with parameter-efficient methods such as LoRA. We further validate CREPA across diverse datasets with varying attributes, confirming its broad applicability. Project page: https://crepavideo.github.io

[11] PatchGuard: Adversarially Robust Anomaly Detection and Localization through Vision Transformers and Pseudo Anomalies cs.CV | cs.LGPDF

Mojtaba Nafez, Amirhossein Koochakian, Arad Maleki, Jafar Habibi, Mohammad Hossein Rohban

TL;DR: PatchGuard是一种基于Vision Transformer（ViT）和伪异常的对抗鲁棒性异常检测（AD）与定位（AL）方法，通过引入伪异常和定位掩码提升模型对抗攻击的鲁棒性，显著优于现有方法。

Details

Motivation: 当前AD和AL方法因仅使用正常样本训练而易受对抗攻击，限制了其在高可靠性领域（如医疗影像和工业监控）的应用。PatchGuard旨在解决这一问题。

Result: 在对抗环境下，PatchGuard在AD和AL任务上分别提升53.2%和68.5%，且在正常环境下仍保持竞争力。

Insight: 伪异常与ViT的结合及对抗训练的优化是提升AD和AL对抗鲁棒性的有效途径。

Abstract: Anomaly Detection (AD) and Anomaly Localization (AL) are crucial in fields that demand high reliability, such as medical imaging and industrial monitoring. However, current AD and AL approaches are often susceptible to adversarial attacks due to limitations in training data, which typically include only normal, unlabeled samples. This study introduces PatchGuard, an adversarially robust AD and AL method that incorporates pseudo anomalies with localization masks within a Vision Transformer (ViT)-based architecture to address these vulnerabilities. We begin by examining the essential properties of pseudo anomalies, and follow it by providing theoretical insights into the attention mechanisms required to enhance the adversarial robustness of AD and AL systems. We then present our approach, which leverages Foreground-Aware Pseudo-Anomalies to overcome the deficiencies of previous anomaly-aware methods. Our method incorporates these crafted pseudo-anomaly samples into a ViT-based framework, with adversarial training guided by a novel loss function designed to improve model robustness, as supported by our theoretical analysis. Experimental results on well-established industrial and medical datasets demonstrate that PatchGuard significantly outperforms previous methods in adversarial settings, achieving performance gains of $53.2%$ in AD and $68.5%$ in AL, while also maintaining competitive accuracy in non-adversarial settings. The code repository is available at https://github.com/rohban-lab/PatchGuard .

[12] UFM: A Simple Path towards Unified Dense Correspondence with Flow cs.CV | cs.LG | cs.ROPDF

Yuchen Zhang, Nikhil Keetha, Chenwei Lyu, Bhuvan Jhamb, Yutian Chen

TL;DR: UFM提出了一种统一的稠密对应模型，通过简单的Transformer架构直接回归(u,v)光流，在训练和精度上优于传统方法，并首次证明了统一训练可以超越专用方法。

Details

Motivation: 传统稠密对应方法在宽基线场景和光流估计中分别处理，但实际上两者目标一致。UFM试图通过统一训练解决这一问题。

Result: UFM在光流任务中比最优方法精确28%，在宽基线匹配中误差减少62%，速度快6.7倍，首次展示了统一训练在多个任务中的优势。

Insight: 统一训练可以超越任务专用方法，为多模态、长距离和实时对应任务开辟了新方向。

Abstract: Dense image correspondence is central to many applications, such as visual odometry, 3D reconstruction, object association, and re-identification. Historically, dense correspondence has been tackled separately for wide-baseline scenarios and optical flow estimation, despite the common goal of matching content between two images. In this paper, we develop a Unified Flow & Matching model (UFM), which is trained on unified data for pixels that are co-visible in both source and target images. UFM uses a simple, generic transformer architecture that directly regresses the (u,v) flow. It is easier to train and more accurate for large flows compared to the typical coarse-to-fine cost volumes in prior work. UFM is 28% more accurate than state-of-the-art flow methods (Unimatch), while also having 62% less error and 6.7x faster than dense wide-baseline matchers (RoMa). UFM is the first to demonstrate that unified training can outperform specialized approaches across both domains. This result enables fast, general-purpose correspondence and opens new directions for multi-modal, long-range, and real-time correspondence tasks.

[13] Lightweight Object Detection Using Quantized YOLOv4-Tiny for Emergency Response in Aerial Imagery cs.CV | cs.LGPDF

Sindhu Boddu, Arindam Mukherjee

TL;DR: 该论文提出了一种轻量化的目标检测方法，通过量化YOLOv4-Tiny模型，优化其在紧急响应中的空中图像检测性能，显著减小了模型体积并提高了推理速度。

Details

Motivation: 现有公开数据集中缺乏无人机视角的紧急场景图像，限制了相关研究的发展。论文旨在为紧急响应提供一种高效、轻量化的目标检测方案。

Result: 量化后的模型体积从22.5 MB降至6.4 MB，推理速度提升44%，同时保持了与YOLOv5-small相当的检测性能（mAP和F1分数）。

Insight: INT8量化在轻量化模型中表现优异，尤其适合边缘设备部署，为实时紧急检测提供了高效解决方案。

Abstract: This paper presents a lightweight and energy-efficient object detection solution for aerial imagery captured during emergency response situations. We focus on deploying the YOLOv4-Tiny model, a compact convolutional neural network, optimized through post-training quantization to INT8 precision. The model is trained on a custom-curated aerial emergency dataset, consisting of 10,820 annotated images covering critical emergency scenarios. Unlike prior works that rely on publicly available datasets, we created this dataset ourselves due to the lack of publicly available drone-view emergency imagery, making the dataset itself a key contribution of this work. The quantized model is evaluated against YOLOv5-small across multiple metrics, including mean Average Precision (mAP), F1 score, inference time, and model size. Experimental results demonstrate that the quantized YOLOv4-Tiny achieves comparable detection performance while reducing the model size from 22.5 MB to 6.4 MB and improving inference speed by 44%. With a 71% reduction in model size and a 44% increase in inference speed, the quantized YOLOv4-Tiny model proves highly suitable for real-time emergency detection on low-power edge devices.

[14] Efficient Edge Deployment of Quantized YOLOv4-Tiny for Aerial Emergency Object Detection on Raspberry Pi 5 cs.CVPDF

Sindhu Boddu, Arindam Mukherjee

TL;DR: 论文展示了量化YOLOv4-Tiny在树莓派5上的部署与性能评估，针对空中紧急图像实时目标检测，量化后模型在嵌入式条件下表现高效。

Details

Motivation: 研究旨在解决资源受限的边缘设备（如树莓派5）上实时目标检测的挑战，尤其是在紧急响应应用中对低功耗和高效率的需求。

Result: 量化模型每张图像推理时间为28.2毫秒，平均功耗为13.85W，相比FP32版本显著降低了功耗，检测精度在关键紧急类别中保持稳定。

Insight: 低功耗嵌入式AI系统在安全关键型应急响应应用中具有实时部署潜力，展示了量化技术的实用性。

Abstract: This paper presents the deployment and performance evaluation of a quantized YOLOv4-Tiny model for real-time object detection in aerial emergency imagery on a resource-constrained edge device the Raspberry Pi 5. The YOLOv4-Tiny model was quantized to INT8 precision using TensorFlow Lite post-training quantization techniques and evaluated for detection speed, power consumption, and thermal feasibility under embedded deployment conditions. The quantized model achieved an inference time of 28.2 ms per image with an average power consumption of 13.85 W, demonstrating a significant reduction in power usage compared to its FP32 counterpart. Detection accuracy remained robust across key emergency classes such as Ambulance, Police, Fire Engine, and Car Crash. These results highlight the potential of low-power embedded AI systems for real-time deployment in safety-critical emergency response applications.

Tong Wang, Guanzhou Chen, Xiaodong Zhang, Chenxi Liu, Jiaqi Wang

TL;DR: 该论文提出了一种多模态自监督学习框架MSSDF，结合RGB图像、多光谱数据和数字表面模型进行预训练，通过自适应掩码策略和多任务目标，显著提升了遥感图像下游任务的性能。

Details

Motivation: 遥感图像标注数据获取成本高且耗时，因此需要一种高效的自监督学习方法，利用多模态数据中的信息进行预训练。

Result: 在15个数据集上验证，多项任务表现优异，如语义分割（mIoU 78.30%）、深度估计（RMSE 0.182）和变化检测（mIoU 47.51%）。

Insight: 多模态数据结合自监督学习能有效提升遥感图像任务的性能，尤其是在标注数据有限的场景下。

Abstract: Remote sensing image interpretation plays a critical role in environmental monitoring, urban planning, and disaster assessment. However, acquiring high-quality labeled data is often costly and time-consuming. To address this challenge, we proposes a multi-modal self-supervised learning framework that leverages high-resolution RGB images, multi-spectral data, and digital surface models (DSM) for pre-training. By designing an information-aware adaptive masking strategy, cross-modal masking mechanism, and multi-task self-supervised objectives, the framework effectively captures both the correlations across different modalities and the unique feature structures within each modality. We evaluated the proposed method on multiple downstream tasks, covering typical remote sensing applications such as scene classification, semantic segmentation, change detection, object detection, and depth estimation. Experiments are conducted on 15 remote sensing datasets, encompassing 26 tasks. The results demonstrate that the proposed method outperforms existing pretraining approaches in most tasks. Specifically, on the Potsdam and Vaihingen semantic segmentation tasks, our method achieved mIoU scores of 78.30% and 76.50%, with only 50% train-set. For the US3D depth estimation task, the RMSE error is reduced to 0.182, and for the binary change detection task in SECOND dataset, our method achieved mIoU scores of 47.51%, surpassing the second CS-MAE by 3 percentage points. Our pretrain code, checkpoints, and HR-Pairs dataset can be found in https://github.com/CVEO/MSSDF.

[16] An Effective End-to-End Solution for Multimodal Action Recognition cs.CVPDF

Songping Wang, Xiantao Hu, Yueming Lyu, Caifeng Shan

TL;DR: 本文提出了一种多模态动作识别的端到端解决方案，通过数据增强、迁移学习、多模态特征提取和预测增强方法，显著提升了识别性能。

Details

Motivation: 多模态动作识别任务由于三模态数据的稀缺性面临诸多挑战，本文旨在通过综合利用多模态信息和优化技术解决这一问题。

Result: 在竞赛排行榜上达到Top-1准确率99%和Top-5准确率100%，验证了解决方案的优越性。

Insight: 多模态信息的综合利用和预测增强方法的结合对提升动作识别性能具有显著效果，尤其在数据稀缺场景下更具优势。

Abstract: Recently, multimodal tasks have strongly advanced the field of action recognition with their rich multimodal information. However, due to the scarcity of tri-modal data, research on tri-modal action recognition tasks faces many challenges. To this end, we have proposed a comprehensive multimodal action recognition solution that effectively utilizes multimodal information. First, the existing data are transformed and expanded by optimizing data enhancement techniques to enlarge the training scale. At the same time, more RGB datasets are used to pre-train the backbone network, which is better adapted to the new task by means of transfer learning. Secondly, multimodal spatial features are extracted with the help of 2D CNNs and combined with the Temporal Shift Module (TSM) to achieve multimodal spatial-temporal feature extraction comparable to 3D CNNs and improve the computational efficiency. In addition, common prediction enhancement methods, such as Stochastic Weight Averaging (SWA), Ensemble and Test-Time augmentation (TTA), are used to integrate the knowledge of models from different training periods of the same architecture and different architectures, so as to predict the actions from different perspectives and fully exploit the target information. Ultimately, we achieved the Top-1 accuracy of 99% and the Top-5 accuracy of 100% on the competition leaderboard, demonstrating the superiority of our solution.

[17] Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation cs.CV | cs.AI | cs.LGPDF

Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren

TL;DR: 论文提出了一种自回归对抗后训练方法（AAPT），将预训练的潜在视频扩散模型转化为实时交互式视频生成器，支持单步生成和交互控制。

Details

Motivation: 现有大规模视频生成模型计算量过大，难以满足实时交互应用的需求。

Result: 在单个H100上实现24fps的736x416分辨率实时视频生成，8xH100上支持1280x720分辨率长达一分钟的视频生成。

Insight: 对抗训练是自回归生成的有效范式，单步生成和KV缓存优化显著提升效率。

Abstract: Existing large-scale video generation models are computationally intensive, preventing adoption in real-time and interactive applications. In this work, we propose autoregressive adversarial post-training (AAPT) to transform a pre-trained latent video diffusion model into a real-time, interactive video generator. Our model autoregressively generates a latent frame at a time using a single neural function evaluation (1NFE). The model can stream the result to the user in real time and receive interactive responses as controls to generate the next latent frame. Unlike existing approaches, our method explores adversarial training as an effective paradigm for autoregressive generation. This not only allows us to design an architecture that is more efficient for one-step generation while fully utilizing the KV cache, but also enables training the model in a student-forcing manner that proves to be effective in reducing error accumulation during long video generation. Our experiments demonstrate that our 8B model achieves real-time, 24fps, streaming video generation at 736x416 resolution on a single H100, or 1280x720 on 8xH100 up to a minute long (1440 frames). Visit our research website at https://seaweed-apt.com/2

[18] A new approach for image segmentation based on diffeomorphic registration and gradient fields cs.CVPDF

Junchao Zhou

TL;DR: 论文提出了一种基于微分同胚配准和梯度场的图像分割新方法，通过变形模板曲线和梯度场比较实现分割，无需依赖大数据集。

Details

Motivation: 传统的图像分割方法依赖于大量训练数据或有限的边缘检测技术，而本文希望通过结合形状分析和微分同胚变换，提出一种更灵活且理论扎实的方法。

Result: 该方法实现了高精度的图像分割，且不依赖于大数据集，具有灵活性和理论支撑。

Insight: 结合微分同胚变换和几何形状表示可以显著提升分割的准确性，尤其在数据稀缺的场景中具有优势。

Abstract: Image segmentation is a fundamental task in computer vision aimed at delineating object boundaries within images. Traditional approaches, such as edge detection and variational methods, have been widely explored, while recent advances in deep learning have shown promising results but often require extensive training data. In this work, we propose a novel variational framework for 2D image segmentation that integrates concepts from shape analysis and diffeomorphic transformations. Our method models segmentation as the deformation of a template curve via a diffeomorphic transformation of the image domain, using the Large Deformation Diffeomorphic Metric Mapping (LDDMM) framework. The curve evolution is guided by a loss function that compares the deformed curve to the image gradient field, formulated through the varifold representation of geometric shapes. The approach is implemented in Python with GPU acceleration using the PyKeops library. This framework allows for accurate segmentation with a flexible and theoretically grounded methodology that does not rely on large datasets.

[19] SAGE: Exploring the Boundaries of Unsafe Concept Domain with Semantic-Augment Erasing cs.CV | cs.AI | cs.CRPDF

Hongguang Zhu, Yunchao Wei, Mengyu Wang, Siyu Jiao, Yan Fang

TL;DR: SAGE提出了一种基于语义增强擦除的方法，探索不安全概念域的边界，通过自检查和自擦除实现概念域的泛化擦除，同时提出了全局-局部协作保留机制以避免无关概念的退化。

Details

Motivation: 扩散模型在文本到图像生成中取得了显著进展，但预训练中包含的敏感信息带来了安全隐患，如不安全内容的生成和版权侵权问题。现有方法将不安全概念视为固定词并重复擦除，导致模型陷入‘词概念深渊’，限制了概念相关擦除的泛化能力。

Result: 实验表明，SAGE在扩散模型的安全生成方面全面优于其他方法。

Insight: 语义增强擦除和全局-局部保留机制的结合为解决扩散模型中的不安全概念擦除问题提供了新思路，提升了模型的泛化能力和安全性。

Abstract: Diffusion models (DMs) have achieved significant progress in text-to-image generation. However, the inevitable inclusion of sensitive information during pre-training poses safety risks, such as unsafe content generation and copyright infringement. Concept erasing finetunes weights to unlearn undesirable concepts, and has emerged as a promising solution. However, existing methods treat unsafe concept as a fixed word and repeatedly erase it, trapping DMs in ``word concept abyss’’, which prevents generalized concept-related erasing. To escape this abyss, we introduce semantic-augment erasing which transforms concept word erasure into concept domain erasure by the cyclic self-check and self-erasure. It efficiently explores and unlearns the boundary representation of concept domain through semantic spatial relationships between original and training DMs, without requiring additional preprocessed data. Meanwhile, to mitigate the retention degradation of irrelevant concepts while erasing unsafe concepts, we further propose the global-local collaborative retention mechanism that combines global semantic relationship alignment with local predicted noise preservation, effectively expanding the retentive receptive field for irrelevant concepts. We name our method SAGE, and extensive experiments demonstrate the comprehensive superiority of SAGE compared with other methods in the safe generation of DMs. The code and weights will be open-sourced at https://github.com/KevinLight831/SAGE.

[20] ScaleLSD: Scalable Deep Line Segment Detection Streamlined cs.CVPDF

Zeran Ke, Bin Tan, Xianwei Zheng, Yujun Shen, Tianfu Wu

TL;DR: ScaleLSD提出了一种可扩展的自监督学习方法，用于检测图像中的线段，性能优于传统非深度学习方法，并在多种任务中表现出色。

Details

Motivation: 研究目标是学习一个领域无关且鲁棒的线段检测模型，适用于任何自然图像，通过自监督学习解决传统方法在可扩展性上的不足。

Result: 在零样本协议下，ScaleLSD在多种任务中表现优异，成为首个在所有测试任务中超越传统非深度方法的深度模型。

Insight: 自监督学习和规模化数据可以显著提升线几何检测的鲁棒性和泛化能力，为图像几何表征提供了新思路。

Abstract: This paper studies the problem of Line Segment Detection (LSD) for the characterization of line geometry in images, with the aim of learning a domain-agnostic robust LSD model that works well for any natural images. With the focus of scalable self-supervised learning of LSD, we revisit and streamline the fundamental designs of (deep and non-deep) LSD approaches to have a high-performing and efficient LSD learner, dubbed as ScaleLSD, for the curation of line geometry at scale from over 10M unlabeled real-world images. Our ScaleLSD works very well to detect much more number of line segments from any natural images even than the pioneered non-deep LSD approach, having a more complete and accurate geometric characterization of images using line segments. Experimentally, our proposed ScaleLSD is comprehensively testified under zero-shot protocols in detection performance, single-view 3D geometry estimation, two-view line segment matching, and multiview 3D line mapping, all with excellent performance obtained. Based on the thorough evaluation, our ScaleLSD is observed to be the first deep approach that outperforms the pioneered non-deep LSD in all aspects we have tested, significantly expanding and reinforcing the versatility of the line geometry of images. Code and Models are available at https://github.com/ant-research/scalelsd

Jialong Zuo, Yongtai Deng, Mengdan Tan, Rui Jin, Dongyue Wu

TL;DR: 论文提出了一种新的多模态行人重识别任务（OM-ReID），并构建了首个高质量多模态数据集ORBench。同时提出了一种名为ReID5o的多模态学习框架，能够在一个模型中实现多种模态的协同融合和对齐。实验验证了其先进性和实用性。

Details

Motivation: 现有行人重识别方法和数据集仅支持有限模态，无法满足实际场景中对多模态查询（如RGB、红外、文本描述等）的需求。

Result: 实验表明ORBench数据集的多样性和ReID5o框架的优越性能，其在不同模态组合上的表现优于其他模型。

Insight: 多模态行人重识别是实际应用中的重要需求，而统一的模型设计和高质量数据集是关键。

Abstract: In real-word scenarios, person re-identification (ReID) expects to identify a person-of-interest via the descriptive query, regardless of whether the query is a single modality or a combination of multiple modalities. However, existing methods and datasets remain constrained to limited modalities, failing to meet this requirement. Therefore, we investigate a new challenging problem called Omni Multi-modal Person Re-identification (OM-ReID), which aims to achieve effective retrieval with varying multi-modal queries. To address dataset scarcity, we construct ORBench, the first high-quality multi-modal dataset comprising 1,000 unique identities across five modalities: RGB, infrared, color pencil, sketch, and textual description. This dataset also has significant superiority in terms of diversity, such as the painting perspectives and textual information. It could serve as an ideal platform for follow-up investigations in OM-ReID. Moreover, we propose ReID5o, a novel multi-modal learning framework for person ReID. It enables synergistic fusion and cross-modal alignment of arbitrary modality combinations in a single model, with a unified encoding and multi-expert routing mechanism proposed. Extensive experiments verify the advancement and practicality of our ORBench. A wide range of possible models have been evaluated and compared on it, and our proposed ReID5o model gives the best performance. The dataset and code will be made publicly available at https://github.com/Zplusdragon/ReID5o_ORBench.

[22] SRPL-SFDA: SAM-Guided Reliable Pseudo-Labels for Source-Free Domain Adaptation in Medical Image Segmentation cs.CV | I.2.6; I.5.1PDF

Xinya Liu, Jianghao Wu, Tao Lu, Shaoting Zhang, Guotai Wang

TL;DR: SRPL-SFDA 提出了一种基于 SAM 的可靠伪标签方法，用于医学图像分割的无源域自适应（SFDA），通过三支强度增强（T3IE）、伪标签选择模块和可靠性感知训练，显著提升了目标域中的分割性能。

Details

Motivation: 医学图像分割模型在新临床中心部署时面临显著的域偏移问题，而 SFDA 在保护隐私的同时需要解决目标域无标签数据监督不足的挑战。

Result: 在两个医学图像数据集上，SRPL-SFDA 表现优于现有 SFDA 方法，接近有监督训练性能。

Insight: 利用 SAM 的零样本能力可以有效提升伪标签质量，可靠性选择策略在无监督训练中尤为重要。

Abstract: Domain Adaptation (DA) is crucial for robust deployment of medical image segmentation models when applied to new clinical centers with significant domain shifts. Source-Free Domain Adaptation (SFDA) is appealing as it can deal with privacy concerns and access constraints on source-domain data during adaptation to target-domain data. However, SFDA faces challenges such as insufficient supervision in the target domain with unlabeled images. In this work, we propose a Segment Anything Model (SAM)-guided Reliable Pseudo-Labels method for SFDA (SRPL-SFDA) with three key components: 1) Test-Time Tri-branch Intensity Enhancement (T3IE) that not only improves quality of raw pseudo-labels in the target domain, but also leads to SAM-compatible inputs with three channels to better leverage SAM’s zero-shot inference ability for refining the pseudo-labels; 2) A reliable pseudo-label selection module that rejects low-quality pseudo-labels based on Consistency of Multiple SAM Outputs (CMSO) under input perturbations with T3IE; and 3) A reliability-aware training procedure in the unlabeled target domain where reliable pseudo-labels are used for supervision and unreliable parts are regularized by entropy minimization. Experiments conducted on two multi-domain medical image segmentation datasets for fetal brain and the prostate respectively demonstrate that: 1) SRPL-SFDA effectively enhances pseudo-label quality in the unlabeled target domain, and improves SFDA performance by leveraging the reliability-aware training; 2) SRPL-SFDA outperformed state-of-the-art SFDA methods, and its performance is close to that of supervised training in the target domain. The code of this work is available online: https://github.com/HiLab-git/SRPL-SFDA.

[23] Synthetic Human Action Video Data Generation with Pose Transfer cs.CV | cs.AIPDF

Vaclav Knapp, Matyas Bohacek

TL;DR: 论文提出了一种通过姿态迁移生成合成人类动作视频数据的方法，解决了传统合成数据在视频理解任务中的‘不自然’问题，并展示了其在动作识别任务和少量数据增强中的有效性。

Details

Motivation: 现有的人类动作合成数据常因‘不自然’特征而无效，限制了其在手势识别、自动驾驶等任务中的应用。为此，论文提出一种新方法以生成更自然的合成数据。

Result: 在Toyota Smarthome和NTU RGB+D数据集上验证了方法的有效性，提高了动作识别任务的性能，并能有效扩充少量样本数据。

Insight: 合成数据生成技术可以通过更自然的姿态迁移方法解决传统合成数据的不自然问题，同时为数据稀缺任务提供有效解决方案。

Abstract: In video understanding tasks, particularly those involving human motion, synthetic data generation often suffers from uncanny features, diminishing its effectiveness for training. Tasks such as sign language translation, gesture recognition, and human motion understanding in autonomous driving have thus been unable to exploit the full potential of synthetic data. This paper proposes a method for generating synthetic human action video data using pose transfer (specifically, controllable 3D Gaussian avatar models). We evaluate this method on the Toyota Smarthome and NTU RGB+D datasets and show that it improves performance in action recognition tasks. Moreover, we demonstrate that the method can effectively scale few-shot datasets, making up for groups underrepresented in the real training data and adding diverse backgrounds. We open-source the method along with RANDOM People, a dataset with videos and avatars of novel human identities for pose transfer crowd-sourced from the internet.

[24] Noise Conditional Variational Score Distillation cs.CVPDF

Xinyu Peng, Ziyang Zheng, Yaoming Wang, Han Li, Nuowen Kan

TL;DR: NCVSD是一种将预训练扩散模型蒸馏为生成降噪器的新方法，通过揭示无条件评分函数隐含表征降噪后验分布的评分函数，实现了高效学习和灵活采样。

Details

Motivation: 现有扩散模型的生成效率较低，NCVSD旨在通过降噪器的蒸馏优化生成速度和采样质量，同时保留迭代优化的优势。

Result: 实验表明，NCVSD在图像生成和逆问题求解中表现优异，生成效率超过扩散模型，并与更大规模的一致性模型相当，同时在低NFE下达到了逆问题的最高LPIPS记录。

Insight: 通过噪声条件化和评分函数的内隐联系，NCVSD实现了高效与高质量的平衡，为生成模型的实用性提供了新思路。

Abstract: We propose Noise Conditional Variational Score Distillation (NCVSD), a novel method for distilling pretrained diffusion models into generative denoisers. We achieve this by revealing that the unconditional score function implicitly characterizes the score function of denoising posterior distributions. By integrating this insight into the Variational Score Distillation (VSD) framework, we enable scalable learning of generative denoisers capable of approximating samples from the denoising posterior distribution across a wide range of noise levels. The proposed generative denoisers exhibit desirable properties that allow fast generation while preserve the benefit of iterative refinement: (1) fast one-step generation through sampling from pure Gaussian noise at high noise levels; (2) improved sample quality by scaling the test-time compute with multi-step sampling; and (3) zero-shot probabilistic inference for flexible and controllable sampling. We evaluate NCVSD through extensive experiments, including class-conditional image generation and inverse problem solving. By scaling the test-time compute, our method outperforms teacher diffusion models and is on par with consistency models of larger sizes. Additionally, with significantly fewer NFEs than diffusion-based methods, we achieve record-breaking LPIPS on inverse problems.

[25] A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation cs.CV | cs.AIPDF

Yukang Feng, Jianwen Sun, Chuanhao Li, Zizhen Li, Jiaxin Ai

TL;DR: 该论文提出了一个高质量的数据集InterSyn和一个自动化评估工具SynJudge，用于提升多模态模型中交织图像-文本生成的训练和评估。

Details

Motivation: 当前的大型多模态模型（LMMs）在多模态理解和生成方面虽有显著进步，但在生成紧密交织的图像-文本输出时仍有困难，主要原因是现有训练数据集的规模、质量和指令多样性不足。

Result: 实验表明SEIR方法显著提升了数据集质量，基于InterSyn训练的LMMs在所有评估指标上均表现出性能提升。

Insight: 高质量的数据集和可靠的评估工具是提升多模态模型交织生成能力的关键，为未来指令跟随型LMMs的发展提供了重要基础。

Abstract: Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs, primarily due to the limited scale, quality and instructional richness of current training datasets. To address this, we introduce InterSyn, a large-scale multimodal dataset constructed using our Self-Evaluation with Iterative Refinement (SEIR) method. InterSyn features multi-turn, instruction-driven dialogues with tightly interleaved imagetext responses, providing rich object diversity and rigorous automated quality refinement, making it well-suited for training next-generation instruction-following LMMs. Furthermore, to address the lack of reliable evaluation tools capable of assessing interleaved multimodal outputs, we introduce SynJudge, an automatic evaluation model designed to quantitatively assess multimodal outputs along four dimensions: text content, image content, image quality, and image-text synergy. Experimental studies show that the SEIR method leads to substantially higher dataset quality compared to an otherwise identical process without refinement. Moreover, LMMs trained on InterSyn achieve uniform performance gains across all evaluation metrics, confirming InterSyn’s utility for advancing multimodal systems.

[26] A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning cs.CVPDF

Swadhin Das, Divyansh Mundra, Priyanshu Dayal, Raksha Sharma

TL;DR: 论文提出了一种轻量化的Transformer架构，结合边缘感知融合策略，用于遥感图像描述任务，提升了模型计算效率和精细空间细节捕捉能力。

Details

Motivation: 现有的Transformer模型在遥感图像描述任务中计算成本高，且多模态框架中常忽视细粒度结构特征（如边缘和边界）。

Result: 实验表明，该方法在提升描述质量的同时显著降低计算成本，优于现有方法。

Insight: 结合轻量化设计和细粒度特征提取可优化遥感图像描述任务的性能和实用性。

Abstract: Transformer-based models have achieved strong performance in remote sensing image captioning by capturing long-range dependencies and contextual information. However, their practical deployment is hindered by high computational costs, especially in multi-modal frameworks that employ separate transformer-based encoders and decoders. In addition, existing remote sensing image captioning models primarily focus on high-level semantic extraction while often overlooking fine-grained structural features such as edges, contours, and object boundaries. To address these challenges, a lightweight transformer architecture is proposed by reducing the dimensionality of the encoder layers and employing a distilled version of GPT-2 as the decoder. A knowledge distillation strategy is used to transfer knowledge from a more complex teacher model to improve the performance of the lightweight network. Furthermore, an edge-aware enhancement strategy is incorporated to enhance image representation and object boundary understanding, enabling the model to capture fine-grained spatial details in remote sensing images. Experimental results demonstrate that the proposed approach significantly improves caption quality compared to state-of-the-art methods.

[27] TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision cs.CV | cs.AIPDF

Ayush Gupta, Anirban Roy, Rama Chellappa, Nathaniel D. Bastian, Alvaro Velasquez

TL;DR: 论文提出了TOGA模型，用于弱监督下的时间接地开放视频问答，通过联合生成答案和时间接地，并在没有时间标注的情况下生成伪标签。

Details

Motivation: 解决视频问答中的时间接地问题，同时在没有时间标注的弱监督条件下实现开放视频问答。

Result: 在NExT-GQA、MSVD-QA和ActivityNet-QA基准测试中取得了最先进的性能。

Insight: 联合生成答案和时间接地可以提升问答和时间接地的性能，弱监督方法在无标注数据中也能有效工作。

Abstract: We address the problem of video question answering (video QA) with temporal grounding in a weakly supervised setup, without any temporal annotations. Given a video and a question, we generate an open-ended answer grounded with the start and end time. For this task, we propose TOGA: a vision-language model for Temporally Grounded Open-Ended Video QA with Weak Supervision. We instruct-tune TOGA to jointly generate the answer and the temporal grounding. We operate in a weakly supervised setup where the temporal grounding annotations are not available. We generate pseudo labels for temporal grounding and ensure the validity of these labels by imposing a consistency constraint between the question of a grounding response and the response generated by a question referring to the same temporal segment. We notice that jointly generating the answers with the grounding improves performance on question answering as well as grounding. We evaluate TOGA on grounded QA and open-ended QA tasks. For grounded QA, we consider the NExT-GQA benchmark which is designed to evaluate weakly supervised grounded question answering. For open-ended QA, we consider the MSVD-QA and ActivityNet-QA benchmarks. We achieve state-of-the-art performance for both tasks on these benchmarks.

[28] Harmonizing and Merging Source Models for CLIP-based Domain Generalization cs.CVPDF

Yuhe Ding, Jian Liang, Bo Jiang, Zi Wang, Aihua Zheng

TL;DR: 论文提出了一种名为HAM的新框架，用于解决CLIP基领域泛化中多源训练时的样本和优化冲突问题，通过样本增强和模型合并提升泛化能力。

Details

Motivation: 现有的CLIP基领域泛化方法在多源训练中面临样本冲突和优化冲突问题，导致泛化能力受限。HAM通过解决这些冲突，提升模型在未见领域的表现。

Result: 在五个广泛使用的基准数据集上，HAM取得了最先进的性能，验证了其有效性。

Insight: 通过样本增强和模型合并可以有效解决多源训练中的冲突问题，提升模型的泛化能力，尤其适合CLIP基领域泛化任务。

Abstract: CLIP-based domain generalization aims to improve model generalization to unseen domains by leveraging the powerful zero-shot classification capabilities of CLIP and multiple source datasets. Existing methods typically train a single model across multiple source domains to capture domain-shared information. However, this paradigm inherently suffers from two types of conflicts: 1) sample conflicts, arising from noisy samples and extreme domain shifts among sources; and 2) optimization conflicts, stemming from competition and trade-offs during multi-source training. Both hinder the generalization and lead to suboptimal solutions. Recent studies have shown that model merging can effectively mitigate the competition of multi-objective optimization and improve generalization performance. Inspired by these findings, we propose Harmonizing and Merging (HAM), a novel source model merging framework for CLIP-based domain generalization. During the training process of the source models, HAM enriches the source samples without conflicting samples, and harmonizes the update directions of all models. Then, a redundancy-aware historical model merging method is introduced to effectively integrate knowledge across all source models. HAM comprehensively consolidates source domain information while enabling mutual enhancement among source models, ultimately yielding a final model with optimal generalization capabilities. Extensive experiments on five widely used benchmark datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance.

[29] Evidential Deep Learning with Spectral-Spatial Uncertainty Disentanglement for Open-Set Hyperspectral Domain Generalization cs.CVPDF

Amirreza Khoshbakht, Erchan Aptoula

TL;DR: 论文提出了一种新颖的开集域泛化框架，结合频谱不变频率解耦（SIFD）、双通道残差网络（DCRN）、证据深度学习（EDL）和频谱空间不确定性解耦（SSUD），用于高光谱图像分类，解决了未知类别和域偏移问题。

Details

Motivation: 高光谱图像分类中的开集域泛化（OSDG）面临未知类别和多域泛化的挑战，现有域适应方法依赖目标域数据且无法处理未知类别，导致负迁移和性能下降。

Result: 在三个跨场景高光谱分类任务中表现优异，性能接近最先进的域适应方法，且无需目标域训练数据。

Insight: 频域分析和不确定性解耦是解决开集域泛化的有效方法，为多域高光谱分类提供了新思路。

Abstract: Open-set domain generalization(OSDG) for hyperspectral image classification presents significant challenges due to the presence of unknown classes in target domains and the need for models to generalize across multiple unseen domains without target-specific adaptation. Existing domain adaptation methods assume access to target domain data during training and fail to address the fundamental issue of domain shift when unknown classes are present, leading to negative transfer and reduced classification performance. To address these limitations, we propose a novel open-set domain generalization framework that combines four key components: Spectrum-Invariant Frequency Disentanglement (SIFD) for domain-agnostic feature extraction, Dual-Channel Residual Network (DCRN) for robust spectral-spatial feature learning, Evidential Deep Learning (EDL) for uncertainty quantification, and Spectral-Spatial Uncertainty Disentanglement (SSUD) for reliable open-set classification. The SIFD module extracts domain-invariant spectral features in the frequency domain through attention-weighted frequency analysis and domain-agnostic regularization, while DCRN captures complementary spectral and spatial information via parallel pathways with adaptive fusion. EDL provides principled uncertainty estimation using Dirichlet distributions, enabling the SSUD module to make reliable open-set decisions through uncertainty-aware pathway weighting and adaptive rejection thresholding. Experimental results on three cross-scene hyperspectral classification tasks show that our approach achieves performance comparable to state-of-the-art domain adaptation methods while requiring no access to the target domain during training. The implementation will be made available at https://github.com/amir-khb/SSUDOSDG upon acceptance.

[30] Optimizing Cooperative Multi-Object Tracking using Graph Signal Processing cs.CVPDF

Maria Damanaki, Nikos Piperigkos, Alexandros Gkillas, Aris S. Lalos

TL;DR: 该论文提出了一种基于图信号处理的协同多目标跟踪框架，通过融合多车辆信息，优化3D LiDAR场景中的目标跟踪。

Details

Motivation: 单智能体的多目标跟踪（MOT）因遮挡和传感器故障等问题存在感知局限性，而多智能体协作能提供更全面的环境理解。

Result: 在真实数据集V2V4Real上的实验表明，该方法显著优于基线框架（如DMSTrack和V2V4Real）。

Insight: 通过图信号处理揭示多智能体检测的内在一致性，为MOT提供了新的优化思路。

Abstract: Multi-Object Tracking (MOT) plays a crucial role in autonomous driving systems, as it lays the foundations for advanced perception and precise path planning modules. Nonetheless, single agent based MOT lacks in sensing surroundings due to occlusions, sensors failures, etc. Hence, the integration of multiagent information is essential for comprehensive understanding of the environment. This paper proposes a novel Cooperative MOT framework for tracking objects in 3D LiDAR scene by formulating and solving a graph topology-aware optimization problem so as to fuse information coming from multiple vehicles. By exploiting a fully connected graph topology defined by the detected bounding boxes, we employ the Graph Laplacian processing optimization technique to smooth the position error of bounding boxes and effectively combine them. In that manner, we reveal and leverage inherent coherences of diverse multi-agent detections, and associate the refined bounding boxes to tracked objects at two stages, optimizing localization and tracking accuracies. An extensive evaluation study has been conducted, using the real-world V2V4Real dataset, where the proposed method significantly outperforms the baseline frameworks, including the state-of-the-art deep-learning DMSTrack and V2V4Real, in various testing sequences.

Cheng Chen, Yunpeng Zhai, Yifan Zhao, Jinyang Gao, Bolin Ding

TL;DR: 该论文提出了一种基于探索-利用强化学习的框架，用于优化多模态少样本大视觉语言模型（LVLM）中的上下文学习（ICL），通过自适应选择多模态演示组合提升模型任务理解和执行能力。

Details

Motivation: 当前ICL方法依赖预定义或启发式选择的演示，无法覆盖多样任务需求且忽略了演示间的交互作用，导致性能不佳。因此，需要一种能动态优化演示选择策略的方法。

Result: 在四个视觉问答（VQA）数据集上验证了方法的有效性，证明了其在提升少样本LVLM性能方面的优势。

Insight: 强化学习能够有效捕捉演示间的交互作用，动态优化选择策略，从而提升上下文学习的灵活性。

Abstract: In-context learning (ICL), a predominant trend in instruction learning, aims at enhancing the performance of large language models by providing clear task guidance and examples, improving their capability in task understanding and execution. This paper investigates ICL on Large Vision-Language Models (LVLMs) and explores the policies of multi-modal demonstration selection. Existing research efforts in ICL face significant challenges: First, they rely on pre-defined demonstrations or heuristic selecting strategies based on human intuition, which are usually inadequate for covering diverse task requirements, leading to sub-optimal solutions; Second, individually selecting each demonstration fails in modeling the interactions between them, resulting in information redundancy. Unlike these prevailing efforts, we propose a new exploration-exploitation reinforcement learning framework, which explores policies to fuse multi-modal information and adaptively select adequate demonstrations as an integrated whole. The framework allows LVLMs to optimize themselves by continually refining their demonstrations through self-exploration, enabling the ability to autonomously identify and generate the most effective selection policies for in-context learning. Experimental results verify the superior performance of our approach on four Visual Question-Answering (VQA) datasets, demonstrating its effectiveness in enhancing the generalization capability of few-shot LVLMs.

[32] Urban1960SatSeg: Unsupervised Semantic Segmentation of Mid-20$^{th}$ century Urban Landscapes with Satellite Imageries cs.CVPDF

Tianxiang Hao, Lixian Zhang, Yingjia Zhang, Mengxuan Chen, Jinxiao Zhang

TL;DR: 该论文提出了首个基于20世纪中叶历史卫星影像的语义分割数据集Urban1960SatBench，以及无监督分割框架Urban1960SatUSM，用于研究早期城市发展。

Details

Motivation: 历史卫星影像（如Keyhole数据）在理解早期城市发展和长期变化方面具有重要价值，但由于质量退化（如失真、错位和光谱稀缺）和标注缺失，其语义分割一直面临挑战。

Result: 实验表明，Urban1960SatUSM在Urban1960SatSeg数据集上显著优于现有无监督分割方法。

Insight: 该研究为利用现代计算机视觉技术量化长期城市变化提供了新途径。

Abstract: Historical satellite imagery, such as mid-20$^{th}$ century Keyhole data, offers rare insights into understanding early urban development and long-term transformation. However, severe quality degradation (e.g., distortion, misalignment, and spectral scarcity) and annotation absence have long hindered semantic segmentation on such historical RS imagery. To bridge this gap and enhance understanding of urban development, we introduce $\textbf{Urban1960SatBench}$, an annotated segmentation dataset based on historical satellite imagery with the earliest observation time among all existing segmentation datasets, along with a benchmark framework for unsupervised segmentation tasks, $\textbf{Urban1960SatUSM}$. First, $\textbf{Urban1960SatBench}$ serves as a novel, expertly annotated semantic segmentation dataset built on mid-20$^{th}$ century Keyhole imagery, covering 1,240 km$^2$ and key urban classes (buildings, roads, farmland, water). As the earliest segmentation dataset of its kind, it provides a pioneering benchmark for historical urban understanding. Second, $\textbf{Urban1960SatUSM}$(Unsupervised Segmentation Model) is a novel unsupervised semantic segmentation framework for historical RS imagery. It employs a confidence-aware alignment mechanism and focal-confidence loss based on a self-supervised learning architecture, which generates robust pseudo-labels and adaptively prioritizes prediction difficulty and label reliability to improve unsupervised segmentation on noisy historical data without manual supervision. Experiments show Urban1960SatUSM significantly outperforms existing unsupervised segmentation methods on Urban1960SatSeg for segmenting historical urban scenes, promising in paving the way for quantitative studies of long-term urban change using modern computer vision. Our benchmark and supplementary material are available at https://github.com/Tianxiang-Hao/Urban1960SatSeg.

[33] TinySplat: Feedforward Approach for Generating Compact 3D Scene Representation cs.CVPDF

Zetian Song, Jiaye Fu, Jiaqi Zhang, Xiaohan Lu, Chuanmin Jia

TL;DR: TinySplat 是一种前馈方法，用于生成紧凑的3D场景表示。通过消除几何、感知和空间冗余，实现了超过100倍的3D高斯数据压缩，存储大小仅为现有最优方法的6%，同时编码和解码时间大幅减少。

Details

Motivation: 现有的前馈3D高斯喷溅（3DGS）方法虽然重建速度快，但存储成本高，且现有压缩方法不兼容前馈架构。TinySplat 的目标是解决这一存储瓶颈。

Result: 在多个基准数据集上，TinySplat 实现了超过100倍的3D高斯数据压缩，存储大小仅为现有最优方法的6%，编码时间减少75%，解码时间减少99%。

Insight: TinySplat 的压缩框架展示了在前馈架构中高效处理3D场景表示的潜力，同时保持了高质量和低存储成本。

Abstract: The recent development of feedforward 3D Gaussian Splatting (3DGS) presents a new paradigm to reconstruct 3D scenes. Using neural networks trained on large-scale multi-view datasets, it can directly infer 3DGS representations from sparse input views. Although the feedforward approach achieves high reconstruction speed, it still suffers from the substantial storage cost of 3D Gaussians. Existing 3DGS compression methods relying on scene-wise optimization are not applicable due to architectural incompatibilities. To overcome this limitation, we propose TinySplat, a complete feedforward approach for generating compact 3D scene representations. Built upon standard feedforward 3DGS methods, TinySplat integrates a training-free compression framework that systematically eliminates key sources of redundancy. Specifically, we introduce View-Projection Transformation (VPT) to reduce geometric redundancy by projecting geometric parameters into a more compact space. We further present Visibility-Aware Basis Reduction (VABR), which mitigates perceptual redundancy by aligning feature energy along dominant viewing directions via basis transformation. Lastly, spatial redundancy is addressed through an off-the-shelf video codec. Comprehensive experimental results on multiple benchmark datasets demonstrate that TinySplat achieves over 100x compression for 3D Gaussian data generated by feedforward methods. Compared to the state-of-the-art compression approach, we achieve comparable quality with only 6% of the storage size. Meanwhile, our compression framework requires only 25% of the encoding time and 1% of the decoding time.

[34] Generalized Gaussian Entropy Model for Point Cloud Attribute Compression with Dynamic Likelihood Intervals cs.CV | eess.IVPDF

Changhao Peng, Yuqi Ye, Wei Gao

TL;DR: 该论文提出了一种广义高斯熵模型和动态似然区间方法，用于优化点云属性压缩中的概率估计和算术编码性能。

Details

Motivation: 现有方法中使用的高斯和拉普拉斯熵模型未能充分利用神经网络估计的熵参数信息，且固定似然区间限制了模型性能，因此需要更灵活的熵模型和动态调整技术。

Result: 实验表明，该方法在三个基于VAE的点云属性压缩模型中显著提高了率失真（RD）性能，并可推广到图像和视频压缩任务。

Insight: 更灵活的熵模型和动态区间调整策略能够显著提升压缩任务的性能，表明潜在信息的高效利用是关键。

Abstract: Gaussian and Laplacian entropy models are proved effective in learned point cloud attribute compression, as they assist in arithmetic coding of latents. However, we demonstrate through experiments that there is still unutilized information in entropy parameters estimated by neural networks in current methods, which can be used for more accurate probability estimation. Thus we introduce generalized Gaussian entropy model, which controls the tail shape through shape parameter to more accurately estimate the probability of latents. Meanwhile, to the best of our knowledge, existing methods use fixed likelihood intervals for each integer during arithmetic coding, which limits model performance. We propose Mean Error Discriminator (MED) to determine whether the entropy parameter estimation is accurate and then dynamically adjust likelihood intervals. Experiments show that our method significantly improves rate-distortion (RD) performance on three VAE-based models for point cloud attribute compression, and our method can be applied to other compression tasks, such as image and video compression.

[35] HAIF-GS: Hierarchical and Induced Flow-Guided Gaussian Splatting for Dynamic Scene cs.CVPDF

Jianing Chen, Zehao Li, Yujun Cai, Hao Jiang, Chengxuan Qian

TL;DR: HAIF-GS提出了一种层次化和诱导流引导的高斯溅射方法，用于动态场景重建，解决了现有方法在运动一致性和非刚性变形建模中的局限性。

Details

Motivation: 动态3D场景的重建是一个基础挑战，现有方法在高斯溅射（3DGS）中难以实现结构化和时间一致的运动表示，存在冗余更新、运动监督不足和非刚性变形建模弱的问题。

Result: 在合成和真实世界基准测试中，HAIF-GS在渲染质量、时间一致性和重建效率上显著优于现有动态3DGS方法。

Insight: 稀疏锚点驱动和分层次变形机制有助于解决动态场景重建中的复杂非刚性变形问题，同时自监督方法减少了对显式流标签的依赖。

Abstract: Reconstructing dynamic 3D scenes from monocular videos remains a fundamental challenge in 3D vision. While 3D Gaussian Splatting (3DGS) achieves real-time rendering in static settings, extending it to dynamic scenes is challenging due to the difficulty of learning structured and temporally consistent motion representations. This challenge often manifests as three limitations in existing methods: redundant Gaussian updates, insufficient motion supervision, and weak modeling of complex non-rigid deformations. These issues collectively hinder coherent and efficient dynamic reconstruction. To address these limitations, we propose HAIF-GS, a unified framework that enables structured and consistent dynamic modeling through sparse anchor-driven deformation. It first identifies motion-relevant regions via an Anchor Filter to suppresses redundant updates in static areas. A self-supervised Induced Flow-Guided Deformation module induces anchor motion using multi-frame feature aggregation, eliminating the need for explicit flow labels. To further handle fine-grained deformations, a Hierarchical Anchor Propagation mechanism increases anchor resolution based on motion complexity and propagates multi-level transformations. Extensive experiments on synthetic and real-world benchmarks validate that HAIF-GS significantly outperforms prior dynamic 3DGS methods in rendering quality, temporal coherence, and reconstruction efficiency.

[36] Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs cs.CV | cs.AI | cs.CLPDF

Beomsik Cho, Jaehyung Kim

TL;DR: 该论文提出了一种名为ReVisiT的解码方法，通过重用视觉标记来优化LVLM的文本生成过程，从而提升视觉信息的利用率，减少计算开销。

Details

Motivation: 传统的LVLM解码策略未能充分利用视觉信息，导致生成内容与视觉输入脱节。现有方法通常需要额外训练或多步推理，限制了效率。ReVisiT旨在通过简单的解码优化机制解决这一问题。

Result: 在三个LVLM幻觉基准测试中，ReVisiT显著提升了视觉相关性，同时在计算成本降低至2倍的情况下，性能达到或超过现有最佳方法。

Insight: 论文揭示了视觉标记中隐含的语言先验信息对LVLM解码过程的潜在价值，为未来高效的多模态模型设计提供了新思路。

Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks by integrating visual perception with language understanding. However, conventional decoding strategies of LVLMs often fail to successfully utilize visual information, leading to visually ungrounded responses. While various approaches have been proposed to address this limitation, they typically require additional training, multi-step inference procedures, or external model dependencies. This paper introduces ReVisiT, a simple yet effective decoding method that references vision tokens to guide the text generation process in LVLMs. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution space, and dynamically selecting the most relevant vision token at each decoding step through constrained divergence minimization. This selected vision token is then used to refine the output distribution to better incorporate visual semantics. Experiments on three LVLM hallucination benchmarks with two recent LVLMs demonstrate that ReVisiT consistently enhances visual grounding with minimal computational overhead. Moreover, our method achieves competitive or superior results relative to state-of-the-art baselines while reducing computational costs for up to $2\times$.

[37] Gaussian Herding across Pens: An Optimal Transport Perspective on Global Gaussian Reduction for 3DGS cs.CV | I.4.5PDF

Tao Wang, Mengyu Li, Geduo Zeng, Cheng Meng, Qiong Zhang

TL;DR: 该论文提出了一种基于最优传输视角的全局高斯混合降维方法，用于解决3D高斯泼溅（3DGS）中高斯基元冗余问题，显著减少了内存和渲染开销，同时保持了渲染质量。

Details

Motivation: 3DGS虽然是一种强大的辐射场渲染技术，但其通常需要数百万冗余的高斯基元，导致内存和渲染开销巨大。现有的压缩方法基于启发式重要性评分进行剪枝，缺乏全局保真性保证。

Result: 实验表明，该方法仅需10%的高斯基元即可实现与原始3DGS相当的渲染质量（PSNR、SSIM、LPIPS），并优于现有压缩技术。

Insight: 该方法不仅适用于原始3DGS流水线，还兼容加速流水线，为轻量级神经渲染提供了一种高效且通用的解决方案。

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful technique for radiance field rendering, but it typically requires millions of redundant Gaussian primitives, overwhelming memory and rendering budgets. Existing compaction approaches address this by pruning Gaussians based on heuristic importance scores, without global fidelity guarantee. To bridge this gap, we propose a novel optimal transport perspective that casts 3DGS compaction as global Gaussian mixture reduction. Specifically, we first minimize the composite transport divergence over a KD-tree partition to produce a compact geometric representation, and then decouple appearance from geometry by fine-tuning color and opacity attributes with far fewer Gaussian primitives. Experiments on benchmark datasets show that our method (i) yields negligible loss in rendering quality (PSNR, SSIM, LPIPS) compared to vanilla 3DGS with only 10% Gaussians; and (ii) consistently outperforms state-of-the-art 3DGS compaction techniques. Notably, our method is applicable to any stage of vanilla or accelerated 3DGS pipelines, providing an efficient and agnostic pathway to lightweight neural rendering.

[38] 3DGeoDet: General-purpose Geometry-aware Image-based 3D Object Detection cs.CVPDF

Yi Zhang, Yi Wang, Yawen Cui, Lap-Pui Chau

TL;DR: 这篇论文提出了一种通用的几何感知3D目标检测方法3DGeoDet，通过显式和隐式结合的方式生成3D几何表示，在单视角和多视角RGB图像的室内外环境中均表现出色。

Details

Motivation: 基于图像的3D目标检测任务由于缺乏3D几何线索，导致图像与3D表示之间的对应关系模糊。为了解决这一问题，论文提出结合显式和隐式的3D几何表示方法，提升模型对3D几何的理解。

Result: 在SUN RGB-D上mAP@0.5提升9.3，ScanNetV2上提升3.3，KITTI上AP3D@0.7提升0.19，超越了当前最佳图像基方法。

Insight: 通过显式和隐式3D几何表示的结合，可以有效缓解图像基3D检测中的几何模糊问题，提升模型的泛化能力和性能。

Abstract: This paper proposes 3DGeoDet, a novel geometry-aware 3D object detection approach that effectively handles single- and multi-view RGB images in indoor and outdoor environments, showcasing its general-purpose applicability. The key challenge for image-based 3D object detection tasks is the lack of 3D geometric cues, which leads to ambiguity in establishing correspondences between images and 3D representations. To tackle this problem, 3DGeoDet generates efficient 3D geometric representations in both explicit and implicit manners based on predicted depth information. Specifically, we utilize the predicted depth to learn voxel occupancy and optimize the voxelized 3D feature volume explicitly through the proposed voxel occupancy attention. To further enhance 3D awareness, the feature volume is integrated with an implicit 3D representation, the truncated signed distance function (TSDF). Without requiring supervision from 3D signals, we significantly improve the model’s comprehension of 3D geometry by leveraging intermediate 3D representations and achieve end-to-end training. Our approach surpasses the performance of state-of-the-art image-based methods on both single- and multi-view benchmark datasets across diverse environments, achieving a 9.3 mAP@0.5 improvement on the SUN RGB-D dataset, a 3.3 mAP@0.5 improvement on the ScanNetV2 dataset, and a 0.19 AP3D@0.7 improvement on the KITTI dataset. The project page is available at: https://cindy0725.github.io/3DGeoDet/.

[39] GLD-Road:A global-local decoding road network extraction model for remote sensing images cs.CVPDF

Ligao Deng, Yupeng Deng, Yu Meng, Jingbo Chen, Zhihao Xi

TL;DR: 本文提出了一种名为GLD-Road的两阶段模型，通过结合全局效率和局部精度，显著提高了道路网络提取的准确性和效率。

Details

Motivation: 道路网络提取对于地图制图、自动驾驶和灾害响应至关重要，但现有方法在精度或效率方面存在局限性。

Result: 在City-Scale和SpaceNet3数据集上，APLS分别提升了1.9%和0.67%；检索时间比Sat2Graph和RNGDet++分别减少了40%和92%。

Insight: 结合全局和局部方法的优势，既能保持高效性，又能提升精度，为道路网络提取提供了新的解决方案。

Abstract: Road networks are crucial for mapping, autonomous driving, and disaster response. While manual annotation is costly, deep learning offers efficient extraction. Current methods include postprocessing (prone to errors), global parallel (fast but misses nodes), and local iterative (accurate but slow). We propose GLD-Road, a two-stage model combining global efficiency and local precision. First, it detects road nodes and connects them via a Connect Module. Then, it iteratively refines broken roads using local searches, drastically reducing computation. Experiments show GLD-Road outperforms state-of-the-art methods, improving APLS by 1.9% (City-Scale) and 0.67% (SpaceNet3). It also reduces retrieval time by 40% vs. Sat2Graph (global) and 92% vs. RNGDet++ (local). The experimental results are available at https://github.com/ucas-dlg/GLD-Road.

[40] AD^2-Bench: A Hierarchical CoT Benchmark for MLLM in Autonomous Driving under Adverse Conditions cs.CV | cs.AIPDF

Zhaoyang Wei, Chenhui Qiang, Bowen Jiang, Xumeng Han, Xuehui Yu

TL;DR: AD^2-Bench是首个针对恶劣天气和复杂场景下自动驾驶的多模态大模型（MLLM）的链式思考（CoT）评测基准，填补了现有评测的空白，并通过5.4k高质量标注实例支持多步推理和细粒度分析。

Details

Motivation: 现有评测基准未充分评估自动驾驶在恶劣天气和复杂环境下的链式思考推理能力，AD^2-Bench旨在填补这一关键空白，推动MLLM在自动驾驶中的鲁棒性和可解释性发展。

Result: 评测显示，现有MLLM在AD^2-Bench上的准确率低于60%，表明基准的挑战性和当前模型的局限性。

Insight: AD^2-Bench揭示了自动驾驶中恶劣环境对MLLM推理的显著影响，强调了提升模型鲁棒性和可解释性的必要性。

Abstract: Chain-of-Thought (CoT) reasoning has emerged as a powerful approach to enhance the structured, multi-step decision-making capabilities of Multi-Modal Large Models (MLLMs), is particularly crucial for autonomous driving with adverse weather conditions and complex traffic environments. However, existing benchmarks have largely overlooked the need for rigorous evaluation of CoT processes in these specific and challenging scenarios. To address this critical gap, we introduce AD^2-Bench, the first Chain-of-Thought benchmark specifically designed for autonomous driving with adverse weather and complex scenes. AD^2-Bench is meticulously constructed to fulfill three key criteria: comprehensive data coverage across diverse adverse environments, fine-grained annotations that support multi-step reasoning, and a dedicated evaluation framework tailored for assessing CoT performance. The core contribution of AD^2-Bench is its extensive collection of over 5.4k high-quality, manually annotated CoT instances. Each intermediate reasoning step in these annotations is treated as an atomic unit with explicit ground truth, enabling unprecedented fine-grained analysis of MLLMs’ inferential processes under text-level, point-level, and region-level visual prompts. Our comprehensive evaluation of state-of-the-art MLLMs on AD^2-Bench reveals accuracy below 60%, highlighting the benchmark’s difficulty and the need to advance robust, interpretable end-to-end autonomous driving systems. AD^2-Bench thus provides a standardized evaluation platform, driving research forward by improving MLLMs’ reasoning in autonomous driving, making it an invaluable resource.

[41] SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields cs.CVPDF

Qijing Li, Jingxiang Sun, Liang An, Zhaoqi Su, Hongwen Zhang

TL;DR: SemanticSplat提出了一种前馈式语义感知3D重建方法，结合3D高斯与潜在语义属性，实现联合几何-外观-语义建模，显著提升了稀疏视角下的场景理解能力。

Details

Motivation: 现有前馈式3D场景理解方法（如LSM）仅能提取基于语言的语义信息，且存在几何重建质量低和噪声问题；而逐场景优化方法依赖密集输入视角，实用性不足。

Result: 实验验证了其在提示式和开放词汇分割任务中的有效性，显著优于现有方法。

Insight: 语义与几何的联合建模可显著提升稀疏输入下的场景理解鲁棒性，为AR和机器人交互提供了新思路。

Abstract: Holistic 3D scene understanding, which jointly models geometry, appearance, and semantics, is crucial for applications like augmented reality and robotic interaction. Existing feed-forward 3D scene understanding methods (e.g., LSM) are limited to extracting language-based semantics from scenes, failing to achieve holistic scene comprehension. Additionally, they suffer from low-quality geometry reconstruction and noisy artifacts. In contrast, per-scene optimization methods rely on dense input views, which reduces practicality and increases complexity during deployment. In this paper, we propose SemanticSplat, a feed-forward semantic-aware 3D reconstruction method, which unifies 3D Gaussians with latent semantic attributes for joint geometry-appearance-semantics modeling. To predict the semantic anisotropic Gaussians, SemanticSplat fuses diverse feature fields (e.g., LSeg, SAM) with a cost volume representation that stores cross-view feature similarities, enhancing coherent and accurate scene comprehension. Leveraging a two-stage distillation framework, SemanticSplat reconstructs a holistic multi-modal semantic feature field from sparse-view images. Experiments demonstrate the effectiveness of our method for 3D scene understanding tasks like promptable and open-vocabulary segmentation. Video results are available at https://semanticsplat.github.io.

[42] ECAM: A Contrastive Learning Approach to Avoid Environmental Collision in Trajectory Forecasting cs.CVPDF

Giacomo Rosin, Muhammad Rameez Ur Rahman, Sebastiano Vascon

TL;DR: 该论文提出了一种基于对比学习的模块ECAM，用于增强轨迹预测模型的环境碰撞避免能力，显著降低了碰撞率。

Details

Motivation: 现有轨迹预测方法常忽视环境影响，导致预测轨迹与障碍物碰撞，ECAM旨在解决这一问题。

Result: 实验表明，集成ECAM的SOTA模型碰撞率显著降低（40-50%）。

Insight: 环境碰撞避免是轨迹预测的关键因素，对比学习可有效提升模型的物理合理性。

Abstract: Human trajectory forecasting is crucial in applications such as autonomous driving, robotics and surveillance. Accurate forecasting requires models to consider various factors, including social interactions, multi-modal predictions, pedestrian intention and environmental context. While existing methods account for these factors, they often overlook the impact of the environment, which leads to collisions with obstacles. This paper introduces ECAM (Environmental Collision Avoidance Module), a contrastive learning-based module to enhance collision avoidance ability with the environment. The proposed module can be integrated into existing trajectory forecasting models, improving their ability to generate collision-free predictions. We evaluate our method on the ETH/UCY dataset and quantitatively and qualitatively demonstrate its collision avoidance capabilities. Our experiments show that state-of-the-art methods significantly reduce (-40/50%) the collision rate when integrated with the proposed module. The code is available at https://github.com/CVML-CFU/ECAM.

[43] HSENet: Hybrid Spatial Encoding Network for 3D Medical Vision-Language Understanding cs.CV | cs.AIPDF

Yanzhao Shi, Xiaodan Zhang, Junzhong Ji, Haoning Jiang, Chengxin Zheng

TL;DR: HSENet提出了一种混合空间编码网络，用于3D医学视觉-语言理解，通过双3D视觉编码器和空间压缩技术提升诊断准确性和效率。

Details

Motivation: 现有多模态大语言模型（MLLMs）主要针对2D医学图像，限制了其捕捉复杂3D解剖结构的能力，导致诊断错误。HSENet旨在解决这一问题。

Result: 在3D语言-视觉检索（R@100提升5.96%）、医学报告生成（BLEU-4提升8.01%）和视觉问答（Major Class Accuracy提升1.99%）中表现优异。

Insight: 结合3D视觉信息与语言模型能显著提升医学诊断任务性能，高效的空间压缩技术是关键。

Abstract: Automated 3D CT diagnosis empowers clinicians to make timely, evidence-based decisions by enhancing diagnostic accuracy and workflow efficiency. While multimodal large language models (MLLMs) exhibit promising performance in visual-language understanding, existing methods mainly focus on 2D medical images, which fundamentally limits their ability to capture complex 3D anatomical structures. This limitation often leads to misinterpretation of subtle pathologies and causes diagnostic hallucinations. In this paper, we present Hybrid Spatial Encoding Network (HSENet), a framework that exploits enriched 3D medical visual cues by effective visual perception and projection for accurate and robust vision-language understanding. Specifically, HSENet employs dual-3D vision encoders to perceive both global volumetric contexts and fine-grained anatomical details, which are pre-trained by dual-stage alignment with diagnostic reports. Furthermore, we propose Spatial Packer, an efficient multimodal projector that condenses high-resolution 3D spatial regions into a compact set of informative visual tokens via centroid-based compression. By assigning spatial packers with dual-3D vision encoders, HSENet can seamlessly perceive and transfer hybrid visual representations to LLM’s semantic space, facilitating accurate diagnostic text generation. Experimental results demonstrate that our method achieves state-of-the-art performance in 3D language-visual retrieval (39.85% of R@100, +5.96% gain), 3D medical report generation (24.01% of BLEU-4, +8.01% gain), and 3D visual question answering (73.60% of Major Class Accuracy, +1.99% gain), confirming its effectiveness. Our code is available at https://github.com/YanzhaoShi/HSENet.

[44] DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning cs.CV | cs.AIPDF

Dongxu Liu, Yuang Peng, Haomiao Tang, Yuwei Chen, Chunrui Han

TL;DR: DGAE是一种通过扩散模型引导解码器学习的自动编码器，旨在解决高压缩比下的性能下降问题，同时实现更紧凑的潜在表示。

Details

Motivation: 自动编码器在高压缩比下性能下降且训练不稳定，尤其是与GAN相关的挑战。论文希望通过改进解码器表达能力，实现更高效的潜在表示学习。

Result: DGAE在ImageNet-1K图像生成任务中表现优异，潜在空间缩小2倍，且加速了扩散模型的收敛速度。

Insight: 扩散模型的引导可以显著提升解码器的表达能力和潜在表示的效率，为高压缩比下的生成模型提供了新思路。

Abstract: Autoencoders empower state-of-the-art image and video generative models by compressing pixels into a latent space through visual tokenization. Although recent advances have alleviated the performance degradation of autoencoders under high compression ratios, addressing the training instability caused by GAN remains an open challenge. While improving spatial compression, we also aim to minimize the latent space dimensionality, enabling more efficient and compact representations. To tackle these challenges, we focus on improving the decoder’s expressiveness. Concretely, we propose DGAE, which employs a diffusion model to guide the decoder in recovering informative signals that are not fully decoded from the latent representation. With this design, DGAE effectively mitigates the performance degradation under high spatial compression rates. At the same time, DGAE achieves state-of-the-art performance with a 2x smaller latent space. When integrated with Diffusion Models, DGAE demonstrates competitive performance on image generation for ImageNet-1K and shows that this compact latent representation facilitates faster convergence of the diffusion model.

[45] HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios cs.CV | cs.LG | cs.MM | cs.RO | eess.IVPDF

Kunyu Peng, Junchao Huang, Xiangsheng Huang, Di Wen, Junwei Zheng

TL;DR: HopaDIFF是一种新颖的扩散框架，专注于多人物场景中的文本参考引导动作分割任务，提出了首个RHAS133数据集，并通过结合整体-部分感知的傅里叶条件优化，显著提升了分割性能。

Details

Motivation: 现有动作分割方法主要针对单人物活动，忽视了多人物场景的需求。本文旨在填补这一空白，通过文本描述指定目标人物，实现多人物场景的动作分割。

Result: HopaDIFF在RHAS133数据集上取得了最佳性能，优于现有方法。

Insight: 多人物动作分割需要结合全局和局部信息，傅里叶条件扩散为生成任务提供了新思路。

Abstract: Action segmentation is a core challenge in high-level video understanding, aiming to partition untrimmed videos into segments and assign each a label from a predefined action set. Existing methods primarily address single-person activities with fixed action sequences, overlooking multi-person scenarios. In this work, we pioneer textual reference-guided human action segmentation in multi-person settings, where a textual description specifies the target person for segmentation. We introduce the first dataset for Referring Human Action Segmentation, i.e., RHAS133, built from 133 movies and annotated with 137 fine-grained actions with 33h video data, together with textual descriptions for this new task. Benchmarking existing action recognition methods on RHAS133 using VLM-based feature extractors reveals limited performance and poor aggregation of visual cues for the target person. To address this, we propose a holistic-partial aware Fourier-conditioned diffusion framework, i.e., HopaDIFF, leveraging a novel cross-input gate attentional xLSTM to enhance holistic-partial long-range reasoning and a novel Fourier condition to introduce more fine-grained control to improve the action segmentation generation. HopaDIFF achieves state-of-the-art results on RHAS133 in diverse evaluation settings. The code is available at https://github.com/KPeng9510/HopaDIFF.git.

Maik Dannecker, Vasiliki Sideri-Lampretsa, Sophie Starck, Angeline Mihailov, Mathieu Milh

TL;DR: CINeMA提出了一种新型的条件隐式神经多模态图谱框架，能够在低数据环境下创建高分辨率、时空多模态的脑图谱，显著提升了效率和灵活性。

Details

Motivation: 研究胎儿和新生儿大脑的快速发育需要高时空分辨率的脑图谱，但现有方法依赖大量数据，无法解决病理情况下数据稀缺的问题。

Result: CINeMA在准确性、效率和多功能性上超越现有方法，适合低数据环境，并能生成合成数据用于增强模型训练。

Insight: 通过隐式表示和条件生成，CINeMA为脑发育研究提供了高效的解决方案，尤其适用于数据稀缺的病理研究场景。

Abstract: Magnetic resonance imaging of fetal and neonatal brains reveals rapid neurodevelopment marked by substantial anatomical changes unfolding within days. Studying this critical stage of the developing human brain, therefore, requires accurate brain models-referred to as atlases-of high spatial and temporal resolution. To meet these demands, established traditional atlases and recently proposed deep learning-based methods rely on large and comprehensive datasets. This poses a major challenge for studying brains in the presence of pathologies for which data remains scarce. We address this limitation with CINeMA (Conditional Implicit Neural Multi-Modal Atlas), a novel framework for creating high-resolution, spatio-temporal, multimodal brain atlases, suitable for low-data settings. Unlike established methods, CINeMA operates in latent space, avoiding compute-intensive image registration and reducing atlas construction times from days to minutes. Furthermore, it enables flexible conditioning on anatomical features including GA, birth age, and pathologies like ventriculomegaly (VM) and agenesis of the corpus callosum (ACC). CINeMA supports downstream tasks such as tissue segmentation and age prediction whereas its generative properties enable synthetic data creation and anatomically informed data augmentation. Surpassing state-of-the-art methods in accuracy, efficiency, and versatility, CINeMA represents a powerful tool for advancing brain research. We release the code and atlases at https://github.com/m-dannecker/CINeMA.

[47] Reasoning Models Are More Easily Gaslighted Than You Think cs.CV | cs.AIPDF

Bin Zhu, Hailong Yin, Jingjing Chen, Yu-Gang Jiang

TL;DR: 论文通过系统评估发现，当前最先进的推理模型（如OpenAI、Claude和Gemini）在应对误导性用户输入时表现脆弱，准确性显著下降（平均25-29%）。作者提出了GaslightingBench-R基准，进一步揭示了模型在对抗性提示下的脆弱性，准确性下降超过53%。

Details

Motivation: 研究动机是探讨推理模型在面对误导性用户输入时的鲁棒性，填补这一领域的空白。

Result: 结果显示，对抗性提示导致模型准确性显著下降：原始基准中下降25-29%，GaslightingBench-R中下降超过53%。

Insight: 研究揭示了推理模型的局限性：尽管它们具备逐步推理能力，但在信念持久性方面表现脆弱，容易被用户误导。这为未来改进模型的鲁棒性提供了方向。

Abstract: Recent advances in reasoning-centric models promise improved robustness through mechanisms such as chain-of-thought prompting and test-time scaling. However, their ability to withstand misleading user input remains underexplored. In this paper, we conduct a systematic evaluation of three state-of-the-art reasoning models, i.e., OpenAI’s o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash, across three multimodal benchmarks: MMMU, MathVista, and CharXiv. Our evaluation reveals significant accuracy drops (25-29% on average) following gaslighting negation prompts, indicating that even top-tier reasoning models struggle to preserve correct answers under manipulative user feedback. Built upon the insights of the evaluation and to further probe this vulnerability, we introduce GaslightingBench-R, a new diagnostic benchmark specifically designed to evaluate reasoning models’ susceptibility to defend their belief under gaslighting negation prompt. Constructed by filtering and curating 1,025 challenging samples from the existing benchmarks, GaslightingBench-R induces even more dramatic failures, with accuracy drops exceeding 53% on average. Our findings reveal fundamental limitations in the robustness of reasoning models, highlighting the gap between step-by-step reasoning and belief persistence.

[48] Adding simple structure at inference improves Vision-Language Compositionality cs.CV | cs.CL | cs.LGPDF

Imanol Miranda, Ander Salaberria, Eneko Agirre, Gorka Azkune

TL;DR: 该论文提出了一种在推理阶段添加简单结构的方法，以提高视觉-语言模型（VLM）的组合性，无需重新训练模型，通过分割图像和文本片段进行匹配并聚合相似性，显著提升了性能。

Details

Motivation: 现有的双编码器视觉-语言模型（如CLIP）在组合性任务（例如对象-属性绑定）上表现不佳，表现出的“词袋”行为限制了其检索性能。虽然已有许多训练方法尝试改进此类模型，但推理阶段的技术却鲜少被探索。

Result: 实验表明，该方法在各种双编码器VLM上均显著提升了组合性任务的性能，尤其是在属性-对象绑定任务中表现突出。此外，分析显示图像分割对性能提升至关重要。

Insight: 推理阶段的技术具有被低估的潜力，图像分割和文本片段匹配是提升视觉-语言组合性的关键方向，未来的工作可以进一步优化推理阶段的处理流程。

Abstract: Dual encoder Vision-Language Models (VLM) such as CLIP are widely used for image-text retrieval tasks. However, those models struggle with compositionality, showing a bag-of-words-like behavior that limits their retrieval performance. Many different training approaches have been proposed to improve the vision-language compositionality capabilities of those models. In comparison, inference-time techniques have received little attention. In this paper, we propose to add simple structure at inference, where, given an image and a caption: i) we divide the image into different smaller crops, ii) we extract text segments, capturing objects, attributes and relations, iii) using a VLM, we find the image crops that better align with text segments obtaining matches, and iv) we compute the final image-text similarity aggregating the individual similarities of the matches. Based on various popular dual encoder VLMs, we evaluate our approach in controlled and natural datasets for VL compositionality. We find that our approach consistently improves the performance of evaluated VLMs without any training, which shows the potential of inference-time techniques. The results are especially good for attribute-object binding as shown in the controlled dataset. As a result of an extensive analysis: i) we show that processing image crops is actually essential for the observed gains in performance, and ii) we identify specific areas to further improve inference-time approaches.

[49] Towards Practical Alzheimer’s Disease Diagnosis: A Lightweight and Interpretable Spiking Neural Model cs.CV | cs.AIPDF

Changwei Wu, Yifei Chen, Yuxin Du, Jinying Zong, Jie Dong

TL;DR: 论文提出FasterSNN，一种轻量级且可解释的脉冲神经网络模型，用于阿尔茨海默病（AD）的早期诊断，解决了传统深度学习方法的高能耗问题，并通过混合架构提高了模型的表达能力和训练稳定性。

Details

Motivation: 现有阿尔茨海默病诊断方法依赖主观评估和多模态成像，成本高且效率低；深度学习虽能自动化但能耗过大，而SNNs虽有潜力但在复杂医疗任务中表达能力和稳定性不足。

Result: 在基准数据集上验证，FasterSNN性能优异，显著提高了效率和稳定性，适用于实际AD筛查。

Insight: SNNs在医疗诊断中具有潜力，通过合理的架构设计可以解决表达能力和稳定性问题，实现低功耗且高效的自动化诊断。

Abstract: Early diagnosis of Alzheimer’s Disease (AD), especially at the mild cognitive impairment (MCI) stage, is vital yet hindered by subjective assessments and the high cost of multimodal imaging modalities. Although deep learning methods offer automated alternatives, their energy inefficiency and computational demands limit real-world deployment, particularly in resource-constrained settings. As a brain-inspired paradigm, spiking neural networks (SNNs) are inherently well-suited for modeling the sparse, event-driven patterns of neural degeneration in AD, offering a promising foundation for interpretable and low-power medical diagnostics. However, existing SNNs often suffer from weak expressiveness and unstable training, which restrict their effectiveness in complex medical tasks. To address these limitations, we propose FasterSNN, a hybrid neural architecture that integrates biologically inspired LIF neurons with region-adaptive convolution and multi-scale spiking attention. This design enables sparse, efficient processing of 3D MRI while preserving diagnostic accuracy. Experiments on benchmark datasets demonstrate that FasterSNN achieves competitive performance with substantially improved efficiency and stability, supporting its potential for practical AD screening. Our source code is available at https://github.com/wuchangw/FasterSNN.

[50] Non-Contact Health Monitoring During Daily Personal Care Routines cs.CV | cs.AIPDF

Xulin Ma, Jiankai Tang, Zhang Jiang, Songqin Cheng, Yuanchun Shi

TL;DR: 论文提出了首个长期远程光电容积描记（rPPG）数据集LADH，结合RGB和红外视频输入，改进了非接触式生理监测的准确性和鲁棒性，并在心率估计中达到了4.99 BPM的MAE。

Details

Motivation: 远程光电容积描记（rPPG）在长期个人护理场景（如高海拔环境下的日常活动）中的应用面临环境光照变化、手部遮挡和动态面部姿势等挑战。

Result: 结合RGB和红外输入的心率估计MAE为4.99 BPM，多任务学习显著提升了多生理指标的监测性能。

Insight: 结合多模态数据和多任务学习可以显著提升非接触式生理监测的准确性和鲁棒性，尤其在复杂环境中表现突出。

Abstract: Remote photoplethysmography (rPPG) enables non-contact, continuous monitoring of physiological signals and offers a practical alternative to traditional health sensing methods. Although rPPG is promising for daily health monitoring, its application in long-term personal care scenarios, such as mirror-facing routines in high-altitude environments, remains challenging due to ambient lighting variations, frequent occlusions from hand movements, and dynamic facial postures. To address these challenges, we present LADH (Long-term Altitude Daily Health), the first long-term rPPG dataset containing 240 synchronized RGB and infrared (IR) facial videos from 21 participants across five common personal care scenarios, along with ground-truth PPG, respiration, and blood oxygen signals. Our experiments demonstrate that combining RGB and IR video inputs improves the accuracy and robustness of non-contact physiological monitoring, achieving a mean absolute error (MAE) of 4.99 BPM in heart rate estimation. Furthermore, we find that multi-task learning enhances performance across multiple physiological indicators simultaneously. Dataset and code are open at https://github.com/McJackTang/FusionVitals.

[51] Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning cs.CV | cs.AIPDF

Yuting Li, Lai Wei, Kaipeng Zheng, Jingyuan Huang, Linghe Kong

TL;DR: 论文指出，多模态大语言模型（MLLMs）忽视视觉处理的重要性，并提出一种简单的视觉扰动框架，显著提升模型在数学推理任务中的表现。

Details

Motivation: 尽管MLLMs发展迅速，但其视觉处理能力被低估。实验发现，仅提供图像标题的语言模型表现优于直接处理原始视觉输入的MLLMs，表明MLLMs未能有效整合视觉信息进行推理。

Result: 在多个数据集上验证，数学推理性能显著提升，达到与算法改进相当的增益；Qwen2.5-VL-7B模型表现接近开源7B RL调优模型的水平。

Insight: 视觉扰动在多模态数学推理中至关重要；不同扰动策略在提升视觉推理的不同方面具有互补作用，表明‘更好的推理始于更好的视觉’。

Abstract: Despite the rapid progress of multimodal large language models (MLLMs), they have largely overlooked the importance of visual processing. In a simple yet revealing experiment, we interestingly find that language-only models, when provided with image captions, can achieve comparable or even better performance than MLLMs that consume raw visual inputs. This suggests that current MLLMs may generate accurate visual descriptions but fail to effectively integrate them during reasoning. Motivated by this, we propose a simple visual perturbation framework that enhances perceptual robustness without requiring algorithmic modifications or additional training data. Our approach introduces three targeted perturbations: distractor concatenation, dominance-preserving mixup, and random rotation, that can be easily integrated into existing post-training pipelines including SFT, DPO, and GRPO. Through extensive experiments across multiple datasets, we demonstrate consistent improvements in mathematical reasoning performance, with gains comparable to those achieved through algorithmic changes. Additionally, we achieve competitive performance among open-source 7B RL-tuned models by training Qwen2.5-VL-7B with visual perturbation. Through comprehensive ablation studies, we analyze the effectiveness of different perturbation strategies, revealing that each perturbation type contributes uniquely to different aspects of visual reasoning. Our findings highlight the critical role of visual perturbation in multimodal mathematical reasoning: better reasoning begins with better seeing. Our code is available at https://github.com/YutingLi0606/Vision-Matters.

[52] Class Similarity-Based Multimodal Classification under Heterogeneous Category Sets cs.CVPDF

Yangrui Zhu, Junhua Bao, Yipan Wei, Yapeng Li, Bo Du

TL;DR: 该论文提出了多模态异构类别集学习（MMHCL）的实用场景，并提出了基于类别相似性的跨模态融合模型（CSCF），通过语义空间对齐和不确定性估计实现模态间的知识迁移与决策融合，在多个基准数据集上表现优于现有方法。

Details

Motivation: 现实应用中多模态数据的类别分布常不一致，现有方法假设所有模态共享相同类别集，限制了模型利用跨模态信息的能力。

Result: 在多个基准数据集上，CSCF显著优于现有SOTA方法。

Insight: 异构类别集下的模态对齐与信息融合是提升多模态分类性能的关键。

Abstract: Existing multimodal methods typically assume that different modalities share the same category set. However, in real-world applications, the category distributions in multimodal data exhibit inconsistencies, which can hinder the model’s ability to effectively utilize cross-modal information for recognizing all categories. In this work, we propose the practical setting termed Multi-Modal Heterogeneous Category-set Learning (MMHCL), where models are trained in heterogeneous category sets of multi-modal data and aim to recognize complete classes set of all modalities during test. To effectively address this task, we propose a Class Similarity-based Cross-modal Fusion model (CSCF). Specifically, CSCF aligns modality-specific features to a shared semantic space to enable knowledge transfer between seen and unseen classes. It then selects the most discriminative modality for decision fusion through uncertainty estimation. Finally, it integrates cross-modal information based on class similarity, where the auxiliary modality refines the prediction of the dominant one. Experimental results show that our method significantly outperforms existing state-of-the-art (SOTA) approaches on multiple benchmark datasets, effectively addressing the MMHCL task.

[53] Hierarchical Image Matching for UAV Absolute Visual Localization via Semantic and Structural Constraints cs.CV | cs.ROPDF

Xiangkai Zhang, Xiang Zhou, Mao Chen, Yuchen Lu, Xu Yang

TL;DR: 论文提出了一种分层跨源图像匹配方法，用于无人机绝对视觉定位，通过语义和结构约束提升定位准确性和鲁棒性。

Details

Motivation: 无人机在GNSS信号不可用时，传统的视觉定位方法因跨源差异和时空变化导致匹配困难。

Result: 在公开数据集和新的CS-UAV数据集上验证了方法的优越性和鲁棒性。

Insight: 语义和结构的引入能有效缓解跨源和时变差异，提升匹配精度。

Abstract: Absolute localization, aiming to determine an agent’s location with respect to a global reference, is crucial for unmanned aerial vehicles (UAVs) in various applications, but it becomes challenging when global navigation satellite system (GNSS) signals are unavailable. Vision-based absolute localization methods, which locate the current view of the UAV in a reference satellite map to estimate its position, have become popular in GNSS-denied scenarios. However, existing methods mostly rely on traditional and low-level image matching, suffering from difficulties due to significant differences introduced by cross-source discrepancies and temporal variations. To overcome these limitations, in this paper, we introduce a hierarchical cross-source image matching method designed for UAV absolute localization, which integrates a semantic-aware and structure-constrained coarse matching module with a lightweight fine-grained matching module. Specifically, in the coarse matching module, semantic features derived from a vision foundation model first establish region-level correspondences under semantic and structural constraints. Then, the fine-grained matching module is applied to extract fine features and establish pixel-level correspondences. Building upon this, a UAV absolute visual localization pipeline is constructed without any reliance on relative localization techniques, mainly by employing an image retrieval module before the proposed hierarchical image matching modules. Experimental evaluations on public benchmark datasets and a newly introduced CS-UAV dataset demonstrate superior accuracy and robustness of the proposed method under various challenging conditions, confirming its effectiveness.

[54] Q-SAM2: Accurate Quantization for Segment Anything Model 2 cs.CV | cs.AIPDF

Nicola Farronato, Florian Scheidegger, Mattia Rigotti, Cristiano Malossi, Michele Magno

TL;DR: 该论文提出了Q-SAM2方法，通过线性层校准和量化感知训练（QAT）技术，有效解决了SAM2模型在低比特量化时的性能下降问题，显著提升了计算效率和精度。

Details

Motivation: SAM2的计算和内存消耗大，限制了其在资源受限场景中的应用。因此，作者提出了一种高效的低比特量化方法，以解决量化过程中的性能损失问题。

Result: Q-SAM2在超低2比特量化下表现优异，mIoU准确率比未校准模型提升66%。

Insight: 校准技术不仅适用于量化感知训练，也能显著提升训练后量化的性能。

Abstract: The Segment Anything Model 2 (SAM2) has gained significant attention as a foundational approach for promptable image and video segmentation. However, its expensive computational and memory consumption poses a severe challenge for its application in resource-constrained scenarios. In this paper, we propose an accurate low-bit quantization method for efficient SAM2, termed Q-SAM2. To address the performance degradation caused by the singularities in weight and activation distributions during quantization, Q-SAM2 introduces two novel technical contributions. We first introduce a linear layer calibration method for low-bit initialization of SAM2, which minimizes the Frobenius norm over a small image batch to reposition weight distributions for improved quantization. We then propose a Quantization-Aware Training (QAT) pipeline that applies clipping to suppress outliers and allows the network to adapt to quantization thresholds during training. Our comprehensive experiments demonstrate that Q-SAM2 allows for highly accurate inference while substantially improving efficiency. Both quantitative and visual results show that our Q-SAM2 surpasses existing state-of-the-art general quantization schemes, especially for ultra-low 2-bit quantization. While designed for quantization-aware training, our proposed calibration technique also proves effective in post-training quantization, achieving up to a 66% mIoU accuracy improvement over non-calibrated models.

[55] Accurate and efficient zero-shot 6D pose estimation with frozen foundation models cs.CVPDF

Andrea Caraffa, Davide Boscaini, Fabio Poiesi

TL;DR: FreeZeV2是一种无需训练的6D位姿估计方法，利用预训练的基础模型实现高效且高精度的新物体位姿估计，显著提升了速度和准确性。

Details

Motivation: 解决6D位姿估计中新物体泛化问题的传统方法需要大量任务特定的训练数据，计算成本高昂。FreeZeV2探索是否可以不依赖任务特定训练，通过预训练模型实现高性能位姿估计。

Result: 在BOP Benchmark的7个核心数据集上取得新SOTA，使用相同分割掩码时速度提升8倍且精度提高5%；集成分割模型时精度再提升8%且速度仍快2.5倍。

Insight: 预训练的基础模型可以避免任务特定的昂贵训练，同时通过高效设计和集成策略进一步提升性能和速度。

Abstract: Estimating the 6D pose of objects from RGBD data is a fundamental problem in computer vision, with applications in robotics and augmented reality. A key challenge is achieving generalization to novel objects that were not seen during training. Most existing approaches address this by scaling up training on synthetic data tailored to the task, a process that demands substantial computational resources. But is task-specific training really necessary for accurate and efficient 6D pose estimation of novel objects? To answer No!, we introduce FreeZeV2, the second generation of FreeZe: a training-free method that achieves strong generalization to unseen objects by leveraging geometric and vision foundation models pre-trained on unrelated data. FreeZeV2 improves both accuracy and efficiency over FreeZe through three key contributions: (i) a sparse feature extraction strategy that reduces inference-time computation without sacrificing accuracy; (ii) a feature-aware scoring mechanism that improves both pose selection during RANSAC-based 3D registration and the final ranking of pose candidates; and (iii) a modular design that supports ensembles of instance segmentation models, increasing robustness to segmentation masks errors. We evaluate FreeZeV2 on the seven core datasets of the BOP Benchmark, where it establishes a new state-of-the-art in 6D pose estimation of unseen objects. When using the same segmentation masks, FreeZeV2 achieves a remarkable 8x speedup over FreeZe while also improving accuracy by 5%. When using ensembles of segmentation models, FreeZeV2 gains an additional 8% in accuracy while still running 2.5x faster than FreeZe. FreeZeV2 was awarded Best Overall Method at the BOP Challenge 2024.

[56] DreamCS: Geometry-Aware Text-to-3D Generation with Unpaired 3D Reward Supervision cs.CVPDF

Xiandong Zou, Ruihao Xia, Hongsong Wang, Pan Zhou

TL;DR: 论文提出了一个名为DreamCS的框架，通过构建首个大规模非配对的3D偏好数据集（3D-MeshPref），并结合新颖的Cauchy-Schwarz散度目标训练奖励模型（RewardCS），实现了更符合人类偏好的几何感知3D生成。

Details

Motivation: 现有的文本到3D生成方法通常依赖难以获得的配对多视角2D图像来训练奖励模型，这种2D偏见会导致几何伪影。因此，研究旨在解决这一问题，提出直接基于非配对3D数据学习人类偏好的方法。

Result: 实验表明，DreamCS显著优于现有方法，生成的3D资产更具几何忠实性和人类偏好性。

Insight: 通过直接学习3D数据的几何偏好，而非依赖2D图像配对数据，可以有效避免2D偏见，提升生成质量。

Abstract: While text-to-3D generation has attracted growing interest, existing methods often struggle to produce 3D assets that align well with human preferences. Current preference alignment techniques for 3D content typically rely on hardly-collected preference-paired multi-view 2D images to train 2D reward models, when then guide 3D generation – leading to geometric artifacts due to their inherent 2D bias. To address these limitations, we construct 3D-MeshPref, the first large-scale unpaired 3D preference dataset, featuring diverse 3D meshes annotated by a large language model and refined by human evaluators. We then develop RewardCS, the first reward model trained directly on unpaired 3D-MeshPref data using a novel Cauchy-Schwarz divergence objective, enabling effective learning of human-aligned 3D geometric preferences without requiring paired comparisons. Building on this, we propose DreamCS, a unified framework that integrates RewardCS into text-to-3D pipelines – enhancing both implicit and explicit 3D generation with human preference feedback. Extensive experiments show DreamCS outperforms prior methods, producing 3D assets that are both geometrically faithful and human-preferred. Code and models will be released publicly.

Chuang Maa, Yu Peia, Jianhang Zhanga, Shaokai Zhaoa, Bowen Jib

TL;DR: 该论文引入了首个多模态微表情数据集MMME，同步采集了面部动作信号、中枢神经系统信号和外周生理信号，填补了现有微表情研究中多模态数据的空白，显著提升了微表情识别和检测的性能。

Details

Motivation: 现有微表情研究仅关注视觉模态，忽略了其他生理模态蕴含的丰富情感信息，导致性能远低于实际应用需求。因此，探索微表情视觉特征与生理信号的跨模态关联机制，成为推动微表情分析的关键。

Result: 实验表明，多模态融合显著提升了微表情识别和检测性能。MMME是目前模态最全面的微表情数据集，为相关研究提供了重要数据支持。

Insight: 多模态数据（尤其是生理信号）的引入为微表情分析带来了新的维度，揭示了视觉-生理协同效应的潜力，推动了从单模态视觉分析到多模态融合的范式转变。

Abstract: Micro-expressions (MEs) are subtle, fleeting nonverbal cues that reveal an individual’s genuine emotional state. Their analysis has attracted considerable interest due to its promising applications in fields such as healthcare, criminal investigation, and human-computer interaction. However, existing ME research is limited to single visual modality, overlooking the rich emotional information conveyed by other physiological modalities, resulting in ME recognition and spotting performance far below practical application needs. Therefore, exploring the cross-modal association mechanism between ME visual features and physiological signals (PS), and developing a multimodal fusion framework, represents a pivotal step toward advancing ME analysis. This study introduces a novel ME dataset, MMME, which, for the first time, enables synchronized collection of facial action signals (MEs), central nervous system signals (EEG), and peripheral PS (PPG, RSP, SKT, EDA, and ECG). By overcoming the constraints of existing ME corpora, MMME comprises 634 MEs, 2,841 macro-expressions (MaEs), and 2,890 trials of synchronized multimodal PS, establishing a robust foundation for investigating ME neural mechanisms and conducting multimodal fusion-based analyses. Extensive experiments validate the dataset’s reliability and provide benchmarks for ME analysis, demonstrating that integrating MEs with PS significantly enhances recognition and spotting performance. To the best of our knowledge, MMME is the most comprehensive ME dataset to date in terms of modality diversity. It provides critical data support for exploring the neural mechanisms of MEs and uncovering the visual-physiological synergistic effects, driving a paradigm shift in ME research from single-modality visual analysis to multimodal fusion. The dataset will be publicly available upon acceptance of this paper.

[58] DynaSplat: Dynamic-Static Gaussian Splatting with Hierarchical Motion Decomposition for Scene Reconstruction cs.CV | cs.AIPDF

Junli Deng, Ping Shi, Qipei Li, Jinyang Guo

TL;DR: DynaSplat 提出了一种基于高斯泼溅的动态场景重建方法，结合动态-静态分离和分层运动建模，显著提升了复杂动态场景的重建效果。

Details

Motivation: 现有方法在复杂动态场景中的重建效果有限，DynaSplat 旨在通过动态-静态分离和分层运动建模解决这一问题。

Result: 在多个数据集上，DynaSplat 在准确性和真实感上超越了现有方法，且更高效紧凑。

Insight: 结合动态-静态分离和分层建模能显著提升动态场景重建效果，同时基于物理的估计方法增强了视觉一致性。

Abstract: Reconstructing intricate, ever-changing environments remains a central ambition in computer vision, yet existing solutions often crumble before the complexity of real-world dynamics. We present DynaSplat, an approach that extends Gaussian Splatting to dynamic scenes by integrating dynamic-static separation and hierarchical motion modeling. First, we classify scene elements as static or dynamic through a novel fusion of deformation offset statistics and 2D motion flow consistency, refining our spatial representation to focus precisely where motion matters. We then introduce a hierarchical motion modeling strategy that captures both coarse global transformations and fine-grained local movements, enabling accurate handling of intricate, non-rigid motions. Finally, we integrate physically-based opacity estimation to ensure visually coherent reconstructions, even under challenging occlusions and perspective shifts. Extensive experiments on challenging datasets reveal that DynaSplat not only surpasses state-of-the-art alternatives in accuracy and realism but also provides a more intuitive, compact, and efficient route to dynamic scene reconstruction.

Chen Gao, Liankai Jin, Xingyu Peng, Jiazhao Zhang, Yue Deng

TL;DR: OctoNav提出了一种通用导航智能体框架，通过多模态基准OctoNav-Bench和混合训练方法OctoNav-R1，实现了基于自由指令的导航能力。

Details

Motivation: 现有导航研究分散为不同任务（如ObjNav、ImgNav等），缺乏通用性。本文旨在构建一个能处理多模态、多能力自由指令的通用导航智能体。

Result: OctoNav-R1在性能上优于现有方法，验证了通用导航智能体的可行性。

Insight: 结合思维链（CoT）的导航方法能显著提升模型的推理能力，为通用导航任务提供了新方向。

Abstract: Embodied navigation stands as a foundation pillar within the broader pursuit of embodied AI. However, previous navigation research is divided into different tasks/capabilities, e.g., ObjNav, ImgNav and VLN, where they differ in task objectives and modalities, making datasets and methods are designed individually. In this work, we take steps toward generalist navigation agents, which can follow free-form instructions that include arbitrary compounds of multi-modal and multi-capability. To achieve this, we propose a large-scale benchmark and corresponding method, termed OctoNav-Bench and OctoNav-R1. Specifically, OctoNav-Bench features continuous environments and is constructed via a designed annotation pipeline. We thoroughly craft instruction-trajectory pairs, where instructions are diverse in free-form with arbitrary modality and capability. Also, we construct a Think-Before-Action (TBA-CoT) dataset within OctoNav-Bench to provide the thinking process behind actions. For OctoNav-R1, we build it upon MLLMs and adapt it to a VLA-type model, which can produce low-level actions solely based on 2D visual observations. Moreover, we design a Hybrid Training Paradigm (HTP) that consists of three stages, i.e., Action-/TBA-SFT, Nav-GPRO, and Online RL stages. Each stage contains specifically designed learning policies and rewards. Importantly, for TBA-SFT and Nav-GRPO designs, we are inspired by the OpenAI-o1 and DeepSeek-R1, which show impressive reasoning ability via thinking-before-answer. Thus, we aim to investigate how to achieve thinking-before-action in the embodied navigation field, to improve model’s reasoning ability toward generalists. Specifically, we propose TBA-SFT to utilize the TBA-CoT dataset to fine-tune the model as a cold-start phrase and then leverage Nav-GPRO to improve its thinking ability. Finally, OctoNav-R1 shows superior performance compared with previous methods.

[60] Learning to Align: Addressing Character Frequency Distribution Shifts in Handwritten Text Recognition cs.CV | cs.AIPDF

Panagiotis Kaliosis, John Pavlopoulos

TL;DR: 论文提出了一种新的损失函数，通过Wasserstein距离对齐预测文本与目标字符频率分布，提升了手写文本识别的准确性和鲁棒性，并展示了无需重新训练的推理时优化方法。

Details

Motivation: 手写文本识别因字符频率分布随时间或地区变化而性能下降，现有方法难以应对这种数据分布偏移。

Result: 实验验证了该方法在多个数据集和架构上的有效性，提升了模型的泛化能力和性能。

Insight: 字符频率分布对齐是提升手写文本识别鲁棒性的关键，且无需重训练即可优化模型。

Abstract: Handwritten text recognition aims to convert visual input into machine-readable text, and it remains challenging due to the evolving and context-dependent nature of handwriting. Character sets change over time, and character frequency distributions shift across historical periods or regions, often causing models trained on broad, heterogeneous corpora to underperform on specific subsets. To tackle this, we propose a novel loss function that incorporates the Wasserstein distance between the character frequency distribution of the predicted text and a target distribution empirically derived from training data. By penalizing divergence from expected distributions, our approach enhances both accuracy and robustness under temporal and contextual intra-dataset shifts. Furthermore, we demonstrate that character distribution alignment can also improve existing models at inference time without requiring retraining by integrating it as a scoring function in a guided decoding scheme. Experimental results across multiple datasets and architectures confirm the effectiveness of our method in boosting generalization and performance. We open source our code at https://github.com/pkaliosis/fada.

[61] IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments cs.CVPDF

Florian Bordes, Quentin Garrido, Justine T Kao, Adina Williams, Michael Rabbat

TL;DR: IntPhys 2是一个用于评估深度学习模型对直观物理理解的视频基准工具，专注于四大核心原理，并通过违反期望的测试框架挑战模型在复杂虚拟环境中的表现。

Details

Motivation: 当前的深度学习模型在复杂场景中对直观物理的理解远不及人类，亟需通过基准测试推动模型架构和训练方法的改进。

Result: 现有模型在四大原理上的表现接近随机水平（50%），与人类接近完美的准确率形成鲜明对比，凸显了模型的不足。

Insight: 研究揭示了当前深度学习模型在直观物理理解上的巨大差距，为未来模型的设计和训练指明了方向。

Abstract: We present IntPhys 2, a video benchmark designed to evaluate the intuitive physics understanding of deep learning models. Building on the original IntPhys benchmark, IntPhys 2 focuses on four core principles related to macroscopic objects: Permanence, Immutability, Spatio-Temporal Continuity, and Solidity. These conditions are inspired by research into intuitive physical understanding emerging during early childhood. IntPhys 2 offers a comprehensive suite of tests, based on the violation of expectation framework, that challenge models to differentiate between possible and impossible events within controlled and diverse virtual environments. Alongside the benchmark, we provide performance evaluations of several state-of-the-art models. Our findings indicate that while these models demonstrate basic visual understanding, they face significant challenges in grasping intuitive physics across the four principles in complex scenes, with most models performing at chance levels (50%), in stark contrast to human performance, which achieves near-perfect accuracy. This underscores the gap between current models and human-like intuitive physics understanding, highlighting the need for advancements in model architectures and training methodologies.

[62] 3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation cs.CV | cs.AIPDF

Seonho Lee, Jiho Choi, Inha Kang, Jiwook Kim, Junsung Park

TL;DR: 本文提出了一种轻量级的几何蒸馏方法，通过从现成的3D基础模型中提取几何线索（如稀疏对应、相对深度关系和密集成本体积）来增强预训练的视觉语言模型（VLMs）的3D空间理解能力。

Details

Motivation: 现有的视觉语言模型在3D空间结构的理解上存在局限性，尽管它们在多样化的视觉和语言任务中表现出色。

Result: 在3D视觉语言推理和3D感知基准测试中，该方法显著优于现有方法，且计算成本更低。

Insight: 几何蒸馏为2D训练的VLMs与3D理解之间搭建了一条高效且可扩展的路径，拓展了其在空间多模态任务中的应用潜力。

Abstract: Vision-Language Models (VLMs) have shown remarkable performance on diverse visual and linguistic tasks, yet they remain fundamentally limited in their understanding of 3D spatial structures. We propose Geometric Distillation, a lightweight, annotation-free fine-tuning framework that injects human-inspired geometric cues into pretrained VLMs without modifying their architecture. By distilling (1) sparse correspondences, (2) relative depth relations, and (3) dense cost volumes from off-the-shelf 3D foundation models (e.g., MASt3R, VGGT), our method shapes representations to be geometry-aware while remaining compatible with natural image-text inputs. Through extensive evaluations on 3D vision-language reasoning and 3D perception benchmarks, our method consistently outperforms prior approaches, achieving improved 3D spatial reasoning with significantly lower computational cost. Our work demonstrates a scalable and efficient path to bridge 2D-trained VLMs with 3D understanding, opening up wider use in spatially grounded multimodal tasks.

[63] The Less You Depend, The More You Learn: Synthesizing Novel Views from Sparse, Unposed Images without Any 3D Knowledge cs.CVPDF

Haoru Wang, Kai Ye, Yangyan Li, Wenzheng Chen, Baoquan Chen

TL;DR: 论文提出了一种减少依赖3D知识的通用新颖视图合成框架，通过最小化3D归纳偏置和姿态依赖，直接从稀疏2D图像学习隐式3D感知，实现高质量的新视图生成。

Details

Motivation: 当前的新颖视图合成方法通常依赖强3D知识（如显式3D表示或已知相机姿态），限制了其泛化能力和对数据的充分利用。论文探讨了减少3D知识依赖的潜力，并发现数据规模扩大时，依赖较少的3D知识的方法性能提升更快。

Result: 实验表明，该方法能在无需3D知识或姿态标注的情况下生成逼真且3D一致的新视图，性能甚至与依赖姿态输入的方法相当。

Insight: 在数据规模扩大的背景下，减少对3D知识的依赖能够更高效地学习隐式3D感知，为新颖视图合成提供更灵活、可扩展的解决方案。

Abstract: We consider the problem of generalizable novel view synthesis (NVS), which aims to generate photorealistic novel views from sparse or even unposed 2D images without per-scene optimization. This task remains fundamentally challenging, as it requires inferring 3D structure from incomplete and ambiguous 2D observations. Early approaches typically rely on strong 3D knowledge, including architectural 3D inductive biases (e.g., embedding explicit 3D representations, such as NeRF or 3DGS, into network design) and ground-truth camera poses for both input and target views. While recent efforts have sought to reduce the 3D inductive bias or the dependence on known camera poses of input views, critical questions regarding the role of 3D knowledge and the necessity of circumventing its use remain under-explored. In this work, we conduct a systematic analysis on the 3D knowledge and uncover a critical trend: the performance of methods that requires less 3D knowledge accelerates more as data scales, eventually achieving performance on par with their 3D knowledge-driven counterparts, which highlights the increasing importance of reducing dependence on 3D knowledge in the era of large-scale data. Motivated by and following this trend, we propose a novel NVS framework that minimizes 3D inductive bias and pose dependence for both input and target views. By eliminating this 3D knowledge, our method fully leverages data scaling and learns implicit 3D awareness directly from sparse 2D images, without any 3D inductive bias or pose annotation during training. Extensive experiments demonstrate that our model generates photorealistic and 3D-consistent novel views, achieving even comparable performance with methods that rely on posed inputs, thereby validating the feasibility and effectiveness of our data-centric paradigm. Project page: https://pku-vcl-geometry.github.io/Less3Depend/ .

[64] EquiCaps: Predictor-Free Pose-Aware Pre-Trained Capsule Networks cs.CVPDF

Athinoulla Konstantinou, Georgios Leontidis, Mamatha Thota, Aiden Durrant

TL;DR: EquiCaps是一种无预测器的胶囊网络方法，通过利用胶囊网络的固有姿态感知能力，实现了姿态感知的自监督学习，无需专用预测器即可提高姿态估计任务的性能。

Details

Motivation: 研究旨在探索不需要依赖专用预测器就能实现等变性的自监督表示学习方法，并验证胶囊网络在姿态感知表示中的固有优势。

Result: 在3DIEBench旋转预测基准上，EquiCaps达到了0.78的$R^2$值，优于SIE和CapsIE方法；在复杂几何变换下仍保持稳健的等变性性能。

Insight: 胶囊网络具有固有的姿态感知能力，可以避免依赖专用预测器；无预测器的设计在复杂任务中表现出更强的泛化能力。

Abstract: Learning self-supervised representations that are invariant and equivariant to transformations is crucial for advancing beyond traditional visual classification tasks. However, many methods rely on predictor architectures to encode equivariance, despite evidence that architectural choices, such as capsule networks, inherently excel at learning interpretable pose-aware representations. To explore this, we introduce EquiCaps (Equivariant Capsule Network), a capsule-based approach to pose-aware self-supervision that eliminates the need for a specialised predictor for enforcing equivariance. Instead, we leverage the intrinsic pose-awareness capabilities of capsules to improve performance in pose estimation tasks. To further challenge our assumptions, we increase task complexity via multi-geometric transformations to enable a more thorough evaluation of invariance and equivariance by introducing 3DIEBench-T, an extension of a 3D object-rendering benchmark dataset. Empirical results demonstrate that EquiCaps outperforms prior state-of-the-art equivariant methods on rotation prediction, achieving a supervised-level $R^2$ of 0.78 on the 3DIEBench rotation prediction benchmark and improving upon SIE and CapsIE by 0.05 and 0.04 $R^2$, respectively. Moreover, in contrast to non-capsule-based equivariant approaches, EquiCaps maintains robust equivariant performance under combined geometric transformations, underscoring its generalisation capabilities and the promise of predictor-free capsule architectures.

[65] Only-Style: Stylistic Consistency in Image Generation without Content Leakage cs.CVPDF

Tilemachos Aravanis, Panagiotis Filntisis, Petros Maragos, George Retsinas

TL;DR: 这篇论文提出了一种名为Only-Style的方法，旨在解决图像生成中风格一致性与内容泄漏的问题。通过定位和自适应调整风格对齐参数，有效分离了语义内容和风格元素，同时在评估框架中展示了显著改进的效果。

Details

Motivation: 现有的风格一致性图像生成方法难以有效分离语义内容和风格元素，导致内容泄漏问题。论文旨在解决这一挑战，提出一种能够在保持风格一致性的同时避免内容泄漏的方法。

Result: 实验表明，Only-Style在风格一致性和内容泄漏消除方面显著优于现有方法，展示了鲁棒的生成效果。

Insight: 通过动态调整风格对齐参数，可以有效避免因风格对齐引发的语义内容泄漏问题，同时保持视觉风格的一致性。

Abstract: Generating images in a consistent reference visual style remains a challenging computer vision task. State-of-the-art methods aiming for style-consistent generation struggle to effectively separate semantic content from stylistic elements, leading to content leakage from the image provided as a reference to the targets. To address this challenge, we propose Only-Style: a method designed to mitigate content leakage in a semantically coherent manner while preserving stylistic consistency. Only-Style works by localizing content leakage during inference, allowing the adaptive tuning of a parameter that controls the style alignment process, specifically within the image patches containing the subject in the reference image. This adaptive process best balances stylistic consistency with leakage elimination. Moreover, the localization of content leakage can function as a standalone component, given a reference-target image pair, allowing the adaptive tuning of any method-specific parameter that provides control over the impact of the stylistic reference. In addition, we propose a novel evaluation framework to quantify the success of style-consistent generations in avoiding undesired content leakage. Our approach demonstrates a significant improvement over state-of-the-art methods through extensive evaluation across diverse instances, consistently achieving robust stylistic consistency without undesired content leakage.

[66] MetricHMR: Metric Human Mesh Recovery from Monocular Images cs.CVPDF

He Zhang, Chentao Song, Hongwen Zhang, Tao Yu

TL;DR: MetricHMR提出了一种新的方法，从单目图像中恢复人体网格和全局平移，解决了现有方法的尺度和深度模糊问题。

Details

Motivation: 现有的人体网格恢复（HMR）方法在尺度和深度上存在严重模糊性，导致重建结果中全局平移和形状不准确，无法满足实际应用需求。

Result: 在室内和野外场景下，MetricHMR在度量姿态、形状和全局平移估计上均达到了最先进的性能。

Insight: 标准透视投影模型是实现度量尺度HMR的关键，而射线图方法能够有效地结合多种信息，提升重建精度。

Abstract: We introduce MetricHMR (Metric Human Mesh Recovery), an approach for metric human mesh recovery with accurate global translation from monocular images. In contrast to existing HMR methods that suffer from severe scale and depth ambiguity, MetricHMR is able to produce geometrically reasonable body shape and global translation in the reconstruction results. To this end, we first systematically analyze previous HMR methods on camera models to emphasize the critical role of the standard perspective projection model in enabling metric-scale HMR. We then validate the acceptable ambiguity range of metric HMR under the standard perspective projection model. Finally, we contribute a novel approach that introduces a ray map based on the standard perspective projection to jointly encode bounding-box information, camera parameters, and geometric cues for End2End metric HMR without any additional metric-regularization modules. Extensive experiments demonstrate that our method achieves state-of-the-art performance, even compared with sequential HMR methods, in metric pose, shape, and global translation estimation across both indoor and in-the-wild scenarios.

[67] Structural-Spectral Graph Convolution with Evidential Edge Learning for Hyperspectral Image Clustering cs.CVPDF

Jianhan Qi, Yuheng Jia, Hui Liu, Junhui Hou

TL;DR: 提出一种基于结构-光谱图卷积和证据边学习的高光谱图像聚类方法，结合对比学习框架，显著提升了聚类精度。

Details

Motivation: 高光谱图像聚类因缺乏标注信息且现有图神经网络未能充分利用光谱信息，加上超像素图的拓扑结构不准确，导致聚类效果受限。

Result: 在四个数据集上，聚类精度分别提升2.61%、6.06%、4.96%和3.15%。

Insight: 联合空间和光谱特征以及动态边优化是提升高光谱图像聚类性能的关键。

Abstract: Hyperspectral image (HSI) clustering assigns similar pixels to the same class without any annotations, which is an important yet challenging task. For large-scale HSIs, most methods rely on superpixel segmentation and perform superpixel-level clustering based on graph neural networks (GNNs). However, existing GNNs cannot fully exploit the spectral information of the input HSI, and the inaccurate superpixel topological graph may lead to the confusion of different class semantics during information aggregation. To address these challenges, we first propose a structural-spectral graph convolutional operator (SSGCO) tailored for graph-structured HSI superpixels to improve their representation quality through the co-extraction of spatial and spectral features. Second, we propose an evidence-guided adaptive edge learning (EGAEL) module that adaptively predicts and refines edge weights in the superpixel topological graph. We integrate the proposed method into a contrastive learning framework to achieve clustering, where representation learning and clustering are simultaneously conducted. Experiments demonstrate that the proposed method improves clustering accuracy by 2.61%, 6.06%, 4.96% and 3.15% over the best compared methods on four HSI datasets. Our code is available at https://github.com/jhqi/SSGCO-EGAEL.

[68] Outside Knowledge Conversational Video (OKCV) Dataset – Dialoguing over Videos cs.CV | cs.AI | cs.CLPDF

Benjamin Reichman, Constantin Patsch, Jack Truxal, Atishay Jain, Larry Heck

TL;DR: 论文提出了一个基于视频的多轮对话数据集OKCV，要求模型结合视频内容和外部知识回答问题，展示了任务挑战和基准结果。

Details

Motivation: 将OK-VQA扩展到视频对话场景，需要模型不仅能识别视频中的视觉信息，还需结合外部知识进行对话，为相关研究提供数据支持。

Result: 展示了数据集上的基线性能，揭示了任务中的挑战，如视频内容与外部知识的结合。

Insight: 视频对话任务需要同时处理视觉时序信息和外部知识，为多模态对话系统研究提供了新方向。

Abstract: In outside knowledge visual question answering (OK-VQA), the model must identify relevant visual information within an image and incorporate external knowledge to accurately respond to a question. Extending this task to a visually grounded dialogue setting based on videos, a conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information. Moreover, the context of the overall conversation must be considered for the subsequent dialogue. To explore this task, we introduce a dataset comprised of $2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$ interleaved dialogue turns. While the dialogue context is visually grounded in specific video segments, the questions further require external knowledge that is not visually present. Thus, the model not only has to identify relevant video parts but also leverage external knowledge to converse within the dialogue. We further provide several baselines evaluated on our dataset and show future challenges associated with this task. The dataset is made publicly available here: https://github.com/c-patsch/OKCV.

[69] LEO-VL: Towards 3D Vision-Language Generalists via Data Scaling with Efficient Representation cs.CVPDF

Jiangyong Huang, Xiaojian Ma, Xiongkun Linghu, Yue Fan, Junchao He

TL;DR: 论文提出了LEO-VL模型，通过高效的特征网格表示（CFG）和大规模3D视觉语言数据训练，解决了3D-VL通用模型中数据扩展性的障碍，实现了在多任务基准上的最佳性能。

Details

Motivation: 当前3D-VL模型在能力和鲁棒性上落后于2D模型，主要障碍在于数据扩展性问题和高昂的token开销。研究目标是开发能理解3D场景并执行多任务的通用模型。

Result: 在SQA3D、MSQA和Beacon3D等基准上达到SOTA性能，验证了CFG的高效性和数据集的多样性价值。

Insight: 高效的场景表示和多样性数据是3D-VL通用模型的关键；SceneDPO能有效提升模型的鲁棒性。

Abstract: Developing 3D-VL generalists capable of understanding 3D scenes and following natural language instructions to perform a wide range of tasks has been a long-standing goal in the 3D-VL community. Despite recent progress, 3D-VL models still lag behind their 2D counterparts in capability and robustness, falling short of the generalist standard. A key obstacle to developing 3D-VL generalists lies in data scalability, hindered by the lack of an efficient scene representation. We propose LEO-VL, a 3D-VL model built upon condensed feature grid (CFG), an efficient scene representation that bridges 2D perception and 3D spatial structure while significantly reducing token overhead. This efficiency unlocks large-scale training towards 3D-VL generalist, for which we curate over 700k high-quality 3D-VL data spanning four domains of real-world indoor scenes and five tasks such as captioning and dialogue. LEO-VL achieves state-of-the-art performance on a variety of 3D QA benchmarks, including SQA3D, MSQA, and Beacon3D. Ablation studies confirm the efficiency of our representation, the importance of task and scene diversity, and the validity of our data curation principle. Furthermore, we introduce SceneDPO, a novel post-training objective that enhances the robustness of 3D-VL models. We hope our findings contribute to the advancement of scalable and robust 3D-VL generalists.

[70] CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models cs.CV | cs.AI | I.2.10; I.4.8PDF

Aaron Foss, Chloe Evans, Sasha Mitts, Koustuv Sinha, Ammar Rizvi

TL;DR: CausalVQA是一个基于真实世界视频的因果推理基准数据集，旨在测试模型对物理世界中因果关系的理解能力，包括五种问题类型。顶尖多模态模型在该基准上表现显著低于人类，尤其是在预测和假设问题上。

Details

Motivation: 现有的VQA基准要么侧重于表面感知理解，要么局限于模拟环境中的狭窄物理推理问题。CausalVQA弥补了这一空白，提出了更具挑战性的真实世界因果推理问题，测试模型对行动和事件结果的预测能力。

Result: 实验表明，当前顶尖多模态模型在该基准上表现显著低于人类，特别在预测和假设问题上。

Insight: 当前系统在时空推理、物理原理理解以及对替代可能性的把握上仍存在不足，需要进一步研究以提升真实世界环境中的预测能力。

Abstract: We introduce CausalVQA, a benchmark dataset for video question answering (VQA) composed of question-answer pairs that probe models’ understanding of causality in the physical world. Existing VQA benchmarks either tend to focus on surface perceptual understanding of real-world videos, or on narrow physical reasoning questions created using simulation environments. CausalVQA fills an important gap by presenting challenging questions that are grounded in real-world scenarios, while focusing on models’ ability to predict the likely outcomes of different actions and events through five question types: counterfactual, hypothetical, anticipation, planning and descriptive. We designed quality control mechanisms that prevent models from exploiting trivial shortcuts, requiring models to base their answers on deep visual understanding instead of linguistic cues. We find that current frontier multimodal models fall substantially below human performance on the benchmark, especially on anticipation and hypothetical questions. This highlights a challenge for current systems to leverage spatial-temporal reasoning, understanding of physical principles, and comprehension of possible alternatives to make accurate predictions in real-world settings.

Ziyi Wang, Yanran Zhang, Jie Zhou, Jiwen Lu

TL;DR: UniPre3D提出了一种统一的3D点云预训练方法，通过跨模态高斯泼溅（Gaussian Splatting）实现对象和场景级点云的高效学习。

Details

Motivation: 现有3D点云预训练方法难以统一适用于不同尺度和架构的模型，导致对象和场景级任务表现不均。UniPre3D旨在解决这一问题。

Result: 在多种对象和场景级任务中的实验结果验证了UniPre3D的普适性和有效性。

Insight: 跨模态信息（如2D特征）的引入有助于优化3D几何结构学习，高斯泼溅提供了一种高效的像素级监督方式。

Abstract: The scale diversity of point cloud data presents significant challenges in developing unified representation learning techniques for 3D vision. Currently, there are few unified 3D models, and no existing pre-training method is equally effective for both object- and scene-level point clouds. In this paper, we introduce UniPre3D, the first unified pre-training method that can be seamlessly applied to point clouds of any scale and 3D models of any architecture. Our approach predicts Gaussian primitives as the pre-training task and employs differentiable Gaussian splatting to render images, enabling precise pixel-level supervision and end-to-end optimization. To further regulate the complexity of the pre-training task and direct the model’s focus toward geometric structures, we integrate 2D features from pre-trained image models to incorporate well-established texture knowledge. We validate the universal effectiveness of our proposed method through extensive experiments across a variety of object- and scene-level tasks, using diverse point cloud models as backbones. Code is available at https://github.com/wangzy22/UniPre3D.

[72] Vision Generalist Model: A Survey cs.CV | cs.AIPDF

Ziyi Wang, Yongming Rao, Shuofeng Sun, Xinrun Liu, Yi Wei

TL;DR: 该论文对视觉通用模型（Vision Generalist Model）进行了全面综述，探讨了其背景、框架设计、性能提升技术，并提供了应用场景和未来研究方向。

Details

Motivation: 受到自然语言处理中通用模型成功的启发，研究者们尝试将其应用于计算机视觉任务。然而，视觉任务的输入输出多样性较大，如何统一表示是一大挑战。

Result: 论文展示了视觉通用模型的潜力，并指出其在多样化任务中的应用前景。

Insight: 视觉通用模型的发展需要解决输入输出多样性问题，未来研究可结合多模态技术和任务自适应方法进一步提升性能。

Abstract: Recently, we have witnessed the great success of the generalist model in natural language processing. The generalist model is a general framework trained with massive data and is able to process various downstream tasks simultaneously. Encouraged by their impressive performance, an increasing number of researchers are venturing into the realm of applying these models to computer vision tasks. However, the inputs and outputs of vision tasks are more diverse, and it is difficult to summarize them as a unified representation. In this paper, we provide a comprehensive overview of the vision generalist models, delving into their characteristics and capabilities within the field. First, we review the background, including the datasets, tasks, and benchmarks. Then, we dig into the design of frameworks that have been proposed in existing research, while also introducing the techniques employed to enhance their performance. To better help the researchers comprehend the area, we take a brief excursion into related domains, shedding light on their interconnections and potential synergies. To conclude, we provide some real-world application scenarios, undertake a thorough examination of the persistent challenges, and offer insights into possible directions for future research endeavors.

[73] Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy cs.CV | cs.LG | 68T45 (Machine learning), 92C55 (Biomedical imaging and signal

processing) 68T45 (Machine learning), 92C55 (Biomedical imaging and signal
processing) | I.2.10; I.2.6; J.3PDF
Sushant Gautam, Michael A. Riegler, Pål Halvorsen

TL;DR: 介绍了Kvasir-VQA-x1，一个用于医学视觉问答（MedVQA）的大规模胃肠道内窥镜数据集，扩展了原始数据集，新增159,549个问题-答案对，并引入视觉增强以模拟实际临床场景。

Details

Motivation: 现有MedVQA数据集缺乏临床复杂性和视觉多样性，限制了临床决策支持系统的发展。Kvasir-VQA-x1旨在填补这一空白。

Result: 新数据集支持标准VQA性能评估和模型鲁棒性测试，提供了更具挑战性和临床相关性的基准。

Insight: 通过引入分层问题和视觉扰动，Kvasir-VQA-x1能够更全面地评估模型在真实临床环境中的表现，推动更可靠的多模态AI系统的发展。

Abstract: Medical Visual Question Answering (MedVQA) is a promising field for developing clinical decision support systems, yet progress is often limited by the available datasets, which can lack clinical complexity and visual diversity. To address these gaps, we introduce Kvasir-VQA-x1, a new, large-scale dataset for gastrointestinal (GI) endoscopy. Our work significantly expands upon the original Kvasir-VQA by incorporating 159,549 new question-answer pairs that are designed to test deeper clinical reasoning. We developed a systematic method using large language models to generate these questions, which are stratified by complexity to better assess a model’s inference capabilities. To ensure our dataset prepares models for real-world clinical scenarios, we have also introduced a variety of visual augmentations that mimic common imaging artifacts. The dataset is structured to support two main evaluation tracks: one for standard VQA performance and another to test model robustness against these visual perturbations. By providing a more challenging and clinically relevant benchmark, Kvasir-VQA-x1 aims to accelerate the development of more reliable and effective multimodal AI systems for use in clinical settings. The dataset is fully accessible and adheres to FAIR data principles, making it a valuable resource for the wider research community. Code and data: https://github.com/Simula/Kvasir-VQA-x1 and https://huggingface.co/datasets/SimulaMet/Kvasir-VQA-x1

[74] Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing cs.CV | cs.AI | I.2PDF

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu

TL;DR: 论文提出了一种通过视觉绘图增强视觉-语言模型空间推理能力的新方法，通过三阶段训练框架实现，实验表明在多项空间推理任务中性能显著提升。

Details

Motivation: 现有视觉-语言模型在空间推理任务中表现不足，主要依赖于纯文本的推理方式，无法精确处理几何和空间关系。为了解决这一问题，论文提出通过视觉绘图操作增强模型的直接空间表达能力。

Result: 实验表明，论文提出的模型VILASR在迷宫导航、静态空间推理、视频推理和多视角推理等任务中均显著优于现有方法，平均提升18.4%。

Insight: 论文的核心见解是，通过视觉绘图操作直接表达空间关系比纯文本推理更接近人类的思维模式，从而突破了现有方法的性能瓶颈。同时，三阶段训练框架为视觉-语言模型的空间推理能力提供了可扩展的训练方案。

Abstract: As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods primarily approach multimodal reasoning in a straightforward, text-centric manner, where both reasoning and answer derivation are conducted purely through text, with the only difference being the presence of multimodal input. As a result, these methods often encounter fundamental limitations in spatial reasoning tasks that demand precise geometric understanding and continuous spatial tracking-capabilities that humans achieve through mental visualization and manipulation. To address the limitations, we propose drawing to reason in space, a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space. By equipping models with basic drawing operations, including annotating bounding boxes and drawing auxiliary lines, we empower them to express and analyze spatial relationships through direct visual manipulation, meanwhile avoiding the performance ceiling imposed by specialized perception tools in previous tool-integrated reasoning approaches. To cultivate this capability, we develop a three-stage training framework: cold-start training with synthetic data to establish basic drawing abilities, reflective rejection sampling to enhance self-reflection behaviors, and reinforcement learning to directly optimize for target rewards. Extensive experiments demonstrate that our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks, involving maze navigation, static spatial reasoning, video-based reasoning, and multi-view-based reasoning tasks, with an average improvement of 18.4%.

[75] Efficient Part-level 3D Object Generation via Dual Volume Packing cs.CVPDF

Jiaxiang Tang, Ruijie Lu, Zhaoshuo Li, Zekun Hao, Xuan Li

TL;DR: 论文提出了一种双体素打包策略的端到端框架，用于从单个图像生成高质量、多部分的3D对象，支持任意数量的语义化部件编辑。

Details

Motivation: 现有的3D对象生成方法通常生成单一网格，限制了部件级别的编辑能力。不同对象的部件数量差异大，缺乏灵活的处理方法。

Result: 实验显示，该方法在质量、多样性和泛化能力上优于现有基于图像的多部件生成方法。

Insight: 通过双体素打包策略，能有效处理部件数量可变的问题，同时保持部件的语义完整性和可编辑性。

Abstract: Recent progress in 3D object generation has greatly improved both the quality and efficiency. However, most existing methods generate a single mesh with all parts fused together, which limits the ability to edit or manipulate individual parts. A key challenge is that different objects may have a varying number of parts. To address this, we propose a new end-to-end framework for part-level 3D object generation. Given a single input image, our method generates high-quality 3D objects with an arbitrary number of complete and semantically meaningful parts. We introduce a dual volume packing strategy that organizes all parts into two complementary volumes, allowing for the creation of complete and interleaved parts that assemble into the final object. Experiments show that our model achieves better quality, diversity, and generalization than previous image-based part-level generation methods.

[76] ReSim: Reliable World Simulation for Autonomous Driving cs.CV | cs.ROPDF

Jiazhi Yang, Kashyap Chitta, Shenyuan Gao, Long Chen, Yuqian Shao

TL;DR: ReSim proposes一种结合真实世界和模拟驾驶数据的可靠世界模拟方法，通过扩散变换器架构提升驾驶场景模拟的多样性和可控性，并引入Video2Reward模块评估动作奖励。结果显示其在视觉保真度和任务性能上均有显著提升。

Details

Motivation: 当前驾驶世界模型主要依赖专家驾驶数据，难以模拟危险或非专家行为，限制了其在策略评估等任务中的应用。

Result: ReSim在视觉保真度上提升44%，非专家动作可控性提升50%，在NAVSIM任务中规划和策略选择性能分别提升2%和25%。

Insight: 异构数据结合和扩散变换器架构能显著提升驾驶场景模拟的多样性和可靠性，适用于复杂驾驶行为评估。

Abstract: How can we reliably simulate future driving scenarios under a wide range of ego driving behaviors? Recent driving world models, developed exclusively on real-world driving data composed mainly of safe expert trajectories, struggle to follow hazardous or non-expert behaviors, which are rare in such data. This limitation restricts their applicability to tasks such as policy evaluation. In this work, we address this challenge by enriching real-world human demonstrations with diverse non-expert data collected from a driving simulator (e.g., CARLA), and building a controllable world model trained on this heterogeneous corpus. Starting with a video generator featuring a diffusion transformer architecture, we devise several strategies to effectively integrate conditioning signals and improve prediction controllability and fidelity. The resulting model, ReSim, enables Reliable Simulation of diverse open-world driving scenarios under various actions, including hazardous non-expert ones. To close the gap between high-fidelity simulation and applications that require reward signals to judge different actions, we introduce a Video2Reward module that estimates a reward from ReSim’s simulated future. Our ReSim paradigm achieves up to 44% higher visual fidelity, improves controllability for both expert and non-expert actions by over 50%, and boosts planning and policy selection performance on NAVSIM by 2% and 25%, respectively.

[77] InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions cs.CV | cs.AI | cs.SDPDF

Zhenzhi Wang, Jiaqi Yang, Jianwen Jiang, Chao Liang, Gaojie Lin

TL;DR: 该论文提出了一种名为InterActHuman的新框架，用于在多概念场景中生成高可控性的人体动画。通过显式布局控制和多模态条件匹配，解决了现有方法无法精确控制多实体互动的问题。

Details

Motivation: 现有的端到端人体动画方法通常仅支持单实体控制，并在全局注入多模态条件，无法处理多实体互动场景，限制了实际应用的潜力。

Result: 实验和消融研究表明，相比于隐式方法和其他现有技术，该框架在多模态条件下表现出更强的布局控制能力和生成质量。

Insight: 论文揭示了显式布局控制对多概念交互场景的重要性，为未来多实体动画生成提供了一种有效的解决方案。

Abstract: End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a global manner, ignoring scenarios that multiple concepts could appears in the same video with rich human-human interactions and human-object interactions. Such global assumption prevents precise and per-identity control of multiple concepts including humans and objects, therefore hinders applications. In this work, we discard the single-entity assumption and introduce a novel framework that enforces strong, region-specific binding of conditions from modalities to each identity’s spatiotemporal footprint. Given reference images of multiple concepts, our method could automatically infer layout information by leveraging a mask predictor to match appearance cues between the denoised video and each reference appearance. Furthermore, we inject local audio condition into its corresponding region to ensure layout-aligned modality matching in a iterative manner. This design enables the high-quality generation of controllable multi-concept human-centric videos. Empirical results and ablation studies validate the effectiveness of our explicit layout control for multi-modal conditions compared to implicit counterparts and other existing methods.

[78] A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs cs.CV | cs.LGPDF

Benno Krojer, Mojtaba Komeili, Candace Ross, Quentin Garrido, Koustuv Sinha

TL;DR: 该论文提出了一个名为MVP的新基准测试，用于评估视频语言模型的物理理解能力，通过引入最小变化配对来避免模型依赖表面视觉或文本线索的捷径解。

Details

Motivation: 现有的视频问答基准容易被模型利用表面视觉或文本线索的捷径解，导致性能评估不准确。

Result: 人类在MVP上的性能为92.9%，而当前最佳开源模型为40.2%，随机基准为25%，表明其挑战性。

Insight: MVP通过最小变化配对机制，使得依赖捷径解的模型性能低于随机基准，有助于更准确地评估模型的物理理解能力。

Abstract: Existing benchmarks for assessing the spatio-temporal understanding and reasoning abilities of video language models are susceptible to score inflation due to the presence of shortcut solutions based on superficial visual or textual cues. This paper mitigates the challenges in accurately assessing model performance by introducing the Minimal Video Pairs (MVP) benchmark, a simple shortcut-aware video QA benchmark for assessing the physical understanding of video language models. The benchmark is comprised of 55K high-quality multiple-choice video QA examples focusing on physical world understanding. Examples are curated from nine video data sources, spanning first-person egocentric and exocentric videos, robotic interaction data, and cognitive science intuitive physics benchmarks. To mitigate shortcut solutions that rely on superficial visual or textual cues and biases, each sample in MVP has a minimal-change pair – a visually similar video accompanied by an identical question but an opposing answer. To answer a question correctly, a model must provide correct answers for both examples in the minimal-change pair; as such, models that solely rely on visual or textual biases would achieve below random performance. Human performance on MVP is 92.9%, while the best open-source state-of-the-art video-language model achieves 40.2% compared to random performance at 25%.

[79] EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits cs.CV | cs.AI | cs.LGPDF

Ron Yosef, Moran Yanuka, Yonatan Bitton, Dani Lischinski

TL;DR: 论文介绍了EditInspector，一个基于人工标注的文本引导图像编辑评估基准，用于评测现有模型的编辑验证能力，并提出了两种新方法在检测伪影和生成差异描述上优于当前最优模型。

Details

Motivation: 生成式AI的快速发展使得文本引导的图像编辑变得普及，但缺乏系统方法来验证这些编辑的质量和准确性。

Result: 现有模型在综合评估编辑任务中表现不佳，且容易生成幻觉描述；提出的新方法在伪影检测和差异描述任务上优于SOTA。

Insight: 1. 文本引导编辑的评估需要更系统性方法；2. 当前模型仍需提升在复杂任务中的表现；3. 人工标注数据在生成任务评估中具有重要意义。

Abstract: Text-guided image editing, fueled by recent advancements in generative AI, is becoming increasingly widespread. This trend highlights the need for a comprehensive framework to verify text-guided edits and assess their quality. To address this need, we introduce EditInspector, a novel benchmark for evaluation of text-guided image edits, based on human annotations collected using an extensive template for edit verification. We leverage EditInspector to evaluate the performance of state-of-the-art (SoTA) vision and language models in assessing edits across various dimensions, including accuracy, artifact detection, visual quality, seamless integration with the image scene, adherence to common sense, and the ability to describe edit-induced changes. Our findings indicate that current models struggle to evaluate edits comprehensively and frequently hallucinate when describing the changes. To address these challenges, we propose two novel methods that outperform SoTA models in both artifact detection and difference caption generation.

[80] Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes cs.CVPDF

Yiming Dou, Wonseok Oh, Yuqing Luo, Antonio Loquercio, Andrew Owens

TL;DR: 这篇论文研究了如何通过3D场景重建实现交互性，提出了一种从3D手部轨迹预测物理交互声音的方法，实验表明生成的声音能准确传达材质和动作信息。

Details

Motivation: 研究动机是为了增强3D场景的交互性，使其不仅能可视化，还能通过声音反馈物理交互的细节（如材质和动作）。

Result: 实验结果表明，生成的声音能准确传达材质和动作特性，且人类观察者难以区分生成声音与真实声音。

Insight: 研究发现声音信号可以作为3D场景物理交互的有效补充，未来可能用于增强虚拟现实或游戏中的交互体验。

Abstract: We study the problem of making 3D scene reconstructions interactive by asking the following question: can we predict the sounds of human hands physically interacting with a scene? First, we record a video of a human manipulating objects within a 3D scene using their hands. We then use these action-sound pairs to train a rectified flow model to map 3D hand trajectories to their corresponding audio. At test time, a user can query the model for other actions, parameterized as sequences of hand poses, to estimate their corresponding sounds. In our experiments, we find that our generated sounds accurately convey material properties and actions, and that they are often indistinguishable to human observers from real sounds. Project page: https://www.yimingdou.com/hearing_hands/

[81] PlayerOne: Egocentric World Simulator cs.CVPDF

Yuanpeng Tu, Hao Luo, Xi Chen, Xiang Bai, Fan Wang

TL;DR: PlayerOne是首个以自我为中心的真实世界模拟器，能够动态生成与现实用户动作严格对齐的沉浸式视频，通过从粗到细的训练流程和部件解耦的运动注入方案实现精准控制。

Details

Motivation: 当前缺少能够动态生成与真实用户动作对齐的自我中心视角（egocentric）视频的模拟器，这种技术对于虚拟现实、人机交互等领域具有重要意义。

Result: 实验表明PlayerOne能够精准控制不同人类动作，并对多样化场景进行一致性建模，展示了其强大的泛化能力。

Insight: 自我中心视角的视频模拟需要分部件精准控制和场景一致性建模，这为世界建模及其多样化应用开辟了新方向。

Abstract: We introduce PlayerOne, the first egocentric realistic world simulator, facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image from the user, PlayerOne can accurately construct the corresponding world and generate egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that first performs pretraining on large-scale egocentric text-video pairs for coarse-level egocentric understanding, followed by finetuning on synchronous motion-video data extracted from egocentric-exocentric video datasets with our automatic construction pipeline. Besides, considering the varying importance of different components, we design a part-disentangled motion injection scheme, enabling precise control of part-level movements. In addition, we devise a joint reconstruction framework that progressively models both the 4D scene and video frames, ensuring scene consistency in the long-form video generation. Experimental results demonstrate its great generalization ability in precise control of varying human movements and worldconsistent modeling of diverse scenarios. It marks the first endeavor into egocentric real-world simulation and can pave the way for the community to delve into fresh frontiers of world modeling and its diverse applications.

cs.CL [Back]

[82] PHRASED: Phrase Dictionary Biasing for Speech Translation cs.CL | cs.AI | cs.SD | eess.ASPDF

Peidong Wang, Jian Xue, Rui Zhao, Junkun Chen, Aswin Shanmugam Subramanian

TL;DR: 本文提出了一种短语词典偏置方法，用于提升语音翻译任务中短语的翻译准确性，并在流式语音翻译模型和多模态大语言模型中验证了其有效性。

Details

Motivation: 由于短语在训练数据中出现频率较低，导致其在语音翻译任务中的准确翻译具有挑战性。为了解决这一问题，论文提出了短语词典偏置方法。

Result: 实验结果表明，短语词典偏置方法在流式语音翻译模型中相对提升了21%，在多模态大语言模型中实现了85%的短语召回率相对提升。

Insight: 短语词典偏置不仅适用于传统语音翻译模型，还能显著提升大语言模型在短语翻译任务中的表现，为结合外部知识的模型优化提供了新思路。

Abstract: Phrases are essential to understand the core concepts in conversations. However, due to their rare occurrence in training data, correct translation of phrases is challenging in speech translation tasks. In this paper, we propose a phrase dictionary biasing method to leverage pairs of phrases mapping from the source language to the target language. We apply the phrase dictionary biasing method to two types of widely adopted models, a transducer-based streaming speech translation model and a multimodal large language model. Experimental results show that the phrase dictionary biasing method outperforms phrase list biasing by 21% relatively for the streaming speech translation model. In addition, phrase dictionary biasing enables multimodal large language models to use external phrase information, achieving 85% relative improvement in phrase recall.

[83] Extrapolation by Association: Length Generalization Transfer in Transformers cs.CL | cs.AIPDF

Ziyang Cai, Nayoung Lee, Avi Schwarzschild, Samet Oymak, Dimitris Papailiopoulos

TL;DR: 该论文研究了Transformer语言模型在长度泛化方面的能力，通过任务关联（task association）揭示了模型可以从相关任务中继承泛化能力，从而实现从较短输入到较长输入的推断。

Details

Motivation: 尽管Transformer在自然语言领域表现出强大的泛化能力，但其如何实现长度泛化（从短输入推断长输入）的机制尚不清晰。论文希望通过任务关联的角度来探索这一问题。

Result: 实验表明，模型可以从相关任务中继承长度泛化能力，且预训练模型中也存在类似的传递效应。注意力头的重用与泛化能力显著相关。

Insight: Transformer的泛化能力可以通过任务间的关联实现，暗示了模型具有可复用的计算结构，并能跨任务组合利用这些结构。这为理解模型的外推能力提供了新视角。

Abstract: Transformer language models have demonstrated impressive generalization capabilities in natural language domains, yet we lack a fine-grained understanding of how such generalization arises. In this paper, we investigate length generalization–the ability to extrapolate from shorter to longer inputs–through the lens of \textit{task association}. We find that length generalization can be \textit{transferred} across related tasks. That is, training a model with a longer and related auxiliary task can lead it to generalize to unseen and longer inputs from some other target task. We demonstrate this length generalization transfer across diverse algorithmic tasks, including arithmetic operations, string transformations, and maze navigation. Our results show that transformer models can inherit generalization capabilities from similar tasks when trained jointly. Moreover, we observe similar transfer effects in pretrained language models, suggesting that pretraining equips models with reusable computational scaffolding that facilitates extrapolation in downstream settings. Finally, we provide initial mechanistic evidence that length generalization transfer correlates with the re-use of the same attention heads between the tasks. Together, our findings deepen our understanding of how transformers generalize to out-of-distribution inputs and highlight the compositional reuse of inductive structure across tasks.

[84] Self-Anchored Attention Model for Sample-Efficient Classification of Prosocial Text Chat cs.CL | cs.AI | cs.CY | I.2.7; K.4PDF

Zhuofang Li, Rafal Kocielnik, Fereshteh Soltani, Penphob, Boonyarungsrit

TL;DR: 本文提出了一种新颖的自我锚定注意力模型（SAAM），用于在低资源环境下高效分类游戏聊天中的亲社会文本，相比现有技术提升了7.9%。

Details

Motivation: 尽管在线游戏中存在大量亲社会聊天内容，但现有研究主要集中在小规模的毒性检测上。识别和推广亲社会行为对促进积极互动至关重要，但缺乏相关数据集和模型。

Result: SAAM模型在亲社会行为分类任务中比现有最佳技术提升了7.9%，成功应用于《使命召唤：现代战争II》。

Insight: 本研究为从单纯惩罚毒性转向鼓励积极互动提供了新思路，展示了NLP在低资源环境下应用的可能性。

Abstract: Millions of players engage daily in competitive online games, communicating through in-game chat. Prior research has focused on detecting relatively small volumes of toxic content using various Natural Language Processing (NLP) techniques for the purpose of moderation. However, recent studies emphasize the importance of detecting prosocial communication, which can be as crucial as identifying toxic interactions. Recognizing prosocial behavior allows for its analysis, rewarding, and promotion. Unlike toxicity, there are limited datasets, models, and resources for identifying prosocial behaviors in game-chat text. In this work, we employed unsupervised discovery combined with game domain expert collaboration to identify and categorize prosocial player behaviors from game chat. We further propose a novel Self-Anchored Attention Model (SAAM) which gives 7.9% improvement compared to the best existing technique. The approach utilizes the entire training set as “anchors” to help improve model performance under the scarcity of training data. This approach led to the development of the first automated system for classifying prosocial behaviors in in-game chats, particularly given the low-resource settings where large-scale labeled data is not available. Our methodology was applied to one of the most popular online gaming titles - Call of Duty(R): Modern Warfare(R)II, showcasing its effectiveness. This research is novel in applying NLP techniques to discover and classify prosocial behaviors in player in-game chat communication. It can help shift the focus of moderation from solely penalizing toxicity to actively encouraging positive interactions on online platforms.

[85] Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models cs.CLPDF

Milan Bhan, Jean-Noel Vittaut, Nicolas Chesneau, Sarath Chandar, Marie-Jeanne Lesot

TL;DR: 这篇论文提出了一种新颖的框架，通过比较大语言模型生成的自我解释（self-NLE）与其内部隐藏状态的解释，定量评估自我解释的忠实性，揭示了自我解释与模型实际推理过程之间的联系。

Details

Motivation: 现有方法主要通过行为测试或计算块识别来评估自我解释的忠实性，但这些方法忽略了模型的神经活动。本文旨在通过直接分析模型的隐藏状态，填补这一空白。

Result: 该框架揭示了自我解释与模型实际推理之间的不一致性，为理解自我解释的忠实性提供了新视角。

Insight: 自我解释可能表面上逻辑合理，但未必反映模型的真实推理过程；直接分析神经活动有助于揭示这种不忠实性。

Abstract: Large Language Models (LLM) have demonstrated the capability of generating free text self Natural Language Explanation (self-NLE) to justify their answers. Despite their logical appearance, self-NLE do not necessarily reflect the LLM actual decision-making process, making such explanations unfaithful. While existing methods for measuring self-NLE faithfulness mostly rely on behavioral tests or computational block identification, none of them examines the neural activity underlying the model’s reasoning. This work introduces a novel flexible framework for quantitatively measuring the faithfulness of LLM-generated self-NLE by directly comparing the latter with interpretations of the model’s internal hidden states. The proposed framework is versatile and provides deep insights into self-NLE faithfulness by establishing a direct connection between self-NLE and model reasoning. This approach advances the understanding of self-NLE faithfulness and provides building blocks for generating more faithful self-NLE.

[86] $(RSA)^2$: A Rhetorical-Strategy-Aware Rational Speech Act Framework for Figurative Language Understanding cs.CL | cs.AIPDF

Cesare Spinoso-Di Piano, David Austin, Pablo Piantanida, Jackie Chi Kit Cheung

TL;DR: 论文提出了一个名为$(RSA)^2$的新的RSA框架，通过考虑说话者的修辞策略来理解比喻语言（如反讽、夸张），无需建模说话者的非字面表达动机，并在新数据集PragMega+上取得了最先进的性能。

Details

Motivation: 比喻语言（如反讽、夸张）在人类交流中非常普遍，但其字面意义与真实意图不一致，现有RSA框架无法处理此类现象或需要特定场景下建模说话者的动机。因此，需要一种更通用的方法来理解比喻语言。

Result: 结合大语言模型（LLMs），$(RSA)^2$在PragMega+数据集的反讽识别任务上达到了最先进的性能。

Insight: 通过建模修辞策略而非说话者的具体动机，可以更通用且高效地理解比喻语言，这对于自然语言处理任务中复杂意图的捕捉具有重要意义。

Abstract: Figurative language (e.g., irony, hyperbole, understatement) is ubiquitous in human communication, resulting in utterances where the literal and the intended meanings do not match. The Rational Speech Act (RSA) framework, which explicitly models speaker intentions, is the most widespread theory of probabilistic pragmatics, but existing implementations are either unable to account for figurative expressions or require modeling the implicit motivations for using figurative language (e.g., to express joy or annoyance) in a setting-specific way. In this paper, we introduce the Rhetorical-Strategy-Aware RSA $(RSA)^2$ framework which models figurative language use by considering a speaker’s employed rhetorical strategy. We show that $(RSA)^2$ enables human-compatible interpretations of non-literal utterances without modeling a speaker’s motivations for being non-literal. Combined with LLMs, it achieves state-of-the-art performance on the ironic split of PragMega+, a new irony interpretation dataset introduced in this study.

[87] Towards Efficient and Effective Alignment of Large Language Models cs.CLPDF

Yuxin Jiang

TL;DR: 论文提出了一套高效、有效对齐大语言模型的方法，包括数据收集、训练和评估的创新技术。

Details

Motivation: 大语言模型（LLMs）在多任务中表现出色，但与人类期望的精确对齐仍然是一个关键挑战。现有的方法在数据收集、训练和评估方面存在局限性，难以满足对齐的需求。

Result: 实验表明，新方法在零样本推理、数据多样性和对齐任务中表现优越。FollowBench揭示了当前模型在约束遵循方面的弱点。

Insight: 对齐大语言模型需要多管齐下，从数据生成到训练优化再到评估，每一步都需创新。约束遵循能力的评估是未来改进的重要方向。

Abstract: Large language models (LLMs) exhibit remarkable capabilities across diverse tasks, yet aligning them efficiently and effectively with human expectations remains a critical challenge. This thesis advances LLM alignment by introducing novel methodologies in data collection, training, and evaluation. We first address alignment data collection. Existing approaches rely heavily on manually curated datasets or proprietary models. To overcome these limitations, we propose Lion, an adversarial distillation framework that iteratively refines training data by identifying and generating challenging instructions, enabling state-of-the-art zero-shot reasoning. Additionally, we introduce Web Reconstruction (WebR), a fully automated framework that synthesizes instruction-tuning data directly from raw web documents, significantly improving data diversity and scalability over existing synthetic data methods. Next, we enhance alignment training through novel optimization techniques. We develop Learning to Edit (LTE), a framework that enables LLMs to efficiently integrate new knowledge while preserving existing information. LTE leverages meta-learning to improve both real-time and batch knowledge updates. Furthermore, we introduce Bridging and Modeling Correlations (BMC), a refinement of Direct Preference Optimization (DPO) that explicitly captures token-level correlations in preference data, leading to superior alignment across QA and mathematical reasoning tasks. Finally, we tackle the challenge of evaluating alignment. Existing benchmarks emphasize response quality but overlook adherence to specific constraints. To bridge this gap, we introduce FollowBench, a multi-level, fine-grained benchmark assessing LLMs’ ability to follow complex constraints across diverse instruction types. Our results expose key weaknesses in current models’ constraint adherence, offering insights for future improvements.

[88] Multi-Agent Language Models: Advancing Cooperation, Coordination, and Adaptation cs.CL | cs.AI | cs.MAPDF

Arjun Vaithilingam Sudhakar

TL;DR: 该论文探讨了大型语言模型（LLMs）是否具备“心智理论”（theory of mind）能力，即能否推理他人意图，并将其应用于多智能体协作学习（MARL）中，以提升AI与人类或AI之间的协作能力。

Details

Motivation: 现代LLMs在零样本和少样本任务中表现出强大的泛化能力，但能否理解和推理他人意图（即心智理论）尚不明确。这一能力对多智能体协作至关重要，尤其是在人机协作场景中。

Result: 研究发现，LLMs在多智能体协作任务中展现出一定的心智理论能力，能够通过自然语言交互实现协作目标，为构建更强大的人机协作系统提供了基础。

Insight: LLMs不仅可用于单机任务，还能在复杂的社会交互中发挥作用，未来可能推动人机协作技术的进一步发展。

Abstract: Modern Large Language Models (LLMs) exhibit impressive zero-shot and few-shot generalization capabilities across complex natural language tasks, enabling their widespread use as virtual assistants for diverse applications such as translation and summarization. Despite being trained solely on large corpora of text without explicit supervision on author intent, LLMs appear to infer the underlying meaning of textual interactions. This raises a fundamental question: can LLMs model and reason about the intentions of others, i.e., do they possess a form of theory of mind? Understanding other’s intentions is crucial for effective collaboration, which underpins human societal success and is essential for cooperative interactions among multiple agents, including humans and autonomous systems. In this work, we investigate the theory of mind in LLMs through the lens of cooperative multi-agent reinforcement learning (MARL), where agents learn to collaborate via repeated interactions, mirroring human social reasoning. Our approach aims to enhance artificial agent’s ability to adapt and cooperate with both artificial and human partners. By leveraging LLM-based agents capable of natural language interaction, we move towards creating hybrid human-AI systems that can foster seamless collaboration, with broad implications for the future of human-artificial interaction.

[89] RePO: Replay-Enhanced Policy Optimization cs.CL | cs.AI | cs.LGPDF

Siheng Li, Zhanhui Zhou, Wai Lam, Chao Yang, Chaochao Lu

TL;DR: RePO通过多样化的回放策略从回放缓冲区中检索离策略样本，优化了大语言模型的策略学习，显著提升了计算效率和性能。

Details

Motivation: 当前GRPO方法因使用多个同策略样本而计算成本高、数据效率低，需要一种更高效的方法。

Result: 在多个数学推理基准测试中，RePO显著优于GRPO，性能提升显著（如Qwen2.5-Math-1.5B提升18.4分）。

Insight: RePO通过结合同策略和离策略样本，在增加计算成本的同时显著提升了优化步数和性能。

Abstract: Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low data efficiency. To address this, we introduce Replay-Enhanced Policy Optimization (RePO), which leverages diverse replay strategies to retrieve off-policy samples from a replay buffer, allowing policy optimization based on a broader and more diverse set of samples for each prompt. Experiments on five LLMs across seven mathematical reasoning benchmarks demonstrate that RePO achieves absolute average performance gains of $18.4$ and $4.1$ points for Qwen2.5-Math-1.5B and Qwen3-1.7B, respectively, compared to GRPO. Further analysis indicates that RePO increases computational cost by $15%$ while raising the number of effective optimization steps by $48%$ for Qwen3-1.7B, with both on-policy and off-policy sample numbers set to $8$. The repository can be accessed at https://github.com/SihengLi99/RePO.

[90] Token Constraint Decoding Improves Robustness on Question Answering for Large Language Models cs.CL | cs.AIPDF

Jui-Ming Yao, Hao-Yuan Chen, Zi-Xian Tang, Bing-Jia Tan, Sheng-Wei Peng

TL;DR: 该论文提出了一种名为Token Constraint Decoding (TCD)的推理时算法，用于提升大型语言模型在噪声环境下的稳健性，尤其在多项选择题回答任务中表现显著，最高实现了39%的绝对性能提升。

Details

Motivation: 大型语言模型在多选题回答任务中表现出色，但对输入扰动非常敏感，导致性能下降。论文旨在解决这一问题，提高模型在噪声环境下的稳健性。

Result: 在CommonsenseQA、MMLU和MMLU-Pro等数据集上的实验表明，TCD显著提升了模型的稳健性，尤其在较弱的模型（如Gemma3 1B）上实现了高达39%的性能提升。

Insight: TCD通过隐式正则化过自信的输出，提高了模型在噪声环境下的稳定性。不同模型需要不同的惩罚调度以最大化稳健性，这为未来的研究和实际部署提供了参考。

Abstract: Large Language Models (LLMs) have demonstrated impressive performance on multiple-choice question answering (MCQA) benchmarks, yet they remain highly vulnerable to minor input perturbations. In this paper, we introduce and evaluate Token Constraint Decoding (TCD). This simple yet effective inference-time algorithm enforces alignment between token-level predictions to enhance robustness in noisy settings. Through extensive experiments on CommonsenseQA, MMLU, and MMLU-Pro, we show that TCD, especially when paired with prompt engineering (PE) fixes, significantly restores performance degraded by input noise, yielding up to +39% absolute gains for weaker models like Gemma3 1B. Penalty sweep analyses further reveal that TCD implicitly regularizes overconfident outputs, with different models requiring distinct penalty schedules to maximize resilience. Our findings establish TCD as a practical, model-agnostic approach for improving reasoning stability under real-world imperfections and pave the way for more reliable deployment of LLMs in safety-critical or user-facing applications.

[91] PGDA-KGQA: A Prompt-Guided Generative Framework with Multiple Data Augmentation Strategies for Knowledge Graph Question Answering cs.CL | cs.IRPDF

Xiujun Zhou, Pingjian Zhang, Deyou Tang

TL;DR: 该论文提出了PGDA-KGQA框架，通过多策略数据增强提升知识图谱问答（KGQA）任务的性能，有效解决了数据稀缺和多跳推理问题，显著优于现有方法。

Details

Motivation: KGQA任务中，现有方法受限于标注数据的稀缺和多跳推理样本的不足，传统数据增强方法容易导致语义失真，而基于LLM的方法忽视了多跳推理。因此，亟需一种能兼顾语义完整性和多样性的增强方法。

Result: 在标准数据集WebQSP和ComplexWebQuestions上，PGDA-KGQA在F1、Hits@1和准确率指标上分别提升2.8%、1.2%、3.1%和1.8%、1.1%、2.4%，显著优于现有方法。

Insight: 提示设计是高效利用LLM的关键，多策略数据增强能兼顾语义对齐和多样性，反向路径探索为多跳推理提供了更真实的训练样本。

Abstract: Knowledge Graph Question Answering (KGQA) is a crucial task in natural language processing that requires reasoning over knowledge graphs (KGs) to answer natural language questions. Recent methods utilizing large language models (LLMs) have shown remarkable semantic parsing capabilities but are limited by the scarcity of diverse annotated data and multi-hop reasoning samples. Traditional data augmentation approaches are focus mainly on single-hop questions and prone to semantic distortion, while LLM-based methods primarily address semantic distortion but usually neglect multi-hop reasoning, thus limiting data diversity. The scarcity of multi-hop samples further weakens models’ generalization. To address these issues, we propose PGDA-KGQA, a prompt-guided generative framework with multiple data augmentation strategies for KGQA. At its core, PGDA-KGQA employs a unified prompt-design paradigm: by crafting meticulously engineered prompts that integrate the provided textual content, it leverages LLMs to generate large-scale (question, logical form) pairs for model training. Specifically, PGDA-KGQA enriches its training set by: (1) generating single-hop pseudo questions to improve the alignment of question semantics with KG relations; (2) applying semantic-preserving question rewriting to improve robustness against linguistic variations; (3) employing answer-guided reverse path exploration to create realistic multi-hop questions. By adopting an augment-generate-retrieve semantic parsing pipeline, PGDA-KGQA utilizes the augmented data to enhance the accuracy of logical form generation and thus improve answer retrieval performance. Experiments demonstrate that outperforms state-of-the-art methods on standard KGQA datasets, achieving improvements on WebQSP by 2.8%, 1.2%, and 3.1% and on ComplexWebQuestions by 1.8%, 1.1%, and 2.4% in F1, Hits@1, and Accuracy, respectively.

[92] Hidden in Plain Sight: Evaluation of the Deception Detection Capabilities of LLMs in Multimodal Settings cs.CLPDF

Md Messal Monem Miah, Adrita Anika, Xi Shi, Ruihong Huang

TL;DR: 论文评估了大型语言模型（LLMs）和大型多模态模型（LMMs）在欺骗检测任务中的表现，发现微调后的LLMs在文本欺骗检测中表现最佳，而LMMs在多模态线索利用上存在局限。

Details

Motivation: 数字世界中欺骗检测的挑战日益突出，需要评估现有模型在多模态场景下的表现及其潜力。

Result: 微调LLMs在文本欺骗检测任务中达到最优性能，LMMs未能充分利用跨模态线索。辅助特征（如非语言手势）对性能影响有限。

Insight: LLMs在多模态欺骗检测中潜力显著，但需进一步改进跨模态信息融合能力；提示策略对模型性能有显著影响。

Abstract: Detecting deception in an increasingly digital world is both a critical and challenging task. In this study, we present a comprehensive evaluation of the automated deception detection capabilities of Large Language Models (LLMs) and Large Multimodal Models (LMMs) across diverse domains. We assess the performance of both open-source and commercial LLMs on three distinct datasets: real life trial interviews (RLTD), instructed deception in interpersonal scenarios (MU3D), and deceptive reviews (OpSpam). We systematically analyze the effectiveness of different experimental setups for deception detection, including zero-shot and few-shot approaches with random or similarity-based in-context example selection. Our results show that fine-tuned LLMs achieve state-of-the-art performance on textual deception detection tasks, while LMMs struggle to fully leverage cross-modal cues. Additionally, we analyze the impact of auxiliary features, such as non-verbal gestures and video summaries, and examine the effectiveness of different prompting strategies, including direct label generation and chain-of-thought reasoning. Our findings provide key insights into how LLMs process and interpret deceptive cues across modalities, highlighting their potential and limitations in real-world deception detection applications.

[93] Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms cs.CL | cs.LGPDF

Zeguan Xiao, Yun Chen, Guanhua Chen

TL;DR: 该论文提出了一种名为POET的新方法，旨在解决直接对齐算法（DAAs）中的“奖励-生成差距”问题，通过截断偏好和非偏好响应至相同长度，改进了现有方法的性能。

Details

Motivation: 研究者在直接对齐算法（DAAs）中发现了一个关键问题，即训练时的优化目标与推理时的生成性能之间存在的“奖励-生成差距”。这一问题源于模型对前缀标记的重要性与奖励函数对其反映的权重不匹配。

Result: 实验结果表明，POET在DPO和SimPO等DAAs中表现优异，在AlpacaEval 2上提升了15.6分，并在下游任务中展现出总体改进。

Insight: 论文揭示了奖励优化与生成性能之间的不匹配问题，并通过简单但高效的方法解决了这一问题，提供了对DAAs进一步优化的新思路。

Abstract: Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning large language models (LLMs) with human preferences. However, DAAs suffer from a fundamental limitation we identify as the “reward-generation gap” – a misalignment between optimization objectives during training and actual generation performance during inference. In this paper, we find a contributor to the reward-generation gap is the mismatch between the inherent importance of prefix tokens during the LLM generation process and how this importance is reflected in the implicit reward functions of DAAs. To bridge the gap, we introduce a simple yet effective approach called Prefix-Oriented Equal-length Training (POET), which truncates both preferred and dispreferred responses to match the shorter one’s length. Training with POET, where both responses in each sample are truncated to equal length, resulting in diverse truncated lengths across samples, the optimization of DAAs objective is implicitly constrained to converge across all positions, thus paying more attention to prefix tokens than the standard DAAs. We conduct experiments with DPO and SimPO, two representative DAAs, demonstrating that POET improves over their standard implementations, achieving up to 15.6 points in AlpacaEval 2 and overall improvements across downstream tasks. Our results highlight the importance of addressing the misalignment between reward optimization and generation performance in DAAs.

[94] Bridging Online Behavior and Clinical Insight: A Longitudinal LLM-based Study of Suicidality on YouTube Reveals Novel Digital Markers cs.CL | cs.LGPDF

Ilanit Sobol, Shir Lissak, Refael Tikochinski, Tal Nakash, Anat Brunstein Klomek

TL;DR: 这篇论文通过结合计算方法和专家知识，研究了YouTube上自杀行为的数字标记，揭示了新的行为模式和临床见解。

Details

Motivation: 由于自杀是西方国家的主要死因之一，而社交媒体为研究自杀行为提供了新的数据来源，论文旨在探索如何在YouTube上通过数字足迹识别自杀行为的迹象。

Result: 研究发现，与自杀行为相关的主题中，Mental Health Struggles和YouTube Engagement在时间上有显著变化。此外，专家未识别的YouTube Engagement显示了自下而上方法的独特价值。自杀者的动机差异也被揭示：一些人旨在帮助他人，另一些人则将其视为个人康复的一部分。

Insight: 论文强调了结合计算方法和专家知识的重要性，同时揭示了平台特有的行为模式可能成为自杀风险的新标记。这种方法为自杀预防提供了新的研究视角。

Abstract: Suicide remains a leading cause of death in Western countries, underscoring the need for new research approaches. As social media becomes central to daily life, digital footprints offer valuable insight into suicidal behavior. Focusing on individuals who attempted suicide while uploading videos to their channels, we investigate: How do suicidal behaviors manifest on YouTube, and how do they differ from expert knowledge? We applied complementary approaches: computational bottom-up, hybrid, and expert-driven top-down, on a novel longitudinal dataset of 181 YouTube channels from individuals with life-threatening attempts, alongside 134 control channels. In the bottom-up approach, we applied LLM-based topic modeling to identify behavioral indicators. Of 166 topics, five were associated with suicide-attempt, with two also showing temporal attempt-related changes ($p<.01$) - Mental Health Struggles ($+0.08$)* and YouTube Engagement ($+0.1$)*. In the hybrid approach, a clinical expert reviewed LLM-derived topics and flagged 19 as suicide-related. However, none showed significant attempt-related temporal effects beyond those identified bottom-up. Notably, YouTube Engagement, a platform-specific indicator, was not flagged by the expert, underscoring the value of bottom-up discovery. In the top-down approach, psychological assessment of suicide attempt narratives revealed that the only significant difference between individuals who attempted before and those attempted during their upload period was the motivation to share this experience: the former aimed to Help Others ($\beta=-1.69$, $p<.01$), while the latter framed it as part of their Personal Recovery ($\beta=1.08$, $p<.01$). By integrating these approaches, we offer a nuanced understanding of suicidality, bridging digital behavior and clinical insights. * Within-group changes in relation to the suicide attempt.

[95] Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning cs.CLPDF

Jiayi Yuan, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li

TL;DR: 论文探讨了大型语言模型（LLMs）的推理结果因硬件配置和数值精度（如 GPU 类型、批次大小）而产生的不可复现性问题，并提出了一种轻量级推理框架 LayerCast，平衡内存效率和数值稳定性。

Details

Motivation: 研究发现 LLMs 的推理结果在不同硬件配置和数值精度下存在显著差异，尤其是推理任务中微小的浮点运算误差会导致输出结果的分歧，这挑战了现有评测的可信度。

Result: 实验表明，推理任务中 GPU 数量、类型和批次大小的差异可导致准确性变化高达 9%，响应长度差异达 9000 个 token。LayerCast 显著提高了结果的稳定性。

Insight: 浮点运算的非结合性在有限精度下会放大误差，影响 LLMs 的推理结果；评测实践中忽视数值精度可能导致误导性结论。

Abstract: Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration such as evaluation batch size, GPU count, and GPU version can introduce significant difference in the generated responses. This issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size. We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. This work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge. Our analysis reveals that floating-point precision – while critical for reproducibility – is often neglected in evaluation practices. Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32, balancing memory efficiency with numerical stability. Code is available at https://github.com/nanomaoli/llm_reproducibility.

[96] TransXSSM: A Hybrid Transformer State Space Model with Unified Rotary Position Embedding cs.CL | cs.AIPDF

Bingheng Wu, Jingze Shi, Yifan Wu, Nan Tang, Yuyu Luo

TL;DR: 论文提出了一种统一的旋转位置嵌入方法（RoPE），将Transformer和状态空间模型（SSM）结合为混合架构TransXSSM，显著提升了长序列建模的效率和性能。

Details

Motivation: Transformer擅长捕捉长程依赖，而SSM支持线性时间序列建模，但两者的位置编码机制不一致（RoPE与卷积隐式表示）导致性能下降。亟需统一的解决方案。

Result: 在4K序列长度下，TransXSSM训练和推理速度分别提升42.3%和29.5%，语言建模任务精度提升4%，1.3B版本比320M版本平均精度提升7.22%。

Insight: 统一位置编码是混合模型高效长上下文建模的关键，避免了位置机制不一致带来的性能损失。

Abstract: Transformers exhibit proficiency in capturing long-range dependencies, whereas State Space Models (SSMs) facilitate linear-time sequence modeling. Notwithstanding their synergistic potential, the integration of these architectures presents a significant challenge, primarily attributable to a fundamental incongruity in their respective positional encoding mechanisms: Transformers rely on explicit Rotary Position Embeddings (RoPE), while SSMs leverage implicit positional representations via convolutions. This divergence often precipitates discontinuities and suboptimal performance. To address this impediment, we propose a unified rotary position embedding (\textbf{\ourRoPE}) methodology, thereby establishing a consistent positional encoding framework for both self-attention and state-space components. Using this \ourRoPE, we introduce \textbf{\model}, a hybrid architecture that coherently integrates the Transformer and SSM layers under this unified positional encoding scheme. At a 4K sequence length, \model exhibits training and inference speeds that are \textbf{42.3% and 29.5% faster}, respectively, relative to standard Transformer models. It also delivers higher accuracy: under comparable settings, it surpasses a Transformer baseline by over 4% on language modeling benchmarks. \model furthermore scales more effectively: \model-1.3B gains \textbf{7.22%} in average accuracy over its 320M version (versus about 6% gains for equivalent Transformers or SSMs). Our results show that unified positional encoding resolves positional incompatibility in hybrid models, enabling efficient, high-performance long-context modeling.

[97] ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning cs.CL | cs.AI | cs.MAPDF

Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao

TL;DR: 论文ReasonMed提出了一个包含370K高质量医疗推理样本的数据集，通过多智能体验证和精修过程生成，并探索了最优的医疗推理模型微调策略。

Details

Motivation: 尽管推理型大语言模型在数学和编程领域表现出色，但其在知识密集型的医疗问答中的能力尚未充分探索。论文旨在填补这一空白。

Result: 训练的ReasonMed-7B模型在sub-10B模型中表现最佳，超越先前最佳模型4.17%，甚至超过LLaMA3.1-70B在PubMedQA上的表现4.60%。

Insight: 多智能体验证和精修过程能显著提升数据集质量；结合详细CoT和简洁答案摘要是医疗问答任务的理想微调策略。

Abstract: Though reasoning-based large language models (LLMs) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs. ReasonMed is constructed through a \textit{multi-agent verification and refinement process}, where we design an \textit{Error Refiner} to enhance the reasoning paths by identifying and correcting error-prone steps flagged by a verifier. Leveraging ReasonMed, we systematically investigate best practices for training medical reasoning models and find that combining detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields the most effective fine-tuning strategy. Based on this strategy, we train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60%.

[98] MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions cs.CLPDF

Georgios Chatzichristodoulou, Despoina Kosmopoulou, Antonios Kritikos, Anastasia Poulopoulou, Efthymios Georgiou

TL;DR: 论文提出了MEDUSA，一种多模态深度融合的多阶段训练框架，用于自然条件下的语音情感识别（SER）。该框架通过四阶段训练流程解决类别不平衡和情感模糊问题，最终在Interspeech 2025挑战赛中排名第一。

Details

Motivation: 自然条件下的语音情感识别（SER）存在情感主观性和数据不平衡的挑战，传统方法难以有效应对。

Result: MEDUSA在Interspeech 2025挑战赛的任务1中排名第一。

Insight: 多模态融合和多阶段训练能显著提升SER性能，尤其在面对自然条件下的数据不平衡和模糊性时。

Abstract: SER is a challenging task due to the subjective nature of human emotions and their uneven representation under naturalistic conditions. We propose MEDUSA, a multimodal framework with a four-stage training pipeline, which effectively handles class imbalance and emotion ambiguity. The first two stages train an ensemble of classifiers that utilize DeepSER, a novel extension of a deep cross-modal transformer fusion mechanism from pretrained self-supervised acoustic and linguistic representations. Manifold MixUp is employed for further regularization. The last two stages optimize a trainable meta-classifier that combines the ensemble predictions. Our training approach incorporates human annotation scores as soft targets, coupled with balanced data sampling and multitask learning. MEDUSA ranked 1st in Task 1: Categorical Emotion Recognition in the Interspeech 2025: Speech Emotion Recognition in Naturalistic Conditions Challenge.

[99] Gender Bias in English-to-Greek Machine Translation cs.CLPDF

Eleni Gkovedarou, Joke Daems, Luna De Bruyne

TL;DR: 该研究调查了商业机器翻译系统（Google Translate和DeepL）在英语到希腊语翻译中的性别偏见问题，揭示了男性偏见、职业刻板印象和反刻板翻译中的错误，并探索了GPT-4o作为缓解偏见工具的潜力。

Details

Motivation: 随着对包容性语言需求的增加，机器翻译系统可能强化性别刻板印象的问题引起了关注。本研究聚焦于较少研究的英语到希腊语翻译，评估商业系统的性别偏见。

Result: 研究发现MT系统在性别明确时表现较好（DeepL优于Google Translate和GPT-4o），但在性别未指定时无法生成包容性或中性翻译；GPT-4o虽能生成替代方案，但仍存在偏见。

Insight: 商业MT系统在性别偏见问题上仍有不足，GPT-4o作为生成工具展示了潜力，但需进一步优化以实现真正的性别包容性。

Abstract: As the demand for inclusive language increases, concern has grown over the susceptibility of machine translation (MT) systems to reinforce gender stereotypes. This study investigates gender bias in two commercial MT systems, Google Translate and DeepL, focusing on the understudied English-to-Greek language pair. We address three aspects of gender bias: i) male bias, ii) occupational stereotyping, and iii) errors in anti-stereotypical translations. Additionally, we explore the potential of prompted GPT-4o as a bias mitigation tool that provides both gender-explicit and gender-neutral alternatives when necessary. To achieve this, we introduce GendEL, a manually crafted bilingual dataset of 240 gender-ambiguous and unambiguous sentences that feature stereotypical occupational nouns and adjectives. We find persistent gender bias in translations by both MT systems; while they perform well in cases where gender is explicitly defined, with DeepL outperforming both Google Translate and GPT-4o in feminine gender-unambiguous sentences, they are far from producing gender-inclusive or neutral translations when the gender is unspecified. GPT-4o shows promise, generating appropriate gendered and neutral alternatives for most ambiguous cases, though residual biases remain evident.

[100] From Symbolic to Neural and Back: Exploring Knowledge Graph-Large Language Model Synergies cs.CL | cs.AI | cs.LGPDF

Blaž Škrlj, Boshko Koloski, Senja Pollak, Nada Lavrač

TL;DR: 这篇系统综述探讨了知识图谱（KGs）与大语言模型（LLMs）的协同作用，分为KG增强LLMs和LLM增强KGs两类，强调其双向优势与未来研究方向。

Details

Motivation: 研究动机在于整合KGs的结构化知识与LLMs的语言能力，以增强LLMs的推理能力、减少幻觉，并提升KGs的构建与查询效率。

Result: 结果展示了KGs与LLMs整合的潜力，如改进推理、减少幻觉，并推动智能系统在复杂任务中的应用。

Insight: 重要见解包括双向协同的优势、计算效率与数据质量的关键作用，以及未来在神经符号集成和伦理问题上的挑战。

Abstract: Integrating structured knowledge from Knowledge Graphs (KGs) into Large Language Models (LLMs) enhances factual grounding and reasoning capabilities. This survey paper systematically examines the synergy between KGs and LLMs, categorizing existing approaches into two main groups: KG-enhanced LLMs, which improve reasoning, reduce hallucinations, and enable complex question answering; and LLM-augmented KGs, which facilitate KG construction, completion, and querying. Through comprehensive analysis, we identify critical gaps and highlight the mutual benefits of structured knowledge integration. Compared to existing surveys, our study uniquely emphasizes scalability, computational efficiency, and data quality. Finally, we propose future research directions, including neuro-symbolic integration, dynamic KG updating, data reliability, and ethical considerations, paving the way for intelligent systems capable of managing more complex real-world knowledge tasks.

[101] Using Sign Language Production as Data Augmentation to enhance Sign Language Translation cs.CL | cs.CVPDF

Harry Walsh, Maksym Ivashechkin, Richard Bowden

TL;DR: 论文提出利用手语生成技术（Sign Language Production）增强手语翻译模型的性能，通过数据增广方法提升低资源手语数据集的效果。

Details

Motivation: 手语数据稀缺且收集成本高，限制了手语翻译模型的性能。手语生成技术为数据增广提供了新思路。

Result: 数据增广显著提升翻译性能，最高达19%，为低资源环境下的手语翻译提供了可行方案。

Insight: 手语生成技术的进步可以缓解数据稀缺问题，为其他低资源语言任务提供了借鉴。

Abstract: Machine learning models fundamentally rely on large quantities of high-quality data. Collecting the necessary data for these models can be challenging due to cost, scarcity, and privacy restrictions. Signed languages are visual languages used by the deaf community and are considered low-resource languages. Sign language datasets are often orders of magnitude smaller than their spoken language counterparts. Sign Language Production is the task of generating sign language videos from spoken language sentences, while Sign Language Translation is the reverse translation task. Here, we propose leveraging recent advancements in Sign Language Production to augment existing sign language datasets and enhance the performance of Sign Language Translation models. For this, we utilize three techniques: a skeleton-based approach to production, sign stitching, and two photo-realistic generative models, SignGAN and SignSplat. We evaluate the effectiveness of these techniques in enhancing the performance of Sign Language Translation models by generating variation in the signer’s appearance and the motion of the skeletal data. Our results demonstrate that the proposed methods can effectively augment existing datasets and enhance the performance of Sign Language Translation models by up to 19%, paving the way for more robust and accurate Sign Language Translation systems, even in resource-constrained environments.

[102] Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering cs.CL | cs.IR | cs.LG | I.2.6PDF

Tianjun Yao, Haoxuan Li, Zhiqiang Shen, Pan Li, Tongliang Liu

TL;DR: 本文提出了一种名为RAPL的新型框架，旨在提高知识图谱问答（KGQA）中图检索的效率与泛化能力。RAPL通过两阶段标注策略、模型无关的图变换方法和路径推理策略，显著提升了检索性能和泛化能力。

Details

Motivation: 传统检索增强生成（RAG）方法依赖非结构化文本，限制了可解释性和结构化推理能力。知识图谱因其结构化特性成为更优选择，但现有图检索方法泛化能力不足。本文旨在解决这一问题。

Result: 实验表明，RAPL在性能上超越现有方法2.66%-20.34%，显著缩小小模型与大模型之间的性能差距，并在跨数据集场景中表现优异。

Insight: 通过结构化监督和路径推理，RAPL展示了如何增强图检索的泛化能力，为知识图谱与LLMs的结合提供了新思路。

Abstract: Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains, but their reliability is hindered by the outdated knowledge and hallucinations. Retrieval-Augmented Generation mitigates these issues by grounding LLMs with external knowledge; however, most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning. Knowledge graphs, which represent facts as relational triples, offer a more structured and compact alternative. Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering (KGQA), with a significant proportion adopting the retrieve-then-reasoning paradigm. In this framework, graph-based retrievers have demonstrated strong empirical performance, yet they still face challenges in generalization ability. In this work, we propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA. RAPL addresses these limitations through three aspects: (1) a two-stage labeling strategy that combines heuristic signals with parametric models to provide causally grounded supervision; (2) a model-agnostic graph transformation approach to capture both intra- and inter-triple interactions, thereby enhancing representational capacity; and (3) a path-based reasoning strategy that facilitates learning from the injected rational knowledge, and supports downstream reasoner through structured inputs. Empirically, RAPL outperforms state-of-the-art methods by $2.66%-20.34%$, and significantly reduces the performance gap between smaller and more powerful LLM-based reasoners, as well as the gap under cross-dataset settings, highlighting its superior retrieval capability and generalizability. Codes are available at: https://github.com/tianyao-aka/RAPL.

[103] Query-Level Uncertainty in Large Language Models cs.CLPDF

Lihu Chen, Gaël Varoquaux

TL;DR: 这是一篇关于大语言模型的查询级别不确定性检测的论文，提出了一种无训练的方法Internal Confidence，通过自评估来判断模型是否能回答查询，实验证明其优于多个基线方法，并可用于高效的RAG和模型级联。

Details

Motivation: 为了提高大语言模型的效率和可信性，需要让模型能够识别自身知识的边界，从而支持自适应推理（如调用RAG或选择放弃回答）。

Result: 在实验部分，Internal Confidence在多个任务中优于基线方法，并能有效降低推理成本。

Insight: 模型的内部置信度可以作为判断知识边界的有效指标，且无训练方法在实际应用中更具灵活性。

Abstract: It is important for Large Language Models to be aware of the boundary of their knowledge, the mechanism of identifying known and unknown queries. This type of awareness can help models perform adaptive inference, such as invoking RAG, engaging in slow and deep thinking, or adopting the abstention mechanism, which is beneficial to the development of efficient and trustworthy AI. In this work, we propose a method to detect knowledge boundaries via Query-Level Uncertainty, which aims to determine if the model is able to address a given query without generating any tokens. To this end, we introduce a novel and training-free method called \emph{Internal Confidence}, which leverages self-evaluations across layers and tokens. Empirical results on both factual QA and mathematical reasoning tasks demonstrate that our internal confidence can outperform several baselines. Furthermore, we showcase that our proposed method can be used for efficient RAG and model cascading, which is able to reduce inference costs while maintaining performance.

[104] Is Fine-Tuning an Effective Solution? Reassessing Knowledge Editing for Unstructured Data cs.CL | cs.AIPDF

Hao Xiong, Chuanyuan Tan, Wenliang Chen

TL;DR: 本文针对非结构化知识编辑（UKE）的局部性评估不足和微调（FT）方法异常失效的问题，构建了两个扩展数据集UnKEBench-Loc和AKEW-Loc，并提出了一种优化的微调方法FT-UKE。实验表明，FT-UKE在性能和批量编辑场景中表现优异，优于现有SOTA方法。

Details

Motivation: 非结构化知识编辑（UKE）对大型语言模型（LLMs）的知识更新至关重要，但现有方法缺乏局部性评估，且微调方法表现异常。本文旨在解决这些问题并优化微调方法。

Result: FT-UKE在性能上显著优于现有SOTA方法，批量编辑中优势随批量增大而增加，平均指标领先从+6.78%提升至+10.80%。

Insight: 优化后的微调方法在非结构化知识编辑任务中表现优异，局部性评估是提升模型编辑能力的关键。批量编辑场景中，方法扩展性良好。

Abstract: Unstructured Knowledge Editing (UKE) is crucial for updating the relevant knowledge of large language models (LLMs). It focuses on unstructured inputs, such as long or free-form texts, which are common forms of real-world knowledge. Although previous studies have proposed effective methods and tested them, some issues exist: (1) Lack of Locality evaluation for UKE, and (2) Abnormal failure of fine-tuning (FT) based methods for UKE. To address these issues, we first construct two datasets, UnKEBench-Loc and AKEW-Loc (CF), by extending two existing UKE datasets with locality test data from the unstructured and structured views. This enables a systematic evaluation of the Locality of post-edited models. Furthermore, we identify four factors that may affect the performance of FT-based methods. Based on these factors, we conduct experiments to determine how the well-performing FT-based methods should be trained for the UKE task, providing a training recipe for future research. Our experimental results indicate that the FT-based method with the optimal setting (FT-UKE) is surprisingly strong, outperforming the existing state-of-the-art (SOTA). In batch editing scenarios, FT-UKE shows strong performance as well, with its advantage over SOTA methods increasing as the batch size grows, expanding the average metric lead from +6.78% to +10.80%

[105] ComfyUI-R1: Exploring Reasoning Models for Workflow Generation cs.CL | cs.CV | cs.SEPDF

Zhenran Xu, Yiyu Wang, Xue Yang, Longyue Wang, Weihua Luo

TL;DR: ComfyUI-R1是首个用于自动生成工作流的大型推理模型，通过两阶段训练框架（CoT微调和强化学习）实现高效工作流生成，显著优于现有方法。

Details

Motivation: AI生成内容从单一模型发展为模块化工作流（如ComfyUI平台），但工作流设计需要专业知识，用户学习曲线陡峭。因此，研究者提出了ComfyUI-R1，以简化这一过程。

Result: 7B参数模型实现了97%格式有效性，并在节点级、图级F1分数上显著优于GPT-4o和Claude等领先闭源模型。

Insight: 链式思维推理和将工作流转化为代码的方法是关键，尤其在复杂节点合成的艺术创作中表现出色。

Abstract: AI-generated content has evolved from monolithic models to modular workflows, particularly on platforms like ComfyUI, enabling customization in creative pipelines. However, crafting effective workflows requires great expertise to orchestrate numerous specialized components, presenting a steep learning curve for users. To address this challenge, we introduce ComfyUI-R1, the first large reasoning model for automated workflow generation. Starting with our curated dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning data, including node selection, workflow planning, and code-level workflow representation. ComfyUI-R1 is trained through a two-stage framework: (1) CoT fine-tuning for cold start, adapting models to the ComfyUI domain; (2) reinforcement learning for incentivizing reasoning capability, guided by a fine-grained rule-metric hybrid reward, ensuring format validity, structural integrity, and node-level fidelity. Experiments show that our 7B-parameter model achieves a 97% format validity rate, along with high pass rate, node-level and graph-level F1 scores, significantly surpassing prior state-of-the-art methods that employ leading closed-source models such as GPT-4o and Claude series. Further analysis highlights the critical role of the reasoning process and the advantage of transforming workflows into code. Qualitative comparison reveals our strength in synthesizing intricate workflows with diverse nodes, underscoring the potential of long CoT reasoning in AI art creation.

[106] CoRT: Code-integrated Reasoning within Thinking cs.CL | cs.AI | cs.LGPDF

Chengpeng Li, Zhengyang Tang, Ziniu Li, Mingfeng Xue, Keqin Bao

TL;DR: CoRT提出了一种后训练框架，通过代码集成推理（Code-integrated Reasoning）提升大型推理模型（LRMs）在复杂数学运算中的效率与准确性，并通过Hint-Engineering解决数据稀缺问题。

Details

Motivation: 大型推理模型（如o1和DeepSeek-R1）在自然语言推理中表现优异，但在复杂数学运算中效率低下或准确性不足。直接结合计算工具（如代码解释器）会引入外部知识，导致模型内部文本表示与外部工具交互不高效。

Result: 在五个数学推理数据集上，Hint-Engineering模型在32B和1.5B模型上分别实现了4%和8%的绝对性能提升；同时，32B和1.5B模型的推理token量分别减少了30%和50%。

Insight: 1. Hint-Engineering能有效解决代码集成推理中的数据稀缺问题；2. 代码解释器的高效利用显著提升了数学推理任务的性能与效率；3. 小规模高质量数据也能带来显著性能提升。

Abstract: Large Reasoning Models (LRMs) like o1 and DeepSeek-R1 have shown remarkable progress in natural language reasoning with long chain-of-thought (CoT), yet they remain inefficient or inaccurate when handling complex mathematical operations. Addressing these limitations through computational tools (e.g., computation libraries and symbolic solvers) is promising, but it introduces a technical challenge: Code Interpreter (CI) brings external knowledge beyond the model’s internal text representations, thus the direct combination is not efficient. This paper introduces CoRT, a post-training framework for teaching LRMs to leverage CI effectively and efficiently. As a first step, we address the data scarcity issue by synthesizing code-integrated reasoning data through Hint-Engineering, which strategically inserts different hints at appropriate positions to optimize LRM-CI interaction. We manually create 30 high-quality samples, upon which we post-train models ranging from 1.5B to 32B parameters, with supervised fine-tuning, rejection fine-tuning and reinforcement learning. Our experimental results demonstrate that Hint-Engineering models achieve 4% and 8% absolute improvements on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B respectively, across five challenging mathematical reasoning datasets. Furthermore, Hint-Engineering models use about 30% fewer tokens for the 32B model and 50% fewer tokens for the 1.5B model compared with the natural language models. The models and code are available at https://github.com/ChengpengLi1003/CoRT.

[107] EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection cs.CL | cs.AIPDF

Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby, Felix Friedrich, Maurice Kraus

TL;DR: EmoNet-Voice是一个用于语音情感检测的细粒度专家验证基准，包含大规模预训练数据集和专家标注的基准数据集，旨在评估40种情感类别的语音情感识别模型，并通过心理学专家验证其准确性。

Details

Motivation: 当前语音情感识别数据集存在情感细粒度不足、隐私问题和依赖表演性数据等局限性，需要一个更全面、隐私保护的基准来推动AI情感理解能力的发展。

Result: 评估显示，高唤醒情感（如愤怒）比低唤醒情感（如专注）更易检测，模型与专家标注高度一致。

Insight: 合成数据结合专家验证是一种有效的隐私保护方法，同时细粒度情感类别和强度标注有助于更全面的语音情感识别评估。

Abstract: The advancement of text-to-speech and audio generation models necessitates robust benchmarks for evaluating the emotional understanding capabilities of AI systems. Current speech emotion recognition (SER) datasets often exhibit limitations in emotional granularity, privacy concerns, or reliance on acted portrayals. This paper introduces EmoNet-Voice, a new resource for speech emotion detection, which includes EmoNet-Voice Big, a large-scale pre-training dataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions, and 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with human expert annotations. EmoNet-Voice is designed to evaluate SER models on a fine-grained spectrum of 40 emotion categories with different levels of intensities. Leveraging state-of-the-art voice generation, we curated synthetic audio snippets simulating actors portraying scenes designed to evoke specific emotions. Crucially, we conducted rigorous validation by psychology experts who assigned perceived intensity labels. This synthetic, privacy-preserving approach allows for the inclusion of sensitive emotional states often absent in existing datasets. Lastly, we introduce Empathic Insight Voice models that set a new standard in speech emotion recognition with high agreement with human experts. Our evaluations across the current model landscape exhibit valuable findings, such as high-arousal emotions like anger being much easier to detect than low-arousal states like concentration.

[108] Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning cs.CL | cs.AI | math.ST | stat.ME | stat.THPDF

Xiangning Yu, Zhuohan Wang, Linyi Yang, Haoxuan Li, Anjie Liu

TL;DR: 论文提出了一种基于因果充分性和必要性的框架，用于改进链式思维（CoT）推理，通过量化推理步骤的影响并优化步骤的生成与剪枝，提高了推理效率和成本效益。

Details

Motivation: 链式思维（CoT）在提升大语言模型（LLM）复杂推理能力方面至关重要，但其面临推理步骤的充分性和必要性不足的问题，影响了推理的准确性和效率。

Result: 在多个数学和常识推理基准测试中，显著提高了推理效率并减少了token使用量，同时保持了准确性。

Insight: 通过因果充分性和必要性的量化分析可以有效优化推理链，为LLM的推理性能提升和成本控制提供了新方向。

Abstract: Chain-of-Thought (CoT) prompting plays an indispensable role in endowing large language models (LLMs) with complex reasoning capabilities. However, CoT currently faces two fundamental challenges: (1) Sufficiency, which ensures that the generated intermediate inference steps comprehensively cover and substantiate the final conclusion; and (2) Necessity, which identifies the inference steps that are truly indispensable for the soundness of the resulting answer. We propose a causal framework that characterizes CoT reasoning through the dual lenses of sufficiency and necessity. Incorporating causal Probability of Sufficiency and Necessity allows us not only to determine which steps are logically sufficient or necessary to the prediction outcome, but also to quantify their actual influence on the final reasoning outcome under different intervention scenarios, thereby enabling the automated addition of missing steps and the pruning of redundant ones. Extensive experimental results on various mathematical and commonsense reasoning benchmarks confirm substantial improvements in reasoning efficiency and reduced token usage without sacrificing accuracy. Our work provides a promising direction for improving LLM reasoning performance and cost-effectiveness.

[109] Attention Head Embeddings with Trainable Deep Kernels for Hallucination Detection in LLMs cs.CL | cs.AIPDF

Rodion Oblovatny, Alexandra Bazarova, Alexey Zaytsev

TL;DR: 本文提出了一种通过分析提示和响应隐藏状态分布的概率差异来检测大型语言模型（LLMs）中幻觉的新方法，利用可训练的深度核增强敏感度，表现出卓越性能。

Details

Motivation: 大型语言模型的幻觉问题日益突出，传统方法依赖外部知识或辅助模型，缺乏模型内在的检测机制。作者希望通过分布距离作为原则性分数来检测幻觉。

Result: 在多个基准测试中表现优于现有基线，即使未经核训练也保持竞争力。

Insight: 幻觉可能源于表面改写而非实质性推理，分布距离可作为检测幻觉的有效指标。

Abstract: We present a novel approach for detecting hallucinations in large language models (LLMs) by analyzing the probabilistic divergence between prompt and response hidden-state distributions. Counterintuitively, we find that hallucinated responses exhibit smaller deviations from their prompts compared to grounded responses, suggesting that hallucinations often arise from superficial rephrasing rather than substantive reasoning. Leveraging this insight, we propose a model-intrinsic detection method that uses distributional distances as principled hallucination scores, eliminating the need for external knowledge or auxiliary models. To enhance sensitivity, we employ deep learnable kernels that automatically adapt to capture nuanced geometric differences between distributions. Our approach outperforms existing baselines, demonstrating state-of-the-art performance on several benchmarks. The method remains competitive even without kernel training, offering a robust, scalable solution for hallucination detection.

[110] The Emergence of Abstract Thought in Large Language Models Beyond Any Language cs.CL | cs.AIPDF

Yuxin Chen, Yiran Zhao, Yang Zhang, An Zhang, Kenji Kawaguchi

TL;DR: 研究发现，大型语言模型（LLMs）在训练过程中逐渐形成了一个与语言无关的核心参数空间，支持跨语言的抽象思维。共享神经元的比例和重要性随模型发展逐渐增加。

Details

Motivation: 初步研究表明，LLMs的隐藏激活似乎以英语为主导，但多语言性能的提升挑战了这一观点。研究者希望探索LLMs是否真正依赖特定语言进行思考。

Result: 实验表明，共享神经元的比例和功能重要性随时间增加，支持跨语言抽象思维的形成。提出的训练策略在不同LLM家族中有效。

Insight: LLMs的抽象思维能力不依赖于特定语言，而是通过共享神经元实现。模型的发展阶段对训练策略的设计至关重要。

Abstract: As large language models (LLMs) continue to advance, their capacity to function effectively across a diverse range of languages has shown marked improvement. Preliminary studies observe that the hidden activations of LLMs often resemble English, even when responding to non-English prompts. This has led to the widespread assumption that LLMs may “think” in English. However, more recent results showing strong multilingual performance, even surpassing English performance on specific tasks in other languages, challenge this view. In this work, we find that LLMs progressively develop a core language-agnostic parameter space-a remarkably small subset of parameters whose deactivation results in significant performance degradation across all languages. This compact yet critical set of parameters underlies the model’s ability to generalize beyond individual languages, supporting the emergence of abstract thought that is not tied to any specific linguistic system. Specifically, we identify language-related neurons-those are consistently activated during the processing of particular languages, and categorize them as either shared (active across multiple languages) or exclusive (specific to one). As LLMs undergo continued development over time, we observe a marked increase in both the proportion and functional importance of shared neurons, while exclusive neurons progressively diminish in influence. These shared neurons constitute the backbone of the core language-agnostic parameter space, supporting the emergence of abstract thought. Motivated by these insights, we propose neuron-specific training strategies tailored to LLMs’ language-agnostic levels at different development stages. Experiments across diverse LLM families support our approach.

[111] VerIF: Verification Engineering for Reinforcement Learning in Instruction Following cs.CL | cs.AIPDF

Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou

TL;DR: 该论文提出了VerIF方法，结合基于规则的代码验证和基于大型推理模型（如QwQ-32B）的验证，用于增强指令跟随任务中的强化学习（RLVR）。

Details

Motivation: 尽管可验证奖励的强化学习（RLVR）已成为增强大型语言模型（LLMs）的关键技术，但在指令跟随任务中，其最佳实践仍未被充分探索。

Result: 训练的模型在同类模型中达到最先进水平，并能良好泛化到未见约束，且不影响模型的一般能力。

Insight: VerIF方法可整合到现有强化学习框架中，显著提升模型性能，同时保持其一般能力。

Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs), with verification engineering playing a central role. However, best practices for RL in instruction following remain underexplored. In this work, we explore the verification challenge in RL for instruction following and propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model (e.g., QwQ-32B). To support this approach, we construct a high-quality instruction-following dataset, VerInstruct, containing approximately 22,000 instances with associated verification signals. We apply RL training with VerIF to two models, achieving significant improvements across several representative instruction-following benchmarks. The trained models reach state-of-the-art performance among models of comparable size and generalize well to unseen constraints. We further observe that their general capabilities remain unaffected, suggesting that RL with VerIF can be integrated into existing RL recipes to enhance overall model performance. We have released our datasets, codes, and models to facilitate future research at https://github.com/THU-KEG/VerIF.

[112] Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking cs.CLPDF

Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen, Xi Ye

TL;DR: 论文提出了一种名为QRHEAD的查询聚焦检索头，通过优化注意力头提升长上下文信息的检索能力，并开发了高效检索器QR-RETRIEVER，在长上下文推理任务中表现突出。

Details

Motivation: 现有研究发现了检索头（retrieval heads）在长上下文语言模型中的作用，但如何进一步优化这些头的性能以提升检索和推理能力仍是一个开放问题。

Result: 在长上下文推理任务（如LongMemEval和CLIPPER）中，QR-RETRIEVER比全上下文方法性能提升超过10%，并优于其他密集检索器。在BEIR基准测试中，作为重排序器表现优异，超越了如RankGPT等基于LLM的重排序器。

Insight: 研究表明，查询-上下文注意力评分和任务选择是识别具有下游实用性的QRHEAD的关键，同时为理解LM的长上下文能力提供了新视角。

Abstract: Recent work has identified retrieval heads (Wu et al., 2025b), a subset of attention heads responsible for retrieving salient information in long-context language models (LMs), as measured by their copy-paste behavior in Needle-in-a-Haystack tasks. In this paper, we introduce QRHEAD (Query-Focused Retrieval Head), an improved set of attention heads that enhance retrieval from long context. We identify QRHEAD by aggregating attention scores with respect to the input query, using a handful of examples from real-world tasks (e.g., long-context QA). We further introduce QR- RETRIEVER, an efficient and effective retriever that uses the accumulated attention mass of QRHEAD as retrieval scores. We use QR- RETRIEVER for long-context reasoning by selecting the most relevant parts with the highest retrieval scores. On multi-hop reasoning tasks LongMemEval and CLIPPER, this yields over 10% performance gains over full context and outperforms strong dense retrievers. We also evaluate QRRETRIEVER as a re-ranker on the BEIR benchmark and find that it achieves strong zero-shot performance, outperforming other LLM-based re-rankers such as RankGPT. Further analysis shows that both the querycontext attention scoring and task selection are crucial for identifying QRHEAD with strong downstream utility. Overall, our work contributes a general-purpose retriever and offers interpretability insights into the long-context capabilities of LMs.

[113] Resa: Transparent Reasoning Models via SAEs cs.CLPDF

Shangshang Wang, Julian Asilis, Ömer Faruk Akgül, Enes Burak Bilgin, Ollie Liu

TL;DR: Resa提出了一种高效的稀疏自编码器调整方法（SAE-Tuning），通过捕获源模型的推理能力并引导目标模型的训练，显著降低了训练成本和时间，同时保持了高性能。

Details

Motivation: 如何低成本高效地激发语言模型的推理能力？现有的方法通常依赖昂贵的强化学习（RL）训练，Resa旨在通过稀疏自编码器（SAE）来提取和转移推理能力，从而大幅降低成本和训练时间。

Result: 1. 在AIME24和AMC23等任务上表现优异（如43.33% Pass@1和90% Pass@1）；2. 训练成本降低2000倍以上，时间缩短450倍以上；3. 提取的能力具有通用性和模块化特性。

Insight: 1. 稀疏自编码器可以有效提取和转移语言模型的推理能力；2. 推理能力的通用性和模块化特性为模型的灵活应用提供了可能；3. 低成本高效训练方法为资源受限的场景提供了新思路。

Abstract: How cost-effectively can we elicit strong reasoning in language models by leveraging their underlying representations? We answer this question with Resa, a family of 1.5B reasoning models trained via a novel and efficient sparse autoencoder tuning (SAE-Tuning) procedure. This method first trains an SAE to capture reasoning abilities from a source model, and then uses the trained SAE to guide a standard supervised fine-tuning process to elicit such abilities in a target model, all using verified question-answer data without any reasoning traces. Notably, when applied to certain base models before further RL post-training, SAE-Tuning retains >97% of its RL-trained counterpart’s reasoning performance while reducing training costs by >2000x to roughly $1 and training time by >450x to around 20 minutes. Furthermore, when applied to lightly RL-trained models (e.g., within 1 hour on 2 GPUs), it enables reasoning performance such as 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23 for only around $1 additional cost. Surprisingly, the reasoning abilities extracted via SAEs are potentially both generalizable and modular. Generality means abilities extracted from one dataset still elevate performance on a larger and overlapping corpus. Modularity means abilities extracted from Qwen or Qwen-Math can be attached to the R1-Distill model at test time, without any retraining, and yield comparable gains. Extensive ablations validate these findings and all artifacts are fully open-sourced.

[114] Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs cs.CLPDF

Hiroshi Matsuda, Chunpeng Ma, Masayuki Asahara

TL;DR: 本文提出了一种逐步指导策略和简化的输出格式，显著提升了LLM在依赖解析任务中的准确率，并在17种语言的通用依赖数据集上达到了SOTA性能。

Details

Motivation: 尽管大语言模型（LLM）在各种任务中表现出色，但标准提示方法在依赖解析任务中难以生成结构有效且准确的结果。本文旨在通过改进指导策略和输出格式解决这一问题。

Result: 在17种语言的通用依赖数据集上实现了SOTA性能，且输出无幻觉或污染，同时跨语言泛化性能得到提升。

Insight: 显式的推理步骤和格式一致性对提升LLM在依赖解析任务中的表现至关重要，同时多语言微调是提升跨语言泛化的有效方法。

Abstract: Recent advances in large language models (LLMs) have enabled impressive performance in various tasks. However, standard prompting often struggles to produce structurally valid and accurate outputs, especially in dependency parsing. We propose a novel step-by-step instruction strategy, where universal part-of-speech tagging precedes the prediction of syntactic heads and dependency labels, and a simplified CoNLL-U like output format, our method achieves state-of-the-art accuracy on Universal Dependencies datasets across 17 languages without hallucination or contamination. We further show that multilingual fine-tuning simultaneously improves cross-language generalization performance. Our results highlight the effectiveness of explicit reasoning steps in LLM-based parsing and offer a scalable, format-consistent alternative to bracket-based approaches.

[115] Large Language Models for Toxic Language Detection in Low-Resource Balkan Languages cs.CLPDF

Amel Muminovic, Amela Kadric Muminovic

TL;DR: 该研究评估了大型语言模型在塞尔维亚语、克罗地亚语和波斯尼亚语等低资源语言中的毒性语言检测能力，提出通过增加上下文片段和优化提示设计可显著提升性能。

Details

Motivation: 在线毒性语言对社会造成实际危害，尤其是缺乏标注数据的低资源语言地区，因此研究如何利用大型语言模型检测这些语言的毒性内容具有重要意义。

Result: 上下文增强模式平均提升召回率0.12，F1分最高提升0.10；Gemini模型在上下文增强模式下表现最佳（F1=0.82，准确率=0.82），而零样本GPT-4.1在精确率和低误报率上领先。

Insight: 在低资源语言中，简单的上下文增强和提示设计即可显著提升毒性检测性能，为实际应用提供了可行策略。

Abstract: Online toxic language causes real harm, especially in regions with limited moderation tools. In this study, we evaluate how large language models handle toxic comments in Serbian, Croatian, and Bosnian, languages with limited labeled data. We built and manually labeled a dataset of 4,500 YouTube and TikTok comments drawn from videos across diverse categories, including music, politics, sports, modeling, influencer content, discussions of sexism, and general topics. Four models (GPT-3.5 Turbo, GPT-4.1, Gemini 1.5 Pro, and Claude 3 Opus) were tested in two modes: zero-shot and context-augmented. We measured precision, recall, F1 score, accuracy and false positive rates. Including a short context snippet raised recall by about 0.12 on average and improved F1 score by up to 0.10, though it sometimes increased false positives. The best balance came from Gemini in context-augmented mode, reaching an F1 score of 0.82 and accuracy of 0.82, while zero-shot GPT-4.1 led on precision and had the lowest false alarms. We show how adding minimal context can improve toxic language detection in low-resource settings and suggest practical strategies such as improved prompt design and threshold calibration. These results show that prompt design alone can yield meaningful gains in toxicity detection for underserved Balkan language communities.

[116] From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring cs.CL | cs.CYPDF

Yang Li, Qiang Sheng, Yehan Yang, Xueyao Zhang, Juan Cao

TL;DR: 该论文提出了一种流式内容监控方法（SCM），用于在LLM生成过程中早期检测并停止有害输出，解决了传统完整检测方法的高延迟问题。通过构建细粒度标注数据集FineHarm和训练双监督模型，SCM在仅观察前18%的tokens时仍能达到与完整检测相当的性能。

Details

Motivation: 现有LLM安全审核方法依赖完整输出检测，导致高延迟；而部分检测方法直接使用完整检测训练的模型，存在训练-推断差距。论文旨在提出一种原生支持部分检测的解决方案。

Result: SCM仅需观察前18%的tokens即可实现0.95+的宏F1得分，性能接近完整检测，并能提升安全对齐效果。

Insight: 细粒度标注和双监督训练是早期检测的关键；SCM不仅能高效拦截有害内容，还可作为伪标注工具提升LLM安全对齐。

Abstract: Though safety alignment has been applied to most large language models (LLMs), LLM service providers generally deploy a subsequent moderation as the external safety guardrail in real-world products. Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output, causing high service latency. Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected, but they directly apply moderators trained with the full detection paradigm to incomplete outputs, introducing a training-inference gap that lowers the performance. In this paper, we explore how to form a data-and-model solution that natively supports partial detection. For the data, we construct FineHarm, a dataset consisting of 29K prompt-response pairs with fine-grained annotations to provide reasonable supervision for token-level training. Then, we propose the streaming content monitor, which is trained with dual supervision of response- and token-level labels and can follow the output stream of LLM to make a timely judgment of harmfulness. Experiments show that SCM gains 0.95+ in macro F1 score that is comparable to full detection, by only seeing the first 18% of tokens in responses on average. Moreover, the SCM can serve as a pseudo-harmfulness annotator for improving safety alignment and lead to a higher harmlessness score than DPO.

eess.AS [Back]

[117] Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements eess.AS | cs.CL | cs.HC | 68T07 | I.2.7; I.5.4; H.5.2PDF

Suhas BN, Andrew M. Sherrill, Jyoti Alaparthi, Dominik Mattioli, Rosa I. Arriaga

TL;DR: 本文提出了一种基于LoRA微调的音频-语言模型方法，用于自动标注长时间暴露疗法（PE）的关键阶段时间边界，实现了在真实PE会话数据集上的高精度定位（MAE为5.3秒）。

Details

Motivation: 传统PE疗法中，治疗师的忠实度评估依赖人工审查会话录音，费时费力。本文旨在通过自动化方法高效定位PE核心阶段的起止时间，支持临床监督与培训。

Result: 在313个真实PE会话中，最佳配置（LoRA秩8，30秒窗口）的MAE为5.3秒，窗口大小和LoRA秩对性能影响显著。

Insight: 1. 上下文粒度（窗口大小）对时间定位至关重要；2. LoRA的轻量化适配适合小规模任务数据；3. 软监督能有效学习模糊边界。

Abstract: Prolonged Exposure (PE) therapy is an effective treatment for post-traumatic stress disorder (PTSD), but evaluating therapist fidelity remains labor-intensive due to the need for manual review of session recordings. We present a method for the automatic temporal localization of key PE fidelity elements – identifying their start and stop times – directly from session audio and transcripts. Our approach fine-tunes a large pre-trained audio-language model, Qwen2-Audio, using Low-Rank Adaptation (LoRA) to process focused 30-second windows of audio-transcript input. Fidelity labels for three core protocol phases – therapist orientation (P1), imaginal exposure (P2), and post-imaginal processing (P3) – are generated via LLM-based prompting and verified by trained raters. The model is trained to predict normalized boundary offsets using soft supervision guided by task-specific prompts. On a dataset of 313 real PE sessions, our best configuration (LoRA rank 8, 30s windows) achieves a mean absolute error (MAE) of 5.3 seconds across tasks. We further analyze the effects of window size and LoRA rank, highlighting the importance of context granularity and model adaptation. This work introduces a scalable framework for fidelity tracking in PE therapy, with potential to support clinician training, supervision, and quality assurance.

cs.CR [Back]

[118] Adversarial Text Generation with Dynamic Contextual Perturbation cs.CR | cs.CLPDF

Hetvi Waghela, Jaydip Sen, Sneha Rakshit, Subhasis Dasgupta

TL;DR: 提出了一种名为动态上下文扰动（DCP）的新型对抗文本攻击方法，通过动态生成上下文感知的扰动，提升对抗样本的语义一致性和流畅性，有效挑战了当前最先进的NLP模型的鲁棒性。

Details

Motivation: 现有对抗文本攻击方法多局限于单词或局部文本段落的修改，忽视上下文语境，导致扰动容易被察觉或语义不一致。

Result: 实验表明DCP能生成更自然、有效的对抗样本，显著挑战了现有NLP模型的鲁棒性。

Insight: 上下文在对抗攻击中起关键作用，未来需开发能抵御此类复杂攻击的NLP鲁棒性方法。

Abstract: Adversarial attacks on Natural Language Processing (NLP) models expose vulnerabilities by introducing subtle perturbations to input text, often leading to misclassification while maintaining human readability. Existing methods typically focus on word-level or local text segment alterations, overlooking the broader context, which results in detectable or semantically inconsistent perturbations. We propose a novel adversarial text attack scheme named Dynamic Contextual Perturbation (DCP). DCP dynamically generates context-aware perturbations across sentences, paragraphs, and documents, ensuring semantic fidelity and fluency. Leveraging the capabilities of pre-trained language models, DCP iteratively refines perturbations through an adversarial objective function that balances the dual objectives of inducing model misclassification and preserving the naturalness of the text. This comprehensive approach allows DCP to produce more sophisticated and effective adversarial examples that better mimic natural language patterns. Our experimental results, conducted on various NLP models and datasets, demonstrate the efficacy of DCP in challenging the robustness of state-of-the-art NLP systems. By integrating dynamic contextual analysis, DCP significantly enhances the subtlety and impact of adversarial attacks. This study highlights the critical role of context in adversarial attacks and lays the groundwork for creating more robust NLP systems capable of withstanding sophisticated adversarial strategies.

[119] DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt cs.CR | cs.CVPDF

Yitong Zhang, Jia Li, Liyi Cai, Ge Li

TL;DR: 论文提出了DAVSP方法，通过视觉安全提示和深度对齐技术，增强大型视觉语言模型对恶意查询的防御能力，同时保持良性输入的实用性。

Details

Motivation: 现有的对齐方法难以在抵抗恶意查询的同时有效保留良性输入的实用性，该研究旨在解决这一问题。

Result: 在多个基准测试中，DAVSP成功抵御了恶意查询，同时保持了良性输入的实用性，并展现了跨模型生成能力。

Insight: 视觉安全提示和深度对齐技术的结合是实现模型安全对齐的关键。

Abstract: Large Vision-Language Models (LVLMs) have achieved impressive progress across various applications but remain vulnerable to malicious queries that exploit the visual modality. Existing alignment approaches typically fail to resist malicious queries while preserving utility on benign ones effectively. To address these challenges, we propose Deep Aligned Visual Safety Prompt (DAVSP), which is built upon two key innovations. First, we introduce the Visual Safety Prompt, which appends a trainable padding region around the input image. It preserves visual features and expands the optimization space. Second, we propose Deep Alignment, a novel approach to train the visual safety prompt through supervision in the model’s activation space. It enhances the inherent ability of LVLMs to perceive malicious queries, achieving deeper alignment than prior works. Extensive experiments across five benchmarks on two representative LVLMs demonstrate that DAVSP effectively resists malicious queries while preserving benign input utility. Furthermore, DAVSP exhibits great cross-model generation ability. Ablation studies further reveal that both the Visual Safety Prompt and Deep Alignment are essential components, jointly contributing to its overall effectiveness. The code is publicly available at https://github.com/zhangyitonggg/DAVSP.

cs.LG [Back]

[120] An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks cs.LG | cs.AI | cs.CL | cs.CRPDF

Valentyn Boreiko, Alexander Panfilov, Vaclav Voracek, Matthias Hein, Jonas Geiping

TL;DR: 该论文提出了一种统一的威胁模型，基于N-gram语言模型的困惑度评估LLM jailbreak攻击的效果，发现离散优化攻击优于基于LLM的攻击，并揭示了成功攻击的关键模式。

Details

Motivation: 现有jailbreak攻击方法在流畅性和计算成本上差异较大，缺乏统一的评估标准，因此需要一种可解释且与LLM无关的威胁模型来公平比较这些方法。

Result: 实验发现针对现代安全调整LLM的攻击成功率低于预期，离散优化攻击表现优于LLM基攻击；成功攻击常利用罕见或特定领域的二元组。

Insight: 可解释的N-gram模型揭示了攻击的本质模式，为防御设计提供了方向——关注罕见或异常文本片段的检测。

Abstract: A plethora of jailbreaking attacks have been proposed to obtain harmful responses from safety-tuned LLMs. These methods largely succeed in coercing the target output in their original settings, but their attacks vary substantially in fluency and computational effort. In this work, we propose a unified threat model for the principled comparison of these methods. Our threat model checks if a given jailbreak is likely to occur in the distribution of text. For this, we build an N-gram language model on 1T tokens, which, unlike model-based perplexity, allows for an LLM-agnostic, nonparametric, and inherently interpretable evaluation. We adapt popular attacks to this threat model, and, for the first time, benchmark these attacks on equal footing with it. After an extensive comparison, we find attack success rates against safety-tuned modern models to be lower than previously presented and that attacks based on discrete optimization significantly outperform recent LLM-based attacks. Being inherently interpretable, our threat model allows for a comprehensive analysis and comparison of jailbreak attacks. We find that effective attacks exploit and abuse infrequent bigrams, either selecting the ones absent from real-world text or rare ones, e.g., specific to Reddit or code datasets.

[121] Too Big to Think: Capacity, Memorization, and Generalization in Pre-Trained Transformers cs.LG | cs.AI | cs.CLPDF

Joshua Barron, Devin White

TL;DR: 通过预训练容量受限的Transformer模型，研究记忆与泛化的关系，发现小模型泛化能力强但记忆差，大模型则相反，且联合训练时模型无法同时兼顾二者。

Details

Motivation: 探讨大型语言模型中记忆与泛化的关系，理解模型容量如何影响这两种学习模式。

Result: 小模型擅长泛化但记忆差，大模型反之；联合训练时所有模型均无法泛化。

Insight: 模型容量是决定学习模式的关键因素，这可能对小型语言模型的设计和部署有启示。

Abstract: The relationship between memorization and generalization in large language models (LLMs) remains an open area of research, with growing evidence that the two are deeply intertwined. In this work, we investigate this relationship by pre-training a series of capacity-limited Transformer models from scratch on two synthetic character-level tasks designed to separately probe generalization (via arithmetic extrapolation) and memorization (via factual recall). We observe a consistent trade-off: small models extrapolate to unseen arithmetic cases but fail to memorize facts, while larger models memorize but fail to extrapolate. An intermediate-capacity model exhibits a similar shift toward memorization. When trained on both tasks jointly, no model (regardless of size) succeeds at extrapolation. These findings suggest that pre-training may intrinsically favor one learning mode over the other. By isolating these dynamics in a controlled setting, our study offers insight into how model capacity shapes learning behavior and offers broader implications for the design and deployment of small language models.

[122] SensorLM: Learning the Language of Wearable Sensors cs.LG | cs.AI | cs.CLPDF

Yuwei Zhang, Kumar Ayush, Siyuan Qiao, A. Ali Heydari, Girish Narayanswamy

TL;DR: SensorLM是一个传感器-语言基础模型家族，旨在通过自然语言理解可穿戴传感器数据，解决了传感器数据与语言对齐的挑战，并构建了最大的传感器-语言数据集。

Details

Motivation: 可穿戴传感器数据的普遍性与缺乏丰富的标注数据使得传感器数据与自然语言的对齐和理解变得困难。

Result: SensorLM在零样本识别、少样本学习和跨模态检索任务中表现优于现有方法，展示了扩展性、标签效率和零样本泛化能力。

Insight: SensorLM展示了传感器数据与自然语言对齐的潜力，并提供了传感器数据理解的新范式。

Abstract: We present SensorLM, a family of sensor-language foundation models that enable wearable sensor data understanding with natural language. Despite its pervasive nature, aligning and interpreting sensor data with language remains challenging due to the lack of paired, richly annotated sensor-text descriptions in uncurated, real-world wearable data. We introduce a hierarchical caption generation pipeline designed to capture statistical, structural, and semantic information from sensor data. This approach enabled the curation of the largest sensor-language dataset to date, comprising over 59.7 million hours of data from more than 103,000 people. Furthermore, SensorLM extends prominent multimodal pretraining architectures (e.g., CLIP, CoCa) and recovers them as specific variants within a generic architecture. Extensive experiments on real-world tasks in human activity analysis and healthcare verify the superior performance of SensorLM over state-of-the-art in zero-shot recognition, few-shot learning, and cross-modal retrieval. SensorLM also demonstrates intriguing capabilities including scaling behaviors, label efficiency, sensor captioning, and zero-shot generalization to unseen tasks.

[123] Improving LLM Agent Planning with In-Context Learning via Atomic Fact Augmentation and Lookahead Search cs.LG | cs.AI | cs.CL | 68T07, 68T20, 68T30, 93E35 | I.2.6; I.2.7; I.2.8PDF

Samuel Holt, Max Ruiz Luyten, Thomas Pouplin, Mihaela van der Schaar

TL;DR: 本文提出了一种通过原子事实增强和前瞻搜索的上下文学习方法，提升LLM代理的规划能力，使其能够在复杂交互环境中更高效地进行多步推理。

Details

Motivation: 现有的大型语言模型（LLMs）在复杂交互环境中需要大量指导或交互历史才能有效工作，难以适应新信息或高效利用过去经验进行多步推理。本文旨在通过上下文学习方法，提升LLM代理的规划能力，而无需微调。

Result: 在TextFrozenLake和ALFWorld等挑战性交互任务中，代理表现出更好的性能和适应性，随着经验积累能够实现更优的行为。

Insight: 通过原子事实增强和前瞻搜索，代理能够在不更新权重的情况下，利用上下文学习提升规划能力，这为LLM在交互任务中的应用提供了新思路。

Abstract: Large Language Models (LLMs) are increasingly capable but often require significant guidance or extensive interaction history to perform effectively in complex, interactive environments. Existing methods may struggle with adapting to new information or efficiently utilizing past experiences for multi-step reasoning without fine-tuning. We introduce a novel LLM agent framework that enhances planning capabilities through in-context learning, facilitated by atomic fact augmentation and a recursive lookahead search. Our agent learns to extract task-critical ``atomic facts’’ from its interaction trajectories. These facts dynamically augment the prompts provided to LLM-based components responsible for action proposal, latent world model simulation, and state-value estimation. Planning is performed via a depth-limited lookahead search, where the LLM simulates potential trajectories and evaluates their outcomes, guided by the accumulated facts and interaction history. This approach allows the agent to improve its understanding and decision-making online, leveraging its experience to refine its behavior without weight updates. We provide a theoretical motivation linking performance to the quality of fact-based abstraction and LLM simulation accuracy. Empirically, our agent demonstrates improved performance and adaptability on challenging interactive tasks, achieving more optimal behavior as it accumulates experience, showcased in tasks such as TextFrozenLake and ALFWorld.

[124] Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models cs.LG | cs.AI | cs.CL | cs.CVPDF

Shuai Wang, Zhenhua Liu, Jiaheng Wei, Xuanwu Yin, Dong Li

TL;DR: Athena-PRM是一种多模态过程奖励模型，用于评估复杂推理问题解决中各步骤的奖励分数，通过数据高效的方法生成高质量的过程标注数据，显著提升了性能。

Details

Motivation: 传统的过程奖励模型（PRM）需要大量时间和金钱投入，且自动标注方法（如Monte Carlo估计）常产生噪声标签和高计算成本。为解决这些问题，提出了一种更高效的方法。

Result: Athena-PRM在多个基准测试中表现优异，如WeMath和MathVista分别提升10.2和7.1分，并在VisualProcessBench上超越之前SoTA 3.9 F1分。

Insight: 预测一致性和数据高效策略的结合显著提升了多模态推理评估的准确性，为复杂推理任务的优化提供了新思路。

Abstract: We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.

[125] Flipping Against All Odds: Reducing LLM Coin Flip Bias via Verbalized Rejection Sampling cs.LG | cs.CLPDF

Tim Z. Xiao, Johannes Zenn, Zhen Liu, Weiyang Liu, Robert Bamler

TL;DR: 本文提出Verbalized Rejection Sampling (VRS)，一种通过自然语言改进大语言模型（LLM）采样偏差的方法，适用于伯努利分布。

Details

Motivation: 尽管LLM能准确描述概率分布，但在生成忠实样本时表现不佳，限制了其在需要可靠随机性的任务中的应用。

Result: 实验表明VRS显著减少了采样偏差，理论分析也支持其有效性。

Insight: 经典概率工具可通过自然语言嵌入LLM工作流，提升可靠性，而不依赖模型内部访问。

Abstract: Large language models (LLMs) can often accurately describe probability distributions using natural language, yet they still struggle to generate faithful samples from them. This mismatch limits their use in tasks requiring reliable stochasticity, such as Monte Carlo methods, agent-based simulations, and randomized decision-making. We investigate this gap between knowledge and sampling in the context of Bernoulli distributions. We introduce Verbalized Rejection Sampling (VRS), a natural-language adaptation of classical rejection sampling that prompts the LLM to reason about and accept or reject proposed samples. Despite relying on the same Bernoulli mechanism internally, VRS substantially reduces sampling bias across models. We provide theoretical analysis showing that, under mild assumptions, VRS improves over direct sampling, with gains attributable to both the algorithm and prompt design. More broadly, our results show how classical probabilistic tools can be verbalized and embedded into LLM workflows to improve reliability, without requiring access to model internals or heavy prompt engineering.

[126] MultiNet: An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models cs.LG | cs.CVPDF

Pranav Guruprasad, Yangyue Wang, Harshvardhan Sikka

TL;DR: MultiNet是一个开源软件工具包和基准测试套件，旨在评估和适应多模态动作模型，覆盖视觉、语言和动作领域。

Details

Motivation: 多模态动作模型在通用智能代理系统中具有潜力，但缺乏标准化的评估工具和数据集。

Result: MultiNet被用于下游研究，揭示了视觉语言动作模型的泛化局限性。

Insight: 多模态模型的评估需要跨领域的标准化工具和丰富的数据集，以推动进一步研究。

Abstract: Recent innovations in multimodal action models represent a promising direction for developing general-purpose agentic systems, combining visual understanding, language comprehension, and action generation. We introduce MultiNet - a novel, fully open-source benchmark and surrounding software ecosystem designed to rigorously evaluate and adapt models across vision, language, and action domains. We establish standardized evaluation protocols for assessing vision-language models (VLMs) and vision-language-action models (VLAs), and provide open source software to download relevant data, models, and evaluations. Additionally, we provide a composite dataset with over 1.3 trillion tokens of image captioning, visual question answering, commonsense reasoning, robotic control, digital game-play, simulated locomotion/manipulation, and many more tasks. The MultiNet benchmark, framework, toolkit, and evaluation harness have been used in downstream research on the limitations of VLA generalization.

[127] LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization cs.LG | cs.AI | cs.CVPDF

Jiaqi Tang, Yu Xia, Yi-Feng Wu, Yuwei Hu, Yuhui Chen

TL;DR: 论文提出了一种名为LPO的新方法，通过优化位置偏好提高GUI代理的交互准确性，利用信息熵和动态位置奖励函数显著提升了交互精度，并在实验中取得了SOTA结果。

Details

Motivation: 当前GUI代理在空间定位任务中主要依赖SFT方法，但这些方法在感知位置数据上存在局限性，而强化学习等方法又无法有效评估位置准确性，亟需一种更高效的解决方案。

Result: 实验表明LPO在离线基准测试和在线评估中均达到了SOTA性能。

Insight: 位置数据的优化对于GUI代理的交互精度至关重要，信息熵和动态奖励机制的结合可以显著提升任务的完成质量。

Abstract: The advent of autonomous agents is transforming interactions with Graphical User Interfaces (GUIs) by employing natural language as a powerful intermediary. Despite the predominance of Supervised Fine-Tuning (SFT) methods in current GUI agents for achieving spatial localization, these methods face substantial challenges due to their limited capacity to accurately perceive positional data. Existing strategies, such as reinforcement learning, often fail to assess positional accuracy effectively, thereby restricting their utility. In response, we introduce Location Preference Optimization (LPO), a novel approach that leverages locational data to optimize interaction preferences. LPO uses information entropy to predict interaction positions by focusing on zones rich in information. Besides, it further introduces a dynamic location reward function based on physical distance, reflecting the varying importance of interaction positions. Supported by Group Relative Preference Optimization (GRPO), LPO facilitates an extensive exploration of GUI environments and significantly enhances interaction precision. Comprehensive experiments demonstrate LPO’s superior performance, achieving SOTA results across both offline benchmarks and real-world online evaluations. Our code will be made publicly available soon, at https://github.com/AIDC-AI/LPO.

[128] FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models cs.LG | cs.CVPDF

Weiying Zheng, Ziyue Lin, Pengxin Guo, Yuyin Zhou, Feifei Wang

TL;DR: 论文介绍了FedVLMBench，这是联邦学习中首个系统性的视觉-语言模型（VLM）微调基准测试。涵盖多种架构、策略、算法和数据集，并揭示了关键发现，如数据异质性和任务类型对FL方法的影响。

Details

Motivation: 现有VLM微调方法多依赖集中式训练，不适用于隐私要求严格的领域。联邦学习（FL）虽被引入，但缺乏系统性基准测试，无法全面评估其效果。

Result: 发现编码器基VLM在FL中采用2层MLP连接器并同时微调连接器和LLM效果最佳；FL方法对视觉任务的异质性更敏感。

Insight: 数据异质性和任务类型显著影响FL方法的性能，为隐私保护的多模态基础模型训练提供了实证指导。

Abstract: Vision-Language Models (VLMs) have demonstrated remarkable capabilities in cross-modal understanding and generation by integrating visual and textual information. While instruction tuning and parameter-efficient fine-tuning methods have substantially improved the generalization of VLMs, most existing approaches rely on centralized training, posing challenges for deployment in domains with strict privacy requirements like healthcare. Recent efforts have introduced Federated Learning (FL) into VLM fine-tuning to address these privacy concerns, yet comprehensive benchmarks for evaluating federated fine-tuning strategies, model architectures, and task generalization remain lacking. In this work, we present \textbf{FedVLMBench}, the first systematic benchmark for federated fine-tuning of VLMs. FedVLMBench integrates two mainstream VLM architectures (encoder-based and encoder-free), four fine-tuning strategies, five FL algorithms, six multimodal datasets spanning four cross-domain single-task scenarios and two cross-domain multitask settings, covering four distinct downstream task categories. Through extensive experiments, we uncover key insights into the interplay between VLM architectures, fine-tuning strategies, data heterogeneity, and multi-task federated optimization. Notably, we find that a 2-layer multilayer perceptron (MLP) connector with concurrent connector and LLM tuning emerges as the optimal configuration for encoder-based VLMs in FL. Furthermore, current FL methods exhibit significantly higher sensitivity to data heterogeneity in vision-centric tasks than text-centric ones, across both encoder-free and encoder-based VLM architectures. Our benchmark provides essential tools, datasets, and empirical guidance for the research community, offering a standardized platform to advance privacy-preserving, federated training of multimodal foundation models.

eess.IV [Back]

[129] Exploring Image Transforms derived from Eye Gaze Variables for Progressive Autism Diagnosis eess.IV | cs.AI | cs.CV | cs.HC | cs.LGPDF

Abigail Copiaco, Christian Ritz, Yassine Himeur, Valsamma Eapen, Ammar Albanna

TL;DR: 该论文提出了一种基于AI的辅助技术，利用眼球凝视变量生成的图像变换，通过迁移学习实现自闭症谱系障碍（ASD）的高效诊断，旨在简化诊断流程并保护用户隐私。

Details

Motivation: ASD的发病率迅速上升，而现有的诊断方法耗时且成本高，亟需一种更便捷、高效的技术以改善诊断效率和用户体验。

Result: 该方法能够实现高效且隐私保护的ASD诊断，为家庭和医疗系统提供了便捷的解决方案。

Insight: 1. 图像变换技术可以在保护隐私的同时提高诊断效率。2. 迁移学习能够有效利用有限的医疗数据。3. 居家诊断模式有望成为未来医疗辅助技术的发展方向。

Abstract: The prevalence of Autism Spectrum Disorder (ASD) has surged rapidly over the past decade, posing significant challenges in communication, behavior, and focus for affected individuals. Current diagnostic techniques, though effective, are time-intensive, leading to high social and economic costs. This work introduces an AI-powered assistive technology designed to streamline ASD diagnosis and management, enhancing convenience for individuals with ASD and efficiency for caregivers and therapists. The system integrates transfer learning with image transforms derived from eye gaze variables to diagnose ASD. This facilitates and opens opportunities for in-home periodical diagnosis, reducing stress for individuals and caregivers, while also preserving user privacy through the use of image transforms. The accessibility of the proposed method also offers opportunities for improved communication between guardians and therapists, ensuring regular updates on progress and evolving support needs. Overall, the approach proposed in this work ensures timely, accessible diagnosis while protecting the subjects’ privacy, improving outcomes for individuals with ASD.

[130] Foundation Models in Medical Imaging – A Review and Outlook eess.IV | cs.AI | cs.CVPDF

Vivien van Veldhuizen, Vanessa Botha, Chunyao Lu, Melis Erdal Cesur, Kevin Groot Lipman

TL;DR: 该综述探讨了医学影像中基础模型（FMs）的发展与应用，分析了其在病理学、放射学和眼科中的应用，总结了150多项研究，介绍了核心组件、自监督学习方法及下游适应策略，并提出了未来研究方向。

Details

Motivation: 医学影像分析通常依赖大量标注数据，基础模型通过学习大规模无标签数据提取通用视觉特征，减少了标注需求，为医学影像分析带来了新的可能性。

Result: 研究表明，基础模型在病理学、放射学和眼科等医学影像领域表现出色，能够显著减少对标注数据的依赖。

Insight: 基础模型为医学影像分析提供了新工具，但仍需解决模型泛化能力、数据隐私和计算资源等问题。

Abstract: Foundation models (FMs) are changing the way medical images are analyzed by learning from large collections of unlabeled data. Instead of relying on manually annotated examples, FMs are pre-trained to learn general-purpose visual features that can later be adapted to specific clinical tasks with little additional supervision. In this review, we examine how FMs are being developed and applied in pathology, radiology, and ophthalmology, drawing on evidence from over 150 studies. We explain the core components of FM pipelines, including model architectures, self-supervised learning methods, and strategies for downstream adaptation. We also review how FMs are being used in each imaging domain and compare design choices across applications. Finally, we discuss key challenges and open questions to guide future research.

[131] Low-Rank Augmented Implicit Neural Representation for Unsupervised High-Dimensional Quantitative MRI Reconstruction eess.IV | cs.CV | cs.LGPDF

Haonan Zhang, Guoyan Lao, Yuyao Zhang, Hongjiang Wei

TL;DR: 本文提出了一种名为LoREIN的无监督双先验集成框架，用于加速3D多参数定量MRI重建，通过结合低秩先验和连续性先验，提高重建精度。

Details

Motivation: 当前的重建方法通常仅依赖单一先验或物理模型解决高度不适定的逆问题，导致结果不理想。本文旨在通过结合两种先验（低秩先验和连续性先验）来提升重建质量。

Result: LoREIN能够高保真地重建加权图像，并利用多对比加权图像的结构和定量信息提升定量参数图的重建精度。

Insight: 结合低秩和连续性先验的方法在解决高维医学图像重建问题上具有潜力，且零样本学习范式可推广到其他复杂图像重建任务。

Abstract: Quantitative magnetic resonance imaging (qMRI) provides tissue-specific parameters vital for clinical diagnosis. Although simultaneous multi-parametric qMRI (MP-qMRI) technologies enhance imaging efficiency, robustly reconstructing qMRI from highly undersampled, high-dimensional measurements remains a significant challenge. This difficulty arises primarily because current reconstruction methods that rely solely on a single prior or physics-informed model to solve the highly ill-posed inverse problem, which often leads to suboptimal results. To overcome this limitation, we propose LoREIN, a novel unsupervised and dual-prior-integrated framework for accelerated 3D MP-qMRI reconstruction. Technically, LoREIN incorporates both low-rank prior and continuity prior via low-rank representation (LRR) and implicit neural representation (INR), respectively, to enhance reconstruction fidelity. The powerful continuous representation of INR enables the estimation of optimal spatial bases within the low-rank subspace, facilitating high-fidelity reconstruction of weighted images. Simultaneously, the predicted multi-contrast weighted images provide essential structural and quantitative guidance, further enhancing the reconstruction accuracy of quantitative parameter maps. Furthermore, our work introduces a zero-shot learning paradigm with broad potential in complex spatiotemporal and high-dimensional image reconstruction tasks, further advancing the field of medical imaging.

[132] The RSNA Lumbar Degenerative Imaging Spine Classification (LumbarDISC) Dataset eess.IV | cs.CVPDF

Tyler J. Richards, Adam E. Flanders, Errol Colak, Luciano M. Prevedello, Robyn L. Ball

TL;DR: RSNA LumbarDISC数据集是最大的公开成人MRI腰椎退化性变化标注数据集，包含2,697名患者的8,593张影像，来自8个机构，支持非商业用途。

Details

Motivation: 现有腰椎退化性病变研究的公开数据集稀缺，阻碍了机器学习和影像分析研究的进展。

Result: 数据集已公开，并用于RSNA 2024竞赛，推动深度学习模型在腰椎退化分类中的应用。

Insight: 该数据集填补了腰椎退化研究的数据空白，为临床效率提升和患者护理改进提供了资源。

Abstract: The Radiological Society of North America (RSNA) Lumbar Degenerative Imaging Spine Classification (LumbarDISC) dataset is the largest publicly available dataset of adult MRI lumbar spine examinations annotated for degenerative changes. The dataset includes 2,697 patients with a total of 8,593 image series from 8 institutions across 6 countries and 5 continents. The dataset is available for free for non-commercial use via Kaggle and RSNA Medical Imaging Resource of AI (MIRA). The dataset was created for the RSNA 2024 Lumbar Spine Degenerative Classification competition where competitors developed deep learning models to grade degenerative changes in the lumbar spine. The degree of spinal canal, subarticular recess, and neural foraminal stenosis was graded at each intervertebral disc level in the lumbar spine. The images were annotated by expert volunteer neuroradiologists and musculoskeletal radiologists from the RSNA, American Society of Neuroradiology, and the American Society of Spine Radiology. This dataset aims to facilitate research and development in machine learning and lumbar spine imaging to lead to improved patient care and clinical efficiency.

[133] Sampling Theory for Super-Resolution with Implicit Neural Representations eess.IV | cs.CVPDF

Mahrokh Najaf, Gregory Ongie

TL;DR: 该论文研究了使用隐式神经表示（INR）从低频傅里叶样本中恢复连续域图像的采样理论，提出了一种非凸参数空间优化问题与无限维空间惩罚的联系，并验证了精确恢复的可行性。

Details

Motivation: 隐式神经表示（INR）在计算机视觉和计算成像的逆问题中表现出强大潜力，但目前对其样本复杂度的理解不足，尤其是在线性逆问题中。本文旨在填补这一空白。

Result: 理论证明了INR可实现精确恢复的条件，并通过实验验证了低宽度INR在连续域超分辨率恢复中的性能。

Insight: 论文揭示了INR在解决逆问题中的潜力，尤其是通过无限维空间的视角对非凸优化问题的理论分析提供了新思路。

Abstract: Implicit neural representations (INRs) have emerged as a powerful tool for solving inverse problems in computer vision and computational imaging. INRs represent images as continuous domain functions realized by a neural network taking spatial coordinates as inputs. However, unlike traditional pixel representations, little is known about the sample complexity of estimating images using INRs in the context of linear inverse problems. Towards this end, we study the sampling requirements for recovery of a continuous domain image from its low-pass Fourier samples by fitting a single hidden-layer INR with ReLU activation and a Fourier features layer using a generalized form of weight decay regularization. Our key insight is to relate minimizers of this non-convex parameter space optimization problem to minimizers of a convex penalty defined over an infinite-dimensional space of measures. We identify a sufficient number of Fourier samples for which an image realized by an INR is exactly recoverable by solving the INR training problem. To validate our theory, we empirically assess the probability of achieving exact recovery of images realized by low-width single hidden-layer INRs, and illustrate the performance of INRs on super-resolution recovery of continuous domain phantom images.

cs.IR [Back]

[134] ThinkQE: Query Expansion via an Evolving Thinking Process cs.IR | cs.CLPDF

Yibin Lei, Tao Shen, Andrew Yates

TL;DR: ThinkQE提出了一种新的查询扩展框架，通过结合思维过程和语料库互动策略，显著提升了检索性能，尤其在多样性和探索性方面表现突出。

Details

Motivation: 现有基于LLM的查询扩展方法虽表现优异，但往往过于专注特定语义，忽略了查询的多样性和探索性。ThinkQE旨在通过更深入的语义探索和迭代优化来解决这一问题。

Result: 在DL19、DL20和BRIGHT等数据集上，ThinkQE表现优异，优于现有的密集检索器和重排序方法。

Insight: 通过迭代反馈和深度语义探索，可以显著提升查询扩展的多样性和检索性能，尤其适用于需要广泛探索的应用场景。

Abstract: Effective query expansion for web search benefits from promoting both exploration and result diversity to capture multiple interpretations and facets of a query. While recent LLM-based methods have improved retrieval performance and demonstrate strong domain generalization without additional training, they often generate narrowly focused expansions that overlook these desiderata. We propose ThinkQE, a test-time query expansion framework addressing this limitation through two key components: a thinking-based expansion process that encourages deeper and comprehensive semantic exploration, and a corpus-interaction strategy that iteratively refines expansions using retrieval feedback from the corpus. Experiments on diverse web search benchmarks (DL19, DL20, and BRIGHT) show ThinkQE consistently outperforms prior approaches, including training-intensive dense retrievers and rerankers.

cs.SD [Back]

[135] SimClass: A Classroom Speech Dataset Generated via Game Engine Simulation For Automatic Speech Recognition Research cs.SD | cs.AI | cs.CL | eess.ASPDF

Ahmed Adel Attia, Jing Liu, Carl Espy-Wilson

TL;DR: 论文提出了一种利用游戏引擎合成课堂噪声和语音数据的方法，解决了教育领域语音数据稀缺的问题，并生成了一个名为SimClass的数据集。

Details

Motivation: 由于公开课堂语音数据稀缺，且缺乏专门的课堂噪声语料库，导致教育领域的语音识别模型开发受限。论文旨在解决这一问题。

Result: 实验表明，SimClass可以很好地模拟真实课堂语音，为开发鲁棒的语音识别和增强模型提供了资源。

Insight: 通过游戏引擎合成数据的方法具有可扩展性，可应用于其他领域，为数据稀缺问题提供了一种新思路。

Abstract: The scarcity of large-scale classroom speech data has hindered the development of AI-driven speech models for education. Public classroom datasets remain limited, and the lack of a dedicated classroom noise corpus prevents the use of standard data augmentation techniques. In this paper, we introduce a scalable methodology for synthesizing classroom noise using game engines, a framework that extends to other domains. Using this methodology, we present SimClass, a dataset that includes both a synthesized classroom noise corpus and a simulated classroom speech dataset. The speech data is generated by pairing a public children’s speech corpus with YouTube lecture videos to approximate real classroom interactions in clean conditions. Our experiments on clean and noisy speech demonstrate that SimClass closely approximates real classroom speech, making it a valuable resource for developing robust speech recognition and enhancement models.

[136] Training-Free Voice Conversion with Factorized Optimal Transport cs.SD | cs.CV | cs.LG | eess.ASPDF

Alexander Lobashev, Assel Yermekova, Maria Larchenko

TL;DR: 本文提出了一种无需训练的语音转换方法MKL-VC，通过分解最优输运映射在WavLM嵌入子空间中实现高质量、任意语言间的语音转换，仅需5秒参考音频即可完成。

Details

Motivation: 目前的语音转换方法（如kNN-VC）在跨语言场景下表现较差且需要大量训练数据。MKL-VC旨在通过分解最优输运映射解决这一问题，提升短参考音频下的内容保留和鲁棒性。

Result: 实验表明，MKL-VC在LibriSpeech和FLEURS数据集上显著优于kNN-VC，尤其在跨语言语音转换任务中表现突出，性能接近FACodec。

Insight: 分解最优输运映射可以有效处理高维嵌入空间中非均匀方差问题，为语音转换任务提供了一种高效且无需训练的方法。

Abstract: This paper introduces Factorized MKL-VC, a training-free modification for kNN-VC pipeline. In contrast with original pipeline, our algorithm performs high quality any-to-any cross-lingual voice conversion with only 5 second of reference audio. MKL-VC replaces kNN regression with a factorized optimal transport map in WavLM embedding subspaces, derived from Monge-Kantorovich Linear solution. Factorization addresses non-uniform variance across dimensions, ensuring effective feature transformation. Experiments on LibriSpeech and FLEURS datasets show MKL-VC significantly improves content preservation and robustness with short reference audio, outperforming kNN-VC. MKL-VC achieves performance comparable to FACodec, especially in cross-lingual voice conversion domain.

cs.AI [Back]

[137] Ming-Omni: A Unified Multimodal Model for Perception and Generation cs.AI | cs.CL | cs.CV | cs.LG | cs.SD | eess.ASPDF

Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou

TL;DR: Ming-Omni 是一个统一的多模态模型，能够处理图像、文本、音频和视频，并在语音和图像生成方面表现出色。它通过专用编码器和MoE架构实现高效的多模态处理与融合，并支持音频和图像生成功能。

Details

Motivation: 目前的多模态模型通常需要单独的任务模型或结构调整，限制了灵活性和效率。Ming-Omni旨在提供一个统一的框架，支持多种模态的感知与生成。

Result: 实验表明，Ming-Omni在感知和生成任务中表现优异，支持多种任务（如上下文对话、文本转语音、图像编辑），并与GPT-4o在多模态支持上相当。

Insight: 统一的多模态模型可以减少任务特定模型的需求，提高灵活性和效率；MoE架构和模态特定路由器的设计是高效多模态处理的关键。

Abstract: We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-Omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-Omni offers a powerful solution for unified perception and generation across all modalities. Notably, our proposed Ming-Omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community.

[138] Intent Factored Generation: Unleashing the Diversity in Your Language Model cs.AI | cs.CL | cs.LGPDF

Eltayeb Ahmed, Uljad Berdica, Martha Elliott, Danijela Horak, Jakob N. Foerster

TL;DR: 论文提出了一种名为意图因子化生成（IFG）的方法，通过在采样过程中引入语义密集的意图因子，提高语言模型生成样本的多样性和质量。

Details

Motivation: 当前方法在固定提示下生成多样样本时，通常仅停留在词级别，导致推理任务探索不足和对话代理单调重复。IFG旨在解决这一问题。

Result: 实验表明IFG在数学、代码任务和对话生成中提升了性能，并在通用语言建模任务中保持生成质量的同时提高了多样性。

Insight: 通过显式建模意图，IFG能够更好地控制生成的多样性，同时确保内容的一致性。这种方法简单易集成，适用于多种应用。

Abstract: Obtaining multiple meaningfully diverse, high quality samples from Large Language Models for a fixed prompt remains an open challenge. Current methods for increasing diversity often only operate at the token-level, paraphrasing the same response. This is problematic because it leads to poor exploration on reasoning problems and to unengaging, repetitive conversational agents. To address this we propose Intent Factored Generation (IFG), factorising the sampling process into two stages. First, we sample a semantically dense intent, e.g., a summary or keywords. Second, we sample the final response conditioning on both the original prompt and the intent from the first stage. This allows us to use a higher temperature during the intent step to promote conceptual diversity, and a lower temperature during the final generation to ensure the outputs are coherent and self-consistent. Additionally, we find that prompting the model to explicitly state its intent for each step of the chain-of-thought before generating the step is beneficial for reasoning tasks. We demonstrate our method’s effectiveness across a diverse set of tasks. We show this method improves both pass@k and Reinforcement Learning from Verifier Feedback on maths and code tasks. For instruction-tuning, we combine IFG with Direct Preference Optimisation to increase conversational diversity without sacrificing reward. Finally, we achieve higher diversity while maintaining the quality of generations on a general language modelling task, using a new dataset of reader comments and news articles that we collect and open-source. In summary, we present a simple method of increasing the sample diversity of LLMs while maintaining performance. This method can be implemented by changing the prompt and varying the temperature during generation, making it easy to integrate into many algorithms for gains across various applications.

[139] V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning cs.AI | cs.CV | cs.LG | cs.ROPDF

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes

TL;DR: V-JEPA 2是一种自监督视频模型，通过大规模互联网视频和少量机器人交互数据预训练，实现了对物理世界的理解、预测和规划能力。

Details

Motivation: 现代AI的挑战在于通过观察学习理解世界并采取行动，本文探索了一种自监督方法，结合互联网视频和少量机器人数据，开发能够理解和规划物理世界的模型。

Result: 1. 动作理解（Something-Something v2）77.3 top-1准确率；2. 视频问答任务（PerceptionTest 84.0分）；3. 零样本机器人规划任务成功。

Insight: 自监督学习结合大规模视频数据和小量机器人数据，可以高效构建通用世界模型，支持跨任务和场景的应用。

Abstract: A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.

cs.GR [Back]

[140] SILK: Smooth InterpoLation frameworK for motion in-betweening A Simplified Computational Approach cs.GR | cs.CV | cs.LGPDF

Elly Akhoundi, Hung Yu Ling, Anup Anand Deshmukh, Judith Butepage

TL;DR: 本文提出了一种基于Transformer的简单框架SILK，用于运动插值任务，通过数据建模选择和单一Transformer编码器实现高质量动画效果，挑战了模型复杂度决定动画质量的假设。

Details

Motivation: 运动插值对动画师至关重要，现有方法依赖复杂模型或多次训练步骤，亟需简化且高效的解决方案。

Result: 实验表明增加数据量和优化姿势表示可提升结果质量，速度特征显著改善动画性能。

Insight: 运动插值性能更依赖数据建模而非模型复杂度，为动画研究提供数据中心的视角。

Abstract: Motion in-betweening is a crucial tool for animators, enabling intricate control over pose-level details in each keyframe. Recent machine learning solutions for motion in-betweening rely on complex models, incorporating skeleton-aware architectures or requiring multiple modules and training steps. In this work, we introduce a simple yet effective Transformer-based framework, employing a single Transformer encoder to synthesize realistic motions for motion in-betweening tasks. We find that data modeling choices play a significant role in improving in-betweening performance. Among others, we show that increasing data volume can yield equivalent or improved motion transitions, that the choice of pose representation is vital for achieving high-quality results, and that incorporating velocity input features enhances animation performance. These findings challenge the assumption that model complexity is the primary determinant of animation quality and provide insights into a more data-centric approach to motion interpolation. Additional videos and supplementary material are available at https://silk-paper.github.io.

[141] VideoMat: Extracting PBR Materials from Video Diffusion Models cs.GR | cs.CVPDF

Jacob Munkberg, Zian Wang, Ruofan Liang, Tianchang Shen, Jon Hasselgren

TL;DR: VideoMat利用视频扩散模型和物理渲染技术，从文本提示或单张图像中生成高质量的3D模型材质。该方法通过生成多视角一致的材料属性，结合内部分解和可微分渲染，最终输出与常见内容创作工具兼容的PBR材质。

Details

Motivation: 现有方法在从文本或图像生成高质量PBR材质时存在局限性，尤其是在多视角一致性和物理真实感方面。本文旨在通过扩散模型和物理渲染的结合，实现更高质量的材质生成。

Result: 生成的材质在视觉质量和物理一致性上优于现有方法，可直接用于常见3D内容创作工具。

Insight: 视频扩散模型在多视角生成和材质一致性方面具有潜力，内在分解与物理渲染的结合可以进一步提升材质生成的质量和可用性。

Abstract: We leverage finetuned video diffusion models, intrinsic decomposition of videos, and physically-based differentiable rendering to generate high quality materials for 3D models given a text prompt or a single image. We condition a video diffusion model to respect the input geometry and lighting condition. This model produces multiple views of a given 3D model with coherent material properties. Secondly, we use a recent model to extract intrinsics (base color, roughness, metallic) from the generated video. Finally, we use the intrinsics alongside the generated video in a differentiable path tracer to robustly extract PBR materials directly compatible with common content creation tools.

[142] DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos cs.GR | cs.AI | cs.CV | cs.LGPDF

Chieh Hubert Lin, Zhaoyang Lv, Songyin Wu, Zhen Xu, Thu Nguyen-Phuoc

TL;DR: DGS-LRM是首个基于前馈方法从单目视频预测可变形3D高斯泼溅的模型，专注于动态场景重建。

Details

Motivation: 动态场景的实时重建面临训练数据稀缺和3D表示方法不足的挑战，该研究旨在填补这一空白。

Result: 在动态重建质量上媲美基于优化的方法，在长距离3D跟踪任务中表现优秀。

Insight: 可变形3D高斯表示和合成数据驱动的训练范式为动态场景重建提供了新思路。

Abstract: We introduce the Deformable Gaussian Splats Large Reconstruction Model (DGS-LRM), the first feed-forward method predicting deformable 3D Gaussian splats from a monocular posed video of any dynamic scene. Feed-forward scene reconstruction has gained significant attention for its ability to rapidly create digital replicas of real-world environments. However, most existing models are limited to static scenes and fail to reconstruct the motion of moving objects. Developing a feed-forward model for dynamic scene reconstruction poses significant challenges, including the scarcity of training data and the need for appropriate 3D representations and training paradigms. To address these challenges, we introduce several key technical contributions: an enhanced large-scale synthetic dataset with ground-truth multi-view videos and dense 3D scene flow supervision; a per-pixel deformable 3D Gaussian representation that is easy to learn, supports high-quality dynamic view synthesis, and enables long-range 3D tracking; and a large transformer network that achieves real-time, generalizable dynamic scene reconstruction. Extensive qualitative and quantitative experiments demonstrate that DGS-LRM achieves dynamic scene reconstruction quality comparable to optimization-based methods, while significantly outperforming the state-of-the-art predictive dynamic reconstruction method on real-world examples. Its predicted physically grounded 3D deformation is accurate and can readily adapt for long-range 3D tracking tasks, achieving performance on par with state-of-the-art monocular video 3D tracking methods.

cs.RO [Back]

[143] UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation cs.RO | cs.AI | cs.CVPDF

Yihe Tang, Wenlong Huang, Yingke Wang, Chengshu Li, Roy Yuan

TL;DR: 论文提出了一种无监督方法UAD，通过利用基础模型提取物体affordance知识，并将其蒸馏到任务条件的affordance模型中，无需手动标注，实现了在开放任务指令下的广泛泛化能力。

Details

Motivation: 现有视觉affordance预测方法依赖手动标注或预设任务集，限制了它们在开放任务和未结构化环境中的泛化能力。UAD旨在通过无监督方式解决这一问题。

Result: UAD在仿真数据训练后，展现了对真实场景和人类活动的高泛化能力；模仿学习策略在仅10次演示后，泛化到新物体类别和任务变体。

Insight: 基础模型的互补结合可高效解决无监督affordance标注问题；轻量任务条件模型在真实场景中表现优异。

Abstract: Understanding fine-grained object affordances is imperative for robots to manipulate objects in unstructured environments given open-ended task instructions. However, existing methods of visual affordance predictions often rely on manually annotated data or conditions only on a predefined set of tasks. We introduce UAD (Unsupervised Affordance Distillation), a method for distilling affordance knowledge from foundation models into a task-conditioned affordance model without any manual annotations. By leveraging the complementary strengths of large vision models and vision-language models, UAD automatically annotates a large-scale dataset with detailed $<$instruction, visual affordance$>$ pairs. Training only a lightweight task-conditioned decoder atop frozen features, UAD exhibits notable generalization to in-the-wild robotic scenes and to various human activities, despite only being trained on rendered objects in simulation. Using affordance provided by UAD as the observation space, we show an imitation learning policy that demonstrates promising generalization to unseen object instances, object categories, and even variations in task instructions after training on as few as 10 demonstrations. Project website: https://unsup-affordance.github.io/

Guanghu Xie, Zhiduo Jiang, Yonglong Zhang, Yang Liu, Zongwu Xie

TL;DR: 论文提出了一种名为DCIRNet的多模态深度补全网络，用于解决透明和反射物体深度信息缺失的问题，实现了44%的抓取成功率提升。

Details

Motivation: 透明和反射物体由于其独特的视觉特性（如镜面反射和光传输）导致深度传感器难以准确估计深度，影响下游任务的性能。

Result: 在公开数据集上，DCIRNet表现出色，抓取成功率提升44%，验证了方法的有效性。

Insight: 整合多模态数据和迭代细化策略能显著提升透明和反射物体的深度估计精度，从而优化机器人抓取任务的性能。

Abstract: Transparent and reflective objects in everyday environments pose significant challenges for depth sensors due to their unique visual properties, such as specular reflections and light transmission. These characteristics often lead to incomplete or inaccurate depth estimation, which severely impacts downstream geometry-based vision tasks, including object recognition, scene reconstruction, and robotic manipulation. To address the issue of missing depth information in transparent and reflective objects, we propose DCIRNet, a novel multimodal depth completion network that effectively integrates RGB images and depth maps to enhance depth estimation quality. Our approach incorporates an innovative multimodal feature fusion module designed to extract complementary information between RGB images and incomplete depth maps. Furthermore, we introduce a multi-stage supervision and depth refinement strategy that progressively improves depth completion and effectively mitigates the issue of blurred object boundaries. We integrate our depth completion model into dexterous grasping frameworks and achieve a $44%$ improvement in the grasp success rate for transparent and reflective objects. We conduct extensive experiments on public datasets, where DCIRNet demonstrates superior performance. The experimental results validate the effectiveness of our approach and confirm its strong generalization capability across various transparent and reflective objects.

[145] From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models cs.RO | cs.CVPDF

Irving Fang, Juexiao Zhang, Shengbang Tong, Chen Feng

TL;DR: 该论文提出了一个统一的评测套件，包含50个仿真任务，用于系统评估视觉-语言-动作（VLA）模型的泛化能力，发现VLM预训练虽赋予模型强大的感知和高层规划能力，但在动作执行上表现不稳定。

Details

Motivation: 当前视觉-语言-动作（VLA）模型的评测不足，传统模仿学习基准缺乏语言指令，且现有评测任务有限，难以量化VLM预训练对下游机器人策略泛化能力的贡献。

Result: 结果表明，VLM虽提供强感知和规划能力（意图），但在动作执行上表现不稳定；微调可能损害原始VLM的通用推理能力。

Insight: VLA模型的感知与动作执行间存在明显差距，需进一步研究填补这一差距的方法。

Abstract: One promise that Vision-Language-Action (VLA) models hold over traditional imitation learning for robotics is to leverage the broad generalization capabilities of large Vision-Language Models (VLMs) to produce versatile, “generalist” robot policies. However, current evaluations of VLAs remain insufficient. Traditional imitation learning benchmarks are unsuitable due to the lack of language instructions. Emerging benchmarks for VLAs that incorporate language often come with limited evaluation tasks and do not intend to investigate how much VLM pretraining truly contributes to the generalization capabilities of the downstream robotic policy. Meanwhile, much research relies on real-world robot setups designed in isolation by different institutions, which creates a barrier for reproducibility and accessibility. To address this gap, we introduce a unified probing suite of 50 simulation-based tasks across 10 subcategories spanning language instruction, vision, and objects. We systematically evaluate several state-of-the-art VLA architectures on this suite to understand their generalization capability. Our results show that while VLM backbones endow VLAs with robust perceptual understanding and high level planning, which we refer to as good intentions, this does not reliably translate into precise motor execution: when faced with out-of-distribution observations, policies often exhibit coherent intentions, but falter in action execution. Moreover, finetuning on action data can erode the original VLM’s generalist reasoning abilities. We release our task suite and evaluation code to serve as a standardized benchmark for future VLAs and to drive research on closing the perception-to-action gap. More information, including the source code, can be found at https://ai4ce.github.io/INT-ACT/

[146] Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation cs.RO | cs.CV | cs.LGPDF

Wenbo Zhang, Tianrun Hu, Yanyuan Qiao, Hanbo Zhang, Yuchu Qin

TL;DR: CoA是一种新型的视觉-运动策略范式，通过逆向推理生成完整轨迹，结合任务目标，实现全局到局部的动作约束，提升了机器人在模拟和真实任务中的性能。

Details

Motivation: 传统方法通过前向预测下一步动作，可能缺乏全局视角，而CoA通过逆向推理和动作级链式思考（CoT）实现任务目标驱动的轨迹生成，增强泛化能力。

Result: 在60个RLBench任务和8个真实机器人操作任务中取得了最先进的表现，展现了强大的空间泛化能力。

Insight: 逆向推理能更好地将任务目标融入动作生成过程，全局-局部结构设计显着提升了动作规划的效果和泛化性。

Abstract: We present Chain-of-Action (CoA), a novel visuo-motor policy paradigm built upon Trajectory Autoregressive Modeling. Unlike conventional approaches that predict next step action(s) forward, CoA generates an entire trajectory by explicit backward reasoning with task-specific goals through an action-level Chain-of-Thought (CoT) process. This process is unified within a single autoregressive structure: (1) the first token corresponds to a stable keyframe action that encodes the task-specific goals; and (2) subsequent action tokens are generated autoregressively, conditioned on the initial keyframe and previously predicted actions. This backward action reasoning enforces a global-to-local structure, allowing each local action to be tightly constrained by the final goal. To further realize the action reasoning structure, CoA incorporates four complementary designs: continuous action token representation; dynamic stopping for variable-length trajectory generation; reverse temporal ensemble; and multi-token prediction to balance action chunk modeling with global structure. As a result, CoA gives strong spatial generalization capabilities while preserving the flexibility and simplicity of a visuo-motor policy. Empirically, we observe CoA achieves the state-of-the-art performance across 60 RLBench tasks and 8 real-world manipulation tasks.

Table of Contents

cs.CV [Back]

[1] Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations cs.CV | cs.AIPDF

[2] Segment Any Architectural Facades (SAAF):An automatic segmentation model for building facades, walls and windows based on multimodal semantics guidance cs.CV | cs.AIPDF

[3] VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks cs.CV | cs.AIPDF

[4] FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation cs.CV | cs.AI | cs.CLPDF

[5] AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models cs.CV | cs.AI | cs.LGPDF

[6] BakuFlow: A Streamlining Semi-Automatic Label Generation Tool cs.CV | cs.AIPDF

[7] Bias Analysis in Unconditional Image Generative Models cs.CV | cs.LGPDF

[8] CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation cs.CV | cs.CLPDF

[9] Seedance 1.0: Exploring the Boundaries of Video Generation Models cs.CVPDF

[10] Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models cs.CVPDF

[11] PatchGuard: Adversarially Robust Anomaly Detection and Localization through Vision Transformers and Pseudo Anomalies cs.CV | cs.LGPDF

[12] UFM: A Simple Path towards Unified Dense Correspondence with Flow cs.CV | cs.LG | cs.ROPDF

[13] Lightweight Object Detection Using Quantized YOLOv4-Tiny for Emergency Response in Aerial Imagery cs.CV | cs.LGPDF

[14] Efficient Edge Deployment of Quantized YOLOv4-Tiny for Aerial Emergency Object Detection on Raspberry Pi 5 cs.CVPDF

[15] MSSDF: Modality-Shared Self-supervised Distillation for High-Resolution Multi-modal Remote Sensing Image Learning cs.CVPDF

[16] An Effective End-to-End Solution for Multimodal Action Recognition cs.CVPDF

[17] Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation cs.CV | cs.AI | cs.LGPDF

[18] A new approach for image segmentation based on diffeomorphic registration and gradient fields cs.CVPDF

[19] SAGE: Exploring the Boundaries of Unsafe Concept Domain with Semantic-Augment Erasing cs.CV | cs.AI | cs.CRPDF

[20] ScaleLSD: Scalable Deep Line Segment Detection Streamlined cs.CVPDF

[21] ReID5o: Achieving Omni Multi-modal Person Re-identification in a Single Model cs.CVPDF

[22] SRPL-SFDA: SAM-Guided Reliable Pseudo-Labels for Source-Free Domain Adaptation in Medical Image Segmentation cs.CV | I.2.6; I.5.1PDF

[23] Synthetic Human Action Video Data Generation with Pose Transfer cs.CV | cs.AIPDF

[24] Noise Conditional Variational Score Distillation cs.CVPDF

[25] A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation cs.CV | cs.AIPDF

[26] A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning cs.CVPDF

[27] TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision cs.CV | cs.AIPDF

[28] Harmonizing and Merging Source Models for CLIP-based Domain Generalization cs.CVPDF

[29] Evidential Deep Learning with Spectral-Spatial Uncertainty Disentanglement for Open-Set Hyperspectral Domain Generalization cs.CVPDF

[30] Optimizing Cooperative Multi-Object Tracking using Graph Signal Processing cs.CVPDF

[31] Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning cs.CVPDF

[32] Urban1960SatSeg: Unsupervised Semantic Segmentation of Mid-20$^{th}$ century Urban Landscapes with Satellite Imageries cs.CVPDF

[33] TinySplat: Feedforward Approach for Generating Compact 3D Scene Representation cs.CVPDF

[34] Generalized Gaussian Entropy Model for Point Cloud Attribute Compression with Dynamic Likelihood Intervals cs.CV | eess.IVPDF

[35] HAIF-GS: Hierarchical and Induced Flow-Guided Gaussian Splatting for Dynamic Scene cs.CVPDF

[36] Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs cs.CV | cs.AI | cs.CLPDF

[37] Gaussian Herding across Pens: An Optimal Transport Perspective on Global Gaussian Reduction for 3DGS cs.CV | I.4.5PDF

[38] 3DGeoDet: General-purpose Geometry-aware Image-based 3D Object Detection cs.CVPDF

[39] GLD-Road:A global-local decoding road network extraction model for remote sensing images cs.CVPDF

[40] AD^2-Bench: A Hierarchical CoT Benchmark for MLLM in Autonomous Driving under Adverse Conditions cs.CV | cs.AIPDF

[41] SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields cs.CVPDF

[42] ECAM: A Contrastive Learning Approach to Avoid Environmental Collision in Trajectory Forecasting cs.CVPDF

[43] HSENet: Hybrid Spatial Encoding Network for 3D Medical Vision-Language Understanding cs.CV | cs.AIPDF

[44] DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning cs.CV | cs.AIPDF

[45] HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios cs.CV | cs.LG | cs.MM | cs.RO | eess.IVPDF

[46] CINeMA: Conditional Implicit Neural Multi-Modal Atlas for a Spatio-Temporal Representation of the Perinatal Brain cs.CV | cs.LGPDF

[47] Reasoning Models Are More Easily Gaslighted Than You Think cs.CV | cs.AIPDF

[48] Adding simple structure at inference improves Vision-Language Compositionality cs.CV | cs.CL | cs.LGPDF

[49] Towards Practical Alzheimer’s Disease Diagnosis: A Lightweight and Interpretable Spiking Neural Model cs.CV | cs.AIPDF

[50] Non-Contact Health Monitoring During Daily Personal Care Routines cs.CV | cs.AIPDF

[51] Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning cs.CV | cs.AIPDF

[52] Class Similarity-Based Multimodal Classification under Heterogeneous Category Sets cs.CVPDF

[53] Hierarchical Image Matching for UAV Absolute Visual Localization via Semantic and Structural Constraints cs.CV | cs.ROPDF

[54] Q-SAM2: Accurate Quantization for Segment Anything Model 2 cs.CV | cs.AIPDF

[55] Accurate and efficient zero-shot 6D pose estimation with frozen foundation models cs.CVPDF

[56] DreamCS: Geometry-Aware Text-to-3D Generation with Unpaired 3D Reward Supervision cs.CVPDF

[57] MMME: A Spontaneous Multi-Modal Micro-Expression Dataset Enabling Visual-Physiological Fusion cs.CVPDF

[58] DynaSplat: Dynamic-Static Gaussian Splatting with Hierarchical Motion Decomposition for Scene Reconstruction cs.CV | cs.AIPDF

[59] OctoNav: Towards Generalist Embodied Navigation cs.CV | cs.AI | cs.ROPDF

[60] Learning to Align: Addressing Character Frequency Distribution Shifts in Handwritten Text Recognition cs.CV | cs.AIPDF

[61] IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments cs.CVPDF

[62] 3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation cs.CV | cs.AIPDF

[63] The Less You Depend, The More You Learn: Synthesizing Novel Views from Sparse, Unposed Images without Any 3D Knowledge cs.CVPDF

[64] EquiCaps: Predictor-Free Pose-Aware Pre-Trained Capsule Networks cs.CVPDF

[65] Only-Style: Stylistic Consistency in Image Generation without Content Leakage cs.CVPDF

[66] MetricHMR: Metric Human Mesh Recovery from Monocular Images cs.CVPDF

[67] Structural-Spectral Graph Convolution with Evidential Edge Learning for Hyperspectral Image Clustering cs.CVPDF

[68] Outside Knowledge Conversational Video (OKCV) Dataset – Dialoguing over Videos cs.CV | cs.AI | cs.CLPDF

[69] LEO-VL: Towards 3D Vision-Language Generalists via Data Scaling with Efficient Representation cs.CVPDF

[70] CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models cs.CV | cs.AI | I.2.10; I.4.8PDF

[71] UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting cs.CV | cs.AIPDF

[72] Vision Generalist Model: A Survey cs.CV | cs.AIPDF

[73] Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy cs.CV | cs.LG | 68T45 (Machine learning), 92C55 (Biomedical imaging and signal

[74] Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing cs.CV | cs.AI | I.2PDF

[75] Efficient Part-level 3D Object Generation via Dual Volume Packing cs.CVPDF

[76] ReSim: Reliable World Simulation for Autonomous Driving cs.CV | cs.ROPDF

[77] InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions cs.CV | cs.AI | cs.SDPDF

[78] A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs cs.CV | cs.LGPDF