cs.CV [Total: 57]
cs.CL [Total: 33]
cs.MA [Total: 1]
eess.AS [Total: 1]
cs.GR [Total: 1]
eess.IV [Total: 6]
cs.CR [Total: 1]
cs.AI [Total: 3]
physics.optics [Total: 1]
cs.IR [Total: 1]
cs.RO [Total: 3]
cs.SE [Total: 1]
cs.LG [Total: 5]
cs.DC [Total: 1]

cs.CV [Back]

[1] Non-planar Object Detection and Identification by Features Matching and Triangulation Growth cs.CV | cs.AIPDF

Filippo Leveni

TL;DR: 该论文提出了一种基于特征匹配和三角剖分增长的方法，用于检测和识别场景图像中的非平面对象，尤其在几何模型不适用时表现优异。

Details

Motivation: 传统的基于几何模型（如单应性）的方法无法处理非平面对象的变形或扭曲，因此需要一种更灵活的特征匹配方法来解决这一问题。

Result: 实验表明，在变形较小时，该方法与基于RANSAC的单应性方法性能相当；在变形显著时，其描述性能更优。

Insight: 该方法通过局部一致性避免了全局模型假设的局限性，为处理非平面对象提供了一种灵活且鲁棒的解决方案。

Abstract: Object detection and identification is surely a fundamental topic in the computer vision field; it plays a crucial role in many applications such as object tracking, industrial robots control, image retrieval, etc. We propose a feature-based approach for detecting and identifying distorted occurrences of a given template in a scene image by incremental grouping of feature matches between the image and the template. For this purpose, we consider the Delaunay triangulation of template features as an useful tool through which to be guided in this iterative approach. The triangulation is treated as a graph and, starting from a single triangle, neighboring nodes are considered and the corresponding features are identified; then matches related to them are evaluated to determine if they are worthy to be grouped. This evaluation is based on local consistency criteria derived from geometric and photometric properties of local features. Our solution allows the identification of the object in situations where geometric models (e.g. homography) does not hold, thus enable the detection of objects such that the template is non planar or when it is planar but appears distorted in the image. We show that our approach performs just as well or better than application of homography-based RANSAC in scenarios in which distortion is nearly absent, while when the deformation becomes relevant our method shows better description performance.

[2] CDST: Color Disentangled Style Transfer for Universal Style Reference Customization cs.CVPDF

Shiwen Zhang, Zhuowei Chen, Lang Chen, Yanze Wu

TL;DR: CDST提出了一种新颖的两流式风格迁移训练范式，通过完全分离颜色和风格，实现了无需调参的通用风格迁移能力。

Details

Motivation: 为了解决风格迁移中颜色与风格的混杂问题，并实现在无需调参的情况下完成高质量的通用风格迁移。

Result: 通过定性和定量实验及人工评估，CDST在多种风格迁移任务上达到了最先进的性能。

Insight: 将颜色与风格解耦是实现通用风格迁移的关键，而嵌入压缩和新的风格定义可以进一步提升性能。

Abstract: We introduce Color Disentangled Style Transfer (CDST), a novel and efficient two-stream style transfer training paradigm which completely isolates color from style and forces the style stream to be color-blinded. With one same model, CDST unlocks universal style transfer capabilities in a tuning-free manner during inference. Especially, the characteristics-preserved style transfer with style and content references is solved in the tuning-free way for the first time. CDST significantly improves the style similarity by multi-feature image embeddings compression and preserves strong editing capability via our new CDST style definition inspired by Diffusion UNet disentanglement law. By conducting thorough qualitative and quantitative experiments and human evaluations, we demonstrate that CDST achieves state-of-the-art results on various style transfer tasks.

[3] Hidden Bias in the Machine: Stereotypes in Text-to-Image Models cs.CV | cs.AI | cs.CY | cs.LGPDF

Sedat Porikli, Vedat Porikli

TL;DR: 该论文研究了文本到图像（T2I）模型中的隐藏偏见，揭示了这些模型在生成图像时会复制和放大社会中的刻板印象，尤其是在性别、种族、年龄等方面。

Details

Motivation: 随着T2I模型的广泛应用，其可能复现和强化社会偏见的问题引发了广泛关注。论文旨在通过系统性分析，揭示这些模型中存在的偏见。

Result: 分析结果显示，生成的图像在性别、种族、年龄等方面存在显著差异，往往与社会刻板印象一致，甚至加剧了这些偏见。

Insight: 论文强调了在生成视觉系统中使用更包容的数据集和开发实践的必要性，以减少偏见并促进公平性。

Abstract: Text-to-Image (T2I) models have transformed visual content creation, producing highly realistic images from natural language prompts. However, concerns persist around their potential to replicate and magnify existing societal biases. To investigate these issues, we curated a diverse set of prompts spanning thematic categories such as occupations, traits, actions, ideologies, emotions, family roles, place descriptions, spirituality, and life events. For each of the 160 unique topics, we crafted multiple prompt variations to reflect a wide range of meanings and perspectives. Using Stable Diffusion 1.5 (UNet-based) and Flux-1 (DiT-based) models with original checkpoints, we generated over 16,000 images under consistent settings. Additionally, we collected 8,000 comparison images from Google Image Search. All outputs were filtered to exclude abstract, distorted, or nonsensical results. Our analysis reveals significant disparities in the representation of gender, race, age, somatotype, and other human-centric factors across generated images. These disparities often mirror and reinforce harmful stereotypes embedded in societal narratives. We discuss the implications of these findings and emphasize the need for more inclusive datasets and development practices to foster fairness in generative visual systems.

[4] Fake it till You Make it: Reward Modeling as Discriminative Prediction cs.CV | cs.AI | cs.LGPDF

Runtao Liu, Jiahao Zhan, Yingqing He, Chen Wei, Alan Yuille

TL;DR: 该论文提出了一种名为GAN-RM的高效奖励建模框架，通过对抗训练的方式简化了传统方法中依赖大量人工标注或复杂质量维度设计的缺陷，仅需少量目标样本即可实现有效的奖励模型训练。

Details

Motivation: 传统的奖励模型方法依赖大量人工标注的偏好数据或复杂的质量维度设计，导致实现复杂且不完整。本文受GAN对抗训练的启发，旨在提出一种更高效的奖励建模方法。

Result: 实验表明，GAN-RM在多种关键应用（如Best-of-N采样过滤、SFT和DPO）中表现优异，验证了其有效性。

Insight: 对抗训练可以高效地替代传统依赖大量标注或人工设计的奖励建模方法，为强化学习在视觉生成模型中的应用提供了简化途径。

Abstract: An effective reward model plays a pivotal role in reinforcement learning for post-training enhancement of visual generative models. However, current approaches of reward modeling suffer from implementation complexity due to their reliance on extensive human-annotated preference data or meticulously engineered quality dimensions that are often incomplete and engineering-intensive. Inspired by adversarial training in generative adversarial networks (GANs), this paper proposes GAN-RM, an efficient reward modeling framework that eliminates manual preference annotation and explicit quality dimension engineering. Our method trains the reward model through discrimination between a small set of representative, unpaired target samples(denoted as Preference Proxy Data) and model-generated ordinary outputs, requiring only a few hundred target samples. Comprehensive experiments demonstrate our GAN-RM’s effectiveness across multiple key applications including test-time scaling implemented as Best-of-N sample filtering, post-training approaches like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO).

[5] DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding cs.CVPDF

Thomas Kreutz, Max Mühlhäuser, Alejandro Sanchez Guinea

TL;DR: DeSPITE提出了一种多模态对比学习框架，通过联合嵌入LiDAR点云、人体骨骼姿态、IMU数据和文本，实现了新型的人体活动理解任务。

Details

Motivation: LiDAR作为一种隐私保护的技术替代RGB相机，但在多模态对比预训练中的应用尚未充分探索。论文致力于填补这一空白，提出了一种联合嵌入方法。

Result: 实验表明，DeSPITE在点云人体活动识别任务（如MSR-Action3D和HMPEAR）中表现出色，并支持跨模态检索。

Insight: 多模态联合嵌入可以显著提升点云数据的人体活动理解能力，尤其是在缺乏标注数据的情况下。

Abstract: Despite LiDAR (Light Detection and Ranging) being an effective privacy-preserving alternative to RGB cameras to perceive human activities, it remains largely underexplored in the context of multi-modal contrastive pre-training for human activity understanding (e.g., human activity recognition (HAR), retrieval, or person re-identification (RE-ID)). To close this gap, our work explores learning the correspondence between LiDAR point clouds, human skeleton poses, IMU data, and text in a joint embedding space. More specifically, we present DeSPITE, a Deep Skeleton-Pointcloud-IMU-Text Embedding model, which effectively learns a joint embedding space across these four modalities through noise contrastive estimation. At the heart of our empirical exploration, we have combined the existing LIPD and Babel datasets, which enabled us to synchronize data of all four modalities, allowing us to explore the learning of a new joint embedding space. Our experiments demonstrate novel human activity understanding tasks for point cloud sequences enabled through DeSPITE, including Skeleton<->Pointcloud<->IMU matching, retrieval, and temporal moment retrieval. Furthermore, we show that DeSPITE is an effective pre-training strategy for point cloud HAR through experiments in MSR-Action3D and HMPEAR.

[6] Intelligent Image Sensing for Crime Analysis: A ML Approach towards Enhanced Violence Detection and Investigation cs.CV | cs.AIPDF

Aritra Dutta, Pushpita Boral, G Suseela

TL;DR: 该论文提出了一种基于机器学习的智能图像感知框架，用于实时检测和分类暴力事件，结合了3D卷积神经网络和双向LSTM，以提升计算效率和准确性。

Details

Motivation: 全球犯罪率上升，传统监控方法在及时检测暴力事件方面存在局限性，亟需自动化的暴力检测解决方案。

Result: 实验结果表明，该方法在计算资源效率和准确性方面表现优异。

Insight: 通过结合3D CNN与时序模型（如双向LSTM），可以有效捕捉暴力事件的时空特征，为实时监控提供可行方案。

Abstract: The increasing global crime rate, coupled with substantial human and property losses, highlights the limitations of traditional surveillance methods in promptly detecting diverse and unexpected acts of violence. Addressing this pressing need for automatic violence detection, we leverage Machine Learning to detect and categorize violent events in video streams. This paper introduces a comprehensive framework for violence detection and classification, employing Supervised Learning for both binary and multi-class violence classification. The detection model relies on 3D Convolutional Neural Networks, while the classification model utilizes the separable convolutional 3D model for feature extraction and bidirectional LSTM for temporal processing. Training is conducted on a diverse customized datasets with frame-level annotations, incorporating videos from surveillance cameras, human recordings, hockey fight, sohas and wvd dataset across various platforms. Additionally, a camera module integrated with raspberry pi is used to capture live video feed, which is sent to the ML model for processing. Thus, demonstrating improved performance in terms of computational resource efficiency and accuracy.

[7] HierVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment cs.CV | cs.AIPDF

Numair Nadeem, Saeed Anwar, Muhammad Hamza Asad, Abdul Bais

TL;DR: HierVL提出了一个利用视觉-语言协同的半监督语义分割框架，通过动态文本-空间查询对齐和多层次语义查询生成器，显著提升了在稀疏标注下的分割性能。

Details

Motivation: 半监督语义分割在标注稀缺和领域变化下表现不佳，纯视觉方法难以区分相似类别并泛化到新领域，而视觉-语言模型虽具有鲁棒的语义但缺乏空间定位能力。

Result: 在COCO（+4.4% mIoU）、Pascal VOC（+3.1%）、ADE20（+5.9%）和Cityscapes（+1.8%）等数据集上实现了SOTA性能。

Insight: 语言引导的分割显著提高了标签效率，实现了细粒度和实例感知的泛化能力。

Abstract: Semi-supervised semantic segmentation remains challenging under severe label scarcity and domain variability. Vision-only methods often struggle to generalize, resulting in pixel misclassification between similar classes, poor generalization and boundary localization. Vision-Language Models offer robust, domain-invariant semantics but lack the spatial grounding required for dense prediction. We introduce HierVL, a unified framework that bridges this gap by integrating abstract text embeddings into a mask-transformer architecture tailored for semi-supervised segmentation. HierVL features three novel components: a Hierarchical Semantic Query Generator that filters and projects abstract class embeddings into multi-scale queries to suppress irrelevant classes and handle intra-class variability; a Cross-Modal Spatial Alignment Module that aligns semantic queries with pixel features for sharper boundaries under sparse supervision; and a Dual-Query Transformer Decoder that fuses semantic and instance-level queries to prevent instance collapse. We also introduce targeted regularization losses that maintain vision-language alignment throughout training to reinforce semantic grounding. HierVL establishes a new state-of-the-art by achieving a +4.4% mean improvement of the intersection over the union on COCO (with 232 labeled images), +3.1% on Pascal VOC (with 92 labels), +5.9% on ADE20 (with 158 labels) and +1.8% on Cityscapes (with 100 labels), demonstrating better performance under 1% supervision on four benchmark datasets. Our results show that language-guided segmentation closes the label efficiency gap and unlocks new levels of fine-grained, instance-aware generalization.

[8] Mapping Farmed Landscapes from Remote Sensing cs.CV | cs.LGPDF

Michelangelo Conserva, Alex Wilson, Charlotte Stanton, Vishal Batchu, Varun Gulshan

TL;DR: 论文提出了Farmscapes工具，利用深度学习分割模型生成了英格兰大部分地区的高分辨率农田景观地图，包括生态关键要素如树篱、林地和石墙。

Details

Motivation: 农业景观的精准管理对全球生物多样性目标至关重要，但缺乏大规模、高分辨率的生态地图。

Result: 模型表现优异，林地分类F1-score达96%，农田达95%，树篱分割F1-score为72%。

Insight: 该工具为生态学家和政策制定者提供了数据驱动的栖息地恢复规划支持，并可用于监测欧盟生物多样性战略等倡议。

Abstract: Effective management of agricultural landscapes is critical for meeting global biodiversity targets, but efforts are hampered by the absence of detailed, large-scale ecological maps. To address this, we introduce Farmscapes, the first large-scale (covering most of England), high-resolution (25cm) map of rural landscape features, including ecologically vital elements like hedgerows, woodlands, and stone walls. This map was generated using a deep learning segmentation model trained on a novel, dataset of 942 manually annotated tiles derived from aerial imagery. Our model accurately identifies key habitats, achieving high f1-scores for woodland (96%) and farmed land (95%), and demonstrates strong capability in segmenting linear features, with an F1-score of 72% for hedgerows. By releasing the England-wide map on Google Earth Engine, we provide a powerful, open-access tool for ecologists and policymakers. This work enables data-driven planning for habitat restoration, supports the monitoring of initiatives like the EU Biodiversity Strategy, and lays the foundation for advanced analysis of landscape connectivity.

[9] FindMeIfYouCan: Bringing Open Set metrics to $\textit{near} $, $ \textit{far} $ and $\textit{farther}$ Out-of-Distribution Object Detection cs.CVPDF

Daniel Montoya, Aymen Bouguerra, Alexandra Gomez-Villa, Fabio Arnez

TL;DR: 论文探讨了面向开放集的目标检测（OOD-OD）问题，指出现有评测协议忽略了未知物体与已知分布的重叠情况，因此通过语义相似性构建了新的评测基准（near、far、farther）。

Details

Motivation: 现有目标检测方法基于闭集假设，无法有效处理未知物体检测，而安全关键领域（如自动驾驶、医疗影像）需要对此进行改进。

Result: 实验表明，语义和视觉接近的OOD对象更容易定位但也容易与已知类别混淆，而远距离的OOD对象定位更难但分类错误更少。

Insight: 语义相似性是影响OOD检测性能的关键因素，评测协议需要更贴近实际场景的多样性。

Abstract: State-of-the-art Object Detection (OD) methods predominantly operate under a closed-world assumption, where test-time categories match those encountered during training. However, detecting and localizing unknown objects is crucial for safety-critical applications in domains such as autonomous driving and medical imaging. Recently, Out-Of-Distribution (OOD) detection has emerged as a vital research direction for OD, focusing on identifying incorrect predictions typically associated with unknown objects. This paper shows that the current evaluation protocol for OOD-OD violates the assumption of non-overlapping objects with respect to the In-Distribution (ID) datasets, and obscures crucial situations such as ignoring unknown objects, potentially leading to overconfidence in deployment scenarios where truly novel objects might be encountered. To address these limitations, we manually curate, and enrich the existing benchmark by exploiting semantic similarity to create new evaluation splits categorized as $\textit{near}$, $\textit{far}$, and $\textit{farther}$ from ID distributions. Additionally, we incorporate established metrics from the Open Set community, providing deeper insights into how effectively methods detect unknowns, when they ignore them, and when they mistakenly classify OOD objects as ID. Our comprehensive evaluation demonstrates that semantically and visually close OOD objects are easier to localize than far ones, but are also more easily confounded with ID objects. $\textit{Far}$ and $\textit{farther}$ objects are harder to localize but less prone to be taken for an ID object.

[10] Disentangling 3D from Large Vision-Language Models for Controlled Portrait Generation cs.CVPDF

Nick Yiwen Huang, Akin Caliskan, Berkay Kicanaoglu, James Tompkin, Hyeongwoo Kim

TL;DR: 这篇论文提出了一种从大型视觉-语言模型（LVLM）中解耦3D信息的方法，用于生成可控的3D肖像。通过结合预训练的LVLM（如CLIP）和3D形变模型（FLAME），实现了对肖像的外观（如年龄、发型）和几何（如表情、相机姿态）的自由控制。

Details

Motivation: 当前的3D生成方法通常需要大量标注数据或训练复杂模型。作者希望通过解耦LVLM中的3D信息，利用现有2D数据和预训练模型，实现更灵活、低资源的3D肖像生成控制。

Result: 与现有方法相比，该方法生成的肖像在文本和3D几何控制上表现一致，同时保持了高质量和多样性。

Insight: 通过解耦预训练模型的3D信息，可以无需大规模标注或训练，实现对3D生成器的灵活控制，为创作者提供低资源解决方案。

Abstract: We consider the problem of disentangling 3D from large vision-language models, which we show on generative 3D portraits. This allows free-form text control of appearance attributes like age, hair style, and glasses, and 3D geometry control of face expression and camera pose. In this setting, we assume we use a pre-trained large vision-language model (LVLM; CLIP) to generate from a smaller 2D dataset with no additional paired labels and with a pre-defined 3D morphable model (FLAME). First, we disentangle using canonicalization to a 2D reference frame from a deformable neural 3D triplane representation. But another form of entanglement arises from the significant noise in the LVLM’s embedding space that describes irrelevant features. This damages output quality and diversity, but we overcome this with a Jacobian regularization that can be computed efficiently with a stochastic approximator. Compared to existing methods, our approach produces portraits with added text and 3D control, where portraits remain consistent when either control is changed. Broadly, this approach lets creators control 3D generators on their own 2D face data without needing resources to label large data or train large models.

Chelsi Jain, Yiran Wu, Yifan Zeng, Jiale Liu, S hengyu Dai

TL;DR: SimpleDoc 是一个轻量级但强大的检索增强框架，用于 DocVQA，通过双线索检索和迭代优化显著提升了多页文档问答的性能。

Details

Motivation: DocVQA 是一个具有挑战性的任务，需要结合多页和多模态信息回答问题。现有方法通常依赖于 VLM 嵌入模型，但存在检索不精准的问题。SimpleDoc 旨在通过更高效的检索和迭代优化解决这些问题。

Result: SimpleDoc 在 4 个 DocVQA 数据集上平均提升 3.2% 的性能，同时检索页数更少。

Insight: 通过结合嵌入和内容摘要，双线索检索可以更精准地筛选相关页；迭代优化能逐步增强答案信心，减少冗余检索。

Abstract: Document Visual Question Answering (DocVQA) is a practical yet challenging task, which is to ask questions based on documents while referring to multiple pages and different modalities of information, e.g, images and tables. To handle multi-modality, recent methods follow a similar Retrieval Augmented Generation (RAG) pipeline, but utilize Visual Language Models (VLMs) based embedding model to embed and retrieve relevant pages as images, and generate answers with VLMs that can accept an image as input. In this paper, we introduce SimpleDoc, a lightweight yet powerful retrieval - augmented framework for DocVQA. It boosts evidence page gathering by first retrieving candidates through embedding similarity and then filtering and re-ranking these candidates based on page summaries. A single VLM-based reasoner agent repeatedly invokes this dual-cue retriever, iteratively pulling fresh pages into a working memory until the question is confidently answered. SimpleDoc outperforms previous baselines by 3.2% on average on 4 DocVQA datasets with much fewer pages retrieved. Our code is available at https://github.com/ag2ai/SimpleDoc.

[12] Image Segmentation with Large Language Models: A Survey with Perspectives for Intelligent Transportation Systems cs.CV | cs.AIPDF

Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma

TL;DR: 该论文综述了大型语言模型（LLMs）在图像分割领域的应用，尤其是在智能交通系统（ITS）中的潜力和挑战。

Details

Motivation: 智能交通系统需要精确的场景理解以确保安全和效率，而LLMs与计算机视觉的结合为这一领域提供了新范式。

Result: 研究展示了LLMs如何提升道路场景理解（如自动驾驶和交通监控），并指出了实时性和可靠性等关键问题。

Insight: 可解释的、以人为中心的AI是LLM在下一代交通系统中成功部署的前提。

Abstract: The integration of Large Language Models (LLMs) with computer vision is profoundly transforming perception tasks like image segmentation. For intelligent transportation systems (ITS), where accurate scene understanding is critical for safety and efficiency, this new paradigm offers unprecedented capabilities. This survey systematically reviews the emerging field of LLM-augmented image segmentation, focusing on its applications, challenges, and future directions within ITS. We provide a taxonomy of current approaches based on their prompting mechanisms and core architectures, and we highlight how these innovations can enhance road scene understanding for autonomous driving, traffic monitoring, and infrastructure maintenance. Finally, we identify key challenges, including real-time performance and safety-critical reliability, and outline a perspective centered on explainable, human-centric AI as a prerequisite for the successful deployment of this technology in next-generation transportation systems.

[13] FADPNet: Frequency-Aware Dual-Path Network for Face Super-Resolution cs.CVPDF

Siyu Xu, Wenjie Li, Guangwei Gao, Jian Yang, Guo-Jun Qi

TL;DR: FADPNet是一个基于频率感知的双路径网络，用于人脸超分辨率（FSR）任务。它将特征分解为高、低频分量，分别用CNN和Mamba处理，优化计算资源分配，提升性能。

Details

Motivation: 现有人脸超分辨率方法对所有像素同等处理，导致计算资源分配不均，性能下降。CNN对高频特征（如轮廓）敏感，Mamba擅长低频特征（如颜色和纹理）且计算复杂度低。因此，提出FADPNet以解决这些问题。

Result: FADPNet在人脸超分辨率任务中表现优异，超越了现有方法，同时在模型效率上取得了平衡。

Insight: 通过将特征分解为高、低频并分别处理，可以更高效地利用计算资源，提升超分辨率性能。Mamba在低频特征上的表现优于Transformer，而CNN在高频特征上仍具优势。

Abstract: Face super-resolution (FSR) under limited computational costs remains an open problem. Existing approaches typically treat all facial pixels equally, resulting in suboptimal allocation of computational resources and degraded FSR performance. CNN is relatively sensitive to high-frequency facial features, such as component contours and facial outlines. Meanwhile, Mamba excels at capturing low-frequency features like facial color and fine-grained texture, and does so with lower complexity than Transformers. Motivated by these observations, we propose FADPNet, a Frequency-Aware Dual-Path Network that decomposes facial features into low- and high-frequency components and processes them via dedicated branches. For low-frequency regions, we introduce a Mamba-based Low-Frequency Enhancement Block (LFEB), which combines state-space attention with squeeze-and-excitation operations to extract low-frequency global interactions and emphasize informative channels. For high-frequency regions, we design a CNN-based Deep Position-Aware Attention (DPA) module to enhance spatially-dependent structural details, complemented by a lightweight High-Frequency Refinement (HFR) module that further refines frequency-specific representations. Through the above designs, our method achieves an excellent balance between FSR quality and model efficiency, outperforming existing approaches.

[14] Interpreting Biomedical VLMs on High-Imbalance Out-of-Distributions: An Insight into BiomedCLIP on Radiology cs.CVPDF

Nafiz Sadman, Farhana Zulkernine, Benjamin Kwan

TL;DR: 该论文研究了BiomedCLIP在高度不平衡的OOD多标签医学数据集（IU-xray）上的表现，分析了其嵌入空间和分类能力，发现零样本预测精度低但全微调可改进。

Details

Motivation: 探索BiomedCLIP在医学影像任务中的能力，尤其是在高度不平衡和分布外数据下的表现，以提高实际应用的可靠性。

Result: 零样本预测精度低且过预测标签；全微调改进明显；线性探测可捕捉重叠特征。

Insight: 需谨慎调整适应模型以提高在实际医学场景中的可靠性。

Abstract: In this paper, we construct two research objectives: i) explore the learned embedding space of BiomedCLIP, an open-source large vision language model, to analyse meaningful class separations, and ii) quantify the limitations of BiomedCLIP when applied to a highly imbalanced, out-of-distribution multi-label medical dataset. We experiment on IU-xray dataset, which exhibits the aforementioned criteria, and evaluate BiomedCLIP in classifying images (radiographs) in three contexts: zero-shot inference, full finetuning, and linear probing. The results show that the model under zero-shot settings over-predicts all labels, leading to poor precision and inter-class separability. Full fine-tuning improves classification of distinct diseases, while linear probing detects overlapping features. We demonstrate visual understanding of the model using Grad-CAM heatmaps and compare with 15 annotations by a radiologist. We highlight the need for careful adaptations of the models to foster reliability and applicability in a real-world setting. The code for the experiments in this work is available and maintained on GitHub.

[15] RadFabric: Agentic AI System with Reasoning Capability for Radiology cs.CV | cs.CLPDF

Wenting Chen, Yi Dong, Zhaojun Ding, Yucheng Shi, Yifan Zhou

TL;DR: RadFabric是一个多智能体、多模态推理框架，通过视觉和文本分析的结合，提升了胸部X射线（CXR）的综合诊断能力。

Details

Motivation: 当前自动化系统在胸部X射线诊断中存在病理覆盖不足、诊断准确性低以及视觉与文本推理整合不足的问题，需要一种更全面的解决方案。

Result: 在挑战性病理检测（如骨折）上实现1.000准确率，整体诊断准确率达0.799，显著优于传统系统（0.229至0.527）。

Insight: 通过跨模态特征对齐和偏好驱动推理，RadFabric为AI驱动的放射学提供了透明、解剖学精确且临床可操作的诊断方案。

Abstract: Chest X ray (CXR) imaging remains a critical diagnostic tool for thoracic conditions, but current automated systems face limitations in pathology coverage, diagnostic accuracy, and integration of visual and textual reasoning. To address these gaps, we propose RadFabric, a multi agent, multimodal reasoning framework that unifies visual and textual analysis for comprehensive CXR interpretation. RadFabric is built on the Model Context Protocol (MCP), enabling modularity, interoperability, and scalability for seamless integration of new diagnostic agents. The system employs specialized CXR agents for pathology detection, an Anatomical Interpretation Agent to map visual findings to precise anatomical structures, and a Reasoning Agent powered by large multimodal reasoning models to synthesize visual, anatomical, and clinical data into transparent and evidence based diagnoses. RadFabric achieves significant performance improvements, with near-perfect detection of challenging pathologies like fractures (1.000 accuracy) and superior overall diagnostic accuracy (0.799) compared to traditional systems (0.229 to 0.527). By integrating cross modal feature alignment and preference-driven reasoning, RadFabric advances AI-driven radiology toward transparent, anatomically precise, and clinically actionable CXR analysis.

[16] SceneAware: Scene-Constrained Pedestrian Trajectory Prediction with LLM-Guided Walkability cs.CV | cs.AIPDF

Juho Bai, Inwook Shim

TL;DR: SceneAware是一个新的行人轨迹预测框架，通过结合场景理解（使用ViT编码器和LLM生成的可行走区域掩码）提升预测准确性，并在ETH/UCY数据集上显著超越现有方法。

Details

Motivation: 现有行人轨迹预测方法主要关注行人间的社交互动，忽略了环境背景对人类运动模式的重要影响，因此需要提出一种更全面的方法。

Result: 在ETH/UCY数据集上性能提升超过50%，且在不同类型行人运动中表现一致。

Insight: 显式场景信息和物理约束对行人轨迹预测至关重要，SceneAware证明了其有效性和可靠性。

Abstract: Accurate prediction of pedestrian trajectories is essential for applications in robotics and surveillance systems. While existing approaches primarily focus on social interactions between pedestrians, they often overlook the rich environmental context that significantly shapes human movement patterns. In this paper, we propose SceneAware, a novel framework that explicitly incorporates scene understanding to enhance trajectory prediction accuracy. Our method leverages a Vision Transformer~~(ViT) scene encoder to process environmental context from static scene images, while Multi-modal Large Language Models~~(MLLMs) generate binary walkability masks that distinguish between accessible and restricted areas during training. We combine a Transformer-based trajectory encoder with the ViT-based scene encoder, capturing both temporal dynamics and spatial constraints. The framework integrates collision penalty mechanisms that discourage predicted trajectories from violating physical boundaries, ensuring physically plausible predictions. SceneAware is implemented in both deterministic and stochastic variants. Comprehensive experiments on the ETH/UCY benchmark datasets show that our approach outperforms state-of-the-art methods, with more than 50% improvement over previous models. Our analysis based on different trajectory categories shows that the model performs consistently well across various types of pedestrian movement. This highlights the importance of using explicit scene information and shows that our scene-aware approach is both effective and reliable in generating accurate and physically plausible predictions. Code is available at: https://github.com/juho127/SceneAware.

[17] VideoMAR: Autoregressive Video Generatio with Continuous Tokens cs.CV | cs.AIPDF

Hu Yu, Biao Gong, Hangjie Yuan, DanDan Zheng, Weilong Chai

TL;DR: 该论文提出了VideoMAR，一种基于连续令牌的自回归视频生成模型，通过整合时序和空间生成方法，显著提升了视频生成的效率和质量，同时减少了计算资源需求。

Details

Motivation: 虽然基于掩码的自回归模型在图像生成方面表现出色，但其在视频生成领域的潜力尚未充分挖掘。本研究旨在探索自回归模型在视频生成中的应用。

Result: 在VBench-I2V基准测试中，VideoMAR超越了之前的SOTA模型（Cosmos I2V），同时参数、训练数据和GPU资源需求分别仅为9.3%、0.5%和0.2%。

Insight: 视频自回归生成可以通过整合时序和空间生成策略显著提升效率，同时语言模型的特性（如外推能力）可以迁移到视频生成领域。

Abstract: Masked-based autoregressive models have demonstrated promising image generation capability in continuous space. However, their potential for video generation remains under-explored. In this paper, we propose \textbf{VideoMAR}, a concise and efficient decoder-only autoregressive image-to-video model with continuous tokens, composing temporal frame-by-frame and spatial masked generation. We first identify temporal causality and spatial bi-directionality as the first principle of video AR models, and propose the next-frame diffusion loss for the integration of mask and video generation. Besides, the huge cost and difficulty of long sequence autoregressive modeling is a basic but crucial issue. To this end, we propose the temporal short-to-long curriculum learning and spatial progressive resolution training, and employ progressive temperature strategy at inference time to mitigate the accumulation error. Furthermore, VideoMAR replicates several unique capacities of language models to video generation. It inherently bears high efficiency due to simultaneous temporal-wise KV cache and spatial-wise parallel generation, and presents the capacity of spatial and temporal extrapolation via 3D rotary embeddings. On the VBench-I2V benchmark, VideoMAR surpasses the previous state-of-the-art (Cosmos I2V) while requiring significantly fewer parameters ($9.3%$), training data ($0.5%$), and GPU resources ($0.2%$).

[18] A multi-stage augmented multimodal interaction network for fish feeding intensity quantification cs.CV | cs.AI | cs.ETPDF

Shulong Zhang, Mingyuan Yao, Jiayin Zhao, Xiao Liu, Haihua Wang

TL;DR: 本文提出了一种多阶段增强多模态交互网络（MAINet），用于量化鱼类摄食强度。通过高效特征提取、模态间交互增强和证据推理，显著提高了模型的准确性和可靠性。

Details

Motivation: 在循环水产养殖系统中，准确评估鱼类摄食强度对于降低饲料成本和优化投喂时间至关重要。现有研究在模态选择、特征提取与融合以及协同推断方面存在限制，影响了多模态融合模型的性能提升。

Result: 实验结果显示，MAINet在准确率、精确率、召回率和F1分数上均达到96.7%以上，性能显著优于其他模型。消融实验验证了改进策略对模型鲁棒性和特征利用效率的关键作用。

Insight: 通过模态间交互和证据推理的融合策略，可以有效提升多模态模型的性能，为水产养殖中的智能决策提供了新的技术路径。

Abstract: In recirculating aquaculture systems, accurate and effective assessment of fish feeding intensity is crucial for reducing feed costs and calculating optimal feeding times. However, current studies have limitations in modality selection, feature extraction and fusion, and co-inference for decision making, which restrict further improvement in the accuracy, applicability and reliability of multimodal fusion models. To address this problem, this study proposes a Multi-stage Augmented Multimodal Interaction Network (MAINet) for quantifying fish feeding intensity. Firstly, a general feature extraction framework is proposed to efficiently extract feature information from input image, audio and water wave datas. Second, an Auxiliary-modality Reinforcement Primary-modality Mechanism (ARPM) is designed for inter-modal interaction and generate enhanced features, which consists of a Channel Attention Fusion Network (CAFN) and a Dual-mode Attention Fusion Network (DAFN). Finally, an Evidence Reasoning (ER) rule is introduced to fuse the output results of each modality and make decisions, thereby completing the quantification of fish feeding intensity. The experimental results show that the constructed MAINet reaches 96.76%, 96.78%, 96.79% and 96.79% in accuracy, precision, recall and F1-Score respectively, and its performance is significantly higher than the comparison models. Compared with models that adopt single-modality, dual-modality fusion and different decision-making fusion methods, it also has obvious advantages. Meanwhile, the ablation experiments further verified the key role of the proposed improvement strategy in improving the robustness and feature utilization efficiency of model, which can effectively improve the accuracy of the quantitative results of fish feeding intensity.

[19] One-Shot Neural Architecture Search with Network Similarity Directed Initialization for Pathological Image Classification cs.CVPDF

Renao Yan

TL;DR: 该论文提出了一种基于网络相似性引导初始化（NSDI）的策略，结合领域自适应技术，改进了针对病理图像分类的一击神经架构搜索（One-Shot NAS）的稳定性和性能。

Details

Motivation: 现有方法通常直接应用计算机视觉模型于医学任务，而忽视了病理图像的独特性。这种不匹配导致计算效率低下，尤其是在边缘计算场景中。

Result: 在BRACS数据集上的实验表明，该方法优于现有方法，分类性能更优且能定位临床相关特征。

Insight: NSDI和领域自适应的结合有效解决了病理图像分析中的领域差异问题，且适合资源受限场景。

Abstract: Deep learning-based pathological image analysis presents unique challenges due to the practical constraints of network design. Most existing methods apply computer vision models directly to medical tasks, neglecting the distinct characteristics of pathological images. This mismatch often leads to computational inefficiencies, particularly in edge-computing scenarios. To address this, we propose a novel Network Similarity Directed Initialization (NSDI) strategy to improve the stability of neural architecture search (NAS). Furthermore, we introduce domain adaptation into one-shot NAS to better handle variations in staining and semantic scale across pathology datasets. Experiments on the BRACS dataset demonstrate that our method outperforms existing approaches, delivering both superior classification performance and clinically relevant feature localization.

[20] Meta-SurDiff: Classification Diffusion Model Optimized by Meta Learning is Reliable for Online Surgical Phase Recognition cs.CVPDF

Yufei Li, Jirui Wu, Long Tian, Liming Wang, Xiaonan Liu

TL;DR: 论文提出了一种基于元学习优化的分类扩散模型Meta-SurDiff，用于在线手术阶段识别，通过建模手术视频中的不确定性来提高可靠性。

Details

Motivation: 手术视频中的不确定性（如帧模糊性和手术阶段分布不平衡）未被充分探索，而这对可靠的在线识别至关重要。

Result: 在五个数据集（Cholec80、AutoLaparo等）上通过多项指标验证了模型有效性。

Insight: 生成模型和元学习的结合能够有效建模手术视频中的不确定性，提升在线识别的可靠性。

Abstract: Online surgical phase recognition has drawn great attention most recently due to its potential downstream applications closely related to human life and health. Despite deep models have made significant advances in capturing the discriminative long-term dependency of surgical videos to achieve improved recognition, they rarely account for exploring and modeling the uncertainty in surgical videos, which should be crucial for reliable online surgical phase recognition. We categorize the sources of uncertainty into two types, frame ambiguity in videos and unbalanced distribution among surgical phases, which are inevitable in surgical videos. To address this pivot issue, we introduce a meta-learning-optimized classification diffusion model (Meta-SurDiff), to take full advantage of the deep generative model and meta-learning in achieving precise frame-level distribution estimation for reliable online surgical phase recognition. For coarse recognition caused by ambiguous video frames, we employ a classification diffusion model to assess the confidence of recognition results at a finer-grained frame-level instance. For coarse recognition caused by unbalanced phase distribution, we use a meta-learning based objective to learn the diffusion model, thus enhancing the robustness of classification boundaries for different surgical phases.We establish effectiveness of Meta-SurDiff in online surgical phase recognition through extensive experiments on five widely used datasets using more than four practical metrics. The datasets include Cholec80, AutoLaparo, M2Cai16, OphNet, and NurViD, where OphNet comes from ophthalmic surgeries, NurViD is the daily care dataset, while the others come from laparoscopic surgeries. We will release the code upon acceptance.

[21] Egocentric Human-Object Interaction Detection: A New Benchmark and Method cs.CVPDF

Kunyuan Deng, Yi Wang, Lap-Pui Chau

TL;DR: 该论文提出了一个新的自我中心视角下的人-物交互检测（Ego-HOI）数据集Ego-HOIBench，并设计了一种基于手部几何和交互性优化（HGIR）的方法，显著提升了检测性能。

Details

Motivation: 现有的人-物交互检测方法主要关注第三人称视角，忽略了更直观的自我中心视角，这在实际应用中具有重要意义。

Result: HGIR方法在Ego-HOIBench数据集上实现了最先进的性能，并且轻量级设计使其可即插即用。

Insight: 自我中心视角下的人-物交互检测需关注手部遮挡和复杂的手部配置，手部几何信息为交互理解提供了重要线索。

Abstract: Understanding the interaction between humans and objects has gained much attention in recent years. Existing human-object interaction (HOI) detection methods mainly focus on the third-person perspectives, overlooking a more intuitive way from the egocentric view of HOI, namely Ego-HOI. This paper introduces an Ego-HOIBench, a new dataset to promote the benchmarking and development of Ego-HOI detection. Our Ego-HOIBench comprises more than 27K egocentric images with high-quality hand-verb-object triplet annotations across 123 fine-grained interaction categories and locations, covering a rich diversity of scenarios, object types, and hand configurations in daily activities. In addition, we explore and adapt third-person HOI detection methods to Ego-HOIBench and illustrate the challenges of hand-occluded objects and the complexity of single- and two-hand interactions. To build a new baseline, we propose a Hand Geometry and Interactivity Refinement (HGIR) scheme, which leverages hand pose and geometric information as valuable cues for interpreting interactions. Specifically, the HGIR scheme explicitly extracts global hand geometric features from the estimated hand pose proposals and refines the interaction-specific features using pose-interaction attention. This scheme enables the model to obtain a robust and powerful interaction representation, significantly improving the Ego-HOI detection capability. Our approach is lightweight and effective, and it can be easily applied to HOI baselines in a plug-and-play manner to achieve state-of-the-art results on Ego-HOIBench. Our project is available at: https://dengkunyuan.github.io/EgoHOIBench/

[22] Unified Representation Space for 3D Visual Grounding cs.CVPDF

Yinuo Zheng, Lipeng Gu, Honghua Chen, Liangliang Nan, Mingqiang Wei

TL;DR: 本文提出了UniSpace-3D，一种用于3D视觉定位的统一表示空间方法，通过结合CLIP预训练模型和多模态对比学习，显著缩小了视觉与文本模态间的差距，并在多个数据集上优于基线模型。

Details

Motivation: 现有的3D视觉定位方法依赖单独预训练的视觉和文本编码器，导致模态间的几何和语义差异，影响对象定位和分类的准确性。

Result: UniSpace-3D在多个数据集上的性能优于基线至少2.24%。

Insight: 通过统一表示空间和多模态对比学习，可以有效缩小视觉与文本模态的差异，提升3D视觉定位任务的准确性。

Abstract: 3D visual grounding (3DVG) is a critical task in scene understanding that aims to identify objects in 3D scenes based on text descriptions. However, existing methods rely on separately pre-trained vision and text encoders, resulting in a significant gap between the two modalities in terms of spatial geometry and semantic categories. This discrepancy often causes errors in object positioning and classification. The paper proposes UniSpace-3D, which innovatively introduces a unified representation space for 3DVG, effectively bridging the gap between visual and textual features. Specifically, UniSpace-3D incorporates three innovative designs: i) a unified representation encoder that leverages the pre-trained CLIP model to map visual and textual features into a unified representation space, effectively bridging the gap between the two modalities; ii) a multi-modal contrastive learning module that further reduces the modality gap; iii) a language-guided query selection module that utilizes the positional and semantic information to identify object candidate points aligned with textual descriptions. Extensive experiments demonstrate that UniSpace-3D outperforms baseline models by at least 2.24% on the ScanRefer and Nr3D/Sr3D datasets. The code will be made available upon acceptance of the paper.

Xiaohui Jiang, Haijiang Zhu, Chadei Li, Fulin Tang, Ning An

TL;DR: 该论文提出了一种基于隐式子图驱动的跨模态几何层次融合框架，解决了LiDAR场景识别中因点云密度不一致和几何抽象单一导致的问题，实现了高性能的3D位置识别。

Details

Motivation: 现有LiDAR场景识别方法依赖手工特征提取，因点云密度不一致和几何抽象单一导致描述符不稳定和表征脆弱性。论文旨在提出一种密度无关的几何推理框架以解决这些问题。

Result: 在多个数据集上实现了先进的性能，并在精度、运行时间和内存优化方面表现出色。

Insight: 通过隐式表示和跨模态几何层次融合，可以显著提升LiDAR场景识别的鲁棒性和可扩展性。

Abstract: LiDAR-based place recognition serves as a crucial enabler for long-term autonomy in robotics and autonomous driving systems. Yet, prevailing methodologies relying on handcrafted feature extraction face dual challenges: (1) Inconsistent point cloud density, induced by ego-motion dynamics and environmental disturbances during repeated traversals, leads to descriptor instability, and (2) Representation fragility stems from reliance on single-level geometric abstractions that lack discriminative power in structurally complex scenarios. To address these limitations, we propose a novel framework that redefines 3D place recognition through density-agnostic geometric reasoning. Specifically, we introduce an implicit 3D representation based on elastic points, which is immune to the interference of original scene point cloud density and achieves the characteristic of uniform distribution. Subsequently, we derive the occupancy grid and normal vector information of the scene from this implicit representation. Finally, with the aid of these two types of information, we obtain descriptors that fuse geometric information from both bird’s-eye view (capturing macro-level spatial layouts) and 3D segment (encoding micro-scale surface geometries) perspectives. We conducted extensive experiments on numerous datasets (KITTI, KITTI-360, MulRan, NCLT) across diverse environments. The experimental results demonstrate that our method achieves state-of-the-art performance. Moreover, our approach strikes an optimal balance between accuracy, runtime, and memory optimization for historical maps, showcasing excellent Resilient and scalability. Our code will be open-sourced in the future.

[24] Comparison of Two Methods for Stationary Incident Detection Based on Background Image cs.CVPDF

Deepak Ghimire, Joonwhoan Lee

TL;DR: 论文比较了两种基于背景图像的静态事件检测方法，一种是单背景法，另一种是双背景法，并通过实验对比了它们的检测性能和计算复杂度。最终方法能实时运行，对部分遮挡和光照变化具有鲁棒性。

Details

Motivation: 传统的背景减除法主要用于检测运动物体，而静态物体的检测常被忽视。本文旨在通过改进背景减除法，提出两种检测静态物体的方法。

Result: 实验表明，双背景法在检测性能和计算复杂度上优于单背景法，尤其在部分遮挡和光照变化场景下表现更佳。

Insight: 双背景法通过动态调整学习速率，可以更准确地检测静态物体，同时兼顾实时性和鲁棒性。

Abstract: In general, background subtraction-based methods are used to detect moving objects in visual tracking applications. In this paper, we employed a background subtraction-based scheme to detect the temporarily stationary objects. We proposed two schemes for stationary object detection, and we compare those in terms of detection performance and computational complexity. In the first approach, we used a single background, and in the second approach, we used dual backgrounds, generated with different learning rates, in order to detect temporarily stopped objects. Finally, we used normalized cross correlation (NCC) based image comparison to monitor and track the detected stationary object in a video scene. The proposed method is robust with partial occlusion, short-time fully occlusion, and illumination changes, and it can operate in real time.

[25] Exploring Non-contrastive Self-supervised Representation Learning for Image-based Profiling cs.CVPDF

Siran Dai, Qianqian Xu, Peisong Wen, Yang Liu, Qingming Huang

TL;DR: 论文提出了一种名为SSLProfiler的非对比性自监督学习框架，专门针对细胞图像分析设计，通过定制化的数据增强和表示后处理方法解决了现有方法的两个主要挑战，并在CVPR 2025的一个挑战赛中获胜。

Details

Motivation: 细胞图像分析在药物发现中具有重要意义，但由于细胞图像与自然图像分布差异较大，且输入通常涉及多幅图像，传统的自监督学习方法难以直接适用。因此，亟需一种专门针对细胞图像的非对比性自监督学习方法。

Result: SSLProfiler在CVPR 2025的Cell Line Transferability挑战赛中获胜，证明了其在细胞图像分析中的有效性。

Insight: 论文表明，针对特定领域（如细胞图像）定制自监督学习的数据增强和表示后处理方法可以显著提升模型的通用性和鲁棒性，这一方法对其他生物医学图像任务也有借鉴意义。

Abstract: Image-based cell profiling aims to create informative representations of cell images. This technique is critical in drug discovery and has greatly advanced with recent improvements in computer vision. Inspired by recent developments in non-contrastive Self-Supervised Learning (SSL), this paper provides an initial exploration into training a generalizable feature extractor for cell images using such methods. However, there are two major challenges: 1) There is a large difference between the distributions of cell images and natural images, causing the view-generation process in existing SSL methods to fail; and 2) Unlike typical scenarios where each representation is based on a single image, cell profiling often involves multiple input images, making it difficult to effectively combine all available information. To overcome these challenges, we propose SSLProfiler, a non-contrastive SSL framework specifically designed for cell profiling. We introduce specialized data augmentation and representation post-processing methods tailored to cell images, which effectively address the issues mentioned above and result in a robust feature extractor. With these improvements, SSLProfiler won the Cell Line Transferability challenge at CVPR 2025.

[26] Leader360V: The Large-scale, Real-world 360 Video Dataset for Multi-task Learning in Diverse Environment cs.CVPDF

Weiming Zhang, Dingwen Xiao, Aobotao Dai, Yexin Liu, Tianbo Pan

TL;DR: Leader360V 是首个大规模、标注真实世界的 360 视频数据集，用于实例分割和跟踪任务，并提出了自动标注流程以提高标注效率。

Details

Motivation: 360 视频场景理解任务（如分割和跟踪）对实际应用（如自动驾驶、机器人）至关重要，但缺乏大规模标注数据集。球形特性和内容不连续性使得标注成本高且复杂。

Result: 用户研究和实验表明标注流程高效，数据集显著提升了 360 视频分割和跟踪任务的表现。

Insight: 结合 2D 分割器和 LLM 的自动标注流程能够高效处理 360 视频的复杂性和失真问题，为未来 360 场景理解任务提供数据支持。

Abstract: 360 video captures the complete surrounding scenes with the ultra-large field of view of 360X180. This makes 360 scene understanding tasks, eg, segmentation and tracking, crucial for appications, such as autonomous driving, robotics. With the recent emergence of foundation models, the community is, however, impeded by the lack of large-scale, labelled real-world datasets. This is caused by the inherent spherical properties, eg, severe distortion in polar regions, and content discontinuities, rendering the annotation costly yet complex. This paper introduces Leader360V, the first large-scale, labeled real-world 360 video datasets for instance segmentation and tracking. Our datasets enjoy high scene diversity, ranging from indoor and urban settings to natural and dynamic outdoor scenes. To automate annotation, we design an automatic labeling pipeline, which subtly coordinates pre-trained 2D segmentors and large language models to facilitate the labeling. The pipeline operates in three novel stages. Specifically, in the Initial Annotation Phase, we introduce a Semantic- and Distortion-aware Refinement module, which combines object mask proposals from multiple 2D segmentors with LLM-verified semantic labels. These are then converted into mask prompts to guide SAM2 in generating distortion-aware masks for subsequent frames. In the Auto-Refine Annotation Phase, missing or incomplete regions are corrected either by applying the SDR again or resolving the discontinuities near the horizontal borders. The Manual Revision Phase finally incorporates LLMs and human annotators to further refine and validate the annotations. Extensive user studies and evaluations demonstrate the effectiveness of our labeling pipeline. Meanwhile, experiments confirm that Leader360V significantly enhances model performance for 360 video segmentation and tracking, paving the way for more scalable 360 scene understanding.

Avigail Cohen Rimon, Mirela Ben-Chen, Or Litany

TL;DR: 本文提出了一种新颖的方法，通过将功能图视为2D图像，利用图像扩散模型直接在功能图空间中训练，以优化形状之间的对应关系。

Details

Motivation: 现有的对应关系优化方法通常依赖于复杂的优化或学习框架，而功能图的表示形式又提供了将其视为图像的可能性，从而开辟了新的优化途径。

Result: 实验表明，该方法在功能图优化任务上具有竞争力，证明了扩散模型在功能图处理中的潜力。

Insight: 通过将功能图视为图像并利用扩散模型，提供了一种高效且灵活的对应关系优化框架，扩展了扩散模型在几何处理中的应用。

Abstract: We propose a novel approach for refining a given correspondence map between two shapes. A correspondence map represented as a functional map, namely a change of basis matrix, can be additionally treated as a 2D image. With this perspective, we train an image diffusion model directly in the space of functional maps, enabling it to generate accurate maps conditioned on an inaccurate initial map. The training is done purely in the functional space, and thus is highly efficient. At inference time, we use the pointwise map corresponding to the current functional map as guidance during the diffusion process. The guidance can additionally encourage different functional map objectives, such as orthogonality and commutativity with the Laplace-Beltrami operator. We show that our approach is competitive with state-of-the-art methods of map refinement and that guided diffusion models provide a promising pathway to functional map processing.

[28] FGA-NN: Film Grain Analysis Neural Network cs.CV | eess.IVPDF

Zoubida Ameur, Frédéric Lefebvre, Philippe De Lagrange, Miloš Radosavljević

TL;DR: FGA-NN 是一种基于学习的胶片颗粒分析方法，旨在在压缩后保留胶片颗粒的美学效果，同时保持分析准确性和合成复杂性的平衡。

Details

Motivation: 胶片颗粒是电影内容中的重要美学元素，但在中低比特率压缩时会丢失。为了在压缩时保留其艺术效果，需要一种有效的分析和建模方法。

Result: 实验表明，FGA-NN 在分析准确性和合成复杂性之间取得了优异平衡，并具有鲁棒性和实用性。

Insight: 学习技术可以有效解决胶片颗粒在压缩中的保留问题，同时为其他随机噪声的分析与建模提供了新思路。

Abstract: Film grain, once a by-product of analog film, is now present in most cinematographic content for aesthetic reasons. However, when such content is compressed at medium to low bitrates, film grain is lost due to its random nature. To preserve artistic intent while compressing efficiently, film grain is analyzed and modeled before encoding and synthesized after decoding. This paper introduces FGA-NN, the first learning-based film grain analysis method to estimate conventional film grain parameters compatible with conventional synthesis. Quantitative and qualitative results demonstrate FGA-NN’s superior balance between analysis accuracy and synthesis complexity, along with its robustness and applicability.

[29] EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization cs.CV | cs.AIPDF

Xiaoqi Wang, Yi Wang, Lap-Pui Chau

TL;DR: EVA02-AT是一套基于EVA02的视频-语言基础模型，专注于自我为中心的视频理解任务，通过单阶段预训练、时空旋转位置嵌入和对称优化，显著提升了效率和性能。

Details

Motivation: 现有方法在高效率的时空建模和预训练成本方面存在问题，如多阶段预训练成本高、时空特征编码不足和软标签学习目标不精确。

Result: 在Ego4D、EPIC-Kitchens-100和Charades-Ego等任务中实现SOTA性能，模型参数更少，检索任务性能显著提升。

Insight: 时空特征的联合编码和对称优化可以显著提升视频理解和检索任务的性能。

Abstract: Egocentric video-language understanding demands both high efficiency and accurate spatial-temporal modeling. Existing approaches face three key challenges: 1) Excessive pre-training cost arising from multi-stage pre-training pipelines, 2) Ineffective spatial-temporal encoding due to manually split 3D rotary positional embeddings that hinder feature interactions, and 3) Imprecise learning objectives in soft-label multi-instance retrieval, which neglect negative pair correlations. In this paper, we introduce EVA02-AT, a suite of EVA02-based video-language foundation models tailored to egocentric video understanding tasks. EVA02-AT first efficiently transfers an image-based CLIP model into a unified video encoder via a single-stage pretraining. Second, instead of applying rotary positional embeddings to isolated dimensions, we introduce spatial-temporal rotary positional embeddings along with joint attention, which can effectively encode both spatial and temporal information on the entire hidden dimension. This joint encoding of spatial-temporal features enables the model to learn cross-axis relationships, which are crucial for accurately modeling motion and interaction in videos. Third, focusing on multi-instance video-language retrieval tasks, we introduce the Symmetric Multi-Similarity (SMS) loss and a novel training framework that advances all soft labels for both positive and negative pairs, providing a more precise learning objective. Extensive experiments on Ego4D, EPIC-Kitchens-100, and Charades-Ego under zero-shot and fine-tuning settings demonstrate that EVA02-AT achieves state-of-the-art performance across diverse egocentric video-language tasks with fewer parameters. Models with our SMS loss also show significant performance gains on multi-instance retrieval benchmarks. Our code and models are publicly available at https://github.com/xqwang14/EVA02-AT .

[30] HydroChronos: Forecasting Decades of Surface Water Change cs.CVPDF

Daniele Rege Cambrin, Eleonora Poeta, Eliana Pastor, Isaac Corley, Tania Cerquitelli

TL;DR: HydroChronos 是一个用于地表水动态预测的大规模多模态时空数据集，填补了该领域的数据集和基准空白，并提出了 AquaClimaTempo UNet 模型，显著优于基线。

Details

Motivation: 当前地表水动态预测领域缺乏全面的数据集和标准化基准，阻碍了水资源管理和气候变化适应研究的进展。

Result: 模型在变化检测、变化方向分类和变化幅度回归任务中分别比 Persistence 基线高 14%、11% 和 0.1 MAE。

Insight: 通过可解释性分析，揭示了影响地表水变化的关键气候变量和输入通道，为未来建模提供了指导。

Abstract: Forecasting surface water dynamics is crucial for water resource management and climate change adaptation. However, the field lacks comprehensive datasets and standardized benchmarks. In this paper, we introduce HydroChronos, a large-scale, multi-modal spatiotemporal dataset for surface water dynamics forecasting designed to address this gap. We couple the dataset with three forecasting tasks. The dataset includes over three decades of aligned Landsat 5 and Sentinel-2 imagery, climate data, and Digital Elevation Models for diverse lakes and rivers across Europe, North America, and South America. We also propose AquaClimaTempo UNet, a novel spatiotemporal architecture with a dedicated climate data branch, as a strong benchmark baseline. Our model significantly outperforms a Persistence baseline for forecasting future water dynamics by +14% and +11% F1 across change detection and direction of change classification tasks, and by +0.1 MAE on the magnitude of change regression. Finally, we conduct an Explainable AI analysis to identify the key climate variables and input channels that influence surface water change, providing insights to inform and guide future modeling efforts.

[31] Discrete JEPA: Learning Discrete Token Representations without Reconstruction cs.CVPDF

Junyeob Baek, Hosung Lee, Christopher Hoang, Mengye Ren, Sungjin Ahn

TL;DR: 论文提出Discrete-JEPA方法，通过语义分词和新目标增强潜预测编码框架，以解决当前图像分词方法在符号抽象和逻辑推理任务中的局限性，显著提升了视觉符号预测性能。

Details

Motivation: 当前图像分词方法在需要符号抽象和系统推理的任务中表现不佳，限制了认知智能的发展。

Result: 在视觉符号预测任务上表现显著优于基线，且学习到的语义分词空间展现出系统性模式。

Insight: 语义分词空间的自发系统性模式可能为符号世界建模和规划提供新的研究方向。

Abstract: The cornerstone of cognitive intelligence lies in extracting hidden patterns from observations and leveraging these principles to systematically predict future outcomes. However, current image tokenization methods demonstrate significant limitations in tasks requiring symbolic abstraction and logical reasoning capabilities essential for systematic inference. To address this challenge, we propose Discrete-JEPA, extending the latent predictive coding framework with semantic tokenization and novel complementary objectives to create robust tokenization for symbolic reasoning tasks. Discrete-JEPA dramatically outperforms baselines on visual symbolic prediction tasks, while striking visual evidence reveals the spontaneous emergence of deliberate systematic patterns within the learned semantic token space. Though an initial model, our approach promises a significant impact for advancing Symbolic world modeling and planning capabilities in artificial intelligence systems.

[32] DepthSeg: Depth prompting in remote sensing semantic segmentation cs.CV | cs.AIPDF

Ning Zhou, Shanxiong Chen, Mingting Zhou, Haigang Sui, Lieyun Hu

TL;DR: DepthSeg是一种利用深度提示改进遥感图像语义分割的框架，通过结合高度信息解决光谱混淆和阴影遮挡问题，显著提升了土地覆盖分类的准确性。

Details

Motivation: 现有的遥感语义分割方法主要依赖光谱特征，忽略了目标的高度差异，导致在复杂场景中分类错误。DepthSeg的提出是为了解决这一问题。

Result: 在LiuZhou数据集上验证了DepthSeg的优势，消融实验证明了深度提示的重要性。

Insight: 高度信息对遥感语义分割至关重要，深度提示能有效缓解光谱混淆和阴影遮挡的影响。

Abstract: Remote sensing semantic segmentation is crucial for extracting detailed land surface information, enabling applications such as environmental monitoring, land use planning, and resource assessment. In recent years, advancements in artificial intelligence have spurred the development of automatic remote sensing semantic segmentation methods. However, the existing semantic segmentation methods focus on distinguishing spectral characteristics of different objects while ignoring the differences in the elevation of the different targets. This results in land cover misclassification in complex scenarios involving shadow occlusion and spectral confusion. In this paper, we introduce a depth prompting two-dimensional (2D) remote sensing semantic segmentation framework (DepthSeg). It automatically models depth/height information from 2D remote sensing images and integrates it into the semantic segmentation framework to mitigate the effects of spectral confusion and shadow occlusion. During the feature extraction phase of DepthSeg, we introduce a lightweight adapter to enable cost-effective fine-tuning of the large-parameter vision transformer encoder pre-trained by natural images. In the depth prompting phase, we propose a depth prompter to model depth/height features explicitly. In the semantic prediction phase, we introduce a semantic classification decoder that couples the depth prompts with high-dimensional land-cover features, enabling accurate extraction of land-cover types. Experiments on the LiuZhou dataset validate the advantages of the DepthSeg framework in land cover mapping tasks. Detailed ablation studies further highlight the significance of the depth prompts in remote sensing semantic segmentation.

[33] GrFormer: A Novel Transformer on Grassmann Manifold for Infrared and Visible Image Fusion cs.CVPDF

Huan Kang, Hui Li, Xiao-Jun Wu, Tianyang Xu, Rui Wang

TL;DR: GrFormer是一种基于Grassmann流形的新型Transformer，用于红外和可见光图像融合，通过在多尺度语义融合中解耦高频细节和低频语义，结合跨模态融合策略（CMS），显著提升了融合性能。

Details

Motivation: 现有方法在非欧几里得空间中难以捕捉图像的固有拓扑结构，且欧几里得内积计算的是代数相似性而非语义相似性，导致注意力输出不理想。GrFormer的提出是为了解决这一问题。

Result: 实验表明，GrFormer在多个图像融合基准测试中定性和定量均优于现有方法。

Insight: Grassmann流形方法能更好地捕捉非欧几里得空间中的图像特征，通过解耦多尺度语义实现高效融合。

Abstract: In the field of image fusion, promising progress has been made by modeling data from different modalities as linear subspaces. However, in practice, the source images are often located in a non-Euclidean space, where the Euclidean methods usually cannot encapsulate the intrinsic topological structure. Typically, the inner product performed in the Euclidean space calculates the algebraic similarity rather than the semantic similarity, which results in undesired attention output and a decrease in fusion performance. While the balance of low-level details and high-level semantics should be considered in infrared and visible image fusion task. To address this issue, in this paper, we propose a novel attention mechanism based on Grassmann manifold for infrared and visible image fusion (GrFormer). Specifically, our method constructs a low-rank subspace mapping through projection constraints on the Grassmann manifold, compressing attention features into subspaces of varying rank levels. This forces the features to decouple into high-frequency details (local low-rank) and low-frequency semantics (global low-rank), thereby achieving multi-scale semantic fusion. Additionally, to effectively integrate the significant information, we develop a cross-modal fusion strategy (CMS) based on a covariance mask to maximise the complementary properties between different modalities and to suppress the features with high correlation, which are deemed redundant. The experimental results demonstrate that our network outperforms SOTA methods both qualitatively and quantitatively on multiple image fusion benchmarks. The codes are available at https://github.com/Shaoyun2023.

[34] Causally Steered Diffusion for Automated Video Counterfactual Generation cs.CV | cs.AIPDF

Nikos Spyrou, Athanasios Vlontzos, Paraskevas Pegios, Thomas Melistas, Nefeli Gkouti

TL;DR: 论文提出了一种基于因果关系的视频反事实生成框架，通过提示优化和视觉语言模型引导，确保生成内容符合因果逻辑。

Details

Motivation: 现有文本到图像扩散模型在视频编辑中难以维持因果关系的真实性，可能导致不现实的生成结果。

Result: 实验表明，该方法能生成符合因果关系的视频反事实内容，并通过质量和因果标准验证。

Insight: 通过提示优化控制生成内容的因果关系，兼容任何黑盒视频编辑系统，具有广泛的应用潜力。

Abstract: Adapting text-to-image (T2I) latent diffusion models for video editing has shown strong visual fidelity and controllability, but challenges remain in maintaining causal relationships in video content. Edits affecting causally dependent attributes risk generating unrealistic or misleading outcomes if these relationships are ignored. In this work, we propose a causally faithful framework for counterfactual video generation, guided by a vision-language model (VLM). Our method is agnostic to the underlying video editing system and does not require access to its internal mechanisms or finetuning. Instead, we guide the generation by optimizing text prompts based on an assumed causal graph, addressing the challenge of latent space control in LDMs. We evaluate our approach using standard video quality metrics and counterfactual-specific criteria, such as causal effectiveness and minimality. Our results demonstrate that causally faithful video counterfactuals can be effectively generated within the learned distribution of LDMs through prompt-based causal steering. With its compatibility with any black-box video editing system, our method holds significant potential for generating realistic “what-if” video scenarios in diverse areas such as healthcare and digital media.

[35] Compositional Attribute Imbalance in Vision Datasets cs.CV | cs.AIPDF

Jiayi Chen, Yanbiao Ma, Andi Zhang, Weidong Tang, Wei Dai

TL;DR: 该论文探讨了视觉属性不平衡问题，提出了基于CLIP的框架构建视觉属性词典，并通过调整样本采样概率和数据增强技术，有效缓解了属性不平衡对模型性能的影响。

Details

Motivation: 视觉属性不平衡是图像分类中常见但未充分研究的问题，显著影响模型性能和泛化能力。论文旨在通过分析属性不平衡并提出解决方案，提升模型的鲁棒性和公平性。

Result: 在基准数据集上的实验表明，该方法显著缓解了属性不平衡问题，提升了模型的鲁棒性和公平性。

Insight: 视觉属性分布建模是解决长尾分类任务的关键，而组合属性稀有度的动态调整策略为未来研究提供了可扩展的思路。

Abstract: Visual attribute imbalance is a common yet underexplored issue in image classification, significantly impacting model performance and generalization. In this work, we first define the first-level and second-level attributes of images and then introduce a CLIP-based framework to construct a visual attribute dictionary, enabling automatic evaluation of image attributes. By systematically analyzing both single-attribute imbalance and compositional attribute imbalance, we reveal how the rarity of attributes affects model performance. To tackle these challenges, we propose adjusting the sampling probability of samples based on the rarity of their compositional attributes. This strategy is further integrated with various data augmentation techniques (such as CutMix, Fmix, and SaliencyMix) to enhance the model’s ability to represent rare attributes. Extensive experiments on benchmark datasets demonstrate that our method effectively mitigates attribute imbalance, thereby improving the robustness and fairness of deep neural networks. Our research highlights the importance of modeling visual attribute distributions and provides a scalable solution for long-tail image classification tasks.

[36] Toward Rich Video Human-Motion2D Generation cs.CVPDF

Ruihao Xi, Xuekuan Wang, Yongcheng Li, Shuhua Li, Zichen Wang

TL;DR: 论文提出了一个大规模的丰富视频人类动作2D数据集（Motion2D-Video-150K），并提出了一种基于扩散的生成模型RVHM2D，通过改进的文本条件机制和两阶段训练策略，生成高质量的单人与互动双人动作。

Details

Motivation: 由于数据稀缺性和建模人际互动的复杂性，生成逼真且可控的多人互动动作仍然是一个挑战。

Result: RVHM2D在Motion2D-Video-150K数据集上表现优异，能够生成高质量的单人与互动双人动作。

Insight: 结合全局和局部特征的文本条件机制以及强化学习微调方法，有效提升了动作生成的真实性和可控性。

Abstract: Generating realistic and controllable human motions, particularly those involving rich multi-character interactions, remains a significant challenge due to data scarcity and the complexities of modeling inter-personal dynamics. To address these limitations, we first introduce a new large-scale rich video human motion 2D dataset (Motion2D-Video-150K) comprising 150,000 video sequences. Motion2D-Video-150K features a balanced distribution of diverse single-character and, crucially, double-character interactive actions, each paired with detailed textual descriptions. Building upon this dataset, we propose a novel diffusion-based rich video human motion2D generation (RVHM2D) model. RVHM2D incorporates an enhanced textual conditioning mechanism utilizing either dual text encoders (CLIP-L/B) or T5-XXL with both global and local features. We devise a two-stage training strategy: the model is first trained with a standard diffusion objective, and then fine-tuned using reinforcement learning with an FID-based reward to further enhance motion realism and text alignment. Extensive experiments demonstrate that RVHM2D achieves leading performance on the Motion2D-Video-150K benchmark in generating both single and interactive double-character scenarios.

[37] MoTE: Mixture of Ternary Experts for Memory-efficient Large Multimodal Models cs.CV | cs.LGPDF

Hongyu Wang, Jiayu Xu, Ruiping Wang, Yan Feng, Yitao Zhai

TL;DR: 提出了MoTE，一种混合三元专家的方法，用于高效内存的大型多模态模型训练。通过低精度专家和共享专家设计，减少了内存占用，同时保持了性能。

Details

Motivation: 现有大型多模态混合专家模型（MoEs）虽然性能优秀，但内存占用高，难以部署在边缘设备上。作者提出通过低精度专家优化内存效率。

Result: MoTE在3.4GB专家内存占用下，表现优于全精度MoE-LLaVA（平均准确率提升4.3%），同时展示了良好的扩展性。

Insight: 低精度专家可以在减少内存的同时保持性能，结合后训练量化能进一步优化效率，适用于内存受限设备的部署。

Abstract: Large multimodal Mixture-of-Experts (MoEs) effectively scale the model size to boost performance while maintaining fixed active parameters. However, previous works primarily utilized full-precision experts during sparse up-cycling. Despite they show superior performance on end tasks, the large amount of experts introduces higher memory footprint, which poses significant challenges for the deployment on edge devices. In this work, we propose MoTE, a scalable and memory-efficient approach to train Mixture-of-Ternary-Experts models from dense checkpoint. Instead of training fewer high-precision experts, we propose to train more low-precision experts during up-cycling. Specifically, we use the pre-trained FFN as a shared expert and train ternary routed experts with parameters in {-1, 0, 1}. Extensive experiments show that our approach has promising scaling trend along model size. MoTE achieves comparable performance to full-precision baseline MoE-LLaVA while offering lower memory footprint. Furthermore, our approach is compatible with post-training quantization methods and the advantage further amplifies when memory-constraint goes lower. Given the same amount of expert memory footprint of 3.4GB and combined with post-training quantization, MoTE outperforms MoE-LLaVA by a gain of 4.3% average accuracy on end tasks, demonstrating its effectiveness and potential for memory-constrained devices.

[38] Model compression using knowledge distillation with integrated gradients cs.CV | cs.AI | cs.LGPDF

David E. Hernandez, Jose Chang, Torbjörn E. M. Nordling

TL;DR: 该论文提出了一种基于知识蒸馏和集成梯度（IG）的新方法，用于模型压缩，通过IG图增强输入图像提供更深的模型决策洞察，显著提升了压缩效果和推理速度。

Details

Motivation: 在资源受限的设备上部署深度学习模型时，模型压缩至关重要，而传统方法难以在压缩后保持高精度。

Result: 在CIFAR-10上达到92.6%的测试准确率，压缩因子为4.1倍，推理时间从140毫秒降至13毫秒，显著优于非蒸馏模型。

Insight: IG图不仅可以解释模型决策，还能作为数据增强工具增强知识蒸馏效果，同时通过预处理优化计算效率。

Abstract: Model compression is critical for deploying deep learning models on resource-constrained devices. We introduce a novel method enhancing knowledge distillation with integrated gradients (IG) as a data augmentation strategy. Our approach overlays IG maps onto input images during training, providing student models with deeper insights into teacher models’ decision-making processes. Extensive evaluation on CIFAR-10 demonstrates that our IG-augmented knowledge distillation achieves 92.6% testing accuracy with a 4.1x compression factor-a significant 1.1 percentage point improvement ($p<0.001$) over non-distilled models (91.5%). This compression reduces inference time from 140 ms to 13 ms. Our method precomputes IG maps before training, transforming substantial runtime costs into a one-time preprocessing step. Our comprehensive experiments include: (1) comparisons with attention transfer, revealing complementary benefits when combined with our approach; (2) Monte Carlo simulations confirming statistical robustness; (3) systematic evaluation of compression factor versus accuracy trade-offs across a wide range (2.2x-1122x); and (4) validation on an ImageNet subset aligned with CIFAR-10 classes, demonstrating generalisability beyond the initial dataset. These extensive ablation studies confirm that IG-based knowledge distillation consistently outperforms conventional approaches across varied architectures and compression ratios. Our results establish this framework as a viable compression technique for real-world deployment on edge devices while maintaining competitive accuracy.

[39] Adapting Lightweight Vision Language Models for Radiological Visual Question Answering cs.CV | cs.AIPDF

Aditya Shourya, Michel Dumontier, Chang Sun

TL;DR: 本文提出了一种轻量级的视觉语言模型，通过精心调整的训练流程和合成数据生成，成功应用于放射学视觉问答（VQA），并在小规模数据和参数下实现了与大型模型相媲美的性能。

Details

Motivation: 放射学VQA面临专家标注数据稀缺、图像模式复杂以及缺乏评估工具等挑战。本文旨在开发一种轻量级模型，通过高效的数据利用和训练方法解决这些问题。

Result: 尽管模型规模远小于LLaVA-Med等前沿模型，但在开放和封闭式问题上均表现出色。显著分析工具成功识别了模型的失效模式。

Insight: 轻量级模型通过高效数据利用和针对性训练，可以在有限资源下实现高性能，显著分析工具为模型调试提供了实用支持。

Abstract: Recent advancements in vision-language systems have improved the accuracy of Radiological Visual Question Answering (VQA) Models. However, some challenges remain across each stage of model development: limited expert-labeled images hinders data procurement at scale; the intricate and nuanced patterns of radiological images make modeling inherently difficult; and the lack of evaluation evaluation efforts makes it difficult to identify cases where the model might be ill-conditioned. In this study, we fine-tune a lightweight 3B parameter vision-language model for Radiological VQA, demonstrating that small models, when appropriately tuned with curated data, can achieve robust performance across both open- and closed-ended questions. We propose a cost-effective training pipeline from synthetic question-answer pair generation to multi-stage fine-tuning on specialised radiological domain-targeted datasets (e.g., ROCO v2.0, MedPix v2.0). Our results show that despite operating at a fraction of the scale of state-of-the-art models such as LLaVA-Med, our model achieves promising performance given its small parameter size and the limited scale of training data. We introduce a lightweight saliency-based diagnostic tool that enables domain experts to inspect VQA model performance and identify ill-conditioned failure modes through saliency analysis.

[40] Dense360: Dense Understanding from Omnidirectional Panoramas cs.CVPDF

Yikang Zhou, Tao Zhang, Dizhe Zhang, Shunping Ji, Xiangtai Li

TL;DR: 论文提出了一种利用全景图像实现密集视觉理解的方法，并引入了首个全景数据集和评测基准。

Details

Motivation: 现有MLLMs（多模态大语言模型）的视觉输入局限于有限视场（如70度），无法全面理解物理世界。全景图像提供了更完整、紧凑且连续的表示形式，但需要解决空间连续性和信息密度变化的问题。

Result: 建立了包含5M密集标注的全景数据集，并通过Dense360-Bench评测了全景视觉语言理解能力。

Insight: 全景图像为密集视觉理解提供了新的可能性，但需要针对其特性（如ERP投影）优化模型设计。

Abstract: Multimodal Large Language Models (MLLMs) require comprehensive visual inputs to achieve dense understanding of the physical world. While existing MLLMs demonstrate impressive world understanding capabilities through limited field-of-view (FOV) visual inputs (e.g., 70 degree), we take the first step toward dense understanding from omnidirectional panoramas. We first introduce an omnidirectional panoramas dataset featuring a comprehensive suite of reliability-scored annotations. Specifically, our dataset contains 160K panoramas with 5M dense entity-level captions, 1M unique referring expressions, and 100K entity-grounded panoramic scene descriptions. Compared to multi-view alternatives, panoramas can provide more complete, compact, and continuous scene representations through equirectangular projections (ERP). However, the use of ERP introduces two key challenges for MLLMs: i) spatial continuity along the circle of latitude, and ii) latitude-dependent variation in information density. We address these challenges through ERP-RoPE, a position encoding scheme specifically designed for panoramic ERP. In addition, we introduce Dense360-Bench, the first benchmark for evaluating MLLMs on omnidirectional captioning and grounding, establishing a comprehensive framework for advancing dense visual-language understanding in panoramic settings.

[41] I Speak and You Find: Robust 3D Visual Grounding with Noisy and Ambiguous Speech Inputs cs.CVPDF

Yu Qi, Lipeng Gu, Honghua Chen, Liangliang Nan, Mingqiang Wei

TL;DR: 论文提出SpeechRefer，一种针对3D视觉定位任务的新框架，能够处理噪声和模糊的语音输入，通过语音互补模块和对比互补模块提升性能。

Details

Motivation: 现有3D视觉定位方法依赖于精确的文本提示，而现实中的语音输入常因口音、背景噪声等问题导致转录错误，限制了方法的应用。

Result: 在SpeechRefer和SpeechNr3D数据集上的实验表明，SpeechRefer显著提升了现有3DVG方法的性能。

Insight: 语音信号可以作为3D视觉定位的重要补充信息，通过多模态对齐提升系统在噪声环境下的鲁棒性。

Abstract: Existing 3D visual grounding methods rely on precise text prompts to locate objects within 3D scenes. Speech, as a natural and intuitive modality, offers a promising alternative. Real-world speech inputs, however, often suffer from transcription errors due to accents, background noise, and varying speech rates, limiting the applicability of existing 3DVG methods. To address these challenges, we propose \textbf{SpeechRefer}, a novel 3DVG framework designed to enhance performance in the presence of noisy and ambiguous speech-to-text transcriptions. SpeechRefer integrates seamlessly with xisting 3DVG models and introduces two key innovations. First, the Speech Complementary Module captures acoustic similarities between phonetically related words and highlights subtle distinctions, generating complementary proposal scores from the speech signal. This reduces dependence on potentially erroneous transcriptions. Second, the Contrastive Complementary Module employs contrastive learning to align erroneous text features with corresponding speech features, ensuring robust performance even when transcription errors dominate. Extensive experiments on the SpeechRefer and peechNr3D datasets demonstrate that SpeechRefer improves the performance of existing 3DVG methods by a large margin, which highlights SpeechRefer’s potential to bridge the gap between noisy speech inputs and reliable 3DVG, enabling more intuitive and practical multimodal systems.

[42] SIRI-Bench: Challenging VLMs’ Spatial Intelligence through Complex Reasoning Tasks cs.CVPDF

Zijian Song, Xiaoxin Lin, Qiuming Huang, Guangrun Wang, Liang Lin

TL;DR: SIRI-Bench是一个专注于评估视觉语言模型（VLMs）在空间智能和复杂推理能力方面的基准测试。

Details

Motivation: 当前大型语言模型（LLMs）在数学和编程等复杂推理任务上表现出色，但视觉语言模型（VLMs）在空间上下文中的复杂推理能力尚未得到系统评估。

Result: 实验结果表明，当前最先进的VLMs在SIRI-Bench上表现不佳，凸显了空间推理的挑战性。

Insight: 空间推理能力是VLMs在实际交互中的关键能力，SIRI-Bench为研究VLMs在视觉问题解决上的进步提供了新的方向。

Abstract: Large Language Models (LLMs) are experiencing rapid advancements in complex reasoning, exhibiting remarkable generalization in mathematics and programming. In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world interaction, the systematic evaluation of their complex reasoning ability within spatial contexts remains underexplored. To bridge this gap, we introduce SIRI-Bench, a benchmark designed to evaluate VLMs’ spatial intelligence through video-based reasoning tasks. SIRI-Bench comprises nearly 1K video-question-answer triplets, where each problem is embedded in a realistic 3D scene and captured by video. By carefully designing questions and corresponding 3D scenes, our benchmark ensures that solving the questions requires both spatial comprehension for extracting information and high-level reasoning for deriving solutions, making it a challenging benchmark for evaluating VLMs. To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine. This engine, leveraging multiple specialized LLM agents, can generate realistic 3D scenes from abstract math problems, ensuring faithfulness to the original descriptions. Experimental results reveal that state-of-the-art VLMs struggle significantly on SIRI-Bench, underscoring the challenge of spatial reasoning. We hope that our study will bring researchers’ attention to spatially grounded reasoning and advance VLMs in visual problem-solving.

[43] VisLanding: Monocular 3D Perception for UAV Safe Landing via Depth-Normal Synergy cs.CV | cs.ROPDF

Zhuoyue Tan, Boyong He, Yuxiang Ji, Liaoni Wu

TL;DR: VisLanding 提出了一种基于单目3D感知的无人机安全着陆框架，通过深度-法向联合优化机制显著提升了安全区域识别的准确性。

Details

Motivation: 无人机在复杂未知环境中实现自主安全着陆的核心挑战在于精确感知3D环境信息，现有方法在泛化和鲁棒性方面存在不足。

Result: 实验表明，VisLanding 在跨域测试中表现出优越的泛化能力和鲁棒性，同时能通过深度和法向信息估计着陆区域面积。

Insight: 深度-法向信息的联合优化是提升单目3D感知任务性能的有效途径，尤其在未知环境中表现出零样本泛化优势。

Abstract: This paper presents VisLanding, a monocular 3D perception-based framework for safe UAV (Unmanned Aerial Vehicle) landing. Addressing the core challenge of autonomous UAV landing in complex and unknown environments, this study innovatively leverages the depth-normal synergy prediction capabilities of the Metric3D V2 model to construct an end-to-end safe landing zones (SLZ) estimation framework. By introducing a safe zone segmentation branch, we transform the landing zone estimation task into a binary semantic segmentation problem. The model is fine-tuned and annotated using the WildUAV dataset from a UAV perspective, while a cross-domain evaluation dataset is constructed to validate the model’s robustness. Experimental results demonstrate that VisLanding significantly enhances the accuracy of safe zone identification through a depth-normal joint optimization mechanism, while retaining the zero-shot generalization advantages of Metric3D V2. The proposed method exhibits superior generalization and robustness in cross-domain testing compared to other approaches. Furthermore, it enables the estimation of landing zone area by integrating predicted depth and normal information, providing critical decision-making support for practical applications.

[44] PoseGRAF: Geometric-Reinforced Adaptive Fusion for Monocular 3D Human Pose Estimation cs.CV | cs.AIPDF

Ming Xu, Xu Zhang

TL;DR: PoseGRAF是一个用于单目3D人体姿态估计的新框架，通过双图卷积结构分别处理关节和骨骼图，并引入了跨注意力模块和动态融合模块，显著提高了姿态估计的准确性和鲁棒性。

Details

Motivation: 现有方法主要依赖关节位置特征，忽略了骨骼中固有的方向和角度相关性，导致在关节遮挡或快速运动时产生不合理的姿态。

Result: 在Human3.6M和MPI-INF-3DHP数据集上表现优于现有方法，并在野外视频中验证了泛化能力。

Insight: 通过显式建模骨骼的方向和角度特征，结合动态融合机制，可以有效提升姿态估计的鲁棒性和准确性。

Abstract: Existing monocular 3D pose estimation methods primarily rely on joint positional features, while overlooking intrinsic directional and angular correlations within the skeleton. As a result, they often produce implausible poses under joint occlusions or rapid motion changes. To address these challenges, we propose the PoseGRAF framework. We first construct a dual graph convolutional structure that separately processes joint and bone graphs, effectively capturing their local dependencies. A Cross-Attention module is then introduced to model interdependencies between bone directions and joint features. Building upon this, a dynamic fusion module is designed to adaptively integrate both feature types by leveraging the relational dependencies between joints and bones. An improved Transformer encoder is further incorporated in a residual manner to generate the final output. Experimental results on the Human3.6M and MPI-INF-3DHP datasets show that our method exceeds state-of-the-art approaches. Additional evaluations on in-the-wild videos further validate its generalizability. The code is publicly available at https://github.com/iCityLab/PoseGRAF.

[45] Align Your Flow: Scaling Continuous-Time Flow Map Distillation cs.CV | cs.LGPDF

Amirmojtaba Sabour, Sanja Fidler, Karsten Kreis

TL;DR: 本文提出了一种名为Align Your Flow的新方法，通过连续时间目标和流映射技术，解决了扩散和流模型在多步采样中性能下降的问题，并在图像和文本生成任务中取得了先进的性能。

Details

Motivation: 扩散和流模型虽然生成质量高，但需要大量采样步骤，导致效率低下。一致性模型虽然可以蒸馏为一步生成器，但性能随步骤增加而下降。本文旨在解决这一问题。

Result: 在ImageNet 64x64和512x512上实现了最先进的少步生成性能，并在文本生成任务中超越了现有非对抗训练的少步生成器。

Insight: 流映射技术在少步生成中具有潜力，而连续时间目标和额外训练技术可以进一步提升模型性能。

Abstract: Diffusion- and flow-based models have emerged as state-of-the-art generative modeling approaches, but they require many sampling steps. Consistency models can distill these models into efficient one-step generators; however, unlike flow- and diffusion-based methods, their performance inevitably degrades when increasing the number of steps, which we show both analytically and empirically. Flow maps generalize these approaches by connecting any two noise levels in a single step and remain effective across all step counts. In this paper, we introduce two new continuous-time objectives for training flow maps, along with additional novel training techniques, generalizing existing consistency and flow matching objectives. We further demonstrate that autoguidance can improve performance, using a low-quality model for guidance during distillation, and an additional boost can be achieved by adversarial finetuning, with minimal loss in sample diversity. We extensively validate our flow map models, called Align Your Flow, on challenging image generation benchmarks and achieve state-of-the-art few-step generation performance on both ImageNet 64x64 and 512x512, using small and efficient neural networks. Finally, we show text-to-image flow map models that outperform all existing non-adversarially trained few-step samplers in text-conditioned synthesis.

[46] Unsupervised Imaging Inverse Problems with Diffusion Distribution Matching cs.CV | cs.LG | eess.IVPDF

Giacomo Meanti, Thomas Ryckeboer, Michael Arbel, Julien Mairal

TL;DR: 该论文提出一种基于扩散分布匹配的无监督方法，用于解决图像逆问题（如去模糊和非均匀点扩散函数校准），仅需少量未配对数据且无需已知正向模型。

Details

Motivation: 传统方法通常需要完整的正向模型或配对的退化-真实图像数据集，而实际场景中这些条件难以满足。本文旨在减少这些假设，适用于现实场景。

Result: 在去模糊、非均匀点扩散函数校准和盲超分辨率任务上优于单图像盲方法和无监督方法，达到与最先进方法相当的性能。

Insight: 该方法在现实场景（如镜头校准）中表现优异，大幅减少了数据采集需求，为无监督逆问题提供了新思路。

Abstract: This work addresses image restoration tasks through the lens of inverse problems using unpaired datasets. In contrast to traditional approaches – which typically assume full knowledge of the forward model or access to paired degraded and ground-truth images – the proposed method operates under minimal assumptions and relies only on small, unpaired datasets. This makes it particularly well-suited for real-world scenarios, where the forward model is often unknown or misspecified, and collecting paired data is costly or infeasible. The method leverages conditional flow matching to model the distribution of degraded observations, while simultaneously learning the forward model via a distribution-matching loss that arises naturally from the framework. Empirically, it outperforms both single-image blind and unsupervised approaches on deblurring and non-uniform point spread function (PSF) calibration tasks. It also matches state-of-the-art performance on blind super-resolution. We also showcase the effectiveness of our method with a proof of concept for lens calibration: a real-world application traditionally requiring time-consuming experiments and specialized equipment. In contrast, our approach achieves this with minimal data acquisition effort.

[47] VisText-Mosquito: A Multimodal Dataset and Benchmark for AI-Based Mosquito Breeding Site Detection and Reasoning cs.CV | cs.CLPDF

Md. Adnanul Islam, Md. Faiyaz Abdullah Sayeedi, Md. Asaduzzaman Shuvo, Muhammad Ziaur Rahman, Shahanur Rahman Bappy

TL;DR: 本文介绍了VisText-Mosquito，一个多模态数据集和基准，用于基于AI的蚊子繁殖地检测和推理。数据集结合视觉和文本数据，支持目标检测、分割和自然语言推理任务。

Details

Motivation: 蚊媒疾病对全球健康构成重大威胁，需要早期检测和主动控制繁殖地以防止疫情爆发。现有的数据集通常仅关注视觉数据，缺乏多模态（视觉+文本）支持。

Result: 实验结果：目标检测任务中，YOLOv9s达到最高精度0.92926和mAP@50 0.92891；分割任务中，YOLOv11n-Seg达到精度0.91587和mAP@50 0.79795；推理生成任务中，BLIP模型的BLEU得分为54.7，BERTScore为0.91，ROUGE-L为0.87。

Insight: 多模态数据（视觉+文本）能显著提升蚊媒繁殖地分析的全面性和可解释性，突显了“预防胜于治疗”的理念。公开的数据集和代码为未来研究提供了重要资源。

Abstract: Mosquito-borne diseases pose a major global health risk, requiring early detection and proactive control of breeding sites to prevent outbreaks. In this paper, we present VisText-Mosquito, a multimodal dataset that integrates visual and textual data to support automated detection, segmentation, and reasoning for mosquito breeding site analysis. The dataset includes 1,828 annotated images for object detection, 142 images for water surface segmentation, and natural language reasoning texts linked to each image. The YOLOv9s model achieves the highest precision of 0.92926 and mAP@50 of 0.92891 for object detection, while YOLOv11n-Seg reaches a segmentation precision of 0.91587 and mAP@50 of 0.79795. For reasoning generation, our fine-tuned BLIP model achieves a final loss of 0.0028, with a BLEU score of 54.7, BERTScore of 0.91, and ROUGE-L of 0.87. This dataset and model framework emphasize the theme “Prevention is Better than Cure”, showcasing how AI-based detection can proactively address mosquito-borne disease risks. The dataset and implementation code are publicly available at GitHub: https://github.com/adnanul-islam-jisun/VisText-Mosquito

[48] 3DGS-IEval-15K: A Large-scale Image Quality Evaluation Database for 3D Gaussian-Splatting cs.CVPDF

Yuke Xing, Jiarui Wang, Peizhi Niu, Wenjie Huang, Guangtao Zhai

TL;DR: 3DGS-IEval-15K是一个针对3D高斯点云渲染（3DGS）图像质量评估的大规模数据集，包含15,200张图像，旨在填补3DGS压缩方法在感知质量评估上的空白。

Details

Motivation: 3DGS在实时渲染中表现出色，但高存储需求限制了其应用。现有方法缺乏对压缩后图像感知质量的系统评估，因此需要构建一个全面的评估框架。

Result: 数据集通过场景多样性和MOS分布分析验证了质量，并建立了30种IQA指标的基准测试。

Insight: 3DGS的压缩会引入独特的视角依赖性失真，该数据集为开发专用IQA指标和研究这些失真模式提供了基础。

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a promising approach for novel view synthesis, offering real-time rendering with high visual fidelity. However, its substantial storage requirements present significant challenges for practical applications. While recent state-of-the-art (SOTA) 3DGS methods increasingly incorporate dedicated compression modules, there is a lack of a comprehensive framework to evaluate their perceptual impact. Therefore we present 3DGS-IEval-15K, the first large-scale image quality assessment (IQA) dataset specifically designed for compressed 3DGS representations. Our dataset encompasses 15,200 images rendered from 10 real-world scenes through 6 representative 3DGS algorithms at 20 strategically selected viewpoints, with different compression levels leading to various distortion effects. Through controlled subjective experiments, we collect human perception data from 60 viewers. We validate dataset quality through scene diversity and MOS distribution analysis, and establish a comprehensive benchmark with 30 representative IQA metrics covering diverse types. As the largest-scale 3DGS quality assessment dataset to date, our work provides a foundation for developing 3DGS specialized IQA metrics, and offers essential data for investigating view-dependent quality distribution patterns unique to 3DGS. The database is publicly available at https://github.com/YukeXing/3DGS-IEval-15K.

[49] DDS-NAS: Dynamic Data Selection within Neural Architecture Search via On-line Hard Example Mining applied to Image Classification cs.CVPDF

Matt Poyser, Toby P. Breckon

TL;DR: 该论文提出了一种名为DDS-NAS的方法，通过动态数据选择和在线硬样本挖掘（On-line Hard Example Mining），在神经架构搜索（NAS）中加速训练。利用自编码器和k-d树结构实现高效样本排序，结合课程学习动态优化子集数据集。实验表明，DDS-NAS将基于梯度的NAS策略加速高达27倍，且性能无损。

Details

Motivation: 神经架构搜索（NAS）的训练成本高昂，尤其是需要在大规模数据集上反复评估架构性能。论文的动机是通过动态选择和优化训练数据，减少NAS的训练时间和计算资源消耗。

Result: 实验结果表明，DDS-NAS将梯度-based NAS策略的训练速度提升高达27倍，同时保持模型性能不变。

Insight: 动态数据选择和硬样本挖掘能显著减少NAS的训练时间，同时通过课程学习策略，可以更高效地利用数据样本，提升收敛速度。

Abstract: In order to address the scalability challenge within Neural Architecture Search (NAS), we speed up NAS training via dynamic hard example mining within a curriculum learning framework. By utilizing an autoencoder that enforces an image similarity embedding in latent space, we construct an efficient kd-tree structure to order images by furthest neighbour dissimilarity in a low-dimensional embedding. From a given query image from our subsample dataset, we can identify the most dissimilar image within the global dataset in logarithmic time. Via curriculum learning, we then dynamically re-formulate an unbiased subsample dataset for NAS optimisation, upon which the current NAS solution architecture performs poorly. We show that our DDS-NAS framework speeds up gradient-based NAS strategies by up to 27x without loss in performance. By maximising the contribution of each image sample during training, we reduce the duration of a NAS training cycle and the number of iterations required for convergence.

[50] Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models cs.CVPDF

Ling Li, Yao Zhou, Yuxuan Liang, Fugee Tsung, Jiaheng Wei

TL;DR: 该论文提出了一种基于推理的图像地理定位新方法GLOBE，通过构建多样化数据集MP16-Reason，结合任务特定奖励，显著提升了地理定位的准确性和可解释性。

Details

Motivation: 传统的地理定位方法缺乏可解释性，而现有基于推理的数据集和方法在视觉多样性和推理能力上存在不足。

Result: GLOBE在多样视觉场景中的地理定位性能优于现有开源大型视觉语言模型，并生成更具洞察力的推理轨迹。

Insight: 将地理定位任务重新定义为推理驱动问题，结合多样化数据和任务特定奖励，能够显著提升模型性能和可解释性。

Abstract: Previous methods for image geo-localization have typically treated the task as either classification or retrieval, often relying on black-box decisions that lack interpretability. The rise of large vision-language models (LVLMs) has enabled a rethinking of geo-localization as a reasoning-driven task grounded in visual cues. However, two major challenges persist. On the data side, existing reasoning-focused datasets are primarily based on street-view imagery, offering limited scene diversity and constrained viewpoints. On the modeling side, current approaches predominantly rely on supervised fine-tuning, which yields only marginal improvements in reasoning capabilities. To address these challenges, we propose a novel pipeline that constructs a reasoning-oriented geo-localization dataset, MP16-Reason, using diverse social media images. We introduce GLOBE, Group-relative policy optimization for Locatability assessment and Optimized visual-clue reasoning, yielding Bi-objective geo-Enhancement for the VLM in recognition and reasoning. GLOBE incorporates task-specific rewards that jointly enhance locatability assessment, visual clue reasoning, and geolocation accuracy. Both qualitative and quantitative results demonstrate that GLOBE outperforms state-of-the-art open-source LVLMs on geo-localization tasks, particularly in diverse visual scenes, while also generating more insightful and interpretable reasoning trajectories.

[51] FocalClick-XL: Towards Unified and High-quality Interactive Segmentation cs.CVPDF

Xi Chen, Hengshuang Zhao

TL;DR: FocalClick-XL通过多阶段任务分解和子网络预训练，提出了一种统一的交互式分割方法，支持多种交互形式（如点击、涂鸦和框选），并在性能和适应性上达到最新水平。

Details

Motivation: 现有的交互式分割方法仅支持有限的交互形式，且难以捕捉精细细节。FocalClick-XL旨在解决这些问题，通过统一框架支持多种交互形式并提升分割质量。

Result: 在点击基准测试中达到最新性能，且能灵活适应多种交互形式。同时支持生成精细alpha蒙版，拓展了实用性。

Insight: 多任务分解和大规模预训练的结合为交互式分割提供了一种高效且通用的解决方案，同时强调了共享知识的重要性。

Abstract: Interactive segmentation enables users to extract binary masks of target objects through simple interactions such as clicks, scribbles, and boxes. However, existing methods often support only limited interaction forms and struggle to capture fine details. In this paper, we revisit the classical coarse-to-fine design of FocalClick and introduce significant extensions. Inspired by its multi-stage strategy, we propose a novel pipeline, FocalClick-XL, to address these challenges simultaneously. Following the emerging trend of large-scale pretraining, we decompose interactive segmentation into meta-tasks that capture different levels of information – context, object, and detail – assigning a dedicated subnet to each level.This decomposition allows each subnet to undergo scaled pretraining with independent data and supervision, maximizing its effectiveness. To enhance flexibility, we share context- and detail-level information across different interaction forms as common knowledge while introducing a prompting layer at the object level to encode specific interaction types. As a result, FocalClick-XL achieves state-of-the-art performance on click-based benchmarks and demonstrates remarkable adaptability to diverse interaction formats, including boxes, scribbles, and coarse masks. Beyond binary mask generation, it is also capable of predicting alpha mattes with fine-grained details, making it a versatile and powerful tool for interactive segmentation.

[52] YOLOv11-RGBT: Towards a Comprehensive Single-Stage Multispectral Object Detection Framework cs.CVPDF

Dahang Wan, Rongsheng Lu, Yang Fang, Xianli Lang, Shuangbao Shu

TL;DR: 提出了一种基于YOLOv11的多光谱目标检测框架YOLOv11-RGBT，解决了现有方法在跨模态交互、性能和融合策略平衡以及模态权重分配方面的挑战。

Details

Motivation: 多光谱目标检测通过整合多个波段的信息提升检测精度和环境适应性，但现有方法在统一单阶段框架、性能平衡和模态权重分配上存在问题。

Result: 在LLVIP和FLIR等数据集上表现优异，mAP提升3.41%-5.65%，最高达到47.61%。

Insight: 多光谱可控微调策略显著提升了模型的适应性和鲁棒性，验证了框架和策略的有效性。

Abstract: Multispectral object detection, which integrates information from multiple bands, can enhance detection accuracy and environmental adaptability, holding great application potential across various fields. Although existing methods have made progress in cross-modal interaction, low-light conditions, and model lightweight, there are still challenges like the lack of a unified single-stage framework, difficulty in balancing performance and fusion strategy, and unreasonable modality weight allocation. To address these, based on the YOLOv11 framework, we present YOLOv11-RGBT, a new comprehensive multimodal object detection framework. We designed six multispectral fusion modes and successfully applied them to models from YOLOv3 to YOLOv12 and RT-DETR. After reevaluating the importance of the two modalities, we proposed a P3 mid-fusion strategy and multispectral controllable fine-tuning (MCF) strategy for multispectral models. These improvements optimize feature fusion, reduce redundancy and mismatches, and boost overall model performance. Experiments show our framework excels on three major open-source multispectral object detection datasets, like LLVIP and FLIR. Particularly, the multispectral controllable fine-tuning strategy significantly enhanced model adaptability and robustness. On the FLIR dataset, it consistently improved YOLOv11 models’ mAP by 3.41%-5.65%, reaching a maximum of 47.61%, verifying the framework and strategies’ effectiveness. The code is available at: https://github.com/wandahangFY/YOLOv11-RGBT.

[53] SyncTalk++: High-Fidelity and Efficient Synchronized Talking Heads Synthesis Using Gaussian Splatting cs.CVPDF

Ziqiao Peng, Wentao Hu, Junyuan Ma, Xiangyu Zhu, Xiaomei Zhang

TL;DR: SyncTalk++ 提出了一种基于高斯泼溅的高保真同步说话头部合成方法，通过动态肖像渲染器、面部同步控制器和头部同步稳定器，显著提升了同步性和渲染质量。

Details

Motivation: 合成高同步性的逼真语音驱动说话头部视频是一个挑战，缺乏同步会导致不自然的结果。因此，SyncTalk++ 的目标是解决同步性问题。

Result: SyncTalk++ 实现了 101 FPS 的高渲染速度，并在同步性和逼真度上优于现有方法。

Insight: 高斯泼溅技术在说话头部合成的应用显著提升了渲染效率和视觉一致性，同时多模块协同设计解决了复杂同步问题。

Abstract: Achieving high synchronization in the synthesis of realistic, speech-driven talking head videos presents a significant challenge. A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses. The absence of these synchronizations is a fundamental flaw, leading to unrealistic results. To address the critical issue of synchronization, identified as the ‘’devil’’ in creating realistic talking heads, we introduce SyncTalk++, which features a Dynamic Portrait Renderer with Gaussian Splatting to ensure consistent subject identity preservation and a Face-Sync Controller that aligns lip movements with speech while innovatively using a 3D facial blendshape model to reconstruct accurate facial expressions. To ensure natural head movements, we propose a Head-Sync Stabilizer, which optimizes head poses for greater stability. Additionally, SyncTalk++ enhances robustness to out-of-distribution (OOD) audio by incorporating an Expression Generator and a Torso Restorer, which generate speech-matched facial expressions and seamless torso regions. Our approach maintains consistency and continuity in visual details across frames and significantly improves rendering speed and quality, achieving up to 101 frames per second. Extensive experiments and user studies demonstrate that SyncTalk++ outperforms state-of-the-art methods in synchronization and realism. We recommend watching the supplementary video: https://ziqiaopeng.github.io/synctalk++.

[54] Cost-Aware Routing for Efficient Text-To-Image Generation cs.CV | cs.LGPDF

Qinchan, Li, Kenneth Chen, Changyue, Su

TL;DR: 该论文提出了一种基于成本感知的路由框架，用于在文本到图像生成任务中平衡生成质量和计算成本，通过动态分配不同复杂度的提示到不同生成模型或步骤。

Details

Motivation: 扩散模型虽然能生成高质量的图像，但其迭代去噪过程导致计算成本高昂。为了在质量和计算效率之间找到最佳平衡，需要根据提示的复杂性动态调整计算资源。

Result: 在COCO和DiffusionDB数据集上的实验表明，该方法能够提供比单独使用任何一个模型更高的平均质量。

Insight: 通过动态路由机制，可以显著优化计算资源的使用，同时保持生成质量，为高效文本到图像生成提供了新思路。

Abstract: Diffusion models are well known for their ability to generate a high-fidelity image for an input prompt through an iterative denoising process. Unfortunately, the high fidelity also comes at a high computational cost due the inherently sequential generative process. In this work, we seek to optimally balance quality and computational cost, and propose a framework to allow the amount of computation to vary for each prompt, depending on its complexity. Each prompt is automatically routed to the most appropriate text-to-image generation function, which may correspond to a distinct number of denoising steps of a diffusion model, or a disparate, independent text-to-image model. Unlike uniform cost reduction techniques (e.g., distillation, model quantization), our approach achieves the optimal trade-off by learning to reserve expensive choices (e.g., 100+ denoising steps) only for a few complex prompts, and employ more economical choices (e.g., small distilled model) for less sophisticated prompts. We empirically demonstrate on COCO and DiffusionDB that by learning to route to nine already-trained text-to-image models, our approach is able to deliver an average quality that is higher than that achievable by any of these models alone.

[55] Scaling-Up the Pretraining of the Earth Observation Foundation Model PhilEO to the MajorTOM Dataset cs.CVPDF

Nikolaos Dionelis, Jente Bosmans, Riccardo Musto, Giancarlo Paoletti, Simone Sarti

TL;DR: 该论文通过将地观测基础模型PhilEO扩展到23TB的MajorTOM数据集上，验证了数据集和模型规模扩展的有效性，同时在多个下游任务中展示了性能优势。

Details

Motivation: 地观测卫星数据量大，但缺乏标注数据，因此需要通过预训练大规模基础模型以提升下游任务的效率和性能。

Result: PhilEO 44M在MajorTOM数据集上表现优于其他模型，而PhilEO 200M在FastTOM数据集上表现最佳，验证了规模和架构扩展的有效性。

Insight: 数据规模和模型规模的双重扩展对地观测基础模型的性能提升至关重要，而ViT架构在高参数规模下可能更具优势。

Abstract: Today, Earth Observation (EO) satellites generate massive volumes of data, with the Copernicus Sentinel-2 constellation alone producing approximately 1.6TB per day. To fully exploit this information, it is essential to pretrain EO Foundation Models (FMs) on large unlabeled datasets, enabling efficient fine-tuning for several different downstream tasks with minimal labeled data. In this work, we present the scaling-up of our recently proposed EO Foundation Model, PhilEO Geo-Aware U-Net, on the unlabeled 23TB dataset MajorTOM, which covers the vast majority of the Earth’s surface, as well as on the specialized subset FastTOM 2TB that does not include oceans and ice. We develop and study various PhilEO model variants with different numbers of parameters and architectures. Finally, we fine-tune the models on the PhilEO Bench for road density estimation, building density pixel-wise regression, and land cover semantic segmentation, and we evaluate the performance. Our results demonstrate that for all n-shots for road density regression, the PhilEO 44M MajorTOM 23TB model outperforms PhilEO Globe 0.5TB 44M. We also show that for most n-shots for road density estimation and building density regression, PhilEO 200M FastTOM outperforms all the other models. The effectiveness of both dataset and model scaling is validated using the PhilEO Bench. We also study the impact of architecture scaling, transitioning from U-Net Convolutional Neural Networks (CNN) to Vision Transformers (ViT).

[56] ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM cs.CV | cs.CLPDF

Yujun Wang, Jinhe Bi, Yunpu Ma, Soeren Pirk

TL;DR: 本文提出了一种基于注意力调控的对比解码框架（ASCD），通过直接干预多模态大语言模型（MLLM）的注意力机制，显著减少了幻觉现象，并在多个基准测试中表现优异。

Details

Motivation: 现有的对比解码方法（如VCD和ICD）通过扰动输入或负前缀来减少幻觉，但其效果仅停留在表面。本文发现这些方法本质上影响了模型的注意力动态，因此提出了更直接的注意力调控方法。

Result: 实验结果表明，ASCD显著减少了幻觉现象，同时在POPE、CHAIR和MMHal-Bench等基准测试上表现优于现有方法，还提升了VQA任务的性能。

Insight: 揭示了现有对比解码方法（如VCD和ICD）的机制本质上是影响模型的注意力分布，而非仅通过表面修改logits起作用。这一发现推动了更直接有效的注意力调控方法的发展。

Abstract: Multimodal Large Language Model (MLLM) often suffer from hallucinations. They over-rely on partial cues and generate incorrect responses. Recently, methods like Visual Contrastive Decoding (VCD) and Instruction Contrastive Decoding (ICD) have been proposed to mitigate hallucinations by contrasting predictions from perturbed or negatively prefixed inputs against original outputs. In this work, we uncover that methods like VCD and ICD fundamentally influence internal attention dynamics of the model. This observation suggests that their effectiveness may not stem merely from surface-level modifications to logits but from deeper shifts in attention distribution. Inspired by this insight, we propose an attention-steerable contrastive decoding framework that directly intervenes in attention mechanisms of the model to offer a more principled approach to mitigating hallucinations. Our experiments across multiple MLLM architectures and diverse decoding methods demonstrate that our approach significantly reduces hallucinations and improves the performance on benchmarks such as POPE, CHAIR, and MMHal-Bench, while simultaneously enhancing performance on standard VQA benchmarks.

[57] CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion cs.CV | cs.ROPDF

Jiahua Ma, Yiran Qin, Yixiong Li, Xuanqi Liao, Yulan Guo

TL;DR: 论文提出了一种基于因果扩散的Causal Diffusion Policy (CDP)方法，通过结合历史动作序列提升了机器人在复杂任务中的行为预测能力，同时通过缓存机制减少了计算开销，实验表明CDP在输入质量下降时仍能保持高精度。

Details

Motivation: 现实应用中，硬件限制降低了数据质量，而实时约束限制了模型只能基于瞬时状态和场景观察进行推断，这导致模仿专家示范的效果下降，影响任务的执行。

Result: 在模拟和真实环境的2D/3D操纵任务实验中，CDP显著优于现有方法，并在输入质量下降时保持了高精度。

Insight: 历史动作序列的利用和计算优化机制的结合，使得CDP在现实不完美条件下表现出更强的鲁棒性。

Abstract: Diffusion Policy (DP) enables robots to learn complex behaviors by imitating expert demonstrations through action diffusion. However, in practical applications, hardware limitations often degrade data quality, while real-time constraints restrict model inference to instantaneous state and scene observations. These limitations seriously reduce the efficacy of learning from expert demonstrations, resulting in failures in object localization, grasp planning, and long-horizon task execution. To address these challenges, we propose Causal Diffusion Policy (CDP), a novel transformer-based diffusion model that enhances action prediction by conditioning on historical action sequences, thereby enabling more coherent and context-aware visuomotor policy learning. To further mitigate the computational cost associated with autoregressive inference, a caching mechanism is also introduced to store attention key-value pairs from previous timesteps, substantially reducing redundant computations during execution. Extensive experiments in both simulated and real-world environments, spanning diverse 2D and 3D manipulation tasks, demonstrate that CDP uniquely leverages historical action sequences to achieve significantly higher accuracy than existing methods. Moreover, even when faced with degraded input observation quality, CDP maintains remarkable precision by reasoning through temporal continuity, which highlights its practical robustness for robotic control under realistic, imperfect conditions.

cs.CL [Back]

[58] Investigating the interaction of linguistic and mathematical reasoning in language models using multilingual number puzzles cs.CL | cs.AIPDF

Antara Raaghavi Bhattacharya, Isabel Papadimitriou, Kathryn Davidson, David Alvarez-Melis

TL;DR: 论文通过多语言数字谜题研究语言模型中语言推理与数学推理的交互，发现当前语言模型难以灵活推断数字的隐含组合规则。

Details

Motivation: 研究动机是探索语言模型在处理涉及跨语言数字系统的语言-数学混合谜题时的表现，人类可以轻松应对这些任务，但语言模型表现不佳。

Result: 结果显示，除非数学运算被显式标记（如“二十+三”），否则模型无法一致地解决问题。模型难以像人类一样从隐含模式中推断组合规则。

Insight: 关键洞察是当前推理模型在从隐含模式中灵活推断组合规则方面仍面临挑战，这可能限制了其在复杂语言-数学任务中的应用。

Abstract: Across languages, numeral systems vary widely in how they construct and combine numbers. While humans consistently learn to navigate this diversity, large language models (LLMs) struggle with linguistic-mathematical puzzles involving cross-linguistic numeral systems, which humans can learn to solve successfully. We investigate why this task is difficult for LLMs through a series of experiments that untangle the linguistic and mathematical aspects of numbers in language. Our experiments establish that models cannot consistently solve such problems unless the mathematical operations in the problems are explicitly marked using known symbols ($+$, $\times$, etc, as in “twenty + three”). In further ablation studies, we probe how individual parameters of numeral construction and combination affect performance. While humans use their linguistic understanding of numbers to make inferences about the implicit compositional structure of numerals, LLMs seem to lack this notion of implicit numeral structure. We conclude that the ability to flexibly infer compositional rules from implicit patterns in human-scale data remains an open challenge for current reasoning models.

[59] VL-GenRM: Enhancing Vision-Language Verification via Vision Experts and Iterative Training cs.CL | cs.CVPDF

Jipeng Zhang, Kehao Miao, Renjie Pi, Zhaowei Wang, Runtao Liu

TL;DR: 论文提出了一种迭代训练框架VL-GenRM，通过结合视觉专家、思维链解和基于边际的拒绝采样，解决视觉语言奖励模型训练中的数据引导问题和模态偏差问题，提升了幻觉检测和多模态推理能力。

Details

Motivation: 视觉语言奖励模型（VL-RM）在强化微调中的作用关键，但面临高质量训练数据依赖强VL模型的循环引导问题和模态偏差/负例放大的挑战。这些问题导致训练数据质量下降和模型性能受限。

Result: 在VL-RM基准测试中表现出色，显著提升了幻觉检测和多模态推理的性能。

Insight: 通过多模态专家协作和迭代优化数据质量，可以有效解决VL模型训练中的自引导问题和模态偏差，推动强化学习在视觉语言任务中的应用。

Abstract: Reinforcement Fine-Tuning (RFT) with verifiable rewards has advanced large language models but remains underexplored for Vision-Language (VL) models. The Vision-Language Reward Model (VL-RM) is key to aligning VL models by providing structured feedback, yet training effective VL-RMs faces two major challenges. First, the bootstrapping dilemma arises as high-quality training data depends on already strong VL models, creating a cycle where self-generated supervision reinforces existing biases. Second, modality bias and negative example amplification occur when VL models hallucinate incorrect visual attributes, leading to flawed preference data that further misguides training. To address these issues, we propose an iterative training framework leveraging vision experts, Chain-of-Thought (CoT) rationales, and Margin-based Rejection Sampling. Our approach refines preference datasets, enhances structured critiques, and iteratively improves reasoning. Experiments across VL-RM benchmarks demonstrate superior performance in hallucination detection and multimodal reasoning, advancing VL model alignment with reinforcement learning.

[60] ASMR: Augmenting Life Scenario using Large Generative Models for Robotic Action Reflection cs.CL | cs.AI | cs.ROPDF

Shang-Chi Tsai, Seiya Kawano, Angel Garcia Contreras, Koichiro Yoshino, Yun-Nung Chen

TL;DR: 本文提出了一种新颖的框架，通过利用大型语言模型和稳定扩散模型生成对话和环境图像，以数据增强的方式提升机器人辅助场景中的动作选择能力，实验表明该方法显著提升了性能。

Details

Motivation: 在机器人辅助日常活动时，理解用户意图需要结合视觉和语言信息，但大规模多模态数据集的构建困难且耗时。

Result: 在真实场景数据集上，该方法显著提升了机器人动作选择的准确性，达到SOTA性能。

Insight: 生成模型可以有效缓解多模态数据稀缺问题，为机器人辅助场景提供了新的数据增强思路。

Abstract: When designing robots to assist in everyday human activities, it is crucial to enhance user requests with visual cues from their surroundings for improved intent understanding. This process is defined as a multimodal classification task. However, gathering a large-scale dataset encompassing both visual and linguistic elements for model training is challenging and time-consuming. To address this issue, our paper introduces a novel framework focusing on data augmentation in robotic assistance scenarios, encompassing both dialogues and related environmental imagery. This approach involves leveraging a sophisticated large language model to simulate potential conversations and environmental contexts, followed by the use of a stable diffusion model to create images depicting these environments. The additionally generated data serves to refine the latest multimodal models, enabling them to more accurately determine appropriate actions in response to user interactions with the limited target data. Our experimental results, based on a dataset collected from real-world scenarios, demonstrate that our methodology significantly enhances the robot’s action selection capabilities, achieving the state-of-the-art performance.

[61] Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text cs.CLPDF

Amr Mohamed, Yang Zhang, Michalis Vazirgiannis, Guokan Shang

TL;DR: 这篇论文系统地评估了大语言模型（LLMs）在代码切换（CSW）文本上的理解能力。通过生成CSW版本的推理和理解基准测试，研究发现外国词汇对英文文本的破坏会导致性能下降，而将英文嵌入其他语言则可能提升理解。提示方法效果不稳定，但微调能更有效地减轻性能下降问题。

Details

Motivation: 代码切换在多语言社区和在线内容中非常普遍，LLMs经常需要处理这种混合语言的输入。然而，目前尚不清楚LLMs如何理解和处理这类文本，因此对其进行系统评估是必要的。

Result: 实验表明，外国词汇破坏英文文本时会导致性能下降，而将英文嵌入其他语言有时能提升理解。提示方法效果不稳定，微调则能更稳定地减轻性能下降问题。

Insight: LLMs对CSW文本的处理能力存在局限性，但通过微调可以有效地提升其性能。同时，研究揭示了语言混合方式对模型理解的影响，为未来优化LLMs在混合语言任务中的表现提供了方向。

Abstract: Code-switching (CSW) is the act of alternating between two or more languages within a single discourse. This phenomenon is widespread in multilingual communities, and increasingly prevalent in online content, where users naturally mix languages in everyday communication. As a result, Large Language Models (LLMs), now central to content processing and generation, are frequently exposed to code-switched inputs. Given their widespread use, it is crucial to understand how LLMs process and reason about such mixed-language text. This paper presents a systematic evaluation of LLM comprehension under code-switching by generating CSW variants of established reasoning and comprehension benchmarks. While degradation is evident when foreign tokens disrupt English text$\unicode{x2013}$even under linguistic constraints$\unicode{x2013}$embedding English into other languages often improves comprehension. Though prompting yields mixed results, fine-tuning offers a more stable path to degradation mitigation.

[62] MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation cs.CLPDF

Xueqing Peng, Lingfei Qian, Yan Wang, Ruoyu Xiang, Yueru He

TL;DR: MultiFinBen是一个专为全球金融领域设计的首个多语言、多模态基准测试，评估LLM在多模态（文本、视觉、音频）和多语言任务（单语、双语、多语）中的表现，并引入动态难度选择机制。

Details

Motivation: 现有金融领域的基准测试多为单语言和单模态，无法反映实际金融交流的复杂性，因此需要更全面的评估工具。

Result: 评估22个先进模型发现，即使最强的模型在面对复杂的跨语言和多模态任务时表现欠佳。

Insight: 金融领域的多语言和多模态能力仍需显著提升，现有模型在复杂金融任务中表现不足。

Abstract: Recent advances in large language models (LLMs) have accelerated progress in financial NLP and applications, yet existing benchmarks remain limited to monolingual and unimodal settings, often over-relying on simple tasks and failing to reflect the complexity of real-world financial communication. We introduce MultiFinBen, the first multilingual and multimodal benchmark tailored to the global financial domain, evaluating LLMs across modalities (text, vision, audio) and linguistic settings (monolingual, bilingual, multilingual) on domain-specific tasks. We introduce two novel tasks, including PolyFiQA-Easy and PolyFiQA-Expert, the first multilingual financial benchmarks requiring models to perform complex reasoning over mixed-language inputs; and EnglishOCR and SpanishOCR, the first OCR-embedded financial QA tasks challenging models to extract and reason over information from visual-text financial documents. Moreover, we propose a dynamic, difficulty-aware selection mechanism and curate a compact, balanced benchmark rather than simple aggregation existing datasets. Extensive evaluation of 22 state-of-the-art models reveals that even the strongest models, despite their general multimodal and multilingual capabilities, struggle dramatically when faced with complex cross-lingual and multimodal tasks in financial domain. MultiFinBen is publicly released to foster transparent, reproducible, and inclusive progress in financial studies and applications.

[63] An Interdisciplinary Review of Commonsense Reasoning and Intent Detection cs.CL | cs.HCPDF

Md Nazmus Sakib

TL;DR: 本文综述了2020-2025年期间28篇关于常识推理和意图检测的最新研究，总结了它们在零样本学习、文化适应、结构化评估等方面的进展，并指出了模型在泛化性和基准设计方面的不足。

Details

Motivation: 常识推理和意图检测是自然语言理解中的核心挑战，但现有研究缺乏系统性的总结和跨学科视角。本文旨在填补这一空白。

Result: 展示了常识推理和意图检测在适应性、多语言和上下文感知模型上的趋势，并揭示了当前研究的不足。

Insight: 未来研究需关注模型的泛化性、基准设计的改进，以及跨学科协作的重要性。

Abstract: This review explores recent advances in commonsense reasoning and intent detection, two key challenges in natural language understanding. We analyze 28 papers from ACL, EMNLP, and CHI (2020-2025), organizing them by methodology and application. Commonsense reasoning is reviewed across zero-shot learning, cultural adaptation, structured evaluation, and interactive contexts. Intent detection is examined through open-set models, generative formulations, clustering, and human-centered systems. By bridging insights from NLP and HCI, we highlight emerging trends toward more adaptive, multilingual, and context-aware models, and identify key gaps in grounding, generalization, and benchmark design.

[64] S$^4$C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models cs.CL | cs.AIPDF

Tao He, Guang Huang, Yu Yang, Tianshi Xu, Sicheng Zhao

TL;DR: 论文提出了S$^4$C框架，通过结合语法和语义连贯性的推测采样方法，显著提升了大语言模型（LLM）的推理效率，减少了计算资源消耗。

Details

Motivation: 大语言模型在推理时存在显著的延迟问题，影响实时应用。当前的推测采样方法未充分利用生成文本的连贯性，限制了效率。

Result: 在主流任务中，S$^4$C超越了基线方法，实现了2.26x-2.60x的加速比，减少了计算资源消耗。

Insight: 语法和语义连贯性的引入显著提升了推测采样的效率，同时验证树的复用机制进一步优化了计算资源利用率。

Abstract: Large language models (LLMs) exhibit remarkable reasoning capabilities across diverse downstream tasks. However, their autoregressive nature leads to substantial inference latency, posing challenges for real-time applications. Speculative sampling mitigates this issue by introducing a drafting phase followed by a parallel validation phase, enabling faster token generation and verification. Existing approaches, however, overlook the inherent coherence in text generation, limiting their efficiency. To address this gap, we propose a Speculative Sampling with Syntactic and Semantic Coherence (S$^4$C) framework, which extends speculative sampling by leveraging multi-head drafting for rapid token generation and a continuous verification tree for efficient candidate validation and feature reuse. Experimental results demonstrate that S$^4$C surpasses baseline methods across mainstream tasks, offering enhanced efficiency, parallelism, and the ability to generate more valid tokens with fewer computational resources. On Spec-bench benchmarks, S$^4$C achieves an acceleration ratio of 2.26x-2.60x, outperforming state-of-the-art methods.

[65] MIST: Towards Multi-dimensional Implicit Bias and Stereotype Evaluation of LLMs via Theory of Mind cs.CLPDF

Yanlin Li, Hao Liu, Huimin Liu, Yinwei Wei, Yupeng Hu

TL;DR: 论文提出了一种评估大语言模型（LLMs）多维隐式偏见和刻板印象的框架MIST，通过心理理论（ToM）重新定义偏见，并设计了间接测试任务（WABT和AAT）以揭示复杂的偏见结构。

Details

Motivation: 传统直接查询方法容易受社会期望效应影响，且难以捕捉隐式偏见的多维性和微妙性。因此，需要一种更鲁棒的方法来全面评估LLMs的隐式偏见。

Result: 在8种最先进的LLMs上进行了实验，揭示了复杂的偏见结构，包括普遍的社交性偏见、多维分歧和非对称刻板印象放大。

Insight: 该框架为识别隐式偏见的结构性质提供了更鲁棒的方法，揭示了LLMs在多维心理理论能力中的系统性失败。

Abstract: Theory of Mind (ToM) in Large Language Models (LLMs) refers to their capacity for reasoning about mental states, yet failures in this capacity often manifest as systematic implicit bias. Evaluating this bias is challenging, as conventional direct-query methods are susceptible to social desirability effects and fail to capture its subtle, multi-dimensional nature. To this end, we propose an evaluation framework that leverages the Stereotype Content Model (SCM) to reconceptualize bias as a multi-dimensional failure in ToM across Competence, Sociability, and Morality. The framework introduces two indirect tasks: the Word Association Bias Test (WABT) to assess implicit lexical associations and the Affective Attribution Test (AAT) to measure covert affective leanings, both designed to probe latent stereotypes without triggering model avoidance. Extensive experiments on 8 State-of-the-Art LLMs demonstrate our framework’s capacity to reveal complex bias structures, including pervasive sociability bias, multi-dimensional divergence, and asymmetric stereotype amplification, thereby providing a more robust methodology for identifying the structural nature of implicit bias.

[66] GRAM: A Generative Foundation Reward Model for Reward Generalization cs.CL | cs.AIPDF

Chenglong Wang, Yang Gan, Yifu Huo, Yongyu Mu, Qiaozhi He

TL;DR: 论文提出了一种生成式基础奖励模型（GRAM），通过结合无监督和有监督学习，实现了奖励模型的泛化能力，并在多个任务中表现优于基线模型。

Details

Motivation: 现有的奖励模型主要依赖有标注的人类偏好数据，且是判别式模型。作者希望通过结合生成模型和无监督数据，提升奖励模型的泛化和适应能力。

Result: 模型在响应排序、强化学习和任务适应等多个任务中均显著优于基线模型。

Insight: 生成模型和无监督数据的结合可以显著提升奖励模型的泛化能力，同时标签平滑技术为生成和判别模型提供了统一的训练视角。

Abstract: In aligning large language models (LLMs), reward models have played an important role, but are standardly trained as discriminative models and rely only on labeled human preference data. In this paper, we explore methods that train reward models using both unlabeled and labeled data. Building on the generative models in LLMs, we develop a generative reward model that is first trained via large-scale unsupervised learning and then fine-tuned via supervised learning. We also show that by using label smoothing, we are in fact optimizing a regularized pairwise ranking loss. This result, in turn, provides a new view of training reward models, which links generative models and discriminative models under the same class of training objectives. The outcome of these techniques is a foundation reward model, which can be applied to a wide range of tasks with little or no further fine-tuning effort. Extensive experiments show that this model generalizes well across several tasks, including response ranking, reinforcement learning from human feedback, and task adaptation with fine-tuning, achieving significant performance improvements over several strong baseline models.

[67] CausalDiffTab: Mixed-Type Causal-Aware Diffusion for Tabular Data Generation cs.CLPDF

Jia-Chen Zhang, Zheng Zhou, Yu-Jie Xiong, Chun-Ming Xia, Fei Dai

TL;DR: 该论文提出了CausalDiffTab，一种基于扩散模型的生成方法，专门用于处理混合类型的表格数据，通过自适应因果正则化提升生成质量。

Details

Motivation: 高质量数据生成在隐私保护和多样化应用中至关重要，但混合类型表格数据生成面临数据类型异构、变量关系复杂等挑战。

Result: 在七个数据集上的实验表明，CausalDiffTab在各项指标上均优于基线方法。

Insight: 通过自适应控制因果正则化的权重，可以在不牺牲生成能力的情况下进一步提升模型性能。

Abstract: Training data has been proven to be one of the most critical components in training generative AI. However, obtaining high-quality data remains challenging, with data privacy issues presenting a significant hurdle. To address the need for high-quality data. Synthesize data has emerged as a mainstream solution, demonstrating impressive performance in areas such as images, audio, and video. Generating mixed-type data, especially high-quality tabular data, still faces significant challenges. These primarily include its inherent heterogeneous data types, complex inter-variable relationships, and intricate column-wise distributions. In this paper, we introduce CausalDiffTab, a diffusion model-based generative model specifically designed to handle mixed tabular data containing both numerical and categorical features, while being more flexible in capturing complex interactions among variables. We further propose a hybrid adaptive causal regularization method based on the principle of Hierarchical Prior Fusion. This approach adaptively controls the weight of causal regularization, enhancing the model’s performance without compromising its generative capabilities. Comprehensive experiments conducted on seven datasets demonstrate that CausalDiffTab outperforms baseline methods across all metrics. Our code is publicly available at: https://github.com/Godz-z/CausalDiffTab.

[68] Explainable Detection of Implicit Influential Patterns in Conversations via Data Augmentation cs.CLPDF

Sina Abdidizaji, Md Kowsher, Niloofar Yousefi, Ivan Garibay

TL;DR: 提出一种通过数据增强改进隐式影响力模式检测的方法，显著提升了检测性能和任务分类能力。

Details

Motivation: 随着数字平台成为主要交流工具，恶意行为者转向使用隐式语言模式影响公众，现有模型难以及时检测这些模式。

Result: 隐式影响力模式检测性能提升6%，技术和受害者脆弱性的多标签分类任务分别提升33%和43%。

Insight: 数据增强和语言模型的结合能有效提升对隐式语言模式的检测能力，尤其在多标签分类任务中表现显著。

Abstract: In the era of digitalization, as individuals increasingly rely on digital platforms for communication and news consumption, various actors employ linguistic strategies to influence public perception. While models have become proficient at detecting explicit patterns, which typically appear in texts as single remarks referred to as utterances, such as social media posts, malicious actors have shifted toward utilizing implicit influential verbal patterns embedded within conversations. These verbal patterns aim to mentally penetrate the victim’s mind in order to influence them, enabling the actor to obtain the desired information through implicit means. This paper presents an improved approach for detecting such implicit influential patterns. Furthermore, the proposed model is capable of identifying the specific locations of these influential elements within a conversation. To achieve this, the existing dataset was augmented using the reasoning capabilities of state-of-the-art language models. Our designed framework resulted in a 6% improvement in the detection of implicit influential patterns in conversations. Moreover, this approach improved the multi-label classification tasks related to both the techniques used for influence and the vulnerability of victims by 33% and 43%, respectively.

[69] Chaining Event Spans for Temporal Relation Grounding cs.CLPDF

Jongho Kim, Dohyeon Lee, Minsoo Kim, Seung-won Hwang

TL;DR: 该论文提出了一种名为Timeline Reasoning Network (TRN) 的新方法，通过预测事件的时间跨度来解决时间关系标注任务中因答案重叠导致的不可靠结果问题。

Details

Motivation: 现有方法依赖答案重叠作为标签来解决时间关系问题，但这种方法可能导致不可靠结果，因为不同问题可能有偶然相同的答案。

Result: 在TORQUE和TB-dense等任务上的实验表明，TRN能有效解决答案重叠问题并提升性能。

Insight: 通过明确预测事件的时间线，可以更可靠地解决时间关系问题，避免了依赖不可靠的答案重叠信号。

Abstract: Accurately understanding temporal relations between events is a critical building block of diverse tasks, such as temporal reading comprehension (TRC) and relation extraction (TRE). For example in TRC, we need to understand the temporal semantic differences between the following two questions that are lexically near-identical: “What finished right before the decision?” or “What finished right after the decision?”. To discern the two questions, existing solutions have relied on answer overlaps as a proxy label to contrast similar and dissimilar questions. However, we claim that answer overlap can lead to unreliable results, due to spurious overlaps of two dissimilar questions with coincidentally identical answers. To address the issue, we propose a novel approach that elicits proper reasoning behaviors through a module for predicting time spans of events. We introduce the Timeline Reasoning Network (TRN) operating in a two-step inductive reasoning process: In the first step model initially answers each question with semantic and syntactic information. The next step chains multiple questions on the same event to predict a timeline, which is then used to ground the answers. Results on the TORQUE and TB-dense, TRC and TRE tasks respectively, demonstrate that TRN outperforms previous methods by effectively resolving the spurious overlaps using the predicted timeline.

[70] Xolver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team cs.CL | cs.AIPDF

Md Tanzib Hosain, Salman Rahman, Md Kishor Morol, Md Rizwan Parvez

TL;DR: Xolver是一个无需训练的多智能体推理框架，通过持久化记忆整合多种经验模态（如检索、工具使用、协作交互等），使大型语言模型（LLM）能在推理时学习和利用经验，显著提升了复杂任务的性能。

Details

Motivation: 当前LLM在复杂推理任务中通常孤立处理问题，缺乏经验的积累和整合，而人类专家（如奥赛团队）会利用丰富的经验进行协作和迭代学习。Xolver旨在模拟这种经验驱动的推理方式。

Result: Xolver在多个基准测试中超越专有模型（如Gemini 2.5 Pro）和轻量级模型（如QWQ-32B），在GSM8K、AIME等任务上刷新了最佳成绩（最高99.8%）。

Insight: 经验感知的推理是迈向通用智能体的关键步骤，Xolver展示了整合多样化经验模态的潜力，同时揭示了轻量级模型通过经验学习也能达到先进水平。

Abstract: Despite impressive progress on complex reasoning, current large language models (LLMs) typically operate in isolation - treating each problem as an independent attempt, without accumulating or integrating experiential knowledge. In contrast, expert problem solvers - such as Olympiad or programming contest teams - leverage a rich tapestry of experiences: absorbing mentorship from coaches, developing intuition from past problems, leveraging knowledge of tool usage and library functionality, adapting strategies based on the expertise and experiences of peers, continuously refining their reasoning through trial and error, and learning from other related problems even during competition. We introduce Xolver, a training-free multi-agent reasoning framework that equips a black-box LLM with a persistent, evolving memory of holistic experience. Xolver integrates diverse experience modalities, including external and self-retrieval, tool use, collaborative interactions, agent-driven evaluation, and iterative refinement. By learning from relevant strategies, code fragments, and abstract reasoning patterns at inference time, Xolver avoids generating solutions from scratch - marking a transition from isolated inference toward experience-aware language agents. Built on both open-weight and proprietary models, Xolver consistently outperforms specialized reasoning agents. Even with lightweight backbones (e.g., QWQ-32B), it often surpasses advanced models including Qwen3-235B, Gemini 2.5 Pro, o3, and o4-mini-high. With o3-mini-high, it achieves new best results on GSM8K (98.1%), AIME’24 (94.4%), AIME’25 (93.7%), Math-500 (99.8%), and LiveCodeBench-V5 (91.6%) - highlighting holistic experience learning as a key step toward generalist agents capable of expert-level reasoning. Code and data are available at https://kagnlp.github.io/xolver.github.io/.

[71] A Multi-Expert Structural-Semantic Hybrid Framework for Unveiling Historical Patterns in Temporal Knowledge Graphs cs.CLPDF

Yimin Deng, Yuxia Wu, Yejing Wang, Guoshuai Zhao, Li Zhu

TL;DR: 本文提出了一种多专家结构-语义混合（MESH）框架，结合结构和语义信息的双重推理视角，通过三种专家模块处理不同事件预测场景，解决了现有方法在时空知识图谱中的局限性。

Details

Motivation: 现有方法未能整合结构和语义推理的双重视角，且无法区分历史与非历史事件的固有差异，限制了其时态上下文下的泛化能力。

Result: 在三个数据集上的实验证明了MESH框架的有效性，优于现有方法。

Insight: 整合结构和语义信息的多专家协作能显著改进时空知识图谱推理任务，尤其是在区分历史与非历史事件方面表现突出。

Abstract: Temporal knowledge graph reasoning aims to predict future events with knowledge of existing facts and plays a key role in various downstream tasks. Previous methods focused on either graph structure learning or semantic reasoning, failing to integrate dual reasoning perspectives to handle different prediction scenarios. Moreover, they lack the capability to capture the inherent differences between historical and non-historical events, which limits their generalization across different temporal contexts. To this end, we propose a Multi-Expert Structural-Semantic Hybrid (MESH) framework that employs three kinds of expert modules to integrate both structural and semantic information, guiding the reasoning process for different events. Extensive experiments on three datasets demonstrate the effectiveness of our approach.

[72] Re-Initialization Token Learning for Tool-Augmented Large Language Models cs.CL | cs.AIPDF

Chenghao Li, Liu Liu, Baosheng Yu, Jiayan Qiu, Yibing Zhan

TL;DR: 本文提出了一种新的工具增强型大语言模型（LLM）的令牌学习方法，通过初始化对齐工具令牌与现有词嵌入空间，提升模型性能。该方法在多个任务（如数值推理、知识问答等）中显著优于基准方法。

Details

Motivation: 当前LLMs通过独特令牌调用外部工具的方法忽略了工具令牌与词令牌之间的关系，限制了预训练模型的适应性。因此，需要一种方法对齐工具令牌与词嵌入空间，以提升工具调用的准确性。

Result: 在数值推理、知识问答和计划生成任务上，该方法明显优于CoT、REACT、ICL和ToolkenGPT等基准方法。

Insight: 对齐工具令牌与词嵌入空间能显著提升LLMs调用外部工具的能力，且适用于多样化任务领域。

Abstract: Large language models have demonstrated exceptional performance, yet struggle with complex tasks such as numerical reasoning, plan generation. Integrating external tools, such as calculators and databases, into large language models (LLMs) is crucial for enhancing problem-solving capabilities. Current methods assign a unique token to each tool, enabling LLMs to call tools through token prediction-similar to word generation. However, this approach fails to account for the relationship between tool and word tokens, limiting adaptability within pre-trained LLMs. To address this issue, we propose a novel token learning method that aligns tool tokens with the existing word embedding space from the perspective of initialization, thereby enhancing model performance. We begin by constructing prior token embeddings for each tool based on the tool’s name or description, which are used to initialize and regularize the learnable tool token embeddings. This ensures the learned embeddings are well-aligned with the word token space, improving tool call accuracy. We evaluate the method on tasks such as numerical reasoning, knowledge-based question answering, and embodied plan generation using GSM8K-XL, FuncQA, KAMEL, and VirtualHome datasets. The results demonstrate clear improvements over recent baselines, including CoT, REACT, ICL, and ToolkenGPT, indicating that our approach effectively augments LLMs with tools through relevant tokens across diverse domains.

[73] A Vision for Geo-Temporal Deep Research Systems: Towards Comprehensive, Transparent, and Reproducible Geo-Temporal Information Synthesis cs.CL | cs.IRPDF

Bruno Martins, Piotr Szymański, Piotr Gramacki

TL;DR: 这篇论文提出了一个愿景，旨在构建具备地理-时间能力（geo-temporal capabilities）的深度研究系统，以解决当前系统在涉及地理和时间约束问题时能力不足的问题。

Details

Motivation: 当前基于大型语言模型（LLMs）的深度研究系统缺乏地理和时间推理能力，这在回答涉及公共健康、环境科学或社会经济分析等领域的问题时尤为重要。

Result: 论文未提供具体实验结果，但提出了实现地理-时间感知的深度研究系统的技术路径。

Insight: 未来的AI驱动信息访问系统需要更好地整合地理和时间信息，以支持更复杂和多维度的研究和分析任务。

Abstract: The emergence of Large Language Models (LLMs) has transformed information access, with current LLMs also powering deep research systems that can generate comprehensive report-style answers, through planned iterative search, retrieval, and reasoning. Still, current deep research systems lack the geo-temporal capabilities that are essential for answering context-rich questions involving geographic and/or temporal constraints, frequently occurring in domains like public health, environmental science, or socio-economic analysis. This paper reports our vision towards next generation systems, identifying important technical, infrastructural, and evaluative challenges in integrating geo-temporal reasoning into deep research pipelines. We argue for augmenting retrieval and synthesis processes with the ability to handle geo-temporal constraints, supported by open and reproducible infrastructures and rigorous evaluation protocols. Our vision outlines a path towards more advanced and geo-temporally aware deep research systems, of potential impact to the future of AI-driven information access.

[74] ELLIS Alicante at CQs-Gen 2025: Winning the critical thinking questions shared task: LLM-based question generation and selection cs.CL | cs.HCPDF

Lucile Favero, Daniel Frases, Juan Antonio Pérez-Ortiz, Tanja Käser, Nuria Oliver

TL;DR: 论文提出了一种基于LLM的两步框架，通过生成和选择关键问题促进批判性思维，并在ACL 2025共享任务中获胜。

Details

Motivation: 为了应对LLMs可能助长浅层学习的担忧，探索如何利用它们生成挑战性关键问题以促进深度推理。

Result: 系统在ACL 2025共享任务中排名第一，验证了LLM框架在促进批判性思维中的有效性。

Insight: LLMs不仅可用于信息检索，还能通过问题生成与选择框架推动深度学习和批判性思考。

Abstract: The widespread adoption of chat interfaces based on Large Language Models (LLMs) raises concerns about promoting superficial learning and undermining the development of critical thinking skills. Instead of relying on LLMs purely for retrieving factual information, this work explores their potential to foster deeper reasoning by generating critical questions that challenge unsupported or vague claims in debate interventions. This study is part of a shared task of the 12th Workshop on Argument Mining, co-located with ACL 2025, focused on automatic critical question generation. We propose a two-step framework involving two small-scale open source language models: a Questioner that generates multiple candidate questions and a Judge that selects the most relevant ones. Our system ranked first in the shared task competition, demonstrating the potential of the proposed LLM-based approach to encourage critical engagement with argumentative texts.

[75] Thunder-NUBench: A Benchmark for LLMs’ Sentence-Level Negation Understanding cs.CLPDF

Yeonkyoung So, Gyuseong Lee, Sungmok Jung, Joonhak Lee, JiA Kang

TL;DR: Thunder-NUBench是一个专门用于评估大语言模型（LLMs）在句子级否定理解能力的新基准，超越了传统表面的否定检测，涵盖了多种否定形式。

Details

Motivation: 否定是语言中的基本现象，但对LLMs的语义理解提出了挑战。现有基准缺乏专门针对否定理解的评测任务，因此需要一个新的基准来填补这一空白。

Result: 未明确提及实验结果，但基准的构建为LLMs的否定理解提供了评测工具。

Insight: 专门化的基准（如否定理解）对提升LLMs在特定语言现象上的能力具有重要意义。

Abstract: Negation is a fundamental linguistic phenomenon that poses persistent challenges for Large Language Models (LLMs), particularly in tasks requiring deep semantic understanding. Existing benchmarks often treat negation as a side case within broader tasks like natural language inference, resulting in a lack of benchmarks that exclusively target negation understanding. In this work, we introduce \textbf{Thunder-NUBench}, a novel benchmark explicitly designed to assess sentence-level negation understanding in LLMs. Thunder-NUBench goes beyond surface-level cue detection by contrasting standard negation with structurally diverse alternatives such as local negation, contradiction, and paraphrase. The benchmark consists of manually curated sentence-negation pairs and a multiple-choice dataset that enables in-depth evaluation of models’ negation understanding.

[76] ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge cs.CL | cs.AIPDF

Zeinab Sadat Taghavi, Ali Modarressi, Yunpu Ma, Hinrich Schütze

TL;DR: ImpliRet是一个新基准，旨在评估检索系统是否能从文档中隐含的事实（如时间、算术和常识关系）推理相关性，即使查询简单。现有检索系统表现不佳，揭示了文档侧推理的挑战。

Details

Motivation: 现有检索系统依赖浅层信号（如关键词重叠），而新基准更注重通过复杂的文档侧隐含事实推理相关性。

Result: 最佳nDCG@10仅为15.07%，GPT-4.1在包含正文档的短上下文中得分35.06%，显示文档侧推理仍困难。

Insight: 文档侧隐含事实推理是检索系统的新挑战，现有技术尚未解决，需开发更高级的推理能力。

Abstract: Retrieval systems are central to many NLP pipelines, but often rely on surface-level cues such as keyword overlap and lexical semantic similarity. To evaluate retrieval beyond these shallow signals, recent benchmarks introduce reasoning-heavy queries; however, they primarily shift the burden to query-side processing techniques – like prompting or multi-hop retrieval – that can help resolve complexity. In contrast, we present ImpliRet, a benchmark that shifts the reasoning challenge to document-side processing: The queries are simple, but relevance depends on facts stated implicitly in documents through temporal (e.g., resolving “two days ago”), arithmetic, and world knowledge relationships. We evaluate a range of sparse and dense retrievers, all of which struggle in this setting: the best nDCG@10 is only 15.07%. We also test whether long-context models can overcome this limitation. But even with a short context of only ten documents, including the positive document, GPT-4.1 scores only 35.06%, showing that document-side reasoning remains a challenge. Our codes are available at github.com/ZeinabTaghavi/IMPLIRET.Contribution.

[77] How Far Can LLMs Improve from Experience? Measuring Test-Time Learning Ability in LLMs with Human Comparison cs.CLPDF

Jiayin Wang, Zhiquang Guo, Weizhi Ma, Min Zhang

TL;DR: 论文强调评估大型语言模型（LLMs）的动态学习能力（测试时学习能力），提出了基于语义游戏的测试框架，并与人类表现对比，发现LLMs虽具备测试时学习能力，但提升速度和稳定性不如人类。

Details

Motivation: 现有基准测试主要评估静态知识，忽略了模型的动态学习能力。论文提出评估测试时学习能力（Test-time Learning），以更全面地衡量模型的智能水平。

Result: LLMs展现出可测量的测试时学习能力，但在累积经验下提升较慢、稳定性较差，远不如人类表现。

Insight: 论文揭示了LLMs在动态学习能力上的不足，表明静态基准测试无法完全反映模型的综合能力，未来需进一步优化其学习机制。

Abstract: As evaluation designs of large language models may shape our trajectory toward artificial general intelligence, comprehensive and forward-looking assessment is essential. Existing benchmarks primarily assess static knowledge, while intelligence also entails the ability to rapidly learn from experience. To this end, we advocate for the evaluation of Test-time Learning, the capacity to improve performance in experience-based, reasoning-intensive tasks during test time. In this work, we propose semantic games as effective testbeds for evaluating test-time learning, due to their resistance to saturation and inherent demand for strategic reasoning. We introduce an objective evaluation framework that compares model performance under both limited and cumulative experience settings, and contains four forms of experience representation. To provide a comparative baseline, we recruit eight human participants to complete the same task. Results show that LLMs exhibit measurable test-time learning capabilities; however, their improvements are less stable under cumulative experience and progress more slowly than those observed in humans. These findings underscore the potential of LLMs as general-purpose learning machines, while also revealing a substantial intellectual gap between models and humans, irrespective of how well LLMs perform on static benchmarks.

[78] LexiMark: Robust Watermarking via Lexical Substitutions to Enhance Membership Verification of an LLM’s Textual Training Data cs.CL | cs.CRPDF

Eyal German, Sagiv Antebi, Edan Habler, Asaf Shabtai, Yuval Elovici

TL;DR: LexiMark是一种新颖的文本水印技术，通过同义词替换高熵词汇增强LLM对水印文本的记忆，同时保持语义完整性。与现有方法相比，它更具隐蔽性和抗移除性，在AUROC评分上表现显著提升。

Details

Motivation: 现有数据水印方法缺乏隐蔽性，容易被检测和移除，无法有效验证LLM是否未经授权使用了特定训练数据。

Result: 在多个开源模型（如LLaMA-1 7B、Mistral 7B等）的不同训练设置下，LexiMark的AUROC评分显著优于基线方法，验证了其有效性。

Insight: 通过上下文相关的同义词替换嵌入水印是一种高效且隐蔽的方法，可以提升LLM训练数据的版权保护能力。

Abstract: Large language models (LLMs) can be trained or fine-tuned on data obtained without the owner’s consent. Verifying whether a specific LLM was trained on particular data instances or an entire dataset is extremely challenging. Dataset watermarking addresses this by embedding identifiable modifications in training data to detect unauthorized use. However, existing methods often lack stealth, making them relatively easy to detect and remove. In light of these limitations, we propose LexiMark, a novel watermarking technique designed for text and documents, which embeds synonym substitutions for carefully selected high-entropy words. Our method aims to enhance an LLM’s memorization capabilities on the watermarked text without altering the semantic integrity of the text. As a result, the watermark is difficult to detect, blending seamlessly into the text with no visible markers, and is resistant to removal due to its subtle, contextually appropriate substitutions that evade automated and manual detection. We evaluated our method using baseline datasets from recent studies and seven open-source models: LLaMA-1 7B, LLaMA-3 8B, Mistral 7B, Pythia 6.9B, as well as three smaller variants from the Pythia family (160M, 410M, and 1B). Our evaluation spans multiple training settings, including continued pretraining and fine-tuning scenarios. The results demonstrate significant improvements in AUROC scores compared to existing methods, underscoring our method’s effectiveness in reliably verifying whether unauthorized watermarked data was used in LLM training.

[79] LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops cs.CL | cs.CRPDF

Jiyuan Fu, Kaixun Jiang, Lingyi Hong, Jinglun Li, Haijing Guo

TL;DR: LingoLoop是一种新型攻击方法，通过语言上下文和状态诱导使MLLMs陷入无限循环，显著增加计算资源消耗。

Details

Motivation: 现有的能量延迟攻击忽略了词性（POS）和句子结构对输出长度的影响，限制了攻击效果。本文旨在通过语言特性设计更高效的攻击方法。

Result: 攻击使生成令牌数增加30倍，能耗显著提升，测试模型如Qwen2.5-VL-3B被诱导至最大生成极限。

Insight: MLLMs对语言特性和状态调控的敏感性暴露了其潜在脆弱性，需在设计时予以重视。

Abstract: Multimodal Large Language Models (MLLMs) have shown great promise but require substantial computational resources during inference. Attackers can exploit this by inducing excessive output, leading to resource exhaustion and service degradation. Prior energy-latency attacks aim to increase generation time by broadly shifting the output token distribution away from the EOS token, but they neglect the influence of token-level Part-of-Speech (POS) characteristics on EOS and sentence-level structural patterns on output counts, limiting their efficacy. To address this, we propose LingoLoop, an attack designed to induce MLLMs to generate excessively verbose and repetitive sequences. First, we find that the POS tag of a token strongly affects the likelihood of generating an EOS token. Based on this insight, we propose a POS-Aware Delay Mechanism to postpone EOS token generation by adjusting attention weights guided by POS information. Second, we identify that constraining output diversity to induce repetitive loops is effective for sustained generation. We introduce a Generative Path Pruning Mechanism that limits the magnitude of hidden states, encouraging the model to produce persistent loops. Extensive experiments demonstrate LingoLoop can increase generated tokens by up to 30 times and energy consumption by a comparable factor on models like Qwen2.5-VL-3B, consistently driving MLLMs towards their maximum generation limits. These findings expose significant MLLMs’ vulnerabilities, posing challenges for their reliable deployment. The code will be released publicly following the paper’s acceptance.

[80] M2BeamLLM: Multimodal Sensing-empowered mmWave Beam Prediction with Large Language Models cs.CLPDF

Can Zheng, Jiguang He, Chung G. Kang, Guofa Cai, Zitong Yu

TL;DR: M2BeamLLM是一个新型神经网络框架，用于毫米波大规模多输入多输出通信系统中的波束预测，结合多模态传感器数据和大型语言模型（如GPT-2），显著提升了预测准确性和鲁棒性。

Details

Motivation: 毫米波通信系统中的波束预测传统方法依赖深度学习模型，但多模态传感器数据的潜力尚未充分挖掘。本文旨在通过整合多模态数据和LLM的推理能力，提高预测性能。

Result: M2BeamLLM在波束预测任务中表现优异，尤其在数据多样性和鲁棒性方面显著超越传统方法，为V2I通信提供了高效解决方案。

Insight: 多模态数据与LLM的结合为毫米波通信系统的波束预测开辟了新方向，强调了数据多样性和模型推理能力的重要性。

Abstract: This paper introduces a novel neural network framework called M2BeamLLM for beam prediction in millimeter-wave (mmWave) massive multi-input multi-output (mMIMO) communication systems. M2BeamLLM integrates multi-modal sensor data, including images, radar, LiDAR, and GPS, leveraging the powerful reasoning capabilities of large language models (LLMs) such as GPT-2 for beam prediction. By combining sensing data encoding, multimodal alignment and fusion, and supervised fine-tuning (SFT), M2BeamLLM achieves significantly higher beam prediction accuracy and robustness, demonstrably outperforming traditional deep learning (DL) models in both standard and few-shot scenarios. Furthermore, its prediction performance consistently improves with increased diversity in sensing modalities. Our study provides an efficient and intelligent beam prediction solution for vehicle-to-infrastructure (V2I) mmWave communication systems.

[81] When Does Meaning Backfire? Investigating the Role of AMRs in NLI cs.CLPDF

Junghyun Min, Xiulin Yang, Shira Wein

TL;DR: 论文探讨了在自然语言推理（NLI）任务中加入抽象意义表示（AMR）对预训练语言模型的影响，发现微调时加入AMR会阻碍模型泛化性，而通过提示方式加入AMR能带来轻微提升，但实际改善源于对表面差异的放大而非语义推理。

Details

Motivation: 研究旨在验证语义信息（如AMR）是否有助于提升NLI任务的模型性能，以及这些信息在不同方式（微调或提示）下的作用机制。

Result: 微调时加入AMR会损害模型泛化性，而提示方式下在GPT-4o中表现出轻微提升，但本质是对表面差异的放大而非语义推理帮助。

Insight: 语义信息的引入可能误导模型判断，尤其是当模型更关注表面差异而非语义内容时，需要谨慎设计语义信息的集成方式。

Abstract: Natural Language Inference (NLI) relies heavily on adequately parsing the semantic content of the premise and hypothesis. In this work, we investigate whether adding semantic information in the form of an Abstract Meaning Representation (AMR) helps pretrained language models better generalize in NLI. Our experiments integrating AMR into NLI in both fine-tuning and prompting settings show that the presence of AMR in fine-tuning hinders model generalization while prompting with AMR leads to slight gains in \texttt{GPT-4o}. However, an ablation study reveals that the improvement comes from amplifying surface-level differences rather than aiding semantic reasoning. This amplification can mislead models to predict non-entailment even when the core meaning is preserved.

[82] Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning in Large Language Models cs.CL | cs.AIPDF

Chenchen Yuan, Zheyu Zhang, Shuo Yang, Bardh Prenkaj, Gjergji Kasneci

TL;DR: 论文提出了一种框架，通过概率聚合和嵌入优化，将多个大语言模型（LLM）的道德判断综合为集体判断，并通过优化嵌入对齐偏离共识的模型。

Details

Motivation: 大语言模型在复杂道德困境中表现不一致，需要一种方法综合多模型的判断，形成更稳健的道德共识。

Result: 在大规模社会道德困境数据集上，该方法显著提升集体共识和个体模型的一致性。

Insight: 多模型数据驱动的道德对齐有助于构建更安全、一致的AI系统。

Abstract: Large Language Models (LLMs) have shown impressive moral reasoning abilities. Yet they often diverge when confronted with complex, multi-factor moral dilemmas. To address these discrepancies, we propose a framework that synthesizes multiple LLMs’ moral judgments into a collectively formulated moral judgment, realigning models that deviate significantly from this consensus. Our aggregation mechanism fuses continuous moral acceptability scores (beyond binary labels) into a collective probability, weighting contributions by model reliability. For misaligned models, a targeted embedding-optimization procedure fine-tunes token embeddings for moral philosophical theories, minimizing JS divergence to the consensus while preserving semantic integrity. Experiments on a large-scale social moral dilemma dataset show our approach builds robust consensus and improves individual model fidelity. These findings highlight the value of data-driven moral alignment across multiple models and its potential for safer, more consistent AI systems.

[83] AIn’t Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation cs.CL | cs.AI | cs.CYPDF

Leah von der Heyde, Anna-Carolina Haensch, Bernd Weiß, Jessika Daikeler

TL;DR: 本文探讨了利用大型语言模型（LLMs）对德语开放式调查回复进行编码的可行性，比较了不同LLM和提示方法的性能，并发现只有经过微调的LLM表现优异。

Details

Motivation: 研究动机是评估LLMs在非英语语境和复杂主题下的表现，以替代传统手动编码或机器学习模型的预处理过程。

Result: 结果显示不同LLM性能差异显著，仅微调后的LLM达到满意水平；提示方法的有效性取决于所选LLM。

Insight: 研究强调LLMs在分类性能上的不均衡可能影响数据分布，建议在应用时结合微调和综合考虑性能与资源开销。

Abstract: The recent development and wider accessibility of LLMs have spurred discussions about how they can be used in survey research, including classifying open-ended survey responses. Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models. As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether its findings generalize and how the quality of these classifications compares to established methods. In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example. We compare several state-of-the-art LLMs and several prompting approaches, and evaluate the LLMs’ performance by using human expert codings. Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance. Performance differences between prompting approaches are conditional on the LLM used. Finally, LLMs’ unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning. We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data. Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs. In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research.

[84] Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot cs.CL | cs.AI | cs.LGPDF

Xiang Cheng, Chengyan Pan, Minjun Zhao, Deyang Li, Fangchao Liu

TL;DR: 论文发现，对于近期强大的语言模型（如 Qwen2.5 系列），传统或增强的 CoT（Chain-of-Thought）示例在数学推理任务中并未提升性能，反而输出格式对齐是其主要作用。

Details

Motivation: 探讨 CoT 示例对近期强大语言模型（如 Qwen2.5）在推理任务中的实际效用，质疑现有多示例学习（ICL）范式的有效性。

Result: 实验表明 CoT 示例对模型推理性能无改善，仅用于输出格式对齐。增强示例同样无效，模型倾向于忽略示例。

Insight: 当前 ICL+CoT 框架在数学推理中存在局限，需重新设计更有意义的示例或改进学习范式。

Abstract: In-Context Learning (ICL) is an essential emergent ability of Large Language Models (LLMs), and recent studies introduce Chain-of-Thought (CoT) to exemplars of ICL to enhance the reasoning capability, especially in mathematics tasks. However, given the continuous advancement of model capabilities, it remains unclear whether CoT exemplars still benefit recent, stronger models in such tasks. Through systematic experiments, we find that for recent strong models such as the Qwen2.5 series, adding traditional CoT exemplars does not improve reasoning performance compared to Zero-Shot CoT. Instead, their primary function is to align the output format with human expectations. We further investigate the effectiveness of enhanced CoT exemplars, constructed using answers from advanced models such as \texttt{Qwen2.5-Max} and \texttt{DeepSeek-R1}. Experimental results indicate that these enhanced exemplars still fail to improve the model’s reasoning performance. Further analysis reveals that models tend to ignore the exemplars and focus primarily on the instructions, leading to no observable gain in reasoning ability. Overall, our findings highlight the limitations of the current ICL+CoT framework in mathematical reasoning, calling for a re-examination of the ICL paradigm and the definition of exemplars.

. Pazzaglia, V. Vendetti, L. D. Comencini, F. Deriu, V. Modugno

TL;DR: 该研究探讨了通过对开源大语言模型（LLM）进行微调，使其能够生成与政治极化社交媒体评论相似的文本，并在语言学、情感评分和人工标注评估中表现出高度可信度和挑衅性，引发了关于AI在政治话语中的伦理问题。

Details

Motivation: 随着大语言模型的日益复杂，其可能在加剧意识形态极化方面发挥作用，尤其是在自动生成带有偏见的内容时。本研究旨在探索微调后的LLM是否能够复制和放大在线环境中的极化言论。

Result: 实验结果显示，微调后的LLM能够生成极化和意识形态一致的评论，且这些评论在可信度和修辞效果上与人类写作难以区分。

Insight: 研究揭示了AI在政治话语中的潜在危害，强调了制定AI治理、平台监管和对抗微调风险检测工具的重要性。

Abstract: The increasing sophistication of large language models (LLMs) has sparked growing concerns regarding their potential role in exacerbating ideological polarization through the automated generation of persuasive and biased content. This study explores the extent to which fine-tuned LLMs can replicate and amplify polarizing discourse within online environments. Using a curated dataset of politically charged discussions extracted from Reddit, we fine-tune an open-source LLM to produce context-aware and ideologically aligned responses. The model’s outputs are evaluated through linguistic analysis, sentiment scoring, and human annotation, with particular attention to credibility and rhetorical alignment with the original discourse. The results indicate that, when trained on partisan data, LLMs are capable of producing highly plausible and provocative comments, often indistinguishable from those written by humans. These findings raise significant ethical questions about the use of AI in political discourse, disinformation, and manipulation campaigns. The paper concludes with a discussion of the broader implications for AI governance, platform regulation, and the development of detection tools to mitigate adversarial fine-tuning risks.

[86] Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality cs.CLPDF

Yuto Harada, Yusuke Yamauchi, Yusuke Oda, Yohei Oseki, Yusuke Miyao

TL;DR: 该论文通过大规模监督微调实验，揭示了数据集属性、层次修改以及训练任务协同效应对LLM对齐质量的影响，并发现困惑度是预测SFT效果的重要指标。

Details

Motivation: 监督微调（SFT）是调整大型语言模型（LLMs）与人类指令和价值观的关键步骤，但其许多方面尚未被充分理解。

Result: 结果显示，某些训练任务的协同效应在所有模型中一致，而其他效应则差异显著；困惑度是SFT效果的强预测指标，且中层权重变化与性能提升相关性最高。

Insight: 论文强调了模型特定策略的重要性，并指出数据集与基准测试之间的表面相似性可能不如困惑度对SFT效果的预测可靠。

Abstract: Supervised fine-tuning (SFT) is a critical step in aligning large language models (LLMs) with human instructions and values, yet many aspects of SFT remain poorly understood. We trained a wide range of base models on a variety of datasets including code generation, mathematical reasoning, and general-domain tasks, resulting in 1,000+ SFT models under controlled conditions. We then identified the dataset properties that matter most and examined the layer-wise modifications introduced by SFT. Our findings reveal that some training-task synergies persist across all models while others vary substantially, emphasizing the importance of model-specific strategies. Moreover, we demonstrate that perplexity consistently predicts SFT effectiveness–often surpassing superficial similarity between trained data and benchmark–and that mid-layer weight changes correlate most strongly with performance gains. We will release these 1,000+ SFT models and benchmark results to accelerate further research.

[87] Treasure Hunt: Real-time Targeting of the Long Tail using Training-Time Markers cs.CL | cs.LGPDF

Daniel D’souza, Julia Kreutzer, Adrien Morisot, Ahmet Üstün, Sara Hooker

TL;DR: 该论文提出了一种通过在训练时引入标记（markers）的方法，优化模型在少数（长尾）用例上的表现和可控性，显著提升了生成质量和任务适应性。

Details

Motivation: 现代机器学习模型在高频用例上表现优异，但在长尾（低频或未充分代表）数据上表现不佳。传统的提示工程或少量示例方法效果不稳定，难以控制。

Result: 实验表明，开放式生成质量平均提升5.7%胜率，未充分代表领域提升9.1%，代码修复任务（CodeRepair）相对提升14.1%，长度指令跟随绝对提升35.3%。

Insight: 通过在训练阶段引入标记，模型不仅能更好地适应长尾数据，还提高了生成的可控性，为用户提供了灵活的推理控制手段。

Abstract: One of the most profound challenges of modern machine learning is performing well on the long-tail of rare and underrepresented features. Large general-purpose models are trained for many tasks, but work best on high-frequency use cases. After training, it is hard to adapt a model to perform well on specific use cases underrepresented in the training corpus. Relying on prompt engineering or few-shot examples to maximize the output quality on a particular test case can be frustrating, as models can be highly sensitive to small changes, react in unpredicted ways or rely on a fixed system prompt for maintaining performance. In this work, we ask: “Can we optimize our training protocols to both improve controllability and performance on underrepresented use cases at inference time?” We revisit the divide between training and inference techniques to improve long-tail performance while providing users with a set of control levers the model is trained to be responsive to. We create a detailed taxonomy of data characteristics and task provenance to explicitly control generation attributes and implicitly condition generations at inference time. We fine-tune a base model to infer these markers automatically, which makes them optional at inference time. This principled and flexible approach yields pronounced improvements in performance, especially on examples from the long tail of the training distribution. While we observe an average lift of 5.7% win rates in open-ended generation quality with our markers, we see over 9.1% gains in underrepresented domains. We also observe relative lifts of up to 14.1% on underrepresented tasks like CodeRepair and absolute improvements of 35.3% on length instruction following evaluations.

[88] Capacity Matters: a Proof-of-Concept for Transformer Memorization on Real-World Data cs.CLPDF

Anton Changalidis, Aki Härmä

TL;DR: 这篇论文研究了模型架构和数据配置如何影响生成式Transformer的实证记忆能力，通过合成文本数据集实验发现嵌入大小是主要决定因素，而Softmax激活函数表现出更高的稳定性。

Details

Motivation: 研究旨在理解Transformer模型在真实世界数据上的记忆能力，以及如何通过模型设计和数据配置优化其性能。

Result: 结果表明，嵌入大小主导学习速度和记忆能力；增加层数对简单数据集可能有害；Softmax激活函数表现更稳定。

Insight: 数据复杂性提升可改善最终记忆能力，为优化Transformer模型设计提供了实用框架。

Abstract: This paper studies how the model architecture and data configurations influence the empirical memorization capacity of generative transformers. The models are trained using synthetic text datasets derived from the Systematized Nomenclature of Medicine (SNOMED) knowledge graph: triplets, representing static connections, and sequences, simulating complex relation patterns. The results show that embedding size is the primary determinant of learning speed and capacity, while additional layers provide limited benefits and may hinder performance on simpler datasets. Activation functions play a crucial role, and Softmax demonstrates greater stability and capacity. Furthermore, increasing the complexity of the data set seems to improve the final memorization. These insights improve our understanding of transformer memory mechanisms and provide a framework for optimizing model design with structured real-world data.

[89] Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs cs.CL | cs.AIPDF

Ring Team, Bin Hu, Cai Chen, Deng Zhao, Ding Liu

TL;DR: Ring-lite是一个基于混合专家（MoE）架构的大语言模型，通过强化学习优化，实现了高效且鲁棒的推理能力。它在激活参数仅为同类模型三分之一的条件下，匹配了当前小规模推理模型的性能。

Details

Motivation: 当前大语言模型在推理任务中面临计算资源消耗高的问题，Ring-lite旨在通过优化训练方法和激活参数数量，实现高效推理。

Result: 在AIME、LiveCodeBench等基准测试中达到SOTA性能，同时仅激活少量参数。

Insight: 算法-系统协同设计方法对提升训练稳定性至关重要；多领域数据训练需要专门的训练策略以避免冲突。

Abstract: We present Ring-lite, a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL) to achieve efficient and robust reasoning capabilities. Built upon the publicly available Ling-lite model, a 16.8 billion parameter model with 2.75 billion activated parameters, our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench, GPQA-Diamond) while activating only one-third of the parameters required by comparable models. To accomplish this, we introduce a joint training pipeline integrating distillation with RL, revealing undocumented challenges in MoE RL training. First, we identify optimization instability during RL training, and we propose Constrained Contextual Computation Policy Optimization(C3PO), a novel approach that enhances training stability and improves computational throughput via algorithm-system co-design methodology. Second, we empirically demonstrate that selecting distillation checkpoints based on entropy loss for RL training, rather than validation metrics, yields superior performance-efficiency trade-offs in subsequent RL training. Finally, we develop a two-stage training paradigm to harmonize multi-domain data integration, addressing domain conflicts that arise in training with mixed dataset. We will release the model, dataset, and code.

[90] Reasoning with Exploration: An Entropy Perspective cs.CLPDF

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao

TL;DR: 该论文探索了语言模型（LM）推理中探索与开发的平衡问题，通过熵的信号重新审视探索性推理，并提出了一个简单的改进方法以增强推理能力。

Details

Motivation: 现有的语言模型推理方法偏向开发而非探索，导致性能停滞。论文从强化学习中的熵视角出发，探讨如何通过探索性推理突破这一瓶颈。

Result: 该方法显著提升了Pass@K指标，即使在高K值下也表现出色，拓展了语言模型推理的边界。

Insight: 熵不仅是探索的信号，还能引导语言模型生成更复杂的推理链，为未来的推理能力改进提供了新方向。

Abstract: Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing language model (LM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy – a signal of exploration in RL – and examine its relationship to exploratory reasoning in LMs. Through empirical analysis, we uncover strong positive correlations between high-entropy regions and three types of exploratory reasoning actions: (1) pivotal tokens that determine or connect logical steps, (2) reflective actions such as self-verification and correction, and (3) rare behaviors under-explored by the base LMs. Motivated by this, we introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term. Unlike traditional maximum-entropy methods which encourage exploration by promoting uncertainty, we encourage exploration by promoting longer and deeper reasoning chains. Notably, our method achieves significant gains on the Pass@K metric – an upper-bound estimator of LM reasoning capabilities – even when evaluated with extremely large K values, pushing the boundaries of LM reasoning.

cs.MA [Back]

[91] Investigating the Potential of Large Language Model-Based Router Multi-Agent Architectures for Foundation Design Automation: A Task Classification and Expert Selection Study cs.MA | cs.AI | cs.CLPDF

Sompote Youwai, David Phim, Vianne Gayl Murcia, Rianne Clair Onas

TL;DR: 本文研究了基于路由器的多智能体系统在基础设计自动化中的应用，通过任务分类和专家选择提高了性能，并在浅基础和桩基础设计中取得了显著成效。

Details

Motivation: 研究动机在于探索如何利用大型语言模型（LLM）和多智能体架构自动化基础设计计算，同时保证专业标准和安全要求。

Result: 基于路由器的配置在浅基础和桩基础设计中分别达到95.00%和90.63%的性能分数，超过单智能体和传统工作流。Grok 3在独立性能上表现最佳。

Insight: 研究结果表明，基于路由器的多智能体系统是基础设计自动化的最优选择，但需保留人工监督，以确保工程应用的安全性和专业性。

Abstract: This study investigates router-based multi-agent systems for automating foundation design calculations through intelligent task classification and expert selection. Three approaches were evaluated: single-agent processing, multi-agent designer-checker architecture, and router-based expert selection. Performance assessment utilized baseline models including DeepSeek R1, ChatGPT 4 Turbo, Grok 3, and Gemini 2.5 Pro across shallow foundation and pile design scenarios. The router-based configuration achieved performance scores of 95.00% for shallow foundations and 90.63% for pile design, representing improvements of 8.75 and 3.13 percentage points over standalone Grok 3 performance respectively. The system outperformed conventional agentic workflows by 10.0 to 43.75 percentage points. Grok 3 demonstrated superior standalone performance without external computational tools, indicating advances in direct LLM mathematical reasoning for engineering applications. The dual-tier classification framework successfully distinguished foundation types, enabling appropriate analytical approaches. Results establish router-based multi-agent systems as optimal for foundation design automation while maintaining professional documentation standards. Given safety-critical requirements in civil engineering, continued human oversight remains essential, positioning these systems as advanced computational assistance tools rather than autonomous design replacements in professional practice.

eess.AS [Back]

[92] Multimodal Fusion with Semi-Supervised Learning Minimizes Annotation Quantity for Modeling Videoconference Conversation Experience eess.AS | cs.CL | cs.HC | cs.LG | cs.MMPDF

Andrew Chang, Chenkai Hu, Ji Qi, Zhuojian Wei, Kexin Zhang

TL;DR: 该论文通过半监督学习（SSL）和多模态融合技术，减少了视频会议对话体验建模中所需的标注数据量，实现了高效且性能优越的预测模型。

Details

Motivation: 视频会议中的负面体验时刻（如对话不畅或不愉快）在自然数据中较为罕见，全监督学习（SL）需要大量成本高昂的人工标注。为了解决这一问题，论文提出通过半监督学习利用标注和未标注数据结合训练，降低标注需求。

Result: 模型在ROC-AUC和F1分数上表现优异，尤其在标注数据仅为8%时，性能接近全监督学习的96%，显著减少了标注成本。

Insight: 半监督学习与多模态融合的结合为视频会议体验建模提供了一种高效的解决方案，能够在不牺牲性能的情况下大幅降低标注成本，尤其适用于数据稀缺的场景。

Abstract: Group conversations over videoconferencing are a complex social behavior. However, the subjective moments of negative experience, where the conversation loses fluidity or enjoyment remain understudied. These moments are infrequent in naturalistic data, and thus training a supervised learning (SL) model requires costly manual data annotation. We applied semi-supervised learning (SSL) to leverage targeted labeled and unlabeled clips for training multimodal (audio, facial, text) deep features to predict non-fluid or unenjoyable moments in holdout videoconference sessions. The modality-fused co-training SSL achieved an ROC-AUC of 0.9 and an F1 score of 0.6, outperforming SL models by up to 4% with the same amount of labeled data. Remarkably, the best SSL model with just 8% labeled data matched 96% of the SL model’s full-data performance. This shows an annotation-efficient framework for modeling videoconference experience.

cs.GR [Back]

[93] ImmerseGen: Agent-Guided Immersive World Generation with Alpha-Textured Proxies cs.GR | cs.CVPDF

Jinyan Yuan, Bangbang Yang, Keke Wang, Panwang Pan, Lin Ma

TL;DR: ImmerseGen提出了一种基于代理的轻量级3D场景生成框架，通过合成RGBA纹理实现高真实感，同时支持实时VR渲染。

Details

Motivation: 现有方法依赖复杂的高精度建模或大规模3D高斯分布，既复杂又难以实现高真实感。ImmerseGen旨在简化流程并提升效果。

Result: 实验表明，ImmerseGen在真实感、空间一致性和渲染效率上优于现有方法，适用于移动VR设备。

Insight: 轻量级代理结合高真实感纹理是一种高效且实用的3D场景生成方法，尤其适合实时VR应用。

Abstract: Automatic creation of 3D scenes for immersive VR presence has been a significant research focus for decades. However, existing methods often rely on either high-poly mesh modeling with post-hoc simplification or massive 3D Gaussians, resulting in a complex pipeline or limited visual realism. In this paper, we demonstrate that such exhaustive modeling is unnecessary for achieving compelling immersive experience. We introduce ImmerseGen, a novel agent-guided framework for compact and photorealistic world modeling. ImmerseGen represents scenes as hierarchical compositions of lightweight geometric proxies, i.e., simplified terrain and billboard meshes, and generates photorealistic appearance by synthesizing RGBA textures onto these proxies. Specifically, we propose terrain-conditioned texturing for user-centric base world synthesis, and RGBA asset texturing for midground and foreground scenery.This reformulation offers several advantages: (i) it simplifies modeling by enabling agents to guide generative models in producing coherent textures that integrate seamlessly with the scene; (ii) it bypasses complex geometry creation and decimation by directly synthesizing photorealistic textures on proxies, preserving visual quality without degradation; (iii) it enables compact representations suitable for real-time rendering on mobile VR headsets. To automate scene creation from text prompts, we introduce VLM-based modeling agents enhanced with semantic grid-based analysis for improved spatial reasoning and accurate asset placement. ImmerseGen further enriches scenes with dynamic effects and ambient audio to support multisensory immersion. Experiments on scene generation and live VR showcases demonstrate that ImmerseGen achieves superior photorealism, spatial coherence and rendering efficiency compared to prior methods. Project webpage: https://immersegen.github.io.

eess.IV [Back]

[94] BraTS orchestrator : Democratizing and Disseminating state-of-the-art brain tumor image analysis eess.IV | cs.AI | cs.CVPDF

Florian Kofler, Marcel Rosier, Mehdi Astaraki, Ujjwal Baid, Hendrik Möller

TL;DR: BraTS orchestrator是一个开源Python工具包，旨在简化BraTS挑战赛中先进的脑肿瘤分割和合成算法的使用，使研究者和临床医生能轻松部署这些算法。

Details

Motivation: 尽管BraTS挑战赛在脑肿瘤图像分析领域取得显著进展，但其开发的算法和模型在科学和临床社区中的采用率有限。为解决这一问题，BraTS orchestrator通过开源工具包的形式，降低技术门槛，加速这些算法的传播。

Result: 成功开源BraTS orchestrator，并在GitHub上发布，支持研究者和临床医生直接使用这些先进算法。

Insight: 通过开源工具包的形式，可以显著提高先进算法的可访问性和实用性，尤其在医学图像分析领域，有助于将研究成果快速转化为实际应用。

Abstract: The Brain Tumor Segmentation (BraTS) cluster of challenges has significantly advanced brain tumor image analysis by providing large, curated datasets and addressing clinically relevant tasks. However, despite its success and popularity, algorithms and models developed through BraTS have seen limited adoption in both scientific and clinical communities. To accelerate their dissemination, we introduce BraTS orchestrator, an open-source Python package that provides seamless access to state-of-the-art segmentation and synthesis algorithms for diverse brain tumors from the BraTS challenge ecosystem. Available on GitHub (https://github.com/BrainLesion/BraTS), the package features intuitive tutorials designed for users with minimal programming experience, enabling both researchers and clinicians to easily deploy winning BraTS algorithms for inference. By abstracting the complexities of modern deep learning, BraTS orchestrator democratizes access to the specialized knowledge developed within the BraTS community, making these advances readily available to broader neuro-radiology and neuro-oncology audiences.

[95] Reliable Noninvasive Glucose Sensing via CNN-Based Spectroscopy eess.IV | cs.CV | cs.LGPDF

El Arbi Belfarsi, Henry Flores, Maria Valero

TL;DR: 该论文提出了一种基于短波红外光谱的双模态人工智能框架，结合CNN和机器学习回归器，用于无创血糖监测，达到临床精度和成本效益的平衡。

Details

Motivation: 现有的无创血糖监测方法在精度和成本上存在挑战，需要一种可靠且可穿戴的解决方案。

Result: CNN在650 nm波长下达到4.82%的MAPE，光二极管系统在Clarke Error Grid中Zone A覆盖率为86.4%。

Insight: 双模态方法结合了深度学习和传统机器学习，为可穿戴设备提供了高精度和实用性的无创血糖监测方案。

Abstract: In this study, we present a dual-modal AI framework based on short-wave infrared (SWIR) spectroscopy. The first modality employs a multi-wavelength SWIR imaging system coupled with convolutional neural networks (CNNs) to capture spatial features linked to glucose absorption. The second modality uses a compact photodiode voltage sensor and machine learning regressors (e.g., random forest) on normalized optical signals. Both approaches were evaluated on synthetic blood phantoms and skin-mimicking materials across physiological glucose levels (70 to 200 mg/dL). The CNN achieved a mean absolute percentage error (MAPE) of 4.82% at 650 nm with 100% Zone A coverage in the Clarke Error Grid, while the photodiode system reached 86.4% Zone A accuracy. This framework constitutes a state-of-the-art solution that balances clinical accuracy, cost efficiency, and wearable integration, paving the way for reliable continuous non-invasive glucose monitoring.

[96] BRISC: Annotated Dataset for Brain Tumor Segmentation and Classification with Swin-HAFNet eess.IV | cs.CVPDF

Amirreza Fateh, Yasin Rezvani, Sara Moayedi, Sadjad Rezvani, Fatemeh Fateh

TL;DR: 论文介绍了BRISC数据集，一个专为脑肿瘤分割与分类任务设计的高质量MRI数据集，包含6000个标注样本，并提出了基于Transformer的分割模型，取得82.3%的加权平均IoU。

Details

Motivation: 脑肿瘤分割和分类在医学影像分析中仍具挑战性，主因缺乏高质量、平衡且多样的数据集。作者希望通过新数据集提升模型性能，并为未来研究奠定基础。

Result: 模型在加权平均IoU上达到82.3%，在所有肿瘤类别中均表现优异，为未来研究提供了基准。

Insight: 高质量数据集对医学影像任务的性能提升至关重要，同时表明Transformer架构在医学图像分析中的潜力。

Abstract: Accurate segmentation and classification of brain tumors from Magnetic Resonance Imaging (MRI) remain key challenges in medical image analysis, largely due to the lack of high-quality, balanced, and diverse datasets. In this work, we present a new curated MRI dataset designed specifically for brain tumor segmentation and classification tasks. The dataset comprises 6,000 contrast-enhanced T1-weighted MRI scans annotated by certified radiologists and physicians, spanning three major tumor types-glioma, meningioma, and pituitary-as well as non-tumorous cases. Each sample includes high-resolution labels and is categorized across axial, sagittal, and coronal imaging planes to facilitate robust model development and cross-view generalization. To demonstrate the utility of the dataset, we propose a transformer-based segmentation model and benchmark it against established baselines. Our method achieves the highest weighted mean Intersection-over-Union (IoU) of 82.3%, with improvements observed across all tumor categories. Importantly, this study serves primarily as an introduction to the dataset, establishing foundational benchmarks for future research. We envision this dataset as a valuable resource for advancing machine learning applications in neuro-oncology, supporting both academic research and clinical decision-support development. datasetlink: https://www.kaggle.com/datasets/briscdataset/brisc2025/

[97] Compressed Video Super-Resolution based on Hierarchical Encoding eess.IV | cs.CVPDF

Yuxuan Jiang, Siyue Teng, Qiang Zhu, Chen Feng, Chengxi Zeng

TL;DR: 本文提出了一种名为VSR-HE的通用视频超分辨率方法，旨在增强压缩视频的感知质量，重点是消除H.265/HEVC编码引入的压缩伪影，并在多种压缩设置下训练与评估以提高鲁棒性和泛化能力。该方法已被提交至ICME 2025挑战赛。

Details

Motivation: 针对高压缩场景下的视频质量下降问题，特别是H.265/HEVC编码引入的压缩伪影，设计一种能够显著提升低分辨率视频并消除伪影的超分辨率方法。

Result: VSR-HE能够有效提升压缩视频的分辨率并显著减少压缩伪影，在多种压缩场景下表现稳定，同时保留了视频的视觉保真度。

Insight: 在视频超分辨率任务中，结合压缩伪影的消除和分层编码结构的设计，可以显著提升高压缩视频的感知质量。该方法展示了在实际应用中对真实世界视频内容的适应性。

Abstract: This paper presents a general-purpose video super-resolution (VSR) method, dubbed VSR-HE, specifically designed to enhance the perceptual quality of compressed content. Targeting scenarios characterized by heavy compression, the method upscales low-resolution videos by a ratio of four, from 180p to 720p or from 270p to 1080p. VSR-HE adopts hierarchical encoding transformer blocks and has been sophisticatedly optimized to eliminate a wide range of compression artifacts commonly introduced by H.265/HEVC encoding across various quantization parameter (QP) levels. To ensure robustness and generalization, the model is trained and evaluated under diverse compression settings, allowing it to effectively restore fine-grained details and preserve visual fidelity. The proposed VSR-HE has been officially submitted to the ICME 2025 Grand Challenge on VSR for Video Conferencing (Team BVI-VSR), under both the Track 1 (General-Purpose Real-World Video Content) and Track 2 (Talking Head Videos).

[98] A large-scale heterogeneous 3D magnetic resonance brain imaging dataset for self-supervised learning eess.IV | cs.CVPDF

Asbjørn Munk, Stefano Cerri, Jakob Ambsdorf, Julia Machnio, Sebastian Nørgaard Llambias

TL;DR: 论文提出了FOMO60K，一个包含60,529个脑MRI扫描的大规模异构数据集，旨在支持医学影像中自监督学习方法的开发和基准测试。

Details

Motivation: 当前医学影像领域缺乏大规模、多样化的数据集，尤其是在自监督学习方面。FOMO60K的提出填补了这一空白，并支持该领域的研究。

Result: FOMO60K数据集包含60,529个扫描，覆盖11,187名受试者，支持自监督学习方法的开发和评估。

Insight: 数据集的多样性和规模使其成为医学影像自监督学习的重要资源，可能推动该领域的技术进步。

Abstract: We present FOMO60K, a large-scale, heterogeneous dataset of 60,529 brain Magnetic Resonance Imaging (MRI) scans from 13,900 sessions and 11,187 subjects, aggregated from 16 publicly available sources. The dataset includes both clinical- and research-grade images, multiple MRI sequences, and a wide range of anatomical and pathological variability, including scans with large brain anomalies. Minimal preprocessing was applied to preserve the original image characteristics while reducing barriers to entry for new users. Accompanying code for self-supervised pretraining and finetuning is provided. FOMO60K is intended to support the development and benchmarking of self-supervised learning methods in medical imaging at scale.

[99] Plug-and-Play with 2.5D Artifact Reduction Prior for Fast and Accurate Industrial Computed Tomography Reconstruction eess.IV | cs.CVPDF

Haley Duba-Sullivan, Aniket Pramanik, Venkatakrishnan Singanallur, Amirkoushyar Ziabari

TL;DR: 本文提出了一种基于2.5D伪影减少先验的即插即用（PnP）方法，用于快速且准确的工业CT重建，通过利用相邻切片信息提升了重建质量和通用性。

Details

Motivation: 稀疏视图CT扫描在生成高质量重建时需要大量X射线测量，既耗时又昂贵，现有2D CNN方法因忽略切片间信息而性能受限。

Result: 实验表明，该方法在合成和真实数据上均优于2D先验，保留了更多结构细节（如孔隙大小和形状），提升了缺陷检测准确性。

Insight: 2.5D先验能够从仿真数据训练中泛化到真实数据，展示了跨域通用性潜力。

Abstract: Cone-beam X-ray computed tomography (XCT) is an essential imaging technique for generating 3D reconstructions of internal structures, with applications ranging from medical to industrial imaging. Producing high-quality reconstructions typically requires many X-ray measurements; this process can be slow and expensive, especially for dense materials. Recent work incorporating artifact reduction priors within a plug-and-play (PnP) reconstruction framework has shown promising results in improving image quality from sparse-view XCT scans while enhancing the generalizability of deep learning-based solutions. However, this method uses a 2D convolutional neural network (CNN) for artifact reduction, which captures only slice-independent information from the 3D reconstruction, limiting performance. In this paper, we propose a PnP reconstruction method that uses a 2.5D artifact reduction CNN as the prior. This approach leverages inter-slice information from adjacent slices, capturing richer spatial context while remaining computationally efficient. We show that this 2.5D prior not only improves the quality of reconstructions but also enables the model to directly suppress commonly occurring XCT artifacts (such as beam hardening), eliminating the need for artifact correction pre-processing. Experiments on both experimental and synthetic cone-beam XCT data demonstrate that the proposed method better preserves fine structural details, such as pore size and shape, leading to more accurate defect detection compared to 2D priors. In particular, we demonstrate strong performance on experimental XCT data using a 2.5D artifact reduction prior trained entirely on simulated scans, highlighting the proposed method’s ability to generalize across domains.

cs.CR [Back]

[100] Busting the Paper Ballot: Voting Meets Adversarial Machine Learning cs.CR | cs.CV | cs.LGPDF

Kaleel Mahmood, Caleb Manicke, Ethan Rathbun, Aayushi Verma, Sohaib Ahmad

TL;DR: 论文探讨了在美国选举计票器中使用机器学习分类器的安全风险，提出并验证了针对选举领域的对抗性攻击方法。

Details

Motivation: 研究动机是揭示机器学习分类器在选举计票中的潜在安全漏洞，尤其是在对抗性攻击下的脆弱性，为选举安全提供新的视角。

Result: 研究表明，即使是5%的攻击成功率也可能改变选举结果，验证了对抗性攻击在物理环境中的可行性。

Insight: 选举领域的对抗性攻击有其特殊性，传统的高成功率攻击标准不适用；梯度掩蔽问题源于数值不稳定性，需专门方法解决。

Abstract: We show the security risk associated with using machine learning classifiers in United States election tabulators. The central classification task in election tabulation is deciding whether a mark does or does not appear on a bubble associated to an alternative in a contest on the ballot. Barretto et al. (E-Vote-ID 2021) reported that convolutional neural networks are a viable option in this field, as they outperform simple feature-based classifiers. Our contributions to election security can be divided into four parts. To demonstrate and analyze the hypothetical vulnerability of machine learning models on election tabulators, we first introduce four new ballot datasets. Second, we train and test a variety of different models on our new datasets. These models include support vector machines, convolutional neural networks (a basic CNN, VGG and ResNet), and vision transformers (Twins and CaiT). Third, using our new datasets and trained models, we demonstrate that traditional white box attacks are ineffective in the voting domain due to gradient masking. Our analyses further reveal that gradient masking is a product of numerical instability. We use a modified difference of logits ratio loss to overcome this issue (Croce and Hein, ICML 2020). Fourth, in the physical world, we conduct attacks with the adversarial examples generated using our new methods. In traditional adversarial machine learning, a high (50% or greater) attack success rate is ideal. However, for certain elections, even a 5% attack success rate can flip the outcome of a race. We show such an impact is possible in the physical domain. We thoroughly discuss attack realism, and the challenges and practicality associated with printing and scanning ballot adversarial examples.

cs.AI [Back]

[101] ICE-ID: A Novel Historical Census Data Benchmark Comparing NARS against LLMs, & a ML Ensemble on Longitudinal Identity Resolution cs.AI | cs.CL | cs.LG | stat.APPDF

Gonçalo Hora de Carvalho, Lazar S. Popov, Sander Kaatee, Kristinn R. Thórisson, Tangrui Li

TL;DR: ICE-ID是一个新颖的历史人口普查基准数据集，覆盖220年冰岛记录，用于评估和比较NARS与LLMs及机器学习集成方法在身份解析任务中的表现。

Details

Motivation: 现有研究缺乏针对长期身份解析的大规模开放数据集，且传统方法在复杂、多代数据中表现不佳。

Result: NARS方法在身份解析任务中表现优异，达到了SOTA水平。

Insight: NARS在资源有限的情况下仍能有效处理复杂的历史数据，为跨学科研究提供了新工具。

Abstract: We introduce ICE-ID, a novel benchmark dataset for historical identity resolution, comprising 220 years (1703-1920) of Icelandic census records. ICE-ID spans multiple generations of longitudinal data, capturing name variations, demographic changes, and rich genealogical links. To the best of our knowledge, this is the first large-scale, open tabular dataset specifically designed to study long-term person-entity matching in a real-world population. We define identity resolution tasks (within and across census waves) with clearly documented metrics and splits. We evaluate a range of methods: handcrafted rule-based matchers, a ML ensemble as well as LLMs for structured data (e.g. transformer-based tabular networks) against a novel approach to tabular data called NARS (Non-Axiomatic Reasoning System) - a general-purpose AI framework designed to reason with limited knowledge and resources. Its core is Non-Axiomatic Logic (NAL), a term-based logic. Our experiments show that NARS is suprisingly simple and competitive with other standard approaches, achieving SOTA at our task. By releasing ICE-ID and our code, we enable reproducible benchmarking of identity resolution approaches in longitudinal settings and hope that ICE-ID opens new avenues for cross-disciplinary research in data linkage and historical analytics.

[102] Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs cs.AI | cs.CLPDF

Xumeng Wen, Zihan Liu, Shun Zheng, Zhijian Xu, Shengyu Ye

TL;DR: 该论文揭示了传统的Pass@K指标在衡量大型语言模型（LLMs）推理能力时的缺陷，并提出了新的指标CoT-Pass@K，以要求推理路径和最终答案均正确。论文通过理论和实验证明，RL（强化学习）结合可验证奖励（RLVR）能够激励模型保持逻辑完整性，从而提升推理能力。

Details

Motivation: 现有的RLVR方法在提升LLMs推理能力时，往往在Pass@K指标上表现不佳，引发了对RLVR效果的质疑。论文旨在解决这一矛盾，并提出更准确的评估方法。

Result: 实验表明，使用CoT-Pass@K指标时，RLVR能够显著提升模型的推理能力，且这种提升在训练早期即显现并能够泛化。

Insight: Pass@K指标可能掩盖了模型推理路径的问题，而RLVR通过激励正确推理路径，能够真正提升模型的逻辑推理能力。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). However, a critical paradox clouds its efficacy: RLVR-tuned models often underperform their base models on the $Pass@K$ metric for solution-finding, leading to the hypothesis that RLVR merely re-weights existing reasoning paths at the cost of reasoning diversity. In this work, we resolve this contradiction by identifying the source of the problem: the $Pass@K$ metric itself is a flawed measure of reasoning, as it credits correct final answers that probably arise from inaccurate or incomplete chains of thought (CoTs). To address this, we introduce a more precise evaluation metric, $CoT$-$Pass@K$, which mandates that both the reasoning path and the final answer be correct. We provide a new theoretical foundation that formalizes how RLVR, unlike traditional RL, is uniquely structured to incentivize logical integrity. Our empirical results are supportive: using $CoT$-$Pass@K$, we observe that RLVR can incentivize the generalization of correct reasoning for all values of $K$. Furthermore, by analyzing the training dynamics, we find that this enhanced reasoning capability emerges early in the training process and smoothly generalizes. Our work provides a clear perspective on the role of RLVR, offers a more reliable method for its evaluation, and confirms its potential to genuinely advance machine reasoning.

[103] Optimizing Length Compression in Large Reasoning Models cs.AI | cs.CLPDF

Zhengxiang Cheng, Dongping Chen, Mingyang Fu, Tianyi Zhou

TL;DR: 论文针对大型推理模型(LRMs)中的冗长推理链问题，提出两个新原则(Brevity和Sufficiency)和一种基于GRPO的后训练方法LC-R1，显著减少推理序列长度(~~50%)且准确率仅下降~~2%。

Details

Motivation: LRMs在推理过程中常产生冗余的“无效思考”，即模型在得出正确答案后仍重复检查，导致效率低下。

Result: 在多个推理基准测试中，LC-R1减少推理序列长度~~50%，准确率仅下降~~2%，达到Pareto前沿的最优平衡点。

Insight: 1. 冗余推理是LRMs效率低下的核心问题；2. 明确区分有效和无效推理步骤对模型优化至关重要。

Abstract: Large Reasoning Models (LRMs) have achieved remarkable success, yet they often suffer from producing unnecessary and verbose reasoning chains. We identify a core aspect of this issue as “invalid thinking” – models tend to repeatedly double-check their work after having derived the correct answer. To address this specific inefficiency, we move beyond the general principles of Efficacy and Efficiency to propose two new, fine-grained principles: Brevity, which advocates for eliminating redundancy, and Sufficiency, which ensures critical reasoning steps are preserved. Guided by these principles, we introduce LC-R1, a post-training method based on Group Relative Policy Optimization (GRPO). LC-R1 employs a novel combination of a Length Reward for overall conciseness and a Compress Reward that is specifically designed to remove the invalid portion of the thinking process. Extensive experiments on multiple reasoning benchmarks demonstrate that LC-R1 achieves a significant reduction in sequence length (~~50%) with only a marginal (~~2%) drop in accuracy, achieving a favorable trade-off point on the Pareto frontier that prioritizes high compression. Our analysis further validates the robustness of LC-R1 and provides valuable insights for developing more powerful yet computationally efficient LRMs. Our code is released at https://github.com/zxiangx/LC-R1.

physics.optics [Back]

[104] MobileHolo: A Lightweight Complex-Valued Deformable CNN for High-Quality Computer-Generated Hologram physics.optics | cs.CVPDF

Xie Shuyang, Zhou Jie, Xu Bo, Wang Jun, Xu Renjing

TL;DR: MobileHolo是一种轻量级的基于复数可变形卷积神经网络的方法，用于高质量计算机生成全息图（CGH），通过动态调整卷积核形状提升有效感受野（ERF），实现了在模拟和光学实验中的最佳性能。

Details

Motivation: 全息显示在虚拟现实和增强现实中有重要潜力，但现有深度学习方法因有效感受野不足未能准确建模衍射过程，限制了性能。

Result: 在1920×1072分辨率下，峰值信噪比分别比CCNN-CGH、HoloNet和Holo-encoder高2.04 dB、5.31 dB和9.71 dB。

Insight: 复数可变形卷积能有效扩展模型的有效感受野，提升特征提取能力，同时保持轻量化，适用于实际应用。

Abstract: Holographic displays have significant potential in virtual reality and augmented reality owing to their ability to provide all the depth cues. Deep learning-based methods play an important role in computer-generated holograms (CGH). During the diffraction process, each pixel exerts an influence on the reconstructed image. However, previous works face challenges in capturing sufficient information to accurately model this process, primarily due to the inadequacy of their effective receptive field (ERF). Here, we designed complex-valued deformable convolution for integration into network, enabling dynamic adjustment of the convolution kernel’s shape to increase flexibility of ERF for better feature extraction. This approach allows us to utilize a single model while achieving state-of-the-art performance in both simulated and optical experiment reconstructions, surpassing existing open-source models. Specifically, our method has a peak signal-to-noise ratio that is 2.04 dB, 5.31 dB, and 9.71 dB higher than that of CCNN-CGH, HoloNet, and Holo-encoder, respectively, when the resolution is 1920$\times$1072. The number of parameters of our model is only about one-eighth of that of CCNN-CGH.

cs.IR [Back]

[105] InsertRank: LLMs can reason over BM25 scores to Improve Listwise Reranking cs.IR | cs.AI | cs.CLPDF

Rahul Seetharaman, Kaustubh D. Dhole, Aman Bansal

TL;DR: InsertRank是一种基于LLM的重排器，通过利用BM25等词汇信号提升复杂查询的检索性能，在多领域和医疗推理基准测试中表现优异。

Details

Motivation: 随着LLM聊天界面的普及，用户倾向于提出更复杂的查询，需要模型具备推理能力而非简单的关键词匹配。现有LLM重排方法在利用推理能力提升检索性能方面仍有改进空间。

Result: 在BRIGHT和R2MED基准测试中，InsertRank分别达到37.5和51.1分，优于现有方法，并在多种LLM模型家族中表现一致提升。

Insight: 引入传统检索信号（如BM25）可以显著增强LLM在复杂推理任务中的表现，为信息检索领域的多模态融合提供了新思路。

Abstract: Large Language Models (LLMs) have demonstrated significant strides across various information retrieval tasks, particularly as rerankers, owing to their strong generalization and knowledge-transfer capabilities acquired from extensive pretraining. In parallel, the rise of LLM-based chat interfaces has raised user expectations, encouraging users to pose more complex queries that necessitate retrieval by ``reasoning’’ over documents rather than through simple keyword matching or semantic similarity. While some recent efforts have exploited reasoning abilities of LLMs for reranking such queries, considerable potential for improvement remains. In that regards, we introduce InsertRank, an LLM-based reranker that leverages lexical signals like BM25 scores during reranking to further improve retrieval performance. InsertRank demonstrates improved retrieval effectiveness on – BRIGHT, a reasoning benchmark spanning 12 diverse domains, and R2MED, a specialized medical reasoning retrieval benchmark spanning 8 different tasks. We conduct an exhaustive evaluation and several ablation studies and demonstrate that InsertRank consistently improves retrieval effectiveness across multiple families of LLMs, including GPT, Gemini, and Deepseek models. %In addition, we also conduct ablation studies on normalization by varying the scale of the BM25 scores, and positional bias by shuffling the order of the documents. With Deepseek-R1, InsertRank achieves a score of 37.5 on the BRIGHT benchmark. and 51.1 on the R2MED benchmark, surpassing previous methods.

cs.RO [Back]

[106] GAF: Gaussian Action Field as a Dvnamic World Model for Robotic Mlanipulation cs.RO | cs.CVPDF

Ying Chai, Litao Deng, Ruizhi Shao, Jiajun Zhang, Liangjun Xing

TL;DR: 论文提出了GAF（高斯动作场）作为动态世界模型，通过结合可学习运动属性的4D表示，直接从运动感知中推理动作，显著提升了机器人操作的准确性和质量。

Details

Motivation: 现有视觉到动作（V-A）或视觉到3D再到动作（V-3D-A）方法在动态复杂的操作场景中动作推理不准确，因此需要一种能够同时建模动态场景和操作动作的框架。

Result: 实验结果表明，GAF在重建质量和任务成功率上显著优于现有方法，PSNR提升了11.5385 dB，LPIPS降低了0.5574，任务成功率提高了10.33%。

Insight: GAF的成功表明，通过4D表示直接推理动作能够更好地处理动态场景，同时结合扩散模型可以进一步优化动作质量。

Abstract: Accurate action inference is critical for vision-based robotic manipulation. Existing approaches typically follow either a Vision-to-Action (V-A) paradigm, predicting actions directly from visual inputs, or a Vision-to-3D-to-Action (V-3D-A) paradigm, leveraging intermediate 3D representations. However, these methods often struggle with action inaccuracies due to the complexity and dynamic nature of manipulation scenes. In this paper, we propose a V-4D-A framework that enables direct action reasoning from motion-aware 4D representations via a Gaussian Action Field (GAF). GAF extends 3D Gaussian Splatting (3DGS) by incorporating learnable motion attributes, allowing simultaneous modeling of dynamic scenes and manipulation actions. To learn time-varying scene geometry and action-aware robot motion, GAF supports three key query types: reconstruction of the current scene, prediction of future frames, and estimation of initial action via robot motion. Furthermore, the high-quality current and future frames generated by GAF facilitate manipulation action refinement through a GAF-guided diffusion model. Extensive experiments demonstrate significant improvements, with GAF achieving +11.5385 dB PSNR and -0.5574 LPIPS improvements in reconstruction quality, while boosting the average success rate in robotic manipulation tasks by 10.33% over state-of-the-art methods. Project page: http://chaiying1.github.io/GAF.github.io/project_page/

[107] AMPLIFY: Actionless Motion Priors for Robot Learning from Videos cs.RO | cs.CV | cs.LGPDF

Jeremy A. Collins, Loránd Cheng, Kunal Aneja, Albert Wilcox, Benjamin Joffe

TL;DR: AMPLIFY提出了一种利用无动作视频数据学习机器人运动模型的新框架，通过将视觉动态编码为离散的运动令牌，实现了视觉运动预测与动作推断的解耦，显著提升了低数据量下的策略学习效果。

Details

Motivation: 机器人领域的有标签动作数据稀缺且昂贵，而海量的无动作视频数据未得到充分利用。AMPLIFY旨在利用这些无动作数据，解决从观察数据到有效策略的转化难题。

Result: 1. 动态预测误差降低至3.7倍，像素预测准确率提升2.5倍；2. 低数据量政策学习提升1.2-2.2倍；3. 从无动作人类视频中学习实现1.4倍平均提升；4. 首次在无分布内动作数据下泛化至LIBERO任务。

Insight: AMPLIFY展示了异构数据源（无动作视频与有标签数据）的有效结合，为构建高效、可泛化的世界模型提供了新范式，同时其动态模型还可提升视频预测质量。

Abstract: Action-labeled data for robotics is scarce and expensive, limiting the generalization of learned policies. In contrast, vast amounts of action-free video data are readily available, but translating these observations into effective policies remains a challenge. We introduce AMPLIFY, a novel framework that leverages large-scale video data by encoding visual dynamics into compact, discrete motion tokens derived from keypoint trajectories. Our modular approach separates visual motion prediction from action inference, decoupling the challenges of learning what motion defines a task from how robots can perform it. We train a forward dynamics model on abundant action-free videos and an inverse dynamics model on a limited set of action-labeled examples, allowing for independent scaling. Extensive evaluations demonstrate that the learned dynamics are both accurate, achieving up to 3.7x better MSE and over 2.5x better pixel prediction accuracy compared to prior approaches, and broadly useful. In downstream policy learning, our dynamics predictions enable a 1.2-2.2x improvement in low-data regimes, a 1.4x average improvement by learning from action-free human videos, and the first generalization to LIBERO tasks from zero in-distribution action data. Beyond robotic control, we find the dynamics learned by AMPLIFY to be a versatile latent world model, enhancing video prediction quality. Our results present a novel paradigm leveraging heterogeneous data sources to build efficient, generalizable world models. More information can be found at https://amplify-robotics.github.io/.

[108] GAMORA: A Gesture Articulated Meta Operative Robotic Arm for Hazardous Material Handling in Containment-Level Environments cs.RO | cs.AI | cs.CVPDF

Farha Abdul Wasay, Mohammed Abdul Rahman, Hania Ghouse

TL;DR: GAMORA 是一种基于虚拟现实（VR）的机器人系统，通过自然手势实现危险任务的远程执行，提高了生物安全实验环境中的操作精度和安全性。

Details

Motivation: 随着生物危害复杂性增加，减少人类直接暴露并保持操作精度的需求日益迫切，促使开发一种自然、安全的远程操作解决方案。

Result: 实验结果包括 2.2 mm 的定位误差、0.2 mL 的移液精度，以及 50% 的能耗降低，显示了系统的高效性和可持续性。

Insight: GAMORA 提供了一种可扩展的解决方案，为高风险实验室任务的自动化与安全性设定了新标准。

Abstract: The convergence of robotics and virtual reality (VR) has enabled safer and more efficient workflows in high-risk laboratory settings, particularly virology labs. As biohazard complexity increases, minimizing direct human exposure while maintaining precision becomes essential. We propose GAMORA (Gesture Articulated Meta Operative Robotic Arm), a novel VR-guided robotic system that enables remote execution of hazardous tasks using natural hand gestures. Unlike existing scripted automation or traditional teleoperation, GAMORA integrates the Oculus Quest 2, NVIDIA Jetson Nano, and Robot Operating System (ROS) to provide real-time immersive control, digital twin simulation, and inverse kinematics-based articulation. The system supports VR-based training and simulation while executing precision tasks in physical environments via a 3D-printed robotic arm. Inverse kinematics ensure accurate manipulation for delicate operations such as specimen handling and pipetting. The pipeline includes Unity-based 3D environment construction, real-time motion planning, and hardware-in-the-loop testing. GAMORA achieved a mean positional discrepancy of 2.2 mm (improved from 4 mm), pipetting accuracy within 0.2 mL, and repeatability of 1.2 mm across 50 trials. Integrated object detection via YOLOv8 enhances spatial awareness, while energy-efficient operation (50% reduced power output) ensures sustainable deployment. The system’s digital-physical feedback loop enables safe, precise, and repeatable automation of high-risk lab tasks. GAMORA offers a scalable, immersive solution for robotic control and biosafety in biomedical research environments.

cs.SE [Back]

[109] CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios cs.SE | cs.CLPDF

Shiting Huang, Zhen Fang, Zehui Chen, Siyu Yuan, Junjie Ye

TL;DR: 该论文提出了CRITICTOOL，一个专门用于评估大型语言模型（LLM）在工具调用错误场景中自我批判能力的评测基准。通过构建多样化的工具使用错误数据集，CRITICTOOL更贴近现实场景，并验证了不同LLM的工具反思能力。

Details

Motivation: 随着任务复杂性和时长的增加，LLM在使用外部工具时可能触发多种错误，如何有效处理这些错误成为关键研究方向。

Result: CRITICTOOL能够更全面地评估LLM对工具使用错误的识别、诊断和恢复能力，并为工具学习领域提供了新的研究方向。

Insight: 自我批判能力是LLM在复杂工具使用场景中稳健性的关键，多样化的评测基准有助于推动工具学习的进一步发展。

Abstract: The ability of large language models (LLMs) to utilize external tools has enabled them to tackle an increasingly diverse range of tasks. However, as the tasks become more complex and long-horizon, the intricate tool utilization process may trigger various unexpected errors. Therefore, how to effectively handle such errors, including identifying, diagnosing, and recovering from them, has emerged as a key research direction for advancing tool learning. In this work, we first extensively analyze the types of errors encountered during the function-calling process on several competitive tool evaluation benchmarks. Based on it, we introduce CRITICTOOL, a comprehensive critique evaluation benchmark specialized for tool learning. Building upon a novel evolutionary strategy for dataset construction, CRITICTOOL holds diverse tool-use errors with varying complexities, which better reflects real-world scenarios. We conduct extensive experiments on CRITICTOOL, and validate the generalization and effectiveness of our constructed benchmark strategy. We also provide an in-depth analysis of the tool reflection ability on various LLMs, offering a new perspective on the field of tool learning in LLMs. The code is available at \href{https://github.com/Shellorley0513/CriticTool}{https://github.com/Shellorley0513/CriticTool}.

cs.LG [Back]

[110] Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models cs.LG | cs.AI | cs.CLPDF

Vaskar Nath, Elaine Lau, Anisha Gunjal, Manasi Sharma, Nikhil Baharte

TL;DR: 论文研究了强化学习训练的推理模型如何通过学习解决新问题，提出了自适应指导算法Guide，显著提升了泛化性能。

Details

Motivation: 研究旨在探索推理模型通过强化学习（RLVR）如何学习解决新问题，并改进其学习效率与泛化能力。

Result: 在7B和32B参数模型上，Guide-GRPO在数学基准测试上实现4%的宏观平均提升。

Insight: 能力增益主要依赖于自蒸馏，而自适应指导能够显著加速学习过程并提高泛化能力。

Abstract: We study the process through which reasoning models trained with reinforcement learning on verifiable rewards (RLVR) can learn to solve new problems. We find that RLVR drives performance through two main means: (1) by compressing pass@$k$ into pass@1 and (2) via “capability gain” in which models learn to solve new problems that they previously could not solve even at high $k$. We find that while capability gain exists across model scales, learning to solve new problems is primarily driven through self-distillation. We demonstrate these findings across model scales ranging from 0.5B to 72B on >500,000 reasoning problems with prompts and verifiable final answers across math, science, and code domains. We further show that we can significantly improve pass@$k$ rates by leveraging natural language guidance for the model to consider within context while still requiring the model to derive a solution chain from scratch. Based of these insights, we derive $\text{Guide}$ - a new class of online training algorithms. $\text{Guide}$ adaptively incorporates hints into the model’s context on problems for which all rollouts were initially incorrect and adjusts the importance sampling ratio for the “off-policy” trajectories in order to optimize the policy for contexts in which the hints are no longer present. We describe variants of $\text{Guide}$ for GRPO and PPO and empirically show that Guide-GRPO on 7B and 32B parameter models improves generalization over its vanilla counterpart with up to 4$%$ macro-average improvement across math benchmarks. We include careful ablations to analyze $\text{Guide}$’s components and theoretically analyze Guide’s learning efficiency.

[111] AssistedDS: Benchmarking How External Domain Knowledge Assists LLMs in Automated Data Science cs.LG | cs.AI | cs.CL | stat.MEPDF

An Luo, Xun Xian, Jin Du, Fangqiao Tian, Ganghua Wang

TL;DR: AssistedDS是一个基准测试，用于评估外部领域知识如何帮助LLMs（大语言模型）在数据科学任务中进行自动化决策。研究发现，LLMs在应对外部信息时存在不批判性采纳的问题，且难以抵消对抗性信息的负面影响。

Details

Motivation: 探讨LLMs是否能够像人类数据科学家一样，有效地利用外部领域知识（如数据清洗、特征工程和模型选择）来提升自动化数据科学任务的表现。

Result: LLMs容易不加批判地采纳外部信息，对抗性内容会显著降低其预测性能；有益信息难以完全抵消对抗性信息的负面影响；在时间序列数据和分类变量处理上表现不佳。

Insight: 当前LLMs在批判性评估和应用领域知识方面存在明显不足，未来需要开发更稳健的知识感知自动数据科学系统。

Abstract: Large language models (LLMs) have advanced the automation of data science workflows. Yet it remains unclear whether they can critically leverage external domain knowledge as human data scientists do in practice. To answer this question, we introduce AssistedDS (Assisted Data Science), a benchmark designed to systematically evaluate how LLMs handle domain knowledge in tabular prediction tasks. AssistedDS features both synthetic datasets with explicitly known generative mechanisms and real-world Kaggle competitions, each accompanied by curated bundles of helpful and adversarial documents. These documents provide domain-specific insights into data cleaning, feature engineering, and model selection. We assess state-of-the-art LLMs on their ability to discern and apply beneficial versus harmful domain knowledge, evaluating submission validity, information recall, and predictive performance. Our results demonstrate three key findings: (1) LLMs frequently exhibit an uncritical adoption of provided information, significantly impairing their predictive performance when adversarial content is introduced, (2) helpful guidance is often insufficient to counteract the negative influence of adversarial information, and (3) in Kaggle datasets, LLMs often make errors in handling time-series data, applying consistent feature engineering across different folds, and interpreting categorical variables correctly. These findings highlight a substantial gap in current models’ ability to critically evaluate and leverage expert knowledge, underscoring an essential research direction for developing more robust, knowledge-aware automated data science systems.

[112] Improving LoRA with Variational Learning cs.LG | cs.AI | cs.CL | stat.MLPDF

Bai Cong, Nico Daheim, Yuesong Shen, Rio Yokota, Mohammad Emtiyaz Khan

TL;DR: 该论文提出了一种基于变分学习方法IVON来改进LoRA微调的技术，显著提升了模型的准确性和校准性，同时保持了计算效率。

Details

Motivation: 尽管贝叶斯方法在LoRA微调中能提升校准性，但对其他指标（如准确性）的提升有限，甚至有时会带来负面影响，且计算开销较大。因此，需要一种更高效且全面的改进方法。

Result: 在Llama-3.2-3B模型上，IVON将准确性提高了1.3%，ECE降低了5.4%，优于AdamW及其他贝叶斯方法（如Laplace-LoRA和BLoB）。

Insight: 变分学习方法（如IVON）在LoRA微调中不仅能够提升校准性，还能显著改善其他性能指标，且计算开销与AdamW相当，具有实际应用潜力。

Abstract: Bayesian methods have recently been used to improve LoRA finetuning and, although they improve calibration, their effect on other metrics (such as accuracy) is marginal and can sometimes even be detrimental. Moreover, Bayesian methods also increase computational overheads and require additional tricks for them to work well. Here, we fix these issues by using a recently proposed variational algorithm called IVON. We show that IVON is easy to implement and has similar costs to AdamW, and yet it can also drastically improve many metrics by using a simple posterior pruning technique. We present extensive results on billion-scale LLMs (Llama and Qwen series) going way beyond the scale of existing applications of IVON. For example, we finetune a Llama-3.2-3B model on a set of commonsense reasoning tasks and improve accuracy over AdamW by 1.3% and reduce ECE by 5.4%, outperforming AdamW and other recent Bayesian methods like Laplace-LoRA and BLoB. Overall, our results show that variational learning with IVON can effectively improve LoRA finetuning.

[113] TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization cs.LG | cs.AI | cs.CLPDF

Mingkang Zhu, Xi Chen, Zhongdao Wang, Bei Yu, Hengshuang Zhao

TL;DR: 论文提出TGDPO方法，通过分解序列级PPO为标记级PPO问题，并利用标记级奖励指导DPO的优化，显著提升了模型性能。

Details

Motivation: 现有DPO在利用标记级奖励指导时面临困难，因为其是基于序列级的bandit问题，而标记级奖励在PPO中表现优越，如何将其迁移到DPO是一大挑战。

Result: 在MT-Bench、AlpacaEval 2和Arena-Hard上，TGDPO分别取得7.5、6.2和4.3点的胜率提升。

Insight: 标记级奖励指导可以有效提升DPO性能，灵活调整各标记对参考策略的偏离程度是关键。

Abstract: Recent advancements in reinforcement learning from human feedback have shown that utilizing fine-grained token-level reward models can substantially enhance the performance of Proximal Policy Optimization (PPO) in aligning large language models. However, it is challenging to leverage such token-level reward as guidance for Direct Preference Optimization (DPO), since DPO is formulated as a sequence-level bandit problem. To address this challenge, this work decomposes the sequence-level PPO into a sequence of token-level proximal policy optimization problems and then frames the problem of token-level PPO with token-level reward guidance, from which closed-form optimal token-level policy and the corresponding token-level reward can be derived. Using the obtained reward and Bradley-Terry model, this work establishes a framework of computable loss functions with token-level reward guidance for DPO, and proposes a practical reward guidance based on the induced DPO reward. This formulation enables different tokens to exhibit varying degrees of deviation from reference policy based on their respective rewards. Experiment results demonstrate that our method achieves substantial performance improvements over DPO, with win rate gains of up to 7.5 points on MT-Bench, 6.2 points on AlpacaEval 2, and 4.3 points on Arena-Hard. Code is available at https://github.com/dvlab-research/TGDPO.

[114] Train Once, Forget Precisely: Anchored Optimization for Efficient Post-Hoc Unlearning cs.LG | cs.CVPDF

Prabhav Sanga, Jaskaran Singh, Arun K. Dubey

TL;DR: 这篇论文提出了一个称为FAMR（Forget-Aligned Model Reconstruction）的理论框架，用于高效实现深度图像分类器的后验遗忘任务，确保在不完全重新训练的情况下精确移除特定数据或类别的影响。

Details

Motivation: 随着机器学习系统越来越多地依赖受隐私法规约束的数据，从训练好的模型中精确遗忘特定信息变得至关重要。

Result: 在CIFAR-10和ImageNet-100上的实验表明，FAMR在类遗忘任务中表现优异，性能保留度高且计算开销小。

Insight: FAMR不仅适用于类别遗忘，还能自然地扩展到概念和风格擦除，为视觉模型的高效遗忘提供了一种可扩展且可认证的方法。

Abstract: As machine learning systems increasingly rely on data subject to privacy regulation, selectively unlearning specific information from trained models has become essential. In image classification, this involves removing the influence of particular training samples, semantic classes, or visual styles without full retraining. We introduce \textbf{Forget-Aligned Model Reconstruction (FAMR)}, a theoretically grounded and computationally efficient framework for post-hoc unlearning in deep image classifiers. FAMR frames forgetting as a constrained optimization problem that minimizes a uniform-prediction loss on the forget set while anchoring model parameters to their original values via an $\ell_2$ penalty. A theoretical analysis links FAMR’s solution to influence-function-based retraining approximations, with bounds on parameter and output deviation. Empirical results on class forgetting tasks using CIFAR-10 and ImageNet-100 demonstrate FAMR’s effectiveness, with strong performance retention and minimal computational overhead. The framework generalizes naturally to concept and style erasure, offering a scalable and certifiable route to efficient post-hoc forgetting in vision models.

cs.DC [Back]

[115] Déjà Vu: Efficient Video-Language Query Engine with Learning-based Inter-Frame Computation Reuse cs.DC | cs.CVPDF

Jinwoo Hwang, Daeun Kim, Sangyeop Lee, Yoonsung Kim, Guseul Heo

TL;DR: 论文提出了Déjà Vu，一种基于学习的帧间计算复用的视频-语言查询引擎，通过改进ViT模型（ReuseViT）和内存-计算联合压缩技术，显著加速了视频-语言模型的嵌入生成。

Details

Motivation: 视频-语言模型（VideoLMs）在视频查询系统中展现出强大的潜力，但大规模视频的嵌入生成计算开销巨大，阻碍了实际应用部署。因此，需要一种高效的方法来降低计算成本。

Result: Déjà Vu在2%误差范围内加速嵌入生成高达2.64倍，显著提升了视频-语言模型在大规模视频分析中的实用性。

Insight: 帧间计算复用是一种有效降低视频处理计算成本的方法，结合内存-计算优化技术可以进一步提升实际性能。

Abstract: Recently, Video-Language Models (VideoLMs) have demonstrated remarkable capabilities, offering significant potential for flexible and powerful video query systems. These models typically rely on Vision Transformers (ViTs), which process video frames individually to extract visual embeddings. However, generating embeddings for large-scale videos requires ViT inferencing across numerous frames, posing a major hurdle to real-world deployment and necessitating solutions for integration into scalable video data management systems. This paper introduces Déjà Vu, a video-language query engine that accelerates ViT-based VideoLMs by reusing computations across consecutive frames. At its core is ReuseViT, a modified ViT model specifically designed for VideoLM tasks, which learns to detect inter-frame reuse opportunities, striking an effective balance between accuracy and reuse. Although ReuseViT significantly reduces computation, these savings do not directly translate into performance gains on GPUs. To overcome this, Déjà Vu integrates memory-compute joint compaction techniques that convert the FLOP savings into tangible performance gains. Evaluations on three VideoLM tasks show that Déjà Vu accelerates embedding generation by up to a 2.64x within a 2% error bound, dramatically enhancing the practicality of VideoLMs for large-scale video analytics.

Table of Contents

cs.CV [Back]

[1] Non-planar Object Detection and Identification by Features Matching and Triangulation Growth cs.CV | cs.AIPDF

[2] CDST: Color Disentangled Style Transfer for Universal Style Reference Customization cs.CVPDF

[3] Hidden Bias in the Machine: Stereotypes in Text-to-Image Models cs.CV | cs.AI | cs.CY | cs.LGPDF

[4] Fake it till You Make it: Reward Modeling as Discriminative Prediction cs.CV | cs.AI | cs.LGPDF

[5] DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding cs.CVPDF

[6] Intelligent Image Sensing for Crime Analysis: A ML Approach towards Enhanced Violence Detection and Investigation cs.CV | cs.AIPDF

[7] HierVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment cs.CV | cs.AIPDF

[8] Mapping Farmed Landscapes from Remote Sensing cs.CV | cs.LGPDF

[9] FindMeIfYouCan: Bringing Open Set metrics to $\textit{near} $, $ \textit{far} $ and $\textit{farther}$ Out-of-Distribution Object Detection cs.CVPDF

[10] Disentangling 3D from Large Vision-Language Models for Controlled Portrait Generation cs.CVPDF

[11] SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative Refinement cs.CV | cs.AIPDF

[12] Image Segmentation with Large Language Models: A Survey with Perspectives for Intelligent Transportation Systems cs.CV | cs.AIPDF

[13] FADPNet: Frequency-Aware Dual-Path Network for Face Super-Resolution cs.CVPDF

[14] Interpreting Biomedical VLMs on High-Imbalance Out-of-Distributions: An Insight into BiomedCLIP on Radiology cs.CVPDF

[15] RadFabric: Agentic AI System with Reasoning Capability for Radiology cs.CV | cs.CLPDF

[16] SceneAware: Scene-Constrained Pedestrian Trajectory Prediction with LLM-Guided Walkability cs.CV | cs.AIPDF

[17] VideoMAR: Autoregressive Video Generatio with Continuous Tokens cs.CV | cs.AIPDF

[18] A multi-stage augmented multimodal interaction network for fish feeding intensity quantification cs.CV | cs.AI | cs.ETPDF

[19] One-Shot Neural Architecture Search with Network Similarity Directed Initialization for Pathological Image Classification cs.CVPDF

[20] Meta-SurDiff: Classification Diffusion Model Optimized by Meta Learning is Reliable for Online Surgical Phase Recognition cs.CVPDF

[21] Egocentric Human-Object Interaction Detection: A New Benchmark and Method cs.CVPDF

[22] Unified Representation Space for 3D Visual Grounding cs.CVPDF

[23] Cross-Modal Geometric Hierarchy Fusion: An Implicit-Submap Driven Framework for Resilient 3D Place Recognition cs.CVPDF

[24] Comparison of Two Methods for Stationary Incident Detection Based on Background Image cs.CVPDF

[25] Exploring Non-contrastive Self-supervised Representation Learning for Image-based Profiling cs.CVPDF

[26] Leader360V: The Large-scale, Real-world 360 Video Dataset for Multi-task Learning in Diverse Environment cs.CVPDF

[27] FRIDU: Functional Map Refinement with Guided Image Diffusion cs.CV | cs.LGPDF

[28] FGA-NN: Film Grain Analysis Neural Network cs.CV | eess.IVPDF

[29] EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization cs.CV | cs.AIPDF

[30] HydroChronos: Forecasting Decades of Surface Water Change cs.CVPDF

[31] Discrete JEPA: Learning Discrete Token Representations without Reconstruction cs.CVPDF

[32] DepthSeg: Depth prompting in remote sensing semantic segmentation cs.CV | cs.AIPDF

[33] GrFormer: A Novel Transformer on Grassmann Manifold for Infrared and Visible Image Fusion cs.CVPDF

[34] Causally Steered Diffusion for Automated Video Counterfactual Generation cs.CV | cs.AIPDF

[35] Compositional Attribute Imbalance in Vision Datasets cs.CV | cs.AIPDF

[36] Toward Rich Video Human-Motion2D Generation cs.CVPDF

[37] MoTE: Mixture of Ternary Experts for Memory-efficient Large Multimodal Models cs.CV | cs.LGPDF

[38] Model compression using knowledge distillation with integrated gradients cs.CV | cs.AI | cs.LGPDF

[39] Adapting Lightweight Vision Language Models for Radiological Visual Question Answering cs.CV | cs.AIPDF

[40] Dense360: Dense Understanding from Omnidirectional Panoramas cs.CVPDF

[41] I Speak and You Find: Robust 3D Visual Grounding with Noisy and Ambiguous Speech Inputs cs.CVPDF

[42] SIRI-Bench: Challenging VLMs’ Spatial Intelligence through Complex Reasoning Tasks cs.CVPDF

[43] VisLanding: Monocular 3D Perception for UAV Safe Landing via Depth-Normal Synergy cs.CV | cs.ROPDF

[44] PoseGRAF: Geometric-Reinforced Adaptive Fusion for Monocular 3D Human Pose Estimation cs.CV | cs.AIPDF

[45] Align Your Flow: Scaling Continuous-Time Flow Map Distillation cs.CV | cs.LGPDF

[46] Unsupervised Imaging Inverse Problems with Diffusion Distribution Matching cs.CV | cs.LG | eess.IVPDF

[47] VisText-Mosquito: A Multimodal Dataset and Benchmark for AI-Based Mosquito Breeding Site Detection and Reasoning cs.CV | cs.CLPDF

[48] 3DGS-IEval-15K: A Large-scale Image Quality Evaluation Database for 3D Gaussian-Splatting cs.CVPDF

[49] DDS-NAS: Dynamic Data Selection within Neural Architecture Search via On-line Hard Example Mining applied to Image Classification cs.CVPDF

[50] Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models cs.CVPDF

[51] FocalClick-XL: Towards Unified and High-quality Interactive Segmentation cs.CVPDF

[52] YOLOv11-RGBT: Towards a Comprehensive Single-Stage Multispectral Object Detection Framework cs.CVPDF

[53] SyncTalk++: High-Fidelity and Efficient Synchronized Talking Heads Synthesis Using Gaussian Splatting cs.CVPDF

[54] Cost-Aware Routing for Efficient Text-To-Image Generation cs.CV | cs.LGPDF

[55] Scaling-Up the Pretraining of the Earth Observation Foundation Model PhilEO to the MajorTOM Dataset cs.CVPDF

[56] ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM cs.CV | cs.CLPDF

[57] CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion cs.CV | cs.ROPDF

cs.CL [Back]

[58] Investigating the interaction of linguistic and mathematical reasoning in language models using multilingual number puzzles cs.CL | cs.AIPDF

[59] VL-GenRM: Enhancing Vision-Language Verification via Vision Experts and Iterative Training cs.CL | cs.CVPDF

[60] ASMR: Augmenting Life Scenario using Large Generative Models for Robotic Action Reflection cs.CL | cs.AI | cs.ROPDF

[61] Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text cs.CLPDF

[62] MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation cs.CLPDF

[63] An Interdisciplinary Review of Commonsense Reasoning and Intent Detection cs.CL | cs.HCPDF

[64] S$^4$C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models cs.CL | cs.AIPDF

[65] MIST: Towards Multi-dimensional Implicit Bias and Stereotype Evaluation of LLMs via Theory of Mind cs.CLPDF

[66] GRAM: A Generative Foundation Reward Model for Reward Generalization cs.CL | cs.AIPDF

[67] CausalDiffTab: Mixed-Type Causal-Aware Diffusion for Tabular Data Generation cs.CLPDF

[68] Explainable Detection of Implicit Influential Patterns in Conversations via Data Augmentation cs.CLPDF

[69] Chaining Event Spans for Temporal Relation Grounding cs.CLPDF

[70] Xolver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team cs.CL | cs.AIPDF

[71] A Multi-Expert Structural-Semantic Hybrid Framework for Unveiling Historical Patterns in Temporal Knowledge Graphs cs.CLPDF

[72] Re-Initialization Token Learning for Tool-Augmented Large Language Models cs.CL | cs.AIPDF

[73] A Vision for Geo-Temporal Deep Research Systems: Towards Comprehensive, Transparent, and Reproducible Geo-Temporal Information Synthesis cs.CL | cs.IRPDF

[74] ELLIS Alicante at CQs-Gen 2025: Winning the critical thinking questions shared task: LLM-based question generation and selection cs.CL | cs.HCPDF

[75] Thunder-NUBench: A Benchmark for LLMs’ Sentence-Level Negation Understanding cs.CLPDF

[76] ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge cs.CL | cs.AIPDF

[77] How Far Can LLMs Improve from Experience? Measuring Test-Time Learning Ability in LLMs with Human Comparison cs.CLPDF