Table of Contents

cs.CV [Back]

[1] CHECK-MAT: Checking Hand-Written Mathematical Answers for the Russian Unified State Exam cs.CV | cs.AI | cs.LG | 68T07, 97D50 | I.2.7; I.4; K.3.1PDF

Ruslan Khrulev

TL;DR: 该论文提出了一个新的评测基准EGE-Math Solutions Assessment Benchmark,专注于评估视觉语言模型(VLMs)在手写数学解答评分方面的能力,揭示了当前模型在数学推理和人类评分标准对齐上的局限性。

Details

Motivation: 现有评测基准主要关注数学问题的解决,而缺乏对学生解答理解的评估。该论文填补了这一空白,专注于手写解答的评分、错误识别和按固定标准打分。

Result: 实验结果表明,现有模型在数学推理和人类评分标准对齐方面存在显著局限性,为AI辅助评分领域的研究提供了新方向。

Insight: 该研究揭示了VLMs在数学评分任务中的潜力与挑战,强调了改进数学推理和评分标准对齐的重要性。

Abstract: This paper introduces a novel benchmark, EGE-Math Solutions Assessment Benchmark, for evaluating Vision-Language Models (VLMs) on their ability to assess hand-written mathematical solutions. Unlike existing benchmarks that focus on problem solving, our approach centres on understanding student solutions, identifying mistakes, and assigning grades according to fixed criteria. We compile 122 scanned solutions from the Russian Unified State Exam (EGE) together with official expert grades, and evaluate seven modern VLMs from Google, OpenAI, Arcee AI, and Alibaba Cloud in three inference modes. The results reveal current limitations in mathematical reasoning and human-rubric alignment, opening new research avenues in AI-assisted assessment. You can find code in https://github.com/Karifannaa/Auto-check-EGE-math


[2] Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction cs.CV | cs.AIPDF

Giuseppe Cartella, Vittorio Cuculo, Alessandro D’Amelio, Marcella Cornia, Giuseppe Boccignone

TL;DR: ScanDiff是一种结合扩散模型和Vision Transformers的新型架构,用于生成多样且真实的人眼扫描路径,通过显式建模扫描路径的变异性,优于现有方法。

Details

Motivation: 现有深度学习模型在预测人眼扫描路径时通常生成平均行为,无法捕捉人类视觉探索的变异性。

Result: 在基准数据集上,ScanDiff在自由观看和任务驱动场景中均优于现有方法,生成更多样且准确的扫描路径。

Insight: 扩散模型的随机性可以有效建模人眼视觉行为的变异性,文本条件进一步增强了任务的适应性。

Abstract: Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. While deep learning models have advanced scanpath prediction, most existing approaches generate averaged behaviors, failing to capture the variability of human visual exploration. In this work, we present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths. Our method explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, producing a wide range of plausible gaze trajectories. Additionally, we introduce textual conditioning to enable task-driven scanpath generation, allowing the model to adapt to different visual search objectives. Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios, producing more diverse and accurate scanpaths. These results highlight its ability to better capture the complexity of human visual behavior, pushing forward gaze prediction research. Source code and models are publicly available at https://aimagelab.github.io/ScanDiff.


[3] Recovering Diagnostic Value: Super-Resolution-Aided Echocardiographic Classification in Resource-Constrained Imaging cs.CV | cs.AIPDF

Krishan Agyakari Raja Babu, Om Prabhu, Annu, Mohanasankar Sivaprakasam

TL;DR: 该论文探讨了在资源受限环境中,通过超分辨率技术(SR)提升低质量超声心动图的分类准确率,为AI辅助诊断提供支持。

Details

Motivation: 在资源匮乏地区,超声心动图质量较差,影响了自动诊断模型的性能。超分辨率技术在其他医学影像中已表现出潜力,但在超声心动图中的应用尚未充分研究。

Result: 实验结果表明,SRResNet在提升分类性能的同时具有更高的计算效率,显著恢复了低质量超声心动图的诊断价值。

Insight: 超分辨率技术可有效弥补资源受限环境中影像质量的不足,为AI辅助诊断提供了实用解决方案。

Abstract: Automated cardiac interpretation in resource-constrained settings (RCS) is often hindered by poor-quality echocardiographic imaging, limiting the effectiveness of downstream diagnostic models. While super-resolution (SR) techniques have shown promise in enhancing magnetic resonance imaging (MRI) and computed tomography (CT) scans, their application to echocardiography-a widely accessible but noise-prone modality-remains underexplored. In this work, we investigate the potential of deep learning-based SR to improve classification accuracy on low-quality 2D echocardiograms. Using the publicly available CAMUS dataset, we stratify samples by image quality and evaluate two clinically relevant tasks of varying complexity: a relatively simple Two-Chamber vs. Four-Chamber (2CH vs. 4CH) view classification and a more complex End-Diastole vs. End-Systole (ED vs. ES) phase classification. We apply two widely used SR models-Super-Resolution Generative Adversarial Network (SRGAN) and Super-Resolution Residual Network (SRResNet), to enhance poor-quality images and observe significant gains in performance metric-particularly with SRResNet, which also offers computational efficiency. Our findings demonstrate that SR can effectively recover diagnostic value in degraded echo scans, making it a viable tool for AI-assisted care in RCS, achieving more with less.


[4] Adaptive Time-step Training for Enhancing Spike-Based Neural Radiance Fields cs.CV | cs.NEPDF

Ranxi Lin, Canming Yao, Jiayi Li, Weihang Liu, Xin Lou

TL;DR: 该论文提出了PATA方法,通过动态调整时间步长,在基于SNN的NeRF框架中平衡渲染质量与计算效率,显著减少了推理时间和功耗。

Details

Motivation: NeRF在3D重建和渲染任务中表现优异,但依赖密集点采样导致高计算开销,限制了其在资源受限场景的应用。SNN因其低能耗特性成为潜在解决方案。

Result: 实验表明,PATA在保持渲染质量的同时,显著降低了计算资源消耗。

Insight: 动态时间步长策略可有效平衡SNN在神经渲染中的效率与质量,为资源受限场景提供了实用解决方案。

Abstract: Neural Radiance Fields (NeRF)-based models have achieved remarkable success in 3D reconstruction and rendering tasks. However, during both training and inference, these models rely heavily on dense point sampling along rays from multiple viewpoints, resulting in a surge in floating-point operations and severely limiting their use in resource-constrained scenarios like edge computing. Spiking Neural Networks (SNNs), which communicate via binary spikes over discrete time steps, offer a promising alternative due to their energy-efficient nature. Given the inherent variability in scene scale and texture complexity in neural rendering and the prevailing practice of training separate models per scene, we propose a spike-based NeRF framework with a dynamic time step training strategy, termed Pretrain-Adaptive Time-step Adjustment (PATA). This approach automatically explores the trade-off between rendering quality and time step length during training. Consequently, it enables scene-adaptive inference with variable time steps and reduces the additional consumption of computational resources in the inference process. Anchoring to the established Instant-NGP architecture, we evaluate our method across diverse datasets. The experimental results show that PATA can preserve rendering fidelity while reducing inference time steps by 64% and running power by 61.55%.


[5] Early Goal-Guided Multi-Scale Fusion for Real-Time Vision-Language Driving cs.CV | cs.AI | cs.LG | cs.MM | cs.RO | I.2.6; I.2.9; I.2.10; C.3.3PDF

Santosh Patapati, Trisanth Srinivasan

TL;DR: NovaDrive提出了一种实时视觉-语言驾驶架构,通过多模态融合和轻量级交叉注意力块优化性能,显著提升了自动驾驶的成功率和路径效率。

Details

Motivation: 自动驾驶需要在复杂环境下快速反应,当前方法在实时性和多模态融合上存在不足,NovaDrive旨在解决这些问题。

Result: nuScenes/Waymo上,成功率提升4%,路径效率提升0.11,碰撞率降低1.4%。

Insight: 路径点标记和部分VLM微调对性能提升最关键;平滑性损失还能减少能耗。

Abstract: Autonomous vehicles must react in milliseconds while reasoning about road geometry and traffic intent to navigate complex situations. We introduce NovaDrive, a single-branch vision-language architecture that processes front-camera images, HD-map tiles, LiDAR depth, and textual waypoints in a single branch. A lightweight, two-stage cross-attention block first aligns waypoint tokens with the HD map, then refines attention over fine-grained image and depth patches. Coupled with a novel smoothness loss that discourages abrupt steering and speed changes, this design eliminates the need for recurrent memory. We fine-tune the top 15 layers of an 11B LLaMA-3.2 vision-language backbone, enabling real-time inference. On the nuScenes / Waymo subset of the MD-NEX Outdoor benchmark, NovaDrive raises success rate to 84% (+4%), boosts path-efficiency (SPL) to 0.66 (+0.11), and reduces collision frequency from 2.6% to 1.2% (-1.4%) relative to the previous state-of-the-art. Our ablations confirm that waypoint tokens, partial VLM fine-tuning, and the cross-attention fusion each contribute the most to these gains. Beyond safety, NovaDrive’s shorter routes (resulting from the novel smoothness loss) translate to lower fuel or battery usage, pointing toward leaner, more easily updated driving stacks. NovaDrive can be extended to other embodied-AI domains as well.


[6] Reference-Guided Diffusion Inpainting For Multimodal Counterfactual Generation cs.CV | cs.AIPDF

Alexandru Buburuzan

TL;DR: 论文提出了两种新方法(MObI和AnydoorMed)用于自动驾驶和医学影像分析领域的多模态合成数据生成,基于扩散模型实现高真实感和可控性。

Details

Motivation: 安全关键应用(如自动驾驶和医学影像分析)需要大量多模态数据测试,但真实数据采集成本高且复杂,亟需高真实感和可控性的合成数据方法。

Result: 所提方法在自动驾驶和医学影像中实现了高真实感、可控的多模态数据生成,验证了基础模型的普适性。

Insight: 扩散模型在跨模态合成数据生成中展现出潜力,为构建高真实感反事实场景提供了新思路。

Abstract: Safety-critical applications, such as autonomous driving and medical image analysis, require extensive multimodal data for rigorous testing. Synthetic data methods are gaining prominence due to the cost and complexity of gathering real-world data, but they demand a high degree of realism and controllability to be useful. This work introduces two novel methods for synthetic data generation in autonomous driving and medical image analysis, namely MObI and AnydoorMed, respectively. MObI is a first-of-its-kind framework for Multimodal Object Inpainting that leverages a diffusion model to produce realistic and controllable object inpaintings across perceptual modalities, demonstrated simultaneously for camera and lidar. Given a single reference RGB image, MObI enables seamless object insertion into existing multimodal scenes at a specified 3D location, guided by a bounding box, while maintaining semantic consistency and multimodal coherence. Unlike traditional inpainting methods that rely solely on edit masks, this approach uses 3D bounding box conditioning to ensure accurate spatial positioning and realistic scaling. AnydoorMed extends this paradigm to the medical imaging domain, focusing on reference-guided inpainting for mammography scans. It leverages a diffusion-based model to inpaint anomalies with impressive detail preservation, maintaining the reference anomaly’s structural integrity while semantically blending it with the surrounding tissue. Together, these methods demonstrate that foundation models for reference-guided inpainting in natural images can be readily adapted to diverse perceptual modalities, paving the way for the next generation of systems capable of constructing highly realistic, controllable and multimodal counterfactual scenarios.


[7] Vision-Language Fusion for Real-Time Autonomous Driving: Goal-Centered Cross-Attention of Camera, HD-Map, & Waypoints cs.CV | cs.AI | cs.LG | cs.RO | I.4.8; I.2.10; I.2.6; C.3.3; I.4.9PDF

Santosh Patapati, Trisanth Srinivasan, Murari Ambati

TL;DR: 论文提出了一种名为XYZ-Drive的单视觉语言模型,通过目标中心跨注意力层实现摄像头、高清地图和路径点的融合,显著提升了自动驾驶的实时性和准确性。

Details

Motivation: 自动驾驶需要同时处理几何精度和语义理解,而现有方法通常将它们分开处理。XYZ-Drive的目标是通过多模态融合解决这一问题,实现更高效的自动驾驶。

Result: 在MD-NEX Outdoor-Driving基准测试中,XYZ-Drive实现了95%的成功率和0.80的SPL,性能优于PhysNav-DG 15%,碰撞率减半。

Insight: 多模态融合和目标中心注意力机制对自动驾驶任务至关重要;微调预训练模型和保持高分辨率地图是提升性能的关键。

Abstract: Autonomous cars need geometric accuracy and semantic understanding to navigate complex environments, yet most stacks handle them separately. We present XYZ-Drive, a single vision-language model that reads a front-camera frame, a 25m $\times$ 25m overhead map, and the next waypoint, then outputs steering and speed. A lightweight goal-centered cross-attention layer lets waypoint tokens highlight relevant image and map patches, supporting both action and textual explanations, before the fused tokens enter a partially fine-tuned LLaMA-3.2 11B model. On the MD-NEX Outdoor-Driving benchmark XYZ-Drive attains 95% success and 0.80 Success weighted by Path Length (SPL), surpassing PhysNav-DG by 15%. and halving collisions, all while significantly improving efficiency by using only a single branch. Sixteen ablations explain the gains. Removing any modality (vision, waypoint, map) drops success by up to 11%, confirming their complementary roles and rich connections. Replacing goal-centered attention with simple concatenation cuts 3% in performance, showing query-based fusion injects map knowledge more effectively. Keeping the transformer frozen loses 5%, showing the importance of fine-tuning when applying VLMs for specific tasks such as autonomous driving. Coarsening map resolution from 10 cm to 40 cm blurs lane edges and raises crash rate. Overall, these results demonstrate that early, token-level fusion of intent and map layout enables accurate, transparent, real-time driving.


[8] Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model cs.CVPDF

Dmitry Demidov, Zaigham Zaheer, Omkar Thawakar, Salman Khan, Fahad Shahbaz Khan

TL;DR: 论文提出了一种无需词汇表的细粒度视觉识别方法E-FineR,通过结合语言模型与视觉语言模型的丰富上下文,实现了开放集识别,且在零样本和小样本分类中表现出色。

Details

Motivation: 传统细粒度图像分类方法依赖固定词汇表和封闭集分类,难以应对现实世界中新类别的频繁出现。结合LLM与VLM的方法虽能实现开放集识别,但在分类阶段未充分利用LLM潜力,且依赖猜测的类别名称而未深入分析。

Result: 在细粒度识别任务中表现优异,同时在零样本和小样本分类中性能与现有SOTA相当,且无需人工干预。

Insight: 通过语言驱动的灵活理解,E-FineR推动了图像分类从固定标签预测向可扩展、通用化系统的转变,适用于标注困难的现实场景。

Abstract: Fine-grained image classification, the task of distinguishing between visually similar subcategories within a broader category (e.g., bird species, car models, flower types), is a challenging computer vision problem. Traditional approaches rely heavily on fixed vocabularies and closed-set classification paradigms, limiting their scalability and adaptability in real-world settings where novel classes frequently emerge. Recent research has demonstrated that combining large language models (LLMs) with vision-language models (VLMs) makes open-set recognition possible without the need for predefined class labels. However, the existing methods are often limited in harnessing the power of LLMs at the classification phase, and also rely heavily on the guessed class names provided by an LLM without thorough analysis and refinement. To address these bottlenecks, we propose our training-free method, Enriched-FineR (or E-FineR for short), which demonstrates state-of-the-art results in fine-grained visual recognition while also offering greater interpretability, highlighting its strong potential in real-world scenarios and new domains where expert annotations are difficult to obtain. Additionally, we demonstrate the application of our proposed approach to zero-shot and few-shot classification, where it demonstrated performance on par with the existing SOTA while being training-free and not requiring human interventions. Overall, our vocabulary-free framework supports the shift in image classification from rigid label prediction to flexible, language-driven understanding, enabling scalable and generalizable systems for real-world applications. Well-documented code is available on https://github.com/demidovd98/e-finer.


[9] Details Matter for Indoor Open-vocabulary 3D Instance Segmentation cs.CVPDF

Sanghun Jung, Jingjing Zheng, Ke Zhang, Nan Qiao, Albert Y. C. Chen

TL;DR: 本文提出了一种新的开放词汇3D实例分割框架,通过结合3D提议生成和实例分类两阶段方法,以及改进的Alpha-CLIP模型和标准化最大相似度(SMS)评分,在ScanNet200和S3DIS数据集上实现了最先进的性能。

Details

Motivation: 现有的开放词汇3D实例分割方法虽然提出了多种概念,但这些概念是互补的而非互斥的。作者希望通过结合和优化这些概念,解决现有方法的挑战并提升性能。

Result: 在ScanNet200和S3DIS数据集上超越了所有AP和AR指标,甚至优于封闭词汇的端到端方法。

Insight: 1. 细节优化是提升开放词汇3D实例分割性能的关键;2. 结合互补的概念比单独使用一种方法更有效;3. Alpha-CLIP和SMS评分的引入显著提升了分类精度。

Abstract: Unlike closed-vocabulary 3D instance segmentation that is often trained end-to-end, open-vocabulary 3D instance segmentation (OV-3DIS) often leverages vision-language models (VLMs) to generate 3D instance proposals and classify them. While various concepts have been proposed from existing research, we observe that these individual concepts are not mutually exclusive but complementary. In this paper, we propose a new state-of-the-art solution for OV-3DIS by carefully designing a recipe to combine the concepts together and refining them to address key challenges. Our solution follows the two-stage scheme: 3D proposal generation and instance classification. We employ robust 3D tracking-based proposal aggregation to generate 3D proposals and remove overlapped or partial proposals by iterative merging/removal. For the classification stage, we replace the standard CLIP model with Alpha-CLIP, which incorporates object masks as an alpha channel to reduce background noise and obtain object-centric representation. Additionally, we introduce the standardized maximum similarity (SMS) score to normalize text-to-proposal similarity, effectively filtering out false positives and boosting precision. Our framework achieves state-of-the-art performance on ScanNet200 and S3DIS across all AP and AR metrics, even surpassing an end-to-end closed-vocabulary method.


[10] X-NeMo: Expressive Neural Motion Reenactment via Disentangled Latent Attention cs.CVPDF

Xiaochen Zhao, Hongyi Xu, Guoxian Song, You Xie, Chenxu Zhang

TL;DR: X-NeMo提出了一种基于扩散模型的零样本肖像动画方法,通过解耦的潜在注意力机制,解决了身份泄漏和表情捕捉难题,实现了高质量动画生成。

Details

Motivation: 现有方法在肖像动画中存在身份泄漏和难以捕捉细微及极端表情的问题。X-NeMo旨在通过解耦潜在运动描述符,实现更精确的表情控制。

Result: X-NeMo在实验中优于现有基准,生成的表情动画更具表现力且身份相似度更高。

Insight: 解耦运动与身份信息的潜在注意力机制是关键创新,为肖像动画提供了新思路。

Abstract: We propose X-NeMo, a novel zero-shot diffusion-based portrait animation pipeline that animates a static portrait using facial movements from a driving video of a different individual. Our work first identifies the root causes of the key issues in prior approaches, such as identity leakage and difficulty in capturing subtle and extreme expressions. To address these challenges, we introduce a fully end-to-end training framework that distills a 1D identity-agnostic latent motion descriptor from driving image, effectively controlling motion through cross-attention during image generation. Our implicit motion descriptor captures expressive facial motion in fine detail, learned end-to-end from a diverse video dataset without reliance on pretrained motion detectors. We further enhance expressiveness and disentangle motion latents from identity cues by supervising their learning with a dual GAN decoder, alongside spatial and color augmentations. By embedding the driving motion into a 1D latent vector and controlling motion via cross-attention rather than additive spatial guidance, our design eliminates the transmission of spatial-aligned structural clues from the driving condition to the diffusion backbone, substantially mitigating identity leakage. Extensive experiments demonstrate that X-NeMo surpasses state-of-the-art baselines, producing highly expressive animations with superior identity resemblance. Our code and models are available for research.


[11] Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space cs.CVPDF

Shiyao Yu, Zi-An Wang, Kangning Yin, Zheng Tian, Mingyuan Zhang

TL;DR: 该论文提出了一种多模态运动检索框架,通过联合嵌入空间对齐文本、音频、视频和运动四种模态,首次引入音频以提升沉浸感和用户便利性。

Details

Motivation: 现有运动检索方法通常基于对比学习构建统一嵌入空间,但缺乏直观的用户交互,且忽略了模态的序列表征。

Result: 实验显示在HumanML3D数据集上,文本到运动检索R@10提升10.16%,视频到运动检索R@1提升25.43%。

Insight: 四模态框架明显优于三模态版本,证实多模态运动检索在提升运动捕捉技术中的潜力。

Abstract: Motion retrieval is crucial for motion acquisition, offering superior precision, realism, controllability, and editability compared to motion generation. Existing approaches leverage contrastive learning to construct a unified embedding space for motion retrieval from text or visual modality. However, these methods lack a more intuitive and user-friendly interaction mode and often overlook the sequential representation of most modalities for improved retrieval performance. To address these limitations, we propose a framework that aligns four modalities – text, audio, video, and motion – within a fine-grained joint embedding space, incorporating audio for the first time in motion retrieval to enhance user immersion and convenience. This fine-grained space is achieved through a sequence-level contrastive learning approach, which captures critical details across modalities for better alignment. To evaluate our framework, we augment existing text-motion datasets with synthetic but diverse audio recordings, creating two multi-modal motion retrieval datasets. Experimental results demonstrate superior performance over state-of-the-art methods across multiple sub-tasks, including an 10.16% improvement in R@10 for text-to-motion retrieval and a 25.43% improvement in R@1 for video-to-motion retrieval on the HumanML3D dataset. Furthermore, our results show that our 4-modal framework significantly outperforms its 3-modal counterpart, underscoring the potential of multi-modal motion retrieval for advancing motion acquisition.


[12] A Novel Dataset for Flood Detection Robust to Seasonal Changes in Satellite Imagery cs.CV | I.4.6; I.2.10; I.5.4PDF

Youngsun Jang, Dongyoun Kim, Chulwoo Pack, Kwanghee Won

TL;DR: 该论文介绍了一个新的卫星图像数据集,用于洪涝区域的语义分割,弥补了现有数据集在该任务上的不足,并通过实验验证了现有模型的性能。

Details

Motivation: 现有的卫星影像数据集在洪涝区域分割任务上存在不足,且季节性变化对图像特征的影响尚未得到充分研究,因此需要一个新的数据集来填补这一空白。

Result: 实验结果表明,现有模型在该数据集上表现一般,说明需要进一步开发多模态与时序学习方法以提升性能。

Insight: 季节性变化可能对卫星图像的洪涝检测造成显著影响,未来的研究应结合更多模态和时序信息以提高模型的鲁棒性。

Abstract: This study introduces a novel dataset for segmenting flooded areas in satellite images. After reviewing 77 existing benchmarks utilizing satellite imagery, we identified a shortage of suitable datasets for this specific task. To fill this gap, we collected satellite imagery of the 2019 Midwestern USA floods from Planet Explorer by Planet Labs (Image \c{opyright} 2024 Planet Labs PBC). The dataset consists of 10 satellite images per location, each containing both flooded and non-flooded areas. We selected ten locations from each of the five states: Iowa, Kansas, Montana, Nebraska, and South Dakota. The dataset ensures uniform resolution and resizing during data processing. For evaluating semantic segmentation performance, we tested state-of-the-art models in computer vision and remote sensing on our dataset. Additionally, we conducted an ablation study varying window sizes to capture temporal characteristics. Overall, the models demonstrated modest results, suggesting a requirement for future multimodal and temporal learning strategies. The dataset will be publicly available on https://github.com/youngsunjang/SDSU_MidWest_Flood_2019.


[13] Adversarial-Guided Diffusion for Multimodal LLM Attacks cs.CVPDF

Chengwei Xia, Fan Ma, Ruijie Quan, Kun Zhan, Yi Yang

TL;DR: 论文提出了一种基于扩散模型的对抗攻击方法AGD,通过对抗引导噪声欺骗多模态大语言模型(MLLMs),同时避免图像显著失真。AGD将目标语义注入反向扩散的噪声中,使其具有全频谱特性,从而对多种防御方法具有鲁棒性。实验表明,AGD在攻击性能和抗防御能力上优于现有方法。

Details

Motivation: 多模态大语言模型(MLLMs)的安全问题日益突出,传统对抗攻击方法通常嵌入高频扰动到干净图像中,容易被简单的低通滤波防御。论文旨在提出一种更鲁棒的对抗攻击方法,利用扩散模型的特性实现高效攻击。

Result: 实验显示AGD在攻击MLLMs时优于现有方法,且在抗低通滤波等防御措施上表现更稳健。

Insight: 扩散模型的噪声部分可用于嵌入对抗信号,因其全频谱特性使得对抗攻击更难以防御,为对抗攻击设计提供了新思路。

Abstract: This paper addresses the challenge of generating adversarial image using a diffusion model to deceive multimodal large language models (MLLMs) into generating the targeted responses, while avoiding significant distortion of the clean image. To address the above challenges, we propose an adversarial-guided diffusion (AGD) approach for adversarial attack MLLMs. We introduce adversarial-guided noise to ensure attack efficacy. A key observation in our design is that, unlike most traditional adversarial attacks which embed high-frequency perturbations directly into the clean image, AGD injects target semantics into the noise component of the reverse diffusion. Since the added noise in a diffusion model spans the entire frequency spectrum, the adversarial signal embedded within it also inherits this full-spectrum property. Importantly, during reverse diffusion, the adversarial image is formed as a linear combination of the clean image and the noise. Thus, when applying defenses such as a simple low-pass filtering, which act independently on each component, the adversarial image within the noise component is less likely to be suppressed, as it is not confined to the high-frequency band. This makes AGD inherently robust to variety defenses. Extensive experiments demonstrate that our AGD outperforms state-of-the-art methods in attack performance as well as in model robustness to some defenses.


[14] Toward Safe, Trustworthy and Realistic Augmented Reality User Experience cs.CVPDF

Yanming Xiu

TL;DR: 论文致力于提高增强现实(AR)的安全性和可信度,开发了ViDDAR和VIM-Sense系统以检测有害虚拟内容,并提出了未来研究方向。

Details

Motivation: 随着AR日益融入日常生活,确保虚拟内容的安全性和可信度变得至关重要,特别是防止其阻碍关键信息或操纵用户感知。

Result: 论文通过系统和理论框架为AR体验的安全性提供了初步解决方案,并提出了进一步优化的方向。

Insight: 安全的AR体验需要结合多模态检测和轻量化模型部署,未来的研究应注重感知对齐和用户中心的设计。

Abstract: As augmented reality (AR) becomes increasingly integrated into everyday life, ensuring the safety and trustworthiness of its virtual content is critical. Our research addresses the risks of task-detrimental AR content, particularly that which obstructs critical information or subtly manipulates user perception. We developed two systems, ViDDAR and VIM-Sense, to detect such attacks using vision-language models (VLMs) and multimodal reasoning modules. Building on this foundation, we propose three future directions: automated, perceptually aligned quality assessment of virtual content; detection of multimodal attacks; and adaptation of VLMs for efficient and user-centered deployment on AR devices. Overall, our work aims to establish a scalable, human-aligned framework for safeguarding AR experiences and seeks feedback on perceptual modeling, multimodal AR content implementation, and lightweight model adaptation.


[15] Ambiguity-Guided Learnable Distribution Calibration for Semi-Supervised Few-Shot Class-Incremental Learning cs.CVPDF

Fan Lyu, Linglan Zhao, Chengyan Liu, Yinying Mei, Zhang Zhang

TL;DR: 论文提出了一种广义半监督少样本类增量学习(GSemi-FSCIL)问题,并通过Ambiguity-guided Learnable Distribution Calibration(ALDC)策略解决现有方法在区分基础类和新增类未标记样本上的挑战。

Details

Motivation: 现实场景中,未标记数据可能来自基础类或所有历史新增类,而现有方法假设未标记数据仅来自当前会话的新增类,与实际不符。因此,作者重新定义了广义Semi-FSCIL,并提出了ALDC以动态校准特征分布。

Result: 实验表明,ALDC在三个基准数据集上显著优于现有方法,确立了新的SOTA性能。

Insight: 广义Semi-FSCIL更贴合实际场景,而ALDC通过动态分布校准有效提升了模型在少样本和未标记数据混合环境中的表现。

Abstract: Few-Shot Class-Incremental Learning (FSCIL) focuses on models learning new concepts from limited data while retaining knowledge of previous classes. Recently, many studies have started to leverage unlabeled samples to assist models in learning from few-shot samples, giving rise to the field of Semi-supervised Few-shot Class-Incremental Learning (Semi-FSCIL). However, these studies often assume that the source of unlabeled data is only confined to novel classes of the current session, which presents a narrow perspective and cannot align well with practical scenarios. To better reflect real-world scenarios, we redefine Semi-FSCIL as Generalized Semi-FSCIL (GSemi-FSCIL) by incorporating both base and all the ever-seen novel classes in the unlabeled set. This change in the composition of unlabeled samples poses a new challenge for existing methods, as they struggle to distinguish between unlabeled samples from base and novel classes. To address this issue, we propose an Ambiguity-guided Learnable Distribution Calibration (ALDC) strategy. ALDC dynamically uses abundant base samples to correct biased feature distributions for few-shot novel classes. Experiments on three benchmark datasets show that our method outperforms existing works, setting new state-of-the-art results.


[16] Generalized Reinforcement Learning for Retriever-Specific Query Rewriter with Unstructured Real-World Documents cs.CV | cs.CL | cs.LGPDF

Sungguk Cha, DongWook Kim, Taeseung Hahn, Mintae Kim, Youngsub Han

TL;DR: 该论文提出了RL-QR,一种基于强化学习的查询重写框架,可针对特定检索器优化查询,无需人工标注数据,适用于文本和多模态数据库。实验表明,RL-QR在多模态和词汇检索器中性能显著提升,但在语义和混合检索器中表现不佳。

Details

Motivation: 现有检索增强生成(RAG)系统的查询优化依赖于人工标注数据,且难以适应多样化的非结构化真实世界文档,亟需一种可扩展且无需人工干预的解决方案。

Result: RL-QR在多模态检索中NDCG@3提升11%,在词汇检索器中提升9%,但在语义和混合检索器中未观察到改进。

Insight: RL-QR为RAG系统提供了一种可扩展的查询优化方案,但在语义检索上的局限性提示需要进一步研究训练对齐问题。

Abstract: Retrieval-Augmented Generation (RAG) systems rely heavily on effective query formulation to unlock external knowledge, yet optimizing queries for diverse, unstructured real-world documents remains a challenge. We introduce \textbf{RL-QR}, a reinforcement learning framework for retriever-specific query rewriting that eliminates the need for human-annotated datasets and extends applicability to both text-only and multi-modal databases. By synthesizing scenario-question pairs and leveraging Generalized Reward Policy Optimization (GRPO), RL-QR trains query rewriters tailored to specific retrievers, enhancing retrieval performance across varied domains. Experiments on industrial in-house data demonstrate significant improvements, with $\text{RL-QR}{\text{multi-modal}}$ achieving an 11% relative gain in NDCG@3 for multi-modal RAG and $\text{RL-QR}{\text{lexical}}$ yielding a 9% gain for lexical retrievers. However, challenges persist with semantic and hybrid retrievers, where rewriters failed to improve performance, likely due to training misalignments. Our findings highlight RL-QR’s potential to revolutionize query optimization for RAG systems, offering a scalable, annotation-free solution for real-world retrieval tasks, while identifying avenues for further refinement in semantic retrieval contexts.


[17] A Deep Dive into Generic Object Tracking: A Survey cs.CVPDF

Fereshteh Aghaee Meibodi, Shadi Alijani, Homayoun Najjaran

TL;DR: 这篇论文对通用目标跟踪领域进行了全面综述,重点分析了包括基于Siamese网络、判别式以及近期兴起的基于Transformer的三类方法,并特别强调了Transformer方法的快速发展。

Details

Motivation: 通用目标跟踪因复杂的时空动态性和遮挡、相似干扰物等问题具有挑战性。尽管已有一些综述论文,但本文旨在全面覆盖所有主要跟踪范式,尤其是快速发展的Transformer方法。

Result: 研究表明,Transformer方法因其强大的时空建模能力推动了目标跟踪的快速发展。

Insight: 论文指出Transformer方法在跟踪任务中的潜力,同时强调了未来研究需关注其在复杂场景中的鲁棒性和效率。

Abstract: Generic object tracking remains an important yet challenging task in computer vision due to complex spatio-temporal dynamics, especially in the presence of occlusions, similar distractors, and appearance variations. Over the past two decades, a wide range of tracking paradigms, including Siamese-based trackers, discriminative trackers, and, more recently, prominent transformer-based approaches, have been introduced to address these challenges. While a few existing survey papers in this field have either concentrated on a single category or widely covered multiple ones to capture progress, our paper presents a comprehensive review of all three categories, with particular emphasis on the rapidly evolving transformer-based methods. We analyze the core design principles, innovations, and limitations of each approach through both qualitative and quantitative comparisons. Our study introduces a novel categorization and offers a unified visual and tabular comparison of representative methods. Additionally, we organize existing trackers from multiple perspectives and summarize the major evaluation benchmarks, highlighting the fast-paced advancements in transformer-based tracking driven by their robust spatio-temporal modeling capabilities.


[18] Towards Affordable Tumor Segmentation and Visualization for 3D Breast MRI Using SAM2 cs.CV | cs.AIPDF

Solha Kang, Eugene Kim, Joris Vankerschaver, Utku Ozbulak

TL;DR: 该论文探讨了如何利用SAM2(Segment Anything Model 2)在低成本、最小输入的情况下,实现3D乳房MRI中的肿瘤分割。通过单一切片的边界框标注,采用三种切片级跟踪策略(从上到下、从下到上、从中心向外)传播分割预测。中心向外策略表现最佳,尽管SAM2未经体积医学数据训练,但在最小监督下仍表现出色。

Details

Motivation: 乳房MRI的高分辨率体积成像对肿瘤评估至关重要,但手动分析3D扫描耗时且主观。商业AI产品因高成本和基础设施需求难以在低收入和中等收入国家普及,因此需要一种低成本、易用的替代方案。

Result: 中心向外传播策略在分割一致性和准确性上表现最佳。尽管SAM2未经体积医学数据训练,但在最小监督下仍实现了强大的分割性能,同时识别了关键失败模式。

Insight: 通用基础模型(如SAM2)可以在最小监督下支持3D医学图像分析,为资源受限地区提供了一种经济高效的替代方案。

Abstract: Breast MRI provides high-resolution volumetric imaging critical for tumor assessment and treatment planning, yet manual interpretation of 3D scans remains labor-intensive and subjective. While AI-powered tools hold promise for accelerating medical image analysis, adoption of commercial medical AI products remains limited in low- and middle-income countries due to high license costs, proprietary software, and infrastructure demands. In this work, we investigate whether the Segment Anything Model 2 (SAM2) can be adapted for low-cost, minimal-input 3D tumor segmentation in breast MRI. Using a single bounding box annotation on one slice, we propagate segmentation predictions across the 3D volume using three different slice-wise tracking strategies: top-to-bottom, bottom-to-top, and center-outward. We evaluate these strategies across a large cohort of patients and find that center-outward propagation yields the most consistent and accurate segmentations. Despite being a zero-shot model not trained for volumetric medical data, SAM2 achieves strong segmentation performance under minimal supervision. We further analyze how segmentation performance relates to tumor size, location, and shape, identifying key failure modes. Our results suggest that general-purpose foundation models such as SAM2 can support 3D medical image analysis with minimal supervision, offering an accessible and affordable alternative for resource-constrained settings.


[19] iLRM: An Iterative Large 3D Reconstruction Model cs.CVPDF

Gyeongjin Kang, Seungtae Nam, Xiangyu Sun, Sameh Khamis, Abdelrahman Mohamed

TL;DR: iLRM是一种迭代式大型3D重建模型,通过解耦场景表示与输入视图、分解多视图注意力机制及高分辨率信息注入,提升了重建质量和速度。

Details

Motivation: 当前基于Transformer的3D重建方法因全注意力机制在多视图和高分辨率输入时计算成本过高,难以扩展。iLRM旨在解决这一问题。

Result: 在RE10K和DL3DV等数据集上,iLRM在重建质量和速度上优于现有方法,且具有更好的扩展性。

Insight: 解耦和注意力分层机制是提升3D重建效率和扩展性的有效途径。

Abstract: Feed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering, as well as numerous applications. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input-view images to enable compact 3D representations; (2) decomposing fully-attentional multi-view interactions into a two-stage attention scheme to reduce computational costs; and (3) injecting high-resolution information at every layer to achieve high-fidelity reconstruction. Experimental results on widely used datasets, such as RE10K and DL3DV, demonstrate that iLRM outperforms existing methods in both reconstruction quality and speed. Notably, iLRM exhibits superior scalability, delivering significantly higher reconstruction quality under comparable computational cost by efficiently leveraging a larger number of input views.


[20] UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing cs.CVPDF

Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li

TL;DR: UniLiP扩展了CLIP的能力,使其不仅适用于理解和生成任务,还能进行图像编辑。通过两阶段训练和自蒸馏策略,UniLiP在保持原始理解性能的同时,实现了高效的图像重建。在生成和编辑任务中,UniLiP表现优于同类统一模型。

Details

Motivation: 现有基于CLIP的统一方法通常需要额外的扩散解码器或量化来支持重建和生成任务,这可能导致性能下降。UniLiP旨在解决这一问题,通过统一架构实现多任务的高效协同。

Result: 在文本到图像生成任务中,UniLiP在GenEval和WISE基准上的得分分别为0.87和0.53;在图像编辑任务中,ImgEdit Benchmark得分为3.62,均优于现有模型。

Insight: UniLiP展示了如何通过统一架构和策略扩展CLIP的应用范围,同时保持其在理解任务中的优势,为多模态任务的协同处理提供了新思路。

Abstract: In this paper, we propose UniLIP, which extends CLIP to reconstruction, generation and editing, thereby building a unified tokenizer upon its exceptional comprehension capabilities. Previous CLIP-based unified methods often require additional diffusion decoders or quantization to support reconstruction and generation tasks, leading to inconsistent reconstruction or degradation of original comprehension performance.In contrast, we introduce a two-stage training scheme and a self-distillation strategy that progressively integrates reconstruction capabilities into CLIP, allowing it to maintain original comprehension performance while achieving effective image reconstruction. Furthermore, we propose a dual-condition architecture to connect the MLLM and diffusion transformer, using both learnable queries and the last layer multimodal hidden states as joint conditions. This method not only enables the utilization of the MLLM’s strong reasoning capabilities in generation tasks, but also maximizes the exploitation of the rich information in UniLIP features during editing tasks. In text-to-image generation tasks, UniLIP obtains scores of 0.87 and 0.53 on GenEval and WISE benchmark respectively, surpassing all previous unified models of similar scale. In image editing, UniLIP also achieves a score of 3.62 on the ImgEdit Benchmark, surpassing recent state-of-the-art models such as BAGEL and UniWorld-V1. UniLIP effectively expand the application scope of CLIP, enabling continuous CLIP features to not only serve as the optimal choice for understanding tasks but also achieve highly competitive performance in generation and editing tasks.


[21] Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval cs.CVPDF

Dohwan Ko, Ji Soo Lee, Minhyuk Choi, Zihang Meng, Hyunwoo J. Kim

TL;DR: 该论文提出了BLiM框架,通过双向似然估计和候选先验归一化(CPN)消除文本-视频检索中的候选先验偏差,显著提升了检索性能。

Details

Motivation: 现有基于MLLM的方法在文本-视频检索中因候选先验偏差而偏向于高先验的候选,而非与查询更相关的候选。

Result: 在四个基准测试中,BLiM+CPN平均提升R@1 6.4%,显著减轻了候选先验偏差。

Insight: CPN模块在多模态任务中具有广泛适用性,可减少对文本先验的依赖,增强视觉理解。

Abstract: Text-Video Retrieval aims to find the most relevant text (or video) candidate given a video (or text) query from large-scale online databases. Recent work leverages multi-modal large language models (MLLMs) to improve retrieval, especially for long or complex query-candidate pairs. However, we observe that the naive application of MLLMs, i.e., retrieval based on candidate likelihood, introduces candidate prior bias, favoring candidates with inherently higher priors over those more relevant to the query. To this end, we propose a novel retrieval framework, Bidirectional Likelihood Estimation with MLLM (BLiM), which leverages both query and candidate likelihoods by training the model to generate text from a given video as well as video features from a given text. Furthermore, we introduce Candidate Prior Normalization (CPN), a simple yet effective training-free score calibration module designed to mitigate candidate prior bias in candidate likelihood. On four Text-Video Retrieval benchmarks, our BLiM equipped with CPN outperforms previous state-of-the-art models by 6.4 R@1 on average, effectively alleviating candidate prior bias and emphasizing query-candidate relevance. Our in-depth analysis across various multi-modal tasks beyond retrieval highlights the broad applicability of CPN which enhances visual understanding by reducing reliance on textual priors. Code is available at https://github.com/mlvlab/BLiM.


[22] LED Benchmark: Diagnosing Structural Layout Errors for Document Layout Analysis cs.CVPDF

Inbum Heo, Taewook Hwang, Jeesu Jung, Sangkeun Jung

TL;DR: 该论文提出了Layout Error Detection (LED)基准,用于评估文档布局分析的结构鲁棒性,定义了八种标准错误类型,并构建了合成数据集LED-Dataset。实验表明LED能有效区分不同模型的结构理解能力。

Details

Motivation: 现有文档布局分析的评估指标(如IoU和mAP)主要关注空间重叠,难以检测关键的结构错误(如区域合并、分割和内容缺失)。因此,需要一种新的评估方法来诊断这些结构错误。

Result: 实验结果表明,LED能有效区分不同模型的结构理解能力,揭示模态偏差和性能权衡。

Insight: 传统评估指标无法充分反映模型在结构错误检测上的表现,LED提供了一种更全面的评估框架。

Abstract: Recent advancements in Document Layout Analysis through Large Language Models and Multimodal Models have significantly improved layout detection. However, despite these improvements, challenges remain in addressing critical structural errors, such as region merging, splitting, and missing content. Conventional evaluation metrics like IoU and mAP, which focus primarily on spatial overlap, are insufficient for detecting these errors. To address this limitation, we propose Layout Error Detection (LED), a novel benchmark designed to evaluate the structural robustness of document layout predictions. LED defines eight standardized error types, and formulates three complementary tasks: error existence detection, error type classification, and element-wise error type classification. Furthermore, we construct LED-Dataset, a synthetic dataset generated by injecting realistic structural errors based on empirical distributions from DLA models. Experimental results across a range of LMMs reveal that LED effectively differentiates structural understanding capabilities, exposing modality biases and performance trade-offs not visible through traditional metrics.


[23] Training-free Geometric Image Editing on Diffusion Models cs.CVPDF

Hanshen Zhu, Zhen Zhu, Kaile Zhang, Yiming Gong, Yuliang Liu

TL;DR: 该论文提出了一种解耦的几何图像编辑框架FreeFine,通过分离物体变换、源区域修复和目标区域细化三个步骤,提升了图像编辑的逼真度和精度。

Details

Motivation: 现有的基于扩散模型的图像编辑方法通常试图在单一步骤中完成所有相关子任务,但在处理大规模或结构复杂的变换时效果不佳。论文旨在解决这一问题。

Result: 在GeoBench测试集上,FreeFine在图像逼真度和编辑精度上优于现有方法,尤其是在复杂变换场景中。

Insight: 解耦编辑步骤可以显著提升复杂变换任务的效果,而无需训练的扩散方法在效率和质量上具有优势。

Abstract: We tackle the task of geometric image editing, where an object within an image is repositioned, reoriented, or reshaped while preserving overall scene coherence. Previous diffusion-based editing methods often attempt to handle all relevant subtasks in a single step, proving difficult when transformations become large or structurally complex. We address this by proposing a decoupled pipeline that separates object transformation, source region inpainting, and target region refinement. Both inpainting and refinement are implemented using a training-free diffusion approach, FreeFine. In experiments on our new GeoBench benchmark, which contains both 2D and 3D editing scenarios, FreeFine outperforms state-of-the-art alternatives in image fidelity, and edit precision, especially under demanding transformations. Code and benchmark are available at: https://github.com/CIawevy/FreeFine


[24] ST-SAM: SAM-Driven Self-Training Framework for Semi-Supervised Camouflaged Object Detection cs.CVPDF

Xihang Hu, Fuming Sun, Jiazhe Liu, Feilong Xu, Xiaoli Zhang

TL;DR: ST-SAM提出了一种基于自训练的简洁框架,通过动态筛选高置信度伪标签和利用SAM模型的潜力,显著降低了半监督伪装目标检测对标注数据的依赖。

Details

Motivation: 现有半监督伪装目标检测方法依赖复杂多网络结构,存在预测偏差和计算开销大的问题,ST-SAM旨在通过自训练和SAM模型的结合解决这些问题。

Result: 在仅1%标注数据下,ST-SAM性能优于现有半监督方法,甚至接近全监督方法。

Insight: 利用SAM模型的能力可以有效减轻半监督学习中的误差积累,同时单模型架构提高了计算效率和扩展性。

Abstract: Semi-supervised Camouflaged Object Detection (SSCOD) aims to reduce reliance on costly pixel-level annotations by leveraging limited annotated data and abundant unlabeled data. However, existing SSCOD methods based on Teacher-Student frameworks suffer from severe prediction bias and error propagation under scarce supervision, while their multi-network architectures incur high computational overhead and limited scalability. To overcome these limitations, we propose ST-SAM, a highly annotation-efficient yet concise framework that breaks away from conventional SSCOD constraints. Specifically, ST-SAM employs Self-Training strategy that dynamically filters and expands high-confidence pseudo-labels to enhance a single-model architecture, thereby fundamentally circumventing inter-model prediction bias. Furthermore, by transforming pseudo-labels into hybrid prompts containing domain-specific knowledge, ST-SAM effectively harnesses the Segment Anything Model’s potential for specialized tasks to mitigate error accumulation in self-training. Experiments on COD benchmark datasets demonstrate that ST-SAM achieves state-of-the-art performance with only 1% labeled data, outperforming existing SSCOD methods and even matching fully supervised methods. Remarkably, ST-SAM requires training only a single network, without relying on specific models or loss functions. This work establishes a new paradigm for annotation-efficient SSCOD. Codes will be available at https://github.com/hu-xh/ST-SAM.


[25] PriorFusion: Unified Integration of Priors for Robust Road Perception in Autonomous Driving cs.CVPDF

Xuewei Tang, Mengmeng Yang, Tuopu Wen, Peijin Jia, Le Cui

TL;DR: PriorFusion是一个统一框架,通过整合语义、几何和生成先验,提升自动驾驶中的道路元素感知能力。其关键贡献包括基于形状先验的注意力机制和扩散模型生成准确预测。

Details

Motivation: 在复杂环境中,自动驾驶车辆缺乏高精地图支持,现有方法未能充分利用道路元素的结构化先验,导致预测不规则和不准确。

Result: 在大规模数据集上,PriorFusion显著提升了道路元素的感知准确性,尤其在复杂环境下表现优异。

Insight: 通过整合多种先验(语义、几何、生成),可以有效解决道路元素感知中的不准确和碎片化问题,为自动驾驶提供更可靠的感知支持。

Abstract: With the growing interest in autonomous driving, there is an increasing demand for accurate and reliable road perception technologies. In complex environments without high-definition map support, autonomous vehicles must independently interpret their surroundings to ensure safe and robust decision-making. However, these scenarios pose significant challenges due to the large number, complex geometries, and frequent occlusions of road elements. A key limitation of existing approaches lies in their insufficient exploitation of the structured priors inherently present in road elements, resulting in irregular, inaccurate predictions. To address this, we propose PriorFusion, a unified framework that effectively integrates semantic, geometric, and generative priors to enhance road element perception. We introduce an instance-aware attention mechanism guided by shape-prior features, then construct a data-driven shape template space that encodes low-dimensional representations of road elements, enabling clustering to generate anchor points as reference priors. We design a diffusion-based framework that leverages these prior anchors to generate accurate and complete predictions. Experiments on large-scale autonomous driving datasets demonstrate that our method significantly improves perception accuracy, particularly under challenging conditions. Visualization results further confirm that our approach produces more accurate, regular, and coherent predictions of road elements.


[26] Forgetting of task-specific knowledge in model merging-based continual learning cs.CVPDF

Timm Hess, Gido M van de Ven, Tinne Tuytelaars

TL;DR: 论文研究了持续学习中模型线性合并的效果,发现合并主要保留或增强了共享知识,而特定任务的知识会快速退化,增量训练模型的合并效果优于并行训练模型。

Details

Motivation: 探讨持续学习中模型合并的知识保留与退化问题,特别是共享知识和任务特定知识的表现,以优化模型合并策略。

Result: 合并增强了共享知识,但任务特定知识快速退化;增量训练模型的合并效果更优。

Insight: 模型合并策略应关注增量训练,以更好地保留知识并减少任务特定知识的损失。

Abstract: This paper investigates the linear merging of models in the context of continual learning (CL). Using controlled visual cues in computer vision experiments, we demonstrate that merging largely preserves or enhances shared knowledge, while unshared task-specific knowledge rapidly degrades. We further find that merging models from an incremental training process consistently outperforms merging models trained in parallel.


[27] The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models cs.CVPDF

Alfio Ferrara, Sergio Picascia, Elisabetta Rocchetti

TL;DR: 该论文研究了基于Transformer的文本到图像扩散模型如何在生成艺术作品时编码内容与风格概念,发现模型在某种程度上能够区分内容与风格,但这种分离程度取决于特定的艺术提示和风格需求。

Details

Motivation: 尽管文本到图像扩散模型在艺术内容生成方面表现出色,但模型内部如何表示内容与风格这样的概念仍是一个未探索的问题。传统计算机视觉假设内容与风格是正交的,但扩散模型在训练中并未收到关于这种区分的明确指导。

Result: 研究发现,扩散模型在生成艺术作品时表现出不同程度的内容-风格分离,内容词主要影响物体相关区域,而风格词则影响背景和纹理区域,表明模型对内容与风格的区分具有潜在理解。

Insight: 这项研究为理解大规模生成模型如何在无明确监督的情况下表示复杂的艺术概念提供了新视角,揭示了模型在艺术生成任务中的内在机制。

Abstract: Text-to-image diffusion models have demonstrated remarkable capabilities in generating artistic content by learning from billions of images, including popular artworks. However, the fundamental question of how these models internally represent concepts, such as content and style in paintings, remains unexplored. Traditional computer vision assumes content and style are orthogonal, but diffusion models receive no explicit guidance about this distinction during training. In this work, we investigate how transformer-based text-to-image diffusion models encode content and style concepts when generating artworks. We leverage cross-attention heatmaps to attribute pixels in generated images to specific prompt tokens, enabling us to isolate image regions influenced by content-describing versus style-describing tokens. Our findings reveal that diffusion models demonstrate varying degrees of content-style separation depending on the specific artistic prompt and style requested. In many cases, content tokens primarily influence object-related regions while style tokens affect background and texture areas, suggesting an emergent understanding of the content-style distinction. These insights contribute to our understanding of how large-scale generative models internally represent complex artistic concepts without explicit supervision. We share the code and dataset, together with an exploratory tool for visualizing attention maps at https://github.com/umilISLab/artistic-prompt-interpretation.


[28] Impact of Hyperparameter Optimization on the Accuracy of Lightweight Deep Learning Models for Real-Time Image Classification cs.CV | cs.AI | cs.LGPDF

Vineet Kumar Rakesh, Soumya Mazumdar, Tapas Samanta, Sarbajit Pal, Amitabha Das

TL;DR: 这篇论文研究了超参数优化对轻量级深度学习模型在实时图像分类任务中精度的影响,通过实验分析了多种模型的性能表现,并提出了优化建议。

Details

Motivation: 轻量级模型在资源受限的实时应用中至关重要,但超参数调整对其性能的影响尚未系统研究。本文旨在填补这一空白。

Result: 余弦学习率衰减和动态批量大小能显著提高精度和收敛速度。RepVGG-A2表现最优,Top-1精度超过80%,同时保持高效推理。

Insight: 优化超参数可以显著提升轻量级模型的实时性能,尤其是余弦学习率和批量大小调整对平衡精度与资源开销非常有效。

Abstract: Lightweight convolutional and transformer-based models have become vital for real-time image classification in resource-constrained applications, such as embedded systems and edge devices. This work analyzes the influence of hyperparameter adjustment on the accuracy and convergence behavior of seven efficient deep learning architectures: EfficientNetV2-S, ConvNeXt-T, MobileViT v2 (XXS/XS/S), MobileNetV3-L, TinyViT-21M, and RepVGG-A2. All models are trained on the ImageNet-1K dataset under consistent training settings, with an emphasis on real-time practicality. An comprehensive ablation study is undertaken to separate the effect of critical hyperparameters, including learning rate schedules, batch sizes, input resolution, data augmentation, regularization approaches, and optimizer choice. To assess appropriateness for real-time applications, each model is assessed not only in terms of Top-1 and Top-5 classification accuracy, but also in terms of inference time, parameter count, model size, and frames-per-second (FPS) on a GPU-accelerated edge deployment simulation. Results demonstrate that cosine learning rate decay and adjustable batch size may greatly boost both accuracy and convergence speed, while keeping low latency and memory cost. Notably, RepVGG-A2 achieves over 80% Top-1 accuracy with efficient inference performance, offering a compelling balance between accuracy and deployment cost for VGG-style models. The results give practical guidance for constructing resource-efficient deep learning models appropriate for real-time image processing pipelines. All code and training logs are publicly accessible at https://github.com/VineetKumarRakesh/lcnn-opt.


[29] FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning cs.CV | cs.AIPDF

Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan

TL;DR: FastDriveVLA提出了一种基于重建的视觉令牌剪枝框架,用于高效端到端自动驾驶,通过MAE风格像素重建和对抗性重建策略,显著降低了计算成本。

Details

Motivation: 现有的视觉令牌剪枝方法在自动驾驶场景中表现不佳,因为驾驶员专注于前景区域,而现有方法未充分考虑这一点。

Result: 在nuScenes闭环规划基准测试中,该方法在不同剪枝比例下达到最佳性能。

Insight: 前景信息对自动驾驶决策至关重要,基于重建的剪枝策略能有效保留关键信息。

Abstract: Vision-Language-Action (VLA) models have demonstrated significant potential in complex scene understanding and action reasoning, leading to their increasing adoption in end-to-end autonomous driving systems. However, the long visual tokens of VLA models greatly increase computational costs. Current visual token pruning methods in Vision-Language Models (VLM) rely on either visual token similarity or visual-text attention, but both have shown poor performance in autonomous driving scenarios. Given that human drivers concentrate on relevant foreground areas while driving, we assert that retaining visual tokens containing this foreground information is essential for effective decision-making. Inspired by this, we propose FastDriveVLA, a novel reconstruction-based vision token pruning framework designed specifically for autonomous driving. FastDriveVLA includes a plug-and-play visual token pruner called ReconPruner, which prioritizes foreground information through MAE-style pixel reconstruction. A novel adversarial foreground-background reconstruction strategy is designed to train ReconPruner for the visual encoder of VLA models. Once trained, ReconPruner can be seamlessly applied to different VLA models with the same visual encoder without retraining. To train ReconPruner, we also introduce a large-scale dataset called nuScenes-FG, consisting of 241K image-mask pairs with annotated foreground regions. Our approach achieves state-of-the-art results on the nuScenes closed-loop planning benchmark across different pruning ratios.


[30] FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models cs.CVPDF

Yiming Yang, Hongbin Lin, Yueru Luo, Suzhong Fu, Chao Zheng

TL;DR: FASTopoWM是一种通过潜在世界模型增强的快速-慢速车道段拓扑推理框架,显著提升了车道检测与中心线感知性能。

Details

Motivation: 现有车道拓扑推理方法未能有效利用时序信息,且易受位姿估计失败影响。FASTopoWM旨在通过潜在世界模型和并行监督解决这些问题。

Result: 在车道段检测(mAP 37.4%)和中心线感知(OLS 46.3%)上优于现有方法。

Insight: 利用潜在世界模型和并行监督能显著提升时序感知能力,对自动驾驶系统具有重要价值。

Abstract: Lane segment topology reasoning provides comprehensive bird’s-eye view (BEV) road scene understanding, which can serve as a key perception module in planning-oriented end-to-end autonomous driving systems. Existing lane topology reasoning methods often fall short in effectively leveraging temporal information to enhance detection and reasoning performance. Recently, stream-based temporal propagation method has demonstrated promising results by incorporating temporal cues at both the query and BEV levels. However, it remains limited by over-reliance on historical queries, vulnerability to pose estimation failures, and insufficient temporal propagation. To overcome these limitations, we propose FASTopoWM, a novel fast-slow lane segment topology reasoning framework augmented with latent world models. To reduce the impact of pose estimation failures, this unified framework enables parallel supervision of both historical and newly initialized queries, facilitating mutual reinforcement between the fast and slow systems. Furthermore, we introduce latent query and BEV world models conditioned on the action latent to propagate the state representations from past observations to the current timestep. This design substantially improves the performance of temporal perception within the slow pipeline. Extensive experiments on the OpenLane-V2 benchmark demonstrate that FASTopoWM outperforms state-of-the-art methods in both lane segment detection (37.4% v.s. 33.6% on mAP) and centerline perception (46.3% v.s. 41.5% on OLS).


[31] Learning Semantic Directions for Feature Augmentation in Domain-Generalized Medical Segmentation cs.CVPDF

Yingkai Wang, Yaoyao Zhu, Xiuding Cai, Yuhao Xiao, Haotian Wu

TL;DR: 该论文提出了一种针对医学图像分割的领域泛化框架,通过引入隐式特征扰动和自适应一致性约束,提高了模型在未见临床领域中的分割性能。

Details

Motivation: 医学图像分割在临床工作流中至关重要,但由于成像条件、扫描仪类型和采集协议的变化,模型在未见领域中表现下降。论文利用医学图像的解剖结构一致性,针对性解决了领域偏移问题。

Result: 在两个多中心公开基准测试中,该方法显著优于现有领域泛化方法,实现了跨临床领域的鲁棒分割性能。

Insight: 医学图像的解剖结构一致性为领域泛化提供了独特优势,通过特征扰动和自适应约束可以在保持任务相关一致性的同时应对领域变异。

Abstract: Medical image segmentation plays a crucial role in clinical workflows, but domain shift often leads to performance degradation when models are applied to unseen clinical domains. This challenge arises due to variations in imaging conditions, scanner types, and acquisition protocols, limiting the practical deployment of segmentation models. Unlike natural images, medical images typically exhibit consistent anatomical structures across patients, with domain-specific variations mainly caused by imaging conditions. This unique characteristic makes medical image segmentation particularly challenging. To address this challenge, we propose a domain generalization framework tailored for medical image segmentation. Our approach improves robustness to domain-specific variations by introducing implicit feature perturbations guided by domain statistics. Specifically, we employ a learnable semantic direction selector and a covariance-based semantic intensity sampler to modulate domain-variant features while preserving task-relevant anatomical consistency. Furthermore, we design an adaptive consistency constraint that is selectively applied only when feature adjustment leads to degraded segmentation performance. This constraint encourages the adjusted features to align with the original predictions, thereby stabilizing feature selection and improving the reliability of the segmentation. Extensive experiments on two public multi-center benchmarks show that our framework consistently outperforms existing domain generalization approaches, achieving robust and generalizable segmentation performance across diverse clinical domains.


[32] Contrastive Learning-Driven Traffic Sign Perception: Multi-Modal Fusion of Text and Vision cs.CVPDF

Qiang Lu, Waikit Xiu, Xiying Li, Shenyu Hu, Shengbo Sun

TL;DR: 论文提出了一种结合开放词汇检测和跨模态学习的两阶段框架,用于解决交通标志识别中的长尾分布和小目标多尺度特征提取问题,实现了在TT100K数据集上的最优性能。

Details

Motivation: 当前交通标志识别技术的两大挑战是数据集的长尾分布和小目标的多尺度特征提取,这导致传统卷积网络对低频类和分布外类的识别性能下降。

Result: 在TT100K数据集上,模型取得了78.4%的mAP(长尾检测任务),分类准确率为91.8%,召回率为88.9%,显著优于主流算法。

Insight: 跨模态对比学习可以有效缓解数据不均衡带来的类别混淆问题,同时结合视觉与语义特征能够提升模型的泛化能力。

Abstract: Traffic sign recognition, as a core component of autonomous driving perception systems, directly influences vehicle environmental awareness and driving safety. Current technologies face two significant challenges: first, the traffic sign dataset exhibits a pronounced long-tail distribution, resulting in a substantial decline in recognition performance of traditional convolutional networks when processing low-frequency and out-of-distribution classes; second, traffic signs in real-world scenarios are predominantly small targets with significant scale variations, making it difficult to extract multi-scale features.To overcome these issues, we propose a novel two-stage framework combining open-vocabulary detection and cross-modal learning. For traffic sign detection, our NanoVerse YOLO model integrates a reparameterizable vision-language path aggregation network (RepVL-PAN) and an SPD-Conv module to specifically enhance feature extraction for small, multi-scale targets. For traffic sign classification, we designed a Traffic Sign Recognition Multimodal Contrastive Learning model (TSR-MCL). By contrasting visual features from a Vision Transformer with semantic features from a rule-based BERT, TSR-MCL learns robust, frequency-independent representations, effectively mitigating class confusion caused by data imbalance. On the TT100K dataset, our method achieves a state-of-the-art 78.4% mAP in the long-tail detection task for all-class recognition. The model also obtains 91.8% accuracy and 88.9% recall, significantly outperforming mainstream algorithms and demonstrating superior accuracy and generalization in complex, open-world scenarios.


[33] MagicRoad: Semantic-Aware 3D Road Surface Reconstruction via Obstacle Inpainting cs.CVPDF

Xingyue Peng, Yuandong Lyu, Lang Zhang, Jian Zhu, Songtao Wang

TL;DR: MagicRoad提出了一种语义感知的3D道路表面重建框架,通过障碍物修复和语义引导的颜色增强,提升了复杂城市环境下道路重建的鲁棒性和一致性。

Details

Motivation: 现有方法在干净和静态环境下表现良好,但在动态遮挡、静态障碍物和光照变化等真实场景中表现不佳,因此需要一种更鲁棒的重建框架。

Result: 在城市规模数据集上,方法在视觉连贯性和几何精度上显著优于现有方法。

Insight: 语义信息的引入(如分割和颜色校正)是提升道路重建质量的关键,尤其在复杂环境下。

Abstract: Road surface reconstruction is essential for autonomous driving, supporting centimeter-accurate lane perception and high-definition mapping in complex urban environments.While recent methods based on mesh rendering or 3D Gaussian splatting (3DGS) achieve promising results under clean and static conditions, they remain vulnerable to occlusions from dynamic agents, visual clutter from static obstacles, and appearance degradation caused by lighting and weather changes. We present a robust reconstruction framework that integrates occlusion-aware 2D Gaussian surfels with semantic-guided color enhancement to recover clean, consistent road surfaces. Our method leverages a planar-adapted Gaussian representation for efficient large-scale modeling, employs segmentation-guided video inpainting to remove both dynamic and static foreground objects, and enhances color coherence via semantic-aware correction in HSV space. Extensive experiments on urban-scale datasets demonstrate that our framework produces visually coherent and geometrically faithful reconstructions, significantly outperforming prior methods under real-world conditions.


[34] The Impact of Image Resolution on Face Detection: A Comparative Analysis of MTCNN, YOLOv XI and YOLOv XII models cs.CV | 68T45, 68T07 | I.4.8; I.4.9; I.5.4PDF

Ahmet Can Ömercikoğlu, Mustafa Mansur Yönügül, Pakize Erdoğmuş

TL;DR: 论文比较了MTCNN、YOLOv11和YOLOv12在不同分辨率下的面部检测性能,发现YOLOv11在高分辨率下表现最佳,YOLOv12召回率略高,而MTCNN在实时性上较差。

Details

Motivation: 现实中的低分辨率图像对面部检测性能提出了挑战,需要研究分辨率对模型性能的影响。

Result: YOLOv11在高分辨率下表现最优,YOLOv12召回率较高,MTCNN实时性不足但地标定位效果好。

Insight: 分辨率对模型性能有显著影响,需根据实际需求选择合适模型。

Abstract: Face detection is a crucial component in many AI-driven applications such as surveillance, biometric authentication, and human-computer interaction. However, real-world conditions like low-resolution imagery present significant challenges that degrade detection performance. In this study, we systematically investigate the impact of input resolution on the accuracy and robustness of three prominent deep learning-based face detectors: YOLOv11, YOLOv12, and MTCNN. Using the WIDER FACE dataset, we conduct extensive evaluations across multiple image resolutions (160x160, 320x320, and 640x640) and assess each model’s performance using metrics such as precision, recall, mAP50, mAP50-95, and inference time. Results indicate that YOLOv11 outperforms YOLOv12 and MTCNN in terms of detection accuracy, especially at higher resolutions, while YOLOv12 exhibits slightly better recall. MTCNN, although competitive in landmark localization, lags in real-time inference speed. Our findings provide actionable insights for selecting resolution-aware face detection models suitable for varying operational constraints.


[35] IN45023 Neural Network Design Patterns in Computer Vision Seminar Report, Summer 2025 cs.CVPDF

Radu-Andrei Bourceanu, Neil De La Fuente, Jan Grimm, Andrei Jardan, Andriy Manucharyan

TL;DR: 该报告分析了计算机视觉中六篇影响力论文的关键设计模式演变,涵盖残差连接(ResNet)、视觉Transformer(ViT)、生成对抗网络(GANs)、潜在扩散模型(LDMs)以及自监督学习技术(DINO和MAE)。

Details

Motivation: 探索计算机视觉领域设计模式的演变,从传统卷积网络到基于注意力的模型,再到生成模型和自监督学习技术,以理解技术进步的核心驱动力。

Result: ResNet和ViT推动了视觉表征的发展,GANs和LDMs提升了生成模型的质量,DINO和MAE在减少标签依赖方面表现出色。

Insight: 1. 残差连接和注意力机制是深层网络训练的关键;2. 潜在扩散模型在生成任务中效率更高;3. 自监督学习为大规模模型预训练提供了新方向。

Abstract: This report analyzes the evolution of key design patterns in computer vision by examining six influential papers. The analy- sis begins with foundational architectures for image recognition. We review ResNet, which introduced residual connections to overcome the vanishing gradient problem and enable effective training of significantly deeper convolutional networks. Subsequently, we examine the Vision Transformer (ViT), which established a new paradigm by applying the Transformer ar- chitecture to sequences of image patches, demonstrating the efficacy of attention-based models for large-scale image recogni- tion. Building on these visual representation backbones, we investigate generative models. Generative Adversarial Networks (GANs) are analyzed for their novel adversarial training process, which challenges a generator against a discriminator to learn complex data distributions. Then, Latent Diffusion Models (LDMs) are covered, which improve upon prior generative methods by performing a sequential denoising process in a perceptually compressed latent space. LDMs achieve high-fidelity synthesis with greater computational efficiency, representing the current state-of-the-art for image generation. Finally, we explore self-supervised learning techniques that reduce dependency on labeled data. DINO is a self-distillation framework in which a student network learns to match the output of a momentum-updated teacher, yielding features with strong k-NN classification performance. We conclude with Masked Autoencoders (MAE), which utilize an asymmetric encoder-decoder design to reconstruct heavily masked inputs, providing a highly scalable and effective method for pre-training large-scale vision models.


[36] Short-LVLM: Compressing and Accelerating Large Vision-Language Models by Pruning Redundant Layers cs.CVPDF

Ji Ma, Wei Suo, Peng Wang, Yanning Zhang

TL;DR: 这篇论文提出了Short-LVLM(SVL)框架,通过剪枝冗余层来压缩和加速大型视觉语言模型(LVLM),解决了直接应用NLP层剪枝技术无效的问题,实现了性能和效率的权衡。

Details

Motivation: 大型视觉语言模型(LVLM)虽然表现出色,但其参数量和计算成本限制了实际应用。论文旨在探索一种无需训练的高效压缩方法。

Result: Short-LVLM在性能和效率之间取得显著平衡,且在无需额外训练的情况下具有高度兼容性。

Insight: 视觉语言模型中的模态差异使得NLP剪枝技术直接迁移无效,而通过优化token利用和层间特征可以显著提升剪枝效果。

Abstract: Although large vision-language models (LVLMs) have demonstrated impressive capabilities in multi-modal understanding and reasoning, their practical applications are still limited by massive model parameters and high computational costs. Recent efforts from natural language processing (NLP) have shown the effectiveness of layer pruning, offering a plausible training-free compression solution. However, due to the modality divergence between vision and language, it is unclear whether these NLP techniques are still effective in LVLMs. In this paper, we empirically prove that directly applying these layer pruning methods to LVLMs is ineffective. Through extensive experiments, we find that non-essential vision-language (VL) tokens and inter-layer feature gaps pose critical challenges to pruning layers in LVLMs. Based on these insights, we propose a novel framework Short-LVLM (SVL) that can utilize important VL tokens and mitigate the layer-wise feature gaps. Notably, Short-LVLM not only achieves a superior trade-off between performance and efficiency but also exhibits several potential advantages, i.e., training-free, model-agnostic, and highly compatible. The code for this work is publicly available at https://github.com/ASGO-MM/Short-LVLM.


[37] Multi-Prompt Progressive Alignment for Multi-Source Unsupervised Domain Adaptation cs.CVPDF

Haoran Chen, Zexiao Wang, Haidong Cao, Zuxuan Wu, Yu-Gang Jiang

TL;DR: 提出了一种基于CLIP的多源无监督域自适应方法MP²A,通过渐进对齐策略减少噪声样本的影响,提升域不变特征学习。

Details

Motivation: 现有方法在同时对齐所有伪标注数据时,易受噪声和难分类样本影响,导致误差传播和学习效果不佳,多源场景下问题更严重。

Result: 在ImageCLEF、Office-Home和DomainNet基准测试中,MP²A取得了优于现有CLIP基多源无监督域自适应方法的表现。

Insight: 渐进对齐策略可显著减少噪声样本的负面影响,提升域自适应任务的稳定性和性能。

Abstract: Large Vision-Language Models like CLIP have become a powerful foundation for Unsupervised Domain Adaptation due to their strong zero-shot generalization. State-of-the-art methods typically leverage CLIP to generate pseudo-labels for the target domain, then fine-tune the model to learn domain-invariant features. However, these methods attempt to align source and target domains using all pseudo-labeled data simultaneously. This one-shot alignment struggles with noisy, hard-to-classify samples, leading to error propagation and suboptimal feature learning. The problem is even more amplified in the multi-source scenario, where diverse domain gaps and varying noise levels across multiple source domains further destabilize the alignment process. To address this issue, in this work, we propose a progressive alignment strategy for adapting CLIP to unlabeled downstream task. Our method begins by training the model on a high-confidence subset of target samples, allowing it to first learn a well-aligned representation from the most reliable data. As training progresses, it gradually incorporates more challenging samples, guiding the model to refine its understanding without being overwhelmed by initial label noise. This progressive approach effectively mitigates confirmation bias and promotes a more robust convergence, allowing for the learning of genuinely domain-invariant features. We name our approach MP^2A and test it on three popular UDA benchmarks, namely ImageCLEF, Office-Home, and the most challenging DomainNet. Experiments showcase that MP^2A achieves state-of-the-art performance when compared with most recent CLIP-based MS-UDA approaches, demonstrating the effectiveness of our approach.


[38] NeRF Is a Valuable Assistant for 3D Gaussian Splatting cs.CVPDF

Shuangkang Fang, I-Chao Shen, Takeo Igarashi, Yufeng Wang, ZeSheng Wang

TL;DR: 论文提出NeRF-GS框架,结合NeRF和3DGS的优势,通过联合优化提升3D场景表示性能。

Details

Motivation: 解决3D高斯泼溅(3DGS)在高斯初始化敏感性、空间感知有限和高斯间关联弱等局限性。

Result: 在基准数据集上表现优于现有方法,达到SOTA性能。

Insight: NeRF和3DGS是互补而非竞争的,为结合两者的混合方法提供了新思路。

Abstract: We introduce NeRF-GS, a novel framework that jointly optimizes Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). This framework leverages the inherent continuous spatial representation of NeRF to mitigate several limitations of 3DGS, including sensitivity to Gaussian initialization, limited spatial awareness, and weak inter-Gaussian correlations, thereby enhancing its performance. In NeRF-GS, we revisit the design of 3DGS and progressively align its spatial features with NeRF, enabling both representations to be optimized within the same scene through shared 3D spatial information. We further address the formal distinctions between the two approaches by optimizing residual vectors for both implicit features and Gaussian positions to enhance the personalized capabilities of 3DGS. Experimental results on benchmark datasets show that NeRF-GS surpasses existing methods and achieves state-of-the-art performance. This outcome confirms that NeRF and 3DGS are complementary rather than competing, offering new insights into hybrid approaches that combine 3DGS and NeRF for efficient 3D scene representation.


[39] AGA: An adaptive group alignment framework for structured medical cross-modal representation learning cs.CV | cs.AI | cs.LGPDF

Wei Li, Xun Gong, Jiao Li, Xiaobin Sun

TL;DR: AGA提出了一种自适应组对齐框架,通过双向分组机制和阈值门模块,解决了医疗领域跨模态表示学习中结构化语义捕获和小规模数据集对比学习的问题。

Details

Motivation: 当前医疗视觉-语言预训练方法常将临床报告简化为单一实体或碎片化标记,忽略其内在结构,且对比学习依赖大量负样本,不适用于小规模医疗数据。

Result: 在公开和私有数据集上,AGA在图像-文本检索和分类任务中表现出色,适用于微调和零样本场景。

Insight: 结构化语义的细粒度对齐在医疗跨模态学习中至关重要,动态阈值机制能有效适应小规模数据。

Abstract: Learning medical visual representations from paired images and reports is a promising direction in representation learning. However, current vision-language pretraining methods in the medical domain often simplify clinical reports into single entities or fragmented tokens, ignoring their inherent structure. In addition, contrastive learning frameworks typically depend on large quantities of hard negative samples, which is impractical for small-scale medical datasets. To tackle these challenges, we propose Adaptive Grouped Alignment (AGA), a new framework that captures structured semantics from paired medical images and reports. AGA introduces a bidirectional grouping mechanism based on a sparse similarity matrix. For each image-report pair, we compute fine-grained similarities between text tokens and image patches. Each token selects its top-matching patches to form a visual group, and each patch selects its most related tokens to form a language group. To enable adaptive grouping, we design two threshold gating modules, called Language Grouped Threshold Gate and Vision Grouped Threshold Gate, which learn grouping thresholds dynamically. Group representations are computed as weighted averages based on similarity scores. To align each token with its group representation, we introduce an Instance Aware Group Alignment loss that operates within each image-text pair, removing the need for external negatives. Finally, a Bidirectional Cross-modal Grouped Alignment module is applied to enhance fine-grained alignment between visual and linguistic group representations. Extensive experiments on public and private datasets show that our method achieves strong performance on image-text retrieval and classification tasks under both fine-tuning and zero-shot settings.


[40] Out-of-Distribution Detection in Medical Imaging via Diffusion Trajectories cs.CVPDF

Lemar Abdi, Francisco Caetano, Amaan Valiuddin, Christiaan Viviers, Hamdi Joudeh

TL;DR: 本文提出了一种基于Stein分数去噪扩散模型(SBDDM)的无监督异常检测方法,通过仅使用5个扩散步骤的前向扩散轨迹实现了高效、准确的异常评分,显著降低了计算成本,并在多个医学影像OOD检测基准上实现了最优性能。

Details

Motivation: 医学影像中,异常病例的发病率极低,而现有的生成式方法通常依赖于似然估计或重构误差,计算成本高且不可靠,尤其在数据分布变化时需重新训练。因此,迫切需要一种高效、鲁棒的无监督异常检测方法。

Result: 在医学影像数据集上,SBDDM在近OOD和远OOD检测中实现了最优性能,计算成本显著降低,适用于实时计算机辅助诊断。

Insight: 扩散模型的轨迹曲率可作为高效的异常评分指标,为医学影像中的无监督异常检测提供了新的思路,同时展示了预训练模型在多任务中的泛化能力。

Abstract: In medical imaging, unsupervised out-of-distribution (OOD) detection offers an attractive approach for identifying pathological cases with extremely low incidence rates. In contrast to supervised methods, OOD-based approaches function without labels and are inherently robust to data imbalances. Current generative approaches often rely on likelihood estimation or reconstruction error, but these methods can be computationally expensive, unreliable, and require retraining if the inlier data changes. These limitations hinder their ability to distinguish nominal from anomalous inputs efficiently, consistently, and robustly. We propose a reconstruction-free OOD detection method that leverages the forward diffusion trajectories of a Stein score-based denoising diffusion model (SBDDM). By capturing trajectory curvature via the estimated Stein score, our approach enables accurate anomaly scoring with only five diffusion steps. A single SBDDM pre-trained on a large, semantically aligned medical dataset generalizes effectively across multiple Near-OOD and Far-OOD benchmarks, achieving state-of-the-art performance while drastically reducing computational cost during inference. Compared to existing methods, SBDDM achieves a relative improvement of up to 10.43% and 18.10% for Near-OOD and Far-OOD detection, making it a practical building block for real-time, reliable computer-aided diagnosis.


[41] CST Anti-UAV: A Thermal Infrared Benchmark for Tiny UAV Tracking in Complex Scenes cs.CVPDF

Bin Xie, Congxuan Zhang, Fagan Wang, Peng Liu, Feng Lu

TL;DR: 该论文提出了一个新的热红外数据集CST Anti-UAV,专注于复杂场景下的小型无人机(UAV)单目标跟踪(SOT),并评估了20种现有SOT方法的性能,结果凸显了当前技术的局限性。

Details

Motivation: 当前无人机广泛应用引发安全和隐私问题,但现有的无人机跟踪数据集在场景复杂性和对象多样性上不足,难以满足实际需求。

Result: 实验表明,现有最佳方法的跟踪准确率仅为35.92%,远低于其他数据集(如Anti-UAV410的67.69%),说明复杂场景下小型无人机跟踪仍具挑战性。

Insight: 该数据集揭示了现有跟踪技术的局限性,并推动开发更鲁棒的SOT方法,以提升反无人机系统的性能。

Abstract: The widespread application of Unmanned Aerial Vehicles (UAVs) has raised serious public safety and privacy concerns, making UAV perception crucial for anti-UAV tasks. However, existing UAV tracking datasets predominantly feature conspicuous objects and lack diversity in scene complexity and attribute representation, limiting their applicability to real-world scenarios. To overcome these limitations, we present the CST Anti-UAV, a new thermal infrared dataset specifically designed for Single Object Tracking (SOT) in Complex Scenes with Tiny UAVs (CST). It contains 220 video sequences with over 240k high-quality bounding box annotations, highlighting two key properties: a significant number of tiny-sized UAV targets and the diverse and complex scenes. To the best of our knowledge, CST Anti-UAV is the first dataset to incorporate complete manual frame-level attribute annotations, enabling precise evaluations under varied challenges. To conduct an in-depth performance analysis for CST Anti-UAV, we evaluate 20 existing SOT methods on the proposed dataset. Experimental results demonstrate that tracking tiny UAVs in complex environments remains a challenge, as the state-of-the-art method achieves only 35.92% state accuracy, much lower than the 67.69% observed on the Anti-UAV410 dataset. These findings underscore the limitations of existing benchmarks and the need for further advancements in UAV tracking research. The CST Anti-UAV benchmark is about to be publicly released, which not only fosters the development of more robust SOT methods but also drives innovation in anti-UAV systems.


[42] 3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding cs.CVPDF

Ting Huang, Zeyu Zhang, Hao Tang

TL;DR: 3D-R1通过高质量数据合成、强化学习策略和动态视角选择,提升了3D视觉语言模型的推理能力和场景理解泛化性。

Details

Motivation: 现有3D VLMs在推理和泛化能力上存在不足,主要受限于高质量空间数据缺乏和静态视角假设。

Result: 在多个3D场景基准测试中平均提升10%。

Insight: 高质量数据和动态视角选择对3D推理任务至关重要;强化学习能有效提升模型语义和检测精度。

Abstract: Large vision-language models (VLMs) have made significant strides in 2D visual understanding tasks, sparking interest in extending these capabilities to 3D scene understanding. However, current 3D VLMs often struggle with robust reasoning and generalization due to limitations in high-quality spatial data and the static nature of viewpoint assumptions. To address these challenges, we propose 3D-R1, a foundation model that enhances the reasoning capabilities of 3D VLMs. Specifically, we first construct a high-quality synthetic dataset with CoT, named Scene-30K, leveraging existing 3D-VL datasets and a data engine based on Gemini 2.5 Pro. It serves as cold-start initialization data for 3D-R1. Moreover, we leverage RLHF policy such as GRPO in the reinforcement learning training process to enhance reasoning capabilities and introduce three reward functions: a perception reward, a semantic similarity reward and a format reward to maintain detection accuracy and answer semantic precision. Furthermore, we introduce a dynamic view selection strategy that adaptively chooses the most informative perspectives for 3D scene understanding. Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks, highlighting its effectiveness in enhancing reasoning and generalization in 3D scene understanding. Code: https://github.com/AIGeeksGroup/3D-R1. Website: https://aigeeksgroup.github.io/3D-R1.


[43] Seeing More with Less: Video Capsule Endoscopy with Multi-Task Learning cs.CVPDF

Julia Werner, Oliver Bause, Julius Oexle, Maxime Le Floch, Franz Brinkmann

TL;DR: 论文提出了一种多任务学习模型,用于胶囊内窥镜的实时定位与异常检测,旨在解决设备电池续航短和数据稀疏问题,模型仅需100万参数即可超越现有基线。

Details

Motivation: 胶囊内窥镜的电池续航有限且数据稀疏,传统单任务模型难以满足实时决策需求。

Result: 在定位任务上准确率达93.63%,异常检测任务达87.48%,参数仅需100万。

Insight: 多任务学习能有效利用有限资源,提升胶囊内窥镜的智能决策能力,为医疗边缘设备提供了新思路。

Abstract: Video capsule endoscopy has become increasingly important for investigating the small intestine within the gastrointestinal tract. However, a persistent challenge remains the short battery lifetime of such compact sensor edge devices. Integrating artificial intelligence can help overcome this limitation by enabling intelligent real-time decision- making, thereby reducing the energy consumption and prolonging the battery life. However, this remains challenging due to data sparsity and the limited resources of the device restricting the overall model size. In this work, we introduce a multi-task neural network that combines the functionalities of precise self-localization within the gastrointestinal tract with the ability to detect anomalies in the small intestine within a single model. Throughout the development process, we consistently restricted the total number of parameters to ensure the feasibility to deploy such model in a small capsule. We report the first multi-task results using the recently published Galar dataset, integrating established multi-task methods and Viterbi decoding for subsequent time-series analysis. This outperforms current single-task models and represents a significant ad- vance in AI-based approaches in this field. Our model achieves an accu- racy of 93.63% on the localization task and an accuracy of 87.48% on the anomaly detection task. The approach requires only 1 million parameters while surpassing the current baselines.


[44] Online Estimation of Table-Top Grown Strawberry Mass in Field Conditions with Occlusions cs.CV | cs.ROPDF

Jinshan Zhen, Yuanyue Ge, Tianxiao Zhu, Hui Zhao, Ya Xiong

TL;DR: 该论文提出了一种基于视觉的方法,结合RGB-D感知和深度学习,用于实时在线估计在遮挡条件下种植的草莓质量,解决了传统方法的局限性。

Details

Motivation: 在田间条件下,草莓的质量估计因频繁的遮挡和姿态变化而具有挑战性,需要一种非破坏性、实时且鲁棒性强的解决方案。

Result: 实验显示,孤立草莓的平均质量估计误差为8.11%,遮挡情况下为10.47%;CycleGAN在遮挡修复中优于LaMa模型。

Insight: CycleGAN在遮挡修复中表现出色,为复杂遮挡条件下的自动化收获和产量监测提供了新思路。

Abstract: Accurate mass estimation of table-top grown strawberries under field conditions remains challenging due to frequent occlusions and pose variations. This study proposes a vision-based pipeline integrating RGB-D sensing and deep learning to enable non-destructive, real-time and online mass estimation. The method employed YOLOv8-Seg for instance segmentation, Cycle-consistent generative adversarial network (CycleGAN) for occluded region completion, and tilt-angle correction to refine frontal projection area calculations. A polynomial regression model then mapped the geometric features to mass. Experiments demonstrated mean mass estimation errors of 8.11% for isolated strawberries and 10.47% for occluded cases. CycleGAN outperformed large mask inpainting (LaMa) model in occlusion recovery, achieving superior pixel area ratios (PAR) (mean: 0.978 vs. 1.112) and higher intersection over union (IoU) scores (92.3% vs. 47.7% in the [0.9-1] range). This approach addresses critical limitations of traditional methods, offering a robust solution for automated harvesting and yield monitoring with complex occlusion patterns.


[45] Hyperbolic Cycle Alignment for Infrared-Visible Image Fusion cs.CVPDF

Timing Li, Bing Cao, Jiahe Feng, Haifang Cao, Qinghau Hu

TL;DR: 本文提出了一种基于双曲空间的跨模态图像对齐方法Hy-CycleAlign,通过双路径循环注册框架和双曲层次对比对齐模块,显著提升了多模态图像的对齐和融合质量。

Details

Motivation: 现有的基于欧几里得空间的图像注册方法在跨模态对齐上效果不佳,限制了多源数据融合的性能。为解决这一问题,作者探索了非欧几里得空间(双曲空间)中的图像对齐方法。

Result: 在实验中,Hy-CycleAlign显著优于现有方法,实现了更高质量的多模态图像对齐和融合。

Insight: 双曲空间比欧几里得空间更适合处理跨模态图像的几何和模态差异,为多模态数据对齐提供了新的思路。

Abstract: Image fusion synthesizes complementary information from multiple sources, mitigating the inherent limitations of unimodal imaging systems. Accurate image registration is essential for effective multi-source data fusion. However, existing registration methods, often based on image translation in Euclidean space, fail to handle cross-modal misalignment effectively, resulting in suboptimal alignment and fusion quality. To overcome this limitation, we explore image alignment in non-Euclidean space and propose a Hyperbolic Cycle Alignment Network (Hy-CycleAlign). To the best of our knowledge, Hy-CycleAlign is the first image registration method based on hyperbolic space. It introduces a dual-path cross-modal cyclic registration framework, in which a forward registration network aligns cross-modal inputs, while a backward registration network reconstructs the original image, forming a closed-loop registration structure with geometric consistency. Additionally, we design a Hyperbolic Hierarchy Contrastive Alignment (H$^{2}$CA) module, which maps images into hyperbolic space and imposes registration constraints, effectively reducing interference caused by modality discrepancies. We further analyze image registration in both Euclidean and hyperbolic spaces, demonstrating that hyperbolic space enables more sensitive and effective multi-modal image registration. Extensive experiments on misaligned multi-modal images demonstrate that our method significantly outperforms existing approaches in both image alignment and fusion. Our code will be publicly available.


[46] I Am Big, You Are Little; I Am Right, You Are Wrong cs.CV | cs.AIPDF

David A. Kelly, Akchunya Chanchal, Nathan Blake

TL;DR: 该研究通过分析图像分类模型的最小足够像素集(minimal sufficient pixels sets),揭示了不同架构模型在决策过程中关注的像素区域差异(如ConvNext和EVA与其他模型的显著不同),并发现误分类图像通常需要更大的像素集。

Details

Motivation: 随着图像分类器数量和架构的多样化,选择合适模型变得至关重要,但对其决策机制的理解有限。为深入了解不同模型的决策过程,研究提出通过最小足够像素集来分析模型的‘注意力集中度’。

Result: 发现ConvNext和EVA模型与其他模型在像素集大小和位置上具有显著差异,且误分类图像的像素集通常更大。

Insight: 研究结果表明,模型架构直接影响其对图像关键区域的关注方式,误分类可能源于模型需要更多信息来做出决策。这为模型选择和优化提供了新视角。

Abstract: Machine learning for image classification is an active and rapidly developing field. With the proliferation of classifiers of different sizes and different architectures, the problem of choosing the right model becomes more and more important. While we can assess a model’s classification accuracy statistically, our understanding of the way these models work is unfortunately limited. In order to gain insight into the decision-making process of different vision models, we propose using minimal sufficient pixels sets to gauge a model’s `concentration’: the pixels that capture the essence of an image through the lens of the model. By comparing position, overlap, and size of sets of pixels, we identify that different architectures have statistically different concentration, in both size and position. In particular, ConvNext and EVA models differ markedly from the others. We also identify that images which are misclassified are associated with larger pixels sets than correct classifications.


[47] ART: Adaptive Relation Tuning for Generalized Relation Prediction cs.CV | cs.AIPDF

Gopika Sudhakaran, Hikaru Shindo, Patrick Schramowski, Simone Schaub-Meyer, Kristian Kersting

TL;DR: ART是一种自适应关系调整框架,通过指令调优和策略性实例选择,将视觉语言模型(VLM)适配于视觉关系检测(VRD)任务,提升了模型的泛化能力。

Details

Motivation: 传统的VRD模型依赖手工提示,难以处理新关系或复杂关系,限制了泛化能力。而指令调优能更好地适应多样化的关系数据。

Result: ART显著超越基线方法,并能推理未见过的关系概念,还能用于复杂场景的分割任务。

Insight: 指令调优是提升VRD模型泛化能力的有力工具,自适应采样有助于模型聚焦关键关系,同时避免过拟合。

Abstract: Visual relation detection (VRD) is the task of identifying the relationships between objects in a scene. VRD models trained solely on relation detection data struggle to generalize beyond the relations on which they are trained. While prompt tuning has been used to adapt vision-language models (VLMs) for VRD, it uses handcrafted prompts and struggles with novel or complex relations. We argue that instruction tuning offers a more effective solution by fine-tuning VLMs on diverse instructional data. We thus introduce ART, an Adaptive Relation Tuning framework that adapts VLMs for VRD through instruction tuning and strategic instance selection. By converting VRD datasets into an instruction tuning format and employing an adaptive sampling algorithm, ART directs the VLM to focus on informative relations while maintaining generalizability. Specifically, we focus on the relation classification, where subject-object boxes are given and the model predicts the predicate between them. We tune on a held-in set and evaluate across multiple held-out datasets of varying complexity. Our approach strongly improves over its baselines and can infer unseen relation concepts, a capability absent in mainstream VRD methods. We demonstrate ART’s practical value by using the predicted relations for segmenting complex scenes.


[48] Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation cs.CVPDF

Sobhan Asasi, Mohamed Ilyas Lakhal, Ozge Mercanoglu Sincan, Richard Bowden

TL;DR: 论文提出了一个名为BeyondGloss的免手语词典框架,通过使用视频大语言模型(VideoLLMs)的时空推理能力,结合对手部动作的细粒度文本描述和对齐模块,提升了手语翻译的性能,并在多个基准测试中达到最先进水平。

Details

Motivation: 手语翻译(SLT)面临模态差异和细粒度手部动作捕捉的挑战,现有视频大语言模型难以处理长视频细节,因此需要一种新方法来生成细粒度且时序敏感的文本描述。

Result: 在Phoenix14T和CSL-Daily基准测试中达到最先进水平,证明了框架的有效性。

Insight: 免手语词典的方法更符合实际应用需求,细粒度文本描述和对比对齐模块是关键创新,可能为SLT和其他时序任务提供新思路。

Abstract: Sign Language Translation (SLT) is a challenging task that requires bridging the modality gap between visual and linguistic information while capturing subtle variations in hand shapes and movements. To address these challenges, we introduce \textbf{BeyondGloss}, a novel gloss-free SLT framework that leverages the spatio-temporal reasoning capabilities of Video Large Language Models (VideoLLMs). Since existing VideoLLMs struggle to model long videos in detail, we propose a novel approach to generate fine-grained, temporally-aware textual descriptions of hand motion. A contrastive alignment module aligns these descriptions with video features during pre-training, encouraging the model to focus on hand-centric temporal dynamics and distinguish signs more effectively. To further enrich hand-specific representations, we distill fine-grained features from HaMeR. Additionally, we apply a contrastive loss between sign video representations and target language embeddings to reduce the modality gap in pre-training. \textbf{BeyondGloss} achieves state-of-the-art performance on the Phoenix14T and CSL-Daily benchmarks, demonstrating the effectiveness of the proposed framework. We will release the code upon acceptance of the paper.


[49] Mamba-based Efficient Spatio-Frequency Motion Perception for Video Camouflaged Object Detection cs.CVPDF

Xin Li, Keren Fu, Qijun Zhao

TL;DR: 该论文提出了一种基于Mamba的高效时空频率运动感知方法(Vcamba),用于视频伪装目标检测(VCOD)。通过结合空间和频率特征,Vcamba显著提升了检测的准确性和完整性。

Details

Motivation: 现有VCOD方法主要依赖空间外观特征感知运动线索,但由于前景和背景高度相似,空间特征的区分性有限。频率特征和Mamba模型的引入可以弥补这一不足,提高检测性能。

Result: 实验结果表明,Vcamba在6个评估指标和2个数据集上均优于现有方法,同时计算成本更低。

Insight: 频率特征和Mamba模型的结合有效解决了VCOD中空间特征区分性不足的问题,同时提供了高效的长序列建模能力。

Abstract: Existing video camouflaged object detection (VCOD) methods primarily rely on spatial appearance features to perceive motion cues for breaking camouflage. However, the high similarity between foreground and background in VCOD results in limited discriminability of spatial appearance features (e.g., color and texture), restricting detection accuracy and completeness. Recent studies demonstrate that frequency features can not only enhance feature representation to compensate for appearance limitations but also perceive motion through dynamic variations in frequency energy. Furthermore, the emerging state space model called Mamba, enables efficient perception of motion cues in frame sequences due to its linear-time long-sequence modeling capability. Motivated by this, we propose a novel visual camouflage Mamba (Vcamba) based on spatio-frequency motion perception that integrates frequency and spatial features for efficient and accurate VCOD. Specifically, we propose a receptive field visual state space (RFVSS) module to extract multi-scale spatial features after sequence modeling. For frequency learning, we introduce an adaptive frequency component enhancement (AFE) module with a novel frequency-domain sequential scanning strategy to maintain semantic consistency. Then we propose a space-based long-range motion perception (SLMP) module and a frequency-based long-range motion perception (FLMP) module to model spatio-temporal and frequency-temporal sequences in spatial and frequency phase domains. Finally, the space and frequency motion fusion module (SFMF) integrates dual-domain features for unified motion representation. Experimental results show that our Vcamba outperforms state-of-the-art methods across 6 evaluation metrics on 2 datasets with lower computation cost, confirming the superiority of Vcamba. Our code is available at: https://github.com/BoydeLi/Vcamba.


[50] Medical Image De-Identification Benchmark Challenge cs.CV | cs.CRPDF

Linmin Pei, Granger Sutton, Michael Rutherford, Ulrike Wagner, Tracy Nolan

TL;DR: 论文介绍了医疗图像去标识化基准挑战(MIDI-B),旨在通过标准化平台评估基于HIPAA标准的去标识化工具,使用合成PHI/PII的多中心数据集,结果显示参与者表现优异。

Details

Motivation: 医疗图像共享需符合患者隐私法规,同时需保留非PHI元数据以支持AI研究。MIDI-B挑战旨在为去标识化工具提供标准化评估平台。

Result: 10支团队成功完成测试,得分范围为97.91%至99.93%,证明了规则方法的有效性。

Insight: 1. 标准化基准对评估去标识化工具有重要意义;2. 多种技术(如OCR)在去标识化任务中表现良好;3. 挑战为未来隐私保护研究提供了参考。

Abstract: The de-identification (deID) of protected health information (PHI) and personally identifiable information (PII) is a fundamental requirement for sharing medical images, particularly through public repositories, to ensure compliance with patient privacy laws. In addition, preservation of non-PHI metadata to inform and enable downstream development of imaging artificial intelligence (AI) is an important consideration in biomedical research. The goal of MIDI-B was to provide a standardized platform for benchmarking of DICOM image deID tools based on a set of rules conformant to the HIPAA Safe Harbor regulation, the DICOM Attribute Confidentiality Profiles, and best practices in preservation of research-critical metadata, as defined by The Cancer Imaging Archive (TCIA). The challenge employed a large, diverse, multi-center, and multi-modality set of real de-identified radiology images with synthetic PHI/PII inserted. The MIDI-B Challenge consisted of three phases: training, validation, and test. Eighty individuals registered for the challenge. In the training phase, we encouraged participants to tune their algorithms using their in-house or public data. The validation and test phases utilized the DICOM images containing synthetic identifiers (of 216 and 322 subjects, respectively). Ten teams successfully completed the test phase of the challenge. To measure success of a rule-based approach to image deID, scores were computed as the percentage of correct actions from the total number of required actions. The scores ranged from 97.91% to 99.93%. Participants employed a variety of open-source and proprietary tools with customized configurations, large language models, and optical character recognition (OCR). In this paper we provide a comprehensive report on the MIDI-B Challenge’s design, implementation, results, and lessons learned.


[51] Consistent Point Matching cs.CV | cs.DC | cs.LGPDF

Halid Ziya Yerebakan, Gerardo Hermosillo Valadez

TL;DR: 本文提出了一种将一致性启发式方法融入点匹配算法的技术,显著提升了医学图像中解剖结构匹配的鲁棒性,并在多个数据集上取得了优于现有方法的结果。

Details

Motivation: 医学图像中解剖结构的精确匹配对于临床决策至关重要。现有的点匹配方法在鲁棒性和效率方面仍有改进空间。

Result: 方法在Deep Lesion Tracking数据集上超越了现有技术的最佳结果,同时在多种模态的数据集上表现出色。

Insight: 一致性启发式显著提升了点匹配的鲁棒性,且无需依赖机器学习,适用于资源受限的环境。

Abstract: This study demonstrates that incorporating a consistency heuristic into the point-matching algorithm \cite{yerebakan2023hierarchical} improves robustness in matching anatomical locations across pairs of medical images. We validated our approach on diverse longitudinal internal and public datasets spanning CT and MRI modalities. Notably, it surpasses state-of-the-art results on the Deep Lesion Tracking dataset. Additionally, we show that the method effectively addresses landmark localization. The algorithm operates efficiently on standard CPU hardware and allows configurable trade-offs between speed and robustness. The method enables high-precision navigation between medical images without requiring a machine learning model or training data.


[52] DivControl: Knowledge Diversion for Controllable Image Generation cs.CV | cs.LGPDF

Yucheng Xie, Fu Feng, Ruixiao Shi, Jing Wang, Yong Rui

TL;DR: DivControl提出了一种基于知识分散的可分解预训练框架,用于统一可控图像生成和高效适应,通过SVD分解ControlNet并结合动态门控实现零样本泛化和参数高效适应。

Details

Motivation: 现有方法在可控图像生成中通常需要为每个条件训练单独模型或依赖统一但耦合的架构,导致泛化能力差和适应成本高。DivControl旨在解决这一问题。

Result: DivControl在训练成本降低36.4倍的同时,实现了最先进的生成可控性,并在未见条件上表现出强大的零样本和少样本性能。

Insight: 知识分散和模块化解耦是提升可控生成模型泛化能力和适应效率的关键。

Abstract: Diffusion models have advanced from text-to-image (T2I) to image-to-image (I2I) generation by incorporating structured inputs such as depth maps, enabling fine-grained spatial control. However, existing methods either train separate models for each condition or rely on unified architectures with entangled representations, resulting in poor generalization and high adaptation costs for novel conditions. To this end, we propose DivControl, a decomposable pretraining framework for unified controllable generation and efficient adaptation. DivControl factorizes ControlNet via SVD into basic components-pairs of singular vectors-which are disentangled into condition-agnostic learngenes and condition-specific tailors through knowledge diversion during multi-condition training. Knowledge diversion is implemented via a dynamic gate that performs soft routing over tailors based on the semantics of condition instructions, enabling zero-shot generalization and parameter-efficient adaptation to novel conditions. To further improve condition fidelity and training efficiency, we introduce a representation alignment loss that aligns condition embeddings with early diffusion features. Extensive experiments demonstrate that DivControl achieves state-of-the-art controllability with 36.4$\times$ less training cost, while simultaneously improving average performance on basic conditions. It also delivers strong zero-shot and few-shot performance on unseen conditions, demonstrating superior scalability, modularity, and transferability.


[53] SAMSA: Segment Anything Model Enhanced with Spectral Angles for Hyperspectral Interactive Medical Image Segmentation cs.CV | cs.LGPDF

Alfie Roddan, Tobias Czempiel, Chi Xu, Daniel S. Elson, Stamatia Giannarou

TL;DR: SAMSA 是一种结合 RGB 基础模型和光谱分析的交互式分割框架,解决了高光谱医学图像分割中的数据限制和硬件差异问题。

Details

Motivation: 高光谱成像在医学图像中提供丰富的光谱信息,但数据不足和硬件差异导致分割任务极具挑战性。

Result: 在公开数据集上达到 81.0% 1-click 和 93.4% 5-click DICE(神经外科),以及 81.1% 1-click 和 89.2% 5-click DICE(猪体内手术)。

Insight: SAMSA 在少样本和零样本学习场景中表现优异,适用于具有不同光谱特性的数据集,为高光谱医学图像分析提供灵活框架。

Abstract: Hyperspectral imaging (HSI) provides rich spectral information for medical imaging, yet encounters significant challenges due to data limitations and hardware variations. We introduce SAMSA, a novel interactive segmentation framework that combines an RGB foundation model with spectral analysis. SAMSA efficiently utilizes user clicks to guide both RGB segmentation and spectral similarity computations. The method addresses key limitations in HSI segmentation through a unique spectral feature fusion strategy that operates independently of spectral band count and resolution. Performance evaluation on publicly available datasets has shown 81.0% 1-click and 93.4% 5-click DICE on a neurosurgical and 81.1% 1-click and 89.2% 5-click DICE on an intraoperative porcine hyperspectral dataset. Experimental results demonstrate SAMSA’s effectiveness in few-shot and zero-shot learning scenarios and using minimal training examples. Our approach enables seamless integration of datasets with different spectral characteristics, providing a flexible framework for hyperspectral medical image analysis.


[54] I2V-GS: Infrastructure-to-Vehicle View Transformation with Gaussian Splatting for Autonomous Driving Data Generation cs.CVPDF

Jialei Chen, Wuhao Xu, Sipeng He, Baoru Huang, Dongchun Ren

TL;DR: 这篇论文提出了I2V-GS方法,通过高斯泼溅技术实现基础设施到车辆视角的转换,用于生成自动驾驶数据,并引入RoadSight数据集。实验表明,该方法在合成质量和指标上显著优于现有技术。

Details

Motivation: 自动驾驶系统需要大量高质量数据,但现有的车辆采集方式成本高且效率低。从基础设施视角合成车辆视角数据成为一种潜在解决方案。

Result: I2V-GS在NTA-Iou、NTL-Iou和FID指标上分别比StreetGaussian提高了45.7%、34.2%和14.9%。

Insight: 基础设施视角数据可以高效合成车辆视角数据,为自动驾驶数据生成提供新思路。

Abstract: Vast and high-quality data are essential for end-to-end autonomous driving systems. However, current driving data is mainly collected by vehicles, which is expensive and inefficient. A potential solution lies in synthesizing data from real-world images. Recent advancements in 3D reconstruction demonstrate photorealistic novel view synthesis, highlighting the potential of generating driving data from images captured on the road. This paper introduces a novel method, I2V-GS, to transfer the Infrastructure view To the Vehicle view with Gaussian Splatting. Reconstruction from sparse infrastructure viewpoints and rendering under large view transformations is a challenging problem. We adopt the adaptive depth warp to generate dense training views. To further expand the range of views, we employ a cascade strategy to inpaint warped images, which also ensures inpainting content is consistent across views. To further ensure the reliability of the diffusion model, we utilize the cross-view information to perform a confidenceguided optimization. Moreover, we introduce RoadSight, a multi-modality, multi-view dataset from real scenarios in infrastructure views. To our knowledge, I2V-GS is the first framework to generate autonomous driving datasets with infrastructure-vehicle view transformation. Experimental results demonstrate that I2V-GS significantly improves synthesis quality under vehicle view, outperforming StreetGaussian in NTA-Iou, NTL-Iou, and FID by 45.7%, 34.2%, and 14.9%, respectively.


[55] Enhanced Velocity Field Modeling for Gaussian Video Reconstruction cs.CV | cs.AIPDF

Zhenyang Li, Xiaoyang Bai, Tongchen Zhang, Pengfei Shen, Weiwei Xu

TL;DR: FlowGaussian-VR提出了一种针对高斯视频重建的增强速度场建模方法,通过光学流优化和自适应高斯分布调整,显著提升了动态场景的视觉质量和轨迹跟踪能力。

Details

Motivation: 当前基于变形场的高斯重建方法在复杂运动和尺度变化场景中表现不佳,高斯轨迹易过拟合,且静态方法的梯度密集化策略无法满足动态内容需求。

Result: 在多视角动态重建和新视角合成任务中,PSNR提升2.5 dB以上,动态纹理模糊减少,高斯轨迹更规则且可跟踪。

Insight: 通过结合光学流和高斯自适应分布,能有效解决复杂运动中轨迹过拟合和内容缺失问题,显著提升视频重建质量。

Abstract: High-fidelity 3D video reconstruction is essential for enabling real-time rendering of dynamic scenes with realistic motion in virtual and augmented reality (VR/AR). The deformation field paradigm of 3D Gaussian splatting has achieved near-photorealistic results in video reconstruction due to the great representation capability of deep deformation networks. However, in videos with complex motion and significant scale variations, deformation networks often overfit to irregular Gaussian trajectories, leading to suboptimal visual quality. Moreover, the gradient-based densification strategy designed for static scene reconstruction proves inadequate to address the absence of dynamic content. In light of these challenges, we propose a flow-empowered velocity field modeling scheme tailored for Gaussian video reconstruction, dubbed FlowGaussian-VR. It consists of two core components: a velocity field rendering (VFR) pipeline which enables optical flow-based optimization, and a flow-assisted adaptive densification (FAD) strategy that adjusts the number and size of Gaussians in dynamic regions. We validate our model’s effectiveness on multi-view dynamic reconstruction and novel view synthesis with multiple real-world datasets containing challenging motion scenarios, demonstrating not only notable visual improvements (over 2.5 dB gain in PSNR) and less blurry artifacts in dynamic textures, but also regularized and trackable per-Gaussian trajectories.


[56] DiffuMatch: Category-Agnostic Spectral Diffusion Priors for Robust Non-rigid Shape Matching cs.CVPDF

Emery Pierson, Lei Li, Angela Dai, Maks Ovsjanikov

TL;DR: DiffuMatch提出了一种基于谱扩散先验的数据驱动方法,用于非刚性形状匹配,通过生成模型在谱域中训练功能映射,替代了传统的基于公理的正则化策略。

Details

Motivation: 传统非刚性形状匹配方法依赖于功能映射的公理化建模,限制了方法的准确性和适用性。本文旨在通过数据驱动的方式在谱域中学习功能映射的先验知识,以提升匹配的鲁棒性。

Result: 实验表明,该方法在零样本非刚性形状匹配任务中表现优于传统的公理化方法。

Insight: 通过数据驱动的方式学习功能映射的谱域先验,可以摆脱对公理化模型的依赖,提升匹配的泛化能力和准确性。

Abstract: Deep functional maps have recently emerged as a powerful tool for solving non-rigid shape correspondence tasks. Methods that use this approach combine the power and flexibility of the functional map framework, with data-driven learning for improved accuracy and generality. However, most existing methods in this area restrict the learning aspect only to the feature functions and still rely on axiomatic modeling for formulating the training loss or for functional map regularization inside the networks. This limits both the accuracy and the applicability of the resulting approaches only to scenarios where assumptions of the axiomatic models hold. In this work, we show, for the first time, that both in-network regularization and functional map training can be replaced with data-driven methods. For this, we first train a generative model of functional maps in the spectral domain using score-based generative modeling, built from a large collection of high-quality maps. We then exploit the resulting model to promote the structural properties of ground truth functional maps on new shape collections. Remarkably, we demonstrate that the learned models are category-agnostic, and can fully replace commonly used strategies such as enforcing Laplacian commutativity or orthogonality of functional maps. Our key technical contribution is a novel distillation strategy from diffusion models in the spectral domain. Experiments demonstrate that our learned regularization leads to better results than axiomatic approaches for zero-shot non-rigid shape matching. Our code is available at: https://github.com/daidedou/diffumatch/


[57] RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping cs.CV | cs.ROPDF

Dongming Wu, Yanping Fu, Saike Huang, Yingfei Liu, Fan Jia

TL;DR: RAGNet 是一个基于推理的大规模抓取导向的功能分割基准,包含 273k 图像和 26k 指令,提出 AffordanceNet 框架,结合视觉语言模型和抓取网络,提升了开放世界的泛化能力。

Details

Motivation: 当前机器人抓取系统缺乏基于推理的大规模功能数据,限制了开放世界的适用性。需要构建一个包含多样场景和人类指令的基准。

Result: 在功能分割基准和真实机器人任务中表现优异,展现了强大的开放世界泛化能力。

Insight: 通过语言指令和功能图的结合,可以显著提升机器人抓取系统的开放世界适应能力。

Abstract: General robotic grasping systems require accurate object affordance perception in diverse open-world scenarios following human instructions. However, current studies suffer from the problem of lacking reasoning-based large-scale affordance prediction data, leading to considerable concern about open-world effectiveness. To address this limitation, we build a large-scale grasping-oriented affordance segmentation benchmark with human-like instructions, named RAGNet. It contains 273k images, 180 categories, and 26k reasoning instructions. The images cover diverse embodied data domains, such as wild, robot, ego-centric, and even simulation data. They are carefully annotated with an affordance map, while the difficulty of language instructions is largely increased by removing their category name and only providing functional descriptions. Furthermore, we propose a comprehensive affordance-based grasping framework, named AffordanceNet, which consists of a VLM pre-trained on our massive affordance data and a grasping network that conditions an affordance map to grasp the target. Extensive experiments on affordance segmentation benchmarks and real-robot manipulation tasks show that our model has a powerful open-world generalization ability. Our data and code is available at https://github.com/wudongming97/AffordanceNet.


[58] Slot Attention with Re-Initialization and Self-Distillation cs.CVPDF

Rongzhen Zhao, Yi Zhao, Juho Kannala, Joni Pajarinen

TL;DR: 论文提出了DIAS方法,通过重新初始化和自蒸馏改进Slot Attention,减少冗余并提升对象表示效果,在对象发现和识别任务上达到SOTA。

Details

Motivation: 现有的Object-Centric Learning(OCL)方法中,Slot Attention的槽位初始后直接复用,导致冗余槽位与有效槽位竞争,对象被错误分割。此外,监督信号仅来自槽位解码重建输入,忽略了内部信息的潜在监督。

Result: DIAS在对象发现和识别任务中表现优异,并提升了高级视觉预测和推理能力。

Insight: 槽位的动态更新和自蒸馏能够显著提升对象中心学习的性能,减少冗余和错误分割问题。

Abstract: Unlike popular solutions based on dense feature maps, Object-Centric Learning (OCL) represents visual scenes as sub-symbolic object-level feature vectors, termed slots, which are highly versatile for tasks involving visual modalities. OCL typically aggregates object superpixels into slots by iteratively applying competitive cross attention, known as Slot Attention, with the slots as the query. However, once initialized, these slots are reused naively, causing redundant slots to compete with informative ones for representing objects. This often results in objects being erroneously segmented into parts. Additionally, mainstream methods derive supervision signals solely from decoding slots into the input’s reconstruction, overlooking potential supervision based on internal information. To address these issues, we propose Slot Attention with re-Initialization and self-Distillation (DIAS): $\emph{i)}$ We reduce redundancy in the aggregated slots and re-initialize extra aggregation to update the remaining slots; $\emph{ii)}$ We drive the bad attention map at the first aggregation iteration to approximate the good at the last iteration to enable self-distillation. Experiments demonstrate that DIAS achieves state-of-the-art on OCL tasks like object discovery and recognition, while also improving advanced visual prediction and reasoning. Our code is available on https://github.com/Genera1Z/DIAS.


[59] SeqAffordSplat: Scene-level Sequential Affordance Reasoning on 3D Gaussian Splatting cs.CVPDF

Di Li, Jie Feng, Jiahao Chen, Weisheng Dong, Guanbin Li

TL;DR: 本文提出了SeqAffordSplat,一个支持3D高斯泼溅(3DGS)环境下长视野功能区域推理的大规模基准,并提出了SeqSplatNet框架,结合大语言模型和条件解码器实现多步任务的功能掩码预测。

Details

Motivation: 现有的3D功能区域推理方法局限于单对象单步交互,无法应对复杂现实任务中的多对象长视野需求。

Result: 实验表明,该方法在提出的基准上取得了最先进性能,成功将功能区域推理从单步扩展至场景级多步任务。

Insight: 结合LLM和3D视觉模型可以有效处理复杂场景下的长视野功能推理,预训练和语义融合是提升性能的关键。

Abstract: 3D affordance reasoning, the task of associating human instructions with the functional regions of 3D objects, is a critical capability for embodied agents. Current methods based on 3D Gaussian Splatting (3DGS) are fundamentally limited to single-object, single-step interactions, a paradigm that falls short of addressing the long-horizon, multi-object tasks required for complex real-world applications. To bridge this gap, we introduce the novel task of Sequential 3D Gaussian Affordance Reasoning and establish SeqAffordSplat, a large-scale benchmark featuring 1800+ scenes to support research on long-horizon affordance understanding in complex 3DGS environments. We then propose SeqSplatNet, an end-to-end framework that directly maps an instruction to a sequence of 3D affordance masks. SeqSplatNet employs a large language model that autoregressively generates text interleaved with special segmentation tokens, guiding a conditional decoder to produce the corresponding 3D mask. To handle complex scene geometry, we introduce a pre-training strategy, Conditional Geometric Reconstruction, where the model learns to reconstruct complete affordance region masks from known geometric observations, thereby building a robust geometric prior. Furthermore, to resolve semantic ambiguities, we design a feature injection mechanism that lifts rich semantic features from 2D Vision Foundation Models (VFM) and fuses them into the 3D decoder at multiple scales. Extensive experiments demonstrate that our method sets a new state-of-the-art on our challenging benchmark, effectively advancing affordance reasoning from single-step interactions to complex, sequential tasks at the scene level.


[60] Half-Physics: Enabling Kinematic 3D Human Model with Physical Interactions cs.CVPDF

Li Siyao, Yao Feng, Omid Tehari, Chen Change Loy, Michael J. Black

TL;DR: 该论文提出了将SMPL-X人体模型嵌入动态物理交互的’半物理’方法,解决了传统运动学模型无法与物体真实交互的问题,同时避免了穿透和不真实的物体动力学。

Details

Motivation: 当前通用的3D人体模型(如SMPL-X)虽然在形状和姿态上表现高效,但缺乏物理交互能力,导致交互时出现穿透和不真实的动力学问题。

Result: 该方法实时运行,能适应任意身体形状和动作,并保留了原始运动学运动的保真度。

Insight: 提出了一种轻量级、无需训练的物理交互方法,为运动学模型和物理模拟的结合提供了新思路。

Abstract: While current general-purpose 3D human models (e.g., SMPL-X) efficiently represent accurate human shape and pose, they lacks the ability to physically interact with the environment due to the kinematic nature. As a result, kinematic-based interaction models often suffer from issues such as interpenetration and unrealistic object dynamics. To address this limitation, we introduce a novel approach that embeds SMPL-X into a tangible entity capable of dynamic physical interactions with its surroundings. Specifically, we propose a “half-physics” mechanism that transforms 3D kinematic motion into a physics simulation. Our approach maintains kinematic control over inherent SMPL-X poses while ensuring physically plausible interactions with scenes and objects, effectively eliminating penetration and unrealistic object dynamics. Unlike reinforcement learning-based methods, which demand extensive and complex training, our half-physics method is learning-free and generalizes to any body shape and motion; meanwhile, it operates in real time. Moreover, it preserves the fidelity of the original kinematic motion while seamlessly integrating physical interactions


[61] Phi-Ground Tech Report: Advancing Perception in GUI Grounding cs.CV | cs.AI | cs.MMPDF

Miaosen Zhang, Ziqiang Xu, Jialiang Zhu, Qi Dai, Kai Qiu

TL;DR: 论文研究了GUI接地模型训练的实证,提出了Phi-Ground模型家族,在多个基准测试中取得了最优性能,并分享了训练中的细节和经验。

Details

Motivation: 当前端到端接地模型在挑战性基准测试中准确率不足65%,远未达到实际部署要求,因此需要改进模型训练方法以提高性能。

Result: Phi-Ground在ScreenSpot-pro和UI-Vision等基准测试中分别取得43.2和27.2的分数,表现最优。

Insight: 论文中的训练细节和失败经验不仅适用于GUI接地任务,也对其他感知任务具有参考价值。

Abstract: With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from \textit{“Iron Man”}, are becoming a reality. GUI grounding is a core component for CUAs to execute actual actions, similar to mechanical control in robotics, and it directly leads to the success or failure of the system. It determines actions such as clicking and typing, as well as related parameters like the coordinates for clicks. Current end-to-end grounding models still achieve less than 65% accuracy on challenging benchmarks like ScreenSpot-pro and UI-Vision, indicating they are far from being ready for deployment. % , as a single misclick can result in unacceptable consequences. In this work, we conduct an empirical study on the training of grounding models, examining details from data collection to model training. Ultimately, we developed the \textbf{Phi-Ground} model family, which achieves state-of-the-art performance across all five grounding benchmarks for models under $10B$ parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of \textit{\textbf{43.2}} on ScreenSpot-pro and \textit{\textbf{27.2}} on UI-Vision. We believe that the various details discussed in this paper, along with our successes and failures, not only clarify the construction of grounding models but also benefit other perception tasks. Project homepage: \href{https://zhangmiaosen2000.github.io/Phi-Ground/}{https://zhangmiaosen2000.github.io/Phi-Ground/}


[62] MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion cs.CVPDF

Zihan Wang, Jeff Tan, Tarasha Khurana, Neehar Peri, Deva Ramanan

TL;DR: MonoFusion提出了一种稀疏视角下的动态场景重建方法,通过融合单目相机重建结果,解决了多视角密集相机系统的高成本和局限性问题。

Details

Motivation: 多视角密集相机系统(如Panoptic Studio)成本高昂且无法适用于野外场景。作者希望通过稀疏视角相机(如四个静态相机)重建动态场景,如修理自行车或跳舞等行为。

Result: 在PanopticStudio和Ego-Exo4D数据集上的实验显示,MonoFusion在渲染新视角时重建质量更高,代码和数据已开源。

Insight: 稀疏视角的动态重建可以通过融合单目结果实现,为低成本野外观测提供了可行方案。

Abstract: We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures with hundreds of calibrated cameras (e.g. Panoptic Studio). Such multi-view setups are prohibitively expensive to build and cannot capture diverse scenes in-the-wild. In contrast, we aim to reconstruct dynamic human behaviors, such as repairing a bike or dancing, from a small set of sparse-view cameras with complete scene coverage (e.g. four equidistant inward-facing static cameras). We find that dense multi-view reconstruction methods struggle to adapt to this sparse-view setup due to limited overlap between viewpoints. To address these limitations, we carefully align independent monocular reconstructions of each camera to produce time- and view-consistent dynamic scene reconstructions. Extensive experiments on PanopticStudio and Ego-Exo4D demonstrate that our method achieves higher quality reconstructions than prior art, particularly when rendering novel views. Code, data, and data-processing scripts are available on https://github.com/ImNotPrepared/MonoFusion.


[63] Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis cs.CVPDF

Bowen Zhang, Sicheng Xu, Chuxin Wang, Jiaolong Yang, Feng Zhao

TL;DR: 本文提出了一种新颖的视频到4D生成框架,通过直接编码高斯泼溅(GS)及其时变信息,并结合时间感知的扩散变换器,实现了高质量的动态3D内容生成。

Details

Motivation: 现有方法在从单视频输入生成高质量动态3D内容时面临挑战,包括数据构建成本高和联合表示3D形状、外观及运动的高维性。

Result: 在Objaverse数据集上训练的模型表现出卓越的生成质量,且在未见过真实视频输入时表现出良好的泛化能力。

Insight: 通过压缩高维动画数据到紧凑潜在空间,并结合扩散模型,为高质量动态3D内容生成提供了新思路。

Abstract: In this paper, we present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. Direct 4D diffusion modeling is extremely challenging due to costly data construction and the high-dimensional nature of jointly representing 3D shape, appearance, and motion. We address these challenges by introducing a Direct 4DMesh-to-GS Variation Field VAE that directly encodes canonical Gaussian Splats (GS) and their temporal variations from 3D animation data without per-instance fitting, and compresses high-dimensional animations into a compact latent space. Building upon this efficient representation, we train a Gaussian Variation Field diffusion model with temporal-aware Diffusion Transformer conditioned on input videos and canonical GS. Trained on carefully-curated animatable 3D objects from the Objaverse dataset, our model demonstrates superior generation quality compared to existing methods. It also exhibits remarkable generalization to in-the-wild video inputs despite being trained exclusively on synthetic data, paving the way for generating high-quality animated 3D content. Project page: https://gvfdiffusion.github.io/.


cs.CL [Back]

[64] ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing cs.CL | cs.AIPDF

Jinzhi Wang, Qingke Peng, Haozhou Li, Zeyuan Zeng, Qinfeng Song

TL;DR: 论文提出了ElectriQ基准,用于评估大语言模型在电力营销场景中的响应能力。通过构建涵盖六类服务的对话数据集和四种评估指标,结合领域知识库和方法增强,实验表明小模型(如LLama3-8B)的优化表现可以超越GPT-4o。

Details

Motivation: 当前电力营销客服系统(如中国95598热线)存在响应慢、流程僵化等问题,而大语言模型缺乏领域专业性和同理心,需针对性优化。

Result: 实验表明,优化后的小模型(如LLama3-8B)在专业性和用户友好性上超越GPT-4o。

Insight: 小模型通过领域知识增强和微调,能在特定任务中超越通用大模型,突显领域定制的重要性。

Abstract: Electric power marketing customer service plays a critical role in addressing inquiries, complaints, and service requests. However, current systems, such as China’s 95598 hotline, often struggle with slow response times, inflexible procedures, and limited accuracy in domain-specific tasks. While large language models (LLMs) like GPT-4o and Claude 3 demonstrate strong general capabilities, they lack the domain expertise and empathy required in this field. To bridge this gap, we introduce ElectriQ, the first benchmark designed to evaluate and enhance LLMs in electric power marketing scenarios. ElectriQ consists of a dialogue dataset covering six key service categories and introduces four evaluation metrics: professionalism, popularity, readability, and user-friendliness. We further incorporate a domain-specific knowledge base and propose a knowledge augmentation method to boost model performance. Experiments on 13 LLMs reveal that smaller models such as LLama3-8B, when fine-tuned and augmented, can surpass GPT-4o in terms of professionalism and user-friendliness. ElectriQ establishes a comprehensive foundation for developing LLMs tailored to the needs of power marketing services.


[65] A Language Model-Driven Semi-Supervised Ensemble Framework for Illicit Market Detection Across Deep/Dark Web and Social Platforms cs.CL | cs.AI | cs.LG | 68T07, 68T50PDF

Navid Yazdanjue, Morteza Rakhshaninejad, Hossein Yazdanjouei, Mohammad Sadegh Khorshidi, Mikko S. Niemela

TL;DR: 本文提出了一种结合微调语言模型和半监督集成学习的框架,用于检测和分类深网、暗网及社交平台上的非法市场内容,通过两阶段分类和多种特征提取方法,取得了优异的性能。

Details

Motivation: 非法市场活动在深网、暗网及社交平台上日益猖獗,由于数据稀疏、语言复杂且平台异构性高,检测和分类此类内容具有挑战性。

Result: 在多个数据集上的实验表明,模型准确率达0.96489,F1分数为0.93467,TMCC为0.95388,显著优于基线模型。

Insight: 结合语言模型与手工特征能有效提升模型性能;半监督集成学习在稀疏标注数据下表现出良好的鲁棒性;分层分类策略适合复杂任务。

Abstract: Illegal marketplaces have increasingly shifted to concealed parts of the internet, including the deep and dark web, as well as platforms such as Telegram, Reddit, and Pastebin. These channels enable the anonymous trade of illicit goods including drugs, weapons, and stolen credentials. Detecting and categorizing such content remains challenging due to limited labeled data, the evolving nature of illicit language, and the structural heterogeneity of online sources. This paper presents a hierarchical classification framework that combines fine-tuned language models with a semi-supervised ensemble learning strategy to detect and classify illicit marketplace content across diverse platforms. We extract semantic representations using ModernBERT, a transformer model for long documents, finetuned on domain-specific data from deep and dark web pages, Telegram channels, Subreddits, and Pastebin pastes to capture specialized jargon and ambiguous linguistic patterns. In addition, we incorporate manually engineered features such as document structure, embedded patterns including Bitcoin addresses, emails, and IPs, and metadata, which complement language model embeddings. The classification pipeline operates in two stages. The first stage uses a semi-supervised ensemble of XGBoost, Random Forest, and SVM with entropy-based weighted voting to detect sales-related documents. The second stage further classifies these into drug, weapon, or credential sales. Experiments on three datasets, including our multi-source corpus, DUTA, and CoDA, show that our model outperforms several baselines, including BERT, ModernBERT, DarkBERT, ALBERT, Longformer, and BigBird. The model achieves an accuracy of 0.96489, an F1-score of 0.93467, and a TMCC of 0.95388, demonstrating strong generalization, robustness under limited supervision, and effectiveness in real-world illicit content detection.


[66] Full Triple Matcher: Integrating all triple elements between heterogeneous Knowledge Graphs cs.CLPDF

Victor Eiti Yamamoto, Hideaki Takeda

TL;DR: 论文提出了一种集成异构知识图谱中所有三元组元素的方法,重点解决了上下文匹配这一未充分探索的问题。

Details

Motivation: 现有知识图谱集成方法主要关注模式(schema)和身份(identity)匹配,而上下文(context)匹配的研究较少。由于实际知识图谱在来源、规模和信息密度上差异较大,现有方法在复杂上下文集成中表现不足。

Result: 在OAEI比赛中表现优异,相比监督方法在多样化测试案例中取得了高精度。

Insight: 上下文匹配是知识图谱集成的重要方向,结合标签和三元组匹配可以显著提升性能。

Abstract: Knowledge graphs (KGs) are powerful tools for representing and reasoning over structured information. Their main components include schema, identity, and context. While schema and identity matching are well-established in ontology and entity matching research, context matching remains largely unexplored. This is particularly important because real-world KGs often vary significantly in source, size, and information density - factors not typically represented in the datasets on which current entity matching methods are evaluated. As a result, existing approaches may fall short in scenarios where diverse and complex contexts need to be integrated. To address this gap, we propose a novel KG integration method consisting of label matching and triple matching. We use string manipulation, fuzzy matching, and vector similarity techniques to align entity and predicate labels. Next, we identify mappings between triples that convey comparable information, using these mappings to improve entity-matching accuracy. Our approach demonstrates competitive performance compared to leading systems in the OAEI competition and against supervised methods, achieving high accuracy across diverse test cases. Additionally, we introduce a new dataset derived from the benchmark dataset to evaluate the triple-matching step more comprehensively.


[67] Theoretical Foundations and Mitigation of Hallucination in Large Language Models cs.CL | cs.AIPDF

Esmail Gumaan

TL;DR: 该论文对大型语言模型(LLMs)中的幻觉问题进行了系统的理论分析,定义了幻觉风险,并提出了检测和缓解策略。通过理论框架和实验方法,为减少LLMs中的幻觉提供了理论和实践基础。

Details

Motivation: 幻觉是LLMs生成不符合输入或事实内容的严重问题,限制了其实用性和可靠性。作者旨在通过理论分析和方法论探索,为这一挑战提供系统性解决方案。

Result: 论文提出了一种统一的工作流,并通过实验验证了检测和缓解策略的有效性。同时,提出了针对幻觉的评估协议,推荐数据集和指标。

Insight: 理论框架为理解幻觉提供了新视角,实践方法则为LLMs的可靠部署提供了工具。研究强调了多策略整合的重要性,以全面应对幻觉问题。

Abstract: Hallucination in Large Language Models (LLMs) refers to the generation of content that is not faithful to the input or the real-world facts. This paper provides a rigorous treatment of hallucination in LLMs, including formal definitions and theoretical analyses. We distinguish between intrinsic and extrinsic hallucinations, and define a \textit{hallucination risk} for models. We derive bounds on this risk using learning-theoretic frameworks (PAC-Bayes and Rademacher complexity). We then survey detection strategies for hallucinations, such as token-level uncertainty estimation, confidence calibration, and attention alignment checks. On the mitigation side, we discuss approaches including retrieval-augmented generation, hallucination-aware fine-tuning, logit calibration, and the incorporation of fact-verification modules. We propose a unified detection and mitigation workflow, illustrated with a diagram, to integrate these strategies. Finally, we outline evaluation protocols for hallucination, recommending datasets, metrics, and experimental setups to quantify and reduce hallucinations. Our work lays a theoretical foundation and practical guidelines for addressing the crucial challenge of hallucination in LLMs.


[68] Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey cs.CL | cs.AIPDF

Jindong Li, Yali Fu, Jiahong Liu, Linxiao Cao, Wei Ji

TL;DR: 这篇论文是第一篇系统性研究离散标记化在多模态大语言模型(LLMs)中的应用的综述,提出了分类方法并分析了8种代表性向量量化(VQ)技术,讨论了其算法原理、训练动态及与LLM流程的整合挑战。

Details

Motivation: 随着大语言模型的快速发展,将连续多模态数据转换为适合语言处理的离散表示的需求日益增加,但目前缺乏对这种离散标记化技术的系统综述。

Result: 研究结果展示了VQ技术在LLM中的适用性,并指出了代码本崩溃、梯度估计不稳定和模态特定编码限制等关键问题。

Insight: 未来的研究方向包括动态和任务自适应的量化、统一标记化框架和受生物学启发的代码本学习,这些方向有助于构建高效通用的多模态系统。

Abstract: The rapid advancement of large language models (LLMs) has intensified the need for effective mechanisms to transform continuous multimodal data into discrete representations suitable for language-based processing. Discrete tokenization, with vector quantization (VQ) as a central approach, offers both computational efficiency and compatibility with LLM architectures. Despite its growing importance, there is a lack of a comprehensive survey that systematically examines VQ techniques in the context of LLM-based systems. This work fills this gap by presenting the first structured taxonomy and analysis of discrete tokenization methods designed for LLMs. We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines. Beyond algorithm-level investigation, we discuss existing research in terms of classical applications without LLMs, LLM-based single-modality systems, and LLM-based multimodal systems, highlighting how quantization strategies influence alignment, reasoning, and generation performance. In addition, we identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints. Finally, we discuss emerging research directions such as dynamic and task-adaptive quantization, unified tokenization frameworks, and biologically inspired codebook learning. This survey bridges the gap between traditional vector quantization and modern LLM applications, serving as a foundational reference for the development of efficient and generalizable multimodal systems. A continuously updated version is available at: https://github.com/jindongli-Ai/LLM-Discrete-Tokenization-Survey.


[69] Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents cs.CL | cs.AIPDF

Haoran Sun, Shaoning Zeng

TL;DR: 该论文提出了一种分层记忆(H-MEM)架构,用于提升大型语言模型代理(LLM Agents)的长期推理能力,通过多级语义抽象组织记忆并引入索引路由机制,显著提高了记忆检索效率。

Details

Motivation: 现有LLM Agents的记忆机制在结构化组织与高效检索方面存在不足,限制了长期推理能力。论文旨在通过分层记忆架构解决这些问题。

Result: 在LoCoMo数据集的五项任务中,H-MEM均优于五种基线方法,验证了其在长期对话场景中的有效性。

Insight: 多级记忆组织和索引路由机制可显著提升LLM Agents的记忆检索效率和长期推理能力。

Abstract: Long-term memory is one of the key factors influencing the reasoning capabilities of Large Language Model Agents (LLM Agents). Incorporating a memory mechanism that effectively integrates past interactions can significantly enhance decision-making and contextual coherence of LLM Agents. While recent works have made progress in memory storage and retrieval, such as encoding memory into dense vectors for similarity-based search or organizing knowledge in the form of graph, these approaches often fall short in structured memory organization and efficient retrieval. To address these limitations, we propose a Hierarchical Memory (H-MEM) architecture for LLM Agents that organizes and updates memory in a multi-level fashion based on the degree of semantic abstraction. Each memory vector is embedded with a positional index encoding pointing to its semantically related sub-memories in the next layer. During the reasoning phase, an index-based routing mechanism enables efficient, layer-by-layer retrieval without performing exhaustive similarity computations. We evaluate our method on five task settings from the LoCoMo dataset. Experimental results show that our approach consistently outperforms five baseline methods, demonstrating its effectiveness in long-term dialogue scenarios.


[70] Multi-Relation Extraction in Entity Pairs using Global Context cs.CL | cs.IRPDF

Nilesh, Atul Gupta, Avinash C Panday

TL;DR: 本文提出了一种新颖的输入嵌入方法,通过捕获文档中实体出现的位置来构建全局上下文,从而在文档级关系抽取中更准确地预测实体间的关系。

Details

Motivation: 现有方法仅关注实体提及的句子,无法捕捉文档全局上下文,导致关系抽取不准确。

Result: 在DocRED、Re-DocRED和REBEL三个基准数据集上验证了方法的有效性。

Insight: 全局上下文建模和多句子推理对文档级关系抽取具有重要意义。

Abstract: In document-level relation extraction, entities may appear multiple times in a document, and their relationships can shift from one context to another. Accurate prediction of the relationship between two entities across an entire document requires building a global context spanning all relevant sentences. Previous approaches have focused only on the sentences where entities are mentioned, which fails to capture the complete document context necessary for accurate relation extraction. Therefore, this paper introduces a novel input embedding approach to capture the positions of mentioned entities throughout the document rather than focusing solely on the span where they appear. The proposed input encoding approach leverages global relationships and multi-sentence reasoning by representing entities as standalone segments, independent of their positions within the document. The performance of the proposed method has been tested on three benchmark relation extraction datasets, namely DocRED, Re-DocRED, and REBEL. The experimental results demonstrated that the proposed method accurately predicts relationships between entities in a document-level setting. The proposed research also has theoretical and practical implications. Theoretically, it advances global context modeling and multi-sentence reasoning in document-level relation extraction. Practically, it enhances relationship detection, enabling improved performance in real-world NLP applications requiring comprehensive entity-level insights and interpretability.


[71] PRGB Benchmark: A Robust Placeholder-Assisted Algorithm for Benchmarking Retrieval-Augmented Generation cs.CLPDF

Zhehao Tan, Yihan Jiao, Dan Yang, Lei Liu, Jie Feng

TL;DR: 论文提出了PRGB基准,用于细粒度评估检索增强生成(RAG)中语言模型的能力,通过多维度分析和占位符方法解耦模型参数知识与外部知识。

Details

Motivation: 现有RAG基准多关注系统整体性能,缺乏对语言模型能力的细粒度评估,尤其是文档利用能力。

Result: 实验表明当前语言模型在RAG中生成能力有限,尤其在错误恢复和上下文忠实性上表现不足。

Insight: PRGB为开发更可靠高效的RAG系统提供了可复现的评估框架,突出了模型能力的细粒度分析价值。

Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge, where the LLM’s ability to generate responses based on the combination of a given query and retrieved documents is crucial. However, most benchmarks focus on overall RAG system performance, rarely assessing LLM-specific capabilities. Current benchmarks emphasize broad aspects such as noise robustness, but lack a systematic and granular evaluation framework on document utilization. To this end, we introduce \textit{Placeholder-RAG-Benchmark}, a multi-level fine-grained benchmark, emphasizing the following progressive dimensions: (1) multi-level filtering abilities, (2) combination abilities, and (3) reference reasoning. To provide a more nuanced understanding of LLMs’ roles in RAG systems, we formulate an innovative placeholder-based approach to decouple the contributions of the LLM’s parametric knowledge and the external knowledge. Experiments demonstrate the limitations of representative LLMs in the RAG system’s generation capabilities, particularly in error resilience and context faithfulness. Our benchmark provides a reproducible framework for developing more reliable and efficient RAG systems. Our code is available in https://github.com/Alipay-Med/PRGB.


[72] How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding cs.CL | cs.AIPDF

Xi Chen, Aske Plaat, Niki van Stein

TL;DR: 该论文通过稀疏自编码和激活修补技术,研究了链式思考(CoT)提示在语言模型中的内部机制,发现高容量模型(如Pythia-2.8B)中的CoT特征更模块化且可解释。

Details

Motivation: 尽管链式思考(CoT)提示在多步任务中提升了语言模型的准确率,但其生成的‘思考’是否真实反映内部推理过程尚不明确,论文旨在通过因果关系研究回答这一问题。

Result: 在Pythia-2.8B模型中,CoT特征的引入显著提升了回答的对数概率(从1.2到4.3),同时提高了激活稀疏性和特征可解释性得分。

Insight: CoT提示在高容量语言模型中更有效,能够诱导更模块化和可解释的内部结构,表明其作为结构化提示方法的有效性。

Abstract: Chain-of-thought (CoT) prompting boosts Large Language Models accuracy on multi-step tasks, yet whether the generated “thoughts” reflect the true internal reasoning process is unresolved. We present the first feature-level causal study of CoT faithfulness. Combining sparse autoencoders with activation patching, we extract monosemantic features from Pythia-70M and Pythia-2.8B while they tackle GSM8K math problems under CoT and plain (noCoT) prompting. Swapping a small set of CoT-reasoning features into a noCoT run raises answer log-probabilities significantly in the 2.8B model, but has no reliable effect in 70M, revealing a clear scale threshold. CoT also leads to significantly higher activation sparsity and feature interpretability scores in the larger model, signalling more modular internal computation. For example, the model’s confidence in generating correct answers improves from 1.2 to 4.3. We introduce patch-curves and random-feature patching baselines, showing that useful CoT information is not only present in the top-K patches but widely distributed. Overall, our results indicate that CoT can induce more interpretable internal structures in high-capacity LLMs, validating its role as a structured prompting method.


[73] EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow cs.CL | cs.CV | cs.MAPDF

Xiaoyu Pan, Yang Bai, Ke Zou, Yang Zhou, Jun Zhou

TL;DR: 论文提出了EH-Benchmark,专注于评估眼科大语言模型(MLLMs)中的幻觉问题,并通过多阶段代理驱动框架显著减少幻觉,提升诊断的准确性和可靠性。

Details

Motivation: 现有眼科MLLMs因知识不足、视觉定位与推理能力有限以及数据稀缺,导致幻觉问题严重,影响疾病诊断的精确性,而目前的医学基准无法有效评估或解决这些问题。

Result: 实验表明,该框架显著降低了两种类型的幻觉,提高了模型的准确性、可解释性和可靠性。

Insight: 通过任务和错误类型对幻觉进行分类,并结合多阶段代理框架,为解决MLLMs在医学领域的幻觉问题提供了新思路。

Abstract: Medical Large Language Models (MLLMs) play a crucial role in ophthalmic diagnosis, holding significant potential to address vision-threatening diseases. However, their accuracy is constrained by hallucinations stemming from limited ophthalmic knowledge, insufficient visual localization and reasoning capabilities, and a scarcity of multimodal ophthalmic data, which collectively impede precise lesion detection and disease diagnosis. Furthermore, existing medical benchmarks fail to effectively evaluate various types of hallucinations or provide actionable solutions to mitigate them. To address the above challenges, we introduce EH-Benchmark, a novel ophthalmology benchmark designed to evaluate hallucinations in MLLMs. We categorize MLLMs’ hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition, each comprising multiple subclasses. Given that MLLMs predominantly rely on language-based reasoning rather than visual processing, we propose an agent-centric, three-phase framework, including the Knowledge-Level Retrieval stage, the Task-Level Case Studies stage, and the Result-Level Validation stage. Experimental results show that our multi-agent framework significantly mitigates both types of hallucinations, enhancing accuracy, interpretability, and reliability. Our project is available at https://github.com/ppxy1/EH-Benchmark.


[74] Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection cs.CL | cs.SIPDF

Shalini Jangra, Suparna De, Nishanth Sastry, Saeed Fadaei

TL;DR: 论文提出了一种生成合成数据集的方法,用于检测社交媒体中自我披露的个人信息(PII),以解决现有标记数据不足的问题。通过三种LLMs生成合成数据,并验证其与原数据的可比较性、不可链接性和不可区分性。

Details

Motivation: 社交媒体中存在大量用户自我披露的个人信息(PII),这些信息可能导致隐私风险和网络危害。但由于缺乏开源标记数据集,相关研究受到限制。因此,需要一种安全共享的合成数据生成方法。

Result: 生成的合成数据集在实用性测试中表现良好,能够替代原始数据用于模型训练,同时保护用户隐私。

Insight: 合成数据生成技术为隐私敏感研究提供了可行解决方案,特别是在缺乏标记数据时,能有效支持可重复性研究。

Abstract: Social platforms such as Reddit have a network of communities of shared interests, with a prevalence of posts and comments from which one can infer users’ Personal Information Identifiers (PIIs). While such self-disclosures can lead to rewarding social interactions, they pose privacy risks and the threat of online harms. Research into the identification and retrieval of such risky self-disclosures of PIIs is hampered by the lack of open-source labeled datasets. To foster reproducible research into PII-revealing text detection, we develop a novel methodology to create synthetic equivalents of PII-revealing data that can be safely shared. Our contributions include creating a taxonomy of 19 PII-revealing categories for vulnerable populations and the creation and release of a synthetic PII-labeled multi-text span dataset generated from 3 text generation Large Language Models (LLMs), Llama2-7B, Llama3-8B, and zephyr-7b-beta, with sequential instruction prompting to resemble the original Reddit posts. The utility of our methodology to generate this synthetic dataset is evaluated with three metrics: First, we require reproducibility equivalence, i.e., results from training a model on the synthetic data should be comparable to those obtained by training the same models on the original posts. Second, we require that the synthetic data be unlinkable to the original users, through common mechanisms such as Google Search. Third, we wish to ensure that the synthetic data be indistinguishable from the original, i.e., trained humans should not be able to tell them apart. We release our dataset and code at https://netsys.surrey.ac.uk/datasets/synthetic-self-disclosure/ to foster reproducible research into PII privacy risks in online social media.


[75] FinMarBa: A Market-Informed Dataset for Financial Sentiment Classification cs.CL | q-fin.GNPDF

Baptiste Lefort, Eric Benhamou, Beatrice Guez, Jean-Jacques Ohana, Ethan Setrouk

TL;DR: 本文提出了一种新颖的分层框架用于投资组合优化,结合轻量级大语言模型(LLMs)和深度强化学习(DRL),整合金融新闻的情感情报与传统市场指标。

Details

Motivation: 希望通过整合金融新闻的情感和传统市场数据,提升投资组合优化的性能,同时解决多模态数据融合的挑战。

Result: 在2018-2024年的测试数据上,年化收益率为26%,夏普比率为1.2,优于等权重和标普500基准。

Insight: 情感分析与市场数据结合能显著提升投资性能,分层强化学习架构有助于稳定性和可扩展性。

Abstract: This paper presents a novel hierarchical framework for portfolio optimization, integrating lightweight Large Language Models (LLMs) with Deep Reinforcement Learning (DRL) to combine sentiment signals from financial news with traditional market indicators. Our three-tier architecture employs base RL agents to process hybrid data, meta-agents to aggregate their decisions, and a super-agent to merge decisions based on market data and sentiment analysis. Evaluated on data from 2018 to 2024, after training on 2000-2017, the framework achieves a 26% annualized return and a Sharpe ratio of 1.2, outperforming equal-weighted and S&P 500 benchmarks. Key contributions include scalable cross-modal integration, a hierarchical RL structure for enhanced stability, and open-source reproducibility.


[76] Augmented Vision-Language Models: A Systematic Review cs.CL | cs.AIPDF

Anthony C Davis, Burhan Sadiq, Tianmin Shu, Chien-Ming Huang

TL;DR: 本文是对增强视觉-语言模型的系统性综述,探讨如何通过结合外部符号信息系统提升视觉-语言理解能力,解决传统模型在可解释性、适应性和逻辑推理方面的局限性。

Details

Motivation: 传统的视觉-语言模型虽然在大规模无监督数据上表现优异,但存在可解释性差、难以动态更新数据和逻辑推理能力弱等问题。通过结合神经符号系统,可以为模型提供更强的推理和记忆能力。

Result: 综述发现,结合外部符号信息系统的神经符号模型能够显著提升模型的可解释性、适应性和逻辑推理能力。

Insight: 神经符号系统的结合为解决传统视觉-语言模型的局限性提供了一种实用且高效的解决方案,尤其是在动态信息更新和复杂推理任务中表现突出。

Abstract: Recent advances in visual-language machine learning models have demonstrated exceptional ability to use natural language and understand visual scenes by training on large, unstructured datasets. However, this training paradigm cannot produce interpretable explanations for its outputs, requires retraining to integrate new information, is highly resource-intensive, and struggles with certain forms of logical reasoning. One promising solution involves integrating neural networks with external symbolic information systems, forming neural symbolic systems that can enhance reasoning and memory abilities. These neural symbolic systems provide more interpretable explanations to their outputs and the capacity to assimilate new information without extensive retraining. Utilizing powerful pre-trained Vision-Language Models (VLMs) as the core neural component, augmented by external systems, offers a pragmatic approach to realizing the benefits of neural-symbolic integration. This systematic literature review aims to categorize techniques through which visual-language understanding can be improved by interacting with external symbolic information systems.


[77] Deep Learning Approaches for Multimodal Intent Recognition: A Survey cs.CL | cs.AIPDF

Jingwei Zhao, Yuhua Wen, Qifei Li, Minchi Hu, Yingying Zhou

TL;DR: 这篇论文综述了深度学习在多模态意图识别(MIR)中的应用,涵盖了从单模态到多模态的技术转变、数据集、方法、应用及当前挑战。

Details

Motivation: 随着人机交互的自然需求增长,意图识别从传统的文本扩展到多模态数据(如音频、视觉和生理信号),深度学习尤其是基于Transformer的模型成为关键推动力。

Result: 归纳了多模态意图识别的现有技术、数据集和性能表现,同时指出当前研究的局限性和未解决问题。

Insight: 多模态数据融合和Transformer模型是推动意图识别发展的关键,但跨模态对齐和标注数据稀缺仍是主要挑战。

Abstract: Intent recognition aims to identify users’ underlying intentions, traditionally focusing on text in natural language processing. With growing demands for natural human-computer interaction, the field has evolved through deep learning and multimodal approaches, incorporating data from audio, vision, and physiological signals. Recently, the introduction of Transformer-based models has led to notable breakthroughs in this domain. This article surveys deep learning methods for intent recognition, covering the shift from unimodal to multimodal techniques, relevant datasets, methodologies, applications, and current challenges. It provides researchers with insights into the latest developments in multimodal intent recognition (MIR) and directions for future research.


[78] Trusted Knowledge Extraction for Operations and Maintenance Intelligence cs.CL | cs.AIPDF

Kathleen Mealey, Jonathan A. Karr Jr., Priscila Saboia Moreira, Paul R. Brenner, Charles F. Vardeman II

TL;DR: 论文探讨了从组织数据中提取运维智能的挑战,提出了知识图谱构建方法,并评估了NLP工具与大型语言模型的性能,聚焦于航空业的可信应用。

Details

Motivation: 组织数据在保密性与集成性之间存在矛盾,且NLP工具在运维领域表现有限,推动了可信知识提取的研究。

Result: 发现现有工具在性能上存在显著限制,讨论了可信NLP和LLM的挑战及技术成熟度。

Insight: 在航空等关键行业中,可信NLP和LLM工具的技术成熟度仍需提升,需进一步优化以满足任务关键需求。

Abstract: Deriving operational intelligence from organizational data repositories is a key challenge due to the dichotomy of data confidentiality vs data integration objectives, as well as the limitations of Natural Language Processing (NLP) tools relative to the specific knowledge structure of domains such as operations and maintenance. In this work, we discuss Knowledge Graph construction and break down the Knowledge Extraction process into its Named Entity Recognition, Coreference Resolution, Named Entity Linking, and Relation Extraction functional components. We then evaluate sixteen NLP tools in concert with or in comparison to the rapidly advancing capabilities of Large Language Models (LLMs). We focus on the operational and maintenance intelligence use case for trusted applications in the aircraft industry. A baseline dataset is derived from a rich public domain US Federal Aviation Administration dataset focused on equipment failures or maintenance requirements. We assess the zero-shot performance of NLP and LLM tools that can be operated within a controlled, confidential environment (no data is sent to third parties). Based on our observation of significant performance limitations, we discuss the challenges related to trusted NLP and LLM tools as well as their Technical Readiness Level for wider use in mission-critical industries such as aviation. We conclude with recommendations to enhance trust and provide our open-source curated dataset to support further baseline testing and evaluation.


[79] A Graph-based Approach for Multi-Modal Question Answering from Flowcharts in Telecom Documents cs.CL | cs.AI | 68T50 | I.2.7PDF

Sumit Soman, H. G. Ranjani, Sujoy Roychowdhury, Venkata Dharma Surya Narayana Sastry, Akshat Jain

TL;DR: 该论文提出了一种基于图的方法,用于从电信文档中的流程图进行多模态问答(QA)。通过利用视觉大语言模型(VLMs)生成的流程图图表示,并将其整合到基于文本的RAG系统中,实现了图像检索的功能,同时降低了推理阶段的成本。

Details

Motivation: 技术文档中的问答通常涉及流程图中的信息,而传统的文本检索增强生成(RAG)系统难以处理此类问题。因此,需要一种结合图像和文本的多模态方法来解决这一挑战。

Result: 图表示与真实标签的编辑距离更低,证明了其鲁棒性。在问答任务中,文本嵌入模型结合图表示取得了良好的检索性能,验证了方法的有效性。

Insight: 多模态表示(尤其是图结构)能够有效捕捉流程图中的信息,从而提升问答系统的性能。同时,减少对昂贵VLM推理的依赖为实际部署提供了成本优势。

Abstract: Question-Answering (QA) from technical documents often involves questions whose answers are present in figures, such as flowcharts or flow diagrams. Text-based Retrieval Augmented Generation (RAG) systems may fail to answer such questions. We leverage graph representations of flowcharts obtained from Visual large Language Models (VLMs) and incorporate them in a text-based RAG system to show that this approach can enable image retrieval for QA in the telecom domain. We present the end-to-end approach from processing technical documents, classifying image types, building graph representations, and incorporating them with the text embedding pipeline for efficient retrieval. We benchmark the same on a QA dataset created based on proprietary telecom product information documents. Results show that the graph representations obtained using a fine-tuned VLM model have lower edit distance with respect to the ground truth, which illustrate the robustness of these representations for flowchart images. Further, the approach for QA using these representations gives good retrieval performance using text-based embedding models, including a telecom-domain adapted one. Our approach also alleviates the need for a VLM in inference, which is an important cost benefit for deployed QA systems.


[80] PARROT: An Open Multilingual Radiology Reports Dataset cs.CL | cs.AIPDF

Bastien Le Guellec, Kokou Adambounou, Lisa C Adams, Thibault Agripnidis, Sung Soo Ahn

TL;DR: PARROT是一个多语言、开放获取的放射学报告数据集,用于测试自然语言处理(NLP)应用,包含2658份虚构报告,覆盖13种语言和多种成像模态。

Details

Motivation: 解决放射学NLP应用中多语言数据和隐私限制的缺乏问题,提供开放的测试资源。

Result: 数据集包含2658份报告,覆盖多模态和多语言,人机区分准确率为53.9%,放射科医生表现更好。

Insight: 虚构报告可用于NLP测试而不侵犯隐私,多语言数据促进全球化NLP应用发展。

Abstract: Rationale and Objectives: To develop and validate PARROT (Polyglottal Annotated Radiology Reports for Open Testing), a large, multicentric, open-access dataset of fictional radiology reports spanning multiple languages for testing natural language processing applications in radiology. Materials and Methods: From May to September 2024, radiologists were invited to contribute fictional radiology reports following their standard reporting practices. Contributors provided at least 20 reports with associated metadata including anatomical region, imaging modality, clinical context, and for non-English reports, English translations. All reports were assigned ICD-10 codes. A human vs. AI report differentiation study was conducted with 154 participants (radiologists, healthcare professionals, and non-healthcare professionals) assessing whether reports were human-authored or AI-generated. Results: The dataset comprises 2,658 radiology reports from 76 authors across 21 countries and 13 languages. Reports cover multiple imaging modalities (CT: 36.1%, MRI: 22.8%, radiography: 19.0%, ultrasound: 16.8%) and anatomical regions, with chest (19.9%), abdomen (18.6%), head (17.3%), and pelvis (14.1%) being most prevalent. In the differentiation study, participants achieved 53.9% accuracy (95% CI: 50.7%-57.1%) in distinguishing between human and AI-generated reports, with radiologists performing significantly better (56.9%, 95% CI: 53.3%-60.6%, p<0.05) than other groups. Conclusion: PARROT represents the largest open multilingual radiology report dataset, enabling development and validation of natural language processing applications across linguistic, geographic, and clinical boundaries without privacy constraints.


[81] Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes cs.CL | cs.AIPDF

Rui Jiao, Yue Zhang, Jinku Li

TL;DR: RELIANCE框架通过专用事实验证分类器、多维奖励强化学习及模型激活分析,显著提升大语言模型中间推理步骤的事实准确性,同时保持或提升性能。

Details

Motivation: 大语言模型在中间推理步骤中存在事实错误,尽管最终答案可能正确,这在医疗、法律等高风险领域可能导致误导性决策,亟需提升推理的事实准确性。

Result: RELIANCE将模型的事实准确性提升高达49.90%,同时在Math-500等基准测试中保持或改进性能。

Insight: 激活分析揭示了事实性改进如何改变模型推理路径,为未来通过激活引导优化的训练方法奠定了基础。

Abstract: We present RELIANCE (Reasoning Evaluation with Logical Integrity and Accuracy for Confidence Enhancement), a novel framework addressing a critical vulnerability in Large Language Models (LLMs): the prevalence of factual inaccuracies within intermediate reasoning steps despite correct final answers. This phenomenon poses substantial risks in high-stakes domains including healthcare, legal analysis, and scientific research, where erroneous yet confidently presented reasoning can mislead users into dangerous decisions. Our framework integrates three core components: (1) a specialized fact-checking classifier trained on counterfactually augmented data to detect subtle factual inconsistencies within reasoning chains; (2) a Group Relative Policy Optimization (GRPO) reinforcement learning approach that balances factuality, coherence, and structural correctness through multi-dimensional rewards; and (3) a mechanistic interpretability module examining how factuality improvements manifest in model activations during reasoning processes. Extensive evaluation across ten state-of-the-art models reveals concerning patterns: even leading models like Claude-3.7 and GPT-o1 demonstrate reasoning factual accuracy of only 81.93% and 82.57% respectively. RELIANCE significantly enhances factual robustness (up to 49.90% improvement) while maintaining or improving performance on challenging benchmarks including Math-500, AIME-2024, and GPQA. Furthermore, our activation-level analysis provides actionable insights into how factual enhancements reshape reasoning trajectories within model architectures, establishing foundations for future training methodologies that explicitly target factual robustness through activation-guided optimization.


[82] SigBERT: Combining Narrative Medical Reports and Rough Path Signature Theory for Survival Risk Estimation in Oncology cs.CL | cs.CY | cs.LG | stat.APPDF

Paul Minchella, Loïc Verlingue, Stéphane Chrétien, Rémi Vaucher, Guillaume Metzler

TL;DR: SigBERT结合医学报告和粗糙路径签名理论,提出一种时序生存分析框架,通过提取文本嵌入和路径特征提升风险估计性能。

Details

Motivation: 电子医疗报告包含丰富信息,但现有生存分析方法难以有效处理其复杂时序性,SigBERT旨在解决这一问题。

Result: 在真实肿瘤数据集上达到C-index 0.75,验证了方法的有效性。

Insight: 粗糙路径签名理论能有效捕捉医学文本的时序动态,为生存分析提供新思路。

Abstract: Electronic medical reports (EHR) contain a vast amount of information that can be leveraged for machine learning applications in healthcare. However, existing survival analysis methods often struggle to effectively handle the complexity of textual data, particularly in its sequential form. Here, we propose SigBERT, an innovative temporal survival analysis framework designed to efficiently process a large number of clinical reports per patient. SigBERT processes timestamped medical reports by extracting and averaging word embeddings into sentence embeddings. To capture temporal dynamics from the time series of sentence embedding coordinates, we apply signature extraction from rough path theory to derive geometric features for each patient, which significantly enhance survival model performance by capturing complex temporal dynamics. These features are then integrated into a LASSO-penalized Cox model to estimate patient-specific risk scores. The model was trained and evaluated on a real-world oncology dataset from the L'eon B'erard Center corpus, with a C-index score of 0.75 (sd 0.014) on the independent test cohort. SigBERT integrates sequential medical data to enhance risk estimation, advancing narrative-based survival analysis.


[83] A chart review process aided by natural language processing and multi-wave adaptive sampling to expedite validation of code-based algorithms for large database studies cs.CL | stat.MEPDF

Shirley V Wang, Georg Hahn, Sushama Kattinakere Sreedhara, Mufaddal Mahesri, Haritha S. Pillai

TL;DR: 该论文提出了一种通过自然语言处理(NLP)和多波自适应抽样加速验证基于编码的算法的流程,以减少大型数据库研究中人工标注的时间。

Details

Motivation: 传统的手动标注电子健康记录(EHR)需要大量时间和资源,限制了编码算法在大规模数据库研究中的验证效率。

Result: 实验表明,NLP辅助标注时间减少40%,停止规则可避免77%的病例标注,且对性能指标的精度影响有限。

Insight: 该流程能显著提升编码算法验证的效率,为大型数据库研究的可靠性评估提供了实用工具。

Abstract: Background: One of the ways to enhance analyses conducted with large claims databases is by validating the measurement characteristics of code-based algorithms used to identify health outcomes or other key study parameters of interest. These metrics can be used in quantitative bias analyses to assess the robustness of results for an inferential study given potential bias from outcome misclassification. However, extensive time and resource allocation are typically re-quired to create reference-standard labels through manual chart review of free-text notes from linked electronic health records. Methods: We describe an expedited process that introduces efficiency in a validation study us-ing two distinct mechanisms: 1) use of natural language processing (NLP) to reduce time spent by human reviewers to review each chart, and 2) a multi-wave adaptive sampling approach with pre-defined criteria to stop the validation study once performance characteristics are identified with sufficient precision. We illustrate this process in a case study that validates the performance of a claims-based outcome algorithm for intentional self-harm in patients with obesity. Results: We empirically demonstrate that the NLP-assisted annotation process reduced the time spent on review per chart by 40% and use of the pre-defined stopping rule with multi-wave samples would have prevented review of 77% of patient charts with limited compromise to precision in derived measurement characteristics. Conclusion: This approach could facilitate more routine validation of code-based algorithms used to define key study parameters, ultimately enhancing understanding of the reliability of find-ings derived from database studies.


[84] Opacity as Authority: Arbitrariness and the Preclusion of Contestation cs.CL | cs.AI | cs.CYPDF

Naomi Omeonga wa Kayembe

TL;DR: 本文重新定义任意性,认为其并非规范缺陷或支配症状,而是构建人类系统与互动的基础功能机制。

Details

Motivation: 现有批判传统将任意性与不公正混为一谈,而本文将其视为符号学特征,揭示其在语言、法律和社会系统中的功能作用。

Result: 揭示了任意性作为结构不透明性的设计逻辑,保护权威免于问责,同时为AI可解释性研究提供新视角。

Insight: 任意性是中性的控制工具,既用于权威维护,也用于人际关怀,这一发现为跨领域系统分析开辟了新路径。

Abstract: This article redefines arbitrariness not as a normative flaw or a symptom of domination, but as a foundational functional mechanism structuring human systems and interactions. Diverging from critical traditions that conflate arbitrariness with injustice, it posits arbitrariness as a semiotic trait: a property enabling systems - linguistic, legal, or social - to operate effectively while withholding their internal rationale. Building on Ferdinand de Saussure’s concept of l’arbitraire du signe, the analysis extends this principle beyond language to demonstrate its cross-domain applicability, particularly in law and social dynamics. The paper introduces the “Motivation -> Constatability -> Contestability” chain, arguing that motivation functions as a crucial interface rendering an act’s logic vulnerable to intersubjective contestation. When this chain is broken through mechanisms like “immotivization” or “Conflict Lateralization” (exemplified by “the blur of the wolf drowned in the fish”), acts produce binding effects without exposing their rationale, thus precluding justiciability. This structural opacity, while appearing illogical, is a deliberate design protecting authority from accountability. Drawing on Shannon’s entropy model, the paper formalizes arbitrariness as A = H(L|M) (conditional entropy). It thereby proposes a modern theory of arbitrariness as a neutral operator central to control as well as care, an overlooked dimension of interpersonal relations. While primarily developed through human social systems, this framework also illuminates a new pathway for analyzing explainability in advanced artificial intelligence systems.


[85] ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual-Language Models through Procedural Plans cs.CLPDF

Ananya Sadana, Yash Kumar Lal, Jiawei Zhou

TL;DR: ISO-Bench是一个新的基准测试,用于评估多模态模型在视觉和文本之间因果推理的能力。现有的前沿视觉-语言模型在这一任务上表现不佳,最佳零样本F1分数仅为0.57,远低于人类水平(0.98)。

Details

Motivation: 理解跨模态的因果关系是多模态模型在真实环境中的核心挑战。当前模型在这方面的能力尚未被充分评估,因此需要一个专门的基准测试来填补这一空白。

Result: 当前模型表现不佳,最佳零样本F1分数为0.57,链式思维推理仅提升至0.62,远低于人类的0.98。

Insight: 研究表明,多模态模型在跨模态因果推理方面仍有很大改进空间,未来的研究可以关注如何更好地结合视觉和文本信息以提升推理能力。

Abstract: Understanding causal relationships across modalities is a core challenge for multimodal models operating in real-world environments. We introduce ISO-Bench, a benchmark for evaluating whether models can infer causal dependencies between visual observations and procedural text. Each example presents an image of a task step and a text snippet from a plan, with the goal of deciding whether the visual step occurs before or after the referenced text step. Evaluation results on ten frontier vision-language models show underwhelming performance: the best zero-shot F1 is only 0.57, and chain-of-thought reasoning yields only modest gains (up to 0.62 F1), largely behind humans (0.98 F1). Our analysis further highlights concrete directions for improving causal understanding in multimodal models.


[86] Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks cs.CL | cs.AI | cs.LGPDF

Jianghui Wang, Vinay Joshi, Saptarshi Majumder, Xu Chao, Bin Ding

TL;DR: 论文介绍了GEAK框架,利用前沿大语言模型(LLM)为AMD GPU生成高性能Triton代码,并通过推理时计算缩放实现显著的性能提升。

Details

Motivation: 随着深度学习工作负载的复杂性和多样性增加,需要自动化低层内核开发以满足性能和生产力需求。AI驱动的GPU代码生成成为行业和学术界关注的焦点。

Result: GEAK在正确性上达到63%,执行速度提升高达2.59倍,显著优于直接使用LLM或Reflexion流水线的基准方法。

Insight: GEAK展示了基于代理的代码生成在加速多样化硬件平台采用和提升内核性能方面的潜力。

Abstract: The demand for AI-generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware-optimized solutions in both industry and academia. As deep learning workloads grow in complexity and diversity, it is imperative to automate low-level kernel development to meet performance and productivity demands. Major cloud providers, semiconductor companies, and research institutions are now investing heavily in AI-driven code generation for GPUs, aiming to reduce manual optimization efforts while achieving near-expert performance on hardware like AMD MI300X. The Triton language, a Python-based DSL for GPU programming, has emerged as a popular target for such AI-generated kernels due to its balance of performance and ease-of-coding. In this work, we present an evaluation suite for Triton-based GPU kernels and GEAK (Generating Efficient AI-centric GPU Kernels)-a framework that leverages cutting-edge LLMs to generate performant Triton code specifically for AMD GPUs, including the AMD MI300X and MI250. GEAK leverages inference-time compute scaling to produce Triton-based GPU kernels using a reasoning loop adapted from Reflexion-style feedback mechanisms. On two evaluation benchmarks, GEAK significantly outperformed the baselines of directly prompting frontier LLMs as well as Reflexion-based generation pipelines by achieving correctness up to $63$% and execution speed up of up to $2.59$X. These results highlight the promise of GEAK-like agentic code generation for accelerating the adoption of diverse hardware platforms and democratizing access to expert-level kernel performance.


[87] Enabling Few-Shot Alzheimer’s Disease Diagnosis on Tabular Biomarker Data with LLMs cs.CL | cs.LG | q-bio.QMPDF

Sophie Kearney, Shu Yang, Zixuan Wen, Bojian Hou, Duy Duong-Tran

TL;DR: 该论文提出了一个名为TAP-GPT的新框架,利用TableGPT2模型和few-shot学习方法,通过结构化生物标志物数据实现阿尔茨海默病(AD)的早期诊断。

Details

Motivation: 阿尔茨海默病的早期诊断依赖于复杂的生物标志物分析,LLMs凭借其多模态整合和few-shot推理能力,为解决这一问题提供了新途径。

Result: TAP-GPT在AD诊断任务中优于通用LLMs和专门的表格基础模型(TFM)。

Insight: 展示了LLMs在结构化生物医学数据分析中的潜力,为未来多代理框架的开发铺平了道路。

Abstract: Early and accurate diagnosis of Alzheimer’s disease (AD), a complex neurodegenerative disorder, requires analysis of heterogeneous biomarkers (e.g., neuroimaging, genetic risk factors, cognitive tests, and cerebrospinal fluid proteins) typically represented in a tabular format. With flexible few-shot reasoning, multimodal integration, and natural-language-based interpretability, large language models (LLMs) offer unprecedented opportunities for prediction with structured biomedical data. We propose a novel framework called TAP-GPT, Tabular Alzheimer’s Prediction GPT, that adapts TableGPT2, a multimodal tabular-specialized LLM originally developed for business intelligence tasks, for AD diagnosis using structured biomarker data with small sample sizes. Our approach constructs few-shot tabular prompts using in-context learning examples from structured biomedical data and finetunes TableGPT2 using the parameter-efficient qLoRA adaption for a clinical binary classification task of AD or cognitively normal (CN). The TAP-GPT framework harnesses the powerful tabular understanding ability of TableGPT2 and the encoded prior knowledge of LLMs to outperform more advanced general-purpose LLMs and a tabular foundation model (TFM) developed for prediction tasks. To our knowledge, this is the first application of LLMs to the prediction task using tabular biomarker data, paving the way for future LLM-driven multi-agent frameworks in biomedical informatics.


[88] P-ReMIS: Pragmatic Reasoning in Mental Health and a Social Implication cs.CLPDF

Sneha Oram, Pushpak Bhattacharyya

TL;DR: 该论文研究了大型语言模型(LLMs)在心理健康领域中的语用推理能力,提出了P-ReMe数据集,并重新定义了隐含意义(implicature)和预设(presupposition)的语用现象。实验表明,Mistral和Qwen在该领域表现优异。此外,还研究了LLMs对心理健康污名的处理,发现Claude-3.5-haiku表现更负责任。

Details

Motivation: 心理健康领域的个性化聊天机器人和可解释性技术发展迅速,但语用推理和对话话语的推理能力尚未被充分研究。论文旨在填补这一空白。

Result: Mistral和Qwen在语用推理任务中表现突出;Claude-3.5-haiku在处理心理健康污名时比其他模型更负责任。

Insight: LLMs在心理健康领域具有一定的语用推理能力,但不同模型的表现差异显著;对污名的处理需要更具社会责任感的模型设计。

Abstract: There has been an increase in recent advancements in the explainability and development of personalized chatbots for mental health. However, the reasoning aspects for explainability and dialogue discourse have not been explored previously for mental health. Hence, we are investigating the pragmatic reasoning capability of large language models (LLMs) in this domain. We introduce P-ReMe dataset, and propose a modified definition for the pragmatic phenomena of implicature (implied meaning) and presupposition (implicit assumption) in mental health. Following the definition, we formulate two tasks in implicature and one task in presupposition. To benchmark the dataset and the presented tasks, we consider four models - Llama3.1, Mistral, MentaLLaMa, and Qwen. The results of the experiments suggest that Mistral and Qwen show substantial reasoning capabilities in the domain. In addition, we also propose StiPRompts to study the stigma around mental health with the state-of-the-art LLMs, GPT-4o mini, Deepseek-chat, and Claude-3.5-haiku. Our evaluated findings show that Claude-3.5-haiku deals with the stigma more responsibly compared to the other two LLMs.


[89] Unveiling Super Experts in Mixture-of-Experts Large Language Models cs.CLPDF

Zunhai Su, Qingyuan Li, Hao Zhang, YuLei Qian, Yuchen Xie

TL;DR: 该论文发现并研究了MoE大语言模型中的一类关键专家(Super Experts,SEs),揭示了它们在模型推理中的重要作用及其对性能的显著影响。

Details

Motivation: 现有MoE LLMs的专家级压缩技术多依赖经验标准,缺乏对专家异质性重要性的深入理解。本研究旨在探索和验证模型推理中关键专家的存在及其作用机制。

Result: SEs的剪枝会导致模型性能显著下降(如数学推理能力受损),并扰乱注意力分布;SEs的存在对模型任务表现具有关键作用。

Insight: MoE LLMs依赖SEs实现注意力分配等关键机制,SEs的异质性特性为模型压缩和优化提供了新视角。

Abstract: Sparsely activated Mixture-of-Experts (MoE) models have shown promise in enhancing the learning capacity of large language models (LLMs). Leveraging the intrinsic importance differences among experts, recent research has explored expert-level compression techniques to improve the efficiency of MoE LLMs. However, existing approaches often rely on empirical criteria to identify critical experts, lacking a deeper exploration and understanding of the heterogeneous importance of experts. In this study, we present the first discovery and investigation of a distinct subset of experts that play a crucial role in the underlying mechanisms during the model’s forward inference. These experts are prevalent in open-source MoE LLMs, and despite their limited number, pruning them leads to a significant decline in model performance (e.g., pruning three causes Qwen3-30B-A3B to produce repetitive and uninformative outputs). We refer to these experts as Super Experts (SEs). Our comprehensive analysis provides progressively deeper insights into SEs. (i) SEs are characterized by rare but extreme activation outliers in the output of the down_proj, which give rise to massive activations in the hidden states between decoder layers. Moreover, the distribution of SEs remains model-specific and is unaffected by post-training processes. (ii) By pruning SEs, we assess their significance across a variety of tasks, revealing their considerable impact on the model’s overall performance, particularly in mathematical reasoning. (iii) We further enhance our understanding of the influence of SEs compression. Our findings confirm that MoE LLMs rely on SEs to induce attention sinks, which are crucial for the distribution of attention scores but are significantly disrupted by SE pruning. The code is available at https://github.com/ZunhaiSu/Super-Experts-Profilling.


[90] What’s Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content cs.CLPDF

Alfio Ferrara, Sergio Picascia, Laura Pinnavaia, Vojimir Ranitovic, Elisabetta Rocchetti

TL;DR: 该论文实证分析了GPT-4o-mini在重新表述敏感内容时的隐式过滤行为,发现其对敏感内容进行了系统性的弱化处理,并评估了LLMs在零样本条件下对句子敏感性的分类能力。

Details

Motivation: 尽管已有研究专注于显式训练模型以过滤敏感内容,但对LLMs是否会在无显式指令下隐式过滤语言的探索较少。本文旨在填补这一空白。

Result: GPT-4o-mini对敏感内容进行了系统性弱化处理,贬义和禁忌语言显著减少;LLMs在零样本分类任务中表现优于传统方法。

Insight: LLMs无需显式训练即可隐式过滤敏感内容,展现出‘自我审查’能力,这为未来内容审核技术提供了新方向。

Abstract: Proprietary Large Language Models (LLMs) have shown tendencies toward politeness, formality, and implicit content moderation. While previous research has primarily focused on explicitly training models to moderate and detoxify sensitive content, there has been limited exploration of whether LLMs implicitly sanitize language without explicit instructions. This study empirically analyzes the implicit moderation behavior of GPT-4o-mini when paraphrasing sensitive content and evaluates the extent of sensitivity shifts. Our experiments indicate that GPT-4o-mini systematically moderates content toward less sensitive classes, with substantial reductions in derogatory and taboo language. Also, we evaluate the zero-shot capabilities of LLMs in classifying sentence sensitivity, comparing their performances against traditional methods.


[91] Text-to-SQL Task-oriented Dialogue Ontology Construction cs.CL | cs.AI | cs.DB | cs.IRPDF

Renato Vukovic, Carel van Niekerk, Michael Heck, Benjamin Ruppik, Hsien-Chin Lin

TL;DR: 论文提出TeQoDO方法,利用大语言模型的SQL编程能力,在无监督情况下构建面向任务的对话本体,提升解释性和可控性。

Details

Motivation: 现有方法依赖手动标注或有监督训练构建本体,限制了可扩展性和效率。大语言模型的参数化知识缺乏解释性和可信度,需结合外部数据库结构。

Result: 在对话状态跟踪任务中表现优异,扩展实验证明其在维基百科和ArXiv数据集上的可扩展性。

Insight: 对话理论在提示设计中对本体构建至关重要,为提升大语言模型解释性提供了新思路。

Abstract: Large language models (LLMs) are widely used as general-purpose knowledge sources, but they rely on parametric knowledge, limiting explainability and trustworthiness. In task-oriented dialogue (TOD) systems, this separation is explicit, using an external database structured by an explicit ontology to ensure explainability and controllability. However, building such ontologies requires manual labels or supervised training. We introduce TeQoDO: a Text-to-SQL task-oriented Dialogue Ontology construction method. Here, an LLM autonomously builds a TOD ontology from scratch without supervision using its inherent SQL programming capabilities combined with dialogue theory provided in the prompt. We show that TeQoDO outperforms transfer learning approaches, and its constructed ontology is competitive on a downstream dialogue state tracking task. Ablation studies demonstrate the key role of dialogue theory. TeQoDO also scales to allow construction of much larger ontologies, which we investigate on a Wikipedia and ArXiv dataset. We view this as a step towards broader application of ontologies to increase LLM explainability.


[92] MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models cs.CL | cs.AI | cs.CV | I.2.8; I.2.10PDF

Yiyan Ji, Haoran Chen, Qiguang Chen, Chengyue Wu, Libo Qin

TL;DR: 该论文提出了MPCC基准测试,首次系统评估多模态大语言模型在复杂约束下的规划能力。实验显示现有模型在多种约束下表现不佳,突显了约束感知推理的重要性。

Details

Motivation: 当前基准测试无法直接评估多模态规划能力,且缺乏跨模态的复杂约束。MPCC旨在解决这些问题,推动多模态规划研究的进展。

Result: 闭源模型仅生成21.3%的可行计划,开源模型平均低于11%。模型对约束复杂度敏感,传统多模态提示策略在多约束场景下失败。

Insight: 实际应用中需改进MLLMs的约束感知推理能力;MPCC为多模态规划研究提供了标准化评估框架。

Abstract: Multimodal planning capabilities refer to the ability to predict, reason, and design steps for task execution with multimodal context, which is essential for complex reasoning and decision-making across multiple steps. However, current benchmarks face two key challenges: (1) they cannot directly assess multimodal real-world planning capabilities, and (2) they lack constraints or implicit constraints across modalities. To address these issues, we introduce Multimodal Planning with Complex Constraints (MPCC), the first benchmark to systematically evaluate MLLMs’ ability to handle multimodal constraints in planning. To address the first challenge, MPCC focuses on three real-world tasks: Flight Planning, Calendar Planning, and Meeting Planning. To solve the second challenge, we introduce complex constraints (e.g. budget, temporal, and spatial) in these tasks, with graded difficulty levels (EASY, MEDIUM, HARD) to separate constraint complexity from search space expansion. Experiments on 13 advanced MLLMs reveal significant challenges: closed-source models achieve only 21.3% feasible plans, while open-source models average below 11%. Additionally, we observe that MLLMs are highly sensitive to constraint complexity and that traditional multimodal prompting strategies fail in multi-constraint scenarios. Our work formalizes multimodal constraints in planning, provides a rigorous evaluation framework, and highlights the need for advancements in constraint-aware reasoning for real-world MLLM applications.


[93] Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models cs.CL | cs.AIPDF

Ailiang Lin, Zhuoyun Li, Kotaro Funakoshi

TL;DR: Causal2Vec 是一种改进的解码器专用大语言模型(LLM)嵌入方法,通过预编码上下文信息和优化隐藏状态池化,显著提升语义嵌入性能,同时降低计算开销。

Details

Motivation: 现有方法在去除因果注意力掩码或依赖额外输入文本时,可能牺牲语义提取能力或增加计算成本,因此需要一种既保持高效又能提升嵌入性能的解决方案。

Result: 在MTEB基准测试中达到SOTA性能,相比最优方法减少85%序列长度和82%推理时间。

Insight: 通过轻量级预编码和优化池化策略,可以显著提升解码器专用LLM的嵌入能力,而无需牺牲效率或增加计算负担。

Abstract: Decoder-only large language models (LLMs) are increasingly used to build embedding models that effectively encode the semantic information of natural language texts into dense vector representations for various embedding tasks. However, many existing methods primarily focus on removing the causal attention mask in LLMs to enable bidirectional attention, potentially undermining the model’s ability to extract semantic information acquired during pretraining. Additionally, leading unidirectional approaches often rely on extra input text to overcome the inherent limitations of causal attention, inevitably increasing computational costs. In this work, we propose Causal2Vec, a general-purpose embedding model tailored to enhance the performance of decoder-only LLMs without altering their original architectures or introducing significant computational overhead. Specifically, we first employ a lightweight BERT-style model to pre-encode the input text into a single Contextual token, which is then prepended to the LLM’s input sequence, allowing each token to capture contextualized information even without attending to future tokens. Furthermore, to mitigate the recency bias introduced by last-token pooling and help LLMs better leverage the semantic information encoded in the Contextual token, we concatenate the last hidden states of Contextual and EOS tokens as the final text embedding. In practice, Causal2Vec achieves state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB) among models trained solely on publicly available retrieval datasets, while reducing the required sequence length by up to 85% and inference time by up to 82% compared to best-performing methods.


[94] Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration cs.CLPDF

Ante Wang, Yujie Lin, Jingyao Liu, Suhang Wu, Hao Liu

TL;DR: 论文提出了一种名为”主动批判性思考”的新范式,要求AI模型主动向用户请求缺失或澄清信息以更好地解决问题。为此,研究者开发了两个新基准GSM-MC和GSM-MCE,并证明强化学习能显著提升模型在此任务上的表现。

Details

Motivation: 现有的批判性思维研究主要关注被动拒绝问题查询,而忽略了模型主动解决问题的能力提升。为此,研究者提出主动批判性思维,以促进更有效的人机协作。

Result: 强化学习显著提升了模型在GSM-MC上的准确率,例如Qwen3-1.7B的准确率从0.15%提升到73.98%。

Insight: 主动批判性思维是提升AI与人类协作能力的关键方向,强化学习在此任务上显示了极大潜力。

Abstract: Critical thinking is essential for building robust AI systems, preventing them from blindly accepting flawed data or biased reasoning. However, prior work has primarily focused on passive critical thinking, where models simply reject problematic queries without taking constructive steps to address user requests. In this work, we introduce proactive critical thinking, a paradigm where models actively seek missing or clarifying information from users to resolve their queries better. To evaluate this capability, we present GSM-MC and GSM-MCE, two novel benchmarks based on GSM8K for assessing mathematical reasoning under incomplete or misleading conditions. GSM-MC contains 1,368 math problems with a key variable deliberately removed, requiring models to identify and request the missing information. GSM-MCE further increases the difficulty by introducing irrelevant details to test robustness against distractions. Experiments on Qwen3 and Llama series models show that, while these models excel in traditional reasoning tasks due to extensive post-training and inference-time scaling, they struggle with proactive critical thinking, especially smaller ones. However, we demonstrate that reinforcement learning (RL) can significantly improve this ability. Using our enhanced RL algorithm, we achieve substantial gains, boosting the Qwen3-1.7B’s accuracy from 0.15% to 73.98% on GSM-MC. We hope this work advances models that collaborate more effectively with users in problem-solving through proactive critical thinking.


[95] Role-Aware Language Models for Secure and Contextualized Access Control in Organizations cs.CL | cs.AIPDF

Saeed Almheiri, Yerulan Kongrat, Adrian Santosh, Ruslan Tasmukhanov, Josemaria Vera

TL;DR: 该论文探讨了在组织环境中通过微调大型语言模型(LLMs)以实现基于用户角色的访问控制。提出了三种建模策略,并通过构建两个数据集评估了模型的性能和对安全威胁的鲁棒性。

Details

Motivation: 现有的大型语言模型安全方法通常假设统一的访问权限,而未考虑角色特定的访问约束。在组织中,基于角色的访问控制对模型行为的安全性和上下文适应性提出了需求。

Result: 评估了模型在不同组织结构和安全威胁(如提示注入、角色不匹配和越狱攻击)下的表现,分析了各策略的优劣。

Insight: 研究表明,通过微调大型语言模型可以实现基于角色的访问控制,但模型的鲁棒性和泛化能力仍需进一步提升。角色条件生成方法在灵活性上表现较好,但在安全性方面可能需要更强的防御机制。

Abstract: As large language models (LLMs) are increasingly deployed in enterprise settings, controlling model behavior based on user roles becomes an essential requirement. Existing safety methods typically assume uniform access and focus on preventing harmful or toxic outputs, without addressing role-specific access constraints. In this work, we investigate whether LLMs can be fine-tuned to generate responses that reflect the access privileges associated with different organizational roles. We explore three modeling strategies: a BERT-based classifier, an LLM-based classifier, and role-conditioned generation. To evaluate these approaches, we construct two complementary datasets. The first is adapted from existing instruction-tuning corpora through clustering and role labeling, while the second is synthetically generated to reflect realistic, role-sensitive enterprise scenarios. We assess model performance across varying organizational structures and analyze robustness to prompt injection, role mismatch, and jailbreak attempts.


[96] A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains cs.CLPDF

Shirui Wang, Zhihui Tang, Huaxia Yang, Qiuhong Gong, Tiantian Gu

TL;DR: 论文提出了临床安全-有效性双轨基准(CSEDB),用于评估医疗大型语言模型(LLM)的安全性和有效性,通过临床专家共识开发了30个标准,测试结果显示领域专用医疗LLM优于通用模型,尤其在安全性和有效性方面表现更优。

Details

Motivation: 尽管大型语言模型在临床决策支持中具有潜力,但其安全性和有效性的评估仍面临重大挑战,缺乏标准化基准。

Result: 测试结果显示医疗LLM平均总分57.2%,安全性54.7%,有效性62.3%;高风险场景下性能下降13.3%,领域专用模型表现更优。

Insight: 领域专用医疗LLM在临床应用中表现更稳定,CSEDB为医疗LLM的部署提供了风险识别和改进方向的依据。

Abstract: Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 criteria covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.


[97] Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning cs.CLPDF

Keer Lu, Zheng Liang, Youquan Li, Jiejun Tan, Da Pan

TL;DR: 本文提出了Med-R$^3$,一种基于渐进式强化学习的医学检索增强推理框架,通过联合优化检索与推理能力,显著提升了大型语言模型在医学领域的效果。

Details

Motivation: 在医学场景中,现有方法往往单独优化检索或推理能力,缺乏对两者协调的联合优化,且过度依赖监督微调(SFT),限制了模型的泛化能力。此外,通用领域的强化学习方法未充分考虑医学领域的特殊需求。

Result: Med-R$^3$显著提升了模型性能,LLaMA3.1-8B-Instruct + Med-R$^3$超越GPT-4o-mini 3.93%,Qwen2.5-14B + Med-R$^3$提升13.53%。

Insight: 医学领域的检索增强推理需要联合优化检索与推理能力,且渐进式强化学习能更好地适应其复杂性和特殊性。

Abstract: In medical scenarios, effectively retrieving external knowledge and leveraging it for rigorous logical reasoning is of significant importance. Despite their potential, existing work has predominantly focused on enhancing either retrieval or reasoning capabilities of the models in isolation, with little attention given to their joint optimization, which leads to limited coordination between the two processes. Additionally, current methods rely heavily on supervised fine-tuning (SFT), which can cause models to memorize existing problem-solving pathways, thereby restricting their generalization ability when confronted with novel problem contexts. Furthermore, while some studies have explored to improve retrieval-augmented reasoning in general domains via reinforcement learning, their reward function designs do not adequately capture the specific demands of the medical domain. To address these challenges, we introduce Med-R$^3$, a Medical Retrieval-augmented Reasoning framework driven by progressive Reinforcement learning. In this framework, we first develop the model’s ability to perform logical reasoning over medical problems. Subsequently, on the basis of this foundation, we adaptively optimize the retrieval capability to better align with the characteristics of knowledge corpus and external information utilization throughout the reasoning process. Finally, we conduct joint optimization of the model’s retrieval and reasoning coordination. Extensive experiments indicate that Med-R$^3$ could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + Med-R$^3$ surpassing closed-sourced GPT-4o-mini by 3.93% at a comparable parameter scale, while Qwen2.5-14B augmented with Med-R$^3$ shows a more substantial gain of 13.53%.


[98] T-Detect: Tail-Aware Statistical Normalization for Robust Detection of Adversarial Machine-Generated Text cs.CLPDF

Alva West, Luodan Zhang, Liuliu Zhang, Minjun Zhu, Yixuan Weng

TL;DR: T-Detect是一种新颖的对抗性机器生成文本检测方法,通过使用重尾的Student’s t分布替换传统的高斯归一化,提高了对统计异常值的鲁棒性,并在多个基准测试中表现优异。

Details

Motivation: 现有零样本检测器在假设高斯分布的前提下,难以应对对抗性或非原生英语文本的重尾统计特征,导致检测性能下降。

Result: 在RAID基准测试中,AUROC提升高达3.9%,并在Books领域达到0.926的SOTA表现。

Insight: 对抗性文本具有明显的尖峰厚尾特征,传统高斯假设不适用,重尾统计模型更适合此类检测任务。

Abstract: The proliferation of sophisticated text generation models necessitates the development of robust detection methods capable of identifying machine-generated content, particularly text designed to evade detection through adversarial perturbations. Existing zero-shot detectors often rely on statistical measures that implicitly assume Gaussian distributions, a premise that falters when confronted with the heavy-tailed statistical artifacts characteristic of adversarial or non-native English texts. This paper introduces T-Detect, a novel detection method that fundamentally redesigns the statistical core of curvature-based detectors. Our primary innovation is the replacement of standard Gaussian normalization with a heavy-tailed discrepancy score derived from the Student’s t-distribution. This approach is theoretically grounded in the empirical observation that adversarial texts exhibit significant leptokurtosis, rendering traditional statistical assumptions inadequate. T-Detect computes a detection score by normalizing the log-likelihood of a passage against the expected moments of a t-distribution, providing superior resilience to statistical outliers. We validate our approach on the challenging RAID benchmark for adversarial text and the comprehensive HART dataset. Experiments show that T-Detect provides a consistent performance uplift over strong baselines, improving AUROC by up to 3.9% in targeted domains. When integrated into a two-dimensional detection framework (CT), our method achieves state-of-the-art performance, with an AUROC of 0.926 on the Books domain of RAID. Our contributions are a new, theoretically-justified statistical foundation for text detection, an ablation-validated method that demonstrates superior robustness, and a comprehensive analysis of its performance under adversarial conditions. Ours code are released at https://github.com/ResearAI/t-detect.


[99] DiffLoRA: Differential Low-Rank Adapters for Large Language Models cs.CLPDF

Alexandre Misrahi, Nadezhda Chirkova, Maxime Louis, Vassilina Nikoulina

TL;DR: DiffLoRA引入了一种参数高效的自适应方法,通过在差分注意力机制中结合低秩适配器,旨在保留LoRA效率的同时提升性能。尽管在多任务评测中表现一般,但在部分领域(如HumanEval)有显著提升。

Details

Motivation: 研究动机是结合差分注意力机制的性能优势与LoRA的参数高效性,以探索更高效的模型微调方法。

Result: 实验结果显示,DiffLoRA在大多数任务中表现不如其他参数高效微调方法,但在HumanEval任务上比LoRA提升了11分。

Insight: 分析表明,DiffLoRA在某些领域的性能提升可能源于其独特的注意力模式,这为未来优化提供了方向。

Abstract: Differential Transformer has recently been proposed to improve performance in Transformer models by canceling out noise through a denoiser attention mechanism. In this work, we introduce DiffLoRA, a parameter-efficient adaptation of the differential attention mechanism, with low-rank adapters on both positive and negative attention terms. This approach retains the efficiency of LoRA while aiming to benefit from the performance gains of differential attention. We evaluate DiffLoRA across a broad range of NLP tasks, including general benchmarks, many-shot in-context learning, RAG, and long-context tests. We observe that, although DiffLoRA falls short of other parameter-efficient fine-tuning methods in most evaluation tasks, it shows interesting results in certain domains (+11 pts on LoRA for HumanEval). We analyze the attention patterns post-finetuning to identify the reasons for this behavior.


[100] Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs cs.CL | cs.AI | cs.LGPDF

Nasim Shirvani-Mahdavi, Devin Wingfield, Amin Ghasemi, Chengkai Li

TL;DR: 论文探索了利用大语言模型为知识图谱中的逻辑规则生成自然语言解释的方法,提出了多种提示策略,并通过人类评测验证了其正确性和清晰度。

Details

Motivation: 知识图谱中的逻辑规则复杂且难以理解,研究者希望通过自然语言解释提升其可读性和实用性。

Result: 生成的解释在正确性和清晰度上表现良好,但仍存在一些挑战。

Insight: 大语言模型能有效生成规则解释,但需进一步解决幻觉等问题。

Abstract: Knowledge graphs (KGs) often contain sufficient information to support the inference of new facts. Identifying logical rules not only improves the completeness of a knowledge graph but also enables the detection of potential errors, reveals subtle data patterns, and enhances the overall capacity for reasoning and interpretation. However, the complexity of such rules, combined with the unique labeling conventions of each KG, can make them difficult for humans to understand. In this paper, we explore the potential of large language models to generate natural language explanations for logical rules. Specifically, we extract logical rules using the AMIE 3.5.1 rule discovery algorithm from the benchmark dataset FB15k-237 and two large-scale datasets, FB-CVT-REV and FB+CVT-REV. We examine various prompting strategies, including zero- and few-shot prompting, including variable entity types, and chain-of-thought reasoning. We conduct a comprehensive human evaluation of the generated explanations based on correctness, clarity, and hallucination, and also assess the use of large language models as automatic judges. Our results demonstrate promising performance in terms of explanation correctness and clarity, although several challenges remain for future research. All scripts and data used in this study are publicly available at https://github.com/idirlab/KGRule2NL}{https://github.com/idirlab/KGRule2NL.


[101] Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities cs.CLPDF

Yunxiang Yan, Tomohiro Sawada, Kartik Goyal

TL;DR: 该论文提出了一种基于级联问题披露(cascaded question disclosure)的框架,用于更准确地评估大型语言模型(LLM)的底层问题解决能力,同时保持评测的自动化和可扩展性。通过阶段性地逐步揭示问题信息,该方法能更公平地比较不同LLM,并生成比标准问答范式更好的中间推理痕迹。经验证,该方法缩小了标准评测中的性能差距,表明传统问答评测可能高估了模型间的差异。

Details

Motivation: 当前基于问答(QA)基准的评测方法虽自动且可扩展,但间接评估模型的底层问题解决能力存在局限性。因此,论文提出一种更直接且普适的框架,以更准确地反映模型的真实推理和问题解决能力。

Result: 实验表明,该方法不仅改进了模型间的比较,还生成了更清晰的中间推理轨迹。与传统评测相比,它在多样化的推理和知识密集型QA数据上缩小了模型间的性能差距,表明标准评测可能高估了模型差异。

Insight: 论文揭示了当前QA评测的局限性,即间接评测可能掩盖模型的真实问题解决能力。通过分阶段披露信息,该方法提供了一种更公平、透明的评测方式,对LLM能力的评估更接近其底层推理和知识运用能力。

Abstract: While question-answering~(QA) benchmark performance is an automatic and scalable method to compare LLMs, it is an indirect method of evaluating their underlying problem-solving capabilities. Therefore, we propose a holistic and generalizable framework based on \emph{cascaded question disclosure} that provides a more accurate estimate of the models’ problem-solving capabilities while maintaining the scalability and automation. This approach collects model responses in a stagewise manner with each stage revealing partial information about the question designed to elicit generalized reasoning in LLMs. We find that our approach not only provides a better comparison between LLMs, but also induces better intermediate traces in models compared to the standard QA paradigm. We empirically verify this behavior on diverse reasoning and knowledge-heavy QA datasets by comparing LLMs of varying sizes and families. Our approach narrows the performance gap observed in the standard QA evaluation settings, indicating that the prevalent indirect QA paradigm of evaluation overestimates the differences in performance between models. We further validate our findings by extensive ablation studies.


cs.CR [Back]

[102] Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems cs.CR | cs.CLPDF

Lijia Liu, Takumi Kondo, Kyohei Atarashi, Koh Takeuchi, Jiyi Li

TL;DR: 该论文提出了一种结合标准评估(SE)和反事实评估(CFE)的框架,用于检测LLM评估系统中针对提示注入的盲攻击。实验表明,该方法显著提高了安全性,且性能损失极小。

Details

Motivation: LLM(大语言模型)评估系统容易受到提示注入攻击的威胁,尤其是所谓的盲攻击(即候选答案独立于真实答案设计以欺骗评估者)。现有方法难以检测此类攻击,亟需更有效的防御机制。

Result: 实验结果显示,标准评估对盲攻击高度脆弱,而SE+CFE框架显著提高了攻击检测率,且对正常评估任务的性能影响极小。

Insight: 反事实评估为检测LLM评估系统中的欺骗行为提供了新思路,未来可在其他安全场景中扩展应用。

Abstract: This paper investigates defenses for LLM-based evaluation systems against prompt injection. We formalize a class of threats called blind attacks, where a candidate answer is crafted independently of the true answer to deceive the evaluator. To counter such attacks, we propose a framework that augments Standard Evaluation (SE) with Counterfactual Evaluation (CFE), which re-evaluates the submission against a deliberately false ground-truth answer. An attack is detected if the system validates an answer under both standard and counterfactual conditions. Experiments show that while standard evaluation is highly vulnerable, our SE+CFE framework significantly improves security by boosting attack detection with minimal performance trade-offs.


cs.HC [Back]

[103] Hybrid EEG–Driven Brain–Computer Interface: A Large Language Model Framework for Personalized Language Rehabilitation cs.HC | cs.CLPDF

Ismail Hossain, Mridul Banik

TL;DR: 该论文提出了一种基于混合脑电图(EEG)和大型语言模型(LLM)的个性化语言康复框架,结合了BCI的低疲劳性和LLM的上下文生成能力,用于帮助有严重言语或运动障碍的患者进行语言康复。

Details

Motivation: 传统的增强和替代沟通(AAC)系统与语言学习平台难以实时适应用户的认知和语言需求,尤其是在中风后失语症或肌萎缩侧索硬化症等神经疾病中。

Result: 系统能够帮助用户通过脑命令进行语言学习,并根据神经认知标记动态调整难度。

Insight: EEG与LLM的结合为语言康复提供了一种新的个性化方法,尤其是在严重运动或言语障碍的康复中。

Abstract: Conventional augmentative and alternative communication (AAC) systems and language-learning platforms often fail to adapt in real time to the user’s cognitive and linguistic needs, especially in neurological conditions such as post-stroke aphasia or amyotrophic lateral sclerosis. Recent advances in noninvasive electroencephalography (EEG)–based brain-computer interfaces (BCIs) and transformer–based large language models (LLMs) offer complementary strengths: BCIs capture users’ neural intent with low fatigue, while LLMs generate contextually tailored language content. We propose and evaluate a novel hybrid framework that leverages real-time EEG signals to drive an LLM-powered language rehabilitation assistant. This system aims to: (1) enable users with severe speech or motor impairments to navigate language-learning modules via mental commands; (2) dynamically personalize vocabulary, sentence-construction exercises, and corrective feedback; and (3) monitor neural markers of cognitive effort to adjust task difficulty on the fly.


[104] Voice-guided Orchestrated Intelligence for Clinical Evaluation (VOICE): A Voice AI Agent System for Prehospital Stroke Assessment cs.HC | cs.CLPDF

Julian Acosta, Scott Adams, Julius Kernbach, Romain Hardy, Sung Eun Kim

TL;DR: 该论文开发了一个基于语音的AI系统(VOICE),用于辅助非专业人士进行中风预评估,通过自然对话和智能手机视频记录关键检查内容,显著提高了中风识别的准确性和效率。

Details

Motivation: 当前急救中风识别存在不一致性和低敏感性问题(低至58%),导致治疗延误。VOICE旨在通过语音AI系统提供专家级别的评估,弥补这一关键缺口。

Result: 系统正确识别84%的中风特征和75%可能的大血管闭塞(LVO),评估时间约6分钟。用户信心高(4.5/5),易用性评分4.67/5。专家复核正确率100%,但AI错误导致仅40%的病例能初步决策。

Insight: 尽管当前系统需人工监督,但其潜力显著。未来语音AI的快速进步可能实现高度准确评估,从而将专家级能力普及到普通人群中,革新急诊医疗。

Abstract: We developed a voice-driven artificial intelligence (AI) system that guides anyone - from paramedics to family members - through expert-level stroke evaluations using natural conversation, while also enabling smartphone video capture of key examination components for documentation and potential expert review. This addresses a critical gap in emergency care: current stroke recognition by first responders is inconsistent and often inaccurate, with sensitivity for stroke detection as low as 58%, causing life-threatening delays in treatment. Three non-medical volunteers used our AI system to assess ten simulated stroke patients, including cases with likely large vessel occlusion (LVO) strokes and stroke-like conditions, while we measured diagnostic accuracy, completion times, user confidence, and expert physician review of the AI-generated reports. The AI system correctly identified 84% of individual stroke signs and detected 75% of likely LVOs, completing evaluations in just over 6 minutes. Users reported high confidence (median 4.5/5) and ease of use (mean 4.67/5). The system successfully identified 86% of actual strokes but also incorrectly flagged 2 of 3 non-stroke cases as strokes. When an expert physician reviewed the AI reports with videos, they identified the correct diagnosis in 100% of cases, but felt confident enough to make preliminary treatment decisions in only 40% of cases due to observed AI errors including incorrect scoring and false information. While the current system’s limitations necessitate human oversight, ongoing rapid advancements in speech-to-speech AI models suggest that future versions are poised to enable highly accurate assessments. Achieving human-level voice interaction could transform emergency medical care, putting expert-informed assessment capabilities in everyone’s hands.


[105] iLearnRobot: An Interactive Learning-Based Multi-Modal Robot with Continuous Improvement cs.HC | cs.AI | cs.CV | cs.ROPDF

Kohou Wang, ZhaoXiang Liu, Lin Bai, Kun Fan, Xiang Liu

TL;DR: 这篇论文提出了一种基于多模态大语言模型(MLLM)的交互式学习机器人系统,能够通过与用户的自然对话持续改进性能。

Details

Motivation: 机器人部署后可能会遇到从未见过的新场景,因此需要一种能够在实际使用中持续学习和改进的系统。现有的主流MLLM机器人系统缺乏这种交互式学习能力,无法避免重复错误。

Result: 实验从定量和定性两方面验证了该系统的有效性和持续改进能力。

Insight: 交互式学习为机器人提供了更灵活的自适应能力,未来在多模态和人机交互领域有广阔应用前景。

Abstract: It is crucial that robots’ performance can be improved after deployment, as they are inherently likely to encounter novel scenarios never seen before. This paper presents an innovative solution: an interactive learning-based robot system powered by a Multi-modal Large Language Model(MLLM). A key feature of our system is its ability to learn from natural dialogues with non-expert users. We also propose chain of question to clarify the exact intent of the question before providing an answer and dual-modality retrieval modules to leverage these interaction events to avoid repeating same mistakes, ensuring a seamless user experience before model updates, which is in contrast to current mainstream MLLM-based robotic systems. Our system marks a novel approach in robotics by integrating interactive learning, paving the way for superior adaptability and performance in diverse environments. We demonstrate the effectiveness and improvement of our method through experiments, both quantitively and qualitatively.


cs.GR [Back]

[106] Noise-Coded Illumination for Forensic and Photometric Video Analysis cs.GR | cs.CR | cs.CVPDF

Peter F. Michael, Zekun Hao, Serge Belongie, Abe Davis

TL;DR: 通过将微妙的噪声编码调制嵌入场景照明中,为视频添加时间水印,以对抗视频篡改,保护高价值内容。

Details

Motivation: 随着视频篡改工具的普及,伪造视频越来越难以辨别。本文旨在通过照明编码创造信息不对称,使验证方占据优势。

Result: 该技术能够有效对抗视频篡改,尤其是在高价值场景(如公共活动、访谈)中,即使对手知情也难以伪造。

Insight: 通过控制照明条件创建信息不对称,为视频防伪提供了新思路,适用于无法控制摄像头的场景。

Abstract: The proliferation of advanced tools for manipulating video has led to an arms race, pitting those who wish to sow disinformation against those who want to detect and expose it. Unfortunately, time favors the ill-intentioned in this race, with fake videos growing increasingly difficult to distinguish from real ones. At the root of this trend is a fundamental advantage held by those manipulating media: equal access to a distribution of what we consider authentic (i.e., “natural”) video. In this paper, we show how coding very subtle, noise-like modulations into the illumination of a scene can help combat this advantage by creating an information asymmetry that favors verification. Our approach effectively adds a temporal watermark to any video recorded under coded illumination. However, rather than encoding a specific message, this watermark encodes an image of the unmanipulated scene as it would appear lit only by the coded illumination. We show that even when an adversary knows that our technique is being used, creating a plausible coded fake video amounts to solving a second, more difficult version of the original adversarial content creation problem at an information disadvantage. This is a promising avenue for protecting high-stakes settings like public events and interviews, where the content on display is a likely target for manipulation, and while the illumination can be controlled, the cameras capturing video cannot.


cs.SE [Back]

[107] SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution cs.SE | cs.CL | cs.LGPDF

Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian

TL;DR: SWE-Debate提出了一种基于竞争性多智能体辩论的框架,用于解决软件问题,通过多样化的推理路径和协作式收敛,显著提升问题定位和修复效果。

Details

Motivation: 现有基于智能体的问题解决方法通常依赖独立探索,容易陷入局部最优解,而无法发现跨代码库的问题模式。因此,作者提出通过多智能体辩论来激发多样化的推理路径。

Result: 在SWE-bench基准测试中,SWE-Debate显著优于基线方法,并达到开源智能体框架的最高水平。

Insight: 通过竞争性多智能体辩论,能够打破局部最优,有效整合跨代码库的问题模式,提升软件问题解决能力。

Abstract: Issue resolution has made remarkable progress thanks to the advanced reasoning capabilities of large language models (LLMs). Recently, agent-based frameworks such as SWE-agent have further advanced this progress by enabling autonomous, tool-using agents to tackle complex software engineering tasks. While existing agent-based issue resolution approaches are primarily based on agents’ independent explorations, they often get stuck in local solutions and fail to identify issue patterns that span across different parts of the codebase. To address this limitation, we propose SWE-Debate, a competitive multi-agent debate framework that encourages diverse reasoning paths and achieves more consolidated issue localization. SWE-Debate first creates multiple fault propagation traces as localization proposals by traversing a code dependency graph. Then, it organizes a three-round debate among specialized agents, each embodying distinct reasoning perspectives along the fault propagation trace. This structured competition enables agents to collaboratively converge on a consolidated fix plan. Finally, this consolidated fix plan is integrated into an MCTS-based code modification agent for patch generation. Experiments on the SWE-bench benchmark show that SWE-Debate achieves new state-of-the-art results in open-source agent frameworks and outperforms baselines by a large margin.


[108] SWE-Exp: Experience-Driven Software Issue Resolution cs.SE | cs.CL | cs.LGPDF

Silin Chen, Shaoxin Lin, Xiaodong Gu, Yuling Shi, Heng Lian

TL;DR: SWE-Exp提出了一种基于经验的软件问题解决方法,通过记录和重用先前的修复经验,避免冗余探索,实现了持续学习。

Details

Motivation: 当前LLM代理在软件问题解决中缺乏记忆性,无法重用先前的修复经验,导致冗余探索和效率低下。

Result: 在SWE-bench-Verified数据集上达到41.6% Pass@1的解决率,性能表现领先。

Insight: 表明通过系统积累和利用修复经验,软件工程代理可以从试错探索转向战略性的问题解决。

Abstract: Recent advances in large language model (LLM) agents have shown remarkable progress in software issue resolution, leveraging advanced techniques such as multi-agent collaboration and Monte Carlo Tree Search (MCTS). However, current agents act as memoryless explorers - treating each problem separately without retaining or reusing knowledge from previous repair experiences. This leads to redundant exploration of failed trajectories and missed chances to adapt successful issue resolution methods to similar problems. To address this problem, we introduce SWE-Exp, an experience - enhanced approach that distills concise and actionable experience from prior agent trajectories, enabling continuous learning across issues. Our method introduces a multi-faceted experience bank that captures both successful and failed repair attempts. Specifically, it extracts reusable issue resolution knowledge at different levels - from high-level problem comprehension to specific code changes. Experiments show that SWE-Exp achieves state-of-the-art resolution rate (41.6% Pass@1) on SWE-bench-Verified under open-source agent frameworks. Our approach establishes a new paradigm in which automated software engineering agents systematically accumulate and leverage repair expertise, fundamentally shifting from trial-and-error exploration to strategic, experience-driven issue resolution.


cs.AI [Back]

[109] DSBC : Data Science task Benchmarking with Context engineering cs.AI | cs.CL | cs.MAPDF

Ram Mohan Rao Kadiyala, Siddhant Gupta, Jebish Purbey, Giulio Martini, Suman Debnath

TL;DR: 论文提出了一个针对数据科学任务的基准测试DSBC,基于真实用户交互设计,评估了三种大语言模型在不同方法下的表现,强调实用部署中的关键因素。

Details

Motivation: 当前数据科学代理的评测缺乏系统性,且实际应用效果尚不明确。

Result: 不同模型和方法之间存在显著性能差异,揭示了影响实际部署的关键因素。

Insight: 上下文工程和温度参数对模型性能有显著影响,未来研究需关注这些因素以提高数据科学代理的鲁棒性和有效性。

Abstract: Recent advances in large language models (LLMs) have significantly impacted data science workflows, giving rise to specialized data science agents designed to automate analytical tasks. Despite rapid adoption, systematic benchmarks evaluating the efficacy and limitations of these agents remain scarce. In this paper, we introduce a comprehensive benchmark specifically crafted to reflect real-world user interactions with data science agents by observing usage of our commercial applications. We evaluate three LLMs: Claude-4.0-Sonnet, Gemini-2.5-Flash, and OpenAI-o4-Mini across three approaches: zero-shot with context engineering, multi-step with context engineering, and with SmolAgent. Our benchmark assesses performance across a diverse set of eight data science task categories, additionally exploring the sensitivity of models to common prompting issues, such as data leakage and slightly ambiguous instructions. We further investigate the influence of temperature parameters on overall and task-specific outcomes for each model and approach. Our findings reveal distinct performance disparities among the evaluated models and methodologies, highlighting critical factors that affect practical deployment. The benchmark dataset and evaluation framework introduced herein aim to provide a foundation for future research of more robust and effective data science agents.


[110] TextQuests: How Good are LLMs at Text-Based Video Games? cs.AI | cs.CLPDF

Long Phan, Mantas Mazeika, Andy Zou, Dan Hendrycks

TL;DR: 论文提出了TextQuests基准,基于Infocom的互动小说游戏,用于评估AI代理在自主探索环境中的长上下文推理能力。

Details

Motivation: 现有的AI代理基准未能充分评估代理在需要长时自主推理的探索性环境中的能力,因此需要新的评估工具。

Result: TextQuests提供了评估代理在复杂探索性环境中表现的新方法,促进了更强大推理能力代理的开发。

Insight: 互动小说游戏是评估AI代理长时自主推理能力的有效工具,强调了长上下文推理的重要性。

Abstract: Evaluating AI agents within complex, interactive environments that mirror real-world challenges is critical for understanding their practical capabilities. While existing agent benchmarks effectively assess skills like tool use or performance on structured tasks, they often do not fully capture an agent’s ability to operate autonomously in exploratory environments that demand sustained, self-directed reasoning over a long and growing context. To spur the development of agents capable of more robust intrinsic reasoning over long horizons, we introduce TextQuests, a benchmark based on the Infocom suite of interactive fiction games. These text-based adventures, which can take human players over 30 hours and require hundreds of precise actions to solve, serve as an effective proxy for evaluating AI agents on focused, stateful tasks. The benchmark is specifically designed to assess an LLM agent’s capacity for self-contained problem-solving by precluding the use of external tools, thereby focusing on intrinsic long-context reasoning capabilities in an exploratory environment characterized by the need for trial-and-error learning and sustained problem-solving within a single interactive session. We release TextQuests at https://textquests.ai.


[111] Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving cs.AI | cs.CLPDF

Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang

TL;DR: Seed-Prover是一种基于Lean形式验证的自动定理证明模型,通过迭代优化证明并引入几何引擎Seed-Geometry,在IMO竞赛中表现优异,显著提升了自动数学推理的能力。

Details

Motivation: 现有大型语言模型(LLMs)在数学推理中表现优异,但在定理证明中缺乏清晰的监督信号,导致效果不佳。Seed-Prover旨在通过形式验证和长链推理解决这一问题。

Result: Seed-Prover在形式化IMO问题中达到78.1%的证明率,远超之前最优方法。在IMO 2025中完全解决了5/6问题。

Insight: 形式验证与长链推理的结合能显著提升自动定理证明的效果,几何引擎的引入扩展了系统的能力边界。

Abstract: LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought, yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training through reinforcement learning. In this work, we propose \textbf{Seed-Prover}, a lemma-style whole-proof reasoning model. Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization. To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning. Seed-Prover proves $78.1%$ of formalized past IMO problems, saturates MiniF2F, and achieves over 50% on PutnamBench, outperforming the previous state-of-the-art by a large margin. To address the lack of geometry support in Lean, we introduce a geometry reasoning engine \textbf{Seed-Geometry}, which outperforms previous formal geometry engines. We use these two systems to participate in IMO 2025 and fully prove 5 out of 6 problems. This work represents a significant advancement in automated mathematical reasoning, demonstrating the effectiveness of formal verification with long chain-of-thought reasoning.


[112] CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks cs.AI | cs.CLPDF

Ping Yu, Jack Lanchantin, Tianlu Wang, Weizhe Yuan, Olga Golovneva

TL;DR: CoT-Self-Instruct提出了一种合成数据生成方法,利用Chain-of-Thought(CoT)引导LLMs生成高质量且复杂的提示,显著提升了推理和非推理任务的表现。

Details

Motivation: 现有合成数据生成方法在推理和非推理任务中的质量不足,需要一种能够自动生成高质量提示的新方法。

Result: 在MATH500等推理任务和AlpacaEval 2.0等非推理任务中,性能显著优于现有方法。

Insight: 结合CoT的自动提示生成和过滤是提升LLM训练数据质量的有效途径。

Abstract: We propose CoT-Self-Instruct, a synthetic data generation method that instructs LLMs to first reason and plan via Chain-of-Thought (CoT) based on the given seed tasks, and then to generate a new synthetic prompt of similar quality and complexity for use in LLM training, followed by filtering for high-quality data with automatic metrics. In verifiable reasoning, our synthetic data significantly outperforms existing training datasets, such as s1k and OpenMathReasoning, across MATH500, AMC23, AIME24 and GPQA-Diamond. For non-verifiable instruction-following tasks, our method surpasses the performance of human or standard self-instruct prompts on both AlpacaEval 2.0 and Arena-Hard.


[113] SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model cs.AI | cs.CL | cs.LG | cs.ROPDF

Mingkai Deng, Jinyu Hou, Yilin Shen, Hongxia Jin, Graham Neubig

TL;DR: SimuRA提出了一种基于LLM的世界模型的通用目标导向智能体架构,通过模拟推理克服自回归模型的局限性,实验表明在复杂任务中性能显著提升。

Details

Motivation: 现有基于LLM的智能体通常针对单一任务设计,缺乏通用性和扩展性,而人类通过模拟推理实现通用目标。SimuRA的提出旨在解决这一问题。

Result: 在航班搜索任务中,成功率从0%提升至32.2%;基于世界模型的规划比自回归规划性能提升最高达124%。

Insight: 模拟推理范式(世界模型)为通用智能体提供了一个有前景的方向,可能推动单一通用模型的训练。

Abstract: AI agents built on large language models (LLMs) hold enormous promise, but current practice focuses on a one-task-one-agent approach, which not only falls short of scalability and generality, but also suffers from the fundamental limitations of autoregressive LLMs. On the other hand, humans are general agents who reason by mentally simulating the outcomes of their actions and plans. Moving towards a more general and powerful AI agent, we introduce SimuRA, a goal-oriented architecture for generalized agentic reasoning. Based on a principled formulation of optimal agent in any environment, \modelname overcomes the limitations of autoregressive reasoning by introducing a world model for planning via simulation. The generalized world model is implemented using LLM, which can flexibly plan in a wide range of environments using the concept-rich latent space of natural language. Experiments on difficult web browsing tasks show that \modelname improves the success of flight search from 0% to 32.2%. World-model-based planning, in particular, shows consistent advantage of up to 124% over autoregressive planning, demonstrating the advantage of world model simulation as a reasoning paradigm. We are excited about the possibility for training a single, general agent model based on LLMs that can act superintelligently in all environments. To start, we make SimuRA, a web-browsing agent built on \modelname with pretrained LLMs, available as a research demo for public testing.


eess.AS [Back]

[114] MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks eess.AS | cs.AI | cs.CL | cs.SDPDF

Yadong Niu, Tianzi Wang, Heinrich Dinkel, Xingwei Sun, Jiahao Zhou

TL;DR: MECAT是首个多专家构建的细粒度音频理解基准,结合专家分析和Chain-of-Thought大模型推理生成多视角描述与问答对,并提出了创新性评估指标DATE。

Details

Motivation: 现有音频理解模型与人类理解差距明显,主要因当前基准的数据标注和评估指标不足,无法区分泛泛输出和细节描述。

Result: MECAT基准全面评估了SOTA音频模型,揭示了其在细粒度任务上的不足。

Insight: 当前音频模型在细节捕捉和区分能力上仍有欠缺,需更精细的标注与评估推动进步。

Abstract: While large audio-language models have advanced open-ended audio understanding, they still fall short of nuanced human-level comprehension. This gap persists largely because current benchmarks, limited by data annotations and evaluation metrics, fail to reliably distinguish between generic and highly detailed model outputs. To this end, this work introduces MECAT, a Multi-Expert Constructed Benchmark for Fine-Grained Audio Understanding Tasks. Generated via a pipeline that integrates analysis from specialized expert models with Chain-of-Thought large language model reasoning, MECAT provides multi-perspective, fine-grained captions and open-set question-answering pairs. The benchmark is complemented by a novel metric: DATE (Discriminative-Enhanced Audio Text Evaluation). This metric penalizes generic terms and rewards detailed descriptions by combining single-sample semantic similarity with cross-sample discriminability. A comprehensive evaluation of state-of-the-art audio models is also presented, providing new insights into their current capabilities and limitations. The data and code are available at https://github.com/xiaomi-research/mecat


cs.IR [Back]

[115] Holistic Evaluations of Topic Models cs.IR | cs.CLPDF

Thomas Compton

TL;DR: 本文从数据库视角评估主题模型,分析1140个BERTopic模型的运行结果,探讨参数优化的权衡及其对主题模型解释和负责任使用的影响。

Details

Motivation: 主题模型因其能总结大量非结构化文本而在商业和学术领域受到关注,但其可能成为‘黑箱’,用户缺乏对其输出的验证。本文旨在揭示参数优化的权衡,帮助用户更负责任地使用主题模型。

Result: 实验结果表明,参数设置对主题模型输出有显著影响,用户需在模型性能与解释性之间权衡。

Insight: 主题模型的使用需结合领域知识和用户需求,避免盲目依赖算法输出;参数优化不仅是技术问题,也涉及模型的可解释性和实用性。

Abstract: Topic models are gaining increasing commercial and academic interest for their ability to summarize large volumes of unstructured text. As unsupervised machine learning methods, they enable researchers to explore data and help general users understand key themes in large text collections. However, they risk becoming a ‘black box’, where users input data and accept the output as an accurate summary without scrutiny. This article evaluates topic models from a database perspective, drawing insights from 1140 BERTopic model runs. The goal is to identify trade-offs in optimizing model parameters and to reflect on what these findings mean for the interpretation and responsible use of topic models


cs.RO [Back]

[116] H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation cs.RO | cs.CV | cs.LGPDF

Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su

TL;DR: H-RDT是一种通过利用人类操作数据增强机器人操纵能力的新方法,采用扩散变换器架构和两阶段训练范式,在仿真和真实环境中显著优于现有方法。

Details

Motivation: 模仿学习面临大规模高质量机器人演示数据稀缺的问题,而跨具身机器人数据集的多样性又增加了统一训练的难度。

Result: 在仿真和真实实验中分别提升13.9%和40.5%,显著优于从头训练和现有方法(如Pi0和RDT)。

Insight: 人类操作数据可作为机器人双手机器人操作策略学习的强大基础。

Abstract: Imitation learning for robotic manipulation faces a fundamental challenge: the scarcity of large-scale, high-quality robot demonstration data. Recent robotic foundation models often pre-train on cross-embodiment robot datasets to increase data scale, while they face significant limitations as the diverse morphologies and action spaces across different robot embodiments make unified training challenging. In this paper, we present H-RDT (Human to Robotics Diffusion Transformer), a novel approach that leverages human manipulation data to enhance robot manipulation capabilities. Our key insight is that large-scale egocentric human manipulation videos with paired 3D hand pose annotations provide rich behavioral priors that capture natural manipulation strategies and can benefit robotic policy learning. We introduce a two-stage training paradigm: (1) pre-training on large-scale egocentric human manipulation data, and (2) cross-embodiment fine-tuning on robot-specific data with modular action encoders and decoders. Built on a diffusion transformer architecture with 2B parameters, H-RDT uses flow matching to model complex action distributions. Extensive evaluations encompassing both simulation and real-world experiments, single-task and multitask scenarios, as well as few-shot learning and robustness assessments, demonstrate that H-RDT outperforms training from scratch and existing state-of-the-art methods, including Pi0 and RDT, achieving significant improvements of 13.9% and 40.5% over training from scratch in simulation and real-world experiments, respectively. The results validate our core hypothesis that human manipulation data can serve as a powerful foundation for learning bimanual robotic manipulation policies.


[117] A Unified Perception-Language-Action Framework for Adaptive Autonomous Driving cs.RO | cs.AI | cs.CVPDF

Yi Zhang, Erik Leo Haß, Kuo-Yi Chao, Nenad Petrovic, Yinglei Song

TL;DR: 论文提出了一种统一的感知-语言-动作(PLA)框架,通过将多传感器融合与大型语言模型(如GPT-4.1)结合,实现自动驾驶系统的适应性、鲁棒性和可解释性。

Details

Motivation: 当前自动驾驶系统在复杂开放环境中的适应性、鲁棒性和可解释性不足,且架构分散,难以应对新场景。

Result: 在城市交叉路口场景中,该框架在轨迹跟踪、速度预测和自适应规划方面表现优异。

Insight: 语言增强的认知框架有望推动自动驾驶系统在安全性、可解释性和可扩展性方面的进步。

Abstract: Autonomous driving systems face significant challenges in achieving human-like adaptability, robustness, and interpretability in complex, open-world environments. These challenges stem from fragmented architectures, limited generalization to novel scenarios, and insufficient semantic extraction from perception. To address these limitations, we propose a unified Perception-Language-Action (PLA) framework that integrates multi-sensor fusion (cameras, LiDAR, radar) with a large language model (LLM)-augmented Vision-Language-Action (VLA) architecture, specifically a GPT-4.1-powered reasoning core. This framework unifies low-level sensory processing with high-level contextual reasoning, tightly coupling perception with natural language-based semantic understanding and decision-making to enable context-aware, explainable, and safety-bounded autonomous driving. Evaluations on an urban intersection scenario with a construction zone demonstrate superior performance in trajectory tracking, speed prediction, and adaptive planning. The results highlight the potential of language-augmented cognitive frameworks for advancing the safety, interpretability, and scalability of autonomous driving systems.


[118] User Experience Estimation in Human-Robot Interaction Via Multi-Instance Learning of Multimodal Social Signals cs.RO | cs.CV | cs.HCPDF

Ryo Miyoshi, Yuki Okafuji, Takuya Iwamoto, Junya Nakanishi, Jun Baba

TL;DR: 该论文提出了一种基于多模态社交信号的用户体验(UX)估计方法,通过Transformer模型和多实例学习框架,结合面部表情和声音数据,捕捉短长期交互模式,优于人类评估者的表现。

Details

Motivation: 随着社交机器人需求的增长,需要根据用户状态调整行为。现有的UX评估方法通常单一聚焦情感或参与度,缺乏多方面的综合评估。

Result: 实验表明,该方法在UX估计上优于第三方人类评估者。

Insight: 多模态信号和多实例学习框架能更全面地捕捉用户体验的动态特性,为HRI行为调整提供了更精准的依据。

Abstract: In recent years, the demand for social robots has grown, requiring them to adapt their behaviors based on users’ states. Accurately assessing user experience (UX) in human-robot interaction (HRI) is crucial for achieving this adaptability. UX is a multi-faceted measure encompassing aspects such as sentiment and engagement, yet existing methods often focus on these individually. This study proposes a UX estimation method for HRI by leveraging multimodal social signals. We construct a UX dataset and develop a Transformer-based model that utilizes facial expressions and voice for estimation. Unlike conventional models that rely on momentary observations, our approach captures both short- and long-term interaction patterns using a multi-instance learning framework. This enables the model to capture temporal dynamics in UX, providing a more holistic representation. Experimental results demonstrate that our method outperforms third-party human evaluators in UX estimation.


eess.IV [Back]

[119] Rethink Domain Generalization in Heterogeneous Sequence MRI Segmentation eess.IV | cs.CVPDF

Zheyuan Zhang, Linkai Peng, Wanying Dou, Cuiling Sun, Halil Ertugrul Aktas

TL;DR: 这篇论文提出了一个名为PancreasDG的大规模多中心3D MRI胰腺分割数据集,专注于研究医学影像中的域泛化问题,解决了现有基准测试忽视的跨序列变异性问题,并提出了一种半监督方法,显著提升了性能。

Details

Motivation: 现有的域泛化基准测试主要关注跨中心的变化,而忽视了MRI中T1和T2序列间的显著差异。胰腺分割在腹部成像中具有挑战性且临床重要性高,但现有方法对其分割效果不佳。

Result: 所提方法在跨序列分割任务中显著优于现有技术,Dice分数提升了61.63%,在两个测试中心的跨序列分割中达到87.00%的Dice分数。

Insight: 1. 采样不足会引入显著方差。2. 跨序列变化比跨中心变化更具挑战性,需专门解决方案。3. 解剖学不变性特征是解决域泛化的有效途径。

Abstract: Clinical magnetic-resonance (MR) protocols generate many T1 and T2 sequences whose appearance differs more than the acquisition sites that produce them. Existing domain-generalization benchmarks focus almost on cross-center shifts and overlook this dominant source of variability. Pancreas segmentation remains a major challenge in abdominal imaging: the gland is small, irregularly, surrounded by organs and fat, and often suffers from low T1 contrast. State-of-the-art deep networks that already achieve >90% Dice on the liver or kidneys still miss 20-30% of the pancreas. The organ is also systematically under-represented in public cross-domain benchmarks, despite its clinical importance in early cancer detection, surgery, and diabetes research. To close this gap, we present PancreasDG, a large-scale multi-center 3D MRI pancreas segmentation dataset for investigating domain generalization in medical imaging. The dataset comprises 563 MRI scans from six institutions, spanning both venous phase and out-of-phase sequences, enabling study of both cross-center and cross-sequence variations with pixel-accurate pancreas masks created by a double-blind, two-pass protocol. Through comprehensive analysis, we reveal three insights: (i) limited sampling introduces significant variance that may be mistaken for distribution shifts, (ii) cross-center performance correlates with source domain performance for identical sequences, and (iii) cross-sequence shifts require specialized solutions. We also propose a semi-supervised approach that leverages anatomical invariances, significantly outperforming state-of-the-art domain generalization techniques with 61.63% Dice score improvements and 87.00% on two test centers for cross-sequence segmentation. PancreasDG sets a new benchmark for domain generalization in medical imaging. Dataset, code, and models will be available at https://pancreasdg.netlify.app.


[120] Learning Arbitrary-Scale RAW Image Downscaling with Wavelet-based Recurrent Reconstruction eess.IV | cs.CVPDF

Yang Ren, Hai Jiang, Wei Li, Menglong Yang, Heng Zhang

TL;DR: 这篇论文提出了一种基于小波的循环重建框架,用于任意尺度的RAW图像下采样。通过低频和高频模块保留结构和纹理完整性,并引入新的数据集和损失函数,显著优于现有方法。

Details

Motivation: 现有的图像下采样方法主要针对sRGB域,而RAW图像因其未处理的原始信息更具灵活性,但缺乏专门的框架。研究旨在解决这一空白。

Result: 实验表明,该方法在定量和视觉指标上均优于现有技术。

Insight: 利用小波变换的无损信息属性,可以更灵活地实现高质量的下采样,尤其在RAW图像处理中具有广泛应用潜力。

Abstract: Image downscaling is critical for efficient storage and transmission of high-resolution (HR) images. Existing learning-based methods focus on performing downscaling within the sRGB domain, which typically suffers from blurred details and unexpected artifacts. RAW images, with their unprocessed photonic information, offer greater flexibility but lack specialized downscaling frameworks. In this paper, we propose a wavelet-based recurrent reconstruction framework that leverages the information lossless attribute of wavelet transformation to fulfill the arbitrary-scale RAW image downscaling in a coarse-to-fine manner, in which the Low-Frequency Arbitrary-Scale Downscaling Module (LASDM) and the High-Frequency Prediction Module (HFPM) are proposed to preserve structural and textural integrity of the reconstructed low-resolution (LR) RAW images, alongside an energy-maximization loss to align high-frequency energy between HR and LR domain. Furthermore, we introduce the Realistic Non-Integer RAW Downscaling (Real-NIRD) dataset, featuring a non-integer downscaling factor of 1.3$\times$, and incorporate it with publicly available datasets with integer factors (2$\times$, 3$\times$, 4$\times$) for comprehensive benchmarking arbitrary-scale image downscaling purposes. Extensive experiments demonstrate that our method outperforms existing state-of-the-art competitors both quantitatively and visually. The code and dataset will be released at https://github.com/RenYangSCU/ASRD.


[121] EMedNeXt: An Enhanced Brain Tumor Segmentation Framework for Sub-Saharan Africa using MedNeXt V2 with Deep Supervision eess.IV | cs.CVPDF

Ahmed Jaheen, Abdelrahman Elsayed, Damir Kim, Daniil Tikhonov, Matheus Scatolin

TL;DR: EMedNeXt是一个改进的脑肿瘤分割框架,针对撒哈拉以南非洲地区的低资源环境优化,通过扩大感兴趣区域、改进的nnU-Net V2骨架和模型集成系统,在隐藏验证集上表现优异。

Details

Motivation: 撒哈拉以南非洲地区的MRI设备质量低、放射学专家稀缺,导致脑肿瘤分割和量化困难。EMedNeXt旨在解决这些问题,优化分割性能。

Result: 在隐藏验证集上,平均LesionWise DSC为0.897,NSD在0.5 mm和1.0 mm容忍度下分别为0.541和0.84。

Insight: 在低资源地区,通过改进网络架构和模型集成,可以显著提升脑肿瘤分割的准确性和鲁棒性。

Abstract: Brain cancer affects millions worldwide, and in nearly every clinical setting, doctors rely on magnetic resonance imaging (MRI) to diagnose and monitor gliomas. However, the current standard for tumor quantification through manual segmentation of multi-parametric MRI is time-consuming, requires expert radiologists, and is often infeasible in under-resourced healthcare systems. This problem is especially pronounced in low-income regions, where MRI scanners are of lower quality and radiology expertise is scarce, leading to incorrect segmentation and quantification. In addition, the number of acquired MRI scans in Africa is typically small. To address these challenges, the BraTS-Lighthouse 2025 Challenge focuses on robust tumor segmentation in sub-Saharan Africa (SSA), where resource constraints and image quality degradation introduce significant shifts. In this study, we present EMedNeXt – an enhanced brain tumor segmentation framework based on MedNeXt V2 with deep supervision and optimized post-processing pipelines tailored for SSA. EMedNeXt introduces three key contributions: a larger region of interest, an improved nnU-Net v2-based architectural skeleton, and a robust model ensembling system. Evaluated on the hidden validation set, our solution achieved an average LesionWise DSC of 0.897 with an average LesionWise NSD of 0.541 and 0.84 at a tolerance of 0.5 mm and 1.0 mm, respectively.


[122] Pixel Embedding Method for Tubular Neurite Segmentation eess.IV | cs.CV | q-bio.NCPDF

Huayu Fu, Jiamin Li, Haozhi Qu, Xiaolin Hu, Zengcai Guo

TL;DR: 提出了一种基于像素嵌入的神经管分割方法,结合深度学习网络和端到端流程,显著降低了神经拓扑重建的错误率,并提出了新的拓扑评估指标。

Details

Motivation: 神经元分支的复杂形态和纤维之间的遮挡为基于深度学习的分割带来了挑战,为解决这些问题,需要更有效的方法来提高分割精度和重建质量。

Result: 在fMOST成像数据集上,显著降低了神经拓扑重建的错误率。

Insight: 像素嵌入方法和拓扑评估指标的引入,为复杂神经结构的分割提供了更精准的工具。

Abstract: Automatic segmentation of neuronal topology is critical for handling large scale neuroimaging data, as it can greatly accelerate neuron annotation and analysis. However, the intricate morphology of neuronal branches and the occlusions among fibers pose significant challenges for deep learning based segmentation. To address these issues, we propose an improved framework: First, we introduce a deep network that outputs pixel level embedding vectors and design a corresponding loss function, enabling the learned features to effectively distinguish different neuronal connections within occluded regions. Second, building on this model, we develop an end to end pipeline that directly maps raw neuronal images to SWC formatted neuron structure trees. Finally, recognizing that existing evaluation metrics fail to fully capture segmentation accuracy, we propose a novel topological assessment metric to more appropriately quantify the quality of neuron segmentation and reconstruction. Experiments on our fMOST imaging dataset demonstrate that, compared to several classical methods, our approach significantly reduces the error rate in neuronal topology reconstruction.


[123] Smart Video Capsule Endoscopy: Raw Image-Based Localization for Enhanced GI Tract Investigation eess.IV | cs.AR | cs.CVPDF

Oliver Bause, Julia Werner, Paul Palomero Bernardo, Oliver Bringmann

TL;DR: 论文提出了一种基于原始Bayer图像的智能视频胶囊内窥镜系统,通过轻量化CNN和Viterbi解码实现高效分类,显著降低了能耗。

Details

Motivation: 针对资源受限的边缘设备(如视频胶囊内窥镜),传统深度神经网络因模型过大和RGB转换能耗高而不适用,需提出更高效的解决方案。

Result: 系统平均节省89.9%的能耗(相比传统视频胶囊),每图分类仅需5.31μJ。

Insight: 通过跳过RGB转换和模型轻量化,可在边缘设备上实现高效AI应用,特别适用于医疗等低功耗场景。

Abstract: For many real-world applications involving low-power sensor edge devices deep neural networks used for image classification might not be suitable. This is due to their typically large model size and require- ment of operations often exceeding the capabilities of such resource lim- ited devices. Furthermore, camera sensors usually capture images with a Bayer color filter applied, which are subsequently converted to RGB images that are commonly used for neural network training. However, on resource-constrained devices, such conversions demands their share of energy and optimally should be skipped if possible. This work ad- dresses the need for hardware-suitable AI targeting sensor edge devices by means of the Video Capsule Endoscopy, an important medical proce- dure for the investigation of the small intestine, which is strongly limited by its battery lifetime. Accurate organ classification is performed with a final accuracy of 93.06% evaluated directly on Bayer images involv- ing a CNN with only 63,000 parameters and time-series analysis in the form of Viterbi decoding. Finally, the process of capturing images with a camera and raw image processing is demonstrated with a customized PULPissimo System-on-Chip with a RISC-V core and an ultra-low power hardware accelerator providing an energy-efficient AI-based image clas- sification approach requiring just 5.31 {\mu}J per image. As a result, it is possible to save an average of 89.9% of energy before entering the small intestine compared to classic video capsules.


[124] JPEG Processing Neural Operator for Backward-Compatible Coding eess.IV | cs.CVPDF

Woo Kyoung Han, Yongjun Lee, Byeonghun Lee, Sang Hyun Park, Sunghoon Im

TL;DR: JPNeO是一种兼容当前JPEG标准的下一代算法,通过神经网络操作改善色彩分量的保存和重建质量,同时减少内存和参数需求。

Details

Motivation: 传统学习型压缩算法难以标准化,且缺乏向后兼容性,JPNeO旨在解决这些问题。

Result: JPNeO在保留兼容性的同时,提高了压缩效率和重建质量。

Insight: 神经网络操作可无缝嵌入传统编码协议,实现性能提升。

Abstract: Despite significant advances in learning-based lossy compression algorithms, standardizing codecs remains a critical challenge. In this paper, we present the JPEG Processing Neural Operator (JPNeO), a next-generation JPEG algorithm that maintains full backward compatibility with the current JPEG format. Our JPNeO improves chroma component preservation and enhances reconstruction fidelity compared to existing artifact removal methods by incorporating neural operators in both the encoding and decoding stages. JPNeO achieves practical benefits in terms of reduced memory usage and parameter count. We further validate our hypothesis about the existence of a space with high mutual information through empirical evidence. In summary, the JPNeO functions as a high-performance out-of-the-box image compression pipeline without changing source coding’s protocol. Our source code is available at https://github.com/WooKyoungHan/JPNeO.


[125] Towards Field-Ready AI-based Malaria Diagnosis: A Continual Learning Approach eess.IV | cs.CVPDF

Louise Guillon, Soheib Biga, Yendoube E. Kantchire, Mouhamadou Lamine Sane, Grégoire Pasquier

TL;DR: 该论文探讨了持续性学习(CL)在提高基于深度学习的疟疾计算机辅助诊断(CAD)系统跨域泛化能力中的作用。

Details

Motivation: 疟疾是全球健康的重要挑战,特别是在资源匮乏地区,专家显微镜诊断难以普及。现有的深度学习CAD系统在域适应性上有局限,限制了临床部署。

Result: 结果表明,持续性学习(特别是基于复习的方法)显著提高了性能。

Insight: 持续性学习有望推动可部署的疟疾CAD工具的开发。

Abstract: Malaria remains a major global health challenge, particularly in low-resource settings where access to expert microscopy may be limited. Deep learning-based computer-aided diagnosis (CAD) systems have been developed and demonstrate promising performance on thin blood smear images. However, their clinical deployment may be hindered by limited generalization across sites with varying conditions. Yet very few practical solutions have been proposed. In this work, we investigate continual learning (CL) as a strategy to enhance the robustness of malaria CAD models to domain shifts. We frame the problem as a domain-incremental learning scenario, where a YOLO-based object detector must adapt to new acquisition sites while retaining performance on previously seen domains. We evaluate four CL strategies, two rehearsal-based and two regularization-based methods, on real-life conditions thanks to a multi-site clinical dataset of thin blood smear images. Our results suggest that CL, and rehearsal-based methods in particular, can significantly improve performance. These findings highlight the potential of continual learning to support the development of deployable, field-ready CAD tools for malaria.


cs.LG [Back]

[126] SequenceLayers: Sequence Processing and Streaming Neural Networks Made Easy cs.LG | cs.CL | cs.PL | cs.SE | eess.ASPDF

RJ Skerry-Ryan, Julian Salazar, Soroosh Mariooryad, David Kao, Daisy Stanton

TL;DR: SequenceLayers是一个用于序列建模的神经网络层API和库,旨在简化序列模型的创建,支持逐层(如教师强制训练)和逐步(如自回归采样)执行。其通过显式状态表示和步进方法实现高效流式处理,减少常见错误,并提供兼容性强的实现。

Details

Motivation: 传统序列模型在流式处理和并行处理中常出现状态管理复杂和错误频发的问题,SequenceLayers旨在通过统一的状态管理机制和API设计解决这些问题,简化模型开发和部署。

Result: SequenceLayers实现了高效的流式序列处理,解决了状态管理和执行一致性问题,已在JAX和TensorFlow 2中实现,并开源。

Insight: 显式状态管理和统一执行机制是流式序列处理的核心,通过高抽象层次的API设计,可以显著简化复杂模型的开发和维护。

Abstract: We introduce a neural network layer API and library for sequence modeling, designed for easy creation of sequence models that can be executed both layer-by-layer (e.g., teacher-forced training) and step-by-step (e.g., autoregressive sampling). To achieve this, layers define an explicit representation of their state over time (e.g., a Transformer KV cache, a convolution buffer, an RNN hidden state), and a step method that evolves that state, tested to give identical results to a stateless layer-wise invocation. This and other aspects of the SequenceLayers contract enables complex models to be immediately streamable, mitigates a wide range of common bugs arising in both streaming and parallel sequence processing, and can be implemented in any deep learning library. A composable and declarative API, along with a comprehensive suite of layers and combinators, streamlines the construction of production-scale models from simple streamable components while preserving strong correctness guarantees. Our current implementations of SequenceLayers (JAX, TensorFlow 2) are available at https://github.com/google/sequence-layers.


[127] Planning for Cooler Cities: A Multimodal AI Framework for Predicting and Mitigating Urban Heat Stress through Urban Landscape Transformation cs.LG | cs.CVPDF

Shengao Yi, Xiaojiang Li, Wei Tu, Tianhong Zhao

TL;DR: GSM-UTCI是一种多模态深度学习框架,用于预测城市热应力,通过动态融合地表形态和气象数据,实现了接近物理模型的准确性和高效性,并为城市景观改造提供决策支持。

Details

Motivation: 随着气候变化和城市化加剧,城市热应力问题日益严重,传统物理模型计算成本高,限制了其在大规模城市规划中的应用。

Result: 模型R2达0.9151,MAE为0.41°C,推理时间从几小时缩短至5分钟;通过城市景观改造模拟,树冠替换不透水区域降温效果最显著。

Insight: GSM-UTCI为城市气候适应提供了可扩展的精细化决策工具,揭示了不同城市景观改造策略的降温潜力。

Abstract: As extreme heat events intensify due to climate change and urbanization, cities face increasing challenges in mitigating outdoor heat stress. While traditional physical models such as SOLWEIG and ENVI-met provide detailed assessments of human-perceived heat exposure, their computational demands limit scalability for city-wide planning. In this study, we propose GSM-UTCI, a multimodal deep learning framework designed to predict daytime average Universal Thermal Climate Index (UTCI) at 1-meter hyperlocal resolution. The model fuses surface morphology (nDSM), high-resolution land cover data, and hourly meteorological conditions using a feature-wise linear modulation (FiLM) architecture that dynamically conditions spatial features on atmospheric context. Trained on SOLWEIG-derived UTCI maps, GSM-UTCI achieves near-physical accuracy, with an R2 of 0.9151 and a mean absolute error (MAE) of 0.41{\deg}C, while reducing inference time from hours to under five minutes for an entire city. To demonstrate its planning relevance, we apply GSM-UTCI to simulate systematic landscape transformation scenarios in Philadelphia, replacing bare earth, grass, and impervious surfaces with tree canopy. Results show spatially heterogeneous but consistently strong cooling effects, with impervious-to-tree conversion producing the highest aggregated benefit (-4.18{\deg}C average change in UTCI across 270.7 km2). Tract-level bivariate analysis further reveals strong alignment between thermal reduction potential and land cover proportions. These findings underscore the utility of GSM-UTCI as a scalable, fine-grained decision support tool for urban climate adaptation, enabling scenario-based evaluation of greening strategies across diverse urban environments.


[128] Investigating the Invertibility of Multimodal Latent Spaces: Limitations of Optimization-Based Methods cs.LG | cs.AI | cs.CV | cs.SD | eess.ASPDF

Siwoo Park

TL;DR: 本文研究了多模态潜在空间的可逆性问题,发现基于优化的方法在反向映射时存在局限性,导致语义不连贯和感知质量低下。

Details

Motivation: 多模态模型在前向任务(如文本到图像生成)上表现出色,但其反向映射能力尚未被充分探索。本文旨在验证这些潜在空间是否支持有意义且连贯的反向映射。

Result: 实验表明,优化虽能生成文本对齐的输出,但反向映射的感知质量混沌且语义不连贯,潜在空间嵌入缺乏可解释性。

Insight: 当前多模态潜在空间主要用于前向任务优化,缺乏支持稳健反向映射的结构,需进一步研究开发真正可逆的语义丰富潜在空间。

Abstract: This paper investigates the inverse capabilities and broader utility of multimodal latent spaces within task-specific AI (Artificial Intelligence) models. While these models excel at their designed forward tasks (e.g., text-to-image generation, audio-to-text transcription), their potential for inverse mappings remains largely unexplored. We propose an optimization-based framework to infer input characteristics from desired outputs, applying it bidirectionally across Text-Image (BLIP, Flux.1-dev) and Text-Audio (Whisper-Large-V3, Chatterbox-TTS) modalities. Our central hypothesis posits that while optimization can guide models towards inverse tasks, their multimodal latent spaces will not consistently support semantically meaningful and perceptually coherent inverse mappings. Experimental results consistently validate this hypothesis. We demonstrate that while optimization can force models to produce outputs that align textually with targets (e.g., a text-to-image model generating an image that an image captioning model describes correctly, or an ASR model transcribing optimized audio accurately), the perceptual quality of these inversions is chaotic and incoherent. Furthermore, when attempting to infer the original semantic input from generative models, the reconstructed latent space embeddings frequently lack semantic interpretability, aligning with nonsensical vocabulary tokens. These findings highlight a critical limitation. multimodal latent spaces, primarily optimized for specific forward tasks, do not inherently possess the structure required for robust and interpretable inverse mappings. Our work underscores the need for further research into developing truly semantically rich and invertible multimodal latent spaces.


[129] FuseTen: A Generative Model for Daily 10 m Land Surface Temperature Estimation from Spatio-Temporal Satellite Observations cs.LG | cs.AI | cs.CVPDF

Sofiane Bouaziz, Adel Hafiane, Raphael Canals, Rachid Nedjai

TL;DR: FuseTen 是一个生成式模型,通过融合 Sentinel-2、Landsat 8 和 Terra MODIS 的卫星观测数据,生成空间分辨率为 10 米的每日地表温度(LST)估算。

Details

Motivation: 在气候变化背景下,城市热浪、干旱和土地退化问题日益严重,需要高精度的地表温度时空数据进行研究。然而,现有卫星数据在时空分辨率上存在权衡,FuseTen 旨在填补这一技术空白。

Result: 实验表明,FuseTen 在定量指标上平均提升 32.06%,视觉保真度提升 31.42%。

Insight: 通过生成式模型融合多源卫星数据,能够显著提升 LST 的空间分辨率,为气候变化研究提供更精细的数据支持。

Abstract: Urban heatwaves, droughts, and land degradation are pressing and growing challenges in the context of climate change. A valuable approach to studying them requires accurate spatio-temporal information on land surface conditions. One of the most important variables for assessing and understanding these phenomena is Land Surface Temperature (LST), which is derived from satellites and provides essential information about the thermal state of the Earth’s surface. However, satellite platforms inherently face a trade-off between spatial and temporal resolutions. To bridge this gap, we propose FuseTen, a novel generative framework that produces daily LST observations at a fine 10 m spatial resolution by fusing spatio-temporal observations derived from Sentinel-2, Landsat 8, and Terra MODIS. FuseTen employs a generative architecture trained using an averaging-based supervision strategy grounded in physical principles. It incorporates attention and normalization modules within the fusion process and uses a PatchGAN discriminator to enforce realism. Experiments across multiple dates show that FuseTen outperforms linear baselines, with an average 32.06% improvement in quantitative metrics and 31.42% in visual fidelity. To the best of our knowledge, this is the first non-linear method to generate daily LST estimates at such fine spatial resolution.


[130] DepMicroDiff: Diffusion-Based Dependency-Aware Multimodal Imputation for Microbiome Data cs.LG | cs.CVPDF

Rabeya Tus Sadia, Qiang Cheng

TL;DR: DepMicroDiff 是一种结合扩散模型和依赖感知Transformer的多模态微生物组数据插补框架,显著提升了插补性能。

Details

Motivation: 微生物组数据的稀疏性和噪声问题严重影响了其分析和下游任务,现有方法难以捕捉复杂的微生物间依赖关系及上下文元数据。

Result: 在多个癌症类型中,Pearson相关性和余弦相似度显著提升(最高0.712和0.812),RMSE和MAE降低。

Insight: 1. 依赖感知建模和多模态元数据结合对微生物组插补至关重要;2. 扩散模型适用于复杂生物数据建模。

Abstract: Microbiome data analysis is essential for understanding host health and disease, yet its inherent sparsity and noise pose major challenges for accurate imputation, hindering downstream tasks such as biomarker discovery. Existing imputation methods, including recent diffusion-based models, often fail to capture the complex interdependencies between microbial taxa and overlook contextual metadata that can inform imputation. We introduce DepMicroDiff, a novel framework that combines diffusion-based generative modeling with a Dependency-Aware Transformer (DAT) to explicitly capture both mutual pairwise dependencies and autoregressive relationships. DepMicroDiff is further enhanced by VAE-based pretraining across diverse cancer datasets and conditioning on patient metadata encoded via a large language model (LLM). Experiments on TCGA microbiome datasets show that DepMicroDiff substantially outperforms state-of-the-art baselines, achieving higher Pearson correlation (up to 0.712), cosine similarity (up to 0.812), and lower RMSE and MAE across multiple cancer types, demonstrating its robustness and generalizability for microbiome imputation.


[131] Consensus-Driven Active Model Selection cs.LG | cs.AI | cs.CVPDF

Justin Kay, Grant Van Horn, Subhransu Maji, Daniel Sheldon, Sara Beery

TL;DR: CODA是一种主动模型选择方法,通过利用候选模型的预测结果,优先标注能够高效区分最佳模型的数据点,显著减少了标注工作量。

Details

Motivation: 传统模型选择需要大量标注验证数据,过程耗时且昂贵。CODA旨在通过主动选择关键数据点来优化这一过程。

Result: 在26个基准任务上,CODA显著优于现有方法,将发现最佳模型所需的标注工作量减少了70%以上。

Insight: 模型间的共识与分歧信息可用于高效指导标注过程,极大提升模型选择的效率。

Abstract: The widespread availability of off-the-shelf machine learning models poses a challenge: which model, of the many available candidates, should be chosen for a given data analysis task? This question of model selection is traditionally answered by collecting and annotating a validation dataset – a costly and time-intensive process. We propose a method for active model selection, using predictions from candidate models to prioritize the labeling of test data points that efficiently differentiate the best candidate. Our method, CODA, performs consensus-driven active model selection by modeling relationships between classifiers, categories, and data points within a probabilistic framework. The framework uses the consensus and disagreement between models in the candidate pool to guide the label acquisition process, and Bayesian inference to update beliefs about which model is best as more information is collected. We validate our approach by curating a collection of 26 benchmark tasks capturing a range of model selection scenarios. CODA outperforms existing methods for active model selection significantly, reducing the annotation effort required to discover the best model by upwards of 70% compared to the previous state-of-the-art. Code and data are available at https://github.com/justinkay/coda.