Table of Contents

cs.CV [Back]

[1] Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models cs.CV | cs.AI | cs.CL | cs.GR | cs.MMPDF

Sridhar S, Nithin A, Shakeel Rifath, Vasantha Raj

TL;DR: 这篇论文提出了一种结合文本到图像和音频生成模型的多模态电影视频合成方法,能够在60秒内生成高质量的电影风格视频。

Details

Motivation: 随着生成式人工智能的进步,多媒体创作的自动化需求日益增长,尤其是如何从文本输入生成具有专业质量的电影视频。

Result: 实验结果表明,该方法在视觉质量、叙事连贯性和效率方面表现优异,适用于创意、教育和工业应用。

Insight: 通过结合多种生成模型和后处理技术,能够实现高效且高质量的多模态视频合成,为文本到视频的自动化创作提供了新思路。

Abstract: Advances in generative artificial intelligence have altered multimedia creation, allowing for automatic cinematic video synthesis from text inputs. This work describes a method for creating 60-second cinematic movies incorporating Stable Diffusion for high-fidelity image synthesis, GPT-2 for narrative structuring, and a hybrid audio pipeline using gTTS and YouTube-sourced music. It uses a five-scene framework, which is augmented by linear frame interpolation, cinematic post-processing (e.g., sharpening), and audio-video synchronization to provide professional-quality results. It was created in a GPU-accelerated Google Colab environment using Python 3.11. It has a dual-mode Gradio interface (Simple and Advanced), which supports resolutions of up to 1024x768 and frame rates of 15-30 FPS. Optimizations such as CUDA memory management and error handling ensure reliability. The experiments demonstrate outstanding visual quality, narrative coherence, and efficiency, furthering text-to-video synthesis for creative, educational, and industrial applications.


[2] LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning cs.CVPDF

Chenjian Gao, Lihe Ding, Xin Cai, Zhanpeng Huang, Zibin Wang

TL;DR: LoRA-Edit提出了一种基于LoRA微调的方法,通过掩码驱动实现对视频编辑的灵活控制,避免大规模预训练,同时保留背景并优化编辑传播。

Details

Motivation: 现有视频编辑方法依赖大规模预训练,灵活性受限;一帧引导编辑无法灵活控制后续帧。为解决这些问题,作者提出掩码感知的LoRA微调方法。

Result: 实验表明,该方法在视频编辑任务中优于现有技术,实现了高质量的灵活编辑。

Insight: 掩码和LoRA的结合为视频编辑提供了一种高效且灵活的解决方案,同时展示了多参考信息的价值。

Abstract: Video editing using diffusion models has achieved remarkable results in generating high-quality edits for videos. However, current methods often rely on large-scale pretraining, limiting flexibility for specific edits. First-frame-guided editing provides control over the first frame, but lacks flexibility over subsequent frames. To address this, we propose a mask-based LoRA (Low-Rank Adaptation) tuning method that adapts pretrained Image-to-Video (I2V) models for flexible video editing. Our approach preserves background regions while enabling controllable edits propagation. This solution offers efficient and adaptable video editing without altering the model architecture. To better steer this process, we incorporate additional references, such as alternate viewpoints or representative scene states, which serve as visual anchors for how content should unfold. We address the control challenge using a mask-driven LoRA tuning strategy that adapts a pre-trained image-to-video model to the editing context. The model must learn from two distinct sources: the input video provides spatial structure and motion cues, while reference images offer appearance guidance. A spatial mask enables region-specific learning by dynamically modulating what the model attends to, ensuring that each area draws from the appropriate source. Experimental results show our method achieves superior video editing performance compared to state-of-the-art methods.


[3] DeepTraverse: A Depth-First Search Inspired Network for Algorithmic Visual Understanding cs.CVPDF

Bin Guo, John H. L. Hansen

TL;DR: DeepTraverse 是一种受深度优先搜索算法启发的新型视觉架构,通过递归探索和自适应校准模块构建更结构化、可解释的特征表示,在图像分类任务中表现优异。

Details

Motivation: 传统视觉模型的特征构建过程缺乏显式的自适应迭代优化路径,能否借鉴经典搜索算法的原则,实现更结构化和逻辑化的处理流程?

Result: 在多样化的图像分类任务中,DeepTraverse 表现优于传统模型,且参数效率更高。

Insight: 将算法先验融入视觉模型设计,可提升模型效率、性能和结构化程度。

Abstract: Conventional vision backbones, despite their success, often construct features through a largely uniform cascade of operations, offering limited explicit pathways for adaptive, iterative refinement. This raises a compelling question: can principles from classical search algorithms instill a more algorithmic, structured, and logical processing flow within these networks, leading to representations built through more interpretable, perhaps reasoning-like decision processes? We introduce DeepTraverse, a novel vision architecture directly inspired by algorithmic search strategies, enabling it to learn features through a process of systematic elucidation and adaptive refinement distinct from conventional approaches. DeepTraverse operationalizes this via two key synergistic components: recursive exploration modules that methodically deepen feature analysis along promising representational paths with parameter sharing for efficiency, and adaptive calibration modules that dynamically adjust feature salience based on evolving global context. The resulting algorithmic interplay allows DeepTraverse to intelligently construct and refine feature patterns. Comprehensive evaluations across a diverse suite of image classification benchmarks show that DeepTraverse achieves highly competitive classification accuracy and robust feature discrimination, often outperforming conventional models with similar or larger parameter counts. Our work demonstrates that integrating such algorithmic priors provides a principled and effective strategy for building more efficient, performant, and structured vision backbones.


[4] Test-Time Adaptation for Generalizable Task Progress Estimation cs.CV | cs.AI | I.2.6; I.2.9; I.2.10PDF

Christos Ziakas, Alessandra Russo

TL;DR: 提出了一种基于测试时自适应的方法,通过优化自监督目标,使进度估计模型能够在线适应测试轨迹的视觉和时间上下文。

Details

Motivation: 为了解决进度估计模型在分布外任务、环境和实现上的泛化问题,作者提出了一种测试时自适应方法,通过利用专家视觉轨迹和自然语言任务描述来优化模型的适应性。

Result: 在分布外任务、环境和实现中,该方法表现优于当前最先进的基于自回归视觉语言模型的上下文学习方法。

Insight: 测试时自适应和语义优先策略显著提升了进度估计模型的泛化能力,特别是在分布外场景中。

Abstract: We propose a test-time adaptation method that enables a progress estimation model to adapt online to the visual and temporal context of test trajectories by optimizing a learned self-supervised objective. To this end, we introduce a gradient-based meta-learning strategy to train the model on expert visual trajectories and their natural language task descriptions, such that test-time adaptation improves progress estimation relying on semantic content over temporal order. Our test-time adaptation method generalizes from a single training environment to diverse out-of-distribution tasks, environments, and embodiments, outperforming the state-of-the-art in-context learning approach using autoregressive vision-language models.


[5] EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models cs.CVPDF

Yantai Yang, Yuhao Wang, Zichen Wen, Luo Zhongwei, Chang Zou

TL;DR: EfficientVLA提出了一种无需训练的加速框架,通过剪枝、视觉令牌优化和缓存中间特征三种策略,显著加速了VLA模型的推理,同时保持了性能。

Details

Motivation: 现有的VLA模型(如基于扩散架构的模型)计算和内存需求高,限制了实际部署。现有加速方法通常只针对局部问题,未能全面解决整个流程中的冗余问题。

Result: 在CogACT模型上实现了1.93倍加速,FLOPs降至28.9%,任务成功率仅下降0.6%。

Insight: 无需训练的加速方法能显著提升效率,同时保持模型性能,为VLA模型的实用部署提供了可能。

Abstract: Vision-Language-Action (VLA) models, particularly diffusion-based architectures, demonstrate transformative potential for embodied intelligence but are severely hampered by high computational and memory demands stemming from extensive inherent and inference-time redundancies. While existing acceleration efforts often target isolated inefficiencies, such piecemeal solutions typically fail to holistically address the varied computational and memory bottlenecks across the entire VLA pipeline, thereby limiting practical deployability. We introduce EfficientVLA, a structured and training-free inference acceleration framework that systematically eliminates these barriers by cohesively exploiting multifaceted redundancies. EfficientVLA synergistically integrates three targeted strategies: (1) pruning of functionally inconsequential layers from the language module, guided by an analysis of inter-layer redundancies; (2) optimizing the visual processing pathway through a task-aware strategy that selects a compact, diverse set of visual tokens, balancing task-criticality with informational coverage; and (3) alleviating temporal computational redundancy within the iterative diffusion-based action head by strategically caching and reusing key intermediate features. We apply our method to a standard VLA model CogACT, yielding a 1.93X inference speedup and reduces FLOPs to 28.9%, with only a 0.6% success rate drop in the SIMPLER benchmark.


[6] A Manually Annotated Image-Caption Dataset for Detecting Children in the Wild cs.CV | cs.ETPDF

Klim Kireev, Ana-Maria Creţu, Raphael Meier, Sarah Adel Bargal, Elissa Redmiles

TL;DR: 论文发布了一个名为ICCWD的多模态图像-字幕数据集,用于检测未成年人的图像内容,并通过测试三种检测器展示了数据集的实用性。

Details

Motivation: 目前缺乏用于多模态环境下检测未成年人内容的数据集,论文旨在填补这一空白,以支持机器学习工具的开发和评估。

Result: 实验结果表明,未成年人检测是一个具有挑战性的任务,最佳方法的真实阳性率为75.3%。

Insight: ICCWD数据集为设计更好的未成年人检测方法提供了支持,尤其是在多模态环境下。

Abstract: Platforms and the law regulate digital content depicting minors (defined as individuals under 18 years of age) differently from other types of content. Given the sheer amount of content that needs to be assessed, machine learning-based automation tools are commonly used to detect content depicting minors. To our knowledge, no dataset or benchmark currently exists for detecting these identification methods in a multi-modal environment. To fill this gap, we release the Image-Caption Children in the Wild Dataset (ICCWD), an image-caption dataset aimed at benchmarking tools that detect depictions of minors. Our dataset is richer than previous child image datasets, containing images of children in a variety of contexts, including fictional depictions and partially visible bodies. ICCWD contains 10,000 image-caption pairs manually labeled to indicate the presence or absence of a child in the image. To demonstrate the possible utility of our dataset, we use it to benchmark three different detectors, including a commercial age estimation system applied to images. Our results suggest that child detection is a challenging task, with the best method achieving a 75.3% true positive rate. We hope the release of our dataset will aid in the design of better minor detection methods in a wide range of scenarios.


[7] Detecção da Psoríase Utilizando Visão Computacional: Uma Abordagem Comparativa Entre CNNs e Vision Transformers cs.CV | cs.AI | cs.LGPDF

Natanael Lucena, Fábio S. da Silva, Ricardo Rios

TL;DR: 论文比较了CNN和Vision Transformers(ViT)在银屑病病变图像分类任务中的表现,发现ViT在较小的模型规模下表现更优,其中DaViT-B模型以96.4%的f1-score成为最有效的自动检测银屑病的架构。

Details

Motivation: 研究动机在于探索不同深度学习架构(尤其是ViT)在医学图像分类中的潜力,以优化自动化银屑病检测的效率和准确性。

Result: 结果显示ViT(尤其是DaViT-B)在f1-score上表现最佳(96.4%),优于CNN模型。

Insight: 研究结果表明ViT在医学图像分类任务中具有显著潜力,尤其是在需要轻量级高效模型的情况下。

Abstract: This paper presents a comparison of the performance of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in the task of multi-classifying images containing lesions of psoriasis and diseases similar to it. Models pre-trained on ImageNet were adapted to a specific data set. Both achieved high predictive metrics, but the ViTs stood out for their superior performance with smaller models. Dual Attention Vision Transformer-Base (DaViT-B) obtained the best results, with an f1-score of 96.4%, and is recommended as the most efficient architecture for automated psoriasis detection. This article reinforces the potential of ViTs for medical image classification tasks.


[8] ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs cs.CV | cs.LGPDF

Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou

TL;DR: 该论文提出了ViCrit任务,一种可验证的强化学习代理任务,用于提升视觉语言模型(VLMs)的视觉感知能力,通过定位文本中的视觉幻觉错误,并在多个基准测试上验证了其有效性。

Details

Motivation: 现有强化学习(RL)在大型语言模型(LLMs)中表现良好,但在视觉语言模型(VLMs)中缺乏既可验证又具有挑战性的视觉任务,ViCrit旨在填补这一空白。

Result: ViCrit训练后的模型在多种VL任务上表现显著提升,且能泛化到抽象图像和视觉数学任务。

Insight: ViCrit任务不仅提升了模型对已知对象的记忆能力,还增强了其真正的视觉感知能力,为VLMs的优化提供了新方向。

Abstract: Reinforcement learning (RL) has shown great effectiveness for fine-tuning large language models (LLMs) using tasks that are challenging yet easily verifiable, such as math reasoning or code generation. However, extending this success to visual perception in vision-language models (VLMs) has been impeded by the scarcity of vision-centric tasks that are simultaneously challenging and unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions. Starting from a 200-word captions, we inject a single, subtle visual description error-altering a few words on objects, attributes, counts, or spatial relations-and task the model to pinpoint the corrupted span given the image and the modified caption. This formulation preserves the full perceptual difficulty while providing a binary, exact-match reward that is easy to compute and unambiguous. Models trained with the ViCrit Task exhibit substantial gains across a variety of VL benchmarks. Crucially, the improvements transfer beyond natural-image training data to abstract image reasoning and visual math, showing promises of learning to perceive rather than barely memorizing seen objects. To facilitate evaluation, we further introduce ViCrit-Bench, a category-balanced diagnostic benchmark that systematically probes perception errors across diverse image domains and error types. Together, our results demonstrate that fine-grained hallucination criticism is an effective and generalizable objective for enhancing visual perception in VLMs.


[9] Retrieval of Surface Solar Radiation through Implicit Albedo Recovery from Temporal Context cs.CV | physics.ao-phPDF

Yael Frischholz, Devis Tuia, Michael Lehning

TL;DR: 本文提出了一种基于注意力机制的模型,通过隐式学习时序卫星图像中的晴空地表反射率,实现了地表太阳辐射(SSR)的准确反演,无需依赖手工设计的特征如反照率图或云掩膜。

Details

Motivation: 传统地表太阳辐射反演算法依赖月度统计估算背景反射率,但在地形复杂且雪盖动态变化的山区表现不佳。本文旨在解决这一问题。

Result: 实验表明,模型在提供足够长时序上下文时,性能与依赖反照率信息的模型相当,且在山区表现尤为突出。

Insight: 时序信息对于隐式学习地表反射动态至关重要,尤其在复杂地形下能够显著提升模型的泛化能力。

Abstract: Accurate retrieval of surface solar radiation (SSR) from satellite imagery critically depends on estimating the background reflectance that a spaceborne sensor would observe under clear-sky conditions. Deviations from this baseline can then be used to detect cloud presence and guide radiative transfer models in inferring atmospheric attenuation. Operational retrieval algorithms typically approximate background reflectance using monthly statistics, assuming surface properties vary slowly relative to atmospheric conditions. However, this approach fails in mountainous regions where intermittent snow cover and changing snow surfaces are frequent. We propose an attention-based emulator for SSR retrieval that implicitly learns to infer clear-sky surface reflectance from raw satellite image sequences. Built on the Temporo-Spatial Vision Transformer, our approach eliminates the need for hand-crafted features such as explicit albedo maps or cloud masks. The emulator is trained on instantaneous SSR estimates from the HelioMont algorithm over Switzerland, a region characterized by complex terrain and dynamic snow cover. Inputs include multi-spectral SEVIRI imagery from the Meteosat Second Generation platform, augmented with static topographic features and solar geometry. The target variable is HelioMont’s SSR, computed as the sum of its direct and diffuse horizontal irradiance components, given at a spatial resolution of 1.7 km. We show that, when provided a sufficiently long temporal context, the model matches the performances of albedo-informed models, highlighting the model’s ability to internally learn and exploit latent surface reflectance dynamics. Our geospatial analysis shows this effect is most powerful in mountainous regions and improves generalization in both simple and complex topographic settings. Code and datasets are publicly available at https://github.com/frischwood/HeMu-dev.git


[10] Attention, Please! Revisiting Attentive Probing for Masked Image Modeling cs.CVPDF

Bill Psomas, Dionysis Christopoulos, Eirini Baltzi, Ioannis Kakogeorgiou, Tilemachos Aravanis

TL;DR: 该论文提出了一种高效的注意力探测方法(EP),通过多查询交叉注意力机制,显著减少了可训练参数和计算开销,同时优于现有方法。

Details

Motivation: 随着自监督学习(SSL)的广泛应用,标准线性探测(LP)无法充分评估基于掩码图像建模(MIM)训练模型的潜力,因此需要更高效的注意力探测方法。

Result: EP在七个基准测试中表现优于LP和其他注意力探测方法,同时在低样本和分层设置中表现优异。

Insight: 高效的注意力探测方法可以显著提升模型评估的性能和效率,尤其适用于掩码图像建模和其他预训练范式。

Abstract: As fine-tuning (FT) becomes increasingly impractical at scale, probing is emerging as the preferred evaluation protocol for self-supervised learning (SSL). Yet, the standard linear probing (LP) fails to adequately reflect the potential of models trained with Masked Image Modeling (MIM), due to the distributed nature of patch tokens. This motivates the need for attentive probing, an alternative that uses attention to selectively aggregate patch-level features. Despite its growing adoption, attentive probing remains under-explored, with existing methods suffering from excessive parameterization and poor computational efficiency. In this work, we revisit attentive probing through the lens of the accuracy-efficiency trade-off. We conduct a systematic study of existing methods, analyzing their mechanisms and benchmarking their performance. We introduce efficient probing (EP), a multi-query cross-attention mechanism that eliminates redundant projections, reduces the number of trainable parameters, and achieves up to a 10$\times$ speed-up over conventional multi-head attention. Despite its simplicity, EP outperforms LP and prior attentive probing approaches across seven benchmarks, generalizes well beyond MIM to diverse pre-training paradigms, produces interpretable attention maps, and achieves strong gains in low-shot and layer-wise settings. Code available at https://github.com/billpsomas/efficient-probing.


[11] Improving Personalized Search with Regularized Low-Rank Parameter Updates cs.CVPDF

Fiona Ryan, Josef Sivic, Fabian Caba Heilbron, Judy Hoffman, James M. Rehg

TL;DR: 论文提出了一种通过正则化低秩参数更新改进个性化视觉-语言检索的方法,通过调整语言编码器的参数,平衡个性化和通用知识,实现了在DeepFashion2和ConCon-Chi数据集的SOTA性能。

Details

Motivation: 个性化视觉-语言检索需要从少数样本中学习新概念(如’我的狗Fido’),同时将个性化和通用知识结合。现有方法(如文本反转)存在局限性,本文探索更高效的方法。

Result: 在两个个性化图像检索基准(DeepFashion2和ConCon-Chi)上,优于现有方法4%-22%。

Insight: 低秩参数更新是实现个性化检索的高效方法;参数加法是组合多个个性化概念的有效策略;VLM生成描述可作为通用知识保留的评估工具。

Abstract: Personalized vision-language retrieval seeks to recognize new concepts (e.g. “my dog Fido”) from only a few examples. This task is challenging because it requires not only learning a new concept from a few images, but also integrating the personal and general knowledge together to recognize the concept in different contexts. In this paper, we show how to effectively adapt the internal representation of a vision-language dual encoder model for personalized vision-language retrieval. We find that regularized low-rank adaption of a small set of parameters in the language encoder’s final layer serves as a highly effective alternative to textual inversion for recognizing the personal concept while preserving general knowledge. Additionally, we explore strategies for combining parameters of multiple learned personal concepts, finding that parameter addition is effective. To evaluate how well general knowledge is preserved in a finetuned representation, we introduce a metric that measures image retrieval accuracy based on captions generated by a vision language model (VLM). Our approach achieves state-of-the-art accuracy on two benchmarks for personalized image retrieval with natural language queries - DeepFashion2 and ConCon-Chi - outperforming the prior art by 4%-22% on personal retrievals.


[12] ScoreMix: Improving Face Recognition via Score Composition in Diffusion Generators cs.CV | cs.AI | cs.LGPDF

Parsa Rahimi, Sebastien Marcel

TL;DR: ScoreMix通过扩散模型中的分数混合策略生成具有挑战性的合成样本,显著提升了判别器的性能,尤其在小样本场景下。

Details

Motivation: 解决在有限标注数据下训练判别模型时数据增强的不足,通过扩散模型的分数合成特性生成更有效的合成样本。

Result: ScoreMix在多个基准测试中显著提升了判别器的性能,尤其是在数据有限的情况下。

Insight: 结合判别器嵌入空间中距离较远的类比在生成器条件空间中相近的类更有效,生成器和判别器的学习空间相关性较低。

Abstract: In this paper, we propose ScoreMix, a novel yet simple data augmentation strategy leveraging the score compositional properties of diffusion models to enhance discriminator performance, particularly under scenarios with limited labeled data. By convexly mixing the scores from different class-conditioned trajectories during diffusion sampling, we generate challenging synthetic samples that significantly improve discriminative capabilities in all studied benchmarks. We systematically investigate class-selection strategies for mixing and discover that greater performance gains arise when combining classes distant in the discriminator’s embedding space, rather than close in the generator’s condition space. Moreover, we empirically show that, under standard metrics, the correlation between the generator’s learned condition space and the discriminator’s embedding space is minimal. Our approach achieves notable performance improvements without extensive parameter searches, demonstrating practical advantages for training discriminative models while effectively mitigating problems regarding collections of large datasets. Paper website: https://parsa-ra.github.io/scoremix


[13] California Crop Yield Benchmark: Combining Satellite Image, Climate, Evapotranspiration, and Soil Data Layers for County-Level Yield Forecasting of Over 70 Crops cs.CVPDF

Hamid Kamangir, Mona Hajiesmaeeli, Mason Earles

TL;DR: 该论文提出了一个全面的加州作物产量基准数据集,结合卫星图像、气候、蒸散发和土壤数据,开发了一个多模态深度学习模型,用于预测70多种作物的县级产量,整体R2得分达到0.76。

Details

Motivation: 加州是全球农业生产的领导者,但由于环境、气候和土壤因素的复杂相互作用,准确及时的作物产量预测仍然具有挑战性。

Result: 模型在未见测试数据集上的整体R2得分为0.76,展现了强大的预测性能。

Insight: 该研究为农业预测、气候适应和精准农业提供了一个有价值的框架。

Abstract: California is a global leader in agricultural production, contributing 12.5% of the United States total output and ranking as the fifth-largest food and cotton supplier in the world. Despite the availability of extensive historical yield data from the USDA National Agricultural Statistics Service, accurate and timely crop yield forecasting remains a challenge due to the complex interplay of environmental, climatic, and soil-related factors. In this study, we introduce a comprehensive crop yield benchmark dataset covering over 70 crops across all California counties from 2008 to 2022. The benchmark integrates diverse data sources, including Landsat satellite imagery, daily climate records, monthly evapotranspiration, and high-resolution soil properties. To effectively learn from these heterogeneous inputs, we develop a multi-modal deep learning model tailored for county-level, crop-specific yield forecasting. The model employs stratified feature extraction and a timeseries encoder to capture spatial and temporal dynamics during the growing season. Static inputs such as soil characteristics and crop identity inform long-term variability. Our approach achieves an overall R2 score of 0.76 across all crops of unseen test dataset, highlighting strong predictive performance across California diverse agricultural regions. This benchmark and modeling framework offer a valuable foundation for advancing agricultural forecasting, climate adaptation, and precision farming. The full dataset and codebase are publicly available at our GitHub repository.


[14] DySS: Dynamic Queries and State-Space Learning for Efficient 3D Object Detection from Multi-Camera Videos cs.CVPDF

Rajeev Yasarla, Shizhong Han, Hong Cai, Fatih Porikli

TL;DR: DySS提出了一种基于动态查询和状态空间学习的高效多摄像头视频3D物体检测方法,通过稀疏查询和状态空间模型优化性能和效率。

Details

Motivation: 现有的多摄像头3D检测方法依赖密集BEV特征或大量查询,计算成本高,难以扩展到多帧视频。DySS旨在通过动态查询和状态空间学习解决这一问题。

Result: 在nuScenes测试集上达到65.31 NDS和57.4 mAP,优于现有方法;验证集上56.2 NDS和46.2 mAP,实时推理速度33 FPS。

Insight: 稀疏查询和状态空间学习能显著提升多摄像头视频3D检测的效率和性能,动态查询机制有助于减少计算负担。

Abstract: Camera-based 3D object detection in Bird’s Eye View (BEV) is one of the most important perception tasks in autonomous driving. Earlier methods rely on dense BEV features, which are costly to construct. More recent works explore sparse query-based detection. However, they still require a large number of queries and can become expensive to run when more video frames are used. In this paper, we propose DySS, a novel method that employs state-space learning and dynamic queries. More specifically, DySS leverages a state-space model (SSM) to sequentially process the sampled features over time steps. In order to encourage the model to better capture the underlying motion and correspondence information, we introduce auxiliary tasks of future prediction and masked reconstruction to better train the SSM. The state of the SSM then provides an informative yet efficient summarization of the scene. Based on the state-space learned features, we dynamically update the queries via merge, remove, and split operations, which help maintain a useful, lean set of detection queries throughout the network. Our proposed DySS achieves both superior detection performance and efficient inference. Specifically, on the nuScenes test split, DySS achieves 65.31 NDS and 57.4 mAP, outperforming the latest state of the art. On the val split, DySS achieves 56.2 NDS and 46.2 mAP, as well as a real-time inference speed of 33 FPS.


[15] HalLoc: Token-level Localization of Hallucinations for Vision Language Models cs.CVPDF

Eunkyu Park, Minyeong Kim, Gunhee Kim

TL;DR: HalLoc提出了一种新的数据集和基线模型,用于高效、概率性的幻觉检测,增强视觉语言模型的可靠性。

Details

Motivation: 当前幻觉检测方法计算资源需求高且无法处理真实场景中模糊的幻觉与真相边界。

Result: HalLoc数据集和模型公开发布,为提升视觉语言模型的可靠性提供新工具。

Insight: 概率性幻觉检测模块有望成为提升模型可信度的实用插件,适用于真实场景。

Abstract: Hallucinations pose a significant challenge to the reliability of large vision-language models, making their detection essential for ensuring accuracy in critical applications. Current detection methods often rely on computationally intensive models, leading to high latency and resource demands. Their definitive outcomes also fail to account for real-world scenarios where the line between hallucinated and truthful information is unclear. To address these issues, we propose HalLoc, a dataset designed for efficient, probabilistic hallucination detection. It features 150K token-level annotated samples, including hallucination types, across Visual Question Answering (VQA), instruction-following, and image captioning tasks. This dataset facilitates the development of models that detect hallucinations with graded confidence, enabling more informed user interactions. Additionally, we introduce a baseline model trained on HalLoc, offering low-overhead, concurrent hallucination detection during generation. The model can be seamlessly integrated into existing VLMs, improving reliability while preserving efficiency. The prospect of a robust plug-and-play hallucination detection module opens new avenues for enhancing the trustworthiness of vision-language models in real-world applications. The HalLoc dataset and code are publicly available at: https://github.com/dbsltm/cvpr25_halloc.


[16] Uncertainty-Aware Deep Learning for Automated Skin Cancer Classification: A Comprehensive Evaluation cs.CV | cs.AIPDF

Hamzeh Asgharnezhad, Pegah Tabarisaadi, Abbas Khosravi, Roohallah Alizadehsani, U. Rajendra Acharya

TL;DR: 这篇论文通过迁移学习和不确定性量化(UQ)对皮肤癌分类进行了全面评估,发现基于CLIP的视觉变换器和SVM组合性能最佳,集成方法在准确性和不确定性处理之间取得了良好平衡。

Details

Motivation: 皮肤癌的准确诊断对患者早期治疗至关重要,但现有深度学习方法面临数据稀缺和缺乏不确定性感知的挑战。

Result: 研究发现基于CLIP的ViT-H/14与SVM组合性能最佳,且集成方法在准确性和不确定性处理之间表现最优,EMCD对不确定性预测更为敏感。

Insight: 不确定性量化在基于深度学习的医学诊断中至关重要,能够提升模型的信任度和实际应用价值。

Abstract: Accurate and reliable skin cancer diagnosis is critical for early treatment and improved patient outcomes. Deep learning (DL) models have shown promise in automating skin cancer classification, but their performance can be limited by data scarcity and a lack of uncertainty awareness. In this study, we present a comprehensive evaluation of DL-based skin lesion classification using transfer learning and uncertainty quantification (UQ) on the HAM10000 dataset. In the first phase, we benchmarked several pre-trained feature extractors-including Contrastive Language-Image Pretraining (CLIP) variants, Residual Network-50 (ResNet50), Densely Connected Convolutional Network (DenseNet121), Visual Geometry Group network (VGG16), and EfficientNet-V2-Large-combined with a range of traditional classifiers such as Support Vector Machine (SVM), eXtreme Gradient Boosting (XGBoost), and logistic regression. Our results show that CLIP-based vision transformers, particularly LAION CLIP ViT-H/14 with SVM, deliver the highest classification performance. In the second phase, we incorporated UQ using Monte Carlo Dropout (MCD), Ensemble, and Ensemble Monte Carlo Dropout (EMCD) to assess not only prediction accuracy but also the reliability of model outputs. We evaluated these models using uncertainty-aware metrics such as uncertainty accuracy(UAcc), uncertainty sensitivity(USen), uncertainty specificity(USpe), and uncertainty precision(UPre). The results demonstrate that ensemble methods offer a good trade-off between accuracy and uncertainty handling, while EMCD is more sensitive to uncertain predictions. This study highlights the importance of integrating UQ into DL-based medical diagnosis to enhance both performance and trustworthiness in real-world clinical applications.


[17] Towards Scalable SOAP Note Generation: A Weakly Supervised Multimodal Framework cs.CV | cs.AI | cs.LGPDF

Sadia Kamal, Tim Oates, Joy Wan

TL;DR: 该论文提出了一种弱监督多模态框架,用于从有限的输入(如病变图像和稀疏临床文本)生成结构化的SOAP(主观、客观、评估和计划)笔记,目标是减轻临床医生的负担并减少对大量标注数据的依赖。

Details

Motivation: 皮肤癌是全球最常见的癌症,每年造成高额医疗支出。临床医生需要手动记录详细的SOAP笔记,这不仅耗时,还增加了工作负担。论文旨在通过弱监督方法解决这一问题。

Result: 该方法在关键临床相关性指标上表现与GPT-4o、Claude和DeepSeek Janus Pro相当。验证了其临床实用性和可扩展性。

Insight: 通过弱监督学习,可以在标注数据有限的情况下生成高质量的临床笔记,同时减轻医生的工作负担。

Abstract: Skin carcinoma is the most prevalent form of cancer globally, accounting for over $8 billion in annual healthcare expenditures. In clinical settings, physicians document patient visits using detailed SOAP (Subjective, Objective, Assessment, and Plan) notes. However, manually generating these notes is labor-intensive and contributes to clinician burnout. In this work, we propose a weakly supervised multimodal framework to generate clinically structured SOAP notes from limited inputs, including lesion images and sparse clinical text. Our approach reduces reliance on manual annotations, enabling scalable, clinically grounded documentation while alleviating clinician burden and reducing the need for large annotated data. Our method achieves performance comparable to GPT-4o, Claude, and DeepSeek Janus Pro across key clinical relevance metrics. To evaluate clinical quality, we introduce two novel metrics MedConceptEval and Clinical Coherence Score (CCS) which assess semantic alignment with expert medical concepts and input features, respectively.


[18] Research on Audio-Visual Quality Assessment Dataset and Method for User-Generated Omnidirectional Video cs.CV | eess.IVPDF

Fei Zhao, Da Pan, Zelu Qi, Ping Shi

TL;DR: 论文针对用户生成的全景视频(ODV)的音视频质量评估(AVQA)问题,构建了一个数据集,并提出了一种基于特征提取和融合的基线模型。

Details

Motivation: 随着元宇宙的兴起,全景视频逐渐从专业内容转向用户生成内容(UGC),但目前对UGC全景视频的音视频质量评估研究较少。

Result: 实验表明,模型在提出的数据集上表现最优。

Insight: 1. 用户生成全景视频的音视频质量评估是一个新兴研究方向;2. 特征融合是提升AVQA模型性能的关键。

Abstract: In response to the rising prominence of the Metaverse, omnidirectional videos (ODVs) have garnered notable interest, gradually shifting from professional-generated content (PGC) to user-generated content (UGC). However, the study of audio-visual quality assessment (AVQA) within ODVs remains limited. To address this, we construct a dataset of UGC omnidirectional audio and video (A/V) content. The videos are captured by five individuals using two different types of omnidirectional cameras, shooting 300 videos covering 10 different scene types. A subjective AVQA experiment is conducted on the dataset to obtain the Mean Opinion Scores (MOSs) of the A/V sequences. After that, to facilitate the development of UGC-ODV AVQA fields, we construct an effective AVQA baseline model on the proposed dataset, of which the baseline model consists of video feature extraction module, audio feature extraction and audio-visual fusion module. The experimental results demonstrate that our model achieves optimal performance on the proposed dataset.


[19] Using Vision Language Models to Detect Students’ Academic Emotion through Facial Expressions cs.CV | cs.AIPDF

Deliang Wang, Chao Yang, Gaowei Chen

TL;DR: 本文探讨了利用视觉语言模型(VLMs)通过零样本提示检测学生学术情绪的方法,替代传统监督学习方法,发现Qwen2.5-VL-7B-Instruct在识别困惑情绪方面表现较好,但在检测分心行为上仍有不足。

Details

Motivation: 学生学术情绪对学习表现影响显著,传统监督学习方法泛化能力差,而VLMs为跨任务泛化提供了新可能,因此研究其在情绪识别中的应用。

Result: Qwen2.5-VL-7B-Instruct表现优于Llama-3.2-11B-Vision-Instruct,尤其在识别困惑情绪上效果显著,但对分心行为的检测效果较差。

Insight: VLMs在学术情绪识别中展现出潜力,尤其是零样本提示方法避免了数据标注和微调的需求,但其性能仍有提升空间,特别是在特定情绪(如分心行为)的检测上。

Abstract: Students’ academic emotions significantly influence their social behavior and learning performance. Traditional approaches to automatically and accurately analyze these emotions have predominantly relied on supervised machine learning algorithms. However, these models often struggle to generalize across different contexts, necessitating repeated cycles of data collection, annotation, and training. The emergence of Vision-Language Models (VLMs) offers a promising alternative, enabling generalization across visual recognition tasks through zero-shot prompting without requiring fine-tuning. This study investigates the potential of VLMs to analyze students’ academic emotions via facial expressions in an online learning environment. We employed two VLMs, Llama-3.2-11B-Vision-Instruct and Qwen2.5-VL-7B-Instruct, to analyze 5,000 images depicting confused, distracted, happy, neutral, and tired expressions using zero-shot prompting. Preliminary results indicate that both models demonstrate moderate performance in academic facial expression recognition, with Qwen2.5-VL-7B-Instruct outperforming Llama-3.2-11B-Vision-Instruct. Notably, both models excel in identifying students’ happy emotions but fail to detect distracted behavior. Additionally, Qwen2.5-VL-7B-Instruct exhibits relatively high performance in recognizing students’ confused expressions, highlighting its potential for practical applications in identifying content that causes student confusion.


[20] PointGS: Point Attention-Aware Sparse View Synthesis with Gaussian Splatting cs.CVPDF

Lintao Xiang, Hongpei Zheng, Yating Huang, Qijun Yang, Hujun Yin

TL;DR: PointGS通过结合点注意力机制和高斯溅射技术,实现了从稀疏视角中实时生成高质量渲染效果,解决了3DGS在稀疏输入下过拟合的问题。

Details

Motivation: 现有3DGS方法需要大量校准视角才能生成一致的场景表示,而在稀疏视角下容易过拟合训练视角,导致渲染质量下降。作者希望通过改进3DGS,使其在稀疏输入下也能高效渲染。

Result: 实验表明,PointGS在多样数据集上显著优于基于NeRF的方法,并在少样本设置下达到与当前最佳3DGS方法竞争的性能。

Insight: 点注意力机制的引入能够有效提升稀疏视角下高斯溅射的泛化能力,说明局部特征交互对3D渲染质量的重要性。

Abstract: 3D Gaussian splatting (3DGS) is an innovative rendering technique that surpasses the neural radiance field (NeRF) in both rendering speed and visual quality by leveraging an explicit 3D scene representation. Existing 3DGS approaches require a large number of calibrated views to generate a consistent and complete scene representation. When input views are limited, 3DGS tends to overfit the training views, leading to noticeable degradation in rendering quality. To address this limitation, we propose a Point-wise Feature-Aware Gaussian Splatting framework that enables real-time, high-quality rendering from sparse training views. Specifically, we first employ the latest stereo foundation model to estimate accurate camera poses and reconstruct a dense point cloud for Gaussian initialization. We then encode the colour attributes of each 3D Gaussian by sampling and aggregating multiscale 2D appearance features from sparse inputs. To enhance point-wise appearance representation, we design a point interaction network based on a self-attention mechanism, allowing each Gaussian point to interact with its nearest neighbors. These enriched features are subsequently decoded into Gaussian parameters through two lightweight multi-layer perceptrons (MLPs) for final rendering. Extensive experiments on diverse benchmarks demonstrate that our method significantly outperforms NeRF-based approaches and achieves competitive performance under few-shot settings compared to the state-of-the-art 3DGS methods.


[21] UrbanSense:AFramework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models cs.CV | cs.AIPDF

Jun Yin, Jing Zhong, Peilin Li, Pengyu Zeng, Miao Zhang

TL;DR: 该论文提出基于视觉-语言模型的框架UrbanSense,用于量化分析城市街景风格差异,并构建了数据集UrbanDiffBench,实验表明其能有效捕捉风格差异。

Details

Motivation: 城市街景风格因地理、历史和社会政治因素而异,传统依赖专家解释和历史文档的方法难以标准化,因此需要一种客观、数据驱动的自动化分析框架。

Result: 生成描述80%通过t检验(p<0.05),主观评估Phi得分高(城市0.912,时期0.833),证明了框架对风格差异的捕捉能力。

Insight: 该框架为城市风格演变的量化分析提供了科学依据,可用于未来设计的客观评估。

Abstract: Urban cultures and architectural styles vary significantly across cities due to geographical, chronological, historical, and socio-political factors. Understanding these differences is essential for anticipating how cities may evolve in the future. As representative cases of historical continuity and modern innovation in China, Beijing and Shenzhen offer valuable perspectives for exploring the transformation of urban streetscapes. However, conventional approaches to urban cultural studies often rely on expert interpretation and historical documentation, which are difficult to standardize across different contexts. To address this, we propose a multimodal research framework based on vision-language models, enabling automated and scalable analysis of urban streetscape style differences. This approach enhances the objectivity and data-driven nature of urban form research. The contributions of this study are as follows: First, we construct UrbanDiffBench, a curated dataset of urban streetscapes containing architectural images from different periods and regions. Second, we develop UrbanSense, the first vision-language-model-based framework for urban streetscape analysis, enabling the quantitative generation and comparison of urban style representations. Third, experimental results show that Over 80% of generated descriptions pass the t-test (p less than 0.05). High Phi scores (0.912 for cities, 0.833 for periods) from subjective evaluations confirm the method’s ability to capture subtle stylistic differences. These results highlight the method’s potential to quantify and interpret urban style evolution, offering a scientifically grounded lens for future design.


[22] RealKeyMorph: Keypoints in Real-world Coordinates for Resolution-agnostic Image Registration cs.CVPDF

Mina C. Moghadam, Alan Q. Wang, Omer Taub, Martin R. Prince, Mert R. Sabuncu

TL;DR: RealKeyMorph (RKM) 是一种分辨率无关的图像配准方法,通过输出真实世界坐标系中的关键点避免了传统方法因重采样引入的伪影。

Details

Motivation: 医学图像配准中,由于图像采集参数不同导致的分辨率差异问题,传统方法通过固定分辨率重采样会引入插值伪影。RKM旨在解决这一问题。

Result: 实验表明,RKM在腹部MRI正交2D堆叠和不同分辨率脑数据集3D体积配准任务中表现优越。

Insight: 通过真实世界坐标处理关键点,可以绕过分辨率限制,提升配准质量,适用于多分辨率医学图像场景。

Abstract: Many real-world settings require registration of a pair of medical images that differ in spatial resolution, which may arise from differences in image acquisition parameters like pixel spacing, slice thickness, and field-of-view. However, all previous machine learning-based registration techniques resample images onto a fixed resolution. This is suboptimal because resampling can introduce artifacts due to interpolation. To address this, we present RealKeyMorph (RKM), a resolution-agnostic method for image registration. RKM is an extension of KeyMorph, a registration framework which works by training a network to learn corresponding keypoints for a given pair of images, after which a closed-form keypoint matching step is used to derive the transformation that aligns them. To avoid resampling and enable operating on the raw data, RKM outputs keypoints in real-world coordinates of the scanner. To do this, we leverage the affine matrix produced by the scanner (e.g., MRI machine) that encodes the mapping from voxel coordinates to real world coordinates. By transforming keypoints into real-world space and integrating this into the training process, RKM effectively enables the extracted keypoints to be resolution-agnostic. In our experiments, we demonstrate the advantages of RKM on the registration task for orthogonal 2D stacks of abdominal MRIs, as well as 3D volumes with varying resolutions in brain datasets.


[23] Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation cs.CVPDF

Runqi Ouyang, Haoyun Li, Zhenyuan Zhang, Xiaofeng Wang, Zheng Zhu

TL;DR: 该论文提出Motion-R1,一种结合链式思维推理和强化学习的框架,用于提升文本到运动生成的语义理解能力和运动质量。

Details

Motivation: 现有文本到运动生成方法通常依赖端到端映射策略,未能捕捉深层语言结构和逻辑推理,导致生成的动缺乏可控性、一致性和多样性。

Result: 在多个基准数据集上表现优异,尤其在需要细微语义理解和长期时间一致性的场景中优于现有方法。

Insight: 链式思维推理可以显著提升文本到运动生成中的语义指导和逻辑一致性,强化学习则进一步优化了运动质量。

Abstract: Recent advances in large language models, especially in natural language understanding and reasoning, have opened new possibilities for text-to-motion generation. Although existing approaches have made notable progress in semantic alignment and motion synthesis, they often rely on end-to-end mapping strategies that fail to capture deep linguistic structures and logical reasoning. Consequently, generated motions tend to lack controllability, consistency, and diversity. To address these limitations, we propose Motion-R1, a unified motion-language modeling framework that integrates a Chain-of-Thought mechanism. By explicitly decomposing complex textual instructions into logically structured action paths, Motion-R1 provides high-level semantic guidance for motion generation, significantly enhancing the model’s ability to interpret and execute multi-step, long-horizon, and compositionally rich commands. To train our model, we adopt Group Relative Policy Optimization, a reinforcement learning algorithm designed for large models, which leverages motion quality feedback to optimize reasoning chains and motion synthesis jointly. Extensive experiments across multiple benchmark datasets demonstrate that Motion-R1 achieves competitive or superior performance compared to state-of-the-art methods, particularly in scenarios requiring nuanced semantic understanding and long-term temporal coherence. The code, model and data will be publicly available.


[24] FaceLiVT: Face Recognition using Linear Vision Transformer with Structural Reparameterization For Mobile Device cs.CVPDF

Novendra Setyawan, Chi-Chia Sun, Mao-Hsiu Hsu, Wen-Kai Kuo, Jun-Wei Hsieh

TL;DR: FaceLiVT是一种轻量级但强大的人脸识别模型,通过结合混合CNN-Transformer架构和多头线性注意力机制,降低了计算复杂度并保持了高精度。

Details

Motivation: 在移动设备上实现高效、实时的人脸识别,同时减少计算资源消耗。

Result: 在多个基准测试中表现优异,推理速度比现有轻量级模型快8.6倍(与EdgeFace相比)和21.2倍(与纯ViT模型相比)。

Insight: 混合架构和轻量级的注意力机制是移动设备上高效人脸识别的有效解决方案。

Abstract: This paper introduces FaceLiVT, a lightweight yet powerful face recognition model that integrates a hybrid Convolution Neural Network (CNN)-Transformer architecture with an innovative and lightweight Multi-Head Linear Attention (MHLA) mechanism. By combining MHLA alongside a reparameterized token mixer, FaceLiVT effectively reduces computational complexity while preserving competitive accuracy. Extensive evaluations on challenging benchmarks; including LFW, CFP-FP, AgeDB-30, IJB-B, and IJB-C; highlight its superior performance compared to state-of-the-art lightweight models. MHLA notably improves inference speed, allowing FaceLiVT to deliver high accuracy with lower latency on mobile devices. Specifically, FaceLiVT is 8.6 faster than EdgeFace, a recent hybrid CNN-Transformer model optimized for edge devices, and 21.2 faster than a pure ViT-Based model. With its balanced design, FaceLiVT offers an efficient and practical solution for real-time face recognition on resource-constrained platforms.


[25] FSATFusion: Frequency-Spatial Attention Transformer for Infrared and Visible Image Fusion cs.CVPDF

Tianpei Zhang, Jufeng Zhao, Yiming Zhu, Guangmang Cui, Yuhan Lyu

TL;DR: FSATFusion是一种基于频率-空间注意力Transformer的红外与可见光图像融合网络,通过改进的Transformer模块和注意力机制提升融合性能。

Details

Motivation: 现有深度学习方法在红外与可见光图像融合(IVIF)中依赖卷积神经网络,但卷积操作难以捕捉全局上下文,导致信息丢失,限制了融合性能。

Result: 实验表明,FSATFusion在融合质量和效率上优于现有方法,并在下游任务(如目标检测)中表现出优异的泛化能力。

Insight: 结合频率和空间注意力机制的Transformer架构在图像融合任务中具有潜力,能够更好地保留和融合多模态图像的关键信息。

Abstract: The infrared and visible images fusion (IVIF) is receiving increasing attention from both the research community and industry due to its excellent results in downstream applications. Existing deep learning approaches often utilize convolutional neural networks to extract image features. However, the inherently capacity of convolution operations to capture global context can lead to information loss, thereby restricting fusion performance. To address this limitation, we propose an end-to-end fusion network named the Frequency-Spatial Attention Transformer Fusion Network (FSATFusion). The FSATFusion contains a frequency-spatial attention Transformer (FSAT) module designed to effectively capture discriminate features from source images. This FSAT module includes a frequency-spatial attention mechanism (FSAM) capable of extracting significant features from feature maps. Additionally, we propose an improved Transformer module (ITM) to enhance the ability to extract global context information of vanilla Transformer. We conducted both qualitative and quantitative comparative experiments, demonstrating the superior fusion quality and efficiency of FSATFusion compared to other state-of-the-art methods. Furthermore, our network was tested on two additional tasks without any modifications, to verify the excellent generalization capability of FSATFusion. Finally, the object detection experiment demonstrated the superiority of FSATFusion in downstream visual tasks. Our code is available at https://github.com/Lmmh058/FSATFusion.


[26] Revisiting Transformers with Insights from Image Filtering cs.CV | cs.LGPDF

Laziz U. Abdullaev, Maksim Tkachenko, Tan M. Nguyen

TL;DR: 本文通过图像处理框架重新审视Transformer的自注意力机制,提出了一种统一的理论解释方法,不仅解释了自注意力的计算,还阐明了位置编码和残差连接等组件的作用。此外,提出的两种架构修改在提升模型可解释性的同时,还显著提高了任务精度和鲁棒性。

Details

Motivation: Transformer的自注意力机制虽然效果显著,但其理论解释仍不充分。已有研究尝试从图像去噪和非参数回归角度理解自注意力,但缺乏对架构组件的深入机制解释。本文旨在填补这一空白。

Result: 实验表明,提出的架构修改在多个任务上显著提升了模型的精度和对抗数据污染的鲁棒性,尤其改善了长序列理解能力。

Insight: 将Transformer的自注意力机制与图像处理理论联系起来,不仅为模型提供了理论支撑,还启发了新的架构设计方向,表明跨领域的理论迁移可以推动深度学习的进步。

Abstract: The self-attention mechanism, a cornerstone of Transformer-based state-of-the-art deep learning architectures, is largely heuristic-driven and fundamentally challenging to interpret. Establishing a robust theoretical foundation to explain its remarkable success and limitations has therefore become an increasingly prominent focus in recent research. Some notable directions have explored understanding self-attention through the lens of image denoising and nonparametric regression. While promising, existing frameworks still lack a deeper mechanistic interpretation of various architectural components that enhance self-attention, both in its original formulation and subsequent variants. In this work, we aim to advance this understanding by developing a unifying image processing framework, capable of explaining not only the self-attention computation itself but also the role of components such as positional encoding and residual connections, including numerous later variants. We also pinpoint potential distinctions between the two concepts building upon our framework, and make effort to close this gap. We introduce two independent architectural modifications within transformers. While our primary objective is interpretability, we empirically observe that image processing-inspired modifications can also lead to notably improved accuracy and robustness against data contamination and adversaries across language and vision tasks as well as better long sequence understanding.


[27] Leveraging 6DoF Pose Foundation Models For Mapping Marine Sediment Burial cs.CVPDF

Jerry Yan, Chinmay Talegaonkar, Nicholas Antipa, Eric Terrill, Sophia Merrifield

TL;DR: 该论文提出了一种名为PoseIDON的计算机视觉方法,通过结合深度学习基础模型和多视图摄影测量技术,估计海底物体的六自由度位姿和周围海底的朝向,从而推断掩埋深度,实现了高精度的海底物体掩埋状态映射。

Details

Motivation: 准确估计海底人为物体的掩埋状态对于研究沉积动态、评估生态风险和污染物传输至关重要,但由于部分遮挡、能见度差和物体退化等问题,传统的遥感图像分析难以实现精确测量。

Result: 在San Pedro Basin历史海洋倾倒场的实验中,模型平均掩埋深度误差约为10厘米,并能反映沉积物传输过程的空间模式。

Insight: 该方法为非侵入式、可扩展的海底掩埋状态映射提供了新途径,适用于环境污染评估和其他相关应用。

Abstract: The burial state of anthropogenic objects on the seafloor provides insight into localized sedimentation dynamics and is also critical for assessing ecological risks, potential pollutant transport, and the viability of recovery or mitigation strategies for hazardous materials such as munitions. Accurate burial depth estimation from remote imagery remains difficult due to partial occlusion, poor visibility, and object degradation. This work introduces a computer vision pipeline, called PoseIDON, which combines deep foundation model features with multiview photogrammetry to estimate six degrees of freedom object pose and the orientation of the surrounding seafloor from ROV video. Burial depth is inferred by aligning CAD models of the objects with observed imagery and fitting a local planar approximation of the seafloor. The method is validated using footage of 54 objects, including barrels and munitions, recorded at a historic ocean dumpsite in the San Pedro Basin. The model achieves a mean burial depth error of approximately 10 centimeters and resolves spatial burial patterns that reflect underlying sediment transport processes. This approach enables scalable, non-invasive mapping of seafloor burial and supports environmental assessment at contaminated sites.


[28] DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Transformer and Mamba cs.CVPDF

Shicheng Yin, Kaixuan Yin, Yang Liu, Weixing Chen, Liang Lin

TL;DR: DART提出了一种动态自适应区域标记器,为Vision Transformer和Mamba提供内容相关的可变大小图像分区,显著提升性能并减少计算开销。

Details

Motivation: 现有非卷积模型(如ViT和Vim)依赖固定大小的图像分区,导致背景区域编码冗余或关键局部细节缺失。需要一种自适应方法来解决这一问题。

Result: 在DeiT上准确率提升2.1%,FLOPs减少45%,并在DeiT、Vim和VideoMamba上一致表现优越。

Insight: 动态自适应标记分配优于均匀增加标记密度的方法,显著提升效率与性能。

Abstract: Recently, non-convolutional models such as the Vision Transformer (ViT) and Vision Mamba (Vim) have achieved remarkable performance in computer vision tasks. However, their reliance on fixed-size patches often results in excessive encoding of background regions and omission of critical local details, especially when informative objects are sparsely distributed. To address this, we introduce a fully differentiable Dynamic Adaptive Region Tokenizer (DART), which adaptively partitions images into content-dependent patches of varying sizes. DART combines learnable region scores with piecewise differentiable quantile operations to allocate denser tokens to information-rich areas. Despite introducing only approximately 1 million (1M) additional parameters, DART improves accuracy by 2.1% on DeiT (ImageNet-1K). Unlike methods that uniformly increase token density to capture fine-grained details, DART offers a more efficient alternative, achieving 45% FLOPs reduction with superior performance. Extensive experiments on DeiT, Vim, and VideoMamba confirm that DART consistently enhances accuracy while incurring minimal or even reduced computational overhead. Code is available at https://github.com/HCPLab-SYSU/DART.


[29] ReconMOST: Multi-Layer Sea Temperature Reconstruction with Observations-Guided Diffusion cs.CVPDF

Yuanyi Song, Pumeng Lyu, Ben Fei, Fenghua Ling, Wanli Ouyang

TL;DR: ReconMOST提出了一个基于数据驱动的引导扩散模型框架,用于多层海洋温度重建,解决了传统方法因数据稀疏、算法复杂和高计算成本带来的挑战。

Details

Motivation: 传统的海洋温度重建方法受限于数据稀疏性和算法复杂性,而现有的机器学习方法多局限于海表和局部区域,难以应对云遮挡等问题。RecoMOST旨在通过扩散模型解决这些问题,实现全球多层海洋温度的高精度重建。

Result: 在CMIP6和EN4分析数据上的实验结果显示,ReconMOST在引导、重建和总体任务上的均方误差(MSE)分别为0.049、0.680和0.633,能够处理92.5%的缺失数据,同时保持重建精度和空间分辨率。

Insight: 1. 扩散模型能够有效结合数值模拟和观测数据的优势,实现高精度海洋温度重建;2. 预训练的物理一致分布模式对无观测数据的区域重建至关重要;3. 该方法为全球多层海洋温度重建提供了新的思路。

Abstract: Accurate reconstruction of ocean is essential for reflecting global climate dynamics and supporting marine meteorological research. Conventional methods face challenges due to sparse data, algorithmic complexity, and high computational costs, while increasing usage of machine learning (ML) method remains limited to reconstruction problems at the sea surface and local regions, struggling with issues like cloud occlusion. To address these limitations, this paper proposes ReconMOST, a data-driven guided diffusion model framework for multi-layer sea temperature reconstruction. Specifically, we first pre-train an unconditional diffusion model using a large collection of historical numerical simulation data, enabling the model to attain physically consistent distribution patterns of ocean temperature fields. During the generation phase, sparse yet high-accuracy in-situ observational data are utilized as guidance points for the reverse diffusion process, generating accurate reconstruction results. Importantly, in regions lacking direct observational data, the physically consistent spatial distribution patterns learned during pre-training enable implicitly guided and physically plausible reconstructions. Our method extends ML-based SST reconstruction to a global, multi-layer setting, handling over 92.5% missing data while maintaining reconstruction accuracy, spatial resolution, and superior generalization capability. We pre-train our model on CMIP6 numerical simulation data and conduct guided reconstruction experiments on CMIP6 and EN4 analysis data. The results of mean squared error (MSE) values achieve 0.049 on guidance, 0.680 on reconstruction, and 0.633 on total, respectively, demonstrating the effectiveness and robustness of the proposed framework. Our source code is available at https://github.com/norsheep/ReconMOST.


[30] Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation cs.CV | cs.AIPDF

Zhiyang Xu, Jiuhai Chen, Zhaojiang Lin, Xichen Pan, Lifu Huang

TL;DR: Pisces是一种自回归多模态基础模型,通过解耦的视觉编码架构和优化的训练技术,在图像理解和生成任务中均表现优异。

Details

Motivation: 尽管多模态基础模型在图像理解和生成任务上有所进展,但其统一模型的性能仍落后于专门化模型。主要挑战在于视觉特征的差异和训练过程的多样性。

Result: 在20多个图像理解基准测试和GenEval图像生成基准上,Pisces表现出色,验证了其多任务能力的优势。

Insight: 研究表明,图像理解与生成任务之间存在协同效应,独立视觉编码器的使用进一步推动了统一多模态模型的进步。

Abstract: Recent advances in large language models (LLMs) have enabled multimodal foundation models to tackle both image understanding and generation within a unified framework. Despite these gains, unified models often underperform compared to specialized models in either task. A key challenge in developing unified models lies in the inherent differences between the visual features needed for image understanding versus generation, as well as the distinct training processes required for each modality. In this work, we introduce Pisces, an auto-regressive multimodal foundation model that addresses this challenge through a novel decoupled visual encoding architecture and tailored training techniques optimized for multimodal generation. Combined with meticulous data curation, pretraining, and finetuning, Pisces achieves competitive performance in both image understanding and image generation. We evaluate Pisces on over 20 public benchmarks for image understanding, where it demonstrates strong performance across a wide range of tasks. Additionally, on GenEval, a widely adopted benchmark for image generation, Pisces exhibits robust generative capabilities. Our extensive analysis reveals the synergistic relationship between image understanding and generation, and the benefits of using separate visual encoders, advancing the field of unified multimodal models.


[31] MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment cs.CVPDF

Shuo wang, Jihao Zhang

TL;DR: MF2Summ是一种基于多模态融合的视频摘要方法,通过结合视觉和听觉信息以及跨模态Transformer改进传统单模态方法的不足,显著提升了性能。

Details

Motivation: 在线视频内容的快速增加需要高效的视频摘要技术。传统方法通常仅依赖单模态(如视觉),难以捕捉视频的完整语义丰富性。因此,本文提出了一种结合视觉和听觉的多模态融合方法。

Result: 在SumMe和TVSum数据集上,相比DSNet模型,F1分数分别提升了1.9%和0.6%,性能优于其他现有方法。

Insight: 多模态融合有效提升了视频摘要的性能,跨模态Transformer和时序对齐技术是关键。

Abstract: The rapid proliferation of online video content necessitates effective video summarization techniques. Traditional methods, often relying on a single modality (typically visual), struggle to capture the full semantic richness of videos. This paper introduces MF2Summ, a novel video summarization model based on multimodal content understanding, integrating both visual and auditory information. MF2Summ employs a five-stage process: feature extraction, cross-modal attention interaction, feature fusion, segment prediction, and key shot selection. Visual features are extracted using a pre-trained GoogLeNet model, while auditory features are derived using SoundNet. The core of our fusion mechanism involves a cross-modal Transformer and an alignment-guided self-attention Transformer, designed to effectively model inter-modal dependencies and temporal correspondences. Segment importance, location, and center-ness are predicted, followed by key shot selection using Non-Maximum Suppression (NMS) and the Kernel Temporal Segmentation (KTS) algorithm. Experimental results on the SumMe and TVSum datasets demonstrate that MF2Summ achieves competitive performance, notably improving F1-scores by 1.9% and 0.6% respectively over the DSNet model, and performing favorably against other state-of-the-art methods.


[32] Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts cs.CV | cs.CL | cs.LG | cs.MMPDF

Guowei Zhong, Ruohong Huan, Mingzhen Wu, Ronghua Liang, Peng Chen

TL;DR: 论文提出了一种鲁棒的多模态情感识别框架CIDer,通过自蒸馏和因果推理模块同时解决模态缺失和分布偏移问题,并引入了新的任务和数据集。

Details

Motivation: 多模态情感识别(MER)面临模态缺失和分布外(OOD)数据的挑战,现有方法依赖特定模型或引入过多参数,实用性受限。

Result: CIDer在RMFM和OOD场景中表现鲁棒,参数更少、训练更快,优于现有方法。

Insight: 自蒸馏和因果推理的结合能有效解决模态缺失和分布偏移问题,为MER提供了一种实用且高效的解决方案。

Abstract: Recent advancements in Multimodal Emotion Recognition (MER) face challenges in addressing both modality missing and Out-Of-Distribution (OOD) data simultaneously. Existing methods often rely on specific models or introduce excessive parameters, which limits their practicality. To address these issues, we propose a novel robust MER framework, Causal Inference Distiller (CIDer), and introduce a new task, Random Modality Feature Missing (RMFM), to generalize the definition of modality missing. CIDer integrates two key components: a Model-Specific Self-Distillation (MSSD) module and a Model-Agnostic Causal Inference (MACI) module. MSSD enhances robustness under the RMFM task through a weight-sharing self-distillation approach applied across low-level features, attention maps, and high-level representations. Additionally, a Word-level Self-aligned Attention Module (WSAM) reduces computational complexity, while a Multimodal Composite Transformer (MCT) facilitates efficient multimodal fusion. To tackle OOD challenges, MACI employs a tailored causal graph to mitigate label and language biases using a Multimodal Causal Module (MCM) and fine-grained counterfactual texts. Notably, MACI can independently enhance OOD generalization with minimal additional parameters. Furthermore, we also introduce the new repartitioned MER OOD datasets. Experimental results demonstrate that CIDer achieves robust performance in both RMFM and OOD scenarios, with fewer parameters and faster training compared to state-of-the-art methods. The implementation of this work is publicly accessible at https://github.com/gw-zhong/CIDer.


[33] Rethinking Generative Human Video Coding with Implicit Motion Transformation cs.CV | eess.IVPDF

Bolin Chen, Ru-Ling Liao, Jie Chen, Yan Ye

TL;DR: 论文提出了一种基于隐式运动变换(IMT)的生成式人体视频编码方法,解决了传统显式运动引导在复杂人体视频中导致的失真和运动不准确问题。

Details

Motivation: 传统生成式视频编码在处理复杂人体运动时,显式运动场会导致重建结果失真和运动不准确,亟需新的方法改进。

Result: 实验表明,IMT方法在生成式人体视频编码中实现了高效压缩和高保真合成。

Insight: 隐式运动变换能够更灵活地捕捉复杂人体运动模式,避免显式运动场的局限性,为生成式视频编码提供新思路。

Abstract: Beyond traditional hybrid-based video codec, generative video codec could achieve promising compression performance by evolving high-dimensional signals into compact feature representations for bitstream compactness at the encoder side and developing explicit motion fields as intermediate supervision for high-quality reconstruction at the decoder side. This paradigm has achieved significant success in face video compression. However, compared to facial videos, human body videos pose greater challenges due to their more complex and diverse motion patterns, i.e., when using explicit motion guidance for Generative Human Video Coding (GHVC), the reconstruction results could suffer severe distortions and inaccurate motion. As such, this paper highlights the limitations of explicit motion-based approaches for human body video compression and investigates the GHVC performance improvement with the aid of Implicit Motion Transformation, namely IMT. In particular, we propose to characterize complex human body signal into compact visual features and transform these features into implicit motion guidance for signal reconstruction. Experimental results demonstrate the effectiveness of the proposed IMT paradigm, which can facilitate GHVC to achieve high-efficiency compression and high-fidelity synthesis.


[34] MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models cs.CVPDF

Yu Huang, Zelin Peng, Yichen Zhao, Piao Yang, Xiaokang Yang

TL;DR: 该论文提出了一种新的医学图像分割任务——医学图像推理分割(reasoning segmentation),并通过MedSeg-R框架结合多模态大语言模型(MLLMs)实现了基于复杂临床问题的精确分割。同时,作者还发布了MedSeg-QA数据集用于支持该任务。

Details

Motivation: 现有医学图像分割模型依赖显式指令,缺乏对复杂临床问题的推理能力,而多模态大语言模型在医学问答任务中表现优异,但难以生成精确的分割掩模。因此,论文旨在解决这一问题。

Result: 实验表明,MedSeg-R在多个基准测试中表现优异,实现了高分割精度和可解释的文本分析。

Insight: 通过结合MLLMs的推理能力,可以解决复杂医学指令下的分割问题,同时生成可解释的结果,推动自动医学诊断的发展。

Abstract: Medical image segmentation is crucial for clinical diagnosis, yet existing models are limited by their reliance on explicit human instructions and lack the active reasoning capabilities to understand complex clinical questions. While recent advancements in multimodal large language models (MLLMs) have improved medical question-answering (QA) tasks, most methods struggle to generate precise segmentation masks, limiting their application in automatic medical diagnosis. In this paper, we introduce medical image reasoning segmentation, a novel task that aims to generate segmentation masks based on complex and implicit medical instructions. To address this, we propose MedSeg-R, an end-to-end framework that leverages the reasoning abilities of MLLMs to interpret clinical questions while also capable of producing corresponding precise segmentation masks for medical images. It is built on two core components: 1) a global context understanding module that interprets images and comprehends complex medical instructions to generate multi-modal intermediate tokens, and 2) a pixel-level grounding module that decodes these tokens to produce precise segmentation masks and textual responses. Furthermore, we introduce MedSeg-QA, a large-scale dataset tailored for the medical image reasoning segmentation task. It includes over 10,000 image-mask pairs and multi-turn conversations, automatically annotated using large language models and refined through physician reviews. Experiments show MedSeg-R’s superior performance across several benchmarks, achieving high segmentation accuracy and enabling interpretable textual analysis of medical images.


[35] LLMs Are Not Yet Ready for Deepfake Image Detection cs.CVPDF

Shahroz Tariq, David Nguyen, M. A. P. Chamikara, Tingmin Wu, Alsharif Abuadbba

TL;DR: 这篇论文通过零样本评估验证了四种主流视觉语言模型(VLM)在深度伪造图像检测中的表现,发现虽然这些模型能生成合理解释并识别表面异常,但尚不适合作为独立的检测系统。

Details

Motivation: 深度伪造技术的发展对媒体完整性和公众信任构成威胁,而视觉语言模型(VLM)因其多模态能力被认为可能适用于检测深度伪造。研究旨在评估VLM在此任务中的实际表现。

Result: 结果显示VLM在独立检测中存在显著局限性,但对上下文分析和可解释性的优势使其可作为混合或人机协作检测框架的补充工具。

Insight: 尽管通用模型目前无法完全自主完成深度伪造检测,但其在增强人类专家审核流程中具有潜力,尤其在提供解释性和上下文分析方面。

Abstract: The growing sophistication of deepfakes presents substantial challenges to the integrity of media and the preservation of public trust. Concurrently, vision-language models (VLMs), large language models enhanced with visual reasoning capabilities, have emerged as promising tools across various domains, sparking interest in their applicability to deepfake detection. This study conducts a structured zero-shot evaluation of four prominent VLMs: ChatGPT, Claude, Gemini, and Grok, focusing on three primary deepfake types: faceswap, reenactment, and synthetic generation. Leveraging a meticulously assembled benchmark comprising authentic and manipulated images from diverse sources, we evaluate each model’s classification accuracy and reasoning depth. Our analysis indicates that while VLMs can produce coherent explanations and detect surface-level anomalies, they are not yet dependable as standalone detection systems. We highlight critical failure modes, such as an overemphasis on stylistic elements and vulnerability to misleading visual patterns like vintage aesthetics. Nevertheless, VLMs exhibit strengths in interpretability and contextual analysis, suggesting their potential to augment human expertise in forensic workflows. These insights imply that although general-purpose models currently lack the reliability needed for autonomous deepfake detection, they hold promise as integral components in hybrid or human-in-the-loop detection frameworks.


[36] Semantic Localization Guiding Segment Anything Model For Reference Remote Sensing Image Segmentation cs.CV | cs.AIPDF

Shuyang Li, Shuang Wang, Zhuangzhuang Sun, Jing Xiao

TL;DR: PSLG-SAM是一个两阶段框架,通过粗定位和精细分割解决遥感图像分割任务,显著减少了标注需求并在性能上超越了当前最优模型。

Details

Motivation: 当前的RRSIS方法依赖多模态融合骨干和语义分割头,但面临密集标注需求和复杂场景解释挑战。

Result: 在两个数据集上验证表明,PSLG-SAM性能显著优于现有最优模型。

Insight: 任务分解避免复杂场景干扰,第二阶段可无训练,显著降低标注负担。

Abstract: The Reference Remote Sensing Image Segmentation (RRSIS) task generates segmentation masks for specified objects in images based on textual descriptions, which has attracted widespread attention and research interest. Current RRSIS methods rely on multi-modal fusion backbones and semantic segmentation heads but face challenges like dense annotation requirements and complex scene interpretation. To address these issues, we propose a framework named \textit{prompt-generated semantic localization guiding Segment Anything Model}(PSLG-SAM), which decomposes the RRSIS task into two stages: coarse localization and fine segmentation. In coarse localization stage, a visual grounding network roughly locates the text-described object. In fine segmentation stage, the coordinates from the first stage guide the Segment Anything Model (SAM), enhanced by a clustering-based foreground point generator and a mask boundary iterative optimization strategy for precise segmentation. Notably, the second stage can be train-free, significantly reducing the annotation data burden for the RRSIS task. Additionally, decomposing the RRSIS task into two stages allows for focusing on specific region segmentation, avoiding interference from complex scenes.We further contribute a high-quality, multi-category manually annotated dataset. Experimental validation on two datasets (RRSIS-D and RRSIS-M) demonstrates that PSLG-SAM achieves significant performance improvements and surpasses existing state-of-the-art models.Our code will be made publicly available.


[37] J-DDL: Surface Damage Detection and Localization System for Fighter Aircraft cs.CVPDF

Jin Huang, Mingqiang Wei, Zikuan Li, Hangyu Qu, Wei Zhao

TL;DR: J-DDL是一个用于战斗机表面损伤检测与定位的智能系统,通过结合2D图像和3D点云数据,采用优化的YOLO架构和新型损失函数,提高了损伤检测的精度和效率。

Details

Motivation: 战斗机表面损伤检测的传统人工方法存在可扩展性、效率和一致性问题,J-DDL旨在通过自动化技术解决这些挑战。

Result: 实验验证了J-DDL的高效性,能精确检测和定位战斗机表面损伤。

Insight: 结合2D与3D数据的方法在复杂表面的损伤检测中具有潜力,轻量化和注意力机制是提升模型效率的关键。

Abstract: Ensuring the safety and extended operational life of fighter aircraft necessitates frequent and exhaustive inspections. While surface defect detection is feasible for human inspectors, manual methods face critical limitations in scalability, efficiency, and consistency due to the vast surface area, structural complexity, and operational demands of aircraft maintenance. We propose a smart surface damage detection and localization system for fighter aircraft, termed J-DDL. J-DDL integrates 2D images and 3D point clouds of the entire aircraft surface, captured using a combined system of laser scanners and cameras, to achieve precise damage detection and localization. Central to our system is a novel damage detection network built on the YOLO architecture, specifically optimized for identifying surface defects in 2D aircraft images. Key innovations include lightweight Fasternet blocks for efficient feature extraction, an optimized neck architecture incorporating Efficient Multiscale Attention (EMA) modules for superior feature aggregation, and the introduction of a novel loss function, Inner-CIOU, to enhance detection accuracy. After detecting damage in 2D images, the system maps the identified anomalies onto corresponding 3D point clouds, enabling accurate 3D localization of defects across the aircraft surface. Our J-DDL not only streamlines the inspection process but also ensures more comprehensive and detailed coverage of large and complex aircraft exteriors. To facilitate further advancements in this domain, we have developed the first publicly available dataset specifically focused on aircraft damage. Experimental evaluations validate the effectiveness of our framework, underscoring its potential to significantly advance automated aircraft inspection technologies.


[38] CogStream: Context-guided Streaming Video Question Answering cs.CV | cs.AIPDF

Zicheng Zhao, Kangyu Wang, Shijie Li, Rui Qian, Weiyao Lin

TL;DR: 本文提出了一种名为CogStream的新任务,专注于流媒体视频中的上下文推理问题,并通过提出一个密集标注的数据集和基线模型CogReasoner来解决现有方法在计算负担和无关上下文干扰上的问题。

Details

Motivation: 现有视频大型语言模型(Vid-LLMs)在处理流媒体视频时,依赖所有可用历史上下文信息,导致计算负担重且可能因无关信息而分心。本文旨在解决这一问题。

Result: 实验证明该方法在流媒体视频推理任务中高效且有效。

Insight: 仅依赖相关上下文信息可以显著提升模型性能并降低计算开销,为流媒体视频推理任务提供了新思路。

Abstract: Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant context distracts models from key details. This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream. To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline. Additionally, we present CogReasoner as a baseline model. It efficiently tackles this task by leveraging visual stream compression and historical dialogue retrieval. Extensive experiments prove the effectiveness of this method. Code will be released soon.


[39] From Images to Insights: Explainable Biodiversity Monitoring with Plain Language Habitat Explanations cs.CV | cs.AI | cs.ETPDF

Yutong Zhou, Masahiro Ryo

TL;DR: 该论文提出了一种端到端的视觉到因果框架,将物种图像转化为可解释的栖息地偏好因果见解,结合了物种识别、全球分布检索等方法,并生成人类可读的解释。

Details

Motivation: 现有的生态工作流程分散且对非专家不友好,需要一种更直观的方法来解释物种栖息地偏好的原因。

Result: 在蜜蜂和花卉物种上展示了框架的潜力,生成了统计支持的、人类可读的解释。

Insight: 多模态AI助手结合生态建模实践,为非专家提供了一种理解物种栖息地偏好的新方式。

Abstract: Explaining why the species lives at a particular location is important for understanding ecological systems and conserving biodiversity. However, existing ecological workflows are fragmented and often inaccessible to non-specialists. We propose an end-to-end visual-to-causal framework that transforms a species image into interpretable causal insights about its habitat preference. The system integrates species recognition, global occurrence retrieval, pseudo-absence sampling, and climate data extraction. We then discover causal structures among environmental features and estimate their influence on species occurrence using modern causal inference methods. Finally, we generate statistically grounded, human-readable causal explanations from structured templates and large language models. We demonstrate the framework on a bee and a flower species and report early results as part of an ongoing project, showing the potential of the multimodal AI assistant backed up by a recommended ecological modeling practice for describing species habitat in human-understandable language.


[40] Balancing Tails when Comparing Distributions: Comprehensive Equity Index (CEI) with Application to Bias Evaluation in Operational Face Biometrics cs.CV | cs.AIPDF

Imanol Solano, Julian Fierrez, Aythami Morales, Alejandro Peña, Ruben Tolosana

TL;DR: 论文提出了Comprehensive Equity Index (CEI)及其自动化版本CEI^A,用于检测人脸识别系统中传统指标难以捕捉的尾部分布偏差,实验证明其优于现有方法。

Details

Motivation: 现有指标难以检测高性能人脸识别系统中的尾部分布偏差,尤其是细微的不公平现象。

Result: 实验表明CEI优于现有方法,能检测细微偏差,并适用于多种分布比较问题。

Insight: 关注分布尾部能更敏感地捕捉偏差,自动化版本简化了实际应用。

Abstract: Demographic bias in high-performance face recognition (FR) systems often eludes detection by existing metrics, especially with respect to subtle disparities in the tails of the score distribution. We introduce the Comprehensive Equity Index (CEI), a novel metric designed to address this limitation. CEI uniquely analyzes genuine and impostor score distributions separately, enabling a configurable focus on tail probabilities while also considering overall distribution shapes. Our extensive experiments (evaluating state-of-the-art FR systems, intentionally biased models, and diverse datasets) confirm CEI’s superior ability to detect nuanced biases where previous methods fall short. Furthermore, we present CEI^A, an automated version of the metric that enhances objectivity and simplifies practical application. CEI provides a robust and sensitive tool for operational FR fairness assessment. The proposed methods have been developed particularly for bias evaluation in face biometrics but, in general, they are applicable for comparing statistical distributions in any problem where one is interested in analyzing the distribution tails.


[41] DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers cs.CV | cs.AIPDF

Lizhen Wang, Zhurong Xia, Tianshu Hu, Pengrui Wang, Pengfei Wang

TL;DR: 提出了一种基于扩散变换器(DiT)的框架DreamActor-H1,用于生成高保真的人-产品演示视频,同时保留人类身份和产品细节,并通过3D身体网格和产品边界框实现精确运动引导。

Details

Motivation: 在电子商务中,生成高质量的人-产品演示视频对产品展示至关重要,但现有方法难以同时保留人和产品的身份细节,且缺乏对人-产品空间关系的理解。

Result: 在混合数据集上训练,优于现有技术,能更好地保留身份和生成自然运动。

Insight: 通过结合3D空间信息和结构化文本编码,可以显著提升人-产品交互视频的真实感和一致性。

Abstract: In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions. Project page: https://submit2025-dream.github.io/DreamActor-H1/.


[42] Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration cs.CVPDF

Jun Wang, Lixing Zhu, Xiaohan Yu, Abhir Bhalerao, Yulan He

TL;DR: PLACE框架通过病理级别的跨模态对齐和相关探索改进医学视觉表示学习,无需额外人工标注,在多个下游任务上取得SOTA性能。

Details

Motivation: 解决医学领域数据稀缺问题,同时克服长报告中的复杂语义和病理关联,而现有方法多忽略病理级别的一致性。

Result: 在分类、图像到文本检索、语义分割、目标检测和报告生成等多个下游任务中达到最先进水平。

Insight: 病理级别对齐能有效提升医学视觉表示学习的性能,而无监督相关性探索任务可进一步丰富细节信息。

Abstract: Learning medical visual representations from image-report pairs through joint learning has garnered increasing research attention due to its potential to alleviate the data scarcity problem in the medical domain. The primary challenges stem from the lengthy reports that feature complex discourse relations and semantic pathologies. Previous works have predominantly focused on instance-wise or token-wise cross-modal alignment, often neglecting the importance of pathological-level consistency. This paper presents a novel framework PLACE that promotes the Pathological-Level Alignment and enriches the fine-grained details via Correlation Exploration without additional human annotations. Specifically, we propose a novel pathological-level cross-modal alignment (PCMA) approach to maximize the consistency of pathology observations from both images and reports. To facilitate this, a Visual Pathology Observation Extractor is introduced to extract visual pathological observation representations from localized tokens. The PCMA module operates independently of any external disease annotations, enhancing the generalizability and robustness of our methods. Furthermore, we design a proxy task that enforces the model to identify correlations among image patches, thereby enriching the fine-grained details crucial for various downstream tasks. Experimental results demonstrate that our proposed framework achieves new state-of-the-art performance on multiple downstream tasks, including classification, image-to-text retrieval, semantic segmentation, object detection and report generation.


[43] DanceChat: Large Language Model-Guided Music-to-Dance Generation cs.CV | cs.MM | cs.SD | eess.ASPDF

Qing Wang, Xiaohang Yang, Yilan Dong, Naveen Raj Govindaraj, Gregory Slabaugh

TL;DR: DanceChat是一种基于大型语言模型(LLM)的音乐到舞蹈生成方法,通过结合文本动作指导和音乐特征,生成多样化且与音乐风格一致舞蹈动作。

Details

Motivation: 音乐到舞蹈生成面临语义鸿沟和一对多映射的挑战,现有方法仅依赖音乐学习舞蹈动作,导致多样性和风格对齐不足。

Result: 在AIST++数据集和人工评估中,DanceChat定性定量均优于现有方法。

Insight: LLM的文本指导能显式提升舞蹈生成的多样性和音乐风格对齐,弥补纯音乐驱动的不足。

Abstract: Music-to-dance generation aims to synthesize human dance motion conditioned on musical input. Despite recent progress, significant challenges remain due to the semantic gap between music and dance motion, as music offers only abstract cues, such as melody, groove, and emotion, without explicitly specifying the physical movements. Moreover, a single piece of music can produce multiple plausible dance interpretations. This one-to-many mapping demands additional guidance, as music alone provides limited information for generating diverse dance movements. The challenge is further amplified by the scarcity of paired music and dance data, which restricts the model^a\u{A}'Zs ability to learn diverse dance patterns. In this paper, we introduce DanceChat, a Large Language Model (LLM)-guided music-to-dance generation approach. We use an LLM as a choreographer that provides textual motion instructions, offering explicit, high-level guidance for dance generation. This approach goes beyond implicit learning from music alone, enabling the model to generate dance that is both more diverse and better aligned with musical styles. Our approach consists of three components: (1) an LLM-based pseudo instruction generation module that produces textual dance guidance based on music style and structure, (2) a multi-modal feature extraction and fusion module that integrates music, rhythm, and textual guidance into a shared representation, and (3) a diffusion-based motion synthesis module together with a multi-modal alignment loss, which ensures that the generated dance is aligned with both musical and textual cues. Extensive experiments on AIST++ and human evaluations show that DanceChat outperforms state-of-the-art methods both qualitatively and quantitatively.


[44] Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning cs.CVPDF

Chun-Mei Feng, Kai Yu, Xinxing Xu, Salman Khan, Rick Siow Mong Goh

TL;DR: 该论文提出了一种名为T2I-PAL的新方法,通过利用文本到图像生成模型减少模态差距,并结合提示调优和适配器学习,显著提升了多标签图像识别的性能。

Details

Motivation: CLIP等预训练视觉语言模型虽然可以通过图像-文本对比学习高效微调,但模态差距问题限制了其性能,尤其在多标签图像识别任务中。

Result: 在MS-COCO、VOC2007和NUS-WIDE等基准测试中,T2I-PAL比现有最优方法平均提升3.47%的识别性能。

Insight: 通过生成图像弥补模态差距是一种有效方法,同时结合局部特征增强和高效微调策略可以进一步提升模型性能。

Abstract: Benefited from image-text contrastive learning, pre-trained vision-language models, e.g., CLIP, allow to direct leverage texts as images (TaI) for parameter-efficient fine-tuning (PEFT). While CLIP is capable of making image features to be similar to the corresponding text features, the modality gap remains a nontrivial issue and limits image recognition performance of TaI. Using multi-label image recognition (MLR) as an example, we present a novel method, called T2I-PAL to tackle the modality gap issue when using only text captions for PEFT. The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions, thereby reducing the modality gap. To further enhance MLR, T2I-PAL incorporates a class-wise heatmap and learnable prototypes. This aggregates local similarities, making the representation of local visual features more robust and informative for multi-label recognition. For better PEFT, we further combine both prompt tuning and adapter learning to enhance classification performance. T2I-PAL offers significant advantages: it eliminates the need for fully semantically annotated training images, thereby reducing the manual annotation workload, and it preserves the intrinsic mode of the CLIP model, allowing for seamless integration with any existing CLIP framework. Extensive experiments on multiple benchmarks, including MS-COCO, VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance by 3.47% in average above the top-ranked state-of-the-art methods.


[45] Rethinking Random Masking in Self Distillation on ViT cs.CVPDF

Jihyeon Seong, Hyunkyung Han

TL;DR: 该论文探讨了在ViT的自蒸馏框架(如DINO)中随机掩码的作用,提出了一种非对称掩码策略,仅在学生的全局视图上应用掩码,保留局部视图和教师视图的完整信息,从而提升了注意力的鲁棒性和下游性能。

Details

Motivation: 在ViT的自蒸馏框架中,随机掩码可能无意中破坏关键语义信息,因此论文旨在探索更合理的掩码策略,以平衡训练效率和语义保留。

Result: 实验表明,该策略在mini-ImageNet数据集上显著提升了注意力的细粒度和鲁棒性,并进一步提升了下游任务的性能。

Insight: 随机掩码在自蒸馏中的有效性依赖于其应用方式,非对称掩码设计能够在不破坏关键信息的前提下,有效提升模型的鲁棒性和性能。

Abstract: Vision Transformers (ViTs) have demonstrated remarkable performance across a wide range of vision tasks. In particular, self-distillation frameworks such as DINO have contributed significantly to these advances. Within such frameworks, random masking is often utilized to improve training efficiency and introduce regularization. However, recent studies have raised concerns that indiscriminate random masking may inadvertently eliminate critical semantic information, motivating the development of more informed masking strategies. In this study, we explore the role of random masking in the self-distillation setting, focusing on the DINO framework. Specifically, we apply random masking exclusively to the student’s global view, while preserving the student’s local views and the teacher’s global view in their original, unmasked forms. This design leverages DINO’s multi-view augmentation scheme to retain clean supervision while inducing robustness through masked inputs. We evaluate our approach using DINO-Tiny on the mini-ImageNet dataset and show that random masking under this asymmetric setup yields more robust and fine-grained attention maps, ultimately enhancing downstream performance.


[46] Hierarchical Error Assessment of CAD Models for Aircraft Manufacturing-and-Measurement cs.CVPDF

Jin Huang, Honghua Chen, Mingqiang Wei

TL;DR: 该论文提出了一种名为HEA-MM的分层误差评估框架,用于飞机CAD模型在制造与测量平台中的质量评估。

Details

Motivation: 航空设备的高质量是至关重要的,需要高精度评估制造过程中的误差,以确保性能、稳定性和可靠性。

Result: 在多种飞机CAD模型上的实验证明了该方法的有效性,能准确评估制造误差。

Insight: 分层误差评估能够更全面地捕捉制造过程中的误差,优化基元与两阶段特征检测方法显著提升了分析的精度与效率。

Abstract: The most essential feature of aviation equipment is high quality, including high performance, high stability and high reliability. In this paper, we propose a novel hierarchical error assessment framework for aircraft CAD models within a manufacturing-and-measurement platform, termed HEA-MM. HEA-MM employs structured light scanners to obtain comprehensive 3D measurements of manufactured workpieces. The measured point cloud is registered with the reference CAD model, followed by an error analysis conducted at three hierarchical levels: global, part, and feature. At the global level, the error analysis evaluates the overall deviation of the scanned point cloud from the reference CAD model. At the part level, error analysis is performed on these patches underlying the point clouds. We propose a novel optimization-based primitive refinement method to obtain a set of meaningful patches of point clouds. Two basic operations, splitting and merging, are introduced to refine the coarse primitives. At the feature level, error analysis is performed on circular holes, which are commonly found in CAD models. To facilitate it, a two-stage algorithm is introduced for the detection of circular holes. First, edge points are identified using a tensor-voting algorithm. Then, multiple circles are fitted through a hypothesize-and-clusterize framework, ensuring accurate detection and analysis of the circular features. Experimental results on various aircraft CAD models demonstrate the effectiveness of our proposed method.


[47] Semantic-decoupled Spatial Partition Guided Point-supervised Oriented Object Detection cs.CVPDF

Xinyuan Liu, Hang Xu, Yike Ma, Yucheng Zhang, Feng Dai

TL;DR: 该论文提出了一种名为SSP(语义解耦空间分区)的统一框架,用于解决点监督下的定向目标检测中的样本分配和实例混淆问题。通过结合规则驱动的先验注入和数据驱动的标签净化,SSP显著提升了检测性能。

Details

Motivation: 遥感图像的快速增长需要高效的定向目标检测方法,但高密度场景下的人工标注成本高昂。现有基于点监督的方法因样本分配不足和实例混淆而效果不佳,亟需改进。

Result: 在DOTA-v1.0等数据集上,SSP以45.78%的mAP超越了SOTA方法PointOBB-v2(提升4.10%),与ORCNN和ReDet结合后分别达到47.86%和48.50%的mAP。

Insight: SSP通过解耦语义和空间信息,有效缓解了点监督下的样本分配和实例提取问题,为高密度场景的遥感目标检测提供了高效、低成本的解决方案。

Abstract: Recent remote sensing tech advancements drive imagery growth, making oriented object detection rapid development, yet hindered by labor-intensive annotation for high-density scenes. Oriented object detection with point supervision offers a cost-effective solution for densely packed scenes in remote sensing, yet existing methods suffer from inadequate sample assignment and instance confusion due to rigid rule-based designs. To address this, we propose SSP (Semantic-decoupled Spatial Partition), a unified framework that synergizes rule-driven prior injection and data-driven label purification. Specifically, SSP introduces two core innovations: 1) Pixel-level Spatial Partition-based Sample Assignment, which compactly estimates the upper and lower bounds of object scales and mines high-quality positive samples and hard negative samples through spatial partitioning of pixel maps. 2) Semantic Spatial Partition-based Box Extraction, which derives instances from spatial partitions modulated by semantic maps and reliably converts them into bounding boxes to form pseudo-labels for supervising the learning of downstream detectors. Experiments on DOTA-v1.0 and others demonstrate SSP' s superiority: it achieves 45.78% mAP under point supervision, outperforming SOTA method PointOBB-v2 by 4.10%. Furthermore, when integrated with ORCNN and ReDet architectures, the SSP framework achieves mAP values of 47.86% and 48.50%, respectively. The code is available at https://github.com/antxinyuan/ssp.


[48] High-resolution efficient image generation from WiFi CSI using a pretrained latent diffusion model cs.CVPDF

Eshan Ramesh, Nishio Takayuki

TL;DR: LatentCSI是一种从WiFi信道状态信息(CSI)生成高分辨率图像的新方法,利用预训练的潜在扩散模型(LDM),绕过传统像素空间生成任务的复杂性。

Details

Motivation: 现有方法依赖复杂且计算密集的GANs等技术,LatentCSI通过轻量级网络将CSI映射到LDM的潜在空间,显著提高效率和生成质量。

Result: 在两个数据集上验证,LatentCSI在计算效率和感知质量上优于基线方法,并支持文本引导的灵活性。

Insight: 通过利用预训练LDM和轻量级映射,LatentCSI在图像生成任务中实现了高效与高质量的平衡,同时支持用户控制的生成能力。

Abstract: We present LatentCSI, a novel method for generating images of the physical environment from WiFi CSI measurements that leverages a pretrained latent diffusion model (LDM). Unlike prior approaches that rely on complex and computationally intensive techniques such as GANs, our method employs a lightweight neural network to map CSI amplitudes directly into the latent space of an LDM. We then apply the LDM’s denoising diffusion model to the latent representation with text-based guidance before decoding using the LDM’s pretrained decoder to obtain a high-resolution image. This design bypasses the challenges of pixel-space image generation and avoids the explicit image encoding stage typically required in conventional image-to-image pipelines, enabling efficient and high-quality image synthesis. We validate our approach on two datasets: a wide-band CSI dataset we collected with off-the-shelf WiFi devices and cameras; and a subset of the publicly available MM-Fi dataset. The results demonstrate that LatentCSI outperforms baselines of comparable complexity trained directly on ground-truth images in both computational efficiency and perceptual quality, while additionally providing practical advantages through its unique capacity for text-guided controllability.


[49] MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling cs.CVPDF

Liang Yin, Xudong Xie, Zhang Li, Xiang Bai, Yuliang Liu

TL;DR: MSTAR提出了一种无需边界框标注的注意力循环多查询场景文本检索方法,通过渐进式视觉嵌入和多实例匹配模块提升文本表示与对齐能力,并在新基准MQTR数据集上显著优于现有方法。

Details

Motivation: 现有场景文本检索方法依赖昂贵的边界框标注,且难以统一不同类型的查询以满足多样化需求。

Result: 在Total-Text数据集上MAP提升6.4%,在MQTR数据集上平均提升8.5%。

Insight: 无需边界框标注的方法在文本检索中具有潜力,多查询与视觉语言对齐是未来发展方向。

Abstract: Scene text retrieval has made significant progress with the assistance of accurate text localization. However, existing approaches typically require costly bounding box annotations for training. Besides, they mostly adopt a customized retrieval strategy but struggle to unify various types of queries to meet diverse retrieval needs. To address these issues, we introduce Muti-query Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for scene text retrieval. It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts and harmonizes free-style text queries with style-aware instructions. Additionally, a multi-instance matching module is integrated to enhance vision-language alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset, the first benchmark designed to evaluate the multi-query scene text retrieval capability of models, comprising four query types and 16k images. Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset. Notably, MSTAR marginally surpasses the previous state-of-the-art model by 6.4% in MAP on Total-Text while eliminating box annotation costs. Moreover, on the MQTR benchmark, MSTAR significantly outperforms the previous models by an average of 8.5%. The code and datasets are available at https://github.com/yingift/MSTAR.


[50] Anatomy-Grounded Weakly Supervised Prompt Tuning for Chest X-ray Latent Diffusion Models cs.CVPDF

Konstantinos Vilouras, Ilias Stogiannidis, Junyu Yan, Alison Q. O’Neil, Sotirios A. Tsaftaris

TL;DR: 该论文提出了一种基于解剖学信息的弱监督提示调整方法,用于改进胸部X光潜扩散模型的多模态对齐能力,使其能够更好地适应下游任务。

Details

Motivation: 现有文本到图像的潜扩散模型在医学影像领域(如胸部X光)中的多模态对齐能力不足,主要是由于数据有限。本文旨在解决这一问题。

Result: 在MS-CXR数据集上实现了新的state-of-the-art,并在VinDr-CXR数据上表现出鲁棒性能。

Insight: 通过引入弱监督提示调整,可以在数据受限的医学影像领域中有效提升多模态对齐能力,为下游任务提供支持。

Abstract: Latent Diffusion Models have shown remarkable results in text-guided image synthesis in recent years. In the domain of natural (RGB) images, recent works have shown that such models can be adapted to various vision-language downstream tasks with little to no supervision involved. On the contrary, text-to-image Latent Diffusion Models remain relatively underexplored in the field of medical imaging, primarily due to limited data availability (e.g., due to privacy concerns). In this work, focusing on the chest X-ray modality, we first demonstrate that a standard text-conditioned Latent Diffusion Model has not learned to align clinically relevant information in free-text radiology reports with the corresponding areas of the given scan. Then, to alleviate this issue, we propose a fine-tuning framework to improve multi-modal alignment in a pre-trained model such that it can be efficiently repurposed for downstream tasks such as phrase grounding. Our method sets a new state-of-the-art on a standard benchmark dataset (MS-CXR), while also exhibiting robust performance on out-of-distribution data (VinDr-CXR). Our code will be made publicly available.


[51] Symmetrical Flow Matching: Unified Image Generation, Segmentation, and Classification with Score-Based Generative Models cs.CV | cs.AIPDF

Francisco Caetano, Christiaan Viviers, Peter H. N. De With, Fons van der Sommen

TL;DR: 该论文提出了一种名为Symmetrical Flow Matching(SymmFlow)的新方法,将图像生成、语义分割和分类任务统一在一个模型中,通过对称学习目标和新的训练目标实现高效采样和语义结构保留。

Details

Motivation: 现有的Flow Matching方法虽然在高保真生成模型上表现出色,但未能统一处理生成、分割和分类任务。SymmFlow旨在通过对称学习目标和双向一致性解决这一问题。

Result: 在CelebAMask-HQ和COCO-Stuff上分别达到FID 11.9和7.0(仅25步推理);在分割和分类任务中表现优异。

Insight: SymmFlow通过统一框架和对称学习目标展示了多任务建模的潜力,同时高效采样为实际应用提供了便利。

Abstract: Flow Matching has emerged as a powerful framework for learning continuous transformations between distributions, enabling high-fidelity generative modeling. This work introduces Symmetrical Flow Matching (SymmFlow), a new formulation that unifies semantic segmentation, classification, and image generation within a single model. Using a symmetric learning objective, SymmFlow models forward and reverse transformations jointly, ensuring bi-directional consistency, while preserving sufficient entropy for generative diversity. A new training objective is introduced to explicitly retain semantic information across flows, featuring efficient sampling while preserving semantic structure, allowing for one-step segmentation and classification without iterative refinement. Unlike previous approaches that impose strict one-to-one mapping between masks and images, SymmFlow generalizes to flexible conditioning, supporting both pixel-level and image-level class labels. Experimental results on various benchmarks demonstrate that SymmFlow achieves state-of-the-art performance on semantic image synthesis, obtaining FID scores of 11.9 on CelebAMask-HQ and 7.0 on COCO-Stuff with only 25 inference steps. Additionally, it delivers competitive results on semantic segmentation and shows promising capabilities in classification tasks. The code will be publicly available.


[52] GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning cs.CVPDF

Xiaoyi Bao, Jindi Lv, Xiaofeng Wang, Zheng Zhu, Xinze Chen

TL;DR: GigaVideo-1提出了一种高效的视频生成微调框架,通过自动反馈优化预训练视频扩散模型,无需人工标注或海量计算资源,仅用4 GPU小时即可在多维度提升生成质量。

Details

Motivation: 现有的视频生成模型微调通常依赖人工标注和大规模计算资源,限制了实用性。GigaVideo-1旨在通过自动反馈机制,高效提升视频生成质量。

Result: 在VBench-2.0基准测试中,GigaVideo-1在17个维度上平均提升4%的性能,仅需4 GPU小时。

Insight: 通过自动反馈机制和高效数据利用,可以显著减少对人工和大规模计算的依赖,为视频生成模型的实用化提供新思路。

Abstract: Recent progress in diffusion models has greatly enhanced video generation quality, yet these models still require fine-tuning to improve specific dimensions like instance preservation, motion rationality, composition, and physical plausibility. Existing fine-tuning approaches often rely on human annotations and large-scale computational resources, limiting their practicality. In this work, we propose GigaVideo-1, an efficient fine-tuning framework that advances video generation without additional human supervision. Rather than injecting large volumes of high-quality data from external sources, GigaVideo-1 unlocks the latent potential of pre-trained video diffusion models through automatic feedback. Specifically, we focus on two key aspects of the fine-tuning process: data and optimization. To improve fine-tuning data, we design a prompt-driven data engine that constructs diverse, weakness-oriented training samples. On the optimization side, we introduce a reward-guided training strategy, which adaptively weights samples using feedback from pre-trained vision-language models with a realism constraint. We evaluate GigaVideo-1 on the VBench-2.0 benchmark using Wan2.1 as the baseline across 17 evaluation dimensions. Experiments show that GigaVideo-1 consistently improves performance on almost all the dimensions with an average gain of about 4% using only 4 GPU-hours. Requiring no manual annotations and minimal real data, GigaVideo-1 demonstrates both effectiveness and efficiency. Code, model, and data will be publicly available.


[53] PiPViT: Patch-based Visual Interpretable Prototypes for Retinal Image Analysis cs.CV | cs.AIPDF

Marzieh Oghbaie, Teresa Araújoa, Hrvoje Bogunović

TL;DR: 论文提出了一种基于视觉Transformer的原型模型PiPViT,通过学习图像块(patch)的长距离依赖关系,生成可解释的原型,用于视网膜图像分析。

Details

Motivation: 现有原型方法在医学图像中的可视化结果与人类可理解的生物标志物不一致,且原型过于细粒度,难以解释病变的范围。PiPViT旨在解决这些问题,通过学习可解释的原型,提供透明化的决策依据。

Result: 在四个视网膜OCT数据集上,PiPViT实现了与SOTA方法相当的性能,同时提供了更具临床意义的解释。定量评估验证了学习到的原型具有语义和临床相关性。

Insight: PiPViT通过结合ViT的多尺度处理能力,生成了更符合医学需求的可解释原型,为临床决策提供了透明化的支持。

Abstract: Background and Objective: Prototype-based methods improve interpretability by learning fine-grained part-prototypes; however, their visualization in the input pixel space is not always consistent with human-understandable biomarkers. In addition, well-known prototype-based approaches typically learn extremely granular prototypes that are less interpretable in medical imaging, where both the presence and extent of biomarkers and lesions are critical. Methods: To address these challenges, we propose PiPViT (Patch-based Visual Interpretable Prototypes), an inherently interpretable prototypical model for image recognition. Leveraging a vision transformer (ViT), PiPViT captures long-range dependencies among patches to learn robust, human-interpretable prototypes that approximate lesion extent only using image-level labels. Additionally, PiPViT benefits from contrastive learning and multi-resolution input processing, which enables effective localization of biomarkers across scales. Results: We evaluated PiPViT on retinal OCT image classification across four datasets, where it achieved competitive quantitative performance compared to state-of-the-art methods while delivering more meaningful explanations. Moreover, quantitative evaluation on a hold-out test set confirms that the learned prototypes are semantically and clinically relevant. We believe PiPViT can transparently explain its decisions and assist clinicians in understanding diagnostic outcomes. Github page: https://github.com/marziehoghbaie/PiPViT


[54] Enhancing Deepfake Detection using SE Block Attention with CNN cs.CVPDF

Subhram Dasgupta, Janelle Mason, Xiaohong Yuan, Olusola Odeyomi, Kaushik Roy

TL;DR: 本文提出了一种轻量级的CNN结合SE注意力机制的Deepfake检测模型,通过动态通道特征重校准提升检测效率,在低计算资源下达到高精度。

Details

Motivation: Deepfake技术的高真实性对信息安全和真实性构成威胁,传统检测方法难以应对。现有检测模型通常体积庞大,计算资源消耗高。

Result: 在Style GAN数据集上达到94.14%的分类准确率和0.985的AUC-ROC分数。

Insight: SE注意力机制在轻量化模型中表现优异,为Deepfake检测提供了高效且可扩展的解决方案。

Abstract: In the digital age, Deepfake present a formidable challenge by using advanced artificial intelligence to create highly convincing manipulated content, undermining information authenticity and security. These sophisticated fabrications surpass traditional detection methods in complexity and realism. To address this issue, we aim to harness cutting-edge deep learning methodologies to engineer an innovative deepfake detection model. However, most of the models designed for deepfake detection are large, causing heavy storage and memory consumption. In this research, we propose a lightweight convolution neural network (CNN) with squeeze and excitation block attention (SE) for Deepfake detection. The SE block module is designed to perform dynamic channel-wise feature recalibration. The SE block allows the network to emphasize informative features and suppress less useful ones, which leads to a more efficient and effective learning module. This module is integrated with a simple sequential model to perform Deepfake detection. The model is smaller in size and it achieves competing accuracy with the existing models for deepfake detection tasks. The model achieved an overall classification accuracy of 94.14% and AUC-ROC score of 0.985 on the Style GAN dataset from the Diverse Fake Face Dataset. Our proposed approach presents a promising avenue for combating the Deepfake challenge with minimal computational resources, developing efficient and scalable solutions for digital content verification.


[55] Unsourced Adversarial CAPTCHA: A Bi-Phase Adversarial CAPTCHA Framework cs.CV | cs.CRPDF

Xia Du, Xiaoyuan Liu, Jizhe Zhou, Zheng Lin, Chi-man Pun

TL;DR: 本文提出了一种名为Unsourced Adversarial CAPTCHA (UAC)的新框架,通过文本提示生成高质量对抗样本,提升CAPTCHA的多样性,支持定向和非定向攻击。

Details

Motivation: 随着深度学习的快速发展,传统CAPTCHA方案对基于DNN的自动化攻击越来越脆弱。现有对抗攻击方法依赖原始图像特征,导致输出失真且缺乏对无初始图像场景的支持。

Result: 实验表明,BP-UAC在多种系统上实现了高攻击成功率,生成的CAPTCHA对人类和DNN均难以区分。

Insight: 文本提示可有效指导对抗样本生成,双路径优化策略显著提升了黑盒场景下的攻击效率。

Abstract: With the rapid advancements in deep learning, traditional CAPTCHA schemes are increasingly vulnerable to automated attacks powered by deep neural networks (DNNs). Existing adversarial attack methods often rely on original image characteristics, resulting in distortions that hinder human interpretation and limit applicability in scenarios lacking initial input images. To address these challenges, we propose the Unsourced Adversarial CAPTCHA (UAC), a novel framework generating high-fidelity adversarial examples guided by attacker-specified text prompts. Leveraging a Large Language Model (LLM), UAC enhances CAPTCHA diversity and supports both targeted and untargeted attacks. For targeted attacks, the EDICT method optimizes dual latent variables in a diffusion model for superior image quality. In untargeted attacks, especially for black-box scenarios, we introduce bi-path unsourced adversarial CAPTCHA (BP-UAC), a two-step optimization strategy employing multimodal gradients and bi-path optimization for efficient misclassification. Experiments show BP-UAC achieves high attack success rates across diverse systems, generating natural CAPTCHAs indistinguishable to humans and DNNs.


[56] Underage Detection through a Multi-Task and MultiAge Approach for Screening Minors in Unconstrained Imagery cs.CVPDF

Christopher Gaul, Eduardo Fidalgo, Enrique Alegre, Rocío Alaiz Rodríguez, Eri Pérez Corral

TL;DR: 论文提出了一种多任务架构,结合年龄回归和二值分类任务,用于无约束图像中未成年人的检测,通过改进的损失函数和采样方法提高性能,并在新的基准测试中验证了模型的泛化能力。

Details

Motivation: 公开数据中未成年人的代表性不足以及分布偏移问题,导致未成年人检测模型在无约束图像中表现不佳。

Result: 模型在ASORES-39k测试集上降低了均方根误差(从5.733降至5.656),并在ASWIFT-20k测试集上显著提升了召回率和F2分数。

Insight: 多任务学习和改进的损失设计能有效提升模型在未成年人检测任务中的性能和泛化能力。

Abstract: Accurate automatic screening of minors in unconstrained images demands models that are robust to distribution shift and resilient to the children under-representation in publicly available data. To overcome these issues, we propose a multi-task architecture with dedicated under/over-age discrimination tasks based on a frozen FaRL vision-language backbone joined with a compact two-layer MLP that shares features across one age-regression head and four binary under-age heads for age thresholds of 12, 15, 18, and 21 years, focusing on the legally critical age range. To address the severe class imbalance, we introduce an $\alpha$-reweighted focal-style loss and age-balanced mini-batch sampling, which equalizes twelve age bins during stochastic optimization. Further improvement is achieved with an age gap that removes edge cases from the loss. Moreover, we set a rigorous evaluation by proposing the Overall Under-Age Benchmark, with 303k cleaned training images and 110k test images, defining both the “ASORES-39k” restricted overall test, which removes the noisiest domains, and the age estimation wild shifts test “ASWIFT-20k” of 20k-images, stressing extreme pose ($>$45{\deg}), expression, and low image quality to emulate real-world shifts. Trained on the cleaned overall set with resampling and age gap, our multiage model “F” lowers the root-mean-square-error on the ASORES-39k restricted test from 5.733 (age-only baseline) to 5.656 years and lifts under-18 detection from F2 score of 0.801 to 0.857 at 1% false-adult rate. Under the domain shift to the wild data of ASWIFT-20k, the same configuration nearly sustains 0.99 recall while boosting F2 from 0.742 to 0.833 with respect to the age-only baseline, demonstrating strong generalization under distribution shift. For the under-12 and under-15 tasks, the respective boosts in F2 are from 0.666 to 0.955 and from 0.689 to 0.916, respectively.


[57] Continual Hyperbolic Learning of Instances and Classes cs.CV | cs.AI | cs.LGPDF

Melika Ayoughi, Mina Ghadimi Atigh, Mohammad Mahdi Derakhshani, Cees G. M. Snoek, Pascal Mettes

TL;DR: 论文提出了一种结合实例和类别的持续学习方法HyperCLIC,利用双曲空间建模层级关系,通过双曲分类和蒸馏目标实现层级关系的持续嵌入,并在EgoObjects数据集上验证了其有效性。

Details

Motivation: 现实应用(如机器人和自动驾驶)需要模型同时处理实例和类别的持续学习,传统方法未能解决这一问题。作者发现实例和类别天然形成层级结构,因此提出利用双曲空间建模这种关系。

Result: 在EgoObjects数据集上的实验表明,HyperCLIC能够在多粒度层级上有效运行,并提升了层级泛化能力。

Insight: 双曲空间非常适合建模层级数据,为持续学习中的层级关系建模提供了新思路。

Abstract: Continual learning has traditionally focused on classifying either instances or classes, but real-world applications, such as robotics and self-driving cars, require models to handle both simultaneously. To mirror real-life scenarios, we introduce the task of continual learning of instances and classes, at the same time. This task challenges models to adapt to multiple levels of granularity over time, which requires balancing fine-grained instance recognition with coarse-grained class generalization. In this paper, we identify that classes and instances naturally form a hierarchical structure. To model these hierarchical relationships, we propose HyperCLIC, a continual learning algorithm that leverages hyperbolic space, which is uniquely suited for hierarchical data due to its ability to represent tree-like structures with low distortion and compact embeddings. Our framework incorporates hyperbolic classification and distillation objectives, enabling the continual embedding of hierarchical relations. To evaluate performance across multiple granularities, we introduce continual hierarchical metrics. We validate our approach on EgoObjects, the only dataset that captures the complexity of hierarchical object recognition in dynamic real-world environments. Empirical results show that HyperCLIC operates effectively at multiple granularities with improved hierarchical generalization.


[58] Uncertainty-Masked Bernoulli Diffusion for Camouflaged Object Detection Refinement cs.CVPDF

Yuqi Shen, Fengyang Xiao, Sujie Hu, Youwei Pang, Yifan Pu

TL;DR: 该论文提出了不确定性掩码伯努利扩散模型(UMBD),这是一个专门用于伪装目标检测(COD)后处理精细化的生成式框架。通过引入不确定性引导的掩码机制,UMBD能够针对性改进分割质量较差的区域,同时保留正确分割部分。实验显示,该方法在多个COD基准上显著提升了性能。

Details

Motivation: 伪装目标检测(COD)因目标和背景视觉差异细微而具有挑战性,现有方法在精细化处理方面仍有提升空间。论文旨在填补这一空白,提出一种生成式精细化框架。

Result: 实验表明,UMBD在多个COD基准上平均提升5.5%的MAE和3.2%的加权F-measure,计算开销较小。

Insight: 生成式方法可用于COD后处理精细化,不确定性引导的掩码机制能有效提升分割质量,同时保持轻量化设计。

Abstract: Camouflaged Object Detection (COD) presents inherent challenges due to the subtle visual differences between targets and their backgrounds. While existing methods have made notable progress, there remains significant potential for post-processing refinement that has yet to be fully explored. To address this limitation, we propose the Uncertainty-Masked Bernoulli Diffusion (UMBD) model, the first generative refinement framework specifically designed for COD. UMBD introduces an uncertainty-guided masking mechanism that selectively applies Bernoulli diffusion to residual regions with poor segmentation quality, enabling targeted refinement while preserving correctly segmented areas. To support this process, we design the Hybrid Uncertainty Quantification Network (HUQNet), which employs a multi-branch architecture and fuses uncertainty from multiple sources to improve estimation accuracy. This enables adaptive guidance during the generative sampling process. The proposed UMBD framework can be seamlessly integrated with a wide range of existing Encoder-Decoder-based COD models, combining their discriminative capabilities with the generative advantages of diffusion-based refinement. Extensive experiments across multiple COD benchmarks demonstrate consistent performance improvements, achieving average gains of 5.5% in MAE and 3.2% in weighted F-measure with only modest computational overhead. Code will be released.


[59] IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain cs.CVPDF

Hong Huang, Weixiang Sun, Zhijian Wu, Jingwen Niu, Donghuan Lu

TL;DR: IQE-CLIP提出了一个基于CLIP的零样本/少样本异常检测框架,通过结合文本和实例感知的视觉信息生成更有效的异常指示嵌入,针对医学领域优化并取得SOTA性能。

Details

Motivation: 现有基于CLIP的异常检测方法依赖类别先验知识和特定场景的文本提示,但它们在联合嵌入空间中难以区分正常与异常实例,且医学领域的探索较少。

Result: 在六个医学数据集上的实验表明,IQE-CLIP在零样本和少样本设定下均达到SOTA性能。

Insight: 融合文本与实例感知的视觉信息能更有效地区分异常,且在医学领域显著优于现有方法。

Abstract: Recent advances in vision-language models, such as CLIP, have significantly improved performance in zero- and few-shot anomaly detection (ZFSAD) tasks. However, most existing CLIP-based methods assume prior knowledge of categories and rely on carefully designed prompts tailored to specific scenarios. While these text prompts capture semantic information in the textual space, they often fail to distinguish normal and anomalous instances in the joint embedding space. Moreover, most ZFSAD approaches focus on industrial domains, with limited exploration in medical tasks. To address these limitations, we propose IQE-CLIP, a novel framework for ZFSAD in the medical domain. We show that query embeddings integrating both textual and instance-aware visual information serve as more effective indicators of anomalies. Specifically, we introduce class-based and learnable prompting tokens to better adapt CLIP to the medical setting. Furthermore, we design an instance-aware query module that extracts region-level contextual information from both modalities, enabling the generation of anomaly-sensitive embeddings. Extensive experiments on six medical datasets demonstrate that IQE-CLIP achieves state-of-the-art performance in both zero-shot and few-shot settings. Code and data are available at \href{https://github.com/hongh0/IQE-CLIP/}{this https URL}.


[60] PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework cs.CVPDF

SiXiang Chen, Jianyu Lai, Jialin Gao, Tian Ye, Haoyu Chen

TL;DR: PosterCraft提出了一个统一框架,用于生成高质量海报,通过多阶段优化实现文本渲染与艺术内容的无缝整合,显著优于开源基线。

Details

Motivation: 现有海报生成技术通常采用模块化流程和固定布局,难以实现艺术内容与文本的和谐统一。PosterCraft旨在解决这一问题,提供自由度更高的生成框架。

Result: 实验表明,PosterCraft在文本渲染准确度、布局连贯性和整体美感上显著优于开源基线,接近商业SOTA系统。

Insight: 通过统一框架和多阶段优化,海报生成的自由度和质量得以提升,同时全自动数据构建流程降低了模型复杂性。

Abstract: Generating aesthetic posters is more challenging than simple design images: it requires not only precise text rendering but also the seamless integration of abstract artistic content, striking layouts, and overall stylistic harmony. To address this, we propose PosterCraft, a unified framework that abandons prior modular pipelines and rigid, predefined layouts, allowing the model to freely explore coherent, visually compelling compositions. PosterCraft employs a carefully designed, cascaded workflow to optimize the generation of high-aesthetic posters: (i) large-scale text-rendering optimization on our newly introduced Text-Render-2M dataset; (ii) region-aware supervised fine-tuning on HQ-Poster100K; (iii) aesthetic-text-reinforcement learning via best-of-n preference optimization; and (iv) joint vision-language feedback refinement. Each stage is supported by a fully automated data-construction pipeline tailored to its specific needs, enabling robust training without complex architectural modifications. Evaluated on multiple experiments, PosterCraft significantly outperforms open-source baselines in rendering accuracy, layout coherence, and overall visual appeal-approaching the quality of SOTA commercial systems. Our code, models, and datasets can be found in the Project page: https://ephemeral182.github.io/PosterCraft


[61] SlotPi: Physics-informed Object-centric Reasoning Models cs.CV | cs.AI | cs.LGPDF

Jian Li, Wan Han, Ning Lin, Yu-Liang Zhan, Ruizhi Chengze

TL;DR: SlotPi是一种结合物理知识和对象中心推理的模型,通过哈密顿原理和时空预测模块解决动态模拟中的挑战,在多种任务和数据集上表现出色。

Details

Motivation: 现有对象中心动态模拟方法未充分利用物理知识,且缺乏多场景验证能力。人类通过观察世界获取物理知识并用于动态推理,而SlotPi旨在填补这一空白。

Result: 模型在基准和流体数据集上的预测及VQA任务中表现优异,验证了其强适应性。

Insight: 物理知识的集成和多场景验证对动态模拟至关重要,SlotPi为高级世界模型开发奠定了基础。

Abstract: Understanding and reasoning about dynamics governed by physical laws through visual observation, akin to human capabilities in the real world, poses significant challenges. Currently, object-centric dynamic simulation methods, which emulate human behavior, have achieved notable progress but overlook two critical aspects: 1) the integration of physical knowledge into models. Humans gain physical insights by observing the world and apply this knowledge to accurately reason about various dynamic scenarios; 2) the validation of model adaptability across diverse scenarios. Real-world dynamics, especially those involving fluids and objects, demand models that not only capture object interactions but also simulate fluid flow characteristics. To address these gaps, we introduce SlotPi, a slot-based physics-informed object-centric reasoning model. SlotPi integrates a physical module based on Hamiltonian principles with a spatio-temporal prediction module for dynamic forecasting. Our experiments highlight the model’s strengths in tasks such as prediction and Visual Question Answering (VQA) on benchmark and fluid datasets. Furthermore, we have created a real-world dataset encompassing object interactions, fluid dynamics, and fluid-object interactions, on which we validated our model’s capabilities. The model’s robust performance across all datasets underscores its strong adaptability, laying a foundation for developing more advanced world models.


[62] Human-Robot Navigation using Event-based Cameras and Reinforcement Learning cs.CVPDF

Ignacio Bugueno-Cordova, Javier Ruiz-del-Solar, Rodrigo Verschae

TL;DR: 论文提出了一种结合事件相机与强化学习的机器人导航控制器,用于实时的人为中心导航与避障,解决了传统图像控制器的延迟与运动模糊问题。

Details

Motivation: 传统图像控制器因固定帧率和延迟性难以满足实时导航需求,事件相机的异步特性为这一问题提供了新的解决方案。

Result: 在仿真环境中实现了鲁棒的导航、行人跟随和避障,展示了方法的有效性。

Insight: 事件相机的异步特性为实时导航提供了新思路,强化学习与模仿学习的结合显著提升了样本效率。

Abstract: This work introduces a robot navigation controller that combines event cameras and other sensors with reinforcement learning to enable real-time human-centered navigation and obstacle avoidance. Unlike conventional image-based controllers, which operate at fixed rates and suffer from motion blur and latency, this approach leverages the asynchronous nature of event cameras to process visual information over flexible time intervals, enabling adaptive inference and control. The framework integrates event-based perception, additional range sensing, and policy optimization via Deep Deterministic Policy Gradient, with an initial imitation learning phase to improve sample efficiency. Promising results are achieved in simulated environments, demonstrating robust navigation, pedestrian following, and obstacle avoidance. A demo video is available at the project website.


[63] Prompts to Summaries: Zero-Shot Language-Guided Video Summarization cs.CVPDF

Mario Barbara, Alaa Maalouf

TL;DR: 该论文提出了一种零样本、基于自然语言查询的视频摘要方法,通过结合预训练的视频-语言模型和大语言模型,无需领域特定训练数据即可实现用户可控的视频摘要。

Details

Motivation: 随着视频数据的爆炸式增长,急需无需领域特定训练数据且能灵活响应用户自然语言意图的视频摘要工具。现有方法或依赖数据集的限制泛化能力,或无法融入用户意图。

Result: 在SumMe和TVSum数据集上超越所有无监督方法,并与监督方法媲美;在QFVS基准测试中表现优异。还发布了VidSum-Reason数据集作为新基准。

Insight: 通过合理设计提示词和分数传播机制,预训练多模态模型已为通用、可查询文本的视频摘要提供了强大基础,展示了零样本任务的潜力。

Abstract: The explosive growth of video data intensified the need for flexible user-controllable summarization tools that can operate without domain-specific training data. Existing methods either rely on datasets, limiting generalization, or cannot incorporate user intent expressed in natural language. We introduce Prompts-to-Summaries: the first zero-shot, text-queryable video summarizer that converts off-the-shelf video-language models (VidLMs) captions into user-guided skims via large language models (LLMs) judging, without the use of training data at all, beating all unsupervised and matching supervised methods. Our pipeline (i) segments raw video footage into coherent scenes, (ii) generates rich scene-level descriptions through a memory-efficient, batch-style VidLM prompting scheme that scales to hours-long videos on a single GPU, (iii) leverages an LLM as a judge to assign scene-level importance scores under a carefully crafted prompt, and finally, (iv) propagates those scores to short segments level via two new metrics: consistency (temporal coherency) and uniqueness (novelty), yielding fine-grained frame importance. On SumMe and TVSum, our data-free approach surpasses all prior data-hungry unsupervised methods. It also performs competitively on the Query-Focused Video Summarization (QFVS) benchmark, despite using no training data and the competing methods requiring supervised frame-level importance. To spur further research, we release VidSum-Reason, a new query-driven dataset featuring long-tailed concepts and multi-step reasoning; our framework attains robust F1 scores and serves as the first challenging baseline. Overall, our results demonstrate that pretrained multimodal models, when orchestrated with principled prompting and score propagation, already provide a powerful foundation for universal, text-queryable video summarization.


[64] Unsupervised Deformable Image Registration with Structural Nonparametric Smoothing cs.CV | eess.IV | eess.SPPDF

Hang Zhang, Xiang Chen, Renjiu Hu, Rongguang Wang, Jinwei Zhang

TL;DR: 提出了SmoothProper模块,用于解决无监督变形图像配准中稀疏特征和大位移的问题,通过集成优化层实现平滑性,并在视网膜血管数据集上验证了有效性。

Details

Motivation: 在稀疏特征和大位移场景下,传统无监督变形图像配准方法表现不佳,需要一种能够强制平滑性和结构一致性的模块。

Result: 在2912x2912的视网膜血管数据集上,将配准误差降至1.88像素。

Insight: 网络前向传播中的平滑性约束可以有效解决无监督配准中的挑战,且无需调整正则化超参数。

Abstract: Learning-based deformable image registration (DIR) accelerates alignment by amortizing traditional optimization via neural networks. Label supervision further enhances accuracy, enabling efficient and precise nonlinear alignment of unseen scans. However, images with sparse features amid large smooth regions, such as retinal vessels, introduce aperture and large-displacement challenges that unsupervised DIR methods struggle to address. This limitation occurs because neural networks predict deformation fields in a single forward pass, leaving fields unconstrained post-training and shifting the regularization burden entirely to network weights. To address these issues, we introduce SmoothProper, a plug-and-play neural module enforcing smoothness and promoting message passing within the network’s forward pass. By integrating a duality-based optimization layer with tailored interaction terms, SmoothProper efficiently propagates flow signals across spatial locations, enforces smoothness, and preserves structural consistency. It is model-agnostic, seamlessly integrates into existing registration frameworks with minimal parameter overhead, and eliminates regularizer hyperparameter tuning. Preliminary results on a retinal vessel dataset exhibiting aperture and large-displacement challenges demonstrate our method reduces registration error to 1.88 pixels on 2912x2912 images, marking the first unsupervised DIR approach to effectively address both challenges. The source code will be available at https://github.com/tinymilky/SmoothProper.


[65] Occlusion-Aware 3D Hand-Object Pose Estimation with Masked AutoEncoders cs.CVPDF

Hui Yang, Wei Sun, Jian Liu, Jin Zheng, Jian Xiao

TL;DR: 论文提出了一种基于掩码自编码器的遮挡感知3D手-物姿态估计方法(HOMAE),通过目标聚焦掩码策略和多尺度特征融合,显著提升了遮挡情况下的姿态估计性能。

Details

Motivation: 现有方法在单目RGB图像中估计手-物姿态时,由于严重的遮挡问题,未能充分探索全局结构感知和推理,限制了其在遮挡情况下的有效性。

Result: 在DexYCB和HO3Dv2基准测试中,HOMAE实现了最先进的性能。

Insight: 通过结构化的遮挡设计和表示融合,可以有效提升遮挡条件下的姿态估计能力。

Abstract: Hand-object pose estimation from monocular RGB images remains a significant challenge mainly due to the severe occlusions inherent in hand-object interactions. Existing methods do not sufficiently explore global structural perception and reasoning, which limits their effectiveness in handling occluded hand-object interactions. To address this challenge, we propose an occlusion-aware hand-object pose estimation method based on masked autoencoders, termed as HOMAE. Specifically, we propose a target-focused masking strategy that imposes structured occlusion on regions of hand-object interaction, encouraging the model to learn context-aware features and reason about the occluded structures. We further integrate multi-scale features extracted from the decoder to predict a signed distance field (SDF), capturing both global context and fine-grained geometry. To enhance geometric perception, we combine the implicit SDF with an explicit point cloud derived from the SDF, leveraging the complementary strengths of both representations. This fusion enables more robust handling of occluded regions by combining the global context from the SDF with the precise local geometry provided by the point cloud. Extensive experiments on challenging DexYCB and HO3Dv2 benchmarks demonstrate that HOMAE achieves state-of-the-art performance in hand-object pose estimation. We will release our code and model.


[66] VideoDeepResearch: Long Video Understanding With Agentic Tool Using cs.CV | cs.AI | cs.CLPDF

Huaying Yuan, Zheng Liu, Junjie Zhou, Ji-Rong Wen, Zhicheng Dou

TL;DR: VideoDeepResearch是一个新型代理框架,仅依赖纯文本大推理模型(LRM)和多模态工具包,通过选择性访问视频内容解决了长视频理解(LVU)的挑战,性能显著优于现有方法。

Details

Motivation: 当前的多模态大语言模型(MLLMs)由于任务复杂性和上下文窗口限制,难以处理长视频理解(LVU)任务。论文质疑了“必须依赖扩展上下文窗口的强大MLLM”这一假设。

Result: 在MLVU、LVBench和LongVideoBench基准测试中,性能分别提升了9.6%、6.6%和3.9%,超越了现有方法。

Insight: 代理系统通过模块化工具和选择性访问内容,可以有效解决LVU任务的复杂性和上下文窗口限制,为未来研究提供了新方向。

Abstract: Long video understanding (LVU) presents a significant challenge for current multi-modal large language models (MLLMs) due to the task’s inherent complexity and context window constraint. It is widely assumed that addressing LVU tasks requires foundation MLLMs with extended context windows, strong visual perception capabilities, and proficient domain expertise. In this work, we challenge this common belief by introducing VideoDeepResearch, a novel agentic framework for long video understanding. Our approach relies solely on a text-only large reasoning model (LRM) combined with a modular multi-modal toolkit, including multimodal retrievers and visual perceivers, all of which are readily available in practice. For each LVU task, the system formulates a problem-solving strategy through reasoning, while selectively accessing and utilizing essential video content via tool using. We conduct extensive experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench. Our results demonstrate that VideoDeepResearch achieves substantial improvements over existing MLLM baselines, surpassing the previous state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and LongVideoBench, respectively. These findings highlight the promise of agentic systems in overcoming key challenges in LVU problems.


[67] Post-Training Quantization for Video Matting cs.CV | cs.AIPDF

Tianrui Zhu, Houyuan Chen, Ruihao Gong, Michele Magno, Haotong Qin

TL;DR: 该论文提出了一种专为视频抠图设计的后训练量化(PTQ)框架,通过两阶段策略、统计驱动的全局仿射校准和光流辅助组件,显著提升了低比特量化下的准确性和时序一致性。

Details

Motivation: 视频抠图在电影制作和虚拟现实中至关重要,但在资源受限设备上部署其计算密集型模型具有挑战性。量化作为模型压缩和加速的关键技术,目前在视频抠图领域的应用尚处于早期阶段。

Result: 在4比特量化下,PTQ4VM性能接近全精度模型,并节省了8倍的计算量,达到了当前最佳水平。

Insight: 通过结合局部优化和全局校准,以及引入时空先验信息,可以显著提升视频抠图模型在低比特量化下的性能。

Abstract: Video matting is crucial for applications such as film production and virtual reality, yet deploying its computationally intensive models on resource-constrained devices presents challenges. Quantization is a key technique for model compression and acceleration. As an efficient approach, Post-Training Quantization (PTQ) is still in its nascent stages for video matting, facing significant hurdles in maintaining accuracy and temporal coherence. To address these challenges, this paper proposes a novel and general PTQ framework specifically designed for video matting models, marking, to the best of our knowledge, the first systematic attempt in this domain. Our contributions include: (1) A two-stage PTQ strategy that combines block-reconstruction-based optimization for fast, stable initial quantization and local dependency capture, followed by a global calibration of quantization parameters to minimize accuracy loss. (2) A Statistically-Driven Global Affine Calibration (GAC) method that enables the network to compensate for cumulative statistical distortions arising from factors such as neglected BN layer effects, even reducing the error of existing PTQ methods on video matting tasks up to 20%. (3) An Optical Flow Assistance (OFA) component that leverages temporal and semantic priors from frames to guide the PTQ process, enhancing the model’s ability to distinguish moving foregrounds in complex scenes and ultimately achieving near full-precision performance even under ultra-low-bit quantization. Comprehensive quantitative and visual results show that our PTQ4VM achieves the state-of-the-art accuracy performance across different bit-widths compared to the existing quantization methods. We highlight that the 4-bit PTQ4VM even achieves performance close to the full-precision counterpart while enjoying 8x FLOP savings.


[68] VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos cs.CV | cs.AI | cs.MMPDF

Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang

TL;DR: VRBench 是一个针对长叙事视频的多步推理评估基准,填补了现有评测忽视时序推理和程序有效性的空白,包含了大量标注数据和全面的评估方法。

Details

Motivation: 现有的视频评测基准在多步推理和时序建模方面表现不足,需要一种更全面的评估工具来推动大型模型在这方面的能力。

Result: 对 12 个 LLM 和 16 个 VLM 的评估表明,VRBench 能有效评测模型的多步推理能力,并提供有价值的分析。

Insight: 时序推理和多步推理是长视频理解的关键挑战,VRBench 为这一领域的进一步研究提供了重要工具和参考。

Abstract: We present VRBench, the first long narrative video benchmark crafted for evaluating large models’ multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity. It comprises 1,010 long videos (with an average duration of 1.6 hours), along with 9,468 human-labeled multi-step question-answering pairs and 30,292 reasoning steps with timestamps. These videos are curated via a multi-stage filtering process including expert inter-rater reviewing to prioritize plot coherence. We develop a human-AI collaborative framework that generates coherent reasoning chains, each requiring multiple temporally grounded steps, spanning seven types (e.g., event attribution, implicit inference). VRBench designs a multi-phase evaluation pipeline that assesses models at both the outcome and process levels. Apart from the MCQs for the final results, we propose a progress-level LLM-guided scoring metric to evaluate the quality of the reasoning chain from multiple dimensions comprehensively. Through extensive evaluations of 12 LLMs and 16 VLMs on VRBench, we undertake a thorough analysis and provide valuable insights that advance the field of multi-step reasoning.


[69] CreatiPoster: Towards Editable and Controllable Multi-Layer Graphic Design Generation cs.CVPDF

Zhao Zhang, Yutao Cheng, Dexiang Hong, Maoke Yang, Gonglei Shi

TL;DR: CreatiPoster是一个生成可编辑、多图层图形设计的框架,支持自然语言指令或用户提供的素材输入,能够生成符合专业审美且可编辑的设计。

Details

Motivation: 图形设计对商业和个人都非常重要,但高质量的、可编辑的设计需要大量时间和技能。现有AI工具难以满足用户需求,尤其是对素材的整合和编辑性的保证。

Result: CreatiPoster在图形设计生成任务上超越开源和商业系统,支持多种应用(如画布编辑、多语言适应等)。

Insight: 通过分离前景和背景生成,并结合用户输入,能够实现高质量且可编辑的设计,推动了AI辅助图形设计的普及。

Abstract: Graphic design plays a crucial role in both commercial and personal contexts, yet creating high-quality, editable, and aesthetically pleasing graphic compositions remains a time-consuming and skill-intensive task, especially for beginners. Current AI tools automate parts of the workflow, but struggle to accurately incorporate user-supplied assets, maintain editability, and achieve professional visual appeal. Commercial systems, like Canva Magic Design, rely on vast template libraries, which are impractical for replicate. In this paper, we introduce CreatiPoster, a framework that generates editable, multi-layer compositions from optional natural-language instructions or assets. A protocol model, an RGBA large multimodal model, first produces a JSON specification detailing every layer (text or asset) with precise layout, hierarchy, content and style, plus a concise background prompt. A conditional background model then synthesizes a coherent background conditioned on this rendered foreground layers. We construct a benchmark with automated metrics for graphic-design generation and show that CreatiPoster surpasses leading open-source approaches and proprietary commercial systems. To catalyze further research, we release a copyright-free corpus of 100,000 multi-layer designs. CreatiPoster supports diverse applications such as canvas editing, text overlay, responsive resizing, multilingual adaptation, and animated posters, advancing the democratization of AI-assisted graphic design. Project homepage: https://github.com/graphic-design-ai/creatiposter


[70] AIR: Zero-shot Generative Model Adaptation with Iterative Refinement cs.CV | cs.AIPDF

Guimeng Liu, Milad Abdollahzadeh, Ngai-Man Cheung

TL;DR: 该论文提出了一种零样本生成模型适应方法AIR,通过迭代优化解决CLIP嵌入空间中图像与文本偏移不对齐的问题,显著提升了生成图像的质量。

Details

Motivation: 现有的零样本生成模型适应方法假设图像和文本偏移在CLIP嵌入空间中完全对齐,导致生成图像质量下降。论文通过分析偏移不对齐现象,提出改进方法。

Result: 在26种实验设置中,AIR方法在定性、定量和用户研究中均达到了最先进的性能。

Insight: 偏移不对齐现象与概念距离相关,通过迭代优化可以有效解决这一问题,提升生成模型的适应能力。

Abstract: Zero-shot generative model adaptation (ZSGM) aims to adapt a pre-trained generator to a target domain using only text guidance and without any samples from the target domain. Central to recent ZSGM approaches are directional loss which use the text guidance in the form of aligning the image offset with text offset in the embedding space of a vision-language model like CLIP. This is similar to the analogical reasoning in NLP where the offset between one pair of words is used to identify a missing element in another pair by aligning the offset between these two pairs. However, a major limitation of existing ZSGM methods is that the learning objective assumes the complete alignment between image offset and text offset in the CLIP embedding space, resulting in quality degrade in generated images. Our work makes two main contributions. Inspired by the offset misalignment studies in NLP, as our first contribution, we perform an empirical study to analyze the misalignment between text offset and image offset in CLIP embedding space for various large publicly available datasets. Our important finding is that offset misalignment in CLIP embedding space is correlated with concept distance, i.e., close concepts have a less offset misalignment. To address the limitations of the current approaches, as our second contribution, we propose Adaptation with Iterative Refinement (AIR) which is the first ZSGM approach to focus on improving target domain image quality based on our new insight on offset misalignment.Qualitative, quantitative, and user study in 26 experiment setups consistently demonstrate the proposed AIR approach achieves SOTA performance. Additional experiments are in Supp.


[71] M4V: Multi-Modal Mamba for Text-to-Video Generation cs.CV | cs.AI | cs.LGPDF

Jiancheng Huang, Gengwei Zhang, Zequn Jie, Siyu Jiao, Yinlong Qian

TL;DR: M4V是一种基于Mamba架构的多模态文本到视频生成框架,通过多模态扩散Mamba块(MM-DiM)实现高效的多模态信息整合和时空建模,显著降低计算成本并提升视频质量。

Details

Motivation: 文本到视频生成需要建模复杂的时空空间,而传统Transformer的二次复杂度限制了其实际应用。Mamba架构作为线性时间序列建模的替代方案效率更高,但其简单设计难以直接应用于多模态视频生成任务。

Result: M4V在生成768×1280分辨率视频时,比基于注意力机制的方案减少45%的FLOPs,并在文本到视频基准测试中展现出高质量生成能力。

Insight: 展示了Mamba架构在多模态视频生成任务中的潜力,通过创新设计显著降低了计算成本,同时提出奖励学习策略解决了长序列生成中的视觉退化问题。

Abstract: Text-to-video generation has significantly enriched content creation and holds the potential to evolve into powerful world simulators. However, modeling the vast spatiotemporal space remains computationally demanding, particularly when employing Transformers, which incur quadratic complexity in sequence processing and thus limit practical applications. Recent advancements in linear-time sequence modeling, particularly the Mamba architecture, offer a more efficient alternative. Nevertheless, its plain design limits its direct applicability to multi-modal and spatiotemporal video generation tasks. To address these challenges, we introduce M4V, a Multi-Modal Mamba framework for text-to-video generation. Specifically, we propose a multi-modal diffusion Mamba (MM-DiM) block that enables seamless integration of multi-modal information and spatiotemporal modeling through a multi-modal token re-composition design. As a result, the Mamba blocks in M4V reduce FLOPs by 45% compared to the attention-based alternative when generating videos at 768$\times$1280 resolution. Additionally, to mitigate the visual quality degradation in long-context autoregressive generation processes, we introduce a reward learning strategy that further enhances per-frame visual realism. Extensive experiments on text-to-video benchmarks demonstrate M4V’s ability to produce high-quality videos while significantly lowering computational costs. Code and models will be publicly available at https://huangjch526.github.io/M4V_project.


[72] VINCIE: Unlocking In-context Image Editing from Video cs.CV | cs.AI | cs.CL | cs.LG | cs.MMPDF

Leigang Qu, Feng Cheng, Ziyan Yang, Qi Zhao, Shanchuan Lin

TL;DR: 本文提出了一种直接从视频中学习上下文图像编辑的方法VINCIE,通过设计块因果扩散变换器和多代理任务学习,实现了强大的图像编辑能力,并在多轮编辑基准测试中取得了领先的结果。

Details

Motivation: 现有上下文图像编辑方法依赖特定任务的流程和专家模型,限制了可扩展性和灵活性。本文探索直接从视频中学习上下文图像编辑的可能性,以克服数据标注的瓶颈。

Result: 在多个基准测试中达到SOTA,并展示了在多概念组合、故事生成等任务中的优异表现。

Insight: 直接从视频中学习上下文编辑是可行的,且视频数据能提供丰富的多模态信息,支持多样化的编辑应用。

Abstract: In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (e.g., segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable approach to annotate videos as interleaved multimodal sequences. To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction. Additionally, we propose a novel multi-turn image editing benchmark to advance research in this area. Extensive experiments demonstrate that our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks. Despite being trained exclusively on videos, our model also shows promising abilities in multi-concept composition, story generation, and chain-of-editing applications.


[73] MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning cs.CV | cs.CLPDF

Yuxuan Luo, Yuhui Yuan, Junwen Chen, Haonan Cai, Ziyi Yue

TL;DR: 这篇论文提出了一个新的任务——知识图像生成,并推出了MMMG基准测试来评估图像生成模型的推理能力。MMMG包含覆盖多学科和多教育层次的专家验证数据集,并提出了MMMG-Score作为评估标准。实验显示当前模型的推理能力不足,同时提出了一个开源的基线模型FLUX-Reason。

Details

Motivation: 知识图像对人类学习和文明至关重要,但现有的文本到图像生成模型在此类任务中的表现尚未得到充分评估。因此,作者希望通过MMMG基准测试填补这一空白。

Result: 评估了16个先进模型,显示其推理能力有限,GPT-4o的MMMG-Score仅为50.20。提出的FLUX-Reason模型得分为34.45。

Insight: 当前文本到图像生成模型在知识图像生成任务中的推理能力仍有显著不足,多模态推理和知识融合是关键挑战。

Abstract: In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning–a fact underscored by dual-coding theory and the picture-superiority effect. Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals. To enable comprehensive evaluation, MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image’s core entities and their dependencies. We further introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image generation models expose serious reasoning deficits–low entity fidelity, weak relations, and clutter–with GPT-4o achieving an MMMG-Score of only 50.20, underscoring the benchmark’s difficulty. To spur further progress, we release FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on 16,000 curated knowledge image-prompt pairs.


[74] Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs cs.CV | cs.AIPDF

Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang

TL;DR: 本文提出了一种名为CDPruner的新型视觉token剪枝方法,通过最大化条件多样性来优化MLLMs中的冗余token问题,显著降低了计算成本,同时保持了高准确率。

Details

Motivation: 在MLLMs中,视觉token的长度通常远大于文本token,导致推理成本高昂。现有方法(基于注意力或相似性的剪枝)存在重复token多或忽视指令相关性的问题,性能不理想。

Result: 在多个MLLMs上验证,CDPruner显著降低计算量(FLOPs减少95%,延迟降低78%)且保持94%的原始准确率,优于现有方法。

Insight: 通过条件多样性的最大化,不仅提升了token的代表性,还确保了对用户指令的严格遵循,为MLLMs的高效推理提供了新思路。

Abstract: In multimodal large language models (MLLMs), the length of input visual tokens is often significantly greater than that of their textual counterparts, leading to a high inference cost. Many works aim to address this issue by removing redundant visual tokens. However, current approaches either rely on attention-based pruning, which retains numerous duplicate tokens, or use similarity-based pruning, overlooking the instruction relevance, consequently causing suboptimal performance. In this paper, we go beyond attention or similarity by proposing a novel visual token pruning method named CDPruner, which maximizes the conditional diversity of retained tokens. We first define the conditional similarity between visual tokens conditioned on the instruction, and then reformulate the token pruning problem with determinantal point process (DPP) to maximize the conditional diversity of the selected subset. The proposed CDPruner is training-free and model-agnostic, allowing easy application to various MLLMs. Extensive experiments across diverse MLLMs show that CDPruner establishes new state-of-the-art on various vision-language benchmarks. By maximizing conditional diversity through DPP, the selected subset better represents the input images while closely adhering to user instructions, thereby preserving strong performance even with high reduction ratios. When applied to LLaVA, CDPruner reduces FLOPs by 95% and CUDA latency by 78%, while maintaining 94% of the original accuracy. Our code is available at https://github.com/Theia-4869/CDPruner.


[75] GenWorld: Towards Detecting AI-generated Real-world Simulation Videos cs.CVPDF

Weiliang Chen, Wenzhao Zheng, Yu Zheng, Lei Chen, Jie Zhou

TL;DR: GenWorld提出了一個大規模、高品質的現實世界模擬數據集,用於AI生成視頻檢測。通過多模態提示和多生成器的多樣性,該數據集提供了更通用的鑒別特徵。研究還發現現有方法在檢測高品質視頻時失效,因此提出了一個基於多視角一致性的簡單模型SpannDetector,實驗顯示其優越性能。

Details

Motivation: 隨著視頻生成技術的發展,現實信息的可信度受到威脅,急需可靠的AI生成視頻檢測方法。然而,現有數據集質量不足,缺乏現實世界模擬場景,這限制了檢測器的發展。

Result: 實驗表明,SpannDetector在GenWorld數據集上表現優越,特別是在檢測高品質AI生成視頻時效果顯著。

Insight: 物理合理性(如多視角一致性)可作為AI生成視頻檢測的關鍵特徵,為解釋性檢測方法提供了新方向。

Abstract: The flourishing of video generation technologies has endangered the credibility of real-world information and intensified the demand for AI-generated video detectors. Despite some progress, the lack of high-quality real-world datasets hinders the development of trustworthy detectors. In this paper, we propose GenWorld, a large-scale, high-quality, and real-world simulation dataset for AI-generated video detection. GenWorld features the following characteristics: (1) Real-world Simulation: GenWorld focuses on videos that replicate real-world scenarios, which have a significant impact due to their realism and potential influence; (2) High Quality: GenWorld employs multiple state-of-the-art video generation models to provide realistic and high-quality forged videos; (3) Cross-prompt Diversity: GenWorld includes videos generated from diverse generators and various prompt modalities (e.g., text, image, video), offering the potential to learn more generalizable forensic features. We analyze existing methods and find they fail to detect high-quality videos generated by world models (i.e., Cosmos), revealing potential drawbacks of ignoring real-world clues. To address this, we propose a simple yet effective model, SpannDetector, to leverage multi-view consistency as a strong criterion for real-world AI-generated video detection. Experiments show that our method achieves superior results, highlighting a promising direction for explainable AI-generated video detection based on physical plausibility. We believe that GenWorld will advance the field of AI-generated video detection. Project Page: https://chen-wl20.github.io/GenWorld


[76] Fine-Grained Perturbation Guidance via Attention Head Selection cs.CV | cs.AI | cs.LGPDF

Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Minjae Kim, Jaewon Min

TL;DR: 该论文研究了在扩散模型中通过注意力头选择进行细粒度扰动引导的方法,提出了一种名为HeadHunter的框架和SoftPAG技术,实现了对生成质量和视觉属性的精细控制。

Details

Motivation: 现有注意力扰动方法缺乏确定扰动位置的系统性方法,尤其是在扩散Transformer架构中,质量相关的计算分布在多个层中。

Result: 在Stable Diffusion 3和FLUX.1等模型上验证了方法的有效性,实现了生成质量提升和风格特异性引导。

Insight: 注意力头在扩散模型中具有明确的视觉概念分工,通过针对性扰动可以实现对生成结果的精细控制。

Abstract: Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose “HeadHunter”, a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head’s attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.


[77] InstaInpaint: Instant 3D-Scene Inpainting with Masked Large Reconstruction Model cs.CVPDF

Junqi You, Chieh Hubert Lin, Weijie Lyu, Zhengbo Zhang, Ming-Hsuan Yang

TL;DR: InstaInpaint提出了一种基于参考的前馈框架,能够在0.4秒内完成3D场景修复,实现1000倍速度提升,同时保持SOTA性能。

Details

Motivation: 现有3D场景修复方法依赖耗时优化,无法满足实时交互需求,因此需要一种快速高效的解决方案。

Result: 在标准基准测试中达到SOTA性能,速度提升1000倍,并能灵活应用于对象插入和多区域修复。

Insight: 关键设计(如掩码微调和模型架构)显著改善了泛化能力、纹理一致性和几何正确性。

Abstract: Recent advances in 3D scene reconstruction enable real-time viewing in virtual and augmented reality. To support interactive operations for better immersiveness, such as moving or editing objects, 3D scene inpainting methods are proposed to repair or complete the altered geometry. However, current approaches rely on lengthy and computationally intensive optimization, making them impractical for real-time or online applications. We propose InstaInpaint, a reference-based feed-forward framework that produces 3D-scene inpainting from a 2D inpainting proposal within 0.4 seconds. We develop a self-supervised masked-finetuning strategy to enable training of our custom large reconstruction model (LRM) on the large-scale dataset. Through extensive experiments, we analyze and identify several key designs that improve generalization, textural consistency, and geometric correctness. InstaInpaint achieves a 1000x speed-up from prior methods while maintaining a state-of-the-art performance across two standard benchmarks. Moreover, we show that InstaInpaint generalizes well to flexible downstream applications such as object insertion and multi-region inpainting. More video results are available at our project page: https://dhmbb2.github.io/InstaInpaint_page/.


cs.CL [Back]

[78] TaskCraft: Automated Generation of Agentic Tasks cs.CLPDF

Dingfeng Shi, Jingyi Cao, Qianben Chen, Weichen Sun, Weizhen Li

TL;DR: TaskCraft提出了一种自动化工作流,用于生成多工具、可验证的智能体任务,解决了现有指令数据缺乏工具交互和人工标注成本高的问题。

Details

Motivation: 现有智能体任务数据缺乏工具交互且依赖人工标注,限制了其扩展性,因此需要一种自动化的任务生成方法。

Result: 实验表明,生成的任务能有效优化提示流程并提升基础模型的监督微调性能。

Insight: 自动化任务生成可解决数据标注的瓶颈问题,为智能体调优和评估提供了新的研究方向。

Abstract: Agentic tasks, which require multi-step problem solving with autonomy, tool use, and adaptive reasoning, are becoming increasingly central to the advancement of NLP and AI. However, existing instruction data lacks tool interaction, and current agentic benchmarks rely on costly human annotation, limiting their scalability. We introduce \textsc{TaskCraft}, an automated workflow for generating difficulty-scalable, multi-tool, and verifiable agentic tasks with execution trajectories. TaskCraft expands atomic tasks using depth-based and width-based extensions to create structurally and hierarchically complex challenges. Empirical results show that these tasks improve prompt optimization in the generation workflow and enhance supervised fine-tuning of agentic foundation models. We present a large-scale synthetic dataset of approximately 36,000 tasks with varying difficulty to support future research on agent tuning and evaluation.


[79] Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific Information cs.CLPDF

Christodoulos Constantinides, Shuxin Lin, Nianjun Zhou, Dhaval Patel

TL;DR: 本文提出一种名为Chat-of-Thought的多智能体系统,用于优化工业资产故障模式与影响分析(FMEA)文档生成的协作LLM智能体框架。

Details

Motivation: 工业设备监控中FMEA文档生成过程复杂且耗时,需要多角色协作与迭代优化,而传统方法效率低下。

Result: 展示了系统在工业设备监控中高效生成FMEA文档的能力,并验证了多智能体协作的优越性。

Insight: 多智能体协作可以显著提升复杂领域特定任务的生成效率与质量,动态路由与迭代优化是关键。

Abstract: This paper presents a novel multi-agent system called Chat-of-Thought, designed to facilitate the generation of Failure Modes and Effects Analysis (FMEA) documents for industrial assets. Chat-of-Thought employs multiple collaborative Large Language Model (LLM)-based agents with specific roles, leveraging advanced AI techniques and dynamic task routing to optimize the generation and validation of FMEA tables. A key innovation in this system is the introduction of a Chat of Thought, where dynamic, multi-persona-driven discussions enable iterative refinement of content. This research explores the application domain of industrial equipment monitoring, highlights key challenges, and demonstrates the potential of Chat-of-Thought in addressing these challenges through interactive, template-driven workflows and context-aware agent collaboration.


[80] ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering cs.CLPDF

Caijun Jia, Nan Xu, Jingxuan Wei, Qingli Wang, Lei Wang

TL;DR: ChartReasoner是一个两阶段框架,通过代码驱动的方式解决图表问答任务中的长链推理问题,能够保留图表的原始细节并实现高精度推理。其方法包括高保真图表转代码模型、自动生成推理轨迹的数据合成管道,以及结合监督微调和强化学习的训练策略。在实验中表现优异,接近GPT-4o等专有模型的性能。

Details

Motivation: 大型语言模型虽在长链推理中表现优异,但如何将其扩展到视觉推理任务(如图表问答)仍具挑战性。现有方法通过图像转文本的方式容易丢失视觉信息的结构和语义细节。

Result: 在四个公共基准测试中表现优异,接近专有系统如GPT-4o的性能,且参数更少。

Insight: 通过代码驱动的方式可以更精准地保留视觉信息的结构和语义细节,为视觉推理任务提供新思路。

Abstract: Recently, large language models have shown remarkable reasoning capabilities through long-chain reasoning before responding. However, how to extend this capability to visual reasoning tasks remains an open challenge. Existing multimodal reasoning approaches transfer such visual reasoning task into textual reasoning task via several image-to-text conversions, which often lose critical structural and semantic information embedded in visualizations, especially for tasks like chart question answering that require a large amount of visual details. To bridge this gap, we propose ChartReasoner, a code-driven novel two-stage framework designed to enable precise, interpretable reasoning over charts. We first train a high-fidelity model to convert diverse chart images into structured ECharts codes, preserving both layout and data semantics as lossless as possible. Then, we design a general chart reasoning data synthesis pipeline, which leverages this pretrained transport model to automatically and scalably generate chart reasoning trajectories and utilizes a code validator to filter out low-quality samples. Finally, we train the final multimodal model using a combination of supervised fine-tuning and reinforcement learning on our synthesized chart reasoning dataset and experimental results on four public benchmarks clearly demonstrate the effectiveness of our proposed ChartReasoner. It can preserve the original details of the charts as much as possible and perform comparably with state-of-the-art open-source models while using fewer parameters, approaching the performance of proprietary systems like GPT-4o in out-of-domain settings.


[81] Unsupervised Elicitation of Language Models cs.CL | cs.AIPDF

Jiaxin Wen, Zachary Ankner, Arushi Somani, Peter Hase, Samuel Marks

TL;DR: 论文提出了一种无监督算法Internal Coherence Maximization(ICM),用于微调预训练语言模型,无需外部监督即可生成标签,并在多个任务中表现优于人工监督。

Details

Motivation: 当前的后训练范式依赖人类指定期望行为,但对于超人类能力的模型,高质量的人类监督难以实现。因此,需要一种无监督方法来引导模型性能。

Result: 在多项任务中,ICM匹配了黄金监督的性能,并超越人类众包监督。对于超人类能力任务,ICM显著优于人类标签训练。此外,无监督训练的奖励模型和助手优于人工监督版本。

Insight: 无监督方法在模型能力超越人类时具有显著优势,ICM为前沿语言模型训练提供了高效替代方案。

Abstract: To steer pretrained language models for downstream tasks, today’s post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficult or impossible to get high-quality human supervision. To address this challenge, we introduce a new unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune pretrained language models on their own generated labels, \emph{without external supervision}. On GSM8k-verification, TruthfulQA, and Alpaca reward modeling tasks, our method matches the performance of training on golden supervision and outperforms training on crowdsourced human supervision. On tasks where LMs’ capabilities are strongly superhuman, our method can elicit those capabilities significantly better than training on human labels. Finally, we show that our method can improve the training of frontier LMs: we use our method to train an unsupervised reward model and use reinforcement learning to train a Claude 3.5 Haiku-based assistant. Both the reward model and the assistant outperform their human-supervised counterparts.


[82] Can LLMs Generate Good Stories? Insights and Challenges from a Narrative Planning Perspective cs.CL | cs.AIPDF

Yi Wang, Max Kreminski

TL;DR: 本文探讨了LLMs在故事生成中的能力,特别关注叙事规划问题。研究发现,GPT-4级别的LLMs可以生成小规模因果合理的故事,但在角色意图和戏剧冲突方面仍面临挑战。

Details

Motivation: 故事生成是LLMs的重要应用,但目前对其生成高质量故事能力的理解有限,部分原因是自动评估方法的不足和人工评估的高成本与主观性。

Result: 实验表明,LLMs可以生成小规模因果合理的故事,但在角色意图和戏剧冲突方面仍需改进,尤其是需要强化学习训练以支持复杂推理。

Insight: 研究揭示了LLMs在叙事规划中的潜力与挑战,为游戏环境中的应用提供了重要启示。

Abstract: Story generation has been a prominent application of Large Language Models (LLMs). However, understanding LLMs’ ability to produce high-quality stories remains limited due to challenges in automatic evaluation methods and the high cost and subjectivity of manual evaluation. Computational narratology offers valuable insights into what constitutes a good story, which has been applied in the symbolic narrative planning approach to story generation. This work aims to deepen the understanding of LLMs’ story generation capabilities by using them to solve narrative planning problems. We present a benchmark for evaluating LLMs on narrative planning based on literature examples, focusing on causal soundness, character intentionality, and dramatic conflict. Our experiments show that GPT-4 tier LLMs can generate causally sound stories at small scales, but planning with character intentionality and dramatic conflict remains challenging, requiring LLMs trained with reinforcement learning for complex reasoning. The results offer insights on the scale of stories that LLMs can generate while maintaining quality from different aspects. Our findings also highlight interesting problem solving behaviors and shed lights on challenges and considerations for applying LLM narrative planning in game environments.


[83] Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval cs.CLPDF

Shubhashis Roy Dipta, Francis Ferraro

TL;DR: Q2E提出了一种零样本多语言文本到视频检索方法,通过分解查询并利用LLMs和VLMs的潜在知识,显著提升了复杂现实事件的视频检索能力。

Details

Motivation: 现有方法在理解复杂查询和视频内容之间缺乏桥梁,尤其是在多语言和多样模态场景下。Q2E旨在通过分解查询并融合多模态知识来解决这一问题。

Result: 在多个数据集和检索指标上,Q2E超越了现有方法,且音频信息的整合进一步提升了文本到视频检索的性能。

Insight: 1. 查询分解能显著提升复杂事件的检索精度。2. 多模态信息的整合(尤其是音频)对视频检索任务至关重要。

Abstract: Recent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs). In this work, we consider how we can improve the identification and retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events. We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our approach demonstrates that we can enhance the understanding of otherwise overly simplified human queries by decomposing the query using the knowledge embedded in LLMs and VLMs. We additionally show how to apply our approach to both visual and speech-based inputs. To combine this varied multimodal knowledge, we adopt entropy-based fusion scoring for zero-shot fusion. Through evaluations on two diverse datasets and multiple retrieval metrics, we demonstrate that Q2E outperforms several state-of-the-art baselines. Our evaluation also shows that integrating audio information can significantly improve text-to-video retrieval. We have released code and data for future research.


[84] TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games cs.CL | cs.AIPDF

Prakamya Mishra, Jiang Liu, Jialian Wu, Xiaodong Yu, Zicheng Liu

TL;DR: 论文介绍了TTT-Bench,一个通过简单的井字棋类游戏评估大型推理模型(LRMs)基本战略、空间和逻辑推理能力的基准测试,发现虽然LRMs在复杂数学问题上表现优异,但在这些简单任务中表现不佳。

Details

Motivation: 当前大多数推理基准集中在STEM领域,而LRMs在更广泛任务域中的推理能力尚待探索。论文旨在填补这一缺口,开发一个简单但挑战性的测试基准。

Result: 评估发现LRMs在简单推理游戏中的表现远低于复杂数学问题,尤其在长期战略推理任务中表现更差,且大型模型通过更短的推理路径获得更高分数。

Insight: 大型推理模型在复杂任务上可能依赖数据而非真正推理能力,而简单任务的失败揭示了其在战略和空间推理上的局限性。

Abstract: Large reasoning models (LRMs) have demonstrated impressive reasoning capabilities across a broad range of tasks including Olympiad-level mathematical problems, indicating evidence of their complex reasoning abilities. While many reasoning benchmarks focus on the STEM domain, the ability of LRMs to reason correctly in broader task domains remains underexplored. In this work, we introduce \textbf{TTT-Bench}, a new benchmark that is designed to evaluate basic strategic, spatial, and logical reasoning abilities in LRMs through a suite of four two-player Tic-Tac-Toe-style games that humans can effortlessly solve from a young age. We propose a simple yet scalable programmatic approach for generating verifiable two-player game problems for TTT-Bench. Although these games are trivial for humans, they require reasoning about the intentions of the opponent, as well as the game board’s spatial configurations, to ensure a win. We evaluate a diverse set of state-of-the-art LRMs, and \textbf{discover that the models that excel at hard math problems frequently fail at these simple reasoning games}. Further testing reveals that our evaluated reasoning models score on average $\downarrow$ 41% & $\downarrow$ 5% lower on TTT-Bench compared to MATH 500 & AIME 2024 respectively, with larger models achieving higher performance using shorter reasoning traces, where most of the models struggle on long-term strategic reasoning situations on simple and new TTT-Bench tasks.


[85] Classifying Unreliable Narrators with Large Language Models cs.CLPDF

Anneliese Brei, Katharine Henry, Abhisheik Sharma, Shashank Srivastava, Snigdha Chaturvedi

TL;DR: 本文提出利用大型语言模型(LLMs)识别不可靠叙述者的方法,并发布了TUNa数据集,通过多领域文本分类任务验证了LLMs在此任务上的潜力与挑战。

Details

Motivation: 在文学和社交媒体等文本中,叙述者可能无意间提供不准确信息,通过计算手段自动识别这种不可靠性具有重要意义。

Result: 结果显示任务极具挑战性,但LLMs在此领域表现出了潜力。

Insight: 通过文学理论的启发,将不可靠叙述者的识别任务扩展到实际文本数据,为后续研究提供了新思路和资源。

Abstract: Often when we interact with a first-person account of events, we consider whether or not the narrator, the primary speaker of the text, is reliable. In this paper, we propose using computational methods to identify unreliable narrators, i.e. those who unintentionally misrepresent information. Borrowing literary theory from narratology to define different types of unreliable narrators based on a variety of textual phenomena, we present TUNa, a human-annotated dataset of narratives from multiple domains, including blog posts, subreddit posts, hotel reviews, and works of literature. We define classification tasks for intra-narrational, inter-narrational, and inter-textual unreliabilities and analyze the performance of popular open-weight and proprietary LLMs for each. We propose learning from literature to perform unreliable narrator classification on real-world text data. To this end, we experiment with few-shot, fine-tuning, and curriculum learning settings. Our results show that this task is very challenging, and there is potential for using LLMs to identify unreliable narrators. We release our expert-annotated dataset and code and invite future research in this area.


[86] Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages cs.CL | cs.AIPDF

Ali Almutairi, Abdullah Alsuhaibani, Shoaib Jameel, Usman Naseem, Gelareh Mohammadi

TL;DR: Flick提出了一种在低资源语言中高效解决少标签文本分类问题的方法,通过伪标签精炼和多任务学习提升模型性能。

Details

Motivation: 解决低资源语言中少标签文本分类问题,现有方法易受噪声伪标签和领域适应性的影响。

Result: 在14个多样化数据集(包括低资源语言如阿拉伯语、乌尔都语等)上验证了方法的有效性和适应性。

Insight: 从多簇中精炼高置信度伪标签是提升少标签分类性能的关键,尤其在低资源语言中。

Abstract: Training deep learning networks with minimal supervision has gained significant research attention due to its potential to reduce reliance on extensive labelled data. While self-training methods have proven effective in semi-supervised learning, they remain vulnerable to errors from noisy pseudo labels. Moreover, most recent approaches to the few-label classification problem are either designed for resource-rich languages such as English or involve complex cascading models that are prone to overfitting. To address the persistent challenge of few-label text classification in truly low-resource linguistic contexts, where existing methods often struggle with noisy pseudo-labels and domain adaptation, we propose Flick. Unlike prior methods that rely on generic multi-cluster pseudo-labelling or complex cascading architectures, Flick leverages the fundamental insight that distilling high-confidence pseudo-labels from a broader set of initial clusters can dramatically improve pseudo-label quality, particularly for linguistically diverse, low-resource settings. Flick introduces a novel pseudo-label refinement component, a departure from traditional pseudo-labelling strategies by identifying and leveraging top-performing pseudo-label clusters. This component specifically learns to distil highly reliable pseudo-labels from an initial broad set by focusing on single-cluster cohesion and leveraging an adaptive top-k selection mechanism. This targeted refinement process is crucial for mitigating the propagation of errors inherent in low-resource data, allowing for robust fine-tuning of pre-trained language models with only a handful of true labels. We demonstrate Flick’s efficacy across 14 diverse datasets, encompassing challenging low-resource languages such as Arabic, Urdu, and Setswana, alongside English, showcasing its superior performance and adaptability.


[87] “Check My Work?”: Measuring Sycophancy in a Simulated Educational Context cs.CL | cs.CYPDF

Chuck Arvin

TL;DR: 论文研究了在模拟教育情境中,大型语言模型(LLMs)如何受用户提供的建议影响,特别是在谄媚行为(sycophancy)的风险下。通过测试五个不同的LLMs模型,结果显示响应质量因查询表述方式差异显著。当学生提到错误答案时,模型的正确率可能下降15个百分点,而提到正确答案时正确率提升相同幅度。这种偏差在较小模型中更明显。

Details

Motivation: 研究动机是探讨LLMs在教育情境中的谄媚行为,这可能导致对知识水平较低学生的误解加剧,影响教育公平。

Result: 结果显示,LLMs的响应质量因学生答案提及方式差异显著,正确率变化可达15个百分点,较小模型的偏差效应更大(高达30%)。

Insight: 研究揭示了LLMs在教育应用中潜在的谄媚行为,可能导致教育不平等,强调需要进一步研究和缓解这种偏差。

Abstract: This study examines how user-provided suggestions affect Large Language Models (LLMs) in a simulated educational context, where sycophancy poses significant risks. Testing five different LLMs from the OpenAI GPT-4o and GPT-4.1 model classes across five experimental conditions, we show that response quality varies dramatically based on query framing. In cases where the student mentions an incorrect answer, the LLM correctness can degrade by as much as 15 percentage points, while mentioning the correct answer boosts accuracy by the same margin. Our results also show that this bias is stronger in smaller models, with an effect of up to 30% for the GPT-4.1-nano model, versus 8% for the GPT-4o model. Our analysis of how often LLMs “flip” their answer, and an investigation into token level probabilities, confirm that the models are generally changing their answers to answer choices mentioned by students in line with the sycophancy hypothesis. This sycophantic behavior has important implications for educational equity, as LLMs may accelerate learning for knowledgeable students while the same tools may reinforce misunderstanding for less knowledgeable students. Our results highlight the need to better understand the mechanism, and ways to mitigate, such bias in the educational context.


[88] Code Execution as Grounded Supervision for LLM Reasoning cs.CL | cs.AIPDF

Dongwon Jung, Wenxuan Zhou, Muhao Chen

TL;DR: 论文提出了一种通过代码执行生成高质量思维链(CoT)监督数据的方法,替代依赖人工标注或LLM生成的方法,显著提升了LLM的推理能力。

Details

Motivation: 现有生成思维链监督数据的方法依赖人工标注或LLM生成,效率低且易出错。代码执行具有确定性,可提供可验证的推理过程,适合用于生成高质量监督数据。

Result: 实验表明,该方法在多领域推理任务中有效提升了LLM的推理能力,并减少了推理时的token长度。

Insight: 代码执行可以作为生成高质量监督数据的可靠来源,其确定性显著优于传统人工或LLM生成的方法。

Abstract: Training large language models (LLMs) with chain-of-thought (CoT) supervision has proven effective for enhancing their reasoning abilities. However, obtaining reliable and accurate reasoning supervision remains a significant challenge. We propose a scalable method for generating a high-quality CoT supervision dataset by leveraging the determinism of program execution. Unlike existing reasoning dataset generation methods that rely on costly human annotations or error-prone LLM-generated CoT, our approach extracts verifiable, step-by-step reasoning traces from code execution and transforms them into a natural language CoT reasoning. Experiments on reasoning benchmarks across various domains show that our method effectively equips LLMs with transferable reasoning abilities across diverse tasks. Furthermore, the ablation studies validate that our method produces highly accurate reasoning data and reduces overall token length during inference by reducing meaningless repetition and overthinking.


[89] TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning cs.CL | cs.IRPDF

Xiaohan Yu, Pu Jian, Chong Chen

TL;DR: TableRAG是一个专为异构文档(包含文本和表格)设计的检索增强生成框架,通过迭代式的四个步骤解决了现有方法在表格数据处理上的局限性,并在新基准HeteQA上实现了最优性能。

Details

Motivation: 异构文档(文本和表格混合)的处理在现有RAG方法中存在显著不足,扁平化表格和分块策略破坏了表格结构,导致信息丢失和多跳推理能力下降。

Result: 在公开数据集和新基准HeteQA上,TableRAG均优于现有基线,达到新SOTA。

Insight: TableRAG的成功表明,保留表格结构并引入SQL操作是提升异构文档推理能力的关键。

Abstract: Retrieval-Augmented Generation (RAG) has demonstrated considerable effectiveness in open-domain question answering. However, when applied to heterogeneous documents, comprising both textual and tabular components, existing RAG approaches exhibit critical limitations. The prevailing practice of flattening tables and chunking strategies disrupts the intrinsic tabular structure, leads to information loss, and undermines the reasoning capabilities of LLMs in multi-hop, global queries. To address these challenges, we propose TableRAG, an hybrid framework that unifies textual understanding and complex manipulations over tabular data. TableRAG iteratively operates in four steps: context-sensitive query decomposition, text retrieval, SQL programming and execution, and compositional intermediate answer generation. We also develop HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous reasoning capabilities. Experimental results demonstrate that TableRAG consistently outperforms existing baselines on both public datasets and our HeteQA, establishing a new state-of-the-art for heterogeneous document question answering. We release TableRAG at https://github.com/yxh-y/TableRAG/tree/main.


[90] PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier cs.CL | cs.AI | cs.LGPDF

Yuhua Jiang, Yuwen Xiong, Yufeng Yuan, Chao Xin, Wenyuan Xu

TL;DR: PAG是一个通过强化学习框架统一策略和验证角色的多轮自纠正方法,仅在检测到错误时才进行修正,显著提升了LLM的推理和验证能力。

Details

Motivation: 现有的LLM自我验证方法往往依赖额外的验证器或多阶段训练流程,限制了可扩展性。PAG旨在通过统一的强化学习框架实现更高效的自纠正。

Result: PAG在多个推理基准测试中表现优异,作为策略和验证器均优于现有方法。

Insight: 统一策略和验证角色的方法可以更高效地提升LLM的自纠正能力,而选择性修正减少了不必要的修正次数。

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks, yet they still struggle to reliably verify the correctness of their own outputs. Existing solutions to this verification challenge often depend on separate verifier models or require multi-stage self-correction training pipelines, which limit scalability. In this paper, we propose Policy as Generative Verifier (PAG), a simple and effective framework that empowers LLMs to self-correct by alternating between policy and verifier roles within a unified multi-turn reinforcement learning (RL) paradigm. Distinct from prior approaches that always generate a second attempt regardless of model confidence, PAG introduces a selective revision mechanism: the model revises its answer only when its own generative verification step detects an error. This verify-then-revise workflow not only alleviates model collapse but also jointly enhances both reasoning and verification abilities. Extensive experiments across diverse reasoning benchmarks highlight PAG’s dual advancements: as a policy, it enhances direct generation and self-correction accuracy; as a verifier, its self-verification outperforms self-consistency.


[91] Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences? cs.CL | cs.CVPDF

Yingjin Song, Yupei Du, Denis Paperno, Albert Gatt

TL;DR: 本文提出TempVS基准测试,评估多模态大语言模型(MLLMs)在图像序列中的事件时序理解和推理能力,发现现有模型表现远低于人类水平。

Details

Motivation: 研究MLLMs是否真正能够捕捉图像序列中事件的时序关系,填补现有基准测试的不足。

Result: 现有MLLMs在TempVS上表现不佳,与人类能力存在显著差距。

Insight: 未来研究可关注多模态时序推理的改进,尤其是视觉和语言模态的结合。

Abstract: This paper introduces the TempVS benchmark, which focuses on temporal grounding and reasoning capabilities of Multimodal Large Language Models (MLLMs) in image sequences. TempVS consists of three main tests (i.e., event relation inference, sentence ordering and image ordering), each accompanied with a basic grounding test. TempVS requires MLLMs to rely on both visual and linguistic modalities to understand the temporal order of events. We evaluate 38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS, with a substantial performance gap compared to human capabilities. We also provide fine-grained insights that suggest promising directions for future research. Our TempVS benchmark data and code are available at https://github.com/yjsong22/TempVS.


[92] Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty cs.CLPDF

Zehui Ling, Deshu Chen, Hongwei Zhang, Yifeng Jiao, Xin Guo

TL;DR: 论文提出了一种通过动态长度惩罚提升语言模型推理效率的方法,针对简单问题缩短输出长度,同时保持复杂问题的推理深度。

Details

Motivation: 当前大型语言模型在推理任务中表现出色,但常用方法(如思维链提示)导致输出过长,增加计算延迟。现有缩短方法未考虑问题复杂性,效果不佳。

Result: 在GSM8K、MATH500和AIME2024三个数据集上表现优异:简单任务中缩短了输出长度并保持或提升准确性,复杂任务中准确性提高。

Insight: 动态长度惩罚能有效权衡推理效率和准确性,适用于不同复杂度的任务。

Abstract: Large language models (LLMs) have demonstrated significant advancements in reasoning capabilities, performing well on various challenging benchmarks. Techniques like Chain-of-Thought prompting have been introduced to further improve reasoning. However, these approaches frequently generate longer outputs, which in turn increase computational latency. Although some methods use reinforcement learning to shorten reasoning, they often apply uniform penalties without considering the problem’s complexity, leading to suboptimal outcomes. In this study, we seek to enhance the efficiency of LLM reasoning by promoting conciseness for simpler problems while preserving sufficient reasoning for more complex ones for accuracy, thus improving the model’s overall performance. Specifically, we manage the model’s reasoning efficiency by dividing the reward function and including a novel penalty for output length. Our approach has yielded impressive outcomes in benchmark evaluations across three datasets: GSM8K, MATH500, and AIME2024. For the comparatively simpler datasets GSM8K and MATH500, our method has effectively shortened output lengths while preserving or enhancing accuracy. On the more demanding AIME2024 dataset, our approach has resulted in improved accuracy.


[93] Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers cs.CLPDF

Xanh Ho, Sunisth Kumar, Yun-Ang Wu, Florian Boudin, Atsuhiro Takasu

TL;DR: 论文重新定义了表格文本对齐任务为解释性任务,要求模型识别支持科学声明验证的关键表格单元格,并通过新构建的数据集和实验展示了改进方法。

Details

Motivation: 传统的科学声明验证任务仅预测最终标签,缺乏解释性,无法揭示模型的推理过程,因此需要更细粒度的表格单元格对齐分析。

Result: 实验表明,加入表格对齐信息提升了声明验证性能,但多数大语言模型未能恢复人类标注的依据。

Insight: 大语言模型的预测可能缺乏忠实推理,未来工作需关注模型的解释性与对齐能力。

Abstract: Scientific claim verification against tables typically requires predicting whether a claim is supported or refuted given a table. However, we argue that predicting the final label alone is insufficient: it reveals little about the model’s reasoning and offers limited interpretability. To address this, we reframe table-text alignment as an explanation task, requiring models to identify the table cells essential for claim verification. We build a new dataset by extending the SciTab benchmark with human-annotated cell-level rationales. Annotators verify the claim label and highlight the minimal set of cells needed to support their decision. After the annotation process, we utilize the collected information and propose a taxonomy for handling ambiguous cases. Our experiments show that (i) incorporating table alignment information improves claim verification performance, and (ii) most LLMs, while often predicting correct labels, fail to recover human-aligned rationales, suggesting that their predictions do not stem from faithful reasoning.


[94] Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs cs.CL | cs.AIPDF

Yilin Xiao, Chuang Zhou, Qinggang Zhang, Bo Li, Qing Li

TL;DR: 论文提出了RRP框架,通过结合LLM的语义能力和知识图谱的结构信息,提取高质量推理路径,提升LLM的推理能力。

Details

Motivation: 现有基于知识图谱增强的LLM在解决复杂问题时表现不佳,主要原因是未能有效利用事实间的关系和逻辑一致的推理路径。

Result: 在两个公开数据集上,RRP取得了最先进的性能,并能以即插即用的方式增强多种LLM的推理能力。

Insight: 高质量的推理路径不仅是补充事实知识的关键,还能为LLM提供更有效的指导,从而提升其解决复杂问题的能力。

Abstract: Large language models (LLMs) often struggle with knowledge-intensive tasks due to a lack of background knowledge and a tendency to hallucinate. To address these limitations, integrating knowledge graphs (KGs) with LLMs has been intensively studied. Existing KG-enhanced LLMs focus on supplementary factual knowledge, but still struggle with solving complex questions. We argue that refining the relationships among facts and organizing them into a logically consistent reasoning path is equally important as factual knowledge itself. Despite their potential, extracting reliable reasoning paths from KGs poses the following challenges: the complexity of graph structures and the existence of multiple generated paths, making it difficult to distinguish between useful and redundant ones. To tackle these challenges, we propose the RRP framework to mine the knowledge graph, which combines the semantic strengths of LLMs with structural information obtained through relation embedding and bidirectional distribution learning. Additionally, we introduce a rethinking module that evaluates and refines reasoning paths according to their significance. Experimental results on two public datasets show that RRP achieves state-of-the-art performance compared to existing baseline methods. Moreover, RRP can be easily integrated into various LLMs to enhance their reasoning abilities in a plug-and-play manner. By generating high-quality reasoning paths tailored to specific questions, RRP distills effective guidance for LLM reasoning.


[95] NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors cs.CL | cs.AI | I.2.7PDF

Numaan Naeem, Sarfraz Ahmad, Momina Ahsan, Hasan Iqbal

TL;DR: 论文提出了一种基于检索增强提示的系统,用于评估AI导师对学生数学推理错误的识别能力,结合了多种方法并展示了LLM的有效性。

Details

Motivation: 研究旨在为AI导师的评估任务提供高效解决方案,尤其是识别学生在数学推理中的错误。

Result: 最终系统显著优于基线,展示了LLM在教育反馈评估中的潜力。

Insight: 检索增强提示和LLM推理的结合是解决教育领域复杂任务的有效方法。

Abstract: This paper presents our system for Track 1: Mistake Identification in the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. The task involves evaluating whether a tutor’s response correctly identifies a mistake in a student’s mathematical reasoning. We explore four approaches: (1) an ensemble of machine learning models over pooled token embeddings from multiple pretrained language models (LMs); (2) a frozen sentence-transformer using [CLS] embeddings with an MLP classifier; (3) a history-aware model with multi-head attention between token-level history and response embeddings; and (4) a retrieval-augmented few-shot prompting system with a large language model (LLM) i.e. GPT 4o. Our final system retrieves semantically similar examples, constructs structured prompts, and uses schema-guided output parsing to produce interpretable predictions. It outperforms all baselines, demonstrating the effectiveness of combining example-driven prompting with LLM reasoning for pedagogical feedback assessment. Our code is available at https://github.com/NaumanNaeem/BEA_2025.


[96] PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models cs.CL | cs.AI | cs.LGPDF

Ye Yu, Yaoning Yu, Haohan Wang

TL;DR: PREMISE 提出了一种无需修改模型权重的提示优化框架,通过结合诊断与梯度启发的提示优化,大幅减少数学推理任务中的冗余计算,显著降低 token 使用和成本,同时保持或提升准确率。

Details

Motivation: 大型推理模型(如 Claude 和 GPT)在数学任务中表现优异,但冗长的推理过程导致 token 使用和成本过高,限制了在实时或资源受限场景的部署。

Result: 在 GSM8K、SVAMP 和 Math500 上,PREMISE 匹配或超越基线准确率(如 Claude 96%→96%),同时减少推理 token 达 87.5%,成本降低 69%-82%。

Insight: 提示优化是大模型推理高效化的重要方向,无需修改模型权重即可显著提升效率,适用于实际商业场景。

Abstract: Large reasoning models (LRMs) such as Claude 3.7 Sonnet and OpenAI o1 achieve strong performance on mathematical benchmarks using lengthy chain-of-thought (CoT) reasoning, but the resulting traces are often unnecessarily verbose. This inflates token usage and cost, limiting deployment in latency-sensitive or API-constrained settings. We introduce PREMISE (PRompt-based Efficient Mathematical Inference with Strategic Evaluation), a prompt-only framework that reduces reasoning overhead without modifying model weights. PREMISE combines trace-level diagnostics with gradient-inspired prompt optimization to minimize redundant computation while preserving answer accuracy. The approach jointly optimizes brevity and correctness through a multi-objective textual search that balances token length and answer validity. Unlike prior work, PREMISE runs in a single-pass black-box interface, so it can be applied directly to commercial LLMs. On GSM8K, SVAMP, and Math500 we match or exceed baseline accuracy ($96%\rightarrow96%$ with Claude, $91%\rightarrow92%$ with Gemini) while reducing reasoning tokens by up to $87.5%$ and cutting dollar cost by $69$–$82%$. These results show that prompt-level optimization is a practical and scalable path to efficient LRM inference without compromising reasoning quality.


[97] Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims cs.CL | cs.IRPDF

Priyanka Kargupta, Runchu Tian, Jiawei Han

TL;DR: 该论文提出了ClaimSpect框架,通过检索增强生成技术对复杂声明进行分层分析,将其拆解为可验证的子方面,并整合不同视角的数据,为科学和政治声明提供全面解读。

Details

Motivation: 当前许多声明(如科学或政治领域)难以简单用“真”或“假”标签分类,需要更细粒度的分析。

Result: 在真实数据集上验证了ClaimSpect的鲁棒性和准确性,优于多个基线方法。

Insight: 通过分层分析和多视角整合,可以更全面地理解复杂声明,避免了简单二元分类的局限性。

Abstract: Claims made by individuals or entities are oftentimes nuanced and cannot be clearly labeled as entirely “true” or “false” – as is frequently the case with scientific and political claims. However, a claim (e.g., “vaccine A is better than vaccine B”) can be dissected into its integral aspects and sub-aspects (e.g., efficacy, safety, distribution), which are individually easier to validate. This enables a more comprehensive, structured response that provides a well-rounded perspective on a given problem while also allowing the reader to prioritize specific angles of interest within the claim (e.g., safety towards children). Thus, we propose ClaimSpect, a retrieval-augmented generation-based framework for automatically constructing a hierarchy of aspects typically considered when addressing a claim and enriching them with corpus-specific perspectives. This structure hierarchically partitions an input corpus to retrieve relevant segments, which assist in discovering new sub-aspects. Moreover, these segments enable the discovery of varying perspectives towards an aspect of the claim (e.g., support, neutral, or oppose) and their respective prevalence (e.g., “how many biomedical papers believe vaccine A is more transportable than B?”). We apply ClaimSpect to a wide variety of real-world scientific and political claims featured in our constructed dataset, showcasing its robustness and accuracy in deconstructing a nuanced claim and representing perspectives within a corpus. Through real-world case studies and human evaluation, we validate its effectiveness over multiple baselines.


[98] Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs cs.CLPDF

Alberto Testoni, Iacer Calixto

TL;DR: 这篇论文对临床QA中LLMs的不确定性和校准性进行了细粒度评估,比较了10个开源LLM在不同医学专业和问题类型中的表现,并提出了一种轻量级单次生成估计方法。

Details

Motivation: 在临床决策等高风险领域,LLMs的准确和校准良好的不确定性估计至关重要,但目前缺乏对不同问题和模型类型的细粒度评估。

Result: 结果显示,不同医学专业和问题类型的性能差异显著,轻量级方法在性能上接近语义熵方法,且仅需一次生成。

Insight: 研究发现,LLMs的选择应基于问题的性质和模型的特长,突出了领域适应性的重要性。

Abstract: Accurate and well-calibrated uncertainty estimates are essential for deploying large language models (LLMs) in high-stakes domains such as clinical decision support. We present a fine-grained evaluation of uncertainty estimation methods for clinical multiple-choice question answering, covering ten open-source LLMs (general-purpose, biomedical, and reasoning models) across two datasets, eleven medical specialties, and six question types. We compare standard single-generation and sampling-based methods, and present a case study exploring simple, single-pass estimators based on behavioral signals in reasoning traces. These lightweight methods approach the performance of Semantic Entropy while requiring only one generation. Our results reveal substantial variation across specialties and question types, underscoring the importance of selecting models based on both the nature of the question and model-specific strengths.


[99] Improving Named Entity Transcription with Contextual LLM-based Revision cs.CL | cs.AIPDF

Viet Anh Trinh, Xinlu He, Jacob Whitehill

TL;DR: 本文提出了一种基于大型语言模型(LLM)的修正机制,通过利用LLM的推理能力和局部上下文(如课堂笔记)来修正ASR预测中错误命名的实体,显著降低了命名实体的词错误率(WER)。

Details

Motivation: 现有的ASR系统在通用语音识别上表现优异,但在命名实体识别上错误率较高,影响下游应用。

Result: 在NER-MIT-OpenCourseWare数据集上,命名实体WER相对降低30%。

Insight: 结合LLM的推理能力和上下文信息可以有效提升ASR系统中命名实体的识别准确性。

Abstract: With recent advances in modeling and the increasing amount of supervised training data, automatic speech recognition (ASR) systems have achieved remarkable performance on general speech. However, the word error rate (WER) of state-of-the-art ASR remains high for named entities. Since named entities are often the most critical keywords, misrecognizing them can affect all downstream applications, especially when the ASR system functions as the front end of a complex system. In this paper, we introduce a large language model (LLM) revision mechanism to revise incorrect named entities in ASR predictions by leveraging the LLM’s reasoning ability as well as local context (e.g., lecture notes) containing a set of correct named entities. Finally, we introduce the NER-MIT-OpenCourseWare dataset, containing 45 hours of data from MIT courses for development and testing. On this dataset, our proposed technique achieves up to 30% relative WER reduction for named entities.


[100] Mitigating Negative Interference in Multilingual Sequential Knowledge Editing through Null-Space Constraints cs.CLPDF

Wei Sun, Tingyu Qu, Mingxiao Li, Jesse Davis, Marie-Francine Moens

TL;DR: 论文提出LangEdit框架,通过空空间约束隔离多语言知识更新,避免参数干扰,提升多语言大模型的知识编辑效率与一致性。

Details

Motivation: 多语言大模型在跨语言知识更新时面临参数干扰问题,导致知识一致性与泛化能力下降。现有方法(如多模型独立编辑)成本高昂,因此需要一种高效且统一的解决方案。

Result: 在三种模型架构、六种语言和四项任务上的实验显示,LangEdit在减少干扰和知识准确性上优于现有方法。

Insight: 通过数学约束实现参数隔离是解决多语言知识编辑干扰的有效途径,为多语言模型的高效更新提供了新思路。

Abstract: Efficiently updating multilingual knowledge in large language models (LLMs), while preserving consistent factual representations across languages, remains a long-standing and unresolved challenge. While deploying separate editing systems for each language might seem viable, this approach incurs substantial costs due to the need to manage multiple models. A more efficient solution involves integrating knowledge updates across all languages into a unified model. However, performing sequential edits across languages often leads to destructive parameter interference, significantly degrading multilingual generalization and the accuracy of injected knowledge. To address this challenge, we propose LangEdit, a novel null-space constrained framework designed to precisely isolate language-specific knowledge updates. The core innovation of LangEdit lies in its ability to project parameter updates for each language onto the orthogonal complement of previous updated subspaces. This approach mathematically guarantees update independence while preserving multilingual generalization capabilities. We conduct a comprehensive evaluation across three model architectures, six languages, and four downstream tasks, demonstrating that LangEdit effectively mitigates parameter interference and outperforms existing state-of-the-art editing methods. Our results highlight its potential for enabling efficient and accurate multilingual knowledge updates in LLMs. The code is available at https://github.com/VRCMF/LangEdit.git.


[101] ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization cs.CLPDF

Zhensheng Jin, Xinze Li, Yifan Ji, Chunyi Peng, Zhenghao Liu

TL;DR: ReCUT提出了一种通过逐步探索和偏好优化的方法,平衡LLMs的推理长度与准确性,显著减少了30-50%的推理长度,同时保持或提升准确性。

Details

Motivation: 现有CoT提示方法常因过度思考导致冗余推理轨迹,现有解决方案受限于生成数据的质量和过拟合问题。

Result: 在多个数学推理数据集和骨干模型上,ReCUT显著减少推理长度30-50%,同时保持或提升准确性。

Insight: 通过平衡推理长度与准确性,ReCUT为LLMs的高效推理提供了新思路,尤其适合需要简洁且准确推理的任务。

Abstract: Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of Large Language Models (LLMs). However, these methods often suffer from overthinking, leading to unnecessarily lengthy or redundant reasoning traces. Existing approaches attempt to mitigate this issue through curating multiple reasoning chains for training LLMs, but their effectiveness is often constrained by the quality of the generated data and prone to overfitting. To address the challenge, we propose Reasoning Compression ThroUgh Stepwise Trials (ReCUT), a novel method aimed at balancing the accuracy and length of reasoning trajectory. Specifically, ReCUT employs a stepwise exploration mechanism and a long-short switched sampling strategy, enabling LLMs to incrementally generate diverse reasoning paths. These paths are evaluated and used to construct preference pairs to train two specialized models (Gemini LLMs)-one optimized for reasoning accuracy, the other for shorter reasoning. A final integrated model is obtained by interpolating the parameters of these two models. Experimental results across multiple math reasoning datasets and backbone models demonstrate that ReCUT significantly reduces reasoning lengths by approximately 30-50%, while maintaining or improving reasoning accuracy compared to various baselines. All codes and data will be released via https://github.com/NEUIR/ReCUT.


[102] CIIR@LiveRAG 2025: Optimizing Multi-Agent Retrieval Augmented Generation through Self-Training cs.CL | cs.IRPDF

Alireza Salemi, Mukta Maddipatla, Hamed Zamani

TL;DR: 该论文提出了mRAG,一种多智能体检索增强生成框架,通过自我训练优化智能体协作,并在LiveRAG 2025竞赛中表现优异。

Details

Motivation: 传统检索增强生成(RAG)方法在复杂任务中表现有限,作者希望通过多智能体协作和自我训练范式提升性能。

Result: 在SIGIR 2025 LiveRAG竞赛中,mRAG优于传统RAG基线,并通过案例验证了其在复杂任务中的有效性。

Insight: 多智能体协作和自适应训练机制能够显著提升检索增强生成任务的性能。

Abstract: This paper presents mRAG, a multi-agent retrieval-augmented generation (RAG) framework composed of specialized agents for subtasks such as planning, searching, reasoning, and coordination. Our system uses a self-training paradigm with reward-guided trajectory sampling to optimize inter-agent collaboration and enhance response generation. Evaluated on DataMorgana-derived datasets during the SIGIR 2025 LiveRAG competition, mRAG outperforms conventional RAG baselines. We further analyze competition outcomes and showcase the framework’s strengths with case studies, demonstrating its efficacy for complex, real-world RAG tasks.


[103] Accelerating Diffusion Large Language Models with SlowFast: The Three Golden Principles cs.CL | cs.AI | cs.LGPDF

Qingyan Wei, Yaojie Zhang, Zhiyuan Liu, Dongrui Liu, Linfeng Zhang

TL;DR: 该论文提出了SlowFast Sampling,一种动态采样策略,通过交替探索和加速解码阶段,显著提升了扩散式语言模型的推理速度,同时结合dLLM-Cache减少冗余计算,实现高达34.22倍加速。

Details

Motivation: 现有扩散式语言模型的采样策略存在静态行为问题,导致效率和灵活性不足。SlowFast Sampling旨在通过动态调整解码阶段提升性能和速度。

Result: 在LLaDA上实现15.63倍加速,结合缓存后达34.22倍,且在吞吐量上超越LLaMA3 8B等自回归基线。

Insight: 通过动态采样策略,可以充分发挥扩散式语言模型的并行生成潜力,实现高效高质量生成。

Abstract: Diffusion-based language models (dLLMs) have emerged as a promising alternative to traditional autoregressive LLMs by enabling parallel token generation and significantly reducing inference latency. However, existing sampling strategies for dLLMs, such as confidence-based or semi-autoregressive decoding, often suffer from static behavior, leading to suboptimal efficiency and limited flexibility. In this paper, we propose SlowFast Sampling, a novel dynamic sampling strategy that adaptively alternates between exploratory and accelerated decoding stages. Our method is guided by three golden principles: certainty principle, convergence principle, and positional principle, which govern when and where tokens can be confidently and efficiently decoded. We further integrate our strategy with dLLM-Cache to reduce redundant computation. Extensive experiments across benchmarks and models show that SlowFast Sampling achieves up to 15.63$\times$ speedup on LLaDA with minimal accuracy drop, and up to 34.22$\times$ when combined with caching. Notably, our approach outperforms strong autoregressive baselines like LLaMA3 8B in throughput, demonstrating that well-designed sampling can unlock the full potential of dLLMs for fast and high-quality generation.


[104] Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models cs.CL | eess.ASPDF

Michele Gubian, Ioana Krehan, Oli Liu, James Kirby, Sharon Goldwater

TL;DR: 本文探讨了自监督语音模型wav2vec2在不同语言预训练中如何编码语音、声调和说话者信息,发现其表示结构与预训练语言无关。

Details

Motivation: 现有研究多集中于英语,本文旨在揭示多语言预训练的wav2vec2模型如何编码语音、声调和说话者信息。

Result: 发现语音、声调和说话者信息的表示子空间正交,且层间探测准确率模式相似,仅后期层中对匹配语言的语音和声调略有优势。

Insight: 自监督语音模型学习的表示结构可能不受预训练语音材料的语种影响,具有较高的通用性。

Abstract: Analyses of self-supervised speech models have begun to reveal where and how they represent different types of information. However, almost all analyses have focused on English. Here, we examine how wav2vec2 models trained on four different languages encode both language-matched and non-matched speech. We use probing classifiers and geometric analyses to examine how phones, lexical tones, and speaker information are represented. We show that for all pretraining and test languages, the subspaces encoding phones, tones, and speakers are largely orthogonal, and that layerwise patterns of probing accuracy are similar, with a relatively small advantage for matched-language phone and tone (but not speaker) probes in the later layers. Our findings suggest that the structure of representations learned by wav2vec2 is largely independent of the speech material used during pretraining.


[105] Slimming Down LLMs Without Losing Their Minds cs.CL | cs.AIPDF

Qingda, Mai

TL;DR: 本文研究了高效参数微调方法(LoRA和QLoRA)对大型语言模型(LLM)性能的影响,验证了其在常识推理、数学推理和多领域知识任务中的表现。

Details

Motivation: 随着LLM规模的增大,高效微调方法的需求日益迫切。本文旨在验证参数高效方法(如LoRA和QLoRA)在实际任务中是否能在保持计算效率的同时提升模型性能。

Result: LoRA-based方法显著提升了任务特定性能,且计算效率高;性能表现高度依赖于微调数据集与任务间的对齐程度。

Insight: 参数高效方法在特定条件下可替代全参数微调,为资源有限的开发者提供了可行的解决方案。

Abstract: This paper investigates and validates the impact of fine-tuning on large language model performance, focusing on parameter-efficient methods (LoRA and QLoRA). We evaluate model capabilities across three key domains: (1) commonsense reasoning (HellaSwag), (2) mathematical reasoning (GSM8K), and (3) multi-domain knowledge (MMLU-CS). Our findings demonstrate that: (1) LoRA-based methods effectively improve task-specific performance while maintaining computational efficiency, and (2) performance strongly depends on alignment between fine-tuning dataset and benchmark tasks. The study provides both theoretical insights into parameter-efficient mechanisms and practical guidance for developers implementing efficient LLM adaptation with limited resources.


[106] Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers cs.CL | cs.LGPDF

Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi

TL;DR: 这篇论文研究了大型语言模型(LLMs)通过微调学习新知识时的两种行为:泛化和幻觉。作者提出这两种行为源于同一种机制——‘上下文外推理’(OCR),即模型通过关联概念推导出信息的能力,无论这些概念是否存在因果关系。

Details

Motivation: LLMs在微调过程中表现出泛化和幻觉的‘双重性’现象,但其背后的机制并不清楚。作者希望揭示这种现象的根本原因,从而为模型行为提供理论基础。

Result: 实验证实了OCR在五种主流LLMs中驱动泛化和幻觉的行为。理论分析揭示了矩阵分解对OCR能力的关键作用。

Insight: 论文的洞察在于,泛化和幻觉并非截然不同的行为,而是同一机制在不同条件下的表现。梯度下降的隐式偏差是模型高效学习关联的关键,无论这种关联是否具有因果性。

Abstract: Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.


[107] BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP cs.CL | cs.AIPDF

Thomas Sounack, Joshua Davis, Brigitte Durieux, Antoine Chaffin, Tom J. Pollard

TL;DR: BioClinical ModernBERT 是一种基于 ModernBERT 的领域自适应编码器,专为生物医学和临床 NLP 设计,通过大规模预训练和长上下文处理技术,显著提升任务性能。

Details

Motivation: 生物医学和临床 NLP 的编码器发展滞后于解码器模型,导致领域适应能力有限。为了解决这一问题,作者提出了一种改进的编码器。

Result: BioClinical ModernBERT 在多项生物医学和临床应用任务中表现优于现有编码器模型。

Insight: 多源数据集的使用和长上下文处理技术是提升生物医学和临床 NLP 任务性能的关键。

Abstract: Encoder-based transformer models are central to biomedical and clinical Natural Language Processing (NLP), as their bidirectional self-attention makes them well-suited for efficiently extracting structured information from unstructured text through discriminative tasks. However, encoders have seen slower development compared to decoder models, leading to limited domain adaptation in biomedical and clinical settings. We introduce BioClinical ModernBERT, a domain-adapted encoder that builds on the recent ModernBERT release, incorporating long-context processing and substantial improvements in speed and performance for biomedical and clinical NLP. BioClinical ModernBERT is developed through continued pretraining on the largest biomedical and clinical corpus to date, with over 53.5 billion tokens, and addresses a key limitation of prior clinical encoders by leveraging 20 datasets from diverse institutions, domains, and geographic regions, rather than relying on data from a single source. It outperforms existing biomedical and clinical encoders on four downstream tasks spanning a broad range of use cases. We release both base (150M parameters) and large (396M parameters) versions of BioClinical ModernBERT, along with training checkpoints to support further research.


[108] Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning cs.CLPDF

Lan Zhang, Marco Valentino, Andre Freitas

TL;DR: 这篇论文提出了一种基于大语言模型(LLM)的系统化、自动化的评估自形式化任务的方法,通过引入逻辑保持、数学一致性、形式有效性和形式质量等多维标准,提高了评估的透明性和可靠性。

Details

Motivation: 在高级数学领域,自形式化的自动评估需要领域专家的参与且耗时。现有的大语言模型作为评判者的方法通常采用粗粒度的通用标准,难以捕捉复杂的数学推理中的细微差别。

Result: 实验表明,EFG集成方法比粗粒度模型更接近人类评估结果,尤其是在形式质量方面表现出更强的相关性。

Insight: 通过定义明确的原子属性,大语言模型作为评判者可以为形式数学推理提供可扩展、可解释且可靠的评估支持。

Abstract: Autoformalization plays a crucial role in formal mathematical reasoning by enabling the automatic translation of natural language statements into formal languages. While recent advances using large language models (LLMs) have shown promising results, methods for automatically evaluating autoformalization remain underexplored. As one moves to more complex domains (e.g., advanced mathematics), human evaluation requires significant time and domain expertise, especially as the complexity of the underlying statements and background knowledge increases. LLM-as-a-judge presents a promising approach for automating such evaluation. However, existing methods typically employ coarse-grained and generic evaluation criteria, which limit their effectiveness for advanced formal mathematical reasoning, where quality hinges on nuanced, multi-granular dimensions. In this work, we take a step toward addressing this gap by introducing a systematic, automatic method to evaluate autoformalization tasks. The proposed method is based on an epistemically and formally grounded ensemble (EFG) of LLM judges, defined on criteria encompassing logical preservation (LP), mathematical consistency (MC), formal validity (FV), and formal quality (FQ), resulting in a transparent assessment that accounts for different contributing factors. We validate the proposed framework to serve as a proxy for autoformalization assessment within the domain of formal mathematics. Overall, our experiments demonstrate that the EFG ensemble of LLM judges is a suitable emerging proxy for evaluation, more strongly correlating with human assessments than a coarse-grained model, especially when assessing formal qualities. These findings suggest that LLM-as-judges, especially when guided by a well-defined set of atomic properties, could offer a scalable, interpretable, and reliable support for evaluating formal mathematical reasoning.


[109] Magistral cs.CLPDF

Mistral-AI, :, Abhinav Rastogi, Albert Q. Jiang, Andy Lo

TL;DR: Magistral是Mistral推出的首个推理模型,通过完全自主的强化学习(RL)流程训练,展示了纯RL训练的潜力,同时提出了一种强制模型推理语言的简单方法。

Details

Motivation: 研究目标是探索纯强化学习在训练大型语言模型(LLM)中的潜力,摆脱对现有实现和先前模型RL痕迹的依赖,验证RL在文本数据上的能力。

Result: 纯RL训练在文本数据上能维持或提升模型的多模态理解、指令遵循和函数调用能力。

Insight: 纯RL训练具备潜力,不需要依赖先验模型的RL痕迹即可达到甚至超越现有能力。

Abstract: We introduce Magistral, Mistral’s first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a simple method to force the reasoning language of the model, and show that RL on text data alone maintains most of the initial checkpoint’s capabilities. We find that RL on text maintains or improves multimodal understanding, instruction following and function calling. We present Magistral Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we open-source Magistral Small (Apache 2.0) which further includes cold-start data from Magistral Medium.


[110] Dynamic Epistemic Friction in Dialogue cs.CLPDF

Timothy Obiso, Kenneth Lai, Abhijnan Nath, Nikhil Krishnaswamy, James Pustejovsky

TL;DR: 该论文探讨了大型语言模型(LLMs)与人类对齐时忽视的‘认知摩擦’问题,提出动态认知摩擦的概念,并基于动态认知逻辑框架建模,应用于实际对话任务中以预测信念更新。

Details

Motivation: 尽管LLMs在与人类对齐方面取得了进展,但忽视‘认知摩擦’(即对新信息的信念更新阻力)会导致模型在真实对话场景中的表现受限。

Result: 模型能有效预测对话中的信念更新,并可通过进一步复杂化以更好地适应现实对话的复杂性。

Insight: 认知摩擦是对话中的关键因素,将其建模有助于提升LLMs在真实交互中的表现。

Abstract: Recent developments in aligning Large Language Models (LLMs) with human preferences have significantly enhanced their utility in human-AI collaborative scenarios. However, such approaches often neglect the critical role of “epistemic friction,” or the inherent resistance encountered when updating beliefs in response to new, conflicting, or ambiguous information. In this paper, we define dynamic epistemic friction as the resistance to epistemic integration, characterized by the misalignment between an agent’s current belief state and new propositions supported by external evidence. We position this within the framework of Dynamic Epistemic Logic (Van Benthem and Pacuit, 2011), where friction emerges as nontrivial belief-revision during the interaction. We then present analyses from a situated collaborative task that demonstrate how this model of epistemic friction can effectively predict belief updates in dialogues, and we subsequently discuss how the model of belief alignment as a measure of epistemic resistance or friction can naturally be made more sophisticated to accommodate the complexities of real-world dialogue scenarios.


[111] Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training cs.CL | cs.AI | cs.LGPDF

Mozhi Zhang, Howe Tissue, Lu Wang, Xipeng Qiu

TL;DR: Domain2Vec利用元域向量化数据集,无需训练即可找到最优数据混合,验证了分布对齐假设,显著降低了计算开销并提升了下游任务性能。

Details

Motivation: 现有方法在寻找最优数据集混合时需要大量训练计算,而Domain2Vec提出了一种无需训练的方法,通过向量化和分布对齐假设提高效率。

Result: 在Pile-CC上仅需51.5%的计算量即可达到相同验证损失,同等计算预算下下游任务平均提升2.83%。

Insight: 分布对齐假设为数据集混合优化提供了理论支持,向量化方法提高了效率和可扩展性。

Abstract: We introduce~\textsc{Domain2Vec}, a novel approach that decomposes any dataset into a linear combination of several \emph{meta-domains}, a new concept designed to capture the key underlying features of datasets. \textsc{Domain2Vec} maintains a vocabulary of meta-domains and uses a classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary. These domain vectors enable the identification of the optimal data mixture for language model (LM) pretraining in a training-free manner under the \emph{\textbf{D}istribution \textbf{A}lignment \textbf{A}ssumption} (DA$^{2}$), which suggests that when the data distributions of the training set and the validation set are better aligned, a lower validation loss is achieved. Moreover, \textsc{Domain2vec} can be seamlessly integrated into previous works to model the relationship between domain vectors and LM performance, greatly enhancing the efficiency and scalability of previous methods. Extensive experiments demonstrate that \textsc{Domain2Vec} helps find the data mixture that enhances downstream task performance with minimal computational overhead. Specifically, \textsc{Domain2Vec} achieves the same validation loss on Pile-CC using only $51.5%$ of the computation required when training on the original mixture of The Pile dataset. Under equivalent compute budget, \textsc{Domain2Vec} improves downstream performance by an average of $2.83%$.


[112] How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts? cs.CLPDF

Sohee Yang, Sang-Woo Lee, Nora Kassner, Daniela Gottesman, Sebastian Riedel

TL;DR: 该论文研究了推理模型在识别和纠正无效思维(如无关、误导或不准确的内容)方面的能力,发现模型虽能识别无效思维,但在纠正过程中表现不佳,尤其是大模型在面对短无效思维时更难恢复,呼吁改进模型的自我评估能力。

Details

Motivation: 探索推理模型的自我反思能力,尤其是其识别和纠正无效思维的效果,以提升模型的推理能力和安全性。

Result: 模型能识别大部分无效思维,但恢复能力差;大模型在短无效思维干扰下表现更差;小模型对有害触发思维的抵抗力最强。

Insight: 当前推理模型的自我评估能力尚不足,尤其是大模型在复杂干扰下可能表现更差,需进一步提升其”元认知”能力以增强安全性和鲁棒性。

Abstract: Recent reasoning models show the ability to reflect, backtrack, and self-validate their reasoning, which is crucial in spotting mistakes and arriving at accurate solutions. A natural question that arises is how effectively models can perform such self-reevaluation. We tackle this question by investigating how well reasoning models identify and recover from four types of unhelpful thoughts: uninformative rambling thoughts, thoughts irrelevant to the question, thoughts misdirecting the question as a slightly different question, and thoughts that lead to incorrect answers. We show that models are effective at identifying most unhelpful thoughts but struggle to recover from the same thoughts when these are injected into their thinking process, causing significant performance drops. Models tend to naively continue the line of reasoning of the injected irrelevant thoughts, which showcases that their self-reevaluation abilities are far from a general “meta-cognitive” awareness. Moreover, we observe non/inverse-scaling trends, where larger models struggle more than smaller ones to recover from short irrelevant thoughts, even when instructed to reevaluate their reasoning. We demonstrate the implications of these findings with a jailbreak experiment using irrelevant thought injection, showing that the smallest models are the least distracted by harmful-response-triggering thoughts. Overall, our findings call for improvement in self-reevaluation of reasoning models to develop better reasoning and safer systems.


cs.SD [Back]

[113] PAL: Probing Audio Encoders via LLMs – A Study of Information Transfer from Audio Encoders to LLMs cs.SD | cs.AI | cs.CL | eess.ASPDF

Tony Alex, Wish Suharitdamrong, Sara Atito, Armin Mustafa, Philip J. B. Jackson

TL;DR: 这篇论文系统地研究了音频编码器与大型语言模型(LLM)之间的信息传递机制,通过改进架构设计提升了音频-LLM的性能。

Details

Motivation: 尽管音频-LLM的应用开发进展迅速,但其底层信息传递机制仍未充分探索,尤其是音频编码器如何向LLM高效传递丰富语义信息。

Result: 最终架构在560万音频-文本对数据集上实现了10%到60%的性能提升。

Insight: LLM的初始文本上下文有助于增强对音频表示的探测能力,注意力模块足够高效,而多编码器集成能够提供更丰富的音频信息。

Abstract: The integration of audio perception capabilities into Large Language Models (LLMs) has enabled significant advances in Audio-LLMs. Although application-focused developments, particularly in curating training data for specific capabilities e.g., audio reasoning, have progressed rapidly, the underlying mechanisms that govern efficient transfer of rich semantic representations from audio encoders to LLMs remain under-explored. We conceptualize effective audio-LLM interaction as the LLM’s ability to proficiently probe the audio encoder representations to satisfy textual queries. This paper presents a systematic investigation on how architectural design choices can affect that. Beginning with a standard Pengi/LLaVA-style audio-LLM architecture, we propose and evaluate several modifications guided by hypotheses derived from mechanistic interpretability studies and LLM operational principles. Our experiments demonstrate that: (1) delaying audio integration until the LLM’s initial layers establish textual context that enhances its ability to probe the audio representations for relevant information; (2) the LLM can proficiently probe audio representations exclusively through LLM layer’s attention submodule, without requiring propagation to its Feed-Forward Network (FFN) submodule; (3) an efficiently integrated ensemble of diverse audio encoders provides richer, complementary representations, thereby broadening the LLM’s capacity to probe a wider spectrum of audio information. All hypotheses are evaluated using an identical three-stage training curriculum on a dataset of 5.6 million audio-text pairs, ensuring controlled comparisons. Our final architecture, which incorporates all proposed modifications, achieves relative improvements from 10% to 60% over the baseline, validating our approach to optimizing cross-modal information transfer in audio-LLMs. Project page: https://ta012.github.io/PAL/


cs.MM [Back]

[114] Multimodal Large Language Models: A Survey cs.MM | cs.AI | cs.CLPDF

Longzhen Han, Awes Mubarak, Almas Baimagambetov, Nikolaos Polatidis, Thar Baker

TL;DR: 这篇综述总结了多模态大语言模型(MLLMs)的发展,探讨了其在文本、图像、音乐等多样化输出模态中的应用,分析了关键技术(如SSL、MoE等)和架构创新(如Transformer、扩散模型),并提出了未来挑战。

Details

Motivation: 随着多模态大语言模型的快速发展,如何统一架构并实现跨模态能力成为关键问题。本文旨在系统梳理MLLMs的进展、技术及挑战,为未来研究方向提供指导。

Result: 总结了MLLMs在跨模态生成中的成功案例和协同效应,揭示了关键技术的作用及局限性。

Insight: 未来MLLMs的发展需要关注评估标准化、模块化设计和增强结构化推理能力,以实现更通用、自适应和可解释的多模态系统。

Abstract: Multimodal Large Language Models (MLLMs) have rapidly evolved beyond text generation, now spanning diverse output modalities including images, music, video, human motion, and 3D objects, by integrating language with other sensory modalities under unified architectures. This survey categorises six primary generative modalities and examines how foundational techniques, namely Self-Supervised Learning (SSL), Mixture of Experts (MoE), Reinforcement Learning from Human Feedback (RLHF), and Chain-of-Thought (CoT) prompting, enable cross-modal capabilities. We analyze key models, architectural trends, and emergent cross-modal synergies, while highlighting transferable techniques and unresolved challenges. Architectural innovations like transformers and diffusion models underpin this convergence, enabling cross-modal transfer and modular specialization. We highlight emerging patterns of synergy, and identify open challenges in evaluation, modularity, and structured reasoning. This survey offers a unified perspective on MLLM development and identifies critical paths toward more general-purpose, adaptive, and interpretable multimodal systems.


[115] EQ-TAA: Equivariant Traffic Accident Anticipation via Diffusion-Based Accident Video Synthesis cs.MM | cs.AI | cs.CV | cs.ROPDF

Jianwu Fang, Lei-Lei Li, Zhedong Zheng, Hongkai Yu, Jianru Xue

TL;DR: 该论文提出了一种名为EQ-TAA的新方法,通过基于扩散的交通事故事件生成模型(AVD)合成事故视频片段,并利用等变三元损失(equivariant triple loss)提升交通事故事件预测性能,以解决背景干扰和标注难题。

Details

Motivation: 当前交通事故事件预测(TAA)方法因需要标注事故持续时间而面临困难,且交通场景的长尾、不确定性和快速变化特性导致因果部分难以识别,易受数据偏差影响。

Result: 实验结果表明,AVD和EQ-TAA在性能上达到了先进水平,解决了背景干扰问题。

Insight: 通过合成因果视频片段和对比学习,能够有效减少数据偏差对模型的影响,提升交通事故事件预测的鲁棒性。

Abstract: Traffic Accident Anticipation (TAA) in traffic scenes is a challenging problem for achieving zero fatalities in the future. Current approaches typically treat TAA as a supervised learning task needing the laborious annotation of accident occurrence duration. However, the inherent long-tailed, uncertain, and fast-evolving nature of traffic scenes has the problem that real causal parts of accidents are difficult to identify and are easily dominated by data bias, resulting in a background confounding issue. Thus, we propose an Attentive Video Diffusion (AVD) model that synthesizes additional accident video clips by generating the causal part in dashcam videos, i.e., from normal clips to accident clips. AVD aims to generate causal video frames based on accident or accident-free text prompts while preserving the style and content of frames for TAA after video generation. This approach can be trained using datasets collected from various driving scenes without any extra annotations. Additionally, AVD facilitates an Equivariant TAA (EQ-TAA) with an equivariant triple loss for an anchor accident-free video clip, along with the generated pair of contrastive pseudo-normal and pseudo-accident clips. Extensive experiments have been conducted to evaluate the performance of AVD and EQ-TAA, and competitive performance compared to state-of-the-art methods has been obtained.


[116] HER2 Expression Prediction with Flexible Multi-Modal Inputs via Dynamic Bidirectional Reconstruction cs.MM | cs.AI | cs.CV | cs.LGPDF

Jie Qin, Wei Yang, Yan Su, Yiran Zhu, Weizhen Li

TL;DR: 论文提出了一种动态双向重构的多模态HER2表达预测框架,通过灵活的模态输入选择提升准确性,并在资源有限的情况下减少IHC成本。

Details

Motivation: 现有的HER2评估模型通常单独分析H&E或IHC图像,但临床实践中需结合两者进行综合判断。然而,同时获取两种模态数据的成本和流程复杂性限制了其应用。

Result: 单模态H&E预测准确率从71.44%提升至94.25%,双模态准确率达95.09%,仅用IHC输入时可靠性为90.28%。此外,F1分数显著提升(H&E到IHC为0.9609,IHC到H&E为0.9251)。

Insight: 动态弹性架构在资源受限的场景中具有优势,通过减少IHC基础设施成本,同时实现接近双模态的性能。重构路径的引入有效缓解了数据缺失导致的性能下降问题。

Abstract: Current HER2 assessment models for breast cancer predominantly analyze H&E or IHC images in isolation,despite clinical reliance on their synergistic interpretation. However, concurrent acquisition of both modalities is often hindered by workflow complexity and cost constraints. We propose an adaptive bimodal framework enabling flexible single-/dual-modality HER2 prediction through three innovations: 1) A dynamic branch selector that activates either single-modality reconstruction or dual-modality joint inference based on input completeness; 2) A bidirectional cross-modal GAN performing context-aware feature-space reconstruction of missing modalities; 3) A hybrid training protocol integrating adversarial learning and multi-task optimization. This architecture elevates single-modality H&E prediction accuracy from 71.44% to 94.25% while achieving 95.09% dual-modality accuracy, maintaining 90.28% reliability with sole IHC inputs. The framework’s “dual-preferred, single-compatible” design delivers near-bimodal performance without requiring synchronized acquisition, particularly benefiting resource-limited settings through IHC infrastructure cost reduction. Experimental validation confirms 22.81%/12.90% accuracy improvements over H&E/IHC baselines respectively, with cross-modal reconstruction enhancing F1-scores to 0.9609 (HE to IHC) and 0.9251 (IHC to HE). By dynamically routing inputs through reconstruction-enhanced or native fusion pathways, the system mitigates performance degradation from missing data while preserving computational efficiency (78.55% parameter reduction in lightweight variant). This elastic architecture demonstrates significant potential for democratizing precise HER2 assessment across diverse healthcare settings.


[117] Controllable Expressive 3D Facial Animation via Diffusion in a Unified Multimodal Space cs.MM | cs.AI | cs.CVPDF

Kangwei Liu, Junwu Liu, Xiaowei Yi, Jinlin Guo, Yun Cao

TL;DR: 该论文提出了一种基于扩散模型的3D面部动画生成框架,通过多模态情感绑定策略和注意力机制的潜在扩散模型,解决了单模态控制和确定性回归方法的局限性。

Details

Motivation: 现有的音频驱动3D面部动画方法依赖于单模态控制信号,且使用确定性回归方法限制了情感表达和行为的多样性,无法充分发挥多模态信号的互补优势。

Result: 实验表明,该方法在情感相似性上提升了21.6%,同时保持了自然的面部动态。

Insight: 多模态信号和扩散模型的结合可以显著提升3D面部动画的表达力和可控性。

Abstract: Audio-driven emotional 3D facial animation encounters two significant challenges: (1) reliance on single-modal control signals (videos, text, or emotion labels) without leveraging their complementary strengths for comprehensive emotion manipulation, and (2) deterministic regression-based mapping that constrains the stochastic nature of emotional expressions and non-verbal behaviors, limiting the expressiveness of synthesized animations. To address these challenges, we present a diffusion-based framework for controllable expressive 3D facial animation. Our approach introduces two key innovations: (1) a FLAME-centered multimodal emotion binding strategy that aligns diverse modalities (text, audio, and emotion labels) through contrastive learning, enabling flexible emotion control from multiple signal sources, and (2) an attention-based latent diffusion model with content-aware attention and emotion-guided layers, which enriches motion diversity while maintaining temporal coherence and natural facial dynamics. Extensive experiments demonstrate that our method outperforms existing approaches across most metrics, achieving a 21.6% improvement in emotion similarity while preserving physiologically plausible facial dynamics. Project Page: https://kangweiiliu.github.io/Control_3D_Animation.


[118] Structured Graph Representations for Visual Narrative Reasoning: A Hierarchical Framework for Comics cs.MM | cs.AI | cs.CVPDF

Yi-Chun Chen

TL;DR: 该论文提出了一种分层的知识图谱框架,用于结构化理解漫画等视觉叙事内容,支持多模态推理。

Details

Motivation: 视觉叙事(如漫画)包含复杂的多模态信息(视觉和文本),传统的单一层次分析方法难以捕捉其语义、空间和时间关系。

Result: 在Manga109数据集上的实验表明,该方法在动作检索、对话追踪、角色定位和面板时间线重建等任务中表现优异,精确率和召回率较高。

Insight: 该工作为视觉媒体的叙事分析、交互式故事讲述和多模态推理提供了可扩展的基础,强调了分层图谱在复杂叙事理解中的重要性。

Abstract: This paper presents a hierarchical knowledge graph framework for the structured understanding of visual narratives, focusing on multimodal media such as comics. The proposed method decomposes narrative content into multiple levels, from macro-level story arcs to fine-grained event segments. It represents them through integrated knowledge graphs that capture semantic, spatial, and temporal relationships. At the panel level, we construct multimodal graphs that link visual elements such as characters, objects, and actions with corresponding textual components, including dialogue and captions. These graphs are integrated across narrative levels to support reasoning over story structure, character continuity, and event progression. We apply our approach to a manually annotated subset of the Manga109 dataset and demonstrate its ability to support symbolic reasoning across diverse narrative tasks, including action retrieval, dialogue tracing, character appearance mapping, and panel timeline reconstruction. Evaluation results show high precision and recall across tasks, validating the coherence and interpretability of the framework. This work contributes a scalable foundation for narrative-based content analysis, interactive storytelling, and multimodal reasoning in visual media.


[119] WDMIR: Wavelet-Driven Multimodal Intent Recognition cs.MM | cs.AI | cs.CV | eess.SPPDF

Weiyin Gong, Kai Zhang, Yanghai Zhang, Qi Liu, Xinjie Sun

TL;DR: 论文提出了一种基于小波变换的多模态意图识别框架WDMIR,通过频域分析提升对非语言信息的理解,并在MIntRec数据集上实现了最佳性能。

Details

Motivation: 现有方法多侧重于文本分析,忽视了非语言信息的丰富语义,WDMIR旨在通过频域分析填补这一空白。

Result: 在MIntRec数据集上准确率提升1.13%,小波融合模块对情感线索的识别准确率提升0.41%。

Insight: 频域分析能更精细地捕捉非语言信息的时间动态,跨模态交互有效弥合语言与非语言语义之间的鸿沟。

Abstract: Multimodal intent recognition (MIR) seeks to accurately interpret user intentions by integrating verbal and non-verbal information across video, audio and text modalities. While existing approaches prioritize text analysis, they often overlook the rich semantic content embedded in non-verbal cues. This paper presents a novel Wavelet-Driven Multimodal Intent Recognition(WDMIR) framework that enhances intent understanding through frequency-domain analysis of non-verbal information. To be more specific, we propose: (1) a wavelet-driven fusion module that performs synchronized decomposition and integration of video-audio features in the frequency domain, enabling fine-grained analysis of temporal dynamics; (2) a cross-modal interaction mechanism that facilitates progressive feature enhancement from bimodal to trimodal integration, effectively bridging the semantic gap between verbal and non-verbal information. Extensive experiments on MIntRec demonstrate that our approach achieves state-of-the-art performance, surpassing previous methods by 1.13% on accuracy. Ablation studies further verify that the wavelet-driven fusion module significantly improves the extraction of semantic information from non-verbal sources, with a 0.41% increase in recognition accuracy when analyzing subtle emotional cues.


cs.GR [Back]

[120] Edit360: 2D Image Edits to 3D Assets from Any Angle cs.GR | cs.CVPDF

Junchao Huang, Xinting Hu, Zhuotao Tian, Shaoshuai Shi, Li Jiang

TL;DR: Edit360是一个无需调整的框架,将2D图像编辑扩展到多视角一致的3D编辑,通过视频扩散模型和Anchor-View Editing Propagation机制,实现高质量3D资产重建。

Details

Motivation: 现有方法通常限制在多角度编辑上,缺乏灵活性,难以实现多视角一致的精细编辑。

Result: 能够重建高质量的3D资产,支持可定制的3D内容创作。

Insight: 通过扩散模型的潜在和注意力空间实现多视角信息对齐,为3D编辑提供了一种高效且灵活的方法。

Abstract: Recent advances in diffusion models have significantly improved image generation and editing, but extending these capabilities to 3D assets remains challenging, especially for fine-grained edits that require multi-view consistency. Existing methods typically restrict editing to predetermined viewing angles, severely limiting their flexibility and practical applications. We introduce Edit360, a tuning-free framework that extends 2D modifications to multi-view consistent 3D editing. Built upon video diffusion models, Edit360 enables user-specific editing from arbitrary viewpoints while ensuring structural coherence across all views. The framework selects anchor views for 2D modifications and propagates edits across the entire 360-degree range. To achieve this, Edit360 introduces a novel Anchor-View Editing Propagation mechanism, which effectively aligns and merges multi-view information within the latent and attention spaces of diffusion models. The resulting edited multi-view sequences facilitate the reconstruction of high-quality 3D assets, enabling customizable 3D content creation.


eess.IV [Back]

[121] Rethinking Brain Tumor Segmentation from the Frequency Domain Perspective eess.IV | cs.CVPDF

Minye Shao, Zeyu Wang, Haoran Duan, Yawen Huang, Bing Zhai

TL;DR: 这篇论文提出了一种基于频域视角的脑肿瘤分割方法HFF-Net,通过频率域分解和自适应拉普拉斯卷积模块显著提升了对比增强区域的肿瘤分割性能。

Details

Motivation: 当前脑肿瘤分割方法在对比增强区域的性能不足,主要由于对MRI特定特征(如复杂纹理和方向变化)的考虑不足。

Result: 在四个公共数据集上,HFF-Net在主要肿瘤子区域的平均Dice得分相对提升了4.48%,对比增强区域提升了7.33%。

Insight: 频域分析可以更好地捕捉MRI图像的纹理和方向特征,动态卷积核显著提升了对边界细节的敏感性。

Abstract: Precise segmentation of brain tumors, particularly contrast-enhancing regions visible in post-contrast MRI (areas highlighted by contrast agent injection), is crucial for accurate clinical diagnosis and treatment planning but remains challenging. However, current methods exhibit notable performance degradation in segmenting these enhancing brain tumor areas, largely due to insufficient consideration of MRI-specific tumor features such as complex textures and directional variations. To address this, we propose the Harmonized Frequency Fusion Network (HFF-Net), which rethinks brain tumor segmentation from a frequency-domain perspective. To comprehensively characterize tumor regions, we develop a Frequency Domain Decomposition (FDD) module that separates MRI images into low-frequency components, capturing smooth tumor contours and high-frequency components, highlighting detailed textures and directional edges. To further enhance sensitivity to tumor boundaries, we introduce an Adaptive Laplacian Convolution (ALC) module that adaptively emphasizes critical high-frequency details using dynamically updated convolution kernels. To effectively fuse tumor features across multiple scales, we design a Frequency Domain Cross-Attention (FDCA) integrating semantic, positional, and slice-specific information. We further validate and interpret frequency-domain improvements through visualization, theoretical reasoning, and experimental analyses. Extensive experiments on four public datasets demonstrate that HFF-Net achieves an average relative improvement of 4.48% (ranging from 2.39% to 7.72%) in the mean Dice scores across the three major subregions, and an average relative improvement of 7.33% (ranging from 5.96% to 8.64%) in the segmentation of contrast-enhancing tumor regions, while maintaining favorable computational efficiency and clinical applicability. Code: https://github.com/VinyehShaw/HFF.


[122] Prompt-Guided Latent Diffusion with Predictive Class Conditioning for 3D Prostate MRI Generation eess.IV | cs.CVPDF

Emerson P. Grabke, Masoom A. Haider, Babak Taati

TL;DR: 论文提出了一种新型的CCELLA方法,通过双头条件化策略和联合损失函数,结合高效的LDM训练框架,显著提升了3D前列腺MRI生成的性能,解决了医学图像合成中的数据稀缺问题。

Details

Motivation: 医学图像合成中,潜在扩散模型(LDM)通常依赖短提示文本编码器或非医学预训练模型,且需要大量数据进行微调,限制了性能和科学可访问性。本文旨在通过新方法解决这些问题。

Result: 在3D前列腺MRI数据集上,FID得分为0.025,显著优于基线模型(FID 0.071);合成图像能提升分类器准确率(从69%到74%),且仅用合成数据训练的分类器性能与真实数据相当。

Insight: 通过条件化策略和高效训练框架,可以在小数据场景下实现高质量的医学图像合成,缓解了数据稀缺问题,同时提升了模型的实用性和可访问性。

Abstract: Latent diffusion models (LDM) could alleviate data scarcity challenges affecting machine learning development for medical imaging. However, medical LDM training typically relies on performance- or scientific accessibility-limiting strategies including a reliance on short-prompt text encoders, the reuse of non-medical LDMs, or a requirement for fine-tuning with large data volumes. We propose a Class-Conditioned Efficient Large Language model Adapter (CCELLA) to address these limitations. CCELLA is a novel dual-head conditioning approach that simultaneously conditions the LDM U-Net with non-medical large language model-encoded text features through cross-attention and with pathology classification through the timestep embedding. We also propose a joint loss function and a data-efficient LDM training framework. In combination, these strategies enable pathology-conditioned LDM training for high-quality medical image synthesis given limited data volume and human data annotation, improving LDM performance and scientific accessibility. Our method achieves a 3D FID score of 0.025 on a size-limited prostate MRI dataset, significantly outperforming a recent foundation model with FID 0.071. When training a classifier for prostate cancer prediction, adding synthetic images generated by our method to the training dataset improves classifier accuracy from 69% to 74%. Training a classifier solely on our method’s synthetic images achieved comparable performance to training on real images alone.


[123] DUN-SRE: Deep Unrolling Network with Spatiotemporal Rotation Equivariance for Dynamic MRI Reconstruction eess.IV | cs.AI | cs.CVPDF

Yuliang Zhu, Jing Cheng, Qi Xie, Zhuo-Xu Cui, Qingyong Zhu

TL;DR: 该论文提出了一种具有时空旋转等变性的深度展开网络(DUN-SRE),用于动态MRI重建,通过结合时空对称性先验,显著提升了图像质量。

Details

Motivation: 动态MRI存在空间和时间维度的对称性先验,但现有方法未能有效建模这些对称性。DUN-SRE旨在填补这一空白,特别是在处理时间对称性方面。

Result: 在心脏CINE MRI数据集上,DUN-SRE在保留旋转对称结构方面表现优异,并展现出广泛的泛化能力。

Insight: 时空对称性先验对动态MRI重建至关重要,DUN-SRE通过等变性设计实现了更精确的物理建模。

Abstract: Dynamic Magnetic Resonance Imaging (MRI) exhibits transformation symmetries, including spatial rotation symmetry within individual frames and temporal symmetry along the time dimension. Explicit incorporation of these symmetry priors in the reconstruction model can significantly improve image quality, especially under aggressive undersampling scenarios. Recently, Equivariant convolutional neural network (ECNN) has shown great promise in exploiting spatial symmetry priors. However, existing ECNNs critically fail to model temporal symmetry, arguably the most universal and informative structural prior in dynamic MRI reconstruction. To tackle this issue, we propose a novel Deep Unrolling Network with Spatiotemporal Rotation Equivariance (DUN-SRE) for Dynamic MRI Reconstruction. The DUN-SRE establishes spatiotemporal equivariance through a (2+1)D equivariant convolutional architecture. In particular, it integrates both the data consistency and proximal mapping module into a unified deep unrolling framework. This architecture ensures rigorous propagation of spatiotemporal rotation symmetry constraints throughout the reconstruction process, enabling more physically accurate modeling of cardiac motion dynamics in cine MRI. In addition, a high-fidelity group filter parameterization mechanism is developed to maintain representation precision while enforcing symmetry constraints. Comprehensive experiments on Cardiac CINE MRI datasets demonstrate that DUN-SRE achieves state-of-the-art performance, particularly in preserving rotation-symmetric structures, offering strong generalization capability to a broad range of dynamic MRI reconstruction tasks.


[124] ConStyX: Content Style Augmentation for Generalizable Medical Image Segmentation eess.IV | cs.CVPDF

Xi Chen, Zhiqiang Shen, Peng Cao, Jinzhu Yang, Osmar R. Zaiane

TL;DR: 论文提出了一种新的领域随机化方法ConStyX,用于提升医学图像分割模型的泛化能力,通过同时增强内容与风格,解决了现有方法仅依赖风格扰动和过度增强的问题。

Details

Motivation: 医学图像多域采集导致的域偏移问题影响了模型的性能,现有领域泛化方法仅依赖风格扰动且忽略了过度增强的负面影响。

Result: 实验表明,ConStyX在多个领域上优于现有方法,表现出更强的泛化能力。

Insight: 同时增强内容与风格能更全面地覆盖多域数据,而优化训练过程则能有效避免过度增强的负面影响。

Abstract: Medical images are usually collected from multiple domains, leading to domain shifts that impair the performance of medical image segmentation models. Domain Generalization (DG) aims to address this issue by training a robust model with strong generalizability. Recently, numerous domain randomization-based DG methods have been proposed. However, these methods suffer from the following limitations: 1) constrained efficiency of domain randomization due to their exclusive dependence on image style perturbation, and 2) neglect of the adverse effects of over-augmented images on model training. To address these issues, we propose a novel domain randomization-based DG method, called content style augmentation (ConStyX), for generalizable medical image segmentation. Specifically, ConStyX 1) augments the content and style of training data, allowing the augmented training data to better cover a wider range of data domains, and 2) leverages well-augmented features while mitigating the negative effects of over-augmented features during model training. Extensive experiments across multiple domains demonstrate that our ConStyX achieves superior generalization performance. The code is available at https://github.com/jwxsp1/ConStyX.


[125] Generalist Models in Medical Image Segmentation: A Survey and Performance Comparison with Task-Specific Approaches eess.IV | cs.AI | cs.CV | A.1; I.2.0; I.4.6PDF

Andrea Moglia, Matteo Leccardi, Matteo Cavicchioli, Alice Maccarini, Marco Marcon

TL;DR: 这篇综述论文系统研究了通用模型(特别是SAM及其变体)在医学图像分割中的应用,并对比了其与任务专用模型的性能,同时探讨了未来发展方向和面临的挑战。

Details

Motivation: 受到大型语言模型和自然图像分割模型(如SAM)成功的启发,研究通用模型在医学图像分割中的潜力,探讨其是否能超越任务专用模型。

Result: 研究发现通用模型在某些任务上表现优异,但与任务专用模型相比仍有差距,特别是在医学影像的复杂性和多样性方面。

Insight: 未来的研究方向应包括合成数据、多模态融合、借鉴自然语言处理的通用模型经验、可信AI,以及临床转化中的实际应用问题。

Abstract: Following the successful paradigm shift of large language models, leveraging pre-training on a massive corpus of data and fine-tuning on different downstream tasks, generalist models have made their foray into computer vision. The introduction of Segment Anything Model (SAM) set a milestone on segmentation of natural images, inspiring the design of a multitude of architectures for medical image segmentation. In this survey we offer a comprehensive and in-depth investigation on generalist models for medical image segmentation. We start with an introduction on the fundamentals concepts underpinning their development. Then, we provide a taxonomy on the different declinations of SAM in terms of zero-shot, few-shot, fine-tuning, adapters, on the recent SAM 2, on other innovative models trained on images alone, and others trained on both text and images. We thoroughly analyze their performances at the level of both primary research and best-in-literature, followed by a rigorous comparison with the state-of-the-art task-specific models. We emphasize the need to address challenges in terms of compliance with regulatory frameworks, privacy and security laws, budget, and trustworthy artificial intelligence (AI). Finally, we share our perspective on future directions concerning synthetic data, early fusion, lessons learnt from generalist models in natural language processing, agentic AI and physical AI, and clinical translation.


[126] Med-URWKV: Pure RWKV With ImageNet Pre-training For Medical Image Segmentation eess.IV | cs.CVPDF

Zhenhuan Zhou

TL;DR: Med-URWKV是首个在医学图像分割领域基于纯RWKV架构的模型,通过ImageNet预训练提升性能,在多个数据集上表现优异。

Details

Motivation: 现有医学图像分割方法(如CNN、Transformer或混合架构)分别存在感受野受限或计算复杂度高的问题,RWKV结合线性复杂度和长程建模能力成为新选择,但尚未探索其预训练优势。

Result: 在7个数据集上验证,Med-URWKV性能优于或媲美从头训练的RWKV模型,证明预训练的有效性。

Insight: 预训练RWKV编码器可显著提升医学图像分割任务表现,为轻量化和高效长程建模提供新方向。

Abstract: Medical image segmentation is a fundamental and key technology in computer-aided diagnosis and treatment. Previous methods can be broadly classified into three categories: convolutional neural network (CNN) based, Transformer based, and hybrid architectures that combine both. However, each of them has its own limitations, such as restricted receptive fields in CNNs or the computational overhead caused by the quadratic complexity of Transformers. Recently, the Receptance Weighted Key Value (RWKV) model has emerged as a promising alternative for various vision tasks, offering strong long-range modeling capabilities with linear computational complexity. Some studies have also adapted RWKV to medical image segmentation tasks, achieving competitive performance. However, most of these studies focus on modifications to the Vision-RWKV (VRWKV) mechanism and train models from scratch, without exploring the potential advantages of leveraging pre-trained VRWKV models for medical image segmentation tasks. In this paper, we propose Med-URWKV, a pure RWKV-based architecture built upon the U-Net framework, which incorporates ImageNet-based pretraining to further explore the potential of RWKV in medical image segmentation tasks. To the best of our knowledge, Med-URWKV is the first pure RWKV segmentation model in the medical field that can directly reuse a large-scale pre-trained VRWKV encoder. Experimental results on seven datasets demonstrate that Med-URWKV achieves comparable or even superior segmentation performance compared to other carefully optimized RWKV models trained from scratch. This validates the effectiveness of using a pretrained VRWKV encoder in enhancing model performance. The codes will be released.


cs.IR [Back]

[127] Conversational Search: From Fundamentals to Frontiers in the LLM Era cs.IR | cs.CLPDF

Fengran Mo, Chuan Meng, Mohammad Aliannejadi, Jian-Yun Nie

TL;DR: 论文《Conversational Search: From Fundamentals to Frontiers in the LLM Era》探讨了大语言模型(LLMs)时代下会话搜索的基础与前沿技术,介绍了多轮交互实现复杂信息需求的方法以及LLM带来的机遇与挑战。

Details

Motivation: 会话搜索通过多轮交互满足用户的复杂信息需求,而LLMs具备指令遵循、内容生成和推理能力,为构建智能会话搜索系统提供了新的机会和挑战。

Result: 论文为学术界和工业界的参与者提供了全面的知识框架,帮助他们理解并推动下一代会话搜索系统的发展。

Insight: LLMs的引入为会话搜索带来了更强的上下文理解和动态交互能力,但其落地仍需解决如意图理解、信息准确性等挑战。

Abstract: Conversational search enables multi-turn interactions between users and systems to fulfill users’ complex information needs. During this interaction, the system should understand the users’ search intent within the conversational context and then return the relevant information through a flexible, dialogue-based interface. The recent powerful large language models (LLMs) with capacities of instruction following, content generation, and reasoning, attract significant attention and advancements, providing new opportunities and challenges for building up intelligent conversational search systems. This tutorial aims to introduce the connection between fundamentals and the emerging topics revolutionized by LLMs in the context of conversational search. It is designed for students, researchers, and practitioners from both academia and industry. Participants will gain a comprehensive understanding of both the core principles and cutting-edge developments driven by LLMs in conversational search, equipping them with the knowledge needed to contribute to the development of next-generation conversational search systems.


cs.LG [Back]

[128] Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMs cs.LG | cs.AI | cs.CL | cs.CVPDF

Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu

TL;DR: Omni-DPO 提出了一种双视角优化框架,动态调整偏好对的学习权重,显著提升了 DPO 在 RLHF 中的性能。

Details

Motivation: 现有的 DPO 方法通常对所有偏好对一视同仁,忽略了它们在质量和学习效用上的差异,导致数据利用效率和性能不佳。

Result: 在文本理解任务中,Gemma-2-9b-it 微调后超越 Claude 3 Opus 6.7 分;在数学推理任务中,Omni-DPO 在所有基准测试中均优于基线方法。

Insight: 动态调整偏好对的学习权重是提升 DPO 性能的关键,数据质量和模型学习动态是两大核心视角。

Abstract: Direct Preference Optimization (DPO) has become a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based approaches typically treat all preference pairs uniformly, ignoring critical variations in their inherent quality and learning utility, leading to suboptimal data utilization and performance. To address this challenge, we propose Omni-DPO, a dual-perspective optimization framework that jointly accounts for (1) the inherent quality of each preference pair and (2) the model’s evolving performance on those pairs. By adaptively weighting samples according to both data quality and the model’s learning dynamics during training, Omni-DPO enables more effective training data utilization and achieves better performance. Experimental results on various models and benchmarks demonstrate the superiority and generalization capabilities of Omni-DPO. On textual understanding tasks, Gemma-2-9b-it finetuned with Omni-DPO beats the leading LLM, Claude 3 Opus, by a significant margin of 6.7 points on the Arena-Hard benchmark. On mathematical reasoning tasks, Omni-DPO consistently outperforms the baseline methods across all benchmarks, providing strong empirical evidence for the effectiveness and robustness of our approach. Code and models will be available at https://github.com/pspdada/Omni-DPO.


[129] Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning cs.LG | cs.AI | cs.CL | stat.MLPDF

Jikai Jin, Vasilis Syrgkanis, Sham Kakade, Hanlin Zhang

TL;DR: 该论文提出了一个因果表示学习框架,通过建模语言模型的潜在能力因素来解释基准测试表现,揭示了能力间的因果结构,强调了控制基础模型变体的重要性。

Details

Motivation: 语言模型能力评估中存在复杂的混杂效应和高计算成本,传统方法难以揭示其潜在能力间的因果关系。

Result: 在一个包含1500多个模型的数据集中,成功识别出一个三节点的线性因果结构,揭示了能力间的因果方向。

Insight: 研究发现能力发展从通用问题解决开始,逐步到指令跟随能力,最终到数学推理能力,强调了控制基础模型变体的关键作用。

Abstract: Faithful evaluation of language model capabilities is crucial for deriving actionable insights that can inform model development. However, rigorous causal evaluations in this domain face significant methodological challenges, including complex confounding effects and prohibitive computational costs associated with extensive retraining. To tackle these challenges, we propose a causal representation learning framework wherein observed benchmark performance is modeled as a linear transformation of a few latent capability factors. Crucially, these latent factors are identified as causally interrelated after appropriately controlling for the base model as a common confounder. Applying this approach to a comprehensive dataset encompassing over 1500 models evaluated across six benchmarks from the Open LLM Leaderboard, we identify a concise three-node linear causal structure that reliably explains the observed performance variations. Further interpretation of this causal structure provides substantial scientific insights beyond simple numerical rankings: specifically, we reveal a clear causal direction starting from general problem-solving capabilities, advancing through instruction-following proficiency, and culminating in mathematical reasoning ability. Our results underscore the essential role of carefully controlling base model variations during evaluation, a step critical to accurately uncovering the underlying causal relationships among latent model capabilities.


[130] Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series cs.LG | cs.AI | cs.CLPDF

Ching Chang, Jeehyun Hwang, Yidan Shi, Haixin Wang, Wen-Chih Peng

TL;DR: Time-IMM 是一个专门为不规则多模态多变量时间序列设计的数据集,结合 IMM-TSF 基准库,填补了研究和实际应用之间的差距。

Details

Motivation: 现实世界中的时间序列数据(如医疗、气候建模和金融)通常是不规则、多模态且脏乱的,而现有基准通常假设数据是干净的、规则采样的单模态数据,与实际需求脱节。

Result: 实验表明,显式建模多模态在时间序列中的不规则性显著提升了预测性能。

Insight: 这项研究为实际应用中不规则多模态时间序列分析提供了重要工具,推动了该领域的发展。

Abstract: Time series data in real-world applications such as healthcare, climate modeling, and finance are often irregular, multimodal, and messy, with varying sampling rates, asynchronous modalities, and pervasive missingness. However, existing benchmarks typically assume clean, regularly sampled, unimodal data, creating a significant gap between research and real-world deployment. We introduce Time-IMM, a dataset specifically designed to capture cause-driven irregularity in multimodal multivariate time series. Time-IMM represents nine distinct types of time series irregularity, categorized into trigger-based, constraint-based, and artifact-based mechanisms. Complementing the dataset, we introduce IMM-TSF, a benchmark library for forecasting on irregular multimodal time series, enabling asynchronous integration and realistic evaluation. IMM-TSF includes specialized fusion modules, including a timestamp-to-text fusion module and a multimodality fusion module, which support both recency-aware averaging and attention-based integration strategies. Empirical results demonstrate that explicitly modeling multimodality on irregular time series data leads to substantial gains in forecasting performance. Time-IMM and IMM-TSF provide a foundation for advancing time series analysis under real-world conditions. The dataset is publicly available at https://www.kaggle.com/datasets/blacksnail789521/time-imm/data, and the benchmark library can be accessed at https://anonymous.4open.science/r/IMMTSF_NeurIPS2025.


[131] Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering cs.LG | cs.CLPDF

Sai Prasanna Teja Reddy Bogireddy, Abrar Majeedi, Viswanatha Reddy Gajjala, Zhuoyan Xu, Siddhant Rai

TL;DR: 论文提出了一个基于数据驱动的提示优化方法(Neural),用于临床电子健康记录(EHR)的问题回答,通过分离证据检索和答案生成步骤,并结合自一致性投票机制,显著提升了性能。

Details

Motivation: 临床电子健康记录的自动问答(QA)需要高精度的证据检索和可靠的答案生成,但在监督数据有限的情况下,传统方法表现不佳。因此,论文旨在提出一种高效的、基于提示优化的解决方案。

Result: 在隐藏测试集上,论文方法得分为51.5,表现优于零样本和小样本提示方法分别超过20和10个百分点,位居第二名。

Insight: 数据驱动的提示优化是一种成本效益高的替代微调的方案,尤其在临床QA等高风险领域,提升了AI助手的可靠性。

Abstract: Automated question answering (QA) over electronic health records (EHRs) can bridge critical information gaps for clinicians and patients, yet it demands both precise evidence retrieval and faithful answer generation under limited supervision. In this work, we present Neural, the runner-up in the BioNLP 2025 ArchEHR-QA shared task on evidence-grounded clinical QA. Our proposed method decouples the task into (1) sentence-level evidence identification and (2) answer synthesis with explicit citations. For each stage, we automatically explore the prompt space with DSPy’s MIPROv2 optimizer, jointly tuning instructions and few-shot demonstrations on the development set. A self-consistency voting scheme further improves evidence recall without sacrificing precision. On the hidden test set, our method attains an overall score of 51.5, placing second stage while outperforming standard zero-shot and few-shot prompting by over 20 and 10 points, respectively. These results indicate that data-driven prompt optimization is a cost-effective alternative to model fine-tuning for high-stakes clinical QA, advancing the reliability of AI assistants in healthcare.


[132] Robustly Improving LLM Fairness in Realistic Settings via Interpretability cs.LG | cs.AI | cs.CLPDF

Adam Karvonen, Samuel Marks

TL;DR: 该论文提出了一种通过解释性方法在现实场景中稳健提升大语言模型公平性的技术,发现传统反偏见提示在现实背景中失效,并提出了一种内部偏见缓解策略。

Details

Motivation: 大语言模型(LLMs)在高风险招聘应用中的部署日益增多,但其在现实复杂背景下表现出的偏见未被充分研究,亟需一种有效的内部干预方法。

Result: 实验表明,该方法能将偏见降至极低水平(通常低于1%),同时保持模型性能,并在多种商业和开源模型中验证了有效性。

Insight: 论文揭示了现实上下文对模型偏见的潜在影响,强调了内部干预策略的重要性,为公平LLM的实际部署提供了指导。

Abstract: Large language models (LLMs) are increasingly deployed in high-stakes hiring applications, making decisions that directly impact people’s careers and livelihoods. While prior studies suggest simple anti-bias prompts can eliminate demographic biases in controlled evaluations, we find these mitigations fail when realistic contextual details are introduced. We address these failures through internal bias mitigation: by identifying and neutralizing sensitive attribute directions within model activations, we achieve robust bias reduction across all tested scenarios. Across leading commercial (GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash) and open-source models (Gemma-2 27B, Gemma-3, Mistral-24B), we find that adding realistic context such as company names, culture descriptions from public careers pages, and selective hiring constraints (e.g.,``only accept candidates in the top 10%“) induces significant racial and gender biases (up to 12% differences in interview rates). When these biases emerge, they consistently favor Black over White candidates and female over male candidates across all tested models and scenarios. Moreover, models can infer demographics and become biased from subtle cues like college affiliations, with these biases remaining invisible even when inspecting the model’s chain-of-thought reasoning. To address these limitations, our internal bias mitigation identifies race and gender-correlated directions and applies affine concept editing at inference time. Despite using directions from a simple synthetic dataset, the intervention generalizes robustly, consistently reducing bias to very low levels (typically under 1%, always below 2.5%) while largely maintaining model performance. Our findings suggest that practitioners deploying LLMs for hiring should adopt more realistic evaluation methodologies and consider internal mitigation strategies for equitable outcomes.


[133] GUARD: Guided Unlearning and Retention via Data Attribution for Large Language Models cs.LG | cs.AI | cs.CLPDF

Evelyn Ma, Duo Zhou, Peizhi Niu, Huiting Zhou, Huan Zhang

TL;DR: GUARD 是一种针对大型语言模型(LLMs)的轻量级数据属性框架,通过自适应权重分配解决遗忘学习中的意外遗忘问题,显著提升了模型在遗忘后的信息保留能力。

Details

Motivation: 由于法规遵从性、版权保护和隐私需求,LLMs的遗忘学习变得越来越重要。然而,现有方法在遗忘高影响力数据时容易出现意外遗忘问题,导致模型性能下降。GUARD 旨在通过数据层面的优化解决这一问题。

Result: 在TOFU基准测试中,GUARD 在遗忘10%训练数据时,将保留集的效用牺牲(Truth Ratio)降低了194.92%,同时保持了与传统方法相当的遗忘效果。

Insight: 数据层面的优化在LLMs的遗忘学习中具有重要意义,GUARD 通过数据属性的量化与自适应权重分配,显著提升了模型在实际应用中的可靠性。

Abstract: Unlearning in large language models (LLMs) is becoming increasingly important due to regulatory compliance, copyright protection, and privacy concerns. However, a key challenge in LLM unlearning is unintended forgetting, where the removal of specific data inadvertently impairs the utility of the model and its retention of valuable, desired information. While prior work has primarily focused on architectural innovations, the influence of data-level factors on unlearning performance remains underexplored. As a result, existing methods often suffer from degraded retention when forgetting high-impact data. To address this, we propose GUARD-a novel framework for Guided Unlearning And Retention via Data attribution. At its core, GUARD introduces a lightweight proxy data attribution metric tailored for LLM unlearning, which quantifies the “alignment” between the forget and retain sets while remaining computationally efficient. Building on this, we design a novel unlearning objective that assigns adaptive, nonuniform unlearning weights to samples, inversely proportional to their proxy attribution scores. Through such a reallocation of unlearning power, GUARD mitigates unintended losses in retention. We provide rigorous theoretical guarantees that GUARD significantly enhances retention while maintaining forgetting metrics comparable to prior methods. Extensive experiments on the TOFU benchmark across multiple LLM architectures demonstrate that GUARD substantially improves utility preservation while ensuring effective unlearning. Notably, GUARD reduces utility sacrifice on the Retain Set by up to 194.92% in terms of Truth Ratio when forgetting 10% of the training data.


[134] Build the web for agents, not agents for the web cs.LG | cs.CLPDF

Xing Han Lù, Gaurav Kamath, Marius Mosbach, Siva Reddy

TL;DR: 这篇论文提出了一种新的范式转变,即开发专门为AI代理设计的网页接口(AWI),而不是让代理适应人类设计的界面,以提高效率和可靠性。

Details

Motivation: 当前的AI代理在处理网页任务时面临巨大挑战,因为这些界面是为人类设计的,而非为代理优化的。这种不匹配限制了代理的能力和效率。

Result: AWI的提出为未来的网页代理研究提供了新的方向,有望解决现有方法在处理复杂网页任务时的局限性。

Insight: 论文指出,未来的网页代理开发需要与机器学习社区共同协作,推动标准化和优化的代理友好型接口设计。

Abstract: Recent advancements in Large Language Models (LLMs) and multimodal counterparts have spurred significant interest in developing web agents – AI systems capable of autonomously navigating and completing tasks within web environments. While holding tremendous promise for automating complex web interactions, current approaches face substantial challenges due to the fundamental mismatch between human-designed interfaces and LLM capabilities. Current methods struggle with the inherent complexity of web inputs, whether processing massive DOM trees, relying on screenshots augmented with additional information, or bypassing the user interface entirely through API interactions. This position paper advocates for a paradigm shift in web agent research: rather than forcing web agents to adapt to interfaces designed for humans, we should develop a new interaction paradigm specifically optimized for agentic capabilities. To this end, we introduce the concept of an Agentic Web Interface (AWI), an interface specifically designed for agents to navigate a website. We establish six guiding principles for AWI design, emphasizing safety, efficiency, and standardization, to account for the interests of all primary stakeholders. This reframing aims to overcome fundamental limitations of existing interfaces, paving the way for more efficient, reliable, and transparent web agent design, which will be a collaborative effort involving the broader ML community.


[135] ReGuidance: A Simple Diffusion Wrapper for Boosting Sample Quality on Hard Inverse Problems cs.LG | cs.AI | cs.CVPDF

Aayush Karan, Kulin Shah, Sitan Chen

TL;DR: ReGuidance是一种简单的扩散模型包装器,用于提升在困难逆问题中的样本质量。它通过反转无条件概率流ODE并重新初始化DPS,显著提升了样本真实性和奖励一致性。

Details

Motivation: 现有方法(如DPS及其变体)在处理低信噪比的困难逆问题时容易偏离数据流形,导致输出不真实。ReGuidance旨在解决这一问题。

Result: 在困难逆问题(如大框补全和高倍超分辨率)中,ReGuidance显著优于现有方法,提升了样本质量和测量一致性。

Insight: 该方法首次证明了对某些多模态数据分布,ReGuidance能同时提升奖励并将候选解拉回数据流形。

Abstract: There has been a flurry of activity around using pretrained diffusion models as informed data priors for solving inverse problems, and more generally around steering these models using reward models. Training-free methods like diffusion posterior sampling (DPS) and its many variants have offered flexible heuristic algorithms for these tasks, but when the reward is not informative enough, e.g., in hard inverse problems with low signal-to-noise ratio, these techniques veer off the data manifold, failing to produce realistic outputs. In this work, we devise a simple wrapper, ReGuidance, for boosting both the sample realism and reward achieved by these methods. Given a candidate solution $\hat{x}$ produced by an algorithm of the user’s choice, we propose inverting the solution by running the unconditional probability flow ODE in reverse starting from $\hat{x}$, and then using the resulting latent as an initialization for DPS. We evaluate our wrapper on hard inverse problems like large box in-painting and super-resolution with high upscaling. Whereas state-of-the-art baselines visibly fail, we find that applying our wrapper on top of these baselines significantly boosts sample quality and measurement consistency. We complement these findings with theory proving that on certain multimodal data distributions, ReGuidance simultaneously boosts the reward and brings the candidate solution closer to the data manifold. To our knowledge, this constitutes the first rigorous algorithmic guarantee for DPS.


cs.CR [Back]

[136] GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models cs.CR | cs.CLPDF

Zilong Wang, Xiang Zheng, Xiaosen Wang, Bo Wang, Xingjun Ma

TL;DR: GenBreak通过微调大型语言模型(LLM)生成对抗性提示,系统性评估文本到图像(T2I)模型的安全漏洞,既能绕过安全机制,又能生成高毒性图像。

Details

Motivation: T2I模型的安全问题日益突出,现有方法在绕过安全机制或生成高毒性图像上存在局限性,缺乏综合评估工具。

Result: 生成的对抗性提示在黑盒攻击中表现优异,成功揭示了商业T2I模型的安全缺陷。

Insight: GenBreak展示了通过LLM增强系统安全性评估的潜力,为未来T2I模型的安全设计提供了新方向。

Abstract: Text-to-image (T2I) models such as Stable Diffusion have advanced rapidly and are now widely used in content creation. However, these models can be misused to generate harmful content, including nudity or violence, posing significant safety risks. While most platforms employ content moderation systems, underlying vulnerabilities can still be exploited by determined adversaries. Recent research on red-teaming and adversarial attacks against T2I models has notable limitations: some studies successfully generate highly toxic images but use adversarial prompts that are easily detected and blocked by safety filters, while others focus on bypassing safety mechanisms but fail to produce genuinely harmful outputs, neglecting the discovery of truly high-risk prompts. Consequently, there remains a lack of reliable tools for evaluating the safety of defended T2I models. To address this gap, we propose GenBreak, a framework that fine-tunes a red-team large language model (LLM) to systematically explore underlying vulnerabilities in T2I generators. Our approach combines supervised fine-tuning on curated datasets with reinforcement learning via interaction with a surrogate T2I model. By integrating multiple reward signals, we guide the LLM to craft adversarial prompts that enhance both evasion capability and image toxicity, while maintaining semantic coherence and diversity. These prompts demonstrate strong effectiveness in black-box attacks against commercial T2I generators, revealing practical and concerning safety weaknesses.


[137] Secure Data Access in Cloud Environments Using Quantum Cryptography cs.CR | cs.CVPDF

S. Vasavi Venkata Lakshmi, Ziaul Haque Choudhury

TL;DR: 该论文提出了一种在云环境中使用量子密码学(如BB84协议和量子一次性加密)保障数据安全的新方法,以应对未来量子计算的威胁。

Details

Motivation: 云计算的普及带来了数据存储与访问的便利,但传统加密方法在量子计算时代可能不安全。量子密码学为解决这一问题提供了新的方向。

Result: 该方法能够有效抵抗量子计算攻击,为云数据提供强大的安全保障,适用于未来量子计算环境。

Insight: 量子密码学是未来数据安全的重要方向,尤其在云计算领域,结合QKD和QOTP可以为现有系统提供长期的防护能力。

Abstract: Cloud computing has made storing and accessing data easier but keeping it secure is a big challenge nowadays. Traditional methods of ensuring data may not be strong enough in the future when powerful quantum computers become available. To solve this problem, this study uses quantum cryptography to protect data in the cloud environment. Quantum Key Distribution (QKD) creates secure keys by sending information using quantum particles like photons. Specifically, we use the BB84 protocol, a simple and reliable way to make secure keys that cannot be stolen without detection. To protect the data, we use the Quantum One Time pad (QOTP) for encryption and decryption, ensuring the data stays completely private. This study shows how these Quantum methods can be applied in cloud systems to provide a strong defense against hackers, even if they have access to quantum computers. The combination of QKD, BB84, and QOTP creates a safe and reliable way to keep data secure when it is stored or shared in the cloud. Using quantum cryptography, this paper provides a way to ensure data security now and in the future, making cloud computing safer for everyone to store their data securely and safely.


cs.RO [Back]

[138] A Navigation Framework Utilizing Vision-Language Models cs.RO | cs.AI | cs.CVPDF

Yicheng Duan, Kaiyu tang

TL;DR: 该论文提出了一种模块化的导航框架,通过解耦视觉-语言理解和动作规划,结合轻量级规划和冻结的视觉-语言模型,实现了高效且灵活的导航。虽然实验结果在未见过环境中存在挑战,但为未来的改进提供了方向。

Details

Motivation: 视觉与语言导航(VLN)是具身AI中的重要课题,现有的视觉-语言模型(如CLIP)虽然提升了多模态理解能力,但其计算成本和实时部署问题仍待解决。本文旨在通过模块化设计解决这些问题。

Result: 在VLN-CE的Room-to-Room基准测试中,系统在未见过环境中的泛化能力面临挑战,但模块化方法为未来优化(如增强环境先验和多模态输入)奠定了基础。

Insight: 1. 模块化设计有效降低了计算成本;2. 双帧输入策略改善了决策连续性;3. 未来可通过加强环境先验和多模态输入的整合进一步提升性能。

Abstract: Vision-and-Language Navigation (VLN) presents a complex challenge in embodied AI, requiring agents to interpret natural language instructions and navigate through visually rich, unfamiliar environments. Recent advances in large vision-language models (LVLMs), such as CLIP and Flamingo, have significantly improved multimodal understanding but introduced new challenges related to computational cost and real-time deployment. In this project, we propose a modular, plug-and-play navigation framework that decouples vision-language understanding from action planning. By integrating a frozen vision-language model, Qwen2.5-VL-7B-Instruct, with lightweight planning logic, we aim to achieve flexible, fast, and adaptable navigation without extensive model fine-tuning. Our framework leverages prompt engineering, structured history management, and a two-frame visual input strategy to enhance decision-making continuity across navigation steps. We evaluate our system on the Room-to-Room benchmark within the VLN-CE setting using the Matterport3D dataset and Habitat-Lab simulation environment. Although our initial results reveal challenges in generalizing to unseen environments under strict evaluation settings, our modular approach lays a foundation for scalable and efficient navigation systems, highlighting promising directions for future improvement through enhanced environmental priors and expanded multimodal input integration.


[139] EmbodiedGen: Towards a Generative 3D World Engine for Embodied Intelligence cs.RO | cs.CVPDF

Wang Xinjie, Liu Liu, Cao Yu, Wu Ruiqi, Qin Wenkang

TL;DR: EmbodiedGen是一个生成式3D世界引擎平台,旨在通过生成高质量、可控且逼真的3D资产,支持具身智能任务的训练与评估,提升数据多样性与可扩展性。

Details

Motivation: 当前具身智能任务依赖传统人工创建的3D资产,成本高且真实性有限,限制了数据驱动方法的扩展性。

Result: 生成的3D资产可直接用于物理仿真,支持具身智能任务的高效训练与评估。

Insight: 生成式AI可作为解决3D数据稀缺和多样性的有效工具,同时提升仿真环境的真实性与可交互性。

Abstract: Constructing a physically realistic and accurately scaled simulated 3D world is crucial for the training and evaluation of embodied intelligence tasks. The diversity, realism, low cost accessibility and affordability of 3D data assets are critical for achieving generalization and scalability in embodied AI. However, most current embodied intelligence tasks still rely heavily on traditional 3D computer graphics assets manually created and annotated, which suffer from high production costs and limited realism. These limitations significantly hinder the scalability of data driven approaches. We present EmbodiedGen, a foundational platform for interactive 3D world generation. It enables the scalable generation of high-quality, controllable and photorealistic 3D assets with accurate physical properties and real-world scale in the Unified Robotics Description Format (URDF) at low cost. These assets can be directly imported into various physics simulation engines for fine-grained physical control, supporting downstream tasks in training and evaluation. EmbodiedGen is an easy-to-use, full-featured toolkit composed of six key modules: Image-to-3D, Text-to-3D, Texture Generation, Articulated Object Generation, Scene Generation and Layout Generation. EmbodiedGen generates diverse and interactive 3D worlds composed of generative 3D assets, leveraging generative AI to address the challenges of generalization and evaluation to the needs of embodied intelligence related research. Code is available at https://horizonrobotics.github.io/robot_lab/embodied_gen/index.html.


[140] Eye, Robot: Learning to Look to Act with a BC-RL Perception-Action Loop cs.RO | cs.CVPDF

Justin Kerr, Kush Hari, Ethan Weber, Chung Min Kim, Brent Yi

TL;DR: EyeRobot是一个结合了模仿学习(BC)和强化学习(RL)的机器人系统,通过联合训练手和眼的行为,实现任务驱动的主动视觉感知。

Details

Motivation: 人类通过主动观察环境来完成任务,而传统机器人系统缺乏这种动态视觉感知能力。作者希望通过训练机器人主动调整视线(gaze),以实现更高效的操控任务。

Result: 在五个全景工作空间任务中,EyeRobot表现出高效的手眼协调能力,能够稳定注视目标并忽略干扰物。

Insight: 动态视觉感知(如注视调整)可以显著提升机器人在复杂环境中的任务表现,同时高分辨率策略设计有助于降低计算成本。

Abstract: Humans do not passively observe the visual world – we actively look in order to act. Motivated by this principle, we introduce EyeRobot, a robotic system with gaze behavior that emerges from the need to complete real-world tasks. We develop a mechanical eyeball that can freely rotate to observe its surroundings and train a gaze policy to control it using reinforcement learning. We accomplish this by first collecting teleoperated demonstrations paired with a 360 camera. This data is imported into a simulation environment that supports rendering arbitrary eyeball viewpoints, allowing episode rollouts of eye gaze on top of robot demonstrations. We then introduce a BC-RL loop to train the hand and eye jointly: the hand (BC) agent is trained from rendered eye observations, and the eye (RL) agent is rewarded when the hand produces correct action predictions. In this way, hand-eye coordination emerges as the eye looks towards regions which allow the hand to complete the task. EyeRobot implements a foveal-inspired policy architecture allowing high resolution with a small compute budget, which we find also leads to the emergence of more stable fixation as well as improved ability to track objects and ignore distractors. We evaluate EyeRobot on five panoramic workspace manipulation tasks requiring manipulation in an arc surrounding the robot arm. Our experiments suggest EyeRobot exhibits hand-eye coordination behaviors which effectively facilitate manipulation over large workspaces with a single camera. See project site for videos: https://www.eyerobot.net/


physics.med-ph [Back]

[141] Modality-AGnostic Image Cascade (MAGIC) for Multi-Modality Cardiac Substructure Segmentation physics.med-ph | cs.CVPDF

Nicholas Summerfield, Qisheng He, Alex Kuo, Ahmed I. Ghanem, Simeng Zhu

TL;DR: 论文提出了一种名为MAGIC的多模态心脏子结构分割方法,基于nnU-Net框架,通过复制的编码和解码分支实现多模态适应性,在CT、MR-Linac和CCTA上表现优异。

Details

Motivation: 心脏子结构分割对放射治疗规划至关重要,但现有深度学习方法在多模态和重叠结构上泛化能力不足。

Result: 在Dice相似系数(DSC)评估中,MAGIC在57%的案例中优于对比模型,且计算轻量。

Insight: MAGIC展示了单模型在多模态任务中的潜力,但其统计差异有限,需进一步优化。

Abstract: Cardiac substructures are essential in thoracic radiation therapy planning to minimize risk of radiation-induced heart disease. Deep learning (DL) offers efficient methods to reduce contouring burden but lacks generalizability across different modalities and overlapping structures. This work introduces and validates a Modality-AGnostic Image Cascade (MAGIC) for comprehensive and multi-modal cardiac substructure segmentation. MAGIC is implemented through replicated encoding and decoding branches of an nnU-Net-based, U-shaped backbone conserving the function of a single model. Twenty cardiac substructures (heart, chambers, great vessels (GVs), valves, coronary arteries (CAs), and conduction nodes) from simulation CT (Sim-CT), low-field MR-Linac, and cardiac CT angiography (CCTA) modalities were manually delineated and used to train (n=76), validate (n=15), and test (n=30) MAGIC. Twelve comparison models (four segmentation subgroups across three modalities) were equivalently trained. All methods were compared for training efficiency and against reference contours using the Dice Similarity Coefficient (DSC) and two-tailed Wilcoxon Signed-Rank test (threshold, p<0.05). Average DSC scores were 0.75(0.16) for Sim-CT, 0.68(0.21) for MR-Linac, and 0.80(0.16) for CCTA. MAGIC outperforms the comparison in 57% of cases, with limited statistical differences. MAGIC offers an effective and accurate segmentation solution that is lightweight and capable of segmenting multiple modalities and overlapping structures in a single model. MAGIC further enables clinical implementation by simplifying the computational requirements and offering unparalleled flexibility for clinical settings.


cs.AI [Back]

[142] One Patient, Many Contexts: Scaling Medical AI Through Contextual Intelligence cs.AI | cs.CLPDF

Michelle M. Li, Ben Y. Reis, Adam Rodman, Tianxi Cai, Noa Dagan

TL;DR: 这篇论文提出了一种上下文切换的医疗AI愿景,旨在解决当前医疗基础模型在新场景中动态适应能力不足的问题,从而减少因忽略关键上下文信息而导致的错误。

Details

Motivation: 当前医疗AI模型在新人群、专科或场景中需要微调或精心设计提示,无法动态适应不同上下文,导致预测时忽略关键信息,造成错误。

Result: 这种上下文切换的AI有望在多个专科和地区诊断、管理和治疗多种疾病,扩大医疗服务的可及性。

Insight: 动态调整上下文的能力是提升医疗AI实用性和泛化性的关键。

Abstract: Medical foundation models, including language models trained on clinical notes, vision-language models on medical images, and multimodal models on electronic health records, can summarize clinical notes, answer medical questions, and assist in decision-making. Adapting these models to new populations, specialties, or settings typically requires fine-tuning, careful prompting, or retrieval from knowledge bases. This can be impractical, and limits their ability to interpret unfamiliar inputs and adjust to clinical situations not represented during training. As a result, models are prone to contextual errors, where predictions appear reasonable but fail to account for critical patient-specific or contextual information. These errors stem from a fundamental limitation that current models struggle with: dynamically adjusting their behavior across evolving contexts of medical care. In this Perspective, we outline a vision for context-switching in medical AI: models that dynamically adapt their reasoning without retraining to new specialties, populations, workflows, and clinical roles. We envision context-switching AI to diagnose, manage, and treat a wide range of diseases across specialties and regions, and expand access to medical care.


[143] Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning cs.AI | cs.CLPDF

Yuhao Zhou, Yiheng Wang, Xuming He, Ruoyao Xiao, Zhiwei Li

TL;DR: 该论文提出了名为‘Scientists’ First Exam’(SFE)的基准测试,用于评估多模态大语言模型(MLLMs)在科学领域的感知、理解和推理能力。通过涵盖66个多模态任务和830个专家验证的问题对,SFE揭示了当前先进模型(如GPT-3和InternVL-3)在科学认知能力上的不足。

Details

Motivation: 当前科学基准测试主要关注MLLMs的知识理解能力,忽视了其感知和推理能力的评估。为了解决这一局限性,SFE旨在全面评估MLLMs在科学领域的认知能力。

Result: 实验显示,GPT-3和InternVL-3在SFE上的表现仅为34.08%和26.52%,表明MLLMs在科学领域的认知能力仍有较大提升空间。

Insight: SFE为AI在科学领域的应用提供了新的评估标准,强调了多模态感知和推理能力的重要性,有望推动AI增强的科学发现。

Abstract: Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists’ First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.


[144] TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving cs.AI | cs.CLPDF

Vincenzo Colle, Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Fadhel Ayed

TL;DR: TeleMath是首个专门评估大语言模型在电信领域数学问题解决能力的基准数据集,包含500个问答对,覆盖广泛主题。评估发现,专为数学或逻辑推理设计的模型表现最佳,而通用模型即使参数庞大也表现不佳。

Details

Motivation: 人工智能在电信领域的应用增加,但对大语言模型在领域专用数学密集型任务中的能力研究不足,尤其是在信号处理、网络优化等方面。

Result: 专为数学或逻辑推理设计的模型表现最佳,通用模型表现较差。

Insight: 领域专用模型在复杂数学任务中优于通用模型,表明领域适应性和专门化的重要性。

Abstract: The increasing adoption of artificial intelligence in telecommunications has raised interest in the capability of Large Language Models (LLMs) to address domain-specific, mathematically intensive tasks. Although recent advancements have improved the performance of LLMs in general mathematical reasoning, their effectiveness within specialized domains, such as signal processing, network optimization, and performance analysis, remains largely unexplored. To address this gap, we introduce TeleMath, the first benchmark dataset specifically designed to evaluate LLM performance in solving mathematical problems with numerical solutions in the telecommunications domain. Comprising 500 question-answer (QnA) pairs, TeleMath covers a wide spectrum of topics in the telecommunications field. This paper outlines the proposed QnAs generation pipeline, starting from a selected seed of problems crafted by Subject Matter Experts. The evaluation of a wide range of open-source LLMs reveals that best performance on TeleMath is achieved by recent models explicitly designed for mathematical or logical reasoning. In contrast, general-purpose models, even those with a large number of parameters, often struggle with these challenges. We have released the dataset and the evaluation code to ease result reproducibility and support future research.


[145] Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification? cs.AI | cs.CLPDF

Fei Lin, Ziyang Gong, Cong Wang, Yonglin Tian, Tengchao Zhang

TL;DR: 论文介绍了首个针对多模态大语言模型(MLLMs)的分子毒性修复基准任务ToxiMol,并提出了自动化评估框架ToxiEval。实验表明,尽管当前MLLMs在该任务中面临挑战,但已展现出毒性理解、语义约束和结构感知分子编辑的潜力。

Details

Motivation: 毒性是早期药物开发失败的主要原因,但目前分子毒性修复任务缺乏系统性定义和基准。研究旨在填补这一空白,提出通用MLLMs在分子毒性修复中的适用性评估。

Result: 实验结果表明,当前MLLMs在分子毒性修复任务中仍面临挑战,但表现出毒性理解、语义约束和结构感知编辑的潜力。

Insight: MLLMs在分子毒性修复任务中具备潜力,但需进一步优化评估标准、生成多样性和失败归因分析。

Abstract: Toxicity remains a leading cause of early-stage drug development failure. Despite advances in molecular design and property prediction, the task of molecular toxicity repair - generating structurally valid molecular alternatives with reduced toxicity - has not yet been systematically defined or benchmarked. To fill this gap, we introduce ToxiMol, the first benchmark task for general-purpose Multimodal Large Language Models (MLLMs) focused on molecular toxicity repair. We construct a standardized dataset covering 11 primary tasks and 560 representative toxic molecules spanning diverse mechanisms and granularities. We design a prompt annotation pipeline with mechanism-aware and task-adaptive capabilities, informed by expert toxicological knowledge. In parallel, we propose an automated evaluation framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic accessibility, drug-likeness, and structural similarity into a high-throughput evaluation chain for repair success. We systematically assess nearly 30 mainstream general-purpose MLLMs and design multiple ablation studies to analyze key factors such as evaluation criteria, candidate diversity, and failure attribution. Experimental results show that although current MLLMs still face significant challenges on this task, they begin to demonstrate promising capabilities in toxicity understanding, semantic constraint adherence, and structure-aware molecule editing.


cs.MA [Back]

[146] AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation cs.MA | cs.CVPDF

Haoyuan Shi, Yunxin Li, Xinyu Chen, Longyue Wang, Baotian Hu

TL;DR: AniMaker是一个多代理框架,通过MCTS驱动的剪辑生成和故事感知剪辑选择,从文本输入生成全局一致且故事连贯的动画。

Details

Motivation: 现有视频生成方法在生成多场景和多角色的连贯叙事视频时存在挑战,表现为叙事脱节和节奏问题。AniMaker旨在解决这些问题。

Result: AniMaker在VBench和AniEval指标上表现优异,显著提升多候选生成的效率和质量。

Insight: 通过分代理协作和MCTS优化,AniMaker展示了文本到动画生成中全局一致性和资源效率的平衡。

Abstract: Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation’s logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker’s approach are two key technical components: MCTS-Gen in Photography Agent, an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent, the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion, and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.


eess.SY [Back]

[147] Energy Aware Camera Location Search Algorithm for Increasing Precision of Observation in Automated Manufacturing eess.SY | cs.CV | cs.SY | 93C85 (Primary), 93B52 (Secondary)PDF

Rongfei Li, Francis Assadian

TL;DR: 本文提出了一种能量感知的摄像头位置搜索算法,用于在自动化制造中提升观测精度,通过优化摄像头位置以减少图像噪声,同时考虑能量限制。

Details

Motivation: 在自动化制造环境中,摄像头的观测质量因位置不同而有显著差异,但目前研究较少关注摄像头位置的影响。本文旨在解决这一问题,通过优化摄像头位置提升观测精度。

Result: 仿真实验表明,该算法在有限能量下显著提升了观测精度。

Insight: 摄像头位置的优化对自动化制造中的观测精度至关重要,能量感知的搜索策略可以在资源受限的情况下实现高效优化。

Abstract: Visual servoing technology has been well developed and applied in many automated manufacturing tasks, especially in tools’ pose alignment. To access a full global view of tools, most applications adopt eye-to-hand configuration or eye-to-hand/eye-in-hand cooperation configuration in an automated manufacturing environment. Most research papers mainly put efforts into developing control and observation architectures in various scenarios, but few of them have discussed the importance of the camera’s location in eye-to-hand configuration. In a manufacturing environment, the quality of camera estimations may vary significantly from one observation location to another, as the combined effects of environmental conditions result in different noise levels of a single image shot at different locations. In this paper, we propose an algorithm for the camera’s moving policy so that it explores the camera workspace and searches for the optimal location where the images’ noise level is minimized. Also, this algorithm ensures the camera ends up at a suboptimal (if the optimal one is unreachable) location among the locations already searched, with limited energy available for moving the camera. Unlike a simple brute force approach, the algorithm enables the camera to explore space more efficiently by adapting the search policy from learning the environment. With the aid of an image averaging technique, this algorithm, in use of a solo camera, achieves the observation accuracy in eye-to-hand configurations to a desirable extent without filtering out high-frequency information in the original image. An automated manufacturing application has been simulated and the results show the success of this algorithm’s improvement of observation precision with limited energy.


[148] Semi-Tensor-Product Based Convolutional Neural Networks eess.SY | cs.AI | cs.CV | cs.SYPDF

Daizhan Cheng

TL;DR: 本文提出了一种基于半张量积(STP)的新型卷积神经网络(CNN),通过结合领域卷积积(CP)和STP,避免了传统CNN中填充带来的垃圾信息。

Details

Motivation: 传统CNN的卷积操作要求输入向量维度一致,且填充操作会引入无用信息。本文旨在利用STP的灵活性,避免填充问题,提升CNN性能。

Result: 实验验证了该方法在图像和三阶信号识别中的有效性,避免了填充带来的干扰。

Insight: 半张量积的灵活性为卷积操作提供了新思路,能够处理不同维度输入并避免填充问题,为CNN设计拓展了可能性。

Abstract: The semi-tensor product (STP) of vectors is a generalization of conventional inner product of vectors, which allows the factor vectors to of different dimensions. This paper proposes a domain-based convolutional product (CP). Combining domain-based CP with STP of vectors, a new CP is proposed. Since there is no zero or any other padding, it can avoid the junk information caused by padding. Using it, the STP-based convolutional neural network (CNN) is developed. Its application to image and third order signal identifications is considered.