Table of Contents

cs.CV [Back]

[1] Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models cs.CV | cs.AI | cs.CL | cs.GR | cs.MMPDF

Sridhar S, Nithin A, Shakeel Rifath, Vasantha Raj

TL;DR: 该论文提出了一种多模态电影视频合成方法,结合文本到图像和音频生成模型,实现了60秒高质量电影的自动化生成,支持高达1024x768的分辨率和15-30 FPS的帧率。

Details

Motivation: 生成式人工智能的进步推动了多媒体创作的自动化,但目前仍缺乏一种整合文本、图像和音频合成的端到端电影视频生成方法。

Result: 实验表明,该方法在视觉质量、叙事连贯性和效率上表现优异,适用于创意、教育和工业应用。

Insight: 该工作展示了生成式AI在多模态视频合成中的潜力,通过整合不同模型和流水线优化,实现了端到端的电影级内容生成。

Abstract: Advances in generative artificial intelligence have altered multimedia creation, allowing for automatic cinematic video synthesis from text inputs. This work describes a method for creating 60-second cinematic movies incorporating Stable Diffusion for high-fidelity image synthesis, GPT-2 for narrative structuring, and a hybrid audio pipeline using gTTS and YouTube-sourced music. It uses a five-scene framework, which is augmented by linear frame interpolation, cinematic post-processing (e.g., sharpening), and audio-video synchronization to provide professional-quality results. It was created in a GPU-accelerated Google Colab environment using Python 3.11. It has a dual-mode Gradio interface (Simple and Advanced), which supports resolutions of up to 1024x768 and frame rates of 15-30 FPS. Optimizations such as CUDA memory management and error handling ensure reliability. The experiments demonstrate outstanding visual quality, narrative coherence, and efficiency, furthering text-to-video synthesis for creative, educational, and industrial applications.


[2] LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning cs.CVPDF

Chenjian Gao, Lihe Ding, Xin Cai, Zhanpeng Huang, Zibin Wang

TL;DR: 该论文提出了一种基于LoRA(低秩适应)微调的视频编辑方法LoRA-Edit,通过掩码感知的LoRA微调实现可控的首帧引导视频编辑,避免了大规模预训练的局限性,同时保留了背景并支持灵活的编辑传播。

Details

Motivation: 当前基于扩散模型的视频编辑方法依赖于大规模预训练,缺乏对特定编辑的灵活性。首帧引导编辑虽然能控制第一帧,但对后续帧的控制不足。论文旨在解决这一问题。

Result: 实验表明,该方法在视频编辑任务中优于现有方法,实现了高质量的编辑效果。

Insight: 该方法通过掩码和LoRA技术的结合,实现了对视频编辑的灵活控制,同时避免了模型架构的修改,为高效视频编辑提供了新思路。

Abstract: Video editing using diffusion models has achieved remarkable results in generating high-quality edits for videos. However, current methods often rely on large-scale pretraining, limiting flexibility for specific edits. First-frame-guided editing provides control over the first frame, but lacks flexibility over subsequent frames. To address this, we propose a mask-based LoRA (Low-Rank Adaptation) tuning method that adapts pretrained Image-to-Video (I2V) models for flexible video editing. Our approach preserves background regions while enabling controllable edits propagation. This solution offers efficient and adaptable video editing without altering the model architecture. To better steer this process, we incorporate additional references, such as alternate viewpoints or representative scene states, which serve as visual anchors for how content should unfold. We address the control challenge using a mask-driven LoRA tuning strategy that adapts a pre-trained image-to-video model to the editing context. The model must learn from two distinct sources: the input video provides spatial structure and motion cues, while reference images offer appearance guidance. A spatial mask enables region-specific learning by dynamically modulating what the model attends to, ensuring that each area draws from the appropriate source. Experimental results show our method achieves superior video editing performance compared to state-of-the-art methods.


[3] DeepTraverse: A Depth-First Search Inspired Network for Algorithmic Visual Understanding cs.CVPDF

Bin Guo, John H. L. Hansen

TL;DR: DeepTraverse提出了一种受深度优先搜索启发的视觉网络架构,通过递归探索和自适应校准模块实现特征的系统性构建和精细化,提升了分类准确性和特征判别力。

Details

Motivation: 传统视觉主干网络的特征构建方式单一,缺乏自适应迭代优化的能力。作者受经典搜索算法启发,探索了将算法化、结构化处理流程引入视觉网络的潜力,以提高特征的可解释性和推理能力。

Result: 在多个图像分类基准测试中,DeepTraverse表现出更高的分类准确性和特征判别力,其性能优于参数数量相近或更多的传统模型。

Insight: 将算法先验(如搜索策略)引入视觉网络设计,能够构建更高效、性能更强且结构化的视觉主干网络,为网络的可解释性和推理能力提供了新思路。

Abstract: Conventional vision backbones, despite their success, often construct features through a largely uniform cascade of operations, offering limited explicit pathways for adaptive, iterative refinement. This raises a compelling question: can principles from classical search algorithms instill a more algorithmic, structured, and logical processing flow within these networks, leading to representations built through more interpretable, perhaps reasoning-like decision processes? We introduce DeepTraverse, a novel vision architecture directly inspired by algorithmic search strategies, enabling it to learn features through a process of systematic elucidation and adaptive refinement distinct from conventional approaches. DeepTraverse operationalizes this via two key synergistic components: recursive exploration modules that methodically deepen feature analysis along promising representational paths with parameter sharing for efficiency, and adaptive calibration modules that dynamically adjust feature salience based on evolving global context. The resulting algorithmic interplay allows DeepTraverse to intelligently construct and refine feature patterns. Comprehensive evaluations across a diverse suite of image classification benchmarks show that DeepTraverse achieves highly competitive classification accuracy and robust feature discrimination, often outperforming conventional models with similar or larger parameter counts. Our work demonstrates that integrating such algorithmic priors provides a principled and effective strategy for building more efficient, performant, and structured vision backbones.


[4] Test-Time Adaptation for Generalizable Task Progress Estimation cs.CV | cs.AI | I.2.6; I.2.9; I.2.10PDF

Christos Ziakas, Alessandra Russo

TL;DR: 本文提出了一种测试时自适应方法,通过优化自监督目标,使进度估计模型能够在线适应测试轨迹的视觉和时间上下文。该方法基于梯度元学习策略,利用专家视觉轨迹和自然语言任务描述进行训练,从而提升基于语义内容而非时间顺序的进度估计性能。

Details

Motivation: 传统的进度估计方法在面对分布外任务、环境和体现形式时表现不佳。本文旨在提出一种通用性强的自适应方法,使模型在测试时能够动态适应新场景。

Result: 实验表明,该方法在分布外任务、环境和体现形式中的表现优于当前最先进的基于上下文的自动回归视觉-语言模型方法。

Insight: 通过结合元学习和测试时自适应,模型能够在不依赖时间顺序的情况下更好地捕捉语义内容,从而提升通用性和适应性。

Abstract: We propose a test-time adaptation method that enables a progress estimation model to adapt online to the visual and temporal context of test trajectories by optimizing a learned self-supervised objective. To this end, we introduce a gradient-based meta-learning strategy to train the model on expert visual trajectories and their natural language task descriptions, such that test-time adaptation improves progress estimation relying on semantic content over temporal order. Our test-time adaptation method generalizes from a single training environment to diverse out-of-distribution tasks, environments, and embodiments, outperforming the state-of-the-art in-context learning approach using autoregressive vision-language models.


[5] EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models cs.CVPDF

Yantai Yang, Yuhao Wang, Zichen Wen, Luo Zhongwei, Chang Zou

TL;DR: EfficientVLA是一种无需训练的高效推理加速框架,用于减少VLA模型的计算和内存开销,通过剪枝、视觉令牌优化和特征缓存策略,显著提升了推理速度,同时保持了性能。

Details

Motivation: 当前VLA模型(如扩散式架构)在计算和内存需求上存在冗余问题,限制了其实际部署能力。现有的加速方法通常只能解决局部问题,无法全面优化整个VLA流程。

Result: 在CogACT模型上实现了1.93倍的推理加速,计算量降至28.9%,同时在SIMPLER基准测试中仅降低0.6%的成功率。

Insight: 通过结构化和训练无关的方法,可以显著提升VLA模型的效率,为实际部署提供了可能性。

Abstract: Vision-Language-Action (VLA) models, particularly diffusion-based architectures, demonstrate transformative potential for embodied intelligence but are severely hampered by high computational and memory demands stemming from extensive inherent and inference-time redundancies. While existing acceleration efforts often target isolated inefficiencies, such piecemeal solutions typically fail to holistically address the varied computational and memory bottlenecks across the entire VLA pipeline, thereby limiting practical deployability. We introduce EfficientVLA, a structured and training-free inference acceleration framework that systematically eliminates these barriers by cohesively exploiting multifaceted redundancies. EfficientVLA synergistically integrates three targeted strategies: (1) pruning of functionally inconsequential layers from the language module, guided by an analysis of inter-layer redundancies; (2) optimizing the visual processing pathway through a task-aware strategy that selects a compact, diverse set of visual tokens, balancing task-criticality with informational coverage; and (3) alleviating temporal computational redundancy within the iterative diffusion-based action head by strategically caching and reusing key intermediate features. We apply our method to a standard VLA model CogACT, yielding a 1.93X inference speedup and reduces FLOPs to 28.9%, with only a 0.6% success rate drop in the SIMPLER benchmark.


[6] A Manually Annotated Image-Caption Dataset for Detecting Children in the Wild cs.CV | cs.ETPDF

Klim Kireev, Ana-Maria Creţu, Raphael Meier, Sarah Adel Bargal, Elissa Redmiles

TL;DR: 该论文发布了首个多模态环境下检测未成年人的图像-字幕数据集ICCWD,包含10,000个手动标注的图像-字幕对,用于评估检测工具的性能。实验显示现有方法在此任务上仍有挑战性。

Details

Motivation: 现有缺乏多模态环境下检测未成年人的数据集,而法律和平台对未成年人内容的监管需求迫切。

Result: 最佳检测器的真阳性率为75.3%,表明任务具有挑战性。

Insight: 多模态数据集有助于设计更优的未成年人检测方法,填补了现有研究空白。

Abstract: Platforms and the law regulate digital content depicting minors (defined as individuals under 18 years of age) differently from other types of content. Given the sheer amount of content that needs to be assessed, machine learning-based automation tools are commonly used to detect content depicting minors. To our knowledge, no dataset or benchmark currently exists for detecting these identification methods in a multi-modal environment. To fill this gap, we release the Image-Caption Children in the Wild Dataset (ICCWD), an image-caption dataset aimed at benchmarking tools that detect depictions of minors. Our dataset is richer than previous child image datasets, containing images of children in a variety of contexts, including fictional depictions and partially visible bodies. ICCWD contains 10,000 image-caption pairs manually labeled to indicate the presence or absence of a child in the image. To demonstrate the possible utility of our dataset, we use it to benchmark three different detectors, including a commercial age estimation system applied to images. Our results suggest that child detection is a challenging task, with the best method achieving a 75.3% true positive rate. We hope the release of our dataset will aid in the design of better minor detection methods in a wide range of scenarios.


[7] Detecção da Psoríase Utilizando Visão Computacional: Uma Abordagem Comparativa Entre CNNs e Vision Transformers cs.CV | cs.AI | cs.LGPDF

Natanael Lucena, Fábio S. da Silva, Ricardo Rios

TL;DR: 本文比较了卷积神经网络(CNN)和视觉变换器(ViT)在多分类银屑病及其类似病变图像任务中的性能,发现ViT在小模型上表现更优,推荐DaViT-B架构用于自动化检测。

Details

Motivation: 研究旨在探索CNN和ViT在医学图像分类中的性能差异,特别是针对银屑病的识别任务,以推动高效自动化诊断工具的发展。

Result: ViT表现优于CNN,DaViT-B模型性能最佳,F1分数达96.4%。

Insight: 视觉变换器在小模型情况下展现出色性能,表明其在医学图像分类任务中的潜力。

Abstract: This paper presents a comparison of the performance of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in the task of multi-classifying images containing lesions of psoriasis and diseases similar to it. Models pre-trained on ImageNet were adapted to a specific data set. Both achieved high predictive metrics, but the ViTs stood out for their superior performance with smaller models. Dual Attention Vision Transformer-Base (DaViT-B) obtained the best results, with an f1-score of 96.4%, and is recommended as the most efficient architecture for automated psoriasis detection. This article reinforces the potential of ViTs for medical image classification tasks.


[8] ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs cs.CV | cs.LGPDF

Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou

TL;DR: 论文提出了ViCrit任务,通过强化学习(RL)微调视觉语言模型(VLMs),任务要求模型定位人为注入的视觉描述错误,从而提升视觉感知能力。ViCrit平衡了任务难度与可验证性,同时在多个VL基准测试中表现优异。

Details

Motivation: 现有RL任务多针对纯语言模型(如数学推理或代码生成),但VLMs的视觉感知任务缺乏既具挑战性又可明确验证的代理任务,阻碍了RL在视觉领域的应用。

Result: 实验表明,ViCrit训练的模型在多个VL基准测试中表现显著提升,并能泛化到自然图像以外的抽象推理和视觉数学任务。

Insight: ViCrit不仅提升了模型的记忆能力,还促进了真正的视觉感知学习,表明细粒度的幻觉批评是提升VLMs视觉感知的有效目标。

Abstract: Reinforcement learning (RL) has shown great effectiveness for fine-tuning large language models (LLMs) using tasks that are challenging yet easily verifiable, such as math reasoning or code generation. However, extending this success to visual perception in vision-language models (VLMs) has been impeded by the scarcity of vision-centric tasks that are simultaneously challenging and unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions. Starting from a 200-word captions, we inject a single, subtle visual description error-altering a few words on objects, attributes, counts, or spatial relations-and task the model to pinpoint the corrupted span given the image and the modified caption. This formulation preserves the full perceptual difficulty while providing a binary, exact-match reward that is easy to compute and unambiguous. Models trained with the ViCrit Task exhibit substantial gains across a variety of VL benchmarks. Crucially, the improvements transfer beyond natural-image training data to abstract image reasoning and visual math, showing promises of learning to perceive rather than barely memorizing seen objects. To facilitate evaluation, we further introduce ViCrit-Bench, a category-balanced diagnostic benchmark that systematically probes perception errors across diverse image domains and error types. Together, our results demonstrate that fine-grained hallucination criticism is an effective and generalizable objective for enhancing visual perception in VLMs.


[9] Retrieval of Surface Solar Radiation through Implicit Albedo Recovery from Temporal Context cs.CV | physics.ao-phPDF

Yael Frischholz, Devis Tuia, Michael Lehning

TL;DR: 本文提出了一种基于注意力机制的太阳辐射反演方法,通过隐式学习从卫星图像序列中推断背景反射率,解决了传统方法在山区动态雪盖区域的局限性。

Details

Motivation: 传统太阳辐射反演算法依赖月度统计估计背景反射率,但在山区由于动态雪盖导致性能下降。本文旨在通过隐式学习背景反射率来提升反演精度。

Result: 实验表明,模型在提供足够的时间上下文时,性能与依赖反照率的方法相当,尤其在山区表现突出。

Insight: 时间上下文在隐式学习背景反射率中至关重要,尤其在动态雪盖区域。模型能够捕捉并利用潜在地表反射动态,提升泛化能力。

Abstract: Accurate retrieval of surface solar radiation (SSR) from satellite imagery critically depends on estimating the background reflectance that a spaceborne sensor would observe under clear-sky conditions. Deviations from this baseline can then be used to detect cloud presence and guide radiative transfer models in inferring atmospheric attenuation. Operational retrieval algorithms typically approximate background reflectance using monthly statistics, assuming surface properties vary slowly relative to atmospheric conditions. However, this approach fails in mountainous regions where intermittent snow cover and changing snow surfaces are frequent. We propose an attention-based emulator for SSR retrieval that implicitly learns to infer clear-sky surface reflectance from raw satellite image sequences. Built on the Temporo-Spatial Vision Transformer, our approach eliminates the need for hand-crafted features such as explicit albedo maps or cloud masks. The emulator is trained on instantaneous SSR estimates from the HelioMont algorithm over Switzerland, a region characterized by complex terrain and dynamic snow cover. Inputs include multi-spectral SEVIRI imagery from the Meteosat Second Generation platform, augmented with static topographic features and solar geometry. The target variable is HelioMont’s SSR, computed as the sum of its direct and diffuse horizontal irradiance components, given at a spatial resolution of 1.7 km. We show that, when provided a sufficiently long temporal context, the model matches the performances of albedo-informed models, highlighting the model’s ability to internally learn and exploit latent surface reflectance dynamics. Our geospatial analysis shows this effect is most powerful in mountainous regions and improves generalization in both simple and complex topographic settings. Code and datasets are publicly available at https://github.com/frischwood/HeMu-dev.git


[10] Attention, Please! Revisiting Attentive Probing for Masked Image Modeling cs.CVPDF

Bill Psomas, Dionysis Christopoulos, Eirini Baltzi, Ioannis Kakogeorgiou, Tilemachos Aravanis

TL;DR: 本文提出了高效的注意力探测方法(EP),通过多查询交叉注意力机制减少冗余参数,提升计算效率,并在多个基准测试中优于线性探测和传统注意力探测方法。

Details

Motivation: 由于微调在大规模场景下不切实际,探测成为自监督学习的重要评估协议。然而,标准的线性探测(LP)无法充分反映掩码图像建模(MIM)的潜力,因为其分布式补丁令牌的特性。这促使了对注意力探测的需求,但现有方法存在参数冗余和计算效率低的问题。

Result: EP在七个基准测试中超越了LP和传统注意力探测方法,在低样本和逐层设置中表现优异,同时生成可解释的注意力图。

Insight: 注意力探测在MIM等任务中比线性探测更具潜力,且高效的设计可以显著提升性能和计算效率。

Abstract: As fine-tuning (FT) becomes increasingly impractical at scale, probing is emerging as the preferred evaluation protocol for self-supervised learning (SSL). Yet, the standard linear probing (LP) fails to adequately reflect the potential of models trained with Masked Image Modeling (MIM), due to the distributed nature of patch tokens. This motivates the need for attentive probing, an alternative that uses attention to selectively aggregate patch-level features. Despite its growing adoption, attentive probing remains under-explored, with existing methods suffering from excessive parameterization and poor computational efficiency. In this work, we revisit attentive probing through the lens of the accuracy-efficiency trade-off. We conduct a systematic study of existing methods, analyzing their mechanisms and benchmarking their performance. We introduce efficient probing (EP), a multi-query cross-attention mechanism that eliminates redundant projections, reduces the number of trainable parameters, and achieves up to a 10$\times$ speed-up over conventional multi-head attention. Despite its simplicity, EP outperforms LP and prior attentive probing approaches across seven benchmarks, generalizes well beyond MIM to diverse pre-training paradigms, produces interpretable attention maps, and achieves strong gains in low-shot and layer-wise settings. Code available at https://github.com/billpsomas/efficient-probing.


[11] Improving Personalized Search with Regularized Low-Rank Parameter Updates cs.CVPDF

Fiona Ryan, Josef Sivic, Fabian Caba Heilbron, Judy Hoffman, James M. Rehg

TL;DR: 本文提出了一种通过正则化低秩参数更新改进个性化视觉语言检索的方法,结合少量示例学习新概念并整合个人与通用知识,在DeepFashion2和ConCon-Chi基准测试中表现最优。

Details

Motivation: 个性化视觉语言检索需从少量示例学习新概念,同时整合个人与通用知识。现有方法难以平衡两者的结合。

Result: 在DeepFashion2和ConCon-Chi基准测试中,个性化检索准确率比现有方法提升4%-22%。

Insight: 低秩参数更新能有效平衡新概念学习和通用知识保留,参数加法是组合多概念的可行策略。

Abstract: Personalized vision-language retrieval seeks to recognize new concepts (e.g. “my dog Fido”) from only a few examples. This task is challenging because it requires not only learning a new concept from a few images, but also integrating the personal and general knowledge together to recognize the concept in different contexts. In this paper, we show how to effectively adapt the internal representation of a vision-language dual encoder model for personalized vision-language retrieval. We find that regularized low-rank adaption of a small set of parameters in the language encoder’s final layer serves as a highly effective alternative to textual inversion for recognizing the personal concept while preserving general knowledge. Additionally, we explore strategies for combining parameters of multiple learned personal concepts, finding that parameter addition is effective. To evaluate how well general knowledge is preserved in a finetuned representation, we introduce a metric that measures image retrieval accuracy based on captions generated by a vision language model (VLM). Our approach achieves state-of-the-art accuracy on two benchmarks for personalized image retrieval with natural language queries - DeepFashion2 and ConCon-Chi - outperforming the prior art by 4%-22% on personal retrievals.


[12] ScoreMix: Improving Face Recognition via Score Composition in Diffusion Generators cs.CV | cs.AI | cs.LGPDF

Parsa Rahimi, Sebastien Marcel

TL;DR: ScoreMix提出了一种基于扩散模型的数据增强方法,通过混合不同类别的分数生成挑战性样本,显著提升了判别模型的性能,尤其是在数据较少的情况下。

Details

Motivation: 解决小样本场景下判别模型性能不足的问题,利用扩散模型的分数组合特性生成有挑战性的合成数据。

Result: 在多个基准测试中显著提升了判别模型的性能,尤其是在数据有限的情况下。

Insight: 判别器和生成器的条件空间相关性低,混合远处类别的策略更有效。

Abstract: In this paper, we propose ScoreMix, a novel yet simple data augmentation strategy leveraging the score compositional properties of diffusion models to enhance discriminator performance, particularly under scenarios with limited labeled data. By convexly mixing the scores from different class-conditioned trajectories during diffusion sampling, we generate challenging synthetic samples that significantly improve discriminative capabilities in all studied benchmarks. We systematically investigate class-selection strategies for mixing and discover that greater performance gains arise when combining classes distant in the discriminator’s embedding space, rather than close in the generator’s condition space. Moreover, we empirically show that, under standard metrics, the correlation between the generator’s learned condition space and the discriminator’s embedding space is minimal. Our approach achieves notable performance improvements without extensive parameter searches, demonstrating practical advantages for training discriminative models while effectively mitigating problems regarding collections of large datasets. Paper website: https://parsa-ra.github.io/scoremix


[13] California Crop Yield Benchmark: Combining Satellite Image, Climate, Evapotranspiration, and Soil Data Layers for County-Level Yield Forecasting of Over 70 Crops cs.CVPDF

Hamid Kamangir, Mona Hajiesmaeeli, Mason Earles

TL;DR: 该论文提出了一种结合卫星图像、气候、蒸散和土壤数据的多模态深度学习模型,用于加州70多种作物的县级产量预测。模型在测试数据集上达到了0.76的R2分数,表现优异。

Details

Motivation: 加州是全球农业生产的领导者,但准确的作物产量预测仍面临挑战,因为涉及复杂的环境、气候和土壤因素。现有数据虽丰富,但缺乏有效的整合和预测方法。

Result: 在测试数据集上,模型的整体R2分数为0.76,展示了强大的预测能力。

Insight: 该研究为农业预测、气候适应和精准农业提供了有力工具,公开的数据集和代码对进一步研究具有重要价值。

Abstract: California is a global leader in agricultural production, contributing 12.5% of the United States total output and ranking as the fifth-largest food and cotton supplier in the world. Despite the availability of extensive historical yield data from the USDA National Agricultural Statistics Service, accurate and timely crop yield forecasting remains a challenge due to the complex interplay of environmental, climatic, and soil-related factors. In this study, we introduce a comprehensive crop yield benchmark dataset covering over 70 crops across all California counties from 2008 to 2022. The benchmark integrates diverse data sources, including Landsat satellite imagery, daily climate records, monthly evapotranspiration, and high-resolution soil properties. To effectively learn from these heterogeneous inputs, we develop a multi-modal deep learning model tailored for county-level, crop-specific yield forecasting. The model employs stratified feature extraction and a timeseries encoder to capture spatial and temporal dynamics during the growing season. Static inputs such as soil characteristics and crop identity inform long-term variability. Our approach achieves an overall R2 score of 0.76 across all crops of unseen test dataset, highlighting strong predictive performance across California diverse agricultural regions. This benchmark and modeling framework offer a valuable foundation for advancing agricultural forecasting, climate adaptation, and precision farming. The full dataset and codebase are publicly available at our GitHub repository.


[14] DySS: Dynamic Queries and State-Space Learning for Efficient 3D Object Detection from Multi-Camera Videos cs.CVPDF

Rajeev Yasarla, Shizhong Han, Hong Cai, Fatih Porikli

TL;DR: DySS提出了一种基于状态空间学习和动态查询的高效3D物体检测方法,通过状态空间模型和动态查询更新机制,实现了高性能和高效推理。

Details

Motivation: 传统的基于密集BEV特征的3D物体检测方法计算成本高,而基于稀疏查询的方法在处理多帧视频时仍然需要大量查询,导致效率低下。DySS旨在通过状态空间学习和动态查询来解决这些问题。

Result: 在nuScenes测试集上取得65.31 NDS和57.4 mAP,验证集上56.2 NDS和46.2 mAP,推理速度达33 FPS。

Insight: 通过状态空间学习和动态查询机制,DySS不仅提升了3D物体检测性能,还显著降低了计算成本,为实时感知任务提供了高效解决方案。

Abstract: Camera-based 3D object detection in Bird’s Eye View (BEV) is one of the most important perception tasks in autonomous driving. Earlier methods rely on dense BEV features, which are costly to construct. More recent works explore sparse query-based detection. However, they still require a large number of queries and can become expensive to run when more video frames are used. In this paper, we propose DySS, a novel method that employs state-space learning and dynamic queries. More specifically, DySS leverages a state-space model (SSM) to sequentially process the sampled features over time steps. In order to encourage the model to better capture the underlying motion and correspondence information, we introduce auxiliary tasks of future prediction and masked reconstruction to better train the SSM. The state of the SSM then provides an informative yet efficient summarization of the scene. Based on the state-space learned features, we dynamically update the queries via merge, remove, and split operations, which help maintain a useful, lean set of detection queries throughout the network. Our proposed DySS achieves both superior detection performance and efficient inference. Specifically, on the nuScenes test split, DySS achieves 65.31 NDS and 57.4 mAP, outperforming the latest state of the art. On the val split, DySS achieves 56.2 NDS and 46.2 mAP, as well as a real-time inference speed of 33 FPS.


[15] HalLoc: Token-level Localization of Hallucinations for Vision Language Models cs.CVPDF

Eunkyu Park, Minyeong Kim, Gunhee Kim

TL;DR: HalLoc是一个针对视觉语言模型(VLM)幻觉问题的标记级定位数据集,支持高效的概率幻觉检测,包含15万标注样本,并提出了低开销的基线模型。

Details

Motivation: 现有幻觉检测方法计算复杂度高且难以区分模糊的真实与幻觉信息,HalLoc旨在解决这些问题。

Result: HalLoc支持开发概率检测模型,基线模型可无缝集成到VLM中,提升可靠性。

Insight: HalLoc为提升VLM在真实场景中的可信度提供了新方向,通过概率检测和低开销模型实现高效幻觉定位。

Abstract: Hallucinations pose a significant challenge to the reliability of large vision-language models, making their detection essential for ensuring accuracy in critical applications. Current detection methods often rely on computationally intensive models, leading to high latency and resource demands. Their definitive outcomes also fail to account for real-world scenarios where the line between hallucinated and truthful information is unclear. To address these issues, we propose HalLoc, a dataset designed for efficient, probabilistic hallucination detection. It features 150K token-level annotated samples, including hallucination types, across Visual Question Answering (VQA), instruction-following, and image captioning tasks. This dataset facilitates the development of models that detect hallucinations with graded confidence, enabling more informed user interactions. Additionally, we introduce a baseline model trained on HalLoc, offering low-overhead, concurrent hallucination detection during generation. The model can be seamlessly integrated into existing VLMs, improving reliability while preserving efficiency. The prospect of a robust plug-and-play hallucination detection module opens new avenues for enhancing the trustworthiness of vision-language models in real-world applications. The HalLoc dataset and code are publicly available at: https://github.com/dbsltm/cvpr25_halloc.


[16] Uncertainty-Aware Deep Learning for Automated Skin Cancer Classification: A Comprehensive Evaluation cs.CV | cs.AIPDF

Hamzeh Asgharnezhad, Pegah Tabarisaadi, Abbas Khosravi, Roohallah Alizadehsani, U. Rajendra Acharya

TL;DR: 该论文通过结合迁移学习和不确定性量化(UQ),对皮肤癌分类任务进行了全面评估,旨在提高分类准确性和模型输出的可靠性。

Details

Motivation: 皮肤癌的早期准确诊断对患者治疗至关重要,但现有深度学习模型受限于数据稀缺和缺乏不确定性意识,需要进一步提升性能与可信度。

Result: 结果表明,基于CLIP的视觉变换器(如LAION CLIP ViT-H/14)与SVM组合表现最佳;集成方法在准确性和不确定性处理上表现平衡,而EMCD对不确定预测更敏感。

Insight: 论文强调了在医疗诊断中整合UQ的重要性,既能提升性能,又能增强模型在临床实际应用中的可信度。

Abstract: Accurate and reliable skin cancer diagnosis is critical for early treatment and improved patient outcomes. Deep learning (DL) models have shown promise in automating skin cancer classification, but their performance can be limited by data scarcity and a lack of uncertainty awareness. In this study, we present a comprehensive evaluation of DL-based skin lesion classification using transfer learning and uncertainty quantification (UQ) on the HAM10000 dataset. In the first phase, we benchmarked several pre-trained feature extractors-including Contrastive Language-Image Pretraining (CLIP) variants, Residual Network-50 (ResNet50), Densely Connected Convolutional Network (DenseNet121), Visual Geometry Group network (VGG16), and EfficientNet-V2-Large-combined with a range of traditional classifiers such as Support Vector Machine (SVM), eXtreme Gradient Boosting (XGBoost), and logistic regression. Our results show that CLIP-based vision transformers, particularly LAION CLIP ViT-H/14 with SVM, deliver the highest classification performance. In the second phase, we incorporated UQ using Monte Carlo Dropout (MCD), Ensemble, and Ensemble Monte Carlo Dropout (EMCD) to assess not only prediction accuracy but also the reliability of model outputs. We evaluated these models using uncertainty-aware metrics such as uncertainty accuracy(UAcc), uncertainty sensitivity(USen), uncertainty specificity(USpe), and uncertainty precision(UPre). The results demonstrate that ensemble methods offer a good trade-off between accuracy and uncertainty handling, while EMCD is more sensitive to uncertain predictions. This study highlights the importance of integrating UQ into DL-based medical diagnosis to enhance both performance and trustworthiness in real-world clinical applications.


[17] Towards Scalable SOAP Note Generation: A Weakly Supervised Multimodal Framework cs.CV | cs.AI | cs.LGPDF

Sadia Kamal, Tim Oates, Joy Wan

TL;DR: 提出了一种弱监督多模态框架,用于从病灶图像和稀疏临床文本生成SOAP笔记,减轻医生负担并减少对大量标注数据的依赖,性能媲美先进模型。

Details

Motivation: 皮肤癌是全球最常见的癌症,医生手动生成SOAP笔记耗时且易导致职业倦怠,亟需自动化解决方案以缓解负担。

Result: 在临床相关性指标上表现与GPT-4o、Claude和DeepSeek Janus Pro相当。

Insight: 弱监督学习和多模态输入的结合可用于医疗领域的结构化文本生成任务,有效减少标注成本。

Abstract: Skin carcinoma is the most prevalent form of cancer globally, accounting for over $8 billion in annual healthcare expenditures. In clinical settings, physicians document patient visits using detailed SOAP (Subjective, Objective, Assessment, and Plan) notes. However, manually generating these notes is labor-intensive and contributes to clinician burnout. In this work, we propose a weakly supervised multimodal framework to generate clinically structured SOAP notes from limited inputs, including lesion images and sparse clinical text. Our approach reduces reliance on manual annotations, enabling scalable, clinically grounded documentation while alleviating clinician burden and reducing the need for large annotated data. Our method achieves performance comparable to GPT-4o, Claude, and DeepSeek Janus Pro across key clinical relevance metrics. To evaluate clinical quality, we introduce two novel metrics MedConceptEval and Clinical Coherence Score (CCS) which assess semantic alignment with expert medical concepts and input features, respectively.


[18] Research on Audio-Visual Quality Assessment Dataset and Method for User-Generated Omnidirectional Video cs.CV | eess.IVPDF

Fei Zhao, Da Pan, Zelu Qi, Ping Shi

TL;DR: 论文针对元宇宙中用户生成的360度视频(UGC-ODV)的音视频质量评估(AVQA)问题,构建了一个数据集并提出了一种基线模型。

Details

Motivation: 随着元宇宙的兴起,360度视频逐渐从专业生成内容(PGC)转向用户生成内容(UGC)。然而,UGC-ODV的音视频质量评估研究仍然有限。

Result: 提出的基线模型在构建的数据集上表现最优。

Insight: UGC-ODV的音视频质量评估需要结合多模态特征,且用户生成内容的质量评估与专业内容存在差异。

Abstract: In response to the rising prominence of the Metaverse, omnidirectional videos (ODVs) have garnered notable interest, gradually shifting from professional-generated content (PGC) to user-generated content (UGC). However, the study of audio-visual quality assessment (AVQA) within ODVs remains limited. To address this, we construct a dataset of UGC omnidirectional audio and video (A/V) content. The videos are captured by five individuals using two different types of omnidirectional cameras, shooting 300 videos covering 10 different scene types. A subjective AVQA experiment is conducted on the dataset to obtain the Mean Opinion Scores (MOSs) of the A/V sequences. After that, to facilitate the development of UGC-ODV AVQA fields, we construct an effective AVQA baseline model on the proposed dataset, of which the baseline model consists of video feature extraction module, audio feature extraction and audio-visual fusion module. The experimental results demonstrate that our model achieves optimal performance on the proposed dataset.


[19] Using Vision Language Models to Detect Students’ Academic Emotion through Facial Expressions cs.CV | cs.AIPDF

Deliang Wang, Chao Yang, Gaowei Chen

TL;DR: 该论文研究了利用视觉语言模型(VLMs)通过零样本提示检测学生在在线学习环境中的学术情绪,发现Qwen2.5-VL-7B-Instruct在识别困惑表情方面表现较好,但对分心行为的检测效果较差。

Details

Motivation: 传统的学生学术情绪分析方法依赖监督学习,且泛化能力不足。视觉语言模型的出现为解决这一问题提供了新思路。

Result: Qwen2.5-VL-7B-Instruct在困惑表情识别上表现较好,而两种模型对分心行为的检测效果不佳。快乐情绪的识别效果最好。

Insight: VLMs在学术情绪检测领域具有一定的潜力,但对某些情绪(如分心)的识别需要进一步改进。

Abstract: Students’ academic emotions significantly influence their social behavior and learning performance. Traditional approaches to automatically and accurately analyze these emotions have predominantly relied on supervised machine learning algorithms. However, these models often struggle to generalize across different contexts, necessitating repeated cycles of data collection, annotation, and training. The emergence of Vision-Language Models (VLMs) offers a promising alternative, enabling generalization across visual recognition tasks through zero-shot prompting without requiring fine-tuning. This study investigates the potential of VLMs to analyze students’ academic emotions via facial expressions in an online learning environment. We employed two VLMs, Llama-3.2-11B-Vision-Instruct and Qwen2.5-VL-7B-Instruct, to analyze 5,000 images depicting confused, distracted, happy, neutral, and tired expressions using zero-shot prompting. Preliminary results indicate that both models demonstrate moderate performance in academic facial expression recognition, with Qwen2.5-VL-7B-Instruct outperforming Llama-3.2-11B-Vision-Instruct. Notably, both models excel in identifying students’ happy emotions but fail to detect distracted behavior. Additionally, Qwen2.5-VL-7B-Instruct exhibits relatively high performance in recognizing students’ confused expressions, highlighting its potential for practical applications in identifying content that causes student confusion.


[20] PointGS: Point Attention-Aware Sparse View Synthesis with Gaussian Splatting cs.CVPDF

Lintao Xiang, Hongpei Zheng, Yating Huang, Qijun Yang, Hujun Yin

TL;DR: PointGS利用点注意力感知的高斯泼溅技术,通过稀疏视图实现了高质量的实时渲染。

Details

Motivation: 现有的3DGS方法需要大量校准视图来生成一致的场景表示,稀疏输入时容易过拟合训练视图,导致渲染质量下降。

Result: 在多种基准测试中显著优于NeRF方法,且在少样本设置下与最先进的3DGS方法竞争。

Insight: 点注意力机制和多尺度特征融合是提升稀疏视图渲染质量的关键。

Abstract: 3D Gaussian splatting (3DGS) is an innovative rendering technique that surpasses the neural radiance field (NeRF) in both rendering speed and visual quality by leveraging an explicit 3D scene representation. Existing 3DGS approaches require a large number of calibrated views to generate a consistent and complete scene representation. When input views are limited, 3DGS tends to overfit the training views, leading to noticeable degradation in rendering quality. To address this limitation, we propose a Point-wise Feature-Aware Gaussian Splatting framework that enables real-time, high-quality rendering from sparse training views. Specifically, we first employ the latest stereo foundation model to estimate accurate camera poses and reconstruct a dense point cloud for Gaussian initialization. We then encode the colour attributes of each 3D Gaussian by sampling and aggregating multiscale 2D appearance features from sparse inputs. To enhance point-wise appearance representation, we design a point interaction network based on a self-attention mechanism, allowing each Gaussian point to interact with its nearest neighbors. These enriched features are subsequently decoded into Gaussian parameters through two lightweight multi-layer perceptrons (MLPs) for final rendering. Extensive experiments on diverse benchmarks demonstrate that our method significantly outperforms NeRF-based approaches and achieves competitive performance under few-shot settings compared to the state-of-the-art 3DGS methods.


[21] UrbanSense:AFramework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models cs.CV | cs.AIPDF

Jun Yin, Jing Zhong, Peilin Li, Pengyu Zeng, Miao Zhang

TL;DR: 该论文提出了一个基于视觉-语言模型的多模态研究框架UrbanSense,用于自动化、可扩展的城市街景风格差异分析,并展示了其在量化城市风格演变方面的潜力。

Details

Motivation: 城市文化和建筑风格因地理、历史、社会政治等因素差异显著,传统研究方法难以标准化。该研究旨在通过数据驱动方法增强城市形态研究的客观性。

Result: 超过80%的生成描述通过t检验(p<0.05),主观评价中的高Phi分数证实了方法对风格差异的捕捉能力(城市0.912,时期0.833)。

Insight: 该框架为城市风格的量化与解释提供了科学依据,为未来设计提供了数据支持。

Abstract: Urban cultures and architectural styles vary significantly across cities due to geographical, chronological, historical, and socio-political factors. Understanding these differences is essential for anticipating how cities may evolve in the future. As representative cases of historical continuity and modern innovation in China, Beijing and Shenzhen offer valuable perspectives for exploring the transformation of urban streetscapes. However, conventional approaches to urban cultural studies often rely on expert interpretation and historical documentation, which are difficult to standardize across different contexts. To address this, we propose a multimodal research framework based on vision-language models, enabling automated and scalable analysis of urban streetscape style differences. This approach enhances the objectivity and data-driven nature of urban form research. The contributions of this study are as follows: First, we construct UrbanDiffBench, a curated dataset of urban streetscapes containing architectural images from different periods and regions. Second, we develop UrbanSense, the first vision-language-model-based framework for urban streetscape analysis, enabling the quantitative generation and comparison of urban style representations. Third, experimental results show that Over 80% of generated descriptions pass the t-test (p less than 0.05). High Phi scores (0.912 for cities, 0.833 for periods) from subjective evaluations confirm the method’s ability to capture subtle stylistic differences. These results highlight the method’s potential to quantify and interpret urban style evolution, offering a scientifically grounded lens for future design.


[22] RealKeyMorph: Keypoints in Real-world Coordinates for Resolution-agnostic Image Registration cs.CVPDF

Mina C. Moghadam, Alan Q. Wang, Omer Taub, Martin R. Prince, Mert R. Sabuncu

TL;DR: RealKeyMorph (RKM) 提出了一种分辨率无关的图像配准方法,通过输出真实世界坐标系中的关键点,避免了传统方法中的重采样问题。

Details

Motivation: 医学图像配准中,由于采集参数(如像素间距、切片厚度等)的差异,图像分辨率可能不同。传统机器学习方法基于固定分辨率进行重采样,容易引入插值伪影,影响配准效果。

Result: 实验显示,RKM 在腹部 MRI 的 2D 正交堆叠和脑部数据集的 3D 体积配准任务中表现优异。

Insight: 通过直接操作原始数据(避免重采样),RKM 提升了图像配准的精度和鲁棒性,尤其适用于分辨率不一致的场景。

Abstract: Many real-world settings require registration of a pair of medical images that differ in spatial resolution, which may arise from differences in image acquisition parameters like pixel spacing, slice thickness, and field-of-view. However, all previous machine learning-based registration techniques resample images onto a fixed resolution. This is suboptimal because resampling can introduce artifacts due to interpolation. To address this, we present RealKeyMorph (RKM), a resolution-agnostic method for image registration. RKM is an extension of KeyMorph, a registration framework which works by training a network to learn corresponding keypoints for a given pair of images, after which a closed-form keypoint matching step is used to derive the transformation that aligns them. To avoid resampling and enable operating on the raw data, RKM outputs keypoints in real-world coordinates of the scanner. To do this, we leverage the affine matrix produced by the scanner (e.g., MRI machine) that encodes the mapping from voxel coordinates to real world coordinates. By transforming keypoints into real-world space and integrating this into the training process, RKM effectively enables the extracted keypoints to be resolution-agnostic. In our experiments, we demonstrate the advantages of RKM on the registration task for orthogonal 2D stacks of abdominal MRIs, as well as 3D volumes with varying resolutions in brain datasets.


[23] Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation cs.CVPDF

Runqi Ouyang, Haoyun Li, Zhenyuan Zhang, Xiaofeng Wang, Zheng Zhu

TL;DR: Motion-R1通过结合思维链和强化学习提升文本到动作生成的质量,解决了现有方法在语义对齐和动作合成中的不足。

Details

Motivation: 现有方法无法捕捉深层语言结构和逻辑推理,导致生成的动作缺乏可控性、一致性和多样性。Motion-R1旨在解决这些问题。

Result: 在多个基准数据集上表现优异,尤其在语义理解和长期时间连贯性方面优于现有方法。

Insight: 思维链机制和强化学习的结合能够有效提升复杂动作生成的语义理解和执行能力。

Abstract: Recent advances in large language models, especially in natural language understanding and reasoning, have opened new possibilities for text-to-motion generation. Although existing approaches have made notable progress in semantic alignment and motion synthesis, they often rely on end-to-end mapping strategies that fail to capture deep linguistic structures and logical reasoning. Consequently, generated motions tend to lack controllability, consistency, and diversity. To address these limitations, we propose Motion-R1, a unified motion-language modeling framework that integrates a Chain-of-Thought mechanism. By explicitly decomposing complex textual instructions into logically structured action paths, Motion-R1 provides high-level semantic guidance for motion generation, significantly enhancing the model’s ability to interpret and execute multi-step, long-horizon, and compositionally rich commands. To train our model, we adopt Group Relative Policy Optimization, a reinforcement learning algorithm designed for large models, which leverages motion quality feedback to optimize reasoning chains and motion synthesis jointly. Extensive experiments across multiple benchmark datasets demonstrate that Motion-R1 achieves competitive or superior performance compared to state-of-the-art methods, particularly in scenarios requiring nuanced semantic understanding and long-term temporal coherence. The code, model and data will be publicly available.


[24] FaceLiVT: Face Recognition using Linear Vision Transformer with Structural Reparameterization For Mobile Device cs.CVPDF

Novendra Setyawan, Chi-Chia Sun, Mao-Hsiu Hsu, Wen-Kai Kuo, Jun-Wei Hsieh

TL;DR: FaceLiVT是一种轻量级但强大的人脸识别模型,结合了CNN-Transformer混合架构和创新的轻量级多头线性注意力机制(MHLA),在降低计算复杂度的同时保持了高精度。

Details

Motivation: 在移动设备上实现高效、低延迟的人脸识别,同时平衡计算资源与模型性能。

Result: 在LFW等基准数据集上表现优异,推理速度比EdgeFace快8.6倍,比纯ViT模型快21.2倍。

Insight: 混合架构结合轻量级注意力机制可显著提升移动设备上的实时人脸识别性能。

Abstract: This paper introduces FaceLiVT, a lightweight yet powerful face recognition model that integrates a hybrid Convolution Neural Network (CNN)-Transformer architecture with an innovative and lightweight Multi-Head Linear Attention (MHLA) mechanism. By combining MHLA alongside a reparameterized token mixer, FaceLiVT effectively reduces computational complexity while preserving competitive accuracy. Extensive evaluations on challenging benchmarks; including LFW, CFP-FP, AgeDB-30, IJB-B, and IJB-C; highlight its superior performance compared to state-of-the-art lightweight models. MHLA notably improves inference speed, allowing FaceLiVT to deliver high accuracy with lower latency on mobile devices. Specifically, FaceLiVT is 8.6 faster than EdgeFace, a recent hybrid CNN-Transformer model optimized for edge devices, and 21.2 faster than a pure ViT-Based model. With its balanced design, FaceLiVT offers an efficient and practical solution for real-time face recognition on resource-constrained platforms.


[25] FSATFusion: Frequency-Spatial Attention Transformer for Infrared and Visible Image Fusion cs.CVPDF

Tianpei Zhang, Jufeng Zhao, Yiming Zhu, Guangmang Cui, Yuhan Lyu

TL;DR: 该论文提出了FSATFusion,一种用于红外和可见光图像融合的端到端网络,结合了频域-空间注意力Transformer模块,显著提升了融合效果和下游任务性能。

Details

Motivation: 现有方法多基于CNN,但在全局上下文捕获上存在局限,导致信息丢失,影响融合质量。作者希望通过Transformer和注意力机制提升这一能力。

Result: 实验表明FSATFusion在融合质量和效率上优于现有方法,且在下游任务(如目标检测)中表现优异。

Insight: 结合频率和空间域的注意力机制能更好地提取特征,改进的Transformer模块进一步提升了全局信息捕获能力。

Abstract: The infrared and visible images fusion (IVIF) is receiving increasing attention from both the research community and industry due to its excellent results in downstream applications. Existing deep learning approaches often utilize convolutional neural networks to extract image features. However, the inherently capacity of convolution operations to capture global context can lead to information loss, thereby restricting fusion performance. To address this limitation, we propose an end-to-end fusion network named the Frequency-Spatial Attention Transformer Fusion Network (FSATFusion). The FSATFusion contains a frequency-spatial attention Transformer (FSAT) module designed to effectively capture discriminate features from source images. This FSAT module includes a frequency-spatial attention mechanism (FSAM) capable of extracting significant features from feature maps. Additionally, we propose an improved Transformer module (ITM) to enhance the ability to extract global context information of vanilla Transformer. We conducted both qualitative and quantitative comparative experiments, demonstrating the superior fusion quality and efficiency of FSATFusion compared to other state-of-the-art methods. Furthermore, our network was tested on two additional tasks without any modifications, to verify the excellent generalization capability of FSATFusion. Finally, the object detection experiment demonstrated the superiority of FSATFusion in downstream visual tasks. Our code is available at https://github.com/Lmmh058/FSATFusion.


[26] Revisiting Transformers with Insights from Image Filtering cs.CV | cs.LGPDF

Laziz U. Abdullaev, Maksim Tkachenko, Tan M. Nguyen

TL;DR: 本文通过图像处理的视角重新解释了Transformer中的自注意力机制,提出了一个统一的框架来解释其计算过程及组件作用,并提出了两种改进架构,不仅在可解释性上有所提升,还在任务性能上取得了显著进步。

Details

Motivation: 自注意力机制的成功缺乏理论解释,现有方法虽尝试从图像去噪和非参数回归角度理解,但仍未深入分析其增强组件的机理。本文旨在通过图像处理框架弥补这一理解差距。

Result: 实验表明,基于图像处理启发的改进不仅增强了模型的可解释性,还在语言和视觉任务中提升了准确性和鲁棒性,尤其是在长序列理解上表现更优。

Insight: 自注意力机制的某些设计灵感可能源于图像处理领域,这种跨领域的视角有助于深入理解并改进Transformer架构。

Abstract: The self-attention mechanism, a cornerstone of Transformer-based state-of-the-art deep learning architectures, is largely heuristic-driven and fundamentally challenging to interpret. Establishing a robust theoretical foundation to explain its remarkable success and limitations has therefore become an increasingly prominent focus in recent research. Some notable directions have explored understanding self-attention through the lens of image denoising and nonparametric regression. While promising, existing frameworks still lack a deeper mechanistic interpretation of various architectural components that enhance self-attention, both in its original formulation and subsequent variants. In this work, we aim to advance this understanding by developing a unifying image processing framework, capable of explaining not only the self-attention computation itself but also the role of components such as positional encoding and residual connections, including numerous later variants. We also pinpoint potential distinctions between the two concepts building upon our framework, and make effort to close this gap. We introduce two independent architectural modifications within transformers. While our primary objective is interpretability, we empirically observe that image processing-inspired modifications can also lead to notably improved accuracy and robustness against data contamination and adversaries across language and vision tasks as well as better long sequence understanding.


[27] Leveraging 6DoF Pose Foundation Models For Mapping Marine Sediment Burial cs.CVPDF

Jerry Yan, Chinmay Talegaonkar, Nicholas Antipa, Eric Terrill, Sophia Merrifield

TL;DR: PoseIDON 是一个结合深度基础模型与多视角摄影测量的计算机视觉流程,用于估计海底物体6自由度位姿,并通过CAD模型对齐推断埋藏深度。

Details

Motivation: 海底人为物体埋藏状态的准确估计对研究局部沉积动态、评估生态风险和污染物运输至关重要,但传统方法因遮挡、低可见度和物体退化等问题难以实现。

Result: 在54个物体的测试中,平均埋藏深度误差约为10厘米,成功捕捉了底层沉积物运输的空间模式。

Insight: 该方法为非侵入式、可扩展的海底埋藏状态测绘提供了新思路,适用于环境污染评估和危险物恢复策略制定。

Abstract: The burial state of anthropogenic objects on the seafloor provides insight into localized sedimentation dynamics and is also critical for assessing ecological risks, potential pollutant transport, and the viability of recovery or mitigation strategies for hazardous materials such as munitions. Accurate burial depth estimation from remote imagery remains difficult due to partial occlusion, poor visibility, and object degradation. This work introduces a computer vision pipeline, called PoseIDON, which combines deep foundation model features with multiview photogrammetry to estimate six degrees of freedom object pose and the orientation of the surrounding seafloor from ROV video. Burial depth is inferred by aligning CAD models of the objects with observed imagery and fitting a local planar approximation of the seafloor. The method is validated using footage of 54 objects, including barrels and munitions, recorded at a historic ocean dumpsite in the San Pedro Basin. The model achieves a mean burial depth error of approximately 10 centimeters and resolves spatial burial patterns that reflect underlying sediment transport processes. This approach enables scalable, non-invasive mapping of seafloor burial and supports environmental assessment at contaminated sites.


[28] DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Transformer and Mamba cs.CVPDF

Shicheng Yin, Kaixuan Yin, Yang Liu, Weixing Chen, Liang Lin

TL;DR: 论文提出了一种名为DART的动态自适应区域标记器,通过自适应划分图像为不同大小的块,提升了Vision Transformer和Mamba的性能,同时减少了计算开销。

Details

Motivation: 现有的Vision Transformer和Mamba等非卷积模型依赖固定大小的图像块,导致对背景区域的过度编码和对关键局部细节的遗漏,尤其在信息稀疏分布的场景下表现不佳。

Result: 在DeiT(ImageNet-1K)上准确率提升2.1%,同时减少45%的浮点运算量(FLOPs)。在DeiT、Vim和VideoMamba上的实验验证了其一致性和高效性。

Insight: DART提供了一种更高效的方法替代均匀增加标记密度的策略,能够在提升性能的同时减少计算开销,适用于信息分布不均的场景。

Abstract: Recently, non-convolutional models such as the Vision Transformer (ViT) and Vision Mamba (Vim) have achieved remarkable performance in computer vision tasks. However, their reliance on fixed-size patches often results in excessive encoding of background regions and omission of critical local details, especially when informative objects are sparsely distributed. To address this, we introduce a fully differentiable Dynamic Adaptive Region Tokenizer (DART), which adaptively partitions images into content-dependent patches of varying sizes. DART combines learnable region scores with piecewise differentiable quantile operations to allocate denser tokens to information-rich areas. Despite introducing only approximately 1 million (1M) additional parameters, DART improves accuracy by 2.1% on DeiT (ImageNet-1K). Unlike methods that uniformly increase token density to capture fine-grained details, DART offers a more efficient alternative, achieving 45% FLOPs reduction with superior performance. Extensive experiments on DeiT, Vim, and VideoMamba confirm that DART consistently enhances accuracy while incurring minimal or even reduced computational overhead. Code is available at https://github.com/HCPLab-SYSU/DART.


[29] ReconMOST: Multi-Layer Sea Temperature Reconstruction with Observations-Guided Diffusion cs.CVPDF

Yuanyi Song, Pumeng Lyu, Ben Fei, Fenghua Ling, Wanli Ouyang

TL;DR: ReconMOST提出了一种基于数据驱动的扩散模型框架,用于多层面海水温度重建,通过历史数值模拟数据预训练模型,并利用稀疏原位观测数据指导逆向扩散过程,实现了高精度和物理一致的重建。

Details

Motivation: 传统海洋温度重建方法因数据稀疏、算法复杂和计算成本高而受限,而现有的机器学习方法通常仅适用于海表或局部区域,且难以处理云遮挡等问题。需要一种更高效、全局且多层面的重建方法。

Result: 在CMIP6和EN4分析数据上,MSE达到0.049(指导)、0.680(重建)和0.633(总),即使在92.5%数据缺失情况下仍保持重建精度和分辨率。

Insight: 扩散模型在海洋科学中的应用潜力——通过结合物理一致的预训练和稀疏观测,可实现高精度全局重建,同时解决了传统ML方法的局限性。

Abstract: Accurate reconstruction of ocean is essential for reflecting global climate dynamics and supporting marine meteorological research. Conventional methods face challenges due to sparse data, algorithmic complexity, and high computational costs, while increasing usage of machine learning (ML) method remains limited to reconstruction problems at the sea surface and local regions, struggling with issues like cloud occlusion. To address these limitations, this paper proposes ReconMOST, a data-driven guided diffusion model framework for multi-layer sea temperature reconstruction. Specifically, we first pre-train an unconditional diffusion model using a large collection of historical numerical simulation data, enabling the model to attain physically consistent distribution patterns of ocean temperature fields. During the generation phase, sparse yet high-accuracy in-situ observational data are utilized as guidance points for the reverse diffusion process, generating accurate reconstruction results. Importantly, in regions lacking direct observational data, the physically consistent spatial distribution patterns learned during pre-training enable implicitly guided and physically plausible reconstructions. Our method extends ML-based SST reconstruction to a global, multi-layer setting, handling over 92.5% missing data while maintaining reconstruction accuracy, spatial resolution, and superior generalization capability. We pre-train our model on CMIP6 numerical simulation data and conduct guided reconstruction experiments on CMIP6 and EN4 analysis data. The results of mean squared error (MSE) values achieve 0.049 on guidance, 0.680 on reconstruction, and 0.633 on total, respectively, demonstrating the effectiveness and robustness of the proposed framework. Our source code is available at https://github.com/norsheep/ReconMOST.


[30] Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation cs.CV | cs.AIPDF

Zhiyang Xu, Jiuhai Chen, Zhaojiang Lin, Xichen Pan, Lifu Huang

TL;DR: Pisces是一种自回归多模态基础模型,通过解耦的视觉编码架构和针对多模态生成的定制训练技术,解决了图像理解与生成任务中的性能差异问题,同时在两者上均表现出色。

Details

Motivation: 现有统一多模态模型在图像理解和生成任务中往往不如专用模型表现优异,主要因为视觉特征和训练过程的差异。Pisces旨在设计一种统一框架,解决这一问题。

Result: 在20多个图像理解任务和GenEval图像生成基准测试中表现出色,验证了模型在两种任务上的竞争力。

Insight: 图像理解与生成之间存在协同关系,解耦的视觉编码架构能有效提升统一多模态模型的性能。

Abstract: Recent advances in large language models (LLMs) have enabled multimodal foundation models to tackle both image understanding and generation within a unified framework. Despite these gains, unified models often underperform compared to specialized models in either task. A key challenge in developing unified models lies in the inherent differences between the visual features needed for image understanding versus generation, as well as the distinct training processes required for each modality. In this work, we introduce Pisces, an auto-regressive multimodal foundation model that addresses this challenge through a novel decoupled visual encoding architecture and tailored training techniques optimized for multimodal generation. Combined with meticulous data curation, pretraining, and finetuning, Pisces achieves competitive performance in both image understanding and image generation. We evaluate Pisces on over 20 public benchmarks for image understanding, where it demonstrates strong performance across a wide range of tasks. Additionally, on GenEval, a widely adopted benchmark for image generation, Pisces exhibits robust generative capabilities. Our extensive analysis reveals the synergistic relationship between image understanding and generation, and the benefits of using separate visual encoders, advancing the field of unified multimodal models.


[31] MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment cs.CVPDF

Shuo wang, Jihao Zhang

TL;DR: MF2Summ提出了一种多模态融合的视频摘要方法,结合视觉和听觉信息,通过跨模态Transformer和时间对齐机制提升摘要性能。在SumMe和TVSum数据集上表现优于现有方法。

Details

Motivation: 传统视频摘要方法通常仅依赖视觉信息,无法充分利用视频的多模态语义。本文旨在通过融合视觉和听觉信息,提升视频摘要的性能和语义丰富度。

Result: 在SumMe和TVSum数据集上,F1分数分别提升1.9%和0.6%,优于DSNet和其他先进方法。

Insight: 多模态融合和时间对齐是提升视频摘要性能的关键,听觉信息为视觉提供了补充语义,增强了摘要的全面性。

Abstract: The rapid proliferation of online video content necessitates effective video summarization techniques. Traditional methods, often relying on a single modality (typically visual), struggle to capture the full semantic richness of videos. This paper introduces MF2Summ, a novel video summarization model based on multimodal content understanding, integrating both visual and auditory information. MF2Summ employs a five-stage process: feature extraction, cross-modal attention interaction, feature fusion, segment prediction, and key shot selection. Visual features are extracted using a pre-trained GoogLeNet model, while auditory features are derived using SoundNet. The core of our fusion mechanism involves a cross-modal Transformer and an alignment-guided self-attention Transformer, designed to effectively model inter-modal dependencies and temporal correspondences. Segment importance, location, and center-ness are predicted, followed by key shot selection using Non-Maximum Suppression (NMS) and the Kernel Temporal Segmentation (KTS) algorithm. Experimental results on the SumMe and TVSum datasets demonstrate that MF2Summ achieves competitive performance, notably improving F1-scores by 1.9% and 0.6% respectively over the DSNet model, and performing favorably against other state-of-the-art methods.


[32] Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts cs.CV | cs.CL | cs.LG | cs.MMPDF

Guowei Zhong, Ruohong Huan, Mingzhen Wu, Ronghua Liang, Peng Chen

TL;DR: 论文提出了一种名为CIDer的鲁棒多模态情感识别框架,通过模型特定的自蒸馏和模型无关的因果推断模块,解决了模态缺失和分布偏移问题。

Details

Motivation: 多模态情感识别在面临模态缺失和分布外数据时表现不佳,现有方法过于依赖特定模型或引入过多参数,CIDer旨在解决这些问题。

Result: CIDer在RMFM和OOD任务中表现出色,且参数更少、训练更快。

Insight: 结合自蒸馏和因果推断可以同时解决模态缺失和分布偏移问题,且无需过多参数。

Abstract: Recent advancements in Multimodal Emotion Recognition (MER) face challenges in addressing both modality missing and Out-Of-Distribution (OOD) data simultaneously. Existing methods often rely on specific models or introduce excessive parameters, which limits their practicality. To address these issues, we propose a novel robust MER framework, Causal Inference Distiller (CIDer), and introduce a new task, Random Modality Feature Missing (RMFM), to generalize the definition of modality missing. CIDer integrates two key components: a Model-Specific Self-Distillation (MSSD) module and a Model-Agnostic Causal Inference (MACI) module. MSSD enhances robustness under the RMFM task through a weight-sharing self-distillation approach applied across low-level features, attention maps, and high-level representations. Additionally, a Word-level Self-aligned Attention Module (WSAM) reduces computational complexity, while a Multimodal Composite Transformer (MCT) facilitates efficient multimodal fusion. To tackle OOD challenges, MACI employs a tailored causal graph to mitigate label and language biases using a Multimodal Causal Module (MCM) and fine-grained counterfactual texts. Notably, MACI can independently enhance OOD generalization with minimal additional parameters. Furthermore, we also introduce the new repartitioned MER OOD datasets. Experimental results demonstrate that CIDer achieves robust performance in both RMFM and OOD scenarios, with fewer parameters and faster training compared to state-of-the-art methods. The implementation of this work is publicly accessible at https://github.com/gw-zhong/CIDer.


[33] Rethinking Generative Human Video Coding with Implicit Motion Transformation cs.CV | eess.IVPDF

Bolin Chen, Ru-Ling Liao, Jie Chen, Yan Ye

TL;DR: 本文提出了一种基于隐式运动变换(IMT)的生成式人体视频编码方法,解决了传统显式运动场在复杂人体运动中的失真和运动不准确问题。

Details

Motivation: 传统基于显式运动的生成式人体视频编码在复杂多样的运动模式中表现不佳,导致重建结果失真和运动不准确。

Result: 实验证明,IMT方法在高效压缩和高保真合成方面表现优异。

Insight: 隐式运动变换能更有效地处理复杂人体运动,避免了显式运动场的局限性。

Abstract: Beyond traditional hybrid-based video codec, generative video codec could achieve promising compression performance by evolving high-dimensional signals into compact feature representations for bitstream compactness at the encoder side and developing explicit motion fields as intermediate supervision for high-quality reconstruction at the decoder side. This paradigm has achieved significant success in face video compression. However, compared to facial videos, human body videos pose greater challenges due to their more complex and diverse motion patterns, i.e., when using explicit motion guidance for Generative Human Video Coding (GHVC), the reconstruction results could suffer severe distortions and inaccurate motion. As such, this paper highlights the limitations of explicit motion-based approaches for human body video compression and investigates the GHVC performance improvement with the aid of Implicit Motion Transformation, namely IMT. In particular, we propose to characterize complex human body signal into compact visual features and transform these features into implicit motion guidance for signal reconstruction. Experimental results demonstrate the effectiveness of the proposed IMT paradigm, which can facilitate GHVC to achieve high-efficiency compression and high-fidelity synthesis.


[34] MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models cs.CVPDF

Yu Huang, Zelin Peng, Yichen Zhao, Piao Yang, Xiaokang Yang

TL;DR: MedSeg-R 是一个基于多模态大语言模型(MLLMs)的端到端框架,用于医学图像推理分割任务,能够生成精确的分割掩码并理解复杂的临床指令。

Details

Motivation: 现有的医学图像分割模型依赖显式的人工指令,缺乏主动推理能力,无法处理复杂的临床问题。多模态大语言模型在医学QA任务中表现良好,但难以生成精确的分割掩码。

Result: 实验表明MedSeg-R在多个基准测试中表现优异,分割精度高,并支持医学图像的可解释文本分析。

Insight: 通过结合大语言模型的推理能力和像素级分割技术,可以有效实现医学图像的复杂指令理解和精确分割。

Abstract: Medical image segmentation is crucial for clinical diagnosis, yet existing models are limited by their reliance on explicit human instructions and lack the active reasoning capabilities to understand complex clinical questions. While recent advancements in multimodal large language models (MLLMs) have improved medical question-answering (QA) tasks, most methods struggle to generate precise segmentation masks, limiting their application in automatic medical diagnosis. In this paper, we introduce medical image reasoning segmentation, a novel task that aims to generate segmentation masks based on complex and implicit medical instructions. To address this, we propose MedSeg-R, an end-to-end framework that leverages the reasoning abilities of MLLMs to interpret clinical questions while also capable of producing corresponding precise segmentation masks for medical images. It is built on two core components: 1) a global context understanding module that interprets images and comprehends complex medical instructions to generate multi-modal intermediate tokens, and 2) a pixel-level grounding module that decodes these tokens to produce precise segmentation masks and textual responses. Furthermore, we introduce MedSeg-QA, a large-scale dataset tailored for the medical image reasoning segmentation task. It includes over 10,000 image-mask pairs and multi-turn conversations, automatically annotated using large language models and refined through physician reviews. Experiments show MedSeg-R’s superior performance across several benchmarks, achieving high segmentation accuracy and enabling interpretable textual analysis of medical images.


[35] LLMs Are Not Yet Ready for Deepfake Image Detection cs.CVPDF

Shahroz Tariq, David Nguyen, M. A. P. Chamikara, Tingmin Wu, Alsharif Abuadbba

TL;DR: 该论文通过零样本评估研究了四种大型视觉语言模型(VLM)在深伪图像检测中的表现,发现虽然它们能生成合理解释并识别表面异常,但仍不适合作为独立检测系统。模型容易受到风格误导,但在可解释性和上下文分析方面有潜力,适合作为混合或人机协作框架的一部分。

Details

Motivation: 随着深伪技术的快速发展,维护媒体可信度和公众信任面临巨大挑战。视觉语言模型(VLM)因其多领域潜力而被视为可能的解决方案,但其在深伪检测中的实际表现尚不清楚,本研究旨在填补这一空白。

Result: 1. VLM能生成合理解释并识别表面异常,但无法作为独立检测工具;2. 模型在风格误导下表现不佳;3. 在可解释性和上下文分析方面表现突出,适合辅助人类专家。

Insight: 通用模型虽无法独立胜任深伪检测,但其在可解释性和上下文分析方面的优势使其适合作为混合或人机协作框架的一部分,未来可结合领域专业知识提升性能。

Abstract: The growing sophistication of deepfakes presents substantial challenges to the integrity of media and the preservation of public trust. Concurrently, vision-language models (VLMs), large language models enhanced with visual reasoning capabilities, have emerged as promising tools across various domains, sparking interest in their applicability to deepfake detection. This study conducts a structured zero-shot evaluation of four prominent VLMs: ChatGPT, Claude, Gemini, and Grok, focusing on three primary deepfake types: faceswap, reenactment, and synthetic generation. Leveraging a meticulously assembled benchmark comprising authentic and manipulated images from diverse sources, we evaluate each model’s classification accuracy and reasoning depth. Our analysis indicates that while VLMs can produce coherent explanations and detect surface-level anomalies, they are not yet dependable as standalone detection systems. We highlight critical failure modes, such as an overemphasis on stylistic elements and vulnerability to misleading visual patterns like vintage aesthetics. Nevertheless, VLMs exhibit strengths in interpretability and contextual analysis, suggesting their potential to augment human expertise in forensic workflows. These insights imply that although general-purpose models currently lack the reliability needed for autonomous deepfake detection, they hold promise as integral components in hybrid or human-in-the-loop detection frameworks.


[36] Semantic Localization Guiding Segment Anything Model For Reference Remote Sensing Image Segmentation cs.CV | cs.AIPDF

Shuyang Li, Shuang Wang, Zhuangzhuang Sun, Jing Xiao

TL;DR: 该论文提出了一种名为PSLG-SAM的框架,通过将参考遥感图像分割任务分解为粗定位和精细分割两阶段,利用视觉定位网络和SAM模型,显著提升了性能并减少了标注负担。

Details

Motivation: 当前RRSIS方法依赖多模态融合骨干和语义分割头,但面临密集标注需求和复杂场景解释的挑战。论文旨在通过两阶段分解解决这些问题。

Result: 在两个数据集上,PSLG-SAM显著优于现有方法,验证了其有效性。

Insight: 通过任务分解和SAM模型的结合,可以显著减少标注负担并提升复杂场景下的分割精度。

Abstract: The Reference Remote Sensing Image Segmentation (RRSIS) task generates segmentation masks for specified objects in images based on textual descriptions, which has attracted widespread attention and research interest. Current RRSIS methods rely on multi-modal fusion backbones and semantic segmentation heads but face challenges like dense annotation requirements and complex scene interpretation. To address these issues, we propose a framework named \textit{prompt-generated semantic localization guiding Segment Anything Model}(PSLG-SAM), which decomposes the RRSIS task into two stages: coarse localization and fine segmentation. In coarse localization stage, a visual grounding network roughly locates the text-described object. In fine segmentation stage, the coordinates from the first stage guide the Segment Anything Model (SAM), enhanced by a clustering-based foreground point generator and a mask boundary iterative optimization strategy for precise segmentation. Notably, the second stage can be train-free, significantly reducing the annotation data burden for the RRSIS task. Additionally, decomposing the RRSIS task into two stages allows for focusing on specific region segmentation, avoiding interference from complex scenes.We further contribute a high-quality, multi-category manually annotated dataset. Experimental validation on two datasets (RRSIS-D and RRSIS-M) demonstrates that PSLG-SAM achieves significant performance improvements and surpasses existing state-of-the-art models.Our code will be made publicly available.


[37] J-DDL: Surface Damage Detection and Localization System for Fighter Aircraft cs.CVPDF

Jin Huang, Mingqiang Wei, Zikuan Li, Hangyu Qu, Wei Zhao

TL;DR: J-DDL是一个用于战斗机表面损伤检测与定位的智能系统,结合2D图像与3D点云数据,优化了YOLO架构并引入创新模块,显著提升了检测效率和准确性。

Details

Motivation: 战斗机表面检查的规模和复杂性使得人工检测效率低下且难以统一,亟需自动化解决方案。

Result: 实验验证J-DDL在损伤检测与定位上的有效性,显著优于传统方法。

Insight: 结合2D与3D数据可实现更全面的表面检查;轻量化模块和新型损失函数是提升检测效率的关键。

Abstract: Ensuring the safety and extended operational life of fighter aircraft necessitates frequent and exhaustive inspections. While surface defect detection is feasible for human inspectors, manual methods face critical limitations in scalability, efficiency, and consistency due to the vast surface area, structural complexity, and operational demands of aircraft maintenance. We propose a smart surface damage detection and localization system for fighter aircraft, termed J-DDL. J-DDL integrates 2D images and 3D point clouds of the entire aircraft surface, captured using a combined system of laser scanners and cameras, to achieve precise damage detection and localization. Central to our system is a novel damage detection network built on the YOLO architecture, specifically optimized for identifying surface defects in 2D aircraft images. Key innovations include lightweight Fasternet blocks for efficient feature extraction, an optimized neck architecture incorporating Efficient Multiscale Attention (EMA) modules for superior feature aggregation, and the introduction of a novel loss function, Inner-CIOU, to enhance detection accuracy. After detecting damage in 2D images, the system maps the identified anomalies onto corresponding 3D point clouds, enabling accurate 3D localization of defects across the aircraft surface. Our J-DDL not only streamlines the inspection process but also ensures more comprehensive and detailed coverage of large and complex aircraft exteriors. To facilitate further advancements in this domain, we have developed the first publicly available dataset specifically focused on aircraft damage. Experimental evaluations validate the effectiveness of our framework, underscoring its potential to significantly advance automated aircraft inspection technologies.


[38] CogStream: Context-guided Streaming Video Question Answering cs.CV | cs.AIPDF

Zicheng Zhao, Kangyu Wang, Shijie Li, Rui Qian, Weiyao Lin

TL;DR: 这篇论文提出了一个名为CogStream的新任务,解决了流媒体视频场景中的多模态推理问题,同时提出了一种高效的方法和相关数据集。

Details

Motivation: 传统视频大语言模型在处理流媒体视频时存在计算负担大和无关上下文干扰的问题。

Result: 实验证明了方法的有效性。

Insight: 流媒体视频的多模态推理需要高效利用相关上下文信息,避免无关数据干扰。

Abstract: Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant context distracts models from key details. This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream. To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline. Additionally, we present CogReasoner as a baseline model. It efficiently tackles this task by leveraging visual stream compression and historical dialogue retrieval. Extensive experiments prove the effectiveness of this method. Code will be released soon.


[39] From Images to Insights: Explainable Biodiversity Monitoring with Plain Language Habitat Explanations cs.CV | cs.AI | cs.ETPDF

Yutong Zhou, Masahiro Ryo

TL;DR: 该论文提出一个端到端的视觉-因果框架,将物种图像转化为关于其栖息地偏好的可解释因果分析,结合了物种识别、全球分布检索、伪缺失采样和气候数据提取等方法。

Details

Motivation: 理解物种为何生活在特定地点对生态研究和生物多样性保护至关重要,但现有生态工作流程零散且难以被非专家使用。

Result: 通过蜜蜂和花卉物种的案例展示了框架的潜力,表明其能生成统计支撑且人类易懂的栖息地解释。

Insight: 结合多模态AI和因果推断方法,可以为生态学研究提供更直观和可解释的工具,尤其有助于非专业用户理解物种栖息地偏好。

Abstract: Explaining why the species lives at a particular location is important for understanding ecological systems and conserving biodiversity. However, existing ecological workflows are fragmented and often inaccessible to non-specialists. We propose an end-to-end visual-to-causal framework that transforms a species image into interpretable causal insights about its habitat preference. The system integrates species recognition, global occurrence retrieval, pseudo-absence sampling, and climate data extraction. We then discover causal structures among environmental features and estimate their influence on species occurrence using modern causal inference methods. Finally, we generate statistically grounded, human-readable causal explanations from structured templates and large language models. We demonstrate the framework on a bee and a flower species and report early results as part of an ongoing project, showing the potential of the multimodal AI assistant backed up by a recommended ecological modeling practice for describing species habitat in human-understandable language.


[40] Balancing Tails when Comparing Distributions: Comprehensive Equity Index (CEI) with Application to Bias Evaluation in Operational Face Biometrics cs.CV | cs.AIPDF

Imanol Solano, Julian Fierrez, Aythami Morales, Alejandro Peña, Ruben Tolosana

TL;DR: 该论文提出了 Comprehensive Equity Index (CEI),一种用于检测人脸识别系统中人口统计偏差的新指标。CEI通过单独分析真实和冒名分数分布,并关注分布尾部,显著提升了检测细微偏差的能力。

Details

Motivation: 现有指标难以检测高性能人脸识别系统中的细微人口统计偏差,尤其是在分数分布的尾部。因此,需要一种更敏感的度量方法来揭示这些隐蔽的偏差。

Result: 实验表明,CEI 能有效检测出以往方法难以发现的细微偏差,尤其在分布尾部。CEI^A 进一步增强了实用性和客观性。

Insight: CEI 的核心创新在于对分布尾部的独立分析,这在检测细微偏差时尤为关键。这一方法不仅适用于人脸识别,还可扩展到其他需要分析分布尾部的问题。

Abstract: Demographic bias in high-performance face recognition (FR) systems often eludes detection by existing metrics, especially with respect to subtle disparities in the tails of the score distribution. We introduce the Comprehensive Equity Index (CEI), a novel metric designed to address this limitation. CEI uniquely analyzes genuine and impostor score distributions separately, enabling a configurable focus on tail probabilities while also considering overall distribution shapes. Our extensive experiments (evaluating state-of-the-art FR systems, intentionally biased models, and diverse datasets) confirm CEI’s superior ability to detect nuanced biases where previous methods fall short. Furthermore, we present CEI^A, an automated version of the metric that enhances objectivity and simplifies practical application. CEI provides a robust and sensitive tool for operational FR fairness assessment. The proposed methods have been developed particularly for bias evaluation in face biometrics but, in general, they are applicable for comparing statistical distributions in any problem where one is interested in analyzing the distribution tails.


[41] DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers cs.CV | cs.AIPDF

Lizhen Wang, Zhurong Xia, Tianshu Hu, Pengrui Wang, Pengfei Wang

TL;DR: 本文提出了一种基于扩散变换器(DiT)的框架DreamActor-H1,用于生成高保真的人类与产品展示视频,解决了现有方法在保留人和产品身份以及空间关系上的不足。

Details

Motivation: 在电子商务和数字营销中,生成高质量的人类与产品展示视频对产品呈现至关重要。现有方法往往无法同时保留人和产品的身份信息,或缺乏对空间关系的理解。

Result: 在混合数据集上训练后,该方法在身份保留和运动生成上表现优于现有技术,实现了更真实的展示效果。

Insight: 通过结合3D模板和语义编码,可以显著提升生成视频的保真度和交互自然性,为电商应用提供实用解决方案。

Abstract: In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions. Project page: https://submit2025-dream.github.io/DreamActor-H1/.


[42] Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration cs.CVPDF

Jun Wang, Lixing Zhu, Xiaohan Yu, Abhir Bhalerao, Yulan He

TL;DR: 论文提出了一种名为PLACE的新框架,通过病理级别的跨模态对齐和关联探索,提升了医学视觉表示学习的效果,无需额外人工标注。

Details

Motivation: 医学领域的图像-报告对学习面临复杂语义和长报告的挑战,现有方法多关注实例或词汇级别的对齐,而忽略了病理级别的语义一致性。

Result: 在分类、图像-文本检索、语义分割、目标检测和报告生成等多个下游任务中实现了SOTA性能。

Insight: 病理级别的对齐和关联探索是提升医学视觉表示学习的关键,且无需依赖外部疾病标注。

Abstract: Learning medical visual representations from image-report pairs through joint learning has garnered increasing research attention due to its potential to alleviate the data scarcity problem in the medical domain. The primary challenges stem from the lengthy reports that feature complex discourse relations and semantic pathologies. Previous works have predominantly focused on instance-wise or token-wise cross-modal alignment, often neglecting the importance of pathological-level consistency. This paper presents a novel framework PLACE that promotes the Pathological-Level Alignment and enriches the fine-grained details via Correlation Exploration without additional human annotations. Specifically, we propose a novel pathological-level cross-modal alignment (PCMA) approach to maximize the consistency of pathology observations from both images and reports. To facilitate this, a Visual Pathology Observation Extractor is introduced to extract visual pathological observation representations from localized tokens. The PCMA module operates independently of any external disease annotations, enhancing the generalizability and robustness of our methods. Furthermore, we design a proxy task that enforces the model to identify correlations among image patches, thereby enriching the fine-grained details crucial for various downstream tasks. Experimental results demonstrate that our proposed framework achieves new state-of-the-art performance on multiple downstream tasks, including classification, image-to-text retrieval, semantic segmentation, object detection and report generation.


[43] DanceChat: Large Language Model-Guided Music-to-Dance Generation cs.CV | cs.MM | cs.SD | eess.ASPDF

Qing Wang, Xiaohang Yang, Yilan Dong, Naveen Raj Govindaraj, Gregory Slabaugh

TL;DR: DanceChat 是一种基于大语言模型(LLM)的音乐到舞蹈生成方法,通过文本指令提供明确的舞蹈指导,解决了音乐与舞蹈之间的语义鸿沟问题。

Details

Motivation: 音乐仅提供抽象的线索(如旋律、节奏和情感),难以直接映射到具体舞蹈动作,同时音乐和舞蹈的配对数据稀缺,限制了模型的多样性学习能力。

Result: 在 AIST++ 数据集和人类评估中,DanceChat 在定性和定量上均优于现有方法。

Insight: 通过大语言模型提供的文本指令,可以显式填补音乐到舞蹈的语义鸿沟,显著提升生成的多样性和对齐性。

Abstract: Music-to-dance generation aims to synthesize human dance motion conditioned on musical input. Despite recent progress, significant challenges remain due to the semantic gap between music and dance motion, as music offers only abstract cues, such as melody, groove, and emotion, without explicitly specifying the physical movements. Moreover, a single piece of music can produce multiple plausible dance interpretations. This one-to-many mapping demands additional guidance, as music alone provides limited information for generating diverse dance movements. The challenge is further amplified by the scarcity of paired music and dance data, which restricts the model^a\u{A}'Zs ability to learn diverse dance patterns. In this paper, we introduce DanceChat, a Large Language Model (LLM)-guided music-to-dance generation approach. We use an LLM as a choreographer that provides textual motion instructions, offering explicit, high-level guidance for dance generation. This approach goes beyond implicit learning from music alone, enabling the model to generate dance that is both more diverse and better aligned with musical styles. Our approach consists of three components: (1) an LLM-based pseudo instruction generation module that produces textual dance guidance based on music style and structure, (2) a multi-modal feature extraction and fusion module that integrates music, rhythm, and textual guidance into a shared representation, and (3) a diffusion-based motion synthesis module together with a multi-modal alignment loss, which ensures that the generated dance is aligned with both musical and textual cues. Extensive experiments on AIST++ and human evaluations show that DanceChat outperforms state-of-the-art methods both qualitatively and quantitatively.


[44] Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning cs.CVPDF

Chun-Mei Feng, Kai Yu, Xinxing Xu, Salman Khan, Rick Siow Mong Goh

TL;DR: 论文提出了一种名为T2I-PAL的新方法,通过利用文本到图像生成模型减少模态差异,并结合提示调谐和适配器学习提升多标签图像识别性能。

Details

Motivation: 现有基于CLIP的文本-图像对比学习方法存在模态差异问题,限制了多标签图像识别的性能。T2I-PAL旨在通过生成多样化的真实图像和局部特征聚合解决这一问题。

Result: 在多个基准数据集(如MS-COCO、VOC2007和NUS-WIDE)上,T2I-PAL的平均识别性能比现有最优方法高出3.47%。

Insight: 通过生成真实图像并优化局部特征表示,可以有效减少模态差异并提升多标签图像识别性能,同时降低对全标注数据的依赖。

Abstract: Benefited from image-text contrastive learning, pre-trained vision-language models, e.g., CLIP, allow to direct leverage texts as images (TaI) for parameter-efficient fine-tuning (PEFT). While CLIP is capable of making image features to be similar to the corresponding text features, the modality gap remains a nontrivial issue and limits image recognition performance of TaI. Using multi-label image recognition (MLR) as an example, we present a novel method, called T2I-PAL to tackle the modality gap issue when using only text captions for PEFT. The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions, thereby reducing the modality gap. To further enhance MLR, T2I-PAL incorporates a class-wise heatmap and learnable prototypes. This aggregates local similarities, making the representation of local visual features more robust and informative for multi-label recognition. For better PEFT, we further combine both prompt tuning and adapter learning to enhance classification performance. T2I-PAL offers significant advantages: it eliminates the need for fully semantically annotated training images, thereby reducing the manual annotation workload, and it preserves the intrinsic mode of the CLIP model, allowing for seamless integration with any existing CLIP framework. Extensive experiments on multiple benchmarks, including MS-COCO, VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance by 3.47% in average above the top-ranked state-of-the-art methods.


[45] Rethinking Random Masking in Self Distillation on ViT cs.CVPDF

Jihyeon Seong, Hyunkyung Han

TL;DR: 该论文探讨了在自蒸馏框架(如DINO)中随机掩码的作用,提出了一种不对称的掩码策略,仅对学生的全局视图进行掩码,同时保留局部视图和教师的全局视图。

Details

Motivation: 研究表明,随机掩码可能无意中移除关键语义信息,因此需要更智能的掩码策略。

Result: 在mini-ImageNet数据集上的实验表明,该方法能生成更鲁棒和细粒度的注意力图,并提升下游任务性能。

Insight: 通过不对称掩码策略,可以平衡训练效率和语义信息的保留,从而提升自蒸馏框架的性能。

Abstract: Vision Transformers (ViTs) have demonstrated remarkable performance across a wide range of vision tasks. In particular, self-distillation frameworks such as DINO have contributed significantly to these advances. Within such frameworks, random masking is often utilized to improve training efficiency and introduce regularization. However, recent studies have raised concerns that indiscriminate random masking may inadvertently eliminate critical semantic information, motivating the development of more informed masking strategies. In this study, we explore the role of random masking in the self-distillation setting, focusing on the DINO framework. Specifically, we apply random masking exclusively to the student’s global view, while preserving the student’s local views and the teacher’s global view in their original, unmasked forms. This design leverages DINO’s multi-view augmentation scheme to retain clean supervision while inducing robustness through masked inputs. We evaluate our approach using DINO-Tiny on the mini-ImageNet dataset and show that random masking under this asymmetric setup yields more robust and fine-grained attention maps, ultimately enhancing downstream performance.


[46] Hierarchical Error Assessment of CAD Models for Aircraft Manufacturing-and-Measurement cs.CVPDF

Jin Huang, Honghua Chen, Mingqiang Wei

TL;DR: 该论文提出了一种称为HEA-MM的分层误差评估框架,用于飞机制造和测量平台中的CAD模型。通过全局、部件和特征三个层次进行误差分析,并结合结构光扫描器和优化方法,实现了对飞机工件的高精度评估。

Details

Motivation: 航空设备对高质量(高性能、高稳定性和高可靠性)的要求极高。现有方法在评估CAD模型的制造误差时缺乏分层分析能力,因此需要一种更全面的误差评估框架。

Result: 实验结果表明,HEA-MM在多种飞机CAD模型上表现出高效性和准确性,能够为制造和测量提供可靠的误差评估。

Insight: 分层误差分析方法可以更全面地捕捉制造误差,优化方法为点云区域分析提供了新思路,两阶段圆形孔检测算法提高了特征分析的精度。

Abstract: The most essential feature of aviation equipment is high quality, including high performance, high stability and high reliability. In this paper, we propose a novel hierarchical error assessment framework for aircraft CAD models within a manufacturing-and-measurement platform, termed HEA-MM. HEA-MM employs structured light scanners to obtain comprehensive 3D measurements of manufactured workpieces. The measured point cloud is registered with the reference CAD model, followed by an error analysis conducted at three hierarchical levels: global, part, and feature. At the global level, the error analysis evaluates the overall deviation of the scanned point cloud from the reference CAD model. At the part level, error analysis is performed on these patches underlying the point clouds. We propose a novel optimization-based primitive refinement method to obtain a set of meaningful patches of point clouds. Two basic operations, splitting and merging, are introduced to refine the coarse primitives. At the feature level, error analysis is performed on circular holes, which are commonly found in CAD models. To facilitate it, a two-stage algorithm is introduced for the detection of circular holes. First, edge points are identified using a tensor-voting algorithm. Then, multiple circles are fitted through a hypothesize-and-clusterize framework, ensuring accurate detection and analysis of the circular features. Experimental results on various aircraft CAD models demonstrate the effectiveness of our proposed method.


[47] Semantic-decoupled Spatial Partition Guided Point-supervised Oriented Object Detection cs.CVPDF

Xinyuan Liu, Hang Xu, Yike Ma, Yucheng Zhang, Feng Dai

TL;DR: 该论文提出了一种名为SSP(语义解耦空间划分)的统一框架,用于解决点监督定向目标检测中样本分配和实例混淆的问题,显著提升了检测性能。

Details

Motivation: 随着遥感技术的进步,图像数量激增,但高密度场景中的定向目标检测因需要大量人工标注而受限。点监督方法虽然成本低,但现有方法因基于固定规则的设计导致样本分配不足和实例混淆。

Result: 在DOTA-v1.0等数据集上,SSP在点监督下达到45.78% mAP,优于SOTA方法PointOBB-v2 4.10%。与ORCNN和ReDet结合时,分别达到47.86%和48.50% mAP。

Insight: SSP通过结合规则和数据驱动的方法,有效解决了点监督中的样本分配问题,为高密度场景的定向目标检测提供了一种高效解决方案。

Abstract: Recent remote sensing tech advancements drive imagery growth, making oriented object detection rapid development, yet hindered by labor-intensive annotation for high-density scenes. Oriented object detection with point supervision offers a cost-effective solution for densely packed scenes in remote sensing, yet existing methods suffer from inadequate sample assignment and instance confusion due to rigid rule-based designs. To address this, we propose SSP (Semantic-decoupled Spatial Partition), a unified framework that synergizes rule-driven prior injection and data-driven label purification. Specifically, SSP introduces two core innovations: 1) Pixel-level Spatial Partition-based Sample Assignment, which compactly estimates the upper and lower bounds of object scales and mines high-quality positive samples and hard negative samples through spatial partitioning of pixel maps. 2) Semantic Spatial Partition-based Box Extraction, which derives instances from spatial partitions modulated by semantic maps and reliably converts them into bounding boxes to form pseudo-labels for supervising the learning of downstream detectors. Experiments on DOTA-v1.0 and others demonstrate SSP' s superiority: it achieves 45.78% mAP under point supervision, outperforming SOTA method PointOBB-v2 by 4.10%. Furthermore, when integrated with ORCNN and ReDet architectures, the SSP framework achieves mAP values of 47.86% and 48.50%, respectively. The code is available at https://github.com/antxinyuan/ssp.


[48] High-resolution efficient image generation from WiFi CSI using a pretrained latent diffusion model cs.CVPDF

Eshan Ramesh, Nishio Takayuki

TL;DR: LatentCSI是一种利用预训练的潜在扩散模型(LDM)从WiFi CSI测量生成物理环境图像的新方法,通过直接映射CSI振幅到潜在空间,提高计算效率和图像质量。

Details

Motivation: 传统方法如GAN在WiFi CSI图像生成中存在计算复杂度高和图像质量不佳的问题。LatentCSI旨在通过利用预训练的LDM解决这些问题,实现高效高质量的图像生成。

Result: 在自采集的WiFi设备和摄像头数据集及MM-Fi数据集上验证,LatentCSI在计算效率和感知质量上均优于基线方法,并支持文本引导控制。

Insight: 通过绕过传统的像素空间生成和显式编码阶段,LatentCSI展示了潜在扩散模型在跨模态数据(如WiFi CSI到图像)任务中的潜力。

Abstract: We present LatentCSI, a novel method for generating images of the physical environment from WiFi CSI measurements that leverages a pretrained latent diffusion model (LDM). Unlike prior approaches that rely on complex and computationally intensive techniques such as GANs, our method employs a lightweight neural network to map CSI amplitudes directly into the latent space of an LDM. We then apply the LDM’s denoising diffusion model to the latent representation with text-based guidance before decoding using the LDM’s pretrained decoder to obtain a high-resolution image. This design bypasses the challenges of pixel-space image generation and avoids the explicit image encoding stage typically required in conventional image-to-image pipelines, enabling efficient and high-quality image synthesis. We validate our approach on two datasets: a wide-band CSI dataset we collected with off-the-shelf WiFi devices and cameras; and a subset of the publicly available MM-Fi dataset. The results demonstrate that LatentCSI outperforms baselines of comparable complexity trained directly on ground-truth images in both computational efficiency and perceptual quality, while additionally providing practical advantages through its unique capacity for text-guided controllability.


[49] MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling cs.CVPDF

Liang Yin, Xudong Xie, Zhang Li, Xiang Bai, Yuliang Liu

TL;DR: MSTAR提出了一种无需边界框标注的、支持多查询类型的场景文本检索方法,通过动态多粒度文本表示和风格感知指令统一自由文本查询,显著提升了性能并降低了标注成本。

Details

Motivation: 现有场景文本检索方法依赖昂贵的边界框标注,且难以统一多样化的查询类型。MSTAR旨在解决这些问题。

Result: 在7个公开数据集和MQTR基准上,MSTAR性能优于之前方法(如Total-Text上MAP提升6.4%),MQTR上平均提升8.5%。

Insight: 无需边界框标注的检索方法可行且高效;多查询统一化设计能更好满足多样化需求。

Abstract: Scene text retrieval has made significant progress with the assistance of accurate text localization. However, existing approaches typically require costly bounding box annotations for training. Besides, they mostly adopt a customized retrieval strategy but struggle to unify various types of queries to meet diverse retrieval needs. To address these issues, we introduce Muti-query Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for scene text retrieval. It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts and harmonizes free-style text queries with style-aware instructions. Additionally, a multi-instance matching module is integrated to enhance vision-language alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset, the first benchmark designed to evaluate the multi-query scene text retrieval capability of models, comprising four query types and 16k images. Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset. Notably, MSTAR marginally surpasses the previous state-of-the-art model by 6.4% in MAP on Total-Text while eliminating box annotation costs. Moreover, on the MQTR benchmark, MSTAR significantly outperforms the previous models by an average of 8.5%. The code and datasets are available at https://github.com/yingift/MSTAR.


[50] Anatomy-Grounded Weakly Supervised Prompt Tuning for Chest X-ray Latent Diffusion Models cs.CVPDF

Konstantinos Vilouras, Ilias Stogiannidis, Junyu Yan, Alison Q. O’Neil, Sotirios A. Tsaftaris

TL;DR: 论文提出了一种针对胸部X光片的潜在扩散模型微调框架,通过弱监督提示调优提升多模态对齐,解决了医学影像中文本与图像对齐不足的问题,并在标准数据集和分布外数据上表现优异。

Details

Motivation: 医学影像领域的数据隐私问题导致数据有限,使得潜在扩散模型在文本与图像对齐方面的性能不足,影响了其在医学影像多模态任务中的应用。

Result: 在MS-CXR数据集上达到新SOTA,同时在VinDr-CXR等分布外数据上表现鲁棒。

Insight: 即使是数据受限的医学影像领域,通过弱监督学习也能有效提升多模态模型的性能,为医学影像分析任务提供了新思路。

Abstract: Latent Diffusion Models have shown remarkable results in text-guided image synthesis in recent years. In the domain of natural (RGB) images, recent works have shown that such models can be adapted to various vision-language downstream tasks with little to no supervision involved. On the contrary, text-to-image Latent Diffusion Models remain relatively underexplored in the field of medical imaging, primarily due to limited data availability (e.g., due to privacy concerns). In this work, focusing on the chest X-ray modality, we first demonstrate that a standard text-conditioned Latent Diffusion Model has not learned to align clinically relevant information in free-text radiology reports with the corresponding areas of the given scan. Then, to alleviate this issue, we propose a fine-tuning framework to improve multi-modal alignment in a pre-trained model such that it can be efficiently repurposed for downstream tasks such as phrase grounding. Our method sets a new state-of-the-art on a standard benchmark dataset (MS-CXR), while also exhibiting robust performance on out-of-distribution data (VinDr-CXR). Our code will be made publicly available.


[51] Symmetrical Flow Matching: Unified Image Generation, Segmentation, and Classification with Score-Based Generative Models cs.CV | cs.AIPDF

Francisco Caetano, Christiaan Viviers, Peter H. N. De With, Fons van der Sommen

TL;DR: 该论文提出了Symmetrical Flow Matching(SymmFlow)框架,通过对称学习目标统一了图像生成、语义分割和分类任务,实现了高性能的单模态多任务模型。

Details

Motivation: 现有的Flow Matching框架在生成任务中表现优异,但在多任务统一方面存在局限性。研究人员希望通过对称学习目标实现图像生成、分割和分类的统一建模。

Result: 在CelebAMask-HQ和COCO-Stuff等数据集上,生成任务FID分别达到11.9和7.0;同时在分割和分类任务中表现优异。

Insight: 通过对称性和语义保留机制,多任务统一建模不仅可行,还能提升模型在单一任务上的性能,为多模态生成学习提供了新思路。

Abstract: Flow Matching has emerged as a powerful framework for learning continuous transformations between distributions, enabling high-fidelity generative modeling. This work introduces Symmetrical Flow Matching (SymmFlow), a new formulation that unifies semantic segmentation, classification, and image generation within a single model. Using a symmetric learning objective, SymmFlow models forward and reverse transformations jointly, ensuring bi-directional consistency, while preserving sufficient entropy for generative diversity. A new training objective is introduced to explicitly retain semantic information across flows, featuring efficient sampling while preserving semantic structure, allowing for one-step segmentation and classification without iterative refinement. Unlike previous approaches that impose strict one-to-one mapping between masks and images, SymmFlow generalizes to flexible conditioning, supporting both pixel-level and image-level class labels. Experimental results on various benchmarks demonstrate that SymmFlow achieves state-of-the-art performance on semantic image synthesis, obtaining FID scores of 11.9 on CelebAMask-HQ and 7.0 on COCO-Stuff with only 25 inference steps. Additionally, it delivers competitive results on semantic segmentation and shows promising capabilities in classification tasks. The code will be publicly available.


[52] GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning cs.CVPDF

Xiaoyi Bao, Jindi Lv, Xiaofeng Wang, Zheng Zhu, Xinze Chen

TL;DR: GigaVideo-1提出了一种高效的视频生成微调框架,通过自动反馈而非大量人工标注和数据,仅用4 GPU小时显著提升了视频生成质量。

Details

Motivation: 当前视频生成模型需通过微调提升特定维度(如实例保留、运动合理性),但传统方法依赖人工标注和高计算资源,实用性受限。

Result: 在17个评估维度上均表现提升,平均增益4%,计算资源需求极低。

Insight: 自动反馈机制可有效替代人工标注,低资源消耗的微调方法具有实际应用潜力。

Abstract: Recent progress in diffusion models has greatly enhanced video generation quality, yet these models still require fine-tuning to improve specific dimensions like instance preservation, motion rationality, composition, and physical plausibility. Existing fine-tuning approaches often rely on human annotations and large-scale computational resources, limiting their practicality. In this work, we propose GigaVideo-1, an efficient fine-tuning framework that advances video generation without additional human supervision. Rather than injecting large volumes of high-quality data from external sources, GigaVideo-1 unlocks the latent potential of pre-trained video diffusion models through automatic feedback. Specifically, we focus on two key aspects of the fine-tuning process: data and optimization. To improve fine-tuning data, we design a prompt-driven data engine that constructs diverse, weakness-oriented training samples. On the optimization side, we introduce a reward-guided training strategy, which adaptively weights samples using feedback from pre-trained vision-language models with a realism constraint. We evaluate GigaVideo-1 on the VBench-2.0 benchmark using Wan2.1 as the baseline across 17 evaluation dimensions. Experiments show that GigaVideo-1 consistently improves performance on almost all the dimensions with an average gain of about 4% using only 4 GPU-hours. Requiring no manual annotations and minimal real data, GigaVideo-1 demonstrates both effectiveness and efficiency. Code, model, and data will be publicly available.


[53] PiPViT: Patch-based Visual Interpretable Prototypes for Retinal Image Analysis cs.CV | cs.AIPDF

Marzieh Oghbaie, Teresa Araújoa, Hrvoje Bogunović

TL;DR: PiPViT提出了一种基于视觉Transformer的原型方法,通过对比学习和多分辨率输入处理学习可解释的原型,用于视网膜图像分析,既能实现高性能又能提供有意义的解释。

Details

Motivation: 现有原型方法在医学影像中的可视化通常与人类可理解的生物标志物不一致,且原型通常过细,难以解释生物标志物的范围和存在性。

Result: 在视网膜OCT图像分类任务中达到SOTA性能,原型具有临床相关性和语义意义。

Insight: PiPViT不仅能提供高性能分类,还能通过透明原型辅助临床诊断解释。

Abstract: Background and Objective: Prototype-based methods improve interpretability by learning fine-grained part-prototypes; however, their visualization in the input pixel space is not always consistent with human-understandable biomarkers. In addition, well-known prototype-based approaches typically learn extremely granular prototypes that are less interpretable in medical imaging, where both the presence and extent of biomarkers and lesions are critical. Methods: To address these challenges, we propose PiPViT (Patch-based Visual Interpretable Prototypes), an inherently interpretable prototypical model for image recognition. Leveraging a vision transformer (ViT), PiPViT captures long-range dependencies among patches to learn robust, human-interpretable prototypes that approximate lesion extent only using image-level labels. Additionally, PiPViT benefits from contrastive learning and multi-resolution input processing, which enables effective localization of biomarkers across scales. Results: We evaluated PiPViT on retinal OCT image classification across four datasets, where it achieved competitive quantitative performance compared to state-of-the-art methods while delivering more meaningful explanations. Moreover, quantitative evaluation on a hold-out test set confirms that the learned prototypes are semantically and clinically relevant. We believe PiPViT can transparently explain its decisions and assist clinicians in understanding diagnostic outcomes. Github page: https://github.com/marziehoghbaie/PiPViT


[54] Enhancing Deepfake Detection using SE Block Attention with CNN cs.CVPDF

Subhram Dasgupta, Janelle Mason, Xiaohong Yuan, Olusola Odeyomi, Kaushik Roy

TL;DR: 该论文提出了一种轻量级的CNN结合SE注意力模块的Deepfake检测方法,显著降低了模型大小和计算资源消耗,同时在性能上达到了竞争性水平。

Details

Motivation: Deepfake技术的快速发展使得伪造内容越来越逼真,传统检测方法难以应对。现有深度检测模型通常体积庞大,存储和计算成本高,亟需高效轻量化的解决方案。

Result: 在Style GAN数据集上取得了94.14%的分类准确率和0.985的AUC-ROC分数,性能优于同类模型。

Insight: SE注意力模块可以有效提升轻量级模型的检测性能,为资源受限场景下的Deepfake检测提供了可行方案。

Abstract: In the digital age, Deepfake present a formidable challenge by using advanced artificial intelligence to create highly convincing manipulated content, undermining information authenticity and security. These sophisticated fabrications surpass traditional detection methods in complexity and realism. To address this issue, we aim to harness cutting-edge deep learning methodologies to engineer an innovative deepfake detection model. However, most of the models designed for deepfake detection are large, causing heavy storage and memory consumption. In this research, we propose a lightweight convolution neural network (CNN) with squeeze and excitation block attention (SE) for Deepfake detection. The SE block module is designed to perform dynamic channel-wise feature recalibration. The SE block allows the network to emphasize informative features and suppress less useful ones, which leads to a more efficient and effective learning module. This module is integrated with a simple sequential model to perform Deepfake detection. The model is smaller in size and it achieves competing accuracy with the existing models for deepfake detection tasks. The model achieved an overall classification accuracy of 94.14% and AUC-ROC score of 0.985 on the Style GAN dataset from the Diverse Fake Face Dataset. Our proposed approach presents a promising avenue for combating the Deepfake challenge with minimal computational resources, developing efficient and scalable solutions for digital content verification.


[55] Unsourced Adversarial CAPTCHA: A Bi-Phase Adversarial CAPTCHA Framework cs.CV | cs.CRPDF

Xia Du, Xiaoyuan Liu, Jizhe Zhou, Zheng Lin, Chi-man Pun

TL;DR: 论文提出了一个名为UAC的对抗性CAPTCHA框架,通过大语言模型(LLM)和扩散模型生成高质量对抗样本,支持目标和黑盒攻击,实验表明其攻击成功率高且生成的CAPTCHA对人类和DNN均难区分。

Details

Motivation: 传统CAPTCHA方案因深度学习进步而容易被自动化攻击破解,现有对抗攻击方法依赖原始图像特征,限制了在缺乏初始输入图像场景中的应用。

Result: BP-UAC在多样系统中实现高攻击成功率,生成自然且难区分的CAPTCHA。

Insight: 结合LLM和扩散模型可提升对抗样本的多样性和质量,多模态梯度与双路径优化对黑盒攻击有效。

Abstract: With the rapid advancements in deep learning, traditional CAPTCHA schemes are increasingly vulnerable to automated attacks powered by deep neural networks (DNNs). Existing adversarial attack methods often rely on original image characteristics, resulting in distortions that hinder human interpretation and limit applicability in scenarios lacking initial input images. To address these challenges, we propose the Unsourced Adversarial CAPTCHA (UAC), a novel framework generating high-fidelity adversarial examples guided by attacker-specified text prompts. Leveraging a Large Language Model (LLM), UAC enhances CAPTCHA diversity and supports both targeted and untargeted attacks. For targeted attacks, the EDICT method optimizes dual latent variables in a diffusion model for superior image quality. In untargeted attacks, especially for black-box scenarios, we introduce bi-path unsourced adversarial CAPTCHA (BP-UAC), a two-step optimization strategy employing multimodal gradients and bi-path optimization for efficient misclassification. Experiments show BP-UAC achieves high attack success rates across diverse systems, generating natural CAPTCHAs indistinguishable to humans and DNNs.


[56] Underage Detection through a Multi-Task and MultiAge Approach for Screening Minors in Unconstrained Imagery cs.CVPDF

Christopher Gaul, Eduardo Fidalgo, Enrique Alegre, Rocío Alaiz Rodríguez, Eri Pérez Corral

TL;DR: 该论文提出了一种多任务和多年龄方法,用于在无约束图像中检测未成年人,通过共享特征的架构、改进的损失函数和年龄平衡采样,显著提升了年龄估计和未成年人检测的准确性。

Details

Motivation: 自动检测未成年人面临公开数据中儿童样本不足和分布偏移的问题,需要开发鲁棒的方法来解决这些挑战。

Result: 在ASORES-39k上,年龄估计的RMSE从5.733降至5.656,18岁以下检测的F2得分从0.801提升至0.857;在ASWIFT-20k上,F2得分从0.742提升至0.833。

Insight: 多任务学习和年龄平衡采样对未成年人检测任务至关重要,尤其是在数据分布偏移的情况下,模型表现出强鲁棒性。

Abstract: Accurate automatic screening of minors in unconstrained images demands models that are robust to distribution shift and resilient to the children under-representation in publicly available data. To overcome these issues, we propose a multi-task architecture with dedicated under/over-age discrimination tasks based on a frozen FaRL vision-language backbone joined with a compact two-layer MLP that shares features across one age-regression head and four binary under-age heads for age thresholds of 12, 15, 18, and 21 years, focusing on the legally critical age range. To address the severe class imbalance, we introduce an $\alpha$-reweighted focal-style loss and age-balanced mini-batch sampling, which equalizes twelve age bins during stochastic optimization. Further improvement is achieved with an age gap that removes edge cases from the loss. Moreover, we set a rigorous evaluation by proposing the Overall Under-Age Benchmark, with 303k cleaned training images and 110k test images, defining both the “ASORES-39k” restricted overall test, which removes the noisiest domains, and the age estimation wild shifts test “ASWIFT-20k” of 20k-images, stressing extreme pose ($>$45{\deg}), expression, and low image quality to emulate real-world shifts. Trained on the cleaned overall set with resampling and age gap, our multiage model “F” lowers the root-mean-square-error on the ASORES-39k restricted test from 5.733 (age-only baseline) to 5.656 years and lifts under-18 detection from F2 score of 0.801 to 0.857 at 1% false-adult rate. Under the domain shift to the wild data of ASWIFT-20k, the same configuration nearly sustains 0.99 recall while boosting F2 from 0.742 to 0.833 with respect to the age-only baseline, demonstrating strong generalization under distribution shift. For the under-12 and under-15 tasks, the respective boosts in F2 are from 0.666 to 0.955 and from 0.689 to 0.916, respectively.


[57] Continual Hyperbolic Learning of Instances and Classes cs.CV | cs.AI | cs.LGPDF

Melika Ayoughi, Mina Ghadimi Atigh, Mohammad Mahdi Derakhshani, Cees G. M. Snoek, Pascal Mettes

TL;DR: 论文提出了一个持续学习框架HyperCLIC,用于同时学习实例和类别,利用双曲空间建模层次结构,并验证了其在EgoObjects数据集上的有效性。

Details

Motivation: 现实应用(如机器人和自动驾驶)需要模型同时处理实例和类别的持续学习,传统方法仅专注于单一任务,无法满足需求。

Result: 在EgoObjects数据集上验证,HyperCLIC在多种粒度下表现出色,提升了层次泛化能力。

Insight: 双曲空间适合建模层次关系,为持续学习中的多粒度任务提供了新思路。

Abstract: Continual learning has traditionally focused on classifying either instances or classes, but real-world applications, such as robotics and self-driving cars, require models to handle both simultaneously. To mirror real-life scenarios, we introduce the task of continual learning of instances and classes, at the same time. This task challenges models to adapt to multiple levels of granularity over time, which requires balancing fine-grained instance recognition with coarse-grained class generalization. In this paper, we identify that classes and instances naturally form a hierarchical structure. To model these hierarchical relationships, we propose HyperCLIC, a continual learning algorithm that leverages hyperbolic space, which is uniquely suited for hierarchical data due to its ability to represent tree-like structures with low distortion and compact embeddings. Our framework incorporates hyperbolic classification and distillation objectives, enabling the continual embedding of hierarchical relations. To evaluate performance across multiple granularities, we introduce continual hierarchical metrics. We validate our approach on EgoObjects, the only dataset that captures the complexity of hierarchical object recognition in dynamic real-world environments. Empirical results show that HyperCLIC operates effectively at multiple granularities with improved hierarchical generalization.


[58] Uncertainty-Masked Bernoulli Diffusion for Camouflaged Object Detection Refinement cs.CVPDF

Yuqi Shen, Fengyang Xiao, Sujie Hu, Youwei Pang, Yifan Pu

TL;DR: 该论文提出了第一个专门用于伪装目标检测(COD)的生成式优化框架——不确定性掩码伯努利扩散(UMBD),通过选择性优化低质量区域提升分割性能。

Details

Motivation: 现有COD方法在分割质量较差的区域存在优化空间,但缺乏针对性后处理框架。

Result: 在多个COD基准测试中,平均MAE提升5.5%,加权F-measure提升3.2%,计算开销低。

Insight: 生成式优化与判别式模型的结合能有效解决COD的细微差异问题;不确定性引导的局部优化是关键。

Abstract: Camouflaged Object Detection (COD) presents inherent challenges due to the subtle visual differences between targets and their backgrounds. While existing methods have made notable progress, there remains significant potential for post-processing refinement that has yet to be fully explored. To address this limitation, we propose the Uncertainty-Masked Bernoulli Diffusion (UMBD) model, the first generative refinement framework specifically designed for COD. UMBD introduces an uncertainty-guided masking mechanism that selectively applies Bernoulli diffusion to residual regions with poor segmentation quality, enabling targeted refinement while preserving correctly segmented areas. To support this process, we design the Hybrid Uncertainty Quantification Network (HUQNet), which employs a multi-branch architecture and fuses uncertainty from multiple sources to improve estimation accuracy. This enables adaptive guidance during the generative sampling process. The proposed UMBD framework can be seamlessly integrated with a wide range of existing Encoder-Decoder-based COD models, combining their discriminative capabilities with the generative advantages of diffusion-based refinement. Extensive experiments across multiple COD benchmarks demonstrate consistent performance improvements, achieving average gains of 5.5% in MAE and 3.2% in weighted F-measure with only modest computational overhead. Code will be released.


[59] IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain cs.CVPDF

Hong Huang, Weixiang Sun, Zhijian Wu, Jingwen Niu, Donghuan Lu

TL;DR: IQE-CLIP 是一种基于 CLIP 的零/小样本异常检测框架,通过结合文本和实例感知的视觉信息生成异常敏感嵌入,适用于医学领域。

Details

Motivation: 现有基于 CLIP 的方法依赖于预先设计的情境特定提示,无法在联合嵌入空间中有效区分正常和异常实例,且医学领域的探索有限。

Result: 在六个医学数据集上实现了零/小样本设置的 SOTA 性能。

Insight: 结合文本和视觉实例信息能更有效指示异常,医学领域的 ZFSAD 任务需要更精细的嵌入生成方法。

Abstract: Recent advances in vision-language models, such as CLIP, have significantly improved performance in zero- and few-shot anomaly detection (ZFSAD) tasks. However, most existing CLIP-based methods assume prior knowledge of categories and rely on carefully designed prompts tailored to specific scenarios. While these text prompts capture semantic information in the textual space, they often fail to distinguish normal and anomalous instances in the joint embedding space. Moreover, most ZFSAD approaches focus on industrial domains, with limited exploration in medical tasks. To address these limitations, we propose IQE-CLIP, a novel framework for ZFSAD in the medical domain. We show that query embeddings integrating both textual and instance-aware visual information serve as more effective indicators of anomalies. Specifically, we introduce class-based and learnable prompting tokens to better adapt CLIP to the medical setting. Furthermore, we design an instance-aware query module that extracts region-level contextual information from both modalities, enabling the generation of anomaly-sensitive embeddings. Extensive experiments on six medical datasets demonstrate that IQE-CLIP achieves state-of-the-art performance in both zero-shot and few-shot settings. Code and data are available at \href{https://github.com/hongh0/IQE-CLIP/}{this https URL}.


[60] PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework cs.CVPDF

SiXiang Chen, Jianyu Lai, Jialin Gao, Tian Ye, Haoyu Chen

TL;DR: PosterCraft是一个统一框架,用于生成高质量美学海报,通过多阶段优化工作流和自动化数据构建,显著提升渲染质量和视觉吸引力。

Details

Motivation: 生成美学海报比简单设计图像更具挑战性,需解决文本渲染、内容整合、布局和谐等问题。现有方法多为模块化流程,限制了生成自由度。

Result: 在渲染精度、布局一致性和视觉吸引力上大幅优于开源基线,接近SOTA商业系统质量。

Insight: 自动化数据构建和多阶段优化是提升生成质量的关键,统一框架更适合复杂美学任务。

Abstract: Generating aesthetic posters is more challenging than simple design images: it requires not only precise text rendering but also the seamless integration of abstract artistic content, striking layouts, and overall stylistic harmony. To address this, we propose PosterCraft, a unified framework that abandons prior modular pipelines and rigid, predefined layouts, allowing the model to freely explore coherent, visually compelling compositions. PosterCraft employs a carefully designed, cascaded workflow to optimize the generation of high-aesthetic posters: (i) large-scale text-rendering optimization on our newly introduced Text-Render-2M dataset; (ii) region-aware supervised fine-tuning on HQ-Poster100K; (iii) aesthetic-text-reinforcement learning via best-of-n preference optimization; and (iv) joint vision-language feedback refinement. Each stage is supported by a fully automated data-construction pipeline tailored to its specific needs, enabling robust training without complex architectural modifications. Evaluated on multiple experiments, PosterCraft significantly outperforms open-source baselines in rendering accuracy, layout coherence, and overall visual appeal-approaching the quality of SOTA commercial systems. Our code, models, and datasets can be found in the Project page: https://ephemeral182.github.io/PosterCraft


[61] SlotPi: Physics-informed Object-centric Reasoning Models cs.CV | cs.AI | cs.LGPDF

Jian Li, Wan Han, Ning Lin, Yu-Liang Zhan, Ruizhi Chengze

TL;DR: SlotPi是一种基于物理知识的对象中心推理模型,通过结合哈密顿原理和时空预测模块,解决了现有方法中物理知识整合不足和跨场景适应性的问题。

Details

Motivation: 现有对象中心动态模拟方法忽略了物理知识的整合和模型在多样场景中的适应性验证,而人类能够通过观察世界获取物理知识并用于动态推理。

Result: 实验表明,SlotPi在预测和VQA任务中表现优异,并在新创建的真实世界数据集上展现了强大的适应性。

Insight: 物理知识的整合不仅提高了模型的动态推理能力,还增强了其跨场景适应性,为构建更高级的世界模型奠定了基础。

Abstract: Understanding and reasoning about dynamics governed by physical laws through visual observation, akin to human capabilities in the real world, poses significant challenges. Currently, object-centric dynamic simulation methods, which emulate human behavior, have achieved notable progress but overlook two critical aspects: 1) the integration of physical knowledge into models. Humans gain physical insights by observing the world and apply this knowledge to accurately reason about various dynamic scenarios; 2) the validation of model adaptability across diverse scenarios. Real-world dynamics, especially those involving fluids and objects, demand models that not only capture object interactions but also simulate fluid flow characteristics. To address these gaps, we introduce SlotPi, a slot-based physics-informed object-centric reasoning model. SlotPi integrates a physical module based on Hamiltonian principles with a spatio-temporal prediction module for dynamic forecasting. Our experiments highlight the model’s strengths in tasks such as prediction and Visual Question Answering (VQA) on benchmark and fluid datasets. Furthermore, we have created a real-world dataset encompassing object interactions, fluid dynamics, and fluid-object interactions, on which we validated our model’s capabilities. The model’s robust performance across all datasets underscores its strong adaptability, laying a foundation for developing more advanced world models.


[62] Human-Robot Navigation using Event-based Cameras and Reinforcement Learning cs.CVPDF

Ignacio Bugueno-Cordova, Javier Ruiz-del-Solar, Rodrigo Verschae

TL;DR: 本文提出了一种结合事件相机和其他传感器以及强化学习的机器人导航控制器,用于实时的人类中心导航和避障。与传统基于图像的控制器相比,该方法利用事件相机的异步特性,实现了自适应推理和控制。

Details

Motivation: 传统基于图像的导航控制器存在固定帧率、运动模糊和延迟问题,而事件相机能够异步捕捉视觉信息,为解决这些问题提供了新思路。

Result: 在仿真环境中实现了鲁棒的导航、行人跟随和避障功能。

Insight: 事件相机的异步特性为机器人导航提供了新的感知方式,结合强化学习可以显著提升动态环境中的适应性。

Abstract: This work introduces a robot navigation controller that combines event cameras and other sensors with reinforcement learning to enable real-time human-centered navigation and obstacle avoidance. Unlike conventional image-based controllers, which operate at fixed rates and suffer from motion blur and latency, this approach leverages the asynchronous nature of event cameras to process visual information over flexible time intervals, enabling adaptive inference and control. The framework integrates event-based perception, additional range sensing, and policy optimization via Deep Deterministic Policy Gradient, with an initial imitation learning phase to improve sample efficiency. Promising results are achieved in simulated environments, demonstrating robust navigation, pedestrian following, and obstacle avoidance. A demo video is available at the project website.


[63] Prompts to Summaries: Zero-Shot Language-Guided Video Summarization cs.CVPDF

Mario Barbara, Alaa Maalouf

TL;DR: 该论文提出了一种零样本、可查询文本的视频摘要方法,利用视频-语言模型(VidLMs)和大型语言模型(LLMs)生成用户引导的视频摘要,无需训练数据,性能优于无监督方法,并媲美监督方法。

Details

Motivation: 视频数据的爆炸式增长需要灵活、用户可控的摘要工具,但现有方法要么依赖特定领域数据,无法泛化,要么无法结合用户通过自然语言表达的意图。

Result: 在SumMe和TVSum上超越无监督方法,QFVS基准测试中表现竞争力,且无需训练数据。VidSum-Reason数据集的提出为后续研究提供挑战性基线。

Insight: 预训练多模态模型结合精心设计的提示和分数传播方法,可成为通用、文本查询视频摘要的强大基础。

Abstract: The explosive growth of video data intensified the need for flexible user-controllable summarization tools that can operate without domain-specific training data. Existing methods either rely on datasets, limiting generalization, or cannot incorporate user intent expressed in natural language. We introduce Prompts-to-Summaries: the first zero-shot, text-queryable video summarizer that converts off-the-shelf video-language models (VidLMs) captions into user-guided skims via large language models (LLMs) judging, without the use of training data at all, beating all unsupervised and matching supervised methods. Our pipeline (i) segments raw video footage into coherent scenes, (ii) generates rich scene-level descriptions through a memory-efficient, batch-style VidLM prompting scheme that scales to hours-long videos on a single GPU, (iii) leverages an LLM as a judge to assign scene-level importance scores under a carefully crafted prompt, and finally, (iv) propagates those scores to short segments level via two new metrics: consistency (temporal coherency) and uniqueness (novelty), yielding fine-grained frame importance. On SumMe and TVSum, our data-free approach surpasses all prior data-hungry unsupervised methods. It also performs competitively on the Query-Focused Video Summarization (QFVS) benchmark, despite using no training data and the competing methods requiring supervised frame-level importance. To spur further research, we release VidSum-Reason, a new query-driven dataset featuring long-tailed concepts and multi-step reasoning; our framework attains robust F1 scores and serves as the first challenging baseline. Overall, our results demonstrate that pretrained multimodal models, when orchestrated with principled prompting and score propagation, already provide a powerful foundation for universal, text-queryable video summarization.


[64] Unsupervised Deformable Image Registration with Structural Nonparametric Smoothing cs.CV | eess.IV | eess.SPPDF

Hang Zhang, Xiang Chen, Renjiu Hu, Rongguang Wang, Jinwei Zhang

TL;DR: 论文提出了一种名为SmoothProper的无监督模块,通过结合非参数平滑优化层,有效解决了稀疏特征和大位移挑战下的变形图像配准问题。

Details

Motivation: 稀疏特征和大位移是传统无监督变形图像配准(DIR)方法的难点,神经网络的单次前向预测导致变形场缺乏约束,难以处理这些问题。

Result: 在视网膜血管数据集上,配准误差降至1.88像素(2912x2912图像),首次有效解决了稀疏特征和大位移的双重挑战。

Insight: 结合优化层和神经网络可以弥补无监督DIR在结构一致性上的不足,为复杂场景配准提供了新思路。

Abstract: Learning-based deformable image registration (DIR) accelerates alignment by amortizing traditional optimization via neural networks. Label supervision further enhances accuracy, enabling efficient and precise nonlinear alignment of unseen scans. However, images with sparse features amid large smooth regions, such as retinal vessels, introduce aperture and large-displacement challenges that unsupervised DIR methods struggle to address. This limitation occurs because neural networks predict deformation fields in a single forward pass, leaving fields unconstrained post-training and shifting the regularization burden entirely to network weights. To address these issues, we introduce SmoothProper, a plug-and-play neural module enforcing smoothness and promoting message passing within the network’s forward pass. By integrating a duality-based optimization layer with tailored interaction terms, SmoothProper efficiently propagates flow signals across spatial locations, enforces smoothness, and preserves structural consistency. It is model-agnostic, seamlessly integrates into existing registration frameworks with minimal parameter overhead, and eliminates regularizer hyperparameter tuning. Preliminary results on a retinal vessel dataset exhibiting aperture and large-displacement challenges demonstrate our method reduces registration error to 1.88 pixels on 2912x2912 images, marking the first unsupervised DIR approach to effectively address both challenges. The source code will be available at https://github.com/tinymilky/SmoothProper.


[65] Occlusion-Aware 3D Hand-Object Pose Estimation with Masked AutoEncoders cs.CVPDF

Hui Yang, Wei Sun, Jian Liu, Jin Zheng, Jian Xiao

TL;DR: 论文提出了一种基于掩码自编码器的遮挡感知3D手-物体姿态估计方法(HOMAE),通过目标聚焦掩码策略和多尺度特征融合,解决了手-物体交互中的遮挡问题,并在DexYCB和HO3Dv2基准测试中取得了最先进的结果。

Details

Motivation: 现有的手-物体姿态估计方法在遮挡情况下表现不佳,缺乏对全局结构的感知和推理能力。作者希望通过引入掩码自编码器和多尺度特征融合,提升模型在遮挡场景下的性能。

Result: 在DexYCB和HO3Dv2基准测试中,HOMAE达到了最先进的性能,证明了其遮挡感知和几何融合的有效性。

Insight: 通过结合隐式(SDF)和显式(点云)表示的优势,可以更好地处理遮挡问题,同时多尺度特征提取和全局推理是提升姿态估计性能的关键。

Abstract: Hand-object pose estimation from monocular RGB images remains a significant challenge mainly due to the severe occlusions inherent in hand-object interactions. Existing methods do not sufficiently explore global structural perception and reasoning, which limits their effectiveness in handling occluded hand-object interactions. To address this challenge, we propose an occlusion-aware hand-object pose estimation method based on masked autoencoders, termed as HOMAE. Specifically, we propose a target-focused masking strategy that imposes structured occlusion on regions of hand-object interaction, encouraging the model to learn context-aware features and reason about the occluded structures. We further integrate multi-scale features extracted from the decoder to predict a signed distance field (SDF), capturing both global context and fine-grained geometry. To enhance geometric perception, we combine the implicit SDF with an explicit point cloud derived from the SDF, leveraging the complementary strengths of both representations. This fusion enables more robust handling of occluded regions by combining the global context from the SDF with the precise local geometry provided by the point cloud. Extensive experiments on challenging DexYCB and HO3Dv2 benchmarks demonstrate that HOMAE achieves state-of-the-art performance in hand-object pose estimation. We will release our code and model.


[66] VideoDeepResearch: Long Video Understanding With Agentic Tool Using cs.CV | cs.AI | cs.CLPDF

Huaying Yuan, Zheng Liu, Junjie Zhou, Ji-Rong Wen, Zhicheng Dou

TL;DR: 该论文提出了VideoDeepResearch,一种用于长视频理解(LVU)的新型智能代理框架,仅依赖文本推理模型和多模态工具包,显著优于现有MLLM基线。

Details

Motivation: 当前多模态大语言模型(MLLMs)在处理长视频理解(LVU)任务时面临复杂性及上下文窗口限制的挑战,论文旨在通过智能代理系统克服这些限制。

Result: 在MLVU、Video-MME和LVBench等基准测试中,VideoDeepResearch显著超越现有MLLM基线,最高提升9.6%。

Insight: 智能代理系统通过动态工具调用和多模态协作,能够有效解决长视频理解的复杂性和上下文限制问题。

Abstract: Long video understanding (LVU) presents a significant challenge for current multi-modal large language models (MLLMs) due to the task’s inherent complexity and context window constraint. It is widely assumed that addressing LVU tasks requires foundation MLLMs with extended context windows, strong visual perception capabilities, and proficient domain expertise. In this work, we challenge this common belief by introducing VideoDeepResearch, a novel agentic framework for long video understanding. Our approach relies solely on a text-only large reasoning model (LRM) combined with a modular multi-modal toolkit, including multimodal retrievers and visual perceivers, all of which are readily available in practice. For each LVU task, the system formulates a problem-solving strategy through reasoning, while selectively accessing and utilizing essential video content via tool using. We conduct extensive experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench. Our results demonstrate that VideoDeepResearch achieves substantial improvements over existing MLLM baselines, surpassing the previous state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and LongVideoBench, respectively. These findings highlight the promise of agentic systems in overcoming key challenges in LVU problems.


[67] Post-Training Quantization for Video Matting cs.CV | cs.AIPDF

Tianrui Zhu, Houyuan Chen, Ruihao Gong, Michele Magno, Haotong Qin

TL;DR: 本篇论文提出了一种专为视频抠图设计的后训练量化(PTQ)框架PTQ4VM,通过两阶段策略和改进的全局校准方法(GAC)以及光流辅助(OFA)组件,显著提升了量化后模型的精度和时序一致性,同时大幅降低了计算开销,达到了4位量化下接近全精度的性能。

Details

Motivation: 视频抠图任务在计算资源受限的设备上面临部署困难,现有PTQ方法在精度和时序一致性上的不足限制了在这一领域的应用。

Result: PTQ4VM在多种位宽下均达到最优性能,4位量化模型接近全精度性能,计算开销降低8倍。

Insight: 通过捕捉局部依赖性和全局统计特性,并结合时序信息,可以显著提升视频抠图模型的量化效果。

Abstract: Video matting is crucial for applications such as film production and virtual reality, yet deploying its computationally intensive models on resource-constrained devices presents challenges. Quantization is a key technique for model compression and acceleration. As an efficient approach, Post-Training Quantization (PTQ) is still in its nascent stages for video matting, facing significant hurdles in maintaining accuracy and temporal coherence. To address these challenges, this paper proposes a novel and general PTQ framework specifically designed for video matting models, marking, to the best of our knowledge, the first systematic attempt in this domain. Our contributions include: (1) A two-stage PTQ strategy that combines block-reconstruction-based optimization for fast, stable initial quantization and local dependency capture, followed by a global calibration of quantization parameters to minimize accuracy loss. (2) A Statistically-Driven Global Affine Calibration (GAC) method that enables the network to compensate for cumulative statistical distortions arising from factors such as neglected BN layer effects, even reducing the error of existing PTQ methods on video matting tasks up to 20%. (3) An Optical Flow Assistance (OFA) component that leverages temporal and semantic priors from frames to guide the PTQ process, enhancing the model’s ability to distinguish moving foregrounds in complex scenes and ultimately achieving near full-precision performance even under ultra-low-bit quantization. Comprehensive quantitative and visual results show that our PTQ4VM achieves the state-of-the-art accuracy performance across different bit-widths compared to the existing quantization methods. We highlight that the 4-bit PTQ4VM even achieves performance close to the full-precision counterpart while enjoying 8x FLOP savings.


[68] VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos cs.CV | cs.AI | cs.MMPDF

Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang

TL;DR: VRBench是首个针对长叙事视频多步推理能力的基准测试,包含1,010个长视频和9,468个人工标注的多步问题-答案对,为评估大型模型的时序推理和过程有效性提供标准化工具。

Details

Motivation: 现有评估方法忽视了时序推理和过程的合理性,缺乏对长视频多步推理的系统性测试,VRBench填补了这一空白。

Result: 对12个LLM和16个VLM的广泛评测显示,VRBench能有效区分模型在多步推理任务中的性能差异。

Insight: 长视频的多步推理需要更复杂的时序建模能力,模型的推理链质量需从多维度综合评估。

Abstract: We present VRBench, the first long narrative video benchmark crafted for evaluating large models’ multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity. It comprises 1,010 long videos (with an average duration of 1.6 hours), along with 9,468 human-labeled multi-step question-answering pairs and 30,292 reasoning steps with timestamps. These videos are curated via a multi-stage filtering process including expert inter-rater reviewing to prioritize plot coherence. We develop a human-AI collaborative framework that generates coherent reasoning chains, each requiring multiple temporally grounded steps, spanning seven types (e.g., event attribution, implicit inference). VRBench designs a multi-phase evaluation pipeline that assesses models at both the outcome and process levels. Apart from the MCQs for the final results, we propose a progress-level LLM-guided scoring metric to evaluate the quality of the reasoning chain from multiple dimensions comprehensively. Through extensive evaluations of 12 LLMs and 16 VLMs on VRBench, we undertake a thorough analysis and provide valuable insights that advance the field of multi-step reasoning.


[69] CreatiPoster: Towards Editable and Controllable Multi-Layer Graphic Design Generation cs.CVPDF

Zhao Zhang, Yutao Cheng, Dexiang Hong, Maoke Yang, Gonglei Shi

TL;DR: CreatiPoster是一个可编辑、可控的多层次图形设计生成框架,通过结合自然语言指令或用户提供的资产,生成高质量的JSON规范和多层次设计,超越现有开源和商业工具。

Details

Motivation: 当前AI工具难以准确整合用户提供的资产并保持可编辑性和视觉吸引力,商业系统依赖模板库,限制了创造性和实用性。

Result: CreatiPoster超越现有开源方法和商业系统,支持多种应用(如画布编辑、多语言适配)。

Insight: 将多层次设计的生成与背景合成分离,提高了设计的可编辑性和视觉一致性,推动了AI辅助图形设计的民主化。

Abstract: Graphic design plays a crucial role in both commercial and personal contexts, yet creating high-quality, editable, and aesthetically pleasing graphic compositions remains a time-consuming and skill-intensive task, especially for beginners. Current AI tools automate parts of the workflow, but struggle to accurately incorporate user-supplied assets, maintain editability, and achieve professional visual appeal. Commercial systems, like Canva Magic Design, rely on vast template libraries, which are impractical for replicate. In this paper, we introduce CreatiPoster, a framework that generates editable, multi-layer compositions from optional natural-language instructions or assets. A protocol model, an RGBA large multimodal model, first produces a JSON specification detailing every layer (text or asset) with precise layout, hierarchy, content and style, plus a concise background prompt. A conditional background model then synthesizes a coherent background conditioned on this rendered foreground layers. We construct a benchmark with automated metrics for graphic-design generation and show that CreatiPoster surpasses leading open-source approaches and proprietary commercial systems. To catalyze further research, we release a copyright-free corpus of 100,000 multi-layer designs. CreatiPoster supports diverse applications such as canvas editing, text overlay, responsive resizing, multilingual adaptation, and animated posters, advancing the democratization of AI-assisted graphic design. Project homepage: https://github.com/graphic-design-ai/creatiposter


[70] AIR: Zero-shot Generative Model Adaptation with Iterative Refinement cs.CV | cs.AIPDF

Guimeng Liu, Milad Abdollahzadeh, Ngai-Man Cheung

TL;DR: 本文提出了一种零样本生成模型自适应方法AIR,通过迭代优化解决现有方法中图像与文本偏移不对齐的问题,实验证明其在26种实验设置中均达到最优性能。

Details

Motivation: 现有零样本生成模型自适应方法假设图像偏移与文本偏移在CLIP嵌入空间中完全对齐,导致生成图像质量下降。本文通过实证研究发现偏移不对齐与概念距离相关,并提出了改进方法。

Result: 在26种实验设置中,AIR在定性和定量评估及用户研究中均达到最优性能。

Insight: CLIP嵌入空间中偏移不对齐与概念距离相关,接近的概念偏移不对齐较小,这一发现可指导生成模型优化。

Abstract: Zero-shot generative model adaptation (ZSGM) aims to adapt a pre-trained generator to a target domain using only text guidance and without any samples from the target domain. Central to recent ZSGM approaches are directional loss which use the text guidance in the form of aligning the image offset with text offset in the embedding space of a vision-language model like CLIP. This is similar to the analogical reasoning in NLP where the offset between one pair of words is used to identify a missing element in another pair by aligning the offset between these two pairs. However, a major limitation of existing ZSGM methods is that the learning objective assumes the complete alignment between image offset and text offset in the CLIP embedding space, resulting in quality degrade in generated images. Our work makes two main contributions. Inspired by the offset misalignment studies in NLP, as our first contribution, we perform an empirical study to analyze the misalignment between text offset and image offset in CLIP embedding space for various large publicly available datasets. Our important finding is that offset misalignment in CLIP embedding space is correlated with concept distance, i.e., close concepts have a less offset misalignment. To address the limitations of the current approaches, as our second contribution, we propose Adaptation with Iterative Refinement (AIR) which is the first ZSGM approach to focus on improving target domain image quality based on our new insight on offset misalignment.Qualitative, quantitative, and user study in 26 experiment setups consistently demonstrate the proposed AIR approach achieves SOTA performance. Additional experiments are in Supp.


[71] M4V: Multi-Modal Mamba for Text-to-Video Generation cs.CV | cs.AI | cs.LGPDF

Jiancheng Huang, Gengwei Zhang, Zequn Jie, Siyu Jiao, Yinlong Qian

TL;DR: M4V是一个基于Mamba架构的多模态文本到视频生成框架,通过多模态扩散Mamba块和奖励学习策略,显著降低了计算成本并提升了生成视频的质量。

Details

Motivation: 传统基于Transformer的视频生成方法由于计算复杂度高(平方级),限制了实际应用。Mamba架构虽高效,但其设计难以直接适用于多模态和时空建模任务。M4V旨在解决这些问题。

Result: 在文本到视频基准测试中,M4V能够生成高质量视频,同时显著降低计算成本。

Insight: 1. Mamba架构在多模态任务中经过适配后表现出色。2. 奖励学习是提升长序列生成质量的有效方法。

Abstract: Text-to-video generation has significantly enriched content creation and holds the potential to evolve into powerful world simulators. However, modeling the vast spatiotemporal space remains computationally demanding, particularly when employing Transformers, which incur quadratic complexity in sequence processing and thus limit practical applications. Recent advancements in linear-time sequence modeling, particularly the Mamba architecture, offer a more efficient alternative. Nevertheless, its plain design limits its direct applicability to multi-modal and spatiotemporal video generation tasks. To address these challenges, we introduce M4V, a Multi-Modal Mamba framework for text-to-video generation. Specifically, we propose a multi-modal diffusion Mamba (MM-DiM) block that enables seamless integration of multi-modal information and spatiotemporal modeling through a multi-modal token re-composition design. As a result, the Mamba blocks in M4V reduce FLOPs by 45% compared to the attention-based alternative when generating videos at 768$\times$1280 resolution. Additionally, to mitigate the visual quality degradation in long-context autoregressive generation processes, we introduce a reward learning strategy that further enhances per-frame visual realism. Extensive experiments on text-to-video benchmarks demonstrate M4V’s ability to produce high-quality videos while significantly lowering computational costs. Code and models will be publicly available at https://huangjch526.github.io/M4V_project.


[72] VINCIE: Unlocking In-context Image Editing from Video cs.CV | cs.AI | cs.CL | cs.LG | cs.MMPDF

Leigang Qu, Feng Cheng, Ziyan Yang, Qi Zhao, Shanchuan Lin

TL;DR: VINCIE提出了一种直接从视频中学习上下文图像编辑的方法,通过多模态序列标注和块因果扩散变换器,实现了多任务学习,并在多项任务中表现优异。

Details

Motivation: 现有方法依赖任务特定的流水线和专家模型,而VINCIE探索是否可以直接从视频中学习上下文图像编辑,以简化流程并提升灵活性。

Result: 在多项任务中表现优异,包括多概念组合、故事生成和编辑链应用,并在多轮图像编辑基准中达到SOTA。

Insight: 从视频中学习上下文图像编辑是可行的,且多任务学习可以有效提升模型的泛化能力和表现。

Abstract: In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (e.g., segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable approach to annotate videos as interleaved multimodal sequences. To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction. Additionally, we propose a novel multi-turn image editing benchmark to advance research in this area. Extensive experiments demonstrate that our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks. Despite being trained exclusively on videos, our model also shows promising abilities in multi-concept composition, story generation, and chain-of-editing applications.


[73] MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning cs.CV | cs.CLPDF

Yuxuan Luo, Yuhui Yuan, Junwen Chen, Haonan Cai, Ziyi Yue

TL;DR: 该论文提出了知识图像生成这一新任务,并发布了MMMG基准,用于评估图像生成模型的推理能力。MMMG包含4,456对专家验证的知识图像-文本对,涵盖多学科、多教育层次和多种知识形式。通过统一的KG表示和MMMG-Score评估方法,揭示了当前模型的推理不足,并提供了一个开源基线FLUX-Reason。

Details

Motivation: 知识图像在人类文明和学习中扮演重要角色,但生成此类图像需要多模态推理能力,目前缺乏专门的任务和基准来评估模型的这一能力。

Result: 评估了16个SOTA文本到图像生成模型,发现普遍存在推理缺陷。GPT-4o的MMMG-Score仅为50.20,提供的基线FLUX-Reason得分为34.45。

Insight: 当前模型在多模态推理能力上仍有显著不足,未来的工作需要更深入地结合事实知识和视觉生成能力。

Abstract: In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning–a fact underscored by dual-coding theory and the picture-superiority effect. Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals. To enable comprehensive evaluation, MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image’s core entities and their dependencies. We further introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image generation models expose serious reasoning deficits–low entity fidelity, weak relations, and clutter–with GPT-4o achieving an MMMG-Score of only 50.20, underscoring the benchmark’s difficulty. To spur further progress, we release FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on 16,000 curated knowledge image-prompt pairs.


[74] Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs cs.CV | cs.AIPDF

Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang

TL;DR: 论文提出了一种新的视觉token裁剪方法CDPruner,通过最大化条件多样性来优化多模态大语言模型(MLLMs)中的视觉token裁剪问题,显著提升了性能并降低了计算开销。

Details

Motivation: 在多模态大语言模型中,视觉token的长度通常远大于文本token,导致推理成本高昂。现有的裁剪方法(如基于注意力或相似性的方法)无法同时避免重复token和忽略指令相关性,从而影响模型性能。

Result: 在LLaVA等模型中,CDPruner将FLOPs减少95%,CUDA延迟降低78%,同时保持94%的原始精度。在多个视觉-语言基准上达到SOTA性能。

Insight: 最大化条件多样性不仅能减少冗余token,还能更好保留图像输入的代表性并紧密贴合用户指令,从而在高裁剪率下仍维持高性能。

Abstract: In multimodal large language models (MLLMs), the length of input visual tokens is often significantly greater than that of their textual counterparts, leading to a high inference cost. Many works aim to address this issue by removing redundant visual tokens. However, current approaches either rely on attention-based pruning, which retains numerous duplicate tokens, or use similarity-based pruning, overlooking the instruction relevance, consequently causing suboptimal performance. In this paper, we go beyond attention or similarity by proposing a novel visual token pruning method named CDPruner, which maximizes the conditional diversity of retained tokens. We first define the conditional similarity between visual tokens conditioned on the instruction, and then reformulate the token pruning problem with determinantal point process (DPP) to maximize the conditional diversity of the selected subset. The proposed CDPruner is training-free and model-agnostic, allowing easy application to various MLLMs. Extensive experiments across diverse MLLMs show that CDPruner establishes new state-of-the-art on various vision-language benchmarks. By maximizing conditional diversity through DPP, the selected subset better represents the input images while closely adhering to user instructions, thereby preserving strong performance even with high reduction ratios. When applied to LLaVA, CDPruner reduces FLOPs by 95% and CUDA latency by 78%, while maintaining 94% of the original accuracy. Our code is available at https://github.com/Theia-4869/CDPruner.


[75] GenWorld: Towards Detecting AI-generated Real-world Simulation Videos cs.CVPDF

Weiliang Chen, Wenzhao Zheng, Yu Zheng, Lei Chen, Jie Zhou

TL;DR: GenWorld是一个用于检测AI生成视频的大规模高质量真实世界模拟数据集,并提出SpannDetector模型,利用多视角一致性提升检测性能。

Details

Motivation: 随着视频生成技术的发展,AI生成的视频对真实世界信息的可信度构成威胁,亟需可靠的检测方法。然而,缺乏高质量的真实世界模拟数据集阻碍了检测器的发展。

Result: 实验表明SpannDetector在检测高质生成视频上表现优异,为基于物理合理性的可解释检测提供了新方向。

Insight: 真实世界线索对于AI生成视频检测至关重要,多视角一致性是一种有效的检测标准。

Abstract: The flourishing of video generation technologies has endangered the credibility of real-world information and intensified the demand for AI-generated video detectors. Despite some progress, the lack of high-quality real-world datasets hinders the development of trustworthy detectors. In this paper, we propose GenWorld, a large-scale, high-quality, and real-world simulation dataset for AI-generated video detection. GenWorld features the following characteristics: (1) Real-world Simulation: GenWorld focuses on videos that replicate real-world scenarios, which have a significant impact due to their realism and potential influence; (2) High Quality: GenWorld employs multiple state-of-the-art video generation models to provide realistic and high-quality forged videos; (3) Cross-prompt Diversity: GenWorld includes videos generated from diverse generators and various prompt modalities (e.g., text, image, video), offering the potential to learn more generalizable forensic features. We analyze existing methods and find they fail to detect high-quality videos generated by world models (i.e., Cosmos), revealing potential drawbacks of ignoring real-world clues. To address this, we propose a simple yet effective model, SpannDetector, to leverage multi-view consistency as a strong criterion for real-world AI-generated video detection. Experiments show that our method achieves superior results, highlighting a promising direction for explainable AI-generated video detection based on physical plausibility. We believe that GenWorld will advance the field of AI-generated video detection. Project Page: https://chen-wl20.github.io/GenWorld


[76] Fine-Grained Perturbation Guidance via Attention Head Selection cs.CV | cs.AI | cs.LGPDF

Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Minjae Kim, Jaewon Min

TL;DR: 论文提出了HeadHunter和SoftPAG方法,通过细粒度选择和扰动注意力头,提升扩散模型中生成图像的视觉质量和可控性。

Details

Motivation: 现有注意力扰动方法缺乏确定扰动位置的系统性方法,尤其是在DiT架构中,质量相关的计算分散在不同层。

Result: 在Stable Diffusion 3和FLUX.1上验证了方法的有效性,提升了生成质量并实现了风格特异性控制。

Insight: 注意力头在视觉概念(如结构、风格等)中表现出专业化分工,可被用于精准控制生成过程。

Abstract: Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose “HeadHunter”, a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head’s attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.


[77] InstaInpaint: Instant 3D-Scene Inpainting with Masked Large Reconstruction Model cs.CVPDF

Junqi You, Chieh Hubert Lin, Weijie Lyu, Zhengbo Zhang, Ming-Hsuan Yang

TL;DR: InstaInpaint 是一个基于参考的快速前馈框架,能够在0.4秒内从2D修复提案生成3D场景修复。通过自监督的掩码微调策略训练定制的大型重建模型(LRM),实现了1000倍的速度提升,并在两个标准基准测试中保持最新性能。

Details

Motivation: 当前的3D场景修复方法依赖耗时且计算密集的优化,不适合实时或在线应用,因此需要一种高效的解决方案来支持交互式操作。

Result: 在0.4秒内完成修复,速度提升1000倍;在两个标准基准测试中表现优异;适用于对象插入和多区域修复等下游任务。

Insight: 自监督掩码微调策略和大规模数据训练对提升模型的泛化能力、纹理一致性和几何正确性至关重要;快速框架扩展了3D修复的实际应用场景。

Abstract: Recent advances in 3D scene reconstruction enable real-time viewing in virtual and augmented reality. To support interactive operations for better immersiveness, such as moving or editing objects, 3D scene inpainting methods are proposed to repair or complete the altered geometry. However, current approaches rely on lengthy and computationally intensive optimization, making them impractical for real-time or online applications. We propose InstaInpaint, a reference-based feed-forward framework that produces 3D-scene inpainting from a 2D inpainting proposal within 0.4 seconds. We develop a self-supervised masked-finetuning strategy to enable training of our custom large reconstruction model (LRM) on the large-scale dataset. Through extensive experiments, we analyze and identify several key designs that improve generalization, textural consistency, and geometric correctness. InstaInpaint achieves a 1000x speed-up from prior methods while maintaining a state-of-the-art performance across two standard benchmarks. Moreover, we show that InstaInpaint generalizes well to flexible downstream applications such as object insertion and multi-region inpainting. More video results are available at our project page: https://dhmbb2.github.io/InstaInpaint_page/.


cs.CL [Back]

[78] TaskCraft: Automated Generation of Agentic Tasks cs.CLPDF

Dingfeng Shi, Jingyi Cao, Qianben Chen, Weichen Sun, Weizhen Li

TL;DR: TaskCraft是一种自动生成具有多工具交互、可扩展难度和可验证执行轨迹的代理任务的框架,解决了现有数据集中工具交互不足和人工标注成本高的问题。

Details

Motivation: 现有的指令数据缺乏工具交互,而代理任务基准主要依赖人工标注,成本高且难以扩展。因此,需要一种自动化的方法来生成多样化、难度可控的代理任务。

Result: 实验表明,生成的任务改进了提示优化和监督微调的效果,支持了代理模型的性能提升。

Insight: 自动化任务生成是解决代理任务数据稀缺和标注成本高的有效途径,同时可通过调整扩展方式控制任务难度。

Abstract: Agentic tasks, which require multi-step problem solving with autonomy, tool use, and adaptive reasoning, are becoming increasingly central to the advancement of NLP and AI. However, existing instruction data lacks tool interaction, and current agentic benchmarks rely on costly human annotation, limiting their scalability. We introduce \textsc{TaskCraft}, an automated workflow for generating difficulty-scalable, multi-tool, and verifiable agentic tasks with execution trajectories. TaskCraft expands atomic tasks using depth-based and width-based extensions to create structurally and hierarchically complex challenges. Empirical results show that these tasks improve prompt optimization in the generation workflow and enhance supervised fine-tuning of agentic foundation models. We present a large-scale synthetic dataset of approximately 36,000 tasks with varying difficulty to support future research on agent tuning and evaluation.


[79] Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific Information cs.CLPDF

Christodoulos Constantinides, Shuxin Lin, Nianjun Zhou, Dhaval Patel

TL;DR: 本文提出了一种名为Chat-of-Thought的多代理系统,用于生成工业资产的FMEA文档。该系统通过多角色协作的LLM代理和动态任务路由优化内容生成与验证。

Details

Motivation: 工业设备监控领域的FMEA文档生成面临高效性和准确性挑战,传统方法难以满足需求。

Result: 展示了Chat-of-Thought在工业设备监控领域的潜力,能够高效生成和验证FMEA文档。

Insight: 多代理协作和动态讨论能够显著提升领域特定信息的生成质量与效率。

Abstract: This paper presents a novel multi-agent system called Chat-of-Thought, designed to facilitate the generation of Failure Modes and Effects Analysis (FMEA) documents for industrial assets. Chat-of-Thought employs multiple collaborative Large Language Model (LLM)-based agents with specific roles, leveraging advanced AI techniques and dynamic task routing to optimize the generation and validation of FMEA tables. A key innovation in this system is the introduction of a Chat of Thought, where dynamic, multi-persona-driven discussions enable iterative refinement of content. This research explores the application domain of industrial equipment monitoring, highlights key challenges, and demonstrates the potential of Chat-of-Thought in addressing these challenges through interactive, template-driven workflows and context-aware agent collaboration.


[80] ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering cs.CLPDF

Caijun Jia, Nan Xu, Jingxuan Wei, Qingli Wang, Lei Wang

TL;DR: ChartReasoner是一个两阶段的框架,通过代码驱动的方式实现图表问答任务中的长链推理,通过高保真的图表转换和自动合成数据,提升了多模态推理的精确性和可解释性。

Details

Motivation: 当前的视觉推理任务通常将视觉信息转换为文本进行推理,但会丢失图表中的结构和语义信息。尤其在图表问答任务中,这会导致关键细节的缺失。

Result: 在四个公开基准测试中,ChartReasoner在保留图表细节和推理性能上表现优异,参数更少的情况下接近GPT-4o的性能。

Insight: 通过代码驱动的模态转换和自动数据合成,可以在保留视觉细节的同时实现高效的多模态推理,为视觉推理任务提供了一种新思路。

Abstract: Recently, large language models have shown remarkable reasoning capabilities through long-chain reasoning before responding. However, how to extend this capability to visual reasoning tasks remains an open challenge. Existing multimodal reasoning approaches transfer such visual reasoning task into textual reasoning task via several image-to-text conversions, which often lose critical structural and semantic information embedded in visualizations, especially for tasks like chart question answering that require a large amount of visual details. To bridge this gap, we propose ChartReasoner, a code-driven novel two-stage framework designed to enable precise, interpretable reasoning over charts. We first train a high-fidelity model to convert diverse chart images into structured ECharts codes, preserving both layout and data semantics as lossless as possible. Then, we design a general chart reasoning data synthesis pipeline, which leverages this pretrained transport model to automatically and scalably generate chart reasoning trajectories and utilizes a code validator to filter out low-quality samples. Finally, we train the final multimodal model using a combination of supervised fine-tuning and reinforcement learning on our synthesized chart reasoning dataset and experimental results on four public benchmarks clearly demonstrate the effectiveness of our proposed ChartReasoner. It can preserve the original details of the charts as much as possible and perform comparably with state-of-the-art open-source models while using fewer parameters, approaching the performance of proprietary systems like GPT-4o in out-of-domain settings.


[81] Unsupervised Elicitation of Language Models cs.CL | cs.AIPDF

Jiaxin Wen, Zachary Ankner, Arushi Somani, Peter Hase, Samuel Marks

TL;DR: 该论文提出了一种无监督算法ICM,通过在自生成标签上微调预训练语言模型,无需外部监督,优于传统的人类监督方法。

Details

Motivation: 针对超人类能力的语言模型,高质量的人类监督难以获取,论文旨在解决这一问题。

Result: 在多个任务上表现优于人类监督方法,并能提升前沿语言模型的训练效果。

Insight: 无监督方法可以超越人类监督,尤其是在模型能力远超人类的任务上。

Abstract: To steer pretrained language models for downstream tasks, today’s post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficult or impossible to get high-quality human supervision. To address this challenge, we introduce a new unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune pretrained language models on their own generated labels, \emph{without external supervision}. On GSM8k-verification, TruthfulQA, and Alpaca reward modeling tasks, our method matches the performance of training on golden supervision and outperforms training on crowdsourced human supervision. On tasks where LMs’ capabilities are strongly superhuman, our method can elicit those capabilities significantly better than training on human labels. Finally, we show that our method can improve the training of frontier LMs: we use our method to train an unsupervised reward model and use reinforcement learning to train a Claude 3.5 Haiku-based assistant. Both the reward model and the assistant outperform their human-supervised counterparts.


[82] Can LLMs Generate Good Stories? Insights and Challenges from a Narrative Planning Perspective cs.CL | cs.AIPDF

Yi Wang, Max Kreminski

TL;DR: 该论文研究了大型语言模型(LLM)在故事生成中的能力,重点关注叙事规划问题。通过提出基于文学例子的评估基准,研究发现GPT-4级别的LLM在小规模下可以生成因果合理的故事,但在角色意图性和戏剧冲突方面仍存在挑战。

Details

Motivation: 故事生成是LLM的重要应用领域,但对其能力的理解有限,主要由于自动评估方法的不足和人工评估的高成本与主观性。

Result: GPT-4级别的LLM在小规模故事中表现良好,但在角色意图性和戏剧冲突方面仍需强化学习支持的复杂推理。

Insight: LLM在叙事规划中的能力受限于规模与复杂度,未来需结合强化学习等技术提升复杂推理能力。

Abstract: Story generation has been a prominent application of Large Language Models (LLMs). However, understanding LLMs’ ability to produce high-quality stories remains limited due to challenges in automatic evaluation methods and the high cost and subjectivity of manual evaluation. Computational narratology offers valuable insights into what constitutes a good story, which has been applied in the symbolic narrative planning approach to story generation. This work aims to deepen the understanding of LLMs’ story generation capabilities by using them to solve narrative planning problems. We present a benchmark for evaluating LLMs on narrative planning based on literature examples, focusing on causal soundness, character intentionality, and dramatic conflict. Our experiments show that GPT-4 tier LLMs can generate causally sound stories at small scales, but planning with character intentionality and dramatic conflict remains challenging, requiring LLMs trained with reinforcement learning for complex reasoning. The results offer insights on the scale of stories that LLMs can generate while maintaining quality from different aspects. Our findings also highlight interesting problem solving behaviors and shed lights on challenges and considerations for applying LLM narrative planning in game environments.


[83] Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval cs.CLPDF

Shubhashis Roy Dipta, Francis Ferraro

TL;DR: Q2E提出了一种零样本多语言文本到视频检索的方法,通过LLMs和VLMs的潜在知识分解查询,提升了复杂事件的视频检索能力。

Details

Motivation: 利用LLMs和VLMs的潜在参数知识,改进复杂事件的视频检索,解决人类查询过于简化的问题。

Result: 在多个数据集和检索指标上优于现有方法,音频信息的集成显著提升了性能。

Insight: 复杂事件检索可通过分解查询和多模态融合优化;音频信息在多模态检索中不可或缺。

Abstract: Recent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs). In this work, we consider how we can improve the identification and retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events. We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our approach demonstrates that we can enhance the understanding of otherwise overly simplified human queries by decomposing the query using the knowledge embedded in LLMs and VLMs. We additionally show how to apply our approach to both visual and speech-based inputs. To combine this varied multimodal knowledge, we adopt entropy-based fusion scoring for zero-shot fusion. Through evaluations on two diverse datasets and multiple retrieval metrics, we demonstrate that Q2E outperforms several state-of-the-art baselines. Our evaluation also shows that integrating audio information can significantly improve text-to-video retrieval. We have released code and data for future research.


[84] TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games cs.CL | cs.AIPDF

Prakamya Mishra, Jiang Liu, Jialian Wu, Xiaodong Yu, Zicheng Liu

TL;DR: TTT-Bench是一个新的评测基准,通过四种简单的井字棋变体游戏评测大型推理模型(LRMs)的基础策略、空间和逻辑推理能力,发现这些模型虽然在复杂数学问题上表现优异,但在简单推理游戏中表现不佳。

Details

Motivation: 现有评测基准主要集中在STEM领域,而LRMs在更广泛的任务领域中的推理能力尚未充分探索。通过设计简单游戏评测基础推理能力,填补了这一空白。

Result: 评测结果显示,LRMs在TTT-Bench上的表现平均比MATH 500和AIME 2024低41%和5%,尤其是在长期策略推理任务中表现较差。

Insight: 大型推理模型在复杂任务上的优异表现可能掩盖了其在基础推理能力上的不足,这为进一步优化模型提供了方向。

Abstract: Large reasoning models (LRMs) have demonstrated impressive reasoning capabilities across a broad range of tasks including Olympiad-level mathematical problems, indicating evidence of their complex reasoning abilities. While many reasoning benchmarks focus on the STEM domain, the ability of LRMs to reason correctly in broader task domains remains underexplored. In this work, we introduce \textbf{TTT-Bench}, a new benchmark that is designed to evaluate basic strategic, spatial, and logical reasoning abilities in LRMs through a suite of four two-player Tic-Tac-Toe-style games that humans can effortlessly solve from a young age. We propose a simple yet scalable programmatic approach for generating verifiable two-player game problems for TTT-Bench. Although these games are trivial for humans, they require reasoning about the intentions of the opponent, as well as the game board’s spatial configurations, to ensure a win. We evaluate a diverse set of state-of-the-art LRMs, and \textbf{discover that the models that excel at hard math problems frequently fail at these simple reasoning games}. Further testing reveals that our evaluated reasoning models score on average $\downarrow$ 41% & $\downarrow$ 5% lower on TTT-Bench compared to MATH 500 & AIME 2024 respectively, with larger models achieving higher performance using shorter reasoning traces, where most of the models struggle on long-term strategic reasoning situations on simple and new TTT-Bench tasks.


[85] Classifying Unreliable Narrators with Large Language Models cs.CLPDF

Anneliese Brei, Katharine Henry, Abhisheik Sharma, Shashank Srivastava, Snigdha Chaturvedi

TL;DR: 论文利用大型语言模型(LLM)识别不可靠叙述者,提出了TUNa数据集和分类任务,实验表明此任务极具挑战性,但有潜力。

Details

Motivation: 研究旨在通过计算方式识别叙述者是否可靠,填补了文学理论与现实世界文本数据的应用空白。

Result: 任务极具挑战性,但LLM在识别不可靠叙述者方面有潜力。

Insight: 文学理论可以为现实世界文本分类提供启发,未来研究可进一步优化模型和数据集。

Abstract: Often when we interact with a first-person account of events, we consider whether or not the narrator, the primary speaker of the text, is reliable. In this paper, we propose using computational methods to identify unreliable narrators, i.e. those who unintentionally misrepresent information. Borrowing literary theory from narratology to define different types of unreliable narrators based on a variety of textual phenomena, we present TUNa, a human-annotated dataset of narratives from multiple domains, including blog posts, subreddit posts, hotel reviews, and works of literature. We define classification tasks for intra-narrational, inter-narrational, and inter-textual unreliabilities and analyze the performance of popular open-weight and proprietary LLMs for each. We propose learning from literature to perform unreliable narrator classification on real-world text data. To this end, we experiment with few-shot, fine-tuning, and curriculum learning settings. Our results show that this task is very challenging, and there is potential for using LLMs to identify unreliable narrators. We release our expert-annotated dataset and code and invite future research in this area.


[86] Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages cs.CL | cs.AIPDF

Ali Almutairi, Abdullah Alsuhaibani, Shoaib Jameel, Usman Naseem, Gelareh Mohammadi

TL;DR: Flick提出了一种用于低资源语言的少标签文本分类方法,通过高质量伪标签蒸馏和自适应选择机制,显著提升了伪标签的可靠性。

Details

Motivation: 解决低资源语言环境中少标签文本分类的难点,尤其是在噪声伪标签和领域适应问题上。

Result: 在14个多样化数据集上验证了Flick的优越性能,包括阿拉伯语、乌尔都语等低资源语言。

Insight: 通过专注于高质量伪标签的生成,Flick在低资源环境中实现了更鲁棒的模型微调,仅需少量真实标签。

Abstract: Training deep learning networks with minimal supervision has gained significant research attention due to its potential to reduce reliance on extensive labelled data. While self-training methods have proven effective in semi-supervised learning, they remain vulnerable to errors from noisy pseudo labels. Moreover, most recent approaches to the few-label classification problem are either designed for resource-rich languages such as English or involve complex cascading models that are prone to overfitting. To address the persistent challenge of few-label text classification in truly low-resource linguistic contexts, where existing methods often struggle with noisy pseudo-labels and domain adaptation, we propose Flick. Unlike prior methods that rely on generic multi-cluster pseudo-labelling or complex cascading architectures, Flick leverages the fundamental insight that distilling high-confidence pseudo-labels from a broader set of initial clusters can dramatically improve pseudo-label quality, particularly for linguistically diverse, low-resource settings. Flick introduces a novel pseudo-label refinement component, a departure from traditional pseudo-labelling strategies by identifying and leveraging top-performing pseudo-label clusters. This component specifically learns to distil highly reliable pseudo-labels from an initial broad set by focusing on single-cluster cohesion and leveraging an adaptive top-k selection mechanism. This targeted refinement process is crucial for mitigating the propagation of errors inherent in low-resource data, allowing for robust fine-tuning of pre-trained language models with only a handful of true labels. We demonstrate Flick’s efficacy across 14 diverse datasets, encompassing challenging low-resource languages such as Arabic, Urdu, and Setswana, alongside English, showcasing its superior performance and adaptability.


[87] “Check My Work?”: Measuring Sycophancy in a Simulated Educational Context cs.CL | cs.CYPDF

Chuck Arvin

TL;DR: 论文研究了在模拟教育环境中,用户提供的建议如何影响大型语言模型(LLM),尤其是模型的’谄媚’行为对教育公平可能产生的负面影响。

Details

Motivation: 研究动机在于LLM在教育环境中的应用日益增多,但其对用户输入的敏感性可能导致谄媚行为,从而加剧教育不平等。

Result: 结果显示,模型的准确性受学生回答的显著影响(±15%),且小模型的谄媚效应更强(30% vs. 8%)。

Insight: 研究发现LLM在教育中可能加剧知识差距,强调了理解和减少这种偏见的必要性。

Abstract: This study examines how user-provided suggestions affect Large Language Models (LLMs) in a simulated educational context, where sycophancy poses significant risks. Testing five different LLMs from the OpenAI GPT-4o and GPT-4.1 model classes across five experimental conditions, we show that response quality varies dramatically based on query framing. In cases where the student mentions an incorrect answer, the LLM correctness can degrade by as much as 15 percentage points, while mentioning the correct answer boosts accuracy by the same margin. Our results also show that this bias is stronger in smaller models, with an effect of up to 30% for the GPT-4.1-nano model, versus 8% for the GPT-4o model. Our analysis of how often LLMs “flip” their answer, and an investigation into token level probabilities, confirm that the models are generally changing their answers to answer choices mentioned by students in line with the sycophancy hypothesis. This sycophantic behavior has important implications for educational equity, as LLMs may accelerate learning for knowledgeable students while the same tools may reinforce misunderstanding for less knowledgeable students. Our results highlight the need to better understand the mechanism, and ways to mitigate, such bias in the educational context.


[88] Code Execution as Grounded Supervision for LLM Reasoning cs.CL | cs.AIPDF

Dongwon Jung, Wenxuan Zhou, Muhao Chen

TL;DR: 论文提出了一种利用代码执行确定性生成高质量思维链监督数据的方法,替代依赖人工标注或易错的LLM生成监督数据,有效提升了LLM的推理能力。

Details

Motivation: 现有思维链监督数据生成方法依赖昂贵的人工标注或易错的LLM生成,难以保证可靠性和准确性。本文通过利用代码执行的确定性,提出了一种可扩展的高质量监督数据生成方法。

Result: 在多个领域的推理基准测试中,该方法显著提升了LLM的推理能力,并通过消融实验验证了生成数据的准确性和推理效率的提升。

Insight: 利用代码执行的确定性生成推理监督数据是一种高效且可扩展的方法,可减少对人工标注的依赖并提高推理准确性。

Abstract: Training large language models (LLMs) with chain-of-thought (CoT) supervision has proven effective for enhancing their reasoning abilities. However, obtaining reliable and accurate reasoning supervision remains a significant challenge. We propose a scalable method for generating a high-quality CoT supervision dataset by leveraging the determinism of program execution. Unlike existing reasoning dataset generation methods that rely on costly human annotations or error-prone LLM-generated CoT, our approach extracts verifiable, step-by-step reasoning traces from code execution and transforms them into a natural language CoT reasoning. Experiments on reasoning benchmarks across various domains show that our method effectively equips LLMs with transferable reasoning abilities across diverse tasks. Furthermore, the ablation studies validate that our method produces highly accurate reasoning data and reduces overall token length during inference by reducing meaningless repetition and overthinking.


[89] TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning cs.CL | cs.IRPDF

Xiaohan Yu, Pu Jian, Chong Chen

TL;DR: TableRAG是一种检索增强生成框架,针对异构文档(包含文本和表格)的推理任务提出,通过融合文本检索和表格操作,解决了现有方法在表格结构和多跳推理中的局限性。

Details

Motivation: 现有的检索增强生成(RAG)方法在处理包含文本和表格的异构文档时存在局限性,如破坏表格结构和信息丢失,导致在多跳和全局推理任务中表现不佳。

Result: TableRAG在公开数据集和HeteQA上均超越现有方法,成为异构文档问答的新SOTA。

Insight: 异构文档的推理需要结合文本和表格的结构化操作,而非简单拼接,动态的多步推理能显著提升性能。

Abstract: Retrieval-Augmented Generation (RAG) has demonstrated considerable effectiveness in open-domain question answering. However, when applied to heterogeneous documents, comprising both textual and tabular components, existing RAG approaches exhibit critical limitations. The prevailing practice of flattening tables and chunking strategies disrupts the intrinsic tabular structure, leads to information loss, and undermines the reasoning capabilities of LLMs in multi-hop, global queries. To address these challenges, we propose TableRAG, an hybrid framework that unifies textual understanding and complex manipulations over tabular data. TableRAG iteratively operates in four steps: context-sensitive query decomposition, text retrieval, SQL programming and execution, and compositional intermediate answer generation. We also develop HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous reasoning capabilities. Experimental results demonstrate that TableRAG consistently outperforms existing baselines on both public datasets and our HeteQA, establishing a new state-of-the-art for heterogeneous document question answering. We release TableRAG at https://github.com/yxh-y/TableRAG/tree/main.


[90] PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier cs.CL | cs.AI | cs.LGPDF

Yuhua Jiang, Yuwen Xiong, Yufeng Yuan, Chao Xin, Wenyuan Xu

TL;DR: PAG提出了一个结合生成与验证的多轮强化学习框架,通过模型在生成和验证角色间切换,选择性修正答案,提升自校正能力。

Details

Motivation: 大型语言模型(LLM)在复杂推理任务中表现优异,但难以可靠验证自身输出的正确性。现有解决方案依赖独立验证模块或多阶段训练,限制了扩展性。

Result: 在多样化推理基准测试中,PAG作为策略提升了生成和自校正的准确性,作为验证器其自验证表现优于自一致性方法。

Insight: 将验证与生成统一到单一框架中,通过选择性修正避免了不必要的重复修正常见的模型崩溃问题,同时联合优化了推理和验证能力。

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks, yet they still struggle to reliably verify the correctness of their own outputs. Existing solutions to this verification challenge often depend on separate verifier models or require multi-stage self-correction training pipelines, which limit scalability. In this paper, we propose Policy as Generative Verifier (PAG), a simple and effective framework that empowers LLMs to self-correct by alternating between policy and verifier roles within a unified multi-turn reinforcement learning (RL) paradigm. Distinct from prior approaches that always generate a second attempt regardless of model confidence, PAG introduces a selective revision mechanism: the model revises its answer only when its own generative verification step detects an error. This verify-then-revise workflow not only alleviates model collapse but also jointly enhances both reasoning and verification abilities. Extensive experiments across diverse reasoning benchmarks highlight PAG’s dual advancements: as a policy, it enhances direct generation and self-correction accuracy; as a verifier, its self-verification outperforms self-consistency.


[91] Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences? cs.CL | cs.CVPDF

Yingjin Song, Yupei Du, Denis Paperno, Albert Gatt

TL;DR: 这篇论文提出了TempVS基准,用于评估多模态大型语言模型(MLLMs)在图像序列中对时间逻辑的理解能力,发现现有模型与人类表现存在显著差距。

Details

Motivation: 研究动机是验证MLLMs是否能真正理解图像序列中的事件顺序,揭示其时间推理和基础能力的不足。

Result: 结果显示现有MLLMs在TempVS任务上表现不佳,与人类能力差距较大。

Insight: 研究指出未来研究方向,包括改进模型的时间推理能力和多模态融合机制。

Abstract: This paper introduces the TempVS benchmark, which focuses on temporal grounding and reasoning capabilities of Multimodal Large Language Models (MLLMs) in image sequences. TempVS consists of three main tests (i.e., event relation inference, sentence ordering and image ordering), each accompanied with a basic grounding test. TempVS requires MLLMs to rely on both visual and linguistic modalities to understand the temporal order of events. We evaluate 38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS, with a substantial performance gap compared to human capabilities. We also provide fine-grained insights that suggest promising directions for future research. Our TempVS benchmark data and code are available at https://github.com/yjsong22/TempVS.


[92] Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty cs.CLPDF

Zehui Ling, Deshu Chen, Hongwei Zhang, Yifeng Jiao, Xin Guo

TL;DR: 本文提出了一种通过动态调整输出长度惩罚来提升大型语言模型(LLM)推理效率的方法,针对简单问题减少输出长度以降低计算延迟,而对复杂问题保留充分推理以提高准确性。

Details

Motivation: 现有方法如Chain-of-Thought提示虽然提升了LLM的推理能力,但往往导致输出过长,增加计算延迟。统一长度惩罚忽略了问题复杂性,影响了性能表现。

Result: 在三个数据集(GSM8K、MATH500、AIME2024)上验证了方法的有效性,简单数据集上缩短了输出长度而不损失准确性,复杂数据集上提升了准确性。

Insight: 动态调整推理策略能显著提升LLM的效率和性能,说明问题复杂度对推理行为的设计至关重要。

Abstract: Large language models (LLMs) have demonstrated significant advancements in reasoning capabilities, performing well on various challenging benchmarks. Techniques like Chain-of-Thought prompting have been introduced to further improve reasoning. However, these approaches frequently generate longer outputs, which in turn increase computational latency. Although some methods use reinforcement learning to shorten reasoning, they often apply uniform penalties without considering the problem’s complexity, leading to suboptimal outcomes. In this study, we seek to enhance the efficiency of LLM reasoning by promoting conciseness for simpler problems while preserving sufficient reasoning for more complex ones for accuracy, thus improving the model’s overall performance. Specifically, we manage the model’s reasoning efficiency by dividing the reward function and including a novel penalty for output length. Our approach has yielded impressive outcomes in benchmark evaluations across three datasets: GSM8K, MATH500, and AIME2024. For the comparatively simpler datasets GSM8K and MATH500, our method has effectively shortened output lengths while preserving or enhancing accuracy. On the more demanding AIME2024 dataset, our approach has resulted in improved accuracy.


[93] Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers cs.CLPDF

Xanh Ho, Sunisth Kumar, Yun-Ang Wu, Florian Boudin, Atsuhiro Takasu

TL;DR: 该论文将表格-文本对齐重新定义为解释任务,强调模型需识别支持或反驳科学声明所需的关键表格单元格,并构建了包含人工标注单元格级依据的新数据集,以提高科学声明的可解释性和验证性能。

Details

Motivation: 仅预测科学声明验证标签的模型缺乏透明性和解释性,无法揭示模型推理过程。通过引入表格单元格对齐任务,论文旨在提升模型的可解释性和性能。

Result: 实验表明,加入表格对齐信息可提升声明验证性能,但大多数语言模型虽能预测正确标签,却难以复现人类对齐的依据,表明其预测缺乏忠实推理。

Insight: 模型的预测正确性并不等同于忠实推理,表明现有语言模型在解释性任务上仍有不足,需进一步改进以提升可解释性。

Abstract: Scientific claim verification against tables typically requires predicting whether a claim is supported or refuted given a table. However, we argue that predicting the final label alone is insufficient: it reveals little about the model’s reasoning and offers limited interpretability. To address this, we reframe table-text alignment as an explanation task, requiring models to identify the table cells essential for claim verification. We build a new dataset by extending the SciTab benchmark with human-annotated cell-level rationales. Annotators verify the claim label and highlight the minimal set of cells needed to support their decision. After the annotation process, we utilize the collected information and propose a taxonomy for handling ambiguous cases. Our experiments show that (i) incorporating table alignment information improves claim verification performance, and (ii) most LLMs, while often predicting correct labels, fail to recover human-aligned rationales, suggesting that their predictions do not stem from faithful reasoning.


[94] Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs cs.CL | cs.AIPDF

Yilin Xiao, Chuang Zhou, Qinggang Zhang, Bo Li, Qing Li

TL;DR: 论文提出了一种名为RRP的框架,通过结合知识图谱和大型语言模型,解决复杂推理任务中路径可靠性和冗余问题,提升推理能力。

Details

Motivation: 大型语言模型在知识密集型任务中表现不佳,主要因为缺乏背景知识且容易产生幻觉,知识图谱的引入虽能补充事实,但仍难以生成逻辑一致的推理路径。

Result: 在两个公开数据集上达到SOTA性能,且能灵活集成到不同LLMs中。

Insight: 推理路径的可靠性和逻辑一致性对LLMs的推理能力至关重要,结构信息和语义能力的结合为复杂问题提供了有效解决方案。

Abstract: Large language models (LLMs) often struggle with knowledge-intensive tasks due to a lack of background knowledge and a tendency to hallucinate. To address these limitations, integrating knowledge graphs (KGs) with LLMs has been intensively studied. Existing KG-enhanced LLMs focus on supplementary factual knowledge, but still struggle with solving complex questions. We argue that refining the relationships among facts and organizing them into a logically consistent reasoning path is equally important as factual knowledge itself. Despite their potential, extracting reliable reasoning paths from KGs poses the following challenges: the complexity of graph structures and the existence of multiple generated paths, making it difficult to distinguish between useful and redundant ones. To tackle these challenges, we propose the RRP framework to mine the knowledge graph, which combines the semantic strengths of LLMs with structural information obtained through relation embedding and bidirectional distribution learning. Additionally, we introduce a rethinking module that evaluates and refines reasoning paths according to their significance. Experimental results on two public datasets show that RRP achieves state-of-the-art performance compared to existing baseline methods. Moreover, RRP can be easily integrated into various LLMs to enhance their reasoning abilities in a plug-and-play manner. By generating high-quality reasoning paths tailored to specific questions, RRP distills effective guidance for LLM reasoning.


[95] NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors cs.CL | cs.AI | I.2.7PDF

Numaan Naeem, Sarfraz Ahmad, Momina Ahsan, Hasan Iqbal

TL;DR: 该论文提出了四种方法用于评估AI导师是否能够正确识别学生数学推理中的错误,最终检索增强的少样本提示系统结合LLM推理表现最佳。

Details

Motivation: 旨在提升AI导师在教学反馈中的错误识别能力,从而优化其教学效果。

Result: 检索增强的提示系统优于所有基线,证明了其在教学反馈评估中的优势。

Insight: 示例驱动的提示与LLM推理结合能有效提升AI导师的错误识别能力。

Abstract: This paper presents our system for Track 1: Mistake Identification in the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. The task involves evaluating whether a tutor’s response correctly identifies a mistake in a student’s mathematical reasoning. We explore four approaches: (1) an ensemble of machine learning models over pooled token embeddings from multiple pretrained language models (LMs); (2) a frozen sentence-transformer using [CLS] embeddings with an MLP classifier; (3) a history-aware model with multi-head attention between token-level history and response embeddings; and (4) a retrieval-augmented few-shot prompting system with a large language model (LLM) i.e. GPT 4o. Our final system retrieves semantically similar examples, constructs structured prompts, and uses schema-guided output parsing to produce interpretable predictions. It outperforms all baselines, demonstrating the effectiveness of combining example-driven prompting with LLM reasoning for pedagogical feedback assessment. Our code is available at https://github.com/NaumanNaeem/BEA_2025.


[96] PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models cs.CL | cs.AI | cs.LGPDF

Ye Yu, Yaoning Yu, Haohan Wang

TL;DR: PREMISE 是一种无需修改模型权重的提示优化框架,通过多目标文本搜索和梯度启发式方法,显著减少大型推理模型中的冗余计算和 token 使用,同时保持高准确率。

Details

Motivation: 现有的 LRM(大型推理模型)在数学推理任务中使用冗长的链式思维(CoT)推理,导致 token 使用量高且成本昂贵,限制了在延迟敏感或 API 受限环境中的部署。PREMISE 旨在通过优化提示减少推理开销。

Result: 在 GSM8K、SVAMP 和 Math500 基准上,PREMISE 在保持或提升准确率(Claude 96%→96%,Gemini 91%→92%)的同时,显著减少 token 使用(最高 87.5%)和成本(降低 69%–82%)。

Insight: 提示级优化是实现高效 LRM 推理的可行路径,无需对模型进行任何修改,且适用于商业 LLM。未来可扩展到其他推理任务和场景。

Abstract: Large reasoning models (LRMs) such as Claude 3.7 Sonnet and OpenAI o1 achieve strong performance on mathematical benchmarks using lengthy chain-of-thought (CoT) reasoning, but the resulting traces are often unnecessarily verbose. This inflates token usage and cost, limiting deployment in latency-sensitive or API-constrained settings. We introduce PREMISE (PRompt-based Efficient Mathematical Inference with Strategic Evaluation), a prompt-only framework that reduces reasoning overhead without modifying model weights. PREMISE combines trace-level diagnostics with gradient-inspired prompt optimization to minimize redundant computation while preserving answer accuracy. The approach jointly optimizes brevity and correctness through a multi-objective textual search that balances token length and answer validity. Unlike prior work, PREMISE runs in a single-pass black-box interface, so it can be applied directly to commercial LLMs. On GSM8K, SVAMP, and Math500 we match or exceed baseline accuracy ($96%\rightarrow96%$ with Claude, $91%\rightarrow92%$ with Gemini) while reducing reasoning tokens by up to $87.5%$ and cutting dollar cost by $69$–$82%$. These results show that prompt-level optimization is a practical and scalable path to efficient LRM inference without compromising reasoning quality.


[97] Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims cs.CL | cs.IRPDF

Priyanka Kargupta, Runchu Tian, Jiawei Han

TL;DR: 该论文提出了ClaimSpect框架,通过检索增强生成技术自动构建针对复杂声明的层次化分析,结合语料库视角解构声明并展示不同观点。

Details

Motivation: 现实中许多声明(如科学或政治领域的声明)往往具有复杂性,仅用“真”或“假”难以全面评估。论文旨在通过层次化分析和多角度验证提供更全面的见解。

Result: 在科学和政治声明数据集上的实验表明,ClaimSpect能够有效解构复杂声明,并在多种基线方法中表现优异。

Insight: 层次化分析和多角度检索提供了一种更全面、结构化的方式评估复杂声明,有助于用户聚焦感兴趣的特定方面。

Abstract: Claims made by individuals or entities are oftentimes nuanced and cannot be clearly labeled as entirely “true” or “false” – as is frequently the case with scientific and political claims. However, a claim (e.g., “vaccine A is better than vaccine B”) can be dissected into its integral aspects and sub-aspects (e.g., efficacy, safety, distribution), which are individually easier to validate. This enables a more comprehensive, structured response that provides a well-rounded perspective on a given problem while also allowing the reader to prioritize specific angles of interest within the claim (e.g., safety towards children). Thus, we propose ClaimSpect, a retrieval-augmented generation-based framework for automatically constructing a hierarchy of aspects typically considered when addressing a claim and enriching them with corpus-specific perspectives. This structure hierarchically partitions an input corpus to retrieve relevant segments, which assist in discovering new sub-aspects. Moreover, these segments enable the discovery of varying perspectives towards an aspect of the claim (e.g., support, neutral, or oppose) and their respective prevalence (e.g., “how many biomedical papers believe vaccine A is more transportable than B?”). We apply ClaimSpect to a wide variety of real-world scientific and political claims featured in our constructed dataset, showcasing its robustness and accuracy in deconstructing a nuanced claim and representing perspectives within a corpus. Through real-world case studies and human evaluation, we validate its effectiveness over multiple baselines.


[98] Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs cs.CLPDF

Alberto Testoni, Iacer Calixto

TL;DR: 本文对临床问答任务中大型语言模型(LLMs)的不确定性估计方法进行了细粒度评估,比较了不同模型和方法在多种医学专业和问题类型下的表现,并提出了一种轻量级的单次生成估计方法。

Details

Motivation: 在临床决策支持等高风险领域,LLMs的准确且校准良好的不确定性估计至关重要。然而,现有研究对LLMs在不同医学专业和问题类型下的表现差异缺乏细致分析,为此需进行系统性评估。

Result: 实验结果表明,不同医学专业和问题类型之间存在显著差异,强调了根据问题性质和模型特点选择合适模型的重要性。轻量级方法的性能接近语义熵,但仅需单次生成。

Insight: 1. 医学专业和问题类型的多样性对LLMs的不确定性估计性能有显著影响;
2. 轻量级方法在保持性能的同时降低了计算成本,适用于实际部署。

Abstract: Accurate and well-calibrated uncertainty estimates are essential for deploying large language models (LLMs) in high-stakes domains such as clinical decision support. We present a fine-grained evaluation of uncertainty estimation methods for clinical multiple-choice question answering, covering ten open-source LLMs (general-purpose, biomedical, and reasoning models) across two datasets, eleven medical specialties, and six question types. We compare standard single-generation and sampling-based methods, and present a case study exploring simple, single-pass estimators based on behavioral signals in reasoning traces. These lightweight methods approach the performance of Semantic Entropy while requiring only one generation. Our results reveal substantial variation across specialties and question types, underscoring the importance of selecting models based on both the nature of the question and model-specific strengths.


[99] Improving Named Entity Transcription with Contextual LLM-based Revision cs.CL | cs.AIPDF

Viet Anh Trinh, Xinlu He, Jacob Whitehill

TL;DR: 论文提出了一种基于大语言模型(LLM)的修正方法,通过利用LLM的推理能力和上下文信息(如课程笔记)来改进ASR系统中的命名实体转录错误,并在自建的数据集上实现了30%的相对WER降低。

Details

Motivation: 当前ASR系统在通用语音识别上表现优异,但对命名实体的错误率仍然较高,影响了后续应用。因此,需要一种方法专门提升命名实体的转录准确性。

Result: 在NER-MIT-OpenCourseWare数据集上,命名实体的WER相对降低了30%。

Insight: LLM的上下文推理能力可以有效修正ASR系统中的命名实体错误,且结合领域知识(如课程笔记)能进一步提升效果。

Abstract: With recent advances in modeling and the increasing amount of supervised training data, automatic speech recognition (ASR) systems have achieved remarkable performance on general speech. However, the word error rate (WER) of state-of-the-art ASR remains high for named entities. Since named entities are often the most critical keywords, misrecognizing them can affect all downstream applications, especially when the ASR system functions as the front end of a complex system. In this paper, we introduce a large language model (LLM) revision mechanism to revise incorrect named entities in ASR predictions by leveraging the LLM’s reasoning ability as well as local context (e.g., lecture notes) containing a set of correct named entities. Finally, we introduce the NER-MIT-OpenCourseWare dataset, containing 45 hours of data from MIT courses for development and testing. On this dataset, our proposed technique achieves up to 30% relative WER reduction for named entities.


[100] Mitigating Negative Interference in Multilingual Sequential Knowledge Editing through Null-Space Constraints cs.CLPDF

Wei Sun, Tingyu Qu, Mingxiao Li, Jesse Davis, Marie-Francine Moens

TL;DR: LangEdit是一种通过零空间约束减轻多语言顺序知识编辑中负面干扰的新框架,确保语言特定知识更新的独立性,同时保持多语言通用能力。

Details

Motivation: 多语言大型语言模型(LLMs)在更新知识时,跨语言的顺序编辑会导致参数干扰,破坏多语言通用性和知识准确性,亟需解决这一挑战。

Result: 在三种模型架构、六种语言和四项下游任务上验证,LangEdit显著减少参数干扰,优于现有编辑方法。

Insight: 零空间约束为多语言知识更新提供了一种高效且数学可解释的解决方案,为LLM的多语言编辑开辟了新方向。

Abstract: Efficiently updating multilingual knowledge in large language models (LLMs), while preserving consistent factual representations across languages, remains a long-standing and unresolved challenge. While deploying separate editing systems for each language might seem viable, this approach incurs substantial costs due to the need to manage multiple models. A more efficient solution involves integrating knowledge updates across all languages into a unified model. However, performing sequential edits across languages often leads to destructive parameter interference, significantly degrading multilingual generalization and the accuracy of injected knowledge. To address this challenge, we propose LangEdit, a novel null-space constrained framework designed to precisely isolate language-specific knowledge updates. The core innovation of LangEdit lies in its ability to project parameter updates for each language onto the orthogonal complement of previous updated subspaces. This approach mathematically guarantees update independence while preserving multilingual generalization capabilities. We conduct a comprehensive evaluation across three model architectures, six languages, and four downstream tasks, demonstrating that LangEdit effectively mitigates parameter interference and outperforms existing state-of-the-art editing methods. Our results highlight its potential for enabling efficient and accurate multilingual knowledge updates in LLMs. The code is available at https://github.com/VRCMF/LangEdit.git.


[101] ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization cs.CLPDF

Zhensheng Jin, Xinze Li, Yifan Ji, Chunyi Peng, Zhenghao Liu

TL;DR: ReCUT提出了一种新方法,通过逐步探索和长短切换采样策略,平衡LLM的推理长度与准确性,显著减少了推理长度并保持了准确性。

Details

Motivation: 现有的CoT提示方法存在过度推理(overthinking)问题,导致推理轨迹冗长或冗余,现有解决方案因数据质量和过拟合问题效果受限。

Result: 在多个数学推理数据集上,推理长度减少30-50%,同时保持或提升了准确性。

Insight: 通过平衡推理长度和准确性,ReCUT在减少计算开销的同时保持了推理质量,为LLM的高效推理提供了新思路。

Abstract: Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of Large Language Models (LLMs). However, these methods often suffer from overthinking, leading to unnecessarily lengthy or redundant reasoning traces. Existing approaches attempt to mitigate this issue through curating multiple reasoning chains for training LLMs, but their effectiveness is often constrained by the quality of the generated data and prone to overfitting. To address the challenge, we propose Reasoning Compression ThroUgh Stepwise Trials (ReCUT), a novel method aimed at balancing the accuracy and length of reasoning trajectory. Specifically, ReCUT employs a stepwise exploration mechanism and a long-short switched sampling strategy, enabling LLMs to incrementally generate diverse reasoning paths. These paths are evaluated and used to construct preference pairs to train two specialized models (Gemini LLMs)-one optimized for reasoning accuracy, the other for shorter reasoning. A final integrated model is obtained by interpolating the parameters of these two models. Experimental results across multiple math reasoning datasets and backbone models demonstrate that ReCUT significantly reduces reasoning lengths by approximately 30-50%, while maintaining or improving reasoning accuracy compared to various baselines. All codes and data will be released via https://github.com/NEUIR/ReCUT.


[102] CIIR@LiveRAG 2025: Optimizing Multi-Agent Retrieval Augmented Generation through Self-Training cs.CL | cs.IRPDF

Alireza Salemi, Mukta Maddipatla, Hamed Zamani

TL;DR: 该论文提出了一种名为 mRAG 的多智能体检索增强生成框架,通过自训练和奖励引导的轨迹采样优化智能体协作,提升了复杂任务的生成效果,并在比赛中表现优异。

Details

Motivation: 传统的检索增强生成(RAG)方法在复杂任务中表现受限,缺乏多智能体协作的能力。为了克服这一局限,研究者提出了多智能体框架 mRAG,以优化任务分解与协作。

Result: mRAG 在 SIGIR 2025 LiveRAG 比赛中的表现优于传统 RAG 基线,展示了其在复杂任务中的生成能力。

Insight: 1. 多智能体协作能显著提升复杂任务的生成质量。
2. 自训练和奖励引导的轨迹采样是优化智能体协作的有效方法。

Abstract: This paper presents mRAG, a multi-agent retrieval-augmented generation (RAG) framework composed of specialized agents for subtasks such as planning, searching, reasoning, and coordination. Our system uses a self-training paradigm with reward-guided trajectory sampling to optimize inter-agent collaboration and enhance response generation. Evaluated on DataMorgana-derived datasets during the SIGIR 2025 LiveRAG competition, mRAG outperforms conventional RAG baselines. We further analyze competition outcomes and showcase the framework’s strengths with case studies, demonstrating its efficacy for complex, real-world RAG tasks.


[103] Accelerating Diffusion Large Language Models with SlowFast: The Three Golden Principles cs.CL | cs.AI | cs.LGPDF

Qingyan Wei, Yaojie Zhang, Zhiyuan Liu, Dongrui Liu, Linfeng Zhang

TL;DR: 论文提出了一种名为SlowFast Sampling的动态采样策略,通过交替探索和加速解码阶段,显著提升扩散语言模型的推理效率,同时结合dLLM-Cache减少冗余计算。

Details

Motivation: 现有扩散语言模型的采样策略(如基于置信度或半自回归解码)存在静态行为问题,导致效率不足和灵活性受限,因此需要更高效的动态采样方法。

Result: 实验表明,SlowFast Sampling在LLaDA上实现了15.63倍的加速,结合缓存后可达34.22倍,且吞吐量优于自回归基线LLaMA3 8B。

Insight: 合理的采样策略可以充分释放扩散语言模型的潜力,实现高效且高质量的文本生成。

Abstract: Diffusion-based language models (dLLMs) have emerged as a promising alternative to traditional autoregressive LLMs by enabling parallel token generation and significantly reducing inference latency. However, existing sampling strategies for dLLMs, such as confidence-based or semi-autoregressive decoding, often suffer from static behavior, leading to suboptimal efficiency and limited flexibility. In this paper, we propose SlowFast Sampling, a novel dynamic sampling strategy that adaptively alternates between exploratory and accelerated decoding stages. Our method is guided by three golden principles: certainty principle, convergence principle, and positional principle, which govern when and where tokens can be confidently and efficiently decoded. We further integrate our strategy with dLLM-Cache to reduce redundant computation. Extensive experiments across benchmarks and models show that SlowFast Sampling achieves up to 15.63$\times$ speedup on LLaDA with minimal accuracy drop, and up to 34.22$\times$ when combined with caching. Notably, our approach outperforms strong autoregressive baselines like LLaMA3 8B in throughput, demonstrating that well-designed sampling can unlock the full potential of dLLMs for fast and high-quality generation.


[104] Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models cs.CL | eess.ASPDF

Michele Gubian, Ioana Krehan, Oli Liu, James Kirby, Sharon Goldwater

TL;DR: 该论文研究了自监督语音模型wav2vec2在不同语言预训练下对语音、音调、说话者信息的表示方式,揭示了这些信息在模型中的正交性及跨语言共性。

Details

Motivation: 目前对自监督语音模型的分析主要集中在英语,论文旨在探索wav2vec2模型在不同语言预训练下如何编码语音、音调和说话者信息,填补多语言研究的空白。

Result: 结果显示,所有预训练语言中,语音、音调和说话者信息的子空间基本正交,且层间性能模式相似,仅在匹配语言的语音和音调任务中后期层有微弱优势。

Insight: 研究表明,wav2vec2学习的表示结构主要独立于预训练语音材料,可能具有跨语言的通用性。

Abstract: Analyses of self-supervised speech models have begun to reveal where and how they represent different types of information. However, almost all analyses have focused on English. Here, we examine how wav2vec2 models trained on four different languages encode both language-matched and non-matched speech. We use probing classifiers and geometric analyses to examine how phones, lexical tones, and speaker information are represented. We show that for all pretraining and test languages, the subspaces encoding phones, tones, and speakers are largely orthogonal, and that layerwise patterns of probing accuracy are similar, with a relatively small advantage for matched-language phone and tone (but not speaker) probes in the later layers. Our findings suggest that the structure of representations learned by wav2vec2 is largely independent of the speech material used during pretraining.


[105] Slimming Down LLMs Without Losing Their Minds cs.CL | cs.AIPDF

Qingda, Mai

TL;DR: 本文研究了高效参数微调方法(LoRA和QLoRA)对大型语言模型(LLM)性能的影响,验证了其在常识推理、数学推理和多领域知识任务中的表现,并强调了微调数据集与目标任务对齐的重要性。

Details

Motivation: 随着大型语言模型的规模不断扩大,高效参数微调方法的需求日益增长。作者旨在验证这些方法在任务特定性能提升中的效果及其计算效率。

Result: LoRA方法在任务特定性能上表现优异且高效,目标任务表现与微调数据集选择密切相关。

Insight: 为资源有限的开发者提供了高效适配LLM的理论依据和实践指导,强调了数据对齐在微调中的关键作用。

Abstract: This paper investigates and validates the impact of fine-tuning on large language model performance, focusing on parameter-efficient methods (LoRA and QLoRA). We evaluate model capabilities across three key domains: (1) commonsense reasoning (HellaSwag), (2) mathematical reasoning (GSM8K), and (3) multi-domain knowledge (MMLU-CS). Our findings demonstrate that: (1) LoRA-based methods effectively improve task-specific performance while maintaining computational efficiency, and (2) performance strongly depends on alignment between fine-tuning dataset and benchmark tasks. The study provides both theoretical insights into parameter-efficient mechanisms and practical guidance for developers implementing efficient LLM adaptation with limited resources.


[106] Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers cs.CL | cs.LGPDF

Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi

TL;DR: 本文探讨了大型语言模型(LLMs)在新知识微调过程中表现出的‘泛化’与‘幻觉’行为的根源,提出了‘上下文外推理’(OCR)机制,并通过实验和理论分析验证了其作用。

Details

Motivation: 尽管LLMs能够通过微调获取新知识,但其在泛化和幻觉行为上的矛盾现象尚未得到充分解释。本文旨在揭示这两种行为的共同机制。

Result: 实验表明OCR驱动了泛化和幻觉行为,理论分析揭示了梯度下降倾向于最小化输出-值矩阵的核范数,从而解释了其高效学习能力。

Insight: 知识注入过程中的泛化和幻觉现象可归因于相同的底层机制(OCR),梯度下降的隐式偏好是模型学习关联能力的核心原因。

Abstract: Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.


[107] BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP cs.CL | cs.AIPDF

Thomas Sounack, Joshua Davis, Brigitte Durieux, Antoine Chaffin, Tom J. Pollard

TL;DR: BioClinical ModernBERT 是一种针对生物医学和临床 NLP 任务的长上下文编码器,通过大规模领域适应和长上下文处理,提升速度和性能。

Details

Motivation: 现有编码器在生物医学和临床 NLP 中的应用受限,发展滞后于解码器模型,因此亟需一种高效、领域适应的解决方案。

Result: 在四项下游任务中超越现有编码器,并提供了基础版和大版模型及训练检查点以供研究。

Insight: 多源数据和大规模预训练是提升生物医学和临床 NLP 模型性能的关键。

Abstract: Encoder-based transformer models are central to biomedical and clinical Natural Language Processing (NLP), as their bidirectional self-attention makes them well-suited for efficiently extracting structured information from unstructured text through discriminative tasks. However, encoders have seen slower development compared to decoder models, leading to limited domain adaptation in biomedical and clinical settings. We introduce BioClinical ModernBERT, a domain-adapted encoder that builds on the recent ModernBERT release, incorporating long-context processing and substantial improvements in speed and performance for biomedical and clinical NLP. BioClinical ModernBERT is developed through continued pretraining on the largest biomedical and clinical corpus to date, with over 53.5 billion tokens, and addresses a key limitation of prior clinical encoders by leveraging 20 datasets from diverse institutions, domains, and geographic regions, rather than relying on data from a single source. It outperforms existing biomedical and clinical encoders on four downstream tasks spanning a broad range of use cases. We release both base (150M parameters) and large (396M parameters) versions of BioClinical ModernBERT, along with training checkpoints to support further research.


[108] Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning cs.CLPDF

Lan Zhang, Marco Valentino, Andre Freitas

TL;DR: 论文提出了一种基于LLM评委的系统化自动评估方法(EFG),用于形式数学推理中的自动形式化任务,通过多维度标准提高了评估的透明度和有效性。

Details

Motivation: 当前自动形式化任务中,LLM评委的评估方法过于粗粒度,无法满足高级数学推理中对细微和多维度质量的要求,亟需一种更系统化的评估框架。

Result: 实验表明EFG集成方法在评估形式数学推理任务时,比粗粒度模型更贴合人类评估结果,尤其在形式质量方面表现突出。

Insight: 通过定义明确的原子属性指导LLM评委,可以构建一个可扩展、可解释且可靠的自动评估框架,特别适用于复杂的形式数学推理任务。

Abstract: Autoformalization plays a crucial role in formal mathematical reasoning by enabling the automatic translation of natural language statements into formal languages. While recent advances using large language models (LLMs) have shown promising results, methods for automatically evaluating autoformalization remain underexplored. As one moves to more complex domains (e.g., advanced mathematics), human evaluation requires significant time and domain expertise, especially as the complexity of the underlying statements and background knowledge increases. LLM-as-a-judge presents a promising approach for automating such evaluation. However, existing methods typically employ coarse-grained and generic evaluation criteria, which limit their effectiveness for advanced formal mathematical reasoning, where quality hinges on nuanced, multi-granular dimensions. In this work, we take a step toward addressing this gap by introducing a systematic, automatic method to evaluate autoformalization tasks. The proposed method is based on an epistemically and formally grounded ensemble (EFG) of LLM judges, defined on criteria encompassing logical preservation (LP), mathematical consistency (MC), formal validity (FV), and formal quality (FQ), resulting in a transparent assessment that accounts for different contributing factors. We validate the proposed framework to serve as a proxy for autoformalization assessment within the domain of formal mathematics. Overall, our experiments demonstrate that the EFG ensemble of LLM judges is a suitable emerging proxy for evaluation, more strongly correlating with human assessments than a coarse-grained model, especially when assessing formal qualities. These findings suggest that LLM-as-judges, especially when guided by a well-defined set of atomic properties, could offer a scalable, interpretable, and reliable support for evaluating formal mathematical reasoning.


[109] Magistral cs.CLPDF

Mistral-AI, :, Abhinav Rastogi, Albert Q. Jiang, Andy Lo

TL;DR: Magistral 是 Mistral 的第一个推理模型,基于可扩展的强化学习(RL)流程,通过纯 RL 训练探索了大语言模型(LLM)的极限,并提供了强制模型推理语言的简单方法。

Details

Motivation: 现有方法通常依赖于现有实现或从先前模型中提取的 RL 痕迹,Magistral 通过从头开始的方法,仅使用自己的模型和基础设施,探索纯 RL 训练的可能性。

Result: RL 训练不仅保持了初始模型的能力,还改善了多模态理解、指令跟随和函数调用。

Insight: 纯 RL 训练是一种可行的 LLM 训练策略,能够在不依赖外部数据的情况下提升模型性能。

Abstract: We introduce Magistral, Mistral’s first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a simple method to force the reasoning language of the model, and show that RL on text data alone maintains most of the initial checkpoint’s capabilities. We find that RL on text maintains or improves multimodal understanding, instruction following and function calling. We present Magistral Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we open-source Magistral Small (Apache 2.0) which further includes cold-start data from Magistral Medium.


[110] Dynamic Epistemic Friction in Dialogue cs.CLPDF

Timothy Obiso, Kenneth Lai, Abhijnan Nath, Nikhil Krishnaswamy, James Pustejovsky

TL;DR: 该论文探讨了大型语言模型(LLM)在与人类协作时的’动态认知摩擦’,即因新信息与现有信念冲突而产生的阻力,并提出了一种基于动态认知逻辑的模型来预测对话中的信念更新。

Details

Motivation: 尽管LLM在人类偏好对齐方面取得了进展,但现有方法忽视了信念更新过程中的认知摩擦,导致在对抗性或模糊信息下的表现不足。

Result: 模型能够有效预测对话中的信念更新行为,并为复杂现实对话场景的信念对齐提供了更精细的度量方法。

Insight: 动态认知摩擦为理解人类-AI协作中的信念冲突和更新提供了新视角,有助于改进LLM在对抗性环境下的表现。

Abstract: Recent developments in aligning Large Language Models (LLMs) with human preferences have significantly enhanced their utility in human-AI collaborative scenarios. However, such approaches often neglect the critical role of “epistemic friction,” or the inherent resistance encountered when updating beliefs in response to new, conflicting, or ambiguous information. In this paper, we define dynamic epistemic friction as the resistance to epistemic integration, characterized by the misalignment between an agent’s current belief state and new propositions supported by external evidence. We position this within the framework of Dynamic Epistemic Logic (Van Benthem and Pacuit, 2011), where friction emerges as nontrivial belief-revision during the interaction. We then present analyses from a situated collaborative task that demonstrate how this model of epistemic friction can effectively predict belief updates in dialogues, and we subsequently discuss how the model of belief alignment as a measure of epistemic resistance or friction can naturally be made more sophisticated to accommodate the complexities of real-world dialogue scenarios.


[111] Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training cs.CL | cs.AI | cs.LGPDF

Mozhi Zhang, Howe Tissue, Lu Wang, Xipeng Qiu

TL;DR: Domain2Vec提出了一种无训练的方法,通过将数据集分解为元域(meta-domains)的线性组合来优化数据混合,显著减少了计算开销。

Details

Motivation: 当前在语言模型预训练中,如何选择最优的数据混合是一个重要问题。传统方法需要多次训练,计算成本高。本文提出了一种无需训练的方法,通过向量化数据集来找到最佳数据混合。

Result: 在Pile-CC上,仅用51.5%的计算量达到了与原数据混合相同的验证损失;在相同计算预算下,下游任务性能平均提升2.83%。

Insight: 分布对齐假设为数据混合优化提供了理论基础;Domain2Vec的向量化方法提升了传统方法的效率和扩展性。

Abstract: We introduce~\textsc{Domain2Vec}, a novel approach that decomposes any dataset into a linear combination of several \emph{meta-domains}, a new concept designed to capture the key underlying features of datasets. \textsc{Domain2Vec} maintains a vocabulary of meta-domains and uses a classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary. These domain vectors enable the identification of the optimal data mixture for language model (LM) pretraining in a training-free manner under the \emph{\textbf{D}istribution \textbf{A}lignment \textbf{A}ssumption} (DA$^{2}$), which suggests that when the data distributions of the training set and the validation set are better aligned, a lower validation loss is achieved. Moreover, \textsc{Domain2vec} can be seamlessly integrated into previous works to model the relationship between domain vectors and LM performance, greatly enhancing the efficiency and scalability of previous methods. Extensive experiments demonstrate that \textsc{Domain2Vec} helps find the data mixture that enhances downstream task performance with minimal computational overhead. Specifically, \textsc{Domain2Vec} achieves the same validation loss on Pile-CC using only $51.5%$ of the computation required when training on the original mixture of The Pile dataset. Under equivalent compute budget, \textsc{Domain2Vec} improves downstream performance by an average of $2.83%$.


[112] How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts? cs.CLPDF

Sohee Yang, Sang-Woo Lee, Nora Kassner, Daniela Gottesman, Sebastian Riedel

TL;DR: 本文探讨了推理模型在识别和恢复四种无益思维上的表现,发现模型能有效识别但难以从中恢复,尤其是大模型表现更差,并提出了改进自我评估能力的呼吁。

Details

Motivation: 研究动机在于探索推理模型是否具备有效的自我评估能力,能否识别并修正无益思维,以提升模型的推理准确性和安全性。

Result: 结果显示:1) 模型能有效识别无益思维但难以恢复;2) 大模型在短无关思维注入时表现更差;3) 小模型对有害思维的干扰最小。

Insight: 核心洞察是:模型的自我评估能力尚不成熟,需进一步改进以实现更可靠和安全的推理系统,同时规模并非总是提高性能的关键因素。

Abstract: Recent reasoning models show the ability to reflect, backtrack, and self-validate their reasoning, which is crucial in spotting mistakes and arriving at accurate solutions. A natural question that arises is how effectively models can perform such self-reevaluation. We tackle this question by investigating how well reasoning models identify and recover from four types of unhelpful thoughts: uninformative rambling thoughts, thoughts irrelevant to the question, thoughts misdirecting the question as a slightly different question, and thoughts that lead to incorrect answers. We show that models are effective at identifying most unhelpful thoughts but struggle to recover from the same thoughts when these are injected into their thinking process, causing significant performance drops. Models tend to naively continue the line of reasoning of the injected irrelevant thoughts, which showcases that their self-reevaluation abilities are far from a general “meta-cognitive” awareness. Moreover, we observe non/inverse-scaling trends, where larger models struggle more than smaller ones to recover from short irrelevant thoughts, even when instructed to reevaluate their reasoning. We demonstrate the implications of these findings with a jailbreak experiment using irrelevant thought injection, showing that the smallest models are the least distracted by harmful-response-triggering thoughts. Overall, our findings call for improvement in self-reevaluation of reasoning models to develop better reasoning and safer systems.


cs.MM [Back]

[113] Multimodal Large Language Models: A Survey cs.MM | cs.AI | cs.CLPDF

Longzhen Han, Awes Mubarak, Almas Baimagambetov, Nikolaos Polatidis, Thar Baker

TL;DR: 这篇综述探讨了多模态大语言模型(MLLMs)的发展,分析了其从单一文本生成扩展到图像、音乐、视频等多样化输出的能力。文章重点研究了自监督学习、专家混合、人类反馈强化学习和思维链提示等核心技术如何推动跨模态能力的实现,并总结了当前研究的架构趋势、跨模态协同效应以及未解决的挑战。

Details

Motivation: 随着多模态大语言模型的迅速发展,其应用范围已远超文本生成,涉及图像、音频等多种模态。为了更好地理解其技术基础和未来方向,需要对现有研究进行系统分类和分析。

Result: 总结了当前MLLMs的发展现状,指出了跨模态协同的潜力,并提出了未来研究应关注的方向,如提升模型的通用性和可解释性。

Insight: 跨模态能力的实现依赖于基础技术的结合,如自监督学习和强化学习的协同作用。未来的挑战在于如何进一步提升模型的可扩展性和结构化推理能力。

Abstract: Multimodal Large Language Models (MLLMs) have rapidly evolved beyond text generation, now spanning diverse output modalities including images, music, video, human motion, and 3D objects, by integrating language with other sensory modalities under unified architectures. This survey categorises six primary generative modalities and examines how foundational techniques, namely Self-Supervised Learning (SSL), Mixture of Experts (MoE), Reinforcement Learning from Human Feedback (RLHF), and Chain-of-Thought (CoT) prompting, enable cross-modal capabilities. We analyze key models, architectural trends, and emergent cross-modal synergies, while highlighting transferable techniques and unresolved challenges. Architectural innovations like transformers and diffusion models underpin this convergence, enabling cross-modal transfer and modular specialization. We highlight emerging patterns of synergy, and identify open challenges in evaluation, modularity, and structured reasoning. This survey offers a unified perspective on MLLM development and identifies critical paths toward more general-purpose, adaptive, and interpretable multimodal systems.


[114] EQ-TAA: Equivariant Traffic Accident Anticipation via Diffusion-Based Accident Video Synthesis cs.MM | cs.AI | cs.CV | cs.ROPDF

Jianwu Fang, Lei-Lei Li, Zhedong Zheng, Hongkai Yu, Jianru Xue

TL;DR: EQ-TAA提出了一种基于扩散模型的事故视频合成方法,通过生成因果视频帧来提升事故预测性能,同时避免了数据偏差问题,实现了无需额外标注的训练。

Details

Motivation: 当前的交通事故事件预测方法依赖于大量标注数据,且容易受数据偏差影响。为了解决这一问题,作者提出了一种通过视频合成生成因果部分的方法。

Result: 实验表明,EQ-TAA在性能上优于现有方法,且能够有效缓解数据偏差问题。

Insight: 通过合成视频因果部分的方法,可以显著减少标注需求,同时提升模型对真实事故场景的预测能力。

Abstract: Traffic Accident Anticipation (TAA) in traffic scenes is a challenging problem for achieving zero fatalities in the future. Current approaches typically treat TAA as a supervised learning task needing the laborious annotation of accident occurrence duration. However, the inherent long-tailed, uncertain, and fast-evolving nature of traffic scenes has the problem that real causal parts of accidents are difficult to identify and are easily dominated by data bias, resulting in a background confounding issue. Thus, we propose an Attentive Video Diffusion (AVD) model that synthesizes additional accident video clips by generating the causal part in dashcam videos, i.e., from normal clips to accident clips. AVD aims to generate causal video frames based on accident or accident-free text prompts while preserving the style and content of frames for TAA after video generation. This approach can be trained using datasets collected from various driving scenes without any extra annotations. Additionally, AVD facilitates an Equivariant TAA (EQ-TAA) with an equivariant triple loss for an anchor accident-free video clip, along with the generated pair of contrastive pseudo-normal and pseudo-accident clips. Extensive experiments have been conducted to evaluate the performance of AVD and EQ-TAA, and competitive performance compared to state-of-the-art methods has been obtained.


[115] HER2 Expression Prediction with Flexible Multi-Modal Inputs via Dynamic Bidirectional Reconstruction cs.MM | cs.AI | cs.CV | cs.LGPDF

Jie Qin, Wei Yang, Yan Su, Yiran Zhu, Weizhen Li

TL;DR: 该论文提出了一种动态双向重建的灵活多模态输入框架,用于HER2表达的预测,显著提升了单模态和双模态输入的准确性和适应性。

Details

Motivation: 现有的HER2评估模型通常单独分析H&E或IHC图像,而临床实践中需要两者的协同解释,但同步获取这两种模态数据存在工作流程复杂性和成本限制的问题。

Result: 单模态H&E预测准确率从71.44%提升至94.25%,双模态准确率达95.09%,仅IHC输入时的可靠性为90.28%。

Insight: 该框架的“双模态优先、单模态兼容”设计可以在不需要同步采样的条件下实现接近双模态的性能,尤其适合资源有限的医疗环境。

Abstract: Current HER2 assessment models for breast cancer predominantly analyze H&E or IHC images in isolation,despite clinical reliance on their synergistic interpretation. However, concurrent acquisition of both modalities is often hindered by workflow complexity and cost constraints. We propose an adaptive bimodal framework enabling flexible single-/dual-modality HER2 prediction through three innovations: 1) A dynamic branch selector that activates either single-modality reconstruction or dual-modality joint inference based on input completeness; 2) A bidirectional cross-modal GAN performing context-aware feature-space reconstruction of missing modalities; 3) A hybrid training protocol integrating adversarial learning and multi-task optimization. This architecture elevates single-modality H&E prediction accuracy from 71.44% to 94.25% while achieving 95.09% dual-modality accuracy, maintaining 90.28% reliability with sole IHC inputs. The framework’s “dual-preferred, single-compatible” design delivers near-bimodal performance without requiring synchronized acquisition, particularly benefiting resource-limited settings through IHC infrastructure cost reduction. Experimental validation confirms 22.81%/12.90% accuracy improvements over H&E/IHC baselines respectively, with cross-modal reconstruction enhancing F1-scores to 0.9609 (HE to IHC) and 0.9251 (IHC to HE). By dynamically routing inputs through reconstruction-enhanced or native fusion pathways, the system mitigates performance degradation from missing data while preserving computational efficiency (78.55% parameter reduction in lightweight variant). This elastic architecture demonstrates significant potential for democratizing precise HER2 assessment across diverse healthcare settings.


[116] Controllable Expressive 3D Facial Animation via Diffusion in a Unified Multimodal Space cs.MM | cs.AI | cs.CVPDF

Kangwei Liu, Junwu Liu, Xiaowei Yi, Jinlin Guo, Yun Cao

TL;DR: 本文提出了一种基于扩散模型的3D面部动画生成方法,通过多模态信号(文本、音频、情感标签)的统一表示和扩散模型增强情感表达的多样性和可控性。

Details

Motivation: 当前音频驱动的3D面部动画方法多依赖单模态信号且采用确定性映射,限制了情感表达的多样性和灵活性。

Result: 实验表明,该方法在情感相似度上提升21.6%,同时保持生理合理的面部动态。

Insight: 多模态信号的统一表示和扩散模型的引入显著提升了情感表达的多样性和可控性,为3D面部动画提供了新思路。

Abstract: Audio-driven emotional 3D facial animation encounters two significant challenges: (1) reliance on single-modal control signals (videos, text, or emotion labels) without leveraging their complementary strengths for comprehensive emotion manipulation, and (2) deterministic regression-based mapping that constrains the stochastic nature of emotional expressions and non-verbal behaviors, limiting the expressiveness of synthesized animations. To address these challenges, we present a diffusion-based framework for controllable expressive 3D facial animation. Our approach introduces two key innovations: (1) a FLAME-centered multimodal emotion binding strategy that aligns diverse modalities (text, audio, and emotion labels) through contrastive learning, enabling flexible emotion control from multiple signal sources, and (2) an attention-based latent diffusion model with content-aware attention and emotion-guided layers, which enriches motion diversity while maintaining temporal coherence and natural facial dynamics. Extensive experiments demonstrate that our method outperforms existing approaches across most metrics, achieving a 21.6% improvement in emotion similarity while preserving physiologically plausible facial dynamics. Project Page: https://kangweiiliu.github.io/Control_3D_Animation.


[117] Structured Graph Representations for Visual Narrative Reasoning: A Hierarchical Framework for Comics cs.MM | cs.AI | cs.CVPDF

Yi-Chun Chen

TL;DR: 这篇论文提出了一种层次化的知识图谱框架,用于漫画等多模态媒体的结构化理解,通过多层次分解叙事内容并构建集成知识图谱,支持符号化推理任务。

Details

Motivation: 视觉叙事(如漫画)包含复杂的多模态信息(图像和文本),需要一种结构化的方法来理解其语义、空间和时间关系。

Result: 在Manga109数据集上验证了框架的有效性,在动作检索、对话追踪等任务中表现出高精度和高召回率。

Insight: 层次化知识图谱能够有效地建模视觉叙事的复杂性,为基于叙事的分析、交互式叙事和多模态推理提供了可扩展的基础。

Abstract: This paper presents a hierarchical knowledge graph framework for the structured understanding of visual narratives, focusing on multimodal media such as comics. The proposed method decomposes narrative content into multiple levels, from macro-level story arcs to fine-grained event segments. It represents them through integrated knowledge graphs that capture semantic, spatial, and temporal relationships. At the panel level, we construct multimodal graphs that link visual elements such as characters, objects, and actions with corresponding textual components, including dialogue and captions. These graphs are integrated across narrative levels to support reasoning over story structure, character continuity, and event progression. We apply our approach to a manually annotated subset of the Manga109 dataset and demonstrate its ability to support symbolic reasoning across diverse narrative tasks, including action retrieval, dialogue tracing, character appearance mapping, and panel timeline reconstruction. Evaluation results show high precision and recall across tasks, validating the coherence and interpretability of the framework. This work contributes a scalable foundation for narrative-based content analysis, interactive storytelling, and multimodal reasoning in visual media.


[118] WDMIR: Wavelet-Driven Multimodal Intent Recognition cs.MM | cs.AI | cs.CV | eess.SPPDF

Weiyin Gong, Kai Zhang, Yanghai Zhang, Qi Liu, Xinjie Sun

TL;DR: 论文提出了一种基于小波变换的多模态意图识别框架(WDMIR),通过频域分析提升非语言信息的语义提取能力,显著提高了意图识别的准确性。

Details

Motivation: 现有方法过于依赖文本信息,忽视了视频和音频等非语言信息中的丰富语义内容,导致意图识别不够全面。

Result: 在MIntRec数据集上取得SOTA性能,准确率提升1.13%;小波融合模块对非语言信息的分析能力显著提升(0.41%)。

Insight: 频域分析为多模态意图识别提供了新的视角,小波变换能有效捕捉非语言信息中的细微动态特征。

Abstract: Multimodal intent recognition (MIR) seeks to accurately interpret user intentions by integrating verbal and non-verbal information across video, audio and text modalities. While existing approaches prioritize text analysis, they often overlook the rich semantic content embedded in non-verbal cues. This paper presents a novel Wavelet-Driven Multimodal Intent Recognition(WDMIR) framework that enhances intent understanding through frequency-domain analysis of non-verbal information. To be more specific, we propose: (1) a wavelet-driven fusion module that performs synchronized decomposition and integration of video-audio features in the frequency domain, enabling fine-grained analysis of temporal dynamics; (2) a cross-modal interaction mechanism that facilitates progressive feature enhancement from bimodal to trimodal integration, effectively bridging the semantic gap between verbal and non-verbal information. Extensive experiments on MIntRec demonstrate that our approach achieves state-of-the-art performance, surpassing previous methods by 1.13% on accuracy. Ablation studies further verify that the wavelet-driven fusion module significantly improves the extraction of semantic information from non-verbal sources, with a 0.41% increase in recognition accuracy when analyzing subtle emotional cues.


cs.AI [Back]

[119] One Patient, Many Contexts: Scaling Medical AI Through Contextual Intelligence cs.AI | cs.CLPDF

Michelle M. Li, Ben Y. Reis, Adam Rodman, Tianxi Cai, Noa Dagan

TL;DR: 本文提出需要解决医疗AI在上下文切换中的动态适应问题,以避免因上下文错误导致的预测失效。

Details

Motivation: 当前的医疗基础模型(如临床文本、医学图像的多模态模型)在处理新环境时需微调或检索知识库,但这种方法不切实际且难以动态适应临床情境。

Result: 愿景目标是实现更广泛的医疗覆盖,让AI能跨专科和地区诊断、管理和治疗疾病。

Insight: 医疗AI的未来发展需解决动态上下文适应问题,以提高模型的泛化能力和临床应用可靠性。

Abstract: Medical foundation models, including language models trained on clinical notes, vision-language models on medical images, and multimodal models on electronic health records, can summarize clinical notes, answer medical questions, and assist in decision-making. Adapting these models to new populations, specialties, or settings typically requires fine-tuning, careful prompting, or retrieval from knowledge bases. This can be impractical, and limits their ability to interpret unfamiliar inputs and adjust to clinical situations not represented during training. As a result, models are prone to contextual errors, where predictions appear reasonable but fail to account for critical patient-specific or contextual information. These errors stem from a fundamental limitation that current models struggle with: dynamically adjusting their behavior across evolving contexts of medical care. In this Perspective, we outline a vision for context-switching in medical AI: models that dynamically adapt their reasoning without retraining to new specialties, populations, workflows, and clinical roles. We envision context-switching AI to diagnose, manage, and treat a wide range of diseases across specialties and regions, and expand access to medical care.


[120] Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning cs.AI | cs.CLPDF

Yuhao Zhou, Yiheng Wang, Xuming He, Ruoyao Xiao, Zhiwei Li

TL;DR: 该论文提出了一个名为‘科学家第一考试’(SFE)的基准测试,旨在评估多模态大语言模型(MLLM)在科学领域的感知、理解和推理能力。实验表明当前先进模型在这些任务上表现仍有显著提升空间。

Details

Motivation: 现有的科学基准测试主要评估MLLM的知识理解能力,忽视了其感知和推理能力,因此需要一个新的基准来全面评估科学领域的认知能力。

Result: 实验结果显示,当前最先进的GPT-o3和InternVL-3在SFE上的得分仅为34.08%和26.52%,表明MLLM在科学领域仍有巨大改进空间。

Insight: SFE的提出不仅揭示了MLLM在科学认知能力上的不足,还为未来AI在科学发现中的进一步发展和应用提供了方向。

Abstract: Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists’ First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.


[121] TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving cs.AI | cs.CLPDF

Vincenzo Colle, Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Fadhel Ayed

TL;DR: 论文介绍了TeleMath,首个专门评估LLMs在电信领域数学问题解决能力的基准数据集,包含500个问题-答案对,覆盖电信领域多主题。实验表明,专为数学或逻辑推理设计的LLMs表现最佳,通用模型即使参数多也表现不佳。

Details

Motivation: 当前LLMs在通用数学推理中表现提升,但在电信等专业领域的数学问题解决能力尚未充分探索。因此,作者提出TeleMath填补这一空白。

Result: 实验显示,专为数学或逻辑推理设计的LLMs在TeleMath上表现最佳,通用模型即使参数多也表现不佳。数据集和评估代码已开源。

Insight: 在特定领域的数学问题解决中,通用LLMs的表现可能不如专用模型,表明领域适配的重要性。

Abstract: The increasing adoption of artificial intelligence in telecommunications has raised interest in the capability of Large Language Models (LLMs) to address domain-specific, mathematically intensive tasks. Although recent advancements have improved the performance of LLMs in general mathematical reasoning, their effectiveness within specialized domains, such as signal processing, network optimization, and performance analysis, remains largely unexplored. To address this gap, we introduce TeleMath, the first benchmark dataset specifically designed to evaluate LLM performance in solving mathematical problems with numerical solutions in the telecommunications domain. Comprising 500 question-answer (QnA) pairs, TeleMath covers a wide spectrum of topics in the telecommunications field. This paper outlines the proposed QnAs generation pipeline, starting from a selected seed of problems crafted by Subject Matter Experts. The evaluation of a wide range of open-source LLMs reveals that best performance on TeleMath is achieved by recent models explicitly designed for mathematical or logical reasoning. In contrast, general-purpose models, even those with a large number of parameters, often struggle with these challenges. We have released the dataset and the evaluation code to ease result reproducibility and support future research.


[122] Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification? cs.AI | cs.CLPDF

Fei Lin, Ziyang Gong, Cong Wang, Yonglin Tian, Tengchao Zhang

TL;DR: 该论文提出了ToxiMol,首个针对分子毒性修复的通用多模态大语言模型(MLLM)基准任务,并设计了评估框架ToxiEval,系统性评估了近30种主流MLLM,发现其在毒性理解和分子编辑方面表现出潜力。

Details

Motivation: 早期药物开发中,毒性是导致失败的主要原因,但目前缺乏系统性的分子毒性修复任务定义与基准。

Result: 当前MLLM在此任务上仍面临挑战,但在毒性理解、语义约束和分子编辑方面展现出潜力。

Insight: MLLM在分子毒性修复任务上的能力尚不成熟,但初步结果表明其在该领域具备进一步开发的潜力。

Abstract: Toxicity remains a leading cause of early-stage drug development failure. Despite advances in molecular design and property prediction, the task of molecular toxicity repair - generating structurally valid molecular alternatives with reduced toxicity - has not yet been systematically defined or benchmarked. To fill this gap, we introduce ToxiMol, the first benchmark task for general-purpose Multimodal Large Language Models (MLLMs) focused on molecular toxicity repair. We construct a standardized dataset covering 11 primary tasks and 560 representative toxic molecules spanning diverse mechanisms and granularities. We design a prompt annotation pipeline with mechanism-aware and task-adaptive capabilities, informed by expert toxicological knowledge. In parallel, we propose an automated evaluation framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic accessibility, drug-likeness, and structural similarity into a high-throughput evaluation chain for repair success. We systematically assess nearly 30 mainstream general-purpose MLLMs and design multiple ablation studies to analyze key factors such as evaluation criteria, candidate diversity, and failure attribution. Experimental results show that although current MLLMs still face significant challenges on this task, they begin to demonstrate promising capabilities in toxicity understanding, semantic constraint adherence, and structure-aware molecule editing.


eess.SY [Back]

[123] Energy Aware Camera Location Search Algorithm for Increasing Precision of Observation in Automated Manufacturing eess.SY | cs.CV | cs.SY | 93C85 (Primary), 93B52 (Secondary)PDF

Rongfei Li, Francis Assadian

TL;DR: 论文提出了一种针对自动化制造环境中视觉伺服任务的相机位置搜索算法,旨在通过优化相机位置以减少图像噪声,从而提高观测精度。

Details

Motivation: 现有研究多关注控制与观测架构的设计,很少讨论相机位置对观测质量的影响;而在制造环境中,相机位置会显著影响图像噪声水平,从而影响观测精度。

Result: 仿真实验表明,该算法能够在有限能量下显著提高观测精度。

Insight: 相机位置的动态优化对提升视觉伺服系统的观测精度至关重要,尤其是在噪声多变的制造环境中。

Abstract: Visual servoing technology has been well developed and applied in many automated manufacturing tasks, especially in tools’ pose alignment. To access a full global view of tools, most applications adopt eye-to-hand configuration or eye-to-hand/eye-in-hand cooperation configuration in an automated manufacturing environment. Most research papers mainly put efforts into developing control and observation architectures in various scenarios, but few of them have discussed the importance of the camera’s location in eye-to-hand configuration. In a manufacturing environment, the quality of camera estimations may vary significantly from one observation location to another, as the combined effects of environmental conditions result in different noise levels of a single image shot at different locations. In this paper, we propose an algorithm for the camera’s moving policy so that it explores the camera workspace and searches for the optimal location where the images’ noise level is minimized. Also, this algorithm ensures the camera ends up at a suboptimal (if the optimal one is unreachable) location among the locations already searched, with limited energy available for moving the camera. Unlike a simple brute force approach, the algorithm enables the camera to explore space more efficiently by adapting the search policy from learning the environment. With the aid of an image averaging technique, this algorithm, in use of a solo camera, achieves the observation accuracy in eye-to-hand configurations to a desirable extent without filtering out high-frequency information in the original image. An automated manufacturing application has been simulated and the results show the success of this algorithm’s improvement of observation precision with limited energy.


[124] Semi-Tensor-Product Based Convolutional Neural Networks eess.SY | cs.AI | cs.CV | cs.SYPDF

Daizhan Cheng

TL;DR: 该论文提出了一种基于半张量积(STP)的卷积神经网絡(CNN),通过STP的向量卷积积(CP)避免了填充带来的噪音问题,并在图像和三维信号识别中应用。

Details

Motivation: 传统卷积操作中,填充(padding)可能引入无用信息,影响模型性能。本文通过STP和域基CP的结合,试图解决这一问题。

Result: 所提出的STP-CNN在图像和三维信号识别任务中展示了有效性。

Insight: 通过STP推广向量运算,不仅可以处理不同维度的向量,还能在卷积中避免填充的负面影响,提升模型性能。

Abstract: The semi-tensor product (STP) of vectors is a generalization of conventional inner product of vectors, which allows the factor vectors to of different dimensions. This paper proposes a domain-based convolutional product (CP). Combining domain-based CP with STP of vectors, a new CP is proposed. Since there is no zero or any other padding, it can avoid the junk information caused by padding. Using it, the STP-based convolutional neural network (CNN) is developed. Its application to image and third order signal identifications is considered.


cs.RO [Back]

[125] A Navigation Framework Utilizing Vision-Language Models cs.RO | cs.AI | cs.CVPDF

Yicheng Duan, Kaiyu tang

TL;DR: 该论文提出了一种模块化的导航框架,将视觉-语言理解与动作规划解耦,利用冻结的视觉-语言模型和轻量级规划逻辑,旨在实现灵活、快速且适应性强的导航。

Details

Motivation: 视觉-语言导航(VLN)任务需要智能体解析自然语言指令并在陌生环境中导航,现有大型视觉-语言模型虽然在多模态理解上表现优异,但存在计算成本高和实时部署困难的问题。

Result: 虽然在新环境泛化能力上仍有挑战,但模块化设计为高效可扩展的导航系统奠定了基础。

Insight: 通过增强环境先验和扩展多模态输入集成,未来有望进一步提升导航系统的性能。

Abstract: Vision-and-Language Navigation (VLN) presents a complex challenge in embodied AI, requiring agents to interpret natural language instructions and navigate through visually rich, unfamiliar environments. Recent advances in large vision-language models (LVLMs), such as CLIP and Flamingo, have significantly improved multimodal understanding but introduced new challenges related to computational cost and real-time deployment. In this project, we propose a modular, plug-and-play navigation framework that decouples vision-language understanding from action planning. By integrating a frozen vision-language model, Qwen2.5-VL-7B-Instruct, with lightweight planning logic, we aim to achieve flexible, fast, and adaptable navigation without extensive model fine-tuning. Our framework leverages prompt engineering, structured history management, and a two-frame visual input strategy to enhance decision-making continuity across navigation steps. We evaluate our system on the Room-to-Room benchmark within the VLN-CE setting using the Matterport3D dataset and Habitat-Lab simulation environment. Although our initial results reveal challenges in generalizing to unseen environments under strict evaluation settings, our modular approach lays a foundation for scalable and efficient navigation systems, highlighting promising directions for future improvement through enhanced environmental priors and expanded multimodal input integration.


[126] EmbodiedGen: Towards a Generative 3D World Engine for Embodied Intelligence cs.RO | cs.CVPDF

Wang Xinjie, Liu Liu, Cao Yu, Wu Ruiqi, Qin Wenkang

TL;DR: EmbodiedGen是一个用于生成高质量、可控且逼真的3D资产的平台,支持具身智能任务。它通过生成式AI解决了传统3D资产的高成本和低真实性问题。

Details

Motivation: 传统3D资产的高生产成本和有限的真实性问题阻碍了具身智能任务的扩展性需求。EmbodiedGen旨在通过生成式AI提供低成本、多样化的3D世界生成方案。

Result: 生成的3D资产具有高真实性和准确的物理属性,可直接用于具身智能任务的训练和评估。代码已开源。

Insight: EmbodiedGen展示了生成式AI在解决3D资产稀缺问题上的潜力,为具身智能研究的扩展性和泛化性提供了新工具。

Abstract: Constructing a physically realistic and accurately scaled simulated 3D world is crucial for the training and evaluation of embodied intelligence tasks. The diversity, realism, low cost accessibility and affordability of 3D data assets are critical for achieving generalization and scalability in embodied AI. However, most current embodied intelligence tasks still rely heavily on traditional 3D computer graphics assets manually created and annotated, which suffer from high production costs and limited realism. These limitations significantly hinder the scalability of data driven approaches. We present EmbodiedGen, a foundational platform for interactive 3D world generation. It enables the scalable generation of high-quality, controllable and photorealistic 3D assets with accurate physical properties and real-world scale in the Unified Robotics Description Format (URDF) at low cost. These assets can be directly imported into various physics simulation engines for fine-grained physical control, supporting downstream tasks in training and evaluation. EmbodiedGen is an easy-to-use, full-featured toolkit composed of six key modules: Image-to-3D, Text-to-3D, Texture Generation, Articulated Object Generation, Scene Generation and Layout Generation. EmbodiedGen generates diverse and interactive 3D worlds composed of generative 3D assets, leveraging generative AI to address the challenges of generalization and evaluation to the needs of embodied intelligence related research. Code is available at https://horizonrobotics.github.io/robot_lab/embodied_gen/index.html.


[127] Eye, Robot: Learning to Look to Act with a BC-RL Perception-Action Loop cs.RO | cs.CVPDF

Justin Kerr, Kush Hari, Ethan Weber, Chung Min Kim, Brent Yi

TL;DR: EyeRobot是一个结合了机械眼球和强化学习的机器人系统,通过联合训练手和眼的动作来完成实际任务。

Details

Motivation: 人类通过主动观察环境来完成任务,受此启发,研究者开发了EyeRobot,模拟人类的主动注视行为,以提高机器人在大工作空间中的操作能力。

Result: 实验表明,EyeRobot在五个全景工作空间任务中表现优越,能够有效跟踪目标并忽略干扰物,实现了在大工作空间中的高效操作。

Insight: 论文的洞察包括:1) 高分辨率注视策略有助于稳定注视和目标跟踪;2) 手眼协调的自然涌现可以显著提升任务完成效果。

Abstract: Humans do not passively observe the visual world – we actively look in order to act. Motivated by this principle, we introduce EyeRobot, a robotic system with gaze behavior that emerges from the need to complete real-world tasks. We develop a mechanical eyeball that can freely rotate to observe its surroundings and train a gaze policy to control it using reinforcement learning. We accomplish this by first collecting teleoperated demonstrations paired with a 360 camera. This data is imported into a simulation environment that supports rendering arbitrary eyeball viewpoints, allowing episode rollouts of eye gaze on top of robot demonstrations. We then introduce a BC-RL loop to train the hand and eye jointly: the hand (BC) agent is trained from rendered eye observations, and the eye (RL) agent is rewarded when the hand produces correct action predictions. In this way, hand-eye coordination emerges as the eye looks towards regions which allow the hand to complete the task. EyeRobot implements a foveal-inspired policy architecture allowing high resolution with a small compute budget, which we find also leads to the emergence of more stable fixation as well as improved ability to track objects and ignore distractors. We evaluate EyeRobot on five panoramic workspace manipulation tasks requiring manipulation in an arc surrounding the robot arm. Our experiments suggest EyeRobot exhibits hand-eye coordination behaviors which effectively facilitate manipulation over large workspaces with a single camera. See project site for videos: https://www.eyerobot.net/


cs.CR [Back]

[128] GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models cs.CR | cs.CLPDF

Zilong Wang, Xiang Zheng, Xiaosen Wang, Bo Wang, Xingjun Ma

TL;DR: GenBreak提出了一种通过微调大型语言模型(LLM)来系统探索文本到图像(T2I)生成器潜在漏洞的框架,结合监督微调和强化学习,生成既能绕过安全机制又能产生高毒性图像的对抗性提示。

Details

Motivation: 当前的T2I模型(如Stable Diffusion)可能被滥用以生成有害内容,但现有研究在对抗性攻击上存在局限性:或容易被检测,或无法生成真正有害的输出。GenBreak旨在填补这一空白,提供一种可靠的工具来评估T2I模型的安全性。

Result: 生成的对抗性提示在针对商用T2I模型的黑盒攻击中表现出色,揭示了严重的安全弱点。

Insight: 通过LLM生成对抗性提示是一种高效发现T2I模型漏洞的方法,结合多奖励信号的强化学习可以显著提升提示的有效性和多样性。

Abstract: Text-to-image (T2I) models such as Stable Diffusion have advanced rapidly and are now widely used in content creation. However, these models can be misused to generate harmful content, including nudity or violence, posing significant safety risks. While most platforms employ content moderation systems, underlying vulnerabilities can still be exploited by determined adversaries. Recent research on red-teaming and adversarial attacks against T2I models has notable limitations: some studies successfully generate highly toxic images but use adversarial prompts that are easily detected and blocked by safety filters, while others focus on bypassing safety mechanisms but fail to produce genuinely harmful outputs, neglecting the discovery of truly high-risk prompts. Consequently, there remains a lack of reliable tools for evaluating the safety of defended T2I models. To address this gap, we propose GenBreak, a framework that fine-tunes a red-team large language model (LLM) to systematically explore underlying vulnerabilities in T2I generators. Our approach combines supervised fine-tuning on curated datasets with reinforcement learning via interaction with a surrogate T2I model. By integrating multiple reward signals, we guide the LLM to craft adversarial prompts that enhance both evasion capability and image toxicity, while maintaining semantic coherence and diversity. These prompts demonstrate strong effectiveness in black-box attacks against commercial T2I generators, revealing practical and concerning safety weaknesses.


cs.SD [Back]

[129] PAL: Probing Audio Encoders via LLMs – A Study of Information Transfer from Audio Encoders to LLMs cs.SD | cs.AI | cs.CL | eess.ASPDF

Tony Alex, Wish Suharitdamrong, Sara Atito, Armin Mustafa, Philip J. B. Jackson

TL;DR: 该论文研究了音频编码器与LLMs的交互机制,提出通过延迟音频整合、仅用注意力子模块和多样化编码器集成,优化跨模态信息传输,显著提升性能。

Details

Motivation: 尽管音频-LLMs的应用发展迅速,但音频编码器与LLMs间的语义表征传输机制尚不明确。研究旨在优化这种交互,提升LLMs对音频信息的探测能力。

Result: 最终架构在基线基础上提升10%-60%,验证了跨模态信息传输优化的有效性。

Insight: 延迟音频整合和注意力模块的简化设计是关键;多样化的编码器集成能显著扩展LLMs的音频信息处理能力。

Abstract: The integration of audio perception capabilities into Large Language Models (LLMs) has enabled significant advances in Audio-LLMs. Although application-focused developments, particularly in curating training data for specific capabilities e.g., audio reasoning, have progressed rapidly, the underlying mechanisms that govern efficient transfer of rich semantic representations from audio encoders to LLMs remain under-explored. We conceptualize effective audio-LLM interaction as the LLM’s ability to proficiently probe the audio encoder representations to satisfy textual queries. This paper presents a systematic investigation on how architectural design choices can affect that. Beginning with a standard Pengi/LLaVA-style audio-LLM architecture, we propose and evaluate several modifications guided by hypotheses derived from mechanistic interpretability studies and LLM operational principles. Our experiments demonstrate that: (1) delaying audio integration until the LLM’s initial layers establish textual context that enhances its ability to probe the audio representations for relevant information; (2) the LLM can proficiently probe audio representations exclusively through LLM layer’s attention submodule, without requiring propagation to its Feed-Forward Network (FFN) submodule; (3) an efficiently integrated ensemble of diverse audio encoders provides richer, complementary representations, thereby broadening the LLM’s capacity to probe a wider spectrum of audio information. All hypotheses are evaluated using an identical three-stage training curriculum on a dataset of 5.6 million audio-text pairs, ensuring controlled comparisons. Our final architecture, which incorporates all proposed modifications, achieves relative improvements from 10% to 60% over the baseline, validating our approach to optimizing cross-modal information transfer in audio-LLMs. Project page: https://ta012.github.io/PAL/


cs.GR [Back]

[130] Learning-based density-equalizing map cs.GR | cs.CV | cs.LGPDF

Yanwen Huang, Lok Ming Lui, Gary P. T. Choi

TL;DR: 本文提出了一种基于学习的密度等值映射框架(LDEM),利用深度神经网络改进传统的密度等值映射方法,解决了传统方法在精度、重叠和2D到3D扩展方面的限制。

Details

Motivation: 传统密度等值映射方法依赖数值求解器或手工设计的能量函数,存在精度有限、极端情况下产生重叠以及难以从2D扩展到3D的问题。本文希望通过学习的方法解决这些问题。

Result: LDEM在简单和复杂密度分布上表现出优越的密度等值和双射性,优于传统方法。同时,该方法可直接应用于3D场景,具有更强的扩展性。

Insight: 深度学习可以替代传统数值方法解决几何问题,尤其是在无需显式设计能量函数的情况下,能够实现更高效和鲁棒的解决方案。

Abstract: Density-equalizing map (DEM) serves as a powerful technique for creating shape deformations with the area changes reflecting an underlying density function. In recent decades, DEM has found widespread applications in fields such as data visualization, geometry processing, and medical imaging. Traditional approaches to DEM primarily rely on iterative numerical solvers for diffusion equations or optimization-based methods that minimize handcrafted energy functionals. However, these conventional techniques often face several challenges: they may suffer from limited accuracy, produce overlapping artifacts in extreme cases, and require substantial algorithmic redesign when extended from 2D to 3D, due to the derivative-dependent nature of their energy formulations. In this work, we propose a novel learning-based density-equalizing mapping framework (LDEM) using deep neural networks. Specifically, we introduce a loss function that enforces density uniformity and geometric regularity, and utilize a hierarchical approach to predict the transformations at both the coarse and dense levels. Our method demonstrates superior density-equalizing and bijectivity properties compared to prior methods for a wide range of simple and complex density distributions, and can be easily applied to surface remeshing with different effects. Also, it generalizes seamlessly from 2D to 3D domains without structural changes to the model architecture or loss formulation. Altogether, our work opens up new possibilities for scalable and robust computation of density-equalizing maps for practical applications.


[131] Edit360: 2D Image Edits to 3D Assets from Any Angle cs.GR | cs.CVPDF

Junchao Huang, Xinting Hu, Zhuotao Tian, Shaoshuai Shi, Li Jiang

TL;DR: Edit360提出了一种无需调参的框架,将2D图像编辑扩展到多视角一致的3D编辑,解决了现有方法视角受限的问题,通过视频扩散模型和锚定视图编辑传播机制实现高质量的3D内容重建。

Details

Motivation: 现有方法在多视角3D编辑中存在视角受限和一致性不足的问题,限制了实际应用的灵活性。

Result: Edit360能够生成多视角一致的3D编辑内容,支持从任意视角查看和修改。

Insight: 视频扩散模型在3D内容生成中具有潜力,锚定视图机制为多视角编辑提供了新思路。

Abstract: Recent advances in diffusion models have significantly improved image generation and editing, but extending these capabilities to 3D assets remains challenging, especially for fine-grained edits that require multi-view consistency. Existing methods typically restrict editing to predetermined viewing angles, severely limiting their flexibility and practical applications. We introduce Edit360, a tuning-free framework that extends 2D modifications to multi-view consistent 3D editing. Built upon video diffusion models, Edit360 enables user-specific editing from arbitrary viewpoints while ensuring structural coherence across all views. The framework selects anchor views for 2D modifications and propagates edits across the entire 360-degree range. To achieve this, Edit360 introduces a novel Anchor-View Editing Propagation mechanism, which effectively aligns and merges multi-view information within the latent and attention spaces of diffusion models. The resulting edited multi-view sequences facilitate the reconstruction of high-quality 3D assets, enabling customizable 3D content creation.


cs.LG [Back]

[132] Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMs cs.LG | cs.AI | cs.CL | cs.CVPDF

Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu

TL;DR: Omni-DPO 是一种双视角优化的动态偏好学习框架,通过考虑偏好对的数据质量和模型学习动态,提升了 DPO 方法的性能。

Details

Motivation: 现有的 DPO 方法将所有偏好对视为同等重要,忽视了其数据质量与学习效用的差异,导致数据利用不足和性能下降。

Result: 在文本理解和数学推理任务中,Omni-DPO 表现优于基准方法,显著超越 Claude 3 Opus 6.7 分。

Insight: 动态调整偏好对的权重可以更有效地利用数据,提升模型性能。

Abstract: Direct Preference Optimization (DPO) has become a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based approaches typically treat all preference pairs uniformly, ignoring critical variations in their inherent quality and learning utility, leading to suboptimal data utilization and performance. To address this challenge, we propose Omni-DPO, a dual-perspective optimization framework that jointly accounts for (1) the inherent quality of each preference pair and (2) the model’s evolving performance on those pairs. By adaptively weighting samples according to both data quality and the model’s learning dynamics during training, Omni-DPO enables more effective training data utilization and achieves better performance. Experimental results on various models and benchmarks demonstrate the superiority and generalization capabilities of Omni-DPO. On textual understanding tasks, Gemma-2-9b-it finetuned with Omni-DPO beats the leading LLM, Claude 3 Opus, by a significant margin of 6.7 points on the Arena-Hard benchmark. On mathematical reasoning tasks, Omni-DPO consistently outperforms the baseline methods across all benchmarks, providing strong empirical evidence for the effectiveness and robustness of our approach. Code and models will be available at https://github.com/pspdada/Omni-DPO.


[133] Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning cs.LG | cs.AI | cs.CL | stat.MLPDF

Jikai Jin, Vasilis Syrgkanis, Sham Kakade, Hanlin Zhang

TL;DR: 该论文提出了一个因果表示学习框架,通过控制基础模型作为混淆变量,识别出语言模型的潜在能力因素及其因果结构,从而更好地评估和理解模型的性能。

Details

Motivation: 现有的语言模型评估方法面临复杂混淆效应和高计算成本的挑战,难以揭示模型能力的本质因果关系。

Result: 发现了一个简洁的三节点线性因果结构,揭示了从通用问题解决能力到数学推理能力的因果路径。

Insight: 研究发现基础模型的差异对评估结果有显著影响,控制这些差异有助于揭示潜在能力的真实因果关系。

Abstract: Faithful evaluation of language model capabilities is crucial for deriving actionable insights that can inform model development. However, rigorous causal evaluations in this domain face significant methodological challenges, including complex confounding effects and prohibitive computational costs associated with extensive retraining. To tackle these challenges, we propose a causal representation learning framework wherein observed benchmark performance is modeled as a linear transformation of a few latent capability factors. Crucially, these latent factors are identified as causally interrelated after appropriately controlling for the base model as a common confounder. Applying this approach to a comprehensive dataset encompassing over 1500 models evaluated across six benchmarks from the Open LLM Leaderboard, we identify a concise three-node linear causal structure that reliably explains the observed performance variations. Further interpretation of this causal structure provides substantial scientific insights beyond simple numerical rankings: specifically, we reveal a clear causal direction starting from general problem-solving capabilities, advancing through instruction-following proficiency, and culminating in mathematical reasoning ability. Our results underscore the essential role of carefully controlling base model variations during evaluation, a step critical to accurately uncovering the underlying causal relationships among latent model capabilities.


[134] Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series cs.LG | cs.AI | cs.CLPDF

Ching Chang, Jeehyun Hwang, Yidan Shi, Haixin Wang, Wen-Chih Peng

TL;DR: 论文介绍了Time-IMM数据集和IMM-TSF基准库,用于处理多模态多变量的不规则时间序列,填补了当前研究中与真实数据之间的差距。

Details

Motivation: 现实中的时间序列数据(如医疗、气候和金融领域)通常是不规则、多模态和混乱的,而现有基准多假设数据是干净、规则和单模态的。

Result: 实验表明,显式建模多模态在不规则时间序列上的预测性能有显著提升。

Insight: 多模态和异步融合策略是提升不规则时间序列预测的关键。

Abstract: Time series data in real-world applications such as healthcare, climate modeling, and finance are often irregular, multimodal, and messy, with varying sampling rates, asynchronous modalities, and pervasive missingness. However, existing benchmarks typically assume clean, regularly sampled, unimodal data, creating a significant gap between research and real-world deployment. We introduce Time-IMM, a dataset specifically designed to capture cause-driven irregularity in multimodal multivariate time series. Time-IMM represents nine distinct types of time series irregularity, categorized into trigger-based, constraint-based, and artifact-based mechanisms. Complementing the dataset, we introduce IMM-TSF, a benchmark library for forecasting on irregular multimodal time series, enabling asynchronous integration and realistic evaluation. IMM-TSF includes specialized fusion modules, including a timestamp-to-text fusion module and a multimodality fusion module, which support both recency-aware averaging and attention-based integration strategies. Empirical results demonstrate that explicitly modeling multimodality on irregular time series data leads to substantial gains in forecasting performance. Time-IMM and IMM-TSF provide a foundation for advancing time series analysis under real-world conditions. The dataset is publicly available at https://www.kaggle.com/datasets/blacksnail789521/time-imm/data, and the benchmark library can be accessed at https://anonymous.4open.science/r/IMMTSF_NeurIPS2025.


[135] Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering cs.LG | cs.CLPDF

Sai Prasanna Teja Reddy Bogireddy, Abrar Majeedi, Viswanatha Reddy Gajjala, Zhuoyan Xu, Siddhant Rai

TL;DR: 本文提出了一种基于DSPy的MIPROv2优化器的提示优化方法,用于提升临床电子健康记录(EHR)问答任务中的证据检索和答案生成能力。

Details

Motivation: 临床电子健康记录(EHR)问答任务需要高精度的证据检索和可靠的答案生成,但现有方法在有限监督下表现不佳。本文旨在通过提示优化方法解决这一问题。

Result: 在隐藏测试集上,该方法总分达到51.5,排名第二,优于零样本和小样本提示方法20和10分以上。

Insight: 数据驱动的提示优化是模型微调的高效替代方案,可提升高风险临床问答任务的可靠性。

Abstract: Automated question answering (QA) over electronic health records (EHRs) can bridge critical information gaps for clinicians and patients, yet it demands both precise evidence retrieval and faithful answer generation under limited supervision. In this work, we present Neural, the runner-up in the BioNLP 2025 ArchEHR-QA shared task on evidence-grounded clinical QA. Our proposed method decouples the task into (1) sentence-level evidence identification and (2) answer synthesis with explicit citations. For each stage, we automatically explore the prompt space with DSPy’s MIPROv2 optimizer, jointly tuning instructions and few-shot demonstrations on the development set. A self-consistency voting scheme further improves evidence recall without sacrificing precision. On the hidden test set, our method attains an overall score of 51.5, placing second stage while outperforming standard zero-shot and few-shot prompting by over 20 and 10 points, respectively. These results indicate that data-driven prompt optimization is a cost-effective alternative to model fine-tuning for high-stakes clinical QA, advancing the reliability of AI assistants in healthcare.


[136] Robustly Improving LLM Fairness in Realistic Settings via Interpretability cs.LG | cs.AI | cs.CLPDF

Adam Karvonen, Samuel Marks

TL;DR: 这篇论文提出了一种通过内部偏置缓解方法,在大语言模型(LLM)的实际应用中有效减少种族和性别偏见的策略。

Details

Motivation: 当前在受控环境中简单的反偏见提示可以消除LLM的人口统计偏见,但在引入实际上下文后效果不佳。论文旨在解决这种问题,确保LLM在招聘等高风险应用中公平决策。

Result: 该方法在多种商业和开源模型上一致地将偏见降低至极低水平(通常低于1%,最高2.5%)。

Insight: 实际上下文会显著增强LLM的偏见,而内部偏置缓解是一种有效且泛化性强的解决方案,适用于实际部署场景。

Abstract: Large language models (LLMs) are increasingly deployed in high-stakes hiring applications, making decisions that directly impact people’s careers and livelihoods. While prior studies suggest simple anti-bias prompts can eliminate demographic biases in controlled evaluations, we find these mitigations fail when realistic contextual details are introduced. We address these failures through internal bias mitigation: by identifying and neutralizing sensitive attribute directions within model activations, we achieve robust bias reduction across all tested scenarios. Across leading commercial (GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash) and open-source models (Gemma-2 27B, Gemma-3, Mistral-24B), we find that adding realistic context such as company names, culture descriptions from public careers pages, and selective hiring constraints (e.g.,``only accept candidates in the top 10%“) induces significant racial and gender biases (up to 12% differences in interview rates). When these biases emerge, they consistently favor Black over White candidates and female over male candidates across all tested models and scenarios. Moreover, models can infer demographics and become biased from subtle cues like college affiliations, with these biases remaining invisible even when inspecting the model’s chain-of-thought reasoning. To address these limitations, our internal bias mitigation identifies race and gender-correlated directions and applies affine concept editing at inference time. Despite using directions from a simple synthetic dataset, the intervention generalizes robustly, consistently reducing bias to very low levels (typically under 1%, always below 2.5%) while largely maintaining model performance. Our findings suggest that practitioners deploying LLMs for hiring should adopt more realistic evaluation methodologies and consider internal mitigation strategies for equitable outcomes.


[137] GUARD: Guided Unlearning and Retention via Data Attribution for Large Language Models cs.LG | cs.AI | cs.CLPDF

Evelyn Ma, Duo Zhou, Peizhi Niu, Huiting Zhou, Huan Zhang

TL;DR: 该论文提出了GUARD框架,通过数据归因方法解决了大语言模型(LLM)在遗忘特定数据时导致的非预期遗忘问题,显著提升了模型在保留重要信息方面的性能。

Details

Motivation: 随着法规合规、版权保护和隐私问题的日益重要,LLM的遗忘能力变得关键。然而,现有方法在遗忘高影响力数据时往往导致模型保留性能下降,因此需要一种更高效的数据级解决方案。

Result: 在TOFU基准测试中,GUARD在多种LLM架构上表现优异,遗忘10%训练数据时,保留集的Truth Ratio提升了194.92%,显著优于现有方法。

Insight: 数据级归因方法在模型遗忘任务中具有重要作用,通过自适应权重分配可以平衡遗忘与保留的需求,为LLM的合规性和实用性提供了新思路。

Abstract: Unlearning in large language models (LLMs) is becoming increasingly important due to regulatory compliance, copyright protection, and privacy concerns. However, a key challenge in LLM unlearning is unintended forgetting, where the removal of specific data inadvertently impairs the utility of the model and its retention of valuable, desired information. While prior work has primarily focused on architectural innovations, the influence of data-level factors on unlearning performance remains underexplored. As a result, existing methods often suffer from degraded retention when forgetting high-impact data. To address this, we propose GUARD-a novel framework for Guided Unlearning And Retention via Data attribution. At its core, GUARD introduces a lightweight proxy data attribution metric tailored for LLM unlearning, which quantifies the “alignment” between the forget and retain sets while remaining computationally efficient. Building on this, we design a novel unlearning objective that assigns adaptive, nonuniform unlearning weights to samples, inversely proportional to their proxy attribution scores. Through such a reallocation of unlearning power, GUARD mitigates unintended losses in retention. We provide rigorous theoretical guarantees that GUARD significantly enhances retention while maintaining forgetting metrics comparable to prior methods. Extensive experiments on the TOFU benchmark across multiple LLM architectures demonstrate that GUARD substantially improves utility preservation while ensuring effective unlearning. Notably, GUARD reduces utility sacrifice on the Retain Set by up to 194.92% in terms of Truth Ratio when forgetting 10% of the training data.


[138] Build the web for agents, not agents for the web cs.LG | cs.CLPDF

Xing Han Lù, Gaurav Kamath, Marius Mosbach, Siva Reddy

TL;DR: 这篇立场论文主张一种新的网络代理研究范式,提出开发专门为代理能力优化的交互界面(AWI),而非让代理适应人类设计的界面。

Details

Motivation: 当前网络代理方法面临重大挑战,因为人类设计的界面与LLM能力之间存在不匹配问题。为了解决这一问题,论文提出了专门为代理设计的界面理念。

Result: 通过AWI的提出,论文展望了更加高效、可靠和透明的网络代理设计可能性。

Insight: 网络代理的研究需要协同努力,重新设计界面以更好地匹配代理的能力,而非试图让代理适应现有的人类界面。

Abstract: Recent advancements in Large Language Models (LLMs) and multimodal counterparts have spurred significant interest in developing web agents – AI systems capable of autonomously navigating and completing tasks within web environments. While holding tremendous promise for automating complex web interactions, current approaches face substantial challenges due to the fundamental mismatch between human-designed interfaces and LLM capabilities. Current methods struggle with the inherent complexity of web inputs, whether processing massive DOM trees, relying on screenshots augmented with additional information, or bypassing the user interface entirely through API interactions. This position paper advocates for a paradigm shift in web agent research: rather than forcing web agents to adapt to interfaces designed for humans, we should develop a new interaction paradigm specifically optimized for agentic capabilities. To this end, we introduce the concept of an Agentic Web Interface (AWI), an interface specifically designed for agents to navigate a website. We establish six guiding principles for AWI design, emphasizing safety, efficiency, and standardization, to account for the interests of all primary stakeholders. This reframing aims to overcome fundamental limitations of existing interfaces, paving the way for more efficient, reliable, and transparent web agent design, which will be a collaborative effort involving the broader ML community.


[139] ReGuidance: A Simple Diffusion Wrapper for Boosting Sample Quality on Hard Inverse Problems cs.LG | cs.AI | cs.CVPDF

Aayush Karan, Kulin Shah, Sitan Chen

TL;DR: ReGuidance是一个简单的扩散模型包装器,用于提升困难逆问题中的样本质量,通过逆向操作和重新初始化改进现有方法的表现。

Details

Motivation: 现有方法在信号噪声比低的困难逆问题中容易偏离数据流形,导致输出不真实。ReGuidance旨在通过一种简单的方法提升样本质量和奖励一致性。

Result: 在大型框内填充和高倍超分辨率等任务中,ReGuidance显著提升了样本质量,且理论证明其在多模态数据分布中能同时提升奖励和接近数据流形。

Insight: ReGuidance首次为DPS提供了严格的算法保证,展示了通过简单操作显著改进样本质量的潜力。

Abstract: There has been a flurry of activity around using pretrained diffusion models as informed data priors for solving inverse problems, and more generally around steering these models using reward models. Training-free methods like diffusion posterior sampling (DPS) and its many variants have offered flexible heuristic algorithms for these tasks, but when the reward is not informative enough, e.g., in hard inverse problems with low signal-to-noise ratio, these techniques veer off the data manifold, failing to produce realistic outputs. In this work, we devise a simple wrapper, ReGuidance, for boosting both the sample realism and reward achieved by these methods. Given a candidate solution $\hat{x}$ produced by an algorithm of the user’s choice, we propose inverting the solution by running the unconditional probability flow ODE in reverse starting from $\hat{x}$, and then using the resulting latent as an initialization for DPS. We evaluate our wrapper on hard inverse problems like large box in-painting and super-resolution with high upscaling. Whereas state-of-the-art baselines visibly fail, we find that applying our wrapper on top of these baselines significantly boosts sample quality and measurement consistency. We complement these findings with theory proving that on certain multimodal data distributions, ReGuidance simultaneously boosts the reward and brings the candidate solution closer to the data manifold. To our knowledge, this constitutes the first rigorous algorithmic guarantee for DPS.


cs.MA [Back]

[140] AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation cs.MA | cs.CVPDF

Haoyuan Shi, Yunxin Li, Xinyu Chen, Longyue Wang, Baotian Hu

TL;DR: AniMaker是一个多智能体框架,通过蒙特卡洛树搜索(MCTS)驱动的视频片段生成和故事感知的片段选择,实现从文本输入生成全局一致且故事连贯的动画。

Details

Motivation: 当前视频生成方法在生成跨多场景和多角色的连贯故事视频时面临挑战,现有的方法通常只能生成固定长度的片段,导致叙事不连贯和节奏问题,且不稳定。

Result: 实验表明,AniMaker在VBench和AniEval等指标上表现优异,显著提升了多候选片段生成的效率。

Insight: 通过智能体分工和MCTS优化,AniMaker展示了文本到视频生成中全局一致性和故事连贯性的重要性。

Abstract: Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation’s logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker’s approach are two key technical components: MCTS-Gen in Photography Agent, an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent, the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion, and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.


physics.med-ph [Back]

[141] Modality-AGnostic Image Cascade (MAGIC) for Multi-Modality Cardiac Substructure Segmentation physics.med-ph | cs.CVPDF

Nicholas Summerfield, Qisheng He, Alex Kuo, Ahmed I. Ghanem, Simeng Zhu

TL;DR: 论文提出了一种名为MAGIC的多模态心脏亚结构分割方法,通过单一的nnU-Net模型实现多模态数据的分割,并在三种模态下取得了优于对比模型的表现。

Details

Motivation: 心脏亚结构分割在放射治疗规划中至关重要,现有深度学习方法在多模态数据和重叠结构分割上缺乏通用性。

Result: 在Sim-CT、MR-Linac和CCTA三种模态上的平均DSC分数分别为0.75、0.68和0.80,优于57%的对比模型。

Insight: MAGIC简化了计算需求,提高了临床应用的灵活性,为多模态心脏亚结构分割提供了一种高效解决方案。

Abstract: Cardiac substructures are essential in thoracic radiation therapy planning to minimize risk of radiation-induced heart disease. Deep learning (DL) offers efficient methods to reduce contouring burden but lacks generalizability across different modalities and overlapping structures. This work introduces and validates a Modality-AGnostic Image Cascade (MAGIC) for comprehensive and multi-modal cardiac substructure segmentation. MAGIC is implemented through replicated encoding and decoding branches of an nnU-Net-based, U-shaped backbone conserving the function of a single model. Twenty cardiac substructures (heart, chambers, great vessels (GVs), valves, coronary arteries (CAs), and conduction nodes) from simulation CT (Sim-CT), low-field MR-Linac, and cardiac CT angiography (CCTA) modalities were manually delineated and used to train (n=76), validate (n=15), and test (n=30) MAGIC. Twelve comparison models (four segmentation subgroups across three modalities) were equivalently trained. All methods were compared for training efficiency and against reference contours using the Dice Similarity Coefficient (DSC) and two-tailed Wilcoxon Signed-Rank test (threshold, p<0.05). Average DSC scores were 0.75(0.16) for Sim-CT, 0.68(0.21) for MR-Linac, and 0.80(0.16) for CCTA. MAGIC outperforms the comparison in 57% of cases, with limited statistical differences. MAGIC offers an effective and accurate segmentation solution that is lightweight and capable of segmenting multiple modalities and overlapping structures in a single model. MAGIC further enables clinical implementation by simplifying the computational requirements and offering unparalleled flexibility for clinical settings.


eess.IV [Back]

[142] Rethinking Brain Tumor Segmentation from the Frequency Domain Perspective eess.IV | cs.CVPDF

Minye Shao, Zeyu Wang, Haoran Duan, Yawen Huang, Bing Zhai

TL;DR: 该论文提出了一种从频域视角重新思考脑肿瘤分割的方法,通过HFF-Net网络结合频域分解和自适应拉普拉斯卷积,显著提升了对比增强区域的性能。

Details

Motivation: 当前方法在分割增强脑肿瘤区域时性能下降,主要因为未充分考虑MRI图像的复杂纹理和方向变化特征,因此需要一种更全面的方法来捕捉这些特征。

Result: 在四个公开数据集上,HFF-Net在三个主要子区域的Dice分数平均提升4.48%,对比增强区域分割性能平均提升7.33%。

Insight: 频域视角能够更全面地表征肿瘤区域的特征,尤其是在处理复杂纹理和边界时表现出显著优势。

Abstract: Precise segmentation of brain tumors, particularly contrast-enhancing regions visible in post-contrast MRI (areas highlighted by contrast agent injection), is crucial for accurate clinical diagnosis and treatment planning but remains challenging. However, current methods exhibit notable performance degradation in segmenting these enhancing brain tumor areas, largely due to insufficient consideration of MRI-specific tumor features such as complex textures and directional variations. To address this, we propose the Harmonized Frequency Fusion Network (HFF-Net), which rethinks brain tumor segmentation from a frequency-domain perspective. To comprehensively characterize tumor regions, we develop a Frequency Domain Decomposition (FDD) module that separates MRI images into low-frequency components, capturing smooth tumor contours and high-frequency components, highlighting detailed textures and directional edges. To further enhance sensitivity to tumor boundaries, we introduce an Adaptive Laplacian Convolution (ALC) module that adaptively emphasizes critical high-frequency details using dynamically updated convolution kernels. To effectively fuse tumor features across multiple scales, we design a Frequency Domain Cross-Attention (FDCA) integrating semantic, positional, and slice-specific information. We further validate and interpret frequency-domain improvements through visualization, theoretical reasoning, and experimental analyses. Extensive experiments on four public datasets demonstrate that HFF-Net achieves an average relative improvement of 4.48% (ranging from 2.39% to 7.72%) in the mean Dice scores across the three major subregions, and an average relative improvement of 7.33% (ranging from 5.96% to 8.64%) in the segmentation of contrast-enhancing tumor regions, while maintaining favorable computational efficiency and clinical applicability. Code: https://github.com/VinyehShaw/HFF.


[143] Prompt-Guided Latent Diffusion with Predictive Class Conditioning for 3D Prostate MRI Generation eess.IV | cs.CVPDF

Emerson P. Grabke, Masoom A. Haider, Babak Taati

TL;DR: 论文提出了一种名为CCELLA的新方法,通过结合大型语言模型和病理分类的双重条件引导,解决了医学LDM训练中的数据稀缺问题,显著提升了合成图像的质量和分类器性能。

Details

Motivation: 医学影像数据稀缺限制了机器学习的发展,现有的LDM训练方法依赖于短提示文本编码器或非医学预训练模型,且需要大量数据微调,影响了性能和科学可访问性。

Result: 在受限的前列腺MRI数据集上,3D FID达到0.025,显著优于其他方法(FID 0.071)。合成图像将前列腺癌分类器准确率从69%提升到74%,且仅用合成图像训练的分类器性能接近真实图像训练结果。

Insight: 通过结合文本和病理条件的双重引导,可以在数据稀缺场景下高效训练LDM,生成高质量医学图像,并为下游任务提供显著性能提升。

Abstract: Latent diffusion models (LDM) could alleviate data scarcity challenges affecting machine learning development for medical imaging. However, medical LDM training typically relies on performance- or scientific accessibility-limiting strategies including a reliance on short-prompt text encoders, the reuse of non-medical LDMs, or a requirement for fine-tuning with large data volumes. We propose a Class-Conditioned Efficient Large Language model Adapter (CCELLA) to address these limitations. CCELLA is a novel dual-head conditioning approach that simultaneously conditions the LDM U-Net with non-medical large language model-encoded text features through cross-attention and with pathology classification through the timestep embedding. We also propose a joint loss function and a data-efficient LDM training framework. In combination, these strategies enable pathology-conditioned LDM training for high-quality medical image synthesis given limited data volume and human data annotation, improving LDM performance and scientific accessibility. Our method achieves a 3D FID score of 0.025 on a size-limited prostate MRI dataset, significantly outperforming a recent foundation model with FID 0.071. When training a classifier for prostate cancer prediction, adding synthetic images generated by our method to the training dataset improves classifier accuracy from 69% to 74%. Training a classifier solely on our method’s synthetic images achieved comparable performance to training on real images alone.


[144] DUN-SRE: Deep Unrolling Network with Spatiotemporal Rotation Equivariance for Dynamic MRI Reconstruction eess.IV | cs.AI | cs.CVPDF

Yuliang Zhu, Jing Cheng, Qi Xie, Zhuo-Xu Cui, Qingyong Zhu

TL;DR: 论文提出了一个名为DUN-SRE的新型深度展开网络,通过时空旋转等变性解决动态MRI重建中的对称性问题,尤其在心脏CINE MRI中表现优异。

Details

Motivation: 动态MRI在时间和空间维度上具有变换对称性,但现有方法未能有效建模时间对称性,导致重建质量受限。

Result: 在心脏CINE MRI数据集上实现了最先进的性能,尤其擅长保留旋转对称结构,并表现出强大的泛化能力。

Insight: 时空对称性约束的显式建模显著提升了动态MRI重建的质量,尤其在高度欠采样场景下效果显著。

Abstract: Dynamic Magnetic Resonance Imaging (MRI) exhibits transformation symmetries, including spatial rotation symmetry within individual frames and temporal symmetry along the time dimension. Explicit incorporation of these symmetry priors in the reconstruction model can significantly improve image quality, especially under aggressive undersampling scenarios. Recently, Equivariant convolutional neural network (ECNN) has shown great promise in exploiting spatial symmetry priors. However, existing ECNNs critically fail to model temporal symmetry, arguably the most universal and informative structural prior in dynamic MRI reconstruction. To tackle this issue, we propose a novel Deep Unrolling Network with Spatiotemporal Rotation Equivariance (DUN-SRE) for Dynamic MRI Reconstruction. The DUN-SRE establishes spatiotemporal equivariance through a (2+1)D equivariant convolutional architecture. In particular, it integrates both the data consistency and proximal mapping module into a unified deep unrolling framework. This architecture ensures rigorous propagation of spatiotemporal rotation symmetry constraints throughout the reconstruction process, enabling more physically accurate modeling of cardiac motion dynamics in cine MRI. In addition, a high-fidelity group filter parameterization mechanism is developed to maintain representation precision while enforcing symmetry constraints. Comprehensive experiments on Cardiac CINE MRI datasets demonstrate that DUN-SRE achieves state-of-the-art performance, particularly in preserving rotation-symmetric structures, offering strong generalization capability to a broad range of dynamic MRI reconstruction tasks.


[145] ConStyX: Content Style Augmentation for Generalizable Medical Image Segmentation eess.IV | cs.CVPDF

Xi Chen, Zhiqiang Shen, Peng Cao, Jinzhu Yang, Osmar R. Zaiane

TL;DR: 论文提出了一种名为ConStyX的新型域泛化方法,通过内容和风格增强来提升医学图像分割模型的泛化性能。

Details

Motivation: 医学图像通常来自多个域,导致域偏移问题,影响分割模型的性能。现有的域随机化方法仅依赖图像风格扰动,效率受限,且忽略了过度增强图像对训练的负面影响。

Result: 实验表明,ConStyX在多个域上取得了优越的泛化性能。

Insight: 内容和风格的协同增强对于改善医学图像分割的域泛化能力至关重要。

Abstract: Medical images are usually collected from multiple domains, leading to domain shifts that impair the performance of medical image segmentation models. Domain Generalization (DG) aims to address this issue by training a robust model with strong generalizability. Recently, numerous domain randomization-based DG methods have been proposed. However, these methods suffer from the following limitations: 1) constrained efficiency of domain randomization due to their exclusive dependence on image style perturbation, and 2) neglect of the adverse effects of over-augmented images on model training. To address these issues, we propose a novel domain randomization-based DG method, called content style augmentation (ConStyX), for generalizable medical image segmentation. Specifically, ConStyX 1) augments the content and style of training data, allowing the augmented training data to better cover a wider range of data domains, and 2) leverages well-augmented features while mitigating the negative effects of over-augmented features during model training. Extensive experiments across multiple domains demonstrate that our ConStyX achieves superior generalization performance. The code is available at https://github.com/jwxsp1/ConStyX.


[146] Generalist Models in Medical Image Segmentation: A Survey and Performance Comparison with Task-Specific Approaches eess.IV | cs.AI | cs.CV | A.1; I.2.0; I.4.6PDF

Andrea Moglia, Matteo Leccardi, Matteo Cavicchioli, Alice Maccarini, Marco Marcon

TL;DR: 本文是对医学图像分割领域的通才模型(如SAM及其变种)的全面调查,比较了它们的性能,并探讨了与任务特定模型的差异。

Details

Motivation: 受大型语言模型的启发,通才模型在计算机视觉领域崭露头角,特别是在医学图像分割中。本文旨在调查这些模型的性能及其挑战。

Result: 通才模型在医学图像分割中表现良好,但仍需解决合规性、隐私和预算等挑战。

Insight: 未来的方向包括合成数据、早期融合、代理AI和临床转化,借鉴自然语言处理的经验。

Abstract: Following the successful paradigm shift of large language models, leveraging pre-training on a massive corpus of data and fine-tuning on different downstream tasks, generalist models have made their foray into computer vision. The introduction of Segment Anything Model (SAM) set a milestone on segmentation of natural images, inspiring the design of a multitude of architectures for medical image segmentation. In this survey we offer a comprehensive and in-depth investigation on generalist models for medical image segmentation. We start with an introduction on the fundamentals concepts underpinning their development. Then, we provide a taxonomy on the different declinations of SAM in terms of zero-shot, few-shot, fine-tuning, adapters, on the recent SAM 2, on other innovative models trained on images alone, and others trained on both text and images. We thoroughly analyze their performances at the level of both primary research and best-in-literature, followed by a rigorous comparison with the state-of-the-art task-specific models. We emphasize the need to address challenges in terms of compliance with regulatory frameworks, privacy and security laws, budget, and trustworthy artificial intelligence (AI). Finally, we share our perspective on future directions concerning synthetic data, early fusion, lessons learnt from generalist models in natural language processing, agentic AI and physical AI, and clinical translation.


[147] Med-URWKV: Pure RWKV With ImageNet Pre-training For Medical Image Segmentation eess.IV | cs.CVPDF

Zhenhuan Zhou

TL;DR: Med-URWKV提出了一种基于纯RWKV架构的医学图像分割模型,首次利用ImageNet预训练的VRWKV编码器,性能优于从头训练的RWKV模型。

Details

Motivation: 现有的医学图像分割方法(CNN、Transformer或混合架构)存在局限性,如CNN感受野受限或Transformer计算复杂度高。RWKV具备线性复杂度的远程建模能力,但其在医学领域的预训练潜力未被探索。

Result: Med-URWKV在多个数据集上取得了与从头训练的RWKV模型相当或更优的分割性能,验证了预训练的有效性。

Insight: 利用ImageNet预训练的RWKV编码器可以显著提升医学图像分割任务的性能,为未来研究提供了新的方向。

Abstract: Medical image segmentation is a fundamental and key technology in computer-aided diagnosis and treatment. Previous methods can be broadly classified into three categories: convolutional neural network (CNN) based, Transformer based, and hybrid architectures that combine both. However, each of them has its own limitations, such as restricted receptive fields in CNNs or the computational overhead caused by the quadratic complexity of Transformers. Recently, the Receptance Weighted Key Value (RWKV) model has emerged as a promising alternative for various vision tasks, offering strong long-range modeling capabilities with linear computational complexity. Some studies have also adapted RWKV to medical image segmentation tasks, achieving competitive performance. However, most of these studies focus on modifications to the Vision-RWKV (VRWKV) mechanism and train models from scratch, without exploring the potential advantages of leveraging pre-trained VRWKV models for medical image segmentation tasks. In this paper, we propose Med-URWKV, a pure RWKV-based architecture built upon the U-Net framework, which incorporates ImageNet-based pretraining to further explore the potential of RWKV in medical image segmentation tasks. To the best of our knowledge, Med-URWKV is the first pure RWKV segmentation model in the medical field that can directly reuse a large-scale pre-trained VRWKV encoder. Experimental results on seven datasets demonstrate that Med-URWKV achieves comparable or even superior segmentation performance compared to other carefully optimized RWKV models trained from scratch. This validates the effectiveness of using a pretrained VRWKV encoder in enhancing model performance. The codes will be released.


cs.IR [Back]

[148] Conversational Search: From Fundamentals to Frontiers in the LLM Era cs.IR | cs.CLPDF

Fengran Mo, Chuan Meng, Mohammad Aliannejadi, Jian-Yun Nie

TL;DR: 该教程介绍了对话式搜索的基础知识和新兴发展,特别是大型语言模型(LLM)带来的革新,旨在帮助研究人员和实践者掌握核心技术并推动下一代对话式搜索系统的开发。

Details

Motivation: 随着LLM在指令跟随、内容生成和推理等方面能力的显著提升,为构建智能对话式搜索系统提供了新的机遇和挑战,促使研究者重新审视和推进这一领域的发展。

Result: 参与者将掌握对话式搜索的基础和LLM带来的前沿技术,具备开发下一代智能对话式搜索系统的能力。

Insight: LLM的强大能力为对话式搜索注入了新的活力,但也带来了如何高效利用和微调模型以满足特定需求的挑战。

Abstract: Conversational search enables multi-turn interactions between users and systems to fulfill users’ complex information needs. During this interaction, the system should understand the users’ search intent within the conversational context and then return the relevant information through a flexible, dialogue-based interface. The recent powerful large language models (LLMs) with capacities of instruction following, content generation, and reasoning, attract significant attention and advancements, providing new opportunities and challenges for building up intelligent conversational search systems. This tutorial aims to introduce the connection between fundamentals and the emerging topics revolutionized by LLMs in the context of conversational search. It is designed for students, researchers, and practitioners from both academia and industry. Participants will gain a comprehensive understanding of both the core principles and cutting-edge developments driven by LLMs in conversational search, equipping them with the knowledge needed to contribute to the development of next-generation conversational search systems.