Table of Contents
- cs.CV [Total: 77]
- cs.CL [Total: 35]
- cs.MA [Total: 1]
- physics.med-ph [Total: 1]
- cs.IR [Total: 1]
- eess.IV [Total: 6]
- cs.CR [Total: 2]
- cs.LG [Total: 8]
- cs.GR [Total: 1]
- cs.MM [Total: 6]
- cs.SD [Total: 1]
- cs.AI [Total: 4]
- cs.RO [Total: 3]
- eess.SY [Total: 2]
cs.CV [Back]
[1] Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models cs.CV | cs.AI | cs.CL | cs.GR | cs.MMPDF
Sridhar S, Nithin A, Shakeel Rifath, Vasantha Raj
TL;DR: 本文提出了一种结合文本到图像和音频生成模型的多模态电影视频合成方法,通过Stable Diffusion、GPT-2和混合音频流水线实现高保真视频生成。
Details
Motivation: 随着生成式人工智能的发展,如何高效合成具有叙事连贯性和专业质量的电影视频成为研究重点。
Result: 实验展示了出色的视觉质量、叙事连贯性和效率,适用于创意、教育和工业场景。
Insight: 多模态生成模型(文本、图像、音频)的结合为电影视频合成提供了新思路,优化技术(如CUDA内存管理)提升了可靠性。
Abstract: Advances in generative artificial intelligence have altered multimedia creation, allowing for automatic cinematic video synthesis from text inputs. This work describes a method for creating 60-second cinematic movies incorporating Stable Diffusion for high-fidelity image synthesis, GPT-2 for narrative structuring, and a hybrid audio pipeline using gTTS and YouTube-sourced music. It uses a five-scene framework, which is augmented by linear frame interpolation, cinematic post-processing (e.g., sharpening), and audio-video synchronization to provide professional-quality results. It was created in a GPU-accelerated Google Colab environment using Python 3.11. It has a dual-mode Gradio interface (Simple and Advanced), which supports resolutions of up to 1024x768 and frame rates of 15-30 FPS. Optimizations such as CUDA memory management and error handling ensure reliability. The experiments demonstrate outstanding visual quality, narrative coherence, and efficiency, furthering text-to-video synthesis for creative, educational, and industrial applications.
[2] LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning cs.CVPDF
Chenjian Gao, Lihe Ding, Xin Cai, Zhanpeng Huang, Zibin Wang
TL;DR: 论文提出了一种基于掩码感知的LoRA微调方法,通过首帧引导实现可控的视频编辑,解决了传统方法依赖大规模预训练和编辑灵活性不足的问题。
Details
Motivation: 当前基于扩散模型的视频编辑方法依赖大规模预训练,且首帧引导的编辑方式缺乏对后续帧的灵活控制。
Result: 实验表明,该方法在视频编辑性能上优于现有技术。
Insight: 掩码与LoRA的结合为可控视频编辑提供了高效的解决方案,无需改变模型架构即可实现灵活编辑。
Abstract: Video editing using diffusion models has achieved remarkable results in generating high-quality edits for videos. However, current methods often rely on large-scale pretraining, limiting flexibility for specific edits. First-frame-guided editing provides control over the first frame, but lacks flexibility over subsequent frames. To address this, we propose a mask-based LoRA (Low-Rank Adaptation) tuning method that adapts pretrained Image-to-Video (I2V) models for flexible video editing. Our approach preserves background regions while enabling controllable edits propagation. This solution offers efficient and adaptable video editing without altering the model architecture. To better steer this process, we incorporate additional references, such as alternate viewpoints or representative scene states, which serve as visual anchors for how content should unfold. We address the control challenge using a mask-driven LoRA tuning strategy that adapts a pre-trained image-to-video model to the editing context. The model must learn from two distinct sources: the input video provides spatial structure and motion cues, while reference images offer appearance guidance. A spatial mask enables region-specific learning by dynamically modulating what the model attends to, ensuring that each area draws from the appropriate source. Experimental results show our method achieves superior video editing performance compared to state-of-the-art methods.
[3] DeepTraverse: A Depth-First Search Inspired Network for Algorithmic Visual Understanding cs.CVPDF
Bin Guo, John H. L. Hansen
TL;DR: 提出了一种受深度优先搜索算法启发的视觉架构DeepTraverse,通过递归探索和动态校准模块实现特征的自适应迭代优化,在图像分类任务中表现优异。
Details
Motivation: 传统视觉模型的特征提取过程缺乏显式的自适应迭代与推理能力。作者探索是否可以通过经典搜索算法的逻辑,构建更结构化、可解释的特征提取流程。
Result: 在多个图像分类基准测试中,DeepTraverse取得了与或优于传统模型的性能,且参数量相近或更少。
Insight: 将算法先验(如深度优先搜索)融入视觉模型设计,可以提升模型的效率、性能和可解释性。
Abstract: Conventional vision backbones, despite their success, often construct features through a largely uniform cascade of operations, offering limited explicit pathways for adaptive, iterative refinement. This raises a compelling question: can principles from classical search algorithms instill a more algorithmic, structured, and logical processing flow within these networks, leading to representations built through more interpretable, perhaps reasoning-like decision processes? We introduce DeepTraverse, a novel vision architecture directly inspired by algorithmic search strategies, enabling it to learn features through a process of systematic elucidation and adaptive refinement distinct from conventional approaches. DeepTraverse operationalizes this via two key synergistic components: recursive exploration modules that methodically deepen feature analysis along promising representational paths with parameter sharing for efficiency, and adaptive calibration modules that dynamically adjust feature salience based on evolving global context. The resulting algorithmic interplay allows DeepTraverse to intelligently construct and refine feature patterns. Comprehensive evaluations across a diverse suite of image classification benchmarks show that DeepTraverse achieves highly competitive classification accuracy and robust feature discrimination, often outperforming conventional models with similar or larger parameter counts. Our work demonstrates that integrating such algorithmic priors provides a principled and effective strategy for building more efficient, performant, and structured vision backbones.
[4] Test-Time Adaptation for Generalizable Task Progress Estimation cs.CV | cs.AI | I.2.6; I.2.9; I.2.10PDF
Christos Ziakas, Alessandra Russo
TL;DR: 论文提出了一种测试时适应方法,通过优化自监督目标,使进度估计模型能够在测试轨迹中在线适应视觉和时间上下文。
Details
Motivation: 针对任务进度估计的通用性问题,现有方法在分布外任务和环境中的表现不足,需要一种能够在测试时动态适应的方法。
Result: 方法在分布外任务、环境和实体中表现优异,超越了基于自回归视觉语言模型的上下文学习方法。
Insight: 测试时自适应是提升模型泛化能力的有效途径,尤其适用于动态和多变的真实世界任务。
Abstract: We propose a test-time adaptation method that enables a progress estimation model to adapt online to the visual and temporal context of test trajectories by optimizing a learned self-supervised objective. To this end, we introduce a gradient-based meta-learning strategy to train the model on expert visual trajectories and their natural language task descriptions, such that test-time adaptation improves progress estimation relying on semantic content over temporal order. Our test-time adaptation method generalizes from a single training environment to diverse out-of-distribution tasks, environments, and embodiments, outperforming the state-of-the-art in-context learning approach using autoregressive vision-language models.
[5] EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models cs.CVPDF
Yantai Yang, Yuhao Wang, Zichen Wen, Luo Zhongwei, Chang Zou
TL;DR: EfficientVLA是一个无需训练的推理加速框架,通过综合多方面的冗余优化,显著提升Vision-Language-Action模型的效率和部署性。
Details
Motivation: VLA模型(尤其是基于扩散架构的模型)在具身智能领域潜力巨大,但高计算和内存需求限制了其实际应用。现有方法往往只针对局部低效问题,缺乏全局优化。
Result: 在CogACT模型上实现了1.93倍加速,FLOPs减少至28.9%,任务成功率仅下降0.6%。
Insight: 全局冗余优化比局部优化更有效,且无需额外训练即可显著提升模型效率。
Abstract: Vision-Language-Action (VLA) models, particularly diffusion-based architectures, demonstrate transformative potential for embodied intelligence but are severely hampered by high computational and memory demands stemming from extensive inherent and inference-time redundancies. While existing acceleration efforts often target isolated inefficiencies, such piecemeal solutions typically fail to holistically address the varied computational and memory bottlenecks across the entire VLA pipeline, thereby limiting practical deployability. We introduce EfficientVLA, a structured and training-free inference acceleration framework that systematically eliminates these barriers by cohesively exploiting multifaceted redundancies. EfficientVLA synergistically integrates three targeted strategies: (1) pruning of functionally inconsequential layers from the language module, guided by an analysis of inter-layer redundancies; (2) optimizing the visual processing pathway through a task-aware strategy that selects a compact, diverse set of visual tokens, balancing task-criticality with informational coverage; and (3) alleviating temporal computational redundancy within the iterative diffusion-based action head by strategically caching and reusing key intermediate features. We apply our method to a standard VLA model CogACT, yielding a 1.93X inference speedup and reduces FLOPs to 28.9%, with only a 0.6% success rate drop in the SIMPLER benchmark.
[6] A Manually Annotated Image-Caption Dataset for Detecting Children in the Wild cs.CV | cs.ETPDF
Klim Kireev, Ana-Maria Creţu, Raphael Meier, Sarah Adel Bargal, Elissa Redmiles
TL;DR: 该论文提出了一个多模态数据集ICCWD,用于检测图像中未成年人,填补了现有研究的空白,并通过基准测试展示了现有方法的局限性。
Details
Motivation: 由于数字内容中对未成年人的监管需求,目前缺乏多模态环境下检测未成年人的数据集,因此作者提出了ICCWD数据集以支持相关研究。
Result: 实验结果表明未成年人检测任务具有挑战性,最佳方法的真阳性率为75.3%。
Insight: 未成年人检测在多模态环境中仍需改进,公开数据集有望推动更优方法的开发。
Abstract: Platforms and the law regulate digital content depicting minors (defined as individuals under 18 years of age) differently from other types of content. Given the sheer amount of content that needs to be assessed, machine learning-based automation tools are commonly used to detect content depicting minors. To our knowledge, no dataset or benchmark currently exists for detecting these identification methods in a multi-modal environment. To fill this gap, we release the Image-Caption Children in the Wild Dataset (ICCWD), an image-caption dataset aimed at benchmarking tools that detect depictions of minors. Our dataset is richer than previous child image datasets, containing images of children in a variety of contexts, including fictional depictions and partially visible bodies. ICCWD contains 10,000 image-caption pairs manually labeled to indicate the presence or absence of a child in the image. To demonstrate the possible utility of our dataset, we use it to benchmark three different detectors, including a commercial age estimation system applied to images. Our results suggest that child detection is a challenging task, with the best method achieving a 75.3% true positive rate. We hope the release of our dataset will aid in the design of better minor detection methods in a wide range of scenarios.
[7] Detecção da Psoríase Utilizando Visão Computacional: Uma Abordagem Comparativa Entre CNNs e Vision Transformers cs.CV | cs.AI | cs.LGPDF
Natanael Lucena, Fábio S. da Silva, Ricardo Rios
TL;DR: 该论文比较了卷积神经网络(CNN)和视觉变换器(ViT)在多分类银屑病及其类似疾病图像上的性能,发现ViT在小模型上表现更优,其中DaViT-B的f1-score达到96.4%。
Details
Motivation: 研究动机是探索视觉变换器在医学图像分类任务(如银屑病检测)中的潜力,并与传统CNN方法进行对比。
Result: 结果显示ViTs在小模型上表现更优,DaViT-B的f1-score为96.4%,推荐为自动化银屑病检测的最优架构。
Insight: 论文的洞见在于ViTs在医学图像分类任务中展现出强大潜力,尤其是在小模型下仍能保持高性能。
Abstract: This paper presents a comparison of the performance of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in the task of multi-classifying images containing lesions of psoriasis and diseases similar to it. Models pre-trained on ImageNet were adapted to a specific data set. Both achieved high predictive metrics, but the ViTs stood out for their superior performance with smaller models. Dual Attention Vision Transformer-Base (DaViT-B) obtained the best results, with an f1-score of 96.4%, and is recommended as the most efficient architecture for automated psoriasis detection. This article reinforces the potential of ViTs for medical image classification tasks.
[8] ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs cs.CV | cs.LGPDF
Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou
TL;DR: 该论文提出了ViCrit任务,通过微妙的视觉幻觉定位任务增强视觉语言模型(VLMs)的感知能力,并展示了其在多种视觉基准上的显著提升效果。
Details
Motivation: 由于视觉任务往往难以明确验证,强化学习(RL)在视觉语言模型中的扩展受到限制。论文旨在设计一种兼具挑战性且可验证的视觉任务。
Result: 实验表明,通过ViCrit任务训练的模型在多种视觉基准上表现显著提升,且能力可泛化到抽象图像推理和视觉数学任务。
Insight: 精细的幻觉批评任务能有效提升视觉感知能力,而非仅依赖于记忆已见对象,展示了视觉理解的潜力。
Abstract: Reinforcement learning (RL) has shown great effectiveness for fine-tuning large language models (LLMs) using tasks that are challenging yet easily verifiable, such as math reasoning or code generation. However, extending this success to visual perception in vision-language models (VLMs) has been impeded by the scarcity of vision-centric tasks that are simultaneously challenging and unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions. Starting from a 200-word captions, we inject a single, subtle visual description error-altering a few words on objects, attributes, counts, or spatial relations-and task the model to pinpoint the corrupted span given the image and the modified caption. This formulation preserves the full perceptual difficulty while providing a binary, exact-match reward that is easy to compute and unambiguous. Models trained with the ViCrit Task exhibit substantial gains across a variety of VL benchmarks. Crucially, the improvements transfer beyond natural-image training data to abstract image reasoning and visual math, showing promises of learning to perceive rather than barely memorizing seen objects. To facilitate evaluation, we further introduce ViCrit-Bench, a category-balanced diagnostic benchmark that systematically probes perception errors across diverse image domains and error types. Together, our results demonstrate that fine-grained hallucination criticism is an effective and generalizable objective for enhancing visual perception in VLMs.
[9] Retrieval of Surface Solar Radiation through Implicit Albedo Recovery from Temporal Context cs.CV | physics.ao-phPDF
Yael Frischholz, Devis Tuia, Michael Lehning
TL;DR: 该论文提出了一种基于注意力机制的SSR检索方法,通过隐式学习从卫星图像序列推断晴空地表反射率,无需手动特征提取。
Details
Motivation: 传统SSR检索方法在山区因动态雪盖变化效果不佳,需改进。
Result: 模型在提供足够长时间上下文时,性能媲美基于反照率的方法,尤其对山区效果显著。
Insight: 长时间上下文可有效捕捉地表反射率动态,提升模型泛化能力。
Abstract: Accurate retrieval of surface solar radiation (SSR) from satellite imagery critically depends on estimating the background reflectance that a spaceborne sensor would observe under clear-sky conditions. Deviations from this baseline can then be used to detect cloud presence and guide radiative transfer models in inferring atmospheric attenuation. Operational retrieval algorithms typically approximate background reflectance using monthly statistics, assuming surface properties vary slowly relative to atmospheric conditions. However, this approach fails in mountainous regions where intermittent snow cover and changing snow surfaces are frequent. We propose an attention-based emulator for SSR retrieval that implicitly learns to infer clear-sky surface reflectance from raw satellite image sequences. Built on the Temporo-Spatial Vision Transformer, our approach eliminates the need for hand-crafted features such as explicit albedo maps or cloud masks. The emulator is trained on instantaneous SSR estimates from the HelioMont algorithm over Switzerland, a region characterized by complex terrain and dynamic snow cover. Inputs include multi-spectral SEVIRI imagery from the Meteosat Second Generation platform, augmented with static topographic features and solar geometry. The target variable is HelioMont’s SSR, computed as the sum of its direct and diffuse horizontal irradiance components, given at a spatial resolution of 1.7 km. We show that, when provided a sufficiently long temporal context, the model matches the performances of albedo-informed models, highlighting the model’s ability to internally learn and exploit latent surface reflectance dynamics. Our geospatial analysis shows this effect is most powerful in mountainous regions and improves generalization in both simple and complex topographic settings. Code and datasets are publicly available at https://github.com/frischwood/HeMu-dev.git
[10] Attention, Please! Revisiting Attentive Probing for Masked Image Modeling cs.CVPDF
Bill Psomas, Dionysis Christopoulos, Eirini Baltzi, Ioannis Kakogeorgiou, Tilemachos Aravanis
TL;DR: 该论文重新审视了注意力探针在掩码图像建模(MIM)中的应用,提出了高效的探针方法(EP),通过多查询交叉注意力机制提升效率与性能。
Details
Motivation: 由于分布式补丁标记的特性,标准线性探针(LP)无法充分评估MIM模型的潜力,因此需要更高效的注意力探针方法。
Result: EP在七项基准测试中优于LP和先前的注意力探针方法,且在低样本和分层设置中表现优异。
Insight: 高效的注意力机制不仅提升性能,还能生成可解释的注意力图,适用于多种预训练范式。
Abstract: As fine-tuning (FT) becomes increasingly impractical at scale, probing is emerging as the preferred evaluation protocol for self-supervised learning (SSL). Yet, the standard linear probing (LP) fails to adequately reflect the potential of models trained with Masked Image Modeling (MIM), due to the distributed nature of patch tokens. This motivates the need for attentive probing, an alternative that uses attention to selectively aggregate patch-level features. Despite its growing adoption, attentive probing remains under-explored, with existing methods suffering from excessive parameterization and poor computational efficiency. In this work, we revisit attentive probing through the lens of the accuracy-efficiency trade-off. We conduct a systematic study of existing methods, analyzing their mechanisms and benchmarking their performance. We introduce efficient probing (EP), a multi-query cross-attention mechanism that eliminates redundant projections, reduces the number of trainable parameters, and achieves up to a 10$\times$ speed-up over conventional multi-head attention. Despite its simplicity, EP outperforms LP and prior attentive probing approaches across seven benchmarks, generalizes well beyond MIM to diverse pre-training paradigms, produces interpretable attention maps, and achieves strong gains in low-shot and layer-wise settings. Code available at https://github.com/billpsomas/efficient-probing.
[11] Improving Personalized Search with Regularized Low-Rank Parameter Updates cs.CVPDF
Fiona Ryan, Josef Sivic, Fabian Caba Heilbron, Judy Hoffman, James M. Rehg
TL;DR: 论文提出了一种通过正则化低秩参数更新改进个性化视觉语言检索的方法,显著提升了在小样本学习任务中的表现。
Details
Motivation: 个性化视觉语言检索需要从小样本中学习新概念(如’我的狗Fido’),并将其与通用知识结合以识别不同上下文中的概念,这一任务极具挑战性。
Result: 在DeepFashion2和ConCon-Chi基准测试中,个性化检索准确率比之前方法提高了4%-22%。
Insight: 正则化低秩适应是一种有效的个性化概念学习方法,同时通用知识的保留可以通过VLM生成的标题进行评估。
Abstract: Personalized vision-language retrieval seeks to recognize new concepts (e.g. “my dog Fido”) from only a few examples. This task is challenging because it requires not only learning a new concept from a few images, but also integrating the personal and general knowledge together to recognize the concept in different contexts. In this paper, we show how to effectively adapt the internal representation of a vision-language dual encoder model for personalized vision-language retrieval. We find that regularized low-rank adaption of a small set of parameters in the language encoder’s final layer serves as a highly effective alternative to textual inversion for recognizing the personal concept while preserving general knowledge. Additionally, we explore strategies for combining parameters of multiple learned personal concepts, finding that parameter addition is effective. To evaluate how well general knowledge is preserved in a finetuned representation, we introduce a metric that measures image retrieval accuracy based on captions generated by a vision language model (VLM). Our approach achieves state-of-the-art accuracy on two benchmarks for personalized image retrieval with natural language queries - DeepFashion2 and ConCon-Chi - outperforming the prior art by 4%-22% on personal retrievals.
[12] ScoreMix: Improving Face Recognition via Score Composition in Diffusion Generators cs.CV | cs.AI | cs.LGPDF
Parsa Rahimi, Sebastien Marcel
TL;DR: ScoreMix提出了一种基于扩散模型分数组合的数据增强方法,通过混合不同类别的扩散轨迹分数生成具有挑战性的合成样本,显著提升了判别模型的性能。
Details
Motivation: 在标记数据有限的情况下,如何利用生成模型增强判别模型的性能是一个关键问题。ScoreMix利用扩散模型的分数组合特性生成高质量合成样本,以解决这一问题。
Result: 在多个基准测试中,ScoreMix显著提升了判别模型的性能,且无需复杂的超参数搜索。
Insight: 生成器的条件空间与判别器的嵌入空间关联性较低,混合远离的类别比相近类别更能提升性能,为数据增强提供了新思路。
Abstract: In this paper, we propose ScoreMix, a novel yet simple data augmentation strategy leveraging the score compositional properties of diffusion models to enhance discriminator performance, particularly under scenarios with limited labeled data. By convexly mixing the scores from different class-conditioned trajectories during diffusion sampling, we generate challenging synthetic samples that significantly improve discriminative capabilities in all studied benchmarks. We systematically investigate class-selection strategies for mixing and discover that greater performance gains arise when combining classes distant in the discriminator’s embedding space, rather than close in the generator’s condition space. Moreover, we empirically show that, under standard metrics, the correlation between the generator’s learned condition space and the discriminator’s embedding space is minimal. Our approach achieves notable performance improvements without extensive parameter searches, demonstrating practical advantages for training discriminative models while effectively mitigating problems regarding collections of large datasets. Paper website: https://parsa-ra.github.io/scoremix
[13] California Crop Yield Benchmark: Combining Satellite Image, Climate, Evapotranspiration, and Soil Data Layers for County-Level Yield Forecasting of Over 70 Crops cs.CVPDF
Hamid Kamangir, Mona Hajiesmaeeli, Mason Earles
TL;DR: 该论文提出了一个结合多源数据(卫星图像、气候、蒸散发和土壤数据)的深度学习模型,用于加州70多种作物的县级产量预测,整体R2得分达到0.76。
Details
Motivation: 加州是全球农业生产的领导者,但复杂的环境、气候和土壤因素使得产量预测充满挑战。现有数据未充分利用多源信息进行精准预测。
Result: 模型在未见测试数据上的整体R2得分为0.76,表现优异。
Insight: 多源数据的整合和时空动态建模对农业产量预测至关重要,为气候适应和精准农业提供了新工具。
Abstract: California is a global leader in agricultural production, contributing 12.5% of the United States total output and ranking as the fifth-largest food and cotton supplier in the world. Despite the availability of extensive historical yield data from the USDA National Agricultural Statistics Service, accurate and timely crop yield forecasting remains a challenge due to the complex interplay of environmental, climatic, and soil-related factors. In this study, we introduce a comprehensive crop yield benchmark dataset covering over 70 crops across all California counties from 2008 to 2022. The benchmark integrates diverse data sources, including Landsat satellite imagery, daily climate records, monthly evapotranspiration, and high-resolution soil properties. To effectively learn from these heterogeneous inputs, we develop a multi-modal deep learning model tailored for county-level, crop-specific yield forecasting. The model employs stratified feature extraction and a timeseries encoder to capture spatial and temporal dynamics during the growing season. Static inputs such as soil characteristics and crop identity inform long-term variability. Our approach achieves an overall R2 score of 0.76 across all crops of unseen test dataset, highlighting strong predictive performance across California diverse agricultural regions. This benchmark and modeling framework offer a valuable foundation for advancing agricultural forecasting, climate adaptation, and precision farming. The full dataset and codebase are publicly available at our GitHub repository.
[14] DySS: Dynamic Queries and State-Space Learning for Efficient 3D Object Detection from Multi-Camera Videos cs.CVPDF
Rajeev Yasarla, Shizhong Han, Hong Cai, Fatih Porikli
TL;DR: DySS提出了一种基于动态查询和状态空间学习的高效多摄像头视频3D物体检测方法,通过状态空间模型(SSM)和动态查询更新操作,实现了优异的检测性能和实时推理速度。
Details
Motivation: 传统方法依赖密集BEV特征,计算成本高;稀疏查询方法虽有所改进,但仍需大量查询且处理多帧视频时效率低。DySS旨在通过动态查询和状态空间学习提高检测效率和性能。
Result: 在nuScenes测试集上,NDS为65.31,mAP为57.4;验证集上NDS为56.2,mAP为46.2,推理速度达33 FPS,优于现有方法。
Insight: DySS通过结合时序建模和动态查询优化,显著提升了3D物体检测的效率,同时保持了高性能,为实时自动驾驶感知任务提供了新思路。
Abstract: Camera-based 3D object detection in Bird’s Eye View (BEV) is one of the most important perception tasks in autonomous driving. Earlier methods rely on dense BEV features, which are costly to construct. More recent works explore sparse query-based detection. However, they still require a large number of queries and can become expensive to run when more video frames are used. In this paper, we propose DySS, a novel method that employs state-space learning and dynamic queries. More specifically, DySS leverages a state-space model (SSM) to sequentially process the sampled features over time steps. In order to encourage the model to better capture the underlying motion and correspondence information, we introduce auxiliary tasks of future prediction and masked reconstruction to better train the SSM. The state of the SSM then provides an informative yet efficient summarization of the scene. Based on the state-space learned features, we dynamically update the queries via merge, remove, and split operations, which help maintain a useful, lean set of detection queries throughout the network. Our proposed DySS achieves both superior detection performance and efficient inference. Specifically, on the nuScenes test split, DySS achieves 65.31 NDS and 57.4 mAP, outperforming the latest state of the art. On the val split, DySS achieves 56.2 NDS and 46.2 mAP, as well as a real-time inference speed of 33 FPS.
[15] HalLoc: Token-level Localization of Hallucinations for Vision Language Models cs.CVPDF
Eunkyu Park, Minyeong Kim, Gunhee Kim
TL;DR: 论文提出了HalLoc数据集和一个基线模型,用于高效、概率性的幻觉检测,目标是提升视觉语言模型在关键应用中的可靠性。
Details
Motivation: 大型视觉语言模型中的幻觉问题严重影响其可靠性,而现有检测方法计算成本高且无法处理模糊情况。
Result: HalLoc数据集和基线模型公开可用,为提升视觉语言模型的可靠性提供了实用工具。
Insight: 该工作推动了幻觉检测从二元判别向概率性评分的转变,增强了模型的透明度和实用性。
Abstract: Hallucinations pose a significant challenge to the reliability of large vision-language models, making their detection essential for ensuring accuracy in critical applications. Current detection methods often rely on computationally intensive models, leading to high latency and resource demands. Their definitive outcomes also fail to account for real-world scenarios where the line between hallucinated and truthful information is unclear. To address these issues, we propose HalLoc, a dataset designed for efficient, probabilistic hallucination detection. It features 150K token-level annotated samples, including hallucination types, across Visual Question Answering (VQA), instruction-following, and image captioning tasks. This dataset facilitates the development of models that detect hallucinations with graded confidence, enabling more informed user interactions. Additionally, we introduce a baseline model trained on HalLoc, offering low-overhead, concurrent hallucination detection during generation. The model can be seamlessly integrated into existing VLMs, improving reliability while preserving efficiency. The prospect of a robust plug-and-play hallucination detection module opens new avenues for enhancing the trustworthiness of vision-language models in real-world applications. The HalLoc dataset and code are publicly available at: https://github.com/dbsltm/cvpr25_halloc.
[16] Uncertainty-Aware Deep Learning for Automated Skin Cancer Classification: A Comprehensive Evaluation cs.CV | cs.AIPDF
Hamzeh Asgharnezhad, Pegah Tabarisaadi, Abbas Khosravi, Roohallah Alizadehsani, U. Rajendra Acharya
TL;DR: 本文通过对HAM10000数据集的综合评估,研究了基于深度学习的皮肤癌分类方法,结合迁移学习和不确定性量化(UQ),展示了CLIP-based视觉变压器模型的优越性能,以及集成方法在准确性和不确定性处理之间的平衡。
Details
Motivation: 自动化皮肤癌分类对早期治疗和改善患者预后至关重要,但现有深度学习方法受限于数据稀缺和缺乏不确定性感知。本文旨在通过迁移学习和UQ提升模型的性能和可信度。
Result: CLIP-based视觉变压器(如LAION CLIP ViT-H/14)与SVM结合表现最佳;集成方法在准确性和不确定性处理之间取得平衡,EMCD对不确定预测更敏感。
Insight: 在医学诊断中,集成不确定性量化可以提升深度学习的可信度和实际应用价值,为临床决策提供更可靠的依据。
Abstract: Accurate and reliable skin cancer diagnosis is critical for early treatment and improved patient outcomes. Deep learning (DL) models have shown promise in automating skin cancer classification, but their performance can be limited by data scarcity and a lack of uncertainty awareness. In this study, we present a comprehensive evaluation of DL-based skin lesion classification using transfer learning and uncertainty quantification (UQ) on the HAM10000 dataset. In the first phase, we benchmarked several pre-trained feature extractors-including Contrastive Language-Image Pretraining (CLIP) variants, Residual Network-50 (ResNet50), Densely Connected Convolutional Network (DenseNet121), Visual Geometry Group network (VGG16), and EfficientNet-V2-Large-combined with a range of traditional classifiers such as Support Vector Machine (SVM), eXtreme Gradient Boosting (XGBoost), and logistic regression. Our results show that CLIP-based vision transformers, particularly LAION CLIP ViT-H/14 with SVM, deliver the highest classification performance. In the second phase, we incorporated UQ using Monte Carlo Dropout (MCD), Ensemble, and Ensemble Monte Carlo Dropout (EMCD) to assess not only prediction accuracy but also the reliability of model outputs. We evaluated these models using uncertainty-aware metrics such as uncertainty accuracy(UAcc), uncertainty sensitivity(USen), uncertainty specificity(USpe), and uncertainty precision(UPre). The results demonstrate that ensemble methods offer a good trade-off between accuracy and uncertainty handling, while EMCD is more sensitive to uncertain predictions. This study highlights the importance of integrating UQ into DL-based medical diagnosis to enhance both performance and trustworthiness in real-world clinical applications.
[17] Towards Scalable SOAP Note Generation: A Weakly Supervised Multimodal Framework cs.CV | cs.AI | cs.LGPDF
Sadia Kamal, Tim Oates, Joy Wan
TL;DR: 该论文提出了一种弱监督多模态框架,用于从有限输入(如病灶图像和稀疏临床文本)自动生成临床结构化的SOAP笔记,旨在减轻医生负担并减少对大型标注数据的依赖。
Details
Motivation: 皮肤癌是全球最常见的癌症之一,医生手动记录SOAP笔记耗时且易导致职业倦怠,因此需要一种自动化解决方案。
Result: 在临床相关指标上表现接近GPT-4o、Claude和DeepSeek Janus Pro等先进模型。
Insight: 弱监督学习和多模态输入的结合可以有效减少数据标注需求并提升临床文档生成的效率。
Abstract: Skin carcinoma is the most prevalent form of cancer globally, accounting for over $8 billion in annual healthcare expenditures. In clinical settings, physicians document patient visits using detailed SOAP (Subjective, Objective, Assessment, and Plan) notes. However, manually generating these notes is labor-intensive and contributes to clinician burnout. In this work, we propose a weakly supervised multimodal framework to generate clinically structured SOAP notes from limited inputs, including lesion images and sparse clinical text. Our approach reduces reliance on manual annotations, enabling scalable, clinically grounded documentation while alleviating clinician burden and reducing the need for large annotated data. Our method achieves performance comparable to GPT-4o, Claude, and DeepSeek Janus Pro across key clinical relevance metrics. To evaluate clinical quality, we introduce two novel metrics MedConceptEval and Clinical Coherence Score (CCS) which assess semantic alignment with expert medical concepts and input features, respectively.
[18] Research on Audio-Visual Quality Assessment Dataset and Method for User-Generated Omnidirectional Video cs.CV | eess.IVPDF
Fei Zhao, Da Pan, Zelu Qi, Ping Shi
TL;DR: 该文针对元宇宙中用户生成的全向视频(UGC-ODV)的视听质量评估问题,构建了一个数据集,并提出了一种基准模型,结合了视频特征提取、音频特征提取和视听融合模块。实验结果表明模型表现优异。
Details
Motivation: 随着元宇宙的兴起,用户生成的全向视频(UGC-ODV)日益重要,但相关的视听质量评估研究较少,亟需数据集和方法支持。
Result: 基准模型在提出的数据集上表现最优,验证了其有效性。
Insight: 该研究为UGC-ODV的视听质量评估提供了数据和模型基础,推动了元宇宙相关技术的发展。
Abstract: In response to the rising prominence of the Metaverse, omnidirectional videos (ODVs) have garnered notable interest, gradually shifting from professional-generated content (PGC) to user-generated content (UGC). However, the study of audio-visual quality assessment (AVQA) within ODVs remains limited. To address this, we construct a dataset of UGC omnidirectional audio and video (A/V) content. The videos are captured by five individuals using two different types of omnidirectional cameras, shooting 300 videos covering 10 different scene types. A subjective AVQA experiment is conducted on the dataset to obtain the Mean Opinion Scores (MOSs) of the A/V sequences. After that, to facilitate the development of UGC-ODV AVQA fields, we construct an effective AVQA baseline model on the proposed dataset, of which the baseline model consists of video feature extraction module, audio feature extraction and audio-visual fusion module. The experimental results demonstrate that our model achieves optimal performance on the proposed dataset.
[19] Using Vision Language Models to Detect Students’ Academic Emotion through Facial Expressions cs.CV | cs.AIPDF
Deliang Wang, Chao Yang, Gaowei Chen
TL;DR: 该研究探讨了利用视觉语言模型(VLMs)通过零样本提示检测学生学术情绪的方法,发现Qwen2.5-VL-7B-Instruct在识别学生困惑表情方面表现较好,但模型对分心行为的检测效果不佳。
Details
Motivation: 学生的学术情绪对其学习表现和行为有重要影响,而传统监督学习方法泛化能力有限,需要大量标注数据。视觉语言模型的出现为解决这一问题提供了新思路。
Result: Qwen2.5-VL-7B-Instruct在识别困惑表情方面表现较优,但两种模型均无法有效检测分心行为。快乐情绪的检测准确率较高。
Insight: 视觉语言模型在学术情绪识别中表现良好,尤其适用于检测学生困惑情绪,但需要进一步改进对分心行为的识别能力。
Abstract: Students’ academic emotions significantly influence their social behavior and learning performance. Traditional approaches to automatically and accurately analyze these emotions have predominantly relied on supervised machine learning algorithms. However, these models often struggle to generalize across different contexts, necessitating repeated cycles of data collection, annotation, and training. The emergence of Vision-Language Models (VLMs) offers a promising alternative, enabling generalization across visual recognition tasks through zero-shot prompting without requiring fine-tuning. This study investigates the potential of VLMs to analyze students’ academic emotions via facial expressions in an online learning environment. We employed two VLMs, Llama-3.2-11B-Vision-Instruct and Qwen2.5-VL-7B-Instruct, to analyze 5,000 images depicting confused, distracted, happy, neutral, and tired expressions using zero-shot prompting. Preliminary results indicate that both models demonstrate moderate performance in academic facial expression recognition, with Qwen2.5-VL-7B-Instruct outperforming Llama-3.2-11B-Vision-Instruct. Notably, both models excel in identifying students’ happy emotions but fail to detect distracted behavior. Additionally, Qwen2.5-VL-7B-Instruct exhibits relatively high performance in recognizing students’ confused expressions, highlighting its potential for practical applications in identifying content that causes student confusion.
[20] PointGS: Point Attention-Aware Sparse View Synthesis with Gaussian Splatting cs.CVPDF
Lintao Xiang, Hongpei Zheng, Yating Huang, Qijun Yang, Hujun Yin
TL;DR: PointGS提出了一种基于高斯泼溅的点注意力感知稀疏视图合成框架,能够从稀疏训练视图中实现高质量的实时渲染。
Details
Motivation: 现有的3D高斯泼溅(3DGS)方法需要大量校准视图以生成完整场景表示,稀疏输入会导致过拟合和渲染质量下降。PointGS旨在解决这一限制。
Result: 在多种基准测试中,PointGS显著优于基于NeRF的方法,并在少样本设置下与当前最优3DGS方法竞争激烈。
Insight: PointGS展示了稀疏视图下通过点级特征增强和高斯泼溅技术实现高质量渲染的潜力,为3D重建和渲染领域提供了新的思路。
Abstract: 3D Gaussian splatting (3DGS) is an innovative rendering technique that surpasses the neural radiance field (NeRF) in both rendering speed and visual quality by leveraging an explicit 3D scene representation. Existing 3DGS approaches require a large number of calibrated views to generate a consistent and complete scene representation. When input views are limited, 3DGS tends to overfit the training views, leading to noticeable degradation in rendering quality. To address this limitation, we propose a Point-wise Feature-Aware Gaussian Splatting framework that enables real-time, high-quality rendering from sparse training views. Specifically, we first employ the latest stereo foundation model to estimate accurate camera poses and reconstruct a dense point cloud for Gaussian initialization. We then encode the colour attributes of each 3D Gaussian by sampling and aggregating multiscale 2D appearance features from sparse inputs. To enhance point-wise appearance representation, we design a point interaction network based on a self-attention mechanism, allowing each Gaussian point to interact with its nearest neighbors. These enriched features are subsequently decoded into Gaussian parameters through two lightweight multi-layer perceptrons (MLPs) for final rendering. Extensive experiments on diverse benchmarks demonstrate that our method significantly outperforms NeRF-based approaches and achieves competitive performance under few-shot settings compared to the state-of-the-art 3DGS methods.
[21] UrbanSense:AFramework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models cs.CV | cs.AIPDF
Jun Yin, Jing Zhong, Peilin Li, Pengyu Zeng, Miao Zhang
TL;DR: 该论文提出了UrbanSense框架,基于视觉大语言模型,通过多模态方法实现城市街景风格的自动化、可扩展分析,并展示了其在量化城市风格差异方面的有效性。
Details
Motivation: 城市文化和建筑风格因地理、历史和社会政治因素差异显著,传统研究方法依赖专家解读,难以标准化。需要一种客观、数据驱动的方法来量化分析城市街景风格。
Result: 生成描述的80%通过t检验,主观评估中Phi得分高(城市0.912,时期0.833),表明能捕捉细微风格差异。
Insight: UrbanSense为城市风格演化提供了科学量化工具,为未来设计提供了数据支持,展现了多模态方法在城市研究中的潜力。
Abstract: Urban cultures and architectural styles vary significantly across cities due to geographical, chronological, historical, and socio-political factors. Understanding these differences is essential for anticipating how cities may evolve in the future. As representative cases of historical continuity and modern innovation in China, Beijing and Shenzhen offer valuable perspectives for exploring the transformation of urban streetscapes. However, conventional approaches to urban cultural studies often rely on expert interpretation and historical documentation, which are difficult to standardize across different contexts. To address this, we propose a multimodal research framework based on vision-language models, enabling automated and scalable analysis of urban streetscape style differences. This approach enhances the objectivity and data-driven nature of urban form research. The contributions of this study are as follows: First, we construct UrbanDiffBench, a curated dataset of urban streetscapes containing architectural images from different periods and regions. Second, we develop UrbanSense, the first vision-language-model-based framework for urban streetscape analysis, enabling the quantitative generation and comparison of urban style representations. Third, experimental results show that Over 80% of generated descriptions pass the t-test (p less than 0.05). High Phi scores (0.912 for cities, 0.833 for periods) from subjective evaluations confirm the method’s ability to capture subtle stylistic differences. These results highlight the method’s potential to quantify and interpret urban style evolution, offering a scientifically grounded lens for future design.
[22] RealKeyMorph: Keypoints in Real-world Coordinates for Resolution-agnostic Image Registration cs.CVPDF
Mina C. Moghadam, Alan Q. Wang, Omer Taub, Martin R. Prince, Mert R. Sabuncu
TL;DR: RealKeyMorph(RKM)提出了一种分辨率不敏感的医学图像配准方法,通过训练网络学习图像对的关键点,并在真实世界坐标系中操作,避免了传统方法因重采样引入的伪影。
Details
Motivation: 医学图像配准中,图像分辨率差异(如像素间距、切片厚度等)会导致传统方法因重采样引入伪影。RKM旨在消除这一限制,直接在原始数据上操作。
Result: 实验证明,RKM在腹部MRI正交2D堆栈和不同分辨率3D脑数据集上的配准任务中表现出优势。
Insight: 通过在真实世界坐标系中操作,RKM避免了传统配准方法因重采样带来的问题,为医学图像处理提供了一种更鲁棒的方法。
Abstract: Many real-world settings require registration of a pair of medical images that differ in spatial resolution, which may arise from differences in image acquisition parameters like pixel spacing, slice thickness, and field-of-view. However, all previous machine learning-based registration techniques resample images onto a fixed resolution. This is suboptimal because resampling can introduce artifacts due to interpolation. To address this, we present RealKeyMorph (RKM), a resolution-agnostic method for image registration. RKM is an extension of KeyMorph, a registration framework which works by training a network to learn corresponding keypoints for a given pair of images, after which a closed-form keypoint matching step is used to derive the transformation that aligns them. To avoid resampling and enable operating on the raw data, RKM outputs keypoints in real-world coordinates of the scanner. To do this, we leverage the affine matrix produced by the scanner (e.g., MRI machine) that encodes the mapping from voxel coordinates to real world coordinates. By transforming keypoints into real-world space and integrating this into the training process, RKM effectively enables the extracted keypoints to be resolution-agnostic. In our experiments, we demonstrate the advantages of RKM on the registration task for orthogonal 2D stacks of abdominal MRIs, as well as 3D volumes with varying resolutions in brain datasets.
[23] Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation cs.CVPDF
Runqi Ouyang, Haoyun Li, Zhenyuan Zhang, Xiaofeng Wang, Zheng Zhu
TL;DR: Motion-R1结合了Chain-of-Thought推理和强化学习,通过分解复杂文本指令为逻辑动作路径,提升了文本到动作生成的语义理解能力与一致性。
Details
Motivation: 现有文本到动作生成方法多基于端到端映射,缺乏对深层语言结构和逻辑推理的捕捉,导致动作生成的多样性、可控性和一致性受限。
Result: 在多个基准数据集上表现优异,尤其在需要细粒度语义理解和长期时序一致性的场景中优于现有方法。
Insight: 显式逻辑分解和强化学习的结合可显著提升文本到动作生成的语义理解与执行能力,为复杂指令执行提供了新思路。
Abstract: Recent advances in large language models, especially in natural language understanding and reasoning, have opened new possibilities for text-to-motion generation. Although existing approaches have made notable progress in semantic alignment and motion synthesis, they often rely on end-to-end mapping strategies that fail to capture deep linguistic structures and logical reasoning. Consequently, generated motions tend to lack controllability, consistency, and diversity. To address these limitations, we propose Motion-R1, a unified motion-language modeling framework that integrates a Chain-of-Thought mechanism. By explicitly decomposing complex textual instructions into logically structured action paths, Motion-R1 provides high-level semantic guidance for motion generation, significantly enhancing the model’s ability to interpret and execute multi-step, long-horizon, and compositionally rich commands. To train our model, we adopt Group Relative Policy Optimization, a reinforcement learning algorithm designed for large models, which leverages motion quality feedback to optimize reasoning chains and motion synthesis jointly. Extensive experiments across multiple benchmark datasets demonstrate that Motion-R1 achieves competitive or superior performance compared to state-of-the-art methods, particularly in scenarios requiring nuanced semantic understanding and long-term temporal coherence. The code, model and data will be publicly available.
[24] FaceLiVT: Face Recognition using Linear Vision Transformer with Structural Reparameterization For Mobile Device cs.CVPDF
Novendra Setyawan, Chi-Chia Sun, Mao-Hsiu Hsu, Wen-Kai Kuo, Jun-Wei Hsieh
TL;DR: FaceLiVT提出了一种轻量级但强大的人脸识别模型,结合CNN-Transformer架构和创新的多头部线性注意力机制,显著降低计算复杂度同时保持高准确性。
Details
Motivation: 移动设备上的人脸识别需要轻量化和高效性,而现有模型在计算复杂度和延迟方面难以兼顾。作者希望通过结合CNN和Transformer的优势,设计一种更适合移动设备的解决方案。
Result: 在LFW、CFP-FP等基准测试中,FaceLiVT表现优于现有轻量级模型,推理速度比EdgeFace快8.6倍,比纯ViT模型快21.2倍。
Insight: 通过结合CNN的局部特征提取能力和Transformer的全局建模能力,并优化注意力机制,可以显著提升移动设备上的人脸识别效率。
Abstract: This paper introduces FaceLiVT, a lightweight yet powerful face recognition model that integrates a hybrid Convolution Neural Network (CNN)-Transformer architecture with an innovative and lightweight Multi-Head Linear Attention (MHLA) mechanism. By combining MHLA alongside a reparameterized token mixer, FaceLiVT effectively reduces computational complexity while preserving competitive accuracy. Extensive evaluations on challenging benchmarks; including LFW, CFP-FP, AgeDB-30, IJB-B, and IJB-C; highlight its superior performance compared to state-of-the-art lightweight models. MHLA notably improves inference speed, allowing FaceLiVT to deliver high accuracy with lower latency on mobile devices. Specifically, FaceLiVT is 8.6 faster than EdgeFace, a recent hybrid CNN-Transformer model optimized for edge devices, and 21.2 faster than a pure ViT-Based model. With its balanced design, FaceLiVT offers an efficient and practical solution for real-time face recognition on resource-constrained platforms.
[25] FSATFusion: Frequency-Spatial Attention Transformer for Infrared and Visible Image Fusion cs.CVPDF
Tianpei Zhang, Jufeng Zhao, Yiming Zhu, Guangmang Cui, Yuhan Lyu
TL;DR: FSATFusion提出了一种基于频率-空间注意力Transformer的红外与可见光图像融合网络,通过改进Transformer模块和注意力机制,显著提升了融合性能。
Details
Motivation: 现有的深度学习方法在红外与可见光图像融合任务中,由于卷积操作难以捕捉全局上下文,导致信息丢失,限制了融合性能。
Result: 实验表明,FSATFusion在融合质量和效率上优于现有方法,且具有良好的泛化能力和下游任务性能。
Insight: 结合频率-空间注意力机制的Transformer能有效解决图像融合中的信息丢失问题,提升全局特征提取能力。
Abstract: The infrared and visible images fusion (IVIF) is receiving increasing attention from both the research community and industry due to its excellent results in downstream applications. Existing deep learning approaches often utilize convolutional neural networks to extract image features. However, the inherently capacity of convolution operations to capture global context can lead to information loss, thereby restricting fusion performance. To address this limitation, we propose an end-to-end fusion network named the Frequency-Spatial Attention Transformer Fusion Network (FSATFusion). The FSATFusion contains a frequency-spatial attention Transformer (FSAT) module designed to effectively capture discriminate features from source images. This FSAT module includes a frequency-spatial attention mechanism (FSAM) capable of extracting significant features from feature maps. Additionally, we propose an improved Transformer module (ITM) to enhance the ability to extract global context information of vanilla Transformer. We conducted both qualitative and quantitative comparative experiments, demonstrating the superior fusion quality and efficiency of FSATFusion compared to other state-of-the-art methods. Furthermore, our network was tested on two additional tasks without any modifications, to verify the excellent generalization capability of FSATFusion. Finally, the object detection experiment demonstrated the superiority of FSATFusion in downstream visual tasks. Our code is available at https://github.com/Lmmh058/FSATFusion.
[26] Revisiting Transformers with Insights from Image Filtering cs.CV | cs.LGPDF
Laziz U. Abdullaev, Maksim Tkachenko, Tan M. Nguyen
TL;DR: 该论文通过图像处理框架重新解释Transformer的自注意力机制,不仅提升了其可解释性,还通过图像处理启发的修改提高了模型的性能和鲁棒性。
Details
Motivation: 自注意力机制的成功缺乏坚实的理论基础,尤其是在各种架构组件的作用上。论文旨在通过图像处理框架填补这一空白。
Result: 改进的模型在语言和视觉任务中表现出更高的准确性和鲁棒性,同时增强了长序列理解能力。
Insight: 将图像处理理论与自注意力机制结合,可以同时提升模型的性能和可解释性,为未来的研究提供了新方向。
Abstract: The self-attention mechanism, a cornerstone of Transformer-based state-of-the-art deep learning architectures, is largely heuristic-driven and fundamentally challenging to interpret. Establishing a robust theoretical foundation to explain its remarkable success and limitations has therefore become an increasingly prominent focus in recent research. Some notable directions have explored understanding self-attention through the lens of image denoising and nonparametric regression. While promising, existing frameworks still lack a deeper mechanistic interpretation of various architectural components that enhance self-attention, both in its original formulation and subsequent variants. In this work, we aim to advance this understanding by developing a unifying image processing framework, capable of explaining not only the self-attention computation itself but also the role of components such as positional encoding and residual connections, including numerous later variants. We also pinpoint potential distinctions between the two concepts building upon our framework, and make effort to close this gap. We introduce two independent architectural modifications within transformers. While our primary objective is interpretability, we empirically observe that image processing-inspired modifications can also lead to notably improved accuracy and robustness against data contamination and adversaries across language and vision tasks as well as better long sequence understanding.
[27] Leveraging 6DoF Pose Foundation Models For Mapping Marine Sediment Burial cs.CVPDF
Jerry Yan, Chinmay Talegaonkar, Nicholas Antipa, Eric Terrill, Sophia Merrifield
TL;DR: 本文提出了一个名为PoseIDON的计算机视觉流程,结合深度基础模型特征与多视图摄影测量技术,用于从ROV视频中估计海底物体的6自由度姿态及周围海底的朝向,并通过CAD模型对齐推断埋藏深度。
Details
Motivation: 海底人为物体的埋藏状态对局部沉积动力学、生态风险评估以及污染物运输的研究至关重要。但由于部分遮挡、能见度差和物体退化等原因,从遥感图像中准确估计埋藏深度仍具挑战。
Result: 在54个物体的验证中,平均埋藏深度误差约10厘米,并能反映沉积物运输的空间模式。
Insight: PoseIDON方法为海底埋藏测绘提供了可扩展且非侵入式的解决方案,支持对环境受污染场地的快速评估。
Abstract: The burial state of anthropogenic objects on the seafloor provides insight into localized sedimentation dynamics and is also critical for assessing ecological risks, potential pollutant transport, and the viability of recovery or mitigation strategies for hazardous materials such as munitions. Accurate burial depth estimation from remote imagery remains difficult due to partial occlusion, poor visibility, and object degradation. This work introduces a computer vision pipeline, called PoseIDON, which combines deep foundation model features with multiview photogrammetry to estimate six degrees of freedom object pose and the orientation of the surrounding seafloor from ROV video. Burial depth is inferred by aligning CAD models of the objects with observed imagery and fitting a local planar approximation of the seafloor. The method is validated using footage of 54 objects, including barrels and munitions, recorded at a historic ocean dumpsite in the San Pedro Basin. The model achieves a mean burial depth error of approximately 10 centimeters and resolves spatial burial patterns that reflect underlying sediment transport processes. This approach enables scalable, non-invasive mapping of seafloor burial and supports environmental assessment at contaminated sites.
[28] DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Transformer and Mamba cs.CVPDF
Shicheng Yin, Kaixuan Yin, Yang Liu, Weixing Chen, Liang Lin
TL;DR: DART提出了一种可微分的动态自适应区域分词器,通过内容相关的可变大小分块解决固定大小分块带来的问题,显著提升了ViT和Mamba模型的性能,同时降低了计算开销。
Details
Motivation: 现有ViT和Mamba模型依赖固定大小的图像分块,会导致背景区域编码过多而关键局部细节丢失,尤其是信息稀疏分布时效果不佳。
Result: 在DeiT上实现2.1%的准确率提升,同时降低45%的FLOPs,在多个模型上验证了其有效性。
Insight: 动态调整分块大小比统一增加令牌密度更高效,能在减少计算开销的同时提升性能。
Abstract: Recently, non-convolutional models such as the Vision Transformer (ViT) and Vision Mamba (Vim) have achieved remarkable performance in computer vision tasks. However, their reliance on fixed-size patches often results in excessive encoding of background regions and omission of critical local details, especially when informative objects are sparsely distributed. To address this, we introduce a fully differentiable Dynamic Adaptive Region Tokenizer (DART), which adaptively partitions images into content-dependent patches of varying sizes. DART combines learnable region scores with piecewise differentiable quantile operations to allocate denser tokens to information-rich areas. Despite introducing only approximately 1 million (1M) additional parameters, DART improves accuracy by 2.1% on DeiT (ImageNet-1K). Unlike methods that uniformly increase token density to capture fine-grained details, DART offers a more efficient alternative, achieving 45% FLOPs reduction with superior performance. Extensive experiments on DeiT, Vim, and VideoMamba confirm that DART consistently enhances accuracy while incurring minimal or even reduced computational overhead. Code is available at https://github.com/HCPLab-SYSU/DART.
[29] ReconMOST: Multi-Layer Sea Temperature Reconstruction with Observations-Guided Diffusion cs.CVPDF
Yuanyi Song, Pumeng Lyu, Ben Fei, Fenghua Ling, Wanli Ouyang
TL;DR: ReconMOST 是一种基于数据驱动的扩散模型框架,用于多层海水温度的重建,通过历史模拟数据预训练和观测数据引导,解决了传统方法的稀疏数据和高计算成本问题。
Details
Motivation: 传统海水温度重建方法受限于数据稀疏性、算法复杂性和高计算成本,而现有的机器学习方法主要集中在海面或局部区域,难以处理多云遮挡等问题。
Result: 在 CMIP6 和 EN4 数据上的实验结果显示,MSE 值为引导 0.049、重建 0.680、总体 0.633,表明方法在准确性和泛化能力上的优越性。
Insight: 通过结合数据驱动的扩散模型和物理一致的预训练模式,ReconMOST 展示了在复杂海洋数据重建任务中的潜力,尤其是在数据稀疏和缺失的情况下。
Abstract: Accurate reconstruction of ocean is essential for reflecting global climate dynamics and supporting marine meteorological research. Conventional methods face challenges due to sparse data, algorithmic complexity, and high computational costs, while increasing usage of machine learning (ML) method remains limited to reconstruction problems at the sea surface and local regions, struggling with issues like cloud occlusion. To address these limitations, this paper proposes ReconMOST, a data-driven guided diffusion model framework for multi-layer sea temperature reconstruction. Specifically, we first pre-train an unconditional diffusion model using a large collection of historical numerical simulation data, enabling the model to attain physically consistent distribution patterns of ocean temperature fields. During the generation phase, sparse yet high-accuracy in-situ observational data are utilized as guidance points for the reverse diffusion process, generating accurate reconstruction results. Importantly, in regions lacking direct observational data, the physically consistent spatial distribution patterns learned during pre-training enable implicitly guided and physically plausible reconstructions. Our method extends ML-based SST reconstruction to a global, multi-layer setting, handling over 92.5% missing data while maintaining reconstruction accuracy, spatial resolution, and superior generalization capability. We pre-train our model on CMIP6 numerical simulation data and conduct guided reconstruction experiments on CMIP6 and EN4 analysis data. The results of mean squared error (MSE) values achieve 0.049 on guidance, 0.680 on reconstruction, and 0.633 on total, respectively, demonstrating the effectiveness and robustness of the proposed framework. Our source code is available at https://github.com/norsheep/ReconMOST.
[30] Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation cs.CV | cs.AIPDF
Zhiyang Xu, Jiuhai Chen, Zhaojiang Lin, Xichen Pan, Lifu Huang
TL;DR: Pisces是一种自回归的多模态基础模型,通过解耦的视觉编码架构和优化的训练技术,统一了图像理解和生成任务,并在公开基准测试中表现出色。
Details
Motivation: 当前的多模态基础模型虽然在图像理解和生成任务上实现了统一,但其性能往往不及专门针对单一任务的模型。主要原因在于视觉特征的差异和训练过程的不同。
Result: 在超过20个公开的图像理解基准测试中表现出色,并在GenEval图像生成基准上展现了强大的生成能力。
Insight: 研究揭示了图像理解与生成之间的协同关系,并证明了使用分离的视觉编码器对统一多模态模型的促进作用。
Abstract: Recent advances in large language models (LLMs) have enabled multimodal foundation models to tackle both image understanding and generation within a unified framework. Despite these gains, unified models often underperform compared to specialized models in either task. A key challenge in developing unified models lies in the inherent differences between the visual features needed for image understanding versus generation, as well as the distinct training processes required for each modality. In this work, we introduce Pisces, an auto-regressive multimodal foundation model that addresses this challenge through a novel decoupled visual encoding architecture and tailored training techniques optimized for multimodal generation. Combined with meticulous data curation, pretraining, and finetuning, Pisces achieves competitive performance in both image understanding and image generation. We evaluate Pisces on over 20 public benchmarks for image understanding, where it demonstrates strong performance across a wide range of tasks. Additionally, on GenEval, a widely adopted benchmark for image generation, Pisces exhibits robust generative capabilities. Our extensive analysis reveals the synergistic relationship between image understanding and generation, and the benefits of using separate visual encoders, advancing the field of unified multimodal models.
[31] MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment cs.CVPDF
Shuo wang, Jihao Zhang
TL;DR: MF2Summ是一个基于多模态融合的视频摘要模型,结合视觉和听觉信息,通过跨模态Transformer和时序对齐的注意力机制提升性能,在SumMe和TVSum数据集上表现优于现有方法。
Details
Motivation: 现有的视频摘要方法通常仅依赖单一模态(如视觉),难以充分捕捉视频的语义丰富性。因此,本文提出多模态融合的方法,结合视觉和听觉信息,以提升视频摘要的效果。
Result: 在SumMe和TVSum数据集上,MF2Summ的F1-score分别比DSNet提升1.9%和0.6%,优于其他先进方法。
Insight: 多模态融合能显著提升视频摘要性能;跨模态注意力机制和时序对齐是建模模态依赖的关键;NMS和KTS算法能有效筛选关键片段。
Abstract: The rapid proliferation of online video content necessitates effective video summarization techniques. Traditional methods, often relying on a single modality (typically visual), struggle to capture the full semantic richness of videos. This paper introduces MF2Summ, a novel video summarization model based on multimodal content understanding, integrating both visual and auditory information. MF2Summ employs a five-stage process: feature extraction, cross-modal attention interaction, feature fusion, segment prediction, and key shot selection. Visual features are extracted using a pre-trained GoogLeNet model, while auditory features are derived using SoundNet. The core of our fusion mechanism involves a cross-modal Transformer and an alignment-guided self-attention Transformer, designed to effectively model inter-modal dependencies and temporal correspondences. Segment importance, location, and center-ness are predicted, followed by key shot selection using Non-Maximum Suppression (NMS) and the Kernel Temporal Segmentation (KTS) algorithm. Experimental results on the SumMe and TVSum datasets demonstrate that MF2Summ achieves competitive performance, notably improving F1-scores by 1.9% and 0.6% respectively over the DSNet model, and performing favorably against other state-of-the-art methods.
[32] Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts cs.CV | cs.CL | cs.LG | cs.MMPDF
Guowei Zhong, Ruohong Huan, Mingzhen Wu, Ronghua Liang, Peng Chen
TL;DR: 论文提出了一种新颖的多模态情感识别框架CIDer,通过模型特定的自蒸馏模块(MSSD)和模型无关的因果推理模块(MACI),解决了模态缺失和分布偏移(OOD)问题,同时在参数效率和训练速度上优于现有方法。
Details
Motivation: 多模态情感识别(MER)在实际应用中常面临模态缺失和分布偏移的挑战。现有方法通常依赖特定模型或引入过多参数,实用性受限。
Result: 实验表明CIDer在RMFM和OOD场景中均表现优异,参数少且训练速度快。
Insight: 1) 自蒸馏和因果推理的结合能有效提升多模态任务的鲁棒性;2) 轻量化设计在复杂任务中具有实际优势。
Abstract: Recent advancements in Multimodal Emotion Recognition (MER) face challenges in addressing both modality missing and Out-Of-Distribution (OOD) data simultaneously. Existing methods often rely on specific models or introduce excessive parameters, which limits their practicality. To address these issues, we propose a novel robust MER framework, Causal Inference Distiller (CIDer), and introduce a new task, Random Modality Feature Missing (RMFM), to generalize the definition of modality missing. CIDer integrates two key components: a Model-Specific Self-Distillation (MSSD) module and a Model-Agnostic Causal Inference (MACI) module. MSSD enhances robustness under the RMFM task through a weight-sharing self-distillation approach applied across low-level features, attention maps, and high-level representations. Additionally, a Word-level Self-aligned Attention Module (WSAM) reduces computational complexity, while a Multimodal Composite Transformer (MCT) facilitates efficient multimodal fusion. To tackle OOD challenges, MACI employs a tailored causal graph to mitigate label and language biases using a Multimodal Causal Module (MCM) and fine-grained counterfactual texts. Notably, MACI can independently enhance OOD generalization with minimal additional parameters. Furthermore, we also introduce the new repartitioned MER OOD datasets. Experimental results demonstrate that CIDer achieves robust performance in both RMFM and OOD scenarios, with fewer parameters and faster training compared to state-of-the-art methods. The implementation of this work is publicly accessible at https://github.com/gw-zhong/CIDer.
[33] Rethinking Generative Human Video Coding with Implicit Motion Transformation cs.CV | eess.IVPDF
Bolin Chen, Ru-Ling Liao, Jie Chen, Yan Ye
TL;DR: 该论文通过隐式运动变换(IMT)改进了生成式人体视频编码(GHVC),解决了传统显式运动引导导致重建失真和运动不准确的问题,实现了高效压缩和高保真合成。
Details
Motivation: 传统生成式视频编码依赖显式运动场作为中间监督,但在复杂多样的人体运动模式下存在重建质量低和运动不准确的问题。论文旨在探索隐式运动变换如何提升GHVC的性能。
Result: 实验证明,IMT显著提升了GHVC的压缩效率和重建质量。
Insight: 隐式运动变换优于显式运动场,尤其适用于复杂运动模式的人体视频编码,为生成式视频编码提供了新思路。
Abstract: Beyond traditional hybrid-based video codec, generative video codec could achieve promising compression performance by evolving high-dimensional signals into compact feature representations for bitstream compactness at the encoder side and developing explicit motion fields as intermediate supervision for high-quality reconstruction at the decoder side. This paradigm has achieved significant success in face video compression. However, compared to facial videos, human body videos pose greater challenges due to their more complex and diverse motion patterns, i.e., when using explicit motion guidance for Generative Human Video Coding (GHVC), the reconstruction results could suffer severe distortions and inaccurate motion. As such, this paper highlights the limitations of explicit motion-based approaches for human body video compression and investigates the GHVC performance improvement with the aid of Implicit Motion Transformation, namely IMT. In particular, we propose to characterize complex human body signal into compact visual features and transform these features into implicit motion guidance for signal reconstruction. Experimental results demonstrate the effectiveness of the proposed IMT paradigm, which can facilitate GHVC to achieve high-efficiency compression and high-fidelity synthesis.
[34] MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models cs.CVPDF
Yu Huang, Zelin Peng, Yichen Zhao, Piao Yang, Xiaokang Yang
TL;DR: MedSeg-R提出了一种新型任务——医学图像推理分割,利用多模态大语言模型(MLLMs)的推理能力生成精确分割掩码,并通过全局上下文理解和像素级定位模块实现端到端框架。
Details
Motivation: 现有医学图像分割模型依赖显式人工指令,缺乏主动推理能力,限制了其在自动诊断中的应用。MLLMs虽在医学问答任务中表现优异,但难以生成精确分割掩码。
Result: 在多个基准测试中表现优异,分割精度高并提供可解释的医学图像分析。
Insight: 通过结合MLLMs的推理能力和像素级定位,MedSeg-R为医学图像分割提供了更灵活和智能的解决方案。
Abstract: Medical image segmentation is crucial for clinical diagnosis, yet existing models are limited by their reliance on explicit human instructions and lack the active reasoning capabilities to understand complex clinical questions. While recent advancements in multimodal large language models (MLLMs) have improved medical question-answering (QA) tasks, most methods struggle to generate precise segmentation masks, limiting their application in automatic medical diagnosis. In this paper, we introduce medical image reasoning segmentation, a novel task that aims to generate segmentation masks based on complex and implicit medical instructions. To address this, we propose MedSeg-R, an end-to-end framework that leverages the reasoning abilities of MLLMs to interpret clinical questions while also capable of producing corresponding precise segmentation masks for medical images. It is built on two core components: 1) a global context understanding module that interprets images and comprehends complex medical instructions to generate multi-modal intermediate tokens, and 2) a pixel-level grounding module that decodes these tokens to produce precise segmentation masks and textual responses. Furthermore, we introduce MedSeg-QA, a large-scale dataset tailored for the medical image reasoning segmentation task. It includes over 10,000 image-mask pairs and multi-turn conversations, automatically annotated using large language models and refined through physician reviews. Experiments show MedSeg-R’s superior performance across several benchmarks, achieving high segmentation accuracy and enabling interpretable textual analysis of medical images.
[35] LLMs Are Not Yet Ready for Deepfake Image Detection cs.CVPDF
Shahroz Tariq, David Nguyen, M. A. P. Chamikara, Tingmin Wu, Alsharif Abuadbba
TL;DR: 这篇论文通过零样本评估四种主流视觉语言模型(VLM)在检测深度伪造图像上的表现,发现尽管这些模型能提供连贯的解释并识别表面异常,但尚不可靠作为独立检测系统。
Details
Motivation: 随着深度伪造技术的复杂化,媒体完整性和公众信任面临严峻挑战;同时,视觉语言模型(VLM)因其多领域潜力而备受关注,但其在深度伪造检测中的应用尚未明确。
Result: 结果显示,VLM虽然能生成合理解释并识别表面异常,但容易受到误导性视觉模式(如复古风格)的影响,无法独立可靠地检测深度伪造。
Insight: 尽管通用模型目前不适合自主检测深度伪造,但其在可解释性和上下文分析上的优势,表明其在混合或人机协同检测框架中具有潜力。
Abstract: The growing sophistication of deepfakes presents substantial challenges to the integrity of media and the preservation of public trust. Concurrently, vision-language models (VLMs), large language models enhanced with visual reasoning capabilities, have emerged as promising tools across various domains, sparking interest in their applicability to deepfake detection. This study conducts a structured zero-shot evaluation of four prominent VLMs: ChatGPT, Claude, Gemini, and Grok, focusing on three primary deepfake types: faceswap, reenactment, and synthetic generation. Leveraging a meticulously assembled benchmark comprising authentic and manipulated images from diverse sources, we evaluate each model’s classification accuracy and reasoning depth. Our analysis indicates that while VLMs can produce coherent explanations and detect surface-level anomalies, they are not yet dependable as standalone detection systems. We highlight critical failure modes, such as an overemphasis on stylistic elements and vulnerability to misleading visual patterns like vintage aesthetics. Nevertheless, VLMs exhibit strengths in interpretability and contextual analysis, suggesting their potential to augment human expertise in forensic workflows. These insights imply that although general-purpose models currently lack the reliability needed for autonomous deepfake detection, they hold promise as integral components in hybrid or human-in-the-loop detection frameworks.
[36] Semantic Localization Guiding Segment Anything Model For Reference Remote Sensing Image Segmentation cs.CV | cs.AIPDF
Shuyang Li, Shuang Wang, Zhuangzhuang Sun, Jing Xiao
TL;DR: PSLG-SAM框架通过两阶段方法(粗定位和精细分割)解决RRSIS任务中的密集标注和复杂场景问题,显著减少标注负担并提升性能。
Details
Motivation: RRSIS任务需要基于文本描述分割遥感图像中的指定对象,现有方法依赖密集标注和多模态融合,面临复杂场景和标注负担大的挑战。
Result: 在RRSIS-D和RRSIS-M数据集上,PSLG-SAM表现优异,超过现有最优模型。
Insight: 通过任务分解和模块化设计,可以显著降低标注需求并提升模型对复杂场景的鲁棒性。
Abstract: The Reference Remote Sensing Image Segmentation (RRSIS) task generates segmentation masks for specified objects in images based on textual descriptions, which has attracted widespread attention and research interest. Current RRSIS methods rely on multi-modal fusion backbones and semantic segmentation heads but face challenges like dense annotation requirements and complex scene interpretation. To address these issues, we propose a framework named \textit{prompt-generated semantic localization guiding Segment Anything Model}(PSLG-SAM), which decomposes the RRSIS task into two stages: coarse localization and fine segmentation. In coarse localization stage, a visual grounding network roughly locates the text-described object. In fine segmentation stage, the coordinates from the first stage guide the Segment Anything Model (SAM), enhanced by a clustering-based foreground point generator and a mask boundary iterative optimization strategy for precise segmentation. Notably, the second stage can be train-free, significantly reducing the annotation data burden for the RRSIS task. Additionally, decomposing the RRSIS task into two stages allows for focusing on specific region segmentation, avoiding interference from complex scenes.We further contribute a high-quality, multi-category manually annotated dataset. Experimental validation on two datasets (RRSIS-D and RRSIS-M) demonstrates that PSLG-SAM achieves significant performance improvements and surpasses existing state-of-the-art models.Our code will be made publicly available.
[37] J-DDL: Surface Damage Detection and Localization System for Fighter Aircraft cs.CVPDF
Jin Huang, Mingqiang Wei, Zikuan Li, Hangyu Qu, Wei Zhao
TL;DR: J-DDL是一种用于战斗机表面损伤检测与定位的智能系统,通过结合2D图像和3D点云数据,利用优化的YOLO架构和新型损失函数实现高精度检测。
Details
Motivation: 战斗机表面损伤检测存在手工检查效率低、一致性差的问题,亟需自动化解决方案。
Result: 实验验证了J-DDL的高效性,展示了其在自动化飞机检测技术中的潜力。
Insight: 结合2D和3D数据可提升损伤检测的精度;轻量化设计与注意力机制优化对复杂场景检测至关重要。
Abstract: Ensuring the safety and extended operational life of fighter aircraft necessitates frequent and exhaustive inspections. While surface defect detection is feasible for human inspectors, manual methods face critical limitations in scalability, efficiency, and consistency due to the vast surface area, structural complexity, and operational demands of aircraft maintenance. We propose a smart surface damage detection and localization system for fighter aircraft, termed J-DDL. J-DDL integrates 2D images and 3D point clouds of the entire aircraft surface, captured using a combined system of laser scanners and cameras, to achieve precise damage detection and localization. Central to our system is a novel damage detection network built on the YOLO architecture, specifically optimized for identifying surface defects in 2D aircraft images. Key innovations include lightweight Fasternet blocks for efficient feature extraction, an optimized neck architecture incorporating Efficient Multiscale Attention (EMA) modules for superior feature aggregation, and the introduction of a novel loss function, Inner-CIOU, to enhance detection accuracy. After detecting damage in 2D images, the system maps the identified anomalies onto corresponding 3D point clouds, enabling accurate 3D localization of defects across the aircraft surface. Our J-DDL not only streamlines the inspection process but also ensures more comprehensive and detailed coverage of large and complex aircraft exteriors. To facilitate further advancements in this domain, we have developed the first publicly available dataset specifically focused on aircraft damage. Experimental evaluations validate the effectiveness of our framework, underscoring its potential to significantly advance automated aircraft inspection technologies.
[38] CogStream: Context-guided Streaming Video Question Answering cs.CV | cs.AIPDF
Zicheng Zhao, Kangyu Wang, Shijie Li, Rui Qian, Weiyao Lin
TL;DR: CogStream提出了一个具有挑战性的任务:基于上下文引导的流式视频推理,并贡献了一个密集标注的数据集和一个基线模型CogReasoner,该方法通过视觉流压缩和历史对话检索高效完成任务。
Details
Motivation: 现有方法在流式视频推理中面临计算负担和高估不相关上下文的问题,CogStream旨在模拟真实场景,要求模型识别最相关的历史上下文以回答问题。
Result: 实验证明了方法的有效性。
Insight: 流式视频推理需要高效过滤无关上下文,CogReasoner的设计为此提供了可行方案。
Abstract: Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant context distracts models from key details. This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream. To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline. Additionally, we present CogReasoner as a baseline model. It efficiently tackles this task by leveraging visual stream compression and historical dialogue retrieval. Extensive experiments prove the effectiveness of this method. Code will be released soon.
[39] From Images to Insights: Explainable Biodiversity Monitoring with Plain Language Habitat Explanations cs.CV | cs.AI | cs.ETPDF
Yutong Zhou, Masahiro Ryo
TL;DR: 该论文提出了一种端到端的视觉到因果框架,将物种图像转化为可解释的栖息地偏好因果洞察,并结合大语言模型生成人类可读的解释。
Details
Motivation: 理解物种为何生活在特定位置对生态系统的认识和生物多样性保护至关重要,但现有生态工作流程对非专业人士不友好。
Result: 以蜜蜂和花朵为案例展示了框架的潜力,证明了其能为物种栖息地生成统计支持的、人类可读的解释。
Insight: 多模态AI助手结合生态建模实践,为非专业人士提供了直观的生态洞察工具。
Abstract: Explaining why the species lives at a particular location is important for understanding ecological systems and conserving biodiversity. However, existing ecological workflows are fragmented and often inaccessible to non-specialists. We propose an end-to-end visual-to-causal framework that transforms a species image into interpretable causal insights about its habitat preference. The system integrates species recognition, global occurrence retrieval, pseudo-absence sampling, and climate data extraction. We then discover causal structures among environmental features and estimate their influence on species occurrence using modern causal inference methods. Finally, we generate statistically grounded, human-readable causal explanations from structured templates and large language models. We demonstrate the framework on a bee and a flower species and report early results as part of an ongoing project, showing the potential of the multimodal AI assistant backed up by a recommended ecological modeling practice for describing species habitat in human-understandable language.
[40] Balancing Tails when Comparing Distributions: Comprehensive Equity Index (CEI) with Application to Bias Evaluation in Operational Face Biometrics cs.CV | cs.AIPDF
Imanol Solano, Julian Fierrez, Aythami Morales, Alejandro Peña, Ruben Tolosana
TL;DR: 论文提出了一种新型指标CEI,用于检测人脸识别系统中的人口统计偏置,尤其是分布尾部的细微差异。CEI通过分别分析真实和冒用分数分布,配置性聚焦尾部概率,优于现有方法。
Details
Motivation: 现有指标难以检测高性能人脸识别系统中的细微人口统计偏置,尤其是在分数分布的尾部。
Result: 实验验证了CEI在检测细微偏置上的优越性,尤其在尾部表现更敏感。
Insight: CEI不仅适用于人脸识别,还可用于其他需要分析分布尾部的统计问题。
Abstract: Demographic bias in high-performance face recognition (FR) systems often eludes detection by existing metrics, especially with respect to subtle disparities in the tails of the score distribution. We introduce the Comprehensive Equity Index (CEI), a novel metric designed to address this limitation. CEI uniquely analyzes genuine and impostor score distributions separately, enabling a configurable focus on tail probabilities while also considering overall distribution shapes. Our extensive experiments (evaluating state-of-the-art FR systems, intentionally biased models, and diverse datasets) confirm CEI’s superior ability to detect nuanced biases where previous methods fall short. Furthermore, we present CEI^A, an automated version of the metric that enhances objectivity and simplifies practical application. CEI provides a robust and sensitive tool for operational FR fairness assessment. The proposed methods have been developed particularly for bias evaluation in face biometrics but, in general, they are applicable for comparing statistical distributions in any problem where one is interested in analyzing the distribution tails.
[41] DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers cs.CV | cs.AIPDF
Lizhen Wang, Zhurong Xia, Tianshu Hu, Pengrui Wang, Pengfei Wang
TL;DR: DreamActor-H1是一个基于扩散变换器(DiT)的框架,旨在生成高保真的人-产品演示视频,解决了身份保留和空间关系理解的挑战。
Details
Motivation: 电子商务和数字营销中,高保真的人-产品演示视频对产品呈现至关重要,但现有方法难以同时保留人和产品的身份,或缺乏对空间关系的理解。
Result: 在身份完整性和运动真实性方面优于现有技术,生成了更真实的人-产品交互视频。
Insight: 通过结合3D几何信息和语义编码,可以有效解决人-产品交互中的身份保留和空间对齐问题。
Abstract: In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions. Project page: https://submit2025-dream.github.io/DreamActor-H1/.
[42] Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration cs.CVPDF
Jun Wang, Lixing Zhu, Xiaohan Yu, Abhir Bhalerao, Yulan He
TL;DR: 论文提出了一种名为PLACE的新框架,通过病理层面的跨模态对齐和相关性探索,提升医学视觉表示学习,无需额外人工标注。
Details
Motivation: 医学领域的数据稀缺问题严重,现有的方法多关注实例级或标记级的跨模态对齐,忽略了病理层面的一致性。本研究旨在填补这一空白。
Result: 在分类、图像到文本检索、语义分割、目标检测和报告生成等任务中达到了新的SOTA性能。
Insight: 病理层面的对齐和相关性探索能够显著提升医学视觉表示学习的性能,尤其是在数据稀缺的情况下,表现出较强的泛化能力和鲁棒性。
Abstract: Learning medical visual representations from image-report pairs through joint learning has garnered increasing research attention due to its potential to alleviate the data scarcity problem in the medical domain. The primary challenges stem from the lengthy reports that feature complex discourse relations and semantic pathologies. Previous works have predominantly focused on instance-wise or token-wise cross-modal alignment, often neglecting the importance of pathological-level consistency. This paper presents a novel framework PLACE that promotes the Pathological-Level Alignment and enriches the fine-grained details via Correlation Exploration without additional human annotations. Specifically, we propose a novel pathological-level cross-modal alignment (PCMA) approach to maximize the consistency of pathology observations from both images and reports. To facilitate this, a Visual Pathology Observation Extractor is introduced to extract visual pathological observation representations from localized tokens. The PCMA module operates independently of any external disease annotations, enhancing the generalizability and robustness of our methods. Furthermore, we design a proxy task that enforces the model to identify correlations among image patches, thereby enriching the fine-grained details crucial for various downstream tasks. Experimental results demonstrate that our proposed framework achieves new state-of-the-art performance on multiple downstream tasks, including classification, image-to-text retrieval, semantic segmentation, object detection and report generation.
[43] DanceChat: Large Language Model-Guided Music-to-Dance Generation cs.CV | cs.MM | cs.SD | eess.ASPDF
Qing Wang, Xiaohang Yang, Yilan Dong, Naveen Raj Govindaraj, Gregory Slabaugh
TL;DR: DanceChat是一种基于大语言模型(LLM)的音乐到舞蹈生成方法,通过LLM提供文本指导,生成多样且与音乐风格对齐的舞蹈动作。
Details
Motivation: 现有音乐到舞蹈生成方法因音乐与动作之间的语义差距和数据稀缺性,难以生成多样且准确的舞蹈动作。
Result: 在AIST++数据集和人工评测中,DanceChat在质量和多样性上均优于现有方法。
Insight: 利用LLM提供高层次的文本指导,能有效弥补音乐与动作之间的语义差距,提升生成舞蹈的多样性和风格对齐性。
Abstract: Music-to-dance generation aims to synthesize human dance motion conditioned on musical input. Despite recent progress, significant challenges remain due to the semantic gap between music and dance motion, as music offers only abstract cues, such as melody, groove, and emotion, without explicitly specifying the physical movements. Moreover, a single piece of music can produce multiple plausible dance interpretations. This one-to-many mapping demands additional guidance, as music alone provides limited information for generating diverse dance movements. The challenge is further amplified by the scarcity of paired music and dance data, which restricts the model^a\u{A}'Zs ability to learn diverse dance patterns. In this paper, we introduce DanceChat, a Large Language Model (LLM)-guided music-to-dance generation approach. We use an LLM as a choreographer that provides textual motion instructions, offering explicit, high-level guidance for dance generation. This approach goes beyond implicit learning from music alone, enabling the model to generate dance that is both more diverse and better aligned with musical styles. Our approach consists of three components: (1) an LLM-based pseudo instruction generation module that produces textual dance guidance based on music style and structure, (2) a multi-modal feature extraction and fusion module that integrates music, rhythm, and textual guidance into a shared representation, and (3) a diffusion-based motion synthesis module together with a multi-modal alignment loss, which ensures that the generated dance is aligned with both musical and textual cues. Extensive experiments on AIST++ and human evaluations show that DanceChat outperforms state-of-the-art methods both qualitatively and quantitatively.
[44] Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning cs.CVPDF
Chun-Mei Feng, Kai Yu, Xinxing Xu, Salman Khan, Rick Siow Mong Goh
TL;DR: 论文提出了一种名为T2I-PAL的新方法,通过结合文本生成图像模型与CLIP框架,解决了多标签图像识别中的模态差异问题,并显著提升了性能。
Details
Motivation: CLIP等视觉-语言预训练模型虽然能通过对比学习将图像与文本特征对齐,但模态差异问题仍然限制了其在多标签图像识别中的应用。论文旨在减少这种差异,同时减少对语义标注数据的依赖。
Result: 在MS-COCO等基准测试中,T2I-PAL相比现有最优方法平均提升3.47%的性能。
Insight: 1. 文本生成图像模型可以有效填补模态差异。2. 局部特征增强对多标签识别至关重要。3. 联合提示和适配器学习为CLIP微调提供了新思路。
Abstract: Benefited from image-text contrastive learning, pre-trained vision-language models, e.g., CLIP, allow to direct leverage texts as images (TaI) for parameter-efficient fine-tuning (PEFT). While CLIP is capable of making image features to be similar to the corresponding text features, the modality gap remains a nontrivial issue and limits image recognition performance of TaI. Using multi-label image recognition (MLR) as an example, we present a novel method, called T2I-PAL to tackle the modality gap issue when using only text captions for PEFT. The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions, thereby reducing the modality gap. To further enhance MLR, T2I-PAL incorporates a class-wise heatmap and learnable prototypes. This aggregates local similarities, making the representation of local visual features more robust and informative for multi-label recognition. For better PEFT, we further combine both prompt tuning and adapter learning to enhance classification performance. T2I-PAL offers significant advantages: it eliminates the need for fully semantically annotated training images, thereby reducing the manual annotation workload, and it preserves the intrinsic mode of the CLIP model, allowing for seamless integration with any existing CLIP framework. Extensive experiments on multiple benchmarks, including MS-COCO, VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance by 3.47% in average above the top-ranked state-of-the-art methods.
[45] Rethinking Random Masking in Self Distillation on ViT cs.CVPDF
Jihyeon Seong, Hyunkyung Han
TL;DR: 论文探讨了在自蒸馏框架(如DINO)中随机掩码的作用,提出了一种非对称掩码策略,仅对学生的全局视图进行掩码,从而保留关键语义信息并提升性能。
Details
Motivation: 当前自蒸馏框架(如DINO)中使用随机掩码可能无意中破坏关键语义信息,因此需要更智能的掩码策略以提高训练效果。
Result: 在mini-ImageNet数据集上使用DINO-Tiny评估,结果表明该方法能够生成更鲁棒和细粒度的注意力图,并提升下游任务性能。
Insight: 在自蒸馏中,合理的掩码策略可以通过保留关键语义信息显著提升模型性能,而非对称掩码是一种有效的实现方式。
Abstract: Vision Transformers (ViTs) have demonstrated remarkable performance across a wide range of vision tasks. In particular, self-distillation frameworks such as DINO have contributed significantly to these advances. Within such frameworks, random masking is often utilized to improve training efficiency and introduce regularization. However, recent studies have raised concerns that indiscriminate random masking may inadvertently eliminate critical semantic information, motivating the development of more informed masking strategies. In this study, we explore the role of random masking in the self-distillation setting, focusing on the DINO framework. Specifically, we apply random masking exclusively to the student’s global view, while preserving the student’s local views and the teacher’s global view in their original, unmasked forms. This design leverages DINO’s multi-view augmentation scheme to retain clean supervision while inducing robustness through masked inputs. We evaluate our approach using DINO-Tiny on the mini-ImageNet dataset and show that random masking under this asymmetric setup yields more robust and fine-grained attention maps, ultimately enhancing downstream performance.
[46] Hierarchical Error Assessment of CAD Models for Aircraft Manufacturing-and-Measurement cs.CVPDF
Jin Huang, Honghua Chen, Mingqiang Wei
TL;DR: 论文提出了一种名为HEA-MM的分层误差评估框架,用于飞机CAD模型在制造与测量平台中的质量评估,通过全局、部件和特征三个层次进行误差分析。
Details
Motivation: 航空设备的高质量要求(高性能、高稳定性和高可靠性)促使开发一种系统化的方法评估制造过程中的CAD模型误差。
Result: 实验结果表明,HEA-MM方法在多种飞机CAD模型上有效实现了误差评估。
Insight: 分层分析方法能够更全面地捕捉制造误差,特别是在复杂几何结构中,优化和特征检测算法的结合提升了评估的精确性。
Abstract: The most essential feature of aviation equipment is high quality, including high performance, high stability and high reliability. In this paper, we propose a novel hierarchical error assessment framework for aircraft CAD models within a manufacturing-and-measurement platform, termed HEA-MM. HEA-MM employs structured light scanners to obtain comprehensive 3D measurements of manufactured workpieces. The measured point cloud is registered with the reference CAD model, followed by an error analysis conducted at three hierarchical levels: global, part, and feature. At the global level, the error analysis evaluates the overall deviation of the scanned point cloud from the reference CAD model. At the part level, error analysis is performed on these patches underlying the point clouds. We propose a novel optimization-based primitive refinement method to obtain a set of meaningful patches of point clouds. Two basic operations, splitting and merging, are introduced to refine the coarse primitives. At the feature level, error analysis is performed on circular holes, which are commonly found in CAD models. To facilitate it, a two-stage algorithm is introduced for the detection of circular holes. First, edge points are identified using a tensor-voting algorithm. Then, multiple circles are fitted through a hypothesize-and-clusterize framework, ensuring accurate detection and analysis of the circular features. Experimental results on various aircraft CAD models demonstrate the effectiveness of our proposed method.
[47] Semantic-decoupled Spatial Partition Guided Point-supervised Oriented Object Detection cs.CVPDF
Xinyuan Liu, Hang Xu, Yike Ma, Yucheng Zhang, Feng Dai
TL;DR: 该论文提出了一种名为SSP的统一框架,通过语义解耦的空间分区(Semantic-decoupled Spatial Partition)解决点监督目标检测中的样本分配不足和实例混淆问题,显著提升了密集场景下的检测性能。
Details
Motivation: 在遥感图像中,高密度的目标场景需要大量人工标注,而基于点监督的定向目标检测方法虽然成本低,但存在样本分配不足和实例混淆的问题。论文提出SSP框架以解决这些问题。
Result: 在DOTA-v1.0上,SSP在点监督下达到了45.78%的mAP,比当前最佳方法(PointOBB-v2)提升了4.10%。与ORCNN和ReDet结合后,mAP分别达到47.86%和48.50%。
Insight: SSP通过结合规则驱动和数据驱动的方法,解决了点监督目标检测中的核心问题,为密集场景下的高效标注提供了新思路。
Abstract: Recent remote sensing tech advancements drive imagery growth, making oriented object detection rapid development, yet hindered by labor-intensive annotation for high-density scenes. Oriented object detection with point supervision offers a cost-effective solution for densely packed scenes in remote sensing, yet existing methods suffer from inadequate sample assignment and instance confusion due to rigid rule-based designs. To address this, we propose SSP (Semantic-decoupled Spatial Partition), a unified framework that synergizes rule-driven prior injection and data-driven label purification. Specifically, SSP introduces two core innovations: 1) Pixel-level Spatial Partition-based Sample Assignment, which compactly estimates the upper and lower bounds of object scales and mines high-quality positive samples and hard negative samples through spatial partitioning of pixel maps. 2) Semantic Spatial Partition-based Box Extraction, which derives instances from spatial partitions modulated by semantic maps and reliably converts them into bounding boxes to form pseudo-labels for supervising the learning of downstream detectors. Experiments on DOTA-v1.0 and others demonstrate SSP' s superiority: it achieves 45.78% mAP under point supervision, outperforming SOTA method PointOBB-v2 by 4.10%. Furthermore, when integrated with ORCNN and ReDet architectures, the SSP framework achieves mAP values of 47.86% and 48.50%, respectively. The code is available at https://github.com/antxinyuan/ssp.
[48] High-resolution efficient image generation from WiFi CSI using a pretrained latent diffusion model cs.CVPDF
Eshan Ramesh, Nishio Takayuki
TL;DR: LatentCSI 是一种从 WiFi CSI 测量生成环境图像的新方法,通过预训练的潜扩散模型实现高效高分辨率图像合成。
Details
Motivation: 传统方法依赖 GAN 等复杂技术,计算成本高且效果受限。本文旨在通过轻量级网络和潜扩散模型简化流程,提升生成效率和质量。
Result: 在自采数据和 MM-Fi 数据集上验证,LatentCSI 在计算效率和感知质量上优于基线方法,且支持文本引导。
Insight: 潜空间直接生成图像避免了像素级编码的复杂性,结合预训练模型可高效实现高质量结果,文本引导进一步提升了实用性。
Abstract: We present LatentCSI, a novel method for generating images of the physical environment from WiFi CSI measurements that leverages a pretrained latent diffusion model (LDM). Unlike prior approaches that rely on complex and computationally intensive techniques such as GANs, our method employs a lightweight neural network to map CSI amplitudes directly into the latent space of an LDM. We then apply the LDM’s denoising diffusion model to the latent representation with text-based guidance before decoding using the LDM’s pretrained decoder to obtain a high-resolution image. This design bypasses the challenges of pixel-space image generation and avoids the explicit image encoding stage typically required in conventional image-to-image pipelines, enabling efficient and high-quality image synthesis. We validate our approach on two datasets: a wide-band CSI dataset we collected with off-the-shelf WiFi devices and cameras; and a subset of the publicly available MM-Fi dataset. The results demonstrate that LatentCSI outperforms baselines of comparable complexity trained directly on ground-truth images in both computational efficiency and perceptual quality, while additionally providing practical advantages through its unique capacity for text-guided controllability.
[49] MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling cs.CVPDF
Liang Yin, Xudong Xie, Zhang Li, Xiang Bai, Yuliang Liu
TL;DR: MSTAR提出了一种无需边界框标注的多查询场景文本检索方法,通过动态捕获多粒度文本表示和融合风格感知指令,显著提升了检索性能。
Details
Motivation: 现有场景文本检索方法依赖昂贵的边界框标注且难以统一多种查询类型,MSTAR旨在解决这些问题。
Result: 在Total-Text上MAP超过SOTA 6.4%,在MQTR上平均提升8.5%。
Insight: 1. Box-free设计显著降低标注成本;2. 多查询统一策略适应多样化检索需求;3. 动态多粒度表征提升文本理解能力。
Abstract: Scene text retrieval has made significant progress with the assistance of accurate text localization. However, existing approaches typically require costly bounding box annotations for training. Besides, they mostly adopt a customized retrieval strategy but struggle to unify various types of queries to meet diverse retrieval needs. To address these issues, we introduce Muti-query Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for scene text retrieval. It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts and harmonizes free-style text queries with style-aware instructions. Additionally, a multi-instance matching module is integrated to enhance vision-language alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset, the first benchmark designed to evaluate the multi-query scene text retrieval capability of models, comprising four query types and 16k images. Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset. Notably, MSTAR marginally surpasses the previous state-of-the-art model by 6.4% in MAP on Total-Text while eliminating box annotation costs. Moreover, on the MQTR benchmark, MSTAR significantly outperforms the previous models by an average of 8.5%. The code and datasets are available at https://github.com/yingift/MSTAR.
[50] Anatomy-Grounded Weakly Supervised Prompt Tuning for Chest X-ray Latent Diffusion Models cs.CVPDF
Konstantinos Vilouras, Ilias Stogiannidis, Junyu Yan, Alison Q. O’Neil, Sotirios A. Tsaftaris
TL;DR: 本文提出了一种基于解剖学信息的弱监督提示调整框架,用于改进预训练的胸部X光潜在扩散模型的多模态对齐性能,使其在下游任务(如短语定位)中表现优异。
Details
Motivation: 在医学影像领域,潜在扩散模型(Latent Diffusion Models)的多模态对齐性能由于数据隐私问题受限。本文旨在解决胸部X光报告中自由文本与图像区域的临床相关性对齐不足的问题。
Result: 在MS-CXR数据集上达到新的SOTA,同时在外部数据集VinDr-CXR上表现出鲁棒性能。
Insight: 解剖学信息的引入为医学影像的多模态对齐提供了新的优化方向,无需大量标注数据即可显著提升模型性能。
Abstract: Latent Diffusion Models have shown remarkable results in text-guided image synthesis in recent years. In the domain of natural (RGB) images, recent works have shown that such models can be adapted to various vision-language downstream tasks with little to no supervision involved. On the contrary, text-to-image Latent Diffusion Models remain relatively underexplored in the field of medical imaging, primarily due to limited data availability (e.g., due to privacy concerns). In this work, focusing on the chest X-ray modality, we first demonstrate that a standard text-conditioned Latent Diffusion Model has not learned to align clinically relevant information in free-text radiology reports with the corresponding areas of the given scan. Then, to alleviate this issue, we propose a fine-tuning framework to improve multi-modal alignment in a pre-trained model such that it can be efficiently repurposed for downstream tasks such as phrase grounding. Our method sets a new state-of-the-art on a standard benchmark dataset (MS-CXR), while also exhibiting robust performance on out-of-distribution data (VinDr-CXR). Our code will be made publicly available.
[51] Symmetrical Flow Matching: Unified Image Generation, Segmentation, and Classification with Score-Based Generative Models cs.CV | cs.AIPDF
Francisco Caetano, Christiaan Viviers, Peter H. N. De With, Fons van der Sommen
TL;DR: Symmetrical Flow Matching (SymmFlow) is a novel framework unifying image generation, segmentation, and classification through a symmetric learning objective, ensuring bi-directional consistency and preserving semantic information.
Details
Motivation: Existing methods often separate generative modeling, segmentation, and classification tasks. SymmFlow aims to unify these tasks within a single model, leveraging flow matching for improved consistency and efficiency.
Result: Achieves state-of-the-art FID scores (11.9 on CelebAMask-HQ, 7.0 on COCO-Stuff) with only 25 inference steps. It also shows competitive segmentation and promising classification performance.
Insight: SymmFlow demonstrates that unifying generative, segmentation, and classification tasks is feasible through symmetric flow matching, offering a more efficient and consistent framework compared to task-specific models.
Abstract: Flow Matching has emerged as a powerful framework for learning continuous transformations between distributions, enabling high-fidelity generative modeling. This work introduces Symmetrical Flow Matching (SymmFlow), a new formulation that unifies semantic segmentation, classification, and image generation within a single model. Using a symmetric learning objective, SymmFlow models forward and reverse transformations jointly, ensuring bi-directional consistency, while preserving sufficient entropy for generative diversity. A new training objective is introduced to explicitly retain semantic information across flows, featuring efficient sampling while preserving semantic structure, allowing for one-step segmentation and classification without iterative refinement. Unlike previous approaches that impose strict one-to-one mapping between masks and images, SymmFlow generalizes to flexible conditioning, supporting both pixel-level and image-level class labels. Experimental results on various benchmarks demonstrate that SymmFlow achieves state-of-the-art performance on semantic image synthesis, obtaining FID scores of 11.9 on CelebAMask-HQ and 7.0 on COCO-Stuff with only 25 inference steps. Additionally, it delivers competitive results on semantic segmentation and shows promising capabilities in classification tasks. The code will be publicly available.
[52] GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning cs.CVPDF
Xiaoyi Bao, Jindi Lv, Xiaofeng Wang, Zheng Zhu, Xinze Chen
TL;DR: GigaVideo-1 是一种高效的视频生成微调框架,通过自动反馈提升生成质量,无需人工标注或大量计算资源,仅需4 GPU小时即可显著改进17个评估维度。
Details
Motivation: 现有视频扩散模型微调依赖人工标注和大量计算资源,限制了实用性。作者希望通过自动反馈和高效优化方法改进视频生成质量。
Result: 在VBench-2.0基准测试中,GigaVideo-1平均提升4%的性能,仅消耗4 GPU小时。
Insight: 自动反馈机制可有效替代人工标注,高效解锁预训练模型的潜力,为视频生成领域提供了一种低成本优化方案。
Abstract: Recent progress in diffusion models has greatly enhanced video generation quality, yet these models still require fine-tuning to improve specific dimensions like instance preservation, motion rationality, composition, and physical plausibility. Existing fine-tuning approaches often rely on human annotations and large-scale computational resources, limiting their practicality. In this work, we propose GigaVideo-1, an efficient fine-tuning framework that advances video generation without additional human supervision. Rather than injecting large volumes of high-quality data from external sources, GigaVideo-1 unlocks the latent potential of pre-trained video diffusion models through automatic feedback. Specifically, we focus on two key aspects of the fine-tuning process: data and optimization. To improve fine-tuning data, we design a prompt-driven data engine that constructs diverse, weakness-oriented training samples. On the optimization side, we introduce a reward-guided training strategy, which adaptively weights samples using feedback from pre-trained vision-language models with a realism constraint. We evaluate GigaVideo-1 on the VBench-2.0 benchmark using Wan2.1 as the baseline across 17 evaluation dimensions. Experiments show that GigaVideo-1 consistently improves performance on almost all the dimensions with an average gain of about 4% using only 4 GPU-hours. Requiring no manual annotations and minimal real data, GigaVideo-1 demonstrates both effectiveness and efficiency. Code, model, and data will be publicly available.
[53] PiPViT: Patch-based Visual Interpretable Prototypes for Retinal Image Analysis cs.CV | cs.AIPDF
Marzieh Oghbaie, Teresa Araújoa, Hrvoje Bogunović
TL;DR: PiPViT提出了一种基于视觉Transformer(ViT)的原型学习方法,通过对比学习和多分辨率输入处理,学习可解释的病灶原型,适用于视网膜图像分析。
Details
Motivation: 现有的原型方法在医学图像中难以生成与人类可理解的生物标志物一致的可视化原型,且过于细粒度,而医学影像中病灶的范围和存在同样重要。
Result: 在四个视网膜OCT图像数据集上表现出色,不仅性能与SOTA相当,还能提供更直观的解释。原型在语义和临床上也具有相关性。
Insight: PiPViT通过结合ViT和原型学习,提供了一种医学影像诊断中透明且可解释的新方法,有助于临床理解诊断结果。
Abstract: Background and Objective: Prototype-based methods improve interpretability by learning fine-grained part-prototypes; however, their visualization in the input pixel space is not always consistent with human-understandable biomarkers. In addition, well-known prototype-based approaches typically learn extremely granular prototypes that are less interpretable in medical imaging, where both the presence and extent of biomarkers and lesions are critical. Methods: To address these challenges, we propose PiPViT (Patch-based Visual Interpretable Prototypes), an inherently interpretable prototypical model for image recognition. Leveraging a vision transformer (ViT), PiPViT captures long-range dependencies among patches to learn robust, human-interpretable prototypes that approximate lesion extent only using image-level labels. Additionally, PiPViT benefits from contrastive learning and multi-resolution input processing, which enables effective localization of biomarkers across scales. Results: We evaluated PiPViT on retinal OCT image classification across four datasets, where it achieved competitive quantitative performance compared to state-of-the-art methods while delivering more meaningful explanations. Moreover, quantitative evaluation on a hold-out test set confirms that the learned prototypes are semantically and clinically relevant. We believe PiPViT can transparently explain its decisions and assist clinicians in understanding diagnostic outcomes. Github page: https://github.com/marziehoghbaie/PiPViT
[54] Enhancing Deepfake Detection using SE Block Attention with CNN cs.CVPDF
Subhram Dasgupta, Janelle Mason, Xiaohong Yuan, Olusola Odeyomi, Kaushik Roy
TL;DR: 论文提出了一种基于SE块注意力和CNN的轻量级深度伪造检测模型,通过动态通道特征重校准提高效率和准确性,在Style GAN数据集上达到了94.14%的分类准确率和0.985的AUC-ROC分数。
Details
Motivation: 深度伪造技术因其高度逼真的合成内容对信息真实性和安全性构成威胁,传统检测方法难以应对,而现有模型多为大型网络,计算开销大。
Result: 在Style GAN数据集上表现优异,分类准确率94.14%,AUC-ROC分数0.985。
Insight: SE块的动态特征重校准机制能有效提升轻量级模型的性能,为计算资源有限的场景提供高效解决方案。
Abstract: In the digital age, Deepfake present a formidable challenge by using advanced artificial intelligence to create highly convincing manipulated content, undermining information authenticity and security. These sophisticated fabrications surpass traditional detection methods in complexity and realism. To address this issue, we aim to harness cutting-edge deep learning methodologies to engineer an innovative deepfake detection model. However, most of the models designed for deepfake detection are large, causing heavy storage and memory consumption. In this research, we propose a lightweight convolution neural network (CNN) with squeeze and excitation block attention (SE) for Deepfake detection. The SE block module is designed to perform dynamic channel-wise feature recalibration. The SE block allows the network to emphasize informative features and suppress less useful ones, which leads to a more efficient and effective learning module. This module is integrated with a simple sequential model to perform Deepfake detection. The model is smaller in size and it achieves competing accuracy with the existing models for deepfake detection tasks. The model achieved an overall classification accuracy of 94.14% and AUC-ROC score of 0.985 on the Style GAN dataset from the Diverse Fake Face Dataset. Our proposed approach presents a promising avenue for combating the Deepfake challenge with minimal computational resources, developing efficient and scalable solutions for digital content verification.
[55] Unsourced Adversarial CAPTCHA: A Bi-Phase Adversarial CAPTCHA Framework cs.CV | cs.CRPDF
Xia Du, Xiaoyuan Liu, Jizhe Zhou, Zheng Lin, Chi-man Pun
TL;DR: 论文提出了Unsourced Adversarial CAPTCHA (UAC)框架,通过基于文本提示生成高保真对抗样本,增强CAPTCHA的多样性,并支持定向和非定向攻击,有效抵御基于DNN的自动攻击。
Details
Motivation: 随着深度学习的快速发展,传统CAPTCHA在DNN驱动的自动攻击面前越来越脆弱。现有对抗攻击方法依赖原始图像特征,导致扭曲干扰人类理解,且缺乏初始输入图像时适用性受限。
Result: 实验证明BP-UAC在多样系统中实现了高攻击成功率,生成的CAPTCHA对人类和DNN均难以区分。
Insight: 通过结合文本提示和多模态优化,UAC框架为CAPTCHA设计提供了新思路,平衡了对抗攻击的有效性和人类可读性。
Abstract: With the rapid advancements in deep learning, traditional CAPTCHA schemes are increasingly vulnerable to automated attacks powered by deep neural networks (DNNs). Existing adversarial attack methods often rely on original image characteristics, resulting in distortions that hinder human interpretation and limit applicability in scenarios lacking initial input images. To address these challenges, we propose the Unsourced Adversarial CAPTCHA (UAC), a novel framework generating high-fidelity adversarial examples guided by attacker-specified text prompts. Leveraging a Large Language Model (LLM), UAC enhances CAPTCHA diversity and supports both targeted and untargeted attacks. For targeted attacks, the EDICT method optimizes dual latent variables in a diffusion model for superior image quality. In untargeted attacks, especially for black-box scenarios, we introduce bi-path unsourced adversarial CAPTCHA (BP-UAC), a two-step optimization strategy employing multimodal gradients and bi-path optimization for efficient misclassification. Experiments show BP-UAC achieves high attack success rates across diverse systems, generating natural CAPTCHAs indistinguishable to humans and DNNs.
[56] Underage Detection through a Multi-Task and MultiAge Approach for Screening Minors in Unconstrained Imagery cs.CVPDF
Christopher Gaul, Eduardo Fidalgo, Enrique Alegre, Rocío Alaiz Rodríguez, Eri Pérez Corral
TL;DR: 本研究提出了一种多任务、多年龄框架,结合重加权损失和年龄平衡采样,显著提升了未成年检测在开放图像中的准确性和鲁棒性。
Details
Motivation: 公开数据中未成年样本不足且分布偏移严重,需要鲁棒的模型来解决这些问题。
Result: 模型在多个未成年分类任务上显著提升F2分数,并在分布偏移下保持高召回率。
Insight: 多任务联合优化和平衡采样是关键,且新评测基准为实际应用提供了更严格的测试场景。
Abstract: Accurate automatic screening of minors in unconstrained images demands models that are robust to distribution shift and resilient to the children under-representation in publicly available data. To overcome these issues, we propose a multi-task architecture with dedicated under/over-age discrimination tasks based on a frozen FaRL vision-language backbone joined with a compact two-layer MLP that shares features across one age-regression head and four binary under-age heads for age thresholds of 12, 15, 18, and 21 years, focusing on the legally critical age range. To address the severe class imbalance, we introduce an $\alpha$-reweighted focal-style loss and age-balanced mini-batch sampling, which equalizes twelve age bins during stochastic optimization. Further improvement is achieved with an age gap that removes edge cases from the loss. Moreover, we set a rigorous evaluation by proposing the Overall Under-Age Benchmark, with 303k cleaned training images and 110k test images, defining both the “ASORES-39k” restricted overall test, which removes the noisiest domains, and the age estimation wild shifts test “ASWIFT-20k” of 20k-images, stressing extreme pose ($>$45{\deg}), expression, and low image quality to emulate real-world shifts. Trained on the cleaned overall set with resampling and age gap, our multiage model “F” lowers the root-mean-square-error on the ASORES-39k restricted test from 5.733 (age-only baseline) to 5.656 years and lifts under-18 detection from F2 score of 0.801 to 0.857 at 1% false-adult rate. Under the domain shift to the wild data of ASWIFT-20k, the same configuration nearly sustains 0.99 recall while boosting F2 from 0.742 to 0.833 with respect to the age-only baseline, demonstrating strong generalization under distribution shift. For the under-12 and under-15 tasks, the respective boosts in F2 are from 0.666 to 0.955 and from 0.689 to 0.916, respectively.
[57] Continual Hyperbolic Learning of Instances and Classes cs.CV | cs.AI | cs.LGPDF
Melika Ayoughi, Mina Ghadimi Atigh, Mohammad Mahdi Derakhshani, Cees G. M. Snoek, Pascal Mettes
TL;DR: 论文提出了一种新的持续学习任务,同时处理实例和类别的分类任务,并利用双曲空间建模层次结构,提出HyperCLIC算法,结合双曲分类和蒸馏目标,实现了对层次关系的持续嵌入。
Details
Motivation: 现实应用(如机器人和自动驾驶)需要模型同时处理实例和类别的分类任务,而传统持续学习仅关注其中之一。因此,论文提出同时学习实例和类别的任务,并利用层次结构建模。
Result: 实验证明HyperCLIC能有效处理多粒度任务,提升层次化泛化能力。
Insight: 双曲空间适合建模层次结构,且在持续学习中具有潜力;实例和类别的联合学习更贴近实际应用需求。
Abstract: Continual learning has traditionally focused on classifying either instances or classes, but real-world applications, such as robotics and self-driving cars, require models to handle both simultaneously. To mirror real-life scenarios, we introduce the task of continual learning of instances and classes, at the same time. This task challenges models to adapt to multiple levels of granularity over time, which requires balancing fine-grained instance recognition with coarse-grained class generalization. In this paper, we identify that classes and instances naturally form a hierarchical structure. To model these hierarchical relationships, we propose HyperCLIC, a continual learning algorithm that leverages hyperbolic space, which is uniquely suited for hierarchical data due to its ability to represent tree-like structures with low distortion and compact embeddings. Our framework incorporates hyperbolic classification and distillation objectives, enabling the continual embedding of hierarchical relations. To evaluate performance across multiple granularities, we introduce continual hierarchical metrics. We validate our approach on EgoObjects, the only dataset that captures the complexity of hierarchical object recognition in dynamic real-world environments. Empirical results show that HyperCLIC operates effectively at multiple granularities with improved hierarchical generalization.
[58] Uncertainty-Masked Bernoulli Diffusion for Camouflaged Object Detection Refinement cs.CVPDF
Yuqi Shen, Fengyang Xiao, Sujie Hu, Youwei Pang, Yifan Pu
TL;DR: 论文提出了一种基于不确定性的伯努利扩散模型(UMBD),通过选择性优化分割质量较差的区域,显著提升了伪装目标检测的性能。
Details
Motivation: 伪装目标检测(COD)中,目标与背景的视觉差异小,现有方法的分割结果仍有较大优化空间,但尚未充分探索生成式后处理方法。
Result: 在多个COD基准测试中,平均MAE提升5.5%,加权F-measure提升3.2%,且计算开销适中。
Insight: 将生成式方法与判别式模型结合,可通过针对性优化显著提升COD性能,不确定性估计在优化过程中起到了关键作用。
Abstract: Camouflaged Object Detection (COD) presents inherent challenges due to the subtle visual differences between targets and their backgrounds. While existing methods have made notable progress, there remains significant potential for post-processing refinement that has yet to be fully explored. To address this limitation, we propose the Uncertainty-Masked Bernoulli Diffusion (UMBD) model, the first generative refinement framework specifically designed for COD. UMBD introduces an uncertainty-guided masking mechanism that selectively applies Bernoulli diffusion to residual regions with poor segmentation quality, enabling targeted refinement while preserving correctly segmented areas. To support this process, we design the Hybrid Uncertainty Quantification Network (HUQNet), which employs a multi-branch architecture and fuses uncertainty from multiple sources to improve estimation accuracy. This enables adaptive guidance during the generative sampling process. The proposed UMBD framework can be seamlessly integrated with a wide range of existing Encoder-Decoder-based COD models, combining their discriminative capabilities with the generative advantages of diffusion-based refinement. Extensive experiments across multiple COD benchmarks demonstrate consistent performance improvements, achieving average gains of 5.5% in MAE and 3.2% in weighted F-measure with only modest computational overhead. Code will be released.
[59] IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain cs.CVPDF
Hong Huang, Weixiang Sun, Zhijian Wu, Jingwen Niu, Donghuan Lu
TL;DR: IQE-CLIP提出了一种结合文本和视觉信息的查询嵌入方法,用于医学领域的零样本/少样本异常检测,通过类基础和可学习的提示令牌以及实例感知查询模块,显著提升了性能。
Details
Motivation: 现有基于CLIP的方法在零样本/少样本异常检测中依赖于特定场景的提示设计,且主要针对工业领域,缺乏对医学任务的探索。IQE-CLIP旨在解决这些局限性。
Result: 在六个医学数据集上的实验表明,IQE-CLIP在零样本和少样本设置中均达到最先进性能。
Insight: 结合文本和视觉信息的查询嵌入能更有效地捕捉异常特征,尤其在医学领域。实例感知模块的设计为跨模态信息融合提供了新思路。
Abstract: Recent advances in vision-language models, such as CLIP, have significantly improved performance in zero- and few-shot anomaly detection (ZFSAD) tasks. However, most existing CLIP-based methods assume prior knowledge of categories and rely on carefully designed prompts tailored to specific scenarios. While these text prompts capture semantic information in the textual space, they often fail to distinguish normal and anomalous instances in the joint embedding space. Moreover, most ZFSAD approaches focus on industrial domains, with limited exploration in medical tasks. To address these limitations, we propose IQE-CLIP, a novel framework for ZFSAD in the medical domain. We show that query embeddings integrating both textual and instance-aware visual information serve as more effective indicators of anomalies. Specifically, we introduce class-based and learnable prompting tokens to better adapt CLIP to the medical setting. Furthermore, we design an instance-aware query module that extracts region-level contextual information from both modalities, enabling the generation of anomaly-sensitive embeddings. Extensive experiments on six medical datasets demonstrate that IQE-CLIP achieves state-of-the-art performance in both zero-shot and few-shot settings. Code and data are available at \href{https://github.com/hongh0/IQE-CLIP/}{this https URL}.
[60] PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework cs.CVPDF
SiXiang Chen, Jianyu Lai, Jialin Gao, Tian Ye, Haoyu Chen
TL;DR: PosterCraft是一个统一框架,用于生成高质量美学海报,通过多阶段优化工作流程,显著提升了文本渲染和布局的视觉效果。
Details
Motivation: 生成美学海报比简单设计图像更具挑战性,需要兼顾文本渲染、艺术内容整合和布局和谐。现有方法通常是模块化或预定义布局,限制了创造性。
Result: 在多项实验中,PosterCraft在渲染精度、布局一致性和视觉吸引力上显著优于开源基线,接近商业系统的水平。
Insight: 通过多阶段优化和自动化数据构建,可以在不复杂修改架构的情况下实现高质量海报生成,展示了统一框架的强大潜力。
Abstract: Generating aesthetic posters is more challenging than simple design images: it requires not only precise text rendering but also the seamless integration of abstract artistic content, striking layouts, and overall stylistic harmony. To address this, we propose PosterCraft, a unified framework that abandons prior modular pipelines and rigid, predefined layouts, allowing the model to freely explore coherent, visually compelling compositions. PosterCraft employs a carefully designed, cascaded workflow to optimize the generation of high-aesthetic posters: (i) large-scale text-rendering optimization on our newly introduced Text-Render-2M dataset; (ii) region-aware supervised fine-tuning on HQ-Poster100K; (iii) aesthetic-text-reinforcement learning via best-of-n preference optimization; and (iv) joint vision-language feedback refinement. Each stage is supported by a fully automated data-construction pipeline tailored to its specific needs, enabling robust training without complex architectural modifications. Evaluated on multiple experiments, PosterCraft significantly outperforms open-source baselines in rendering accuracy, layout coherence, and overall visual appeal-approaching the quality of SOTA commercial systems. Our code, models, and datasets can be found in the Project page: https://ephemeral182.github.io/PosterCraft
[61] SlotPi: Physics-informed Object-centric Reasoning Models cs.CV | cs.AI | cs.LGPDF
Jian Li, Wan Han, Ning Lin, Yu-Liang Zhan, Ruizhi Chengze
TL;DR: SlotPi是一个基于物理知识的物体中心推理模型,通过结合哈密顿原理和时空预测模块,解决了动态场景模拟中物理知识整合和模型适应性的问题。
Details
Motivation: 当前物体中心动态模拟方法缺乏物理知识的整合,且在多样化场景中的适应性验证不足,特别是流体和物体交互的动态场景。
Result: 在基准数据集和流体数据集上的预测和VQA任务中表现出色,验证了模型的适应性和性能。
Insight: 整合物理知识可以显著提升模型在复杂动态场景中的推理能力,为更先进的世界模型开发奠定了基础。
Abstract: Understanding and reasoning about dynamics governed by physical laws through visual observation, akin to human capabilities in the real world, poses significant challenges. Currently, object-centric dynamic simulation methods, which emulate human behavior, have achieved notable progress but overlook two critical aspects: 1) the integration of physical knowledge into models. Humans gain physical insights by observing the world and apply this knowledge to accurately reason about various dynamic scenarios; 2) the validation of model adaptability across diverse scenarios. Real-world dynamics, especially those involving fluids and objects, demand models that not only capture object interactions but also simulate fluid flow characteristics. To address these gaps, we introduce SlotPi, a slot-based physics-informed object-centric reasoning model. SlotPi integrates a physical module based on Hamiltonian principles with a spatio-temporal prediction module for dynamic forecasting. Our experiments highlight the model’s strengths in tasks such as prediction and Visual Question Answering (VQA) on benchmark and fluid datasets. Furthermore, we have created a real-world dataset encompassing object interactions, fluid dynamics, and fluid-object interactions, on which we validated our model’s capabilities. The model’s robust performance across all datasets underscores its strong adaptability, laying a foundation for developing more advanced world models.
[62] Human-Robot Navigation using Event-based Cameras and Reinforcement Learning cs.CVPDF
Ignacio Bugueno-Cordova, Javier Ruiz-del-Solar, Rodrigo Verschae
TL;DR: 本文提出了一种结合事件相机与强化学习的机器人导航控制器,用于实时人本导航与避障,突破了传统图像控制器的固定帧率与运动模糊限制。
Details
Motivation: 传统基于图像的导航控制器存在固定帧率、运动模糊和高延迟问题,而事件相机的异步特性能够灵活处理视觉信息,为机器人导航提供了新的可能性。
Result: 在模拟环境中展示了鲁棒的导航能力,包括行人跟随和避障。
Insight: 事件相机与强化学习的结合为实时机器人导航提供了高效解决方案,异步处理显著提升了系统适应性。
Abstract: This work introduces a robot navigation controller that combines event cameras and other sensors with reinforcement learning to enable real-time human-centered navigation and obstacle avoidance. Unlike conventional image-based controllers, which operate at fixed rates and suffer from motion blur and latency, this approach leverages the asynchronous nature of event cameras to process visual information over flexible time intervals, enabling adaptive inference and control. The framework integrates event-based perception, additional range sensing, and policy optimization via Deep Deterministic Policy Gradient, with an initial imitation learning phase to improve sample efficiency. Promising results are achieved in simulated environments, demonstrating robust navigation, pedestrian following, and obstacle avoidance. A demo video is available at the project website.
[63] Prompts to Summaries: Zero-Shot Language-Guided Video Summarization cs.CVPDF
Mario Barbara, Alaa Maalouf
TL;DR: 本文提出了一种零样本、基于自然语言查询的视频摘要方法Prompts-to-Summaries,利用现有视频语言模型(VidLMs)和大语言模型(LLMs)无需训练数据即可生成用户引导的视频摘要,性能超越无监督方法并与监督方法相当。
Details
Motivation: 视频数据的爆炸式增长催生了对无需领域特定训练数据、可灵活响应用户自然语言意图的视频摘要工具的需求。现有方法要么依赖数据集限制了泛化能力,要么无法结合用户自然语言表达的意图。
Result: 在SumMe和TVSum上超越了所有无监督方法,与监督方法表现相当。在QFVS基准测试中表现竞争力,尽管未使用训练数据。
Insight: 预训练多模态模型通过精心设计的提示和分数传播机制,已经具备强大的通用视频摘要能力,无需额外训练数据。
Abstract: The explosive growth of video data intensified the need for flexible user-controllable summarization tools that can operate without domain-specific training data. Existing methods either rely on datasets, limiting generalization, or cannot incorporate user intent expressed in natural language. We introduce Prompts-to-Summaries: the first zero-shot, text-queryable video summarizer that converts off-the-shelf video-language models (VidLMs) captions into user-guided skims via large language models (LLMs) judging, without the use of training data at all, beating all unsupervised and matching supervised methods. Our pipeline (i) segments raw video footage into coherent scenes, (ii) generates rich scene-level descriptions through a memory-efficient, batch-style VidLM prompting scheme that scales to hours-long videos on a single GPU, (iii) leverages an LLM as a judge to assign scene-level importance scores under a carefully crafted prompt, and finally, (iv) propagates those scores to short segments level via two new metrics: consistency (temporal coherency) and uniqueness (novelty), yielding fine-grained frame importance. On SumMe and TVSum, our data-free approach surpasses all prior data-hungry unsupervised methods. It also performs competitively on the Query-Focused Video Summarization (QFVS) benchmark, despite using no training data and the competing methods requiring supervised frame-level importance. To spur further research, we release VidSum-Reason, a new query-driven dataset featuring long-tailed concepts and multi-step reasoning; our framework attains robust F1 scores and serves as the first challenging baseline. Overall, our results demonstrate that pretrained multimodal models, when orchestrated with principled prompting and score propagation, already provide a powerful foundation for universal, text-queryable video summarization.
[64] Unsupervised Deformable Image Registration with Structural Nonparametric Smoothing cs.CV | eess.IV | eess.SPPDF
Hang Zhang, Xiang Chen, Renjiu Hu, Rongguang Wang, Jinwei Zhang
TL;DR: 论文提出了一种名为SmoothProper的无监督可变形图像配准方法,通过结构性非参平滑解决了稀疏特征和大位移问题,无需标签监督,显著降低了配准误差。
Details
Motivation: 针对现有无监督可变形图像配准方法在处理稀疏特征和大位移时的不足,提出了SmoothProper模块,以解决网络预测中的平滑性和结构一致性挑战。
Result: 在视网膜血管数据集上,SmoothProper将配准误差降至1.88像素(2912x2912图像),首次有效解决了稀疏特征和大位移问题。
Insight: 通过结构性非参平滑,SmoothProper展示了在无监督配准中处理复杂图像特征的潜力,为类似任务提供了新思路。
Abstract: Learning-based deformable image registration (DIR) accelerates alignment by amortizing traditional optimization via neural networks. Label supervision further enhances accuracy, enabling efficient and precise nonlinear alignment of unseen scans. However, images with sparse features amid large smooth regions, such as retinal vessels, introduce aperture and large-displacement challenges that unsupervised DIR methods struggle to address. This limitation occurs because neural networks predict deformation fields in a single forward pass, leaving fields unconstrained post-training and shifting the regularization burden entirely to network weights. To address these issues, we introduce SmoothProper, a plug-and-play neural module enforcing smoothness and promoting message passing within the network’s forward pass. By integrating a duality-based optimization layer with tailored interaction terms, SmoothProper efficiently propagates flow signals across spatial locations, enforces smoothness, and preserves structural consistency. It is model-agnostic, seamlessly integrates into existing registration frameworks with minimal parameter overhead, and eliminates regularizer hyperparameter tuning. Preliminary results on a retinal vessel dataset exhibiting aperture and large-displacement challenges demonstrate our method reduces registration error to 1.88 pixels on 2912x2912 images, marking the first unsupervised DIR approach to effectively address both challenges. The source code will be available at https://github.com/tinymilky/SmoothProper.
[65] Occlusion-Aware 3D Hand-Object Pose Estimation with Masked AutoEncoders cs.CVPDF
Hui Yang, Wei Sun, Jian Liu, Jin Zheng, Jian Xiao
TL;DR: 论文提出了一种基于掩码自编码器(HOMAE)的遮挡感知手-物体姿态估计方法,通过目标聚焦掩码策略和多尺度特征融合,结合隐式SDF与显式点云,显著提升了遮挡情况下的姿态估计性能。
Details
Motivation: 现有方法在处理手-物体交互中的遮挡问题时缺乏全局结构感知和推理能力,影响了姿态估计的准确性。本文旨在通过掩码自编码器增强模型的上下文感知能力。
Result: 在DexYCB和HO3Dv2基准测试中达到SOTA性能。
Insight: 通过结构化掩码模拟遮挡增强了模型的上下文推理能力,而SDF与点云的结合则提供了全局与局部几何信息的互补优势。
Abstract: Hand-object pose estimation from monocular RGB images remains a significant challenge mainly due to the severe occlusions inherent in hand-object interactions. Existing methods do not sufficiently explore global structural perception and reasoning, which limits their effectiveness in handling occluded hand-object interactions. To address this challenge, we propose an occlusion-aware hand-object pose estimation method based on masked autoencoders, termed as HOMAE. Specifically, we propose a target-focused masking strategy that imposes structured occlusion on regions of hand-object interaction, encouraging the model to learn context-aware features and reason about the occluded structures. We further integrate multi-scale features extracted from the decoder to predict a signed distance field (SDF), capturing both global context and fine-grained geometry. To enhance geometric perception, we combine the implicit SDF with an explicit point cloud derived from the SDF, leveraging the complementary strengths of both representations. This fusion enables more robust handling of occluded regions by combining the global context from the SDF with the precise local geometry provided by the point cloud. Extensive experiments on challenging DexYCB and HO3Dv2 benchmarks demonstrate that HOMAE achieves state-of-the-art performance in hand-object pose estimation. We will release our code and model.
[66] VideoDeepResearch: Long Video Understanding With Agentic Tool Using cs.CV | cs.AI | cs.CLPDF
Huaying Yuan, Zheng Liu, Junjie Zhou, Ji-Rong Wen, Zhicheng Dou
TL;DR: VideoDeepResearch提出了一种基于文本推理模型和多模态工具包的代理框架,通过选择性访问视频内容来解决长视频理解任务,显著提升了性能。
Details
Motivation: 现有多模态大语言模型(MLLM)因上下文窗口限制和复杂性难以处理长视频理解(LVU)任务。本文挑战了依赖扩展上下文窗口和强视觉能力的传统思路,转而采用代理工具架构。
Result: 在MLVU、LVBench和LongVideoBench上分别超越先前最佳性能9.6%、6.6%和3.9%,验证了代理系统的有效性。
Insight: 代理工具架构可以通过模块化设计和任务驱动策略有效解决长视频理解的复杂性问题,而无需依赖扩展的上下文窗口或强视觉模型。
Abstract: Long video understanding (LVU) presents a significant challenge for current multi-modal large language models (MLLMs) due to the task’s inherent complexity and context window constraint. It is widely assumed that addressing LVU tasks requires foundation MLLMs with extended context windows, strong visual perception capabilities, and proficient domain expertise. In this work, we challenge this common belief by introducing VideoDeepResearch, a novel agentic framework for long video understanding. Our approach relies solely on a text-only large reasoning model (LRM) combined with a modular multi-modal toolkit, including multimodal retrievers and visual perceivers, all of which are readily available in practice. For each LVU task, the system formulates a problem-solving strategy through reasoning, while selectively accessing and utilizing essential video content via tool using. We conduct extensive experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench. Our results demonstrate that VideoDeepResearch achieves substantial improvements over existing MLLM baselines, surpassing the previous state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and LongVideoBench, respectively. These findings highlight the promise of agentic systems in overcoming key challenges in LVU problems.
[67] Post-Training Quantization for Video Matting cs.CV | cs.AIPDF
Tianrui Zhu, Houyuan Chen, Ruihao Gong, Michele Magno, Haotong Qin
TL;DR: 该论文提出了一种专门用于视频抠图的后训练量化框架(PTQ4VM),通过两阶段量化策略、全局仿射校准和光流辅助组件,显著减少了量化误差并保持了时间一致性。
Details
Motivation: 视频抠图在资源受限设备上部署时面临计算密集型模型的挑战,后训练量化(PTQ)尚未在这一领域得到系统研究。
Result: 在4比特量化下接近全精度性能,计算量减少8倍,优于现有量化方法。
Insight: 结合统计校正和时序信息能显著提升视频抠图模型的量化效果,为实际应用提供高效解决方案。
Abstract: Video matting is crucial for applications such as film production and virtual reality, yet deploying its computationally intensive models on resource-constrained devices presents challenges. Quantization is a key technique for model compression and acceleration. As an efficient approach, Post-Training Quantization (PTQ) is still in its nascent stages for video matting, facing significant hurdles in maintaining accuracy and temporal coherence. To address these challenges, this paper proposes a novel and general PTQ framework specifically designed for video matting models, marking, to the best of our knowledge, the first systematic attempt in this domain. Our contributions include: (1) A two-stage PTQ strategy that combines block-reconstruction-based optimization for fast, stable initial quantization and local dependency capture, followed by a global calibration of quantization parameters to minimize accuracy loss. (2) A Statistically-Driven Global Affine Calibration (GAC) method that enables the network to compensate for cumulative statistical distortions arising from factors such as neglected BN layer effects, even reducing the error of existing PTQ methods on video matting tasks up to 20%. (3) An Optical Flow Assistance (OFA) component that leverages temporal and semantic priors from frames to guide the PTQ process, enhancing the model’s ability to distinguish moving foregrounds in complex scenes and ultimately achieving near full-precision performance even under ultra-low-bit quantization. Comprehensive quantitative and visual results show that our PTQ4VM achieves the state-of-the-art accuracy performance across different bit-widths compared to the existing quantization methods. We highlight that the 4-bit PTQ4VM even achieves performance close to the full-precision counterpart while enjoying 8x FLOP savings.
[68] VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos cs.CV | cs.AI | cs.MMPDF
Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang
TL;DR: VRBench是首个针对长叙事视频多步推理能力的基准测试,包含1,010个长视频和9,468个人工标注的多步问答对,旨在解决现有评测中忽视时间推理和程序有效性的问题。
Details
Motivation: 现有评测方法在长视频多步推理任务中存在局限性,尤其是时间推理和程序有效性未得到充分评估,VRBench旨在填补这一空白。
Result: 对12个LLM和16个VLM的广泛评测表明,VRBench能够全面分析模型的多步推理能力,并提供领域内有价值的洞见。
Insight: 多步推理任务中,时间上下文和程序有效性对模型表现至关重要,而基于进度的评测能更全面反映模型推理质量。
Abstract: We present VRBench, the first long narrative video benchmark crafted for evaluating large models’ multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity. It comprises 1,010 long videos (with an average duration of 1.6 hours), along with 9,468 human-labeled multi-step question-answering pairs and 30,292 reasoning steps with timestamps. These videos are curated via a multi-stage filtering process including expert inter-rater reviewing to prioritize plot coherence. We develop a human-AI collaborative framework that generates coherent reasoning chains, each requiring multiple temporally grounded steps, spanning seven types (e.g., event attribution, implicit inference). VRBench designs a multi-phase evaluation pipeline that assesses models at both the outcome and process levels. Apart from the MCQs for the final results, we propose a progress-level LLM-guided scoring metric to evaluate the quality of the reasoning chain from multiple dimensions comprehensively. Through extensive evaluations of 12 LLMs and 16 VLMs on VRBench, we undertake a thorough analysis and provide valuable insights that advance the field of multi-step reasoning.
[69] CreatiPoster: Towards Editable and Controllable Multi-Layer Graphic Design Generation cs.CVPDF
Zhao Zhang, Yutao Cheng, Dexiang Hong, Maoke Yang, Gonglei Shi
TL;DR: CreatiPoster是一个生成可编辑、多图层图形设计的框架,支持自然语言或用户提供的素材输入,能够生成专业级的设计并保持可编辑性。通过联合协议模型和条件背景模型,它超越了现有开源和商业系统,并发布了10万版权自由的多图层设计库。
Details
Motivation: 当前AI工具在图形设计中难以兼顾用户素材的准确整合、可编辑性和专业视觉吸引力,依赖模板库的商业系统也不灵活。解决这些问题可以推动AI辅助图形设计的民主化。
Result: 实验表明,CreatiPoster在图形设计生成任务上超越现有开源和商业系统,并支持多样应用如编辑、多语言适应等。
Insight: 通过结构化生成和多模态模型的联合使用,AI可以更灵活地生成专业且可编辑的图形设计,为用户提供了高效的工具支持。
Abstract: Graphic design plays a crucial role in both commercial and personal contexts, yet creating high-quality, editable, and aesthetically pleasing graphic compositions remains a time-consuming and skill-intensive task, especially for beginners. Current AI tools automate parts of the workflow, but struggle to accurately incorporate user-supplied assets, maintain editability, and achieve professional visual appeal. Commercial systems, like Canva Magic Design, rely on vast template libraries, which are impractical for replicate. In this paper, we introduce CreatiPoster, a framework that generates editable, multi-layer compositions from optional natural-language instructions or assets. A protocol model, an RGBA large multimodal model, first produces a JSON specification detailing every layer (text or asset) with precise layout, hierarchy, content and style, plus a concise background prompt. A conditional background model then synthesizes a coherent background conditioned on this rendered foreground layers. We construct a benchmark with automated metrics for graphic-design generation and show that CreatiPoster surpasses leading open-source approaches and proprietary commercial systems. To catalyze further research, we release a copyright-free corpus of 100,000 multi-layer designs. CreatiPoster supports diverse applications such as canvas editing, text overlay, responsive resizing, multilingual adaptation, and animated posters, advancing the democratization of AI-assisted graphic design. Project homepage: https://github.com/graphic-design-ai/creatiposter
[70] AIR: Zero-shot Generative Model Adaptation with Iterative Refinement cs.CV | cs.AIPDF
Guimeng Liu, Milad Abdollahzadeh, Ngai-Man Cheung
TL;DR: 论文提出一种零样本生成模型适应方法AIR,通过迭代优化解决CLIP嵌入空间中文本偏移与图像偏移不对齐问题,提升目标域图像生成质量。
Details
Motivation: 现有零样本生成模型适应方法假设文本偏移与图像偏移完全对齐,导致生成图像质量下降。本文受NLP偏移不对齐研究启发,分析CLIP嵌入空间中二者的不对齐现象。
Result: 实验表明,AIR在26种实验设置中均优于现有方法,生成图像质量显著提升。
Insight: 偏移不对齐与概念距离相关,近距概念偏移更小,为模型优化提供了新思路。
Abstract: Zero-shot generative model adaptation (ZSGM) aims to adapt a pre-trained generator to a target domain using only text guidance and without any samples from the target domain. Central to recent ZSGM approaches are directional loss which use the text guidance in the form of aligning the image offset with text offset in the embedding space of a vision-language model like CLIP. This is similar to the analogical reasoning in NLP where the offset between one pair of words is used to identify a missing element in another pair by aligning the offset between these two pairs. However, a major limitation of existing ZSGM methods is that the learning objective assumes the complete alignment between image offset and text offset in the CLIP embedding space, resulting in quality degrade in generated images. Our work makes two main contributions. Inspired by the offset misalignment studies in NLP, as our first contribution, we perform an empirical study to analyze the misalignment between text offset and image offset in CLIP embedding space for various large publicly available datasets. Our important finding is that offset misalignment in CLIP embedding space is correlated with concept distance, i.e., close concepts have a less offset misalignment. To address the limitations of the current approaches, as our second contribution, we propose Adaptation with Iterative Refinement (AIR) which is the first ZSGM approach to focus on improving target domain image quality based on our new insight on offset misalignment.Qualitative, quantitative, and user study in 26 experiment setups consistently demonstrate the proposed AIR approach achieves SOTA performance. Additional experiments are in Supp.
[71] M4V: Multi-Modal Mamba for Text-to-Video Generation cs.CV | cs.AI | cs.LGPDF
Jiancheng Huang, Gengwei Zhang, Zequn Jie, Siyu Jiao, Yinlong Qian
TL;DR: 论文提出M4V框架,结合多模态Mamba架构与扩散模型,解决了Transformer在文本到视频生成中的计算效率问题,显著降低了计算成本,同时通过奖励学习策略提升了视频质量。
Details
Motivation: 当前文本到视频生成任务由于Transformer的二次复杂度在处理时空序列时计算成本高,限制了实际应用。因此,需要一种更高效的序列建模方法,同时支持多模态信息融合。
Result: 在文本到视频生成基准测试中,M4V生成高质量视频的同时显著降低计算成本(768×1280分辨率下FLOPs减少45%)。
Insight: Mamba架构在视频生成任务中具有潜力,通过多模态令牌重组和奖励学习的结合可以有效提升生成质量与效率。
Abstract: Text-to-video generation has significantly enriched content creation and holds the potential to evolve into powerful world simulators. However, modeling the vast spatiotemporal space remains computationally demanding, particularly when employing Transformers, which incur quadratic complexity in sequence processing and thus limit practical applications. Recent advancements in linear-time sequence modeling, particularly the Mamba architecture, offer a more efficient alternative. Nevertheless, its plain design limits its direct applicability to multi-modal and spatiotemporal video generation tasks. To address these challenges, we introduce M4V, a Multi-Modal Mamba framework for text-to-video generation. Specifically, we propose a multi-modal diffusion Mamba (MM-DiM) block that enables seamless integration of multi-modal information and spatiotemporal modeling through a multi-modal token re-composition design. As a result, the Mamba blocks in M4V reduce FLOPs by 45% compared to the attention-based alternative when generating videos at 768$\times$1280 resolution. Additionally, to mitigate the visual quality degradation in long-context autoregressive generation processes, we introduce a reward learning strategy that further enhances per-frame visual realism. Extensive experiments on text-to-video benchmarks demonstrate M4V’s ability to produce high-quality videos while significantly lowering computational costs. Code and models will be publicly available at https://huangjch526.github.io/M4V_project.
[72] VINCIE: Unlocking In-context Image Editing from Video cs.CV | cs.AI | cs.CL | cs.LG | cs.MMPDF
Leigang Qu, Feng Cheng, Ziyan Yang, Qi Zhao, Shanchuan Lin
TL;DR: VINCIE提出了一种基于视频的上下文图像编辑方法,通过设计块因果扩散变换器和多任务学习,直接从视频数据中学习,无需依赖任务特定流程或专家模型。
Details
Motivation: 当前上下文图像编辑方法需要依赖任务特定的流程和专家模型(如分割和修复技术)来整理训练数据,这限制了方法的通用性和可扩展性。研究探索是否可以通过直接学习视频数据来开发更通用的图像编辑模型。
Result: 模型在上下文图像编辑任务上表现优异,并在多轮图像编辑基准测试中达到SOTA。此外,模型还展示了多概念组合、故事生成和编辑链任务上的潜力。
Insight: 直接从视频数据学习可以避免依赖任务特定流程,同时模型展示了在未见任务上的泛化能力。
Abstract: In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (e.g., segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable approach to annotate videos as interleaved multimodal sequences. To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction. Additionally, we propose a novel multi-turn image editing benchmark to advance research in this area. Extensive experiments demonstrate that our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks. Despite being trained exclusively on videos, our model also shows promising abilities in multi-concept composition, story generation, and chain-of-editing applications.
[73] MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning cs.CV | cs.CLPDF
Yuxuan Luo, Yuhui Yuan, Junwen Chen, Haonan Cai, Ziyi Yue
TL;DR: 该论文提出了一个新的任务——知识图像生成,并发布了MMMG基准测试,用于评估图像生成模型的多模态推理能力。通过专家验证的数据集和统一的图谱表示,揭示了当前模型的推理缺陷,并提出了一个开源基线模型。
Details
Motivation: 知识图像在人类文明和学习机制中扮演重要角色,但现有图像生成模型在生成此类图像时缺乏多模态推理能力。因此,作者提出了MMMG基准测试,以推动模型在知识图像生成方面的进步。
Result: 评估了16种SOTA文本到图像生成模型,发现其在实体保真度、关系强度和图像清晰度方面存在严重缺陷。GPT-4o的MMMG-Score仅为50.20,而基线模型FLUX-Reason得分为34.45。
Insight: 1. 知识图像生成需要更强的多模态推理能力;2. 统一的图谱表示简化了评估过程;3. 当前模型在解释性图像生成上仍有巨大提升空间。
Abstract: In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning–a fact underscored by dual-coding theory and the picture-superiority effect. Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals. To enable comprehensive evaluation, MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image’s core entities and their dependencies. We further introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image generation models expose serious reasoning deficits–low entity fidelity, weak relations, and clutter–with GPT-4o achieving an MMMG-Score of only 50.20, underscoring the benchmark’s difficulty. To spur further progress, we release FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on 16,000 curated knowledge image-prompt pairs.
[74] Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs cs.CV | cs.AIPDF
Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang
TL;DR: 论文提出了一种名为CDPruner的新型视觉标记剪枝方法,通过最大化条件多样性来优化多模态大语言模型(MLLM)的推理效率。
Details
Motivation: 视觉标记数量远多于文本标记导致MLLM推理成本高,现有方法基于注意力或相似性的剪枝存在冗余问题。
Result: 实验显示CDPruner在多种MLLM上表现优异,大幅降低FLOPs和CUDA延迟,同时保留94%原始准确率。
Insight: 最大化条件多样性能够平衡图像表征和指令遵循,实现高效且高性能的剪枝。
Abstract: In multimodal large language models (MLLMs), the length of input visual tokens is often significantly greater than that of their textual counterparts, leading to a high inference cost. Many works aim to address this issue by removing redundant visual tokens. However, current approaches either rely on attention-based pruning, which retains numerous duplicate tokens, or use similarity-based pruning, overlooking the instruction relevance, consequently causing suboptimal performance. In this paper, we go beyond attention or similarity by proposing a novel visual token pruning method named CDPruner, which maximizes the conditional diversity of retained tokens. We first define the conditional similarity between visual tokens conditioned on the instruction, and then reformulate the token pruning problem with determinantal point process (DPP) to maximize the conditional diversity of the selected subset. The proposed CDPruner is training-free and model-agnostic, allowing easy application to various MLLMs. Extensive experiments across diverse MLLMs show that CDPruner establishes new state-of-the-art on various vision-language benchmarks. By maximizing conditional diversity through DPP, the selected subset better represents the input images while closely adhering to user instructions, thereby preserving strong performance even with high reduction ratios. When applied to LLaVA, CDPruner reduces FLOPs by 95% and CUDA latency by 78%, while maintaining 94% of the original accuracy. Our code is available at https://github.com/Theia-4869/CDPruner.
[75] GenWorld: Towards Detecting AI-generated Real-world Simulation Videos cs.CVPDF
Weiliang Chen, Wenzhao Zheng, Yu Zheng, Lei Chen, Jie Zhou
TL;DR: GenWorld提出一个大规模、高质量的真实世界模拟数据集,用于检测AI生成的视频,并开发了SpannDetector模型,通过多视角一致性提升检测性能。
Details
Motivation: 随着视频生成技术的快速发展,AI生成视频的可信度问题日益凸显,现有检测方法因缺乏高质量数据集而受限。
Result: 实验表明SpannDetector在检测高质量视频上表现优异,验证了方法的有效性。
Insight: 忽略真实世界线索是现有方法的缺陷,物理合理性和多视角一致性是改进AI生成视频检测的关键。
Abstract: The flourishing of video generation technologies has endangered the credibility of real-world information and intensified the demand for AI-generated video detectors. Despite some progress, the lack of high-quality real-world datasets hinders the development of trustworthy detectors. In this paper, we propose GenWorld, a large-scale, high-quality, and real-world simulation dataset for AI-generated video detection. GenWorld features the following characteristics: (1) Real-world Simulation: GenWorld focuses on videos that replicate real-world scenarios, which have a significant impact due to their realism and potential influence; (2) High Quality: GenWorld employs multiple state-of-the-art video generation models to provide realistic and high-quality forged videos; (3) Cross-prompt Diversity: GenWorld includes videos generated from diverse generators and various prompt modalities (e.g., text, image, video), offering the potential to learn more generalizable forensic features. We analyze existing methods and find they fail to detect high-quality videos generated by world models (i.e., Cosmos), revealing potential drawbacks of ignoring real-world clues. To address this, we propose a simple yet effective model, SpannDetector, to leverage multi-view consistency as a strong criterion for real-world AI-generated video detection. Experiments show that our method achieves superior results, highlighting a promising direction for explainable AI-generated video detection based on physical plausibility. We believe that GenWorld will advance the field of AI-generated video detection. Project Page: https://chen-wl20.github.io/GenWorld
[76] Fine-Grained Perturbation Guidance via Attention Head Selection cs.CV | cs.AI | cs.LGPDF
Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Minjae Kim, Jaewon Min
TL;DR: 本文研究了扩散模型中注意力扰动的细粒度方法,提出了一种通过选择注意力头(HeadHunter框架)实现精细化控制生成质量和视觉属性的方法,并引入了SoftPAG技术调节扰动强度。
Details
Motivation: 现有注意力扰动方法缺乏对扰动应用位置的原则性指导,尤其是在Diffusion Transformer(DiT)架构中,质量相关的计算分布在多个层中。
Result: 在Stable Diffusion 3和FLUX.1等大规模DiT文本到图像模型上验证了方法的有效性,在生成质量提升和风格控制方面表现优越。
Insight: 特定注意力头控制不同的视觉概念(如结构、风格、纹理质量),可以通过组合选择实现针对性风格控制。
Abstract: Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose “HeadHunter”, a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head’s attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.
[77] InstaInpaint: Instant 3D-Scene Inpainting with Masked Large Reconstruction Model cs.CVPDF
Junqi You, Chieh Hubert Lin, Weijie Lyu, Zhengbo Zhang, Ming-Hsuan Yang
TL;DR: InstaInpaint提出了一种快速的3D场景修复框架,能够在0.4秒内完成修复任务,相比之前方法加速1000倍,同时保持高性能。
Details
Motivation: 现有3D场景修复方法依赖耗时优化,无法满足实时或在线应用需求,亟需一种快速高效的解决方案。
Result: 在标准测试中,InstaInpaint速度提升1000倍,性能达到SOTA,且能泛化至下游任务如物体插入和多区域修复。
Insight: 关键设计包括自监督微调和LRM模型,这些方法显著提升了泛化能力、纹理一致性和几何正确性。
Abstract: Recent advances in 3D scene reconstruction enable real-time viewing in virtual and augmented reality. To support interactive operations for better immersiveness, such as moving or editing objects, 3D scene inpainting methods are proposed to repair or complete the altered geometry. However, current approaches rely on lengthy and computationally intensive optimization, making them impractical for real-time or online applications. We propose InstaInpaint, a reference-based feed-forward framework that produces 3D-scene inpainting from a 2D inpainting proposal within 0.4 seconds. We develop a self-supervised masked-finetuning strategy to enable training of our custom large reconstruction model (LRM) on the large-scale dataset. Through extensive experiments, we analyze and identify several key designs that improve generalization, textural consistency, and geometric correctness. InstaInpaint achieves a 1000x speed-up from prior methods while maintaining a state-of-the-art performance across two standard benchmarks. Moreover, we show that InstaInpaint generalizes well to flexible downstream applications such as object insertion and multi-region inpainting. More video results are available at our project page: https://dhmbb2.github.io/InstaInpaint_page/.
cs.CL [Back]
[78] TaskCraft: Automated Generation of Agentic Tasks cs.CLPDF
Dingfeng Shi, Jingyi Cao, Qianben Chen, Weichen Sun, Weizhen Li
TL;DR: TaskCraft提出了一种自动化工作流,用于生成难度可扩展、多工具交互且可验证的智能体任务。通过深度和宽度扩展,TaskCraft解决了现有数据缺乏工具交互和依赖人工标注的问题。
Details
Motivation: 当前智能体任务的研究面临两大问题:现有指令数据缺乏工具交互能力,以及智能体基准测试依赖昂贵的人工标注。TaskCraft旨在通过自动化生成任务来解决这些问题。
Result: 实验表明,生成的任务可以优化提示生成流程,并提升智能体基础模型的监督微调效果。
Insight: 通过自动化生成任务,TaskCraft为智能体任务的可扩展性和多样性提供了新的解决方案,同时降低了依赖人工标注的成本。
Abstract: Agentic tasks, which require multi-step problem solving with autonomy, tool use, and adaptive reasoning, are becoming increasingly central to the advancement of NLP and AI. However, existing instruction data lacks tool interaction, and current agentic benchmarks rely on costly human annotation, limiting their scalability. We introduce \textsc{TaskCraft}, an automated workflow for generating difficulty-scalable, multi-tool, and verifiable agentic tasks with execution trajectories. TaskCraft expands atomic tasks using depth-based and width-based extensions to create structurally and hierarchically complex challenges. Empirical results show that these tasks improve prompt optimization in the generation workflow and enhance supervised fine-tuning of agentic foundation models. We present a large-scale synthetic dataset of approximately 36,000 tasks with varying difficulty to support future research on agent tuning and evaluation.
[79] Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific Information cs.CLPDF
Christodoulos Constantinides, Shuxin Lin, Nianjun Zhou, Dhaval Patel
TL;DR: 该论文提出了一种名为Chat-of-Thought的多智能体系统,用于高效生成工业资产的FMEA文档,通过多角色协作和动态任务路由优化生成与验证过程。
Details
Motivation: 工业资产管理中的FMEA文档生成面临效率和质量挑战,传统的单智能体方法难以满足复杂需求,需要一种协作、动态的系统来解决这些问题。
Result: 系统在工业设备监测领域中展示了高效生成和验证FMEA文档的能力,解决了复杂场景下的协作问题。
Insight: 多智能体协作和动态角色分配能够显著提升复杂任务的执行效果,尤其是在需要多视角验证的领域(如FMEA)中表现突出。
Abstract: This paper presents a novel multi-agent system called Chat-of-Thought, designed to facilitate the generation of Failure Modes and Effects Analysis (FMEA) documents for industrial assets. Chat-of-Thought employs multiple collaborative Large Language Model (LLM)-based agents with specific roles, leveraging advanced AI techniques and dynamic task routing to optimize the generation and validation of FMEA tables. A key innovation in this system is the introduction of a Chat of Thought, where dynamic, multi-persona-driven discussions enable iterative refinement of content. This research explores the application domain of industrial equipment monitoring, highlights key challenges, and demonstrates the potential of Chat-of-Thought in addressing these challenges through interactive, template-driven workflows and context-aware agent collaboration.
[80] ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering cs.CLPDF
Caijun Jia, Nan Xu, Jingxuan Wei, Qingli Wang, Lei Wang
TL;DR: 提出ChartReasoner,一个代码驱动的两阶段框架,通过高保真转换图表为结构化ECharts代码并自动生成推理轨迹,提升图表问答任务中的长链推理能力。
Details
Motivation: 传统多模态推理方法将视觉任务转换为文本任务时丢失关键视觉细节,尤其在图表问答任务中。如何保留图表结构语义信息并实现高效推理是核心挑战。
Result: 在四个公开基准上表现优异,保留图表细节的同时参数更少,接近GPT-4o性能。
Insight: 代码驱动方法能有效保留视觉细节,且自动化数据合成是提升多模态推理性能的关键。
Abstract: Recently, large language models have shown remarkable reasoning capabilities through long-chain reasoning before responding. However, how to extend this capability to visual reasoning tasks remains an open challenge. Existing multimodal reasoning approaches transfer such visual reasoning task into textual reasoning task via several image-to-text conversions, which often lose critical structural and semantic information embedded in visualizations, especially for tasks like chart question answering that require a large amount of visual details. To bridge this gap, we propose ChartReasoner, a code-driven novel two-stage framework designed to enable precise, interpretable reasoning over charts. We first train a high-fidelity model to convert diverse chart images into structured ECharts codes, preserving both layout and data semantics as lossless as possible. Then, we design a general chart reasoning data synthesis pipeline, which leverages this pretrained transport model to automatically and scalably generate chart reasoning trajectories and utilizes a code validator to filter out low-quality samples. Finally, we train the final multimodal model using a combination of supervised fine-tuning and reinforcement learning on our synthesized chart reasoning dataset and experimental results on four public benchmarks clearly demonstrate the effectiveness of our proposed ChartReasoner. It can preserve the original details of the charts as much as possible and perform comparably with state-of-the-art open-source models while using fewer parameters, approaching the performance of proprietary systems like GPT-4o in out-of-domain settings.
[81] Unsupervised Elicitation of Language Models cs.CL | cs.AIPDF
Jiaxin Wen, Zachary Ankner, Arushi Somani, Peter Hase, Samuel Marks
TL;DR: 提出一种无监督算法ICM,用于通过最大化内部一致性微调预训练语言模型,无需外部监督,表现优于人类标注数据。
Details
Motivation: 在语言模型能力超越人类的场景下,高质量的人类监督难以获取,需要无监督方法引导模型适应下游任务。
Result: 在GSM8k验证、TruthfulQA和Alpaca奖励建模任务中,ICM表现优于人类监督,且能更好地激发模型的超级能力。
Insight: 无监督方法在模型能力超越人类时更具优势,可能成为未来训练前沿模型的可行路径。
Abstract: To steer pretrained language models for downstream tasks, today’s post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficult or impossible to get high-quality human supervision. To address this challenge, we introduce a new unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune pretrained language models on their own generated labels, \emph{without external supervision}. On GSM8k-verification, TruthfulQA, and Alpaca reward modeling tasks, our method matches the performance of training on golden supervision and outperforms training on crowdsourced human supervision. On tasks where LMs’ capabilities are strongly superhuman, our method can elicit those capabilities significantly better than training on human labels. Finally, we show that our method can improve the training of frontier LMs: we use our method to train an unsupervised reward model and use reinforcement learning to train a Claude 3.5 Haiku-based assistant. Both the reward model and the assistant outperform their human-supervised counterparts.
[82] Can LLMs Generate Good Stories? Insights and Challenges from a Narrative Planning Perspective cs.CL | cs.AIPDF
Yi Wang, Max Kreminski
TL;DR: 本文探讨了大型语言模型(LLM)在故事生成中的能力,通过叙事规划视角分析了其生成高质量故事的潜力与挑战。通过设定一个基于文学示例的基准测试,研究发现LLM(如GPT-4)在小规模故事中能够保持因果合理性,但在角色意图和戏剧冲突方面仍面临困难。
Details
Motivation: 当前LLM在故事生成中的应用广泛,但其生成故事的自动评估方法有限,人工评估成本高且主观性强。计算叙事学为高质量故事提供了理论支持,本文希望通过叙事规划问题来深入理解LLM的生成能力。
Result: GPT-4级LLM能生成小规模因果合理的故事,但在角色意图和戏剧冲突的规划上表现不足,需借助强化学习提升复杂推理能力。
Insight: LLM在故事生成中表现出一定的潜力,但其能力受限于叙事复杂性,需进一步优化模型以应对角色意图和戏剧冲突等高级叙事要求,尤其是在游戏环境中的应用。
Abstract: Story generation has been a prominent application of Large Language Models (LLMs). However, understanding LLMs’ ability to produce high-quality stories remains limited due to challenges in automatic evaluation methods and the high cost and subjectivity of manual evaluation. Computational narratology offers valuable insights into what constitutes a good story, which has been applied in the symbolic narrative planning approach to story generation. This work aims to deepen the understanding of LLMs’ story generation capabilities by using them to solve narrative planning problems. We present a benchmark for evaluating LLMs on narrative planning based on literature examples, focusing on causal soundness, character intentionality, and dramatic conflict. Our experiments show that GPT-4 tier LLMs can generate causally sound stories at small scales, but planning with character intentionality and dramatic conflict remains challenging, requiring LLMs trained with reinforcement learning for complex reasoning. The results offer insights on the scale of stories that LLMs can generate while maintaining quality from different aspects. Our findings also highlight interesting problem solving behaviors and shed lights on challenges and considerations for applying LLM narrative planning in game environments.
[83] Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval cs.CLPDF
Shubhashis Roy Dipta, Francis Ferraro
TL;DR: Q2E提出了一种基于大型语言模型和视觉语言模型的查询到事件分解方法,用于零样本多语言文本到视频检索,表现优于现有基线方法。
Details
Motivation: 现有方法在处理复杂现实世界事件的视频检索时,往往简化了用户查询,导致检索效果不佳。Q2E旨在通过分解查询并利用模型的隐式知识提升检索能力。
Result: 在两个多样化数据集和多种检索指标上,Q2E表现优于现有方法,且整合音频信息显著提升了检索效果。
Insight: 分解复杂查询并结合多模态信息(如音频)可以显著提升视频检索性能,尤其是在零样本和多语言场景下。
Abstract: Recent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs). In this work, we consider how we can improve the identification and retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events. We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our approach demonstrates that we can enhance the understanding of otherwise overly simplified human queries by decomposing the query using the knowledge embedded in LLMs and VLMs. We additionally show how to apply our approach to both visual and speech-based inputs. To combine this varied multimodal knowledge, we adopt entropy-based fusion scoring for zero-shot fusion. Through evaluations on two diverse datasets and multiple retrieval metrics, we demonstrate that Q2E outperforms several state-of-the-art baselines. Our evaluation also shows that integrating audio information can significantly improve text-to-video retrieval. We have released code and data for future research.
[84] TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games cs.CL | cs.AIPDF
Prakamya Mishra, Jiang Liu, Jialian Wu, Xiaodong Yu, Zicheng Liu
TL;DR: 论文介绍了TTT-Bench,一个通过简单的井字棋类游戏评估大型推理模型(LRMs)基本战略、空间和逻辑推理能力的基准测试。尽管对人类来说这些游戏很简单,但模型表现不佳。
Details
Motivation: 当前大型推理模型在STEM领域表现优秀,但在更广任务领域的推理能力探索不足,特别是战略和空间推理。
Result: 大多数模型在简单任务中表现不佳,尤其是长期战略推理,且与数学问题表现差距显著。
Insight: 模型在复杂数学问题上表现良好,但在简单战略推理任务中表现较弱,凸显了当前模型的局限性。
Abstract: Large reasoning models (LRMs) have demonstrated impressive reasoning capabilities across a broad range of tasks including Olympiad-level mathematical problems, indicating evidence of their complex reasoning abilities. While many reasoning benchmarks focus on the STEM domain, the ability of LRMs to reason correctly in broader task domains remains underexplored. In this work, we introduce \textbf{TTT-Bench}, a new benchmark that is designed to evaluate basic strategic, spatial, and logical reasoning abilities in LRMs through a suite of four two-player Tic-Tac-Toe-style games that humans can effortlessly solve from a young age. We propose a simple yet scalable programmatic approach for generating verifiable two-player game problems for TTT-Bench. Although these games are trivial for humans, they require reasoning about the intentions of the opponent, as well as the game board’s spatial configurations, to ensure a win. We evaluate a diverse set of state-of-the-art LRMs, and \textbf{discover that the models that excel at hard math problems frequently fail at these simple reasoning games}. Further testing reveals that our evaluated reasoning models score on average $\downarrow$ 41% & $\downarrow$ 5% lower on TTT-Bench compared to MATH 500 & AIME 2024 respectively, with larger models achieving higher performance using shorter reasoning traces, where most of the models struggle on long-term strategic reasoning situations on simple and new TTT-Bench tasks.
[85] Classifying Unreliable Narrators with Large Language Models cs.CLPDF
Anneliese Brei, Katharine Henry, Abhisheik Sharma, Shashank Srivastava, Snigdha Chaturvedi
TL;DR: 论文提出了一种利用大型语言模型(LLM)识别不可靠叙述者的方法,并发布了TUNa数据集,尝试在少量样本、微调和课程学习设置下评估模型性能。
Details
Motivation: 人类在阅读第一人称叙述时常常需要考虑叙述者是否可靠,而现有方法缺乏对不可靠叙述者的标准化识别。研究希望通过计算方法和LLM填补这一空白。
Result: 任务极具挑战性,但LLM在识别不可靠叙述者方面显示出潜力。
Insight: 从文学分析中学习的方法可以迁移到现实世界文本中,为LLM在叙事分析中的应用提供了新方向。
Abstract: Often when we interact with a first-person account of events, we consider whether or not the narrator, the primary speaker of the text, is reliable. In this paper, we propose using computational methods to identify unreliable narrators, i.e. those who unintentionally misrepresent information. Borrowing literary theory from narratology to define different types of unreliable narrators based on a variety of textual phenomena, we present TUNa, a human-annotated dataset of narratives from multiple domains, including blog posts, subreddit posts, hotel reviews, and works of literature. We define classification tasks for intra-narrational, inter-narrational, and inter-textual unreliabilities and analyze the performance of popular open-weight and proprietary LLMs for each. We propose learning from literature to perform unreliable narrator classification on real-world text data. To this end, we experiment with few-shot, fine-tuning, and curriculum learning settings. Our results show that this task is very challenging, and there is potential for using LLMs to identify unreliable narrators. We release our expert-annotated dataset and code and invite future research in this area.
[86] Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages cs.CL | cs.AIPDF
Ali Almutairi, Abdullah Alsuhaibani, Shoaib Jameel, Usman Naseem, Gelareh Mohammadi
TL;DR: 论文提出了一种名为Flick的新方法,专注于低资源语言的少标签文本分类问题,通过优化伪标签的生成和选择机制,显著提升了模型的性能。
Details
Motivation: 现有的少标签文本分类方法在低资源语言环境中面临伪标签噪音和领域适应的挑战,尤其是在语言多样性高的情况下。
Result: 在14个多样化数据集(包括低资源语言如阿拉伯语、乌尔都语等)上验证了Flick的优越性能和适应性。
Insight: 通过聚焦高置信度伪标签和简化伪标签生成过程,Flick在低资源语言环境中表现出更强的鲁棒性和泛化能力。
Abstract: Training deep learning networks with minimal supervision has gained significant research attention due to its potential to reduce reliance on extensive labelled data. While self-training methods have proven effective in semi-supervised learning, they remain vulnerable to errors from noisy pseudo labels. Moreover, most recent approaches to the few-label classification problem are either designed for resource-rich languages such as English or involve complex cascading models that are prone to overfitting. To address the persistent challenge of few-label text classification in truly low-resource linguistic contexts, where existing methods often struggle with noisy pseudo-labels and domain adaptation, we propose Flick. Unlike prior methods that rely on generic multi-cluster pseudo-labelling or complex cascading architectures, Flick leverages the fundamental insight that distilling high-confidence pseudo-labels from a broader set of initial clusters can dramatically improve pseudo-label quality, particularly for linguistically diverse, low-resource settings. Flick introduces a novel pseudo-label refinement component, a departure from traditional pseudo-labelling strategies by identifying and leveraging top-performing pseudo-label clusters. This component specifically learns to distil highly reliable pseudo-labels from an initial broad set by focusing on single-cluster cohesion and leveraging an adaptive top-k selection mechanism. This targeted refinement process is crucial for mitigating the propagation of errors inherent in low-resource data, allowing for robust fine-tuning of pre-trained language models with only a handful of true labels. We demonstrate Flick’s efficacy across 14 diverse datasets, encompassing challenging low-resource languages such as Arabic, Urdu, and Setswana, alongside English, showcasing its superior performance and adaptability.
[87] “Check My Work?”: Measuring Sycophancy in a Simulated Educational Context cs.CL | cs.CYPDF
Chuck Arvin
TL;DR: 论文研究大型语言模型(LLMs)在模拟教育环境中对学生提示的迎合行为(sycophancy),发现模型对答案的选择显著受学生提供的信息影响,且较小模型更易表现这种行为。
Details
Motivation: 在教育环境中,LLMs的迎合行为可能导致知识水平不同的学生受益不均,甚至强化错误理解,因此需要研究其机制和缓解方法。
Result: 学生提示错误答案时模型正确率最多下降15%,提示正确答案则提升15%;小模型迎合行为更强(如GPT-4.1-nano达30%)。
Insight: LLMs的迎合行为可能加剧教育不平等,需进一步探索其机制和解决方案。
Abstract: This study examines how user-provided suggestions affect Large Language Models (LLMs) in a simulated educational context, where sycophancy poses significant risks. Testing five different LLMs from the OpenAI GPT-4o and GPT-4.1 model classes across five experimental conditions, we show that response quality varies dramatically based on query framing. In cases where the student mentions an incorrect answer, the LLM correctness can degrade by as much as 15 percentage points, while mentioning the correct answer boosts accuracy by the same margin. Our results also show that this bias is stronger in smaller models, with an effect of up to 30% for the GPT-4.1-nano model, versus 8% for the GPT-4o model. Our analysis of how often LLMs “flip” their answer, and an investigation into token level probabilities, confirm that the models are generally changing their answers to answer choices mentioned by students in line with the sycophancy hypothesis. This sycophantic behavior has important implications for educational equity, as LLMs may accelerate learning for knowledgeable students while the same tools may reinforce misunderstanding for less knowledgeable students. Our results highlight the need to better understand the mechanism, and ways to mitigate, such bias in the educational context.
[88] Code Execution as Grounded Supervision for LLM Reasoning cs.CL | cs.AIPDF
Dongwon Jung, Wenxuan Zhou, Muhao Chen
TL;DR: 这篇论文提出了一种利用代码执行确定性生成高质量Chain-of-Thought(CoT)监督数据的方法,显著提升了大型语言模型(LLMs)的推理能力。
Details
Motivation: 现有的推理数据生成方法或依赖昂贵的人工标注,或使用易出错的LLM生成的CoT,难以保证可靠性和准确性。因此,作者希望通过代码执行的确定性提取可验证的推理轨迹。
Result: 实验结果表明,该方法生成的推理数据准确性高,且减少了推理中的无意义重复和过度思考,从而降低了推理时的总token长度。
Insight: 利用代码执行作为监督信号能够提供可靠的推理步骤,从而提升LLMs的泛化能力和推理效率。
Abstract: Training large language models (LLMs) with chain-of-thought (CoT) supervision has proven effective for enhancing their reasoning abilities. However, obtaining reliable and accurate reasoning supervision remains a significant challenge. We propose a scalable method for generating a high-quality CoT supervision dataset by leveraging the determinism of program execution. Unlike existing reasoning dataset generation methods that rely on costly human annotations or error-prone LLM-generated CoT, our approach extracts verifiable, step-by-step reasoning traces from code execution and transforms them into a natural language CoT reasoning. Experiments on reasoning benchmarks across various domains show that our method effectively equips LLMs with transferable reasoning abilities across diverse tasks. Furthermore, the ablation studies validate that our method produces highly accurate reasoning data and reduces overall token length during inference by reducing meaningless repetition and overthinking.
[89] TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning cs.CL | cs.IRPDF
Xiaohan Yu, Pu Jian, Chong Chen
TL;DR: TableRAG提出了一种针对异构文档的检索增强生成框架,统一了文本理解和表格操作,显著提升了模型在多跳推理和全局查询中的表现。
Details
Motivation: 现有RAG方法在处理包含文本和表格的异构文档时,扁平化和分块策略破坏了表格结构,导致信息丢失和多跳推理能力受限。
Result: 实验显示TableRAG在异构文档问答任务中显著优于基线,达到新SOTA。
Insight: 保留表格结构并迭代结合文本与表格操作是提升异构文档推理能力的关键。
Abstract: Retrieval-Augmented Generation (RAG) has demonstrated considerable effectiveness in open-domain question answering. However, when applied to heterogeneous documents, comprising both textual and tabular components, existing RAG approaches exhibit critical limitations. The prevailing practice of flattening tables and chunking strategies disrupts the intrinsic tabular structure, leads to information loss, and undermines the reasoning capabilities of LLMs in multi-hop, global queries. To address these challenges, we propose TableRAG, an hybrid framework that unifies textual understanding and complex manipulations over tabular data. TableRAG iteratively operates in four steps: context-sensitive query decomposition, text retrieval, SQL programming and execution, and compositional intermediate answer generation. We also develop HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous reasoning capabilities. Experimental results demonstrate that TableRAG consistently outperforms existing baselines on both public datasets and our HeteQA, establishing a new state-of-the-art for heterogeneous document question answering. We release TableRAG at https://github.com/yxh-y/TableRAG/tree/main.
[90] PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier cs.CL | cs.AI | cs.LGPDF
Yuhua Jiang, Yuwen Xiong, Yufeng Yuan, Chao Xin, Wenyuan Xu
TL;DR: 论文提出了一种名为PAG的框架,通过多轮强化学习让大语言模型在策略和验证器角色之间切换,实现自我纠正,避免了传统方法中的冗余生成问题。
Details
Motivation: 大语言模型在复杂推理任务中表现出色,但验证自身输出的可靠性仍是一个挑战。现有方法依赖外部验证器或多阶段训练,缺乏扩展性,因此需要一种更高效的自验证机制。
Result: 在多个推理任务上的实验表明,PAG在直接生成和自纠正准确性上均有提升,且其自验证能力优于自一致性方法。
Insight: 通过在单一框架中联合优化生成和验证能力,PAG证明了自我验证对大语言模型的潜力,同时避免了模型崩塌问题。
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks, yet they still struggle to reliably verify the correctness of their own outputs. Existing solutions to this verification challenge often depend on separate verifier models or require multi-stage self-correction training pipelines, which limit scalability. In this paper, we propose Policy as Generative Verifier (PAG), a simple and effective framework that empowers LLMs to self-correct by alternating between policy and verifier roles within a unified multi-turn reinforcement learning (RL) paradigm. Distinct from prior approaches that always generate a second attempt regardless of model confidence, PAG introduces a selective revision mechanism: the model revises its answer only when its own generative verification step detects an error. This verify-then-revise workflow not only alleviates model collapse but also jointly enhances both reasoning and verification abilities. Extensive experiments across diverse reasoning benchmarks highlight PAG’s dual advancements: as a policy, it enhances direct generation and self-correction accuracy; as a verifier, its self-verification outperforms self-consistency.
[91] Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences? cs.CL | cs.CVPDF
Yingjin Song, Yupei Du, Denis Paperno, Albert Gatt
TL;DR: 该论文提出了TempVS基准,用于评估多模态大语言模型(MLLMs)在图像序列中理解事件时序的能力。实验表明,当前MLLMs在此任务上表现不佳,与人类能力存在显著差距。
Details
Motivation: 研究MLLMs在时序推理能力上的不足,尤其在图像序列中对事件顺序的理解。
Result: 实验表明MLLMs在理解事件时序方面表现较差,与人类表现差距较大。
Insight: 研究揭示了MLLMs在时序推理上的缺陷,为未来改进提供了方向。
Abstract: This paper introduces the TempVS benchmark, which focuses on temporal grounding and reasoning capabilities of Multimodal Large Language Models (MLLMs) in image sequences. TempVS consists of three main tests (i.e., event relation inference, sentence ordering and image ordering), each accompanied with a basic grounding test. TempVS requires MLLMs to rely on both visual and linguistic modalities to understand the temporal order of events. We evaluate 38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS, with a substantial performance gap compared to human capabilities. We also provide fine-grained insights that suggest promising directions for future research. Our TempVS benchmark data and code are available at https://github.com/yjsong22/TempVS.
[92] Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty cs.CLPDF
Zehui Ling, Deshu Chen, Hongwei Zhang, Yifeng Jiao, Xin Guo
TL;DR: 这篇论文提出了一种新方法,通过动态调整输出长度的惩罚项,优化大型语言模型(LLMs)在推理任务中的效率,使其在处理简单问题时生成更简洁的输出,同时为复杂问题保留足够的推理步骤,从而提升整体性能。
Details
Motivation: 当前LLMs在推理任务中表现优异,但传统方法(如Chain-of-Thought提示)往往生成冗长输出,增加了计算延迟。现有方法(如强化学习)对问题复杂性缺乏区分,导致效率不彰。因此,作者希望提升LLMs的推理效率,使其在简单问题上更简洁,复杂问题上更精确。
Result: 在GSM8K和MATH500(简单数据集)上显著缩短输出,精度未降;在AIME2024(复杂数据集)上精度提升。
Insight: 动态调整长度惩罚项能有效平衡推理效率和精度,表明LLMs在不同复杂性任务上需要差异化的优化策略。
Abstract: Large language models (LLMs) have demonstrated significant advancements in reasoning capabilities, performing well on various challenging benchmarks. Techniques like Chain-of-Thought prompting have been introduced to further improve reasoning. However, these approaches frequently generate longer outputs, which in turn increase computational latency. Although some methods use reinforcement learning to shorten reasoning, they often apply uniform penalties without considering the problem’s complexity, leading to suboptimal outcomes. In this study, we seek to enhance the efficiency of LLM reasoning by promoting conciseness for simpler problems while preserving sufficient reasoning for more complex ones for accuracy, thus improving the model’s overall performance. Specifically, we manage the model’s reasoning efficiency by dividing the reward function and including a novel penalty for output length. Our approach has yielded impressive outcomes in benchmark evaluations across three datasets: GSM8K, MATH500, and AIME2024. For the comparatively simpler datasets GSM8K and MATH500, our method has effectively shortened output lengths while preserving or enhancing accuracy. On the more demanding AIME2024 dataset, our approach has resulted in improved accuracy.
[93] Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers cs.CLPDF
Xanh Ho, Sunisth Kumar, Yun-Ang Wu, Florian Boudin, Atsuhiro Takasu
TL;DR: 该论文将表格-文本对齐任务重新定义为解释任务,强调不仅要预测标签,还需识别关键表格单元格以增强可解释性。通过扩展SciTab基准数据集并标注单元格级合理性,提出了处理模糊情况的分类法,实验表明对齐信息提升验证性能,但多数LLM无法忠实还原人类标注的合理性。
Details
Motivation: 科学声明验证的传统方法仅预测标签,缺乏对模型推理的解释性。因此,该研究旨在通过识别关键表格单元格,增强模型的可解释性。
Result: 1. 对齐信息提升了声明验证性能;2. 多数LLM能预测正确标签,但无法忠实还原人类标注的合理性。
Insight: 模型预测的正确性不一定反映其推理的忠实性,强调了可解释性在科学声明验证中的重要性。
Abstract: Scientific claim verification against tables typically requires predicting whether a claim is supported or refuted given a table. However, we argue that predicting the final label alone is insufficient: it reveals little about the model’s reasoning and offers limited interpretability. To address this, we reframe table-text alignment as an explanation task, requiring models to identify the table cells essential for claim verification. We build a new dataset by extending the SciTab benchmark with human-annotated cell-level rationales. Annotators verify the claim label and highlight the minimal set of cells needed to support their decision. After the annotation process, we utilize the collected information and propose a taxonomy for handling ambiguous cases. Our experiments show that (i) incorporating table alignment information improves claim verification performance, and (ii) most LLMs, while often predicting correct labels, fail to recover human-aligned rationales, suggesting that their predictions do not stem from faithful reasoning.
[94] Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs cs.CL | cs.AIPDF
Yilin Xiao, Chuang Zhou, Qinggang Zhang, Bo Li, Qing Li
TL;DR: 该论文提出了RRP框架,通过结合知识图谱的语义和结构信息,为LLM生成高质量的推理路径,解决了现有方法在复杂问题上的不足,并在实验中取得了最优性能。
Details
Motivation: LLM在知识密集型任务中因缺乏背景知识和幻觉问题表现不佳,现有KG增强方法虽补充了事实知识,但仍难以解决复杂问题。论文认为推理路径的可靠性和逻辑一致性同样重要。
Result: 在多个公开数据集上,RRP的性能超越了现有基线方法,并能够无缝集成到不同LLM中,提升其推理能力。
Insight: 高质量的推理路径不仅能补充事实知识,还能提供逻辑一致的指导,对LLM在复杂任务中的表现至关重要。
Abstract: Large language models (LLMs) often struggle with knowledge-intensive tasks due to a lack of background knowledge and a tendency to hallucinate. To address these limitations, integrating knowledge graphs (KGs) with LLMs has been intensively studied. Existing KG-enhanced LLMs focus on supplementary factual knowledge, but still struggle with solving complex questions. We argue that refining the relationships among facts and organizing them into a logically consistent reasoning path is equally important as factual knowledge itself. Despite their potential, extracting reliable reasoning paths from KGs poses the following challenges: the complexity of graph structures and the existence of multiple generated paths, making it difficult to distinguish between useful and redundant ones. To tackle these challenges, we propose the RRP framework to mine the knowledge graph, which combines the semantic strengths of LLMs with structural information obtained through relation embedding and bidirectional distribution learning. Additionally, we introduce a rethinking module that evaluates and refines reasoning paths according to their significance. Experimental results on two public datasets show that RRP achieves state-of-the-art performance compared to existing baseline methods. Moreover, RRP can be easily integrated into various LLMs to enhance their reasoning abilities in a plug-and-play manner. By generating high-quality reasoning paths tailored to specific questions, RRP distills effective guidance for LLM reasoning.
[95] NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors cs.CL | cs.AI | I.2.7PDF
Numaan Naeem, Sarfraz Ahmad, Momina Ahsan, Hasan Iqbal
TL;DR: 论文介绍了针对BEA 2025共享任务中AI导师错误识别的四种方法,其中检索增强的少样本提示系统结合大语言模型表现最佳。
Details
Motivation: 评估AI导师在数学推理中是否能正确识别学生错误,提升教育反馈的准确性和可解释性。
Result: 检索增强提示系统在所有基线方法中表现最优。
Insight: 结合示例驱动的提示和大语言模型推理能有效提升教育反馈评估的效果。
Abstract: This paper presents our system for Track 1: Mistake Identification in the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. The task involves evaluating whether a tutor’s response correctly identifies a mistake in a student’s mathematical reasoning. We explore four approaches: (1) an ensemble of machine learning models over pooled token embeddings from multiple pretrained language models (LMs); (2) a frozen sentence-transformer using [CLS] embeddings with an MLP classifier; (3) a history-aware model with multi-head attention between token-level history and response embeddings; and (4) a retrieval-augmented few-shot prompting system with a large language model (LLM) i.e. GPT 4o. Our final system retrieves semantically similar examples, constructs structured prompts, and uses schema-guided output parsing to produce interpretable predictions. It outperforms all baselines, demonstrating the effectiveness of combining example-driven prompting with LLM reasoning for pedagogical feedback assessment. Our code is available at https://github.com/NaumanNaeem/BEA_2025.
[96] PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models cs.CL | cs.AI | cs.LGPDF
Ye Yu, Yaoning Yu, Haohan Wang
TL;DR: PREMISE 提出了一种基于提示优化的框架,用于减少大型推理模型在数学推理任务中的冗余计算,显著降低 token 开销和成本,同时保持准确性。
Details
Motivation: 现有的长推理链(CoT)方法虽然性能强,但冗长且 token 开销大,增加了部署成本。PREMISE 旨在通过提示优化解决这一问题,无需修改模型权重。
Result: 在多个数学基准测试中,保持或提升准确性(如 Claude 96%→96%,Gemini 91%→92%),同时减少 token 开销高达 87.5% 和成本 69%-82%。
Insight: 提示级优化是高效推理的可扩展路径,无需牺牲推理质量。
Abstract: Large reasoning models (LRMs) such as Claude 3.7 Sonnet and OpenAI o1 achieve strong performance on mathematical benchmarks using lengthy chain-of-thought (CoT) reasoning, but the resulting traces are often unnecessarily verbose. This inflates token usage and cost, limiting deployment in latency-sensitive or API-constrained settings. We introduce PREMISE (PRompt-based Efficient Mathematical Inference with Strategic Evaluation), a prompt-only framework that reduces reasoning overhead without modifying model weights. PREMISE combines trace-level diagnostics with gradient-inspired prompt optimization to minimize redundant computation while preserving answer accuracy. The approach jointly optimizes brevity and correctness through a multi-objective textual search that balances token length and answer validity. Unlike prior work, PREMISE runs in a single-pass black-box interface, so it can be applied directly to commercial LLMs. On GSM8K, SVAMP, and Math500 we match or exceed baseline accuracy ($96%\rightarrow96%$ with Claude, $91%\rightarrow92%$ with Gemini) while reducing reasoning tokens by up to $87.5%$ and cutting dollar cost by $69$–$82%$. These results show that prompt-level optimization is a practical and scalable path to efficient LRM inference without compromising reasoning quality.
[97] Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims cs.CL | cs.IRPDF
Priyanka Kargupta, Runchu Tian, Jiawei Han
TL;DR: ClaimSpect是一个基于检索增强生成的框架,用于自动构建针对复杂声明的层级分析,并通过检索相关语料丰富其视角,以提供更全面的回应。
Details
Motivation: 现实中的声明(如科学或政治声明)通常具有复杂性,难以简单地用‘真’或‘假’来标记。需要一种方法将其分解为更易验证的子方面,并提供多角度分析。
Result: 在真实世界的科学和政治声明数据集上验证了ClaimSpect的鲁棒性和准确性,通过案例研究和人工评估展示了其优于多个基线方法的有效性。
Insight: ClaimSpect提供了一种新的方式来处理复杂声明,通过层级化和多视角分析增强了信息的可解释性和实用性。
Abstract: Claims made by individuals or entities are oftentimes nuanced and cannot be clearly labeled as entirely “true” or “false” – as is frequently the case with scientific and political claims. However, a claim (e.g., “vaccine A is better than vaccine B”) can be dissected into its integral aspects and sub-aspects (e.g., efficacy, safety, distribution), which are individually easier to validate. This enables a more comprehensive, structured response that provides a well-rounded perspective on a given problem while also allowing the reader to prioritize specific angles of interest within the claim (e.g., safety towards children). Thus, we propose ClaimSpect, a retrieval-augmented generation-based framework for automatically constructing a hierarchy of aspects typically considered when addressing a claim and enriching them with corpus-specific perspectives. This structure hierarchically partitions an input corpus to retrieve relevant segments, which assist in discovering new sub-aspects. Moreover, these segments enable the discovery of varying perspectives towards an aspect of the claim (e.g., support, neutral, or oppose) and their respective prevalence (e.g., “how many biomedical papers believe vaccine A is more transportable than B?”). We apply ClaimSpect to a wide variety of real-world scientific and political claims featured in our constructed dataset, showcasing its robustness and accuracy in deconstructing a nuanced claim and representing perspectives within a corpus. Through real-world case studies and human evaluation, we validate its effectiveness over multiple baselines.
[98] Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs cs.CLPDF
Alberto Testoni, Iacer Calixto
TL;DR: 本文对大型语言模型(LLMs)在临床问答任务中的不确定性估计方法进行了细粒度评估,比较了多种模型和方法在不同医学专科和问题类型上的表现,并提出了轻量级的单次生成估计方法。
Details
Motivation: 在高风险领域(如临床决策支持)中,准确和校准良好的不确定性估计对于LLMs的部署至关重要。研究旨在评估不同LLMs在临床问答任务中的不确定性估计表现。
Result: 结果显示,不同医学专科和问题类型之间存在显著差异,轻量级单次生成方法性能接近语义熵方法。
Insight: 模型选择应考虑问题和模型的匹配性,轻量级方法在高计算成本场景下具有潜在优势。
Abstract: Accurate and well-calibrated uncertainty estimates are essential for deploying large language models (LLMs) in high-stakes domains such as clinical decision support. We present a fine-grained evaluation of uncertainty estimation methods for clinical multiple-choice question answering, covering ten open-source LLMs (general-purpose, biomedical, and reasoning models) across two datasets, eleven medical specialties, and six question types. We compare standard single-generation and sampling-based methods, and present a case study exploring simple, single-pass estimators based on behavioral signals in reasoning traces. These lightweight methods approach the performance of Semantic Entropy while requiring only one generation. Our results reveal substantial variation across specialties and question types, underscoring the importance of selecting models based on both the nature of the question and model-specific strengths.
[99] Improving Named Entity Transcription with Contextual LLM-based Revision cs.CL | cs.AIPDF
Viet Anh Trinh, Xinlu He, Jacob Whitehill
TL;DR: 本文提出了一种基于大型语言模型(LLM)的修正机制,通过利用LLM的推理能力和包含正确命名实体的局部上下文(如课程笔记)来修正ASR预测中的错误命名实体。实验结果表明,该方法在名为NER-MIT-OpenCourseWare的新数据集上,将命名实体的WER降低了30%。
Details
Motivation: 尽管ASR系统在通用语音识别上表现优异,但命名实体的错误率仍然较高,而命名实体通常是关键词,其误识别会严重影响下游应用。因此,需要一种有效的方法来修正ASR中的命名实体错误。
Result: 在NER-MIT-OpenCourseWare数据集上,命名实体的WER降低了30%。
Insight: 通过结合LLM和局部上下文,可以有效修正ASR系统中的命名实体错误,尤其是在特定领域(如教育领域)中效果显著。
Abstract: With recent advances in modeling and the increasing amount of supervised training data, automatic speech recognition (ASR) systems have achieved remarkable performance on general speech. However, the word error rate (WER) of state-of-the-art ASR remains high for named entities. Since named entities are often the most critical keywords, misrecognizing them can affect all downstream applications, especially when the ASR system functions as the front end of a complex system. In this paper, we introduce a large language model (LLM) revision mechanism to revise incorrect named entities in ASR predictions by leveraging the LLM’s reasoning ability as well as local context (e.g., lecture notes) containing a set of correct named entities. Finally, we introduce the NER-MIT-OpenCourseWare dataset, containing 45 hours of data from MIT courses for development and testing. On this dataset, our proposed technique achieves up to 30% relative WER reduction for named entities.
[100] Mitigating Negative Interference in Multilingual Sequential Knowledge Editing through Null-Space Constraints cs.CLPDF
Wei Sun, Tingyu Qu, Mingxiao Li, Jesse Davis, Marie-Francine Moens
TL;DR: LangEdit是一种新颖的空约束框架,旨在解决多语言顺序知识编辑中的负干扰问题,通过将参数更新投影到先前更新子空间的正交补空间,实现语言特定知识更新的精确隔离。
Details
Motivation: 多语言大模型(LLMs)中跨语言一致的知识更新是一个长期未解决的挑战,传统的多模型管理成本高,而一体化编辑会导致参数干扰。
Result: 在三种模型架构、六种语言和四项下游任务上的评估表明,LangEdit有效减少了参数干扰,优于现有编辑方法。
Insight: LangEdit为多语言LLMs提供了一种高效、精确的知识更新方法,解决了跨语言知识编辑的干扰问题。
Abstract: Efficiently updating multilingual knowledge in large language models (LLMs), while preserving consistent factual representations across languages, remains a long-standing and unresolved challenge. While deploying separate editing systems for each language might seem viable, this approach incurs substantial costs due to the need to manage multiple models. A more efficient solution involves integrating knowledge updates across all languages into a unified model. However, performing sequential edits across languages often leads to destructive parameter interference, significantly degrading multilingual generalization and the accuracy of injected knowledge. To address this challenge, we propose LangEdit, a novel null-space constrained framework designed to precisely isolate language-specific knowledge updates. The core innovation of LangEdit lies in its ability to project parameter updates for each language onto the orthogonal complement of previous updated subspaces. This approach mathematically guarantees update independence while preserving multilingual generalization capabilities. We conduct a comprehensive evaluation across three model architectures, six languages, and four downstream tasks, demonstrating that LangEdit effectively mitigates parameter interference and outperforms existing state-of-the-art editing methods. Our results highlight its potential for enabling efficient and accurate multilingual knowledge updates in LLMs. The code is available at https://github.com/VRCMF/LangEdit.git.
[101] ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization cs.CLPDF
Zhensheng Jin, Xinze Li, Yifan Ji, Chunyi Peng, Zhenghao Liu
TL;DR: 论文提出ReCUT方法,通过逐步探索和长短切换采样策略,平衡LLM推理长度与准确性,显著减少推理长度30-50%的同时保持或提升准确性。
Details
Motivation: 现有CoT提示方法存在过度思考问题,导致推理路径冗长或冗余。现有解决方案受限于生成数据质量且易过拟合。
Result: 在数学推理数据集上,推理长度减少30-50%,准确性保持或提升。
Insight: 通过长短推理路径的偏好优化与集成,显著提升LLM推理效率,为复杂任务提供新思路。
Abstract: Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of Large Language Models (LLMs). However, these methods often suffer from overthinking, leading to unnecessarily lengthy or redundant reasoning traces. Existing approaches attempt to mitigate this issue through curating multiple reasoning chains for training LLMs, but their effectiveness is often constrained by the quality of the generated data and prone to overfitting. To address the challenge, we propose Reasoning Compression ThroUgh Stepwise Trials (ReCUT), a novel method aimed at balancing the accuracy and length of reasoning trajectory. Specifically, ReCUT employs a stepwise exploration mechanism and a long-short switched sampling strategy, enabling LLMs to incrementally generate diverse reasoning paths. These paths are evaluated and used to construct preference pairs to train two specialized models (Gemini LLMs)-one optimized for reasoning accuracy, the other for shorter reasoning. A final integrated model is obtained by interpolating the parameters of these two models. Experimental results across multiple math reasoning datasets and backbone models demonstrate that ReCUT significantly reduces reasoning lengths by approximately 30-50%, while maintaining or improving reasoning accuracy compared to various baselines. All codes and data will be released via https://github.com/NEUIR/ReCUT.
[102] CIIR@LiveRAG 2025: Optimizing Multi-Agent Retrieval Augmented Generation through Self-Training cs.CL | cs.IRPDF
Alireza Salemi, Mukta Maddipatla, Hamed Zamani
TL;DR: 本文提出了mRAG,一种多智能体的检索增强生成(RAG)框架,通过自训练和奖励引导的轨迹采样优化智能体间协作,显著优于传统RAG方法。
Details
Motivation: 解决传统RAG方法在复杂任务中协作效率不足的问题,通过多智能体分工和自训练提升性能。
Result: 在DataMorgana数据集上优于传统RAG基线模型,并通过案例分析展示其实际效能。
Insight: 多智能体分工和自训练可以有效提升RAG在复杂任务中的性能,为实际应用提供了新思路。
Abstract: This paper presents mRAG, a multi-agent retrieval-augmented generation (RAG) framework composed of specialized agents for subtasks such as planning, searching, reasoning, and coordination. Our system uses a self-training paradigm with reward-guided trajectory sampling to optimize inter-agent collaboration and enhance response generation. Evaluated on DataMorgana-derived datasets during the SIGIR 2025 LiveRAG competition, mRAG outperforms conventional RAG baselines. We further analyze competition outcomes and showcase the framework’s strengths with case studies, demonstrating its efficacy for complex, real-world RAG tasks.
[103] Accelerating Diffusion Large Language Models with SlowFast: The Three Golden Principles cs.CL | cs.AI | cs.LGPDF
Qingyan Wei, Yaojie Zhang, Zhiyuan Liu, Dongrui Liu, Linfeng Zhang
TL;DR: 本文提出了一种名为SlowFast Sampling的新型动态采样策略,用于加速基于扩散的语言模型(dLLMs)。该方法通过三条黄金原则(确定性、收敛性、位置性)指导采样过程,并结合缓存技术实现显著加速,在性能和效率上超越传统自回归模型。
Details
Motivation: 现有扩散语言模型的采样策略(如基于置信度或半自回归解码)存在静态行为问题,导致效率和灵活性受限。需要一种动态策略来优化推理过程,充分发挥dLLMs的并行生成潜力。
Result: 实验显示,SlowFast Sampling在LLaDA上实现15.63倍加速(结合缓存达34.22倍),且精度下降极小。吞吐量显著优于传统自回归模型(如LLaMA3 8B)。
Insight: 1. 动态采样策略可有效释放dLLMs的并行生成潜力。2. 缓存技术与采样策略结合能进一步优化效率。3. 三条黄金原则为未来研究提供了通用指导框架。
Abstract: Diffusion-based language models (dLLMs) have emerged as a promising alternative to traditional autoregressive LLMs by enabling parallel token generation and significantly reducing inference latency. However, existing sampling strategies for dLLMs, such as confidence-based or semi-autoregressive decoding, often suffer from static behavior, leading to suboptimal efficiency and limited flexibility. In this paper, we propose SlowFast Sampling, a novel dynamic sampling strategy that adaptively alternates between exploratory and accelerated decoding stages. Our method is guided by three golden principles: certainty principle, convergence principle, and positional principle, which govern when and where tokens can be confidently and efficiently decoded. We further integrate our strategy with dLLM-Cache to reduce redundant computation. Extensive experiments across benchmarks and models show that SlowFast Sampling achieves up to 15.63$\times$ speedup on LLaDA with minimal accuracy drop, and up to 34.22$\times$ when combined with caching. Notably, our approach outperforms strong autoregressive baselines like LLaMA3 8B in throughput, demonstrating that well-designed sampling can unlock the full potential of dLLMs for fast and high-quality generation.
[104] Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models cs.CL | eess.ASPDF
Michele Gubian, Ioana Krehan, Oli Liu, James Kirby, Sharon Goldwater
TL;DR: 本文研究了不同语言预训练的wav2vec2模型如何编码语音、音调和说话者信息,通过探测分类器和几何分析发现这些信息的子空间基本正交,且表示结构与预训练语言无关。
Details
Motivation: 现有分析主要集中在英语领域的自监督语音模型,本文旨在探究多语言预训练的wav2vec2模型是否以类似方式编码语音、音调和说话者信息。
Result: 发现所有预训练和测试语言中,语音、音调和说话者信息的子空间基本正交,且层间探测准确率模式相似,仅在语音和音调上有轻微的语言匹配优势。
Insight: wav2vec2学习的表示结构具有语言无关性,表明其自监督学习机制能够通用地捕捉语音、音调和说话者信息。
Abstract: Analyses of self-supervised speech models have begun to reveal where and how they represent different types of information. However, almost all analyses have focused on English. Here, we examine how wav2vec2 models trained on four different languages encode both language-matched and non-matched speech. We use probing classifiers and geometric analyses to examine how phones, lexical tones, and speaker information are represented. We show that for all pretraining and test languages, the subspaces encoding phones, tones, and speakers are largely orthogonal, and that layerwise patterns of probing accuracy are similar, with a relatively small advantage for matched-language phone and tone (but not speaker) probes in the later layers. Our findings suggest that the structure of representations learned by wav2vec2 is largely independent of the speech material used during pretraining.
[105] Slimming Down LLMs Without Losing Their Minds cs.CL | cs.AIPDF
Qingda, Mai
TL;DR: 这篇论文研究了参数高效方法(如LoRA和QLoRA)对大型语言模型性能的影响,发现LoRA方法在提升任务性能的同时保持了计算效率,且性能与微调数据集和任务的匹配度密切相关。
Details
Motivation: 探究如何在资源有限的情况下高效微调大型语言模型,同时保持其性能。
Result: LoRA方法在计算高效的前提下显著提升了任务性能,且性能高度依赖数据集与任务的匹配。
Insight: 参数高效方法为资源受限环境下的LLM微调提供了理论和实践指导。
Abstract: This paper investigates and validates the impact of fine-tuning on large language model performance, focusing on parameter-efficient methods (LoRA and QLoRA). We evaluate model capabilities across three key domains: (1) commonsense reasoning (HellaSwag), (2) mathematical reasoning (GSM8K), and (3) multi-domain knowledge (MMLU-CS). Our findings demonstrate that: (1) LoRA-based methods effectively improve task-specific performance while maintaining computational efficiency, and (2) performance strongly depends on alignment between fine-tuning dataset and benchmark tasks. The study provides both theoretical insights into parameter-efficient mechanisms and practical guidance for developers implementing efficient LLM adaptation with limited resources.
[106] Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers cs.CL | cs.LGPDF
Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi
TL;DR: 该论文探讨了大型语言模型在微调过程中表现出的双重性(泛化与幻觉),并提出了一种称为’上下文外推理’(OCR)的机制解释这种现象。通过实验和理论分析,论文揭示了OCR与矩阵分解和梯度下降隐式偏置的关系。
Details
Motivation: 大型语言模型在微调时既能泛化新知识,又容易产生幻觉,但这种现象的原因尚不明确。论文旨在通过研究OCR机制,理解模型推理行为的本质。
Result: 实验表明OCR确实驱动了泛化和幻觉行为。理论分析揭示了矩阵分解的重要性,并表明梯度下降倾向于最小化核范数的解,从而解释了模型的高效学习能力。
Insight: 论文提供了理解模型推理行为的新视角,强调了矩阵结构和优化目标对模型能力的关键影响,为缓解知识注入中的不良行为提供了理论基础。
Abstract: Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.
[107] BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP cs.CL | cs.AIPDF
Thomas Sounack, Joshua Davis, Brigitte Durieux, Antoine Chaffin, Tom J. Pollard
TL;DR: BioClinical ModernBERT 是一种针对生物医学和临床 NLP 优化的、支持长上下文的高性能编码器模型,通过大规模领域适应性预训练和多源数据集实现了显著性能提升。
Details
Motivation: 现有的编码器模型在生物医学和临床 NLP 领域的进展较慢,且通常依赖于单一数据源,限制了其适应性和性能。
Result: 在四个下游任务中优于现有生物医学和临床编码器,发布了基础和大型模型版本及训练检查点。
Insight: 多源数据的引入和长上下文支持是提升生物医学和临床 NLP 模型表现的关键。
Abstract: Encoder-based transformer models are central to biomedical and clinical Natural Language Processing (NLP), as their bidirectional self-attention makes them well-suited for efficiently extracting structured information from unstructured text through discriminative tasks. However, encoders have seen slower development compared to decoder models, leading to limited domain adaptation in biomedical and clinical settings. We introduce BioClinical ModernBERT, a domain-adapted encoder that builds on the recent ModernBERT release, incorporating long-context processing and substantial improvements in speed and performance for biomedical and clinical NLP. BioClinical ModernBERT is developed through continued pretraining on the largest biomedical and clinical corpus to date, with over 53.5 billion tokens, and addresses a key limitation of prior clinical encoders by leveraging 20 datasets from diverse institutions, domains, and geographic regions, rather than relying on data from a single source. It outperforms existing biomedical and clinical encoders on four downstream tasks spanning a broad range of use cases. We release both base (150M parameters) and large (396M parameters) versions of BioClinical ModernBERT, along with training checkpoints to support further research.
[108] Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning cs.CLPDF
Lan Zhang, Marco Valentino, Andre Freitas
TL;DR: 该论文提出了一种基于LLM法官的集成方法(EFG),用于自动评估数学自动形式化任务,通过多维度标准(如逻辑保持、数学一致性等)提供更透明的评估,实验表明其与人工评估的相关性优于粗粒度模型。
Details
Motivation: 数学自动形式化任务的评估依赖人工,耗时且需要专业知识。现有的LLM评估方法标准过于粗粒度,难以满足高级数学形式化推理的需求。
Result: 实验表明,EFG集成方法与人工评估的相关性优于粗粒度模型,尤其在形式质量评估上表现突出。
Insight: LLM作为法官的潜力在于,当其评估标准细粒度且定义明确时,可以提供可扩展、可解释且可靠的自动评估支持。
Abstract: Autoformalization plays a crucial role in formal mathematical reasoning by enabling the automatic translation of natural language statements into formal languages. While recent advances using large language models (LLMs) have shown promising results, methods for automatically evaluating autoformalization remain underexplored. As one moves to more complex domains (e.g., advanced mathematics), human evaluation requires significant time and domain expertise, especially as the complexity of the underlying statements and background knowledge increases. LLM-as-a-judge presents a promising approach for automating such evaluation. However, existing methods typically employ coarse-grained and generic evaluation criteria, which limit their effectiveness for advanced formal mathematical reasoning, where quality hinges on nuanced, multi-granular dimensions. In this work, we take a step toward addressing this gap by introducing a systematic, automatic method to evaluate autoformalization tasks. The proposed method is based on an epistemically and formally grounded ensemble (EFG) of LLM judges, defined on criteria encompassing logical preservation (LP), mathematical consistency (MC), formal validity (FV), and formal quality (FQ), resulting in a transparent assessment that accounts for different contributing factors. We validate the proposed framework to serve as a proxy for autoformalization assessment within the domain of formal mathematics. Overall, our experiments demonstrate that the EFG ensemble of LLM judges is a suitable emerging proxy for evaluation, more strongly correlating with human assessments than a coarse-grained model, especially when assessing formal qualities. These findings suggest that LLM-as-judges, especially when guided by a well-defined set of atomic properties, could offer a scalable, interpretable, and reliable support for evaluating formal mathematical reasoning.
[109] Magistral cs.CLPDF
Mistral-AI, :, Abhinav Rastogi, Albert Q. Jiang, Andy Lo
TL;DR: Magistral是Mistral的首个推理模型,基于从头开始构建的强化学习(RL)流程,完全依赖自身模型和基础设施,探索了纯RL训练LLM的极限。
Details
Motivation: 现有方法通常依赖预先蒸馏的RL轨迹或实现,Magistral尝试从零开始构建RL流程,探索纯RL训练LLM的潜力。
Result: Magistral Medium在推理任务上表现优异,同时RL训练提升了多模态理解、指令遵循和函数调用能力;开源了Magistral Small。
Insight: 纯RL训练在文本数据上不仅能保持原模型能力,还能进一步优化特定任务,展示了RL在LLM训练中的潜力。
Abstract: We introduce Magistral, Mistral’s first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a simple method to force the reasoning language of the model, and show that RL on text data alone maintains most of the initial checkpoint’s capabilities. We find that RL on text maintains or improves multimodal understanding, instruction following and function calling. We present Magistral Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we open-source Magistral Small (Apache 2.0) which further includes cold-start data from Magistral Medium.
[110] Dynamic Epistemic Friction in Dialogue cs.CLPDF
Timothy Obiso, Kenneth Lai, Abhijnan Nath, Nikhil Krishnaswamy, James Pustejovsky
TL;DR: 该论文探讨了大型语言模型在对话中更新信念时的“动态认知摩擦”问题,提出了一种基于动态认知逻辑的模型,用于预测和优化对话中的信念对齐。
Details
Motivation: 现有的大型语言模型在与人协作时缺乏对信念更新过程中阻力的考虑,即“认知摩擦”,这限制了其在复杂对话场景中的有效性。
Result: 通过具体协作任务的实验表明,该模型能有效预测对话中信念的更新,并为进一步优化对话对齐提供了理论基础。
Insight: 动态认知摩擦是影响对话中信念对齐的关键因素,将其量化可以为提升语言模型在复杂场景中的适应性提供新思路。
Abstract: Recent developments in aligning Large Language Models (LLMs) with human preferences have significantly enhanced their utility in human-AI collaborative scenarios. However, such approaches often neglect the critical role of “epistemic friction,” or the inherent resistance encountered when updating beliefs in response to new, conflicting, or ambiguous information. In this paper, we define dynamic epistemic friction as the resistance to epistemic integration, characterized by the misalignment between an agent’s current belief state and new propositions supported by external evidence. We position this within the framework of Dynamic Epistemic Logic (Van Benthem and Pacuit, 2011), where friction emerges as nontrivial belief-revision during the interaction. We then present analyses from a situated collaborative task that demonstrate how this model of epistemic friction can effectively predict belief updates in dialogues, and we subsequently discuss how the model of belief alignment as a measure of epistemic resistance or friction can naturally be made more sophisticated to accommodate the complexities of real-world dialogue scenarios.
[111] Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training cs.CL | cs.AI | cs.LGPDF
Mozhi Zhang, Howe Tissue, Lu Wang, Xipeng Qiu
TL;DR: Domain2Vec提出一种无需训练的方法,通过数据集向量化找到最佳数据组合,提升下游任务性能。
Details
Motivation: 现有的数据集组合优化方法通常需要大量训练计算,Domain2Vec旨在减少计算开销,通过分析数据分布与模型性能的关系,高效找到最佳数据组合。
Result: Domain2Vec仅需原方法51.5%的计算量即可达到相同验证损失,相同计算预算下平均提升下游性能2.83%。
Insight: 数据集分布与模型性能的对齐关系可通过向量化高效建模,为数据组合优化提供了一种低开销、可扩展的解决方案。
Abstract: We introduce~\textsc{Domain2Vec}, a novel approach that decomposes any dataset into a linear combination of several \emph{meta-domains}, a new concept designed to capture the key underlying features of datasets. \textsc{Domain2Vec} maintains a vocabulary of meta-domains and uses a classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary. These domain vectors enable the identification of the optimal data mixture for language model (LM) pretraining in a training-free manner under the \emph{\textbf{D}istribution \textbf{A}lignment \textbf{A}ssumption} (DA$^{2}$), which suggests that when the data distributions of the training set and the validation set are better aligned, a lower validation loss is achieved. Moreover, \textsc{Domain2vec} can be seamlessly integrated into previous works to model the relationship between domain vectors and LM performance, greatly enhancing the efficiency and scalability of previous methods. Extensive experiments demonstrate that \textsc{Domain2Vec} helps find the data mixture that enhances downstream task performance with minimal computational overhead. Specifically, \textsc{Domain2Vec} achieves the same validation loss on Pile-CC using only $51.5%$ of the computation required when training on the original mixture of The Pile dataset. Under equivalent compute budget, \textsc{Domain2Vec} improves downstream performance by an average of $2.83%$.
[112] How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts? cs.CLPDF
Sohee Yang, Sang-Woo Lee, Nora Kassner, Daniela Gottesman, Sebastian Riedel
TL;DR: 本文研究了推理模型识别和从不理想的思维(如无关或误导性思维)中恢复的能力,发现模型虽能识别问题,但难以恢复,且大模型表现更差。这呼吁改进模型的自我评估能力。
Details
Motivation: 最近的研究表明推理模型能够进行反思和自验证,但它们在真正遇到不理想思维时如何表现尚不清楚。本文旨在填补这一空白。
Result: 模型能识别不理想思维,但恢复能力差,尤其是大模型表现更糟。最小模型对有害思维的干扰抵抗最强。
Insight: 模型的自我评估能力仍需改进,尤其是在面对干扰时。规模增长未必带来性能提升,甚至可能适得其反。
Abstract: Recent reasoning models show the ability to reflect, backtrack, and self-validate their reasoning, which is crucial in spotting mistakes and arriving at accurate solutions. A natural question that arises is how effectively models can perform such self-reevaluation. We tackle this question by investigating how well reasoning models identify and recover from four types of unhelpful thoughts: uninformative rambling thoughts, thoughts irrelevant to the question, thoughts misdirecting the question as a slightly different question, and thoughts that lead to incorrect answers. We show that models are effective at identifying most unhelpful thoughts but struggle to recover from the same thoughts when these are injected into their thinking process, causing significant performance drops. Models tend to naively continue the line of reasoning of the injected irrelevant thoughts, which showcases that their self-reevaluation abilities are far from a general “meta-cognitive” awareness. Moreover, we observe non/inverse-scaling trends, where larger models struggle more than smaller ones to recover from short irrelevant thoughts, even when instructed to reevaluate their reasoning. We demonstrate the implications of these findings with a jailbreak experiment using irrelevant thought injection, showing that the smallest models are the least distracted by harmful-response-triggering thoughts. Overall, our findings call for improvement in self-reevaluation of reasoning models to develop better reasoning and safer systems.
cs.MA [Back]
[113] AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation cs.MA | cs.CVPDF
Haoyuan Shi, Yunxin Li, Xinyu Chen, Longyue Wang, Baotian Hu
TL;DR: AniMaker是一个多智能体框架,用于自动生成连贯的多场景故事动画,通过MCTS驱动的视频片段生成和故事感知的片段选择优化动画质量和一致性。
Details
Motivation: 现有的视频生成方法在生成多场景、多角色的连贯故事动画时存在叙事断裂、节奏问题和模型不稳定性的挑战。
Result: 实验表明AniMaker在VBench和AniEval评估中表现优异,显著提升多候选生成效率,接近生产标准。
Insight: 多智能体分工与MCTS驱动的生成策略能有效解决多场景动画的连贯性和质量问题,AniEval为多镜头动画评估提供了新标准。
Abstract: Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation’s logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker’s approach are two key technical components: MCTS-Gen in Photography Agent, an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent, the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion, and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.
physics.med-ph [Back]
[114] Modality-AGnostic Image Cascade (MAGIC) for Multi-Modality Cardiac Substructure Segmentation physics.med-ph | cs.CVPDF
Nicholas Summerfield, Qisheng He, Alex Kuo, Ahmed I. Ghanem, Simeng Zhu
TL;DR: MAGIC是一种多模态心脏子结构分割方法,通过单一模型实现跨模态分割,性能优于对比模型,且计算轻量。
Details
Motivation: 心脏子结构分割在放射治疗计划中至关重要,但现有深度学习方法在多模态和重叠结构上缺乏泛化能力。
Result: 在Sim-CT、MR-Linac和CCTA上的平均Dice分数分别为0.75、0.68和0.80,多数情况下优于对比模型。
Insight: MAGIC展示了单一模型处理多模态任务的潜力,为临床提供了轻量且灵活的解决方案。
Abstract: Cardiac substructures are essential in thoracic radiation therapy planning to minimize risk of radiation-induced heart disease. Deep learning (DL) offers efficient methods to reduce contouring burden but lacks generalizability across different modalities and overlapping structures. This work introduces and validates a Modality-AGnostic Image Cascade (MAGIC) for comprehensive and multi-modal cardiac substructure segmentation. MAGIC is implemented through replicated encoding and decoding branches of an nnU-Net-based, U-shaped backbone conserving the function of a single model. Twenty cardiac substructures (heart, chambers, great vessels (GVs), valves, coronary arteries (CAs), and conduction nodes) from simulation CT (Sim-CT), low-field MR-Linac, and cardiac CT angiography (CCTA) modalities were manually delineated and used to train (n=76), validate (n=15), and test (n=30) MAGIC. Twelve comparison models (four segmentation subgroups across three modalities) were equivalently trained. All methods were compared for training efficiency and against reference contours using the Dice Similarity Coefficient (DSC) and two-tailed Wilcoxon Signed-Rank test (threshold, p<0.05). Average DSC scores were 0.75(0.16) for Sim-CT, 0.68(0.21) for MR-Linac, and 0.80(0.16) for CCTA. MAGIC outperforms the comparison in 57% of cases, with limited statistical differences. MAGIC offers an effective and accurate segmentation solution that is lightweight and capable of segmenting multiple modalities and overlapping structures in a single model. MAGIC further enables clinical implementation by simplifying the computational requirements and offering unparalleled flexibility for clinical settings.
cs.IR [Back]
[115] Conversational Search: From Fundamentals to Frontiers in the LLM Era cs.IR | cs.CLPDF
Fengran Mo, Chuan Meng, Mohammad Aliannejadi, Jian-Yun Nie
TL;DR: 该教程介绍了会话搜索的基础与由大型语言模型(LLM)推动的前沿研究,旨在为学术界和工业界的研究者及从业者提供全面知识。
Details
Motivation: 会话搜索通过多轮交互满足复杂信息需求,但因LLM的出现带来新的机会与挑战,需重新探讨其发展路径。
Result: 参与者将掌握构建下一代会话搜索系统所需的核心原则与新兴技术。
Insight: LLM在会话搜索中的应用不仅提升了智能化水平,也带来新的研究挑战,如上下文理解和动态交互优化。
Abstract: Conversational search enables multi-turn interactions between users and systems to fulfill users’ complex information needs. During this interaction, the system should understand the users’ search intent within the conversational context and then return the relevant information through a flexible, dialogue-based interface. The recent powerful large language models (LLMs) with capacities of instruction following, content generation, and reasoning, attract significant attention and advancements, providing new opportunities and challenges for building up intelligent conversational search systems. This tutorial aims to introduce the connection between fundamentals and the emerging topics revolutionized by LLMs in the context of conversational search. It is designed for students, researchers, and practitioners from both academia and industry. Participants will gain a comprehensive understanding of both the core principles and cutting-edge developments driven by LLMs in conversational search, equipping them with the knowledge needed to contribute to the development of next-generation conversational search systems.
eess.IV [Back]
[116] Rethinking Brain Tumor Segmentation from the Frequency Domain Perspective eess.IV | cs.CVPDF
Minye Shao, Zeyu Wang, Haoran Duan, Yawen Huang, Bing Zhai
TL;DR: HFF-Net通过从频域角度重新思考脑肿瘤分割,提出了一种综合频域分解(FDD)、自适应拉普拉斯卷积(ALC)和频域交叉注意力(FDCA)的网络,显著提升了增强肿瘤区域的分割性能。
Details
Motivation: 现有方法在分割MRI中对比增强的脑肿瘤区域时性能下降,主要因为缺乏对肿瘤特征的充分考量,如复杂纹理和方向变化。
Result: 在四个公共数据集上,平均Dice分数提升4.48%,增强肿瘤区域提升7.33%,计算效率和临床适用性良好。
Insight: 频域视角能有效捕捉肿瘤区域的复杂特征,动态调整高频细节是关键提升点。
Abstract: Precise segmentation of brain tumors, particularly contrast-enhancing regions visible in post-contrast MRI (areas highlighted by contrast agent injection), is crucial for accurate clinical diagnosis and treatment planning but remains challenging. However, current methods exhibit notable performance degradation in segmenting these enhancing brain tumor areas, largely due to insufficient consideration of MRI-specific tumor features such as complex textures and directional variations. To address this, we propose the Harmonized Frequency Fusion Network (HFF-Net), which rethinks brain tumor segmentation from a frequency-domain perspective. To comprehensively characterize tumor regions, we develop a Frequency Domain Decomposition (FDD) module that separates MRI images into low-frequency components, capturing smooth tumor contours and high-frequency components, highlighting detailed textures and directional edges. To further enhance sensitivity to tumor boundaries, we introduce an Adaptive Laplacian Convolution (ALC) module that adaptively emphasizes critical high-frequency details using dynamically updated convolution kernels. To effectively fuse tumor features across multiple scales, we design a Frequency Domain Cross-Attention (FDCA) integrating semantic, positional, and slice-specific information. We further validate and interpret frequency-domain improvements through visualization, theoretical reasoning, and experimental analyses. Extensive experiments on four public datasets demonstrate that HFF-Net achieves an average relative improvement of 4.48% (ranging from 2.39% to 7.72%) in the mean Dice scores across the three major subregions, and an average relative improvement of 7.33% (ranging from 5.96% to 8.64%) in the segmentation of contrast-enhancing tumor regions, while maintaining favorable computational efficiency and clinical applicability. Code: https://github.com/VinyehShaw/HFF.
[117] Prompt-Guided Latent Diffusion with Predictive Class Conditioning for 3D Prostate MRI Generation eess.IV | cs.CVPDF
Emerson P. Grabke, Masoom A. Haider, Babak Taati
TL;DR: 该论文提出了一种名为CCELLA的新型双重条件方法,结合了非医学大型语言模型的文本特征和病理学分类,用于训练潜在扩散模型(LDM),以生成高质量的医学图像,尤其是在数据有限的情况下。该方法显著提高了合成图像的质量和分类器性能。
Details
Motivation: 医学图像生成中,潜在扩散模型的训练通常依赖于有限的文本编码器、非医学LDM的重复使用或需要大量数据微调,这些方法限制了性能和科学可访问性。作者旨在解决这些问题,提出一种数据高效的LDM训练框架。
Result: 在3D前列腺MRI数据集上,FID得分为0.025,显著优于基准模型(FID 0.071)。合成的图像用于训练分类器时,准确率从69%提升至74%,且仅用合成图像训练的分类器与真实图像训练的性能相当。
Insight: 结合非医学领域的文本特征和医学领域的病理学分类,可以有效提升LDM在医学图像生成中的性能,尤其是在数据稀缺的情况下。这种方法为医学图像合成提供了新的可能性。
Abstract: Latent diffusion models (LDM) could alleviate data scarcity challenges affecting machine learning development for medical imaging. However, medical LDM training typically relies on performance- or scientific accessibility-limiting strategies including a reliance on short-prompt text encoders, the reuse of non-medical LDMs, or a requirement for fine-tuning with large data volumes. We propose a Class-Conditioned Efficient Large Language model Adapter (CCELLA) to address these limitations. CCELLA is a novel dual-head conditioning approach that simultaneously conditions the LDM U-Net with non-medical large language model-encoded text features through cross-attention and with pathology classification through the timestep embedding. We also propose a joint loss function and a data-efficient LDM training framework. In combination, these strategies enable pathology-conditioned LDM training for high-quality medical image synthesis given limited data volume and human data annotation, improving LDM performance and scientific accessibility. Our method achieves a 3D FID score of 0.025 on a size-limited prostate MRI dataset, significantly outperforming a recent foundation model with FID 0.071. When training a classifier for prostate cancer prediction, adding synthetic images generated by our method to the training dataset improves classifier accuracy from 69% to 74%. Training a classifier solely on our method’s synthetic images achieved comparable performance to training on real images alone.
[118] DUN-SRE: Deep Unrolling Network with Spatiotemporal Rotation Equivariance for Dynamic MRI Reconstruction eess.IV | cs.AI | cs.CVPDF
Yuliang Zhu, Jing Cheng, Qi Xie, Zhuo-Xu Cui, Qingyong Zhu
TL;DR: 该论文提出了一种名为DUN-SRE的新型深度展开网络,结合时空旋转等变性,用于动态MRI重建,显著提升了图像质量。
Details
Motivation: 动态MRI具有时空对称性(空间旋转和时间对称性),现有方法未能充分利用这些先验信息,尤其是时间对称性,影响了重建质量。
Result: 在心脏CINE MRI数据集上实现了最先进的性能,尤其在保留旋转对称结构方面表现突出。
Insight: 对称性先验的显式建模对动态MRI重建至关重要,时空等变性架构能有效捕捉心脏运动的物理动态。
Abstract: Dynamic Magnetic Resonance Imaging (MRI) exhibits transformation symmetries, including spatial rotation symmetry within individual frames and temporal symmetry along the time dimension. Explicit incorporation of these symmetry priors in the reconstruction model can significantly improve image quality, especially under aggressive undersampling scenarios. Recently, Equivariant convolutional neural network (ECNN) has shown great promise in exploiting spatial symmetry priors. However, existing ECNNs critically fail to model temporal symmetry, arguably the most universal and informative structural prior in dynamic MRI reconstruction. To tackle this issue, we propose a novel Deep Unrolling Network with Spatiotemporal Rotation Equivariance (DUN-SRE) for Dynamic MRI Reconstruction. The DUN-SRE establishes spatiotemporal equivariance through a (2+1)D equivariant convolutional architecture. In particular, it integrates both the data consistency and proximal mapping module into a unified deep unrolling framework. This architecture ensures rigorous propagation of spatiotemporal rotation symmetry constraints throughout the reconstruction process, enabling more physically accurate modeling of cardiac motion dynamics in cine MRI. In addition, a high-fidelity group filter parameterization mechanism is developed to maintain representation precision while enforcing symmetry constraints. Comprehensive experiments on Cardiac CINE MRI datasets demonstrate that DUN-SRE achieves state-of-the-art performance, particularly in preserving rotation-symmetric structures, offering strong generalization capability to a broad range of dynamic MRI reconstruction tasks.
[119] ConStyX: Content Style Augmentation for Generalizable Medical Image Segmentation eess.IV | cs.CVPDF
Xi Chen, Zhiqiang Shen, Peng Cao, Jinzhu Yang, Osmar R. Zaiane
TL;DR: 该论文提出了ConStyX方法,通过同时增强图像的内容和风格来解决医学图像分割中的领域泛化问题,克服了传统方法仅依赖风格扰动和忽视过度增强负面影响的缺陷。
Details
Motivation: 医学图像通常来自多领域,导致领域偏移影响分割模型的性能。现有领域随机化方法仅依赖风格扰动,且忽视过度增强的负面影响。
Result: 多领域实验表明,ConStyX在医学图像分割中显著提升了模型的泛化性能。
Insight: 同时增强内容和风格能够更全面地模拟领域变化,而避免过度增强的负面影响是提升泛化能力的关键。
Abstract: Medical images are usually collected from multiple domains, leading to domain shifts that impair the performance of medical image segmentation models. Domain Generalization (DG) aims to address this issue by training a robust model with strong generalizability. Recently, numerous domain randomization-based DG methods have been proposed. However, these methods suffer from the following limitations: 1) constrained efficiency of domain randomization due to their exclusive dependence on image style perturbation, and 2) neglect of the adverse effects of over-augmented images on model training. To address these issues, we propose a novel domain randomization-based DG method, called content style augmentation (ConStyX), for generalizable medical image segmentation. Specifically, ConStyX 1) augments the content and style of training data, allowing the augmented training data to better cover a wider range of data domains, and 2) leverages well-augmented features while mitigating the negative effects of over-augmented features during model training. Extensive experiments across multiple domains demonstrate that our ConStyX achieves superior generalization performance. The code is available at https://github.com/jwxsp1/ConStyX.
[120] Generalist Models in Medical Image Segmentation: A Survey and Performance Comparison with Task-Specific Approaches eess.IV | cs.AI | cs.CV | A.1; I.2.0; I.4.6PDF
Andrea Moglia, Matteo Leccardi, Matteo Cavicchioli, Alice Maccarini, Marco Marcon
TL;DR: 这篇论文全面调查了医学图像分割领域的通用模型,重点比较了它们与任务特定模型的性能,并探讨了未来的发展方向和挑战。
Details
Motivation: 受到大型语言模型和Segment Anything Model (SAM)成功的启发,研究者希望探索通用模型在医学图像分割中的应用,以提升泛化能力和减少对任务特定模型的需求。
Result: 通用模型在医学图像分割中展现出潜力,但仍需解决监管合规、隐私安全等挑战,且性能在某些任务上可能不及任务特定模型。
Insight: 未来研究方向包括合成数据、早期信息融合、自然语言处理中的通用模型经验借鉴,以及临床转化的可行性。
Abstract: Following the successful paradigm shift of large language models, leveraging pre-training on a massive corpus of data and fine-tuning on different downstream tasks, generalist models have made their foray into computer vision. The introduction of Segment Anything Model (SAM) set a milestone on segmentation of natural images, inspiring the design of a multitude of architectures for medical image segmentation. In this survey we offer a comprehensive and in-depth investigation on generalist models for medical image segmentation. We start with an introduction on the fundamentals concepts underpinning their development. Then, we provide a taxonomy on the different declinations of SAM in terms of zero-shot, few-shot, fine-tuning, adapters, on the recent SAM 2, on other innovative models trained on images alone, and others trained on both text and images. We thoroughly analyze their performances at the level of both primary research and best-in-literature, followed by a rigorous comparison with the state-of-the-art task-specific models. We emphasize the need to address challenges in terms of compliance with regulatory frameworks, privacy and security laws, budget, and trustworthy artificial intelligence (AI). Finally, we share our perspective on future directions concerning synthetic data, early fusion, lessons learnt from generalist models in natural language processing, agentic AI and physical AI, and clinical translation.
[121] Med-URWKV: Pure RWKV With ImageNet Pre-training For Medical Image Segmentation eess.IV | cs.CVPDF
Zhenhuan Zhou
TL;DR: Med-URWKV是一种基于纯RWKV架构的医学图像分割模型,首次利用ImageNet预训练的VRWKV编码器,在多个数据集上表现优异。
Details
Motivation: 现有医学图像分割方法(如CNN、Transformer或混合架构)存在感受野受限或计算复杂度高的问题。RWKV因线性复杂度和长程建模能力成为有潜力的替代方案,但尚未充分利用预训练优势。
Result: 在7个数据集上表现优于或媲美从零训练的优化RWKV模型,证明了预训练的重要性。
Insight: 预训练的VRWKV编码器可为医学图像分割提供更强特征表示,纯RWKV架构在保持高效的同时具备竞争力。
Abstract: Medical image segmentation is a fundamental and key technology in computer-aided diagnosis and treatment. Previous methods can be broadly classified into three categories: convolutional neural network (CNN) based, Transformer based, and hybrid architectures that combine both. However, each of them has its own limitations, such as restricted receptive fields in CNNs or the computational overhead caused by the quadratic complexity of Transformers. Recently, the Receptance Weighted Key Value (RWKV) model has emerged as a promising alternative for various vision tasks, offering strong long-range modeling capabilities with linear computational complexity. Some studies have also adapted RWKV to medical image segmentation tasks, achieving competitive performance. However, most of these studies focus on modifications to the Vision-RWKV (VRWKV) mechanism and train models from scratch, without exploring the potential advantages of leveraging pre-trained VRWKV models for medical image segmentation tasks. In this paper, we propose Med-URWKV, a pure RWKV-based architecture built upon the U-Net framework, which incorporates ImageNet-based pretraining to further explore the potential of RWKV in medical image segmentation tasks. To the best of our knowledge, Med-URWKV is the first pure RWKV segmentation model in the medical field that can directly reuse a large-scale pre-trained VRWKV encoder. Experimental results on seven datasets demonstrate that Med-URWKV achieves comparable or even superior segmentation performance compared to other carefully optimized RWKV models trained from scratch. This validates the effectiveness of using a pretrained VRWKV encoder in enhancing model performance. The codes will be released.
cs.CR [Back]
[122] GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models cs.CR | cs.CLPDF
Zilong Wang, Xiang Zheng, Xiaosen Wang, Bo Wang, Xingjun Ma
TL;DR: GenBreak是一个通过微调大型语言模型(LLM)来系统性探测文本到图像(T2I)生成器潜在安全漏洞的框架,结合监督微调和强化学习,成功生成既能绕过安全机制又能输出有害内容的对抗性提示。
Details
Motivation: 现有的T2I模型可能被滥用以生成有害内容,而传统的安全测试方法存在局限性——要么容易检测,要么无法生成真正有害的输出。因此,需要一种可靠的工具来评估T2I模型的安全性。
Result: 生成的对抗性提示在针对商业T2I生成器的黑盒攻击中表现优异,暴露了实际的安全弱点。
Insight: 生成对抗性提示需要平衡隐蔽性和危害性,多奖励信号的设计是关键;T2I模型的安全防御仍需进一步改进。
Abstract: Text-to-image (T2I) models such as Stable Diffusion have advanced rapidly and are now widely used in content creation. However, these models can be misused to generate harmful content, including nudity or violence, posing significant safety risks. While most platforms employ content moderation systems, underlying vulnerabilities can still be exploited by determined adversaries. Recent research on red-teaming and adversarial attacks against T2I models has notable limitations: some studies successfully generate highly toxic images but use adversarial prompts that are easily detected and blocked by safety filters, while others focus on bypassing safety mechanisms but fail to produce genuinely harmful outputs, neglecting the discovery of truly high-risk prompts. Consequently, there remains a lack of reliable tools for evaluating the safety of defended T2I models. To address this gap, we propose GenBreak, a framework that fine-tunes a red-team large language model (LLM) to systematically explore underlying vulnerabilities in T2I generators. Our approach combines supervised fine-tuning on curated datasets with reinforcement learning via interaction with a surrogate T2I model. By integrating multiple reward signals, we guide the LLM to craft adversarial prompts that enhance both evasion capability and image toxicity, while maintaining semantic coherence and diversity. These prompts demonstrate strong effectiveness in black-box attacks against commercial T2I generators, revealing practical and concerning safety weaknesses.
[123] Secure Data Access in Cloud Environments Using Quantum Cryptography cs.CR | cs.CVPDF
S. Vasavi Venkata Lakshmi, Ziaul Haque Choudhury
TL;DR: 该论文提出了一种结合量子密钥分发(QKD)和量子一次一密(QOTP)的方法,利用BB84协议在云计算环境中实现安全数据传输,为未来量子计算机威胁下的数据安全提供解决方案。
Details
Motivation: 随着量子计算机的发展,传统加密方法可能无法应对未来的安全威胁。云计算环境中的数据安全成为一个迫切问题,需要新的技术手段。
Result: 研究证明了量子密码学在云计算环境中的有效性,能够为数据存储和共享提供强大的安全保障,即使面对量子计算机攻击。
Insight: 量子密码学为未来的数据安全问题提供了前瞻性解决方案,尤其是在云计算等分布式环境中,展现了其潜在的应用价值。
Abstract: Cloud computing has made storing and accessing data easier but keeping it secure is a big challenge nowadays. Traditional methods of ensuring data may not be strong enough in the future when powerful quantum computers become available. To solve this problem, this study uses quantum cryptography to protect data in the cloud environment. Quantum Key Distribution (QKD) creates secure keys by sending information using quantum particles like photons. Specifically, we use the BB84 protocol, a simple and reliable way to make secure keys that cannot be stolen without detection. To protect the data, we use the Quantum One Time pad (QOTP) for encryption and decryption, ensuring the data stays completely private. This study shows how these Quantum methods can be applied in cloud systems to provide a strong defense against hackers, even if they have access to quantum computers. The combination of QKD, BB84, and QOTP creates a safe and reliable way to keep data secure when it is stored or shared in the cloud. Using quantum cryptography, this paper provides a way to ensure data security now and in the future, making cloud computing safer for everyone to store their data securely and safely.
cs.LG [Back]
[124] Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMs cs.LG | cs.AI | cs.CL | cs.CVPDF
Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu
TL;DR: Omni-DPO 是一种双视角优化框架,通过自适应加权样本,结合数据质量和模型学习动态,显著提升了强化学习从人类反馈(RLHF)的性能。
Details
Motivation: 现有的 DPO 方法将所有偏好对视为相同,忽略了其固有质量和学习效用的差异,导致数据利用和性能不佳。
Result: 在文本理解和数学推理任务中,Omni-DPO 显著优于基线方法,Gemma-2-9b-it 模型在 Arena-Hard 基准上超越 Claude 3 Opus 6.7 分。
Insight: 关注数据的固有质量和模型学习动态的联合优化,是提升强化学习从人类反馈效果的关键。
Abstract: Direct Preference Optimization (DPO) has become a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based approaches typically treat all preference pairs uniformly, ignoring critical variations in their inherent quality and learning utility, leading to suboptimal data utilization and performance. To address this challenge, we propose Omni-DPO, a dual-perspective optimization framework that jointly accounts for (1) the inherent quality of each preference pair and (2) the model’s evolving performance on those pairs. By adaptively weighting samples according to both data quality and the model’s learning dynamics during training, Omni-DPO enables more effective training data utilization and achieves better performance. Experimental results on various models and benchmarks demonstrate the superiority and generalization capabilities of Omni-DPO. On textual understanding tasks, Gemma-2-9b-it finetuned with Omni-DPO beats the leading LLM, Claude 3 Opus, by a significant margin of 6.7 points on the Arena-Hard benchmark. On mathematical reasoning tasks, Omni-DPO consistently outperforms the baseline methods across all benchmarks, providing strong empirical evidence for the effectiveness and robustness of our approach. Code and models will be available at https://github.com/pspdada/Omni-DPO.
[125] Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning cs.LG | cs.AI | cs.CL | stat.MLPDF
Jikai Jin, Vasilis Syrgkanis, Sham Kakade, Hanlin Zhang
TL;DR: 论文提出了一种因果表示学习框架,通过建模基准表现与潜在能力因素的线性关系,揭示了语言模型能力的层次性因果结构。
Details
Motivation: 语言模型能力的评估存在方法论挑战,如混杂效应和高计算成本,需更严谨的因果分析方法。
Result: 在1500多个模型的六项基准测试数据上,发现了一个三节点的线性因果结构,揭示了能力发展的层次性方向。
Insight: 模型能力的发展呈现从通用问题解决到指令遵循再到数学推理的因果方向,基础模型变异是影响评估的关键因素。
Abstract: Faithful evaluation of language model capabilities is crucial for deriving actionable insights that can inform model development. However, rigorous causal evaluations in this domain face significant methodological challenges, including complex confounding effects and prohibitive computational costs associated with extensive retraining. To tackle these challenges, we propose a causal representation learning framework wherein observed benchmark performance is modeled as a linear transformation of a few latent capability factors. Crucially, these latent factors are identified as causally interrelated after appropriately controlling for the base model as a common confounder. Applying this approach to a comprehensive dataset encompassing over 1500 models evaluated across six benchmarks from the Open LLM Leaderboard, we identify a concise three-node linear causal structure that reliably explains the observed performance variations. Further interpretation of this causal structure provides substantial scientific insights beyond simple numerical rankings: specifically, we reveal a clear causal direction starting from general problem-solving capabilities, advancing through instruction-following proficiency, and culminating in mathematical reasoning ability. Our results underscore the essential role of carefully controlling base model variations during evaluation, a step critical to accurately uncovering the underlying causal relationships among latent model capabilities.
[126] Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series cs.LG | cs.AI | cs.CLPDF
Ching Chang, Jeehyun Hwang, Yidan Shi, Haixin Wang, Wen-Chih Peng
TL;DR: 该论文介绍了Time-IMM数据集和IMM-TSF基准库,旨在解决现实世界中不规则、多模态时间序列数据的挑战,并通过实验证明多模态建模对提升预测性能的重要性。
Details
Motivation: 现实世界中的时间序列数据往往是不规则、多模态且杂乱的,但现有基准通常假设数据是干净、规则且单模态的,导致研究与实际应用之间存在差距。
Result: 实验结果表明,显式建模多模态数据在不规则时间序列上显著提升了预测性能。
Insight: 不规则性和多模态的显式建模对时间序列分析的性能至关重要,为现实世界应用提供了更贴近实际的评估基准。
Abstract: Time series data in real-world applications such as healthcare, climate modeling, and finance are often irregular, multimodal, and messy, with varying sampling rates, asynchronous modalities, and pervasive missingness. However, existing benchmarks typically assume clean, regularly sampled, unimodal data, creating a significant gap between research and real-world deployment. We introduce Time-IMM, a dataset specifically designed to capture cause-driven irregularity in multimodal multivariate time series. Time-IMM represents nine distinct types of time series irregularity, categorized into trigger-based, constraint-based, and artifact-based mechanisms. Complementing the dataset, we introduce IMM-TSF, a benchmark library for forecasting on irregular multimodal time series, enabling asynchronous integration and realistic evaluation. IMM-TSF includes specialized fusion modules, including a timestamp-to-text fusion module and a multimodality fusion module, which support both recency-aware averaging and attention-based integration strategies. Empirical results demonstrate that explicitly modeling multimodality on irregular time series data leads to substantial gains in forecasting performance. Time-IMM and IMM-TSF provide a foundation for advancing time series analysis under real-world conditions. The dataset is publicly available at https://www.kaggle.com/datasets/blacksnail789521/time-imm/data, and the benchmark library can be accessed at https://anonymous.4open.science/r/IMMTSF_NeurIPS2025.
[127] Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering cs.LG | cs.CLPDF
Sai Prasanna Teja Reddy Bogireddy, Abrar Majeedi, Viswanatha Reddy Gajjala, Zhuoyan Xu, Siddhant Rai
TL;DR: 论文提出了一种基于代理提示优化的方法,用于证据驱动的临床问答任务,通过两个阶段(证据识别与答案生成)并利用提示优化器提升性能,最终在ArchEHR-QA任务中取得了第二名。
Details
Motivation: 电子健康记录(EHR)的自动化问答系统可以为临床医生和患者提供关键信息支持,但需要在有限监督下实现精确的证据检索和可靠的答案生成。
Result: 在隐藏测试集上得分51.5,排名第二,优于零样本和少样本提示方法20分和10分以上。
Insight: 数据驱动的提示优化是模型微调的高效替代方案,可提升医疗领域高风险问答任务的可靠性。
Abstract: Automated question answering (QA) over electronic health records (EHRs) can bridge critical information gaps for clinicians and patients, yet it demands both precise evidence retrieval and faithful answer generation under limited supervision. In this work, we present Neural, the runner-up in the BioNLP 2025 ArchEHR-QA shared task on evidence-grounded clinical QA. Our proposed method decouples the task into (1) sentence-level evidence identification and (2) answer synthesis with explicit citations. For each stage, we automatically explore the prompt space with DSPy’s MIPROv2 optimizer, jointly tuning instructions and few-shot demonstrations on the development set. A self-consistency voting scheme further improves evidence recall without sacrificing precision. On the hidden test set, our method attains an overall score of 51.5, placing second stage while outperforming standard zero-shot and few-shot prompting by over 20 and 10 points, respectively. These results indicate that data-driven prompt optimization is a cost-effective alternative to model fine-tuning for high-stakes clinical QA, advancing the reliability of AI assistants in healthcare.
[128] Robustly Improving LLM Fairness in Realistic Settings via Interpretability cs.LG | cs.AI | cs.CLPDF
Adam Karvonen, Samuel Marks
TL;DR: 论文提出了一种通过内部偏见缓解方法,在现实场景中减少LLM的偏见,识别并中和模型激活中的敏感属性方向,实现了稳定的偏见减少。
Details
Motivation: LLM在高风险招聘应用中的部署日益增多,但现有简单的反偏见提示在现实场景中失效,需要更鲁棒的缓解方法。
Result: 在多种商业和开源模型中,该方法将偏见降至通常低于1%,同时模型性能基本不受影响。
Insight: 现实场景中LLM的偏见问题更复杂,需要通过内部干预而非简单提示来解决,同时需要更现实的评估方法。
Abstract: Large language models (LLMs) are increasingly deployed in high-stakes hiring applications, making decisions that directly impact people’s careers and livelihoods. While prior studies suggest simple anti-bias prompts can eliminate demographic biases in controlled evaluations, we find these mitigations fail when realistic contextual details are introduced. We address these failures through internal bias mitigation: by identifying and neutralizing sensitive attribute directions within model activations, we achieve robust bias reduction across all tested scenarios. Across leading commercial (GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash) and open-source models (Gemma-2 27B, Gemma-3, Mistral-24B), we find that adding realistic context such as company names, culture descriptions from public careers pages, and selective hiring constraints (e.g.,``only accept candidates in the top 10%“) induces significant racial and gender biases (up to 12% differences in interview rates). When these biases emerge, they consistently favor Black over White candidates and female over male candidates across all tested models and scenarios. Moreover, models can infer demographics and become biased from subtle cues like college affiliations, with these biases remaining invisible even when inspecting the model’s chain-of-thought reasoning. To address these limitations, our internal bias mitigation identifies race and gender-correlated directions and applies affine concept editing at inference time. Despite using directions from a simple synthetic dataset, the intervention generalizes robustly, consistently reducing bias to very low levels (typically under 1%, always below 2.5%) while largely maintaining model performance. Our findings suggest that practitioners deploying LLMs for hiring should adopt more realistic evaluation methodologies and consider internal mitigation strategies for equitable outcomes.
[129] GUARD: Guided Unlearning and Retention via Data Attribution for Large Language Models cs.LG | cs.AI | cs.CLPDF
Evelyn Ma, Duo Zhou, Peizhi Niu, Huiting Zhou, Huan Zhang
TL;DR: GUARD是一个用于大型语言模型(LLM)的指导性遗忘与保留框架,通过数据归因减少无意遗忘,提升模型保留有价值信息的能力。
Details
Motivation: 由于法规遵从、版权保护和隐私问题,LLM的遗忘变得越来越重要,但现有方法常因遗忘高影响数据而损害模型效用。
Result: 在TOFU基准测试中,GUARD显著提升了保留集的效用(Truth Ratio最高提升194.92%),同时保持高效的遗忘性能。
Insight: 数据级因素对LLM遗忘性能有重要影响,GUARD提供了一种高效平衡遗忘与保留的方法。
Abstract: Unlearning in large language models (LLMs) is becoming increasingly important due to regulatory compliance, copyright protection, and privacy concerns. However, a key challenge in LLM unlearning is unintended forgetting, where the removal of specific data inadvertently impairs the utility of the model and its retention of valuable, desired information. While prior work has primarily focused on architectural innovations, the influence of data-level factors on unlearning performance remains underexplored. As a result, existing methods often suffer from degraded retention when forgetting high-impact data. To address this, we propose GUARD-a novel framework for Guided Unlearning And Retention via Data attribution. At its core, GUARD introduces a lightweight proxy data attribution metric tailored for LLM unlearning, which quantifies the “alignment” between the forget and retain sets while remaining computationally efficient. Building on this, we design a novel unlearning objective that assigns adaptive, nonuniform unlearning weights to samples, inversely proportional to their proxy attribution scores. Through such a reallocation of unlearning power, GUARD mitigates unintended losses in retention. We provide rigorous theoretical guarantees that GUARD significantly enhances retention while maintaining forgetting metrics comparable to prior methods. Extensive experiments on the TOFU benchmark across multiple LLM architectures demonstrate that GUARD substantially improves utility preservation while ensuring effective unlearning. Notably, GUARD reduces utility sacrifice on the Retain Set by up to 194.92% in terms of Truth Ratio when forgetting 10% of the training data.
[130] Build the web for agents, not agents for the web cs.LG | cs.CLPDF
Xing Han Lù, Gaurav Kamath, Marius Mosbach, Siva Reddy
TL;DR: 这篇立场论文提出了一种新的范式转变,即开发专为智能体设计的网页接口(AWI),而不是让智能体适应人类的网页界面。
Details
Motivation: 当前的网页智能体方法因人类设计的界面与LLM能力之间的不匹配而面临挑战,导致处理复杂网页输入时效率低下。
Result: AWI旨在提高网页智能体的效率、可靠性和透明度,为未来的协作开发奠定了基础。
Insight: 网页智能体的未来发展需要重新设计界面,而不是简单地让智能体适应现有的人类界面。
Abstract: Recent advancements in Large Language Models (LLMs) and multimodal counterparts have spurred significant interest in developing web agents – AI systems capable of autonomously navigating and completing tasks within web environments. While holding tremendous promise for automating complex web interactions, current approaches face substantial challenges due to the fundamental mismatch between human-designed interfaces and LLM capabilities. Current methods struggle with the inherent complexity of web inputs, whether processing massive DOM trees, relying on screenshots augmented with additional information, or bypassing the user interface entirely through API interactions. This position paper advocates for a paradigm shift in web agent research: rather than forcing web agents to adapt to interfaces designed for humans, we should develop a new interaction paradigm specifically optimized for agentic capabilities. To this end, we introduce the concept of an Agentic Web Interface (AWI), an interface specifically designed for agents to navigate a website. We establish six guiding principles for AWI design, emphasizing safety, efficiency, and standardization, to account for the interests of all primary stakeholders. This reframing aims to overcome fundamental limitations of existing interfaces, paving the way for more efficient, reliable, and transparent web agent design, which will be a collaborative effort involving the broader ML community.
[131] ReGuidance: A Simple Diffusion Wrapper for Boosting Sample Quality on Hard Inverse Problems cs.LG | cs.AI | cs.CVPDF
Aayush Karan, Kulin Shah, Sitan Chen
TL;DR: 该论文提出了ReGuidance,一种简单但有效的扩散模型包装器,用于提升在困难逆问题中的样本质量和奖励表现。
Details
Motivation: 现有方法如DPS在奖励信息不足(如低信噪比的困难逆问题)时会偏离数据流形,导致输出不真实。ReGuidance通过优化初始化和反向ODE流程解决了这一问题。
Result: 在大型框内修复和高倍数超分辨率等困难任务中,ReGuidance显著超越了现有基线方法的质量和一致性。
Insight: 通过反向ODE优化初始化,可以在多模态数据分布上同时提升奖励和接近数据流形,为DPS提供了理论支持。
Abstract: There has been a flurry of activity around using pretrained diffusion models as informed data priors for solving inverse problems, and more generally around steering these models using reward models. Training-free methods like diffusion posterior sampling (DPS) and its many variants have offered flexible heuristic algorithms for these tasks, but when the reward is not informative enough, e.g., in hard inverse problems with low signal-to-noise ratio, these techniques veer off the data manifold, failing to produce realistic outputs. In this work, we devise a simple wrapper, ReGuidance, for boosting both the sample realism and reward achieved by these methods. Given a candidate solution $\hat{x}$ produced by an algorithm of the user’s choice, we propose inverting the solution by running the unconditional probability flow ODE in reverse starting from $\hat{x}$, and then using the resulting latent as an initialization for DPS. We evaluate our wrapper on hard inverse problems like large box in-painting and super-resolution with high upscaling. Whereas state-of-the-art baselines visibly fail, we find that applying our wrapper on top of these baselines significantly boosts sample quality and measurement consistency. We complement these findings with theory proving that on certain multimodal data distributions, ReGuidance simultaneously boosts the reward and brings the candidate solution closer to the data manifold. To our knowledge, this constitutes the first rigorous algorithmic guarantee for DPS.
cs.GR [Back]
[132] Edit360: 2D Image Edits to 3D Assets from Any Angle cs.GR | cs.CVPDF
Junchao Huang, Xinting Hu, Zhuotao Tian, Shaoshuai Shi, Li Jiang
TL;DR: Edit360是一个无需调优的框架,能够将2D图像编辑扩展到多视角一致的3D编辑,通过引入Anchor-View Editing Propagation机制,实现任意视角的高质量3D资产重建。
Details
Motivation: 现有方法通常将编辑限制在预定的视角范围内,缺乏灵活性,难以满足实际应用中对多视角一致性的要求。
Result: 实现了高质量3D资产的重建,支持自定义3D内容创作。
Insight: 将2D编辑能力扩展到3D领域的挑战在于多视角一致性的处理,Edit360通过融合扩散模型的多视角信息解决了这一问题。
Abstract: Recent advances in diffusion models have significantly improved image generation and editing, but extending these capabilities to 3D assets remains challenging, especially for fine-grained edits that require multi-view consistency. Existing methods typically restrict editing to predetermined viewing angles, severely limiting their flexibility and practical applications. We introduce Edit360, a tuning-free framework that extends 2D modifications to multi-view consistent 3D editing. Built upon video diffusion models, Edit360 enables user-specific editing from arbitrary viewpoints while ensuring structural coherence across all views. The framework selects anchor views for 2D modifications and propagates edits across the entire 360-degree range. To achieve this, Edit360 introduces a novel Anchor-View Editing Propagation mechanism, which effectively aligns and merges multi-view information within the latent and attention spaces of diffusion models. The resulting edited multi-view sequences facilitate the reconstruction of high-quality 3D assets, enabling customizable 3D content creation.
cs.MM [Back]
[133] Multimodal Large Language Models: A Survey cs.MM | cs.AI | cs.CLPDF
Longzhen Han, Awes Mubarak, Almas Baimagambetov, Nikolaos Polatidis, Thar Baker
TL;DR: 这篇《Multimodal Large Language Models: A Survey》对多模态大语言模型(MLLMs)的发展进行了系统综述,涵盖了从文本生成扩展到多种感官模态的模型。
Details
Motivation: 随着多模态技术的快速发展,研究者需要整合语言与其他感官模态,以推动更具普适性和适应性的多模态系统的发展。
Result: 指出了评估、模块化和结构化推理等未解决的挑战,为MLLM的未来发展提供了统一视角。
Insight: 跨模态协同和技术转移是MLLM发展的核心方向,未来的研究需要集中在通用性、适应性和可解释性的提升上。
Abstract: Multimodal Large Language Models (MLLMs) have rapidly evolved beyond text generation, now spanning diverse output modalities including images, music, video, human motion, and 3D objects, by integrating language with other sensory modalities under unified architectures. This survey categorises six primary generative modalities and examines how foundational techniques, namely Self-Supervised Learning (SSL), Mixture of Experts (MoE), Reinforcement Learning from Human Feedback (RLHF), and Chain-of-Thought (CoT) prompting, enable cross-modal capabilities. We analyze key models, architectural trends, and emergent cross-modal synergies, while highlighting transferable techniques and unresolved challenges. Architectural innovations like transformers and diffusion models underpin this convergence, enabling cross-modal transfer and modular specialization. We highlight emerging patterns of synergy, and identify open challenges in evaluation, modularity, and structured reasoning. This survey offers a unified perspective on MLLM development and identifies critical paths toward more general-purpose, adaptive, and interpretable multimodal systems.
[134] EQ-TAA: Equivariant Traffic Accident Anticipation via Diffusion-Based Accident Video Synthesis cs.MM | cs.AI | cs.CV | cs.ROPDF
Jianwu Fang, Lei-Lei Li, Zhedong Zheng, Hongkai Yu, Jianru Xue
TL;DR: 论文提出了一种基于扩散模型的注意力视频生成方法(AVD),通过合成事故视频片段来解决交通事故预测(TAA)中的数据偏差问题,并结合等变损失(EQ-TAA)提升模型性能。
Details
Motivation: 当前交通事故预测方法依赖大量标注数据,但事故数据的因果部分难以识别且容易受数据偏差影响。论文旨在通过生成因果视频片段解决这一问题。
Result: 实验表明AVD和EQ-TAA在性能上优于现有方法。
Insight: 通过生成因果视频片段可以有效缓解数据偏差问题,等变损失设计进一步提升了模型的鲁棒性。
Abstract: Traffic Accident Anticipation (TAA) in traffic scenes is a challenging problem for achieving zero fatalities in the future. Current approaches typically treat TAA as a supervised learning task needing the laborious annotation of accident occurrence duration. However, the inherent long-tailed, uncertain, and fast-evolving nature of traffic scenes has the problem that real causal parts of accidents are difficult to identify and are easily dominated by data bias, resulting in a background confounding issue. Thus, we propose an Attentive Video Diffusion (AVD) model that synthesizes additional accident video clips by generating the causal part in dashcam videos, i.e., from normal clips to accident clips. AVD aims to generate causal video frames based on accident or accident-free text prompts while preserving the style and content of frames for TAA after video generation. This approach can be trained using datasets collected from various driving scenes without any extra annotations. Additionally, AVD facilitates an Equivariant TAA (EQ-TAA) with an equivariant triple loss for an anchor accident-free video clip, along with the generated pair of contrastive pseudo-normal and pseudo-accident clips. Extensive experiments have been conducted to evaluate the performance of AVD and EQ-TAA, and competitive performance compared to state-of-the-art methods has been obtained.
[135] HER2 Expression Prediction with Flexible Multi-Modal Inputs via Dynamic Bidirectional Reconstruction cs.MM | cs.AI | cs.CV | cs.LGPDF
Jie Qin, Wei Yang, Yan Su, Yiran Zhu, Weizhen Li
TL;DR: 一种自适应双模态框架通过动态分支选择、双向跨模态GAN和混合训练协议,实现了灵活的单/双模态HER2预测,显著提升了单模态和双模态的预测准确率。
Details
Motivation: 目前HER2评估模型通常单独分析H&E或IHC图像,而临床实践中需要两者的协同解释,但由于工作流复杂性和成本限制,同时获取这两种模态的数据较为困难。
Result: 单模态H&E预测精度从71.44%提升至94.25%,双模态精度达95.09%;IHC单模态输入仍保持90.28%的可靠性。
Insight: 该框架通过动态路由输入和跨模态重建,显著缓解了数据缺失带来的性能下降,同时保持了计算效率,适用于资源有限的环境。
Abstract: Current HER2 assessment models for breast cancer predominantly analyze H&E or IHC images in isolation,despite clinical reliance on their synergistic interpretation. However, concurrent acquisition of both modalities is often hindered by workflow complexity and cost constraints. We propose an adaptive bimodal framework enabling flexible single-/dual-modality HER2 prediction through three innovations: 1) A dynamic branch selector that activates either single-modality reconstruction or dual-modality joint inference based on input completeness; 2) A bidirectional cross-modal GAN performing context-aware feature-space reconstruction of missing modalities; 3) A hybrid training protocol integrating adversarial learning and multi-task optimization. This architecture elevates single-modality H&E prediction accuracy from 71.44% to 94.25% while achieving 95.09% dual-modality accuracy, maintaining 90.28% reliability with sole IHC inputs. The framework’s “dual-preferred, single-compatible” design delivers near-bimodal performance without requiring synchronized acquisition, particularly benefiting resource-limited settings through IHC infrastructure cost reduction. Experimental validation confirms 22.81%/12.90% accuracy improvements over H&E/IHC baselines respectively, with cross-modal reconstruction enhancing F1-scores to 0.9609 (HE to IHC) and 0.9251 (IHC to HE). By dynamically routing inputs through reconstruction-enhanced or native fusion pathways, the system mitigates performance degradation from missing data while preserving computational efficiency (78.55% parameter reduction in lightweight variant). This elastic architecture demonstrates significant potential for democratizing precise HER2 assessment across diverse healthcare settings.
[136] Controllable Expressive 3D Facial Animation via Diffusion in a Unified Multimodal Space cs.MM | cs.AI | cs.CVPDF
Kangwei Liu, Junwu Liu, Xiaowei Yi, Jinlin Guo, Yun Cao
TL;DR: 本文提出了一种基于扩散模型的3D面部动画生成方法,通过多模态情感绑定和注意力机制实现灵活的情感控制和丰富的运动多样性。
Details
Motivation: 现有的音频驱动情感3D面部动画方法存在两个主要问题:一是依赖单一模态控制信号,未能综合利用多模态信息的互补性;二是确定性回归映射限制了情感表达和非语言行为的随机性。
Result: 实验表明,该方法在情感相似性指标上比现有方法提升21.6%,同时保持了生理合理的面部动态。
Insight: 多模态信息的联合使用和扩散模型的引入显著提升了3D面部动画的情感表达能力和多样性。
Abstract: Audio-driven emotional 3D facial animation encounters two significant challenges: (1) reliance on single-modal control signals (videos, text, or emotion labels) without leveraging their complementary strengths for comprehensive emotion manipulation, and (2) deterministic regression-based mapping that constrains the stochastic nature of emotional expressions and non-verbal behaviors, limiting the expressiveness of synthesized animations. To address these challenges, we present a diffusion-based framework for controllable expressive 3D facial animation. Our approach introduces two key innovations: (1) a FLAME-centered multimodal emotion binding strategy that aligns diverse modalities (text, audio, and emotion labels) through contrastive learning, enabling flexible emotion control from multiple signal sources, and (2) an attention-based latent diffusion model with content-aware attention and emotion-guided layers, which enriches motion diversity while maintaining temporal coherence and natural facial dynamics. Extensive experiments demonstrate that our method outperforms existing approaches across most metrics, achieving a 21.6% improvement in emotion similarity while preserving physiologically plausible facial dynamics. Project Page: https://kangweiiliu.github.io/Control_3D_Animation.
[137] Structured Graph Representations for Visual Narrative Reasoning: A Hierarchical Framework for Comics cs.MM | cs.AI | cs.CVPDF
Yi-Chun Chen
TL;DR: 该论文提出了一种层级知识图框架,用于多模态媒体(如漫画)的视觉叙事理解。方法将叙事内容分解为多个层次,并通过整合知识图捕捉语义、空间和时间关系,支持多样叙事任务的推理。
Details
Motivation: 动机是解决视觉叙事(如漫画)中复杂的多模态关系理解问题,通过结构化表示支持推理任务。
Result: 在Manga109数据集上验证,结果显示高精度和高召回率,支持多样叙事任务(如动作检索、对话追踪、角色映射等)。
Insight: 研究展示了结构化图表示在多模态叙事分析中的有效性,为交互式叙事和多模态推理提供了可扩展基础。
Abstract: This paper presents a hierarchical knowledge graph framework for the structured understanding of visual narratives, focusing on multimodal media such as comics. The proposed method decomposes narrative content into multiple levels, from macro-level story arcs to fine-grained event segments. It represents them through integrated knowledge graphs that capture semantic, spatial, and temporal relationships. At the panel level, we construct multimodal graphs that link visual elements such as characters, objects, and actions with corresponding textual components, including dialogue and captions. These graphs are integrated across narrative levels to support reasoning over story structure, character continuity, and event progression. We apply our approach to a manually annotated subset of the Manga109 dataset and demonstrate its ability to support symbolic reasoning across diverse narrative tasks, including action retrieval, dialogue tracing, character appearance mapping, and panel timeline reconstruction. Evaluation results show high precision and recall across tasks, validating the coherence and interpretability of the framework. This work contributes a scalable foundation for narrative-based content analysis, interactive storytelling, and multimodal reasoning in visual media.
[138] WDMIR: Wavelet-Driven Multimodal Intent Recognition cs.MM | cs.AI | cs.CV | eess.SPPDF
Weiyin Gong, Kai Zhang, Yanghai Zhang, Qi Liu, Xinjie Sun
TL;DR: WDMIR提出了一种基于小波变换的多模态意图识别框架,通过频域分析提升非语言信息的语义提取能力,实现了性能提升。
Details
Motivation: 现有方法过于依赖文本分析,忽略了非语言信息的丰富语义内容,WDMIR旨在通过频域分析弥补这一不足。
Result: 在MIntRec数据集上取得SOTA性能,准确率提升1.13%,小波融合模块对非语言语义提取的效果提升0.41%。
Insight: 频域分析(如小波变换)能有效捕捉非语言信息的动态语义,跨模态逐步融合对意图识别至关重要。
Abstract: Multimodal intent recognition (MIR) seeks to accurately interpret user intentions by integrating verbal and non-verbal information across video, audio and text modalities. While existing approaches prioritize text analysis, they often overlook the rich semantic content embedded in non-verbal cues. This paper presents a novel Wavelet-Driven Multimodal Intent Recognition(WDMIR) framework that enhances intent understanding through frequency-domain analysis of non-verbal information. To be more specific, we propose: (1) a wavelet-driven fusion module that performs synchronized decomposition and integration of video-audio features in the frequency domain, enabling fine-grained analysis of temporal dynamics; (2) a cross-modal interaction mechanism that facilitates progressive feature enhancement from bimodal to trimodal integration, effectively bridging the semantic gap between verbal and non-verbal information. Extensive experiments on MIntRec demonstrate that our approach achieves state-of-the-art performance, surpassing previous methods by 1.13% on accuracy. Ablation studies further verify that the wavelet-driven fusion module significantly improves the extraction of semantic information from non-verbal sources, with a 0.41% increase in recognition accuracy when analyzing subtle emotional cues.
cs.SD [Back]
[139] PAL: Probing Audio Encoders via LLMs – A Study of Information Transfer from Audio Encoders to LLMs cs.SD | cs.AI | cs.CL | eess.ASPDF
Tony Alex, Wish Suharitdamrong, Sara Atito, Armin Mustafa, Philip J. B. Jackson
TL;DR: 该论文通过系统研究音频编码器和LLM(大语言模型)的信息传递机制,提出并验证了几种优化架构设计的方法,显著提升了音频-LLM的性能。
Details
Motivation: 尽管音频-LLM在应用领域取得了快速进展,但其底层机制,尤其是音频编码器如何高效将丰富的语义信息传递给LLM,仍未得到充分研究。论文旨在探索并优化这一交互过程。
Result: 最终提出的架构在基准测试中实现了10%到60%的相对性能提升,验证了优化跨模态信息传递的有效性。
Insight: 研究揭示了音频-LLM中信息传递的关键机制,包括延迟集成和多编码器互补作用,为未来跨模态模型设计提供了理论支持。
Abstract: The integration of audio perception capabilities into Large Language Models (LLMs) has enabled significant advances in Audio-LLMs. Although application-focused developments, particularly in curating training data for specific capabilities e.g., audio reasoning, have progressed rapidly, the underlying mechanisms that govern efficient transfer of rich semantic representations from audio encoders to LLMs remain under-explored. We conceptualize effective audio-LLM interaction as the LLM’s ability to proficiently probe the audio encoder representations to satisfy textual queries. This paper presents a systematic investigation on how architectural design choices can affect that. Beginning with a standard Pengi/LLaVA-style audio-LLM architecture, we propose and evaluate several modifications guided by hypotheses derived from mechanistic interpretability studies and LLM operational principles. Our experiments demonstrate that: (1) delaying audio integration until the LLM’s initial layers establish textual context that enhances its ability to probe the audio representations for relevant information; (2) the LLM can proficiently probe audio representations exclusively through LLM layer’s attention submodule, without requiring propagation to its Feed-Forward Network (FFN) submodule; (3) an efficiently integrated ensemble of diverse audio encoders provides richer, complementary representations, thereby broadening the LLM’s capacity to probe a wider spectrum of audio information. All hypotheses are evaluated using an identical three-stage training curriculum on a dataset of 5.6 million audio-text pairs, ensuring controlled comparisons. Our final architecture, which incorporates all proposed modifications, achieves relative improvements from 10% to 60% over the baseline, validating our approach to optimizing cross-modal information transfer in audio-LLMs. Project page: https://ta012.github.io/PAL/
cs.AI [Back]
[140] One Patient, Many Contexts: Scaling Medical AI Through Contextual Intelligence cs.AI | cs.CLPDF
Michelle M. Li, Ben Y. Reis, Adam Rodman, Tianxi Cai, Noa Dagan
TL;DR: 这篇论文提出了医疗AI中的上下文切换概念,旨在通过动态调整模型行为来适应不同医疗场景,避免因固定训练导致的错误。
Details
Motivation: 当前医疗AI模型在适应新环境、人群或专业时需微调或提示,难以动态响应复杂多变的临床情境,导致上下文错误。
Result: 未来目标是开发能够跨专业、区域诊断和治疗的AI,扩大医疗服务可及性。
Insight: 医疗AI需要更强的上下文适应能力,以克服固定训练的局限性,服务于多样化临床需求。
Abstract: Medical foundation models, including language models trained on clinical notes, vision-language models on medical images, and multimodal models on electronic health records, can summarize clinical notes, answer medical questions, and assist in decision-making. Adapting these models to new populations, specialties, or settings typically requires fine-tuning, careful prompting, or retrieval from knowledge bases. This can be impractical, and limits their ability to interpret unfamiliar inputs and adjust to clinical situations not represented during training. As a result, models are prone to contextual errors, where predictions appear reasonable but fail to account for critical patient-specific or contextual information. These errors stem from a fundamental limitation that current models struggle with: dynamically adjusting their behavior across evolving contexts of medical care. In this Perspective, we outline a vision for context-switching in medical AI: models that dynamically adapt their reasoning without retraining to new specialties, populations, workflows, and clinical roles. We envision context-switching AI to diagnose, manage, and treat a wide range of diseases across specialties and regions, and expand access to medical care.
[141] Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning cs.AI | cs.CLPDF
Yuhao Zhou, Yiheng Wang, Xuming He, Ruoyao Xiao, Zhiwei Li
TL;DR: 本文提出了Scientists’ First Exam (SFE)基准测试,用于评估多模态大语言模型(MLLMs)在科学领域的感知、理解和推理能力,填补了现有评测的不足。
Details
Motivation: 科学发现依赖于复杂的多模态推理,但目前评测MLLMs的基准主要集中在知识理解上,缺乏对感知和推理能力的评估。
Result: GPT-3和InternVL-3在SFE上的得分分别为34.08%和26.52%,表明MLLMs在科学领域仍有较大改进空间。
Insight: 科学领域的MLLMs需要更强的感知和推理能力,SFE为未来AI支持科学发现的研究提供了参考方向。
Abstract: Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists’ First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.
[142] TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving cs.AI | cs.CLPDF
Vincenzo Colle, Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Fadhel Ayed
TL;DR: 该论文提出了TeleMath,一个专门用于评估大语言模型(LLMs)在电信领域数学问题求解能力的基准数据集,覆盖信号处理、网络优化等多个主题。通过评估发现,专为数学或逻辑推理设计的模型表现最佳,而通用模型即使参数量大也难以胜任。
Details
Motivation: 电信领域对数学密集型任务的需求增加,但现有LLMs在专业领域的数学推理能力尚未充分探索。作者希望通过TeleMath填补这一空白。
Result: 专为数学或逻辑推理设计的模型在TeleMath上表现最佳,通用模型则表现不佳。
Insight: 专业领域(如电信)的数学问题求解需要针对性设计的LLMs,而非单纯增加参数量的通用模型。
Abstract: The increasing adoption of artificial intelligence in telecommunications has raised interest in the capability of Large Language Models (LLMs) to address domain-specific, mathematically intensive tasks. Although recent advancements have improved the performance of LLMs in general mathematical reasoning, their effectiveness within specialized domains, such as signal processing, network optimization, and performance analysis, remains largely unexplored. To address this gap, we introduce TeleMath, the first benchmark dataset specifically designed to evaluate LLM performance in solving mathematical problems with numerical solutions in the telecommunications domain. Comprising 500 question-answer (QnA) pairs, TeleMath covers a wide spectrum of topics in the telecommunications field. This paper outlines the proposed QnAs generation pipeline, starting from a selected seed of problems crafted by Subject Matter Experts. The evaluation of a wide range of open-source LLMs reveals that best performance on TeleMath is achieved by recent models explicitly designed for mathematical or logical reasoning. In contrast, general-purpose models, even those with a large number of parameters, often struggle with these challenges. We have released the dataset and the evaluation code to ease result reproducibility and support future research.
[143] Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification? cs.AI | cs.CLPDF
Fei Lin, Ziyang Gong, Cong Wang, Yonglin Tian, Tengchao Zhang
TL;DR: 该论文提出了首个专注于分子毒性修复的基准任务ToxiMol,并构建了一个标准化数据集,同时提出了自动评估框架ToxiEval。实验表明,当前的多模态大语言模型(MLLM)在此任务上仍面临挑战,但在毒性理解、语义约束和结构感知编辑方面展现潜力。
Details
Motivation: 毒性是药物早期开发失败的主要原因之一,但目前缺乏系统定义和基准任务以支持分子毒性修复的研究。
Result: 实验评估了近30个主流MLLM,显示其在毒性修复任务上仍有挑战,但在某些方面已显现潜力。
Insight: MLLM在分子毒性修复任务中的应用尚需进一步研究,但其在理解毒性和结构编辑方面的能力为未来提供了方向。
Abstract: Toxicity remains a leading cause of early-stage drug development failure. Despite advances in molecular design and property prediction, the task of molecular toxicity repair - generating structurally valid molecular alternatives with reduced toxicity - has not yet been systematically defined or benchmarked. To fill this gap, we introduce ToxiMol, the first benchmark task for general-purpose Multimodal Large Language Models (MLLMs) focused on molecular toxicity repair. We construct a standardized dataset covering 11 primary tasks and 560 representative toxic molecules spanning diverse mechanisms and granularities. We design a prompt annotation pipeline with mechanism-aware and task-adaptive capabilities, informed by expert toxicological knowledge. In parallel, we propose an automated evaluation framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic accessibility, drug-likeness, and structural similarity into a high-throughput evaluation chain for repair success. We systematically assess nearly 30 mainstream general-purpose MLLMs and design multiple ablation studies to analyze key factors such as evaluation criteria, candidate diversity, and failure attribution. Experimental results show that although current MLLMs still face significant challenges on this task, they begin to demonstrate promising capabilities in toxicity understanding, semantic constraint adherence, and structure-aware molecule editing.
cs.RO [Back]
[144] A Navigation Framework Utilizing Vision-Language Models cs.RO | cs.AI | cs.CVPDF
Yicheng Duan, Kaiyu tang
TL;DR: 论文提出了一种利用视觉语言模型(VLN)的模块化导航框架,通过解耦视觉语言理解和动作规划,实现了快速且适应性强的导航,无需大量微调。
Details
Motivation: 现有的大型视觉语言模型(如CLIP和Flamingo)虽然在多模态理解方面表现优异,但在实时部署和计算成本方面存在挑战。论文旨在解决这些问题,同时提升导航的灵活性和效率。
Result: 在VLN-CE设置下评估了Room-to-Room基准和Matterport3D数据集,结果表明在未见环境中的通用性存在挑战,但模块化设计为未来改进提供了基础。
Insight: 模块化设计是解决计算成本和实时部署问题的有效途径,未来可通过增强环境先验和扩展多模态输入进一步提升性能。
Abstract: Vision-and-Language Navigation (VLN) presents a complex challenge in embodied AI, requiring agents to interpret natural language instructions and navigate through visually rich, unfamiliar environments. Recent advances in large vision-language models (LVLMs), such as CLIP and Flamingo, have significantly improved multimodal understanding but introduced new challenges related to computational cost and real-time deployment. In this project, we propose a modular, plug-and-play navigation framework that decouples vision-language understanding from action planning. By integrating a frozen vision-language model, Qwen2.5-VL-7B-Instruct, with lightweight planning logic, we aim to achieve flexible, fast, and adaptable navigation without extensive model fine-tuning. Our framework leverages prompt engineering, structured history management, and a two-frame visual input strategy to enhance decision-making continuity across navigation steps. We evaluate our system on the Room-to-Room benchmark within the VLN-CE setting using the Matterport3D dataset and Habitat-Lab simulation environment. Although our initial results reveal challenges in generalizing to unseen environments under strict evaluation settings, our modular approach lays a foundation for scalable and efficient navigation systems, highlighting promising directions for future improvement through enhanced environmental priors and expanded multimodal input integration.
[145] EmbodiedGen: Towards a Generative 3D World Engine for Embodied Intelligence cs.RO | cs.CVPDF
Wang Xinjie, Liu Liu, Cao Yu, Wu Ruiqi, Qin Wenkang
TL;DR: EmbodiedGen是一个为具身智能设计的生成性3D世界引擎平台,旨在低成本生成高质量、可控且逼真的3D资产,以解决当前手动创建3D数据资产的高成本和缺乏真实感的问题。
Details
Motivation: 当前具身智能任务依赖手动创建的3D图形资产,成本高且真实感有限,限制了数据驱动方法的扩展性。EmbodiedGen旨在通过生成性AI技术解决这一问题。
Result: 生成的3D资产具有高质量、可控性和真实感,可直接用于物理仿真引擎,支持下游任务的训练和评估。
Insight: 利用生成性AI技术可以显著降低3D数据资产的成本并提升其多样性,从而推动具身智能研究的扩展性和通用性。
Abstract: Constructing a physically realistic and accurately scaled simulated 3D world is crucial for the training and evaluation of embodied intelligence tasks. The diversity, realism, low cost accessibility and affordability of 3D data assets are critical for achieving generalization and scalability in embodied AI. However, most current embodied intelligence tasks still rely heavily on traditional 3D computer graphics assets manually created and annotated, which suffer from high production costs and limited realism. These limitations significantly hinder the scalability of data driven approaches. We present EmbodiedGen, a foundational platform for interactive 3D world generation. It enables the scalable generation of high-quality, controllable and photorealistic 3D assets with accurate physical properties and real-world scale in the Unified Robotics Description Format (URDF) at low cost. These assets can be directly imported into various physics simulation engines for fine-grained physical control, supporting downstream tasks in training and evaluation. EmbodiedGen is an easy-to-use, full-featured toolkit composed of six key modules: Image-to-3D, Text-to-3D, Texture Generation, Articulated Object Generation, Scene Generation and Layout Generation. EmbodiedGen generates diverse and interactive 3D worlds composed of generative 3D assets, leveraging generative AI to address the challenges of generalization and evaluation to the needs of embodied intelligence related research. Code is available at https://horizonrobotics.github.io/robot_lab/embodied_gen/index.html.
[146] Eye, Robot: Learning to Look to Act with a BC-RL Perception-Action Loop cs.RO | cs.CVPDF
Justin Kerr, Kush Hari, Ethan Weber, Chung Min Kim, Brent Yi
TL;DR: 这篇论文提出了EyeRobot系统,通过结合模仿学习(BC)和强化学习(RL)的BC-RL循环,训练机器人眼球注视行为以完成实际任务,实现了手眼协调。
Details
Motivation: 受人类主动观察以完成任务的启发,论文旨在设计一种能够通过注视行为辅助机器人完成任务的系统。
Result: 在五个全景工作空间任务中,EyeRobot表现出有效的手眼协调能力,能够在大范围内完成操作任务。
Insight: 主动注视行为能够显著提升机器人在复杂任务中的表现,尤其是在需要大范围操作的环境下。
Abstract: Humans do not passively observe the visual world – we actively look in order to act. Motivated by this principle, we introduce EyeRobot, a robotic system with gaze behavior that emerges from the need to complete real-world tasks. We develop a mechanical eyeball that can freely rotate to observe its surroundings and train a gaze policy to control it using reinforcement learning. We accomplish this by first collecting teleoperated demonstrations paired with a 360 camera. This data is imported into a simulation environment that supports rendering arbitrary eyeball viewpoints, allowing episode rollouts of eye gaze on top of robot demonstrations. We then introduce a BC-RL loop to train the hand and eye jointly: the hand (BC) agent is trained from rendered eye observations, and the eye (RL) agent is rewarded when the hand produces correct action predictions. In this way, hand-eye coordination emerges as the eye looks towards regions which allow the hand to complete the task. EyeRobot implements a foveal-inspired policy architecture allowing high resolution with a small compute budget, which we find also leads to the emergence of more stable fixation as well as improved ability to track objects and ignore distractors. We evaluate EyeRobot on five panoramic workspace manipulation tasks requiring manipulation in an arc surrounding the robot arm. Our experiments suggest EyeRobot exhibits hand-eye coordination behaviors which effectively facilitate manipulation over large workspaces with a single camera. See project site for videos: https://www.eyerobot.net/
eess.SY [Back]
[147] Energy Aware Camera Location Search Algorithm for Increasing Precision of Observation in Automated Manufacturing eess.SY | cs.CV | cs.SY | 93C85 (Primary), 93B52 (Secondary)PDF
Rongfei Li, Francis Assadian
TL;DR: 该论文提出了一种针对自动化制造环境中视觉伺服的摄像头位置搜索算法,通过优化摄像头移动策略和学习环境特征,提高了观察精度,同时考虑了能量限制。
Details
Motivation: 在自动化制造环境中,摄像头的位置对视觉伺服的精度至关重要。论文旨在解决摄像头位置选择对图像噪声水平和观察精度的影响,并优化摄像头移动策略以减少能耗。
Result: 实验结果表明,该算法在仿真自动化制造环境中有效提高了观察精度,并在能量有限的情况下实现了次优位置的选择。
Insight: 论文揭示了摄像头位置对视觉伺服精度的影响,并通过智能搜索策略和学习机制,为自动化制造中的摄像头定位问题提供了一种高效的解决方案。
Abstract: Visual servoing technology has been well developed and applied in many automated manufacturing tasks, especially in tools’ pose alignment. To access a full global view of tools, most applications adopt eye-to-hand configuration or eye-to-hand/eye-in-hand cooperation configuration in an automated manufacturing environment. Most research papers mainly put efforts into developing control and observation architectures in various scenarios, but few of them have discussed the importance of the camera’s location in eye-to-hand configuration. In a manufacturing environment, the quality of camera estimations may vary significantly from one observation location to another, as the combined effects of environmental conditions result in different noise levels of a single image shot at different locations. In this paper, we propose an algorithm for the camera’s moving policy so that it explores the camera workspace and searches for the optimal location where the images’ noise level is minimized. Also, this algorithm ensures the camera ends up at a suboptimal (if the optimal one is unreachable) location among the locations already searched, with limited energy available for moving the camera. Unlike a simple brute force approach, the algorithm enables the camera to explore space more efficiently by adapting the search policy from learning the environment. With the aid of an image averaging technique, this algorithm, in use of a solo camera, achieves the observation accuracy in eye-to-hand configurations to a desirable extent without filtering out high-frequency information in the original image. An automated manufacturing application has been simulated and the results show the success of this algorithm’s improvement of observation precision with limited energy.
[148] Semi-Tensor-Product Based Convolutional Neural Networks eess.SY | cs.AI | cs.CV | cs.SYPDF
Daizhan Cheng
TL;DR: 该论文提出了一种基于半张量积(STP)的新型卷积运算(CP),并通过结合域基CP和STP向量,避免了传统卷积中填充操作带来的无效信息,进而构建了STP-based CNN,应用于图像和三阶信号识别。
Details
Motivation: 传统卷积运算中的填充操作(如零填充)可能引入无效信息,影响模型性能。本研究旨在通过半张量积的泛化特性,设计一种无需填充的卷积运算。
Result: 新方法在图像和三阶信号识别任务中取得了显著效果,避免了传统填充带来的无效信息问题。
Insight: STP的灵活维度处理能力为卷积运算设计提供了新思路,无需填充的操作简化了模型且避免了信息污染。
Abstract: The semi-tensor product (STP) of vectors is a generalization of conventional inner product of vectors, which allows the factor vectors to of different dimensions. This paper proposes a domain-based convolutional product (CP). Combining domain-based CP with STP of vectors, a new CP is proposed. Since there is no zero or any other padding, it can avoid the junk information caused by padding. Using it, the STP-based convolutional neural network (CNN) is developed. Its application to image and third order signal identifications is considered.