Table of Contents
- cs.CV [Total: 80]
- cs.CL [Total: 17]
- cs.GR [Total: 1]
- cs.RO [Total: 8]
- eess.IV [Total: 3]
- stat.ML [Total: 1]
- cs.LG [Total: 3]
- quant-ph [Total: 1]
- cs.HC [Total: 1]
- cs.AI [Total: 2]
- cs.DC [Total: 2]
- cs.SD [Total: 1]
cs.CV [Back]
[1] What Happens When: Learning Temporal Orders of Events in Videos cs.CV | cs.AIPDF
Daechul Ahn, Yura Choi, Hyeonbeom Choi, Seongwon Cho, San Kim
TL;DR: 论文揭示了当前视频大型多模态模型(VLMMs)在理解事件时序性上的不足,提出了新基准VECTOR和新方法MECOT,显著提升了模型的时序理解能力。
Details
Motivation: 现有VLMMs在视频理解任务中表现优秀,但对事件时序的捕捉能力未被充分探索。实验发现,即使视频帧被打乱,模型仍表现良好,表明其可能依赖先验知识而非时序处理能力。
Result: MECOT在VECTOR基准上显著优于现有方法,同时提升了其他视频任务的性能,验证了其有效性。
Insight: VLMMs可能存在对时序信息的忽视,需设计专门方法强化时序理解能力;MECOT为未来多模态模型提供了新思路。
Abstract: Video Large Multimodal Models (VLMMs) have shown impressive performance in video understanding, yet their ability to accurately capture the temporal order of multiple events remains underexplored. We interestingly observe that, even when video frames are scrambled, models perform very well on the existing benchmarks by comprehensive experiments. This implies that VLMMs may not necessarily rely on accurate sequential processing of visual events, but instead depend on prior knowledge of typical scenarios to answer the question. To benchmark temporal understanding capabilities in VLMMs, we propose VECTOR, designed to explicitly assess a model’s ability to identify the temporal order of events. On this benchmark, we observe that various VLMMs often fail to understand the orders of events. To address this, we propose MECOT (Multi-Event instruction fine-tuning with Chain-of-Thought), which (1) trains models on detailed, event-by-event video descriptions and (2) using chain-of-thought prompts at inference to enhance temporal awareness. MECOT outperforms prior arts on VECTOR as well as improving performance on existing video benchmarks, implying effectiveness of temporal understanding. We release our code, model and datasets.
[2] Training Multi-Image Vision Agents via End2End Reinforcement Learning cs.CV | cs.AIPDF
Chengqi Dong, Chuhuai Yue, Hang He, Rongge Mao, Fenghe Tang
TL;DR: 论文提出IMAgent,一种通过端到端强化学习训练的开源视觉智能体,专注于复杂多图像任务。通过多智能体系统生成高质量多图像QA对,并开发专用工具优化视觉注意力分配,实现了稳定工具使用行为。
Details
Motivation: 现有开源方法通常限制输入为单图像,无法满足真实世界多图像QA任务需求,因此需要一种更高效的多图像视觉智能体。
Result: IMAgent在单图像基准上保持高性能,同时在多图像数据集上显著提升,证明了方法的有效性。
Insight: 通过专用工具和强化学习策略,可以有效解决视觉模型在多图像任务中忽略视觉输入的问题。
Abstract: Recent VLM-based agents aim to replicate OpenAI O3’s ``thinking with images” via tool use, but most open-source methods limit input to a single image, falling short on real-world multi-image QA tasks. To address this, we propose IMAgent, an open-source vision agent trained via end-to-end reinforcement learning dedicated for complex multi-image tasks. By leveraging a multi-agent system, we generate challenging and visually-rich multi-image QA pairs to fully activate the tool-use potential of the base VLM. Through manual verification, we obtain MIFG-QA, comprising 10k samples for training and evaluation. With deeper reasoning steps, VLMs may increasingly ignore visual inputs. We therefore develop two specialized tools for visual reflection and confirmation, allowing the model to proactively reallocate its attention to image content during inference. Benefiting from our well-designed action-trajectory two-level mask strategy, IMAgent achieves stable tool use behavior via pure RL training without requiring costly supervised fine-tuning data. Extensive experiments demonstrate that IMAgent maintains strong performance on existing single-image benchmarks while achieving substantial improvements on our proposed multi-image dataset, with our analysis providing actionable insights for the research community. Codes and data will be released soon.
[3] Mitigating Bias with Words: Inducing Demographic Ambiguity in Face Recognition Templates by Text Encoding cs.CV | cs.AIPDF
Tahar Chettaoui, Naser Damer, Fadi Boutros
TL;DR: 论文提出了一种新方法UTIE,通过文本编码诱导人脸识别模板中的性别和种族模糊性,以减轻人脸识别系统中的偏见问题。
Details
Motivation: 人脸识别系统常因嵌入空间中身份特征与人口统计特征的纠缠而存在偏见,可能导致不同人口群体的验证性能差异。
Result: 实验表明,UTIE能有效减少偏见指标,同时保持或提升人脸验证精度。
Insight: 通过文本编码引入人口统计模糊性,可以有效缓解嵌入空间中的偏见问题,为公平人脸识别提供了一种新思路。
Abstract: Face recognition (FR) systems are often prone to demographic biases, partially due to the entanglement of demographic-specific information with identity-relevant features in facial embeddings. This bias is extremely critical in large multicultural cities, especially where biometrics play a major role in smart city infrastructure. The entanglement can cause demographic attributes to overshadow identity cues in the embedding space, resulting in disparities in verification performance across different demographic groups. To address this issue, we propose a novel strategy, Unified Text-Image Embedding (UTIE), which aims to induce demographic ambiguity in face embeddings by enriching them with information related to other demographic groups. This encourages face embeddings to emphasize identity-relevant features and thus promotes fairer verification performance across groups. UTIE leverages the zero-shot capabilities and cross-modal semantic alignment of Vision-Language Models (VLMs). Given that VLMs are naturally trained to align visual and textual representations, we enrich the facial embeddings of each demographic group with text-derived demographic features extracted from other demographic groups. This encourages a more neutral representation in terms of demographic attributes. We evaluate UTIE using three VLMs, CLIP, OpenCLIP, and SigLIP, on two widely used benchmarks, RFW and BFW, designed to assess bias in FR. Experimental results show that UTIE consistently reduces bias metrics while maintaining, or even improving in several cases, the face verification accuracy.
[4] Consist-Retinex: One-Step Noise-Emphasized Consistency Training Accelerates High-Quality Retinex Enhancement cs.CV | cs.AIPDF
Jian Xu, Wei Chen, Shigui Li, Delu Zeng, John Paisley
TL;DR: 该论文提出了Consist-Retinex,一种基于一致性模型的低光图像增强方法,通过双目标一致性损失和自适应噪声强调采样策略,实现了单步采样的高性能Retinex增强。
Details
Motivation: 现有扩散模型在Retinex分解的低光图像增强中表现优异,但需要数百步迭代采样,限制了实际应用。一致性模型在无条件生成中展现了单步生成潜力,但其在条件增强中的应用尚未探索。
Result: 在VE-LOL-L数据集上,Consist-Retinex以单步采样实现了PSNR 25.51和FID 44.73的SOTA性能,训练预算仅为基线方法的1/8。
Insight: 条件增强的训练动态需要重点关注大噪声区域,而非无条件生成中的低噪声区域,这是实现高效单步条件生成的关键。
Abstract: Diffusion models have achieved remarkable success in low-light image enhancement through Retinex-based decomposition, yet their requirement for hundreds of iterative sampling steps severely limits practical deployment. While recent consistency models offer promising one-step generation for \textit{unconditional synthesis}, their application to \textit{conditional enhancement} remains unexplored. We present \textbf{Consist-Retinex}, the first framework adapting consistency modeling to Retinex-based low-light enhancement. Our key insight is that conditional enhancement requires fundamentally different training dynamics than unconditional generation standard consistency training focuses on low-noise regions near the data manifold, while conditional mapping critically depends on large-noise regimes that bridge degraded inputs to enhanced outputs. We introduce two core innovations: (1) a \textbf{dual-objective consistency loss} combining temporal consistency with ground-truth alignment under randomized time sampling, providing full-spectrum supervision for stable convergence; and (2) an \textbf{adaptive noise-emphasized sampling strategy} that prioritizes training on large-noise regions essential for one-step conditional generation. On VE-LOL-L, Consist-Retinex achieves \textbf{state-of-the-art performance with single-step sampling} (\textbf{PSNR: 25.51 vs. 23.41, FID: 44.73 vs. 49.59} compared to Diff-Retinex++), while requiring only \textbf{1/8 of the training budget} relative to the 1000-step Diff-Retinex baseline.
[5] An Efficient Test-Time Scaling Approach for Image Generation cs.CVPDF
Vignesh Sundaresha, Akash Haridas, Vikram Appia, Lav Varshney
TL;DR: 本文提出了一种高效的测试时计算分配方法Verifier-Threshold,显著提升了图像生成模型的效率,在相同性能下计算时间减少了2-4倍。
Details
Motivation: 目前的图像生成模型在测试时计算分配上效率低下,尤其是一些基于贪婪算法的方法未能有效利用计算资源。本文旨在解决这一问题。
Result: 在GenEval基准测试中,相同性能下计算时间减少了2-4倍。
Insight: 高效的测试时计算分配对图像生成模型的性能提升至关重要,动态调整策略优于静态贪婪算法。
Abstract: Image generation has emerged as a mainstream application of large generative AI models. Just as test-time compute and reasoning have helped language models improve their capabilities, similar benefits have also been observed with image generation models. In particular, searching over noise samples for diffusion and flow models has shown to scale well with test-time compute. While recent works have explored allocating non-uniform inference-compute budgets across different denoising steps, they rely on greedy algorithms and allocate the compute budget ineffectively. In this work, we study this problem and propose solutions to fix it. We propose the Verifier-Threshold method which automatically reallocates test-time compute and delivers substantial efficiency improvements. For the same performance on the GenEval benchmark, we achieve a 2-4x reduction in computational time over the state-of-the-art method.
[6] Explainable Fundus Image Curation and Lesion Detection in Diabetic Retinopathy cs.CV | cs.AIPDF
Anca Mihai, Adrian Groza
TL;DR: 该论文提出了一种用于糖尿病视网膜病变(DR)的可解释眼底图像筛选和病灶检测框架,通过特征分类、图像增强和标注一致性验证,提高数据质量和AI模型的可靠性。
Details
Motivation: 糖尿病视网膜病变的早期诊断至关重要,但眼底图像的质量和标注的一致性会影响AI模型的性能。为了解决这些问题,作者提出了一个质量控制框架。
Result: 该方法提升了数据标注的质量和一致性,为AI训练提供了更可靠的眼底图像数据集。
Insight: 结合可解释性和深度学习的方法,可以有效解决医学图像中数据质量和标注一致性的问题,从而提升AI模型的性能。
Abstract: Diabetic Retinopathy (DR) affects individuals with long-term diabetes. Without early diagnosis, DR can lead to vision loss. Fundus photography captures the structure of the retina along with abnormalities indicative of the stage of the disease. Artificial Intelligence (AI) can support clinicians in identifying these lesions, reducing manual workload, but models require high-quality annotated datasets. Due to the complexity of retinal structures, errors in image acquisition and lesion interpretation of manual annotators can occur. We proposed a quality-control framework, ensuring only high-standard data is used for evaluation and AI training. First, an explainable feature-based classifier is used to filter inadequate images. The features are extracted both using image processing and contrastive learning. Then, the images are enhanced and put subject to annotation, using deep-learning-based assistance. Lastly, the agreement between annotators calculated using derived formulas determines the usability of the annotations.
[7] Deterministic World Models for Verification of Closed-loop Vision-based Systems cs.CV | cs.LGPDF
Yuang Geng, Zhuoyang Zhou, Zhongzheng Zhang, Siyuan Pan, Hoang-Dung Tran
TL;DR: 论文提出了一种确定性世界模型(DWM),直接映射系统状态到生成图像,避免了随机潜在变量带来的过近似误差,并通过双重损失函数优化模型。该方法结合StarV可达性分析和保形预测,显著提升了闭环视觉系统的验证性能。
Details
Motivation: 验证闭环视觉控制系统的挑战主要来自图像的高维性和视觉环境建模的困难。现有生成模型依赖随机潜在变量,导致不必要且难以解释的过近似误差。
Result: 在标准基准测试中,DWM生成的可达集更紧凑,验证性能显著优于依赖潜在变量的基线方法。
Insight: 确定性模型在验证任务中优于随机生成模型,双重损失和统计验证的结合为视觉系统验证提供了新思路。
Abstract: Verifying closed-loop vision-based control systems remains a fundamental challenge due to the high dimensionality of images and the difficulty of modeling visual environments. While generative models are increasingly used as camera surrogates in verification, their reliance on stochastic latent variables introduces unnecessary overapproximation error. To address this bottleneck, we propose a Deterministic World Model (DWM) that maps system states directly to generative images, effectively eliminating uninterpretable latent variables to ensure precise input bounds. The DWM is trained with a dual-objective loss function that combines pixel-level reconstruction accuracy with a control difference loss to maintain behavioral consistency with the real system. We integrate DWM into a verification pipeline utilizing Star-based reachability analysis (StarV) and employ conformal prediction to derive rigorous statistical bounds on the trajectory deviation between the world model and the actual vision-based system. Experiments on standard benchmarks show that our approach yields significantly tighter reachable sets and better verification performance than a latent-variable baseline.
[8] A Survey of Body and Face Motion: Datasets, Performance Evaluation Metrics and Generative Techniques cs.CV | cs.HCPDF
Lownish Rai Sookha, Nikhil Pakhale, Mudasir Ganaie, Abhinav Dhall
TL;DR: 该论文是对身体和面部动作生成的综述,涵盖了核心概念、表示技术、生成方法、数据集和评估指标,并提出了增强虚拟角色真实性和表达性的未来方向。
Details
Motivation: 身体和面部动作在交流中传递重要信息,但生成具有表现力和连贯性的动作仍具挑战性。论文旨在总结现有技术并为未来研究指明方向。
Result: 综述了当前技术的优缺点,指出生成动作的真实性和连贯性仍需改进,并提出了未来研究方向。
Insight: 动作生成需综合考虑语言和非语言线索,未来研究应注重个性化和多模态融合以提升虚拟角色的表现力。
Abstract: Body and face motion play an integral role in communication. They convey crucial information on the participants. Advances in generative modeling and multi-modal learning have enabled motion generation from signals such as speech, conversational context and visual cues. However, generating expressive and coherent face and body dynamics remains challenging due to the complex interplay of verbal / non-verbal cues and individual personality traits. This survey reviews body and face motion generation, covering core concepts, representations techniques, generative approaches, datasets and evaluation metrics. We highlight future directions to enhance the realism, coherence and expressiveness of avatars in dyadic settings. To the best of our knowledge, this work is the first comprehensive review to cover both body and face motion. Detailed resources are listed on https://lownish23csz0010.github.io/mogen/.
[9] Towards Lossless Ultimate Vision Token Compression for VLMs cs.CV | cs.AIPDF
Dehua Zheng, Mouxiao Huang, Borui Jiang, Hailin Hu, Xinghao Chen
TL;DR: 论文提出了LUVC框架,通过迭代合并和频谱修剪单元解决视觉语言模型中高分辨率图像和视频令牌冗余问题,显著提升了计算效率。
Details
Motivation: 高分辨率图像和视频的令牌表示存在大量冗余,导致计算效率低下和延迟问题。现有压缩方法存在位置偏差和类别不平衡,且无法泛化到浅层LLM交互。
Result: LUVC在语言模型中实现2倍加速推断,并且精度损失可忽略。训练自由特性使其可快速部署到多种VLMs。
Insight: 通过在视觉编码器和LLM中逐步压缩令牌,能够高效融合高维视觉特征到多模态查询中,显著提升计算性能。
Abstract: Visual language models encounter challenges in computational efficiency and latency, primarily due to the substantial redundancy in the token representations of high-resolution images and videos. Current attention/similarity-based compression algorithms suffer from either position bias or class imbalance, leading to significant accuracy degradation. They also fail to generalize to shallow LLM layers, which exhibit weaker cross-modal interactions. To address this, we extend token compression to the visual encoder through an effective iterative merging scheme that is orthogonal in spatial axes to accelerate the computation across the entire VLM. Furthermoer, we integrate a spectrum pruning unit into LLM through an attention/similarity-free low-pass filter, which gradually prunes redundant visual tokens and is fully compatible to modern FlashAttention. On this basis, we propose Lossless Ultimate Vision tokens Compression (LUVC) framework. LUVC systematically compresses visual tokens until complete elimination at the final layer of LLM, so that the high-dimensional visual features are gradually fused into the multimodal queries. The experiments show that LUVC achieves a 2 speedup inference in language model with negligible accuracy degradation, and the training-free characteristic enables immediate deployment across multiple VLMs.
[10] An Approach for Detection of Entities in Dynamic Media Contents cs.CVPDF
Nzakiese Mbongo, Ngombo Armando
TL;DR: 本文提出了一种基于深度学习的方法,用于在视频序列中检测特定目标实体,特别是在动态媒体内容中识别特定人物。该方法通过监督学习算法,利用目标的简单特征实现了高效定位。
Details
Motivation: 当前在视频中检测特定实体的研究面临复杂对象的挑战,尤其是在动态媒体内容中识别特定人物。本文旨在利用深度学习技术解决这一问题,并着重于提升检测效率和准确性。
Result: 实验结果表明,该方法能够高效地从公共或私人图像库中定位目标人物,并在国家安防系统中具有实际应用价值。
Insight: 该研究表明,即使是简单的目标特征,通过深度学习的监督学习方法也能实现高效的实体检测,为解决复杂动态媒体内容中的识别问题提供了新思路。
Abstract: The notion of learning underlies almost every evolution of Intelligent Agents. In this paper, we present an approach for searching and detecting a given entity in a video sequence. Specifically, we study how the deep learning technique by artificial neuralnetworks allows us to detect a character in a video sequence. The technique of detecting a character in a video is a complex field of study, considering the multitude of objects present in the data under analysis. From the results obtained, we highlight the following, compared to state of the art: In our approach, within the field of Computer Vision, the structuring of supervised learning algorithms allowed us to achieve several successes from simple characteristics of the target character. Our results demonstrate that is new approach allows us to locate, in an efficient way, wanted individuals from a private or public image base. For the case of Angola, the classifier we propose opens the possibility of reinforcing the national security system based on the database of target individuals (disappeared, criminals, etc.) and the video sequences of the Integrated Public Security Centre (CISP).
[11] Learning to Remove Lens Flare in Event Camera cs.CVPDF
Haiqian Han, Lingdong Kong, Jianing Li, Ao Liang, Chengtao Zhu
TL;DR: 论文提出E-Deflare框架,首次系统性解决事件相机数据中的镜头光斑问题,通过物理驱动的前向模型和新基准测试集,验证了E-DeflareNet的优异性能。
Details
Motivation: 事件相机虽具备高时间分辨率和动态范围,但其数据易受镜头光斑影响,形成复杂的时空失真,此前研究却未充分重视这一问题。
Result: 实验表明E-DeflareNet在恢复性能和下游任务中均表现优异,基准测试集和代码已开源。
Insight: 镜头光斑的物理建模是解决复杂畸变的关键,结合仿真与真实数据能有效提升模型泛化能力。
Abstract: Event cameras have the potential to revolutionize vision systems with their high temporal resolution and dynamic range, yet they remain susceptible to lens flare, a fundamental optical artifact that causes severe degradation. In event streams, this optical artifact forms a complex, spatio-temporal distortion that has been largely overlooked. We present E-Deflare, the first systematic framework for removing lens flare from event camera data. We first establish the theoretical foundation by deriving a physics-grounded forward model of the non-linear suppression mechanism. This insight enables the creation of the E-Deflare Benchmark, a comprehensive resource featuring a large-scale simulated training set, E-Flare-2.7K, and the first-ever paired real-world test set, E-Flare-R, captured by our novel optical system. Empowered by this benchmark, we design E-DeflareNet, which achieves state-of-the-art restoration performance. Extensive experiments validate our approach and demonstrate clear benefits for downstream tasks. Code and datasets are publicly available.
[12] ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors cs.CVPDF
Liming Kuang, Yordanka Velikova, Mahdi Saleh, Jan-Nico Zaech, Danda Pani Paudel
TL;DR: ConceptPose是一种无需训练的自由框架,利用视觉语言模型生成开放词汇的3D概念地图,实现零样本物体姿态估计,性能优于现有方法。
Details
Motivation: 现有物体姿态估计方法通常需要大量数据集特定训练,而大规模视觉语言模型展现了出色的零样本能力。本研究旨在结合这两者,开发无需训练的零样本姿态估计方法。
Result: 在零样本相对姿态估计基准测试中,ConceptPose显著优于现有方法,ADD(-S)分数提升超过62%。
Insight: 视觉语言模型在零样本任务中具有巨大潜力,可通过概念向量和3D地图高效解决传统需要训练的问题。
Abstract: Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities. In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero shot relative pose estimation benchmarks, significantly outperforming existing methods by over 62% in ADD(-S) score, including those that utilize extensive dataset-specific training.
[13] SIP: Site in Pieces- A Dataset of Disaggregated Construction-Phase 3D Scans for Semantic Segmentation and Scene Understanding cs.CV | cs.LGPDF
Seongyong Kim, Yong Kwon Cho
TL;DR: SIP是一个针对建筑工地的LiDAR数据集,专注于语义分割和场景理解,特点是捕捉了建筑工地特有的碎片化几何和稀疏扫描条件,填补了现有数据集的不足。
Details
Motivation: 现有3D感知数据集多为密集扫描的完整场景,无法反映建筑工地中LiDAR采集的实际限制(如稀疏扫描、视角依赖、遮挡等问题),因此需要更具代表性的数据。
Result: SIP数据集能够支持建筑工地特有的3D视觉任务(如遮挡和碎片化几何的分割),为实际场景的鲁棒评测提供了基础。
Insight: 建筑工地的LiDAR数据特性(如稀疏性、遮挡)对分割任务提出新挑战,SIP数据集填补了这一领域的空白,推动了建筑导向的3D视觉研究。
Abstract: Accurate 3D scene interpretation in active construction sites is essential for progress monitoring, safety assessment, and digital twin development. LiDAR is widely used in construction because it offers advantages over camera-based systems, performing reliably in cluttered and dynamically changing conditions. Yet most public datasets for 3D perception are derived from densely fused scans with uniform sampling and complete visibility, conditions that do not reflect real construction sites. Field data are often collected as isolated single-station LiDAR views, constrained by safety requirements, limited access, and ongoing operations. These factors lead to radial density decay, fragmented geometry, and view-dependent visibility-characteristics that remain underrepresented in existing datasets. This paper presents SIP, Site in Pieces, a dataset created to reflect the practical constraints of LiDAR acquisition during construction. SIP provides indoor and outdoor scenes captured with a terrestrial LiDAR scanner and annotated at the point level using a taxonomy tailored to construction environments: A. Built Environment, B. Construction Operations, and C. Site Surroundings. The dataset includes both structural components and slender temporary objects such as scaffolding, MEP piping, and scissor lifts, where sparsity caused by occlusion and fragmented geometry make segmentation particularly challenging. The scanning protocol, annotation workflow, and quality control procedures establish a consistent foundation for the dataset. SIP is openly available with a supporting Git repository, offering adaptable class configurations that streamline adoption within modern 3D deep learning frameworks. By providing field data that retain real-world sensing characteristics, SIP enables robust benchmarking and contributes to advancing construction-oriented 3D vision tasks.
[14] KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification cs.CV | cs.AI | cs.LGPDF
Erfan Nourbakhsh, Nasrin Sanjari, Ali Nourbakhsh
TL;DR: 提出了一种名为KD-OCT的知识蒸馏框架,用于压缩高性能的ConvNeXtV2-Large教师模型为轻量级EfficientNet-B2学生模型,以在临床OCT分类中实现高效准确的边缘部署。
Details
Motivation: 由于ConvNeXtV2-Large等深度学习模型的计算需求高,在临床环境中部署受限,因此需要开发高效且保持高诊断性能的模型。
Result: 在Noor Eye Hospital数据集上,KD-OCT在多尺度或特征融合OCT分类器中表现最佳,接近教师模型性能且显著减少模型大小和推理时间。
Insight: 知识蒸馏在医疗影像分类中能有效压缩模型而不显著牺牲性能,推动边缘设备的高效部署。
Abstract: Age-related macular degeneration (AMD) and choroidal neovascularization (CNV)-related conditions are leading causes of vision loss worldwide, with optical coherence tomography (OCT) serving as a cornerstone for early detection and management. However, deploying state-of-the-art deep learning models like ConvNeXtV2-Large in clinical settings is hindered by their computational demands. Therefore, it is desirable to develop efficient models that maintain high diagnostic performance while enabling real-time deployment. In this study, a novel knowledge distillation framework, termed KD-OCT, is proposed to compress a high-performance ConvNeXtV2-Large teacher model, enhanced with advanced augmentations, stochastic weight averaging, and focal loss, into a lightweight EfficientNet-B2 student for classifying normal, drusen, and CNV cases. KD-OCT employs real-time distillation with a combined loss balancing soft teacher knowledge transfer and hard ground-truth supervision. The effectiveness of the proposed method is evaluated on the Noor Eye Hospital (NEH) dataset using patient-level cross-validation. Experimental results demonstrate that KD-OCT outperforms comparable multi-scale or feature-fusion OCT classifiers in efficiency- accuracy balance, achieving near-teacher performance with substantial reductions in model size and inference time. Despite the compression, the student model exceeds most existing frameworks, facilitating edge deployment for AMD screening. Code is available at https://github.com/erfan-nourbakhsh/KD- OCT.
[15] AgentComp: From Agentic Reasoning to Compositional Mastery in Text-to-Image Models cs.CVPDF
Arman Zarei, Jiacheng Pan, Matthew Gwilliam, Soheil Feizi, Zhenheng Yang
TL;DR: AgentComp是一个通过增强文本到图像生成模型在组合性任务上的推理能力来提升其表现的新框架,利用大型语言模型的推理和工具使用能力自主构建组合数据集,并通过偏好优化微调模型。
Details
Motivation: 现有的文本到图像生成模型在组合性任务上表现不佳,难以准确捕捉对象关系、属性绑定和细粒度细节。
Result: 在T2I-CompBench等组合性基准测试上取得了最先进的结果,同时保持了图像质量,并展现出对其他能力(如文本渲染)的泛化性。
Insight: 通过代理推理和组合数据集构建可以显著提升文本到图像模型在精细组合任务上的表现,同时避免牺牲其他性能。
Abstract: Text-to-image generative models have achieved remarkable visual quality but still struggle with compositionality$-$accurately capturing object relationships, attribute bindings, and fine-grained details in prompts. A key limitation is that models are not explicitly trained to differentiate between compositionally similar prompts and images, resulting in outputs that are close to the intended description yet deviate in fine-grained details. To address this, we propose AgentComp, a framework that explicitly trains models to better differentiate such compositional variations and enhance their reasoning ability. AgentComp leverages the reasoning and tool-use capabilities of large language models equipped with image generation, editing, and VQA tools to autonomously construct compositional datasets. Using these datasets, we apply an agentic preference optimization method to fine-tune text-to-image models, enabling them to better distinguish between compositionally similar samples and resulting in overall stronger compositional generation ability. AgentComp achieves state-of-the-art results on compositionality benchmarks such as T2I-CompBench, without compromising image quality$-$a common drawback in prior approaches$-$and even generalizes to other capabilities not explicitly trained for, such as text rendering.
[16] Explaining the Unseen: Multimodal Vision-Language Reasoning for Situational Awareness in Underground Mining Disasters cs.CVPDF
Mizanur Rahman Jewel, Mohamed Elmahallawy, Sanjay Madria, Samuel Frimpong
TL;DR: 提出了一种名为MDSE的多模态视觉-语言框架,用于地下矿难后的场景解释,通过上下文感知的跨注意力机制和分割感知的双路径视觉编码,结合高效的基于Transformer的语言模型,显著提升了在视觉受损环境下的情境感知能力。
Details
Motivation: 地下矿难导致视觉信息严重受损,常规系统难以提供有效的情境感知。MDSE旨在通过生成详细的文本解释,弥补视觉信息的不足。
Result: 在UMD数据集和相关基准测试中,MDSE显著优于现有方法,生成的描述更准确且上下文相关,增强了情境感知能力。
Insight: 在多模态任务中,结合上下文感知和区域分割能有效提升模型在视觉受损环境下的表现,同时高效的模型设计减少了计算开销。
Abstract: Underground mining disasters produce pervasive darkness, dust, and collapses that obscure vision and make situational awareness difficult for humans and conventional systems. To address this, we propose MDSE, Multimodal Disaster Situation Explainer, a novel vision-language framework that automatically generates detailed textual explanations of post-disaster underground scenes. MDSE has three-fold innovations: (i) Context-Aware Cross-Attention for robust alignment of visual and textual features even under severe degradation; (ii) Segmentation-aware dual pathway visual encoding that fuses global and region-specific embeddings; and (iii) Resource-Efficient Transformer-Based Language Model for expressive caption generation with minimal compute cost. To support this task, we present the Underground Mine Disaster (UMD) dataset–the first image-caption corpus of real underground disaster scenes–enabling rigorous training and evaluation. Extensive experiments on UMD and related benchmarks show that MDSE substantially outperforms state-of-the-art captioning models, producing more accurate and contextually relevant descriptions that capture crucial details in obscured environments, improving situational awareness for underground emergency response. The code is at https://github.com/mizanJewel/Multimodal-Disaster-Situation-Explainer.
[17] Food Image Generation on Multi-Noun Categories cs.CVPDF
Xinyue Pan, Yuhao Chen, Jiangpeng He, Fengqing Zhu
TL;DR: 论文研究了多名词食物类别图像生成的挑战,并提出了一种结合食物领域知识和生成过程早期引入核心概念的方法FoCULR,显著提升了生成效果。
Details
Motivation: 现实世界中多名词食物类别(如“鸡蛋面”)在生成时容易误解语义,导致生成错误的图像内容,而现有方法在多名词关系理解上存在不足。
Result: 实验结果表明,FoCULR在多名词食物类别的图像生成任务中表现优于现有方法。
Insight: 多名词类别理解是生成任务中的关键挑战,领域知识的引入和生成过程的优化可以有效提升生成效果。
Abstract: Generating realistic food images for categories with multiple nouns is surprisingly challenging. For instance, the prompt “egg noodle” may result in images that incorrectly contain both eggs and noodles as separate entities. Multi-noun food categories are common in real-world datasets and account for a large portion of entries in benchmarks such as UEC-256. These compound names often cause generative models to misinterpret the semantics, producing unintended ingredients or objects. This is due to insufficient multi-noun category related knowledge in the text encoder and misinterpretation of multi-noun relationships, leading to incorrect spatial layouts. To overcome these challenges, we propose FoCULR (Food Category Understanding and Layout Refinement) which incorporates food domain knowledge and introduces core concepts early in the generation process. Experimental results demonstrate that the integration of these techniques improves image generation performance in the food domain.
[18] GimbalDiffusion: Gravity-Aware Camera Control for Video Generation cs.CVPDF
Frédéric Fortier-Chouinard, Yannick Hold-Geoffroy, Valentin Deschaintre, Matheus Gadelha, Jean-François Lalonde
TL;DR: GimbalDiffusion提出了一种基于物理世界坐标的摄像机控制框架,利用重力作为全局参考,实现了对摄像机运动的精确控制,克服了现有方法对相对描述的依赖。
Details
Motivation: 现有的文本到视频生成方法虽然实现了高真实感,但对摄像机运动和方向的精细控制仍然不足,通常采用相对或模糊的描述方式。
Result: GimbalDiffusion实现了对摄像机参数的精确和可解释控制,提高了文本到视频模型的鲁棒性。
Insight: 通过引入物理世界坐标和重力参考,可以显著提升摄像机控制的精确性和可解释性,同时减少对初始参考帧的依赖。
Abstract: Recent progress in text-to-video generation has achieved remarkable realism, yet fine-grained control over camera motion and orientation remains elusive. Existing approaches typically encode camera trajectories through relative or ambiguous representations, limiting explicit geometric control. We introduce GimbalDiffusion, a framework that enables camera control grounded in physical-world coordinates, using gravity as a global reference. Instead of describing motion relative to previous frames, our method defines camera trajectories in an absolute coordinate system, allowing precise and interpretable control over camera parameters without requiring an initial reference frame. We leverage panoramic 360-degree videos to construct a wide variety of camera trajectories, well beyond the predominantly straight, forward-facing trajectories seen in conventional video data. To further enhance camera guidance, we introduce null-pitch conditioning, an annotation strategy that reduces the model’s reliance on text content when conflicting with camera specifications (e.g., generating grass while the camera points towards the sky). Finally, we establish a benchmark for camera-aware video generation by rebalancing SpatialVID-HQ for comprehensive evaluation under wide camera pitch variation. Together, these contributions advance the controllability and robustness of text-to-video models, enabling precise, gravity-aligned camera manipulation within generative frameworks.
[19] SuperF: Neural Implicit Fields for Multi-Image Super-Resolution cs.CVPDF
Sander Riisøen Jyhne, Christian Igel, Morten Goodwin, Per-Arne Andersen, Serge Belongie
TL;DR: SuperF提出了一种基于神经隐式场(INR)的多图像超分辨率方法,通过共享多个低分辨率图像的隐式表示并联合优化帧对齐与INR,实现了无需高分辨率训练数据的超分辨率增强。
Details
Motivation: 高分辨率图像常受传感器技术、大气条件和成本限制,单图像超分辨率方法可能导致“幻觉”结构,而多图像超分辨率(MISR)通过利用多视图约束能更准确地提升分辨率。
Result: 在卫星图像和手持相机图像的模拟序列上取得了显著效果,支持高达8倍的上采样。
Insight: 神经隐式场(INR)的连续性特性使其特别适合MISR任务,SuperF的核心在于联合优化对齐与信号表示,避免了训练数据的依赖。
Abstract: High-resolution imagery is often hindered by limitations in sensor technology, atmospheric conditions, and costs. Such challenges occur in satellite remote sensing, but also with handheld cameras, such as our smartphones. Hence, super-resolution aims to enhance the image resolution algorithmically. Since single-image super-resolution requires solving an inverse problem, such methods must exploit strong priors, e.g. learned from high-resolution training data, or be constrained by auxiliary data, e.g. by a high-resolution guide from another modality. While qualitatively pleasing, such approaches often lead to “hallucinated” structures that do not match reality. In contrast, multi-image super-resolution (MISR) aims to improve the (optical) resolution by constraining the super-resolution process with multiple views taken with sub-pixel shifts. Here, we propose SuperF, a test-time optimization approach for MISR that leverages coordinate-based neural networks, also called neural fields. Their ability to represent continuous signals with an implicit neural representation (INR) makes them an ideal fit for the MISR task. The key characteristic of our approach is to share an INR for multiple shifted low-resolution frames and to jointly optimize the frame alignment with the INR. Our approach advances related INR baselines, adopted from burst fusion for layer separation, by directly parameterizing the sub-pixel alignment as optimizable affine transformation parameters and by optimizing via a super-sampled coordinate grid that corresponds to the output resolution. Our experiments yield compelling results on simulated bursts of satellite imagery and ground-level images from handheld cameras, with upsampling factors of up to 8. A key advantage of SuperF is that this approach does not rely on any high-resolution training data.
[20] Integrated Pipeline for Coronary Angiography With Automated Lesion Profiling, Virtual Stenting, and 100-Vessel FFR Validation cs.CV | cs.AIPDF
Georgy Kopanitsa, Oleg Metsker, Alexey Yakovlev
TL;DR: 该论文提出了一种名为AngioAI-QFR的端到端冠状动脉造影分析管道,结合了深度学习狭窄检测、虚拟支架植入和自动QFR计算,并在100条血管中以FFR为基准验证了其性能。
Details
Motivation: 冠状动脉狭窄的视觉评估存在变异性且与缺血相关性不高,现有基于导线的FFR方法虽准确但未广泛使用。论文旨在开发一种自动化、快速且整合多种功能的系统以提高临床实用性。
Result: AngioAI-QFR与FFR相关性高(r=0.89),AUC为0.93,敏感性0.88,特异性0.86,93%的血管分析完全自动化,耗时中位数41秒。
Insight: 该系统不仅提高评估效率,还能区分局灶性和弥漫性病变,虚拟支架植入的预测效果在局灶性疾病中更显著,为临床决策提供了新工具。
Abstract: Coronary angiography is the main tool for assessing coronary artery disease, but visual grading of stenosis is variable and only moderately related to ischaemia. Wire based fractional flow reserve (FFR) improves lesion selection but is not used systematically. Angiography derived indices such as quantitative flow ratio (QFR) offer wire free physiology, yet many tools are workflow intensive and separate from automated anatomy analysis and virtual PCI planning. We developed AngioAI-QFR, an end to end angiography only pipeline combining deep learning stenosis detection, lumen segmentation, centreline and diameter extraction, per millimetre Relative Flow Capacity profiling, and virtual stenting with automatic recomputation of angiography derived QFR. The system was evaluated in 100 consecutive vessels with invasive FFR as reference. Primary endpoints were agreement with FFR (correlation, mean absolute error) and diagnostic performance for FFR <= 0.80. On held out frames, stenosis detection achieved precision 0.97 and lumen segmentation Dice 0.78. Across 100 vessels, AngioAI-QFR correlated strongly with FFR (r = 0.89, MAE 0.045). The AUC for detecting FFR <= 0.80 was 0.93, with sensitivity 0.88 and specificity 0.86. The pipeline completed fully automatically in 93 percent of vessels, with median time to result 41 s. RFC profiling distinguished focal from diffuse capacity loss, and virtual stenting predicted larger QFR gain in focal than in diffuse disease. AngioAI-QFR provides a practical, near real time pipeline that unifies computer vision, functional profiling, and virtual PCI with automated angiography derived physiology.
[21] GTAvatar: Bridging Gaussian Splatting and Texture Mapping for Relightable and Editable Gaussian Avatars cs.CV | cs.GRPDF
Kelian Baert, Mae Younes, Francois Bourel, Marc Christie, Adnane Boukhayma
TL;DR: GTAvatar結合了高斯分佈與紋理映射,實現了可重光照和可編輯的高斯頭像,兼具高精度和直觀編輯能力。
Details
Motivation: 高斯分佈技術雖能重建高保真頭像,但缺乏傳統網格方法的直觀編輯性。本文旨在結合兩者優勢。
Result: 實驗結果顯示,該方法在重建精度、重光照效果和編輯能力上均優於現有技術。
Insight: 結合高斯分佈與紋理映射為頭像重建提供了新的方向,兼具保真度與用戶友好的編輯體驗。
Abstract: Recent advancements in Gaussian Splatting have enabled increasingly accurate reconstruction of photorealistic head avatars, opening the door to numerous applications in visual effects, videoconferencing, and virtual reality. This, however, comes with the lack of intuitive editability offered by traditional triangle mesh-based methods. In contrast, we propose a method that combines the accuracy and fidelity of 2D Gaussian Splatting with the intuitiveness of UV texture mapping. By embedding each canonical Gaussian primitive’s local frame into a patch in the UV space of a template mesh in a computationally efficient manner, we reconstruct continuous editable material head textures from a single monocular video on a conventional UV domain. Furthermore, we leverage an efficient physically based reflectance model to enable relighting and editing of these intrinsic material maps. Through extensive comparisons with state-of-the-art methods, we demonstrate the accuracy of our reconstructions, the quality of our relighting results, and the ability to provide intuitive controls for modifying an avatar’s appearance and geometry via texture mapping without additional optimization.
[22] WonderZoom: Multi-Scale 3D World Generation cs.CV | cs.AI | cs.GRPDF
Jin Cao, Hong-Xing Yu, Jiajun Wu
TL;DR: WonderZoom是一种从单张图像生成多尺度3D场景的新方法,通过自适应高斯表面和渐进细节合成器,解决了现有方法无法生成多尺度内容的问题。
Details
Motivation: 现有的3D场景生成方法仅支持单一尺度合成,无法在不同粒度上生成连贯的内容,缺乏一种能够处理多尺度空间的3D表示方法。
Result: 实验表明,WonderZoom在质量和多尺度一致性上显著优于现有3D和视频生成方法。
Insight: 通过逐步细化细节,实现了从宏观景观到微观特征的多尺度3D内容生成,为交互式3D世界创建提供了新思路。
Abstract: We present WonderZoom, a novel approach to generating 3D scenes with contents across multiple spatial scales from a single image. Existing 3D world generation models remain limited to single-scale synthesis and cannot produce coherent scene contents at varying granularities. The fundamental challenge is the lack of a scale-aware 3D representation capable of generating and rendering content with largely different spatial sizes. WonderZoom addresses this through two key innovations: (1) scale-adaptive Gaussian surfels for generating and real-time rendering of multi-scale 3D scenes, and (2) a progressive detail synthesizer that iteratively generates finer-scale 3D contents. Our approach enables users to “zoom into” a 3D region and auto-regressively synthesize previously non-existent fine details from landscapes to microscopic features. Experiments demonstrate that WonderZoom significantly outperforms state-of-the-art video and 3D models in both quality and alignment, enabling multi-scale 3D world creation from a single image. We show video results and an interactive viewer of generated multi-scale 3D worlds in https://wonderzoom.github.io/
[23] Prompt-Based Continual Compositional Zero-Shot Learning cs.CV | cs.AIPDF
Sauda Maryam, Sara Nadeem, Faisal Qureshi, Mohsen Ali
TL;DR: 论文提出了PromptCCZSL框架,通过多教师蒸馏和组合提示机制,解决了组合零样本学习中的持续适应问题,同时防止先前知识的遗忘。
Details
Motivation: 组合零样本学习(CZSL)中的持续学习比传统持续学习更复杂,因为属性和对象可能在多个会话中重复出现,但组合是唯一的。论文旨在解决这一问题。
Result: 在UT-Zappos和C-GQA基准测试中,PromptCCZSL显著优于现有VLM和非VLM基线,为封闭世界中的CCZSL设定了新标准。
Insight: 1) 多教师蒸馏和组合提示机制的结合有效防止了知识遗忘; 2) 损失函数的设计(如正交投影损失)显著提升了新会话中的适应能力。
Abstract: We tackle continual adaptation of vision-language models to new attributes, objects, and their compositions in Compositional Zero-Shot Learning (CZSL), while preventing forgetting of prior knowledge. Unlike classical continual learning where classes are disjoint, CCZSL is more complex as attributes and objects may reoccur across sessions while compositions remain unique. Built on a frozen VLM backbone, we propose the first Prompt-based Continual Compositional Zero-Shot Learning (PromptCCZSL) framework that retains prior knowledge through recency-weighted multi-teacher distillation. It employs session-aware compositional prompts to fuse multimodal features for new compositions, while attribute and object prompts are learned through session-agnostic fusion to maintain global semantic consistency, which is further stabilized by a Cosine Anchor Loss (CAL) to preserve prior knowledge. To enhance adaptation in the current session, an Orthogonal Projection Loss (OPL) ensures that new attribute and object embeddings remain distinct from previous ones, preventing overlap, while an Intra-Session Diversity Loss (IDL) promotes variation among current-session embeddings for richer, more discriminative representations. We also introduce a comprehensive protocol that jointly measures catastrophic forgetting and compositional generalization. Extensive experiments on UT-Zappos and C-GQA benchmarks demonstrate that PromptCCZSL achieves substantial improvements over prior VLM-based and non-VLM baselines, setting a new benchmark for CCZSL in closed-world settings.
[24] Learning Patient-Specific Disease Dynamics with Latent Flow Matching for Longitudinal Imaging Generation cs.CV | cs.AIPDF
Hao Chen, Rui Yin, Yifan Chen, Qi Chen, Chao Li
TL;DR: 论文提出了一种名为Δ-LFM的框架,通过流匹配(Flow Matching)技术建模患者特异性疾病动态,解决了传统生成方法在连续性和语义结构上的不足,并在纵向MRI数据上验证了其性能。
Details
Motivation: 疾病进展的建模对早期诊断和个性化治疗至关重要,但现有生成方法在连续性和语义结构上存在问题。Δ-LFM旨在通过流匹配和患者特异性潜在对齐,更准确地捕捉疾病动态。
Result: Δ-LFM在纵向MRI数据上表现优异,提供了疾病动态的可视化和解释性更强的潜在空间。
Insight: 流匹配技术为疾病动态建模提供了新的思路,患者特异性潜在对齐的引入增强了语义一致性,有助于临床解释。
Abstract: Understanding disease progression is a central clinical challenge with direct implications for early diagnosis and personalized treatment. While recent generative approaches have attempted to model progression, key mismatches remain: disease dynamics are inherently continuous and monotonic, yet latent representations are often scattered, lacking semantic structure, and diffusion-based models disrupt continuity with random denoising process. In this work, we propose to treat the disease dynamic as a velocity field and leverage Flow Matching (FM) to align the temporal evolution of patient data. Unlike prior methods, it captures the intrinsic dynamic of disease, making the progression more interpretable. However, a key challenge remains: in latent space, Auto-Encoders (AEs) do not guarantee alignment across patients or correlation with clinical-severity indicators (e.g., age and disease conditions). To address this, we propose to learn patient-specific latent alignment, which enforces patient trajectories to lie along a specific axis, with magnitude increasing monotonically with disease severity. This leads to a consistent and semantically meaningful latent space. Together, we present $Δ$-LFM, a framework for modeling patient-specific latent progression with flow matching. Across three longitudinal MRI benchmarks, $Δ$-LFM demonstrates strong empirical performance and, more importantly, offers a new framework for interpreting and visualizing disease dynamics.
[25] View-on-Graph: Zero-shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs cs.CVPDF
Yuanyuan Liu, Haiyang Mei, Dongyang Zhan, Jiayue Zhao, Dongsheng Zhou
TL;DR: 论文提出了一种名为View-on-Graph(VoG)的零样本3D视觉接地方法,通过将3D场景组织为多模态、多层次的场景图,使视觉语言模型(VLM)能够像主动代理一样选择性地访问必要信息,从而提升性能和解译性。
Details
Motivation: 现有的零样本3D视觉接地方法将3D空间信息转换为适合VLM处理的复合输入(如特定视角渲染或视频序列),但这种方式导致视觉表示过于复杂,难以有效利用空间语义关系。因此,论文提出将3D信息外部化为场景图,以简化VLM的推理过程。
Result: 实验表明,VoG在零样本3D视觉接地任务中达到了最先进的性能,证明了结构化场景探索策略的有效性。
Insight: 将3D场景信息外部化为场景图,能够显著提升VLM的推理效率和性能,同时生成可解译的分步推理过程,为零样本3D视觉接地提供了新思路。
Abstract: 3D visual grounding (3DVG) identifies objects in 3D scenes from language descriptions. Existing zero-shot approaches leverage 2D vision-language models (VLMs) by converting 3D spatial information (SI) into forms amenable to VLM processing, typically as composite inputs such as specified view renderings or video sequences with overlaid object markers. However, this VLM + SI paradigm yields entangled visual representations that compel the VLM to process entire cluttered cues, making it hard to exploit spatial semantic relationships effectively. In this work, we propose a new VLM x SI paradigm that externalizes the 3D SI into a form enabling the VLM to incrementally retrieve only what it needs during reasoning. We instantiate this paradigm with a novel View-on-Graph (VoG) method, which organizes the scene into a multi-modal, multi-layer scene graph and allows the VLM to operate as an active agent that selectively accesses necessary cues as it traverses the scene. This design offers two intrinsic advantages: (i) by structuring 3D context into a spatially and semantically coherent scene graph rather than confounding the VLM with densely entangled visual inputs, it lowers the VLM’s reasoning difficulty; and (ii) by actively exploring and reasoning over the scene graph, it naturally produces transparent, step-by-step traces for interpretable 3DVG. Extensive experiments show that VoG achieves state-of-the-art zero-shot performance, establishing structured scene exploration as a promising strategy for advancing zero-shot 3DVG.
[26] GLACIA: Instance-Aware Positional Reasoning for Glacial Lake Segmentation via Multimodal Large Language Model cs.CV | cs.AIPDF
Lalit Maurya, Saurabh Kaushik, Beth Tellman
TL;DR: GLACIA 是一个将大型语言模型与分割能力结合的新型框架,旨在为冰川湖分割提供精确的分割掩码和空间推理输出。
Details
Motivation: 现有基于 CNN 和 ViT 的分割方法仅关注像素级预测,缺乏高层全局场景语义和人类可解释的推理。GLACIA 通过引入多模态大型语言模型来解决这一问题。
Result: GLACIA 在 mIoU 指标上达到 87.30,优于 CNN(78.55-79.01)、ViT(69.27-81.75)、Geo-foundation 模型(76.37-87.10)和基于推理的分割方法(60.12-75.66)。
Insight: 通过自然语言交互,GLACIA 提升了冰川湖监测的直观性和可解释性,为灾害准备和政策制定提供了高效支持。
Abstract: Glacial lake monitoring bears great significance in mitigating the anticipated risk of Glacial Lake Outburst Floods. However, existing segmentation methods based on convolutional neural networks (CNNs) and Vision Transformers (ViTs), remain constrained to pixel-level predictions, lacking high-level global scene semantics and human-interpretable reasoning. To address this, we introduce GLACIA (\textbf{G}lacial \textbf{LA}ke segmentation with \textbf{C}ontextual \textbf{I}nstance \textbf{A}wareness), the first framework that integrates large language models with segmentation capabilities to produce both accurate segmentation masks and corresponding spatial reasoning outputs. We construct the Glacial Lake Position Reasoning (GLake-Pos) dataset pipeline, which provides diverse, spatially grounded question-answer pairs designed to overcome the lack of instance-aware positional reasoning data in remote sensing. Comparative evaluation demonstrate that GLACIA (mIoU: 87.30) surpasses state-of-the-art method based on CNNs (mIoU: 78.55 - 79.01), ViTs (mIoU: 69.27 - 81.75), Geo-foundation models (mIoU: 76.37 - 87.10), and reasoning based segmentation methods (mIoU: 60.12 - 75.66). Our approach enables intuitive disaster preparedness and informed policy-making in the context of rapidly changing glacial environments by facilitating natural language interaction, thereby supporting more efficient and interpretable decision-making. The code is released on https://github.com/lalitmaurya47/GLACIA
[27] Rethinking Chain-of-Thought Reasoning for Videos cs.CV | cs.AI | cs.CL | cs.LGPDF
Yiwu Zhong, Zi-Yuan Hu, Yin Li, Liwei Wang
TL;DR: 论文提出了一种高效的视频推理框架,挑战了传统长链思维(CoT)推理的必要性,通过压缩视觉标记和简短推理轨迹实现了高效且竞争性的表现。
Details
Motivation: 研究发现,传统视频推理依赖冗长的推理链和大量视觉标记,但实际可能需要更简洁的推理和更少的标记。通过实验验证了这一假设。
Result: 在多个基准测试中表现竞争性,同时显著提高了推理效率。
Insight: 结果表明,视频推理可能不需要冗长的人类CoT推理,简洁推理同样有效且高效。
Abstract: Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing, and recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning. However, these models typically build on lengthy reasoning chains and large numbers of input visual tokens. Motivated by empirical observations from our benchmark study, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning. To evaluate this hypothesis, we design and validate an efficient post-training and inference framework that enhances a video MLLM’s reasoning capability. Our framework enables models to operate on compressed visual tokens and generate brief reasoning traces prior to answering. The resulting models achieve substantially improved inference efficiency, deliver competitive performance across diverse benchmarks, and avoid reliance on manual CoT annotations or supervised fine-tuning. Collectively, our results suggest that long, human-like CoT reasoning may not be necessary for general video reasoning, and that concise reasoning can be both effective and efficient. Our code will be released at https://github.com/LaVi-Lab/Rethink_CoT_Video.
[28] ROI-Packing: Efficient Region-Based Compression for Machine Vision cs.CVPDF
Md Eimran Hossain Eimon, Alena Krause, Ashan Perera, Juan Merlos, Hari Kalva
TL;DR: ROI-Packing是一种专为机器视觉设计的高效图像压缩方法,通过优先处理关键任务区域并丢弃不相关数据,在不影响任务精度的情况下显著提升压缩效率。
Details
Motivation: 现有的通用视频编码标准(如VVC)未针对机器视觉任务优化,导致资源浪费或精度下降。ROI-Packing旨在解决这一问题。
Result: 在多个数据集和任务(目标检测、实例分割)中,压缩率提升44.10%,同比特率下精度提高8.88%,优于VVC编解码器。
Insight: 针对机器视觉的任务感知压缩方法可以在不修改模型的情况下显著提升效率,为边缘计算等场景提供潜力。
Abstract: This paper introduces ROI-Packing, an efficient image compression method tailored specifically for machine vision. By prioritizing regions of interest (ROI) critical to end-task accuracy and packing them efficiently while discarding less relevant data, ROI-Packing achieves significant compression efficiency without requiring retraining or fine-tuning of end-task models. Comprehensive evaluations across five datasets and two popular tasks-object detection and instance segmentation-demonstrate up to a 44.10% reduction in bitrate without compromising end-task accuracy, along with an 8.88 % improvement in accuracy at the same bitrate compared to the state-of-the-art Versatile Video Coding (VVC) codec standardized by the Moving Picture Experts Group (MPEG).
[29] MedForget: Hierarchy-Aware Multimodal Unlearning Testbed for Medical AI cs.CV | cs.AI | cs.CLPDF
Fengli Wu, Vaidehi Patil, Jaehong Yoon, Yue Zhang, Mohit Bansal
TL;DR: MedForget是一个医疗AI中的多模态遗忘测试基准,专注于验证模型在医疗场景下对敏感数据的遗忘能力,同时保持诊断性能。
Details
Motivation: 医疗AI系统使用预训练多模态大语言模型(MLLMs)时,隐私和合规性成为关键问题,尤其是在HIPAA和GDPR等法规要求数据可被遗忘的情况下。现有遗忘方法在复杂医疗场景中的有效性未充分研究。
Result: 现有方法难以实现完全的层次感知遗忘,且粗粒度遗忘对抗重建攻击效果强,细粒度遗忘则较弱。
Insight: 医疗AI中的遗忘需要精细设计,以适应数据层次结构,同时平衡遗忘效果与模型性能。
Abstract: Pretrained Multimodal Large Language Models (MLLMs) are increasingly deployed in medical AI systems for clinical reasoning, diagnosis support, and report generation. However, their training on sensitive patient data raises critical privacy and compliance challenges under regulations such as HIPAA and GDPR, which enforce the “right to be forgotten”. Unlearning, the process of tuning models to selectively remove the influence of specific training data points, offers a potential solution, yet its effectiveness in complex medical settings remains underexplored. To systematically study this, we introduce MedForget, a Hierarchy-Aware Multimodal Unlearning Testbed with explicit retain and forget splits and evaluation sets containing rephrased variants. MedForget models hospital data as a nested hierarchy (Institution -> Patient -> Study -> Section), enabling fine-grained assessment across eight organizational levels. The benchmark contains 3840 multimodal (image, question, answer) instances, each hierarchy level having a dedicated unlearning target, reflecting distinct unlearning challenges. Experiments with four SOTA unlearning methods on three tasks (generation, classification, cloze) show that existing methods struggle to achieve complete, hierarchy-aware forgetting without reducing diagnostic performance. To test whether unlearning truly deletes hierarchical pathways, we introduce a reconstruction attack that progressively adds hierarchical level context to prompts. Models unlearned at a coarse granularity show strong resistance, while fine-grained unlearning leaves models vulnerable to such reconstruction. MedForget provides a practical, HIPAA-aligned testbed for building compliant medical AI systems.
[30] MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectional Blending with Hierarchical Densification cs.CVPDF
Sangwoon Kwak, Weeyoung Kwon, Jun Young Jeong, Geonho Kim, Won-Sik Cheong
TL;DR: MoRel提出了一种基于锚点中继的双向混合(ARBB)机制和特征方差引导的层次化密集化(FHD)方案,解决了4D高斯泼溅(4DGS)中长时运动建模的内存爆炸、时间闪烁和遮挡问题,并构建了新数据集SelfCap$_{ ext{LR}}$验证其效果。
Details
Motivation: 现有4DGS方法在建模长时动态场景时存在内存爆炸、时间闪烁和遮挡处理失败的问题,需要一种更高效且一致的方法来解决这些挑战。
Result: MoRel实现了长时4D运动建模的时间一致性和无闪烁渲染,同时保持了内存使用的可扩展性。
Insight: 通过锚点中继和双向混合可以显著提升长时动态场景建模的时间一致性,而层次化密集化能有效平衡渲染质量和内存效率。
Abstract: Recent advances in 4D Gaussian Splatting (4DGS) have extended the high-speed rendering capability of 3D Gaussian Splatting (3DGS) into the temporal domain, enabling real-time rendering of dynamic scenes. However, one of the major remaining challenges lies in modeling long-range motion-contained dynamic videos, where a naive extension of existing methods leads to severe memory explosion, temporal flickering, and failure to handle appearing or disappearing occlusions over time. To address these challenges, we propose a novel 4DGS framework characterized by an Anchor Relay-based Bidirectional Blending (ARBB) mechanism, named MoRel, which enables temporally consistent and memory-efficient modeling of long-range dynamic scenes. Our method progressively constructs locally canonical anchor spaces at key-frame time index and models inter-frame deformations at the anchor level, enhancing temporal coherence. By learning bidirectional deformations between KfA and adaptively blending them through learnable opacity control, our approach mitigates temporal discontinuities and flickering artifacts. We further introduce a Feature-variance-guided Hierarchical Densification (FHD) scheme that effectively densifies KfA’s while keeping rendering quality, based on an assigned level of feature-variance. To effectively evaluate our model’s capability to handle real-world long-range 4D motion, we newly compose long-range 4D motion-contained dataset, called SelfCap$_{\text{LR}}$. It has larger average dynamic motion magnitude, captured at spatially wider spaces, compared to previous dynamic video datasets. Overall, our MoRel achieves temporally coherent and flicker-free long-range 4D reconstruction while maintaining bounded memory usage, demonstrating both scalability and efficiency in dynamic Gaussian-based representations.
[31] LongT2IBench: A Benchmark for Evaluating Long Text-to-Image Generation with Graph-structured Annotations cs.CVPDF
Zhichao Yang, Tianjiao Gu, Jianjie Wang, Feiyu Lin, Xiangfei Sheng
TL;DR: 论文提出了LongT2IBench,一个用于评估长文本生成图像对齐性的基准数据集,包含14K对长文本-图像和图结构注释。通过设计Generate-Refine-Qualify协议和Hierarchical Alignment Chain-of-Thought (CoT),进一步提出了LongT2IExpert评估器,在多模态大语言模型的支持下提供量化评分和结构化解释。
Details
Motivation: 现有的文本到图像(T2I)对齐基准主要集中在短文本场景,且仅提供MOS或Likert量表注释,限制了长文本T2I评估的发展。长文本包含更多细节,需更精细的对齐评估方法。
Result: 实验证明LongT2IExpert在对齐评估和解释方面表现优越,数据集和代码已开源。
Insight: 长文本T2I对齐评估需要更精细的结构化方法,图结构注释和多模态大语言模型的结合可有效提升评估的准确性和可解释性。
Abstract: The increasing popularity of long Text-to-Image (T2I) generation has created an urgent need for automatic and interpretable models that can evaluate the image-text alignment in long prompt scenarios. However, the existing T2I alignment benchmarks predominantly focus on short prompt scenarios and only provide MOS or Likert scale annotations. This inherent limitation hinders the development of long T2I evaluators, particularly in terms of the interpretability of alignment. In this study, we contribute LongT2IBench, which comprises 14K long text-image pairs accompanied by graph-structured human annotations. Given the detail-intensive nature of long prompts, we first design a Generate-Refine-Qualify annotation protocol to convert them into textual graph structures that encompass entities, attributes, and relations. Through this transformation, fine-grained alignment annotations are achieved based on these granular elements. Finally, the graph-structed annotations are converted into alignment scores and interpretations to facilitate the design of T2I evaluation models. Based on LongT2IBench, we further propose LongT2IExpert, a LongT2I evaluator that enables multi-modal large language models (MLLMs) to provide both quantitative scores and structured interpretations through an instruction-tuning process with Hierarchical Alignment Chain-of-Thought (CoT). Extensive experiments and comparisons demonstrate the superiority of the proposed LongT2IExpert in alignment evaluation and interpretation. Data and code have been released in https://welldky.github.io/LongT2IBench-Homepage/.
[32] Dynamic Facial Expressions Analysis Based Parkinson’s Disease Auxiliary Diagnosis cs.CVPDF
Xiaochen Huang, Xiaochen Bi, Cuihua Lv, Xin Wang, Haoyan Zhang
TL;DR: 该论文提出了一种基于动态面部表情分析的帕金森病(PD)辅助诊断方法,通过分析面部表情减少和僵硬的特征,结合多模态网络和LSTM分类器,实现了93.1%的高准确率。
Details
Motivation: 帕金森病(PD)是一种常见的神经退行性疾病,严重影响患者的日常生活和社交能力。传统的诊断方法复杂且不便,因此需要一种高效、便捷的辅助诊断手段。
Result: 该方法在PD诊断中达到了93.1%的准确率,优于其他体外诊断方法。
Insight: 面部表情分析可以作为一种便捷、无创的PD诊断工具,未来可推广到其他神经系统疾病的早期筛查中。
Abstract: Parkinson’s disease (PD), a prevalent neurodegenerative disorder, significantly affects patients’ daily functioning and social interactions. To facilitate a more efficient and accessible diagnostic approach for PD, we propose a dynamic facial expression analysis-based PD auxiliary diagnosis method. This method targets hypomimia, a characteristic clinical symptom of PD, by analyzing two manifestations: reduced facial expressivity and facial rigidity, thereby facilitating the diagnosis process. We develop a multimodal facial expression analysis network to extract expression intensity features during patients’ performance of various facial expressions. This network leverages the CLIP architecture to integrate visual and textual features while preserving the temporal dynamics of facial expressions. Subsequently, the expression intensity features are processed and input into an LSTM-based classification network for PD diagnosis. Our method achieves an accuracy of 93.1%, outperforming other in-vitro PD diagnostic approaches. This technique offers a more convenient detection method for potential PD patients, improving their diagnostic experience.
[33] FoundIR-v2: Optimizing Pre-Training Data Mixtures for Image Restoration Foundation Model cs.CVPDF
Xiang Chen, Jinshan Pan, Jiangxin Dong, Jian Yang, Jinhui Tang
TL;DR: 论文提出了一种名为FoundIR-v2的图像修复基础模型,通过动态优化不同任务的数据混合比例和引入MoE驱动的调度器,提升了模型在多任务中的泛化能力和性能。
Details
Motivation: 研究发现,图像修复基础模型的性能不仅依赖于数据规模和质量,还与不同任务的数据混合比例密切相关。因此,作者提出了一种新的方法来优化数据混合比例和任务适应性。
Result: 模型能够处理50多种子任务,并在真实场景中表现优异,优于现有方法。
Insight: 数据混合比例在多任务学习中至关重要,动态调度和任务适应性设计能显著提升模型的泛化能力和综合性能。
Abstract: Recent studies have witnessed significant advances in image restoration foundation models driven by improvements in the scale and quality of pre-training data. In this work, we find that the data mixture proportions from different restoration tasks are also a critical factor directly determining the overall performance of all-in-one image restoration models. To this end, we propose a high-capacity diffusion-based image restoration foundation model, FoundIR-v2, which adopts a data equilibrium scheduling paradigm to dynamically optimize the proportions of mixed training datasets from different tasks. By leveraging the data mixing law, our method ensures a balanced dataset composition, enabling the model to achieve consistent generalization and comprehensive performance across diverse tasks. Furthermore, we introduce an effective Mixture-of-Experts (MoE)-driven scheduler into generative pre-training to flexibly allocate task-adaptive diffusion priors for each restoration task, accounting for the distinct degradation forms and levels exhibited by different tasks. Extensive experiments demonstrate that our method can address over 50 sub-tasks across a broader scope of real-world scenarios and achieves favorable performance against state-of-the-art approaches.
[34] Traffic Scene Small Target Detection Method Based on YOLOv8n-SPTS Model for Autonomous Driving cs.CVPDF
Songhan Wu
TL;DR: 该论文提出了一种基于改进YOLOv8n-SPTS模型的交通场景小目标检测方法,通过优化特征提取模块、增强特征融合能力和设计专用的小目标检测结构,显著提升了自动驾驶中小目标的检测性能。
Details
Motivation: 自动驾驶中的动态感知面临小目标识别性能不佳的问题,主要原因是小目标信息丢失、尺度不平衡和遮挡。现有算法在这些场景下表现不足。
Result: 在VisDrone2019-DET数据集上,YOLOv8n-SPTS模型的精度(61.9%)、召回率(48.3%)、mAP@0.5(52.6%)和mAP@0.5:0.95(32.6%)均排名第一。可视化结果验证了小目标(如行人和自行车)在遮挡和密集场景下的漏检率显著降低。
Insight: 通过空间到深度转换保留细粒度信息、优化多尺度特征融合以及专用小目标检测结构的设计,是提升自动驾驶中小目标检测性能的有效途径。
Abstract: This paper focuses on the key issue in autonomous driving: small target recognition in dynamic perception. Existing algorithms suffer from poor detection performance due to missing small target information, scale imbalance, and occlusion. We propose an improved YOLOv8n-SPTS model, which enhances the detection accuracy of small traffic targets through three key innovations: First, optimizing the feature extraction module. In the Backbone Bottleneck structure of YOLOv8n, 4 traditional convolution modules are replaced with Space-to-Depth Convolution (SPD-Conv) modules. This module retains fine-grained information through space-to-depth conversion, reduces information loss, and enhances the ability to capture features of low-resolution small targets. Second, enhancing feature fusion capability. The Spatial Pyramid Pooling - Fast Cross Stage Partial Connection (SPPFCSPC) module is introduced to replace the original SPPF module, integrating the multi-scale feature extraction from Spatial Pyramid Pooling (SPP) and the feature fusion mechanism of Cross Stage Partial Connection (CSP), thereby improving the model’s contextual understanding of complex scenes and multi-scale feature expression ability. Third, designing a dedicated detection structure for small targets. A Triple-Stage Feature Pyramid (TSFP) structure is proposed, which adds a 160*160 small target detection head to the original detection heads to fully utilize high-resolution features in shallow layers; meanwhile, redundant large target detection heads are removed to balance computational efficiency. Comparative experiments on the VisDrone2019-DET dataset show that YOLOv8n-SPTS model ranks first in precision (61.9%), recall (48.3%), mAP@0.5 (52.6%), and mAP@0.5:0.95 (32.6%). Visualization results verify that the miss rate of small targets such as pedestrians and bicycles in occluded and dense scenes is significantly reduced.
[35] VABench: A Comprehensive Benchmark for Audio-Video Generation cs.CV | cs.SDPDF
Daili Hua, Xizhi Wang, Bohan Zeng, Xinyi Huang, Hao Liang
TL;DR: VABench是一个新的基准测试框架,旨在系统评估同步音频-视频生成的能力,填补了现有评测的空白。
Details
Motivation: 现有的视频生成基准测试缺乏对音频-视频同步生成的全面评测指标,尤其是对同步音频-视频输出的评估。
Result: VABench覆盖了七大类内容,提供了系统化的分析和可视化评测结果,旨在为该领域建立新的评测标准。
Insight: 音频-视频同步生成是一个复杂但重要的研究方向,VABench为其提供了全面的评测工具,有助于推动领域的进一步发展。
Abstract: Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.
[36] From SAM to DINOv2: Towards Distilling Foundation Models to Lightweight Baselines for Generalized Polyp Segmentation cs.CVPDF
Shivanshu Agnihotri, Snehashis Majhi, Deepak Ranjan Nayak, Debesh Jha
TL;DR: 该论文提出了一个新颖的蒸馏框架Polyp-DiFoM,将大规模视觉基础模型(如SAM和DINOv2)的丰富表示能力迁移到轻量级分割基线(如U-Net和U-Net++)中,以提升结肠镜息肉分割的性能。
Details
Motivation: 现有的轻量级分割模型在应对息肉分割中的尺寸、形状和颜色变化时表现不佳,而大规模基础模型虽具备强大的泛化能力,但因缺乏医学领域特定知识而难以直接应用于医疗图像任务。需要通过蒸馏技术弥合这一差距。
Result: 在五个基准数据集上的实验表明,Polyp-DiFoM显著优于基线模型和当前最优方法,同时计算开销降低了近9倍。
Insight: 通过蒸馏技术结合基础模型的泛化能力和轻量级模型的效率,为医学图像分割提供了一种高效且高精度的解决方案。
Abstract: Accurate polyp segmentation during colonoscopy is critical for the early detection of colorectal cancer and still remains challenging due to significant size, shape, and color variations, and the camouflaged nature of polyps. While lightweight baseline models such as U-Net, U-Net++, and PraNet offer advantages in terms of easy deployment and low computational cost, they struggle to deal with the above issues, leading to limited segmentation performance. In contrast, large-scale vision foundation models such as SAM, DINOv2, OneFormer, and Mask2Former have exhibited impressive generalization performance across natural image domains. However, their direct transfer to medical imaging tasks (e.g., colonoscopic polyp segmentation) is not straightforward, primarily due to the scarcity of large-scale datasets and lack of domain-specific knowledge. To bridge this gap, we propose a novel distillation framework, Polyp-DiFoM, that transfers the rich representations of foundation models into lightweight segmentation baselines, allowing efficient and accurate deployment in clinical settings. In particular, we infuse semantic priors from the foundation models into canonical architectures such as U-Net and U-Net++ and further perform frequency domain encoding for enhanced distillation, corroborating their generalization capability. Extensive experiments are performed across five benchmark datasets, such as Kvasir-SEG, CVC-ClinicDB, ETIS, ColonDB, and CVC-300. Notably, Polyp-DiFoM consistently outperforms respective baseline models significantly, as well as the state-of-the-art model, with nearly 9 times reduced computation overhead. The code is available at https://github.com/lostinrepo/PolypDiFoM.
[37] Transformer-Driven Multimodal Fusion for Explainable Suspiciousness Estimation in Visual Surveillance cs.CV | cs.CRPDF
Kuldeep Singh Yadav, Lalan Kumar
TL;DR: 该论文提出一种基于Transformer的多模态融合方法,用于实时可疑性分析,并发布了大规模数据集USE50k。
Details
Motivation: 旨在通过智能视觉监控提升公共安全,解决复杂环境中实时可疑性分析的挑战。
Result: 实验表明框架在准确性、鲁棒性和可解释性上优于现有方法。
Insight: 多模态融合和Transformer架构在复杂监控场景中具有潜力,为实时风险评估提供了可扩展基础。
Abstract: Suspiciousness estimation is critical for proactive threat detection and ensuring public safety in complex environments. This work introduces a large-scale annotated dataset, USE50k, along with a computationally efficient vision-based framework for real-time suspiciousness analysis. The USE50k dataset contains 65,500 images captured from diverse and uncontrolled environments, such as airports, railway stations, restaurants, parks, and other public areas, covering a broad spectrum of cues including weapons, fire, crowd density, abnormal facial expressions, and unusual body postures. Building on this dataset, we present DeepUSEvision, a lightweight and modular system integrating three key components, i.e., a Suspicious Object Detector based on an enhanced YOLOv12 architecture, dual Deep Convolutional Neural Networks (DCNN-I and DCNN-II) for facial expression and body-language recognition using image and landmark features, and a transformer-based Discriminator Network that adaptively fuses multimodal outputs to yield an interpretable suspiciousness score. Extensive experiments confirm the superior accuracy, robustness, and interpretability of the proposed framework compared to state-of-the-art approaches. Collectively, the USE50k dataset and the DeepUSEvision framework establish a strong and scalable foundation for intelligent surveillance and real-time risk assessment in safety-critical applications.
[38] UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking cs.CV | cs.SDPDF
Xuangeng Chu, Ruicong Liu, Yifei Huang, Yun Liu, Yichen Peng
TL;DR: UniLS是一个端到端的框架,能够仅通过双轨音频驱动生成统一的听-说表情。它通过两阶段训练范式解决了听者建模的挑战,显著提升了听者的多样性和自然性。
Details
Motivation: 现有方法在建模听者时表现不佳,难以生成自然的听者动作,且依赖额外的说话者动作信息。UniLS旨在仅通过音频驱动,实现听和说的统一建模。
Result: UniLS在说话准确性上达到SOTA,同时在听者表情生成方面显著提升了多样性和自然性。
Insight: 听者动作主要依赖内部先验,而非外部音频信号;通过两阶段训练可以更好地平衡内外因素,提升整体表现。
Abstract: Generating lifelike conversational avatars requires modeling not just isolated speakers, but the dynamic, reciprocal interaction of speaking and listening. However, modeling the listener is exceptionally challenging: direct audio-driven training fails, producing stiff, static listening motions. This failure stems from a fundamental imbalance: the speaker’s motion is strongly driven by speech audio, while the listener’s motion primarily follows an internal motion prior and is only loosely guided by external speech. This challenge has led most methods to focus on speak-only generation. The only prior attempt at joint generation relies on extra speaker’s motion to produce the listener. This design is not end-to-end, thereby hindering the real-time applicability. To address this limitation, we present UniLS, the first end-to-end framework for generating unified speak-listen expressions, driven by only dual-track audio. Our method introduces a novel two-stage training paradigm. Stage 1 first learns the internal motion prior by training an audio-free autoregressive generator, capturing the spontaneous dynamics of natural facial motion. Stage 2 then introduces the dual-track audio, fine-tuning the generator to modulate the learned motion prior based on external speech cues. Extensive evaluations show UniLS achieves state-of-the-art speaking accuracy. More importantly, it delivers up to 44.1% improvement in listening metrics, generating significantly more diverse and natural listening expressions. This effectively mitigates the stiffness problem and provides a practical, high-fidelity audio-driven solution for interactive digital humans.
[39] Relightable and Dynamic Gaussian Avatar Reconstruction from Monocular Video cs.CV | cs.MMPDF
Seonghwa Choi, Moonkyeong Choi, Mingyu Jang, Jaekyung Kim, Jianfei Cai
TL;DR: 该论文提出了一种基于3D高斯泼溅(3DGS)的可重光照和动态人体化身建模框架(RnD-Avatar),能够从单目视频中重建具有高几何细节和动态形变的化身,并支持任意光照条件下的渲染。
Details
Motivation: 现有方法(如NeRF和3DGS)在重建人体化身时无法充分捕捉与身体运动相关的几何细节(如衣物褶皱),导致渲染效果不佳。
Result: 在新颖视角合成、新颖姿势渲染和重光照任务中达到了最先进的性能。
Insight: 动态形变和高质量几何细节是提升人体化身真实感的关键,而光照条件多样性的数据支持是方法泛化的重要保障。
Abstract: Modeling relightable and animatable human avatars from monocular video is a long-standing and challenging task. Recently, Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) methods have been employed to reconstruct the avatars. However, they often produce unsatisfactory photo-realistic results because of insufficient geometrical details related to body motion, such as clothing wrinkles. In this paper, we propose a 3DGS-based human avatar modeling framework, termed as Relightable and Dynamic Gaussian Avatar (RnD-Avatar), that presents accurate pose-variant deformation for high-fidelity geometrical details. To achieve this, we introduce dynamic skinning weights that define the human avatar’s articulation based on pose while also learning additional deformations induced by body motion. We also introduce a novel regularization to capture fine geometric details under sparse visual cues. Furthermore, we present a new multi-view dataset with varied lighting conditions to evaluate relight. Our framework enables realistic rendering of novel poses and views while supporting photo-realistic lighting effects under arbitrary lighting conditions. Our method achieves state-of-the-art performance in novel view synthesis, novel pose rendering, and relighting.
[40] Video-QTR: Query-Driven Temporal Reasoning Framework for Lightweight Video Understanding cs.CVPDF
Xinkui Zhao, Zuxin Wang, Yifan Zhang, Guanjie Cheng, Yueshen Xu
TL;DR: Video-QTR提出了一种轻量级的视频理解框架,通过查询驱动的时态推理动态分配感知资源,减少输入帧数量73%,同时在多个基准测试中达到SOTA性能。
Details
Motivation: 传统的视频理解方法对所有帧进行密集编码,计算和内存开销大,缺乏效率。为了解决这一问题,研究者提出了一种查询驱动的时态推理框架。
Result: 在MSVD-QA、Activity Net-QA等五个基准测试中,Video-QTR性能达到SOTA,同时输入帧减少高达73%。
Insight: 查询驱动的时态推理是一种有效的方式,能够在不牺牲性能的情况下显著提升视频理解的效率和可扩展性。
Abstract: The rapid development of multimodal large-language models (MLLMs) has significantly expanded the scope of visual language reasoning, enabling unified systems to interpret and describe complex visual content. However, applying these models to long-video understanding remains computationally intensive. Dense frame encoding generates excessive visual tokens, leading to high memory consumption, redundant computation, and limited scalability in real-world applications. This inefficiency highlights a key limitation of the traditional process-then-reason paradigm, which analyzes visual streams exhaustively before semantic reasoning. To address this challenge, we introduce Video-QTR (Query-Driven Temporal Reasoning), a lightweight framework that redefines video comprehension as a query-guided reasoning process. Instead of encoding every frame, Video-QTR dynamically allocates perceptual resources based on the semantic intent of the query, creating an adaptive feedback loop between reasoning and perception. Extensive experiments across five benchmarks: MSVD-QA, Activity Net-QA, Movie Chat, and Video MME demonstrate that Video-QTR achieves state-of-the-art performance while reducing input frame consumption by up to 73%. These results confirm that query-driven temporal reasoning provides an efficient and scalable solution for video understanding.
[41] StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation cs.CVPDF
Ke Xing, Longfei Li, Yuyang Yin, Hanwen Liang, Guixun Luo
TL;DR: StereoWorld利用预训练的视频生成器,通过几何感知正则化和时空分块机制,实现高质量的单目到立体视频生成,显著超越现有方法。
Details
Motivation: XR设备的普及对高质量立体视频的需求激增,但其制作成本高昂且易产生伪影。StereoWorld旨在解决这一问题。
Result: 实验表明StereoWorld在视觉保真度和几何一致性上显著优于现有方法。
Insight: 几何正则化和时空分块是提升立体视频生成质量的关键,大规模数据集对训练和评估至关重要。
Abstract: The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone. To address this challenge, we present StereoWorld, an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization to ensure 3D structural fidelity. A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis. To enable large-scale training and evaluation, we curate a high-definition stereo video dataset containing over 11M frames aligned to natural human interpupillary distance (IPD). Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency. The project webpage is available at https://ke-xing.github.io/StereoWorld/.
[42] ASSIST-3D: Adapted Scene Synthesis for Class-Agnostic 3D Instance Segmentation cs.CVPDF
Shengchao Zhou, Jiehong Lin, Jiahui Liu, Shizhen Zhao, Chirui Chang
TL;DR: ASSIST-3D是一个用于类别无关3D实例分割的自适应场景合成框架,通过合成多样化、合理的3D场景数据提升模型泛化能力。
Details
Motivation: 现有3D场景合成方法难以同时满足几何多样性、上下文复杂性和布局合理性,而类别无关3D实例分割需要高质量合成数据以克服真实数据不足的问题。
Result: 在ScanNetV2等数据集上显著优于现有方法,证明了合成数据的有效性。
Insight: 高质量的合成数据对类别无关任务至关重要,尤其是结合几何多样性与布局合理性时。
Abstract: Class-agnostic 3D instance segmentation tackles the challenging task of segmenting all object instances, including previously unseen ones, without semantic class reliance. Current methods struggle with generalization due to the scarce annotated 3D scene data or noisy 2D segmentations. While synthetic data generation offers a promising solution, existing 3D scene synthesis methods fail to simultaneously satisfy geometry diversity, context complexity, and layout reasonability, each essential for this task. To address these needs, we propose an Adapted 3D Scene Synthesis pipeline for class-agnostic 3D Instance SegmenTation, termed as ASSIST-3D, to synthesize proper data for model generalization enhancement. Specifically, ASSIST-3D features three key innovations, including 1) Heterogeneous Object Selection from extensive 3D CAD asset collections, incorporating randomness in object sampling to maximize geometric and contextual diversity; 2) Scene Layout Generation through LLM-guided spatial reasoning combined with depth-first search for reasonable object placements; and 3) Realistic Point Cloud Construction via multi-view RGB-D image rendering and fusion from the synthetic scenes, closely mimicking real-world sensor data acquisition. Experiments on ScanNetV2, ScanNet++, and S3DIS benchmarks demonstrate that models trained with ASSIST-3D-generated data significantly outperform existing methods. Further comparisons underscore the superiority of our purpose-built pipeline over existing 3D scene synthesis approaches.
[43] FUSER: Feed-Forward MUltiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement cs.CVPDF
Haobo Jiang, Jin Xie, Jian Yang, Liang Yu, Jianmin Zheng
TL;DR: FUSER提出了首个前馈式多视图点云配准Transformer,通过统一处理所有扫描数据直接预测全局位姿,避免了传统方法中计算昂贵的成对匹配。此外,FUSER-DF进一步引入SE(3)$^N$扩散细化框架,通过去噪提升配准精度。实验显示其在高精度和计算效率上均表现出色。
Details
Motivation: 传统多视图点云配准依赖于成对匹配构建位姿图,计算量大且缺乏全局几何约束。FUSER旨在通过前馈式Transformer直接预测全局位姿,解决这一问题。
Result: 在3DMatch、ScanNet和ArkitScenes上的实验表明,FUSER及其细化版本FUSER-DF在配准精度和计算效率上均优于现有方法。
Insight: 通过前馈式Transformer统一处理多视图数据,避免了传统方法的成对匹配瓶颈,同时扩散细化进一步提升了配准准确性。
Abstract: Registration of multiview point clouds conventionally relies on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and inherently ill-posed without holistic geometric constraints. This paper proposes FUSER, the first feed-forward multiview registration transformer that jointly processes all scans in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER encodes each scan into low-resolution superpoint features via a sparse 3D CNN that preserves absolute translation cues, and performs efficient intra- and inter-scan reasoning through a Geometric Alternating Attention module. Particularly, we transfer 2D attention priors from off-the-shelf foundation models to enhance 3D feature interaction and geometric consistency. Building upon FUSER, we further introduce FUSER-DF, an SE(3)$^N$ diffusion refinement framework to correct FUSER’s estimates via denoising in the joint SE(3)$^N$ space. FUSER acts as a surrogate multiview registration model to construct the denoiser, and a prior-conditioned SE(3)$^N$ variational lower bound is derived for denoising supervision. Extensive experiments on 3DMatch, ScanNet and ArkitScenes demonstrate that our approach achieves the superior registration accuracy and outstanding computational efficiency.
[44] Log NeRF: Comparing Spaces for Learning Radiance Fields cs.CV | cs.AIPDF
Sihe Chen, Luv Verma, Bruce A. Maxwell
TL;DR: 论文研究了NeRF在不同颜色空间(如线性、sRGB、GPLog和对数RGB)中的表现,发现对数RGB空间能显著提升渲染质量、鲁棒性和低光条件下的性能。
Details
Motivation: 受BiIlluminant Dichromatic Reflection模型启发,作者假设对数RGB空间能帮助NeRF更紧凑有效地学习场景外观表示。
Result: 对数RGB空间在所有评估场景中表现最佳,尤其在低光条件下效果显著。
Insight: 对数空间简化了光照和反射的分离,有助于NeRF学习更有效的场景表示。
Abstract: Neural Radiance Fields (NeRF) have achieved remarkable results in novel view synthesis, typically using sRGB images for supervision. However, little attention has been paid to the color space in which the network is learning the radiance field representation. Inspired by the BiIlluminant Dichromatic Reflection (BIDR) model, which suggests that a logarithmic transformation simplifies the separation of illumination and reflectance, we hypothesize that log RGB space enables NeRF to learn a more compact and effective representation of scene appearance. To test this, we captured approximately 30 videos using a GoPro camera, ensuring linear data recovery through inverse encoding. We trained NeRF models under various color space interpretations linear, sRGB, GPLog, and log RGB by converting each network output to a common color space before rendering and loss computation, enforcing representation learning in different color spaces. Quantitative and qualitative evaluations demonstrate that using a log RGB color space consistently improves rendering quality, exhibits greater robustness across scenes, and performs particularly well in low light conditions while using the same bit-depth input images. Further analysis across different network sizes and NeRF variants confirms the generalization and stability of the log space advantage.
[45] Detection and Localization of Subdural Hematoma Using Deep Learning on Computed Tomography cs.CV | cs.LGPDF
Vasiliki Stoumpou, Rohan Kumar, Bernard Burman, Diego Ojeda, Tapan Mehta
TL;DR: 论文提出了一种基于多模态深度学习的框架,用于快速准确检测和定位硬膜下血肿(SDH),结合临床数据和CT影像,取得高准确率。
Details
Motivation: 硬膜下血肿(SDH)是一种常见且紧急的神经外科疾病,现有自动化工具仅关注检测且缺乏可解释性和空间定位能力。
Result: 多模态集成模型的AUC为0.9407,显著优于单一模态(临床数据AUC为0.75,影像数据AUC为0.922-0.926)。
Insight: 多模态方法在医学影像分析中具有显著优势,整合临床与影像数据可提高模型性能和可解释性,有望优化临床工作流程。
Abstract: Background. Subdural hematoma (SDH) is a common neurosurgical emergency, with increasing incidence in aging populations. Rapid and accurate identification is essential to guide timely intervention, yet existing automated tools focus primarily on detection and provide limited interpretability or spatial localization. There remains a need for transparent, high-performing systems that integrate multimodal clinical and imaging information to support real-time decision-making. Methods. We developed a multimodal deep-learning framework that integrates structured clinical variables, a 3D convolutional neural network trained on CT volumes, and a transformer-enhanced 2D segmentation model for SDH detection and localization. Using 25,315 head CT studies from Hartford HealthCare (2015–2024), of which 3,774 (14.9%) contained clinician-confirmed SDH, tabular models were trained on demographics, comorbidities, medications, and laboratory results. Imaging models were trained to detect SDH and generate voxel-level probability maps. A greedy ensemble strategy combined complementary predictors. Findings. Clinical variables alone provided modest discriminatory power (AUC 0.75). Convolutional models trained on CT volumes and segmentation-derived maps achieved substantially higher accuracy (AUCs 0.922 and 0.926). The multimodal ensemble integrating all components achieved the best overall performance (AUC 0.9407; 95% CI, 0.930–0.951) and produced anatomically meaningful localization maps consistent with known SDH patterns. Interpretation. This multimodal, interpretable framework provides rapid and accurate SDH detection and localization, achieving high detection performance and offering transparent, anatomically grounded outputs. Integration into radiology workflows could streamline triage, reduce time to intervention, and improve consistency in SDH management.
[46] Generative Point Cloud Registration cs.CVPDF
Haobo Jiang, Jin Xie, Jian Yang, Liang Yu, Jianmin Zheng
TL;DR: 本文提出了一种新的3D点云配准方法——生成式点云配准,通过结合2D生成模型与3D匹配任务来提升配准性能。
Details
Motivation: 传统的3D配准方法通常依赖于几何信息,忽略了多模态(如颜色、纹理)的潜在优势。本文旨在通过生成几何-颜色一致的多视图图像对,提升配准的鲁棒性。
Result: 在3DMatch和ScanNet数据集上的实验验证了方法的有效性,显著提升了配准性能。
Insight: 通过融合生成模型和3D配准任务,本文展示了多模态信息(几何与颜色)在提升配准鲁棒性中的重要作用。
Abstract: In this paper, we propose a novel 3D registration paradigm, Generative Point Cloud Registration, which bridges advanced 2D generative models with 3D matching tasks to enhance registration performance. Our key idea is to generate cross-view consistent image pairs that are well-aligned with the source and target point clouds, enabling geometry-color feature fusion to facilitate robust matching. To ensure high-quality matching, the generated image pair should feature both 2D-3D geometric consistency and cross-view texture consistency. To achieve this, we introduce Match-ControlNet, a matching-specific, controllable 2D generative model. Specifically, it leverages the depth-conditioned generation capability of ControlNet to produce images that are geometrically aligned with depth maps derived from point clouds, ensuring 2D-3D geometric consistency. Additionally, by incorporating a coupled conditional denoising scheme and coupled prompt guidance, Match-ControlNet further promotes cross-view feature interaction, guiding texture consistency generation. Our generative 3D registration paradigm is general and could be seamlessly integrated into various registration methods to enhance their performance. Extensive experiments on 3DMatch and ScanNet datasets verify the effectiveness of our approach.
[47] DirectSwap: Mask-Free Cross-Identity Training and Benchmarking for Expression-Consistent Video Head Swapping cs.CVPDF
Yanan Wang, Shengcai Liao, Panwen Hu, Xin Li, Fan Yang
TL;DR: 论文提出了DirectSwap框架和HeadSwapBench数据集,解决了视频头互换中身份泄露和面罩边界问题,通过直接跨身份训练和运动-表情感知重建损失提升了视觉质量和一致性。
Details
Motivation: 现有视频头互换方法依赖同人跨帧训练和面罩修复,存在边界伪影和身份泄露问题,且无法恢复掩码遮挡的关键信息(如面部姿态和表情)。
Result: 实验表明DirectSwap在视觉质量、身份保真度及运动表情一致性上优于现有方法。
Insight: 跨身份配对数据的引入和无面罩直接训练解决了关键瓶颈,MEAR损失为多帧连续性问题提供了新思路。
Abstract: Video head swapping aims to replace the entire head of a video subject, including facial identity, head shape, and hairstyle, with that of a reference image, while preserving the target body, background, and motion dynamics. Due to the lack of ground-truth paired swapping data, prior methods typically train on cross-frame pairs of the same person within a video and rely on mask-based inpainting to mitigate identity leakage. Beyond potential boundary artifacts, this paradigm struggles to recover essential cues occluded by the mask, such as facial pose, expressions, and motion dynamics. To address these issues, we prompt a video editing model to synthesize new heads for existing videos as fake swapping inputs, while maintaining frame-synchronized facial poses and expressions. This yields HeadSwapBench, the first cross-identity paired dataset for video head swapping, which supports both training (\TrainNum{} videos) and benchmarking (\TestNum{} videos) with genuine outputs. Leveraging this paired supervision, we propose DirectSwap, a mask-free, direct video head-swapping framework that extends an image U-Net into a video diffusion model with a motion module and conditioning inputs. Furthermore, we introduce the Motion- and Expression-Aware Reconstruction (MEAR) loss, which reweights the diffusion loss per pixel using frame-difference magnitudes and facial-landmark proximity, thereby enhancing cross-frame coherence in motion and expressions. Extensive experiments demonstrate that DirectSwap achieves state-of-the-art visual quality, identity fidelity, and motion and expression consistency across diverse in-the-wild video scenes. We will release the source code and the HeadSwapBench dataset to facilitate future research.
[48] Label-free Motion-Conditioned Diffusion Model for Cardiac Ultrasound Synthesis cs.CVPDF
Zhe Li, Hadrien Reynaud, Johanna P Müller, Bernhard Kainz
TL;DR: 论文提出了一种无需人工标注的运动条件扩散模型(MCDM),用于合成心脏超声视频,核心在于利用自监督的运动特征。
Details
Motivation: 心脏超声数据标注稀缺,隐私限制和专家标注复杂性阻碍深度学习应用,因此需要无监督方法生成高质量视频。
Result: 在EchoNet-Dynamic数据集上,MCDM生成的时间一致性高且临床真实的视频,性能与传统方法竞争。
Insight: 自监督条件生成可扩展心脏超声合成,减少对人工标注的依赖。
Abstract: Ultrasound echocardiography is essential for the non-invasive, real-time assessment of cardiac function, but the scarcity of labelled data, driven by privacy restrictions and the complexity of expert annotation, remains a major obstacle for deep learning methods. We propose the Motion Conditioned Diffusion Model (MCDM), a label-free latent diffusion framework that synthesises realistic echocardiography videos conditioned on self-supervised motion features. To extract these features, we design the Motion and Appearance Feature Extractor (MAFE), which disentangles motion and appearance representations from videos. Feature learning is further enhanced by two auxiliary objectives: a re-identification loss guided by pseudo appearance features and an optical flow loss guided by pseudo flow fields. Evaluated on the EchoNet-Dynamic dataset, MCDM achieves competitive video generation performance, producing temporally coherent and clinically realistic sequences without reliance on manual labels. These results demonstrate the potential of self-supervised conditioning for scalable echocardiography synthesis. Our code is available at https://github.com/ZheLi2020/LabelfreeMCDM.
[49] InfoMotion: A Graph-Based Approach to Video Dataset Distillation for Echocardiography cs.CVPDF
Zhe Li, Hadrien Reynaud, Alberto Gomez, Bernhard Kainz
TL;DR: 该论文提出了一种基于图的方法InfoMotion,用于超声心动图视频数据集的蒸馏,通过运动特征提取和图构建选择代表性样本,仅用25个合成视频便实现了69.38%的测试准确率。
Details
Motivation: 超声心动图在心血管疾病的无创实时评估中至关重要,但大规模数据带来了存储和计算效率的挑战,因此需要通过数据集蒸馏技术合成紧凑且保留关键特征的子集。
Result: 仅用25个合成视频便在EchoNet-Dynamic数据集上达到了69.38%的测试准确率,证明了方法的有效性和可扩展性。
Insight: 通过捕捉时间动态和图表示学习,可以有效压缩医学视频数据集,同时保留关键临床特征,为高效存储和模型训练提供了解决方案。
Abstract: Echocardiography playing a critical role in the diagnosis and monitoring of cardiovascular diseases as a non-invasive real-time assessment of cardiac structure and function. However, the growing scale of echocardiographic video data presents significant challenges in terms of storage, computation, and model training efficiency. Dataset distillation offers a promising solution by synthesizing a compact, informative subset of data that retains the key clinical features of the original dataset. In this work, we propose a novel approach for distilling a compact synthetic echocardiographic video dataset. Our method leverages motion feature extraction to capture temporal dynamics, followed by class-wise graph construction and representative sample selection using the Infomap algorithm. This enables us to select a diverse and informative subset of synthetic videos that preserves the essential characteristics of the original dataset. We evaluate our approach on the EchoNet-Dynamic datasets and achieve a test accuracy of (69.38%) using only (25) synthetic videos. These results demonstrate the effectiveness and scalability of our method for medical video dataset distillation.
[50] UniPart: Part-Level 3D Generation with Unified 3D Geom-Seg Latents cs.CVPDF
Xufan He, Yushuang Wu, Xiaoyang Guo, Chongjie Ye, Jiaqing Zhou
TL;DR: UniPart提出了一种两阶段潜在扩散框架,用于图像引导的部件级3D生成,通过统一的Geom-Seg VecSet表示联合编码几何和部件结构。
Details
Motivation: 现有的部件级3D生成方法要么依赖隐式部件分割(控制粒度有限),要么需要外部强分割器(需大量标注数据)。本文发现部件感知在全局几何学习中自然涌现。
Result: 实验表明,UniPart在分割可控性和部件级几何质量上优于现有方法。
Insight: 部件感知可从全局几何学习中自然学习,无需依赖外部分割器;双空间潜在优化能显著提升生成质量。
Abstract: Part-level 3D generation is essential for applications requiring decomposable and structured 3D synthesis. However, existing methods either rely on implicit part segmentation with limited granularity control or depend on strong external segmenters trained on large annotated datasets. In this work, we observe that part awareness emerges naturally during whole-object geometry learning and propose Geom-Seg VecSet, a unified geometry-segmentation latent representation that jointly encodes object geometry and part-level structure. Building on this representation, we introduce UniPart, a two-stage latent diffusion framework for image-guided part-level 3D generation. The first stage performs joint geometry generation and latent part segmentation, while the second stage conditions part-level diffusion on both whole-object and part-specific latents. A dual-space generation scheme further enhances geometric fidelity by predicting part latents in both global and canonical spaces. Extensive experiments demonstrate that UniPart achieves superior segmentation controllability and part-level geometric quality compared with existing approaches.
[51] Representation Calibration and Uncertainty Guidance for Class-Incremental Learning based on Vision Language Model cs.CV | cs.AIPDF
Jiantao Tan, Peixian Ma, Tong Yu, Wentao Zhang, Ruixuan Wang
TL;DR: 该论文提出了一种基于视觉语言模型(VLM)的类增量学习框架,通过任务特定适配器和跨任务表示校准策略解决类混淆问题,并结合预测不确定性指导推理策略,显著提升了性能。
Details
Motivation: 类增量学习需要系统在持续学习新类知识的同时保留旧类知识。现有的基于视觉语言模型(VLM)的方法仍面临跨任务类区分的问题。
Result: 在多种数据集和设置下的实验表明,该方法显著优于现有方法。
Insight: 跨任务表示校准和不确定性指导可以有效缓解类混淆问题,提升类增量学习的性能。
Abstract: Class-incremental learning requires a learning system to continually learn knowledge of new classes and meanwhile try to preserve previously learned knowledge of old classes. As current state-of-the-art methods based on Vision-Language Models (VLMs) still suffer from the issue of differentiating classes across learning tasks. Here a novel VLM-based continual learning framework for image classification is proposed. In this framework, task-specific adapters are added to the pre-trained and frozen image encoder to learn new knowledge, and a novel cross-task representation calibration strategy based on a mixture of light-weight projectors is used to help better separate all learned classes in a unified feature space, alleviating class confusion across tasks. In addition, a novel inference strategy guided by prediction uncertainty is developed to more accurately select the most appropriate image feature for class prediction. Extensive experiments on multiple datasets under various settings demonstrate the superior performance of our method compared to existing ones.
[52] Defect-aware Hybrid Prompt Optimization via Progressive Tuning for Zero-Shot Multi-type Anomaly Detection and Segmentation cs.CVPDF
Nadeem Nazer, Hongkuan Zhou, Lavdim Halilaj, Ylli Sadikaj, Steffen Staab
TL;DR: 该论文提出了一种名为DAPO的新型缺陷感知提示优化方法,通过渐进式调优实现零样本多类型异常检测与分割,显著提升了在分布偏移下的性能。
Details
Motivation: 现有视觉语言模型(如CLIP)在异常检测中忽视了细粒度异常类型(如“孔洞”、“切割”、“划痕”),而这些信息对于理解异常根源和采取针对性措施至关重要。手工设计这些提示既耗时又易受人为偏见影响。
Result: 在多个公开基准(MPDD、VisA等)和内部数据集上,DAPO在图像级AUROC和平均精度指标上平均提升3.7%,在零样本新异常类型定位上平均提升6.5%。
Insight: 通过细粒度语义信息丰富异常表示可以显著提升模型的检测能力,尤其是在零样本和分布偏移场景下,缺陷感知提示优化是一个有效的方向。
Abstract: Recent vision language models (VLMs) like CLIP have demonstrated impressive anomaly detection performance under significant distribution shift by utilizing high-level semantic information through text prompts. However, these models often neglect fine-grained details, such as which kind of anomalies, like “hole”, “cut”, “scratch” that could provide more specific insight into the nature of anomalies. We argue that recognizing fine-grained anomaly types 1) enriches the representation of “abnormal” with structured semantics, narrowing the gap between coarse anomaly signals and fine-grained defect categories; 2) enables manufacturers to understand the root causes of the anomaly and implement more targeted and appropriate corrective measures quickly. While incorporating such detailed semantic information is crucial, designing handcrafted prompts for each defect type is both time-consuming and susceptible to human bias. For this reason, we introduce DAPO, a novel approach for Defect-aware Prompt Optimization based on progressive tuning for the zero-shot multi-type and binary anomaly detection and segmentation under distribution shifts. Our approach aligns anomaly-relevant image features with their corresponding text semantics by learning hybrid defect-aware prompts with both fixed textual anchors and learnable token embeddings. We conducted experiments on public benchmarks (MPDD, VisA, MVTec-AD, MAD, and Real-IAD) and an internal dataset. The results suggest that compared to the baseline models, DAPO achieves a 3.7% average improvement in AUROC and average precision metrics at the image level under distribution shift, and a 6.5% average improvement in localizing novel anomaly types under zero-shot settings.
[53] Cytoplasmic Strings Analysis in Human Embryo Time-Lapse Videos using Deep Learning Framework cs.CV | cs.AIPDF
Anabia Sohail, Mohamad Alansari, Ahmed Abughali, Asmaa Chehab, Abdelfatah Ahmed
TL;DR: 论文提出了一种用于分析人类胚胎延时视频中细胞质串(Cytoplasmic Strings, CS)的两阶段深度学习方法,通过新型损失函数NUCE和RF-DETR定位器,实现了高性能的分类与检测。
Details
Motivation: 胚胎选择是不孕症治疗的关键瓶颈,现有方法依赖手工形态动力学特征,忽略了细胞质串等新兴生物标志物,而CS的检测目前依赖主观的人工检查,亟需自动化解决方案。
Result: 在多Transformer骨干网络上,NUCE显著提升了F1分数;RF-DETR实现了CS检测的SOTA性能。
Insight: 新型损失函数NUCE通过耦合置信度加权与嵌入收缩,可推广至其他高度不平衡和特征不确定的任务;CS的自动化分析可能成为胚胎评估的新标准。
Abstract: Infertility is a major global health issue, and while in-vitro fertilization has improved treatment outcomes, embryo selection remains a critical bottleneck. Time-lapse imaging enables continuous, non-invasive monitoring of embryo development, yet most automated assessment methods rely solely on conventional morphokinetic features and overlook emerging biomarkers. Cytoplasmic Strings, thin filamentous structures connecting the inner cell mass and trophectoderm in expanded blastocysts, have been associated with faster blastocyst formation, higher blastocyst grades, and improved viability. However, CS assessment currently depends on manual visual inspection, which is labor-intensive, subjective, and severely affected by detection and subtle visual appearance. In this work, we present, to the best of our knowledge, the first computational framework for CS analysis in human IVF embryos. We first design a human-in-the-loop annotation pipeline to curate a biologically validated CS dataset from TLI videos, comprising 13,568 frames with highly sparse CS-positive instances. Building on this dataset, we propose a two-stage deep learning framework that (i) classifies CS presence at the frame level and (ii) localizes CS regions in positive cases. To address severe imbalance and feature uncertainty, we introduce the Novel Uncertainty-aware Contractive Embedding (NUCE) loss, which couples confidence-aware reweighting with an embedding contraction term to form compact, well-separated class clusters. NUCE consistently improves F1-score across five transformer backbones, while RF-DETR-based localization achieves state-of-the-art (SOTA) detection performance for thin, low-contrast CS structures. The source code will be made publicly available at: https://github.com/HamadYA/CS_Detection.
[54] Privacy-Preserving Computer Vision for Industry: Three Case Studies in Human-Centric Manufacturing cs.CV | cs.AIPDF
Sander De Coninck, Emilio Gamba, Bart Van Doninck, Abdellatif Bey-Temsamani, Sam Leroux
TL;DR: 该论文首次在真实工业环境中验证了一种隐私保护的计算机视觉框架,通过三种代表性应用案例展示了其在平衡隐私与实用性的效果。评估结果表明,该方法在保护工人隐私的同时保持了任务性能,适合工业实际应用。
Details
Motivation: 工业中采用AI驱动的计算机视觉技术时,通常需要在操作实用性和工人隐私之间取得平衡。为了解决这一问题,论文提出了隐私保护框架,并在真实工业数据上验证其有效性。
Result: 实验结果表明,任务特定的视觉遮蔽能够有效降低隐私风险,同时保持任务性能。工业合作伙伴的反馈也证实了该框架的部署可行性和信任度。
Insight: 论文揭示了在工业场景中,隐私保护与AI实用性可以协同实现,为跨领域的人本AI部署提供了具体建议。
Abstract: The adoption of AI-powered computer vision in industry is often constrained by the need to balance operational utility with worker privacy. Building on our previously proposed privacy-preserving framework, this paper presents its first comprehensive validation on real-world data collected directly by industrial partners in active production environments. We evaluate the framework across three representative use cases: woodworking production monitoring, human-aware AGV navigation, and multi-camera ergonomic risk assessment. The approach employs learned visual transformations that obscure sensitive or task-irrelevant information while retaining features essential for task performance. Through both quantitative evaluation of the privacy-utility trade-off and qualitative feedback from industrial partners, we assess the framework’s effectiveness, deployment feasibility, and trust implications. Results demonstrate that task-specific obfuscation enables effective monitoring with reduced privacy risks, establishing the framework’s readiness for real-world adoption and providing cross-domain recommendations for responsible, human-centric AI deployment in industry.
[55] Temporal-Spatial Tubelet Embedding for Cloud-Robust MSI Reconstruction using MSI-SAR Fusion: A Multi-Head Self-Attention Video Vision Transformer Approach cs.CV | cs.AIPDF
Yiqun Wang, Lujun Li, Meiru Yue, Radu State
TL;DR: 论文提出了一种基于Video Vision Transformer (ViViT)的框架,通过时空融合嵌入改进多云覆盖区域的多光谱影像(MSI)重建,显著提升了重建精度和农业监测的鲁棒性。
Details
Motivation: 多云覆盖严重影响多光谱影像(MSI)的光谱信息,限制了早期作物监测的准确性。现有的ViT方法因粗粒度时间嵌入导致信息丢失和重建精度下降。
Result: 在2020年Traill County数据上的实验表明,MTS-ViViT比MTS-ViT基线MSE降低了2.23%;SAR融合后,SMTS-ViViT比SMTS-ViT基线提升了10.33%。
Insight: 通过精细化时间-空间嵌入和传感器融合(SAR-MSI),可以有效提升多云条件下MSI重建的精度,为农业监测提供更鲁棒的解决方案。
Abstract: Cloud cover in multispectral imagery (MSI) significantly hinders early-season crop mapping by corrupting spectral information. Existing Vision Transformer(ViT)-based time-series reconstruction methods, like SMTS-ViT, often employ coarse temporal embeddings that aggregate entire sequences, causing substantial information loss and reducing reconstruction accuracy. To address these limitations, a Video Vision Transformer (ViViT)-based framework with temporal-spatial fusion embedding for MSI reconstruction in cloud-covered regions is proposed in this study. Non-overlapping tubelets are extracted via 3D convolution with constrained temporal span $(t=2)$, ensuring local temporal coherence while reducing cross-day information degradation. Both MSI-only and SAR-MSI fusion scenarios are considered during the experiments. Comprehensive experiments on 2020 Traill County data demonstrate notable performance improvements: MTS-ViViT achieves a 2.23% reduction in MSE compared to the MTS-ViT baseline, while SMTS-ViViT achieves a 10.33% improvement with SAR integration over the SMTS-ViT baseline. The proposed framework effectively enhances spectral reconstruction quality for robust agricultural monitoring.
[56] MODA: The First Challenging Benchmark for Multispectral Object Detection in Aerial Images cs.CVPDF
Shuaihao Han, Tingfa Xu, Peifu Liu, Jianan Li
TL;DR: 论文提出了首个大规模多光谱航空目标检测数据集MODA,并设计了全新的检测框架OSSDet,显著提升了多光谱航空目标检测的性能。
Details
Motivation: 航空目标检测面临小目标和背景干扰等问题,RGB图像信息有限,多光谱数据潜力巨大但缺乏训练数据。
Result: 实验表明OSSDet在参数量和效率可比的情况下优于现有方法。
Insight: 多光谱数据可显著提升航空目标检测性能,对象感知机制能有效抑制背景干扰。
Abstract: Aerial object detection faces significant challenges in real-world scenarios, such as small objects and extensive background interference, which limit the performance of RGB-based detectors with insufficient discriminative information. Multispectral images (MSIs) capture additional spectral cues across multiple bands, offering a promising alternative. However, the lack of training data has been the primary bottleneck to exploiting the potential of MSIs. To address this gap, we introduce the first large-scale dataset for Multispectral Object Detection in Aerial images (MODA), which comprises 14,041 MSIs and 330,191 annotations across diverse, challenging scenarios, providing a comprehensive data foundation for this field. Furthermore, to overcome challenges inherent to aerial object detection using MSIs, we propose OSSDet, a framework that integrates spectral and spatial information with object-aware cues. OSSDet employs a cascaded spectral-spatial modulation structure to optimize target perception, aggregates spectrally related features by exploiting spectral similarities to reinforce intra-object correlations, and suppresses irrelevant background via object-aware masking. Moreover, cross-spectral attention further refines object-related representations under explicit object-aware guidance. Extensive experiments demonstrate that OSSDet outperforms existing methods with comparable parameters and efficiency.
[57] StateSpace-SSL: Linear-Time Self-supervised Learning for Plant Disease Detectio cs.CVPDF
Abdullah Al Mamun, Miaohua Zhang, David Ahmedt-Aristizabal, Zeeshan Hayder, Mohammad Awrangjeb
TL;DR: StateSpace-SSL是一种线性时间自监督学习框架,用于植物病害检测。它通过状态空间编码器和教师-学生目标,解决了CNN和Transformer在农业图像上的不足,显著提升了检测性能。
Details
Motivation: 现有自监督学习方法(CNN和Transformer)在农业图像上效果不佳,无法有效捕捉病害的连续性模式或计算成本高。
Result: 在三个公开数据集上优于CNN和Transformer基线,学习到的特征更紧凑且聚焦于病害区域。
Insight: 线性状态空间建模特别适合捕捉农业图像中病害的连续性,为自监督学习提供了高效解决方案。
Abstract: Self-supervised learning (SSL) is attractive for plant disease detection as it can exploit large collections of unlabeled leaf images, yet most existing SSL methods are built on CNNs or vision transformers that are poorly matched to agricultural imagery. CNN-based SSL struggles to capture disease patterns that evolve continuously along leaf structures, while transformer-based SSL introduces quadratic attention cost from high-resolution patches. To address these limitations, we propose StateSpace-SSL, a linear-time SSL framework that employs a Vision Mamba state-space encoder to model long-range lesion continuity through directional scanning across the leaf surface. A prototype-driven teacher-student objective aligns representations across multiple views, encouraging stable and lesion-aware features from labelled data. Experiments on three publicly available plant disease datasets show that StateSpace-SSL consistently outperforms the CNN- and transformer-based SSL baselines in various evaluation metrics. Qualitative analyses further confirm that it learns compact, lesion-focused feature maps, highlighting the advantage of linear state-space modelling for self-supervised plant disease representation learning.
[58] Masked Registration and Autoencoding of CT Images for Predictive Tibia Reconstruction cs.CVPDF
Hongyou Zhou, Cederic Aßmann, Alaa Bejaoui, Heiko Tzschätzsch, Mark Heyland
TL;DR: 论文提出了一种结合神经配准和自编码器的方法,从骨折胫骨的CT图像中预测患者特异性重建目标,为复杂胫骨骨折的手术规划提供辅助。
Details
Motivation: 复杂胫骨骨折的手术规划中,医生难以想象理想的3D骨对齐结构,因此需要一个工具来预测患者特异性的健康骨重建目标。
Result: 方法能够从骨折CT中预测患者特异性健康骨的标准坐标结构,为手术规划提供支持。
Insight: 通过掩码输入的设计,方法能够处理不完整的骨折数据,扩展了配准和生成模型的适用性。
Abstract: Surgical planning for complex tibial fractures can be challenging for surgeons, as the 3D structure of the later desirable bone alignment may be diffi- cult to imagine. To assist in such planning, we address the challenge of predicting a patient-specific reconstruction target from a CT of the fractured tibia. Our ap- proach combines neural registration and autoencoder models. Specifically, we first train a modified spatial transformer network (STN) to register a raw CT to a standardized coordinate system of a jointly trained tibia prototype. Subsequently, various autoencoder (AE) architectures are trained to model healthy tibial varia- tions. Both the STN and AE models are further designed to be robust to masked input, allowing us to apply them to fractured CTs and decode to a prediction of the patient-specific healthy bone in standard coordinates. Our contributions include: i) a 3D-adapted STN for global spatial registration, ii) a comparative analysis of AEs for bone CT modeling, and iii) the extension of both to handle masked inputs for predictive generation of healthy bone structures. Project page: https://github.com/HongyouZhou/repair
[59] Building Reasonable Inference for Vision-Language Models in Blind Image Quality Assessment cs.CVPDF
Yuan Li, Zitang Sun, Yen-ju Chen, Shin’ya Nishida
TL;DR: 该论文探讨了视觉语言模型(VLMs)在盲图像质量评估(BIQA)中的推理问题,揭示了其预测不稳定性和逻辑不一致性,并提出了一种两阶段调优方法以提升推理的合理性和稳定性。
Details
Motivation: 现有的VLMs在BIQA任务中表现出预测不稳定和生成文本与质量预测不一致的问题,这与人类推理方式不符。为此,论文试图分析并解决这些问题,以实现更合理的推理。
Result: 在SPAQ和KONIQ数据集上,将预测不稳定性从22.00%降至12.39%,并在SRCC/PLCC指标上平均提升0.3124/0.3507。
Insight: 分离视觉特征学习和质量推断可以显著提升模型的稳定性和推理可靠性,更接近人类的推理方式。
Abstract: Recent progress in BIQA has been driven by VLMs, whose semantic reasoning abilities suggest that they might extract visual features, generate descriptive text, and infer quality in a human-like manner. However, these models often produce textual descriptions that contradict their final quality predictions, and the predicted scores can change unstably during inference - behaviors not aligned with human reasoning. To understand these issues, we analyze the factors that cause contradictory assessments and instability. We first estimate the relationship between the final quality predictions and the generated visual features, finding that the predictions are not fully grounded in the features and that the logical connection between them is weak. Moreover, decoding intermediate VLM layers shows that the model frequently relies on a limited set of candidate tokens, which contributes to prediction instability. To encourage more human-like reasoning, we introduce a two-stage tuning method that explicitly separates visual perception from quality inference. In the first stage, the model learns visual features; in the second, it infers quality solely from these features. Experiments on SPAQ and KONIQ demonstrate that our approach reduces prediction instability from 22.00% to 12.39% and achieves average gains of 0.3124/0.3507 in SRCC/PLCC across LIVE, CSIQ, SPAQ, and KONIQ compared to the baseline. Further analyses show that our method improves both stability and the reliability of the inference process.
[60] Investigate the Low-level Visual Perception in Vision-Language based Image Quality Assessment cs.CVPDF
Yuan Li, Zitang Sun, Yen-Ju Chen, Shin’ya Nishida
TL;DR: 该论文探讨了基于多模态大语言模型(MLLM)的图像质量评估(IQA)在低层视觉感知方面的局限性,并提出通过改进视觉编码器的对齐来提升低层失真识别的准确性。
Details
Motivation: 当前基于MLLM的IQA系统虽然在视觉感知方面表现强大,但在检测低层失真(如模糊、噪声、压缩等)时存在不可靠性,且推理结果不一致。论文旨在探究这些系统是否真的关注了关键的视觉特征。
Result: 改进视觉编码器的对齐后,失真识别准确率显著提升(从14.92%到84.43%)。这表明合理的约束可以增强视觉特征的表达能力。
Insight: 视觉-语言对齐的改进是提升MLLM在低层视觉任务中表现的关键。未来研究应关注如何在预训练和微调阶段更好地保留低层视觉信息。
Abstract: Recent advances in Image Quality Assessment (IQA) have leveraged Multi-modal Large Language Models (MLLMs) to generate descriptive explanations. However, despite their strong visual perception modules, these models often fail to reliably detect basic low-level distortions such as blur, noise, and compression, and may produce inconsistent evaluations across repeated inferences. This raises an essential question: do MLLM-based IQA systems truly perceive the visual features that matter? To examine this issue, we introduce a low-level distortion perception task that requires models to classify specific distortion types. Our component-wise analysis shows that although MLLMs are structurally capable of representing such distortions, they tend to overfit training templates, leading to biases in quality scoring. As a result, critical low-level features are weakened or lost during the vision-language alignment transfer stage. Furthermore, by computing the semantic distance between visual features and corresponding semantic tokens before and after component-wise fine-tuning, we show that improving the alignment of the vision encoder dramatically enhances distortion recognition accuracy, increasing it from 14.92% to 84.43%. Overall, these findings indicate that incorporating dedicated constraints on the vision encoder can strengthen text-explainable visual representations and enable MLLM-based pipelines to produce more coherent and interpretable reasoning in vision-centric tasks.
[61] Hands-on Evaluation of Visual Transformers for Object Recognition and Detection cs.CV | cs.AIPDF
Dimitrios N. Vlachogiannis, Dimitrios A. Koutsomitropoulos
TL;DR: 论文比较了视觉Transformer(ViT)与传统CNN在物体识别、检测和医学图像分类任务中的表现,发现混合型和分层型Transformer在精度与计算资源间表现更优。
Details
Motivation: 传统CNN在全局图像理解上有局限性,而受语言模型启发的ViT通过自注意力机制能更好地捕捉全局关系,但需要实证支持。
Result: 混合型和分层型ViT(如Swin和CvT)在多项任务中表现优异,医学图像上结合数据增强后性能显著提升。
Insight: ViT在需要全局理解的场景(如医学影像)中优势明显,未来可进一步优化计算效率和数据适配性。
Abstract: Convolutional Neural Networks (CNNs) for computer vision sometimes struggle with understanding images in a global context, as they mainly focus on local patterns. On the other hand, Vision Transformers (ViTs), inspired by models originally created for language processing, use self-attention mechanisms, which allow them to understand relationships across the entire image. In this paper, we compare different types of ViTs (pure, hierarchical, and hybrid) against traditional CNN models across various tasks, including object recognition, detection, and medical image classification. We conduct thorough tests on standard datasets like ImageNet for image classification and COCO for object detection. Additionally, we apply these models to medical imaging using the ChestX-ray14 dataset. We find that hybrid and hierarchical transformers, especially Swin and CvT, offer a strong balance between accuracy and computational resources. Furthermore, by experimenting with data augmentation techniques on medical images, we discover significant performance improvements, particularly with the Swin Transformer model. Overall, our results indicate that Vision Transformers are competitive and, in many cases, outperform traditional CNNs, especially in scenarios requiring the understanding of global visual contexts like medical imaging.
[62] Content-Adaptive Image Retouching Guided by Attribute-Based Text Representation cs.CVPDF
Hancheng Zhu, Xinyu Liu, Rui Yao, Kunyang Sun, Leida Li
TL;DR: 该论文提出了一种基于属性文本表示的内容自适应图像修饰方法(CA-ATP),通过内容自适应曲线映射和属性文本预测模块,实现多样化的颜色调整和用户风格偏好指导。
Details
Motivation: 现有图像修饰方法通常忽视图像内容的固有颜色变化,导致无法实现内容自适应的颜色调整和用户风格偏好的灵活适配。为了克服这一限制,作者提出了一种新方法。
Result: 在多个公开数据集上的实验表明,该方法达到了最先进的性能。
Insight: 通过结合内容自适应颜色映射和用户风格偏好指导,该方法不仅提升了图像修饰的质量,还增强了用户控制的灵活性。
Abstract: Image retouching has received significant attention due to its ability to achieve high-quality visual content. Existing approaches mainly rely on uniform pixel-wise color mapping across entire images, neglecting the inherent color variations induced by image content. This limitation hinders existing approaches from achieving adaptive retouching that accommodates both diverse color distributions and user-defined style preferences. To address these challenges, we propose a novel Content-Adaptive image retouching method guided by Attribute-based Text Representation (CA-ATP). Specifically, we propose a content-adaptive curve mapping module, which leverages a series of basis curves to establish multiple color mapping relationships and learns the corresponding weight maps, enabling content-aware color adjustments. The proposed module can capture color diversity within the image content, allowing similar color values to receive distinct transformations based on their spatial context. In addition, we propose an attribute text prediction module that generates text representations from multiple image attributes, which explicitly represent user-defined style preferences. These attribute-based text representations are subsequently integrated with visual features via a multimodal model, providing user-friendly guidance for image retouching. Extensive experiments on several public datasets demonstrate that our method achieves state-of-the-art performance.
[63] UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision cs.CVPDF
Alberto Rota, Mert Kiray, Mert Asim Karaoglu, Patrick Ruhkamp, Elena De Momi
TL;DR: UnReflectAnything 是一个基于 RGB 的框架,通过预测高光图和重建反射无关的漫反射图像来去除单张图像中的高光。它利用冻结的视觉 Transformer 编码器提取多尺度特征,并通过虚拟高光合成管道生成训练数据。
Details
Motivation: 高光会扭曲图像外观、模糊纹理并阻碍几何推理。尤其是在自然和手术图像中,非朗伯表面和非均匀光照导致的高光问题更为严重。
Result: 在多个基准测试中达到了 SOTA 性能,尤其在自然和手术图像领域表现优异。
Insight: 通过合成数据克服了配对数据的稀缺性,展示了物理渲染在生成训练数据中的潜力。模型跨领域泛化能力强,适用于复杂光照和非朗伯表面。
Abstract: Specular highlights distort appearance, obscure texture, and hinder geometric reasoning in both natural and surgical imagery. We present UnReflectAnything, an RGB-only framework that removes highlights from a single image by predicting a highlight map together with a reflection-free diffuse reconstruction. The model uses a frozen vision transformer encoder to extract multi-scale features, a lightweight head to localize specular regions, and a token-level inpainting module that restores corrupted feature patches before producing the final diffuse image. To overcome the lack of paired supervision, we introduce a Virtual Highlight Synthesis pipeline that renders physically plausible specularities using monocular geometry, Fresnel-aware shading, and randomized lighting which enables training on arbitrary RGB images with correct geometric structure. UnReflectAnything generalizes across natural and surgical domains where non-Lambertian surfaces and non-uniform lighting create severe highlights and it achieves competitive performance with state-of-the-art results on several benchmarks. Project Page: https://alberto-rota.github.io/UnReflectAnything/
[64] CS3D: An Efficient Facial Expression Recognition via Event Vision cs.CV | cs.ROPDF
Zhe Wang, Qijin Song, Yucen Peng, Weibang Bai
TL;DR: 论文CS3D提出了一种高效的面部表情识别框架,通过将3D卷积分解降低计算复杂度和能耗,结合软脉冲神经元和时空注意力机制提升信息保留能力,实验表明其在多个数据集上优于RNN、Transformer和C3D,且能耗仅为C3D的21.97%。
Details
Motivation: 提升事件相机在面部表情识别中的效率,解决传统深度学习模型能耗高的问题,使其更适用于边缘计算设备。
Result: 在多个数据集上优于RNN、Transformer和C3D,能耗仅为C3D的21.97%。
Insight: 通过分解3D卷积和引入软脉冲神经元,显著降低了计算能耗,同时通过时空注意力机制提升了性能。
Abstract: Responsive and accurate facial expression recognition is crucial to human-robot interaction for daily service robots. Nowadays, event cameras are becoming more widely adopted as they surpass RGB cameras in capturing facial expression changes due to their high temporal resolution, low latency, computational efficiency, and robustness in low-light conditions. Despite these advantages, event-based approaches still encounter practical challenges, particularly in adopting mainstream deep learning models. Traditional deep learning methods for facial expression analysis are energy-intensive, making them difficult to deploy on edge computing devices and thereby increasing costs, especially for high-frequency, dynamic, event vision-based approaches. To address this challenging issue, we proposed the CS3D framework by decomposing the Convolutional 3D method to reduce the computational complexity and energy consumption. Additionally, by utilizing soft spiking neurons and a spatial-temporal attention mechanism, the ability to retain information is enhanced, thus improving the accuracy of facial expression detection. Experimental results indicate that our proposed CS3D method attains higher accuracy on multiple datasets compared to architectures such as the RNN, Transformer, and C3D, while the energy consumption of the CS3D method is just 21.97% of the original C3D required on the same device.
[65] Beyond Sequences: A Benchmark for Atomic Hand-Object Interaction Using a Static RNN Encoder cs.CVPDF
Yousef Azizi Movahed, Fatemeh Ziaeetabar
TL;DR: 该论文研究了手-物体交互中的细粒度分类问题,提出了一个静态RNN编码器方法,通过优化数据工程和模型结构,显著提升了分类准确率至97.60%,尤其是解决了最具挑战性的’grabbing’类别。
Details
Motivation: 手-物体交互中的意图预测是计算机视觉中的开放性问题,论文聚焦于原子交互状态的细粒度分类(’approaching’、’grabbing’、’holding’),旨在通过结构化数据工程和轻量级架构解决这一问题。
Result: 最终模型达到了97.60%的分类准确率,尤其在最具挑战性的’grabbing’类别上取得了平衡F1分数0.90。
Insight: 研究发现,时序建模并非必要,高性能可通过静态特征编码实现;结构化特征和轻量级架构在细粒度交互分类中表现出色。
Abstract: Reliably predicting human intent in hand-object interactions is an open challenge for computer vision. Our research concentrates on a fundamental sub-problem: the fine-grained classification of atomic interaction states, namely ‘approaching’, ‘grabbing’, and ‘holding’. To this end, we introduce a structured data engineering process that converts raw videos from the MANIAC dataset into 27,476 statistical-kinematic feature vectors. Each vector encapsulates relational and dynamic properties from a short temporal window of motion. Our initial hypothesis posited that sequential modeling would be critical, leading us to compare static classifiers (MLPs) against temporal models (RNNs). Counter-intuitively, the key discovery occurred when we set the sequence length of a Bidirectional RNN to one (seq_length=1). This modification converted the network’s function, compelling it to act as a high-capacity static feature encoder. This architectural change directly led to a significant accuracy improvement, culminating in a final score of 97.60%. Of particular note, our optimized model successfully overcame the most challenging transitional class, ‘grabbing’, by achieving a balanced F1-score of 0.90. These findings provide a new benchmark for low-level hand-object interaction recognition using structured, interpretable features and lightweight architectures.
[66] Kaapana: A Comprehensive Open-Source Platform for Integrating AI in Medical Imaging Research Environments cs.CVPDF
Ünal Akünal, Markus Bujotzek, Stefan Denner, Benjamin Hamm, Klaus Kades
TL;DR: Kaapana是一个开源平台,旨在解决医学影像研究中数据访问和工具标准化的挑战,支持多中心协作和大规模研究。
Details
Motivation: 医学影像AI研究常面临数据分散、法规限制和技术工具不统一的问题,导致研究难以复现和扩展。
Result: Kaapana降低了技术复杂度,提升了研究的可复现性,支持从本地原型到全国范围的多中心研究。
Insight: Kaapana通过将算法带到数据本地,平衡了数据隐私和协作需求,为医学影像研究提供了标准化解决方案。
Abstract: Developing generalizable AI for medical imaging requires both access to large, multi-center datasets and standardized, reproducible tooling within research environments. However, leveraging real-world imaging data in clinical research environments is still hampered by strict regulatory constraints, fragmented software infrastructure, and the challenges inherent in conducting large-cohort multicentre studies. This leads to projects that rely on ad-hoc toolchains that are hard to reproduce, difficult to scale beyond single institutions and poorly suited for collaboration between clinicians and data scientists. We present Kaapana, a comprehensive open-source platform for medical imaging research that is designed to bridge this gap. Rather than building single-use, site-specific tooling, Kaapana provides a modular, extensible framework that unifies data ingestion, cohort curation, processing workflows and result inspection under a common user interface. By bringing the algorithm to the data, it enables institutions to keep control over their sensitive data while still participating in distributed experimentation and model development. By integrating flexible workflow orchestration with user-facing applications for researchers, Kaapana reduces technical overhead, improves reproducibility and enables conducting large-scale, collaborative, multi-centre imaging studies. We describe the core concepts of the platform and illustrate how they can support diverse use cases, from local prototyping to nation-wide research networks. The open-source codebase is available at https://github.com/kaapana/kaapana
[67] VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification cs.CVPDF
Wanyue Zhang, Lin Geng Foo, Thabo Beeler, Rishabh Dabral, Christian Theobalt
TL;DR: VHOI提出了一种两阶段框架,通过稀疏轨迹生成稠密的人-物交互(HOI)掩码序列,并基于这些掩码微调视频扩散模型,实现了可控的HOI视频生成。
Details
Motivation: 现有可控视频生成方法在稀疏控制(如关键点轨迹)和稠密信号(如光流或3D网格)之间存在权衡,前者易于指定但缺乏实例感知,后者信息丰富但获取成本高。VHOI旨在解决这一矛盾。
Result: 实验表明VHOI在可控HOI视频生成中达到SOTA效果,且能生成包含导航与交互的完整场景。
Insight: HOI感知的运动表示为视频生成提供了更丰富的控制信号,两阶段设计平衡了控制简便性与生成质量。
Abstract: Synthesizing realistic human-object interactions (HOI) in video is challenging due to the complex, instance-specific interaction dynamics of both humans and objects. Incorporating controllability in video generation further adds to the complexity. Existing controllable video generation approaches face a trade-off: sparse controls like keypoint trajectories are easy to specify but lack instance-awareness, while dense signals such as optical flow, depths or 3D meshes are informative but costly to obtain. We propose VHOI, a two-stage framework that first densifies sparse trajectories into HOI mask sequences, and then fine-tunes a video diffusion model conditioned on these dense masks. We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics. This design incorporates a human prior into the conditioning signal and strengthens the model’s ability to understand and generate realistic HOI dynamics. Experiments demonstrate state-of-the-art results in controllable HOI video generation. VHOI is not limited to interaction-only scenarios and can also generate full human navigation leading up to object interactions in an end-to-end manner. Project page: https://vcai.mpi-inf.mpg.de/projects/vhoi/.
[68] IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting cs.CVPDF
Tao Zhang, Yuyang Hong, Yang Xia, Kun Ding, Zeyu Zhang
TL;DR: 论文提出了IF-Bench,首个用于评估多模态大语言模型(MLLMs)在红外图像理解方面能力的基准,包含499张红外图像和680个问答对,并通过实验评估了40多种MLLMs。作者提出了一种无需训练的生成视觉提示(GenViP)方法,显著提升了模型性能。
Details
Motivation: 当前MLLMs在多模态任务中表现优异,但其在红外图像理解方面的能力尚未被探索,因此需要构建专门的基准来填补这一空白。
Result: 实验表明GenViP方法显著提升了多种MLLMs的性能,并揭示了模型规模、架构和推理范式对红外图像理解的影响。
Insight: 红外图像与RGB图像的领域分布差异是主要挑战,GenViP通过域对齐方法有效缓解了这一差异,为红外图像理解提供了新思路。
Abstract: Recent advances in multimodal large language models (MLLMs) have led to impressive progress across various benchmarks. However, their capability in understanding infrared images remains unexplored. To address this gap, we introduce IF-Bench, the first high-quality benchmark designed for evaluating multimodal understanding of infrared images. IF-Bench consists of 499 images sourced from 23 infrared datasets and 680 carefully curated visual question-answer pairs, covering 10 essential dimensions of image understanding. Based on this benchmark, we systematically evaluate over 40 open-source and closed-source MLLMs, employing cyclic evaluation, bilingual assessment, and hybrid judgment strategies to enhance the reliability of the results. Our analysis reveals how model scale, architecture, and inference paradigms affect infrared image comprehension, providing valuable insights for this area. Furthermore, we propose a training-free generative visual prompting (GenViP) method, which leverages advanced image editing models to translate infrared images into semantically and spatially aligned RGB counterparts, thereby mitigating domain distribution shifts. Extensive experiments demonstrate that our method consistently yields significant performance improvements across a wide range of MLLMs. The benchmark and code are available at https://github.com/casiatao/IF-Bench.
[69] An Automated Tip-and-Cue Framework for Optimized Satellite Tasking and Visual Intelligence cs.CV | eess.SYPDF
Gil Weissman, Amir Ivry, Israel Cohen
TL;DR: 本文提出了一种全自动的Tip-and-Cue框架,用于卫星成像任务的自动化生成和调度,结合AI模型处理图像并生成结构化视觉报告,以海事船舶跟踪为案例展示了其有效性。
Details
Motivation: 随着卫星星座的普及和传感器能力的多样化,地球观测的自动化需求日益增长。传统任务调度方法难以满足实时性和多样化的需求,因此需要一个自动化框架来提高效率和实用性。
Result: 通过海事船舶跟踪场景验证了框架的有效性,展示了从AIS数据预测轨迹、生成观测任务到输出可操作结果的完整流程。
Insight: 该框架可扩展至智能城市监控和灾害响应等场景,突显了自动化任务调度和实时分析的潜力。
Abstract: The proliferation of satellite constellations, coupled with reduced tasking latency and diverse sensor capabilities, has expanded the opportunities for automated Earth observation. This paper introduces a fully automated Tip-and-Cue framework designed for satellite imaging tasking and scheduling. In this context, tips are generated from external data sources or analyses of prior satellite imagery, identifying spatiotemporal targets and prioritizing them for downstream planning. Corresponding cues are the imaging tasks formulated in response, which incorporate sensor constraints, timing requirements, and utility functions. The system autonomously generates candidate tasks, optimizes their scheduling across multiple satellites using continuous utility functions that reflect the expected value of each observation, and processes the resulting imagery using artificial-intelligence-based models, including object detectors and vision-language models. Structured visual reports are generated to support both interpretability and the identification of new insights for downstream tasking. The efficacy of the framework is demonstrated through a maritime vessel tracking scenario, utilizing Automatic Identification System (AIS) data for trajectory prediction, targeted observations, and the generation of actionable outputs. Maritime vessel tracking is a widely researched application, often used to benchmark novel approaches to satellite tasking, forecasting, and analysis. The system is extensible to broader applications such as smart-city monitoring and disaster response, where timely tasking and automated analysis are critical.
[70] LiM-YOLO: Less is More with Pyramid Level Shift and Normalized Auxiliary Branch for Ship Detection in Optical Remote Sensing Imagery cs.CV | eess.IVPDF
Seon-Hoon Kim, Hyeji Sim, Youeyun Jung, Ok-Chul Jung, Yerin Kim
TL;DR: LiM-YOLO 是一种专为卫星图像中的船舶检测设计的检测器,通过金字塔层级偏移策略和归一化辅助分支解决目标尺度差异和形态各向异性问题,提升了检测精度和效率。
Details
Motivation: 通用目标检测器在卫星图像中检测船舶时面临目标尺度极端差异和形态各向异性的挑战,尤其是深度层级(如 P5)会导致小目标的空间特征稀释。
Result: 在 SODA-A、DOTA-v1.5、FAIR1M-v2.0 和 ShipRSImageNet-V1 数据集上表现优于现有模型,检测精度和效率更高。
Insight: 针对特定领域(如船舶检测)优化检测器层级设计和归一化模块,能显著提升对小目标的检测能力。
Abstract: Applying general-purpose object detectors to ship detection in satellite imagery presents significant challenges due to the extreme scale disparity and morphological anisotropy of maritime targets. Standard architectures utilizing stride-32 (P5) layers often fail to resolve narrow vessels, resulting in spatial feature dilution. In this work, we propose LiM-YOLO, a specialized detector designed to resolve these domain-specific conflicts. Based on a statistical analysis of ship scales, we introduce a Pyramid Level Shift Strategy that reconfigures the detection head to P2-P4. This shift ensures compliance with Nyquist sampling criteria for small objects while eliminating the computational redundancy of deep layers. To further enhance training stability on high-resolution inputs, we incorporate a Group Normalized Convolutional Block for Linear Projection (GN-CBLinear), which mitigates gradient volatility in micro-batch settings. Validated on SODA-A, DOTA-v1.5, FAIR1M-v2.0, and ShipRSImageNet-V1, LiM-YOLO demonstrates superior detection accuracy and efficiency compared to state-of-the-art models. The code is available at https://github.com/egshkim/LiM-YOLO.
[71] FastPose-ViT: A Vision Transformer for Real-Time Spacecraft Pose Estimation cs.CVPDF
Pierre Ancey, Andrew Price, Saqib Javed, Mathieu Salzmann
TL;DR: FastPose-ViT是一种基于Vision Transformer(ViT)的架构,用于实时估计航天器的6自由度姿态。它通过直接回归姿态,避免了传统PnP方法的计算负担,并在边缘设备上实现了低延迟和高吞吐量。
Details
Motivation: 传统基于PnP的姿态估计算法计算量大,不适合在资源受限的边缘设备上实时部署。FastPose-ViT旨在通过直接回归姿态解决这一问题,适应实时任务需求。
Result: 在SPEED数据集上,FastPose-ViT优于其他非PnP方法,与基于PnP的最新技术竞争。在NVIDIA Jetson Orin Nano上,延迟约75毫秒/帧,吞吐量达33 FPS。
Insight: 直接回归姿态可以绕过PnP的计算瓶颈,而ViT在这种任务中表现出色。数学形式的引入确保了局部预测的全局一致性,为实时部署提供了可能。
Abstract: Estimating the 6-degrees-of-freedom (6DoF) pose of a spacecraft from a single image is critical for autonomous operations like in-orbit servicing and space debris removal. Existing state-of-the-art methods often rely on iterative Perspective-n-Point (PnP)-based algorithms, which are computationally intensive and ill-suited for real-time deployment on resource-constrained edge devices. To overcome these limitations, we propose FastPose-ViT, a Vision Transformer (ViT)-based architecture that directly regresses the 6DoF pose. Our approach processes cropped images from object bounding boxes and introduces a novel mathematical formalism to map these localized predictions back to the full-image scale. This formalism is derived from the principles of projective geometry and the concept of “apparent rotation”, where the model predicts an apparent rotation matrix that is then corrected to find the true orientation. We demonstrate that our method outperforms other non-PnP strategies and achieves performance competitive with state-of-the-art PnP-based techniques on the SPEED dataset. Furthermore, we validate our model’s suitability for real-world space missions by quantizing it and deploying it on power-constrained edge hardware. On the NVIDIA Jetson Orin Nano, our end-to-end pipeline achieves a latency of ~75 ms per frame under sequential execution, and a non-blocking throughput of up to 33 FPS when stages are scheduled concurrently.
[72] Modality-Specific Enhancement and Complementary Fusion for Semi-Supervised Multi-Modal Brain Tumor Segmentation cs.CVPDF
Tien-Dat Chung, Ba-Thinh Lam, Thanh-Huy Nguyen, Thien Nguyen, Nguyen Lan Vi Vu
TL;DR: 该论文提出了一种半监督多模态脑肿瘤分割框架,通过模态特异性增强(MEM)和互补信息融合(CIF)模块,解决了现有方法在多模态医学图像中互补信息利用不足的问题。
Details
Motivation: 现有半监督学习方法在多模态医学图像中因语义差异和对齐问题未能充分利用模态间的互补信息。
Result: 在BraTS 2019数据集(HGG子集)的1%、5%和10%标注数据下,方法在Dice和Sensitivity指标上显著优于基线。
Insight: MEM和CIF模块的协同作用能有效弥合跨模态差异,提升低监督条件下的分割鲁棒性。
Abstract: Semi-supervised learning (SSL) has become a promising direction for medical image segmentation, enabling models to learn from limited labeled data alongside abundant unlabeled samples. However, existing SSL approaches for multi-modal medical imaging often struggle to exploit the complementary information between modalities due to semantic discrepancies and misalignment across MRI sequences. To address this, we propose a novel semi-supervised multi-modal framework that explicitly enhances modality-specific representations and facilitates adaptive cross-modal information fusion. Specifically, we introduce a Modality-specific Enhancing Module (MEM) to strengthen semantic cues unique to each modality via channel-wise attention, and a learnable Complementary Information Fusion (CIF) module to adaptively exchange complementary knowledge between modalities. The overall framework is optimized using a hybrid objective combining supervised segmentation loss and cross-modal consistency regularization on unlabeled data. Extensive experiments on the BraTS 2019 (HGG subset) demonstrate that our method consistently outperforms strong semi-supervised and multi-modal baselines under 1%, 5%, and 10% labeled data settings, achieving significant improvements in both Dice and Sensitivity scores. Ablation studies further confirm the complementary effects of our proposed MEM and CIF in bridging cross-modality discrepancies and improving segmentation robustness under scarce supervision.
[73] CHEM: Estimating and Understanding Hallucinations in Deep Learning for Image Processing cs.CV | cs.AIPDF
Jianfei Li, Ines Rosellon-Inclan, Gitta Kutyniok, Jean-Luc Starck
TL;DR: 该论文提出了一种名为CHEM的新方法,用于量化和理解深度学习图像处理中的幻觉现象(不真实的人工痕迹),为构建可信任的计算机视觉模型提供了工具。
Details
Motivation: U型网络(如U-Net)在图像反卷积任务中表现出色,但会产生不真实的幻觉现象,尤其在安全关键场景中可能带来风险。迫切需要一种方法量化这些幻觉,以确保模型的可靠性。
Result: CHEM方法成功量化了不同模型的幻觉现象,为理解幻觉提供了新视角,尤其是在深度学习图像处理中。
Insight: U型网络因结构特性容易产生幻觉,而CHEM提供了一种标准化评估工具,有助于提升模型的可信度和安全性。
Abstract: U-Net and other U-shaped architectures have achieved significant success in image deconvolution tasks. However, challenges have emerged, as these methods might generate unrealistic artifacts or hallucinations, which can interfere with analysis in safety-critical scenarios. This paper introduces a novel approach for quantifying and comprehending hallucination artifacts to ensure trustworthy computer vision models. Our method, termed the Conformal Hallucination Estimation Metric (CHEM), is applicable to any image reconstruction model, enabling efficient identification and quantification of hallucination artifacts. It offers two key advantages: it leverages wavelet and shearlet representations to efficiently extract hallucinations of image features and uses conformalized quantile regression to assess hallucination levels in a distribution-free manner. Furthermore, from an approximation theoretical perspective, we explore the reasons why U-shaped networks are prone to hallucinations. We test the proposed approach on the CANDELS astronomical image dataset with models such as U-Net, SwinUNet, and Learnlets, and provide new perspectives on hallucination from different aspects in deep learning-based image processing.
[74] DynaIP: Dynamic Image Prompt Adapter for Scalable Zero-shot Personalized Text-to-Image Generation cs.CVPDF
Zhizhong Wang, Tianyi Chu, Zeyi Huang, Nanyang Wang, Kehan Li
TL;DR: DynaIP是一种动态图像提示适配器,用于提升零样本个性化文本到图像(PT2I)生成的可扩展性,解决了概念保留与提示跟随的平衡、细粒度细节保留和多主题扩展性问题。
Details
Motivation: 现有PT2I方法在零样本设置下面临三个核心挑战:概念保留与提示跟随的平衡、细粒度细节丢失以及多主题扩展性受限,DynaIP旨在解决这些问题。
Result: 实验表明DynaIP在单主题和多主题PT2I任务中优于现有方法,标志着PT2I领域的显著进步。
Insight: MM-DiT的解耦学习行为和CLIP分层特征的有效利用是提升PT2I性能的关键因素。
Abstract: Personalized Text-to-Image (PT2I) generation aims to produce customized images based on reference images. A prominent interest pertains to the integration of an image prompt adapter to facilitate zero-shot PT2I without test-time fine-tuning. However, current methods grapple with three fundamental challenges: 1. the elusive equilibrium between Concept Preservation (CP) and Prompt Following (PF), 2. the difficulty in retaining fine-grained concept details in reference images, and 3. the restricted scalability to extend to multi-subject personalization. To tackle these challenges, we present Dynamic Image Prompt Adapter (DynaIP), a cutting-edge plugin to enhance the fine-grained concept fidelity, CP-PF balance, and subject scalability of SOTA T2I multimodal diffusion transformers (MM-DiT) for PT2I generation. Our key finding is that MM-DiT inherently exhibit decoupling learning behavior when injecting reference image features into its dual branches via cross attentions. Based on this, we design an innovative Dynamic Decoupling Strategy that removes the interference of concept-agnostic information during inference, significantly enhancing the CP-PF balance and further bolstering the scalability of multi-subject compositions. Moreover, we identify the visual encoder as a key factor affecting fine-grained CP and reveal that the hierarchical features of commonly used CLIP can capture visual information at diverse granularity levels. Therefore, we introduce a novel Hierarchical Mixture-of-Experts Feature Fusion Module to fully leverage the hierarchical features of CLIP, remarkably elevating the fine-grained concept fidelity while also providing flexible control of visual granularity. Extensive experiments across single- and multi-subject PT2I tasks verify that our DynaIP outperforms existing approaches, marking a notable advancement in the field of PT2l generation.
[75] Composing Concepts from Images and Videos via Concept-prompt Binding cs.CV | cs.AI | cs.MMPDF
Xianghao Kong, Zeyu Zhang, Yuwei Guo, Zhuoran Zhao, Songchun Zhang
TL;DR: 这篇论文提出了Bind & Compose方法,通过将视觉概念与提示词绑定并结合跨媒体(图像和视频)的概念,实现了灵活的概念组合。该方法采用了分层绑定结构和时序解耦策略,显著提升了概念组合的一致性和质量。
Details
Motivation: 现有的视觉概念组合方法在从图像和视频中提取复杂概念并灵活组合时效果不佳,这限制了视觉创意的表达。
Result: 实验表明,该方法在概念一致性、提示词保真度和运动质量上优于现有方法。
Insight: 通过绑定和解耦策略的结合,可以显著提升视觉概念组合的质量和灵活性,为创意视觉任务提供了新思路。
Abstract: Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.
[76] UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving cs.CVPDF
Hao Lu, Ziyang Liu, Guangfeng Jiang, Yuanfei Luo, Sheng Chen
TL;DR: 本文提出了一个名为UniUGP的统一框架,用于端到端自动驾驶,通过结合理解、生成和规划模块,解决了长尾场景中的挑战。该方法整合了预训练的视觉语言模型和视频生成模型,显著提升了规划性能。
Details
Motivation: 自动驾驶系统在长尾场景中表现不佳,主要因为现有方法无法充分利用未标记视频进行视觉因果学习,且缺乏大语言模型的推理能力。本文旨在解决这些问题。
Result: 实验表明,UniUGP在感知、推理和决策任务中达到了最先进的性能,且在具有挑战性的长尾场景中表现出色。
Insight: 结合视觉语言模型和视频生成模型能够显著提升自动驾驶系统的性能和泛化能力,尤其是在复杂的长尾场景中。
Abstract: Autonomous driving (AD) systems struggle in long-tail scenarios due to limited world knowledge and weak visual dynamic modeling. Existing vision-language-action (VLA)-based methods cannot leverage unlabeled videos for visual causal learning, while world model-based methods lack reasoning capabilities from large language models. In this paper, we construct multiple specialized datasets providing reasoning and planning annotations for complex scenarios. Then, a unified Understanding-Generation-Planning framework, named UniUGP, is proposed to synergize scene reasoning, future video generation, and trajectory planning through a hybrid expert architecture. By integrating pre-trained VLMs and video generation models, UniUGP leverages visual dynamics and semantic reasoning to enhance planning performance. Taking multi-frame observations and language instructions as input, it produces interpretable chain-of-thought reasoning, physically consistent trajectories, and coherent future videos. We introduce a four-stage training strategy that progressively builds these capabilities across multiple existing AD datasets, along with the proposed specialized datasets. Experiments demonstrate state-of-the-art performance in perception, reasoning, and decision-making, with superior generalization to challenging long-tail situations.
[77] Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs cs.CVPDF
Pius Horn, Janis Keuper
TL;DR: 这篇论文提出了一个新的基准测试框架,用于评估PDF解析器在数学公式提取任务上的表现,通过合成PDF数据和LLM作为评估方法,揭示了现有工具的显著性能差异,并提供了选择解析器的实践指导。
Details
Motivation: 现有的PDF解析器评估基准要么完全忽略数学公式,要么缺乏语义感知的评估指标,而正确解析数学公式对训练大型语言模型和构建科学知识库至关重要。
Result: LLM评估方法与人类判断的相关性显著高于传统方法(Pearson r=0.78 vs. CDM r=0.34)。评估揭示了不同解析器的性能差异,为下游应用提供了实践指导。
Insight: LLM可以作为高效的语义评估工具,而合成数据为基准测试提供了可控的环境,有助于揭示解析器的性能瓶颈。
Abstract: Correctly parsing mathematical formulas from PDFs is critical for training large language models and building scientific knowledge bases from academic literature, yet existing benchmarks either exclude formulas entirely or lack semantically-aware evaluation metrics. We introduce a novel benchmarking framework centered on synthetically generated PDFs with precise LaTeX ground truth, enabling systematic control over layout, formulas, and content characteristics. A key methodological contribution is pioneering LLM-as-a-judge for semantic formula assessment, combined with a robust two-stage matching pipeline that handles parser output inconsistencies. Through human validation on 250 formula pairs (750 ratings from 30 evaluators), we demonstrate that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.78) compared to CDM (r=0.34) and text similarity (r~0). Evaluating 20+ contemporary PDF parsers (including specialized OCR models, vision-language models, and rule-based approaches) across 100 synthetic documents with 2,000+ formulas reveals significant performance disparities. Our findings provide crucial insights for practitioners selecting parsers for downstream applications and establish a robust, scalable methodology that enables reproducible evaluation of PDF formula extraction quality. Code and benchmark data: https://github.com/phorn1/pdf-parse-bench
[78] VisualActBench: Can VLMs See and Act like a Human? cs.CVPDF
Daoan Zhang, Pai Liu, Xiaofei Zhou, Yuan Ge, Guangchen Lan
TL;DR: 论文提出了新任务Visual Action Reasoning和测评基准VisualActBench,用于评估VLMs在仅视觉输入下推理和行动的能力,发现当前模型与人类水平仍有显著差距。
Details
Motivation: 现有Vision-Language Models (VLMs)在视觉感知和描述方面表现优异,但其在仅视觉输入下主动推理和行动的能力尚未充分探索。
Result: 前沿模型如GPT4o表现较好,但与人类水平仍有显著差距,尤其在生成高优先级和主动行动方面。
Insight: 当前VLMs在复杂上下文理解、结果预测和人类决策框架对齐方面存在局限性,VisualActBench为提升主动视觉AI的实用性提供了基础。
Abstract: Vision-Language Models (VLMs) have achieved impressive progress in perceiving and describing visual environments. However, their ability to proactively reason and act based solely on visual inputs, without explicit textual prompts, remains underexplored. We introduce a new task, Visual Action Reasoning, and propose VisualActBench, a large-scale benchmark comprising 1,074 videos and 3,733 human-annotated actions across four real-world scenarios. Each action is labeled with an Action Prioritization Level (APL) and a proactive-reactive type to assess models’ human-aligned reasoning and value sensitivity. We evaluate 29 VLMs on VisualActBench and find that while frontier models like GPT4o demonstrate relatively strong performance, a significant gap remains compared to human-level reasoning, particularly in generating proactive, high-priority actions. Our results highlight limitations in current VLMs’ ability to interpret complex context, anticipate outcomes, and align with human decision-making frameworks. VisualActBench establishes a comprehensive foundation for assessing and improving the real-world readiness of proactive, vision-centric AI agents.
[79] Splatent: Splatting Diffusion Latents for Novel View Synthesis cs.CVPDF
Or Hirschorn, Omer Sela, Inbar Huberman-Spiegelglas, Netalee Efrat, Eli Alshan
TL;DR: Splatent提出了一种基于扩散模型的增强框架,在VAE潜在空间中利用3D高斯泼溅(3DGS)技术,通过多视角注意力机制在2D空间中恢复细节,避免了3D重建中的模糊和失真问题,显著提升了稀疏视角3D重建的质量。
Details
Motivation: 现有方法在VAE潜在空间中进行3D重建时存在多视角不一致问题,导致纹理模糊和细节丢失。传统方法通过微调VAE或依赖预训练扩散模型恢复细节,但牺牲了重建质量或可能引入幻觉。因此,需要一种新方法来解决这些问题。
Result: Splatent在多个基准测试中实现了VAE潜在辐射场重建的最新性能,显著提升了稀疏视角3D重建的细节保留能力。
Insight: 通过2D空间的多视角细节恢复能够更高效地避免3D重建中的不一致问题,同时保留预训练模型的优势。这一方法为高质量稀疏视角3D重建提供了新思路。
Abstract: Radiance field representations have recently been explored in the latent space of VAEs that are commonly used by diffusion models. This direction offers efficient rendering and seamless integration with diffusion-based pipelines. However, these methods face a fundamental limitation: The VAE latent space lacks multi-view consistency, leading to blurred textures and missing details during 3D reconstruction. Existing approaches attempt to address this by fine-tuning the VAE, at the cost of reconstruction quality, or by relying on pre-trained diffusion models to recover fine-grained details, at the risk of some hallucinations. We present Splatent, a diffusion-based enhancement framework designed to operate on top of 3D Gaussian Splatting (3DGS) in the latent space of VAEs. Our key insight departs from the conventional 3D-centric view: rather than reconstructing fine-grained details in 3D space, we recover them in 2D from input views through multi-view attention mechanisms. This approach preserves the reconstruction quality of pretrained VAEs while achieving faithful detail recovery. Evaluated across multiple benchmarks, Splatent establishes a new state-of-the-art for VAE latent radiance field reconstruction. We further demonstrate that integrating our method with existing feed-forward frameworks, consistently improves detail preservation, opening new possibilities for high-quality sparse-view 3D reconstruction.
[80] ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning cs.CVPDF
Xinyu Liu, Hangjie Yuan, Yujie Wei, Jiazheng Xing, Yujin Han
TL;DR: 论文提出了ReViSE框架,通过自反思学习统一模型的推理与视频编辑能力,解决了现有视频统一模型在推理感知编辑中的不足,并在新构建的RVE-Bench基准上验证了其有效性。
Details
Motivation: 现有视频统一模型尽管在理解和生成方面表现强大,但在推理感知的视频编辑任务中存在不足,主要原因包括数据集不完善以及模型推理与编辑能力的割裂。
Result: 在RVE-Bench的推理感知编辑子集上,ReViSE相比现有方法提升了32%的综合得分。
Insight: 通过统一推理与编辑的框架设计,可以显著增强模型在复杂编辑任务中的表现。
Abstract: Video unified models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing even when equipped with powerful internal vision-language models (VLMs). We attribute this gap to two factors: 1) existing datasets are inadequate for training and evaluating reasoning-aware video editing, and 2) an inherent disconnect between the models’ reasoning and editing capabilities, which prevents the rich understanding from effectively instructing the editing process. Bridging this gap requires an integrated framework that connects reasoning with visual transformation. To address this gap, we introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing. To support systematic evaluation, we construct RVE-Bench, a comprehensive benchmark with two complementary subsets: Reasoning-Informed Video Editing and In-Context Video Generation. These subsets cover diverse reasoning dimensions and real-world editing scenarios. Building upon this foundation, we propose the ReViSE, a Self-Reflective Reasoning (SRF) framework that unifies generation and evaluation within a single architecture. The model’s internal VLM provides intrinsic feedback by assessing whether the edited video logically satisfies the given instruction. The differential feedback that refines the generator’s reasoning behavior during training. Extensive experiments on RVE-Bench demonstrate that ReViSE significantly enhances editing accuracy and visual fidelity, achieving a 32% improvement of the Overall score in the reasoning-informed video editing subset over state-of-the-art methods.
cs.CL [Back]
[81] Enhancing Reliability across Short and Long-Form QA via Reinforcement Learning cs.CLPDF
Yudong Wang, Zhe Yang, Wenhan Ma, Zhifang Sui, Liang Zhao
TL;DR: 本文提出了一种基于强化学习的框架,旨在减少大型语言模型在短答案和长答案问答中的幻觉问题,通过新颖的训练集和奖励机制提升了模型的可靠性。
Details
Motivation: 尽管强化学习增强了大型语言模型(LLMs)的复杂推理能力,但也加剧了其幻觉问题,导致能力与可靠性之间的权衡。本研究旨在解决这一问题。
Result: 实验表明,该方法在多个基准测试中显著提升性能,有效减少了内外在幻觉问题。
Insight: 通过在强化学习中引入事实基础和谨慎性奖励,可以缓解LLMs的幻觉问题,实现能力与可靠性的平衡。
Abstract: While reinforcement learning has unlocked unprecedented complex reasoning in large language models, it has also amplified their propensity for hallucination, creating a critical trade-off between capability and reliability. This work confronts this challenge by introducing a targeted RL framework designed to mitigate both intrinsic and extrinsic hallucinations across short and long-form question answering. We address extrinsic hallucinations (flawed internal knowledge) by creating a novel training set from open-ended conversions of TriviaQA. Concurrently, we tackle intrinsic hallucinations (unfaithfulness to context) by leveraging long-form texts from FineWeb in a fact-grounding reward scheme. To further bolster reliability, our framework explicitly rewards the model for refusing to answer unanswerable questions, thereby cultivating crucial cautiousness. Extensive experiments demonstrate that our methodology yields significant performance gains across a diverse suite of benchmarks, substantially reducing both hallucination types. Ultimately, this research contributes a practical framework for resolving the critical tension between advanced reasoning and factual trustworthiness, paving the way for more capable and reliable large language models.
[82] Luxical: High-Speed Lexical-Dense Text Embeddings cs.CL | cs.LGPDF
DatologyAI, :, Luke Merrick, Alex Fang, Aldo Carranza
TL;DR: Luxical是一个结合了稀疏TF-IDF特征和轻量ReLU网络的高效文本嵌入库,显著提高了文本处理速度,同时接近大模型质量。
Details
Motivation: 现有文本嵌入工具在速度和灵活性之间存在权衡,Luxical旨在结合两者的优势,支持大规模文本组织任务。
Result: 在检索和分类任务中。
Insight: 通过结合传统特征提取方法和轻量神经网络,可以大幅提升文本处理效率,同时保持与复杂模型相近的效果,为大规模文本组织提供了实用解决方案。
Abstract: Frontier language model quality increasingly hinges on our ability to organize web-scale text corpora for training. Today’s dominant tools trade off speed and flexibility: lexical classifiers (e.g., FastText) are fast but limited to producing classification output scores, while the vector-valued outputs of transformer text embedding models flexibly support numerous workflows (e.g., clustering, classification, and retrieval) but are computationally expensive to produce. We introduce Luxical, a library for high-speed “lexical-dense” text embeddings that aims to recover the best properties of both approaches for web-scale text organization. Luxical combines sparse TF–IDF features, a small ReLU network, and a knowledge distillation training regimen to approximate large transformer embedding models at a fraction of their operational cost. In this technical report, we describe the Luxical architecture and training objective and evaluate a concrete Luxical model in two disparate applications: a targeted webcrawl document retrieval test and an end-to-end language model data curation task grounded in text classification. In these tasks we demonstrate speedups ranging from 3x to 100x over varying-sized neural baselines, and comparable to FastText model inference during the data curation task. On these evaluations, the tested Luxical model illustrates favorable compute/quality trade-offs for large-scale text organization, matching the quality of neural baselines. Luxical is available as open-source software at https://github.com/datologyai/luxical.
[83] Targeting Misalignment: A Conflict-Aware Framework for Reward-Model-based LLM Alignment cs.CLPDF
Zixuan Liu, Siavash H. Khajavi, Guangkai Jiang, Xinru Liu
TL;DR: 本文提出了一种针对奖励模型与策略冲突的框架,通过局部和全局冲突指标识别共享无知区域,并设计了选择性人工反馈算法SHF-CAS,有效提升大语言模型的对齐性能。
Details
Motivation: 奖励模型的对齐常因标注噪声、偏见或覆盖不足而失效,导致模型优化错误的信号而非真实人类偏好,亟需一种方法来识别并缓解这种不对齐。
Result: 实验表明,即使在奖励模型存在偏差的情况下,SHF-CAS仍能提升对齐性能,验证了冲突区域优化的重要性。
Insight: 共享无知区域(代理与策略均无知)是对齐失败的关键点,针对性优化此类区域可显著提升模型性能。
Abstract: Reward-model-based fine-tuning is a central paradigm in aligning Large Language Models with human preferences. However, such approaches critically rely on the assumption that proxy reward models accurately reflect intended supervision, a condition often violated due to annotation noise, bias, or limited coverage. This misalignment can lead to undesirable behaviors, where models optimize for flawed signals rather than true human values. In this paper, we investigate a novel framework to identify and mitigate such misalignment by treating the fine-tuning process as a form of knowledge integration. We focus on detecting instances of proxy-policy conflicts, cases where the base model strongly disagrees with the proxy. We argue that such conflicts often signify areas of shared ignorance, where neither the policy nor the reward model possesses sufficient knowledge, making them especially susceptible to misalignment. To this end, we propose two complementary metrics for identifying these conflicts: a localized Proxy-Policy Alignment Conflict Score (PACS) and a global Kendall-Tau Distance measure. Building on this insight, we design an algorithm named Selective Human-in-the-loop Feedback via Conflict-Aware Sampling (SHF-CAS) that targets high-conflict QA pairs for additional feedback, refining both the reward model and policy efficiently. Experiments on two alignment tasks demonstrate that our approach enhances general alignment performance, even when trained with a biased proxy reward. Our work provides a new lens for interpreting alignment failures and offers a principled pathway for targeted refinement in LLM training.
[84] CORE: A Conceptual Reasoning Layer for Large Language Models cs.CL | cs.AIPDF
Vishwas Hegde, Vindhya Shigehalli
TL;DR: 论文提出CORE,一种概念优先的交互层,通过持久化的局部概念和小型通用认知操作库,减少多轮对话中的token消耗并提高稳定性。
Details
Motivation: 大型语言模型在单轮生成上表现良好,但在多轮交互中,模型需从不断增长的历史token中重建任务状态和用户意图,导致推理模式不一致和提示膨胀。
Result: 原型实验显示累计token减少约42%,但需注意此结果为模拟条件下的初步数据。
Insight: CORE提供了一种模型无关的机制,为多轮系统的稳定性提供了可扩展的方向。
Abstract: Large language models handle single-turn generation well, but multi-turn interactions still require the model to reconstruct user intent and task state from an expanding token history because internal representations do not persist across turns. This token-first paradigm leads to drift, inconsistent reasoning modes, and growing prompts as conversations deepen. We propose CORE, a concept-first interaction layer that improves multi-turn stability without modifying model weights. CORE combines a small library of universal cognitive operators with a persistent Local Concept - a compact semantic state capturing the task, constraints, preferences, and intermediate results. Each model call receives only this concept state, the user’s latest instruction, and the selected operator, eliminating the need to replay full history. A preliminary prototype simulating CORE’s behavior shows about 42% reduction in cumulative prompt tokens, though this number reflects prototype conditions and should not be interpreted as a real-world performance estimate. CORE offers a model-agnostic mechanism that separates conceptual reasoning from language generation, suggesting a scalable direction for more stable multi-turn systems.
[85] Training-free Context-adaptive Attention for Efficient Long Context Modeling cs.CLPDF
Zeng You, Yaofo Chen, Shuhai Zhang, Zhijie Qiu, Tingyu Wu
TL;DR: 该论文提出了一种无需训练、上下文自适应的稀疏注意力机制(TCA-Attention),用于高效建模长上下文,解决了自注意力机制在长序列中的计算和内存问题。
Details
Motivation: 自注意力机制在长序列中的二次复杂度带来了计算和内存挑战,现有的稀疏注意力或KV缓存压缩方法通常依赖于固定模式或需要额外训练。论文旨在提出一种无需训练的、自适应的高效解决方案。
Result: 实验表明,在128K上下文长度下,TCA-Attention实现了2.8倍加速,KV缓存减少61%,且性能与完全注意力相当。
Insight: TCA-Attention提供了一种即插即用的解决方案,无需参数更新或架构修改,适用于预填充和解码阶段,为长上下文建模提供了实用的高效途径。
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. These capabilities stem primarily from the self-attention mechanism, which enables modeling of long-range dependencies. However, the quadratic complexity of self-attention with respect to sequence length poses significant computational and memory challenges, especially as sequence length extends to extremes. While various sparse attention and KV cache compression methods have been proposed to improve efficiency, they often suffer from limitations such as reliance on fixed patterns, inability to handle both prefilling and decoding stages, or the requirement for additional training. In this paper, we propose Training-free Context-adaptive Attention (TCA-Attention), a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference. Our method consists of two lightweight phases: i) an offline calibration phase that determines head-specific sparsity budgets via a single forward pass, and ii) an online token selection phase that adaptively retains core context tokens using a lightweight redundancy metric. TCA-Attention provides a unified solution that accelerates both prefilling and decoding while reducing KV cache memory footprint, without requiring parameter updates or architectural changes. Theoretical analysis shows that our approach maintains bounded approximation error. Extensive experiments demonstrate that TCA-Attention achieves a 2.8$\times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention across various benchmarks, offering a practical plug-and-play solution for efficient long-context inference.
[86] CONCUR: A Framework for Continual Constrained and Unconstrained Routing cs.CL | cs.AI | cs.LGPDF
Peter Baile Chen, Weiyue Li, Dan Roth, Michael Cafarella, Samuel Madden
TL;DR: CONCUR是一个持续路由框架,支持有约束和无约束的路由,通过模块化设计和多输入表征提升路由性能,同时降低训练和推理成本。
Details
Motivation: 现有路由系统通常训练单一模型,难以适应新策略的出现,且缺乏多任务表征能力,导致泛化性差和路由决策不优。
Result: 在分布内外任务上,CONCUR优于单一策略和现有路由方法,提高准确性并降低推理和训练成本。
Insight: 模块化和多表征设计是解决持续路由问题的关键,可扩展性强且高效。
Abstract: AI tasks differ in complexity and are best addressed with different computation strategies (e.g., combinations of models and decoding methods). Hence, an effective routing system that maps tasks to the appropriate strategies is crucial. Most prior methods build the routing framework by training a single model across all strategies, which demands full retraining whenever new strategies appear and leads to high overhead. Attempts at such continual routing, however, often face difficulties with generalization. Prior models also typically use a single input representation, limiting their ability to capture the full complexity of the routing problem and leading to sub-optimal routing decisions. To address these gaps, we propose CONCUR, a continual routing framework that supports both constrained and unconstrained routing (i.e., routing with or without a budget). Our modular design trains a separate predictor model for each strategy, enabling seamless incorporation of new strategies with low additional training cost. Our predictors also leverage multiple representations of both tasks and computation strategies to better capture overall problem complexity. Experiments on both in-distribution and out-of-distribution, knowledge- and reasoning-intensive tasks show that our method outperforms the best single strategy and strong existing routing techniques with higher end-to-end accuracy and lower inference cost in both continual and non-continual settings, while also reducing training cost in the continual setting.
[87] Language models as tools for investigating the distinction between possible and impossible natural languages cs.CLPDF
Julie Kallini, Christopher Potts
TL;DR: 论文探讨了语言模型(LMs)作为研究工具,用于区分可能的和不可能的自然语言,并揭示支持人类语言学习的归纳偏差。作者提出了一个分阶段的研究计划,通过迭代优化LM架构来更好地识别语言的可能性。
Details
Motivation: 研究动机是通过语言模型探索人类语言学习的核心问题,尤其是区分可能的和不可能的语言的能力,以揭示人类认知的归纳偏差。
Result: 论文未明确列出具体实验结果,但提出了一个理论框架和研究方向。
Insight: 研究提供了利用语言模型探索人类语言学习机制的新视角,强调了LM在认知科学研究中的潜在价值。
Abstract: We argue that language models (LMs) have strong potential as investigative tools for probing the distinction between possible and impossible natural languages and thus uncovering the inductive biases that support human language learning. We outline a phased research program in which LM architectures are iteratively refined to better discriminate between possible and impossible languages, supporting linking hypotheses to human cognition.
[88] CourtPressGER: A German Court Decision to Press Release Summarization Dataset cs.CL | cs.AIPDF
Sebastian Nagl, Mohamed Elganayni, Melanie Pospisil, Matthias Grabmair
TL;DR: 论文介绍了CourtPressGER数据集,用于训练和评估LLM从德国法院裁决生成易于理解的新闻稿摘要。
Details
Motivation: 现有NLP研究多关注技术性摘要,忽视了面向公众的沟通需求。德国法院的官方新闻稿旨在向公众解释复杂的司法裁决,因此需要一种方法生成准确且易读的摘要。
Result: 大型LLM生成高质量草案,性能损失较小;小型LLM需分层处理。人工撰写的新闻稿在评估中排名最高。
Insight: 面向公众的摘要生成需兼顾准确性和可读性,大型LLM在长文本处理中具有优势,但小型模型通过分层设计也能提升表现。
Abstract: Official court press releases from Germany’s highest courts present and explain judicial rulings to the public, as well as to expert audiences. Prior NLP efforts emphasize technical headnotes, ignoring citizen-oriented communication needs. We introduce CourtPressGER, a 6.4k dataset of triples: rulings, human-drafted press releases, and synthetic prompts for LLMs to generate comparable releases. This benchmark trains and evaluates LLMs in generating accurate, readable summaries from long judicial texts. We benchmark small and large LLMs using reference-based metrics, factual-consistency checks, LLM-as-judge, and expert ranking. Large LLMs produce high-quality drafts with minimal hierarchical performance loss; smaller models require hierarchical setups for long judgments. Initial benchmarks show varying model performance, with human-drafted releases ranking highest.
[89] Knowledge-Augmented Large Language Model Agents for Explainable Financial Decision-Making cs.CLPDF
Qingyuan Zhang, Yuxi Wang, Cancan Hua, Yulin Huang, Ning Lyu
TL;DR: 该研究提出了一种基于知识增强的大型语言模型代理的可解释金融决策方法,结合外部知识检索、语义表示和推理生成,以提高事实准确性和推理透明度。
Details
Motivation: 传统金融决策方法依赖参数化知识,缺乏事实一致性和推理链,难以满足复杂金融场景的需求。
Result: 在金融文本处理和决策任务中,该方法在准确性、文本生成质量和事实支持方面优于基线方法。
Insight: 知识增强和可解释推理的结合显著提升了金融决策的性能和透明度,展示了在复杂金融场景中的实用价值。
Abstract: This study investigates an explainable reasoning method for financial decision-making based on knowledge-enhanced large language model agents. To address the limitations of traditional financial decision methods that rely on parameterized knowledge, lack factual consistency, and miss reasoning chains, an integrated framework is proposed that combines external knowledge retrieval, semantic representation, and reasoning generation. The method first encodes financial texts and structured data to obtain semantic representations, and then retrieves task-related information from external knowledge bases using similarity computation. Internal representations and external knowledge are combined through weighted fusion, which ensures fluency while improving factual accuracy and completeness of generated content. In the reasoning stage, a multi-head attention mechanism is introduced to construct logical chains, allowing the model to present transparent causal relationships and traceability during generation. Finally, the model jointly optimizes task objectives and explanation consistency objectives, which enhances predictive performance and reasoning interpretability. Experiments on financial text processing and decision tasks show that the method outperforms baseline approaches in accuracy, text generation quality, and factual support, verifying the effectiveness of knowledge enhancement and explainable reasoning. Overall, the proposed approach overcomes the limitations of traditional models in semantic coverage and reasoning transparency, and demonstrates strong practical value in complex financial scenarios.
[90] Advancing Text Classification with Large Language Models and Neural Attention Mechanisms cs.CLPDF
Ning Lyu, Yuxi Wang, Feng Chen, Qingyuan Zhang
TL;DR: 该论文提出了一种基于大语言模型和神经注意力机制的文本分类算法,通过深度语义嵌入和注意力增强特征选择,结合全局与加权策略生成鲁棒的文本向量,显著优于基线模型。
Details
Motivation: 传统文本分类方法在捕获长距离依赖、理解上下文语义以及处理类别不平衡方面存在局限性,论文旨在通过大语言模型和注意力机制解决这些问题。
Result: 在Precision、Recall、F1-Score和AUC上均优于基线模型,尤其在Recall和AUC上表现突出。超参数和数据条件的敏感性实验验证了模型的适应性和稳定性。
Insight: 1. 大语言模型和注意力机制的结合显著提升了文本分类性能;2. 模型配置对性能影响显著;3. 在复杂数据环境中表现出鲁棒性和适用性。
Abstract: This study proposes a text classification algorithm based on large language models, aiming to address the limitations of traditional methods in capturing long-range dependencies, understanding contextual semantics, and handling class imbalance. The framework includes text encoding, contextual representation modeling, attention-based enhancement, feature aggregation, and classification prediction. In the representation stage, deep semantic embeddings are obtained through large-scale pretrained language models, and attention mechanisms are applied to enhance the selective representation of key features. In the aggregation stage, global and weighted strategies are combined to generate robust text-level vectors. In the classification stage, a fully connected layer and Softmax output are used to predict class distributions, and cross-entropy loss is employed to optimize model parameters. Comparative experiments introduce multiple baseline models, including recurrent neural networks, graph neural networks, and Transformers, and evaluate them on Precision, Recall, F1-Score, and AUC. Results show that the proposed method outperforms existing models on all metrics, with especially strong improvements in Recall and AUC. In addition, sensitivity experiments are conducted on hyperparameters and data conditions, covering the impact of hidden dimensions on AUC and the impact of class imbalance ratios on Recall. The findings demonstrate that proper model configuration has a significant effect on performance and reveal the adaptability and stability of the model under different conditions. Overall, the proposed text classification method not only achieves effective performance improvement but also verifies its robustness and applicability in complex data environments through systematic analysis.
[91] Source Coverage and Citation Bias in LLM-based vs. Traditional Search Engines cs.CL | cs.CYPDF
Peixian Zhang, Qiming Ye, Zifan Peng, Kiran Garimella, Gareth Tyson
TL;DR: 该论文通过大规模实证研究比较了LLM-SEs和TSEs在信息来源覆盖、引用偏差等方面的表现,揭示了LLM-SEs的多样性和局限性。
Details
Motivation: LLM-SEs作为一种新型信息检索范式,其引用透明度和可信度尚未被充分研究,论文旨在填补这一空白。
Result: LLM-SEs在信息来源多样性上优于TSEs(37%的域名仅由LLM-SEs引用),但在可信度、政治中立性和安全性方面未见显著优势。
Insight: LLM-SEs虽提供更丰富的信息来源,但仍需改进可信度和透明度,研究结果为用户、网站所有者和开发者提供了实用建议。
Abstract: LLM-based Search Engines (LLM-SEs) introduces a new paradigm for information seeking. Unlike Traditional Search Engines (TSEs) (e.g., Google), these systems summarize results, often providing limited citation transparency. The implications of this shift remain largely unexplored, yet raises key questions regarding trust and transparency. In this paper, we present a large-scale empirical study of LLM-SEs, analyzing 55,936 queries and the corresponding search results across six LLM-SEs and two TSEs. We confirm that LLM-SEs cites domain resources with greater diversity than TSEs. Indeed, 37% of domains are unique to LLM-SEs. However, certain risks still persist: LLM-SEs do not outperform TSEs in credibility, political neutrality and safety metrics. Finally, to understand the selection criteria of LLM-SEs, we perform a feature-based analysis to identify key factors influencing source choice. Our findings provide actionable insights for end users, website owners, and developers.
[92] RouteRAG: Efficient Retrieval-Augmented Generation from Text and Graph via Reinforcement Learning cs.CL | cs.AI | cs.IRPDF
Yucan Guo, Miao Su, Saiping Guan, Zihao Sun, Xiaolong Jin
TL;DR: RouteRAG是一个基于强化学习的框架,通过联合优化生成过程支持多轮自适应图文混合检索增强生成,显著优于现有RAG基线。
Details
Motivation: 当前基于图或混合检索的系统依赖固定或手工设计的检索流程,无法动态整合证据,且图证据检索成本高。
Result: 在五个问答基准测试中显著优于现有RAG基线。
Insight: 端到端强化学习能有效支持复杂推理中的自适应高效检索。
Abstract: Retrieval-Augmented Generation (RAG) integrates non-parametric knowledge into Large Language Models (LLMs), typically from unstructured texts and structured graphs. While recent progress has advanced text-based RAG to multi-turn reasoning through Reinforcement Learning (RL), extending these advances to hybrid retrieval introduces additional challenges. Existing graph-based or hybrid systems typically depend on fixed or handcrafted retrieval pipelines, lacking the ability to integrate supplementary evidence as reasoning unfolds. Besides, while graph evidence provides relational structures crucial for multi-hop reasoning, it is substantially more expensive to retrieve. To address these limitations, we introduce \model{}, an RL-based framework that enables LLMs to perform multi-turn and adaptive graph-text hybrid RAG. \model{} jointly optimizes the entire generation process via RL, allowing the model to learn when to reason, what to retrieve from either texts or graphs, and when to produce final answers, all within a unified generation policy. To guide this learning process, we design a two-stage training framework that accounts for both task outcome and retrieval efficiency, enabling the model to exploit hybrid evidence while avoiding unnecessary retrieval overhead. Experimental results across five question answering benchmarks demonstrate that \model{} significantly outperforms existing RAG baselines, highlighting the benefits of end-to-end RL in supporting adaptive and efficient retrieval for complex reasoning.
[93] MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment cs.CLPDF
Mengxi Xiao, Kailai Yang, Pengde Zhao, Enze Zhang, Ziyan Kuang
TL;DR: 论文提出了MentraSuite框架,通过MentraBench评估和Mindora模型的优化,提升了大型语言模型在心理健康领域的推理能力和可靠性。
Details
Motivation: 心理健康障碍在全球范围内影响广泛,而现有的大型语言模型在心理健康领域的应用存在推理不完整、不一致等问题,需要改进。
Result: Mindora在MentraBench上表现最佳,展现了出色的推理可靠性。
Insight: 结构化推理轨迹生成和一致性优化是提升心理健康领域模型性能的关键。
Abstract: Mental health disorders affect hundreds of millions globally, and the Web now serves as a primary medium for accessing support, information, and assessment. Large language models (LLMs) offer scalable and accessible assistance, yet their deployment in mental-health settings remains risky when their reasoning is incomplete, inconsistent, or ungrounded. Existing psychological LLMs emphasize emotional understanding or knowledge recall but overlook the step-wise, clinically aligned reasoning required for appraisal, diagnosis, intervention planning, abstraction, and verification. To address these issues, we introduce MentraSuite, a unified framework for advancing reliable mental-health reasoning. We propose MentraBench, a comprehensive benchmark spanning five core reasoning aspects, six tasks, and 13 datasets, evaluating both task performance and reasoning quality across five dimensions: conciseness, coherence, hallucination avoidance, task understanding, and internal consistency. We further present Mindora, a post-trained model optimized through a hybrid SFT-RL framework with an inconsistency-detection reward to enforce faithful and coherent reasoning. To support training, we construct high-quality trajectories using a novel reasoning trajectory generation strategy, that strategically filters difficult samples and applies a structured, consistency-oriented rewriting process to produce concise, readable, and well-balanced trajectories. Across 20 evaluated LLMs, Mindora achieves the highest average performance on MentraBench and shows remarkable performances in reasoning reliability, demonstrating its effectiveness for complex mental-health scenarios.
[94] d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models cs.CLPDF
Leyi Pan, Shuchang Tao, Yunpeng Zhai, Zheyu Fu, Liancheng Fang
TL;DR: 该论文提出了d-TreeRPO,一种针对扩散语言模型的可靠强化学习框架,通过树结构展开和可验证的奖励信号改进策略优化。
Details
Motivation: 现有扩散语言模型的强化学习方法在奖励信号和预测概率估计方面存在不足,影响了优化效果。
Result: 在多个推理基准中取得显著提升,如数独(+86.2)、倒计时(+51.6)等;消融实验验证了设计的有效性。
Insight: 更高的预测置信度可以降低概率估计误差,时间调度的自蒸馏损失是优化训练过程的有效方法。
Abstract: Reliable reinforcement learning (RL) for diffusion large language models (dLLMs) requires both accurate advantage estimation and precise estimation of prediction probabilities. Existing RL methods for dLLMs fall short in both aspects: they rely on coarse or unverifiable reward signals, and they estimate prediction probabilities without accounting for the bias relative to the true, unbiased expected prediction probability that properly integrates over all possible decoding orders. To mitigate these issues, we propose \emph{d}-TreeRPO, a reliable RL framework for dLLMs that leverages tree-structured rollouts and bottom-up advantage computation based on verifiable outcome rewards to provide fine-grained and verifiable step-wise reward signals. When estimating the conditional transition probability from a parent node to a child node, we theoretically analyze the estimation error between the unbiased expected prediction probability and the estimate obtained via a single forward pass, and find that higher prediction confidence leads to lower estimation error. Guided by this analysis, we introduce a time-scheduled self-distillation loss during training that enhances prediction confidence in later training stages, thereby enabling more accurate probability estimation and improved convergence. Experiments show that \emph{d}-TreeRPO outperforms existing baselines and achieves significant gains on multiple reasoning benchmarks, including +86.2 on Sudoku, +51.6 on Countdown, +4.5 on GSM8K, and +5.3 on Math500. Ablation studies and computational cost analyses further demonstrate the effectiveness and practicality of our design choices.
[95] FineFreq: A Multilingual Character Frequency Dataset from Web-Scale Text cs.CLPDF
Binbin XU
TL;DR: FineFreq是一个从FineWeb和FineWeb2语料库中提取的多语言字符频率数据集,覆盖1900多种语言,包含96万亿字符的频率统计。
Details
Motivation: 为研究字符频率及其随时间变化的模式提供资源,支持多语言和跨脚本的自然语言处理任务。
Result: 发布了包含年频和聚合频次的CSV和Parquet格式数据集,可用于多种下游任务。
Insight: 数据集揭示了多语言文本中字符使用的多样性和时间动态性,为NLP研究提供了新资源。
Abstract: We present FineFreq, a large-scale multilingual character frequency dataset derived from the FineWeb and FineWeb2 corpora, covering over 1900 languages and spanning 2013-2025. The dataset contains frequency counts for 96 trillion characters processed from 57 TB of compressed text. For each language, FineFreq provides per-character statistics with aggregate and year-level frequencies, allowing fine-grained temporal analysis. The dataset preserves naturally occurring multilingual features such as cross-script borrowings, emoji, and acronyms without applying artificial filtering. Each character entry includes Unicode metadata (category, script, block), enabling domain-specific or other downstream filtering and analysis. The full dataset is released in both CSV and Parquet formats, with associated metadata, available on GitHub and HuggingFace. https://github.com/Bin-2/FineFreq
[96] MOA: Multi-Objective Alignment for Role-Playing Agents cs.CLPDF
Chonghua Liao, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li
TL;DR: MOA是一个多目标强化学习框架,旨在优化角色扮演代理(RPAs)的多维度表现,解决了现有方法在多样性和任务全面性上的不足。
Details
Motivation: 角色扮演代理需要同时掌握多种冲突技能(如多轮指令、领域知识和一致的风格),而现有方法(如监督微调或强化学习)难以全面优化这些维度。
Result: 在PersonaGym和RoleMRC等benchmark上,8B模型的MOA表现优于GPT-4o和Claude,展示了其在多任务和多维度上的潜力。
Insight: 多目标优化和rollout策略的结合是提升角色扮演代理全面表现的关键。
Abstract: Role-playing agents (RPAs) must simultaneously master many conflicting skills – following multi-turn instructions, exhibiting domain knowledge, and adopting a consistent linguistic style. Existing work either relies on supervised fine-tuning (SFT) that over-fits surface cues and yields low diversity, or applies reinforcement learning (RL) that fails to learn multiple dimensions for comprehensive RPA optimization. We present MOA (Multi-Objective Alignment), a reinforcement-learning framework that enables multi-dimensional, fine-grained rubric optimization for general RPAs. MOA introduces a novel multi-objective optimization strategy that trains simultaneously on multiple fine-grained rubrics to boost optimization performance. Besides, to address the issues of model output diversity and quality, we have also employed thought-augmented rollout with off-policy guidance. Extensive experiments on challenging benchmarks such as PersonaGym and RoleMRC show that MOA enables an 8B model to match or even outperform strong baselines such as GPT-4o and Claude across numerous dimensions. This demonstrates the great potential of MOA in building RPAs that can simultaneously meet the demands of role knowledge, persona style, diverse scenarios, and complex multi-turn conversations.
[97] ChronusOmni: Improving Time Awareness of Omni Large Language Models cs.CL | cs.CV | cs.MMPDF
Yijing Chen, Yihan Wu, Kaisi Guan, Yuchen Ren, Yuyue Wang
TL;DR: ChronusOmni是一个通用的时间感知大语言模型,专注于提升显式和隐式的视听时间接地能力。它通过结合时间戳令牌和强化学习,显著提升了模型的时间推理能力,并在新构建的数据集ChronusAV上取得了最佳性能。
Details
Motivation: 现有方法主要关注视觉语言场景中的显式时间接地问题,忽略了音频模态的充分利用和跨模态的隐式时间关系。ChronusOmni旨在填补这一空白。
Result: 在ChronusAV上取得了30%以上的性能提升,并在其他时间接地基准测试中表现最优。
Insight: 结合时间戳和强化学习能显著提升跨模态时间感知能力,同时不影响模型的通用视听理解能力。
Abstract: Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions. Previous approaches mainly target vision-language scenarios and focus on the explicit temporal grounding questions, such as identifying when a visual event occurs or determining what event happens at aspecific time. However, they often make insufficient use of the audio modality, and overlook implicit temporal grounding across modalities–for example, identifying what is visually present when a character speaks, or determining what is said when a visual event occurs–despite such cross-modal temporal relations being prevalent in real-world scenarios. In this paper, we propose ChronusOmni, an omni large language model designed to enhance temporal awareness for both explicit and implicit audiovisual temporal grounding. First, we interleave text-based timestamp tokens with visual and audio representations at each time unit, enabling unified temporal modeling across modalities. Second, to enforce correct temporal ordering and strengthen fine-grained temporal reasoning, we incorporate reinforcement learning with specially designed reward functions. Moreover, we construct ChronusAV, a temporally-accurate, modality-complete, and cross-modal-aligned dataset to support the training and evaluation on audiovisual temporal grounding task. Experimental results demonstrate that ChronusOmni achieves state-of-the-art performance on ChronusAV with more than 30% improvement and top results on most metrics upon other temporal grounding benchmarks. This highlights the strong temporal awareness of our model across modalities, while preserving general video and audio understanding capabilities.
cs.GR [Back]
[98] Residual Primitive Fitting of 3D Shapes with SuperFrusta cs.GR | cs.CVPDF
Aditya Ganeshan, Matheus Gadelha, Thibault Groueix, Zhiqin Chen, Siddhartha Chaudhuri
TL;DR: 该论文提出了一种将3D形状转换为紧凑且可编辑的解析基元组装体的框架,通过引入SuperFrustum基元和ResFit算法,在保真度与简洁性之间取得平衡。
Details
Motivation: 当前3D形状重构方法在保真度与简洁性之间存在权衡,难以同时实现高精度和可编辑性。论文旨在解决这一问题。
Result: 在多个3D基准测试中,方法显著优于现有技术,IoU提升9点以上,且基元数量减半。
Insight: SuperFrustum的通用性和ResFit的迭代优化过程有效解决了形状重构中精度与简洁性的矛盾,为3D设计提供了新工具。
Abstract: We introduce a framework for converting 3D shapes into compact and editable assemblies of analytic primitives, directly addressing the persistent trade-off between reconstruction fidelity and parsimony. Our approach combines two key contributions: a novel primitive, termed SuperFrustum, and an iterative fiting algorithm, Residual Primitive Fitting (ResFit). SuperFrustum is an analytical primitive that is simultaneously (1) expressive, being able to model various common solids such as cylinders, spheres, cones & their tapered and bent forms, (2) editable, being compactly parameterized with 8 parameters, and (3) optimizable, with a sign distance field differentiable w.r.t. its parameters almost everywhere. ResFit is an unsupervised procedure that interleaves global shape analysis with local optimization, iteratively fitting primitives to the unexplained residual of a shape to discover a parsimonious yet accurate decompositions for each input shape. On diverse 3D benchmarks, our method achieves state-of-the-art results, improving IoU by over 9 points while using nearly half as many primitives as prior work. The resulting assemblies bridge the gap between dense 3D data and human-controllable design, producing high-fidelity and editable shape programs.
cs.RO [Back]
[99] Development and Testing for Perception Based Autonomous Landing of a Long-Range QuadPlane cs.RO | cs.CVPDF
Ashik E Rasul, Humaira Tasnim, Ji Yu Kim, Young Hyun Lim, Scott Schmitz
TL;DR: 论文提出了一种轻量化的QuadPlane系统,用于高效实现基于视觉的自主着陆和视觉-惯性里程计,解决了GPS拒止环境下长程任务的挑战。
Details
Motivation: 在GPS拒止或复杂城市环境中,基于感知的自主着陆对QuadPlane的可靠操作至关重要。传统结构化着陆区不适用现实环境,需要强大的泛化能力。
Result: 系统能够在动态、非结构化、GPS拒止环境中实现高效自主着陆,为长程QuadPlane任务(如空中监测)奠定基础。
Insight: 边缘AI设备的优化部署和高惯性飞行器的稳定着陆是实际应用中的关键挑战。
Abstract: QuadPlanes combine the range efficiency of fixed-wing aircraft with the maneuverability of multi-rotor platforms for long-range autonomous missions. In GPS-denied or cluttered urban environments, perception-based landing is vital for reliable operation. Unlike structured landing zones, real-world sites are unstructured and highly variable, requiring strong generalization capabilities from the perception system. Deep neural networks (DNNs) provide a scalable solution for learning landing site features across diverse visual and environmental conditions. While perception-driven landing has been shown in simulation, real-world deployment introduces significant challenges. Payload and volume constraints limit high-performance edge AI devices like the NVIDIA Jetson Orin Nano, which are crucial for real-time detection and control. Accurate pose estimation during descent is necessary, especially in the absence of GPS, and relies on dependable visual-inertial odometry. Achieving this with limited edge AI resources requires careful optimization of the entire deployment framework. The flight characteristics of large QuadPlanes further complicate the problem. These aircraft exhibit high inertia, reduced thrust vectoring, and slow response times further complicate stable landing maneuvers. This work presents a lightweight QuadPlane system for efficient vision-based autonomous landing and visual-inertial odometry, specifically developed for long-range QuadPlane operations such as aerial monitoring. It describes the hardware platform, sensor configuration, and embedded computing architecture designed to meet demanding real-time, physical constraints. This establishes a foundation for deploying autonomous landing in dynamic, unstructured, GPS-denied environments.
[100] H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos cs.RO | cs.AI | cs.CVPDF
Hai Ci, Xiaokang Liu, Pei Yang, Yiren Song, Mike Zheng Shou
TL;DR: 一篇关于从人类交互视频中生成物理基础机器人视频的论文,提出了无需配对数据的视频翻译框架。
Details
Motivation: 通过人类视频学习机器人操作技能,减少机器人数据收集的繁琐过程。
Result: 实验表明,该方法生成的机器人视频比基线更真实且物理基础更强。
Insight: 无需配对数据即可实现人机视频翻译,为机器人学习提供了新的可扩展方向。
Abstract: Robots that learn manipulation skills from everyday human videos could acquire broad capabilities without tedious robot data collection. We propose a video-to-video translation framework that converts ordinary human-object interaction videos into motion-consistent robot manipulation videos with realistic, physically grounded interactions. Our approach does not require any paired human-robot videos for training only a set of unpaired robot videos, making the system easy to scale. We introduce a transferable representation that bridges the embodiment gap: by inpainting the robot arm in training videos to obtain a clean background and overlaying a simple visual cue (a marker and arrow indicating the gripper’s position and orientation), we can condition a generative model to insert the robot arm back into the scene. At test time, we apply the same process to human videos (inpainting the person and overlaying human pose cues) and generate high-quality robot videos that mimic the human’s actions. We fine-tune a SOTA video diffusion model (Wan 2.2) in an in-context learning manner to ensure temporal coherence and leveraging of its rich prior knowledge. Empirical results demonstrate that our approach achieves significantly more realistic and grounded robot motions compared to baselines, pointing to a promising direction for scaling up robot learning from unlabeled human videos. Project page: https://showlab.github.io/H2R-Grounder/
[101] ViTA-Seg: Vision Transformer for Amodal Segmentation in Robotics cs.RO | cs.CVPDF
Donato Caramia, Florian T. Pokorny, Giuseppe Triggiani, Denis Ruffino, David Naso
TL;DR: ViTA-Seg是一个基于视觉Transformer的框架,用于机器人中的非模态分割,提出两种架构(Single-Head和Dual-Head)并引入合成数据集ViTA-SimData,在实时性和准确性上表现优越。
Details
Motivation: 机器人抓取任务中遮挡问题影响准确性和可靠性,需要通过非模态分割恢复完整物体掩码。
Result: 在COOCA和KINS基准测试中,ViTA-Seg Dual Head表现出优异的非模态和遮挡分割精度,计算高效。
Insight: 视觉Transformer在非模态分割任务中表现优越,合成数据可以有效提升模型性能。
Abstract: Occlusions in robotic bin picking compromise accurate and reliable grasp planning. We present ViTA-Seg, a class-agnostic Vision Transformer framework for real-time amodal segmentation that leverages global attention to recover complete object masks, including hidden regions. We proposte two architectures: a) Single-Head for amodal mask prediction; b) Dual-Head for amodal and occluded mask prediction. We also introduce ViTA-SimData, a photo-realistic synthetic dataset tailored to industrial bin-picking scenario. Extensive experiments on two amodal benchmarks, COOCA and KINS, demonstrate that ViTA-Seg Dual Head achieves strong amodal and occlusion segmentation accuracy with computational efficiency, enabling robust, real-time robotic manipulation.
[102] UrbanNav: Learning Language-Guided Urban Navigation from Web-Scale Human Trajectories cs.RO | cs.CVPDF
Yanghong Mei, Yirong Yang, Longteng Guo, Qunbo Wang, Ming-Ming Yu
TL;DR: UrbanNav是一个基于大规模人类轨迹的语言引导城市导航框架,通过标注网络视频数据解决了复杂城市环境下的导航挑战,显著优于现有方法。
Details
Motivation: 现有视觉导航方法局限于模拟环境或特定目标格式,无法适应真实城市环境的复杂性和动态性。UrbanNav旨在利用大规模网络视频数据解决这一问题。
Result: UrbanNav在空间推理、噪声指令鲁棒性和泛化能力上优于现有方法。
Insight: 大规模网络视频数据为语言引导的真实城市导航提供了可行的训练资源。
Abstract: Navigating complex urban environments using natural language instructions poses significant challenges for embodied agents, including noisy language instructions, ambiguous spatial references, diverse landmarks, and dynamic street scenes. Current visual navigation methods are typically limited to simulated or off-street environments, and often rely on precise goal formats, such as specific coordinates or images. This limits their effectiveness for autonomous agents like last-mile delivery robots navigating unfamiliar cities. To address these limitations, we introduce UrbanNav, a scalable framework that trains embodied agents to follow free-form language instructions in diverse urban settings. Leveraging web-scale city walking videos, we develop an scalable annotation pipeline that aligns human navigation trajectories with language instructions grounded in real-world landmarks. UrbanNav encompasses over 1,500 hours of navigation data and 3 million instruction-trajectory-landmark triplets, capturing a wide range of urban scenarios. Our model learns robust navigation policies to tackle complex urban scenarios, demonstrating superior spatial reasoning, robustness to noisy instructions, and generalization to unseen urban settings. Experimental results show that UrbanNav significantly outperforms existing methods, highlighting the potential of large-scale web video data to enable language-guided, real-world urban navigation for embodied agents.
[103] Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation cs.RO | cs.CVPDF
Yuyang Li, Yinghan Chen, Zihang Zhao, Puhao Li, Tengyu Liu
TL;DR: 论文提出了TacThru传感器和TacThru-UMI学习框架,实现了机器人操作中的同步触觉-视觉感知,并在五项任务中表现优异。
Details
Motivation: 机器人操作需要多模态感知和有效的学习框架。现有设计无法实现同步多模态感知,触觉跟踪不可靠,且缺乏将这些信号整合到学习框架中的方法。
Result: 在五项任务中,TacThru-UMI平均成功率85.5%,显著优于交替触觉-视觉(66.3%)和纯视觉(55.4%)基线。
Insight: 同步多模态感知与现代学习框架的结合能够实现更精确、适应性更强的机器人操作。
Abstract: Robotic manipulation requires both rich multimodal perception and effective learning frameworks to handle complex real-world tasks. See-through-skin (STS) sensors, which combine tactile and visual perception, offer promising sensing capabilities, while modern imitation learning provides powerful tools for policy acquisition. However, existing STS designs lack simultaneous multimodal perception and suffer from unreliable tactile tracking. Furthermore, integrating these rich multimodal signals into learning-based manipulation pipelines remains an open challenge. We introduce TacThru, an STS sensor enabling simultaneous visual perception and robust tactile signal extraction, and TacThru-UMI, an imitation learning framework that leverages these multimodal signals for manipulation. Our sensor features a fully transparent elastomer, persistent illumination, novel keyline markers, and efficient tracking, while our learning system integrates these signals through a Transformer-based Diffusion Policy. Experiments on five challenging real-world tasks show that TacThru-UMI achieves an average success rate of 85.5%, significantly outperforming the baselines of alternating tactile-visual (66.3%) and vision-only (55.4%). The system excels in critical scenarios, including contact detection with thin and soft objects and precision manipulation requiring multimodal coordination. This work demonstrates that combining simultaneous multimodal perception with modern learning frameworks enables more precise, adaptable robotic manipulation.
[104] Visual Heading Prediction for Autonomous Aerial Vehicles cs.RO | cs.AI | cs.CV | cs.MA | eess.SYPDF
Reza Ahmari, Ahmad Mohammadi, Vahid Hemmati, Mohammed Mynuddin, Parham Kebria
TL;DR: 论文提出了一种基于视觉的数据驱动框架,用于无人飞行器(UAV)与无人地面车辆(UGV)的实时集成,重点研究了UGV的鲁棒检测和导航中航向角的预测。
Details
Motivation: 由于GPS/GNSS等外部定位基础设施在实时协同场景中可能不可用或不可靠,需要一种不依赖外部基础设施的解决方案以实现UAV与UGV的高精度协同。
Result: ANN的航向角预测平均绝对误差为0.1506°,均方根误差为0.1957°;UGV检测准确率达95%。
Insight: 仅使用单目摄像头输入即可实现高精度航向角预测,适用于动态环境下的实时UAV-UGV协同。
Abstract: The integration of Unmanned Aerial Vehicles (UAVs) and Unmanned Ground Vehicles (UGVs) is increasingly central to the development of intelligent autonomous systems for applications such as search and rescue, environmental monitoring, and logistics. However, precise coordination between these platforms in real-time scenarios presents major challenges, particularly when external localization infrastructure such as GPS or GNSS is unavailable or degraded [1]. This paper proposes a vision-based, data-driven framework for real-time UAV-UGV integration, with a focus on robust UGV detection and heading angle prediction for navigation and coordination. The system employs a fine-tuned YOLOv5 model to detect UGVs and extract bounding box features, which are then used by a lightweight artificial neural network (ANN) to estimate the UAV’s required heading angle. A VICON motion capture system was used to generate ground-truth data during training, resulting in a dataset of over 13,000 annotated images collected in a controlled lab environment. The trained ANN achieves a mean absolute error of 0.1506° and a root mean squared error of 0.1957°, offering accurate heading angle predictions using only monocular camera inputs. Experimental evaluations achieve 95% accuracy in UGV detection. This work contributes a vision-based, infrastructure- independent solution that demonstrates strong potential for deployment in GPS/GNSS-denied environments, supporting reliable multi-agent coordination under realistic dynamic conditions. A demonstration video showcasing the system’s real-time performance, including UGV detection, heading angle prediction, and UAV alignment under dynamic conditions, is available at: https://github.com/Kooroshraf/UAV-UGV-Integration
[105] YOPO-Nav: Visual Navigation using 3DGS Graphs from One-Pass Videos cs.RO | cs.CVPDF
Ryan Meegan, Adam D’Souza, Bryan Bo Cao, Shubham Jain, Kristin Dana
TL;DR: YOPO-Nav提出了一种基于单次探索视频的视觉导航方法,通过3D高斯散射图(3DGS)构建紧凑的空间表示,结合视觉地点识别(VPR)模块实现高效导航。
Details
Motivation: 传统导航方法依赖精确地图,计算和存储开销大。YOPO-Nav利用单次探索视频构建轻量化空间表示,避免复杂地图维护。
Result: 实验证实YOPO-Nav在真实场景中表现优异,优于现有视觉导航方法。
Insight: 3DGS结合VPR模块提供了一种轻量高效的导航方案,适用于大规模环境。
Abstract: Visual navigation has emerged as a practical alternative to traditional robotic navigation pipelines that rely on detailed mapping and path planning. However, constructing and maintaining 3D maps is often computationally expensive and memory-intensive. We address the problem of visual navigation when exploration videos of a large environment are available. The videos serve as a visual reference, allowing a robot to retrace the explored trajectories without relying on metric maps. Our proposed method, YOPO-Nav (You Only Pass Once), encodes an environment into a compact spatial representation composed of interconnected local 3D Gaussian Splatting (3DGS) models. During navigation, the framework aligns the robot’s current visual observation with this representation and predicts actions that guide it back toward the demonstrated trajectory. YOPO-Nav employs a hierarchical design: a visual place recognition (VPR) module provides coarse localization, while the local 3DGS models refine the goal and intermediate poses to generate control actions. To evaluate our approach, we introduce the YOPO-Campus dataset, comprising 4 hours of egocentric video and robot controller inputs from over 6 km of human-teleoperated robot trajectories. We benchmark recent visual navigation methods on trajectories from YOPO-Campus using a Clearpath Jackal robot. Experimental results show YOPO-Nav provides excellent performance in image-goal navigation for real-world scenes on a physical robot. The dataset and code will be made publicly available for visual navigation and scene representation research.
[106] LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating cs.RO | cs.AI | cs.CVPDF
Junting Chen, Yunchuan Li, Panfeng Jiang, Jiacheng Du, Zixuan Chen
TL;DR: 本文提出了LISN-Bench,首个基于仿真的语言指令社会导航基准,并通过Social-Nav-Modulator方法实现了高成功率(91.3%)的社会导航任务。
Details
Motivation: 现有社会导航研究主要关注路径效率和行人避碰,忽略了用户指令的遵从和社会规范的适配。本文旨在填补这一空白。
Result: 平均成功率为91.3%,比最强基线高出63%,尤其是在复杂任务(如人群跟随和指令禁区避障)中表现突出。
Insight: 解耦动作生成与VLM推理可降低高频VLM推断的依赖,同时提升导航的动态适应性和任务完成率。
Abstract: Towards human-robot coexistence, socially aware navigation is significant for mobile robots. Yet existing studies on this area focus mainly on path efficiency and pedestrian collision avoidance, which are essential but represent only a fraction of social navigation. Beyond these basics, robots must also comply with user instructions, aligning their actions to task goals and social norms expressed by humans. In this work, we present LISN-Bench, the first simulation-based benchmark for language-instructed social navigation. Built on Rosnav-Arena 3.0, it is the first standardized social navigation benchmark to incorporate instruction following and scene understanding across diverse contexts. To address this task, we further propose Social-Nav-Modulator, a fast-slow hierarchical system where a VLM agent modulates costmaps and controller parameters. Decoupling low-level action generation from the slower VLM loop reduces reliance on high-frequency VLM inference while improving dynamic avoidance and perception adaptability. Our method achieves an average success rate of 91.3%, which is greater than 63% than the most competitive baseline, with most of the improvements observed in challenging tasks such as following a person in a crowd and navigating while strictly avoiding instruction-forbidden regions. The project website is at: https://social-nav.github.io/LISN-project/
eess.IV [Back]
[107] Agreement Disagreement Guided Knowledge Transfer for Cross-Scene Hyperspectral Imaging eess.IV | cs.CVPDF
Lu Huo, Haimin Zhang, Min Xu
TL;DR: 论文提出了一个名为ADGKT的框架,通过整合‘一致’和‘分歧’机制来改进跨场景高光谱成像中的知识迁移,解决了梯度冲突和主导梯度问题。
Details
Motivation: 现有跨场景高光谱成像方法存在梯度冲突和目标特征多样性不足的问题,导致知识迁移效果受限。
Result: 实验验证了ADGKT在跨场景高光谱成像中的鲁棒性和优越性。
Insight: 同时利用一致和分歧信息可以显著提升跨场景知识迁移的效果。
Abstract: Knowledge transfer plays a crucial role in cross-scene hyperspectral imaging (HSI). However, existing studies often overlook the challenges of gradient conflicts and dominant gradients that arise during the optimization of shared parameters. Moreover, many current approaches fail to simultaneously capture both agreement and disagreement information, relying only on a limited shared subset of target features and consequently missing the rich, diverse patterns present in the target scene. To address these issues, we propose an Agreement Disagreement Guided Knowledge Transfer (ADGKT) framework that integrates both mechanisms to enhance cross-scene transfer. The agreement component includes GradVac, which aligns gradient directions to mitigate conflicts between source and target domains, and LogitNorm, which regulates logit magnitudes to prevent domination by a single gradient source. The disagreement component consists of a Disagreement Restriction (DiR) and an ensemble strategy, which capture diverse predictive target features and mitigate the loss of critical target information. Extensive experiments demonstrate the effectiveness and superiority of the proposed method in achieving robust and balanced knowledge transfer across heterogeneous HSI scenes.
[108] DermETAS-SNA LLM: A Dermatology Focused Evolutionary Transformer Architecture Search with StackNet Augmented LLM Assistant eess.IV | cs.AI | cs.CVPDF
Nitya Phani Santosh Oruganty, Keerthi Vemula Murali, Chun-Kit Ngan, Paulo Bandeira Pinho
TL;DR: 论文提出了一种皮肤病学专用的进化Transformer架构搜索(ETAS)与StackNet增强LLM结合的助手,通过动态学习皮肤疾病分类器并提供医学描述,显著提升了分类性能和临床实用性。
Details
Motivation: 皮肤病学诊断需要高度专业化的分类器和解释工具,现有方法在分类性能和临床解释上有局限。
Result: 在23类皮肤病上F1-score达56.30%,超过SkinGPT-4(16.06%提升),专家评估同意率达92%。
Insight: 结合进化搜索和多分类器集成可显著提升皮肤病分类性能,RAG管道增强了临床解释的实用性。
Abstract: Our work introduces the DermETAS-SNA LLM Assistant that integrates Dermatology-focused Evolutionary Transformer Architecture Search with StackNet Augmented LLM. The assistant dynamically learns skin-disease classifiers and provides medically informed descriptions to facilitate clinician-patient interpretation. Contributions include: (1) Developed an ETAS framework on the SKINCON dataset to optimize a Vision Transformer (ViT) tailored for dermatological feature representation and then fine-tuned binary classifiers for each of the 23 skin disease categories in the DermNet dataset to enhance classification performance; (2) Designed a StackNet architecture that integrates multiple fine-tuned binary ViT classifiers to enhance predictive robustness and mitigate class imbalance issues; (3) Implemented a RAG pipeline, termed Diagnostic Explanation and Retrieval Model for Dermatology, which harnesses the capabilities of the Google Gemini 2.5 Pro LLM architecture to generate personalized, contextually informed diagnostic descriptions and explanations for patients, leveraging a repository of verified dermatological materials; (4) Performed extensive experimental evaluations on 23 skin disease categories to demonstrate performance increase, achieving an overall F1-score of 56.30% that surpasses SkinGPT-4 (48.51%) by a considerable margin, representing a performance increase of 16.06%; (5) Conducted a domain-expert evaluation, with eight licensed medical doctors, of the clinical responses generated by our AI assistant for seven dermatological conditions. Our results show a 92% agreement rate with the assessments provided by our AI assistant (6) Created a proof-of-concept prototype that fully integrates our DermETAS-SNA LLM into our AI assistant to demonstrate its practical feasibility for real-world clinical and educational applications.
[109] PathCo-LatticE: Pathology-Constrained Lattice-Of Experts Framework for Fully-supervised Few-Shot Cardiac MRI Segmentation eess.IV | cs.AI | cs.CV | cs.LGPDF
Mohamed Elbayumi, Mohammed S. M. Elbaz
TL;DR: PathCo-LatticE是一个完全监督的少样本学习框架,通过病理引导的合成监督取代无标签数据,提升心脏MRI分割的零样本泛化能力。
Details
Motivation: 传统少样本学习方法依赖半监督技术,易受领域偏移和验证偏差影响,限制了零样本泛化能力。
Result: 在严格OOD设定下,PathCo-LatticE以7个标注样本优于4种SOTA方法4.2-11%Dice,19个样本接近完全监督性能(差距1%)。
Insight: 合成数据结合动态专家架构可显著提升少样本学习的泛化能力,尤其在多中心、多厂商数据中表现优异。
Abstract: Few-shot learning (FSL) mitigates data scarcity in cardiac MRI segmentation but typically relies on semi-supervised techniques sensitive to domain shifts and validation bias, restricting zero-shot generalizability. We propose PathCo-LatticE, a fully supervised FSL framework that replaces unlabeled data with pathology-guided synthetic supervision. First, our Virtual Patient Engine models continuous latent disease trajectories from sparse clinical anchors, using generative modeling to synthesize physiologically plausible, fully labeled 3D cohorts. Second, Self-Reinforcing Interleaved Validation (SIV) provides a leakage-free protocol that evaluates models online with progressively challenging synthetic samples, eliminating the need for real validation data. Finally, a dynamic Lattice-of-Experts (LoE) organizes specialized networks within a pathology-aware topology and activates the most relevant experts per input, enabling robust zero-shot generalization to unseen data without target-domain fine-tuning. We evaluated PathCo-LatticE in a strict out-of-distribution (OOD) setting, deriving all anchors and severity statistics from a single-source domain (ACDC) and performing zero-shot testing on the multi-center, multi-vendor M&Ms dataset. PathCo-LatticE outperforms four state-of-the-art FSL methods by 4.2-11% Dice starting from only 7 labeled anchors, and approaches fully supervised performance (within 1% Dice) with only 19 labeled anchors. The method shows superior harmonization across four vendors and generalization to unseen pathologies. [Code will be made publicly available].
stat.ML [Back]
[110] Don’t Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search stat.ML | cs.CL | cs.LGPDF
Ekaterina Fadeeva, Maiya Goloburda, Aleksandr Rubashevskii, Roman Vashurin, Artem Shelmanov
TL;DR: 论文提出一种新方法,利用束搜索(beam search)替代多项式采样(multinomial sampling)生成候选答案,用于一致性不确定性量化(UQ),提升了性能和稳定性。
Details
Motivation: 当前基于一致性的不确定性量化方法依赖多项式采样,但其容易产生重复答案且估计方差较大。
Result: 在六个QA数据集上验证了新方法的稳定性和性能优势,达到最优UQ效果。
Insight: 束搜索在生成多样性候选答案上比多项式采样更有效,尤其适用于分布尖锐的任务。
Abstract: Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring their agreement level. However, in short-form QA, multinomial sampling is prone to producing duplicates due to peaked distributions, and its stochasticity introduces considerable variance in uncertainty estimates across runs. We introduce a new family of methods that employ beam search to generate candidates for consistency-based UQ, yielding improved performance and reduced variance compared to multinomial sampling. We also provide a theoretical lower bound on the beam set probability mass under which beam search achieves a smaller error than multinomial sampling. We empirically evaluate our approach on six QA datasets and find that its consistent improvements over multinomial sampling lead to state-of-the-art UQ performance.
cs.LG [Back]
[111] Resolving Conflicts in Lifelong Learning via Aligning Updates in Subspaces cs.LG | cs.AI | cs.CLPDF
Yueer Zhou, Yichen Wu, Ying Wei
TL;DR: PS-LoRA(Parameter Stability LoRA)通过在对齐的子空间中进行梯度更新,解决了LoRA在持续学习中由于任务间梯度冲突导致的灾难性遗忘问题,表现优于现有方法。
Details
Motivation: LoRA在持续学习中常因任务间梯度的冲突导致灾难性遗忘,影响了学习效率和稳定性。
Result: 在NLP和视觉基准测试中表现优于现有方法,有效保持学习表示的稳定性。
Insight: 任务间梯度的方向冲突是灾难性遗忘的主要原因,通过子空间对齐和正则化可以显著缓解这一问题。
Abstract: Low-Rank Adaptation (LoRA) enables efficient Continual Learning but often suffers from catastrophic forgetting due to destructive interference between tasks. Our analysis reveals that this degradation is primarily driven by antagonistic directional updates where new task gradients directly oppose the historical weight trajectory. To address this, we propose PS-LoRA (Parameter Stability LoRA), a framework designed to resolve conflicts by aligning updates within the optimization subspace. Our approach employs a dual-regularization objective that penalizes conflicting directions and constrains magnitude deviations to ensure consistency with prior knowledge. Additionally, we implement a magnitude-based merging strategy to consolidate sequential adapters into a robust representation without retraining. Experiments on NLP and Vision benchmarks show that PS-LoRA outperforms state-of-the-art methods by preserving the stability of learned representations while efficiently adapting to new domains.
[112] Financial Instruction Following Evaluation (FIFE) cs.LG | cs.AI | cs.CLPDF
Glenn Matlin, Siddharth, Anirudh JM, Aditya Shukla, Yahya Hassan
TL;DR: FIFE是一个新的高难度基准测试,用于评估语言模型在金融分析任务中的指令遵循能力。测试结果显示,开源模型与专有系统之间存在显著性能差距。
Details
Motivation: 金融领域的高风险性和复杂性要求语言模型能够精确遵循指令,但现有模型在复杂任务中存在困难,因此需要一个专门的评估基准。
Result: 结果显示,顶级开源模型表现优于专有系统,但所有模型均无法完全满足复杂的FIFE要求。
Insight: 金融领域的高复杂性任务仍是语言模型的挑战,开源模型在特定任务中可以超越专有系统。
Abstract: Language Models (LMs) struggle with complex, interdependent instructions, particularly in high-stakes domains like finance where precision is critical. We introduce FIFE, a novel, high-difficulty benchmark designed to assess LM instruction-following capabilities for financial analysis tasks. FIFE comprises 88 human-authored prompts and employs a verification system with chainable, verifiable constraints for fine-grained reward signals. We evaluate 53 models (proprietary, open-weight, open-source) in a zero-shot setting. Our key findings reveal a clear performance hierarchy: the top open-weight model (76.1 strict / 79.5 loose) surpasses the leading proprietary system (65.9 strict / 70.5 loose), while the best open-source models lag significantly (45.5 strict / 48.9 loose). However, even top-performing models struggle with FIFE’s complex requirements, failing to achieve perfect compliance. We release our dataset and code as an open-source resource to promote research in Reinforcement Learning for the financial domain.
[113] Are Hypervectors Enough? Single-Call LLM Reasoning over Knowledge Graphs cs.LG | cs.CLPDF
Yezi Liu, William Youngwoo Chung, Hanning Chen, Calvin Yeung, Mohsen Imani
TL;DR: PathHD是一个轻量级、无需编码器的知识图谱推理框架,通过超维计算和单次LLM调用实现高效、可解释的推理,显著降低延迟和GPU开销。
Details
Motivation: 现有基于知识图谱的推理方法依赖大量神经编码器或多轮LLM调用,导致高延迟、高成本且决策不透明。PathHD旨在解决这些问题,提供高效且可解释的解决方案。
Result: 在WebQSP、CWQ和GrailQA数据集上,PathHD实现与神经基线模型相当的Hits@1,同时降低40-60%延迟和3-5倍GPU内存占用,并提供可解释的路径依据。
Insight: 超维计算是高效知识图谱-LLM推理的可行基础,其设计良好的表示能够在精度、效率和可解释性之间找到平衡。
Abstract: Recent advances in large language models (LLMs) have enabled strong reasoning over both structured and unstructured knowledge. When grounded on knowledge graphs (KGs), however, prevailing pipelines rely on heavy neural encoders to embed and score symbolic paths or on repeated LLM calls to rank candidates, leading to high latency, GPU cost, and opaque decisions that hinder faithful, scalable deployment. We propose PathHD, a lightweight and encoder-free KG reasoning framework that replaces neural path scoring with hyperdimensional computing (HDC) and uses only a single LLM call per query. PathHD encodes relation paths into block-diagonal GHRR hypervectors, ranks candidates with blockwise cosine similarity and Top-K pruning, and then performs a one-shot LLM adjudication to produce the final answer together with cited supporting paths. Technically, PathHD is built on three ingredients: (i) an order-aware, non-commutative binding operator for path composition, (ii) a calibrated similarity for robust hypervector-based retrieval, and (iii) a one-shot adjudication step that preserves interpretability while eliminating per-path LLM scoring. On WebQSP, CWQ, and the GrailQA split, PathHD (i) attains comparable or better Hits@1 than strong neural baselines while using one LLM call per query; (ii) reduces end-to-end latency by $40-60%$ and GPU memory by $3-5\times$ thanks to encoder-free retrieval; and (iii) delivers faithful, path-grounded rationales that improve error diagnosis and controllability. These results indicate that carefully designed HDC representations provide a practical substrate for efficient KG-LLM reasoning, offering a favorable accuracy-efficiency-interpretability trade-off.
quant-ph [Back]
[114] LiePrune: Lie Group and Quantum Geometric Dual Representation for One-Shot Structured Pruning of Quantum Neural Networks quant-ph | cs.CVPDF
Haijian Shao, Bowen Yang, Wei Liu, Xing Deng, Yingtao Jiang
TL;DR: LiePrune是一种基于李群和量子几何的量子神经网络(QNN)结构化剪枝框架,首次实现了一次性剪枝,显著减少了参数,同时保持了或提升了任务性能。
Details
Motivation: 量子神经网络(QNN)和参数化量子电路(PQC)在近期量子机器学习中至关重要,但其可扩展性受限于参数过多、训练困难(如贫瘠高原问题)和硬件限制。LiePrune通过数学方法解决这些问题。
Result: 在量子分类、生成建模和量子化学任务中,实现了超过10倍的压缩,性能几乎没有损失甚至有所提升。
Insight: LiePrune展示了结构化剪枝在量子机器学习中的潜力,为高效量子模型的开发提供了新方向。
Abstract: Quantum neural networks (QNNs) and parameterized quantum circuits (PQCs) are key building blocks for near-term quantum machine learning. However, their scalability is constrained by excessive parameters, barren plateaus, and hardware limitations. We propose LiePrune, the first mathematically grounded one-shot structured pruning framework for QNNs that leverages Lie group structure and quantum geometric information. Each gate is jointly represented in a Lie group–Lie algebra dual space and a quantum geometric feature space, enabling principled redundancy detection and aggressive compression. Experiments on quantum classification (MNIST, FashionMNIST), quantum generative modeling (Bars-and-Stripes), and quantum chemistry (LiH VQE) show that LiePrune achieves over $10\times$ compression with negligible or even improved task performance, while providing provable guarantees on redundancy detection, functional approximation, and computational complexity.
cs.HC [Back]
[115] ImageTalk: Designing a Multimodal AAC Text Generation System Driven by Image Recognition and Natural Language Generation cs.HC | cs.AI | cs.CVPDF
Boyin Yang, Puming Jiang, Per Ola Kristensson
TL;DR: 论文设计了一种名为ImageTalk的多模态AAC文本生成系统,通过图像识别和自然语言生成技术,显著提升了运动神经元疾病患者的沟通效率。
Details
Motivation: 传统的基于符号的AAC系统词汇量有限,而文本输入方案的沟通速率较低,亟需一种高效的系统以满足患者的沟通需求。
Result: 系统表现稳定且用户满意度高,显著提升了沟通效率。
Insight: 结合图像识别和自然语言生成的多模态方法,可以为AAC系统设计提供高效且用户友好的解决方案。
Abstract: People living with Motor Neuron Disease (plwMND) frequently encounter speech and motor impairments that necessitate a reliance on augmentative and alternative communication (AAC) systems. This paper tackles the main challenge that traditional symbol-based AAC systems offer a limited vocabulary, while text entry solutions tend to exhibit low communication rates. To help plwMND articulate their needs about the system efficiently and effectively, we iteratively design and develop a novel multimodal text generation system called ImageTalk through a tailored proxy-user-based and an end-user-based design phase. The system demonstrates pronounced keystroke savings of 95.6%, coupled with consistent performance and high user satisfaction. We distill three design guidelines for AI-assisted text generation systems design and outline four user requirement levels tailored for AAC purposes, guiding future research in this field.
cs.AI [Back]
[116] SCOPE: Language Models as One-Time Teacher for Hierarchical Planning in Text Environments cs.AI | cs.CLPDF
Haoye Lu, Pavan Seshadri, Kaheer Suleman
TL;DR: SCOPE提出了一种基于LLM的一层次次规划方法,仅在初始阶段生成子目标预训练轻量级学生模型,显著减少了计算开销。
Details
Motivation: 现有方法在训练和推理中频繁查询LLM,效率低且缺乏适应性。SCOPE通过一次性生成子目标预训练学生模型,解决了这些问题。
Result: 在TextCraft环境中,SCOPE的成功率为0.56,优于ADaPT的0.52,且推理时间从164.4秒降至3.0秒。
Insight: LLM生成的子目标即使次优,仍可作为文本规划任务中分层目标分解的有效起点。
Abstract: Long-term planning in complex, text-based environments presents significant challenges due to open-ended action spaces, ambiguous observations, and sparse feedback. Recent research suggests that large language models (LLMs) encode rich semantic knowledge about the world, which can be valuable for guiding agents in high-level reasoning and planning across both embodied and purely textual settings. However, existing approaches often depend heavily on querying LLMs during training and inference, making them computationally expensive and difficult to deploy efficiently. In addition, these methods typically employ a pretrained, unaltered LLM whose parameters remain fixed throughout training, providing no opportunity for adaptation to the target task. To address these limitations, we introduce SCOPE (Subgoal-COnditioned Pretraining for Efficient planning), a one-shot hierarchical planner that leverages LLM-generated subgoals only at initialization to pretrain a lightweight student model. Unlike prior approaches that distill LLM knowledge by repeatedly prompting the model to adaptively generate subgoals during training, our method derives subgoals directly from example trajectories. This design removes the need for repeated LLM queries, significantly improving efficiency, though at the cost of reduced explainability and potentially suboptimal subgoals. Despite their suboptimality, our results on the TextCraft environment show that LLM-generated subgoals can still serve as a strong starting point for hierarchical goal decomposition in text-based planning tasks. Compared to the LLM-based hierarchical agent ADaPT (Prasad et al., 2024), which achieves a 0.52 success rate, our method reaches 0.56 and reduces inference time from 164.4 seconds to just 3.0 seconds.
[117] Visual Categorization Across Minds and Models: Cognitive Analysis of Human Labeling and Neuro-Symbolic Integration cs.AI | cs.CV | cs.LGPDF
Chethana Prasad Kabgere
TL;DR: 这篇论文通过对比人类和深度学习模型在低分辨率图像标注任务中的表现,探讨了人类启发式和AI特征处理的异同,并提出了一种结合符号推理与连接主义表示的神经符号架构。
Details
Motivation: 研究人类与AI在处理模糊视觉刺激时的差异,为构建更可解释和认知对齐的AI系统提供理论基础。
Result: 人类表现出分层和启发式决策策略,而AI更依赖特征处理,提出了神经符号架构的统一方法。
Insight: 未来的AI系统需要结合符号推理与连接主义表示,以实现更好的解释性和认知对齐。
Abstract: Understanding how humans and AI systems interpret ambiguous visual stimuli offers critical insight into the nature of perception, reasoning, and decision-making. This paper examines image labeling performance across human participants and deep neural networks, focusing on low-resolution, perceptually degraded stimuli. Drawing from computational cognitive science, cognitive architectures, and connectionist-symbolic hybrid models, we contrast human strategies such as analogical reasoning, shape-based recognition, and confidence modulation with AI’s feature-based processing. Grounded in Marr’s tri-level hypothesis, Simon’s bounded rationality, and Thagard’s frameworks of representation and emotion, we analyze participant responses in relation to Grad-CAM visualizations of model attention. Human behavior is further interpreted through cognitive principles modeled in ACT-R and Soar, revealing layered and heuristic decision strategies under uncertainty. Our findings highlight key parallels and divergences between biological and artificial systems in representation, inference, and confidence calibration. The analysis motivates future neuro-symbolic architectures that unify structured symbolic reasoning with connectionist representations. Such architectures, informed by principles of embodiment, explainability, and cognitive alignment, offer a path toward AI systems that are not only performant but also interpretable and cognitively grounded.
cs.DC [Back]
[118] A Distributed Framework for Privacy-Enhanced Vision Transformers on the Edge cs.DC | cs.CR | cs.CVPDF
Zihao Ding, Mufeng Zhu, Zhongze Tang, Sheng Wei, Yao Liu
TL;DR: 该论文提出了一种分布式的、分层卸载的Vision Transformers框架,通过在边缘设备上分割视觉数据并分散到多个独立云服务器,保护数据隐私。
Details
Motivation: 解决传统云端视觉处理中的数据隐私问题,同时适应资源受限的边缘设备需求。
Result: 在SAM模型上的实验表明,该方法在保持基线分割性能的同时,显著降低了数据重建和暴露风险。
Insight: 该方法为边缘-云连续体中的视觉任务提供了一种可扩展的隐私保护解决方案。
Abstract: Nowadays, visual intelligence tools have become ubiquitous, offering all kinds of convenience and possibilities. However, these tools have high computational requirements that exceed the capabilities of resource-constrained mobile and wearable devices. While offloading visual data to the cloud is a common solution, it introduces significant privacy vulnerabilities during transmission and server-side computation. To address this, we propose a novel distributed, hierarchical offloading framework for Vision Transformers (ViTs) that addresses these privacy challenges by design. Our approach uses a local trusted edge device, such as a mobile phone or an Nvidia Jetson, as the edge orchestrator. This orchestrator partitions the user’s visual data into smaller portions and distributes them across multiple independent cloud servers. By design, no single external server possesses the complete image, preventing comprehensive data reconstruction. The final data merging and aggregation computation occurs exclusively on the user’s trusted edge device. We apply our framework to the Segment Anything Model (SAM) as a practical case study, which demonstrates that our method substantially enhances content privacy over traditional cloud-based approaches. Evaluations show our framework maintains near-baseline segmentation performance while substantially reducing the risk of content reconstruction and user data exposure. Our framework provides a scalable, privacy-preserving solution for vision tasks in the edge-cloud continuum.
[119] SynthPix: A lightspeed PIV images generator cs.DC | cs.CV | cs.LG | eess.IVPDF
Antonio Terpin, Alan Bonomi, Francesco Banelli, Raffaello D’Andrea
TL;DR: SynthPix是一个基于JAX的高性能并行合成PIV图像生成器,相比现有工具,其生成速度提升了数个数量级,旨在支持数据饥渴的强化学习方法训练和快速流场估计方法的开发。
Details
Motivation: 现有的PIV图像生成工具速度较慢,无法满足数据饥渴的强化学习方法训练和实时流场估计的需求。SynthPix旨在通过高性能并行计算提升生成速度。
Result: SynthPix的图像对生成速度比现有工具高出几个数量级,能够支持大规模数据生成需求。
Insight: 通过高性能并行计算,SynthPix展示了合成数据生成工具在流体动力学研究中提升效率的潜力。
Abstract: We describe SynthPix, a synthetic image generator for Particle Image Velocimetry (PIV) with a focus on performance and parallelism on accelerators, implemented in JAX. SynthPix supports the same configuration parameters as existing tools but achieves a throughput several orders of magnitude higher in image-pair generation per second. SynthPix was developed to enable the training of data-hungry reinforcement learning methods for flow estimation and for reducing the iteration times during the development of fast flow estimation methods used in recent active fluids control studies with real-time PIV feedback. We believe SynthPix to be useful for the fluid dynamics community, and in this paper we describe the main ideas behind this software package.
cs.SD [Back]
[120] ORCA: Open-ended Response Correctness Assessment for Audio Question Answering cs.SD | cs.AI | cs.CLPDF
Šimon Sedláček, Sara Barahona, Bolaji Yusuf, Laura Herrera-Alarcón, Santosh Kesiraju
TL;DR: ORCA 是一个用于评估音频问答中开放答案正确性的框架,通过 Beta 分布建模人类判断的变异性,并预测预期正确性和不确定性。
Details
Motivation: 传统评估方法无法捕捉人类对开放答案判断的不确定性,需要一种既能预测正确性又能量化不确定性的方法。
Result: 在两项音频 QA 任务中,ORCA 与人类平均判断的 Spearman 相关性达 0.91,优于基线方法,且计算成本更低。
Insight: ORCA 不仅提升了评估的准确性,还提供不确定性估计,为音频问答的开放答案评估提供了新思路。
Abstract: Evaluating open-ended responses from large audio language models (LALMs) is challenging because human annotators often genuinely disagree on answer correctness due to multiple valid interpretations, partial correctness, and subjective judgment. Traditional metrics reporting only mean scores fail to capture this uncertainty. We present ORCA (Open-ended Response Correctness Assessment), a framework that models the variability in human judgments using Beta distributions to predict both expected correctness and uncertainty. Our three-stage annotation framework combines human judgment with structured feedback and iterative refinement to simultaneously curate training data and improve benchmark quality. We collected 11,721 annotations across 3,580 question-answer pairs from 15 LALMs on two audio QA benchmarks, achieving inter-annotator agreement of 0.82 (Krippendorff’s alpha). ORCA achieves 0.91 Spearman correlation with mean human judgments, matching or outperforming LLM-judge baselines while providing uncertainty estimates and requiring significantly less compute. We release our models, code, and curated dataset.