cs.CV [Total: 82]
cs.CL [Total: 17]
cs.GR [Total: 1]
eess.IV [Total: 3]
cs.SD [Total: 1]
cs.AI [Total: 2]
cs.LG [Total: 6]
cs.RO [Total: 3]

cs.CV [Back]

[1] StereoDiff: Stereo-Diffusion Synergy for Video Depth Estimation cs.CVPDF

Haodong Li, Chen Wang, Jiahui Lei, Kostas Daniilidis, Lingjie Liu

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Recent video depth estimation methods achieve great performance by following the paradigm of image depth estimation, i.e., typically fine-tuning pre-trained video diffusion models with massive data. However, we argue that video depth estimation is not a naive extension of image depth estimation. The temporal consistency requirements for dynamic and static regions in videos are fundamentally different. Consistent video depth in static regions, typically backgrounds, can be more effectively achieved via stereo matching across all frames, which provides much stronger global 3D cues. While the consistency for dynamic regions still should be learned from large-scale video depth data to ensure smooth transitions, due to the violation of triangulation constraints. Based on these insights, we introduce StereoDiff, a two-stage video depth estimator that synergizes stereo matching for mainly the static areas with video depth diffusion for maintaining consistent depth transitions in dynamic areas. We mathematically demonstrate how stereo matching and video depth diffusion offer complementary strengths through frequency domain analysis, highlighting the effectiveness of their synergy in capturing the advantages of both. Experimental results on zero-shot, real-world, dynamic video depth benchmarks, both indoor and outdoor, demonstrate StereoDiff’s SoTA performance, showcasing its superior consistency and accuracy in video depth estimation.

[2] ConViTac: Aligning Visual-Tactile Fusion with Contrastive Representations cs.CV | cs.ROPDF

Zhiyuan Wu, Yongqiang Zhao, Shan Luo

TL;DR: ConViTac是一种视觉-触觉融合表示学习网络，通过对比表示增强特征对齐，提出对比嵌入条件机制（CEC），在自监督对比学习的预训练支持下，改善模态融合效果，显著提升下游任务性能。

Details

Motivation: 现有视觉-触觉融合方法多采用简单拼接或相加等方式，导致特征整合效果差，未能充分利用多模态互补信息。ConViTac旨在通过对比表示增强模态间特征对齐，提升融合效果。

Result: 在材料和抓取预测任务中，ConViTac比现有方法准确率最高提升12.0%，验证了CEC机制和跨模态注意力的有效性。

Insight: 对比表示学习能显著提升多模态特征对齐效果，跨模态注意力是实现视觉-触觉高效融合的关键。为机器人多模态感知提供了新思路。

Abstract: Vision and touch are two fundamental sensory modalities for robots, offering complementary information that enhances perception and manipulation tasks. Previous research has attempted to jointly learn visual-tactile representations to extract more meaningful information. However, these approaches often rely on direct combination, such as feature addition and concatenation, for modality fusion, which tend to result in poor feature integration. In this paper, we propose ConViTac, a visual-tactile representation learning network designed to enhance the alignment of features during fusion using contrastive representations. Our key contribution is a Contrastive Embedding Conditioning (CEC) mechanism that leverages a contrastive encoder pretrained through self-supervised contrastive learning to project visual and tactile inputs into unified latent embeddings. These embeddings are used to couple visual-tactile feature fusion through cross-modal attention, aiming at aligning the unified representations and enhancing performance on downstream tasks. We conduct extensive experiments to demonstrate the superiority of ConViTac in real world over current state-of-the-art methods and the effectiveness of our proposed CEC mechanism, which improves accuracy by up to 12.0% in material classification and grasping prediction tasks.

[3] How do Foundation Models Compare to Skeleton-Based Approaches for Gesture Recognition in Human-Robot Interaction? cs.CV | cs.HC | cs.RO | I.2.10; I.2.9; I.5.4; I.4.8; I.4.9; H.1.2PDF

Stephanie Käs, Anton Burenko, Louis Markert, Onur Alp Culha, Dennis Mack

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Gestures enable non-verbal human-robot communication, especially in noisy environments like agile production. Traditional deep learning-based gesture recognition relies on task-specific architectures using images, videos, or skeletal pose estimates as input. Meanwhile, Vision Foundation Models (VFMs) and Vision Language Models (VLMs) with their strong generalization abilities offer potential to reduce system complexity by replacing dedicated task-specific modules. This study investigates adapting such models for dynamic, full-body gesture recognition, comparing V-JEPA (a state-of-the-art VFM), Gemini Flash 2.0 (a multimodal VLM), and HD-GCN (a top-performing skeleton-based approach). We introduce NUGGET, a dataset tailored for human-robot communication in intralogistics environments, to evaluate the different gesture recognition approaches. In our experiments, HD-GCN achieves best performance, but V-JEPA comes close with a simple, task-specific classification head - thus paving a possible way towards reducing system complexity, by using it as a shared multi-task model. In contrast, Gemini struggles to differentiate gestures based solely on textual descriptions in the zero-shot setting, highlighting the need of further research on suitable input representations for gestures.

[4] Leveraging Vision-Language Models to Select Trustworthy Super-Resolution Samples Generated by Diffusion Models cs.CV | cs.AIPDF

Cansu Korkmaz, Ahmet Murat Tekalp, Zafer Dogan

TL;DR: 该论文提出了一种利用视觉-语言模型（VLMs）选择扩散模型生成的最可信超分辨率（SR）样本的框架，通过语义推理和混合信任度评分（TWS）量化SR的可靠性。

Details

Motivation: 超分辨率（SR）是一个不适定问题，传统方法在保真度和感知质量之间难以平衡，而扩散模型生成的多样化SR样本中如何选择最可信的样本仍具挑战。

Result: TWS与人类偏好强相关，VLM选择的样本具有更高的TWS值，优于传统指标如PSNR和LPIPS。

Insight: 通过VLM的语义推理和TWS的量化评估，能够更可靠地解决扩散SR样本的选择问题，为生成式SR的可信度设定了新标准。

Abstract: Super-resolution (SR) is an ill-posed inverse problem with many feasible solutions consistent with a given low-resolution image. On one hand, regressive SR models aim to balance fidelity and perceptual quality to yield a single solution, but this trade-off often introduces artifacts that create ambiguity in information-critical applications such as recognizing digits or letters. On the other hand, diffusion models generate a diverse set of SR images, but selecting the most trustworthy solution from this set remains a challenge. This paper introduces a robust, automated framework for identifying the most trustworthy SR sample from a diffusion-generated set by leveraging the semantic reasoning capabilities of vision-language models (VLMs). Specifically, VLMs such as BLIP-2, GPT-4o, and their variants are prompted with structured queries to assess semantic correctness, visual quality, and artifact presence. The top-ranked SR candidates are then ensembled to yield a single trustworthy output in a cost-effective manner. To rigorously assess the validity of VLM-selected samples, we propose a novel Trustworthiness Score (TWS) a hybrid metric that quantifies SR reliability based on three complementary components: semantic similarity via CLIP embeddings, structural integrity using SSIM on edge maps, and artifact sensitivity through multi-level wavelet decomposition. We empirically show that TWS correlates strongly with human preference in both ambiguous and natural images, and that VLM-guided selections consistently yield high TWS values. Compared to conventional metrics like PSNR, LPIPS, which fail to reflect information fidelity, our approach offers a principled, scalable, and generalizable solution for navigating the uncertainty of the diffusion SR space. By aligning outputs with human expectations and semantic correctness, this work sets a new benchmark for trustworthiness in generative SR.

[5] FixCLR: Negative-Class Contrastive Learning for Semi-Supervised Domain Generalization cs.CV | cs.AIPDF

Ha Min Son, Shahbaz Rezaei, Xin Liu

TL;DR: FixCLR是一种针对半监督域泛化（SSDG）的新方法，通过改进对比学习以显式学习域不变表示，弥补了现有方法在标签稀缺情况下的不足。

Details

Motivation: 由于标签稀缺，现有半监督域泛化方法在域不变表示学习上表现不佳，FixCLR通过对比学习解决了这一问题。

Result: 实验表明FixCLR在多个基准测试中表现优越，尤其与其他半监督方法结合时效果显著。

Insight: 通过显式域不变性学习，FixCLR解决了半监督域泛化中标签稀缺的核心问题，同时展示了对比学习在域适应中的潜力。

Abstract: Semi-supervised domain generalization (SSDG) aims to solve the problem of generalizing to out-of-distribution data when only a few labels are available. Due to label scarcity, applying domain generalization methods often underperform. Consequently, existing SSDG methods combine semi-supervised learning methods with various regularization terms. However, these methods do not explicitly regularize to learn domains invariant representations across all domains, which is a key goal for domain generalization. To address this, we introduce FixCLR. Inspired by success in self-supervised learning, we change two crucial components to adapt contrastive learning for explicit domain invariance regularization: utilization of class information from pseudo-labels and using only a repelling term. FixCLR can also be added on top of most existing SSDG and semi-supervised methods for complementary performance improvements. Our research includes extensive experiments that have not been previously explored in SSDG studies. These experiments include benchmarking different improvements to semi-supervised methods, evaluating the performance of pretrained versus non-pretrained models, and testing on datasets with many domains. Overall, FixCLR proves to be an effective SSDG method, especially when combined with other semi-supervised methods.

[6] Vector Contrastive Learning For Pixel-Wise Pretraining In Medical Vision cs.CVPDF

Yuting He, Shuo Li

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Contrastive learning (CL) has become a cornerstone of self-supervised pretraining (SSP) in foundation models, however, extending CL to pixel-wise representation, crucial for medical vision, remains an open problem. Standard CL formulates SSP as a binary optimization problem (binary CL) where the excessive pursuit of feature dispersion leads to an over-dispersion problem, breaking pixel-wise feature correlation thus disrupting the intra-class distribution. Our vector CL reformulates CL as a vector regression problem, enabling dispersion quantification in pixel-wise pretraining via modeling feature distances in regressing displacement vectors. To implement this novel paradigm, we propose the COntrast in VEctor Regression (COVER) framework. COVER establishes an extendable vector-based self-learning, enforces a consistent optimization flow from vector regression to distance modeling, and leverages a vector pyramid architecture for granularity adaptation, thus preserving pixel-wise feature correlations in SSP. Extensive experiments across 8 tasks, spanning 2 dimensions and 4 modalities, show that COVER significantly improves pixel-wise SSP, advancing generalizable medical visual foundation models.

[7] Enhancing Ambiguous Dynamic Facial Expression Recognition with Soft Label-based Data Augmentation cs.CVPDF

Ryosuke Kawamura, Hideaki Hayashi, Shunsuke Otake, Noriko Takemura, Hajime Nagahara

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Dynamic facial expression recognition (DFER) is a task that estimates emotions from facial expression video sequences. For practical applications, accurately recognizing ambiguous facial expressions – frequently encountered in in-the-wild data – is essential. In this study, we propose MIDAS, a data augmentation method designed to enhance DFER performance for ambiguous facial expression data using soft labels representing probabilities of multiple emotion classes. MIDAS augments training data by convexly combining pairs of video frames and their corresponding emotion class labels. This approach extends mixup to soft-labeled video data, offering a simple yet highly effective method for handling ambiguity in DFER. To evaluate MIDAS, we conducted experiments on both the DFEW dataset and FERV39k-Plus, a newly constructed dataset that assigns soft labels to an existing DFER dataset. The results demonstrate that models trained with MIDAS-augmented data achieve superior performance compared to the state-of-the-art method trained on the original dataset.

[8] THIRDEYE: Cue-Aware Monocular Depth Estimation via Brain-Inspired Multi-Stage Fusion cs.CV | cs.AI | I.4.8; I.2.10PDF

Calin Teodor Ioan

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Monocular depth estimation methods traditionally train deep models to infer depth directly from RGB pixels. This implicit learning often overlooks explicit monocular cues that the human visual system relies on, such as occlusion boundaries, shading, and perspective. Rather than expecting a network to discover these cues unaided, we present ThirdEye, a cue-aware pipeline that deliberately supplies each cue through specialised, pre-trained, and frozen networks. These cues are fused in a three-stage cortical hierarchy (V1->V2->V3) equipped with a key-value working-memory module that weights them by reliability. An adaptive-bins transformer head then produces a high-resolution disparity map. Because the cue experts are frozen, ThirdEye inherits large amounts of external supervision while requiring only modest fine-tuning. This extended version provides additional architectural detail, neuroscientific motivation, and an expanded experimental protocol; quantitative results will appear in a future revision.

[9] The Role of Cyclopean-Eye in Stereo Vision cs.CVPDF

Sherlon Almeida da Silva, Davi Geiger, Luiz Velho, Moacir Antonelli Ponti

TL;DR: 本研究探讨了现代立体视觉系统的几何基础，重点关注Cyclopean Eye模型及其对深度重建的贡献，提出了新的几何约束，并验证了深度学习特征匹配和注意力机制的作用。

Details

Motivation: 立体视觉系统在3D重建中的精度依赖于几何基础和人类感知启发。本文旨在结合几何先验和学习特征，提升深度重建的质量。

Result: 实验验证了结合几何先验和学习特征能显著提升3D重建精度，尤其是在遮挡和深度不连续情况下。

Insight: 几何约束与学习特征的结合是理解立体视觉系统的关键，注意力机制在恢复有意义的3D表面中具有重要作用。

Abstract: This work investigates the geometric foundations of modern stereo vision systems, with a focus on how 3D structure and human-inspired perception contribute to accurate depth reconstruction. We revisit the Cyclopean Eye model and propose novel geometric constraints that account for occlusions and depth discontinuities. Our analysis includes the evaluation of stereo feature matching quality derived from deep learning models, as well as the role of attention mechanisms in recovering meaningful 3D surfaces. Through both theoretical insights and empirical studies on real datasets, we demonstrate that combining strong geometric priors with learned features provides internal abstractions for understanding stereo vision systems.

[10] FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing cs.CVPDF

Advait Gupta, Rishie Raj, Dang Nguyen, Tianyi Zhou

TL;DR: 论文提出了一种名为FaSTA$^$的神经符号代理，通过结合LLM的快速高层子任务规划和A$^$搜索的低层精确工具调用，实现了多轮图像编辑任务的高效低成本处理。通过挖掘可重用的子程序，进一步提升了计算效率。

Details

Motivation: 多轮图像编辑任务（如检测、重新着色、移除对象等）通常需要复杂的工具调用序列，现有方法计算成本高且缺乏通用性。

Result: FaSTA$^*$在计算效率上显著优于现有方法，同时在成功率上与当前最先进的方法相当。

Insight: 结合神经（LLM）与符号（A$^*$搜索）方法的混合代理能够高效处理复杂任务，知识重用（如子程序的挖掘）是提升效率的关键。

Abstract: We develop a cost-efficient neurosymbolic agent to address challenging multi-turn image editing tasks such as “Detect the bench in the image while recoloring it to pink. Also, remove the cat for a clearer view and recolor the wall to yellow.’’ It combines the fast, high-level subtask planning by large language models (LLMs) with the slow, accurate, tool-use, and local A$^$ search per subtask to find a cost-efficient toolpath – a sequence of calls to AI tools. To save the cost of A$^$ on similar subtasks, we perform inductive reasoning on previously successful toolpaths via LLMs to continuously extract/refine frequently used subroutines and reuse them as new tools for future tasks in an adaptive fast-slow planning, where the higher-level subroutines are explored first, and only when they fail, the low-level A$^$ search is activated. The reusable symbolic subroutines considerably save exploration cost on the same types of subtasks applied to similar images, yielding a human-like fast-slow toolpath agent “FaSTA$^$’’: fast subtask planning followed by rule-based subroutine selection per subtask is attempted by LLMs at first, which is expected to cover most tasks, while slow A$^$ search is only triggered for novel and challenging subtasks. By comparing with recent image editing approaches, we demonstrate FaSTA$^$ is significantly more computationally efficient while remaining competitive with the state-of-the-art baseline in terms of success rate.

[11] M2SFormer: Multi-Spectral and Multi-Scale Attention with Edge-Aware Difficulty Guidance for Image Forgery Localization cs.CVPDF

Ju-Hyeon Nam, Dong-Hyun Moon, Sang-Chul Lee

TL;DR: M2SFormer提出了一种基于Transformer的编码器框架，通过多频段和多尺度注意力结合边缘感知难度引导，显著提升了图像篡改定位的性能。

Details

Motivation: 现有的深度学习方法在像素级篡改定位中表现出高精度，但存在计算开销大和表示能力有限的问题，尤其是在处理细微或复杂篡改时表现不佳。因此，需要一种更高效的框架来解决这些问题。

Result: 在多个基准数据集上的实验表明，M2SFormer优于现有最优模型，尤其在跨域检测中展现出更强的泛化能力。

Insight: 通过融合频域和空间信息，结合动态难度引导，可以更有效地捕捉复杂篡改痕迹，提升鲁棒性和精度。

Abstract: Image editing techniques have rapidly advanced, facilitating both innovative use cases and malicious manipulation of digital images. Deep learning-based methods have recently achieved high accuracy in pixel-level forgery localization, yet they frequently struggle with computational overhead and limited representation power, particularly for subtle or complex tampering. In this paper, we propose M2SFormer, a novel Transformer encoder-based framework designed to overcome these challenges. Unlike approaches that process spatial and frequency cues separately, M2SFormer unifies multi-frequency and multi-scale attentions in the skip connection, harnessing global context to better capture diverse forgery artifacts. Additionally, our framework addresses the loss of fine detail during upsampling by utilizing a global prior map, a curvature metric indicating the difficulty of forgery localization, which then guides a difficulty-guided attention module to preserve subtle manipulations more effectively. Extensive experiments on multiple benchmark datasets demonstrate that M2SFormer outperforms existing state-of-the-art models, offering superior generalization in detecting and localizing forgeries across unseen domains.

[12] PhysRig: Differentiable Physics-Based Skinning and Rigging Framework for Realistic Articulated Object Modeling cs.CVPDF

Hao Zhang, Haolan Xu, Chun Feng, Varun Jampani, Narendra Ahuja

TL;DR: 论文提出了PhysRig，一种基于物理的可微分蒙皮和骨骼绑定框架，解决了传统线性混合蒙皮（LBS）在模拟弹性材料和复杂变形时的不足。

Details

Motivation: 传统LBS方法在模拟弹性材料（如软组织、毛发）时表现不佳，且会导致体积损失和变形不自然。PhysRig通过物理模拟克服了这些问题。

Result: PhysRig在生成逼真且物理合理的变形方面优于传统LBS方法，并在姿态迁移任务中展示了其适用性。

Insight: 通过物理模拟提升蒙皮和骨骼绑定的真实性，是解决复杂变形问题的有效途径。

Abstract: Skinning and rigging are fundamental components in animation, articulated object reconstruction, motion transfer, and 4D generation. Existing approaches predominantly rely on Linear Blend Skinning (LBS), due to its simplicity and differentiability. However, LBS introduces artifacts such as volume loss and unnatural deformations, and it fails to model elastic materials like soft tissues, fur, and flexible appendages (e.g., elephant trunks, ears, and fatty tissues). In this work, we propose PhysRig: a differentiable physics-based skinning and rigging framework that overcomes these limitations by embedding the rigid skeleton into a volumetric representation (e.g., a tetrahedral mesh), which is simulated as a deformable soft-body structure driven by the animated skeleton. Our method leverages continuum mechanics and discretizes the object as particles embedded in an Eulerian background grid to ensure differentiability with respect to both material properties and skeletal motion. Additionally, we introduce material prototypes, significantly reducing the learning space while maintaining high expressiveness. To evaluate our framework, we construct a comprehensive synthetic dataset using meshes from Objaverse, The Amazing Animals Zoo, and MixaMo, covering diverse object categories and motion patterns. Our method consistently outperforms traditional LBS-based approaches, generating more realistic and physically plausible results. Furthermore, we demonstrate the applicability of our framework in the pose transfer task highlighting its versatility for articulated object modeling.

[13] AIR-VIEW: The Aviation Image Repository for Visibility Estimation of Weather, A Dataset and Benchmark cs.CVPDF

Chad Mourning, Zhewei Wang, Justin Murray

TL;DR: 该论文提出了一个名为AIR-VIEW的新型数据集，用于航空天气能见度估计，填补了公开数据集的空缺，并通过基准测试验证了其有效性。

Details

Motivation: 当前缺乏适合航空能见度估计的公开数据集，阻碍了机器学习在该领域的应用。本文旨在填补这一空白。

Result: 基准测试表明，提出的数据集和模型能够有效用于航空能见度估计，并与ASTM标准进行了对比验证。

Insight: 该数据集为航空天气能见度估计的研究提供了重要资源，促进了低成本替代传统传感器的研究。

Abstract: Machine Learning for aviation weather is a growing area of research for providing low-cost alternatives for traditional, expensive weather sensors; however, in the area of atmospheric visibility estimation, publicly available datasets, tagged with visibility estimates, of distances relevant for aviation, of diverse locations, of sufficient size for use in supervised learning, are absent. This paper introduces a new dataset which represents the culmination of a year-long data collection campaign of images from the FAA weather camera network suitable for this purpose. We also present a benchmark when applying three commonly used approaches and a general-purpose baseline when trained and tested on three publicly available datasets, in addition to our own, when compared against a recently ratified ASTM standard.

[14] Hierarchical Sub-action Tree for Continuous Sign Language Recognition cs.CV | cs.MMPDF

Dejie Yang, Zhu Xu, Xinjie Gao, Yang Liu

TL;DR: 该论文提出了一种名为HST-CSLR的方法，通过结合文本信息和视觉表示学习，构建了一个分层的子动作树（HST），以改进连续手语识别任务的效果。

Details

Motivation: 连续手语识别（CSLR）任务中，缺乏大型数据集和精确标注已成为瓶颈。现有方法未能充分利用文本模态的知识，导致性能受限。

Result: 在四个数据集（PHOENIX-2014、PHOENIX-2014T、CSL-Daily和Sign Language Gesture）上验证了方法的有效性。

Insight: HST结构不仅提升了模态对齐的效率，还通过树状层级降低了计算复杂度。

Abstract: Continuous sign language recognition (CSLR) aims to transcribe untrimmed videos into glosses, which are typically textual words. Recent studies indicate that the lack of large datasets and precise annotations has become a bottleneck for CSLR due to insufficient training data. To address this, some works have developed cross-modal solutions to align visual and textual modalities. However, they typically extract textual features from glosses without fully utilizing their knowledge. In this paper, we propose the Hierarchical Sub-action Tree (HST), termed HST-CSLR, to efficiently combine gloss knowledge with visual representation learning. By incorporating gloss-specific knowledge from large language models, our approach leverages textual information more effectively. Specifically, we construct an HST for textual information representation, aligning visual and textual modalities step-by-step and benefiting from the tree structure to reduce computational complexity. Additionally, we impose a contrastive alignment enhancement to bridge the gap between the two modalities. Experiments on four datasets (PHOENIX-2014, PHOENIX-2014T, CSL-Daily, and Sign Language Gesture) demonstrate the effectiveness of our HST-CSLR.

Yiman Zhang, Ziheng Luo, Qiangyu Yan, Wei He, Borui Jiang

TL;DR: OmniEval是一个新的基准测试，用于评估支持视觉、听觉和文本输入的全模态模型，强调模态协作、任务多样性和视频多样性。

Details

Motivation: 现有基准测试未能充分评估全模态模型的协作能力，尤其是在音频和视频强耦合的场景下。OmniEval填补了这一空白。

Result: 实验验证了OmniEval对全模态模型的评测效果，展示了MiniCPM-O 2.6等模型的表现。

Insight: 全模态协作和任务多样性是提升模型理解和构建上下文关联能力的关键。

Abstract: In this paper, we introduce OmniEval, a benchmark for evaluating omni-modality models like MiniCPM-O 2.6, which encompasses visual, auditory, and textual inputs. Compared with existing benchmarks, our OmniEval has several distinctive features: (i) Full-modal collaboration: We design evaluation tasks that highlight the strong coupling between audio and video, requiring models to effectively leverage the collaborative perception of all modalities; (ii) Diversity of videos: OmniEval includes 810 audio-visual synchronized videos, 285 Chinese videos and 525 English videos; (iii) Diversity and granularity of tasks: OmniEval contains 2617 question-answer pairs, comprising 1412 open-ended questions and 1205 multiple-choice questions. These questions are divided into 3 major task types and 12 sub-task types to achieve comprehensive evaluation. Among them, we introduce a more granular video localization task named Grounding. Then we conduct experiments on OmniEval with several omni-modality models. We hope that our OmniEval can provide a platform for evaluating the ability to construct and understand coherence from the context of all modalities. Codes and data could be found at https://omnieval.github.io/.

[16] Evidence-based diagnostic reasoning with multi-agent copilot for human pathology cs.CV | cs.AIPDF

Chengkuan Chen, Luca L. Weishaupt, Drew F. K. Williamson, Richard J. Chen, Tong Ding

TL;DR: 该论文提出了PathChat+和SlideSeek，一个针对病理学设计的多模态大语言模型（MLLM）和多代理AI系统，旨在解决传统方法在自然语言和图像结合推理方面的不足，取得了显著的效果提升。

Details

Motivation: 传统病理学AI模型缺乏对自然语言指令和文本上下文的支持，且多图理解和自主诊断推理能力不足。

Result: PathChat+在多项病理学基准测试中显著超越现有模型；SlideSeek在DDxBench上达到高准确率，并能生成可解释的报告。

Insight: 结合自然语言和图像的多模态推理能力能显著提升病理学AI的诊断准确性和实用性。

Abstract: Pathology is experiencing rapid digital transformation driven by whole-slide imaging and artificial intelligence (AI). While deep learning-based computational pathology has achieved notable success, traditional models primarily focus on image analysis without integrating natural language instruction or rich, text-based context. Current multimodal large language models (MLLMs) in computational pathology face limitations, including insufficient training data, inadequate support and evaluation for multi-image understanding, and a lack of autonomous, diagnostic reasoning capabilities. To address these limitations, we introduce PathChat+, a new MLLM specifically designed for human pathology, trained on over 1 million diverse, pathology-specific instruction samples and nearly 5.5 million question answer turns. Extensive evaluations across diverse pathology benchmarks demonstrated that PathChat+ substantially outperforms the prior PathChat copilot, as well as both state-of-the-art (SOTA) general-purpose and other pathology-specific models. Furthermore, we present SlideSeek, a reasoning-enabled multi-agent AI system leveraging PathChat+ to autonomously evaluate gigapixel whole-slide images (WSIs) through iterative, hierarchical diagnostic reasoning, reaching high accuracy on DDxBench, a challenging open-ended differential diagnosis benchmark, while also capable of generating visually grounded, humanly-interpretable summary reports.

[17] DFVEdit: Conditional Delta Flow Vector for Zero-shot Video Editing cs.CV | cs.AIPDF

Lingling Cai, Kang Zhao, Hangjie Yuan, Xiang Wang, Yingya Zhang

TL;DR: DFVEdit提出了一种高效的零样本视频编辑方法，专为Video DiTs设计，通过流变换直接操作潜在空间，避免了注意修改和微调，显著提升了计算效率和编辑质量。

Details

Motivation: 现有的视频编辑方法在应用于Video DiTs时计算开销大，DFVEdit旨在解决这一问题，提供一种更高效的编辑解决方案。

Result: 在Video DiTs上实现了20倍推理加速和85%内存节省，同时在结构保真度、时空一致性和编辑质量上达到SOTA。

Insight: 流变换视角为视频编辑提供了一种高效且统一的方法，无需依赖注意修改和微调，为未来视频生成模型的编辑任务提供了新思路。

Abstract: The advent of Video Diffusion Transformers (Video DiTs) marks a milestone in video generation. However, directly applying existing video editing methods to Video DiTs often incurs substantial computational overhead, due to resource-intensive attention modification or finetuning. To alleviate this problem, we present DFVEdit, an efficient zero-shot video editing method tailored for Video DiTs. DFVEdit eliminates the need for both attention modification and fine-tuning by directly operating on clean latents via flow transformation. To be more specific, we observe that editing and sampling can be unified under the continuous flow perspective. Building upon this foundation, we propose the Conditional Delta Flow Vector (CDFV) – a theoretically unbiased estimation of DFV – and integrate Implicit Cross Attention (ICA) guidance as well as Embedding Reinforcement (ER) to further enhance editing quality. DFVEdit excels in practical efficiency, offering at least 20x inference speed-up and 85% memory reduction on Video DiTs compared to attention-engineering-based editing methods. Extensive quantitative and qualitative experiments demonstrate that DFVEdit can be seamlessly applied to popular Video DiTs (e.g., CogVideoX and Wan2.1), attaining state-of-the-art performance on structural fidelity, spatial-temporal consistency, and editing quality.

[18] From Cradle to Cane: A Two-Pass Framework for High-Fidelity Lifespan Face Aging cs.CV | cs.AIPDF

Tao Liu, Dafeng Zhang, Gengchen Li, Shizhuo Liu, Yongqi Song

TL;DR: 该论文提出了一种名为Cradle2Cane的双通道框架，用于实现高保真的人脸老化，通过自适应噪声注入和身份感知嵌入解决了年龄准确性（Age）与身份一致性（ID）的权衡问题。

Details

Motivation: 现有的人脸老化方法难以在大年龄跨度或极端头部姿态下实现真实且无缝的变换，且在年龄准确性与身份一致性之间存在权衡问题。

Result: 在CelebA-HQ测试数据集上的实验表明，Cradle2Cane在年龄准确性和身份一致性上优于现有方法，Face++和Qwen-VL协议验证了其优势。

Insight: 通过分阶段处理年龄准确性和身份一致性，双通道框架成功解决了人脸老化中的Age-ID权衡问题，为未来相关研究提供了新思路。

Abstract: Face aging has become a crucial task in computer vision, with applications ranging from entertainment to healthcare. However, existing methods struggle with achieving a realistic and seamless transformation across the entire lifespan, especially when handling large age gaps or extreme head poses. The core challenge lies in balancing age accuracy and identity preservation–what we refer to as the Age-ID trade-off. Most prior methods either prioritize age transformation at the expense of identity consistency or vice versa. In this work, we address this issue by proposing a two-pass face aging framework, named Cradle2Cane, based on few-step text-to-image (T2I) diffusion models. The first pass focuses on solving age accuracy by introducing an adaptive noise injection (AdaNI) mechanism. This mechanism is guided by including prompt descriptions of age and gender for the given person as the textual condition. Also, by adjusting the noise level, we can control the strength of aging while allowing more flexibility in transforming the face. However, identity preservation is weakly ensured here to facilitate stronger age transformations. In the second pass, we enhance identity preservation while maintaining age-specific features by conditioning the model on two identity-aware embeddings (IDEmb): SVR-ArcFace and Rotate-CLIP. This pass allows for denoising the transformed image from the first pass, ensuring stronger identity preservation without compromising the aging accuracy. Both passes are jointly trained in an end-to-end way. Extensive experiments on the CelebA-HQ test dataset, evaluated through Face++ and Qwen-VL protocols, show that our Cradle2Cane outperforms existing face aging methods in age accuracy and identity consistency.

[19] 3D Scene-Camera Representation with Joint Camera Photometric Optimization cs.CVPDF

Weichen Dai, Kangcheng Ma, Jiaxin Wang, Kecen Pan, Yuhang Ming

TL;DR: 论文提出了一种联合相机光度优化的3D场景-相机表示方法，通过优化光度模型参数和深度正则化，显著提升了3D场景表示的质量，尤其在存在成像退化的情况下表现优异。

Details

Motivation: 多视角图像中存在的光度失真（如渐晕、污渍）会降低图像质量，进而影响3D场景建模的准确性。传统方法未充分考虑这些失真，导致场景表示中引入无关信息。

Result: 实验表明，该方法在成像退化条件下（如渐晕、污渍）仍能生成高质量的3D场景表示。

Insight: 相机光度模型是3D场景建模中不可忽视的部分，联合优化场景和相机参数可以显著提升表示质量。

Abstract: Representing scenes from multi-view images is a crucial task in computer vision with extensive applications. However, inherent photometric distortions in the camera imaging can significantly degrade image quality. Without accounting for these distortions, the 3D scene representation may inadvertently incorporate erroneous information unrelated to the scene, diminishing the quality of the representation. In this paper, we propose a novel 3D scene-camera representation with joint camera photometric optimization. By introducing internal and external photometric model, we propose a full photometric model and corresponding camera representation. Based on simultaneously optimizing the parameters of the camera representation, the proposed method effectively separates scene-unrelated information from the 3D scene representation. Additionally, during the optimization of the photometric parameters, we introduce a depth regularization to prevent the 3D scene representation from fitting scene-unrelated information. By incorporating the camera model as part of the mapping process, the proposed method constructs a complete map that includes both the scene radiance field and the camera photometric model. Experimental results demonstrate that the proposed method can achieve high-quality 3D scene representations, even under conditions of imaging degradation, such as vignetting and dirt.

[20] TSDASeg: A Two-Stage Model with Direct Alignment for Interactive Point Cloud Segmentation cs.CVPDF

Chade Li, Pengju Zhang, Yihong Wu

TL;DR: TSDASeg是一个两阶段模型，结合了直接跨模态对齐模块和记忆模块，用于交互式点云分割，解决了现有方法中3D点云与文本/2D图像数据对齐不足的问题。

Details

Motivation: 现有点云分割方法在点级别任务上表现不佳，主要原因是缺乏直接的3D-文本对齐，无法有效关联局部3D特征与文本上下文。

Result: 在多个3D指令、参考和语义分割数据集上验证了TSDASeg的优越性能，达到最先进水平。

Insight: 显式的跨模态对齐和动态记忆机制能显著提升交互式点云分割的准确性和一致性。

Abstract: The rapid advancement of 3D vision-language models (VLMs) has spurred significant interest in interactive point cloud processing tasks, particularly for real-world applications. However, existing methods often underperform in point-level tasks, such as segmentation, due to missing direct 3D-text alignment, limiting their ability to link local 3D features with textual context. To solve this problem, we propose TSDASeg, a Two-Stage model coupled with a Direct cross-modal Alignment module and memory module for interactive point cloud Segmentation. We introduce the direct cross-modal alignment module to establish explicit alignment between 3D point clouds and textual/2D image data. Within the memory module, we employ multiple dedicated memory banks to separately store text features, visual features, and their cross-modal correspondence mappings. These memory banks are dynamically leveraged through self-attention and cross-attention mechanisms to update scene-specific features based on prior stored data, effectively addressing inconsistencies in interactive segmentation results across diverse scenarios. Experiments conducted on multiple 3D instruction, reference, and semantic segmentation datasets demonstrate that the proposed method achieves state-of-the-art performance.

[21] Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance cs.CV | cs.LG | cs.SD | eess.ASPDF

Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

TL;DR: 该论文提出了一种逐步从视频合成音频的新方法，通过负音频引导依次生成特定声音事件的独立音轨，类似于传统Foley工作流程，最终合成高质量复合音频。

Details

Motivation: 传统视频到音频合成方法通常生成单一音轨，无法全面捕捉视频中的所有声音事件。本文受Foley工作流程启发，希望通过逐步生成和组合多个音轨来更全面地还原视频中的声音。

Result: 实验结果表明，该方法能为一输入视频生成多个语义不同的音轨，合成的复合音频质量优于现有方法。

Insight: 通过分步生成和组合独立音轨的Foley式方法，可以有效提升视频到音频合成的真实性和丰富性。

Abstract: We propose a novel step-by-step video-to-audio generation method that sequentially produces individual audio tracks, each corresponding to a specific sound event in the video. Our approach mirrors traditional Foley workflows, aiming to capture all sound events induced by a given video comprehensively. Each generation step is formulated as a guided video-to-audio synthesis task, conditioned on a target text prompt and previously generated audio tracks. This design is inspired by the idea of concept negation from prior compositional generation frameworks. To enable this guided generation, we introduce a training framework that leverages pre-trained video-to-audio models and eliminates the need for specialized paired datasets, allowing training on more accessible data. Experimental results demonstrate that our method generates multiple semantically distinct audio tracks for a single input video, leading to higher-quality composite audio synthesis than existing baselines.

[22] DBMovi-GS: Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting cs.CVPDF

Yeon-Ji Song, Jaein Kim, Byung-Ju Kim, Byoung-Tak Zhang

TL;DR: 本文提出了DBMovi-GS方法，解决了从模糊单目视频中合成动态场景的难题，通过稀疏控制的高斯泼溅技术恢复清晰度并重建动态场景的3D几何。

Details

Motivation: 现有的新视角合成方法依赖高分辨率图像或静态几何假设，难以处理动态模糊场景。本文旨在解决这一局限，提升在动态模糊环境下的合成效果。

Result: 在动态模糊场景下，模型实现了鲁棒的新视角合成效果，为模糊单目视频输入设立了新的性能基准。

Insight: 动态场景的模糊问题和几何变化可以通过稀疏控制的高斯泼溅技术有效解决，为真实世界复杂场景的合成提供了新思路。

Abstract: Novel view synthesis is a task of generating scenes from unseen perspectives; however, synthesizing dynamic scenes from blurry monocular videos remains an unresolved challenge that has yet to be effectively addressed. Existing novel view synthesis methods are often constrained by their reliance on high-resolution images or strong assumptions about static geometry and rigid scene priors. Consequently, their approaches lack robustness in real-world environments with dynamic object and camera motion, leading to instability and degraded visual fidelity. To address this, we propose Motion-aware Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting (DBMovi-GS), a method designed for dynamic view synthesis from blurry monocular videos. Our model generates dense 3D Gaussians, restoring sharpness from blurry videos and reconstructing detailed 3D geometry of the scene affected by dynamic motion variations. Our model achieves robust performance in novel view synthesis under dynamic blurry scenes and sets a new benchmark in realistic novel view synthesis for blurry monocular video inputs.

[23] Style-Aligned Image Composition for Robust Detection of Abnormal Cells in Cytopathology cs.CVPDF

Qiuyi Qi, Xin Li, Ming Kong, Zikang Xu, Bingdi Chen

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Challenges such as the lack of high-quality annotations, long-tailed data distributions, and inconsistent staining styles pose significant obstacles to training neural networks to detect abnormal cells in cytopathology robustly. This paper proposes a style-aligned image composition (SAIC) method that composes high-fidelity and style-preserved pathological images to enhance the effectiveness and robustness of detection models. Without additional training, SAIC first selects an appropriate candidate from the abnormal cell bank based on attribute guidance. Then, it employs a high-frequency feature reconstruction to achieve a style-aligned and high-fidelity composition of abnormal cells and pathological backgrounds. Finally, it introduces a large vision-language model to filter high-quality synthesis images. Experimental results demonstrate that incorporating SAIC-synthesized images effectively enhances the performance and robustness of abnormal cell detection for tail categories and styles, thereby improving overall detection performance. The comprehensive quality evaluation further confirms the generalizability and practicality of SAIC in clinical application scenarios. Our code will be released at https://github.com/Joey-Qi/SAIC.

[24] VisionGuard: Synergistic Framework for Helmet Violation Detection cs.CVPDF

Lam-Huy Nguyen, Thinh-Phuc Nguyen, Thanh-Hai Nguyen, Gia-Huy Dinh, Minh-Triet Tran

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Enforcing helmet regulations among motorcyclists is essential for enhancing road safety and ensuring the effectiveness of traffic management systems. However, automatic detection of helmet violations faces significant challenges due to environmental variability, camera angles, and inconsistencies in the data. These factors hinder reliable detection of motorcycles and riders and disrupt consistent object classification. To address these challenges, we propose VisionGuard, a synergistic multi-stage framework designed to overcome the limitations of frame-wise detectors, especially in scenarios with class imbalance and inconsistent annotations. VisionGuard integrates two key components: Adaptive Labeling and Contextual Expander modules. The Adaptive Labeling module is a tracking-based refinement technique that enhances classification consistency by leveraging a tracking algorithm to assign persistent labels across frames and correct misclassifications. The Contextual Expander module improves recall for underrepresented classes by generating virtual bounding boxes with appropriate confidence scores, effectively addressing the impact of data imbalance. Experimental results show that VisionGuard improves overall mAP by 3.1% compared to baseline detectors, demonstrating its effectiveness and potential for real-world deployment in traffic surveillance systems, ultimately promoting safety and regulatory compliance.

[25] Detection of Breast Cancer Lumpectomy Margin with SAM-incorporated Forward-Forward Contrastive Learning cs.CVPDF

Tyler Ward, Xiaoqin Wang, Braxton McFarland, Md Atik Ahamed, Sahar Nozad

TL;DR: 本文提出了一种结合Segment Anything Model (SAM)和Forward-Forward Contrastive Learning (FFCL)的深度学习框架，用于提高乳腺癌切除术术中标本边缘评估的准确性和速度。

Details

Motivation: 目前用于评估术中标本边缘状态的2D标本放射摄影(SR)准确性有限，导致近四分之一的患者需要额外手术。为解决这一问题，作者提出了一种新方法。

Result: 实验结果显示，该方法在边缘分类的AUC达到0.8455，Dice相似性比基线模型提高了27.4%，推断时间降至每图像47毫秒。

Insight: 该方法的潜力在于减少乳腺癌治疗中的再切除率，改善手术效果。

Abstract: Complete removal of cancer tumors with a negative specimen margin during lumpectomy is essential in reducing breast cancer recurrence. However, 2D specimen radiography (SR), the current method used to assess intraoperative specimen margin status, has limited accuracy, resulting in nearly a quarter of patients requiring additional surgery. To address this, we propose a novel deep learning framework combining the Segment Anything Model (SAM) with Forward-Forward Contrastive Learning (FFCL), a pre-training strategy leveraging both local and global contrastive learning for patch-level classification of SR images. After annotating SR images with regions of known maligancy, non-malignant tissue, and pathology-confirmed margins, we pre-train a ResNet-18 backbone with FFCL to classify margin status, then reconstruct coarse binary masks to prompt SAM for refined tumor margin segmentation. Our approach achieved an AUC of 0.8455 for margin classification and segmented margins with a 27.4% improvement in Dice similarity over baseline models, while reducing inference time to 47 milliseconds per image. These results demonstrate that FFCL-SAM significantly enhances both the speed and accuracy of intraoperative margin assessment, with strong potential to reduce re-excision rates and improve surgical outcomes in breast cancer treatment. Our code is available at https://github.com/tbwa233/FFCL-SAM/.

[26] The Aging Multiverse: Generating Condition-Aware Facial Aging Tree via Training-Free Diffusion cs.CVPDF

Bang Gong, Luchao Qi, Jiaye Wu, Zhicheng Fu, Chunbo Song

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: We introduce the Aging Multiverse, a framework for generating multiple plausible facial aging trajectories from a single image, each conditioned on external factors such as environment, health, and lifestyle. Unlike prior methods that model aging as a single deterministic path, our approach creates an aging tree that visualizes diverse futures. To enable this, we propose a training-free diffusion-based method that balances identity preservation, age accuracy, and condition control. Our key contributions include attention mixing to modulate editing strength and a Simulated Aging Regularization strategy to stabilize edits. Extensive experiments and user studies demonstrate state-of-the-art performance across identity preservation, aging realism, and conditional alignment, outperforming existing editing and age-progression models, which often fail to account for one or more of the editing criteria. By transforming aging into a multi-dimensional, controllable, and interpretable process, our approach opens up new creative and practical avenues in digital storytelling, health education, and personalized visualization.

[27] User-in-the-Loop View Sampling with Error Peaking Visualization cs.CVPDF

Ayaka Yasunaga, Hideo Saito, Shohei Mori

TL;DR: 本文提出了一种用户参与的视图采样方法，通过可视化误差峰值来指导用户采集新视图，避免了传统的3D标注限制，提升了AR场景的采集效率和用户满意度。

Details

Motivation: 现有的AR系统中，用户需要通过3D标注来采集新视图，这种方法不仅心智负担大，且限制了场景探索的范围。本文旨在解决这一问题，提出更自由的视图采集方法。

Result: 实验表明，该方法减少了用户的心智负担，提升了最终结果的质量，且需要的视图样本更少。

Insight: 通过可视化误差的方式，用户可以更直观地理解需要补全的视图区域，从而高效完成数据采集任务。

Abstract: Augmented reality (AR) provides ways to visualize missing view samples for novel view synthesis. Existing approaches present 3D annotations for new view samples and task users with taking images by aligning the AR display. This data collection task is known to be mentally demanding and limits capture areas to pre-defined small areas due to the ideal but restrictive underlying sampling theory. To free users from 3D annotations and limited scene exploration, we propose using locally reconstructed light fields and visualizing errors to be removed by inserting new views. Our results show that the error-peaking visualization is less invasive, reduces disappointment in final results, and is satisfactory with fewer view samples in our mobile view synthesis system. We also show that our approach can contribute to recent radiance field reconstruction for larger scenes, such as 3D Gaussian splatting.

[28] Bridging Video Quality Scoring and Justification via Large Multimodal Models cs.CVPDF

Qizhi Xie, Kun Yuan, Yunpeng Qu, Jiachao Gong, Mingda Wu

TL;DR: 论文提出了一种基于Score-based Instruction Generation（SIG）的自动化方法，生成视频质量评估（VQA）的指令数据，并设计了一个包含32万对指令-响应的Score2Instruct（S2I）数据集，同时通过渐进式调优策略提升视频大型多模态模型（LMMs）的质量评分和解释能力。

Details

Motivation: 传统视频质量评估方法仅生成数值评分，无法描述视频的复杂质量维度。利用大型多模态模型的生成能力进行指令调优，可以解决这一问题，但现有方法依赖人工标注和专有系统，数据扩展性不足。

Result: 实验表明，方法在S2I-Bench和现有基准测试中显著提升了视频LMMs的质量评分和解释能力。

Insight: 自动化数据生成和分层次CoT设计可以高效替代人工标注，同时提升模型的推理能力。

Abstract: Classical video quality assessment (VQA) methods generate a numerical score to judge a video’s perceived visual fidelity and clarity. Yet, a score fails to describe the video’s complex quality dimensions, restricting its applicability. Benefiting from the linguistic output, adapting video large multimodal models (LMMs) to VQA via instruction tuning has the potential to address this issue. The core of the approach lies in the video quality-centric instruction data. Previous explorations mainly focus on the image domain, and their data generation processes heavily rely on human quality annotations and proprietary systems, limiting data scalability and effectiveness. To address these challenges, we propose the Score-based Instruction Generation (SIG) pipeline. Specifically, SIG first scores multiple quality dimensions of an unlabeled video and maps scores to text-defined levels. It then explicitly incorporates a hierarchical Chain-of-Thought (CoT) to model the correlation between specific dimensions and overall quality, mimicking the human visual system’s reasoning process. The automated pipeline eliminates the reliance on expert-written quality descriptions and proprietary systems, ensuring data scalability and generation efficiency. To this end, the resulting Score2Instruct (S2I) dataset contains over 320K diverse instruction-response pairs, laying the basis for instruction tuning. Moreover, to advance video LMMs’ quality scoring and justification abilities simultaneously, we devise a progressive tuning strategy to fully unleash the power of S2I. Built upon SIG, we further curate a benchmark termed S2I-Bench with 400 open-ended questions to better evaluate the quality justification capacity of video LMMs. Experimental results on the S2I-Bench and existing benchmarks indicate that our method consistently improves quality scoring and justification capabilities across multiple video LMMs.

[29] FedSC: Federated Learning with Semantic-Aware Collaboration cs.CVPDF

Huan Wang, Haoran Li, Huaming Chen, Jun Yan, Jiahua Shi

TL;DR: FedSC提出了一种通过语义感知协作的联邦学习方法，利用客户端内部的语义信息解决数据异构性问题，通过构建关系原型和一致原型实现客户端间知识的有效共享。

Details

Motivation: 联邦学习中数据异构性（如客户端标签偏好）是主要挑战，现有方法常忽视客户端内部的语义信息。FedSC旨在挖掘这些语义信息以改进模型性能。

Result: 实验表明FedSC在多种挑战性场景下表现优异，关键组件有效提升模型性能。

Insight: 利用客户端内部的语义信息可以显著缓解联邦学习中的数据异构性问题。

Abstract: Federated learning (FL) aims to train models collaboratively across clients without sharing data for privacy-preserving. However, one major challenge is the data heterogeneity issue, which refers to the biased labeling preferences at multiple clients. A number of existing FL methods attempt to tackle data heterogeneity locally (e.g., regularizing local models) or globally (e.g., fine-tuning global model), often neglecting inherent semantic information contained in each client. To explore the possibility of using intra-client semantically meaningful knowledge in handling data heterogeneity, in this paper, we propose Federated Learning with Semantic-Aware Collaboration (FedSC) to capture client-specific and class-relevant knowledge across heterogeneous clients. The core idea of FedSC is to construct relational prototypes and consistent prototypes at semantic-level, aiming to provide fruitful class underlying knowledge and stable convergence signals in a prototype-wise collaborative way. On the one hand, FedSC introduces an inter-contrastive learning strategy to bring instance-level embeddings closer to relational prototypes with the same semantics and away from distinct classes. On the other hand, FedSC devises consistent prototypes via a discrepancy aggregation manner, as a regularization penalty to constrain the optimization region of the local model. Moreover, a theoretical analysis for FedSC is provided to ensure a convergence guarantee. Experimental results on various challenging scenarios demonstrate the effectiveness of FedSC and the efficiency of crucial components.

[30] HybridQ: Hybrid Classical-Quantum Generative Adversarial Network for Skin Disease Image Generation cs.CV | cs.LG | quant-phPDF

Qingyue Jiao, Kangyu Zheng, Yiyu Shi, Zhiding Liang

TL;DR: 论文提出了一种混合经典-量子的生成对抗网络（HybridQ-GAN），用于生成皮肤疾病彩色图像，解决了传统方法计算资源需求高和量子方法只能生成低质量灰度图像的问题。

Details

Motivation: 皮肤疾病数据集存在类别不平衡、隐私问题和对象偏置等问题，传统生成模型计算成本高，而量子方法生成质量差，因此需要一种高效、高质量的图像生成方法。

Result: 模型在生成质量和分类性能提升上优于传统方法，参数减少25倍，训练周期减少10倍，并在真实量子硬件上验证了鲁棒性。

Insight: 量子图像生成在医疗领域潜力巨大，随着量子硬件进步，可能成为高效替代方案。

Abstract: Machine learning-assisted diagnosis is gaining traction in skin disease detection, but training effective models requires large amounts of high-quality data. Skin disease datasets often suffer from class imbalance, privacy concerns, and object bias, making data augmentation essential. While classical generative models are widely used, they demand extensive computational resources and lengthy training time. Quantum computing offers a promising alternative, but existing quantum-based image generation methods can only yield grayscale low-quality images. Through a novel classical-quantum latent space fusion technique, our work overcomes this limitation and introduces the first classical-quantum generative adversarial network (GAN) capable of generating color medical images. Our model outperforms classical deep convolutional GANs and existing hybrid classical-quantum GANs in both image generation quality and classification performance boost when used as data augmentation. Moreover, the performance boost is comparable with that achieved using state-of-the-art classical generative models, yet with over 25 times fewer parameters and 10 times fewer training epochs. Such results suggest a promising future for quantum image generation as quantum hardware advances. Finally, we demonstrate the robust performance of our model on real IBM quantum machine with hardware noise.

[31] Multimodal Prompt Alignment for Facial Expression Recognition cs.CV | cs.AIPDF

Fuyan Ma, Yiran He, Bin Sun, Shutao Li

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs) like CLIP for various downstream tasks. Despite their success, current VLM-based facial expression recognition (FER) methods struggle to capture fine-grained textual-visual relationships, which are essential for distinguishing subtle differences between facial expressions. To address this challenge, we propose a multimodal prompt alignment framework for FER, called MPA-FER, that provides fine-grained semantic guidance to the learning process of prompted visual features, resulting in more precise and interpretable representations. Specifically, we introduce a multi-granularity hard prompt generation strategy that utilizes a large language model (LLM) like ChatGPT to generate detailed descriptions for each facial expression. The LLM-based external knowledge is injected into the soft prompts by minimizing the feature discrepancy between the soft prompts and the hard prompts. To preserve the generalization abilities of the pretrained CLIP model, our approach incorporates prototype-guided visual feature alignment, ensuring that the prompted visual features from the frozen image encoder align closely with class-specific prototypes. Additionally, we propose a cross-modal global-local alignment module that focuses on expression-relevant facial features, further improving the alignment between textual and visual features. Extensive experiments demonstrate our framework outperforms state-of-the-art methods on three FER benchmark datasets, while retaining the benefits of the pretrained model and minimizing computational costs.

[32] LASFNet: A Lightweight Attention-Guided Self-Modulation Feature Fusion Network for Multimodal Object Detection cs.CVPDF

Lei Hao, Lina Xu, Chang Liu, Yanni Dong

TL;DR: LASFNet提出了一种轻量级的注意力引导自调制特征融合网络，通过单一的特征融合单元实现高效的多模态目标检测，显著降低了计算开销并提升了检测精度。

Details

Motivation: 现有多模态目标检测方法通常通过堆叠多个特征融合单元实现模态间特征融合，导致训练过程复杂且计算开销大。研究旨在简化这一过程并提升性能。

Result: 在三个数据集上的实验表明，LASFNet相比先进方法减少了90%的参数量和85%的计算开销，同时检测精度（mAP）提升了1%-3%。

Insight: 通过注意力机制和轻量化设计，可以在显著降低计算资源的同时提升多模态目标检测的性能，为实际应用提供了更高效的解决方案。

Abstract: Effective deep feature extraction via feature-level fusion is crucial for multimodal object detection. However, previous studies often involve complex training processes that integrate modality-specific features by stacking multiple feature-level fusion units, leading to significant computational overhead. To address this issue, we propose a new fusion detection baseline that uses a single feature-level fusion unit to enable high-performance detection, thereby simplifying the training process. Based on this approach, we propose a lightweight attention-guided self-modulation feature fusion network (LASFNet), which introduces a novel attention-guided self-modulation feature fusion (ASFF) module that adaptively adjusts the responses of fusion features at both global and local levels based on attention information from different modalities, thereby promoting comprehensive and enriched feature generation. Additionally, a lightweight feature attention transformation module (FATM) is designed at the neck of LASFNet to enhance the focus on fused features and minimize information loss. Extensive experiments on three representative datasets demonstrate that, compared to state-of-the-art methods, our approach achieves a favorable efficiency-accuracy trade-off, reducing the number of parameters and computational cost by as much as 90% and 85%, respectively, while improving detection accuracy (mAP) by 1%-3%. The code will be open-sourced at https://github.com/leileilei2000/LASFNet.

[33] Instella-T2I: Pushing the Limits of 1D Discrete Latent Space Image Generation cs.CVPDF

Ze Wang, Hao Chen, Benran Hu, Jiang Liu, Ximeng Sun

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Image tokenization plays a critical role in reducing the computational demands of modeling high-resolution images, significantly improving the efficiency of image and multimodal understanding and generation. Recent advances in 1D latent spaces have reduced the number of tokens required by eliminating the need for a 2D grid structure. In this paper, we further advance compact discrete image representation by introducing 1D binary image latents. By representing each image as a sequence of binary vectors, rather than using traditional one-hot codebook tokens, our approach preserves high-resolution details while maintaining the compactness of 1D latents. To the best of our knowledge, our text-to-image models are the first to achieve competitive performance in both diffusion and auto-regressive generation using just 128 discrete tokens for images up to 1024x1024, demonstrating up to a 32-fold reduction in token numbers compared to standard VQ-VAEs. The proposed 1D binary latent space, coupled with simple model architectures, achieves marked improvements in speed training and inference speed. Our text-to-image models allow for a global batch size of 4096 on a single GPU node with 8 AMD MI300X GPUs, and the training can be completed within 200 GPU days. Our models achieve competitive performance compared to modern image generation models without any in-house private training data or post-training refinements, offering a scalable and efficient alternative to conventional tokenization methods.

[34] DidSee: Diffusion-Based Depth Completion for Material-Agnostic Robotic Perception and Manipulation cs.CVPDF

Wenzhou Lyu, Jialing Lin, Wenqi Ren, Ruihao Xia, Feng Qian

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Commercial RGB-D cameras often produce noisy, incomplete depth maps for non-Lambertian objects. Traditional depth completion methods struggle to generalize due to the limited diversity and scale of training data. Recent advances exploit visual priors from pre-trained text-to-image diffusion models to enhance generalization in dense prediction tasks. However, we find that biases arising from training-inference mismatches in the vanilla diffusion framework significantly impair depth completion performance. Additionally, the lack of distinct visual features in non-Lambertian regions further hinders precise prediction. To address these issues, we propose \textbf{DidSee}, a diffusion-based framework for depth completion on non-Lambertian objects. First, we integrate a rescaled noise scheduler enforcing a zero terminal signal-to-noise ratio to eliminate signal leakage bias. Second, we devise a noise-agnostic single-step training formulation to alleviate error accumulation caused by exposure bias and optimize the model with a task-specific loss. Finally, we incorporate a semantic enhancer that enables joint depth completion and semantic segmentation, distinguishing objects from backgrounds and yielding precise, fine-grained depth maps. DidSee achieves state-of-the-art performance on multiple benchmarks, demonstrates robust real-world generalization, and effectively improves downstream tasks such as category-level pose estimation and robotic grasping.Project page: https://wenzhoulyu.github.io/DidSee/

[35] Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features cs.CV | cs.CRPDF

Shangbo Wu, Yu-an Tan, Ruinan Ma, Wencong Ma, Dehua Zhu

TL;DR: 提出了一种利用自监督视觉Transformer（ViT）特征的生成对抗攻击方法dSVA，通过结合对比学习（CL）和掩码图像建模（MIM）的特征，显著提升了黑盒对抗迁移性。

Details

Motivation: 此前的研究多依赖监督学习提取的中间特征，本文探索自监督ViT特征是否能提升对抗迁移性，尤其是结合CL和MIM两种自监督学习方法的特点。

Result: 实验表明，dSVA在多种模型架构上的黑盒迁移性优于现有方法。

Insight: 自监督ViT的CL和MIM特征具有互补性，联合利用可显著提高对抗攻击的泛化能力。

Abstract: The ability of deep neural networks (DNNs) come from extracting and interpreting features from the data provided. By exploiting intermediate features in DNNs instead of relying on hard labels, we craft adversarial perturbation that generalize more effectively, boosting black-box transferability. These features ubiquitously come from supervised learning in previous work. Inspired by the exceptional synergy between self-supervised learning and the Transformer architecture, this paper explores whether exploiting self-supervised Vision Transformer (ViT) representations can improve adversarial transferability. We present dSVA – a generative dual self-supervised ViT features attack, that exploits both global structural features from contrastive learning (CL) and local textural features from masked image modeling (MIM), the self-supervised learning paradigm duo for ViTs. We design a novel generative training framework that incorporates a generator to create black-box adversarial examples, and strategies to train the generator by exploiting joint features and the attention mechanism of self-supervised ViTs. Our findings show that CL and MIM enable ViTs to attend to distinct feature tendencies, which, when exploited in tandem, boast great adversarial generalizability. By disrupting dual deep features distilled by self-supervised ViTs, we are rewarded with remarkable black-box transferability to models of various architectures that outperform state-of-the-arts. Code available at https://github.com/spencerwooo/dSVA.

Qize Yang, Shimin Yao, Weixuan Chen, Shenghao Fu, Detao Bai

TL;DR: 论文《HumanOmniV2》提出了一种通过强化学习增强多模态大语言模型推理能力的方法，重点解决全局上下文理解不足和捷径问题，并引入新基准IntentBench。

Details

Motivation: 现有推理模型在多模态数据处理中存在全局上下文理解不足和捷径问题，导致模型可能误解上下文或忽略关键线索。

Result: 方法在多个多模态基准测试中优于其他开源多模态模型。

Insight: 全局上下文理解是解决多模态推理问题的关键，而LLM的奖励机制能有效整合多模态信息与逻辑推理。

Abstract: With the rapid evolution of multimodal large language models, the capacity to deeply understand and interpret human intentions has emerged as a critical capability, which demands detailed and thoughtful reasoning. In recent studies, Reinforcement Learning (RL) has demonstrated potential in enhancing the reasoning capabilities of Large Language Models (LLMs). Nonetheless, the challenges associated with adapting RL to multimodal data and formats remain largely unaddressed. In this paper, we identify two issues in existing multimodal reasoning models: insufficient global context understanding and shortcut problems. Insufficient context understanding can happen when a model misinterprets multimodal context, resulting in incorrect answers. The shortcut problem occurs when the model overlooks crucial clues in multimodal inputs, directly addressing the query without considering the multimodal information. To tackle these issues, we emphasize the necessity for the model to reason with a clear understanding of the global context within multimodal inputs. This global context understanding can effectively prevent the model from overlooking key multimodal cues and ensure a thorough reasoning process. To ensure the accurate interpretation of multimodal context information, we implement a context reward judged by a large language model, alongside format and accuracy rewards. Additionally, to improve complex reasoning capability, we employ the LLM to assess the logical reward, determining whether the reasoning process successfully integrates multimodal information with logical methods. We also introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating models in understanding complex human intentions and emotions. Our proposed method demonstrates advanced performance across multiple omni-modal benchmarks compared to other open-source omni-modal models.

[37] SAMURAI: Shape-Aware Multimodal Retrieval for 3D Object Identification cs.CVPDF

Dinh-Khoi Vo, Van-Loc Nguyen, Minh-Triet Tran, Trung-Nghia Le

TL;DR: SAMURAI是一个多模态检索框架，结合了CLIP语义匹配和形状引导的重新排序，用于解决复杂室内环境中3D物体识别问题，表现优异。

Details

Motivation: 在ROOMELSA挑战中，仅通过掩码2D图像和自然语言描述检索3D物体面临诸多困难，如变形视角、无纹理区域、模糊语言提示和噪声分割掩码。

Result: 在ROOMELSA私有测试集上表现优异，展示了形状先验与语言理解的结合对开放世界3D物体检索的重要性。

Insight: 形状先验与多模态语言的结合是实现开放世界3D物体检索的关键。

Abstract: Retrieving 3D objects in complex indoor environments using only a masked 2D image and a natural language description presents significant challenges. The ROOMELSA challenge limits access to full 3D scene context, complicating reasoning about object appearance, geometry, and semantics. These challenges are intensified by distorted viewpoints, textureless masked regions, ambiguous language prompts, and noisy segmentation masks. To address this, we propose SAMURAI: Shape-Aware Multimodal Retrieval for 3D Object Identification. SAMURAI integrates CLIP-based semantic matching with shape-guided re-ranking derived from binary silhouettes of masked regions, alongside a robust majority voting strategy. A dedicated preprocessing pipeline enhances mask quality by extracting the largest connected component and removing background noise. Our hybrid retrieval framework leverages both language and shape cues, achieving competitive performance on the ROOMELSA private test set. These results highlight the importance of combining shape priors with language understanding for robust open-world 3D object retrieval.

[38] PoseMaster: Generating 3D Characters in Arbitrary Poses from a Single Image cs.CVPDF

Hongyu Yan, Kunming Luo, Weiyu Li, Yixun Liang, Shengming Li

TL;DR: PoseMaster是一个端到端的可控3D角色生成框架，通过统一姿势变换和3D重建，提出基于流的原生生成方法，并利用骨骼作为姿势条件，提升了姿态控制的精确性和生成质量。

Details

Motivation: 现有方法在姿势标准化阶段容易因自遮挡和视角问题生成失真图像，影响后续3D重建质量。PoseMaster旨在解决这一问题，实现高效且高质量的3D角色生成。

Result: PoseMaster在A-pose生成任务上超越现有方法，并在任意姿态控制上表现出色。

Insight: 通过结合3D骨骼条件和流式生成方法，PoseMaster解决了自遮挡和视角问题，为3D角色生成提供了更高效的解决方案。

Abstract: 3D characters play a crucial role in our daily entertainment. To improve the efficiency of 3D character modeling, recent image-based methods use two separate models to achieve pose standardization and 3D reconstruction of the A-pose character. However, these methods are prone to generating distorted and degraded images in the pose standardization stage due to self-occlusion and viewpoints, which further affects the geometric quality of the subsequent reconstruction process. To tackle these problems, we propose PoseMaster, an end-to-end controllable 3D character generation framework. Specifically, we unify pose transformation and 3D character generation into a flow-based 3D native generation framework. To achieve accurate arbitrary-pose control, we propose to leverage the 3D body bones existing in the skeleton of an animatable character as the pose condition. Furthermore, considering the specificity of multi-condition control, we randomly empty the pose condition and the image condition during training to improve the effectiveness and generalizability of pose control. Finally, we create a high-quality pose-control dataset derived from realistic character animation data to make the model learning the implicit relationships between skeleton and skinning weights. Extensive experiments show that PoseMaster outperforms current state-of-the-art techniques in both qualitative and quantitative evaluations for A-pose character generation while demonstrating its powerful ability to achieve precise control for arbitrary poses.

[39] EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception cs.CV | cs.AI | cs.LGPDF

Sanjoy Chowdhury, Subrata Biswas, Sayan Nag, Tushar Nagarajan, Calvin Murdock

TL;DR: 该论文提出了EgoAdapt框架，通过自适应多模态蒸馏和策略学习，显著提升了以自我为中心的多模态感知任务的效率，同时保持了高性能。

Details

Motivation: 当前的多模态自我中心感知模型虽性能卓越，但计算成本高，难以在资源受限的环境中部署。EgoAdapt旨在解决这一效率问题。

Result: 在EPIC-Kitchens、EasyCom和Aria Everyday Activities数据集上，GMACs减少89.09%，参数减少82.02%，能耗最高降低9.6倍，性能仍优于或持平SOTA。

Insight: 自适应多模态蒸馏和动态策略学习是解决高效率感知任务的有效途径，尤其适合资源受限的应用场景。

Abstract: Modern perception models, particularly those designed for multisensory egocentric tasks, have achieved remarkable performance but often come with substantial computational costs. These high demands pose challenges for real-world deployment, especially in resource-constrained environments. In this paper, we introduce EgoAdapt, a framework that adaptively performs cross-modal distillation and policy learning to enable efficient inference across different egocentric perception tasks, including egocentric action recognition, active speaker localization, and behavior anticipation. Our proposed policy module is adaptable to task-specific action spaces, making it broadly applicable. Experimental results on three challenging egocentric datasets EPIC-Kitchens, EasyCom, and Aria Everyday Activities demonstrate that our method significantly enhances efficiency, reducing GMACs by up to 89.09%, parameters up to 82.02%, and energy up to 9.6x, while still on-par and in many cases outperforming, the performance of corresponding state-of-the-art models.

[40] ESMStereo: Enhanced ShuffleMixer Disparity Upsampling for Real-Time and Accurate Stereo Matching cs.CVPDF

Mahmoud Tahmasebi, Saif Huq, Kevin Meehan, Marion McAfee

TL;DR: ESMStereo通过增强的ShuffleMixer技术在小规模成本体积下实现高精度和实时性的立体匹配。

Details

Motivation: 立体匹配在自动驾驶系统中至关重要，但现有方法难以同时满足高精度和实时性需求。大尺度成本体积计算复杂，小尺度则信息不足。

Result: ESMStereo在高性能GPU上达到116 FPS，AGX Orin上达到91 FPS，兼顾精度和速度。

Insight: 轻量化设计结合特征融合可显著提升小尺度成本体积的精度，而上下文连接的优化对实时性至关重要。

Abstract: Stereo matching has become an increasingly important component of modern autonomous systems. Developing deep learning-based stereo matching models that deliver high accuracy while operating in real-time continues to be a major challenge in computer vision. In the domain of cost-volume-based stereo matching, accurate disparity estimation depends heavily on large-scale cost volumes. However, such large volumes store substantial redundant information and also require computationally intensive aggregation units for processing and regression, making real-time performance unattainable. Conversely, small-scale cost volumes followed by lightweight aggregation units provide a promising route for real-time performance, but lack sufficient information to ensure highly accurate disparity estimation. To address this challenge, we propose the Enhanced Shuffle Mixer (ESM) to mitigate information loss associated with small-scale cost volumes. ESM restores critical details by integrating primary features into the disparity upsampling unit. It quickly extracts features from the initial disparity estimation and fuses them with image features. These features are mixed by shuffling and layer splitting then refined through a compact feature-guided hourglass network to recover more detailed scene geometry. The ESM focuses on local contextual connectivity with a large receptive field and low computational cost, leading to the reconstruction of a highly accurate disparity map at real-time. The compact version of ESMStereo achieves an inference speed of 116 FPS on high-end GPUs and 91 FPS on the AGX Orin.

[41] OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography cs.CVPDF

Caoshuo Li, Zengmao Ding, Xiaobin Hu, Bang Li, Donghao Luo

TL;DR: OracleFusion是一种新颖的两阶段语义排版框架，旨在协助解读甲骨文，通过增强空间感知的多模态大语言模型和结构约束的矢量字体生成，显著提升了甲骨文的可读性和美观性。

Details

Motivation: 甲骨文是早期古代语言之一，但目前仅解读了部分字符，其余因结构复杂和抽象意象难以破译。OracleFusion试图通过结合现代技术帮助专家更高效地解读这些未破译字符。

Result: OracleFusion在语义、视觉效果和图形维护方面优于基线模型，显著提升了甲骨文的可读性和美观性，并为专家提供了类似专家的见解。

Insight: 结合现代AI技术和结构约束可以有效地协助解读复杂古代字符，为文化遗产保护提供了新思路。

Abstract: As one of the earliest ancient languages, Oracle Bone Script (OBS) encapsulates the cultural records and intellectual expressions of ancient civilizations. Despite the discovery of approximately 4,500 OBS characters, only about 1,600 have been deciphered. The remaining undeciphered ones, with their complex structure and abstract imagery, pose significant challenges for interpretation. To address these challenges, this paper proposes a novel two-stage semantic typography framework, named OracleFusion. In the first stage, this approach leverages the Multimodal Large Language Model (MLLM) with enhanced Spatial Awareness Reasoning (SAR) to analyze the glyph structure of the OBS character and perform visual localization of key components. In the second stage, we introduce Oracle Structural Vector Fusion (OSVF), incorporating glyph structure constraints and glyph maintenance constraints to ensure the accurate generation of semantically enriched vector fonts. This approach preserves the objective integrity of the glyph structure, offering visually enhanced representations that assist experts in deciphering OBS. Extensive qualitative and quantitative experiments demonstrate that OracleFusion outperforms state-of-the-art baseline models in terms of semantics, visual appeal, and glyph maintenance, significantly enhancing both readability and aesthetic quality. Furthermore, OracleFusion provides expert-like insights on unseen oracle characters, making it a valuable tool for advancing the decipherment of OBS.

[42] Pushing Trade-Off Boundaries: Compact yet Effective Remote Sensing Change Detection cs.CV | cs.LGPDF

Luosheng Xu, Dalin Zhang, Zhaohui Song

TL;DR: 该论文提出了一种轻量高效的遥感变化检测方法FlickCD，通过增强差异模块（EDM）和多尺度特征融合技术，显著降低了计算资源的消耗，同时保持了高精度。

Details

Motivation: 当前遥感变化检测领域虽然深度学习模型复杂度和计算需求增加，但并未带来显著的精度提升。研究聚焦于轻量化模型，以满足星上处理的低资源需求。

Result: 在四个基准数据集上，FlickCD计算和存储开销降低一个数量级，同时达到SOTA性能或仅损失不到1%的F1分数。

Insight: 轻量化模型设计可以在不显著损失精度的情况下大幅减少资源消耗，为星上实时处理提供了可能性。

Abstract: Remote sensing change detection is essential for monitoring urban expansion, disaster assessment, and resource management, offering timely, accurate, and large-scale insights into dynamic landscape transformations. While deep learning has revolutionized change detection, the increasing complexity and computational demands of modern models have not necessarily translated into significant accuracy gains. Instead of following this trend, this study explores a more efficient approach, focusing on lightweight models that maintain high accuracy while minimizing resource consumption, which is an essential requirement for on-satellite processing. To this end, we propose FlickCD, which means quick flick then get great results, pushing the boundaries of the performance-resource trade-off. FlickCD introduces an Enhanced Difference Module (EDM) to amplify critical feature differences between temporal phases while suppressing irrelevant variations such as lighting and weather changes, thereby reducing computational costs in the subsequent change decoder. Additionally, the FlickCD decoder incorporates Local-Global Fusion Blocks, leveraging Shifted Window Self-Attention (SWSA) and Enhanced Global Self-Attention (EGSA) to efficiently capture semantic information at multiple scales, preserving both coarse- and fine-grained changes. Extensive experiments on four benchmark datasets demonstrate that FlickCD reduces computational and storage overheads by more than an order of magnitude while achieving state-of-the-art (SOTA) performance or incurring only a minor (<1% F1) accuracy trade-off. The implementation code is publicly available at https://github.com/xulsh8/FlickCD.

Yujia Liang, Jile Jiao, Zhicheng Wang, Xuetao Feng, Zixuan Ye

TL;DR: IPFormer-VideoLLM通过新的数据集MultiClip-Bench和改进的注意力机制，显著提升了多镜头场景下的视频理解能力。

Details

Motivation: 现有VideoLLM在多镜头场景（如不同摄像机角度或场景切换）中表现不佳，容易出现实例身份遗忘和关键帧忽视的问题。这归因于缺乏针对多镜头场景的标注数据。

Result: 实验表明，模型和数据集不仅显著提升了多场景视频理解，还在多个视频基准测试中展现了优势。

Insight: 实例特征的离散或损失式编码可能导致身份信息丢失，而实例级特征的直接注入能够有效缓解这一问题。

Abstract: Video Large Language Models (VideoLLMs) have demonstrated remarkable understanding capabilities, but are found struggling to tackle multi-shot scenarios,e.g., video clips with varying camera angles or scene changes. This challenge can render failures such as instance identity forgetting and key frame negligence. In this work, we first attribute the challenge to the lack of multi-shot annotations among existing datasets and therefore we introduce a new dataset termed MultiClip-Bench, featuring dense descriptions and instruction-based question-answering pairs tailored for multi-shot scenarios. We empirically find that the training set significantly boosts the multi-shot performance, while the testing benchmark provides a reliable measure of the model capability in multi-shot scenarios. By further analyzing and discovering that current models only encode instance features in a discrete or lossy manner, at the risk of missing identity information, we then contribute a new model IPFormer-VideoLLM. Its key idea is the injection of instance-level features as instance prompts through an efficient attention-based connector. This allows for the aggregation of instance-specific information across scenes. Experiments demonstrate that our proposed dataset and model not only enhance the multi-scene video understanding significantly, but also offer distinct advantages across various video benchmarks.

[44] HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation cs.CV | cs.AI | cs.CL | cs.LGPDF

Xinzhuo Li, Adheesh Juvekar, Xingyou Liu, Muntasir Wahed, Kiet A. Nguyen

TL;DR: HalluSegBench 是一个专注于通过反事实视觉推理评估视觉语言分割模型中幻觉现象的基准，揭示了视觉驱动的幻觉比标签驱动的更普遍。

Details

Motivation: 现有的分割模型在视觉理解中常产生不基于图像内容的幻觉分割掩码或错误标注区域，但当前评估方法主要关注标签或文本幻觉，缺乏视觉上下文操控能力。

Result: 实验表明，先进的分割模型中视觉驱动的幻觉远高于标签驱动，且模型常坚持错误分割。

Insight: 反事实推理是诊断模型视觉基础能力的关键，需进一步研究以提升模型对视觉内容的忠实性。

Abstract: Recent progress in vision-language segmentation has significantly advanced grounded visual understanding. However, these models often exhibit hallucinations by producing segmentation masks for objects not grounded in the image content or by incorrectly labeling irrelevant regions. Existing evaluation protocols for segmentation hallucination primarily focus on label or textual hallucinations without manipulating the visual context, limiting their capacity to diagnose critical failures. In response, we introduce HalluSegBench, the first benchmark specifically designed to evaluate hallucinations in visual grounding through the lens of counterfactual visual reasoning. Our benchmark consists of a novel dataset of 1340 counterfactual instance pairs spanning 281 unique object classes, and a set of newly introduced metrics that quantify hallucination sensitivity under visually coherent scene edits. Experiments on HalluSegBench with state-of-the-art vision-language segmentation models reveal that vision-driven hallucinations are significantly more prevalent than label-driven ones, with models often persisting in false segmentation, highlighting the need for counterfactual reasoning to diagnose grounding fidelity.

[45] GoIRL: Graph-Oriented Inverse Reinforcement Learning for Multimodal Trajectory Prediction cs.CV | cs.ROPDF

Muleilan Pei, Shaoshuai Shi, Lu Zhang, Peiliang Li, Shaojie Shen

TL;DR: GoIRL提出了一种基于逆向强化学习的多模态轨迹预测框架，通过图结构与网格特征的融合，结合最大熵逆向强化学习推断奖励分布，并采用分层生成器和概率融合策略提升预测准确性和置信度。

Details

Motivation: 现有轨迹预测方法主要依赖监督学习，难以处理不确定性和多模态问题。GoIRL旨在通过逆向强化学习框架解决这些问题。

Result: 在Argoverse和nuScenes数据集上达到SOTA性能，并表现出优于监督模型的泛化能力。

Insight: GoIRL表明逆向强化学习在多模态轨迹预测中具有潜力，图结构与特征融合是提升预测性能的关键。

Abstract: Trajectory prediction for surrounding agents is a challenging task in autonomous driving due to its inherent uncertainty and underlying multimodality. Unlike prevailing data-driven methods that primarily rely on supervised learning, in this paper, we introduce a novel Graph-oriented Inverse Reinforcement Learning (GoIRL) framework, which is an IRL-based predictor equipped with vectorized context representations. We develop a feature adaptor to effectively aggregate lane-graph features into grid space, enabling seamless integration with the maximum entropy IRL paradigm to infer the reward distribution and obtain the policy that can be sampled to induce multiple plausible plans. Furthermore, conditioned on the sampled plans, we implement a hierarchical parameterized trajectory generator with a refinement module to enhance prediction accuracy and a probability fusion strategy to boost prediction confidence. Extensive experimental results showcase our approach not only achieves state-of-the-art performance on the large-scale Argoverse & nuScenes motion forecasting benchmarks but also exhibits superior generalization abilities compared to existing supervised models.

[46] YOLO-FDA: Integrating Hierarchical Attention and Detail Enhancement for Surface Defect Detection cs.CVPDF

Jiawei Hu

TL;DR: YOLO-FDA 是一种基于 YOLO 的检测框架，通过细粒度细节增强和注意力引导特征融合，解决了工业场景表面缺陷检测中的多尺度问题和冗余特征挑战。

Details

Motivation: 工业场景中的表面缺陷检测面临缺陷类型多样、形状和尺寸不规则、细粒度要求高以及材料纹理复杂等挑战，现有方法在特征冗余、细节敏感性和多尺度鲁棒性方面存在不足。

Result: 在多个基准数据集上，YOLO-FDA 在准确性和鲁棒性上均优于现有方法。

Insight: 注意力机制和细节增强的融合能显著提升缺陷检测的精度和鲁棒性。

Abstract: Surface defect detection in industrial scenarios is both crucial and technically demanding due to the wide variability in defect types, irregular shapes and sizes, fine-grained requirements, and complex material textures. Although recent advances in AI-based detectors have improved performance, existing methods often suffer from redundant features, limited detail sensitivity, and weak robustness under multiscale conditions. To address these challenges, we propose YOLO-FDA, a novel YOLO-based detection framework that integrates fine-grained detail enhancement and attention-guided feature fusion. Specifically, we adopt a BiFPN-style architecture to strengthen bidirectional multilevel feature aggregation within the YOLOv5 backbone. To better capture fine structural changes, we introduce a Detail-directional Fusion Module (DDFM) that introduces a directional asymmetric convolution in the second-lowest layer to enrich spatial details and fuses the second-lowest layer with low-level features to enhance semantic consistency. Furthermore, we propose two novel attention-based fusion strategies, Attention-weighted Concatenation (AC) and Cross-layer Attention Fusion (CAF) to improve contextual representation and reduce feature noise. Extensive experiments on benchmark datasets demonstrate that YOLO-FDA consistently outperforms existing state-of-the-art methods in terms of both accuracy and robustness across diverse types of defects and scales.

[47] Tree-based Semantic Losses: Application to Sparsely-supervised Large Multi-class Hyperspectral Segmentation cs.CVPDF

Junwen Wang, Oscar Maccormac, William Rochford, Aaron Kujawa, Jonathan Shapey

TL;DR: 该论文提出两种基于树的语义损失函数，利用标签的层次结构改进稀疏标注的多类高光谱图像分割任务，并在性能上达到最优。

Details

Motivation: 高光谱成像（HSI）在手术应用中潜力巨大，但现有方法对标签空间的语义关系利用不足，导致分割任务效果受限。

Result: 在稀疏标注的HSI数据集上达到SOTA性能，且不影响离群像素的检测能力。

Insight: 层次化的标签结构能有效提升分割任务的语义一致性，同时支持更复杂的检测任务。

Abstract: Hyperspectral imaging (HSI) shows great promise for surgical applications, offering detailed insights into biological tissue differences beyond what the naked eye can perceive. Refined labelling efforts are underway to train vision systems to distinguish large numbers of subtly varying classes. However, commonly used learning methods for biomedical segmentation tasks penalise all errors equivalently and thus fail to exploit any inter-class semantics in the label space. In this work, we introduce two tree-based semantic loss functions which take advantage of a hierarchical organisation of the labels. We further incorporate our losses in a recently proposed approach for training with sparse, background-free annotations. Extensive experiments demonstrate that our proposed method reaches state-of-the-art performance on a sparsely annotated HSI dataset comprising $107$ classes organised in a clinically-defined semantic tree structure. Furthermore, our method enables effective detection of out-of-distribution (OOD) pixels without compromising segmentation performance on in-distribution (ID) pixels.

[48] Geometry and Perception Guided Gaussians for Multiview-consistent 3D Generation from a Single Image cs.CV | 68 | I.4.0PDF

Pufan Li, Bi’an Du, Wei Hu

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Generating realistic 3D objects from single-view images requires natural appearance, 3D consistency, and the ability to capture multiple plausible interpretations of unseen regions. Existing approaches often rely on fine-tuning pretrained 2D diffusion models or directly generating 3D information through fast network inference or 3D Gaussian Splatting, but their results generally suffer from poor multiview consistency and lack geometric detail. To takle these issues, we present a novel method that seamlessly integrates geometry and perception priors without requiring additional model training to reconstruct detailed 3D objects from a single image. Specifically, we train three different Gaussian branches initialized from the geometry prior, perception prior and Gaussian noise, respectively. The geometry prior captures the rough 3D shapes, while the perception prior utilizes the 2D pretrained diffusion model to enhance multiview information. Subsequently, we refine 3D Gaussian branches through mutual interaction between geometry and perception priors, further enhanced by a reprojection-based strategy that enforces depth consistency. Experiments demonstrate the higher-fidelity reconstruction results of our method, outperforming existing methods on novel view synthesis and 3D reconstruction, demonstrating robust and consistent 3D object generation.

[49] Task-Aware KV Compression For Cost-Effective Long Video Understanding cs.CV | cs.AIPDF

Minghao Qin, Yan Shu, Peitian Zhang, Kun Lun, Huaying Yuan

TL;DR: Video-X^2L提出了一个任务感知的KV压缩方法，通过双级KV压缩和选择性KV重加载，显著提高了长视频理解的效率，同时减少了计算成本。

Details

Motivation: 现有的多模态大语言模型（MLLMs）在处理长视频理解（LVU）任务时面临高昂的计算成本问题，而现有的KV压缩方法在高压缩比下会丢失大量关键信息。

Result: 在多个LVU基准测试（如VideoMME、MLVU等）中，Video-X^2L显著优于现有KV压缩方法，同时大幅节省了计算成本。

Insight: 任务感知的信息保留策略是提升长视频理解效率的关键，无需额外训练即可与现有MLLMs兼容。

Abstract: Long-video understanding (LVU) remains a severe challenge for existing multimodal large language models (MLLMs), primarily due to the prohibitive computational cost. Recent approaches have explored KV compression to mitigate this issue, but they often suffer from significant information loss at high compression ratios. In this paper, we introduce Video-X^2L, which flexibly preserves critical video information for each LVU task. Video-X^2L involves two key operations. The first one is called bi-level KV compression. During the MLLM’s pre-filling stage, Video-X^2L generates two types of compressed KVs: low-compression KVs (L-KVs) to capture fine-grained video details and high-compression KVs (H-KVs) to offer compact video representations. The second one is called selective KV re-loading. During the MLLM’s decoding stage, Video-X^2L selectively re-loads L-KVs for the most critical video chunks while using H-KVs for other less important ones. This allows the MLLM to fully utilize task-specific information while maintaining the overall compactness. Video-X^2L is simple yet effective: it is free from additional training and directly compatible with existing KV-compressible MLLMs. We evaluate Video-X^2L with a variety of popular LVU benchmarks, including VideoMME, MLVU, LongVideoBench, and VNBench. Our experiment result shows that Video-X^2L outperforms existing KV-compression methods by a huge advantage while substantially saving the computation cost.

[50] GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding cs.CVPDF

Zijun Lin, Shuting He, Cheston Tan, Bihan Wen

TL;DR: 论文提出GroundFlow模块，通过提取短期和长期历史信息，提升了3D点云顺序定位任务的准确性，显著优于现有方法。

Details

Motivation: 现有3D视觉定位（3DVG）方法将多步骤文本指令视为整体，忽略时间信息，无法正确处理代词引用和上下文依赖。

Result: 在SG3D基准测试中，GroundFlow将基线方法任务准确率显著提升（+7.5%和+10.2%），优于预训练的3D大型语言模型。

Insight: 时间信息的多粒度提取（短期和长期）对复杂顺序任务至关重要，可插拔模块设计为现有3DVG模型提供了灵活扩展能力。

Abstract: Sequential grounding in 3D point clouds (SG3D) refers to locating sequences of objects by following text instructions for a daily activity with detailed steps. Current 3D visual grounding (3DVG) methods treat text instructions with multiple steps as a whole, without extracting useful temporal information from each step. However, the instructions in SG3D often contain pronouns such as “it”, “here” and “the same” to make language expressions concise. This requires grounding methods to understand the context and retrieve relevant information from previous steps to correctly locate object sequences. Due to the lack of an effective module for collecting related historical information, state-of-the-art 3DVG methods face significant challenges in adapting to the SG3D task. To fill this gap, we propose GroundFlow – a plug-in module for temporal reasoning on 3D point cloud sequential grounding. Firstly, we demonstrate that integrating GroundFlow improves the task accuracy of 3DVG baseline methods by a large margin (+7.5% and +10.2%) in the SG3D benchmark, even outperforming a 3D large language model pre-trained on various datasets. Furthermore, we selectively extract both short-term and long-term step information based on its relevance to the current instruction, enabling GroundFlow to take a comprehensive view of historical information and maintain its temporal understanding advantage as step counts increase. Overall, our work introduces temporal reasoning capabilities to existing 3DVG models and achieves state-of-the-art performance in the SG3D benchmark across five datasets.

[51] Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation cs.CV | cs.RO | eess.IVPDF

Yihong Cao, Jiaming Zhang, Xu Zheng, Hao Shi, Kunyu Peng

TL;DR: 论文提出了一种名为UNLOCK的方法，用于解决全景图像分割中的无源数据依赖问题，通过Omni伪标签学习和Amodal驱动上下文学习模块，实现了在无需源数据或目标标签的情况下的高效分割。

Details

Motivation: 全景图像处理因其全视角感知能力而备受关注，但传统方法依赖源数据（如标注的针孔图像），限制了实际应用。论文提出了一个更实用的任务SFOASS，即在无需源数据的情况下实现遮挡感知的全景分割。

Result: 实验结果显示，UNLOCK在无需源数据的情况下，性能与依赖源数据的方法相当，在mAAP和mAP指标上分别达到10.9和11.6，mAPQ指标上比源数据方法绝对提升了+4.3。

Insight: 论文表明，通过伪标签和遮挡感知推理，可以在不依赖源数据的情况下实现高效的全景分割，为实际应用提供了更灵活的选择。

Abstract: Panoramic image processing is essential for omni-context perception, yet faces constraints like distortions, perspective occlusions, and limited annotations. Previous unsupervised domain adaptation methods transfer knowledge from labeled pinhole data to unlabeled panoramic images, but they require access to source pinhole data. To address these, we introduce a more practical task, i.e., Source-Free Occlusion-Aware Seamless Segmentation (SFOASS), and propose its first solution, called UNconstrained Learning Omni-Context Knowledge (UNLOCK). Specifically, UNLOCK includes two key modules: Omni Pseudo-Labeling Learning and Amodal-Driven Context Learning. While adapting without relying on source data or target labels, this framework enhances models to achieve segmentation with 360{\deg} viewpoint coverage and occlusion-aware reasoning. Furthermore, we benchmark the proposed SFOASS task through both real-to-real and synthetic-to-real adaptation settings. Experimental results show that our source-free method achieves performance comparable to source-dependent methods, yielding state-of-the-art scores of 10.9 in mAAP and 11.6 in mAP, along with an absolute improvement of +4.3 in mAPQ over the source-only method. All data and code will be made publicly available at https://github.com/yihong-97/UNLOCK.

[52] BitMark for Infinity: Watermarking Bitwise Autoregressive Image Generative Models cs.CV | cs.AIPDF

Louis Kerner, Michel Meintz, Bihe Zhao, Franziska Boenisch, Adam Dziedzic

TL;DR: BitMark是一种针对Infinity这类比特自回归图像生成模型的鲁棒水印框架，通过在生成过程中嵌入比特级水印，防止模型因生成内容重复训练而导致的性能退化（模型崩溃）。

Details

Motivation: 随着生成模型（如Infinity）输出内容在互联网上的泛滥，这些内容可能被重新作为训练数据使用，导致模型崩溃（性能逐步退化）。水印技术可以标识生成内容，从而缓解这一问题。

Result: BitMark的水印能够在后续模型的训练中放射性传播，即使仅对扩散模型或自回归模型进行微调，水印依然可检测。

Insight: 比特级水印为生成模型提供了一种可靠的防止模型崩溃的解决方案，同时揭示了生成内容循环使用的潜在风险。

Abstract: State-of-the-art text-to-image models like Infinity generate photorealistic images at an unprecedented speed. These models operate in a bitwise autoregressive manner over a discrete set of tokens that is practically infinite in size. However, their impressive generative power comes with a growing risk: as their outputs increasingly populate the Internet, they are likely to be scraped and reused as training data-potentially by the very same models. This phenomenon has been shown to lead to model collapse, where repeated training on generated content, especially from the models’ own previous versions, causes a gradual degradation in performance. A promising mitigation strategy is watermarking, which embeds human-imperceptible yet detectable signals into generated images-enabling the identification of generated content. In this work, we introduce BitMark, a robust bitwise watermarking framework for Infinity. Our method embeds a watermark directly at the bit level of the token stream across multiple scales (also referred to as resolutions) during Infinity’s image generation process. Our bitwise watermark subtly influences the bits to preserve visual fidelity and generation speed while remaining robust against a spectrum of removal techniques. Furthermore, it exhibits high radioactivity, i.e., when watermarked generated images are used to train another image generative model, this second model’s outputs will also carry the watermark. The radioactive traces remain detectable even when only fine-tuning diffusion or image autoregressive models on images watermarked with our BitMark. Overall, our approach provides a principled step toward preventing model collapse in image generative models by enabling reliable detection of generated outputs.

[53] Real-Time ESFP: Estimating, Smoothing, Filtering, and Pose-Mapping cs.CV | cs.ROPDF

Qifei Cui, Yuang Zhou, Ruichen Deng

TL;DR: 本文提出了ESFP，一种端到端的流程，将单目RGB视频转换为低成本4-DoF桌面机械臂的可执行关节轨迹，包含估计、平滑、滤波和姿态映射四个模块。

Details

Motivation: 为了将单目RGB视频直接转换为机械臂的可执行轨迹，需解决从2D到3D的姿态估计、时间平滑性、噪声过滤以及姿态映射的问题。

Result: ESFP能够高效地将视频转换为机械臂的可执行轨迹，具有较高的鲁棒性和实用性。

Insight: 结合时间序列模型（HPSTM）和几何映射，能够更好地解决从视频到机械臂控制的实际挑战。

Abstract: This paper presents ESFP, an end-to-end pipeline that converts monocular RGB video into executable joint trajectories for a low-cost 4-DoF desktop arm. ESFP comprises four sequential modules. (1) Estimating: ROMP lifts each frame to a 24-joint 3-D skeleton. (2) Smoothing: the proposed HPSTM-a sequence-to-sequence Transformer with self-attention-combines long-range temporal context with a differentiable forward-kinematics decoder, enforcing constant bone lengths and anatomical plausibility while jointly predicting joint means and full covariances. (3) Filtering: root-normalized trajectories are variance-weighted according to HPSTM’s uncertainty estimates, suppressing residual noise. (4) Pose-Mapping: a geometric retargeting layer transforms shoulder-elbow-wrist triples into the uArm’s polar workspace, preserving wrist orientation.

Umaima Rahman, Mohammad Yaqub, Dwarikanath Mahapatra

TL;DR: DiMPLe提出了一种新型多模态提示学习方法，通过解耦模态内和模态间的特征（不变特征和伪特征），提升了模型在分布外（OOD）数据上的对齐能力和泛化性能。

Details

Motivation: 视觉数据中的伪相关性会损害模型在分布外数据上的表现。现有方法多关注单一模态（如图像特征），而忽略了跨模态特征解耦的重要性。

Result: 在11个数据集上的实验显示，DiMPLe优于CoOp-OOD，基础类准确率提升15.27，新类准确率提升44.31。

Insight: 跨模态特征解耦是提升多模态模型在OOD数据上泛化能力的关键。不变特征和伪特征的分离能显著提高模型对分布偏移的鲁棒性。

Abstract: We introduce DiMPLe (Disentangled Multi-Modal Prompt Learning), a novel approach to disentangle invariant and spurious features across vision and language modalities in multi-modal learning. Spurious correlations in visual data often hinder out-of-distribution (OOD) performance. Unlike prior methods focusing solely on image features, DiMPLe disentangles features within and across modalities while maintaining consistent alignment, enabling better generalization to novel classes and robustness to distribution shifts. Our method combines three key objectives: (1) mutual information minimization between invariant and spurious features, (2) spurious feature regularization, and (3) contrastive learning on invariant features. Extensive experiments demonstrate DiMPLe demonstrates superior performance compared to CoOp-OOD, when averaged across 11 diverse datasets, and achieves absolute gains of 15.27 in base class accuracy and 44.31 in novel class accuracy.

[55] Temporal Rate Reduction Clustering for Human Motion Segmentation cs.CVPDF

Xianghan Meng, Zhengyu Tong, Zhiyuan Huang, Chun-Guang Li

TL;DR: 这篇论文提出了一种名为Temporal Rate Reduction Clustering（TR²C）的新方法，用于解决复杂背景下人类运动分割（HMS）任务中的问题。

Details

Motivation: 现有的人体运动分割方法主要基于子空间聚类假设，但视频中复杂的人类运动和杂乱的背景可能不符合这一假设，因此需要一种更有效的方法。

Result: 在五个HMS基准数据集上进行了广泛实验，使用不同特征提取器均实现了最先进的性能。

Insight: TR²C通过学习时间一致的结构化表示，克服了传统子空间聚类在复杂运动场景中的局限性。

Abstract: Human Motion Segmentation (HMS), which aims to partition videos into non-overlapping human motions, has attracted increasing research attention recently. Existing approaches for HMS are mainly dominated by subspace clustering methods, which are grounded on the assumption that high-dimensional temporal data align with a Union-of-Subspaces (UoS) distribution. However, the frames in video capturing complex human motions with cluttered backgrounds may not align well with the UoS distribution. In this paper, we propose a novel approach for HMS, named Temporal Rate Reduction Clustering ($\text{TR}^2\text{C}$), which jointly learns structured representations and affinity to segment the frame sequences in video. Specifically, the structured representations learned by $\text{TR}^2\text{C}$ maintain temporally consistent and align well with a UoS structure, which is favorable for the HMS task. We conduct extensive experiments on five benchmark HMS datasets and achieve state-of-the-art performances with different feature extractors.

[56] Video Virtual Try-on with Conditional Diffusion Transformer Inpainter cs.CVPDF

Cheng Zou, Senlin Cheng, Bolei Xu, Dandan Zheng, Xiaobo Li

TL;DR: ViTI 提出了一种基于扩散变换器的视频虚拟试穿方法，通过条件视频修复任务实现空间-时间一致性，优于现有方法。

Details

Motivation: 视频虚拟试穿需保证帧间一致性和服装细节保留。现有基于图像的方法帧间拼接效果差，而扩散模型虽改进但仍有不一致问题。

Result: 实验证明 ViTI 在定量和定性上均优于现有方法，实现了更好的时空一致性和细节保留。

Insight: 从视频生成问题切入（而非图像试穿）更易保证一致性；扩散变换器与条件修复结合是解决复杂视频任务的有效路径。

Abstract: Video virtual try-on aims to naturally fit a garment to a target person in consecutive video frames. It is a challenging task, on the one hand, the output video should be in good spatial-temporal consistency, on the other hand, the details of the given garment need to be preserved well in all the frames. Naively using image-based try-on methods frame by frame can get poor results due to severe inconsistency. Recent diffusion-based video try-on methods, though very few, happen to coincide with a similar solution: inserting temporal attention into image-based try-on model to adapt it for video try-on task, which have shown improvements but there still exist inconsistency problems. In this paper, we propose ViTI (Video Try-on Inpainter), formulate and implement video virtual try-on as a conditional video inpainting task, which is different from previous methods. In this way, we start with a video generation problem instead of an image-based try-on problem, which from the beginning has a better spatial-temporal consistency. Specifically, at first we build a video inpainting framework based on Diffusion Transformer with full 3D spatial-temporal attention, and then we progressively adapt it for video garment inpainting, with a collection of masking strategies and multi-stage training. After these steps, the model can inpaint the masked garment area with appropriate garment pixels according to the prompt with good spatial-temporal consistency. Finally, as other try-on methods, garment condition is added to the model to make sure the inpainted garment appearance and details are as expected. Both quantitative and qualitative experimental results show that ViTI is superior to previous works.

[57] WordCon: Word-level Typography Control in Scene Text Rendering cs.CVPDF

Wenda Shi, Yiren Song, Zihan Rao, Dengming Zhang, Jiaming Liu

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Achieving precise word-level typography control within generated images remains a persistent challenge. To address it, we newly construct a word-level controlled scene text dataset and introduce the Text-Image Alignment (TIA) framework. This framework leverages cross-modal correspondence between text and local image regions provided by grounding models to enhance the Text-to-Image (T2I) model training. Furthermore, we propose WordCon, a hybrid parameter-efficient fine-tuning (PEFT) method. WordCon reparameterizes selective key parameters, improving both efficiency and portability. This allows seamless integration into diverse pipelines, including artistic text rendering, text editing, and image-conditioned text rendering. To further enhance controllability, the masked loss at the latent level is applied to guide the model to concentrate on learning the text region in the image, and the joint-attention loss provides feature-level supervision to promote disentanglement between different words. Both qualitative and quantitative results demonstrate the superiority of our method to the state of the art. The datasets and source code will be available for academic use.

[58] HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation cs.CVPDF

Diego Biagini, Nassir Navab, Azade Farshad

TL;DR: HieraSurg 是一种层次感知的扩散模型，通过结合手术阶段和分割信息生成手术视频，优于现有方法。

Details

Motivation: 现有手术视频生成方法缺乏对手术动作和阶段的语义一致性，HieraSurg 旨在解决这一问题。

Result: 在胆囊切除手术视频生成任务中，HieraSurg 在定量和定性上优于现有方法，并能生成高帧率视频。

Insight: 通过层次化建模手术信息，HieraSurg 实现了更真实的视频生成，展现了在手术模拟中的实用潜力。

Abstract: Surgical Video Synthesis has emerged as a promising research direction following the success of diffusion models in general-domain video generation. Although existing approaches achieve high-quality video generation, most are unconditional and fail to maintain consistency with surgical actions and phases, lacking the surgical understanding and fine-grained guidance necessary for factual simulation. We address these challenges by proposing HieraSurg, a hierarchy-aware surgical video generation framework consisting of two specialized diffusion models. Given a surgical phase and an initial frame, HieraSurg first predicts future coarse-grained semantic changes through a segmentation prediction model. The final video is then generated by a second-stage model that augments these temporal segmentation maps with fine-grained visual features, leading to effective texture rendering and integration of semantic information in the video space. Our approach leverages surgical information at multiple levels of abstraction, including surgical phase, action triplets, and panoptic segmentation maps. The experimental results on Cholecystectomy Surgical Video Generation demonstrate that the model significantly outperforms prior work both quantitatively and qualitatively, showing strong generalization capabilities and the ability to generate higher frame-rate videos. The model exhibits particularly fine-grained adherence when provided with existing segmentation maps, suggesting its potential for practical surgical applications.

[59] DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images cs.CVPDF

Badri Vishal Kasuba, Parag Chaudhuri, Ganesh Ramakrishnan

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Visual grounding in text-rich document images is a critical yet underexplored challenge for document intelligence and visual question answering (VQA) systems. We present \drishtikon, a multi-granular visual grounding framework designed to enhance interpretability and trust in VQA for complex, multilingual documents. Our approach integrates robust multi-lingual OCR, large language models, and a novel region matching algorithm to accurately localize answer spans at block, line, word, and point levels. We curate a new benchmark from the CircularsVQA test set, providing fine-grained, human-verified annotations across multiple granularities. Extensive experiments demonstrate that our method achieves state-of-the-art grounding accuracy, with line-level granularity offering the best trade-off between precision and recall. Ablation studies further highlight the benefits of multi-block and multi-line reasoning. Comparative evaluations with leading vision-language models reveal the limitations of current VLMs in precise localization, underscoring the effectiveness of our structured, alignment-based approach. Our findings pave the way for more robust and interpretable document understanding systems in real-world, text-centric scenarios. Code and dataset has been made available at https://github.com/kasuba-badri-vishal/DhrishtiKon.

[60] LLaVA-Pose: Enhancing Human Pose and Action Understanding via Keypoint-Integrated Instruction Tuning cs.CVPDF

Dewen Zhang, Tahir Hussain, Wangpeng An, Hayaru Shouno

TL;DR: 论文提出了一种通过整合人体关键点数据生成专门用于姿态和动作理解的视觉语言指令数据的方法，构建了一个包含20万样本的数据集，并微调LLaVA-1.5-7B模型，性能提升了33.2%。

Details

Motivation: 目前视觉语言模型（VLMs）在通用视觉理解任务上表现良好，但在与人体姿态和动作相关的复杂任务中表现不佳，缺乏专门的视觉语言指令数据是主要原因。

Result: 微调后的LLaVA-Pose模型在E-HPAUB基准上性能提升了33.2%。

Insight: 整合人体关键点的数据能有效增强多模态模型在人体姿态和动作理解任务中的表现。

Abstract: Current vision-language models (VLMs) are well-adapted for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions due to the lack of specialized vision-language instruction-following data. We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes, enabling more precise understanding of human-centric scenes. Our approach constructs a dataset comprising 200,328 samples tailored to fine-tune models for human-centric tasks, focusing on three areas: conversation, detailed description, and complex reasoning. We establish an Extended Human Pose and Action Understanding Benchmark (E-HPAUB) to assess model performance on human pose and action understanding. We fine-tune the LLaVA-1.5-7B model using this dataset and evaluate our resulting LLaVA-Pose model on the benchmark, achieving significant improvements. Experimental results show an overall improvement of 33.2% compared to the original LLaVA-1.5-7B model. These findings highlight the effectiveness of keypoint-integrated data in enhancing multimodal models for human-centric visual understanding. Code is available at https://github.com/Ody-trek/LLaVA-Pose.

[61] Holistic Surgical Phase Recognition with Hierarchical Input Dependent State Space Models cs.CV | cs.AIPDF

Haoyang Wu, Tsun-Hsuan Wang, Mathias Lechner, Ramin Hasani, Jennifer A. Eckhoff

TL;DR: 该论文提出了一种新型的分层输入依赖状态空间模型，用于手术阶段识别，结合局部和全局动态建模，通过线性缩放特性高效处理长时间手术视频。

Details

Motivation: 手术工作流分析对机器人辅助手术至关重要，但长时间的手术视频处理存在挑战，现有Transformer模型因二次注意力机制难以高效处理。

Result: 在多个数据集（Cholec80、MICCAI2016、Heichole）上显著超越现有方法，性能提升2.8%至12.9%。

Insight: 状态空间模型的线性缩放特性为长时间视频任务提供了高效解决方案，离散-连续混合监督策略进一步提升了模型性能。

Abstract: Surgical workflow analysis is essential in robot-assisted surgeries, yet the long duration of such procedures poses significant challenges for comprehensive video analysis. Recent approaches have predominantly relied on transformer models; however, their quadratic attention mechanism restricts efficient processing of lengthy surgical videos. In this paper, we propose a novel hierarchical input-dependent state space model that leverages the linear scaling property of state space models to enable decision making on full-length videos while capturing both local and global dynamics. Our framework incorporates a temporally consistent visual feature extractor, which appends a state space model head to a visual feature extractor to propagate temporal information. The proposed model consists of two key modules: a local-aggregation state space model block that effectively captures intricate local dynamics, and a global-relation state space model block that models temporal dependencies across the entire video. The model is trained using a hybrid discrete-continuous supervision strategy, where both signals of discrete phase labels and continuous phase progresses are propagated through the network. Experiments have shown that our method outperforms the current state-of-the-art methods by a large margin (+2.8% on Cholec80, +4.3% on MICCAI2016, and +12.9% on Heichole datasets). Code will be publicly available after paper acceptance.

[62] Generalizable Neural Electromagnetic Inverse Scattering cs.CV | eess.IVPDF

Yizhe Cheng, Chunxun Tian, Haoru Wang, Wentao Zhu, Xiaoxuan Ma

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Solving Electromagnetic Inverse Scattering Problems (EISP) is fundamental in applications such as medical imaging, where the goal is to reconstruct the relative permittivity from scattered electromagnetic field. This inverse process is inherently ill-posed and highly nonlinear, making it particularly challenging. A recent machine learning-based approach, Img-Interiors, shows promising results by leveraging continuous implicit functions. However, it requires case-specific optimization, lacks generalization to unseen data, and fails under sparse transmitter setups (e.g., with only one transmitter). To address these limitations, we revisit EISP from a physics-informed perspective, reformulating it as a two stage inverse transmission-scattering process. This formulation reveals the induced current as a generalizable intermediate representation, effectively decoupling the nonlinear scattering process from the ill-posed inverse problem. Built on this insight, we propose the first generalizable physics-driven framework for EISP, comprising a current estimator and a permittivity solver, working in an end-to-end manner. The current estimator explicitly learns the induced current as a physical bridge between the incident and scattered field, while the permittivity solver computes the relative permittivity directly from the estimated induced current. This design enables data-driven training and generalizable feed-forward prediction of relative permittivity on unseen data while maintaining strong robustness to transmitter sparsity. Extensive experiments show that our method outperforms state-of-the-art approaches in reconstruction accuracy, generalization, and robustness. This work offers a fundamentally new perspective on electromagnetic inverse scattering and represents a major step toward cost-effective practical solutions for electromagnetic imaging.

[63] ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models cs.CVPDF

Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong

TL;DR: 该论文提出了ShotBench，一个专门用于评估视觉语言模型（VLMs）在电影语言理解能力的基准测试，并构建了大规模多模态数据集ShotQA。通过微调和优化方法开发出的ShotVL模型显著优于现有模型。

Details

Motivation: 尽管现有的视觉语言模型在通用视觉理解上表现良好，但其对电影语言中细微的视觉语法（如叙事、情感和美学）的理解能力尚未充分探索，也缺乏有效的评测工具。

Result: 现有VLMs在ShotBench上表现不佳，最高平均准确率不足60%。ShotVL显著优于其他开源和专有模型，达到最新技术水平。

Insight: 电影语言理解是视觉语言模型的重要扩展方向，ShotBench和ShotQA为这一领域的研究提供了有力工具。

Abstract: Cinematography, the fundamental visual language of film, is essential for conveying narrative, emotion, and aesthetic quality. While recent Vision-Language Models (VLMs) demonstrate strong general visual understanding, their proficiency in comprehending the nuanced cinematic grammar embedded within individual shots remains largely unexplored and lacks robust evaluation. This critical gap limits both fine-grained visual comprehension and the precision of AI-assisted video generation. To address this, we introduce \textbf{ShotBench}, a comprehensive benchmark specifically designed for cinematic language understanding. It features over 3.5k expert-annotated QA pairs from images and video clips, meticulously curated from over 200 acclaimed (predominantly Oscar-nominated) films and spanning eight key cinematography dimensions. Our evaluation of 24 leading VLMs on ShotBench reveals their substantial limitations: even the top-performing model achieves less than 60% average accuracy, particularly struggling with fine-grained visual cues and complex spatial reasoning. To catalyze advancement in this domain, we construct \textbf{ShotQA}, a large-scale multimodal dataset comprising approximately 70k cinematic QA pairs. Leveraging ShotQA, we develop \textbf{ShotVL} through supervised fine-tuning and Group Relative Policy Optimization. ShotVL significantly outperforms all existing open-source and proprietary models on ShotBench, establishing new \textbf{state-of-the-art} performance. We open-source our models, data, and code to foster rapid progress in this crucial area of AI-driven cinematic understanding and generation.

[64] CoPa-SG: Dense Scene Graphs with Parametric and Proto-Relations cs.CVPDF

Julian Lorenz, Mrunmai Phatak, Robin Schön, Katja Ludwig, Nico Hörmann

TL;DR: CoPa-SG是一个合成的场景图数据集，提供了精确的标注和全面的物体间关系注释，并引入了参数关系和原型关系两种新概念，提升了场景图的细粒度表示和推理能力。

Details

Motivation: 当前场景图研究受限于数据不足，尤其是缺乏精确标注的基准数据集，制约了模型的性能和应用。

Result: 实验表明，CoPa-SG可以显著提升场景图生成模型的性能，并增强下游任务（如规划与推理）的能力。

Insight: 参数关系和原型关系为场景图提供了更丰富的语义信息，推动了场景理解的发展。

Abstract: 2D scene graphs provide a structural and explainable framework for scene understanding. However, current work still struggles with the lack of accurate scene graph data. To overcome this data bottleneck, we present CoPa-SG, a synthetic scene graph dataset with highly precise ground truth and exhaustive relation annotations between all objects. Moreover, we introduce parametric and proto-relations, two new fundamental concepts for scene graphs. The former provides a much more fine-grained representation than its traditional counterpart by enriching relations with additional parameters such as angles or distances. The latter encodes hypothetical relations in a scene graph and describes how relations would form if new objects are placed in the scene. Using CoPa-SG, we compare the performance of various scene graph generation models. We demonstrate how our new relation types can be integrated in downstream applications to enhance planning and reasoning capabilities.

[65] CA-I2P: Channel-Adaptive Registration Network with Global Optimal Selection cs.CV | cs.AIPDF

Zhixin Cheng, Jiacheng Deng, Xinjun Li, Xiaotian Yin, Bohao Liao

TL;DR: 该论文提出了一种基于通道自适应调整和全局最优选择的图像到点云配准方法，解决了跨模态匹配中的特征注意力不一致和冗余对应问题。

Details

Motivation: 传统检测无关方法在图像与点云配准时，由于特征通道注意力差异和场景中相似结构导致的冗余对应，导致配准精度下降。

Result: 在RGB-D Scenes V2和7-Scenes数据集上取得了先进的配准性能。

Insight: 通道自适应和全局优化是提升跨模态配准精度的关键。

Abstract: Detection-free methods typically follow a coarse-to-fine pipeline, extracting image and point cloud features for patch-level matching and refining dense pixel-to-point correspondences. However, differences in feature channel attention between images and point clouds may lead to degraded matching results, ultimately impairing registration accuracy. Furthermore, similar structures in the scene could lead to redundant correspondences in cross-modal matching. To address these issues, we propose Channel Adaptive Adjustment Module (CAA) and Global Optimal Selection Module (GOS). CAA enhances intra-modal features and suppresses cross-modal sensitivity, while GOS replaces local selection with global optimization. Experiments on RGB-D Scenes V2 and 7-Scenes demonstrate the superiority of our method, achieving state-of-the-art performance in image-to-point cloud registration.

Long Tian, Yufei Li, Yuyang Dai, Wenchao Chen, Xiyang Liu

TL;DR: FastRef提出了一种用于少样本工业异常检测（FS-IAD）的高效原型精炼框架，通过特征传递和异常抑制两阶段迭代优化原型，显著提升了检测性能。

Details

Motivation: 现有FS-IAD方法主要依赖有限正常样本生成原型，但忽视了利用查询图像统计信息提升原型代表性，导致性能受限。

Result: 在MVTec、ViSA等四个数据集上，FastRef显著提升了1/2/4-shot设置下的检测性能，且计算高效。

Insight: 少样本下异常重建概率更高，需结合查询特征动态优化原型；OT适用于非高斯分布特征，能有效抑制异常。

Abstract: Few-shot industrial anomaly detection (FS-IAD) presents a critical challenge for practical automated inspection systems operating in data-scarce environments. While existing approaches predominantly focus on deriving prototypes from limited normal samples, they typically neglect to systematically incorporate query image statistics to enhance prototype representativeness. To address this issue, we propose FastRef, a novel and efficient prototype refinement framework for FS-IAD. Our method operates through an iterative two-stage process: (1) characteristic transfer from query features to prototypes via an optimizable transformation matrix, and (2) anomaly suppression through prototype alignment. The characteristic transfer is achieved through linear reconstruction of query features from prototypes, while the anomaly suppression addresses a key observation in FS-IAD that unlike conventional IAD with abundant normal prototypes, the limited-sample setting makes anomaly reconstruction more probable. Therefore, we employ optimal transport (OT) for non-Gaussian sampled features to measure and minimize the gap between prototypes and their refined counterparts for anomaly suppression. For comprehensive evaluation, we integrate FastRef with three competitive prototype-based FS-IAD methods: PatchCore, FastRecon, WinCLIP, and AnomalyDINO. Extensive experiments across four benchmark datasets of MVTec, ViSA, MPDD and RealIAD demonstrate both the effectiveness and computational efficiency of our approach under 1/2/4-shots.

[67] XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation cs.CVPDF

Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang

TL;DR: 论文《XVerse》提出了一种通过DiT（Diffusion Transformer）调制实现多主题身份和语义属性独立控制的新方法，解决了多主题生成中属性纠缠和编辑性差的问题，提升了复杂场景生成的准确性和灵活性。

Details

Motivation: 现有文本到图像生成方法在多主题控制中常导致属性纠缠和编辑性差，难以同时保持图像连贯性和精细控制。XVerse旨在解决这一问题，实现高质量的多主题生成和独立属性编辑。

Result: XVerse在多主题生成任务中表现出色，能够高保真地控制每个主题的身份和属性（如姿态、风格、光照），同时避免了常见的属性纠缠和人工痕迹。

Insight: 通过独立调制文本流而非直接修改潜在空间，XVerse在多主题生成中实现了更精确的控制和更高的编辑自由度，为复杂场景合成提供了新思路。

Abstract: Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs). Many approaches introduce artifacts or suffer from attribute entanglement. To overcome these challenges, we propose a novel multi-subject controlled generation model XVerse. By transforming reference images into offsets for token-specific text-stream modulation, XVerse allows for precise and independent control for specific subject without disrupting image latents or features. Consequently, XVerse offers high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. This advancement significantly improves personalized and complex scene generation capabilities.

[68] HyperSORT: Self-Organising Robust Training with hyper-networks cs.CVPDF

Samuel Joutard, Marijn Stollenga, Marc Balle Sanchez, Mohammad Farid Azampour, Raphael Prevost

TL;DR: HyperSORT是一种利用超网络（hyper-network）从潜在向量预测UNet参数的自组织鲁棒训练框架，用于处理医学成像数据中的异质偏差，如标签错误或标注风格不一致。该方法在合成和真实数据集上验证了有效性。

Details

Motivation: 医学成像数据中存在多种异质偏差（如错误标签或不一致的标注风格），这些偏差会严重影响分割网络的性能。传统方法难以识别和表征这些偏差，因此需要一种能自动组织数据并学习鲁棒分割参数的方法。

Result: 在两个3D腹部CT数据集（合成的AMOS数据集和真实的TotalSegmentator数据集）上的实验表明，HyperSORT能够生成结构化的潜在映射，并识别系统偏差。

Insight: HyperSORT不仅提高了分割任务的鲁棒性，还为医学影像分析中的数据偏差问题提供了一种有效的解决方案。潜在空间的聚类揭示了数据集中的系统性偏差，为后续研究提供了新视角。

Abstract: Medical imaging datasets often contain heterogeneous biases ranging from erroneous labels to inconsistent labeling styles. Such biases can negatively impact deep segmentation networks performance. Yet, the identification and characterization of such biases is a particularly tedious and challenging task. In this paper, we introduce HyperSORT, a framework using a hyper-network predicting UNets’ parameters from latent vectors representing both the image and annotation variability. The hyper-network parameters and the latent vector collection corresponding to each data sample from the training set are jointly learned. Hence, instead of optimizing a single neural network to fit a dataset, HyperSORT learns a complex distribution of UNet parameters where low density areas can capture noise-specific patterns while larger modes robustly segment organs in differentiated but meaningful manners. We validate our method on two 3D abdominal CT public datasets: first a synthetically perturbed version of the AMOS dataset, and TotalSegmentator, a large scale dataset containing real unknown biases and errors. Our experiments show that HyperSORT creates a structured mapping of the dataset allowing the identification of relevant systematic biases and erroneous samples. Latent space clusters yield UNet parameters performing the segmentation task in accordance with the underlying learned systematic bias. The code and our analysis of the TotalSegmentator dataset are made available: https://github.com/ImFusionGmbH/HyperSORT

[69] Benchmarking Deep Learning and Vision Foundation Models for Atypical vs. Normal Mitosis Classification with Cross-Dataset Evaluation cs.CVPDF

Sweta Banerjee, Viktoria Weiss, Taryn A. Donovan, Rutger A. Fick, Thomas Conrad

TL;DR: 这篇论文提出了一个深度学习基准测试，用于比较不同方法在典型与非典型有丝分裂分类中的表现，并通过跨数据集验证展示了低秩自适应（LoRA）方法的优越性。

Details

Motivation: 非典型有丝分裂（AMF）是肿瘤恶性的重要标志，但其识别因数据稀缺、类不平衡和病理学家间一致性低而具有挑战性。本文旨在通过深度学习方法解决这一问题。

Result: 在AMi-Br、AtNorM-Br和AtNorM-MD数据集上，LoRA微调的模型分别取得了0.8135、0.7696和0.7705的平衡准确率，表现最佳。

Insight: 非典型有丝分裂分类虽具挑战性，但可以通过迁移学习和微调技术有效解决，尤其是LoRA方法在跨域数据上表现突出。

Abstract: Atypical mitoses mark a deviation in the cell division process that can be an independent prognostically relevant marker for tumor malignancy. However, their identification remains challenging due to low prevalence, at times subtle morphological differences from normal mitoses, low inter-rater agreement among pathologists, and class imbalance in datasets. Building on the Atypical Mitosis dataset for Breast Cancer (AMi-Br), this study presents a comprehensive benchmark comparing deep learning approaches for automated atypical mitotic figure (AMF) classification, including baseline models, foundation models with linear probing, and foundation models fine-tuned with low-rank adaptation (LoRA). For rigorous evaluation, we further introduce two new hold-out AMF datasets - AtNorM-Br, a dataset of mitoses from the The TCGA breast cancer cohort, and AtNorM-MD, a multi-domain dataset of mitoses from the MIDOG++ training set. We found average balanced accuracy values of up to 0.8135, 0.7696, and 0.7705 on the in-domain AMi-Br and the out-of-domain AtNorm-Br and AtNorM-MD datasets, respectively, with the results being particularly good for LoRA-based adaptation of the Virchow-line of foundation models. Our work shows that atypical mitosis classification, while being a challenging problem, can be effectively addressed through the use of recent advances in transfer learning and model fine-tuning techniques. We make available all code and data used in this paper in this github repository: https://github.com/DeepMicroscopy/AMi-Br_Benchmark.

[70] A Comprehensive Dataset for Underground Miner Detection in Diverse Scenario cs.CV | cs.LGPDF

Cyrus Addy, Ajay Kumar Gurumadaiah, Yixiang Gao, Kwame Awuah-Offei

TL;DR: 该论文提出了一个专门用于地下矿工检测的热成像数据集，以支持开发可靠的矿工检测系统。

Details

Motivation: 地下采矿作业的安全挑战需要强大的应急响应能力，机器人在搜救任务中的应用依赖于可靠的矿工检测技术，但目前缺乏适用于地下环境的数据集。

Result: 证明热成像可用于矿工检测，并为未来这一关键安全应用的研究奠定了基础。

Insight: 热成像技术在地下矿工检测中具有可行性，但其有效性依赖于高质量的数据集和算法优化。

Abstract: Underground mining operations face significant safety challenges that make emergency response capabilities crucial. While robots have shown promise in assisting with search and rescue operations, their effectiveness depends on reliable miner detection capabilities. Deep learning algorithms offer potential solutions for automated miner detection, but require comprehensive training datasets, which are currently lacking for underground mining environments. This paper presents a novel thermal imaging dataset specifically designed to enable the development and validation of miner detection systems for potential emergency applications. We systematically captured thermal imagery of various mining activities and scenarios to create a robust foundation for detection algorithms. To establish baseline performance metrics, we evaluated several state-of-the-art object detection algorithms including YOLOv8, YOLOv10, YOLO11, and RT-DETR on our dataset. While not exhaustive of all possible emergency situations, this dataset serves as a crucial first step toward developing reliable thermal-based miner detection systems that could eventually be deployed in real emergency scenarios. This work demonstrates the feasibility of using thermal imaging for miner detection and establishes a foundation for future research in this critical safety application.

[71] Evaluation of Traffic Signals for Daily Traffic Pattern cs.CV | cs.LGPDF

Mohammad Shokrolah Shirazi, Hung-Fu Chang

TL;DR: 本文提出了三种基于TMC的交通信号配置方法（动态、静态和混合），并通过仿真实验验证了它们在真实交通数据下的性能，发现混合方法在高峰和非高峰时段表现最佳。

Details

Motivation: 交通转向运动计数（TMC）数据对交通信号设计至关重要，但传统方法难以适应每日的双峰流量模式，需更灵活的信号配置方法。

Result: 90秒和120秒的周期时间效果最佳。动态配置在四交叉口表现更好，混合方法在高峰和低峰时段表现最优。

Insight: 区域交通分布影响信号设计选择，混合方法适合流量不均衡的交叉口。

Abstract: The turning movement count data is crucial for traffic signal design, intersection geometry planning, traffic flow, and congestion analysis. This work proposes three methods called dynamic, static, and hybrid configuration for TMC-based traffic signals. A vision-based tracking system is developed to estimate the TMC of six intersections in Las Vegas using traffic cameras. The intersection design, route (e.g. vehicle movement directions), and signal configuration files with compatible formats are synthesized and imported into Simulation of Urban MObility for signal evaluation with realistic data. The initial experimental results based on estimated waiting times indicate that the cycle time of 90 and 120 seconds works best for all intersections. In addition, four intersections show better performance for dynamic signal timing configuration, and the other two with lower performance have a lower ratio of total vehicle count to total lanes of the intersection leg. Since daily traffic flow often exhibits a bimodal pattern, we propose a hybrid signal method that switches between dynamic and static methods, adapting to peak and off-peak traffic conditions for improved flow management. So, a built-in traffic generator module creates vehicle routes for 4 hours, including peak hours, and a signal design module produces signal schedule cycles according to static, dynamic, and hybrid methods. Vehicle count distributions are weighted differently for each zone (i.e., West, North, East, South) to generate diverse traffic patterns. The extended experimental results for 6 intersections with 4 hours of simulation time imply that zone-based traffic pattern distributions affect signal design selection. Although the static method works great for evenly zone-based traffic distribution, the hybrid method works well for highly weighted traffic at intersection pairs of the West-East and North-South zones.

[72] Global and Local Entailment Learning for Natural World Imagery cs.CVPDF

Srikumar Sastry, Aayush Dhakal, Eric Xing, Subash Khanal, Nathan Jacobs

TL;DR: 该论文提出了一种名为Radial Cross-Modal Embeddings（RCME）的框架，用于显式建模传递性强制蕴含关系，优化视觉-语言模型中概念的偏序关系，并在层次物种分类和检索任务中展示了性能提升。

Details

Motivation: 现有方法未能显式建模蕴含关系的传递性，而传递性是表示空间中顺序与语义关系的关键。因此，需要一种新的框架来捕捉这一特性。

Result: 在层次物种分类和检索任务中，RCME模型的性能优于现有最先进模型。

Insight: 显式建模传递性蕴含关系可以显著提升视觉-语言模型在层次化任务中的表现，为理解复杂语义关系提供了新思路。

Abstract: Learning the hierarchical structure of data in vision-language models is a significant challenge. Previous works have attempted to address this challenge by employing entailment learning. However, these approaches fail to model the transitive nature of entailment explicitly, which establishes the relationship between order and semantics within a representation space. In this work, we introduce Radial Cross-Modal Embeddings (RCME), a framework that enables the explicit modeling of transitivity-enforced entailment. Our proposed framework optimizes for the partial order of concepts within vision-language models. By leveraging our framework, we develop a hierarchical vision-language foundation model capable of representing the hierarchy in the Tree of Life. Our experiments on hierarchical species classification and hierarchical retrieval tasks demonstrate the enhanced performance of our models compared to the existing state-of-the-art models. Our code and models are open-sourced at https://vishu26.github.io/RCME/index.html.

[73] Towards Reliable Detection of Empty Space: Conditional Marked Point Processes for Object Detection cs.CV | cs.LG | math.PRPDF

Tobias J. Riedlinger, Kira Maag, Hanno Gottschalk

TL;DR: 论文提出了一种基于空间统计的物体检测模型，通过条件标记点过程（Conditional Marked Point Processes）量化未检测区域的置信度，解决了现有方法在空区域不确定性评估上的不足，适用于自动驾驶等安全关键应用。

Details

Motivation: 现有物体检测器的置信度估计常不准确，且无法量化未检测区域（空区域）的不确定性，这在自动驾驶等安全关键场景中是一个隐患。

Result: 实验表明，该方法在校准评估和性能表现上均优于现有方法。

Insight: 将空间统计与深度学习结合，为物体检测提供了更可靠的不确定性量化方法，尤其是在空区域的评估上具有创新性。

Abstract: Deep neural networks have set the state-of-the-art in computer vision tasks such as bounding box detection and semantic segmentation. Object detectors and segmentation models assign confidence scores to predictions, reflecting the model’s uncertainty in object detection or pixel-wise classification. However, these confidence estimates are often miscalibrated, as their architectures and loss functions are tailored to task performance rather than probabilistic foundation. Even with well calibrated predictions, object detectors fail to quantify uncertainty outside detected bounding boxes, i.e., the model does not make a probability assessment of whether an area without detected objects is truly free of obstacles. This poses a safety risk in applications such as automated driving, where uncertainty in empty areas remains unexplored. In this work, we propose an object detection model grounded in spatial statistics. Bounding box data matches realizations of a marked point process, commonly used to describe the probabilistic occurrence of spatial point events identified as bounding box centers, where marks are used to describe the spatial extension of bounding boxes and classes. Our statistical framework enables a likelihood-based training and provides well-defined confidence estimates for whether a region is drivable, i.e., free of objects. We demonstrate the effectiveness of our method through calibration assessments and evaluation of performance.

[74] Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration cs.CVPDF

Jiahe Chen, Jiaying He, Qian Shao, Qiyuan Chen, Jiahe Ying

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Large Vision-Language Models (LVLMs) have demonstrated significant advancements in multimodal understanding, yet they are frequently hampered by hallucination-the generation of text that contradicts visual input. Existing training-free decoding strategies exhibit critical limitations, including the use of static constraints that do not adapt to semantic drift during generation, inefficiency stemming from the need for multiple forward passes, and degradation of detail due to overly rigid intervention rules. To overcome these challenges, this paper introduces Dynamic Logits Calibration (DLC), a novel training-free decoding framework designed to dynamically align text generation with visual evidence at inference time. At the decoding phase, DLC step-wise employs CLIP to assess the semantic alignment between the input image and the generated text sequence. Then, the Relative Visual Advantage (RVA) of candidate tokens is evaluated against a dynamically updated contextual baseline, adaptively adjusting output logits to favor tokens that are visually grounded. Furthermore, an adaptive weighting mechanism, informed by a real-time context alignment score, carefully balances the visual guidance while ensuring the overall quality of the textual output. Extensive experiments conducted across diverse benchmarks and various LVLM architectures (such as LLaVA, InstructBLIP, and MiniGPT-4) demonstrate that DLC significantly reduces hallucinations, outperforming current methods while maintaining high inference efficiency by avoiding multiple forward passes. Overall, we present an effective and efficient decoding-time solution to mitigate hallucinations, thereby enhancing the reliability of LVLMs for more practices. Code will be released on Github.

[75] GGTalker: Talking Head Systhesis with Generalizable Gaussian Priors and Identity-Specific Adaptation cs.CVPDF

Wentao Hu, Shunkai Li, Ziqiao Peng, Haoxian Zhang, Fan Shi

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Creating high-quality, generalizable speech-driven 3D talking heads remains a persistent challenge. Previous methods achieve satisfactory results for fixed viewpoints and small-scale audio variations, but they struggle with large head rotations and out-of-distribution (OOD) audio. Moreover, they are constrained by the need for time-consuming, identity-specific training. We believe the core issue lies in the lack of sufficient 3D priors, which limits the extrapolation capabilities of synthesized talking heads. To address this, we propose GGTalker, which synthesizes talking heads through a combination of generalizable priors and identity-specific adaptation. We introduce a two-stage Prior-Adaptation training strategy to learn Gaussian head priors and adapt to individual characteristics. We train Audio-Expression and Expression-Visual priors to capture the universal patterns of lip movements and the general distribution of head textures. During the Customized Adaptation, individual speaking styles and texture details are precisely modeled. Additionally, we introduce a color MLP to generate fine-grained, motion-aligned textures and a Body Inpainter to blend rendered results with the background, producing indistinguishable, photorealistic video frames. Comprehensive experiments show that GGTalker achieves state-of-the-art performance in rendering quality, 3D consistency, lip-sync accuracy, and training efficiency.

[76] G$^{2}$D: Boosting Multimodal Learning with Gradient-Guided Distillation cs.CVPDF

Mohammed Rakib, Arunkumar Bagavathi

TL;DR: G$^{2}$D是一个通过梯度引导蒸馏(Gradient-Guided Distillation)解决多模态学习中模态不平衡问题的框架，通过动态顺序模态优先级(SMP)技术提升弱模态的重要性。

Details

Motivation: 传统多模态学习存在模态不平衡问题，强模态主导优化，导致弱模态特征表达不足。

Result: 在多个真实数据集上验证，G$^{2}$D显著提升弱模态作用，性能优于现有方法。

Insight: 动态平衡模态优先级是提升多模态模型性能的关键。

Abstract: Multimodal learning aims to leverage information from diverse data modalities to achieve more comprehensive performance. However, conventional multimodal models often suffer from modality imbalance, where one or a few modalities dominate model optimization, leading to suboptimal feature representation and underutilization of weak modalities. To address this challenge, we introduce Gradient-Guided Distillation (G$^{2}$D), a knowledge distillation framework that optimizes the multimodal model with a custom-built loss function that fuses both unimodal and multimodal objectives. G$^{2}$D further incorporates a dynamic sequential modality prioritization (SMP) technique in the learning process to ensure each modality leads the learning process, avoiding the pitfall of stronger modalities overshadowing weaker ones. We validate G$^{2}$D on multiple real-world datasets and show that G$^{2}$D amplifies the significance of weak modalities while training and outperforms state-of-the-art methods in classification and regression tasks. Our code is available at https://github.com/rAIson-Lab/G2D.

[77] MADrive: Memory-Augmented Driving Scene Modeling cs.CVPDF

Polina Karpikova, Daniil Selikhanovych, Kirill Struminsky, Ruslan Musaev, Maria Golitsyna

TL;DR: MADrive提出了一种基于内存增强的驾驶场景建模框架，通过替换观测车辆为外部内存库中检索的3D资产，扩展了现有场景重建方法的能力。

Details

Motivation: 当前自动驾驶环境的重建方法虽然能实现高真实感，但对于显著变化或新场景的合成能力有限，因此需要一种增强重建灵活性的方法。

Result: 实验证明，MADrive能支持显著变化的配置的光照真实合成，提供多视角车辆表示。

Insight: 内存库的引入为场景重建提供了更大的灵活性和真实感，推动了自动驾驶环境建模的进步。

Abstract: Recent advances in scene reconstruction have pushed toward highly realistic modeling of autonomous driving (AD) environments using 3D Gaussian splatting. However, the resulting reconstructions remain closely tied to the original observations and struggle to support photorealistic synthesis of significantly altered or novel driving scenarios. This work introduces MADrive, a memory-augmented reconstruction framework designed to extend the capabilities of existing scene reconstruction methods by replacing observed vehicles with visually similar 3D assets retrieved from a large-scale external memory bank. Specifically, we release MAD-Cars, a curated dataset of ${\sim}70$K 360{\deg} car videos captured in the wild and present a retrieval module that finds the most similar car instances in the memory bank, reconstructs the corresponding 3D assets from video, and integrates them into the target scene through orientation alignment and relighting. The resulting replacements provide complete multi-view representations of vehicles in the scene, enabling photorealistic synthesis of substantially altered configurations, as demonstrated in our experiments. Project page: https://yandex-research.github.io/madrive/

Hani Alomari, Anushka Sivakumar, Andrew Zhang, Chris Thomas

TL;DR: 这篇论文提出了一种新的跨模态检索方法，通过最大化嵌入集之间的一对一匹配来防止表示坍缩，并结合两种损失函数提升性能，实现了在MS-COCO和Flickr30k上的SOTA结果。

Details

Motivation: 传统单向量嵌入方法难以捕捉跨模态的多样语义关联，而基于集合的方法虽然能捕获更丰富的关系，但仍面临稀疏监督和集合坍缩的问题。

Result: 在不依赖外部数据的情况下，在MS-COCO和Flickr30k上达到了SOTA性能。

Insight: 保持嵌入集的语义多样性是关键，优化一对一匹配并结合损失函数可以有效提升跨模态检索的鲁棒性。

Abstract: Cross-modal image-text retrieval is challenging because of the diverse possible associations between content from different modalities. Traditional methods learn a single-vector embedding to represent semantics of each sample, but struggle to capture nuanced and diverse relationships that can exist across modalities. Set-based approaches, which represent each sample with multiple embeddings, offer a promising alternative, as they can capture richer and more diverse relationships. In this paper, we show that, despite their promise, these set-based representations continue to face issues including sparse supervision and set collapse, which limits their effectiveness. To address these challenges, we propose Maximal Pair Assignment Similarity to optimize one-to-one matching between embedding sets which preserve semantic diversity within the set. We also introduce two loss functions to further enhance the representations: Global Discriminative Loss to enhance distinction among embeddings, and Intra-Set Divergence Loss to prevent collapse within each set. Our method achieves state-of-the-art performance on MS-COCO and Flickr30k without relying on external data.

[79] DeOcc-1-to-3: 3D De-Occlusion from a Single Image via Self-Supervised Multi-View Diffusion cs.CVPDF

Yansong Qu, Shaohui Dai, Xinyang Li, Yuze Wang, You Shen

TL;DR: DeOcc-1-to-3提出了一种自监督多视角扩散方法，从单张部分遮挡的图像中生成六种结构一致的新视角，直接支持3D重建，无需预修复或人工标注。

Details

Motivation: 传统方法依赖完全可见的输入，无法处理遮挡情况，导致3D重建失效。本文旨在解决真实场景中部分遮挡的3D重建问题。

Result: 方法在遮挡场景下生成一致的多视角图像，显著提升3D重建质量，并提供了标准化评测协议。

Insight: 自监督训练和伪真实视角的结合是解决遮挡问题的关键，微调现有模型可高效实现多任务学习。

Abstract: Reconstructing 3D objects from a single image is a long-standing challenge, especially under real-world occlusions. While recent diffusion-based view synthesis models can generate consistent novel views from a single RGB image, they generally assume fully visible inputs and fail when parts of the object are occluded. This leads to inconsistent views and degraded 3D reconstruction quality. To overcome this limitation, we propose an end-to-end framework for occlusion-aware multi-view generation. Our method directly synthesizes six structurally consistent novel views from a single partially occluded image, enabling downstream 3D reconstruction without requiring prior inpainting or manual annotations. We construct a self-supervised training pipeline using the Pix2Gestalt dataset, leveraging occluded-unoccluded image pairs and pseudo-ground-truth views to teach the model structure-aware completion and view consistency. Without modifying the original architecture, we fully fine-tune the view synthesis model to jointly learn completion and multi-view generation. Additionally, we introduce the first benchmark for occlusion-aware reconstruction, encompassing diverse occlusion levels, object categories, and mask patterns. This benchmark provides a standardized protocol for evaluating future methods under partial occlusions. Our code is available at https://github.com/Quyans/DeOcc123.

[80] SAM4D: Segment Anything in Camera and LiDAR Streams cs.CV | cs.ROPDF

Jianyun Xu, Song Wang, Ziqian Ni, Chunyong Hu, Sheng Yang

TL;DR: SAM4D是一个多模态时序基础模型，用于相机和LiDAR流的可提示分割。通过统一的跨模态对齐和运动感知注意力机制，解决了动态自动驾驶场景中的分割问题，并提出了高效的数据标注方法。

Details

Motivation: 自动驾驶场景中，相机和LiDAR数据的跨模态分割和时序一致性是实现可靠感知的关键挑战。现有方法在跨模态对齐和动态场景处理上表现不足。

Result: 在Waymo-4DSeg上验证了模型在跨模态分割和数据标注上的优越性能。

Insight: 1）多模态对齐和时序建模是关键；2）自动数据引擎可显著提升标注效率。

Abstract: We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudo-labels at a speed orders of magnitude faster than human annotation while preserving VFM-derived semantic fidelity in point cloud representations. We conduct extensive experiments on the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal segmentation ability and great potential in data annotation of proposed SAM4D.

[81] SiM3D: Single-instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark cs.CVPDF

Alex Costanzino, Pierluigi Zama Ramirez, Luigi Lella, Matteo Ragaglia, Alessandro Oliva

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: We propose SiM3D, the first benchmark considering the integration of multiview and multimodal information for comprehensive 3D anomaly detection and segmentation (ADS), where the task is to produce a voxel-based Anomaly Volume. Moreover, SiM3D focuses on a scenario of high interest in manufacturing: single-instance anomaly detection, where only one object, either real or synthetic, is available for training. In this respect, SiM3D stands out as the first ADS benchmark that addresses the challenge of generalising from synthetic training data to real test data. SiM3D includes a novel multimodal multiview dataset acquired using top-tier industrial sensors and robots. The dataset features multiview high-resolution images (12 Mpx) and point clouds (7M points) for 333 instances of eight types of objects, alongside a CAD model for each type. We also provide manually annotated 3D segmentation GTs for anomalous test samples. To establish reference baselines for the proposed multiview 3D ADS task, we adapt prominent singleview methods and assess their performance using novel metrics that operate on Anomaly Volumes.

[82] Whole-Body Conditioned Egocentric Video Prediction cs.CV | cs.AI | cs.LG | cs.MM | cs.ROPDF

Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: We train models to Predict Ego-centric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model’s embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.

cs.CL [Back]

[83] Towards Probabilistic Question Answering Over Tabular Data cs.CL | 68T50, 68T37 | I.2.7PDF

Chen Shen, Sajjadur Rahman, Estevam Hruschka

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Current approaches for question answering (QA) over tabular data, such as NL2SQL systems, perform well for factual questions where answers are directly retrieved from tables. However, they fall short on probabilistic questions requiring reasoning under uncertainty. In this paper, we introduce a new benchmark LUCARIO and a framework for probabilistic QA over large tabular data. Our method induces Bayesian Networks from tables, translates natural language queries into probabilistic queries, and uses large language models (LLMs) to generate final answers. Empirical results demonstrate significant improvements over baselines, highlighting the benefits of hybrid symbolic-neural reasoning.

[84] The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas cs.CL | cs.AI | cs.CY | cs.HC | cs.LGPDF

Chenglei Si, Tatsunori Hashimoto, Diyi Yang

TL;DR: 研究表明，尽管LLM生成的研究想法在初始阶段被认为更具新颖性，但在实际执行后，这些想法在多个评估指标上的表现显著下降，显示出与人类专家想法的差距。这揭示了当前LLM在生成真正有效研究想法上的局限性。

Details

Motivation: 当前LLM在生成新颖研究想法方面表现良好，但尚不清楚这些想法在执行后是否仍能保持优势。研究旨在验证LLM生成的想法是否会导致更好的研究成果。

Result: 执行后，LLM生成的想法在所有评估指标（新颖性、兴奋度、有效性和总体评分）上的分数显著下降，甚至在某些指标上被人类想法反超。

Insight: 研究强调了仅依赖初始评估无法全面验证研究想法的有效性，实际执行结果是更可靠的衡量标准。

Abstract: Large Language Models (LLMs) have shown promise in accelerating the scientific research pipeline. A key capability for this process is the ability to generate novel research ideas, and prior studies have found settings in which LLM-generated research ideas were judged as more novel than human-expert ideas. However, a good idea should not simply appear to be novel, it should also result in better research after being executed. To test whether AI-generated ideas lead to better research outcomes, we conduct an execution study by recruiting 43 expert researchers to execute randomly-assigned ideas, either written by experts or generated by an LLM. Each expert spent over 100 hours implementing the idea and wrote a 4-page short paper to document the experiments. All the executed projects are then reviewed blindly by expert NLP researchers. Comparing the review scores of the same ideas before and after execution, the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p < 0.05), closing the gap between LLM and human ideas observed at the ideation stage. When comparing the aggregated review scores from the execution study, we even observe that for many metrics there is a flip in rankings where human ideas score higher than LLM ideas. This ideation-execution gap highlights the limitations of current LLMs in generating truly effective research ideas and the challenge of evaluating research ideas in the absence of execution outcomes.

[85] MultiFinRAG: An Optimized Multimodal Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering cs.CL | cs.AI | cs.CE | 68T50, 68T07 (Primary) 68P20, 91G15, 91G70, 68U10 (Secondary) | I.2.7; I.2.10; H.3.3; H.2.8; I.5.4; J.1PDF

Chinmay Gondhalekar, Urjitkumar Patel, Fang-Chun Yeh

TL;DR: MultiFinRAG是一个针对金融问答优化的多模态检索增强生成（RAG）框架，通过多模态提取和模态感知检索，显著提升了复杂金融文档的问答准确性。

Details

Motivation: 金融文档（如10-Ks、10-Qs）包含多种模态（文本、表格、图像）且内容繁杂，传统检索增强生成框架因令牌限制和跨模态上下文断裂难以处理。

Result: 在复杂金融QA任务上，MultiFinRAG比免费版ChatGPT-4o的准确性高19个百分点。

Insight: 多模态联合推理和动态上下文调整是处理复杂金融文档的关键，轻量级模型与优化检索策略可在有限硬件资源下实现高性能。

Abstract: Financial documents–such as 10-Ks, 10-Qs, and investor presentations–span hundreds of pages and combine diverse modalities, including dense narrative text, structured tables, and complex figures. Answering questions over such content often requires joint reasoning across modalities, which strains traditional large language models (LLMs) and retrieval-augmented generation (RAG) pipelines due to token limitations, layout loss, and fragmented cross-modal context. We introduce MultiFinRAG, a retrieval-augmented generation framework purpose-built for financial QA. MultiFinRAG first performs multimodal extraction by grouping table and figure images into batches and sending them to a lightweight, quantized open-source multimodal LLM, which produces both structured JSON outputs and concise textual summaries. These outputs, along with narrative text, are embedded and indexed with modality-aware similarity thresholds for precise retrieval. A tiered fallback strategy then dynamically escalates from text-only to text+table+image contexts when necessary, enabling cross-modal reasoning while reducing irrelevant context. Despite running on commodity hardware, MultiFinRAG achieves 19 percentage points higher accuracy than ChatGPT-4o (free-tier) on complex financial QA tasks involving text, tables, images, and combined multimodal reasoning.

[86] Uncovering Hidden Violent Tendencies in LLMs: A Demographic Analysis via Behavioral Vignettes cs.CL | cs.AIPDF

Quintin Myers, Yanjun Gao

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Large language models (LLMs) are increasingly proposed for detecting and responding to violent content online, yet their ability to reason about morally ambiguous, real-world scenarios remains underexamined. We present the first study to evaluate LLMs using a validated social science instrument designed to measure human response to everyday conflict, namely the Violent Behavior Vignette Questionnaire (VBVQ). To assess potential bias, we introduce persona-based prompting that varies race, age, and geographic identity within the United States. Six LLMs developed across different geopolitical and organizational contexts are evaluated under a unified zero-shot setting. Our study reveals two key findings: (1) LLMs surface-level text generation often diverges from their internal preference for violent responses; (2) their violent tendencies vary across demographics, frequently contradicting established findings in criminology, social science, and psychology.

[87] Optimising Language Models for Downstream Tasks: A Post-Training Perspective cs.CL | cs.AIPDF

Zhengyan Shi

TL;DR: 该论文提出了一系列方法，旨在优化语言模型（LMs）在下游任务中的适应能力，包括利用未标记数据、参数高效的微调方法以及改进的监督微调技术，显著提升了模型的鲁棒性、效率和泛化能力。

Details

Motivation: 随着语言模型规模和复杂度的增加，传统微调方法在利用未标记数据和适应特定任务时存在效率低、易过拟合和高计算成本的问题，限制了其在实际任务中的应用。因此，需要更高效的适应方法。

Result: 实验表明，这些方法在多任务NLP任务中显著提升了语言模型的鲁棒性、效率和泛化能力，使其更适应广泛的应用场景。

Insight: 通过优化语言模型的适应策略，不仅可以提高其在特定任务上的性能，还能降低计算成本，为更广泛的通用人工智能目标奠定基础。

Abstract: Language models (LMs) have demonstrated remarkable capabilities in NLP, yet adapting them efficiently and robustly to specific tasks remains challenging. As their scale and complexity grow, fine-tuning LMs on labelled data often underutilizes available unlabelled data, leads to overfitting on small task-specific sets, and imposes significant computational costs. These limitations hamper their application to the open-ended landscape of real-world language tasks. This thesis proposes a series of methods to better adapt LMs to downstream applications. First, we explore strategies for extracting task-relevant knowledge from unlabelled data, introducing a novel continued pre-training technique that outperforms state-of-the-art semi-supervised approaches. Next, we present a parameter-efficient fine-tuning method that substantially reduces memory and compute costs while maintaining competitive performance. We also introduce improved supervised fine-tuning methods that enable LMs to better follow instructions, especially when labelled data is scarce, enhancing their performance across a range of NLP tasks, including open-ended generation. Finally, we develop new evaluation methods and benchmarks, such as multi-hop spatial reasoning tasks, to assess LM capabilities and adaptation more comprehensively. Through extensive empirical studies across diverse NLP tasks, our results demonstrate that these approaches substantially improve LM robustness, efficiency, and generalization, making them more adaptable to a broad range of applications. These advances mark a significant step towards more robust and efficient LMs, bringing us closer to the goal of artificial general intelligence.

[88] FineWeb2: One Pipeline to Scale Them All – Adapting Pre-Training Data Processing to Every Language cs.CLPDF

Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan

TL;DR: FineWeb2是一个多语言预训练数据集构建流水线，能够自动适应各种语言，通过改进数据处理和去重策略，生成的模型性能优于以往数据集。

Details

Motivation: 当前多语言大模型预训练面临的数据处理和去重难题，尤其是如何为不同语言定制高效的数据处理流水线。

Result: 实验证明，FineWeb2生成的模型性能优于以往数据集，并在1000多种语言上验证了其有效性。

Insight: 数据质量和去重策略的优化对多语言模型性能至关重要，可扩展的自动化流水线能显著提升效率。

Abstract: Pre-training state-of-the-art large language models (LLMs) requires vast amounts of clean and diverse text data. While the open development of large high-quality English pre-training datasets has seen substantial recent progress, training performant multilingual LLMs remains a challenge, in large part due to the inherent difficulty of tailoring filtering and deduplication pipelines to a large number of languages. In this work, we introduce a new pre-training dataset curation pipeline based on FineWeb that can be automatically adapted to support any language. We extensively ablate our pipeline design choices on a set of nine diverse languages, guided by a set of meaningful and informative evaluation tasks that were chosen through a novel selection process based on measurable criteria. Ultimately, we show that our pipeline can be used to create non-English corpora that produce more performant models than prior datasets. We additionally introduce a straightforward and principled approach to rebalance datasets that takes into consideration both duplication count and quality, providing an additional performance uplift. Finally, we scale our pipeline to over 1000 languages using almost 100 Common Crawl snapshots to produce FineWeb2, a new 20 terabyte (5 billion document) multilingual dataset which we release along with our pipeline, training, and evaluation codebases.

[89] Large Language Models Acing Chartered Accountancy cs.CL | cs.AIPDF

Jatin Gupta, Akhil Sharma, Saransh Singhania, Mohammad Adnan, Sakshi Deo

TL;DR: 论文介绍了CA-Ben基准测试，旨在评估LLMs在会计、法律和定量推理方面的能力，测试了六种主流模型并揭示了其在数值计算和法律解释上的挑战。Claude 3.5 Sonnet和GPT-4o表现最佳。

Details

Motivation: 研究动机在于评估LLMs在专业财务领域的应用潜力，尤其是在印度复杂的会计和法律环境中。当前模型在这方面的表现尚不明确。

Result: 结果显示Claude 3.5 Sonnet和GPT-4o表现最佳，但模型在数值计算和法律解释上存在显著不足。

Insight: 研究发现未来需要通过混合推理和检索增强生成技术来提升LLMs在定量分析和法律解释方面的能力。

Abstract: Advanced intelligent systems, particularly Large Language Models (LLMs), are significantly reshaping financial practices through advancements in Natural Language Processing (NLP). However, the extent to which these models effectively capture and apply domain-specific financial knowledge remains uncertain. Addressing a critical gap in the expansive Indian financial context, this paper introduces CA-Ben, a Chartered Accountancy benchmark specifically designed to evaluate the financial, legal, and quantitative reasoning capabilities of LLMs. CA-Ben comprises structured question-answer datasets derived from the rigorous examinations conducted by the Institute of Chartered Accountants of India (ICAI), spanning foundational, intermediate, and advanced CA curriculum stages. Six prominent LLMs i.e. GPT 4o, LLAMA 3.3 70B, LLAMA 3.1 405B, MISTRAL Large, Claude 3.5 Sonnet, and Microsoft Phi 4 were evaluated using standardized protocols. Results indicate variations in performance, with Claude 3.5 Sonnet and GPT-4o outperforming others, especially in conceptual and legal reasoning. Notable challenges emerged in numerical computations and legal interpretations. The findings emphasize the strengths and limitations of current LLMs, suggesting future improvements through hybrid reasoning and retrieval-augmented generation methods, particularly for quantitative analysis and accurate legal interpretation.

[90] MT2-CSD: A New Dataset and Multi-Semantic Knowledge Fusion Method for Conversational Stance Detection cs.CLPDF

Fuqiang Niu, Genan Dai, Yisha Lu, Jiayu Liao, Xiang Li

TL;DR: 该论文提出了MT2-CSD数据集和LLM-CRAN方法，用于多目标、多轮对话中的立场检测。MT2-CSD是目前最大、对话深度最深的立场检测数据集，而LLM-CRAN结合大语言模型的推理能力，显著提升了模型性能。

Details

Motivation: 传统立场检测研究通常针对单一样本，而实际社交媒体中的讨论多为多轮对话且涉及多个目标，现有数据集难以捕捉这种动态交互。为此，论文提出新数据集和方法以应对这一挑战。

Result: 实验表明，LLM-CRAN在MT2-CSD数据集上显著优于基线模型。

Insight: 多轮、多目标的对话立场检测需要更复杂的数据集和方法，结合大语言模型的推理能力可以有效提升模型性能。

Abstract: In the realm of contemporary social media, automatic stance detection is pivotal for opinion mining, as it synthesizes and examines user perspectives on contentious topics to uncover prevailing trends and sentiments. Traditional stance detection research often targets individual instances, thereby limiting its capacity to model multi-party discussions typical in real social media scenarios. This shortcoming largely stems from the scarcity of datasets that authentically capture the dynamics of social media interactions, hindering advancements in conversational stance detection. In this paper, we introduce MT2-CSD, a comprehensive dataset for multi-target, multi-turn conversational stance detection. To the best of our knowledge, MT2-CSD is the largest dataset available for this purpose, comprising 24,457 annotated instances and exhibiting the greatest conversational depth, thereby presenting new challenges for stance detection. To address these challenges, we propose the Large Language model enhanced Conversational Relational Attention Network (LLM-CRAN), which exploits the reasoning capabilities of LLMs to improve conversational understanding. We conduct extensive experiments to evaluate the efficacy of LLM-CRAN on the MT2-CSD dataset. The experimental results indicate that LLM-CRAN significantly outperforms strong baseline models in the task of conversational stance detection.

[91] DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning cs.CLPDF

Kang He, Yuzhe Ding. Haining Wang, Fei Li, Chong Teng, Donghong Ji

TL;DR: 该论文提出了一种双层次对齐学习（DALR）方法，通过细粒度跨模态对齐和全局模态内对齐学习，解决多模态句子表示学习中的跨模态偏差和模态内语义分歧问题，显著提升了表示质量。

Details

Motivation: 现有的多模态句子表示学习方法大多仅关注粗粒度的图像与文本对齐，导致跨模态偏差和模态内语义分歧问题，限制了表示质量。

Result: 在语义文本相似性（STS）和迁移（TR）任务上的实验表明，该方法优于现有基线。

Insight: 句子关系不仅是二元正负标签，还存在复杂排序结构，通过双层次对齐可以显著提升多模态表示质量。

Abstract: Previous multimodal sentence representation learning methods have achieved impressive performance. However, most approaches focus on aligning images and text at a coarse level, facing two critical challenges:cross-modal misalignment bias and intra-modal semantic divergence, which significantly degrade sentence representation quality. To address these challenges, we propose DALR (Dual-level Alignment Learning for Multimodal Sentence Representation). For cross-modal alignment, we propose a consistency learning module that softens negative samples and utilizes semantic similarity from an auxiliary task to achieve fine-grained cross-modal alignment. Additionally, we contend that sentence relationships go beyond binary positive-negative labels, exhibiting a more intricate ranking structure. To better capture these relationships and enhance representation quality, we integrate ranking distillation with global intra-modal alignment learning. Comprehensive experiments on semantic textual similarity (STS) and transfer (TR) tasks validate the effectiveness of our approach, consistently demonstrating its superiority over state-of-the-art baselines.

[92] Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents cs.CL | cs.AIPDF

Tianyi Men, Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu

TL;DR: 该论文提出了一个名为Agent-RewardBench的基准测试，旨在评估多模态大语言模型（MLLMs）在奖励建模方面的能力，涵盖了感知、规划和安全性等多个维度，并在实验中表明当前最先进的多模态模型表现有限。

Details

Motivation: 由于多模态智能体在真实世界任务中的自我纠正和泛化能力受限，缺乏外部反馈，奖励模型成为一种有前景的解决方案，但目前缺乏针对智能体的奖励模型选择标准，因此需要一个专门的基准测试。

Result: 实验显示，即使是当前最先进的多模态模型在奖励建模方面表现有限，表明需要专门训练。

Insight: 奖励建模是提升多模态智能体性能的关键，但现有模型仍需改进，Agent-RewardBench为此提供了一个明确的评估方向。

Abstract: As Multimodal Large Language Models (MLLMs) advance, multimodal agents show promise in real-world tasks like web navigation and embodied intelligence. However, due to limitations in a lack of external feedback, these agents struggle with self-correction and generalization. A promising approach is to use reward models as external feedback, but there is no clear on how to select reward models for agents. Thus, there is an urgent need to build a reward bench targeted at agents. To address these challenges, we propose Agent-RewardBench, a benchmark designed to evaluate reward modeling ability in MLLMs. The benchmark is characterized by three key features: (1) Multiple dimensions and real-world agent scenarios evaluation. It covers perception, planning, and safety with 7 scenarios; (2) Step-level reward evaluation. It allows for the assessment of agent capabilities at the individual steps of a task, providing a more granular view of performance during the planning process; and (3) Appropriately difficulty and high-quality. We carefully sample from 10 diverse models, difficulty control to maintain task challenges, and manual verification to ensure the integrity of the data. Experiments demonstrate that even state-of-the-art multimodal models show limited performance, highlighting the need for specialized training in agent reward modeling. Code is available at github.

[93] Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning cs.CLPDF

Xin Xu, Tianhao Chen, Fan Zhang, Wanlong Liu, Pengxiang Li

TL;DR: Double-Checker通过自评微调框架，提升慢思考LLMs的推理能力，使其能迭代优化自身输出。

Details

Motivation: 慢思考LLMs虽具备反思式推理能力（称为”aha moment”），但其生成批判性反馈和优化解决方案的能力仍有限。

Result: 在AIME基准测试中，pass@1性能从4.4%提升至18.2%。

Insight: 结构化自评是提升LLMs推理能力和可信度的有效方向。

Abstract: While slow-thinking large language models (LLMs) exhibit reflection-like reasoning, commonly referred to as the “aha moment:, their ability to generate informative critiques and refine prior solutions remains limited. In this paper, we introduce Double-Checker, a principled framework designed to enhance the reasoning capabilities of slow-thinking LLMs by fostering explicit self-critique and iterative refinement of their previous solutions. By fine-tuning on our curated 1,730 self-critical instances, Double-Checker empowers long-CoT LLMs to iteratively critique and refine their outputs during inference until they evaluate their solutions as correct under self-generated critiques. We validate the efficacy of Double-Checker across a comprehensive suite of reasoning benchmarks, demonstrating that iterative self-critique significantly enhances the reasoning capabilities of long-CoT LLMs. Notably, our Double-Checker increases the pass@1 performance on challenging AIME benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These results highlight a promising direction for developing more trustworthy and effective LLMs capable of structured self-critique.

[94] Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models cs.CL | cs.AIPDF

Bram Willemsen, Gabriel Skantze

TL;DR: 本文探讨了仅使用文本的自回归语言模型方法从视觉对话中提取指代表达式的效果，发现仅语言上下文也能有效完成任务，但仍需多模态方法。

Details

Motivation: 研究旨在于探索仅依赖语言上下文是否能识别视觉对话中的指代表达式，从而减少对视觉模态的依赖。

Result: 结果表明，即使是中等规模的LLM和小数据集，仅文本方法也能取得较好效果，强调了语言上下文的重要性。

Insight: 任务本质上是多模态的，单模态方法存在固有局限性，但仍能通过语言上下文提供有效支持。

Abstract: In this paper, we explore the use of a text-only, autoregressive language modeling approach for the extraction of referring expressions from visually grounded dialogue. More specifically, the aim is to investigate the extent to which the linguistic context alone can inform the detection of mentions that have a (visually perceivable) referent in the visual context of the conversation. To this end, we adapt a pretrained large language model (LLM) to perform a relatively course-grained annotation of mention spans in unfolding conversations by demarcating mention span boundaries in text via next-token prediction. Our findings indicate that even when using a moderately sized LLM, relatively small datasets, and parameter-efficient fine-tuning, a text-only approach can be effective, highlighting the relative importance of the linguistic context for this task. Nevertheless, we argue that the task represents an inherently multimodal problem and discuss limitations fundamental to unimodal approaches.

[95] Structuralist Approach to AI Literary Criticism: Leveraging Greimas Semiotic Square for Large Language Models cs.CLPDF

Fangzhou Dong, Yifan Zeng, Yingpeng Sang, Hong Shen

TL;DR: 这篇论文提出了GLASS框架，基于格雷马斯的符号学方阵（GSS），帮助大型语言模型（LLMs）进行深度文学分析。通过构建首个GSS文学批评数据集和定量评估指标，GLASS在多个作品上展示了高性能，并填补了文学研究中的空白。

Details

Motivation: 大型语言模型虽然在文本理解和生成上表现出色，但在处理思想深刻、叙事复杂的文学作品时，缺乏专业的文学批评能力。这促使研究者开发GLASS框架，以提升模型在这方面的表现。

Result: GLASS在多个作品和LLMs上表现出高性能，能够生成原创且高质量的文学分析，填补了现有研究的空白。

Insight: 1. GLASS为文学研究和教育提供了AI工具；2. 揭示了文学参与的认知机制；3. 展示了结构化方法在提升LLMs专业领域能力中的潜力。

Abstract: Large Language Models (LLMs) excel in understanding and generating text but struggle with providing professional literary criticism for works with profound thoughts and complex narratives. This paper proposes GLASS (Greimas Literary Analysis via Semiotic Square), a structured analytical framework based on Greimas Semiotic Square (GSS), to enhance LLMs’ ability to conduct in-depth literary analysis. GLASS facilitates the rapid dissection of narrative structures and deep meanings in narrative works. We propose the first dataset for GSS-based literary criticism, featuring detailed analyses of 48 works. Then we propose quantitative metrics for GSS-based literary criticism using the LLM-as-a-judge paradigm. Our framework’s results, compared with expert criticism across multiple works and LLMs, show high performance. Finally, we applied GLASS to 39 classic works, producing original and high-quality analyses that address existing research gaps. This research provides an AI-based tool for literary research and education, offering insights into the cognitive mechanisms underlying literary engagement.

[96] Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation cs.CL | cs.AI | cs.IRPDF

Guanting Dong, Xiaoxi Li, Yuyao Zhang, Mengjie Deng

TL;DR: 该论文提出了Omni-RAG框架，通过大语言模型（LLM）辅助的查询理解，提升了实时检索增强生成（RAG）系统在处理复杂、模糊查询时的鲁棒性和有效性。

Details

Motivation: 现实世界中的实时检索增强生成系统面临用户查询噪声多、模糊性强且包含多意图的问题，而现有系统通常基于干净数据训练或评估，难以应对此类挑战。

Result: Omni-RAG旨在填补当前RAG系统能力与现实应用需求之间的鸿沟，如SIGIR 2025 LiveRAG Challenge中强调的复杂查询处理问题。

Insight: 通过LLM辅助的查询预处理和结构化子查询生成，可以实现对复杂查询的鲁棒处理，为开放域实时RAG系统提供了新的解决方案。

Abstract: Real-world live retrieval-augmented generation (RAG) systems face significant challenges when processing user queries that are often noisy, ambiguous, and contain multiple intents. While RAG enhances large language models (LLMs) with external knowledge, current systems typically struggle with such complex inputs, as they are often trained or evaluated on cleaner data. This paper introduces Omni-RAG, a novel framework designed to improve the robustness and effectiveness of RAG systems in live, open-domain settings. Omni-RAG employs LLM-assisted query understanding to preprocess user inputs through three key modules: (1) Deep Query Understanding and Decomposition, which utilizes LLMs with tailored prompts to denoise queries (e.g., correcting spelling errors) and decompose multi-intent queries into structured sub-queries; (2) Intent-Aware Knowledge Retrieval, which performs retrieval for each sub-query from a corpus (i.e., FineWeb using OpenSearch) and aggregates the results; and (3) Reranking and Generation, where a reranker (i.e., BGE) refines document selection before a final response is generated by an LLM (i.e., Falcon-10B) using a chain-of-thought prompt. Omni-RAG aims to bridge the gap between current RAG capabilities and the demands of real-world applications, such as those highlighted by the SIGIR 2025 LiveRAG Challenge, by robustly handling complex and noisy queries.

[97] Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection cs.CL | cs.AIPDF

Ali Şenol, Garima Agrawal, Huan Liu

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Detecting deceptive conversations on dynamic platforms is increasingly difficult due to evolving language patterns and Concept Drift (CD)-i.e., semantic or topical shifts that alter the context or intent of interactions over time. These shifts can obscure malicious intent or mimic normal dialogue, making accurate classification challenging. While Large Language Models (LLMs) show strong performance in natural language tasks, they often struggle with contextual ambiguity and hallucinations in risk-sensitive scenarios. To address these challenges, we present a Domain Knowledge (DK)-Enhanced LLM framework that integrates pretrained LLMs with structured, task-specific insights to perform fraud and concept drift detection. The proposed architecture consists of three main components: (1) a DK-LLM module to detect fake or deceptive conversations; (2) a drift detection unit (OCDD) to determine whether a semantic shift has occurred; and (3) a second DK-LLM module to classify the drift as either benign or fraudulent. We first validate the value of domain knowledge using a fake review dataset and then apply our full framework to SEConvo, a multiturn dialogue dataset that includes various types of fraud and spam attacks. Results show that our system detects fake conversations with high accuracy and effectively classifies the nature of drift. Guided by structured prompts, the LLaMA-based implementation achieves 98% classification accuracy. Comparative studies against zero-shot baselines demonstrate that incorporating domain knowledge and drift awareness significantly improves performance, interpretability, and robustness in high-stakes NLP applications.

[98] Bridging Offline and Online Reinforcement Learning for LLMs cs.CLPDF

Jack Lanchantin, Angelica Chen, Janice Lan, Xian Li, Swarnadeep Saha

TL;DR: 本文研究了大型语言模型从离线到半在线再到完全在线的强化学习方法，比较了不同优化目标的性能，发现在线和半在线方法表现优于离线方法，且多任务学习能进一步提升性能。

Details

Motivation: 探究强化学习方法在大型语言模型微调中的表现，特别是在从离线到在线过渡时的效果，并为可验证和不可验证任务提供优化策略。

Result: 实验结果显示，在线和半在线方法在所有任务中表现相似且优于离线方法，多任务学习进一步提升了性能。

Insight: 在线和半在线强化学习方法在大型语言模型微调中表现优异，且多任务学习是一种有效的优化策略。

Abstract: We investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and non-verifiable tasks. Our experiments cover training on verifiable math as well as non-verifiable instruction following with a set of benchmark evaluations for both. Across these settings, we extensively compare online and semi-online Direct Preference Optimization and Group Reward Policy Optimization objectives, and surprisingly find similar performance and convergence between these variants, which all strongly outperform offline methods. We provide a detailed analysis of the training dynamics and hyperparameter selection strategies to achieve optimal results. Finally, we show that multi-tasking with verifiable and non-verifiable rewards jointly yields improved performance across both task types.

[99] Enhancing User Engagement in Socially-Driven Dialogue through Interactive LLM Alignments cs.CLPDF

Jiashuo Wang, Kaitao Song, Chunpu Xu, Changhe Song, Yang Xiao

TL;DR: 这篇论文提出了一种通过交互式LLM对齐增强社交驱动对话中用户参与度的方法，利用用户反应作为奖励信号，通过i×MCTS和直接偏好优化（DPO）来提升LLM的表现。

Details

Motivation: 现有研究虽然优化了模型对相关知识的推理或对话行为的规划，但未能直接保证社交驱动对话中的用户参与度。因此，作者希望通过更直接的指标（如用户反应）来对齐交互式LLM。

Result: 在情感支持对话和说服对话两个场景中的实验表明，该方法显著提高了用户参与度。

Insight: 用户反应可以作为直接且有效的奖励信号，用于优化LLM在社交驱动对话中的表现。

Abstract: Enhancing user engagement through interactions plays an essential role in socially-driven dialogues. While prior works have optimized models to reason over relevant knowledge or plan a dialogue act flow, the relationship between user engagement and knowledge or dialogue acts is subtle and does not guarantee user engagement in socially-driven dialogues. To this end, we enable interactive LLMs to learn user engagement by leveraging signals from the future development of conversations. Specifically, we adopt a more direct and relevant indicator of user engagement, i.e., the user’s reaction related to dialogue intention after the interaction, as a reward to align interactive LLMs. To achieve this, we develop a user simulator to interact with target interactive LLMs and explore interactions between the user and the interactive LLM system via \textit{i$\times$MCTS} (\textit{M}onte \textit{C}arlo \textit{T}ree \textit{S}earch for \textit{i}nteraction). In this way, we collect a dataset containing pairs of higher and lower-quality experiences using \textit{i$\times$MCTS}, and align interactive LLMs for high-level user engagement by direct preference optimization (DPO) accordingly. Experiments conducted on two socially-driven dialogue scenarios (emotional support conversations and persuasion for good) demonstrate that our method effectively enhances user engagement in interactive LLMs.

cs.GR [Back]

[100] Consistent Zero-shot 3D Texture Synthesis Using Geometry-aware Diffusion and Temporal Video Models cs.GR | cs.AI | cs.CV | 68T45, 68U05 | I.3.7; I.4.10; I.2.10PDF

Donggoo Kang, Jangyeong Kim, Dasol Jeong, Junyoung Choi, Jeonga Wi

TL;DR: VideoTex提出一种新的3D纹理合成框架，通过结合视频生成模型和几何感知条件，解决了纹理合成的时空不一致性。

Details

Motivation: 当前纹理合成方法受限于固定视角，缺乏全局上下文和几何理解，导致合成结果不一致。本文利用视频生成模型的优势，提出一种时空一致性更强的纹理合成方法。

Result: 实验表明，VideoTex在纹理保真度、接缝融合和稳定性上优于现有方法，适用于动态实时应用。

Insight: 通过结合视频生成的时序一致性和几何感知信息，可以有效提升3D纹理合成的质量和一致性。

Abstract: Current texture synthesis methods, which generate textures from fixed viewpoints, suffer from inconsistencies due to the lack of global context and geometric understanding. Meanwhile, recent advancements in video generation models have demonstrated remarkable success in achieving temporally consistent videos. In this paper, we introduce VideoTex, a novel framework for seamless texture synthesis that leverages video generation models to address both spatial and temporal inconsistencies in 3D textures. Our approach incorporates geometry-aware conditions, enabling precise utilization of 3D mesh structures. Additionally, we propose a structure-wise UV diffusion strategy, which enhances the generation of occluded areas by preserving semantic information, resulting in smoother and more coherent textures. VideoTex not only achieves smoother transitions across UV boundaries but also ensures high-quality, temporally stable textures across video frames. Extensive experiments demonstrate that VideoTex outperforms existing methods in texture fidelity, seam blending, and stability, paving the way for dynamic real-time applications that demand both visual quality and temporal coherence.

eess.IV [Back]

[101] Global and Local Contrastive Learning for Joint Representations from Cardiac MRI and ECG eess.IV | cs.AI | cs.CV | eess.SPPDF

Alexander Selivanov, Philip Müller, Özgün Turgut, Nil Stolt-Ansó, Daniel Rückert

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: An electrocardiogram (ECG) is a widely used, cost-effective tool for detecting electrical abnormalities in the heart. However, it cannot directly measure functional parameters, such as ventricular volumes and ejection fraction, which are crucial for assessing cardiac function. Cardiac magnetic resonance (CMR) is the gold standard for these measurements, providing detailed structural and functional insights, but is expensive and less accessible. To bridge this gap, we propose PTACL (Patient and Temporal Alignment Contrastive Learning), a multimodal contrastive learning framework that enhances ECG representations by integrating spatio-temporal information from CMR. PTACL uses global patient-level contrastive loss and local temporal-level contrastive loss. The global loss aligns patient-level representations by pulling ECG and CMR embeddings from the same patient closer together, while pushing apart embeddings from different patients. Local loss enforces fine-grained temporal alignment within each patient by contrasting encoded ECG segments with corresponding encoded CMR frames. This approach enriches ECG representations with diagnostic information beyond electrical activity and transfers more insights between modalities than global alignment alone, all without introducing new learnable weights. We evaluate PTACL on paired ECG-CMR data from 27,951 subjects in the UK Biobank. Compared to baseline approaches, PTACL achieves better performance in two clinically relevant tasks: (1) retrieving patients with similar cardiac phenotypes and (2) predicting CMR-derived cardiac function parameters, such as ventricular volumes and ejection fraction. Our results highlight the potential of PTACL to enhance non-invasive cardiac diagnostics using ECG. The code is available at: https://github.com/alsalivan/ecgcmr

[102] U-R-VEDA: Integrating UNET, Residual Links, Edge and Dual Attention, and Vision Transformer for Accurate Semantic Segmentation of CMRs eess.IV | cs.AI | cs.CV | cs.LG | I.4.6; I.2; I.5.2; I.5.1PDF

Racheal Mukisa, Arvind K. Bansal

TL;DR: 论文提出U-R-Veda模型，整合了UNet、残差连接、边缘与双重注意力机制及Vision Transformer，用于心脏磁共振图像的精确语义分割，在基于DSC指标的测试中达到95.2%的平均准确率。

Details

Motivation: 自动化精确分割心脏图像是心脏疾病量化诊断的必要步骤，现有方法在右心室和左心室心肌的分割上存在不足。

Result: 在DSC和HD指标上优于其他模型，右心室和左心室心肌的分割效果尤为突出，平均准确率达95.2%。

Insight: 嵌入双重注意力机制和边缘检测跳连能有效减少卷积变换中的信息损失，提升分割精度。

Abstract: Artificial intelligence, including deep learning models, will play a transformative role in automated medical image analysis for the diagnosis of cardiac disorders and their management. Automated accurate delineation of cardiac images is the first necessary initial step for the quantification and automated diagnosis of cardiac disorders. In this paper, we propose a deep learning based enhanced UNet model, U-R-Veda, which integrates convolution transformations, vision transformer, residual links, channel-attention, and spatial attention, together with edge-detection based skip-connections for an accurate fully-automated semantic segmentation of cardiac magnetic resonance (CMR) images. The model extracts local-features and their interrelationships using a stack of combination convolution blocks, with embedded channel and spatial attention in the convolution block, and vision transformers. Deep embedding of channel and spatial attention in the convolution block identifies important features and their spatial localization. The combined edge information with channel and spatial attention as skip connection reduces information-loss during convolution transformations. The overall model significantly improves the semantic segmentation of CMR images necessary for improved medical image analysis. An algorithm for the dual attention module (channel and spatial attention) has been presented. Performance results show that U-R-Veda achieves an average accuracy of 95.2%, based on DSC metrics. The model outperforms the accuracy attained by other models, based on DSC and HD metrics, especially for the delineation of right-ventricle and left-ventricle-myocardium.

[103] Development of MR spectral analysis method robust against static magnetic field inhomogeneity eess.IV | cs.CVPDF

Shuki Maruyama, Hidenori Takeshima

TL;DR: 论文提出了一种新的磁共振波谱分析方法，通过深度学习模型和使用模拟波谱增强在静磁场不均匀性下的分析准确性。

Details

Motivation: 静磁场B0不均匀性会影响磁共振波谱分析的准确性，需要一种鲁棒的方法来克服这种影响。

Result: 使用模拟波谱训练的模型比仅用实测波谱的模型MSE降低了49.89%，比结合模拟波谱的模型降低了26.66%。模型表现优于传统LCModel方法。

Insight: 通过增加模拟波谱的训练样本，可以有效提高磁共振波谱分析的鲁棒性和准确性，尤其是在静磁场不均匀的情况下。

Abstract: Purpose:To develop a method that enhances the accuracy of spectral analysis in the presence of static magnetic field B0 inhomogeneity. Methods:The authors proposed a new spectral analysis method utilizing a deep learning model trained on modeled spectra that consistently represent the spectral variations induced by B0 inhomogeneity. These modeled spectra were generated from the B0 map and metabolite ratios of the healthy human brain. The B0 map was divided into a patch size of subregions, and the separately estimated metabolites and baseline components were averaged and then integrated. The quality of the modeled spectra was visually and quantitatively evaluated against the measured spectra. The analysis models were trained using measured, simulated, and modeled spectra. The performance of the proposed method was assessed using mean squared errors (MSEs) of metabolite ratios. The mean absolute percentage errors (MAPEs) of the metabolite ratios were also compared to LCModel when analyzing the phantom spectra acquired under two types of B0 inhomogeneity. Results:The modeled spectra exhibited broadened and narrowed spectral peaks depending on the B0 inhomogeneity and were quantitatively close to the measured spectra. The analysis model trained using measured spectra with modeled spectra improved MSEs by 49.89% compared to that trained using measured spectra alone, and by 26.66% compared to that trained using measured spectra with simulated spectra. The performance improved as the number of modeled spectra increased from 0 to 1,000. This model showed significantly lower MAPEs than LCModel under both types of B0 inhomogeneity. Conclusion:A new spectral analysis-trained deep learning model using the modeled spectra was developed. The results suggest that the proposed method has the potential to improve the accuracy of spectral analysis by increasing the training samples of spectra.

cs.SD [Back]

[104] Exploring Adapter Design Tradeoffs for Low Resource Music Generation cs.SD | cs.AI | cs.CL | cs.LG | cs.MM | eess.ASPDF

Atharva Mehta, Shivam Chauhan, Monojit Choudhury

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Fine-tuning large-scale music generation models, such as MusicGen and Mustango, is a computationally expensive process, often requiring updates to billions of parameters and, therefore, significant hardware resources. Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly adapter-based methods, have emerged as a promising alternative, enabling adaptation with minimal trainable parameters while preserving model performance. However, the design choices for adapters, including their architecture, placement, and size, are numerous, and it is unclear which of these combinations would produce optimal adapters and why, for a given case of low-resource music genre. In this paper, we attempt to answer this question by studying various adapter configurations for two AI music models, MusicGen and Mustango, on two genres: Hindustani Classical and Turkish Makam music. Our findings reveal distinct trade-offs: convolution-based adapters excel in capturing fine-grained local musical details such as ornamentations and short melodic phrases, while transformer-based adapters better preserve long-range dependencies crucial for structured improvisation. Additionally, we analyze computational resource requirements across different adapter scales, demonstrating how mid-sized adapters (40M parameters) achieve an optimal balance between expressivity and quality. Furthermore, we find that Mustango, a diffusion-based model, generates more diverse outputs with better adherence to the description in the input prompt while lacking in providing stability in notes, rhythm alignment, and aesthetics. Also, it is computationally intensive and requires significantly more time to train. In contrast, autoregressive models like MusicGen offer faster training and are more efficient, and can produce better quality output in comparison, but have slightly higher redundancy in their generations.

cs.AI [Back]

[105] Unveiling Causal Reasoning in Large Language Models: Reality or Mirage? cs.AI | cs.CL | cs.LGPDF

Haoang Chi, He Li, Wenjing Yang, Feng Liu, Long Lan

TL;DR: 该研究发现当前大型语言模型（LLMs）仅能进行浅层因果推理（level-1），而无法实现类似人类的深层推理（level-2）。为此，作者提出了一种新方法 G^2-Reasoner，通过结合通用知识和目标导向提示，显著提升了 LLMs 在因果推理任务中的表现。

Details

Motivation: 当前 LLMs 虽然在因果推理任务中表现优异，但其是否真正具备类似人类的因果推理能力尚不明确。研究旨在验证 LLMs 的因果推理能力是否仅为浅层次，并提出提升方法。

Result: 实验表明，LLMs 在 CausalProbe-2024 上表现显著下降；而 G^2-Reasoner 显著提升了其在新鲜和反事实上下文中的因果推理能力。

Insight: 研究揭示了 LLMs 因果推理能力的局限性，并提出了通过结合外部知识和目标导向提示的改进方向，为 LLMs 迈向更深层推理提供了新思路。

Abstract: Causal reasoning capability is critical in advancing large language models (LLMs) toward strong artificial intelligence. While versatile LLMs appear to have demonstrated capabilities in understanding contextual causality and providing responses that obey the laws of causality, it remains unclear whether they perform genuine causal reasoning akin to humans. However, current evidence indicates the contrary. Specifically, LLMs are only capable of performing shallow (level-1) causal reasoning, primarily attributed to the causal knowledge embedded in their parameters, but they lack the capacity for genuine human-like (level-2) causal reasoning. To support this hypothesis, methodologically, we delve into the autoregression mechanism of transformer-based LLMs, revealing that it is not inherently causal. Empirically, we introduce a new causal Q&A benchmark called CausalProbe-2024, whose corpora are fresh and nearly unseen for the studied LLMs. The LLMs exhibit a significant performance drop on CausalProbe-2024 compared to earlier benchmarks, indicating the fact that they primarily engage in level-1 causal reasoning. To bridge the gap towards level-2 causal reasoning, we draw inspiration from the fact that human reasoning is usually facilitated by general knowledge and intended goals. We propose G^2-Reasoner, a method that incorporates general knowledge and goal-oriented prompts into LLMs’ causal reasoning processes. Experiments demonstrate that G^2-Reasoner significantly enhances LLMs’ causal reasoning capability, particularly in fresh and counterfactual contexts. This work sheds light on a new path for LLMs to advance towards genuine causal reasoning, going beyond level-1 and making strides towards level-2.

[106] Spatial Mental Modeling from Limited Views cs.AI | cs.CL | cs.CVPDF

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang

TL;DR: 论文提出了MindCube基准测试，揭示了现有视觉语言模型（VLM）在空间心理建模上的不足，并通过三种方法帮助VLM近似空间心理建模，其中‘先建模后推理’方法显著提升了性能。

Details

Motivation: 研究动机是探索VLM是否能够像人类一样通过少数视角构建完整的空间心理模型，以应对布局、视角和动态推理。现有VLM在此任务上表现接近随机，因此需要新的解决方案。

Result: 实验结果表明，‘先建模后推理’方法将准确率从37.8%提升至60.8%（+23.0%），加入强化学习后进一步达到70.7%（+32.9%）。

Insight: 关键发现是，通过主动构建和利用结构化空间表征（如认知地图），结合灵活的推理过程，可以显著提升VLM对不可观察空间的理解能力。

Abstract: Can Vision Language Models (VLMs) imagine the full scene from just a few views, like humans do? Humans form spatial mental models, internal representations of unseen space, to reason about layout, perspective, and motion. Our new MindCube benchmark with 21,154 questions across 3,268 images exposes this critical gap, where existing VLMs exhibit near-random performance. Using MindCube, we systematically evaluate how well VLMs build robust spatial mental models through representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation for “what-if” movements). We then explore three approaches to help VLMs approximate spatial mental models, including unseen intermediate views, natural language reasoning chains, and cognitive maps. The significant improvement comes from a synergistic approach, “map-then-reason”, that jointly trains the model to first generate a cognitive map and then reason upon it. By training models to reason over these internal maps, we boosted accuracy from 37.8% to 60.8% (+23.0%). Adding reinforcement learning pushed performance even further to 70.7% (+32.9%). Our key insight is that such scaffolding of spatial mental models, actively constructing and utilizing internal structured spatial representations with flexible reasoning processes, significantly improves understanding of unobservable space.

cs.LG [Back]

[107] SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes cs.LG | cs.CL | cs.CVPDF

Yifan Yang, Zhen Zhang, Rupak Vignesh Swaminathan, Jing Liu, Nathan Susanj

TL;DR: SharpZO提出了一种混合的零阶优化方法，通过锐度感知预热训练提升视觉语言模型的微调性能，仅需前向传播即可实现高效优化。

Details

Motivation: 当前视觉语言模型微调依赖反向传播，难以在内存受限的边缘设备上部署；现有的免反向传播方法性能不足。

Result: 在CLIP模型上平均提升7%准确率，收敛速度更快。

Insight: 锐度感知预热能平滑损失曲面，为零阶优化提供强初始化，避免高方差问题。

Abstract: Fine-tuning vision language models (VLMs) has achieved remarkable performance across various downstream tasks; yet, it requires access to model gradients through backpropagation (BP), making them unsuitable for memory-constrained, inference-only edge devices. To address this limitation, previous work has explored various BP-free fine-tuning methods. However, these approaches often rely on high-variance evolutionary strategies (ES) or zeroth-order (ZO) optimization, and often fail to achieve satisfactory performance. In this paper, we propose a hybrid Sharpness-aware Zeroth-order optimization (SharpZO) approach, specifically designed to enhance the performance of ZO VLM fine-tuning via a sharpness-aware warm-up training. SharpZO features a two-stage optimization process: a sharpness-aware ES stage that globally explores and smooths the loss landscape to construct a strong initialization, followed by a fine-grained local search via sparse ZO optimization. The entire optimization relies solely on forward passes. Detailed theoretical analysis and extensive experiments on CLIP models demonstrate that SharpZO significantly improves accuracy and convergence speed, achieving up to 7% average gain over state-of-the-art forward-only methods.

[108] Complexity-aware fine-tuning cs.LG | cs.CLPDF

Andrey Goncharov, Daniil Vyazhev, Petr Sychev, Edvard Khalafyan, Alexey Zaytsev

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: General-purpose Large Language Models (LLMs) are frequently fine-tuned through supervised fine-tuning (SFT) to enhance performance in specific domains. Better results can be achieved by distilling the chain-of-thought of a larger model at the cost of numerous expensive calls and a much greater amount of data. We propose a novel blueprint for efficient fine-tuning that uses reasoning only for complex data identified by entropy. Specifically, across two small open models ($\approx 3B$) we split the training data into complexity categories by a single token answer entropy (ROC AUC $0.73$), fine-tune large language models (LLMs) via SFT and distillation, and show that our pipeline significantly outperforms the standard SFT approach ($0.55$ vs $0.43$ average accuracy) and provides comparable with distillation performance while using $62%$ less data ($0.55$ average accuracy for both). We publish our code and data to facilitate further research in this direction.

[109] DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster cs.LG | cs.AI | cs.CLPDF

Ji Qi, WenPeng Zhu, Li Li, Ming Wu, YingJun Wu

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: The distributed training of foundation models, particularly large language models (LLMs), demands a high level of communication. Consequently, it is highly dependent on a centralized cluster with fast and reliable interconnects. Can we conduct training on slow networks and thereby unleash the power of decentralized clusters when dealing with models exceeding 100 billion parameters? In this paper, we propose DiLoCoX, a low-communication large-scale decentralized cluster training framework. It combines Pipeline Parallelism with Dual Optimizer Policy, One-Step-Delay Overlap of Communication and Local Training, and an Adaptive Gradient Compression Scheme. This combination significantly improves the scale of parameters and the speed of model pre-training. We justify the benefits of one-step-delay overlap of communication and local training, as well as the adaptive gradient compression scheme, through a theoretical analysis of convergence. Empirically, we demonstrate that DiLoCoX is capable of pre-training a 107B foundation model over a 1Gbps network. Compared to vanilla AllReduce, DiLoCoX can achieve a 357x speedup in distributed training while maintaining negligible degradation in model convergence. To the best of our knowledge, this is the first decentralized training framework successfully applied to models with over 100 billion parameters.

[110] Universal and Efficient Detection of Adversarial Data through Nonuniform Impact on Network Layers cs.LG | cs.CR | cs.CVPDF

Furkan Mumcu, Yasin Yilmaz

TL;DR: 该论文提出了一种通用且高效的对抗样本检测方法，通过分析对抗攻击对不同DNN层的非均匀影响来实现。

Details

Motivation: 深度神经网络（DNNs）对对抗性攻击的脆弱性是众所周知的，而现有的防御方法要么专注于提高鲁棒性，要么依赖于复杂的检测模型。这些方法要么效果有限，要么计算成本高昂。因此，需要一种更实用的、适用于实时处理的检测方法。

Result: 实验结果表明，该方法在检测对抗样本方面非常有效，并且计算效率高，适用于实时处理。同时，它在多种DNN架构和不同领域中表现一致。

Insight: 对抗攻击对DNN的不同层的影响是非均匀的，这种非均匀性可以作为检测对抗样本的关键特征。利用轻量级模型来量化这种影响是一种高效且通用的解决方案。

Abstract: Deep Neural Networks (DNNs) are notoriously vulnerable to adversarial input designs with limited noise budgets. While numerous successful attacks with subtle modifications to original input have been proposed, defense techniques against these attacks are relatively understudied. Existing defense approaches either focus on improving DNN robustness by negating the effects of perturbations or use a secondary model to detect adversarial data. Although equally important, the attack detection approach, which is studied in this work, provides a more practical defense compared to the robustness approach. We show that the existing detection methods are either ineffective against the state-of-the-art attack techniques or computationally inefficient for real-time processing. We propose a novel universal and efficient method to detect adversarial examples by analyzing the varying degrees of impact of attacks on different DNN layers. {Our method trains a lightweight regression model that predicts deeper-layer features from early-layer features, and uses the prediction error to detect adversarial samples.} Through theoretical arguments and extensive experiments, we demonstrate that our detection method is highly effective, computationally efficient for real-time processing, compatible with any DNN architecture, and applicable across different domains, such as image, video, and audio.

[111] RL-Selector: Reinforcement Learning-Guided Data Selection via Redundancy Assessment cs.LG | cs.CVPDF

Suorong Yang, Peijia Li, Furao Shen, Jian Zhao

TL;DR: 该论文提出了一种基于强化学习的数据选择方法RL-Selector，通过动态评估数据冗余性来优化训练效率，显著提升了模型的泛化性能。

Details

Motivation: 现代深度学习依赖大规模数据集，但冗余数据增加了计算和存储开销。现有数据选择方法多基于静态评分或预训练模型，忽略了数据选择动态性和样本间的综合效应。

Result: 在多个基准数据集和架构上的实验表明，RL-Selector在训练效率和模型性能上均优于现有方法。

Insight: 动态评估数据冗余性并结合强化学习框架可以更高效地提升深度学习的训练效果。

Abstract: Modern deep architectures often rely on large-scale datasets, but training on these datasets incurs high computational and storage overhead. Real-world datasets often contain substantial redundancies, prompting the need for more data-efficient training paradigms. Data selection has shown promise to mitigate redundancy by identifying the most representative samples, thereby reducing training costs without compromising performance. Existing methods typically rely on static scoring metrics or pretrained models, overlooking the combined effect of selected samples and their evolving dynamics during training. We introduce the concept of epsilon-sample cover, which quantifies sample redundancy based on inter-sample relationships, capturing the intrinsic structure of the dataset. Based on this, we reformulate data selection as a reinforcement learning (RL) process and propose RL-Selector, where a lightweight RL agent optimizes the selection policy by leveraging epsilon-sample cover derived from evolving dataset distribution as a reward signal. Extensive experiments across benchmark datasets and diverse architectures demonstrate that our method consistently outperforms existing state-of-the-art baselines. Models trained with our selected datasets show enhanced generalization performance with improved training efficiency.

[112] Personalized Federated Learning via Dual-Prompt Optimization and Cross Fusion cs.LG | cs.CVPDF

Yuguang Zhang, Kuangpu Guo, Zhihe Lu, Yunbo Wang, Jian Liang

TL;DR: 论文提出了一种基于双提示学习和交叉融合的个性化联邦学习框架pFedDC，通过全局和局部提示的跨模态优化，解决了数据、计算和通信异构性问题。

Details

Motivation: 联邦学习在数据、计算和通信的异构性上面临挑战，而预训练的视觉语言模型（VLM）通过轻量级的提示调整提供了潜在解决方案，但现有方法仅依赖文本提示且忽视了标签-域分布的联合偏移。

Result: 在九种异构数据集上的实验表明，pFedDC一致优于现有方法。

Insight: 跨模态提示和自适应融合是实现联邦学习中个性化表示的有效途径，能够同时处理数据异构性和标签-域分布偏移。

Abstract: Federated learning (FL) enables collaborative model training across decentralized clients without sharing local data, but is challenged by heterogeneity in data, computation, and communication. Pretrained vision-language models (VLMs), with their strong generalization and lightweight tuning via prompts, offer a promising solution. However, existing federated prompt-learning methods rely only on text prompts and overlook joint label-domain distribution shifts. In this paper, we propose a personalized FL framework based on dual-prompt learning and cross fusion, termed pFedDC. Specifically, each client maintains both global and local prompts across vision and language modalities: global prompts capture common knowledge shared across the federation, while local prompts encode client-specific semantics and domain characteristics. Meanwhile, a cross-fusion module is designed to adaptively integrate prompts from different levels, enabling the model to generate personalized representations aligned with each client’s unique data distribution. Extensive experiments across nine datasets with various types of heterogeneity show that pFedDC consistently outperforms state-of-the-art methods.

cs.RO [Back]

[113] Model-Based Real-Time Pose and Sag Estimation of Overhead Power Lines Using LiDAR for Drone Inspection cs.RO | cs.CVPDF

Alexandre Girard, Steven A. Parkison, Philippe Hamelin

TL;DR: 该论文提出了一种基于几何模型的实时估计方法，利用LiDAR数据准确跟踪输电线路的位姿和下垂，解决了无人机巡检中导线点稀疏和噪声干扰的问题。

Details

Motivation: 无人机巡检带电输电线路时，LiDAR传感器因导线表面面积小、检测不稳定以及环境干扰（如树木和塔架）导致难以准确定位导线位姿。

Result: 实验结果表明，该方法在局部观测、噪声和离群点干扰下仍能准确跟踪，且对离群点的容忍度是有效导线点数的两倍。

Insight: 单一模型方法优于逐个导线跟踪，能更鲁棒地处理稀疏和噪声数据，适用于实时无人机巡检任务。

Abstract: Drones can inspect overhead power lines while they remain energized, significantly simplifying the inspection process. However, localizing a drone relative to all conductors using an onboard LiDAR sensor presents several challenges: (1) conductors provide minimal surface for LiDAR beams limiting the number of conductor points in a scan, (2) not all conductors are consistently detected, and (3) distinguishing LiDAR points corresponding to conductors from other objects, such as trees and pylons, is difficult. This paper proposes an estimation approach that minimizes the error between LiDAR measurements and a single geometric model representing the entire conductor array, rather than tracking individual conductors separately. Experimental results, using data from a power line drone inspection, demonstrate that this method achieves accurate tracking, with a solver converging under 50 ms per frame, even in the presence of partial observations, noise, and outliers. A sensitivity analysis shows that the estimation approach can tolerate up to twice as many outlier points as valid conductors measurements.

Shruti Bansal, Wenshan Wang, Yifei Liu, Parv Maheshwari

TL;DR: 论文提出了利用条件扩散模型将RGB图像转换为热图像的方法，以解决热成像数据不足的问题，促进热像仪在自主系统中的快速应用。

Details

Motivation: 热像仪在夜间或恶劣环境中具有优势，但热成像数据稀缺限制了其在自主系统中的广泛应用。为此，论文提出通过合成热图像来补充现有数据集。

Result: 该方法能够有效地生成合成的热图像，填补了现有数据集中热成像数据的不足。

Insight: 通过合成数据解决热像仪数据不足问题，为自主系统在恶劣环境下的性能提升提供了一种可行方案。

Abstract: Autonomous systems rely on sensors to estimate the environment around them. However, cameras, LiDARs, and RADARs have their own limitations. In nighttime or degraded environments such as fog, mist, or dust, thermal cameras can provide valuable information regarding the presence of objects of interest due to their heat signature. They make it easy to identify humans and vehicles that are usually at higher temperatures compared to their surroundings. In this paper, we focus on the adaptation of thermal cameras for robotics and automation, where the biggest hurdle is the lack of data. Several multi-modal datasets are available for driving robotics research in tasks such as scene segmentation, object detection, and depth estimation, which are the cornerstone of autonomous systems. However, they are found to be lacking in thermal imagery. Our paper proposes a solution to augment these datasets with synthetic thermal data to enable widespread and rapid adaptation of thermal cameras. We explore the use of conditional diffusion models to convert existing RGB images to thermal images using self-attention to learn the thermal properties of real-world objects.

[115] V2X-REALM: Vision-Language Model-Based Robust End-to-End Cooperative Autonomous Driving with Adaptive Long-Tail Modeling cs.RO | cs.AI | cs.CVPDF

Junwei You, Pei Li, Zhuoyu Jiang, Zilin Huang, Rui Gan

TL;DR: V2X-REALM提出了一个基于视觉-语言模型（VLM）的框架，用于提升在长尾场景下的协同自动驾驶鲁棒性，通过自适应多模态学习和创新的模块设计，显著改善了复杂环境下的语义推理和安全性。

Details

Motivation: 在协同自动驾驶中，如何在罕见、多样和视觉退化的长尾场景下实现鲁棒的规划和决策是关键挑战。目前方法在这些场景下的表现不足，亟需更高效的解决方案。

Result: 实验表明，V2X-REALM在鲁棒性、语义推理、安全性和规划准确性上显著优于现有基线方法。

Insight: 利用基础模型生成多样化的训练数据并结合自适应注意力机制，是提升长尾场景下自动驾驶性能的有效途径。

Abstract: Ensuring robust planning and decision-making under rare, diverse, and visually degraded long-tail scenarios remains a fundamental challenge for autonomous driving in urban environments. This issue becomes more critical in cooperative settings, where vehicles and infrastructure jointly perceive and reason across complex environments. To address this challenge, we propose V2X-REALM, a vision-language model (VLM)-based framework with adaptive multimodal learning for robust cooperative autonomous driving under long-tail scenarios. V2X-REALM introduces three core innovations: (i) a prompt-driven long-tail scenario generation and evaluation pipeline that leverages foundation models to synthesize realistic long-tail conditions such as snow and fog across vehicle- and infrastructure-side views, enriching training diversity efficiently; (ii) a gated multi-scenario adaptive attention module that modulates the visual stream using scenario priors to recalibrate ambiguous or corrupted features; and (iii) a multi-task scenario-aware contrastive learning objective that improves multimodal alignment and promotes cross-scenario feature separability. Extensive experiments demonstrate that V2X-REALM significantly outperforms existing baselines in robustness, semantic reasoning, safety, and planning accuracy under complex, challenging driving conditions, advancing the scalability of end-to-end cooperative autonomous driving.

Table of Contents

cs.CV [Back]

[1] StereoDiff: Stereo-Diffusion Synergy for Video Depth Estimation cs.CVPDF

[2] ConViTac: Aligning Visual-Tactile Fusion with Contrastive Representations cs.CV | cs.ROPDF

[3] How do Foundation Models Compare to Skeleton-Based Approaches for Gesture Recognition in Human-Robot Interaction? cs.CV | cs.HC | cs.RO | I.2.10; I.2.9; I.5.4; I.4.8; I.4.9; H.1.2PDF

[4] Leveraging Vision-Language Models to Select Trustworthy Super-Resolution Samples Generated by Diffusion Models cs.CV | cs.AIPDF

[5] FixCLR: Negative-Class Contrastive Learning for Semi-Supervised Domain Generalization cs.CV | cs.AIPDF

[6] Vector Contrastive Learning For Pixel-Wise Pretraining In Medical Vision cs.CVPDF

[7] Enhancing Ambiguous Dynamic Facial Expression Recognition with Soft Label-based Data Augmentation cs.CVPDF

[8] THIRDEYE: Cue-Aware Monocular Depth Estimation via Brain-Inspired Multi-Stage Fusion cs.CV | cs.AI | I.4.8; I.2.10PDF

[9] The Role of Cyclopean-Eye in Stereo Vision cs.CVPDF

[10] FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing cs.CVPDF

[11] M2SFormer: Multi-Spectral and Multi-Scale Attention with Edge-Aware Difficulty Guidance for Image Forgery Localization cs.CVPDF

[12] PhysRig: Differentiable Physics-Based Skinning and Rigging Framework for Realistic Articulated Object Modeling cs.CVPDF

[13] AIR-VIEW: The Aviation Image Repository for Visibility Estimation of Weather, A Dataset and Benchmark cs.CVPDF

[14] Hierarchical Sub-action Tree for Continuous Sign Language Recognition cs.CV | cs.MMPDF

[15] OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs cs.CV | cs.AIPDF

[16] Evidence-based diagnostic reasoning with multi-agent copilot for human pathology cs.CV | cs.AIPDF

[17] DFVEdit: Conditional Delta Flow Vector for Zero-shot Video Editing cs.CV | cs.AIPDF

[18] From Cradle to Cane: A Two-Pass Framework for High-Fidelity Lifespan Face Aging cs.CV | cs.AIPDF

[19] 3D Scene-Camera Representation with Joint Camera Photometric Optimization cs.CVPDF

[20] TSDASeg: A Two-Stage Model with Direct Alignment for Interactive Point Cloud Segmentation cs.CVPDF

[21] Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance cs.CV | cs.LG | cs.SD | eess.ASPDF

[22] DBMovi-GS: Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting cs.CVPDF

[23] Style-Aligned Image Composition for Robust Detection of Abnormal Cells in Cytopathology cs.CVPDF

[24] VisionGuard: Synergistic Framework for Helmet Violation Detection cs.CVPDF

[25] Detection of Breast Cancer Lumpectomy Margin with SAM-incorporated Forward-Forward Contrastive Learning cs.CVPDF

[26] The Aging Multiverse: Generating Condition-Aware Facial Aging Tree via Training-Free Diffusion cs.CVPDF

[27] User-in-the-Loop View Sampling with Error Peaking Visualization cs.CVPDF

[28] Bridging Video Quality Scoring and Justification via Large Multimodal Models cs.CVPDF

[29] FedSC: Federated Learning with Semantic-Aware Collaboration cs.CVPDF

[30] HybridQ: Hybrid Classical-Quantum Generative Adversarial Network for Skin Disease Image Generation cs.CV | cs.LG | quant-phPDF

[31] Multimodal Prompt Alignment for Facial Expression Recognition cs.CV | cs.AIPDF

[32] LASFNet: A Lightweight Attention-Guided Self-Modulation Feature Fusion Network for Multimodal Object Detection cs.CVPDF

[33] Instella-T2I: Pushing the Limits of 1D Discrete Latent Space Image Generation cs.CVPDF

[34] DidSee: Diffusion-Based Depth Completion for Material-Agnostic Robotic Perception and Manipulation cs.CVPDF

[35] Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features cs.CV | cs.CRPDF

[36] HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context cs.CV | cs.CLPDF

[37] SAMURAI: Shape-Aware Multimodal Retrieval for 3D Object Identification cs.CVPDF

[38] PoseMaster: Generating 3D Characters in Arbitrary Poses from a Single Image cs.CVPDF

[39] EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception cs.CV | cs.AI | cs.LGPDF

[40] ESMStereo: Enhanced ShuffleMixer Disparity Upsampling for Real-Time and Accurate Stereo Matching cs.CVPDF

[41] OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography cs.CVPDF

[42] Pushing Trade-Off Boundaries: Compact yet Effective Remote Sensing Change Detection cs.CV | cs.LGPDF

[43] IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes cs.CV | cs.AIPDF

[44] HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation cs.CV | cs.AI | cs.CL | cs.LGPDF

[45] GoIRL: Graph-Oriented Inverse Reinforcement Learning for Multimodal Trajectory Prediction cs.CV | cs.ROPDF

[46] YOLO-FDA: Integrating Hierarchical Attention and Detail Enhancement for Surface Defect Detection cs.CVPDF

[47] Tree-based Semantic Losses: Application to Sparsely-supervised Large Multi-class Hyperspectral Segmentation cs.CVPDF

[48] Geometry and Perception Guided Gaussians for Multiview-consistent 3D Generation from a Single Image cs.CV | 68 | I.4.0PDF

[49] Task-Aware KV Compression For Cost-Effective Long Video Understanding cs.CV | cs.AIPDF

[50] GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding cs.CVPDF

[51] Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation cs.CV | cs.RO | eess.IVPDF

[52] BitMark for Infinity: Watermarking Bitwise Autoregressive Image Generative Models cs.CV | cs.AIPDF

[53] Real-Time ESFP: Estimating, Smoothing, Filtering, and Pose-Mapping cs.CV | cs.ROPDF

[54] DiMPLe – Disentangled Multi-Modal Prompt Learning: Enhancing Out-Of-Distribution Alignment with Invariant and Spurious Feature Separation cs.CVPDF

[55] Temporal Rate Reduction Clustering for Human Motion Segmentation cs.CVPDF

[56] Video Virtual Try-on with Conditional Diffusion Transformer Inpainter cs.CVPDF

[57] WordCon: Word-level Typography Control in Scene Text Rendering cs.CVPDF

[58] HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation cs.CVPDF

[59] DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images cs.CVPDF

[60] LLaVA-Pose: Enhancing Human Pose and Action Understanding via Keypoint-Integrated Instruction Tuning cs.CVPDF

[61] Holistic Surgical Phase Recognition with Hierarchical Input Dependent State Space Models cs.CV | cs.AIPDF

[62] Generalizable Neural Electromagnetic Inverse Scattering cs.CV | eess.IVPDF

[63] ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models cs.CVPDF

[64] CoPa-SG: Dense Scene Graphs with Parametric and Proto-Relations cs.CVPDF

[65] CA-I2P: Channel-Adaptive Registration Network with Global Optimal Selection cs.CV | cs.AIPDF

[66] FastRef:Fast Prototype Refinement for Few-Shot Industrial Anomaly Detection cs.CVPDF

[67] XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation cs.CVPDF

[68] HyperSORT: Self-Organising Robust Training with hyper-networks cs.CVPDF

[69] Benchmarking Deep Learning and Vision Foundation Models for Atypical vs. Normal Mitosis Classification with Cross-Dataset Evaluation cs.CVPDF

[70] A Comprehensive Dataset for Underground Miner Detection in Diverse Scenario cs.CV | cs.LGPDF

[71] Evaluation of Traffic Signals for Daily Traffic Pattern cs.CV | cs.LGPDF

[72] Global and Local Entailment Learning for Natural World Imagery cs.CVPDF

[73] Towards Reliable Detection of Empty Space: Conditional Marked Point Processes for Object Detection cs.CV | cs.LG | math.PRPDF

[74] Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration cs.CVPDF

[75] GGTalker: Talking Head Systhesis with Generalizable Gaussian Priors and Identity-Specific Adaptation cs.CVPDF

[76] G$^{2}$D: Boosting Multimodal Learning with Gradient-Guided Distillation cs.CVPDF

[77] MADrive: Memory-Augmented Driving Scene Modeling cs.CVPDF

[78] Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval cs.CV | cs.IR | cs.LGPDF