cs.CV [Total: 57]
cs.CL [Total: 25]
cs.AI [Total: 7]
cs.CR [Total: 3]
eess.IV [Total: 4]
cs.RO [Total: 3]
physics.optics [Total: 1]
q-bio.NC [Total: 1]
cs.GR [Total: 2]
cs.LG [Total: 1]
eess.SP [Total: 1]
quant-ph [Total: 1]
cs.HC [Total: 1]

cs.CV [Back]

[1] Towards Training-Free Underwater 3D Object Detection from Sonar Point Clouds: A Comparison of Traditional and Deep Learning Approaches cs.CV | cs.AI | cs.LG | cs.ROPDF

M. Salman Shaukat, Yannik Käckenmeister, Sebastian Bader, Thomas Kirste

TL;DR: 该论文探讨了无需训练数据的3D水下目标检测方法，对比了传统的模板匹配和基于深度学习的合成数据训练方法，发现传统方法在实际数据中表现更优。

Details

Motivation: 水下3D目标检测面临训练数据稀缺和声学环境恶劣的挑战，传统深度学习方法依赖大量标注数据，成本高。论文旨在研究无需真实训练数据的检测方法。

Result: 神经网络在合成数据上达到98% mAP，但在真实数据上仅为40%。模板匹配方法在真实数据中保持83% mAP，无需训练，鲁棒性更强。

Insight: 水下场景中，传统几何方法可能比依赖合成数据的深度学习方法更具鲁棒性，挑战了深度学习在数据稀缺领域的传统认知。

Abstract: Underwater 3D object detection remains one of the most challenging frontiers in computer vision, where traditional approaches struggle with the harsh acoustic environment and scarcity of training data. While deep learning has revolutionized terrestrial 3D detection, its application underwater faces a critical bottleneck: obtaining sufficient annotated sonar data is prohibitively expensive and logistically complex, often requiring specialized vessels, expert surveyors, and favorable weather conditions. This work addresses a fundamental question: Can we achieve reliable underwater 3D object detection without real-world training data? We tackle this challenge by developing and comparing two paradigms for training-free detection of artificial structures in multibeam echo-sounder point clouds. Our dual approach combines a physics-based sonar simulation pipeline that generates synthetic training data for state-of-the-art neural networks, with a robust model-based template matching system that leverages geometric priors of target objects. Evaluation on real bathymetry surveys from the Baltic Sea reveals surprising insights: while neural networks trained on synthetic data achieve 98% mean Average Precision (mAP) on simulated scenes, they drop to 40% mAP on real sonar data due to domain shift. Conversely, our template matching approach maintains 83% mAP on real data without requiring any training, demonstrating remarkable robustness to acoustic noise and environmental variations. Our findings challenge conventional wisdom about data-hungry deep learning in underwater domains and establish the first large-scale benchmark for training-free underwater 3D detection. This work opens new possibilities for autonomous underwater vehicle navigation, marine archaeology, and offshore infrastructure monitoring in data-scarce environments where traditional machine learning approaches fail.

[2] MobileDenseAttn:A Dual-Stream Architecture for Accurate and Interpretable Brain Tumor Detection cs.CV | cs.AIPDF

Shudipta Banik, Muna Das, Trapa Banik, Md. Ehsanul Haque

TL;DR: 论文提出了MobileDenseAttn，一种双流架构，结合MobileNetV2和DenseNet201，用于高精度且可解释的脑肿瘤检测，提高了特征表示、计算效率和可视化解释能力。

Details

Motivation: 现有脑肿瘤检测方法泛化能力有限，计算效率低，且缺乏可解释性，影响了临床信任。因此，需要一种高效、高精度且透明的模型。

Result: 训练准确率99.75%，测试准确率98.35%，F1分数0.9835。相比基准模型（如VGG19），准确率提升3.67%，训练时间减少39.3%。GradCAM热图清晰定位肿瘤区域。

Insight: 双流架构有效结合了轻量化和高精度特征提取的优势，GradCAM提高了模型的可解释性，使其更适用于临床实践。

Abstract: The detection of brain tumor in MRI is an important aspect of ensuring timely diagnostics and treatment; however, manual analysis is commonly long and error-prone. Current approaches are not universal because they have limited generalization to heterogeneous tumors, are computationally inefficient, are not interpretable, and lack transparency, thus limiting trustworthiness. To overcome these issues, we introduce MobileDenseAttn, a fusion model of dual streams of MobileNetV2 and DenseNet201 that can help gradually improve the feature representation scale, computing efficiency, and visual explanations via GradCAM. Our model uses feature level fusion and is trained on an augmented dataset of 6,020 MRI scans representing glioma, meningioma, pituitary tumors, and normal samples. Measured under strict 5-fold cross-validation protocols, MobileDenseAttn provides a training accuracy of 99.75%, a testing accuracy of 98.35%, and a stable F1 score of 0.9835 (95% CI: 0.9743 to 0.9920). The extensive validation shows the stability of the model, and the comparative analysis proves that it is a great advancement over the baseline models (VGG19, DenseNet201, MobileNetV2) with a +3.67% accuracy increase and a 39.3% decrease in training time compared to VGG19. The GradCAM heatmaps clearly show tumor-affected areas, offering clinically significant localization and improving interpretability. These findings position MobileDenseAttn as an efficient, high performance, interpretable model with a high probability of becoming a clinically practical tool in identifying brain tumors in the real world.

[3] Can VLMs Recall Factual Associations From Visual References? cs.CV | cs.AI | cs.CLPDF

Dhananjay Ashok, Ashutosh Chaubey, Hirona J. Arai, Jonathan May, Jesse Thomason

TL;DR: 本文通过实证研究揭示了视觉语言模型（VLMs）在多模态基础中的系统性问题：其通过视觉参考回忆事实关联的能力远低于文本参考，且内部状态模式可预测其可靠性。

Details

Motivation: 研究动机是探索VLMs在处理视觉与文本参考时的性能差异，揭示其在多模态基础中的缺陷机制。

Result: 结果显示：VLMs通过视觉参考回忆事实的能力降低50%；探针对失效案例的检测准确率达92%；选择性预测覆盖率和准确率分别提升7.87%和0.9%。

Insight: 核心发现是VLMs难以将内部知识与视觉表征关联，但内部状态模式可被有效用于可靠性检测，为未来多模态基础研究提供了可解释性方向。

Abstract: Through a controlled study, we identify a systematic deficiency in the multimodal grounding of Vision Language Models (VLMs). While VLMs can recall factual associations when provided a textual reference to an entity; their ability to do so is significantly diminished when the reference is visual instead. Forcing VLMs to rely on image representations of an entity halves their ability to recall factual knowledge, suggesting that VLMs struggle to link their internal knowledge of an entity with its image representation. We show that such linking failures are correlated with the expression of distinct patterns in model internal states, and that probes on these internal states achieve over 92% accuracy at flagging cases where the VLM response is unreliable. These probes can be applied, without retraining, to identify when a VLM will fail to correctly answer a question that requires an understanding of multimodal input. When used to facilitate selective prediction on a visual question answering task, the probes increase coverage by 7.87% (absolute) while also reducing the risk of error by 0.9% (absolute). Addressing the systematic, detectable deficiency is an important avenue in language grounding, and we provide informed recommendations for future directions.

[4] SERES: Semantic-aware neural reconstruction from sparse views cs.CVPDF

Bo Xu, Yuhu Guo, Yuchao Wang, Wenting Wang, Yeung Yam

TL;DR: 论文提出了一种语义感知的神经重建方法（SERES），用于从稀疏图像中生成高保真3D模型。通过引入基于补丁的语义逻辑和几何基元掩码的规则化，解决了稀疏输入导致的辐射模糊问题，显著提升了重建精度。

Details

Motivation: 稀疏输入导致的特征不匹配和辐射模糊问题严重影响了3D重建的精度。传统方法难以从稀疏视图中恢复高质量的3D模型。

Result: 在DTU数据集上，稀疏重建的平均倒角距离分别比SparseNeuS和VolRecon降低了44%和20%；作为NeuS和Neuralangelo等密集重建基线的插件，平均误差分别降低了69%和68%。

Insight: 语义信息和几何规则的结合能有效提升稀疏视图下的3D重建质量，尤其适用于实际应用中的输入受限场景。

Abstract: We propose a semantic-aware neural reconstruction method to generate 3D high-fidelity models from sparse images. To tackle the challenge of severe radiance ambiguity caused by mismatched features in sparse input, we enrich neural implicit representations by adding patch-based semantic logits that are optimized together with the signed distance field and the radiance field. A novel regularization based on the geometric primitive masks is introduced to mitigate shape ambiguity. The performance of our approach has been verified in experimental evaluation. The average chamfer distances of our reconstruction on the DTU dataset can be reduced by 44% for SparseNeuS and 20% for VolRecon. When working as a plugin for those dense reconstruction baselines such as NeuS and Neuralangelo, the average error on the DTU dataset can be reduced by 69% and 68% respectively.

[5] Structures Meet Semantics: Multimodal Fusion via Graph Contrastive Learning cs.CV | cs.AI | 68T10 | I.2.4PDF

Jiangfeng Sun, Sihao He, Zhonghong Ou, Meina Song

TL;DR: 本文提出了一种名为SSU的新型多模态情感分析框架，通过图对比学习融合模态特定的结构依赖和语义对齐，显著提升了模型的性能与可解释性。

Details

Motivation: 现有多模态融合方法常忽略模态特定的结构依赖和语义对齐问题，导致性能与可解释性受限。本文旨在通过结构与语义的统一解决这些问题。

Result: 在两个基准数据集（CMU-MOSI和CMU-MOSEI）上达到SOTA性能，且显著降低了计算开销。

Insight: 多模态融合中结构与语义的统一对性能提升和模型可解释性至关重要，语义锚点与对比学习有助于跨模态对齐。

Abstract: Multimodal sentiment analysis (MSA) aims to infer emotional states by effectively integrating textual, acoustic, and visual modalities. Despite notable progress, existing multimodal fusion methods often neglect modality-specific structural dependencies and semantic misalignment, limiting their quality, interpretability, and robustness. To address these challenges, we propose a novel framework called the Structural-Semantic Unifier (SSU), which systematically integrates modality-specific structural information and cross-modal semantic grounding for enhanced multimodal representations. Specifically, SSU dynamically constructs modality-specific graphs by leveraging linguistic syntax for text and a lightweight, text-guided attention mechanism for acoustic and visual modalities, thus capturing detailed intra-modal relationships and semantic interactions. We further introduce a semantic anchor, derived from global textual semantics, that serves as a cross-modal alignment hub, effectively harmonizing heterogeneous semantic spaces across modalities. Additionally, we develop a multiview contrastive learning objective that promotes discriminability, semantic consistency, and structural coherence across intra- and inter-modal views. Extensive evaluations on two widely used benchmark datasets, CMU-MOSI and CMU-MOSEI, demonstrate that SSU consistently achieves state-of-the-art performance while significantly reducing computational overhead compared to prior methods. Comprehensive qualitative analyses further validate SSU’s interpretability and its ability to capture nuanced emotional patterns through semantically grounded interactions.

[6] FastAvatar: Instant 3D Gaussian Splatting for Faces from Single Unconstrained Poses cs.CVPDF

Hao Liang, Zhixuan Ge, Ashish Tiwari, Soumendu Majee, G. M. Dilshan Godaliyadda

TL;DR: FastAvatar是一个快速、前馈的框架，能够从任意姿态的单张人脸图像中即时生成3D高斯喷洒模型，具有高质量重建和快速推理的优势。

Details

Motivation: 现有3D高斯喷洒（3DGS）方法在生成人脸时需要多视角数据或耗时优化，难以满足实时交互需求。FastAvatar旨在解决这一问题，提供快速、高质量的单视角重建。

Result: FastAvatar在重建质量上显著优于现有前馈方法（如GAGAvatar），推理速度快1000倍，并支持实时编辑。

Insight: 通过残差预测和模板化设计，FastAvatar在高效率和高质量之间取得了平衡，为3DGS在交互式应用中的落地提供了可能。

Abstract: We present FastAvatar, a pose-invariant, feed-forward framework that can generate a 3D Gaussian Splatting (3DGS) model from a single face image from an arbitrary pose in near-instant time (<10ms). FastAvatar uses a novel encoder-decoder neural network design to achieve both fast fitting and identity preservation regardless of input pose. First, FastAvatar constructs a 3DGS face ``template’’ model from a training dataset of faces with multi-view captures. Second, FastAvatar encodes the input face image into an identity-specific and pose-invariant latent embedding, and decodes this embedding to predict residuals to the structural and appearance parameters of each Gaussian in the template 3DGS model. By only inferring residuals in a feed-forward fashion, model inference is fast and robust. FastAvatar significantly outperforms existing feed-forward face 3DGS methods (e.g., GAGAvatar) in reconstruction quality, and runs 1000x faster than per-face optimization methods (e.g., FlashAvatar, GaussianAvatars and GASP). In addition, FastAvatar’s novel latent space design supports real-time identity interpolation and attribute editing which is not possible with any existing feed-forward 3DGS face generation framework. FastAvatar’s combination of excellent reconstruction quality and speed expands the scope of 3DGS for photorealistic avatar applications in consumer and interactive systems.

[7] Securing Face and Fingerprint Templates in Humanitarian Biometric Systems cs.CV | cs.CRPDF

Giuseppe Stragapede, Sam Merrick, Vedrana Krivokuća Hahn, Justin Sukaitis, Vincent Graf Narbel

TL;DR: 该论文提出了一种适用于人道主义场景的生物特征模板保护方案PolyProtect，并通过实际数据集验证了其在人脸和指纹识别中的有效性。

Details

Motivation: 在人道主义和紧急情况下，生物识别技术虽然提升了效率，但也带来了数据安全风险，尤其是在脆弱环境中。因此，需要一种轻量级且安全的生物特征模板保护方法。

Result: 实验结果表明PolyProtect在人脸和指纹识别中表现良好，验证了其有效性和轻量级特性。

Insight: PolyProtect的模态无关性使其在多种生物特征保护中具有潜力，尤其适合资源受限的人道主义场景。

Abstract: In humanitarian and emergency scenarios, the use of biometrics can dramatically improve the efficiency of operations, but it poses risks for the data subjects, which are exacerbated in contexts of vulnerability. To address this, we present a mobile biometric system implementing a biometric template protection (BTP) scheme suitable for these scenarios. After rigorously formulating the functional, operational, and security and privacy requirements of these contexts, we perform a broad comparative analysis of the BTP landscape. PolyProtect, a method designed to operate on neural network face embeddings, is identified as the most suitable method due to its effectiveness, modularity, and lightweight computational burden. We evaluate PolyProtect in terms of verification and identification accuracy, irreversibility, and unlinkability, when this BTP method is applied to face embeddings extracted using EdgeFace, a novel state-of-the-art efficient feature extractor, on a real-world face dataset from a humanitarian field project in Ethiopia. Moreover, as PolyProtect promises to be modality-independent, we extend its evaluation to fingerprints. To the best of our knowledge, this is the first time that PolyProtect has been evaluated for the identification scenario and for fingerprint biometrics. Our experimental results are promising, and we plan to release our code

[8] Why Relational Graphs Will Save the Next Generation of Vision Foundation Models? cs.CVPDF

Fatemeh Ziaeetabar

TL;DR: 这篇论文主张下一代视觉基础模型（FMs）应通过动态关系图增强显式关系推理能力，以提高在细粒度任务中的性能、鲁棒性和效率。

Details

Motivation: 当前视觉基础模型在处理需要显式实体、角色和时空关系推理的任务时表现不足，而此类能力对细粒度任务（如人类行为识别和多模态医学图像分析）至关重要。

Result: 实验显示，这种方法在细粒度语义保真度、分布外鲁棒性、可解释性和计算效率上优于纯FMs基线，同时具备较高的内存和硬件效率。

Insight: 关系图与FMs的结合为下一代视觉模型提供了新的方向，尤其是在动态图构建、多层级关系推理和多模态融合方面具有潜力。

Abstract: Vision foundation models (FMs) have become the predominant architecture in computer vision, providing highly transferable representations learned from large-scale, multimodal corpora. Nonetheless, they exhibit persistent limitations on tasks that require explicit reasoning over entities, roles, and spatio-temporal relations. Such relational competence is indispensable for fine-grained human activity recognition, egocentric video understanding, and multimodal medical image analysis, where spatial, temporal, and semantic dependencies are decisive for performance. We advance the position that next-generation FMs should incorporate explicit relational interfaces, instantiated as dynamic relational graphs (graphs whose topology and edge semantics are inferred from the input and task context). We illustrate this position with cross-domain evidence from recent systems in human manipulation action recognition and brain tumor segmentation, showing that augmenting FMs with lightweight, context-adaptive graph-reasoning modules improves fine-grained semantic fidelity, out of distribution robustness, interpretability, and computational efficiency relative to FM only baselines. Importantly, by reasoning sparsely over semantic nodes, such hybrids also achieve favorable memory and hardware efficiency, enabling deployment under practical resource constraints. We conclude with a targeted research agenda for FM graph hybrids, prioritizing learned dynamic graph construction, multi-level relational reasoning (e.g., part object scene in activity understanding, or region organ in medical imaging), cross-modal fusion, and evaluation protocols that directly probe relational competence in structured vision tasks.

[9] LPLC: A Dataset for License Plate Legibility Classification cs.CVPDF

Lucas Wojcik, Gabriel E. Lima, Valfride Nascimento, Eduil Nascimento Jr., Rayson Laroca

TL;DR: 该论文介绍了LPLC数据集，用于车牌清晰度分类，旨在优化ALPR系统对低质量车牌的处理，并通过实验展示了任务的挑战性。

Details

Motivation: 自动车牌识别（ALPR）在处理低质量车牌时面临挑战，现有方法（如超分辨率）未能彻底解决问题。需要一种选择性预处理机制来优化模型效率和性能。

Result: 基准模型的F1分数均低于80%，表明任务具有挑战性，需要进一步研究。

Insight: 车牌清晰度分类是ALPR中的重要环节，现有方法仍需改进，LPLC数据集为研究提供了新方向。

Abstract: Automatic License Plate Recognition (ALPR) faces a major challenge when dealing with illegible license plates (LPs). While reconstruction methods such as super-resolution (SR) have emerged, the core issue of recognizing these low-quality LPs remains unresolved. To optimize model performance and computational efficiency, image pre-processing should be applied selectively to cases that require enhanced legibility. To support research in this area, we introduce a novel dataset comprising 10,210 images of vehicles with 12,687 annotated LPs for legibility classification (the LPLC dataset). The images span a wide range of vehicle types, lighting conditions, and camera/image quality levels. We adopt a fine-grained annotation strategy that includes vehicle- and LP-level occlusions, four legibility categories (perfect, good, poor, and illegible), and character labels for three categories (excluding illegible LPs). As a benchmark, we propose a classification task using three image recognition networks to determine whether an LP image is good enough, requires super-resolution, or is completely unrecoverable. The overall F1 score, which remained below 80% for all three baseline models (ViT, ResNet, and YOLO), together with the analyses of SR and LP recognition methods, highlights the difficulty of the task and reinforces the need for further research. The proposed dataset is publicly available at https://github.com/lmlwojcik/lplc-dataset.

[10] CLARIFY: A Specialist-Generalist Framework for Accurate and Lightweight Dermatological Visual Question Answering cs.CV | cs.AIPDF

Aranya Saha, Tanvir Ahmed Khan, Ismam Nur Swapnil, Mohammad Ariful Haque

TL;DR: CLARIFY是一种专家-通用医生框架，用于皮肤科视觉问答任务。通过结合轻量级图像分类器和压缩的对话视觉语言模型，显著提升诊断准确性和计算效率。

Details

Motivation: 通用视觉语言模型在医疗任务中潜力巨大，但其通用性可能限制专业诊断准确性，且模型体积庞大，计算成本高。CLARIFY旨在解决这些问题，为皮肤科VQA任务提供高效解决方案。

Result: 在皮肤科数据集上，CLARIFY比最优基线模型的诊断准确性提升18%，同时降低20%的VRAM需求和5%的延迟。

Insight: 专家-通用医生框架通过分层设计，能有效平衡专业诊断准确性和计算效率，为医疗AI系统提供轻量化和可信赖的解决方案。

Abstract: Vision-language models (VLMs) have shown significant potential for medical tasks; however, their general-purpose nature can limit specialized diagnostic accuracy, and their large size poses substantial inference costs for real-world clinical deployment. To address these challenges, we introduce CLARIFY, a Specialist-Generalist framework for dermatological visual question answering (VQA). CLARIFY combines two components: (i) a lightweight, domain-trained image classifier (the Specialist) that provides fast and highly accurate diagnostic predictions, and (ii) a powerful yet compressed conversational VLM (the Generalist) that generates natural language explanations to user queries. In our framework, the Specialist’s predictions directly guide the Generalist’s reasoning, focusing it on the correct diagnostic path. This synergy is further enhanced by a knowledge graph-based retrieval module, which grounds the Generalist’s responses in factual dermatological knowledge, ensuring both accuracy and reliability. This hierarchical design not only reduces diagnostic errors but also significantly improves computational efficiency. Experiments on our curated multimodal dermatology dataset demonstrate that CLARIFY achieves an 18% improvement in diagnostic accuracy over the strongest baseline, a fine-tuned, uncompressed single-line VLM, while reducing the average VRAM requirement and latency by at least 20% and 5%, respectively. These results indicate that a Specialist-Generalist system provides a practical and powerful paradigm for building lightweight, trustworthy, and clinically viable AI systems.

[11] VQualA 2025 Challenge on Face Image Quality Assessment: Methods and Results cs.CVPDF

Sizhuo Ma, Wei-Ting Chen, Qiang Gao, Jian Wang, Chris Wei Zhou

TL;DR: VQualA 2025挑战赛聚焦面部图像质量评估（FIQA），参与者开发轻量高效模型（限制为0.5 GFLOPs和500万参数），预测真实退化条件下的面部图像平均主观评分（MOS），并在ICCV 2025研讨会上展示了方法与结果。

Details

Motivation: 面部图像在许多应用中至关重要，但现实条件下图像质量常因噪声、模糊和压缩伪影等退化而下降，影响任务表现。为此，VQualA 2025挑战赛旨在推动实用FIQA方法的发展。

Result: 挑战赛展示了多种高效FIQA方法的性能，为实际应用提供了技术参考。

Insight: 轻量化和高效率是FIQA实用化的关键，挑战赛为未来研究提供了数据支持和基准测试框架。

Abstract: Face images play a crucial role in numerous applications; however, real-world conditions frequently introduce degradations such as noise, blur, and compression artifacts, affecting overall image quality and hindering subsequent tasks. To address this challenge, we organized the VQualA 2025 Challenge on Face Image Quality Assessment (FIQA) as part of the ICCV 2025 Workshops. Participants created lightweight and efficient models (limited to 0.5 GFLOPs and 5 million parameters) for the prediction of Mean Opinion Scores (MOS) on face images with arbitrary resolutions and realistic degradations. Submissions underwent comprehensive evaluations through correlation metrics on a dataset of in-the-wild face images. This challenge attracted 127 participants, with 1519 final submissions. This report summarizes the methodologies and findings for advancing the development of practical FIQA approaches.

[12] Context-Aware Zero-Shot Anomaly Detection in Surveillance Using Contrastive and Predictive Spatiotemporal Modeling cs.CV | cs.LGPDF

Md. Rashid Shahriar Khan, Md. Abrar Hasan, Mohammod Tareq Aziz Justice

TL;DR: 该论文提出了一种上下文感知的零样本异常检测框架，结合了时间建模与语义理解，能够在未见过异常样本的情况下检测复杂监控场景中的异常行为。

Details

Motivation: 监控视频中的异常检测通常依赖于异常的上下文特性，但异常行为不可预测且缺乏标记数据。因此，该研究旨在通过零样本学习方法解决这一问题。

Result: 框架在未见过异常样本的情况下能够泛化到复杂环境中的新行为，表现优于传统方法。

Insight: 结合时间预测与语义上下文可以显著提升零样本异常检测的性能，为复杂场景下的异常检测提供了新思路。

Abstract: Detecting anomalies in surveillance footage is inherently challenging due to their unpredictable and context-dependent nature. This work introduces a novel context-aware zero-shot anomaly detection framework that identifies abnormal events without exposure to anomaly examples during training. The proposed hybrid architecture combines TimeSformer, DPC, and CLIP to model spatiotemporal dynamics and semantic context. TimeSformer serves as the vision backbone to extract rich spatial-temporal features, while DPC forecasts future representations to identify temporal deviations. Furthermore, a CLIP-based semantic stream enables concept-level anomaly detection through context-specific text prompts. These components are jointly trained using InfoNCE and CPC losses, aligning visual inputs with their temporal and semantic representations. A context-gating mechanism further enhances decision-making by modulating predictions with scene-aware cues or global video features. By integrating predictive modeling with vision-language understanding, the system can generalize to previously unseen behaviors in complex environments. This framework bridges the gap between temporal reasoning and semantic context in zero-shot anomaly detection for surveillance. The code for this research has been made available at https://github.com/NK-II/Context-Aware-ZeroShot-Anomaly-Detection-in-Surveillance.

Ajinkya Khoche, Qingwen Zhang, Yixi Cai, Sina Sharif Mansouri, Patric Jensfelt

TL;DR: DoGFlow提出了一种基于跨模态Doppler引导的自监督LiDAR场景流估计方法，无需人工标注，在性能上显著优于当前的自监督方法。

Details

Motivation: 准确3D场景流估计对自动驾驶至关重要，但人工标注成本高，现有自监督方法在长距离和恶劣天气下性能不足。

Result: 在MAN TruckScenes数据集上，仅用10%标注数据即可达到全监督方法90%的性能。

Insight: 跨模态（雷达与LiDAR）信息迁移可以有效解决自监督学习中的标注瓶颈问题。

Abstract: Accurate 3D scene flow estimation is critical for autonomous systems to navigate dynamic environments safely, but creating the necessary large-scale, manually annotated datasets remains a significant bottleneck for developing robust perception models. Current self-supervised methods struggle to match the performance of fully supervised approaches, especially in challenging long-range and adverse weather scenarios, while supervised methods are not scalable due to their reliance on expensive human labeling. We introduce DoGFlow, a novel self-supervised framework that recovers full 3D object motions for LiDAR scene flow estimation without requiring any manual ground truth annotations. This paper presents our cross-modal label transfer approach, where DoGFlow computes motion pseudo-labels in real-time directly from 4D radar Doppler measurements and transfers them to the LiDAR domain using dynamic-aware association and ambiguity-resolved propagation. On the challenging MAN TruckScenes dataset, DoGFlow substantially outperforms existing self-supervised methods and improves label efficiency by enabling LiDAR backbones to achieve over 90% of fully supervised performance with only 10% of the ground truth data. For more details, please visit https://ajinkyakhoche.github.io/DogFlow/

[14] Wan-S2V: Audio-Driven Cinematic Video Generation cs.CVPDF

Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji

TL;DR: 本文提出了一种名为Wan-S2V的音频驱动视频生成模型，旨在解决复杂影视制作中对精细角色互动、真实身体运动和动态镜头工作的需求，显著提升了表现力和保真度。

Details

Motivation: 现有音频驱动角色动画方法在涉及语音和唱歌的场景中表现良好，但在复杂影视制作中难以满足要求。本文旨在解决这一长期存在的挑战。

Result: 实验表明，Wan-S2V在电影级动画生成上显著优于现有方法。

Insight: 该模型为复杂影视制作提供了新的解决方案，尤其在表达性和多功能性方面取得了重要突破。

Abstract: Current state-of-the-art (SOTA) methods for audio-driven character animation demonstrate promising performance for scenarios primarily involving speech and singing. However, they often fall short in more complex film and television productions, which demand sophisticated elements such as nuanced character interactions, realistic body movements, and dynamic camera work. To address this long-standing challenge of achieving film-level character animation, we propose an audio-driven model, which we refere to as Wan-S2V, built upon Wan. Our model achieves significantly enhanced expressiveness and fidelity in cinematic contexts compared to existing approaches. We conducted extensive experiments, benchmarking our method against cutting-edge models such as Hunyuan-Avatar and Omnihuman. The experimental results consistently demonstrate that our approach significantly outperforms these existing solutions. Additionally, we explore the versatility of our method through its applications in long-form video generation and precise video lip-sync editing.

[15] Decouple, Reorganize, and Fuse: A Multimodal Framework for Cancer Survival Prediction cs.CVPDF

Huayi Wang, Haochao Ying, Yuyang Xu, Qibo Qiu, Cheng Zhang

TL;DR: 该论文提出了一种名为DeReF的多模态框架，用于解决癌症生存预测中固定融合方案和MoE融合信息交互不足的问题，通过特征重组和动态MoE融合模块提升了特征组合多样性和信息交互能力。

Details

Motivation: 现有多模态癌症生存预测方法存在固定融合方案和MoE融合中信息交互不足的问题，限制了特征的动态融合和信息捕获能力。

Result: 在内部肝癌数据集和三个TCGA公开数据集上的实验验证了方法的有效性。

Insight: 动态特征重组和MoE融合的结合能够有效提升多模态数据的泛化能力和信息交互效果，区域交叉注意力的引入进一步优化了特征表示。

Abstract: Cancer survival analysis commonly integrates information across diverse medical modalities to make survival-time predictions. Existing methods primarily focus on extracting different decoupled features of modalities and performing fusion operations such as concatenation, attention, and MoE-based (Mixture-of-Experts) fusion. However, these methods still face two key challenges: i) Fixed fusion schemes (concatenation and attention) can lead to model over-reliance on predefined feature combinations, limiting the dynamic fusion of decoupled features; ii) in MoE-based fusion methods, each expert network handles separate decoupled features, which limits information interaction among the decoupled features. To address these challenges, we propose a novel Decoupling-Reorganization-Fusion framework (DeReF), which devises a random feature reorganization strategy between modalities decoupling and dynamic MoE fusion modules.Its advantages are: i) it increases the diversity of feature combinations and granularity, enhancing the generalization ability of the subsequent expert networks; ii) it overcomes the problem of information closure and helps expert networks better capture information among decoupled features. Additionally, we incorporate a regional cross-attention network within the modality decoupling module to improve the representation quality of decoupled features. Extensive experimental results on our in-house Liver Cancer (LC) and three widely used TCGA public datasets confirm the effectiveness of our proposed method. The code will be made publicly available.

[16] ROSE: Remove Objects with Side Effects in Videos cs.CV | cs.AI | cs.LGPDF

Chenxuan Miao, Yutong Feng, Jianshu Zeng, Zixiang Gao, Hantang Liu

TL;DR: 论文提出ROSE框架，专注于解决视频中物体移除时的副作用（如阴影、反射等），并通过3D渲染生成合成数据。ROSE基于扩散变换器实现，通过参考视频定位副作用区域，并引入额外监督提升效果。实验证明其在ROSE-Bench基准上表现优异。

Details

Motivation: 现有视频物体移除方法在去除物体副作用（如阴影、反射）时表现不佳，主要原因是缺乏配对视频数据。论文通过3D渲染生成合成数据，并设计了系统化的框架来解决这一问题。

Result: ROSE在ROSE-Bench上表现优于现有方法，并能泛化到真实视频场景。

Insight: 3D渲染合成数据为解决视频领域数据稀缺问题提供了新思路；显式监督副作用区域能有效提升修复质量。

Abstract: Video object removal has achieved advanced performance due to the recent success of video generative models. However, when addressing the side effects of objects, e.g., their shadows and reflections, existing works struggle to eliminate these effects for the scarcity of paired video data as supervision. This paper presents ROSE, termed Remove Objects with Side Effects, a framework that systematically studies the object’s effects on environment, which can be categorized into five common cases: shadows, reflections, light, translucency and mirror. Given the challenges of curating paired videos exhibiting the aforementioned effects, we leverage a 3D rendering engine for synthetic data generation. We carefully construct a fully-automatic pipeline for data preparation, which simulates a large-scale paired dataset with diverse scenes, objects, shooting angles, and camera trajectories. ROSE is implemented as an video inpainting model built on diffusion transformer. To localize all object-correlated areas, the entire video is fed into the model for reference-based erasing. Moreover, additional supervision is introduced to explicitly predict the areas affected by side effects, which can be revealed through the differential mask between the paired videos. To fully investigate the model performance on various side effect removal, we presents a new benchmark, dubbed ROSE-Bench, incorporating both common scenarios and the five special side effects for comprehensive evaluation. Experimental results demonstrate that ROSE achieves superior performance compared to existing video object erasing models and generalizes well to real-world video scenarios. The project page is https://rose2025-inpaint.github.io/.

[17] OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward cs.CVPDF

Chunlin Zhong, Qiuxia Hou, Zhangjun Zhou, Shuang Hao, Haonan Lu

TL;DR: OwlCap提出了一种通过数据集HMD-270K和奖励机制CSER来解决视频描述中运动-细节不平衡问题的方法，显著提升了性能。

Details

Motivation: 现有的视频描述方法存在运动-细节不平衡问题，导致生成的描述不够全面和一致。

Result: OwlCap在VDC（细节为主）和DREAM-1K（运动为主）基准测试上分别提升了4.2%和4.6%。

Insight: 平衡运动与细节的捕捉是提升视频描述质量的关键，数据集和奖励机制的优化显著改善了模型性能。

Abstract: Video captioning aims to generate comprehensive and coherent descriptions of the video content, contributing to the advancement of both video understanding and generation. However, existing methods often suffer from motion-detail imbalance, as models tend to overemphasize one aspect while neglecting the other. This imbalance results in incomplete captions, which in turn leads to a lack of consistency in video understanding and generation. To address this issue, we propose solutions from two aspects: 1) Data aspect: We constructed the Harmonizing Motion-Detail 270K (HMD-270K) dataset through a two-stage pipeline: Motion-Detail Fusion (MDF) and Fine-Grained Examination (FGE). 2) Optimization aspect: We introduce the Caption Set Equivalence Reward (CSER) based on Group Relative Policy Optimization (GRPO). CSER enhances completeness and accuracy in capturing both motion and details through unit-to-set matching and bidirectional validation. Based on the HMD-270K supervised fine-tuning and GRPO post-training with CSER, we developed OwlCap, a powerful video captioning multi-modal large language model (MLLM) with motion-detail balance. Experimental results demonstrate that OwlCap achieves significant improvements compared to baseline models on two benchmarks: the detail-focused VDC (+4.2 Acc) and the motion-focused DREAM-1K (+4.6 F1). The HMD-270K dataset and OwlCap model will be publicly released to facilitate video captioning research community advancements.

[18] Hierarchical Spatio-temporal Segmentation Network for Ejection Fraction Estimation in Echocardiography Videos cs.CVPDF

Dongfang Wang, Jian Yang, Yizhe Zhang, Tao Zhou

TL;DR: 该论文提出了一种用于超声心动图视频的层次时空分割网络（HSSN），通过结合局部细节建模和全局动态感知来提高射血分数（EF）估计的准确性。

Details

Motivation: 现有研究在超声心动图视频中的左心室心内膜分割表现良好，但在EF估计中表现不佳。因此，需要一种能够同时捕捉局部细节和全局动态的方法来提高EF估计的准确性。

Result: HSSN在EF估计中表现优于现有方法，减少了因超声图像噪声等因素导致的偏差。

Insight: 层次化设计和STCS模块的结合有效平衡了局部细节和全局动态的建模需求，提供了更准确的EF估计。

Abstract: Automated segmentation of the left ventricular endocardium in echocardiography videos is a key research area in cardiology. It aims to provide accurate assessment of cardiac structure and function through Ejection Fraction (EF) estimation. Although existing studies have achieved good segmentation performance, their results do not perform well in EF estimation. In this paper, we propose a Hierarchical Spatio-temporal Segmentation Network (\ourmodel) for echocardiography video, aiming to improve EF estimation accuracy by synergizing local detail modeling with global dynamic perception. The network employs a hierarchical design, with low-level stages using convolutional networks to process single-frame images and preserve details, while high-level stages utilize the Mamba architecture to capture spatio-temporal relationships. The hierarchical design balances single-frame and multi-frame processing, avoiding issues such as local error accumulation when relying solely on single frames or neglecting details when using only multi-frame data. To overcome local spatio-temporal limitations, we propose the Spatio-temporal Cross Scan (STCS) module, which integrates long-range context through skip scanning across frames and positions. This approach helps mitigate EF calculation biases caused by ultrasound image noise and other factors.

[19] Feature-Space Planes Searcher: A Universal Domain Adaptation Framework for Interpretability and Computational Efficiency cs.CVPDF

Zhitong Cheng, Yiran Jiang, Yulong Ge, Yufeng Li, Zhongheng Qin

TL;DR: 本文提出了一种名为FPS的新颖域适应框架，通过利用预训练模型特征空间中的几何模式，优化决策边界，同时保持特征编码器不变，从而实现了高效且可解释的域适应。

Details

Motivation: 当前的无监督域适应方法通常需要微调特征提取器，存在效率低、可解释性差和难以扩展到现代架构等问题。本文发现预训练模型在特征空间中存在域不变的几何模式，为优化决策边界提供了新的思路。

Result: 在公共基准测试中，FPS表现优于或与最先进方法相当，并在蛋白质结构预测、遥感分类和地震检测等多个领域展现了良好的通用性。

Insight: 特征空间的域不变几何模式表明，域偏移主要表现为边界不对齐而非特征退化，这为简化域适应任务提供了新视角。

Abstract: Domain shift, characterized by degraded model performance during transition from labeled source domains to unlabeled target domains, poses a persistent challenge for deploying deep learning systems. Current unsupervised domain adaptation (UDA) methods predominantly rely on fine-tuning feature extractors - an approach limited by inefficiency, reduced interpretability, and poor scalability to modern architectures. Our analysis reveals that models pretrained on large-scale data exhibit domain-invariant geometric patterns in their feature space, characterized by intra-class clustering and inter-class separation, thereby preserving transferable discriminative structures. These findings indicate that domain shifts primarily manifest as boundary misalignment rather than feature degradation. Unlike fine-tuning entire pre-trained models - which risks introducing unpredictable feature distortions - we propose the Feature-space Planes Searcher (FPS): a novel domain adaptation framework that optimizes decision boundaries by leveraging these geometric patterns while keeping the feature encoder frozen. This streamlined approach enables interpretative analysis of adaptation while substantially reducing memory and computational costs through offline feature extraction, permitting full-dataset optimization in a single computation cycle. Evaluations on public benchmarks demonstrate that FPS achieves competitive or superior performance to state-of-the-art methods. FPS scales efficiently with multimodal large models and shows versatility across diverse domains including protein structure prediction, remote sensing classification, and earthquake detection. We anticipate FPS will provide a simple, effective, and generalizable paradigm for transfer learning, particularly in domain adaptation tasks. .

[20] A Novel Deep Hybrid Framework with Ensemble-Based Feature Optimization for Robust Real-Time Human Activity Recognition cs.CVPDF

Wasi Ullah, Yasir Noman Khalid, Saddam Hussain Khan

TL;DR: 这篇论文提出了一种新颖的深度混合框架，通过集成特征优化方法实现了鲁棒的实时人类活动识别（HAR），结合定制的InceptionV3、LSTM结构和自适应动态共享注意力的遗传算法。

Details

Motivation: HAR在智能监控、医疗等领域有广泛应用，但现有系统存在计算成本高、特征冗余和实时性差等问题，需要一种轻量且高效的解决方案。

Result: 在UCF-YouTube数据集上达到99.65%识别准确率，特征维度降至7，推理时间显著优化。

Insight: 轻量级设计和特征优化使框架适用于边缘设备，推动HAR在实时场景中的应用。

Abstract: Human Activity Recognition (HAR) plays a pivotal role in various applications, including smart surveillance, healthcare, assistive technologies, sports analytics, etc. However, HAR systems still face critical challenges, including high computational costs, redundant features, and limited scalability in real-time scenarios. An optimized hybrid deep learning framework is introduced that integrates a customized InceptionV3, an LSTM architecture, and a novel ensemble-based feature selection strategy. The proposed framework first extracts spatial descriptors using the customized InceptionV3 model, which captures multilevel contextual patterns, region homogeneity, and fine-grained localization cues. The temporal dependencies across frames are then modeled using LSTMs to effectively encode motion dynamics. Finally, an ensemble-based genetic algorithm with Adaptive Dynamic Fitness Sharing and Attention (ADFSA) is employed to select a compact and optimized feature set by dynamically balancing objectives such as accuracy, redundancy, uniqueness, and complexity reduction. Consequently, the selected feature subsets, which are both diverse and discriminative, enable various lightweight machine learning classifiers to achieve accurate and robust HAR in heterogeneous environments. Experimental results on the robust UCF-YouTube dataset, which presents challenges such as occlusion, cluttered backgrounds, motion dynamics, and poor illumination, demonstrate good performance. The proposed approach achieves 99.65% recognition accuracy, reduces features to as few as 7, and enhances inference time. The lightweight and scalable nature of the HAR system supports real-time deployment on edge devices such as Raspberry Pi, enabling practical applications in intelligent, resource-aware environments, including public safety, assistive technology, and autonomous monitoring systems.

[21] ColorGS: High-fidelity Surgical Scene Reconstruction with Colored Gaussian Splatting cs.CVPDF

Qun Ji, Peng Li, Mingqiang Wei

TL;DR: ColorGS是一种新颖的框架，通过动态彩色高斯原语和增强变形模型，实现了高保真手术场景重建，显著提升了颜色表达和变形建模能力。

Details

Motivation: 现有方法在捕捉内窥镜视频中细微颜色变化和全局变形建模方面存在局限，3D高斯泼溅技术虽能高效重建但颜色和变形建模能力不足。

Result: 在DaVinci手术视频和基准数据集上，PSNR达39.85，SSIM达97.25%，实时渲染效率优于现有方法。

Insight: 平衡高保真与计算实用性是手术场景重建的关键，为术中导航和AR/VR应用提供了新思路。

Abstract: High-fidelity reconstruction of deformable tissues from endoscopic videos remains challenging due to the limitations of existing methods in capturing subtle color variations and modeling global deformations. While 3D Gaussian Splatting (3DGS) enables efficient dynamic reconstruction, its fixed per-Gaussian color assignment struggles with intricate textures, and linear deformation modeling fails to model consistent global deformation. To address these issues, we propose ColorGS, a novel framework that integrates spatially adaptive color encoding and enhanced deformation modeling for surgical scene reconstruction. First, we introduce Colored Gaussian Primitives, which employ dynamic anchors with learnable color parameters to adaptively encode spatially varying textures, significantly improving color expressiveness under complex lighting and tissue similarity. Second, we design an Enhanced Deformation Model (EDM) that combines time-aware Gaussian basis functions with learnable time-independent deformations, enabling precise capture of both localized tissue deformations and global motion consistency caused by surgical interactions. Extensive experiments on DaVinci robotic surgery videos and benchmark datasets (EndoNeRF, StereoMIS) demonstrate that ColorGS achieves state-of-the-art performance, attaining a PSNR of 39.85 (1.5 higher than prior 3DGS-based methods) and superior SSIM (97.25%) while maintaining real-time rendering efficiency. Our work advances surgical scene reconstruction by balancing high fidelity with computational practicality, critical for intraoperative guidance and AR/VR applications.

DongHoon Lim, YoungChae Kim, Dong-Hyun Kim, Da-Hee Yang, Joon-Hyuk Chang

TL;DR: 论文提出了一种基于路由器门控的跨模态特征融合方法，用于提升噪声环境下音频-视觉语音识别的鲁棒性，通过动态调整模态权重以适应音频质量变化。

Details

Motivation: 现有音频-视觉语音识别系统在噪声环境中难以动态评估音频可靠性并调整模态依赖性，导致性能下降。为了解决这一问题，论文提出了一种新的框架。

Result: 在LRS3数据集上的实验表明，该方法相比于AV-HuBERT模型，词错误率相对降低了16.51-42.67%。

Insight: 路由器门控机制和动态权重调整是提升噪声环境下多模态系统鲁棒性的关键，尤其当音频质量下降时，视觉模态的强化对系统性能提升显著。

Abstract: Robust audio-visual speech recognition (AVSR) in noisy environments remains challenging, as existing systems struggle to estimate audio reliability and dynamically adjust modality reliance. We propose router-gated cross-modal feature fusion, a novel AVSR framework that adaptively reweights audio and visual features based on token-level acoustic corruption scores. Using an audio-visual feature fusion-based router, our method down-weights unreliable audio tokens and reinforces visual cues through gated cross-attention in each decoder layer. This enables the model to pivot toward the visual modality when audio quality deteriorates. Experiments on LRS3 demonstrate that our approach achieves an 16.51-42.67% relative reduction in word error rate compared to AV-HuBERT. Ablation studies confirm that both the router and gating mechanism contribute to improved robustness under real-world acoustic noise.

[23] Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods cs.CVPDF

Qinqian Lei, Bo Wang, Robby T. Tan

TL;DR: 该论文提出了一个新的基准测试，用于直接比较通用视觉语言模型（VLMs）和专门的人-物体交互（HOI）检测方法，解决了现有基准在评估生成式VLM时的不匹配问题。

Details

Motivation: 现有HOI基准测试（如HICO-DET）在设计时未考虑现代VLM的生成特性，其严格的类匹配评估可能对VLM和HOI方法产生不公正的惩罚，影响了合理但未标注的预测结果。

Result: 新基准测试能够更公平地评估VLM和HOI方法，揭示了当前HOI理解技术的发展现状。

Insight: 生成式VLM在HOI任务中可能已经具备较强能力，但需要更灵活的评估方法以反映其多答案特性；专门的HOI方法仍需改进以与VLM竞争。

Abstract: Prior human-object interaction (HOI) detection methods have integrated early vision-language models (VLMs) such as CLIP, but only as supporting components within their frameworks. In contrast, recent advances in large, generative VLMs suggest that these models may already possess strong ability to understand images involving HOI. This naturally raises an important question: can general-purpose standalone VLMs effectively solve HOI detection, and how do they compare with specialized HOI methods? Answering this requires a benchmark that can accommodate both paradigms. However, existing HOI benchmarks such as HICO-DET were developed before the emergence of modern VLMs, and their evaluation protocols require exact matches to annotated HOI classes. This is poorly aligned with the generative nature of VLMs, which often yield multiple valid interpretations in ambiguous cases. For example, a static image may capture a person mid-motion with a frisbee, which can plausibly be interpreted as either “throwing” or “catching”. When only “catching” is annotated, the other, though equally plausible for the image, is marked incorrect when exact matching is used. As a result, correct predictions might be penalized, affecting both VLMs and HOI-specific methods. To avoid penalizing valid predictions, we introduce a new benchmark that reformulates HOI detection as a multiple-answer multiple-choice task, where each question includes only ground-truth positive options and a curated set of negatives that are constructed to reduce ambiguity (e.g., when “catching” is annotated, “throwing” is not selected as a negative to avoid penalizing valid predictions). The proposed evaluation protocol is the first of its kind for both VLMs and HOI methods, enabling direct comparison and offering new insight into the current state of progress in HOI understanding.

[24] Beyond the Textual: Generating Coherent Visual Options for MCQs cs.CV | cs.CLPDF

Wanqiang Wang, Longzhu He, Wei Zheng

TL;DR: 该论文提出了一个多模态框架CmOS，用于生成带有视觉选项的多选题（MCQs），解决了以往研究中忽略视觉选项和高质量干扰项的挑战。通过结合Multimodal Chain-of-Thought（MCoT）和Retrieval-Augmented Generation（RAG），生成的选项在语义和视觉上均具有合理性。实验结果表明，CmOS在多个学科和教育水平上优于现有方法。

Details

Motivation: 现有的多选题生成研究主要集中在文本选项上，忽略了视觉选项的潜力，且手动生成高质量干扰项成本高且难以扩展。

Result: 实验显示，CmOS在内容辨别、问题生成和视觉选项生成任务中表现出色，优于现有方法。

Insight: 视觉选项在教学中具有潜力，跨模态生成技术可以有效提升多选题的质量和多样性。

Abstract: Multiple-choice questions (MCQs) play a crucial role in fostering deep thinking and knowledge integration in education. However, previous research has primarily focused on generating MCQs with textual options, but it largely overlooks the visual options. Moreover, generating high-quality distractors remains a major challenge due to the high cost and limited scalability of manual authoring. To tackle these problems, we propose a Cross-modal Options Synthesis (CmOS), a novel framework for generating educational MCQs with visual options. Our framework integrates Multimodal Chain-of-Thought (MCoT) reasoning process and Retrieval-Augmented Generation (RAG) to produce semantically plausible and visually similar answer and distractors. It also includes a discrimination module to identify content suitable for visual options. Experimental results on test tasks demonstrate the superiority of CmOS in content discrimination, question generation and visual option generation over existing methods across various subjects and educational levels.

[25] Design, Implementation and Evaluation of a Real-Time Remote Photoplethysmography (rPPG) Acquisition System for Non-Invasive Vital Sign Monitoring cs.CVPDF

Constantino Álvarez Casado, Sasan Sharifipour, Manuel Lage Cañellas, Nhi Nguyen, Le Nguyen

TL;DR: 本文提出了一种针对低功耗设备的实时远程光电容积描记（rPPG）系统，用于从面部视频流中提取心率（HR）、呼吸频率（RR）和血氧饱和度（SpO2）等生理信号。系统采用多线程架构和混合编程模型（FRP+Actor），在资源受限的平台上实现了高效的实时处理。

Details

Motivation: 随着智能环境和低功耗计算设备的普及，远程非接触式生理监测需求增长，但实时部署在资源受限平台上存在可扩展性和性能挑战。

Result: 在实时约束下系统表现稳健，显著降低了计算开销。

Insight: 通过混合编程模型和自适应反馈，系统在低功耗设备上实现了高效实时处理，为现代医疗和人机交互应用提供了实用解决方案。

Abstract: The growing integration of smart environments and low-power computing devices, coupled with mass-market sensor technologies, is driving advancements in remote and non-contact physiological monitoring. However, deploying these systems in real-time on resource-constrained platforms introduces significant challenges related to scalability, interoperability, and performance. This paper presents a real-time remote photoplethysmography (rPPG) system optimized for low-power devices, designed to extract physiological signals, such as heart rate (HR), respiratory rate (RR), and oxygen saturation (SpO2), from facial video streams. The system is built on the Face2PPG pipeline, which processes video frames sequentially for rPPG signal extraction and analysis, while leveraging a multithreaded architecture to manage video capture, real-time processing, network communication, and graphical user interface (GUI) updates concurrently. This design ensures continuous, reliable operation at 30 frames per second (fps), with adaptive feedback through a collaborative user interface to guide optimal signal capture conditions. The network interface includes both an HTTP server for continuous video streaming and a RESTful API for on-demand vital sign retrieval. To ensure accurate performance despite the limitations of low-power devices, we use a hybrid programming model combining Functional Reactive Programming (FRP) and the Actor Model, allowing event-driven processing and efficient task parallelization. The system is evaluated under real-time constraints, demonstrating robustness while minimizing computational overhead. Our work addresses key challenges in real-time biosignal monitoring, offering practical solutions for optimizing performance in modern healthcare and human-computer interaction applications.

[26] PseudoMapTrainer: Learning Online Mapping without HD Maps cs.CV | cs.LG | cs.ROPDF

Christian Löwens, Thorben Funke, Jingchao Xie, Alexandru Paul Condurache

TL;DR: PseudoMapTrainer提出了一种无需高清地图（HD Maps）的在线地图学习方法，通过从无标签的传感器数据生成伪标签，同时解决了伪标签部分遮挡问题。

Details

Motivation: 现有在线地图学习方法依赖昂贵且地理覆盖有限的高清地图标注数据，限制了模型的泛化能力。

Result: 首次实现无需真实高清地图标注的在线地图模型训练，并可利用无标注数据进行预训练。

Insight: 伪标签生成和部分遮挡处理是从无标注数据学习在线地图的关键，为地图学习提供了新思路。

Abstract: Online mapping models show remarkable results in predicting vectorized maps from multi-view camera images only. However, all existing approaches still rely on ground-truth high-definition maps during training, which are expensive to obtain and often not geographically diverse enough for reliable generalization. In this work, we propose PseudoMapTrainer, a novel approach to online mapping that uses pseudo-labels generated from unlabeled sensor data. We derive those pseudo-labels by reconstructing the road surface from multi-camera imagery using Gaussian splatting and semantics of a pre-trained 2D segmentation network. In addition, we introduce a mask-aware assignment algorithm and loss function to handle partially masked pseudo-labels, allowing for the first time the training of online mapping models without any ground-truth maps. Furthermore, our pseudo-labels can be effectively used to pre-train an online model in a semi-supervised manner to leverage large-scale unlabeled crowdsourced data. The code is available at github.com/boschresearch/PseudoMapTrainer.

[27] Robust and Label-Efficient Deep Waste Detection cs.CVPDF

Hassan Abid, Khan Muhammad, Muhammad Haris Khan

TL;DR: 该论文提出了一个基于集成学习的半监督学习框架，用于提升废物检测的鲁棒性和标签效率。通过优化提示和微调Transformer检测器，论文在ZeroWaste数据集上建立了新的基线（51.6 mAP），并提出了一种软伪标签策略，在未标注数据上实现了超越全监督方法的性能。

Details

Motivation: 废物分类对可持续回收至关重要，但现有AI研究因数据集有限和对传统目标检测器的依赖而落后于商业系统。论文旨在通过建立强基线并引入半监督学习框架，推动AI驱动的废物检测技术的发展。

Result: 微调后的Transformer检测器在ZeroWaste数据集上达到51.6 mAP的基线；软伪标签策略在未标注数据上的性能优于全监督方法。

Insight: 1. 优化提示对提升零样本检测性能至关重要。2. 基于集成的半监督学习可以显著减少对标注数据的依赖。3. Transformer架构在废物检测任务中表现出强大的潜力。

Abstract: Effective waste sorting is critical for sustainable recycling, yet AI research in this domain continues to lag behind commercial systems due to limited datasets and reliance on legacy object detectors. In this work, we advance AI-driven waste detection by establishing strong baselines and introducing an ensemble-based semi-supervised learning framework. We first benchmark state-of-the-art Open-Vocabulary Object Detection (OVOD) models on the real-world ZeroWaste dataset, demonstrating that while class-only prompts perform poorly, LLM-optimized prompts significantly enhance zero-shot accuracy. Next, to address domain-specific limitations, we fine-tune modern transformer-based detectors, achieving a new baseline of 51.6 mAP. We then propose a soft pseudo-labeling strategy that fuses ensemble predictions using spatial and consensus-aware weighting, enabling robust semi-supervised training. Applied to the unlabeled ZeroWaste-s subset, our pseudo-annotations achieve performance gains that surpass fully supervised training, underscoring the effectiveness of scalable annotation pipelines. Our work contributes to the research community by establishing rigorous baselines, introducing a robust ensemble-based pseudo-labeling pipeline, generating high-quality annotations for the unlabeled ZeroWaste-s subset, and systematically evaluating OVOD models under real-world waste sorting conditions. Our code is available at: https://github.com/h-abid97/robust-waste-detection.

[28] Embedding Font Impression Word Tags Based on Co-occurrence cs.CVPDF

Yugo Kubota, Seiichi Uchida

TL;DR: 该论文提出了一种基于共现关系的字体印象标签嵌入方法，通过构建标签共现图并应用谱嵌入技术生成标签向量，优于传统词嵌入方法（如BERT和CLIP），特别适用于基于印象的字体生成和检索任务。

Details

Motivation: 字体形状与描述其印象的词语标签之间存在紧密关系，传统词嵌入方法（如BERT和CLIP）无法准确捕捉这种关系，因此需要一种新的嵌入方法来更好地表示字体印象。

Result: 实验结果表明，该方法在印象引导的字体生成任务中表现优于BERT和CLIP。

Insight: 通过利用标签的共现关系，可以更有效地捕捉字体形状与其印象之间的关联，为字体设计领域的任务提供更好的支持。

Abstract: Different font styles (i.e., font shapes) convey distinct impressions, indicating a close relationship between font shapes and word tags describing those impressions. This paper proposes a novel embedding method for impression tags that leverages these shape-impression relationships. For instance, our method assigns similar vectors to impression tags that frequently co-occur in order to represent impressions of fonts, whereas standard word embedding methods (e.g., BERT and CLIP) yield very different vectors. This property is particularly useful for impression-based font generation and font retrieval. Technically, we construct a graph whose nodes represent impression tags and whose edges encode co-occurrence relationships. Then, we apply spectral embedding to obtain the impression vectors for each tag. We compare our method with BERT and CLIP in qualitative and quantitative evaluations, demonstrating that our approach performs better in impression-guided font generation.

[29] Deep Pre-trained Time Series Features for Tree Species Classification in the Dutch Forest Inventory cs.CVPDF

Takayuki Ishikawa, Carmelo Bonannella, Bas J. W. Lerink, Marc Rußwurm

TL;DR: 该论文探讨了利用预训练的深度学习模型提取时间序列特征，以提升荷兰国家森林资源调查（NFI）中树种分类的准确性，相比传统方法显著提高了10%的分类精度。

Details

Motivation: 国家森林资源调查（NFI）依赖人工现场调查，耗时耗力。结合遥感和机器学习的方法可以更高效地进行大规模更新。然而，现有方法主要依赖随机森林分类器和手工设计的特征，无法充分捕捉复杂的季节反射模式。预训练深度学习模型可提供更优的解决方案。

Result: 实验结果表明，基于预训练模型的深度特征在荷兰树种分类任务中比当前最优方法提升了高达10%的精度，验证了深度特征在数据有限任务中的优势。

Insight: 预训练深度学习模型能够在数据有限的任务中显著提升分类精度，为NFI等实际应用提供了一种高效的补充方法。同时，多源卫星数据的结合进一步增强了模型的泛化能力。

Abstract: National Forest Inventory (NFI)s serve as the primary source of forest information, providing crucial tree species distribution data. However, maintaining these inventories requires labor-intensive on-site campaigns. Remote sensing approaches, particularly when combined with machine learning, offer opportunities to update NFIs more frequently and at larger scales. While the use of Satellite Image Time Series has proven effective for distinguishing tree species through seasonal canopy reflectance patterns, current approaches rely primarily on Random Forest classifiers with hand-designed features and phenology-based metrics. Using deep features from an available pre-trained remote sensing foundation models offers a complementary strategy. These pre-trained models leverage unannotated global data and are meant to used for general-purpose applications and can then be efficiently fine-tuned with smaller labeled datasets for specific classification tasks. This work systematically investigates how deep features improve tree species classification accuracy in the Netherlands with few annotated data. Data-wise, we extracted time-series data from Sentinel-1, Sentinel-2 and ERA5 satellites data and SRTM data using Google Earth Engine. Our results demonstrate that fine-tuning a publicly available remote sensing time series foundation model outperforms the current state-of-the-art in NFI classification in the Netherlands by a large margin of up to 10% across all datasets. This demonstrates that classic hand-defined harmonic features are too simple for this task and highlights the potential of using deep AI features for data-limited application like NFI classification. By leveraging openly available satellite data and pre-trained models, this approach significantly improves classification accuracy compared to traditional methods and can effectively complement existing forest inventory processes.

[30] Boosting Micro-Expression Analysis via Prior-Guided Video-Level Regression cs.CVPDF

Zizheng Guo, Bochao Zou, Yinuo Jia, Xiangyu Li, Huimin Ma

TL;DR: 该论文提出一种先验引导的视频级回归方法，用于微表情分析，结合时序选择和协同优化框架，显著提升性能。

Details

Motivation: 现有微表情分析方法依赖固定窗口分类，难以捕捉复杂时序动态，部分视频级回归方法仍受限于手动预定义窗口，问题未完全解决。

Result: 在多个基准数据集上达到SOTA性能，CAS(ME)^3数据集的STRS为0.0562，SAMMLV为0.2000。

Insight: 微表情的时序特性是关键，结合先验知识和多任务协同优化可显著提升模型性能。

Abstract: Micro-expressions (MEs) are involuntary, low-intensity, and short-duration facial expressions that often reveal an individual’s genuine thoughts and emotions. Most existing ME analysis methods rely on window-level classification with fixed window sizes and hard decisions, which limits their ability to capture the complex temporal dynamics of MEs. Although recent approaches have adopted video-level regression frameworks to address some of these challenges, interval decoding still depends on manually predefined, window-based methods, leaving the issue only partially mitigated. In this paper, we propose a prior-guided video-level regression method for ME analysis. We introduce a scalable interval selection strategy that comprehensively considers the temporal evolution, duration, and class distribution characteristics of MEs, enabling precise spotting of the onset, apex, and offset phases. In addition, we introduce a synergistic optimization framework, in which the spotting and recognition tasks share parameters except for the classification heads. This fully exploits complementary information, makes more efficient use of limited data, and enhances the model’s capability. Extensive experiments on multiple benchmark datasets demonstrate the state-of-the-art performance of our method, with an STRS of 0.0562 on CAS(ME)$^3$ and 0.2000 on SAMMLV. The code is available at https://github.com/zizheng-guo/BoostingVRME.

[31] Quantitative Outcome-Oriented Assessment of Microsurgical Anastomosis cs.CVPDF

Luyin Hu, Soheil Gholami, George Dindelegan, Torstein R. Meling, Aude Billard

TL;DR: 该论文提出了一种基于图像处理技术的定量框架，用于客观评估显微外科吻合术，减少了主观评判的偏差，提高了评估的效率和可靠性。

Details

Motivation: 显微外科吻合术的评估目前依赖主观方法，存在偏差和不可靠的问题，因此需要一种客观、定量的评估方法。

Result: 几何指标能有效复现专家评分的结果，证明了该方法的有效性。

Insight: 定量方法可以显著提升显微外科吻合术评估的客观性和可靠性，适用于不同技能水平的学习者。

Abstract: Microsurgical anastomosis demands exceptional dexterity and visuospatial skills, underscoring the importance of comprehensive training and precise outcome assessment. Currently, methods such as the outcome-oriented anastomosis lapse index are used to evaluate this procedure. However, they often rely on subjective judgment, which can introduce biases that affect the reliability and efficiency of the assessment of competence. Leveraging three datasets from hospitals with participants at various levels, we introduce a quantitative framework that uses image-processing techniques for objective assessment of microsurgical anastomoses. The approach uses geometric modeling of errors along with a detection and scoring mechanism, enhancing the efficiency and reliability of microsurgical proficiency assessment and advancing training protocols. The results show that the geometric metrics effectively replicate expert raters’ scoring for the errors considered in this work.

[32] Harnessing Meta-Learning for Controllable Full-Frame Video Stabilization cs.CVPDF

Muhammad Kashif Ali, Eun Woo Im, Dongjin Kim, Tae Hyun Kim, Vivek Gupta

TL;DR: 该论文提出了一种基于元学习的可控全帧视频稳定方法，通过快速适应输入视频的低级视觉线索，显著提高了稳定性和视觉质量，并引入急动定位模块和针对性适应策略。

Details

Motivation: 现有像素级合成视频稳定方法难以适应不同视频的运动多样性和视觉内容，泛化能力受限。论文旨在利用元学习实现快速适应，提高稳定性和视觉质量。

Result: 实验表明，该方法显著提升了多种全帧合成模型的性能，包括稳定性和视觉质量，并在下游任务中表现出色。

Insight: 元学习和针对性适应结合可以有效解决视频稳定中的泛化问题，同时保持全帧输出的优势。

Abstract: Video stabilization remains a fundamental problem in computer vision, particularly pixel-level synthesis solutions for video stabilization, which synthesize full-frame outputs, add to the complexity of this task. These methods aim to enhance stability while synthesizing full-frame videos, but the inherent diversity in motion profiles and visual content present in each video sequence makes robust generalization with fixed parameters difficult. To address this, we present a novel method that improves pixel-level synthesis video stabilization methods by rapidly adapting models to each input video at test time. The proposed approach takes advantage of low-level visual cues available during inference to improve both the stability and visual quality of the output. Notably, the proposed rapid adaptation achieves significant performance gains even with a single adaptation pass. We further propose a jerk localization module and a targeted adaptation strategy, which focuses the adaptation on high-jerk segments for maximizing stability with fewer adaptation steps. The proposed methodology enables modern stabilizers to overcome the longstanding SOTA approaches while maintaining the full frame nature of the modern methods, while offering users with control mechanisms akin to classical approaches. Extensive experiments on diverse real-world datasets demonstrate the versatility of the proposed method. Our approach consistently improves the performance of various full-frame synthesis models in both qualitative and quantitative terms, including results on downstream applications.

Yuexuan Xia, Benteng Ma, Jiang He, Zhiyong Wang, Qi Dou

TL;DR: 该论文提出了一种多模态提示学习框架DualFairVL，通过文本引导的属性解耦方法，联合去偏并对齐跨模态表示，以提升医学影像诊断的公平性和鲁棒性。

Details

Motivation: 在医学影像诊断中，确保不同人口群体的公平性对医疗平等至关重要。然而，现有方法独立处理视觉和文本模态，导致跨模态未对齐和公平性差距。

Result: 在8个医学影像数据集上的实验表明，DualFairVL在公平性和准确性上优于现有方法，仅需3.6M可训练参数。

Insight: 通过解耦和对齐跨模态表示，可以显著提升模型的公平性和鲁棒性，尤其是在分布偏移场景下。

Abstract: Ensuring fairness across demographic groups in medical diagnosis is essential for equitable healthcare, particularly under distribution shifts caused by variations in imaging equipment and clinical practice. Vision-language models (VLMs) exhibit strong generalization, and text prompts encode identity attributes, enabling explicit identification and removal of sensitive directions. However, existing debiasing approaches typically address vision and text modalities independently, leaving residual cross-modal misalignment and fairness gaps. To address this challenge, we propose DualFairVL, a multimodal prompt-learning framework that jointly debiases and aligns cross-modal representations. DualFairVL employs a parallel dual-branch architecture that separates sensitive and target attributes, enabling disentangled yet aligned representations across modalities. Approximately orthogonal text anchors are constructed via linear projections, guiding cross-attention mechanisms to produce fused features. A hypernetwork further disentangles attribute-related information and generates instance-aware visual prompts, which encode dual-modal cues for fairness and robustness. Prototype-based regularization is applied in the visual branch to enforce separation of sensitive features and strengthen alignment with textual anchors. Extensive experiments on eight medical imaging datasets across four modalities show that DualFairVL achieves state-of-the-art fairness and accuracy under both in- and out-of-distribution settings, outperforming full fine-tuning and parameter-efficient baselines with only 3.6M trainable parameters. Code will be released upon publication.

[34] DQEN: Dual Query Enhancement Network for DETR-based HOI Detection cs.CVPDF

Zhehao Li, Chong Wang, Yi Chen, Yinghao Lu, Jiangbo Qian

TL;DR: DQEN提出了一种双查询增强网络，用于改进DETR-based HOI检测中的对象和交互查询，通过对象感知和语义融合提升检测性能。

Details

Motivation: 现有的DETR-based HOI检测模型依赖随机初始化的查询，导致表达模糊，限制了模型效果。DQEN旨在通过增强对象和交互查询，提升检测能力。

Result: 在HICO-Det和V-COCO数据集上取得了有竞争力的性能。

Insight: 通过明确对象和交互的查询初始化，结合语义信息，可以显著提升DETR-based HOI检测的效果。

Abstract: Human-Object Interaction (HOI) detection focuses on localizing human-object pairs and recognizing their interactions. Recently, the DETR-based framework has been widely adopted in HOI detection. In DETR-based HOI models, queries with clear meaning are crucial for accurately detecting HOIs. However, prior works have typically relied on randomly initialized queries, leading to vague representations that limit the model’s effectiveness. Meanwhile, humans in the HOI categories are fixed, while objects and their interactions are variable. Therefore, we propose a Dual Query Enhancement Network (DQEN) to enhance object and interaction queries. Specifically, object queries are enhanced with object-aware encoder features, enabling the model to focus more effectively on humans interacting with objects in an object-aware way. On the other hand, we design a novel Interaction Semantic Fusion module to exploit the HOI candidates that are promoted by the CLIP model. Semantic features are extracted to enhance the initialization of interaction queries, thereby improving the model’s ability to understand interactions. Furthermore, we introduce an Auxiliary Prediction Unit aimed at improving the representation of interaction features. Our proposed method achieves competitive performance on both the HICO-Det and the V-COCO datasets. The source code is available at https://github.com/lzzhhh1019/DQEN.

[35] Interpretable Decision-Making for End-to-End Autonomous Driving cs.CV | cs.AI | cs.LG | cs.ROPDF

Mona Mirzaie, Bodo Rosenhahn

TL;DR: 这篇论文提出了一种增强端到端自动驾驶决策可解释性的方法，通过设计损失函数生成稀疏和局部化的特征图，从而解释AI决策的依据，并在CARLA基准测试中表现出色。

Details

Motivation: 自动驾驶系统需要被信任才能广泛部署，但当前端到端方法的深度神经网络缺乏可解释性，尤其在复杂城市场景中。因此，研究如何提升决策的可解释性至关重要。

Result: 该方法在CARLA基准测试中表现优异，单目非集成模型超越了排行榜上的最优方法，降低了违规率并提高了路线完成率。

Insight: 可解释性与性能提升可以相辅相成：通过优化特征图的局部化和稀疏性，不仅增强了决策的可解释性，还进一步提高了驾驶模型的安全性。

Abstract: Trustworthy AI is mandatory for the broad deployment of autonomous vehicles. Although end-to-end approaches derive control commands directly from raw data, interpreting these decisions remains challenging, especially in complex urban scenarios. This is mainly attributed to very deep neural networks with non-linear decision boundaries, making it challenging to grasp the logic behind AI-driven decisions. This paper presents a method to enhance interpretability while optimizing control commands in autonomous driving. To address this, we propose loss functions that promote the interpretability of our model by generating sparse and localized feature maps. The feature activations allow us to explain which image regions contribute to the predicted control command. We conduct comprehensive ablation studies on the feature extraction step and validate our method on the CARLA benchmarks. We also demonstrate that our approach improves interpretability, which correlates with reducing infractions, yielding a safer, high-performance driving model. Notably, our monocular, non-ensemble model surpasses the top-performing approaches from the CARLA Leaderboard by achieving lower infraction scores and the highest route completion rate, all while ensuring interpretability.

[36] Event-Enriched Image Analysis Grand Challenge at ACM Multimedia 2025 cs.CVPDF

Thien-Phuc Tran, Minh-Quang Nguyen, Minh-Triet Tran, Tam V. Nguyen, Trong-Le Do

TL;DR: EVENTA Grand Challenge 是 ACM Multimedia 2025 上的一个多模态理解任务，旨在通过整合上下文、时间和语义信息，弥补传统图像任务在事件级理解上的不足。

Details

Motivation: 传统图像任务（如标注和检索）通常关注表面层次的识别，忽略了定义真实世界事件的上下文和语义维度。EVENTA 旨在填补这一空白。

Result: 共有 45 支团队参与挑战，前三名团队在 ACM Multimedia 2025 上展示解决方案。EVENTA 为多媒体 AI 的应用（如新闻、媒体分析、文化存档等）奠定了基础。

Insight: EVENTA 强调了事件级理解的重要性，为未来研究提供了新的方向和评估标准，推动了多模态 AI 向更丰富的上下文和叙事能力发展。

Abstract: The Event-Enriched Image Analysis (EVENTA) Grand Challenge, hosted at ACM Multimedia 2025, introduces the first large-scale benchmark for event-level multimodal understanding. Traditional captioning and retrieval tasks largely focus on surface-level recognition of people, objects, and scenes, often overlooking the contextual and semantic dimensions that define real-world events. EVENTA addresses this gap by integrating contextual, temporal, and semantic information to capture the who, when, where, what, and why behind an image. Built upon the OpenEvents V1 dataset, the challenge features two tracks: Event-Enriched Image Retrieval and Captioning, and Event-Based Image Retrieval. A total of 45 teams from six countries participated, with evaluation conducted through Public and Private Test phases to ensure fairness and reproducibility. The top three teams were invited to present their solutions at ACM Multimedia 2025. EVENTA establishes a foundation for context-aware, narrative-driven multimedia AI, with applications in journalism, media analysis, cultural archiving, and accessibility. Further details about the challenge are available at the official homepage: https://ltnghia.github.io/eventa/eventa-2025.

[37] Preliminary Study on Space Utilization and Emergent Behaviors of Group vs. Single Pedestrians in Real-World Trajectories cs.CV | stat.APPDF

Amartaivan Sanjjamts, Morita Hiroshi

TL;DR: 该论文提出了一种基于真实轨迹数据区分行人群组与个体的初步框架，并通过空间和行为指标分析其差异。

Details

Motivation: 研究旨在理解行人群组与个体在空间利用和行为模式上的差异，为人群动力学研究提供基础。

Result: 论文建立了分类流程和数据集结构，支持不同序列长度的分析，为未来研究提供工具。

Insight: 空间和行为指标的框架为深入分析行人群组与个体差异提供了可能，有助于人群模拟和空间设计验证。

Abstract: This study presents an initial framework for distinguishing group and single pedestrians based on real-world trajectory data, with the aim of analyzing their differences in space utilization and emergent behavioral patterns. By segmenting pedestrian trajectories into fixed time bins and applying a Transformer-based pair classification model, we identify cohesive groups and isolate single pedestrians over a structured sequence-based filtering process. To prepare for deeper analysis, we establish a comprehensive metric framework incorporating both spatial and behavioral dimensions. Spatial utilization metrics include convex hull area, smallest enclosing circle radius, and heatmap-based spatial densities to characterize how different pedestrian types occupy and interact with space. Behavioral metrics such as velocity change, motion angle deviation, clearance radius, and trajectory straightness are designed to capture local adaptations and responses during interactions. Furthermore, we introduce a typology of encounter types-single-to-single, single-to-group, and group-to-group to categorize and later quantify different interaction scenarios. Although this version focuses primarily on the classification pipeline and dataset structuring, it establishes the groundwork for scalable analysis across different sequence lengths 60, 100, and 200 frames. Future versions will incorporate complete quantitative analysis of the proposed metrics and their implications for pedestrian simulation and space design validation in crowd dynamics research.

[38] The point is the mask: scaling coral reef segmentation with weak supervision cs.CV | cs.AIPDF

Matteo Contini, Victor Illien, Sylvain Poulain, Serge Bernard, Julien Barde

TL;DR: 论文提出了一种多尺度弱监督语义分割框架，通过将水下图像的细粒度生态信息迁移到航拍数据中，解决了大规模珊瑚礁监测的挑战。该方法结合了分类监督、空间插值和自蒸馏技术，能以最少的标注实现珊瑚礁的大范围分割。

Details

Motivation: 大规模珊瑚礁监测对评估生态系统健康和指导保护工作至关重要，但现有的航拍图像分辨率有限，难以区分珊瑚形态的细粒度类别，而像素级标注成本高昂，限制了深度学习方法的可扩展性。

Result: 实验证明了该方法的有效性，能够实现大范围珊瑚形态的分割，并展示了对新类别的灵活性。

Insight: 通过结合低成本数据采集和弱监督学习，论文提供了一种可扩展、经济高效的高分辨率珊瑚礁监测方法，为生态保护提供了新工具。

Abstract: Monitoring coral reefs at large spatial scales remains an open challenge, essential for assessing ecosystem health and informing conservation efforts. While drone-based aerial imagery offers broad spatial coverage, its limited resolution makes it difficult to reliably distinguish fine-scale classes, such as coral morphotypes. At the same time, obtaining pixel-level annotations over large spatial extents is costly and labor-intensive, limiting the scalability of deep learning-based segmentation methods for aerial imagery. We present a multi-scale weakly supervised semantic segmentation framework that addresses this challenge by transferring fine-scale ecological information from underwater imagery to aerial data. Our method enables large-scale coral reef mapping from drone imagery with minimal manual annotation, combining classification-based supervision, spatial interpolation and self-distillation techniques. We demonstrate the efficacy of the approach, enabling large-area segmentation of coral morphotypes and demonstrating flexibility for integrating new classes. This study presents a scalable, cost-effective methodology for high-resolution reef monitoring, combining low-cost data collection, weakly supervised deep learning and multi-scale remote sensing.

[39] Enhancing compact convolutional transformers with super attention cs.CV | cs.LGPDF

Simpenzwe Honore Leandre, Natenaile Asmamaw Shiferaw, Dillip Rout

TL;DR: 本文提出了一种结合token混合、序列池化和卷积tokenizer的视觉模型，在固定上下文长度任务中实现了高效推理和SOTA性能。

Details

Motivation: 现有的注意力机制（如SDPA）在短上下文任务中效率不高，且依赖额外技术（如数据增强、位置编码等）。本文旨在设计一种更高效、更稳定的模型。

Result: 在CIFAR100上，top-1和top-5验证准确率分别从36.50%提升到46.29%和66.33%提升到76.31%，且模型更小、更高效。

Insight: 短上下文任务中，超注意力机制比传统注意力更高效，且模型设计可以简化训练流程。

Abstract: In this paper, we propose a vision model that adopts token mixing, sequence-pooling, and convolutional tokenizers to achieve state-of-the-art performance and efficient inference in fixed context-length tasks. In the CIFAR100 benchmark, our model significantly improves the baseline of the top 1% and top 5% validation accuracy from 36.50% to 46.29% and 66.33% to 76.31%, while being more efficient than the Scaled Dot Product Attention (SDPA) transformers when the context length is less than the embedding dimension and only 60% the size. In addition, the architecture demonstrates high training stability and does not rely on techniques such as data augmentation like mixup, positional embeddings, or learning rate scheduling. We make our code available on Github.

[40] Can we make NeRF-based visual localization privacy-preserving? cs.CVPDF

Maxime Pietrantoni, Martin Humenberger, Torsten Sattler, Gabriela Csurka

TL;DR: 论文探讨了NeRF（神经辐射场）在视觉定位任务中可能泄露隐私的问题，提出了一种评估NeRF隐私保护能力的新协议，并开发了名为ppNeSF的隐私保护变体。ppNeSF通过自监督学习的分割标签替代RGB图像训练，既保护隐私又能实现高精度视觉定位。

Details

Motivation: NeRF在视觉定位中的广泛应用带来了隐私泄露风险，因为其隐含地存储了大量场景细节。论文旨在解决这一问题，确保NeRF在云服务中的部署既能保留其高性能，又能保护隐私。

Result: ppNeSF在保护隐私的同时，实现了视觉定位的最先进性能，验证了其有效性。

Insight: NeRF的隐私问题不仅存在于颜色预测头，其几何表示也是敏感信息的潜在来源；分割标签可以作为一种隐私保护的替代方案。

Abstract: Visual localization (VL) is the task of estimating the camera pose in a known scene. VL methods, a.o., can be distinguished based on how they represent the scene, e.g., explicitly through a (sparse) point cloud or a collection of images or implicitly through the weights of a neural network. Recently, NeRF-based methods have become popular for VL. While NeRFs offer high-quality novel view synthesis, they inadvertently encode fine scene details, raising privacy concerns when deployed in cloud-based localization services as sensitive information could be recovered. In this paper, we tackle this challenge on two ends. We first propose a new protocol to assess privacy-preservation of NeRF-based representations. We show that NeRFs trained with photometric losses store fine-grained details in their geometry representations, making them vulnerable to privacy attacks, even if the head that predicts colors is removed. Second, we propose ppNeSF (Privacy-Preserving Neural Segmentation Field), a NeRF variant trained with segmentation supervision instead of RGB images. These segmentation labels are learned in a self-supervised manner, ensuring they are coarse enough to obscure identifiable scene details while remaining discriminativeness in 3D. The segmentation space of ppNeSF can be used for accurate visual localization, yielding state-of-the-art results.

[41] Enhancing Document VQA Models via Retrieval-Augmented Generation cs.CVPDF

Eric López, Artemis Llabrés, Ernest Valveny

TL;DR: 论文通过检索增强生成（RAG）提升文档VQA性能，评估了文本和视觉检索变体，在多个基准测试中显著提升准确率。

Details

Motivation: 现有文档VQA系统在处理多页文档时存在内存消耗高的问题，RAG提供了一种高效且轻量级的替代方案。

Result: 文本检索变体在MP-DocVQA等数据集上比基线最高提升22.5 ANLS，视觉检索变体提升5.0 ANLS。

Insight: 精细化的证据选择（而非布局分块策略）是提升性能的关键，适用于不同模型规模和基准。

Abstract: Document Visual Question Answering (Document VQA) must cope with documents that span dozens of pages, yet leading systems still concatenate every page or rely on very large vision-language models, both of which are memory-hungry. Retrieval-Augmented Generation (RAG) offers an attractive alternative, first retrieving a concise set of relevant segments before generating answers from this selected evidence. In this paper, we systematically evaluate the impact of incorporating RAG into Document VQA through different retrieval variants - text-based retrieval using OCR tokens and purely visual retrieval without OCR - across multiple models and benchmarks. Evaluated on the multi-page datasets MP-DocVQA, DUDE, and InfographicVQA, the text-centric variant improves the “concatenate-all-pages” baseline by up to +22.5 ANLS, while the visual variant achieves +5.0 ANLS improvement without requiring any text extraction. An ablation confirms that retrieval and reranking components drive most of the gain, whereas the layout-guided chunking strategy - proposed in several recent works to leverage page structure - fails to help on these datasets. Our experiments demonstrate that careful evidence selection consistently boosts accuracy across multiple model sizes and multi-page benchmarks, underscoring its practical value for real-world Document VQA.

[42] Ask Me Again Differently: GRAS for Measuring Bias in Vision Language Models on Gender, Race, Age, and Skin Tone cs.CVPDF

Shaivi Malik, Hasnat Md Abdullah, Sriparna Saha, Amit Sheth

TL;DR: GRAS是一个用于评估视觉语言模型（VLMs）在性别、种族、年龄和肤色方面偏见的基准测试，并提出可解释的GRAS偏见分数。在评估五种先进VLM时，发现其偏见水平较高，最低得分仅为2分（满分100）。此外，研究发现，在视觉问答（VQA）任务中评估偏见时，需考虑问题的多种表述形式。

Details

Motivation: 随着视觉语言模型在实际应用中的普及，理解其在人口统计学上的偏见变得至关重要。当前缺乏一个覆盖全面的评估基准来测量这些偏见。为了实现这一点，作者提出了GRAS。

Result: 评估结果表明，五种先进VLM的偏见水平较高，最低得分仅为2分（满分100）。此外，研究发现问题的不同表述会影响偏见的评估结果。

Insight: 在评估视觉语言模型的偏见时，单一问题表述可能无法全面反映实际情况，需考虑多种表述形式以确保评估的全面性。

Abstract: As Vision Language Models (VLMs) become integral to real-world applications, understanding their demographic biases is critical. We introduce GRAS, a benchmark for uncovering demographic biases in VLMs across gender, race, age, and skin tone, offering the most diverse coverage to date. We further propose the GRAS Bias Score, an interpretable metric for quantifying bias. We benchmark five state-of-the-art VLMs and reveal concerning bias levels, with the least biased model attaining a GRAS Bias Score of only 2 out of 100. Our findings also reveal a methodological insight: evaluating bias in VLMs with visual question answering (VQA) requires considering multiple formulations of a question. Our code, data, and evaluation results are publicly available.

[43] RoofSeg: An edge-aware transformer-based network for end-to-end roof plane segmentation cs.CV | cs.AIPDF

Siyuan You, Guozheng Xu, Pengwei Zhou, Qiwen Jin, Jian Yao

TL;DR: RoofSeg是一种基于Transformer的边缘感知网络，用于从LiDAR点云中端到端地分割屋顶平面，解决了当前深度学习方法在边缘区域特征判别性不足和几何特性未充分利用的问题。

Details

Motivation: 现有的屋顶平面分割方法大多依赖手工设计或学习的特征及几何聚类策略，但这些方法存在非端到端、边缘特征判别性低和几何特性未充分利用的问题。

Result: RoofSeg在端到端分割和边缘区域精度上优于现有方法，有效提升了屋顶平面分割性能。

Insight: 结合几何先验和Transformer的全局建模能力可以显著提升点云分割任务的效果，尤其是在边缘区域。

Abstract: Roof plane segmentation is one of the key procedures for reconstructing three-dimensional (3D) building models at levels of detail (LoD) 2 and 3 from airborne light detection and ranging (LiDAR) point clouds. The majority of current approaches for roof plane segmentation rely on the manually designed or learned features followed by some specifically designed geometric clustering strategies. Because the learned features are more powerful than the manually designed features, the deep learning-based approaches usually perform better than the traditional approaches. However, the current deep learning-based approaches have three unsolved problems. The first is that most of them are not truly end-to-end, the plane segmentation results may be not optimal. The second is that the point feature discriminability near the edges is relatively low, leading to inaccurate planar edges. The third is that the planar geometric characteristics are not sufficiently considered to constrain the network training. To solve these issues, a novel edge-aware transformer-based network, named RoofSeg, is developed for segmenting roof planes from LiDAR point clouds in a truly end-to-end manner. In the RoofSeg, we leverage a transformer encoder-decoder-based framework to hierarchically predict the plane instance masks with the use of a set of learnable plane queries. To further improve the segmentation accuracy of edge regions, we also design an Edge-Aware Mask Module (EAMM) that sufficiently incorporates planar geometric prior of edges to enhance its discriminability for plane instance mask refinement. In addition, we propose an adaptive weighting strategy in the mask loss to reduce the influence of misclassified points, and also propose a new plane geometric loss to constrain the network training.

[44] MicroDetect-Net (MDN): Leveraging Deep Learning to Detect Microplastics in Clam Blood, a Step Towards Human Blood Analysis cs.CVPDF

Riju Marwah, Riya Arora, Navneet Yadav, Himank Arora

TL;DR: MicroDetect-Net (MDN) 是一个深度学习模型，结合荧光显微镜和尼罗红染色技术，用于检测蛤蜊血液中的微塑料，为未来人类血液分析奠定基础。

Details

Motivation: 微塑料污染日益严重，对人类健康构成潜在威胁。当前缺乏高效、准确的检测方法，因此需要开发一种基于深度学习的自动化方案。

Result: MDN 在 276 张尼罗红染色图像上达到 92% 准确率，IoU 为 87.4%，F1 分数为 92.1%。

Insight: 将深度学习与传统染色技术结合，可显著提升微塑料检测效率，未来可拓展至人类血液样本分析。

Abstract: With the prevalence of plastics exceeding 368 million tons yearly, microplastic pollution has grown to an extent where air, water, soil, and living organisms have all tested positive for microplastic presence. These particles, which are smaller than 5 millimeters in size, are no less harmful to humans than to the environment. Toxicity research on microplastics has shown that exposure may cause liver infection, intestinal injuries, and gut flora imbalance, leading to numerous potential health hazards. This paper presents a new model, MicroDetect-Net (MDN), which applies fluorescence microscopy with Nile Red dye staining and deep learning to scan blood samples for microplastics. Although clam blood has certain limitations in replicating real human blood, this study opens avenues for applying the approach to human samples, which are more consistent for preliminary data collection. The MDN model integrates dataset preparation, fluorescence imaging, and segmentation using a convolutional neural network to localize and count microplastic fragments. The combination of convolutional networks and Nile Red dye for segmentation produced strong image detection and accuracy. MDN was evaluated on a dataset of 276 Nile Red-stained fluorescent blood images and achieved an accuracy of ninety two percent. Robust performance was observed with an Intersection over Union of 87.4 percent, F1 score of 92.1 percent, Precision of 90.6 percent, and Recall of 93.7 percent. These metrics demonstrate the effectiveness of MDN in the detection of microplastics.

[45] ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval cs.CVPDF

Yi Pan, Yujia Zhang, Michael Kampffmeyer, Xiaoguang Zhao

TL;DR: ProPy利用CLIP模型，通过构建交互式提示金字塔（Prompt Pyramid）和祖先后代交互机制（Ancestor-Descendant Interaction Mechanism），显著提升了部分相关视频检索（PRVR）的性能，在三个公开数据集上达到SOTA。

Details

Motivation: 部分相关视频检索（PRVR）是一个实际但具有挑战性的任务，现有方法主要依赖单模态特征处理，而强大的预训练视觉-语言模型（如CLIP）在该领域尚未充分探索。

Result: 在三个公开数据集上实现了SOTA性能，显著优于先前模型。

Insight: 多粒度语义捕捉和动态交互机制是提升PRVR性能的关键。

Abstract: Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task that involves retrieving videos based on queries relevant to only specific segments. While existing works follow the paradigm of developing models to process unimodal features, powerful pretrained vision-language models like CLIP remain underexplored in this field. To bridge this gap, we propose ProPy, a model with systematic architectural adaption of CLIP specifically designed for PRVR. Drawing insights from the semantic relevance of multi-granularity events, ProPy introduces two key innovations: (1) A Prompt Pyramid structure that organizes event prompts to capture semantics at multiple granularity levels, and (2) An Ancestor-Descendant Interaction Mechanism built on the pyramid that enables dynamic semantic interaction among events. With these designs, ProPy achieves SOTA performance on three public datasets, outperforming previous models by significant margins. Code is available at https://github.com/BUAAPY/ProPy.

[46] GReAT: leveraging geometric artery data to improve wall shear stress assessment cs.CV | cs.LGPDF

Julian Suk, Jolanda J. Wentzel, Patryk Rygiel, Joost Daemen, Daniel Rueckert

TL;DR: GReAT利用几何动脉大数据通过自监督预训练提升壁面剪切应力评估，解决了医学图像数据不足的问题。

Details

Motivation: 心血管健康领域中，基于患者特异性医学图像的机器学习可以避免耗时的流体模拟，但缺乏足够的大规模数据集。

Result: 在大规模几何动脉数据集（8449个形状）上预训练的模型，在小规模临床试验数据（49名患者）上提升了壁面剪切应力区域的分割性能。

Insight: 几何数据的学习表示可以有效增强小规模临床数据的分析能力，为心血管健康领域提供新的数据驱动方法。

Abstract: Leveraging big data for patient care is promising in many medical fields such as cardiovascular health. For example, hemodynamic biomarkers like wall shear stress could be assessed from patient-specific medical images via machine learning algorithms, bypassing the need for time-intensive computational fluid simulation. However, it is extremely challenging to amass large-enough datasets to effectively train such models. We could address this data scarcity by means of self-supervised pre-training and foundations models given large datasets of geometric artery models. In the context of coronary arteries, leveraging learned representations to improve hemodynamic biomarker assessment has not yet been well studied. In this work, we address this gap by investigating whether a large dataset (8449 shapes) consisting of geometric models of 3D blood vessels can benefit wall shear stress assessment in coronary artery models from a small-scale clinical trial (49 patients). We create a self-supervised target for the 3D blood vessels by computing the heat kernel signature, a quantity obtained via Laplacian eigenvectors, which captures the very essence of the shapes. We show how geometric representations learned from this datasets can boost segmentation of coronary arteries into regions of low, mid and high (time-averaged) wall shear stress even when trained on limited data.

[47] No Label Left Behind: A Unified Surface Defect Detection Model for all Supervision Regimes cs.CV | cs.AIPDF

Blaž Rolih, Matic Fučka, Danijel Skočaj

TL;DR: SuperSimpleNet是一种高效的自适应判别模型，能够在四种监督场景（无监督、弱监督、混合监督和全监督）中高效训练，结合了合成异常生成、增强分类头和改进学习过程，实现了高性能和快速推理。

Details

Motivation: 工业表面缺陷检测需要高性能、高效和适应性强的模型，但现有方法通常局限于特定监督场景，难以应对现实制造中多样化的数据标注。

Result: 在四个基准数据集上表现出色，推理时间低于10毫秒，性能优于现有方法。

Insight: SuperSimpleNet通过统一监督范式，缩小了学术研究与工业应用的差距，为实际制造挑战提供了可行的解决方案。

Abstract: Surface defect detection is a critical task across numerous industries, aimed at efficiently identifying and localising imperfections or irregularities on manufactured components. While numerous methods have been proposed, many fail to meet industrial demands for high performance, efficiency, and adaptability. Existing approaches are often constrained to specific supervision scenarios and struggle to adapt to the diverse data annotations encountered in real-world manufacturing processes, such as unsupervised, weakly supervised, mixed supervision, and fully supervised settings. To address these challenges, we propose SuperSimpleNet, a highly efficient and adaptable discriminative model built on the foundation of SimpleNet. SuperSimpleNet incorporates a novel synthetic anomaly generation process, an enhanced classification head, and an improved learning procedure, enabling efficient training in all four supervision scenarios, making it the first model capable of fully leveraging all available data annotations. SuperSimpleNet sets a new standard for performance across all scenarios, as demonstrated by its results on four challenging benchmark datasets. Beyond accuracy, it is very fast, achieving an inference time below 10 ms. With its ability to unify diverse supervision paradigms while maintaining outstanding speed and reliability, SuperSimpleNet represents a promising step forward in addressing real-world manufacturing challenges and bridging the gap between academic research and industrial applications. Code: https://github.com/blaz-r/SuperSimpleNet

[48] VibES: Induced Vibration for Persistent Event-Based Sensing cs.CV | cs.ROPDF

Vincenzo Polizzi, Stephen Yang, Quentin Clark, Jonathan Kelly, Igor Gilitschenski

TL;DR: 本文提出了一种名为VibES的轻量级方法，通过周期性振动诱导事件相机的持续事件生成，解决了静态或低运动场景下事件相机无法生成事件的问题。结合运动补偿流程，该方法能够为下游任务提供干净的运动校正事件。

Details

Motivation: 事件相机在静态或低运动场景下无法生成事件，限制了其应用范围。传统方法需要复杂硬件或额外光学组件，本工作旨在提出一种轻量级解决方案。

Result: 实验表明，该方法能可靠恢复运动参数，并在图像重建和边缘检测任务中优于无运动诱导的事件相机。

Insight: 轻量级的机械振动是一种有效的解决方案，能够在无复杂硬件需求的情况下扩展事件相机的适用性。

Abstract: Event cameras are a bio-inspired class of sensors that asynchronously measure per-pixel intensity changes. Under fixed illumination conditions in static or low-motion scenes, rigidly mounted event cameras are unable to generate any events, becoming unsuitable for most computer vision tasks. To address this limitation, recent work has investigated motion-induced event stimulation that often requires complex hardware or additional optical components. In contrast, we introduce a lightweight approach to sustain persistent event generation by employing a simple rotating unbalanced mass to induce periodic vibrational motion. This is combined with a motion-compensation pipeline that removes the injected motion and yields clean, motion-corrected events for downstream perception tasks. We demonstrate our approach with a hardware prototype and evaluate it on real-world captured datasets. Our method reliably recovers motion parameters and improves both image reconstruction and edge detection over event-based sensing without motion induction.

[49] Few-Shot Connectivity-Aware Text Line Segmentation in Historical Documents cs.CV | cs.AI | cs.LGPDF

Rafael Sterzinger, Tingyu Lin, Robert Sablatnig

TL;DR: 本文提出了一种基于轻量级UNet++和拓扑感知损失函数的历史文档文本行分割方法，通过少量标注数据实现了高效的分割效果。

Details

Motivation: 历史文档的文本行分割通常需要大量标注数据，但标注成本高且专家知识需求大。因此，本文探索少样本学习方法，以降低数据需求。

Result: 在U-DIADS-TL数据集上，识别准确率提升200%，行交并比提升75%，F-Measure与DIVA-HisDB竞赛冠军相当。

Insight: 轻量架构和拓扑感知损失能有效解决少样本历史文档分割问题，为稀缺标注数据的任务提供了新思路。

Abstract: A foundational task for the digital analysis of documents is text line segmentation. However, automating this process with deep learning models is challenging because it requires large, annotated datasets that are often unavailable for historical documents. Additionally, the annotation process is a labor- and cost-intensive task that requires expert knowledge, which makes few-shot learning a promising direction for reducing data requirements. In this work, we demonstrate that small and simple architectures, coupled with a topology-aware loss function, are more accurate and data-efficient than more complex alternatives. We pair a lightweight UNet++ with a connectivity-aware loss, initially developed for neuron morphology, which explicitly penalizes structural errors like line fragmentation and unintended line merges. To increase our limited data, we train on small patches extracted from a mere three annotated pages per manuscript. Our methodology significantly improves upon the current state-of-the-art on the U-DIADS-TL dataset, with a 200% increase in Recognition Accuracy and a 75% increase in Line Intersection over Union. Our method also achieves an F-Measure score on par with or even exceeding that of the competition winner of the DIVA-HisDB baseline detection task, all while requiring only three annotated pages, exemplifying the efficacy of our approach. Our implementation is publicly available at: https://github.com/RafaelSterzinger/acpr_few_shot_hist.

[50] Dual Enhancement on 3D Vision-Language Perception for Monocular 3D Visual Grounding cs.CV | Information systems~Multimedia and multimodal retrievalPDF

Yuzhen Li, Min Liu, Yuan Bian, Xueping Wang, Zhaoyang Li

TL;DR: 针对单目3D视觉定位任务中语言模型对数值单位敏感的问题，本文提出两种增强3D感知的方法：3D文本增强（3DTE）和文本引导的几何增强（TGE），显著提升了模型性能。

Details

Motivation: 现有的预训练语言模型在3D视觉定位任务中对数值单位（如米、厘米）的敏感性不足，导致性能下降，因此需要增强模型对文本和几何特征的3D感知能力。

Result: 在Mono3DRefer数据集上取得新SOTA结果，在“远距离”场景中准确率提升11.94%。

Insight: 数值单位的敏感性是影响3D视觉定位的关键因素，通过增强文本和几何特征的协同作用可以显著提升模型性能。

Abstract: Monocular 3D visual grounding is a novel task that aims to locate 3D objects in RGB images using text descriptions with explicit geometry information. Despite the inclusion of geometry details in the text, we observe that the text embeddings are sensitive to the magnitude of numerical values but largely ignore the associated measurement units. For example, simply equidistant mapping the length with unit “meter” to “decimeters” or “centimeters” leads to severe performance degradation, even though the physical length remains equivalent. This observation signifies the weak 3D comprehension of pre-trained language model, which generates misguiding text features to hinder 3D perception. Therefore, we propose to enhance the 3D perception of model on text embeddings and geometry features with two simple and effective methods. Firstly, we introduce a pre-processing method named 3D-text Enhancement (3DTE), which enhances the comprehension of mapping relationships between different units by augmenting the diversity of distance descriptors in text queries. Next, we propose a Text-Guided Geometry Enhancement (TGE) module to further enhance the 3D-text information by projecting the basic text features into geometrically consistent space. These 3D-enhanced text features are then leveraged to precisely guide the attention of geometry features. We evaluate the proposed method through extensive comparisons and ablation studies on the Mono3DRefer dataset. Experimental results demonstrate substantial improvements over previous methods, achieving new state-of-the-art results with a notable accuracy gain of 11.94% in the “Far” scenario. Our code will be made publicly available.

[51] Beyond flattening: a geometrically principled positional encoding for vision transformers with Weierstrass elliptic functions cs.CVPDF

Zhihang Xin, Xitong Hu, Rui Wang

TL;DR: 论文提出了一种基于Weierstrass椭圆函数的位置编码方法（WEF-PE），旨在解决传统Vision Transformer中由于一维可学习位置嵌入导致的空间结构破坏问题，通过椭圆函数的双重周期性更好地捕捉视觉数据的平移不变性。

Details

Motivation: 传统Vision Transformer通过扁平化处理图像破坏了其二维空间结构，且现有位置编码方法缺乏几何约束，无法有效利用空间邻近性先验。论文希望通过数学原理设计一种更符合视觉数据特性的位置编码方法。

Result: 在ViT-Tiny架构上，CIFAR-100从头训练达到63.78%准确率；ViT-Base微调达到93.28%；在VTAB-1k基准任务上表现一致优于传统方法。理论分析和注意力可视化验证了其几何归纳偏差和语义聚焦能力。

Insight: 椭圆函数的双重周期性与视觉数据的平移不变性高度契合，为其在位置编码中的应用提供了数学基础；非线性几何特性可更自然地建模空间距离关系。

Abstract: Vision Transformers have demonstrated remarkable success in computer vision tasks, yet their reliance on learnable one-dimensional positional embeddings fundamentally disrupts the inherent two-dimensional spatial structure of images through patch flattening procedures. Traditional positional encoding approaches lack geometric constraints and fail to establish monotonic correspondence between Euclidean spatial distances and sequential index distances, thereby limiting the model’s capacity to leverage spatial proximity priors effectively. We propose Weierstrass Elliptic Function Positional Encoding (WEF-PE), a mathematically principled approach that directly addresses two-dimensional coordinates through natural complex domain representation, where the doubly periodic properties of elliptic functions align remarkably with translational invariance patterns commonly observed in visual data. Our method exploits the non-linear geometric nature of elliptic functions to encode spatial distance relationships naturally, while the algebraic addition formula enables direct derivation of relative positional information between arbitrary patch pairs from their absolute encodings. Comprehensive experiments demonstrate that WEF-PE achieves superior performance across diverse scenarios, including 63.78% accuracy on CIFAR-100 from-scratch training with ViT-Tiny architecture, 93.28% on CIFAR-100 fine-tuning with ViT-Base, and consistent improvements on VTAB-1k benchmark tasks. Theoretical analysis confirms the distance-decay property through rigorous mathematical proof, while attention visualization reveals enhanced geometric inductive bias and more coherent semantic focus compared to conventional approaches.The source code implementing the methods described in this paper is publicly available on GitHub.

[52] SoccerNet 2025 Challenges Results cs.CVPDF

Silvio Giancola, Anthony Cioppa, Marc Gutiérrez-Pérez, Jan Held, Carlos Hinojosa

TL;DR: SoccerNet 2025 Challenges是第五屆針對足球視頻理解的電腦視覺開放基準測試，包含四項任務：團隊球類動作檢測、單目深度估計、多視角犯規識別和比賽狀態重建，提供大規模標註數據集和統一評估標準。

Details

Motivation: 推動電腦視覺在足球視頻分析領域的研究，提供公開可重現的基准測試平台。

Result: 報告了各任務的頂尖解決方案和社區進展，展示了足球視頻理解的技術發展現狀。

Insight: SoccerNet Challenges成功促進了電腦視覺與體育分析的跨學科研究，展示了開放基准測試的重要性。

Abstract: The SoccerNet 2025 Challenges mark the fifth annual edition of the SoccerNet open benchmarking effort, dedicated to advancing computer vision research in football video understanding. This year’s challenges span four vision-based tasks: (1) Team Ball Action Spotting, focused on detecting ball-related actions in football broadcasts and assigning actions to teams; (2) Monocular Depth Estimation, targeting the recovery of scene geometry from single-camera broadcast clips through relative depth estimation for each pixel; (3) Multi-View Foul Recognition, requiring the analysis of multiple synchronized camera views to classify fouls and their severity; and (4) Game State Reconstruction, aimed at localizing and identifying all players from a broadcast video to reconstruct the game state on a 2D top-view of the field. Across all tasks, participants were provided with large-scale annotated datasets, unified evaluation protocols, and strong baselines as starting points. This report presents the results of each challenge, highlights the top-performing solutions, and provides insights into the progress made by the community. The SoccerNet Challenges continue to serve as a driving force for reproducible, open research at the intersection of computer vision, artificial intelligence, and sports. Detailed information about the tasks, challenges, and leaderboards can be found at https://www.soccer-net.org, with baselines and development kits available at https://github.com/SoccerNet.

[53] All-in-One Slider for Attribute Manipulation in Diffusion Models cs.CVPDF

Weixin Ye, Hongguang Zhu, Wei Wang, Yahui Liu, Mengyu Wang

TL;DR: 本文提出了一种名为All-in-One Slider的轻量级模块，用于在扩散模型中实现多属性的统一操控，解决了传统One-for-One方法参数冗余和灵活性不足的问题。

Details

Motivation: 文本到图像（T2I）扩散模型在生成高质量图像方面取得了显著进展，但对生成图像的属性进行渐进式操控以满足用户需求仍具挑战性。传统方法需要为每个属性独立训练滑动模块，导致参数冗余和灵活性受限。

Result: 实验表明，该方法在属性操控上具有高精度和扩展性，相比传统方法有显著提升。同时支持真实图像的属性操控，拓宽了应用场景。

Insight: 稀疏的方向分解和零样本支持为扩散模型的属性操控提供了新的思路，同时也展示了轻量级模块在复杂任务中的潜力。

Abstract: Text-to-image (T2I) diffusion models have made significant strides in generating high-quality images. However, progressively manipulating certain attributes of generated images to meet the desired user expectations remains challenging, particularly for content with rich details, such as human faces. Some studies have attempted to address this by training slider modules. However, they follow a One-for-One manner, where an independent slider is trained for each attribute, requiring additional training whenever a new attribute is introduced. This not only results in parameter redundancy accumulated by sliders but also restricts the flexibility of practical applications and the scalability of attribute manipulation. To address this issue, we introduce the All-in-One Slider, a lightweight module that decomposes the text embedding space into sparse, semantically meaningful attribute directions. Once trained, it functions as a general-purpose slider, enabling interpretable and fine-grained continuous control over various attributes. Moreover, by recombining the learned directions, the All-in-One Slider supports zero-shot manipulation of unseen attributes (e.g., races and celebrities) and the composition of multiple attributes. Extensive experiments demonstrate that our method enables accurate and scalable attribute manipulation, achieving notable improvements compared to previous methods. Furthermore, our method can be extended to integrate with the inversion framework to perform attribute manipulation on real images, broadening its applicability to various real-world scenarios. The code and trained model will be released at: https://github.com/ywxsuperstar/KSAE-FaceSteer.

[54] LSD-3D: Large-Scale 3D Driving Scene Generation with Geometry Grounding cs.CV | cs.AI | cs.GRPDF

Julian Ost, Andrea Ramazzina, Amogh Joshi, Maximilian Bömer, Mario Bijelic

TL;DR: LSD-3D提出了一种生成大规模3D驾驶场景的方法，结合代理几何与环境表示生成与2D图像先验的得分蒸馏，实现了高质量的几何一致性和可控性。

Details

Motivation: 现有方法中，神经重建方法受限于静态环境和有限场景控制，而基于扩散模型的生成方法缺乏几何基础和因果性。LSD-3D旨在填补这一空白。

Result: 能够生成几何一致、高保真纹理和结构的复杂驾驶场景，并可根据地图布局进行条件控制。

Insight: 该方法通过结合几何生成与2D先验蒸馏，实现了在3D场景生成中的高可控性与几何一致性，为大规模场景数据生成提供了新思路。

Abstract: Large-scale scene data is essential for training and testing in robot learning. Neural reconstruction methods have promised the capability of reconstructing large physically-grounded outdoor scenes from captured sensor data. However, these methods have baked-in static environments and only allow for limited scene control – they are functionally constrained in scene and trajectory diversity by the captures from which they are reconstructed. In contrast, generating driving data with recent image or video diffusion models offers control, however, at the cost of geometry grounding and causality. In this work, we aim to bridge this gap and present a method that directly generates large-scale 3D driving scenes with accurate geometry, allowing for causal novel view synthesis with object permanence and explicit 3D geometry estimation. The proposed method combines the generation of a proxy geometry and environment representation with score distillation from learned 2D image priors. We find that this approach allows for high controllability, enabling the prompt-guided geometry and high-fidelity texture and structure that can be conditioned on map layouts – producing realistic and geometrically consistent 3D generations of complex driving scenes.

[55] OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation cs.CVPDF

Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang

TL;DR: OmniHuman-1.5提出了一种新框架，通过认知模拟为虚拟角色注入活跃的思维，生成语义连贯且富有表现力的动画。

Details

Motivation: 现有视频虚拟角色模型仅能生成基于低级线索（如音频节奏）的动画，缺乏对情感、意图或上下文的高层语义理解。

Result: 模型在唇同步准确性、视频质量、运动自然性和语义一致性等方面表现领先，并能扩展到多人或非人类角色的复杂场景。

Insight: 多模态语义指导与运动生成的结合可以显著提升虚拟角色动画的表现力和语义连贯性。

Abstract: Existing video avatar models can produce fluid human animations, yet they struggle to move beyond mere physical likeness to capture a character’s authentic essence. Their motions typically synchronize with low-level cues like audio rhythm, lacking a deeper semantic understanding of emotion, intent, or context. To bridge this gap, \textbf{we propose a framework designed to generate character animations that are not only physically plausible but also semantically coherent and expressive.} Our model, \textbf{OmniHuman-1.5}, is built upon two key technical contributions. First, we leverage Multimodal Large Language Models to synthesize a structured textual representation of conditions that provides high-level semantic guidance. This guidance steers our motion generator beyond simplistic rhythmic synchronization, enabling the production of actions that are contextually and emotionally resonant. Second, to ensure the effective fusion of these multimodal inputs and mitigate inter-modality conflicts, we introduce a specialized Multimodal DiT architecture with a novel Pseudo Last Frame design. The synergy of these components allows our model to accurately interpret the joint semantics of audio, images, and text, thereby generating motions that are deeply coherent with the character, scene, and linguistic content. Extensive experiments demonstrate that our model achieves leading performance across a comprehensive set of metrics, including lip-sync accuracy, video quality, motion naturalness and semantic consistency with textual prompts. Furthermore, our approach shows remarkable extensibility to complex scenarios, such as those involving multi-person and non-human subjects. Homepage: \href{https://omnihuman-lab.github.io/v1_5/}

[56] Autoregressive Universal Video Segmentation Model cs.CVPDF

Miran Heo, Sukjun Hwang, Min-Hung Chen, Yu-Chiang Frank Wang, Albert Gu

TL;DR: AUSM是一个统一的视频分割模型，将提示和无提示视频分割任务统一为自回归的掩码预测问题，实现了高效并行训练和优异的性能。

Details

Motivation: 当前视频分割领域存在任务特定模型和流程碎片化的问题，尤其是在无提示视频分割任务上，缺乏统一的解决方案。

Result: 在多个标准数据集上优于之前的通用流式视频分割方法，并在16帧序列上实现了2.5倍的训练加速。

Insight: 将视频分割任务建模为序列预测问题（类似语言建模）是一种有效且可扩展的统一框架，适用于复杂的流式视频场景。

Abstract: Recent video foundation models such as SAM2 excel at prompted video segmentation by treating masks as a general-purpose primitive. However, many real-world settings require unprompted segmentation that aims to detect and track all objects in a video without external cues, leaving today’s landscape fragmented across task-specific models and pipelines. We recast streaming video segmentation as sequential mask prediction, analogous to language modeling, and introduce the Autoregressive Universal Segmentation Model (AUSM), a single architecture that unifies both prompted and unprompted video segmentation. Built on recent state-space models, AUSM maintains a fixed-size spatial state and scales to video streams of arbitrary length. Furthermore, all components of AUSM are designed for parallel training across frames, yielding substantial speedups over iterative training. On standard benchmarks (DAVIS17, YouTube-VOS 2018 & 2019, MOSE, YouTube-VIS 2019 & 2021, and OVIS) AUSM outperforms prior universal streaming video segmentation methods and achieves up to 2.5x faster training on 16-frame sequences.

[57] Articulate3D: Zero-Shot Text-Driven 3D Object Posing cs.CVPDF

Oishi Deb, Anjun Hu, Ashkan Khakzar, Philip Torr, Christian Rupprecht

TL;DR: Articulate3D是一种无需训练的零样本方法，通过语言控制对3D资产进行姿势调整，利用图像生成器和多视角姿势优化实现目标。

Details

Motivation: 尽管视觉和语言模型取得进展，但通过语言控制调整3D物体姿势仍具挑战性。该方法试图解决这一问题。

Result: 实验表明，该方法在多样3D对象和自由文本提示下有效，用户研究中85%以上优于现有方法。

Insight: 可微分渲染对姿势优化不可靠，关键点匹配更有效；自注意力机制能保持结构一致性。

Abstract: We propose a training-free method, Articulate3D, to pose a 3D asset through language control. Despite advances in vision and language models, this task remains surprisingly challenging. To achieve this goal, we decompose the problem into two steps. We modify a powerful image-generator to create target images conditioned on the input image and a text instruction. We then align the mesh to the target images through a multi-view pose optimisation step. In detail, we introduce a self-attention rewiring mechanism (RSActrl) that decouples the source structure from pose within an image generative model, allowing it to maintain a consistent structure across varying poses. We observed that differentiable rendering is an unreliable signal for articulation optimisation; instead, we use keypoints to establish correspondences between input and target images. The effectiveness of Articulate3D is demonstrated across a diverse range of 3D objects and free-form text prompts, successfully manipulating poses while maintaining the original identity of the mesh. Quantitative evaluations and a comparative user study, in which our method was preferred over 85% of the time, confirm its superiority over existing approaches. Project page:https://odeb1.github.io/articulate3d_page_deb/

cs.CL [Back]

Maojia Song, Tej Deep Pala, Weisheng Jin, Amir Zadeh, Chuan Li

TL;DR: 论文探讨了大型语言模型（LLMs）在多智能体系统中的社交互动能力，提出了KAIROS基准测试，研究了信任形成、错误信息抵抗和同伴输入整合等问题，并评估了多种缓解策略。

Details

Motivation: 研究目的是分析LLMs在多智能体社交互动中的表现，尤其是在信任、抗干扰和群体决策方面的能力，为提升集体智能提供理论基础。

Result: 结果表明，GRPO在多智能体上下文中表现最佳，但降低了模型的社交影响力鲁棒性。

Insight: LLMs在多智能体交互中仍存在局限性，需进一步优化模型的社交适应能力和鲁棒性。

Abstract: Large language models (LLMs) are increasingly deployed in multi-agent systems (MAS) as components of collaborative intelligence, where peer interactions dynamically shape individual decision-making. Although prior work has focused on conformity bias, we extend the analysis to examine how LLMs form trust from previous impressions, resist misinformation, and integrate peer input during interaction, key factors for achieving collective intelligence under complex social dynamics. We present KAIROS, a benchmark simulating quiz contests with peer agents of varying reliability, offering fine-grained control over conditions such as expert-novice roles, noisy crowds, and adversarial peers. LLMs receive both historical interactions and current peer responses, allowing systematic investigation into how trust, peer action, and self-confidence influence decisions. As for mitigation strategies, we evaluate prompting, supervised fine-tuning, and reinforcement learning, Group Relative Policy Optimisation (GRPO), across multiple models. Our results reveal that GRPO with multi-agent context combined with outcome-based rewards and unconstrained reasoning achieves the best overall performance, but also decreases the robustness to social influence compared to Base models. The code and datasets are available at: https://github.com/declare-lab/KAIROS.

[59] Language-Specific Layer Matters: Efficient Multilingual Enhancement for Large Vision-Language Models cs.CLPDF

Yuchun Fan, Yilin Wang, Yongyu Mu, Lei Huang, Bei Li

TL;DR: 该论文提出了一种名为PLAST的高效多语言增强方法，通过精确微调语言特定层来提升大型视觉-语言模型（LVLMs）的多语言能力，显著减少了参数调整量。

Details

Motivation: 大型视觉-语言模型在多语言能力上存在不平衡现象，论文旨在通过分析其多语言工作模式，找到提升多语言理解的效率方法。

Result: 在MM-Bench和MMMB上的实验表明，PLAST显著提升了LVLMs的多语言能力，仅需调整14%的参数。

Insight: 语言特定视觉信息参与主要集中在浅层网络，PLAST在低资源和复杂视觉推理任务中具有泛化能力。

Abstract: Large vision-language models (LVLMs) have demonstrated exceptional capabilities in understanding visual information with human languages but also exhibit an imbalance in multilingual capabilities. In this work, we delve into the multilingual working pattern of LVLMs and identify a salient correlation between the multilingual understanding ability of LVLMs and language-specific neuron activations in shallow layers. Building on this insight, we introduce PLAST, a training recipe that achieves efficient multilingual enhancement for LVLMs by Precise LAnguage-Specific layers fine-Tuning. PLAST first identifies layers involved in multilingual understanding by monitoring language-specific neuron activations. These layers are then precisely fine-tuned with question-translation pairs to achieve multilingual alignment. Our empirical results on MM-Bench and MMMB demonstrate that PLAST effectively improves the multilingual capabilities of LVLMs and achieves significant efficiency with only 14% of the parameters tuned. Further analysis reveals that PLAST can be generalized to low-resource and complex visual reasoning tasks, facilitating the language-specific visual information engagement in shallow layers.

[60] Integral Transformer: Denoising Attention, Not Too Much Not Too Little cs.CLPDF

Ivan Kobyzev, Abbas Ghaddar, Dingtao Hu, Boxing Chen

TL;DR: 本文提出了Integral Transformer，通过从logit分布中积分采样信号的新型自注意力机制，减少注意力噪声，同时保留特殊令牌的关键信息，优于现有方法。

Details

Motivation: 传统的softmax自注意力机制对语义无关的令牌（如特殊令牌和标点）分配了过高的权重（注意力噪声），而现有方法（如Cog Attention和Differential Transformer）虽然通过引入负注意力分数减少了噪声，但可能丢失有用信息。

Result: 在多个知识和推理语言基准测试中，Integral Transformer优于vanilla、Cog和Differential注意力变体，并有效减少了上层注意力分布的等级崩溃。

Insight: 底层使用传统自注意力有助于性能提升，而Integral Transformer在上层能够平衡注意力分布，减少噪声的同时避免信息丢失。

Abstract: Softmax self-attention often assigns disproportionate weight to semantically uninformative tokens such as special tokens and punctuation, a phenomenon known as attention noise. While recent methods like Cog Attention and the Differential Transformer have addressed this by introducing negative attention scores, they risk discarding useful information. In this paper, we propose the Integral Transformer, a novel self-attention mechanism that denoises attention by integrating signals sampled from the logit distribution. Our approach mitigates noise while preserving the contributions of special tokens critical for model performance. Extensive experiments demonstrate that our model outperforms vanilla, Cog, and Differential attention variants on well-established knowledge and reasoning language benchmarks. Moreover, our analysis reveals that employing vanilla self-attention in the lower Transformer layers enhances performance and that the Integral Transformer effectively balances attention distributions and reduces rank collapse in upper layers.

[61] Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning cs.CL | cs.AIPDF

Jeong-seok Oh, Jay-yoon Lee

TL;DR: LSC通过可学习的token嵌入选择语义最一致的响应，适用于短篇和长篇推理任务，计算开销极低且在各类任务中表现优于其他方法。

Details

Motivation: 大语言模型在复杂或长篇推理任务中常产生不一致输出，现有方法虽能缓解问题但难以同时兼顾短篇和长篇任务。

Result: 在6个短篇和5个长篇基准测试中，LSC均优于其他方法，且能提供校准良好的置信度估计。

Insight: LSC表明语义一致性选择是解决模型输出不一致的有效方法，同时保持计算效率。

Abstract: Probabilistic decoding in Large Language Models (LLMs) often yields inconsistent outputs, particularly on complex or long-form questions. Self-Consistency (SC) mitigates this for short-form QA by majority voting over exact strings, whereas Universal Self-Consistency (USC) and Weighted Unigram Consistency Score (WUCS) extend to long-form responses but lose accuracy on short-form benchmarks. We introduce Latent Self-Consistency (LSC), which selects the most semantically consistent response using learnable token embeddings. A lightweight forward generation of summary tokens increases inference time by less than 1% and requires no changes to the model architecture. Across 6 short-form and 5 long-form reasoning benchmarks (e.g., MATH, MMLU, TruthfulQA), LSC surpasses SC, USC and WUCS on all short-form and long-form ones on average, while maintaining negligible computational overhead. These results position LSC as a practical consistency-selection method that works reliably across answer formats. Additionally, LSC provides well-calibrated confidence estimates, maintaining low Expected Calibration Error across both answer formats.

[62] How Reliable are LLMs for Reasoning on the Re-ranking task? cs.CL | cs.AIPDF

Nafis Tanveer Islam, Zhiming Zhao

TL;DR: 该论文探讨了大型语言模型（LLM）在重排序任务中的可靠性，分析了不同训练方法对模型语义理解的影响，并研究了模型是否能生成更具解释性的文本推理以提高透明度和解决数据不足问题。

Details

Motivation: 随着LLM语义理解能力的提升，其与人类价值观的契合度增强，但透明性下降。在数据有限的新系统中，重排序的准确性仍是一大挑战。作者希望通过分析不同训练方法的影响，探索LLM是否能为重排序任务提供更可靠的解释和推理。

Result: 研究发现，一些训练方法表现出更好的解释性，但并非所有方法都能实现准确的语义理解，部分方法仅通过抽象知识优化评估，引发了对LLM真实可靠性的质疑。

Insight: 论文指出，LLM在重排序任务中的可靠性不仅取决于其性能，还依赖于其语义理解和解释性能力。这强调了在透明性和数据有限的情况下，选择合适的训练方法的重要性。

Abstract: With the improving semantic understanding capability of Large Language Models (LLMs), they exhibit a greater awareness and alignment with human values, but this comes at the cost of transparency. Although promising results are achieved via experimental analysis, an in-depth understanding of the LLM’s internal workings is unavoidable to comprehend the reasoning behind the re-ranking, which provides end users with an explanation that enables them to make an informed decision. Moreover, in newly developed systems with limited user engagement and insufficient ranking data, accurately re-ranking content remains a significant challenge. While various training methods affect the training of LLMs and generate inference, our analysis has found that some training methods exhibit better explainability than others, implying that an accurate semantic understanding has not been learned through all training methods; instead, abstract knowledge has been gained to optimize evaluation, which raises questions about the true reliability of LLMs. Therefore, in this work, we analyze how different training methods affect the semantic understanding of the re-ranking task in LLMs and investigate whether these models can generate more informed textual reasoning to overcome the challenges of transparency or LLMs and limited training data. To analyze the LLMs for re-ranking tasks, we utilize a relatively small ranking dataset from the environment and the Earth science domain to re-rank retrieved content. Furthermore, we also analyze the explainable information to see if the re-ranking can be reasoned using explainability.

[63] Integrating gender inclusivity into large language models via instruction tuning cs.CLPDF

Alina Wróblewska, Bartosz Żuk

TL;DR: 该研究通过指令调优方法，解决了波兰语大语言模型中的性别偏见问题，使用IPIS数据集设计性别包容性提示，优化了模型的输出。

Details

Motivation: 由于历史和语言习惯，波兰语中男性形式被广泛使用，导致语言模型继承了性别偏见。研究旨在通过技术手段改善这一现象。

Result: 实验表明，调优后的模型能显著减少性别偏见，生成更公平的语言输出。

Insight: 语言模型的性别偏见问题可以通过指令调优和数据驱动方法解决，为其他语言的技术干预提供了参考。

Abstract: Imagine a language with masculine, feminine, and neuter grammatical genders, yet, due to historical and political conventions, masculine forms are predominantly used to refer to men, women and mixed-gender groups. This is the reality of contemporary Polish. A social consequence of this unfair linguistic system is that large language models (LLMs) trained on Polish texts inherit and reinforce this masculine bias, generating gender-imbalanced outputs. This study addresses this issue by tuning LLMs using the IPIS dataset, a collection of human-crafted gender-inclusive proofreading in Polish and Polish-to-English translation instructions. Grounded in a theoretical linguistic framework, we design a system prompt with explicit gender-inclusive guidelines for Polish. In our experiments, we IPIS-tune multilingual LLMs (Llama-8B, Mistral-7B and Mistral-Nemo) and Polish-specific LLMs (Bielik and PLLuM). Our approach aims to integrate gender inclusivity as an inherent feature of these models, offering a systematic solution to mitigate gender bias in Polish language generation.

[64] Thinking Before You Speak: A Proactive Test-time Scaling Approach cs.CLPDF

Cong Li, Wenchang Chai, Hejun Wu, Yan Pan, Pengxu Wei

TL;DR: 这篇论文提出了一个名为TBYS的推理框架，通过主动生成‘insight’来填补LLMs在复杂推理任务中的缺陷，并在数学数据集上验证了其有效性。

Details

Motivation: LLMs在复杂推理任务（如数学）中表现不佳，原因是人类推理模式与训练数据模式存在差异。人类在解决复杂问题时通常会内省思考，但这些思考过程并未在训练数据中体现。

Result: 在具有挑战性的数学数据集上，TBYS表现出了有效性。

Insight: 主动生成的‘insight’可以更好地模拟人类的推理过程，弥补LLMs在复杂任务中的不足。

Abstract: Large Language Models (LLMs) often exhibit deficiencies with complex reasoning tasks, such as maths, which we attribute to the discrepancy between human reasoning patterns and those presented in the LLMs’ training data. When dealing with complex problems, humans tend to think carefully before expressing solutions. However, they often do not articulate their inner thoughts, including their intentions and chosen methodologies. Consequently, critical insights essential for bridging reasoning steps may be absent in training data collected from human sources. To bridge this gap, we proposes inserting \emph{insight}s between consecutive reasoning steps, which review the status and initiate the next reasoning steps. Unlike prior prompting strategies that rely on a single or a workflow of static prompts to facilitate reasoning, \emph{insight}s are \emph{proactively} generated to guide reasoning processes. We implement our idea as a reasoning framework, named \emph{Thinking Before You Speak} (TBYS), and design a pipeline for automatically collecting and filtering in-context examples for the generation of \emph{insight}s, which alleviates human labeling efforts and fine-tuning overheads. Experiments on challenging mathematical datasets verify the effectiveness of TBYS. Project website: https://gitee.com/jswrt/TBYS

[65] Tailored Teaching with Balanced Difficulty: Elevating Reasoning in Multimodal Chain-of-Thought via Prompt Curriculum cs.CL | cs.AI | cs.MMPDF

Xinglong Yang, Quan Feng, Zhongying Pan, Xiang Chen, Yu Tian

TL;DR: 该论文提出了一种基于难度平衡的提示课程设计方法，通过结合模型感知难度和样本内在复杂性，优化多模态思维链（MCoT）效果。

Details

Motivation: 现有的MCoT提示方法通常依赖于随机或人工选择的示例，忽略了模型知识分布和任务内在复杂性，导致性能不稳定。作者受“因材施教”启发，提出以模型能力为导向的提示课程设计。

Result: 在五个挑战性基准测试和多个MLLM上，该方法显著提升了性能并减少了随机采样带来的性能波动。

Insight: 提示示例的选择对MCoT性能至关重要，结合模型能力和任务复杂性可以更有效地提升多模态推理能力。

Abstract: The effectiveness of Multimodal Chain-of-Thought (MCoT) prompting is often limited by the use of randomly or manually selected examples. These examples fail to account for both model-specific knowledge distributions and the intrinsic complexity of the tasks, resulting in suboptimal and unstable model performance. To address this, we propose a novel framework inspired by the pedagogical principle of “tailored teaching with balanced difficulty”. We reframe prompt selection as a prompt curriculum design problem: constructing a well ordered set of training examples that align with the model’s current capabilities. Our approach integrates two complementary signals: (1) model-perceived difficulty, quantified through prediction disagreement in an active learning setup, capturing what the model itself finds challenging; and (2) intrinsic sample complexity, which measures the inherent difficulty of each question-image pair independently of any model. By jointly analyzing these signals, we develop a difficulty-balanced sampling strategy that ensures the selected prompt examples are diverse across both dimensions. Extensive experiments conducted on five challenging benchmarks and multiple popular Multimodal Large Language Models (MLLMs) demonstrate that our method yields substantial and consistent improvements and greatly reduces performance discrepancies caused by random sampling, providing a principled and robust approach for enhancing multimodal reasoning.

[66] Knowing or Guessing? Robust Medical Visual Question Answering via Joint Consistency and Contrastive Learning cs.CLPDF

Songtao Jiang, Yuxi Chen, Sibo Song, Yan Zhang, Yeying Jin

TL;DR: 医学视觉问答（VQA）模型在不同问法下表现不稳定。论文提出RoMed数据集和一致性对比学习（CCL）方法，提升模型鲁棒性。

Details

Motivation: 当前医学视觉语言模型（Med-VLMs）在面对语义相同的不同问法时表现不一致，影响可靠诊断，需解决这一鲁棒性问题。

Result: CCL在三个VQA基准测试中达到SOTA性能，RoMed测试集上的答案一致性提升50%。

Insight: 1. 医学VQA的鲁棒性需对齐知识和消除数据偏置；2. 多级扰动数据集有助于评估模型真实性能。

Abstract: In high-stakes medical applications, consistent answering across diverse question phrasings is essential for reliable diagnosis. However, we reveal that current Medical Vision-Language Models (Med-VLMs) exhibit concerning fragility in Medical Visual Question Answering, as their answers fluctuate significantly when faced with semantically equivalent rephrasings of medical questions. We attribute this to two limitations: (1) insufficient alignment of medical concepts, leading to divergent reasoning patterns, and (2) hidden biases in training data that prioritize syntactic shortcuts over semantic understanding. To address these challenges, we construct RoMed, a dataset built upon original VQA datasets containing 144k questions with variations spanning word-level, sentence-level, and semantic-level perturbations. When evaluating state-of-the-art (SOTA) models like LLaVA-Med on RoMed, we observe alarming performance drops (e.g., a 40% decline in Recall) compared to original VQA benchmarks, exposing critical robustness gaps. To bridge this gap, we propose Consistency and Contrastive Learning (CCL), which integrates two key components: (1) knowledge-anchored consistency learning, aligning Med-VLMs with medical knowledge rather than shallow feature patterns, and (2) bias-aware contrastive learning, mitigating data-specific priors through discriminative representation refinement. CCL achieves SOTA performance on three popular VQA benchmarks and notably improves answer consistency by 50% on the challenging RoMed test set, demonstrating significantly enhanced robustness. Code will be released.

[67] Attention2Probability: Attention-Driven Terminology Probability Estimation for Robust Speech-to-Text System cs.CLPDF

Yanfan Du, Jun Zhang, Bin Wang, Jin Qiu, Lu Huang

TL;DR: Attention2Probability：一种轻量级、灵活且准确的注意力驱动的术语概率估计方法，用于提升语音到文本系统的术语识别准确性。

Details

Motivation: 当前的语音大语言模型（SLM）在通用领域表现优异，但在处理领域特定术语或新词时仍存在挑战。

Result: 在测试集上显著优于VectorDB方法，中英文的最高召回率分别达到92.57%和86.83%，每查询延迟仅为8.71毫秒。术语干预使SLM的术语准确性提高了6-17%。

Insight: 当前SLM在术语利用方面存在局限性，注意力机制和课程学习的结合可以有效提升术语识别的表现。

Abstract: Recent advances in speech large language models (SLMs) have improved speech recognition and translation in general domains, but accurately generating domain-specific terms or neologisms remains challenging. To address this, we propose Attention2Probability: attention-driven terminology probability estimation for robust speech-to-text system, which is lightweight, flexible, and accurate. Attention2Probability converts cross-attention weights between speech and terminology into presence probabilities, and it further employs curriculum learning to enhance retrieval accuracy. Furthermore, to tackle the lack of data for speech-to-text tasks with terminology intervention, we create and release a new speech dataset with terminology to support future research in this area. Experimental results show that Attention2Probability significantly outperforms the VectorDB method on our test set. Specifically, its maximum recall rates reach 92.57% for Chinese and 86.83% for English. This high recall is achieved with a latency of only 8.71ms per query. Intervening in SLMs’ recognition and translation tasks using Attention2Probability-retrieved terms improves terminology accuracy by 6-17%, while revealing that the current utilization of terminology by SLMs has limitations.

[68] Filtering for Creativity: Adaptive Prompting for Multilingual Riddle Generation in LLMs cs.CLPDF

Duy Le, Kent Ziti, Evan Girard-Sun, Sean O’Brien, Vasu Sharma

TL;DR: 该论文提出了一种名为自适应原创性过滤（AOF）的提示框架，用于改进多语言谜语生成的质量，通过在提示过程中过滤冗余内容并增强词汇新颖性与跨语言保真度。

Details

Motivation: 现有的大型语言模型在生成多语言谜语时往往依赖于记忆的谜语或浅层改写，缺乏文化流畅性和创造性。

Result: 实验显示，AOF加持的GPT-4o在多语言环境下显著降低了冗余（Self-BLEU为0.177）并提升了多样性（Distinct-2为0.915）。

Insight: 语义过滤机制可以在不进行任务微调的情况下引导模型生成更具文化根基和创造性的内容。

Abstract: Multilingual riddle generation challenges large language models (LLMs) to balance cultural fluency with creative abstraction. Standard prompting strategies – zero-shot, few-shot, chain-of-thought – tend to reuse memorized riddles or perform shallow paraphrasing. We introduce Adaptive Originality Filtering (AOF), a prompting framework that filters redundant generations using cosine-based similarity rejection, while enforcing lexical novelty and cross-lingual fidelity. Evaluated across three LLMs and four language pairs, AOF-enhanced GPT-4o achieves \texttt{0.177} Self-BLEU and \texttt{0.915} Distinct-2 in Japanese, signaling improved lexical diversity and reduced redundancy compared to other prompting methods and language pairs. Our findings show that semantic rejection can guide culturally grounded, creative generation without task-specific fine-tuning.

[69] Beyond Quality: Unlocking Diversity in Ad Headline Generation with Large Language Models cs.CL | cs.LGPDF

Chang Wang, Siyu Yan, Depeng Yuan, Yuqi Chen, Yanhua Huang

TL;DR: 论文提出DIVER框架，基于大语言模型（LLMs），通过多阶段多目标优化（SFT和RL）解决广告标题生成中质量和多样性不足的问题，在实际工业数据集中表现出色，提升了广告价值和点击率。

Details

Motivation: 现有广告标题生成方法过于注重质量或点击率，忽略了多样性，导致输出同质化，难以吸引多样化受众。论文旨在解决这一问题。

Result: 在实际工业数据集上，DIVER显著平衡了质量和多样性，ADVV提升4.0%，CTR提高1.4%。

Insight: 广告标题生成中，多样性是提升效果的关键因素之一；结合多目标优化和大语言模型可有效解决这一问题。

Abstract: The generation of ad headlines plays a vital role in modern advertising, where both quality and diversity are essential to engage a broad range of audience segments. Current approaches primarily optimize language models for headline quality or click-through rates (CTR), often overlooking the need for diversity and resulting in homogeneous outputs. To address this limitation, we propose DIVER, a novel framework based on large language models (LLMs) that are jointly optimized for both diversity and quality. We first design a semantic- and stylistic-aware data generation pipeline that automatically produces high-quality training pairs with ad content and multiple diverse headlines. To achieve the goal of generating high-quality and diversified ad headlines within a single forward pass, we propose a multi-stage multi-objective optimization framework with supervised fine-tuning (SFT) and reinforcement learning (RL). Experiments on real-world industrial datasets demonstrate that DIVER effectively balances quality and diversity. Deployed on a large-scale content-sharing platform serving hundreds of millions of users, our framework improves advertiser value (ADVV) and CTR by 4.0% and 1.4%.

[70] M3HG: Multimodal, Multi-scale, and Multi-type Node Heterogeneous Graph for Emotion Cause Triplet Extraction in Conversations cs.CL | cs.AIPDF

Qiao Liang, Ying Shen, Tiantian Chen, Lin Zhang

TL;DR: 论文提出了M3HG模型，用于多模态对话中的情感原因三元组提取任务，并发布了首个多模态、多场景的数据集MECAD，实验证明M3HG优于现有方法。

Details

Motivation: 现有MECTEC任务的数据集单一且匮乏，且现有方法未能显式建模情感与因果上下文，也未能有效融合多层次的语义信息，导致性能受限。

Result: 实验表明，M3HG在MECAD数据集上显著优于现有方法。

Insight: 1. 多模态和多场景的数据集对任务性能至关重要；2. 显式建模情感与因果上下文以及多层次信息融合能显著提升模型表现。

Abstract: Emotion Cause Triplet Extraction in Multimodal Conversations (MECTEC) has recently gained significant attention in social media analysis, aiming to extract emotion utterances, cause utterances, and emotion categories simultaneously. However, the scarcity of related datasets, with only one published dataset featuring highly uniform dialogue scenarios, hinders model development in this field. To address this, we introduce MECAD, the first multimodal, multi-scenario MECTEC dataset, comprising 989 conversations from 56 TV series spanning a wide range of dialogue contexts. In addition, existing MECTEC methods fail to explicitly model emotional and causal contexts and neglect the fusion of semantic information at different levels, leading to performance degradation. In this paper, we propose M3HG, a novel model that explicitly captures emotional and causal contexts and effectively fuses contextual information at both inter- and intra-utterance levels via a multimodal heterogeneous graph. Extensive experiments demonstrate the effectiveness of M3HG compared with existing state-of-the-art methods. The codes and dataset are available at https://github.com/redifinition/M3HG.

[71] Chronological Passage Assembling in RAG framework for Temporal Question Answering cs.CLPDF

Byeongjeong Kim, Jeonghyun Park, Joonho Yang, Hwanhee Lee

TL;DR: 论文提出了ChronoRAG，一种专为叙事文本设计的RAG框架，重点在于将分散的文档信息整合为连贯的结构化段落，并显式捕捉和维护时间顺序以提升问答性能。

Details

Motivation: 现有RAG方法在处理叙事文本时效果有限，因为叙事文本的理解需要更广的上下文和时间顺序的连贯性，而不仅仅是孤立段落。

Result: 实验证明ChronoRAG在叙事问答任务中表现优异，特别是在需要处理复杂时间关系的任务上。

Insight: 时间顺序的推理对叙事问答至关重要，显式建模时间顺序能显著提升模型性能。

Abstract: Long-context question answering over narrative tasks is challenging because correct answers often hinge on reconstructing a coherent timeline of events while preserving contextual flow in a limited context window. Retrieval-augmented generation (RAG) indexing methods aim to address this challenge by selectively retrieving only necessary document segments. However, narrative texts possess unique characteristics that limit the effectiveness of these existing approaches. Specifically, understanding narrative texts requires more than isolated segments, as the broader context and sequential relationships between segments are crucial for comprehension. To address these limitations, we propose ChronoRAG, a novel RAG framework specialized for narrative texts. This approach focuses on two essential aspects: refining dispersed document information into coherent and structured passages, and preserving narrative flow by explicitly capturing and maintaining the temporal order among retrieved passages. We empirically demonstrate the effectiveness of ChronoRAG through experiments on the NarrativeQA dataset, showing substantial improvements in tasks requiring both factual identification and comprehension of complex sequential relationships, underscoring that reasoning over temporal order is crucial in resolving narrative QA.

[72] ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models cs.CLPDF

Qianyu He, Siyu Yuan, Xuefeng Li, Mingxuan Wang, Jiangjie Chen

TL;DR: ThinkDial is an open-source framework that enables controllable reasoning effort in large language models (LLMs) through discrete operational modes (High, Medium, Low), balancing performance and computational cost.

Details

Motivation: LLMs with chain-of-thought reasoning lack practical control over computational effort, hindering deployment. Proprietary systems offer such control, but open-source solutions lag behind.

Result: ThinkDial reduces tokens by 50% (Medium) and 75% (Low) with minimal performance drops (<10% and <15%, respectively) and generalizes well to out-of-distribution tasks.

Insight: Open-source frameworks can achieve proprietary-level reasoning control through integrated training paradigms, enabling practical deployment of LLMs with adaptive computational effort.

Abstract: Large language models (LLMs) with chain-of-thought reasoning have demonstrated remarkable problem-solving capabilities, but controlling their computational effort remains a significant challenge for practical deployment. Recent proprietary systems like OpenAI’s gpt-oss series have introduced discrete operational modes for intuitive reasoning control, but the open-source community has largely failed to achieve such capabilities. In this paper, we introduce ThinkDial, the first open-recipe end-to-end framework that successfully implements gpt-oss-style controllable reasoning through discrete operational modes. Our system enables seamless switching between three distinct reasoning regimes: High mode (full reasoning capability), Medium mode (50 percent token reduction with <10 percent performance degradation), and Low mode (75 percent token reduction with <15 percent performance degradation). We achieve this through an end-to-end training paradigm that integrates budget-mode control throughout the entire pipeline: budget-mode supervised fine-tuning that embeds controllable reasoning capabilities directly into the learning process, and two-phase budget-aware reinforcement learning with adaptive reward shaping. Extensive experiments demonstrate that ThinkDial achieves target compression-performance trade-offs with clear response length reductions while maintaining performance thresholds. The framework also exhibits strong generalization capabilities on out-of-distribution tasks.

[73] Harnessing Rule-Based Reinforcement Learning for Enhanced Grammatical Error Correction cs.CL | cs.AIPDF

Yilin Li, Xunjian Yin, Yilin Chen, Xiaojun Wan

TL;DR: 该论文提出了一种基于规则强化学习的新框架，用于提升语法错误校正任务中大型语言模型的性能，相较于传统方法在中文数据集上取得了最优表现。

Details

Motivation: 传统的基于编码器-解码器模型的方法虽然取得了一定成功，但在语法错误校正任务中未能充分利用大型语言模型的推理能力。现有的研究主要通过监督微调直接生成校正后的句子，限制了模型的能力。

Result: 实验结果表明，该框架在中文语法错误校正任务中达到了最优性能，尤其是在召回率方面有显著提升。

Insight: 使用强化学习指导大型语言模型能够提供更可控和可靠的解决方案，为语法错误校正任务的未来发展提供了新的研究范式。

Abstract: Grammatical error correction is a significant task in NLP. Traditional methods based on encoder-decoder models have achieved certain success, but the application of LLMs in this field is still underexplored. Current research predominantly relies on supervised fine-tuning to train LLMs to directly generate the corrected sentence, which limits the model’s powerful reasoning ability. To address this limitation, we propose a novel framework based on Rule-Based RL. Through experiments on the Chinese datasets, our Rule-Based RL framework achieves \textbf{state-of-the-art }performance, with a notable increase in \textbf{recall}. This result clearly highlights the advantages of using RL to steer LLMs, offering a more controllable and reliable paradigm for future development in GEC.

[74] Arrows of Math Reasoning Data Synthesis for Large Language Models: Diversity, Complexity and Correctness cs.CLPDF

Sirui Chen, Changxin Tian, Binbin Hu, Kunlong Chen, Ziqi Liu

TL;DR: 提出了一种程序辅助合成框架，用于系统生成高质量数学推理数据，提升大语言模型的数学推理能力。

Details

Motivation: 传统方法在生成高质量数学推理数据时面临可扩展性、成本和数据可靠性的挑战，需要一种更高效、可靠的解决方案。

Result: 生成了1230万组问题-解决三元组，实验表明在该数据上微调的模型在多个基准数据集上达到最先进性能。

Insight: 通过程序化生成和严格验证，可以高效且可靠地生成大规模高质量数学推理数据，显著提升模型性能。

Abstract: Enhancing the mathematical reasoning of large language models (LLMs) demands high-quality training data, yet conventional methods face critical challenges in scalability, cost, and data reliability. To address these limitations, we propose a novel program-assisted synthesis framework that systematically generates a high-quality mathematical corpus with guaranteed diversity, complexity, and correctness. This framework integrates mathematical knowledge systems and domain-specific tools to create executable programs. These programs are then translated into natural language problem-solution pairs and vetted by a bilateral validation mechanism that verifies solution correctness against program outputs and ensures program-problem consistency. We have generated 12.3 million such problem-solving triples. Experiments demonstrate that models fine-tuned on our data significantly improve their inference capabilities, achieving state-of-the-art performance on several benchmark datasets and showcasing the effectiveness of our synthesis approach.

[75] ConfTuner: Training Large Language Models to Express Their Confidence Verbally cs.CL | cs.AIPDF

Yibo Li, Miao Xiong, Jiaying Wu, Bryan Hooi

TL;DR: ConfTuner是一种简单高效的微调方法，通过引入新的损失函数（tokenized Brier score），改进大型语言模型（LLM）的置信度表达，避免过自信问题，并在推理任务中表现出更好的校准效果。

Details

Motivation: LLM在高风险领域（如科学、法律、医疗）中的部署需要准确的置信度表达以增强可靠性和信任。当前LLM存在过自信问题，现有方法效果和泛化性有限，亟需更有效的方法。

Result: ConfTuner显著改善了LLM的置信度校准，提升了自我修正和模型级联的下游任务表现，适用于黑盒模型。

Insight: 通过理论驱动的损失函数改进LLM的置信度表达，是提升模型可靠性和信任的有效途径，有望在高风险领域推动可信LLM系统的发展。

Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as science, law, and healthcare, where accurate expressions of uncertainty are essential for reliability and trust. However, current LLMs are often observed to generate incorrect answers with high confidence, a phenomenon known as “overconfidence”. Recent efforts have focused on calibrating LLMs’ verbalized confidence: i.e., their expressions of confidence in text form, such as “I am 80% confident that…”. Existing approaches either rely on prompt engineering or fine-tuning with heuristically generated uncertainty estimates, both of which have limited effectiveness and generalizability. Motivated by the notion of proper scoring rules for calibration in classical machine learning models, we introduce ConfTuner, a simple and efficient fine-tuning method that introduces minimal overhead and does not require ground-truth confidence scores or proxy confidence estimates. ConfTuner relies on a new loss function, tokenized Brier score, which we theoretically prove to be a proper scoring rule, intuitively meaning that it “correctly incentivizes the model to report its true probability of being correct”. ConfTuner improves calibration across diverse reasoning tasks and generalizes to black-box models such as GPT-4o. Our results further show that better-calibrated confidence enables downstream gains in self-correction and model cascade, advancing the development of trustworthy LLM systems. The code is available at https://github.com/liushiliushi/ConfTuner.

[76] ReflectivePrompt: Reflective evolution in autoprompting algorithms cs.CL | cs.AI | cs.LGPDF

Viktor N. Zhuravlev, Artur R. Khairullin, Ernest A. Dyagin, Alena N. Sitkina, Nikita I. Kulin

TL;DR: ReflectivePrompt是一种基于进化算法的自动提示方法，通过反射进化实现更精确和全面的提示搜索，显著优于现有方法。

Details

Motivation: 随着提示工程的快速发展，自动选择优化提示的需求增加，传统方法在提示搜索的精确性和全面性上有局限。

Result: 在33个数据集上测试，平均性能提升28%（如BBH任务），优于现有方法。

Insight: 反射进化能够有效捕捉和利用进化过程中的知识，为自动提示优化提供了新思路。

Abstract: Autoprompting is the process of automatically selecting optimized prompts for language models, which has been gaining popularity with the rapid advancement of prompt engineering, driven by extensive research in the field of large language models (LLMs). This paper presents ReflectivePrompt - a novel autoprompting method based on evolutionary algorithms that employs a reflective evolution approach for more precise and comprehensive search of optimal prompts. ReflectivePrompt utilizes short-term and long-term reflection operations before crossover and elitist mutation to enhance the quality of the modifications they introduce. This method allows for the accumulation of knowledge obtained throughout the evolution process and updates it at each epoch based on the current population. ReflectivePrompt was tested on 33 datasets for classification and text generation tasks using open-access large language models: t-lite-instruct-0.1 and gemma3-27b-it. The method demonstrates, on average, a significant improvement (e.g., 28% on BBH compared to EvoPrompt) in metrics relative to current state-of-the-art approaches, thereby establishing itself as one of the most effective solutions in evolutionary algorithm-based autoprompting.

[77] Empowering Computing Education Researchers Through LLM-Assisted Content Analysis cs.CLPDF

Laurie Gale, Sebastian Mateos Nicolajsen

TL;DR: 该论文提出了一种结合大型语言模型（LLM）的内容分析方法（LACA），以帮助教育研究者高效分析大量文本数据，推动计算教育研究（CER）的规模化和严谨性。

Details

Motivation: 计算教育研究者常因资源或能力有限，难以开展可泛化或严谨的研究。论文旨在解决这一问题，提出一种减轻研究者负担的同时提升研究规模和质量的方法。

Result: LACA方法展示了在CER中的潜力，能够支持更广泛的泛化研究和提升研究质量。

Insight: LLM可以为教育研究提供高效工具，帮助研究者突破资源限制，推动学科的实践和研究质量发展。

Abstract: Computing education research (CER) is often instigated by practitioners wanting to improve both their own and the wider discipline’s teaching practice. However, the latter is often difficult as many researchers lack the colleagues, resources, or capacity to conduct research that is generalisable or rigorous enough to advance the discipline. As a result, research methods that enable sense-making with larger volumes of qualitative data, while not increasing the burden on the researcher, have significant potential within CER. In this discussion paper, we propose such a method for conducting rigorous analysis on large volumes of textual data, namely a variation of LLM-assisted content analysis (LACA). This method combines content analysis with the use of large language models, empowering researchers to conduct larger-scale research which they would otherwise not be able to perform. Using a computing education dataset, we illustrate how LACA could be applied in a reproducible and rigorous manner. We believe this method has potential in CER, enabling more generalisable findings from a wider range of research. This, together with the development of similar methods, can help to advance both the practice and research quality of the CER discipline.

[78] Affective Polarization across European Parliaments cs.CL | cs.SIPDF

Bojan Evkoski, Igor Mozetič, Nikola Ljubešić, Petra Kralj Novak

TL;DR: 该研究通过自然语言处理技术分析欧洲六个国家议会的演讲内容，发现情感极化的普遍存在，并表明互惠性是极化现象的驱动机制之一。

Details

Motivation: 近年来，情感极化（如对对立群体的负面情绪与敌意）在全球政治话语中日益突出，研究者希望通过自动化的方法探究欧洲议会中是否存在这种现象。

Result: 研究发现所有六个议会的议员均表现出情感极化现象，且极化程度与活跃度无关，但互惠性在一定程度上推动了极化。

Insight: 研究强调了情感极化在欧洲议会中的普遍性，并指出互惠性在政治对立中的重要作用，为理解政治话语的负面影响提供了新视角。

Abstract: Affective polarization, characterized by increased negativity and hostility towards opposing groups, has become a prominent feature of political discourse worldwide. Our study examines the presence of this type of polarization in a selection of European parliaments in a fully automated manner. Utilizing a comprehensive corpus of parliamentary speeches from the parliaments of six European countries, we employ natural language processing techniques to estimate parliamentarian sentiment. By comparing the levels of negativity conveyed in references to individuals from opposing groups versus one’s own, we discover patterns of affectively polarized interactions. The findings demonstrate the existence of consistent affective polarization across all six European parliaments. Although activity correlates with negativity, there is no observed difference in affective polarization between less active and more active members of parliament. Finally, we show that reciprocity is a contributing mechanism in affective polarization between parliamentarians across all six parliaments.

[79] Interpretable by AI Mother Tongue: Native Symbolic Reasoning in Neural Models cs.CL | cs.AI | cs.LGPDF

Hung Ming Liu

TL;DR: 本文提出了一种框架，使神经模型发展出一种‘AI母语’，即原生符号语言，同时支持直观推理、组合符号链和内在可解释性。该方法将推理嵌入模型表示中，通过符号捕获语义模式、链追踪决策路径，并结合门控归纳机制实现透明而灵活的推理。

Details

Motivation: 传统的事后解释方法无法在神经模型中实现内在的可解释性和符号推理，因此本文提出了一种将符号推理直接嵌入模型表示的方法，以同时提高模型的透明性和灵活性。

Result: 实验表明，该方法在AI任务中实现了竞争性的准确性，并提供了可验证的推理轨迹，证明其可以作为神经模型中可解释性、直观性和符号推理的统一机制。

Insight: 神经模型可以通过原生符号语言实现内在可解释性，而无需依赖事后解释方法。符号推理和直觉推理可以通过统一的框架协同工作，提高模型的透明性和功能多样性。

Abstract: We present a framework where neural models develop an AI Mother Tongue, a native symbolic language that simultaneously supports intuitive reasoning, compositional symbol chains, and inherent interpretability. Unlike post-hoc explanation methods, our approach embeds reasoning directly into the model’s representations: symbols capture meaningful semantic patterns, chains trace decision paths, and gated induction mechanisms guide selective focus, yielding transparent yet flexible reasoning. We introduce complementary training objectives to enhance symbol purity and decision sparsity, and employ a sequential specialization strategy to first build broad symbolic competence and then refine intuitive judgments. Experiments on AI tasks demonstrate competitive accuracy alongside verifiable reasoning traces, showing that AI Mother Tongue can serve as a unified mechanism for interpretability, intuition, and symbolic reasoning in neural models.

[80] MovieCORE: COgnitive REasoning in Movies cs.CLPDF

Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Ying Cheng, Hung-Ting Su

TL;DR: 论文提出了MovieCORE数据集，专注于电影内容的深层次认知理解，通过多LLM代理生成高质量问答对，并引入ACE模块提升模型推理能力。

Details

Motivation: 当前的视频问答数据集多关注表层理解，缺乏对电影内容深层次认知的评估。MovieCORE填补了这一空白，旨在推动AI对电影的深度理解。

Result: MovieCORE数据集通过测试验证了其质量，ACE模块将模型推理能力提升了25%，展示了其在深层次电影理解任务中的潜力。

Insight: 1. 多LLM代理方法可高效生成高质量问题；2. 深层次认知问题能更有效评估VQA模型的局限性；3. ACE模块为提升模型推理能力提供了新思路。

Abstract: This paper introduces MovieCORE, a novel video question answering (VQA) dataset designed to probe deeper cognitive understanding of movie content. Unlike existing datasets that focus on surface-level comprehension, MovieCORE emphasizes questions that engage System-2 thinking while remaining specific to the video material. We present an innovative agentic brainstorming approach, utilizing multiple large language models (LLMs) as thought agents to generate and refine high-quality question-answer pairs. To evaluate dataset quality, we develop a set of cognitive tests assessing depth, thought-provocation potential, and syntactic complexity. We also propose a comprehensive evaluation scheme for assessing VQA model performance on deeper cognitive tasks. To address the limitations of existing video-language models (VLMs), we introduce an agentic enhancement module, Agentic Choice Enhancement (ACE), which improves model reasoning capabilities post-training by up to 25%. Our work contributes to advancing movie understanding in AI systems and provides valuable insights into the capabilities and limitations of current VQA models when faced with more challenging, nuanced questions about cinematic content. Our project page, dataset and code can be found at https://joslefaure.github.io/assets/html/moviecore.html.

[81] Do LVLMs Know What They Know? A Systematic Study of Knowledge Boundary Perception in LVLMs cs.CLPDF

Zhikai Ding, Shiyu Ni, Keping Bi

TL;DR: 这篇论文系统地研究了大型视觉语言模型（LVLMs）对自身知识边界的感知能力，通过评估三种置信信号并提出了改进方法，发现视觉与文本联合处理虽然降低了性能，但提升了感知准确性。

Details

Motivation: LVLMs在视觉问答（VQA）中表现出色，但存在幻觉问题。研究其知识边界感知能力是提升模型可靠性的关键。

Result: 实验显示，LVLMs对知识边界有一定感知能力，但仍有提升空间。概率和一致性信号更可靠，而语言化置信度容易过度自信。联合处理降低了性能但提升了感知。

Insight: 视觉与文本的联合处理可能对模型的感知能力产生积极影响，但需进一步优化性能与感知的平衡。

Abstract: Large vision-language models (LVLMs) demonstrate strong visual question answering (VQA) capabilities but are shown to hallucinate. A reliable model should perceive its knowledge boundaries-knowing what it knows and what it does not. This paper investigates LVLMs’ perception of their knowledge boundaries by evaluating three types of confidence signals: probabilistic confidence, answer consistency-based confidence, and verbalized confidence. Experiments on three LVLMs across three VQA datasets show that, although LVLMs possess a reasonable perception level, there is substantial room for improvement. Among the three confidences, probabilistic and consistency-based signals are more reliable indicators, while verbalized confidence often leads to overconfidence. To enhance LVLMs’ perception, we adapt several established confidence calibration methods from Large Language Models (LLMs) and propose three effective methods. Additionally, we compare LVLMs with their LLM counterparts, finding that jointly processing visual and textual inputs decreases question-answering performance but reduces confidence, resulting in an improved perception level compared to LLMs.

[82] Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning cs.CLPDF

Alan Li, Yixin Liu, Arpan Sarkar, Doug Downey, Arman Cohan

TL;DR: 该论文提出了一套科学推理评测基准SciReas和SciReas-Pro，并设计了KRUX探针框架，分析了知识与推理在LLMs中的作用，发现检索相关知识是瓶颈，外部知识能提升推理能力，显式推理有助于知识提取。

Details

Motivation: 科学问题解决需要深入领域知识和复杂推理能力，但目前缺乏全面的评测基准，且未能系统地区分知识与推理的作用。

Result: 发现知识检索是瓶颈，外部知识能增强推理，显式推理有助于知识提取。

Insight: 科学推理任务需要结合内外知识，显式推理设计是关键提升点。

Abstract: Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no widely adopted holistic benchmark for evaluating scientific reasoning, and few approaches systematically disentangle the distinct roles of knowledge and reasoning in these tasks. To address these gaps, we introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks, and SciReas-Pro, a selective subset that requires more complex reasoning. Our holistic evaluation surfaces insights about scientific reasoning performance that remain hidden when relying on individual benchmarks alone. We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks. Combining the two, we conduct an in-depth analysis that yields several key findings: (1) Retrieving task-relevant knowledge from model parameters is a critical bottleneck for LLMs in scientific reasoning; (2) Reasoning models consistently benefit from external knowledge added in-context on top of the reasoning enhancement; (3) Enhancing verbalized reasoning improves LLMs’ ability to surface task-relevant knowledge. Finally, we conduct a lightweight analysis, comparing our science-focused data composition with concurrent efforts on long CoT SFT, and release SciLit01, a strong 8B baseline for scientific reasoning.

cs.AI [Back]

[83] RLMR: Reinforcement Learning with Mixed Rewards for Creative Writing cs.AI | cs.CLPDF

Jianxing Liao, Tian Zhang, Xiao Feng, Yusong Zhang, Rui Yang

TL;DR: 该论文提出了一种名为RLMR的强化学习方法，通过动态混合奖励系统平衡创意写作中的主观写作质量与客观约束遵循，实现了多维度优化的创新。

Details

Motivation: 创意写作需要平衡主观写作质量（如文学性和情感表达）与客观约束遵循（如格式要求和字数限制），现有强化学习方法难以同时优化这两方面。

Result: 在自动化与人工评估中均取得显著提升，指令遵循（IFEval从83.36%提升到86.65%）和写作质量（WriteEval上的72.75%胜率）均有改善。

Insight: 动态调整奖励权重是关键创新点，能够根据写作质量自适应调整惩罚违反约束的样本，从而在训练中更有效地平衡主观与客观要求。

Abstract: Large language models are extensively utilized in creative writing applications. Creative writing requires a balance between subjective writing quality (e.g., literariness and emotional expression) and objective constraint following (e.g., format requirements and word limits). Existing reinforcement learning methods struggle to balance these two aspects: single reward strategies fail to improve both abilities simultaneously, while fixed-weight mixed-reward methods lack the ability to adapt to different writing scenarios. To address this problem, we propose Reinforcement Learning with Mixed Rewards (RLMR), utilizing a dynamically mixed reward system from a writing reward model evaluating subjective writing quality and a constraint verification model assessing objective constraint following. The constraint following reward weight is adjusted dynamically according to the writing quality within sampled groups, ensuring that samples violating constraints get negative advantage in GRPO and thus penalized during training, which is the key innovation of this proposed method. We conduct automated and manual evaluations across diverse model families from 8B to 72B parameters. Additionally, we construct a real-world writing benchmark named WriteEval for comprehensive evaluation. Results illustrate that our method achieves consistent improvements in both instruction following (IFEval from 83.36% to 86.65%) and writing quality (72.75% win rate in manual expert pairwise evaluations on WriteEval). To the best of our knowledge, RLMR is the first work to combine subjective preferences with objective verification in online RL training, providing an effective solution for multi-dimensional creative writing optimization.

[84] Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap cs.AI | cs.CLPDF

Jun Wang, Ninglun Gu, Kailai Zhang, Zijiao Zhang, Yelun Bao

TL;DR: 该论文提出了一种新的评估范式，通过人类智力的视角，将LLM评估划分为IQ、EQ、PQ三个维度，并设计了面向价值的评估框架（VQ），以弥合基准测试与实际应用之间的差距。

Details

Motivation: 当前LLM的评估框架过于碎片化，注重技术指标而忽视了实际部署时的全面评估，导致基准测试性能与实际效用脱节。

Result: 为开发技术上精通、上下文相关且伦理合规的LLM提供了可操作的指导，并维护了一个开源评估资源库。

Insight: LLM评估需要超越技术指标，关注多维度的实际价值和伦理影响，动态评估和可解释性是未来研究方向。

Abstract: For Large Language Models (LLMs), a disconnect persists between benchmark performance and real-world utility. Current evaluation frameworks remain fragmented, prioritizing technical metrics while neglecting holistic assessment for deployment. This survey introduces an anthropomorphic evaluation paradigm through the lens of human intelligence, proposing a novel three-dimensional taxonomy: Intelligence Quotient (IQ)-General Intelligence for foundational capacity, Emotional Quotient (EQ)-Alignment Ability for value-based interactions, and Professional Quotient (PQ)-Professional Expertise for specialized proficiency. For practical value, we pioneer a Value-oriented Evaluation (VQ) framework assessing economic viability, social impact, ethical alignment, and environmental sustainability. Our modular architecture integrates six components with an implementation roadmap. Through analysis of 200+ benchmarks, we identify key challenges including dynamic assessment needs and interpretability gaps. It provides actionable guidance for developing LLMs that are technically proficient, contextually relevant, and ethically sound. We maintain a curated repository of open-source evaluation resources at: https://github.com/onejune2018/Awesome-LLM-Eval.

[85] CAC-CoT: Connector-Aware Compact Chain-of-Thought for Efficient Reasoning Data Synthesis Across Dual-System Cognitive Tasks cs.AI | cs.CLPDF

Sunguk Choi, Yonghoon Kwon, Heondeuk Lee

TL;DR: 该论文提出了一种名为CAC-CoT的方法，通过限制推理步骤为少量固定的连接短语，实现了在双系统认知任务中高效合成的紧凑链式思维。

Details

Motivation: 长链式思维（CoT）提示有助于LLM解决复杂问题，但在快速直觉的‘System-1’任务中，过长的推理链会降低性能。因此，需要一种更高效的紧凑推理方法。

Result: 在GSM8K（System-2任务）上达到85%的准确率，GPQA上达到40%，同时在S1-Bench（System-1任务）上保持90%的准确率，推理链长度仅为基线方法的三分之一（约300词）。

Insight: 紧凑的推理链能在不损失准确性的前提下显著提升效率，尤其适用于需兼顾‘System-1’和‘System-2’任务的双系统场景。

Abstract: Long chain-of-thought (CoT) prompting helps Large Language Models (LLMs) solve difficult problems, but very long traces often slow or even degrade performance on fast, intuitive “System-1” tasks. We introduce Connector-Aware Compact CoT (CAC-CoT) – a method that deliberately restricts reasoning to a small, fixed set of connector phrases, steering the model toward concise and well – structured explanations. Despite its simplicity, our synthetic method with Gemini-2.0-Flash yields a high-quality training quality. CAC-CoT achieves approximately 85% on GSM8K and approximately 40% on GPQA (System-2) while retaining approximately 90% on S1-Bench (System-1). Its reasoning traces average approximately 300 tokens(ART), about one-third the length of baseline traces, delivering higher efficiency without loss of accuracy.

[86] Answering the Unanswerable Is to Err Knowingly: Analyzing and Mitigating Abstention Failures in Large Reasoning Models cs.AI | cs.CLPDF

Yi Liu, Xiangyu Liu, Zequn Sun, Wei Hu

TL;DR: 大型推理模型（LRMs）在复杂推理任务上表现出色，但在面对无法回答的问题时，常常无法适当放弃回答。本研究分析了这一现象并提出了一种轻量的两阶段方法以改进。

Details

Motivation: 大型推理模型在面对无法回答的问题时（如条件不足的数学问题），经常无法正确放弃回答，影响其可信度。本研究旨在分析和解决这一问题。

Result: 实验表明，该方法显著提高了模型的放弃回答率，同时未影响其整体推理性能。

Insight: LRMs的响应行为与其内部认知存在不一致，通过轻量的干预可实现行为优化。这为提升模型的可信度提供了新思路。

Abstract: Large reasoning models (LRMs) have shown remarkable progress on complex reasoning tasks. However, some questions posed to LRMs are inherently unanswerable, such as math problems lacking sufficient conditions. We find that LRMs continually fail to provide appropriate abstentions when confronted with these unanswerable questions. In this paper, we systematically analyze, investigate, and resolve this issue for trustworthy AI. We first conduct a detailed analysis of the distinct response behaviors of LRMs when facing unanswerable questions. Then, we show that LRMs possess sufficient cognitive capabilities to recognize the flaws in these questions. However, they fail to exhibit appropriate abstention behavior, revealing a misalignment between their internal cognition and external response. Finally, to resolve this issue, we propose a lightweight, two-stage method that combines cognitive monitoring with inference-time intervention. Experimental results demonstrate that our method significantly improves the abstention rate while maintaining the overall reasoning performance.

[87] Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark cs.AI | cs.CLPDF

Yuxuan Cai, Yipeng Hao, Jie Zhou, Hang Yan, Zhikai Lei

TL;DR: 本文提出了一种名为经验驱动的终身学习（ELL）的框架，用于构建能够通过与动态环境交互持续自我进化的智能体，并介绍了模拟学生大学生涯的基准数据集StuLife。

Details

Motivation: 随着AI向通用智能发展，研究重点从静态任务优化的系统转向能够持续学习的开放性智能体。本文旨在通过经验驱动的终身学习框架，推动智能体的自我进化能力。

Result: StuLife数据集为终身学习能力评估提供全面平台，包括记忆保持、技能迁移和自主动机行为。本文还探讨了上下文工程对通用人工智能的作用。

Insight: 论文表明，通过经验驱动的学习和动态环境交互，智能体可以逐步内化知识并发展出直觉能力，进一步推动通用人工智能的发展。

Abstract: As AI advances toward general intelligence, the focus is shifting from systems optimized for static tasks to creating open-ended agents that learn continuously. In this paper, we introduce Experience-driven Lifelong Learning (ELL), a framework for building self-evolving agents capable of continuous growth through real-world interaction. The framework is built on four core principles: (1) Experience Exploration: Agents learn through continuous, self-motivated interaction with dynamic environments, navigating interdependent tasks and generating rich experiential trajectories. (2) Long-term Memory: Agents preserve and structure historical knowledge, including personal experiences, domain expertise, and commonsense reasoning, into a persistent memory system. (3) Skill Learning: Agents autonomously improve by abstracting recurring patterns from experience into reusable skills, which are actively refined and validated for application in new tasks. (4) Knowledge Internalization: Agents internalize explicit and discrete experiences into implicit and intuitive capabilities as “second nature”. We also introduce StuLife, a benchmark dataset for ELL that simulates a student’s holistic college journey, from enrollment to academic and personal development, across three core phases and ten detailed sub-scenarios. StuLife is designed around three key paradigm shifts: From Passive to Proactive, From Context to Memory, and From Imitation to Learning. In this dynamic environment, agents must acquire and distill practical skills and maintain persistent memory to make decisions based on evolving state variables. StuLife provides a comprehensive platform for evaluating lifelong learning capabilities, including memory retention, skill transfer, and self-motivated behavior. Beyond evaluating SOTA LLMs on the StuLife benchmark, we also explore the role of context engineering in advancing AGI.

[88] StepWiser: Stepwise Generative Judges for Wiser Reasoning cs.AI | cs.CLPDF

Wei Xiong, Wenting Zhao, Weizhe Yuan, Olga Golovneva, Tong Zhang

TL;DR: 论文提出了一种名为StepWiser的生成式判断模型，通过元推理（meta-reasoning）监督多步推理中的中间步骤逻辑有效性，优于现有方法，并在训练和推理时提升模型表现。

Details

Motivation: 现有方法对多步推理的中间步骤监督不足，分类器式奖励模型缺乏解释性且依赖静态数据集，限制了泛化能力。

Result: 实验显示StepWiser在中间步骤判断准确率、训练时改进策略模型及推理时搜索效果上优于现有方法。

Insight: 将奖励建模与推理任务结合，生成式方法可以提供更透明的监督信号，同时提升模型表现和泛化能力。

Abstract: As models increasingly leverage multi-step reasoning strategies to solve complex problems, supervising the logical validity of these intermediate steps has become a critical research challenge. Process reward models address this by providing step-by-step feedback, but current approaches have two major drawbacks: they typically function as classifiers without providing explanations, and their reliance on supervised fine-tuning with static datasets limits generalization. Inspired by recent advances, we reframe stepwise reward modeling from a classification task to a reasoning task itself. We thus propose a generative judge that reasons about the policy model’s reasoning steps (i.e., meta-reasons), outputting thinking tokens before delivering a final verdict. Our model, StepWiser, is trained by reinforcement learning using relative outcomes of rollouts. We show it provides (i) better judgment accuracy on intermediate steps than existing methods; (ii) can be used to improve the policy model at training time; and (iii) improves inference-time search.

[89] Stabilizing Open-Set Test-Time Adaptation via Primary-Auxiliary Filtering and Knowledge-Integrated Prediction cs.AI | cs.CVPDF

Byung-Joon Lee, Jin-Seop Lee, Jee-Hyong Lee

TL;DR: 该论文提出了一种名为‘主-辅助过滤（PAF）’和‘知识集成预测（KIP）’的新方法，用于解决开放集测试时适应（OSTTA）中的不稳定性和错误积累问题。

Details

Motivation: 现实中的测试数据常面临域偏移（domain shift），而开放集数据会进一步降低封闭集（closed-set）的准确性。现有方法依赖源模型过滤开放集数据，效果不佳，且适应模型在噪声测试数据中不稳定，导致错误累积。

Result: 实验表明，该方法在多种封闭集和开放集数据集上均优于现有方法，提升了封闭集准确性和开放集判别能力。

Insight: 适应模型（adapting model）在噪声测试数据中不稳定，但结合其他模型的知识可以有效提升开放集测试时适应的稳定性和准确性。

Abstract: Deep neural networks demonstrate strong performance under aligned training-test distributions. However, real-world test data often exhibit domain shifts. Test-Time Adaptation (TTA) addresses this challenge by adapting the model to test data during inference. While most TTA studies assume that the training and test data share the same class set (closed-set TTA), real-world scenarios often involve open-set data (open-set TTA), which can degrade closed-set accuracy. A recent study showed that identifying open-set data during adaptation and maximizing its entropy is an effective solution. However, the previous method relies on the source model for filtering, resulting in suboptimal filtering accuracy on domain-shifted test data. In contrast, we found that the adapting model, which learns domain knowledge from noisy test streams, tends to be unstable and leads to error accumulation when used for filtering. To address this problem, we propose Primary-Auxiliary Filtering (PAF), which employs an auxiliary filter to validate data filtered by the primary filter. Furthermore, we propose Knowledge-Integrated Prediction (KIP), which calibrates the outputs of the adapting model, EMA model, and source model to integrate their complementary knowledge for OSTTA. We validate our approach across diverse closed-set and open-set datasets. Our method enhances both closed-set accuracy and open-set discrimination over existing methods. The code is available at https://github.com/powerpowe/PAF-KIP-OSTTA .

cs.CR [Back]

[90] A Systematic Approach to Predict the Impact of Cybersecurity Vulnerabilities Using LLMs cs.CR | cs.AI | cs.CL | cs.SEPDF

Anders Mølmen Høst, Pierre Lison, Leon Moonen

TL;DR: 论文提出了TRIAGE方法，利用大型语言模型（LLMs）将CVE漏洞映射到ATT&CK知识库的技术，结合规则推理和数据驱动推断，提高了漏洞影响预测的效率和准确性。

Details

Motivation: 现有的漏洞数据库（如NVD）虽提供了CVE的详细描述，但缺乏关于其实际影响的信息（如攻击者的TTPs）。手动映射耗时且低效，亟需自动化支持。

Result: 上下文学习优于单一映射方法，混合方法提升了利用技术的召回率。GPT-4o-mini效果优于Llama3.3-70B。

Insight: LLMs可用于自动化预测漏洞影响，结合规则与数据驱动的方法可显著提升映射任务的效率和准确性。

Abstract: Vulnerability databases, such as the National Vulnerability Database (NVD), offer detailed descriptions of Common Vulnerabilities and Exposures (CVEs), but often lack information on their real-world impact, such as the tactics, techniques, and procedures (TTPs) that adversaries may use to exploit the vulnerability. However, manually linking CVEs to their corresponding TTPs is a challenging and time-consuming task, and the high volume of new vulnerabilities published annually makes automated support desirable. This paper introduces TRIAGE, a two-pronged automated approach that uses Large Language Models (LLMs) to map CVEs to relevant techniques from the ATT&CK knowledge base. We first prompt an LLM with instructions based on MITRE’s CVE Mapping Methodology to predict an initial list of techniques. This list is then combined with the results from a second LLM-based module that uses in-context learning to map a CVE to relevant techniques. This hybrid approach strategically combines rule-based reasoning with data-driven inference. Our evaluation reveals that in-context learning outperforms the individual mapping methods, and the hybrid approach improves recall of exploitation techniques. We also find that GPT-4o-mini performs better than Llama3.3-70B on this task. Overall, our results show that LLMs can be used to automatically predict the impact of cybersecurity vulnerabilities and TRIAGE makes the process of mapping CVEs to ATT&CK more efficient. Keywords: vulnerability impact, CVE, ATT&CK techniques, large language models, automated mapping.

[91] The Double-edged Sword of LLM-based Data Reconstruction: Understanding and Mitigating Contextual Vulnerability in Word-level Differential Privacy Text Sanitization cs.CR | cs.CLPDF

Stephen Meisenbacher, Alexandra Klymenko, Andreea-Elena Bodea, Florian Matthes

TL;DR: 本文探讨了基于LLM的数据重建在差分隐私文本脱敏中的双重作用，揭示了其既能利用上下文漏洞攻击隐私，又能通过反向思维增强隐私保护的潜力。

Details

Motivation: 差分隐私（DP）文本脱敏方法虽然在隐私保护上提供了理论保证，但在实际操作中存在上下文漏洞，容易被LLM利用。本文旨在研究LLM如何利用这一漏洞，并提出可能的缓解措施。

Result: 实验表明，LLM能够有效利用上下文漏洞推断原始文本语义，但也可以用于提升脱敏文本的质量和隐私保护。

Insight: LLM在隐私保护中是一把双刃剑，需合理利用其能力；对抗性思维（如利用LLM进行后处理）可能是未来隐私保护的新方向。

Abstract: Differentially private text sanitization refers to the process of privatizing texts under the framework of Differential Privacy (DP), providing provable privacy guarantees while also empirically defending against adversaries seeking to harm privacy. Despite their simplicity, DP text sanitization methods operating at the word level exhibit a number of shortcomings, among them the tendency to leave contextual clues from the original texts due to randomization during sanitization $\unicode{x2013}$ this we refer to as $\textit{contextual vulnerability}$. Given the powerful contextual understanding and inference capabilities of Large Language Models (LLMs), we explore to what extent LLMs can be leveraged to exploit the contextual vulnerability of DP-sanitized texts. We expand on previous work not only in the use of advanced LLMs, but also in testing a broader range of sanitization mechanisms at various privacy levels. Our experiments uncover a double-edged sword effect of LLM-based data reconstruction attacks on privacy and utility: while LLMs can indeed infer original semantics and sometimes degrade empirical privacy protections, they can also be used for good, to improve the quality and privacy of DP-sanitized texts. Based on our findings, we propose recommendations for using LLM data reconstruction as a post-processing step, serving to increase privacy protection by thinking adversarially.

[92] Hidden Tail: Adversarial Image Causing Stealthy Resource Consumption in Vision-Language Models cs.CR | cs.CVPDF

Rui Zhang, Zihan Wang, Tianli Yang, Hongwei Li, Wenbo Jiang

TL;DR: 这篇论文提出了Hidden Tail——一种对视觉语言模型（VLM）进行隐蔽资源消耗攻击的方法，通过生成对抗性图像使模型输出特殊令牌（而非无关内容）来延长推理时间，同时保持隐蔽性。

Details

Motivation: 视觉语言模型的高推理成本使其易受资源消耗攻击，但现有攻击方法因生成无关内容而缺乏隐蔽性。论文旨在解决这一隐蔽性与攻击效果之间的权衡问题。

Result: 实验表明，Hidden Tail将输出长度提高了19.2倍，达到了最大令牌限制，同时保持了隐蔽性，优于现有攻击方法。

Insight: 研究强调了提升VLM对抗效率导向威胁的鲁棒性的紧迫性，并展示了隐蔽资源消耗攻击的潜在威胁。

Abstract: Vision-Language Models (VLMs) are increasingly deployed in real-world applications, but their high inference cost makes them vulnerable to resource consumption attacks. Prior attacks attempt to extend VLM output sequences by optimizing adversarial images, thereby increasing inference costs. However, these extended outputs often introduce irrelevant abnormal content, compromising attack stealthiness. This trade-off between effectiveness and stealthiness poses a major limitation for existing attacks. To address this challenge, we propose \textit{Hidden Tail}, a stealthy resource consumption attack that crafts prompt-agnostic adversarial images, inducing VLMs to generate maximum-length outputs by appending special tokens invisible to users. Our method employs a composite loss function that balances semantic preservation, repetitive special token induction, and suppression of the end-of-sequence (EOS) token, optimized via a dynamic weighting strategy. Extensive experiments show that \textit{Hidden Tail} outperforms existing attacks, increasing output length by up to 19.2$\times$ and reaching the maximum token limit, while preserving attack stealthiness. These results highlight the urgent need to improve the robustness of VLMs against efficiency-oriented adversarial threats. Our code is available at https://github.com/zhangrui4041/Hidden_Tail.

eess.IV [Back]

[93] Analise de Desaprendizado de Maquina em Modelos de Classificacao de Imagens Medicas eess.IV | cs.AI | cs.CVPDF

Andreza M. C. Falcao, Filipe R. Cordeiro

TL;DR: 该论文探讨了在医疗图像分类模型中使用SalUn进行机器去学习（machine unlearning）的效果，实验表明其性能接近完全重新训练的模型，并分析了数据增强的影响。

Details

Motivation: 目前机器去学习技术虽然已有进展，但在医疗图像分类领域的应用尚未探索，而医疗数据中的隐私和敏感性使其成为重要研究方向。

Result: SalUn表现接近完全重新训练的效果，证明了其在医疗应用中的高效性。

Insight: 数据增强可以进一步优化去学习的质量，这可能为隐私保护提供更多技术路径。

Abstract: Machine unlearning aims to remove private or sensitive data from a pre-trained model while preserving the model’s robustness. Despite recent advances, this technique has not been explored in medical image classification. This work evaluates the SalUn unlearning model by conducting experiments on the PathMNIST, OrganAMNIST, and BloodMNIST datasets. We also analyse the impact of data augmentation on the quality of unlearning. Results show that SalUn achieves performance close to full retraining, indicating an efficient solution for use in medical applications.

[94] A Closer Look at Edema Area Segmentation in SD-OCT Images Using Adversarial Framework eess.IV | cs.CVPDF

Yuhui Tao, Yizhe Zhang, Qiang Chen

TL;DR: 本文提出了一种结合视网膜层结构引导后处理和测试时间自适应策略的对抗性框架，用于增强SD-OCT图像中水肿区域的弱监督分割性能。

Details

Motivation: 当前基于异常检测的弱监督方法在水肿区域分割任务中表现不及全监督方法，而视网膜层结构与水肿区域高度相关。本文旨在利用这些特性改进分割性能。

Result: 在两个公开数据集上的实验表明，该方法显著提升了水肿区域分割的准确性和鲁棒性。

Insight: 通过引入领域知识（视网膜层结构）和动态适应策略（TTA），可以有效提升弱监督模型在医学图像分割任务中的性能。

Abstract: The development of artificial intelligence models for macular edema (ME) analy-sis always relies on expert-annotated pixel-level image datasets which are expen-sive to collect prospectively. While anomaly-detection-based weakly-supervised methods have shown promise in edema area (EA) segmentation task, their per-formance still lags behind fully-supervised approaches. In this paper, we leverage the strong correlation between EA and retinal layers in spectral-domain optical coherence tomography (SD-OCT) images, along with the update characteristics of weakly-supervised learning, to enhance an off-the-shelf adversarial framework for EA segmentation with a novel layer-structure-guided post-processing step and a test-time-adaptation (TTA) strategy. By incorporating additional retinal lay-er information, our framework reframes the dense EA prediction task as one of confirming intersection points between the EA contour and retinal layers, result-ing in predictions that better align with the shape prior of EA. Besides, the TTA framework further helps address discrepancies in the manifestations and presen-tations of EA between training and test sets. Extensive experiments on two pub-licly available datasets demonstrate that these two proposed ingredients can im-prove the accuracy and robustness of EA segmentation, bridging the gap between weakly-supervised and fully-supervised models.

[95] Understanding Benefits and Pitfalls of Current Methods for the Segmentation of Undersampled MRI Data eess.IV | cs.CVPDF

Jan Nikolas Morshuis, Matthias Hein, Christian F. Baumgartner

TL;DR: 该论文首次为欠采样MRI数据的分割提供了统一的基准测试，比较了7种方法，重点对比了一阶段（重建+分割联合模型）与两阶段（先重建再分割）方法，发现简单两阶段方法表现最佳。

Details

Motivation: MRI采集时间长且成本高，研究通过欠采样加速采集，但大多数方法未直接比较，缺乏统一评估标准。本研究旨在填补这一空白，找到最优的分割策略。

Result: 实验表明，考虑数据一致性的简单两阶段方法在分割任务中表现最佳，甚至超过了为此任务开发的复杂专用方法。

Insight: 研究揭示了在欠采样MRI数据分割中，数据一致性的重要性，并为后续方法设计提供了实用指导。

Abstract: MR imaging is a valuable diagnostic tool allowing to non-invasively visualize patient anatomy and pathology with high soft-tissue contrast. However, MRI acquisition is typically time-consuming, leading to patient discomfort and increased costs to the healthcare system. Recent years have seen substantial research effort into the development of methods that allow for accelerated MRI acquisition while still obtaining a reconstruction that appears similar to the fully-sampled MR image. However, for many applications a perfectly reconstructed MR image may not be necessary, particularly, when the primary goal is a downstream task such as segmentation. This has led to growing interest in methods that aim to perform segmentation directly on accelerated MRI data. Despite recent advances, existing methods have largely been developed in isolation, without direct comparison to one another, often using separate or private datasets, and lacking unified evaluation standards. To date, no high-quality, comprehensive comparison of these methods exists, and the optimal strategy for segmenting accelerated MR data remains unknown. This paper provides the first unified benchmark for the segmentation of undersampled MRI data comparing 7 approaches. A particular focus is placed on comparing \textit{one-stage approaches}, that combine reconstruction and segmentation into a unified model, with \textit{two-stage approaches}, that utilize established MRI reconstruction methods followed by a segmentation network. We test these methods on two MRI datasets that include multi-coil k-space data as well as a human-annotated segmentation ground-truth. We find that simple two-stage methods that consider data-consistency lead to the best segmentation scores, surpassing complex specialized methods that are developed specifically for this task.

[96] RDDM: Practicing RAW Domain Diffusion Model for Real-world Image Restoration eess.IV | cs.AI | cs.CVPDF

Yan Chen, Yi Wen, Wei Li, Junchao Liu, Yong Guo

TL;DR: 论文提出了RDDM模型，直接在RAW域进行图像恢复，解决了sRGB域的局限性，通过引入RAW域VAE和可调后处理模块，取得了更高保真度的结果。

Details

Motivation: 现有的sRGB域扩散模型在高保真和真实感之间存在权衡，且忽略了RAW数据的可用性。RDDM直接在RAW域处理图像，避免了传统两阶段流程的问题。

Result: 实验表明RDDM优于现有sRGB扩散方法，生成更高保真度且更少伪影的图像。

Insight: 直接在RAW域处理图像能更充分利用传感器数据，避免sRGB域的损失，为图像恢复任务提供了新思路。

Abstract: We present the RAW domain diffusion model (RDDM), an end-to-end diffusion model that restores photo-realistic images directly from the sensor RAW data. While recent sRGB-domain diffusion methods achieve impressive results, they are caught in a dilemma between high fidelity and realistic generation. As these models process lossy sRGB inputs and neglect the accessibility of the sensor RAW images in many scenarios, e.g., in image and video capturing in edge devices, resulting in sub-optimal performance. RDDM bypasses this limitation by directly restoring images in the RAW domain, replacing the conventional two-stage image signal processing (ISP) + IR pipeline. However, a simple adaptation of pre-trained diffusion models to the RAW domain confronts the out-of-distribution (OOD) issues. To this end, we propose: (1) a RAW-domain VAE (RVAE) learning optimal latent representations, (2) a differentiable Post Tone Processing (PTP) module enabling joint RAW and sRGB space optimization. To compensate for the deficiency in the dataset, we develop a scalable degradation pipeline synthesizing RAW LQ-HQ pairs from existing sRGB datasets for large-scale training. Furthermore, we devise a configurable multi-bayer (CMB) LoRA module handling diverse RAW patterns such as RGGB, BGGR, etc. Extensive experiments demonstrate RDDM’s superiority over state-of-the-art sRGB diffusion methods, yielding higher fidelity results with fewer artifacts.

cs.RO [Back]

[97] Enhancing Video-Based Robot Failure Detection Using Task Knowledge cs.RO | cs.CVPDF

Santosh Thoduka, Sebastian Houben, Juergen Gall, Paul G. Plöger

TL;DR: 这篇论文提出了一种基于视频的机器人故障检测方法，结合了任务知识和时空信息，显著提升了故障检测性能。

Details

Motivation: 机器人任务执行的鲁棒性依赖于可靠的故障检测，但现有方法在复杂现实场景中表现不佳。

Result: 在ARMBench数据集上，F1分数从77.9提升到80.0（无额外计算成本），测试时进一步增强到81.4。

Insight: 时空信息对故障检测至关重要，未来可探索更多合适的启发式方法。

Abstract: Robust robotic task execution hinges on the reliable detection of execution failures in order to trigger safe operation modes, recovery strategies, or task replanning. However, many failure detection methods struggle to provide meaningful performance when applied to a variety of real-world scenarios. In this paper, we propose a video-based failure detection approach that uses spatio-temporal knowledge in the form of the actions the robot performs and task-relevant objects within the field of view. Both pieces of information are available in most robotic scenarios and can thus be readily obtained. We demonstrate the effectiveness of our approach on three datasets that we amend, in part, with additional annotations of the aforementioned task-relevant knowledge. In light of the results, we also propose a data augmentation method that improves performance by applying variable frame rates to different parts of the video. We observe an improvement from 77.9 to 80.0 in F1 score on the ARMBench dataset without additional computational expense and an additional increase to 81.4 with test-time augmentation. The results emphasize the importance of spatio-temporal information during failure detection and suggest further investigation of suitable heuristics in future implementations. Code and annotations are available.

Shreya Gummadi, Mateus V. Gasparino, Gianluca Capezzuto, Marcelo Becker, Girish Chowdhary

TL;DR: ZeST利用大型语言模型（LLMs）的视觉推理能力，在未知环境中实现零样本可通行性导航，避免了传统数据收集的风险，并提供了一种安全、高效的导航解决方案。

Details

Motivation: 传统方法生成可通行性预测数据集时，需要将机器人置于潜在危险环境中，风险较高。ZeST提出了一种无需暴露机器人于危险中的方法，利用LLMs实现安全、快速的实时导航。

Result: 在室内和室外非结构化环境中的实验表明，ZeST在安全性上优于其他先进方法，且能稳定到达目标点。

Insight: LLMs的视觉推理能力可以高效解决机器人导航中的可通行性问题，为未来自主导航系统的发展提供了新思路。

Abstract: The advancement of robotics and autonomous navigation systems hinges on the ability to accurately predict terrain traversability. Traditional methods for generating datasets to train these prediction models often involve putting robots into potentially hazardous environments, posing risks to equipment and safety. To solve this problem, we present ZeST, a novel approach leveraging visual reasoning capabilities of Large Language Models (LLMs) to create a traversability map in real-time without exposing robots to danger. Our approach not only performs zero-shot traversability and mitigates the risks associated with real-world data collection but also accelerates the development of advanced navigation systems, offering a cost-effective and scalable solution. To support our findings, we present navigation results, in both controlled indoor and unstructured outdoor environments. As shown in the experiments, our method provides safer navigation when compared to other state-of-the-art methods, constantly reaching the final goal.

[99] MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation cs.RO | cs.CVPDF

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu

TL;DR: MemoryVLA提出了一种结合感知认知记忆的视觉-语言-动作框架，解决机器人操作中长期依赖任务的问题，性能显著优于现有方法。

Details

Motivation: 现有VLA模型忽视了时间上下文，无法处理长期依赖任务，而人类通过工作记忆和海马系统实现了高效的短期和长期记忆，这启发了MemoryVLA的设计。

Result: 在仿真和真实任务中表现优异，如Bridge任务提升14.6%，真实世界任务成功率达84.0%，长期依赖任务提升26%。

Insight: 模拟人类记忆机制可以有效提升机器人对复杂任务的处理能力，尤其是在长期依赖场景中。

Abstract: Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it and struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived representations for immediate control, while the hippocampal system preserves verbatim episodic details and semantic gist of past experience for long-term memory. Inspired by these mechanisms, we propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation. A pretrained VLM encodes the observation into perceptual and cognitive tokens that form working memory, while a Perceptual-Cognitive Memory Bank stores low-level details and high-level semantics consolidated from it. Working memory retrieves decision-relevant entries from the bank, adaptively fuses them with current tokens, and updates the bank by merging redundancies. Using these tokens, a memory-conditioned diffusion action expert yields temporally aware action sequences. We evaluate MemoryVLA on 150+ simulation and real-world tasks across three robots. On SimplerEnv-Bridge, Fractal, and LIBERO-5 suites, it achieves 71.9%, 72.7%, and 96.5% success rates, respectively, all outperforming state-of-the-art baselines CogACT and pi-0, with a notable +14.6 gain on Bridge. On 12 real-world tasks spanning general skills and long-horizon temporal dependencies, MemoryVLA achieves 84.0% success rate, with long-horizon tasks showing a +26 improvement over state-of-the-art baseline. Project Page: https://shihao1895.github.io/MemoryVLA

physics.optics [Back]

[100] Designing across domains with declarative thinking: Insights from the 96-Eyes ptychographic imager project physics.optics | cs.CLPDF

Antony C Chan

TL;DR: 本文通过96-Eyes项目（一个用于高通量药物发现的96相机并行多模态成像系统）的案例，探讨了声明式问题表述语言（5GL）在跨领域成像系统设计中的应用及其优势。

Details

Motivation: 在跨学科和跨功能团队的合作中，传统的命令式编程语言（3GL）可能导致设计不一致和沟通不畅，而5GL可以通过机器可读的问题表述提升透明度和可追溯性。

Result: 5GL能够增强设计透明度、确保可追溯性，并减少跨团队间的高成本错位。

Insight: 声明式问题表述可以促进创新，尤其是在并发研发流程中，而传统的命令式语言则更适合顺序驱动的研发环境。编程范式隐式影响了研究流程和领域层级结构。

Abstract: This article presents a practitioner’s reflection on applying declarative, 5th generation, problem formulation language (5GL) to de novo imaging system design, informed by experiences across the interdisciplinary research in academia and cross-functional product development within the private sector. Using the 96-Eyes project: 96-camera parallel multi-modal imager for high-throughput drug discovery as a representative case, I illustrate how project requirements, ranging from hardware constraints to life sciences needs, can be formalized into machine-readable problem statements to preserve mission-critical input from diverse domain stakeholders. This declarative approach enhances transparency, ensures design traceability, and minimizes costly misalignment across optical, algorithmic, hardware-accelerated compute, and life sciences teams. Alongside the technical discussion of 5GL with real-world code examples, I reflect on the practical barriers to adopting 5GL in environments where imperative, 3rd-generation languages (3GL) remain the default medium for inter-team collaboration. Rather than offering an one-size-fits-all solution, these learned lessons highlight how programming paradigms implicitly shapes research workflows through existing domain hierarchies. The discussion aims to invite further explorations into how declarative problem formulations can facilitate innovation in settings where concurrent R&{}D workflows are gaining traction, as opposed to environments where sequential, phase-driven workflows remain the norm.

q-bio.NC [Back]

[101] Time Series Analysis of Spiking Neural Systems via Transfer Entropy and Directed Persistent Homology q-bio.NC | cs.CVPDF

Dylan Peek, Siddharth Pritam, Matthew P. Skerritt, Stephan Chalup

TL;DR: 该论文提出了一种结合传递熵（TE）和有向持续同调（PH）的拓扑框架，用于分析神经时间序列，以表征脉冲神经系统的信息流动。

Details

Motivation: 研究目的是开发一种能够捕捉神经系统中定向信息流动并映射到全局组织模式的通用方法，适用于人工和生物神经网络。

Result: 在合成脉冲网络、图像分类网络和小鼠皮层记录等多种场景中，该方法成功区分了任务复杂度、刺激结构和行为状态，并显示出高维特征在复杂或噪声条件下的重要性。

Insight: 研究结果表明，高维拓扑特征能够反映超出成对连接的交互模式，为理解神经系统的全局组织提供了新视角。

Abstract: We present a topological framework for analysing neural time series that integrates Transfer Entropy (TE) with directed Persistent Homology (PH) to characterize information flow in spiking neural systems. TE quantifies directional influence between neurons, producing weighted, directed graphs that reflect dynamic interactions. These graphs are then analyzed using PH, enabling assessment of topological complexity across multiple structural scales and dimensions. We apply this TE+PH pipeline to synthetic spiking networks trained on logic gate tasks, image-classification networks exposed to structured and perturbed inputs, and mouse cortical recordings annotated with behavioral events. Across all settings, the resulting topological signatures reveal distinctions in task complexity, stimulus structure, and behavioral regime. Higher-dimensional features become more prominent in complex or noisy conditions, reflecting interaction patterns that extend beyond pairwise connectivity. Our findings offer a principled approach to mapping directed information flow onto global organizational patterns in both artificial and biological neural systems. The framework is generalizable and interpretable, making it well suited for neural systems with time-resolved and binary spiking data.

cs.GR [Back]

[102] SemLayoutDiff: Semantic Layout Generation with Diffusion Model for Indoor Scene Synthesis cs.GR | cs.CVPDF

Xiaohao Sun, Divyam Goel, Angle X. Chang

TL;DR: SemLayoutDiff是一个基于扩散模型的语义布局生成方法，用于多样化的3D室内场景合成，结合了语义地图和物体属性，并通过显式条件建模实现了与建筑约束的兼容。

Details

Motivation: 现有的室内场景生成方法难以显式建模建筑约束（如门、窗等），导致生成的场景可能不实用或不连贯。SemLayoutDiff旨在通过扩散模型和语义布局的显式条件建模解决这一问题。

Result: 在3D-FRONT数据集上的实验表明，SemLayoutDiff生成的场景在空间一致性、真实性和多样性上优于现有方法。

Insight: 结合扩散模型和显式条件建模可以更好地处理建筑约束，从而生成更实用的室内场景。

Abstract: We present SemLayoutDiff, a unified model for synthesizing diverse 3D indoor scenes across multiple room types. The model introduces a scene layout representation combining a top-down semantic map and attributes for each object. Unlike prior approaches, which cannot condition on architectural constraints, SemLayoutDiff employs a categorical diffusion model capable of conditioning scene synthesis explicitly on room masks. It first generates a coherent semantic map, followed by a cross-attention-based network to predict furniture placements that respect the synthesized layout. Our method also accounts for architectural elements such as doors and windows, ensuring that generated furniture arrangements remain practical and unobstructed. Experiments on the 3D-FRONT dataset show that SemLayoutDiff produces spatially coherent, realistic, and varied scenes, outperforming previous methods.

[103] PanoHair: Detailed Hair Strand Synthesis on Volumetric Heads cs.GR | cs.CVPDF

Shashikant Verma, Shanmuganathan Raman

TL;DR: PanoHair提出了一种新颖的方法，通过知识蒸馏从预训练的生成模型合成高保真头发丝几何，显著提升了生成速度与多样性。

Details

Motivation: 现有方法需要复杂的多视图数据采集和较长的处理时间，限制了效率和应用范围。PanoHair旨在简化这一流程，快速生成高保真头发丝。

Result: 实验表明，PanoHair在5秒内可生成干净流形网格，优于现有方法，且在视觉质量和效率上均有显著提升。

Insight: 知识蒸馏和生成模型的结合为头发合成提供了高效且灵活的解决方案，潜在空间操作为多样化生成提供了可能。

Abstract: Achieving realistic hair strand synthesis is essential for creating lifelike digital humans, but producing high-fidelity hair strand geometry remains a significant challenge. Existing methods require a complex setup for data acquisition, involving multi-view images captured in constrained studio environments. Additionally, these methods have longer hair volume estimation and strand synthesis times, which hinder efficiency. We introduce PanoHair, a model that estimates head geometry as signed distance fields using knowledge distillation from a pre-trained generative teacher model for head synthesis. Our approach enables the prediction of semantic segmentation masks and 3D orientations specifically for the hair region of the estimated geometry. Our method is generative and can generate diverse hairstyles with latent space manipulations. For real images, our approach involves an inversion process to infer latent codes and produces visually appealing hair strands, offering a streamlined alternative to complex multi-view data acquisition setups. Given the latent code, PanoHair generates a clean manifold mesh for the hair region in under 5 seconds, along with semantic and orientation maps, marking a significant improvement over existing methods, as demonstrated in our experiments.

cs.LG [Back]

[104] Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks cs.LG | cs.AI | cs.CLPDF

Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara

TL;DR: 研究了MoE语言模型中稀疏性对记忆任务和推理任务的影响，发现推理性能在稀疏性增加时会饱和甚至下降，而记忆任务则随参数增加而持续提升。

Details

Motivation: 现有的大语言模型（LLM）研究主要关注密集模型，而MoE模型的稀疏性维度并未被充分探索，尤其是在不同能力（记忆vs推理）上的表现差异。

Result: 记忆任务性能与参数总量正相关，而推理任务性能在稀疏性增加时会饱和甚至下降；top-$k路由的单独调整影响较小。

Insight: 推理任务的性能提升可能受到模型稀疏性的限制，传统超参数（如学习率）与稀疏性对泛化能力的影响方向一致。

Abstract: Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization and reasoning. We train families of MoE Transformers that systematically vary total parameters, active parameters, and top-$k$ routing while holding the compute budget fixed. For every model we record pre-training loss, downstream task loss, and task accuracy, allowing us to separate the train-test generalization gap from the loss-accuracy gap. Memorization benchmarks improve monotonically with total parameters, mirroring training loss. By contrast, reasoning performance saturates and can even regress despite continued gains in both total parameters and training loss. Altering top-$k$ alone has little effect when active parameters are constant, and classic hyperparameters such as learning rate and initialization modulate the generalization gap in the same direction as sparsity. Neither post-training reinforcement learning (GRPO) nor extra test-time compute rescues the reasoning deficit of overly sparse models. Our model checkpoints, code and logs are open-source at https://github.com/rioyokotalab/optimal-sparsity.

eess.SP [Back]

[105] EMind: A Foundation Model for Multi-task Electromagnetic Signals Understanding eess.SP | cs.AI | cs.CVPDF

Luqing Luo, Wenjin Gui, Yunfei Liu, Ziyue Zhang, Yunxi Zhang

TL;DR: EMind是首个针对电磁信号的多任务基础模型，解决了电磁信号的高异质性、强背景噪声和复杂时频结构等问题，通过大规模预训练和统一数据集实现了跨任务的泛化和高效迁移。

Details

Motivation: 电磁信号与文本和图像差异巨大，现有通用模型难以直接应用，且任务多样性导致跨任务泛化能力不足，缺乏高质量大规模数据集阻碍了多任务学习框架的发展。

Result: 实验表明，EMind在多个下游任务中表现出色，实现了从任务专用模型到统一框架的跨越。

Insight: 通过物理特性驱动的数据预处理和训练策略优化，可以显著提升电磁信号模型的泛化能力和效率。

Abstract: Deep understanding of electromagnetic signals is fundamental to dynamic spectrum management, intelligent transportation, autonomous driving and unmanned vehicle perception. The field faces challenges because electromagnetic signals differ greatly from text and images, showing high heterogeneity, strong background noise and complex joint time frequency structure, which prevents existing general models from direct use. Electromagnetic communication and sensing tasks are diverse, current methods lack cross task generalization and transfer efficiency, and the scarcity of large high quality datasets blocks the creation of a truly general multitask learning framework. To overcome these issue, we introduce EMind, an electromagnetic signals foundation model that bridges large scale pretraining and the unique nature of this modality. We build the first unified and largest standardized electromagnetic signal dataset covering multiple signal types and tasks. By exploiting the physical properties of electromagnetic signals, we devise a length adaptive multi-signal packing method and a hardware-aware training strategy that enable efficient use and representation learning from heterogeneous multi-source signals. Experiments show that EMind achieves strong performance and broad generalization across many downstream tasks, moving decisively from task specific models to a unified framework for electromagnetic intelligence. The code is available at: https://github.com/GabrielleTse/EMind.

quant-ph [Back]

[106] Quantum-Circuit-Based Visual Fractal Image Generation in Qiskit and Analytics quant-ph | cs.CVPDF

Hillol Biswas

TL;DR: 该论文探讨了使用量子电路生成Julia集分形图像的方法，结合量子叠加、随机性和纠缠等原理，为量子生成艺术提供了新的研究方向。

Details

Motivation: 自然界中的分形现象与量子系统的干涉模式具有相似性，论文试图通过量子计算探索分形图像生成的可能性，为量子生成艺术开辟新方向。

Result: 验证了量子电路可以生成复杂的Julia集分形图像，为量子生成艺术提供了新的技术途径。

Insight: 量子计算的特性（如叠加和纠缠）可以为分形图像生成引入更高的复杂性和随机性，为艺术与科学的交叉领域提供创新思路。

Abstract: As nature is ascribed as quantum, the fractals also pose some intriguing appearance which is found in many micro and macro observable entities or phenomena. Fractals show self-similarity across sizes; structures that resemble the entire are revealed when zoomed in. In Quantum systems, the probability density or wavefunction may exhibit recurring interference patterns at various energy or length scales. Fractals are produced by basic iterative rules (such as Mandelbrot or Julia sets), and they provide limitless complexity. Despite its simplicity, the Schr"odinger equation in quantum mechanics produces incredibly intricate patterns of interference and entanglement, particularly in chaotic quantum systems. Quantum computing, the root where lies to the using the principles of quantum-mechanical phenomenon, when applied in fractal image generation, what outcomes are expected? The paper outlines the generation of a Julia set dataset using an approach coupled with building quantum circuit, highlighting the concepts of superposition, randomness, and entanglement as foundational elements to manipulate the generated dataset patterns. As Quantum computing is finding many application areas, the possibility of using quantum circuits for fractal Julia image generation posits a unique direction of future research where it can be applied to quantum generative arts across various ecosystems with a customised approach, such as producing an exciting landscape based on a quantum art theme.

cs.HC [Back]

[107] Impact of Target and Tool Visualization on Depth Perception and Usability in Optical See-Through AR cs.HC | cs.CV | cs.GRPDF

Yue Yang, Xue Xie, Xinkai Wang, Hui Zhang, Chiming Yu

TL;DR: 论文研究了光学透视增强现实（OST-AR）中目标和工具可视化对深度感知和系统可用性的影响，发现不透明的目标渲染和实时工具遮挡能显著提高精度和用户体验。

Details

Motivation: 光学透视增强现实（如HoloLens 2）在近距离任务（如手术）中有潜力，但深度感知和工具的遮挡问题仍需解决。

Result: 不透明的目标渲染和实时工具遮挡显著提高深度感知和任务精度，而高透明目标会损害效果。

Insight: 正确的遮挡线索和目标不透明度对OST-AR的深度感知至关重要，设计时应优先考虑工具跟踪和遮挡处理。

Abstract: Optical see-through augmented reality (OST-AR) systems like Microsoft HoloLens 2 hold promise for arm’s distance guidance (e.g., surgery), but depth perception of the hologram and occlusion of real instruments remain challenging. We present an evaluation of how visualizing the target object with different transparencies and visualizing a tracked tool (virtual proxy vs. real tool vs. no tool tracking) affects depth perception and system usability. Ten participants performed two experiments on HoloLens 2. In Experiment 1, we compared high-transparency vs. low-transparency target rendering in a depth matching task at arm’s length. In Experiment 2, participants performed a simulated surgical pinpoint task on a frontal bone target under six visualization conditions ($2 \times 3$: two target transparencies and three tool visualization modes: virtual tool hologram, real tool, or no tool tracking). We collected data on depth matching error, target localization error, system usability, task workload, and qualitative feedback. Results show that a more opaque target yields significantly lower depth estimation error than a highly transparent target at arm’s distance. Moreover, showing the real tool (occluding the virtual target) led to the highest accuracy and usability with the lowest workload, while not tracking the tool yielded the worst performance and user ratings. However, making the target highly transparent, while allowing the real tool to remain visible, slightly impaired depth cues and did not improve usability. Our findings underscore that correct occlusion cues, rendering virtual content opaque and occluding it with real tools in real time, are critical for depth perception and precision in OST-AR. Designers of arm-distance AR systems should prioritize robust tool tracking and occlusion handling; if unavailable, cautiously use transparency to balance depth perception and tool visibility.

Table of Contents

cs.CV [Back]

[1] Towards Training-Free Underwater 3D Object Detection from Sonar Point Clouds: A Comparison of Traditional and Deep Learning Approaches cs.CV | cs.AI | cs.LG | cs.ROPDF

[2] MobileDenseAttn:A Dual-Stream Architecture for Accurate and Interpretable Brain Tumor Detection cs.CV | cs.AIPDF

[3] Can VLMs Recall Factual Associations From Visual References? cs.CV | cs.AI | cs.CLPDF

[4] SERES: Semantic-aware neural reconstruction from sparse views cs.CVPDF

[5] Structures Meet Semantics: Multimodal Fusion via Graph Contrastive Learning cs.CV | cs.AI | 68T10 | I.2.4PDF

[6] FastAvatar: Instant 3D Gaussian Splatting for Faces from Single Unconstrained Poses cs.CVPDF

[7] Securing Face and Fingerprint Templates in Humanitarian Biometric Systems cs.CV | cs.CRPDF

[8] Why Relational Graphs Will Save the Next Generation of Vision Foundation Models? cs.CVPDF

[9] LPLC: A Dataset for License Plate Legibility Classification cs.CVPDF

[10] CLARIFY: A Specialist-Generalist Framework for Accurate and Lightweight Dermatological Visual Question Answering cs.CV | cs.AIPDF

[11] VQualA 2025 Challenge on Face Image Quality Assessment: Methods and Results cs.CVPDF

[12] Context-Aware Zero-Shot Anomaly Detection in Surveillance Using Contrastive and Predictive Spatiotemporal Modeling cs.CV | cs.LGPDF

[13] DoGFlow: Self-Supervised LiDAR Scene Flow via Cross-Modal Doppler Guidance cs.CVPDF

[14] Wan-S2V: Audio-Driven Cinematic Video Generation cs.CVPDF

[15] Decouple, Reorganize, and Fuse: A Multimodal Framework for Cancer Survival Prediction cs.CVPDF

[16] ROSE: Remove Objects with Side Effects in Videos cs.CV | cs.AI | cs.LGPDF

[17] OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward cs.CVPDF

[18] Hierarchical Spatio-temporal Segmentation Network for Ejection Fraction Estimation in Echocardiography Videos cs.CVPDF

[19] Feature-Space Planes Searcher: A Universal Domain Adaptation Framework for Interpretability and Computational Efficiency cs.CVPDF

[20] A Novel Deep Hybrid Framework with Ensemble-Based Feature Optimization for Robust Real-Time Human Activity Recognition cs.CVPDF

[21] ColorGS: High-fidelity Surgical Scene Reconstruction with Colored Gaussian Splatting cs.CVPDF

[22] Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion cs.CV | cs.AI | cs.MM | eess.AS | eess.SPPDF

[23] Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods cs.CVPDF

[24] Beyond the Textual: Generating Coherent Visual Options for MCQs cs.CV | cs.CLPDF

[25] Design, Implementation and Evaluation of a Real-Time Remote Photoplethysmography (rPPG) Acquisition System for Non-Invasive Vital Sign Monitoring cs.CVPDF

[26] PseudoMapTrainer: Learning Online Mapping without HD Maps cs.CV | cs.LG | cs.ROPDF

[27] Robust and Label-Efficient Deep Waste Detection cs.CVPDF

[28] Embedding Font Impression Word Tags Based on Co-occurrence cs.CVPDF

[29] Deep Pre-trained Time Series Features for Tree Species Classification in the Dutch Forest Inventory cs.CVPDF

[30] Boosting Micro-Expression Analysis via Prior-Guided Video-Level Regression cs.CVPDF

[31] Quantitative Outcome-Oriented Assessment of Microsurgical Anastomosis cs.CVPDF

[32] Harnessing Meta-Learning for Controllable Full-Frame Video Stabilization cs.CVPDF

[33] Toward Robust Medical Fairness: Debiased Dual-Modal Alignment via Text-Guided Attribute-Disentangled Prompt Learning for Vision-Language Models cs.CVPDF

[34] DQEN: Dual Query Enhancement Network for DETR-based HOI Detection cs.CVPDF

[35] Interpretable Decision-Making for End-to-End Autonomous Driving cs.CV | cs.AI | cs.LG | cs.ROPDF

[36] Event-Enriched Image Analysis Grand Challenge at ACM Multimedia 2025 cs.CVPDF

[37] Preliminary Study on Space Utilization and Emergent Behaviors of Group vs. Single Pedestrians in Real-World Trajectories cs.CV | stat.APPDF

[38] The point is the mask: scaling coral reef segmentation with weak supervision cs.CV | cs.AIPDF

[39] Enhancing compact convolutional transformers with super attention cs.CV | cs.LGPDF

[40] Can we make NeRF-based visual localization privacy-preserving? cs.CVPDF

[41] Enhancing Document VQA Models via Retrieval-Augmented Generation cs.CVPDF

[42] Ask Me Again Differently: GRAS for Measuring Bias in Vision Language Models on Gender, Race, Age, and Skin Tone cs.CVPDF

[43] RoofSeg: An edge-aware transformer-based network for end-to-end roof plane segmentation cs.CV | cs.AIPDF

[44] MicroDetect-Net (MDN): Leveraging Deep Learning to Detect Microplastics in Clam Blood, a Step Towards Human Blood Analysis cs.CVPDF

[45] ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval cs.CVPDF

[46] GReAT: leveraging geometric artery data to improve wall shear stress assessment cs.CV | cs.LGPDF

[47] No Label Left Behind: A Unified Surface Defect Detection Model for all Supervision Regimes cs.CV | cs.AIPDF

[48] VibES: Induced Vibration for Persistent Event-Based Sensing cs.CV | cs.ROPDF

[49] Few-Shot Connectivity-Aware Text Line Segmentation in Historical Documents cs.CV | cs.AI | cs.LGPDF

[50] Dual Enhancement on 3D Vision-Language Perception for Monocular 3D Visual Grounding cs.CV | Information systems~Multimedia and multimodal retrievalPDF

[51] Beyond flattening: a geometrically principled positional encoding for vision transformers with Weierstrass elliptic functions cs.CVPDF

[52] SoccerNet 2025 Challenges Results cs.CVPDF

[53] All-in-One Slider for Attribute Manipulation in Diffusion Models cs.CVPDF

[54] LSD-3D: Large-Scale 3D Driving Scene Generation with Geometry Grounding cs.CV | cs.AI | cs.GRPDF

[55] OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation cs.CVPDF

[56] Autoregressive Universal Video Segmentation Model cs.CVPDF

[57] Articulate3D: Zero-Shot Text-Driven 3D Object Posing cs.CVPDF

cs.CL [Back]

[58] LLMs Can’t Handle Peer Pressure: Crumbling under Multi-Agent Social Interactions cs.CL | cs.AIPDF

[59] Language-Specific Layer Matters: Efficient Multilingual Enhancement for Large Vision-Language Models cs.CLPDF

[60] Integral Transformer: Denoising Attention, Not Too Much Not Too Little cs.CLPDF

[61] Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning cs.CL | cs.AIPDF

[62] How Reliable are LLMs for Reasoning on the Re-ranking task? cs.CL | cs.AIPDF

[63] Integrating gender inclusivity into large language models via instruction tuning cs.CLPDF

[64] Thinking Before You Speak: A Proactive Test-time Scaling Approach cs.CLPDF

[65] Tailored Teaching with Balanced Difficulty: Elevating Reasoning in Multimodal Chain-of-Thought via Prompt Curriculum cs.CL | cs.AI | cs.MMPDF

[66] Knowing or Guessing? Robust Medical Visual Question Answering via Joint Consistency and Contrastive Learning cs.CLPDF

[67] Attention2Probability: Attention-Driven Terminology Probability Estimation for Robust Speech-to-Text System cs.CLPDF

[68] Filtering for Creativity: Adaptive Prompting for Multilingual Riddle Generation in LLMs cs.CLPDF

[69] Beyond Quality: Unlocking Diversity in Ad Headline Generation with Large Language Models cs.CL | cs.LGPDF

[70] M3HG: Multimodal, Multi-scale, and Multi-type Node Heterogeneous Graph for Emotion Cause Triplet Extraction in Conversations cs.CL | cs.AIPDF

[71] Chronological Passage Assembling in RAG framework for Temporal Question Answering cs.CLPDF

[72] ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models cs.CLPDF

[73] Harnessing Rule-Based Reinforcement Learning for Enhanced Grammatical Error Correction cs.CL | cs.AIPDF

[74] Arrows of Math Reasoning Data Synthesis for Large Language Models: Diversity, Complexity and Correctness cs.CLPDF

[75] ConfTuner: Training Large Language Models to Express Their Confidence Verbally cs.CL | cs.AIPDF

[76] ReflectivePrompt: Reflective evolution in autoprompting algorithms cs.CL | cs.AI | cs.LGPDF

[77] Empowering Computing Education Researchers Through LLM-Assisted Content Analysis cs.CLPDF