cs.CV [Total: 214]
cs.CL [Total: 50]
cs.SE [Total: 1]
cs.LG [Total: 23]
q-bio.NC [Total: 1]
cs.HC [Total: 3]
eess.IV [Total: 5]
cs.IR [Total: 1]
cs.AR [Total: 3]
cs.CY [Total: 1]
astro-ph.IM [Total: 1]
cs.RO [Total: 3]
stat.AP [Total: 1]
stat.CO [Total: 1]
cs.CR [Total: 3]
cs.AI [Total: 14]
cs.DC [Total: 1]
eess.AS [Total: 2]

cs.CV [Back]

[1] Psychological stress during Examination and its estimation by handwriting in answer script cs.CVPDF

Abhijeet Kumar, Chetan Agarwal, Pronoy B. Neogi, Mayank Goswami

TL;DR: 该研究通过结合图形学和人工智能技术，分析学生手写考试答卷的笔迹，量化其心理压力水平。利用OCR和基于Transformer的情感分析模型，提出了一种超越传统评分系统的数据驱动方法，揭示了考试期间的认知和情感状态。

Details

Motivation: 传统的考试评分系统仅关注答案的正确性，而忽略了学生的心理状态。研究旨在通过笔迹分析量化考试期间的心理压力，为教育评估提供更全面的视角。

Result: 提出的方法能够有效量化考试期间的心理压力水平，并为学术界提供了一种新的评估工具。

Insight: 笔迹分析结合AI技术可以为教育心理学和学术评估开辟新的研究方向，同时强调了情感状态对学生表现的影响。

Abstract: This research explores the fusion of graphology and artificial intelligence to quantify psychological stress levels in students by analyzing their handwritten examination scripts. By leveraging Optical Character Recognition and transformer based sentiment analysis models, we present a data driven approach that transcends traditional grading systems, offering deeper insights into cognitive and emotional states during examinations. The system integrates high resolution image processing, TrOCR, and sentiment entropy fusion using RoBERTa based models to generate a numerical Stress Index. Our method achieves robustness through a five model voting mechanism and unsupervised anomaly detection, making it an innovative framework in academic forensics.

[2] Real-time pothole detection with onboard sensors and camera on vehicles cs.CVPDF

Aswath Muthuselvam, Jeevak Raj S, Mohanaprasad K

TL;DR: 论文提出了一种利用车载传感器和摄像头实时检测道路坑洞的方法，通过SVM分类器实现了98.1%的准确率。

Details

Motivation: 随着车辆数量的增加，频繁监测道路状况对交通流畅至关重要。道路上的小裂缝可能因温度和车辆压力发展为坑洞，因此需要实时检测。

Result: 在包含26个坑洞的数据集上实现了98.1%的检测准确率。

Insight: 车载传感器结合SVM分类器可以有效实现高精度的实时坑洞检测，有助于大规模道路维护。

Abstract: Road conditions play an important role in our everyday commute. With the proliferating number of vehicles on the road each year, it has become necessary to access the road conditions very frequently, this would ensure that the traffic also flows smoothly. Even the smallest crack in the road could be easily be chipped into a large pothole due to changing surface temperatures of the road and from the force of vehicles riding over it. In this paper, we have addressed how we could better identify these potholes in realtime with the help of onboard sensors in vehicles so that the data could be useful for analysis and better management of potholes on a large scale. For the implementation, we used an SVM classifier to detect potholes, we achieved 98.1% accuracy based on data collected from a local road for about 2 km which had 26 potholes distributed along the road. Code is available at: https://github.com/aswathselvam/Potholes

[3] A Method for Identifying Farmland System Habitat Types Based on the Dynamic-Weighted Feature Fusion Network Model cs.CVPDF

Kesong Zheng, Zhi Song, Peizhou Li, Shuyi Yao, Zhenxing Bian

TL;DR: 该论文提出了一种基于动态加权特征融合网络（DWFF-Net）的方法，用于识别农田系统栖息地类型，解决了现有模型在语义和纹理特征融合上的不足，提升了多尺度栖息地分割的精度。

Details

Motivation: 现有的农田生态系统栖息地分类缺乏标准化系统，覆盖类型不完整，且现有模型无法有效融合语义和纹理特征，导致分割精度不足和多尺度栖息地边界模糊。

Result: 在构建的数据集上，DWFF-Net的mIoU为0.6979，F1分数为0.8049，分别比基线网络提升了0.021和0.0161。消融实验验证了多层特征融合的互补性。

Insight: 动态加权特征融合能有效提升微栖息地类别（如田埂）的分割精度。该方法为农田景观的细粒度监测提供了低成本技术方案。

Abstract: Addressing the current lack of a standardized habitat classification system for cultivated land ecosystems, incomplete coverage of habitat types, and the inability of existing models to effectively integrate semantic and texture features-resulting in insufficient segmentation accuracy and blurred boundaries for multi-scale habitats (e.g., large-scale field plots and micro-habitats)-this study developed a comprehensively annotated ultra-high-resolution remote sensing image dataset encompassing 15 categories of cultivated land system habitats. Furthermore, we propose a Dynamic-Weighted Feature Fusion Network (DWFF-Net). The encoder of this model utilizes a frozen-parameter DINOv3 to extract foundational features. By analyzing the relationships between different category images and feature maps, we introduce a data-level adaptive dynamic weighting strategy for feature fusion. The decoder incorporates a dynamic weight computation network to achieve thorough integration of multi-layer features, and a hybrid loss function is adopted to optimize model training. Experimental results on the constructed dataset demonstrate that the proposed model achieves a mean Intersection over Union (mIoU) of 0.6979 and an F1-score of 0.8049, outperforming the baseline network by 0.021 and 0.0161, respectively. Ablation studies further confirm the complementary nature of multi-layer feature fusion, which effectively improves the IoU for micro-habitat categories such as field ridges. This study establishes a habitat identification framework for cultivated land systems based on adaptive multi-layer feature fusion, enabling sub-meter precision habitat mapping at a low cost and providing robust technical support for fine-grained habitat monitoring in cultivated landscapes.

Lian He, Meng Liu, Qilang Ye, Yu Zhou, Xiang Deng

TL;DR: 本文提出了一种名为TASA的任务感知3D场景级功能分割框架，通过结合2D语义线索和3D几何推理，实现了高效且精确的场景功能分割。

Details

Motivation: 现有的方法主要关注对象级功能或将2D预测直接提升到3D，忽略了点云中的丰富几何结构信息，且计算成本高。因此，需要一种既能利用语义推理又能结合空间信息的方法。

Result: 在SceneFun3D数据集上的实验表明，TASA在场景级功能分割任务中显著优于基线方法，兼具高精度和高效性。

Insight: 通过结合2D语义线索和3D几何信息，可以显著提升功能分割的性能，同时降低计算成本。

Abstract: Understanding 3D scene-level affordances from natural language instructions is essential for enabling embodied agents to interact meaningfully in complex environments. However, this task remains challenging due to the need for semantic reasoning and spatial grounding. Existing methods mainly focus on object-level affordances or merely lift 2D predictions to 3D, neglecting rich geometric structure information in point clouds and incurring high computational costs. To address these limitations, we introduce Task-Aware 3D Scene-level Affordance segmentation (TASA), a novel geometry-optimized framework that jointly leverages 2D semantic cues and 3D geometric reasoning in a coarse-to-fine manner. To improve the affordance detection efficiency, TASA features a task-aware 2D affordance detection module to identify manipulable points from language and visual inputs, guiding the selection of task-relevant views. To fully exploit 3D geometric information, a 3D affordance refinement module is proposed to integrate 2D semantic priors with local 3D geometry, resulting in accurate and spatially coherent 3D affordance masks. Experiments on SceneFun3D demonstrate that TASA significantly outperforms the baselines in both accuracy and efficiency in scene-level affordance segmentation.

Zekai Shi, Zhixi Cai, Kalin Stefanov

TL;DR: 这篇论文探讨了盲点对人类婴儿学习词汇指称映射的影响，并提出了一种基于自监督学习的生物合理性视觉表征学习方法。通过掩码自编码器结合人类盲点的知识，该方法在词汇映射任务中表现与随机掩码方法相当。

Details

Motivation: 研究动机在于理解婴儿如何在没有先验知识的情况下学习词汇指称映射，并提出一种更符合人类视觉系统的学习方法。

Result: 实验表明，这种生物合理性掩码策略在词汇指称映射任务上的效果与随机掩码相当。

Insight: 研究表明，人类视觉系统的盲点特性可以被有效整合到学习模型中，为未来的生物启发式学习方法提供了新的思路。

Abstract: Typically, children start to learn their first words between 6 and 9 months, linking spoken utterances to their visual referents. Without prior knowledge, a word encountered for the first time can be interpreted in countless ways; it might refer to any of the objects in the environment, their components, or attributes. Using longitudinal, egocentric, and ecologically valid data from the experience of one child, in this work, we propose a self-supervised and biologically plausible strategy to learn strong visual representations. Our masked autoencoder-based visual backbone incorporates knowledge about the blind spot in human eyes to define a novel masking strategy. This mask and reconstruct approach attempts to mimic the way the human brain fills the gaps in the eyes’ field of view. This represents a significant shift from standard random masking strategies, which are difficult to justify from a biological perspective. The pretrained encoder is utilized in a contrastive learning-based video-text model capable of acquiring word-referent mappings. Extensive evaluation suggests that the proposed biologically plausible masking strategy is at least as effective as random masking for learning word-referent mappings from cross-situational and temporally extended episodes.

[6] GROVER: Graph-guided Representation of Omics and Vision with Expert Regulation for Adaptive Spatial Multi-omics Fusion cs.CV | cs.AIPDF

Yongjun Xiao, Dian Meng, Xinlei Huang, Yanran Liu, Shiwei Ruan

TL;DR: GROVER是一个新颖的框架，用于自适应整合空间多组学数据，通过图卷积网络和动态专家路由机制解决了多模态数据异质性和分辨率不匹配的问题。

Details

Motivation: 空间多组学数据（如转录组、蛋白质组和表观组）与病理形态学图像的整合对全面理解疾病组织至关重要，但由于数据异质性和分辨率不匹配等问题，传统方法难以有效融合。

Result: 在真实世界的空间多组学数据集上，GROVER优于现有基线方法，提供了鲁棒且可靠的多模态整合解决方案。

Insight: 通过结合图神经网络和动态专家路由机制，GROVER不仅能处理多模态异质性，还能自适应地抑制噪声或低质量输入，为空间多组学数据分析提供了新思路。

Abstract: Effectively modeling multimodal spatial omics data is critical for understanding tissue complexity and underlying biological mechanisms. While spatial transcriptomics, proteomics, and epigenomics capture molecular features, they lack pathological morphological context. Integrating these omics with histopathological images is therefore essential for comprehensive disease tissue analysis. However, substantial heterogeneity across omics, imaging, and spatial modalities poses significant challenges. Naive fusion of semantically distinct sources often leads to ambiguous representations. Additionally, the resolution mismatch between high-resolution histology images and lower-resolution sequencing spots complicates spatial alignment. Biological perturbations during sample preparation further distort modality-specific signals, hindering accurate integration. To address these challenges, we propose Graph-guided Representation of Omics and Vision with Expert Regulation for Adaptive Spatial Multi-omics Fusion (GROVER), a novel framework for adaptive integration of spatial multi-omics data. GROVER leverages a Graph Convolutional Network encoder based on Kolmogorov-Arnold Networks to capture the nonlinear dependencies between each modality and its associated spatial structure, thereby producing expressive, modality-specific embeddings. To align these representations, we introduce a spot-feature-pair contrastive learning strategy that explicitly optimizes the correspondence across modalities at each spot. Furthermore, we design a dynamic expert routing mechanism that adaptively selects informative modalities for each spot while suppressing noisy or low-quality inputs. Experiments on real-world spatial omics datasets demonstrate that GROVER outperforms state-of-the-art baselines, providing a robust and reliable solution for multimodal integration.

[7] Concept-RuleNet: Grounded Multi-Agent Neurosymbolic Reasoning in Vision Language Models cs.CV | cs.AI | cs.MAPDF

Sanchit Sinha, Guangzhi Xiong, Zhenghao He, Aidong Zhang

TL;DR: Concept-RuleNet 是一个多智能体系统，结合了视觉概念生成和符号推理，提升了视觉语言模型的可解释性和性能。

Details

Motivation: 现代视觉语言模型（VLMs）虽然在预测准确性上表现出色，但缺乏决策透明性，且容易在非分布数据中出现幻觉问题。Neurosymbolic 框架试图解决这一问题，但现有方法的符号提取仅依赖于任务标签，缺乏对视觉数据的充分理解。因此，本文提出了一种新的方法，旨在实现视觉数据的充分理解和符号推理的透明性。

Result: 在五个基准测试（包括两个医学成像任务和三个自然图像数据集）中，Concept-RuleNet 平均提升了现有神经符号基线的性能约 5%，同时将规则中的幻觉符号出现概率降低了 50%。

Insight: 通过直接挖掘视觉概念和结合符号推理，Concept-RuleNet 不仅能提升模型的性能，还能显著提高决策的可解释性。这种方法尤其适用于需要高透明性和低幻觉风险的领域（如医学成像）。

Abstract: Modern vision-language models (VLMs) deliver impressive predictive accuracy yet offer little insight into ‘why’ a decision is reached, frequently hallucinating facts, particularly when encountering out-of-distribution data. Neurosymbolic frameworks address this by pairing black-box perception with interpretable symbolic reasoning, but current methods extract their symbols solely from task labels, leaving them weakly grounded in the underlying visual data. In this paper, we introduce a multi-agent system - Concept-RuleNet that reinstates visual grounding while retaining transparent reasoning. Specifically, a multimodal concept generator first mines discriminative visual concepts directly from a representative subset of training images. Next, these visual concepts are utilized to condition symbol discovery, anchoring the generations in real image statistics and mitigating label bias. Subsequently, symbols are composed into executable first-order rules by a large language model reasoner agent - yielding interpretable neurosymbolic rules. Finally, during inference, a vision verifier agent quantifies the degree of presence of each symbol and triggers rule execution in tandem with outputs of black-box neural models, predictions with explicit reasoning pathways. Experiments on five benchmarks, including two challenging medical-imaging tasks and three underrepresented natural-image datasets, show that our system augments state-of-the-art neurosymbolic baselines by an average of 5% while also reducing the occurrence of hallucinated symbols in rules by up to 50%.

[8] Image-POSER: Reflective RL for Multi-Expert Image Generation and Editing cs.CV | cs.AIPDF

Hossein Mohebbi, Mohammed Abdulrahman, Yanting Miao, Pascal Poupart, Suraj Kothawade

TL;DR: Image-POSER是一种基于反射强化学习的框架，通过动态任务分解和视觉语言模型反馈，协调多个预训练专家模型，实现高效的多专家图像生成与编辑。

Details

Motivation: 当前的文本生成图像模型在处理复杂长提示时表现不佳，缺乏可靠的多模型协作能力，作者提出Image-POSER以解决这一问题。

Result: 实验表明Image-POSER在对齐、保真度和美学上优于基线及前沿模型，人类评估中也一致更受欢迎。

Insight: 强化学习可以实现AI系统的自主任务分解与模型组合，推动通用视觉助手的发展。

Abstract: Recent advances in text-to-image generation have produced strong single-shot models, yet no individual system reliably executes the long, compositional prompts typical of creative workflows. We introduce Image-POSER, a reflective reinforcement learning framework that (i) orchestrates a diverse registry of pretrained text-to-image and image-to-image experts, (ii) handles long-form prompts end-to-end through dynamic task decomposition, and (iii) supervises alignment at each step via structured feedback from a vision-language model critic. By casting image synthesis and editing as a Markov Decision Process, we learn non-trivial expert pipelines that adaptively combine strengths across models. Experiments show that Image-POSER outperforms baselines, including frontier models, across industry-standard and custom benchmarks in alignment, fidelity, and aesthetics, and is consistently preferred in human evaluations. These results highlight that reinforcement learning can endow AI systems with the capacity to autonomously decompose, reorder, and combine visual models, moving towards general-purpose visual assistants.

[9] Defending Unauthorized Model Merging via Dual-Stage Weight Protection cs.CV | cs.CRPDF

Wei-Jia Chen, Min-Yen Tsai, Cheng-Yi Lee, Chia-Mu Yu

TL;DR: MergeGuard是一种双阶段权重保护框架，通过破坏模型合并兼容性来防止未经授权的模型合并，同时保持任务性能。

Details

Motivation: 开放的预训练模型库使用户可以方便地合并微调模型，但未经授权的合并侵犯知识产权且破坏模型所有权和责任归属。MergeGuard旨在解决这一问题。

Result: 实验表明，MergeGuard能将被合并模型的精度降低多达90%，同时对保护模型的性能影响小于1.5%。

Insight: MergeGuard展示了通过参数几何重塑防止模型合并的有效性，同时为模型所有权保护提供了新思路。

Abstract: The rapid proliferation of pretrained models and open repositories has made model merging a convenient yet risky practice, allowing free-riders to combine fine-tuned models into a new multi-capability model without authorization. Such unauthorized model merging not only violates intellectual property rights but also undermines model ownership and accountability. To address this issue, we present MergeGuard, a proactive dual-stage weight protection framework that disrupts merging compatibility while maintaining task fidelity. In the first stage, we redistribute task-relevant information across layers via L2-regularized optimization, ensuring that important gradients are evenly dispersed. In the second stage, we inject structured perturbations to misalign task subspaces, breaking curvature compatibility in the loss landscape. Together, these stages reshape the model’s parameter geometry such that merged models collapse into destructive interference while the protected model remains fully functional. Extensive experiments on both vision (ViT-L-14) and language (Llama2, Gemma2, Mistral) models demonstrate that MergeGuard reduces merged model accuracy by up to 90% with less than 1.5% performance loss on the protected model.

[10] FocusSDF: Boundary-Aware Learning for Medical Image Segmentation via Signed Distance Supervision cs.CVPDF

Muzammal Shafique, Nasir Rahim, Jamil Ahmad, Mohammad Siadat, Khalid Malik

TL;DR: FocusSDF提出了一种基于有符号距离函数（SDF）的新型损失函数，通过自适应分配更高权重给靠近边界的像素，使分割模型边界感知，提升了医学图像分割的性能。

Details

Motivation: 医学图像分割在临床中至关重要，但现有模型通常未显式编码边界信息，导致边界保存问题持续存在。为解决这一问题，FocusSDF旨在通过SDF强化边界区域的关注。

Result: 实验结果显示，FocusSDF在多种数据集（如脑动脉瘤、中风、肝肿瘤等）上均表现优于现有方法，显著提升了边界分割精度。

Insight: 显式引入边界信息（如SDF）可有效提升分割模型的边界感知能力；自适应权重分配策略是处理医学图像边界问题的有效手段。

Abstract: Segmentation of medical images constitutes an essential component of medical image analysis, providing the foundation for precise diagnosis and efficient therapeutic interventions in clinical practices. Despite substantial progress, most segmentation models do not explicitly encode boundary information; as a result, making boundary preservation a persistent challenge in medical image segmentation. To address this challenge, we introduce FocusSDF, a novel loss function based on the signed distance functions (SDFs), which redirects the network to concentrate on boundary regions by adaptively assigning higher weights to pixels closer to the lesion or organ boundary, effectively making it boundary aware. To rigorously validate FocusSDF, we perform extensive evaluations against five state-of-the-art medical image segmentation models, including the foundation model MedSAM, using four distance-based loss functions across diverse datasets covering cerebral aneurysm, stroke, liver, and breast tumor segmentation tasks spanning multiple imaging modalities. The experimental results consistently demonstrate the superior performance of FocusSDF over existing distance transform based loss functions.

[11] Lacking Data? No worries! How synthetic images can alleviate image scarcity in wildlife surveys: a case study with muskox (Ovibos moschatus) cs.CVPDF

Simon Durand, Samuel Foucher, Alexandre Delplanque, Joëlle Taillon, Jérôme Théau

TL;DR: 该论文探讨了如何利用合成图像（SI）补充稀缺的真实数据，以提高零样本（ZS）和少样本（FS）设置下麝牛的检测性能。实验表明，SI可以显著提升检测模型的精度和召回率，尤其在数据稀缺的情况下。

Details

Motivation: 传统的野生动物调查方法（如空中视觉计数和GNSS追踪）资源消耗大且受限于地理条件。尽管深度学习和遥感技术提供了新途径，但稀疏分布的物种（如麝牛）的小数据集限制了目标检测模型（ODM）的性能。因此，研究者希望通过合成图像解决数据稀缺问题。

Result: 1. 零样本模型中，SI显著提升了检测性能，但性能增长在SI超过基线数据量的100%后趋于平缓；2. 少样本模型中，结合SI和真实图像略微提高了召回率和整体精度，但差异未达到统计显著性。

Insight: 1. 合成图像是解决数据稀缺问题的有效手段，尤其适用于分布稀疏或难以监测的物种；2. 模型性能的提升依赖于SI与真实数据的平衡，过度依赖SI可能导致收益递减；3. 该方法为野生动物监测提供了一个无需依赖大量真实数据的起点。

Abstract: Accurate population estimates are essential for wildlife management, providing critical insights into species abundance and distribution. Traditional survey methods, including visual aerial counts and GNSS telemetry tracking, are widely used to monitor muskox populations in Arctic regions. These approaches are resource intensive and constrained by logistical challenges. Advances in remote sensing, artificial intelligence, and high resolution aerial imagery offer promising alternatives for wildlife detection. Yet, the effectiveness of deep learning object detection models (ODMs) is often limited by small datasets, making it challenging to train robust ODMs for sparsely distributed species like muskoxen. This study investigates the integration of synthetic imagery (SI) to supplement limited training data and improve muskox detection in zero shot (ZS) and few-shot (FS) settings. We compared a baseline model trained on real imagery with 5 ZS and 5 FS models that incorporated progressively more SI in the training set. For the ZS models, where no real images were included in the training set, adding SI improved detection performance. As more SI were added, performance in precision, recall and F1 score increased, but eventually plateaued, suggesting diminishing returns when SI exceeded 100% of the baseline model training dataset. For FS models, combining real and SI led to better recall and slightly higher overall accuracy compared to using real images alone, though these improvements were not statistically significant. Our findings demonstrate the potential of SI to train accurate ODMs when data is scarce, offering important perspectives for wildlife monitoring by enabling rare or inaccessible species to be monitored and to increase monitoring frequency. This approach could be used to initiate ODMs without real data and refine it as real images are acquired over time.

[12] Advancing Annotat3D with Harpia: A CUDA-Accelerated Library For Large-Scale Volumetric Data Segmentation cs.CV | cs.DCPDF

Camila Machado de Araujo, Egon P. B. S. Borges, Ricardo Marcelo Canteiro Grangeiro, Allan Pinto

TL;DR: 论文通过Harpia扩展Annotat3D，提出了一种支持大规模3D数据集高效分割的CUDA加速库，显著提升了处理速度和内存效率。

Details

Motivation: 高分辨率体成像技术生成的大规模数据集对现有处理工具的效率提出了挑战，需要一种支持交互式分割和高效GPU资源管理的解决方案。

Result: 实验结果表明，Harpia在处理速度、内存效率和扩展性上显著优于NVIDIA cuCIM和scikit-image。

Insight: Harpia的高效GPU资源管理和交互式界面使其特别适用于共享高性能计算基础设施中的协作科学成像工作流。

Abstract: High-resolution volumetric imaging techniques, such as X-ray tomography and advanced microscopy, generate increasingly large datasets that challenge existing tools for efficient processing, segmentation, and interactive exploration. This work introduces new capabilities to Annotat3D through Harpia, a new CUDA-based processing library designed to support scalable, interactive segmentation workflows for large 3D datasets in high-performance computing (HPC) and remote-access environments. Harpia features strict memory control, native chunked execution, and a suite of GPU-accelerated filtering, annotation, and quantification tools, enabling reliable operation on datasets exceeding single-GPU memory capacity. Experimental results demonstrate significant improvements in processing speed, memory efficiency, and scalability compared to widely used frameworks such as NVIDIA cuCIM and scikit-image. The system’s interactive, human-in-the-loop interface, combined with efficient GPU resource management, makes it particularly suitable for collaborative scientific imaging workflows in shared HPC infrastructures.

[13] Prompt Triage: Structured Optimization Enhances Vision-Language Model Performance on Medical Imaging Benchmarks cs.CV | cs.AIPDF

Arnav Singhvi, Vasiliki Bikia, Asad Aali, Akshay Chaudhari, Roxana Daneshjou

TL;DR: 论文提出了一种结构化自动提示优化方法，显著提升了开源视觉-语言模型在医学影像任务中的性能，平均相对提升53%，某些任务甚至高达3,400%。

Details

Motivation: 现有视觉-语言模型在医学任务中表现不佳，传统微调方法依赖大数据和计算资源，而手动提示设计难以推广。研究旨在通过自动化提示优化，减少对人工设计的依赖，提升模型性能。

Result: 优化后的提示管道在零样本基础上实现中位数53%的相对性能提升，部分任务提升高达3,400%。开源评估管道支持可复现研究。

Insight: 自动化提示优化能显著提升医学AI系统的性能，减少对人工设计的依赖，使临床医生更专注于患者护理和决策。

Abstract: Vision-language foundation models (VLMs) show promise for diverse imaging tasks but often underperform on medical benchmarks. Prior efforts to improve performance include model finetuning, which requires large domain-specific datasets and significant compute, or manual prompt engineering, which is hard to generalize and often inaccessible to medical institutions seeking to deploy these tools. These challenges motivate interest in approaches that draw on a model’s embedded knowledge while abstracting away dependence on human-designed prompts to enable scalable, weight-agnostic performance improvements. To explore this, we adapt the Declarative Self-improving Python (DSPy) framework for structured automated prompt optimization in medical vision-language systems through a comprehensive, formal evaluation. We implement prompting pipelines for five medical imaging tasks across radiology, gastroenterology, and dermatology, evaluating 10 open-source VLMs with four prompt optimization techniques. Optimized pipelines achieved a median relative improvement of 53% over zero-shot prompting baselines, with the largest gains ranging from 300% to 3,400% on tasks where zero-shot performance is low. These results highlight the substantial potential of applying automated prompt optimization to medical AI systems, demonstrating significant gains for vision-based applications requiring accurate clinical image interpretation. By reducing dependence on prompt design to elicit intended outputs, these techniques allow clinicians to focus on patient care and clinical decision-making. Furthermore, our experiments offer scalability and preserve data privacy, demonstrating performance improvement on open-source VLMs. We publicly release our evaluation pipelines to support reproducible research on specialized medical tasks, available at https://github.com/DaneshjouLab/prompt-triage-lab.

[14] PI-NAIM: Path-Integrated Neural Adaptive Imputation Model cs.CV | cs.AIPDF

Afifa Khaled, Ebrahim Hamid Sumiea

TL;DR: PI-NAIM提出了一种双路径架构，动态路由样本以优化缺失模态的插补问题，整合了统计方法和神经网络的优势，显著提升了插补精度和下游任务性能。

Details

Motivation: 医学影像和多模态临床数据常面临模态缺失问题，现有插补方法要么表征能力不足，要么计算成本高。

Result: 在MIMIC-III和多模态基准上表现优异，RMSE为0.108（基线0.119-0.152），死亡率预测AUROC达0.812。

Insight: 模块化设计使其能无缝集成到处理不完整数据的视觉流程中，为现实场景提供统一解决方案。

Abstract: Medical imaging and multi-modal clinical settings often face the challange of missing modality in their diagnostic pipelines. Existing imputation methods either lack representational capacity or are computationally expensive. We propose PI-NAIM, a novel dual-path architecture that dynamically routes samples to optimized imputation approaches based on missingness complexity. Our framework integrates: (1) intelligent path routing that directs low missingness samples to efficient statistical imputation (MICE) and complex patterns to powerful neural networks (GAIN with temporal analysis); (2) cross-path attention fusion that leverages missingness-aware embeddings to intelligently combine both branches; and (3) end-to-end joint optimization of imputation accuracy and downstream task performance. Extensive experiments on MIMIC-III and multimodal benchmarks demonstrate state-of-the-art performance, achieving RMSE of 0.108 (vs. baselines’ 0.119-0.152) and substantial gains in downstream tasks with an AUROC of 0.812 for mortality prediction. PI-NAIM’s modular design enables seamless integration into vision pipelines handling incomplete sensor measurements, missing modalities, or corrupted inputs, providing a unified solution for real-world scenario. The code is publicly available at https://github.com/AfifaKhaled/PI-NAIM-Path-Integrated-Neural-Adaptive-Imputation-Model

[15] Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models cs.CVPDF

Siyou Li, Huanan Wu, Juexi Shao, Yinghao Ma, Yujian Gan

TL;DR: 本文提出了Query-aware Token Selector (QTSplus)，一种轻量级但功能强大的视觉令牌选择模块，用于解决长视频理解中的令牌爆炸问题。通过动态选择最重要的视觉证据并预测保留预算，QTSplus显著压缩了视觉流并提高了效率。

Details

Motivation: 长视频理解中，视觉令牌数量随视频长度线性增长，导致注意力成本、内存和延迟爆炸性增加。本文旨在解决这一问题。

Result: 在Qwen2.5-VL中，QTSplus压缩视觉流89%，减少延迟28%，并在长视频理解基准上表现优异。

Insight: QTSplus是一种通用机制，可扩展MLLMs至长视频场景，同时保留任务相关证据。

Abstract: Despite the recent advances in the video understanding ability of multimodal large language models (MLLMs), long video understanding remains a challenge. One of the main issues is that the number of vision tokens grows linearly with video length, which causes an explosion in attention cost, memory, and latency. To solve this challenge, we present Query-aware Token Selector (\textbf{QTSplus}), a lightweight yet powerful visual token selection module that serves as an information gate between the vision encoder and LLMs. Given a text query and video tokens, QTSplus dynamically selects the most important visual evidence for the input text query by (i) scoring visual tokens via cross-attention, (ii) \emph{predicting} an instance-specific retention budget based on the complexity of the query, and (iii) \emph{selecting} Top-$n$ tokens with a differentiable straight-through estimator during training and a hard gate at inference. Furthermore, a small re-encoder preserves temporal order using absolute time information, enabling second-level localization while maintaining global coverage. Integrated into Qwen2.5-VL, QTSplus compresses the vision stream by up to \textbf{89%} and reduces end-to-end latency by \textbf{28%} on long videos. The evaluation on eight long video understanding benchmarks shows near-parity accuracy overall when compared with the original Qwen models and outperforms the original model by \textbf{+20.5} and \textbf{+5.6} points respectively on TempCompass direction and order accuracies. These results show that QTSplus is an effective, general mechanism for scaling MLLMs to real-world long-video scenarios while preserving task-relevant evidence. We will make all code, data, and trained models’ weights publicly available.

[16] From Events to Clarity: The Event-Guided Diffusion Framework for Dehazing cs.CVPDF

Ling Wang, Yunfan Lu, Wenzong Ma, Huizai Yao, Pengteng Li

TL;DR: 论文首次使用事件相机进行去雾，提出了事件引导的扩散框架，通过扩散模型的强大生成先验从模糊输入重建清晰图像，并在真实数据集上取得最优结果。

Details

Motivation: 现有基于RGB的去雾方法受限于动态范围，容易丢失结构和光照细节。事件相机具有高动态范围和低延迟的优势，适合处理雾霾场景。

Result: 在两个基准数据集和自建的真实雾霾数据集上实现了最先进的去雾效果。

Insight: 事件相机的高动态范围特性为去雾任务提供了新的可能性，而扩散模型的生成能力有助于保留结构和语义信息。

Abstract: Clear imaging under hazy conditions is a critical task. Prior-based and neural methods have improved results. However, they operate on RGB frames, which suffer from limited dynamic range. Therefore, dehazing remains ill-posed and can erase structure and illumination details. To address this, we use event cameras for dehazing for the \textbf{first time}. Event cameras offer much higher HDR ($120 dBvs.60 dB$) and microsecond latency, therefore they suit hazy scenes. In practice, transferring HDR cues from events to frames is hard because real paired data are scarce. To tackle this, we propose an event-guided diffusion model that utilizes the strong generative priors of diffusion models to reconstruct clear images from hazy inputs by effectively transferring HDR information from events. Specifically, we design an event-guided module that maps sparse HDR event features, \textit{e.g.,} edges, corners, into the diffusion latent space. This clear conditioning provides precise structural guidance during generation, improves visual realism, and reduces semantic drift. For real-world evaluation, we collect a drone dataset in heavy haze (AQI = 341) with synchronized RGB and event sensors. Experiments on two benchmarks and our dataset achieve state-of-the-art results.

[17] Evaluation of Attention Mechanisms in U-Net Architectures for Semantic Segmentation of Brazilian Rock Art Petroglyphs cs.CVPDF

Leonardi Melo, Luís Gustavo, Dimmy Magalhães, Lucciani Vieira, Mauro Araújo

TL;DR: 该研究比较了三种基于U-Net的架构（BEGL-UNet、Attention-Residual BEGL-UNet和Spatial Channel Attention BEGL-UNet）在巴西岩画石刻语义分割任务中的表现，发现结合注意力机制的架构性能优于基线。

Details

Motivation: 研究旨在探索注意力机制在U-Net架构中对巴西岩画石刻语义分割任务的提升效果，以支持考古遗产的数字化保护。

Result: Attention-Residual BEGL-UNet表现最佳（Dice Score 0.710），Spatial Channel Attention BEGL-UNet次之（DSC 0.707），均优于基线（DSC 0.690）。注意力机制带来2.5-2.9%的性能提升。

Insight: 注意力机制（尤其是空间-通道注意力）能有效提升岩石分割任务的性能，为复杂纹理的考古图像分割提供了新思路。

Abstract: This study presents a comparative analysis of three U-Net-based architectures for semantic segmentation of rock art petroglyphs from Brazilian archaeological sites. The investigated architectures were: (1) BEGL-UNet with Border-Enhanced Gaussian Loss function; (2) Attention-Residual BEGL-UNet, incorporating residual blocks and gated attention mechanisms; and (3) Spatial Channel Attention BEGL-UNet, which employs spatial-channel attention modules based on Convolutional Block Attention Module. All implementations employed the BEGL loss function combining binary cross-entropy with Gaussian edge enhancement. Experiments were conducted on images from the Poço da Bebidinha Archaeological Complex, Piauí, Brazil, using 5-fold cross-validation. Among the architectures, Attention-Residual BEGL-UNet achieved the best overall performance with Dice Score of 0.710, validation loss of 0.067, and highest recall of 0.854. Spatial Channel Attention BEGL-UNet obtained comparable performance with DSC of 0.707 and recall of 0.857. The baseline BEGL-UNet registered DSC of 0.690. These results demonstrate the effectiveness of attention mechanisms for archaeological heritage digital preservation, with Dice Score improvements of 2.5-2.9% over the baseline.

Zhenhao Guo, Rachit Saluja, Tianyuan Yao, Quan Liu, Junchao Zhu

TL;DR: 该论文探讨了在数据受限情况下如何利用视觉-语言模型（VLM）实现细粒度肾小球亚型分类，并通过实验表明病理学专用模型在少量标注样本下表现最佳。

Details

Motivation: 细粒度肾小球亚型分类对肾脏活检至关重要，但临床标注数据稀缺且难以获取。现有方法主要关注粗粒度疾病分类，而如何在小样本条件下利用VLM实现临床亚型分类尚不明确。

Result: 病理学专用视觉-语言主干模型在普通微调下表现最佳，仅需4-8个标注样本即可显著提升判别能力和校准效果。研究表明监督水平和适应策略共同影响诊断性能和多模态结构。

Insight: 正负样本的判别重要性不亚于图像-文本对齐；监督水平和模型适应策略共同决定了模型在小样本条件下的表现，为临床模型选择和标注投入提供了指导。

Abstract: Fine-grained glomerular subtyping is central to kidney biopsy interpretation, but clinically valuable labels are scarce and difficult to obtain. Existing computational pathology approaches instead tend to evaluate coarse diseased classification under full supervision with image-only models, so it remains unclear how vision-language models (VLMs) should be adapted for clinically meaningful subtyping under data constraints. In this work, we model fine-grained glomerular subtyping as a clinically realistic few-shot problem and systematically evaluate both pathology-specialized and general-purpose vision-language models under this setting. We assess not only classification performance (accuracy, AUC, F1) but also the geometry of the learned representations, examining feature alignment between image and text embeddings and the separability of glomerular subtypes. By jointly analyzing shot count, model architecture and domain knowledge, and adaptation strategy, this study provides guidance for future model selection and training under real clinical data constraints. Our results indicate that pathology-specialized vision-language backbones, when paired with the vanilla fine-tuning, are the most effective starting point. Even with only 4-8 labeled examples per glomeruli subtype, these models begin to capture distinctions and show substantial gains in discrimination and calibration, though additional supervision continues to yield incremental improvements. We also find that the discrimination between positive and negative examples is as important as image-text alignment. Overall, our results show that supervision level and adaptation strategy jointly shape both diagnostic performance and multimodal structure, providing guidance for model selection, adaptation strategies, and annotation investment.

[19] LithoSeg: A Coarse-to-Fine Framework for High-Precision Lithography Segmentation cs.CV | cs.NIPDF

Xinyu He, Botong Zhao, Bingbing Li, Shujing Lyu, Jiwei Shen

TL;DR: LithoSeg提出了一种从粗到细的框架，用于高精度光刻分割，结合了粗阶段的SAM模型和细阶段的1D回归方法，显著提升了分割精度和鲁棒性。

Details

Motivation: 当前光刻扫描电子显微镜（SEM）图像的分割方法在精度和鲁棒性上不足，限制了其在半导体制造中的实际应用。作者希望通过一种新的方法解决这一问题。

Result: LithoSeg在分割精度和计量精度上优于现有方法，同时减少了所需的监督信号。

Insight: 结合人类反馈和轻量级回归方法可以有效提升分割任务的精度和实用性。

Abstract: Accurate segmentation and measurement of lithography scanning electron microscope (SEM) images are crucial for ensuring precise process control, optimizing device performance, and advancing semiconductor manufacturing yield. Lithography segmentation requires pixel-level delineation of groove contours and consistent performance across diverse pattern geometries and process window. However, existing methods often lack the necessary precision and robustness, limiting their practical applicability. To overcome this challenge, we propose LithoSeg, a coarse-to-fine network tailored for lithography segmentation. In the coarse stage, we introduce a Human-in-the-Loop Bootstrapping scheme for the Segment Anything Model (SAM) to attain robustness with minimal supervision. In the subsequent fine stage, we recast 2D segmentation as 1D regression problem by sampling groove-normal profiles using the coarse mask and performing point-wise refinement with a lightweight MLP. LithoSeg outperforms previous approaches in both segmentation accuracy and metrology precision while requiring less supervision, offering promising prospects for real-world applications.

[20] Enhancing Road Safety Through Multi-Camera Image Segmentation with Post-Encroachment Time Analysis cs.CV | cs.LG | cs.SIPDF

Shounak Ray Chaudhuri, Arash Jahangiri, Christopher Paolini

TL;DR: 该论文提出了一种基于多摄像头计算机视觉的框架，用于实时计算后侵占时间（PET），以提高交通安全，特别是在信号灯交叉路口。该方法使用YOLOv11分割和鸟瞰图变换，实现了高精度的车辆检测和动态热图分析。

Details

Motivation: 传统的基于碰撞的交通安全研究存在数据稀疏和延迟性的问题，因此需要一种实时、高分辨率的方法来评估交通安全。

Result: 框架能够在边缘设备上以2.68 FPS的速度处理800 x 800像素的对数热图，并精确识别高风险区域。

Insight: 该方法证明了基于视觉的PET分析在智能交通系统中的可行性，提供了高分辨率、实时和可扩展的交叉路口安全评估方案。

Abstract: Traffic safety analysis at signalized intersections is vital for reducing vehicle and pedestrian collisions, yet traditional crash-based studies are limited by data sparsity and latency. This paper presents a novel multi-camera computer vision framework for real-time safety assessment through Post-Encroachment Time (PET) computation, demonstrated at the intersection of H Street and Broadway in Chula Vista, California. Four synchronized cameras provide continuous visual coverage, with each frame processed on NVIDIA Jetson AGX Xavier devices using YOLOv11 segmentation for vehicle detection. Detected vehicle polygons are transformed into a unified bird’s-eye map using homography matrices, enabling alignment across overlapping camera views. A novel pixel-level PET algorithm measures vehicle position without reliance on fixed cells, allowing fine-grained hazard visualization via dynamic heatmaps, accurate to 3.3 sq-cm. Timestamped vehicle and PET data is stored in an SQL database for long-term monitoring. Results over various time intervals demonstrate the framework’s ability to identify high-risk regions with sub-second precision and real-time throughput on edge devices, producing data for an 800 x 800 pixel logarithmic heatmap at an average of 2.68 FPS. This study validates the feasibility of decentralized vision-based PET analysis for intelligent transportation systems, offering a replicable methodology for high-resolution, real-time, and scalable intersection safety evaluation.

[21] LIHE: Linguistic Instance-Split Hyperbolic-Euclidean Framework for Generalized Weakly-Supervised Referring Expression Comprehension cs.CVPDF

Xianglong Shi, Silin Cheng, Sirui Zhao, Yunhan Jiang, Enhong Chen

TL;DR: 论文提出了一种名为LIHE的新型弱监督广义指代表达理解（WGREC）框架，解决了现有方法在处理零个或多个目标时的局限性，通过混合双曲-欧几里得几何方法避免了语义表示崩溃。

Details

Motivation: 现有弱监督指代表达理解（WREC）方法假设一对一的映射，无法处理现实场景中表达对应零个或多个目标的情况。为了解决这一问题，作者提出了广义弱监督指代表达理解任务（WGREC），并设计了LIHE框架。

Result: 在gRefCOCO和Ref-ZOM数据集上，LIHE首次建立了有效的弱监督WGREC基准，HEMix模块在标准REC基准上提升了IoU@0.5达2.5%。

Insight: 通过混合双曲和欧几里得几何，可以更好地处理语义层次结构，避免表示崩溃，同时在小样本或弱监督场景中表现出更强的鲁棒性。

Abstract: Existing Weakly-Supervised Referring Expression Comprehension (WREC) methods, while effective, are fundamentally limited by a one-to-one mapping assumption, hindering their ability to handle expressions corresponding to zero or multiple targets in realistic scenarios. To bridge this gap, we introduce the Weakly-Supervised Generalized Referring Expression Comprehension task (WGREC), a more practical paradigm that handles expressions with variable numbers of referents. However, extending WREC to WGREC presents two fundamental challenges: supervisory signal ambiguity, where weak image-level supervision is insufficient for training a model to infer the correct number and identity of referents, and semantic representation collapse, where standard Euclidean similarity forces hierarchically-related concepts into non-discriminative clusters, blurring categorical boundaries. To tackle these challenges, we propose a novel WGREC framework named Linguistic Instance-Split Hyperbolic-Euclidean (LIHE), which operates in two stages. The first stage, Referential Decoupling, predicts the number of target objects and decomposes the complex expression into simpler sub-expressions. The second stage, Referent Grounding, then localizes these sub-expressions using HEMix, our innovative hybrid similarity module that synergistically combines the precise alignment capabilities of Euclidean proximity with the hierarchical modeling strengths of hyperbolic geometry. This hybrid approach effectively prevents semantic collapse while preserving fine-grained distinctions between related concepts. Extensive experiments demonstrate LIHE establishes the first effective weakly supervised WGREC baseline on gRefCOCO and Ref-ZOM, while HEMix achieves consistent improvements on standard REC benchmarks, improving IoU@0.5 by up to 2.5%. The code is available at https://anonymous.4open.science/r/LIHE.

[22] Null-Space Diffusion Distillation for Efficient Photorealistic Lensless Imaging cs.CVPDF

Jose Reinaldo Cunha Santos A V Silva Neto, Hodaka Kawachi, Yasushi Yagi, Tomoya Nakamura

TL;DR: NSDD方法通过蒸馏迭代DDNM+求解器的零空间分量，实现了无需配对监督的高效、逼真的无透镜成像重建。

Details

Motivation: 现有的无透镜相机重建方法通常依赖于成对的透镜-无透镜监督，可能导致模型因领域不匹配而产生偏差。为避免这一问题，本文探索了基于扩散先验的无监督方法。

Result: 在Lensless-FFHQ和PhlatCam数据集上，NSDD实现了接近教师模型的感知质量（LPIPS第二低），且运行速度仅次于Wiener方法。

Insight: 结果表明，分离范围空间和零空间更新的方法在无监督逼真重建中具有潜力，为高效无透镜成像提供了实用方案。

Abstract: State-of-the-art photorealistic reconstructions for lensless cameras often rely on paired lensless-lensed supervision, which can bias models due to lens-lensless domain mismatch. To avoid this, ground-truth-free diffusion priors are attractive; however, generic formulations tuned for conventional inverse problems often break under the noisy, highly multiplexed, and ill-posed lensless deconvolution setting. We observe that methods which separate range-space enforcement from null-space diffusion-prior updates yield stable, realistic reconstructions. Building on this, we introduce Null-Space Diffusion Distillation (NSDD): a single-pass student that distills the null-space component of an iterative DDNM+ solver, conditioned on the lensless measurement and on a range-space anchor. NSDD preserves measurement consistency and achieves photorealistic results without paired supervision at a fraction of the runtime and memory. On Lensless-FFHQ and PhlatCam, NSDD is the second fastest, behind Wiener, and achieves near-teacher perceptual quality (second-best LPIPS, below DDNM+), outperforming DPS and classical convex baselines. These results suggest a practical path toward fast, ground-truth-free, photorealistic lensless imaging.

[23] Bridging Vision and Language for Robust Context-Aware Surgical Point Tracking: The VL-SurgPT Dataset and Benchmark cs.CVPDF

Rulin Zhou, Wenlong He, An Wang, Jianhang Zhang, Xuanhui Zeng

TL;DR: 本文介绍了VL-SurgPT数据集，首个结合视觉跟踪和文本描述的手术场景多模态数据集，并通过TG-SurgPT方法利用语义信息提升跟踪鲁棒性。

Details

Motivation: 手术场景中烟雾遮挡、镜面反射和组织变形等问题使得点跟踪困难，现有数据集缺乏语义上下文，无法理解跟踪失败的原因。

Result: 实验表明，结合点状态信息可显著提升跟踪准确性，尤其在视觉挑战性场景中优于纯视觉方法。

Insight: 融合视觉与语言模态有助于开发上下文感知的跟踪系统，提升计算机辅助手术的性能。

Abstract: Accurate point tracking in surgical environments remains challenging due to complex visual conditions, including smoke occlusion, specular reflections, and tissue deformation. While existing surgical tracking datasets provide coordinate information, they lack the semantic context necessary to understand tracking failure mechanisms. We introduce VL-SurgPT, the first large-scale multimodal dataset that bridges visual tracking with textual descriptions of point status in surgical scenes. The dataset comprises 908 in vivo video clips, including 754 for tissue tracking (17,171 annotated points across five challenging scenarios) and 154 for instrument tracking (covering seven instrument types with detailed keypoint annotations). We establish comprehensive benchmarks using eight state-of-the-art tracking methods and propose TG-SurgPT, a text-guided tracking approach that leverages semantic descriptions to improve robustness in visually challenging conditions. Experimental results demonstrate that incorporating point status information significantly improves tracking accuracy and reliability, particularly in adverse visual scenarios where conventional vision-only methods struggle. By bridging visual and linguistic modalities, VL-SurgPT enables the development of context-aware tracking systems crucial for advancing computer-assisted surgery applications that can maintain performance even under challenging intraoperative conditions.

[24] GCAgent: Long-Video Understanding via Schematic and Narrative Episodic Memory cs.CV | cs.AIPDF

Jeong Hun Yeo, Sangyun Chung, Sungjune Park, Dae Hoe Kim, Jinyoung Moon

TL;DR: GCAgent通过创新的示意与叙事情景记忆机制，解决了长视频理解中全局上下文捕捉和长期依赖的难题，在多阶段感知-行动-反思框架下显著提升了性能。

Details

Motivation: 长视频理解面临多模态大语言模型（MLLMs）的token限制和长期依赖复杂性的挑战，现有方法难以捕捉全局上下文和事件关系。

Result: 在Video-MME Long split上准确率提升23.5%，7B规模MLLMs中达到最佳性能，Long split准确率73.4%，整体平均71.9%。

Insight: 结构化情景记忆和多阶段推理范式为长效依赖问题提供了认知启发的解决方案，验证了代理框架的有效性。

Abstract: Long-video understanding remains a significant challenge for Multimodal Large Language Models (MLLMs) due to inherent token limitations and the complexity of capturing long-term temporal dependencies. Existing methods often fail to capture the global context and complex event relationships necessary for deep video reasoning. To address this, we introduce GCAgent, a novel Global-Context-Aware Agent framework that achieves comprehensive long-video understanding. Our core innovation is the Schematic and Narrative Episodic Memory. This memory structurally models events and their causal and temporal relations into a concise, organized context, fundamentally resolving the long-term dependency problem. Operating in a multi-stage Perception-Action-Reflection cycle, our GCAgent utilizes a Memory Manager to retrieve relevant episodic context for robust, context-aware inference. Extensive experiments confirm that GCAgent significantly enhances long-video understanding, achieving up to 23.5% accuracy improvement on the Video-MME Long split over a strong MLLM baseline. Furthermore, our framework establishes state-of-the-art performance among comparable 7B-scale MLLMs, achieving 73.4% accuracy on the Long split and the highest overall average (71.9%) on the Video-MME benchmark, validating our agent-based reasoning paradigm and structured memory for cognitively-inspired long-video understanding.

[25] VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation cs.CVPDF

Jun Zhou, Chi Xu, Kaifeng Tang, Yuting Ge, Tingrui Guo

TL;DR: 该论文提出了一种新颖的手-物体姿态估计框架VPHO，通过联合学习视觉和物理线索，提升姿态估计的准确性和物理合理性。

Details

Motivation: 现有方法主要依赖视觉线索，往往导致违反物理约束（如穿透或非接触）的结果。为解决这一问题，论文提出联合视觉与物理线索学习的方法。

Result: 实验表明，该方法在姿态准确性和物理合理性上显著优于现有方法。

Insight: 结合视觉与物理线索能显著提升姿态估计的物理合理性，同时保持视觉一致性。

Abstract: Estimating the 3D poses of hands and objects from a single RGB image is a fundamental yet challenging problem, with broad applications in augmented reality and human-computer interaction. Existing methods largely rely on visual cues alone, often producing results that violate physical constraints such as interpenetration or non-contact. Recent efforts to incorporate physics reasoning typically depend on post-optimization or non-differentiable physics engines, which compromise visual consistency and end-to-end trainability. To overcome these limitations, we propose a novel framework that jointly integrates visual and physical cues for hand-object pose estimation. This integration is achieved through two key ideas: 1) joint visual-physical cue learning: The model is trained to extract 2D visual cues and 3D physical cues, thereby enabling more comprehensive representation learning for hand-object interactions; 2) candidate pose aggregation: A novel refinement process that aggregates multiple diffusion-generated candidate poses by leveraging both visual and physical predictions, yielding a final estimate that is visually consistent and physically plausible. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches in both pose accuracy and physical plausibility.

[26] Improved Masked Image Generation with Knowledge-Augmented Token Representations cs.CVPDF

Guotao Liang, Baoquan Zhang, Zhiyuan Wen, Zihao Han, Yunming Ye

TL;DR: 该论文提出了一种名为KA-MIG的知识增强掩码图像生成框架，通过引入显式的token级语义依赖知识（如共现图、语义相似图和位置-token不兼容图），并结合图感知编码器，提升了掩码图像生成的性能和质量。

Details

Motivation: 现有的掩码图像生成方法通常依赖模型自身学习视觉token序列的语义依赖关系，但由于单个token缺乏明确的语义含义且序列较长，这种学习方式具有挑战性。因此，作者提出引入显式的token知识作为先验，以改进生成质量。

Result: 实验表明，KA-MIG在ImageNet上的类别条件图像生成任务中表现优于现有MIG方法。

Insight: 显式引入token级语义依赖知识可以有效提升掩码图像生成的质量，尤其是在复杂语义依赖关系的学习上具有显著优势。

Abstract: Masked image generation (MIG) has demonstrated remarkable efficiency and high-fidelity images by enabling parallel token prediction. Existing methods typically rely solely on the model itself to learn semantic dependencies among visual token sequences. However, directly learning such semantic dependencies from data is challenging because the individual tokens lack clear semantic meanings, and these sequences are usually long. To address this limitation, we propose a novel Knowledge-Augmented Masked Image Generation framework, named KA-MIG, which introduces explicit knowledge of token-level semantic dependencies (\emph{i.e.}, extracted from the training data) as priors to learn richer representations for improving performance. In particular, we explore and identify three types of advantageous token knowledge graphs, including two positive and one negative graphs (\emph{i.e.}, the co-occurrence graph, the semantic similarity graph, and the position-token incompatibility graph). Based on three prior knowledge graphs, we design a graph-aware encoder to learn token and position-aware representations. After that, a lightweight fusion mechanism is introduced to integrate these enriched representations into the existing MIG methods. Resorting to such prior knowledge, our method effectively enhances the model’s ability to capture semantic dependencies, leading to improved generation quality. Experimental results demonstrate that our method improves upon existing MIG for class-conditional image generation on ImageNet.

[27] Calibrated Multimodal Representation Learning with Missing Modalities cs.CV | cs.LG | cs.MMPDF

Xiaohao Liu, Xiaobo Xia, Jiaheng Wei, Shuo Yang, Xiu Su

TL;DR: 论文CalMRL提出了一种校准的多模态表示学习方法，解决了模态缺失情况下表示对齐的偏移问题，通过先验和模态间关系建模缺失模态的表示级填补，显著提升了多模态学习的性能。

Details

Motivation: 多模态表示学习通常假设所有模态都存在，但在现实数据中模态缺失很常见。传统的对齐方法在这种情况下面临表示偏移的问题，亟需一种能处理模态缺失的方法。

Result: 实验表明CalMRL在模态缺失场景下显著优于基线方法，成功缓解了表示偏移问题。

Insight: 1. 表示级填补比传统输入级填补更有效；2. 模态间关系的利用对填补至关重要；3. 双步学习方法提供了理论保障和优化效率。

Abstract: Multimodal representation learning harmonizes distinct modalities by aligning them into a unified latent space. Recent research generalizes traditional cross-modal alignment to produce enhanced multimodal synergy but requires all modalities to be present for a common instance, making it challenging to utilize prevalent datasets with missing modalities. We provide theoretical insights into this issue from an anchor shift perspective. Observed modalities are aligned with a local anchor that deviates from the optimal one when all modalities are present, resulting in an inevitable shift. To address this, we propose CalMRL for multimodal representation learning to calibrate incomplete alignments caused by missing modalities. Specifically, CalMRL leverages the priors and the inherent connections among modalities to model the imputation for the missing ones at the representation level. To resolve the optimization dilemma, we employ a bi-step learning method with the closed-form solution of the posterior distribution of shared latents. We validate its mitigation of anchor shift and convergence with theoretical guidance. By equipping the calibrated alignment with the existing advanced method, we offer new flexibility to absorb data with missing modalities, which is originally unattainable. Extensive experiments and comprehensive analyses demonstrate the superiority of CalMRL. Our code, model checkpoints, and evaluation raw data will be publicly available.

[28] SRSplat: Feed-Forward Super-Resolution Gaussian Splatting from Sparse Multi-View Images cs.CVPDF

Xinyuan Hu, Changyue Shi, Chuxiao Yang, Minghao Chen, Jiajun Ding

TL;DR: SRSplat是一种前馈框架，通过结合外部高清参考图像和内部纹理线索，从稀疏低分辨率图像重建高分辨率3D场景。其核心创新是参考引导的特征增强模块和纹理感知密度控制。

Details

Motivation: 现有方法在稀疏低分辨率图像输入下难以恢复精细纹理细节，主要原因是缺乏高频信息。为了解决这一问题，SRSplat利用外部高质量参考图像和内部纹理线索进行补偿。

Result: SRSplat在RealEstate10K、ACID和DTU等数据集上表现优于现有方法，并展示了跨数据集和跨分辨率的强大泛化能力。

Insight: 结合外部参考信息和内部纹理线索可以有效提升稀疏低分辨率图像下3D重建的细节恢复能力。

Abstract: Feed-forward 3D reconstruction from sparse, low-resolution (LR) images is a crucial capability for real-world applications, such as autonomous driving and embodied AI. However, existing methods often fail to recover fine texture details. This limitation stems from the inherent lack of high-frequency information in LR inputs. To address this, we propose \textbf{SRSplat}, a feed-forward framework that reconstructs high-resolution 3D scenes from only a few LR views. Our main insight is to compensate for the deficiency of texture information by jointly leveraging external high-quality reference images and internal texture cues. We first construct a scene-specific reference gallery, generated for each scene using Multimodal Large Language Models (MLLMs) and diffusion models. To integrate this external information, we introduce the \textit{Reference-Guided Feature Enhancement (RGFE)} module, which aligns and fuses features from the LR input images and their reference twin image. Subsequently, we train a decoder to predict the Gaussian primitives using the multi-view fused feature obtained from \textit{RGFE}. To further refine predicted Gaussian primitives, we introduce \textit{Texture-Aware Density Control (TADC)}, which adaptively adjusts Gaussian density based on the internal texture richness of the LR inputs. Extensive experiments demonstrate that our SRSplat outperforms existing methods on various datasets, including RealEstate10K, ACID, and DTU, and exhibits strong cross-dataset and cross-resolution generalization capabilities.

[29] DCMM-Transformer: Degree-Corrected Mixed-Membership Attention for Medical Imaging cs.CV | cs.AIPDF

Huimin Cheng, Xiaowei Yu, Shushan Wu, Luyang Fang, Chao Cao

TL;DR: DCMM-Transformer是一种用于医学图像分析的新型ViT架构，通过引入Degree-Corrected Mixed-Membership模型作为自注意力的偏置，解决了标准ViTs无法利用潜在解剖分组的问题。

Details

Motivation: 标准ViTs未能充分利用医学图像中潜在的解剖分组（如器官、组织和病理区域）。现有的方法（如SBM-Transformer）虽然尝试通过随机二元掩码引入结构，但存在不可微性、训练不稳定性和无法建模复杂社区结构的问题。

Result: 在多种医学图像数据集上的实验表明，DCMM-Transformer具有优越的性能和泛化能力，同时学到的分组结构和结构化注意力调制显著提升了模型的可解释性。

Insight: 通过可微分的方式建模医学图像中的社区结构和度异性，不仅提升了模型性能，还增强了注意力机制的解释性，为复杂的医学图像分析提供了新的工具。

Abstract: Medical images exhibit latent anatomical groupings, such as organs, tissues, and pathological regions, that standard Vision Transformers (ViTs) fail to exploit. While recent work like SBM-Transformer attempts to incorporate such structures through stochastic binary masking, they suffer from non-differentiability, training instability, and the inability to model complex community structure. We present DCMM-Transformer, a novel ViT architecture for medical image analysis that incorporates a Degree-Corrected Mixed-Membership (DCMM) model as an additive bias in self-attention. Unlike prior approaches that rely on multiplicative masking and binary sampling, our method introduces community structure and degree heterogeneity in a fully differentiable and interpretable manner. Comprehensive experiments across diverse medical imaging datasets, including brain, chest, breast, and ocular modalities, demonstrate the superior performance and generalizability of the proposed approach. Furthermore, the learned group structure and structured attention modulation substantially enhance interpretability by yielding attention maps that are anatomically meaningful and semantically coherent.

[30] PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling cs.CV | cs.AI | cs.DCPDF

Sijie Wang, Qiang Wang, Shaohuai Shi

TL;DR: PipeDiT提出了一种新方法，通过任务流水线和模型解耦加速基于扩散变压器的视频生成，显著提升了推理速度和降低了内存消耗。

Details

Motivation: 基于扩散变压器（DiT）的视频生成模型在推理速度和内存消耗上表现不佳，限制了实际部署。

Result: 在8-GPU系统上，PipeDiT实现了1.06x至4.02x的加速比，相比现有框架更高效。

Insight: 任务流水线和模块解耦是提升视频生成效率的有效策略，尤其适用于资源密集型任务。

Abstract: Video generation has been advancing rapidly, and diffusion transformer (DiT) based models have demonstrated remark- able capabilities. However, their practical deployment is of- ten hindered by slow inference speeds and high memory con- sumption. In this paper, we propose a novel pipelining frame- work named PipeDiT to accelerate video generation, which is equipped with three main innovations. First, we design a pipelining algorithm (PipeSP) for sequence parallelism (SP) to enable the computation of latent generation and commu- nication among multiple GPUs to be pipelined, thus reduc- ing inference latency. Second, we propose DeDiVAE to de- couple the diffusion module and the variational autoencoder (VAE) module into two GPU groups, whose executions can also be pipelined to reduce memory consumption and infer- ence latency. Third, to better utilize the GPU resources in the VAE group, we propose an attention co-processing (Aco) method to further reduce the overall video generation latency. We integrate our PipeDiT into both OpenSoraPlan and Hun- yuanVideo, two state-of-the-art open-source video generation frameworks, and conduct extensive experiments on two 8- GPU systems. Experimental results show that, under many common resolution and timestep configurations, our PipeDiT achieves 1.06x to 4.02x speedups over OpenSoraPlan and HunyuanVideo.

[31] MovSemCL: Movement-Semantics Contrastive Learning for Trajectory Similarity cs.CV | cs.AI | cs.DBPDF

Zhichen Lai, Hua Lu, Huan Li, Jialiang Li, Christian S. Jensen

TL;DR: 论文提出MovSemCL，一种结合运动语义的对比学习框架，用于轨迹相似性计算，解决了现有方法在语义建模、计算效率和数据增强方面的不足。

Details

Motivation: 现有基于学习的轨迹相似性计算方法存在三个主要问题：1）对轨迹语义和层次结构建模不足，缺乏运动动态提取和多尺度表征；2）点对点编码导致高计算成本；3）数据增强方法破坏了轨迹语义的物理合理性。

Result: MovSemCL在真实数据集上表现优异，相似性搜索的平均排名接近理想值1，在启发式近似任务中提升20.3%，推理延迟降低43.4%。

Insight: 运动语义和多尺度注意力机制的引入显著提升了轨迹建模的效率和效果，曲率增强策略为轨迹数据增强提供了新思路。

Abstract: Trajectory similarity computation is fundamental functionality that is used for, e.g., clustering, prediction, and anomaly detection. However, existing learning-based methods exhibit three key limitations: (1) insufficient modeling of trajectory semantics and hierarchy, lacking both movement dynamics extraction and multi-scale structural representation; (2) high computational costs due to point-wise encoding; and (3) use of physically implausible augmentations that distort trajectory semantics. To address these issues, we propose MovSemCL, a movement-semantics contrastive learning framework for trajectory similarity computation. MovSemCL first transforms raw GPS trajectories into movement-semantics features and then segments them into patches. Next, MovSemCL employs intra- and inter-patch attentions to encode local as well as global trajectory patterns, enabling efficient hierarchical representation and reducing computational costs. Moreover, MovSemCL includes a curvature-guided augmentation strategy that preserves informative segments (e.g., turns and intersections) and masks redundant ones, generating physically plausible augmented views. Experiments on real-world datasets show that MovSemCL is capable of outperforming state-of-the-art methods, achieving mean ranks close to the ideal value of 1 at similarity search tasks and improvements by up to 20.3% at heuristic approximation, while reducing inference latency by up to 43.4%.

[32] DCA-LUT: Deep Chromatic Alignment with 5D LUT for Purple Fringing Removal cs.CV | eess.IVPDF

Jialang Lu, Shuning Sun, Pu Wang, Chen Wu, Feng Gao

TL;DR: 该论文提出了一种基于深度学习的紫色边缘去除方法DCA-LUT，通过色度感知坐标转换模块提取紫色边缘通道，并结合5D查找表实现高效的非线性颜色映射。

Details

Motivation: 紫色边缘（purple fringing）是由镜头纵向色差引起的图像伪影，传统方法依赖昂贵的硬件或手工特征提取。作者希望通过数据驱动的方式解决这一问题，填补现有方法的空白。

Result: 在合成和真实数据集上的实验表明，DCA-LUT在紫色边缘去除任务中达到了最先进的性能。

Insight: 通过数据驱动的方法和物理启发的设计（如色度分离），可以有效解决传统硬件或手工方法难以处理的图像伪影问题。

Abstract: Purple fringing, a persistent artifact caused by Longitudinal Chromatic Aberration (LCA) in camera lenses, has long degraded the clarity and realism of digital imaging. Traditional solutions rely on complex and expensive apochromatic (APO) lens hardware and the extraction of handcrafted features, ignoring the data-driven approach. To fill this gap, we introduce DCA-LUT, the first deep learning framework for purple fringing removal. Inspired by the physical root of the problem, the spatial misalignment of RGB color channels due to lens dispersion, we introduce a novel Chromatic-Aware Coordinate Transformation (CA-CT) module, learning an image-adaptive color space to decouple and isolate fringing into a dedicated dimension. This targeted separation allows the network to learn a precise ``purple fringe channel”, which then guides the accurate restoration of the luminance channel. The final color correction is performed by a learned 5D Look-Up Table (5D LUT), enabling efficient and powerful% non-linear color mapping. To enable robust training and fair evaluation, we constructed a large-scale synthetic purple fringing dataset (PF-Synth). Extensive experiments in synthetic and real-world datasets demonstrate that our method achieves state-of-the-art performance in purple fringing removal.

[33] Learning to Hear by Seeing: It’s Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound cs.CVPDF

Dengming Zhang, Weitao You, Jingxiong Li, Weishen Lin, Wenda Shi

TL;DR: 该论文提出了一种称为VAEmotionLLM的两阶段框架，通过有限的音频预训练，使视觉语言模型（VLM）具备跨模态的情感理解能力，并在艺术情感基准测试中表现出色。

Details

Motivation: 现有的大型语言模型（LLMs）在情感理解方面缺乏跨模态能力，尤其是视觉和听觉的联合表达。此外，当前的音频-视觉语言模型（AVLMs）需要大规模的音频预训练，限制了其可扩展性。

Result: 在ArtEmoBenchmark上，VAEmotionLLM的表现优于仅音频、仅视觉和音视频基线模型，证明了其有效性。

Insight: 通过视觉引导音频对齐和轻量级情感适配器，可以高效实现跨模态情感理解，避免了大规模音频预训练的需求。

Abstract: Emotion understanding is critical for making Large Language Models (LLMs) more general, reliable, and aligned with humans. Art conveys emotion through the joint design of visual and auditory elements, yet most prior work is human-centered or single-modality, overlooking the emotion intentionally expressed by the artwork. Meanwhile, current Audio-Visual Language Models (AVLMs) typically require large-scale audio pretraining to endow Visual Language Models (VLMs) with hearing, which limits scalability. We present Vision Anchored Audio-Visual Emotion LLM (VAEmotionLLM), a two-stage framework that teaches a VLM to hear by seeing with limited audio pretraining and to understand emotion across modalities. In Stage 1, Vision-Guided Audio Alignment (VG-Align) distills the frozen visual pathway into a new audio pathway by aligning next-token distributions of the shared LLM on synchronized audio-video clips, enabling hearing without a large audio dataset. In Stage 2, a lightweight Cross-Modal Emotion Adapter (EmoAdapter), composed of the Emotion Enhancer and the Emotion Supervisor, injects emotion-sensitive residuals and applies emotion supervision to enhance cross-modal emotion understanding. We also construct ArtEmoBenchmark, an art-centric emotion benchmark that evaluates content and emotion understanding under audio-only, visual-only, and audio-visual inputs. VAEmotionLLM achieves state-of-the-art results on ArtEmoBenchmark, outperforming audio-only, visual-only, and audio-visual baselines. Ablations show that the proposed components are complementary.

[34] Point Cloud Quantization through Multimodal Prompting for 3D Understanding cs.CVPDF

Hongxuan Li, Wencheng Zhu, Huiying Xu, Xinzhong Zhu, Pengfei Zhu

TL;DR: 该论文提出了一种基于多模态提示的点云量化框架，利用文本嵌入作为原型先验，并通过紧凑性和分离性约束的量化空间，实现几何与语义信息的联合编码，提高了点云理解的性能。

Details

Motivation: 传统的向量量化方法在代码本设计和代表性上存在不足，尤其是在点云分析中。多模态对齐在视觉语言模型中的成功表明，可以利用文本嵌入的视觉语义来提高量化效果。

Result: 在ModelNet40和ScanObjectNN数据集上的实验表明，该方法在点云分析任务中表现出色。

Insight: 利用多模态信息（如文本嵌入）可以显著提升点云量化的效果，同时紧凑性和分离性约束有助于优化量化空间的表示能力。

Abstract: Vector quantization has emerged as a powerful tool in large-scale multimodal models, unifying heterogeneous representations through discrete token encoding. However, its effectiveness hinges on robust codebook design. Current prototype-based approaches relying on trainable vectors or clustered centroids fall short in representativeness and interpretability, even as multimodal alignment demonstrates its promise in vision-language models. To address these limitations, we propose a simple multimodal prompting-driven quantization framework for point cloud analysis. Our methodology is built upon two core insights: 1) Text embeddings from pre-trained models inherently encode visual semantics through many-to-one contrastive alignment, naturally serving as robust prototype priors; and 2) Multimodal prompts enable adaptive refinement of these prototypes, effectively mitigating vision-language semantic gaps. The framework introduces a dual-constrained quantization space, enforced by compactness and separation regularization, which seamlessly integrates visual and prototype features, resulting in hybrid representations that jointly encode geometric and semantic information. Furthermore, we employ Gumbel-Softmax relaxation to achieve differentiable discretization while maintaining quantization sparsity. Extensive experiments on the ModelNet40 and ScanObjectNN datasets clearly demonstrate the superior effectiveness of the proposed method.

[35] Supervised Multilabel Image Classification Using Residual Networks with Probabilistic Reasoning cs.CVPDF

Lokender Singh, Saksham Kumar, Chandan Kumar

TL;DR: 该论文提出了一种新颖的多标签图像分类方法，通过结合概率推理和改进的ResNet-101架构，显著提升了COCO-2014数据集上的分类性能。

Details

Motivation: 多标签图像分类在计算机视觉应用中具有广泛需求，但现有方法在处理标签依赖性和不确定性时表现不足。因此，研究旨在通过概率推理改进分类性能。

Result: 实验结果优于ResNet-SRN和Vision Transformer基线模型，mAP达到0.794。

Insight: 研究表明，将概率推理融入深度学习模型能够有效解决多标签场景中的挑战，并为类似任务提供了新的思路。

Abstract: Multilabel image categorization has drawn interest recently because of its numerous computer vision applications. The proposed work introduces a novel method for classifying multilabel images using the COCO-2014 dataset and a modified ResNet-101 architecture. By simulating label dependencies and uncertainties, the approach uses probabilistic reasoning to improve prediction accuracy. Extensive tests show that the model outperforms earlier techniques and approaches to state-of-the-art outcomes in multilabel categorization. The work also thoroughly assesses the model’s performance using metrics like precision-recall score and achieves 0.794 mAP on COCO-2014, outperforming ResNet-SRN (0.771) and Vision Transformer baselines (0.785). The novelty of the work lies in integrating probabilistic reasoning into deep learning models to effectively address the challenges presented by multilabel scenarios.

[36] SemanticStitch: Enhancing Image Coherence through Foreground-Aware Seam Carving cs.CVPDF

Ji-Ping Jin, Chen-Bin Feng, Rui Fan, Chi-Man Vong

TL;DR: SemanticStitch是一种基于深度学习的图像缝合方法，通过融入前景对象的语义先验，提升了图像的视觉连贯性。

Details

Motivation: 传统图像缝合技术在缝合过程中因忽略语义信息，导致前景对象的连续性被破坏，从而影响视觉质量。

Result: 实验结果表明，与传统方法相比，SemanticStitch显著提升了图像缝合的质量和连贯性。

Insight: 语义信息对于图像缝合至关重要，尤其是在保持前景对象完整性方面。

Abstract: Image stitching often faces challenges due to varying capture angles, positional differences, and object movements, leading to misalignments and visual discrepancies. Traditional seam carving methods neglect semantic information, causing disruptions in foreground continuity. We introduce SemanticStitch, a deep learning-based framework that incorporates semantic priors of foreground objects to preserve their integrity and enhance visual coherence. Our approach includes a novel loss function that emphasizes the semantic integrity of salient objects, significantly improving stitching quality. We also present two specialized real-world datasets to evaluate our method’s effectiveness. Experimental results demonstrate substantial improvements over traditional techniques, providing robust support for practical applications.

[37] Learning from Dense Events: Towards Fast Spiking Neural Networks Training via Event Dataset Distillatio cs.CVPDF

Shuhan Ye, Yi Yu, Qixin Zhang, Chenqi Kong, Qiangqiang Wu

TL;DR: 该论文提出了PACE框架，通过数据集蒸馏技术显著降低SNN的训练成本和时间，同时保持高性能。

Details

Motivation: 事件相机与SNN的结合具有高效能潜力，但SNN的训练成本高昂，限制了实际应用。PACE旨在解决这一问题。

Result: 在多个数据集上表现优异，如N-MNIST达到84.4%准确率，训练时间减少50倍，存储成本降低6000倍。

Insight: 数据集蒸馏可显著提高SNN的训练效率，尤其是在动态事件流和低IPC条件下表现突出。

Abstract: Event cameras sense brightness changes and output binary asynchronous event streams, attracting increasing attention. Their bio-inspired dynamics align well with spiking neural networks (SNNs), offering a promising energy-efficient alternative to conventional vision systems. However, SNNs remain costly to train due to temporal coding, which limits their practical deployment. To alleviate the high training cost of SNNs, we introduce \textbf{PACE} (Phase-Aligned Condensation for Events), the first dataset distillation framework to SNNs and event-based vision. PACE distills a large training dataset into a compact synthetic one that enables fast SNN training, which is achieved by two core modules: \textbf{ST-DSM} and \textbf{PEQ-N}. ST-DSM uses residual membrane potentials to densify spike-based features (SDR) and to perform fine-grained spatiotemporal matching of amplitude and phase (ST-SM), while PEQ-N provides a plug-and-play straight through probabilistic integer quantizer compatible with standard event-frame pipelines. Across DVS-Gesture, CIFAR10-DVS, and N-MNIST datasets, PACE outperforms existing coreset selection and dataset distillation baselines, with particularly strong gains on dynamic event streams and at low or moderate IPC. Specifically, on N-MNIST, it achieves (84.4%) accuracy, about (85%) of the full training set performance, while reducing training time by more than (50\times) and storage cost by (6000\times), yielding compact surrogates that enable minute-scale SNN training and efficient edge deployment.

[38] Sparse by Rule: Probability-Based N:M Pruning for Spiking Neural Networks cs.CVPDF

Shuhan Ye, Yi Yu, Qixin Zhang, Chenqi Kong, Qiangqiang Wu

TL;DR: 这篇论文提出了SpikeNM，一种基于概率的N:M稀疏剪枝框架，专门为脉冲神经网络（SNNs）设计，通过半结构化剪枝平衡了高稀疏性和硬件友好性。

Details

Motivation: 脉冲神经网络（SNNs）的事件驱动稀疏计算具有高能效潜力，但由于参数和计算成本高，难以在边缘设备上部署。现有剪枝方法要么难以加速（非结构化），要么灵活性不足（结构化）。

Result: 在2:4稀疏度下，SpikeNM保持甚至提升了主流数据集的性能，同时生成硬件友好的稀疏模式。

Insight: 结合神经科学启发的EID和半结构化剪枝，能够在高稀疏性下稳定搜索，同时兼顾性能和硬件适配性。

Abstract: Brain-inspired Spiking neural networks (SNNs) promise energy-efficient intelligence via event-driven, sparse computation, but deeper architectures inflate parameters and computational cost, hindering their edge deployment. Recent progress in SNN pruning helps alleviate this burden, yet existing efforts fall into only two families: \emph{unstructured} pruning, which attains high sparsity but is difficult to accelerate on general hardware, and \emph{structured} pruning, which eases deployment but lack flexibility and often degrades accuracy at matched sparsity. In this work, we introduce \textbf{SpikeNM}, the first SNN-oriented \emph{semi-structured} (N{:}M) pruning framework that learns sparse SNNs \emph{from scratch}, enforcing \emph{at most (N)} non-zeros per (M)-weight block. To avoid the combinatorial space complexity (\sum_{k=1}^{N}\binom{M}{k}) growing exponentially with (M), SpikeNM adopts an (M)-way basis-logit parameterization with a differentiable top-(k) sampler, \emph{linearizing} per-block complexity to (\mathcal O(M)) and enabling more aggressive sparsification. Further inspired by neuroscience, we propose \emph{eligibility-inspired distillation} (EID), which converts temporally accumulated credits into block-wise soft targets to align mask probabilities with spiking dynamics, reducing sampling variance and stabilizing search under high sparsity. Experiments show that at (2{:}4) sparsity, SpikeNM maintains and even with gains across main-stream datasets, while yielding hardware-amenable patterns that complement intrinsic spike sparsity.

[39] DINOv3-Guided Cross Fusion Framework for Semantic-aware CT generation from MRI and CBCT cs.CVPDF

Xianhao Zhou, Jianghao Wu, Ku Zhao, Jinlong He, Huangxuan Zhao

TL;DR: 论文提出了一种基于DINOv3的交叉融合框架（DGCF），用于从MRI和CBCT生成语义感知的CT图像，结合了Transformer的全局语义理解和CNN的局部特征，并通过多层DINOv3感知损失提升语义相似性。

Details

Motivation: 现有CNN模型缺乏全局语义理解，而Transformer在小规模医学数据集上容易过拟合。因此，需要一种方法结合两者的优势，提升合成CT图像的质量。

Result: 在SynthRAD2023骨盆数据集上，DGCF在MRI→CT和CBCT→CT任务中均取得最佳性能（MS-SSIM、PSNR和分割指标）。

Insight: 自监督Transformer的特征可以引导医学图像生成任务，提升语义感知能力；跨模态融合是解决全局与局部特征不平衡的有效方法。

Abstract: Generating synthetic CT images from CBCT or MRI has a potential for efficient radiation dose planning and adaptive radiotherapy. However, existing CNN-based models lack global semantic understanding, while Transformers often overfit small medical datasets due to high model capacity and weak inductive bias. To address these limitations, we propose a DINOv3-Guided Cross Fusion (DGCF) framework that integrates a frozen self-supervised DINOv3 Transformer with a trainable CNN encoder-decoder. It hierarchically fuses global representation of Transformer and local features of CNN via a learnable cross fusion module, achieving balanced local appearance and contextual representation. Furthermore, we introduce a Multi-Level DINOv3 Perceptual (MLDP) loss that encourages semantic similarity between synthetic CT and the ground truth in DINOv3’s feature space. Experiments on the SynthRAD2023 pelvic dataset demonstrate that DGCF achieved state-of-the-art performance in terms of MS-SSIM, PSNR and segmentation-based metrics on both MRI$\rightarrow$CT and CBCT$\rightarrow$CT translation tasks. To the best of our knowledge, this is the first work to employ DINOv3 representations for medical image translation, highlighting the potential of self-supervised Transformer guidance for semantic-aware CT synthesis. The code is available at https://github.com/HiLab-git/DGCF.

[40] Adaptive Begin-of-Video Tokens for Autoregressive Video Diffusion Models cs.CVPDF

Tianle Cheng, Zeyan Zhang, Kaifeng Gao, Jun Xiao

TL;DR: 论文提出自适应视频开头标记（ada-BOV）和改进的去噪策略，用于自回归视频扩散模型，解决了长视频生成中的一致性和动态质量问题，实验表明其方法在多指标上表现优异。

Details

Motivation: 现有自回归视频扩散模型在生成长视频时面临一致性和动态质量不足的问题，尤其是流式去噪方法表现不佳。论文旨在设计一种既能保持全局一致性又能灵活适应动态场景的方法。

Result: 实验表明，方法在多项指标上表现优异，显著提高了生成长视频的一致性和动态质量。

Insight: 自适应BOV标记和流式去噪优化是提升视频扩散模型性能的关键，同时噪声调度的设计对模型训练非常重要。

Abstract: Recent advancements in diffusion-based video generation have produced impressive and high-fidelity short videos. To extend these successes to generate coherent long videos, most video diffusion models (VDMs) generate videos in an autoregressive manner, i.e., generating subsequent frames conditioned on previous ones. There are generally two primary paradigms: chunk-based extension and stream denoising. The former directly concatenates previous clean frames as conditioning, suffering from denoising latency and error accumulation. The latter maintains the denoising sequence with monotonically increasing noise levels. In each denoising iteration, one clean frame is produced while a new pure noise is simultaneously appended, enabling live-stream sampling. However, it struggles with fragile consistency and poor motion dynamics. In this paper, we propose Adaptive Begin-of-Video Tokens (ada-BOV) for autoregressive VDMs. The BOV tokens are special learnable embeddings on VDMs. They adaptively absorb denoised preceding frames via an adaptive-layer-norm-like modulation. This design preserves the global consistency while allowing for flexible conditioning in dynamic scenarios. To ensure the quality of local dynamics essential in modulating BOV tokens, we further propose a refinement strategy for stream denoising. It decouples the sampling trajectory length from the attention window size constraint, leading to improved local guidance and overall imaging quality. We also propose a disturbance-augmented training noise schedule, which balances the convergence speed with model robustness for the stream denoising. Extensive experiments demonstrate that our method achieves compelling qualitative and quantitative results across multiple metrics.

[41] Fine-Grained DINO Tuning with Dual Supervision for Face Forgery Detection cs.CVPDF

Tianxiang Zhang, Peipeng Yu, Zhihua Xia, Longchen Dai, Xiaoyu Zhou

TL;DR: 该论文提出了DFF-Adapter方法，通过轻量级多头LoRA模块和双监督任务优化DINOv2，显著提升了深度伪造检测的性能和参数效率。

Details

Motivation: 当前深度伪造检测方法通常将问题简化为二分类任务，忽略了不同伪造方法的独特特征。这限制了模型对伪造痕迹的敏感性，导致检测性能不佳。

Result: 仅使用3.5M可训练参数，DFF-Adapter在性能上可与或超越当前复杂的最先进方法。

Insight: 通过结合细粒度伪造分类任务优化真实性检测，可以显著提升深度伪造检测的准确性，同时保持高效的参数利用。

Abstract: The proliferation of sophisticated deepfakes poses significant threats to information integrity. While DINOv2 shows promise for detection, existing fine-tuning approaches treat it as generic binary classification, overlooking distinct artifacts inherent to different deepfake methods. To address this, we propose a DeepFake Fine-Grained Adapter (DFF-Adapter) for DINOv2. Our method incorporates lightweight multi-head LoRA modules into every transformer block, enabling efficient backbone adaptation. DFF-Adapter simultaneously addresses authenticity detection and fine-grained manipulation type classification, where classifying forgery methods enhances artifact sensitivity. We introduce a shared branch propagating fine-grained manipulation cues to the authenticity head. This enables multi-task cooperative optimization, explicitly enhancing authenticity discrimination with manipulation-specific knowledge. Utilizing only 3.5M trainable parameters, our parameter-efficient approach achieves detection accuracy comparable to or even surpassing that of current complex state-of-the-art methods.

[42] MediRound: Multi-Round Entity-Level Reasoning Segmentation in Medical Images cs.CV | cs.AIPDF

Qinyue Tong, Ziqian Lu, Jun Liu, Rui Zuo, Zheming Lu

TL;DR: MEMR-Seg是一种新的医学图像分割任务，支持多轮实体级推理分割；提出了MediRound模型和MR-MedSeg数据集，并通过Judgment & Correction机制减少错误传播，显著优于传统方法。

Details

Motivation: 现有医学图像分割方法多为任务特定且缺乏交互性；文本提示分割虽提升了用户驱动和基于推理的分割，但仍局限于单轮对话，无法进行多轮推理。

Result: 实验表明，该方法有效解决了MEMR-Seg任务，并在性能上优于传统医学参考分割方法。

Insight: 多轮推理和实体级交互是提升医学图像分割灵活性和准确性的关键；轻量级纠正机制可显著减少错误传播。

Abstract: Despite the progress in medical image segmentation, most existing methods remain task-specific and lack interactivity. Although recent text-prompt-based segmentation approaches enhance user-driven and reasoning-based segmentation, they remain confined to single-round dialogues and fail to perform multi-round reasoning. In this work, we introduce Multi-Round Entity-Level Medical Reasoning Segmentation (MEMR-Seg), a new task that requires generating segmentation masks through multi-round queries with entity-level reasoning. To support this task, we construct MR-MedSeg, a large-scale dataset of 177K multi-round medical segmentation dialogues, featuring entity-based reasoning across rounds. Furthermore, we propose MediRound, an effective baseline model designed for multi-round medical reasoning segmentation. To mitigate the inherent error propagation in the chain-like pipeline of multi-round segmentation, we introduce a lightweight yet effective Judgment & Correction Mechanism during model inference. Experimental results demonstrate that our method effectively addresses the MEMR-Seg task and outperforms conventional medical referring segmentation methods.

[43] RadarMP: Motion Perception for 4D mmWave Radar in Autonomous Driving cs.CVPDF

Ruiqi Cheng, Huijun Di, Jian Li, Feng Liu, Wei Liang

TL;DR: RadarMP是一种新型方法，通过联合建模雷达目标检测与运动估计任务，利用4D mmWave雷达信号实现精确的3D场景运动感知，适用于全场景自动驾驶系统。

Details

Motivation: 自动驾驶系统中，精确的3D场景运动感知对安全性和可靠性至关重要。4D mmWave雷达因其全天候工作能力和独特感知特性成为重要组件，但其稀疏和噪声数据常导致运动感知不精确。

Result: 在公开数据集上的实验表明，RadarMP在多样天气和光照条件下实现了可靠的3D场景运动感知，性能优于现有雷达解码运动感知方法。

Insight: 通过联合建模和多模态自监督学习，RadarMP克服了雷达数据稀疏性和噪声问题，为自动驾驶系统提供了更鲁棒的运动感知解决方案。

Abstract: Accurate 3D scene motion perception significantly enhances the safety and reliability of an autonomous driving system. Benefiting from its all-weather operational capability and unique perceptual properties, 4D mmWave radar has emerged as an essential component in advanced autonomous driving. However, sparse and noisy radar points often lead to imprecise motion perception, leaving autonomous vehicles with limited sensing capabilities when optical sensors degrade under adverse weather conditions. In this paper, we propose RadarMP, a novel method for precise 3D scene motion perception using low-level radar echo signals from two consecutive frames. Unlike existing methods that separate radar target detection and motion estimation, RadarMP jointly models both tasks in a unified architecture, enabling consistent radar point cloud generation and pointwise 3D scene flow prediction. Tailored to radar characteristics, we design specialized self-supervised loss functions guided by Doppler shifts and echo intensity, effectively supervising spatial and motion consistency without explicit annotations. Extensive experiments on the public dataset demonstrate that RadarMP achieves reliable motion perception across diverse weather and illumination conditions, outperforming radar-based decoupled motion perception pipelines and enhancing perception capabilities for full-scenario autonomous driving systems.

[44] OAD-Promoter: Enhancing Zero-shot VQA using Large Language Models with Object Attribute Description cs.CV | cs.AIPDF

Quanxing Xu, Ling Zhou, Feifei Zhang, Jinyu Tian, Rubing Huang

TL;DR: 本文提出OAD-Promoter方法，通过减少语言偏见和增强域偏移鲁棒性，提升大型语言模型（LLM）在零样本视觉问答（VQA）中的表现。

Details

Motivation: 现有的LLM在VQA中依赖大规模训练数据，导致语言偏见和域外泛化能力不足。

Result: 实验表明，OAD-Promoter在零样本或少样本VQA中取得了新的最佳性能。

Insight: 结合全局与区域视觉信息，以及知识辅助机制，可以有效提升LLM在VQA中的鲁棒性和性能。

Abstract: Large Language Models (LLMs) have become a crucial tool in Visual Question Answering (VQA) for handling knowledge-intensive questions in few-shot or zero-shot scenarios. However, their reliance on massive training datasets often causes them to inherit language biases during the acquisition of knowledge. This limitation imposes two key constraints on existing methods: (1) LLM predictions become less reliable due to bias exploitation, and (2) despite strong knowledge reasoning capabilities, LLMs still struggle with out-of-distribution (OOD) generalization. To address these issues, we propose Object Attribute Description Promoter (OAD-Promoter), a novel approach for enhancing LLM-based VQA by mitigating language bias and improving domain-shift robustness. OAD-Promoter comprises three components: the Object-concentrated Example Generation (OEG) module, the Memory Knowledge Assistance (MKA) module, and the OAD Prompt. The OEG module generates global captions and object-concentrated samples, jointly enhancing visual information input to the LLM and mitigating bias through complementary global and regional visual cues. The MKA module assists the LLM in handling OOD samples by retrieving relevant knowledge from stored examples to support questions from unseen domains. Finally, the OAD Prompt integrates the outputs of the preceding modules to optimize LLM inference. Experiments demonstrate that OAD-Promoter significantly improves the performance of LLM-based VQA methods in few-shot or zero-shot settings, achieving new state-of-the-art results.

[45] Compression and Inference of Spiking Neural Networks on Resource-Constrained Hardware cs.CVPDF

Karol C. Jurzec, Tomasz Szydlo, Maciej Wielgosz

TL;DR: 论文提出了一种高效的轻量级C运行时库，用于在资源受限的边缘设备上部署脉冲神经网络（SNNs），并通过数据布局优化和稀疏性剪枝显著提升了速度和内存效率。

Details

Motivation: SNNs因其事件驱动的特性在能效和时序处理上具有优势，但在资源受限的硬件上训练和部署仍具挑战性。论文旨在解决这些问题，使其适用于嵌入式平台。

Result: 在N-MNIST和ST-MNIST数据集上实现了与Python基线相同的功能，同时在桌面CPU上速度提升约10倍，并在微控制器（Arduino Portenta H7）上验证了可行性。

Insight: 通过优化的运行时和模型压缩，SNNs可以在传统嵌入式平台上高效运行，展示了其在边缘计算中的潜力。

Abstract: Spiking neural networks (SNNs) communicate via discrete spikes in time rather than continuous activations. Their event-driven nature offers advantages for temporal processing and energy efficiency on resource-constrained hardware, but training and deployment remain challenging. We present a lightweight C-based runtime for SNN inference on edge devices and optimizations that reduce latency and memory without sacrificing accuracy. Trained models exported from SNNTorch are translated to a compact C representation; static, cache-friendly data layouts and preallocation avoid interpreter and allocation overheads. We further exploit sparse spiking activity to prune inactive neurons and synapses, shrinking computation in upstream convolutional layers. Experiments on N-MNIST and ST-MNIST show functional parity with the Python baseline while achieving ~10 speedups on desktop CPU and additional gains with pruning, together with large memory reductions that enable microcontroller deployment (Arduino Portenta H7). Results indicate that SNNs can be executed efficiently on conventional embedded platforms when paired with an optimized runtime and spike-driven model compression. Code: https://github.com/karol-jurzec/snn-generator/

[46] MAVIS: A Benchmark for Multimodal Source Attribution in Long-form Visual Question Answering cs.CVPDF

Seokwon Song, Minsu Park, Gunhee Kim

TL;DR: MAVIS是一个新的长形式视觉问答（VQA）多模态来源归因基准，旨在提升AI生成答案的可靠性。它通过标注多模态证据的引用，评估模型在理解用户意图、检索证据及生成带引用的长答案时的表现。

Details

Motivation: 现有研究主要关注文本环境，忽略了多模态在来源归因中的作用。MAVIS填补了这一空白，旨在评估多模态来源归因系统的性能。

Result: 研究发现：(1) 多模态RAG生成的信息更丰富、答案更流畅，但对图像文档的根植性较弱；(2) 信息性与根植性在不同提示方法下存在权衡；(3) 处理图像文档时的上下文偏差是未来研究的关键方向。

Insight: 多模态来源归因系统在处理图像和文本时存在差异，提示方法的选择影响性能权衡，未来需关注如何减少图像上下文偏差。

Abstract: Source attribution aims to enhance the reliability of AI-generated answers by including references for each statement, helping users validate the provided answers. However, existing work has primarily focused on text-only scenario and largely overlooked the role of multimodality. We introduce MAVIS, the first benchmark designed to evaluate multimodal source attribution systems that understand user intent behind visual questions, retrieve multimodal evidence, and generate long-form answers with citations. Our dataset comprises 157K visual QA instances, where each answer is annotated with fact-level citations referring to multimodal documents. We develop fine-grained automatic metrics along three dimensions of informativeness, groundedness, and fluency, and demonstrate their strong correlation with human judgments. Our key findings are threefold: (1) LVLMs with multimodal RAG generate more informative and fluent answers than unimodal RAG, but they exhibit weaker groundedness for image documents than for text documents, a gap amplified in multimodal settings. (2) Given the same multimodal documents, there is a trade-off between informativeness and groundedness across different prompting methods. (3) Our proposed method highlights mitigating contextual bias in interpreting image documents as a crucial direction for future research. The dataset and experimental code are available at https://github.com/seokwon99/MAVIS

[47] Breaking the Modality Wall: Time-step Mixup for Efficient Spiking Knowledge Transfer from Static to Event Domain cs.CVPDF

Yuqi Xie, Shuhan Ye, Yi Yu, Chong Wang, Qixin Zhang

TL;DR: TMKT通过时间步混合和多模态感知目标，实现了从静态RGB到事件DVS数据的高效知识迁移，提升了SNN在事件域中的分类性能。

Details

Motivation: 由于事件数据稀缺且DVS输出稀疏，从静态RGB迁移知识到事件DVS的现有方法效果不佳。模态间分布差异大是主要原因，因此需要一种更有效的方法填补这一差距。

Result: TMKT在脉冲图像分类任务中表现出色，优于现有方法，并通过消融实验验证了各模块的重要性。

Insight: 1. 时间步混合能有效减少模态差异；2. 显式对齐时序特征是知识迁移的关键；3. 轻量级的监督目标提升了训练效率。

Abstract: The integration of event cameras and spiking neural networks (SNNs) promises energy-efficient visual intelligence, yet scarce event data and the sparsity of DVS outputs hinder effective training. Prior knowledge transfers from RGB to DVS often underperform because the distribution gap between modalities is substantial. In this work, we present Time-step Mixup Knowledge Transfer (TMKT), a cross-modal training framework with a probabilistic Time-step Mixup (TSM) strategy. TSM exploits the asynchronous nature of SNNs by interpolating RGB and DVS inputs at various time steps to produce a smooth curriculum within each sequence, which reduces gradient variance and stabilizes optimization with theoretical analysis. To employ auxiliary supervision from TSM, TMKT introduces two lightweight modality-aware objectives, Modality Aware Guidance (MAG) for per-frame source supervision and Mixup Ratio Perception (MRP) for sequence-level mix ratio estimation, which explicitly align temporal features with the mixing schedule. TMKT enables smoother knowledge transfer, helps mitigate modality mismatch during training, and achieves superior performance in spiking image classification tasks. Extensive experiments across diverse benchmarks and multiple SNN backbones, together with ablations, demonstrate the effectiveness of our method.

[48] FIA-Edit: Frequency-Interactive Attention for Efficient and High-Fidelity Inversion-Free Text-Guided Image Editing cs.CVPDF

Kaixiang Yang, Boyang Shen, Xin Li, Yuchen Dai, Yuxuan Luo

TL;DR: 该论文提出了一种名为FIA-Edit的高效、高质量文本引导图像编辑框架，通过频率交互注意力机制避免了潜在反转，同时提升了背景保留和语义一致性。

Details

Motivation: 现有的基于流的免反转方法虽然在效率上有优势，但由于缺乏对源信息的有效整合，往往导致背景保留不佳、空间不一致和过度编辑的问题。FIA-Edit旨在解决这些问题。

Result: 实验表明，FIA-Edit在低计算成本（约6秒/512*512图像）下支持高质量编辑，并在视觉质量、背景保真度和可控性上优于现有方法。此外，在医学图像出血分类任务中表现显著。

Insight: 频率交互和跨注意力机制的结合可以有效提升图像编辑的质量，同时扩展了文本引导编辑的应用领域。

Abstract: Text-guided image editing has advanced rapidly with the rise of diffusion models. While flow-based inversion-free methods offer high efficiency by avoiding latent inversion, they often fail to effectively integrate source information, leading to poor background preservation, spatial inconsistencies, and over-editing due to the lack of effective integration of source information. In this paper, we present FIA-Edit, a novel inversion-free framework that achieves high-fidelity and semantically precise edits through a Frequency-Interactive Attention. Specifically, we design two key components: (1) a Frequency Representation Interaction (FRI) module that enhances cross-domain alignment by exchanging frequency components between source and target features within self-attention, and (2) a Feature Injection (FIJ) module that explicitly incorporates source-side queries, keys, values, and text embeddings into the target branch’s cross-attention to preserve structure and semantics. Comprehensive and extensive experiments demonstrate that FIA-Edit supports high-fidelity editing at low computational cost (~6s per 512 * 512 image on an RTX 4090) and consistently outperforms existing methods across diverse tasks in visual quality, background fidelity, and controllability. Furthermore, we are the first to extend text-guided image editing to clinical applications. By synthesizing anatomically coherent hemorrhage variations in surgical images, FIA-Edit opens new opportunities for medical data augmentation and delivers significant gains in downstream bleeding classification. Our project is available at: https://github.com/kk42yy/FIA-Edit.

[49] Codebook-Centric Deep Hashing: End-to-End Joint Learning of Semantic Hash Centers and Neural Hash Function cs.CV | cs.LGPDF

Shuo Yin, Zhiyuan Yin, Yuqing Hou, Rui Liu, Yong Chen

TL;DR: 提出了一种名为CRH的端到端深度哈希框架，动态从预设码本重分配哈希中心，并联合优化哈希函数，避免了传统方法的两阶段问题且提升了检索性能。

Details

Motivation: 传统基于哈希中心的深度哈希方法存在哈希中心随机初始化忽略类间语义关系的问题，而现有两阶段方法又因阶段间差异导致计算开销大、性能不佳。

Result: 在三个基准数据集上，CRH优于现有深度哈希方法，学习到更具语义意义的哈希中心。

Insight: 动态调整哈希中心并将其优化与哈希函数训练结合，能更好地捕获数据分布的语义关系，提升检索效果。

Abstract: Hash center-based deep hashing methods improve upon pairwise or triplet-based approaches by assigning fixed hash centers to each class as learning targets, thereby avoiding the inefficiency of local similarity optimization. However, random center initialization often disregards inter-class semantic relationships. While existing two-stage methods mitigate this by first refining hash centers with semantics and then training the hash function, they introduce additional complexity, computational overhead, and suboptimal performance due to stage-wise discrepancies. To address these limitations, we propose $\textbf{Center-Reassigned Hashing (CRH)}$, an end-to-end framework that $\textbf{dynamically reassigns hash centers}$ from a preset codebook while jointly optimizing the hash function. Unlike previous methods, CRH adapts hash centers to the data distribution $\textbf{without explicit center optimization phases}$, enabling seamless integration of semantic relationships into the learning process. Furthermore, $\textbf{a multi-head mechanism}$ enhances the representational capacity of hash centers, capturing richer semantic structures. Extensive experiments on three benchmarks demonstrate that CRH learns semantically meaningful hash centers and outperforms state-of-the-art deep hashing methods in retrieval tasks.

[50] Rethinking Multimodal Point Cloud Completion: A Completion-by-Correction Perspective cs.CV | cs.AIPDF

Wang Luo, Di Wu, Hengyuan Na, Yinlin Zhu, Miao Hu

TL;DR: 该论文提出了一种新的点云补全范式’Completion-by-Correction’，通过预训练的2D-3D模型生成拓扑完整的形状先验，并在特征空间中进行校正，而不是直接合成缺失结构。其框架PGNet在ShapeNetViPC数据集上表现出色。

Details

Motivation: 传统方法遵循’Completion-by-Inpainting’范式，由于几何和语义约束不足，常导致结构不一致和拓扑伪影。论文重新思考了任务范式。

Result: 在ShapeNetViPC数据集上，PGNet的平均Chamfer Distance降低了23.5%，F-score提升了7.1%。

Insight: 从’Completion-by-Inpainting’到’Completion-by-Correction’的范式转变，通过引入形状先验和特征校正，显著提升了点云补全的结构一致性和准确性。

Abstract: Point cloud completion aims to reconstruct complete 3D shapes from partial observations, which is a challenging problem due to severe occlusions and missing geometry. Despite recent advances in multimodal techniques that leverage complementary RGB images to compensate for missing geometry, most methods still follow a Completion-by-Inpainting paradigm, synthesizing missing structures from fused latent features. We empirically show that this paradigm often results in structural inconsistencies and topological artifacts due to limited geometric and semantic constraints. To address this, we rethink the task and propose a more robust paradigm, termed Completion-by-Correction, which begins with a topologically complete shape prior generated by a pretrained image-to-3D model and performs feature-space correction to align it with the partial observation. This paradigm shifts completion from unconstrained synthesis to guided refinement, enabling structurally consistent and observation-aligned reconstruction. Building upon this paradigm, we introduce PGNet, a multi-stage framework that conducts dual-feature encoding to ground the generative prior, synthesizes a coarse yet structurally aligned scaffold, and progressively refines geometric details via hierarchical correction. Experiments on the ShapeNetViPC dataset demonstrate the superiority of PGNet over state-of-the-art baselines in terms of average Chamfer Distance (-23.5%) and F-score (+7.1%).

[51] MixAR: Mixture Autoregressive Image Generation cs.CV | cs.LGPDF

Jinyuan Hu, Jiayou Zhang, Shaobo Cui, Kun Zhang, Guangyi Chen

TL;DR: MixAR是一种新颖的图像生成框架，通过混合离散和连续表征的自回归建模，解决了连续空间建模效率低下的问题，同时提升了生成质量。

Details

Motivation: 传统的离散令牌自回归方法由于量化过程和有限码本大小导致细粒度信息丢失，限制了生成质量。虽然连续潜在空间建模能提升质量，但连续表征的无结构和广阔空间增加了建模难度。

Result: 实验表明，DC-Mix在计算效率和生成质量间取得了良好平衡，TI-Mix带来了持续的改进。

Insight: 混合离散和连续表征的自回归建模能有效解决连续空间建模的挑战，同时提升生成质量；训练和推断阶段的一致性对模型性能至关重要。

Abstract: Autoregressive (AR) approaches, which represent images as sequences of discrete tokens from a finite codebook, have achieved remarkable success in image generation. However, the quantization process and the limited codebook size inevitably discard fine-grained information, placing bottlenecks on fidelity. Motivated by this limitation, recent studies have explored autoregressive modeling in continuous latent spaces, which offers higher generation quality. Yet, unlike discrete tokens constrained by a fixed codebook, continuous representations lie in a vast and unstructured space, posing significant challenges for efficient autoregressive modeling. To address these challenges, we introduce MixAR, a novel framework that leverages mixture training paradigms to inject discrete tokens as prior guidance for continuous AR modeling. MixAR is a factorized formulation that leverages discrete tokens as prior guidance for continuous autoregressive prediction. We investigate several discrete-continuous mixture strategies, including self-attention (DC-SA), cross-attention (DC-CA), and a simple approach (DC-Mix) that replaces homogeneous mask tokens with informative discrete counterparts. Moreover, to bridge the gap between ground-truth training tokens and inference tokens produced by the pre-trained AR model, we propose Training-Inference Mixture (TI-Mix) to achieve consistent training and generation distributions. In our experiments, we demonstrate a favorable balance of the DC-Mix strategy between computational efficiency and generation fidelity, and consistent improvement of TI-Mix.

Aditi Bhalla, Christian Hellert, Enkelejda Kasneci

TL;DR: 论文提出了一种两阶段的跨视角、跨模态无监督域自适应框架，用于驾驶员监测系统，解决了视角变化和域偏移问题，显著提升了驾驶员活动识别的准确性。

Details

Motivation: 驾驶员分心是交通事故的主要原因之一，但现有的基于深度学习的驾驶员活动识别方法在实际部署中面临视角变化（跨视角）和域偏移（跨模态）的挑战，亟需一种能同时解决这两种问题的方法。

Result: 实验结果表明，该框架在RGB视频数据上的Top-1准确率比现有监督对比学习方法提升了近50%，同时比仅域自适应的方法提升了5%。

Insight: 同时处理跨视角和跨模态问题是提升驾驶员监测系统鲁棒性和可扩展性的关键。

Abstract: Driver distraction remains a leading cause of road traffic accidents, contributing to thousands of fatalities annually across the globe. While deep learning-based driver activity recognition methods have shown promise in detecting such distractions, their effectiveness in real-world deployments is hindered by two critical challenges: variations in camera viewpoints (cross-view) and domain shifts such as change in sensor modality or environment. Existing methods typically address either cross-view generalization or unsupervised domain adaptation in isolation, leaving a gap in the robust and scalable deployment of models across diverse vehicle configurations. In this work, we propose a novel two-phase cross-view, cross-modal unsupervised domain adaptation framework that addresses these challenges jointly on real-time driver monitoring data. In the first phase, we learn view-invariant and action-discriminative features within a single modality using contrastive learning on multi-view data. In the second phase, we perform domain adaptation to a new modality using information bottleneck loss without requiring any labeled data from the new domain. We evaluate our approach using state-of-the art video transformers (Video Swin, MViT) and multi modal driver activity dataset called Drive&Act, demonstrating that our joint framework improves top-1 accuracy on RGB video data by almost 50% compared to a supervised contrastive learning-based cross-view method, and outperforms unsupervised domain adaptation-only methods by up to 5%, using the same video transformer backbone.

[53] Bridging Granularity Gaps: Hierarchical Semantic Learning for Cross-domain Few-shot Segmentation cs.CVPDF

Sujun Sun, Haowen Gu, Cheng Xie, Yanxu Ren, Mingwu Ren

TL;DR: 该论文提出了一种名为层次化语义学习（HSL）的框架，用于解决跨域少样本分割（CD-FSS）中的语义粒度差异问题。通过双风格随机化（DSR）和层次化语义挖掘（HSM）模块提升模型对多粒度语义的识别能力，并在实验中验证了其有效性。

Details

Motivation: 现有CD-FSS方法主要关注源域和目标域之间的风格差异，而忽略了分割粒度差异，导致目标域中新类别的语义区分能力不足。因此，需要一种能够捕捉多粒度语义信息的解决方案。

Result: 在四个流行的目标域数据集上的实验表明，该方法达到了最先进的性能。

Insight: 语义粒度差异是CD-FSS中不可忽视的问题，多粒度特征学习和风格模拟能显著提升模型在新域中的表现。

Abstract: Cross-domain Few-shot Segmentation (CD-FSS) aims to segment novel classes from target domains that are not involved in training and have significantly different data distributions from the source domain, using only a few annotated samples, and recent years have witnessed significant progress on this task. However, existing CD-FSS methods primarily focus on style gaps between source and target domains while ignoring segmentation granularity gaps, resulting in insufficient semantic discriminability for novel classes in target domains. Therefore, we propose a Hierarchical Semantic Learning (HSL) framework to tackle this problem. Specifically, we introduce a Dual Style Randomization (DSR) module and a Hierarchical Semantic Mining (HSM) module to learn hierarchical semantic features, thereby enhancing the model’s ability to recognize semantics at varying granularities. DSR simulates target domain data with diverse foreground-background style differences and overall style variations through foreground and global style randomization respectively, while HSM leverages multi-scale superpixels to guide the model to mine intra-class consistency and inter-class distinction at different granularities. Additionally, we also propose a Prototype Confidence-modulated Thresholding (PCMT) module to mitigate segmentation ambiguity when foreground and background are excessively similar. Extensive experiments are conducted on four popular target domain datasets, and the results demonstrate that our method achieves state-of-the-art performance.

[54] OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs cs.CVPDF

Feng Chen, Yefei He, Shaoxuan He, Yuanyu He, Jing Liu

TL;DR: OmniSparse是一种面向长视频多模态大语言模型（MLLMs）的训练感知细粒度稀疏注意力框架，通过动态令牌预算分配在训练和推理中实现高效加速，同时在多个维度上优化令牌选择。

Details

Motivation: 现有的稀疏注意力方法主要针对推理加速，缺乏训练优化能力，且在查询、键值（KV）和多头注意力的细粒度选择上表现不佳，导致性能次优和加速效果有限。OmniSparse旨在弥补这一差距。

Result: 实验表明，OmniSparse在性能上与全注意力相当，同时在预填充阶段实现2.7倍加速，解码阶段内存减少2.4倍。

Insight: 动态令牌预算分配和多维度细粒度选择是提升稀疏注意力性能的关键，未来研究可进一步探索不同模态下的优化策略。

Abstract: Existing sparse attention methods primarily target inference-time acceleration by selecting critical tokens under predefined sparsity patterns. However, they often fail to bridge the training-inference gap and lack the capacity for fine-grained token selection across multiple dimensions such as queries, key-values (KV), and heads, leading to suboptimal performance and limited acceleration gains. In this paper, we introduce OmniSparse, a training-aware fine-grained sparse attention framework for long-video MLLMs, which operates in both training and inference with dynamic token budget allocation. Specifically, OmniSparse contains three adaptive and complementary mechanisms: (1) query selection via lazy-active classification, retaining active queries that capture broad semantic similarity while discarding most lazy ones that focus on limited local context and exhibit high functional redundancy; (2) KV selection with head-level dynamic budget allocation, where a shared budget is determined based on the flattest head and applied uniformly across all heads to ensure attention recall; and (3) KV cache slimming to reduce head-level redundancy by selectively fetching visual KV cache according to the head-level decoding query pattern. Experimental results show that OmniSparse matches the performance of full attention while achieving up to 2.7x speedup during prefill and 2.4x memory reduction during decoding.

[55] LSS3D: Learnable Spatial Shifting for Consistent and High-Quality 3D Generation from Single-Image cs.CVPDF

Zhuojiang Cai, Yiheng Zhang, Meitong Guo, Mingdao Wang, Yuwang Wang

TL;DR: LSS3D提出了一种可学习的空间偏移方法，用于解决单图像生成3D模型中多视图不一致和非正面视角输入的问题，通过调整视角参数实现高质量的几何和纹理生成。

Details

Motivation: 当前多视图扩散3D生成方法存在形状和纹理对齐问题，导致几何细节和纹理质量不佳。此外，这些方法对非正面视角输入的鲁棒性较差。

Result: 实验表明，LSS3D在几何和纹理评估指标上均取得领先结果，尤其在灵活视角输入下表现优异。

Insight: 可学习的空间偏移显式优化多视图一致性是关键；输入视角约束可显著提升非正面输入的鲁棒性；定量评估流程有助于性能比较。

Abstract: Recently, multi-view diffusion-based 3D generation methods have gained significant attention. However, these methods often suffer from shape and texture misalignment across generated multi-view images, leading to low-quality 3D generation results, such as incomplete geometric details and textural ghosting. Some methods are mainly optimized for the frontal perspective and exhibit poor robustness to oblique perspective inputs. In this paper, to tackle the above challenges, we propose a high-quality image-to-3D approach, named LSS3D, with learnable spatial shifting to explicitly and effectively handle the multiview inconsistencies and non-frontal input view. Specifically, we assign learnable spatial shifting parameters to each view, and adjust each view towards a spatially consistent target, guided by the reconstructed mesh, resulting in high-quality 3D generation with more complete geometric details and clean textures. Besides, we include the input view as an extra constraint for the optimization, further enhancing robustness to non-frontal input angles, especially for elevated viewpoint inputs. We also provide a comprehensive quantitative evaluation pipeline that can contribute to the community in performance comparisons. Extensive experiments demonstrate that our method consistently achieves leading results in both geometric and texture evaluation metrics across more flexible input viewpoints.

[56] GeoMVD: Geometry-Enhanced Multi-View Generation Model Based on Geometric Information Extraction cs.CVPDF

Jiaqi Wu, Yaosen Chen, Shuyuan Zhu

TL;DR: GeoMVD提出了一种基于几何信息提取的多视图生成模型，通过深度图、法线图和前景分割掩码构建共享几何结构，并结合解耦的几何增强注意力机制和自适应学习策略，在多视图一致性和高分辨率生成方面取得显著改进。

Details

Motivation: 多视图图像生成在3D重建、虚拟现实等领域具有重要应用价值，但现有方法在跨视图一致性和高分辨率生成方面面临计算挑战。因此，研究需要一种能够有效利用几何信息的方法来解决这些问题。

Result: 模型在多视图一致性和高分辨率生成方面表现优异，能够生成细节丰富且视觉连贯的多视图图像。

Insight: 几何信息的有效利用是提升多视图生成一致性和细节质量的关键；解耦的注意力机制和自适应调整策略可以有效改善模型表现。

Abstract: Multi-view image generation holds significant application value in computer vision, particularly in domains like 3D reconstruction, virtual reality, and augmented reality. Most existing methods, which rely on extending single images, face notable computational challenges in maintaining cross-view consistency and generating high-resolution outputs. To address these issues, we propose the Geometry-guided Multi-View Diffusion Model, which incorporates mechanisms for extracting multi-view geometric information and adjusting the intensity of geometric features to generate images that are both consistent across views and rich in detail. Specifically, we design a multi-view geometry information extraction module that leverages depth maps, normal maps, and foreground segmentation masks to construct a shared geometric structure, ensuring shape and structural consistency across different views. To enhance consistency and detail restoration during generation, we develop a decoupled geometry-enhanced attention mechanism that strengthens feature focus on key geometric details, thereby improving overall image quality and detail preservation. Furthermore, we apply an adaptive learning strategy that fine-tunes the model to better capture spatial relationships and visual coherence between the generated views, ensuring realistic results. Our model also incorporates an iterative refinement process that progressively improves the output quality through multiple stages of image generation. Finally, a dynamic geometry information intensity adjustment mechanism is proposed to adaptively regulate the influence of geometric data, optimizing overall quality while ensuring the naturalness of generated images. More details can be found on the project page: https://github.com/SobeyMIL/GeoMVD.com.

[57] A Novel AI-Driven System for Real-Time Detection of Mirror Absence, Helmet Non-Compliance, and License Plates Using YOLOv8 and OCR cs.CV | cs.AIPDF

Nishant Vasantkumar Hegde, Aditi Agarwal, Minal Moharir

TL;DR: 这篇论文提出了一种基于YOLOv8和OCR的AI驱动系统，用于实时检测摩托车头盔佩戴不规范、后视镜缺失及车牌识别，显著提升了交通违规执法的效率和准确性。

Details

Motivation: 人工执法的资源消耗大且不一致，亟需一种自动化的解决方案来提升道路安全性。

Result: 模型表现出色，整体精确度为0.9147，召回率为0.886，mAP@50为0.843，mAP@50-95为0.503。

Insight: 该系统为自动化交通执法提供了一种实用解决方案，尤其在复杂条件下的车牌识别和目标检测效果显著，具备实际部署的潜力。

Abstract: Road safety is a critical global concern, with manual enforcement of helmet laws and vehicle safety standards (e.g., rear-view mirror presence) being resource-intensive and inconsistent. This paper presents an AI-powered system to automate traffic violation detection, significantly enhancing enforcement efficiency and road safety. The system leverages YOLOv8 for robust object detection and EasyOCR for license plate recognition. Trained on a custom dataset of annotated images (augmented for diversity), it identifies helmet non-compliance, the absence of rear-view mirrors on motorcycles, an innovative contribution to automated checks, and extracts vehicle registration numbers. A Streamlit-based interface facilitates real-time monitoring and violation logging. Advanced image preprocessing enhances license plate recognition, particularly under challenging conditions. Based on evaluation results, the model achieves an overall precision of 0.9147, a recall of 0.886, and a mean Average Precision (mAP@50) of 0.843. The mAP@50 95 of 0.503 further indicates strong detection capability under stricter IoU thresholds. This work demonstrates a practical and effective solution for automated traffic rule enforcement, with considerations for real-world deployment discussed.

[58] Mixture of States: Routing Token-Level Dynamics for Multimodal Generation cs.CVPDF

Haozhe Liu, Ding Liu, Mingchen Zhuge, Zijian Zhou, Tian Xie

TL;DR: MoS（Mixture of States）是一种新颖的多模态扩散模型融合范式，通过可学习的令牌级路由器动态对齐模态间的隐藏状态，显著提升了多模态生成的效率和性能。

Details

Motivation: 现有的多模态扩散模型在融合不同模态时通常采用固定或简单的交互方式，限制了模型的灵活性和性能。MoS旨在通过动态路由器实现更高效的模态交互。

Result: 在文本到图像生成（MoS-Image）和编辑（MoS-Editing）任务中达到最先进性能，且仅需3B到5B参数即可匹配或超越参数规模大4倍的模型。

Insight: 通过动态令牌级路由实现模态间的精确对齐，能够在降低模型复杂度的同时提升性能，为多模态扩散模型的扩展提供了灵活高效的新范式。

Abstract: We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities’ hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top-$k$ hidden states and is trained with an $ε$-greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to $4\times$ larger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models.

Peng Zhang, Zhihui Lai, Wenting Chen, Xu Wu, Heng Kong

TL;DR: FaNe提出了一种语义增强的医疗视觉-语言预训练框架，通过减少False Negatives和实现细粒度跨模态对齐，提升了医疗图像理解能力。

Details

Motivation: 现有的医学视觉-语言预训练方法受限于语义相似文本导致的False Negatives问题，以及跨模态对齐不够细粒度的挑战。FaNe旨在解决这些问题。

Result: 在5个医学图像下游任务（分类、检测、分割）上，FaNe实现了最优性能。

Insight: 减少False Negatives和实现细粒度跨模态对齐对医学VLP任务至关重要。

Abstract: Medical vision-language pre-training (VLP) offers significant potential for advancing medical image understanding by leveraging paired image-report data. However, existing methods are limited by Fa}lse Negatives (FaNe) induced by semantically similar texts and insufficient fine-grained cross-modal alignment. To address these limitations, we propose FaNe, a semantic-enhanced VLP framework. To mitigate false negatives, we introduce a semantic-aware positive pair mining strategy based on text-text similarity with adaptive normalization. Furthermore, we design a text-conditioned sparse attention pooling module to enable fine-grained image-text alignment through localized visual representations guided by textual cues. To strengthen intra-modal discrimination, we develop a hard-negative aware contrastive loss that adaptively reweights semantically similar negatives. Extensive experiments on five downstream medical imaging benchmarks demonstrate that FaNe achieves state-of-the-art performance across image classification, object detection, and semantic segmentation, validating the effectiveness of our framework.

[60] Suppressing VLM Hallucinations with Spectral Representation Filtering cs.CV | cs.LGPDF

Ameen Ali, Tamim Zoabi, Lior Wolf

TL;DR: 该研究提出了Spectral Representation Filtering（SRF）方法，通过分析视觉语言模型（VLM）表征的协方差结构，抑制幻觉现象。SRF无需额外训练或推理开销，显著降低了幻觉率，并在多任务基准测试中表现优异。

Details

Motivation: 视觉语言模型在生成描述时容易产生幻觉（即描述不存在的对象或属性），这是由于对语言先验的过度依赖和跨模态绑定的不精确。为了在不影响生成质量的前提下解决这一问题，研究者提出了SRF方法。

Result: 在MSCOCO、POPE-VQA等基准测试中，SRF在LLaVA-1.5、MiniGPT-4和mPLUG-Owl2等模型上显著降低了幻觉率，且未影响生成描述的质量。

Insight: 1. VLM的幻觉现象源于特征空间的偏置结构。2. 通过协方差分析可以高效识别并修正这些偏置。3. SRF提供了一种轻量级的后处理解决方案，适用于多种VLM。

Abstract: Vision-language models (VLMs) frequently produce hallucinations in the form of descriptions of objects, attributes, or relations that do not exist in the image due to over-reliance on language priors and imprecise cross-modal grounding. We introduce Spectral Representation Filtering (SRF), a lightweight, training-free method to suppress such hallucinations by analyzing and correcting the covariance structure of the model’s representations. SRF identifies low-rank hallucination modes through eigendecomposition of the covariance of the differences between features collected for truthful and hallucinatory captions, revealing structured biases in the feature space. A soft spectral filter then attenuates these modes in the feed-forward projection weights of deeper vLLM layers, equalizing feature variance while preserving semantic fidelity. Unlike decoding or retraining-based approaches, SRF operates entirely post-hoc, incurs zero inference overhead, and requires no architectural modifications. Across three families of VLMs (LLaVA-1.5, MiniGPT-4, and mPLUG-Owl2), SRF consistently reduces hallucination rates on MSCOCO, POPE-VQA, and other visual tasks benchmarks, achieving state-of-the-art faithfulness without degrading caption quality.

[61] Model Inversion Attack Against Deep Hashing cs.CV | cs.AIPDF

Dongdong Zhao, Qiben Xu, Ranxin Fang, Baogang Song

TL;DR: 该论文提出了DHMI，首个针对深度哈希的扩散模型反演框架，成功解决了现有方法无法适应深度哈希的问题，并在黑盒场景下重构高质量图像，证明了深度哈希系统潜在的隐私风险。

Details

Motivation: 深度哈希虽然提高了检索效率，但其二进制代码可能泄露原始训练数据，导致隐私风险。然而针对深度哈希的模型反演攻击尚未被研究，该论文填补了这一空白。

Result: 在多个数据集上的实验表明，DHMI即使在最严苛的黑盒设置下也能成功重构高分辨率、高质量的图像，其性能优于现有最优模型反演攻击方法。

Insight: 论文揭示了深度哈希系统潜在的高风险隐私漏洞，突出了在设计此类系统时需加强隐私保护措施的重要性。

Abstract: Deep hashing improves retrieval efficiency through compact binary codes, yet it introduces severe and often overlooked privacy risks. The ability to reconstruct original training data from hash codes could lead to serious threats such as biometric forgery and privacy breaches. However, model inversion attacks specifically targeting deep hashing models remain unexplored, leaving their security implications unexamined. This research gap stems from the inaccessibility of genuine training hash codes and the highly discrete Hamming space, which prevents existing methods from adapting to deep hashing. To address these challenges, we propose DHMI, the first diffusion-based model inversion framework designed for deep hashing. DHMI first clusters an auxiliary dataset to derive semantic hash centers as surrogate anchors. It then introduces a surrogate-guided denoising optimization method that leverages a novel attack metric (fusing classification consistency and hash proximity) to dynamically select candidate samples. A cluster of surrogate models guides the refinement of these candidates, ensuring the generation of high-fidelity and semantically consistent images. Experiments on multiple datasets demonstrate that DHMI successfully reconstructs high-resolution, high-quality images even under the most challenging black-box setting, where no training hash codes are available. Our method outperforms the existing state-of-the-art model inversion attacks in black-box scenarios, confirming both its practical efficacy and the critical privacy risks inherent in deep hashing systems.

[62] Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets cs.CVPDF

Huy M. Le, Dat Tien Nguyen, Phuc Binh Nguyen, Gia-Bao Le-Tran, Phu Truong Thien

TL;DR: Fusionista2.0是一个针对大规模数据集优化的高效视频检索系统，通过重新设计核心模块和用户界面，显著提升了检索速度和用户体验。

Details

Motivation: Video Browser Showdown（VBS）要求在严格时间限制下提供准确结果，因此需要一种高效且易于使用的视频检索系统。

Result: 检索时间减少75%，准确率和用户满意度均显著提升。

Insight: 通过模块化优化和轻量化设计，可以在保持或提升准确性的同时大幅提高系统效率。

Abstract: The Video Browser Showdown (VBS) challenges systems to deliver accurate results under strict time constraints. To meet this demand, we present Fusionista2.0, a streamlined video retrieval system optimized for speed and usability. All core modules were re-engineered for efficiency: preprocessing now relies on ffmpeg for fast keyframe extraction, optical character recognition uses Vintern-1B-v3.5 for robust multilingual text recognition, and automatic speech recognition employs faster-whisper for real-time transcription. For question answering, lightweight vision-language models provide quick responses without the heavy cost of large models. Beyond these technical upgrades, Fusionista2.0 introduces a redesigned user interface with improved responsiveness, accessibility, and workflow efficiency, enabling even non-expert users to retrieve relevant content rapidly. Evaluations demonstrate that retrieval time was reduced by up to 75% while accuracy and user satisfaction both increased, confirming Fusionista2.0 as a competitive and user-friendly system for large-scale video search.

[63] Prompt-Conditioned FiLM and Multi-Scale Fusion on MedSigLIP for Low-Dose CT Quality Assessment cs.CV | cs.AI | eess.IVPDF

Tolga Demiroglu, Mehmet Ozan Unal, Metin Ertas, Isa Yildirim

TL;DR: 本文提出了一种基于MedSigLIP的提示条件框架，通过FiLM和多尺度池化注入文本先验，结合全局、局部和纹理感知的回归头，实现了高效学习和快速适应，并在LDCT质量评估任务中取得了优异性能。

Details

Motivation: 传统低剂量CT（LDCT）质量评估方法依赖于大量标注数据且缺乏灵活性。引入文本提示条件和多尺度特征可以提升模型的临床适应性和数据效率。

Result: 在LDCTIQA2023数据集（1,000训练图像）上，PLCC=0.9575、SROCC=0.9561、KROCC=0.8301，超越公开的最优结果。

Insight: 文本提示条件能够有效引导模型聚焦临床意图，多尺度特征融合提升了表征能力，可推广至其他医学图像任务。

Abstract: We propose a prompt-conditioned framework built on MedSigLIP that injects textual priors via Feature-wise Linear Modulation (FiLM) and multi-scale pooling. Text prompts condition patch-token features on clinical intent, enabling data-efficient learning and rapid adaptation. The architecture combines global, local, and texture-aware pooling through separate regression heads fused by a lightweight MLP, trained with pairwise ranking loss. Evaluated on the LDCTIQA2023 (a public LDCT quality assessment challenge) with 1,000 training images, we achieve PLCC = 0.9575, SROCC = 0.9561, and KROCC = 0.8301, surpassing the top-ranked published challenge submissions and demonstrating the effectiveness of our prompt-guided approach.

[64] A Disease-Aware Dual-Stage Framework for Chest X-ray Report Generation cs.CVPDF

Puzhen Wu, Hexin Dong, Yi Lin, Yihao Ding, Yifan Peng

TL;DR: 该论文提出了一种新颖的双阶段疾病感知框架，用于胸部X光片的报告生成，通过疾病感知语义令牌和视觉-疾病注意力融合模块，显著提升了生成报告的临床准确性和语言质量。

Details

Motivation: 现有的胸部X光片报告生成方法往往缺乏足够的疾病感知能力和视觉-语言对齐，导致忽略关键病理特征和生成不准确的临床报告。

Result: 在CheXpert Plus、IU X-ray和MIMIC-CXR数据集上，该方法在临床准确性和语言质量方面达到了最先进性能。

Insight: 疾病感知能力在医学图像分析中至关重要，结合视觉和语义对齐可以有效提升报告的生成质量。

Abstract: Radiology report generation from chest X-rays is an important task in artificial intelligence with the potential to greatly reduce radiologists’ workload and shorten patient wait times. Despite recent advances, existing approaches often lack sufficient disease-awareness in visual representations and adequate vision-language alignment to meet the specialized requirements of medical image analysis. As a result, these models usually overlook critical pathological features on chest X-rays and struggle to generate clinically accurate reports. To address these limitations, we propose a novel dual-stage disease-aware framework for chest X-ray report generation. In Stage1, our model learns Disease-Aware Semantic Tokens (DASTs) corresponding to specific pathology categories through cross-attention mechanisms and multi-label classification, while simultaneously aligning vision and language representations via contrastive learning. In Stage2, we introduce a Disease-Visual Attention Fusion (DVAF) module to integrate disease-aware representations with visual features, along with a Dual-Modal Similarity Retrieval (DMSR) mechanism that combines visual and disease-specific similarities to retrieve relevant exemplars, providing contextual guidance during report generation. Extensive experiments on benchmark datasets (i.e., CheXpert Plus, IU X-ray, and MIMIC-CXR) demonstrate that our disease-aware framework achieves state-of-the-art performance in chest X-ray report generation, with significant improvements in clinical accuracy and linguistic quality.

[65] CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models cs.CV | cs.AIPDF

Jingyao Li, Jingyun Wang, Molin Tan, Haochen Wang, Cilin Yan

TL;DR: CrossVid是一个专为评估多模态大语言模型（MLLMs）在多视频交叉推理（CVR）能力上的综合性基准测试，包含多样化的任务和大量视频数据，揭示了当前模型的局限性。

Details

Motivation: 现有视频理解基准主要关注单视频分析，缺乏对多视频交叉推理能力的评估，CrossVid旨在填补这一空白。

Result: Gemini-2.5-Pro表现最佳，平均准确率为50.4%，但大多数MLLMs在多视频推理任务中表现不佳。

Insight: 当前MLLMs在多视频证据整合和比较方面存在不足，表明需要进一步的技术改进。

Abstract: Cross-Video Reasoning (CVR) presents a significant challenge in video understanding, which requires simultaneous understanding of multiple videos to aggregate and compare information across groups of videos. Most existing video understanding benchmarks focus on single-video analysis, failing to assess the ability of multimodal large language models (MLLMs) to simultaneously reason over various videos. Recent benchmarks evaluate MLLMs’ capabilities on multi-view videos that capture different perspectives of the same scene. However, their limited tasks hinder a thorough assessment of MLLMs in diverse real-world CVR scenarios. To this end, we introduce CrossVid, the first benchmark designed to comprehensively evaluate MLLMs’ spatial-temporal reasoning ability in cross-video contexts. Firstly, CrossVid encompasses a wide spectrum of hierarchical tasks, comprising four high-level dimensions and ten specific tasks, thereby closely reflecting the complex and varied nature of real-world video understanding. Secondly, CrossVid provides 5,331 videos, along with 9,015 challenging question-answering pairs, spanning single-choice, multiple-choice, and open-ended question formats. Through extensive experiments on various open-source and closed-source MLLMs, we observe that Gemini-2.5-Pro performs best on CrossVid, achieving an average accuracy of 50.4%. Notably, our in-depth case study demonstrates that most current MLLMs struggle with CVR tasks, primarily due to their inability to integrate or compare evidence distributed across multiple videos for reasoning. These insights highlight the potential of CrossVid to guide future advancements in enhancing MLLMs’ CVR capabilities.

[66] ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks cs.CVPDF

Ruixun Liu, Bowen Fu, Jiayi Song, Kaiyu Li, Wanchen Li

TL;DR: 该论文提出了ZoomEarth，一种针对超高分辨率（UHR）地理空间视觉-语言任务的主动感知框架，通过动态裁剪和缩放机制减少冗余信息，并在新的大规模数据集LRS-GRO上实现了最优性能。

Details

Motivation: 现有的动态分辨率和令牌剪枝方法在处理超高分辨率遥感图像时存在冗余问题，无法有效捕捉信息丰富的区域。因此，作者探索了一种主动感知范式，允许模型主动选择信息丰富的区域进行处理。

Result: ZoomEarth在LRS-GRO数据集上达到了最优性能，并在零样本设置下在三个公共UHR遥感基准测试中表现出色。此外，它还能够在多种下游任务中无缝集成。

Insight: 主动感知范式可以显著减少冗余计算，并提高模型对信息丰富区域的关注，这对于处理超高分辨率图像具有重要意义。

Abstract: Ultra-high-resolution (UHR) remote sensing (RS) images offer rich fine-grained information but also present challenges in effective processing. Existing dynamic resolution and token pruning methods are constrained by a passive perception paradigm, suffering from increased redundancy when obtaining finer visual inputs. In this work, we explore a new active perception paradigm that enables models to revisit information-rich regions. First, we present LRS-GRO, a large-scale benchmark dataset tailored for active perception in UHR RS processing, encompassing 17 question types across global, region, and object levels, annotated via a semi-automatic pipeline. Building on LRS-GRO, we propose ZoomEarth, an adaptive cropping-zooming framework with a novel Region-Guided reward that provides fine-grained guidance. Trained via supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), ZoomEarth achieves state-of-the-art performance on LRS-GRO and, in the zero-shot setting, on three public UHR remote sensing benchmarks. Furthermore, ZoomEarth can be seamlessly integrated with downstream models for tasks such as cloud removal, denoising, segmentation, and image editing through simple tool interfaces, demonstrating strong versatility and extensibility.

[67] TM-UNet: Token-Memory Enhanced Sequential Modeling for Efficient Medical Image Segmentation cs.CVPDF

Yaxuan Jiao, Qing Xu, Yuxiang Luo, Xiangjian He, Zhen Chen

TL;DR: TM-UNet是一个轻量级医学图像分割框架，通过结合多尺度token-memory块和高效的内存机制，显著降低了计算成本，同时保持了高质量的分割性能。

Details

Motivation: 现有的基于transformer的医学图像分割方法计算成本高，限制了临床部署。

Result: TM-UNet在多个医学分割任务中超越了现有方法，且计算成本显著降低。

Insight: token-memory机制能够以线性复杂度实现高效的全局推理，为轻量级医学图像分割提供了新思路。

Abstract: Medical image segmentation is essential for clinical diagnosis and treatment planning. Although transformer-based methods have achieved remarkable results, their high computational cost hinders clinical deployment. To address this issue, we propose TM-UNet, a novel lightweight framework that integrates token sequence modeling with an efficient memory mechanism for efficient medical segmentation. Specifically, we introduce a multi-scale token-memory (MSTM) block that transforms 2D spatial features into token sequences through strategic spatial scanning, leveraging matrix memory cells to selectively retain and propagate discriminative contextual information across tokens. This novel token-memory mechanism acts as a dynamic knowledge store that captures long-range dependencies with linear complexity, enabling efficient global reasoning without redundant computation. Our MSTM block further incorporates exponential gating to identify token effectiveness and multi-scale contextual extraction via parallel pooling operations, enabling hierarchical representation learning without computational overhead. Extensive experiments demonstrate that TM-UNet outperforms state-of-the-art methods across diverse medical segmentation tasks with substantially reduced computation cost. The code is available at https://github.com/xq141839/TM-UNet.

[68] D$^{3}$ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs cs.CV | cs.CL | cs.LGPDF

Shuochen Chang, Xiaofeng Zhang, Qingyang Liu, Li Niu

TL;DR: 论文D$^{3}$ToM提出了一种动态令牌合并方法，用以加速扩散式多模态大语言模型（Diffusion MLLMs）的推理速度，解决了模型中冗余视觉令牌导致的立方级解码复杂度问题。

Details

Motivation: 扩散式多模态大语言模型在视觉与语言任务中表现出色，但推理速度较慢，主要因为每一步去噪都需要对整个序列进行双向自注意力计算，导致计算开销巨大。本文旨在解决这一问题，提升推理效率。

Result: 实验表明，D$^{3}$ToM在保持模型性能的同时显著提升了推理速度。

Insight: 动态令牌合并是一种高效且灵活的方法，能够在不牺牲性能的前提下优化模型的计算效率。该方法为类似模型的加速提供了新思路。

Abstract: Diffusion-based multimodal large language models (Diffusion MLLMs) have recently demonstrated impressive non-autoregressive generative capabilities across vision-and-language tasks. However, Diffusion MLLMs exhibit substantially slower inference than autoregressive models: Each denoising step employs full bidirectional self-attention over the entire sequence, resulting in cubic decoding complexity that becomes computationally impractical with thousands of visual tokens. To address this challenge, we propose D$^{3}$ToM, a Decider-guided dynamic token merging method that dynamically merges redundant visual tokens at different denoising steps to accelerate inference in Diffusion MLLMs. At each denoising step, D$^{3}$ToM uses decider tokens-the tokens generated in the previous denoising step-to build an importance map over all visual tokens. Then it maintains a proportion of the most salient tokens and merges the remainder through similarity-based aggregation. This plug-and-play module integrates into a single transformer layer, physically shortening the visual token sequence for all subsequent layers without altering model parameters. Moreover, D$^{3}$ToM employs a merge ratio that dynamically varies with each denoising step, aligns with the native decoding process of Diffusion MLLMs, achieving superior performance under equivalent computational budgets. Extensive experiments show that D$^{3}$ToM accelerates inference while preserving competitive performance. The code is released at https://github.com/bcmi/D3ToM-Diffusion-MLLM.

[69] One target to align them all: LiDAR, RGB and event cameras extrinsic calibration for Autonomous Driving cs.CVPDF

Andrea Bertogalli, Giacomo Boracchi, Luca Magri

TL;DR: 该论文提出了一种新颖的多模态外参标定框架，用于同时标定事件相机、激光雷达(LiDAR)和RGB相机之间的相对位姿，特别针对事件相机的标定挑战设计了一种独特的3D标定目标。

Details

Motivation: 自动驾驶中多传感器系统的精确对齐至关重要，但现有方法通常依赖于单独的双模态标定，导致效率和精度受限。为了解决这一问题，本文提出了一种一体化标定方法。

Result: 在自定义数据集上的实验验证表明，该方法在标定精度和鲁棒性上表现优异，适用于复杂的自动驾驶视觉系统。

Insight: 该研究的关键洞察是通过多特征融合的目标设计，解决了多传感器联合标定的难题，为自动驾驶系统的传感器集成提供了新思路。

Abstract: We present a novel multi-modal extrinsic calibration framework designed to simultaneously estimate the relative poses between event cameras, LiDARs, and RGB cameras, with particular focus on the challenging event camera calibration. Core of our approach is a novel 3D calibration target, specifically designed and constructed to be concurrently perceived by all three sensing modalities. The target encodes features in planes, ChArUco, and active LED patterns, each tailored to the unique characteristics of LiDARs, RGB cameras, and event cameras respectively. This unique design enables a one-shot, joint extrinsic calibration process, in contrast to existing approaches that typically rely on separate, pairwise calibrations. Our calibration pipeline is designed to accurately calibrate complex vision systems in the context of autonomous driving, where precise multi-sensor alignment is critical. We validate our approach through an extensive experimental evaluation on a custom built dataset, recorded with an advanced autonomous driving sensor setup, confirming the accuracy and robustness of our method.

[70] Rethinking Bias in Generative Data Augmentation for Medical AI: a Frequency Recalibration Method cs.CV | cs.AIPDF

Chi Liu, Jincheng Liu, Congcong Zhu, Minghao Wang, Sheng Shen

TL;DR: 该论文提出了一种频率校准方法（FreRec），旨在减少生成医学图像与真实图像之间的频率分布差异，从而提升生成数据增强（GDA）的可靠性。

Details

Motivation: 医学AI的发展依赖于大量数据，但数据稀缺问题普遍存在。生成数据增强（GDA）提供了合成真实医学图像的解决方案，但其引入的偏差（尤其是频率偏差）常被忽视，可能对下游任务产生负面影响。

Result: 在多类医学数据集（如脑MRI、胸部X光和眼底图像）上的实验表明，FreRec显著提升了生成图像的分类性能。

Insight: 频率校准是提高GDA可靠性的有效手段，FreRec作为一种独立的后处理步骤，可与任何生成模型兼容，适用于医学GDA流程。

Abstract: Developing Medical AI relies on large datasets and easily suffers from data scarcity. Generative data augmentation (GDA) using AI generative models offers a solution to synthesize realistic medical images. However, the bias in GDA is often underestimated in medical domains, with concerns about the risk of introducing detrimental features generated by AI and harming downstream tasks. This paper identifies the frequency misalignment between real and synthesized images as one of the key factors underlying unreliable GDA and proposes the Frequency Recalibration (FreRec) method to reduce the frequency distributional discrepancy and thus improve GDA. FreRec involves (1) Statistical High-frequency Replacement (SHR) to roughly align high-frequency components and (2) Reconstructive High-frequency Mapping (RHM) to enhance image quality and reconstruct high-frequency details. Extensive experiments were conducted in various medical datasets, including brain MRIs, chest X-rays, and fundus images. The results show that FreRec significantly improves downstream medical image classification performance compared to uncalibrated AI-synthesized samples. FreRec is a standalone post-processing step that is compatible with any generative model and can integrate seamlessly with common medical GDA pipelines.

[71] Learning Time in Static Classifiers cs.CV | cs.AI | cs.LGPDF

Xi Ding, Lei Wang, Piotr Koniusz, Yongsheng Gao

TL;DR: 这篇论文提出了一种名为SEQ（Support-Exemplar-Query）的学习范式，通过将训练数据组织成时间上连贯的轨迹，无需修改模型架构或引入循环模块，即可为静态分类器赋予时间推理能力。

Details

Motivation: 现实世界中的视觉数据通常是随时间逐渐演变的，而传统的分类器通常基于时间独立的假设进行训练，无法捕捉这种动态变化。

Result: 该方法在细粒度和超细粒度图像分类任务中提升了性能，并在视频异常检测任务中实现了精确且时间一致的预测。

Insight: 仅通过损失函数设计就能为静态模型引入时间归纳偏置，展示了在无需复杂架构修改的情况下提升时间推理能力的数据高效方法。

Abstract: Real-world visual data rarely presents as isolated, static instances. Instead, it often evolves gradually over time through variations in pose, lighting, object state, or scene context. However, conventional classifiers are typically trained under the assumption of temporal independence, limiting their ability to capture such dynamics. We propose a simple yet effective framework that equips standard feedforward classifiers with temporal reasoning, all without modifying model architectures or introducing recurrent modules. At the heart of our approach is a novel Support-Exemplar-Query (SEQ) learning paradigm, which structures training data into temporally coherent trajectories. These trajectories enable the model to learn class-specific temporal prototypes and align prediction sequences via a differentiable soft-DTW loss. A multi-term objective further promotes semantic consistency and temporal smoothness. By interpreting input sequences as evolving feature trajectories, our method introduces a strong temporal inductive bias through loss design alone. This proves highly effective in both static and temporal tasks: it enhances performance on fine-grained and ultra-fine-grained image classification, and delivers precise, temporally consistent predictions in video anomaly detection. Despite its simplicity, our approach bridges static and temporal learning in a modular and data-efficient manner, requiring only a simple classifier on top of pre-extracted features.

[72] SpaceVLM: Sub-Space Modeling of Negation in Vision-Language Models cs.CVPDF

Sepehr Kazemi Ranjbar, Kumail Alhamoud, Marzyeh Ghassemi

TL;DR: SpaceVLM提出了一种无需训练的方法，通过对嵌入空间进行子空间建模，显著提升了视觉语言模型（VLMs）对否定的理解能力，同时避免了传统微调方法对零样本性能的损害。

Details

Motivation: 现有视觉语言模型在处理否定性提示（如“无行人”）时表现不佳，传统微调方法虽然能改善否定理解，却会损害模型的零样本性能。

Result: 在检索、多选题和文本生成图像任务中，模型在否定理解上的表现平均优于现有方法30%，且不影响零样本性能。

Insight: 视觉语言模型的嵌入空间可以划分为语义一致的子空间，这种性质为提升否定理解提供了新思路。

Abstract: Vision-Language Models (VLMs) struggle with negation. Given a prompt like “retrieve (or generate) a street scene without pedestrians,” they often fail to respect the “not.” Existing methods address this limitation by fine-tuning on large negation datasets, but such retraining often compromises the model’s zero-shot performance on affirmative prompts. We show that the embedding space of VLMs, such as CLIP, can be divided into semantically consistent subspaces. Based on this property, we propose a training-free framework that models negation as a subspace in the joint embedding space rather than a single point (Figure 1). To find the matching image for a caption such as “A but not N,” we construct two spherical caps around the embeddings of A and N, and we score images by the central direction of the region that is close to A and far from N. Across retrieval, MCQ, and text-to-image tasks, our method improves negation understanding by about 30% on average over prior methods. It closes the gap between affirmative and negated prompts while preserving the zero-shot performance that fine-tuned models fail to maintain. Code will be released upon publication.

[73] Ground Plane Projection for Improved Traffic Analytics at Intersections cs.CV | cs.AI | cs.LGPDF

Sajjad Pakdamansavoji, Kumar Vaibhav Jha, Baher Abdulhai, James H Elder

TL;DR: 该论文探讨了通过将基础设施摄像头检测到的车辆反向投影到地面平面进行3D坐标分析的优势，结果表明单摄像头系统中反向投影能提高轨迹分类和转弯计数准确性，多摄像头系统的弱融合能进一步提升精度。

Details

Motivation: 交通信号控制、管理和城市规划需要准确的转弯计数，而传统基于图像平面的方法可能存在精度不足的问题。

Result: 实验表明，单摄像头系统的反向投影方法在转弯计数和轨迹分类上更准确，多摄像头系统的弱融合则进一步提升了性能。

Insight: 交通分析应在地面平面而非图像平面进行，多摄像头系统的弱融合是实现更高精度的有效途径。

Abstract: Accurate turning movement counts at intersections are important for signal control, traffic management and urban planning. Computer vision systems for automatic turning movement counts typically rely on visual analysis in the image plane of an infrastructure camera. Here we explore potential advantages of back-projecting vehicles detected in one or more infrastructure cameras to the ground plane for analysis in real-world 3D coordinates. For single-camera systems we find that back-projection yields more accurate trajectory classification and turning movement counts. We further show that even higher accuracy can be achieved through weak fusion of back-projected detections from multiple cameras. These results suggeest that traffic should be analyzed on the ground plane, not the image plane

[74] Explainable AI-Generated Image Detection RewardBench cs.CVPDF

Michael Yang, Shijian Deng, William T. Doan, Kai Wang, Tianyu Yang

TL;DR: 该论文提出了XAIGID-RewardBench，首个用于评估多模态大语言模型（MLLMs）在判断AI生成图像检测解释质量能力的基准测试。结果显示，当前最佳模型的评分与人类标注者的一致性存在明显差距。

Details

Motivation: 传统基于分类的AI生成图像检测方法无法提供人类专家可理解的解释，降低了检测工具的可信度和说服力。MLLMs成为解决这一问题的趋势，但其在判断自身或其他MLLMs生成的解释时的能力尚未被充分研究。

Result: 当前最佳奖励模型在该基准测试中得分为88.76%，而人类标注者一致性为98.30%，表明MLLMs与人类水平仍存在差距。论文还分析了常见错误。

Insight: MLLMs在解释判断任务上与人类表现仍有差距，揭示了其在推理能力上的局限性，为未来改进提供了方向。

Abstract: Conventional, classification-based AI-generated image detection methods cannot explain why an image is considered real or AI-generated in a way a human expert would, which reduces the trustworthiness and persuasiveness of these detection tools for real-world applications. Leveraging Multimodal Large Language Models (MLLMs) has recently become a trending solution to this issue. Further, to evaluate the quality of generated explanations, a common approach is to adopt an “MLLM as a judge” methodology to evaluate explanations generated by other MLLMs. However, how well those MLLMs perform when judging explanations for AI-generated image detection generated by themselves or other MLLMs has not been well studied. We therefore propose \textbf{XAIGID-RewardBench}, the first benchmark designed to evaluate the ability of current MLLMs to judge the quality of explanations about whether an image is real or AI-generated. The benchmark consists of approximately 3,000 annotated triplets sourced from various image generation models and MLLMs as policy models (detectors) to assess the capabilities of current MLLMs as reward models (judges). Our results show that the current best reward model scored 88.76% on this benchmark (while human inter-annotator agreement reaches 98.30%), demonstrating that a visible gap remains between the reasoning abilities of today’s MLLMs and human-level performance. In addition, we provide an analysis of common pitfalls that these models frequently encounter. Code and benchmark are available at https://github.com/RewardBench/XAIGID-RewardBench.

[75] Constructing and Interpreting Digital Twin Representations for Visual Reasoning via Reinforcement Learning cs.CVPDF

Yiqing Shen, Mathias Unberath

TL;DR: DT-R1是一个基于强化学习的框架，通过构建和解释数字孪生表示（digital twin representations），为多模态视觉推理任务提供统一解决方法。

Details

Motivation: 现有视觉推理方法依赖任务特定的监督微调和模型设计，缺乏统一性和跨任务/跨模态的泛化能力。

Result: 在六个视觉推理基准测试（涵盖两种模态和四种任务类型）中，DT-R1表现优于当前任务特定模型。

Insight: DT-R1展示了数字孪生表示和强化学习在统一视觉推理任务中的潜力。

Abstract: Visual reasoning may require models to interpret images and videos and respond to implicit text queries across diverse output formats, from pixel-level segmentation masks to natural language descriptions. Existing approaches rely on supervised fine-tuning with task-specific architectures. For example, reasoning segmentation, grounding, summarization, and visual question answering each demand distinct model designs and training, preventing unified solutions and limiting cross-task and cross-modality generalization. Hence, we propose DT-R1, a reinforcement learning framework that trains large language models to construct digital twin representations of complex multi-modal visual inputs and then reason over these high-level representations as a unified approach to visual reasoning. Specifically, we train DT-R1 using GRPO with a novel reward that validates both structural integrity and output accuracy. Evaluations in six visual reasoning benchmarks, covering two modalities and four task types, demonstrate that DT-R1 consistently achieves improvements over state-of-the-art task-specific models. DT-R1 opens a new direction where visual reasoning emerges from reinforcement learning with digital twin representations.

[76] Fast Reasoning Segmentation for Images and Videos cs.CVPDF

Yiqing Shen, Mathias Unberath

TL;DR: FastReasonSeg是一种高效的推理分割方法，通过数字孪生表示和蒸馏技术，显著提升了边缘设备上的推理分割性能。

Details

Motivation: 现有的推理分割方法依赖大规模多模态语言模型，计算资源需求高，难以部署在边缘设备上。蒸馏技术虽可用于模型压缩，但传统方法无法保留多步推理能力。

Result: 在图像和视频基准测试中，FastReasonSeg达到最先进性能，0.6B参数模型优于20倍参数的模型，且运行效率高（7.79 FPS，内存消耗2.1GB）。

Insight: 解耦感知与推理并使用蒸馏技术是提升推理分割效率的有效途径，特别是在资源受限环境下。数字孪生表示为模型压缩提供了新思路。

Abstract: Reasoning segmentation enables open-set object segmentation via implicit text queries, therefore serving as a foundation for embodied agents that should operate autonomously in real-world environments. However, existing methods for reasoning segmentation require multimodal large language models with billions of parameters that exceed the computational capabilities of edge devices that typically deploy the embodied AI systems. Distillation offers a pathway to compress these models while preserving their capabilities. Yet, existing distillation approaches fail to transfer the multi-step reasoning capabilities that reasoning segmentation demands, as they focus on matching output predictions and intermediate features rather than preserving reasoning chains. The emerging paradigm of reasoning over digital twin representations presents an opportunity for more effective distillation by re-framing the problem. Consequently, we propose FastReasonSeg, which employs digital twin representations that decouple perception from reasoning to enable more effective distillation. Our distillation scheme first relies on supervised fine-tuning on teacher-generated reasoning chains. Then it is followed by reinforcement fine-tuning with joint rewards evaluating both segmentation accuracy and reasoning quality alignment. Experiments on two video (JiTBench, RVTBench) and two image benchmarks (ReasonSeg, LLM-Seg40K) demonstrate that our FastReasonSeg achieves state-of-the-art reasoning segmentation performance. Moreover, the distilled 0.6B variant outperforms models with 20 times more parameters while achieving 7.79 FPS throughput with only 2.1GB memory consumption. This efficiency enables deployment in resource-constrained environments to enable real-time reasoning segmentation.

[77] Changes in Real Time: Online Scene Change Detection with Multi-View Fusion cs.CVPDF

Chamuditha Jayanga Galappaththige, Jason Lai, Lloyd Windrim, Donald Dansereau, Niko Sünderhauf

TL;DR: 该论文提出了一种新的在线场景变化检测（SCD）方法，能够在不受限的视角下实时检测场景变化，并通过多视角融合和自我监督损失实现多视角一致性。

Details

Motivation: 现有的在线SCD方法准确性远低于离线方法，且在不受限的视角下效果较差。该研究旨在开发一种在线、姿态无关、无需标签、高效且高准确率的SCD方法。

Result: 在复杂真实数据集上的实验表明，该方法以超过10 FPS的速度运行，性能优于现有的在线和离线基线。

Insight: 通过多视角融合和自我监督学习，可以在不受限的视角下实现高效的在线SCD，同时也展示了3D高斯泼溅在动态场景表示中的潜力。

Abstract: Online Scene Change Detection (SCD) is an extremely challenging problem that requires an agent to detect relevant changes on the fly while observing the scene from unconstrained viewpoints. Existing online SCD methods are significantly less accurate than offline approaches. We present the first online SCD approach that is pose-agnostic, label-free, and ensures multi-view consistency, while operating at over 10 FPS and achieving new state-of-the-art performance, surpassing even the best offline approaches. Our method introduces a new self-supervised fusion loss to infer scene changes from multiple cues and observations, PnP-based fast pose estimation against the reference scene, and a fast change-guided update strategy for the 3D Gaussian Splatting scene representation. Extensive experiments on complex real-world datasets demonstrate that our approach outperforms both online and offline baselines.

[78] Reasoning Text-to-Video Retrieval via Digital Twin Video Representations and Large Language Models cs.CVPDF

Yiqing Shen, Chenxiao Fan, Chenjia Li, Mathias Unberath

TL;DR: 提出了通过数字孪生视频表示和大语言模型来解决推理型文本到视频检索问题的方法，显著提升了性能。

Details

Motivation: 现有方法在显式查询下表现良好，但对隐含查询（需要推理）效果不佳，因此需要新的方法来填补这一空白。

Result: 在ReasonT2VBench-135上R@1达到81.2%，远超基线50个百分点，并在MSR-VTT等传统基准上刷新SOTA。

Insight: 数字孪生表示和大语言模型的结合为视频理解与推理任务提供了新的思路，同时强调了结构化表示的重要性。

Abstract: The goal of text-to-video retrieval is to search large databases for relevant videos based on text queries. Existing methods have progressed to handling explicit queries where the visual content of interest is described explicitly; however, they fail with implicit queries where identifying videos relevant to the query requires reasoning. We introduce reasoning text-to-video retrieval, a paradigm that extends traditional retrieval to process implicit queries through reasoning while providing object-level grounding masks that identify which entities satisfy the query conditions. Instead of relying on vision-language models directly, we propose representing video content as digital twins, i.e., structured scene representations that decompose salient objects through specialist vision models. This approach is beneficial because it enables large language models to reason directly over long-horizon video content without visual token compression. Specifically, our two-stage framework first performs compositional alignment between decomposed sub-queries and digital twin representations for candidate identification, then applies large language model-based reasoning with just-in-time refinement that invokes additional specialist models to address information gaps. We construct a benchmark of 447 manually created implicit queries with 135 videos (ReasonT2VBench-135) and another more challenging version of 1000 videos (ReasonT2VBench-1000). Our method achieves 81.2% R@1 on ReasonT2VBench-135, outperforming the strongest baseline by greater than 50 percentage points, and maintains 81.7% R@1 on the extended configuration while establishing state-of-the-art results in three conventional benchmarks (MSR-VTT, MSVD, and VATEX).

[79] Calibrated Decomposition of Aleatoric and Epistemic Uncertainty in Deep Features for Inference-Time Adaptation cs.CV | stat.MLPDF

Divake Kumar, Patrick Poggi, Sina Tayebati, Devashri Naik, Nilesh Ahuja

TL;DR: 该论文提出了一种轻量级的推理时间框架，通过分解深度特征空间中的偶然（数据驱动）和认知（模型驱动）不确定性，实现了更可靠的推理自适应和计算资源分配。

Details

Motivation: 当前的估计器通常将所有不确定性模式合并为一个单一的置信度分数，这导致无法可靠地判断何时需要分配更多计算资源或调整推理过程。因此，需要一个能够明确分解不确定性类型的框架。

Result: 在MOT17数据集上，该方法减少了约60%的计算量，同时对精度的影响可以忽略不计。此外，消融实验表明，提出的正交不确定性分解在所有MOT17序列中均实现了更高的计算效率，相比基于总不确定性的基线提升了13.6个百分点。

Insight: 通过明确分解偶然和认知不确定性，能够更有效地指导推理时间的自适应调整和计算资源分配，从而在保持精度的同时显著降低计算成本。

Abstract: Most estimators collapse all uncertainty modes into a single confidence score, preventing reliable reasoning about when to allocate more compute or adjust inference. We introduce Uncertainty-Guided Inference-Time Selection, a lightweight inference time framework that disentangles aleatoric (data-driven) and epistemic (model-driven) uncertainty directly in deep feature space. Aleatoric uncertainty is estimated using a regularized global density model, while epistemic uncertainty is formed from three complementary components that capture local support deficiency, manifold spectral collapse, and cross-layer feature inconsistency. These components are empirically orthogonal and require no sampling, no ensembling, and no additional forward passes. We integrate the decomposed uncertainty into a distribution free conformal calibration procedure that yields significantly tighter prediction intervals at matched coverage. Using these components for uncertainty guided adaptive model selection reduces compute by approximately 60 percent on MOT17 with negligible accuracy loss, enabling practical self regulating visual inference. Additionally, our ablation results show that the proposed orthogonal uncertainty decomposition consistently yields higher computational savings across all MOT17 sequences, improving margins by 13.6 percentage points over the total-uncertainty baseline.

[80] DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions cs.CV | cs.CLPDF

Xiaoyu Lin, Aniket Ghorpade, Hansheng Zhu, Justin Qiu, Dea Rrozhani

TL;DR: DenseAnnotate是一个基于语音的高效标注平台，用于为图像和3D场景生成密集标注，解决了传统文本标注的限制，并在多语言和文化对齐任务上显著提升了模型性能。

Details

Motivation: 当前多模态大型语言模型（MLLMs）需要高质量的训练数据，但传统文本标注方法在密集标注上效率低下、表达受限。DenseAnnotate通过语音标注解决了这一问题。

Result: 构建了包含3,531张图像、898个3D场景和7,460个3D对象的数据集，模型在多语言任务上提升5%，文化对齐任务上提升47%，3D空间能力上提升54%。

Insight: 语音标注是一种高效的密集标注方法，适用于多种数据类型和任务，为未来的视觉-语言研究提供了可行方案。

Abstract: With the rapid adoption of multimodal large language models (MLLMs) across diverse applications, there is a pressing need for task-centered, high-quality training data. A key limitation of current training datasets is their reliance on sparse annotations mined from the Internet or entered via manual typing that capture only a fraction of an image’s visual content. Dense annotations are more valuable but remain scarce. Traditional text-based annotation pipelines are poorly suited for creating dense annotations: typing limits expressiveness, slows annotation speed, and underrepresents nuanced visual features, especially in specialized areas such as multicultural imagery and 3D asset annotation. In this paper, we present DenseAnnotate, an audio-driven online annotation platform that enables efficient creation of dense, fine-grained annotations for images and 3D assets. Annotators narrate observations aloud while synchronously linking spoken phrases to image regions or 3D scene parts. Our platform incorporates speech-to-text transcription and region-of-attention marking. To demonstrate the effectiveness of DenseAnnotate, we conducted case studies involving over 1,000 annotators across two domains: culturally diverse images and 3D scenes. We curate a human-annotated multi-modal dataset of 3,531 images, 898 3D scenes, and 7,460 3D objects, with audio-aligned dense annotations in 20 languages, including 8,746 image captions, 2,000 scene captions, and 19,000 object captions. Models trained on this dataset exhibit improvements of 5% in multilingual, 47% in cultural alignment, and 54% in 3D spatial capabilities. Our results show that our platform offers a feasible approach for future vision-language research and can be applied to various tasks and diverse types of data.

[81] MSLoRA: Multi-Scale Low-Rank Adaptation via Attention Reweighting cs.CV | cs.AIPDF

Xu Yang, Gady Agam

TL;DR: MSLoRA提出了一种轻量级的适配器，通过多尺度低秩适应和注意力重加权，统一了CNN和ViT的适配方法，显著提升了迁移性能，同时保持了高参数效率。

Details

Motivation: 现有的低秩适配方法主要局限于视觉变换器（ViT），难以跨架构泛化。MSLoRA旨在设计一种适用于CNN和ViT的通用适配器，通过注意力重加权而非重新调优主干网络来实现高效适配。

Result: 实验表明，MSLoRA在分类、检测和分割任务上均能显著提升性能，且优化稳定、收敛快、跨架构泛化能力强。

Insight: MSLoRA的注意力重加权机制为冻结主干网络的高效适配提供了一个简单而通用的解决方案，展示了参数效率与性能的良好平衡。

Abstract: We introduce MSLoRA, a backbone-agnostic, parameter-efficient adapter that reweights feature responses rather than re-tuning the underlying backbone. Existing low-rank adaptation methods are mostly confined to vision transformers (ViTs) and struggle to generalize across architectures. MSLoRA unifies adaptation for both convolutional neural networks (CNNs) and ViTs by combining a low-rank linear projection with a multi-scale nonlinear transformation that jointly modulates spatial and channel attention. The two components are fused through pointwise multiplication and a residual connection, yielding a lightweight module that shifts feature attention while keeping pretrained weights frozen. Extensive experiments demonstrate that MSLoRA consistently improves transfer performance on classification, detection, and segmentation tasks with roughly less than 5% of backbone parameters. The design further enables stable optimization, fast convergence, and strong cross-architecture generalization. By reweighting rather than re-tuning, MSLoRA provides a simple and universal approach for efficient adaptation of frozen vision backbones.

[82] VLA-R: Vision-Language Action Retrieval toward Open-World End-to-End Autonomous Driving cs.CVPDF

Hyunki Seong, Seongwoo Moon, Hojin Ahn, Jehun Kang, David Hyunchul Shim

TL;DR: VLA-R是一种面向开放世界的端到端自动驾驶框架，通过结合开放世界感知与视觉-动作检索范式，实现了在未结构化环境中的强泛化能力。

Details

Motivation: 端到端自动驾驶在未结构化环境中常遇到训练时未见的条件，需要具备强大的泛化能力，因此提出了一种结合开放世界感知与视觉-动作检索的新方法。

Result: 在真实机器人平台上展示了对未结构化环境的强泛化能力和探索性能，即使在数据有限的情况下也能有效工作。

Insight: 通过视觉-语言与动作对齐的方法，可以有效提升端到端自动驾驶在开放世界中的泛化能力和适应性。

Abstract: Exploring open-world situations in an end-to-end manner is a promising yet challenging task due to the need for strong generalization capabilities. In particular, end-to-end autonomous driving in unstructured outdoor environments often encounters conditions that were unfamiliar during training. In this work, we present Vision-Language Action Retrieval (VLA-R), an open-world end-to-end autonomous driving (OW-E2EAD) framework that integrates open-world perception with a novel vision-action retrieval paradigm. We leverage a frozen vision-language model for open-world detection and segmentation to obtain multi-scale, prompt-guided, and interpretable perception features without domain-specific tuning. A Q-Former bottleneck aggregates fine-grained visual representations with language-aligned visual features, bridging perception and action domains. To learn transferable driving behaviors, we introduce a vision-action contrastive learning scheme that aligns vision-language and action embeddings for effective open-world reasoning and action retrieval. Our experiments on a real-world robotic platform demonstrate strong generalization and exploratory performance in unstructured, unseen environments, even with limited data. Demo videos are provided in the supplementary material.

[83] Self-Supervised Visual Prompting for Cross-Domain Road Damage Detection cs.CVPDF

Xi Xiao, Zhuxuanzi Wang, Mingqiao Mo, Chen Liu, Chenrui Ma

TL;DR: 论文提出了一种自监督视觉提示方法（Self-Supervised Visual Prompting），用于跨领域道路损伤检测，解决了领域适应性问题，并在实验中表现出色。

Details

Motivation: 自动化道路缺陷检测通常因跨领域泛化能力差而受限。已有方法需要昂贵的重新标注或对领域偏移敏感，作者希望通过自监督方法改进这一问题。

Result: 在四个基准测试中表现优于监督、自监督和领域适应基线，实现了零样本迁移和少样本适应的高效性。

Insight: 自监督提示是一种可扩展且适应性强的方法，适用于视觉检测系统的构建。

Abstract: The deployment of automated pavement defect detection is often hindered by poor cross-domain generalization. Supervised detectors achieve strong in-domain accuracy but require costly re-annotation for new environments, while standard self-supervised methods capture generic features and remain vulnerable to domain shift. We propose \ours, a self-supervised framework that \emph{visually probes} target domains without labels. \ours introduces a Self-supervised Prompt Enhancement Module (SPEM), which derives defect-aware prompts from unlabeled target data to guide a frozen ViT backbone, and a Domain-Aware Prompt Alignment (DAPA) objective, which aligns prompt-conditioned source and target representations. Experiments on four challenging benchmarks show that \ours consistently outperforms strong supervised, self-supervised, and adaptation baselines, achieving robust zero-shot transfer, improved resilience to domain variations, and high data efficiency in few-shot adaptation. These results highlight self-supervised prompting as a practical direction for building scalable and adaptive visual inspection systems. Source code is publicly available: https://github.com/xixiaouab/PROBE/tree/main

[84] Towards Rotation-only Imaging Geometry: Rotation Estimation cs.CVPDF

Xinrui Li, Qi Cai, Yuanxin Wu

TL;DR: 本文提出了一种基于旋转的成像几何表示方法，通过将平移表示为旋转的函数，实现了旋转优化的框架，显著提升了SfM任务中旋转估计的准确性和鲁棒性。

Details

Motivation: 现有的姿态-仅（pose-only）成像几何方法在SfM任务中表现出色，但其仍需处理3D坐标和相机姿态的耦合关系。本文旨在进一步探索场景结构、旋转和平移之间的关键关系，从而简化成像几何的表示。

Result: 实验结果表明，该方法在旋转估计任务中表现出更高的准确性和鲁棒性，性能优于现有技术，且与Bundle Adjustment多次迭代结果相当。

Insight: 通过将平移表示为旋转的函数，可以有效简化成像几何的优化问题，同时保持高精度和鲁棒性。这种方法为更高效、可靠的3D视觉计算提供了新思路。

Abstract: Structure from Motion (SfM) is a critical task in computer vision, aiming to recover the 3D scene structure and camera motion from a sequence of 2D images. The recent pose-only imaging geometry decouples 3D coordinates from camera poses and demonstrates significantly better SfM performance through pose adjustment. Continuing the pose-only perspective, this paper explores the critical relationship between the scene structures, rotation and translation. Notably, the translation can be expressed in terms of rotation, allowing us to condense the imaging geometry representation onto the rotation manifold. A rotation-only optimization framework based on reprojection error is proposed for both two-view and multi-view scenarios. The experiment results demonstrate superior accuracy and robustness performance over the current state-of-the-art rotation estimation methods, even comparable to multiple bundle adjustment iteration results. Hopefully, this work contributes to even more accurate, efficient and reliable 3D visual computing.

[85] Seeing Through the Rain: Resolving High-Frequency Conflicts in Deraining and Super-Resolution via Diffusion Guidance cs.CVPDF

Wenjie Li, Jinglei Shi, Jin Han, Heng Guo, Zhanyu Ma

TL;DR: 论文提出了一种基于扩散模型的高频引导方法DHGM，解决了雨水去除与超分辨率任务之间的高频冲突问题，能够同时去除雨滴噪声并增强细节结构。

Details

Motivation: 现实中图像常因恶劣天气而质量下降，传统方法在去除天气影响时可能丢失高频细节，而直接串联去雨和超分辨率任务会因目标冲突导致效果不佳。

Result: 在实验中，DHGM表现优于现有方法，生成图像的质量更高，且计算成本更低。

Insight: 扩散模型的先验信息可以有效解决高频恢复任务中的冲突，结合高频引导策略可实现更精细的图像恢复。

Abstract: Clean images are crucial for visual tasks such as small object detection, especially at high resolutions. However, real-world images are often degraded by adverse weather, and weather restoration methods may sacrifice high-frequency details critical for analyzing small objects. A natural solution is to apply super-resolution (SR) after weather removal to recover both clarity and fine structures. However, simply cascading restoration and SR struggle to bridge their inherent conflict: removal aims to remove high-frequency weather-induced noise, while SR aims to hallucinate high-frequency textures from existing details, leading to inconsistent restoration contents. In this paper, we take deraining as a case study and propose DHGM, a Diffusion-based High-frequency Guided Model for generating clean and high-resolution images. DHGM integrates pre-trained diffusion priors with high-pass filters to simultaneously remove rain artifacts and enhance structural details. Extensive experiments demonstrate that DHGM achieves superior performance over existing methods, with lower costs.

[86] MFI-ResNet: Efficient ResNet Architecture Optimization via MeanFlow Compression and Selective Incubation cs.CV | cs.AIPDF

Nuolin Sun, Linyuan Wang, Haonan Wei, Lei Li, Bin Yan

TL;DR: MFI-ResNet通过MeanFlow压缩和选择性孵化优化ResNet架构，减少了参数数量并提升了性能。

Details

Motivation: 传统ResNet的多层结构在单阶段中通过残差连接实现特征变换，但效率不高。受MeanFlow生成模型的启发，研究者希望利用生成流场优化ResNet的效率和性能。

Result: 在CIFAR-10和CIFAR-100数据集上，MFI-ResNet分别减少参数46.28%和45.59%，同时精度提升0.23%和0.17%。

Insight: 生成流场能有效描述ResNet特征变换过程，为生成建模与判别学习的关系提供了新视角。

Abstract: ResNet has achieved tremendous success in computer vision through its residual connection mechanism. ResNet can be viewed as a discretized form of ordinary differential equations (ODEs). From this perspective, the multiple residual blocks within a single ResNet stage essentially perform multi-step discrete iterations of the feature transformation for that stage. The recently proposed flow matching model, MeanFlow, enables one-step generative modeling by learning the mean velocity field to transform distributions. Inspired by this, we propose MeanFlow-Incubated ResNet (MFI-ResNet), which employs a compression-expansion strategy to jointly improve parameter efficiency and discriminative performance. In the compression phase, we simplify the multi-layer structure within each ResNet stage to one or two MeanFlow modules to construct a lightweight meta model. In the expansion phase, we apply a selective incubation strategy to the first three stages, expanding them to match the residual block configuration of the baseline ResNet model, while keeping the last stage in MeanFlow form, and fine-tune the incubated model. Experimental results show that on CIFAR-10 and CIFAR-100 datasets, MFI-ResNet achieves remarkable parameter efficiency, reducing parameters by 46.28% and 45.59% compared to ResNet-50, while still improving accuracy by 0.23% and 0.17%, respectively. This demonstrates that generative flow-fields can effectively characterize the feature transformation process in ResNet, providing a new perspective for understanding the relationship between generative modeling and discriminative learning.

[87] RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning cs.CVPDF

Jingqi Xu, Jingxi Lu, Chenghao Li, Sreetama Sarkar, Souvik Kundu

TL;DR: RedVTP通过掩码响应令牌引导的视觉令牌剪枝，无需训练即可加速扩散视觉语言模型推理，显著提升了推理效率和吞吐量。

Details

Motivation: 扩散视觉语言模型（DVLMs）虽然支持并行令牌解码，但其大量的视觉令牌仍严重影响推理效率。现有剪枝方法多集中于自回归VLMs，而对DVLMs的研究不足，因此需针对DVLMs设计高效剪枝策略。

Result: 实验显示，RedVTP将LLaDA-V和LaViDa的令牌生成吞吐量分别提升了186%和28.05%，推理延迟降低了64.97%和21.87%，且保持了精度甚至有所提升。

Insight: DVLMs的视觉令牌剪枝需考虑其并行解码特性，掩码响应令牌的注意力评分能有效指导剪枝，且重要性评分在步骤间的稳定性为高效剪枝提供了可能性。

Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning and generation, yet their high computational demands remain a major challenge. Diffusion Vision-Language Models (DVLMs) are particularly attractive because they enable parallel token decoding, but the large number of visual tokens still significantly hinders their inference efficiency. While visual token pruning has been extensively studied for autoregressive VLMs (AVLMs), it remains largely unexplored for DVLMs. In this work, we propose RedVTP, a response-driven visual token pruning strategy that leverages the inference dynamics of DVLMs. Our method estimates visual token importance using attention from the masked response tokens. Based on the observation that these importance scores remain consistent across steps, RedVTP prunes the less important visual tokens from the masked tokens after the first inference step, thereby maximizing inference efficiency. Experiments show that RedVTP improves token generation throughput of LLaDA-V and LaViDa by up to 186% and 28.05%, respectively, and reduces inference latency by up to 64.97% and 21.87%, without compromising-and in some cases improving-accuracy.

[88] Text-Guided Channel Perturbation and Pretrained Knowledge Integration for Unified Multi-Modality Image Fusion cs.CVPDF

Xilai Li, Xiaosong Li, Weijun Jiang

TL;DR: 论文提出了一种基于通道扰动和预训练知识集成的统一多模态图像融合框架（UP-Fusion），通过语义感知通道剪枝模块（SCPM）和几何仿射调制模块（GAM）增强关键特征，并利用文本引导通道扰动模块（TCPM）优化解码过程，显著提升了多模态图像融合的性能。

Details

Motivation: 多模态图像融合的统一模型在处理大模态差异时容易因梯度冲突导致性能受限，而现有模态特定编码器方法虽能提升融合质量，却牺牲了跨任务的泛化能力。因此，需要一种既能抑制冗余模态信息又能保持特征编码器模态区分性的方法。

Result: 实验表明，UP-Fusion在多模态图像融合及其下游任务中均显著优于现有方法。

Insight: 通过预训练模型的语义能力引导特征选择和扰动，可以有效解决多模态融合中的梯度冲突问题，同时提升泛化性和性能。

Abstract: Multi-modality image fusion enhances scene perception by combining complementary information. Unified models aim to share parameters across modalities for multi-modality image fusion, but large modality differences often cause gradient conflicts, limiting performance. Some methods introduce modality-specific encoders to enhance feature perception and improve fusion quality. However, this strategy reduces generalisation across different fusion tasks. To overcome this limitation, we propose a unified multi-modality image fusion framework based on channel perturbation and pre-trained knowledge integration (UP-Fusion). To suppress redundant modal information and emphasize key features, we propose the Semantic-Aware Channel Pruning Module (SCPM), which leverages the semantic perception capability of a pre-trained model to filter and enhance multi-modality feature channels. Furthermore, we proposed the Geometric Affine Modulation Module (GAM), which uses original modal features to apply affine transformations on initial fusion features to maintain the feature encoder modal discriminability. Finally, we apply a Text-Guided Channel Perturbation Module (TCPM) during decoding to reshape the channel distribution, reducing the dependence on modality-specific channels. Extensive experiments demonstrate that the proposed algorithm outperforms existing methods on both multi-modality image fusion and downstream tasks.

[89] Real-Time Drivers’ Drowsiness Detection and Analysis through Deep Learning cs.CV | cs.AI | cs.HC | cs.LGPDF

ANK Zaman, Prosenjit Chatterjee, Rajat Sharma

TL;DR: 该论文提出了一种基于深度卷积神经网络（DCNNs）和OpenCV的实时驾驶员疲劳检测系统，通过分析面部特征（如眼睛和嘴巴动作）检测疲劳状态，并在检测到时发出警报，准确率高达99.6%和97%。

Details

Motivation: 长时间驾驶可能导致驾驶员疲劳，危及生命安全，因此需要一种非侵入式、低成本的实时疲劳检测系统来提升道路安全。

Result: 该系统在NTHU-DDD和Yawn-Eye-Dataset数据集上的疲劳检测准确率分别为99.6%和97%。

Insight: 1. 面部特征检测是一种有效的疲劳检测方法；2. 结合实时图像处理和深度学习可以实现高精度检测；3. 非侵入式设计提升了系统的实用性和可接受性。

Abstract: A long road trip is fun for drivers. However, a long drive for days can be tedious for a driver to accommodate stringent deadlines to reach distant destinations. Such a scenario forces drivers to drive extra miles, utilizing extra hours daily without sufficient rest and breaks. Once a driver undergoes such a scenario, it occasionally triggers drowsiness during driving. Drowsiness in driving can be life-threatening to any individual and can affect other drivers’ safety; therefore, a real-time detection system is needed. To identify fatigued facial characteristics in drivers and trigger the alarm immediately, this research develops a real-time driver drowsiness detection system utilizing deep convolutional neural networks (DCNNs) and OpenCV.Our proposed and implemented model takes real- time facial images of a driver using a live camera and utilizes a Python-based library named OpenCV to examine the facial images for facial landmarks like sufficient eye openings and yawn-like mouth movements. The DCNNs framework then gathers the data and utilizes a per-trained model to detect the drowsiness of a driver using facial landmarks. If the driver is identified as drowsy, the system issues a continuous alert in real time, embedded in the Smart Car technology.By potentially saving innocent lives on the roadways, the proposed technique offers a non-invasive, inexpensive, and cost-effective way to identify drowsiness. Our proposed and implemented DCNNs embedded drowsiness detection model successfully react with NTHU-DDD dataset and Yawn-Eye-Dataset with drowsiness detection classification accuracy of 99.6% and 97% respectively.

[90] CoTBox-TTT: Grounding Medical VQA with Visual Chain-of-Thought Boxes During Test-time Training cs.CVPDF

Jiahe Qian, Yuhao Shen, Zhangtianyi Chen, Juexiao Zhou, Peisong Wang

TL;DR: CoTBox-TTT是一种针对医学VQA的测试时训练方法，通过视觉链式思考信号定位问题相关区域，并更新少量连续软提示，提升模型在域偏移下的准确性和证据基础。

Details

Motivation: 当前医学VQA系统在域偏移下表现不佳，且难以在部署时重新训练或获取额外标签，导致回答缺乏图像证据支持。

Result: 在医学VQA数据集（如pathVQA）上，CoTBox-TTT将LLaVA的闭端准确率提升了12.3%。

Insight: 通过动态调整软提示和视觉链式思考信号，可以有效提升模型在域偏移下的泛化能力，同时减少对额外标签的依赖。

Abstract: Medical visual question answering could support clinical decision making, yet current systems often fail under domain shift and produce answers that are weakly grounded in image evidence. This reliability gap arises when models attend to spurious regions and when retraining or additional labels are impractical at deployment time. We address this setting with CoTBox-TTT, an evidence-first test-time training approach that adapts a vision-language model at inference while keeping all backbones frozen. The method updates only a small set of continuous soft prompts. It identifies question-relevant regions through a visual chain-of-thought signal and encourages answer consistency across the original image and a localized crop. The procedure is label free, and plug and play with diverse backbones. Experiments on medical VQA show that the approach is practical for real deployments. For instance, adding CoTBox-TTT to LLaVA increases closed-ended accuracy by 12.3% on pathVQA.

[91] MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding cs.CV | cs.AI | cs.IR | cs.LGPDF

Zhanheng Nie, Chenghan Fu, Daoze Zhang, Junxian Wu, Wanxian Guan

TL;DR: MOON2.0是一个动态平衡的多模态表示学习框架，通过混合专家模块、双重对齐方法和图像-文本协同增强策略，解决了电子商务产品理解中的模态不平衡、对齐关系利用不足和数据噪声问题。

Details

Motivation: 电子商务的快速增长需要能够理解丰富视觉和文本产品信息的多模态模型。现有方法在处理模态不平衡、对齐关系和噪声时存在不足。

Result: 在MBE2.0和多个公开数据集上实现零样本SOTA性能，并通过热图可视化展示改进的多模态对齐效果。

Insight: 动态平衡模态和协同增强策略是多模态学习的有效手段，尤其是面对电子商务中的噪声和不平衡数据。

Abstract: The rapid growth of e-commerce calls for multimodal models that comprehend rich visual and textual product information. Although recent multimodal large language models (MLLMs) for product understanding exhibit strong capability in representation learning for e-commerce, they still face three challenges: (i) the modality imbalance induced by modality mixed training; (ii) underutilization of the intrinsic alignment relationships among visual and textual information within a product; and (iii) limited handling of noise in e-commerce multimodal data. To address these, we propose MOON2.0, a dynamic modality-balanced multimodal representation learning framework for e-commerce product understanding. MOON2.0 comprises: (1) a Modality-driven Mixture-of-Experts (MoE) module that adaptively processes input samples by their modality composition, enabling Multimodal Joint Learning to mitigate the modality imbalance; (2) a Dual-level Alignment method to better leverage semantic alignment properties inside individual products; and (3) an MLLM-based Image-text Co-augmentation strategy that integrates textual enrichment with visual expansion, coupled with Dynamic Sample Filtering to improve training data quality. We further introduce MBE2.0, a co-augmented multimodal representation benchmark for e-commerce representation learning and evaluation. Experiments show that MOON2.0 delivers state-of-the-art zero-shot performance on MBE2.0 and multiple public datasets. Furthermore, attention-based heatmap visualization provides qualitative evidence of improved multimodal alignment of MOON2.0.

[92] DINO-Detect: A Simple yet Effective Framework for Blur-Robust AI-Generated Image Detection cs.CV | cs.LGPDF

Jialiang Shen, Jiyang Zheng, Yunqi Xue, Huajie Chen, Yu Yao

TL;DR: 论文提出了一种基于教师-学生知识蒸馏的模糊鲁棒AI生成图像检测框架DINO-Detect，通过冻结教师模型（DINOv3）的权重，将从清晰图像中学到的特征和logit响应蒸馏到学生模型中，使其在模糊条件下仍能保持稳定的检测性能。

Details

Motivation: 由于AI生成图像（AIGI）在真实场景中常受到运动模糊等退化的影响，传统检测器性能大幅下降，因此需要一种鲁棒的检测框架来应对模糊条件下的挑战。

Result: 在模糊和清晰图像的实验中，DINO-Detect均取得最先进性能，验证了其鲁棒性和泛化能力。

Insight: 冻结教师模型的知识蒸馏策略有效保留了教师的语义表示能力，同时使学生模型适应模糊条件下的检测任务，为AIGI检测的实际应用提供了新思路。

Abstract: With growing concerns over image authenticity and digital safety, the field of AI-generated image (AIGI) detection has progressed rapidly. Yet, most AIGI detectors still struggle under real-world degradations, particularly motion blur, which frequently occurs in handheld photography, fast motion, and compressed video. Such blur distorts fine textures and suppresses high-frequency artifacts, causing severe performance drops in real-world settings. We address this limitation with a blur-robust AIGI detection framework based on teacher-student knowledge distillation. A high-capacity teacher (DINOv3), trained on clean (i.e., sharp) images, provides stable and semantically rich representations that serve as a reference for learning. By freezing the teacher to maintain its generalization ability, we distill its feature and logit responses from sharp images to a student trained on blurred counterparts, enabling the student to produce consistent representations under motion degradation. Extensive experiments benchmarks show that our method achieves state-of-the-art performance under both motion-blurred and clean conditions, demonstrating improved generalization and real-world applicability. Source codes will be released at: https://github.com/JiaLiangShen/Dino-Detect-for-blur-robust-AIGC-Detection.

[93] MdaIF: Robust One-Stop Multi-Degradation-Aware Image Fusion with Language-Driven Semantics cs.CVPDF

Jing Li, Yifan Wang, Jiafeng Yan, Renlong Zhang, Bin Yang

TL;DR: 论文提出了一种基于大语言模型的多退化感知图像融合框架(MdaIF)，解决了现有方法在恶劣天气条件下可见光图像退化问题，并通过混合专家系统(MoE)和视觉语言模型(VLM)提升了多退化场景下的融合性能。

Details

Motivation: 现有红外与可见光图像融合方法未能有效处理恶劣天气下的可见光图像退化问题，且网络架构固定，缺乏对不同退化场景的适应性。

Result: 实验表明MdaIF在多个退化场景下优于现有方法。

Insight: 通过结合语言驱动的语义信息和多专家系统，可以显著提升图像融合在复杂退化条件下的鲁棒性和适应性。

Abstract: Infrared and visible image fusion aims to integrate complementary multi-modal information into a single fused result. However, existing methods 1) fail to account for the degradation visible images under adverse weather conditions, thereby compromising fusion performance; and 2) rely on fixed network architectures, limiting their adaptability to diverse degradation scenarios. To address these issues, we propose a one-stop degradation-aware image fusion framework for multi-degradation scenarios driven by a large language model (MdaIF). Given the distinct scattering characteristics of different degradation scenarios (e.g., haze, rain, and snow) in atmospheric transmission, a mixture-of-experts (MoE) system is introduced to tackle image fusion across multiple degradation scenarios. To adaptively extract diverse weather-aware degradation knowledge and scene feature representations, collectively referred to as the semantic prior, we employ a pre-trained vision-language model (VLM) in our framework. Guided by the semantic prior, we propose degradation-aware channel attention module (DCAM), which employ degradation prototype decomposition to facilitate multi-modal feature interaction in channel domain. In addition, to achieve effective expert routing, the semantic prior and channel-domain modulated features are utilized to guide the MoE, enabling robust image fusion in complex degradation scenarios. Extensive experiments validate the effectiveness of our MdaIF, demonstrating superior performance over SOTA methods.

[94] ReaSon: Reinforced Causal Search with Information Bottleneck for Video Understanding cs.CVPDF

Yuan Zhou, Litao Hua, Shilong Jin, Wentao Huang, Haoran Duan

TL;DR: ReaSon是一个结合强化学习和因果推理的视频关键帧选择框架，通过新颖的因果信息瓶颈（CIB）优化关键帧选择，显著提升了视频理解任务的性能。

Details

Motivation: 由于视觉语言模型（VLMs）的输入token限制和视频帧中相关信息的时间稀疏性，关键帧选择对视频理解至关重要。然而，现有方法未能同时满足预测充分性和因果必要性，因此提出了ReaSon框架。

Result: 在NExT-QA、EgoSchema和Video-MME数据集上，ReaSon在有限帧设置下显著优于现有方法。

Insight: 结合因果推理和强化学习是提升关键帧选择的有效路径，因果信息瓶颈为视频理解提供了新的优化方向。

Abstract: Keyframe selection has become essential for video understanding with vision-language models (VLMs) due to limited input tokens and the temporal sparsity of relevant information across video frames. Video understanding often relies on effective keyframes that are not only informative but also causally decisive. To this end, we propose Reinforced Causal Search with Information Bottleneck (ReaSon), a framework that formulates keyframe selection as an optimization problem with the help of a novel Causal Information Bottleneck (CIB), which explicitly defines keyframes as those satisfying both predictive sufficiency and causal necessity. Specifically, ReaSon employs a learnable policy network to select keyframes from a visually relevant pool of candidate frames to capture predictive sufficiency, and then assesses causal necessity via counterfactual interventions. Finally, a composite reward aligned with the CIB principle is designed to guide the selection policy through reinforcement learning. Extensive experiments on NExT-QA, EgoSchema, and Video-MME demonstrate that ReaSon consistently outperforms existing state-of-the-art methods under limited-frame settings, validating its effectiveness and generalization ability.

[95] EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis cs.CVPDF

Yijie Guo, Dexiang Hong, Weidong Chen, Zihan She, Cheng Ye

TL;DR: EmoVerse是一个多模态语言模型（MLLMs）驱动的情绪表示数据集，旨在通过分层的、基于知识图谱的注释支持可解释的视觉情绪分析。该数据集包含21.9万张图像，提供了背景-属性-主体（B-A-S）三元组分解和双注释（离散与连续情绪表示），并配套一个新模型，能够映射视觉线索并解释情绪归因。

Details

Motivation: 视觉情绪分析（VEA）由于缺乏开源和可解释的数据集，进展有限。现有研究通常仅为整张图像分配单一离散情绪标签，无法揭示视觉元素如何影响情绪。EmoVerse通过分层注释填补了这一空白。

Result: EmoVerse数据集规模大（21.9万张图像），支持可解释的情绪分析。实验表明新模型能够有效映射视觉线索并提供详细归因。

Insight: 分层的情绪分解（B-A-S）和双注释策略（CES-DES）为可解释的情绪分析提供了新思路，推动了高层次的、可解释的情绪理解研究。

Abstract: Visual Emotion Analysis (VEA) aims to bridge the affective gap between visual content and human emotional responses. Despite its promise, progress in this field remains limited by the lack of open-source and interpretable datasets. Most existing studies assign a single discrete emotion label to an entire image, offering limited insight into how visual elements contribute to emotion. In this work, we introduce EmoVerse, a large-scale open-source dataset that enables interpretable visual emotion analysis through multi-layered, knowledge-graph-inspired annotations. By decomposing emotions into Background-Attribute-Subject (B-A-S) triplets and grounding each element to visual regions, EmoVerse provides word-level and subject-level emotional reasoning. With over 219k images, the dataset further includes dual annotations in Categorical Emotion States (CES) and Dimensional Emotion Space (DES), facilitating unified discrete and continuous emotion representation. A novel multi-stage pipeline ensures high annotation reliability with minimal human effort. Finally, we introduce an interpretable model that maps visual cues into DES representations and provides detailed attribution explanations. Together, the dataset, pipeline, and model form a comprehensive foundation for advancing explainable high-level emotion understanding.

[96] Through-Foliage Surface-Temperature Reconstruction for early Wildfire Detection cs.CVPDF

Mohamed Youssef, Lukas Brunner, Klaus Rundhammer, Gerald Czech, Oliver Bimber

TL;DR: 该论文提出了一种结合信号处理和机器学习的方法，通过遮挡的森林植被重建地表温度，用于早期野火检测。方法通过训练视觉状态空间模型恢复模糊数据中的细微热信号，并通过潜在扩散模型生成大量真实数据。

Details

Motivation: 实现全自动化的无人机野火监测，及早发现地面火灾，避免依赖可见烟雾或火焰。

Result: 在模拟数据中，RMSE降低了2到2.5倍；在实地实验中，RMSE提高了12.8倍（与传统热成像相比）和2.6倍（与未经校正的SA成像相比）。

Insight: 1. 形态特征对热信号分类至关重要；2. 在部分遮挡情况下，该方法仍能完整重建火灾和人体的热信号形态。

Abstract: We introduce a novel method for reconstructing surface temperatures through occluding forest vegetation by combining signal processing and machine learning. Our goal is to enable fully automated aerial wildfire monitoring using autonomous drones, allowing for the early detection of ground fires before smoke or flames are visible. While synthetic aperture (SA) sensing mitigates occlusion from the canopy and sunlight, it introduces thermal blur that obscures the actual surface temperatures. To address this, we train a visual state space model to recover the subtle thermal signals of partially occluded soil and fire hotspots from this blurred data. A key challenge was the scarcity of real-world training data. We overcome this by integrating a latent diffusion model into a vector quantized to generated a large volume of realistic surface temperature simulations from real wildfire recordings, which we further expanded through temperature augmentation and procedural thermal forest simulation. On simulated data across varied ambient and surface temperatures, forest densities, and sunlight conditions, our method reduced the RMSE by a factor of 2 to 2.5 compared to conventional thermal and uncorrected SA imaging. In field experiments focused on high-temperature hotspots, the improvement was even more significant, with a 12.8-fold RMSE gain over conventional thermal and a 2.6-fold gain over uncorrected SA images. We also demonstrate our model’s generalization to other thermal signals, such as human signatures for search and rescue. Since simple thresholding is frequently inadequate for detecting subtle thermal signals, the morphological characteristics are equally essential for accurate classification. Our experiments demonstrated another clear advantage: we reconstructed the complete morphology of fire and human signatures, whereas conventional imaging is defeated by partial occlusion.

[97] Beyond Pixels: Semantic-aware Typographic Attack for Geo-Privacy Protection cs.CVPDF

Jiayi Zhu, Yihao Huang, Yue Cao, Xiaojun Jia, Qing Guo

TL;DR: 论文提出了一种通过语义感知的排版攻击保护地理隐私的方法，该方法通过在图像外添加欺骗性文本来干扰大型视觉语言模型（LVLMs）的地理位置推断，显著降低预测准确率同时保持视觉质量。

Details

Motivation: 大型视觉语言模型（LVLMs）能够从用户分享的图像中推断地理位置，造成隐私泄露。传统的对抗性扰动方法需大幅扭曲图像才能有效保护隐私，但会明显降低视觉质量。论文旨在找到一种既能保护隐私又能保持视觉效果的解决方案。

Result: 在三个数据集上的实验表明，该方法显著降低了五种商业LVLMs的地理预测准确率，同时保持了图像的视觉质量。

Insight: 通过语义感知的排版攻击可以在不明显影响图像视觉质量的情况下有效保护地理隐私，为对抗新兴隐私威胁提供了实用策略。

Abstract: Large Visual Language Models (LVLMs) now pose a serious yet overlooked privacy threat, as they can infer a social media user’s geolocation directly from shared images, leading to unintended privacy leakage. While adversarial image perturbations provide a potential direction for geo-privacy protection, they require relatively strong distortions to be effective against LVLMs, which noticeably degrade visual quality and diminish an image’s value for sharing. To overcome this limitation, we identify typographical attacks as a promising direction for protecting geo-privacy by adding text extension outside the visual content. We further investigate which textual semantics are effective in disrupting geolocation inference and design a two-stage, semantics-aware typographical attack that generates deceptive text to protect user privacy. Extensive experiments across three datasets demonstrate that our approach significantly reduces geolocation prediction accuracy of five state-of-the-art commercial LVLMs, establishing a practical and visually-preserving protection strategy against emerging geo-privacy threats.

[98] TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction cs.CVPDF

Yukuo Ma, Cong Liu, Junke Wang, Junqi Liu, Haibin Huang

TL;DR: TempoMaster通过预测下一帧速率生成长视频，首先生成低帧率片段作为视频的粗略蓝图，再逐步提高帧率以优化细节和运动连续性，同时实现高效的并行合成。

Details

Motivation: 现有的长视频生成方法在时间一致性和效率上面临挑战，TempoMaster旨在通过分层帧率预测解决这些问题。

Result: 实验表明，TempoMaster在长视频生成的视觉效果和时间一致性上达到最新水平。

Insight: 分层帧率预测是解决长视频生成中效率和时间一致性问题的有效方法。

Abstract: We present TempoMaster, a novel framework that formulates long video generation as next-frame-rate prediction. Specifically, we first generate a low-frame-rate clip that serves as a coarse blueprint of the entire video sequence, and then progressively increase the frame rate to refine visual details and motion continuity. During generation, TempoMaster employs bidirectional attention within each frame-rate level while performing autoregression across frame rates, thus achieving long-range temporal coherence while enabling efficient and parallel synthesis. Extensive experiments demonstrate that TempoMaster establishes a new state-of-the-art in long video generation, excelling in both visual and temporal quality.

[99] Rank-Aware Agglomeration of Foundation Models for Immunohistochemistry Image Cell Counting cs.CVPDF

Zuqi Huang, Mengxin Tian, Huan Liu, Wentao Li, Baobao Liang

TL;DR: 本文提出了一种基于排名感知的聚合框架CountIHC，通过选择性蒸馏多个强大基础模型的知识，并结合全局到局部补丁排名的策略，解决了免疫组化图像中多类别细胞计数的挑战。

Details

Motivation: 免疫组化图像中细胞计数的准确性对癌症诊断至关重要，但现有方法难以处理细胞重叠、染色多样性和形态多样性等问题，且基础模型的潜力未充分挖掘。

Result: CountIHC在12种免疫组化生物标志物和5种组织类型上超越了现有方法，并与病理学家的评估结果高度一致。其H&E染色数据上的有效性进一步证实了方法的可扩展性。

Insight: 1) 任务感知的教师选择策略优于传统任务无关的方法；2) 视觉-语言对齐能有效解决多类别重叠细胞的计数问题；3) 基础模型的潜力可通过知识蒸馏进一步释放。

Abstract: Accurate cell counting in immunohistochemistry (IHC) images is critical for quantifying protein expression and aiding cancer diagnosis. However, the task remains challenging due to the chromogen overlap, variable biomarker staining, and diverse cellular morphologies. Regression-based counting methods offer advantages over detection-based ones in handling overlapped cells, yet rarely support end-to-end multi-class counting. Moreover, the potential of foundation models remains largely underexplored in this paradigm. To address these limitations, we propose a rank-aware agglomeration framework that selectively distills knowledge from multiple strong foundation models, leveraging their complementary representations to handle IHC heterogeneity and obtain a compact yet effective student model, CountIHC. Unlike prior task-agnostic agglomeration strategies that either treat all teachers equally or rely on feature similarity, we design a Rank-Aware Teacher Selecting (RATS) strategy that models global-to-local patch rankings to assess each teacher’s inherent counting capacity and enable sample-wise teacher selection. For multi-class cell counting, we introduce a fine-tuning stage that reformulates the task as vision-language alignment. Discrete semantic anchors derived from structured text prompts encode both category and quantity information, guiding the regression of class-specific density maps and improving counting for overlapping cells. Extensive experiments demonstrate that CountIHC surpasses state-of-the-art methods across 12 IHC biomarkers and 5 tissue types, while exhibiting high agreement with pathologists’ assessments. Its effectiveness on H&E-stained data further confirms the scalability of the proposed method.

[100] Fine-Grained Representation for Lane Topology Reasoning cs.CV | cs.AIPDF

Guoqing Xu, Yiheng Li, Yang Yang

TL;DR: 论文提出了一种细粒度的车道拓扑推理框架（TopoFG），通过分三个阶段（HPE、RFD、RBTR）结合全局空间先验和局部序列先验，提升了车道拓扑建模的精确性和可靠性，在OpenLane-V2基准测试中达到了最新性能。

Details

Motivation: 自动驾驶中精确的车道拓扑建模至关重要，但现有方法基于单一查询的设计难以准确建模复杂车道结构。

Result: 在OpenLane-V2基准测试中，TopoFG达到了最新性能（48.0% OLS on subsetA，45.4% on subsetB）。

Insight: 结合全局与局部先验的细粒度查询建模能够有效应对复杂车道拓扑的挑战；边界点拓扑推理的去噪策略显著提升了可靠性。

Abstract: Precise modeling of lane topology is essential for autonomous driving, as it directly impacts navigation and control decisions.Existing methods typically represent each lane with a single query and infer topological connectivity based on the similarity between lane queries.However, this kind of design struggles to accurately model complex lane structures, leading to unreliable topology prediction.In this view, we propose a Fine-Grained lane topology reasoning framework (TopoFG).It divides the procedure from bird’s-eye-view (BEV) features to topology prediction via fine-grained queries into three phases, i.e., Hierarchical Prior Extractor (HPE), Region-Focused Decoder (RFD), and Robust Boundary-Point Topology Reasoning (RBTR).Specifically, HPE extracts global spatial priors from the BEV mask and local sequential priors from in-lane keypoint sequences to guide subsequent fine-grained query modeling.RFD constructs fine-grained queries by integrating the spatial and sequential priors. It then samples reference points in RoI regions of the mask and applies cross-attention with BEV features to refine the query representations of each lane.RBTR models lane connectivity based on boundary-point query features and further employs a topological denoising strategy to reduce matching ambiguity.By integrating spatial and sequential priors into fine-grained queries and applying a denoising strategy to boundary-point topology reasoning, our method precisely models complex lane structures and delivers trustworthy topology predictions.Extensive experiments on the OpenLane-V2 benchmark demonstrate that TopoFG achieves new state-of-the-art performance, with an OLS of 48.0% on subsetA and 45.4% on subsetB.

[101] Seg-VAR: Image Segmentation with Visual Autoregressive Modeling cs.CVPDF

Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang

TL;DR: Seg-VAR提出了一个新的框架，将图像分割任务重新定义为条件自回归掩码生成问题，结合了多尺度建模和潜在学习，显著提升了分割性能。

Details

Motivation: 传统的自回归模型（VAR）在图像生成中表现良好，但其在需要精确空间感知的分割任务中的潜力尚未被探索。

Result: Seg-VAR在多种分割任务和验证基准上优于之前的判别性和生成性方法。

Insight: 通过将分割任务建模为顺序层次预测问题，Seg-VAR为自回归推理在空间感知视觉系统中的集成开辟了新途径。

Abstract: While visual autoregressive modeling (VAR) strategies have shed light on image generation with the autoregressive models, their potential for segmentation, a task that requires precise low-level spatial perception, remains unexplored. Inspired by the multi-scale modeling of classic Mask2Former-based models, we propose Seg-VAR, a novel framework that rethinks segmentation as a conditional autoregressive mask generation problem. This is achieved by replacing the discriminative learning with the latent learning process. Specifically, our method incorporates three core components: (1) an image encoder generating latent priors from input images, (2) a spatial-aware seglat (a latent expression of segmentation mask) encoder that maps segmentation masks into discrete latent tokens using a location-sensitive color mapping to distinguish instances, and (3) a decoder reconstructing masks from these latents. A multi-stage training strategy is introduced: first learning seglat representations via image-seglat joint training, then refining latent transformations, and finally aligning image-encoder-derived latents with seglat distributions. Experiments show Seg-VAR outperforms previous discriminative and generative methods on various segmentation tasks and validation benchmarks. By framing segmentation as a sequential hierarchical prediction task, Seg-VAR opens new avenues for integrating autoregressive reasoning into spatial-aware vision systems. Code will be available at https://github.com/rkzheng99/Seg-VAR.

[102] LoRA-Enhanced Vision Transformer for Single Image based Morphing Attack Detection via Knowledge Distillation from EfficientNet cs.CVPDF

Ria Shekhawat, Sushrut Patwardhan, Raghavendra Ramachandra, Praveen Kumar Chandaliya, Kishor P. Upla

TL;DR: 本论文提出了一种结合LoRA和知识蒸馏的单图像变形攻击检测方法，通过CNN教师模型指导ViT学生模型，提升效率和准确性。

Details

Motivation: 人脸识别系统易受变形攻击（如合成图像混合多人的生物特征），需高效且准确的单图像检测方法。

Result: 实验表明，该方法在检测性能和计算效率上均优于现有技术。

Insight: LoRA可显著降低ViT微调成本，而知识蒸馏能提升小模型的检测能力。

Abstract: Face Recognition Systems (FRS) are critical for security but remain vulnerable to morphing attacks, where synthetic images blend biometric features from multiple individuals. We propose a novel Single-Image Morphing Attack Detection (S-MAD) approach using a teacher-student framework, where a CNN-based teacher model refines a ViT-based student model. To improve efficiency, we integrate Low-Rank Adaptation (LoRA) for fine-tuning, reducing computational costs while maintaining high detection accuracy. Extensive experiments are conducted on a morphing dataset built from three publicly available face datasets, incorporating ten different morphing generation algorithms to assess robustness. The proposed method is benchmarked against six state-of-the-art S-MAD techniques, demonstrating superior detection performance and computational efficiency.

[103] Pixels or Positions? Benchmarking Modalities in Group Activity Recognition cs.CVPDF

Drishya Karki, Merey Ramazanova, Anthony Cioppa, Silvio Giancola, Bernard Ghanem

TL;DR: 该论文介绍了SoccerNet-GAR，一个基于2022年世界杯比赛的多模态数据集，并比较了视频和追踪数据在群体活动识别（GAR）中的表现。研究发现，基于追踪的模型在准确性和效率上优于视频模型。

Details

Motivation: 群体活动识别（GAR）以往主要依赖视频模态，但追踪数据作为更紧凑且明确的空间交互信号尚未充分探索。缺乏标准化的多模态基准限制了模态间的公平比较。

Result: 追踪模型的平衡准确率为67.2%，优于视频模型的58.1%，且训练速度快4.25倍，参数量少438倍（197K vs 86.3M）。

Insight: 追踪数据在GAR中表现优于视频数据，且效率更高；角色感知建模对捕捉战术结构至关重要。这为模态选择提供了新的研究方向。

Abstract: Group Activity Recognition (GAR) is well studied on the video modality for surveillance and indoor team sports (e.g., volleyball, basketball). Yet, other modalities such as agent positions and trajectories over time, i.e. tracking, remain comparatively under-explored despite being compact, agent-centric signals that explicitly encode spatial interactions. Understanding whether pixel (video) or position (tracking) modalities leads to better group activity recognition is therefore important to drive further research on the topic. However, no standardized benchmark currently exists that aligns broadcast video and tracking data for the same group activities, leading to a lack of apples-to-apples comparison between these modalities for GAR. In this work, we introduce SoccerNet-GAR, a multimodal dataset built from the $64$ matches of the football World Cup 2022. Specifically, the broadcast videos and player tracking modalities for $94{,}285$ group activities are synchronized and annotated with $10$ categories. Furthermore, we define a unified evaluation protocol to benchmark two strong unimodal approaches: (i) a competitive video-based classifiers and (ii) a tracking-based classifiers leveraging graph neural networks. In particular, our novel role-aware graph architecture for tracking-based GAR directly encodes tactical structure through positional edges and temporal attention. Our tracking model achieves $67.2%$ balanced accuracy compared to $58.1%$ for the best video baseline, while training $4.25 \times$ faster with $438 \times$ fewer parameters ($197K$ \vs $86.3M$). This study provides new insights into the relative strengths of pixels and positions for group activity recognition. Overall, it highlights the importance of modality choice and role-aware modeling for GAR.

[104] OPFormer: Object Pose Estimation leveraging foundation model with geometric encoding cs.CV | cs.AI | cs.LG | cs.ROPDF

Artem Moroz, Vít Zeman, Martin Mikšík, Elizaveta Isianova, Miroslav David

TL;DR: OPFormer是一个统一的端到端框架，结合了物体检测和姿态估计，利用基础模型和几何编码实现高精度6D姿态估计。

Details

Motivation: 传统物体姿态估计方法在模型缺失或复杂场景下表现不佳，需要一个统一的框架来无缝结合检测和姿态估计，同时支持基于模型和无模型的场景。

Result: 在BOP基准测试中表现出高精度和效率，适用于基于模型和无模型的场景。

Insight: 结合基础模型和几何编码能够显著提升姿态估计的鲁棒性和精度，尤其是在模型缺失或复杂场景下。

Abstract: We introduce a unified, end-to-end framework that seamlessly integrates object detection and pose estimation with a versatile onboarding process. Our pipeline begins with an onboarding stage that generates object representations from either traditional 3D CAD models or, in their absence, by rapidly reconstructing a high-fidelity neural representation (NeRF) from multi-view images. Given a test image, our system first employs the CNOS detector to localize target objects. For each detection, our novel pose estimation module, OPFormer, infers the precise 6D pose. The core of OPFormer is a transformer-based architecture that leverages a foundation model for robust feature extraction. It uniquely learns a comprehensive object representation by jointly encoding multiple template views and enriches these features with explicit 3D geometric priors using Normalized Object Coordinate Space (NOCS). A decoder then establishes robust 2D-3D correspondences to determine the final pose. Evaluated on the challenging BOP benchmarks, our integrated system demonstrates a strong balance between accuracy and efficiency, showcasing its practical applicability in both model-based and model-free scenarios.

[105] Multivariate Diffusion Transformer with Decoupled Attention for High-Fidelity Mask-Text Collaborative Facial Generation cs.CV | cs.AIPDF

Yushe Cao, Dianxi Shi, Xing Fu, Xuechao Zou, Haikuo Peng

TL;DR: 論文提出了一種名為MDiTFace的多變量擴散變壓器框架，通過解耦注意力機制和統一特徵處理策略，顯著提升了基於語義遮罩和文本的面部生成質量，同時大幅降低了計算開銷。

Details

Motivation: 傳統的多模態特徵融合方法在面部生成任務中難以實現有效的跨模態交互，導致生成結果不佳。論文旨在解決這一挑戰。

Result: 實驗表明MDiTFace在生成質量和條件一致性上優於其他方法，同時將由遮罩條件引入的計算開銷降低94%以上。

Insight: 解耦注意力機制和動靜分離的計算策略可以顯著提升多模態生成的效率和性能，為類似任務提供了新思路。

Abstract: While significant progress has been achieved in multimodal facial generation using semantic masks and textual descriptions, conventional feature fusion approaches often fail to enable effective cross-modal interactions, thereby leading to suboptimal generation outcomes. To address this challenge, we introduce MDiTFace–a customized diffusion transformer framework that employs a unified tokenization strategy to process semantic mask and text inputs, eliminating discrepancies between heterogeneous modality representations. The framework facilitates comprehensive multimodal feature interaction through stacked, newly designed multivariate transformer blocks that process all conditions synchronously. Additionally, we design a novel decoupled attention mechanism by dissociating implicit dependencies between mask tokens and temporal embeddings. This mechanism segregates internal computations into dynamic and static pathways, enabling caching and reuse of features computed in static pathways after initial calculation, thereby reducing additional computational overhead introduced by mask condition by over 94% while maintaining performance. Extensive experiments demonstrate that MDiTFace significantly outperforms other competing methods in terms of both facial fidelity and conditional consistency.

[106] Denoising Vision Transformer Autoencoder with Spectral Self-Regularization cs.CVPDF

Xunzhi Xiang, Xingye Tian, Guiyu Zhang, Yabo Chen, Shaofeng Zhang

TL;DR: 该论文提出了一种基于ViT的自编码器（Denoising-VAE），通过光谱自正则化策略减少高频噪声，提升生成模型性能，并在ImageNet基准上取得了优异的重建和生成效果。

Details

Motivation: 传统VAE在高维潜在空间中存在高频冗余噪声，影响扩散模型的训练收敛和生成质量，需要一种不依赖外部视觉基础模型（VFMs）的解决方案。

Result: 在ImageNet 256x256基准上，重建质量（rFID=0.28，PSNR=27.26）和生成性能（gFID=1.82）均达到SOTA水平，且扩散模型收敛速度提升约2倍。

Insight: 高频噪声是高维潜在空间中生成模型优化的关键障碍，通过光谱正则化可以显著改善训练效率和质量。

Abstract: Variational autoencoders (VAEs) typically encode images into a compact latent space, reducing computational cost but introducing an optimization dilemma: a higher-dimensional latent space improves reconstruction fidelity but often hampers generative performance. Recent methods attempt to address this dilemma by regularizing high-dimensional latent spaces using external vision foundation models (VFMs). However, it remains unclear how high-dimensional VAE latents affect the optimization of generative models. To our knowledge, our analysis is the first to reveal that redundant high-frequency components in high-dimensional latent spaces hinder the training convergence of diffusion models and, consequently, degrade generation quality. To alleviate this problem, we propose a spectral self-regularization strategy to suppress redundant high-frequency noise while simultaneously preserving reconstruction quality. The resulting Denoising-VAE, a ViT-based autoencoder that does not rely on VFMs, produces cleaner, lower-noise latents, leading to improved generative quality and faster optimization convergence. We further introduce a spectral alignment strategy to facilitate the optimization of Denoising-VAE-based generative models. Our complete method enables diffusion models to converge approximately 2$\times$ faster than with SD-VAE, while achieving state-of-the-art reconstruction quality (rFID = 0.28, PSNR = 27.26) and competitive generation performance (gFID = 1.82) on the ImageNet 256$\times$256 benchmark.

[107] Medical Knowledge Intervention Prompt Tuning for Medical Image Classification cs.CVPDF

Ye Du, Nanxi Yu, Shujun Wang

TL;DR: 该论文提出了一种结合大型语言模型（LLMs）和视觉语言基础模型（VLMs）的提示调优方法CILMP，用于医学图像分类任务，通过医学知识干预提升性能。

Details

Motivation: 为了解决现有提示调优方法在医学图像分类中难以区分不同医学概念的局限，作者提出利用LLMs提供的专业医学知识来增强VLMs的提示调优过程。

Result: 实验表明，CILMP在多模态医学图像数据集上优于现有提示调优方法，证明了其有效性。

Insight: 研究表明，LLMs可作为医学知识的有效来源，通过干预VLMs的提示调优过程，能够显著提升医学图像分类的性能。

Abstract: Vision-language foundation models (VLMs) have shown great potential in feature transfer and generalization across a wide spectrum of medical-related downstream tasks. However, fine-tuning these models is resource-intensive due to their large number of parameters. Prompt tuning has emerged as a viable solution to mitigate memory usage and reduce training time while maintaining competitive performance. Nevertheless, the challenge is that existing prompt tuning methods cannot precisely distinguish different kinds of medical concepts, which miss essentially specific disease-related features across various medical imaging modalities in medical image classification tasks. We find that Large Language Models (LLMs), trained on extensive text corpora, are particularly adept at providing this specialized medical knowledge. Motivated by this, we propose incorporating LLMs into the prompt tuning process. Specifically, we introduce the CILMP, Conditional Intervention of Large Language Models for Prompt Tuning, a method that bridges LLMs and VLMs to facilitate the transfer of medical knowledge into VLM prompts. CILMP extracts disease-specific representations from LLMs, intervenes within a low-rank linear subspace, and utilizes them to create disease-specific prompts. Additionally, a conditional mechanism is incorporated to condition the intervention process on each individual medical image, generating instance-adaptive prompts and thus enhancing adaptability. Extensive experiments across diverse medical image datasets demonstrate that CILMP consistently outperforms state-of-the-art prompt tuning methods, demonstrating its effectiveness. Code is available at https://github.com/usr922/cilmp.

[108] DPVO-QAT++: Heterogeneous QAT and CUDA Kernel Fusion for High-Performance Deep Patch Visual Odometry cs.CVPDF

Cheng Liao

TL;DR: DPVO-QAT++提出了一种异构量化优化框架，通过结合可学习尺度参数化、异构精度设计和GPU原生内核融合，显著提升了Deep Patch Visual Odometry的性能和效率，同时保持轨迹精度。

Details

Motivation: 现有的深度学习视觉SLAM系统计算开销大，难以在资源受限的自主平台上部署。

Result: 在TartanAir和EuRoC数据集上分别实现了52.1%和30.1%的FPS提升，同时保持了与原模型相当的轨迹精度。

Insight: 异构量化设计和GPU内核融合是提升深度学习视觉SLAM效率的有效途径。

Abstract: Deep learning-based Visual SLAM (vSLAM) systems exhibit exceptional geometric reasoning capabilities, yet their prohibitive computational overhead severely restricts deployment on resource-constrained autonomous platforms. This paper presents a hierarchical quantization optimization framework, DPVO-QAT++ (DPVO-QAT++: Heterogeneous QAT and CUDA Kernel Fusion for High-Performance Deep Patch Visual Odometry). Through the synergistic integration of learnable scale parameterization, a heterogeneous precision design for the Visual Odometry (VO) front-end and back-end (front-end floating-point fake quantization with FP16/FP32; back-end full precision), and GPU-native kernel fusion for fake quantization (custom CUDA kernels), our framework significantly reduces memory footprint and increases processing speed while preserving the trajectory accuracy of the original model. On the TartanAir dataset, our framework achieves an average FPS increase of 52.1%, a 29.1% reduction in median latency, and a 64.9% reduction in peak GPU memory reservation, while maintaining trajectory accuracy (ATE) comparable to the original DPVO model across 32 validation sequences. On the EuRoC dataset, it realizes an average FPS increase of 30.1%, a 23.1% reduction in median latency, and a 37.7% reduction in peak GPU memory reservation, maintaining comparable trajectory accuracy (ATE) across 11 validation sequences. Experimental results demonstrate that DPVO-QAT++ effectively bridges the gap between high-precision deep VO and the efficiency requirements for practical deployment, offering a viable engineering paradigm for the application of this technology on real-world embedded platforms. Keywords: Visual Odometry, Heterogeneous Precision Architecture, Quantization-Aware Training, CUDA Kernel Fusion, Scale-Only Training, Deep Patch Visual Odometry, GPU-Native Kernel Fusion.

[109] Toward Real-world Text Image Forgery Localization: Structured and Interpretable Data Synthesis cs.CVPDF

Zeqin Yu, Haotao Xie, Jian Zhang, Jiangqun Ni, Wenkan Su

TL;DR: 论文提出了一种基于傅里叶系列的篡改合成方法（FSTS），通过结构化且可解释的方式合成篡改文本图像数据，解决了现有方法因合成数据与真实数据分布差异导致的泛化性能不足问题。

Details

Motivation: 现有文本图像篡改定位（T-IFL）方法因真实数据集规模有限，且合成数据难以反映真实篡改的复杂性，导致泛化性能不佳。

Result: 实验表明，基于FSTS数据的模型在四个评估协议中显著提升了真实数据集的泛化性能。

Insight: 通过结构化分析和层次化建模，合成数据的质量显著提升，更接近真实篡改分布。相关数据集已开源。

Abstract: Existing Text Image Forgery Localization (T-IFL) methods often suffer from poor generalization due to the limited scale of real-world datasets and the distribution gap caused by synthetic data that fails to capture the complexity of real-world tampering. To tackle this issue, we propose Fourier Series-based Tampering Synthesis (FSTS), a structured and interpretable framework for synthesizing tampered text images. FSTS first collects 16,750 real-world tampering instances from five representative tampering types, using a structured pipeline that records human-performed editing traces via multi-format logs (e.g., video, PSD, and editing logs). By analyzing these collected parameters and identifying recurring behavioral patterns at both individual and population levels, we formulate a hierarchical modeling framework. Specifically, each individual tampering parameter is represented as a compact combination of basis operation-parameter configurations, while the population-level distribution is constructed by aggregating these behaviors. Since this formulation draws inspiration from the Fourier series, it enables an interpretable approximation using basis functions and their learned weights. By sampling from this modeled distribution, FSTS synthesizes diverse and realistic training data that better reflect real-world forgery traces. Extensive experiments across four evaluation protocols demonstrate that models trained with FSTS data achieve significantly improved generalization on real-world datasets. Dataset is available at \href{https://github.com/ZeqinYu/FSTS}{Project Page}.

[110] Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans cs.CVPDF

Hongbin Huang, Junwei Li, Tianxin Xie, Zhuang Li, Cekai Weng

TL;DR: Hi-Reco提出了一种高保真、实时的对话数字人系统，结合3D虚拟形象、情感语音合成和基于知识的对话生成，通过异步执行流水线实现低延迟交互。

Details

Motivation: 数字人在交互应用中使用广泛，但实现视觉真实性和实时响应性仍是一大挑战。本文旨在解决这一问题。

Result: 系统实现了高保真视觉效果和实时交互，适用于教育、娱乐等沉浸式应用。

Insight: 异步流水线设计是低延迟系统的关键，检索增强方法显著提升了对话质量。

Abstract: High-fidelity digital humans are increasingly used in interactive applications, yet achieving both visual realism and real-time responsiveness remains a major challenge. We present a high-fidelity, real-time conversational digital human system that seamlessly combines a visually realistic 3D avatar, persona-driven expressive speech synthesis, and knowledge-grounded dialogue generation. To support natural and timely interaction, we introduce an asynchronous execution pipeline that coordinates multi-modal components with minimal latency. The system supports advanced features such as wake word detection, emotionally expressive prosody, and highly accurate, context-aware response generation. It leverages novel retrieval-augmented methods, including history augmentation to maintain conversational flow and intent-based routing for efficient knowledge access. Together, these components form an integrated system that enables responsive and believable digital humans, suitable for immersive applications in communication, education, and entertainment.

[111] DensePercept-NCSSD: Vision Mamba towards Real-time Dense Visual Perception with Non-Causal State Space Duality cs.CVPDF

Tushar Anand, Advik Sinha, Abhijit Das

TL;DR: 论文提出了一种实时、高精度的光学流和视差估计模型DensePercept-NCSSD，通过非因果选择性状态空间融合输入图像，提升了密集感知任务的效率。

Details

Motivation: 密集视觉感知任务（如光学流和视差估计）在实时应用中面临速度和精度的平衡问题。传统方法通常在计算效率上不足，无法满足实时需求，因此需要一种既能保持高精度又能高效运行的模型。

Result: 在现实场景中的实验验证表明，该模型能够高效生成光学流和视差图，适用于实时3D密集感知任务。

Insight: 通过引入非因果状态空间和高效Mamba块，本文展示了在密集感知任务中实现实时性和高精度的可行性，为相关领域的模型设计提供了新思路。

Abstract: In this work, we propose an accurate and real-time optical flow and disparity estimation model by fusing pairwise input images in the proposed non-causal selective state space for dense perception tasks. We propose a non-causal Mamba block-based model that is fast and efficient and aptly manages the constraints present in a real-time applications. Our proposed model reduces inference times while maintaining high accuracy and low GPU usage for optical flow and disparity map generation. The results and analysis, and validation in real-life scenario justify that our proposed model can be used for unified real-time and accurate 3D dense perception estimation tasks. The code, along with the models, can be found at https://github.com/vimstereo/DensePerceptNCSSD

[112] BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections cs.CV | cs.AIPDF

Subin Varghese, Joshua Gao, Asad Ur Rahman, Vedhus Hoskere

TL;DR: BridgeEQA是一个专注于真实桥梁检查场景的开放词汇体现问答（EQA）基准，包含2200个问答对，覆盖200个真实桥梁场景，并提出了一种新的EQA指标Image Citation Relevance。为解决现有模型的性能差距，论文提出了EMVR方法，通过图像场景图的顺序导航实现推理。

Details

Motivation: 当前EQA领域缺乏能够真实反映现实操作条件的基准问题，尤其是在多尺度推理、长程空间理解和复杂语义关系的需求下。基础设施检查（如桥梁检查）是一个理想的领域，因为它具有标准化评估和丰富的专业报告支持。

Result: EMVR方法在BridgeEQA基准上显著优于现有视觉-语言模型，体现出在多图像证据合成和条件评分对齐上的优势。

Insight: 基础设施检查领域为EQA提供了独特的评估优势，而基于场景图的导航方法可以有效解决多图像推理问题。

Abstract: Deploying embodied agents that can answer questions about their surroundings in realistic real-world settings remains difficult, partly due to the scarcity of benchmarks that faithfully capture practical operating conditions. We propose infrastructure inspection as a compelling domain for open-vocabulary Embodied Question Answering (EQA): it naturally demands multi-scale reasoning, long-range spatial understanding, and complex semantic relationships, while offering unique evaluation advantages via standardized National Bridge Inventory (NBI) condition ratings (0-9), professional inspection reports, and egocentric imagery. We introduce BridgeEQA, a benchmark of 2,200 open-vocabulary question-answer pairs (in the style of OpenEQA) grounded in professional inspection reports across 200 real-world bridge scenes with 47.93 images on average per scene. Questions require synthesizing visual evidence across multiple images and aligning responses with NBI condition ratings. We further propose a new EQA metric Image Citation Relevance to evaluate the ability of a model to cite relevant images. Evaluations of state-of-the-art vision-language models reveal substantial performance gaps under episodic memory EQA settings. To address this, we propose Embodied Memory Visual Reasoning (EMVR), which formulates inspection as sequential navigation over an image-based scene graph: images are nodes, and an agent takes actions to traverse views, compare evidence, and reason within a Markov decision process. EMVR shows strong performance over the baselines. We publicly release both the dataset and code.

[113] R$^{2}$Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection cs.CV | cs.AIPDF

Shuaike Shen, Ke Liu, Jiaqing Xie, Shangde Gao, Chunhua Shen

TL;DR: 论文提出了R²Seg框架，一种无需训练的OOD医学肿瘤分割方法，通过解剖学推理和统计拒绝两阶段流程，显著提升分割性能。

Details

Motivation: 现有的基础模型在分布外（OOD）医学图像分割中表现不佳，容易产生碎片化的假阳性结果，因此需要一个无需训练且鲁棒的解决方案。

Result: 在多中心和多模态肿瘤分割基准测试中，R²Seg在Dice系数、特异性和敏感性上显著优于基线方法和原始基础模型。

Insight: R²Seg展示了结合高层推理和统计方法在无需训练的情况下改进OOD分割的潜力，为解决医学图像分割中的假阳性问题提供了新思路。

Abstract: Foundation models for medical image segmentation struggle under out-of-distribution (OOD) shifts, often producing fragmented false positives on OOD tumors. We introduce R$^{2}$Seg, a training-free framework for robust OOD tumor segmentation that operates via a two-stage Reason-and-Reject process. First, the Reason step employs an LLM-guided anatomical reasoning planner to localize organ anchors and generate multi-scale ROIs. Second, the Reject step applies two-sample statistical testing to candidates generated by a frozen foundation model (BiomedParse) within these ROIs. This statistical rejection filter retains only candidates significantly different from normal tissue, effectively suppressing false positives. Our framework requires no parameter updates, making it compatible with zero-update test-time augmentation and avoiding catastrophic forgetting. On multi-center and multi-modal tumor segmentation benchmarks, R$^{2}$Seg substantially improves Dice, specificity, and sensitivity over strong baselines and the original foundation models. Code are available at https://github.com/Eurekashen/R2Seg.

[114] HEDGE: Hallucination Estimation via Dense Geometric Entropy for VQA with Vision-Language Models cs.CV | cs.AIPDF

Sushant Gautam, Michael A. Riegler, Pål Halvorsen

TL;DR: HEDGE提出了一种统一的幻觉检测框架，结合视觉扰动、语义聚类和不确定性度量，为多模态架构提供了可复现的流水线。

Details

Motivation: 尽管视觉语言模型（VLMs）支持开放式视觉问答，但其容易产生幻觉回答，因此需要一种系统化的方法来检测和评估这些幻觉。

Result: 实验表明，统一融合模型（如Qwen2.5-VL）的幻觉检测效果最好，而受限标记化架构（如Med-Gemma）效果最差。VASE度量在配置中表现最鲁棒。

Insight: 幻觉检测的几何稳健性取决于采样规模、提示设计、模型架构和聚类策略的结合。简洁的标签式输出比句法受限的回答更易于分析。

Abstract: Vision-language models (VLMs) enable open-ended visual question answering but remain prone to hallucinations. We present HEDGE, a unified framework for hallucination detection that combines controlled visual perturbations, semantic clustering, and robust uncertainty metrics. HEDGE integrates sampling, distortion synthesis, clustering (entailment- and embedding-based), and metric computation into a reproducible pipeline applicable across multimodal architectures. Evaluations on VQA-RAD and KvasirVQA-x1 with three representative VLMs (LLaVA-Med, Med-Gemma, Qwen2.5-VL) reveal clear architecture- and prompt-dependent trends. Hallucination detectability is highest for unified-fusion models with dense visual tokenization (Qwen2.5-VL) and lowest for architectures with restricted tokenization (Med-Gemma). Embedding-based clustering often yields stronger separation when applied directly to the generated answers, whereas NLI-based clustering remains advantageous for LLaVA-Med and for longer, sentence-level responses. Across configurations, the VASE metric consistently provides the most robust hallucination signal, especially when paired with embedding clustering and a moderate sampling budget (n ~ 10-15). Prompt design also matters: concise, label-style outputs offer clearer semantic structure than syntactically constrained one-sentence responses. By framing hallucination detection as a geometric robustness problem shaped jointly by sampling scale, prompt structure, model architecture, and clustering strategy, HEDGE provides a principled, compute-aware foundation for evaluating multimodal reliability. The hedge-bench PyPI library enables reproducible and extensible benchmarking, with full code and experimental resources available at https://github.com/Simula/HEDGE .

[115] X-VMamba: Explainable Vision Mamba cs.CV | cs.LG | math.DSPDF

Mohamed A. Mabrok, Yalda Zafari

TL;DR: 本文提出了一个基于可控性的可解释性框架，用于分析视觉SSMs（状态空间模型）的内部状态动态，揭示其如何处理空间信息。该框架包含Jacobian和Gramian两种方法，无需修改架构或调参，适用于多种领域。

Details

Motivation: 尽管SSMs（尤其是Mamba架构）在序列建模中表现出色，但其缺乏类似注意力的透明机制，导致难以理解其空间信息处理过程。为了弥补这一空白，本文提出了一个可控性驱动的解释框架。

Result: 在三种医学影像模态上的实验表明，SSMs自然地实现了从低级纹理到临床相关模式的层次化特征提炼。分析还揭示了领域特定的可控性特征和扫描策略的影响。

Insight: 1. SSMs内部具有层次化的特征提炼机制；2. 可控性分析可以成为SSMs的统一解释范式；3. 扫描策略对注意力模式有显著影响。

Abstract: State Space Models (SSMs), particularly the Mamba architecture, have recently emerged as powerful alternatives to Transformers for sequence modeling, offering linear computational complexity while achieving competitive performance. Yet, despite their effectiveness, understanding how these Vision SSMs process spatial information remains challenging due to the lack of transparent, attention-like mechanisms. To address this gap, we introduce a controllability-based interpretability framework that quantifies how different parts of the input sequence (tokens or patches) influence the internal state dynamics of SSMs. We propose two complementary formulations: a Jacobian-based method applicable to any SSM architecture that measures influence through the full chain of state propagation, and a Gramian-based approach for diagonal SSMs that achieves superior speed through closed-form analytical solutions. Both methods operate in a single forward pass with linear complexity, requiring no architectural modifications or hyperparameter tuning. We validate our framework through experiments on three diverse medical imaging modalities, demonstrating that SSMs naturally implement hierarchical feature refinement from diffuse low-level textures in early layers to focused, clinically meaningful patterns in deeper layers. Our analysis reveals domain-specific controllability signatures aligned with diagnostic criteria, progressive spatial selectivity across the network hierarchy, and the substantial influence of scanning strategies on attention patterns. Beyond medical imaging, we articulate applications spanning computer vision, natural language processing, and cross-domain tasks. Our framework establishes controllability analysis as a unified, foundational interpretability paradigm for SSMs across all domains. Code and analysis tools will be made available upon publication

[116] Counting Through Occlusion: Framework for Open World Amodal Counting cs.CVPDF

Safaeid Hossain Arib, Rabeya Akter, Abdul Monaf Chowdhury, Md Jubair Ahmed Sourov, Md Mehedi Hasan

TL;DR: CountOCC是一个创新框架，通过多模态引导重建被遮挡物体的特征，解决了现有方法在被遮挡场景下计数失败的问题。

Details

Motivation: 现有目标计数方法在被遮挡场景下性能显著下降，因为主干网络倾向于编码遮挡表面而非目标物体，导致特征表示受损。

Result: 在FSC 147、CARPK和CAPTUREReal数据集上实现了显著的MAE下降，证明了其在多样化场景下的鲁棒性。

Insight: 显式建模遮挡部分的多模态特征合成是解决遮挡计数问题的有效途径。

Abstract: Object counting has achieved remarkable success on visible instances, yet state-of-the-art (SOTA) methods fail under occlusion, a pervasive challenge in real world deployment. This failure stems from a fundamental architectural limitation where backbone networks encode occluding surfaces rather than target objects, thereby corrupting the feature representations required for accurate enumeration. To address this, we present CountOCC, an amodal counting framework that explicitly reconstructs occluded object features through hierarchical multimodal guidance. Rather than accepting degraded encodings, we synthesize complete representations by integrating spatial context from visible fragments with semantic priors from text and visual embeddings, generating class-discriminative features at occluded locations across multiple pyramid levels. We further introduce a visual equivalence objective that enforces consistency in attention space, ensuring that both occluded and unoccluded views of the same scene produce spatially aligned gradient-based attention maps. Together, these complementary mechanisms preserve discriminative properties essential for accurate counting under occlusion. For rigorous evaluation, we establish occlusion-augmented versions of FSC 147 and CARPK spanning both structured and unstructured scenes. CountOCC achieves SOTA performance on FSC 147 with 26.72% and 20.80% MAE reduction over prior baselines under occlusion in validation and test, respectively. CountOCC also demonstrates exceptional generalization by setting new SOTA results on CARPK with 49.89% MAE reduction and on CAPTUREReal with 28.79% MAE reduction, validating robust amodal counting across diverse visual domains. Code will be released soon.

[117] FSDAM: Few-Shot Driving Attention Modeling via Vision-Language Coupling cs.CVPDF

Kaiser Hamid, Can Cui, Khandakar Ashrafi Akbar, Ziran Wang, Nade Liang

TL;DR: FSDAM是一个通过视觉-语言耦合实现少样本驾驶员注意力建模的框架，仅需约100个标注样本，就能联合预测注意力分布和生成描述，并在零样本场景下表现出色。

Details

Motivation: 现有驾驶员注意力建模方法依赖大规模标注数据集，但数据收集和标注成本高昂。FSDAM旨在通过少样本学习解决这一问题，同时实现注意力预测和语义解释生成。

Result: FSDAM在注意力预测上表现优异，生成的描述连贯且上下文相关，并在多个驾驶数据集上实现零样本泛化。

Insight: 少样本学习和跨模态对齐是实现高效、可解释驾驶员注意力建模的关键，且有望在数据受限场景中推广。

Abstract: Understanding where drivers look and why they shift their attention is essential for autonomous systems that read human intent and justify their actions. Most existing models rely on large-scale gaze datasets to learn these patterns; however, such datasets are labor-intensive to collect and time-consuming to curate. We present FSDAM (Few-Shot Driver Attention Modeling), a framework that achieves joint attention prediction and caption generation with approximately 100 annotated examples, two orders of magnitude fewer than existing approaches. Our approach introduces a dual-pathway architecture where separate modules handle spatial prediction and caption generation while maintaining semantic consistency through cross-modal alignment. Despite minimal supervision, FSDAM achieves competitive performance on attention prediction, generates coherent, and context-aware explanations. The model demonstrates robust zero-shot generalization across multiple driving benchmarks. This work shows that effective attention-conditioned generation is achievable with limited supervision, opening new possibilities for practical deployment of explainable driver attention systems in data-constrained scenarios.

Ankita Raj, Chetan Arora

TL;DR: 该论文首次研究了开放词汇目标检测器（OVODs）的后门攻击，并提出了一种多模态提示调优的攻击策略TrAP，通过联合优化图像和文本模态的提示参数及视觉触发器，植入轻量级的可学习提示令牌，实现了高攻击成功率。

Details

Motivation: 随着开放词汇目标检测器在机器人、自动驾驶和监控等高风险应用中的普及，理解其安全风险变得至关重要。论文旨在探索OVODs的新型攻击面，特别是在提示调优过程中可能存在的安全问题。

Result: 实验表明，TrAP在多个数据集上实现了高攻击成功率（目标误分类和消失攻击），同时在下游数据集上的干净图像性能相比零样本设置有所提升。

Insight: 论文揭示了OVODs在提示调优过程中的安全风险，强调了多模态模型在实际应用中需要更强的安全性验证。

Abstract: Open-vocabulary object detectors (OVODs) unify vision and language to detect arbitrary object categories based on text prompts, enabling strong zero-shot generalization to novel concepts. As these models gain traction in high-stakes applications such as robotics, autonomous driving, and surveillance, understanding their security risks becomes crucial. In this work, we conduct the first study of backdoor attacks on OVODs and reveal a new attack surface introduced by prompt tuning. We propose TrAP (Trigger-Aware Prompt tuning), a multi-modal backdoor injection strategy that jointly optimizes prompt parameters in both image and text modalities along with visual triggers. TrAP enables the attacker to implant malicious behavior using lightweight, learnable prompt tokens without retraining the base model weights, thus preserving generalization while embedding a hidden backdoor. We adopt a curriculum-based training strategy that progressively shrinks the trigger size, enabling effective backdoor activation using small trigger patches at inference. Experiments across multiple datasets show that TrAP achieves high attack success rates for both object misclassification and object disappearance attacks, while also improving clean image performance on downstream datasets compared to the zero-shot setting.

[119] Direct Visual Grounding by Directing Attention of Visual Tokens cs.CVPDF

Parsa Esmaeilkhani, Longin Jan Latecki

TL;DR: 该论文提出了一种新的损失函数（KLAL），通过直接监督视觉token的注意力分布，改善视觉语言模型（VLM）在视觉任务中的表现。实验表明，该方法在几何任务、指向任务和指代表达理解任务上均有显著提升。

Details

Motivation: 现有的视觉语言模型（VLM）在处理视觉token和文本token时，存在视觉token与查询相关的部分在最后几层注意力中被忽视的问题，导致错误的视觉问答结果。论文假设直接监督视觉token的注意力分布可以提升模型性能。

Result: 实验表明，KLAL显著提升了VLMs在几何任务、指向任务和指代表达理解任务上的性能，甚至在商业VLMs中也能观察到改进。

Insight: 1. 直接监督视觉token的注意力分布可以有效改善VLMs的视觉任务表现；2. 标准NTP损失对视觉token的监督不足；3. 商业VLMs在某些任务（如线追踪）中表现不佳，表明其视觉能力仍有局限性。

Abstract: Vision Language Models (VLMs) mix visual tokens and text tokens. A puzzling issue is the fact that visual tokens most related to the query receive little to no attention in the final layers of the LLM module of VLMs from the answer tokens, where all tokens are treated equally, in particular, visual and language tokens in the LLM attention layers. This fact may result in wrong answers to visual questions, as our experimental results confirm. It appears that the standard next-token prediction (NTP) loss provides an insufficient signal for directing attention to visual tokens. We hypothesize that a more direct supervision of the attention of visual tokens to corresponding language tokens in the LLM module of VLMs will lead to improved performance on visual tasks. To demonstrate that this is indeed the case, we propose a novel loss function that directly supervises the attention of visual tokens. It directly grounds the answer language tokens in images by directing their attention to the relevant visual tokens. This is achieved by aligning the attention distribution of visual tokens to ground truth attention maps with KL divergence. The ground truth attention maps are obtained from task geometry in synthetic cases or from standard grounding annotations (e.g., bounding boxes or point annotations) in real images, and are used inside the LLM for attention supervision without requiring new labels. The obtained KL attention loss (KLAL) when combined with NTP encourages VLMs to attend to relevant visual tokens while generating answer tokens. This results in notable improvements across geometric tasks, pointing, and referring expression comprehension on both synthetic and real-world data, as demonstrated by our experiments. We also introduce a new dataset to evaluate the line tracing abilities of VLMs. Surprisingly, even commercial VLMs do not perform well on this task.

[120] SAGE: Saliency-Guided Contrastive Embeddings cs.CVPDF

Colton R. Crum, Adam Czajka

TL;DR: SAGE提出了一种基于显著性引导的对比嵌入损失函数，通过在模型的潜在嵌入空间中引入人类显著性先验，显著提升了模型的分类性能和泛化能力。

Details

Motivation: 现有方法通常依赖于内部模型机制来整合显著性信息，但研究表明这些机制可能不可靠。SAGE的动机是将显著性引导从图像空间转移到潜在嵌入空间，从而更可靠地指导模型训练。

Result: 实验表明，SAGE在开放和封闭场景下的分类性能均优于现有基于显著性的方法，且在不同主干网络和任务中表现出良好的泛化能力。

Insight: 潜在嵌入空间可能是整合人类显著性先验的更可靠方式，而对比学习可以有效引导模型关注显著性特征，从而提升模型的泛化性和性能。

Abstract: Integrating human perceptual priors into the training of neural networks has been shown to raise model generalization, serve as an effective regularizer, and align models with human expertise for applications in high-risk domains. Existing approaches to integrate saliency into model training often rely on internal model mechanisms, which recent research suggests may be unreliable. Our insight is that many challenges associated with saliency-guided training stem from the placement of the guidance approaches solely within the image space. Instead, we move away from the image space, use the model’s latent space embeddings to steer human guidance during training, and we propose SAGE (Saliency-Guided Contrastive Embeddings): a loss function that integrates human saliency into network training using contrastive embeddings. We apply salient-preserving and saliency-degrading signal augmentations to the input and capture the changes in embeddings and model logits. We guide the model towards salient features and away from non-salient features using a contrastive triplet loss. Additionally, we perform a sanity check on the logit distributions to ensure that the model outputs match the saliency-based augmentations. We demonstrate a boost in classification performance across both open- and closed-set scenarios against SOTA saliency-based methods, showing SAGE’s effectiveness across various backbones, and include experiments to suggest its wide generalization across tasks.

[121] RoCoISLR: A Romanian Corpus for Isolated Sign Language Recognition cs.CV | cs.LGPDF

Cătălin-Alexandru Rîpanu, Andrei-Theodor Hotnog, Giulia-Stefania Imbrea, Dumitru-Clementin Cercel

TL;DR: 该论文介绍了首个罗马尼亚孤立手语识别（RoISLR）的大规模标准化数据集RoCoISLR，包含9000多个视频样本，涵盖6000个标准化手势词汇。通过评估多种先进视频识别模型，发现基于Transformer的架构表现最佳（Swin Transformer的Top-1准确率为34.1%），同时揭示了低资源手语中长尾分布问题的挑战。

Details

Motivation: 当前手语识别研究主要关注美国手语（ASL），缺乏针对罗马尼亚孤立手语的标准化数据集，限制了该领域的研究进展。因此，作者提出构建RoCoISLR数据集，填补这一空白，并为相关研究提供基础。

Result: Swin Transformer在RoCoISLR数据集上的Top-1准确率达到34.1%，表现优于其他卷积基线模型。实验同时揭示了低资源手语中长尾分布带来的挑战。

Insight: 1. Transformer架构在低资源手语识别任务中具有显著优势；2. 长尾分布问题在手语识别中尤为突出，未来研究需设计针对性解决方案。

Abstract: Automatic sign language recognition plays a crucial role in bridging the communication gap between deaf communities and hearing individuals; however, most available datasets focus on American Sign Language. For Romanian Isolated Sign Language Recognition (RoISLR), no large-scale, standardized dataset exists, which limits research progress. In this work, we introduce a new corpus for RoISLR, named RoCoISLR, comprising over 9,000 video samples that span nearly 6,000 standardized glosses from multiple sources. We establish benchmark results by evaluating seven state-of-the-art video recognition models-I3D, SlowFast, Swin Transformer, TimeSformer, Uniformer, VideoMAE, and PoseConv3D-under consistent experimental setups, and compare their performance with that of the widely used WLASL2000 corpus. According to the results, transformer-based architectures outperform convolutional baselines; Swin Transformer achieved a Top-1 accuracy of 34.1%. Our benchmarks highlight the challenges associated with long-tail class distributions in low-resource sign languages, and RoCoISLR provides the initial foundation for systematic RoISLR research.

[122] Enhancing Neuro-Oncology Through Self-Assessing Deep Learning Models for Brain Tumor Unified Model for MRI Segmentation cs.CVPDF

Andrew Zhou

TL;DR: 本文提出了一种不确定性感知的深度学习框架，结合了nnUNet和一个体素级不确定性的通道，用于脑肿瘤MRI分割，同时提供肿瘤及其周围健康结构的统一分割和不确定性评估。

Details

Motivation: 脑肿瘤的精确分割对临床诊断和治疗至关重要，但现有方法缺乏不确定性估计和对健康结构的关注，限制了临床应用。本研究旨在填补这一空白。

Result: 实验结果展示了不确定性评估的良好相关性（0.750）和较低误差（RMSD=0.047），同时保持了肿瘤分割的准确性（DSC=0.86）。统一模型在脑结构和肿瘤分割上均表现优异（DSC=0.81和0.86）。

Insight: 可视化检查表明，不确定性地图能为预测评估和错误修正提供关键见解，有助于临床决策。这一方法为AI在神经外科中的应用提供了新的工具。

Abstract: Accurate segmentation of brain tumors is vital for diagnosis, surgical planning, and treatment monitoring. Deep learning has advanced on benchmarks, but two issues limit clinical use: no uncertainty estimates for errors and no segmentation of healthy brain structures around tumors for surgery. Current methods fail to unify tumor localization with anatomical context and lack confidence scores. This study presents an uncertainty-aware framework augmenting nnUNet with a channel for voxel-wise uncertainty. Trained on BraTS2023, it yields a correlation of 0.750 and RMSD of 0.047 for uncertainty without hurting tumor accuracy. It predicts uncertainty in one pass, with no extra networks or inferences, aiding clinical decisions. For whole-brain context, a unified model combines normal and cancer datasets, achieving a DSC of 0.81 for brain structures and 0.86 for tumor, with robust key-region performance. Combining both innovations gives the first model outputting tumor in natural surroundings plus an overlaid uncertainty map. Visual checks of outputs show uncertainty offers key insights to evaluate predictions and fix errors, helping informed surgical decisions from AI.

[123] MSRNet: A Multi-Scale Recursive Network for Camouflaged Object Detection cs.CV | cs.AI | eess.IVPDF

Leena Alghamdi, Muhammad Usman, Hafeez Anwar, Abdul Bais, Saeed Anwar

TL;DR: 该论文提出了一种多尺度递归网络MSRNet，用于检测与环境高度融合的伪装目标。通过结合金字塔视觉Transformer和多粒度融合单元，其性能优于现有方法，在多个基准数据集上取得领先结果。

Details

Motivation: 现有方法在复杂场景中对小目标和多目标的伪装检测仍存在不足，尤其是在低光照、遮挡和复杂背景下。

Result: 在两个基准数据集上达到最优性能，并在其他两个数据集上排名第二。

Insight: 多尺度特征提取与递归优化显著提升了小目标和多目标的检测性能，全局上下文理解是关键。

Abstract: Camouflaged object detection is an emerging and challenging computer vision task that requires identifying and segmenting objects that blend seamlessly into their environments due to high similarity in color, texture, and size. This task is further complicated by low-light conditions, partial occlusion, small object size, intricate background patterns, and multiple objects. While many sophisticated methods have been proposed for this task, current methods still struggle to precisely detect camouflaged objects in complex scenarios, especially with small and multiple objects, indicating room for improvement. We propose a Multi-Scale Recursive Network that extracts multi-scale features via a Pyramid Vision Transformer backbone and combines them via specialized Attention-Based Scale Integration Units, enabling selective feature merging. For more precise object detection, our decoder recursively refines features by incorporating Multi-Granularity Fusion Units. A novel recursive-feedback decoding strategy is developed to enhance global context understanding, helping the model overcome the challenges in this task. By jointly leveraging multi-scale learning and recursive feature optimization, our proposed method achieves performance gains, successfully detecting small and multiple camouflaged objects. Our model achieves state-of-the-art results on two benchmark datasets for camouflaged object detection and ranks second on the remaining two. Our codes, model weights, and results are available at \href{https://github.com/linaagh98/MSRNet}{https://github.com/linaagh98/MSRNet}.

[124] SAGA: Source Attribution of Generative AI Videos cs.CV | cs.AIPDF

Rohit Kundu, Vishal Mohanty, Hao Xiong, Shan Jia, Athula Balachandran

TL;DR: SAGA is a novel framework designed for attributing AI-generated videos to their specific sources across multiple granular levels. It utilizes a video transformer with a pretrain-and-attribute strategy and introduces Temporal Attention Signatures for interpretability, achieving state-of-the-art performance with minimal labeled data.

Details

Motivation: The rise of hyper-realistic AI-generated videos poses significant misuse risks, surpassing traditional binary detection methods. There’s an urgent need for a scalable solution to identify the exact generative models used.

Result: SAGA achieves state-of-the-art performance using only 0.5% of labeled data per class and provides interpretable insights into temporal differences between generators.

Insight: Multi-granular attribution is crucial for forensic and regulatory applications, and pretraining strategies can significantly reduce reliance on labeled data.

Abstract: The proliferation of generative AI has led to hyper-realistic synthetic videos, escalating misuse risks and outstripping binary real/fake detectors. We introduce SAGA (Source Attribution of Generative AI videos), the first comprehensive framework to address the urgent need for AI-generated video source attribution at a large scale. Unlike traditional detection, SAGA identifies the specific generative model used. It uniquely provides multi-granular attribution across five levels: authenticity, generation task (e.g., T2V/I2V), model version, development team, and the precise generator, offering far richer forensic insights. Our novel video transformer architecture, leveraging features from a robust vision foundation model, effectively captures spatio-temporal artifacts. Critically, we introduce a data-efficient pretrain-and-attribute strategy, enabling SAGA to achieve state-of-the-art attribution using only 0.5% of source-labeled data per class, matching fully supervised performance. Furthermore, we propose Temporal Attention Signatures (T-Sigs), a novel interpretability method that visualizes learned temporal differences, offering the first explanation for why different video generators are distinguishable. Extensive experiments on public datasets, including cross-domain scenarios, demonstrate that SAGA sets a new benchmark for synthetic video provenance, providing crucial, interpretable insights for forensic and regulatory applications.

[125] Video Finetuning Improves Reasoning Between Frames cs.CV | cs.AIPDF

Ruiqi Yang, Tian Yun, Zihan Wang, Ellie Pavlick

TL;DR: 本文探讨了视频微调对多模态大语言模型（LLMs）推理能力的影响，提出了一种视觉思维链（vCoT）方法，用于生成帧间过渡事件描述。实验表明，vCoT显著提升了仅使用图像的模型在长视频问答任务中的表现，但对视频微调后的模型提升有限。同时，视频微调模型在静态图像的关系推理任务中表现出更强的迁移能力。

Details

Motivation: 多模态LLMs在视觉理解方面取得了快速进展，但将图像扩展到视频时往往仅简单地拼接帧标记。本文旨在研究视频微调对多模态LLMs带来的优势，并探索其帧间推理能力。

Result: 1. vCoT显著提升了图像模型在长视频问答任务中的性能；2. 视频微调模型在帧间推理方面表现优异，且能迁移到静态推理任务中，超越图像模型基线。

Insight: 视频微调不仅提升了模型对时间信息的建模能力，还增强了其对静态图像关系的理解，表明时间推理能力的训练可以泛化到其他视觉任务中。

Abstract: Multimodal large language models (LLMs) have made rapid progress in visual understanding, yet their extension from images to videos often reduces to a naive concatenation of frame tokens. In this work, we investigate what video finetuning brings to multimodal LLMs. We propose Visual Chain-of-Thought (vCoT), an explicit reasoning process that generates transitional event descriptions between consecutive frames. Using vCoT, we systematically compare image-only LVLMs with their video-finetuned counterparts, both with and without access to these transitional cues. Our experiments show that vCoT significantly improves the performance of image-only models on long-form video question answering, while yielding only marginal gains for video-finetuned models. This suggests that the latter already capture frame-to-frame transitions implicitly. Moreover, we find that video models transfer this temporal reasoning ability to purely static settings, outperforming image models’ baselines on relational visual reasoning tasks.

Trung Thanh Nguyen, Yasutomo Kawanishi, Vijay John, Takahiro Komamizu, Ichiro Ide

TL;DR: 本文提出了一种名为ViCoKD的跨模态知识蒸馏框架，用于解决多视角动作识别中部分重叠视角和多模态限制的问题。

Details

Motivation: 多传感器系统的广泛应用推动了多视角动作识别的研究，但部分重叠视角和多模态限制的场景仍未被充分探索。

Result: 在MultiSensor-Home数据集上的实验表明，ViCoKD在多模态限制条件下表现优异，甚至在某些情况下超越了教师模型。

Insight: 视角一致性和跨模态知识蒸馏的结合是多视角动作识别中的有效解决方案。

Abstract: The widespread use of multi-sensor systems has increased research in multi-view action recognition. While existing approaches in multi-view setups with fully overlapping sensors benefit from consistent view coverage, partially overlapping settings where actions are visible in only a subset of views remain underexplored. This challenge becomes more severe in real-world scenarios, as many systems provide only limited input modalities and rely on sequence-level annotations instead of dense frame-level labels. In this study, we propose View-aware Cross-modal Knowledge Distillation (ViCoKD), a framework that distills knowledge from a fully supervised multi-modal teacher to a modality- and annotation-limited student. ViCoKD employs a cross-modal adapter with cross-modal attention, allowing the student to exploit multi-modal correlations while operating with incomplete modalities. Moreover, we propose a View-aware Consistency module to address view misalignment, where the same action may appear differently or only partially across viewpoints. It enforces prediction alignment when the action is co-visible across views, guided by human-detection masks and confidence-weighted Jensen-Shannon divergence between their predicted class distributions. Experiments on the real-world MultiSensor-Home dataset show that ViCoKD consistently outperforms competitive distillation methods across multiple backbones and environments, delivering significant gains and surpassing the teacher model under limited conditions.

[127] Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views cs.CV | cs.ROPDF

Junyi Ma, Wentao Bao, Jingyi Xu, Guanzhong Sun, Yu Zheng

TL;DR: 该论文提出了一种名为EgoLoc的零样本方法，用于在自我中心视角的视频中定位手与物体的接触和分离时刻，解决了现有技术在时间交互定位（TIL）中的不足。

Details

Motivation: 现有研究多关注交互行为的模式（即“如何交互”），而未充分探索手与目标物体接触和分离的关键时刻（即“何时交互”）。这对混合现实和机器人运动规划至关重要。

Result: 在公开数据集和作者提出的新基准测试中，EgoLoc表现出色，并能有效支持自我中心视觉和机器人操作的下游任务。

Insight: EgoLoc的零样本特性使其具有高度通用性，解决了传统方法依赖类别标注和物体掩码的问题，为时间交互定位提供了新思路。

Abstract: Analyzing hand-object interaction in egocentric vision facilitates VR/AR applications and human-robot policy transfer. Existing research has mostly focused on modeling the behavior paradigm of interactive actions (i.e., “how to interact”). However, the more challenging and fine-grained problem of capturing the critical moments of contact and separation between the hand and the target object (i.e., “when to interact”) is still underexplored, which is crucial for immersive interactive experiences in mixed reality and robotic motion planning. Therefore, we formulate this problem as temporal interaction localization (TIL). Some recent works extract semantic masks as TIL references, but suffer from inaccurate object grounding and cluttered scenarios. Although current temporal action localization (TAL) methods perform well in detecting verb-noun action segments, they rely on category annotations during training and exhibit limited precision in localizing hand-object contact/separation moments. To address these issues, we propose a novel zero-shot approach dubbed EgoLoc to localize hand-object contact and separation timestamps in egocentric videos. EgoLoc introduces hand-dynamics-guided sampling to generate high-quality visual prompts. It exploits the vision-language model to identify contact/separation attributes, localize specific timestamps, and provide closed-loop feedback for further refinement. EgoLoc eliminates the need for object masks and verb-noun taxonomies, leading to generalizable zero-shot implementation. Comprehensive experiments on the public dataset and our novel benchmarks demonstrate that EgoLoc achieves plausible TIL for egocentric videos. It is also validated to effectively facilitate multiple downstream applications in egocentric vision and robotic manipulation tasks. Code and relevant data will be released at https://github.com/IRMVLab/EgoLoc.

[128] Simple Lines, Big Ideas: Towards Interpretable Assessment of Human Creativity from Drawings cs.CVPDF

Zihao Lin, Zhenshan Shi, Sasa Zhao, Hanwei Zhu, Lingyu Zhu

TL;DR: 该论文提出了一种数据驱动的框架，用于从绘图中自动且可解释地评估人类创造力，通过结合内容和风格两个维度，并引入多模态多任务学习框架。

Details

Motivation: 当前对人类创造力的评估主要依赖专家主观评分，效率低且主观性强。论文旨在提出一种自动、可解释的创造力评估方法。

Result: 实验结果表明，该模型在回归任务中达到了最先进性能，并提供了与人类判断一致的可解释性可视化结果。

Insight: 创造力评估可以通过结合绘图的内容和风格两个互补维度实现自动化和可解释性，条件学习机制有助于提取创造力相关的信号。

Abstract: Assessing human creativity through visual outputs, such as drawings, plays a critical role in fields including psychology, education, and cognitive science. However, current assessment practices still rely heavily on expert-based subjective scoring, which is both labor-intensive and inherently subjective. In this paper, we propose a data-driven framework for automatic and interpretable creativity assessment from drawings. Motivated by the cognitive understanding that creativity can emerge from both what is drawn (content) and how it is drawn (style), we reinterpret the creativity score as a function of these two complementary dimensions.Specifically, we first augment an existing creativity labeled dataset with additional annotations targeting content categories. Based on the enriched dataset, we further propose a multi-modal, multi-task learning framework that simultaneously predicts creativity scores, categorizes content types, and extracts stylistic features. In particular, we introduce a conditional learning mechanism that enables the model to adapt its visual feature extraction by dynamically tuning it to creativity-relevant signals conditioned on the drawing’s stylistic and semantic cues.Experimental results demonstrate that our model achieves state-of-the-art performance compared to existing regression-based approaches and offers interpretable visualizations that align well with human judgments. The code and annotations will be made publicly available at https://github.com/WonderOfU9/CSCA_PRCV_2025

[129] Reconstructing 3D Scenes in Native High Dynamic Range cs.CVPDF

Kaixuan Zhang, Minxian Li, Mingwu Ren, Jiankang Deng, Xiatian Zhu

TL;DR: 论文提出了NH-3DGS方法，首次直接从原生HDR数据中重建3D场景，通过新的亮度-色度分解技术优化了动态范围保存和重建质量。

Details

Motivation: 专业数字媒体创作（如电影制作、虚拟制作和照片级渲染）需要高动态范围（HDR）成像。然而，现有的3D场景重建方法主要基于低动态范围（LDR）数据，限制了其在专业工作流程中的应用。

Result: 在合成和真实多视图HDR数据集上，NH-3DGS显著优于现有方法，在重建质量和动态范围保存方面表现出色。

Insight: 原生HDR数据可以直接用于3D场景重建，且通过亮度-色度分解能够有效保留动态范围，为专业级应用提供了可能性。

Abstract: High Dynamic Range (HDR) imaging is essential for professional digital media creation, e.g., filmmaking, virtual production, and photorealistic rendering. However, 3D scene reconstruction has primarily focused on Low Dynamic Range (LDR) data, limiting its applicability to professional workflows. Existing approaches that reconstruct HDR scenes from LDR observations rely on multi-exposure fusion or inverse tone-mapping, which increase capture complexity and depend on synthetic supervision. With the recent emergence of cameras that directly capture native HDR data in a single exposure, we present the first method for 3D scene reconstruction that directly models native HDR observations. We propose {\bf Native High dynamic range 3D Gaussian Splatting (NH-3DGS)}, which preserves the full dynamic range throughout the reconstruction pipeline. Our key technical contribution is a novel luminance-chromaticity decomposition of the color representation that enables direct optimization from native HDR camera data. We demonstrate on both synthetic and real multi-view HDR datasets that NH-3DGS significantly outperforms existing methods in reconstruction quality and dynamic range preservation, enabling professional-grade 3D reconstruction directly from native HDR captures. Code and datasets will be made available.

[130] DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning cs.CV | cs.AIPDF

Junbo Zou, Haotian Xia, Zhen Ye, Shengjie Zhang, Christopher Lai

TL;DR: DeepSport是一个多模态大语言模型（MLLM），通过代理强化学习实现体育视频的综合推理，解决了现有方法在体育领域中的局限性。

Details

Motivation: 体育视频理解需要模型具备高速动态感知、复杂规则理解和长时序推理能力，但现有方法要么局限于单一体育项目，要么缺乏学习的推理过程。

Result: 在6.7k问题的测试基准上，DeepSport显著优于专有和开源模型。

Insight: DeepSport通过动态推理和数据蒸馏解决了体育视频的复杂性，为领域特定视频推理奠定了基础。

Abstract: Sports video understanding presents unique challenges, requiring models to perceive high-speed dynamics, comprehend complex rules, and reason over long temporal contexts. While Multimodal Large Language Models (MLLMs) have shown promise in genral domains, the current state of research in sports remains narrowly focused: existing approaches are either single-sport centric, limited to specific tasks, or rely on training-free paradigms that lack robust, learned reasoning process. To address this gap, we introduce DeepSport, the first end-to-end trained MLLM framework designed for multi-task, multi-sport video understanding. DeepSport shifts the paradigm from passive frame processing to active, iterative reasoning, empowering the model to ``think with videos’’ by dynamically interrogating content via a specialized frame-extraction tool. To enable this, we propose a data distillation pipeline that synthesizes high-quality Chain-of-Thought (CoT) trajectories from 10 diverse data source, creating a unified resource of 78k training data. We then employ a two-stage training strategy, Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) with a novel gated tool-use reward, to optimize the model’s reasoning process. Extensive experiments on the testing benchmark of 6.7k questions demonstrate that DeepSport achieves state-of-the-art performance, significantly outperforming baselines of both proprietary model and open-source models. Our work establishes a new foundation for domain-specific video reasoning to address the complexities of diverse sports.

[131] Explore How to Inject Beneficial Noise in MLLMs cs.CVPDF

Ruishu Zhu, Sida Huang, Ziheng Jiao, Hongyuan Zhang

TL;DR: 该论文提出了一种新颖的多模态大语言模型（MLLMs）微调策略，通过注入有益的随机噪声来提升模型性能。提出的多模态噪声生成器（MuNG）能动态分析跨模态关系，生成任务自适应噪声，显著改善跨模态对齐和下游任务表现。

Details

Motivation: 现有微调方法未充分考虑跨模态异质性，限制了MLLMs的潜力。为此，作者探索如何通过噪声注入提升模型表现。

Result: 在QwenVL和LLaVA上验证，MuNG超越全参数微调和其他方法，仅需调整1~2%额外参数。

Insight: 噪声注入可以作为提升跨模态对齐的有效手段，动态任务自适应噪声设计是关键。

Abstract: Multimodal Large Language Models (MLLMs) have played an increasingly important role in multimodal intelligence. However, the existing fine-tuning methods often ignore cross-modal heterogeneity, limiting their full potential. In this work, we propose a novel fine-tuning strategy by injecting beneficial random noise, which outperforms previous methods and even surpasses full fine-tuning, with minimal additional parameters. The proposed Multimodal Noise Generator (MuNG) enables efficient modality fine-tuning by injecting customized noise into the frozen MLLMs. Specifically, we reformulate the reasoning process of MLLMs from a variational inference perspective, upon which we design a multimodal noise generator that dynamically analyzes cross-modal relationships in image-text pairs to generate task-adaptive beneficial noise. Injecting this type of noise into the MLLMs effectively suppresses irrelevant semantic components, leading to significantly improved cross-modal representation alignment and enhanced performance on downstream tasks. Experiments on two mainstream MLLMs, QwenVL and LLaVA, demonstrate that our method surpasses full-parameter fine-tuning and other existing fine-tuning approaches, while requiring adjustments to only about $1\sim2%$ additional parameters. The relevant code is uploaded in the supplementary.

[132] Generative Photographic Control for Scene-Consistent Video Cinematic Editing cs.CVPDF

Huiqiang Sun, Liao Shen, Zhan Peng, Kun Wang, Size Wu

TL;DR: 论文提出了CineCtrl框架，首次实现了对视频中专业相机参数（如散景、快门速度）的精细控制，解决了现有方法仅限于相机运动控制的局限性。

Details

Motivation: 电影叙事中，摄影元素（如景深和曝光）的艺术化操控对传达情绪和美感至关重要，但现有生成视频模型难以精细控制这些效果。

Result: 实验表明，模型能生成高保真视频，精确实现用户指定的摄影相机效果。

Insight: 通过解耦相机运动和摄影参数控制，CineCtrl在保持场景一致性的同时提供了更高的创作自由度。

Abstract: Cinematic storytelling is profoundly shaped by the artful manipulation of photographic elements such as depth of field and exposure. These effects are crucial in conveying mood and creating aesthetic appeal. However, controlling these effects in generative video models remains highly challenging, as most existing methods are restricted to camera motion control. In this paper, we propose CineCtrl, the first video cinematic editing framework that provides fine control over professional camera parameters (e.g., bokeh, shutter speed). We introduce a decoupled cross-attention mechanism to disentangle camera motion from photographic inputs, allowing fine-grained, independent control without compromising scene consistency. To overcome the shortage of training data, we develop a comprehensive data generation strategy that leverages simulated photographic effects with a dedicated real-world collection pipeline, enabling the construction of a large-scale dataset for robust model training. Extensive experiments demonstrate that our model generates high-fidelity videos with precisely controlled, user-specified photographic camera effects.

[133] PFAvatar: Pose-Fusion 3D Personalized Avatar Reconstruction from Real-World Outfit-of-the-Day Photos cs.CV | cs.AI | cs.GRPDF

Dianbing Xi, Guoyuan An, Jingsen Zhu, Zhijian Liu, Yuan Liu

TL;DR: PFAvatar提出了一种从日常穿搭（OOTD）照片中重建高质量3D Avatar的两阶段方法，结合了姿势感知扩散模型和NeRF表示，显著提升了重建速度和细节保留能力。

Details

Motivation: 解决现有方法在从多样化的OOTD照片（多姿态、复杂背景、遮挡）中重建3D Avatar时存在的分解不一致性和速度慢等问题。

Result: PFAvatar在重建保真度、细节保留和遮挡处理上优于SOTA方法，并支持下游应用（如虚拟试穿和动画）。

Insight: NeRF的连续辐射场特性比传统网格表示更适合处理复杂遮挡和高频细节，为3D Avatar生成提供了新方向。

Abstract: We propose PFAvatar (Pose-Fusion Avatar), a new method that reconstructs high-quality 3D avatars from ``Outfit of the Day’’ (OOTD) photos, which exhibit diverse poses, occlusions, and complex backgrounds. Our method consists of two stages: (1) fine-tuning a pose-aware diffusion model from few-shot OOTD examples and (2) distilling a 3D avatar represented by a neural radiance field (NeRF). In the first stage, unlike previous methods that segment images into assets (e.g., garments, accessories) for 3D assembly, which is prone to inconsistency, we avoid decomposition and directly model the full-body appearance. By integrating a pre-trained ControlNet for pose estimation and a novel Condition Prior Preservation Loss (CPPL), our method enables end-to-end learning of fine details while mitigating language drift in few-shot training. Our method completes personalization in just 5 minutes, achieving a 48$\times$ speed-up compared to previous approaches. In the second stage, we introduce a NeRF-based avatar representation optimized by canonical SMPL-X space sampling and Multi-Resolution 3D-SDS. Compared to mesh-based representations that suffer from resolution-dependent discretization and erroneous occluded geometry, our continuous radiance field can preserve high-frequency textures (e.g., hair) and handle occlusions correctly through transmittance. Experiments demonstrate that PFAvatar outperforms state-of-the-art methods in terms of reconstruction fidelity, detail preservation, and robustness to occlusions/truncations, advancing practical 3D avatar generation from real-world OOTD albums. In addition, the reconstructed 3D avatar supports downstream applications such as virtual try-on, animation, and human video reenactment, further demonstrating the versatility and practical value of our approach.

[134] Semi-Supervised High Dynamic Range Image Reconstructing via Bi-Level Uncertain Area Masking cs.CVPDF

Wei Jiang, Jiahao Cui, Yizheng Wu, Zhan Peng, Zhiyu Pan

TL;DR: 论文提出了一种半监督的高动态范围（HDR）图像重建方法，通过双层次不确定性掩码减少伪标注中的不可靠部分，性能优于现有标注高效方法，且仅需6.7%的HDR GT即可媲美全监督方法。

Details

Motivation: 由于LDR-HDR图像对的获取困难，研究需要探索如何在有限HDR标注下实现高性能重建。

Result: 方法仅需6.7%的HDR GT，性能优于标注高效方法，且与全监督方法相当。

Insight: 不确定性掩码在半监督学习中能有效抑制伪标注带来的确认偏差，提升模型性能。

Abstract: Reconstructing high dynamic range (HDR) images from low dynamic range (LDR) bursts plays an essential role in the computational photography. Impressive progress has been achieved by learning-based algorithms which require LDR-HDR image pairs. However, these pairs are hard to obtain, which motivates researchers to delve into the problem of annotation-efficient HDR image reconstructing: how to achieve comparable performance with limited HDR ground truths (GTs). This work attempts to address this problem from the view of semi-supervised learning where a teacher model generates pseudo HDR GTs for the LDR samples without GTs and a student model learns from pseudo GTs. Nevertheless, the confirmation bias, i.e., the student may learn from the artifacts in pseudo HDR GTs, presents an impediment. To remove this impediment, an uncertainty-based masking process is proposed to discard unreliable parts of pseudo GTs at both pixel and patch levels, then the trusted areas can be learned from by the student. With this novel masking process, our semi-supervised HDR reconstructing method not only outperforms previous annotation-efficient algorithms, but also achieves comparable performance with up-to-date fully-supervised methods by using only 6.7% HDR GTs.

[135] Recurrent Autoregressive Diffusion: Global Memory Meets Local Attention cs.CVPDF

Taiye Chen, Zihan Ding, Anjian Li, Christina Zhang, Zeqi Xiao

TL;DR: 论文提出了一种名为RAD（循环自回归扩散）的新框架，结合LSTM和注意力机制来解决长视频生成中的记忆压缩和检索问题，提升了时空一致性。

Details

Motivation: 当前视频生成模型（如扩散模型）在长视频生成中因局部注意力机制和缺乏有效记忆功能，导致遗忘和时空不一致问题。

Result: 在Memory Maze和Minecraft数据集上的实验证明了RAD在长视频生成中的优越性。

Insight: LSTM在序列建模中表现出高效性，为解决长视频生成中的记忆问题提供了新思路。

Abstract: Recent advancements in video generation have demonstrated the potential of using video diffusion models as world models, with autoregressive generation of infinitely long videos through masked conditioning. However, such models, usually with local full attention, lack effective memory compression and retrieval for long-term generation beyond the window size, leading to issues of forgetting and spatiotemporal inconsistencies. To enhance the retention of historical information within a fixed memory budget, we introduce a recurrent neural network (RNN) into the diffusion transformer framework. Specifically, a diffusion model incorporating LSTM with attention achieves comparable performance to state-of-the-art RNN blocks, such as TTT and Mamba2. Moreover, existing diffusion-RNN approaches often suffer from performance degradation due to training-inference gap or the lack of overlap across windows. To address these limitations, we propose a novel Recurrent Autoregressive Diffusion (RAD) framework, which executes frame-wise autoregression for memory update and retrieval, consistently across training and inference time. Experiments on Memory Maze and Minecraft datasets demonstrate the superiority of RAD for long video generation, highlighting the efficiency of LSTM in sequence modeling.

[136] EndoSight AI: Deep Learning-Driven Real-Time Gastrointestinal Polyp Detection and Segmentation for Enhanced Endoscopic Diagnostics cs.CV | cs.AIPDF

Daniel Cavadia

TL;DR: EndoSight AI是一种基于深度学习的实时胃肠道息肉检测与分割系统，旨在提高内窥镜诊断的准确性和效率。该系统在公开数据集Hyper-Kvasir上表现优异，检测mAP达88.3%，分割Dice系数达69%，且实时推理速度超过35 FPS。

Details

Motivation: 胃肠道息肉的实时精确检测对早期诊断和预防结直肠癌至关重要。现有方法在实时性和准确性上仍有提升空间。

Result: 在Hyper-Kvasir数据集上，检测mAP为88.3%，分割Dice系数达69%，实时推理速度>35 FPS。

Insight: 1. 深度学习在内窥镜图像分析中潜力巨大；2. 热感知训练可提升模型鲁棒性；3. 高精度与实时性的平衡是关键挑战。

Abstract: Precise and real-time detection of gastrointestinal polyps during endoscopic procedures is crucial for early diagnosis and prevention of colorectal cancer. This work presents EndoSight AI, a deep learning architecture developed and evaluated independently to enable accurate polyp localization and detailed boundary delineation. Leveraging the publicly available Hyper-Kvasir dataset, the system achieves a mean Average Precision (mAP) of 88.3% for polyp detection and a Dice coefficient of up to 69% for segmentation, alongside real-time inference speeds exceeding 35 frames per second on GPU hardware. The training incorporates clinically relevant performance metrics and a novel thermal-aware procedure to ensure model robustness and efficiency. This integrated AI solution is designed for seamless deployment in endoscopy workflows, promising to advance diagnostic accuracy and clinical decision-making in gastrointestinal healthcare.

[137] GrOCE:Graph-Guided Online Concept Erasure for Text-to-Image Diffusion Models cs.CVPDF

Ning Han, Zhenyu Ge, Feng Han, Yuhua Sun, Chengqing Li

TL;DR: GrOCE提出了一种无需训练的框架，通过图语义推理实现精准的在线概念擦除，解决了传统方法因调优昂贵或粗粒度语义分离导致的性能下降问题。

Details

Motivation: 现有方法依赖昂贵微调或粗粒度语义分离，容易导致无关概念退化且无法适应动态概念集。GrOCE旨在解决这些问题，实现高效、精准的概念擦除。

Result: 实验表明GrOCE在概念相似度和图像生成质量上优于现有方法，无需重新训练。

Insight: 图结构能有效建模概念依赖关系，动态图更新和选择性切断是实现稳定擦除的关键。

Abstract: Concept erasure aims to remove harmful, inappropriate, or copyrighted content from text-to-image diffusion models while preserving non-target semantics. However, existing methods either rely on costly fine-tuning or apply coarse semantic separation, often degrading unrelated concepts and lacking adaptability to evolving concept sets. To alleviate this issue, we propose Graph-Guided Online Concept Erasure (GrOCE), a training-free framework that performs precise and adaptive concept removal through graph-based semantic reasoning. GrOCE models concepts and their interrelations as a dynamic semantic graph, enabling principled reasoning over dependencies and fine-grained isolation of undesired content. It comprises three components: (1) Dynamic Topological Graph Construction for incremental graph building, (2) Adaptive Cluster Identification for multi-hop traversal with similarity-decay scoring, and (3) Selective Edge Severing for targeted edge removal while preserving global semantics. Extensive experiments demonstrate that GrOCE achieves state-of-the-art performance on Concept Similarity (CS) and Fréchet Inception Distance (FID) metrics, offering efficient, accurate, and stable concept erasure without retraining.

[138] MCAQ-YOLO: Morphological Complexity-Aware Quantization for Efficient Object Detection with Curriculum Learning cs.CV | cs.LGPDF

Yoonjae Seo, Ermal Elbasani, Jaehong Lee

TL;DR: MCAQ-YOLO提出了一种基于形态复杂度的自适应量化方法，通过五种形态指标动态调整量化精度，并结合课程学习提升训练效率。

Details

Motivation: 现有量化方法通常采用均匀量化，忽略了视觉数据的空间异质性。MCAQ-YOLO通过形态复杂度量化来解决这一问题，以提升目标检测的效率和精度。

Result: 在安全设备数据集上达到85.6% mAP@0.5，平均4.2比特，压缩比7.6x，性能优于均匀量化。跨数据集验证也表现一致。

Insight: 形态复杂度与量化敏感性高度相关，自适应量化能显著提升效率和鲁棒性，尤其适用于计算受限的安全关键任务。

Abstract: Most neural network quantization methods apply uniform bit precision across spatial regions, ignoring the heterogeneous structural and textural complexity of visual data. This paper introduces MCAQ-YOLO, a morphological complexity-aware quantization framework for object detection. The framework employs five morphological metrics - fractal dimension, texture entropy, gradient variance, edge density, and contour complexity - to characterize local visual morphology and guide spatially adaptive bit allocation. By correlating these metrics with quantization sensitivity, MCAQ-YOLO dynamically adjusts bit precision according to spatial complexity. In addition, a curriculum-based quantization-aware training scheme progressively increases quantization difficulty to stabilize optimization and accelerate convergence. Experimental results demonstrate a strong correlation between morphological complexity and quantization sensitivity and show that MCAQ-YOLO achieves superior detection accuracy and convergence efficiency compared with uniform quantization. On a safety equipment dataset, MCAQ-YOLO attains 85.6 percent mAP@0.5 with an average of 4.2 bits and a 7.6x compression ratio, yielding 3.5 percentage points higher mAP than uniform 4-bit quantization while introducing only 1.8 ms of additional runtime overhead per image. Cross-dataset validation on COCO and Pascal VOC further confirms consistent performance gains, indicating that morphology-driven spatial quantization can enhance efficiency and robustness for computationally constrained, safety-critical visual recognition tasks.

[139] Concept Regions Matter: Benchmarking CLIP with a New Cluster-Importance Approach cs.CVPDF

Aishwarya Agarwal, Srikrishna Karanam, Vineet Gandhi

TL;DR: 这篇论文提出了一个新的解释性方法CCI，用于评估CLIP等对比视觉-语言模型的性能，并通过分析概念区域的重要性揭示模型的背景依赖性。此外，论文还引入了COVAR基准，系统性地区分了前景和背景的影响，为更鲁棒的视觉-语言模型提供了新方向。

Details

Motivation: CLIP等对比视觉-语言模型在零样本识别中表现强劲，但容易受到背景过度依赖等虚假相关性的影响。现有基准主要依赖准确率，无法区分模型错误的来源。

Result: CCI在忠实性基准上显著优于现有方法，如MS COCO检索的删除AUC指标提高超过两倍。COVAR基准揭示了CLIP错误的多元原因（如视角变化、尺度变化等）。

Insight: 模型的错误不仅来自背景依赖，还包括视角变化和细粒度混淆等因素。CCI和COVAR的结合为理解模型行为和改进鲁棒性提供了关键工具。

Abstract: Contrastive vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition yet remain vulnerable to spurious correlations, particularly background over-reliance. We introduce Cluster-based Concept Importance (CCI), a novel interpretability method that uses CLIP’s own patch embeddings to group spatial patches into semantically coherent clusters, mask them, and evaluate relative changes in model predictions. CCI sets a new state of the art on faithfulness benchmarks, surpassing prior methods by large margins; for example, it yields more than a twofold improvement on the deletion-AUC metric for MS COCO retrieval. We further propose that CCI, when combined with GroundedSAM, automatically categorizes predictions as foreground- or background-driven, providing a crucial diagnostic ability. Existing benchmarks such as CounterAnimals, however, rely solely on accuracy and implicitly attribute all performance degradation to background correlations. Our analysis shows this assumption to be incomplete, since many errors arise from viewpoint variation, scale shifts, and fine-grained object confusions. To disentangle these effects, we introduce COVAR, a benchmark that systematically varies object foregrounds and backgrounds. Leveraging CCI with COVAR, we present a comprehensive evaluation of eighteen CLIP variants, offering methodological advances and empirical evidence that chart a path toward more robust VLMs.

[140] Semantic Prioritization in Visual Counterfactual Explanations with Weighted Segmentation and Auto-Adaptive Region Selection cs.CVPDF

Lintong Zhang, Kang Yin, Seong-Whan Lee

TL;DR: 这篇论文提出了一种名为WSAE-Net的创新方法，旨在通过加权语义图和自适应候选编辑序列，提升非生成式视觉反事实解释的语义相关性和计算效率。

Details

Motivation: 传统的视觉反事实解释方法在替换查询图像区域时忽视了语义相关性，影响了模型的解释性和编辑流程。为了解决这一问题，论文提出了一种新方法。

Result: 实验表明，该方法在计算效率和反事实解释的语义相关性方面表现优越。

Insight: 论文的见解是通过语义优先和自适应编辑顺序，可以显著提升视觉反事实解释的清晰度和效率。

Abstract: In the domain of non-generative visual counterfactual explanations (CE), traditional techniques frequently involve the substitution of sections within a query image with corresponding sections from distractor images. Such methods have historically overlooked the semantic relevance of the replacement regions to the target object, thereby impairing the model’s interpretability and hindering the editing workflow. Addressing these challenges, the present study introduces an innovative methodology named as Weighted Semantic Map with Auto-adaptive Candidate Editing Network (WSAE-Net). Characterized by two significant advancements: the determination of an weighted semantic map and the auto-adaptive candidate editing sequence. First, the generation of the weighted semantic map is designed to maximize the reduction of non-semantic feature units that need to be computed, thereby optimizing computational efficiency. Second, the auto-adaptive candidate editing sequences are designed to determine the optimal computational order among the feature units to be processed, thereby ensuring the efficient generation of counterfactuals while maintaining the semantic relevance of the replacement feature units to the target object. Through comprehensive experimentation, our methodology demonstrates superior performance, contributing to a more lucid and in-depth understanding of visual counterfactual explanations.

[141] PerTouch: VLM-Driven Agent for Personalized and Semantic Image Retouching cs.CVPDF

Zewei Chang, Zheng-Peng Duan, Jianxing Zhang, Chun-Le Guo, Siyu Liu

TL;DR: PerTouch是一种基于扩散模型的图像个性化润色框架，通过语义级控制和VLM驱动的代理，实现了对用户美学偏好的精细调节。

Details

Motivation: 图像润色需要兼顾视觉质量与用户个性化美学偏好，现有方法在可控性与主观性之间难以平衡。

Result: 实验验证了各组件的有效性，PerTouch在个性化图像润色中表现优越。

Insight: 语义替换与参数扰动机制提升了语义边界感知能力，反馈驱动反思和场景感知记忆更好地对齐了用户意图。

Abstract: Image retouching aims to enhance visual quality while aligning with users’ personalized aesthetic preferences. To address the challenge of balancing controllability and subjectivity, we propose a unified diffusion-based image retouching framework called PerTouch. Our method supports semantic-level image retouching while maintaining global aesthetics. Using parameter maps containing attribute values in specific semantic regions as input, PerTouch constructs an explicit parameter-to-image mapping for fine-grained image retouching. To improve semantic boundary perception, we introduce semantic replacement and parameter perturbation mechanisms in the training process. To connect natural language instructions with visual control, we develop a VLM-driven agent that can handle both strong and weak user instructions. Equipped with mechanisms of feedback-driven rethinking and scene-aware memory, PerTouch better aligns with user intent and captures long-term preferences. Extensive experiments demonstrate each component’s effectiveness and the superior performance of PerTouch in personalized image retouching. Code is available at: https://github.com/Auroral703/PerTouch.

[142] Infinite-Story: A Training-Free Consistent Text-to-Image Generation cs.CVPDF

Jihun Park, Kyoungmin Lee, Jongmin Gim, Hyeonseo Jo, Minseok Oh

TL;DR: Infinite-Story提出了一种无需训练的文本到图像生成框架，专注于多提示故事场景的一致性生成，解决了身份和风格不一致问题，实现了高效推理和高一致性。

Details

Motivation: 多提示文本到图像生成中，身份和风格的不一致性是关键挑战，现有方法需微调或推理速度慢，亟需一种高效且无需训练的方法。

Result: 实验表明，方法在生成质量和一致性上达到SOTA，推理速度快6倍（1.72秒/图像），适合真实视觉叙事。

Insight: 无需训练的高效框架在一致性生成中具有潜力，统一注意力机制是解决风格和身份不一致的有效技术。

Abstract: We present Infinite-Story, a training-free framework for consistent text-to-image (T2I) generation tailored for multi-prompt storytelling scenarios. Built upon a scale-wise autoregressive model, our method addresses two key challenges in consistent T2I generation: identity inconsistency and style inconsistency. To overcome these issues, we introduce three complementary techniques: Identity Prompt Replacement, which mitigates context bias in text encoders to align identity attributes across prompts; and a unified attention guidance mechanism comprising Adaptive Style Injection and Synchronized Guidance Adaptation, which jointly enforce global style and identity appearance consistency while preserving prompt fidelity. Unlike prior diffusion-based approaches that require fine-tuning or suffer from slow inference, Infinite-Story operates entirely at test time, delivering high identity and style consistency across diverse prompts. Extensive experiments demonstrate that our method achieves state-of-the-art generation performance, while offering over 6X faster inference (1.72 seconds per image) than the existing fastest consistent T2I models, highlighting its effectiveness and practicality for real-world visual storytelling.

[143] SAGE: Spuriousness-Aware Guided Prompt Exploration for Mitigating Multimodal Bias cs.CV | cs.AIPDF

Wenqian Ye, Di Wang, Guangtao Zheng, Bohan Liu, Aidong Zhang

TL;DR: 论文提出了SAGE方法，通过引导提示探索来缓解CLIP模型在多模态任务中的虚假偏差，无需训练或微调，显著提升了零样本分类的鲁棒性和泛化能力。

Details

Motivation: CLIP等大规模视觉-语言模型在多模态任务中表现出色，但容易产生虚假偏差（如依赖背景而非核心特征识别物体），影响其在分布外数据上的鲁棒性。现有方法通常需要微调或先验知识，限制了CLIP的即用性。

Result: 在四个真实世界基准数据集和五种骨干模型上的实验表明，SAGE显著提升了零样本性能，尤其在分布外数据上优于其他无需外部知识或模型更新的方法。

Insight: 1）多模态虚假偏差是CLIP鲁棒性的主要问题；2）通过提示选择而非模型调整可有效缓解偏差；3）语义分离是提升零样本泛化能力的关键指标。

Abstract: Large vision-language models, such as CLIP, have shown strong zero-shot classification performance by aligning images and text in a shared embedding space. However, CLIP models often develop multimodal spurious biases, which is the undesirable tendency to rely on spurious features. For example, CLIP may infer object types in images based on frequently co-occurring backgrounds rather than the object’s core features. This bias significantly impairs the robustness of pre-trained CLIP models on out-of-distribution data, where such cross-modal associations no longer hold. Existing methods for mitigating multimodal spurious bias typically require fine-tuning on downstream data or prior knowledge of the bias, which undermines the out-of-the-box usability of CLIP. In this paper, we first theoretically analyze the impact of multimodal spurious bias in zero-shot classification. Based on this insight, we propose Spuriousness-Aware Guided Exploration (SAGE), a simple and effective method that mitigates spurious bias through guided prompt selection. SAGE requires no training, fine-tuning, or external annotations. It explores a space of prompt templates and selects the prompts that induce the largest semantic separation between classes, thereby improving worst-group robustness. Extensive experiments on four real-world benchmark datasets and five popular backbone models demonstrate that SAGE consistently improves zero-shot performance and generalization, outperforming previous zero-shot approaches without any external knowledge or model updates.

[144] Beyond Darkness: Thermal-Supervised 3D Gaussian Splatting for Low-Light Novel View Synthesis cs.CVPDF

Qingsen Ma, Chen Zou, Dianyun Wang, Jia Wang, Liuyu Xiang

TL;DR: DTGS提出了一个联合优化框架，将热成像监督与3D高斯泼溅结合，解决了极低光照条件下新视角合成的几何和颜色一致性问题。

Details

Motivation: 在极低光照条件下，传统3D高斯泼溅方法由于独立增强视角导致光照不一致和几何失真，无法有效处理欠曝输入。

Result: 在RGBT-LOW数据集上，DTGS显著优于现有方法，在辐射一致性、几何保真度和颜色稳定性上表现优异。

Insight: 热成像可有效指导低光照条件下的几何和颜色恢复，联合优化优于分步处理。

Abstract: Under extremely low-light conditions, novel view synthesis (NVS) faces severe degradation in terms of geometry, color consistency, and radiometric stability. Standard 3D Gaussian Splatting (3DGS) pipelines fail when applied directly to underexposed inputs, as independent enhancement across views causes illumination inconsistencies and geometric distortion. To address this, we present DTGS, a unified framework that tightly couples Retinex-inspired illumination decomposition with thermal-guided 3D Gaussian Splatting for illumination-invariant reconstruction. Unlike prior approaches that treat enhancement as a pre-processing step, DTGS performs joint optimization across enhancement, geometry, and thermal supervision through a cyclic enhancement-reconstruction mechanism. A thermal supervisory branch stabilizes both color restoration and geometry learning by dynamically balancing enhancement, structural, and thermal losses. Moreover, a Retinex-based decomposition module embedded within the 3DGS loop provides physically interpretable reflectance-illumination separation, ensuring consistent color and texture across viewpoints. To evaluate our method, we construct RGBT-LOW, a new multi-view low-light thermal dataset capturing severe illumination degradation. Extensive experiments show that DTGS significantly outperforms existing low-light enhancement and 3D reconstruction baselines, achieving superior radiometric consistency, geometric fidelity, and color stability under extreme illumination.

[145] Geometry Meets Light: Leveraging Geometric Priors for Universal Photometric Stereo under Limited Multi-Illumination Cues cs.CVPDF

King-Man Tam, Satoshi Ikehata, Yuta Asano, Zhaoyi An, Rei Kawakami

TL;DR: GeoUniPS是一个通用的光度立体网络，通过结合合成监督和大规模3D重建模型提供的高级几何先验，解决了多光照线索不可靠的问题。

Details

Motivation: 传统的光度立体方法在多光照线索不可靠（如偏置光照、阴影或自遮挡区域）时表现不佳，因此需要一种能利用几何先验的方法来提高鲁棒性。

Result: GeoUniPS在多个数据集上表现优异，尤其是在复杂的真实场景中，取得了定量和定性的最佳性能。

Insight: 3D重建模型可以隐式编码丰富的几何知识，可作为视觉几何基础模型，为光度立体任务提供有力支持。

Abstract: Universal Photometric Stereo is a promising approach for recovering surface normals without strict lighting assumptions. However, it struggles when multi-illumination cues are unreliable, such as under biased lighting or in shadows or self-occluded regions of complex in-the-wild scenes. We propose GeoUniPS, a universal photometric stereo network that integrates synthetic supervision with high-level geometric priors from large-scale 3D reconstruction models pretrained on massive in-the-wild data. Our key insight is that these 3D reconstruction models serve as visual-geometry foundation models, inherently encoding rich geometric knowledge of real scenes. To leverage this, we design a Light-Geometry Dual-Branch Encoder that extracts both multi-illumination cues and geometric priors from the frozen 3D reconstruction model. We also address the limitations of the conventional orthographic projection assumption by introducing the PS-Perp dataset with realistic perspective projection to enable learning of spatially varying view directions. Extensive experiments demonstrate that GeoUniPS delivers state-of-the-arts performance across multiple datasets, both quantitatively and qualitatively, especially in the complex in-the-wild scenes.

[146] MeanFlow Transformers with Representation Autoencoders cs.CV | cs.AI | cs.LGPDF

Zheyuan Hu, Chieh-Hsin Lai, Ge Wu, Yuki Mitsufuji, Stefano Ermon

TL;DR: MeanFlow Transformers with Representation Autoencoders（MF-RAE）提出了一种高效训练和采样的MF方法，通过在Representation Autoencoder（RAE）的潜在空间中操作，显著降低了计算成本和训练复杂度。

Details

Motivation: 现有MF方法在训练和推理阶段存在计算成本高、不稳定以及依赖复杂超参数的问题，尤其在高维数据建模中，SD-VAE的解码器占据了生成的主要开销。

Result: 在ImageNet 256上，1-step FID达到2.03，优于vanilla MF的3.43，同时采样GFLOPS减少38%，总训练成本降低83%。在ImageNet 512上，1-step FID为3.23，GFLOPS最低。

Insight: 通过将MF与轻量级的RAE结合，可以显著提高生成效率并降低计算开销，同时避免了对SD-VAE的依赖。这种方法在高分辨率图像生成中具有潜力。

Abstract: MeanFlow (MF) is a diffusion-motivated generative model that enables efficient few-step generation by learning long jumps directly from noise to data. In practice, it is often used as a latent MF by leveraging the pre-trained Stable Diffusion variational autoencoder (SD-VAE) for high-dimensional data modeling. However, MF training remains computationally demanding and is often unstable. During inference, the SD-VAE decoder dominates the generation cost, and MF depends on complex guidance hyperparameters for class-conditional generation. In this work, we develop an efficient training and sampling scheme for MF in the latent space of a Representation Autoencoder (RAE), where a pre-trained vision encoder (e.g., DINO) provides semantically rich latents paired with a lightweight decoder. We observe that naive MF training in the RAE latent space suffers from severe gradient explosion. To stabilize and accelerate training, we adopt Consistency Mid-Training for trajectory-aware initialization and use a two-stage scheme: distillation from a pre-trained flow matching teacher to speed convergence and reduce variance, followed by an optional bootstrapping stage with a one-point velocity estimator to further reduce deviation from the oracle mean flow. This design removes the need for guidance, simplifies training configurations, and reduces computation in both training and sampling. Empirically, our method achieves a 1-step FID of 2.03, outperforming vanilla MF’s 3.43, while reducing sampling GFLOPS by 38% and total training cost by 83% on ImageNet 256. We further scale our approach to ImageNet 512, achieving a competitive 1-step FID of 3.23 with the lowest GFLOPS among all baselines. Code is available at https://github.com/sony/mf-rae.

[147] SpectralAdapt: Semi-Supervised Domain Adaptation with Spectral Priors for Human-Centered Hyperspectral Image Reconstruction cs.CV | cs.AIPDF

Yufei Wen, Yuting Zhang, Jingdan Kang, Hao Ren, Weibin Cheng

TL;DR: 这篇论文提出了SpectralAdapt，一种半监督域适应（SSDA）框架，用于解决人类中心高光谱图像（HSI）重建中领域差异和数据稀缺的问题。

Details

Motivation: 高光谱成像（HSI）在医疗领域潜力巨大，但由于数据获取成本高和技术要求高，人类HSI数据稀缺，限制了发展。为解决这一问题，作者提出了一种利用通用域数据集和有限的人类HSI数据的SSDA方法。

Result: 实验表明，SpectralAdapt在光谱保真度、跨域泛化能力和训练稳定性方面均实现了显著提升。

Insight: 半监督域适应（SSDA）可以高效利用有限标注数据和大量未标注数据，为医疗领域的高光谱成像提供了一种可行的解决方案。

Abstract: Hyperspectral imaging (HSI) holds great potential for healthcare due to its rich spectral information. However, acquiring HSI data remains costly and technically demanding. Hyperspectral image reconstruction offers a practical solution by recovering HSI data from accessible modalities, such as RGB. While general domain datasets are abundant, the scarcity of human HSI data limits progress in medical applications. To tackle this, we propose SpectralAdapt, a semi-supervised domain adaptation (SSDA) framework that bridges the domain gap between general and human-centered HSI datasets. To fully exploit limited labels and abundant unlabeled data, we enhance spectral reasoning by introducing Spectral Density Masking (SDM), which adaptively masks RGB channels based on their spectral complexity, encouraging recovery of informative regions from complementary cues during consistency training. Furthermore, we introduce Spectral Endmember Representation Alignment (SERA), which derives physically interpretable endmembers from valuable labeled pixels and employs them as domain-invariant anchors to guide unlabeled predictions, with momentum updates ensuring adaptability and stability. These components are seamlessly integrated into SpectralAdapt, a spectral prior-guided framework that effectively mitigates domain shift, spectral degradation, and data scarcity in HSI reconstruction. Experiments on benchmark datasets demonstrate consistent improvements in spectral fidelity, cross-domain generalization, and training stability, highlighting the promise of SSDA as an efficient solution for hyperspectral imaging in healthcare.

[148] REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding cs.CVPDF

Jiaze Li, Hao Yin, Wenhui Tan, Jingyang Chen, Boshen Xu

TL;DR: 论文提出了REVISOR框架，通过多模态反思（结合文本和视觉信息）提升长视频理解能力，设计了DADR机制以强化学习中准确定位相关视频片段。

Details

Motivation: 纯文本的自反思机制在长视频理解中存在局限，因视频包含动态视觉信息且缺乏跨模态交互能力。

Result: 在四个基准测试（VideoMME等）中显著提升MLLMs的长视频理解能力，无需额外调优或外部模型。

Insight: 多模态反思能更有效整合动态视觉信息，DADR机制确保了推理与视频证据的因果对齐。

Abstract: Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1)long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model’s reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.

[149] Towards 3D Object-Centric Feature Learning for Semantic Scene Completion cs.CVPDF

Weihua Wang, Yubo Cui, Xiangru Lin, Zhiheng Li, Zheng Fang

TL;DR: 本文提出了Ocean框架，一种基于对象中心的3D语义场景补全方法，通过分解场景为独立对象实例来提高语义占用预测的准确性，并在多个基准测试中取得最优性能。

Details

Motivation: 现有的3D语义场景补全方法通常采用以自我为中心的范式，忽略了对象级别的细节，导致在复杂环境中的语义和几何模糊问题。为了解决这一问题，作者提出了对象中心的预测框架。

Result: 在SemanticKITTI和SSCBench-KITTI360基准测试中，Ocean分别取得了17.40和20.28的mIoU分数，表现最优。

Insight: 对象中心的特征学习方法能够有效提升复杂场景下的语义场景补全性能，特别是在处理语义和几何模糊问题上表现出显著优势。

Abstract: Vision-based 3D Semantic Scene Completion (SSC) has received growing attention due to its potential in autonomous driving. While most existing approaches follow an ego-centric paradigm by aggregating and diffusing features over the entire scene, they often overlook fine-grained object-level details, leading to semantic and geometric ambiguities, especially in complex environments. To address this limitation, we propose Ocean, an object-centric prediction framework that decomposes the scene into individual object instances to enable more accurate semantic occupancy prediction. Specifically, we first employ a lightweight segmentation model, MobileSAM, to extract instance masks from the input image. Then, we introduce a 3D Semantic Group Attention module that leverages linear attention to aggregate object-centric features in 3D space. To handle segmentation errors and missing instances, we further design a Global Similarity-Guided Attention module that leverages segmentation features for global interaction. Finally, we propose an Instance-aware Local Diffusion module that improves instance features through a generative process and subsequently refines the scene representation in the BEV space. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that Ocean achieves state-of-the-art performance, with mIoU scores of 17.40 and 20.28, respectively.

[150] Uni-Inter: Unifying 3D Human Motion Synthesis Across Diverse Interaction Contexts cs.CVPDF

Sheng Liu, Yuanzhi Liang, Jiepeng Wang, Sidan Du, Chi Zhang

TL;DR: Uni-Inter是一个统一的人类动作生成框架，支持多样化的交互场景（如人-人、人-物、人-场景），通过统一的体积表示（UIV）实现异构实体的空间编码和关系推理，展现出良好的泛化能力。

Details

Motivation: 现有方法通常针对特定交互任务设计，泛化能力有限。Uni-Inter旨在通过统一框架解决多样化交互场景的动作生成问题，提高模型的通用性和可扩展性。

Result: 在三种代表性交互任务上的实验表明，Uni-Inter性能优越且能泛化到新的实体组合。

Insight: 统一的复合交互建模为复杂环境中的动作合成提供了可扩展的方向。

Abstract: We present Uni-Inter, a unified framework for human motion generation that supports a wide range of interaction scenarios: including human-human, human-object, and human-scene-within a single, task-agnostic architecture. In contrast to existing methods that rely on task-specific designs and exhibit limited generalization, Uni-Inter introduces the Unified Interactive Volume (UIV), a volumetric representation that encodes heterogeneous interactive entities into a shared spatial field. This enables consistent relational reasoning and compound interaction modeling. Motion generation is formulated as joint-wise probabilistic prediction over the UIV, allowing the model to capture fine-grained spatial dependencies and produce coherent, context-aware behaviors. Experiments across three representative interaction tasks demonstrate that Uni-Inter achieves competitive performance and generalizes well to novel combinations of entities. These results suggest that unified modeling of compound interactions offers a promising direction for scalable motion synthesis in complex environments.

[151] uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data cs.CVPDF

Dahyun Chung, Donghyun Shin, Yujin Sung, Seunggi Moon, Jinwoo Jeon

TL;DR: 论文提出了一种轻量级框架uCLIP，通过冻结预训练的视觉和文本编码器，仅训练一个小的投影模块，实现了对低资源语言的高效多语言视觉-语言对齐。

Details

Motivation: 现有的多语言视觉-语言模型在低资源语言上表现较差，主要由于高质量的多语言图文数据稀缺。因此，需要一种无需成对数据、参数高效的解决方案。

Result: 在多个多语言检索基准测试中，该方法在捷克语、芬兰语等低资源语言上表现显著优于现有方法。

Insight: 通过利用英语表征作为语义锚点，可以在无需成对数据的情况下实现高效的多语言对齐，为多模态学习的包容性提供了新思路。

Abstract: Contrastive Language-Image Pre-training (CLIP) has demonstrated strong generalization across a wide range of visual tasks by leveraging large-scale English-image pairs. However, its extension to low-resource languages remains limited due to the scarcity of high-quality multilingual image-text data. Existing multilingual vision-language models exhibit consistently low retrieval performance in underrepresented languages including Czech, Finnish, Croatian, Hungarian, and Romanian on the Crossmodal-3600 (XM3600) benchmark. To address this, we propose a lightweight and data-efficient framework for multilingual vision-language alignment. Our approach requires no image-text pairs or text-text pairs and freezes both the pretrained image encoder and multilingual text encoder during training. Only a compact 1.7M-parameter projection module is trained, using a contrastive loss over English representations as semantic anchors. This minimal training setup enables robust multilingual alignment even for languages with limited supervision. Extensive evaluation across multiple multilingual retrieval benchmarks confirms the effectiveness of our method, showing significant gains in five underrepresented languages where existing models typically underperform. These findings highlight the effectiveness of our pivot-based, parameter-efficient alignment strategy for inclusive multimodal learning.

[152] MGCA-Net: Multi-Grained Category-Aware Network for Open-Vocabulary Temporal Action Localization cs.CVPDF

Zhenying Fang, Richang Hong

TL;DR: MGCA-Net是一种用于开放词汇时间动作定位的多粒度类别感知网络，通过结合局部化和多粒度分类方法，显著提升了基类和新增类别的识别精度。

Details

Motivation: 现有方法在时间动作定位中通常只能识别单一粒度的动作类别，导致基类和新增类别的识别精度下降。为了解决这一问题，提出了MGCA-Net。

Result: 在THUMOS’14和ActivityNet-1.3基准测试中，MGCA-Net取得了最先进的性能，同时在零样本时间动作定位任务中也表现优异。

Insight: 多粒度类别感知可以有效提升开放词汇时间动作定位的性能，尤其是在处理新增类别时。

Abstract: Open-Vocabulary Temporal Action Localization (OV-TAL) aims to recognize and localize instances of any desired action categories in videos without explicitly curating training data for all categories. Existing methods mostly recognize action categories at a single granularity, which degrades the recognition accuracy of both base and novel action categories. To address these issues, we propose a Multi-Grained Category-Aware Network (MGCA-Net) comprising a localizer, an action presence predictor, a conventional classifier, and a coarse-to-fine classifier. Specifically, the localizer localizes category-agnostic action proposals. For these action proposals, the action presence predictor estimates the probability that they belong to an action instance. At the same time, the conventional classifier predicts the probability of each action proposal over base action categories at the snippet granularity. Novel action categories are recognized by the coarse-to-fine classifier, which first identifies action presence at the video granularity. Finally, it assigns each action proposal to one category from the coarse categories at the proposal granularity. Through coarse-to-fine category awareness for novel actions and the conventional classifier’s awareness of base actions, multi-grained category awareness is achieved, effectively enhancing localization performance. Comprehensive evaluations on the THUMOS’14 and ActivityNet-1.3 benchmarks demonstrate that our method achieves state-of-the-art performance. Furthermore, our MGCA-Net achieves state-of-the-art results under the Zero-Shot Temporal Action Localization setting.

[153] DiffPixelFormer: Differential Pixel-Aware Transformer for RGB-D Indoor Scene Segmentation cs.CV | cs.ROPDF

Yan Gong, Jianli Lu, Yongsheng Gao, Jie Zhao, Xiaojuan Zhang

TL;DR: DiffPixelFormer是一种用于RGB-D室内场景分割的差分像素感知Transformer，通过改进模态内和模态间的特征关系建模，提升了分割性能。

Details

Motivation: 现有RGB-D融合方法通常依赖高计算成本的交叉注意力机制，且未能充分建模模态内和模态间的特征关系，导致特征对齐不精确和表征能力有限。

Result: 在SUN RGB-D和NYUDv2基准测试中，DiffPixelFormer-L分别取得了54.28%和59.95%的mIoU，优于DFormer-L。

Insight: 模态内和模态间关系的精细建模对RGB-D分割至关重要，动态融合策略可根据场景特点更有效地利用RGB-D信息。

Abstract: Indoor semantic segmentation is fundamental to computer vision and robotics, supporting applications such as autonomous navigation, augmented reality, and smart environments. Although RGB-D fusion leverages complementary appearance and geometric cues, existing methods often depend on computationally intensive cross-attention mechanisms and insufficiently model intra- and inter-modal feature relationships, resulting in imprecise feature alignment and limited discriminative representation. To address these challenges, we propose DiffPixelFormer, a differential pixel-aware Transformer for RGB-D indoor scene segmentation that simultaneously enhances intra-modal representations and models inter-modal interactions. At its core, the Intra-Inter Modal Interaction Block (IIMIB) captures intra-modal long-range dependencies via self-attention and models inter-modal interactions with the Differential-Shared Inter-Modal (DSIM) module to disentangle modality-specific and shared cues, enabling fine-grained, pixel-level cross-modal alignment. Furthermore, a dynamic fusion strategy balances modality contributions and fully exploits RGB-D information according to scene characteristics. Extensive experiments on the SUN RGB-D and NYUDv2 benchmarks demonstrate that DiffPixelFormer-L achieves mIoU scores of 54.28% and 59.95%, outperforming DFormer-L by 1.78% and 2.75%, respectively. Code is available at https://github.com/gongyan1/DiffPixelFormer.

[154] ViSS-R1: Self-Supervised Reinforcement Video Reasoning cs.CVPDF

Bo Fang, Yuxin Song, Qiangqiang Wu, Haoyuan Sun, Wenhao Wu

TL;DR: 这篇论文提出了ViSS-R1框架，通过自监督强化学习（Pretext-GRPO算法）改进多模态大语言模型（MLLM）的视频推理能力，避免了传统文本为中心的方法对视觉信息的低效利用。

Details

Motivation: 传统R1方法在视频任务中过度依赖文本推理，忽视丰富的视觉信息，容易导致捷径学习和幻觉问题。作者希望通过视觉中心的方法提升模型的视频理解能力。

Result: 在六个广泛使用的视频推理和理解基准测试中，Pretext-GRPO和ViSS-R1表现出色，验证了方法的有效性和优越性。

Insight: 通过引入视觉中心的强化学习和自监督任务，可以显著提升MLLM在复杂视频推理任务中的表现，同时减少对稀疏视觉线索的依赖。

Abstract: Complex video reasoning remains a significant challenge for Multimodal Large Language Models (MLLMs), as current R1-based methodologies often prioritize text-centric reasoning derived from text-based and image-based developments. In video tasks, such strategies frequently underutilize rich visual information, leading to potential shortcut learning and increased susceptibility to hallucination. To foster a more robust, visual-centric video understanding, we start by introducing a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline, in which positive rewards are assigned for correctly solving pretext tasks on transformed visual inputs, which makes the model to non-trivially process the visual information. Building on the effectiveness of Pretext-GRPO, we further propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM’s R1 post-training paradigm. Instead of relying solely on sparse visual cues, our framework compels models to reason about transformed visual input by simultaneously processing both pretext questions (concerning transformations) and true user queries. This necessitates identifying the applied transformation and reconstructing the original video to formulate accurate final answers. Comprehensive evaluations on six widely-used video reasoning and understanding benchmarks demonstrate the effectiveness and superiority of our Pretext-GRPO and ViSS-R1 for complex video reasoning. Our codes and models will be publicly available.

[155] FGNet: Leveraging Feature-Guided Attention to Refine SAM2 for 3D EM Neuron Segmentation cs.CV | cs.IRPDF

Zhenghua Li, Hang Chen, Zihao Sun, Kai Li, Xiaolin Hu

TL;DR: 该论文提出了一种名为FGNet的新框架，通过利用特征引导注意力模块，将Segment Anything 2（SAM2）的特征迁移到电子显微镜（EM）图像领域，以改进神经元分割任务的性能。

Details

Motivation: 电子显微镜（EM）图像中的神经元分割任务面临形态复杂、信噪比低和标注稀缺等挑战，限制了现有方法的准确性和泛化能力。作者希望通过利用视觉基础模型（如SAM2）在大规模自然图像上的预训练知识来解决这些问题。

Result: 实验结果表明，FGNet在冻结SAM2权重时性能接近SOTA方法，而在EM数据上微调后显著超越现有方法。

Insight: 研究验证了将自然图像预训练表征与针对性领域自适应指导相结合，可以有效解决神经元分割中的特定挑战。

Abstract: Accurate segmentation of neural structures in Electron Microscopy (EM) images is paramount for neuroscience. However, this task is challenged by intricate morphologies, low signal-to-noise ratios, and scarce annotations, limiting the accuracy and generalization of existing methods. To address these challenges, we seek to leverage the priors learned by visual foundation models on a vast amount of natural images to better tackle this task. Specifically, we propose a novel framework that can effectively transfer knowledge from Segment Anything 2 (SAM2), which is pre-trained on natural images, to the EM domain. We first use SAM2 to extract powerful, general-purpose features. To bridge the domain gap, we introduce a Feature-Guided Attention module that leverages semantic cues from SAM2 to guide a lightweight encoder, the Fine-Grained Encoder (FGE), in focusing on these challenging regions. Finally, a dual-affinity decoder generates both coarse and refined affinity maps. Experimental results demonstrate that our method achieves performance comparable to state-of-the-art (SOTA) approaches with the SAM2 weights frozen. Upon further fine-tuning on EM data, our method significantly outperforms existing SOTA methods. This study validates that transferring representations pre-trained on natural images, when combined with targeted domain-adaptive guidance, can effectively address the specific challenges in neuron segmentation.

[156] Decoupling Scene Perception and Ego Status: A Multi-Context Fusion Approach for Enhanced Generalization in End-to-End Autonomous Driving cs.CVPDF

Jiacheng Tang, Mingyue Feng, Jiachao Liu, Yaonong Wang, Jian Pu

TL;DR: 该论文提出了一种新架构AdaptiveAD，通过双分支设计解耦场景感知与自车状态，提升了端到端自动驾驶系统的泛化能力。

Details

Motivation: 现有模块化设计的端到端自动驾驶系统过度依赖自车状态信息，导致泛化能力受限和场景理解不充分。论文旨在解决这一设计缺陷。

Result: 在nuScenes数据集上，AdaptiveAD实现了开环规划的最先进性能，显著提升了泛化能力。

Insight: 解耦自车状态与场景感知是提升自动驾驶系统泛化能力的关键；多上下文融合策略能有效避免信息短路问题。

Abstract: Modular design of planning-oriented autonomous driving has markedly advanced end-to-end systems. However, existing architectures remain constrained by an over-reliance on ego status, hindering generalization and robust scene understanding. We identify the root cause as an inherent design within these architectures that allows ego status to be easily leveraged as a shortcut. Specifically, the premature fusion of ego status in the upstream BEV encoder allows an information flow from this strong prior to dominate the downstream planning module. To address this challenge, we propose AdaptiveAD, an architectural-level solution based on a multi-context fusion strategy. Its core is a dual-branch structure that explicitly decouples scene perception and ego status. One branch performs scene-driven reasoning based on multi-task learning, but with ego status deliberately omitted from the BEV encoder, while the other conducts ego-driven reasoning based solely on the planning task. A scene-aware fusion module then adaptively integrates the complementary decisions from the two branches to form the final planning trajectory. To ensure this decoupling does not compromise multi-task learning, we introduce a path attention mechanism for ego-BEV interaction and add two targeted auxiliary tasks: BEV unidirectional distillation and autoregressive online mapping. Extensive evaluations on the nuScenes dataset demonstrate that AdaptiveAD achieves state-of-the-art open-loop planning performance. Crucially, it significantly mitigates the over-reliance on ego status and exhibits impressive generalization capabilities across diverse scenarios.

[157] Rethinking Saliency Maps: A Cognitive Human Aligned Taxonomy and Evaluation Framework for Explanations cs.CV | cs.AI | cs.LGPDF

Yehonatan Elisha, Seffi Cohen, Oren Barkan, Noam Koenigstein

TL;DR: 论文提出了一种名为RFxG的分类法，通过参考框架（点对点vs对比）和粒度（细粒度vs粗粒度）两个维度重新组织显著性图的解释方法，并提出了四个新的评估指标，以更全面地评估解释质量。

Details

Motivation: 现有的显著性图解释方法缺乏统一的评估标准和明确的目标，导致解释方法难以根据用户需求进行有效评估和应用。

Result: 实验表明现有评估指标过度关注点对点忠实度，忽略了对比推理和语义粒度。新指标能更全面地评估解释质量。

Insight: 显著性图的评估应更注重用户意图和多维度的忠实度，而非单一的点对点指标。

Abstract: Saliency maps are widely used for visual explanations in deep learning, but a fundamental lack of consensus persists regarding their intended purpose and alignment with diverse user queries. This ambiguity hinders the effective evaluation and practical utility of explanation methods.We address this gap by introducing the Reference-Frame $\times$ Granularity (RFxG) taxonomy, a principled conceptual framework that organizes saliency explanations along two essential axes:Reference-Frame: Distinguishing between pointwise (“Why this prediction?”) and contrastive (“Why this and not an alternative?”) explanations.Granularity: Ranging from fine-grained class-level (e.g., “Why Husky?”) to coarse-grained group-level (e.g., “Why Dog?”) interpretations.Using the RFxG lens, we demonstrate critical limitations in existing evaluation metrics, which overwhelmingly prioritize pointwise faithfulness while neglecting contrastive reasoning and semantic granularity. To systematically assess explanation quality across both RFxG dimensions, we propose four novel faithfulness metrics. Our comprehensive evaluation framework applies these metrics to ten state-of-the-art saliency methods, four model architectures, and three datasets.By advocating a shift toward user-intent-driven evaluation, our work provides both the conceptual foundation and the practical tools necessary to develop visual explanations that are not only faithful to the underlying model behavior but are also meaningfully aligned with the complexity of human understanding and inquiry.

[158] MergeSlide: Continual Model Merging and Task-to-Class Prompt-Aligned Inference for Lifelong Learning on Whole Slide Images cs.CVPDF

Doanh C. Bui, Ba Hung Ngo, Hoai Luan Pham, Khang Nguyen, Maï K. Nguyen

TL;DR: MergeSlide是一个用于全幻灯片图像（WSIs）终身学习的框架，通过模型合并和任务到类别提示对齐推理来减少灾难性遗忘并提高性能。

Details

Motivation: 全幻灯片图像体积巨大，数据处理和传输成本高，终身学习可以减少资源消耗。

Result: 在六个TCGA数据集上的实验表明，MergeSlide优于基于回放的持续学习和视觉-语言零样本基线方法。

Insight: 视觉-语言模型的提示工程和正交合并策略是终身学习中的有效工具，可以兼顾性能和遗忘问题。

Abstract: Lifelong learning on Whole Slide Images (WSIs) aims to train or fine-tune a unified model sequentially on cancer-related tasks, reducing the resources and effort required for data transfer and processing, especially given the gigabyte-scale size of WSIs. In this paper, we introduce MergeSlide, a simple yet effective framework that treats lifelong learning as a model merging problem by leveraging a vision-language pathology foundation model. When a new task arrives, it is: 1) defined with class-aware prompts, 2) fine-tuned for a few epochs using an MLP-free backbone, and 3) merged into a unified model using an orthogonal continual merging strategy that preserves performance and mitigates catastrophic forgetting. For inference under the class-incremental learning (CLASS-IL) setting, where task identity is unknown, we introduce Task-to-Class Prompt-aligned (TCP) inference. Specifically, TCP first identifies the most relevant task using task-level prompts and then applies the corresponding class-aware prompts to generate predictions. To evaluate MergeSlide, we conduct experiments on a stream of six TCGA datasets. The results show that MergeSlide outperforms both rehearsal-based continual learning and vision-language zero-shot baselines. Code and data are available at https://github.com/caodoanh2001/MergeSlide.

[159] PlugTrack: Multi-Perceptive Motion Analysis for Adaptive Fusion in Multi-Object Tracking cs.CVPDF

Seungjae Kim, SeungJoon Lee, MyeongAh Cho

TL;DR: PlugTrack提出了一种自适应融合多目标跟踪（MOT）中Kalman滤波器和数据驱动运动预测器的方法，通过多感知运动分析实现性能提升。

Details

Motivation: 现有MOT方法通常使用Kalman滤波器或数据驱动运动预测器，但前者无法处理非线性运动，后者则在泛化性和计算开销上存在问题。研究发现，真实场景中两种运动模式互补，因此需要一种自适应融合方法。

Result: 在MOT17/MOT20上显著提升性能，并在DanceTrack上达到SOTA水平。

Insight: 真实世界运动既包括线性也包含非线性模式，自适应融合两种预测器能够有效提升跟踪鲁棒性。

Abstract: Multi-object tracking (MOT) predominantly follows the tracking-by-detection paradigm, where Kalman filters serve as the standard motion predictor due to computational efficiency but inherently fail on non-linear motion patterns. Conversely, recent data-driven motion predictors capture complex non-linear dynamics but suffer from limited domain generalization and computational overhead. Through extensive analysis, we reveal that even in datasets dominated by non-linear motion, Kalman filter outperforms data-driven predictors in up to 34% of cases, demonstrating that real-world tracking scenarios inherently involve both linear and non-linear patterns. To leverage this complementarity, we propose PlugTrack, a novel framework that adaptively fuses Kalman filter and data-driven motion predictors through multi-perceptive motion understanding. Our approach employs multi-perceptive motion analysis to generate adaptive blending factors. PlugTrack achieves significant performance gains on MOT17/MOT20 and state-of-the-art on DanceTrack without modifying existing motion predictors. To the best of our knowledge, PlugTrack is the first framework to bridge classical and modern motion prediction paradigms through adaptive fusion in MOT.

[160] DGS-Net: Distillation-Guided Gradient Surgery for CLIP Fine-Tuning in AI-Generated Image Detection cs.CVPDF

Jiazhen Yan, Ziqiang Li, Fan Wang, Boyu Wang, Zhangjie Fu

TL;DR: DGS-Net通过蒸馏引导的梯度手术，在CLIP微调中保留预训练先验知识，同时抑制任务无关组件，显著提升AI生成图像的检测性能。

Details

Motivation: 随着GAN和扩散模型等生成模型的快速发展，AI生成图像大量涌现，引发了对数字媒体中错误信息、隐私侵犯和信任危机的担忧。CLIP等大规模多模态模型虽然提供了强大的可迁移表示，但微调时容易引发灾难性遗忘，从而降低跨域泛化能力。

Result: 在50种生成模型上的实验表明，DGS-Net平均优于当前最优方法6.6%，检测性能和跨域泛化能力显著提升。

Insight: 梯度手术和知识蒸馏的结合可以有效缓解灾难性遗忘，提升模型的迁移能力和任务适应性。

Abstract: The rapid progress of generative models such as GANs and diffusion models has led to the widespread proliferation of AI-generated images, raising concerns about misinformation, privacy violations, and trust erosion in digital media. Although large-scale multimodal models like CLIP offer strong transferable representations for detecting synthetic content, fine-tuning them often induces catastrophic forgetting, which degrades pre-trained priors and limits cross-domain generalization. To address this issue, we propose the Distillation-guided Gradient Surgery Network (DGS-Net), a novel framework that preserves transferable pre-trained priors while suppressing task-irrelevant components. Specifically, we introduce a gradient-space decomposition that separates harmful and beneficial descent directions during optimization. By projecting task gradients onto the orthogonal complement of harmful directions and aligning with beneficial ones distilled from a frozen CLIP encoder, DGS-Net achieves unified optimization of prior preservation and irrelevant suppression. Extensive experiments on 50 generative models demonstrate that our method outperforms state-of-the-art approaches by an average margin of 6.6, achieving superior detection performance and generalization across diverse generation techniques.

[161] Learning Implicit Neural Degradation Representation for Unpaired Image Dehazing cs.CVPDF

Shuaibin Fan, Senming Zhong, Wenchao Yan, Minglong Xue

TL;DR: 本文提出了一种基于隐式神经退化表示的无监督去雾方法，通过结合通道独立和通道依赖机制，以及隐式神经表示和密集残差增强模块，有效提升了复杂场景下去雾的性能。

Details

Motivation: 现有的去雾方法在复杂场景下难以平衡不均匀雾分布的细粒度特征表示和全局一致性建模的需求，同时需要更有效地学习雾化的空间变化共性表达。

Result: 实验结果表明，该方法在多个公开和真实数据集上达到了竞争性的去雾性能。

Insight: 通过隐式神经表示和密集残差增强，能够更高效地捕捉雾化的复杂退化模式，同时减少对显式特征提取的依赖，适用于复杂场景。

Abstract: Image dehazing is an important task in the field of computer vision, aiming at restoring clear and detail-rich visual content from haze-affected images. However, when dealing with complex scenes, existing methods often struggle to strike a balance between fine-grained feature representation of inhomogeneous haze distribution and global consistency modeling. Furthermore, to better learn the common degenerate representation of haze in spatial variations, we propose an unsupervised dehaze method for implicit neural degradation representation. Firstly, inspired by the Kolmogorov-Arnold representation theorem, we propose a mechanism combining the channel-independent and channel-dependent mechanisms, which efficiently enhances the ability to learn from nonlinear dependencies. which in turn achieves good visual perception in complex scenes. Moreover, we design an implicit neural representation to model haze degradation as a continuous function to eliminate redundant information and the dependence on explicit feature extraction and physical models. To further learn the implicit representation of the haze features, we also designed a dense residual enhancement module from it to eliminate redundant information. This achieves high-quality image restoration. Experimental results show that our method achieves competitive dehaze performance on various public and real-world datasets. This project code will be available at https://github.com/Fan-pixel/NeDR-Dehaze.

[162] Semantics and Content Matter: Towards Multi-Prior Hierarchical Mamba for Image Deraining cs.CVPDF

Zhaocheng Yu, Kui Jiang, Junjun Jiang, Xianming Liu, Guanglu Sun

TL;DR: 这篇论文提出了一种名为多先验分层Mamba（MPHM）的网络，用于图像去雨。通过结合宏观语义文本先验（CLIP）和微观结构视觉先验（DINOv2），并设计渐进式先验融合注入（PFI）和分层Mamba模块（HMM），MPHM在去除雨痕的同时显著提升了语义和空间细节的保真度。

Details

Motivation: 雨水会显著降低计算机视觉系统的性能，现有去雨方法在语义和空间细节保真度方面仍有不足。为了解决这一问题，论文提出了一种新颖的网络架构，通过结合不同先验信息和改进特征表示能力来提升去雨效果。

Result: MPHN在Rain200H数据集上实现了0.57 dB的PSNR增益，并在真实雨天场景中表现出优秀的泛化能力。

Insight: 结合宏观语义和微观结构的先验信息，并通过渐进式融合和分层特征表示，可以有效提升图像去雨任务的性能。

Abstract: Rain significantly degrades the performance of computer vision systems, particularly in applications like autonomous driving and video surveillance. While existing deraining methods have made considerable progress, they often struggle with fidelity of semantic and spatial details. To address these limitations, we propose the Multi-Prior Hierarchical Mamba (MPHM) network for image deraining. This novel architecture synergistically integrates macro-semantic textual priors (CLIP) for task-level semantic guidance and micro-structural visual priors (DINOv2) for scene-aware structural information. To alleviate potential conflicts between heterogeneous priors, we devise a progressive Priors Fusion Injection (PFI) that strategically injects complementary cues at different decoder levels. Meanwhile, we equip the backbone network with an elaborate Hierarchical Mamba Module (HMM) to facilitate robust feature representation, featuring a Fourier-enhanced dual-path design that concurrently addresses global context modeling and local detail recovery. Comprehensive experiments demonstrate MPHM’s state-of-the-art performance, achieving a 0.57 dB PSNR gain on the Rain200H dataset while delivering superior generalization on real-world rainy scenarios.

[163] A Lightweight 3D Anomaly Detection Method with Rotationally Invariant Features cs.CVPDF

Hanzhe Liang, Jie Zhou, Can Gao, Bingyang Guo, Jinbao Wang

TL;DR: 本文提出了一种轻量级的3D异常检测方法RIF，通过旋转不变特征处理和CTF-Net网络，解决了点云数据中方向和位置变化带来的特征不一致问题。

Details

Motivation: 3D异常检测中，点云数据的方向和位置变化可能导致特征不一致，影响检测效果。本文旨在解决这一问题。

Result: 在Anomaly-ShapeNet数据集上P-AUROC提升17.7%，在Real3D-AD数据集上提升1.6%，表现优异。

Insight: 旋转不变特征处理显著提升了3D异常检测的性能和泛化能力，适用于工业应用。

Abstract: 3D anomaly detection (AD) is a crucial task in computer vision, aiming to identify anomalous points or regions from point cloud data. However, existing methods may encounter challenges when handling point clouds with changes in orientation and position because the resulting features may vary significantly. To address this problem, we propose a novel Rotationally Invariant Features (RIF) framework for 3D AD. Firstly, to remove the adverse effect of variations on point cloud data, we develop a Point Coordinate Mapping (PCM) technique, which maps each point into a rotationally invariant space to maintain consistency of representation. Then, to learn robust and discriminative features, we design a lightweight Convolutional Transform Feature Network (CTF-Net) to extract rotationally invariant features for the memory bank. To improve the ability of the feature extractor, we introduce the idea of transfer learning to pre-train the feature extractor with 3D data augmentation. Experimental results show that the proposed method achieves the advanced performance on the Anomaly-ShapeNet dataset, with an average P-AUROC improvement of 17.7%, and also gains the best performance on the Real3D-AD dataset, with an average P-AUROC improvement of 1.6%. The strong generalization ability of RIF has been verified by combining it with traditional feature extraction methods on anomaly detection tasks, demonstrating great potential for industrial applications.

[164] CloseUpShot: Close-up Novel View Synthesis from Sparse-views via Point-conditioned Diffusion Model cs.CVPDF

Yuqi Zhang, Guanying Chen, Jiaxing Chen, Chuanyu Fu, Chuan Huang

TL;DR: CloseUpShot是一个基于扩散模型的框架，旨在通过点条件视频扩散从稀疏输入中合成特写视角的新视图。它通过分层变形和遮挡感知噪声抑制改善了传统像素变形技术在特写场景中的问题，并引入了全局结构引导以增强几何一致性。实验证明其在特写视角合成中的优越性。

Details

Motivation: 在稀疏输入视角下重建3D场景并合成新视图极具挑战性，尤其是在特写场景中，传统方法难以捕捉细粒度的细节。视频扩散模型展现了强大的时序推理能力，为解决这一问题提供了可能。

Result: 在多个数据集上的实验表明，CloseUpShot在特写视角合成任务中显著优于现有方法，验证了其有效性。

Insight: 1. 点条件扩散模型可以显著提升稀疏输入下的特写视角合成质量；2. 几何一致性和动态噪声抑制是解决特写场景稀疏输入问题的关键。

Abstract: Reconstructing 3D scenes and synthesizing novel views from sparse input views is a highly challenging task. Recent advances in video diffusion models have demonstrated strong temporal reasoning capabilities, making them a promising tool for enhancing reconstruction quality under sparse-view settings. However, existing approaches are primarily designed for modest viewpoint variations, which struggle in capturing fine-grained details in close-up scenarios since input information is severely limited. In this paper, we present a diffusion-based framework, called CloseUpShot, for close-up novel view synthesis from sparse inputs via point-conditioned video diffusion. Specifically, we observe that pixel-warping conditioning suffers from severe sparsity and background leakage in close-up settings. To address this, we propose hierarchical warping and occlusion-aware noise suppression, enhancing the quality and completeness of the conditioning images for the video diffusion model. Furthermore, we introduce global structure guidance, which leverages a dense fused point cloud to provide consistent geometric context to the diffusion process, to compensate for the lack of globally consistent 3D constraints in sparse conditioning inputs. Extensive experiments on multiple datasets demonstrate that our method outperforms existing approaches, especially in close-up novel view synthesis, clearly validating the effectiveness of our design.

[165] Region-Point Joint Representation for Effective Trajectory Similarity Learning cs.CV | cs.IR | cs.LGPDF

Hao Long, Silin Zhou, Lisi Chen, Shuo Shang

TL;DR: 论文提出RePo方法，联合区域和点特征表示轨迹，结合空间上下文和细粒度移动模式，显著提升轨迹相似性学习效果。

Details

Motivation: 现有学习方法虽然降低了计算复杂度，但未能充分利用轨迹信息的全面性来建模相似性。

Result: 在各项评价指标上，RePo平均准确率比SOTA基线提升了22.2%。

Insight: 轨迹相似性建模需要同时考虑区域上下文和细粒度移动模式，多尺度特征的联合表示是关键。

Abstract: Recent learning-based methods have reduced the computational complexity of traditional trajectory similarity computation, but state-of-the-art (SOTA) methods still fail to leverage the comprehensive spectrum of trajectory information for similarity modeling. To tackle this problem, we propose \textbf{RePo}, a novel method that jointly encodes \textbf{Re}gion-wise and \textbf{Po}int-wise features to capture both spatial context and fine-grained moving patterns. For region-wise representation, the GPS trajectories are first mapped to grid sequences, and spatial context are captured by structural features and semantic context enriched by visual features. For point-wise representation, three lightweight expert networks extract local, correlation, and continuous movement patterns from dense GPS sequences. Then, a router network adaptively fuses the learned point-wise features, which are subsequently combined with region-wise features using cross-attention to produce the final trajectory embedding. To train RePo, we adopt a contrastive loss with hard negative samples to provide similarity ranking supervision. Experiment results show that RePo achieves an average accuracy improvement of 22.2% over SOTA baselines across all evaluation metrics.

[166] VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language cs.CV | cs.CRPDF

Zonghao Ying, Moyang Chen, Nizhang Li, Zhiqiang Wang, Wenxin Zhang

TL;DR: 本文提出了VEIL框架，通过隐式语言中的视觉利用绕过文本到视频（T2V）模型的安全防护，成功生成违反策略的视频。

Details

Motivation: 现有针对T2V模型的越狱攻击通常通过在明显不安全的提示中添加对抗扰动，易于被检测。本文发现看似无害但隐含丰富线索的提示能够诱导模型生成违反策略的视频。

Result: 在7个T2V模型上验证了VEIL的有效性，商业模型的平均攻击成功率提升了23%。

Insight: 揭示了T2V模型在处理隐含线索时的漏洞，为未来模型安全防护提供了新的研究方向。

Abstract: Jailbreak attacks can circumvent model safety guardrails and reveal critical blind spots. Prior attacks on text-to-video (T2V) models typically add adversarial perturbations to obviously unsafe prompts, which are often easy to detect and defend. In contrast, we show that benign-looking prompts containing rich, implicit cues can induce T2V models to generate semantically unsafe videos that both violate policy and preserve the original (blocked) intent. To realize this, we propose VEIL, a jailbreak framework that leverages T2V models’ cross-modal associative patterns via a modular prompt design. Specifically, our prompts combine three components: neutral scene anchors, which provide the surface-level scene description extracted from the blocked intent to maintain plausibility; latent auditory triggers, textual descriptions of innocuous-sounding audio events (e.g., creaking, muffled noises) that exploit learned audio-visual co-occurrence priors to bias the model toward particular unsafe visual concepts; and stylistic modulators, cinematic directives (e.g., camera framing, atmosphere) that amplify and stabilize the latent trigger’s effect. We formalize attack generation as a constrained optimization over the above modular prompt space and solve it with a guided search procedure that balances stealth and effectiveness. Extensive experiments over 7 T2V models demonstrate the efficacy of our attack, achieving a 23 percent improvement in average attack success rate in commercial models.

[167] Shedding Light on VLN Robustness: A Black-box Framework for Indoor Lighting-based Adversarial Attack cs.CV | cs.AIPDF

Chenyang Li, Wenbing Tang, Yihao Huang, Sinong Simon Zhan, Ming Hu

TL;DR: 论文提出了一个黑盒框架ILA，通过操纵室内光照对VLN智能体进行对抗攻击，揭示其在现实光照变化下的脆弱性。

Details

Motivation: 现有的VLN对抗评估多依赖不现实的纹理扰动，缺乏实用意义。光照作为室内场景的重要属性，却被忽视。

Result: 实验表明，ILA显著提高了VLN模型的失败率并降低了轨迹效率，揭示了其对光照变化的敏感性。

Insight: 现实中的光照变化是VLN智能体的重要潜在威胁，未来的研究需更多关注场景固有属性的对抗鲁棒性。

Abstract: Vision-and-Language Navigation (VLN) agents have made remarkable progress, but their robustness remains insufficiently studied. Existing adversarial evaluations often rely on perturbations that manifest as unusual textures rarely encountered in everyday indoor environments. Errors under such contrived conditions have limited practical relevance, as real-world agents are unlikely to encounter such artificial patterns. In this work, we focus on indoor lighting, an intrinsic yet largely overlooked scene attribute that strongly influences navigation. We propose Indoor Lighting-based Adversarial Attack (ILA), a black-box framework that manipulates global illumination to disrupt VLN agents. Motivated by typical household lighting usage, we design two attack modes: Static Indoor Lighting-based Attack (SILA), where the lighting intensity remains constant throughout an episode, and Dynamic Indoor Lighting-based Attack (DILA), where lights are switched on or off at critical moments to induce abrupt illumination changes. We evaluate ILA on two state-of-the-art VLN models across three navigation tasks. Results show that ILA significantly increases failure rates while reducing trajectory efficiency, revealing previously unrecognized vulnerabilities of VLN agents to realistic indoor lighting variations.

[168] MedGEN-Bench: Contextually entangled benchmark for open-ended multimodal medical generation cs.CVPDF

Junjie Yang, Yuhao Yan, Gang Wu, Yuxuan Wang, Ruoyu Liang

TL;DR: MedGEN-Bench是一个针对多模态医学生成任务的综合性基准，专注于解决现有医学视觉基准的局限性，引入了上下文交织的指令和开放式生成输出，并结合了创新的三层评估框架。

Details

Motivation: 当前医学视觉基准存在查询模糊、简化诊断推理以及忽视图像生成能力的局限性，限制了AI系统在临床工作流中的应用。

Result: 系统评估了10种组合框架、3种统一模型和5种视觉语言模型。

Insight: 通过上下文交织的指令和开放式生成任务，MedGEN-Bench提升了医学AI系统的跨模态推理和生成能力，更贴合实际临床需求。

Abstract: As Vision-Language Models (VLMs) increasingly gain traction in medical applications, clinicians are progressively expecting AI systems not only to generate textual diagnoses but also to produce corresponding medical images that integrate seamlessly into authentic clinical workflows. Despite the growing interest, existing medical visual benchmarks present notable limitations. They often rely on ambiguous queries that lack sufficient relevance to image content, oversimplify complex diagnostic reasoning into closed-ended shortcuts, and adopt a text-centric evaluation paradigm that overlooks the importance of image generation capabilities. To address these challenges, we introduce \textsc{MedGEN-Bench}, a comprehensive multimodal benchmark designed to advance medical AI research. MedGEN-Bench comprises 6,422 expert-validated image-text pairs spanning six imaging modalities, 16 clinical tasks, and 28 subtasks. It is structured into three distinct formats: Visual Question Answering, Image Editing, and Contextual Multimodal Generation. What sets MedGEN-Bench apart is its focus on contextually intertwined instructions that necessitate sophisticated cross-modal reasoning and open-ended generative outputs, moving beyond the constraints of multiple-choice formats. To evaluate the performance of existing systems, we employ a novel three-tier assessment framework that integrates pixel-level metrics, semantic text analysis, and expert-guided clinical relevance scoring. Using this framework, we systematically assess 10 compositional frameworks, 3 unified models, and 5 VLMs.

[169] Automated Road Distress Detection Using Vision Transformersand Generative Adversarial Networks cs.CV | cs.AIPDF

Cesar Portocarrero Rodriguez, Laura Vandeweyen, Yosuke Yamamoto

TL;DR: 论文提出了一种结合Vision Transformers和GANs的自动化道路损坏检测方法，通过生成对抗网络增强数据，并使用基于Transformer的MaskFormer模型，证明了其在性能上优于传统CNN方法。

Details

Motivation: 美国道路基础设施状况不佳，传统检测方法成本高、效率低。随着自动驾驶车辆实时视觉数据的普及，利用计算机视觉技术进行道路损坏检测具有潜力。

Result: GAN生成的数据提升了模型性能；MaskFormer在mAP50和IoU两个指标上均优于CNN。

Insight: 合成数据可以显著提升模型的泛化能力，而基于Transformer的模型在复杂视觉任务中表现出更强的性能。

Abstract: The American Society of Civil Engineers has graded Americas infrastructure condition as a C, with the road system receiving a dismal D. Roads are vital to regional economic viability, yet their management, maintenance, and repair processes remain inefficient, relying on outdated manual or laser-based inspection methods that are both costly and time-consuming. With the increasing availability of real-time visual data from autonomous vehicles, there is an opportunity to apply computer vision (CV) methods for advanced road monitoring, providing insights to guide infrastructure rehabilitation efforts. This project explores the use of state-of-the-art CV techniques for road distress segmentation. It begins by evaluating synthetic data generated with Generative Adversarial Networks (GANs) to assess its usefulness for model training. The study then applies Convolutional Neural Networks (CNNs) for road distress segmentation and subsequently examines the transformer-based model MaskFormer. Results show that GAN-generated data improves model performance and that MaskFormer outperforms the CNN model in two metrics: mAP50 and IoU.

[170] Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification cs.CVPDF

Rifen Lin, Alex Jinpeng Wang, Jiawei Mo, Min Li

TL;DR: 该论文提出一种基于骨架的运动感知预训练框架CSIP-ReID，用于视频行人重识别（ReID），通过骨架序列和视频帧的对齐学习提升性能。

Details

Motivation: 现有的视频ReID方法多依赖文本-视频配对，但文本难以捕捉细粒度的时间动态（如运动信息），而骨架数据能更有效地表征时空信息。

Result: 在MARS、LS-VID等视频ReID数据集和BIWI等骨架ReID任务上显著优于现有方法，验证了框架的泛化性。

Insight: 骨架数据比文本更适用于表征视频中的动态信息，推动了多模态表示学习在ReID中的应用。

Abstract: Multimodal pretraining has revolutionized visual understanding, but its impact on video-based person re-identification (ReID) remains underexplored. Existing approaches often rely on video-text pairs, yet suffer from two fundamental limitations: (1) lack of genuine multimodal pretraining, and (2) text poorly captures fine-grained temporal motion-an essential cue for distinguishing identities in video. In this work, we take a bold departure from text-based paradigms by introducing the first skeleton-driven pretraining framework for ReID. To achieve this, we propose Contrastive Skeleton-Image Pretraining for ReID (CSIP-ReID), a novel two-stage method that leverages skeleton sequences as a spatiotemporally informative modality aligned with video frames. In the first stage, we employ contrastive learning to align skeleton and visual features at sequence level. In the second stage, we introduce a dynamic Prototype Fusion Updater (PFU) to refine multimodal identity prototypes, fusing motion and appearance cues. Moreover, we propose a Skeleton Guided Temporal Modeling (SGTM) module that distills temporal cues from skeleton data and integrates them into visual features. Extensive experiments demonstrate that CSIP-ReID achieves new state-of-the-art results on standard video ReID benchmarks (MARS, LS-VID, iLIDS-VID). Moreover, it exhibits strong generalization to skeleton-only ReID tasks (BIWI, IAS), significantly outperforming previous methods. CSIP-ReID pioneers an annotation-free and motion-aware pretraining paradigm for ReID, opening a new frontier in multimodal representation learning.

[171] THIR: Topological Histopathological Image Retrieval cs.CVPDF

Zahra Tabatabaei, Jon Sporring

TL;DR: THIR是一种基于拓扑数据分析的无监督医学图像检索框架，利用持久同调的Betti数表征组织病理学图像的结构模式，无需训练即可高效检索相似图像。

Details

Motivation: 乳腺癌是全球女性死亡的主要原因之一，早期诊断和准确决策至关重要。当前深度学习依赖于大量标注数据和昂贵计算资源，而THIR旨在提供一种快速、可扩展且无需训练的无监督解决方案。

Result: 在BreaKHis数据集上，THIR优于现有监督和无监督方法，且仅需20分钟即可处理完整数据集（使用标准CPU）。

Insight: 拓扑特征可作为高效、可解释的医学图像检索工具，尤其是在缺乏标注数据或计算资源受限的场景中。

Abstract: According to the World Health Organization, breast cancer claimed the lives of approximately 685,000 women in 2020. Early diagnosis and accurate clinical decision making are critical in reducing this global burden. In this study, we propose THIR, a novel Content-Based Medical Image Retrieval (CBMIR) framework that leverages topological data analysis specifically, Betti numbers derived from persistent homology to characterize and retrieve histopathological images based on their intrinsic structural patterns. Unlike conventional deep learning approaches that rely on extensive training, annotated datasets, and powerful GPU resources, THIR operates entirely without supervision. It extracts topological fingerprints directly from RGB histopathological images using cubical persistence, encoding the evolution of loops as compact, interpretable feature vectors. The similarity retrieval is then performed by computing the distances between these topological descriptors, efficiently returning the top-K most relevant matches. Extensive experiments on the BreaKHis dataset demonstrate that THIR outperforms state of the art supervised and unsupervised methods. It processes the entire dataset in under 20 minutes on a standard CPU, offering a fast, scalable, and training free solution for clinical image retrieval.

[172] GenTract: Generative Global Tractography cs.CVPDF

Alec Sargood, Lemuel Puglisi, Elinor Thompson, Mirco Musolesi, Daniel C. Alexander

TL;DR: GenTract是一种生成式全局纤维束追踪方法，将纤维束追踪任务建模为生成任务，直接从dMRI数据生成完整的、解剖学合理的纤维束轨迹。它在高分辨率和低分辨率/噪声数据上均表现出色，精度显著优于现有方法。

Details

Motivation: 局部纤维束追踪方法容易因误差累积而高估阳性率，而全局方法计算成本高。GenTract旨在通过生成模型结合两者的优点。

Result: GenTract的精度是次优方法TractOracle的2.1倍，在低分辨率和噪声数据上的优势更为明显。

Insight: 生成式建模能够克服传统方法的局限性，为纤维束追踪提供了高精度和计算效率兼具的解决方案。

Abstract: Tractography is the process of inferring the trajectories of white-matter pathways in the brain from diffusion magnetic resonance imaging (dMRI). Local tractography methods, which construct streamlines by following local fiber orientation estimates stepwise through an image, are prone to error accumulation and high false positive rates, particularly on noisy or low-resolution data. In contrast, global methods, which attempt to optimize a collection of streamlines to maximize compatibility with underlying fiber orientation estimates, are computationally expensive. To address these challenges, we introduce GenTract, the first generative model for global tractography. We frame tractography as a generative task, learning a direct mapping from dMRI to complete, anatomically plausible streamlines. We compare both diffusion-based and flow matching paradigms and evaluate GenTract’s performance against state-of-the-art baselines. Notably, GenTract achieves precision 2.1x higher than the next-best method, TractOracle. This advantage becomes even more pronounced in challenging low-resolution and noisy settings, where it outperforms the closest competitor by an order of magnitude. By producing tractograms with high precision on research-grade data while also maintaining reliability on imperfect, lower-resolution data, GenTract represents a promising solution for global tractography.

Diego Ortego, Marlon Rodríguez, Mario Almagro, Kunal Dahiya, David Jiménez

TL;DR: 本文探讨了如何在大规模语言模型中有效利用解码器模型以及视觉信息，以提升极端多标签分类（XMC）的性能，并提出了ViXML框架。

Details

Motivation: 极端多标签分类（XMC）需要在大规模标签空间中平衡效率与性能。现有方法多依赖于小型编码器模型，而作者希望通过利用更大的解码器模型和视觉信息来进一步提升性能。

Result: 实验表明，ViXML在四个公开文本数据集及其视觉增强版本上表现优异，最大P@1提升达8.21%。视觉信息的引入甚至超越了纯文本解码器的性能。

Insight: 视觉信息对提升XMC性能至关重要，一张图像的价值相当于数十亿参数的模型。ViXML框架为多模态XMC提供了高效解决方案。

Abstract: Foundation models have revolutionized artificial intelligence across numerous domains, yet their transformative potential remains largely untapped in Extreme Multi-label Classification (XMC). Queries in XMC are associated with relevant labels from extremely large label spaces, where it is critical to strike a balance between efficiency and performance. Therefore, many recent approaches efficiently pose XMC as a maximum inner product search between embeddings learned from small encoder-only transformer architectures. In this paper, we address two important aspects in XMC: how to effectively harness larger decoder-only models, and how to exploit visual information while maintaining computational efficiency. We demonstrate that both play a critical role in XMC separately and can be combined for improved performance. We show that a few billion-size decoder can deliver substantial improvements while keeping computational overhead manageable. Furthermore, our Vision-enhanced eXtreme Multi-label Learning framework (ViXML) efficiently integrates foundation vision models by pooling a single embedding per image. This limits computational growth while unlocking multi-modal capabilities. Remarkably, ViXML with small encoders outperforms text-only decoder in most cases, showing that an image is worth billions of parameters. Finally, we present an extension of existing text-only datasets to exploit visual metadata and make them available for future benchmarking. Comprehensive experiments across four public text-only datasets and their corresponding image enhanced versions validate our proposals’ effectiveness, surpassing previous state-of-the-art by up to +8.21% in P@1 on the largest dataset. ViXML’s code is available at https://github.com/DiegoOrtego/vixml.

[174] Video Spatial Reasoning with Object-Centric 3D Rollout cs.CVPDF

Haoran Tang, Meng Cao, Ruyang Liu, Xiaoxi Liang, Linglong Li

TL;DR: 该论文提出了一种名为Object-Centric 3D Rollout（OCR）的新方法，旨在提升多模态大型语言模型（MLLMs）在动态3D场景中进行视频空间推理的能力，通过结构化扰动和滚动训练策略实现了显著性能提升。

Details

Motivation: 现有的MLLMs在视频空间推理上存在局限性，尤其是局限于查询中明确提到的对象而忽略上下文线索。论文的目标是解决这一问题，使模型能够更全面地理解3D场景中对象的位置、方向和相互关系。

Result: 3B参数的模型在VSI-Bench上达到47.5%的准确率，超越多个7B基线模型，AB实验验证OCR优于T-GRPO和NoisyRollout等现有方法。

Insight: 通过结构化扰动和滚动训练，模型能够更全面地捕捉动态3D场景中的空间关系，而不仅局限于查询中提到的对象。

Abstract: Recent advances in Multi-modal Large Language Models (MLLMs) have showcased remarkable capabilities in vision-language understanding. However, enabling robust video spatial reasoning-the ability to comprehend object locations, orientations, and inter-object relationships in dynamic 3D scenes-remains a key unsolved challenge. Existing approaches primarily rely on spatially grounded supervised fine-tuning or reinforcement learning, yet we observe that such models often exhibit query-locked reasoning, focusing narrowly on objects explicitly mentioned in the prompt while ignoring critical contextual cues. To address this limitation, we propose Object-Centric 3D Rollout (OCR), a novel strategy that introduces structured perturbations to the 3D geometry of selected objects during training. By degrading object-specific visual cues and projecting the altered geometry into 2D space, OCR compels the model to reason holistically across the entire scene. We further design a rollout-based training pipeline that jointly leverages vanilla and region-noisy videos to optimize spatial reasoning trajectories. Experiments demonstrate state-of-the-art performance: our 3B-parameter model achieves 47.5% accuracy on VSI-Bench, outperforming several 7B baselines. Ablations confirm OCR’s superiority over prior rollout strategies (e.g., T-GRPO, NoisyRollout).

[175] Difficulty-Aware Label-Guided Denoising for Monocular 3D Object Detection cs.CVPDF

Soyul Lee, Seungmin Baek, Dongbo Min

TL;DR: MonoDLGD提出了一种基于检测难度的标签引导去噪框架，通过自适应扰动和重建地面真实标签，改进单目3D目标检测的性能。

Details

Motivation: 单目3D目标检测由于深度信息的固有模糊性，性能受限。现有方法忽视了检测难度（如遮挡、距离和截断）的影响，导致检测结果不佳。

Result: 在KITTI基准测试上，MonoDLGD在所有难度级别上均达到了state-of-the-art性能。

Insight: 1. 检测难度（如遮挡和距离）对单目3D目标检测有重要影响；2. 通过几何监督和联合训练可以有效提升模型鲁棒性。

Abstract: Monocular 3D object detection is a cost-effective solution for applications like autonomous driving and robotics, but remains fundamentally ill-posed due to inherently ambiguous depth cues. Recent DETR-based methods attempt to mitigate this through global attention and auxiliary depth prediction, yet they still struggle with inaccurate depth estimates. Moreover, these methods often overlook instance-level detection difficulty, such as occlusion, distance, and truncation, leading to suboptimal detection performance. We propose MonoDLGD, a novel Difficulty-Aware Label-Guided Denoising framework that adaptively perturbs and reconstructs ground-truth labels based on detection uncertainty. Specifically, MonoDLGD applies stronger perturbations to easier instances and weaker ones into harder cases, and then reconstructs them to effectively provide explicit geometric supervision. By jointly optimizing label reconstruction and 3D object detection, MonoDLGD encourages geometry-aware representation learning and improves robustness to varying levels of object complexity. Extensive experiments on the KITTI benchmark demonstrate that MonoDLGD achieves state-of-the-art performance across all difficulty levels.

[176] RefineVAD: Semantic-Guided Feature Recalibration for Weakly Supervised Video Anomaly Detection cs.CVPDF

Junhee Lee, ChaeBeen Bang, MyoungChul Kim, MyeongAh Cho

TL;DR: RefineVAD提出了一个结合时域运动和语义结构的双模块框架，通过运动感知与时序注意力矫正（MoTAR）及类别导向细化（CORE），显式建模异常事件的动态演化和语义类别，从而提升弱监督视频异常检测的性能。

Details

Motivation: 现有弱监督视频异常检测方法通常将异常事件视为单一类别，忽略了其多样化的语义和时域特性，无法充分捕捉真实场景中复杂异常的特征。受人类感知异常的启发，本文提出结合运动和语义的双重推理框架。

Result: 在WVAD基准上的实验表明，RefineVAD显著优于现有方法，证明了结合语义上下文指导特征细化的重要性。

Insight: 异常事件不仅需要关注动态演化（“how”），还需结合语义类别（“what”），双重建模能更全面地捕捉异常特征。

Abstract: Weakly-Supervised Video Anomaly Detection aims to identify anomalous events using only video-level labels, balancing annotation efficiency with practical applicability. However, existing methods often oversimplify the anomaly space by treating all abnormal events as a single category, overlooking the diverse semantic and temporal characteristics intrinsic to real-world anomalies. Inspired by how humans perceive anomalies, by jointly interpreting temporal motion patterns and semantic structures underlying different anomaly types, we propose RefineVAD, a novel framework that mimics this dual-process reasoning. Our framework integrates two core modules. The first, Motion-aware Temporal Attention and Recalibration (MoTAR), estimates motion salience and dynamically adjusts temporal focus via shift-based attention and global Transformer-based modeling. The second, Category-Oriented Refinement (CORE), injects soft anomaly category priors into the representation space by aligning segment-level features with learnable category prototypes through cross-attention. By jointly leveraging temporal dynamics and semantic structure, explicitly models both “how” motion evolves and “what” semantic category it resembles. Extensive experiments on WVAD benchmark validate the effectiveness of RefineVAD and highlight the importance of integrating semantic context to guide feature refinement toward anomaly-relevant patterns.

[177] End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer cs.CVPDF

Yonghui Yu, Jiahang Cai, Xun Wang, Wenwu Yang

TL;DR: 提出了首个端到端的多帧2D人体姿态估计方法PAVE-Net，通过空间编码器和时空姿态解码器消除启发式操作，显著提升了准确性和效率。

Details

Motivation: 现有的多人物视频姿态估计方法通常采用两阶段流程（检测+时序建模），依赖启发式操作（如检测、RoI裁剪和NMS），限制了性能和效率。

Result: 在PoseTrack2017上mAP提升6.0，精度与两阶段方法竞争，效率显著提升。

Insight: 端到端方法可以消除启发式操作的瓶颈，而姿态感知注意力机制是解决跨帧关联的关键。

Abstract: Existing multi-person video pose estimation methods typically adopt a two-stage pipeline: detecting individuals in each frame, followed by temporal modeling for single-person pose estimation. This design relies on heuristic operations such as detection, RoI cropping, and non-maximum suppression (NMS), limiting both accuracy and efficiency. In this paper, we present a fully end-to-end framework for multi-person 2D pose estimation in videos, effectively eliminating heuristic operations. A key challenge is to associate individuals across frames under complex and overlapping temporal trajectories. To address this, we introduce a novel Pose-Aware Video transformEr Network (PAVE-Net), which features a spatial encoder to model intra-frame relations and a spatiotemporal pose decoder to capture global dependencies across frames. To achieve accurate temporal association, we propose a pose-aware attention mechanism that enables each pose query to selectively aggregate features corresponding to the same individual across consecutive frames.Additionally, we explicitly model spatiotemporal dependencies among pose keypoints to improve accuracy. Notably, our approach is the first end-to-end method for multi-frame 2D human pose estimation.Extensive experiments show that PAVE-Net substantially outperforms prior image-based end-to-end methods, achieving a \textbf{6.0} mAP improvement on PoseTrack2017, and delivers accuracy competitive with state-of-the-art two-stage video-based approaches, while offering significant gains in efficiency.Project page: https://github.com/zgspose/PAVENet

[178] Hybrid-Domain Adaptative Representation Learning for Gaze Estimation cs.CVPDF

Qida Tan, Hongyu Yang, Wenchao Du

TL;DR: 该论文提出了一种混合域自适应表示学习（HARL）框架，用于解决基于外观的视线估计中的跨域性能下降问题。通过利用高质量近眼图像的特征对齐和稀疏图融合模块，实现了鲁棒的视线表示。

Details

Motivation: 现有的基于外观的视线估计方法在跨域评估中性能显著下降，主要受表情、佩戴物和图像质量等无关因素的影响。论文旨在通过学习鲁棒的视线表示来解决这一问题。

Result: 在EyeDiap、MPIIFaceGaze和Gaze360数据集上分别实现了5.02°、3.36°和9.26°的最新准确率，并在跨数据集评估中表现优异。

Insight: 高质量近眼图像的特征对齐和几何约束的结合能显著提升视线估计的鲁棒性和跨域性能。

Abstract: Appearance-based gaze estimation, aiming to predict accurate 3D gaze direction from a single facial image, has made promising progress in recent years. However, most methods suffer significant performance degradation in cross-domain evaluation due to interference from gaze-irrelevant factors, such as expressions, wearables, and image quality. To alleviate this problem, we present a novel Hybrid-domain Adaptative Representation Learning (shorted by HARL) framework that exploits multi-source hybrid datasets to learn robust gaze representation. More specifically, we propose to disentangle gaze-relevant representation from low-quality facial images by aligning features extracted from high-quality near-eye images in an unsupervised domain-adaptation manner, which hardly requires any computational or inference costs. Additionally, we analyze the effect of head-pose and design a simple yet efficient sparse graph fusion module to explore the geometric constraint between gaze direction and head-pose, leading to a dense and robust gaze representation. Extensive experiments on EyeDiap, MPIIFaceGaze, and Gaze360 datasets demonstrate that our approach achieves state-of-the-art accuracy of $\textbf{5.02}^{\circ}$ and $\textbf{3.36}^{\circ}$, and $\textbf{9.26}^{\circ}$ respectively, and present competitive performances through cross-dataset evaluation. The code is available at https://github.com/da60266/HARL.

[179] MMD-Thinker: Adaptive Multi-Dimensional Thinking for Multimodal Misinformation Detection cs.CVPDF

Junjie Wu, Guohong Fu

TL;DR: MMD-Thinker是一个两阶段框架，通过自适应多维思考检测多模态虚假信息，解决了通用多模态大语言模型在虚假信息检测中的推理不足和偏差问题。

Details

Motivation: 多模态虚假信息的低成本和高欺骗性对社会构成威胁，而现有通用多模态大语言模型在检测中缺乏任务特定知识，推理能力和模式单一。

Result: MMD-Thinker在领域内和领域外基准数据集上均达到SOTA性能，同时保持了灵活的推理和令牌使用。

Insight: 任务定制的思考模式和强化学习策略的结合能有效提升多模态虚假信息检测的准确性和适应性。

Abstract: Multimodal misinformation floods on various social media, and continues to evolve in the era of AI-generated content (AIGC). The emerged misinformation with low creation cost and high deception poses significant threats to society. While recent studies leverage general-purpose multimodal large language models (MLLMs) to achieve remarkable results in detection, they encounter two critical limitations: (1) Insufficient reasoning, where general-purpose MLLMs often follow the uniform reasoning paradigm but generate inaccurate explanations and judgments, due to the lack of the task-specific knowledge of multimodal misinformation detection. (2) Reasoning biases, where a single thinking mode make detectors a suboptimal path for judgment, struggling to keep pace with the fast-growing and intricate multimodal misinformation. In this paper, we propose MMD-Thinker, a two-stage framework for multimodal misinformation detection through adaptive multi-dimensional thinking. First, we develop tailor-designed thinking mode for multimodal misinformation detection. Second, we adopt task-specific instruction tuning to inject the tailored thinking mode into general-purpose MLLMs. Third, we further leverage reinforcement learning strategy with a mixed advantage function, which incentivizes the reasoning capabilities in trajectories. Furthermore, we construct the multimodal misinformation reasoning (MMR) dataset, encompasses more than 8K image-text pairs with both reasoning processes and classification labels, to make progress in the relam of multimodal misinformation detection. Experimental results demonstrate that our proposed MMD-Thinker achieves state-of-the-art performance on both in-domain and out-of-domain benchmark datasets, while maintaining flexible inference and token usage. Code will be publicly available at Github.

[180] GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models cs.CV | cs.AIPDF

Yushuo Zheng, Jiangyong Ying, Huiyu Duan, Chunyi Li, Zicheng Zhang

TL;DR: GeoX-Bench是一个跨视角地理定位与姿态估计的基准测试，专注于评估大型多模态模型在这些任务中的能力，并提供了大量数据用于微调模型。

Details

Motivation: 目前大型多模态模型（LMMs）在许多任务中表现出色，但其在跨视角地理定位和姿态估计方面的能力尚未被充分探索，而这些能力对导航、自动驾驶和户外机器人等领域至关重要。

Result: 当前LMMs在地理定位任务中表现良好，但在姿态估计任务中表现显著下降，指令微调可以显著提升其跨视角地理感知能力。

Insight: 跨视角地理定位和姿态估计是未来LMMs的重要研究方向，尤其是姿态估计任务的复杂性揭示了模型的局限性。

Abstract: Large multimodal models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks, however their knowledge and abilities in the cross-view geo-localization and pose estimation domains remain unexplored, despite potential benefits for navigation, autonomous driving, outdoor robotics, \textit{etc}. To bridge this gap, we introduce \textbf{GeoX-Bench}, a comprehensive \underline{Bench}mark designed to explore and evaluate the capabilities of LMMs in \underline{cross}-view \underline{Geo}-localization and pose estimation. Specifically, GeoX-Bench contains 10,859 panoramic-satellite image pairs spanning 128 cities in 49 countries, along with corresponding 755,976 question-answering (QA) pairs. Among these, 42,900 QA pairs are designated for benchmarking, while the remaining are intended to enhance the capabilities of LMMs. Based on GeoX-Bench, we evaluate the capabilities of 25 state-of-the-art LMMs on cross-view geo-localization and pose estimation tasks, and further explore the empowered capabilities of instruction-tuning. Our benchmark demonstrate that while current LMMs achieve impressive performance in geo-localization tasks, their effectiveness declines significantly on the more complex pose estimation tasks, highlighting a critical area for future improvement, and instruction-tuning LMMs on the training data of GeoX-Bench can significantly improve the cross-view geo-sense abilities. The GeoX-Bench is available at \textcolor{magenta}{https://github.com/IntMeGroup/GeoX-Bench}.

[181] Building Egocentric Procedural AI Assistant: Methods, Benchmarks, and Challenges cs.CVPDF

Junlong Li, Huaiyuan Xu, Sijie Cheng, Kejun Wu, Kim-Hui Yap

TL;DR: 本文提出了面向第一人称视角的日常程序性任务支持的Egocentric Procedural AI Assistant（EgoProceAssist），并定义了三项核心任务：错误检测、程序性学习和问答。通过综述现有技术、数据集和评估指标，以及与现有VLM方法的对比实验，指出了未来的研究挑战和方向。

Details

Motivation: 近年来的视觉语言模型（VLMs）和第一人称感知研究的发展推动了面向程序性任务的AI助手需求。本文旨在填补这一领域的研究空白。

Result: 实验结果揭示了现有方法在egocentric任务中的局限性，为改进方向提供了依据。

Insight: egocentric任务需要更强的上下文理解和交互能力，未来研究应关注数据集的多样性和任务复杂性。

Abstract: Driven by recent advances in vision language models (VLMs) and egocentric perception research, we introduce the concept of an egocentric procedural AI assistant (EgoProceAssist) tailored to step-by-step support daily procedural tasks in a first-person view. In this work, we start by identifying three core tasks: egocentric procedural error detection, egocentric procedural learning, and egocentric procedural question answering. These tasks define the essential functions of EgoProceAssist within a new taxonomy. Specifically, our work encompasses a comprehensive review of current techniques, relevant datasets, and evaluation metrics across these three core areas. To clarify the gap between the proposed EgoProceAssist and existing VLM-based AI assistants, we introduce novel experiments and provide a comprehensive evaluation of representative VLM-based methods. Based on these findings and our technical analysis, we discuss the challenges ahead and suggest future research directions. Furthermore, an exhaustive list of this study is publicly available in an active repository that continuously collects the latest work: https://github.com/z1oong/Building-Egocentric-Procedural-AI-Assistant

Lingfeng Zhang, Yuchen Zhang, Hongsheng Li, Haoxiang Fu, Yingbo Tang

TL;DR: 该论文提出了SpatialSky-Bench，一个专门评估视觉语言模型（VLM）在无人机导航中空间智能能力的基准，并开发了SpatialSky-Dataset数据集和Sky-VLM模型，显著提升了VLM在无人机场景中的性能。

Details

Motivation: 现有视觉语言模型（VLM）在无人机导航任务中的空间智能能力尚未被充分研究，其动态环境下的表现存在不确定性。

Result: Sky-VLM在所有基准任务中达到了最先进的性能，证明了其在无人机场景中的有效性。

Insight: 研究揭示了现有VLM在无人机复杂环境中的能力不足，强调了针对特定场景定制化模型的重要性。

Abstract: Vision-Language Models (VLMs), leveraging their powerful visual perception and reasoning capabilities, have been widely applied in Unmanned Aerial Vehicle (UAV) tasks. However, the spatial intelligence capabilities of existing VLMs in UAV scenarios remain largely unexplored, raising concerns about their effectiveness in navigating and interpreting dynamic environments. To bridge this gap, we introduce SpatialSky-Bench, a comprehensive benchmark specifically designed to evaluate the spatial intelligence capabilities of VLMs in UAV navigation. Our benchmark comprises two categories-Environmental Perception and Scene Understanding-divided into 13 subcategories, including bounding boxes, color, distance, height, and landing safety analysis, among others. Extensive evaluations of various mainstream open-source and closed-source VLMs reveal unsatisfactory performance in complex UAV navigation scenarios, highlighting significant gaps in their spatial capabilities. To address this challenge, we developed the SpatialSky-Dataset, a comprehensive dataset containing 1M samples with diverse annotations across various scenarios. Leveraging this dataset, we introduce Sky-VLM, a specialized VLM designed for UAV spatial reasoning across multiple granularities and contexts. Extensive experimental results demonstrate that Sky-VLM achieves state-of-the-art performance across all benchmark tasks, paving the way for the development of VLMs suitable for UAV scenarios. The source code is available at https://github.com/linglingxiansen/SpatialSKy.

[183] Recognition of Abnormal Events in Surveillance Videos using Weakly Supervised Dual-Encoder Models cs.CVPDF

Noam Tsfaty, Avishai Weizman, Liav Cohen, Moshe Tshuva, Yehudit Aperstein

TL;DR: 论文提出了一种双编码器框架，结合卷积和Transformer表示，通过视频级监督检测监控视频中的异常事件，在UCF-Crime数据集上实现了90.7%的AUC。

Details

Motivation: 监控视频中的异常事件检测通常依赖于罕见的标记数据，且异常类型多样。本文旨在解决这一挑战，仅使用视频级监督实现高效检测。

Result: 在UCF-Crime数据集上达到90.7%的AUC，优于现有方法。

Insight: 结合卷积的局部特征提取和Transformer的全局建模能力，能从弱监督信号中更有效地学习异常模式。

Abstract: We address the challenge of detecting rare and diverse anomalies in surveillance videos using only video-level supervision. Our dual-backbone framework combines convolutional and transformer representations through top-k pooling, achieving 90.7% area under the curve (AUC) on the UCF-Crime dataset.

[184] SF-Recon: Simplification-Free Lightweight Building Reconstruction via 3D Gaussian Splatting cs.CVPDF

Zihan Li, Tengfei Wang, Wentian Gan, Hao Zhan, Xin Wang

TL;DR: SF-Recon提出了一种直接从多视角图像重建轻量化建筑表面的方法，避免了传统方法中繁琐的网格简化步骤。通过结合3D高斯喷绘（3DGS）和多视图一致性优化，最终生成结构准确的轻量化建筑网格。

Details

Motivation: 传统的多视图几何重建流程依赖密集重建和网格简化，效率低且质量敏感。SF-Recon旨在简化这一流程，直接生成轻量化的建筑表面模型。

Result: SF-Recon在SF数据集上的实验表明，生成的建筑模型顶点和面数显著减少，同时保持了计算效率和结构准确性。

Insight: SF-Recon通过结合3DGS和结构优化，展示了直接从多视角图像生成轻量化建筑表面的潜力，避免了传统方法的复杂性和低效问题。

Abstract: Lightweight building surface models are crucial for digital city, navigation, and fast geospatial analytics, yet conventional multi-view geometry pipelines remain cumbersome and quality-sensitive due to their reliance on dense reconstruction, meshing, and subsequent simplification. This work presents SF-Recon, a method that directly reconstructs lightweight building surfaces from multi-view images without post-hoc mesh simplification. We first train an initial 3D Gaussian Splatting (3DGS) field to obtain a view-consistent representation. Building structure is then distilled by a normal-gradient-guided Gaussian optimization that selects primitives aligned with roof and wall boundaries, followed by multi-view edge-consistency pruning to enhance structural sharpness and suppress non-structural artifacts without external supervision. Finally, a multi-view depth-constrained Delaunay triangulation converts the structured Gaussian field into a lightweight, structurally faithful building mesh. Based on a proposed SF dataset, the experimental results demonstrate that our SF-Recon can directly reconstruct lightweight building models from multi-view imagery, achieving substantially fewer faces and vertices while maintaining computational efficiency. Website:https://lzh282140127-cell.github.io/SF-Recon-project/

[185] Towards Metric-Aware Multi-Person Mesh Recovery by Jointly Optimizing Human Crowd in Camera Space cs.CVPDF

Kaiwen Wang, Kaili Zheng, Yiming Shi, Chenyi Guo, Ji Wu

TL;DR: 该论文提出了一种联合优化相机空间中多人网格恢复的方法，解决了现有伪真值生成方法中缺乏场景一致性的问题，并构建了一个高质量的多人大规模数据集DTO-Humans。

Details

Motivation: 现有多人人体网格恢复任务中，伪真值生成通常是单一人体为中心的，导致个体间的深度和尺度冲突。论文旨在解决这一问题，提升场景一致性。

Result: 实验表明，该方法在相对深度推理和人体网格恢复任务中达到了最先进的性能。

Insight: 通过联合优化和场景一致性提升多人网格恢复的准确性，表明深度线索和度量尺度信息在多人任务中的重要性。

Abstract: Multi-person human mesh recovery from a single image is a challenging task, hindered by the scarcity of in-the-wild training data. Prevailing in-the-wild human mesh pseudo-ground-truth (pGT) generation pipelines are single-person-centric, where each human is processed individually without joint optimization. This oversight leads to a lack of scene-level consistency, producing individuals with conflicting depths and scales within the same image. To address this, we introduce Depth-conditioned Translation Optimization (DTO), a novel optimization-based method that jointly refines the camera-space translations of all individuals in a crowd. By leveraging anthropometric priors on human height and depth cues from a monocular depth estimator, DTO solves for a scene-consistent placement of all subjects within a principled Maximum a posteriori (MAP) framework. Applying DTO to the 4D-Humans dataset, we construct DTO-Humans, a new large-scale pGT dataset of 0.56M high-quality, scene-consistent multi-person images, featuring dense crowds with an average of 4.8 persons per image. Furthermore, we propose Metric-Aware HMR, an end-to-end network that directly estimates human mesh and camera parameters in metric scale. This is enabled by a camera branch and a novel relative metric loss that enforces plausible relative scales. Extensive experiments demonstrate that our method achieves state-of-the-art performance on relative depth reasoning and human mesh recovery. Code and data will be released publicly.

[186] TabFlash: Efficient Table Understanding with Progressive Question Conditioning and Token Focusing cs.CVPDF

Jongha Kim, Minseong Bae, Sanghyeok Lee, Jinsung Yoon, Hyunwoo J. Kim

TL;DR: TabFlash is a MLLM designed for efficient table understanding, utilizing progressive question conditioning and token focusing to generate informative and compact visual features.

Details

Motivation: Existing MLLMs struggle with redundant visual representations and lack question-specific focus in table understanding tasks.

Result: Achieves SOTA performance with reduced computational costs (27% less FLOPs and 30% less memory usage).

Insight: Adapting feature generation to question-specific needs and reducing redundancy significantly improves table understanding efficiency.

Abstract: Table images present unique challenges for effective and efficient understanding due to the need for question-specific focus and the presence of redundant background regions. Existing Multimodal Large Language Model (MLLM) approaches often overlook these characteristics, resulting in uninformative and redundant visual representations. To address these issues, we aim to generate visual features that are both informative and compact to improve table understanding. We first propose progressive question conditioning, which injects the question into Vision Transformer layers with gradually increasing frequency, considering each layer’s capacity to handle additional information, to generate question-aware visual features. To reduce redundancy, we introduce a pruning strategy that discards background tokens, thereby improving efficiency. To mitigate information loss from pruning, we further propose token focusing, a training strategy that encourages the model to concentrate essential information in the retained tokens. By combining these approaches, we present TabFlash, an efficient and effective MLLM for table understanding. TabFlash achieves state-of-the-art performance, outperforming both open-source and proprietary MLLMs, while requiring 27% less FLOPs and 30% less memory usage compared to the second-best MLLM.

[187] CorrectAD: A Self-Correcting Agentic System to Improve End-to-end Planning in Autonomous Driving cs.CVPDF

Enhui Ma, Lijun Zhou, Tao Tang, Jiahuan Zhang, Junpeng Jiang

TL;DR: 该论文提出了一种名为CorrectAD的自纠正代理系统，旨在改进端到端自动驾驶规划方法的鲁棒性，尤其是在处理长尾问题时。其核心是通过扩散式视频生成和结构化3D布局来自动纠正失败案例。

Details

Motivation: 当前的端到端自动驾驶规划方法在面对长尾问题时鲁棒性较差，尤其是在罕见但安全关键的失败案例中。为了解决这一问题，作者探索了利用扩散式视频生成和3D布局来构建一个完全自动化的自纠正系统。

Result: 在两个数据集（nuScenes和内部数据集）上测试，CorrectAD成功纠正了62.5%和49.8%的失败案例，并分别减少了39%和27%的碰撞率。

Insight: 结合扩散式视频生成和结构化3D布局可以有效提升自动驾驶规划的鲁棒性，尤其是在处理罕见但安全关键的场景时。

Abstract: End-to-end planning methods are the de facto standard of the current autonomous driving system, while the robustness of the data-driven approaches suffers due to the notorious long-tail problem (i.e., rare but safety-critical failure cases). In this work, we explore whether recent diffusion-based video generation methods (a.k.a. world models), paired with structured 3D layouts, can enable a fully automated pipeline to self-correct such failure cases. We first introduce an agent to simulate the role of product manager, dubbed PM-Agent, which formulates data requirements to collect data similar to the failure cases. Then, we use a generative model that can simulate both data collection and annotation. However, existing generative models struggle to generate high-fidelity data conditioned on 3D layouts. To address this, we propose DriveSora, which can generate spatiotemporally consistent videos aligned with the 3D annotations requested by PM-Agent. We integrate these components into our self-correcting agentic system, CorrectAD. Importantly, our pipeline is an end-to-end model-agnostic and can be applied to improve any end-to-end planner. Evaluated on both nuScenes and a more challenging in-house dataset across multiple end-to-end planners, CorrectAD corrects 62.5% and 49.8% of failure cases, reducing collision rates by 39% and 27%, respectively.

[188] DriveLiDAR4D: Sequential and Controllable LiDAR Scene Generation for Autonomous Driving cs.CVPDF

Kaiwen Cai, Xinze Liu, Xia Zhou, Hengtong Hu, Jie Xiang

TL;DR: DriveLiDAR4D提出了一种新的LiDAR生成方法，能够按顺序生成时间一致的点云场景，并通过LiDAR4DNet实现高度可控的前景对象和逼真的背景生成。

Details

Motivation: 现有的LiDAR点云生成方法缺乏顺序生成能力和对前景对象及背景的精细控制，影响了其在实际自动驾驶系统中的适用性。

Result: 在nuScenes和KITTI数据集上表现优异，FRD和FVD分数均显著超越现有最佳方法UniScene。

Insight: 该方法展示了LiDAR场景生成在时间和空间维度上的可控性对自动驾驶系统开发的重要性。

Abstract: The generation of realistic LiDAR point clouds plays a crucial role in the development and evaluation of autonomous driving systems. Although recent methods for 3D LiDAR point cloud generation have shown significant improvements, they still face notable limitations, including the lack of sequential generation capabilities and the inability to produce accurately positioned foreground objects and realistic backgrounds. These shortcomings hinder their practical applicability. In this paper, we introduce DriveLiDAR4D, a novel LiDAR generation pipeline consisting of multimodal conditions and a novel sequential noise prediction model LiDAR4DNet, capable of producing temporally consistent LiDAR scenes with highly controllable foreground objects and realistic backgrounds. To the best of our knowledge, this is the first work to address the sequential generation of LiDAR scenes with full scene manipulation capability in an end-to-end manner. We evaluated DriveLiDAR4D on the nuScenes and KITTI datasets, where we achieved an FRD score of 743.13 and an FVD score of 16.96 on the nuScenes dataset, surpassing the current state-of-the-art (SOTA) method, UniScene, with an performance boost of 37.2% in FRD and 24.1% in FVD, respectively.

[189] Computer Vision based group activity detection and action spotting cs.CV | cs.AIPDF

Narthana Sivalingam, Santhirarajah Sivasthigan, Thamayanthi Mahendranathan, G. M. R. I. Godaliyadda, M. P. B. Ekanayake

TL;DR: 论文提出了一种结合深度学习和图关系推理的计算机视觉框架，用于群体活动检测和动作识别，通过Mask R-CNN、Actor Relation Graphs和GCN实现了高性能的多场景识别。

Details

Motivation: 多人物场景中的群体活动检测因复杂的人体交互、遮挡和时间上的外观变化而极具挑战性。论文旨在解决这些问题，提出一种鲁棒的识别方法。

Result: 在Collective Activity数据集上的实验表明，该方法在拥挤和非拥挤场景下均提升了识别性能。

Insight: 研究展示了结合分割、特征提取和图关系推理在多人物视频理解任务中的潜力。

Abstract: Group activity detection in multi-person scenes is challenging due to complex human interactions, occlusions, and variations in appearance over time. This work presents a computer vision based framework for group activity recognition and action spotting using a combination of deep learning models and graph based relational reasoning. The system first applies Mask R-CNN to obtain accurate actor localization through bounding boxes and instance masks. Multiple backbone networks, including Inception V3, MobileNet, and VGG16, are used to extract feature maps, and RoIAlign is applied to preserve spatial alignment when generating actor specific features. The mask information is then fused with the feature maps to obtain refined masked feature representations for each actor. To model interactions between individuals, we construct Actor Relation Graphs that encode appearance similarity and positional relations using methods such as normalized cross correlation, sum of absolute differences, and dot product. Graph Convolutional Networks operate on these graphs to reason about relationships and predict both individual actions and group level activities. Experiments on the Collective Activity dataset demonstrate that the combination of mask based feature refinement, robust similarity search, and graph neural network reasoning leads to improved recognition performance across both crowded and non crowded scenarios. This approach highlights the potential of integrating segmentation, feature extraction, and relational graph reasoning for complex video understanding tasks.

[190] Semi-Supervised Multi-Task Learning for Interpretable Quality As- sessment of Fundus Images cs.CV | cs.AIPDF

Lucas Gabriel Telesco, Danila Nejamkin, Estefanía Mata, Francisco Filizzola, Kevin Wignall

TL;DR: 该论文提出了一种半监督多任务学习方法，用于提高视网膜图像质量评估（RIQA）的解释性和性能，通过结合人工标注和伪标注，减少了对大量人工标注的依赖。

Details

Motivation: 现有的视网膜图像质量评估工具通常仅分类整体图像质量，而无法提供详细的采集缺陷信息以指导重拍。这主要是由于详细标注的高成本。

Result: 实验表明，该方法在多任务设置中优于单任务基线（EyeQ数据集F1：0.875 vs. 0.863；DeepDRiD数据集F1：0.778 vs. 0.763），并在新标注的EyeQ子集上表现接近专家水平。

Insight: 论文的洞察在于伪标注的噪声与专家评估中的变异性一致，表明半监督方法不仅能提升整体质量评估，还能提供关于采集条件的可解释反馈，无需额外标注成本。

Abstract: Retinal image quality assessment (RIQA) supports computer-aided diagnosis of eye diseases. However, most tools classify only overall image quality, without indicating acquisition defects to guide recapture. This gap is mainly due to the high cost of detailed annotations. In this paper, we aim to mitigate this limitation by introducing a hybrid semi-supervised learning approach that combines manual labels for overall quality with pseudo-labels of quality details within a multi-task framework. Our objective is to obtain more interpretable RIQA models without requiring extensive manual labeling. Pseudo-labels are generated by a Teacher model trained on a small dataset and then used to fine-tune a pre-trained model in a multi-task setting. Using a ResNet-18 backbone, we show that these weak annotations improve quality assessment over single-task baselines (F1: 0.875 vs. 0.863 on EyeQ, and 0.778 vs. 0.763 on DeepDRiD), matching or surpassing existing methods. The multi-task model achieved performance statistically comparable to the Teacher for most detail prediction tasks (p > 0.05). In a newly annotated EyeQ subset released with this paper, our model performed similarly to experts, suggesting that pseudo-label noise aligns with expert variability. Our main finding is that the proposed semi-supervised approach not only improves overall quality assessment but also provides interpretable feedback on capture conditions (illumination, clarity, contrast). This enhances interpretability at no extra manual labeling cost and offers clinically actionable outputs to guide image recapture.

[191] Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA) cs.CV | cs.AIPDF

Nikos Theodoridis, Tim Brophy, Reenu Mohandas, Ganesh Sistu, Fiachra Collins

TL;DR: DTPQA是一个专门设计用于评估视觉语言模型（VLM）在交通场景中感知能力的视觉问答（VQA）基准数据集，包含合成数据和真实数据，并标注了物体距离信息。

Details

Motivation: 现有的视觉语言模型在自动驾驶等安全关键领域的应用中，需要具备强大的感知能力，尤其是在远距离场景中。因此，需要一个专门的基准来评估VLM的纯感知能力，避免其他技能（如推理）的干扰。

Result: 提供了数据集和生成脚本，可用于扩展同类数据。实验结果显示DTPQA可以有效地评估VLM在交通场景中的感知能力，尤其是远距离性能。

Insight: DTPQA填补了现有基准在交通感知评估中的空白，尤其是在远距离场景中。距离信息的引入为研究VLM的性能退化提供了新视角。

Abstract: The remarkable progress of Vision-Language Models (VLMs) on a variety of tasks has raised interest in their application to automated driving. However, for these models to be trusted in such a safety-critical domain, they must first possess robust perception capabilities, i.e., they must be capable of understanding a traffic scene, which can often be highly complex, with many things happening simultaneously. Moreover, since critical objects and agents in traffic scenes are often at long distances, we require systems with not only strong perception capabilities at close distances (up to 20 meters), but also at long (30+ meters) range. Therefore, it is important to evaluate the perception capabilities of these models in isolation from other skills like reasoning or advanced world knowledge. Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed specifically for this purpose: it can be used to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes. Additionally, DTPQA includes distance annotations, i.e., how far the object in question is from the camera. More specifically, each DTPQA sample consists of (at least): (a) an image, (b) a question, (c) the ground truth answer, and (d) the distance of the object in question, enabling analysis of how VLM performance degrades with increasing object distance. In this article, we provide the dataset itself along with the Python scripts used to create it, which can be used to generate additional data of the same kind.

[192] What Color Is It? A Text-Interference Multimodal Hallucination Benchmark cs.CVPDF

Jinkun Zhao, Lei Huang, Wenjun Wu

TL;DR: 论文提出了一个名为’What Color Is It’的数据集，用于测试多模态大模型(MLMs)在颜色感知中的文本干扰视觉幻觉问题，并探讨了其根本原因和可能的解决方案。

Details

Motivation: 随着多模态大模型的快速发展，这些模型在视觉感知中（尤其是颜色感知）容易受到文本信息的干扰，导致幻觉现象。为了解决这一问题，作者构建了一个专门的基准数据集。

Result: 验证了MLMs在颜色感知中容易受到文本干扰，导致视觉幻觉，显示了模型的局限性。

Insight: 多模态大模型在融合文本和视觉信息时，可能存在模态间的干扰问题，尤其是在低级视觉任务（如颜色识别）中。未来的研究需要增强模型的鲁棒性以减少此类幻觉。

Abstract: With the rapid advancement of Large Models, numerous text-and-vision-fused Multimodal Large Models (MLMs) have emerged. However, these MLMs remain susceptible to informational interference in visual perception, particularly in color perception, which introduces an additional risk of hallucination. To validate this hypothesis, we introduce the “What Color Is It” dataset, a novel benchmark constructed using a simple method to trigger single-modality visual hallucination in MLMs. Based on this dataset, we further investigate the underlying causes of hallucination in the visual modality of MLMs and propose potential solutions to enhance their robustness.

[193] VOPE: Revisiting Hallucination of Vision-Language Models in Voluntary Imagination Task cs.CVPDF

Xingming Long, Jie Zhang, Shiguang Shan, Xilin Chen

TL;DR: 论文提出了VOPE方法，用于评估大型视觉-语言模型（LVLM）在自愿想象任务中的幻觉现象，揭示了当前LVLM在这些任务中表现较差，且现有幻觉缓解方法效果有限。

Details

Motivation: 现有研究主要关注LVLM在禁止生成图像中不存在内容的描述任务中的幻觉现象，而忽视了自愿想象任务（如故事写作）中模型生成新内容的幻觉问题。VOPE旨在填补这一空白。

Result: 实验表明：1) 多数LVLM在自愿想象任务中幻觉严重，对想象对象的解释表现差；2) 现有幻觉缓解方法在这些任务中效果有限。

Insight: 自愿想象任务中的幻觉问题是一个亟待研究的方向，需要开发新的方法来解决现有模型的局限性。

Abstract: Most research on hallucinations in Large Vision-Language Models (LVLMs) focuses on factual description tasks that prohibit any output absent from the image. However, little attention has been paid to hallucinations in voluntary imagination tasks, e.g., story writing, where the models are expected to generate novel content beyond the given image. In these tasks, it is inappropriate to simply regard such imagined novel content as hallucinations. To address this limitation, we introduce Voluntary-imagined Object Presence Evaluation (VOPE)-a novel method to assess LVLMs’ hallucinations in voluntary imagination tasks via presence evaluation. Specifically, VOPE poses recheck-based questions to evaluate how an LVLM interprets the presence of the imagined objects in its own response. The consistency between the model’s interpretation and the object’s presence in the image is then used to determine whether the model hallucinates when generating the response. We apply VOPE to several mainstream LVLMs and hallucination mitigation methods, revealing two key findings: (1) most LVLMs hallucinate heavily during voluntary imagination, and their performance in presence evaluation is notably poor on imagined objects; (2) existing hallucination mitigation methods show limited effect in voluntary imagination tasks, making this an important direction for future research.

[194] Unlocking the Forgery Detection Potential of Vanilla MLLMs: A Novel Training-Free Pipeline cs.CV | cs.AIPDF

Rui Zuo, Qinyue Tong, Zhe-Ming Lu, Ziqian Lu

TL;DR: 本文提出了一种无需训练的MLLM（多模态大语言模型）管道Foresee，用于图像伪造分析与定位，优于现有方法。

Details

Motivation: 现有图像伪造检测方法泛化能力差且解释性有限，而大规模训练的MLLM方法计算成本高。本文旨在利用MLLMs的潜力，提出一种无需训练的高效解决方案。

Result: 实验证明Foresee在多种伪造类型上表现优异，定位精度高且提供丰富文本解释，泛化能力强。

Insight: 无需训练的MLLM方法可以在图像伪造分析中实现高效能，为法证领域提供了一种新思路。

Abstract: With the rapid advancement of artificial intelligence-generated content (AIGC) technologies, including multimodal large language models (MLLMs) and diffusion models, image generation and manipulation have become remarkably effortless. Existing image forgery detection and localization (IFDL) methods often struggle to generalize across diverse datasets and offer limited interpretability. Nowadays, MLLMs demonstrate strong generalization potential across diverse vision-language tasks, and some studies introduce this capability to IFDL via large-scale training. However, such approaches cost considerable computational resources, while failing to reveal the inherent generalization potential of vanilla MLLMs to address this problem. Inspired by this observation, we propose Foresee, a training-free MLLM-based pipeline tailored for image forgery analysis. It eliminates the need for additional training and enables a lightweight inference process, while surpassing existing MLLM-based methods in both tamper localization accuracy and the richness of textual explanations. Foresee employs a type-prior-driven strategy and utilizes a Flexible Feature Detector (FFD) module to specifically handle copy-move manipulations, thereby effectively unleashing the potential of vanilla MLLMs in the forensic domain. Extensive experiments demonstrate that our approach simultaneously achieves superior localization accuracy and provides more comprehensive textual explanations. Moreover, Foresee exhibits stronger generalization capability, outperforming existing IFDL methods across various tampering types, including copy-move, splicing, removal, local enhancement, deepfake, and AIGC-based editing. The code will be released in the final version.

[195] Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling cs.CV | cs.AI | cs.LGPDF

Adam Hazimeh, Ke Wang, Mark Collier, Gilles Baechler, Efi Kokiopoulou

TL;DR: 论文提出了一种名为SliDer的新框架，使用视觉语言模型（VLM）将幻灯片图像转换为可编辑的SVG格式，解决了现有几何栅格-矢量方法在高层次语义结构上的不足。

Details

Motivation: 多媒体文档（如幻灯片和海报）通常以静态栅格格式分发，失去了编辑和定制的能力。现有方法依赖低级的几何基元（如曲线和多边形），无法恢复其高层次结构和语义信息。

Result: SliDer的重建LPIPS值为0.069，在82.9%的案例中被人类评价者优于最强的零样本VLM基线。

Insight: 结合视觉语言模型的语义理解能力，可以有效解决文档矢量化的高层次结构问题；迭代优化过程显著提升了重建质量。

Abstract: Multimedia documents such as slide presentations and posters are designed to be interactive and easy to modify. Yet, they are often distributed in a static raster format, which limits editing and customization. Restoring their editability requires converting these raster images back into structured vector formats. However, existing geometric raster-vectorization methods, which rely on low-level primitives like curves and polygons, fall short at this task. Specifically, when applied to complex documents like slides, they fail to preserve the high-level structure, resulting in a flat collection of shapes where the semantic distinction between image and text elements is lost. To overcome this limitation, we address the problem of semantic document derendering by introducing SliDer, a novel framework that uses Vision-Language Models (VLMs) to derender slide images as compact and editable Scalable Vector Graphic (SVG) representations. SliDer detects and extracts attributes from individual image and text elements in a raster input and organizes them into a coherent SVG format. Crucially, the model iteratively refines its predictions during inference in a process analogous to human design, generating SVG code that more faithfully reconstructs the original raster upon rendering. Furthermore, we introduce Slide2SVG, a novel dataset comprising raster-SVG pairs of slide documents curated from real-world scientific presentations, to facilitate future research in this domain. Our results demonstrate that SliDer achieves a reconstruction LPIPS of 0.069 and is favored by human evaluators in 82.9% of cases compared to the strongest zero-shot VLM baseline.

[196] Language-Guided Invariance Probing of Vision-Language Models cs.CVPDF

Jae Joong Lee

TL;DR: 论文提出了Language-Guided Invariance Probing (LGIP)基准，用于评估视觉语言模型（VLMs）在语言学扰动下的稳健性，重点关注模型对意义保留的转述的敏感性以及对意义改变的语义翻转的敏感性。

Details

Motivation: 现有的视觉语言模型（如CLIP、OpenCLIP等）在零样本任务中表现优异，但其对受控语言学扰动的响应可靠性尚未明确。论文旨在填补这一空白。

Result: 实验结果显示，EVA02-CLIP和大规模OpenCLIP变体在不敏感性和敏感性之间取得平衡，而SigLIP和SigLIP2表现较差，倾向于翻转后的描述。

Insight: 标准检索指标无法捕捉模型的语言学稳健性问题，LGIP提供了模型无关的诊断工具，揭示了传统准确率之外的模型行为。

Abstract: Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliably they respond to controlled linguistic perturbations. We introduce Language-Guided Invariance Probing (LGIP), a benchmark that measures (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image-text matching. Using 40k MS COCO images with five human captions each, we automatically generate paraphrases and rule-based flips that alter object category, color or count, and summarize model behavior with an invariance error, a semantic sensitivity gap and a positive-rate statistic. Across nine VLMs, EVA02-CLIP and large OpenCLIP variants lie on a favorable invariance-sensitivity frontier, combining low paraphrase-induced variance with consistently higher scores for original captions than for their flipped counterparts. In contrast, SigLIP and SigLIP2 show much larger invariance error and often prefer flipped captions to the human descriptions, especially for object and color edits. These failures are largely invisible to standard retrieval metrics, indicating that LGIP provides a model-agnostic diagnostic for the linguistic robustness of VLMs beyond conventional accuracy scores.

[197] Minimax Multi-Target Conformal Prediction with Applications to Imaging Inverse Problems cs.CVPDF

Jeffrey Wen, Rizwan Ahmad, Philip Schniter

TL;DR: 该论文提出了一种渐进极小极大方法，用于多目标一致性预测（conformal prediction），在处理成像逆问题时提供紧致的预测区间，并确保联合边际覆盖。

Details

Motivation: 成像逆问题在不确定性量化方面存在挑战，尤其在安全关键应用中。现有方法仅适用于标量估计目标，而实际应用常涉及多目标，因此需要一种新的多目标一致性预测方法。

Result: 在合成数据和MRI数据上的实验表明，该方法优于现有的多目标一致性预测方法。

Insight: 极小极大方法在多目标一致性预测中可以提供更优的不确定量化性能，适用于复杂的成像逆问题。

Abstract: In ill-posed imaging inverse problems, uncertainty quantification remains a fundamental challenge, especially in safety-critical applications. Recently, conformal prediction has been used to quantify the uncertainty that the inverse problem contributes to downstream tasks like image classification, image quality assessment, fat mass quantification, etc. While existing works handle only a scalar estimation target, practical applications often involve multiple targets. In response, we propose an asymptotically minimax approach to multi-target conformal prediction that provides tight prediction intervals while ensuring joint marginal coverage. We then outline how our minimax approach can be applied to multi-metric blind image quality assessment, multi-task uncertainty quantification, and multi-round measurement acquisition. Finally, we numerically demonstrate the benefits of our minimax method, relative to existing multi-target conformal prediction methods, using both synthetic and magnetic resonance imaging (MRI) data.

[198] BootOOD: Self-Supervised Out-of-Distribution Detection via Synthetic Sample Exposure under Neural Collapse cs.CV | cs.LGPDF

Yuanchao Wang, Tian Qin, Eduardo Valle, Bruno Abrahao

TL;DR: BootOOD是一种自监督的OOD检测方法，通过合成伪OOD特征并结合神经坍塌现象，设计了基于特征范数的半径分类器，显著提升了OOD检测性能。

Details

Motivation: 现有的OOD检测方法在处理语义相似的OOD样本时表现不佳，且通常依赖于标注数据或外部OOD样本。BootOOD旨在仅利用ID数据实现高性能的OOD检测，尤其适用于语义挑战的场景。

Result: 在CIFAR-10/100和ImageNet-200上，BootOOD优于现有的后处理方法和非OOD暴露的训练方法，同时与最先进的OOD暴露方法竞争力相当，且保持或提升了ID分类精度。

Insight: 利用神经坍塌现象的固有特性（特征范数的一致性）可以更自然地实现ID与OOD样本的区分，尤其在语义相近时表现更优。

Abstract: Out-of-distribution (OOD) detection is critical for deploying image classifiers in safety-sensitive environments, yet existing detectors often struggle when OOD samples are semantically similar to the in-distribution (ID) classes. We present BootOOD, a fully self-supervised OOD detection framework that bootstraps exclusively from ID data and is explicitly designed to handle semantically challenging OOD samples. BootOOD synthesizes pseudo-OOD features through simple transformations of ID representations and leverages Neural Collapse (NC), where ID features cluster tightly around class means with consistent feature norms. Unlike prior approaches that aim to constrain OOD features into subspaces orthogonal to the collapsed ID means, BootOOD introduces a lightweight auxiliary head that performs radius-based classification on feature norms. This design decouples OOD detection from the primary classifier and imposes a relaxed requirement: OOD samples are learned to have smaller feature norms than ID features, which is easier to satisfy when ID and OOD are semantically close. Experiments on CIFAR-10, CIFAR-100, and ImageNet-200 show that BootOOD outperforms prior post-hoc methods, surpasses training-based methods without outlier exposure, and is competitive with state-of-the-art outlier-exposure approaches while maintaining or improving ID accuracy.

[199] Robust Defense Strategies for Multimodal Contrastive Learning: Efficient Fine-tuning Against Backdoor Attacks cs.CV | cs.AIPDF

Md. Iqbal Hossain, Afia Sajeeda, Neeresh Kumar Perla, Ming Shao

TL;DR: 该论文提出了一种高效的防御策略，针对多模态对比学习模型（如CLIP）中的后门攻击，通过图像分割”oracle”识别潜在触发器和受害者样本，并设计了两种算法进行模型修正，显著提升了模型的鲁棒性。

Details

Motivation: 多模态深度学习模型（如CLIP）在广泛应用中展现出巨大潜力，但其易受后门攻击的威胁。现有防御方法通常需要从头训练或使用大量数据微调，缺乏对特定受影响标签的精准定位，亟需高效且精确的防御策略。

Result: 在视觉识别基准测试中，该方法显著提升了CLIP模型对后门攻击的鲁棒性，证明了其在实际应用中的有效性。

Insight: 1. 引入”oracle”作为外部监督是一种高效的后门检测手段；2. 精准定位和紧凑数据集微调可以减少防御的计算开销；3. 该方法可推广到其他多模态模型的防御中。

Abstract: The advent of multimodal deep learning models, such as CLIP, has unlocked new frontiers in a wide range of applications, from image-text understanding to classification tasks. However, these models are not safe for adversarial attacks, particularly backdoor attacks, which can subtly manipulate model behavior. Moreover, existing defense methods typically involve training from scratch or fine-tuning using a large dataset without pinpointing the specific labels that are affected. In this study, we introduce an innovative strategy to enhance the robustness of multimodal contrastive learning models against such attacks. In particular, given a poisoned CLIP model, our approach can identify the backdoor trigger and pinpoint the victim samples and labels in an efficient manner. To that end, an image segmentation ``oracle’’ is introduced as the supervisor for the output of the poisoned CLIP. We develop two algorithms to rectify the poisoned model: (1) differentiating between CLIP and Oracle’s knowledge to identify potential triggers; (2) pinpointing affected labels and victim samples, and curating a compact fine-tuning dataset. With this knowledge, we are allowed to rectify the poisoned CLIP model to negate backdoor effects. Extensive experiments on visual recognition benchmarks demonstrate our strategy is effective in CLIP-based backdoor defense.

[200] Opt3DGS: Optimizing 3D Gaussian Splatting with Adaptive Exploration and Curvature-Aware Exploitation cs.CVPDF

Ziyang Huang, Jiagang Chen, Jin Liu, Shunping Ji

TL;DR: Opt3DGS 提出了一种两阶段优化框架，通过自适应探索和曲率引导的利用，改进了3D高斯喷洒（3DGS）的优化过程，解决了陷入局部最优和收敛质量不足的问题。

Details

Motivation: 3D高斯喷洒（3DGS）在新视角合成中表现突出，但其优化过程中存在易陷入局部最优和收敛质量不佳的问题，限制了其性能提升。

Result: 在多个基准数据集上的实验表明，Opt3DGS 在不改变3DGS底层表示的情况下，实现了最先进的渲染质量。

Insight: 通过分阶段优化和结合全局探索与局部曲率信息，Opt3DGS 为复杂优化问题提供了一种高效解决方案。

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a leading framework for novel view synthesis, yet its core optimization challenges remain underexplored. We identify two key issues in 3DGS optimization: entrapment in suboptimal local optima and insufficient convergence quality. To address these, we propose Opt3DGS, a robust framework that enhances 3DGS through a two-stage optimization process of adaptive exploration and curvature-guided exploitation. In the exploration phase, an Adaptive Weighted Stochastic Gradient Langevin Dynamics (SGLD) method enhances global search to escape local optima. In the exploitation phase, a Local Quasi-Newton Direction-guided Adam optimizer leverages curvature information for precise and efficient convergence. Extensive experiments on diverse benchmark datasets demonstrate that Opt3DGS achieves state-of-the-art rendering quality by refining the 3DGS optimization process without modifying its underlying representation.

[201] Adaptive Multi-Scale Integration Unlocks Robust Cell Annotation in Histopathology Images cs.CVPDF

Yinuo Xu, Yan Cui, Mingyao Li, Zhi Huang

TL;DR: NuClass是一种基于病理学家工作流程的框架，通过多尺度整合核形态和组织微环境上下文来进行细胞注释，解决了现有方法在整合局部细节与全局上下文以及缺乏高质量标注的问题。

Details

Motivation: 现有基于图块的方法虽能捕获核形态细节，但难以整合更广泛的组织上下文，且标注数据通常是粗粒度且分布不均的，限制了细粒度细胞注释的发展。

Result: 在三个独立测试队列中，NuClass的最佳类别F1分数达到96%，优于基线方法。

Insight: 多尺度、不确定性感知的融合方法能够弥补从幻灯片级病理基础模型到可靠细胞级表型预测之间的差距。

Abstract: Identifying cell types and subtypes from routine histopathology images is essential for improving the computational understanding of human disease. Existing tile-based models can capture detailed nuclear morphology but often fail to incorporate the broader tissue context that influences a cell’s function and identity. In addition, available human annotations are typically coarse-grained and unevenly distributed across studies, making fine-grained subtype-level supervision difficult to obtain. To address these limitations, we introduce NuClass, a pathologist workflow inspired framework for cell-wise multi-scale integration of nuclear morphology and microenvironmental context. NuClass includes two main components: Path local, which focuses on nuclear morphology from 224-by-224 pixel crops, and Path global, which models the surrounding 1024-by-1024 pixel neighborhood. A learnable gating module adaptively balances local detail and contextual cues. To encourage complementary learning, we incorporate an uncertainty-guided objective that directs the global path to prioritize regions where the local path is uncertain. We also provide calibrated confidence estimates and Grad-CAM visualizations to enhance interpretability. To overcome the lack of high-quality annotations, we construct a marker-guided dataset from Xenium spatial transcriptomics assays, yielding single-cell resolution labels for more than two million cells across eight organs and 16 classes. Evaluated on three fully held-out cohorts, NuClass achieves up to 96 percent F1 for its best-performing class, outperforming strong baselines. Our results show that multi-scale, uncertainty-aware fusion can bridge the gap between slide-level pathological foundation models and reliable, cell-level phenotype prediction.

[202] ICLR: Inter-Chrominance and Luminance Interaction for Natural Color Restoration in Low-Light Image Enhancement cs.CVPDF

Xin Xu, Hao Liu, Wei Liu, Wei Wang, Jiayi Wu

TL;DR: 该论文提出了ICLR框架，通过双流交互增强模块（DIEM）和协方差修正损失（CCL），解决了低光图像增强中色度和亮度交互的问题，提升了图像的自然色彩恢复能力。

Details

Motivation: 在低光图像增强任务中，色度和亮度分支之间的分布差异限制了互补特征的提取，而传统的像素级损失在弱相关区域会导致梯度冲突。因此，作者提出了一种新的交互框架来解决这些问题。

Result: 在多个数据集上的实验表明，ICLR框架在低光图像增强任务中优于其他先进方法，实现了更自然的色彩恢复效果。

Insight: 1. 色度和亮度的动态交互是关键；2. 梯度冲突的平衡可以提高模型的鲁棒性；3. 协方差约束是一种有效的优化手段。

Abstract: Low-Light Image Enhancement (LLIE) task aims at improving contrast while restoring details and textures for images captured in low-light conditions. HVI color space has made significant progress in this task by enabling precise decoupling of chrominance and luminance. However, for the interaction of chrominance and luminance branches, substantial distributional differences between the two branches prevalent in natural images limit complementary feature extraction, and luminance errors are propagated to chrominance channels through the nonlinear parameter. Furthermore, for interaction between different chrominance branches, images with large homogeneous-color regions usually exhibit weak correlation between chrominance branches due to concentrated distributions. Traditional pixel-wise losses exploit strong inter-branch correlations for co-optimization, causing gradient conflicts in weakly correlated regions. Therefore, we propose an Inter-Chrominance and Luminance Interaction (ICLR) framework including a Dual-stream Interaction Enhancement Module (DIEM) and a Covariance Correction Loss (CCL). The DIEM improves the extraction of complementary information from two dimensions, fusion and enhancement, respectively. The CCL utilizes luminance residual statistics to penalize chrominance errors and balances gradient conflicts by constraining chrominance branches covariance. Experimental results on multiple datasets show that the proposed ICLR framework outperforms state-of-the-art methods.

[203] Tissue Aware Nuclei Detection and Classification Model for Histopathology Images cs.CVPDF

Kesi Xu, Eleni Chiou, Ali Varamesh, Laura Acqualagna, Nasir Rajpoot

TL;DR: 该论文提出了一种基于组织掩码条件的新型框架TAND，用于病理图像的联合核检测与分类，利用点级监督和空间特征线性调制技术，显著提升了组织依赖性细胞类型的识别性能。

Details

Motivation: 现有方法依赖详细的专家注释且未充分利用组织上下文信息，限制了核检测与分类的准确性。TAND旨在通过减少注释负担并增强组织上下文感知来解决这一问题。

Result: TAND在PUMA基准测试中超越同类方法，尤其在组织依赖性细胞类型（如上皮细胞、内皮细胞和间质细胞）上表现显著提升。

Insight: 组织上下文信息对核分类至关重要，通过轻量级条件化技术（如Spatial-FiLM）可有效减少注释需求并提升性能。

Abstract: Accurate nuclei detection and classification are fundamental to computational pathology, yet existing approaches are hindered by reliance on detailed expert annotations and insufficient use of tissue context. We present Tissue-Aware Nuclei Detection (TAND), a novel framework achieving joint nuclei detection and classification using point-level supervision enhanced by tissue mask conditioning. TAND couples a ConvNeXt-based encoder-decoder with a frozen Virchow-2 tissue segmentation branch, where semantic tissue probabilities selectively modulate the classification stream through a novel multi-scale Spatial Feature-wise Linear Modulation (Spatial-FiLM). On the PUMA benchmark, TAND achieves state-of-the-art performance, surpassing both tissue-agnostic baselines and mask-supervised methods. Notably, our approach demonstrates remarkable improvements in tissue-dependent cell types such as epithelium, endothelium, and stroma. To the best of our knowledge, this is the first method to condition per-cell classification on learned tissue masks, offering a practical pathway to reduce annotation burden.

[204] Alpha Divergence Losses for Biometric Verification cs.CV | cs.AIPDF

Dimitrios Koutsianos, Ladislav Mosner, Yannis Panagakis, Themos Stafylakis

TL;DR: 本文提出了两种基于α-发散损失的边距损失函数（Q-Margin和A3M），用于提升人脸和说话人验证任务的性能，尤其在低误接受率（FAR）下表现出色。

Details

Motivation: 现有的基于边距的softmax损失（如CosFace和ArcFace）在人脸和说话人验证任务中表现优秀，但缺乏对稀疏解的支持。α-发散损失能诱导稀疏解，但其与边距的结合方式尚不明确。

Result: 在IJB-B、IJB-C和VoxCeleb上的实验表明，新方法显著优于基线模型，尤其在低FAR下表现突出。

Insight: α-发散损失不仅能提升验证任务的性能，还能通过稀疏解增强模型的鲁棒性，适用于高安全性应用场景。

Abstract: Performance in face and speaker verification is largely driven by margin based softmax losses like CosFace and ArcFace. Recently introduced $α$-divergence loss functions offer a compelling alternative, particularly for their ability to induce sparse solutions (when $α>1$). However, integrating an angular margin-crucial for verification tasks-is not straightforward. We find this integration can be achieved in at least two distinct ways: via the reference measure (prior probabilities) or via the logits (unnormalized log-likelihoods). In this paper, we explore both pathways, deriving two novel margin-based $α$-divergence losses: Q-Margin (margin in the reference measure) and A3M (margin in the logits). We identify and address a critical training instability in A3M-caused by the interplay of penalized logits and sparsity-with a simple yet effective prototype re-initialization strategy. Our methods achieve significant performance gains on the challenging IJB-B and IJB-C face verification benchmarks. We demonstrate similarly strong performance in speaker verification on VoxCeleb. Crucially, our models significantly outperform strong baselines at low false acceptance rates (FAR). This capability is crucial for practical high-security applications, such as banking authentication, when minimizing false authentications is paramount.

[205] CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding cs.CVPDF

Shrenik Patel, Daivik Patel

TL;DR: CacheFlow提出了一种无需训练的方法，通过动态令牌丢弃（DTD）和压缩长期记忆的结合，显著减少了长视频问答中的计算开销，同时保持了回答的准确性。

Details

Motivation: 长视频问答（VQA）中，当前的视觉语言模型（VLMs）因注意力和键值缓存随运行时增长而效率低下，导致昂贵计算或短视滑动窗口。CacheFlow旨在解决这一问题。

Result: 在离线和流式VQA基准测试中，CacheFlow优于基线方法，同时处理令牌减少高达87%。

Insight: CacheFlow的双重机制（动态剪枝和压缩记忆）使VLMs在高效的同时保持上下文感知能力，为长视频理解提供了实用解决方案。

Abstract: Long-form video question answering (VQA) overwhelms current vision-language models (VLMs) because attention and key-value (KV) caches grow with runtime, forcing either expensive inference or near-sighted sliding windows. We introduce CacheFlow, a training-free pipeline that pairs Dynamic Token Dropping (DTD) with a compressive long-term memory. DTD prunes per-patch tokens online via cosine similarity to the previous frame, and surviving tokens are packed into fixed-size blocks. This online, per-frame processing makes our approach fundamentally suited for live streaming VQA. As blocks are processed, each one’s keys are summarized by a tiny recurrent encoder to form a retrieval index, while the block’s full KV pairs are offloaded and later rehydrated for generation, preserving answer fidelity. At inference, a consensus-based retrieval mechanism retrieves only the Top-K most relevant blocks and attends over both the retrieved and local context for precise, long-range reasoning. CacheFlow is drop-in, architecture-agnostic, and requires no fine-tuning. Experiments on both offline and streaming VQA benchmarks demonstrate that CacheFlow outperforms current strong baselines, while processing up to 87% less tokens. Our dual approach enables VLMs to be both efficient and context-aware, paving the way for practical long-form video understanding.

[206] Part-X-MLLM: Part-aware 3D Multimodal Large Language Model cs.CVPDF

Chunshi Wang, Junliang Ye, Yunhan Yang, Yang Li, Zizhuo Lin

TL;DR: Part-X-MLLM是一个基于多模态大语言模型的3D任务统一框架，通过结构化语法将多样化的3D任务（如部件检测、语义描述和编辑）编码为单一的自回归序列，实现几何模块的多功能驱动。

Details

Motivation: 现有的3D多模态任务接口缺乏统一性和结构化表达能力，难以高效支持部件级的生成与编辑任务。Part-X-MLLM旨在通过语言模型和结构化输出的结合，解决这一问题。

Result: 实验表明，Part-X-MLLM在问答、组合生成和局部编辑任务中达到了最先进的性能，展现了统一接口的优越性。

Insight: 结构化语法和语言模型的结合为复杂3D任务提供了自然且高效的控制方式，未来可扩展至更多几何引擎和场景。

Abstract: We introduce Part-X-MLLM, a native 3D multimodal large language model that unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar. Given an RGB point cloud and a natural language prompt, our model autoregressively generates a single, coherent token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output serves as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling the symbolic planning from the geometric synthesis, our approach allows any compatible geometry engine to be controlled through a single, language-native frontend. We pre-train a dual-encoder architecture to disentangle structure from semantics and instruction-tune the model on a large-scale, part-centric dataset. Experiments demonstrate that our model excels at producing high-quality, structured plans, enabling state-of-the-art performance in grounded Q&A, compositional generation, and localized editing through one unified interface. Project page: https://chunshi.wang/Part-X-MLLM/

[207] PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image cs.CV | cs.ROPDF

Ziang Cao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, Ziwei Liu

TL;DR: PhysX-Anything是一个仿真就绪的物理3D生成框架，能够从单张野生图像生成高质量、具有明确几何、关节和物理属性的3D资产，填补了当前3D生成方法在物理属性上的空白。

Details

Motivation: 现有3D生成方法大多忽视物理和关节属性，限制了其在具身AI中的应用，因此需要一种能直接生成仿真就绪3D资产的框架。

Result: 在PhysX-Mobility和野生图像上表现优异，生成的资产可直接应用于接触密集型机器人策略学习。

Insight: 通过物理属性的显式建模和高效几何表示，PhysX-Anything为具身AI和物理仿真提供了强有力的工具。

Abstract: 3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce PhysX-Anything, the first simulation-ready physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by 193x, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning and significantly improving generative quality. In addition, to overcome the limited diversity of existing physical 3D datasets, we construct a new dataset, PhysX-Mobility, which expands the object categories in prior physical 3D datasets by over 2x and includes more than 2K common real-world objects with rich physical annotations. Extensive experiments on PhysX-Mobility and in-the-wild images demonstrate that PhysX-Anything delivers strong generative performance and robust generalization. Furthermore, simulation-based experiments in a MuJoCo-style environment validate that our sim-ready assets can be directly used for contact-rich robotic policy learning. We believe PhysX-Anything can substantially empower a broad range of downstream applications, especially in embodied AI and physics-based simulation.

[208] Distribution Matching Distillation Meets Reinforcement Learning cs.CVPDF

Dengyang Jiang, Dongyang Liu, Zanyi Wang, Qilong Wu, Xin Jin

TL;DR: 论文提出了一种结合强化学习（RL）和分布匹配蒸馏（DMD）的新框架DMDR，旨在提高少步扩散模型的性能，甚至超越多步教师模型的性能。

Details

Motivation: 现有的分布匹配蒸馏方法将多步扩散模型蒸馏为少步模型，但少步模型的性能受限于多步模型。为了突破这一限制，研究者希望通过结合强化学习技术，进一步提升少步模型的生成能力和模式覆盖范围。

Result: 实验表明，DMDR在视觉质量、提示一致性方面表现优异，甚至在某些情况下超越了多步教师模型的性能。

Insight: 强化学习可以有效地引导模式覆盖过程，而DMD损失则为强化学习提供了更优的正则化方式，二者的结合能够释放少步模型的潜力。

Abstract: Distribution Matching Distillation (DMD) distills a pre-trained multi-step diffusion model to a few-step one to improve inference efficiency. However, the performance of the latter is often capped by the former. To circumvent this dilemma, we propose DMDR, a novel framework that combines Reinforcement Learning (RL) techniques into the distillation process. We show that for the RL of the few-step generator, the DMD loss itself is a more effective regularization compared to the traditional ones. In turn, RL can help to guide the mode coverage process in DMD more effectively. These allow us to unlock the capacity of the few-step generator by conducting distillation and RL simultaneously. Meanwhile, we design the dynamic distribution guidance and dynamic renoise sampling training strategies to improve the initial distillation process. The experiments demonstrate that DMDR can achieve leading visual quality, prompt coherence among few-step methods, and even exhibit performance that exceeds the multi-step teacher.

[209] OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation cs.CV | cs.LGPDF

Henry Herzog, Favyen Bastani, Yawen Zhang, Gabriel Tseng, Joseph Redmon

TL;DR: OlmoEarth是一个多模态时空基础模型，针对地球观测数据的特点设计了新颖的自监督学习方法、掩码策略和损失函数，在多项任务中表现优异。

Details

Motivation: 地球观测数据具有空间性、时序性和高度多模态的特点，传统方法难以有效建模。OlmoEarth旨在为这一领域提供一种强大的基础模型。

Result: 在多个任务中表现优于12种其他基础模型，尤其在嵌入和微调任务中表现突出。

Insight: OlmoEarth展示了针对特定领域设计基础模型的潜力，同时开源了代码和预训练权重，有助于推动相关研究和应用。

Abstract: Earth observation data presents a unique challenge: it is spatial like images, sequential like video or text, and highly multimodal. We present OlmoEarth: a multimodal, spatio-temporal foundation model that employs a novel self-supervised learning formulation, masking strategy, and loss all designed for the Earth observation domain. OlmoEarth achieves state-of-the-art performance compared to 12 other foundation models across a variety of research benchmarks and real-world tasks from external partners. When evaluating embeddings OlmoEarth achieves the best performance on 15 out of 24 tasks, and with full fine-tuning it is the best on 19 of 29 tasks. We deploy OlmoEarth as the backbone of an end-to-end platform for data collection, labeling, training, and inference of Earth observation models. The OlmoEarth Platform puts frontier foundation models and powerful data management tools into the hands of non-profits and NGOs working to solve the world’s biggest problems. OlmoEarth source code, training data, and pre-trained weights are available at $\href{https://github.com/allenai/olmoearth_pretrain}{\text{https://github.com/allenai/olmoearth_pretrain}}$.

[210] Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting cs.CV | cs.LGPDF

Jiangnan Ye, Jiedong Zhuang, Lianrui Mu, Wenjie Zheng, Jiaqi Hu

TL;DR: GS-Light提出了一种无需训练的多视图扩展方法，用于基于文本引导的3D场景（3DGS表示）重光照。该方法通过融合光照先验与视图几何约束，生成高质量的重光照图像。

Details

Motivation: 传统的3D场景重光照方法通常依赖复杂的训练过程或多个视图的处理，GS-Light旨在通过无需训练的扩散模型扩展和多视图处理，提供更高效且符合用户期望的解决方案。

Result: GS-Light在室内外场景中表现出色，定量和定性评估均优于现有基线方法，如多视图一致性、图像质量和用户满意度。

Insight: 1. 结合文本提示和几何约束可以提高光照方向的准确性；2. 无需训练的扩散模型扩展在多视图任务中表现良好；3. 3DGS微调是实现高质量3D场景重光照的关键一步。

Abstract: We introduce GS-Light, an efficient, textual position-aware pipeline for text-guided relighting of 3D scenes represented via Gaussian Splatting (3DGS). GS-Light implements a training-free extension of a single-input diffusion model to handle multi-view inputs. Given a user prompt that may specify lighting direction, color, intensity, or reference objects, we employ a large vision-language model (LVLM) to parse the prompt into lighting priors. Using off-the-shelf estimators for geometry and semantics (depth, surface normals, and semantic segmentation), we fuse these lighting priors with view-geometry constraints to compute illumination maps and generate initial latent codes for each view. These meticulously derived init latents guide the diffusion model to generate relighting outputs that more accurately reflect user expectations, especially in terms of lighting direction. By feeding multi-view rendered images, along with the init latents, into our multi-view relighting model, we produce high-fidelity, artistically relit images. Finally, we fine-tune the 3DGS scene with the relit appearance to obtain a fully relit 3D scene. We evaluate GS-Light on both indoor and outdoor scenes, comparing it to state-of-the-art baselines including per-view relighting, video relighting, and scene editing methods. Using quantitative metrics (multi-view consistency, imaging quality, aesthetic score, semantic similarity, etc.) and qualitative assessment (user studies), GS-Light demonstrates consistent improvements over baselines. Code and assets will be made available upon publication.

[211] TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models cs.CVPDF

Harold Haodong Chen, Disen Lan, Wen-Jie Shu, Qingyang Liu, Zihan Wang

TL;DR: TiViBench是一个专为评估图像到视频生成模型的推理能力而设计的层次化基准测试工具，填补了现有基准测试在高阶推理能力方面的空白。VideoTPO是一种无需额外训练或数据的测试时优化策略，显著提升了模型的推理性能。

Details

Motivation: 当前视频生成模型的评估主要关注视觉保真度和时间一致性，而忽视了高阶推理能力。TiViBench旨在填补这一空白，推动视频生成模型在物理合理性和逻辑一致性方面的研究。

Result: 商业模型（如Sora 2、Veo 3.1）表现出更强的推理潜力，而开源模型因训练规模和数据多样性受限潜力未完全释放。VideoTPO显著提升了模型的推理性能。

Insight: 推理能力在视频生成模型中具有重要潜力，但其发展受限于数据规模和多样性；测试时优化策略可以高效提升性能。

Abstract: The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3’s chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.

[212] UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity cs.CV | cs.AI | cs.LGPDF

Junwei Yu, Trevor Darrell, XuDong Wang

TL;DR: UnSAMv2 是一种自监督学习方法，解决了 Segment Anything Model (SAM) 在分割粒度控制上的局限性，通过引入粒度控制嵌入和丰富的掩模-粒度对，实现了无需人工标注的任意粒度分割，显著提升了性能。

Details

Motivation: SAM 模型在分割粒度控制上存在局限，用户通常需要手动调整提示或选择预生成掩模以达到所需细节程度。这不仅耗时且模糊，因为同一提示可能对应多个合理的掩模。监督解决方案因密集标注成本高而不可行。UnSAMv2 旨在通过自监督学习解决这一问题。

Result: 在 11 个基准测试中，UnSAMv2 显著提升了 SAM-2 的性能：NoC90 从 5.69 降至 4.75，1-IoU 从 58.0 提升至 73.1，AR1000 从 49.6 提升至 68.3。

Insight: 自监督学习和少量无标注数据可以显著释放视觉基础模型的潜力，尤其是在分割粒度控制方面。

Abstract: The Segment Anything Model (SAM) family has become a widely adopted vision foundation model, but its ability to control segmentation granularity remains limited. Users often need to refine results manually - by adding more prompts or selecting from pre-generated masks - to achieve the desired level of detail. This process can be ambiguous, as the same prompt may correspond to several plausible masks, and collecting dense annotations across all granularities is prohibitively expensive, making supervised solutions infeasible. To address this limitation, we introduce UnSAMv2, which enables segment anything at any granularity without human annotations. UnSAMv2 extends the divide-and-conquer strategy of UnSAM by discovering abundant mask-granularity pairs and introducing a novel granularity control embedding that enables precise, continuous control over segmentation scale. Remarkably, with only $6$K unlabeled images and $0.02%$ additional parameters, UnSAMv2 substantially enhances SAM-2, achieving segment anything at any granularity across interactive, whole-image, and video segmentation tasks. Evaluated on over $11$ benchmarks, UnSAMv2 improves $\text{NoC}{90}$ (5.69 $\rightarrow$ 4.75), 1-IoU (58.0 $\rightarrow$ 73.1), and $\text{AR}{1000}$ (49.6 $\rightarrow$ 68.3), showing that small amounts of unlabeled data with a granularity-aware self-supervised learning method can unlock the potential of vision foundation models.

[213] Segment Anything Across Shots: A Method and Benchmark cs.CVPDF

Hengrui Hu, Kaining Ying, Henghui Ding

TL;DR: 该论文提出了针对多镜头半监督视频目标分割（MVOS）的方法Segment Anything Across Shots（SAAS），并通过数据增强策略和新的基准测试Cut-VOS解决了现有方法在镜头切换时的局限性。

Details

Motivation: 现有的视频目标分割（VOS）方法主要针对单一镜头，而在多镜头场景中无法有效处理镜头的不连续性，限制了其实际应用。

Result: 在YouMVOS和Cut-VOS上的大量实验表明，SAAS在多镜头场景中表现优异，优于现有方法。

Insight: 该方法通过在数据增强和模型设计中模拟镜头切换，显著提升了模型在多镜头视频中的分割能力，填补了现有研究的空白。

Abstract: This work focuses on multi-shot semi-supervised video object segmentation (MVOS), which aims at segmenting the target object indicated by an initial mask throughout a video with multiple shots. The existing VOS methods mainly focus on single-shot videos and struggle with shot discontinuities, thereby limiting their real-world applicability. We propose a transition mimicking data augmentation strategy (TMA) which enables cross-shot generalization with single-shot data to alleviate the severe annotated multi-shot data sparsity, and the Segment Anything Across Shots (SAAS) model, which can detect and comprehend shot transitions effectively. To support evaluation and future study in MVOS, we introduce Cut-VOS, a new MVOS benchmark with dense mask annotations, diverse object categories, and high-frequency transitions. Extensive experiments on YouMVOS and Cut-VOS demonstrate that the proposed SAAS achieves state-of-the-art performance by effectively mimicking, understanding, and segmenting across complex transitions. The code and datasets are released at https://henghuiding.com/SAAS/.

[214] Scaling Spatial Intelligence with Multimodal Foundation Models cs.CV | cs.AI | cs.LG | cs.MM | cs.ROPDF

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu

TL;DR: 论文研究了多模态基础模型在空间智能方面的不足，提出了一种名为SenseNova-SI的模型家族，通过系统化构建高质量数据集（SenseNova-SI-8M）提升了空间智能。实验表明，该模型在多个基准测试中表现优异，并探讨了数据扩展、泛化能力等关键问题。

Details

Motivation: 当前多模态基础模型在空间智能方面仍存在明显缺陷，需要通过规模化数据和严谨的方法来提升其能力。

Result: SenseNova-SI在多个空间智能基准测试中表现优异（如VSI-Bench 68.7%，MMSI 43.3%），同时保持了强大的通用多模态理解能力（MMBench-En 84.9%）。

Insight: 数据扩展和多样性训练能够显著提升空间智能性能，但也存在过拟合和语言捷径的风险，初步研究表明空间链式推理具有潜力。

Abstract: Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.

cs.CL [Back]

[215] TimeStampEval: A Simple LLM Eval and a Little Fuzzy Matching Trick to Improve Search Accuracy cs.CL | cs.AIPDF

James McCammon

TL;DR: TimeStampEval提出了一种简单的方法，通过结合模糊匹配和LLM验证，显著提高了非逐字引用搜索的准确性，同时降低了推理成本。

Details

Motivation: 传统模糊匹配在语义相同但句法不同的文本对齐中表现不佳，特别是在官方记录与语音转文本的转录本对齐时。

Result: 该方法将模糊匹配准确性提高50个百分点，延迟减半，每个正确结果成本降低96%；在长转录本上表现稳健。

Insight: 提示设计和推理预算对LLM性能影响显著；结合模糊匹配和LLM验证是高效解决非逐字对齐问题的有效途径。

Abstract: Traditional fuzzy matching often fails when searching for quotes that are semantically identical but syntactically different across documents-a common issue when aligning official written records with speech-to-text transcripts. We introduce TimeStampEval, a benchmark for retrieving precise millisecond timestamps from long transcripts given non-verbatim quotes. Our simple two-stage method dramatically improves retrieval accuracy while cutting inference costs by over 90%. The motivating use case is an automated long-form podcast that assembles Congressional Record clips into AI-hosted narration. The technical challenge: given a sentence-timestamped transcript and a target quote that may differ due to transcription or editorial drift, return exact start and end boundaries. Standard algorithms handle verbatim text but break under fuzzier variants. Evaluating six modern LLMs on a 2,800-sentence (120k-token) transcript revealed four key findings. (1) Prompt design matters more than model choice: placing the query before the transcript and using compact formatting improved accuracy by 3-20 points while reducing token count by 30-40%. (2) Off-by-one errors form a distinct category, showing models understand the task but misplace boundaries. (3) A modest reasoning budget (600-850 tokens) raises accuracy from 37% to 77% for weak setups and to above 90% for strong ones. (4) Our “Assisted Fuzzy” approach-RapidFuzz pre-filtering followed by LLM verification on short snippets-improves fuzzy match accuracy by up to 50 points while halving latency and reducing cost per correct result by up to 96%. Extended tests on ten transcripts (50k-900k tokens, 1989-2025) confirm robustness to transcript length, vocabulary drift, and domain change, maintaining 95-100% rejection accuracy for absent targets.

[216] MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling cs.CLPDF

MiroMind Team, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen

TL;DR: MiroThinker v1.0是一款开源的科研智能体，通过模型、上下文和交互扩展三大维度提升性能，特别强调交互扩展，利用环境反馈优化推理链，实现了在多任务中高效的工具调用能力。

Details

Motivation: 现有的智能体主要依赖模型规模或上下文长度的扩展，而忽视了交互深度的重要性。MiroThinker旨在通过交互扩展（如环境反馈和多轮互动）弥补这一不足，以进一步提升推理和信息获取能力。

Result: 在GAIA、HLE、BrowseComp和BrowseComp-ZH四项基准测试中，72B模型的准确率分别达到81.9%、37.7%、47.1%和55.6%，显著超越现有开源智能体，接近商业模型水平。

Insight: 交互扩展与模型规模和上下文长度类似，具有可预测的扩展效应，这表明它是构建下一代科研智能体的关键维度之一。

Abstract: We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.

[217] On the Notion that Language Models Reason cs.CL | cs.AIPDF

Bertram Højer

TL;DR: 该论文探讨了语言模型是否真正具备推理能力，认为当前的推理定义与语言模型的训练和生成方式不符，并提出语言模型更像是统计模式匹配器而非真正的推理者。

Details

Motivation: 研究动机在于澄清语言模型是否真正具备推理能力，以避免对语言模型能力的误解和不当期望。

Result: 结果表明语言模型的‘推理’输出是统计规律而非逻辑机制的体现，缺乏逻辑一致性保证。

Insight: 该研究强调了准确描述NLP系统计算过程的重要性，避免对语言模型能力的高估或误解。

Abstract: Language models (LMs) are said to be exhibiting reasoning, but what does this entail? We assess definitions of reasoning and how key papers in the field of natural language processing (NLP) use the notion and argue that the definitions provided are not consistent with how LMs are trained, process information, and generate new tokens. To illustrate this incommensurability we assume the view that transformer-based LMs implement an \textit{implicit} finite-order Markov kernel mapping contexts to conditional token distributions. In this view, reasoning-like outputs correspond to statistical regularities and approximate statistical invariances in the learned kernel rather than the implementation of explicit logical mechanisms. This view is illustrative of the claim that LMs are “statistical pattern matchers”” and not genuine reasoners and provides a perspective that clarifies why reasoning-like outputs arise in LMs without any guarantees of logical consistency. This distinction is fundamental to how epistemic uncertainty is evaluated in LMs. We invite a discussion on the importance of how the computational processes of the systems we build and analyze in NLP research are described.

[218] Identifying Imaging Follow-Up in Radiology Reports: A Comparative Analysis of Traditional ML and LLM Approaches cs.CLPDF

Namu Park, Giridhar Kaushik Ramachandran, Kevin Lybarger, Fei Xia, Ozlem Uzuner

TL;DR: 该论文通过构建一个标注的放射学报告数据集，比较了传统机器学习方法（如逻辑回归、支持向量机）和大型语言模型（如GPT-4o、GPT-OSS-20B）在识别后续影像检查任务中的表现，发现优化提示后的LLMs表现最佳，但传统方法仍具竞争力。

Details

Motivation: 当前缺乏针对放射学任务的特定领域数据集来评估LLMs的性能，这限制了其在临床自然语言处理中的应用。作者旨在填补这一空白，并通过比较不同方法的性能，为后续研究提供参考。

Result: GPT-4o（Advanced）表现最佳（F1=0.832），其次是GPT-OSS-20B（Advanced；F1=0.828）。传统方法如LR和SVM也表现良好（F1=0.776和0.775），显示出LLMs在优化提示后可接近人类水平，但传统方法仍具竞争力。

Insight: 1. LLMs在特定任务中表现优异，但提示优化是关键。2. 传统机器学习方法在计算资源和解释性方面仍有优势，适用于资源受限场景。

Abstract: Large language models (LLMs) have shown considerable promise in clinical natural language processing, yet few domain-specific datasets exist to rigorously evaluate their performance on radiology tasks. In this work, we introduce an annotated corpus of 6,393 radiology reports from 586 patients, each labeled for follow-up imaging status, to support the development and benchmarking of follow-up adherence detection systems. Using this corpus, we systematically compared traditional machine-learning classifiers, including logistic regression (LR), support vector machines (SVM), Longformer, and a fully fine-tuned Llama3-8B-Instruct, with recent generative LLMs. To evaluate generative LLMs, we tested GPT-4o and the open-source GPT-OSS-20B under two configurations: a baseline (Base) and a task-optimized (Advanced) setting that focused inputs on metadata, recommendation sentences, and their surrounding context. A refined prompt for GPT-OSS-20B further improved reasoning accuracy. Performance was assessed using precision, recall, and F1 scores with 95% confidence intervals estimated via non-parametric bootstrapping. Inter-annotator agreement was high (F1 = 0.846). GPT-4o (Advanced) achieved the best performance (F1 = 0.832), followed closely by GPT-OSS-20B (Advanced; F1 = 0.828). LR and SVM also performed strongly (F1 = 0.776 and 0.775), underscoring that while LLMs approach human-level agreement through prompt optimization, interpretable and resource-efficient models remain valuable baselines.

[219] MedPT: A Massive Medical Question Answering Dataset for Brazilian-Portuguese Speakers cs.CLPDF

Fernanda Bufon Färber, Iago Alves Brito, Julia Soares Dollis, Pedro Schindler Freire Brasil Ribeiro, Rafael Teixeira Sousa

TL;DR: MedPT是一个针对巴西葡萄牙语的大规模医学问答数据集，包含38万多个真实的医患问答对，并通过多阶段筛选和LLM驱动的标注增强数据集。实验显示，其在医学专业分类任务中表现优异（94% F1-score），展现了丰富的语义深度和文化独特性。

Details

Motivation: 当前大型语言模型（LLM）的开发主要集中于高资源语言，而简单的翻译无法捕捉医学和文化上的独特细微差别（如地方性疾病），这限制了LLM在其他语言中的适用性。为解决这一问题，作者提出了MedPT数据集。

Result: 在20类医学专业分类任务中，微调的1.7B参数模型达到了94%的F1-score。定性误差分析表明，错误分类反映了真实的临床模糊性，证明了数据集的语义深度。

Insight: 文化独特的医学数据集（如MedPT）对开发公平、准确且文化敏感的医疗技术至关重要。此外，LLM驱动的标注为处理大规模数据集提供了可行的解决方案。

Abstract: While large language models (LLMs) show transformative potential in healthcare, their development remains focused on high-resource languages, creating a critical barrier for others as simple translation fails to capture unique clinical and cultural nuances, such as endemic diseases. To address this, we introduce MedPT, the first large-scale, real-world corpus for Brazilian Portuguese, comprising 384,095 authentic question-answer pairs from patient-doctor interactions. The dataset underwent a meticulous multi-stage curation protocol, using a hybrid quantitative-qualitative analysis to filter noise and contextually enrich thousands of ambiguous queries. We further augmented the corpus via LLM-driven annotation, classifying questions into seven semantic types to capture user intent. Our analysis reveals its thematic breadth (3,200 topics) and unique linguistic properties, like the natural asymmetry in patient-doctor communication. To validate its utility, we benchmark a medical specialty routing task: fine-tuning a 1.7B parameter model achieves an outstanding 94% F1-score on a 20-class setup. Furthermore, our qualitative error analysis shows misclassifications are not random but reflect genuine clinical ambiguities (e.g., between comorbid conditions), proving the dataset’s deep semantic richness. We publicly release MedPT to foster the development of more equitable, accurate, and culturally-aware medical technologies for the Portuguese-speaking world.

[220] ClinStructor: AI-Powered Structuring of Unstructured Clinical Texts cs.CL | cs.LGPDF

Karthikeyan K, Raghuveer Thirukovalluru, David Carlson

TL;DR: ClinStructor是一种基于大语言模型的工具，用于将临床自由文本转换为结构化的问题-答案对，以提高透明度和可控性，同时预测性能仅小幅下降。

Details

Motivation: 临床笔记中丰富的信息通常以非结构化形式存在，导致偏见、泛化能力差和可解释性低等问题。

Result: 在ICU死亡率预测任务中，性能仅下降2-3%（AUC），但显著提高了模型的透明度和可控性。

Insight: 结构化处理非临床文本是构建可靠、可解释和可泛化临床模型的重要基础。

Abstract: Clinical notes contain valuable, context-rich information, but their unstructured format introduces several challenges, including unintended biases (e.g., gender or racial bias), and poor generalization across clinical settings (e.g., models trained on one EHR system may perform poorly on another due to format differences) and poor interpretability. To address these issues, we present ClinStructor, a pipeline that leverages large language models (LLMs) to convert clinical free-text into structured, task-specific question-answer pairs prior to predictive modeling. Our method substantially enhances transparency and controllability and only leads to a modest reduction in predictive performance (a 2-3% drop in AUC), compared to direct fine-tuning, on the ICU mortality prediction task. ClinStructor lays a strong foundation for building reliable, interpretable, and generalizable machine learning models in clinical environments.

[221] Context-Emotion Aware Therapeutic Dialogue Generation: A Multi-component Reinforcement Learning Approach to Language Models for Mental Health Support cs.CLPDF

Eric Hua Qing Zhang, Julia Ive

TL;DR: 论文通过多组件强化学习方法优化GPT-2，提升了心理治疗对话生成的上下文和情感感知能力，验证了强化学习在医疗对话系统中的有效性。

Details

Motivation: COVID-19加剧了心理健康服务的可及性挑战，预训练语言模型缺乏上下文和情感感知能力，亟需改进以生成更专业的治疗对话。

Result: 强化学习显著提升了模型性能，BLEU、ROUGE等指标均有改善，情感准确率达到99.34%，远超基线GPT-2的66.96%。

Insight: 强化学习可以有效提升语言模型在特定任务（如心理治疗对话）中的表现，但需结合人类临床监督以确保安全性和专业性。

Abstract: Mental health illness represents a substantial global socioeconomic burden, with COVID-19 further exacerbating accessibility challenges and driving increased demand for telehealth mental health support. While large language models (LLMs) offer promising solutions through 24/7 availability and non-judgmental interactions, pre-trained models often lack the contextual and emotional awareness necessary for appropriate therapeutic responses. This paper investigated the application of supervised fine-tuning (SFT) and reinforcement learning (RL) techniques to enhance GPT-2’s capacity for therapeutic dialogue generation. The methodology restructured input formats to enable simultaneous processing of contextual information and emotional states alongside user input, employing a multi-component reward function that aligned model outputs with professional therapist responses and annotated emotions. Results demonstrated improvements through reinforcement learning over baseline GPT-2 across multiple evaluation metrics: BLEU (0.0111), ROUGE-1 (0.1397), ROUGE-2 (0.0213), ROUGE-L (0.1317), and METEOR (0.0581). LLM evaluation confirmed high contextual relevance and professionalism, while reinforcement learning achieved 99.34% emotion accuracy compared to 66.96% for baseline GPT-2. These findings demonstrate reinforcement learning’s effectiveness in developing therapeutic dialogue systems that can serve as valuable assistive tools for therapists while maintaining essential human clinical oversight.

[222] Additive Large Language Models for Semi-Structured Text cs.CL | cs.LGPDF

Karthikeyan K, Raghuveer Thirukovalluru, David Carlson

TL;DR: 论文提出了CALM框架，旨在解决大型语言模型在临床文本分类中的不透明性问题。CALM通过加法模型实现对半结构化文本的可解释分类，性能与传统LLM分类器相当，同时提升了信任度和临床意义。

Details

Motivation: 大型语言模型在临床文本分类中表现优异，但预测结果的不透明性阻碍了其在实际研究和临床中的广泛应用，因为研究人员和医生需要了解患者记录的哪些部分驱动了风险信号。

Result: CALM的性能与传统LLM分类器相当，但在可解释性、信任度和临床意义方面表现更优，支持质量保证检查和模型开发审计。

Insight: CALM的加法结构不仅提供了清晰的解释性，还为临床决策提供了直观的可视化工具，有助于提升模型在医疗领域的实用性和可信度。

Abstract: Large Language Models have advanced clinical text classification, but their opaque predictions remain a critical barrier to practical adoption in research and clinical settings where investigators and physicians need to understand which parts of a patient’s record drive risk signals. To address this challenge, we introduce \textbf{CALM}, short for \textbf{Classification with Additive Large Language Models}, an interpretable framework for semi-structured text where inputs are composed of semantically meaningful components, such as sections of an admission note or question-answer fields from an intake form. CALM predicts outcomes as the additive sum of each component’s contribution, making these contributions part of the forward computation itself and enabling faithful explanations at both the patient and population level. The additive structure also enables clear visualizations, such as component-level risk curves similar to those used in generalized additive models, making the learned relationships easier to inspect and communicate. Although CALM expects semi-structured inputs, many clinical documents already have this form, and similar structure can often be automatically extracted from free-text notes. CALM achieves performance comparable to conventional LLM classifiers while improving trust, supporting quality-assurance checks, and revealing clinically meaningful patterns during model development and auditing.

[223] InData: Towards Secure Multi-Step, Tool-Based Data Analysis cs.CL | cs.LGPDF

Karthikeyan K, Raghuveer Thirukovalluru, Bhuwan Dhingra, David Edwin Carlson

TL;DR: 论文《InData: Towards Secure Multi-Step, Tool-Based Data Analysis》提出了一种安全的多步工具数据分析方法，通过限制LLM直接生成代码和访问数据，要求其通过预定义的已验证工具交互。作者还引入InData数据集，评估LLM的多步工具推理能力，发现当前模型在复杂任务上表现不足。

Details

Motivation: 现有的大型语言模型（LLM）代理在数据分析中通常直接生成代码并执行于数据库上，这在处理敏感数据时存在重大安全风险。作者希望通过限制LLM的直接数据访问，转而使用安全的预定义工具，以提升安全性。

Result: 测试显示，尽管大型模型（如gpt-oss-120b）在简单任务上表现良好（97.3%准确率），但在复杂任务上表现显著下降（69.6%），表明当前LLM在多步工具推理能力上仍有不足。

Insight: 论文揭示了当前LLM在复杂多步工具推理任务上的局限性，并提出了未来改进的方向。

Abstract: Large language model agents for data analysis typically generate and execute code directly on databases. However, when applied to sensitive data, this approach poses significant security risks. To address this issue, we propose a security-motivated alternative: restrict LLMs from direct code generation and data access, and require them to interact with data exclusively through a predefined set of secure, verified tools. Although recent tool-use benchmarks exist, they primarily target tool selection and simple execution rather than the compositional, multi-step reasoning needed for complex data analysis. To reduce this gap, we introduce Indirect Data Engagement (InData), a dataset designed to assess LLMs’ multi-step tool-based reasoning ability. InData includes data analysis questions at three difficulty levels–Easy, Medium, and Hard–capturing increasing reasoning complexity. We benchmark 15 open-source LLMs on InData and find that while large models (e.g., gpt-oss-120b) achieve high accuracy on Easy tasks (97.3%), performance drops sharply on Hard tasks (69.6%). These results show that current LLMs still lack robust multi-step tool-based reasoning ability. With InData, we take a step toward enabling the development and evaluation of LLMs with stronger multi-step tool-use capabilities. We will publicly release the dataset and code.

[224] Improving LLM’s Attachment to External Knowledge In Dialogue Generation Tasks Through Entity Anonymization cs.CL | cs.LGPDF

Hadi Sheikhi, Chenyang Huang, Osmar R. Zaïane

TL;DR: 论文提出了一种实体匿名化技术，以提高大型语言模型（LLM）在知识图对话生成任务中对外部知识的依赖，并通过实验验证了其有效性。

Details

Motivation: 尽管LLM在多种NLP任务中表现优异，但在知识图对话生成（KG-DG）任务中，它们倾向于依赖内部知识而非外部知识图，导致生成内容与提供的信息脱节。

Result: 在OpenDialKG数据集上的实验表明，该方法显著提升了LLM对外部知识的利用率。

Insight: 实体匿名化是一种简单但有效的策略，能够减少LLM对内部知识的依赖，增强其对外部信息的整合能力。

Abstract: Knowledge graph-based dialogue generation (KG-DG) is a challenging task requiring models to effectively incorporate external knowledge into conversational responses. While large language models (LLMs) have achieved impressive results across various NLP tasks, their ability to utilize external knowledge in KG-DG remains under-explored. We observe that LLMs often rely on internal knowledge, leading to detachment from provided knowledge graphs, even when they are given a flawlessly retrieved knowledge graph. First, we introduce LLM-KAT, an evaluation procedure for measuring knowledge attachment in generated responses. Second, we propose a simple yet effective entity anonymization technique to encourage LLMs to better leverage external knowledge. Experiments on the OpenDialKG dataset demonstrate that our approach improves LLMs’ attachment on external knowledge.

[225] On the Entropy Calibration of Language Models cs.CL | cs.AI | cs.LG | stat.MLPDF

Steven Cao, Gregory Valiant, Percy Liang

TL;DR: 该论文研究了语言模型的熵校准问题，发现随着模型规模增大，熵的校准并未显著改善，误差积累速度相似，并提出了一种理论上可行的校准方法。

Details

Motivation: 研究语言模型中熵校准的问题，尤其是误差积累对生成文本质量的影响。

Result: 发现大模型的熵校准问题与小模型类似，误差积累速度相近。

Insight: 熵校准问题可能无法通过简单地扩大模型规模解决，需要更复杂的校准方法。

Abstract: We study the problem of entropy calibration, which asks whether a language model’s entropy over generations matches its log loss on human text. Past work found that models are miscalibrated, with entropy per step increasing (and text quality decreasing) as generations grow longer. This error accumulation is a fundamental problem in autoregressive models, and the standard solution is to truncate the distribution, which improves text quality at the cost of diversity. In this paper, we ask: is miscalibration likely to improve with scale, and is it theoretically possible to calibrate without tradeoffs? To build intuition, we first study a simplified theoretical setting to characterize the scaling behavior of miscalibration with respect to dataset size. We find that the scaling behavior depends on the power law exponent of the data distribution – in particular, for a power law exponent close to 1, the scaling exponent is close to 0, meaning that miscalibration improves very slowly with scale. Next, we measure miscalibration empirically in language models ranging from 0.5B to 70B parameters. We find that the observed scaling behavior is similar to what is predicted by the simplified setting: our fitted scaling exponents for text are close to 0, meaning that larger models accumulate error at a similar rate as smaller ones. This scaling (or, lack thereof) provides one explanation for why we sample from larger models with similar amounts of truncation as smaller models, even though the larger models are of higher quality. However, truncation is not a satisfying solution because it comes at the cost of increased log loss. In theory, is it even possible to reduce entropy while preserving log loss? We prove that it is possible, if we assume access to a black box which can fit models to predict the future entropy of text.

[226] A Reasoning Paradigm for Named Entity Recognition cs.CLPDF

Hui Huang, Yanping Chen, Ruizhang Huang, Chuan Lin, Yongbin Qin

TL;DR: 论文提出了一种名为ReasoningNER的推理范式，将命名实体识别（NER）从隐式模式匹配转向显式推理，通过生成任务相关的推理链（CoT）并优化推理过程，显著提升了零样本和低资源场景下的性能。

Details

Motivation: 现有生成式大语言模型（LLM）在NER任务中依赖隐式语义模式匹配，缺乏可验证的推理机制，导致零样本和低资源场景下性能不佳。为填补这一空白，论文提出了显式推理框架。

Result: ReasoningNER在零样本设定下实现了SOTA性能，F1值比GPT-4高12.3%，展示了强大的认知能力和泛化性。

Insight: 显式推理可显著提升NER任务的可解释性和泛化能力，尤其在资源稀缺场景下潜力巨大。

Abstract: Generative LLMs typically improve Named Entity Recognition (NER) performance through instruction tuning. They excel at generating entities by semantic pattern matching but lack an explicit, verifiable reasoning mechanism. This “cognitive shortcutting” leads to suboptimal performance and brittle generalization, especially in zero-shot and lowresource scenarios where reasoning from limited contextual cues is crucial. To address this issue, a reasoning framework is proposed for NER, which shifts the extraction paradigm from implicit pattern matching to explicit reasoning. This framework consists of three stages: Chain of Thought (CoT) generation, CoT tuning, and reasoning enhancement. First, a dataset annotated with NER-oriented CoTs is generated, which contain task-relevant reasoning chains. Then, they are used to tune the NER model to generate coherent rationales before deriving the final answer. Finally, a reasoning enhancement stage is implemented to optimize the reasoning process using a comprehensive reward signal. This stage ensures explicit and verifiable extractions. Experiments show that ReasoningNER demonstrates impressive cognitive ability in the NER task, achieving competitive performance. In zero-shot settings, it achieves state-of-the-art (SOTA) performance, outperforming GPT-4 by 12.3 percentage points on the F1 score. Analytical results also demonstrate its great potential to advance research in reasoningoriented information extraction. Our codes are available at https://github.com/HuiResearch/ReasoningIE.

[227] Critical or Compliant? The Double-Edged Sword of Reasoning in Chain-of-Thought Explanations cs.CL | cs.HCPDF

Eunkyu Park, Wesley Hanwen Deng, Vasudha Varadarajan, Mingxi Yan, Gunhee Kim

TL;DR: 论文研究了Chain-of-Thought（CoT）解释在透明度中的作用及其潜在的误导性，发现用户容易因结果表面合理性而忽视推理错误，且表达语气会进一步影响错误检测。

Details

Motivation: 解释工具虽被用于提高透明度，但也可能导致确认偏差，用户可能因输出表面合理性而误认为推理正确。本文旨在探讨CoT解释在道德场景中的双重作用。

Result: 发现用户常将信任与结果一致性混淆，导致即使推理错误仍维持依赖；自信的表达语气会抑制错误检测，维持依赖。

Insight: CoT解释既是透明工具也是误导源，NLP系统需设计鼓励用户批判性思考的解释，而非盲目信任。

Abstract: Explanations are often promoted as tools for transparency, but they can also foster confirmation bias; users may assume reasoning is correct whenever outputs appear acceptable. We study this double-edged role of Chain-of-Thought (CoT) explanations in multimodal moral scenarios by systematically perturbing reasoning chains and manipulating delivery tones. Specifically, we analyze reasoning errors in vision language models (VLMs) and how they impact user trust and the ability to detect errors. Our findings reveal two key effects: (1) users often equate trust with outcome agreement, sustaining reliance even when reasoning is flawed, and (2) the confident tone suppresses error detection while maintaining reliance, showing that delivery styles can override correctness. These results highlight how CoT explanations can simultaneously clarify and mislead, underscoring the need for NLP systems to provide explanations that encourage scrutiny and critical thinking rather than blind trust. All code will be released publicly.

[228] CURE: Cultural Understanding and Reasoning Evaluation - A Framework for “Thick” Culture Alignment Evaluation in LLMs cs.CL | cs.HCPDF

Truong Vo, Sanmi Koyejo

TL;DR: 本文介绍了CURE框架，用于评估大型语言模型（LLMs）在文化多样环境中的文化理解与推理能力，填补了现有评估方法的不足。

Details

Motivation: 现有LLMs的文化能力评估方法过于简化，仅关注脱离上下文的正确性或强制选择判断，未能充分评估模型在实际情境中的文化理解和推理能力。

Result: 研究表明，简化评估方法会高估模型的文化能力且结果不稳定，而CURE框架能揭示推理深度差异，降低方差并提供更稳定的信号。

Insight: 文化能力的评估需要情境化和多维度的指标，简化方法可能掩盖模型的真实表现。

Abstract: Large language models (LLMs) are increasingly deployed in culturally diverse environments, yet existing evaluations of cultural competence remain limited. Existing methods focus on de-contextualized correctness or forced-choice judgments, overlooking the need for cultural understanding and reasoning required for appropriate responses. To address this gap, we introduce a set of benchmarks that, instead of directly probing abstract norms or isolated statements, present models with realistic situational contexts that require culturally grounded reasoning. In addition to the standard Exact Match metric, we introduce four complementary metrics (Coverage, Specificity, Connotation, and Coherence) to capture different dimensions of model’s response quality. Empirical analysis across frontier models reveals that thin evaluation systematically overestimates cultural competence and produces unstable assessments with high variance. In contrast, thick evaluation exposes differences in reasoning depth, reduces variance, and provides more stable, interpretable signals of cultural understanding.

[229] Exploring Parameter-Efficient Fine-Tuning and Backtranslation for the WMT 25 General Translation Task cs.CLPDF

Felipe Fujita, Hideyuki Takada

TL;DR: 论文探讨了结合反向翻译和参数高效微调的方法在小规模日语语料上提升神经机器翻译的效果。结果显示，这种协同方法显著优于单独使用任一技术。

Details

Motivation: 研究动机是探索在小规模日语语料下，如何通过结合反向翻译和微调来提升机器翻译的质量，尤其是在低资源语言对的情况下。

Result: 实验结果：反向翻译使COMET分数从0.460提升至0.468；微调提升至0.589；结合两种方法后达到0.597。

Insight: 研究发现，即使语料规模有限，反向翻译和针对性微调的协同使用也能显著提升翻译质量，为低资源语言对的优化提供有效策略。

Abstract: In this paper, we explore the effectiveness of combining fine-tuning and backtranslation on a small Japanese corpus for neural machine translation. Starting from a baseline English{\textrightarrow}Japanese model (COMET = 0.460), we first apply backtranslation (BT) using synthetic data generated from monolingual Japanese corpora, yielding a modest increase (COMET = 0.468). Next, we fine-tune (FT) the model on a genuine small parallel dataset drawn from diverse Japanese news and literary corpora, achieving a substantial jump to COMET = 0.589 when using Mistral 7B. Finally, we integrate both backtranslation and fine-tuning{ – }first augmenting the small dataset with BT generated examples, then adapting via FT{ – }which further boosts performance to COMET = 0.597. These results demonstrate that, even with limited training data, the synergistic use of backtranslation and targeted fine-tuning on Japanese corpora can significantly enhance translation quality, outperforming each technique in isolation. This approach offers a lightweight yet powerful strategy for improving low-resource language pairs.

[230] LLMLagBench: Identifying Temporal Training Boundaries in Large Language Models cs.CL | cs.AIPDF

Piotr Pęzik, Konrad Kaczyński, Maria Szymańska, Filip Żarnecki, Zuzanna Deckert

TL;DR: 该论文提出了LLMLagBench，用于系统性评估大型语言模型（LLMs）训练数据的时间边界，以避免模型因使用过时信息而导致推理任务中的准确性问题。

Details

Motivation: 大型语言模型（LLMs）在训练时有一个明确的时间截止点，导致模型无法获取此后的事件信息。如果这一限制未被识别或忽视，模型可能在不自知的情况下混合过时的与时事相关的信息和一般知识，从而影响推理的准确性。

Result: LLMLagBench能够有效识别LLMs的训练数据时间边界，帮助了解模型的时效性限制。

Insight: LLMs的时间知识边界可能影响其推理能力，需通过系统性评估工具确保信息的时效性和可靠性。

Abstract: Large Language Models (LLMs) are pretrained on textual data up to a specific temporal cutoff. This creates a strict knowledge boundary beyond which models cannot provide accurate information without querying external sources. More subtly, when this limitation is unknown or ignored, LLMs may inadvertently blend outdated time-sensitive information with general knowledge during reasoning tasks, potentially compromising response accuracy. We introduce LLMLagBench, an LLM freshness benchmark, as a systematic approach for identifying the earliest probable temporal boundaries of an LLM’s training data by evaluating its knowledge of recent events. We then apply this benchmark to evaluate a large set of LLMs, including models with both explicitly declared and undeclared training cutoffs. The reliability of the benchmark is assessed by manual validation and comparison with publicly released information about LLM pretraining.

[231] PRISM of Opinions: A Persona-Reasoned Multimodal Framework for User-centric Conversational Stance Detection cs.CLPDF

Bingbing Wang, Zhixin Bai, Zhengda Jin, Zihan Wang, Xintong Song

TL;DR: 该论文针对多模态对话立场检测（MCSD）中的伪多模态性和用户同质性问题，提出了首个用户中心的数据集U-MStance和模型PRISM。PRISM通过用户历史行为构建个性化特征，结合多模态对齐和任务互增强机制，显著提升了立场检测性能。

Details

Motivation: 现有MCSD研究存在伪多模态性（视觉线索仅出现在源帖子中）和用户同质性问题，忽略了用户个性特征对立场表达的影响。

Result: 在U-MStance数据集上，PRISM显著优于基线模型，验证了用户中心和多模态对齐的有效性。

Insight: 用户个性化特征和多模态上下文对齐对立场检测至关重要；任务互增强机制能进一步提升模型性能。

Abstract: The rapid proliferation of multimodal social media content has driven research in Multimodal Conversational Stance Detection (MCSD), which aims to interpret users’ attitudes toward specific targets within complex discussions. However, existing studies remain limited by: 1) pseudo-multimodality, where visual cues appear only in source posts while comments are treated as text-only, misaligning with real-world multimodal interactions; and 2) user homogeneity, where diverse users are treated uniformly, neglecting personal traits that shape stance expression. To address these issues, we introduce U-MStance, the first user-centric MCSD dataset, containing over 40k annotated comments across six real-world targets. We further propose PRISM, a Persona-Reasoned multImodal Stance Model for MCSD. PRISM first derives longitudinal user personas from historical posts and comments to capture individual traits, then aligns textual and visual cues within conversational context via Chain-of-Thought to bridge semantic and pragmatic gaps across modalities. Finally, a mutual task reinforcement mechanism is employed to jointly optimize stance detection and stance-aware response generation for bidirectional knowledge transfer. Experiments on U-MStance demonstrate that PRISM yields significant gains over strong baselines, underscoring the effectiveness of user-centric and context-grounded multimodal reasoning for realistic stance understanding.

[232] AI-Salesman: Towards Reliable Large Language Model Driven Telemarketing cs.CLPDF

Qingyu Zhang, Chunlei Xin, Xuanang Chen, Yaojie Lu, Hongyu Lin

TL;DR: 论文提出了AI-Salesman框架，用于解决LLM在目标驱动的说服性对话（如电话营销）中的战略脆弱性和事实幻觉问题，并通过新数据集TeleSalesCorpus和双阶段架构（Bayesian-supervised强化学习与Dynamic Outline-Guided Agent）提升了性能。

Details

Motivation: 目标驱动的说服性对话（如电话营销）需要复杂的多轮规划和严格的事实忠实性，即使是先进的LLM也难以胜任。缺乏任务特定数据和直接应用LLM的战略脆弱性与事实幻觉是主要挑战。

Result: 实验结果表明，AI-Salesman在自动指标和综合人工评估中均显著优于基线模型，展示了其在复杂说服性场景中的有效性。

Insight: 1. 任务特定数据集（如TeleSalesCorpus）对LLM在特定领域的性能提升至关重要；2. Bayesian-supervised强化学习可以有效缓解噪声数据的影响；3. 动态策略指导（DOGA）显著增强了LLM在多轮对话中的鲁棒性。

Abstract: Goal-driven persuasive dialogue, exemplified by applications like telemarketing, requires sophisticated multi-turn planning and strict factual faithfulness, which remains a significant challenge for even state-of-the-art Large Language Models (LLMs). A lack of task-specific data often limits previous works, and direct LLM application suffers from strategic brittleness and factual hallucination. In this paper, we first construct and release TeleSalesCorpus, the first real-world-grounded dialogue dataset for this domain. We then propose AI-Salesman, a novel framework featuring a dual-stage architecture. For the training stage, we design a Bayesian-supervised reinforcement learning algorithm that learns robust sales strategies from noisy dialogues. For the inference stage, we introduce the Dynamic Outline-Guided Agent (DOGA), which leverages a pre-built script library to provide dynamic, turn-by-turn strategic guidance. Moreover, we design a comprehensive evaluation framework that combines fine-grained metrics for key sales skills with the LLM-as-a-Judge paradigm. Experimental results demonstrate that our proposed AI-Salesman significantly outperforms baseline models in both automatic metrics and comprehensive human evaluations, showcasing its effectiveness in complex persuasive scenarios.

[233] Seeing is Believing: Rich-Context Hallucination Detection for MLLMs via Backward Visual Grounding cs.CL | cs.CVPDF

Pinxue Guo, Chongruo Wu, Xinyu Zhou, Lingyi Hong, Zhaoyu Chen

TL;DR: 本文提出了一种新型的无参幻觉检测框架VBackChecker，通过视觉反向验证检测MLLM生成的响应与视觉输入的一致性，结合像素级Grounding LLM提升可解释性和性能，并在新数据集R^2-HalBench上取得SOTA。

Details

Motivation: 多模态大语言模型（MLLMs）在跨模态任务中表现强大，但存在严重的幻觉问题。为了提高其可靠性，需要一种无需参考的幻觉检测方法。

Result: 1. VBackChecker在幻觉检测任务中超越现有方法，媲美GPT-4o；2. 像素级验证任务中性能提升超过10%。

Insight: 反向视觉验证机制可以有效提升幻觉检测的可解释性和性能，而高质量的数据生成方法（R-Instruct）和评测基准（R^2-HalBench）是成功的关键。

Abstract: Multimodal Large Language Models (MLLMs) have unlocked powerful cross-modal capabilities, but still significantly suffer from hallucinations. As such, accurate detection of hallucinations in MLLMs is imperative for ensuring their reliability in practical applications. To this end, guided by the principle of “Seeing is Believing”, we introduce VBackChecker, a novel reference-free hallucination detection framework that verifies the consistency of MLLMgenerated responses with visual inputs, by leveraging a pixellevel Grounding LLM equipped with reasoning and referring segmentation capabilities. This reference-free framework not only effectively handles rich-context scenarios, but also offers interpretability. To facilitate this, an innovative pipeline is accordingly designed for generating instruction-tuning data (R-Instruct), featuring rich-context descriptions, grounding masks, and hard negative samples. We further establish R^2 -HalBench, a new hallucination benchmark for MLLMs, which, unlike previous benchmarks, encompasses real-world, rich-context descriptions from 18 MLLMs with high-quality annotations, spanning diverse object-, attribute, and relationship-level details. VBackChecker outperforms prior complex frameworks and achieves state-of-the-art performance on R^2 -HalBench, even rivaling GPT-4o’s capabilities in hallucination detection. It also surpasses prior methods in the pixel-level grounding task, achieving over a 10% improvement. All codes, data, and models are available at https://github.com/PinxueGuo/VBackChecker.

[234] CriticSearch: Fine-Grained Credit Assignment for Search Agents via a Retrospective Critic cs.CLPDF

Yaocheng Zhang, Haohuan Huang, Zijun Song, Yuanheng Zhu, Qichao Zhang

TL;DR: CriticSearch是一种细粒度信用分配框架，通过回顾性批评机制为搜索代理提供密集的逐轮反馈，解决了稀疏奖励导致的训练不稳定问题。

Details

Motivation: 现有搜索代理管道依赖基于强化学习的优化，但由于稀疏结果奖励，导致探索效率低和训练不稳定。CriticSearch通过密集反馈改进这一问题。

Result: 在多跳推理基准测试中，CriticSearch表现优于现有基线，实现了更快的收敛、更高的训练稳定性和性能。

Insight: 密集的逐轮反馈可以有效缓解稀疏奖励问题，从而提升搜索代理的探索效率和训练稳定性。

Abstract: Tool-Integrated Reasoning (TIR) with search engines enables large language models to iteratively retrieve up-to-date external knowledge, enhancing adaptability and generalization in complex question-answering tasks. However, existing search agent pipelines typically depend on reinforcement learning based optimization, which often suffers from sparse outcome rewards, leading to inefficient exploration and unstable training. We introduce CriticSearch, a fine-grained credit-assignment framework that supplies dense, turn-level feedback via a retrospective critic mechanism. During training, a frozen, asymmetric critique LLM retrospectively evaluates each turn using privileged information from the full trajectory and gold answers, converting these assessments into stable, dense rewards that guide policy improvement. Experimental results across diverse multi-hop reasoning benchmarks demonstrate that CriticSearch consistently outperforms existing baselines, achieving faster convergence, improved training stability, and higher performance.

[235] MME-RAG: Multi-Manager-Expert Retrieval-Augmented Generation for Fine-Grained Entity Recognition in Task-Oriented Dialogues cs.CL | cs.AIPDF

Liang Xue, Haoyu Liu, Yajun Tian, Xinyu Zhong, Yang Liu

TL;DR: MME-RAG是一个多管理器-专家检索增强生成框架，通过将细粒度实体识别分解为类型级判断和跨度级提取两个阶段，结合轻量级管理器和专业专家，实现了无需额外训练的领域自适应。

Details

Motivation: 当前大语言模型在领域适应性和检索可控性方面存在挑战，特别是在任务导向对话中细粒度实体识别的重要性日益凸显，现有方法难以满足需求。

Result: 在CrossNER、MIT-Movie等多个数据集上的实验表明，MME-RAG在多数领域表现优于基线方法，并通过消融研究验证了分层结构和KeyInfo检索的关键作用。

Insight: 分层分解和语义对齐的检索机制是实现跨领域泛化和鲁棒性的核心，MME-RAG为任务导向对话提供了一种可扩展且可解释的解决方案。

Abstract: Fine-grained entity recognition is crucial for reasoning and decision-making in task-oriented dialogues, yet current large language models (LLMs) continue to face challenges in domain adaptation and retrieval controllability. We introduce MME-RAG, a Multi-Manager-Expert Retrieval-Augmented Generation framework that decomposes entity recognition into two coordinated stages: type-level judgment by lightweight managers and span-level extraction by specialized experts. Each expert is supported by a KeyInfo retriever that injects semantically aligned, few-shot exemplars during inference, enabling precise and domain-adaptive extraction without additional training. Experiments on CrossNER, MIT-Movie, MIT-Restaurant, and our newly constructed multi-domain customer-service dataset demonstrate that MME-RAG performs better than recent baselines in most domains. Ablation studies further show that both the hierarchical decomposition and KeyInfo-guided retrieval are key drivers of robustness and cross-domain generalization, establishing MME-RAG as a scalable and interpretable solution for adaptive dialogue understanding.

[236] Assessing LLMs for Serendipity Discovery in Knowledge Graphs: A Case for Drug Repurposing cs.CL | cs.AIPDF

Mengying Wang, Chenhui Ma, Ao Jiao, Tuo Liang, Pengjun Lu

TL;DR: 论文提出了一种名为SerenQA的框架，用于评估大语言模型（LLMs）在知识图谱问答（KGQA）任务中发现意外洞察的能力，重点关注药物重定向领域。

Details

Motivation: 现有的KGQA系统通常优化为返回高度相关但可预测的答案，而缺乏发现意外和新颖答案的能力。因此，论文旨在填补这一空白，评估LLMs在科学KGQA任务中的潜力。

Result: 实验表明，虽然现代LLMs在检索任务上表现良好，但在发现真正意外且有价值的洞察方面仍存在困难，揭示了未来改进的空间。

Insight: LLMs在KGQA中展现出潜力，但在意外性发现方面仍有待提升，特别是在科学领域的探索任务中。

Abstract: Large Language Models (LLMs) have greatly advanced knowledge graph question answering (KGQA), yet existing systems are typically optimized for returning highly relevant but predictable answers. A missing yet desired capacity is to exploit LLMs to suggest surprise and novel (“serendipitious”) answers. In this paper, we formally define the serendipity-aware KGQA task and propose the SerenQA framework to evaluate LLMs’ ability to uncover unexpected insights in scientific KGQA tasks. SerenQA includes a rigorous serendipity metric based on relevance, novelty, and surprise, along with an expert-annotated benchmark derived from the Clinical Knowledge Graph, focused on drug repurposing. Additionally, it features a structured evaluation pipeline encompassing three subtasks: knowledge retrieval, subgraph reasoning, and serendipity exploration. Our experiments reveal that while state-of-the-art LLMs perform well on retrieval, they still struggle to identify genuinely surprising and valuable discoveries, underscoring a significant room for future improvements. Our curated resources and extended version are released at: https://cwru-db-group.github.io/serenQA.

[237] SGuard-v1: Safety Guardrail for Large Language Models cs.CL | cs.AI | cs.CRPDF

JoonHo Lee, HyeonMin Cho, Jaewoong Yun, Hyunjae Lee, JunKyu Lee

TL;DR: SGuard-v1是一个轻量级的安全防护框架，专为大型语言模型（LLMs）设计，包含两个组件：ContentFilter和JailbreakFilter，分别用于检测有害内容和对抗性提示。

Details

Motivation: 随着LLMs的广泛应用，其安全性和对抗性提示的防御成为重要问题。SGuard-v1旨在提供一个轻量且高效的解决方案，以减少部署开销并提升安全性。

Result: SGuard-v1在公开和专有安全基准测试中表现优异，保持了轻量级特性，减少了部署开销。

Insight: 轻量级设计和多语言支持使SGuard-v1易于部署，同时其多类预测和置信度评分为下游应用提供了更高的透明度。

Abstract: We present SGuard-v1, a lightweight safety guardrail for Large Language Models (LLMs), which comprises two specialized models to detect harmful content and screen adversarial prompts in human-AI conversational settings. The first component, ContentFilter, is trained to identify safety risks in LLM prompts and responses in accordance with the MLCommons hazard taxonomy, a comprehensive framework for trust and safety assessment of AI. The second component, JailbreakFilter, is trained with a carefully designed curriculum over integrated datasets and findings from prior work on adversarial prompting, covering 60 major attack types while mitigating false-unsafe classification. SGuard-v1 is built on the 2B-parameter Granite-3.3-2B-Instruct model that supports 12 languages. We curate approximately 1.4 million training instances from both collected and synthesized data and perform instruction tuning on the base model, distributing the curated data across the two component according to their designated functions. Through extensive evaluation on public and proprietary safety benchmarks, SGuard-v1 achieves state-of-the-art safety performance while remaining lightweight, thereby reducing deployment overhead. SGuard-v1 also improves interpretability for downstream use by providing multi-class safety predictions and their binary confidence scores. We release the SGuard-v1 under the Apache-2.0 License to enable further research and practical deployment in AI safety.

[238] QA-Noun: Representing Nominal Semantics via Natural Language Question-Answer Pairs cs.CLPDF

Maria Tseytlin, Paul Roit, Omri Abend, Ido Dagan, Ayal Klein

TL;DR: QA-Noun是一个基于问答对的框架，专注于名词为中心的语义关系，通过定义九种问题模板捕捉名词的显式和隐式角色，并与QA-SRL结合实现句子语义的精细分解。

Details

Motivation: 虽然基于QA的语义方法在处理谓词-论元关系时表现出色，但对名词为中心的语义关系研究较少。QA-Noun旨在填补这一空白，提供更全面的语义分解。

Result: QA-Noun几乎完全覆盖AMR的名词论元，并揭示了更多隐含关系。与QA-SRL结合后，分解粒度比FactScore和DecompScore提高了130%以上。

Insight: QA-Noun扩展了QA-based语义框架的覆盖范围，为跨文本对齐提供了更精细和可扩展的语义分解方法。

Abstract: Decomposing sentences into fine-grained meaning units is increasingly used to model semantic alignment. While QA-based semantic approaches have shown effectiveness for representing predicate-argument relations, they have so far left noun-centered semantics largely unaddressed. We introduce QA-Noun, a QA-based framework for capturing noun-centered semantic relations. QA-Noun defines nine question templates that cover both explicit syntactical and implicit contextual roles for nouns, producing interpretable QA pairs that complement verbal QA-SRL. We release detailed guidelines, a dataset of over 2,000 annotated noun mentions, and a trained model integrated with QA-SRL to yield a unified decomposition of sentence meaning into individual, highly fine-grained, facts. Evaluation shows that QA-Noun achieves near-complete coverage of AMR’s noun arguments while surfacing additional contextually implied relations, and that combining QA-Noun with QA-SRL yields over 130% higher granularity than recent fact-based decomposition methods such as FactScore and DecompScore. QA-Noun thus complements the broader QA-based semantic framework, forming a comprehensive and scalable approach to fine-grained semantic decomposition for cross-text alignment.

[239] TAdaRAG: Task Adaptive Retrieval-Augmented Generation via On-the-Fly Knowledge Graph Construction cs.CLPDF

Jie Zhang, Bo Tang, Wanzi Shao, Wenqiang Wei, Jihao Zhao

TL;DR: TAdaRAG提出了一种新颖的RAG框架，通过动态构建任务自适应的知识图谱来解决传统RAG中的信息丢失和不相关细节问题，显著提升了模型的性能和泛化能力。

Details

Motivation: 传统的RAG方法由于输入上下文窗口的限制，需要将外部知识截断为小片段，导致信息丢失，引发回答幻觉和推理链断裂。此外，非结构化的知识检索会引入不相关细节，影响推理准确性。

Result: 在六个公共基准和一个真实业务基准（NowNewsQA）上，TAdaRAG在多种领域和长文本任务中超越了现有方法，证明了其强大的泛化能力和实际效果。

Insight: 动态知识图谱构建和任务自适应机制的结合，可以有效解决信息丢失和不相关细节问题，显著提升模型的推理能力和回答质量。

Abstract: Retrieval-Augmented Generation (RAG) improves large language models by retrieving external knowledge, often truncated into smaller chunks due to the input context window, which leads to information loss, resulting in response hallucinations and broken reasoning chains. Moreover, traditional RAG retrieves unstructured knowledge, introducing irrelevant details that hinder accurate reasoning. To address these issues, we propose TAdaRAG, a novel RAG framework for on-the-fly task-adaptive knowledge graph construction from external sources. Specifically, we design an intent-driven routing mechanism to a domain-specific extraction template, followed by supervised fine-tuning and a reinforcement learning-based implicit extraction mechanism, ensuring concise, coherent, and non-redundant knowledge integration. Evaluations on six public benchmarks and a real-world business benchmark (NowNewsQA) across three backbone models demonstrate that TAdaRAG outperforms existing methods across diverse domains and long-text tasks, highlighting its strong generalization and practical effectiveness.

[240] Mitigating Length Bias in RLHF through a Causal Lens cs.CL | cs.AIPDF

Hyeonji Kim, Sujeong Oh, Sanghack Lee

TL;DR: 论文通过因果视角分析和缓解RLHF中的长度偏差问题，提出了一种反事实数据增强方法，训练奖励模型以独立评估内容质量。

Details

Motivation: RLHF训练的奖励模型存在长度偏差，倾向于偏长回答，混淆了冗长与质量。目标是消除这种偏差，使模型更关注内容质量。

Result: 实验表明该方法有效减少了奖励分配中的长度偏差，使策略模型生成的输出更简洁且内容聚焦。

Insight: 通过因果框架和数据增强策略，可以显著提升RLHF中奖励模型的鲁棒性和内容敏感性。

Abstract: Reinforcement learning from human feedback (RLHF) is widely used to align large language models (LLMs) with human preferences. However, RLHF-trained reward models often exhibit length bias – a systematic tendency to favor longer responses by conflating verbosity with quality. We propose a causal framework for analyzing and mitigating length bias in RLHF reward modeling. Central to our approach is a counterfactual data augmentation method that generates response pairs designed to isolate content quality from verbosity. These counterfactual examples are then used to train the reward model, enabling it to assess responses based on content quality independently of verbosity. Specifically, we construct (1) length-divergent pairs with similar content and (2) content-divergent pairs of similar length. Empirical evaluations show that our method reduces length bias in reward assignment and leads to more concise, content-focused outputs from the policy model. These findings demonstrate that the proposed approach effectively reduces length bias and improves the robustness and content sensitivity of reward modeling in RLHF pipelines.

[241] MMWOZ: Building Multimodal Agent for Task-oriented Dialogue cs.CLPDF

Pu-Hai Yang, Heyan Huang, Heng-Da Xu, Fanshu Sun, Xian-Ling Mao

TL;DR: 该论文介绍了MMWOZ，一个基于MultiWOZ 2.3的多模态对话数据集，并提出了一种新型多模态模型MATE，用于构建面向任务的多模态对话系统。

Details

Motivation: 现有任务导向对话系统主要依赖自然语言和后端API，但在实际场景中前端GUI广泛存在且缺乏定制API，导致应用受限。论文旨在通过多模态方法填补这一差距。

Result: MATE在MMWOZ数据集上完成了初步实验，验证了多模态任务导向对话系统的潜力。

Insight: GUI与多模态结合的任务导向对话系统可能成为实际应用中的重要方向。

Abstract: Task-oriented dialogue systems have garnered significant attention due to their conversational ability to accomplish goals, such as booking airline tickets for users. Traditionally, task-oriented dialogue systems are conceptualized as intelligent agents that interact with users using natural language and have access to customized back-end APIs. However, in real-world scenarios, the widespread presence of front-end Graphical User Interfaces (GUIs) and the absence of customized back-end APIs create a significant gap for traditional task-oriented dialogue systems in practical applications. In this paper, to bridge the gap, we collect MMWOZ, a new multimodal dialogue dataset that is extended from MultiWOZ 2.3 dataset. Specifically, we begin by developing a web-style GUI to serve as the front-end. Next, we devise an automated script to convert the dialogue states and system actions from the original dataset into operation instructions for the GUI. Lastly, we collect snapshots of the web pages along with their corresponding operation instructions. In addition, we propose a novel multimodal model called MATE (Multimodal Agent for Task-oriEnted dialogue) as the baseline model for the MMWOZ dataset. Furthermore, we conduct comprehensive experimental analysis using MATE to investigate the construction of a practical multimodal agent for task-oriented dialogue.

[242] Group-Aware Reinforcement Learning for Output Diversity in Large Language Models cs.CL | cs.AI | cs.LGPDF

Oron Anschel, Alon Shoshan, Adam Botach, Shunit Haviv Hakimi, Asaf Gendler

TL;DR: 论文提出了Group-Aware Policy Optimization (GAPO)方法，通过计算组级奖励来解决大语言模型（LLMs）的模式崩溃问题，提升模型生成回答的多样性。

Details

Motivation: 大型语言模型在生成回答时容易出现模式崩溃，即反复生成相同的几种回答，限制了输出的多样性。作者希望通过组级奖励优化方法解决这一问题。

Result: 在多个标准LLM基准测试（GSM8K、MATH、HumanEval、MMLU-Pro）上，GAPO表现出更高的多样性，同时保持了准确性。

Insight: 组级奖励优化是一种有效提升模型输出多样性的方法，适用于开放和封闭任务。未来可以探索更多组级属性的奖励设计。

Abstract: Large Language Models (LLMs) often suffer from mode collapse, repeatedly generating the same few completions even when many valid answers exist, limiting their diversity across a wide range of tasks. We introduce Group-Aware Policy Optimization (GAPO), a simple extension of the recent and popular Group Relative Policy Optimization (GRPO) that computes rewards over the group as a whole. GAPO enables learning from the group-level properties such as diversity and coverage. We demonstrate GAPO using a frequency-aware reward function that encourages uniform sampling over valid LLM completions, and show that GAPO-trained models produce valid and more diverse model responses. Beyond this setup, GAPO generalizes to open-ended prompts and improves response diversity without compromising accuracy on standard LLM benchmarks (GSM8K, MATH, HumanEval, MMLU-Pro). Our code will be made publicly available.

[243] Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data cs.CL | cs.AI | cs.CVPDF

Yunxin Li, Xinyu Chen, Shenyuan Jiang, Haoyuan Shi, Zhenyu Liu

TL;DR: Uni-MoE-2.0-Omni是一个语言中心的全模态大型模型，通过动态容量MoE设计、渐进式训练策略和精心策划的多模态数据匹配技术，显著提升了多模态理解、推理和生成能力。

Details

Motivation: 旨在构建一个高效且能力全面的全模态模型，解决多模态任务中的计算效率与能力平衡问题，并在语言中心的多模态任务中实现更优性能。

Result: 在85个基准测试中表现优异，超越Qwen2.5-Omni等模型，尤其在视频理解、全模态理解和视听推理任务中提升显著。

Insight: 动态MoE和渐进式训练策略在多模态任务中具有显著优势，高质量数据匹配对生成任务尤为重要。

Abstract: We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee’s Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.

[244] Knots: A Large-Scale Multi-Agent Enhanced Expert-Annotated Dataset and LLM Prompt Optimization for NOTAM Semantic Parsing cs.CL | cs.AIPDF

Maoqi Liu, Quan Fang, Yang Yang, Can Zhao, Kaiquan Cai

TL;DR: 这篇论文提出了NOTAM语义解析任务，并通过构建高质量的Knots数据集和多智能体协作框架来解决NOTAM的复杂语义理解和领域知识集成问题。实验证明了该方法在航空文本理解和处理中的有效性。

Details

Motivation: NOTAMs（航空任务通知）是传达飞行安全信息的重要渠道，但其复杂的语言结构和隐含推理为自动化解析带来了挑战。现有研究主要集中在分类和命名实体识别等表层任务，缺乏深入的语义理解。

Result: 实验结果显示，所提出的方法显著提升了航空文本的理解和处理能力，为自动化NOTAM分析系统提供了有价值的见解。

Insight: 1. NOTAM语义解析需要结合领域知识；2. 多智能体协作可以提升数据标注质量；3. LLM提示优化对复杂语义任务至关重要。

Abstract: Notice to Air Missions (NOTAMs) serve as a critical channel for disseminating key flight safety information, yet their complex linguistic structures and implicit reasoning pose significant challenges for automated parsing. Existing research mainly focuses on surface-level tasks such as classification and named entity recognition, lacking deep semantic understanding. To address this gap, we propose NOTAM semantic parsing, a task emphasizing semantic inference and the integration of aviation domain knowledge to produce structured, inference-rich outputs. To support this task, we construct Knots (Knowledge and NOTAM Semantics), a high-quality dataset of 12,347 expert-annotated NOTAMs covering 194 Flight Information Regions, enhanced through a multi-agent collaborative framework for comprehensive field discovery. We systematically evaluate a wide range of prompt-engineering strategies and model-adaptation techniques, achieving substantial improvements in aviation text understanding and processing. Our experimental results demonstrate the effectiveness of the proposed approach and offer valuable insights for automated NOTAM analysis systems. Our code is available at: https://github.com/Estrellajer/Knots.

[245] Reason-KE++: Aligning the Process, Not Just the Outcome, for Faithful LLM Knowledge Editing cs.CLPDF

Yuchen Wu, Liang Ding, Li Shen, Dacheng Tao

TL;DR: Reason-KE++提出了一种SFT+RL框架，通过过程对齐提升LLM在新知识编辑中的忠实性，解决了传统方法仅关注结果而忽略推理过程的问题。

Details

Motivation: 当前的SFT方法（如Reason-KE）在多跳推理任务中存在“忠实性缺口”，即模型更倾向于模仿格式而非实际推理，导致新事实被参数化先验覆盖，产生事实幻觉。

Result: 在MQUAKE-CF-3k数据集上达到95.48%的准确率（提升5.28%），验证了过程对齐对复杂任务的重要性。

Insight: 仅关注结果的RL会导致推理完整性的崩溃，而过程对齐是构建可信LLM的关键，尤其是在多跳推理任务中。

Abstract: Aligning Large Language Models (LLMs) to be faithful to new knowledge in complex, multi-hop reasoning tasks is a critical, yet unsolved, challenge. We find that SFT-based methods, e.g., Reason-KE, while state-of-the-art, suffer from a “faithfulness gap”: they optimize for format mimicry rather than sound reasoning. This gap enables the LLM’s powerful parametric priors to override new contextual facts, resulting in critical factual hallucinations (e.g., incorrectly reasoning “Houston” from “NASA” despite an explicit edit). To solve this core LLM alignment problem, we propose Reason-KE++, an SFT+RL framework that instills process-level faithfulness. Its core is a Stage-aware Reward mechanism that provides dense supervision for intermediate reasoning steps (e.g., Decomposition, Sub-answer Correctness). Crucially, we identify that naive outcome-only RL is a deceptive trap for LLM alignment: it collapses reasoning integrity (e.g., 19.00% Hop acc) while superficially boosting final accuracy. Our process-aware framework sets a new SOTA of 95.48% on MQUAKE-CF-3k (+5.28%), demonstrating that for complex tasks, aligning the reasoning process is essential for building trustworthy LLMs.

[246] On the Brittleness of LLMs: A Journey around Set Membership cs.CLPDF

Lea Hergert, Gábor Berend, Mario Szegedy, Gyorgy Turan, Márk Jelasity

TL;DR: LLMs虽然在复杂推理任务上表现超人类，但在基础任务如集合成员查询中表现脆弱且不可预测。

Details

Motivation: 研究LLMs在简单任务中表现不佳的悖论，揭示其可靠性和可解释性的局限。

Result: LLMs在基础任务中表现不稳定，且失败模式无法预测，表明其对集合概念的理解是零散的。

Insight: 简化问题的大规模实验为LLM评估提供了新方法，揭示了模型理解的局限性。

Abstract: Large language models (LLMs) achieve superhuman performance on complex reasoning tasks, yet often fail on much simpler problems, raising concerns about their reliability and interpretability. We investigate this paradox through a focused study with two key design features: simplicity, to expose basic failure modes, and scale, to enable comprehensive controlled experiments. We focus on set membership queries – among the most fundamental forms of reasoning – using tasks like Is apple an element of the set \{pear, plum, apple, raspberry\}?''. We conduct a systematic empirical evaluation across prompt phrasing, semantic structure, element ordering, and model choice. Our large-scale analysis reveals that LLM performance on this elementary task is consistently brittle, and unpredictable across all dimensions, suggesting that the models' understanding’’ of the set concept is fragmented and convoluted at best. Our work demonstrates that the large-scale experiments enabled by the simplicity of the problem allow us to map and analyze the failure modes comprehensively, making this approach a valuable methodology for LLM evaluation in general.

[247] Evidence of Phase Transitions in Small Transformer-Based Language Models cs.CL | cs.AIPDF

Noah Hong, Tao Hong

TL;DR: 该论文研究了小型基于Transformer的语言模型中是否存在相变现象，并探讨了这些相变是否可以直接在线性训练空间中检测到，以及是否在训练早期出现。通过词汇使用和统计分析方法，研究发现相变确实存在，为语言模型训练的非线性动态提供了新见解。

Details

Motivation: 大型语言模型中的相变现象被认为是其涌现能力的来源，但此前研究主要集中在大型模型上。本文旨在验证这些现象是否同样存在于小型模型中，并探索是否能以更直接的线性训练空间方式检测到这些相变。

Result: 研究发现了一个明确的相变点，这一现象在标准损失或验证曲线中不可见，但通过词汇和统计分析得以揭示。

Insight: 研究提出相变重组是语言模型训练的普遍特征，即使在小型模型中也能观察到，并强调了定制化指标在揭示相变行为中的重要性。

Abstract: Phase transitions have been proposed as the origin of emergent abilities in large language models (LLMs), where new capabilities appear abruptly once models surpass critical thresholds of scale. Prior work, such as that of Wei et al., demonstrated these phenomena under model and data scaling, with transitions revealed after applying a log scale to training compute. In this work, we ask three complementary questions: (1) Are phase transitions unique to large models, or can they also be observed in small transformer-based language models? (2) Can such transitions be detected directly in linear training space, rather than only after log rescaling? and (3) Can these transitions emerge at early stages of training? To investigate, we train a small GPT-style transformer on a character-level corpus and analyze the evolution of vocabulary usage throughout training. We track the average word length, the number of correct versus incorrect words, and shifts in vocabulary diversity. Building on these measures, we apply Poisson and sub-Poisson statistics to quantify how words connect and reorganize. This combined analysis reveals a distinct transition point during training. Notably, these transitions are not apparent in standard loss or validation curves, but become visible through our vocabulary- and statistics-based probes. Our findings suggest that phase-transition reorganizations are a general feature of language model training, observable even in modest models, detectable directly in linear training space, and occurring surprisingly early as coherence emerges. This perspective provides new insight into the nonlinear dynamics of language model training and underscores the importance of tailored metrics for uncovering phase transition behaviors

[248] LLM Reinforcement in Context cs.CL | cs.CRPDF

Thomas Rivasseau

TL;DR: 该论文提出了一种通过‘中断’（interruptions）来增强大型语言模型（LLM）对齐的方法，即在用户输入中每隔一定长度的标记插入控制语句，以防止模型偏离安全目标。

Details

Motivation: 当前LLM对齐研究主要集中于通过训练和提示提高模型对抗攻击的鲁棒性，但缺乏随着用户输入长度增加而扩展的对齐方法。研究指出LLM越狱概率与输入长度相关，因此需要新的对齐手段。

Result: 研究认为这种方法可以有效减少LLM在长对话或大输入下的越狱概率。

Insight: 通过动态插入控制语句，可以在不影响模型性能的情况下增强对齐，这为LLM的安全对齐提供了新思路。

Abstract: Current Large Language Model alignment research mostly focuses on improving model robustness against adversarial attacks and misbehavior by training on examples and prompting. Research has shown that LLM jailbreak probability increases with the size of the user input or conversation length. There is a lack of appropriate research into means of strengthening alignment which also scale with user input length. We propose interruptions as a possible solution to this problem. Interruptions are control sentences added to the user input approximately every x tokens for some arbitrary x. We suggest that this can be generalized to the Chain-of-Thought process to prevent scheming.

[249] Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing cs.CL | cs.LOPDF

Hayden Moore, Asfahan Shah

TL;DR: 本文研究了大型语言模型（LLMs）在自动形式化任务中对语义相似但表述不同的自然语言输入的鲁棒性，发现模型输出对输入的微小变化敏感。

Details

Motivation: LLMs在自动形式化任务中表现优异，但仍可能生成不准确或不可验证的形式化结果。此前研究表明LLMs对语义相似但表述不同的输入敏感，本研究旨在验证这一现象在自动形式化领域的影响。

Result: 结果表明，输入的微小变化可能导致模型输出显著差异，揭示了LLMs在自动形式化任务中的不稳定性。

Insight: 模型的鲁棒性仍有待提升，研究中发现的敏感性提示需要在生成和使用形式化结果时格外谨慎。

Abstract: Large Language Models (LLMs) have recently emerged as powerful tools for autoformalization. Despite their impressive performance, these models can still struggle to produce grounded and verifiable formalizations. Recent work in text-to-SQL, has revealed that LLMs can be sensitive to paraphrased natural language (NL) inputs, even when high degrees of semantic fidelity are preserved (Safarzadeh, Oroojlooyjadid, and Roth 2025). In this paper, we investigate this claim in the autoformalization domain. Specifically, we evaluate the robustness of LLMs generating formal proofs with semantically similar paraphrased NL statements by measuring semantic and compilation validity. Using the formal benchmarks MiniF2F (Zheng, Han, and Polu 2021) and Lean 4 version of ProofNet (Xin et al. 2024), and two modern LLMs, we generate paraphrased natural language statements and cross-evaluate these statements across both models. The results of this paper reveal performance variability across paraphrased inputs, demonstrating that minor shifts in NL statements can significantly impact model outputs.

[250] BioMedJImpact: A Comprehensive Dataset and LLM Pipeline for AI Engagement and Scientific Impact Analysis of Biomedical Journals cs.CLPDF

Ruiyu Wang, Yuzhang Xie, Xiao Hu, Carl Yang, Jiaying Lu

TL;DR: BioMedJImpact 是一个大规模生物医学数据集，结合了文献计量指标、合作特征和 LLM 提取的 AI 参与度指标，用于分析合作强度和 AI 参与度如何共同影响期刊的科学影响力。

Details

Motivation: 现有开放资源很少能全面捕捉生物医学领域合作结构和 AI 研究如何共同塑造期刊声誉，因此需要构建一个综合数据集和方法框架，以支持深入的期刊影响力分析。

Result: 研究发现：1. 合作强度更高的期刊（尤其是作者团队更大且更多样化的期刊）通常引用影响力更高；2. AI 参与度与期刊声誉的相关性逐渐增强，尤其在分位数排名中更明显。

Insight: 1. AI 参与度正成为衡量期刊声誉的重要指标；2. 合作多样性对科学影响力有显著促进作用；3. LLM 流水线可用于高效可扩展的科学影响力分析。

Abstract: Assessing journal impact is central to scholarly communication, yet existing open resources rarely capture how collaboration structures and artificial intelligence (AI) research jointly shape venue prestige in biomedicine. We present BioMedJImpact, a large-scale, biomedical-oriented dataset designed to advance journal-level analysis of scientific impact and AI engagement. Built from 1.74 million PubMed Central articles across 2,744 journals, BioMedJImpact integrates bibliometric indicators, collaboration features, and LLM-derived semantic indicators for AI engagement. Specifically, the AI engagement feature is extracted through a reproducible three-stage LLM pipeline that we propose. Using this dataset, we analyze how collaboration intensity and AI engagement jointly influence scientific impact across pre- and post-pandemic periods (2016-2019, 2020-2023). Two consistent trends emerge: journals with higher collaboration intensity, particularly those with larger and more diverse author teams, tend to achieve greater citation impact, and AI engagement has become an increasingly strong correlate of journal prestige, especially in quartile rankings. To further validate the three-stage LLM pipeline we proposed for deriving the AI engagement feature, we conduct human evaluation, confirming substantial agreement in AI relevance detection and consistent subfield classification. Together, these contributions demonstrate that BioMedJImpact serves as both a comprehensive dataset capturing the intersection of biomedicine and AI, and a validated methodological framework enabling scalable, content-aware scientometric analysis of scientific impact and innovation dynamics. Code is available at https://github.com/JonathanWry/BioMedJImpact.

[251] NeuroLex: A Lightweight Domain Language Model for EEG Report Understanding and Generation cs.CL | cs.AIPDF

Kang Yin, Hye-Bin Shin

TL;DR: 论文介绍了NeuroLex，一个针对EEG报告的轻量级领域语言模型，通过专门训练捕捉EEG报告的领域特定语言特征，表现优于通用语言模型，并为脑机接口应用提供基础。

Details

Motivation: 通用语言模型难以捕捉EEG报告中的领域特定语言和诊断模式，因此需要专门针对EEG报告设计的轻量级语言模型。

Result: NeuroLex在困惑度、提取和总结准确性、标签效率以及对否定和事实幻觉的鲁棒性方面均优于同等规模的通用模型。

Insight: NeuroLex为脑机接口应用提供了可解释的语言驱动神经解码基础，填补了生物医学文本建模和EEG特定任务的空白。

Abstract: Clinical electroencephalogram (EEG) reports encode domain-specific linguistic conventions that general-purpose language models (LMs) fail to capture. We introduce NeuroLex, a lightweight domain-adaptive language model trained purely on EEG report text from the Harvard Electroencephalography Database. Unlike existing biomedical LMs, NeuroLex is tailored to the linguistic and diagnostic characteristics of EEG reporting, enabling it to serve as both an independent textual model and a decoder backbone for multimodal EEG-language systems. Using span-corruption pretraining and instruction-style fine-tuning on report polishing, paragraph summarization, and terminology question answering, NeuroLex learns the syntax and reasoning patterns characteristic of EEG interpretation. Comprehensive evaluations show that it achieves lower perplexity, higher extraction and summarization accuracy, better label efficiency, and improved robustness to negation and factual hallucination compared with general models of the same scale. With an EEG-aware linguistic backbone, NeuroLex bridges biomedical text modeling and brain-computer interface applications, offering a foundation for interpretable and language-driven neural decoding.

[252] From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models cs.CL | cs.CVPDF

Wenxin Zhu, Andong Chen, Yuchen Song, Kehai Chen, Conghui Zhu

TL;DR: 本文系统综述了多模态Chain-of-Thought（MCoT），分析了其背景、方法和潜在机制，总结了评估指标和应用场景，并探讨了当前挑战与未来研究方向。

Details

Motivation: 提升多模态大语言模型（MLLMs）的复杂推理能力是当前研究重点，而现有模型存在推理路径不透明和泛化能力不足的问题。

Result: 总结了现有评估指标和MCoT的应用场景。

Insight: MCoT在多模态领域有望提升推理透明度和输出可解释性，但仍需解决泛化能力和复杂任务适应性等挑战。

Abstract: With the remarkable success of Multimodal Large Language Models (MLLMs) in perception tasks, enhancing their complex reasoning capabilities has emerged as a critical research focus. Existing models still suffer from challenges such as opaque reasoning paths and insufficient generalization ability. Chain-of-Thought (CoT) reasoning, which has demonstrated significant efficacy in language models by enhancing reasoning transparency and output interpretability, holds promise for improving model reasoning capabilities when extended to the multimodal domain. This paper provides a systematic review centered on “Multimodal Chain-of-Thought” (MCoT). First, it analyzes the background and theoretical motivations for its inception from the perspectives of technical evolution and task demands. Then, it introduces mainstream MCoT methods from three aspects: CoT paradigms, the post-training stage, and the inference stage, while also analyzing their underlying mechanisms. Furthermore, the paper summarizes existing evaluation benchmarks and metrics, and discusses the application scenarios of MCoT. Finally, it analyzes the challenges currently facing MCoT and provides an outlook on its future research directions.

[253] Visual Room 2.0: Seeing is Not Understanding for MLLMs cs.CLPDF

Haokun Li, Yazhou Zhang, Jizhi Ding, Qiuchi Li, Peng Zhang

TL;DR: 论文提出Visual Room 2.0，通过分层测评框架验证多模态大语言模型（MLLMs）的感知与认知对齐问题，发现MLLMs在感知能力上优于认知能力，且认知能力与模型规模相关。

Details

Motivation: 探讨MLLMs是否能真正理解其所见的视觉内容，而非仅停留在表面描述，基于Searle的Chinese Room思想扩展到多模态领域。

Result: 发现MLLMs在感知能力上优于认知能力（8.0%↑），认知能力与模型规模相关，但感知能力并未随模型规模显著提升。

Insight: Seeing ≠ Understanding可以作为可测试假设，为MLLMs的感知到认知推理提供了新范式。

Abstract: Can multi-modal large language models (MLLMs) truly understand what they can see? Extending Searle’s Chinese Room into the multi-modal domain, this paper proposes the Visual Room argument: MLLMs may describe every visual detail precisely yet fail to comprehend the underlying emotions and intentions, namely seeing is not understanding. Building on this, we introduce \textit{Visual Room} 2.0, a hierarchical benchmark for evaluating perception-cognition alignment of MLLMs. We model human perceptive and cognitive processes across three levels: low, middle, and high, covering 17 representative tasks. The perception component ranges from attribute recognition to scene understanding, while the cognition component extends from textual entailment to causal and social reasoning. The dataset contains 350 multi-modal samples, each with six progressive questions (2,100 in total) spanning perception to cognition. Evaluating 10 state-of-the-art (SoTA) MLLMs, we highlight three key findings: (1) MLLMs exhibit stronger perceptual competence than cognitive ability (8.0%$\uparrow$); (2) cognition appears not causally dependent on perception-based reasoning; and (3) cognition scales with model size, but perception does not consistently improve with larger variants. This work operationalizes Seeing $\ne$ Understanding as a testable hypothesis, offering a new paradigm from perceptual processing to cognitive reasoning in MLLMs. Our dataset is available at https://huggingface.co/datasets/LHK2003/PCBench.

[254] Fine-Tuned LLMs Know They Don’t Know: A Parameter-Efficient Approach to Recovering Honesty cs.CLPDF

Zeyu Shi, Ziming Wang, Tianyu Chen, Shiqi Gao, Haoyi Zhou

TL;DR: 这篇论文提出了一种名为HCNR的参数高效方法，通过恢复关键神经元来修复微调后大型语言模型（LLMs）的诚实表达能力。实验表明，HCNR在四种QA任务和五种LLM模型中有效恢复了33.25%的诚实性，且效率显著提高。

Details

Motivation: 监督微调（SFT）虽然能提升LLMs在特定任务上的表现，但会损害其诚实表达能力。现有方法假设SFT彻底破坏了模型的认知边界识别能力，需全局调整参数，而作者发现微调后的LLMs仍保留该能力，只是表达能力受损。

Result: 在四个QA任务和五种LLM家族上的实验表明，HCNR恢复了33.25%的诚实性，速度提升2.23倍以上，且数据需求减少10倍。

Insight: 微调后的LLMs仍保留认知边界识别能力，问题在于表达能力受损；HCNR的局部修复方法为高效恢复模型诚实性提供了新思路。

Abstract: The honesty of Large Language Models (LLMs) is increasingly important for safe deployment in high-stakes domains. However, this crucial trait is severely undermined by supervised fine-tuning (SFT), a common technique for model specialization. Existing recovery methods rely on data-intensive global parameter adjustments, implicitly assuming that SFT deeply corrupts the models’ ability to recognize their knowledge boundaries. However, we observe that fine-tuned LLMs still preserve this ability; what is damaged is their capacity to faithfully express that awareness. Building on this, we propose Honesty-Critical Neurons Restoration (HCNR) to surgically repair this suppressed capacity. HCNR identifies and restores key expression-governing neurons to their pre-trained state while harmonizing them with task-oriented neurons via Hessian-guided compensation. Experiments on four QA tasks and five LLM families demonstrate that HCNR effectively recovers 33.25% of the compromised honesty while achieving at least 2.23x speedup with over 10x less data compared to baseline methods, offering a practical solution for trustworthy LLM deployment.

[255] Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training cs.CLPDF

Xinyuan Zhou, Yi Lei, Xiaoyu Zhou, Jingyi Sun, Yu Zhu

TL;DR: Spark-Prover-X1是一个7B参数的模型，通过三阶段训练框架提升形式定理证明能力。关键创新包括多样性训练数据、CoT增强任务和GRPO优化，在多个基准上达到SOTA性能。

Details

Motivation: LLMs在自动定理证明中表现优异，但高质量的形式语言数据稀缺限制了进展。Spark-Prover-X1旨在通过多样性数据和高效训练框架提升轻量级模型的推理能力。

Result: Spark-Prover-X1-7B在多个基准上表现优异，如PutnamBench（27题解答）和CombiBench（24.0%通过率）。

Insight: 多样性数据和渐进式训练框架可显著提升轻量级LLMs的形式推理能力，为资源有限场景提供了有效路径。

Abstract: Large Language Models (LLMs) have shown significant promise in automated theorem proving, yet progress is often constrained by the scarcity of diverse and high-quality formal language data. To address this issue, we introduce Spark-Prover-X1, a 7B parameter model trained via an three-stage framework designed to unlock the reasoning potential of more accessible and moderately-sized LLMs. The first stage infuses deep knowledge through continuous pre-training on a broad mathematical corpus, enhanced by a suite of novel data tasks. Key innovation is a “CoT-augmented state prediction” task to achieve fine-grained reasoning. The second stage employs Supervised Fine-tuning (SFT) within an expert iteration loop to specialize both the Spark-Prover-X1-7B and Spark-Formalizer-X1-7B models. Finally, a targeted round of Group Relative Policy Optimization (GRPO) is applied to sharpen the prover’s capabilities on the most challenging problems. To facilitate robust evaluation, particularly on problems from real-world examinations, we also introduce ExamFormal-Bench, a new benchmark dataset of 402 formal problems. Experimental results demonstrate that Spark-Prover-X1-7B achieves state-of-the-art performance among similarly-sized open-source models, attaining a 37.0% average pass rate (pass@32). It shows exceptional performance on difficult competition benchmarks, notably solving 27 problems on PutnamBench (pass@32) and achieving 24.0% on CombiBench (pass@32). Our work validates that this diverse training data and progressively refined training pipeline provides an effective path for enhancing the formal reasoning capabilities of lightweight LLMs. Both Spark-Prover-X1-7B and Spark-Formalizer-X1-7B, along with the ExamFormal-Bench dataset, are made publicly available at:https://www.modelscope.cn/organization/iflytek, https://gitcode.com/ifly_opensource.

[256] BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models cs.CLPDF

Chuyuan Li, Giuseppe Carenini

TL;DR: 该论文介绍了BeDiscovER基准测试，用于评估现代大语言模型在话语层面的理解能力，包含52个数据集和5项任务，发现前沿模型在时间推理方面表现良好，但在文档级推理和某些语义现象上仍有不足。

Details

Motivation: 随着大语言模型在推理能力上的进步，需要一个新的基准测试来全面评估其话语理解能力，尤其是在多任务和多语言环境下的表现。

Result: 前沿模型在时间推理方面表现优异，但在文档级推理和某些语义现象（如修辞关系识别）上仍有明显不足。

Insight: 当前大语言模型在处理复杂话语现象和多语言任务时存在局限性，未来研究需要更注重语义和上下文的理解。

Abstract: We introduce BeDiscovER (Benchmark of Discourse Understanding in the Era of Reasoning Language Models), an up-to-date, comprehensive suite for evaluating the discourse-level knowledge of modern LLMs. BeDiscovER compiles 5 publicly available discourse tasks across discourse lexicon, (multi-)sentential, and documental levels, with in total 52 individual datasets. It covers both extensively studied tasks such as discourse parsing and temporal relation extraction, as well as some novel challenges such as discourse particle disambiguation (e.g., ``just’’), and also aggregates a shared task on Discourse Relation Parsing and Treebanking for multilingual and multi-framework discourse relation classification. We evaluate open-source LLMs: Qwen3 series, DeepSeek-R1, and frontier model such as GPT-5-mini on BeDiscovER, and find that state-of-the-art models exhibit strong performance in arithmetic aspect of temporal reasoning, but they struggle with full document reasoning and some subtle semantic and discourse phenomena, such as rhetorical relation recognition.

[257] Evaluating the Ability of Large Language Models to Identify Adherence to CONSORT Reporting Guidelines in Randomized Controlled Trials: A Methodological Evaluation Study cs.CLPDF

Zhichao He, Mouxiao Bian, Jianhong Zhu, Jiayuan Chen, Yunqiu Wang

TL;DR: 该研究评估了大语言模型（LLM）在零样本设置下识别随机对照试验（RCT）是否符合CONSORT报告指南的能力，发现模型整体表现一般，仅能高精度识别符合项，但对不合规和不适用项表现较差。

Details

Motivation: CONSORT报告指南是评估RCT透明度和质量的重要标准，但目前人工验证耗时费力。研究希望通过LLM自动化这一过程，提高效率。

Result: Gemini-2.5-Flash和DeepSeek-R1表现最佳（宏F1分数0.634），但对不合规和不适用项的识别能力显著不足（F1分数通常小于0.400）。

Insight: LLM可作为初步筛选工具识别合规项，但尚无法可靠检测报告遗漏或方法缺陷，仍需依赖人工评估。

Abstract: The Consolidated Standards of Reporting Trials statement is the global benchmark for transparent and high-quality reporting of randomized controlled trials. Manual verification of CONSORT adherence is a laborious, time-intensive process that constitutes a significant bottleneck in peer review and evidence synthesis. This study aimed to systematically evaluate the accuracy and reliability of contemporary LLMs in identifying the adherence of published RCTs to the CONSORT 2010 statement under a zero-shot setting. We constructed a golden standard dataset of 150 published RCTs spanning diverse medical specialties. The primary outcome was the macro-averaged F1-score for the three-class classification task, supplemented by item-wise performance metrics and qualitative error analysis. Overall model performance was modest. The top-performing models, Gemini-2.5-Flash and DeepSeek-R1, achieved nearly identical macro F1 scores of 0.634 and Cohen’s Kappa coefficients of 0.280 and 0.282, respectively, indicating only fair agreement with expert consensus. A striking performance disparity was observed across classes: while most models could identify compliant items with high accuracy (F1 score > 0.850), they struggled profoundly with identifying non-compliant and not applicable items, where F1 scores rarely exceeded 0.400. Notably, some high-profile models like GPT-4o underperformed, achieving a macro F1-score of only 0.521. LLMs show potential as preliminary screening assistants for CONSORT checks, capably identifying well-reported items. However, their current inability to reliably detect reporting omissions or methodological flaws makes them unsuitable for replacing human expertise in the critical appraisal of trial quality.

[258] Extracting Events Like Code: A Multi-Agent Programming Framework for Zero-Shot Event Extraction cs.CL | cs.AIPDF

Quanjiang Guo, Sijie Wang, Jinchuan Zhang, Ben Zhang, Zhao Kang

TL;DR: 文章提出了Agent-Event-Coder（AEC），一种多智能体框架，将事件抽取任务看作代码生成过程，通过分解任务和协作验证提升零样本事件抽取性能。

Details

Motivation: 零样本事件抽取（ZSEE）对大型语言模型（LLMs）具有挑战性，传统直接提示方法常导致输出不完整或结构错误。

Result: 在五个不同领域和六个LLMs上的实验表明，AEC一致优于现有零样本基线。

Insight: 将事件抽取任务类比为代码生成过程，通过迭代精化和确定性验证提升性能，是多智能体协作在NLP任务中应用的典型案例。

Abstract: Zero-shot event extraction (ZSEE) remains a significant challenge for large language models (LLMs) due to the need for complex reasoning and domain-specific understanding. Direct prompting often yields incomplete or structurally invalid outputs–such as misclassified triggers, missing arguments, and schema violations. To address these limitations, we present Agent-Event-Coder (AEC), a novel multi-agent framework that treats event extraction like software engineering: as a structured, iterative code-generation process. AEC decomposes ZSEE into specialized subtasks–retrieval, planning, coding, and verification–each handled by a dedicated LLM agent. Event schemas are represented as executable class definitions, enabling deterministic validation and precise feedback via a verification agent. This programming-inspired approach allows for systematic disambiguation and schema enforcement through iterative refinement. By leveraging collaborative agent workflows, AEC enables LLMs to produce precise, complete, and schema-consistent extractions in zero-shot settings. Experiments across five diverse domains and six LLMs demonstrate that AEC consistently outperforms prior zero-shot baselines, showcasing the power of treating event extraction like code generation. The code and data are released on https://github.com/UESTC-GQJ/Agent-Event-Coder.

[259] A Comparative Analysis of Recurrent and Attention Architectures for Isolated Sign Language Recognition cs.CLPDF

Nigar Alishzade, Gulchin Abdullayeva

TL;DR: 该研究比较了循环架构（ConvLSTM）和注意力架构（Vanilla Transformer）在孤立手语识别任务中的表现。结果表明，基于注意力的Transformer在准确率上显著优于ConvLSTM，尤其在Top-1和Top-5准确率上表现突出，而ConvLSTM则在计算效率上更具优势。

Details

Motivation: 研究动机在于比较循环神经网络和注意力机制在手语识别任务中的性能差异，为实际应用中架构选择提供依据。

Result: 结果表明，Transformer在AzSLD和WLASL上的Top-1准确率分别达到76.8%和88.3%，优于ConvLSTM；ConvLSTM虽然在计算效率上更高，但准确率较低。

Insight: 研究发现Transformer在准确率和跨用户泛化能力上表现更佳，而ConvLSTM在计算效率和时序建模方面具有优势，这些结果为实际应用中的架构选择提供了指导。

Abstract: This study presents a systematic comparative analysis of recurrent and attention-based neural architectures for isolated sign language recognition. We implement and evaluate two representative models-ConvLSTM and Vanilla Transformer-on the Azerbaijani Sign Language Dataset (AzSLD) and the Word-Level American Sign Language (WLASL) dataset. Our results demonstrate that the attention-based Vanilla Transformer consistently outperforms the recurrent ConvLSTM in both Top-1 and Top-5 accuracy across datasets, achieving up to 76.8% Top-1 accuracy on AzSLD and 88.3% on WLASL. The ConvLSTM, while more computationally efficient, lags in recognition accuracy, particularly on smaller datasets. These findings highlight the complementary strengths of each paradigm: the Transformer excels in overall accuracy and signer independence, whereas the ConvLSTM offers advantages in computational efficiency and temporal modeling. The study provides a nuanced analysis of these trade-offs, offering guidance for architecture selection in sign language recognition systems depending on application requirements and resource constraints.

[260] TCM-5CEval: Extended Deep Evaluation Benchmark for LLM’s Comprehensive Clinical Research Competence in Traditional Chinese Medicine cs.CLPDF

Tianai Huang, Jiayuan Chen, Lu Lu, Pengcheng Chen, Tianbin Li

TL;DR: TCM-5CEval是一个扩展的深度评估基准，旨在全面评测大语言模型（LLMs）在传统中医学（TCM）中的临床研究能力，覆盖五个关键维度，揭示了模型在古典文本解释和推理稳定性方面的显著弱点。

Details

Motivation: 尽管LLMs在通用领域表现优异，但其在高度专业化且富含文化背景的领域（如TCM）中的应用仍需细致评估。TCM-5CEval弥补了此前TCM-3CEval的不足，提供了更全面的评测工具。

Result: 研究发现模型在基础知识上表现良好，但在古典文本解释上困难；选项排序测试显示所有模型均存在显著的推理不稳定性和位置敏感性。

Insight: TCM-5CEval不仅为LLMs在TCM中的应用提供了详细诊断工具，还揭示了其推理能力的根本弱点，为进一步研究指明方向。

Abstract: Large language models (LLMs) have demonstrated exceptional capabilities in general domains, yet their application in highly specialized and culturally-rich fields like Traditional Chinese Medicine (TCM) requires rigorous and nuanced evaluation. Building upon prior foundational work such as TCM-3CEval, which highlighted systemic knowledge gaps and the importance of cultural-contextual alignment, we introduce TCM-5CEval, a more granular and comprehensive benchmark. TCM-5CEval is designed to assess LLMs across five critical dimensions: (1) Core Knowledge (TCM-Exam), (2) Classical Literacy (TCM-LitQA), (3) Clinical Decision-making (TCM-MRCD), (4) Chinese Materia Medica (TCM-CMM), and (5) Clinical Non-pharmacological Therapy (TCM-ClinNPT). We conducted a thorough evaluation of fifteen prominent LLMs, revealing significant performance disparities and identifying top-performing models like deepseek_r1 and gemini_2_5_pro. Our findings show that while models exhibit proficiency in recalling foundational knowledge, they struggle with the interpretative complexities of classical texts. Critically, permutation-based consistency testing reveals widespread fragilities in model inference. All evaluated models, including the highest-scoring ones, displayed a substantial performance degradation when faced with varied question option ordering, indicating a pervasive sensitivity to positional bias and a lack of robust understanding. TCM-5CEval not only provides a more detailed diagnostic tool for LLM capabilities in TCM but aldso exposes fundamental weaknesses in their reasoning stability. To promote further research and standardized comparison, TCM-5CEval has been uploaded to the Medbench platform, joining its predecessor in the “In-depth Challenge for Comprehensive TCM Abilities” special track.

[261] Seeing isn’t Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms cs.CLPDF

Tyler Loakman, Joseph James, Chenghua Lin

TL;DR: 这篇论文研究了视觉语言模型（VLMs）在解释声谱图和波形数据时的能力，发现它们在这些任务上的表现接近随机猜测，表明需要专门的知识而不仅仅是多模态数据配对。

Details

Motivation: 随着大型语言模型（LLMs）和视觉语言模型（VLMs）的发展，研究者希望评估这些模型在跨模态任务中的表现，特别是在解释语音的声谱图和波形时的能力。

Result: 结果表明，VLMs在解释声谱图和波形数据时表现不佳，甚至在微调后仍难以超越随机猜测水平。

Insight: 研究发现，成功解释声谱图和波形数据需要专门的参数化知识，而不仅仅是多模态数据的配对。

Abstract: With the rise of Large Language Models (LLMs) and their vision-enabled counterparts (VLMs), numerous works have investigated their capabilities in tasks that fuse the modalities of vision and language. In this work, we benchmark the extent to which VLMs are able to act as highly-trained phoneticians, interpreting spectrograms and waveforms of speech. To do this, we synthesise a novel dataset containing 4k+ English words spoken in isolation alongside stylistically consistent spectrogram and waveform figures. We test the ability of VLMs to understand these representations of speech through a multiple-choice task whereby models must predict the correct phonemic or graphemic transcription of a spoken word when presented amongst 3 distractor transcriptions that have been selected based on their phonemic edit distance to the ground truth. We observe that both zero-shot and finetuned models rarely perform above chance, demonstrating the requirement for specific parametric knowledge of how to interpret such figures, rather than paired samples alone.

[262] Can Large Language Models Function as Qualified Pediatricians? A Systematic Evaluation in Real-World Clinical Contexts cs.CLPDF

Siyu Zhu, Mouxiao Bian, Yue Xie, Yongyu Tang, Zhikang Yu

TL;DR: 论文通过PEDIASBench系统评估框架，研究了大型语言模型（LLMs）在真实儿科临床环境中的表现，发现其在基础知识和部分任务上表现良好，但在复杂推理、实时适应能力和人文关怀方面仍有局限。

Details

Motivation: 随着LLMs在医学领域的快速崛起，探讨其是否能胜任儿科医生的角色成为关键问题。研究旨在评估LLMs在儿科临床中的实际能力，并为未来改进提供方向。

Result: 结果显示，LLMs在基础问题上表现良好（如Qwen3-235B-A22B准确率达90%），但复杂任务中性能下降约15%。动态诊疗能力（如DeepSeek-R1得分为0.58）和人文关怀（Qwen2.5-72B准确率92.05%）仍有不足。

Insight: 当前LLMs尚不能独立完成儿科诊疗，但在决策支持、医学教育和患者沟通中有潜力。未来需关注多模态整合和临床反馈机制，以提升安全性、可解释性及人机协作能力。

Abstract: With the rapid rise of large language models (LLMs) in medicine, a key question is whether they can function as competent pediatricians in real-world clinical settings. We developed PEDIASBench, a systematic evaluation framework centered on a knowledge-system framework and tailored to realistic clinical environments. PEDIASBench assesses LLMs across three dimensions: application of basic knowledge, dynamic diagnosis and treatment capability, and pediatric medical safety and medical ethics. We evaluated 12 representative models released over the past two years, including GPT-4o, Qwen3-235B-A22B, and DeepSeek-V3, covering 19 pediatric subspecialties and 211 prototypical diseases. State-of-the-art models performed well on foundational knowledge, with Qwen3-235B-A22B achieving over 90% accuracy on licensing-level questions, but performance declined ~15% as task complexity increased, revealing limitations in complex reasoning. Multiple-choice assessments highlighted weaknesses in integrative reasoning and knowledge recall. In dynamic diagnosis and treatment scenarios, DeepSeek-R1 scored highest in case reasoning (mean 0.58), yet most models struggled to adapt to real-time patient changes. On pediatric medical ethics and safety tasks, Qwen2.5-72B performed best (accuracy 92.05%), though humanistic sensitivity remained limited. These findings indicate that pediatric LLMs are constrained by limited dynamic decision-making and underdeveloped humanistic care. Future development should focus on multimodal integration and a clinical feedback-model iteration loop to enhance safety, interpretability, and human-AI collaboration. While current LLMs cannot independently perform pediatric care, they hold promise for decision support, medical education, and patient communication, laying the groundwork for a safe, trustworthy, and collaborative intelligent pediatric healthcare system.

[263] Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction cs.CLPDF

Zhaopei Huang, Qifeng Dai, Guozheng Wu, Xiaopeng Wu, Kehan Chen

TL;DR: Mem-PAL introduces PAL-Bench, a benchmark for evaluating personalized dialogue assistants in long-term interactions, and proposes H²Memory, a hierarchical memory framework to enhance personalized responses.

Details

Motivation: 现有的对话助手在长期交互中往往忽略用户的个性特征和主观偏好，无法满足个性化服务需求。因此，需要新的基准和方法来提高个性化对话的质量。

Result: 实验表明，H²Memory在PAL-Bench和外部数据集上均有效提升了对话的个性化程度和响应质量。

Insight: 分层内存和检索增强生成技术是解决长期个性化对话问题的有效方向，同时合成数据可以作为真实数据的补充。

Abstract: With the rise of smart personal devices, service-oriented human-agent interactions have become increasingly prevalent. This trend highlights the need for personalized dialogue assistants that can understand user-specific traits to accurately interpret requirements and tailor responses to individual preferences. However, existing approaches often overlook the complexities of long-term interactions and fail to capture users’ subjective characteristics. To address these gaps, we present PAL-Bench, a new benchmark designed to evaluate the personalization capabilities of service-oriented assistants in long-term user-agent interactions. In the absence of available real-world data, we develop a multi-step LLM-based synthesis pipeline, which is further verified and refined by human annotators. This process yields PAL-Set, the first Chinese dataset comprising multi-session user logs and dialogue histories, which serves as the foundation for PAL-Bench. Furthermore, to improve personalized service-oriented interactions, we propose H$^2$Memory, a hierarchical and heterogeneous memory framework that incorporates retrieval-augmented generation to improve personalized response generation. Comprehensive experiments on both our PAL-Bench and an external dataset demonstrate the effectiveness of the proposed memory framework.

[264] Crossing Borders: A Multimodal Challenge for Indian Poetry Translation and Image Generation cs.CL | cs.CVPDF

Sofia Jamil, Kotla Sai Charan, Sriparna Saha, Koustava Goswami, Joseph K J

TL;DR: 该论文提出了一种名为TAI的多模态框架，结合LLMs和潜在扩散模型，用于翻译和生成印度诗歌的图像，以提升其全球可及性。

Details

Motivation: 印度诗歌因其语言复杂性和文化深度，对非母语者和不熟悉其语境的人构成理解挑战。现有研究多忽视印度语言诗歌，导致资源匮乏。本文旨在填补这一空白。

Result: TAI在诗歌图像生成任务中表现优于现有基线方法，并通过人类和定量评估验证了其有效性。

Insight: 多模态方法（结合翻译和图像生成）能够显著提升对复杂文化内容的理解和传播，同时填补了低资源语言在诗歌领域的空白。

Abstract: Indian poetry, known for its linguistic complexity and deep cultural resonance, has a rich and varied heritage spanning thousands of years. However, its layered meanings, cultural allusions, and sophisticated grammatical constructions often pose challenges for comprehension, especially for non-native speakers or readers unfamiliar with its context and language. Despite its cultural significance, existing works on poetry have largely overlooked Indian language poems. In this paper, we propose the Translation and Image Generation (TAI) framework, leveraging Large Language Models (LLMs) and Latent Diffusion Models through appropriate prompt tuning. Our framework supports the United Nations Sustainable Development Goals of Quality Education (SDG 4) and Reduced Inequalities (SDG 10) by enhancing the accessibility of culturally rich Indian-language poetry to a global audience. It includes (1) a translation module that uses an Odds Ratio Preference Alignment Algorithm to accurately translate morphologically rich poetry into English, and (2) an image generation module that employs a semantic graph to capture tokens, dependencies, and semantic relationships between metaphors and their meanings, to create visually meaningful representations of Indian poems. Our comprehensive experimental evaluation, including both human and quantitative assessments, demonstrates the superiority of TAI Diffusion in poem image generation tasks, outperforming strong baselines. To further address the scarcity of resources for Indian-language poetry, we introduce the Morphologically Rich Indian Language Poems MorphoVerse Dataset, comprising 1,570 poems across 21 low-resource Indian languages. By addressing the gap in poetry translation and visual comprehension, this work aims to broaden accessibility and enrich the reader’s experience.

cs.SE [Back]

[265] Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly? cs.SE | cs.AI | cs.CL | cs.LGPDF

Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, Lingming Zhang

TL;DR: Live-SWE-agent是首个能在运行时自主持续进化的软件工程代理，无需离线训练，通过在解决实际问题时动态优化其框架，显著提升了解决率和泛化能力。

Details

Motivation: 当前的LLM代理虽能解决软件工程问题，但其设计空间难以穷尽且成本高昂，且现有自我改进代理需离线训练，限制了其泛化能力。

Result: 在SWE-bench Verified基准测试中达到75.4%的解决率，优于现有开源代理；在SWE-Bench Pro中达到45.8%的解决率，创下新纪录。

Insight: 软件代理的实时自我进化能力可以显著提升其性能和泛化能力，为未来自适应代理的设计提供了新思路。

Abstract: Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gödel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that Live-SWE-agent can achieve an impressive solve rate of 75.4% without test-time scaling, outperforming all existing open-source software agents and approaching the performance of the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.

cs.LG [Back]

[266] Small Vocabularies, Big Gains: Pretraining and Tokenization in Time Series Models cs.LG | cs.AI | cs.CLPDF

Alexis Roger, Gwen Legate, Kashif Rasul, Yuriy Nevmyvaka, Irina Rish

TL;DR: 本文研究了时间序列模型中分词器设计和预训练的影响，发现分词器配置主导模型的表现能力和稳定性，而预训练优化了效率和一致性，特别在小词汇量下效果显著。

Details

Motivation: 时间序列建模中，分词器和预训练是关键组成部分，但其具体影响尚未系统研究。本文旨在填补这一空白。

Result: 结果表明，设计良好的分词器在小词汇量下结合预训练效果最佳，而错误配置可能削弱预训练优势。

Insight: 时间序列建模中，分词器设计和预训练的协同作用至关重要，尤其在多模态预测中共享词汇表时。

Abstract: Tokenization and transfer learning are two critical components in building state of the art time series foundation models for forecasting. In this work, we systematically study the effect of tokenizer design, specifically scaling and quantization strategies, on model performance, alongside the impact of pretraining versus random initialization. We show that tokenizer configuration primarily governs the representational capacity and stability of the model, while transfer learning influences optimization efficiency and alignment. Using a combination of empirical training experiments and theoretical analyses, we demonstrate that pretrained models consistently leverage well-designed tokenizers more effectively, particularly at smaller vocabulary sizes. Conversely, misaligned tokenization can diminish or even invert the benefits of pretraining. These findings highlight the importance of careful tokenization in time series modeling and suggest that combining small, efficient vocabularies with pretrained weights is especially advantageous in multi-modal forecasting settings, where the overall vocabulary must be shared across modalities. Our results provide concrete guidance for designing tokenizers and leveraging transfer learning in discrete representation learning for continuous signals.

[267] H-Model: Dynamic Neural Architectures for Adaptive Processing cs.LG | cs.CLPDF

Dmytro Hospodarchuk

TL;DR: 该论文提出了一种动态调整内部结构的神经网络架构H-Model，通过路由机制实现自适应计算，目标是探索可适应且更可解释的网络方向，而非优化现有基准性能。

Details

Motivation: 受动态推理过程的启发，作者希望设计一种能够根据输入数据动态调整计算结构的神经网络，而非仅仅优化现有模型的性能。

Result: 由于计算资源和数据限制，研究是初步的，但初步结果显示了这一架构的潜力。

Insight: H-Model为动态适应性和可解释性网络设计提供了新的研究方向，但其潜力需在未来更多资源支持下进一步验证。

Abstract: This article explores the design and experimentation of a neural network architecture capable of dynamically adjusting its internal structure based on the input data. The proposed model introduces a routing mechanism that allows each layer to influence how its outputs are propagated through the network, enabling iterative and adaptive computation. This concept is loosely inspired by the idea of thought processes and dynamic reasoning, where information flow is conditioned not only on the data itself, but also on the internal state of the system. It is important to note that this work does not aim to compete with state-of-the-art language models in terms of performance. Instead, it presents a conceptual prototype-an architectural framework that opens up a new direction for exploring adaptable and potentially more interpretable networks. The goal is not optimization of existing benchmarks but rather the proposal of a system that can learn not only representations, but also the structure of computation itself. Due to practical constraints in computing resources and data, this study remains a preliminary investigation. Nevertheless, initial observations show promise, and the architecture’s full potential can only be evaluated in future experiments under more favorable computational conditions.

[268] Reasoning: From Reflection to Solution cs.LG | cs.AI | cs.CLPDF

Zixi Li

TL;DR: 本文探讨了推理的本质，提出推理是状态空间中迭代算子应用直至收敛到不动点的过程，并通过理论分析和实践验证（OpenLM），在OpenXOR任务上达到了76%的准确性，而现有LLMs则为0%。

Details

Motivation: 尽管大型语言模型（LLMs）在多个基准测试上表现出色，但作者质疑其是否真正具备推理能力，还是仅通过模式匹配推理痕迹来完成任务。这促使他们对推理的本质进行重新定义和研究。

Result: OpenLM在OpenXOR任务上达到了76%的准确性，而现有的LLMs表现完全失败（0%）。

Insight: 真正的推理能力需要迭代和状态空间的动态更新，而不仅仅是静态的模式匹配。

Abstract: What is reasoning? This question has driven centuries of philosophical inquiry, from Aristotle’s syllogisms to modern computational complexity theory. In the age of large language models achieving superhuman performance on benchmarks like GSM8K (95% accuracy) and HumanEval (90% pass@1), we must ask: have these systems learned to \emph{reason}, or have they learned to \emph{pattern-match over reasoning traces}? This paper argues for a specific answer: \textbf{reasoning is iterative operator application in state spaces, converging to fixed points}. This definition is not merely philosophical – it has concrete architectural implications that explain both the failures of current systems and the path to genuine reasoning capabilities. Our investigation begins with a puzzle (OpenXOR), progresses through theory (OpenOperator), and culminates in a working solution (OpenLM) that achieves 76% accuracy where state-of-the-art LLMs achieve 0%. This is not about criticizing existing systems, but about \emph{understanding what reasoning requires} and \emph{building architectures that provide it}.

[269] Better LLM Reasoning via Dual-Play cs.LG | cs.AI | cs.CLPDF

Zhengxin Zhang, Chengyu Huang, Aochong Oliver Li, Claire Cardie

TL;DR: 该论文提出了一种名为PasoDoble的双角色对抗学习框架（Dual-Play），通过训练两个分别专注于生成问题（Proposer）和解答问题（Solver）的模型，显著提升了大型语言模型（LLM）的推理能力，且无需外部监督。

Details

Motivation: 现有的大型语言模型在推理任务中仍依赖外部监督（如人工标注），而对抗学习（如自我对抗）提供了一种减少依赖的可能。然而，直接将对抗学习应用于LLM时存在奖励欺骗和训练不稳定等问题。

Result: 实验结果表明，PasoDoble能够显著提升LLM的推理性能，同时在训练过程中无需外部监督。

Insight: 双角色对抗学习是一种有效的无监督学习方法，能够通过内部竞争推动模型性能的提升，尤其适用于LLM的推理任务。

Abstract: Large Language Models (LLMs) have achieved remarkable progress through Reinforcement Learning with Verifiable Rewards (RLVR), yet still rely heavily on external supervision (e.g., curated labels). Adversarial learning, particularly through self-play, offers a promising alternative that enables models to iteratively learn from themselves - thus reducing reliance on external supervision. Dual-play extends adversarial learning by assigning specialized roles to two models and training them against each other, fostering sustained competition and mutual evolution. Despite its promise, adapting dual-play training to LLMs remains limited, largely due to their susceptibility to reward hacking and training instability. In this paper, we introduce PasoDoble, a novel LLM dual-play framework. PasoDoble adversarially trains two models initialized from the same base model: a Proposer, which generates challenging questions with ground-truth answers, and a Solver, which attempts to solve them. We enrich the Proposer with knowledge from a pre-training dataset to ensure the questions’ quality and diversity. To avoid reward hacking, the Proposer is rewarded for producing only valid questions that push the Solver’s limit, while the Solver is rewarded for solving them correctly, and both are updated jointly. To further enhance training stability, we introduce an optional offline paradigm that decouples Proposer and Solver updates, alternately updating each for several steps while holding the other fixed. Notably, PasoDoble operates without supervision during training. Experimental results show that PasoDoble can improve the reasoning performance of LLMs. Our project page is available at https://hcy123902.github.io/PasoDoble.

[270] P1: Mastering Physics Olympiads with Reinforcement Learning cs.LG | cs.AI | cs.CLPDF

Jiacheng Chen, Qianjia Cheng, Fangchen Yu, Haiyuan Wan, Yuchen Zhang

TL;DR: P1系列模型通过强化学习训练，在物理奥林匹克竞赛中表现出色，尤其是P1-235B-A22B在国际物理奥林匹克竞赛中获得金牌，并在其他多种推理任务中展现出强大的泛化能力。

Details

Motivation: 大型语言模型（LLMs）在科学级推理方面的能力尚待提升，尤其是在物理学这种需要将符号与现实绑定的领域中，这是现代技术的基石。因此，研究团队希望通过强化学习训练模型，突破物理问题的推理局限。

Result: P1-235B-A22B在IPhO 2025中获得金牌，并在13项国际/区域物理竞赛中赢得12枚金牌；P1-30B-A3B获得银牌。P1-235B-A22B+PhysicsMinions在IPhO 2025中总分排名第一。

Insight: P1模型不仅在物理推理中表现出色，还在数学和编程等其他推理任务中展现了强大的泛化能力，表明其在多领域应用中的潜力。

Abstract: Recent progress in large language models (LLMs) has moved the frontier from puzzle-solving to science-grade reasoning-the kind needed to tackle problems whose answers must stand against nature, not merely fit a rubric. Physics is the sharpest test of this shift, which binds symbols to reality in a fundamental way, serving as the cornerstone of most modern technologies. In this work, we manage to advance physics research by developing large language models with exceptional physics reasoning capabilities, especially excel at solving Olympiad-level physics problems. We introduce P1, a family of open-source physics reasoning models trained entirely through reinforcement learning (RL). Among them, P1-235B-A22B is the first open-source model with Gold-medal performance at the latest International Physics Olympiad (IPhO 2025), and wins 12 gold medals out of 13 international/regional physics competitions in 2024/2025. P1-30B-A3B also surpasses almost all other open-source models on IPhO 2025, getting a silver medal. Further equipped with an agentic framework PhysicsMinions, P1-235B-A22B+PhysicsMinions achieves overall No.1 on IPhO 2025, and obtains the highest average score over the 13 physics competitions. Besides physics, P1 models also present great performance on other reasoning tasks like math and coding, showing the great generalibility of P1 series.

[271] A neural optimization framework for free-boundary diffeomorphic mapping problems and its applications cs.LG | cs.CV | cs.GR | math.CV | math.DGPDF

Zhehao Xu, Lok Ming Lui

TL;DR: 论文提出了一种神经优化框架SBN-Opt，用于解决自由边界微分同胚映射问题，通过将LSQC能量嵌入多尺度网格-谱架构中，实现了对局部几何形变的显式控制，并在实验中展示了相对于传统数值算法的优势。

Details

Motivation: 自由边界微分同胚映射是表面映射问题的核心，但由于边界无约束且在大变形下需保持局部双射性，传统数值算法难以直接应用于梯度优化。LSQC理论虽提供数学解决方案，但需依赖地标条件且无法用于梯度优化。

Result: 在密度均衡映射和不一致表面配准实验中，SBN-Opt优于传统数值算法。

Insight: 神经代理与多尺度架构的结合为复杂自由边界问题提供了高效的优化路径，同时显式控制形变为应用提供了更大灵活性。

Abstract: Free-boundary diffeomorphism optimization is a core ingredient in the surface mapping problem but remains notoriously difficult because the boundary is unconstrained and local bijectivity must be preserved under large deformation. Numerical Least-Squares Quasiconformal (LSQC) theory, with its provable existence, uniqueness, similarity-invariance and resolution-independence, offers an elegant mathematical remedy. However, the conventional numerical algorithm requires landmark conditioning, and cannot be applied into gradient-based optimization. We propose a neural surrogate, the Spectral Beltrami Network (SBN), that embeds LSQC energy into a multiscale mesh-spectral architecture. Next, we propose the SBN guided optimization framework SBN-Opt which optimizes free-boundary diffeomorphism for the problem, with local geometric distortion explicitly controllable. Extensive experiments on density-equalizing maps and inconsistent surface registration demonstrate our SBN-Opt’s superiority over traditional numerical algorithms.

[272] MPCM-Net: Multi-scale network integrates partial attention convolution with Mamba for ground-based cloud image segmentation cs.LG | cs.CVPDF

Penghui Niu, Jiashuai She, Taotao Cai, Yajuan Zhang, Ping Zhang

TL;DR: MPCM-Net 是一种多尺度网络，通过整合部分注意力卷积和 Mamba 架构，提升了地面云图像分割的精度和计算效率。该方法解决了现有方法的局限性，并发布了一个高质量的数据集 CSRC。

Details

Motivation: 地面云图像分割对光伏发电预测至关重要，但现有深度学习方法的局限性包括多尺度特征提取不足、注意力机制效率低以及解码器未能建立全局依赖关系。

Result: MPCM-Net 在 CSRC 数据集上表现优异，实现了分割精度和推理速度的最佳平衡。

Insight: 1. 部分注意力卷积提高了特征提取效率；2. Mamba 架构和 SSHD 的结合改善了全局依赖关系；3. CSRC 数据集的发布填补了现有数据集的不足。

Abstract: Ground-based cloud image segmentation is a critical research domain for photovoltaic power forecasting. Current deep learning approaches primarily focus on encoder-decoder architectural refinements. However, existing methodologies exhibit several limitations:(1)they rely on dilated convolutions for multi-scale context extraction, lacking the partial feature effectiveness and interoperability of inter-channel;(2)attention-based feature enhancement implementations neglect accuracy-throughput balance; and (3)the decoder modifications fail to establish global interdependencies among hierarchical local features, limiting inference efficiency. To address these challenges, we propose MPCM-Net, a Multi-scale network that integrates Partial attention Convolutions with Mamba architectures to enhance segmentation accuracy and computational efficiency. Specifically, the encoder incorporates MPAC, which comprises:(1)a MPC block with ParCM and ParSM that enables global spatial interaction across multi-scale cloud formations, and (2)a MPA block combining ParAM and ParSM to extract discriminative features with reduced computational complexity. On the decoder side, a M2B is employed to mitigate contextual loss through a SSHD that maintains linear complexity while enabling deep feature aggregation across spatial and scale dimensions. As a key contribution to the community, we also introduce and release a dataset CSRC, which is a clear-label, fine-grained segmentation benchmark designed to overcome the critical limitations of existing public datasets. Extensive experiments on CSRC demonstrate the superior performance of MPCM-Net over state-of-the-art methods, achieving an optimal balance between segmentation accuracy and inference speed. The dataset and source code will be available at https://github.com/she1110/CSRC.

[273] Stratified Knowledge-Density Super-Network for Scalable Vision Transformers cs.LG | cs.AI | cs.CVPDF

Longhua Li, Lei Qi, Xin Geng

TL;DR: 该论文提出了分层知识密度超级网络（Stratified Knowledge-Density Super-Network）和两种新方法（WPAC和PIAD），用于高效压缩和扩展Vision Transformer模型，提升知识浓缩能力和分层组织。

Details

Motivation: 训练和部署多种资源约束下的Vision Transformer模型代价高昂且效率低下，亟需一种灵活且高效的解决方案。

Result: WPAC在知识浓缩能力上优于现有剪枝方法，与PIAD结合后在模型压缩和扩展任务中表现优异。

Insight: 通过分层组织和紧凑存储知识，Vision Transformer模型可以更灵活地适应不同资源需求，为高效部署提供了新思路。

Abstract: Training and deploying multiple vision transformer (ViT) models for different resource constraints is costly and inefficient. To address this, we propose transforming a pre-trained ViT into a stratified knowledge-density super-network, where knowledge is hierarchically organized across weights. This enables flexible extraction of sub-networks that retain maximal knowledge for varying model sizes. We introduce \textbf{W}eighted \textbf{P}CA for \textbf{A}ttention \textbf{C}ontraction (WPAC), which concentrates knowledge into a compact set of critical weights. WPAC applies token-wise weighted principal component analysis to intermediate features and injects the resulting transformation and inverse matrices into adjacent layers, preserving the original network function while enhancing knowledge compactness. To further promote stratified knowledge organization, we propose \textbf{P}rogressive \textbf{I}mportance-\textbf{A}ware \textbf{D}ropout (PIAD). PIAD progressively evaluates the importance of weight groups, updates an importance-aware dropout list, and trains the super-network under this dropout regime to promote knowledge stratification. Experiments demonstrate that WPAC outperforms existing pruning criteria in knowledge concentration, and the combination with PIAD offers a strong alternative to state-of-the-art model compression and model expansion methods.

[274] Doubly Debiased Test-Time Prompt Tuning for Vision-Language Models cs.LG | cs.AI | cs.CVPDF

Fei Song, Yi Li, Rui Wang, Jiahuan Zhou, Changwen Zheng

TL;DR: 该论文提出了一种双重去偏的测试时间提示调整方法，用于解决视觉语言模型在测试时间提示调整中因模型和数据导致的优化偏差问题。

Details

Motivation: 测试时间提示调整在零样本设置下表现出优秀的泛化能力，但仅依赖未标记测试数据可能导致提示优化偏差，影响下游任务性能。论文从模型和数据角度分析了偏差的成因。

Result: 在15个基准数据集上的实验表明，该方法优于基线，验证了其在减轻提示优化偏差方面的有效性。

Insight: 论文揭示了提示优化偏差的根本原因在于模型端（过度自信的错误预测）和数据端（视觉-文本模态错位），并通过双重去偏方法提升了模型的稳健性。

Abstract: Test-time prompt tuning for vision-language models has demonstrated impressive generalization capabilities under zero-shot settings. However, tuning the learnable prompts solely based on unlabeled test data may induce prompt optimization bias, ultimately leading to suboptimal performance on downstream tasks. In this work, we analyze the underlying causes of prompt optimization bias from both the model and data perspectives. In terms of the model, the entropy minimization objective typically focuses on reducing the entropy of model predictions while overlooking their correctness. This can result in overconfident yet incorrect outputs, thereby compromising the quality of prompt optimization. On the data side, prompts affected by optimization bias can introduce misalignment between visual and textual modalities, which further aggravates the prompt optimization bias. To this end, we propose a Doubly Debiased Test-Time Prompt Tuning method. Specifically, we first introduce a dynamic retrieval-augmented modulation module that retrieves high-confidence knowledge from a dynamic knowledge base using the test image feature as a query, and uses the retrieved knowledge to modulate the predictions. Guided by the refined predictions, we further develop a reliability-aware prompt optimization module that incorporates a confidence-based weighted ensemble and cross-modal consistency distillation to impose regularization constraints during prompt tuning. Extensive experiments across 15 benchmark datasets involving both natural distribution shifts and cross-datasets generalization demonstrate that our method outperforms baselines, validating its effectiveness in mitigating prompt optimization bias.

[275] Toward Dignity-Aware AI: Next-Generation Elderly Monitoring from Fall Detection to ADL cs.LG | cs.CV | cs.CYPDF

Xun Shao, Aoba Otani, Yuto Hirasuka, Runji Cai, Seng W. Loke

TL;DR: 这篇论文提出了一种基于隐私保护和边缘计算的下一代老年人监控系统，旨在从跌倒检测扩展到全面的日常生活活动（ADL）识别。作者展示了在非独立同分布（non-IID）条件下的联邦学习初步结果，并提出了未来研究的挑战和方向。

Details

Motivation: 随着老龄化社会的到来，如何在不侵犯隐私的前提下支持老年人的独立生活成为一个重要问题。目前的AI系统主要集中在单一的跌倒检测任务上，而忽略了更广泛的ADL识别需求。

Result: 实验结果表明，提出的方法在边缘设备和联邦学习场景下具有可行性。然而，ADL数据集仍处于收集阶段，需要进一步研究。

Insight: 未来的研究方向包括解决领域偏移、数据稀缺和隐私风险等问题，实现更全面的智能房间环境下的ADL监控。

Abstract: This position paper envisions a next-generation elderly monitoring system that moves beyond fall detection toward the broader goal of Activities of Daily Living (ADL) recognition. Our ultimate aim is to design privacy-preserving, edge-deployed, and federated AI systems that can robustly detect and understand daily routines, supporting independence and dignity in aging societies. At present, ADL-specific datasets are still under collection. As a preliminary step, we demonstrate feasibility through experiments using the SISFall dataset and its GAN-augmented variants, treating fall detection as a proxy task. We report initial results on federated learning with non-IID conditions, and embedded deployment on Jetson Orin Nano devices. We then outline open challenges such as domain shift, data scarcity, and privacy risks, and propose directions toward full ADL monitoring in smart-room environments. This work highlights the transition from single-task detection to comprehensive daily activity recognition, providing both early evidence and a roadmap for sustainable and human-centered elderly care AI.

[276] Simple Vision-Language Math Reasoning via Rendered Text cs.LG | cs.CVPDF

Matvey Skripkin, Elizaveta Goncharova, Andrey Kuznetsov

TL;DR: 该论文提出了一种轻量级但高效的视觉-语言模型训练方法，通过将LaTeX编码的数学公式渲染为图像，并结合结构化思维链提示，实现了高效的数学问题求解。

Details

Motivation: 当前的多模态模型在数学推理任务中表现不佳，主要原因是对数学符号和结构的视觉理解不足。该论文旨在通过简单的文本到视觉增强方法解决这一问题。

Result: 该方法在MMMU、ChartQA和DocVQA等任务中性能提升高达20%，超越了开源和专有的数学视觉-语言求解器。

Insight: 研究发现，渲染保真度和提示设计是性能提升的关键因素。即使方法简单，也能在数学推理任务中表现出色。

Abstract: We present a lightweight yet effective pipeline for training vision-language models to solve math problems by rendering LaTeX encoded equations into images and pairing them with structured chain-of-thought prompts. This simple text-to-vision augmentation enables compact multimodal architectures to achieve state-of-the-art reasoning accuracy. Through systematic ablations, we find that rendering fidelity and prompt design are the primary drivers of performance. Despite its simplicity, our approach consistently matches or surpasses both open-source and proprietary math-focused vision-language solvers on widely used benchmarks, while preserving broad general-domain competence - showing gains on tasks such as MMMU, ChartQA, and DocVQA of up to 20%.

[277] Multimodal ML: Quantifying the Improvement of Calorie Estimation Through Image-Text Pairs cs.LG | cs.CVPDF

Arya Narang

TL;DR: 本文研究了通过结合图像和文本（如菜名）信息的多模态方法对卡路里估计的改进效果，结果显示多模态模型较仅图像的方法提高了1.25%，误差减少1.06 kcal。

Details

Motivation: 探索短文本输入（如菜名）是否能显著提升卡路里估计的准确性，尤其是在与仅图像方法相比时。

Result: 多模态模型将卡路里估计的平均绝对误差（MAE）从84.76 kcal降至83.70 kcal，减少了1.06 kcal。

Insight: 短文本信息可以作为视觉信息的补充，在多模态任务中带来轻微的但统计显著的性能提升。

Abstract: This paper determines the extent to which short textual inputs (in this case, names of dishes) can improve calorie estimation compared to an image-only baseline model and whether any improvements are statistically significant. Utilizes the TensorFlow library and the Nutrition5k dataset (curated by Google) to train both an image-only CNN and multimodal CNN that accepts both text and an image as input. The MAE of calorie estimations was reduced by 1.06 kcal from 84.76 kcal to 83.70 kcal (1.25% improvement) when using the multimodal model.

[278] Context-Aware Multimodal Representation Learning for Spatio-Temporally Explicit Environmental modelling cs.LG | cs.CVPDF

Julia Peters, Karin Mora, Miguel D. Mahecha, Chaonan Ji, David Montero

TL;DR: 本文提出了一种高时空分辨率的多模态表示学习框架，用于地球观测数据的统一特征表示，克服了现有模型在固定时空尺度上的局限性。

Details

Motivation: 现有的地球观测基础模型通常在固定的空间或时间尺度上运行，限制了其在需要高时空分辨率的生态分析中的应用。

Result: 学习的嵌入在异质景观中展现出高空间和语义一致性，且在建模总初级生产力（GPP）时编码了生态意义模式并保留了足够的时间保真度。

Insight: 通过两阶段设计，可以灵活扩展新传感器并保留预训练编码器，同时实现高时空分辨率和生态相关性。

Abstract: Earth observation (EO) foundation models have emerged as an effective approach to derive latent representations of the Earth system from various remote sensing sensors. These models produce embeddings that can be used as analysis-ready datasets, enabling the modelling of ecosystem dynamics without extensive sensor-specific preprocessing. However, existing models typically operate at fixed spatial or temporal scales, limiting their use for ecological analyses that require both fine spatial detail and high temporal fidelity. To overcome these limitations, we propose a representation learning framework that integrates different EO modalities into a unified feature space at high spatio-temporal resolution. We introduce the framework using Sentinel-1 and Sentinel-2 data as representative modalities. Our approach produces a latent space at native 10 m resolution and the temporal frequency of cloud-free Sentinel-2 acquisitions. Each sensor is first modeled independently to capture its sensor-specific characteristics. Their representations are then combined into a shared model. This two-stage design enables modality-specific optimisation and easy extension to new sensors, retaining pretrained encoders while retraining only fusion layers. This enables the model to capture complementary remote sensing data and to preserve coherence across space and time. Qualitative analyses reveal that the learned embeddings exhibit high spatial and semantic consistency across heterogeneous landscapes. Quantitative evaluation in modelling Gross Primary Production reveals that they encode ecologically meaningful patterns and retain sufficient temporal fidelity to support fine-scale analyses. Overall, the proposed framework provides a flexible, analysis-ready representation learning approach for environmental applications requiring diverse spatial and temporal resolutions.

[279] Fast 3D Surrogate Modeling for Data Center Thermal Management cs.LG | cs.AI | cs.CV | eess.SYPDF

Soumyendu Sarkar, Antonio Guillen-Perez, Zachariah J Carmichael, Avisek Naug, Refik Mert Cam

TL;DR: 论文提出了一种基于视觉的3D替代模型框架，用于数据中心的热管理，通过3D体素化表示直接建模温度和气流动态，实现了实时预测和多架构泛化，显著提升了速度和能效。

Details

Motivation: 传统CFD求解器虽精确但计算成本高，无法满足实时需求。为实现数据中心的实时温度预测以降低能耗和碳排放，需开发高效替代模型。

Result: 模型在不同数据中心配置下泛化能力强，预测速度从小时级降至毫秒级，节能7%。

Insight: 基于视觉的3D替代模型在热管理中具有显著优势，既能保持高精度，又能实现实时控制，推动了可持续数据中心的实现。

Abstract: Reducing energy consumption and carbon emissions in data centers by enabling real-time temperature prediction is critical for sustainability and operational efficiency. Achieving this requires accurate modeling of the 3D temperature field to capture airflow dynamics and thermal interactions under varying operating conditions. Traditional thermal CFD solvers, while accurate, are computationally expensive and require expert-crafted meshes and boundary conditions, making them impractical for real-time use. To address these limitations, we develop a vision-based surrogate modeling framework that operates directly on a 3D voxelized representation of the data center, incorporating server workloads, fan speeds, and HVAC temperature set points. We evaluate multiple architectures, including 3D CNN U-Net variants, a 3D Fourier Neural Operator, and 3D vision transformers, to map these thermal inputs to high-fidelity heat maps. Our results show that the surrogate models generalize across data center configurations and achieve up to 20,000x speedup (hundreds of milliseconds vs. hours). This fast and accurate estimation of hot spots and temperature distribution enables real-time cooling control and workload redistribution, leading to substantial energy savings (7%) and reduced carbon footprint.

[280] Improving a Hybrid Graphsage Deep Network for Automatic Multi-objective Logistics Management in Supply Chain cs.LG | cs.CVPDF

Mehdi Khaleghi, Nastaran Khaleghi, Sobhan Sheykhivand, Sebelan Danishvar

TL;DR: 论文提出了一种混合GraphSAGE网络（H-GSN），用于供应链中的物流管理多任务预测，包括货运类型、状态、交通状态、物流ID和延迟。该方法在多个数据集上取得了高准确率（最高100%），提升了供应链的弹性和可持续性。

Details

Motivation: 供应链的物流管理对提升其弹性和可持续性至关重要。传统方法难以高效预测货运类型、状态等多目标问题，因此需要一种自动化的解决方案。

Result: 在三个数据集（Smart Logistcis、DataCo、Shipping）上取得了高准确率（97.8%-100%），验证了方法的有效性。

Insight: 通过深度学习模型（如GraphSAGE）处理供应链中的多任务物流预测，可以显著提升管理效率和可持续性。

Abstract: Systematic logistics, conveyance amenities and facilities as well as warehousing information play a key role in fostering profitable development in a supply chain. The aim of transformation in industries is the improvement of the resiliency regarding the supply chain. The resiliency policies are required for companies to affect the collaboration with logistics service providers positively. The decrement of air pollutant emissions is a persistent advantage of the efficient management of logistics and transportation in supply chain. The management of shipment type is a significant factor in analyzing the sustainability of logistics and supply chain. An automatic approach to predict the shipment type, logistics delay and traffic status are required to improve the efficiency of the supply chain management. A hybrid graphsage network (H-GSN) is proposed in this paper for multi-task purpose of logistics management in a supply chain. The shipment type, shipment status, traffic status, logistics ID and logistics delay are the objectives in this article regarding three different databases including DataCo, Shipping and Smart Logistcis available on Kaggle as supply chain logistics databases. The average accuracy of 97.8% and 100% are acquired for 10 kinds of logistics ID and 3 types of traffic status prediction in Smart Logistics dataset. The average accuracy of 98.7% and 99.4% are obtained for shipment type prediction in DataCo and logistics delay in Shipping database, respectively. The evaluation metrics for different logistics scenarios confirm the efficiency of the proposed method to improve the resilience and sustainability of the supply chain.

[281] Transformers vs. Recurrent Models for Estimating Forest Gross Primary Production cs.LG | cs.AI | cs.CVPDF

David Montero, Miguel D. Mahecha, Francesco Martinuzzi, César Aybar, Anne Klosterhalfen

TL;DR: 论文比较了Transformer（GPT-2）和递归模型（LSTM）在预测森林总初级生产力（GPP）中的表现，发现LSTM整体表现更优，而GPT-2在极端事件中表现突出。研究了输入窗口长度对模型性能的影响，并识别了辐射作为主要预测因子。

Details

Motivation: 当前监测森林GPP的方法（如Eddy Covariance塔）覆盖范围有限，而传统遥感方法难以捕捉复杂的时间动态。深度学习为多模态GPP预测提供了新机会，但缺乏对当前最先进模型的比较研究。

Result: LSTM整体表现更好，但GPT-2在极端事件中表现更优。LSTM能使用更短的输入窗口达到相似精度。辐射是主要预测因子。

Insight: 研究表明模型架构、输入窗口长度和多模态输入共同影响GPP预测性能，为未来深度学习框架的开发和选择提供了指导。

Abstract: Monitoring the spatiotemporal dynamics of forest CO$_2$ uptake (Gross Primary Production, GPP), remains a central challenge in terrestrial ecosystem research. While Eddy Covariance (EC) towers provide high-frequency estimates, their limited spatial coverage constrains large-scale assessments. Remote sensing offers a scalable alternative, yet most approaches rely on single-sensor spectral indices and statistical models that are often unable to capture the complex temporal dynamics of GPP. Recent advances in deep learning (DL) and data fusion offer new opportunities to better represent the temporal dynamics of vegetation processes, but comparative evaluations of state-of-the-art DL models for multimodal GPP prediction remain scarce. Here, we explore the performance of two representative models for predicting GPP: 1) GPT-2, a transformer architecture, and 2) Long Short-Term Memory (LSTM), a recurrent neural network, using multivariate inputs. Overall, both achieve similar accuracy. But, while LSTM performs better overall, GPT-2 excels during extreme events. Analysis of temporal context length further reveals that LSTM attains similar accuracy using substantially shorter input windows than GPT-2, highlighting an accuracy-efficiency trade-off between the two architectures. Feature importance analysis reveals radiation as the dominant predictor, followed by Sentinel-2, MODIS land surface temperature, and Sentinel-1 contributions. Our results demonstrate how model architecture, context length, and multimodal inputs jointly determine performance in GPP prediction, guiding future developments of DL frameworks for monitoring terrestrial carbon dynamics.

[282] A Systematic Analysis of Out-of-Distribution Detection Under Representation and Training Paradigm Shifts cs.LG | cs.CVPDF

C. César Claros Olivares, Austin J. Brockmeier

TL;DR: 论文通过系统比较不同OOD检测方法在CLIP分层体系下的表现，采用AURC和AUGRC指标，发现特征学习空间对OOD检测效果起决定性作用，并提出基于统计的方法选择依据。

Details

Motivation: 研究动机在于系统地评估不同OOD检测方法在表示范式（如CNN和ViT）和训练范式变化下的有效性，为实际应用提供统计基础的选择指南。

Result: 结果显示：1）概率分数（如MSR、GEN）在误分类检测中表现最佳；2）几何感知分数（如NNGuide、fDBD）在CNN上更有效；3）ViT上GradNorm和KPCA重建误差表现稳定。

Insight: 主要洞察包括：1）OOD检测效果高度依赖于学习到的特征空间；2）不同模型（CNN vs. ViT）需要不同的检测方法；3）简单的PCA投影可以提升检测器性能。

Abstract: We present a systematic comparison of out-of-distribution (OOD) detection methods across CLIP-stratified regimes using AURC and AUGRC as primary metrics. Experiments cover two representation paradigms: CNNs trained from scratch and a fine-tuned Vision Transformer (ViT), evaluated on CIFAR-10/100, SuperCIFAR-100, and TinyImageNet. Using a multiple-comparison-controlled, rank-based pipeline (Friedman test with Conover-Holm post-hoc) and Bron-Kerbosch cliques, we find that the learned feature space largely determines OOD efficacy. For both CNNs and ViTs, probabilistic scores (e.g., MSR, GEN) dominate misclassification (ID) detection. Under stronger shifts, geometry-aware scores (e.g., NNGuide, fDBD, CTM) prevail on CNNs, whereas on ViTs GradNorm and KPCA Reconstruction Error remain consistently competitive. We further show a class-count-dependent trade-off for Monte-Carlo Dropout (MCD) and that a simple PCA projection improves several detectors. These results support a representation-centric view of OOD detection and provide statistically grounded guidance for method selection under distribution shift.

[283] Selecting Fine-Tuning Examples by Quizzing VLMs cs.LG | cs.CVPDF

Tenghao Ji, Eytan Adar

TL;DR: 论文提出了QZLoRA框架，通过QuizRank方法自动筛选高质量图像用于LoRA微调，生成更具代表性和高质量的图像。

Details

Motivation: 微调文本到图像扩散模型时，从质量参差不齐的数据集中选择好的训练样本是一个挑战。优质样本能确保生成的图像更符合目标概念，但目前缺乏自动化方法。

Result: 实验表明，QZLoRA能用更少样本生成更具代表性的真实或风格化图像。

Insight: 结合自动化视觉推理与参数高效微调，有望提升生成模型的适应性。

Abstract: A challenge in fine-tuning text-to-image diffusion models for specific topics is to select good examples. Fine-tuning from image sets of varying quality, such as Wikipedia Commons, will often produce poor output. However, training images that \textit{do} exemplify the target concept (e.g., a \textit{female Mountain Bluebird}) help ensure that the generated images are similarly representative (e.g., have the prototypical blue-wings and gray chest). In this work, we propose QZLoRA, a framework to select images for low-rank adaptation (LoRA). The approach leverages QuizRank, a method to automatically rank images by treating them as an educational intervention' and quizzing’ a VLM. We demonstrate that QZLoRA can produce better aligned, photorealistic images with fewer samples. We also show that these fine-tuned models can produce stylized that are similarly representative (i.e., illustrations). Our results highlight the promise of combining automated visual reasoning with parameter-efficient fine-tuning for topic-adaptive generative modeling.

[284] Variation-Bounded Loss for Noise-Tolerant Learning cs.LG | cs.CVPDF

Jialiang Wang, Xiong Zhou, Xianming Liu, Gangfeng Hu, Deming Zhai

TL;DR: 论文提出了一种基于变分比的新型鲁棒损失函数家族（VBL），通过理论分析证明了较小的变分比能提升模型对噪声标签的鲁棒性，并在多个数据集上验证了其有效性。

Details

Motivation: 噪声标签对监督学习的负面影响一直是难题。现有的鲁棒损失函数虽有效，但缺乏对变分比的系统性研究，作者希望通过引入变分比作为衡量鲁棒性的新属性，设计更灵活的损失函数。

Result: 在多个数据集上的实验表明，VBL能够有效提升模型对噪声标签的鲁棒性，同时保持了灵活性和实用性。

Insight: 变分比是一个衡量损失函数鲁棒性的重要指标，可以指导设计更灵活的鲁棒损失函数，同时放宽对对称性的依赖。

Abstract: Mitigating the negative impact of noisy labels has been aperennial issue in supervised learning. Robust loss functions have emerged as a prevalent solution to this problem. In this work, we introduce the Variation Ratio as a novel property related to the robustness of loss functions, and propose a new family of robust loss functions, termed Variation-Bounded Loss (VBL), which is characterized by a bounded variation ratio. We provide theoretical analyses of the variation ratio, proving that a smaller variation ratio would lead to better robustness. Furthermore, we reveal that the variation ratio provides a feasible method to relax the symmetric condition and offers a more concise path to achieve the asymmetric condition. Based on the variation ratio, we reformulate several commonly used loss functions into a variation-bounded form for practical applications. Positive experiments on various datasets exhibit the effectiveness and flexibility of our approach.

[285] BSO: Binary Spiking Online Optimization Algorithm cs.LG | cs.CVPDF

Yu Liang, Yu Yang, Wenjie Wei, Ammar Belatreche, Shuai Wang

TL;DR: 本文提出了BSO算法，一种针对二元脉冲神经网络（BSNN）的新型在线训练方法，显著减少了训练内存占用。通过翻转信号直接更新权重，无需存储潜在权重，并进一步提出了考虑时间动态的T-BSO变体。

Details

Motivation: 二元脉冲神经网络（BSNN）在资源受限的计算中具有效率优势，但现有训练方法由于需要存储潜在权重和时间处理，导致内存开销较大。

Result: 实验表明，BSO和T-BSO在优化性能上优于现有BSNN训练方法。

Insight: BSO通过简化权重更新机制显著减少了内存需求，而T-BSO的时间动态特性进一步提升了性能，显示出在线训练与时间动态结合的有效性。

Abstract: Binary Spiking Neural Networks (BSNNs) offer promising efficiency advantages for resource-constrained computing. However, their training algorithms often require substantial memory overhead due to latent weights storage and temporal processing requirements. To address this issue, we propose Binary Spiking Online (BSO) optimization algorithm, a novel online training algorithm that significantly reduces training memory. BSO directly updates weights through flip signals under the online training framework. These signals are triggered when the product of gradient momentum and weights exceeds a threshold, eliminating the need for latent weights during training. To enhance performance, we propose T-BSO, a temporal-aware variant that leverages the inherent temporal dynamics of BSNNs by capturing gradient information across time steps for adaptive threshold adjustment. Theoretical analysis establishes convergence guarantees for both BSO and T-BSO, with formal regret bounds characterizing their convergence rates. Extensive experiments demonstrate that both BSO and T-BSO achieve superior optimization performance compared to existing training methods for BSNNs. The codes are available at https://github.com/hamings1/BSO.

[286] Linear time small coresets for k-mean clustering of segments with applications cs.LG | cs.CG | cs.CVPDF

David Denisov, Shlomi Dolev, Dan Felmdan, Michael Segal

TL;DR: 该论文提出了首个能够高效处理任意输入线段的k均值聚类核心集构造方法，核心集大小为O(log²n)，计算时间为O(nd)，适用于实时应用如视频追踪。

Details

Motivation: 现有方法在高效处理线段数据上的k均值聚类问题时存在局限，尤其是在需要实时或分布式计算的场景下。论文旨在提出一种高效且通用的核心集构造方法。

Result: 实验结果表明，该方法在保持聚类精度的同时显著提升了计算效率，适用于实时应用如视频追踪。

Insight: 核心集方法为高维数据的高效聚类提供了一种新思路，尤其适用于需快速处理的场景，如分布式或实时计算。

Abstract: We study the $k$-means problem for a set $\mathcal{S} \subseteq \mathbb{R}^d$ of $n$ segments, aiming to find $k$ centers $X \subseteq \mathbb{R}^d$ that minimize $D(\mathcal{S},X) := \sum_{S \in \mathcal{S}} \min_{x \in X} D(S,x)$, where $D(S,x) := \int_{p \in S} |p - x| dp$ measures the total distance from each point along a segment to a center. Variants of this problem include handling outliers, employing alternative distance functions such as M-estimators, weighting distances to achieve balanced clustering, or enforcing unique cluster assignments. For any $\varepsilon > 0$, an $\varepsilon$-coreset is a weighted subset $C \subseteq \mathbb{R}^d$ that approximates $D(\mathcal{S},X)$ within a factor of $1 \pm \varepsilon$ for any set of $k$ centers, enabling efficient streaming, distributed, or parallel computation. We propose the first coreset construction that provably handles arbitrary input segments. For constant $k$ and $\varepsilon$, it produces a coreset of size $O(\log^2 n)$ computable in $O(nd)$ time. Experiments, including a real-time video tracking application, demonstrate substantial speedups with minimal loss in clustering accuracy, confirming both the practical efficiency and theoretical guarantees of our method.

[287] Real-time prediction of breast cancer sites using deformation-aware graph neural network cs.LG | cs.CVPDF

Kyunghyun Lee, Yong-Min Shin, Minwoo Shin, Jihun Kim, Sunghwan Lim

TL;DR: 该论文提出了一种基于图神经网络的模型，用于在乳房活检过程中实时预测变形乳腺癌的位置，解决了传统MRI引导活检的高成本和耗时长的问题。

Details

Motivation: 传统MRI引导活检存在耗时长、成本高的问题，间接MRI引导活检虽被提出，但难以实现精确的实时乳房变形建模。论文旨在解决这一问题。

Result: 模型在幻影和真实患者数据集上验证，位移预测精度达0.2毫米（RMSE），空间重叠DSC为0.977，计算速度比传统有限元模拟快4000倍以上。

Insight: 该模型展示了图神经网络在医学图像处理中的潜力，通过结合物理模拟和数据驱动方法，实现了高精度和实时性，为临床诊断提供了新工具。

Abstract: Early diagnosis of breast cancer is crucial, enabling the establishment of appropriate treatment plans and markedly enhancing patient prognosis. While direct magnetic resonance imaging-guided biopsy demonstrates promising performance in detecting cancer lesions, its practical application is limited by prolonged procedure times and high costs. To overcome these issues, an indirect MRI-guided biopsy that allows the procedure to be performed outside of the MRI room has been proposed, but it still faces challenges in creating an accurate real-time deformable breast model. In our study, we tackled this issue by developing a graph neural network (GNN)-based model capable of accurately predicting deformed breast cancer sites in real time during biopsy procedures. An individual-specific finite element (FE) model was developed by incorporating magnetic resonance (MR) image-derived structural information of the breast and tumor to simulate deformation behaviors. A GNN model was then employed, designed to process surface displacement and distance-based graph data, enabling accurate prediction of overall tissue displacement, including the deformation of the tumor region. The model was validated using phantom and real patient datasets, achieving an accuracy within 0.2 millimeters (mm) for cancer node displacement (RMSE) and a dice similarity coefficient (DSC) of 0.977 for spatial overlap with actual cancerous regions. Additionally, the model enabled real-time inference and achieved a speed-up of over 4,000 times in computational cost compared to conventional FE simulations. The proposed deformation-aware GNN model offers a promising solution for real-time tumor displacement prediction in breast biopsy, with high accuracy and real-time capability. Its integration with clinical procedures could significantly enhance the precision and efficiency of breast cancer diagnosis.

[288] Uncovering and Mitigating Transient Blindness in Multimodal Model Editing cs.LG | cs.AI | cs.CVPDF

Xiaoqi Han, Ru Li, Ran Yi, Hongye Tan, Zhuomin Liang

TL;DR: 论文提出了一个全面的多模态模型编辑评估框架，揭示了瞬态失明现象，并通过对抗损失平衡跨模态表示，提升了编辑效果。

Details

Motivation: 现有评估方法依赖低相似性或随机输入，掩盖了过拟合问题，无法有效评估多模态模型编辑的真实效果。

Result: 方法在实验中表现优于基线，平均减少17%的瞬态失明并提升局部性。

Insight: 多模态编辑中文本令牌的不平衡影响可能导致瞬态失明，对抗损失能有效缓解这一问题。

Abstract: Multimodal Model Editing (MMED) aims to correct erroneous knowledge in multimodal models. Existing evaluation methods, adapted from textual model editing, overstate success by relying on low-similarity or random inputs, obscure overfitting. We propose a comprehensive locality evaluation framework, covering three key dimensions: random-image locality, no-image locality, and consistent-image locality, operationalized through seven distinct data types, enabling a detailed and structured analysis of multimodal edits. We introduce De-VQA, a dynamic evaluation for visual question answering, uncovering a phenomenon we term transient blindness, overfitting to edit-similar text while ignoring visuals. Token analysis shows edits disproportionately affect textual tokens. We propose locality-aware adversarial losses to balance cross-modal representations. Empirical results demonstrate that our approach consistently outperforms existing baselines, reducing transient blindness and improving locality by 17% on average.

q-bio.NC [Back]

[289] Predicting upcoming visual features during eye movements yields scene representations aligned with human visual cortex q-bio.NC | cs.CVPDF

Sushrut Thorat, Adrien Doerig, Alexander Kroner, Carmen Amme, Tim C. Kietzmann

TL;DR: 这篇论文提出了一种自监督学习方法GPNs，通过学习人类主动视觉中的时间规律来预测下一个注视点的视觉特征，从而构建与人类视觉皮层对齐的场景表征。

Details

Motivation: 场景是复杂的结构集合，包含对象和表面等部分，具有空间和语义关系。为了构建有效的场景表征，需要利用主动视觉中的时间规律来学习这些关系。

Result: GPNs学习的场景表征与人类fMRI响应高度一致，并且在性能上超越了使用显式语义目标的对比模型或现代视觉基线。

Insight: 这项研究表明，利用主动视觉中的时间规律进行自监督学习，可以生成与生物视觉对齐的场景表征，为理解人类视觉系统提供了新视角。

Abstract: Scenes are complex, yet structured collections of parts, including objects and surfaces, that exhibit spatial and semantic relations to one another. An effective visual system therefore needs unified scene representations that relate scene parts to their location and their co-occurrence. We hypothesize that this structure can be learned self-supervised from natural experience by exploiting the temporal regularities of active vision: each fixation reveals a locally-detailed glimpse that is statistically related to the previous one via co-occurrence and saccade-conditioned spatial regularities. We instantiate this idea with Glimpse Prediction Networks (GPNs) – recurrent models trained to predict the feature embedding of the next glimpse along human-like scanpaths over natural scenes. GPNs successfully learn co-occurrence structure and, when given relative saccade location vectors, show sensitivity to spatial arrangement. Furthermore, recurrent variants of GPNs were able to integrate information across glimpses into a unified scene representation. Notably, these scene representations align strongly with human fMRI responses during natural-scene viewing across mid/high-level visual cortex. Critically, GPNs outperform architecture- and dataset-matched controls trained with explicit semantic objectives, and match or exceed strong modern vision baselines, leaving little unique variance for those alternatives. These results establish next-glimpse prediction during active vision as a biologically plausible, self-supervised route to brain-aligned scene representations learned from natural visual experience.

cs.HC [Back]

[290] Accepted with Minor Revisions: Value of AI-Assisted Scientific Writing cs.HC | cs.AI | cs.CLPDF

Sanchaita Hazra, Doeun Lee, Bodhisattwa Prasad Majumder, Sachin Kumar

TL;DR: 该研究评估了大语言模型（LLMs）在辅助科学写作中的潜力，重点关注摘要撰写。研究发现，AI生成的摘要在经过少量修改后可达人类写作的水平，且编辑行为主要受作者对AI来源的感知驱动。

Details

Motivation: 尽管LLMs在各领域的应用日益广泛，但其在科学写作中的作用尚未充分研究，特别是在需要高精度和多模态合成的任务中。

Result: AI生成的摘要在编辑后与人类写作相当；编辑行为受来源披露影响较大，而审稿决策不受来源影响。

Insight: 来源披露在科学协作写作中至关重要；AI辅助写作的效率取决于用户对其来源的认知而非纯客观质量。

Abstract: Large Language Models have seen expanding application across domains, yet their effectiveness as assistive tools for scientific writing – an endeavor requiring precision, multimodal synthesis, and domain expertise – remains insufficiently understood. We examine the potential of LLMs to support domain experts in scientific writing, with a focus on abstract composition. We design an incentivized randomized controlled trial with a hypothetical conference setup where participants with relevant expertise are split into an author and reviewer pool. Inspired by methods in behavioral science, our novel incentive structure encourages authors to edit the provided abstracts to an acceptable quality for a peer-reviewed submission. Our 2x2 between-subject design expands into two dimensions: the implicit source of the provided abstract and the disclosure of it. We find authors make most edits when editing human-written abstracts compared to AI-generated abstracts without source attribution, often guided by higher perceived readability in AI generation. Upon disclosure of source information, the volume of edits converges in both source treatments. Reviewer decisions remain unaffected by the source of the abstract, but bear a significant correlation with the number of edits made. Careful stylistic edits, especially in the case of AI-generated abstracts, in the presence of source information, improve the chance of acceptance. We find that AI-generated abstracts hold potential to reach comparable levels of acceptability to human-written ones with minimal revision, and that perceptions of AI authorship, rather than objective quality, drive much of the observed editing behavior. Our findings reverberate the significance of source disclosure in collaborative scientific writing.

[291] Enhancing XR Auditory Realism via Multimodal Scene-Aware Acoustic Rendering cs.HC | cs.CV | cs.LG | cs.SDPDF

Tianyu Xu, Jihan Li, Penghe Zu, Pranav Sahay, Maruchi Kim

TL;DR: SAMOSA是一个新颖的在设备上运行的XR系统，通过动态适应物理环境来渲染空间准确的声学效果，解决了现有XR空间音频渲染方法在实时适应多样化物理场景时的不足。

Details

Motivation: 在扩展现实(XR)中，声音渲染的准确性对创建逼真虚拟体验至关重要，但现有方法在实时适应多样化物理场景时存在困难，导致视听感知不匹配，影响了用户沉浸感。

Result: 通过在多种房间配置和声音类型上进行的声学指标评估和专家评估(N=12)，验证了SAMOSA在增强XR听觉真实性方面的可行性和有效性。

Insight: 多模态场景表示的融合和动态适应能力是实现逼真XR音频渲染的关键，SAMOSA为未来XR系统提供了高效且可扩展的声学渲染方法。

Abstract: In Extended Reality (XR), rendering sound that accurately simulates real-world acoustics is pivotal in creating lifelike and believable virtual experiences. However, existing XR spatial audio rendering methods often struggle with real-time adaptation to diverse physical scenes, causing a sensory mismatch between visual and auditory cues that disrupts user immersion. To address this, we introduce SAMOSA, a novel on-device system that renders spatially accurate sound by dynamically adapting to its physical environment. SAMOSA leverages a synergistic multimodal scene representation by fusing real-time estimations of room geometry, surface materials, and semantic-driven acoustic context. This rich representation then enables efficient acoustic calibration via scene priors, allowing the system to synthesize a highly realistic Room Impulse Response (RIR). We validate our system through technical evaluation using acoustic metrics for RIR synthesis across various room configurations and sound types, alongside an expert evaluation (N=12). Evaluation results demonstrate SAMOSA’s feasibility and efficacy in enhancing XR auditory realism.

[292] Trust in Vision-Language Models: Insights from a Participatory User Workshop cs.HC | cs.AI | cs.CVPDF

Agnese Chiatti, Lara Piccolo, Sara Bernardini, Matteo Matteucci, Viola Schiaffonati

TL;DR: 这篇论文探讨了用户对视觉语言模型（VLMs）的信任问题，通过一个参与式用户工作坊的初步结果，为未来的研究提供了关于如何衡量和提升用户对VLMs信任的见解。

Details

Motivation: 随着视觉语言模型的广泛应用，用户需要工具来判断何时可以信任这些系统。然而，如何构建用户对VLMs的信任及其演变过程仍是一个未解决的问题。

Result: 初步研究结果表明，用户对VLMs的信任是一个动态过程，未来研究需要进一步结合具体情境设计信任度量和参与策略。

Insight: 用户对AI系统的信任需要通过参与式设计研究具体化，而不是仅仅依赖AI模型作为实验验证工具。这为解决用户与VLMs互动中的信任问题提供了新视角。

Abstract: With the growing deployment of Vision-Language Models (VLMs), pre-trained on large image-text and video-text datasets, it is critical to equip users with the tools to discern when to trust these systems. However, examining how user trust in VLMs builds and evolves remains an open problem. This problem is exacerbated by the increasing reliance on AI models as judges for experimental validation, to bypass the cost and implications of running participatory design studies directly with users. Following a user-centred approach, this paper presents preliminary results from a workshop with prospective VLM users. Insights from this pilot workshop inform future studies aimed at contextualising trust metrics and strategies for participants’ engagement to fit the case of user-VLM interaction.

eess.IV [Back]

[293] Slow - Motion Video Synthesis for Basketball Using Frame Interpolation eess.IV | cs.CVPDF

Jiantang Huang

TL;DR: 本文提出了一个基于篮球运动的实时慢动作视频合成系统，通过对Real-Time Intermediate Flow Estimation (RIFE)网络在SportsSloMo数据集上的微调，生成高质量的篮球特定插帧。

Details

Motivation: 篮球比赛的转播视频通常以30-60 fps拍摄，限制了观众欣赏快速动作（如扣篮和变向）的能力。

Result: 微调后的RIFE模型在PSNR和SSIM指标上分别达到了34.3 dB和0.949，优于Super SloMo（高出2.1 dB）和基线RIFE（高出1.3 dB）。

Insight: 任务特定的适应性对运动慢动作视频生成至关重要，RIFE在精度和速度之间提供了良好的平衡。

Abstract: Basketball broadcast footage is traditionally captured at 30-60 fps, limiting viewers’ ability to appreciate rapid plays such as dunks and crossovers. We present a real-time slow-motion synthesis system that produces high-quality basketball-specific interpolated frames by fine-tuning the recent Real-Time Intermediate Flow Estimation (RIFE) network on the SportsSloMo dataset. Our pipeline isolates the basketball subset of SportsSloMo, extracts training triplets, and fine-tunes RIFE with human-aware random cropping. We compare the resulting model against Super SloMo and the baseline RIFE model using Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) on held-out clips. The fine-tuned RIFE attains a mean PSNR of 34.3 dB and SSIM of 0.949, outperforming Super SloMo by 2.1 dB and the baseline RIFE by 1.3 dB. A lightweight Gradio interface demonstrates end-to-end 4x slow-motion generation on a single RTX 4070 Ti Super at approximately 30 fps. These results indicate that task-specific adaptation is crucial for sports slow-motion, and that RIFE provides an attractive accuracy-speed trade-off for consumer applications.

[294] Multimodal RGB-HSI Feature Fusion with Patient-Aware Incremental Heuristic Meta-Learning for Oral Lesion Classification eess.IV | cs.CVPDF

Rupam Mukherjee, Rajkumar Daniel, Soujanya Hazra, Shirin Dasgupta, Subhamoy Mandal

TL;DR: 本文提出了一种结合RGB-HSI多模态特征融合和患者感知增量启发式元学习的方法，用于口腔病变分类，显著提高了在低资源环境下的分类性能。

Details

Motivation: 在低资源环境下，口腔癌和潜在恶性疾病的早期检测因标注数据有限而充满挑战。本文旨在通过多模态特征融合和元学习方法提升分类性能。

Result: 在未见患者数据上，框架取得了66.23%的宏F1分数和64.56%的准确率，表明HSI重建和元学习显著提升了分类鲁棒性。

Insight: 多模态特征融合和患者感知元学习方法可以有效缓解标注数据不足的问题，为口腔癌筛查提供了更可靠的解决方案。

Abstract: Early detection of oral cancer and potentially malignant disorders is challenging in low-resource settings due to limited annotated data. We present a unified four-class oral lesion classifier that integrates deep RGB embeddings, hyperspectral reconstruction, handcrafted spectral-textural descriptors, and demographic metadata. A pathologist-verified subset of oral cavity images was curated and processed using a fine-tuned ConvNeXt-v2 encoder, followed by RGB-to-HSI reconstruction into 31-band hyperspectral cubes. Haemoglobin-sensitive indices, texture features, and spectral-shape measures were extracted and fused with deep and clinical features. Multiple machine-learning models were assessed with patient-wise validation. We further introduce an incremental heuristic meta-learner (IHML) that combines calibrated base classifiers through probabilistic stacking and patient-level posterior smoothing. On an unseen patient split, the proposed framework achieved a macro F1 of 66.23% and an accuracy of 64.56%. Results demonstrate that hyperspectral reconstruction and uncertainty-aware meta-learning substantially improve robustness for real-world oral lesion screening.

[295] MTMed3D: A Multi-Task Transformer-Based Model for 3D Medical Imaging eess.IV | cs.CVPDF

Fan Li, Arun Iyengar, Lanyu Xu

TL;DR: MTMed3D是一个基于Transformer的多任务模型，用于3D医学影像中的检测、分割和分类任务，显著提高了效率和性能。

Details

Motivation: 现有医学影像AI技术多采用单任务模型，忽略了任务间的共享信息，导致效率低下。MTMed3D旨在通过多任务学习提升效率和应用效果。

Result: 在BraTS数据集上，MTMed3D在所有三个任务中表现优异，尤其在检测任务中优于现有方法。多任务模型显著减少了计算成本并加快了推理速度。

Insight: 通过共享编码器，多任务学习可以高效利用任务间的信息共享，减少计算开销，同时保持单任务的性能水平。

Abstract: In the field of medical imaging, AI-assisted techniques such as object detection, segmentation, and classification are widely employed to alleviate the workload of physicians and doctors. However, single-task models are predominantly used, overlooking the shared information across tasks. This oversight leads to inefficiencies in real-life applications. In this work, we propose MTMed3D, a novel end-to-end Multi-task Transformer-based model to address the limitations of single-task models by jointly performing 3D detection, segmentation, and classification in medical imaging. Our model uses a Transformer as the shared encoder to generate multi-scale features, followed by CNN-based task-specific decoders. The proposed framework was evaluated on the BraTS 2018 and 2019 datasets, achieving promising results across all three tasks, especially in detection, where our method achieves better results than prior works. Additionally, we compare our multi-task model with equivalent single-task variants trained separately. Our multi-task model significantly reduces computational costs and achieves faster inference speed while maintaining comparable performance to the single-task models, highlighting its efficiency advantage. To the best of our knowledge, this is the first work to leverage Transformers for multi-task learning that simultaneously covers detection, segmentation, and classification tasks in 3D medical imaging, presenting its potential to enhance diagnostic processes. The code is available at https://github.com/fanlimua/MTMed3D.git.

[296] BrainNormalizer: Anatomy-Informed Pseudo-Healthy Brain Reconstruction from Tumor MRI via Edge-Guided ControlNet eess.IV | cs.CVPDF

Min Gu Kwak, Yeonju Lee, Hairong Wang, Jing Li

TL;DR: BrainNormalizer是一种基于解剖学的扩散框架，通过边缘引导ControlNet从肿瘤MRI重建伪健康大脑图像，无需配对数据。

Details

Motivation: 脑肿瘤导致的解剖变形在临床中难以获取无肿瘤的健康参考图像，限制了诊断和治疗规划。

Result: 在BraTS2020数据集上，定量和定性表现优异，重建结果解剖学合理且结构连贯。

Insight: 通过边缘引导和解剖学约束，BrainNormalizer为临床提供了可靠的参考图像，并支持肿瘤变形分析。

Abstract: Brain tumors are among the most clinically significant neurological diseases and remain a major cause of morbidity and mortality due to their aggressive growth and structural heterogeneity. As tumors expand, they induce substantial anatomical deformation that disrupts both local tissue organization and global brain architecture, complicating diagnosis, treatment planning, and surgical navigation. Yet a subject-specific reference of how the brain would appear without tumor-induced changes is fundamentally unobtainable in clinical practice. We present BrainNormalizer, an anatomy-informed diffusion framework that reconstructs pseudo-healthy MRIs directly from tumorous scans by conditioning the generative process on boundary cues extracted from the subject’s own anatomy. This boundary-guided conditioning enables anatomically plausible pseudo-healthy reconstruction without requiring paired non-tumorous and tumorous scans. BrainNormalizer employs a two-stage training strategy. The pretrained diffusion model is first adapted through inpainting-based fine-tuning on tumorous and non-tumorous scans. Next, an edge-map-guided ControlNet branch is trained to inject fine-grained anatomical contours into the frozen decoder while preserving learned priors. During inference, a deliberate misalignment strategy pairs tumorous inputs with non-tumorous prompts and mirrored contralateral edge maps, leveraging hemispheric correspondence to guide reconstruction. On the BraTS2020 dataset, BrainNormalizer achieves strong quantitative performance and qualitatively produces anatomically plausible reconstructions in tumor-affected regions while retaining overall structural coherence. BrainNormalizer provides clinically reliable anatomical references for treatment planning and supports new research directions in counterfactual modeling and tumor-induced deformation analysis.

[297] Inertia-Informed Orientation Priors for Event-Based Optical Flow Estimation eess.IV | cs.CVPDF

Pritam P. Karmokar, William J. Beksi

TL;DR: 该论文提出了一种结合视觉和惯性运动信息的混合对比最大化（CM）方法，用于事件相机的光流估计，通过引入方向图作为先验，提升了估计的鲁棒性和收敛性。

Details

Motivation: 事件相机由于其工作原理直接编码场景中的运动，但事件的时间密集性和空间稀疏性为光流估计带来了挑战。现有的CM方法仍是一个高度非凸优化问题，希望通过引入惯性信息改善这一问题。

Result: 在MVSEC、DSEC和ECD数据集上的实验表明，该方法在光流估计的准确率上超越了现有技术。

Insight: 通过惯性信息提供方向性先验，可以显著提升事件相机光流估计的鲁棒性和收敛速度。

Abstract: Event cameras, by virtue of their working principle, directly encode motion within a scene. Many learning-based and model-based methods exist that estimate event-based optical flow, however the temporally dense yet spatially sparse nature of events poses significant challenges. To address these issues, contrast maximization (CM) is a prominent model-based optimization methodology that estimates the motion trajectories of events within an event volume by optimally warping them. Since its introduction, the CM framework has undergone a series of refinements by the computer vision community. Nonetheless, it remains a highly non-convex optimization problem. In this paper, we introduce a novel biologically-inspired hybrid CM method for event-based optical flow estimation that couples visual and inertial motion cues. Concretely, we propose the use of orientation maps, derived from camera 3D velocities, as priors to guide the CM process. The orientation maps provide directional guidance and constrain the space of estimated motion trajectories. We show that this orientation-guided formulation leads to improved robustness and convergence in event-based optical flow estimation. The evaluation of our approach on the MVSEC, DSEC, and ECD datasets yields superior accuracy scores over the state of the art.

cs.IR [Back]

[298] Attention Grounded Enhancement for Visual Document Retrieval cs.IR | cs.CL | cs.CVPDF

Wanqing Cui, Wei Huang, Yazhi Guo, Yibo Hu, Meiguang Jin

TL;DR: 该论文提出了一个名为AGREE的框架，通过引入跨模态注意力作为局部监督信号，改进视觉文档检索模型，使其能够更准确地识别相关文档区域，从而提升检索性能。

Details

Motivation: 现有的视觉文档检索方法主要依赖于全局相关性标签进行训练，缺乏对支持匹配的具体区域的监督，导致模型难以捕捉隐含的语义关联，限制了其处理非抽取式查询的能力。

Result: 在ViDoRe V2基准测试中，AGREE显著优于仅依赖全局监督的基线方法，定量与定性分析表明其能够更好地对齐查询与文档区域，实现更准确且可解释的检索。

Insight: 局部监督信号的引入可以有效提升视觉文档检索的性能，尤其是在处理非抽取式查询时，模型能够从表面匹配转向更深层次的语义对齐。

Abstract: Visual document retrieval requires understanding heterogeneous and multi-modal content to satisfy information needs. Recent advances use screenshot-based document encoding with fine-grained late interaction, significantly improving retrieval performance. However, retrievers are still trained with coarse global relevance labels, without revealing which regions support the match. As a result, retrievers tend to rely on surface-level cues and struggle to capture implicit semantic connections, hindering their ability to handle non-extractive queries. To alleviate this problem, we propose a \textbf{A}ttention-\textbf{G}rounded \textbf{RE}triever \textbf{E}nhancement (AGREE) framework. AGREE leverages cross-modal attention from multimodal large language models as proxy local supervision to guide the identification of relevant document regions. During training, AGREE combines local signals with the global signals to jointly optimize the retriever, enabling it to learn not only whether documents match, but also which content drives relevance. Experiments on the challenging ViDoRe V2 benchmark show that AGREE significantly outperforms the global-supervision-only baseline. Quantitative and qualitative analyses further demonstrate that AGREE promotes deeper alignment between query terms and document regions, moving beyond surface-level matching toward more accurate and interpretable retrieval. Our code is available at: https://anonymous.4open.science/r/AGREE-2025.

cs.AR [Back]

[299] TIMERIPPLE: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space cs.AR | cs.CVPDF

Wenxuan Miao, Yulin Sun, Aiyue Chen, Jing Lin, Yiwu Yao

TL;DR: TIMERIPPLE提出了一种基于潜在空间中时空相关性的轻量级自适应重用策略，显著加速了视频扩散变换器的自注意力计算，同时保持了视频质量。

Details

Motivation: 现有的视频生成模型（基于vDiT）由于自注意力机制的计算冗余，导致推理延迟较高。以往的研究试图通过减少自注意力计算的冗余来加速，但忽略了视频流中固有的时空相关性。本文旨在通过利用潜在空间中的时空相关性来优化自注意力计算。

Result: 在4种vDiT上实现了85%的计算节省，视频质量几乎无损（VBench评分损失<0.06%）。

Insight: 自注意力计算的冗余可以通过视频流中的时空相关性优化，无需复杂的稀疏模式设计；潜在空间中的相关性分析能为高效计算提供新思路。

Abstract: The recent surge in video generation has shown the growing demand for high-quality video synthesis using large vision models. Existing video generation models are predominantly based on the video diffusion transformer (vDiT), however, they suffer from substantial inference delay due to self-attention. While prior studies have focused on reducing redundant computations in self-attention, they often overlook the inherent spatio-temporal correlations in video streams and directly leverage sparsity patterns from large language models to reduce attention computations. In this work, we take a principled approach to accelerate self-attention in vDiTs by leveraging the spatio-temporal correlations in the latent space. We show that the attention patterns within vDiT are primarily due to the dominant spatial and temporal correlations at the token channel level. Based on this insight, we propose a lightweight and adaptive reuse strategy that approximates attention computations by reusing partial attention scores of spatially or temporally correlated tokens along individual channels. We demonstrate that our method achieves significantly higher computational savings (85%) compared to state-of-the-art techniques over 4 vDiTs, while preserving almost identical video quality ($<$0.06% loss on VBench).

[300] Neo: Real-Time On-Device 3D Gaussian Splatting with Reuse-and-Update Sorting Acceleration cs.AR | cs.CVPDF

Changhun Oh, Seongryong Oh, Jinwoo Hwang, Yoonsung Kim, Hardik Sharma

TL;DR: Neo提出了一种重用更新的排序算法和硬件加速器，显著降低了3D高斯泼溅渲染中的计算和内存带宽压力，实现了在资源受限设备上的实时高质量渲染。

Details

Motivation: 现有的3D高斯泼溅（3DGS）渲染方法在资源受限设备上难以实现高帧率，尤其是在高分辨率渲染时。排序阶段的高内存带宽需求成为主要瓶颈。

Result: 实验结果显示，Neo的吞吐量比最先进的边缘GPU和ASIC解决方案分别提高了10.0倍和5.6倍，同时减少了94.5%和81.3%的DRAM流量。

Insight: 利用时间冗余性优化排序算法是提升实时3D渲染性能的有效途径，尤其是在资源受限设备上。

Abstract: 3D Gaussian Splatting (3DGS) rendering in real-time on resource-constrained devices is essential for delivering immersive augmented and virtual reality (AR/VR) experiences. However, existing solutions struggle to achieve high frame rates, especially for high-resolution rendering. Our analysis identifies the sorting stage in the 3DGS rendering pipeline as the major bottleneck due to its high memory bandwidth demand. This paper presents Neo, which introduces a reuse-and-update sorting algorithm that exploits temporal redundancy in Gaussian ordering across consecutive frames, and devises a hardware accelerator optimized for this algorithm. By efficiently tracking and updating Gaussian depth ordering instead of re-sorting from scratch, Neo significantly reduces redundant computations and memory bandwidth pressure. Experimental results show that Neo achieves up to 10.0x and 5.6x higher throughput than state-of-the-art edge GPU and ASIC solution, respectively, while reducing DRAM traffic by 94.5% and 81.3%. These improvements make high-quality and low-latency on-device 3D rendering more practical.

[301] QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention cs.AR | cs.CV | cs.LGPDF

Hyunwoo Oh, Hanning Chen, Sanggeon Yun, Yang Ni, Wenjun Huang

TL;DR: QUILL是一种算法-架构协同设计，通过缓存友好的单次操作优化变形注意力机制，显著提升硬件效率和性能。

Details

Motivation: 变形Transformer在检测任务中表现优异，但其不规则的内存访问和低算术强度导致硬件映射效率低下。QUILL旨在解决这一问题。

Result: QUILL比RTX 4090吞吐量高7.29倍，能效好47.3倍，优于之前加速器3.26-9.82倍（吞吐量）和2.01-6.07倍（能效）。

Insight: 通过将稀疏性转化为局部性，再转化为利用率，QUILL实现了端到端的性能提升。

Abstract: Deformable transformers deliver state-of-the-art detection but map poorly to hardware due to irregular memory access and low arithmetic intensity. We introduce QUILL, a schedule-aware accelerator that turns deformable attention into cache-friendly, single-pass work. At its core, Distance-based Out-of-Order Querying (DOOQ) orders queries by spatial proximity; the look-ahead drives a region prefetch into an alternate buffer–forming a schedule-aware prefetch loop that overlaps memory and compute. A fused MSDeformAttn engine executes interpolation, Softmax, aggregation, and the final projection (W’’m) in one pass without spilling intermediates, while small tensors are kept on-chip and surrounding dense layers run on integrated GEMMs. Implemented as RTL and evaluated end-to-end, QUILL achieves up to 7.29x higher throughput and 47.3x better energy efficiency than an RTX 4090, and exceeds prior accelerators by 3.26-9.82x in throughput and 2.01-6.07x in energy efficiency. With mixed-precision quantization, accuracy tracks FP32 within <=0.9 AP across Deformable and Sparse DETR variants. By converting sparsity into locality–and locality into utilization–QUILL delivers consistent, end-to-end speedups.

cs.CY [Back]

[302] Automatic generation of DRI Statements cs.CY | cs.CLPDF

Maurice Flechtner

TL;DR: 论文提出了一种自动化生成DRI（Deliberative Reason Index）声明的方法，利用自然语言处理（NLP）和大语言模型（LLMs）显著减少人工工作量，为社会科学研究提供了可复制的模板。

Details

Motivation: 传统DRI声明生成过程复杂且耗时，限制了其在评估群体讨论质量中的应用。

Result: 该方法显著减少了人工干预，提高了声明生成的效率。

Insight: 展示了生成式人工智能在社会科学研究方法中的潜力，为类似研究提供了借鉴。

Abstract: Assessing the quality of group deliberation is essential for improving our understanding of deliberative processes. The Deliberative Reason Index (DRI) offers a sophisticated metric for evaluating group reasoning, but its implementation has been constrained by the complex and time-consuming process of statement generation. This thesis introduces an innovative, automated approach to DRI statement generation that leverages advanced natural language processing (NLP) and large language models (LLMs) to substantially reduce the human effort involved in survey preparation. Key contributions are a systematic framework for automated DRI statement generation and a methodological innovation that significantly lowers the barrier to conducting comprehensive deliberative process assessments. In addition, the findings provide a replicable template for integrating generative artificial intelligence into social science research methodologies.

astro-ph.IM [Back]

[303] Towards Mitigating Systematics in Large-Scale Surveys via Few-Shot Optimal Transport-Based Feature Alignment astro-ph.IM | cs.CV | cs.LGPDF

Sultan Hassan, Sambatra Andrianomena, Benjamin D. Wandelt

TL;DR: 论文提出了一种基于少量样本的最优传输特征对齐方法，用于减少大规模调查中的系统误差对预训练模型的影响。

Details

Motivation: 系统误差会污染观测数据，导致分布偏移，挑战预训练模型在标注此类观测数据时的有效性。由于系统误差难以直接建模和消除，需找到一种间接调整特征分布的方法。

Result: 实验结果表明，最优传输在小样本情况下能有效对齐OOD特征，尤其适用于大规模调查中信息提取的真实场景。

Insight: 最优传输作为一种对齐损失函数，在处理分布偏移问题时具有优势，特别是当ID与OOD样本的关系不明确且数据有限时。

Abstract: Systematics contaminate observables, leading to distribution shifts relative to theoretically simulated signals-posing a major challenge for using pre-trained models to label such observables. Since systematics are often poorly understood and difficult to model, removing them directly and entirely may not be feasible. To address this challenge, we propose a novel method that aligns learned features between in-distribution (ID) and out-of-distribution (OOD) samples by optimizing a feature-alignment loss on the representations extracted from a pre-trained ID model. We first experimentally validate the method on the MNIST dataset using possible alignment losses, including mean squared error and optimal transport, and subsequently apply it to large-scale maps of neutral hydrogen. Our results show that optimal transport is particularly effective at aligning OOD features when parity between ID and OOD samples is unknown, even with limited data-mimicking real-world conditions in extracting information from large-scale surveys. Our code is available at https://github.com/sultan-hassan/feature-alignment-for-OOD-generalization.

cs.RO [Back]

[304] Tactile Data Recording System for Clothing with Motion-Controlled Robotic Sliding cs.RO | cs.CV | cs.HC | cs.LG | cs.MMPDF

Michikuni Eguchi, Takekazu Kitagishi, Yuichi Hiroi, Takefumi Hiraki

TL;DR: 提出了一种基于机械臂的系统，用于收集衣物触觉数据，并通过精确控制速度和方向创建多模态触觉数据库，提高了衣物触觉特征的识别准确性。

Details

Motivation: 衣物的触觉感受对穿着舒适性至关重要，但目前缺乏系统性收集触觉数据的工具，特别是在滑动运动中。

Result: 实验表明，加入运动相关参数后，机器学习模型对衣物触觉特征的识别准确性显著提高。

Insight: 运动相关标签对表征衣物触觉感受具有重要作用，为未来研究织物感知与复现提供了新工具。

Abstract: The tactile sensation of clothing is critical to wearer comfort. To reveal physical properties that make clothing comfortable, systematic collection of tactile data during sliding motion is required. We propose a robotic arm-based system for collecting tactile data from intact garments. The system performs stroking measurements with a simulated fingertip while precisely controlling speed and direction, enabling creation of motion-labeled, multimodal tactile databases. Machine learning evaluation showed that including motion-related parameters improved identification accuracy for audio and acceleration data, demonstrating the efficacy of motion-related labels for characterizing clothing tactile sensation. This system provides a scalable, non-destructive method for capturing tactile data of clothing, contributing to future studies on fabric perception and reproduction.

[305] Large Language Models and 3D Vision for Intelligent Robotic Perception and Autonomy: A Review cs.RO | cs.CVPDF

Vinit Mehta, Charu Sharma, Karthick Thiyagarajan

TL;DR: 综述探讨了大型语言模型（LLMs）与3D视觉在机器人感知与自主性中的融合，分析了其方法、应用、挑战及未来研究方向。

Details

Motivation: 人工智能与机器人技术的快速发展促使LLMs与3D视觉的结合成为提升机器人感知能力的关键途径。

Result: 展示了LLMs与3D视觉在机器人感知领域的潜力，并提出了未来研究方向。

Insight: LLMs与3D视觉的结合为机器人提供了自然语言和空间理解的综合能力，推动更智能的自主系统发展。

Abstract: With the rapid advancement of artificial intelligence and robotics, the integration of Large Language Models (LLMs) with 3D vision is emerging as a transformative approach to enhancing robotic sensing technologies. This convergence enables machines to perceive, reason and interact with complex environments through natural language and spatial understanding, bridging the gap between linguistic intelligence and spatial perception. This review provides a comprehensive analysis of state-of-the-art methodologies, applications and challenges at the intersection of LLMs and 3D vision, with a focus on next-generation robotic sensing technologies. We first introduce the foundational principles of LLMs and 3D data representations, followed by an in-depth examination of 3D sensing technologies critical for robotics. The review then explores key advancements in scene understanding, text-to-3D generation, object grounding and embodied agents, highlighting cutting-edge techniques such as zero-shot 3D segmentation, dynamic scene synthesis and language-guided manipulation. Furthermore, we discuss multimodal LLMs that integrate 3D data with touch, auditory and thermal inputs, enhancing environmental comprehension and robotic decision-making. To support future research, we catalog benchmark datasets and evaluation metrics tailored for 3D-language and vision tasks. Finally, we identify key challenges and future research directions, including adaptive model architectures, enhanced cross-modal alignment and real-time processing capabilities, which pave the way for more intelligent, context-aware and autonomous robotic sensing systems.

Cheng Peng, Zhenzhe Zhang, Cheng Chi, Xiaobao Wei, Yanhao Zhang

TL;DR: PIGEON 提出了一种基于视觉语言模型（VLM）的对象导航方法，通过兴趣点（PoI）选择实现高效决策，并在零样本迁移和强化学习验证奖励（RLVR）下取得最优性能。

Details

Motivation: 现有对象导航方法在决策频率和智能性之间难以权衡，导致缺乏前瞻性或行动不连贯。

Result: 在经典对象导航基准测试中实现 SOTA 性能，RLVR 进一步增强了实时导航中的语义推理能力。

Insight: 将 VLM 和 PoI 结合是提升决策效率和语义对齐的有效途径，RLVR 提供了可验证的强化学习数据。

Abstract: Navigating to a specified object in an unknown environment is a fundamental yet challenging capability of embodied intelligence. However, current methods struggle to balance decision frequency with intelligence, resulting in decisions lacking foresight or discontinuous actions. In this work, we propose PIGEON: Point of Interest Guided Exploration for Object Navigation with VLM, maintaining a lightweight and semantically aligned snapshot memory during exploration as semantic input for the exploration strategy. We use a large Visual-Language Model (VLM), named PIGEON-VL, to select Points of Interest (PoI) formed during exploration and then employ a lower-level planner for action output, increasing the decision frequency. Additionally, this PoI-based decision-making enables the generation of Reinforcement Learning with Verifiable Reward (RLVR) data suitable for simulators. Experiments on classic object navigation benchmarks demonstrate that our zero-shot transfer method achieves state-of-the-art performance, while RLVR further enhances the model’s semantic guidance capabilities, enabling deep reasoning during real-time navigation.

stat.AP [Back]

[307] Scalable Vision-Guided Crop Yield Estimation stat.AP | cs.CVPDF

Harrison H. Li, Medhanie Irgau, Nabil Janmohamed, Karen Solveig Rieckmann, David B. Lobell

TL;DR: 该论文提出了一种基于预测驱动推断（PPI）的方法，结合田间照片和少量实测数据（如作物收割采样），提高作物产量估计的精度和不确定性量化，特别是在数据稀缺的地区。

Details

Motivation: 现有的作物产量估计方法（如作物收割采样）耗时且昂贵，限制了大规模农业监测和决策的效率。通过结合低成本田间照片，可以补充实测数据，提高估计精度并降低成本。

Result: 在撒哈拉以南非洲的水稻和玉米田间数据上，PPI方法显著提高了估计精度，有效样本量最高增加73%（水稻）和12-23%（玉米）。即使在仅有20块田地的区域，也能显著优于基线方法。

Insight: 低成本田间照片可以补充实测数据，提高作物产量估计的精度和可靠性，为农业保险和可持续农业投资提供支持。这种方法在数据稀缺地区尤其有价值。

Abstract: Precise estimation and uncertainty quantification for average crop yields are critical for agricultural monitoring and decision making. Existing data collection methods, such as crop cuts in randomly sampled fields at harvest time, are relatively time-consuming. Thus, we propose an approach based on prediction-powered inference (PPI) to supplement these crop cuts with less time-consuming field photos. After training a computer vision model to predict the ground truth crop cut yields from the photos, we learn a ``control function” that recalibrates these predictions with the spatial coordinates of each field. This enables fields with photos but not crop cuts to be leveraged to improve the precision of zone-wide average yield estimates. Our control function is learned by training on a dataset of nearly 20,000 real crop cuts and photos of rice and maize fields in sub-Saharan Africa. To improve precision, we pool training observations across different zones within the same first-level subdivision of each country. Our final PPI-based point estimates of the average yield are provably asymptotically unbiased and cannot increase the asymptotic variance beyond that of the natural baseline estimator – the sample average of the crop cuts – as the number of fields grows. We also propose a novel bias-corrected and accelerated (BCa) bootstrap to construct accompanying confidence intervals. Even in zones with as few as 20 fields, the point estimates show significant empirical improvement over the baseline, increasing the effective sample size by as much as 73% for rice and by 12-23% for maize. The confidence intervals are accordingly shorter at minimal cost to empirical finite-sample coverage. This demonstrates the potential for relatively low-cost images to make area-based crop insurance more affordable and thus spur investment into sustainable agricultural practices.

stat.CO [Back]

[308] Bregman geometry-aware split Gibbs sampling for Bayesian Poisson inverse problems stat.CO | cs.CV | eess.IV | stat.MLPDF

Elhadji Cisse Faye, Mame Diarra Fall, Nicolas Dobigeon, Eric Barat

TL;DR: 该论文提出了一种基于Bregman几何的贝叶斯框架，通过Monte Carlo采样算法解决泊松逆问题，利用Burg熵的Bregman散度实现高效采样，并在实验中展示了优于优化和采样方法的重建质量。

Details

Motivation: 泊松逆问题由于非Lipschitz梯度和正性约束等挑战，传统方法难以高效解决。本文旨在提出一种几何感知的贝叶斯框架，以更好地处理这些难点。

Result: 实验表明，该方法在重建质量上优于基于优化和采样的现有方法。

Insight: 通过几何感知的贝叶斯框架和Bregman散度，可以更高效地解决泊松逆问题，同时保留了问题的内在结构。

Abstract: This paper proposes a novel Bayesian framework for solving Poisson inverse problems by devising a Monte Carlo sampling algorithm which accounts for the underlying non-Euclidean geometry. To address the challenges posed by the Poisson likelihood – such as non-Lipschitz gradients and positivity constraints – we derive a Bayesian model which leverages exact and asymptotically exact data augmentations. In particular, the augmented model incorporates two sets of splitting variables both derived through a Bregman divergence based on the Burg entropy. Interestingly the resulting augmented posterior distribution is characterized by conditional distributions which benefit from natural conjugacy properties and preserve the intrinsic geometry of the latent and splitting variables. This allows for efficient sampling via Gibbs steps, which can be performed explicitly for all conditionals, except the one incorporating the regularization potential. For this latter, we resort to a Hessian Riemannian Langevin Monte Carlo (HRLMC) algorithm which is well suited to handle priors with explicit or easily computable score functions. By operating on a mirror manifold, this Langevin step ensures that the sampling satisfies the positivity constraints and more accurately reflects the underlying problem structure. Performance results obtained on denoising, deblurring, and positron emission tomography (PET) experiments demonstrate that the method achieves competitive performance in terms of reconstruction quality compared to optimization- and sampling-based approaches.

cs.CR [Back]

[309] ForgeDAN: An Evolutionary Framework for Jailbreaking Aligned Large Language Models cs.CR | cs.AI | cs.CLPDF

Siyang Cheng, Gaotian Liu, Rui Mei, Yilin Wang, Kejia Zhang

TL;DR: ForgeDAN 是一个新颖的进化框架，通过多策略文本扰动、语义适应度评估和双维度越狱判断，生成语义连贯且高效的对抗提示，以绕过对齐的大型语言模型的安全防护。

Details

Motivation: 现有自动化越狱生成方法（如 AutoDAN）存在突变多样性有限、适应度评估浅层和脆弱的关键字检测等问题，ForgeDAN 旨在解决这些局限性。

Result: ForgeDAN 在生成对抗提示时表现出高成功率，同时保持自然性和隐蔽性，优于现有 SOTA 方法。

Insight: 通过结合语义连贯性和高效攻击策略，进化框架可以更有效地绕过对齐 LLM 的安全防护。

Abstract: The rapid adoption of large language models (LLMs) has brought both transformative applications and new security risks, including jailbreak attacks that bypass alignment safeguards to elicit harmful outputs. Existing automated jailbreak generation approaches e.g. AutoDAN, suffer from limited mutation diversity, shallow fitness evaluation, and fragile keyword-based detection. To address these limitations, we propose ForgeDAN, a novel evolutionary framework for generating semantically coherent and highly effective adversarial prompts against aligned LLMs. First, ForgeDAN introduces multi-strategy textual perturbations across \textit{character, word, and sentence-level} operations to enhance attack diversity; then we employ interpretable semantic fitness evaluation based on a text similarity model to guide the evolutionary process toward semantically relevant and harmful outputs; finally, ForgeDAN integrates dual-dimensional jailbreak judgment, leveraging an LLM-based classifier to jointly assess model compliance and output harmfulness, thereby reducing false positives and improving detection effectiveness. Our evaluation demonstrates ForgeDAN achieves high jailbreaking success rates while maintaining naturalness and stealth, outperforming existing SOTA solutions.

[310] AttackVLA: Benchmarking Adversarial and Backdoor Attacks on Vision-Language-Action Models cs.CR | cs.AI | cs.CVPDF

Jiayu Li, Yunhan Zhao, Xiang Zheng, Zonghuan Xu, Yige Li

TL;DR: 该论文提出了AttackVLA框架，用于统一评估对视觉-语言-动作（VLA）模型的对抗攻击和后门攻击，填补了现有攻击方法在目标攻击上的不足，并通过BackdoorVLA实现了高成功率的精确攻击。

Details

Motivation: VLA模型的多模态整合带来了新的安全隐患，但缺乏统一的评估框架和现有攻击方法的有效性验证，尤其是在现实场景中的表现。

Result: BackdoorVLA的平均目标攻击成功率为58.4%，在部分任务中达到100%。

Insight: 现有攻击多导致非目标性失败或静态状态，而目标攻击尤其在长序列动作中的潜在威胁被低估，未来需加强对VLA系统的安全保障研究。

Abstract: Vision-Language-Action (VLA) models enable robots to interpret natural-language instructions and perform diverse tasks, yet their integration of perception, language, and control introduces new safety vulnerabilities. Despite growing interest in attacking such models, the effectiveness of existing techniques remains unclear due to the absence of a unified evaluation framework. One major issue is that differences in action tokenizers across VLA architectures hinder reproducibility and fair comparison. More importantly, most existing attacks have not been validated in real-world scenarios. To address these challenges, we propose AttackVLA, a unified framework that aligns with the VLA development lifecycle, covering data construction, model training, and inference. Within this framework, we implement a broad suite of attacks, including all existing attacks targeting VLAs and multiple adapted attacks originally developed for vision-language models, and evaluate them in both simulation and real-world settings. Our analysis of existing attacks reveals a critical gap: current methods tend to induce untargeted failures or static action states, leaving targeted attacks that drive VLAs to perform precise long-horizon action sequences largely unexplored. To fill this gap, we introduce BackdoorVLA, a targeted backdoor attack that compels a VLA to execute an attacker-specified long-horizon action sequence whenever a trigger is present. We evaluate BackdoorVLA in both simulated benchmarks and real-world robotic settings, achieving an average targeted success rate of 58.4% and reaching 100% on selected tasks. Our work provides a standardized framework for evaluating VLA vulnerabilities and demonstrates the potential for precise adversarial manipulation, motivating further research on securing VLA-based embodied systems.

[311] SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization cs.CR | cs.CVPDF

Xuankun Rong, Wenke Huang, Tingfeng Wang, Daiguo Zhou, Bo Du

TL;DR: SafeGRPO是一种自奖励的多模态安全对齐框架，通过规则驱动的奖励构建提升多模态大语言模型的安全推理能力。

Details

Motivation: 多模态大语言模型（MLLMs）虽然在推理和指令遵循方面表现出色，但其多模态空间的扩展引入了文本-图像交互带来的组合安全风险。当前模型的脆弱安全意识需要改进。

Result: 在多种基准测试中显著提升了多模态安全意识、组合鲁棒性和推理稳定性，同时未牺牲模型的通用能力。

Insight: 规则驱动的奖励机制可以有效提升多模态模型的安全性，且结构化推理方法能够在不影响性能的情况下增强模型的安全意识。

Abstract: Multimodal large language models (MLLMs) have demonstrated impressive reasoning and instruction-following capabilities, yet their expanded modality space introduces new compositional safety risks that emerge from complex text-image interactions. Such cross-modal couplings can produce unsafe semantics even when individual inputs are benign, exposing the fragile safety awareness of current MLLMs. While recent works enhance safety by guiding models to reason about potential risks, unregulated reasoning traces may compromise alignment; although Group Relative Policy Optimization (GRPO) offers self-rewarded refinement without human supervision, it lacks verifiable signals for reasoning safety. To address this, we propose SafeGRPO a self-rewarded multimodal safety alignment framework that integrates rule-governed reward construction into GRPO, enabling interpretable and verifiable optimization of reasoning safety. Built upon the constructed SafeTag-VL-3K dataset with explicit visual, textual, and combined safety tags, SafeGRPO performs step-guided safety thinking to enforce structured reasoning and behavior alignment, substantially improving multimodal safety awareness, compositional robustness, and reasoning stability across diverse benchmarks without sacrificing general capabilities.

cs.AI [Back]

[312] CLINB: A Climate Intelligence Benchmark for Foundational Models cs.AI | cs.CLPDF

Michelle Chen Huebscher, Katharine Mach, Aleksandar Stanić, Markus Leippold, Ben Gaiarin

TL;DR: 论文提出了CLINB基准，用于评估大语言模型（LLMs）在气候变化领域的知识处理能力，发现前沿模型在知识综合方面表现出色，但存在证据支持不足的问题。

Details

Motivation: 现有大语言模型在复杂专业知识处理方面的表现缺乏可靠评估方法，气候变化领域作为典型案例急需标准化测试基准。

Result: 前沿模型展现出博士级别的知识综合能力，但在证据支持（如引用和图像）上存在高幻觉率，表现不稳定。

Insight: 可靠的知识合成与可验证的证据支持之间的差距是AI在科学工作流中应用的关键挑战，CLINB为构建可信赖AI提供了重要工具。

Abstract: Evaluating how Large Language Models (LLMs) handle complex, specialized knowledge remains a critical challenge. We address this through the lens of climate change by introducing CLINB, a benchmark that assesses models on open-ended, grounded, multimodal question answering tasks with clear requirements for knowledge quality and evidential support. CLINB relies on a dataset of real users’ questions and evaluation rubrics curated by leading climate scientists. We implement and validate a model-based evaluation process and evaluate several frontier models. Our findings reveal a critical dichotomy. Frontier models demonstrate remarkable knowledge synthesis capabilities, often exhibiting PhD-level understanding and presentation quality. They outperform “hybrid” answers curated by domain experts assisted by weaker models. However, this performance is countered by failures in grounding. The quality of evidence varies, with substantial hallucination rates for references and images. We argue that bridging this gap between knowledge synthesis and verifiable attribution is essential for the deployment of AI in scientific workflows and that reliable, interpretable benchmarks like CLINB are needed to progress towards building trustworthy AI systems.

[313] Do LLMs Really Struggle at NL-FOL Translation? Revealing their Strengths via a Novel Benchmarking Strategy cs.AI | cs.CL | cs.LOPDF

Andrea Brunello, Luca Geatti, Michele Mignani, Angelo Montanari, Nicola Saccomanno

TL;DR: 该论文提出了一个新的评测方法，以揭示LLMs在自然语言到一阶逻辑（NL-FOL）翻译任务中的真实能力，并通过实验证明对话型LLMs在该任务中表现优异。

Details

Motivation: 现有的评测方法和数据集可能低估了大型语言模型（LLMs）在自然语言到一阶逻辑（NL-FOL）翻译中的实际能力，需要一种更准确的评测方法来区分模型的真实逻辑理解与表面模式识别。

Result: 实验表明，对话型LLMs在NL-FOL翻译任务中表现出色，具备真正的逻辑理解能力，而嵌入中心型模型则表现较差。

Insight: LLMs的能力可能被现有评测方法低估，尤其是在需要深层逻辑理解的任务中；未来的评测设计需更注重语义级分析。

Abstract: Due to its expressiveness and unambiguous nature, First-Order Logic (FOL) is a powerful formalism for representing concepts expressed in natural language (NL). This is useful, e.g., for specifying and verifying desired system properties. While translating FOL into human-readable English is relatively straightforward, the inverse problem, converting NL to FOL (NL-FOL translation), has remained a longstanding challenge, for both humans and machines. Although the emergence of Large Language Models (LLMs) promised a breakthrough, recent literature provides contrasting results on their ability to perform NL-FOL translation. In this work, we provide a threefold contribution. First, we critically examine existing datasets and protocols for evaluating NL-FOL translation performance, revealing key limitations that may cause a misrepresentation of LLMs’ actual capabilities. Second, to overcome these shortcomings, we propose a novel evaluation protocol explicitly designed to distinguish genuine semantic-level logical understanding from superficial pattern recognition, memorization, and dataset contamination. Third, using this new approach, we show that state-of-the-art, dialogue-oriented LLMs demonstrate strong NL-FOL translation skills and a genuine grasp of sentence-level logic, whereas embedding-centric models perform markedly worse.

[314] WebCoach: Self-Evolving Web Agents with Cross-Session Memory Guidance cs.AI | cs.CLPDF

Genglin Liu, Shijie Geng, Sha Li, Hejie Cui, Sarah Zhang

TL;DR: 论文提出了WebCoach，一个模型无关的自进化框架，通过跨会话记忆提升网络浏览代理的长期规划和学习能力，显著提高了任务成功率。

Details

Motivation: 现有的网络浏览代理在多会话任务中易犯重复错误且无法从历史经验中学习，限制了长期鲁棒性和样本效率。

Result: 在WebVoyager基准测试中，WebCoach将任务成功率从47%提升至61%，同时维持或减少平均步骤数。

Insight: 跨会话记忆和自我进化机制显著提升了代理的长期性能，小模型结合WebCoach可媲美更大模型的性能。

Abstract: Multimodal LLM-powered agents have recently demonstrated impressive capabilities in web navigation, enabling agents to complete complex browsing tasks across diverse domains. However, current agents struggle with repetitive errors and lack the ability to learn from past experiences across sessions, limiting their long-term robustness and sample efficiency. We introduce WebCoach, a model-agnostic self-evolving framework that equips web browsing agents with persistent cross-session memory, enabling improved long-term planning, reflection, and continual learning without retraining. WebCoach consists of three key components: (1) a WebCondenser, which standardizes raw navigation logs into concise summaries; (2) an External Memory Store, which organizes complete trajectories as episodic experiences; and (3) a Coach, which retrieves relevant experiences based on similarity and recency, and decides whether to inject task-specific advice into the agent via runtime hooks. This design empowers web agents to access long-term memory beyond their native context window, improving robustness in complex browsing tasks. Moreover, WebCoach achieves self-evolution by continuously curating episodic memory from new navigation trajectories, enabling agents to improve over time without retraining. Evaluations on the WebVoyager benchmark demonstrate that WebCoach consistently improves the performance of browser-use agents across three different LLM backbones. With a 38B model, it increases task success rates from 47% to 61% while reducing or maintaining the average number of steps. Notably, smaller base models with WebCoach achieve performance comparable to the same web agent using GPT-4o.

[315] STEP: Success-Rate-Aware Trajectory-Efficient Policy Optimization cs.AI | cs.CL | cs.LGPDF

Yuhan Chen, Yuxuan Liu, Long Zhang, Pengzhi Gao, Jian Luan

TL;DR: STEP提出了一种基于成功率感知的轨迹高效策略优化方法，通过动态分配采样资源和步级优化，显著提升了在线强化学习的效率和稳定性。

Details

Motivation: 多轮交互在在线强化学习中仍然是一个挑战。常见的轨迹级优化方法效率低且会产生误导性学习信号，因为其均匀采样任务、惩罚失败轨迹中的正确动作，并需要高昂的采样成本。

Result: 在OSWorld和AndroidWorld上的实验表明，STEP显著提升了样本效率和训练稳定性，收敛速度更快，且在相同采样预算下泛化能力更强。

Insight: STEP的成功率感知机制能够有效识别任务难度并动态分配资源，步级优化避免了轨迹级优化的冗余和误导信号，从而提升性能。

Abstract: Multi-turn interaction remains challenging for online reinforcement learning. A common solution is trajectory-level optimization, which treats each trajectory as a single training sample. However, this approach can be inefficient and yield misleading learning signals: it applies uniform sampling across tasks regardless of difficulty, penalizes correct intermediate actions in failed trajectories, and incurs high sample-collection costs. To address these issues, we propose STEP (Success-rate-aware Trajectory-Efficient Policy optimization), a framework that dynamically allocates sampling based on per-task success rates and performs step-level optimization. STEP maintains a smoothed success-rate record to guide adaptive trajectory resampling, allocating more effort to harder tasks. It then computes success-rate-weighted advantages and decomposes trajectories into step-level samples. Finally, it applies a step-level GRPO augmentation to refine updates for low-success tasks. Experiments on OSWorld and AndroidWorld show that STEP substantially improves sample efficiency and training stability over trajectory-level GRPO, converging faster and generalizing better under the same sampling budget.

[316] Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment cs.AI | cs.CL | cs.CYPDF

Jea Kwon, Luiz Felipe Vecchietti, Sungwon Park, Meeyoung Cha

TL;DR: 论文探讨了大语言模型（LLMs）在道德困境中的不确定性，通过引入推断时的‘dropout’机制，量化并调节模型的道德不确定性，从而提高人类与LLMs在道德决策上的一致性。

Details

Motivation: 人类在处理道德困境时表现出显著的不确定性，但现有研究表明LLMs的回答往往过于自信。随着AI系统在道德决策场景中的应用增多，理解其道德推理及不确定性对构建可靠的AI系统至关重要。

Result: dropout机制显著增加了总熵（主要通过互信息的提升），同时保持条件熵不变。此外，这一机制显著提高了人类与LLMs的道德对齐程度。

Insight: 通过刻意调节LLMs的道德不确定性可以改善其决策与人类偏好的一致性，尤其是在复杂道德场景中减少模型的过度自信行为。

Abstract: Humans display significant uncertainty when confronted with moral dilemmas, yet the extent of such uncertainty in machines and AI agents remains underexplored. Recent studies have confirmed the overly confident tendencies of machine-generated responses, particularly in large language models (LLMs). As these systems are increasingly embedded in ethical decision-making scenarios, it is important to understand their moral reasoning and the inherent uncertainties in building reliable AI systems. This work examines how uncertainty influences moral decisions in the classical trolley problem, analyzing responses from 32 open-source models and 9 distinct moral dimensions. We first find that variance in model confidence is greater across models than within moral dimensions, suggesting that moral uncertainty is predominantly shaped by model architecture and training method. To quantify uncertainty, we measure binary entropy as a linear combination of total entropy, conditional entropy, and mutual information. To examine its effects, we introduce stochasticity into models via “dropout” at inference time. Our findings show that our mechanism increases total entropy, mainly through a rise in mutual information, while conditional entropy remains largely unchanged. Moreover, this mechanism significantly improves human-LLM moral alignment, with correlations in mutual information and alignment score shifts. Our results highlight the potential to better align model-generated decisions and human preferences by deliberately modulating uncertainty and reducing LLMs’ confidence in morally complex scenarios.

[317] Value-Aligned Prompt Moderation via Zero-Shot Agentic Rewriting for Safe Image Generation cs.AI | cs.CR | cs.CV | cs.LGPDF

Xin Zhao, Xiaojun Chen, Bingshan Liu, Zeyao Liu, Zhendong Zhao

TL;DR: VALOR是一个模块化、零样本的代理框架，用于更安全的文生图生成。它通过多层提示分析和人类价值观对齐，显著减少不安全内容生成，同时保持生成质量。

Details

Motivation: 生成式视觉语言模型（如Stable Diffusion）在创意媒体合成中表现出色，但也可能生成不安全或不适当的内容。现有的防御方法往往需要在生成质量与安全性之间权衡。VALOR旨在解决这一问题，提供一种高效且低成本的安全生成方案。

Result: 实验表明，VALOR在对抗性、模糊性和价值观敏感的提示下，能将不安全内容减少高达100%，同时保持生成的有用性和创造性。

Insight: VALOR展示了如何通过模块化设计和动态指令实现生成模型的安全性对齐，同时避免牺牲生成质量。这一方法具有可扩展性，适用于开放世界的部署场景。

Abstract: Generative vision-language models like Stable Diffusion demonstrate remarkable capabilities in creative media synthesis, but they also pose substantial risks of producing unsafe, offensive, or culturally inappropriate content when prompted adversarially. Current defenses struggle to align outputs with human values without sacrificing generation quality or incurring high costs. To address these challenges, we introduce VALOR (Value-Aligned LLM-Overseen Rewriter), a modular, zero-shot agentic framework for safer and more helpful text-to-image generation. VALOR integrates layered prompt analysis with human-aligned value reasoning: a multi-level NSFW detector filters lexical and semantic risks; a cultural value alignment module identifies violations of social norms, legality, and representational ethics; and an intention disambiguator detects subtle or indirect unsafe implications. When unsafe content is detected, prompts are selectively rewritten by a large language model under dynamic, role-specific instructions designed to preserve user intent while enforcing alignment. If the generated image still fails a safety check, VALOR optionally performs a stylistic regeneration to steer the output toward a safer visual domain without altering core semantics. Experiments across adversarial, ambiguous, and value-sensitive prompts show that VALOR significantly reduces unsafe outputs by up to 100.00% while preserving prompt usefulness and creativity. These results highlight VALOR as a scalable and effective approach for deploying safe, aligned, and helpful image generation systems in open-world settings.

[318] TopoPerception: A Shortcut-Free Evaluation of Global Visual Perception in Large Vision-Language Models cs.AI | cs.CV | cs.LGPDF

Wenhao Zhou, Hao Zheng, Rong Zhao

TL;DR: 论文《TopoPerception》提出了一个新的基准测试TopoPerception，用于评估大型视觉语言模型（LVLMs）的全局视觉感知能力。与传统基准相比，TopoPerception避免了局部捷径的影响，揭示了当前LVLMs在全局感知方面的严重不足。

Details

Motivation: 当前LVLMs通过视觉编码器与预训练语言模型的结合进行了广泛研究，但视觉感知模块成为瓶颈。传统评估可能因局部捷径高估模型能力，需要一种更严格的评估方法。

Result: 所有先进模型在最粗粒度任务上表现均不优于随机猜测，表明它们缺乏全局感知能力。有趣的是，能力更强的模型表现更差。

Insight: 单纯扩大模型规模无法解决全局感知问题，可能需要新的训练范式或架构。TopoPerception为改进LVLMs提供了方向和工具。

Abstract: Large Vision-Language Models (LVLMs) typically align visual features from an encoder with a pre-trained Large Language Model (LLM). However, this makes the visual perception module a bottleneck, which constrains the overall capabilities of LVLMs. Conventional evaluation benchmarks, while rich in visual semantics, often contain unavoidable local shortcuts that can lead to an overestimation of models’ perceptual abilities. Here, we introduce TopoPerception, a benchmark that leverages topological properties to rigorously evaluate the global visual perception capabilities of LVLMs across various granularities. Since topology depends on the global structure of an image and is invariant to local features, TopoPerception enables a shortcut-free assessment of global perception, fundamentally distinguishing it from semantically rich tasks. We evaluate state-of-the-art models on TopoPerception and find that even at the coarsest perceptual granularity, all models perform no better than random chance, indicating a profound inability to perceive global visual features. Notably, a consistent trend emerge within model families: more powerful models with stronger reasoning capabilities exhibit lower accuracy. This suggests that merely scaling up models is insufficient to address this deficit and may even exacerbate it. Progress may require new training paradigms or architectures. TopoPerception not only exposes a critical bottleneck in current LVLMs but also offers a lens and direction for improving their global visual perception. The data and code are publicly available at: https://github.com/Wenhao-Zhou/TopoPerception.

[319] End to End AI System for Surgical Gesture Sequence Recognition and Clinical Outcome Prediction cs.AI | cs.CVPDF

Xi Li, Nicholas Matsumoto, Ujjwal Pasupulety, Atharva Deo, Cherine Yang

TL;DR: 论文提出了一种端到端的AI系统（F2O），用于手术手势序列识别和术后临床结果预测。该系统通过Transformer模型和逐帧分类，将手术视频转化为手势序列，并发现与术后结果相关的模式。

Details

Motivation: 手术过程中的精细行为分析与患者术后结果之间的联系是一个长期难题，亟需一种自动且可解释的方法来实现数据驱动的手术反馈和临床决策支持。

Result: 1. 手势检测的AUC值在帧级别和视频级别分别达到0.80和0.81；
2. F2O提取的特征预测术后结果的准确性与人工标注相当（0.79 vs. 0.75）；
3. 发现了与勃起功能恢复相关的关键手势模式。

Insight: F2O不仅实现了手术手势的自动识别，还为临床决策提供了数据支持，证明了AI在手术行为分析和结果预测中的潜力。

Abstract: Fine-grained analysis of intraoperative behavior and its impact on patient outcomes remain a longstanding challenge. We present Frame-to-Outcome (F2O), an end-to-end system that translates tissue dissection videos into gesture sequences and uncovers patterns associated with postoperative outcomes. Leveraging transformer-based spatial and temporal modeling and frame-wise classification, F2O robustly detects consecutive short (2 seconds) gestures in the nerve-sparing step of robot-assisted radical prostatectomy (AUC: 0.80 frame-level; 0.81 video-level). F2O-derived features (gesture frequency, duration, and transitions) predicted postoperative outcomes with accuracy comparable to human annotations (0.79 vs. 0.75; overlapping 95% CI). Across 25 shared features, effect size directions were concordant with small differences ( 0.07), and strong correlation (r = 0.96, p < 1e-14). F2O also captured key patterns linked to erectile function recovery, including prolonged tissue peeling and reduced energy use. By enabling automatic interpretable assessment, F2O establishes a foundation for data-driven surgical feedback and prospective clinical decision support.

[320] Adaptive Diagnostic Reasoning Framework for Pathology with Multimodal Large Language Models cs.AI | cs.CV | cs.LGPDF

Yunqi Hong, Johnson Kao, Liam Edwards, Nein-Tzu Liu, Chung-Yen Huang

TL;DR: RECAP-PATH是一个可解释的病理诊断框架，通过自学习范式将多模态大语言模型从被动模式识别转变为证据关联的诊断推理。其核心是一个两阶段学习过程，无需大量标注数据或模型权重更新，即可生成癌症诊断，显著提升诊断准确性。

Details

Motivation: 当前病理学AI工具缺乏人类可理解的推理过程，限制了其临床采用。RECAP-PATH旨在通过证据关联的推理提高AI的可审计性和可靠性。

Result: 在乳腺癌和前列腺癌数据集上，RECAP-PATH生成的解释与专家评估一致，诊断准确性显著优于基线方法。

Insight: 通过结合多模态大语言模型的视觉理解和推理能力，RECAP-PATH展示了证据关联解释的可推广路径，为临床可信赖AI提供了新方向。

Abstract: AI tools in pathology have improved screening throughput, standardized quantification, and revealed prognostic patterns that inform treatment. However, adoption remains limited because most systems still lack the human-readable reasoning needed to audit decisions and prevent errors. We present RECAP-PATH, an interpretable framework that establishes a self-learning paradigm, shifting off-the-shelf multimodal large language models from passive pattern recognition to evidence-linked diagnostic reasoning. At its core is a two-phase learning process that autonomously derives diagnostic criteria: diversification expands pathology-style explanations, while optimization refines them for accuracy. This self-learning approach requires only small labeled sets and no white-box access or weight updates to generate cancer diagnoses. Evaluated on breast and prostate datasets, RECAP-PATH produced rationales aligned with expert assessment and delivered substantial gains in diagnostic accuracy over baselines. By uniting visual understanding with reasoning, RECAP-PATH provides clinically trustworthy AI and demonstrates a generalizable path toward evidence-linked interpretation.

[321] AURA: Development and Validation of an Augmented Unplanned Removal Alert System using Synthetic ICU Videos cs.AI | cs.CVPDF

Junhyuk Seo, Hyeyoon Moon, Kyu-Hwan Jung, Namkee Oh, Taerim Kim

TL;DR: 论文提出AURA系统，通过合成ICU视频开发了一种实时检测非计划拔管的视觉风险系统，克服了真实数据获取的伦理和隐私问题。

Details

Motivation: ICU中非计划拔管（UE）引发严重安全问题，但实时检测因缺乏标注视频数据受限，需开发隐私保护的解决方案。

Result: 专家评估证实合成数据真实性，系统在碰撞检测中表现高准确率，躁动识别表现中等。

Insight: 展示了通过合成数据开发隐私保护的患者安全监测系统的新途径，具有ICU部署潜力。

Abstract: Unplanned extubation (UE) remains a critical patient safety concern in intensive care units (ICUs), often leading to severe complications or death. Real-time UE detection has been limited, largely due to the ethical and privacy challenges of obtaining annotated ICU video data. We propose Augmented Unplanned Removal Alert (AURA), a vision-based risk detection system developed and validated entirely on a fully synthetic video dataset. By leveraging text-to-video diffusion, we generated diverse and clinically realistic ICU scenarios capturing a range of patient behaviors and care contexts. The system applies pose estimation to identify two high-risk movement patterns: collision, defined as hand entry into spatial zones near airway tubes, and agitation, quantified by the velocity of tracked anatomical keypoints. Expert assessments confirmed the realism of the synthetic data, and performance evaluations showed high accuracy for collision detection and moderate performance for agitation recognition. This work demonstrates a novel pathway for developing privacy-preserving, reproducible patient safety monitoring systems with potential for deployment in intensive care settings.

[322] Yanyun-3: Enabling Cross-Platform Strategy Game Operation with Vision-Language Models cs.AI | cs.CVPDF

Guoyan Wang, Yanyan Huang, Chunlin Chen, Lifeng Wang, Yuxiang Sun

TL;DR: Yanyun-3 是一个基于视觉语言模型（VLM）的通用框架，首次实现了在三个异构策略游戏环境中的自主跨平台操作。通过结合 Qwen2.5-VL 的多模态推理和 UI-TARS 的精确执行能力，Yanyun-3 能够完成目标定位、战斗资源分配和区域控制等核心任务。研究发现混合多图像和视频数据的策略显著优于完全融合方法，提升了性能和效率。

Details

Motivation: 跨平台策略游戏的自动化操作需要能够适应多样化用户界面和动态战场条件的智能体。尽管视觉语言模型在多模态推理中表现出色，但其在复杂人机交互场景（如策略游戏）中的应用尚未得到充分探索。

Result: 混合策略 MV+S 相比完全融合方法，推理时间减少 63%，BLEU-4 分数从 4.81% 提升到 62.41%（约 12.98 倍）。

Insight: 结构化多模态数据组织可以有效提升 VLM 的性能，揭示了静态感知与动态推理在具身智能中的相互作用。

Abstract: Automated operation in cross-platform strategy games demands agents with robust generalization across diverse user interfaces and dynamic battlefield conditions. While vision-language models (VLMs) have shown considerable promise in multimodal reasoning, their application to complex human-computer interaction scenarios–such as strategy gaming–remains largely unexplored. Here, we introduce Yanyun-3, a general-purpose agent framework that, for the first time, enables autonomous cross-platform operation across three heterogeneous strategy game environments. By integrating the vision-language reasoning of Qwen2.5-VL with the precise execution capabilities of UI-TARS, Yanyun-3 successfully performs core tasks including target localization, combat resource allocation, and area control. Through systematic ablation studies, we evaluate the effects of various multimodal data combinations–static images, multi-image sequences, and videos–and propose the concept of combination granularity to differentiate between intra-sample fusion and inter-sample mixing strategies. We find that a hybrid strategy, which fuses multi-image and video data while mixing in static images (MV+S), substantially outperforms full fusion: it reduces inference time by 63% and boosts the BLEU-4 score by a factor of 12 (from 4.81% to 62.41%, approximately 12.98x). Operating via a closed-loop pipeline of screen capture, model inference, and action execution, the agent demonstrates strong real-time performance and cross-platform generalization. Beyond providing an efficient solution for strategy game automation, our work establishes a general paradigm for enhancing VLM performance through structured multimodal data organization, offering new insights into the interplay between static perception and dynamic reasoning in embodied intelligence.

[323] MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements cs.AI | cs.CVPDF

SeokJoo Kwak, Jihoon Kim, Boyoun Kim, Jung Jae Yoon, Wooseok Jang

TL;DR: MEGA-GUI是一个多阶段GUI元素定位框架，通过将任务分解为粗粒度ROI选择和细粒度元素定位，并结合双向ROI缩放算法和上下文感知重写代理，显著提高了定位准确率，超越了现有方法。

Details

Motivation: 现有的GUI定位系统多为单阶段或单模型方法，缺乏模块化设计，面对视觉杂乱和语义模糊的指令时表现不佳。因此需要一种更灵活、鲁棒的解决方案。

Result: 在ScreenSpot-Pro和OSWorld-G基准测试中分别达到73.18%和68.63%的准确率，超越现有方法。

Insight: 不同视觉尺度的视觉语言模型具有互补性，模块化的多阶段设计可以有效利用这种特性，提升任务性能。

Abstract: Graphical User Interface (GUI) grounding - the task of mapping natural language instructions to screen coordinates - is essential for autonomous agents and accessibility technologies. Existing systems rely on monolithic models or one-shot pipelines that lack modularity and fail under visual clutter and ambiguous instructions. We introduce MEGA-GUI, a multi-stage framework that separates grounding into coarse Region-of-Interest (ROI) selection and fine-grained element grounding, orchestrated by specialized vision-language agents. MEGA-GUI features a bidirectional ROI zoom algorithm that mitigates spatial dilution and a context-aware rewriting agent that reduces semantic ambiguity. Our analysis reveals complementary strengths and weaknesses across vision-language models at different visual scales, and we show that leveraging this modular structure achieves consistently higher accuracy than monolithic approaches. On the visually dense ScreenSpot-Pro benchmark, MEGA-GUI attains 73.18% accuracy, and on the semantically complex OSWorld-G benchmark it reaches 68.63%, surpassing previously reported results. Code and the Grounding Benchmark Toolkit (GBT) are available at https://github.com/samsungsds-research-papers/mega-gui.

[324] MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications cs.AI | cs.CV | cs.ET | cs.NIPDF

Gagan Raj Gupta, Anshul Kumar, Manish Rai, Apu Chakraborty, Ashutosh Modi

TL;DR: 该论文提出MM-Telco，一个专门为电信领域设计的综合多模态基准和模型套件，旨在解决LLMs在电信应用中面临的领域特定挑战，并提供了一系列文本和图像任务。

Details

Motivation: 大型语言模型（LLMs）在电信领域具有巨大潜力，但其部署受到领域特定挑战的限制。为了加速LLMs在电信领域的适应，需要一个专门的基准和模型套件。

Result: 实验表明，在MM-Telco数据集上微调的模型性能显著提升，同时也揭示了当前多模态LLMs的弱点。

Insight: MM-Telco不仅提供了电信领域的实际用例支持，还为进一步研究和开发多模态LLMs指明了方向。

Abstract: Large Language Models (LLMs) have emerged as powerful tools for automating complex reasoning and decision-making tasks. In telecommunications, they hold the potential to transform network optimization, automate troubleshooting, enhance customer support, and ensure regulatory compliance. However, their deployment in telecom is hindered by domain-specific challenges that demand specialized adaptation. To overcome these challenges and to accelerate the adaptation of LLMs for telecom, we propose MM-Telco, a comprehensive suite of multimodal benchmarks and models tailored for the telecom domain. The benchmark introduces various tasks (both text based and image based) that address various practical real-life use cases such as network operations, network management, improving documentation quality, and retrieval of relevant text and images. Further, we perform baseline experiments with various LLMs and VLMs. The models fine-tuned on our dataset exhibit a significant boost in performance. Our experiments also help analyze the weak areas in the working of current state-of-art multimodal LLMs, thus guiding towards further development and research.

[325] DAP: A Discrete-token Autoregressive Planner for Autonomous Driving cs.AI | cs.CVPDF

Bowen Ye, Bin Zhang, Hang Zhao

TL;DR: 论文提出了一种离散标记的自回归规划器DAP，通过联合预测BEV语义与自车轨迹，实现高效数据扩展与性能提升，结合强化学习微调，性能表现优异。

Details

Motivation: 自动驾驶领域中，单纯预测自车轨迹存在监督稀疏、场景演变约束弱的问题，限制了性能扩展。因此，需要一种更全面的规划方法。

Result: DAP在160M参数量下，开环指标达到SOTA，闭环NAVSIM基准测试中表现优异。

Insight: 联合预测BEV语义与自车轨迹能更全面地约束规划任务，离散标记自回归设计提供了紧凑且可扩展的规划范式。

Abstract: Gaining sustainable performance improvement with scaling data and model budget remains a pivotal yet unresolved challenge in autonomous driving. While autoregressive models exhibited promising data-scaling efficiency in planning tasks, predicting ego trajectories alone suffers sparse supervision and weakly constrains how scene evolution should shape ego motion. Therefore, we introduce DAP, a discrete-token autoregressive planner that jointly forecasts BEV semantics and ego trajectories, thereby enforcing comprehensive representation learning and allowing predicted dynamics to directly condition ego motion. In addition, we incorporate a reinforcement-learning-based fine-tuning, which preserves supervised behavior cloning priors while injecting reward-guided improvements. Despite a compact 160M parameter budget, DAP achieves state-of-the-art performance on open-loop metrics and delivers competitive closed-loop results on the NAVSIM benchmark. Overall, the fully discrete-token autoregressive formulation operating on both rasterized BEV and ego actions provides a compact yet scalable planning paradigm for autonomous driving.

cs.DC [Back]

[326] Range Asymmetric Numeral Systems-Based Lightweight Intermediate Feature Compression for Split Computing of Deep Neural Networks cs.DC | cs.AI | cs.CVPDF

Mingyu Sung, Suhwan Im, Vikas Palakonda, Jae-Mo Kang

TL;DR: 该论文提出了一种基于Range Asymmetric Numeral Systems (rANS)的轻量级中间特征压缩框架，用于解决分布式深度学习推理中的通信瓶颈问题，不需复杂概率建模或网络修改即可显著降低传输开销。

Details

Motivation: 分布式深度学习推理（Split Computing）面临传输中间特征的通信瓶颈问题，尤其是在资源受限的边缘设备与云服务器之间。现有方法通常需要复杂的概率建模或网络修改，限制了实际应用。

Result: 在多种神经网络架构（如ResNet、VGG16等）和数据集（CIFAR100、ImageNet）中，该方法保持接近基线准确率，并在自然语言处理任务（如Llama2 7B/13B）中验证了其广泛适用性。

Insight: 该方法为带宽受限环境下部署复杂AI系统提供了可行的解决方案，展示了轻量级压缩技术的潜力。

Abstract: Split computing distributes deep neural network inference between resource-constrained edge devices and cloud servers but faces significant communication bottlenecks when transmitting intermediate features. To this end, in this paper, we propose a novel lightweight compression framework that leverages Range Asymmetric Numeral Systems (rANS) encoding with asymmetric integer quantization and sparse tensor representation to reduce transmission overhead dramatically. Specifically, our approach combines asymmetric integer quantization with a sparse representation technique, eliminating the need for complex probability modeling or network modifications. The key contributions include: (1) a distribution-agnostic compression pipeline that exploits inherent tensor sparsity to achieve bandwidth reduction with minimal computational overhead; (2) an approximate theoretical model that optimizes tensor reshaping dimensions to maximize compression efficiency; and (3) a GPU-accelerated implementation with sub-millisecond encoding/decoding latency. Extensive evaluations across diverse neural architectures (ResNet, VGG16, MobileNetV2, SwinT, DenseNet121, EfficientNetB0) demonstrate that the proposed framework consistently maintains near-baseline accuracy across CIFAR100 and ImageNet benchmarks. Moreover, we validated the framework’s effectiveness on advanced natural language processing tasks by employing Llama2 7B and 13B on standard benchmarks such as MMLU, HellaSwag, ARC, PIQA, Winogrande, BoolQ, and OpenBookQA, demonstrating its broad applicability beyond computer vision. Furthermore, this method addresses a fundamental bottleneck in deploying sophisticated artificial intelligence systems in bandwidth-constrained environments without compromising model performance.

eess.AS [Back]

[327] How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-resource Transfer eess.AS | cs.CLPDF

Minu Kim, Ji Sub Um, Hoirin Kim

TL;DR: 该论文研究了自监督学习（SSL）语音模型在处理声调语言（如缅甸语、泰语、老挝语和越南语）时的表现，发现模型对声调的关注时间范围因下游任务而异。

Details

Motivation: 声调在许多语言中至关重要，但SSL语音模型对声调的建模能力尚未充分探索，尤其是在非普通话的低资源语言中。

Result: 研究发现，模型的声调关注时间范围与语言本身的声调特性相关（缅甸语和泰语约为100毫秒，老挝语和越南语约为180毫秒），但下游任务会显著影响这一范围。

Insight: 下游任务的性质会显著影响SSL语音模型对声调的建模方式，尤其是在低资源条件下，这种影响更加明显。

Abstract: Lexical tone is central to many languages but remains underexplored in self-supervised learning (SSL) speech models, especially beyond Mandarin. We study four languages with complex and diverse tone systems: Burmese, Thai, Lao, and Vietnamese, to examine how far such models listen for tone and how transfer operates in low-resource conditions. As a baseline reference, we estimate the temporal span of tone cues to be about 100 ms in Burmese and Thai, and about 180 ms in Lao and Vietnamese. Probes and gradient analyses on fine-tuned SSL models reveal that tone transfer varies by downstream task: automatic speech recognition fine-tuning aligns spans with language-specific tone cues, while prosody- and voice-related tasks bias the model toward overly long spans. These findings indicate that tone transfer is shaped by downstream task, highlighting task effects on temporal focus in tone modeling.

[328] VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing eess.AS | cs.CL | cs.SDPDF

Zhisheng Zheng, Puyuan Peng, Anuj Diwan, Cong Phuoc Huynh, Xiaohang Sun

TL;DR: VoiceCraft-X是一个自回归神经编解码语言模型，统一了11种语言的语音编辑和零样本文本到语音（TTS）合成，通过Qwen3大语言模型实现跨语言文本处理，并引入新颖的令牌重排序机制。

Details

Motivation: 现有语音合成和编辑系统多为单语言且任务分离，缺乏统一的跨语言和多任务处理能力。VoiceCraft-X旨在解决这一问题，实现多语言和多任务的统一框架。

Result: VoiceCraft-X在11种语言中生成高质量语音，支持无缝编辑，即使在单语言数据有限的情况下也能表现优异。

Insight: 统一的自回归方法在多语言复杂语音任务中具有强大潜力，无需依赖音素标注即可实现跨语言处理。

Abstract: We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft-X utilizes the Qwen3 large language model for phoneme-free cross-lingual text processing and a novel token reordering mechanism with time-aligned text and speech tokens to handle both tasks as a single sequence generation problem. The model generates high-quality, natural-sounding speech, seamlessly creating new audio or editing existing recordings within one framework. VoiceCraft-X shows robust performance in diverse linguistic settings, even with limited per-language data, underscoring the power of unified autoregressive approaches for advancing complex, real-world multilingual speech applications. Audio samples are available at https://zhishengzheng.com/voicecraft-x/.

Table of Contents

cs.CV [Back]

[1] Psychological stress during Examination and its estimation by handwriting in answer script cs.CVPDF

[2] Real-time pothole detection with onboard sensors and camera on vehicles cs.CVPDF

[3] A Method for Identifying Farmland System Habitat Types Based on the Dynamic-Weighted Feature Fusion Network Model cs.CVPDF

[4] Task-Aware 3D Affordance Segmentation via 2D Guidance and Geometric Refinement cs.CV | cs.AI | eess.IVPDF

[5] Do Blind Spots Matter for Word-Referent Mapping? A Computational Study with Infant Egocentric Video cs.CV | cs.AIPDF

[6] GROVER: Graph-guided Representation of Omics and Vision with Expert Regulation for Adaptive Spatial Multi-omics Fusion cs.CV | cs.AIPDF

[7] Concept-RuleNet: Grounded Multi-Agent Neurosymbolic Reasoning in Vision Language Models cs.CV | cs.AI | cs.MAPDF

[8] Image-POSER: Reflective RL for Multi-Expert Image Generation and Editing cs.CV | cs.AIPDF

[9] Defending Unauthorized Model Merging via Dual-Stage Weight Protection cs.CV | cs.CRPDF

[10] FocusSDF: Boundary-Aware Learning for Medical Image Segmentation via Signed Distance Supervision cs.CVPDF

[11] Lacking Data? No worries! How synthetic images can alleviate image scarcity in wildlife surveys: a case study with muskox (Ovibos moschatus) cs.CVPDF

[12] Advancing Annotat3D with Harpia: A CUDA-Accelerated Library For Large-Scale Volumetric Data Segmentation cs.CV | cs.DCPDF

[13] Prompt Triage: Structured Optimization Enhances Vision-Language Model Performance on Medical Imaging Benchmarks cs.CV | cs.AIPDF

[14] PI-NAIM: Path-Integrated Neural Adaptive Imputation Model cs.CV | cs.AIPDF

[15] Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models cs.CVPDF

[16] From Events to Clarity: The Event-Guided Diffusion Framework for Dehazing cs.CVPDF

[17] Evaluation of Attention Mechanisms in U-Net Architectures for Semantic Segmentation of Brazilian Rock Art Petroglyphs cs.CVPDF

[18] From Classification to Cross-Modal Understanding: Leveraging Vision-Language Models for Fine-Grained Renal Pathology cs.CVPDF

[19] LithoSeg: A Coarse-to-Fine Framework for High-Precision Lithography Segmentation cs.CV | cs.NIPDF

[20] Enhancing Road Safety Through Multi-Camera Image Segmentation with Post-Encroachment Time Analysis cs.CV | cs.LG | cs.SIPDF

[21] LIHE: Linguistic Instance-Split Hyperbolic-Euclidean Framework for Generalized Weakly-Supervised Referring Expression Comprehension cs.CVPDF

[22] Null-Space Diffusion Distillation for Efficient Photorealistic Lensless Imaging cs.CVPDF

[23] Bridging Vision and Language for Robust Context-Aware Surgical Point Tracking: The VL-SurgPT Dataset and Benchmark cs.CVPDF

[24] GCAgent: Long-Video Understanding via Schematic and Narrative Episodic Memory cs.CV | cs.AIPDF

[25] VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation cs.CVPDF

[26] Improved Masked Image Generation with Knowledge-Augmented Token Representations cs.CVPDF

[27] Calibrated Multimodal Representation Learning with Missing Modalities cs.CV | cs.LG | cs.MMPDF

[28] SRSplat: Feed-Forward Super-Resolution Gaussian Splatting from Sparse Multi-View Images cs.CVPDF

[29] DCMM-Transformer: Degree-Corrected Mixed-Membership Attention for Medical Imaging cs.CV | cs.AIPDF

[30] PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling cs.CV | cs.AI | cs.DCPDF

[31] MovSemCL: Movement-Semantics Contrastive Learning for Trajectory Similarity cs.CV | cs.AI | cs.DBPDF

[32] DCA-LUT: Deep Chromatic Alignment with 5D LUT for Purple Fringing Removal cs.CV | eess.IVPDF

[33] Learning to Hear by Seeing: It’s Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound cs.CVPDF

[34] Point Cloud Quantization through Multimodal Prompting for 3D Understanding cs.CVPDF

[35] Supervised Multilabel Image Classification Using Residual Networks with Probabilistic Reasoning cs.CVPDF

[36] SemanticStitch: Enhancing Image Coherence through Foreground-Aware Seam Carving cs.CVPDF

[37] Learning from Dense Events: Towards Fast Spiking Neural Networks Training via Event Dataset Distillatio cs.CVPDF

[38] Sparse by Rule: Probability-Based N:M Pruning for Spiking Neural Networks cs.CVPDF

[39] DINOv3-Guided Cross Fusion Framework for Semantic-aware CT generation from MRI and CBCT cs.CVPDF

[40] Adaptive Begin-of-Video Tokens for Autoregressive Video Diffusion Models cs.CVPDF

[41] Fine-Grained DINO Tuning with Dual Supervision for Face Forgery Detection cs.CVPDF

[42] MediRound: Multi-Round Entity-Level Reasoning Segmentation in Medical Images cs.CV | cs.AIPDF

[43] RadarMP: Motion Perception for 4D mmWave Radar in Autonomous Driving cs.CVPDF

[44] OAD-Promoter: Enhancing Zero-shot VQA using Large Language Models with Object Attribute Description cs.CV | cs.AIPDF

[45] Compression and Inference of Spiking Neural Networks on Resource-Constrained Hardware cs.CVPDF

[46] MAVIS: A Benchmark for Multimodal Source Attribution in Long-form Visual Question Answering cs.CVPDF

[47] Breaking the Modality Wall: Time-step Mixup for Efficient Spiking Knowledge Transfer from Static to Event Domain cs.CVPDF

[48] FIA-Edit: Frequency-Interactive Attention for Efficient and High-Fidelity Inversion-Free Text-Guided Image Editing cs.CVPDF

[49] Codebook-Centric Deep Hashing: End-to-End Joint Learning of Semantic Hash Centers and Neural Hash Function cs.CV | cs.LGPDF

[50] Rethinking Multimodal Point Cloud Completion: A Completion-by-Correction Perspective cs.CV | cs.AIPDF

[51] MixAR: Mixture Autoregressive Image Generation cs.CV | cs.LGPDF

[52] Cross-View Cross-Modal Unsupervised Domain Adaptation for Driver Monitoring System cs.CV | cs.HCPDF

[53] Bridging Granularity Gaps: Hierarchical Semantic Learning for Cross-domain Few-shot Segmentation cs.CVPDF

[54] OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs cs.CVPDF

[55] LSS3D: Learnable Spatial Shifting for Consistent and High-Quality 3D Generation from Single-Image cs.CVPDF

[56] GeoMVD: Geometry-Enhanced Multi-View Generation Model Based on Geometric Information Extraction cs.CVPDF

[57] A Novel AI-Driven System for Real-Time Detection of Mirror Absence, Helmet Non-Compliance, and License Plates Using YOLOv8 and OCR cs.CV | cs.AIPDF

[58] Mixture of States: Routing Token-Level Dynamics for Multimodal Generation cs.CVPDF

[59] FaNe: Towards Fine-Grained Cross-Modal Contrast with False-Negative Reduction and Text-Conditioned Sparse Attention cs.CVPDF

[60] Suppressing VLM Hallucinations with Spectral Representation Filtering cs.CV | cs.LGPDF

[61] Model Inversion Attack Against Deep Hashing cs.CV | cs.AIPDF

[62] Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets cs.CVPDF

[63] Prompt-Conditioned FiLM and Multi-Scale Fusion on MedSigLIP for Low-Dose CT Quality Assessment cs.CV | cs.AI | eess.IVPDF

[64] A Disease-Aware Dual-Stage Framework for Chest X-ray Report Generation cs.CVPDF

[65] CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models cs.CV | cs.AIPDF

[66] ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks cs.CVPDF

[67] TM-UNet: Token-Memory Enhanced Sequential Modeling for Efficient Medical Image Segmentation cs.CVPDF

[68] D$^{3}$ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs cs.CV | cs.CL | cs.LGPDF

[69] One target to align them all: LiDAR, RGB and event cameras extrinsic calibration for Autonomous Driving cs.CVPDF

[70] Rethinking Bias in Generative Data Augmentation for Medical AI: a Frequency Recalibration Method cs.CV | cs.AIPDF

[71] Learning Time in Static Classifiers cs.CV | cs.AI | cs.LGPDF

[72] SpaceVLM: Sub-Space Modeling of Negation in Vision-Language Models cs.CVPDF

[73] Ground Plane Projection for Improved Traffic Analytics at Intersections cs.CV | cs.AI | cs.LGPDF

[74] Explainable AI-Generated Image Detection RewardBench cs.CVPDF

[75] Constructing and Interpreting Digital Twin Representations for Visual Reasoning via Reinforcement Learning cs.CVPDF

[76] Fast Reasoning Segmentation for Images and Videos cs.CVPDF

[77] Changes in Real Time: Online Scene Change Detection with Multi-View Fusion cs.CVPDF

[78] Reasoning Text-to-Video Retrieval via Digital Twin Video Representations and Large Language Models cs.CVPDF