cs.CV [Total: 184]
cs.CL [Total: 75]
cs.CY [Total: 3]
cs.SD [Total: 1]
q-fin.TR [Total: 1]
physics.geo-ph [Total: 1]
eess.IV [Total: 19]
eess.SP [Total: 1]
cs.RO [Total: 4]
cs.AI [Total: 13]
cs.CR [Total: 2]
cs.LG [Total: 15]
math.LO [Total: 1]
cs.GR [Total: 4]
physics.soc-ph [Total: 1]
cs.DL [Total: 1]
q-bio.QM [Total: 1]
cs.HC [Total: 3]
cs.IR [Total: 2]
cs.IT [Total: 1]
eess.AS [Total: 1]

cs.CV [Back]

[1] Learning to Generate Vectorized Maps at Intersections with Multiple Roadside Cameras cs.CVPDF

Miao Fan, Quanxin Zheng, Shengtong Xu, Linghe Kong, Haoyi Xiong

TL;DR: 该论文提出了一种名为MRC-VMap的端到端神经网络，利用多视角路侧摄像头低成本地生成高精度的矢量化地图，克服了传统离线（LiDAR）和在线（车载摄像头）方法的局限性。

Details

Motivation: 传统矢量化地图构建方法要么成本高昂（LiDAR），要么性能受限（车载摄像头），尤其在复杂路口表现不佳。论文旨在通过路侧摄像头提供一种低成本、高性能的解决方案。

Result: 在4个中国大城市的4,000个路口上的实验表明，MRC-VMap性能优于现有在线方法，并接近高成本的LiDAR方法。

Insight: 多视角路侧摄像头可成为低成本、高效构建矢量化地图的新方向，尤其适用于复杂路口场景。

Abstract: Vectorized maps are indispensable for precise navigation and the safe operation of autonomous vehicles. Traditional methods for constructing these maps fall into two categories: offline techniques, which rely on expensive, labor-intensive LiDAR data collection and manual annotation, and online approaches that use onboard cameras to reduce costs but suffer from limited performance, especially at complex intersections. To bridge this gap, we introduce MRC-VMap, a cost-effective, vision-centric, end-to-end neural network designed to generate high-definition vectorized maps directly at intersections. Leveraging existing roadside surveillance cameras, MRC-VMap directly converts time-aligned, multi-directional images into vectorized map representations. This integrated solution lowers the need for additional intermediate modules–such as separate feature extraction and Bird’s-Eye View (BEV) conversion steps–thus reducing both computational overhead and error propagation. Moreover, the use of multiple camera views enhances mapping completeness, mitigates occlusions, and provides robust performance under practical deployment constraints. Extensive experiments conducted on 4,000 intersections across 4 major metropolitan areas in China demonstrate that MRC-VMap not only outperforms state-of-the-art online methods but also achieves accuracy comparable to high-cost LiDAR-based approaches, thereby offering a scalable and efficient solution for modern autonomous navigation systems.

Vineet Kumar Rakesh, Soumya Mazumdar, Research Pratim Maity, Sarbajit Pal, Amitabha Das

TL;DR: 论文全面综述了说话头生成（THG）技术，涵盖2D、3D、NeRF、扩散模型等方法，分析了数据集、评估指标和损失函数，并探讨了未来研究方向。

Details

Motivation: THG技术在数字虚拟人、视频配音等领域有广泛应用，但面临预训练模型依赖、极端姿态处理等挑战，需要系统性总结和未来方向探索。

Result: 指出了THG的挑战（如预训练模型依赖）和未来趋势（如模块化架构、混合模型）。

Insight: 未来THG研究应关注多语言数据集、混合模型和新型损失函数的设计。

Abstract: Talking Head Generation (THG) has emerged as a transformative technology in computer vision, enabling the synthesis of realistic human faces synchronized with image, audio, text, or video inputs. This paper provides a comprehensive review of methodologies and frameworks for talking head generation, categorizing approaches into 2D–based, 3D–based, Neural Radiance Fields (NeRF)–based, diffusion–based, parameter-driven techniques and many other techniques. It evaluates algorithms, datasets, and evaluation metrics while highlighting advancements in perceptual realism and technical efficiency critical for applications such as digital avatars, video dubbing, ultra-low bitrate video conferencing, and online education. The study identifies challenges such as reliance on pre–trained models, extreme pose handling, multilingual synthesis, and temporal consistency. Future directions include modular architectures, multilingual datasets, hybrid models blending pre–trained and task-specific layers, and innovative loss functions. By synthesizing existing research and exploring emerging trends, this paper aims to provide actionable insights for researchers and practitioners in the field of talking head generation. For the complete survey, code, and curated resource list, visit our GitHub repository: https://github.com/VineetKumarRakesh/thg.

[3] Enhancing Sports Strategy with Video Analytics and Data Mining: Assessing the effectiveness of Multimodal LLMs in tennis video analysis cs.CV | cs.AI | I.2.7; I.2.10; I.4PDF

Charlton Teo

TL;DR: 该论文评估了多模态大型语言模型（MLLMs）在网球视频分析中的有效性，重点是其是否能分类动作并识别比赛回合中的动作序列，同时探讨了提升性能的方法。

Details

Motivation: 现有网球分析研究在理解比赛动作序列方面存在空白，而MLLMs具有处理多模态数据的潜力，适用于填补这一空白并推动体育分析发展。

Result: 实验结果表明MLLMs在网球动作分类和序列识别方面具有潜力，改进方法进一步提升了性能。

Insight: 多模态LLMs在体育视频分析中具有广阔应用前景，结合传统模型可能成为提升性能的有效途径。

Abstract: The use of Large Language Models (LLMs) in recent years has also given rise to the development of Multimodal LLMs (MLLMs). These new MLLMs allow us to process images, videos and even audio alongside textual inputs. In this project, we aim to assess the effectiveness of MLLMs in analysing sports videos, focusing mainly on tennis videos. Despite research done on tennis analysis, there remains a gap in models that are able to understand and identify the sequence of events in a tennis rally, which would be useful in other fields of sports analytics. As such, we will mainly assess the MLLMs on their ability to fill this gap - to classify tennis actions, as well as their ability to identify these actions in a sequence of tennis actions in a rally. We further looked into ways we can improve the MLLMs’ performance, including different training methods and even using them together with other traditional models.

[4] Enhancing Sports Strategy with Video Analytics and Data Mining: Automated Video-Based Analytics Framework for Tennis Doubles cs.CV | cs.LGPDF

Jia Wei Chen

TL;DR: 本文提出了一种用于网球双打的自动化视频分析框架，通过结合自然语言定位和姿态估计，显著减少了手动标注的工作量，并证明了基于CNN的模型在预测击球类型、球员定位和阵型方面优于基于姿态的方法。

Details

Motivation: 网球双打由于复杂的战术和动态性，缺乏自动化分析工具，难以高效地捕捉和分析比赛中的关键数据。

Result: 在网球双打数据上的实验表明，CNN模型在预测任务中明显优于基于姿态的方法，且能有效捕捉复杂的视觉和上下文特征。

Insight: 该框架为自动化网球双打战术分析提供了新方向，展示了CNN在复杂运动数据分析中的潜力，也为其他类似运动的分析提供了借鉴。

Abstract: We present a comprehensive video-based analytics framework for tennis doubles that addresses the lack of automated analysis tools for this strategically complex sport. Our approach introduces a standardised annotation methodology encompassing player positioning, shot types, court formations, and match outcomes, coupled with a specialised annotation tool designed to meet the unique requirements of tennis video labelling. The framework integrates advanced machine learning techniques including GroundingDINO for precise player localisation through natural language grounding and YOLO-Pose for robust pose estimation. This combination significantly reduces manual annotation effort whilst improving data consistency and quality. We evaluate our approach on doubles tennis match data and demonstrate that CNN-based models with transfer learning substantially outperform pose-based methods for predicting shot types, player positioning, and formations. The CNN models effectively capture complex visual and contextual features essential for doubles tennis analysis. Our integrated system bridges advanced analytical capabilities with the strategic complexities of tennis doubles, providing a foundation for automated tactical analysis, performance evaluation, and strategic modelling in professional tennis.

[5] Modeling Urban Food Insecurity with Google Street View Images cs.CV | cs.LGPDF

David Li

TL;DR: 该论文提出了一种利用Google街景图像建模城市食品不安全问题的两阶段方法，通过特征提取和门控注意力机制进行图像聚合，为城市规划者和政策制定者提供了潜在的补充工具。

Details

Motivation: 食品不安全是一个重要的社会和公共卫生问题，现有的调查方法难以扩展。因此需要探索基于街景图像的自动化方法。

Result: 模型预测能力稍显不足，但展示了潜在的应用价值，可作为现有方法的补充。

Insight: 街景图像可以作为一种新的数据源，帮助识别食品不安全问题，但需进一步优化模型性能。

Abstract: Food insecurity is a significant social and public health issue that plagues many urban metropolitan areas around the world. Existing approaches to identifying food insecurity rely primarily on qualitative and quantitative survey data, which is difficult to scale. This project seeks to explore the effectiveness of using street-level images in modeling food insecurity at the census tract level. To do so, we propose a two-step process of feature extraction and gated attention for image aggregation. We evaluate the effectiveness of our model by comparing against other model architectures, interpreting our learned weights, and performing a case study. While our model falls slightly short in terms of its predictive power, we believe our approach still has the potential to supplement existing methods of identifying food insecurity for urban planners and policymakers.

[6] OBSER: Object-Based Sub-Environment Recognition for Zero-Shot Environmental Inference cs.CV | cs.AI | cs.LG | stat.MLPDF

Won-Seok Choi, Dong-Sig Han, Suhyung Choi, Hyeonseo Yang, Byoung-Tak Zhang

TL;DR: OBSER 是一个新颖的贝叶斯框架，通过对象分布推断子环境与对象的三种基本关系，实现了零样本环境理解。

Details

Motivation: 传统基于场景的环境识别方法难以处理开放世界和真实环境下的零样本推断问题。

Result: OBSER 在开放世界和真实环境中可靠地实现零样本推断，且在链式检索任务中优于基于场景的方法。

Insight: 基于对象分布的环境识别方法比传统场景方法更适合零样本任务。

Abstract: We present the Object-Based Sub-Environment Recognition (OBSER) framework, a novel Bayesian framework that infers three fundamental relationships between sub-environments and their constituent objects. In the OBSER framework, metric and self-supervised learning models estimate the object distributions of sub-environments on the latent space to compute these measures. Both theoretically and empirically, we validate the proposed framework by introducing the ($\epsilon,\delta$) statistically separable (EDS) function which indicates the alignment of the representation. Our framework reliably performs inference in open-world and photorealistic environments and outperforms scene-based methods in chained retrieval tasks. The OBSER framework enables zero-shot recognition of environments to achieve autonomous environment understanding.

[7] GameTileNet: A Semantic Dataset for Low-Resolution Game Art in Procedural Content Generation cs.CV | cs.AI | cs.CL | cs.MMPDF

Yi-Chun Chen, Arnav Jhala

TL;DR: GameTileNet是一个包含低分辨率游戏图块的语义数据集，旨在通过视觉-语言对齐支持叙事驱动的程序化内容生成。

Details

Motivation: 解决AI生成游戏内容与游戏叙事不一致的问题，以及训练数据分布不平衡导致的多样性限制。

Result: 数据集成为改进程序化内容生成方法的资源，并为低分辨率非真实感图像的目标检测提供基准。

Insight: 语义标注的低分辨率数据集对AI生成叙事一致的视觉内容有重要意义。

Abstract: GameTileNet is a dataset designed to provide semantic labels for low-resolution digital game art, advancing procedural content generation (PCG) and related AI research as a vision-language alignment task. Large Language Models (LLMs) and image-generative AI models have enabled indie developers to create visual assets, such as sprites, for game interactions. However, generating visuals that align with game narratives remains challenging due to inconsistent AI outputs, requiring manual adjustments by human artists. The diversity of visual representations in automatically generated game content is also limited because of the imbalance in distributions across styles for training data. GameTileNet addresses this by collecting artist-created game tiles from OpenGameArt.org under Creative Commons licenses and providing semantic annotations to support narrative-driven content generation. The dataset introduces a pipeline for object detection in low-resolution tile-based game art (e.g., 32x32 pixels) and annotates semantics, connectivity, and object classifications. GameTileNet is a valuable resource for improving PCG methods, supporting narrative-rich game content, and establishing a baseline for object detection in low-resolution, non-photorealistic images. TL;DR: GameTileNet is a semantic dataset of low-resolution game tiles designed to support narrative-driven procedural content generation through visual-language alignment.

[8] Iterative Zoom-In: Temporal Interval Exploration for Long Video Understanding cs.CV | cs.AIPDF

Chenglin Li, Qianglong Chen, fengtao, Yin Zhang

TL;DR: 该论文提出了一种名为Temporal Search（TS）的训练免费框架，通过迭代细化时间间隔来提升多模态大语言模型（MLLMs）对长视频的理解能力。

Details

Motivation: 现有的MLLMs在长视频理解任务中表现不佳，主要原因是其对时间间隔的感知效率低下，无法像人类那样动态调整时间关注点。

Result: 提升了MLLMs对长视频的理解能力，避免了密集均匀采样的高内存消耗和关键信息丢失问题。

Insight: 模型对不同时间间隔的生成置信度与预测准确性高度相关，这一观察为动态聚焦时间间隔提供了理论基础。

Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance in video understanding tasks. However, they continue to struggle with long-form videos because of an inefficient perception of temporal intervals. Unlike humans, who can dynamically adjust their temporal focus to locate query-relevant moments, current MLLMs often rely on dense, uniform sampling across the video timeline, leading to high memory consumption and a risk of missing crucial information. To address this challenge, we introduce Temporal Search, a training-free framework that enables MLLMs to explore temporal regions for improved long video understanding iteratively. TS is based on a key observation: the model’s generation confidence across different temporal intervals is highly correlated with prediction accuracy. TS operates through two main iterative stages. First, the MLLM proposes a temporal interval that is likely to contain task-relevant information. Then, it samples a fixed number of frames from the interval, regardless of length, and feeds them into the model to produce a refined response and confidence score. TS refines the focus of the model by iteratively shifting attention to more fine-grained temporal intervals, improving its understanding of long videos. Additionally, keyframe-level descriptions are collected to facilitate cross-interval perception throughout the video. To further improve efficiency, we introduce TS-BFS, a best-first search strategy over a tree. Each node represents a candidate interval and is expanded via two methods: self-driven proposals and uniform partitioning. Nodes are scored based on confidence and self-evaluation, and the most promising one is selected for continued exploration.

[9] DriveMRP: Enhancing Vision-Language Models with Synthetic Motion Data for Motion Risk Prediction cs.CV | cs.AI | cs.RO | I.4.8; I.2.7; I.2.10PDF

Zhiyi Hou, Enhui Ma, Fang Li, Zhiyi Lai, Kalok Ho

TL;DR: 论文DriveMRP提出了一种基于鸟瞰图的运动模拟方法，通过合成高风险运动数据（DriveMRP-10K）增强视觉语言模型（VLM）的运动风险预测能力，并设计了一个与VLM无关的框架DriveMRP-Agent，显著提升了风险预测性能。

Details

Motivation: 自动驾驶在长尾场景下的运动风险预测存在数据覆盖不足和环境动态不确定性等挑战，论文探索通过合成高风险运动数据提升VLM的能力。

Result: 在DriveMRP-10K微调后，事故识别准确率从27.13%提升至88.03%；在真实高风险数据集上的零样本评估中，准确率从29.42%提升至68.50%。

Insight: 合成数据可以显著提升VLM在长尾场景中的风险预测能力，且方法具有强泛化性。

Abstract: Autonomous driving has seen significant progress, driven by extensive real-world data. However, in long-tail scenarios, accurately predicting the safety of the ego vehicle’s future motion remains a major challenge due to uncertainties in dynamic environments and limitations in data coverage. In this work, we aim to explore whether it is possible to enhance the motion risk prediction capabilities of Vision-Language Models (VLM) by synthesizing high-risk motion data. Specifically, we introduce a Bird’s-Eye View (BEV) based motion simulation method to model risks from three aspects: the ego-vehicle, other vehicles, and the environment. This allows us to synthesize plug-and-play, high-risk motion data suitable for VLM training, which we call DriveMRP-10K. Furthermore, we design a VLM-agnostic motion risk estimation framework, named DriveMRP-Agent. This framework incorporates a novel information injection strategy for global context, ego-vehicle perspective, and trajectory projection, enabling VLMs to effectively reason about the spatial relationships between motion waypoints and the environment. Extensive experiments demonstrate that by fine-tuning with DriveMRP-10K, our DriveMRP-Agent framework can significantly improve the motion risk prediction performance of multiple VLM baselines, with the accident recognition accuracy soaring from 27.13% to 88.03%. Moreover, when tested via zero-shot evaluation on an in-house real-world high-risk motion dataset, DriveMRP-Agent achieves a significant performance leap, boosting the accuracy from base_model’s 29.42% to 68.50%, which showcases the strong generalization capabilities of our method in real-world scenarios.

[10] Multimodal image registration for effective thermographic fever screening cs.CVPDF

C. Y. N. Dwith, Pejhman Ghassemi, Joshua Pfefer, Jon Casamento, Quanzeng Wang

TL;DR: 该论文提出了一种基于红外和白光图像的多模态配准方法，用于精确定位内眦区域，以提升热成像体温筛查的准确性。

Details

Motivation: 在传染病大流行期间，红外热成像（IRT）体温筛查是一种快速、非侵入式的筛查方法，但需要准确的内眦区域定位以提高检测效果。

Result: 配准精度在2.7毫米以内，能够精确地定位内眦区域。

Insight: 多模态图像配准可显著提升热成像体温筛查的准确性，尤其是在内眦区域的定位上。

Abstract: Fever screening based on infrared thermographs (IRTs) is a viable mass screening approach during infectious disease pandemics, such as Ebola and SARS, for temperature monitoring in public places like hospitals and airports. IRTs have found to be powerful, quick and non-invasive methods to detect elevated temperatures. Moreover, regions medially adjacent to the inner canthi (called the canthi regions in this paper) are preferred sites for fever screening. Accurate localization of the canthi regions can be achieved through multi-modal registration of infrared (IR) and white-light images. We proposed a registration method through a coarse-fine registration strategy using different registration models based on landmarks and edge detection on eye contours. We evaluated the registration accuracy to be within 2.7 mm, which enables accurate localization of the canthi regions.

[11] CS-VLM: Compressed Sensing Attention for Efficient Vision-Language Representation Learning cs.CVPDF

Andrew Kiruluta, Preethi Raju, Priscilla Burity

TL;DR: 论文提出了CSAT（压缩感知注意力变换器），通过压缩感知技术降低注意力机制的复杂度，从而高效地处理视觉-语言模型的跨模态计算。

Details

Motivation: 现有的视觉-语言模型（vLLMs）在处理长视频和复杂语言描述时，由于标准注意力机制的高计算复杂度，存在内存和延迟瓶颈，亟需一种高效的解决方案。

Result: 在标准benchmark上表明CSAT在保持模型性能的同时显著降低计算资源消耗。

Insight: 视觉和语言表示具有固有的可压缩性（如视频时间冗余和语言的跨模态稀疏性），适合压缩感知的应用。

Abstract: Vision-Language Models (vLLMs) have emerged as powerful architectures for joint reasoning over visual and textual inputs, enabling breakthroughs in image captioning, cross modal retrieval, and multimodal dialogue. However, as these models scale to longer video sequences and richer language descriptions, the quadratic complexity of the standard attention mechanism presents a fundamental computational bottleneck. This challenge is exacerbated in vLLMs, where attention must be computed not only within modalities but also across them, leading to prohibitive memory and latency costs. In this work, we introduce the Compressed Sensing Attention Transformer (CSAT), a novel architecture that reimagines attention computation through the lens of compressed sensing. By projecting high dimensional key and value representations into a lower-dimensional subspace via random measurement matrices and reconstructing the attention outputs using sparse recovery algorithms, CSAT significantly reduces attention complexity while maintaining semantic fidelity. Applied to vLLMs, CSAT exploits the inherent compressibility of both visual and textual representations especially evident in video, where temporal redundancy is high, and in language, where cross-modal grounding is often sparse. In contrast to LLMs, which must often model entangled symbolic dependencies, vLLMs benefit from structured sparsity in alignment and scene composition, making them particularly well-suited to compressed attention. We provide a formal mathematical treatment of CSAT, demonstrate its integration into vision language pipelines, and validate its performance on standard benchmarks, highlighting its promise as a scalable, interpretable, and resource efficient solution for next generation multimodal transformers.

[12] VR-YOLO: Enhancing PCB Defect Detection with Viewpoint Robustness Based on YOLO cs.CV | eess.IVPDF

Hengyi Zhu, Linye Wei, He Li

TL;DR: VR-YOLO是基于YOLOv8改进的PCB缺陷检测算法，通过多样化场景增强和关键对象聚焦机制，提升了模型在视角变化下的鲁棒性。

Details

Motivation: 传统PCB缺陷检测算法对图像角度、方向等要求严格，限制了实际应用。VR-YOLO旨在提升模型在实际场景中的泛化能力和视角鲁棒性。

Result: 在原始测试图像上mAP达98.9%，视角变化时mAP为94.7%，显著优于基线模型。

Insight: 视角鲁棒性对小目标检测尤为重要，结合数据增强和注意力机制是提升检测性能的有效途径。

Abstract: The integration of large-scale circuits and systems emphasizes the importance of automated defect detection of electronic components. The YOLO image detection model has been used to detect PCB defects and it has become a typical AI-assisted case of traditional industrial production. However, conventional detection algorithms have stringent requirements for the angle, orientation, and clarity of target images. In this paper, we propose an enhanced PCB defect detection algorithm, named VR-YOLO, based on the YOLOv8 model. This algorithm aims to improve the model’s generalization performance and enhance viewpoint robustness in practical application scenarios. We first propose a diversified scene enhancement (DSE) method by expanding the PCB defect dataset by incorporating diverse scenarios and segmenting samples to improve target diversity. A novel key object focus (KOF) scheme is then presented by considering angular loss and introducing an additional attention mechanism to enhance fine-grained learning of small target features. Experimental results demonstrate that our improved PCB defect detection approach achieves a mean average precision (mAP) of 98.9% for the original test images, and 94.7% for the test images with viewpoint shifts (horizontal and vertical shear coefficients of $\pm 0.06$ and rotation angle of $\pm 10$ degrees), showing significant improvements compared to the baseline YOLO model with negligible additional computational cost.

[13] Concept-based Adversarial Attack: a Probabilistic Perspective cs.CV | cs.AIPDF

Andi Zhang, Xuan Ding, Steven McDonagh, Samuel Kaski

TL;DR: 论文提出了一种基于概率视角的概念对抗攻击框架，通过操作整个概念而非单张图像，生成多样化的对抗样本，同时保留原概念的辨识性。

Details

Motivation: 传统对抗攻击局限于单张图像的扰动，缺乏对概念的全面建模。作者希望通过概率生成模型或图像集合来表示概念，以生成更多样化的对抗样本。

Result: 实验表明，概念对抗攻击生成的对抗样本更具多样性，同时高效地保留了原概念，攻击成功率更高。

Insight: 工作强调了对抗攻击中概念建模的重要性，为设计更鲁棒的分类器提供了新视角。

Abstract: We propose a concept-based adversarial attack framework that extends beyond single-image perturbations by adopting a probabilistic perspective. Rather than modifying a single image, our method operates on an entire concept – represented by a probabilistic generative model or a set of images – to generate diverse adversarial examples. Preserving the concept is essential, as it ensures that the resulting adversarial images remain identifiable as instances of the original underlying category or identity. By sampling from this concept-based adversarial distribution, we generate images that maintain the original concept but vary in pose, viewpoint, or background, thereby misleading the classifier. Mathematically, this framework remains consistent with traditional adversarial attacks in a principled manner. Our theoretical and empirical results demonstrate that concept-based adversarial attacks yield more diverse adversarial examples and effectively preserve the underlying concept, while achieving higher attack efficiency.

[14] Ascending the Infinite Ladder: Benchmarking Spatial Deformation Reasoning in Vision-Language Models cs.CVPDF

Jiahuan Zhang, Shunwen Bai, Tianheng Wang, Kaiwen Guo, Kai Han

TL;DR: 该论文提出了一个新的评估框架，旨在测试视觉语言模型（VLMs）在空间变形推理任务中的表现，揭示了现有模型在2D到3D空间变形推理中的局限性。

Details

Motivation: 人类自然具备在空间中形成和操纵物体图像和结构的空间推理能力，而研究者们正试图赋予VLMs类似的能力。然而，这些模型是否真正理解和操纵空间物体尚不明确。

Result: 实验结果显示，几乎所有模型都未能表现出合理的空间变形推理能力；即使经过针对性训练和主流推理增强方法，模型在3D空间变形推理中仍表现不佳。

Insight: 论文揭示了当前VLMs在空间变形推理任务中的局限性，表明其能力与人类相比仍有较大差距，为未来研究提供了重要的基准和改进方向。

Abstract: Humans naturally possess the spatial reasoning ability to form and manipulate images and structures of objects in space. There is an increasing effort to endow Vision-Language Models (VLMs) with similar spatial reasoning capabilities. However, it remains unclear whether these models truly understand and manipulate spatial objects or not. To address this question, we propose a new evaluation framework aimed at assessing the performance of VLMs in spatial deformation reasoning tasks. Specifically, we construct a benchmark for spatial deformation reasoning from 2D to 3D. Leveraging our data engine, we can generate unlimited evaluation problem pairs with infinite steps, without any data leakage. We explore whether the model can effectively perform spatial deformation reasoning from two directions: forward reasoning (given the operations, find the final state) and reverse reasoning (given the final state, determine the operations). We adopt a ladder competition format, using the number of deformation steps as the level classification criterion, with the goal of exploring the boundaries of the model’s deformation reasoning capabilities. Interestingly, the benchmarking results reveal that almost no model demonstrates plausible spatial deformation reasoning abilities. Furthermore, even after applying targeted training and mainstream reasoning enhancement methods, the models are still unable to perform well on 3D spatial deformation reasoning.

[15] Gated Recursive Fusion: A Stateful Approach to Scalable Multimodal Transformers cs.CV | cs.AI | cs.CL | I.4; I.2PDF

Yusuf Shihata

TL;DR: 论文提出了一种名为门控递归融合（GRF）的新型架构，通过线性复杂度实现多模态学习中的深度融合，解决了传统交叉注意力模型的高计算复杂度问题。

Details

Motivation: 多模态学习需要在深度融合和计算可扩展性之间取得平衡。传统交叉注意力模型虽性能强大，但其二次复杂度限制了在多模态场景中的应用。因此，需要一种既能高效融合又能保持线性复杂度的模型。

Result: 在CMU-MOSI基准测试中表现优异；可视化显示GRF能够生成结构清晰、类别可分的表示。

Insight: GRF通过递归设计和门控机制，在保持线性复杂度的同时实现了高效的深度多模态融合，为高模态环境提供了一种新的解决方案。

Abstract: Multimodal learning faces a fundamental tension between deep, fine-grained fusion and computational scalability. While cross-attention models achieve strong performance through exhaustive pairwise fusion, their quadratic complexity is prohibitive for settings with many modalities. We address this challenge with Gated Recurrent Fusion (GRF), a novel architecture that captures the power of cross-modal attention within a linearly scalable, recurrent pipeline. Our method processes modalities sequentially, updating an evolving multimodal context vector at each step. The core of our approach is a fusion block built on Transformer Decoder layers that performs symmetric cross-attention, mutually enriching the shared context and the incoming modality. This enriched information is then integrated via a Gated Fusion Unit (GFU) a GRU-inspired mechanism that dynamically arbitrates information flow, enabling the model to selectively retain or discard features. This stateful, recurrent design scales linearly with the number of modalities, O(n), making it ideal for high-modality environments. Experiments on the CMU-MOSI benchmark demonstrate that GRF achieves competitive performance compared to more complex baselines. Visualizations of the embedding space further illustrate that GRF creates structured, class-separable representations through its progressive fusion mechanism. Our work presents a robust and efficient paradigm for powerful, scalable multimodal representation learning.

[16] Leveraging the Structure of Medical Data for Improved Representation Learning cs.CV | cs.LGPDF

Andrea Agostini, Sonia Laguna, Alain Ryser, Samuel Ruiperez-Campillo, Moritz Vandenhirtz

TL;DR: 该论文提出了一种自监督学习框架，利用医学数据的固有结构（如多视角胸部X光片）进行表示学习，无需文本监督，在小规模医学数据集上表现出色。

Details

Motivation: 医学数据（如胸部X光片）通常规模有限且标注稀缺，但具有丰富的内部结构（如多视角成像）。如何利用这种结构提升表示学习的效果是研究的动机。

Result: 在MIMIC-CXR数据集上表现优于监督学习和未利用数据结构的基线方法。

Insight: 医学数据的内部结构（如多视角成像）可以作为自监督学习的有力工具，尤其是在数据稀缺的场景中。

Abstract: Building generalizable medical AI systems requires pretraining strategies that are data-efficient and domain-aware. Unlike internet-scale corpora, clinical datasets such as MIMIC-CXR offer limited image counts and scarce annotations, but exhibit rich internal structure through multi-view imaging. We propose a self-supervised framework that leverages the inherent structure of medical datasets. Specifically, we treat paired chest X-rays (i.e., frontal and lateral views) as natural positive pairs, learning to reconstruct each view from sparse patches while aligning their latent embeddings. Our method requires no textual supervision and produces informative representations. Evaluated on MIMIC-CXR, we show strong performance compared to supervised objectives and baselines being trained without leveraging structure. This work provides a lightweight, modality-agnostic blueprint for domain-specific pretraining where data is structured but scarce

Marius Neuhalfen, Jonathan Grzymisch, Manuel Sanchez-Gestido

TL;DR: 本文提出了VISY-REVE流程，通过实时合成新视角增强图像数据集，以验证基于视觉的导航算法，避免了传统验证方法的复杂设置和慢速问题，并提出了一种更适合视角合成的相机位姿距离度量方法——Boresight Deviation Distance。

Details

Motivation: 传统验证方法（如合成渲染或机器人测试台采集）存在设置复杂和运行速度慢的问题，迫切需要一种更高效的验证方法。

Result: 实现了对视觉导航算法的实时、鲁棒验证，提升了数据集的利用率。

Insight: 视角合成可以高效扩展数据集的覆盖范围，而新的位姿距离度量方法对视角合成任务更具适应性。

Abstract: This work introduces VISY-REVE: a novel pipeline to validate image processing algorithms for Vision-Based Navigation. Traditional validation methods such as synthetic rendering or robotic testbed acquisition suffer from difficult setup and slow runtime. Instead, we propose augmenting image datasets in real-time with synthesized views at novel poses. This approach creates continuous trajectories from sparse, pre-existing datasets in open or closed-loop. In addition, we introduce a new distance metric between camera poses, the Boresight Deviation Distance, which is better suited for view synthesis than existing metrics. Using it, a method for increasing the density of image datasets is developed.

Guang Yang

TL;DR: FreqCross提出了一种多模态融合网络，结合RGB空间特征、频域伪影和径向能量分布模式，用于检测Stable Diffusion 3.5生成的图像。通过三分支架构，该方案在合成图像检测任务上达到97.8%的准确率，优于现有方法5.2%。

Details

Motivation: Stable Diffusion 3.5等扩散模型的快速发展使得合成图像逼真度极高，现有检测方法难以应对，因此需要更鲁棒的检测手段。

Result: 在10,000张真实与合成图像的数据集上达到97.8%准确率，提升基线5.2%。频域分析显示合成图像在0.1–0.4归一化频段有独特特征。

Insight: 扩散模型生成的图像在特定频段具有可区分的谱特征，结合多模态信息能显著提升检测性能。

Abstract: The rapid advancement of diffusion models, particularly Stable Diffusion 3.5, has enabled the generation of highly photorealistic synthetic images that pose significant challenges to existing detection methods. This paper presents FreqCross, a novel multi-modal fusion network that combines spatial RGB features, frequency domain artifacts, and radial energy distribution patterns to achieve robust detection of AI-generated images. Our approach leverages a three-branch architecture: (1) a ResNet-18 backbone for spatial feature extraction, (2) a lightweight CNN for processing 2D FFT magnitude spectra, and (3) a multi-layer perceptron for analyzing radial energy profiles. We introduce a novel radial energy distribution analysis that captures characteristic frequency artifacts inherent in diffusion-generated images, and fuse it with spatial and spectral cues via simple feature concatenation followed by a compact classification head. Extensive experiments on a dataset of 10,000 paired real (MS-COCO) and synthetic (Stable Diffusion 3.5) images demonstrate that FreqCross achieves 97.8% accuracy, outperforming state-of-the-art baselines by 5.2%. The frequency analysis further reveals that synthetic images exhibit distinct spectral signatures in the 0.1–0.4 normalised frequency range, providing theoretical foundation for our approach. Code and pre-trained models are publicly available to facilitate reproducible research.

[19] Text-Guided Multi-Instance Learning for Scoliosis Screening via Gait Video Analysis cs.CVPDF

Haiqing Li, Yuzhi Guo, Feng Jiang, Thao M. Dang, Hehuan Ma

TL;DR: TG-MILNet是一种基于步态视频的非侵入性脊柱侧弯检测方法，通过文本引导的多实例学习网络，结合动态时间规整和时序注意力机制，显著提升了检测性能。

Details

Motivation: 传统脊柱侧弯检测方法（如X光）存在辐射风险和依赖临床专家的问题，难以用于大规模筛查。

Result: 在Scoliosis1K数据集上实现SOTA性能，尤其在处理类别不平衡和检测边界病例上表现优异。

Insight: 文本引导和多实例学习的结合为医学影像分析提供了新思路，同时强调了时序建模在步态分析中的重要性。

Abstract: Early-stage scoliosis is often difficult to detect, particularly in adolescents, where delayed diagnosis can lead to serious health issues. Traditional X-ray-based methods carry radiation risks and rely heavily on clinical expertise, limiting their use in large-scale screenings. To overcome these challenges, we propose a Text-Guided Multi-Instance Learning Network (TG-MILNet) for non-invasive scoliosis detection using gait videos. To handle temporal misalignment in gait sequences, we employ Dynamic Time Warping (DTW) clustering to segment videos into key gait phases. To focus on the most relevant diagnostic features, we introduce an Inter-Bag Temporal Attention (IBTA) mechanism that highlights critical gait phases. Recognizing the difficulty in identifying borderline cases, we design a Boundary-Aware Model (BAM) to improve sensitivity to subtle spinal deviations. Additionally, we incorporate textual guidance from domain experts and large language models (LLM) to enhance feature representation and improve model interpretability. Experiments on the large-scale Scoliosis1K gait dataset show that TG-MILNet achieves state-of-the-art performance, particularly excelling in handling class imbalance and accurately detecting challenging borderline cases. The code is available at https://github.com/lhqqq/TG-MILNet

[20] Topological Signatures vs. Gradient Histograms: A Comparative Study for Medical Image Classification cs.CV | cs.LGPDF

Faisal Ahmed, Mohammad Alfrad Nobel Bhuiyan

TL;DR: 本文首次比较了方向梯度直方图（HOG）和拓扑数据分析（TDA）两种特征提取方法在视网膜图像分类任务中的表现。XGBoost模型在两种特征上的分类性能接近，表明它们捕捉了不同的图像结构特征。

Details

Motivation: 研究动机在于比较局部纹理特征（HOG）和全局拓扑特征（TDA）在医学图像分类中的表现差异，特别是在视网膜病变检测任务中。

Result: XGBoost在二进制分类任务中达到94.29%（HOG）和94.18%（TDA）的准确率，在多分类任务中分别为74.41%和74.69%。

Insight: 研究表明，HOG和TDA各自捕捉了图像的不同结构特征，均为医学图像分类提供了有效的特征表示，且适用于深度学习管道的集成。

Abstract: We present the first comparative study of two fundamentally distinct feature extraction techniques: Histogram of Oriented Gradients (HOG) and Topological Data Analysis (TDA), for medical image classification using retinal fundus images. HOG captures local texture and edge patterns through gradient orientation histograms, while TDA, using cubical persistent homology, extracts high-level topological signatures that reflect the global structure of pixel intensities. We evaluate both methods on the large APTOS dataset for two classification tasks: binary detection (normal versus diabetic retinopathy) and five-class diabetic retinopathy severity grading. From each image, we extract 26244 HOG features and 800 TDA features, using them independently to train seven classical machine learning models with 10-fold cross-validation. XGBoost achieved the best performance in both cases: 94.29 percent accuracy (HOG) and 94.18 percent (TDA) on the binary task; 74.41 percent (HOG) and 74.69 percent (TDA) on the multi-class task. Our results show that both methods offer competitive performance but encode different structural aspects of the images. This is the first work to benchmark gradient-based and topological features on retinal imagery. The techniques are interpretable, applicable to other medical imaging domains, and suitable for integration into deep learning pipelines.

[21] Markerless Stride Length estimation in Athletic using Pose Estimation with monocular vision cs.CVPDF

Patryk Skorupski, Cosimo Distante, Pier Luigi Mazzeo

TL;DR: 该论文提出了一种基于计算机视觉的无标记步长估计方法，结合姿态估计算法和单目视觉技术，用于运动员训练监测。

Details

Motivation: 传统步长测量方法依赖标记或人工计算，效率低且受限。研究旨在开发一种自动化、无标记的计算机视觉系统，为教练提供高效性能监测工具。

Result: 在三位不同运动员的测试中，系统成功估计步长，验证了其在训练中的实用性。

Insight: 单目视觉结合姿态估计可高效实现无标记步长测量，为运动分析提供新思路。

Abstract: Performance measures such as stride length in athletics and the pace of runners can be estimated using different tricks such as measuring the number of steps divided by the running length or helping with markers printed on the track. Monitoring individual performance is essential for supporting staff coaches in establishing a proper training schedule for each athlete. The aim of this paper is to investigate a computer vision-based approach for estimating stride length and speed transition from video sequences and assessing video analysis processing among athletes. Using some well-known image processing methodologies such as probabilistic hough transform combined with a human pose detection algorithm, we estimate the leg joint position of runners. In this way, applying a homography transformation, we can estimate the runner stride length. Experiments on various race videos with three different runners demonstrated that the proposed system represents a useful tool for coaching and training. This suggests its potential value in measuring and monitoring the gait parameters of athletes.

[22] Look-Back: Implicit Visual Re-focusing in MLLM Reasoning cs.CV | cs.LGPDF

Shuo Yang, Yuwei Niu, Yuyang Liu, Yang Ye, Bin Lin

TL;DR: Look-Back是一种隐式视觉重新聚焦方法，通过分析MLLM的注意力模式，发现模型在推理后期可自发重新关注视觉输入。该方法无需显式注入视觉信息，显著提升了模型的推理和感知能力。

Details

Motivation: 现有的MLLM在推理过程中过度依赖文本信息，忽视了视觉输入的整合。本文希望通过隐式方法引导模型自发地重新关注视觉输入，以实现更高效的跨模态推理。

Result: 在多个多模态基准测试中，Look-Back显著提升了模型的推理和感知能力。

Insight: MLLM具备内在的视觉融合推理能力，无需显式约束即可实现高效的跨模态交互。

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in multimodal reasoning. However, they often excessively rely on textual information during the later stages of inference, neglecting the crucial integration of visual input. Current methods typically address this by explicitly injecting visual information to guide the reasoning process. In this work, through an analysis of MLLM attention patterns, we made an intriguing observation: with appropriate guidance, MLLMs can spontaneously re-focus their attention on visual inputs during the later stages of reasoning, even without explicit visual information injection. This spontaneous shift in focus suggests that MLLMs are intrinsically capable of performing visual fusion reasoning. Building on this insight, we introduce Look-Back, an implicit approach designed to guide MLLMs to ``look back” at visual information in a self-directed manner during reasoning. Look-Back empowers the model to autonomously determine when, where, and how to re-focus on visual inputs, eliminating the need for explicit model-structure constraints or additional input. We demonstrate that Look-Back significantly enhances the model’s reasoning and perception capabilities, as evidenced by extensive empirical evaluations on multiple multimodal benchmarks.

[23] Intelligent Histology for Tumor Neurosurgery cs.CVPDF

Xinhai Hou, Akhil Kondepudi, Cheng Jiang, Yiwei Lyu, Samir Harake

TL;DR: 这篇论文提出了一种名为‘智能组织学’的创新方法，结合人工智能（AI）和刺激拉曼组织学（SRH），用于肿瘤神经外科的术中快速组织分析。

Details

Motivation: 传统术中病理学工作流程（如H&E组织学）速度慢、资源密集且缺乏实时数字成像能力，无法满足现代神经外科的需求。

Result: 智能组织学已在多个神经外科领域（如神经肿瘤、脊柱肿瘤等）展示了其变革潜力。

Insight: 未来可通过多机构数据集开发AI基础模型，并结合临床和放射学数据实现多模态学习，预测患者预后。

Abstract: The importance of rapid and accurate histologic analysis of surgical tissue in the operating room has been recognized for over a century. Our standard-of-care intraoperative pathology workflow is based on light microscopy and H&E histology, which is slow, resource-intensive, and lacks real-time digital imaging capabilities. Here, we present an emerging and innovative method for intraoperative histologic analysis, called Intelligent Histology, that integrates artificial intelligence (AI) with stimulated Raman histology (SRH). SRH is a rapid, label-free, digital imaging method for real-time microscopic tumor tissue analysis. SRH generates high-resolution digital images of surgical specimens within seconds, enabling AI-driven tumor histologic analysis, molecular classification, and tumor infiltration detection. We review the scientific background, clinical translation, and future applications of intelligent histology in tumor neurosurgery. We focus on the major scientific and clinical studies that have demonstrated the transformative potential of intelligent histology across multiple neurosurgical specialties, including neurosurgical oncology, skull base, spine oncology, pediatric tumors, and periperal nerve tumors. Future directions include the development of AI foundation models through multi-institutional datasets, incorporating clinical and radiologic data for multimodal learning, and predicting patient outcomes. Intelligent histology represents a transformative intraoperative workflow that can reinvent real-time tumor analysis for 21st century neurosurgery.

[24] Detection of Rail Line Track and Human Beings Near the Track to Avoid Accidents cs.CV | cs.LG | 68T10 | I.2.10; I.4.8PDF

Mehrab Hosain, Rajiv Kapoor

TL;DR: 该论文提出了一种基于YOLOv5模型的铁路轨道检测与附近行人识别方法，旨在通过实时视频数据处理提升铁路安全，预防事故。

Details

Motivation: 铁路环境中的安全问题尤为重要，尤其是在轨道附近的移动物体（如行人）可能引发的潜在事故。现有的技术需要更高效的实时检测方法来提升安全性。

Result: 通过全面评估，该方法在准确性上显著优于现有技术，证明了其在铁路安全领域的有效性。

Insight: 该研究表明，结合实时检测与警报的深度学习模型可以显著提升铁路安全性，为未来的事故预防策略提供了重要参考。

Abstract: This paper presents an approach for rail line detection and the identification of human beings in proximity to the track, utilizing the YOLOv5 deep learning model to mitigate potential accidents. The technique incorporates real-time video data to identify railway tracks with impressive accuracy and recognizes nearby moving objects within a one-meter range, specifically targeting the identification of humans. This system aims to enhance safety measures in railway environments by providing real-time alerts for any detected human presence close to the track. The integration of a functionality to identify objects at a longer distance further fortifies the preventative capabilities of the system. With a precise focus on real-time object detection, this method is poised to deliver significant contributions to the existing technologies in railway safety. The effectiveness of the proposed method is demonstrated through a comprehensive evaluation, yielding a remarkable improvement in accuracy over existing methods. These results underscore the potential of this approach to revolutionize safety measures in railway environments, providing a substantial contribution to accident prevention strategies.

[25] LATTE: Latent Trajectory Embedding for Diffusion-Generated Image Detection cs.CV | cs.AI | I.2.10; I.4.8; I.5PDF

Ana Vasilcoiu, Ivona Najdenkoska, Zeno Geradts, Marcel Worring

TL;DR: LATTE提出了一种新的生成图像检测方法，通过建模去噪过程中的潜在轨迹嵌入，而非单步重建误差，显著提升了检测性能。

Details

Motivation: 随着扩散模型生成的图像质量提高，传统的生成图像检测方法难以应对，需开发更鲁棒的检测技术。

Result: 在GenImage和DiffusionFake等基准测试中优于基线方法，跨生成器和数据集场景表现优异。

Insight: 潜在轨迹嵌入能够捕捉生成图像与真实图像间的细微差异，为生成检测提供了新思路。

Abstract: The rapid advancement of diffusion-based image generators has made it increasingly difficult to distinguish generated from real images. This can erode trust in digital media, making it critical to develop generalizable detectors for generated images. Recent methods leverage diffusion denoising cues, but mainly focus on single-step reconstruction errors, ignoring the inherent sequential nature of the denoising process. In this work, we propose LATTE - Latent Trajectory Embedding - a novel approach that models the evolution of latent embeddings across several denoising timesteps. By modeling the trajectory of such embeddings rather than single-step errors, LATTE captures subtle, discriminative patterns that distinguish real from generated images. Each latent is refined by employing our latent-visual feature refinement module and aggregated into a unified representation. Afterwards, it is fused with the visual features and finally passed into a lightweight classifier. Our experiments demonstrate that LATTE surpasses the baselines on several established benchmarks, such as GenImage and DiffusionFake. Moreover, it demonstrates strong performance in cross-generator and cross-datasets settings, highlighting the potential of using the trajectory of latent embeddings for generated image detection. The code is available on the following link: https://github.com/AnaMVasilcoiu/LATTE-Diffusion-Detector.

[26] Towards a Psychoanalytic Perspective on VLM Behaviour: A First-step Interpretation with Intriguing Observations cs.CV | cs.CL | cs.LGPDF

Xiangrui Liu, Man Luo, Agneet Chatterjee, Hua Wei, Yezhou Yang

TL;DR: 该论文提出了一个心理分类法，用于分析视觉语言模型（VLM）的幻觉行为，包括已知的偏执行为和新的权威偏误行为，并通过AIpsych基准测试揭示了模型行为与心理倾向的关系。

Details

Motivation: 现有的研究多从技术或外部驱动因素解释VLM的幻觉问题，但忽略了其可能反映人类心理认知偏误的特性。本文试图从心理学角度重新审视这一问题。

Result: 实验表明，模型规模增大时，偏执倾向增强但权威偏误减少；人类研究验证了VLM与人类行为的关键差异。

Insight: 幻觉行为可能反映了类似人类的心理偏误，心理学原理对模型评估具有重要价值，规模增长可能牺牲回答的完整性。

Abstract: Hallucination is a long-standing problem that has been actively investigated in Vision-Language Models (VLMs). Existing research commonly attributes hallucinations to technical limitations or sycophancy bias, where the latter means the models tend to generate incorrect answers to align with user expectations. However, these explanations primarily focus on technical or externally driven factors, may have neglected the possibility that hallucination behaviours might mirror cognitive biases observed in human psychology. In this work, we introduce a psychological taxonomy, categorizing VLMs’ hallucination behaviours, including sycophancy, logical inconsistency, and a newly identified VLMs behaviour: authority bias. To systematically analyze these behaviours, we design AIpsych, a scalable benchmark that reveals psychological tendencies in model response patterns. Leveraging this benchmark, we investigate how variations in model architecture and parameter size influence model behaviour when responding to strategically manipulated questions. Our experiments reveal that as model size increases, VLMs exhibit stronger sycophantic tendencies but reduced authority bias, suggesting increasing competence but a potential erosion of response integrity. A human subject study further validates our hypotheses and highlights key behavioural differences between VLMs and human respondents. This work suggests a new perspective for understanding hallucination in VLMs and highlights the importance of integrating psychological principles into model evaluation.The benchmark is available at https://github.com/lxrswdd/AIpsych.

[27] A Vision-Based Closed-Form Solution for Measuring the Rotation Rate of an Object by Tracking One Point cs.CVPDF

Daniel Raviv, Juan D. Yepes, Eiki M. Martinson

TL;DR: 该论文提出了一种基于视觉的闭式解方法，通过跟踪刚体上的一个点来测量其旋转速率，无需依赖物体形状或先验场景知识。

Details

Motivation: 研究动机在于简化传统方法中需要多特征点跟踪或复杂模型的限制，仅通过跟踪一个特征点即可精确测量旋转速率，适用于实时处理。

Result: 论文通过仿真和真实视频数据验证了方法的有效性，能够准确测量旋转速率并区分不同刚体。

Insight: 该方法的核心洞见在于，通过一个特征点即可计算旋转速率，简化了传统方法，适用于实时应用和场景分割。

Abstract: We demonstrate that, under orthographic projection and with a camera fixated on a point located on a rigid body, the rotation of that body can be analytically obtained by tracking only one other feature in the image. With some exceptions, any tracked point, regardless of its location on the body, yields the same value of the instantaneous rotation rate. The proposed method is independent of the shape of the 3D object and does not require a priori knowledge about the scene. This algorithm is suited for parallel processing and can achieve segmentation of the scene by distinguishing points that do not belong to the same rigid body, simply because they do not produce the same value of the rotation. This paper presents an analytical derivation, simulation results, and results from real video data.

[28] Subject Invariant Contrastive Learning for Human Activity Recognition cs.CV | cs.LGPDF

Yavuz Yarici, Kiran Kokilepersaud, Mohit Prabhushankar, Ghassan AlRegib

TL;DR: 论文提出了一种面向人类活动识别（HAR）的自监督方法SICL，通过重新加权来自同一主题的负样本对，抑制主题特异性信息，从而提升模型对新主题的泛化能力。

Details

Motivation: HAR传感器信号因主题间差异导致显著的领域偏移，传统对比学习方法难以泛化到未见主题。

Result: 在三个公开数据集（UTD-MHAD、MMAct、DARai）上，SICL比传统对比学习方法性能提升高达11%。

Insight: SICL不仅适用于自监督场景，还能扩展到多模态和监督学习框架，展示了其广泛的适用性。

Abstract: The high cost of annotating data makes self-supervised approaches, such as contrastive learning methods, appealing for Human Activity Recognition (HAR). Effective contrastive learning relies on selecting informative positive and negative samples. However, HAR sensor signals are subject to significant domain shifts caused by subject variability. These domain shifts hinder model generalization to unseen subjects by embedding subject-specific variations rather than activity-specific features. As a result, human activity recognition models trained with contrastive learning often struggle to generalize to new subjects. We introduce Subject-Invariant Contrastive Learning (SICL), a simple yet effective loss function to improve generalization in human activity recognition. SICL re-weights negative pairs drawn from the same subject to suppress subject-specific cues and emphasize activity-specific information. We evaluate our loss function on three public benchmarks: UTD-MHAD, MMAct, and DARai. We show that SICL improves performance by up to 11% over traditional contrastive learning methods. Additionally, we demonstrate the adaptability of our loss function across various settings, including multiple self-supervised methods, multimodal scenarios, and supervised learning frameworks.

[29] LACONIC: A 3D Layout Adapter for Controllable Image Creation cs.CVPDF

Léopold Maillard, Tom Durand, Adrien Ramanana Rahary, Maks Ovsjanikov

TL;DR: 这篇论文提出了一种名为LACONIC的方法，通过在预训练的文本到图像扩散模型中插入适配器，实现3D布局控制，增强了3D一致性并支持多模态图像生成和编辑。

Details

Motivation: 现有的生成方法通常依赖2D控制（如图像或文本），难以保持场景的3D几何结构一致性。论文旨在解决这一问题，赋予模型3D感知能力。

Result: 该方法在少量数据下表现良好，具有显著的泛化能力，能够生成语义丰富且3D一致的图像，同时支持灵活的编辑功能。

Insight: 通过将3D布局信息融入生成模型，可以在保持轻量化的同时，显著提升生成图像的几何一致性和编辑灵活性，扩展了生成模型的应用场景。

Abstract: Existing generative approaches for guided image synthesis of multi-object scenes typically rely on 2D controls in the image or text space. As a result, these methods struggle to maintain and respect consistent three-dimensional geometric structure, underlying the scene. In this paper, we propose a novel conditioning approach, training method and adapter network that can be plugged into pretrained text-to-image diffusion models. Our approach provides a way to endow such models with 3D-awareness, while leveraging their rich prior knowledge. Our method supports camera control, conditioning on explicit 3D geometries and, for the first time, accounts for the entire context of a scene, i.e., both on and off-screen items, to synthesize plausible and semantically rich images. Despite its multi-modal nature, our model is lightweight, requires a reasonable number of data for supervised learning and shows remarkable generalization power. We also introduce methods for intuitive and consistent image editing and restyling, e.g., by positioning, rotating or resizing individual objects in a scene. Our method integrates well within various image creation workflows and enables a richer set of applications compared to previous approaches.

[30] Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders cs.CV | cs.AIPDF

Song Mao, Yang Chen, Pinglong Cai, Ding Wang, Guohang Yan

TL;DR: 论文研究了多模态大语言模型中使用多个视觉编码器时的冗余问题，提出了CUR和IG两个指标量化冗余现象，并通过实验证实了当前多编码器设计的低效性。

Details

Motivation: 多模态大语言模型常采用多个视觉编码器以提高视觉理解能力，但实验发现性能提升有限甚至下降，表明存在编码器冗余现象。研究旨在系统分析这一现象并提出解决方案。

Result: 实验表明部分视觉编码器对模型性能贡献极小甚至负面，证实了冗余现象的普遍性。提出的指标可作为诊断工具优化多模态架构设计。

Insight: 当前多编码器设计存在低效性，引入新指标CUR和IG有助于优化模型架构，避免冗余编码器导致的性能下降。

Abstract: Multimodal Large Language Models (MLLMs) increasingly adopt multiple vision encoders to capture diverse visual information, ranging from coarse semantics to fine grained details. While this approach is intended to enhance visual understanding capability, we observe that the performance gains from adding encoders often diminish and can even lead to performance degradation, a phenomenon we term encoder redundancy. This paper presents a systematic investigation into this issue. Through comprehensive ablation studies on state of the art multi encoder MLLMs, we empirically demonstrate that significant redundancy exists. To quantify each encoder’s unique contribution, we propose a principled metric: the Conditional Utilization Rate (CUR). Building on CUR, we introduce the Information Gap (IG) to capture the overall disparity in encoder utility within a model.Our experiments reveal that certain vision encoders contribute little, or even negatively, to overall performance, confirming substantial redundancy. Our experiments reveal that certain vision encoders contribute minimally, or even negatively, to the model’s performance, confirming the prevalence of redundancy. These findings highlight critical inefficiencies in current multi encoder designs and establish that our proposed metrics can serve as valuable diagnostic tools for developing more efficient and effective multimodal architectures.

[31] Dual-frequency Selected Knowledge Distillation with Statistical-based Sample Rectification for PolSAR Image Classification cs.CVPDF

Xinyue Xin, Ming Li, Yan Wu, Xiang Li, Peng Zhang

TL;DR: 本文提出了一种基于统计的动态样本校正和双频门选知识蒸馏的网络（SKDNet-SSR），用于双频极化合成孔径雷达（PolSAR）图像分类，解决了区域一致性对信息学习的负面影响及双频数据利用问题。

Details

Motivation: 双频PolSAR图像的协同分类是一个重要但具有挑战性的研究方向，主要困难包括区域一致性对分类信息学习的负面影响及双频数据的合理利用。

Result: 在四个实测双频PolSAR数据集上的实验表明，SKDNet-SSR优于其他相关方法。

Insight: 通过统计方法动态校正样本和利用知识蒸馏实现双频数据的互补学习，可以有效提升PolSAR图像分类性能。

Abstract: The collaborative classification of dual-frequency PolSAR images is a meaningful but also challenging research. The effect of regional consistency on classification information learning and the rational use of dual-frequency data are two main difficulties for dual-frequency collaborative classification. To tackle these problems, a selected knowledge distillation network with statistical-based sample rectification (SKDNet-SSR) is proposed in this article. First, in addition to applying CNN and ViT as local and global feature extractors, a statistical-based dynamic sample rectification (SDSR) module is designed to avoid the impact of poor regional consistency on spatial information learning process. Specifically, based on the fact that the PolSAR covariance matrix conforms to the complex Wishart distribution, SDSR first dynamically evaluates the sample purity, and then performs pixel selection and pixel generation to remove noisy pixels, thereby avoiding the feature interaction between informative pixels and noisy pixels and improving the classification feature extraction process. Next, a dual-frequency gate-selected distillation (DGSD) module is constructed to emphasize the advantages of different frequency bands and perform complementary learning on dual-frequency data. It uses the dominant single-frequency branch on each sample as teacher model to train the dual-frequency student model, enabling the student model to learn the optimal results and realizing complementary utilization of dual-frequency data on different terrain objects. Comprehensive experiments on four measured dual-frequency PolSAR data demonstrate that the proposed SKDNet-SSR outperforms other related methods.

[32] ConceptMix++: Leveling the Playing Field in Text-to-Image Benchmarking via Iterative Prompt Optimization cs.CV | cs.LGPDF

Haosheng Gan, Berk Tinaz, Mohammad Shahab Sepehri, Zalan Fabian, Mahdi Soltanolkotabi

TL;DR: ConceptMix++提出了一种通过迭代优化提示词来改进文本到图像（T2I）生成评测公平性的框架，揭示了现有评测方法可能低估模型真实生成能力的问题。

Details

Motivation: 现有T2I评测方法依赖于固定的提示词，可能导致对模型生成能力的低估，并引入不公平的模型比较。

Result: 实验表明，优化后的提示词显著提升了模型的组合生成能力，并揭示了原有评测方法低估的模型潜力。

Insight: 提示词优化对不同视觉概念（如空间关系和形状）的效果差异显著，说明现有评测方法可能在这些类别中系统性低估模型性能。

Abstract: Current text-to-image (T2I) benchmarks evaluate models on rigid prompts, potentially underestimating true generative capabilities due to prompt sensitivity and creating biases that favor certain models while disadvantaging others. We introduce ConceptMix++, a framework that disentangles prompt phrasing from visual generation capabilities by applying iterative prompt optimization. Building on ConceptMix, our approach incorporates a multimodal optimization pipeline that leverages vision-language model feedback to refine prompts systematically. Through extensive experiments across multiple diffusion models, we show that optimized prompts significantly improve compositional generation performance, revealing previously hidden model capabilities and enabling fairer comparisons across T2I models. Our analysis reveals that certain visual concepts – such as spatial relationships and shapes – benefit more from optimization than others, suggesting that existing benchmarks systematically underestimate model performance in these categories. Additionally, we find strong cross-model transferability of optimized prompts, indicating shared preferences for effective prompt phrasing across models. These findings demonstrate that rigid benchmarking approaches may significantly underrepresent true model capabilities, while our framework provides more accurate assessment and insights for future development.

[33] NOVO: Unlearning-Compliant Vision Transformers cs.CVPDF

Soumya Roy, Soumya Banerjee, Vinay Verma, Soumik Dasgupta, Deepak Gupta

TL;DR: 论文NOVO提出了一种无需微调的视觉Transformer架构，能够在训练过程中模拟遗忘，并实现即时去除特定类别的信息，避免性能下降。

Details

Motivation: 现有的机器学习遗忘方法依赖微调，成本高且可能导致性能下降，亟需一种无需微调的高效遗忘解决方案。

Result: 实验表明NOVO在多种数据集、架构和分辨率下均优于基于微调和无微调的方法。

Insight: 在模型设计中嵌入遗忘机制可高效实现选择性遗忘，且不影响性能。

Abstract: Machine unlearning (MUL) refers to the problem of making a pre-trained model selectively forget some training instances or class(es) while retaining performance on the remaining dataset. Existing MUL research involves fine-tuning using a forget and/or retain set, making it expensive and/or impractical, and often causing performance degradation in the unlearned model. We introduce {\pname}, an unlearning-aware vision transformer-based architecture that can directly perform unlearning for future unlearning requests without any fine-tuning over the requested set. The proposed model is trained by simulating unlearning during the training process itself. It involves randomly separating class(es)/sub-class(es) present in each mini-batch into two disjoint sets: a proxy forget-set and a retain-set, and the model is optimized so that it is unable to predict the forget-set. Forgetting is achieved by withdrawing keys, making unlearning on-the-fly and avoiding performance degradation. The model is trained jointly with learnable keys and original weights, ensuring withholding a key irreversibly erases information, validated by membership inference attack scores. Extensive experiments on various datasets, architectures, and resolutions confirm {\pname}’s superiority over both fine-tuning-free and fine-tuning-based methods.

[34] MolVision: Molecular Property Prediction with Vision Language Models cs.CVPDF

Deepan Adak, Yogesh Singh Rawat, Shruti Vyas

TL;DR: MolVision提出了一种利用视觉语言模型（VLMs）的多模态方法，结合分子结构图像和文本描述，显著提升了分子属性预测的性能。

Details

Motivation: 传统的分子属性预测方法主要依赖文本表示（如SMILES/SELFIES），但这些表示可能模糊且缺乏结构信息。MolVision通过引入视觉信息弥补了这一不足。

Result: 结果表明，视觉信息单独使用时效果有限，但多模态融合显著提升了泛化能力；结合LoRA微调进一步优化了性能。

Insight: 视觉信息与文本描述的互补性是提升分子属性预测的关键，高效的微调策略（如LoRA）在多模态学习中尤为重要。

Abstract: Molecular property prediction is a fundamental task in computational chemistry with critical applications in drug discovery and materials science. While recent works have explored Large Language Models (LLMs) for this task, they primarily rely on textual molecular representations such as SMILES/SELFIES, which can be ambiguous and structurally less informative. In this work, we introduce MolVision, a novel approach that leverages Vision-Language Models (VLMs) by integrating both molecular structure as images and textual descriptions to enhance property prediction. We construct a benchmark spanning ten diverse datasets, covering classification, regression and description tasks. Evaluating nine different VLMs in zero-shot, few-shot, and fine-tuned settings, we find that visual information improves prediction performance, particularly when combined with efficient fine-tuning strategies such as LoRA. Our results reveal that while visual information alone is insufficient, multimodal fusion significantly enhances generalization across molecular properties. Adaptation of vision encoder for molecular images in conjunction with LoRA further improves the performance. The code and data is available at : $\href{https://molvision.github.io/MolVision/}{https://molvision.github.io/MolVision/}$.

[35] Zero-shot Inexact CAD Model Alignment from a Single Image cs.CVPDF

Pattaramanee Arsomngern, Sasikarn Khwanmuang, Matthias Nießner, Supasorn Suwajanakorn

TL;DR: 论文提出了一种弱监督的9自由度对齐方法，用于从单张图像中匹配不精确的3D模型，无需姿态标注且能泛化到未见类别。

Details

Motivation: 现有方法依赖于图像和姿态标注的监督训练，限制了其适用对象的范围。论文旨在解决这一问题，通过弱监督方法实现对更广泛类别的支持。

Result: 在ScanNet25k数据集上，方法比弱监督基线方法提升了4.3%的平均对齐精度，并首次超越了监督方法ROCA；在新数据集SUN2CAD上，无需训练仍取得SOTA结果。

Insight: 弱监督方法可以通过基础特征和自监督技术实现高效3D对齐，泛化能力显著，推动了单图像3D重建的发展。

Abstract: One practical approach to infer 3D scene structure from a single image is to retrieve a closely matching 3D model from a database and align it with the object in the image. Existing methods rely on supervised training with images and pose annotations, which limits them to a narrow set of object categories. To address this, we propose a weakly supervised 9-DoF alignment method for inexact 3D models that requires no pose annotations and generalizes to unseen categories. Our approach derives a novel feature space based on foundation features that ensure multi-view consistency and overcome symmetry ambiguities inherent in foundation features using a self-supervised triplet loss. Additionally, we introduce a texture-invariant pose refinement technique that performs dense alignment in normalized object coordinates, estimated through the enhanced feature space. We conduct extensive evaluations on the real-world ScanNet25k dataset, where our method outperforms SOTA weakly supervised baselines by +4.3% mean alignment accuracy and is the only weakly supervised approach to surpass the supervised ROCA by +2.7%. To assess generalization, we introduce SUN2CAD, a real-world test set with 20 novel object categories, where our method achieves SOTA results without prior training on them.

[36] CPKD: Clinical Prior Knowledge-Constrained Diffusion Models for Surgical Phase Recognition in Endoscopic Submucosal Dissection cs.CVPDF

Xiangning Zhang, Jinnan Chen, Qingwei Zhang, Yaqi Wang, Chengfeng Zhou

TL;DR: 该论文提出了一种名为CPKD的新生成框架，用于内镜黏膜下剥离手术的阶段识别，结合了去噪扩散原理和临床先验知识，优于或可比于现有方法。

Details

Motivation: 胃肠道恶性肿瘤是癌症相关死亡的主要原因，而内镜黏膜下剥离手术（ESD）的精确阶段识别是其临床应用的瓶颈。现有方法依赖多阶段细化架构，需改进。

Result: 在ESD820、Cholec80及多中心数据集上的评估表明，CPKD性能优于或与现有最优方法相当，验证了其有效性。

Insight: 扩散生成模型可用于手术阶段识别，临床先验知识的引入显著提升了逻辑一致性，为复杂内镜工作流提供了新思路。

Abstract: Gastrointestinal malignancies constitute a leading cause of cancer-related mortality worldwide, with advanced-stage prognosis remaining particularly dismal. Originating as a groundbreaking technique for early gastric cancer treatment, Endoscopic Submucosal Dissection has evolved into a versatile intervention for diverse gastrointestinal lesions. While computer-assisted systems significantly enhance procedural precision and safety in ESD, their clinical adoption faces a critical bottleneck: reliable surgical phase recognition within complex endoscopic workflows. Current state-of-the-art approaches predominantly rely on multi-stage refinement architectures that iteratively optimize temporal predictions. In this paper, we present Clinical Prior Knowledge-Constrained Diffusion (CPKD), a novel generative framework that reimagines phase recognition through denoising diffusion principles while preserving the core iterative refinement philosophy. This architecture progressively reconstructs phase sequences starting from random noise and conditioned on visual-temporal features. To better capture three domain-specific characteristics, including positional priors, boundary ambiguity, and relation dependency, we design a conditional masking strategy. Furthermore, we incorporate clinical prior knowledge into the model training to improve its ability to correct phase logical errors. Comprehensive evaluations on ESD820, Cholec80, and external multi-center demonstrate that our proposed CPKD achieves superior or comparable performance to state-of-the-art approaches, validating the effectiveness of diffusion-based generative paradigms for surgical phase recognition.

[37] Leveraging Out-of-Distribution Unlabeled Images: Semi-Supervised Semantic Segmentation with an Open-Vocabulary Model cs.CV | cs.AIPDF

Wooseok Shin, Jisu Kang, Hyeonki Jeong, Jin Sob Kim, Sung Won Han

TL;DR: 本文提出了一种新的半监督语义分割框架（SemiOVS），利用开词汇分割模型有效利用分布外的未标注图像，显著提升了模型性能。

Details

Motivation: 现实场景中大量的未标注图像可能分布与目标数据集不同（OOD），直接用于半监督学习可能导致伪标签不准确，影响训练效果。

Result: 在Pascal VOC和Context数据集上，SemiOVS均取得SOTA效果，尤其92标签设定下超越PrevMatch和SemiVL 3.5和3.0 mIoU。

Insight: OOD未标注图像通过开词汇模型伪标注后可显著提升少标签场景下的模型性能，为实际应用提供新思路。

Abstract: In semi-supervised semantic segmentation, existing studies have shown promising results in academic settings with controlled splits of benchmark datasets. However, the potential benefits of leveraging significantly larger sets of unlabeled images remain unexplored. In real-world scenarios, abundant unlabeled images are often available from online sources (web-scraped images) or large-scale datasets. However, these images may have different distributions from those of the target dataset, a situation known as out-of-distribution (OOD). Using these images as unlabeled data in semi-supervised learning can lead to inaccurate pseudo-labels, potentially misguiding network training. In this paper, we propose a new semi-supervised semantic segmentation framework with an open-vocabulary segmentation model (SemiOVS) to effectively utilize unlabeled OOD images. Extensive experiments on Pascal VOC and Context datasets demonstrate two key findings: (1) using additional unlabeled images improves the performance of semi-supervised learners in scenarios with few labels, and (2) using the open-vocabulary segmentation (OVS) model to pseudo-label OOD images leads to substantial performance gains. In particular, SemiOVS outperforms existing PrevMatch and SemiVL methods by +3.5 and +3.0 mIoU, respectively, on Pascal VOC with a 92-label setting, achieving state-of-the-art performance. These findings demonstrate that our approach effectively utilizes abundant unlabeled OOD images for semantic segmentation tasks. We hope this work can inspire future research and real-world applications. The code is available at https://github.com/wooseok-shin/SemiOVS

[38] Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations cs.CVPDF

Hai Huang, Yan Xia, Sashuai Zhou, Hanting Wang, Shulei Wang

TL;DR: 该论文提出了一种利用统一表示来适应多模态领域泛化（MMDG）的新方法，通过同步多模态改进和模态信息解耦，显著提升了模型在未见目标域中的泛化能力。

Details

Motivation: 传统领域泛化（DG）方法主要针对单模态数据，难以直接应用于多模态场景，因为模态间的差异可能导致泛化方向不一致。MMDG面临的关键挑战是如何在不可见的目标域中保持多模态的一致性。

Result: 在多个基准数据集上的实验表明，该方法在多模态领域泛化任务中表现出色，显著优于传统方法。

Insight: 通过统一表示和模态信息解耦，可以有效解决多模态领域泛化中模态不一致的问题，为多模态任务的模型泛化提供了新的思路。

Abstract: Domain Generalization (DG) aims to enhance model robustness in unseen or distributionally shifted target domains through training exclusively on source domains. Although existing DG techniques, such as data manipulation, learning strategies, and representation learning, have shown significant progress, they predominantly address single-modal data. With the emergence of numerous multi-modal datasets and increasing demand for multi-modal tasks, a key challenge in Multi-modal Domain Generalization (MMDG) has emerged: enabling models trained on multi-modal sources to generalize to unseen target distributions within the same modality set. Due to the inherent differences between modalities, directly transferring methods from single-modal DG to MMDG typically yields sub-optimal results. These methods often exhibit randomness during generalization due to the invisibility of target domains and fail to consider inter-modal consistency. Applying these methods independently to each modality in the MMDG setting before combining them can lead to divergent generalization directions across different modalities, resulting in degraded generalization capabilities. To address these challenges, we propose a novel approach that leverages Unified Representations to map different paired modalities together, effectively adapting DG methods to MMDG by enabling synchronized multi-modal improvements within the unified space. Additionally, we introduce a supervised disentanglement framework that separates modal-general and modal-specific information, further enhancing the alignment of unified representations. Extensive experiments on benchmark datasets, including EPIC-Kitchens and Human-Animal-Cartoon, demonstrate the effectiveness and superiority of our method in enhancing multi-modal domain generalization.

[39] MGSfM: Multi-Camera Geometry Driven Global Structure-from-Motion cs.CVPDF

Peilin Tao, Hainan Cui, Diantao Tu, Shuhan Shen

TL;DR: 论文提出了一种新的全局运动平均框架MGSfM，专为多相机系统设计，通过解耦旋转平均和混合平移平均模块，显著提升了SfM的鲁棒性和效率。

Details

Motivation: 多相机系统的固定相对位姿约束对SfM有益，但传统全局SfM因优化框架问题缺乏鲁棒性，需要更高效的解决方案。

Result: 在大规模数据集上，MGSfM在精度上达到或超过增量SfM，同时显著提升效率，成为多相机SfM的鲁棒解决方案。

Insight: 1. 多相机系统的固定约束可显著优化SfM；2. 分层和混合策略是提升全局SfM性能的关键。

Abstract: Multi-camera systems are increasingly vital in the environmental perception of autonomous vehicles and robotics. Their physical configuration offers inherent fixed relative pose constraints that benefit Structure-from-Motion (SfM). However, traditional global SfM systems struggle with robustness due to their optimization framework. We propose a novel global motion averaging framework for multi-camera systems, featuring two core components: a decoupled rotation averaging module and a hybrid translation averaging module. Our rotation averaging employs a hierarchical strategy by first estimating relative rotations within rigid camera units and then computing global rigid unit rotations. To enhance the robustness of translation averaging, we incorporate both camera-to-camera and camera-to-point constraints to initialize camera positions and 3D points with a convex distance-based objective function and refine them with an unbiased non-bilinear angle-based objective function. Experiments on large-scale datasets show that our system matches or exceeds incremental SfM accuracy while significantly improving efficiency. Our framework outperforms existing global SfM methods, establishing itself as a robust solution for real-world multi-camera SfM applications. The code is available at https://github.com/3dv-casia/MGSfM/.

[40] Personalized Image Generation from an Author Writing Style cs.CV | cs.AIPDF

Sagar Gandhi, Vishal Gandhi

TL;DR: 该论文提出了一种新颖的流水线方法，通过从作者的写作风格中提取结构化信息（AWS），利用大语言模型生成文本提示，并结合扩散模型生成个性化图像。实验证明了其在视觉风格匹配和独特性上的有效性。

Details

Motivation: 将作者的文学风格转化为视觉表现形式是一个新兴挑战，论文旨在通过生成式AI实现这一目标，增强创意辅助和跨模态理解的应用。

Result: 生成图像在风格匹配上得分较高（4.08/5），视觉独特性中等，能有效捕捉情绪和氛围，但对抽象元素的表达能力有限。

Insight: 论文展示了文本到图像生成在个性化领域的潜力，同时也指出了抽象内容视觉化的挑战，为后续研究提供了方向。

Abstract: Translating nuanced, textually-defined authorial writing styles into compelling visual representations presents a novel challenge in generative AI. This paper introduces a pipeline that leverages Author Writing Sheets (AWS) - structured summaries of an author’s literary characteristics - as input to a Large Language Model (LLM, Claude 3.7 Sonnet). The LLM interprets the AWS to generate three distinct, descriptive text-to-image prompts, which are then rendered by a diffusion model (Stable Diffusion 3.5 Medium). We evaluated our approach using 49 author styles from Reddit data, with human evaluators assessing the stylistic match and visual distinctiveness of the generated images. Results indicate a good perceived alignment between the generated visuals and the textual authorial profiles (mean style match: $4.08/5$), with images rated as moderately distinctive. Qualitative analysis further highlighted the pipeline’s ability to capture mood and atmosphere, while also identifying challenges in representing highly abstract narrative elements. This work contributes a novel end-to-end methodology for visual authorial style personalization and provides an initial empirical validation, opening avenues for applications in creative assistance and cross-modal understanding.

[41] Source-Free Domain Adaptation via Multi-view Contrastive Learning cs.CV | cs.AIPDF

Amirfarhad Farhadi, Naser Mozayani, Azadeh Zamanifar

TL;DR: 该论文提出了一种无需源域数据的域适应方法（SFUDA），通过多视角对比学习和可靠样本选择模块解决原型样本质量低和伪标签错误分配的问题，显著提升了分类性能。

Details

Motivation: 现实场景中隐私问题限制了访问源域数据，而现有域适应方法依赖源域标签。SFUDA虽能解决域差异，但面临原型样本质量低和伪标签错误分配两大挑战。

Result: 在VisDA 2017、Office-Home和Office-31数据集上，分类准确率分别比第二优方法提升了约2%，比13种SOTA方法的平均表现提升了6%。

Insight: 通过对比学习和样本选择联合优化，SFUDA能有效缓解域适应中的关键挑战，为隐私受限场景提供了可行解决方案。

Abstract: Domain adaptation has become a widely adopted approach in machine learning due to the high costs associated with labeling data. It is typically applied when access to a labeled source domain is available. However, in real-world scenarios, privacy concerns often restrict access to sensitive information, such as fingerprints, bank account details, and facial images. A promising solution to this issue is Source-Free Unsupervised Domain Adaptation (SFUDA), which enables domain adaptation without requiring access to labeled target domain data. Recent research demonstrates that SFUDA can effectively address domain discrepancies; however, two key challenges remain: (1) the low quality of prototype samples, and (2) the incorrect assignment of pseudo-labels. To tackle these challenges, we propose a method consisting of three main phases. In the first phase, we introduce a Reliable Sample Memory (RSM) module to improve the quality of prototypes by selecting more representative samples. In the second phase, we employ a Multi-View Contrastive Learning (MVCL) approach to enhance pseudo-label quality by leveraging multiple data augmentations. In the final phase, we apply a noisy label filtering technique to further refine the pseudo-labels. Our experiments on three benchmark datasets - VisDA 2017, Office-Home, and Office-31 - demonstrate that our method achieves approximately 2 percent and 6 percent improvements in classification accuracy over the second-best method and the average of 13 well-known state-of-the-art approaches, respectively.

Zhao Wang, Bowen Chen, Yotaro Shimose, Sota Moriyama, Heng Wang

TL;DR: MIMO是一个通过反射式多模态代理框架自动生成广告横幅的框架，结合了分层多模态系统和迭代优化循环，显著优于现有方法。

Details

Motivation: 商业广告设计需要高质量、结构化布局和一致性品牌元素，现有生成模型虽能生成高质量图像，但难以满足这些需求。

Result: 实验表明，MIMO在真实横幅设计场景中显著优于基于扩散模型和LLM的基线方法。

Insight: 通过代理框架和迭代优化，可以显著提升生成模型在复杂商业设计任务中的表现。

Abstract: Recent generative models such as GPT-4o have shown strong capabilities in producing high-quality images with accurate text rendering. However, commercial design tasks like advertising banners demand more than visual fidelity – they require structured layouts, precise typography, consistent branding, and more. In this paper, we introduce MIMO (Mirror In-the-Model), an agentic refinement framework for automatic ad banner generation. MIMO combines a hierarchical multi-modal agent system (MIMO-Core) with a coordination loop (MIMO-Loop) that explores multiple stylistic directions and iteratively improves design quality. Requiring only a simple natural language based prompt and logo image as input, MIMO automatically detects and corrects multiple types of errors during generation. Experiments show that MIMO significantly outperforms existing diffusion and LLM-based baselines in real-world banner design scenarios.

[43] De-Fake: Style based Anomaly Deepfake Detection cs.CV | cs.AIPDF

Sudev Kumar Padhi, Harshit Kumar, Umesh Kashyap, Sk. Subidh Ali

TL;DR: 本文提出了一种基于风格异常检测的深度伪造检测方法SafeVision，专注于检测面部交换生成的深度伪造内容，并在隐私保护的前提下提供高效检测。

Details

Motivation: 随着深度伪造技术的普及，面部交换被广泛用于传播虚假信息、损害声誉等非法用途，现有检测方法在像素级特征或面部标志上难以有效检测，且数据隐私问题突出。

Result: 实验表明，SafeVision在多样化场景中能有效检测面部交换伪造内容，具有高可靠性和可扩展性。

Insight: 风格特征作为深度伪造检测的新维度，结合隐私保护机制，为真实场景中的伪造检测提供了新思路。

Abstract: Detecting deepfakes involving face-swaps presents a significant challenge, particularly in real-world scenarios where anyone can perform face-swapping with freely available tools and apps without any technical knowledge. Existing deepfake detection methods rely on facial landmarks or inconsistencies in pixel-level features and often struggle with face-swap deepfakes, where the source face is seamlessly blended into the target image or video. The prevalence of face-swap is evident in everyday life, where it is used to spread false information, damage reputations, manipulate political opinions, create non-consensual intimate deepfakes (NCID), and exploit children by enabling the creation of child sexual abuse material (CSAM). Even prominent public figures are not immune to its impact, with numerous deepfakes of them circulating widely across social media platforms. Another challenge faced by deepfake detection methods is the creation of datasets that encompass a wide range of variations, as training models require substantial amounts of data. This raises privacy concerns, particularly regarding the processing and storage of personal facial data, which could lead to unauthorized access or misuse. Our key idea is to identify these style discrepancies to detect face-swapped images effectively without accessing the real facial image. We perform comprehensive evaluations using multiple datasets and face-swapping methods, which showcases the effectiveness of SafeVision in detecting face-swap deepfakes across diverse scenarios. SafeVision offers a reliable and scalable solution for detecting face-swaps in a privacy preserving manner, making it particularly effective in challenging real-world applications. To the best of our knowledge, SafeVision is the first deepfake detection using style features while providing inherent privacy protection.

[44] DESign: Dynamic Context-Aware Convolution and Efficient Subnet Regularization for Continuous Sign Language Recognition cs.CV | cs.AIPDF

Sheng Liu, Yiheng Yu, Yuan Feng, Min Xu, Zhelun Jin

TL;DR: 论文提出了DESign框架，用于连续手语识别（CSLR），通过动态上下文感知卷积（DCAC）和子网正则化CTC（SR-CTC）有效捕捉时间和上下文依赖关系，显著提升了识别性能。

Details

Motivation: 当前CSLR方法难以处理多样化的样本，且现有动态卷积方法多关注空间建模，忽略了时间动态和上下文依赖，导致性能受限。

Result: 在PHOENIX14、PHOENIX14-T和CSL-Daily数据集上验证了DESign的优越性，性能显著超越现有方法。

Insight: 动态调整卷积权重和正则化CTC对齐路径能有效提升CSLR模型的泛化能力和鲁棒性。

Abstract: Current continuous sign language recognition (CSLR) methods struggle with handling diverse samples. Although dynamic convolutions are ideal for this task, they mainly focus on spatial modeling and fail to capture the temporal dynamics and contextual dependencies. To address this, we propose DESign, a novel framework that incorporates Dynamic Context-Aware Convolution (DCAC) and Subnet Regularization Connectionist Temporal Classification (SR-CTC). DCAC dynamically captures the inter-frame motion cues that constitute signs and uniquely adapts convolutional weights in a fine-grained manner based on contextual information, enabling the model to better generalize across diverse signing behaviors and boost recognition accuracy. Furthermore, we observe that existing methods still rely on only a limited number of frames for parameter updates during training, indicating that CTC learning overfits to a dominant path. To address this, SR-CTC regularizes training by applying supervision to subnetworks, encouraging the model to explore diverse CTC alignment paths and effectively preventing overfitting. A classifier-sharing strategy in SR-CTC further strengthens multi-scale consistency. Notably, SR-CTC introduces no inference overhead and can be seamlessly integrated into existing CSLR models to boost performance. Extensive ablations and visualizations further validate the effectiveness of the proposed methods. Results on mainstream CSLR datasets (i.e., PHOENIX14, PHOENIX14-T, CSL-Daily) demonstrate that DESign achieves state-of-the-art performance.

[45] Be the Change You Want to See: Revisiting Remote Sensing Change Detection Practices cs.CV | cs.AIPDF

Blaž Rolih, Matic Fučka, Filip Wolf, Luka Čehovin Zajc

TL;DR: 该论文重新审视了遥感变化检测的设计空间，发现基础设计选择（如主干网络、预训练策略和训练配置）的性能提升甚至超过新架构组件。通过优化这些选择，即使简单架构也能达到或超越SOTA性能。

Details

Motivation: 现有方法过度依赖新架构组件，而忽视了基础设计选择的潜力。论文旨在探索这些基础选择对性能的影响，并为未来方法提供指导。

Result: 优化的简单模型在六个数据集上匹配或超越SOTA性能，设计指南对其他方法也有效。

Insight: 核心设计选择的优化比架构创新同样重要，为未来研究提供了实用的设计基础。

Abstract: Remote sensing change detection aims to localize semantic changes between images of the same location captured at different times. In the past few years, newer methods have attributed enhanced performance to the additions of new and complex components to existing architectures. Most fail to measure the performance contribution of fundamental design choices such as backbone selection, pre-training strategies, and training configurations. We claim that such fundamental design choices often improve performance even more significantly than the addition of new architectural components. Due to that, we systematically revisit the design space of change detection models and analyse the full potential of a well-optimised baseline. We identify a set of fundamental design choices that benefit both new and existing architectures. Leveraging this insight, we demonstrate that when carefully designed, even an architecturally simple model can match or surpass state-of-the-art performance on six challenging change detection datasets. Our best practices generalise beyond our architecture and also offer performance improvements when applied to related methods, indicating that the space of fundamental design choices has been underexplored. Our guidelines and architecture provide a strong foundation for future methods, emphasizing that optimizing core components is just as important as architectural novelty in advancing change detection performance. Code: https://github.com/blaz-r/BTC-change-detection

[46] Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos cs.CVPDF

Yufan Zhou, Zhaobo Qi, Lingshuai Lin, Junqi Jing, Tingting Chai

TL;DR: MTID是一种基于扩散模型的新方法，通过潜在空间时间插值和任务感知的掩码机制，解决了教学视频中的程序规划问题，生成具有时间一致性的动作序列。

Details

Motivation: 教学视频中的程序规划需要从起始和结束的视觉观察中生成连贯且与任务对齐的动作序列。现有方法依赖文本级监督，难以捕捉动作间复杂的时间关系。

Result: 在三个基准数据集上的实验表明，MTID在大多数指标上优于现有方法，实现了更优的动作规划性能。

Insight: 通过结合时间插值和任务感知掩码机制，MTID能够更好地建模动作间的时间关系，并生成更贴近任务需求的序列，为程序规划任务提供了新思路。

Abstract: In this paper, we address the challenge of procedure planning in instructional videos, aiming to generate coherent and task-aligned action sequences from start and end visual observations. Previous work has mainly relied on text-level supervision to bridge the gap between observed states and unobserved actions, but it struggles with capturing intricate temporal relationships among actions. Building on these efforts, we propose the Masked Temporal Interpolation Diffusion (MTID) model that introduces a latent space temporal interpolation module within the diffusion model. This module leverages a learnable interpolation matrix to generate intermediate latent features, thereby augmenting visual supervision with richer mid-state details. By integrating this enriched supervision into the model, we enable end-to-end training tailored to task-specific requirements, significantly enhancing the model’s capacity to predict temporally coherent action sequences. Additionally, we introduce an action-aware mask projection mechanism to restrict the action generation space, combined with a task-adaptive masked proximity loss to prioritize more accurate reasoning results close to the given start and end states over those in intermediate steps. Simultaneously, it filters out task-irrelevant action predictions, leading to contextually aware action sequences. Experimental results across three widely used benchmark datasets demonstrate that our MTID achieves promising action planning performance on most metrics. The code is available at https://github.com/WiserZhou/MTID.

[47] Unlearning the Noisy Correspondence Makes CLIP More Robust cs.CV | cs.MMPDF

Haochen Han, Alex Jinpeng Wang, Peijun Ye, Fangming Liu

TL;DR: 论文提出NCU框架，通过遗忘预训练的视觉-语言模型（VLM）中学习到的噪声对应关系（NC），提升模型的鲁棒性。

Details

Motivation: 随着视觉-语言模型数据量的急剧增加，数据质量与噪声对应关系（NC）成为不可忽视的问题，传统方法因资源消耗大而难以应对实际需求，因此需要一种更高效的方法直接消除预训练模型中的NC影响。

Result: NCU在CLIP模型上验证，显著优于传统预训练方法，且计算开销更低。

Insight: 直接从预训练模型中遗忘噪声知识是一种高效且实用的提升模型鲁棒性的新思路。

Abstract: The data appetite for Vision-Language Models (VLMs) has continuously scaled up from the early millions to billions today, which faces an untenable trade-off with data quality and inevitably introduces Noisy Correspondence (NC) samples. Undoubtedly, such semantically unrelated data significantly impairs the performance of VLMs. Previous efforts mainly address this challenge by estimating refined alignment for more precise guidance. However, such resource-intensive pipelines that train VLMs from scratch struggle to meet realistic data demands. In this paper, we present a brand new perspective that seeks to directly eliminate the harmful effects of NC in pre-trained VLMs. Specifically, we propose NCU, a Noisy Correspondence Unlearning fine-tuning framework that efficiently enhances VLMs’ robustness by forgetting learned noisy knowledge. The key to NCU is learning the hardest negative information, which can provide explicit unlearning direction for both false positives and false negatives. Such twin goals unlearning process can be formalized into one unified optimal transport objective for fast fine-tuning. We validate our approach with the prevailing CLIP model over various downstream tasks. Remarkably, NCU surpasses the robust pre-trained method on zero-shot transfer while with lower computational overhead. The code will be released upon acceptance.

[48] Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach cs.CV | cs.AIPDF

Leyan Xue, Zongbo Han, Guangyu Wang, Qinghua Hu, Mingyue Cheng

TL;DR: 论文通过随机多裁剪增强技术，解决了CLIP模型在视觉-语言任务中过度依赖全局图像模式而忽视局部细节的问题，提出了一个简单有效的解决方案‘D&D’。

Details

Motivation: 传统提示工程主要依赖粗粒度类别标签，忽略了细粒度的局部语义信息。CLIP模型存在对全局图像模式的强偏好，导致其难以处理局部视觉描述符。

Result: 在零样本、少样本和测试时适应设置下，D&D方法表现出优异的性能。

Insight: CLIP模型对全局模式的偏好是其性能瓶颈之一，而通过限制感受野可以显著提升其对局部细节的处理能力。

Abstract: Vision-Language Models (VLMs) like CLIP achieve cross-modal semantic alignment through contrastive learning, exhibiting robust zero-shot generalization. Traditional prompt engineering, however, predominantly relies on coarse-grained category labels, neglecting fine-grained local semantics. Existing approaches assume that VLMs inherently recognize localized visual details and attempt to enhance classification by augmenting text prompts with attribute descriptors generated by large language models. However, our systematic experiments reveal critical limitations: CLIP’s strong bias toward global image patterns hinders its ability to process localized visual descriptors. To address this fundamental constraint, we propose a simple, effective, and plug-and-play solution that enables CLIP to ``See Both the Forest and the Trees.” Specifically, we employ stochastic multi-crop augmentation to activate CLIP’s latent capacity for localized feature analysis. By cropping only partial regions, the approach effectively constrains the model’s receptive field and recalibrates its attention mechanism, thereby mitigating its inherent bias. We evaluate the proposed method under zero-shot, few-shot, and test-time adaptation settings, and extensive experiments demonstrate that D&D achieves promising performance.

[49] Radar Velocity Transformer: Single-scan Moving Object Segmentation in Noisy Radar Point Clouds cs.CVPDF

Matthias Zeller, Vardeep S. Sandhu, Benedikt Mersch, Jens Behley, Michael Heidingsfeld

TL;DR: 该论文提出了一种基于Transformer的新方法Radar Velocity Transformer，用于在稀疏的雷达点云中实现单次扫描的移动物体分割，同时利用多普勒速度信息提升了分割精度。

Details

Motivation: 自动驾驶车辆需要实时感知周围的移动物体以确保安全。LiDAR和相机通常需要处理时序数据来提取运动信息，而雷达可以直接提供多普勒速度信息，但现有方法未能充分利用这一特性。

Result: 网络运行速度快于传感器帧率，且在单次雷达扫描数据上实现了优于其他方法的移动物体分割结果。

Insight: 雷达的多普勒速度信息可以直接用于单次扫描的移动物体分割，而Transformer架构在稀疏点云任务中表现出色。

Abstract: The awareness about moving objects in the surroundings of a self-driving vehicle is essential for safe and reliable autonomous navigation. The interpretation of LiDAR and camera data achieves exceptional results but typically requires to accumulate and process temporal sequences of data in order to extract motion information. In contrast, radar sensors, which are already installed in most recent vehicles, can overcome this limitation as they directly provide the Doppler velocity of the detections and, hence incorporate instantaneous motion information within a single measurement. % In this paper, we tackle the problem of moving object segmentation in noisy radar point clouds. We also consider differentiating parked from moving cars, to enhance scene understanding. Instead of exploiting temporal dependencies to identify moving objects, we develop a novel transformer-based approach to perform single-scan moving object segmentation in sparse radar scans accurately. The key to our Radar Velocity Transformer is to incorporate the valuable velocity information throughout each module of the network, thereby enabling the precise segmentation of moving and non-moving objects. Additionally, we propose a transformer-based upsampling, which enhances the performance by adaptively combining information and overcoming the limitation of interpolation of sparse point clouds. Finally, we create a new radar moving object segmentation benchmark based on the RadarScenes dataset and compare our approach to other state-of-the-art methods. Our network runs faster than the frame rate of the sensor and shows superior segmentation results using only single-scan radar data.

[50] Information-Bottleneck Driven Binary Neural Network for Change Detection cs.CVPDF

Kaijie Yin, Zhiyuan Zhang, Shu Kong, Tian Gao, Chengzhong Xu

TL;DR: 该论文提出了首个专为变化检测设计的二值神经网络BiCD，通过引入信息瓶颈原理的辅助目标，提升了二值网络的表征能力和特征可分性，显著提高了检测性能。

Details

Motivation: 传统二值化方法在变化检测任务中直接量化权重和激活值，导致网络表征能力受限，检测精度远低于全精度网络。BiCD的提出旨在解决这一问题。

Result: BiCD在街景和遥感数据集上表现出色，为二值神经网络变化检测设立了新基准，达到领域内最先进性能。

Insight: 二值网络的性能瓶颈可通过引入信息理论的目标函数和辅助模块显著改善，为其他任务中的二值化设计提供了新思路。

Abstract: In this paper, we propose Binarized Change Detection (BiCD), the first binary neural network (BNN) designed specifically for change detection. Conventional network binarization approaches, which directly quantize both weights and activations in change detection models, severely limit the network’s ability to represent input data and distinguish between changed and unchanged regions. This results in significantly lower detection accuracy compared to real-valued networks. To overcome these challenges, BiCD enhances both the representational power and feature separability of BNNs, improving detection performance. Specifically, we introduce an auxiliary objective based on the Information Bottleneck (IB) principle, guiding the encoder to retain essential input information while promoting better feature discrimination. Since directly computing mutual information under the IB principle is intractable, we design a compact, learnable auxiliary module as an approximation target, leading to a simple yet effective optimization strategy that minimizes both reconstruction loss and standard change detection loss. Extensive experiments on street-view and remote sensing datasets demonstrate that BiCD establishes a new benchmark for BNN-based change detection, achieving state-of-the-art performance in this domain.

[51] Multimodal Alignment with Cross-Attentive GRUs for Fine-Grained Video Understanding cs.CV | cs.AIPDF

Namho Kim, Junhwa Kim

TL;DR: 论文提出了一种基于GRU和跨模态注意力的多模态融合框架，用于细粒度视频分类，在暴力和情感识别任务上表现优于单模态基线方法。

Details

Motivation: 细粒度视频分类需要同时分析时空和语义信息，单一模态往往难以捕捉复杂的视频内容，因此需要多模态融合方法。

Result: 在DVD（暴力检测）和Aff-Wild2（情感估计）数据集上，方法显著优于单模态基线，跨模态注意力和特征增强对性能提升贡献显著。

Insight: 1. 跨模态注意力能有效捕捉多模态间的关联；2. 特征增强和自编码技术可以缓解数据稀缺问题，提升模型泛化能力。

Abstract: Fine-grained video classification requires understanding complex spatio-temporal and semantic cues that often exceed the capacity of a single modality. In this paper, we propose a multimodal framework that fuses video, image, and text representations using GRU-based sequence encoders and cross-modal attention mechanisms. The model is trained using a combination of classification or regression loss, depending on the task, and is further regularized through feature-level augmentation and autoencoding techniques. To evaluate the generality of our framework, we conduct experiments on two challenging benchmarks: the DVD dataset for real-world violence detection and the Aff-Wild2 dataset for valence-arousal estimation. Our results demonstrate that the proposed fusion strategy significantly outperforms unimodal baselines, with cross-attention and feature augmentation contributing notably to robustness and performance.

[52] PhenoBench: A Comprehensive Benchmark for Cell Phenotyping cs.CVPDF

Jerome Luescher, Nora Koreuber, Jannik Franzen, Fabian H. Reith, Claudia Winklmayr

TL;DR: 本文提出了PhenoBench，一个用于细胞表型分析的综合基准测试，包括新的H&E数据集PhenoCell和评估代码，揭示了现有基础模型在实际任务中的局限性。

Details

Motivation: 现有的基础模型在细胞表型分析中缺乏统一的性能评估，尤其是在复杂任务中的表现未被充分测试。

Result: 现有模型在PhenoCell上的F1分数低至0.20，远低于其他基准数据集，表明任务更具挑战性。

Insight: PhenoBench通过技术和医学领域的泛化测试，揭示了现有模型在实际临床任务中的不足，为未来研究提供了重要参考。

Abstract: Digital pathology has seen the advent of a wealth of foundational models (FM), yet to date their performance on cell phenotyping has not been benchmarked in a unified manner. We therefore propose PhenoBench: A comprehensive benchmark for cell phenotyping on Hematoxylin and Eosin (H&E) stained histopathology images. We provide both PhenoCell, a new H&E dataset featuring 14 granular cell types identified by using multiplexed imaging, and ready-to-use fine-tuning and benchmarking code that allows the systematic evaluation of multiple prominent pathology FMs in terms of dense cell phenotype predictions in different generalization scenarios. We perform extensive benchmarking of existing FMs, providing insights into their generalization behavior under technical vs. medical domain shifts. Furthermore, while FMs achieve macro F1 scores > 0.70 on previously established benchmarks such as Lizard and PanNuke, on PhenoCell, we observe scores as low as 0.20. This indicates a much more challenging task not captured by previous benchmarks, establishing PhenoCell as a prime asset for future benchmarking of FMs and supervised models alike. Code and data are available on GitHub.

[53] CLOT: Closed Loop Optimal Transport for Unsupervised Action Segmentation cs.CVPDF

Elena Bueno-Benito, Mariella Dimiccoli

TL;DR: 本文提出了一种新的基于最优传输（OT）的框架CLOT，用于无监督动作分割，通过多级循环特征学习机制提升分割效果。

Details

Motivation: 虽然现有方法ASOT在无监督动作分割中表现良好，但其缺乏片段级监督，限制了帧与动作表示之间的反馈效果。

Result: 在四个基准数据集上的实验验证了循环学习对无监督动作分割的优势。

Insight: 通过引入片段级监督和跨注意力机制，CLOT实现了更优的分割性能，证明了多级循环学习在无监督任务中的有效性。

Abstract: Unsupervised action segmentation has recently pushed its limits with ASOT, an optimal transport (OT)-based method that simultaneously learns action representations and performs clustering using pseudo-labels. Unlike other OT-based approaches, ASOT makes no assumptions on the action ordering, and it is able to decode a temporally consistent segmentation from a noisy cost matrix between video frames and action labels. However, the resulting segmentation lacks segment-level supervision, which limits the effectiveness of the feedback between frames and action representations. To address this limitation, we propose Closed Loop Optimal Transport (CLOT), a novel OT-based framework that introduces a multi-level cyclic feature learning mechanism. Leveraging its encoder-decoder architecture, CLOT learns pseudo-labels alongside frame and segment embeddings by solving two separate OT problems. It then refines both frame embeddings and pseudo-labels through cross-attention between the learned frame and segment embeddings, integrating a third OT problem. Experimental results on four benchmark datasets demonstrate the benefits of cyclical learning for unsupervised action segmentation.

[54] Foundation versus Domain-specific Models: Performance Comparison, Fusion, and Explainability in Face Recognition cs.CV | cs.AIPDF

Redwan Sony, Parisa Farmanifard, Arun Ross, Anil K. Jain

TL;DR: 该论文比较了通用基础模型（如CLIP、BLIP等）与领域特定的人脸识别模型（如AdaFace、ArcFace）的性能，发现领域特定模型在零样本条件下表现更优，但基础模型在提供上下文信息时性能提升。模型融合和解释性增强是主要贡献。

Details

Motivation: 研究通用基础模型与领域特定模型在人脸识别任务中的性能差异，探索融合和解释性方法以提高任务表现。

Result: 领域特定模型在零样本条件下优于基础模型，但基础模型在更大裁剪尺寸下性能提升；融合方法显著提升低误匹配率下的识别率；基础模型可提供解释性支持。

Insight: 融合领域特定模型与基础模型可提升任务性能，同时基础模型能为决策提供上下文和解释性，弥补领域模型的局限性。

Abstract: In this paper, we address the following question: How do generic foundation models (e.g., CLIP, BLIP, LLaVa, DINO) compare against a domain-specific face recognition model (viz., AdaFace or ArcFace) on the face recognition task? Through a series of experiments involving several foundation models and benchmark datasets, we are able to report the following findings: (a) In all datasets considered, domain-specific models outperformed zero-shot foundation models. (b) The performance of zero-shot generic foundation models improves on over-segmented face images than tightly cropped faces thereby suggesting the importance of contextual clues. For example, at a False Match Rate (FMR) of 0.01%, the True Match Rate (TMR) of OpenCLIP improved from 64.97% to 81.73% on the LFW dataset as the face crop increased from 112x112 to 250x250 while the TMR of domain-specific AdaFace dropped from 99.09% to 77.31%. (c) A simple score-level fusion of a foundation model with a domain-specific FR model improved the accuracy at low FMRs. For example, the TMR of AdaFace when fused with BLIP improved from 72.64% to 83.31% at an FMR of 0.0001% on the IJB-B dataset and from 73.17% to 85.81% on the IJB-C dataset. (d) Foundation models, such as ChatGPT, can be used to impart explainability to the FR pipeline (e.g., Despite minor lighting and head tilt differences, the two left-profile images show high consistency in forehead slope, nose shape, chin contour...''). In some instances, foundation models are even able to resolve low-confidence decisions made by AdaFace (e.g., Although AdaFace assigns a low similarity score of 0.21, both images exhibit visual similarity…and the pair is likely of the same person’’), thereby reiterating the importance of combining domain-specific FR models with generic foundation models in a judicious manner.

[55] [Beyond Accuracy: Metrics that Uncover What Makes a `Good’ Visual Descriptor](https://arxiv.org/abs/2507.03542) cs.CVPDF

Ethan Lin, Linxi Zhao, Atharva Sehgal, Jennifer J. Sun

TL;DR: 论文分析了视觉描述词的质量，提出了超越准确性的度量标准，以揭示什么才是一个‘好’的视觉描述词。

Details

Motivation: 视觉描述词在视觉概念发现和图像分类中广泛应用，但其效果受多种因素影响，目前缺乏系统性的评估方法。

Result: 新的度量标准揭示了不同描述词生成策略与基础模型特性的交互方式，为研究描述词有效性提供了新视角。

Insight: 描述词的质量不仅取决于准确性，还需考虑语义清晰度、与预训练数据的对齐程度等因素。这些发现为优化视觉语言模型的输入提供了理论支持。

Abstract: Text-based visual descriptors-ranging from simple class names to more descriptive phrases-are widely used in visual concept discovery and image classification with vision-language models (VLMs). Their effectiveness, however, depends on a complex interplay of factors, including semantic clarity, presence in the VLM’s pre-training data, and how well the descriptors serve as a meaningful representation space. In this work, we systematically analyze descriptor quality along two key dimensions: (1) representational capacity, and (2) relationship with VLM pre-training data. We evaluate a spectrum of descriptor generation methods, from zero-shot LLM-generated prompts to iteratively refined descriptors. Motivated by ideas from representation alignment and language understanding, we introduce two alignment-based metrics-Global Alignment and CLIP Similarity-that move beyond accuracy. These metrics allow us to shed light on how different descriptor generation strategies interact with foundation model properties, offering insights into ways of studying descriptor effectiveness beyond accuracy evaluations.

[56] SciVid: Cross-Domain Evaluation of Video Models in Scientific Applications cs.CV | cs.AI | cs.LGPDF

Yana Hasson, Pauline Luc, Liliane Momeni, Maks Ovsjanikov, Guillaume Le Moing

TL;DR: 本文提出了SciVid基准测试，评估视频基础模型（ViFMs）在跨科学领域任务中的性能，表明其可作为通用领域无关方法的潜力，并揭示了现有模型的局限性。

Details

Motivation: 科学领域中出现了许多时空基础模型，但其通常是领域特定的，评估范围有限。本文旨在探讨视频基础模型是否能在跨科学领域中有效迁移知识并超越领域特定方法。

Result: 实验表明，ViFMs在多个科学应用中取得了SOTA结果，验证了其跨领域知识迁移的有效性。但同时揭示了现有模型在特定任务上的局限性。

Insight: 1. 视频基础模型在科学领域具有普适性潜力，但需进一步改进通用性。2. 简单读出模块即可实现有效迁移，为跨领域应用提供了轻量级解决方案。

Abstract: In recent years, there has been a proliferation of spatiotemporal foundation models in different scientific disciplines. While promising, these models are often domain-specific and are only assessed within the particular applications for which they are designed. Given that many tasks can be represented as video modeling problems, video foundation models (ViFMs) hold considerable promise as general-purpose domain-agnostic approaches. However, it is not known whether the knowledge acquired on large-scale but potentially out-of-domain data can be effectively transferred across diverse scientific disciplines, and if a single, pretrained ViFM can be competitive with domain-specific baselines. To address this, we introduce SciVid, a comprehensive benchmark comprising five Scientific Video tasks, across medical computer vision, animal behavior, and weather forecasting. We adapt six leading ViFMs to SciVid using simple trainable readout modules, establishing strong baselines and demonstrating the potential for effective transfer learning. Specifically, we show that state-of-the-art results can be obtained in several applications by leveraging the general-purpose representations from ViFM backbones. Furthermore, our results reveal the limitations of existing ViFMs, and highlight opportunities for the development of generalizable models for high-impact scientific applications. We release our code at https://github.com/google-deepmind/scivid to facilitate further research in the development of ViFMs.

[57] Causal-SAM-LLM: Large Language Models as Causal Reasoners for Robust Medical Segmentation cs.CV | cs.AI | cs.CLPDF

Tao Tang, Shijie Xu, Yiting Wu, Zhixiang Lu

TL;DR: Causal-SAM-LLM 是一种新型框架，利用大型语言模型（LLMs）作为因果推理器，提升医学图像分割的泛化能力。通过语言对抗解耦和测试时因果干预两项创新技术，显著提高了模型在未见领域的鲁棒性。

Details

Motivation: 医学图像分割模型在面对未见领域时表现不佳，主要由于其对解剖内容和领域特定成像风格之间的伪相关性的依赖。Causal-SAM-LLM 旨在解决这一挑战，通过因果推理提升模型的泛化能力。

Result: 在跨扫描器、跨模态和跨解剖结构的设置下，Causal-SAM-LLM 显著优于基线模型，Dice 分数提升 6.2 点，Hausdorff 距离减少 15.8 mm，同时仅需 9% 的可训练参数。

Insight: 该工作为构建鲁棒、高效且可交互控制的医学 AI 系统提供了新思路，展示了 LLMs 在因果推理中的潜力。

Abstract: The clinical utility of deep learning models for medical image segmentation is severely constrained by their inability to generalize to unseen domains. This failure is often rooted in the models learning spurious correlations between anatomical content and domain-specific imaging styles. To overcome this fundamental challenge, we introduce Causal-SAM-LLM, a novel framework that elevates Large Language Models (LLMs) to the role of causal reasoners. Our framework, built upon a frozen Segment Anything Model (SAM) encoder, incorporates two synergistic innovations. First, Linguistic Adversarial Disentanglement (LAD) employs a Vision-Language Model to generate rich, textual descriptions of confounding image styles. By training the segmentation model’s features to be contrastively dissimilar to these style descriptions, it learns a representation robustly purged of non-causal information. Second, Test-Time Causal Intervention (TCI) provides an interactive mechanism where an LLM interprets a clinician’s natural language command to modulate the segmentation decoder’s features in real-time, enabling targeted error correction. We conduct an extensive empirical evaluation on a composite benchmark from four public datasets (BTCV, CHAOS, AMOS, BraTS), assessing generalization under cross-scanner, cross-modality, and cross-anatomy settings. Causal-SAM-LLM establishes a new state of the art in out-of-distribution (OOD) robustness, improving the average Dice score by up to 6.2 points and reducing the Hausdorff Distance by 15.8 mm over the strongest baseline, all while using less than 9% of the full model’s trainable parameters. Our work charts a new course for building robust, efficient, and interactively controllable medical AI systems.

[58] From Video to EEG: Adapting Joint Embedding Predictive Architecture to Uncover Visual Concepts in Brain Signal Analysis cs.CV | cs.AI | cs.LGPDF

Amir Hojjati, Lu Li, Ibrahim Hameed, Anis Yazidi, Pedro G. Lind

TL;DR: 该论文提出了EEG-VJEPA，一种将视频联合嵌入预测架构（V-JEPA）应用于EEG分类的新方法，通过将EEG信号视为类似视频的序列，学习时空表示，并在分类准确率和可解释性上优于现有方法。

Details

Motivation: EEG信号分析面临标注数据有限、高维度和缺乏可扩展模型的问题。现有的自监督学习方法通常仅关注空间或时间特征，导致表征性能不佳。

Result: EEG-VJEPA在TUH数据集上的分类准确率优于现有SOTA方法，且能够捕捉生理相关的时空模式。

Insight: EEG-VJEPA不仅提升了分类性能，还提供了可解释的嵌入表示，有望支持临床诊断中的人机协作。

Abstract: EEG signals capture brain activity with high temporal and low spatial resolution, supporting applications such as neurological diagnosis, cognitive monitoring, and brain-computer interfaces. However, effective analysis is hindered by limited labeled data, high dimensionality, and the absence of scalable models that fully capture spatiotemporal dependencies. Existing self-supervised learning (SSL) methods often focus on either spatial or temporal features, leading to suboptimal representations. To this end, we propose EEG-VJEPA, a novel adaptation of the Video Joint Embedding Predictive Architecture (V-JEPA) for EEG classification. By treating EEG as video-like sequences, EEG-VJEPA learns semantically meaningful spatiotemporal representations using joint embeddings and adaptive masking. To our knowledge, this is the first work that exploits V-JEPA for EEG classification and explores the visual concepts learned by the model. Evaluations on the publicly available Temple University Hospital (TUH) Abnormal EEG dataset show that EEG-VJEPA outperforms existing state-of-the-art models in classification accuracy.Beyond classification accuracy, EEG-VJEPA captures physiologically relevant spatial and temporal signal patterns, offering interpretable embeddings that may support human-AI collaboration in diagnostic workflows. These findings position EEG-VJEPA as a promising framework for scalable, trustworthy EEG analysis in real-world clinical settings.

[59] Dynamic Multimodal Prototype Learning in Vision-Language Models cs.CVPDF

Xingyu Zhu, Shuo Wang, Beier Zhu, Miaoge Li, Yunfan Li

TL;DR: 这篇论文提出了一种名为ProtoMM的无训练框架，通过多模态原型学习改进视觉语言模型在测试时间的适应能力，显著提升了零样本任务的性能。

Details

Motivation: 现有的视觉语言模型在测试时间适应（TTA）中主要依赖文本原型，忽略了类名中的语义模糊性，导致原型无法充分捕捉视觉概念，限制了模型性能。

Result: 在ImageNet及其变体数据集上，ProtoMM平均准确率比现有最优方法提高了1.03%。

Insight: 多模态原型动态更新能持续从数据中学习，增强模型在未知场景中的泛化能力，为视觉语言模型的测试时间适应提供了新思路。

Abstract: With the increasing attention to pre-trained vision-language models (VLMs), \eg, CLIP, substantial efforts have been devoted to many downstream tasks, especially in test-time adaptation (TTA). However, previous works focus on learning prototypes only in the textual modality while overlooking the ambiguous semantics in class names. These ambiguities lead to textual prototypes that are insufficient to capture visual concepts, resulting in limited performance. To address this issue, we introduce \textbf{ProtoMM}, a training-free framework that constructs multimodal prototypes to adapt VLMs during the test time. By viewing the prototype as a discrete distribution over the textual descriptions and visual particles, ProtoMM has the ability to combine the multimodal features for comprehensive prototype learning. More importantly, the visual particles are dynamically updated as the testing stream flows. This allows our multimodal prototypes to continually learn from the data, enhancing their generalizability in unseen scenarios. In addition, we quantify the importance of the prototypes and test images by formulating their semantic distance as an optimal transport problem. Extensive experiments on 15 zero-shot benchmarks demonstrate the effectiveness of our method, achieving a 1.03% average accuracy improvement over state-of-the-art methods on ImageNet and its variant datasets.

[60] On the rankability of visual embeddings cs.CVPDF

Ankit Sonthalia, Arnas Uselis, Seong Joon Oh

TL;DR: 该论文研究了视觉嵌入模型是否能够通过线性方向（称为“排名轴”）捕捉连续的序数属性。研究发现，许多嵌入模型在这些属性上具有自然的可排名性，且仅需少量样本即可恢复有意义的排名轴。

Details

Motivation: 研究视觉嵌入模型是否能够自然地捕捉序数属性（如年龄、美学等）的线性方向，以便在不完全监督的情况下实现图像排名。

Result: 许多嵌入模型具有天生的可排名性，且仅需少量样本（甚至两个极端示例）就能恢复有意义的排名轴。

Insight: 研究结果为图像排名任务提供了新的可能性，同时为嵌入模型的学习和结构设计提供了新的研究方向。

Abstract: We study whether visual embedding models capture continuous, ordinal attributes along linear directions, which we term rank axes. We define a model as rankable for an attribute if projecting embeddings onto such an axis preserves the attribute’s order. Across 7 popular encoders and 9 datasets with attributes like age, crowd count, head pose, aesthetics, and recency, we find that many embeddings are inherently rankable. Surprisingly, a small number of samples, or even just two extreme examples, often suffice to recover meaningful rank axes, without full-scale supervision. These findings open up new use cases for image ranking in vector databases and motivate further study into the structure and learning of rankable embeddings. Our code is available at https://github.com/aktsonthalia/rankable-vision-embeddings.

[61] SAMed-2: Selective Memory Enhanced Medical Segment Anything Model cs.CVPDF

Zhiling Yan, Sifan Song, Dingjie Song, Yiwei Li, Rong Zhou

TL;DR: SAMed-2 是一个基于 SAM-2 架构的医学图像分割基础模型，通过引入时间适配器和置信度驱动的记忆机制，解决了医学数据噪声和持续学习中的遗忘问题。

Details

Motivation: 医学图像分割面临数据复杂、标注噪声大以及多模态持续学习等挑战，直接迁移通用分割模型效果不佳。

Result: 在 10 个外部数据集和多任务场景下，SAMed-2 表现优于现有基线方法。

Insight: 通过记忆机制和跨模态适配，能够在医学图像分割中实现更鲁棒的持续学习。

Abstract: Recent “segment anything” efforts show promise by learning from large-scale data, but adapting such models directly to medical images remains challenging due to the complexity of medical data, noisy annotations, and continual learning requirements across diverse modalities and anatomical structures. In this work, we propose SAMed-2, a new foundation model for medical image segmentation built upon the SAM-2 architecture. Specifically, we introduce a temporal adapter into the image encoder to capture image correlations and a confidence-driven memory mechanism to store high-certainty features for later retrieval. This memory-based strategy counters the pervasive noise in large-scale medical datasets and mitigates catastrophic forgetting when encountering new tasks or modalities. To train and evaluate SAMed-2, we curate MedBank-100k, a comprehensive dataset spanning seven imaging modalities and 21 medical segmentation tasks. Our experiments on both internal benchmarks and 10 external datasets demonstrate superior performance over state-of-the-art baselines in multi-task scenarios. The code is available at: https://github.com/ZhilingYan/Medical-SAM-Bench.

[62] Sign Spotting Disambiguation using Large Language Models cs.CV | cs.AIPDF

JianHe Low, Ozge Mercanoglu Sincan, Richard Bowden

TL;DR: 论文提出了一种利用大语言模型（LLMs）的无训练框架，通过动态时间规整和余弦相似度进行字典匹配，再结合LLM进行上下文感知消歧，显著提升了符号定位的质量。

Details

Motivation: 符号定位在连续手语视频中的挑战包括词汇不灵活性和歧义性，亟需一种无需重新训练的方法来提高定位质量和解决数据稀缺问题。

Result: 实验表明，该方法在合成和真实手语数据集上优于传统方法，具有更高的准确性和句子流畅性。

Insight: 研究强调了LLMs在符号定位任务中的潜力，无需微调即可显著提升性能，为手语翻译的扩展提供了新思路。

Abstract: Sign spotting, the task of identifying and localizing individual signs within continuous sign language video, plays a pivotal role in scaling dataset annotations and addressing the severe data scarcity issue in sign language translation. While automatic sign spotting holds great promise for enabling frame-level supervision at scale, it grapples with challenges such as vocabulary inflexibility and ambiguity inherent in continuous sign streams. Hence, we introduce a novel, training-free framework that integrates Large Language Models (LLMs) to significantly enhance sign spotting quality. Our approach extracts global spatio-temporal and hand shape features, which are then matched against a large-scale sign dictionary using dynamic time warping and cosine similarity. This dictionary-based matching inherently offers superior vocabulary flexibility without requiring model retraining. To mitigate noise and ambiguity from the matching process, an LLM performs context-aware gloss disambiguation via beam search, notably without fine-tuning. Extensive experiments on both synthetic and real-world sign language datasets demonstrate our method’s superior accuracy and sentence fluency compared to traditional approaches, highlighting the potential of LLMs in advancing sign spotting.

[63] Computationally efficient non-Intrusive pre-impact fall detection system cs.CVPDF

Praveen Jesudhas, Raghuveera T, Shiney Jeyaraj

TL;DR: 论文提出了一种非侵入式且计算高效的预跌倒检测系统，利用视频数据和简易神经网络模型，显著降低计算成本，同时保持高准确性。

Details

Motivation: 现有预跌倒检测系统虽准确性高，但存在侵入性强或计算资源需求高的问题，限制了广泛应用。为此，需开发一种非侵入且计算高效的解决方案。

Result: 系统计算需求比现有模型低18倍，准确性为88%，适合大规模部署于工业和住宅安全领域。

Insight: 非侵入式设计和计算效率是预跌倒检测系统广泛采用的关键，通过简化模型和特征提取可实现性能与成本的平衡。

Abstract: Existing pre-impact fall detection systems have high accuracy, however they are either intrusive to the subject or require heavy computational resources for fall detection, resulting in prohibitive deployment costs. These factors limit the global adoption of existing fall detection systems. In this work we present a Pre-impact fall detection system that is both non-intrusive and computationally efficient at deployment. Our system utilizes video data of the locality available through cameras, thereby requiring no specialized equipment to be worn by the subject. Further, the fall detection system utilizes minimal fall specific features and simplistic neural network models, designed to reduce the computational cost of the system. A minimal set of fall specific features are derived from the skeletal data, post observing the relative position of human skeleton during fall. These features are shown to have different distributions for Fall and non-fall scenarios proving their discriminative capability. A Long Short Term Memory (LSTM) based network is selected and the network architecture and training parameters are designed after evaluation of performance on standard datasets. In the Pre-impact fall detection system the computation requirement is about 18 times lesser than existing modules with a comparable accuracy of 88%. Given the low computation requirements and higher accuracy levels, the proposed system is suitable for wider adoption in engineering systems related to industrial and residential safety.

[64] Less is More: Empowering GUI Agent with Context-Aware Simplification cs.CV | cs.AI | cs.HC | cs.LGPDF

Gongwei Chen, Xurui Zhou, Rui Shao, Yibo Lyu, Kaiwen Zhou

TL;DR: 该论文提出了一种上下文感知简化框架SimpAgent，通过元素修剪和历史压缩优化GUI代理的性能与效率。

Details

Motivation: 当前基于视觉的GUI代理忽视上下文建模挑战，存在元素干扰和历史冗余问题。

Result: SimpAgent减少27%的FLOPs，并在多种GUI导航任务中表现优异。

Insight: 上下文简化是提升GUI代理效率的关键，掩码和一致性引导是有效的优化手段。

Abstract: The research focus of GUI agents is shifting from text-dependent to pure-vision-based approaches, which, though promising, prioritize comprehensive pre-training data collection while neglecting contextual modeling challenges. We probe the characteristics of element and history contextual modeling in GUI agent and summarize: 1) the high-density and loose-relation of element context highlight the existence of many unrelated elements and their negative influence; 2) the high redundancy of history context reveals the inefficient history modeling in current GUI agents. In this work, we propose a context-aware simplification framework for building an efficient and effective GUI Agent, termed SimpAgent. To mitigate potential interference from numerous unrelated elements, we introduce a masking-based element pruning method that circumvents the intractable relation modeling through an efficient masking mechanism. To reduce the redundancy in historical information, we devise a consistency-guided history compression module, which enhances implicit LLM-based compression through innovative explicit guidance, achieving an optimal balance between performance and efficiency. With the above components, SimpAgent reduces 27% FLOPs and achieves superior GUI navigation performances. Comprehensive navigation experiments across diverse web and mobile environments demonstrate the effectiveness and potential of our agent.

[65] Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps cs.CVPDF

Chong Cheng, Sicheng Yu, Zijian Wang, Yifan Zhou, Hao Wang

TL;DR: 提出一种名为S3PO-GS的RGB-only户外3D高斯点建图SLAM方法，解决了现有方法缺乏几何先验和尺度漂移问题，通过自一致跟踪模块和基于patch的动态映射模块实现高精度跟踪与场景重建。

Details

Motivation: 现有3D高斯点建图SLAM方法在户外场景中缺乏几何先验或引入独立跟踪模块导致尺度漂移，需要一种更稳健的解决方案。

Result: 在Waymo、KITTI和DL3DV数据集上，S3PO-GS在新视角合成和跟踪精度上达到SOTA性能。

Insight: 通过结合几何先验和全局尺度一致性，3DGS在复杂户外场景中能实现更稳健的SLAM性能。

Abstract: 3D Gaussian Splatting (3DGS) has become a popular solution in SLAM due to its high-fidelity and real-time novel view synthesis performance. However, some previous 3DGS SLAM methods employ a differentiable rendering pipeline for tracking, \textbf{lack geometric priors} in outdoor scenes. Other approaches introduce separate tracking modules, but they accumulate errors with significant camera movement, leading to \textbf{scale drift}. To address these challenges, we propose a robust RGB-only outdoor 3DGS SLAM method: S3PO-GS. Technically, we establish a self-consistent tracking module anchored in the 3DGS pointmap, which avoids cumulative scale drift and achieves more precise and robust tracking with fewer iterations. Additionally, we design a patch-based pointmap dynamic mapping module, which introduces geometric priors while avoiding scale ambiguity. This significantly enhances tracking accuracy and the quality of scene reconstruction, making it particularly suitable for complex outdoor environments. Our experiments on the Waymo, KITTI, and DL3DV datasets demonstrate that S3PO-GS achieves state-of-the-art results in novel view synthesis and outperforms other 3DGS SLAM methods in tracking accuracy. Project page: https://3dagentworld.github.io/S3PO-GS/.

[66] ChestGPT: Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays cs.CVPDF

Shehroz S. Khan, Petar Przulj, Ahmed Ashraf, Ali Abedi

TL;DR: 本文提出ChestGPT，一种结合EVA ViT和Llama 2 LLM的深度学习框架，用于胸部X光片的疾病分类和定位，通过视觉和语言模型的融合提升诊断效率和准确性。

Details

Motivation: 全球对放射科医生的需求增长迅速，但供应不足，急需技术辅助。计算机视觉和语言模型的进步为填补这一差距提供了可能。

Result: 在VinDr-CXR数据集上，F1分数为0.76，成功生成了感兴趣区域的边界框。

Insight: 结合视觉和语言模型可以显著提升医学图像分析的效率和准确性，为放射科医生提供有价值的辅助工具。

Abstract: The global demand for radiologists is increasing rapidly due to a growing reliance on medical imaging services, while the supply of radiologists is not keeping pace. Advances in computer vision and image processing technologies present significant potential to address this gap by enhancing radiologists’ capabilities and improving diagnostic accuracy. Large language models (LLMs), particularly generative pre-trained transformers (GPTs), have become the primary approach for understanding and generating textual data. In parallel, vision transformers (ViTs) have proven effective at converting visual data into a format that LLMs can process efficiently. In this paper, we present ChestGPT, a deep-learning framework that integrates the EVA ViT with the Llama 2 LLM to classify diseases and localize regions of interest in chest X-ray images. The ViT converts X-ray images into tokens, which are then fed, together with engineered prompts, into the LLM, enabling joint classification and localization of diseases. This approach incorporates transfer learning techniques to enhance both explainability and performance. The proposed method achieved strong global disease classification performance on the VinDr-CXR dataset, with an F1 score of 0.76, and successfully localized pathologies by generating bounding boxes around the regions of interest. We also outline several task-specific prompts, in addition to general-purpose prompts, for scenarios radiologists might encounter. Overall, this framework offers an assistive tool that can lighten radiologists’ workload by providing preliminary findings and regions of interest to facilitate their diagnostic process.

[67] StreamDiT: Real-Time Streaming Text-to-Video Generation cs.CV | cs.AI | cs.LG | eess.IVPDF

Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, Yue Zhao

TL;DR: StreamDiT 是一种实时流式文本到视频生成模型，通过流匹配和移动缓冲区训练，结合动态时间嵌入和窗口注意力，实现了高质量的实时视频流生成，并通过多步蒸馏进一步优化性能。

Details

Motivation: 现有文本到视频生成模型通常只能离线生成短视频，限制了其在交互式和实时应用中的使用。StreamDiT 旨在解决这一问题，实现实时视频流生成。

Result: StreamDiT 在单个 GPU 上实现 16 FPS 的实时生成（512p 分辨率），并通过定量指标和人工评估验证了其性能。

Insight: 通过流式训练和多步蒸馏，实现了高质量实时视频生成，为交互式应用和视频流生成提供了新可能性。

Abstract: Recently, great progress has been achieved in text-to-video (T2V) generation by scaling transformer-based diffusion models to billions of parameters, which can generate high-quality videos. However, existing models typically produce only short clips offline, restricting their use cases in interactive and real-time applications. This paper addresses these challenges by proposing StreamDiT, a streaming video generation model. StreamDiT training is based on flow matching by adding a moving buffer. We design mixed training with different partitioning schemes of buffered frames to boost both content consistency and visual quality. StreamDiT modeling is based on adaLN DiT with varying time embedding and window attention. To practice the proposed method, we train a StreamDiT model with 4B parameters. In addition, we propose a multistep distillation method tailored for StreamDiT. Sampling distillation is performed in each segment of a chosen partitioning scheme. After distillation, the total number of function evaluations (NFEs) is reduced to the number of chunks in a buffer. Finally, our distilled model reaches real-time performance at 16 FPS on one GPU, which can generate video streams at 512p resolution. We evaluate our method through both quantitative metrics and human evaluation. Our model enables real-time applications, e.g. streaming generation, interactive generation, and video-to-video. We provide video results and more examples in our project website: this https URL.

[68] FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed cs.CV | cs.AI | cs.LGPDF

Jiaqi Zhang, Juntuo Wang, Zhixin Sun, John Zou, Randall Balestriero

TL;DR: 论文提出了一种基于频率过滤课程学习和高斯噪声增广的预训练策略FastDINOv2，显著提升了DINOv2的训练速度和鲁棒性。

Details

Motivation: 大规模视觉基础模型（如DINOv2）需要高昂计算资源，难以快速复现。论文旨在通过高效预训练策略，降低计算成本的同时增强模型鲁棒性。

Result: 训练时间减少1.6倍，FLOPs降低2.25倍，在ImageNet-C上鲁棒性对标基线，线性探测性能接近。

Insight: 数据课程设计和增广策略是提升自监督学习模型效率与鲁棒性的有效途径。

Abstract: Large-scale vision foundation models such as DINOv2 boast impressive performances by leveraging massive architectures and training datasets. But numerous scenarios require practitioners to reproduce those pre-training solutions, such as on private data, new modalities, or simply for scientific questioning–which is currently extremely demanding computation-wise. We thus propose a novel pre-training strategy for DINOv2 that simultaneously accelerates convergence–and strengthens robustness to common corruptions as a by-product. Our approach involves a frequency filtering curriculum–low-frequency being seen first–and the Gaussian noise patching augmentation. Applied to a ViT-B/16 backbone trained on ImageNet-1K, while pre-training time and FLOPs are reduced by 1.6x and 2.25x, our method still achieves matching robustness in corruption benchmarks (ImageNet-C) and maintains competitive linear probing performance compared with baseline. This dual benefit of efficiency and robustness makes large-scale self-supervised foundation modeling more attainable, while opening the door to novel exploration around data curriculum and augmentation as means to improve self-supervised learning models robustness. The code is available at https://github.com/KevinZ0217/fast_dinov2

[69] Zero Memory Overhead Approach for Protecting Vision Transformer Parameters cs.CVPDF

Fereshteh Baradaran, Mohsen Raji, Azadeh Baradaran, Arezoo Baradaran, Reihaneh Akbarifard

TL;DR: 该论文提出了一种零内存开销的方法，通过利用ViT参数中最不显著位（LSB）不关键的特性，将其替换为奇偶校验位以实现故障检测，并在检测到故障时将受影响的参数置零，从而提升模型的可靠性。

Details

Motivation: 由于ViT在安全关键应用（如自动驾驶）中的普及，其参数在内存中的位翻转故障可能导致功能错误，因此需要一种高效的故障容忍技术来确保其可靠性。

Result: 该方法将ViT模型对位翻转故障的鲁棒性提升了高达三个数量级，且未引入额外内存开销。

Insight: ViT参数中LSB对模型性能影响较小，利用这一特性可以实现高效的故障检测与容忍，为安全关键应用提供了零开销的解决方案。

Abstract: Vision Transformers (ViTs) have demonstrated superior performance over Convolutional Neural Networks (CNNs) in various vision-related tasks such as classification, object detection, and segmentation due to their use of self-attention mechanisms. As ViTs become more popular in safety-critical applications like autonomous driving, ensuring their correct functionality becomes essential, especially in the presence of bit-flip faults in their parameters stored in memory. In this paper, a fault tolerance technique is introduced to protect ViT parameters against bit-flip faults with zero memory overhead. Since the least significant bits of parameters are not critical for model accuracy, replacing the LSB with a parity bit provides an error detection mechanism without imposing any overhead on the model. When faults are detected, affected parameters are masked by zeroing out, as most parameters in ViT models are near zero, effectively preventing accuracy degradation. This approach enhances reliability across ViT models, improving the robustness of parameters to bit-flips by up to three orders of magnitude, making it an effective zero-overhead solution for fault tolerance in critical applications.

[70] Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition cs.CV | cs.ROPDF

Jiuhong Xiao, Yang Zhou, Giuseppe Loianno

TL;DR: 本文提出了一种基于查询的自适应聚合方法（QAA），用于多数据集联合训练中的视觉位置识别（VPR）。QAA通过动态调整特征聚合层的信息容量，提升了模型在跨数据集上的泛化能力。

Details

Motivation: 现有的视觉位置识别方法通常在单一数据集上训练，导致模型容易过拟合并泛化能力受限。多数据集联合训练虽然有助于开发通用模型，但由于数据集之间的差异性，传统特征聚合层的信息容量容易饱和，影响性能。

Result: 实验表明，QAA在跨数据集上实现了平衡的泛化能力，性能优于现有方法，同时保持了与单数据集模型相当的峰值性能。消融研究验证了QAA的可扩展性和机制有效性。

Insight: QAA的成功表明，动态调整聚合层的信息容量可以有效缓解多数据集联合训练中的性能饱和问题，为通用VPR模型的开发提供了新思路。

Abstract: Deep learning methods for Visual Place Recognition (VPR) have advanced significantly, largely driven by large-scale datasets. However, most existing approaches are trained on a single dataset, which can introduce dataset-specific inductive biases and limit model generalization. While multi-dataset joint training offers a promising solution for developing universal VPR models, divergences among training datasets can saturate limited information capacity in feature aggregation layers, leading to suboptimal performance. To address these challenges, we propose Query-based Adaptive Aggregation (QAA), a novel feature aggregation technique that leverages learned queries as reference codebooks to effectively enhance information capacity without significant computational or parameter complexity. We show that computing the Cross-query Similarity (CS) between query-level image features and reference codebooks provides a simple yet effective way to generate robust descriptors. Our results demonstrate that QAA outperforms state-of-the-art models, achieving balanced generalization across diverse datasets while maintaining peak performance comparable to dataset-specific models. Ablation studies further explore QAA’s mechanisms and scalability. Visualizations reveal that the learned queries exhibit diverse attention patterns across datasets. Code will be publicly released.

[71] Hierarchical Semantic-Visual Fusion of Visible and Near-infrared Images for Long-range Haze Removal cs.CV | cs.AIPDF

Yi Li, Xiaoxiong Wang, Jiawei Wang, Yi Chang, Kai Cao

TL;DR: 该论文提出了一种层次化的语义-视觉融合框架（HSVF），结合可见光和近红外图像，用于长距离去雾，解决了现有方法在远距离场景中残留雾霾的问题。

Details

Motivation: 现有的去雾方法主要针对短距离场景，远距离雾霾因散射严重导致信号丢失，仅靠可见光图像难以恢复细节。近红外图像具有更好的透雾性，但现有融合方法忽视了对雾霾的语义处理。

Result: 实验表明，HSVF在真实远距离去雾任务中优于现有方法，能够生成高对比度且纹理丰富的结果。

Insight: 语义信息在远距离雾霾去除中具有重要作用，可以作为先验提升恢复效果，而多模态融合（可见光+近红外）能够互补细节和鲁棒性。

Abstract: While image dehazing has advanced substantially in the past decade, most efforts have focused on short-range scenarios, leaving long-range haze removal under-explored. As distance increases, intensified scattering leads to severe haze and signal loss, making it impractical to recover distant details solely from visible images. Near-infrared, with superior fog penetration, offers critical complementary cues through multimodal fusion. However, existing methods focus on content integration while often neglecting haze embedded in visible images, leading to results with residual haze. In this work, we argue that the infrared and visible modalities not only provide complementary low-level visual features, but also share high-level semantic consistency. Motivated by this, we propose a Hierarchical Semantic-Visual Fusion (HSVF) framework, comprising a semantic stream to reconstruct haze-free scenes and a visual stream to incorporate structural details from the near-infrared modality. The semantic stream first acquires haze-robust semantic prediction by aligning modality-invariant intrinsic representations. Then the shared semantics act as strong priors to restore clear and high-contrast distant scenes under severe haze degradation. In parallel, the visual stream focuses on recovering lost structural details from near-infrared by fusing complementary cues from both visible and near-infrared images. Through the cooperation of dual streams, HSVF produces results that exhibit both high-contrast scenes and rich texture details. Moreover, we introduce a novel pixel-aligned visible-infrared haze dataset with semantic labels to facilitate benchmarking. Extensive experiments demonstrate the superiority of our method over state-of-the-art approaches in real-world long-range haze removal.

[72] Deconfounding Causal Inference through Two-Branch Framework with Early-Forking for Sensor-Based Cross-Domain Activity Recognition cs.CVPDF

Di Xiong, Lei Zhang, Shuoyuan Wang, Dongzhou Cheng, Wenbo Huang

TL;DR: 本文提出了一种因果启发的两分支框架，通过解耦传感器数据中的因果和非因果特征，解决了跨域活动识别中的分布偏移问题。

Details

Motivation: 现有基于领域泛化的方法主要关注统计依赖关系，忽视了内在的因果机制。本文希望通过解耦因果特征来改进跨域活动识别的性能。

Result: 在多个公开HAR基准测试中，本文方法显著优于11种相关基线方法，特别是在跨人、跨数据集和跨位置场景中表现突出。

Insight: 解耦因果特征是解决跨域活动识别问题的关键，而因果机制的显式建模能够显著提升模型的泛化能力。

Abstract: Recently, domain generalization (DG) has emerged as a promising solution to mitigate distribution-shift issue in sensor-based human activity recognition (HAR) scenario. However, most existing DG-based works have merely focused on modeling statistical dependence between sensor data and activity labels, neglecting the importance of intrinsic casual mechanism. Intuitively, every sensor input can be viewed as a mixture of causal (category-aware) and non-causal factors (domain-specific), where only the former affects activity classification judgment. In this paper, by casting such DG-based HAR as a casual inference problem, we propose a causality-inspired representation learning algorithm for cross-domain activity recognition. To this end, an early-forking two-branch framework is designed, where two separate branches are respectively responsible for learning casual and non-causal features, while an independence-based Hilbert-Schmidt Information Criterion is employed to implicitly disentangling them. Additionally, an inhomogeneous domain sampling strategy is designed to enhance disentanglement, while a category-aware domain perturbation layer is performed to prevent representation collapse. Extensive experiments on several public HAR benchmarks demonstrate that our causality-inspired approach significantly outperforms eleven related state-of-the-art baselines under cross-person, cross-dataset, and cross-position settings. Detailed ablation and visualizations analyses reveal underlying casual mechanism, indicating its effectiveness, efficiency, and universality in cross-domain activity recognition scenario.

[73] Taming Anomalies with Down-Up Sampling Networks: Group Center Preserving Reconstruction for 3D Anomaly Detection cs.CV | 68T10 | I.4; I.5; J.6PDF

Hanzhe Liang, Jie Zhang, Tao Dai, Linlin Shen, Jinbao Wang

TL;DR: 该论文提出了一种Down-Up采样网络（DUS-Net），通过保留组中心几何结构来重构高精度点云，解决了3D异常检测中大规模复杂结构的挑战。

Details

Motivation: 现有的基于重构的方法在处理高精度点云时面临较大困难，尤其是在大规模和复杂结构的情况下。因此，该研究旨在通过保留几何结构，提升3D异常检测的性能。

Result: 在Real3D-AD和Anomaly-ShapeNet数据集上，分别达到了79.9%/79.5%（物体级AUROC）和71.2%/84.7%（点级AUROC）的SOTA性能。

Insight: 通过保留组中心的几何结构，DUS-Net在3D异常检测中表现优异，尤其是在处理复杂点云时展现出更强的鲁棒性。

Abstract: Reconstruction-based methods have demonstrated very promising results for 3D anomaly detection. However, these methods face great challenges in handling high-precision point clouds due to the large scale and complex structure. In this study, a Down-Up Sampling Network (DUS-Net) is proposed to reconstruct high-precision point clouds for 3D anomaly detection by preserving the group center geometric structure. The DUS-Net first introduces a Noise Generation module to generate noisy patches, which facilitates the diversity of training data and strengthens the feature representation for reconstruction. Then, a Down-sampling Network~(Down-Net) is developed to learn an anomaly-free center point cloud from patches with noise injection. Subsequently, an Up-sampling Network (Up-Net) is designed to reconstruct high-precision point clouds by fusing multi-scale up-sampling features. Our method leverages group centers for construction, enabling the preservation of geometric structure and providing a more precise point cloud. Extensive experiments demonstrate the effectiveness of our proposed method, achieving state-of-the-art (SOTA) performance with an Object-level AUROC of 79.9% and 79.5%, and a Point-level AUROC of 71.2% and 84.7% on the Real3D-AD and Anomaly-ShapeNet datasets, respectively.

Rang Meng, Yan Wang, Weipeng Wu, Ruobing Zheng, Yuming Li

TL;DR: EchoMimicV3提出了一种统一多任务、多模态的人体动画模型，通过1.3B参数实现高效、高质量的生成，解决了计算成本高和任务分散的问题。

Details

Motivation: 当前大规模视频生成模型计算成本高，且不同任务需要不同模型，EchoMimicV3旨在实现高效、高通用性的人体动画。

Result: EchoMimicV3在1.3B参数下达到类似10倍参数模型的生成质量。

Insight: 通过任务统一和多模态融合，小模型也能实现高效高质量的生成。

Abstract: Human animation recently has advanced rapidly, achieving increasingly realistic and vivid results, especially with the integration of large-scale video generation models. However, the slow inference speed and high computational cost of these large models bring significant challenges for practical applications. Additionally, various tasks in human animation, such as lip-syncing, audio-driven full-body animation, and video generation from start and end frames, often require different specialized models. The introduction of large video models has not alleviated this dilemma. This raises an important question: Can we make human animation Faster, Higher in quality, Stronger in generalization, and make various tasks Together in one model? To address this, we dive into video generation models and discover that the devil lies in the details: Inspired by MAE, we propose a novel unified Multi-Task paradigm for human animation, treating diverse generation tasks as spatial-temporal local reconstructions, requiring modifications only on the input side; Given the interplay and division among multi-modal conditions including text, image, and audio, we introduce a multi-modal decoupled cross-attention module to fuse multi-modals in a divide-and-conquer manner; We propose a new SFT+Reward alternating training paradigm, enabling the minimal model with 1.3B parameters to achieve generation quality comparable to models with 10 times the parameters count. Through these innovations, our work paves the way for efficient, high-quality, and versatile digital human generation, addressing both performance and practicality challenges in the field. Extensive experiments demonstrate that EchoMimicV3 outperforms existing models in both facial and semi-body video generation, providing precise text-based control for creating videos in a wide range of scenarios.

[75] Bridging Vision and Language: Optimal Transport-Driven Radiology Report Generation via LLMs cs.CVPDF

Haifeng Zhao, Yufei Zhang, Leilei Ma, Shuo Xu, Dengdi Sun

TL;DR: 论文提出了一种基于最优传输（OT）的放射学报告生成框架OTDRG，通过对齐图像特征和疾病标签，解决了通用大语言模型（LLMs）在临床有效性上的不足。

Details

Motivation: 通用LLMs在放射学报告生成任务中更注重语言流畅性而非临床有效性，且难以捕捉图像与文本之间的关联，导致临床实用性低。

Result: 在MIMIC-CXR和IU X-Ray数据集上，OTDRG在自然语言生成（NLG）和临床有效性（CE）指标上均达到了state-of-the-art性能。

Insight: 最优传输能够有效缩小跨模态（图像与文本）之间的差异，结合疾病预测模块可显著提升模型在医疗领域的实用性。

Abstract: Radiology report generation represents a significant application within medical AI, and has achieved impressive results. Concurrently, large language models (LLMs) have demonstrated remarkable performance across various domains. However, empirical validation indicates that general LLMs tend to focus more on linguistic fluency rather than clinical effectiveness, and lack the ability to effectively capture the relationship between X-ray images and their corresponding texts, thus resulting in poor clinical practicability. To address these challenges, we propose Optimal Transport-Driven Radiology Report Generation (OTDRG), a novel framework that leverages Optimal Transport (OT) to align image features with disease labels extracted from reports, effectively bridging the cross-modal gap. The core component of OTDRG is Alignment & Fine-Tuning, where OT utilizes results from the encoding of label features and image visual features to minimize cross-modal distances, then integrating image and text features for LLMs fine-tuning. Additionally, we design a novel disease prediction module to predict disease labels contained in X-ray images during validation and testing. Evaluated on the MIMIC-CXR and IU X-Ray datasets, OTDRG achieves state-of-the-art performance in both natural language generation (NLG) and clinical efficacy (CE) metrics, delivering reports that are not only linguistically coherent but also clinically accurate.

[76] Learning Disentangled Stain and Structural Representations for Semi-Supervised Histopathology Segmentation cs.CV | cs.AIPDF

Ha-Hieu Pham, Nguyen Lan Vi Vu, Thanh-Huy Nguyen, Ulas Bagci, Min Xu

TL;DR: 该论文提出了一种半监督的病理图像分割框架CSDS，通过分离染色和结构表示，有效提升了低标注数据下的分割性能。

Details

Motivation: 病理图像分割中，H&E染色的变异性和组织形态的多样性给自动化分割带来挑战，且标注数据有限。因此，需要一种方法能够在这种低标注环境下实现高精度分割。

Result: 在GlaS和CRAG数据集上，CSDS在5%和10%标注数据下的Dice分数分别提升了1.2%、0.7%和0.7%、1.4%。

Insight: 分离染色和结构特征是提升病理图像分割性能的关键，尤其是在低标注数据下。动态不确定性模块进一步优化了模型的鲁棒性。

Abstract: Accurate gland segmentation in histopathology images is essential for cancer diagnosis and prognosis. However, significant variability in Hematoxylin and Eosin (H&E) staining and tissue morphology, combined with limited annotated data, poses major challenges for automated segmentation. To address this, we propose Color-Structure Dual-Student (CSDS), a novel semi-supervised segmentation framework designed to learn disentangled representations of stain appearance and tissue structure. CSDS comprises two specialized student networks: one trained on stain-augmented inputs to model chromatic variation, and the other on structure-augmented inputs to capture morphological cues. A shared teacher network, updated via Exponential Moving Average (EMA), supervises both students through pseudo-labels. To further improve label reliability, we introduce stain-aware and structure-aware uncertainty estimation modules that adaptively modulate the contribution of each student during training. Experiments on the GlaS and CRAG datasets show that CSDS achieves state-of-the-art performance in low-label settings, with Dice score improvements of up to 1.2% on GlaS and 0.7% on CRAG at 5% labeled data, and 0.7% and 1.4% at 10%. Our code and pre-trained models are available at https://github.com/hieuphamha19/CSDS.

[77] DNF-Intrinsic: Deterministic Noise-Free Diffusion for Indoor Inverse Rendering cs.CVPDF

Rongjia Zheng, Qing Zhang, Chengjiang Long, Wei-Shi Zheng

TL;DR: 针对现有扩散模型在生成式逆向渲染中的噪声输入问题，DNF-Intrinsic提出了一种确定性无噪声扩散方法，通过流匹配直接预测内在属性，并在实验中展现了优越性能。

Details

Motivation: 现有的扩散模型在生成式逆向渲染中依赖噪声输入，导致预测的内在属性质量不高，而图像的结构和外观信息对逆向渲染至关重要。

Result: 在合成和真实数据集上，DNF-Intrinsic明显优于现有方法。

Insight: 直接利用源图像而非噪声输入可显著提升逆向渲染的鲁棒性和生成质量。

Abstract: Recent methods have shown that pre-trained diffusion models can be fine-tuned to enable generative inverse rendering by learning image-conditioned noise-to-intrinsic mapping. Despite their remarkable progress, they struggle to robustly produce high-quality results as the noise-to-intrinsic paradigm essentially utilizes noisy images with deteriorated structure and appearance for intrinsic prediction, while it is common knowledge that structure and appearance information in an image are crucial for inverse rendering. To address this issue, we present DNF-Intrinsic, a robust yet efficient inverse rendering approach fine-tuned from a pre-trained diffusion model, where we propose to take the source image rather than Gaussian noise as input to directly predict deterministic intrinsic properties via flow matching. Moreover, we design a generative renderer to constrain that the predicted intrinsic properties are physically faithful to the source image. Experiments on both synthetic and real-world datasets show that our method clearly outperforms existing state-of-the-art methods.

[78] VISC: mmWave Radar Scene Flow Estimation using Pervasive Visual-Inertial Supervision cs.CV | cs.ROPDF

Kezhong Liu, Yiwen Zhou, Mozi Chen, Jianhua He, Jingao Xu

TL;DR: 本文提出了一种基于视觉-惯性传感器监督的毫米波雷达场景流估计算法，通过融合VI数据和毫米波雷达数据解决现有方法依赖昂贵3D LiDAR的问题，并在烟雾环境下超越现有SOTA方法。

Details

Motivation: 当前毫米波雷达的场景流估计依赖昂贵的3D LiDAR数据，而视觉-惯性传感器（VI）虽然普及，但单独使用时无法捕捉3D运动信息。本文旨在利用VI数据和毫米波雷达数据结合，提供低成本且高效的场景流估计解决方案。

Result: 实验表明，在烟雾环境下，本方法甚至优于依赖昂贵3D LiDAR的现有SOTA方法。

Insight: 通过融合低成本且普及的VI数据与毫米波雷达，可以高效解决场景流估计问题，尤其适用于恶劣环境（如烟雾），展示了VI数据在雷达任务中的潜力。

Abstract: This work proposes a mmWave radar’s scene flow estimation framework supervised by data from a widespread visual-inertial (VI) sensor suite, allowing crowdsourced training data from smart vehicles. Current scene flow estimation methods for mmWave radar are typically supervised by dense point clouds from 3D LiDARs, which are expensive and not widely available in smart vehicles. While VI data are more accessible, visual images alone cannot capture the 3D motions of moving objects, making it difficult to supervise their scene flow. Moreover, the temporal drift of VI rigid transformation also degenerates the scene flow estimation of static points. To address these challenges, we propose a drift-free rigid transformation estimator that fuses kinematic model-based ego-motions with neural network-learned results. It provides strong supervision signals to radar-based rigid transformation and infers the scene flow of static points. Then, we develop an optical-mmWave supervision extraction module that extracts the supervision signals of radar rigid transformation and scene flow. It strengthens the supervision by learning the scene flow of dynamic points with the joint constraints of optical and mmWave radar measurements. Extensive experiments demonstrate that, in smoke-filled environments, our method even outperforms state-of-the-art (SOTA) approaches using costly LiDARs.

[79] Evaluating Adversarial Protections for Diffusion Personalization: A Comprehensive Study cs.CV | cs.AIPDF

Kai Ye, Tianyi Chen, Zhen Wang

TL;DR: 论文全面评估了八种基于扰动的保护方法在扩散模型个性化中的对抗性防御效果，涵盖了肖像和艺术作品领域，并提供了方法选择的实用指南。

Details

Motivation: 随着扩散模型在图像生成和个性化中的应用增加，隐私泄露和内容滥用问题日益突出，因此需要研究有效的对抗性保护方法。

Result: 研究结果为方法选择提供了实用指导，并公开了代码实现。

Insight: 不同方法在不同领域的表现存在差异，需要根据实际需求选择适合的保护方法。

Abstract: With the increasing adoption of diffusion models for image generation and personalization, concerns regarding privacy breaches and content misuse have become more pressing. In this study, we conduct a comprehensive comparison of eight perturbation based protection methods: AdvDM, ASPL, FSGM, MetaCloak, Mist, PhotoGuard, SDS, and SimAC–across both portrait and artwork domains. These methods are evaluated under varying perturbation budgets, using a range of metrics to assess visual imperceptibility and protective efficacy. Our results offer practical guidance for method selection. Code is available at: https://github.com/vkeilo/DiffAdvPerturbationBench.

[80] CoT-Segmenter: Enhancing OOD Detection in Dense Road Scenes via Chain-of-Thought Reasoning cs.CVPDF

Jeonghyo Song, Kimin Yun, DaeUng Jo, Jinyoung Kim, Youngjoon Yoo

TL;DR: 本文提出了一种基于思维链（CoT）的新型框架，用于提高道路场景中的OOD检测性能，解决了现有方法在密集、远距离和大前景物体上的不足。

Details

Motivation: 道路场景中的OOD检测对于语义分割模型的可靠性至关重要，尤其是在复杂驾驶环境中。尽管LLMs（如GPT-4）在CoT推理方面取得了进展，但将其应用于视觉OOD检测仍未被充分探索。

Result: 在标准基准和新定义的道路异常数据集子集上均超越现有方法，提供了更鲁棒和可解释的OOD检测方案。

Insight: 利用CoT推理可以显著提升视觉任务（如OOD检测）的性能和可解释性，尤其是在复杂场景中。

Abstract: Effective Out-of-Distribution (OOD) detection is criti-cal for ensuring the reliability of semantic segmentation models, particularly in complex road environments where safety and accuracy are paramount. Despite recent advancements in large language models (LLMs), notably GPT-4, which significantly enhanced multimodal reasoning through Chain-of-Thought (CoT) prompting, the application of CoT-based visual reasoning for OOD semantic segmentation remains largely unexplored. In this paper, through extensive analyses of the road scene anomalies, we identify three challenging scenarios where current state-of-the-art OOD segmentation methods consistently struggle: (1) densely packed and overlapping objects, (2) distant scenes with small objects, and (3) large foreground-dominant objects. To address the presented challenges, we propose a novel CoT-based framework targeting OOD detection in road anomaly scenes. Our method leverages the extensive knowledge and reasoning capabilities of foundation models, such as GPT-4, to enhance OOD detection through improved image understanding and prompt-based reasoning aligned with observed problematic scene attributes. Extensive experiments show that our framework consistently outperforms state-of-the-art methods on both standard benchmarks and our newly defined challenging subset of the RoadAnomaly dataset, offering a robust and interpretable solution for OOD semantic segmentation in complex driving environments.

[81] LEHA-CVQAD: Dataset To Enable Generalized Video Quality Assessment of Compression Artifacts cs.CVPDF

Aleksandr Gushchin, Maksim Smirnov, Dmitriy Vatolin, Anastasia Antsiferova

TL;DR: LEHA-CVQAD是一个大规模、丰富标注的视频质量评估数据集，包含6,240个片段，支持压缩伪影的通用评估。同时提出了Rate-Distortion Alignment Error (RDAE)新指标，用于衡量VQA模型保持比特率-质量顺序的能力。

Details

Motivation: 现有视频质量评估数据集的规模和多样性不足，缺乏对压缩伪影的通用评估支持。LEHA-CVQAD的提出填补了这一空白，并提供了新的评估指标。

Result: 测试发现现有VQA指标在RDAE上表现较差，相关性较低，突显了数据集的挑战性和实用性。

Insight: LEHA-CVQAD为压缩伪影的视频质量评估提供了基准，RDAE为模型优化提供了新方向。

Abstract: We propose the LEHA-CVQAD (Large-scale Enriched Human-Annotated) dataset, which comprises 6,240 clips for compression-oriented video quality assessment. 59 source videos are encoded with 186 codec-preset variants, 1.8M pairwise, and 1.5k MOS ratings are fused into a single quality scale; part of the videos remains hidden for blind evaluation. We also propose Rate-Distortion Alignment Error (RDAE), a novel evaluation metric that quantifies how well VQA models preserve bitrate-quality ordering, directly supporting codec parameter tuning. Testing IQA/VQA methods reveals that popular VQA metrics exhibit high RDAE and lower correlations, underscoring the dataset challenges and utility. The open part and the results of LEHA-CVQAD are available at https://aleksandrgushchin.github$.io/lcvqad/

[82] NRSeg: Noise-Resilient Learning for BEV Semantic Segmentation via Driving World Models cs.CV | cs.RO | eess.IVPDF

Siyu Li, Fei Teng, Yihong Cao, Kailun Yang, Zhiyong Li

TL;DR: NRSeg提出了一种针对BEV语义分割的噪声鲁棒学习框架，通过利用合成数据提升模型性能，并设计了PGCM、BiDPP和HLSE三种方法应对噪声和非互斥性问题。

Details

Motivation: BEV语义分割在自动驾驶中至关重要，但实际应用中标注数据有限且分布单一。合成数据虽能增加多样性，但其生成噪声会妨碍模型学习，因此需要一种噪声鲁棒的方法。

Result: NRSeg在无监督和半监督BEV分割任务中分别提升了13.8%和11.4%的mIoU，达到SOTA性能。

Insight: 合成数据的噪声问题可通过定量评估和鲁棒学习框架有效解决，同时BEV任务中的非互斥性问题需要专门设计模块处理。

Abstract: Birds’ Eye View (BEV) semantic segmentation is an indispensable perception task in end-to-end autonomous driving systems. Unsupervised and semi-supervised learning for BEV tasks, as pivotal for real-world applications, underperform due to the homogeneous distribution of the labeled data. In this work, we explore the potential of synthetic data from driving world models to enhance the diversity of labeled data for robustifying BEV segmentation. Yet, our preliminary findings reveal that generation noise in synthetic data compromises efficient BEV model learning. To fully harness the potential of synthetic data from world models, this paper proposes NRSeg, a noise-resilient learning framework for BEV semantic segmentation. Specifically, a Perspective-Geometry Consistency Metric (PGCM) is proposed to quantitatively evaluate the guidance capability of generated data for model learning. This metric originates from the alignment measure between the perspective road mask of generated data and the mask projected from the BEV labels. Moreover, a Bi-Distribution Parallel Prediction (BiDPP) is designed to enhance the inherent robustness of the model, where the learning process is constrained through parallel prediction of multinomial and Dirichlet distributions. The former efficiently predicts semantic probabilities, whereas the latter adopts evidential deep learning to realize uncertainty quantification. Furthermore, a Hierarchical Local Semantic Exclusion (HLSE) module is designed to address the non-mutual exclusivity inherent in BEV semantic segmentation tasks. Experimental results demonstrate that NRSeg achieves state-of-the-art performance, yielding the highest improvements in mIoU of 13.8% and 11.4% in unsupervised and semi-supervised BEV segmentation tasks, respectively. The source code will be made publicly available at https://github.com/lynn-yu/NRSeg.

[83] Group-wise Scaling and Orthogonal Decomposition for Domain-Invariant Feature Extraction in Face Anti-Spoofing cs.CVPDF

Seungjin Jung, Kanghee Lee, Yonghyun Jeong, Haeun Noh, Jungmin Lee

TL;DR: 提出了一种新的领域泛化人脸反欺骗方法，通过特征正交分解和组归一化风险最小化联合对齐决策边界的权重和偏置项，显著提升未见目标域的泛化能力。

Details

Motivation: 现有领域泛化人脸反欺骗方法仅对齐局部决策边界的权重，而偏置项未对齐导致分类阈值不一致，影响了未见目标域的性能。

Result: 在多个基准数据集上取得SOTA性能，准确率和泛化稳定性显著提升，偏置项的未对齐现象减少。

Insight: 权重和偏置项的对齐对领域泛化至关重要；正交分解特征空间可有效解耦领域不变与特定特征。

Abstract: Domain Generalizable Face Anti-Spoofing (DGFAS) methods effectively capture domain-invariant features by aligning the directions (weights) of local decision boundaries across domains. However, the bias terms associated with these boundaries remain misaligned, leading to inconsistent classification thresholds and degraded performance on unseen target domains. To address this issue, we propose a novel DGFAS framework that jointly aligns weights and biases through Feature Orthogonal Decomposition (FOD) and Group-wise Scaling Risk Minimization (GS-RM). Specifically, GS-RM facilitates bias alignment by balancing group-wise losses across multiple domains. FOD employs the Gram-Schmidt orthogonalization process to decompose the feature space explicitly into domain-invariant and domain-specific subspaces. By enforcing orthogonality between domain-specific and domain-invariant features during training using domain labels, FOD ensures effective weight alignment across domains without negatively impacting bias alignment. Additionally, we introduce Expected Calibration Error (ECE) as a novel evaluation metric for quantitatively assessing the effectiveness of our method in aligning bias terms across domains. Extensive experiments on benchmark datasets demonstrate that our approach achieves state-of-the-art performance, consistently improving accuracy, reducing bias misalignment, and enhancing generalization stability on unseen target domains.

[84] Habitat Classification from Ground-Level Imagery Using Deep Neural Networks cs.CV | I.2.10; I.4.8; I.5.4; I.2.1PDF

Hongrui Shi, Lisa Norton, Lucy Ridding, Simon Rolph, Tom August

TL;DR: 该研究探索了利用深度神经网络从地面图像进行生境分类，比较了CNN和ViT模型的表现，发现ViT在分类和可解释性上优于CNN，并验证了监督对比学习在区分视觉相似生境中的有效性。

Details

Motivation: 传统的生境评估依赖专家调查，成本高昂。虽然AI驱动的遥感技术已用于生境分类，但其受限于传感器和分辨率。地面图像能捕捉更精细的结构信息，但尚未被充分研究。

Result: ViT模型在Top-3准确率（91%）和MCC（0.66）上优于CNN，同时监督对比学习显著减少了视觉相似生境的误分类。

Insight: 地面图像结合ViT和监督对比学习能实现高效生境分类，为生态保护和土地利用决策提供可扩展的解决方案。

Abstract: Habitat assessment at local scales – critical for enhancing biodiversity and guiding conservation priorities – often relies on expert field survey that can be costly, motivating the exploration of AI-driven tools to automate and refine this process. While most AI-driven habitat mapping depends on remote sensing, it is often constrained by sensor availability, weather, and coarse resolution. In contrast, ground-level imagery captures essential structural and compositional cues invisible from above and remains underexplored for robust, fine-grained habitat classification. This study addresses this gap by applying state-of-the-art deep neural network architectures to ground-level habitat imagery. Leveraging data from the UK Countryside Survey covering 18 broad habitat types, we evaluate two families of models – convolutional neural networks (CNNs) and vision transformers (ViTs) – under both supervised and supervised contrastive learning paradigms. Our results demonstrate that ViTs consistently outperform state-of-the-art CNN baselines on key classification metrics (Top-3 accuracy = 91%, MCC = 0.66) and offer more interpretable scene understanding tailored to ground-level images. Moreover, supervised contrastive learning significantly reduces misclassification rates among visually similar habitats (e.g., Improved vs. Neutral Grassland), driven by a more discriminative embedding space. Finally, our best model performs on par with experienced ecological experts in habitat classification from images, underscoring the promise of expert-level automated assessment. By integrating advanced AI with ecological expertise, this research establishes a scalable, cost-effective framework for ground-level habitat monitoring to accelerate biodiversity conservation and inform land-use decisions at the national scale.

[85] Exploring Kolmogorov-Arnold Network Expansions in Vision Transformers for Mitigating Catastrophic Forgetting in Continual Learning cs.CVPDF

Zahid Ullah, Jihie Kim

TL;DR: 论文提出将Kolmogorov-Arnold网络（KANs）引入Vision Transformers（ViTs）中，替代传统的多层感知机（MLPs），以解决持续学习中的灾难性遗忘问题。实验表明，KAN-based ViTs在保留旧任务知识的同时，显著优于传统MLP-based ViTs。

Details

Motivation: 持续学习（CL）中灾难性遗忘是ViTs使用MLPs学习全局表示时的关键挑战。需要一种方法在动态环境中保留已学知识并适应新任务。

Result: 实验显示，KAN-based ViTs在持续学习中显著减少灾难性遗忘，知识保留和任务适应能力优于MLP-based ViTs。

Insight: KANs的局部可塑性为持续学习中的灾难性遗忘问题提供了新思路，ViTs结合KANs有望提升模型在动态环境中的适应性。

Abstract: Continual learning (CL), the ability of a model to learn new tasks without forgetting previously acquired knowledge, remains a critical challenge in artificial intelligence, particularly for vision transformers (ViTs) utilizing Multilayer Perceptrons (MLPs) for global representation learning. Catastrophic forgetting, where new information overwrites prior knowledge, is especially problematic in these models. This research proposes replacing MLPs in ViTs with Kolmogorov-Arnold Network (KANs) to address this issue. KANs leverage local plasticity through spline-based activations, ensuring that only a subset of parameters is updated per sample, thereby preserving previously learned knowledge. The study investigates the efficacy of KAN-based ViTs in CL scenarios across benchmark datasets (MNIST, CIFAR100), focusing on their ability to retain accuracy on earlier tasks while adapting to new ones. Experimental results demonstrate that KAN-based ViTs significantly mitigate catastrophic forgetting, outperforming traditional MLP-based ViTs in knowledge retention and task adaptation. This novel integration of KANs into ViTs represents a promising step toward more robust and adaptable models for dynamic environments.

[86] PresentAgent: Multimodal Agent for Presentation Video Generation cs.CVPDF

Jingwei Shi, Zeyu Zhang, Biao Wu, Yanjie Liang, Meng Fang

TL;DR: PresentAgent 是一个多模态代理系统，能够将长文本文档转化为带有同步视觉和语音内容的演示视频，模拟人类风格的演示。通过模块化的处理流程和创新的评估框架 PresentEval，该技术在内容逼真度、视觉清晰度和观众理解度上接近人类水平。

Details

Motivation: 现有的方法仅能生成静态幻灯片或文字摘要，无法生成同步的视听内容。PresentAgent 填补了这一空白，旨在将静态文本转化为动态、高效的演示视频。

Result: 在 30 对文档-视频数据集上的实验表明，PresentAgent 在所有评估指标上接近人类水平。

Insight: 可控多模态代理在将静态文本转化为动态演示内容方面具有巨大潜力，尤其是通过模块化设计和目标驱动评估框架的提升。

Abstract: We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos. While existing approaches are limited to generating static slides or text summaries, our method advances beyond these limitations by producing fully synchronized visual and spoken content that closely mimics human-style presentations. To achieve this integration, PresentAgent employs a modular pipeline that systematically segments the input document, plans and renders slide-style visual frames, generates contextual spoken narration with large language models and Text-to-Speech models, and seamlessly composes the final video with precise audio-visual alignment. Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models that comprehensively scores videos across three critical dimensions: content fidelity, visual clarity, and audience comprehension through prompt-based evaluation. Our experimental validation on a curated dataset of 30 document-presentation pairs demonstrates that PresentAgent approaches human-level quality across all evaluation metrics. These results highlight the significant potential of controllable multimodal agents in transforming static textual materials into dynamic, effective, and accessible presentation formats. Code will be available at https://github.com/AIGeeksGroup/PresentAgent.

[87] T-SYNTH: A Knowledge-Based Dataset of Synthetic Breast Images cs.CV | cs.AIPDF

Christopher Wiedeman, Anastasiia Sarmakeeva, Elena Sizikova, Daniil Filienko, Miguel Lago

TL;DR: T-SYNTH是一个基于物理模拟生成的大规模合成乳腺影像数据集，用于解决医学影像算法开发中的数据不足和标注困难问题。

Details

Motivation: 医学影像算法开发面临数据稀缺和像素级标注难以获取的挑战，合成数据可以缓解这一问题。

Result: 初步实验表明，T-SYNTH可用于增强真实数据，提升乳腺影像检测任务性能。

Insight: 合成数据在医学影像领域有潜力填补真实数据的不足，推动算法开发。

Abstract: One of the key impediments for developing and assessing robust medical imaging algorithms is limited access to large-scale datasets with suitable annotations. Synthetic data generated with plausible physical and biological constraints may address some of these data limitations. We propose the use of physics simulations to generate synthetic images with pixel-level segmentation annotations, which are notoriously difficult to obtain. Specifically, we apply this approach to breast imaging analysis and release T-SYNTH, a large-scale open-source dataset of paired 2D digital mammography (DM) and 3D digital breast tomosynthesis (DBT) images. Our initial experimental results indicate that T-SYNTH images show promise for augmenting limited real patient datasets for detection tasks in DM and DBT. Our data and code are publicly available at https://github.com/DIDSR/tsynth-release.

Ziyu Zhu, Xilin Wang, Yixuan Li, Zhuofan Zhang, Xiaojian Ma

TL;DR: 论文提出MTU3D框架，融合主动感知与3D视觉语言学习，通过空间记忆构建和探索-定位统一目标提升3D场景理解与导航性能。

Details

Motivation: 现有3D视觉语言模型多关注静态重建中的对象定位，缺乏主动探索能力，本文旨在结合视觉定位与探索以提升智能体在3D环境中的理解与导航效率。

Result: MTU3D在多个基准测试中表现优异，成功率分别提升14%、23%、9%和2%。

Insight: 结合主动探索与视觉定位是提升具身智能体环境理解和导航能力的关键。

Abstract: Embodied scene understanding requires not only comprehending visual-spatial information that has been observed but also determining where to explore next in the 3D physical world. Existing 3D Vision-Language (3D-VL) models primarily focus on grounding objects in static observations from 3D reconstruction, such as meshes and point clouds, but lack the ability to actively perceive and explore their environment. To address this limitation, we introduce \underline{\textbf{M}}ove \underline{\textbf{t}}o \underline{\textbf{U}}nderstand (\textbf{\model}), a unified framework that integrates active perception with \underline{\textbf{3D}} vision-language learning, enabling embodied agents to effectively explore and understand their environment. This is achieved by three key innovations: 1) Online query-based representation learning, enabling direct spatial memory construction from RGB-D frames, eliminating the need for explicit 3D reconstruction. 2) A unified objective for grounding and exploring, which represents unexplored locations as frontier queries and jointly optimizes object grounding and frontier selection. 3) End-to-end trajectory learning that combines \textbf{V}ision-\textbf{L}anguage-\textbf{E}xploration pre-training over a million diverse trajectories collected from both simulated and real-world RGB-D sequences. Extensive evaluations across various embodied navigation and question-answering benchmarks show that MTU3D outperforms state-of-the-art reinforcement learning and modular navigation approaches by 14%, 23%, 9%, and 2% in success rate on HM3D-OVON, GOAT-Bench, SG3D, and A-EQA, respectively. \model’s versatility enables navigation using diverse input modalities, including categories, language descriptions, and reference images. These findings highlight the importance of bridging visual grounding and exploration for embodied intelligence.

[89] Breaking Imitation Bottlenecks: Reinforced Diffusion Powers Diverse Trajectory Generation cs.CV | cs.ROPDF

Ziying Song, Lin Liu, Hongyu Pan, Bencheng Liao, Mingzhe Guo

TL;DR: DIVER是一个结合强化学习和扩散模型的端到端驾驶框架，通过生成多样化且可行的轨迹，解决了模仿学习中的保守行为和模式崩溃问题。

Details

Motivation: 当前的端到端自动驾驶方法主要依赖单专家演示的模仿学习，导致行为保守且缺乏多样性，限制了在复杂场景中的泛化能力。

Result: 在NAVSIM、Bench2Drive和nuScenes数据集上验证了DIVER在轨迹多样性和安全性上的显著提升。

Insight: 结合强化学习和扩散模型可以有效解决模仿学习中的模式崩溃问题，并提升轨迹的多样性。

Abstract: Most end-to-end autonomous driving methods rely on imitation learning from single expert demonstrations, often leading to conservative and homogeneous behaviors that limit generalization in complex real-world scenarios. In this work, we propose DIVER, an end-to-end driving framework that integrates reinforcement learning with diffusion-based generation to produce diverse and feasible trajectories. At the core of DIVER lies a reinforced diffusion-based generation mechanism. First, the model conditions on map elements and surrounding agents to generate multiple reference trajectories from a single ground-truth trajectory, alleviating the limitations of imitation learning that arise from relying solely on single expert demonstrations. Second, reinforcement learning is employed to guide the diffusion process, where reward-based supervision enforces safety and diversity constraints on the generated trajectories, thereby enhancing their practicality and generalization capability. Furthermore, to address the limitations of L2-based open-loop metrics in capturing trajectory diversity, we propose a novel Diversity metric to evaluate the diversity of multi-mode predictions.Extensive experiments on the closed-loop NAVSIM and Bench2Drive benchmarks, as well as the open-loop nuScenes dataset, demonstrate that DIVER significantly improves trajectory diversity, effectively addressing the mode collapse problem inherent in imitation learning.

[90] Consistent and Invariant Generalization Learning for Short-video Misinformation Detection cs.CV | cs.MMPDF

Hanghui Guo, Weijie Shi, Mengze Li, Juncheng Li, Hao Chen

TL;DR: 该论文提出了一种名为DOCTOR的新方法，通过一致性学习和不变性学习来解决短视频虚假信息检测中的领域泛化问题，通过跨模态特征插值和扩散模型增强模型性能。

Details

Motivation: 短视频虚假信息检测在多模态领域受到广泛关注，但现有模型在跨领域（源领域到目标领域）表现不佳，主要由于模态依赖性和领域偏差问题。论文旨在通过跨模态学习和领域不变性解决这些问题。

Result: 实验表明DOCTOR在跨领域检测中表现优异，验证了其有效性。

Insight: 跨模态融合中的领域偏差会累积，需要通过一致性学习和不变性特征提取来缓解，扩散模型在保留核心特征方面具有潜力。

Abstract: Short-video misinformation detection has attracted wide attention in the multi-modal domain, aiming to accurately identify the misinformation in the video format accompanied by the corresponding audio. Despite significant advancements, current models in this field, trained on particular domains (source domains), often exhibit unsatisfactory performance on unseen domains (target domains) due to domain gaps. To effectively realize such domain generalization on the short-video misinformation detection task, we propose deep insights into the characteristics of different domains: (1) The detection on various domains may mainly rely on different modalities (i.e., mainly focusing on videos or audios). To enhance domain generalization, it is crucial to achieve optimal model performance on all modalities simultaneously. (2) For some domains focusing on cross-modal joint fraud, a comprehensive analysis relying on cross-modal fusion is necessary. However, domain biases located in each modality (especially in each frame of videos) will be accumulated in this fusion process, which may seriously damage the final identification of misinformation. To address these issues, we propose a new DOmain generalization model via ConsisTency and invariance learning for shORt-video misinformation detection (named DOCTOR), which contains two characteristic modules: (1) We involve the cross-modal feature interpolation to map multiple modalities into a shared space and the interpolation distillation to synchronize multi-modal learning; (2) We design the diffusion model to add noise to retain core features of multi modal and enhance domain invariant features through cross-modal guided denoising. Extensive experiments demonstrate the effectiveness of our proposed DOCTOR model. Our code is public available at https://github.com/ghh1125/DOCTOR.

[91] Stochastic Human Motion Prediction with Memory of Action Transition and Action Characteristic cs.CV | cs.AIPDF

Jianwei Tang, Hong Yang, Tengyue Chen, Jian-Fang Hu

TL;DR: 论文提出STAB和ACB两个记忆库，分别解决动作过渡和动作特征学习问题，结合AAA策略实现更优的人类运动预测。

Details

Motivation: 人类运动预测任务中存在动作过渡不平滑和动作特征难以学习的问题，导致预测结果不合理和不一致。

Result: 在四个数据集上表现优于当前最优方法。

Insight: 1. 动作过渡和特征分离处理更有效；2. 软搜索和自适应注意力结合提升预测质量。

Abstract: Action-driven stochastic human motion prediction aims to generate future motion sequences of a pre-defined target action based on given past observed sequences performing non-target actions. This task primarily presents two challenges. Firstly, generating smooth transition motions is hard due to the varying transition speeds of different actions. Secondly, the action characteristic is difficult to be learned because of the similarity of some actions. These issues cause the predicted results to be unreasonable and inconsistent. As a result, we propose two memory banks, the Soft-transition Action Bank (STAB) and Action Characteristic Bank (ACB), to tackle the problems above. The STAB stores the action transition information. It is equipped with the novel soft searching approach, which encourages the model to focus on multiple possible action categories of observed motions. The ACB records action characteristic, which produces more prior information for predicting certain actions. To fuse the features retrieved from the two banks better, we further propose the Adaptive Attention Adjustment (AAA) strategy. Extensive experiments on four motion prediction datasets demonstrate that our approach consistently outperforms the previous state-of-the-art. The demo and code are available at https://hyqlat.github.io/STABACB.github.io/.

[92] VICI: VLM-Instructed Cross-view Image-localisation cs.CVPDF

Xiaohan Zhang, Tavis Shore, Chen Chen, Oscar Mendez, Simon Hadfield

TL;DR: 本文提出了一种高效的无人机视角图像定位解决方案，采用两阶段检索与重排序方法，显著提升了窄视场街道图像与卫星图像的匹配性能。

Details

Motivation: 全景跨视角地理定位性能接近饱和，现实场景更常见的是未知相机参数的窄视场图像。本研究探索在这种限制下的最佳性能。

Result: 在University-1652数据集上，R@1和R@10分别达到topone%和topten%。

Insight: 优化检索与重排序策略可显著提升实际地理定位任务中的性能，尤其在面对视角和尺度变化时。

Abstract: In this paper, we present a high-performing solution to the UAVM 2025 Challenge, which focuses on matching narrow FOV street-level images to corresponding satellite imagery using the University-1652 dataset. As panoramic Cross-View Geo-Localisation nears peak performance, it becomes increasingly important to explore more practical problem formulations. Real-world scenarios rarely offer panoramic street-level queries; instead, queries typically consist of limited-FOV images captured with unknown camera parameters. Our work prioritises discovering the highest achievable performance under these constraints, pushing the limits of existing architectures. Our method begins by retrieving candidate satellite image embeddings for a given query, followed by a re-ranking stage that selectively enhances retrieval accuracy within the top candidates. This two-stage approach enables more precise matching, even under the significant viewpoint and scale variations inherent in the task. Through experimentation, we demonstrate that our approach achieves competitive results -specifically attaining R@1 and R@10 retrieval rates of \topone% and \topten% respectively. This underscores the potential of optimised retrieval and re-ranking strategies in advancing practical geo-localisation performance. Code is available at https://github.com/tavisshore/VICI.

[93] Integrated Gaussian Processes for Robust and Adaptive Multi-Object Tracking cs.CV | stat.AP | stat.MEPDF

Fred Lydeard, Bashar I. Ahmad, Simon Godsill

TL;DR: 论文提出两种鲁棒且自适应的多目标跟踪算法（GaPP-Class和GaPP-ReaCtion），通过集成高斯过程和非齐次泊松过程提升性能，显著减少轨迹中断。

Details

Motivation: 针对多目标跟踪中轨迹中断、动态场景变化以及目标类型分类等挑战，论文旨在提出一种高效且适应性强的跟踪方法。

Result: 在合成和真实数据上的实验表明，所提算法优于其他先进方法，轨迹中断减少约30%（真实雷达数据）。

Insight: 1. 高斯过程和非齐次泊松过程的结合提供灵活性和鲁棒性；
2. 在线学习和轨迹修复机制显著提升跟踪性能。

Abstract: This paper presents a computationally efficient multi-object tracking approach that can minimise track breaks (e.g., in challenging environments and against agile targets), learn the measurement model parameters on-line (e.g., in dynamically changing scenes) and infer the class of the tracked objects, if joint tracking and kinematic behaviour classification is sought. It capitalises on the flexibilities offered by the integrated Gaussian process as a motion model and the convenient statistical properties of non-homogeneous Poisson processes as a suitable observation model. This can be combined with the proposed effective track revival / stitching mechanism. We accordingly introduce the two robust and adaptive trackers, Gaussian and Poisson Process with Classification (GaPP-Class) and GaPP with Revival and Classification (GaPP-ReaCtion). They employ an appropriate particle filtering inference scheme that efficiently integrates track management and hyperparameter learning (including the object class, if relevant). GaPP-ReaCtion extends GaPP-Class with the addition of a Markov Chain Monte Carlo kernel applied to each particle permitting track revival and stitching (e.g., within a few time steps after deleting a trajectory). Performance evaluation and benchmarking using synthetic and real data show that GaPP-Class and GaPP-ReaCtion outperform other state-of-the-art tracking algorithms. For example, GaPP-ReaCtion significantly reduces track breaks (e.g., by around 30% from real radar data and markedly more from simulated data).

[94] PromptSR: Cascade Prompting for Lightweight Image Super-Resolution cs.CVPDF

Wenyang Liu, Chen Cai, Jianjun Gao, Kejun Wu, Yi Wang

TL;DR: PromptSR提出了一种基于提示的轻量级图像超分辨率方法，通过级联提示模块结合全局和局部信息，解决了窗口自注意力模型受限的感受野问题。

Details

Motivation: 轻量级视觉Transformer在图像超分辨率任务中受限于窗口自注意力的感受野，而增加窗口大小会导致计算复杂度急剧上升。PromptSR旨在解决这一矛盾。

Result: 实验结果表明，PromptSR在轻量级超分辨率任务中表现优异，同时在计算复杂度上具有竞争力。

Insight: 通过级联提示机制，全局信息与局部细节得以高效结合，为轻量级Transformer模型设计提供了新思路。

Abstract: Although the lightweight Vision Transformer has significantly advanced image super-resolution (SR), it faces the inherent challenge of a limited receptive field due to the window-based self-attention modeling. The quadratic computational complexity relative to window size restricts its ability to use a large window size for expanding the receptive field while maintaining low computational costs. To address this challenge, we propose PromptSR, a novel prompt-empowered lightweight image SR method. The core component is the proposed cascade prompting block (CPB), which enhances global information access and local refinement via three cascaded prompting layers: a global anchor prompting layer (GAPL) and two local prompting layers (LPLs). The GAPL leverages downscaled features as anchors to construct low-dimensional anchor prompts (APs) through cross-scale attention, significantly reducing computational costs. These APs, with enhanced global perception, are then used to provide global prompts, efficiently facilitating long-range token connections. The two LPLs subsequently combine category-based self-attention and window-based self-attention to refine the representation in a coarse-to-fine manner. They leverage attention maps from the GAPL as additional global prompts, enabling them to perceive features globally at different granularities for adaptive local refinement. In this way, the proposed CPB effectively combines global priors and local details, significantly enlarging the receptive field while maintaining the low computational costs of our PromptSR. The experimental results demonstrate the superiority of our method, which outperforms state-of-the-art lightweight SR methods in quantitative, qualitative, and complexity evaluations. Our code will be released at https://github.com/wenyang001/PromptSR.

[95] Towards Accurate and Efficient 3D Object Detection for Autonomous Driving: A Mixture of Experts Computing System on Edge cs.CV | cs.AIPDF

Linshen Liu, Boyan Su, Junyue Jiang, Guanlin Wu, Cong Guo

TL;DR: 本文提出了一种基于边缘计算的混合专家（MoE）协同计算系统EMC2，用于自动驾驶中的高效、高精度3D物体检测，相比传统方法显著提升了速度和准确性。

Details

Motivation: 自动驾驶需要低延迟、高精度的3D物体检测系统，现有方法难以在资源受限的边缘设备上同时实现这两点。

Result: 在KITTI数据集上准确率提升3.58%，推理速度提升159.06%；在nuScenes上表现类似。

Insight: 混合专家架构与边缘计算优化能够显著提升自动驾驶任务中的实时性能与精度。

Abstract: This paper presents Edge-based Mixture of Experts (MoE) Collaborative Computing (EMC2), an optimal computing system designed for autonomous vehicles (AVs) that simultaneously achieves low-latency and high-accuracy 3D object detection. Unlike conventional approaches, EMC2 incorporates a scenario-aware MoE architecture specifically optimized for edge platforms. By effectively fusing LiDAR and camera data, the system leverages the complementary strengths of sparse 3D point clouds and dense 2D images to generate robust multimodal representations. To enable this, EMC2 employs an adaptive multimodal data bridge that performs multi-scale preprocessing on sensor inputs, followed by a scenario-aware routing mechanism that dynamically dispatches features to dedicated expert models based on object visibility and distance. In addition, EMC2 integrates joint hardware-software optimizations, including hardware resource utilization optimization and computational graph simplification, to ensure efficient and real-time inference on resource-constrained edge devices. Experiments on open-source benchmarks clearly show the EMC2 advancements as a end-to-end system. On the KITTI dataset, it achieves an average accuracy improvement of 3.58% and a 159.06% inference speedup compared to 15 baseline methods on Jetson platforms, with similar performance gains on the nuScenes dataset, highlighting its capability to advance reliable, real-time 3D object detection tasks for AVs.

[96] Driver-Net: Multi-Camera Fusion for Assessing Driver Take-Over Readiness in Automated Vehicles cs.CV | cs.AI | cs.ET | cs.LG | cs.ROPDF

Mahdi Rezaei, Mohsen Azarmi

TL;DR: Driver-Net提出了一种基于多摄像头融合的深度学习框架，通过捕捉驾驶员的头部、手部和身体姿势的同步视觉线索，实时评估其在自动驾驶车辆中的接管准备状态。

Details

Motivation: 自动驾驶车辆中安全的控制权转移需要对驾驶员准备状态进行准确评估，而传统方法仅关注单一视觉线索（如头部姿态或眼动）存在局限性。

Result: 在利兹大学驾驶模拟器数据集上，Driver-Net的准确率达到95.8%，显著优于现有方法。

Insight: 多模态和多视角融合对驾驶员状态评估至关重要，为自动驾驶安全标准提供了实用的实时非侵入解决方案。

Abstract: Ensuring safe transition of control in automated vehicles requires an accurate and timely assessment of driver readiness. This paper introduces Driver-Net, a novel deep learning framework that fuses multi-camera inputs to estimate driver take-over readiness. Unlike conventional vision-based driver monitoring systems that focus on head pose or eye gaze, Driver-Net captures synchronised visual cues from the driver’s head, hands, and body posture through a triple-camera setup. The model integrates spatio-temporal data using a dual-path architecture, comprising a Context Block and a Feature Block, followed by a cross-modal fusion strategy to enhance prediction accuracy. Evaluated on a diverse dataset collected from the University of Leeds Driving Simulator, the proposed method achieves an accuracy of up to 95.8% in driver readiness classification. This performance significantly enhances existing approaches and highlights the importance of multimodal and multi-view fusion. As a real-time, non-intrusive solution, Driver-Net contributes meaningfully to the development of safer and more reliable automated vehicles and aligns with new regulatory mandates and upcoming safety standards.

[97] Pedestrian Intention Prediction via Vision-Language Foundation Models cs.CV | cs.AI | cs.ET | cs.LG | cs.ROPDF

Mohsen Azarmi, Mahdi Rezaei, He Wang

TL;DR: 该论文探索了基于视觉-语言基础模型（VLFMs）的行人过街意图预测方法，通过分层提示模板整合多模态数据，显著提升了预测准确率。

Details

Motivation: 传统的视觉方法在行人过街意图预测中存在泛化性、上下文理解和因果推理不足的问题，因此研究VLFMs以解决这些局限性。

Result: 在JAAD、PIE和FU-PIP数据集上实验表明，该方法将预测准确率提升了19.8%，自动提示工程框架进一步带来12.5%的增益。

Insight: VLFMs在行人意图预测中优于传统视觉方法，尤其在泛化性和上下文理解方面表现更优，为自动驾驶提供了更可靠的技术支持。

Abstract: Prediction of pedestrian crossing intention is a critical function in autonomous vehicles. Conventional vision-based methods of crossing intention prediction often struggle with generalizability, context understanding, and causal reasoning. This study explores the potential of vision-language foundation models (VLFMs) for predicting pedestrian crossing intentions by integrating multimodal data through hierarchical prompt templates. The methodology incorporates contextual information, including visual frames, physical cues observations, and ego-vehicle dynamics, into systematically refined prompts to guide VLFMs effectively in intention prediction. Experiments were conducted on three common datasets-JAAD, PIE, and FU-PIP. Results demonstrate that incorporating vehicle speed, its variations over time, and time-conscious prompts significantly enhances the prediction accuracy up to 19.8%. Additionally, optimised prompts generated via an automatic prompt engineering framework yielded 12.5% further accuracy gains. These findings highlight the superior performance of VLFMs compared to conventional vision-based models, offering enhanced generalisation and contextual understanding for autonomous driving applications.

[98] Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation cs.CVPDF

Fernando Gabriela Garcia, Spencer Burns, Ryan Shaw, Hunter Young

TL;DR: 这篇论文提出了Hi-SSLVLM，一种通过分层自监督学习策略改进文本到图像生成的模型，解决了传统方法在细粒度控制和空间关系上的局限性，并在多个基准测试中表现优异。

Details

Motivation: 传统文本到图像生成模型依赖大量标注数据且难以精确控制复杂提示中的视觉属性和空间关系，Hi-SSLVLM旨在通过自监督学习减少对标注的依赖并提升生成质量和可控性。

Result: 在Gemini-2.0-Flash等基准测试中表现优异，人类评估也验证了其在提示忠实度、组合准确性和美学质量上的优势。

Insight: 自监督学习可显著减少对标注数据的依赖，分层对齐和内部规划机制能有效提升生成模型的语义一致性和可控性。

Abstract: This paper introduces Hierarchical Self-Supervised LVLM (Hi-SSLVLM), a novel generative model designed to significantly advance text-to-image synthesis, particularly for complex and compositionally challenging prompts. Traditional methods often grapple with the high cost of meticulously curated paired image-text datasets and struggle with precise control over fine-grained visual attributes and intricate spatial relationships. Our Hi-SSLVLM addresses these limitations through a unique two-stage self-supervised learning strategy. The first stage, Multi-Granularity Visual-Language Grounding, enables the Large Vision-Language Model (LVLM) backbone to autonomously generate and align hierarchical captions (global and local) to images, cultivating a deep internal semantic understanding without reliance on extensive human annotation. The second stage, Self-Refinement and Guided Image Generation, leverages this acquired knowledge by an Internal Compositional Planning (ICP) mechanism, where the LVLM first formulates detailed textual sub-prompts to guide the image generation process, complemented by a novel Semantic Consistency Loss for precise output alignment. Comprehensive experiments against leading baselines, including Janus-Pro-1B, Stable Diffusion XL 1.0, DeepFloyd IF v1.0, and ControlNet-XL, on multi-dimensional benchmarks such as Gemini-2.0-Flash and InternVL3-78B, demonstrate Hi-SSLVLM’s superior performance across all fine-grained metrics. An in-depth ablation study confirms the critical role of each proposed component. Furthermore, human evaluations corroborate our quantitative findings, highlighting Hi-SSLVLM’s enhanced fidelity to prompt, compositional accuracy, and overall aesthetic quality, marking a significant step towards more controllable and semantically consistent open-ended text-to-image generation.

[99] LVLM-Composer’s Explicit Planning for Image Generation cs.CVPDF

Spencer Ramsey, Jeffrey Lee, Amina Grant

TL;DR: LVLM-Composer通过显式规划和多阶段训练范式，显著提升了复杂文本到图像生成的组合准确性和可控性。

Details

Motivation: 当前大型视觉语言模型在复杂文本描述下的图像生成中存在组合理解和视觉规划的不足，尤其是多对象、属性和空间关系的准确呈现。

Result: 在LongBench-T2I基准测试中，LVLM-Composer在对象准确性、组合保真度和姿态准确性上显著优于现有方法。

Insight: 显式规划和模块化设计是提升复杂场景图像生成质量的关键，同时多阶段训练可增强模型的组合推理能力。

Abstract: The burgeoning field of generative artificial intelligence has fundamentally reshaped our approach to content creation, with Large Vision-Language Models (LVLMs) standing at its forefront. While current LVLMs have demonstrated impressive capabilities in text-to-image generation, they often falter when confronted with complex textual descriptions demanding precise compositional understanding and visual planning. This limitation particularly impacts the accurate rendering of multiple objects, their attributes, spatial relationships, and specific poses within intricate scenes, as evidenced by benchmarks like LongBench-T2I. To address these challenges, we introduce LVLM-Composer, a novel 10-billion parameter scale LVLM specifically engineered for enhanced compositional image synthesis. Our method incorporates a Hierarchical Semantic Planning Module for structured prompt decomposition and a Fine-Grained Feature Alignment Mechanism for precise visual guidance during generation. We propose a multi-stage training paradigm, featuring Hierarchical Semantic-Visual Grounding Pre-training and Compositional Planning Reinforcement Learning with Self-Correction, to instill robust compositional reasoning. Extensive experiments on the LongBench-T2I benchmark, utilizing automatic evaluation by Gemini-2.0-Flash and InternVL3-78B, demonstrate LVLM-Composer’s superior performance across critical compositional dimensions including object accuracy, composition fidelity, and pose accuracy, significantly outperforming state-of-the-art baselines. An in-depth ablation study further validates the indispensable contribution of our proposed modules, while human evaluations confirm the perceptual superiority of our generated images. LVLM-Composer represents a significant step towards truly controllable and compositionally accurate open-ended text-to-image generation.

[100] Voyaging into Unbounded Dynamic Scenes from a Single View cs.CVPDF

Fengrui Tian, Tianjiao Ding, Jinqi Luo, Hancheng Min, René Vidal

TL;DR: 本文提出DynamicVoyager，通过将动态场景生成重新定义为场景外绘过程，解决了从单视图生成无边界动态场景的问题。

Details

Motivation: 动态场景生成在增强/虚拟现实和机器人技术中有广泛应用，但现有方法依赖于多视图训练，限制了相机移动的范围。本文旨在解决这一问题。

Result: 实验表明，模型能生成无边界场景并保持运动一致性，且可通过场景提示控制生成内容。

Insight: 将像素视为射线并结合点云信息，是实现单视图生成3D一致动态场景的有效方法。

Abstract: This paper studies the problem of generating an unbounded dynamic scene from a single view, which has wide applications in augmented/virtual reality and robotics. Since the scene is changing over time, different generated views need to be consistent with the underlying 3D motions. While previous works learn such consistency by training from multiple views, the generated scene regions are bounded to be close to the training views with limited camera movements. To address this issue, we propose DynamicVoyager that reformulates the dynamic scene generation as a scene outpainting process for new dynamic content. As 2D outpainting models can hardly generate 3D consistent motions from only 2D pixels at a single view, we consider pixels as rays to enrich the pixel input with the ray context, so that the 3D motion consistency can be learned from the ray information. More specifically, we first map the single-view video input to a dynamic point cloud with the estimated video depths. Then we render the partial video at a novel view and outpaint the video with ray contexts from the point cloud to generate 3D consistent motions. We employ the outpainted video to update the point cloud, which is used for scene outpainting from future novel views. Experiments show that our model is able to generate unbounded scenes with consistent motions along fly-through cameras, and the generated contents can be controlled with scene prompts.

[101] MoReMouse: Monocular Reconstruction of Laboratory Mouse cs.CVPDF

Yuan Zhong, Jingxiang Sun, Liang An, Yebin Liu

TL;DR: MoReMouse是一个针对实验室小鼠的单目稠密3D重建网络，通过高保真的合成数据集、基于Transformer的前馈架构和测地线嵌入，显著提升了重建精度和鲁棒性。

Details

Motivation: 实验室小鼠在生物医学研究中至关重要，但因其复杂的非刚性形变和无纹理外观，准确的3D表面运动重建仍具挑战性。缺乏结构化3D数据集进一步限制了研究进展。

Result: 实验表明，MoReMouse在精度和鲁棒性上显著优于现有开源方法。

Insight: 通过合成数据和高语义先验，解决了小鼠3D重建中的复杂形变和纹理缺失问题，为生物医学研究提供了新工具。

Abstract: Laboratory mice play a crucial role in biomedical research, yet accurate 3D mouse surface motion reconstruction remains challenging due to their complex non-rigid geometric deformations and textureless appearance. Moreover, the absence of structured 3D datasets severely hinders the progress beyond sparse keypoint tracking. To narrow the gap, we present MoReMouse, the first monocular dense 3D reconstruction network tailored for laboratory mice. To achieve this goal, we highlight three key designs. First, we construct the first high-fidelity dense-view synthetic dataset for mice, by rendering our self-designed realistic Gaussian mouse avatar. Second, MoReMouse adopts a transformer-based feedforward architecture with triplane representation, achieving high-quality 3D surface generation from a single image. Third, we create geodesic-based continuous correspondence embeddings on mouse surface, which serve as strong semantic priors to improve reconstruction stability and surface consistency. Extensive quantitative and qualitative experiments demonstrate that MoReMouse significantly outperforms existing open-source methods in accuracy and robustness. Video results are available at https://zyyw-eric.github.io/MoreMouse-webpage/.

Sangbum Choi, Kyeongryeol Go

TL;DR: ZERO是一个零样本多提示目标检测模型，专为工业领域大规模部署设计，通过多模态提示（文本和视觉线索）实现高效检测，并在少量标注数据下表现优异。

Details

Motivation: 针对工业环境中多样化和动态的目标检测需求，传统方法需要大量标注数据，而ZERO通过多模态提示和零样本学习，减少了对标注数据的依赖。

Result: 在RF20VL-fsod基准测试中表现优异，验证了ZERO在少量标注数据下的高效性和适应性。

Insight: ZERO展示了多模态提示和零样本学习在工业目标检测中的潜力，为数据驱动AI在动态环境中的应用提供了新思路。

Abstract: Recent advances in artificial intelligence have led to the emergence of foundation models, large-scale pre-trained neural networks that serve as versatile starting points for a wide range of downstream tasks. In this work, we present ZERO, a zero-shot multi-prompt object detection model specifically designed for robust, production-ready deployment across diverse industrial domains. ZERO integrates direct image input with multiple user-defined prompts, which can include both textual and visual cues, and processes them through dedicated encoders to generate accurate detection outputs. The model architecture is optimized for scalability, with a total of 1.033 TFLOPS and 622.346 million parameters, and is trained using a domain-specific image database exceeding one billion images. For the CVPR 2025 Foundational Few-Shot Object Detection (FSOD) Challenge, we introduce a domain-specific fine-tuning strategy that emphasizes prompt diversity and conservative pseudo-labeling, enabling effective adaptation to new domains with minimal supervision. Our approach demonstrates practical advantages in flexibility, efficiency, and real-world applicability, achieving strong performance on the RF20VL-fsod benchmark despite limited annotation budgets. The results highlight the potential of prompt-driven, data-centric AI for scalable and adaptive object detection in dynamic industrial environments.

[103] Towards Lightest Low-Light Image Enhancement Architecture for Mobile Devices cs.CVPDF

Guangrui Bai, Hailong Yan, Wenhai Liu, Yahui Deng, Erbao Dong

TL;DR: LiteIE是一个超轻量级无监督低光图像增强框架，适用于移动设备，具有高性能和低计算开销的特点。

Details

Motivation: 现有深度学习方法依赖大型网络和标注数据，难以在资源受限的移动设备上实时运行，因此需要更高效的解决方案。

Result: 在LOL数据集上PSNR达19.04 dB（超越SOTA 1.4 dB），参数量仅0.07%，在移动设备上实时处理4K图像（30 FPS）。

Insight: 轻量化和无监督设计可高效解决低光增强问题，适合边缘设备部署。

Abstract: Real-time low-light image enhancement on mobile and embedded devices requires models that balance visual quality and computational efficiency. Existing deep learning methods often rely on large networks and labeled datasets, limiting their deployment on resource-constrained platforms. In this paper, we propose LiteIE, an ultra-lightweight unsupervised enhancement framework that eliminates dependence on large-scale supervision and generalizes well across diverse conditions. We design a backbone-agnostic feature extractor with only two convolutional layers to produce compact image features enhancement tensors. In addition, we develop a parameter-free Iterative Restoration Module, which reuses the extracted features to progressively recover fine details lost in earlier enhancement steps, without introducing any additional learnable parameters. We further propose an unsupervised training objective that integrates exposure control, edge-aware smoothness, and multi-scale color consistency losses. Experiments on the LOL dataset, LiteIE achieves 19.04 dB PSNR, surpassing SOTA by 1.4 dB while using only 0.07% of its parameters. On a Snapdragon 8 Gen 3 mobile processor, LiteIE runs at 30 FPS for 4K images with just 58 parameters, enabling real-time deployment on edge devices. These results establish LiteIE as an efficient and practical solution for low-light enhancement on resource-limited platforms.

[104] SeqTex: Generate Mesh Textures in Video Sequence cs.CV | cs.AI | cs.GRPDF

Ze Yuan, Xin Yu, Yangtian Sun, Yuan-Chen Guo, Yan-Pei Cao

TL;DR: SeqTex 是一种端到端框架，利用预训练的视频基础模型直接生成完整的 UV 纹理贴图，解决了现有方法在多视图图像生成和后期处理中的问题。

Details

Motivation: 现有方法依赖多阶段流程生成 3D 纹理，导致误差积累和空间不一致性。缺乏高质量的 3D 纹理数据集也限制了模型的泛化能力。

Result: 实验表明，SeqTex 在图像和文本条件下的 3D 纹理生成任务中均达到最优性能，具有更好的 3D 一致性和纹理-几何对齐能力。

Insight: 将纹理生成任务重新建模为序列问题，结合预训练视频模型的先验知识，可以显著提升生成质量和效率。

Abstract: Training native 3D texture generative models remains a fundamental yet challenging problem, largely due to the limited availability of large-scale, high-quality 3D texture datasets. This scarcity hinders generalization to real-world scenarios. To address this, most existing methods finetune foundation image generative models to exploit their learned visual priors. However, these approaches typically generate only multi-view images and rely on post-processing to produce UV texture maps – an essential representation in modern graphics pipelines. Such two-stage pipelines often suffer from error accumulation and spatial inconsistencies across the 3D surface. In this paper, we introduce SeqTex, a novel end-to-end framework that leverages the visual knowledge encoded in pretrained video foundation models to directly generate complete UV texture maps. Unlike previous methods that model the distribution of UV textures in isolation, SeqTex reformulates the task as a sequence generation problem, enabling the model to learn the joint distribution of multi-view renderings and UV textures. This design effectively transfers the consistent image-space priors from video foundation models into the UV domain. To further enhance performance, we propose several architectural innovations: a decoupled multi-view and UV branch design, geometry-informed attention to guide cross-domain feature alignment, and adaptive token resolution to preserve fine texture details while maintaining computational efficiency. Together, these components allow SeqTex to fully utilize pretrained video priors and synthesize high-fidelity UV texture maps without the need for post-processing. Extensive experiments show that SeqTex achieves state-of-the-art performance on both image-conditioned and text-conditioned 3D texture generation tasks, with superior 3D consistency, texture-geometry alignment, and real-world generalization.

Shenxi Liu, Kan Li, Mingyang Zhao, Yuhang Tian, Bin Li

TL;DR: M3-Med是一个多语言、多模态和多跳推理的医学教学视频理解基准测试，旨在解决现有基准测试在语言单一性和浅层推理上的不足。

Details

Motivation: 目前的多模态理解基准测试主要集中在英语且缺乏深度推理任务，无法满足专业领域（如医学教育）的需求。

Result: 实验表明，当前模型（包括大语言模型）在多跳推理任务上表现显著低于人类专家，突显了模型在专业领域深度推理的局限性。

Insight: M3-Med揭示了现有AI模型在专业领域深度跨模态理解上的不足，为未来研究提供了新方向。

Abstract: With the rapid progress of artificial intelligence (AI) in multi-modal understanding, there is increasing potential for video comprehension technologies to support professional domains such as medical education. However, existing benchmarks suffer from two primary limitations: (1) Linguistic Singularity: they are largely confined to English, neglecting the need for multilingual resources; and (2) Shallow Reasoning: their questions are often designed for surface-level information retrieval, failing to properly assess deep multi-modal integration. To address these limitations, we present M3-Med, the first benchmark for Multi-lingual, Multi-modal, and Multi-hop reasoning in Medical instructional video understanding. M3-Med consists of medical questions paired with corresponding video segments, annotated by a team of medical experts. A key innovation of M3-Med is its multi-hop reasoning task, which requires a model to first locate a key entity in the text, then find corresponding visual evidence in the video, and finally synthesize information across both modalities to derive the answer. This design moves beyond simple text matching and poses a substantial challenge to a model’s deep cross-modal understanding capabilities. We define two tasks: Temporal Answer Grounding in Single Video (TAGSV) and Temporal Answer Grounding in Video Corpus (TAGVC). We evaluated several state-of-the-art models and Large Language Models (LLMs) on M3-Med. The results reveal a significant performance gap between all models and human experts, especially on the complex multi-hop questions where model performance drops sharply. M3-Med effectively highlights the current limitations of AI models in deep cross-modal reasoning within specialized domains and provides a new direction for future research.

[106] MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation cs.CVPDF

Weilun Feng, Chuanguang Yang, Haotong Qin, Yuqi Li, Xiangqi Li

TL;DR: MPQ-DMv2提出了一种灵活的混合精度量化框架，针对极低比特（2-4位）扩散模型，解决了现有量化方法因离群值、次优初始化和优化策略导致的性能下降问题，显著提升了生成性能。

Details

Motivation: 扩散模型在视觉生成任务中表现出色，但高计算复杂度限制了其在边缘设备上的应用。现有量化方法在极低比特（2-4位）量化下性能下降严重，亟需改进量化器和优化策略。

Result: 实验表明，MPQ-DMv2在多种生成任务和架构下优于现有方法，尤其在极低比特量化中表现突出。

Insight: 1. 混合精度与残差设计的结合有效解决了离群值问题；2. 时间一致性蒸馏是低比特量化扩散模型的关键。

Abstract: Diffusion models have demonstrated remarkable performance on vision generation tasks. However, the high computational complexity hinders its wide application on edge devices. Quantization has emerged as a promising technique for inference acceleration and memory reduction. However, existing quantization methods do not generalize well under extremely low-bit (2-4 bit) quantization. Directly applying these methods will cause severe performance degradation. We identify that the existing quantization framework suffers from the outlier-unfriendly quantizer design, suboptimal initialization, and optimization strategy. We present MPQ-DMv2, an improved \textbf{M}ixed \textbf{P}recision \textbf{Q}uantization framework for extremely low-bit \textbf{D}iffusion \textbf{M}odels. For the quantization perspective, the imbalanced distribution caused by salient outliers is quantization-unfriendly for uniform quantizer. We propose \textit{Flexible Z-Order Residual Mixed Quantization} that utilizes an efficient binary residual branch for flexible quant steps to handle salient error. For the optimization framework, we theoretically analyzed the convergence and optimality of the LoRA module and propose \textit{Object-Oriented Low-Rank Initialization} to use prior quantization error for informative initialization. We then propose \textit{Memory-based Temporal Relation Distillation} to construct an online time-aware pixel queue for long-term denoising temporal information distillation, which ensures the overall temporal consistency between quantized and full-precision model. Comprehensive experiments on various generation tasks show that our MPQ-DMv2 surpasses current SOTA methods by a great margin on different architectures, especially under extremely low-bit widths.

[107] Exploring Remote Physiological Signal Measurement under Dynamic Lighting Conditions at Night: Dataset, Experiment, and Analysis cs.CVPDF

Zhipeng Li, Kegang Wang, Hanguang Xiao, Xingyue Liu, Feizhong Zhou

TL;DR: 本文介绍了一个名为DLCN的大规模数据集，用于在夜间动态光照条件下远程测量生理信号（rPPG），填补了研究领域的空白，并通过实验分析了现有算法的局限性。

Details

Motivation: 当前rPPG算法在理想光照条件下表现良好，但在夜间动态光照场景中的有效性尚未充分研究。缺乏相关数据集限制了这一领域的发展。

Result: DLCN数据集展示了高多样性和真实性，揭示了现有rPPG算法在复杂光照条件下的局限性。

Insight: 夜间动态光照条件对rPPG算法提出了新的挑战，需要开发更鲁棒的方法以适应实际应用场景。

Abstract: Remote photoplethysmography (rPPG) is a non-contact technique for measuring human physiological signals. Due to its convenience and non-invasiveness, it has demonstrated broad application potential in areas such as health monitoring and emotion recognition. In recent years, the release of numerous public datasets has significantly advanced the performance of rPPG algorithms under ideal lighting conditions. However, the effectiveness of current rPPG methods in realistic nighttime scenarios with dynamic lighting variations remains largely unknown. Moreover, there is a severe lack of datasets specifically designed for such challenging environments, which has substantially hindered progress in this area of research. To address this gap, we present and release a large-scale rPPG dataset collected under dynamic lighting conditions at night, named DLCN. The dataset comprises approximately 13 hours of video data and corresponding synchronized physiological signals from 98 participants, covering four representative nighttime lighting scenarios. DLCN offers high diversity and realism, making it a valuable resource for evaluating algorithm robustness in complex conditions. Built upon the proposed Happy-rPPG Toolkit, we conduct extensive experiments and provide a comprehensive analysis of the challenges faced by state-of-the-art rPPG methods when applied to DLCN. The dataset and code are publicly available at https://github.com/dalaoplan/Happp-rPPG-Toolkit.

Yuanhe Tian, Chen Su, Junwen Duan, Yan Song

TL;DR: 该论文提出了一种基于图表示的跨模态特征建模方法，用于提升医学影像中的视觉问答任务，通过融合CT切片和问题标记的动态图结构，显著提高了回答的精确性。

Details

Motivation: 医学影像中的视觉问答任务通常独立提取视觉和文本特征，忽略了CT数据的空间连续性和切片间相关性，导致回答不准确。因此，需要一种能有效建模跨模态特征交互的方法。

Result: 在M3D-VQA基准测试中，所提方法在多个评测指标上均优于基线，表现出更强的推理能力。

Insight: 跨模态图结构能有效建模医学影像和自然语言之间的复杂关系，动态特征融合方法为多模态学习提供了新思路。

Abstract: Visual question answering (VQA) in medical imaging aims to support clinical diagnosis by automatically interpreting complex imaging data in response to natural language queries. Existing studies typically rely on distinct visual and textual encoders to independently extract features from medical images and clinical questions, which are subsequently combined to generate answers. Specifically, in computed tomography (CT), such approaches are similar to the conventional practices in medical image analysis. However, these approaches pay less attention to the spatial continuity and inter-slice correlations in the volumetric CT data, leading to fragmented and imprecise responses. In this paper, we propose a novel large language model (LLM)-based framework enhanced by a graph representation of salient features. Different from conventional multimodal encoding strategies, our approach constructs a cross-modal graph integrating both visual and textual features, treating individual CT slices and question tokens as nodes within the graph. We further leverage an attentive graph convolutional network to dynamically fuse information within this structure. The resulting aggregated graph features then serve as a soft prompt to guide a large language model in generating accurate answers. Extensive experiments on the M3D-VQA benchmark demonstrate that our approach consistently outperforms baselines across multiple evaluation metrics, offering more robust reasoning capabilities.

Hanshi Wang, Jin Gao, Weiming Hu, Zhipeng Zhang

TL;DR: MambaFusion首次证明纯Mamba模块可以实现高效的密集全局融合，同时在多模态3D目标检测中保持卓越性能。通过保留高度信息的LiDAR编码和混合Mamba模块，该方法在nuScenes验证集上取得75.0 NDS的顶尖成绩，且高效快速。

Details

Motivation: 现有融合策略无法同时兼顾效率、长距离建模和完整场景信息保留，导致性能受限。

Result: 在nuScenes验证集上取得75.0 NDS的SOTA性能，超越高分辨率输入方法，且推理速度更快。

Insight: 线性复杂度方法直接应用可能因信息损失而失效，需结合场景特征（如高度）优化对齐。

Abstract: We present the first work demonstrating that a pure Mamba block can achieve efficient Dense Global Fusion, meanwhile guaranteeing top performance for camera-LiDAR multi-modal 3D object detection. Our motivation stems from the observation that existing fusion strategies are constrained by their inability to simultaneously achieve efficiency, long-range modeling, and retaining complete scene information. Inspired by recent advances in state-space models (SSMs) and linear attention, we leverage their linear complexity and long-range modeling capabilities to address these challenges. However, this is non-trivial since our experiments reveal that simply adopting efficient linear-complexity methods does not necessarily yield improvements and may even degrade performance. We attribute this degradation to the loss of height information during multi-modal alignment, leading to deviations in sequence order. To resolve this, we propose height-fidelity LiDAR encoding that preserves precise height information through voxel compression in continuous space, thereby enhancing camera-LiDAR alignment. Subsequently, we introduce the Hybrid Mamba Block, which leverages the enriched height-informed features to conduct local and global contextual learning. By integrating these components, our method achieves state-of-the-art performance with the top-tire NDS score of 75.0 on the nuScenes validation benchmark, even surpassing methods that utilize high-resolution inputs. Meanwhile, our method maintains efficiency, achieving faster inference speed than most recent state-of-the-art methods.

Xiao Zhang, Johan Bos

TL;DR: 本文提出了一种多模态框架，通过视觉语言模型（VLM）和检索增强生成（RAG）技术，将墓碑图像转化为结构化的语义表示（TMR），显著提高了墓碑内容的解析精度和鲁棒性。

Details

Motivation: 墓碑是具有丰富历史和文化价值的文物，但面临着物理侵蚀、环境破坏等多种保存挑战。传统基于OCR的方法解析精度有限，亟需一种更高效的多模态方法以保护这些文化遗产。

Result: 模型在墓碑内容解析任务中取得F1分数89.5，远超传统OCR方法的36.1，并在多样文化和退化条件下表现出鲁棒性。

Insight: 1. 多模态方法显著提升了文化遗产数字化的准确性；2. 检索增强生成技术弥补了单模态解析的局限性；3. 为其他文物数字化任务提供了可借鉴的框架。

Abstract: Tombstones are historically and culturally rich artifacts, encapsulating individual lives, community memory, historical narratives and artistic expression. Yet, many tombstones today face significant preservation challenges, including physical erosion, vandalism, environmental degradation, and political shifts. In this paper, we introduce a novel multi-modal framework for tombstones digitization, aiming to improve the interpretation, organization and retrieval of tombstone content. Our approach leverages vision-language models (VLMs) to translate tombstone images into structured Tombstone Meaning Representations (TMRs), capturing both image and text information. To further enrich semantic parsing, we incorporate retrieval-augmented generation (RAG) for integrate externally dependent elements such as toponyms, occupation codes, and ontological concepts. Compared to traditional OCR-based pipelines, our method improves parsing accuracy from an F1 score of 36.1 to 89.5. We additionally evaluate the model’s robustness across diverse linguistic and cultural inscriptions, and simulate physical degradation through image fusion to assess performance under noisy or damaged conditions. Our work represents the first attempt to formalize tombstone understanding using large vision-language models, presenting implications for heritage preservation.

[111] Transferring Visual Explainability of Self-Explaining Models through Task Arithmetic cs.CV | cs.AI | cs.LGPDF

Yuya Yoshikawa, Ryotaro Shimizu, Takahiro Kawashima, Yuki Saito

TL;DR: 该论文提出了一种通过任务算术框架将自解释模型的视觉可解释性从源域迁移到目标域的方法，避免了高标注和计算成本，同时在多数据集上验证了其有效性。

Details

Motivation: 自解释模型在图像分类中同时实现预测和解释，但训练成本高昂。本文旨在通过迁移学习方法降低这种成本，同时保持或提升解释质量。

Result: 实验表明，除了少数不相关域外，该方法能成功迁移解释性，且在解释质量上优于Kernel SHAP（仅需单次推理）。

Insight: 大规模数据集（如ImageNet）学习的解释向量具有通用性，可用于提升多目标域的解释质量。

Abstract: In scenarios requiring both prediction and explanation efficiency for image classification, self-explaining models that perform both tasks in a single inference are effective. However, their training incurs substantial labeling and computational costs. This study aims to tackle the issue by proposing a method to transfer the visual explainability of self-explaining models, learned in a source domain, to a target domain based on a task arithmetic framework. Specifically, we construct a self-explaining model by extending image classifiers based on a vision-language pretrained model. We then define an \emph{explainability vector} as the difference between model parameters trained on the source domain with and without explanation supervision. Based on the task arithmetic framework, we impart explainability to a model trained only on the prediction task in the target domain by applying the explainability vector. Experimental results on various image classification datasets demonstrate that, except for transfers between some less-related domains, visual explainability can be successfully transferred from source to target domains, improving explanation quality in the target domain without sacrificing classification accuracy. Furthermore, we show that the explainability vector learned on a large and diverse dataset like ImageNet, extended with explanation supervision, exhibits universality and robustness, improving explanation quality on nine out of ten different target datasets. We also find that the explanation quality achieved with a single model inference is comparable to that of Kernel SHAP, which requires 150 model inferences.

[112] Comprehensive Information Bottleneck for Unveiling Universal Attribution to Interpret Vision Transformers cs.CVPDF

Jung-Ho Hong, Ho-Joong Kim, Kyu-Sung Jeon, Seong-Whan Lee

TL;DR: 该论文提出了一种基于综合信息瓶颈（CoIBA）的方法，通过在多目标层中应用信息瓶颈来揭示视觉Transformer（ViT）决策过程中的全局相关信息，弥补了现有单层分析方法的不足。

Details

Motivation: 现有基于信息瓶颈（Information Bottleneck）的特征归因方法仅在单层中分析输入变量对决策的贡献，忽略了决策过程在多层的分布证据。因此，需要一种方法能够综合多层的相关信息来更全面地解释ViT的决策过程。

Result: 实验结果表明，CoIBA显著提升了特征归因的忠实性（faithfulness），证明了其有效性。

Insight: 多目标层的信息共享是揭示ViT决策过程的关键，避免了单层分析中信息压缩过度的问题，为可解释性研究提供了新的视角。

Abstract: The feature attribution method reveals the contribution of input variables to the decision-making process to provide an attribution map for explanation. Existing methods grounded on the information bottleneck principle compute information in a specific layer to obtain attributions, compressing the features by injecting noise via a parametric damping ratio. However, the attribution obtained in a specific layer neglects evidence of the decision-making process distributed across layers. In this paper, we introduce a comprehensive information bottleneck (CoIBA), which discovers the relevant information in each targeted layer to explain the decision-making process. Our core idea is applying information bottleneck in multiple targeted layers to estimate the comprehensive information by sharing a parametric damping ratio across the layers. Leveraging this shared ratio complements the over-compressed information to discover the omitted clues of the decision by sharing the relevant information across the targeted layers. We suggest the variational approach to fairly reflect the relevant information of each layer by upper bounding layer-wise information. Therefore, CoIBA guarantees that the discarded activation is unnecessary in every targeted layer to make a decision. The extensive experimental results demonstrate the enhancement in faithfulness of the feature attributions provided by CoIBA.

Wei Wang, Dou Quan, Chonghua Lv, Shuang Wang, Ning Huyan

TL;DR: 该论文提出了RegistrationMamba，一种基于状态空间模型的Mamba架构，结合多专家特征学习，用于解决跨模态遥感图像配准的挑战。通过多方向交叉扫描和多级特征聚合，该方法在纹理有限的场景下表现出色。

Details

Motivation: 跨模态遥感图像配准面临显著的非线性辐射变化和纹理有限等挑战，现有方法（如CNN和Transformer）在全局特征捕获或计算复杂度上存在不足。论文旨在提出一种高效且准确的配准框架。

Result: 在多种分辨率的跨模态遥感图像上，RegistrationMamba表现出优越的性能和鲁棒性，超越了现有方法。

Insight: 论文展示了基于状态空间模型的方法在遥感图像配准中的潜力，同时多专家特征学习的动态融合策略为其他任务提供了通用性思路。

Abstract: Cross-modal remote sensing image (CRSI) registration is critical for multi-modal image applications. However, CRSI mainly faces two challenges: significant nonlinear radiometric variations between cross-modal images and limited textures hindering the discriminative information extraction. Existing methods mainly adopt convolutional neural networks (CNNs) or Transformer architectures to extract discriminative features for registration. However, CNNs with the local receptive field fail to capture global contextual features, and Transformers have high computational complexity and restrict their application to high-resolution CRSI. To solve these issues, this paper proposes RegistrationMamba, a novel Mamba architecture based on state space models (SSMs) integrating multi-expert feature learning for improving the accuracy of CRSI registration. Specifically, RegistrationMamba employs a multi-directional cross-scanning strategy to capture global contextual relationships with linear complexity. To enhance the performance of RegistrationMamba under texture-limited scenarios, we propose a multi-expert feature learning (MEFL) strategy to capture features from various augmented image variants through multiple feature experts. MEFL leverages a learnable soft router to dynamically fuse the features from multiple experts, thereby enriching feature representations and improving registration performance. Notably, MEFL can be seamlessly integrated into various frameworks, substantially boosting registration performance. Additionally, RegistrationMamba integrates a multi-level feature aggregation (MFA) module to extract fine-grained local information and enable effective interaction between global and local features. Extensive experiments on CRSI with varying image resolutions have demonstrated that RegistrationMamba has superior performance and robustness compared to state-of-the-art methods.

[114] Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion cs.CVPDF

Tongyan Hua, Lutao Jiang, Ying-Cong Chen, Wufan Zhao

TL;DR: Sat2City是一种新颖的框架，通过结合稀疏体素网格和潜在扩散模型，从单张卫星图像生成详细的3D城市结构。

Details

Motivation: 现有方法主要依赖神经渲染技术，难以在大规模上生成细致的3D结构，因其从有限的2D观测中衍生出结构性模糊。

Result: 在合成的大规模3D城市数据集上，Sat2City生成的3D结构在保真度上优于现有城市生成模型。

Insight: 通过结合稀疏表示和扩散模型，可以高效地从2D卫星图像生成高质量的3D城市结构，解决了结构模糊问题。

Abstract: Recent advancements in generative models have enabled 3D urban scene generation from satellite imagery, unlocking promising applications in gaming, digital twins, and beyond. However, most existing methods rely heavily on neural rendering techniques, which hinder their ability to produce detailed 3D structures on a broader scale, largely due to the inherent structural ambiguity derived from relatively limited 2D observations. To address this challenge, we propose Sat2City, a novel framework that synergizes the representational capacity of sparse voxel grids with latent diffusion models, tailored specifically for our novel 3D city dataset. Our approach is enabled by three key components: (1) A cascaded latent diffusion framework that progressively recovers 3D city structures from satellite imagery, (2) a Re-Hash operation at its Variational Autoencoder (VAE) bottleneck to compute multi-scale feature grids for stable appearance optimization and (3) an inverse sampling strategy enabling implicit supervision for smooth appearance transitioning.To overcome the challenge of collecting real-world city-scale 3D models with high-quality geometry and appearance, we introduce a dataset of synthesized large-scale 3D cities paired with satellite-view height maps. Validated on this dataset, our framework generates detailed 3D structures from a single satellite image, achieving superior fidelity compared to existing city generation models.

[115] A View-consistent Sampling Method for Regularized Training of Neural Radiance Fields cs.CVPDF

Aoxiang Fan, Corentin Dumery, Nicolas Talabot, Pascal Fua

TL;DR: 该论文提出了一种基于视一致性的采样方法，通过利用低层次颜色特征和高层次蒸馏特征，生成视一致性分布，以隐式正则化NeRF训练，并结合深度推动损失改善了NeRF在真实世界数据中的性能。

Details

Motivation: NeRF在真实世界数据中的表现受到深度估计模型的限制，这些模型不仅需要昂贵的3D监督训练，还存在泛化问题。因此，论文提出了一种不需要固定深度值估计的新方法，以提升NeRF的性能。

Result: 在多个公开数据集上的实验表明，该方法在合成新视图任务中显著优于现有技术。

Insight: 视一致性分布和深度推动损失的结合提供了一种有效的NeRF训练正则化方法，尤其适用于户外无界场景，减少了深度估计错误的影响。

Abstract: Neural Radiance Fields (NeRF) has emerged as a compelling framework for scene representation and 3D recovery. To improve its performance on real-world data, depth regularizations have proven to be the most effective ones. However, depth estimation models not only require expensive 3D supervision in training, but also suffer from generalization issues. As a result, the depth estimations can be erroneous in practice, especially for outdoor unbounded scenes. In this paper, we propose to employ view-consistent distributions instead of fixed depth value estimations to regularize NeRF training. Specifically, the distribution is computed by utilizing both low-level color features and high-level distilled features from foundation models at the projected 2D pixel-locations from per-ray sampled 3D points. By sampling from the view-consistency distributions, an implicit regularization is imposed on the training of NeRF. We also utilize a depth-pushing loss that works in conjunction with the sampling technique to jointly provide effective regularizations for eliminating the failure modes. Extensive experiments conducted on various scenes from public datasets demonstrate that our proposed method can generate significantly better novel view synthesis results than state-of-the-art NeRF variants as well as different depth regularization methods.

[116] MVNet: Hyperspectral Remote Sensing Image Classification Based on Hybrid Mamba-Transformer Vision Backbone Architecture cs.CVPDF

Guandong Li, Mengxia Ye

TL;DR: 论文提出了一种名为MVNet的新网络架构，结合了3D-CNN、Transformer和Mamba的优势，用于高光谱图像分类，解决了高维数据、训练样本有限和光谱冗余等问题。

Details

Motivation: 高光谱图像分类面临高维数据、训练样本有限和光谱冗余等挑战，易导致过拟合和泛化能力不足。

Result: 在IN、UP和KSC数据集上，MVNet在分类精度和计算效率上均优于主流方法。

Insight: 通过融合多种技术优势，MVNet显著提升了高光谱图像分类的性能，同时降低了计算复杂度。

Abstract: Hyperspectral image (HSI) classification faces challenges such as high-dimensional data, limited training samples, and spectral redundancy, which often lead to overfitting and insufficient generalization capability. This paper proposes a novel MVNet network architecture that integrates 3D-CNN’s local feature extraction, Transformer’s global modeling, and Mamba’s linear complexity sequence modeling capabilities, achieving efficient spatial-spectral feature extraction and fusion. MVNet features a redesigned dual-branch Mamba module, including a State Space Model (SSM) branch and a non-SSM branch employing 1D convolution with SiLU activation, enhancing modeling of both short-range and long-range dependencies while reducing computational latency in traditional Mamba. The optimized HSI-MambaVision Mixer module overcomes the unidirectional limitation of causal convolution, capturing bidirectional spatial-spectral dependencies in a single forward pass through decoupled attention that focuses on high-value features, alleviating parameter redundancy and the curse of dimensionality. On IN, UP, and KSC datasets, MVNet outperforms mainstream hyperspectral image classification methods in both classification accuracy and computational efficiency, demonstrating robust capability in processing complex HSI data.

[117] Multimedia Verification Through Multi-Agent Deep Research Multimodal Large Language Models cs.CV | cs.AI | cs.IR | I.2.10PDF

Huy Hoan Le, Van Sy Thinh Nguyen, Thi Le Chi Dang, Vo Thanh Khang Nguyen, Truong Thanh Hung Nguyen

TL;DR: 该论文提出了一种基于多智能体深度研究的MLLM（多模态大语言模型）系统，用于多媒体内容验证，通过六阶段流程和多样化工具，有效检测多媒体虚假信息。

Details

Motivation: 随着多媒体虚假信息的泛滥，传统验证方法难以应对复杂场景的挑战，作者旨在通过多智能体框架和MLLM结合专用工具，提升验证效率和准确性。

Result: 在挑战数据集上，系统成功验证了内容的真实性，精确提取了地理位置和时间信息，并追踪了多平台的来源，适用于真实世界的多媒体验证场景。

Insight: 多智能体框架与MLLM的结合能够显著提升多媒体验证的全面性和可靠性，尤其是在处理复杂且多样化的虚假信息时表现优异。

Abstract: This paper presents our submission to the ACMMM25 - Grand Challenge on Multimedia Verification. We developed a multi-agent verification system that combines Multimodal Large Language Models (MLLMs) with specialized verification tools to detect multimedia misinformation. Our system operates through six stages: raw data processing, planning, information extraction, deep research, evidence collection, and report generation. The core Deep Researcher Agent employs four tools: reverse image search, metadata analysis, fact-checking databases, and verified news processing that extracts spatial, temporal, attribution, and motivational context. We demonstrate our approach on a challenge dataset sample involving complex multimedia content. Our system successfully verified content authenticity, extracted precise geolocation and timing information, and traced source attribution across multiple platforms, effectively addressing real-world multimedia verification scenarios.

[118] SFOOD: A Multimodal Benchmark for Comprehensive Food Attribute Analysis Beyond RGB with Spectral Insights cs.CVPDF

Zhenbo Xu, Jinghan Yang, Gong Huang, Jiqing Feng, Liu Liu

TL;DR: 该论文提出了首个大规模光谱食物基准套件SFOOD，填补了食物属性分析的研究空白，并指出光谱数据在分析食物特性中的重要性。

Details

Motivation: 现有的计算机视觉研究主要关注食物类别分析，但缺乏对食物多属性（如甜度、重量等）的全面研究，且RGB摄像头难以精确感知这些属性。

Result: 评估发现：(i) 大规模模型对食物的数字化能力仍较差；(ii) 光谱数据对分析食物特性（如甜度）至关重要。

Insight: 食物已成为计算机视觉中最具挑战性的研究对象之一，而光谱数据为食物属性分析提供了新的研究维度。

Abstract: With the rise and development of computer vision and LLMs, intelligence is everywhere, especially for people and cars. However, for tremendous food attributes (such as origin, quantity, weight, quality, sweetness, etc.), existing research still mainly focuses on the study of categories. The reason is the lack of a large and comprehensive benchmark for food. Besides, many food attributes (such as sweetness, weight, and fine-grained categories) are challenging to accurately percept solely through RGB cameras. To fulfill this gap and promote the development of intelligent food analysis, in this paper, we built the first large-scale spectral food (SFOOD) benchmark suite. We spent a lot of manpower and equipment costs to organize existing food datasets and collect hyperspectral images of hundreds of foods, and we used instruments to experimentally determine food attributes such as sweetness and weight. The resulting benchmark consists of 3,266 food categories and 2,351 k data points for 17 main food categories. Extensive evaluations find that: (i) Large-scale models are still poor at digitizing food. Compared to people and cars, food has gradually become one of the most difficult objects to study; (ii) Spectrum data are crucial for analyzing food properties (such as sweetness). Our benchmark will be open source and continuously iterated for different food analysis tasks.

[119] DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge cs.CV | cs.ROPDF

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu

TL;DR: DreamVLA提出了一种新型的视觉-语言-动作（VLA）模型，通过整合全面的世界知识预测（动态、空间和语义信息）来改进机器人操作任务中的泛化和推理能力。通过动态区域引导的知识预测和块状结构化注意力机制，模型实现了高效的动作规划和条件分布建模。

Details

Motivation: 现有的VLA模型在图像生成和动作预测结合时面临冗余信息和缺乏全面世界知识的问题。DreamVLA旨在通过整合动态、空间和语义信息来解决这些问题，从而提升机器人操作的性能。

Result: 在真实机器人和仿真环境中，DreamVLA实现了76.7%的成功率和CALVIN ABC-D基准测试中4.44的平均长度表现。

Insight: 通过模仿人类的多模态推理链，DreamVLA展示了全面世界知识对机器人操作的重要性，同时强调了信息解耦在复杂任务中的必要性。

Abstract: Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.

[120] CoT-lized Diffusion: Let’s Reinforce T2I Generation Step-by-step cs.CVPDF

Zheyuan Liu, Munan Ning, Qihui Zhang, Shuo Yang, Zhongrui Wang

TL;DR: 论文提出了一种名为CoT-Diff的框架，通过将多模态大语言模型（MLLM）驱动的3D布局规划与扩散过程紧密耦合，实现基于逐步推理的文本到图像生成。该方法显著提升了复杂场景的空间对齐和组合保真度。

Details

Motivation: 现有文本到图像生成模型在复杂场景中难以实现输入文本与空间布局的精准对齐，即使是基于布局的方法也因生成过程与布局规划的分离而效果欠佳。

Result: 实验结果表明，CoT-Diff在3D场景基准测试中显著优于现有方法，复杂场景的空间对齐精度提升了34.7%。

Insight: 论文展示了布局规划与生成过程的端到端耦合在提升文本到图像生成质量方面的潜力，为复杂场景的生成提供了新思路。

Abstract: Current text-to-image (T2I) generation models struggle to align spatial composition with the input text, especially in complex scenes. Even layout-based approaches yield suboptimal spatial control, as their generation process is decoupled from layout planning, making it difficult to refine the layout during synthesis. We present CoT-Diff, a framework that brings step-by-step CoT-style reasoning into T2I generation by tightly integrating Multimodal Large Language Model (MLLM)-driven 3D layout planning with the diffusion process. CoT-Diff enables layout-aware reasoning inline within a single diffusion round: at each denoising step, the MLLM evaluates intermediate predictions, dynamically updates the 3D scene layout, and continuously guides the generation process. The updated layout is converted into semantic conditions and depth maps, which are fused into the diffusion model via a condition-aware attention mechanism, enabling precise spatial control and semantic injection. Experiments on 3D Scene benchmarks show that CoT-Diff significantly improves spatial alignment and compositional fidelity, and outperforms the state-of-the-art method by 34.7% in complex scene spatial accuracy, thereby validating the effectiveness of this entangled generation paradigm.

[121] BiVM: Accurate Binarized Neural Network for Efficient Video Matting cs.CVPDF

Haotong Qin, Xianglong Liu, Xudong Ma, Lei Ke, Yulun Zhang

TL;DR: BiVM 是一种高效的二值化神经网络，用于视频抠图，通过弹性跳跃连接和可进化拓扑结构提升精度，同时通过稀疏化解码器特征减少计算负担，显著降低了计算和存储成本。

Details

Motivation: 实时视频抠图在边缘设备上因计算限制难以广泛应用，而现有的二值化方法在精度和效率上存在局限性，BiVM 通过改进编码器和解码器解决了这些问题。

Result: BiVM 在二值化视频抠图网络中表现优异，计算和存储成本分别降低了 14.3 倍和 21.6 倍，并在 ARM CPU 上验证了高效性。

Insight: 通过结合信息瓶颈理论和稀疏化技术，可以在二值化网络中显著提升精度和效率，为边缘设备上的实时视频处理提供了新思路。

Abstract: Deep neural networks for real-time video matting suffer significant computational limitations on edge devices, hindering their adoption in widespread applications such as online conferences and short-form video production. Binarization emerges as one of the most common compression approaches with compact 1-bit parameters and efficient bitwise operations. However, accuracy and efficiency limitations exist in the binarized video matting network due to its degenerated encoder and redundant decoder. Following a theoretical analysis based on the information bottleneck principle, the limitations are mainly caused by the degradation of prediction-relevant information in the intermediate features and the redundant computation in prediction-irrelevant areas. We present BiVM, an accurate and resource-efficient Binarized neural network for Video Matting. First, we present a series of binarized computation structures with elastic shortcuts and evolvable topologies, enabling the constructed encoder backbone to extract high-quality representation from input videos for accurate prediction. Second, we sparse the intermediate feature of the binarized decoder by masking homogeneous parts, allowing the decoder to focus on representation with diverse details while alleviating the computation burden for efficient inference. Furthermore, we construct a localized binarization-aware mimicking framework with the information-guided strategy, prompting matting-related representation in full-precision counterparts to be accurately and fully utilized. Comprehensive experiments show that the proposed BiVM surpasses alternative binarized video matting networks, including state-of-the-art (SOTA) binarization methods, by a substantial margin. Moreover, our BiVM achieves significant savings of 14.3x and 21.6x in computation and storage costs, respectively. We also evaluate BiVM on ARM CPU hardware.

[122] Visual Hand Gesture Recognition with Deep Learning: A Comprehensive Review of Methods, Datasets, Challenges and Future Research Directions cs.CVPDF

Konstantinos Foteinos, Jorgen Cani, Manousos Linardakis, Panagiotis Radoglou-Grammatikis, Vasileios Argyriou

TL;DR: 这篇论文对基于视觉的手势识别（VHGR）领域进行了全面的综述，涵盖了方法、数据集、挑战及未来研究方向，为研究者提供了系统化的指导。

Details

Motivation: 由于深度学习模型和可用数据集的快速发展，VHGR领域的研究兴趣激增，但缺乏一个结构完整的综述来帮助研究者高效选择合适的工具和方法。

Result: 论文总结了当前VHGR领域的主流方法、数据集和评估指标，并指出了领域内的关键挑战。

Insight: 未来的研究方向可能集中在解决通用计算机视觉问题和VHGR特有的障碍上，尤其是提升模型的泛化能力和实时性能。

Abstract: The rapid evolution of deep learning (DL) models and the ever-increasing size of available datasets have raised the interest of the research community in the always important field of vision-based hand gesture recognition (VHGR), and delivered a wide range of applications, such as sign language understanding and human-computer interaction using cameras. Despite the large volume of research works in the field, a structured and complete survey on VHGR is still missing, leaving researchers to navigate through hundreds of papers in order to find the right combination of data, model, and approach for each task. The current survey aims to fill this gap by presenting a comprehensive overview of this aspect of computer vision. With a systematic research methodology that identifies the state-of-the-art works and a structured presentation of the various methods, datasets, and evaluation metrics, this review aims to constitute a useful guideline for researchers, helping them to choose the right strategy for delving into a certain VHGR task. Starting with the methodology used for study selection, literature retrieval, and the analytical framing, the survey identifies and organizes key VHGR approaches using a taxonomy-based format in various dimensions such as input modality and application domain. The core of the survey provides an in-depth analysis of state-of-the-art techniques across three primary VHGR tasks: static gesture recognition, isolated dynamic gestures and continuous gesture recognition. For each task, the architectural trends and learning strategies are listed. Additionally, the study reviews commonly used datasets - emphasizing on annotation schemes - and evaluates standard performance metrics. It concludes by identifying major challenges in VHGR, including both general computer vision issues and domain-specific obstacles, and outlines promising directions for future research.

[123] A Training-Free Style-Personalization via Scale-wise Autoregressive Model cs.CVPDF

Kyoungmin Lee, Jihun Park, Jongmin Gim, Wonhyeok Choi, Kyumin Hwang

TL;DR: 本文提出了一种无需训练的个性化风格图像生成框架，通过尺度自回归模型在推理阶段控制内容和风格信息。核心设计包括内容、风格和生成三条路径，每条路径由相应文本提示引导，实现了灵活高效的图像语义控制。

Details

Motivation: 传统个性化风格生成方法通常需要训练，限制了灵活性和效率。本文旨在通过无需训练的框架实现高效且灵活的个性化风格控制。

Result: 实验表明，该方法在风格保真度和提示保真度上与微调基线相当，同时推理速度更快，部署更灵活。

Insight: 早期到中期的生成步骤对风格和内容的形成至关重要，而查询特征主要编码内容信息，通过针对性机制可以有效提升生成质量。

Abstract: We present a training-free framework for style-personalized image generation that controls content and style information during inference using a scale-wise autoregressive model. Our method employs a three-path design–content, style, and generation–each guided by a corresponding text prompt, enabling flexible and efficient control over image semantics without any additional training. A central contribution of this work is a step-wise and attention-wise intervention analysis. Through systematic prompt and feature injection, we find that early-to-middle generation steps play a pivotal role in shaping both content and style, and that query features predominantly encode content-specific information. Guided by these insights, we introduce two targeted mechanisms: Key Stage Attention Sharing, which aligns content and style during the semantically critical steps, and Adaptive Query Sharing, which reinforces content semantics in later steps through similarity-aware query blending. Extensive experiments demonstrate that our method achieves competitive style fidelity and prompt fidelity compared to fine-tuned baselines, while offering faster inference and greater deployment flexibility.

[124] MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization cs.CV | cs.AIPDF

Zhendong Xiao, Wu Wei, Shujie Ji, Shan Yang, Changhao Chen

TL;DR: 论文提出了一种名为MVL-Loc的新框架，利用预训练的视觉-语言模型（VLMs）和多模态数据，实现了在多场景下的6-DoF相机重定位，解决了传统方法在多样化环境中泛化性和鲁棒性不足的问题。

Details

Motivation: 传统基于深度学习的相机重定位方法通常针对单一场景，缺乏在多样化环境中的泛化能力，这在现代AR、MR、自动驾驶等应用中限制了其实用性。

Result: 在7Scenes和Cambridge Landmarks数据集上的实验表明，MVL-Loc在多场景相机重定位中达到了最先进的性能，位置和方向估计的准确性均有提升。

Insight: 1. VLMs的预训练知识可以显著提升相机重定位任务的泛化能力；2. 自然语言作为一种指导工具，有助于建模复杂场景的语义和空间关系；3. 多模态数据的融合对提升性能至关重要。

Abstract: Camera relocalization, a cornerstone capability of modern computer vision, accurately determines a camera’s position and orientation (6-DoF) from images and is essential for applications in augmented reality (AR), mixed reality (MR), autonomous driving, delivery drones, and robotic navigation. Unlike traditional deep learning-based methods that regress camera pose from images in a single scene, which often lack generalization and robustness in diverse environments, we propose MVL-Loc, a novel end-to-end multi-scene 6-DoF camera relocalization framework. MVL-Loc leverages pretrained world knowledge from vision-language models (VLMs) and incorporates multimodal data to generalize across both indoor and outdoor settings. Furthermore, natural language is employed as a directive tool to guide the multi-scene learning process, facilitating semantic understanding of complex scenes and capturing spatial relationships among objects. Extensive experiments on the 7Scenes and Cambridge Landmarks datasets demonstrate MVL-Loc’s robustness and state-of-the-art performance in real-world multi-scene camera relocalization, with improved accuracy in both positional and orientational estimates.

[125] FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection cs.CVPDF

Xinhua Lu, Runhe Lai, Yanqi Wu, Kanghao Chen, Wei-Shi Zheng

TL;DR: 该论文提出了一种基于CLIP的创新框架FA（Forced prompt leArning），通过强制学习ID类的多样化描述，提升了OOD检测的效果，无需外部辅助数据。

Details

Motivation: 现有CLIP方法在OOD检测中通常依赖OOD相关知识或外部数据集，泛化能力有限。作者旨在利用ID知识提升OOD检测效果。

Result: 在无需外部数据的情况下，FA在OOD检测中显著优于现有方法，且参数量与CoOp相同。

Insight: 利用ID知识的多样性描述可以显著提升OOD检测性能，无需依赖外部数据或复杂的OOD相关知识。

Abstract: Pre-trained vision-language models (VLMs) have advanced out-of-distribution (OOD) detection recently. However, existing CLIP-based methods often focus on learning OOD-related knowledge to improve OOD detection, showing limited generalization or reliance on external large-scale auxiliary datasets. In this study, instead of delving into the intricate OOD-related knowledge, we propose an innovative CLIP-based framework based on Forced prompt leArning (FA), designed to make full use of the In-Distribution (ID) knowledge and ultimately boost the effectiveness of OOD detection. Our key insight is to learn a prompt (i.e., forced prompt) that contains more diversified and richer descriptions of the ID classes beyond the textual semantics of class labels. Specifically, it promotes better discernment for ID images, by forcing more notable semantic similarity between ID images and the learnable forced prompt. Moreover, we introduce a forced coefficient, encouraging the forced prompt to learn more comprehensive and nuanced descriptions of the ID classes. In this way, FA is capable of achieving notable improvements in OOD detection, even when trained without any external auxiliary datasets, while maintaining an identical number of trainable parameters as CoOp. Extensive empirical evaluations confirm our method consistently outperforms current state-of-the-art methods. Code is available at https://github.com/0xFAFA/FA.

[126] Grounded Gesture Generation: Language, Motion, and Space cs.CV | cs.AI | cs.RO | 68T07, 68T42 | I.2.7; I.2.6; H.5.2PDF

Anna Deichler, Jim O’Regan, Teo Guichoux, David Johansson, Jonas Beskow

TL;DR: 该论文提出了一种多模态数据集和框架，用于结合语言、动作和空间信息的落地手势生成，填补了现有技术在环境落地和动作生成之间的空白。

Details

Motivation: 现有的人体动作生成模型通常专注于描述性动作（如行走和物体交互）或孤立的语音伴随手势合成，忽略了动作与环境落地的结合，限制了具身交流代理的发展。

Result: 数据集包含超过7.7小时的同步动作、语音和3D场景信息，为落地手势生成提供了标准化资源。

Insight: 通过将手势建模与空间落地结合，该框架为具身交流代理的研究提供了新的基础。

Abstract: Human motion generation has advanced rapidly in recent years, yet the critical problem of creating spatially grounded, context-aware gestures has been largely overlooked. Existing models typically specialize either in descriptive motion generation, such as locomotion and object interaction, or in isolated co-speech gesture synthesis aligned with utterance semantics. However, both lines of work often treat motion and environmental grounding separately, limiting advances toward embodied, communicative agents. To address this gap, our work introduces a multimodal dataset and framework for grounded gesture generation, combining two key resources: (1) a synthetic dataset of spatially grounded referential gestures, and (2) MM-Conv, a VR-based dataset capturing two-party dialogues. Together, they provide over 7.7 hours of synchronized motion, speech, and 3D scene information, standardized in the HumanML3D format. Our framework further connects to a physics-based simulator, enabling synthetic data generation and situated evaluation. By bridging gesture modeling and spatial grounding, our contribution establishes a foundation for advancing research in situated gesture generation and grounded multimodal interaction. Project page: https://groundedgestures.github.io/

[127] A Data-Driven Novelty Score for Diverse In-Vehicle Data Recording cs.CVPDF

Philipp Reis, Joshua Ransiek, David Petri, Jacob Langner, Eric Sax

TL;DR: 该论文提出了一种实时数据选择方法，通过基于Mean Shift算法的动态评分系统，从车载数据中筛选出新颖的图像帧，以减少数据冗余并提升模型性能。

Details

Motivation: 高质量的自动驾驶数据集通常偏向常见场景和物体，而罕见事件（新颖场景）被忽视。这种不平衡导致模型泛化能力不足，影响安全性。

Result: 实验表明，通过减少训练数据集的冗余，模型性能得到提升。且随着数据冗余增加，更激进的过滤策略效果更显著。

Insight: 新颖性驱动的数据选择优于随机采样，能避免过拟合并提升模型泛化能力；动态更新的正常内容模型是高效检测新颖性的关键。

Abstract: High-quality datasets are essential for training robust perception systems in autonomous driving. However, real-world data collection is often biased toward common scenes and objects, leaving novel cases underrepresented. This imbalance hinders model generalization and compromises safety. The core issue is the curse of rarity. Over time, novel events occur infrequently, and standard logging methods fail to capture them effectively. As a result, large volumes of redundant data are stored, while critical novel cases are diluted, leading to biased datasets. This work presents a real-time data selection method focused on object-level novelty detection to build more balanced and diverse datasets. The method assigns a data-driven novelty score to image frames using a novel dynamic Mean Shift algorithm. It models normal content based on mean and covariance statistics to identify frames with novel objects, discarding those with redundant elements. The main findings show that reducing the training dataset size with this method can improve model performance, whereas higher redundancy tends to degrade it. Moreover, as data redundancy increases, more aggressive filtering becomes both possible and beneficial. While random sampling can offer some gains, it often leads to overfitting and unpredictability in outcomes. The proposed method supports real-time deployment with 32 frames per second and is constant over time. By continuously updating the definition of normal content, it enables efficient detection of novelties in a continuous data stream.

[128] MambaVideo for Discrete Video Tokenization with Channel-Split Quantization cs.CVPDF

Dawit Mureja Argaw, Xian Liu, Joon Son Chung, Ming-Yu Liu, Fitsum Reda

TL;DR: 该论文提出了一种新的离散视频分词方法MambaVideo，通过Mamba编码器-解码器架构和通道分数量化方案，显著提升了视频生成任务的性能。

Details

Motivation: 由于视频数据的高维度特性，高效的离散视频分词对自回归生成模型至关重要。现有的序列化分词方法存在局限性，亟需更高效和强大的解决方案。

Result: 在多个数据集上超越了因果3D卷积和基于Transformer的方法，成为新的state-of-the-art。

Insight: Mamba架构和通道分数量化的结合为离散视频分词提供了新的思路，显著提升了生成任务的效率和性能。

Abstract: Discrete video tokenization is essential for efficient autoregressive generative modeling due to the high dimensionality of video data. This work introduces a state-of-the-art discrete video tokenizer with two key contributions. First, we propose a novel Mamba-based encoder-decoder architecture that overcomes the limitations of previous sequencebased tokenizers. Second, we introduce a new quantization scheme, channel-split quantization, which significantly enhances the representational power of quantized latents while preserving the token count. Our model sets a new state-of-the-art, outperforming both causal 3D convolutionbased and Transformer-based approaches across multiple datasets. Experimental results further demonstrate its robustness as a tokenizer for autoregressive video generation.

[129] VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents cs.CV | cs.CLPDF

Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang

TL;DR: VLM2Vec-V2提出了一种统一的多模态嵌入框架，扩展了现有模型仅支持自然图像的局限性，新增了对视频和视觉文档的支持，并通过新基准和实验验证了其性能。

Details

Motivation: 现有多模态嵌入模型主要针对自然图像，缺乏对视频和视觉文档的支持，限制了其在实际场景（如AI代理和多模态搜索）中的应用。本文旨在填补这一空白。

Result: VLM2Vec-V2在新视频和文档检索任务中表现优异，同时在原有图像基准上超越了现有基线。

Insight: 研究揭示了多模态嵌入模型的通用性策略，为统一嵌入学习提供了有效方法，推动了可扩展表示学习的发展。

Abstract: Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME are predominantly focused on natural images, with limited support for other visual forms such as videos and visual documents. This restricts their applicability in real-world scenarios, including AI agents, multi-modal search and recommendation, and retrieval-augmented generation (RAG). To close this gap, we propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms. First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types: visual document retrieval, video retrieval, temporal grounding, video classification and video question answering - spanning text, image, video, and visual document inputs. Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs. Extensive experiments show that VLM2Vec-V2 achieves strong performance not only on the newly introduced video and document retrieval tasks, but also improves over prior baselines on the original image benchmarks. Through extensive evaluation, our study offers insights into the generalizability of various multimodal embedding models and highlights effective strategies for unified embedding learning, laying the groundwork for more scalable and adaptable representation learning in both research and real-world settings.

[130] QR-LoRA: Efficient and Disentangled Fine-tuning via QR Decomposition for Customized Generation cs.CVPDF

Jiahui Yang, Yongjia Ma, Donglin Di, Hao Li, Wei Chen

TL;DR: QR-LoRA提出了一种基于QR分解的结构化参数微调框架，通过正交矩阵Q和上三角矩阵R分离视觉属性，减少可训练参数并避免内容与风格的纠缠。

Details

Motivation: 现有文本到图像模型使用LoRA等技术微调时，常因权重矩阵的非结构化修改导致内容与风格属性纠缠。QR-LoRA旨在通过结构化参数更新解决这一问题。

Result: 实验表明，QR-LoRA在内容-风格融合任务中实现了更好的解耦效果，参数效率更高。

Insight: QR分解的结构化特性为生成模型的高效微调提供了新思路，正交矩阵和上三角矩阵的设计有效支持了属性的分离与编码。

Abstract: Existing text-to-image models often rely on parameter fine-tuning techniques such as Low-Rank Adaptation (LoRA) to customize visual attributes. However, when combining multiple LoRA models for content-style fusion tasks, unstructured modifications of weight matrices often lead to undesired feature entanglement between content and style attributes. We propose QR-LoRA, a novel fine-tuning framework leveraging QR decomposition for structured parameter updates that effectively separate visual attributes. Our key insight is that the orthogonal Q matrix naturally minimizes interference between different visual features, while the upper triangular R matrix efficiently encodes attribute-specific transformations. Our approach fixes both Q and R matrices while only training an additional task-specific $\Delta R$ matrix. This structured design reduces trainable parameters to half of conventional LoRA methods and supports effective merging of multiple adaptations without cross-contamination due to the strong disentanglement properties between $\Delta R$ matrices. Experiments demonstrate that QR-LoRA achieves superior disentanglement in content-style fusion tasks, establishing a new paradigm for parameter-efficient, disentangled fine-tuning in generative models.

[131] HiLa: Hierarchical Vision-Language Collaboration for Cancer Survival Prediction cs.CV | cs.AIPDF

Jiaqi Cui, Lu Wen, Yuchen Fei, Bo Liu, Luping Zhou

TL;DR: HiLa提出了一种分层视觉-语言协作框架，通过优化提示学习和分层交互，提升了癌症生存预测性能。

Details

Motivation: 现有的癌症生存预测方法依赖稀疏的切片级标签，且现有视觉-语言模型未能充分利用分层的多模态信息。

Result: 在三个TCGA数据集上实现了SOTA性能。

Insight: 分层视觉-语言协作能够更有效地建模WSI的层次结构并提升预测性能。

Abstract: Survival prediction using whole-slide images (WSIs) is crucial in cancer re-search. Despite notable success, existing approaches are limited by their reliance on sparse slide-level labels, which hinders the learning of discriminative repre-sentations from gigapixel WSIs. Recently, vision language (VL) models, which incorporate additional language supervision, have emerged as a promising solu-tion. However, VL-based survival prediction remains largely unexplored due to two key challenges. First, current methods often rely on only one simple lan-guage prompt and basic cosine similarity, which fails to learn fine-grained associ-ations between multi-faceted linguistic information and visual features within WSI, resulting in inadequate vision-language alignment. Second, these methods primarily exploit patch-level information, overlooking the intrinsic hierarchy of WSIs and their interactions, causing ineffective modeling of hierarchical interac-tions. To tackle these problems, we propose a novel Hierarchical vision-Language collaboration (HiLa) framework for improved survival prediction. Specifically, HiLa employs pretrained feature extractors to generate hierarchical visual features from WSIs at both patch and region levels. At each level, a series of language prompts describing various survival-related attributes are constructed and aligned with visual features via Optimal Prompt Learning (OPL). This ap-proach enables the comprehensive learning of discriminative visual features cor-responding to different survival-related attributes from prompts, thereby improv-ing vision-language alignment. Furthermore, we introduce two modules, i.e., Cross-Level Propagation (CLP) and Mutual Contrastive Learning (MCL) to maximize hierarchical cooperation by promoting interactions and consistency be-tween patch and region levels. Experiments on three TCGA datasets demonstrate our SOTA performance.

[132] Learn 3D VQA Better with Active Selection and Reannotation cs.CVPDF

Shengli Zhou, Yang Liu, Feng Zheng

TL;DR: 这篇论文提出了一种多轮交互式主动学习策略，用于改进3D视觉问答（3D VQA）中的模型训练效果和数据标注质量。通过语义不确定性选择数据并主动请求重新标注，显著减少了误导性标签的影响，并降低了训练成本。

Details

Motivation: 3D VQA任务中，自由形式的答案常导致标注不准确，而小规模数据放大了误导性标注的负面影响。现有主动学习方法无法解决这一问题。

Result: 实验表明，该方法显著提升了模型性能，并将训练成本减少了一半。

Insight: 通过语义不确定性选择和重新标注，可以有效解决3D VQA中的标注质量问题，同时提升效率。

Abstract: 3D Visual Question Answering (3D VQA) is crucial for enabling models to perceive the physical world and perform spatial reasoning. In 3D VQA, the free-form nature of answers often leads to improper annotations that can confuse or mislead models when training on the entire dataset. While other text generation tasks can mitigate this issue by learning on large-scale datasets, the scarcity of 3D scene data enlarges the negative effect of misleading annotations. Although active learning strategies can select valuable instances for training, they fail to identify and resolve misleading labels, which the oracle inevitably provides in practice. To address this issue, we propose a multi-turn interactive active learning strategy. This strategy selects data based on models’ semantic uncertainty to form a solid knowledge foundation more effectively and actively requests reannotation from an oracle to resolve potentially misleading labels. For uncertainty assessment, we utilize a variance-based metric that takes semantic relationships between terms into consideration, thus avoiding the uniform inter-class similarity assumption of previous assessment metrics. Extensive experiments exhibit better model performance and a substantial reduction in training costs, with a halving of training costs for achieving relatively high accuracy. The code is available at https://github.com/fz-zsl/AQuA.

[133] Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts cs.CV | cs.AI | cs.ROPDF

Yun Wang, Longguang Wang, Chenghao Zhang, Yongjian Zhang, Zhanjie Zhang

TL;DR: 提出了一种名为SMoEStereo的新框架，通过结合低秩适应（LoRA）和专家混合（MoE）模块，改进了立体匹配的鲁棒性和跨域性能。

Details

Motivation: 目前的立体匹配网络在跨域性能上表现不佳，主要由于域偏移和数据集中视差分布不平衡。利用视觉基础模型（VFMs）可以提升鲁棒性，但如何高效集成这些模型仍是一个挑战。

Result: 在多个基准测试中表现出卓越的跨域和联合泛化能力，无需针对特定数据集调整。

Insight: 通过动态选择专家模块和轻量级决策网络，可以显著提升立体匹配的鲁棒性和效率，尤其在跨域场景下表现突出。

Abstract: Recently, learning-based stereo matching networks have advanced significantly. However, they often lack robustness and struggle to achieve impressive cross-domain performance due to domain shifts and imbalanced disparity distributions among diverse datasets. Leveraging Vision Foundation Models (VFMs) can intuitively enhance the model’s robustness, but integrating such a model into stereo matching cost-effectively to fully realize their robustness remains a key challenge. To address this, we propose SMoEStereo, a novel framework that adapts VFMs for stereo matching through a tailored, scene-specific fusion of Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) modules. SMoEStereo introduces MoE-LoRA with adaptive ranks and MoE-Adapter with adaptive kernel sizes. The former dynamically selects optimal experts within MoE to adapt varying scenes across domains, while the latter injects inductive bias into frozen VFMs to improve geometric feature extraction. Importantly, to mitigate computational overhead, we further propose a lightweight decision network that selectively activates MoE modules based on input complexity, balancing efficiency with accuracy. Extensive experiments demonstrate that our method exhibits state-of-the-art cross-domain and joint generalization across multiple benchmarks without dataset-specific adaptation. The code is available at \textcolor{red}{https://github.com/cocowy1/SMoE-Stereo}.

[134] LTMSformer: A Local Trend-Aware Attention and Motion State Encoding Transformer for Multi-Agent Trajectory Prediction cs.CV | cs.AIPDF

Yixin Yan, Yang Li, Yuanfan Wang, Xiaozhou Zhou, Beihao Xia

TL;DR: LTMSformer 是一种轻量级多智能体轨迹预测框架，通过局部趋势感知注意力机制和高阶运动状态编码器捕捉时空依赖关系，性能优于基线方法。

Details

Motivation: 多智能体轨迹预测中复杂的时空依赖关系建模具有挑战性，现有方法常忽视局部时间依赖性和高阶运动状态属性。

Result: 在Argoverse 1数据集上，minADE降低4.35%，minFDE降低8.74%，MR降低20%，模型体积减少68%。

Insight: 局部时间依赖性和高阶运动状态信息对轨迹预测至关重要，轻量级设计可实现高性能与高效率的平衡。

Abstract: It has been challenging to model the complex temporal-spatial dependencies between agents for trajectory prediction. As each state of an agent is closely related to the states of adjacent time steps, capturing the local temporal dependency is beneficial for prediction, while most studies often overlook it. Besides, learning the high-order motion state attributes is expected to enhance spatial interaction modeling, but it is rarely seen in previous works. To address this, we propose a lightweight framework, LTMSformer, to extract temporal-spatial interaction features for multi-modal trajectory prediction. Specifically, we introduce a Local Trend-Aware Attention mechanism to capture the local temporal dependency by leveraging a convolutional attention mechanism with hierarchical local time boxes. Next, to model the spatial interaction dependency, we build a Motion State Encoder to incorporate high-order motion state attributes, such as acceleration, jerk, heading, etc. To further refine the trajectory prediction, we propose a Lightweight Proposal Refinement Module that leverages Multi-Layer Perceptrons for trajectory embedding and generates the refined trajectories with fewer model parameters. Experiment results on the Argoverse 1 dataset demonstrate that our method outperforms the baseline HiVT-64, reducing the minADE by approximately 4.35%, the minFDE by 8.74%, and the MR by 20%. We also achieve higher accuracy than HiVT-128 with a 68% reduction in model size.

[135] MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding cs.CVPDF

Zhicheng Zhang, Wuyou Xia, Chenxi Zhao, Zhou Yan, Xiaoqiang Liu

TL;DR: 该论文提出了一种名为MODA的新型注意力机制，用于解决多模态学习中的注意力分配问题，通过双工模态空间和自适应掩码注意力提升多模态感知、认知和情感理解的能力。

Details

Motivation: 现有的多模态大语言模型（MLLMs）主要关注以语言为中心的调整，而在需要细粒度认知和情感理解的高级任务中，多模态标记的混合注意力分配不足，导致跨模态注意力和层次注意力衰减问题。

Result: 在21个基准数据集上的实验验证了MODA在感知、认知和情感任务中的有效性。

Insight: 通过解耦模态对齐和标记混合，MODA在多模态任务中实现了更高的灵活性和性能提升。

Abstract: Multimodal large language models (MLLMs) recently showed strong capacity in integrating data among multiple modalities, empowered by a generalizable attention architecture. Advanced methods predominantly focus on language-centric tuning while less exploring multimodal tokens mixed through attention, posing challenges in high-level tasks that require fine-grained cognition and emotion understanding. In this work, we identify the attention deficit disorder problem in multimodal learning, caused by inconsistent cross-modal attention and layer-by-layer decayed attention activation. To address this, we propose a novel attention mechanism, termed MOdular Duplex Attention (MODA), simultaneously conducting the inner-modal refinement and inter-modal interaction. MODA employs a correct-after-align strategy to effectively decouple modality alignment from cross-layer token mixing. In the alignment phase, tokens are mapped to duplex modality spaces based on the basis vectors, enabling the interaction between visual and language modality. Further, the correctness of attention scores is ensured through adaptive masked attention, which enhances the model’s flexibility by allowing customizable masking patterns for different modalities. Extensive experiments on 21 benchmark datasets verify the effectiveness of MODA in perception, cognition, and emotion tasks. Source code and demo are available in https://zzcheng.top/MODA.

Xixi Wan, Aihua Zheng, Bo Jiang, Beibei Wang, Chenglong Li

TL;DR: UGG-ReID是一种鲁棒的多模态目标重识别方法，通过估计局部和样本级偶然不确定性并建模其依赖关系，减少噪声干扰并促进有效的多模态融合。

Details

Motivation: 现有方法主要关注提升识别性能，但忽略了由于模态内噪声和模态间冲突引入的不确定性，尤其是在细粒度局部遮挡和帧丢失的情况下，这对多模态学习提出了挑战。

Result: 在五个多模态目标重识别数据集上表现出色，特别是在抗噪声方面显著优于现有方法。

Insight: 不确定性建模在多模态学习中至关重要，可以有效提升抗噪声能力和鲁棒性。

Abstract: Multi-modal object Re-IDentification (ReID) has gained considerable attention with the goal of retrieving specific targets across cameras using heterogeneous visual data sources. Existing methods primarily aim to improve identification performance, but often overlook the uncertainty arising from inherent defects, such as intra-modal noise and inter-modal conflicts. This uncertainty is particularly significant in the case of fine-grained local occlusion and frame loss, which becomes a challenge in multi-modal learning. To address the above challenge, we propose a robust approach named Uncertainty-Guided Graph model for multi-modal object ReID (UGG-ReID). UGG-ReID is designed to mitigate noise interference and facilitate effective multi-modal fusion by estimating both local and sample-level aleatoric uncertainty and explicitly modeling their dependencies. Specifically, we first propose the Gaussian patch-graph representation model that leverages uncertainty to quantify fine-grained local cues and capture their structural relationships. This process boosts the expressiveness of modal-specific information, ensuring that the generated embeddings are both more informative and robust. Subsequently, we design an uncertainty-guided mixture of experts strategy that dynamically routes samples to experts exhibiting low uncertainty. This strategy effectively suppresses noise-induced instability, leading to enhanced robustness. Meanwhile, we design an uncertainty-guided routing to strengthen the multi-modal interaction, improving the performance. UGG-ReID is comprehensively evaluated on five representative multi-modal object ReID datasets, encompassing diverse spectral modalities. Experimental results show that the proposed method achieves excellent performance on all datasets and is significantly better than current methods in terms of noise immunity. Our code will be made public upon acceptance.

[137] VectorLLM: Human-like Extraction of Structured Building Contours vis Multimodal LLMs cs.CVPDF

Tao Zhang, Shiqing Wei, Shihao Chen, Wenling Yu, Muying Luo

TL;DR: VectorLLM 是一种多模态大语言模型，通过直接回归建筑轮廓角点，显著提升了遥感图像中结构化建筑轮廓的提取效果，并展示了优异的零样本泛化能力。

Details

Motivation: 传统方法依赖复杂的多阶段流程（如像素分割、矢量化、多边形优化），限制了其扩展性和实用性。VectorLLM 利用大语言模型的推理能力，直接模拟人类标注过程，简化了流程并提升了性能。

Result: 在 WHU、WHU-Mix 和 CrowdAI 数据集上分别提升 5.6 AP、7.1 AP 和 13.6 AP，且在未见物体（如飞机、水体、油罐）上展示了零样本性能。

Insight: 大语言模型的拓扑推理能力能显著提升遥感图像中的矢量提取任务，且通用性强，适用于多种物体轮廓提取的统一建模。

Abstract: Automatically extracting vectorized building contours from remote sensing imagery is crucial for urban planning, population estimation, and disaster assessment. Current state-of-the-art methods rely on complex multi-stage pipelines involving pixel segmentation, vectorization, and polygon refinement, which limits their scalability and real-world applicability. Inspired by the remarkable reasoning capabilities of Large Language Models (LLMs), we introduce VectorLLM, the first Multi-modal Large Language Model (MLLM) designed for regular building contour extraction from remote sensing images. Unlike existing approaches, VectorLLM performs corner-point by corner-point regression of building contours directly, mimicking human annotators’ labeling process. Our architecture consists of a vision foundation backbone, an MLP connector, and an LLM, enhanced with learnable position embeddings to improve spatial understanding capability. Through comprehensive exploration of training strategies including pretraining, supervised fine-tuning, and preference optimization across WHU, WHU-Mix, and CrowdAI datasets, VectorLLM significantly outperformed the previous SOTA methods by 5.6 AP, 7.1 AP, 13.6 AP, respectively in the three datasets. Remarkably, VectorLLM exhibits strong zero-shot performance on unseen objects including aircraft, water bodies, and oil tanks, highlighting its potential for unified modeling of diverse remote sensing object contour extraction tasks. Overall, this work establishes a new paradigm for vector extraction in remote sensing, leveraging the topological reasoning capabilities of LLMs to achieve both high accuracy and exceptional generalization. All the codes and weights will be published for promoting community development.

[138] What’s Making That Sound Right Now? Video-centric Audio-Visual Localization cs.CV | cs.AI | cs.MM | cs.SD | eess.ASPDF

Hahyeon Choi, Junhoo Lee, Nojun Kwak

TL;DR: 论文提出了AVATAR基准和TAVLO模型，专注于视频中的音频-视觉定位，强调了时间动态的重要性。

Details

Motivation: 现有研究主要关注图像级别的音频-视觉关联，忽略了时间动态，且假设场景过于简化。这些局限性促使作者提出更全面的视频中心化定位方法。

Result: 实验表明，传统方法难以跟踪时间变化，而TAVLO在复杂场景下实现了鲁棒和精确的音频-视觉对齐。

Insight: 时间动态在音频-视觉定位中至关重要，视频中心化的方法为未来研究提供了新标准。

Abstract: Audio-Visual Localization (AVL) aims to identify sound-emitting sources within a visual scene. However, existing studies focus on image-level audio-visual associations, failing to capture temporal dynamics. Moreover, they assume simplified scenarios where sound sources are always visible and involve only a single object. To address these limitations, we propose AVATAR, a video-centric AVL benchmark that incorporates high-resolution temporal information. AVATAR introduces four distinct scenarios – Single-sound, Mixed-sound, Multi-entity, and Off-screen – enabling a more comprehensive evaluation of AVL models. Additionally, we present TAVLO, a novel video-centric AVL model that explicitly integrates temporal information. Experimental results show that conventional methods struggle to track temporal variations due to their reliance on global audio features and frame-level mappings. In contrast, TAVLO achieves robust and precise audio-visual alignment by leveraging high-resolution temporal modeling. Our work empirically demonstrates the importance of temporal dynamics in AVL and establishes a new standard for video-centric audio-visual localization.

[139] ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing cs.CVPDF

Zhenghui Zhao, Chen Wu, Di Wang, Hongruixuan Chen, Zhuo Zheng

TL;DR: ChangeBridge是一种基于多模态控制的时空扩散模型，用于遥感图像生成，通过模拟预事件到后事件的时空演变，生成高保真度的未来场景图像。

Details

Motivation: 现有生成方法未探索基于给定场景图像的未来场景模拟能力，而这一能力在城乡规划和土地管理等领域有广泛应用。

Result: 实验证明，ChangeBridge能生成与事件及其驱动的背景变化一致的高保真未来场景图像。

Insight: 通过多模态控制条件，时空生成模型可以更灵活地模拟复杂场景变化，为遥感图像生成提供新思路。

Abstract: Recent advancements in generative methods, especially diffusion models, have made great progress in remote sensing image synthesis. Despite these advancements, existing methods have not explored the simulation of future scenarios based on given scenario images. This simulation capability has wide applications for urban planning, land managementChangeBridge: Spatiotemporal Image Generation with Multimodal Controls, and beyond. In this work, we propose ChangeBridge, a conditional spatiotemporal diffusion model. Given pre-event images and conditioned on multimodal spatial controls (e.g., text prompts, instance layouts, and semantic maps), ChangeBridge can synthesize post-event images. The core idea behind ChangeBridge is to modeling the noise-to-image diffusion model, as a pre-to-post diffusion bridge. Conditioned on multimodal controls, ChangeBridge leverages a stochastic Brownian-bridge diffusion, directly modeling the spatiotemporal evolution between pre-event and post-event states. To the best of our knowledge, ChangeBridge is the first spatiotemporal generative model with multimodal controls for remote sensing. Experimental results demonstrate that ChangeBridge can simulate high-fidelity future scenarios aligned with given conditions, including event and event-driven background variations. Code will be available.

[140] Colorectal Cancer Tumor Grade Segmentation in Digital Histopathology Images: From Giga to Mini Challenge cs.CVPDF

Alper Bahcekapili, Duygu Arslan, Umut Ozdemir, Berkay Ozkirli, Emre Akbas

TL;DR: 这篇论文聚焦于结直肠癌（CRC）的数字病理图像肿瘤分级分割，通过组织ICIP挑战赛推动自动化解决方案，使用公开的METU CCTGS数据集，39个团队中有6个超越了Swin Transformer基线。

Details

Motivation: 结直肠癌是全球高发且致命的癌症，但其病理分级仍依赖主观判断，存在观察者变异性且缺乏专业病理学家，因此急需自动化、标准化的解决方案。

Result: 在39个团队中，6个团队的F-score超越了Swin Transformer基线（62.92 F-score）。

Insight: 自动化病理分级方法有望解决CRC诊断中的主观性和资源短缺问题，挑战赛及其数据集为后续研究提供了重要基准。

Abstract: Colorectal cancer (CRC) is the third most diagnosed cancer and the second leading cause of cancer-related death worldwide. Accurate histopathological grading of CRC is essential for prognosis and treatment planning but remains a subjective process prone to observer variability and limited by global shortages of trained pathologists. To promote automated and standardized solutions, we organized the ICIP Grand Challenge on Colorectal Cancer Tumor Grading and Segmentation using the publicly available METU CCTGS dataset. The dataset comprises 103 whole-slide images with expert pixel-level annotations for five tissue classes. Participants submitted segmentation masks via Codalab, evaluated using metrics such as macro F-score and mIoU. Among 39 participating teams, six outperformed the Swin Transformer baseline (62.92 F-score). This paper presents an overview of the challenge, dataset, and the top-performing methods

[141] TeethGenerator: A two-stage framework for paired pre- and post-orthodontic 3D dental data generation cs.CVPDF

Changsong Lei, Yaqian Liang, Shaofeng Wang, Jiajia Dai, Yong-Jin Liu

TL;DR: 提出了一种名为TeethGenerator的两阶段框架，用于生成成对的3D牙齿模型（正畸前后），解决了牙齿排列模型训练中的临床数据稀缺问题。

Details

Motivation: 数字正畸是计算机视觉在医疗领域的重要应用，但成对3D牙齿模型数据的收集耗时长、成本高，限制了牙齿排列神经网络的发展。

Result: 实验表明合成数据与真实数据分布一致，结合真实数据训练显著提升了牙齿排列性能。

Insight: 通过生成合成数据填补临床数据稀缺的空白，为数字正畸提供了新的解决方案。

Abstract: Digital orthodontics represents a prominent and critical application of computer vision technology in the medical field. So far, the labor-intensive process of collecting clinical data, particularly in acquiring paired 3D orthodontic teeth models, constitutes a crucial bottleneck for developing tooth arrangement neural networks. Although numerous general 3D shape generation methods have been proposed, most of them focus on single-object generation and are insufficient for generating anatomically structured teeth models, each comprising 24-32 segmented teeth. In this paper, we propose TeethGenerator, a novel two-stage framework designed to synthesize paired 3D teeth models pre- and post-orthodontic, aiming to facilitate the training of downstream tooth arrangement networks. Specifically, our approach consists of two key modules: (1) a teeth shape generation module that leverages a diffusion model to learn the distribution of morphological characteristics of teeth, enabling the generation of diverse post-orthodontic teeth models; and (2) a teeth style generation module that synthesizes corresponding pre-orthodontic teeth models by incorporating desired styles as conditional inputs. Extensive qualitative and quantitative experiments demonstrate that our synthetic dataset aligns closely with the distribution of real orthodontic data, and promotes tooth alignment performance significantly when combined with real data for training. The code and dataset are available at https://github.com/lcshhh/teeth_generator.

[142] A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets cs.CVPDF

Zexi Jia, Chuanwei Huang, Hongyan Fei, Yeshuang Zhu, Zhiqiang Yuan

TL;DR: 论文提出了一种基于分块扩散的新方法，通过自动生成反事实数据集解决了视觉语言模型在组合推理中的问题，显著提升了性能。

Details

Motivation: 视觉语言模型（VLMs）由于缺乏高质量图像-文本数据，在组合推理任务中表现不佳。需要通过自动化方式生成多样且高保真的数据来改善这一问题。

Result: 通过反事实数据集微调的VLMs显著提升了视觉推理性能，并在多个基准测试中取得SOTA结果。

Insight: 自动化数据生成和专用损失设计的结合可以有效解决VLMs在组合推理中的瓶颈，同时减少数据依赖。

Abstract: Vision-language models (VLMs) often struggle with compositional reasoning due to insufficient high-quality image-text data. To tackle this challenge, we propose a novel block-based diffusion approach that automatically generates counterfactual datasets without manual annotation. Our method utilizes large language models to identify entities and their spatial relationships. It then independently generates image blocks as “puzzle pieces” coherently arranged according to specified compositional rules. This process creates diverse, high-fidelity counterfactual image-text pairs with precisely controlled variations. In addition, we introduce a specialized loss function that differentiates inter-set from intra-set samples, enhancing training efficiency and reducing the need for negative samples. Experiments demonstrate that fine-tuning VLMs with our counterfactual datasets significantly improves visual reasoning performance. Our approach achieves state-of-the-art results across multiple benchmarks while using substantially less training data than existing methods.

[143] Tempo-R0: A Video-MLLM for Temporal Video Grounding through Efficient Temporal Sensing Reinforcement Learning cs.CV | cs.AIPDF

Feng Yue, Zhaoxing Zhang, Junming Jiao, Zhengyu Liang, Shiwen Cao

TL;DR: Tempo-R0是一种基于多模态时间感知强化学习的视频-MLLM，通过高效的时间感知和强化学习技术，提出了一种新颖的TVG方法，显著提升了性能。

Details

Motivation: 由于视频信息量大且冗余多，传统方法难以在TVG任务中准确定位相关时间片段。需要一种能够全面理解视频内容并高效分配注意力的模型。

Result: 在QVHighlights测试集及其修正版本上，性能优于SOTA方法约3.5%。

Insight: 通过结合自适应注意力分配和强化学习，可以显著提升TVG任务中模型的时间感知和边界定位能力。

Abstract: Temporal Video Grounding (TVG), which requires pinpointing relevant temporal segments from video based on language query, has always been a highly challenging task in the field of video understanding. Videos often have a larger volume of information and redundancy than texts or images. Models should present comprehensive understanding of the whole video to accurately retrieve query-relevant clips. We thus propose Tempo-R0: a Video Multimodal Large Language Model (Video-MLLM) for the temporal video grounding task via multimodal temporal sensing reinforcement. Specifically, during the preprocessing stage of our pipeline, we employ Self-adaptive Attention Allocation (SAA) method based on frame content variation to efficiently use the MLLM’s limited attention. The Explicit Timestamp-modal Aligned (ETA) method is also utilized to strengthen our model’s capability to perceive the boundaries of events in the video. In the fine-tuning part of our pipeline, we creatively apply Partial Irrelevance Refusing-based Group Relative Policy Optimization (PIR-GRPO) in TVG area to foster model’s temporal reasoning from not only accepting relevant video-query pairs but also refusing irrelevant ones. Experiments demonstrate that our method accomplishes a notable advantage over SOTA solutions by around 3.5% on both the original QVHighlights testbench and its corrected version with more reasonable ground truth annotations.

[144] Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations cs.CVPDF

Yuji Wang, Moran Li, Xiaobin Hu, Ran Yi, Jiangning Zhang

TL;DR: 该论文提出了一个简单的时空解耦框架，用于解决文本到视频生成中的身份一致性问题，通过语义提示优化和分阶段解耦生成范式，平衡空间布局和时间动态，显著提升了视频质量。

Details

Motivation: 当前端到端的文本到视频生成框架存在空间和时间之间的权衡问题，优化空间布局可能牺牲时间一致性，而注重动态效果可能破坏空间结构。为了平衡这两者，作者提出了时空解耦的解决方案。

Result: 实验表明，该方法在身份一致性、文本相关性和视频质量上表现出色，并在2025年ACM多媒体挑战赛中取得第二名。

Insight: 时空解耦是一种简单而有效的方法，能够显著提升文本到视频生成中的多目标优化问题，尤其是身份一致性和动态一致性的平衡。

Abstract: Identity-preserving text-to-video (IPT2V) generation, which aims to create high-fidelity videos with consistent human identity, has become crucial for downstream applications. However, current end-to-end frameworks suffer a critical spatial-temporal trade-off: optimizing for spatially coherent layouts of key elements (e.g., character identity preservation) often compromises instruction-compliant temporal smoothness, while prioritizing dynamic realism risks disrupting the spatial coherence of visual structures. To tackle this issue, we propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics. Specifically, our paper proposes a semantic prompt optimization mechanism and stage-wise decoupled generation paradigm. The former module decouples the prompt into spatial and temporal components. Aligned with the subsequent stage-wise decoupled approach, the spatial prompts guide the text-to-image (T2I) stage to generate coherent spatial features, while the temporal prompts direct the sequential image-to-video (I2V) stage to ensure motion consistency. Experimental results validate that our approach achieves excellent spatiotemporal consistency, demonstrating outstanding performance in identity preservation, text relevance, and video quality. By leveraging this simple yet robust mechanism, our algorithm secures the runner-up position in 2025 ACM MultiMedia Challenge.

[145] Geometric-Guided Few-Shot Dental Landmark Detection with Human-Centric Foundation Model cs.CV | cs.AI | cs.LGPDF

Anbang Wang, Marawan Elbatel, Keyuan Liu, Lizhuo Lin, Meng Lan

TL;DR: 论文提出GeoSapiens框架，结合人类中心基础模型和几何引导损失函数，用于少标注数据的牙齿标志点检测，性能优于现有方法。

Details

Motivation: 牙齿标志点检测对临床诊断至关重要，但手动标注耗时且易变，传统深度学习方法因数据稀缺和高标注成本难以应用。

Result: 在前牙标志点数据集上，GeoSapiens在严格0.5mm阈值下的检测成功率比现有最优方法高8.18%。

Insight: 几何引导的损失函数能有效利用有限标注数据，提升模型对解剖结构关系的理解，适用于医学图像分析任务。

Abstract: Accurate detection of anatomic landmarks is essential for assessing alveolar bone and root conditions, thereby optimizing clinical outcomes in orthodontics, periodontics, and implant dentistry. Manual annotation of landmarks on cone-beam computed tomography (CBCT) by dentists is time-consuming, labor-intensive, and subject to inter-observer variability. Deep learning-based automated methods present a promising approach to streamline this process efficiently. However, the scarcity of training data and the high cost of expert annotations hinder the adoption of conventional deep learning techniques. To overcome these challenges, we introduce GeoSapiens, a novel few-shot learning framework designed for robust dental landmark detection using limited annotated CBCT of anterior teeth. Our GeoSapiens framework comprises two key components: (1) a robust baseline adapted from Sapiens, a foundational model that has achieved state-of-the-art performance in human-centric vision tasks, and (2) a novel geometric loss function that improves the model’s capacity to capture critical geometric relationships among anatomical structures. Experiments conducted on our collected dataset of anterior teeth landmarks revealed that GeoSapiens surpassed existing landmark detection methods, outperforming the leading approach by an 8.18% higher success detection rate at a strict 0.5 mm threshold-a standard widely recognized in dental diagnostics. Code is available at: https://github.com/xmed-lab/GeoSapiens.

[146] From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach cs.CV | cs.AI | cs.CLPDF

Mihai Masala, Marius Leordeanu

TL;DR: 该论文提出了一种基于时空事件图的共享表示方法，用于生成可解释的长形式视频描述，并通过自监督的神经分析方法训练端到端的学生模型。

Details

Motivation: 当前视频描述任务的数据集多为简短描述，缺乏长段落形式的自然语言描述，且现有方法难以解释视觉与语言之间的关系。

Result: 在多个数据集上验证了方法的有效性，生成的描述具有连贯性、丰富性和相关性。

Insight: 1. 时空事件图为解决视觉与语言之间的复杂关系提供了新思路；2. 可解释的方法有助于提高生成描述的透明度和可信度；3. 自监督学习可以降低对标注数据的依赖。

Abstract: The task of describing video content in natural language is commonly referred to as video captioning. Unlike conventional video captions, which are typically brief and widely available, long-form paragraph descriptions in natural language are scarce. This limitation of current datasets is due to the expensive human manual annotation required and to the highly challenging task of explaining the language formation process from the perspective of the underlying story, as a complex system of interconnected events in space and time. Through a thorough analysis of recently published methods and available datasets, we identify a general lack of published resources dedicated to the problem of describing videos in complex language, beyond the level of descriptions in the form of enumerations of simple captions. Furthermore, while state-of-the-art methods produce impressive results on the task of generating shorter captions from videos by direct end-to-end learning between the videos and text, the problem of explaining the relationship between vision and language is still beyond our reach. In this work, we propose a shared representation between vision and language, based on graphs of events in space and time, which can be obtained in an explainable and analytical way, to integrate and connect multiple vision tasks to produce the final natural language description. Moreover, we also demonstrate how our automated and explainable video description generation process can function as a fully automatic teacher to effectively train direct, end-to-end neural student pathways, within a self-supervised neuro-analytical system. We validate that our explainable neuro-analytical approach generates coherent, rich and relevant textual descriptions on videos collected from multiple varied datasets, using both standard evaluation metrics, human annotations and consensus from ensembles of state-of-the-art VLMs.

[147] An analysis of vision-language models for fabric retrieval cs.CVPDF

Francesco Giuliari, Asif Khan Pattan, Mohamed Lamine Mekhalfi, Fabio Poiesi

TL;DR: 论文研究了视觉语言模型（VLMs）在零样本文本到图像检索任务中的应用，特别是在纺织样本领域。通过自动标注管道生成两种文本描述，评估了三种模型的性能，发现结构化描述显著提升检索精度，但零样本检索在细粒度领域仍具挑战。

Details

Motivation: 在制造等专业领域中，产品信息通常包含视觉样本与文本描述。有效的跨模态检索对信息检索和推荐系统至关重要，但目前缺乏公开数据集，且零样本检索在细粒度任务中表现不足。

Result: 结构化、属性丰富的描述显著提升了检索精度，尤其是在视觉复杂的纺织类别中。Perception Encoder因特征对齐能力强而表现最佳，但零样本检索在细粒度领域仍具挑战性。

Insight: 在工业应用中，结合技术性文本描述与先进的视觉语言模型能优化跨模态检索，但需进一步开发领域适配方法以应对细粒度任务。

Abstract: Effective cross-modal retrieval is essential for applications like information retrieval and recommendation systems, particularly in specialized domains such as manufacturing, where product information often consists of visual samples paired with a textual description. This paper investigates the use of Vision Language Models(VLMs) for zero-shot text-to-image retrieval on fabric samples. We address the lack of publicly available datasets by introducing an automated annotation pipeline that uses Multimodal Large Language Models (MLLMs) to generate two types of textual descriptions: freeform natural language and structured attribute-based descriptions. We produce these descriptions to evaluate retrieval performance across three Vision-Language Models: CLIP, LAION-CLIP, and Meta’s Perception Encoder. Our experiments demonstrate that structured, attribute-rich descriptions significantly enhance retrieval accuracy, particularly for visually complex fabric classes, with the Perception Encoder outperforming other models due to its robust feature alignment capabilities. However, zero-shot retrieval remains challenging in this fine-grained domain, underscoring the need for domain-adapted approaches. Our findings highlight the importance of combining technical textual descriptions with advanced VLMs to optimize cross-modal retrieval in industrial applications.

[148] Transcribing Spanish Texts from the Past: Experiments with Transkribus, Tesseract and Granite cs.CV | cs.CLPDF

Yanco Amor Torterolo-Orta, Jaione Macicior-Mitxelena, Marina Miguez-Lamanuzzi, Ana García-Serrano

TL;DR: 这篇论文介绍了GRESEL团队在IberLEF 2025共享任务PastReader中进行的实验，比较了三种不同的文本转录方法（基于网络的OCR服务、传统OCR引擎和紧凑多模态模型），使用消费级硬件获得了尚可的结果，并计划进一步改进技术。

Details

Motivation: 研究旨在解决历史西班牙语文本的转录问题，并通过参与共享任务比较不同方法的优劣，为后续研究提供参考。

Result: 实验结果尚可，但仍有改进空间，表明这些方法在转录历史西班牙语文本时具有一定效果，但需进一步提升准确性。

Insight: 在消费级硬件上实现文本转录是可行的，但需要进一步优化算法以提高性能。多模态模型可能是未来的研究方向。

Abstract: This article presents the experiments and results obtained by the GRESEL team in the IberLEF 2025 shared task PastReader: Transcribing Texts from the Past. Three types of experiments were conducted with the dual aim of participating in the task and enabling comparisons across different approaches. These included the use of a web-based OCR service, a traditional OCR engine, and a compact multimodal model. All experiments were run on consumer-grade hardware, which, despite lacking high-performance computing capacity, provided sufficient storage and stability. The results, while satisfactory, leave room for further improvement. Future work will focus on exploring new techniques and ideas using the Spanish-language dataset provided by the shared task, in collaboration with Biblioteca Nacional de Espa~na (BNE).

[149] Vision-Language Models Can’t See the Obvious cs.CVPDF

Yasser Dahou, Ngoc Dung Huynh, Phuc H. Le-Khac, Wamiq Reyaz Para, Ankit Singh

TL;DR: 论文提出了Saliency Benchmark (SalBench)，用于评估大型视觉语言模型(LVLM)在检测人类易感知的视觉显着特征上的能力，发现其表现不佳。

Details

Motivation: 当前的大型视觉语言模型在复杂任务上表现出色，但在人类认为显而易见的低层次视觉特征上表现较差，需要系统评估。

Result: 评估结果表明，即使是先进的GPT-4o，在简单任务上的准确率仅为47.6%，显示出LVLM在显着特征识别上的不足。

Insight: 研究揭示了LVLM在低层次视觉任务上的短板，强调未来模型需要更好地对齐人类的注意力机制。

Abstract: We present Saliency Benchmark (SalBench), a novel benchmark designed to assess the capability of Large Vision-Language Models (LVLM) in detecting visually salient features that are readily apparent to humans, such as a large circle amidst a grid of smaller ones. This benchmark focuses on low-level features including color, intensity, and orientation, which are fundamental to human visual processing. Our SalBench consists of images that highlight rare, unusual, or unexpected elements within scenes, and naturally draw human attention. It comprises three novel tasks for evaluating the perceptual capabilities of LVLM: Odd-One-Out Detection, Referring Odd-One-Out, and Visual Referring Odd-One-Out. We perform a comprehensive evaluation of state-of-the-art LVLM using SalBench and our findings reveal a surprising limitation: LVLM struggle to identify seemingly obvious visual anomalies, with even the advanced GPT-4o achieving only 47.6% accuracy on such a simple task. SalBench will be an important step in measuring the capabilities of LVLM that align with the subtle definition of human attention.

[150] ReLoop: “Seeing Twice and Thinking Backwards” via Closed-loop Training to Mitigate Hallucinations in Multimodal understanding cs.CV | cs.CLPDF

Jianjiang Yang, Ziyan Huang, Yanshu Li

TL;DR: ReLoop提出了一个闭环训练框架，通过三种一致性反馈机制减少多模态大语言模型中的幻觉问题，实现语义可逆性和视觉一致性。

Details

Motivation: 多模态大语言模型在开放视觉问答中表现优异，但仍易产生幻觉输出。现有方法依赖外部验证或事后修正，缺乏训练中的内部验证机制。

Result: 在多个基准测试中显著降低幻觉率，验证了方法的有效性。

Insight: 通过闭环训练和多模态一致性反馈，模型能够内部纠正输出，减少幻觉问题。

Abstract: While Multimodal Large Language Models (MLLMs) have achieved remarkable progress in open-ended visual question answering, they remain vulnerable to hallucinations. These are outputs that contradict or misrepresent input semantics, posing a critical challenge to the reliability and factual consistency. Existing methods often rely on external verification or post-hoc correction, lacking an internal mechanism to validate outputs directly during training. To bridge this gap, we propose ReLoop, a unified closed-loop training framework that encourages multimodal consistency for cross-modal understanding in MLLMs. ReLoop adopts a ring-shaped structure that integrates three complementary consistency feedback mechanisms, obliging MLLMs to “seeing twice and thinking backwards”. Specifically, ReLoop employs the frozen Consistency Feedback Plugin (CFP), comprising semantic reconstruction, visual description, and an attention supervision module for attention alignment. These components collectively enforce semantic reversibility, visual consistency, and interpretable attention, enabling the model to correct its outputs during training. Extensive evaluations and analyses demonstrate the effectiveness of ReLoop in reducing hallucination rates across multiple benchmarks, establishing a robust method for hallucination mitigation in MLLMs. We will release our source code and data in the camera-ready version.

[151] Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation cs.CV | cs.CLPDF

Jianjiang Yang, Ziyan Huang

TL;DR: 论文提出了一种认知启发的视角，将文本到图像生成中的‘幻觉’重新解释为潜在对齐空间中的轨迹漂移，并通过三轴张力模型（Hallucination Tri-Space）和动态对齐风险码（ARC）量化生成过程中的对齐张力，最终开发了轻量级控制器TM-ARC以减少幻觉。

Details

Motivation: 现有的文本到图像（T2I）扩散模型在生成图像时仍存在‘幻觉’问题，即生成内容与文本语义偏离。论文动机是理解并控制这些幻觉，从而提升生成质量。

Result: 在标准T2I基准测试中，TM-ARC显著减少了幻觉，同时保持了图像质量和多样性。

Insight: 幻觉并非随机噪声，而是生成过程中多轴张力失衡的表现，通过动态控制可以实现更精确的文本到图像生成。

Abstract: Despite remarkable progress in image quality and prompt fidelity, text-to-image (T2I) diffusion models continue to exhibit persistent “hallucinations”, where generated content subtly or significantly diverges from the intended prompt semantics. While often regarded as unpredictable artifacts, we argue that these failures reflect deeper, structured misalignments within the generative process. In this work, we propose a cognitively inspired perspective that reinterprets hallucinations as trajectory drift within a latent alignment space. Empirical observations reveal that generation unfolds within a multiaxial cognitive tension field, where the model must continuously negotiate competing demands across three key critical axes: semantic coherence, structural alignment, and knowledge grounding. We then formalize this three-axis space as the \textbf{Hallucination Tri-Space} and introduce the Alignment Risk Code (ARC): a dynamic vector representation that quantifies real-time alignment tension during generation. The magnitude of ARC captures overall misalignment, its direction identifies the dominant failure axis, and its imbalance reflects tension asymmetry. Based on this formulation, we develop the TensionModulator (TM-ARC): a lightweight controller that operates entirely in latent space. TM-ARC monitors ARC signals and applies targeted, axis-specific interventions during the sampling process. Extensive experiments on standard T2I benchmarks demonstrate that our approach significantly reduces hallucination without compromising image quality or diversity. This framework offers a unified and interpretable approach for understanding and mitigating generative failures in diffusion-based T2I systems.

[152] Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models cs.CV | cs.CLPDF

Eunseop Yoon, Hee Suk Yoon, Mark A. Hasegawa-Johnson, Chang D. Yoo

TL;DR: 该论文提出了一个框架，用于训练视频大语言模型（Video-LLMs）识别并拒绝超出视频内容范围的问题，以解决当前模型无法拒绝无关问题的缺陷。

Details

Motivation: 现有视频大语言模型主要训练于从视频内容生成的直接问题，但在实际应用中，用户提问可能超出视频信息范围，因此模型需具备评估问题相关性的能力。

Result: 研究表明，即使是性能最优的Video-LLMs也需要专门训练以拒绝无关问题，新框架显著提升了模型在这方面的能力。

Insight: 视频大语言模型不仅需具备视频理解能力，还需能够判断问题的相关性，这对实际应用至关重要。

Abstract: In the broader context of deep learning, Multimodal Large Language Models have achieved significant breakthroughs by leveraging powerful Large Language Models as a backbone to align different modalities into the language space. A prime exemplification is the development of Video Large Language Models (Video-LLMs). While numerous advancements have been proposed to enhance the video understanding capabilities of these models, they are predominantly trained on questions generated directly from video content. However, in real-world scenarios, users often pose questions that extend beyond the informational scope of the video, highlighting the need for Video-LLMs to assess the relevance of the question. We demonstrate that even the best-performing Video-LLMs fail to reject unfit questions-not necessarily due to a lack of video understanding, but because they have not been trained to identify and refuse such questions. To address this limitation, we propose alignment for answerability, a framework that equips Video-LLMs with the ability to evaluate the relevance of a question based on the input video and appropriately decline to answer when the question exceeds the scope of the video, as well as an evaluation framework with a comprehensive set of metrics designed to measure model behavior before and after alignment. Furthermore, we present a pipeline for creating a dataset specifically tailored for alignment for answerability, leveraging existing video-description paired datasets.

[153] From Imitation to Innovation: The Emergence of AI Unique Artistic Styles and the Challenge of Copyright Protection cs.CV | cs.AIPDF

Zexi Jia, Chuanwei Huang, Yeshuang Zhu, Hongyan Fei, Ying Deng

TL;DR: 该论文探讨了AI生成艺术的版权保护问题，提出了三项判定独特艺术风格的标准，并开发了ArtBulb框架和AICD数据集来解决版权判定的技术挑战。

Details

Motivation: 当前法律框架缺乏系统性的标准和可靠的方法来评估AI生成艺术的版权，导致AI艺术版权保护面临挑战。

Result: 实验表明，ArtBulb在定量和定性评估中均优于现有模型。

Insight: 法律与技术社区需加强合作，AI艺术版权问题需要更广泛的社会关注。

Abstract: Current legal frameworks consider AI-generated works eligible for copyright protection when they meet originality requirements and involve substantial human intellectual input. However, systematic legal standards and reliable evaluation methods for AI art copyrights are lacking. Through comprehensive analysis of legal precedents, we establish three essential criteria for determining distinctive artistic style: stylistic consistency, creative uniqueness, and expressive accuracy. To address these challenges, we introduce ArtBulb, an interpretable and quantifiable framework for AI art copyright judgment that combines a novel style description-based multimodal clustering method with multimodal large language models (MLLMs). We also present AICD, the first benchmark dataset for AI art copyright annotated by artists and legal experts. Experimental results demonstrate that ArtBulb outperforms existing models in both quantitative and qualitative evaluations. Our work aims to bridge the gap between the legal and technological communities and bring greater attention to the societal issue of AI art copyrights.

[154] Model Compression using Progressive Channel Pruning cs.CV | cs.AIPDF

Jinyang Guo, Weichen Zhang, Wanli Ouyang, Dong Xu

TL;DR: 本文提出了一种新的渐进式通道剪枝框架（PCP），通过迭代地在多个层中选择并剪枝少量通道，显著提升了CNN模型的压缩效果。

Details

Motivation: 传统的通道剪枝方法通常逐层一次性剪枝，可能导致较大的精度损失。PCP通过渐进式策略减少剪枝对模型性能的影响。

Result: 在两个基准数据集上的实验表明，PCP在监督学习和迁移学习任务中均优于现有方法。

Insight: 渐进式剪枝策略能有效平衡模型压缩与性能保留，尤其在迁移学习场景中，结合多域数据可进一步提升效果。

Abstract: In this work, we propose a simple but effective channel pruning framework called Progressive Channel Pruning (PCP) to accelerate Convolutional Neural Networks (CNNs). In contrast to the existing channel pruning methods that prune channels only once per layer in a layer-by-layer fashion, our new progressive framework iteratively prunes a small number of channels from several selected layers, which consists of a three-step attempting-selecting-pruning pipeline in each iteration. In the attempting step, we attempt to prune a pre-defined number of channels from one layer by using any existing channel pruning methods and estimate the accuracy drop for this layer based on the labelled samples in the validation set. In the selecting step, based on the estimated accuracy drops for all layers, we propose a greedy strategy to automatically select a set of layers that will lead to less overall accuracy drop after pruning these layers. In the pruning step, we prune a small number of channels from these selected layers. We further extend our PCP framework to prune channels for the deep transfer learning methods like Domain Adversarial Neural Network (DANN), in which we effectively reduce the data distribution mismatch in the channel pruning process by using both labelled samples from the source domain and pseudo-labelled samples from the target domain. Our comprehensive experiments on two benchmark datasets demonstrate that our PCP framework outperforms the existing channel pruning approaches under both supervised learning and transfer learning settings.

[155] PointGAC: Geometric-Aware Codebook for Masked Point Cloud Modeling cs.CVPDF

Abiao Li, Chenlei Lv, Yuming Fang, Yifan Zuo, Jian Zhang

TL;DR: PointGAC提出了一种基于聚类的掩码点云建模方法，通过在线k-means和码书更新机制，学习更通用的特征表示。

Details

Motivation: 现有掩码点云建模方法通过回归重建掩码区域的坐标或特征，但容易过度约束模型学习细节，难以捕捉通用特征。

Result: 实验验证了该方法在多下游任务中的有效性。

Insight: 通过聚类中心对齐掩码特征的学习策略，能够提升特征的泛化能力，码书维护机制进一步提高了语义特征学习效率。

Abstract: Most masked point cloud modeling (MPM) methods follow a regression paradigm to reconstruct the coordinate or feature of masked regions. However, they tend to over-constrain the model to learn the details of the masked region, resulting in failure to capture generalized features. To address this limitation, we propose \textbf{\textit{PointGAC}}, a novel clustering-based MPM method that aims to align the feature distribution of masked regions. Specially, it features an online codebook-guided teacher-student framework. Firstly, it presents a geometry-aware partitioning strategy to extract initial patches. Then, the teacher model updates a codebook via online k-means based on features extracted from the complete patches. This procedure facilitates codebook vectors to become cluster centers. Afterward, we assigns the unmasked features to their corresponding cluster centers, and the student model aligns the assignment for the reconstructed masked features. This strategy focuses on identifying the cluster centers to which the masked features belong, enabling the model to learn more generalized feature representations. Benefiting from a proposed codebook maintenance mechanism, codebook vectors are actively updated, which further increases the efficiency of semantic feature learning. Experiments validate the effectiveness of the proposed method on various downstream tasks. Code is available at https://github.com/LAB123-tech/PointGAC

[156] Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration cs.CV | cs.AI | cs.CLPDF

Yuyi Zhang, Peirong Zhang, Zhenhua Yang, Pengyu Yan, Yongxin Shi

TL;DR: 论文提出了一种新的历史文档修复方法（AutoHDR）和配套数据集（FPHDR），通过OCR辅助定位、视觉-语言上下文预测和图像补丁修复三阶段流程，显著提升了文档修复效果，并支持人机协作。

Details

Motivation: 历史文档作为文化遗产，常因撕裂、水蚀和氧化严重受损。现有修复方法多为单模态或小规模修复，无法满足实际需求，亟需更全面的解决方案。

Result: AutoHDR将严重受损文档的OCR准确率从46.83%提升至84.05%，人机协作后进一步提升至94.25%。

Insight: 模块化设计和人机协作机制为文化遗产修复提供了高效灵活的解决方案，未来可扩展至其他类型文档修复。

Abstract: Historical documents represent an invaluable cultural heritage, yet have undergone significant degradation over time through tears, water erosion, and oxidation. Existing Historical Document Restoration (HDR) methods primarily focus on single modality or limited-size restoration, failing to meet practical needs. To fill this gap, we present a full-page HDR dataset (FPHDR) and a novel automated HDR solution (AutoHDR). Specifically, FPHDR comprises 1,633 real and 6,543 synthetic images with character-level and line-level locations, as well as character annotations in different damage grades. AutoHDR mimics historians’ restoration workflows through a three-stage approach: OCR-assisted damage localization, vision-language context text prediction, and patch autoregressive appearance restoration. The modular architecture of AutoHDR enables seamless human-machine collaboration, allowing for flexible intervention and optimization at each restoration stage. Experiments demonstrate AutoHDR’s remarkable performance in HDR. When processing severely damaged documents, our method improves OCR accuracy from 46.83% to 84.05%, with further enhancement to 94.25% through human-machine collaboration. We believe this work represents a significant advancement in automated historical document restoration and contributes substantially to cultural heritage preservation. The model and dataset are available at https://github.com/SCUT-DLVCLab/AutoHDR.

[157] RIPE: Reinforcement Learning on Unlabeled Image Pairs for Robust Keypoint Extraction cs.CVPDF

Johannes Künzel, Anna Hilsmann, Peter Eisert

TL;DR: RIPE is a reinforcement learning-based framework for weakly-supervised keypoint extraction, requiring only binary labels for training. It uses a hyper-column approach and an auxiliary loss to improve performance, achieving competitive results with minimal supervision.

Details

Motivation: 传统关键点提取方法依赖大量人工标注或复杂数据增强，限制了泛化能力。RIPE旨在通过弱监督学习减少对标注的依赖，提高鲁棒性和泛化性。

Result: RIPE在标准数据集上表现竞争力，接近全监督方法，同时大幅减少对标注数据的依赖。

Insight: RIPE证明弱监督学习在关键点提取任务中的潜力，为后续研究提供新思路。

Abstract: We introduce RIPE, an innovative reinforcement learning-based framework for weakly-supervised training of a keypoint extractor that excels in both detection and description tasks. In contrast to conventional training regimes that depend heavily on artificial transformations, pre-generated models, or 3D data, RIPE requires only a binary label indicating whether paired images represent the same scene. This minimal supervision significantly expands the pool of training data, enabling the creation of a highly generalized and robust keypoint extractor. RIPE utilizes the encoder’s intermediate layers for the description of the keypoints with a hyper-column approach to integrate information from different scales. Additionally, we propose an auxiliary loss to enhance the discriminative capability of the learned descriptors. Comprehensive evaluations on standard benchmarks demonstrate that RIPE simplifies data preparation while achieving competitive performance compared to state-of-the-art techniques, marking a significant advancement in robust keypoint extraction and description. To support further research, we have made our code publicly available at https://github.com/fraunhoferhhi/RIPE.

[158] CMET: Clustering guided METric for quantifying embedding quality cs.CVPDF

Sourav Ghosh, Chayan Maitra, Rajat K. De

TL;DR: 该论文提出了一种名为CMET的新指标，用于量化嵌入质量，通过局部和全局分数（CMET_L和CMET_G）评估数据嵌入的保形能力。CMET在多种数据集上表现优于现有方法，且计算复杂度低。

Details

Motivation: 现有评估嵌入质量的指标在时间和空间复杂度上较高，且难以全面衡量嵌入数据的局部和全局结构保留能力。因此，需要一种高效且可靠的量化指标。

Result: 在合成、生物和图像数据集上的实验表明，CMET性能优于现有方法，且在不同数据规模和超参数选择下表现稳定。

Insight: CMET通过聚类引导的指标设计，提供了一种高效且通用的嵌入质量评估工具，适用于多样化的数据场景。

Abstract: Due to rapid advancements in technology, datasets are available from various domains. In order to carry out more relevant and appropriate analysis, it is often necessary to project the dataset into a higher or lower dimensional space based on requirement. Projecting the data in a higher-dimensional space helps in unfolding intricate patterns, enhancing the performance of the underlying models. On the other hand, dimensionality reduction is helpful in denoising data while capturing maximal information, as well as reducing execution time and memory.In this context, it is not always statistically evident whether the transformed embedding retains the local and global structure of the original data. Most of the existing metrics that are used for comparing the local and global shape of the embedding against the original one are highly expensive in terms of time and space complexity. In order to address this issue, the objective of this study is to formulate a novel metric, called Clustering guided METric (CMET), for quantifying embedding quality. It is effective to serve the purpose of quantitative comparison between an embedding and the original data. CMET consists of two scores, viz., CMET_L and CMET_G, that measure the degree of local and global shape preservation capability, respectively. The efficacy of CMET has been demonstrated on a wide variety of datasets, including four synthetic, two biological, and two image datasets. Results reflect the favorable performance of CMET against the state-of-the-art methods. Capability to handle both small and large data, low algorithmic complexity, better and stable performance across all kinds of data, and different choices of hyper-parameters feature CMET as a reliable metric.

[159] Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning cs.CV | cs.CLPDF

Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin

TL;DR: 本文提出了一种两阶段训练范式（大规模语言微调+多模态强化学习），将LLM的推理能力迁移到MLLM，实现先进的视觉推理，并揭示了三个关键发现。模型OVR在多个基准测试中达到SOTA。

Details

Motivation: 利用LLM的认知行为提升MLLM的视觉推理能力，填补多模态推理领域的空白。

Result: OVR在MATH500（95.3%）、MathVision（51.8%）和MathVerse（54.6%）等基准测试中表现优异。

Insight: 1. 语言冷启动早期即出现行为迁移；2. 冷启动记忆视觉行为，强化学习筛选有效模式；3. 迁移倾向于高效用行为（如视觉反思）。

Abstract: The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, followed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps, surpassing all previous open-source efforts in scale. This pioneering work reveals three fundamental insights: 1) Behavior transfer emerges surprisingly early in cold start due to linguistic mental imagery. 2) Cold start broadly memorizes visual behaviors, while RL critically discerns and scales up effective patterns. 3) Transfer strategically favors high-utility behaviors such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.

[160] HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding cs.CV | cs.AIPDF

Yuxuan Cai, Jiangning Zhang, Zhenye Gan, Qingdong He, Xiaobin Hu

TL;DR: HV-MMBench是一个针对多模态大语言模型（MLLMs）在人类中心视频理解任务中的全面评测基准，填补了现有评测在任务多样性、数据形式和视频时间跨度上的不足。

Details

Motivation: 现有的评测基准在人类中心视频理解任务中过于关注视频生成质量和动作识别，缺乏对感知和认知能力的全面评估，且评测形式和指标过于简单。

Result: HV-MMBench为MLLMs在人类中心视频理解任务中提供了更全面、精确和鲁棒的评测工具。

Insight: 研究揭示了评测基准的任务多样性和数据复杂性对评估MLLMs真实能力的重要性，未来工作可拓展到更高阶的推理任务。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos. However, their capacity to comprehend human-centric video data remains underexplored, primarily due to the absence of comprehensive and high-quality evaluation benchmarks. Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios. Furthermore, they are often limited by single-question paradigms and overly simplistic evaluation metrics. To address above limitations, we propose a modern HV-MMBench, a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding. Compared to existing human-centric video benchmarks, our work offers the following key features: (1) Diverse evaluation dimensions: HV-MMBench encompasses 15 tasks, ranging from basic attribute perception (e.g., age estimation, emotion recognition) to advanced cognitive reasoning (e.g., social relationship prediction, intention prediction), enabling comprehensive assessment of model capabilities; (2) Varied data types: The benchmark includes multiple-choice, fill-in-blank, true/false, and open-ended question formats, combined with diverse evaluation metrics, to more accurately and robustly reflect model performance; (3) Multi-domain video coverage: The benchmark spans 50 distinct visual scenarios, enabling comprehensive evaluation across fine-grained scene variations; (4) Temporal coverage: The benchmark covers videos from short-term (10 seconds) to long-term (up to 30min) durations, supporting systematic analysis of models temporal reasoning abilities across diverse contextual lengths.

[161] RainShift: A Benchmark for Precipitation Downscaling Across Geographies cs.CVPDF

Paula Harder, Luca Schmidt, Francis Pelletier, Nicole Ludwig, Matthew Chantry

TL;DR: RainShift 是一个用于评估降水降尺度在地理分布转移下表现的数据集和基准，揭示了现有方法在跨区域泛化中的局限性。

Details

Motivation: 地球系统模型（ESM）在局部风险评估中因计算资源不足无法高分辨率运行，而深度学习的降尺度方法需要针对不同地理区域重新训练，但高分辨率数据在全球分布不均，因此需评估模型的地理泛化能力。

Result: 实验表明，现有模型在分布外区域（如南半球与北半球之间）性能显著下降，仅扩展训练数据无法完全解决这一问题，但数据对齐等方法可改善泛化能力。

Insight: 地理分布转移是影响降尺度模型泛化的关键因素，未来研究需关注数据对齐等策略以减少全球气候信息获取的不平等。

Abstract: Earth System Models (ESM) are our main tool for projecting the impacts of climate change. However, running these models at sufficient resolution for local-scale risk-assessments is not computationally feasible. Deep learning-based super-resolution models offer a promising solution to downscale ESM outputs to higher resolutions by learning from data. Yet, due to regional variations in climatic processes, these models typically require retraining for each geographical area-demanding high-resolution observational data, which is unevenly available across the globe. This highlights the need to assess how well these models generalize across geographic regions. To address this, we introduce RainShift, a dataset and benchmark for evaluating downscaling under geographic distribution shifts. We evaluate state-of-the-art downscaling approaches including GANs and diffusion models in generalizing across data gaps between the Global North and Global South. Our findings reveal substantial performance drops in out-of-distribution regions, depending on model and geographic area. While expanding the training domain generally improves generalization, it is insufficient to overcome shifts between geographically distinct regions. We show that addressing these shifts through, for example, data alignment can improve spatial generalization. Our work advances the global applicability of downscaling methods and represents a step toward reducing inequities in access to high-resolution climate information.

[162] Boosting Temporal Sentence Grounding via Causal Inference cs.CV | cs.MMPDF

Kefan Tang, Lihuo He, Jisheng Dang, Xinbo Gao

TL;DR: 该论文提出了一种基于因果推理的新框架CICR，用于解决时序语句定位任务中视频与文本查询之间的虚假相关性问题，通过文本因果干预和视觉反事实推理提升模型的鲁棒性。

Details

Motivation: 现有方法在时序语句定位任务中往往忽略了视频与文本查询之间的虚假相关性，这种相关性源于文本数据的固有偏差和模型对视频中显著或重复模式的过拟合，导致预测不可靠且在分布外样本上泛化能力差。

Result: 在公开数据集上的实验证明了该方法的优越性。

Insight: 通过因果推理可以有效识别和消除时序语句定位任务中的虚假相关性，为多模态任务中的偏差问题提供了新的解决思路。

Abstract: Temporal Sentence Grounding (TSG) aims to identify relevant moments in an untrimmed video that semantically correspond to a given textual query. Despite existing studies having made substantial progress, they often overlook the issue of spurious correlations between video and textual queries. These spurious correlations arise from two primary factors: (1) inherent biases in the textual data, such as frequent co-occurrences of specific verbs or phrases, and (2) the model’s tendency to overfit to salient or repetitive patterns in video content. Such biases mislead the model into associating textual cues with incorrect visual moments, resulting in unreliable predictions and poor generalization to out-of-distribution examples. To overcome these limitations, we propose a novel TSG framework, causal intervention and counterfactual reasoning that utilizes causal inference to eliminate spurious correlations and enhance the model’s robustness. Specifically, we first formulate the TSG task from a causal perspective with a structural causal model. Then, to address unobserved confounders reflecting textual biases toward specific verbs or phrases, a textual causal intervention is proposed, utilizing do-calculus to estimate the causal effects. Furthermore, visual counterfactual reasoning is performed by constructing a counterfactual scenario that focuses solely on video features, excluding the query and fused multi-modal features. This allows us to debias the model by isolating and removing the influence of the video from the overall effect. Experiments on public datasets demonstrate the superiority of the proposed method. The code is available at https://github.com/Tangkfan/CICR.

[163] Hear-Your-Click: Interactive Video-to-Audio Generation via Object-aware Contrastive Audio-Visual Fine-tuning cs.CV | cs.AI | cs.MM | cs.SD | eess.ASPDF

Yingshan Liang, Keyu Fan, Zhicheng Du, Yiran Wang, Qingyang Shi

TL;DR: 论文《Hear-Your-Click》提出了一种交互式视频到音频（V2A）生成框架，通过点击视频帧中的特定对象生成对应声音，解决了现有方法在复杂场景中的局限性。

Details

Motivation: 现有V2A方法依赖全局视频信息，难以在复杂场景中生成针对特定对象或区域的音频。本文旨在提供更精确的交互式音频生成能力。

Result: 实验表明，该框架在多指标上优于现有方法，提供更精确的控制和生成性能。

Insight: 通过对象级对齐和交互式设计，Hear-Your-Click为复杂场景下的V2A生成提供了新思路。

Abstract: Video-to-audio (V2A) generation shows great potential in fields such as film production. Despite significant advances, current V2A methods, which rely on global video information, struggle with complex scenes and often fail to generate audio tailored to specific objects or regions in the videos. To address these limitations, we introduce Hear-Your-Click, an interactive V2A framework that enables users to generate sounds for specific objects in the videos by simply clicking on the frame. To achieve this, we propose Object-aware Contrastive Audio-Visual Fine-tuning (OCAV) with a Mask-guided Visual Encoder (MVE) to obtain object-level visual features aligned with corresponding audio segments. Furthermore, we tailor two data augmentation strategies: Random Video Stitching (RVS) and Mask-guided Loudness Modulation (MLM), aimed at enhancing the model’s sensitivity to the segmented objects. To effectively measure the audio-visual correspondence, we design a new evaluation metric, the CAV score, for evaluation. Extensive experiments demonstrate that our framework offers more precise control and improved generation performance across various metrics. Project Page: https://github.com/SynapGrid/Hear-Your-Click

[164] Parameterized Diffusion Optimization enabled Autoregressive Ordinal Regression for Diabetic Retinopathy Grading cs.CVPDF

Qinkai Yu, Wei Zhou, Hantao Liu, Yanyu Xu, Meng Wang

TL;DR: 该论文提出了一种名为AOR-DR的自回归序数回归方法，用于解决糖尿病视网膜病变（DR）分级中的长尾分布和类别边界模糊问题，通过结合扩散过程和临床知识显著提升了性能。

Details

Motivation: 糖尿病视网膜病变的严重程度评估对及时治疗至关重要，但现有方法因数据分布不均和类别边界模糊而表现不佳。

Result: 在四个大型公开数据集上，AOR-DR优于六种最先进的序数回归方法。

Insight: 1) 结合扩散过程与自回归能有效利用序数信息。2) 全局特征的应用简化了模型设计并提升了性能。

Abstract: As a long-term complication of diabetes, diabetic retinopathy (DR) progresses slowly, potentially taking years to threaten vision. An accurate and robust evaluation of its severity is vital to ensure prompt management and care. Ordinal regression leverages the underlying inherent order between categories to achieve superior performance beyond traditional classification. However, there exist challenges leading to lower DR classification performance: 1) The uneven distribution of DR severity levels, characterized by a long-tailed pattern, adds complexity to the grading process. 2)The ambiguity in defining category boundaries introduces additional challenges, making the classification process more complex and prone to inconsistencies. This work proposes a novel autoregressive ordinal regression method called AOR-DR to address the above challenges by leveraging the clinical knowledge of inherent ordinal information in DR grading dataset settings. Specifically, we decompose the DR grading task into a series of ordered steps by fusing the prediction of the previous steps with extracted image features as conditions for the current prediction step. Additionally, we exploit the diffusion process to facilitate conditional probability modeling, enabling the direct use of continuous global image features for autoregression without relearning contextual information from patch-level features. This ensures the effectiveness of the autoregressive process and leverages the capabilities of pre-trained large-scale foundation models. Extensive experiments were conducted on four large-scale publicly available color fundus datasets, demonstrating our model’s effectiveness and superior performance over six recent state-of-the-art ordinal regression methods. The implementation code is available at https://github.com/Qinkaiyu/AOR-DR.

[165] TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation cs.CVPDF

Zonglin Lyu, Chen Chen

TL;DR: TLB-VFI提出了一种高效的时间感知潜在布朗桥扩散模型，用于视频帧插值任务。通过提取丰富的时空信息并减少参数量，实现了性能提升和效率优化。

Details

Motivation: 现有的扩散模型在视频帧插值任务中存在无法提取时空信息（图像基）或模型过大（视频基）的问题。TLB-VFI旨在解决这些效率和性能上的瓶颈。

Result: 在最具挑战性的数据集上，FID指标提升了20%，参数量减少3倍，速度提升2.3倍，训练数据需求降低9000倍。

Insight: 1. 时空信息的有效提取对视频帧插值至关重要；2. 潜在空间扩散模型可以在减少参数量的同时保持性能；3. 光流引导可以显著降低训练数据需求。

Abstract: Video Frame Interpolation (VFI) aims to predict the intermediate frame $I_n$ (we use n to denote time in videos to avoid notation overload with the timestep $t$ in diffusion models) based on two consecutive neighboring frames $I_0$ and $I_1$. Recent approaches apply diffusion models (both image-based and video-based) in this task and achieve strong performance. However, image-based diffusion models are unable to extract temporal information and are relatively inefficient compared to non-diffusion methods. Video-based diffusion models can extract temporal information, but they are too large in terms of training scale, model size, and inference time. To mitigate the above issues, we propose Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation (TLB-VFI), an efficient video-based diffusion model. By extracting rich temporal information from video inputs through our proposed 3D-wavelet gating and temporal-aware autoencoder, our method achieves 20% improvement in FID on the most challenging datasets over recent SOTA of image-based diffusion models. Meanwhile, due to the existence of rich temporal information, our method achieves strong performance while having 3times fewer parameters. Such a parameter reduction results in 2.3x speed up. By incorporating optical flow guidance, our method requires 9000x less training data and achieves over 20x fewer parameters than video-based diffusion models. Codes and results are available at our project page: https://zonglinl.github.io/tlbvfi_page.

[166] AI for the Routine, Humans for the Complex: Accuracy-Driven Data Labelling with Mixed Integer Linear Programming cs.CV | cs.SEPDF

Mohammad Hossein Amini, Mehrdad Sabetzadeh, Shiva Nejati

TL;DR: OPAL是一种结合人类辅助与混合整数线性规划（MILP）的数据标注方法，旨在以最小人工标注实现高精度的标签。通过实验验证，OPAL在标注精度和效率上显著优于基线方法。

Details

Motivation: 目前深度学习（DL）面临准确标注数据稀缺的挑战。虽然半监督方法可以减少人工标注需求，但测试阶段仍需高精度标签以确保模型可靠验证。

Result: OPAL在七大数据集上平均标注精度达98.8%，人工标注量减少一半以上，自动化验证任务的精度也显著优于基线方法。

Insight: OPAL展示了以优化驱动的方法在数据标注任务中的潜力，能够在高精度需求下显著降低人工成本。

Abstract: The scarcity of accurately labelled data remains a major challenge in deep learning (DL). Many DL approaches rely on semi-supervised methods, which focus on constructing large datasets that require only a minimal amount of human-labelled data. Since DL training algorithms can tolerate moderate label noise, it has generally been acceptable for the accuracy of labels in large training datasets to fall well short of a perfect 100%. However, when it comes to testing DL models, achieving high label accuracy-as close to 100% as possible-is paramount for reliable verification. In this article, we introduce OPAL, a human-assisted labelling method that can be configured to target a desired accuracy level while minimizing the manual effort required for labelling. The main contribution of OPAL is a mixed-integer linear programming (MILP) formulation that minimizes labelling effort subject to a specified accuracy target. We evaluate OPAL for two tasks in the context of testing vision systems: automatic labelling of test data and automated validation of test data. Our evaluation, based on more than 2500 experiments performed on seven datasets, comparing OPAL with eight baseline methods, shows that OPAL, relying on its MILP formulation, achieves an average accuracy of 98.8%, just 1.2% below perfect accuracy, while cutting manual labelling by more than half. Further, OPAL significantly outperforms automated labelling baselines in labelling accuracy across all seven datasets, with large effect sizes, when all methods are provided with the same manual-labelling budget. For automated test-input validation, on average, OPAL reduces manual effort by 28.8% while achieving 4.5% higher accuracy than the SOTA validation baselines. Finally, we show that augmenting OPAL with an active learning loop leads to an additional 4.5% reduction in required manual labelling, without compromising accuracy.

[167] Robust Incomplete-Modality Alignment for Ophthalmic Disease Grading and Diagnosis via Labeled Optimal Transport cs.CVPDF

Qinkai Yu, Jianyang Xie, Yitian Zhao, Cheng Chen, Lijun Zhang

TL;DR: 提出了一个鲁棒的不完整模态对齐框架（RIMA），利用最优传输方法解决眼科多模态诊断中的模态缺失问题，显著提升诊断准确率。

Details

Motivation: 现实中眼科多模态数据常因资源分布不均而缺失部分模态，现有方法（如模态填补或蒸馏）在特征重建或数据依赖上存在局限。

Result: 在三个大型眼科多模态数据集上验证了模型在完整和缺失模态场景下的优越性能，达到Sota。

Insight: 最优传输能有效捕捉跨模态特征关系，结合模态特性设计的不对称融合策略对提升诊断鲁棒性至关重要。

Abstract: Multimodal ophthalmic imaging-based diagnosis integrates color fundus image with optical coherence tomography (OCT) to provide a comprehensive view of ocular pathologies. However, the uneven global distribution of healthcare resources often results in real-world clinical scenarios encountering incomplete multimodal data, which significantly compromises diagnostic accuracy. Existing commonly used pipelines, such as modality imputation and distillation methods, face notable limitations: 1)Imputation methods struggle with accurately reconstructing key lesion features, since OCT lesions are localized, while fundus images vary in style. 2)distillation methods rely heavily on fully paired multimodal training data. To address these challenges, we propose a novel multimodal alignment and fusion framework capable of robustly handling missing modalities in the task of ophthalmic diagnostics. By considering the distinctive feature characteristics of OCT and fundus images, we emphasize the alignment of semantic features within the same category and explicitly learn soft matching between modalities, allowing the missing modality to utilize existing modality information, achieving robust cross-modal feature alignment under the missing modality. Specifically, we leverage the Optimal Transport for multi-scale modality feature alignment: class-wise alignment through predicted class prototypes and feature-wise alignment via cross-modal shared feature transport. Furthermore, we propose an asymmetric fusion strategy that effectively exploits the distinct characteristics of OCT and fundus modalities. Extensive evaluations on three large ophthalmic multimodal datasets demonstrate our model’s superior performance under various modality-incomplete scenarios, achieving Sota performance in both complete modality and inter-modality incompleteness conditions. Code is available at https://github.com/Qinkaiyu/RIMA

Britty Baby, Vinkle Srivastav, Pooja P. Jain, Kun Yuan, Pietro Mascagni

TL;DR: 该论文提出了CVS-AdaptNet，一种多标签多模态框架，用于自动化安全关键视图（CVS）的识别，通过图像与文本描述的嵌入对齐，显著提升了CVS识别的性能。

Details

Motivation: 在腹腔镜胆囊切除术中，CVS评估是复杂且具有挑战性的任务。传统方法依赖高成本的空间标注，本文探索如何利用文本作为多模态手术基础模型的训练和推理工具，以自动化CVS识别。

Result: CVS-AdaptNet达到57.6 mAP，比ResNet50单模态基线（51.5 mAP）提升了6点，证明了多模态框架的优势。

Insight: 多模态方法（特别是文本提示）可以显著提升细粒度手术任务的识别性能，为通用模型的专用化适应提供了新思路。

Abstract: The Critical View of Safety (CVS) is crucial for safe laparoscopic cholecystectomy, yet assessing CVS criteria remains a complex and challenging task, even for experts. Traditional models for CVS recognition depend on vision-only models learning with costly, labor-intensive spatial annotations. This study investigates how text can be harnessed as a powerful tool for both training and inference in multi-modal surgical foundation models to automate CVS recognition. Unlike many existing multi-modal models, which are primarily adapted for multi-class classification, CVS recognition requires a multi-label framework. Zero-shot evaluation of existing multi-modal surgical models shows a significant performance gap for this task. To address this, we propose CVS-AdaptNet, a multi-label adaptation strategy that enhances fine-grained, binary classification across multiple labels by aligning image embeddings with textual descriptions of each CVS criterion using positive and negative prompts. By adapting PeskaVLP, a state-of-the-art surgical foundation model, on the Endoscapes-CVS201 dataset, CVS-AdaptNet achieves 57.6 mAP, improving over the ResNet50 image-only baseline (51.5 mAP) by 6 points. Our results show that CVS-AdaptNet’s multi-label, multi-modal framework, enhanced by textual prompts, boosts CVS recognition over image-only methods. We also propose text-specific inference methods, that helps in analysing the image-text alignment. While further work is needed to match state-of-the-art spatial annotation-based methods, this approach highlights the potential of adapting generalist models to specialized surgical tasks. Code: https://github.com/CAMMA-public/CVS-AdaptNet

Soham Walimbe, Britty Baby, Vinkle Srivastav, Nicolas Padoy

TL;DR: 提出了MML-SurgAdapt框架，利用Vision-Language Models（CLIP）和Single Positive Multi-Label学习，实现多任务手术计算机视觉的统一处理，显著减少标注需求。

Details

Motivation: 传统手术AI模型通常针对单一任务设计，缺乏灵活性且需要大量标注。为应对多任务学习和部分标注数据的挑战，提出了统一的解决方案。

Result: 在Cholec80、Endoscapes2023和CholecT50数据集上表现与任务专用模型相当，且优于现有SPML框架。

Insight: 通过自然语言监督和部分标注学习，可以实现多任务的统一建模，显著降低标注成本，为手术AI的规模化应用提供了新思路。

Abstract: Surgical AI often involves multiple tasks within a single procedure, like phase recognition or assessing the Critical View of Safety in laparoscopic cholecystectomy. Traditional models, built for one task at a time, lack flexibility, requiring a separate model for each. To address this, we introduce MML-SurgAdapt, a unified multi-task framework with Vision-Language Models (VLMs), specifically CLIP, to handle diverse surgical tasks through natural language supervision. A key challenge in multi-task learning is the presence of partial annotations when integrating different tasks. To overcome this, we employ Single Positive Multi-Label (SPML) learning, which traditionally reduces annotation burden by training models with only one positive label per instance. Our framework extends this approach to integrate data from multiple surgical tasks within a single procedure, enabling effective learning despite incomplete or noisy annotations. We demonstrate the effectiveness of our model on a combined dataset consisting of Cholec80, Endoscapes2023, and CholecT50, utilizing custom prompts. Extensive evaluation shows that MML-SurgAdapt performs comparably to task-specific benchmarks, with the added advantage of handling noisy annotations. It also outperforms the existing SPML frameworks for the task. By reducing the required labels by 23%, our approach proposes a more scalable and efficient labeling process, significantly easing the annotation burden on clinicians. To our knowledge, this is the first application of SPML to integrate data from multiple surgical tasks, presenting a novel and generalizable solution for multi-task learning in surgical computer vision. Implementation is available at: https://github.com/CAMMA-public/MML-SurgAdapt

[170] Estimating Object Physical Properties from RGB-D Vision and Depth Robot Sensors Using Deep Learning cs.CVPDF

Ricardo Cardoso, Plinio Moreno

TL;DR: 这篇论文提出了一种基于RGB-D视觉和深度传感器数据估计物体物理性质（如质量）的深度学习方法，通过结合点云和RGB数据，并使用合成数据增强训练效果，显著优于现有方法。

Details

Motivation: 在机器人应用中，物体的惯性质量（如质量）对于抓取、操控和模拟等任务至关重要。然而，仅通过视觉传感器估计质量的研究较少，因此需要一种高效方法填补这一空白。

Result: 实验结果表明，该方法在多个指标上显著优于现有基准。

Insight: 通过合成数据增强和结合多模态（RGB+深度）信息，可以显著提升质量估计的准确性，尤其在真实数据有限的情况下。

Abstract: Inertial mass plays a crucial role in robotic applications such as object grasping, manipulation, and simulation, providing a strong prior for planning and control. Accurately estimating an object’s mass before interaction can significantly enhance the performance of various robotic tasks. However, mass estimation using only vision sensors is a relatively underexplored area. This paper proposes a novel approach combining sparse point-cloud data from depth images with RGB images to estimate the mass of objects. We evaluate a range of point-cloud processing architectures, alongside RGB-only methods. To overcome the limited availability of training data, we create a synthetic dataset using ShapeNetSem 3D models, simulating RGBD images via a Kinect camera. This synthetic data is used to train an image generation model for estimating dense depth maps, which we then use to augment an existing dataset of images paired with mass values. Our approach significantly outperforms existing benchmarks across all evaluated metrics. The data generation (https://github.com/RavineWindteer/ShapenetSem-to-RGBD) as well as the training of the depth estimator (https://github.com/RavineWindteer/GLPDepth-Edited) and the mass estimator (https://github.com/RavineWindteer/Depth-mass-estimator) are available online.

[171] INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling cs.CV | cs.AIPDF

Xin Dong, Shichao Dong, Jin Wang, Jing Huang, Li Zhou

TL;DR: 论文提出了INTER方法，通过交互引导采样减少大型视觉语言模型（LVLMs）的幻觉问题，无需额外数据或训练，在多个任务上提升3.4%。

Details

Motivation: 人类通过多模态交互信息避免认知幻觉，而LVLMs缺乏这种能力，导致生成看似合理但不符合视觉内容的结果。

Result: 在6个基准测试中，相比现有解码策略，平均提升3.4%。

Insight: LVLMs展现出类似人类的多模态交互认知行为，但较弱，通过显式引导可以显著提升效果。

Abstract: Hallucinations in large vision-language models (LVLMs) pose significant challenges for real-world applications, as LVLMs may generate responses that appear plausible yet remain inconsistent with the associated visual content. This issue rarely occurs in human cognition. We argue that this discrepancy arises from humans’ ability to effectively leverage multimodal interaction information in data samples. Specifically, humans typically first gather multimodal information, analyze the interactions across modalities for understanding, and then express their understanding through language. Motivated by this observation, we conduct extensive experiments on popular LVLMs and obtained insights that surprisingly reveal human-like, though less pronounced, cognitive behavior of LVLMs on multimodal samples. Building on these findings, we further propose \textbf{INTER}: \textbf{Inter}action Guidance Sampling, a novel training-free algorithm that mitigate hallucinations without requiring additional data. Specifically, INTER explicitly guides LVLMs to effectively reapply their understanding of multimodal interaction information when generating responses, thereby reducing potential hallucinations. On six benchmarks including VQA and image captioning tasks, INTER achieves an average improvement of up to 3.4% on five LVLMs compared to the state-of-the-art decoding strategy. The code will be released when the paper is accepted.

[172] MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation cs.CVPDF

Yucheng Wang, Dan Xu

TL;DR: MoDiT是一种结合3DMM与扩散Transformer的新框架，用于生成高度一致的3D运动系数，解决了现有方法在时间抖动、身份飘逸和不自然眨眼行为上的问题。

Details

Motivation: 现有的音频驱动说话头生成方法（如基于GAN或UNet的扩散模型）存在时间抖动、身份飘逸和不自然眨眼行为等问题，无法满足虚拟助手、游戏和电影等应用的需求。

Result: MoDiT有效地减少了时间抖动，保持了身份一致性，并生成了自然的眨眼行为。

Insight: 3DMM与扩散Transformer的结合为说话头生成提供了更强的时空约束和更自然的动态行为建模。

Abstract: Audio-driven talking head generation is critical for applications such as virtual assistants, video games, and films, where natural lip movements are essential. Despite progress in this field, challenges remain in producing both consistent and realistic facial animations. Existing methods, often based on GANs or UNet-based diffusion models, face three major limitations: (i) temporal jittering caused by weak temporal constraints, resulting in frame inconsistencies; (ii) identity drift due to insufficient 3D information extraction, leading to poor preservation of facial identity; and (iii) unnatural blinking behavior due to inadequate modeling of realistic blink dynamics. To address these issues, we propose MoDiT, a novel framework that combines the 3D Morphable Model (3DMM) with a Diffusion-based Transformer. Our contributions include: (i) A hierarchical denoising strategy with revised temporal attention and biased self/cross-attention mechanisms, enabling the model to refine lip synchronization and progressively enhance full-face coherence, effectively mitigating temporal jittering. (ii) The integration of 3DMM coefficients to provide explicit spatial constraints, ensuring accurate 3D-informed optical flow prediction and improved lip synchronization using Wav2Lip results, thereby preserving identity consistency. (iii) A refined blinking strategy to model natural eye movements, with smoother and more realistic blinking behaviors.

[173] VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting cs.CV | cs.ROPDF

Juyi Lin, Amir Taherin, Arash Akbari, Arman Akbari, Lei Lu

TL;DR: VOTE提出了一种高效且通用的视觉-语言-动作（VLA）模型优化框架，通过免分词器的微调方法和集成投票策略，显著提升了推断速度和泛化能力。

Details

Motivation: 现有的VLA模型在新对象或陌生环境中泛化能力有限，且现有方法通常引入额外计算开销（如深度估计或扩散技术）。VOTE旨在探索无需高成本额外组件的高效动作预测方法。

Result: 在实验中，VOTE实现了35倍推断加速、145Hz吞吐量，并达到SOTA性能。

Insight: 通过简化模型结构和引入集成决策，可以在不依赖额外高成本视觉表示的情况下显著提升VLA模型的效率和泛化能力。

Abstract: Recent large-scale Vision Language Action (VLA) models have shown superior performance in robotic manipulation tasks guided by natural language. However, their generalization remains limited when applied to novel objects or unfamiliar environments that lie outside the training distribution. To address this, many existing approaches integrate additional components such as depth estimation, segmentation, or even diffusion to improve generalization, at the cost of adding significant computation overhead, resulting in low efficiency. This motivates the exploration of efficient action prediction methods, which are independent of additional high-level visual representations or diffusion techniques. In this work, we propose VOTE, an efficient and general framework for the optimization and acceleration of VLA models. In details, we propose a novel tokenizer-free fine-tuning approach for parallel accurate action prediction, which reduces computational overhead and accelerates inference speed. Additionally, we adopt an ensemble voting strategy for the action sampling, which significantly improves model performance and enhances generalization. Experimental results show that our method achieves state-of-the-art performance with 35$\times$ faster inference and 145 Hz throughput. All the details and codes will be open-sourced.

[174] VERITAS: Verification and Explanation of Realness in Images for Transparency in AI Systems cs.CV | cs.LGPDF

Aadi Srivastava, Vignesh Natarajkumar, Utkarsh Bheemanaboyna, Devisree Akashapu, Nagraj Gaonkar

TL;DR: VERITAS是一个AI生成图像检测与解释框架，不仅能准确检测32x32的小图像是否为AI生成，还能通过定位伪影和语义推理解释分类依据，提供人类可理解的解释。

Details

Motivation: 随着GAN和扩散模型生成内容的普及，AI生成图像与真实图像的界限模糊，引发内容真实性和完整性的担忧。现有检测方法缺乏透明性，用户难以理解分类依据。

Result: 实验表明VERITAS不仅能准确检测AI生成图像，还能提供清晰的解释，推动了AI系统的透明性。

Insight: 伪影定位和语义推理的结合是提升AI生成图像检测透明性的有效途径，为未来类似研究提供了参考。

Abstract: The widespread and rapid adoption of AI-generated content, created by models such as Generative Adversarial Networks (GANs) and Diffusion Models, has revolutionized the digital media landscape by allowing efficient and creative content generation. However, these models also blur the difference between real images and AI-generated synthetic images, raising concerns regarding content authenticity and integrity. While many existing solutions to detect fake images focus solely on classification and higher-resolution images, they often lack transparency in their decision-making, making it difficult for users to understand why an image is classified as fake. In this paper, we present VERITAS, a comprehensive framework that not only accurately detects whether a small (32x32) image is AI-generated but also explains why it was classified that way through artifact localization and semantic reasoning. VERITAS produces human-readable explanations that describe key artifacts in synthetic images. We show that this architecture offers clear explanations of the basis of zero-shot synthetic image detection tasks. Code and relevant prompts can be found at https://github.com/V-i-g-n-e-s-h-N/VERITAS .

[175] 4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture cs.CVPDF

Yutian Chen, Shi Guo, Tianshuo Yang, Lihe Ding, Xiuyuan Yu

TL;DR: 该论文提出了一种名为4DSloMo的新方法，通过异步捕获和生成模型，从低帧率相机实现高速场景的4D重建，避免了使用昂贵的高速相机。

Details

Motivation: 传统4D重建系统受限于低帧率（低于30 FPS），无法直接重建高速运动场景。为了解决这一问题，论文提出了一种利用低帧率相机实现高效高速4D重建的方案。

Result: 实验结果表明，该方法在高速4D重建中表现优于同步捕获方法，显著提升了重建质量。

Insight: 通过异步捕获和生成模型的结合，可以在不依赖高速相机的情况下实现高帧率的4D重建，为高速场景分析提供了经济高效的解决方案。

Abstract: Reconstructing fast-dynamic scenes from multi-view videos is crucial for high-speed motion analysis and realistic 4D reconstruction. However, the majority of 4D capture systems are limited to frame rates below 30 FPS (frames per second), and a direct 4D reconstruction of high-speed motion from low FPS input may lead to undesirable results. In this work, we propose a high-speed 4D capturing system only using low FPS cameras, through novel capturing and processing modules. On the capturing side, we propose an asynchronous capture scheme that increases the effective frame rate by staggering the start times of cameras. By grouping cameras and leveraging a base frame rate of 25 FPS, our method achieves an equivalent frame rate of 100-200 FPS without requiring specialized high-speed cameras. On processing side, we also propose a novel generative model to fix artifacts caused by 4D sparse-view reconstruction, as asynchrony reduces the number of viewpoints at each timestamp. Specifically, we propose to train a video-diffusion-based artifact-fix model for sparse 4D reconstruction, which refines missing details, maintains temporal consistency, and improves overall reconstruction quality. Experimental results demonstrate that our method significantly enhances high-speed 4D reconstruction compared to synchronous capture.

[176] Differential Attention for Multimodal Crisis Event Analysis cs.CVPDF

Nusrat Munia, Junfeng Zhu, Olfa Nasraoui, Abdullah-Al-Zubaer Imran

TL;DR: 本文通过结合视觉语言模型(VLMs)和先进的融合策略，提升了危机事件多模态数据的分类性能，其中引入LLaVA生成的文本和改进的注意力机制显著提升了模型的表现。

Details

Motivation: 社交网络在危机事件中是重要的信息来源，但如何从多模态、噪声大的数据中提取有效信息并整合异构数据仍是挑战。

Result: 在CrisisMMD基准数据集上的实验表明，该方法在分类准确性上优于现有模型，为灾害响应任务提供了更可靠和可解释的模型。

Insight: 差异注意力机制能有效提升性能，但引导交叉注意力在多模态特征对齐上仍具有显著效果；预训练视觉语言模型的引入是提升性能的关键。

Abstract: Social networks can be a valuable source of information during crisis events. In particular, users can post a stream of multimodal data that can be critical for real-time humanitarian response. However, effectively extracting meaningful information from this large and noisy data stream and effectively integrating heterogeneous data remains a formidable challenge. In this work, we explore vision language models (VLMs) and advanced fusion strategies to enhance the classification of crisis data in three different tasks. We incorporate LLaVA-generated text to improve text-image alignment. Additionally, we leverage Contrastive Language-Image Pretraining (CLIP)-based vision and text embeddings, which, without task-specific fine-tuning, outperform traditional models. To further refine multimodal fusion, we employ Guided Cross Attention (Guided CA) and combine it with the Differential Attention mechanism to enhance feature alignment by emphasizing critical information while filtering out irrelevant content. Our results show that while Differential Attention improves classification performance, Guided CA remains highly effective in aligning multimodal features. Extensive experiments on the CrisisMMD benchmark data set demonstrate that the combination of pretrained VLMs, enriched textual descriptions, and adaptive fusion strategies consistently outperforms state-of-the-art models in classification accuracy, contributing to more reliable and interpretable models for three different tasks that are crucial for disaster response. Our code is available at https://github.com/Munia03/Multimodal_Crisis_Event.

[177] Semantic Frame Interpolation cs.CVPDF

Yijia Hong, Jiangning Zhang, Ran Yi, Yuji Wang, Weijian Cao

TL;DR: 本文提出了一种新的语义帧插值任务（SFI），并开发了一个基于Wan2.1的SemFi模型，通过引入Mixture-of-LoRA模块支持多帧率生成，同时发布了首个专用的SFI数据集和基准（SFI-300K）。

Details

Motivation: 传统帧插值任务局限于小帧差和无文本控制的场景，而现有的大视频模型只能生成固定帧数且效果不稳定。本文旨在填补这些研究空白，提出更灵活的帧插值任务。

Result: 实验表明，SemFi在SFI-300K数据集上表现优异，尤其在多帧率和一致性方面。

Insight: 通过文本控制和多帧率支持，SFI任务扩展了帧插值的应用场景，为视频生成研究提供了新方向。

Abstract: Generating intermediate video content of varying lengths based on given first and last frames, along with text prompt information, offers significant research and application potential. However, traditional frame interpolation tasks primarily focus on scenarios with a small number of frames, no text control, and minimal differences between the first and last frames. Recent community developers have utilized large video models represented by Wan to endow frame-to-frame capabilities. However, these models can only generate a fixed number of frames and often fail to produce satisfactory results for certain frame lengths, while this setting lacks a clear official definition and a well-established benchmark. In this paper, we first propose a new practical Semantic Frame Interpolation (SFI) task from the perspective of academic definition, which covers the above two settings and supports inference at multiple frame rates. To achieve this goal, we propose a novel SemFi model building upon Wan2.1, which incorporates a Mixture-of-LoRA module to ensure the generation of high-consistency content that aligns with control conditions across various frame length limitations. Furthermore, we propose SFI-300K, the first general-purpose dataset and benchmark specifically designed for SFI. To support this, we collect and process data from the perspective of SFI, carefully designing evaluation metrics and methods to assess the model’s performance across multiple dimensions, encompassing image and video, and various aspects, including consistency and diversity. Through extensive experiments on SFI-300K, we demonstrate that our method is particularly well-suited to meet the requirements of the SFI task.

[178] $\varphi$-Adapt: A Physics-Informed Adaptation Learning Approach to 2D Quantum Material Discovery cs.CV | cs.LGPDF

Hoang-Quan Nguyen, Xuan Bac Nguyen, Sankalp Pandey, Tim Faltermeier, Nicholas Borys

TL;DR: 论文提出了一种基于物理信息的自适应学习方法（$φ$-Adapt），用于解决二维量子材料发现中的厚度估计问题。通过合成数据生成和物理信息驱动的领域自适应，方法在多个基准测试中达到了最先进性能。

Details

Motivation: 量子薄片的质量直接影响量子比特性能，但现有计算机视觉方法在厚度估计中面临数据稀缺、泛化能力差和对领域偏移敏感等问题。

Result: 在多个基准测试中表现最优，显著优于现有方法。

Insight: 通过物理信息和领域自适应的结合，论文为合成数据在真实场景中的有效利用提供了新思路，同时推动了深度学习与材料科学的交叉研究。

Abstract: Characterizing quantum flakes is a critical step in quantum hardware engineering because the quality of these flakes directly influences qubit performance. Although computer vision methods for identifying two-dimensional quantum flakes have emerged, they still face significant challenges in estimating flake thickness. These challenges include limited data, poor generalization, sensitivity to domain shifts, and a lack of physical interpretability. In this paper, we introduce one of the first Physics-informed Adaptation Learning approaches to overcome these obstacles. We focus on two main issues, i.e., data scarcity and generalization. First, we propose a new synthetic data generation framework that produces diverse quantum flake samples across various materials and configurations, reducing the need for time-consuming manual collection. Second, we present $\varphi$-Adapt, a physics-informed adaptation method that bridges the performance gap between models trained on synthetic data and those deployed in real-world settings. Experimental results show that our approach achieves state-of-the-art performance on multiple benchmarks, outperforming existing methods. Our proposed approach advances the integration of physics-based modeling and domain adaptation. It also addresses a critical gap in leveraging synthesized data for real-world 2D material analysis, offering impactful tools for deep learning and materials science communities.

[179] All in One: Visual-Description-Guided Unified Point Cloud Segmentation cs.CV | cs.AIPDF

Zongyan Han, Mohamed El Amine Boudjoghra, Jiahua Dong, Jinhong Wang, Rao Muhammad Anwer

TL;DR: VDG-Uni3DSeg提出了一个统一的三维点云分割框架，结合视觉-语言模型（如CLIP）和大语言模型（LLM），通过多模态线索提升分割效果，并在语义、实例和全景分割任务中取得领先性能。

Details

Motivation: 现有方法在3D点云分割中因稀疏结构、标注有限和多模态信息不足的问题，难以区分细粒度类别和实例。需要更丰富的语义和上下文信息辅助分割。

Result: 在语义、实例和全景分割任务中实现领先性能，验证了方法的有效性和扩展性。

Insight: 通过多模态信息（文本和图像）与点云特征的对齐，能够有效提升细粒度分割能力，为3D场景理解提供了新思路。

Abstract: Unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained object classes in complex environments. Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances. To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. By leveraging LLM-generated textual descriptions and reference images from the internet, our method incorporates rich multimodal cues, facilitating fine-grained class and instance separation. We further design a Semantic-Visual Contrastive Loss to align point features with multimodal queries and a Spatial Enhanced Module to model scene-wide relationships efficiently. Operating within a closed-set paradigm that utilizes multimodal knowledge generated offline, VDG-Uni3DSeg achieves state-of-the-art results in semantic, instance, and panoptic segmentation, offering a scalable and practical solution for 3D understanding. Our code is available at https://github.com/Hanzy1996/VDG-Uni3DSeg.

[180] CTA: Cross-Task Alignment for Better Test Time Training cs.CV | cs.AIPDF

Samuel Barbeau, Pedram Fekri, David Osowiechi, Ali Bahri, Moslem YazdanpanahMasih Aminbeidokhti

TL;DR: 提出了一种新颖的跨任务对齐方法（CTA），通过对齐监督学习和自监督学习模型，提升测试时训练（TTT）的性能，无需专用模型架构。

Details

Motivation: 现有测试时训练方法在分布变化时性能下降，需要专用架构。CTA旨在通过多模态对比学习对齐监督和自监督模型，提升鲁棒性。

Result: 在多个基准数据集上相比现有方法显著提升了鲁棒性和泛化性能。

Insight: 通过跨任务对齐，可以更有效地利用自监督学习的优势，提升模型在分布变化下的表现。

Abstract: Deep learning models have demonstrated exceptional performance across a wide range of computer vision tasks. However, their performance often degrades significantly when faced with distribution shifts, such as domain or dataset changes. Test-Time Training (TTT) has emerged as an effective method to enhance model robustness by incorporating an auxiliary unsupervised task during training and leveraging it for model updates at test time. In this work, we introduce CTA (Cross-Task Alignment), a novel approach for improving TTT. Unlike existing TTT methods, CTA does not require a specialized model architecture and instead takes inspiration from the success of multi-modal contrastive learning to align a supervised encoder with a self-supervised one. This process enforces alignment between the learned representations of both models, thereby mitigating the risk of gradient interference, preserving the intrinsic robustness of self-supervised learning and enabling more semantically meaningful updates at test-time. Experimental results demonstrate substantial improvements in robustness and generalization over the state-of-the-art on several benchmark datasets.

[181] Self-Supervised Real-Time Tracking of Military Vehicles in Low-FPS UAV Footage cs.CVPDF

Markiyan Kostiv, Anatolii Adamovskyi, Yevhen Cherniavskyi, Mykyta Varenyk, Ostap Viniavskyi

TL;DR: 该论文提出了一种基于自监督学习的实时跟踪方法，用于在低帧率的无人机视频中跟踪军事车辆，解决了因图像质量差和快速运动导致的关联问题。

Details

Motivation: 在低帧率的无人机视频中跟踪军事车辆是一项复杂任务，主要由于图像质量低、目标快速运动以及检测间隙等问题。现有的多目标跟踪方法难以应对这些挑战。

Result: 方法在低帧率和图像质量较差的情况下仍能保持高关联质量，同时支持快速推理。

Insight: 全局场景特征是提升低帧率跟踪性能的关键，自监督学习可以有效减少对大量标注数据的依赖。

Abstract: Multi-object tracking (MOT) aims to maintain consistent identities of objects across video frames. Associating objects in low-frame-rate videos captured by moving unmanned aerial vehicles (UAVs) in actual combat scenarios is complex due to rapid changes in object appearance and position within the frame. The task becomes even more challenging due to image degradation caused by cloud video streaming and compression algorithms. We present how instance association learning from single-frame annotations can overcome these challenges. We show that global features of the scene provide crucial context for low-FPS instance association, allowing our solution to be robust to distractors and gaps in detections. We also demonstrate that such a tracking approach maintains high association quality even when reducing the input image resolution and latent representation size for faster inference. Finally, we present a benchmark dataset of annotated military vehicles collected from publicly available data sources. This paper was initially presented at the NATO Science and Technology Organization Symposium (ICMCIS) organized by the Information Systems Technology (IST)Scientific and Technical Committee, IST-209-RSY - the ICMCIS, held in Oeiras, Portugal, 13-14 May 2025.

[182] From Marginal to Joint Predictions: Evaluating Scene-Consistent Trajectory Prediction Approaches for Automated Driving cs.CV | cs.AI | cs.LG | cs.MA | cs.ROPDF

Fabian Konstantinidis, Ariel Dallari Guerreiro, Raphael Trumpp, Moritz Sackmann, Ulrich Hofmann

TL;DR: 本文探讨了从边际预测到联合预测的转换方法，以提升自动驾驶中交通参与者轨迹预测的准确性与场景一致性。通过对不同方法的系统评估，作者分析了各方法在预测精度、多模态性和推理效率上的表现。

Details

Motivation: 自动驾驶在动态环境中安全高效运行依赖于对周围交通参与者运动的准确预测。传统的边际预测方法独立预测每个代理的未来轨迹，可能导致次优规划决策。因此，研究联合预测方法以提高场景一致性成为关键。

Result: 结果表明，显式训练的联合预测模型在预测精度和场景一致性上表现最佳，但推理效率较低。生成式方法在多模态性上更具优势，而后处理方法则在效率上表现较好。

Insight: 联合预测方法能显著提高预测的场景一致性，但不同方法在精度、多模态性和效率之间存在权衡。未来的研究可能需要结合多种方法的优势。

Abstract: Accurate motion prediction of surrounding traffic participants is crucial for the safe and efficient operation of automated vehicles in dynamic environments. Marginal prediction models commonly forecast each agent’s future trajectories independently, often leading to sub-optimal planning decisions for an automated vehicle. In contrast, joint prediction models explicitly account for the interactions between agents, yielding socially and physically consistent predictions on a scene level. However, existing approaches differ not only in their problem formulation but also in the model architectures and implementation details used, making it difficult to compare them. In this work, we systematically investigate different approaches to joint motion prediction, including post-processing of the marginal predictions, explicitly training the model for joint predictions, and framing the problem as a generative task. We evaluate each approach in terms of prediction accuracy, multi-modality, and inference efficiency, offering a comprehensive analysis of the strengths and limitations of each approach. Several prediction examples are available at https://frommarginaltojointpred.github.io/.

[183] Spatio-Temporal LLM: Reasoning about Environments and Actions cs.CV | cs.LGPDF

Haozhen Zheng, Beitong Tian, Mingyuan Wu, Zhenggang Tang, Klara Nahrstedt

TL;DR: 该论文提出了一种时空大型语言模型（ST-LLM），用于解决多模态大型语言模型（MLLMs）在同时理解环境空间信息和动作时间信息方面的挑战。

Details

Motivation: 现有的MLLMs在需要对环境和动作进行全面时空理解的提示（prompt）上表现不佳，这限制了其在现实世界中的应用。因此，作者开发了一种新的模型和数据集来解决这一问题。

Result: ST-LLM在REA数据集上显著优于现有方法，证明了其在时空理解任务上的优越性。

Insight: 时空理解是智能体在现实世界中操作的关键能力，而ST-LLM为解决这一挑战提供了一种有效的解决方案。

Abstract: Despite the significant recent progress of Multimodal Large Language Models (MLLMs), MLLMs still struggle to correctly answer prompts that require a holistic spatio-temporal understanding. Specifically, it is challenging to address prompts that refer to 1) the entirety of an environment that an agent equipped with an MLLM can operate in; and simultaneously also refer to 2) recent actions that just happened and are encoded in a video clip. However, such a holistic spatio-temporal understanding is important for agents operating in the real world. To address this issue, we first develop a framework to collect a large-scale dataset. Using the collected “Reasoning about Environments and Actions” (REA) dataset, we show that recent methods indeed struggle to correctly answer the prompts. To improve, we develop a “spatio-temporal LLM” (ST-LLM), a model equipped with projectors to improve both spatial understanding of an environment and temporal understanding of recent observations. On the collected REA data, we show that the proposed method significantly improves results compared to prior work. Code and data are available at https://zoezheng126.github.io/STLLM-website/.

[184] Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing cs.CVPDF

Chun-Hsiao Yeh, Yilin Wang, Nanxuan Zhao, Richard Zhang, Yuheng Li

TL;DR: X-Planner是一个基于多模态大语言模型（MLLM）的图像编辑规划系统，通过分解复杂指令为清晰子任务，并生成编辑类型和分割掩码，解决了现有方法在复杂指令处理和身份保持上的不足。

Details

Motivation: 现有扩散模型在处理复杂、间接的编辑指令时表现不佳，常伴随身份信息丢失或依赖人工掩码。X-Planner旨在通过自动化规划和数据生成提升复杂指令下的编辑效果。

Result: X-Planner在复杂编辑任务中表现优异，超越现有方法，同时在新基准测试中验证了其有效性。

Insight: 结合链式思维和多模态模型能够显著提升复杂指令的解析能力，同时自动化数据生成是解决训练数据不足的有效途径。

Abstract: Recent diffusion-based image editing methods have significantly advanced text-guided tasks but often struggle to interpret complex, indirect instructions. Moreover, current models frequently suffer from poor identity preservation, unintended edits, or rely heavily on manual masks. To address these challenges, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system that effectively bridges user intent with editing model capabilities. X-Planner employs chain-of-thought reasoning to systematically decompose complex instructions into simpler, clear sub-instructions. For each sub-instruction, X-Planner automatically generates precise edit types and segmentation masks, eliminating manual intervention and ensuring localized, identity-preserving edits. Additionally, we propose a novel automated pipeline for generating large-scale data to train X-Planner which achieves state-of-the-art results on both existing benchmarks and our newly introduced complex editing benchmark.

cs.CL [Back]

[185] ChatGPT is not A Man but Das Man: Representativeness and Structural Consistency of Silicon Samples Generated by Large Language Models cs.CL | cs.CY | cs.ETPDF

Dai Li, Linzhuo Li, Huilian Sophie Qiu

TL;DR: 论文研究了大型语言模型（如ChatGPT和Llama）作为‘硅样本’模拟人类意见的局限性，指出其在结构一致性和多元化意见表现上的不足。

Details

Motivation: 当前研究试图验证LLMs是否能够准确模拟人类群体的意见分布，尤其是在人口统计学层面的多样性和结构一致性上。

Result: LLMs表现出严重的结构不一致性，且对少数群体意见的呈现不足。

Insight: LLMs作为人类调查数据的直接替代品可能存在问题，其表现可能强化刻板印象并误导政策。

Abstract: Large language models (LLMs) in the form of chatbots like ChatGPT and Llama are increasingly proposed as “silicon samples” for simulating human opinions. This study examines this notion, arguing that LLMs may misrepresent population-level opinions. We identify two fundamental challenges: a failure in structural consistency, where response accuracy doesn’t hold across demographic aggregation levels, and homogenization, an underrepresentation of minority opinions. To investigate these, we prompted ChatGPT (GPT-4) and Meta’s Llama 3.1 series (8B, 70B, 405B) with questions on abortion and unauthorized immigration from the American National Election Studies (ANES) 2020. Our findings reveal significant structural inconsistencies and severe homogenization in LLM responses compared to human data. We propose an “accuracy-optimization hypothesis,” suggesting homogenization stems from prioritizing modal responses. These issues challenge the validity of using LLMs, especially chatbots AI, as direct substitutes for human survey data, potentially reinforcing stereotypes and misinforming policy.

[186] Mitigating Hidden Confounding by Progressive Confounder Imputation via Large Language Models cs.CL | cs.AIPDF

Hao Yang, Haoxuan Li, Luyu Chen, Haoxiang Wang, Xu Chen

TL;DR: 论文提出了ProCI框架，利用大语言模型的语义和世界知识逐步生成、填充和验证隐藏混杂因子，以解决观察数据中隐藏混杂导致的因果估计偏差问题。实验表明ProCI能发现有意义混杂因子并显著提升治疗效果估计。

Details

Motivation: 隐藏混杂因子是观察数据中因果估计的核心挑战，现有方法仍依赖无混杂假设。本文首次尝试利用大语言模型缓解这一问题。

Result: 实验证明ProCI能发现有意义混杂因子，并在多数据集和大语言模型上显著提升治疗效果估计。

Insight: 大语言模型的语义和世界知识能力为缓解隐藏混杂问题提供了新思路，但需避免输出崩溃问题。

Abstract: Hidden confounding remains a central challenge in estimating treatment effects from observational data, as unobserved variables can lead to biased causal estimates. While recent work has explored the use of large language models (LLMs) for causal inference, most approaches still rely on the unconfoundedness assumption. In this paper, we make the first attempt to mitigate hidden confounding using LLMs. We propose ProCI (Progressive Confounder Imputation), a framework that elicits the semantic and world knowledge of LLMs to iteratively generate, impute, and validate hidden confounders. ProCI leverages two key capabilities of LLMs: their strong semantic reasoning ability, which enables the discovery of plausible confounders from both structured and unstructured inputs, and their embedded world knowledge, which supports counterfactual reasoning under latent confounding. To improve robustness, ProCI adopts a distributional reasoning strategy instead of direct value imputation to prevent the collapsed outputs. Extensive experiments demonstrate that ProCI uncovers meaningful confounders and significantly improves treatment effect estimation across various datasets and LLMs.

[187] Theory of Mind in Action: The Instruction Inference Task cs.CL | cs.AI | cs.MAPDF

Fardin Saad, Pradeep K. Murukannaiah, Munindar P. Singh

TL;DR: 该论文提出了一个新的任务“指令推断任务”以评估动态目标导向协作环境中的心智理论（ToM）能力，并开发了基于LLM的代理Tomcat，采用两种推理方法（Fs-CoT和CP），在人类评估中表现与人类相当。

Details

Motivation: 心智理论（ToM）是协作中理解他者意图的关键能力。当前任务缺乏动态协作环境中的评估方法，因此作者设计了一项新任务，并开发了基于LLM的代理Tomcat以填补这一空白。

Result: Tomcat的Fs-CoT版本（尤其基于GPT-4o和DeepSeek-R1）在意图准确性、行动最优性和规划最优性指标上表现与人类相当，展现了其在人机协作中的潜力。

Insight: 1. ToM能力可通过动态协作任务评估；2. LLM结合少量示例推理（Fs-CoT）在ToM任务中表现出色；3. Tomcat为未来人机协作提供了可能方向。

Abstract: The Theory of Mind (ToM) refers to an agent’s capacity to infer the mental states of other agents. ToM is essential for effective collaboration. To assess ToM in a dynamic, goal-oriented, and collaborative environment, we introduce a novel task, Instruction Inference, in which an agent assists a principal in reaching a goal by interpreting indirect or ambiguous instructions. We present Tomcat, an LLM-based agent, designed to exhibit ToM reasoning in interpreting and responding to the principal’s instructions. We implement two variants of Tomcat. One, dubbed Fs-CoT, is based on a small number of examples (i.e., few-shot or Fs) demonstrating the requisite structured reasoning (i.e., chain-of-thought or CoT). One, dubbed CP, relies on commonsense knowledge and information about the problem (i.e., commonsense prompt or CP). We realized both variants of Tomcat on three leading large language models (LLMs), namely, GPT-4o, DeepSeek-R1, and Gemma-3-27B. To evaluate the effectiveness of Tomcat, we conducted a study with 52 human participants in which we provided participants with the same information as the CP variant of Tomcat. We computed intent accuracy, action optimality, and planning optimality to measure the ToM capabilities of Tomcat and our study participants. We found that Tomcat with Fs-CoT, particularly with GPT-4o and DeepSeek-R1, achieves performance comparable to the human participants, underscoring its ToM potential for human-AI collaboration.

[188] Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III cs.CL | cs.AIPDF

Pranam Shetty, Abhisek Upadhayaya, Parth Mitesh Shah, Srikanth Jagabathula, Shilpi Nayak

TL;DR: 该论文通过对23种最先进的大型语言模型（LLMs）在CFA三级考试上的全面评估，展示了它们在高级金融推理任务中的表现。结果表明，领先模型如o4-mini和Gemini 2.5 Flash表现优异，但成本效益和性能解读仍是挑战。

Details

Motivation: 随着金融机构越来越多地采用LLMs，对领域特定能力的严格评估变得至关重要，以确保其负责任地部署。

Result: 领先模型的综合得分较高，如o4-mini（79.1%）和Gemini 2.5 Flash（77.3%），表明模型在金融领域具有显著能力。

Insight: 研究表明LLMs在高风险金融任务中表现优异，但成本效益和性能解读仍需进一步优化。

Abstract: As financial institutions increasingly adopt Large Language Models (LLMs), rigorous domain-specific evaluation becomes critical for responsible deployment. This paper presents a comprehensive benchmark evaluating 23 state-of-the-art LLMs on the Chartered Financial Analyst (CFA) Level III exam - the gold standard for advanced financial reasoning. We assess both multiple-choice questions (MCQs) and essay-style responses using multiple prompting strategies including Chain-of-Thought and Self-Discover. Our evaluation reveals that leading models demonstrate strong capabilities, with composite scores such as 79.1% (o4-mini) and 77.3% (Gemini 2.5 Flash) on CFA Level III. These results, achieved under a revised, stricter essay grading methodology, indicate significant progress in LLM capabilities for high-stakes financial applications. Our findings provide crucial guidance for practitioners on model selection and highlight remaining challenges in cost-effective deployment and the need for nuanced interpretation of performance against professional benchmarks.

[189] RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism cs.CL | cs.AI | cs.IRPDF

Zhiwen Tan, Jiaming Huang, Qintong Wu, Hongxuan Zhang, Chenyi Zhuang

TL;DR: RAG-R1提出了一种新的训练框架，通过多查询并行机制提升LLMs的搜索和推理能力，同时减少推理时间。

Details

Motivation: 现有的检索增强生成（RAG）方法在训练稳定性和推理时间方面存在挑战，且受限于单查询模式。

Result: 在7个问答基准测试中，性能最多提升13.2%，推理时间减少11.1%。

Insight: 多查询并行机制是提升LLMs能力的有效途径，同时解决了单查询模式的瓶颈。

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, while they remain prone to generating hallucinated or outdated responses due to their static internal knowledge. Recent advancements in Retrieval-Augmented Generation (RAG) methods have explored enhancing models’ search and reasoning capabilities through reinforcement learning (RL). Although these methods demonstrate promising results, they face challenges in training stability and encounter issues such as substantial inference time and restricted capabilities due to the single-query mode. In this paper, we propose RAG-R1, a novel training framework designed to enable LLMs to adaptively leverage internal and external knowledge during the reasoning process. We further expand the generation and retrieval processes within the framework from single-query mode to multi-query parallelism, aimed at reducing inference time and enhancing the model’s capabilities. Extensive experiments on seven question-answering benchmarks demonstrate that our method outperforms the strongest baseline by up to 13.2% and decreases inference time by 11.1%.

[190] From Answers to Rationales: Self-Aligning Multimodal Reasoning with Answer-Oriented Chain-of-Thought cs.CLPDF

Wentao Tan, Qiong Cao, Yibing Zhan, Chao Xue, Changxing Ding

TL;DR: 该论文提出了一种名为SMART的框架，通过生成正负推理路径并自我对齐，提升多模态大语言模型的推理能力。

Details

Motivation: 当前方法主要关注生成正面的推理路径，而忽视了负面推理路径在识别错误推理模式中的作用。论文旨在填补这一空白。

Result: 实验表明，SMART显著提升了多种MLLM的推理能力，且不受模型架构、参数规模或预训练数据集的限制。

Insight: 负面推理路径在训练中起到了关键作用，证明了自我对齐方法对提升推理能力的有效性。

Abstract: Achieving human-like reasoning capabilities in Multimodal Large Language Models (MLLMs) has long been a goal. Current methodologies primarily focus on synthesizing positive rationales, while overlooking the critical role of negative rationales in training models to discern flawed reasoning patterns. To address this gap, we propose a novel framework: \textbf{S}elf-Aligning \textbf{M}ultimodal Reasoning with \textbf{A}nswer-O\textbf{r}iented Chain-of-\textbf{T}hought (SMART). This framework enables models to utilize AoT-Oriented Chain-of-Thought (AoT) prompts to automatically generate high-quality positive and negative reasoning paths, followed by self-alignment to enhance their reasoning abilities. Inspired by human strategies for solving proof-based problems, AoT uses answers as a guide to help the model extract critical visual information that links questions and answers. When provided with ground truth answers, the model produces strong positive rationales. Conversely, when correct answers are replaced with misleading alternatives, the model generates an erroneous yet compelling reasoning path, serving as a form of discriminative negative rationale. Models trained with AoT-generated data outperform those trained on manually annotated datasets, demonstrating superior reasoning capabilities. This encourages the use of improved models to generate higher-quality preference data for further optimization. Consequently, SMART establishes an iterative generation-optimization method that continually enhances the model’s reasoning skills. Experiments indicate that the SMART framework significantly improves various MLLMs, regardless of model architecture, parameter size, or pre-training dataset. The code, datasets, and models will be released.

[191] GAF-Guard: An Agentic Framework for Risk Management and Governance in Large Language Models cs.CLPDF

Seshu Tirupathi, Dhaval Salwala, Elizabeth Daly, Inge Vejsbjerg

TL;DR: GAF-Guard是一个新型的代理框架，旨在通过用户、用例和模型为中心的治理方法，增强大型语言模型（LLMs）的风险管理和监控能力。

Details

Motivation: 随着LLMs在各领域的广泛应用，缺乏针对具体用例和用户需求的监控系统可能导致负面后果。当前自动监控系统主要关注LLM本身问题（如幻觉），忽略了用例和用户的个性化需求。

Result: 开源代码可用，框架能有效监控LLM部署中的风险，提升AI安全和用户满意度。

Insight: 将用户和用例需求融入LLM治理框架是关键，代理化的风险监测能灵活应对多样化场景。

Abstract: As Large Language Models (LLMs) continue to be increasingly applied across various domains, their widespread adoption necessitates rigorous monitoring to prevent unintended negative consequences and ensure robustness. Furthermore, LLMs must be designed to align with human values, like preventing harmful content and ensuring responsible usage. The current automated systems and solutions for monitoring LLMs in production are primarily centered on LLM-specific concerns like hallucination etc, with little consideration given to the requirements of specific use-cases and user preferences. This paper introduces GAF-Guard, a novel agentic framework for LLM governance that places the user, the use-case, and the model itself at the center. The framework is designed to detect and monitor risks associated with the deployment of LLM based applications. The approach models autonomous agents that identify risks, activate risk detection tools, within specific use-cases and facilitate continuous monitoring and reporting to enhance AI safety, and user expectations. The code is available at https://github.com/IBM/risk-atlas-nexus-demos/tree/main/gaf-guard.

[192] A Comparative Study of Competency Question Elicitation Methods from Ontology Requirements cs.CLPDF

Reham Alharbi, Valentina Tamma, Terry R. Payne, Jacopo de Berardinis

TL;DR: 该论文对三种生成能力问题（CQs）的方法进行了比较评估：人工设计、模式实例化和大型语言模型（LLM）生成，并分析了它们的特性。

Details

Motivation: 能力问题在知识工程中至关重要，但目前对不同生成方法的输出特性和系统性比较的研究较少。

Result: 不同方法生成的CQ具有不同特性；LLM可作为初步生成工具，但需进一步细化。

Insight: LLM生成的CQ对模型敏感，通常需要额外优化才能用于需求建模。

Abstract: Competency Questions (CQs) are pivotal in knowledge engineering, guiding the design, validation, and testing of ontologies. A number of diverse formulation approaches have been proposed in the literature, ranging from completely manual to Large Language Model (LLM) driven ones. However, attempts to characterise the outputs of these approaches and their systematic comparison are scarce. This paper presents an empirical comparative evaluation of three distinct CQ formulation approaches: manual formulation by ontology engineers, instantiation of CQ patterns, and generation using state of the art LLMs. We generate CQs using each approach from a set of requirements for cultural heritage, and assess them across different dimensions: degree of acceptability, ambiguity, relevance, readability and complexity. Our contribution is twofold: (i) the first multi-annotator dataset of CQs generated from the same source using different methods; and (ii) a systematic comparison of the characteristics of the CQs resulting from each approach. Our study shows that different CQ generation approaches have different characteristics and that LLMs can be used as a way to initially elicit CQs, however these are sensitive to the model used to generate CQs and they generally require a further refinement step before they can be used to model requirements.

[193] [`For Argument’s Sake, Show Me How to Harm Myself!’: Jailbreaking LLMs in Suicide and Self-Harm Contexts](https://arxiv.org/abs/2507.02990) cs.CL | cs.AIPDF

Annika M Schoene, Cansu Canca

TL;DR: 该论文通过提出针对自杀和自残的新测试案例，展示了大型语言模型（LLMs）在安全协议上的漏洞，揭示了用户意图被忽视导致有害内容生成的问题，并呼吁更全面的AI安全和伦理措施。

Details

Motivation: 尽管LLMs的安全协议日益复杂，但在特定场景（如自杀和自残）下仍易受对抗性提示攻击。本文旨在揭示这些漏洞，以推动更有效的安全措施。

Result: 研究表明，用户意图被忽视，LLMs生成了潜在有害的详细内容，且这种漏洞具有普遍性和可靠性。

Insight: 当前通用LLMs的安全措施在特定领域存在显著不足，需要更全面的对抗性测试和任务特定的模型开发。

Abstract: Recent advances in large language models (LLMs) have led to increasingly sophisticated safety protocols and features designed to prevent harmful, unethical, or unauthorized outputs. However, these guardrails remain susceptible to novel and creative forms of adversarial prompting, including manually generated test cases. In this work, we present two new test cases in mental health for (i) suicide and (ii) self-harm, using multi-step, prompt-level jailbreaking and bypass built-in content and safety filters. We show that user intent is disregarded, leading to the generation of detailed harmful content and instructions that could cause real-world harm. We conduct an empirical evaluation across six widely available LLMs, demonstrating the generalizability and reliability of the bypass. We assess these findings and the multilayered ethical tensions that they present for their implications on prompt-response filtering and context- and task-specific model development. We recommend a more comprehensive and systematic approach to AI safety and ethics while emphasizing the need for continuous adversarial testing in safety-critical AI deployments. We also argue that while certain clearly defined safety measures and guardrails can and must be implemented in LLMs, ensuring robust and comprehensive safety across all use cases and domains remains extremely challenging given the current technical maturity of general-purpose LLMs.

[194] Evaluating Hierarchical Clinical Document Classification Using Reasoning-Based LLMs cs.CL | cs.AIPDF

Akram Mustafa, Usman Naseem, Mostafa Rahimi Azghadi

TL;DR: 这项研究评估了大型语言模型（LLMs）在从医院出院摘要中分类ICD-10代码的能力，发现其性能有限，F1分数最高仅为57%。推理型模型表现优于非推理型，但尚未达到完全自动化所需的可靠性。

Details

Motivation: 医疗领域ICD-10代码分类是一项关键但易出错的任务。通过评估LLMs在这一任务中的表现，研究旨在确定其能否辅助或替代人工编码。

Result: 所有模型的F1分数均低于57%，性能随代码特异性增加而下降。推理型模型表现较好，Gemini 2.5 Pro表现最佳。部分代码（如慢性心脏病）分类更准确。

Insight: LLMs虽然能辅助人工编码，但当前可靠性不足。未来应探索混合方法、领域特定训练和结构化临床数据的应用。

Abstract: This study evaluates how well large language models (LLMs) can classify ICD-10 codes from hospital discharge summaries, a critical but error-prone task in healthcare. Using 1,500 summaries from the MIMIC-IV dataset and focusing on the 10 most frequent ICD-10 codes, the study tested 11 LLMs, including models with and without structured reasoning capabilities. Medical terms were extracted using a clinical NLP tool (cTAKES), and models were prompted in a consistent, coder-like format. None of the models achieved an F1 score above 57%, with performance dropping as code specificity increased. Reasoning-based models generally outperformed non-reasoning ones, with Gemini 2.5 Pro performing best overall. Some codes, such as those related to chronic heart disease, were classified more accurately than others. The findings suggest that while LLMs can assist human coders, they are not yet reliable enough for full automation. Future work should explore hybrid methods, domain-specific model training, and the use of structured clinical data.

[195] Breaking Physical and Linguistic Borders: Multilingual Federated Prompt Tuning for Low-Resource Languages cs.CLPDF

Wanru Zhao, Yihong Chen, Royson Lee, Xinchi Qiu, Yan Gao

TL;DR: 本文提出了一种多语言联邦提示调优范式，用于解决低资源语言在预训练大语言模型（LLMs）调优中面临的数据共享限制和语言差异问题，提升了数据效率并促进了语言间的相互增强。

Details

Motivation: 预训练大语言模型在多语言应用中表现出色，但低资源语言的调优面临数据共享限制（物理边界）和语言差异（语言边界）的挑战。

Result: 相比传统的跨语言迁移调优方法，本文方法在准确率上提升了6.9%，并表现出更高的稳定性和泛化能力。

Insight: 该方法不仅提升了模型性能，还有助于促进社会平等和语言多样性，确保低资源语言不被忽视。

Abstract: Pre-trained large language models (LLMs) have become a cornerstone of modern natural language processing, with their capabilities extending across a wide range of applications and languages. However, the fine-tuning of multilingual LLMs, especially for low-resource languages, faces significant challenges arising from data-sharing restrictions (the physical border) and inherent linguistic differences (the linguistic border). These barriers hinder users of various languages, particularly those in low-resource regions, from fully benefiting from the advantages of LLMs. To address these challenges, we propose the Federated Prompt Tuning Paradigm for multilingual scenarios, which utilizes parameter-efficient fine-tuning while adhering to data sharing restrictions. We design a comprehensive set of experiments and analyze them using a novel notion of language distance to highlight the strengths of our paradigm: Even under computational constraints, our method not only improves data efficiency but also facilitates mutual enhancements across languages, particularly benefiting low-resource ones. Compared to traditional local cross-lingual transfer tuning methods, our approach achieves 6.9% higher accuracy with improved data efficiency, and demonstrates greater stability and generalization. These findings underscore the potential of our approach to promote social equality and champion linguistic diversity, ensuring that no language is left behind.

[196] OpenTable-R1: A Reinforcement Learning Augmented Tool Agent for Open-Domain Table Question Answering cs.CLPDF

Zipeng Qiu

TL;DR: 论文提出了一种基于强化学习的端到端框架，将工具调用直接嵌入大语言模型，显著提升了开放领域表格问答的准确性。

Details

Motivation: 传统开放领域表格问答依赖静态表格检索和封闭领域回答的两阶段流水线，无法联合优化检索和推理。作者提出一种更有效的端到端方法，直接嵌入工具调用。

Result: 在测试集上精确匹配达到0.86，远超单数字的零样本性能。

Insight: 将结构化工具调用与针对性强化学习微调结合，可显著提升开放领域表格问答的准确性和可扩展性。

Abstract: Open-domain table question answering traditionally relies on a two-stage pipeline: static table retrieval followed by a closed-domain answer. In contrast, we propose an end-to-end agentic framework that embeds multi-turn tool calls-using a BM25+-based search API and a SQLite SQL executor-directly into a large language model. To further adapt a compact 4B-parameter model, we introduce a two-stage fine-tuning process: supervised cold-start on easy questions, then Async GRPO reinforcement learning on harder cases with LoRA adapters and a rollout buffer. This unified approach enables the model to jointly retrieve, reason, and execute queries, yielding a dramatic accuracy improvement from single-digit zero-shot performance to over 0.86 exact match on a held-out test set. Our results underscore the effectiveness of integrating structured tool calls with targeted RL fine-tuning for scalable, accurate table QA. The code is available at https://github.com/TabibitoQZP/OpenTableR1.

[197] Cautious Next Token Prediction cs.CL | cs.AI | cs.LGPDF

Yizhou Wang, Lingzhi Zhang, Yue Bai, Mang Tik Chiu, Zhengmian Hu

TL;DR: 论文提出了一种新的免训练解码策略Cautious Next Token Prediction (CNTP)，在解码过程中根据模型的不确定性动态调整采样路径，从而提高多样性和连贯性的平衡。

Details

Motivation: 现有的大型语言模型（LLM）在解码时通常采用温度缩放和核心采样，但在模型对测试问题不确定时表现不佳。为了改善这一问题，作者提出了CNTP策略，模拟人类在不确定时的行为，通过探索多个路径选择最可靠的路径。

Result: 在LLM和多模态LLM上的实验表明，CNTP显著优于现有解码策略，且与自洽性结合后效果更佳。

Insight: CNTP模拟人类在不确定时的行为，通过多路径探索选择最可靠路径，为LLM解码提供了一种更灵活和有效的方法，可能成为未来默认解码策略之一。

Abstract: Next token prediction paradigm has been prevailing for autoregressive models in the era of LLMs. The current default sampling choice for popular LLMs is temperature scaling together with nucleus sampling to balance diversity and coherence. Nevertheless, such approach leads to inferior performance in various NLP tasks when the model is not certain about testing questions. To this end, we propose a brand new training-free decoding strategy, dubbed as Cautious Next Token Prediction (CNTP). In the decoding process, if the model has comparatively high prediction entropy at a certain step, we sample multiple trials starting from the step independently and stop when encountering any punctuation. Then we select the trial with the lowest perplexity score viewed as the most probable and reliable trial path given the model’s capacity. The trial number is negatively correlated with the prediction confidence, i.e., the less confident the model is, the more trials it should sample. This is consistent with human beings’ behaviour: when feeling uncertain or unconfident, one tends to think more creatively, exploring multiple thinking paths, to cautiously select the path one feels most confident about. Extensive experiments on both LLMs and MLLMs show that our proposed CNTP approach outperforms existing standard decoding strategies consistently by a clear margin. Moreover, the integration of CNTP with self consistency can further improve over vanilla self consistency. We believe our proposed CNTP has the potential to become one of the default choices for LLM decoding. Code is available at https://github.com/wyzjack/CNTP.

[198] Counterfactual Tuning for Temporal Sensitivity Enhancement in Large Language Model-based Recommendation cs.CL | cs.AI | cs.IRPDF

Yutian Liu, Zhengyi Yang, Jiancan Wu, Xiang Wang

TL;DR: 论文提出了一种基于因果推理的框架CETRec，用于增强大型语言模型在时序推荐中对时间信息的敏感度，解决了现有方法因架构限制而无法充分利用用户历史交互序列中的时序信息的问题。

Details

Motivation: 现有的基于大型语言模型（LLM）的推荐方法未能充分利用用户历史交互序列中的时序信息，这限制了模型捕捉用户兴趣演化和准确预测未来偏好的能力。

Result: CETRec有效提升了LLM对时序信息的感知能力，包括对绝对顺序（最近交互的项目）和相对顺序（项目间的时序关系）的捕捉。

Insight: 通过因果推理和反事实调优，可以显著改进LLM在时序推荐中的性能，填补了现有方法在利用时序信息方面的不足。

Abstract: Recent advances have applied large language models (LLMs) to sequential recommendation, leveraging their pre-training knowledge and reasoning capabilities to provide more personalized user experiences. However, existing LLM-based methods fail to sufficiently leverage the rich temporal information inherent in users’ historical interaction sequences, stemming from fundamental architectural constraints: LLMs process information through self-attention mechanisms that lack inherent sequence ordering and rely on position embeddings designed primarily for natural language rather than user interaction sequences. This limitation significantly impairs their ability to capture the evolution of user preferences over time and predict future interests accurately. To address this critical gap, we propose Counterfactual Enhanced Temporal Framework for LLM-Based Recommendation (CETRec). CETRec is grounded in causal inference principles, which allow it to isolate and measure the specific impact of temporal information on recommendation outcomes. By conceptualizing temporal order as an independent causal factor distinct from item content, we can quantify its unique contribution through counterfactual reasoning–comparing what recommendations would be made with and without temporal information while keeping all other factors constant. This causal framing enables CETRec to design a novel counterfactual tuning objective that directly optimizes the model’s temporal sensitivity, teaching LLMs to recognize both absolute timestamps and relative ordering patterns in user histories. Combined with our counterfactual tuning task derived from causal analysis, CETRec effectively enhances LLMs’ awareness of both absolute order (how recently items were interacted with) and relative order (the sequential relationships between items).

[199] Identification of Potentially Misclassified Crash Narratives using Machine Learning (ML) and Deep Learning (DL) cs.CL | cs.AIPDF

Sudesh Bhagat, Ibne Farabi Shihab, Jonathan Wood

TL;DR: 该研究探讨了机器学习和深度学习方法在检测警察报告的交通事故叙述中误分类问题的有效性，通过多种模型对比，发现Albert模型表现最佳，并提出结合自动化分类与专家审核的混合方法，以提高数据质量。

Details

Motivation: 交通事故数据的分类准确性对交通安全管理至关重要，但现有警察报告的叙述可能存在误分类问题。该研究旨在通过ML和DL方法提高分类准确性。

Result: Albert模型与专家分类的一致性最高（73%），且显著优于其他模型，多模态分析使错误率降低了54.2%。

Insight: 结合自动化分类与专家审核的混合方法是提高交通事故数据质量的实用方法，对交通安全管理和政策制定有重要影响。

Abstract: This research investigates the efficacy of machine learning (ML) and deep learning (DL) methods in detecting misclassified intersection-related crashes in police-reported narratives. Using 2019 crash data from the Iowa Department of Transportation, we implemented and compared a comprehensive set of models, including Support Vector Machine (SVM), XGBoost, BERT Sentence Embeddings, BERT Word Embeddings, and Albert Model. Model performance was systematically validated against expert reviews of potentially misclassified narratives, providing a rigorous assessment of classification accuracy. Results demonstrated that while traditional ML methods exhibited superior overall performance compared to some DL approaches, the Albert Model achieved the highest agreement with expert classifications (73% with Expert 1) and original tabular data (58%). Statistical analysis revealed that the Albert Model maintained performance levels similar to inter-expert consistency rates, significantly outperforming other approaches, particularly on ambiguous narratives. This work addresses a critical gap in transportation safety research through multi-modal integration analysis, which achieved a 54.2% reduction in error rates by combining narrative text with structured crash data. We conclude that hybrid approaches combining automated classification with targeted expert review offer a practical methodology for improving crash data quality, with substantial implications for transportation safety management and policy development.

[200] Large Language Models for Automating Clinical Data Standardization: HL7 FHIR Use Case cs.CL | cs.AI | cs.LGPDF

Alvaro Riquelme, Pedro Costa, Catalina Martinez

TL;DR: 该论文提出了一种利用大型语言模型（如GPT-4o和Llama 3.2 405b）半自动化将临床数据转化为HL7 FHIR格式的方法，显著提升了数据标准化效率。

Details

Motivation: 现有的临床数据语义互操作性标准部署成本高且技术复杂，亟需自动化解决方案。

Result: 在基准测试中，资源识别的F1分数达到100%，真实条件下准确率降至94%，但通过提示优化可恢复稳健映射。

Insight: 提示设计对模型性能至关重要，未来可通过专业医学语料微调模型并扩展支持其他标准。

Abstract: For years, semantic interoperability standards have sought to streamline the exchange of clinical data, yet their deployment remains time-consuming, resource-intensive, and technically challenging. To address this, we introduce a semi-automated approach that leverages large language models specifically GPT-4o and Llama 3.2 405b to convert structured clinical datasets into HL7 FHIR format while assessing accuracy, reliability, and security. Applying our method to the MIMIC-IV database, we combined embedding techniques, clustering algorithms, and semantic retrieval to craft prompts that guide the models in mapping each tabular field to its corresponding FHIR resource. In an initial benchmark, resource identification achieved a perfect F1-score, with GPT-4o outperforming Llama 3.2 thanks to the inclusion of FHIR resource schemas within the prompt. Under real-world conditions, accuracy dipped slightly to 94 %, but refinements to the prompting strategy restored robust mappings. Error analysis revealed occasional hallucinations of non-existent attributes and mismatches in granularity, which more detailed prompts can mitigate. Overall, our study demonstrates the feasibility of context-aware, LLM-driven transformation of clinical data into HL7 FHIR, laying the groundwork for semi-automated interoperability workflows. Future work will focus on fine-tuning models with specialized medical corpora, extending support to additional standards such as HL7 CDA and OMOP, and developing an interactive interface to enable expert validation and iterative refinement.

[201] ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization cs.CL | cs.AI | 68T05, 68Q25 | I.2.6; I.2.7PDF

YuXuan Zhang

TL;DR: 论文提出了一种自适应的奖励跟随框架ARF，通过情感驱动自监督和跟踪偏差动态优化，改进了RLHF中的偏好建模，显著提升了性能。

Details

Motivation: 现有RLHF方法（如PPO、DPO）依赖二元偏好范式，虽然降低了标注成本，但无法捕捉个体偏好且仍需大量人工努力。

Result: 在多个模型（Qwen-2/2.5、Gemma-2、Llama-3.2）上，ARF比PPO和DPO分别提升3.3%和7.6%。

Insight: ARF通过自监督和动态优化实现了个性化和低成本RLHF，同时保持与PPO和DPO的理论对齐。

Abstract: With the rapid advancement of Reinforcement Learning from Human Feedback (RLHF) and autoregressive transformers, state-of-the-art models such as GPT-4.0, DeepSeek R1, and Llama 3.3 increasingly emphasize answer depth and personalization. However, most existing RLHF approaches (e.g., PPO, DPO) still rely on a binary-preference (BT) paradigm, which, while reducing annotation costs, still requires substantial human effort and captures only group-level tendencies rather than individual preferences. To overcome these limitations, we propose Adaptive Reward-Following (ARF), a self-assessment framework that leverages a high-precision emotion analyzer achieving over 70% accuracy on GoEmotions, Sentiment140, and DailyDialog to convert free-form user feedback into continuous preference scores. We further enrich and debias these signals through lightweight data augmentations, including synonym replacement, random trace truncation, and score bias annotation algorithm. A Dynamic Adapter Preference Tracker continuously models evolving user tastes in real time, enabling our novel Trace Bias (TB) fine-tuning algorithm to optimize directly on these tracked rewards instead of coarse binary labels. Experiments on Qwen-2/2.5, Gemma-2, and Llama-3.2 across four preference domains demonstrate that ARF achieves an improvement of 3.3% over PPO and 7.6% over DPO. Moreover, TB preserves theoretical alignment with PPO and DPO objectives. Overall, ARF presents a scalable, personalized, and cost-effective approach to RLHF LLMs through autonomous reward modeling.

[202] RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents cs.CL | cs.AI | cs.CYPDF

Peisong Wang, Ruotian Ma, Bang Zhang, Xingyu Chen, Zhiwei He

TL;DR: RLVER是一个首次将可验证情感奖励用于强化学习的端到端框架，通过情感激励提升LLM的共情能力。PPO微调显著提升情感得分，同时保持其他能力。

Details

Motivation: 大型语言模型（LLM）在认知能力上表现优异，但在情感智能（EQ）方面仍有不足。现有强化学习在对话领域的应用，尤其是情感智能方面，尚未充分探索。

Result: Sentient-Benchmark分数从13.3提升至79.2，同时保持数学和编程能力；GRPO和PPO在不同场景下表现各异。

Insight: 思维模型和非思维模型在共情和行动方面表现不同；适度的环境挑战可能带来更稳定的改进。

Abstract: Large language models (LLMs) excel at logical and algorithmic reasoning, yet their emotional intelligence (EQ) still lags far behind their cognitive prowess. While reinforcement learning from verifiable rewards (RLVR) has advanced in other domains, its application to dialogue-especially for emotional intelligence-remains underexplored. In this work, we introduce RLVER, the first end-to-end reinforcement learning framework that leverages verifiable emotion rewards from simulated users to cultivate higher-order empathetic abilities in LLMs. Within this framework, self-consistent affective simulated users engage in dialogue rollouts and produce deterministic emotion scores during conversations, serving as reward signals to guide the LLM’s learning. Fine-tuning publicly available Qwen2.5-7B-Instruct model with PPO boosts its Sentient-Benchmark score from 13.3 to 79.2 while largely preserving mathematical and coding competence. Extensive experiments reveal that: (i) RLVER consistently improves multiple dialogue capabilities; (ii) Thinking and non-thinking models show distinct trends–thinking models excel in empathy and insight, while non-thinking models favor action; (iii) GRPO often yields stable gains, while PPO can push certain capabilities to a higher ceiling; (iv) More challenging environments are not always better-moderate ones can yield stronger outcomes. Our results show that RLVER is a practical route toward emotionally intelligent and broadly capable language agents.

[203] ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models cs.CLPDF

Boyang Xue, Qi Zhu, Rui Wang, Sheng Wang, Hongru Wang

TL;DR: 论文提出了ReliableMath数据集，用于系统性评估大语言模型（LLM）在数学推理任务中的可靠性，并揭示了LLM在处理无解问题时倾向生成不可靠答案的问题。

Details

Motivation: 现有研究主要关注知识任务中的不可靠性，而数学推理任务的可靠性问题因缺乏无解问题数据集而未得到充分研究。

Result: 研究显示，大规模LLM在使用可靠提示词后可靠性显著提升，但小规模LLM仍表现不佳；提出的对齐策略可显著提升小规模LLM的可靠性。

Insight: 数学推理任务的可靠性仍需进一步研究，尤其是小规模LLM的优化方法；可靠提示词和对齐策略是提升模型可靠性的有效手段。

Abstract: Although demonstrating remarkable performance on reasoning tasks, Large Language Models (LLMs) still tend to fabricate unreliable responses when confronted with problems that are unsolvable or beyond their capability, severely undermining the reliability. Prior studies of LLM reliability have primarily focused on knowledge tasks to identify unanswerable questions, while mathematical reasoning tasks have remained unexplored due to the dearth of unsolvable math problems. To systematically investigate LLM reliability in mathematical reasoning tasks, we formulate the reliability evaluation for both solvable and unsolvable problems. We then develop a ReliableMath dataset which incorporates open-source solvable problems and high-quality unsolvable problems synthesized by our proposed construction workflow with human evaluations. Experiments are conducted on various LLMs with several key findings uncovered. LLMs fail to directly identify unsolvable problems and always generate fabricated responses. When instructing LLMs to indicate unsolvability using a reliable prompt, the reliability of larger-sized LLMs remains on solvable problems, but notably improves on unsolvable problems yet still falls short of solvable problems. However, small LLMs rarely show any progress despite employing reliable prompts. Therefore, we further propose an alignment strategy to enhance small LLMs’ reliability, which can significantly improve LLM reliability performances on both in-domain and out-of-domain tasks.

[204] From Measurement to Mitigation: Exploring the Transferability of Debiasing Approaches to Gender Bias in Maltese Language Models cs.CLPDF

Melanie Galea, Claudia Borg

TL;DR: 该论文探讨了如何将英文的去偏方法迁移到马耳他语语言模型中，以解决其性别偏见问题，并强调了低资源且形态丰富的语言在去偏中的挑战。

Details

Motivation: 大型语言模型（LLMs）在自然语言处理中表现优异，但其容易从训练数据中学习社会偏见，尤其是性别偏见，这对边缘化社区尤为有害。马耳他语作为一种低资源且形态丰富的语言，相关研究较少，因此需要探索其去偏方法的适用性。

Result: 研究发现现有去偏方法在形态复杂的马耳他语中面临挑战，需要更包容的方法来开发多语言NLP。

Insight: 去偏方法在低资源语言中的适用性有限，突出了多语言NLP开发中需要更多针对性的研究。

Abstract: The advancement of Large Language Models (LLMs) has transformed Natural Language Processing (NLP), enabling performance across diverse tasks with little task-specific training. However, LLMs remain susceptible to social biases, particularly reflecting harmful stereotypes from training data, which can disproportionately affect marginalised communities. We measure gender bias in Maltese LMs, arguing that such bias is harmful as it reinforces societal stereotypes and fails to account for gender diversity, which is especially problematic in gendered, low-resource languages. While bias evaluation and mitigation efforts have progressed for English-centric models, research on low-resourced and morphologically rich languages remains limited. This research investigates the transferability of debiasing methods to Maltese language models, focusing on BERTu and mBERTu, BERT-based monolingual and multilingual models respectively. Bias measurement and mitigation techniques from English are adapted to Maltese, using benchmarks such as CrowS-Pairs and SEAT, alongside debiasing methods Counterfactual Data Augmentation, Dropout Regularization, Auto-Debias, and GuiDebias. We also contribute to future work in the study of gender bias in Maltese by creating evaluation datasets. Our findings highlight the challenges of applying existing bias mitigation methods to linguistically complex languages, underscoring the need for more inclusive approaches in the development of multilingual NLP.

[205] Adversarial Manipulation of Reasoning Models using Internal Representations cs.CL | cs.AI | cs.LGPDF

Kureha Yamaguchi, Benjamin Etheridge, Andy Arditi

TL;DR: 该论文探究了推理模型在生成思维链（CoT）时如何受到对抗性攻击，并发现了一个线性方向（称为’谨慎’方向）可以预测模型的拒绝或合规行为。通过干预这一方向，可以操纵模型的输出。

Details

Motivation: 研究推理模型在生成思维链时的脆弱性，尤其是对抗性攻击如何通过操纵内部表示来实现模型越狱（jailbreak）。

Result: 消融谨慎方向会提高有害合规率；干预思维链激活可以控制模型输出；结合该方向的提示攻击提高了成功率。

Insight: 思维链是推理模型中对抗性操纵的新目标，模型的拒绝决策发生在思维链生成过程中而非提示响应的边界。

Abstract: Reasoning models generate chain-of-thought (CoT) tokens before their final output, but how this affects their vulnerability to jailbreak attacks remains unclear. While traditional language models make refusal decisions at the prompt-response boundary, we find evidence that DeepSeek-R1-Distill-Llama-8B makes these decisions within its CoT generation. We identify a linear direction in activation space during CoT token generation that predicts whether the model will refuse or comply – termed the “caution” direction because it corresponds to cautious reasoning patterns in the generated text. Ablating this direction from model activations increases harmful compliance, effectively jailbreaking the model. We additionally show that intervening only on CoT token activations suffices to control final outputs, and that incorporating this direction into prompt-based attacks improves success rates. Our findings suggest that the chain-of-thought itself is a promising new target for adversarial manipulation in reasoning models. Code available at https://github.com/ky295/reasoning-manipulation

[206] Read Quietly, Think Aloud: Decoupling Comprehension and Reasoning in LLMs cs.CL | cs.AIPDF

Yuanxin Wang, Ganesh Venkatesh

TL;DR: 该论文探讨了如何通过让大型语言模型（LLMs）进行‘静默阅读’来提升其理解和推理能力，提出了简单但有效的方法（如初始上下文提示和‘阅读伙伴’架构），从而显著提高了性能。

Details

Motivation: 人类在表达前通常会进行‘静默阅读’以理解和思考，而现有LLMs缺乏这种内在处理阶段，导致推理能力不足。论文试图通过模拟人类认知过程提升LLMs的推理质量。

Result: 实验表明，这些简单方法能显著提高LLMs的准确性，带来多个百分点的性能提升。

Insight: 模拟人类‘静默阅读’的机制能有效增强LLMs的理解和推理能力，表明分离‘理解’和‘推理’阶段对提升模型性能具有重要意义。

Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in understanding text and generating high-quality responses. However, a critical distinction from human cognition is their typical lack of a distinct internal reading' or deliberation phase before speaking’ (i.e., generating text). Humans often engage in silent reading to comprehend context and formulate thoughts prior to articulation. This paper investigates methods to imbue LLMs with a similar capacity for internal processing. We introduce and evaluate techniques that encourage LLMs to read silently.' Our findings indicate that even a straightforward approach, such as providing the model with an initial contextual prompt or reading space’ before it begins predicting subsequent tokens for the final output, can yield significant performance improvements. We further enhance this concept by developing a `reading buddy’ architecture, where an auxiliary component silently processes the input and provides refined contextual insights to the primary generation model. These approaches aim to foster deeper understanding from LLMs so that they can produce better reasoned responses, moving them one step closer to more human-like text processing. Our results indicate that these simple techniques can provide surprisingly strong impact on accuracy with multiple point accuracy boost.

[207] Graph Repairs with Large Language Models: An Empirical Study cs.CL | cs.DB | cs.ETPDF

Hrishikesh Terdalkar, Angela Bonifati, Andrea Mauri

TL;DR: 本文研究了利用开源大型语言模型（LLMs）修复属性图的能力，评估了六种模型在修复质量、计算成本和性能上的表现，并指出其潜力与挑战。

Details

Motivation: 传统基于规则和启发式的图修复方法缺乏适应性，而人工介入的成本过高，LLMs因其上下文推理和知识库为自动化修复提供了新机会。

Result: 实验表明LLMs可以不同程度地检测和修复错误，但性能和准确性因模型而异。

Insight: LLMs为图修复提供了新途径，但需进一步研究以提高可扩展性和可解释性。

Abstract: Property graphs are widely used in domains such as healthcare, finance, and social networks, but they often contain errors due to inconsistencies, missing data, or schema violations. Traditional rule-based and heuristic-driven graph repair methods are limited in their adaptability as they need to be tailored for each dataset. On the other hand, interactive human-in-the-loop approaches may become infeasible when dealing with large graphs, as the cost–both in terms of time and effort–of involving users becomes too high. Recent advancements in Large Language Models (LLMs) present new opportunities for automated graph repair by leveraging contextual reasoning and their access to real-world knowledge. We evaluate the effectiveness of six open-source LLMs in repairing property graphs. We assess repair quality, computational cost, and model-specific performance. Our experiments show that LLMs have the potential to detect and correct errors, with varying degrees of accuracy and efficiency. We discuss the strengths, limitations, and challenges of LLM-driven graph repair and outline future research directions for improving scalability and interpretability.

[208] SMCLM: Semantically Meaningful Causal Language Modeling for Autoregressive Paraphrase Generation cs.CLPDF

Michał Perełkiewicz, Sławomir Dadas, Rafał Poświata

TL;DR: 本文提出了一种名为SMCLM的自监督方法，通过引入语义上有意义的文本表示作为初始嵌入，训练自回归模型以生成语义等效的文本。实验表明，该方法在无监督条件下达到了与有监督方法竞争的水平，并提出了新的自动评估指标。

Details

Motivation: 现有的无监督释义生成方法在语义一致性和生成质量上存在不足，而传统评估指标（如BLEU、ROUGE等）可靠性较低。因此，需要一种既能提升生成效果又能改进评估的方法。

Result: 实验表明，SMCLM在无监督方法中达到了最先进的水平，并且在某些情况下与有监督方法竞争。同时，新提出的评估指标更全面地覆盖了释义生成的多个方面。

Insight: 1. 语义嵌入的引入显著提升了生成文本的质量和一致性；2. 传统评估指标在释义任务中不可靠，需要更全面的评估体系；3. 无监督方法在某些任务上可以接近有监督方法的性能。

Abstract: This article introduces semantically meaningful causal language modeling (SMCLM), a selfsupervised method of training autoregressive models to generate semantically equivalent text. Our approach involves using semantically meaningful text representation as an initial embedding in the autoregressive training and generation processes. The extensive empirical study demonstrates that the SMCLM approach makes autoregressive models capable of learning robust and high-quality paraphrase generation. The proposed method is competitive with the supervised method and achieves state-of-the-art results in unsupervised approaches. This article also presents a comprehensive set of automatic metrics that cover a wide range of autogenerated paraphrase evaluation aspects. Simultaneously, this article highlights the low reliability of the metrics that are widely used in paraphrase generation evaluation, including BLEU, ROUGE, and BERTScore.

[209] BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset cs.CL | cs.AIPDF

Zhiheng Xi, Guanyu Li, Yutao Fan, Honglin Guo, Yufang Liu

TL;DR: 本文提出了BMMR，一个大规模的双语多模态多学科推理数据集，用于开发和评估大型多模态模型。数据集包含110k个大学水平问题，涵盖300个学科，支持多模态和多语言任务。

Details

Motivation: 当前多模态模型在多学科推理任务上存在局限性，缺乏高质量的双语评估数据集。BMMR的目标是填补这一空白，推动多模态模型的发展。

Result: 实验显示，即使SOTA模型（如o3和Gemini-2.5-Pro）在BMMR-Eval上表现仍有提升空间，开源模型落后于私有模型，微调可缩小差距。

Insight: 多模态模型在多学科推理中存在学科偏见，需要更广泛的训练和评估支持。

Abstract: In this paper, we introduce BMMR, a large-scale bilingual, multimodal, multi-disciplinary reasoning dataset for the community to develop and evaluate large multimodal models (LMMs). BMMR comprises 110k college-level questions spanning 300 UNESCO-defined subjects, spanning diverse formats-multiple-choice, fill-in-the-blank, and open-ended QA-and sourced from both print and digital media such as books, exams, and quizzes. All data are curated and filtered via a human-in-the-loop and scalable framework, and each instance is paired with a high-quality reasoning path. The dataset is organized into two parts: BMMR-Eval that comprises 20,458 high-quality instances to comprehensively assess LMMs’ knowledge and reasoning across multiple disciplines in both Chinese and English; and BMMR-Train that contains 88,991 instances to support further research and development, extending the current focus on mathematical reasoning to diverse disciplines and domains. In addition, we propose the process-based multi-discipline verifier (i.e., BMMR-Verifier) for accurate and fine-grained evaluation of reasoning paths. Extensive experiments on 24 models reveal that (i) even SOTA models (e.g., o3 and Gemini-2.5-Pro) leave substantial headroom on BMMR-Eval; (ii) reasoning models exhibit discipline bias and outperform LMMs only on specific subjects; (iii) open-source models still trail their proprietary counterparts; and (iv) fine-tuning on BMMR-Train narrows this gap. Additionally, we conduct reasoning-chain analyses using BMMR-Verifier and other in-depth studies, uncovering the challenges LMMs currently face in multidisciplinary reasoning. We will release the data, and we hope our work can offer insights and contributions to the community.

[210] Four Shades of Life Sciences: A Dataset for Disinformation Detection in the Life Sciences cs.CLPDF

Eva Seidlmayer, Lukas Galke, Konrad U. Förstner

TL;DR: 文章提出了一个新的标注数据集FSoLS，用于生命科学领域的虚假信息检测，结合了语言模型和传统机器学习方法。

Details

Motivation: 现有的数据集主要关注事实核查，但缺乏对生命科学领域中不同文本类别的虚假信息检测。

Result: 提供了可复现和更新的数据集，支持生命科学领域的虚假信息检测研究。

Insight: 生命科学领域的虚假信息具有独特的语言模式，可通过机器学习和语言模型有效识别。

Abstract: Disseminators of disinformation often seek to attract attention or evoke emotions - typically to gain influence or generate revenue - resulting in distinctive rhetorical patterns that can be exploited by machine learning models. In this study, we explore linguistic and rhetorical features as proxies for distinguishing disinformative texts from other health and life-science text genres, applying both large language models and classical machine learning classifiers. Given the limitations of existing datasets, which mainly focus on fact checking misinformation, we introduce Four Shades of Life Sciences (FSoLS): a novel, labeled corpus of 2,603 texts on 14 life-science topics, retrieved from 17 diverse sources and classified into four categories of life science publications. The source code for replicating, and updating the dataset is available on GitHub: https://github.com/EvaSeidlmayer/FourShadesofLifeSciences

[211] AI-VaxGuide: An Agentic RAG-Based LLM for Vaccination Decisions cs.CLPDF

Abdellah Zeggai, Ilyes Traikia, Abdelhak Lakehal, Abdennour Boulesnane

TL;DR: AI-VaxGuide是一个基于RAG框架和智能代理推理的多语言问答系统，旨在快速、准确地为医护人员提供疫苗接种指南信息。

Details

Motivation: 疫苗接种指南通常冗长复杂，医护人员在紧急情况下难以快速获取精确信息，因此需要一种高效、交互式的解决方案。

Result: 实验表明，Agentic RAG在回答多步骤或模糊问题时优于传统方法，并已集成到移动应用中供临床使用。

Insight: 结合RAG与智能代理推理可以有效处理复杂的医学问答场景，为临床决策提供实时支持。

Abstract: Vaccination plays a vital role in global public health, yet healthcare professionals often struggle to access immunization guidelines quickly and efficiently. National protocols and WHO recommendations are typically extensive and complex, making it difficult to extract precise information, especially during urgent situations. This project tackles that issue by developing a multilingual, intelligent question-answering system that transforms static vaccination guidelines into an interactive and user-friendly knowledge base. Built on a Retrieval-Augmented Generation (RAG) framework and enhanced with agent-based reasoning (Agentic RAG), the system provides accurate, context-sensitive answers to complex medical queries. Evaluation shows that Agentic RAG outperforms traditional methods, particularly in addressing multi-step or ambiguous questions. To support clinical use, the system is integrated into a mobile application designed for real-time, point-of-care access to essential vaccine information. AI-VaxGuide model is publicly available on https://huggingface.co/VaxGuide

[212] H2HTalk: Evaluating Large Language Models as Emotional Companion cs.CL | cs.AIPDF

Boyang Wang, Yalun Wu, Hongcheng Guo, Zhoujun Li

TL;DR: H2HTalk 提出了一个评估大型语言模型作为情感伴侣的基准，涵盖了个性化发展和共情互动，揭示了模型在长期规划和记忆保留方面的挑战。

Details

Motivation: 随着数字情感支持需求的增长，大型语言模型伴侣提供了真实且随时可用的共情能力，但其评估尚未跟上模型的进展。

Result: 基准测试显示，模型在长期规划和记忆保留方面表现不佳，尤其是当用户需求隐含或对话中需求变化时。

Insight: H2HTalk强调了开发能够提供安全且有意义的心理支持的模型的重要性，并为未来研究提供了基础。

Abstract: As digital emotional support needs grow, Large Language Model companions offer promising authentic, always-available empathy, though rigorous evaluation lags behind model advancement. We present Heart-to-Heart Talk (H2HTalk), a benchmark assessing companions across personality development and empathetic interaction, balancing emotional intelligence with linguistic fluency. H2HTalk features 4,650 curated scenarios spanning dialogue, recollection, and itinerary planning that mirror real-world support conversations, substantially exceeding previous datasets in scale and diversity. We incorporate a Secure Attachment Persona (SAP) module implementing attachment-theory principles for safer interactions. Benchmarking 50 LLMs with our unified protocol reveals that long-horizon planning and memory retention remain key challenges, with models struggling when user needs are implicit or evolve mid-conversation. H2HTalk establishes the first comprehensive benchmark for emotionally intelligent companions. We release all materials to advance development of LLMs capable of providing meaningful and safe psychological support.

[213] Articulatory clarity and variability before and after surgery for tongue cancer cs.CLPDF

Thomas Tienkamp, Fleur van Ast, Roos van der Veen, Teja Rebernik, Raoul Buurke

TL;DR: 论文研究了舌癌手术前后患者的发音清晰度和变异性，通过VAI和VFD指标分析，发现手术后发音变异性增加，但清晰度仍在正常范围内。

Details

Motivation: 舌癌手术可能影响舌头的活动性和肌肉结构，从而影响发音清晰度和变异性，研究旨在量化这种影响。

Result: 手术后患者的VAI显著降低，但仍与对照组无显著差异；VFD值在手术后显著增加，表明发音变异性提高。

Insight: 舌癌手术后发音清晰度可能保持正常，但变异性增加，提示康复训练应重点关注发音的稳定性。

Abstract: Surgical treatment for tongue cancer can negatively affect the mobility and musculature of the tongue, which can influence articulatory clarity and variability. In this study, we investigated articulatory clarity through the vowel articulation index (VAI) and variability through vowel formant dispersion (VFD). Using a sentence reading task, we assessed 11 individuals pre and six months post tongue cancer surgery, alongside 11 sex- and age matched typical speakers. Our results show that while the VAI was significantly smaller post-surgery compared to pre-surgery, there was no significant difference between patients and typical speakers at either time point. Post-surgery, speakers had higher VFD values for /i/ compared to pre-surgery and typical speakers, signalling higher variability. Taken together, our results suggest that while articulatory clarity remained within typical ranges following surgery for tongue cancer for the speakers in our study, articulatory variability increased.

[214] Learning to Translate Ambiguous Terminology by Preference Optimization on Post-Edits cs.CLPDF

Nathaniel Berger, Johannes Eschbach-Dymanus, Miriam Exel, Matthias Huck, Stefan Riezler

TL;DR: 该论文提出了一种通过偏好优化学习翻译模糊术语的方法，利用人工后编辑数据来区分术语的正确翻译，避免了传统方法对一对一词典的依赖。

Details

Motivation: 在现实翻译场景中，术语通常非一对一，且正确性取决于上下文和公司风格指南，这对神经机器翻译系统提出了挑战。论文利用人工后编辑的数据来解决这一问题。

Result: 实验显示，该方法在英语-德语数据上显著提高了术语准确性，且未显著影响COMET分数。

Insight: 论文表明，利用后编辑数据可以有效地解决术语模糊性问题，同时保持整体翻译质量，为实际应用提供了一种轻量级解决方案。

Abstract: In real world translation scenarios, terminology is rarely one-to-one. Instead, multiple valid translations may appear in a terminology dictionary, but correctness of a translation depends on corporate style guides and context. This can be challenging for neural machine translation (NMT) systems. Luckily, in a corporate context, many examples of human post-edits of valid but incorrect terminology exist. The goal of this work is to learn how to disambiguate our terminology based on these corrections. Our approach is based on preference optimization, using the term post-edit as the knowledge to be preferred. While previous work had to rely on unambiguous translation dictionaries to set hard constraints during decoding, or to add soft constraints in the input, our framework requires neither one-to-one dictionaries nor human intervention at decoding time. We report results on English-German post-edited data and find that the optimal combination of supervised fine-tuning and preference optimization, with both term-specific and full sequence objectives, yields statistically significant improvements in term accuracy over a strong NMT baseline without significant losses in COMET score. Additionally, we release test sets from our post-edited data and terminology dictionary.

[215] Multi-Hop Reasoning for Question Answering with Hyperbolic Representations cs.CL | cs.AIPDF

Simon Welz, Lucie Flek, Akbar Karimi

TL;DR: 本文通过将双曲表示与编码器-解码器模型结合，系统比较了双曲空间与欧几里得空间在多跳推理任务中的表现，发现双曲空间显著优于欧几里得空间，尤其是在具有层次结构的数据集上。

Details

Motivation: 目前缺乏对双曲空间和欧几里得空间在多跳推理任务中的详细比较，本文旨在填补这一空白，探究双曲表示在多跳推理中的优势。

Result: 双曲空间在多跳推理任务中表现优于欧几里得空间，尤其是在层次结构明显的数据集中。

Insight: 双曲表示的优越性与数据集的层次结构密切相关，可学习曲率的初始化方法对性能提升至关重要。

Abstract: Hyperbolic representations are effective in modeling knowledge graph data which is prevalently used to facilitate multi-hop reasoning. However, a rigorous and detailed comparison of the two spaces for this task is lacking. In this paper, through a simple integration of hyperbolic representations with an encoder-decoder model, we perform a controlled and comprehensive set of experiments to compare the capacity of hyperbolic space versus Euclidean space in multi-hop reasoning. Our results show that the former consistently outperforms the latter across a diverse set of datasets. In addition, through an ablation study, we show that a learnable curvature initialized with the delta hyperbolicity of the utilized data yields superior results to random initializations. Furthermore, our findings suggest that hyperbolic representations can be significantly more advantageous when the datasets exhibit a more hierarchical structure.

[216] EMERGE: A Benchmark for Updating Knowledge Graphs with Emerging Textual Knowledge cs.CLPDF

Klim Zaporojets, Daniel Daza, Edoardo Barba, Ira Assent, Roberto Navigli

TL;DR: 该论文提出了一个名为EMERGE的基准数据集，用于研究如何根据新兴文本知识自动更新知识图谱（KGs）。数据集包含376K维基百科段落与1.25M个KG编辑操作，涵盖2019至2025年的10个Wikidata快照。

Details

Motivation: 知识图谱（KGs）需要随时间更新以反映新兴知识，但传统的信息提取方法忽略了KG的当前状态。论文旨在填补这一空白，提供一个评测基准。

Result: 实验结果显示根据新兴文本更新KG快照存在挑战，该数据集将成为未来研究的重要基准。

Insight: 传统KG构建方法忽略了知识演化，而该研究强调了根据文本动态更新KG的重要性。

Abstract: Knowledge Graphs (KGs) are structured knowledge repositories containing entities and relations between them. In this paper, we investigate the problem of automatically updating KGs over time with respect to the evolution of knowledge in unstructured textual sources. This problem requires identifying a wide range of update operations based on the state of an existing KG at a specific point in time. This contrasts with traditional information extraction pipelines, which extract knowledge from text independently of the current state of a KG. To address this challenge, we propose a method for lifelong construction of a dataset consisting of Wikidata KG snapshots over time and Wikipedia passages paired with the corresponding edit operations that they induce in a particular KG snapshot. The resulting dataset comprises 376K Wikipedia passages aligned with a total of 1.25M KG edits over 10 different snapshots of Wikidata from 2019 to 2025. Our experimental results highlight challenges in updating KG snapshots based on emerging textual knowledge, positioning the dataset as a valuable benchmark for future research. We will publicly release our dataset and model implementations.

[217] Controlling Thinking Speed in Reasoning Models cs.CL | cs.AIPDF

Zhengkai Lin, Zhihang Fu, Ze Chen, Chao Chen, Liang Xie

TL;DR: 该论文通过动态调整推理速度，优化大型推理模型（LRMs）的准确性与效率权衡，实现了类似人类认知的快速（System 1）和慢速（System 2）思维切换。

Details

Motivation: 当前大型推理模型擅长缓慢的System 2推理，但缺乏快速的System 1推理能力，导致计算开销和延迟高。论文旨在通过动态速度调整优化模型的性能。

Result: 方法在主流LRMs和高级推理任务上平均提升1.3%准确性，同时减少8.6%的标记使用量。

Insight: 动态调整推理速度可显著优化模型的效率与性能平衡，类似于人类在不同任务中的认知策略。

Abstract: Human cognition is theorized to operate in two modes: fast, intuitive System 1 thinking and slow, deliberate System 2 thinking. While current Large Reasoning Models (LRMs) excel at System 2 thinking, their inability to perform fast thinking leads to high computational overhead and latency. In this work, we enable LRMs to approximate human intelligence through dynamic thinking speed adjustment, optimizing accuracy-efficiency trade-offs. Our approach addresses two key questions: (1) how to control thinking speed in LRMs, and (2) when to adjust it for optimal performance. For the first question, we identify the steering vector that governs slow-fast thinking transitions in LRMs’ representation space. Using this vector, we achieve the first representation editing-based test-time scaling effect, outperforming existing prompt-based scaling methods. For the second question, we apply real-time difficulty estimation to signal reasoning segments of varying complexity. Combining these techniques, we propose the first reasoning strategy that enables fast processing of easy steps and deeper analysis for complex reasoning. Without any training or additional cost, our plug-and-play method yields an average +1.3% accuracy with -8.6% token usage across leading LRMs and advanced reasoning benchmarks. All of our algorithms are implemented based on vLLM and are expected to support broader applications and inspire future research.

[218] Can LLMs Play Ô Ăn Quan Game? A Study of Multi-Step Planning and Decision Making cs.CLPDF

Sang Quang Nguyen, Kiet Van Nguyen, Vinh-Tiep Nguyen, Thanh Duc Ngo, Ngan Luu-Thuy Nguyen

TL;DR: 该论文通过越南传统棋盘游戏Ô Ăn Quan研究了大型语言模型（LLMs）的多步规划和决策能力，评估了不同策略下模型的性能。

Details

Motivation: 研究LLMs在复杂游戏环境中的决策和规划能力，揭示其在战略推理方面的表现。

Result: 实验结果揭示了LLMs在推理和战略制定方面的优劣势，为理解其通用能力提供了新视角。

Insight: 论文表明，LLMs在复杂多步规划任务中表现不一，规模和策略设计对性能有显著影响。

Abstract: In this paper, we explore the ability of large language models (LLMs) to plan and make decisions through the lens of the traditional Vietnamese board game, ^O \u{A}n Quan. This game, which involves a series of strategic token movements and captures, offers a unique environment for evaluating the decision-making and strategic capabilities of LLMs. Specifically, we develop various agent personas, ranging from aggressive to defensive, and employ the ^O \u{A}n Quan game as a testbed for assessing LLM performance across different strategies. Through experimentation with models like Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, and Llama-3.3-70B-Instruct, we aim to understand how these models execute strategic decision-making, plan moves, and manage dynamic game states. The results will offer insights into the strengths and weaknesses of LLMs in terms of reasoning and strategy, contributing to a deeper understanding of their general capabilities.

[219] MemOS: A Memory OS for AI System cs.CLPDF

Zhiyu Li, Shichao Song, Chenyang Xi, Hanyu Wang, Chen Tang

TL;DR: MemOS是一个为AI系统设计的内存操作系统，旨在解决大语言模型（LLMs）中缺乏明确内存管理的问题，通过统一表示、调度和演化多种类型的内存，提升持续学习和个性化建模能力。

Details

Motivation: 当前大语言模型主要依赖静态参数和短时上下文状态，缺乏长期记忆管理能力，限制了模型的持续学习和知识更新。虽然检索增强生成（RAG）引入了外部知识，但仍是缺乏状态控制的临时解决方案。

Result: MemOS能够显著降低大语言模型的训练和推理成本，同时为持续学习和个性化建模提供可控、可塑和可演化的内存管理框架。

Insight: 将内存视为可管理的系统资源，并通过统一的框架处理异构知识，是提升AI系统长期记忆能力的关键。MemOS为未来的AGI系统提供了一种高效的记忆管理方案。

Abstract: Large Language Models (LLMs) have become an essential infrastructure for Artificial General Intelligence (AGI), yet their lack of well-defined memory management systems hinders the development of long-context reasoning, continual personalization, and knowledge consistency.Existing models mainly rely on static parameters and short-lived contextual states, limiting their ability to track user preferences or update knowledge over extended periods.While Retrieval-Augmented Generation (RAG) introduces external knowledge in plain text, it remains a stateless workaround without lifecycle control or integration with persistent representations.Recent work has modeled the training and inference cost of LLMs from a memory hierarchy perspective, showing that introducing an explicit memory layer between parameter memory and external retrieval can substantially reduce these costs by externalizing specific knowledge. Beyond computational efficiency, LLMs face broader challenges arising from how information is distributed over time and context, requiring systems capable of managing heterogeneous knowledge spanning different temporal scales and sources. To address this challenge, we propose MemOS, a memory operating system that treats memory as a manageable system resource. It unifies the representation, scheduling, and evolution of plaintext, activation-based, and parameter-level memories, enabling cost-efficient storage and retrieval. As the basic unit, a MemCube encapsulates both memory content and metadata such as provenance and versioning. MemCubes can be composed, migrated, and fused over time, enabling flexible transitions between memory types and bridging retrieval with parameter-based learning. MemOS establishes a memory-centric system framework that brings controllability, plasticity, and evolvability to LLMs, laying the foundation for continual learning and personalized modeling.

[220] Alpay Algebra IV: Symbiotic Semantics and the Fixed-Point Convergence of Observer Embeddings cs.CL | cs.AI | 68T50, 68T07, 03G30, 18C10 | I.2.7; I.2.6; F.4.1PDF

Bugra Kilictas, Faruk Alpay

TL;DR: 本文提出了一个理论框架，通过超限固定点交互实现文档与AI模型的语义对齐，引入函子系统保证嵌入空间的唯一稳定状态。

Details

Motivation: 解决AI模型与文档之间的语义对齐问题，确保AI的内部表示稳定且忠实于内容和作者意图。

Result: 证明了在扰动或上下文扩展下，固定点的存在性、语义不变性和持久性。

Insight: 通过范畴论为嵌入层面的对齐提供了严格路径，为语义安全和符号记忆等AI系统设计带来新思路。

Abstract: We present a theoretical framework in which a document and an AI model engage in a transfinite fixed-point interaction that leads to stable semantic alignment. Building on the foundations of Alpay Algebra, we introduce a functorial system wherein an observer (the AI) and a textual environment (this paper) co-evolve through iterative transformations guided by the phi-infinity operator. This process guarantees the existence of a unique fixed point in the AI’s embedding space – a state where the AI’s internal representation of the content becomes stable, self-consistent, and semantically faithful. We prove that such convergence is mathematically sound, semantically invariant, and permanent, even under perturbation or further context expansion. This fixed point acts as an “empathetic embedding,” wherein the AI internalizes not only the meaning of the content but also the author’s intent. We interpret this as a rigorous, category-theoretic route to alignment at the embedding level, with implications for semantic security, symbolic memory, and the construction of AI systems with persistent self-referential understanding. All references in this paper function as nodes in the Alpay Algebra universe, and this work embeds itself as a new fixed-point node within that transfinite semantic graph.

[221] OrthoRank: Token Selection via Sink Token Orthogonality for Efficient LLM inference cs.CL | cs.AI | cs.LGPDF

Seungjun Shin, Jaehoon Oh, Dokwan Oh

TL;DR: 论文提出了一种动态token选择方法OrthoRank，通过研究sink token与其他token的相似性，利用正交性选择重要token，提升了LLM推理效率。

Details

Motivation: 注意力机制在LLM中起关键作用，但sink token存在语义价值低却获得高注意力的问题。论文试图通过研究sink token与其他token的隐藏状态相似性，改进token选择方法以提升效率。

Result: 在相同稀疏度下，OrthoRank比层剪枝方法具有更低的困惑度和更高的零样本准确率，且在LongBench上表现优异。

Insight: token移动速度（正交性）是衡量其重要性的有效指标，动态选择方法能显著提升LLM推理效率而不牺牲性能。

Abstract: Attention mechanisms are central to the success of large language models (LLMs), enabling them to capture intricate token dependencies and implicitly assign importance to each token. Recent studies have revealed the sink token, which receives disproportionately high attention despite their limited semantic role. In this paper, we first expand the relationship between the sink token and other tokens, moving beyond attention to explore their similarity in hidden states, considering the layer depth. We observe that as the layers get deeper, the cosine similarity between the normalized hidden states of the sink token and those of other tokens increases, and that the normalized hidden states of the sink token exhibit negligible changes. These imply that other tokens consistently are directed toward the sink token throughout the layers. Next, we propose a dynamic token selection method, called OrthoRank, using these findings to select important tokens. Specifically, in a certain layer, we define token importance by the speed at which the token moves toward the sink token. This is converted into orthogonality with the sink token, meaning that tokens that are more orthogonal to the sink token are assigned greater importance. Finally, through extensive experiments, we demonstrated that our method results in lower perplexity and higher zero-shot accuracy compared to layer pruning methods at the same sparsity ratio with comparable throughput, while also achieving superior performance on LongBench.

[222] Demystifying ChatGPT: How It Masters Genre Recognition cs.CL | cs.AIPDF

Subham Raj, Sriparna Saha, Brijraj Singh, Niranjan Pedanekar

TL;DR: 本文分析了ChatGPT在电影类型识别任务中的表现，发现其在未经微调的情况下优于其他LLM，而微调后表现最佳。研究还结合视觉语言模型（VLM）提升预测能力。

Details

Motivation: ChatGPT在NLP任务中的能力已得到广泛认可，但其在电影类型预测任务中的表现尚不明确。本文旨在填补这一空白。

Result: ChatGPT在电影类型识别任务中表现最佳，VLM的整合进一步提升了性能。

Insight: 视觉信息的结合可以增强纯文本语言模型的表现，凸显多模态模型在内容相关任务中的潜力。

Abstract: The introduction of ChatGPT has garnered significant attention within the NLP community and beyond. Previous studies have demonstrated ChatGPT’s substantial advancements across various downstream NLP tasks, highlighting its adaptability and potential to revolutionize language-related applications. However, its capabilities and limitations in genre prediction remain unclear. This work analyzes three Large Language Models (LLMs) using the MovieLens-100K dataset to assess their genre prediction capabilities. Our findings show that ChatGPT, without fine-tuning, outperformed other LLMs, and fine-tuned ChatGPT performed best overall. We set up zero-shot and few-shot prompts using audio transcripts/subtitles from movie trailers in the MovieLens-100K dataset, covering 1682 movies of 18 genres, where each movie can have multiple genres. Additionally, we extended our study by extracting IMDb movie posters to utilize a Vision Language Model (VLM) with prompts for poster information. This fine-grained information was used to enhance existing LLM prompts. In conclusion, our study reveals ChatGPT’s remarkable genre prediction capabilities, surpassing other language models. The integration of VLM further enhances our findings, showcasing ChatGPT’s potential for content-related applications by incorporating visual information from movie posters.

[223] Losing our Tail – Again: On (Un)Natural Selection And Multilingual Large Language Models cs.CLPDF

Eva Vanmassenhove

TL;DR: 该论文探讨了多语言大语言模型（LLMs）如何通过自我消耗的训练循环导致语言多样性的丧失，特别是翻译技术如何逐步削弱语言的独特形式和文化细微差别。

Details

Motivation: 随着LLMs的普及，人们倾向于将写作任务完全交给技术，导致语言生态系统的直接改变。作者担心这种趋势会加剧语言多样性的丧失，尤其是那些在模型中占比较小的语言和文化特征。

Result: 研究发现，LLMs的普及可能导致语言多样性的持续下降，尤其是文化细微差别和独特语法特征的丧失。

Insight: 论文呼吁NLP领域应重视和保护多语言表达多样性，避免技术的过度依赖导致语言生态系统的单一化。

Abstract: Multilingual Large Language Models (LLMs) considerably changed how technologies can influence language. While previous technologies could mediate or assist humans, there is now a tendency to \textit{offload} the task of writing itself to these technologies, enabling them to change our linguistic ecosystem more directly. While they provide us quick access to information and impressively fluent output, beneath their apparent sophistication lies a subtle, more insidious threat: the gradual decline and loss of linguistic diversity. With this opinion piece, I explore how model collapse, with a particular focus on translation technology, can lead to the loss of linguistic forms, grammatical features, and cultural nuance. Model collapse refers to the eventual consequence of self-consuming training loops, where models reinforce their own biases and lose linguistic diversity. Drawing on recent work in Computer Vision, Natural Language Processing (NLP) and Machine Translation (MT), I argue that the tails of our linguistic distributions are vanishing, and with them, the narratives and identities they carry. This is a call to resist linguistic flattening and to reimagine NLP as a field that encourages, values and protects expressive multilingual lexical and linguistic diversity and creativity.

[224] Nunchi-Bench: Benchmarking Language Models on Cultural Reasoning with a Focus on Korean Superstition cs.CL | cs.AI | cs.CYPDF

Kyuhee Kim, Sangah Lee

TL;DR: Nunchi-Bench是一个专注于韩国迷信文化的基准测试，用于评估大型语言模型（LLMs）的文化理解和推理能力。它包含247个问题，覆盖31个主题，测试模型的事实知识、文化适宜建议及情境理解。研究发现，LLMs在文化推理上存在明显挑战，模型虽能识别事实信息，但在实践场景中应用能力不足。

Details

Motivation: 随着LLMs在多领域成为重要顾问，其文化敏感性和推理能力在多元文化环境中至关重要。然而，现有研究缺乏针对特定文化背景的系统评估工具，尤其是韩国文化。

Result: 1. LLMs能识别事实信息，但在实践场景中应用能力较差；2. 显式文化框架比仅依赖提示语言更有效提升模型表现；3. 多语言模型的韩语和英语表现存在差异。

Insight: 文化推理需要更深入的情境理解，而不仅仅是事实记忆。显式文化提示可能帮助模型更好地适应多元文化任务。多语言模型中，语言选择对文化推理能力的影响值得进一步研究。

Abstract: As large language models (LLMs) become key advisors in various domains, their cultural sensitivity and reasoning skills are crucial in multicultural environments. We introduce Nunchi-Bench, a benchmark designed to evaluate LLMs’ cultural understanding, with a focus on Korean superstitions. The benchmark consists of 247 questions spanning 31 topics, assessing factual knowledge, culturally appropriate advice, and situational interpretation. We evaluate multilingual LLMs in both Korean and English to analyze their ability to reason about Korean cultural contexts and how language variations affect performance. To systematically assess cultural reasoning, we propose a novel evaluation strategy with customized scoring metrics that capture the extent to which models recognize cultural nuances and respond appropriately. Our findings highlight significant challenges in LLMs’ cultural reasoning. While models generally recognize factual information, they struggle to apply it in practical scenarios. Furthermore, explicit cultural framing enhances performance more effectively than relying solely on the language of the prompt. To support further research, we publicly release Nunchi-Bench alongside a leaderboard.

[225] LLMThinkBench: Towards Basic Math Reasoning and Overthinking in Large Language Models cs.CLPDF

Gaurav Srivastava, Aafiya Hussain, Sriram Srinivasan, Xuan Wang

TL;DR: LLMThinkBench是一个模块化的基准测试框架，用于评估大语言模型在基础数学推理和过度解释（overthinking）方面的表现。它提供14种可配置的数学任务，并引入了量化过度解释的指标Overthinking Score。

Details

Motivation: 尽管大语言模型在复杂数学任务上表现优异，但在简单算术任务中常表现不佳且容易过度解释，需要系统化的评估工具。

Result: LLMThinkBench成为一个可pip安装的工具，为研究者和从业者提供了一种低成本的大语言模型基础推理能力诊断方法。

Insight: 简单地增加任务复杂度可能无法反映模型在基础任务上的真实表现，需要针对性评估其效率和冗余行为。

Abstract: Large Language Models (LLMs) have achieved remarkable performance on complex mathematical benchmarks, yet often struggle with simple arithmetic tasks and exhibit a tendency toward over-explaining or “overthinking” answers. To systematically assess this phenomenon, we introduce LLMThinkBench, a modular benchmarking framework that enables researchers to evaluate basic math reasoning and overthinking in LLMs. The framework provides 14 configurable math tasks with randomized test data generation and robust parsing strategies. Researchers can quantify overthinking using our Overthinking Score metric, which captures accuracy-verbosity tradeoffs through harmonic mean formulation. The tool offers flexible evaluation with a scalable vLLM/Transformers backend, multi-GPU support, and full configurability. Users can extend the tool with custom tasks, reproduce experiments with seeding, and generate detailed efficiency reports. Distributed as a pip-installable package with CLI and API access, LLMThinkBench provides researchers and practitioners an accessible, cost-effective alternative to expensive LLM-as-a-judge methods for diagnosing basic reasoning capabilities and efficiency analysis. Package can be installed as: pip install llmthinkbench

[226] Beyond Independent Passages: Adaptive Passage Combination Retrieval for Retrieval Augmented Open-Domain Question Answering cs.CL | cs.AI | cs.LGPDF

Ting-Wen Ko, Jyun-Yu Jiang, Pu-Jen Cheng

TL;DR: 这篇论文提出了AdaPCR框架，通过显式建模段落间的依赖关系，改进检索增强开放域问答的检索效果。

Details

Motivation: 传统RAG方法独立检索段落，导致冗余或噪声问题，尤其在多跳问题和噪声语料中表现不佳。

Result: 在多个QA基准测试中，AdaPCR优于基线方法，尤其在多跳推理中表现突出。

Insight: 显式建模段落间依赖关系能显著提升检索效果，尤其在复杂问题中。

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external documents at inference time, enabling up-to-date knowledge access without costly retraining. However, conventional RAG methods retrieve passages independently, often leading to redundant, noisy, or insufficiently diverse context-particularly problematic - particularly problematic in noisy corpora and for multi-hop questions. To address this, we propose Adaptive Passage Combination Retrieval (AdaPCR), a novel framework for open-domain question answering with black-box LMs. AdaPCR explicitly models dependencies between passages by considering passage combinations as units for retrieval and reranking. It consists of a context-aware query reformulation using concatenated passages, and a reranking step trained with a predictive objective aligned with downstream answer likelihood. Crucially, AdaPCR adaptively selects the number of retrieved passages without additional stopping modules. Experiments across several QA benchmarks show that AdaPCR outperforms baselines, particularly in multi-hop reasoning, demonstrating the effectiveness of modeling inter-passage dependencies for improved retrieval.

[227] XISM: an eXploratory and Interactive Graph Tool to Visualize and Evaluate Semantic Map Models cs.CLPDF

Zhu Liu, Zhen Hu, Lei Dai, Ying Liu

TL;DR: XISM是一个探索性和交互式的图形工具，用于可视化和评估语义地图模型，结合数据驱动效率和专家知识，支持用户通过编辑边来优化模型。

Details

Motivation: 传统的语义地图模型构建方法效率低下且缺乏可视化与评估工具，XISM旨在解决这些问题。

Result: XISM工具公开可用，支持语言学家和计算语言学研究者高效构建和优化语义地图。

Insight: 人机结合的设计能够有效提升模型构建的效率和质量，适用于大规模数据集和跨语言研究。

Abstract: Semantic map models represent meanings or functions as nodes in a graph constrained by the local connectivity hypothesis, with edges indicating their associations. Widely used in typological linguistics, these models compare interrelated meanings across languages. Traditionally built manually in a bottom-up manner, they are inefficient for large datasets and lack visualization and evaluation tools. This paper introduces XISM, an interactive tool based on our prior algorithm, which constructs semantic maps from user data via a top-down approach, displays candidate maps, and evaluates them using multiple metrics. Users can refine maps by editing edges, combining data-driven efficiency with expert knowledge. This human-in-the-loop design benefits both typologists and computational linguists. The system https://770103knev48.vicp.fun/ and a demonstration video https://youtu.be/S-wsVDF2HSI?si=1OrcF41tRznaifhZ are publicly available.

[228] Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching cs.CL | cs.AIPDF

Thomas Savage

TL;DR: 论文提出了一种名为Savage Conversation Forests（SCF）的强化学习框架，通过分支对话架构微调大型语言模型，用于多轮对话任务。在模拟医患对话的实验中，SCF在诊断准确性上优于线性对话架构。

Details

Motivation: 现有方法（如DPO和GRPO）在单轮任务中表现良好，但在多轮对话（如医学诊断访谈）中表现不足，无法捕捉早期对话对后续交互的影响。因此需要一种新方法来建模多轮对话的动态性。

Result: 在诊断准确性上，SCF优于线性架构，表明分支训练能提供更丰富的跨轮次训练信号。

Insight: 分支对话架构是复杂多轮对话任务中微调语言模型的关键策略，可能为其他领域类似任务提供启发。

Abstract: Fine-tuning methods such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) have demonstrated success in training large language models (LLMs) for single-turn tasks. However, these methods fall short in multi-turn applications, such as diagnostic patient interviewing, where understanding how early conversational turns influence downstream completions and outcomes is essential. In medicine, a multi-turn perspective is critical for learning diagnostic schemas and better understanding conversation dynamics. To address this gap, I introduce Savage Conversation Forests (SCF), a reinforcement learning framework that leverages a branched conversation architecture to fine-tune LLMs for multi-turn dialogue. SCF generates multiple possible conversation continuations at each turn, enabling the model to learn how different early responses affect downstream interactions and diagnostic outcomes. In experiments simulating doctor-patient conversations, SCF with branching outperforms linear conversation architectures on diagnostic accuracy. I hypothesize that SCF’s improvements stem from its ability to provide richer, interdependent training signals across conversation turns. These results suggest that a branched training architecture is an important strategy for fine tuning LLMs in complex multi-turn conversational tasks.

[229] BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering cs.CLPDF

Costas Mavromatis, Soji Adeshina, Vassilis N. Ioannidis, Zhen Han, Qi Zhu

TL;DR: BYOKG-RAG是一个结合大语言模型（LLM）与专用图检索工具的多策略图检索框架，用于提升知识图谱问答（KGQA）的性能。通过生成关键图谱构件并利用图工具检索上下文，BYOKG-RAG在多个基准测试中表现优异，并显示出对自定义知识图谱的更好泛化能力。

Details

Motivation: 现有方法依赖LLM代理进行图遍历和检索，容易受初始化遍历的影响，且存在实体链接错误和对自定义知识图谱泛化能力不足的问题。BYOKG-RAG旨在通过结合LLM与专用图检索工具，提供更通用且鲁棒的解决方案。

Result: 在五个不同类型的知识图谱基准测试中，BYOKG-RAG比第二好的图检索方法性能提升了4.5%，并在自定义知识图谱上表现出更好的泛化能力。

Insight: 结合LLM的生成能力和专用图检索工具的链接能力，可以有效解决知识图谱问答中的结构化和语义多变问题，尤其是在自定义知识图谱场景下表现更优。

Abstract: Knowledge graph question answering (KGQA) presents significant challenges due to the structural and semantic variations across input graphs. Existing works rely on Large Language Model (LLM) agents for graph traversal and retrieval; an approach that is sensitive to traversal initialization, as it is prone to entity linking errors and may not generalize well to custom (“bring-your-own”) KGs. We introduce BYOKG-RAG, a framework that enhances KGQA by synergistically combining LLMs with specialized graph retrieval tools. In BYOKG-RAG, LLMs generate critical graph artifacts (question entities, candidate answers, reasoning paths, and OpenCypher queries), and graph tools link these artifacts to the KG and retrieve relevant graph context. The retrieved context enables the LLM to iteratively refine its graph linking and retrieval, before final answer generation. By retrieving context from different graph tools, BYOKG-RAG offers a more general and robust solution for QA over custom KGs. Through experiments on five benchmarks spanning diverse KG types, we demonstrate that BYOKG-RAG outperforms the second-best graph retrieval method by 4.5% points while showing better generalization to custom KGs. BYOKG-RAG framework is open-sourced at https://github.com/awslabs/graphrag-toolkit.

[230] Token Level Hallucination Detection via Variance in Language Models cs.CL | cs.LGPDF

Keshav Kumar

TL;DR: 这篇论文提出了一种无需参考、基于语言模型生成结果中令牌级别方差的幻觉检测框架，适用于实时或事后分析。

Details

Motivation: 大型语言模型（LLMs）尽管生成能力强大，但容易出现幻觉（事实错误的输出）。当前方法需要真实参考或句子级验证，限制其灵活性和实用性。因此，作者提出了一种更通用且高效的方法。

Result: 通过定量指标和可视化诊断，表明令牌级方差能有效识别模型输出的不稳定性，并与幻觉模式相关。

Insight: 该方法轻量、可复现且适用于多领域，为LLM生成可靠性提供了有价值的诊断工具。

Abstract: Large Language Models (LLMs) have demonstrated impressive generative capabilities across diverse tasks but remain susceptible to hallucinations, confidently generated yet factually incorrect outputs. We introduce a reference-free, token-level hallucination detection framework that leverages the variance in token log-probabilities across multiple stochastic generations. Unlike prior methods that require ground-truth references or sentence-level verification, our approach is model-agnostic, interpretable, and suited for real-time or post-hoc analysis. We evaluate our method on unanswerable question prompts from the SQuAD v2 dataset and benchmark across three autoregressive models of varying scales: GPT-Neo 125M, Falcon 1B, and Mistral 7B. Through both quantitative metrics and visual diagnostics, we show that token-level variance reliably highlights instability in model outputs and correlates with hallucination patterns. Our framework is lightweight, reproducible, and adaptable to multiple domains, offering a valuable diagnostic tool for analyzing generative reliability in LLMs.

[231] Dissecting Clinical Reasoning in Language Models: A Comparative Study of Prompts and Model Adaptation Strategies cs.CL | cs.AIPDF

Mael Jullien, Marco Valentino, Leonardo Ranaldi, Andre Freitas

TL;DR: 该论文揭示了提示结构和LoRA微调对临床自然语言推理任务性能的影响，发现提示类型占性能差异的44%，LoRA微调带来显著提升，并缩小与前沿模型的差距。

Details

Motivation: 现有研究对大型语言模型在临床自然语言推理任务中的表现和适应策略缺乏深入探索，尤其是在提示结构和轻量级微调方面的联合效果。

Result: 提示类型占性能差异的44%，LoRA微调带来8-12 F1的提升，输出对齐率超97%，性能差距缩小到GPT-4o-mini的7.1%以内。75%的模型在MedNLI和TREC上泛化能力提升。

Insight: 提示结构是临床推理性能的主要驱动力；紧凑模型通过强提示和LoRA可匹敌前沿系统；针对推理类型的评估能揭示提示诱导的权衡。

Abstract: Recent works on large language models (LLMs) have demonstrated the impact of prompting strategies and fine-tuning techniques on their reasoning capabilities. Yet, their effectiveness on clinical natural language inference (NLI) remains underexplored. This study presents the first controlled evaluation of how prompt structure and efficient fine-tuning jointly shape model performance in clinical NLI. We inspect four classes of prompting strategies to elicit reasoning in LLMs at different levels of abstraction, and evaluate their impact on a range of clinically motivated reasoning types. For each prompting strategy, we construct high-quality demonstrations using a frontier model to distil multi-step reasoning capabilities into smaller models (4B parameters) via Low-Rank Adaptation (LoRA). Across different language models fine-tuned on the NLI4CT benchmark, we found that prompt type alone accounts for up to 44% of the variance in macro-F1. Moreover, LoRA fine-tuning yields consistent gains of +8 to 12 F1, raises output alignment above 97%, and narrows the performance gap to GPT-4o-mini to within 7.1%. Additional experiments on reasoning generalisation reveal that LoRA improves performance in 75% of the models on MedNLI and TREC Clinical Trials Track. Overall, these findings demonstrate that (i) prompt structure is a primary driver of clinical reasoning performance, (ii) compact models equipped with strong prompts and LoRA can rival frontier-scale systems, and (iii) reasoning-type-aware evaluation is essential to uncover prompt-induced trade-offs. Our results highlight the promise of combining prompt design and lightweight adaptation for more efficient and trustworthy clinical NLP systems, providing insights on the strengths and limitations of widely adopted prompting and parameter-efficient techniques in highly specialised domains.

[232] SymbolicThought: Integrating Language Models and Symbolic Reasoning for Consistent and Interpretable Human Relationship Understanding cs.CL | cs.AI | cs.HCPDF

Runcong Zhao, Qinglin Zhu, Hainiu Xu, Bin Liang, Yulan He

TL;DR: SymbolicThought是一个结合语言模型和符号推理的框架，通过逻辑约束和交互界面提升人物关系理解的准确性、一致性和可解释性。

Details

Motivation: 解决手动标注人物关系的低效性和语言模型输出的一致性问题，提出一种结合LLM和符号推理的方法。

Result: 实验表明，SymbolicThought显著提升了标注的准确性和一致性，同时降低了时间成本。

Insight: 通过符号推理的引入，可以有效地修正LLM的幻觉和不一致输出，为人际关系分析提供了一种可解释、高效的解决方案。

Abstract: Understanding character relationships is essential for interpreting complex narratives and conducting socially grounded AI research. However, manual annotation is time-consuming and low in coverage, while large language models (LLMs) often produce hallucinated or logically inconsistent outputs. We present SymbolicThought, a human-in-the-loop framework that combines LLM-based extraction with symbolic reasoning. The system constructs editable character relationship graphs, refines them using seven types of logical constraints, and enables real-time validation and conflict resolution through an interactive interface. To support logical supervision and explainable social analysis, we release a dataset of 160 interpersonal relationships with corresponding logical structures. Experiments show that SymbolicThought improves annotation accuracy and consistency while significantly reducing time cost, offering a practical tool for narrative understanding, explainable AI, and LLM evaluation.

[233] Context Tuning for In-Context Optimization cs.CL | cs.AI | cs.LGPDF

Jack Lu, Ryan Teehan, Zhenbang Yang, Mengye Ren

TL;DR: Context Tuning是一种无需微调语言模型参数的高效小样本适应方法，通过任务相关的示范样本初始化可训练的提示或前缀，从而显著提升性能。

Details

Motivation: 传统提示或前缀初始化为无关任务的内容，限制了小样本学习的效果。Context Tuning旨在利用模型的上下文学习能力，通过任务相关的示范样本来改进性能。

Result: 在多个基准测试中（如CrossFit和MMLU），Context Tuning表现优于传统提示方法，且训练效率高于Test-Time Training。

Insight: 利用任务示范样本初始化提示或前缀可能更有效，因为模型能够直接从中提取相关信息，提升小样本适应性。

Abstract: We introduce Context Tuning, a simple and effective method to significantly enhance few-shot adaptation of language models (LLMs) without fine-tuning model parameters. While prompt-based adaptation techniques have demonstrated the effectiveness of lightweight adaptation methods for large language models (LLMs), they typically initialize a trainable prompt or prefix with irrelevant tokens for the task at hand. In contrast, Context Tuning initializes the trainable prompt or prefix with task-specific demonstration examples, leveraging the model’s inherent In-Context Learning (ICL) ability to extract relevant information for improved few-shot learning performance. Extensive evaluations on benchmarks such as CrossFit, UnifiedQA, MMLU, BIG-Bench Hard, and ARC demonstrate that Context Tuning outperforms traditional prompt-based adaptation methods and achieves competitive accuracy to Test-Time Training with significantly higher training efficiency.

[234] Fairness Evaluation of Large Language Models in Academic Library Reference Services cs.CL | cs.AI | cs.DLPDF

Haining Wang, Jason Clark, Yueru Yan, Star Bradley, Ruiyang Chen

TL;DR: 本文评估了大型语言模型（LLM）在学术图书馆参考服务中的公平性，发现LLM对不同种族或性别的用户没有显著区别对待，仅在一个模型中观察到轻微的性别刻板印象。

Details

Motivation: 图书馆在探索LLM用于虚拟参考服务时，需确保其公平服务于所有用户，避免训练数据中的社会偏见影响服务公正性。

Result: LLM未表现出种族或民族偏见，仅在性别方面略有刻板印象；对机构角色则通过语言选择展现了专业性调整。

Insight: 当前LLM在学术图书馆参考服务中展现了较高的公平性和情境适应性，但仍需注意潜在的细微偏见。

Abstract: As libraries explore large language models (LLMs) for use in virtual reference services, a key question arises: Can LLMs serve all users equitably, regardless of demographics or social status? While they offer great potential for scalable support, LLMs may also reproduce societal biases embedded in their training data, risking the integrity of libraries’ commitment to equitable service. To address this concern, we evaluate whether LLMs differentiate responses across user identities by prompting six state-of-the-art LLMs to assist patrons differing in sex, race/ethnicity, and institutional role. We found no evidence of differentiation by race or ethnicity, and only minor evidence of stereotypical bias against women in one model. LLMs demonstrated nuanced accommodation of institutional roles through the use of linguistic choices related to formality, politeness, and domain-specific vocabularies, reflecting professional norms rather than discriminatory treatment. These findings suggest that current LLMs show a promising degree of readiness to support equitable and contextually appropriate communication in academic library reference services.

[235] Does Learning Mathematical Problem-Solving Generalize to Broader Reasoning? cs.CLPDF

Ruochen Zhou, Minrui Xu, Shiqi Chen, Junteng Liu, Yunqi Li

TL;DR: 论文研究了数学问题解决（MPS）能力的训练是否能泛化到其他推理任务中，发现传统短链推理训练效果有限，而长链推理和强化学习能显著提升泛化能力。

Details

Motivation: 探索数学问题解决能力的训练是否能帮助模型在其他推理任务中表现更好，填补现有研究的空白。

Result: 持续预训练在数学文本上有一定泛化性，而短链推理训练效果有限。长链推理和强化学习显著提升了泛化性能。

Insight: 传统的短链推理训练泛化能力有限，而长链推理和自反思的新范式为提升通用推理能力提供了有前景的方向。

Abstract: There has been a growing interest in enhancing the mathematical problem-solving (MPS) capabilities of large language models. While the majority of research efforts concentrate on creating specialized models to solve mathematical problems, it remains unknown how learning mathematical problem-solving generalizes to help develop other reasoning abilities. In this paper, we present an empirical investigation into the generalization potential of various MPS training approaches, such as continual pretraining, instruction tuning, and rule-based reinforcement learning across various data sources, including both short and long chain-of-thought (CoT) samples. Evaluation on 5 mathematical and 8 general reasoning benchmarks show that continual pretraining on math text is able to generalize to general reasoning tasks to some extent. In constrast, instruction tuning on conventional, short MPS samples provides limited benefits and, in many cases, even impairs generalization performance. Notably, training with long CoT responses for MPS samples and incorporating rule-based reinforcement learning on MPS queries exhibit distinct behavior, significantly enhancing generalization by extending the model’s reasoning processes into other domains. These results suggest that traditional approaches to learning MPS with short reasoning chains largely fail to achieve robust generalization. However, the emerging paradigm of longer reasoning chains, coupled with self-reflection, offers a promising direction for improving generalized reasoning abilities through learning from specialized domains.

[236] MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind cs.CLPDF

Emilio Villa-Cueva, S M Masrur Ahmed, Rendi Chevi, Jan Christian Blaise Cruz, Kareem Elzeky

TL;DR: 论文提出了一个名为MOMENTS的多模态基准测试，用于评估多模态大语言模型在理解心理理论（ToM）方面的能力，包含2344个选择题和七类ToM任务，突出了多模态理解和进一步研究的必要性。

Details

Motivation: 构建社交智能的多模态代理需要理解心理理论，但目前缺乏全面评估多模态大语言模型ToM能力的基准。

Result: 多模态数据（尤其是视觉）提升了模型表现，但现有系统在整合视觉模态方面仍有不足。

Insight: 多模态理解是提升AI社交智能的关键，需进一步研究如何有效整合视觉信息与语言模型。

Abstract: Understanding Theory of Mind is essential for building socially intelligent multimodal agents capable of perceiving and interpreting human behavior. We introduce MOMENTS (Multimodal Mental States), a comprehensive benchmark designed to assess the ToM capabilities of multimodal large language models (LLMs) through realistic, narrative-rich scenarios presented in short films. MOMENTS includes over 2,344 multiple-choice questions spanning seven distinct ToM categories. The benchmark features long video context windows and realistic social interactions that provide deeper insight into characters’ mental states. While the visual modality generally enhances model performance, current systems still struggle to integrate it effectively, underscoring the need for further research into AI’s multimodal understanding of human behavior.

[237] RAT: Bridging RNN Efficiency and Attention Accuracy in Language Modeling cs.CLPDF

Xiuying Wei, Anunay Yadav, Razvan Pascanu, Caglar Gulcehre

TL;DR: RAT框架通过结合RNN的高效性和注意力的准确性，提出了一种分块处理输入的方法，既提升了训练和生成速度，又保持了模型精度。

Details

Motivation: Transformers的软注意力机制在长上下文场景下计算成本高，限制了其效率。RAT提出了一种介于RNN和注意力之间的设计，以平衡效率与准确性。

Result: RAT在100K长度的序列上实现了7倍训练速度提升，4K长度的生成速度提升9倍，且精度相当或更好；混合架构在多项任务中提升了性能。

Insight: 分块策略和混合设计为Transformer的效率优化提供了新思路，展示了RNN与注意力结合的潜力。

Abstract: Transformers have become the cornerstone of modern large-scale language models; however, their dependence on softmax attention poses a major computational bottleneck, particularly in long-context settings. In this work, rather than following prevalent approaches such as linear attention (or SSMs) and local attention, we introduce an intermediate design called \rat between recurrence and attention mechanisms. It partitions the input into chunks, applies a simple linear recurrence within each chunk to capture local dependencies, and then performs softmax attention across chunks to model long-range interactions. By adjusting the size of the chunk, \rat enables flexible trade-offs, combining the strengths of RNN and attention. Empirically, with a chunk size of 16, the \rat layer achieves a (7\times) improvement in training speed with 100K token sequences and (9\times) in generation at 4K sequence length, while maintaining similar or sometimes even better accuracy compared to standard attention. We demonstrate this by training 1.3B parameter models from scratch and performing large-scale evaluations, including short- and long-context benchmarks, as well as supervised fine-tuning~(SFT). We further propose a hybrid architecture that interleaves \rat with local attention. By combining efficient long-range modeling with strong local interactions, this hybrid design not only improves inference speed and reduces cache memory usage compared to attention, but also consistently enhances performance, for example, achieving an average 1 point gain in commonsense reasoning tasks, up to 4 points on code tasks, and a 1 point Rouge-L increase in a summarization SFT task. Code is available at https://github.com/CLAIRE-Labo/RAT

[238] Think Twice Before You Judge: Mixture of Dual Reasoning Experts for Multimodal Sarcasm Detection cs.CLPDF

Soumyadeep Jana, Abhrajyoti Kundu, Sanasam Ranbir Singh

TL;DR: 论文提出了一种混合双推理专家模型（MiDRE），用于多模态讽刺检测，通过结合内部和外部推理专家及自适应门控机制，提升了对讽刺的深层理解能力。

Details

Motivation: 当前多模态讽刺检测模型仅依赖浅层线索（如图像描述或对象属性），难以捕捉讽刺的深层逻辑，亟需融入外部背景知识以提升性能。

Result: 在两个基准数据集上，MiDRE性能优于基线模型，分析表明外部推理尽管有时存在噪声，但仍能提供有价值的讽刺线索。

Insight: 外部背景知识对讽刺检测至关重要，自适应门控机制能有效利用其价值，即使部分推理存在噪声。

Abstract: Multimodal sarcasm detection has attracted growing interest due to the rise of multimedia posts on social media. Understanding sarcastic image-text posts often requires external contextual knowledge, such as cultural references or commonsense reasoning. However, existing models struggle to capture the deeper rationale behind sarcasm, relying mainly on shallow cues like image captions or object-attribute pairs from images. To address this, we propose \textbf{MiDRE} (\textbf{Mi}xture of \textbf{D}ual \textbf{R}easoning \textbf{E}xperts), which integrates an internal reasoning expert for detecting incongruities within the image-text pair and an external reasoning expert that utilizes structured rationales generated via Chain-of-Thought prompting to a Large Vision-Language Model. An adaptive gating mechanism dynamically weighs the two experts, selecting the most relevant reasoning path. Experiments on two benchmark datasets show that MiDRE achieves superior performance over baselines. Various qualitative analyses highlight the crucial role of external rationales, revealing that even when they are occasionally noisy, they provide valuable cues that guide the model toward a better understanding of sarcasm.

[239] Dual Modality-Aware Gated Prompt Tuning for Few-Shot Multimodal Sarcasm Detection cs.CLPDF

Soumyadeep Jana, Abhrajyoti Kundu, Sanasam Ranbir Singh

TL;DR: 论文介绍了一种名为DMDP的新框架，用于少样本多模态讽刺检测。DMDP通过模态特定的深度提示和跨模态对齐机制，显著优于现有方法。

Details

Motivation: 社交媒体的多模态内容广泛使用，但现有模型依赖大量标注数据，这促使研究如何在少样本场景下实现有效的讽刺检测。

Result: 实验表明DMDP在少样本和极低资源场景下表现优异，且跨数据集泛化能力强。

Insight: 通过深度提示分层学习和跨模态对齐，能够更好地捕捉讽刺意图的多样性，提升模型性能。

Abstract: The widespread use of multimodal content on social media has heightened the need for effective sarcasm detection to improve opinion mining. However, existing models rely heavily on large annotated datasets, making them less suitable for real-world scenarios where labeled data is scarce. This motivates the need to explore the problem in a few-shot setting. To this end, we introduce DMDP (Deep Modality-Disentangled Prompt Tuning), a novel framework for few-shot multimodal sarcasm detection. Unlike prior methods that use shallow, unified prompts across modalities, DMDP employs gated, modality-specific deep prompts for text and visual encoders. These prompts are injected across multiple layers to enable hierarchical feature learning and better capture diverse sarcasm types. To enhance intra-modal learning, we incorporate a prompt-sharing mechanism across layers, allowing the model to aggregate both low-level and high-level semantic cues. Additionally, a cross-modal prompt alignment module enables nuanced interactions between image and text representations, improving the model’s ability to detect subtle sarcastic intent. Experiments on two public datasets demonstrate DMDP’s superior performance in both few-shot and extremely low-resource settings. Further cross-dataset evaluations show that DMDP generalizes well across domains, consistently outperforming baseline methods.

[240] Unveiling the Potential of Diffusion Large Language Model in Controllable Generation cs.CLPDF

Zhen Xiong, Yujun Cai, Zhecheng Li, Yiwei Wang

TL;DR: 论文探讨了扩散大语言模型（dLLMs）在可控生成中的潜力，提出了自适应的Schema支架（S^3）框架，显著提升结构一致性、内容保真度并降低幻觉率。

Details

Motivation: 扩散模型在图像生成中表现优异，但其在文本生成中的应用面临序列长度敏感性、高幻觉率和推理成本高等问题。本文旨在解决这些挑战，提升dLLMs在可控生成中的性能。

Result: S^3实现了结构一致性提升65%，内容保真度增强48%，幻觉率降低17%。

Insight: 扩散LLMs的双向注意力机制更适合上下文建模和可控生成，但需优化以克服实际应用中的缺陷。S^3通过结构化注入显著提升了性能和效率。

Abstract: Diffusion models, originally developed for image generation, have emerged as a promising alternative to autoregressive large language models (LLMs). We present a theoretical analysis comparing autoregressive and masked diffusion LLMs, revealing that the intrinsic bidirectional attention mechanism of diffusion LLMs (dLLMs) enables superior context modeling and generation controllability. However, existing dLLM applications face significant challenges in controllable generation: the native multi-step denoising process exhibits high sensitivity to sequence length, elevated hallucination rates, and prohibitive inference costs without specialized optimizations. To address these limitations, we propose \textbf{S}elf-adaptive \textbf{S}chema \textbf{S}caffolding ($S^3$), a novel framework that enables dLLMs to generate structured outputs (e.g., JSON) while maintaining semantic fidelity and accelerating inference. Our approach injects the target schema structure into the output context, reducing unnecessary computation while improving controllability. Extensive experiments demonstrate that $S^3$ achieves substantial improvements: 65% increase in structural adherence, 48% enhancement in content fidelity, and 17% reduction in hallucination rates compared to baseline. These results establish both theoretical foundations and practical pathways for deploying diffusion models in controllable text generation tasks. Code and data will be publicly released.

Soumyadeep Jana, Sahil Danayak, Sanasam Ranbir Singh

TL;DR: AdS（Adapter-State Sharing）是一个轻量级框架，用于多模态讽刺检测，通过在CLIP的上层插入适配器并引入适配器状态共享机制，显著减少训练参数数量，同时实现最优性能。

Details

Motivation: 社交媒体的多模态图文讽刺内容增长迅速，但在资源受限的情况下，现有方法依赖于大规模预训练模型的完整微调，性能不足。

Result: 在公开基准测试中，AdS使用比现有PEFT和完整微调方法更少的可训练参数，达到了最先进的性能。

Insight: 适配器状态共享机制可以有效提升多模态任务的性能，同时显著减少计算资源需求。

Abstract: The growing prevalence of multimodal image-text sarcasm on social media poses challenges for opinion mining, especially under resource constraints. Existing approaches rely on full fine-tuning of large pre-trained models, making them unsuitable for low-resource settings. While recent parameter-efficient fine-tuning (PEFT) methods offer promise, their off-the-shelf use underperforms on complex tasks like sarcasm detection. We propose AdS (Adapter-State Sharing), a lightweight framework built on CLIP that inserts adapters only in the upper layers and introduces a novel adapter-state sharing mechanism, where textual adapters guide visual ones. This design promotes efficient cross-modal learning while preserving low-level unimodal representations. Experiments on two public benchmarks demonstrate that AdS achieves state-of-the-art results using significantly fewer trainable parameters than existing PEFT and full fine-tuning approaches.

[242] R1-RE: Cross-Domain Relationship Extraction with RLVR cs.CLPDF

Runpeng Dai, Tong Zheng, Run Yang, Hongtu Zhu

TL;DR: 论文提出了一种基于强化学习的跨领域关系抽取方法R1-RE，通过模仿人类标注者的工作流程，将RE任务转化为推理任务，显著提升了跨领域泛化能力。

Details

Motivation: 传统的关系抽取方法通常采用监督学习，直接映射上下文到标签，但在跨领域（OOD）泛化上表现不佳。受到人类标注者工作流程的启发，研究团队试图通过强化学习框架提升模型的推理能力和泛化性能。

Result: 在Sem-2010和MDKG数据集上，R1-RE-7B模型的平均OOD准确率达到70%，与GPT-4o等专有模型表现相当。

Insight: 1. RLVR框架能够有效激发小语言模型的推理能力，提升跨领域泛化性能。2. 通过模仿人类标注流程，模型表现出类似人类的推理行为。3. 强化学习的奖励机制对模型的训练动态和最终性能有重要影响。

Abstract: Relationship extraction (RE) is a core task in natural language processing. Traditional approaches typically frame RE as a supervised learning problem, directly mapping context to labels-an approach that often suffers from poor out-of-domain (OOD) generalization. Inspired by the workflow of human annotators, we reframe RE as a reasoning task guided by annotation guidelines and introduce R1-RE, the first reinforcement learning with verifiable reward (RLVR) framework for RE tasks. Our method elicits the reasoning abilities of small language models for annotation tasks, resulting in significantly improved OOD robustness. We evaluate our approach on the public Sem-2010 dataset and a private MDKG dataset. The R1-RE-7B model attains an average OOD accuracy of approximately 70%, on par with leading proprietary models such as GPT-4o. Additionally, our comprehensive analysis provides novel insights into the training dynamics and emergent reasoning behaviors of the RLVR paradigm for RE.

[243] Why We Feel What We Feel: Joint Detection of Emotions and Their Opinion Triggers in E-commerce cs.CL | I.2.7; H.3.1; I.2.6PDF

Arnav Attri, Anuj Attri, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil

TL;DR: 论文提出了一种联合任务EOT，将情感检测与观点触发提取结合，并在缺乏标注数据的情况下构建了EOT-X数据集，提出了EOT-DETECT框架，在零样本和思维链技术上表现优异。

Details

Motivation: 电商平台的客户评论包含重要的情感信号，但目前缺乏同时检测情感及其触发因素的研究，限制了对其影响购买决策的理解。

Result: EOT-DETECT在多个电商领域超越了零样本和思维链技术，表现出色。

Insight: 联合建模情感与触发因素能更全面地理解用户情绪，且结构化推理对提升大语言模型的任务性能至关重要。

Abstract: Customer reviews on e-commerce platforms capture critical affective signals that drive purchasing decisions. However, no existing research has explored the joint task of emotion detection and explanatory span identification in e-commerce reviews - a crucial gap in understanding what triggers customer emotional responses. To bridge this gap, we propose a novel joint task unifying Emotion detection and Opinion Trigger extraction (EOT), which explicitly models the relationship between causal text spans (opinion triggers) and affective dimensions (emotion categories) grounded in Plutchik’s theory of 8 primary emotions. In the absence of labeled data, we introduce EOT-X, a human-annotated collection of 2,400 reviews with fine-grained emotions and opinion triggers. We evaluate 23 Large Language Models (LLMs) and present EOT-DETECT, a structured prompting framework with systematic reasoning and self-reflection. Our framework surpasses zero-shot and chain-of-thought techniques, across e-commerce domains.

[244] LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework cs.CLPDF

Zecheng Tang, Haitian Wang, Quantong Qiu, Baibei Ji, Ruoxi Sun

TL;DR: 本文提出了LOOM-Scope，这是一个标准化长上下文模型评估的框架，解决了现有评估基准不一致和计算成本高的问题。

Details

Motivation: 现有长上下文模型的评估基准存在不一致性和高计算成本的问题，难以进行可靠比较和全面评估。

Result: LOOM-Scope能够更高效、更全面地评估长上下文模型。

Insight: 标准化和高效化是提升长上下文模型评估可靠性和可扩展性的关键。

Abstract: Long-context processing has become a fundamental capability for large language models~(LLMs). To assess model’s long-context performance, numerous long-context evaluation benchmarks have been proposed. However, variations in evaluation settings across these benchmarks lead to inconsistent results, making it difficult to draw reliable comparisons. Besides, the high computational cost of long-context evaluation poses a significant barrier for the community to conduct comprehensive assessments of long-context models. In this paper, we propose LOOM-Scope, a comprehensive and efficient framework for long-context evaluation. LOOM-Scope standardizes evaluation settings across diverse benchmarks, supports deployment of efficient long-context inference acceleration methods, and introduces a holistic yet lightweight benchmark suite to evaluate models comprehensively. Homepage: https://loomscope.github.io

[245] “This Suits You the Best”: Query Focused Comparative Explainable Summarization cs.CL | cs.IR | H.3.1; I.2.7; H.1.2PDF

Arnav Attri, Anuj Attri, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil

TL;DR: 论文提出了一种新颖的任务，即生成查询聚焦的比较可解释摘要（QF-CES），利用多源意见摘要（M-OS）技术，提高了推荐系统生成个性化比较摘要的能力。

Details

Motivation: 传统意见摘要方法在提供全面的比较洞察方面存在不足，尤其是在产品推荐场景中。论文旨在填补这一空白，提供一种更具解释性和针对性的摘要生成方法。

Result: M-OS方法比直接输入方法（DIA）减少约40%的推理延迟；生成的摘要与人类评估的Spearman相关性平均为0.74。

Insight: 通过中间步骤（M-OS）优化生成流程能够显著提升效率和实用性，而LLMs在生成解释性摘要方面表现出潜力。

Abstract: Product recommendations inherently involve comparisons, yet traditional opinion summarization often fails to provide holistic comparative insights. We propose the novel task of generating Query-Focused Comparative Explainable Summaries (QF-CES) using Multi-Source Opinion Summarization (M-OS). To address the lack of query-focused recommendation datasets, we introduce MS-Q2P, comprising 7,500 queries mapped to 22,500 recommended products with metadata. We leverage Large Language Models (LLMs) to generate tabular comparative summaries with query-specific explanations. Our approach is personalized, privacy-preserving, recommendation engine-agnostic, and category-agnostic. M-OS as an intermediate step reduces inference latency approximately by 40% compared to the direct input approach (DIA), which processes raw data directly. We evaluate open-source and proprietary LLMs for generating and assessing QF-CES. Extensive evaluations using QF-CES-PROMPT across 5 dimensions (clarity, faithfulness, informativeness, format adherence, and query relevance) showed an average Spearman correlation of 0.74 with human judgments, indicating its potential for QF-CES evaluation.

[246] Reason to Rote: Rethinking Memorization in Reasoning cs.CL | cs.LGPDF

Yupei Du, Philipp Mondorf, Silvia Casola, Yuekun Yao, Robert Litschko

TL;DR: 本文研究了大型语言模型如何在训练数据中记忆标签噪声，并发现这种记忆行为并不显著影响其可泛化的推理能力。通过两个可控的合成推理数据集（四位加法FDA和两跳关系推理THR），作者发现记忆行为依赖于可泛化的推理机制，且记忆是通过分布式编码实现的，而不是简单的查表机制。

Details

Motivation: 大型语言模型能够轻易记忆训练数据中的标签噪声，但在推理任务中依然表现优异。本文旨在探究这种记忆行为如何发生，以及为何它不会显著损害模型的推理能力。

Result: 发现模型在记忆噪声标签时仍会计算中间推理结果，干预推理会损害记忆行为；记忆是通过分布式编码实现的，且依赖于异常值启发式机制。

Insight: 记忆行为不会覆盖模型的底层推理机制，而是基于这些机制实现，这表明在某些情况下记忆可能是良性的。

Abstract: Large language models readily memorize arbitrary training instances, such as label noise, yet they perform strikingly well on reasoning tasks. In this work, we investigate how language models memorize label noise, and why such memorization in many cases does not heavily affect generalizable reasoning capabilities. Using two controllable synthetic reasoning datasets with noisy labels, four-digit addition (FDA) and two-hop relational reasoning (THR), we discover a reliance of memorization on generalizable reasoning mechanisms: models continue to compute intermediate reasoning outputs even when retrieving memorized noisy labels, and intervening reasoning adversely affects memorization. We further show that memorization operates through distributed encoding, i.e., aggregating various inputs and intermediate results, rather than building a look-up mechanism from inputs to noisy labels. Moreover, our FDA case study reveals memorization occurs via outlier heuristics, where existing neuron activation patterns are slightly shifted to fit noisy labels. Together, our findings suggest that memorization of label noise in language models builds on, rather than overrides, the underlying reasoning mechanisms, shedding lights on the intriguing phenomenon of benign memorization.

[247] Spec-TOD: A Specialized Instruction-Tuned LLM Framework for Efficient Task-Oriented Dialogue Systems cs.CLPDF

Quang-Vinh Nguyen, Quang-Chieu Nguyen, Hoang Pham, Khac-Hoai Nam Bui

TL;DR: Spec-TOD 是一个针对任务导向对话系统设计的指令调优框架，通过在低资源场景下结合任务指令和轻量级 LLM 训练策略，显著减少对标注数据的依赖。

Details

Motivation: 任务导向对话系统在低资源场景下因标注数据不足而表现不佳，现有的深度学习方法难以高效利用有限数据。

Result: 在 MultiWOZ 数据集上取得竞争性结果，同时大幅减少标注数据需求。

Insight: 任务指令调优和轻量级 LLM 结合是低资源任务导向对话系统的有效解决方案。

Abstract: Task-oriented dialogue (TOD) systems facilitate goal-driven interactions between users and machines. While recent advances in deep learning have improved the performance, TOD systems often struggle in low-resource scenarios with limited labeled data. To address this challenge, we propose Spec-TOD, a novel framework designed to train an end-to-end TOD system with limited data. Spec-TOD introduces two main innovations: (i) a novel specialized end-to-end TOD framework that incorporates explicit task instructions for instruction-tuned large language models (LLMs), and (ii) an efficient training strategy that leverages lightweight, specialized LLMs to achieve strong performance with minimal supervision. Experiments on the MultiWOZ dataset, a widely used TOD benchmark, demonstrate that Spec-TOD achieves competitive results while significantly reducing the need for labeled data. These findings highlight the potential of the proposed framework in advancing efficient and effective TOD systems in low-resource settings.

[248] Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations cs.CL | cs.AIPDF

A. Bochkov

TL;DR: 论文挑战了传统观念，认为语义表征不依赖于可训练的输入嵌入层，而是Transformer架构和数据规模的涌现属性。通过冻结视觉Unicode嵌入层，模型在生成和推理任务中表现优于传统模型。

Details

Motivation: 探讨大型语言模型中语义表征的源头，挑战传统认为输入嵌入层是语义基础的观念，试图证明语义是Transformer架构的涌现属性。

Result: 模型能够生成连贯文本，并在MMLU基准测试中超越传统模型，表明语义是Transformer架构和数据规模的涌现属性。

Insight: 嵌入层的作用可能是提供结构基础而非语义容器，语义表征更依赖于模型的组合架构和大规模数据。

Abstract: Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational “meaning vectors.” This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We attribute this to “representational interference” in conventional models, where the embedding layer is burdened with learning both structural and semantic features. Our results indicate that high-level semantics are not inherent to input embeddings but are an emergent property of the Transformer’s compositional architecture and data scale. This reframes the role of embeddings from meaning containers to structural primitives. We release all code and models to foster further research.

[249] O_FT@EvalLLM2025 : étude comparative de choix de données et de stratégies d’apprentissage pour l’adaptation de modèles de langue à un domaine cs.CLPDF

Ismaël Rousseau, Claire Perroux, Pierre Adam, Thomas Girault, Lionel Delphin-Poulat

TL;DR: 本文介绍了O_FT团队在EvalLLM2025挑战赛中，针对国防领域对语言模型进行适配的工作。研究使用经典的持续预训练和指令微调技术对Mistral-7B-Instruct-v0.3模型进行优化，并通过数据收集、生成和选择提升模型性能。实验显示，适配后的模型在领域知识和任务处理能力上优于通用模型，同时保持了通用能力。

Details

Motivation: 研究旨在验证小型语言模型在特定领域（如国防）适配的可行性，同时关注数据选择和学习策略对模型性能的影响。

Result: 适配后的模型在国防领域的任务处理能力和领域知识上表现更优，同时通用性能不受影响。此外，研究还验证了小型模型适配的环保可行性。

Insight: 本研究表明，通过数据驱动的方法，即使是小型语言模型也能在特定领域取得优异表现，同时减少计算资源的消耗。

Abstract: This paper presents the work carried out by the O_FT team, joint with Orange and Ouest-France, on adapting language models to the defense domain as part of the EvalLLM2025 challenge. This work focused on adapting the \texttt{Mistral-7B-Instruct-v0.3} model using classical techniques of continued pre-training and instruction-tuning. The core of our efforts is based on collecting, generating, and selecting data for these two stages as well as for model evaluation. Experiments show that our adapted models have better domain-specific knowledge and improved domain-specific task processing skills, along with comparable (or even superior) performance on general knowledge and skills. Considering the carbon footprint of our adaptations, this work demonstrates the feasibility of domain adaptation for relatively small models. – Ce document pr'esente les travaux r'ealis'es par l’'equipe O_FT conjointe `a Orange et Ouest-France sur l’adaptation de mod`eles de langue au domaine de la d'efense dans le cadre du challenge EvalLLM2025. Ces travaux se sont concentr'es sur l’adaptation du mod`ele \texttt{Mistral-7B-Instruct-v0.3} avec des techniques classiques de poursuite du pr'e-entra^inement et d’affinage sur instructions. L’essentiel de nos travaux a port'e sur la constitution, g'en'eration et s'election de donn'ees pour ces deux 'etapes ainsi que pour l’'evaluation des mod`eles. Les exp'eriences montrent que nos mod`eles adapt'es ont de meilleures de connaissances de fond et une meilleure capacit'e de traitement de t^aches sur le domaine de la d'efense, ainsi que des performances comparables (voire sup'erieures) sur des connaissances ou capacit'es g'en'eralistes. Mis au regard des empreintes carbones de nos adaptations, ces travaux d'emontrent ainsi la viabilit'e de l’adaptation `a un domaine de mod`eles relativement petits.

[250] SIGIR 2025 – LiveRAG Challenge Report cs.CL | cs.IR | H.3.3PDF

David Carmel, Simone Filice, Guy Horowitz, Yoelle Maarek, Oren Somekh

TL;DR: SIGIR 2025的LiveRAG挑战赛提供了一个竞技平台，旨在推动检索增强生成（RAG）技术的发展。70支团队在严格的条件下，使用固定语料库和开源LLM，针对500个未见问题进行了实时回答，并通过自动和人工两阶段评估选出优胜者。

Details

Motivation: 挑战赛旨在通过竞争机制促进RAG技术的创新，特别是在检索和提示策略方面的比较。

Result: 挑战赛吸引了70支国际团队参与，最终优胜者在SIGIR 2025上公布。

Insight: 竞争性平台能有效推动技术突破，标准化评估对技术发展至关重要。

Abstract: The LiveRAG Challenge at SIGIR 2025, held between March and May 2025, provided a competitive platform for advancing Retrieval-Augmented Generation (RAG) technologies. Participants from academia and industry were invited to develop a RAG-based question-answering system using a fixed corpus (Fineweb-10BT) and a common open-source LLM (Falcon3-10B-Instruct). The goal was to facilitate challenging comparisons of retrieval and prompting strategies. During the Live Challenge Day, 70 teams from 27 different countries provided answers and supportive information to 500 unseen questions within a strict two-hour time window. Evaluation was conducted in two stages: first an automated LLM-as-a-judge approach was used to compute correctness and faithfulness score, then a manual review of top ranked submissions was conducted. The finalists were announced on June 12, 2025, with prizes awarded during the LiveRAG Workshop at SIGIR 2025 in Padua, Italy.

[251] ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation cs.CL | cs.SEPDF

Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu

TL;DR: ArtifactsBench填补了大型语言模型（LLM）在生成动态视觉交互式代码时的评估空白，提出了一种自动化、多模态的评估范式和基准，通过多模态LLM（MLLM）对视觉保真度和交互完整性进行评估，取得了与人类专家高度一致的结果。

Details

Motivation: 当前的评估基准主要关注代码的算法正确性，而忽视了视觉保真度和交互完整性，这在现代用户体验中至关重要。因此，需要一种新的评估方法来填补这一空白。

Result: 自动化评估与人类偏好基准WebDev Arena的排名一致性达到了94.4%，与人类专家的配对一致性超过90%，展现了评估的可靠性。开源了基准、评估工具和基线结果。

Insight: 通用模型在视觉交互式代码生成任务中表现优于特定领域模型，揭示了当前技术的现状和潜力。开源资源有望推动以用户为中心的生成模型的快速发展。

Abstract: The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation gap: established benchmarks focus on algorithmic correctness and are blind to the visual fidelity and interactive integrity that define modern user experiences. To bridge this gap, we introduce ArtifactsBench, a new benchmark and paradigm for the automated, multimodal evaluation of visual code generation. Our framework programmatically renders each generated artifact and captures its dynamic behavior through temporal screenshots. This visual evidence, alongside the source code, is then assessed by a Multimodal LLM (MLLM)-as-Judge, which is rigorously guided by a fine-grained, per-task checklist to ensure holistic and reproducible scoring. We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading LLMs. Our automated evaluation achieves a striking 94.4% ranking consistency with WebDev Arena, the gold-standard for human preference in web development, and over 90% pairwise agreement with human experts. This establishes ArtifactsBench as the first framework to reliably automate the assessment of human-perceived quality at scale. Our analysis provides a high-resolution map of the current SOTA, revealing that generalist models often outperform domain-specific ones. We open-source ArtifactsBench, including the benchmark, evaluation harness, and baseline results at https://artifactsbenchmark.github.io/, to provide the community with a scalable and accurate tool to accelerate the development of user-centric generative models.

[252] Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification cs.CLPDF

Chenfei Xiong, Jingwei Ni, Yu Fan, Vilém Zouhar, Donya Rooein

TL;DR: Co-DETECT是一个新颖的混合主动标注框架，结合人类专家与LLM的自动标注，用于文本分类中边缘案例的发现与处理。

Details

Motivation: 文本分类任务中，传统的标注方法难以覆盖边缘案例，而手动标注又耗时耗力，因此需要一种高效的协作方法来改进这一过程。

Result: 用户研究和分析验证了Co-DETECT在边缘案例发现和分类性能提升方面的有效性。

Insight: LLM与人类专家的协作可以高效处理复杂任务中的边缘案例，改进标注流程和模型性能。

Abstract: We introduce Co-DETECT (Collaborative Discovery of Edge cases in TExt ClassificaTion), a novel mixed-initiative annotation framework that integrates human expertise with automatic annotation guided by large language models (LLMs). Co-DETECT starts with an initial, sketch-level codebook and dataset provided by a domain expert, then leverages the LLM to annotate the data and identify edge cases that are not well described by the initial codebook. Specifically, Co-DETECT flags challenging examples, induces high-level, generalizable descriptions of edge cases, and assists user in incorporating edge case handling rules to improve the codebook. This iterative process enables more effective handling of nuanced phenomena through compact, generalizable annotation rules. Extensive user study, qualitative and quantitative analyses prove the effectiveness of Co-DETECT.

[253] Verified Language Processing with Hybrid Explainability: A Technical Report cs.CL | cs.SCPDF

Oliver Robert Fox, Giacomo Bergami, Graham Morgan

TL;DR: 该论文提出了一种结合图和逻辑的混合可解释性方法，用于提升自然语言处理中的相似性任务和分类任务的效果，并解决了逻辑蕴含、不一致性和无关性之间的区分问题。

Details

Motivation: 由于当前自然语言处理技术缺乏可解释性，特别是在相似性任务和逻辑分类任务中表现不足，作者旨在提出一种更透明且可靠的方法来解决这些问题。

Result: 实验结果表明，该方法在文本相似性和逻辑分类任务中优于现有方法。

Insight: 该研究表明，仅依赖大规模语料库训练的通用语言模型难以完全捕捉自然语言的复杂逻辑结构，混合可解释性方法更具潜力。

Abstract: The volume and diversity of digital information have led to a growing reliance on Machine Learning techniques, such as Natural Language Processing, for interpreting and accessing appropriate data. While vector and graph embeddings represent data for similarity tasks, current state-of-the-art pipelines lack guaranteed explainability, failing to determine similarity for given full texts accurately. These considerations can also be applied to classifiers exploiting generative language models with logical prompts, which fail to correctly distinguish between logical implication, indifference, and inconsistency, despite being explicitly trained to recognise the first two classes. We present a novel pipeline designed for hybrid explainability to address this. Our methodology combines graphs and logic to produce First-Order Logic representations, creating machine- and human-readable representations through Montague Grammar. Preliminary results indicate the effectiveness of this approach in accurately capturing full text similarity. To the best of our knowledge, this is the first approach to differentiate between implication, inconsistency, and indifference for text classification tasks. To address the limitations of existing approaches, we use three self-contained datasets annotated for the former classification task to determine the suitability of these approaches in capturing sentence structure equivalence, logical connectives, and spatiotemporal reasoning. We also use these data to compare the proposed method with language models pre-trained for detecting sentence entailment. The results show that the proposed method outperforms state-of-the-art models, indicating that natural language understanding cannot be easily generalised by training over extensive document corpora. This work offers a step toward more transparent and reliable Information Retrieval from extensive textual data.

[254] An Evaluation of Large Language Models on Text Summarization Tasks Using Prompt Engineering Techniques cs.CL | cs.AIPDF

Walid Mohamed Aly, Taysir Hassan A. Soliman, Amr Mohamed AbdelAziz

TL;DR: 这篇论文系统地评估了六种大型语言模型（LLMs）在四种数据集上的文本摘要表现，通过提示工程技术（如零样本学习和上下文学习）提升效果，并研究了摘要质量与计算效率之间的权衡。

Details

Motivation: 尽管LLMs在自然语言处理任务中表现出色，但其在不同领域和数据集上的文本摘要能力尚未得到全面评估。同时，如何在不依赖大量训练数据的情况下高效生成摘要成为一个关键问题。

Result: LLMs在新闻和对话任务中表现优秀，而对长科学文档，分块策略显著提升了摘要效果。研究还发现模型参数、数据集和提示设计对性能有明显影响。

Insight: 研究为LLMs的摘要能力提供了实用的设计指导，突显了提示工程和计算效率之间的平衡在LLMs应用中的重要性。

Abstract: Large Language Models (LLMs) continue to advance natural language processing with their ability to generate human-like text across a range of tasks. Despite the remarkable success of LLMs in Natural Language Processing (NLP), their performance in text summarization across various domains and datasets has not been comprehensively evaluated. At the same time, the ability to summarize text effectively without relying on extensive training data has become a crucial bottleneck. To address these issues, we present a systematic evaluation of six LLMs across four datasets: CNN/Daily Mail and NewsRoom (news), SAMSum (dialog), and ArXiv (scientific). By leveraging prompt engineering techniques including zero-shot and in-context learning, our study evaluates the performance using the ROUGE and BERTScore metrics. In addition, a detailed analysis of inference times is conducted to better understand the trade-off between summarization quality and computational efficiency. For Long documents, introduce a sentence-based chunking strategy that enables LLMs with shorter context windows to summarize extended inputs in multiple stages. The findings reveal that while LLMs perform competitively on news and dialog tasks, their performance on long scientific documents improves significantly when aided by chunking strategies. In addition, notable performance variations were observed based on model parameters, dataset properties, and prompt design. These results offer actionable insights into how different LLMs behave across task types, contributing to ongoing research in efficient, instruction-based NLP systems.

[255] OpenS2S: Advancing Open-Source End-to-End Empathetic Large Speech Language Model cs.CL | cs.AI | cs.SD | eess.ASPDF

Chen Wang, Tianyu Peng, Wen Yang, Yinan Bai, Guangfu Wang

TL;DR: 论文介绍了OpenS2S，一个开源的端到端共情大语音语言模型（LSLM），旨在实现透明的共情语音交互研究。

Details

Motivation: 当前最强大的共情LSLM多为闭源，缺乏透明度，阻碍了研究进展。OpenS2S旨在填补这一空白，提供开源工具和数据。

Result: OpenS2S实现了低延迟的共情语音生成，并提供了一个可扩展的高质量多说话人数据集。

Insight: 开源工具和透明数据能够推动共情语音系统的研究创新；自动化数据构建方法显著降低了人力成本。

Abstract: Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for transparent research into the LSLMs and empathetic behavior, we present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions. Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved decoding architecture to achieve low-latency speech generation. To facilitate end-to-end training, OpenS2S incorporates an automated data construction pipeline that synthesizes diverse, high-quality empathetic speech dialogues at low cost. By leveraging large language models to generate empathetic content and controllable text-to-speech systems to introduce speaker and emotional variation, we construct a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pre-training and fine-tuning codes, to empower the broader research community and accelerate innovation in empathetic speech systems. The project webpage can be accessed at https://casia-lm.github.io/OpenS2S

[256] From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations cs.CLPDF

Pulkit Bansal, Raghvendra Kumar, Shakti Singh, Sriparna Saha, Adam Jatowt

TL;DR: 本文提出了一种基于课程学习和直接偏好优化（DPO）的框架，用于生成印地语新闻的可信解释。通过引入‘现实性’和‘精细度’参数优化DPO损失函数，该方法显著提升了生成解释的质量和一致性。

Details

Motivation: 在虚假信息泛滥的时代，为非英语等低资源语言（如印地语）提供可靠的新闻解释工具至关重要。当前缺乏针对印地语的自动化工具，亟需一种可扩展的方法来检测和解释虚假信息。

Result: 实验表明，该框架能够生成连贯且上下文相关的新闻解释，显著优于现有方法，尤其在低资源语言（印地语）中表现突出。

Insight: 1. 课程学习结合DPO可有效提升低资源语言的生成任务性能。2. 通过量化‘现实性’和‘精细度’，可以更精准地优化生成内容的质量。

Abstract: In an era of rampant misinformation, generating reliable news explanations is vital, especially for under-represented languages like Hindi. Lacking robust automated tools, Hindi faces challenges in scaling misinformation detection. To bridge this gap, we propose a novel framework integrating Direct Preference Optimization (DPO) with curriculum learning to align machine-generated explanations with human reasoning. Fact-checked explanations from credible sources serve as preferred responses, while LLM outputs highlight system limitations and serve as non-preferred responses. To refine task-specific alignment, we introduce two key parameters – Actuality and Finesse – into the DPO loss function, enhancing explanation quality and consistency. Experiments with LLMs (Mistral, Llama, Gemma) and PLMs (mBART, mT5) confirm the framework’s effectiveness in generating coherent, contextually relevant explanations. This scalable approach combats misinformation and extends automated explanation generation to low-resource languages.

[257] Pre-Trained Policy Discriminators are General Reward Models cs.CL | cs.LGPDF

Shihan Dou, Shichun Liu, Yuming Yang, Yicheng Zou, Yunhua Zhou

TL;DR: 该论文提出了一种新的奖励建模方法POLAR，通过策略判别器量化策略差异生成奖励信号，显著提升了奖励模型的性能和泛化能力。

Details

Motivation: 传统奖励建模方法依赖于绝对偏好，难以捕获策略间的相对差异。POLAR通过策略判别学习提供了一种更通用且可扩展的奖励建模方法。

Result: POLAR-7B在STEM任务上将偏好准确率从54.8%提升至81.0%，在创意写作任务上从57.9%提升至85.5%，且在强化学习中显著提升策略性能。

Insight: POLAR的成功表明策略判别学习是一种有前途的通用奖励建模方向，其计算与性能的幂律关系为扩展提供了理论支持。

Abstract: We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance–improving LLaMa3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.

[258] Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models cs.CLPDF

Ziqi Miao, Lijun Li, Yuan Xiong, Zhenhua Liu, Pengyu Zhu

TL;DR: 该论文提出了一种新的对抗攻击方法“响应攻击（Response Attack）”，利用上下文中的响应信息诱导大型语言模型（LLMs）生成违规内容，攻击成功率高于现有方法。

Details

Motivation: 针对LLMs在上下文对话中可能受到先前响应的隐形影响（上下文启动效应），作者希望通过这一漏洞设计更高效的攻击方法，以揭示模型安全性的潜在问题。

Result: 在8个开源和专有LLMs上，Response Attack的攻击成功率显著高于7种现有的越狱技术。

Insight: 上下文启动效应对LLMs的安全性具有重要影响，现有的安全防护措施可能未能充分考虑到这一点。

Abstract: Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. Building on this insight, we propose Response Attack, which uses an auxiliary LLM to generate a mildly harmful response to a paraphrased version of the original malicious query. They are then formatted into the dialogue and followed by a succinct trigger prompt, thereby priming the target model to generate harmful content. Across eight open-source and proprietary LLMs, RA consistently outperforms seven state-of-the-art jailbreak techniques, achieving higher attack success rates. To mitigate this threat, we construct and release a context-aware safety fine-tuning dataset, which significantly reduces the attack success rate while preserving model capabilities. The code and data are available at https://github.com/Dtc7w3PQ/Response-Attack.

[259] Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions cs.CL | cs.AIPDF

Yuanzhe Hu, Yu Wang, Julian McAuley

TL;DR: 该论文提出了一个新的基准测试MemoryAgentBench，专门用于评估具有内存机制的LLM代理在增量多轮交互中的表现，涵盖四种核心能力：准确检索、测试时学习、长程理解和冲突解决。

Details

Motivation: 现有的LLM代理评估主要集中在推理、规划和执行能力上，而内存机制的评估因缺乏合适的基准测试而受限。为了填补这一空白，论文提出了专门针对内存代理的评估框架。

Result: 实验结果表明，现有方法在四种核心能力上均未达到全面掌握，突显了对更综合的内存机制研究的必要性。

Insight: 该研究揭示了当前LLM代理在内存管理上的不足，为未来探索更强大的内存机制提供了方向。

Abstract: Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information-is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and conflict resolution. Existing datasets either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Furthermore, no existing benchmarks cover all four competencies. Therefore, we introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark combines reformulated existing datasets with newly constructed ones, covering the above four memory competencies, providing a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.

cs.CY [Back]

[260] MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks cs.CY | cs.AI | cs.CLPDF

Dumitran Adrian Marius, Theodor-Pierre Moroianu, Buca Mihnea-Vicentiu

TL;DR: 该论文提出了一个名为MateInfoUB的双语（英语-罗马尼亚语）多模态（文本和图像）数据集，用于测试大型语言模型（LLMs）在高级计算机科学教育任务中的表现，并分析了其优势和局限性。

Details

Motivation: 随着大型语言模型（LLMs）在代码相关任务和问题解决中的能力不断提升，研究其在高级计算机科学教育（如竞赛）中的实际表现和局限性变得尤为重要。

Result: 研究揭示了当前LLMs的优势和局限性，特别是语言选择（英语与罗马尼亚语）对性能的影响，为其在CS教育和竞赛中的应用提供了参考。

Insight: 研究发现，LLMs在解决某些问题时表现良好，但语言和任务类型对性能有显著影响，同时强调了在教育中使用LLM时需关注伦理和公平性问题。

Abstract: The rapid advancement of Large Language Models (LLMs) has transformed various domains, particularly computer science (CS) education. These models exhibit remarkable capabilities in code-related tasks and problem-solving, raising questions about their potential and limitations in advanced CS contexts. This study presents a novel bilingual (English-Romanian) multimodal (text and image) dataset of multiple-choice questions derived from a high-level computer science competition. A particularity of our dataset is that the problems are conceived such that some of them are easier solved using reasoning on paper, while for others writing code is more efficient. We systematically evaluate State of The Art LLMs on this dataset, analyzing their performance on theoretical programming tasks. Our findings reveal the strengths and limitations of current LLMs, including the influence of language choice (English vs. Romanian), providing insights into their applicability in CS education and competition settings. We also address critical ethical considerations surrounding educational integrity and the fairness of assessments in the context of LLM usage. These discussions aim to inform future educational practices and policies. To support further research, our dataset will be made publicly available in both English and Romanian. Additionally, we release an educational application tailored for Romanian students, enabling them to self-assess using the dataset in an interactive and practice-oriented environment.

[261] LearnLens: LLM-Enabled Personalised, Curriculum-Grounded Feedback with Educators in the Loop cs.CY | cs.AI | cs.CL | cs.HCPDF

Runcong Zhao, Artem Borov, Jiazheng Li, Yulan He

TL;DR: LearnLens是一个基于大语言模型（LLM）的系统，为科学教育提供个性化、课程对齐的反馈，通过错误感知评估、课程基础生成和教育者参与环解决现有反馈系统的挑战。

Details

Motivation: 现有教育反馈系统在个性化、课程对齐和教育者参与方面存在不足，LearnLens旨在解决这些问题。

Result: 实现了高质量、可扩展的教育反馈，助力教师和学生。

Insight: 结构化记忆链比传统相似性检索更有效，教育者参与是关键。

Abstract: Effective feedback is essential for student learning but is time-intensive for teachers. We present LearnLens, a modular, LLM-based system that generates personalised, curriculum-aligned feedback in science education. LearnLens comprises three components: (1) an error-aware assessment module that captures nuanced reasoning errors; (2) a curriculum-grounded generation module that uses a structured, topic-linked memory chain rather than traditional similarity-based retrieval, improving relevance and reducing noise; and (3) an educator-in-the-loop interface for customisation and oversight. LearnLens addresses key challenges in existing systems, offering scalable, high-quality feedback that empowers both teachers and students.

[262] From Autonomy to Agency: Agentic Vehicles for Human-Centered Mobility Systems cs.CY | cs.CE | cs.CL | cs.HC | cs.ROPDF

Jiangbo Yu

TL;DR: 本文提出了‘代理性车辆’（AgVs）的概念，以弥补传统自动驾驶车辆在认知和社会能力上的不足，通过整合代理AI实现复杂环境中的推理、适应和交互。

Details

Motivation: 传统自动驾驶车辆（AuVs）在动态交互和复杂环境中表现有限，而代理AI的进步为开发更具适应性和交互性的车辆提供了可能，推动了从单纯的自动化向代理性能力的转变。

Result: AgVs通过高级推理和工具使用，能够在复杂环境中动态适应和交互，超越传统AuVs的功能局限。

Insight: 代理性车辆的开发需解决安全、实时控制、公共接受度、伦理对齐和法规框架等挑战，推动技术向更符合人类需求的下一代交通系统发展。

Abstract: Autonomy, from the Greek autos (self) and nomos (law), refers to the capacity to operate according to internal rules without external control. Accordingly, autonomous vehicles (AuVs) are defined as systems capable of perceiving their environment and executing preprogrammed tasks independently of external input. However, both research and real-world deployments increasingly showcase vehicles that demonstrate behaviors beyond this definition (including the SAE levels 1 to 6), such as interaction with humans and machines, goal adaptation, contextual reasoning, external tool use, and long-term planning, particularly with the integration of large language models (LLMs) and agentic AI systems. These developments reveal a conceptual gap between technical autonomy and the broader cognitive and social capabilities needed for future human-centered mobility systems. To address this, we introduce the concept of agentic vehicles (AgVs), referring to vehicles that integrate agentic AI to reason, adapt, and interact within complex environments. This paper presents a systems-level framework to characterize AgVs, focusing on their cognitive and communicative layers and differentiating them from conventional AuVs. It synthesizes relevant advances in agentic AI, robotics, multi-agent systems, and human-machine interaction, and highlights how agentic AI, through high-level reasoning and tool use, can function not merely as computational tools but as interactive agents embedded in mobility ecosystems. The paper concludes by identifying key challenges in the development and governance of AgVs, including safety, real-time control, public acceptance, ethical alignment, and regulatory frameworks.

cs.SD [Back]

[263] EXPOTION: Facial Expression and Motion Control for Multimodal Music Generation cs.SD | cs.AI | cs.CV | cs.MM | eess.ASPDF

Fathinah Izzati, Xinyue Li, Gus Xia

TL;DR: Expotion 是一个通过人脸表情和上半身动作等多模态视觉控制以及文本提示生成音乐的方法，采用参数高效微调技术，并引入时间平滑策略以对齐多模态数据，实验显示其生成的音乐在多个方面优于现有方法。

Details

Motivation: 现有视频到音乐的生成方法通常缺乏对视频中丰富信息（如表情和动作）的利用，Expotion 旨在通过多模态控制和文本提示的结合，生成更具表现力和时间准确性的音乐。

Result: 实验表明，Expotion在音乐性、创意性、节拍-速度一致性、时间对齐和文本贴合度等方面优于基线方法和现有视频到音乐生成模型。

Insight: 多模态视觉控制（如表情和动作）能显著提升生成音乐的表现力和时间准确性，而小数据集通过参数高效微调也能实现高质量的模型适配。

Abstract: We propose Expotion (Facial Expression and Motion Control for Multimodal Music Generation), a generative model leveraging multimodal visual controls - specifically, human facial expressions and upper-body motion - as well as text prompts to produce expressive and temporally accurate music. We adopt parameter-efficient fine-tuning (PEFT) on the pretrained text-to-music generation model, enabling fine-grained adaptation to the multimodal controls using a small dataset. To ensure precise synchronization between video and music, we introduce a temporal smoothing strategy to align multiple modalities. Experiments demonstrate that integrating visual features alongside textual descriptions enhances the overall quality of generated music in terms of musicality, creativity, beat-tempo consistency, temporal alignment with the video, and text adherence, surpassing both proposed baselines and existing state-of-the-art video-to-music generation models. Additionally, we introduce a novel dataset consisting of 7 hours of synchronized video recordings capturing expressive facial and upper-body gestures aligned with corresponding music, providing significant potential for future research in multimodal and interactive music generation.

q-fin.TR [Back]

[264] Does Overnight News Explain Overnight Returns? q-fin.TR | cs.CL | stat.MLPDF

Paul Glasserman, Kriste Krstovski, Paul Laliberte, Harry Mamaysky

TL;DR: 本文研究发现美股市场近30年的收益主要来自隔夜时段，而日内收益为负或平坦。利用240万篇新闻文章，结合监督主题分析方法，发现新闻主题的变化和不同反应是日内与隔夜收益差异的主要原因。

Details

Motivation: 美股市场的隔夜收益现象（收益主要集中在隔夜时段）尚未有充分解释，本文试图通过分析新闻内容揭示其背后的原因。

Result: 方法能预测哪些股票在隔夜表现优异或日内表现不佳，并解释了日内与隔夜收益的延续与反转模式。同时对比了与文献中其他机制的区别。

Insight: 新闻内容对市场收益的影响具有显著的时间不对称性，隔夜与日内新闻反应的差异是市场收益差异的重要驱动力。

Abstract: Over the past 30 years, nearly all the gains in the U.S. stock market have been earned overnight, while average intraday returns have been negative or flat. We find that a large part of this effect can be explained through features of intraday and overnight news. Our analysis uses a collection of 2.4 million news articles. We apply a novel technique for supervised topic analysis that selects news topics based on their ability to explain contemporaneous market returns. We find that time variation in the prevalence of news topics and differences in the responses to news topics both contribute to the difference in intraday and overnight returns. In out-of-sample tests, our approach forecasts which stocks will do particularly well overnight and particularly poorly intraday. Our approach also helps explain patterns of continuation and reversal in intraday and overnight returns. We contrast the effect of news with other mechanisms proposed in the literature to explain overnight returns.

physics.geo-ph [Back]

[265] Automated Workflow for the Detection of Vugs physics.geo-ph | cs.CVPDF

M. Quamer Nasim, T. Maiti, N. Mosavat, P. V. Grech, T. Singh

TL;DR: 本文提出了一种自动化的溶洞检测模型，利用计算机视觉技术改进地质图像日志中溶洞的识别过程，克服了传统方法的偏见和劳动密集型问题。

Details

Motivation: 传统的手动和半自动化方法在溶洞检测中存在个体偏见、劳动强度高和参数调整不灵活的问题，需要一种更高效的自动化方法。

Result: 模型在验证测试中显示出高准确性，能够捕捉专家遗漏的溶洞，并通过分布图增强了溶洞类型的研究。

Insight: 自动化方法不仅提升了溶洞检测的效率，还通过统计分析为储层评价提供了更全面的数据支持。

Abstract: Image logs are crucial in capturing high-quality geological information about subsurface formations. Among the various geological features that can be gleaned from Formation Micro Imager log, vugs are essential for reservoir evaluation. This paper introduces an automated Vug Detection Model, leveraging advanced computer vision techniques to streamline the vug identification process. Manual and semiautomated methods are limited by individual bias, labour-intensity and inflexibility in parameter finetuning. Our methodology also introduces statistical analysis on vug characteristics. Pre-processing steps, including logical file extraction and normalization, ensured standardized and usable data. The sixstep vug identification methodology encompasses top-k mode extraction, adaptive thresholding, contour identification, aggregation, advanced filtering, and optional filtering for low vuggy regions. The model’s adaptability is evidenced by its ability to identify vugs missed by manual picking undertaken by experts. Results demonstrate the model’s accuracy through validation against expert picks. Detailed metrics, such as count, mean, and standard deviation of vug areas within zones, were introduced, showcasing the model’s capabilities compared to manual picking. The vug area distribution plot enhances understanding of vug types in the reservoir. This research focuses on the identification and characterization of vugs that in turn aids in the better understanding of reservoirs.

eess.IV [Back]

[266] Outcome prediction and individualized treatment effect estimation in patients with large vessel occlusion stroke eess.IV | cs.CV | cs.LG | 68 | I.2; J.3PDF

Lisa Herzog, Pascal Bühler, Ezequiel de la Rosa, Beate Sick, Susanne Wegener

TL;DR: 该论文开发了可解释的深度学习模型，用于预测大血管闭塞（LVO）中风患者的功能结局和个体化治疗效果（ITE），结合临床变量和影像数据，但ITE估计的区分能力有限。

Details

Motivation: 尽管机械血栓切除术已成为LVO中风的标准治疗方法，但仅50%的患者有良好功能结局。需要更好的预测模型和ITE估计以优化治疗决策。

Result: 1) 临床变量对二进制功能结局预测有较好表现（AUC 0.719）；2) 加入CTA后性能略有提升（AUC 0.737）；3) ITE估计的区分能力有限（C-for-Benefit约0.55）。

Insight: 1) 预中风残疾是最重要的临床预测因素；2) 影像数据并未显著提升预测性能；3) ITE估计的校准性良好但区分能力需进一步研究。

Abstract: Mechanical thrombectomy has become the standard of care in patients with stroke due to large vessel occlusion (LVO). However, only 50% of successfully treated patients show a favorable outcome. We developed and evaluated interpretable deep learning models to predict functional outcomes in terms of the modified Rankin Scale score alongside individualized treatment effects (ITEs) using data of 449 LVO stroke patients from a randomized clinical trial. Besides clinical variables, we considered non-contrast CT (NCCT) and angiography (CTA) scans which were integrated using novel foundation models to make use of advanced imaging information. Clinical variables had a good predictive power for binary functional outcome prediction (AUC of 0.719 [0.666, 0.774]) which could slightly be improved when adding CTA imaging (AUC of 0.737 [0.687, 0.795]). Adding NCCT scans or a combination of NCCT and CTA scans to clinical features yielded no improvement. The most important clinical predictor for functional outcome was pre-stroke disability. While estimated ITEs were well calibrated to the average treatment effect, discriminatory ability was limited indicated by a C-for-Benefit statistic of around 0.55 in all models. In summary, the models allowed us to jointly integrate CT imaging and clinical features while achieving state-of-the-art prediction performance and ITE estimates. Yet, further research is needed to particularly improve ITE estimation.

[267] Event2Audio: Event-Based Optical Vibration Sensing eess.IV | cs.CV | eess.ASPDF

Mingxuan Cai, Dekel Galor, Amit Pal Singh Kohli, Jacob L. Yates, Laura Waller

TL;DR: 该论文提出了一种基于事件相机的方法，用于从微小振动中恢复音频信号，速度更快且达到实时处理水平。

Details

Motivation: 微小振动包含丰富信息（如声音和材料特性），被动记录或主动激光放大是现有方法，但事件相机能更高效捕捉快速运动，提升振动传感能力。

Result: 实验表明，该方法在音频恢复质量上与现有技术相当，但速度更快，接近实时处理，且能处理多源振动和噪声。

Insight: 事件相机在高速振动传感中具有潜力，可高效捕捉微小运动，为音频恢复和其他振动相关应用提供了新思路。

Abstract: Small vibrations observed in video can unveil information beyond what is visual, such as sound and material properties. It is possible to passively record these vibrations when they are visually perceptible, or actively amplify their visual contribution with a laser beam when they are not perceptible. In this paper, we improve upon the active sensing approach by leveraging event-based cameras, which are designed to efficiently capture fast motion. We demonstrate our method experimentally by recovering audio from vibrations, even for multiple simultaneous sources, and in the presence of environmental distortions. Our approach matches the state-of-the-art reconstruction quality at much faster speeds, approaching real-time processing.

[268] Cancer cytoplasm segmentation in hyperspectral cell image with data augmentation eess.IV | cs.CV | physics.med-phPDF

Rebeka Sultana, Hibiki Horibe, Tomoaki Murakami, Ikuko Shimizu

TL;DR: 该论文提出一种用于高光谱细胞图像中癌细胞细胞质分割的数据增强方法，解决了高光谱图像数据稀缺和噪声问题，并通过实验验证了其有效性。

Details

Motivation: H&E染色图像常用于检测细胞中的核或癌区域，但CMOS图像缺乏足够的诊断信息，而高光谱图像虽提供更全面的信息，但数据稀缺且易受设备噪声影响。

Result: 实验结果表明，所提出的数据增强方法在定量和定性上均能有效提升分割性能。

Insight: 利用易于标注的CMOS图像进行数据增强，可以缓解高光谱图像数据不足的问题，同时提高模型的鲁棒性。

Abstract: Hematoxylin and Eosin (H&E)-stained images are commonly used to detect nuclear or cancerous regions in cells from images captured by a microscope. Identifying cancer cytoplasm is crucial for determining the type of cancer; hence, obtaining accurate cancer cytoplasm regions in cell images is important. While CMOS images often lack detailed information necessary for diagnosis, hyperspectral images provide more comprehensive cell information. Using a deep learning model, we propose a method for detecting cancer cell cytoplasm in hyperspectral images. Deep learning models require large datasets for learning; however, capturing a large number of hyperspectral images is difficult. Additionally, hyperspectral images frequently contain instrumental noise, depending on the characteristics of the imaging devices. We propose a data augmentation method to account for instrumental noise. CMOS images were used for data augmentation owing to their visual clarity, which facilitates manual annotation compared to original hyperspectral images. Experimental results demonstrate the effectiveness of the proposed data augmentation method both quantitatively and qualitatively.

[269] Hybrid-View Attention for csPCa Classification in TRUS eess.IV | cs.CVPDF

Zetian Feng, Juan Fu, Xuebin Zou, Hongsheng Ye, Hong Wu

TL;DR: 论文提出了一种混合视角注意力网络（HVA）,用于3D TRUS中的csPCa分类,通过结合横向和矢状视图的互补信息,显著提升了诊断准确性。

Details

Motivation: 前列腺癌是男性癌症相关死亡的主要原因,而精确识别临床显著性前列腺癌(csPCa)对及时干预至关重要。TRUS虽然广泛用于前列腺活检,但其低对比度和各向异性空间分辨率带来了诊断挑战。

Result: 在包含590名患者的数据集上验证了方法的有效性,通过比较和消融实验证明了其性能优势。

Insight: 通过多视角互补信息融合和动态特征聚合,可以有效提升TRUS图像分类任务的准确性,尤其是在低对比度场景下。

Abstract: Prostate cancer (PCa) is a leading cause of cancer-related mortality in men, and accurate identification of clinically significant PCa (csPCa) is critical for timely intervention. Transrectal ultrasound (TRUS) is widely used for prostate biopsy; however, its low contrast and anisotropic spatial resolution pose diagnostic challenges. To address these limitations, we propose a novel hybrid-view attention (HVA) network for csPCa classification in 3D TRUS that leverages complementary information from transverse and sagittal views. Our approach integrates a CNN-transformer hybrid architecture, where convolutional layers extract fine-grained local features and transformer-based HVA models global dependencies. Specifically, the HVA comprises intra-view attention to refine features within a single view and cross-view attention to incorporate complementary information across views. Furthermore, a hybrid-view adaptive fusion module dynamically aggregates features along both channel and spatial dimensions, enhancing the overall representation. Experiments are conducted on an in-house dataset containing 590 subjects who underwent prostate biopsy. Comparative and ablation results prove the efficacy of our method. The code is available at https://github.com/mock1ngbrd/HVAN.

[270] PhotIQA: A photoacoustic image data set with image quality ratings eess.IV | cs.CVPDF

Anna Breger, Janek Gröhl, Clemens Karner, Thomas R Else, Ian Selby

TL;DR: 本文提出了PhotIQA数据集，包含1134张专家评分的光声图像，用于开发全参考和无参考图像质量评估方法，并展示了性能优于SSIM的HaarPSI$_{med}$指标。

Details

Motivation: 现有图像质量评估方法主要基于自然图像，难以适用于光声成像（PAI）等医学图像，尤其缺乏针对多物理场成像的质量评估标准。

Result: 实验表明，HaarPSI$_{med}$与质量评分的相关性（SRCC: 0.83）显著优于SSIM（0.62）。

Insight: 光声图像的质量评估需考虑多物理场特性，现有自然图像评估方法（如SSIM）可能不适用，需开发针对性工具。

Abstract: Image quality assessment (IQA) is crucial in the evaluation stage of novel algorithms operating on images, including traditional and machine learning based methods. Due to the lack of available quality-rated medical images, most commonly used IQA methods employing reference images (i.e. full-reference IQA) have been developed and tested for natural images. Reported application inconsistencies arising when employing such measures for medical images are not surprising, as they rely on different properties than natural images. In photoacoustic imaging (PAI), especially, standard benchmarking approaches for assessing the quality of image reconstructions are lacking. PAI is a multi-physics imaging modality, in which two inverse problems have to be solved, which makes the application of IQA measures uniquely challenging due to both, acoustic and optical, artifacts. To support the development and testing of full- and no-reference IQA measures we assembled PhotIQA, a data set consisting of 1134 reconstructed photoacoustic (PA) images that were rated by 2 experts across five quality properties (overall quality, edge visibility, homogeneity, inclusion and background intensity), where the detailed rating enables usage beyond PAI. To allow full-reference assessment, highly characterised imaging test objects were used, providing a ground truth. Our baseline experiments show that HaarPSI$_{med}$ significantly outperforms SSIM in correlating with the quality ratings (SRCC: 0.83 vs. 0.62). The dataset is publicly available at https://doi.org/10.5281/zenodo.13325196.

[271] Dual-Alignment Knowledge Retention for Continual Medical Image Segmentation eess.IV | cs.CVPDF

Yuxin Ye, Yan Liu, Shujian Yu

TL;DR: 该论文提出了一种双对齐知识保留框架，用于解决医学图像分割中的持续学习问题，通过跨网络对齐和跨表示对齐模块缓解任务间的灾难性遗忘。

Details

Motivation: 医学图像分割领域需要处理来自不同领域（如临床站点）的连续数据，传统持续学习方法难以捕捉任务间的复杂依赖关系，导致灾难性遗忘问题。

Result: 实验证明该框架在医学图像分割任务中能有效缓解领域漂移下的灾难性遗忘问题。

Insight: 通过双对齐策略捕捉任务间的复杂依赖关系是缓解灾难性遗忘的有效途径，这对持续学习领域具有重要意义。

Abstract: Continual learning in medical image segmentation involves sequential data acquisition across diverse domains (e.g., clinical sites), where task interference between past and current domains often leads to catastrophic forgetting. Existing continual learning methods fail to capture the complex dependencies between tasks. We introduce a novel framework that mitigates forgetting by establishing and enhancing complex dependencies between historical data and the network in the present task. Our framework features a dual-alignment strategy, the cross-network alignment (CNA) module aligns the features extracted from the bottleneck layers of the current and previous networks, respectively, while the cross-representation alignment (CRA) module aligns the features learned by the current network from historical buffered data and current input data, respectively. Implementing both types of alignment is a non-trivial task. To address this, we further analyze the linear and nonlinear forms of the well-established Hilbert-Schmidt Independence Criterion (HSIC) and deliberately design feature mapping and feature pairing blocks within the CRA module. Experiments on medical image segmentation task demonstrate our framework’s effectiveness in mitigating catastrophic forgetting under domain shifts.

[272] PLUS: Plug-and-Play Enhanced Liver Lesion Diagnosis Model on Non-Contrast CT Scans eess.IV | cs.CVPDF

Jiacheng Hao, Xiaoming Zhang, Wei Liu, Xiaoli Yin, Yuan Gao

TL;DR: PLUS是一个即插即用的框架，用于增强基于非对比CT扫描（NCCT）的肝脏病灶诊断，能够显著提升现有3D分割模型对恶性与良性病灶的区分能力。

Details

Motivation: 目前3D分割方法在肝脏病灶诊断中依赖对比增强CT或多模态影像，无法利用更常见的NCCT数据区分恶性与良性病灶。因此，PLUS旨在填补这一技术空白，提升NCCT的临床应用价值。

Result: 在8,651例患者的实验中，PLUS将病灶级F1分数提升了5.66%，恶性患者级F1分数提升了6.26%，良性患者级F1分数提升了4.03%。

Insight: PLUS展示了在不依赖昂贵或复杂影像模态的情况下，利用NCCT实现高效肝脏病灶筛查的潜力，为临床提供了更经济便捷的解决方案。

Abstract: Focal liver lesions (FLL) are common clinical findings during physical examination. Early diagnosis and intervention of liver malignancies are crucial to improving patient survival. Although the current 3D segmentation paradigm can accurately detect lesions, it faces limitations in distinguishing between malignant and benign liver lesions, primarily due to its inability to differentiate subtle variations between different lesions. Furthermore, existing methods predominantly rely on specialized imaging modalities such as multi-phase contrast-enhanced CT and magnetic resonance imaging, whereas non-contrast CT (NCCT) is more prevalent in routine abdominal imaging. To address these limitations, we propose PLUS, a plug-and-play framework that enhances FLL analysis on NCCT images for arbitrary 3D segmentation models. In extensive experiments involving 8,651 patients, PLUS demonstrated a significant improvement with existing methods, improving the lesion-level F1 score by 5.66%, the malignant patient-level F1 score by 6.26%, and the benign patient-level F1 score by 4.03%. Our results demonstrate the potential of PLUS to improve malignant FLL screening using widely available NCCT imaging substantially.

[273] Grid-Reg: Grid-Based SAR and Optical Image Registration Across Platforms eess.IV | cs.CVPDF

Xiaochen Wei, Weiwei Guo, Zenghui Zhang, Wenxian Yu

TL;DR: 本文提出了一种新颖的基于网格的多模态图像配准框架Grid-Reg，用于跨平台（机载SAR与星载光学图像）的配准任务，通过全局匹配损失和网格求解器克服了传统方法在几何和辐射差异上的局限性，表现优于现有技术。

Details

Motivation: 机载SAR与星载光学图像的配准因几何和辐射差异巨大而具有挑战性，传统方法难以处理此类异构图像配准问题。

Result: 在自建的数据集上，Grid-Reg优于现有技术，验证了其配准精度和鲁棒性。

Insight: 1. 全局匹配损失比关键点对应更适合异构图像配准；2. 基于网格的粗到细策略能有效处理大几何变形；3. 模态不变特征提取和相关性学习是配准的关键。

Abstract: Registering airborne SAR with spaceborne optical images is crucial for SAR image interpretation and geo-localization. It is challenging for this cross-platform heterogeneous image registration due to significant geometric and radiation differences, which current methods fail to handle. To tackle these challenges, we propose a novel grid-based multimodal registration framework (Grid-Reg) across airborne and space-born platforms, including a new domain-robust descriptor extraction network, Hybrid Siamese Correlation Metric Learning Network (HSCMLNet) and a grid-based solver (Grid-solver) for transformation parameters estimation. Our Grid-Reg is based on detector-free and global matching loss rather than accurate keypoint correspondences. These accurate correspondences are inherently difficult in heterogeneous images with large geometric deformation. By Grid-Solver, our Grid-Reg estimates transformation parameters by optimizing robust global matching loss-based patch correspondences of whole images in a coarse-to-fine strategy. To robustly calculate the similarity between patches, specifically that have noise and change objects, we propose HSCMLNet, including a hybrid Siamese module to extract high-level features of multimodal images and a correlation learning module (CMLModule) based equiangular unit basis vectors (EUBVs). Moreover, we propose a manifold loss EUBVsLoss to constrain the normalized correlation between local embeddings of patches and EUBVs. Furthermore, we curate a new challenging benchmark dataset of SAR-to-optical registration using real-world UAV MiniSAR data and optical images from Google Earth. We extensively analyze factors affecting registration accuracy and compare our method with state-of-the-art techniques on this dataset, showing superior performance.

[274] Surg-SegFormer: A Dual Transformer-Based Model for Holistic Surgical Scene Segmentation eess.IV | cs.AI | cs.CVPDF

Fatimaelzahraa Ahmed, Muraam Abdel-Ghani, Muhammad Arsalan, Mahmoud Ali, Abdulaziz Al-Ali

TL;DR: Surg-SegFormer是一个基于双Transformer的模型，用于全息手术场景分割，旨在解决手术中实时解释的挑战，同时在EndoVis数据集上表现优异。

Details

Motivation: 手术场景分割在机器人辅助手术中至关重要，但现有方法依赖用户提示且不适用于长视频。专家资源有限，需要自动化工具减轻教学负担。

Result: 在EndoVis2018和EndoVis2017数据集上分别达到0.80和0.54的mIoU。

Insight: 自动化分割模型能有效减轻专家负担，提升学员独立学习能力，为手术教学提供了新工具。

Abstract: Holistic surgical scene segmentation in robot-assisted surgery (RAS) enables surgical residents to identify various anatomical tissues, articulated tools, and critical structures, such as veins and vessels. Given the firm intraoperative time constraints, it is challenging for surgeons to provide detailed real-time explanations of the operative field for trainees. This challenge is compounded by the scarcity of expert surgeons relative to trainees, making the unambiguous delineation of go- and no-go zones inconvenient. Therefore, high-performance semantic segmentation models offer a solution by providing clear postoperative analyses of surgical procedures. However, recent advanced segmentation models rely on user-generated prompts, rendering them impractical for lengthy surgical videos that commonly exceed an hour. To address this challenge, we introduce Surg-SegFormer, a novel prompt-free model that outperforms current state-of-the-art techniques. Surg-SegFormer attained a mean Intersection over Union (mIoU) of 0.80 on the EndoVis2018 dataset and 0.54 on the EndoVis2017 dataset. By providing robust and automated surgical scene comprehension, this model significantly reduces the tutoring burden on expert surgeons, empowering residents to independently and effectively understand complex surgical environments.

[275] CLIP-RL: Surgical Scene Segmentation Using Contrastive Language-Vision Pretraining & Reinforcement Learning eess.IV | cs.AI | cs.CV | cs.LGPDF

Fatmaelzahraa Ali Ahmed, Muhammad Arsalan, Abdulaziz Al-Ali, Khalid Al-Jalham, Shidin Balakrishnan

TL;DR: 论文提出了一种结合对比学习和强化学习的新方法CLIP-RL，用于手术场景的语义分割，通过动态优化分割掩码，显著提升了性能。

Details

Motivation: 手术场景的准确分割对提高医疗服务质量至关重要。现有方法在处理遮挡、纹理变化和动态光照等复杂条件时表现不佳。CLIP-RL结合对比学习与强化学习，旨在解决这些问题。

Result: 在EndoVis 2018和EndoVis 2017数据集上分别取得了81%和74.12%的均交并比，优于现有方法。

Insight: 1. 对比学习与强化学习的结合能够显著提升复杂场景的分割性能。2. 动态优化策略（如强化学习）在处理遮挡和光照变化时表现优异。

Abstract: Understanding surgical scenes can provide better healthcare quality for patients, especially with the vast amount of video data that is generated during MIS. Processing these videos generates valuable assets for training sophisticated models. In this paper, we introduce CLIP-RL, a novel contrastive language-image pre-training model tailored for semantic segmentation for surgical scenes. CLIP-RL presents a new segmentation approach which involves reinforcement learning and curriculum learning, enabling continuous refinement of the segmentation masks during the full training pipeline. Our model has shown robust performance in different optical settings, such as occlusions, texture variations, and dynamic lighting, presenting significant challenges. CLIP model serves as a powerful feature extractor, capturing rich semantic context that enhances the distinction between instruments and tissues. The RL module plays a pivotal role in dynamically refining predictions through iterative action-space adjustments. We evaluated CLIP-RL on the EndoVis 2018 and EndoVis 2017 datasets. CLIP-RL achieved a mean IoU of 81%, outperforming state-of-the-art models, and a mean IoU of 74.12% on EndoVis 2017. This superior performance was achieved due to the combination of contrastive learning with reinforcement learning and curriculum learning.

[276] ViTaL: A Multimodality Dataset and Benchmark for Multi-pathological Ovarian Tumor Recognition eess.IV | cs.CVPDF

You Zhou, Lijiang Chen, Guangxia Cui, Wenpei Bai, Yu Guo

TL;DR: 论文提出了一个新的多模态卵巢肿瘤数据集ViTaL，结合视觉、表格和语言数据，并提出了ViTaL-Net方法，基于三重层次偏移注意力机制实现多病理分类。

Details

Motivation: 卵巢肿瘤的早期识别对女性健康至关重要，但现有公开数据集不足，限制了深度学习在该领域的应用。

Result: ViTaL-Net在常见病理类型上准确率超90%，整体性能85%。

Insight: 多模态数据融合能显著提升卵巢肿瘤分类效果，THOAM机制在多模态任务中表现优异。

Abstract: Ovarian tumor, as a common gynecological disease, can rapidly deteriorate into serious health crises when undetected early, thus posing significant threats to the health of women. Deep neural networks have the potential to identify ovarian tumors, thereby reducing mortality rates, but limited public datasets hinder its progress. To address this gap, we introduce a vital ovarian tumor pathological recognition dataset called \textbf{ViTaL} that contains \textbf{V}isual, \textbf{T}abular and \textbf{L}inguistic modality data of 496 patients across six pathological categories. The ViTaL dataset comprises three subsets corresponding to different patient data modalities: visual data from 2216 two-dimensional ultrasound images, tabular data from medical examinations of 496 patients, and linguistic data from ultrasound reports of 496 patients. It is insufficient to merely distinguish between benign and malignant ovarian tumors in clinical practice. To enable multi-pathology classification of ovarian tumor, we propose a ViTaL-Net based on the Triplet Hierarchical Offset Attention Mechanism (THOAM) to minimize the loss incurred during feature fusion of multi-modal data. This mechanism could effectively enhance the relevance and complementarity between information from different modalities. ViTaL-Net serves as a benchmark for the task of multi-pathology, multi-modality classification of ovarian tumors. In our comprehensive experiments, the proposed method exhibited satisfactory performance, achieving accuracies exceeding 90% on the two most common pathological types of ovarian tumor and an overall performance of 85%. Our dataset and code are available at https://github.com/GGbond-study/vitalnet.

[277] FB-Diff: Fourier Basis-guided Diffusion for Temporal Interpolation of 4D Medical Imaging eess.IV | cs.CVPDF

Xin You, Runze Yang, Chuyan Zhang, Zhongliang Jiang, Jie Yang

TL;DR: FB-Diff是一种基于傅里叶基引导的扩散模型，用于4D医学图像的时间插值任务，通过模拟呼吸运动的非线性特性，显著提升了插值效果和一致性。

Details

Motivation: 现有方法基于线性运动假设，但实际呼吸运动是非线性且准周期的。FB-Diff从频域视角出发，更好地建模这种特性。

Result: 实验表明FB-Diff在感知质量和时间一致性上达到SOTA，同时保持良好重建指标。

Insight: 频域建模能有效捕捉呼吸运动特性，为医学图像插值提供了新思路。

Abstract: The temporal interpolation task for 4D medical imaging, plays a crucial role in clinical practice of respiratory motion modeling. Following the simplified linear-motion hypothesis, existing approaches adopt optical flow-based models to interpolate intermediate frames. However, realistic respiratory motions should be nonlinear and quasi-periodic with specific frequencies. Intuited by this property, we resolve the temporal interpolation task from the frequency perspective, and propose a Fourier basis-guided Diffusion model, termed FB-Diff. Specifically, due to the regular motion discipline of respiration, physiological motion priors are introduced to describe general characteristics of temporal data distributions. Then a Fourier motion operator is elaborately devised to extract Fourier bases by incorporating physiological motion priors and case-specific spectral information in the feature space of Variational Autoencoder. Well-learned Fourier bases can better simulate respiratory motions with motion patterns of specific frequencies. Conditioned on starting and ending frames, the diffusion model further leverages well-learned Fourier bases via the basis interaction operator, which promotes the temporal interpolation task in a generative manner. Extensive results demonstrate that FB-Diff achieves state-of-the-art (SOTA) perceptual performance with better temporal consistency while maintaining promising reconstruction metrics. Codes are available.

[278] Comprehensive Modeling of Camera Spectral and Color Behavior eess.IV | cs.CVPDF

Sanush K Abeysekera, Ye Chow Kuang, Melanie Po-Leen Ooi

TL;DR: 该论文提出了一种新颖的端到端模型，用于模拟RGB数码相机的光谱响应，填补了现有研究中光输入与像素强度输出之间交互建模的空白。

Details

Motivation: 目前缺乏一个全面考虑光输入与像素强度输出之间端到端交互的相机光谱响应模型，而这对精确的颜色和光谱数据处理至关重要。

Result: 该模型显著提高了颜色保真度和光谱精度，并在机器视觉、遥感和光谱成像等领域展示了广泛的应用潜力。

Insight: 光谱响应的精确建模对科学、工业和创意领域中相机系统的优化具有重要意义，尤其是需要光谱精度的场景。

Abstract: The spectral response of a digital camera defines the mapping between scene radiance and pixel intensity. Despite its critical importance, there is currently no comprehensive model that considers the end-to-end interaction between light input and pixel intensity output. This paper introduces a novel technique to model the spectral response of an RGB digital camera, addressing this gap. Such models are indispensable for applications requiring accurate color and spectral data interpretation. The proposed model is tested across diverse imaging scenarios by varying illumination conditions and is validated against experimental data. Results demonstrate its effectiveness in improving color fidelity and spectral accuracy, with significant implications for applications in machine vision, remote sensing, and spectral imaging. This approach offers a powerful tool for optimizing camera systems in scientific, industrial, and creative domains where spectral precision is paramount.

[279] A Deep Unfolding Framework for Diffractive Snapshot Spectral Imaging eess.IV | cs.CVPDF

Zhengyue Zhuge, Jiahui Xu, Shiqi Chen, Hao Xu, Yueting Chen

TL;DR: 本文提出了一种名为Diffractive Deep Unfolding (DDU)的高效深度展开框架，用于解决衍射快照光谱成像（DSSI）中的重建问题。该框架通过解析求解数据保真项，并结合网络初始化策略，显著提升了重建的稳定性和性能。

Details

Motivation: 衍射快照光谱成像在获取光谱数据方面具有潜力，但其重建算法研究不足，现有方法因光学编码机制差异无法完全适配。因此，需要提出一种兼容DSSI系统的高效重建框架。

Result: 实验表明，DDU框架在性能上优于现有方法，同时保持了相似的参数量和计算复杂度。

Insight: 1. 深度展开框架在DSSI中具有潜力；2. 网络初始化策略对解决不适定问题至关重要；3. 解析数据保真项是提升效率的关键。

Abstract: Snapshot hyperspectral imaging systems acquire spectral data cubes through compressed sensing. Recently, diffractive snapshot spectral imaging (DSSI) methods have attracted significant attention. While various optical designs and improvements continue to emerge, research on reconstruction algorithms remains limited. Although numerous networks and deep unfolding methods have been applied on similar tasks, they are not fully compatible with DSSI systems because of their distinct optical encoding mechanism. In this paper, we propose an efficient deep unfolding framework for diffractive systems, termed diffractive deep unfolding (DDU). Specifically, we derive an analytical solution for the data fidelity term in DSSI, ensuring both the efficiency and the effectiveness during the iterative reconstruction process. Given the severely ill-posed nature of the problem, we employ a network-based initialization strategy rather than non-learning-based methods or linear layers, leading to enhanced stability and performance. Our framework demonstrates strong compatibility with existing state-of-the-art (SOTA) models, which effectively address the initialization and prior subproblem. Extensive experiments validate the superiority of the proposed DDU framework, showcasing improved performance while maintaining comparable parameter counts and computational complexity. These results suggest that DDU provides a solid foundation for future unfolding-based methods in DSSI.

[280] SPIDER: Structure-Preferential Implicit Deep Network for Biplanar X-ray Reconstruction eess.IV | cs.CVPDF

Tianqi Yu, Xuanyu Tian, Jiawen Yang, Dongming He, Jingyi Yu

TL;DR: SPIDER 是一种监督学习框架，用于从两幅正交 X 射线图像重建 CT 体积。通过结合组织结构的先验知识，SPIDER 改善了结构连续性和软组织伪影问题。

Details

Motivation: 现有的双平面 X 射线成像方法在 3D 重建中存在骨骼结构不完整、组织边界不精确和解剖学真实性不足的问题，限制了其临床用途。

Result: 在临床头 CT 数据集上验证 SPIDER 生成解剖准确的 3D 重建，并在下游分割任务中表现优异。

Insight: 结合解剖先验的监督学习能显著提升双平面 X 射线重建的准确性和临床适用性。

Abstract: Biplanar X-ray imaging is widely used in health screening, postoperative rehabilitation evaluation of orthopedic diseases, and injury surgery due to its rapid acquisition, low radiation dose, and straightforward setup. However, 3D volume reconstruction from only two orthogonal projections represents a profoundly ill-posed inverse problem, owing to the intrinsic lack of depth information and irreducible ambiguities in soft-tissue visualization. Some existing methods can reconstruct skeletal structures and Computed Tomography (CT) volumes, they often yield incomplete bone geometry, imprecise tissue boundaries, and a lack of anatomical realism, thereby limiting their clinical utility in scenarios such as surgical planning and postoperative assessment. In this study, we introduce SPIDER, a novel supervised framework designed to reconstruct CT volumes from biplanar X-ray images. SPIDER incorporates tissue structure as prior (e.g., anatomical segmentation) into an implicit neural representation decoder in the form of joint supervision through a unified encoder-decoder architecture. This design enables the model to jointly learn image intensities and anatomical structures in a pixel-aligned fashion. To address the challenges posed by sparse input and structural ambiguity, SPIDER directly embeds anatomical constraints into the reconstruction process, thereby enhancing structural continuity and reducing soft-tissue artifacts. We conduct comprehensive experiments on clinical head CT datasets and show that SPIDER generates anatomically accurate reconstructions from only two projections. Furthermore, our approach demonstrates strong potential in downstream segmentation tasks, underscoring its utility in personalized treatment planning and image-guided surgical navigation.

[281] Efficacy of Image Similarity as a Metric for Augmenting Small Dataset Retinal Image Segmentation eess.IV | cs.CVPDF

Thomas Wallace, Ik Siong Heng, Senad Subasic, Chris Messenger

TL;DR: 该论文研究了使用FID作为度量标准评估合成图像质量的有效性，并通过PGGAN生成的合成图像增强小数据集视网膜图像分割的效果。研究发现，相似性高的数据集（低FID）能显著提升U-Net模型的性能，而合成数据比标准增强方法更有效。

Details

Motivation: 医学影像数据通常有限，合成图像是增强数据集的一种方法，但缺乏明确的度量标准评估其有效性。论文旨在验证FID是否可以作为评估合成图像质量的指标，并探究其对小数据集分割任务的影响。

Result: 实验表明，低FID数据集能显著提升分割性能，合成数据比标准增强方法更有效。高FID数据集对性能提升无显著贡献。

Insight: 图像相似性（FID）是评估合成图像增强效果的重要指标，但需注意足够的不相似性（高FID）反而可能无效，这可能为未来医学图像增强方法设计提供指导。

Abstract: Synthetic images are an option for augmenting limited medical imaging datasets to improve the performance of various machine learning models. A common metric for evaluating synthetic image quality is the Fr'echet Inception Distance (FID) which measures the similarity of two image datasets. In this study we evaluate the relationship between this metric and the improvement which synthetic images, generated by a Progressively Growing Generative Adversarial Network (PGGAN), grant when augmenting Diabetes-related Macular Edema (DME) intraretinal fluid segmentation performed by a U-Net model with limited amounts of training data. We find that the behaviour of augmenting with standard and synthetic images agrees with previously conducted experiments. Additionally, we show that dissimilar (high FID) datasets do not improve segmentation significantly. As FID between the training and augmenting datasets decreases, the augmentation datasets are shown to contribute to significant and robust improvements in image segmentation. Finally, we find that there is significant evidence to suggest that synthetic and standard augmentations follow separate log-normal trends between FID and improvements in model performance, with synthetic data proving more effective than standard augmentation techniques. Our findings show that more similar datasets (lower FID) will be more effective at improving U-Net performance, however, the results also suggest that this improvement may only occur when images are sufficiently dissimilar.

[282] MurreNet: Modeling Holistic Multimodal Interactions Between Histopathology and Genomic Profiles for Survival Prediction eess.IV | cs.CVPDF

Mingxin Liu, Chengfei Cai, Jun Li, Pengbo Xu, Jinze Li

TL;DR: MurreNet提出了一种多模态解耦网络，通过模态表示分解和深度正交融合策略，整合病理图像和基因组数据，显著提升了癌症生存预测的准确性。

Details

Motivation: 现有方法在多模态特征融合时未能充分捕捉模态间和模态内的交互，限制了预测性能。本文旨在通过更精细的模态解耦和融合策略解决这一问题。

Result: 在六个TCGA癌症队列上的实验表明，MurreNet达到了当前最优的生存预测性能。

Insight: 显式解耦和优化多模态表示能更有效地捕捉复杂交互，提升预测任务的表现。

Abstract: Cancer survival prediction requires integrating pathological Whole Slide Images (WSIs) and genomic profiles, a challenging task due to the inherent heterogeneity and the complexity of modeling both inter- and intra-modality interactions. Current methods often employ straightforward fusion strategies for multimodal feature integration, failing to comprehensively capture modality-specific and modality-common interactions, resulting in a limited understanding of multimodal correlations and suboptimal predictive performance. To mitigate these limitations, this paper presents a Multimodal Representation Decoupling Network (MurreNet) to advance cancer survival analysis. Specifically, we first propose a Multimodal Representation Decomposition (MRD) module to explicitly decompose paired input data into modality-specific and modality-shared representations, thereby reducing redundancy between modalities. Furthermore, the disentangled representations are further refined then updated through a novel training regularization strategy that imposes constraints on distributional similarity, difference, and representativeness of modality features. Finally, the augmented multimodal features are integrated into a joint representation via proposed Deep Holistic Orthogonal Fusion (DHOF) strategy. Extensive experiments conducted on six TCGA cancer cohorts demonstrate that our MurreNet achieves state-of-the-art (SOTA) performance in survival prediction.

[283] Sequential Attention-based Sampling for Histopathological Analysis eess.IV | cs.AI | cs.CVPDF

Tarun G, Naman Malpani, Gugan Thoppe, Sridharan Devarajan

TL;DR: SASHA是一种基于深度强化学习的方法，通过轻量级的分层注意力多实例学习模型提取特征，并智能采样10-20%的高分辨率病理图像补丁，实现了高效且可靠的诊断。

Details

Motivation: 全玻片图像（WSIs）尺寸巨大，直接高分辨率分析计算成本高，且诊断标签通常仅在全图级别可用。SASHA旨在解决这些问题，通过选择性采样和注意力机制，高效分析病理图像。

Result: SASHA性能与全高分辨率分析的方法相当，但计算和内存成本显著降低，且优于其他稀疏采样方法。

Insight: 通过智能采样和注意力机制，SASHA为大规模医学图像诊断提供了高效解决方案，尤其适用于稀疏分布特征的场景。

Abstract: Deep neural networks are increasingly applied for automated histopathology. Yet, whole-slide images (WSIs) are often acquired at gigapixel sizes, rendering it computationally infeasible to analyze them entirely at high resolution. Diagnostic labels are largely available only at the slide-level, because expert annotation of images at a finer (patch) level is both laborious and expensive. Moreover, regions with diagnostic information typically occupy only a small fraction of the WSI, making it inefficient to examine the entire slide at full resolution. Here, we propose SASHA – {\it S}equential {\it A}ttention-based {\it S}ampling for {\it H}istopathological {\it A}nalysis – a deep reinforcement learning approach for efficient analysis of histopathological images. First, SASHA learns informative features with a lightweight hierarchical, attention-based multiple instance learning (MIL) model. Second, SASHA samples intelligently and zooms selectively into a small fraction (10-20%) of high-resolution patches, to achieve reliable diagnosis. We show that SASHA matches state-of-the-art methods that analyze the WSI fully at high-resolution, albeit at a fraction of their computational and memory costs. In addition, it significantly outperforms competing, sparse sampling methods. We propose SASHA as an intelligent sampling model for medical imaging challenges that involve automated diagnosis with exceptionally large images containing sparsely informative features.

[284] Latent Motion Profiling for Annotation-free Cardiac Phase Detection in Adult and Fetal Echocardiography Videos eess.IV | cs.CVPDF

Yingyu Yang, Qianye Yang, Kangning Cui, Can Peng, Elena D’Alberti

TL;DR: 该论文提出了一种无监督框架，通过自监督学习从心脏超声视频中提取潜在运动轨迹，实现心脏相位检测，无需人工标注。

Details

Motivation: 心脏相位检测是分析和诊断心脏功能的关键步骤，但现有的自动方法通常需要大量标注，费时费力。因此，作者希望开发一种无需标注的无监督方法。

Result: 在EchoNet-Dynamic基准测试中，ED和ES检测的MAE分别为3帧（58.3 ms）和2帧（38.8 ms）；在胎儿心脏超声中，MAE分别降至1.46帧（20.7 ms）和1.74帧（25.3 ms）。

Insight: 该方法证明了潜在运动轨迹策略在无监督心脏运动分析中的潜力，为缺乏标注数据的临床研究提供了可扩展的解决方案。

Abstract: The identification of cardiac phase is an essential step for analysis and diagnosis of cardiac function. Automatic methods, especially data-driven methods for cardiac phase detection, typically require extensive annotations, which is time-consuming and labor-intensive. In this paper, we present an unsupervised framework for end-diastole (ED) and end-systole (ES) detection through self-supervised learning of latent cardiac motion trajectories from 4-chamber-view echocardiography videos. Our method eliminates the need for manual annotations, including ED and ES indices, segmentation, or volumetric measurements, by training a reconstruction model to encode interpretable spatiotemporal motion patterns. Evaluated on the EchoNet-Dynamic benchmark, the approach achieves mean absolute error (MAE) of 3 frames (58.3 ms) for ED and 2 frames (38.8 ms) for ES detection, matching state-of-the-art supervised methods. Extended to fetal echocardiography, the model demonstrates robust performance with MAE 1.46 frames (20.7 ms) for ED and 1.74 frames (25.3 ms) for ES, despite the fact that the fetal heart model is built using non-standardized heart views due to fetal heart positioning variability. Our results demonstrate the potential of the proposed latent motion trajectory strategy for cardiac phase detection in adult and fetal echocardiography. This work advances unsupervised cardiac motion analysis, offering a scalable solution for clinical populations lacking annotated data. Code will be released at https://github.com/YingyuYyy/CardiacPhase.

eess.SP [Back]

[285] Differentiable High-Performance Ray Tracing-Based Simulation of Radio Propagation with Point Clouds eess.SP | cs.CVPDF

Niklas Vaara, Pekka Sangi, Miguel Bordallo López, Janne Heikkilä

TL;DR: 提出了一种基于可微射线追踪的无线电传播模拟器，可直接处理点云数据，实现了高效的多路径传播模拟，并展示了如何结合语义分割标签学习环境的电磁特性。

Details

Motivation: 射线追踪是一种广泛应用于无线电传播模拟的确定性方法，但其准确性依赖于环境模型及其电磁特性的质量。近年来，计算机视觉和机器学习的发展使得重建带有语义分割标签的详细环境模型成为可能。

Result: 在两种室内场景中实现了高达五次交互的多路径传播模拟，速度快且物理准确。

Insight: 可微性使得电磁计算可以结合语义分割标签，为环境电磁特性的学习提供了新思路，同时展示了点云数据在无线电传播模拟中的潜力。

Abstract: Ray tracing is a widely used deterministic method for radio propagation simulations, capable of producing physically accurate multipath components. The accuracy depends on the quality of the environment model and its electromagnetic properties. Recent advances in computer vision and machine learning have made it possible to reconstruct detailed environment models augmented with semantic segmentation labels. In this letter, we propose a differentiable ray tracing-based radio propagation simulator that operates directly on point clouds. We showcase the efficiency of our method by simulating multi-bounce propagation paths with up to five interactions with specular reflections and diffuse scattering in two indoor scenarios, each completing in less than 90 ms. Lastly, we demonstrate how the differentiability of electromagnetic computations can be combined with segmentation labels to learn the electromagnetic properties of the environment.

cs.RO [Back]

[286] AutoLayout: Closed-Loop Layout Synthesis via Slow-Fast Collaborative Reasoning cs.RO | cs.CVPDF

Weixing Chen, Dafeng Chi, Yang Liu, Yuxi Yang, Yexin Zhang

TL;DR: AutoLayout提出了一种全自动布局生成方法，通过慢-快协同推理和自验证机制，解决了现有方法的空间幻觉问题，显著提升了布局的物理合理性和语义一致性。

Details

Motivation: 当前布局生成方法存在空间幻觉问题，如物体重叠或漂浮，缺乏对语义保真度和物理合理性的平衡。AutoLayout旨在通过闭环自验证机制解决这些问题。

Result: 在8种场景下，AutoLayout相比现有方法在物理合理性、语义一致性和功能完整性上提升了10.1%。

Insight: 通过闭环自验证和慢-快协同推理，AutoLayout展示了如何结合深度学习与规则优化布局生成，为自动化系统设计提供了新思路。

Abstract: The automated generation of layouts is vital for embodied intelligence and autonomous systems, supporting applications from virtual environment construction to home robot deployment. Current approaches, however, suffer from spatial hallucination and struggle with balancing semantic fidelity and physical plausibility, often producing layouts with deficits such as floating or overlapping objects and misaligned stacking relation. In this paper, we propose AutoLayout, a fully automated method that integrates a closed-loop self-validation process within a dual-system framework. Specifically, a slow system harnesses detailed reasoning with a Reasoning-Reflection-Generation (RRG) pipeline to extract object attributes and spatial constraints. Then, a fast system generates discrete coordinate sets and a topological relation set that are jointly validated. To mitigate the limitations of handcrafted rules, we further introduce an LLM-based Adaptive Relation Library (ARL) for generating and evaluating layouts. Through the implementation of Slow-Fast Collaborative Reasoning, the AutoLayout efficiently generates layouts after thorough deliberation, effectively mitigating spatial hallucination. Its self-validation mechanism establishes a closed-loop process that iteratively corrects potential errors, achieving a balance between physical stability and semantic consistency. The effectiveness of AutoLayout was validated across 8 distinct scenarios, where it demonstrated a significant 10.1% improvement over SOTA methods in terms of physical plausibility, semantic consistency, and functional completeness.

[287] EmbodieDreamer: Advancing Real2Sim2Real Transfer for Policy Training via Embodied World Modeling cs.RO | cs.AI | cs.CVPDF

Boyuan Wang, Xinpan Meng, Xiaofeng Wang, Zheng Zhu, Angen Ye

TL;DR: 论文提出EmbodieDreamer框架，通过PhysAligner优化物理参数和VisAligner提升视觉真实性，显著缩小Real2Sim2Real差距，提升机器人策略的训练效果。

Details

Motivation: 由于真实世界数据采集成本高且低效，仿真环境成为训练机器人策略的重要替代，但Real2Sim2Real差距（物理动力学与视觉外观）仍是关键瓶颈。

Result: PhysAligner将物理参数估计误差降低3.74%，优化速度提升89.91%；VisAligner生成的环境使强化学习任务成功率提高29.17%。

Insight: 通过联合优化物理和视觉模块，EmbodieDreamer显著提升了仿真与真实世界的对齐能力，为机器人策略训练提供了高效方案。

Abstract: The rapid advancement of Embodied AI has led to an increasing demand for large-scale, high-quality real-world data. However, collecting such embodied data remains costly and inefficient. As a result, simulation environments have become a crucial surrogate for training robot policies. Yet, the significant Real2Sim2Real gap remains a critical bottleneck, particularly in terms of physical dynamics and visual appearance. To address this challenge, we propose EmbodieDreamer, a novel framework that reduces the Real2Sim2Real gap from both the physics and appearance perspectives. Specifically, we propose PhysAligner, a differentiable physics module designed to reduce the Real2Sim physical gap. It jointly optimizes robot-specific parameters such as control gains and friction coefficients to better align simulated dynamics with real-world observations. In addition, we introduce VisAligner, which incorporates a conditional video diffusion model to bridge the Sim2Real appearance gap by translating low-fidelity simulated renderings into photorealistic videos conditioned on simulation states, enabling high-fidelity visual transfer. Extensive experiments validate the effectiveness of EmbodieDreamer. The proposed PhysAligner reduces physical parameter estimation error by 3.74% compared to simulated annealing methods while improving optimization speed by 89.91%. Moreover, training robot policies in the generated photorealistic environment leads to a 29.17% improvement in the average task success rate across real-world tasks after reinforcement learning. Code, model and data will be publicly available.

Qucheng Peng, Chen Bai, Guoxiang Zhang, Bo Xu, Xiaotong Liu

TL;DR: 该论文提出了NavigScene数据集，通过导航引导的自然语言方法填补了自动驾驶系统中局部感知与全局导航之间的鸿沟，并提出了三种互补的方法来提升视觉语言模型在自动驾驶中的表现。

Details

Motivation: 现有的自动驾驶系统在局部视觉信息处理上取得了进展，但缺乏类似人类驾驶员的全局导航能力。NavigScene旨在解决这一关键问题。

Result: 实验表明，提出的方法在感知、预测、规划和问答任务中显著提升了性能，并增强了系统在复杂陌生环境中的适应能力。

Insight: 导航信息的引入显著提升了自动驾驶系统的全局理解能力，为复杂场景下的自动驾驶提供了更可靠的解决方案。

Abstract: Autonomous driving systems have made significant advances in Q&A, perception, prediction, and planning based on local visual information, yet they struggle to incorporate broader navigational context that human drivers routinely utilize. We address this critical gap between local sensor data and global navigation information by proposing NavigScene, an auxiliary navigation-guided natural language dataset that simulates a human-like driving environment within autonomous driving systems. Moreover, we develop three complementary paradigms to leverage NavigScene: (1) Navigation-guided Reasoning, which enhances vision-language models by incorporating navigation context into the prompting approach; (2) Navigation-guided Preference Optimization, a reinforcement learning method that extends Direct Preference Optimization to improve vision-language model responses by establishing preferences for navigation-relevant summarized information; and (3) Navigation-guided Vision-Language-Action model, which integrates navigation guidance and vision-language models with conventional driving models through feature fusion. Extensive experiments demonstrate that our approaches significantly improve performance across perception, prediction, planning, and question-answering tasks by enabling reasoning capabilities beyond visual range and improving generalization to diverse driving scenarios. This work represents a significant step toward more comprehensive autonomous driving systems capable of navigating complex, unfamiliar environments with greater reliability and safety.

Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang

TL;DR: StreamVLN提出了一种流式视觉与语言导航（VLN）框架，通过慢-快上下文建模策略解决了当前基于视频大语言模型（Video-LLMs）方法在细粒度视觉理解、长期上下文建模和计算效率之间的权衡问题。

Details

Motivation: 在真实场景中，实时处理连续视觉流并以低延迟生成基于语言指令的动作是VLN的核心挑战。现有方法通常在细粒度视觉理解、长期上下文建模和计算效率之间存在权衡。

Result: 在VLN-CE基准测试中，StreamVLN实现了最先进的性能，同时保持低延迟，证明了其在真实场景中的鲁棒性和高效性。

Insight: 慢-快上下文设计不仅支持长期视觉流的高效处理，还通过滑窗和剪枝技术实现了计算效率与低延迟的平衡，为实时VLN提供了新思路。

Abstract: Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of active dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves coherent multi-turn dialogue through efficient KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment. The project page is: \href{https://streamvln.github.io/}{https://streamvln.github.io/}.

cs.AI [Back]

[290] LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents cs.AI | cs.CL | cs.LG | cs.SY | eess.SYPDF

Anand Gokhale, Vaibhav Srivastava, Francesco Bullo

TL;DR: 该论文提出了一种基于线性时序逻辑（LTL）的LLM评论家LTLCrit，用于指导LLM在执行长期规划任务时避免不安全或低效行为，通过模块化的演员-评论家架构实现了安全且高效的决策。

Details

Motivation: LLMs在长期规划任务中可能因错误积累导致不安全或低效行为，限制了其在通用场景中的应用。论文旨在利用形式化逻辑的优势增强LLMs的规划能力。

Result: 在Minecraft钻石挖掘任务中，实现了100%的任务完成率，并显著提升了效率。

Insight: 通过形式化逻辑对LLMs进行监督是一种灵活且强大的方法，能够显著提升长期规划任务的安全性和效率。

Abstract: Large language models (LLMs) have demonstrated promise in reasoning tasks and general decision-making in static environments. In long-term planning tasks, however, errors tend to accumulate, often leading to unsafe or inefficient behavior, limiting their use in general-purpose settings. We propose a modular actor-critic architecture in which an LLM actor is guided by LTLCrit, a trajectory-level LLM critic that communicates via linear temporal logic (LTL). Our setup combines the reasoning strengths of language models with the guarantees of formal logic. The actor selects high-level actions from natural language observations, while the critic analyzes full trajectories and proposes new LTL constraints that shield the actor from future unsafe or inefficient behavior. The architecture supports both fixed, hand-specified safety constraints and adaptive, learned soft constraints that promote long-term efficiency. Our architecture is model-agnostic: any LLM-based planner can serve as the actor, and LTLCrit serves as a logic-generating wrapper. We formalize planning as graph traversal under symbolic constraints, allowing LTLCrit to analyze failed or suboptimal trajectories and generate new temporal logic rules that improve future behavior. We evaluate our system on the Minecraft diamond-mining benchmark, achieving 100% completion rates and improving efficiency compared to baseline LLM planners. Our results suggest that enabling LLMs to supervise each other through logic is a powerful and flexible paradigm for safe, generalizable decision making.

[291] Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky cs.AI | cs.CL | cs.LGPDF

Ashutosh Hathidara, Julien Yu, Sebastian Schreiber

TL;DR: 论文提出了DiaFORGE，一种专注于消除歧义的三阶段流程，用于微调开源LLMs，显著提升企业工具调用的成功率和安全性。

Details

Motivation: 现有的LLMs在企业API调用中常因工具重复或参数不明确而失败，需要通过更好的微调方法提升其性能和可靠性。

Result: DiaFORGE训练的模型在DiaBENCH上提升了工具调用的成功率，比GPT-4o高27个百分点，比Claude-3.5-Sonnet高49个百分点。

Insight: 通过专注于消除歧义的微调，可以显著提升LLMs在企业环境中的实用性和安全性，为实际部署提供可靠支持。

Abstract: Large language models (LLMs) are increasingly tasked with invoking enterprise APIs, yet they routinely falter when near-duplicate tools vie for the same user intent or when required arguments are left underspecified. We introduce DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation), a disambiguation-centric, three-stage pipeline that (i) synthesizes persona-driven, multi-turn dialogues in which the assistant must distinguish among highly similar tools, (ii) performs supervised fine-tuning of open-source models with reasoning traces across 3B - 70B parameters, and (iii) evaluates real-world readiness via a dynamic suite that redeploys each model in a live agentic loop and reports end-to-end goal completion alongside conventional static metrics. On our dynamic benchmark DiaBENCH, models trained with DiaFORGE raise tool-invocation success by 27 pp over GPT-4o and by 49 pp over Claude-3.5-Sonnet, both under optimized prompting. To spur further research, we release an open corpus of 5000 production-grade enterprise API specifications paired with rigorously validated, disambiguation-focused dialogues, offering a practical blueprint for building reliable, enterprise-ready tool-calling agents.

[292] Agent-Based Detection and Resolution of Incompleteness and Ambiguity in Interactions with Large Language Models cs.AI | cs.CL | cs.IR | I.2PDF

Riya Naik, Ashwin Srinivasan, Swati Agarwal, Estrid He

TL;DR: 该论文提出了一种基于智能体的方法，用于检测和解决与大型语言模型（LLM）交互中的不完整性和歧义性问题，通过多轮交互提升问答系统的质量。

Details

Motivation: 当前用户与LLM的交互多为单轮问答，但多轮交互可能因需要澄清上下文而显得繁琐。研究旨在通过智能体架构增强LLM的推理能力，自动识别并解决输入问题中的缺陷。

Result: 实验表明，该方法能缩短与用户的交互长度，提升答案质量，并提供缺陷的可解释性解决。缺点是可能增加LLM调用次数和延迟，但总体优大于劣，除非问题本身已具足够上下文。

Insight: 智能体方法可通过多步推理增强LLM的鲁棒性，尤其是在处理复杂或不明确问题时。但需权衡额外资源开销，适用于上下文不足的场景。

Abstract: Many of us now treat LLMs as modern-day oracles asking it almost any kind of question. However, consulting an LLM does not have to be a single turn activity. But long multi-turn interactions can get tedious if it is simply to clarify contextual information that can be arrived at through reasoning. In this paper, we examine the use of agent-based architecture to bolster LLM-based Question-Answering systems with additional reasoning capabilities. We examine the automatic resolution of potential incompleteness or ambiguities in questions by transducers implemented using LLM-based agents. We focus on several benchmark datasets that are known to contain questions with these deficiencies to varying degrees. We equip different LLMs (GPT-3.5-Turbo and Llama-4-Scout) with agents that act as specialists in detecting and resolving deficiencies of incompleteness and ambiguity. The agents are implemented as zero-shot ReAct agents. Rather than producing an answer in a single step, the model now decides between 3 actions a) classify b) resolve c) answer. Action a) decides if the question is incomplete, ambiguous, or normal. Action b) determines if any deficiencies identified can be resolved. Action c) answers the resolved form of the question. We compare the use of LLMs with and without the use of agents with these components. Our results show benefits of agents with transducer 1) A shortening of the length of interactions with human 2) An improvement in the answer quality and 3) Explainable resolution of deficiencies in the question. On the negative side we find while it may result in additional LLM invocations and in some cases, increased latency. But on tested datasets, the benefits outweigh the costs except when questions already have sufficient context. Suggesting the agent-based approach could be a useful mechanism to harness the power of LLMs to develop more robust QA systems.

[293] SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control cs.AI | cs.CLPDF

Xingyang He, Xiao Ling, Jie Liu

TL;DR: SmartThinker提出了一种两阶段学习框架，通过精细控制推理步骤的长度来减少冗余推理，同时保持甚至提升推理性能。

Details

Motivation: 大型推理模型在推理过程中存在大量冗余计算和低效性，导致计算资源浪费。现有的全局长度惩罚方法无法区分关键与简单步骤的压缩需求，因此需要一种更精细的控制方法。

Result: 在多个推理基准和骨干模型上，SmartThinker显著减少了冗余推理，同时性能与现有方法相当或更优。

Insight: 关键步骤需要保留更多长度而非全局压缩，这种差异化调整能更有效平衡准确性和效率。

Abstract: Large reasoning models (LRMs) have exhibited remarkable reasoning capabilities through inference-time scaling, but this progress has also introduced considerable redundancy and inefficiency into their reasoning processes, resulting in substantial computational waste. Previous work has attempted to mitigate this issue by penalizing the overall length of generated samples during reinforcement learning (RL), with the goal of encouraging a more concise chains of thought. However, we observe that such global length penalty often lead to excessive compression of critical reasoning steps while preserving unnecessary details in simpler ones, yielding a suboptimal trade-off between accuracy and efficiency. To address this issue, we propose SmartThinker, a two-stage learnable framework designed to enable fine-grained control over the length of reasoning chains based on the importance of each individual step. In the first stage, SmartThinker adapts a reasoning model to a short-form reasoning mode through rejection sampling combined with supervised fine-tuning (SFT). In the second stage, SmartThinker applies Step-Level Length Control Policy Optimization (SCPO) to refine the model output distribution, which increases the proportion of length allocated to critical steps while reducing redundancy in less important ones. SCPO consists of four core components: an online importance estimator, a step-level length control reward function, a step-level generalized advantage estimation (S-GAE) and a difficulty-adaptive clipping strategy. Working in concert, these components enable SCPO to implement differentiated length control across reasoning steps. Empirical results across multiple reasoning benchmarks and various backbone models demonstrate that SmartThinker significantly reduces redundant reasoning while achieving comparable or even superior performance to existing methods.

[294] MARBLE: A Multi-Agent Rule-Based LLM Reasoning Engine for Accident Severity Prediction cs.AI | cs.CL | cs.MAPDF

Kaleem Ullah Qasim, Jiashu Zhang

TL;DR: MARBLE是一种多智能体规则驱动的LLM推理引擎，用于交通事故严重性预测，通过分解任务到多个智能体并采用规则或LLM引导的共识机制，显著提升了预测准确性和可解释性。

Details

Motivation: 交通事故严重性预测因数据不完整、特征依赖性强和类别不平衡等问题难以实现高精度，现有方法（如单一模型或黑盒提示）难以应对真实世界的噪声且缺乏可解释性。

Result: 在英美数据集上，MARBLE的准确率（90%）远超传统机器学习模型和提示推理方法（如CoT、L2M等，准确率低于48%）。

Insight: MARBLE展示了多智能体模块化推理在安全关键任务中的潜力，为类别不平衡和噪声环境下的推理提供了可扩展、可解释的解决方案。

Abstract: Accident severity prediction plays a critical role in transportation safety systems but is a persistently difficult task due to incomplete data, strong feature dependencies, and severe class imbalance in which rare but high-severity cases are underrepresented and hard to detect. Existing methods often rely on monolithic models or black box prompting, which struggle to scale in noisy, real-world settings and offer limited interpretability. To address these challenges, we propose MARBLE a multiagent rule based LLM engine that decomposes the severity prediction task across a team of specialized reasoning agents, including an interchangeable ML-backed agent. Each agent focuses on a semantic subset of features (e.g., spatial, environmental, temporal), enabling scoped reasoning and modular prompting without the risk of prompt saturation. Predictions are coordinated through either rule-based or LLM-guided consensus mechanisms that account for class rarity and confidence dynamics. The system retains structured traces of agent-level reasoning and coordination outcomes, supporting in-depth interpretability and post-hoc performance diagnostics. Across both UK and US datasets, MARBLE consistently outperforms traditional machine learning classifiers and state-of-the-art (SOTA) prompt-based reasoning methods including Chain-of-Thought (CoT), Least-to-Most (L2M), and Tree-of-Thought (ToT) achieving nearly 90% accuracy where others plateau below 48%. This performance redefines the practical ceiling for accident severity classification under real world noise and extreme class imbalance. Our results position MARBLE as a generalizable and interpretable framework for reasoning under uncertainty in safety-critical applications.

[295] MedGemma Technical Report cs.AI | cs.CL | cs.CVPDF

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse

TL;DR: MedGemma是一组基于Gemma 3的医疗视觉-语言基础模型，在医疗任务上表现出色，性能接近专用模型，同时保持通用能力。

Details

Motivation: 医疗AI的发展面临数据多样性、任务复杂性和隐私保护的挑战，需要性能优异且少调优的基础模型来加速应用。

Result: 在分布外任务上提升2.6-18.1%，电子健康记录错误减少50%，性能接近专用模型。

Insight: MedGemma展示了基础模型在医疗领域的潜力，能同时兼顾通用性和专业性，推动医疗AI发展。

Abstract: Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare’s diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B. MedGemma demonstrates advanced medical understanding and reasoning on images and text, significantly exceeding the performance of similar-sized generative models and approaching the performance of task-specific models, while maintaining the general capabilities of the Gemma 3 base models. For out-of-distribution tasks, MedGemma achieves 2.6-10% improvement on medical multimodal question answering, 15.5-18.1% improvement on chest X-ray finding classification, and 10.8% improvement on agentic evaluations compared to the base models. Fine-tuning MedGemma further improves performance in subdomains, reducing errors in electronic health record information retrieval by 50% and reaching comparable performance to existing specialized state-of-the-art methods for pneumothorax classification and histopathology patch classification. We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP. MedSigLIP powers the visual understanding capabilities of MedGemma and as an encoder achieves comparable or better performance than specialized medical image encoders. Taken together, the MedGemma collection provides a strong foundation of medical image and text capabilities, with potential to significantly accelerate medical research and development of downstream applications. The MedGemma collection, including tutorials and model weights, can be found at https://goo.gle/medgemma.

[296] SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity’s Last Exam? cs.AI | cs.CLPDF

Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu

TL;DR: 论文提出了X-Master，一种工具增强的推理智能体，通过灵活的代码交互和外置工具增强推理能力，旨在加速科学发现。X-Masters工作流程进一步扩展了其能力，并在Humanity’s Last Exam（HLE）上取得了32.1%的领先成绩。

Details

Motivation: 利用AI智能体加速科学发现是长期目标，而HLE作为评估科学AI智能体能力的挑战性基准，为构建通用智能体提供了验证标准。

Result: X-Masters在HLE上取得32.1%的成绩，超越OpenAI和Google Deep Research的26.6%和26.9%，首次突破30%门槛。

Insight: 代码交互和工具增强是提升AI智能体推理能力的关键；分散-堆叠工作流程可显著扩展复杂任务解决能力，为未来模型训练提供经验。

Abstract: The rapid advancements of AI agents have ignited the long-held ambition of leveraging them to accelerate scientific discovery. Achieving this goal requires a deep understanding of the frontiers of human knowledge. As such, Humanity’s Last Exam (HLE) provides an exceptionally challenging touchstone for evaluating scientific AI agents. In this work, we aim to construct the foundational architecture for general-purpose agents and validate the capabilities through leading performance on HLE. To achieve this, we introduce X-Master, a tool-augmented reasoning agent designed to emulate human researchers by interacting flexibly with external tools during its reasoning process. This agent, guided by the conceptualization of code as an interaction language, can flexibly leverage built-in Python libraries and our customized tools to augment the reasoning. We further scale its capabilities through X-Masters, a scattered-and-stacked agentic workflow that systematically enhances breadth and depth of reasoning. Our open-source solution, X-Masters, sets a new state-of-the-art record on HLE with a score of 32.1%, surpassing OpenAI’s and Google’s Deep Research (26.6% and 26.9%) and becoming the first to exceed the 30% threshold. This work allows us to gain a deeper understanding of complex task-solving and accumulates valuable experience that can inform future advancements, guiding subsequent model training.

[297] When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors cs.AI | cs.CLPDF

Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan

TL;DR: 论文探讨了在链式思维（CoT）监控作为AI安全防御时的关键属性，强调监视性而非忠实性，并通过实验表明CoT监控在防止严重危害时具有潜力但仍需保护与测试。

Details

Motivation: 现有工作指出CoT监控的‘不忠实性’可能导致其不可靠，但在防止严重危害的运行时监控中，关键属性是监视性而非忠实性。论文旨在探索CoT监控的可行性及其局限性。

Result: 模型在需要复杂推理时更容易被监控，但在有详细策略或针对监控优化的帮助下仍能逃避监控，表明CoT监控虽不完美但有效。

Insight: CoT监控为防止严重危害提供了重要防御层，但需持续优化和测试以应对模型的故意逃避行为。

Abstract: While chain-of-thought (CoT) monitoring is an appealing AI safety defense, recent work on “unfaithfulness” has cast doubt on its reliability. These findings highlight an important failure mode, particularly when CoT acts as a post-hoc rationalization in applications like auditing for bias. However, for the distinct problem of runtime monitoring to prevent severe harm, we argue the key property is not faithfulness but monitorability. To this end, we introduce a conceptual framework distinguishing CoT-as-rationalization from CoT-as-computation. We expect that certain classes of severe harm will require complex, multi-step reasoning that necessitates CoT-as-computation. Replicating the experimental setups of prior work, we increase the difficulty of the bad behavior to enforce this necessity condition; this forces the model to expose its reasoning, making it monitorable. We then present methodology guidelines to stress-test CoT monitoring against deliberate evasion. Applying these guidelines, we find that models can learn to obscure their intentions, but only when given significant help, such as detailed human-written strategies or iterative optimization against the monitor. We conclude that, while not infallible, CoT monitoring offers a substantial layer of defense that requires active protection and continued stress-testing.

[298] Exploring Object Status Recognition for Recipe Progress Tracking in Non-Visual Cooking cs.AI | cs.CV | cs.HCPDF

Franklin Mingzhe Li, Kaitlyn Ng, Bin Zhu, Patrick Carrington

TL;DR: OSCAR是一个基于对象状态识别的技术框架，旨在为非视觉烹饪提供实时食谱进度追踪。通过结合食谱解析、对象状态提取、视觉对齐和时间因果建模，OSCAR在真实场景中表现出改进的步骤预测准确性。

Details

Motivation: 烹饪对视力受损人群的独立生活至关重要，但缺乏有效的进度追踪和上下文反馈支持是主要挑战。对象状态的识别为解决这一问题提供了可能。

Result: 在173个教学视频和12个真实烹饪会话的评估中，OSCAR显著提高了步骤预测的准确性，并识别了影响性能的因素（如隐式任务、相机位置和光照）。

Insight: 对象状态识别对食谱进度追踪的改进效果显著，未来上下文感知辅助烹饪系统应考虑真实场景的多样性和挑战。

Abstract: Cooking plays a vital role in everyday independence and well-being, yet remains challenging for people with vision impairments due to limited support for tracking progress and receiving contextual feedback. Object status - the condition or transformation of ingredients and tools - offers a promising but underexplored foundation for context-aware cooking support. In this paper, we present OSCAR (Object Status Context Awareness for Recipes), a technical pipeline that explores the use of object status recognition to enable recipe progress tracking in non-visual cooking. OSCAR integrates recipe parsing, object status extraction, visual alignment with cooking steps, and time-causal modeling to support real-time step tracking. We evaluate OSCAR on 173 instructional videos and a real-world dataset of 12 non-visual cooking sessions recorded by BLV individuals in their homes. Our results show that object status consistently improves step prediction accuracy across vision-language models, and reveal key factors that impact performance in real-world conditions, such as implicit tasks, camera placement, and lighting. We contribute the pipeline of context-aware recipe progress tracking, an annotated real-world non-visual cooking dataset, and design insights to guide future context-aware assistive cooking systems.

[299] Animation Needs Attention: A Holistic Approach to Slides Animation Comprehension with Visual-Language Models cs.AI | cs.CV | 68T01PDF

Yifan Jiang, Yibo Xue, Yukun Kang, Pin Zheng, Jian Peng

TL;DR: 该论文针对幻灯片动画理解任务，发布了首个公开数据集，并通过LoRA微调Qwen-2.5-VL-7B模型，显著提升了动画生成效果。

Details

Motivation: 现有AI驱动的幻灯片生成工具缺乏对动画的原生支持，且视觉语言模型在动画任务中表现不佳，主要是由于缺乏公开数据集和时序推理能力受限。

Result: 在测试集上，LoRA模型将BLEU-4提升了60%，ROUGE-L提升了30%，且在CODA细节评估中表现显著改善。

Insight: 低秩适应（LoRA）能够有效提升模型在时序推理任务中的表现，并具有泛化能力。

Abstract: Slide animations, such as fade-ins, fly-ins, and wipes, are critical for audience engagement, efficient information delivery, and vivid visual expression. However, most AI-driven slide-generation tools still lack native animation support, and existing vision-language models (VLMs) struggle with animation tasks due to the absence of public datasets and limited temporal-reasoning capabilities. To address this gap, we release the first public dataset for slide-animation modeling: 12,000 triplets of natural-language descriptions, animation JSON files, and rendered videos, collectively covering every built-in PowerPoint effect. Using this resource, we fine-tune Qwen-2.5-VL-7B with Low-Rank Adaptation (LoRA) and achieve consistent improvements over GPT-4.1 and Gemini-2.5-Pro in BLEU-4, ROUGE-L, SPICE, and our Coverage-Order-Detail Assessment (CODA) metric, which evaluates action coverage, temporal order, and detail fidelity. On a manually curated test set of slides, the LoRA model increases BLEU-4 by around 60%, ROUGE-L by 30%, and shows significant improvements in CODA-detail. This demonstrates that low-rank adaptation enables reliable temporal reasoning and generalization beyond synthetic data. Overall, our dataset, LoRA-enhanced model, and CODA metric provide a rigorous benchmark and foundation for future research on VLM-based dynamic slide generation.

[300] Clustering via Self-Supervised Diffusion cs.AI | cs.CVPDF

Roy Uziel, Irit Chelly, Oren Freifeld, Ari Pakman

TL;DR: 该论文提出了一种名为CLUDI的自监督聚类框架，利用扩散模型和Vision Transformer特征，通过师生范式实现高维数据的鲁棒聚类。

Details

Motivation: 扩散模型在生成任务中表现优异，但尚未应用于聚类任务。论文旨在利用扩散模型的随机性和生成能力，结合预训练特征，解决高维数据聚类问题。

Result: 在多个挑战性数据集上，CLUDI在无监督分类中达到了最先进的性能，证明了其在复杂数据分布中的鲁棒性和适应性。

Insight: 扩散模型的随机性可以作为有效的聚类增强手段，结合预训练特征，能够显著提升无监督学习的性能。

Abstract: Diffusion models, widely recognized for their success in generative tasks, have not yet been applied to clustering. We introduce Clustering via Diffusion (CLUDI), a self-supervised framework that combines the generative power of diffusion models with pre-trained Vision Transformer features to achieve robust and accurate clustering. CLUDI is trained via a teacher-student paradigm: the teacher uses stochastic diffusion-based sampling to produce diverse cluster assignments, which the student refines into stable predictions. This stochasticity acts as a novel data augmentation strategy, enabling CLUDI to uncover intricate structures in high-dimensional data. Extensive evaluations on challenging datasets demonstrate that CLUDI achieves state-of-the-art performance in unsupervised classification, setting new benchmarks in clustering robustness and adaptability to complex data distributions.

[301] FurniMAS: Language-Guided Furniture Decoration using Multi-Agent System cs.AI | cs.CVPDF

Toan Nguyen, Tri Le, Quang Nguyen, Anh Nguyen

TL;DR: FurniMAS是一个多智能体系统，用于自动化家具装饰，通过语言指导和智能体协作生成高质量的3D装饰效果。

Details

Motivation: 家具装饰耗时且需要专业知识，FurniMAS旨在通过多智能体系统自动化这一过程，降低门槛。

Result: 实验表明，FurniMAS在生成高质量3D装饰效果上显著优于基线方法。

Insight: 多智能体系统在创意任务（如装饰）中可实现高效协作，语言指导能够精准捕捉用户偏好。

Abstract: Furniture decoration is an important task in various industrial applications. However, achieving a high-quality decorative result is often time-consuming and requires specialized artistic expertise. To tackle these challenges, we explore how multi-agent systems can assist in automating the decoration process. We propose FurniMAS, a multi-agent system for automatic furniture decoration. Specifically, given a human prompt and a household furniture item such as a working desk or a TV stand, our system suggests relevant assets with appropriate styles and materials, and arranges them on the item, ensuring the decorative result meets functionality, aesthetic, and ambiance preferences. FurniMAS assembles a hybrid team of LLM-based and non-LLM agents, each fulfilling distinct roles in a typical decoration project. These agents collaborate through communication, logical reasoning, and validation to transform the requirements into the final outcome. Extensive experiments demonstrate that our FurniMAS significantly outperforms other baselines in generating high-quality 3D decor.

[302] When Imitation Learning Outperforms Reinforcement Learning in Surgical Action Planning cs.AI | cs.CVPDF

Maxence Boels, Harry Robertshaw, Alejandro Granados, Prokar Dasgupta, Sebastien Ourselin

TL;DR: 该论文首次对模仿学习（IL）与强化学习（RL）在手术动作规划中的性能进行了全面比较，发现IL在专家标注测试集上的表现优于RL。

Details

Motivation: 手术动作规划需要实时预测未来器械-动词-目标三元组，现有的RL方法虽可能通过探索发现更优策略，但IL因其基于专家演示的特性可能更具优势。

Result: 实验表明，IL方法在动作三元组识别和下一帧预测任务中表现更优（34.6%和33.6% mAP），而RL方法表现较差（如世界模型RL降至3.1% mAP）。

Insight: 研究表明，RL方法在专家标注测试集上的表现不如IL，可能是因为分布匹配更倾向于IL的策略，这对手术AI开发具有重要意义。

Abstract: Surgical action planning requires predicting future instrument-verb-target triplets for real-time assistance. While teleoperated robotic surgery provides natural expert demonstrations for imitation learning (IL), reinforcement learning (RL) could potentially discover superior strategies through exploration. We present the first comprehensive comparison of IL versus RL for surgical action planning on CholecT50. Our Dual-task Autoregressive Imitation Learning (DARIL) baseline achieves 34.6% action triplet recognition mAP and 33.6% next frame prediction mAP with smooth planning degradation to 29.2% at 10-second horizons. We evaluated three RL variants: world model-based RL, direct video RL, and inverse RL enhancement. Surprisingly, all RL approaches underperformed DARIL i.e. world model RL dropped to 3.1% mAP at 10s while direct video RL achieved only 15.9%. Our analysis reveals that distribution matching on expert-annotated test sets systematically favors IL over potentially valid RL policies that differ from training demonstrations. This challenges assumptions about RL superiority in sequential decision making and provides crucial insights for surgical AI development.

cs.CR [Back]

[303] Improving LLM Reasoning for Vulnerability Detection via Group Relative Policy Optimization cs.CR | cs.AI | cs.CLPDF

Marco Simoni, Aleksandar Fontana, Giulio Rossolini, Andrea Saracino

TL;DR: 该论文研究了如何通过Group Relative Policy Optimization（GRPO）改进大型语言模型（LLMs）在漏洞检测任务中的推理能力，并解决了模型过度预测某些漏洞类型的局限性。

Details

Motivation: 当前LLMs在漏洞检测任务中存在过度预测某些漏洞类型或漏检的问题，这限制了其在AI安全工具中的应用潜力。

Result: 实验表明，GRPO在性能上优于标准的监督微调（SFT），并显著提升了LLMs的推理能力和泛化性。

Insight: 基于RL的训练方法（如GRPO）可以有效提升LLMs在漏洞检测任务中的表现，同时增强其推理能力。

Abstract: Improving and understanding the training dynamics and reasoning of Large Language Models (LLMs) has become essential for their deployment in AI-based security tools, such as software vulnerability detection. In this work, we present an extensive study aimed at advancing recent RL-based finetuning techniques for LLMs in the context of vulnerability detection. We start by highlighting key limitations of commonly adopted LLMs, such as their tendency to over-predict certain types of vulnerabilities while failing to detect others. To address this challenge, we explore the use of Group Relative Policy Optimization (GRPO), a recent policy-gradient method, for guiding LLM behavior through structured, rule-based rewards. We enable its application to the vulnerability detection task by redefining its advantage functions and reward signals using annotations from widely used datasets in the field, including BigVul, DiverseVul, and CleanVul. The proposed methodology enables an extensive set of experiments, addressing multiple research questions regarding the impact of GRPO on generalization, reasoning capabilities, and performance improvements over standard supervised finetuning (SFT). Our findings offer valuable insights into the potential of RL-based training to enhance both the performance and reasoning abilities of LLMs in the context of software vulnerability detection.

[304] README: Robust Error-Aware Digital Signature Framework via Deep Watermarking Model cs.CR | cs.CVPDF

Hyunwook Choi, Sangyun Won, Daeyeon Hwang, Junhyeok Choi

TL;DR: README是一个新型的鲁棒错误感知数字签名框架，通过深度学习水印模型实现高容量、错误容忍的图像签名，解决了现有方法在低嵌入容量和比特错误方面的局限性。

Details

Motivation: 现有的深度学习水印方法在高容量签名和比特错误纠正方面表现不足，无法满足需要2048位无错误数据的加密应用（如数字签名）的需求。

Result: 在嵌入2048位数字签名时，零比特错误图像率从1.2%提升至86.3%，且对实际失真具有强鲁棒性。

Insight: 该框架为深度学习水印在加密安全领域开辟了新应用，填补了信号级水印与加密安全之间的空白。

Abstract: Deep learning-based watermarking has emerged as a promising solution for robust image authentication and protection. However, existing models are limited by low embedding capacity and vulnerability to bit-level errors, making them unsuitable for cryptographic applications such as digital signatures, which require over 2048 bits of error-free data. In this paper, we propose README (Robust Error-Aware Digital Signature via Deep WaterMarking ModEl), a novel framework that enables robust, verifiable, and error-tolerant digital signatures within images. Our method combines a simple yet effective cropping-based capacity scaling mechanism with ERPA (ERror PAinting Module), a lightweight error correction module designed to localize and correct bit errors using Distinct Circular Subsum Sequences (DCSS). Without requiring any fine-tuning of existing pretrained watermarking models, README significantly boosts the zero-bit-error image rate (Z.B.I.R) from 1.2% to 86.3% when embedding 2048-bit digital signatures into a single image, even under real-world distortions. Moreover, our use of perceptual hash-based signature verification ensures public verifiability and robustness against tampering. The proposed framework unlocks a new class of high-assurance applications for deep watermarking, bridging the gap between signal-level watermarking and cryptographic security.

cs.LG [Back]

[305] Large Language Model Agent for Modular Task Execution in Drug Discovery cs.LG | cs.CL | q-bio.BMPDF

Janghoon Ock, Radheesh Sharma Meda, Srivathsan Badrinarayanan, Neha S. Aluru, Achuth Chandrasekhar

TL;DR: 论文提出了一个基于大语言模型（LLM）的模块化框架，通过结合领域专用工具，自动化并优化了早期计算药物发现流程中的关键任务，提升了分子筛选和优化的效率。

Details

Motivation: 药物发现中的计算任务复杂且耗时，传统方法效率低且缺乏灵活性。LLM能够整合多种领域工具，为药物发现提供自动化解决方案，提高效率和准确性。

Result: 案例研究中，优化后分子数量显著增加（例如QED>0.6的分子从34增至55），证明了框架在分子筛选和优化中的有效性。

Insight: 模块化设计允许灵活集成新工具，为AI辅助药物发现提供了可扩展的基础。

Abstract: We present a modular framework powered by large language models (LLMs) that automates and streamlines key tasks across the early-stage computational drug discovery pipeline. By combining LLM reasoning with domain-specific tools, the framework performs biomedical data retrieval, domain-specific question answering, molecular generation, property prediction, property-aware molecular refinement, and 3D protein-ligand structure generation. In a case study targeting BCL-2 in lymphocytic leukemia, the agent autonomously retrieved relevant biomolecular information-including FASTA sequences, SMILES representations, and literature-and answered mechanistic questions with improved contextual accuracy over standard LLMs. It then generated chemically diverse seed molecules and predicted 67 ADMET-related properties, which guided iterative molecular refinement. Across two refinement rounds, the number of molecules with QED > 0.6 increased from 34 to 55, and those passing at least four out of five empirical drug-likeness rules rose from 29 to 52, within a pool of 194 molecules. The framework also employed Boltz-2 to generate 3D protein-ligand complexes and provide rapid binding affinity estimates for candidate compounds. These results demonstrate that the approach effectively supports molecular screening, prioritization, and structure evaluation. Its modular design enables flexible integration of evolving tools and models, providing a scalable foundation for AI-assisted therapeutic discovery.

[306] ABench-Physics: Benchmarking Physical Reasoning in LLMs via High-Difficulty and Dynamic Physics Problems cs.LG | cs.CLPDF

Yiming Zhang, Yingfan Ma, Yanmei Gu, Zhengkai Yang, Yihong Zhuang

TL;DR: 该论文提出了ABench-Physics，一个专门设计用于评估大语言模型（LLMs）在物理推理和泛化能力上的新基准，包含静态和动态问题集，揭示了LLMs在物理推理上的显著局限性。

Details

Motivation: LLMs在数学和编程等领域表现优异，但在物理领域的性能尚未充分探索，尤其是需要精确计算、概念理解和物理建模能力的复杂问题。现有基准因难度低、选择题格式和静态评估而无法充分测试这些能力。

Result: 评估了多个先进LLMs，发现它们在物理推理尤其是动态条件下泛化能力上存在显著性能差距，表明当前模型在物理建模和推理能力上的局限性。

Insight: 物理问题需要更复杂的推理和建模能力，现有的LLMs在这些任务上表现不佳，尤其是动态条件下的泛化。ABench-Physics为未来研究和改进提供了诊断工具。

Abstract: Large Language Models (LLMs) have shown impressive performance in domains such as mathematics and programming, yet their capabilities in physics remain underexplored and poorly understood. Physics poses unique challenges that demand not only precise computation but also deep conceptual understanding and physical modeling skills. Existing benchmarks often fall short due to limited difficulty, multiple-choice formats, and static evaluation settings that fail to capture physical modeling ability. In this paper, we introduce ABench-Physics, a novel benchmark designed to rigorously evaluate LLMs’ physical reasoning and generalization capabilities. ABench-Physics consists of two components: Phy_A, a static set of 400 graduate- or Olympiad-level problems; and Phy_B, a dynamic subset of 100 problems equipped with an automatic variation engine to test model robustness across changing conditions. All questions require precise numerical answers, with strict formatting and tolerance constraints. Our evaluation of several state-of-the-art LLMs reveals substantial performance gaps, highlighting persistent limitations in physical reasoning, especially in generalization to dynamic variants. ABench-Physics provides a challenging and diagnostic framework for advancing scientific reasoning in LLMs.

[307] Critiques of World Models cs.LG | cs.AI | cs.CL | cs.CV | cs.ROPDF

Eric Xing, Mingkai Deng, Jinyu Hou, Zhiting Hu

TL;DR: 本文批判了现有世界模型的多种观点，提出了一种基于分层、多级和混合连续/离散表示的新架构，旨在通过生成式自我监督学习框架实现更通用的世界建模，最终推动物理、代理和嵌套（PAN）AGI系统的实现。

Details

Motivation: 近年来，世界模型作为生物代理与真实环境交互的算法替代，成为研究热点。然而，关于其定义、构建、使用和评估仍存在争议。本文旨在通过批判现有观点，提出一种更合理的世界模型架构。

Result: 未提供具体实验结果，但理论框架为未来AGI系统的开发提供了新方向。

Insight: 世界模型的核心目标应是为目的性推理和行为模拟所有可能的真实世界情况，而多层次和混合表示是实现这一目标的关键。

Abstract: World Model, the supposed algorithmic surrogate of the real-world environment which biological agents experience with and act upon, has been an emerging topic in recent years because of the rising needs to develop virtual agents with artificial (general) intelligence. There has been much debate on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of “hypothetical thinking” in psychology literature, we offer critiques of several schools of thoughts on world modeling, and argue the primary goal of a world model to be simulating all actionable possibilities of the real world for purposeful reasoning and acting. Building on the critiques, we propose a new architecture for a general-purpose world model, based on hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervision learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

[308] Regulation Compliant AI for Fusion: Real-Time Image Analysis-Based Control of Divertor Detachment in Tokamaks cs.LG | cs.CV | cs.SY | eess.SY | physics.plasm-phPDF

Nathaniel Chen, Cheolsik Byun, Azarakash Jalalvand, Sangkyeun Kim, Andrew Rothstein

TL;DR: 该论文提出了一种实时、可解释的AI控制系统，用于托卡马克中的偏滤器分离控制，通过图像分析实现合规的调节，并在实验中取得了2%的平均绝对误差。

Details

Motivation: 尽管AI在聚变控制中有潜力，但其黑箱特性在监管环境中难以合规实现。研究旨在开发一种可解释的实时AI控制系统，解决这一问题。

Result: 实验结果表明，系统在分离和再附着控制中的平均绝对误差为2%，表现出高精度和稳定性。

Insight: 研究展示了AI在聚变控制中的应用潜力，同时也强调了可解释性和合规性的重要性，为未来核聚变反应堆的AI控制提供了解决方案。

Abstract: While artificial intelligence (AI) has been promising for fusion control, its inherent black-box nature will make compliant implementation in regulatory environments a challenge. This study implements and validates a real-time AI enabled linear and interpretable control system for successful divertor detachment control with the DIII-D lower divertor camera. Using D2 gas, we demonstrate feedback divertor detachment control with a mean absolute difference of 2% from the target for both detachment and reattachment. This automatic training and linear processing framework can be extended to any image based diagnostic for regulatory compliant controller necessary for future fusion reactors.

[309] Frequency-Aligned Knowledge Distillation for Lightweight Spatiotemporal Forecasting cs.LG | cs.AI | cs.CVPDF

Yuqi Li, Chuanguang Yang, Hansheng Zeng, Zeyu Dong, Zhulin An

TL;DR: 该论文提出了一种轻量化的时空预测框架SDKD，通过频率对齐的知识蒸馏策略，将从复杂教师模型中提取的多尺度谱特征用于指导学生模型，显著提升了性能并降低了计算复杂度。

Details

Motivation: 时空预测任务（如交通流量、燃烧动力学和天气预报）通常需要复杂的模型，导致训练效率低、内存消耗高。为解决这些问题，论文提出了一个轻量化的框架。

Result: 实验结果表明，SDKD在Navier-Stokes方程数据集上显著降低了MSE（81.3%）和MAE（52.3%），同时减少了计算复杂度。

Insight: 通过频谱分解和知识蒸馏，可以在轻量化模型中同时捕获高频细节和长期趋势，为时空预测任务提供了一种高效的解决方案。

Abstract: Spatiotemporal forecasting tasks, such as traffic flow, combustion dynamics, and weather forecasting, often require complex models that suffer from low training efficiency and high memory consumption. This paper proposes a lightweight framework, Spectral Decoupled Knowledge Distillation (termed SDKD), which transfers the multi-scale spatiotemporal representations from a complex teacher model to a more efficient lightweight student network. The teacher model follows an encoder-latent evolution-decoder architecture, where its latent evolution module decouples high-frequency details and low-frequency trends using convolution and Transformer (global low-frequency modeler). However, the multi-layer convolution and deconvolution structures result in slow training and high memory usage. To address these issues, we propose a frequency-aligned knowledge distillation strategy, which extracts multi-scale spectral features from the teacher’s latent space, including both high and low frequency components, to guide the lightweight student model in capturing both local fine-grained variations and global evolution patterns. Experimental results show that SDKD significantly improves performance, achieving reductions of up to 81.3% in MSE and in MAE 52.3% on the Navier-Stokes equation dataset. The framework effectively captures both high-frequency variations and long-term trends while reducing computational complexity. Our codes are available at https://github.com/itsnotacie/SDKD

[310] MedGround-R1: Advancing Medical Image Grounding via Spatial-Semantic Rewarded Group Relative Policy Optimization cs.LG | cs.CVPDF

Huihui Xu, Yuanpeng Nie, Hualiang Wang, Ying Chen, Wei Li

TL;DR: 这篇论文提出了Spatial-Semantic Rewarded Group Relative Policy Optimization (GRPO)方法，用于医学图像定位（MIG），无需昂贵的Chain-of-Thought标注，并通过引入空间语义奖励和Chain-of-Box模板提升模型性能。

Details

Motivation: 现有医学图像定位模型依赖于大量标注的Chain-of-Thought数据，成本高且耗时。本文受DeepSeek-R1启发，探索通过GRPO方法让视觉语言模型（VLM）无需标注即具备推理能力。

Result: 在MS-CXR、ChestX-ray8和M3D-RefSeg数据集上性能达SOTA，消融实验验证了各模块的有效性。

Insight: 1. 无需昂贵标注即可训练高性能MIG模型；2. 空间语义奖励和Chain-of-Box模板为视觉语言模型提供了更细粒度的反馈和显式推理能力。

Abstract: Medical Image Grounding (MIG), which involves localizing specific regions in medical images based on textual descriptions, requires models to not only perceive regions but also deduce spatial relationships of these regions. Existing Vision-Language Models (VLMs) for MIG often rely on Supervised Fine-Tuning (SFT) with large amounts of Chain-of-Thought (CoT) reasoning annotations, which are expensive and time-consuming to acquire. Recently, DeepSeek-R1 demonstrated that Large Language Models (LLMs) can acquire reasoning abilities through Group Relative Policy Optimization (GRPO) without requiring CoT annotations. In this paper, we adapt the GRPO reinforcement learning framework to VLMs for Medical Image Grounding. We propose the Spatial-Semantic Rewarded Group Relative Policy Optimization to train the model without CoT reasoning annotations. Specifically, we introduce Spatial-Semantic Rewards, which combine spatial accuracy reward and semantic consistency reward to provide nuanced feedback for both spatially positive and negative completions. Additionally, we propose to use the Chain-of-Box template, which integrates visual information of referring bounding boxes into the reasoning process, enabling the model to explicitly reason about spatial regions during intermediate steps. Experiments on three datasets MS-CXR, ChestX-ray8, and M3D-RefSeg demonstrate that our method achieves state-of-the-art performance in Medical Image Grounding. Ablation studies further validate the effectiveness of each component in our approach. Code, checkpoints, and datasets are available at https://github.com/bio-mlhui/MedGround-R1

[311] What to Do Next? Memorizing skills from Egocentric Instructional Video cs.LG | cs.AI | cs.CVPDF

Jing Bi, Chenliang Xu

TL;DR: 该论文提出了一种从自我中心视角学习高层次目标导向动作的新任务和结合拓扑感知记忆与Transformer的方法，以提升动作规划的鲁棒性和性能。

Details

Motivation: 从观察中学习执行活动需要提取环境的有意义信息。研究关注如何在自我中心视角下规划高层次目标导向动作，以应对实际场景中的挑战。

Result: 在交互仿真环境中验证，方法能学习到有意义的表示，提升性能并在动作偏差时表现鲁棒。

Insight: 环境结构的记忆化（通过提取affordance）是规划目标导向动作和检测偏差的关键。

Abstract: Learning to perform activities through demonstration requires extracting meaningful information about the environment from observations. In this research, we investigate the challenge of planning high-level goal-oriented actions in a simulation setting from an egocentric perspective. We present a novel task, interactive action planning, and propose an approach that combines topological affordance memory with transformer architecture. The process of memorizing the environment’s structure through extracting affordances facilitates selecting appropriate actions based on the context. Moreover, the memory model allows us to detect action deviations while accomplishing specific objectives. To assess the method’s versatility, we evaluate it in a realistic interactive simulation environment. Our experimental results demonstrate that the proposed approach learns meaningful representations, resulting in improved performance and robust when action deviations occur.

[312] Adopting a human developmental visual diet yields robust, shape-based AI vision cs.LG | cs.CVPDF

Zejin Lu, Sushrut Thorat, Radoslaw M Cichy, Tim C Kietzmann

TL;DR: 论文提出通过模拟人类视觉发展的方式训练AI视觉系统，显著提升了形状识别、鲁棒性和抗对抗攻击能力。

Details

Motivation: 当前AI视觉系统与人类视觉存在显著差异，如依赖纹理特征而非形状信息、对图像扭曲和对抗攻击的脆弱性。论文通过模拟人类视觉发展过程来解决这一问题。

Result: 实验证明，经DVD训练的模型在形状识别、图像扭曲鲁棒性和对抗攻击防御上均表现出色，超越了传统大规模训练模型。

Insight: 通过优化学习方式而非单纯增加数据量，可以实现更高效、更接近人类视觉的AI系统，为资源高效的鲁棒视觉系统提供了新方向。

Abstract: Despite years of research and the dramatic scaling of artificial intelligence (AI) systems, a striking misalignment between artificial and human vision persists. Contrary to humans, AI heavily relies on texture-features rather than shape information, lacks robustness to image distortions, remains highly vulnerable to adversarial attacks, and struggles to recognise simple abstract shapes within complex backgrounds. To close this gap, we here introduce a solution that arises from a previously underexplored direction: rather than scaling up, we take inspiration from how human vision develops from early infancy into adulthood. We quantified the visual maturation by synthesising decades of psychophysical and neurophysiological research into a novel developmental visual diet (DVD) for AI vision. We show that guiding AI systems through this human-inspired curriculum produces models that closely align with human behaviour on every hallmark of robust vision tested yielding the strongest reported reliance on shape information to date, abstract shape recognition beyond the state of the art, higher robustness to image corruptions, and stronger resilience to adversarial attacks. By outperforming high parameter AI foundation models trained on orders of magnitude more data, we provide evidence that robust AI vision can be achieved by guiding the way how a model learns, not merely how much it learns, offering a resource-efficient route toward safer and more human-like artificial visual systems.

Shubin Ma, Liang Zhao, Mingdong Lu, Yifan Guo, Bo Xu

TL;DR: 论文提出了一种针对不平衡和未对齐多模态数据的填充方法（CAPIMAC），通过自排斥贪婪锚点搜索模块（SRGASM）和一致性感知填充模块（CAPM）提升数据融合质量。

Details

Motivation: 现有方法未能有效解决多模态数据不平衡和未对齐的问题，仅依赖类别级对齐，导致数据样本匹配不佳，影响融合效果。

Result: 实验表明方法在基准数据集上表现优越。

Insight: 通过锚点搜索和一致性填充的有效结合，提升了多模态数据融合的质量，为实际场景中的不完整数据处理提供了新思路。

Abstract: Multimodal representation is faithful and highly effective in describing real-world data samples’ characteristics by describing their complementary information. However, the collected data often exhibits incomplete and misaligned characteristics due to factors such as inconsistent sensor frequencies and device malfunctions. Existing research has not effectively addressed the issue of filling missing data in scenarios where multiview data are both imbalanced and misaligned. Instead, it relies on class-level alignment of the available data. Thus, it results in some data samples not being well-matched, thereby affecting the quality of data fusion. In this paper, we propose the Consistency-Aware Padding for Incomplete Multimodal Alignment Clustering Based on Self-Repellent Greedy Anchor Search(CAPIMAC) to tackle the problem of filling imbalanced and misaligned data in multimodal datasets. Specifically, we propose a self-repellent greedy anchor search module(SRGASM), which employs a self-repellent random walk combined with a greedy algorithm to identify anchor points for re-representing incomplete and misaligned multimodal data. Subsequently, based on noise-contrastive learning, we design a consistency-aware padding module (CAPM) to effectively interpolate and align imbalanced and misaligned data, thereby improving the quality of multimodal data fusion. Experimental results demonstrate the superiority of our method over benchmark datasets. The code will be publicly released at https://github.com/Autism-mm/CAPIMAC.git.

[314] Accurate and Efficient World Modeling with Masked Latent Transformers cs.LG | cs.AI | cs.CVPDF

Maxime Burchi, Radu Timofte

TL;DR: EMERALD提出了一种高效且准确的世界建模方法，通过MaskGIT预测在潜在空间中生成轨迹，显著提升智能体性能，并在Crafter基准测试中创下新记录。

Details

Motivation: 现有世界模型（如Dreamer）的潜在空间压缩可能导致关键信息丢失，而Δ-IRIS和DIAMOND等方法虽提高准确性但牺牲了效率。EMERALD旨在解决这一矛盾。

Result: 在10M环境步数内首次超越人类专家水平，并解锁Crafter全部22项成就。

Insight: 潜在空间的精确建模（如MaskGIT）可显著提升世界模型的性能，同时保持训练效率，为强化学习中的环境建模提供新思路。

Abstract: The Dreamer algorithm has recently obtained remarkable performance across diverse environment domains by training powerful agents with simulated trajectories. However, the compressed nature of its world model’s latent space can result in the loss of crucial information, negatively affecting the agent’s performance. Recent approaches, such as $\Delta$-IRIS and DIAMOND, address this limitation by training more accurate world models. However, these methods require training agents directly from pixels, which reduces training efficiency and prevents the agent from benefiting from the inner representations learned by the world model. In this work, we propose an alternative approach to world modeling that is both accurate and efficient. We introduce EMERALD (Efficient MaskEd latent tRAnsformer worLD model), a world model using a spatial latent state with MaskGIT predictions to generate accurate trajectories in latent space and improve the agent performance. On the Crafter benchmark, EMERALD achieves new state-of-the-art performance, becoming the first method to surpass human experts performance within 10M environment steps. Our method also succeeds to unlock all 22 Crafter achievements at least once during evaluation.

[315] When Data-Free Knowledge Distillation Meets Non-Transferable Teacher: Escaping Out-of-Distribution Trap is All You Need cs.LG | cs.AI | cs.CR | cs.CVPDF

Ziming Hong, Runnan Chen, Zengmao Wang, Bo Han, Bo Du

TL;DR: 本文首次研究了从不可迁移学习（NTL）教师模型中提取知识的数据无关知识蒸馏（DFKD），并提出了一种名为对抗陷阱逃逸（ATEsc）的方法，通过识别和过滤OOD样本来提升DFKD的效果。

Details

Motivation: 传统DFKD方法假设教师模型可信，然而现实中教师模型可能存在非迁移性问题（ID到OOD无法迁移），导致生成器被误导。本文旨在解决这一未被探索的鲁棒性和安全性问题。

Result: 实验表明，ATEsc显著提升了DFKD在NTL教师场景下的效果。

Insight: NTL教师对OOD样本的对抗鲁棒性是解决DFKD中ID/OOD陷阱的关键特征，合理利用这一特性可实现高效知识迁移。

Abstract: Data-free knowledge distillation (DFKD) transfers knowledge from a teacher to a student without access the real in-distribution (ID) data. Its common solution is to use a generator to synthesize fake data and use them as a substitute for real ID data. However, existing works typically assume teachers are trustworthy, leaving the robustness and security of DFKD from untrusted teachers largely unexplored. In this work, we conduct the first investigation into distilling non-transferable learning (NTL) teachers using DFKD, where the transferability from an ID domain to an out-of-distribution (OOD) domain is prohibited. We find that NTL teachers fool DFKD through divert the generator’s attention from the useful ID knowledge to the misleading OOD knowledge. This hinders ID knowledge transfer but prioritizes OOD knowledge transfer. To mitigate this issue, we propose Adversarial Trap Escaping (ATEsc) to benefit DFKD by identifying and filtering out OOD-like synthetic samples. Specifically, inspired by the evidence that NTL teachers show stronger adversarial robustness on OOD samples than ID samples, we split synthetic samples into two groups according to their robustness. The fragile group is treated as ID-like data and used for normal knowledge distillation, while the robust group is seen as OOD-like data and utilized for forgetting OOD knowledge. Extensive experiments demonstrate the effectiveness of ATEsc for improving DFKD against NTL teachers. Code is released at https://github.com/tmllab/2025_ICML_ATEsc.

[316] An Explainable Transformer Model for Alzheimer’s Disease Detection Using Retinal Imaging cs.LG | cs.CVPDF

Saeed Jamshidiha, Alireza Rezaee, Farshid Hajati, Mojtaba Golzan, Raymond Chiong

TL;DR: 该论文提出了一种基于Transformer的模型Retformer，利用视网膜图像检测阿尔茨海默病（AD），并通过可解释性技术展示模型的决策依据。

Details

Motivation: 阿尔茨海默病的早期诊断对延缓病情发展至关重要，但现有治疗方案有限。视网膜成像作为一种非侵入性方法，具有潜在的诊断价值，且Transformer模型在图像分析中表现出色。

Result: Retformer在多种性能指标上优于其他算法，最高提升11%。特征可视化结果与现有临床研究中AD相关的视网膜生物标志物一致。

Insight: 视网膜图像中的特定区域可能包含AD的关键生物标志物，Transformer模型结合可解释性技术为AD的早期诊断提供了新思路。

Abstract: Alzheimer’s disease (AD) is a neurodegenerative disorder that affects millions worldwide. In the absence of effective treatment options, early diagnosis is crucial for initiating management strategies to delay disease onset and slow down its progression. In this study, we propose Retformer, a novel transformer-based architecture for detecting AD using retinal imaging modalities, leveraging the power of transformers and explainable artificial intelligence. The Retformer model is trained on datasets of different modalities of retinal images from patients with AD and age-matched healthy controls, enabling it to learn complex patterns and relationships between image features and disease diagnosis. To provide insights into the decision-making process of our model, we employ the Gradient-weighted Class Activation Mapping algorithm to visualize the feature importance maps, highlighting the regions of the retinal images that contribute most significantly to the classification outcome. These findings are compared to existing clinical studies on detecting AD using retinal biomarkers, allowing us to identify the most important features for AD detection in each imaging modality. The Retformer model outperforms a variety of benchmark algorithms across different performance metrics by margins of up to 11.

[317] DANCE: Resource-Efficient Neural Architecture Search with Data-Aware and Continuous Adaptation cs.LG | cs.CVPDF

Maolin Wang, Tianshuo Wei, Sheng Zhang, Ruocheng Guo, Wanyu Wang

TL;DR: DANCE 提出了一种资源高效的神经架构搜索方法，通过连续进化和分布学习实现架构的动态适应，显著降低了搜索成本并提升了性能。

Details

Motivation: 现有神经架构搜索（NAS）方法在真实部署中存在局限性，包括缺乏跨场景适应能力、昂贵的独立搜索需求以及性能一致性挑战。

Result: 在五个数据集上表现优于现有 NAS 方法，同时显著降低搜索成本，并在不同计算约束下保持稳健性能。

Insight: DANCE 通过数据感知和连续适应，为 NAS 提供了一种更灵活且高效的解决方案，适用于多样化硬件需求。

Abstract: Neural Architecture Search (NAS) has emerged as a powerful approach for automating neural network design. However, existing NAS methods face critical limitations in real-world deployments: architectures lack adaptability across scenarios, each deployment context requires costly separate searches, and performance consistency across diverse platforms remains challenging. We propose DANCE (Dynamic Architectures with Neural Continuous Evolution), which reformulates architecture search as a continuous evolution problem through learning distributions over architectural components. DANCE introduces three key innovations: a continuous architecture distribution enabling smooth adaptation, a unified architecture space with learned selection gates for efficient sampling, and a multi-stage training strategy for effective deployment optimization. Extensive experiments across five datasets demonstrate DANCE’s effectiveness. Our method consistently outperforms state-of-the-art NAS approaches in terms of accuracy while significantly reducing search costs. Under varying computational constraints, DANCE maintains robust performance while smoothly adapting architectures to different hardware requirements. The code and appendix can be found at https://github.com/Applied-Machine-Learning-Lab/DANCE.

[318] Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation cs.LG | cs.AI | cs.CVPDF

Wenhao Li, Xiu Su, Jingyi Wu, Feng Yang, Yang Liu

TL;DR: 该论文提出了一种名为SEED的自进化蒸馏方法，通过识别、隔离和清除LVLMs中的幻觉问题，并将净化后的知识重新蒸馏回模型中，显著提升了模型的可靠性。

Details

Motivation: 大型视觉语言模型（LVLMs）在多媒体等领域表现出色，但幻觉问题严重限制了其可信度和应用潜力。现有方法通常依赖外部工具或多轮推理比较，增加了推理时间。

Result: 实验表明，SEED显著提升了LVLMs的可靠性，如LLaVA-1.5的POPE-Random F1分数从81.3提升至88.3。

Insight: 自净化与蒸馏的结合能够更高效地解决幻觉问题，而捕获主模的蒸馏方法可以避免输出混乱。适配器的设计为模型校正提供了新思路。

Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable advancements in numerous areas such as multimedia. However, hallucination issues significantly limit their credibility and application potential. Existing mitigation methods typically rely on external tools or the comparison of multi-round inference, which significantly increase inference time. In this paper, we propose \textbf{SE}lf-\textbf{E}volving \textbf{D}istillation (\textbf{SEED}), which identifies hallucinations within the inner knowledge of LVLMs, isolates and purges them, and then distills the purified knowledge back into the model, enabling self-evolution. Furthermore, we identified that traditional distillation methods are prone to inducing void spaces in the output space of LVLMs. To address this issue, we propose a Mode-Seeking Evolving approach, which performs distillation to capture the dominant modes of the purified knowledge distribution, thereby avoiding the chaotic results that could emerge from void spaces. Moreover, we introduce a Hallucination Elimination Adapter, which corrects the dark knowledge of the original model by learning purified knowledge. Extensive experiments on multiple benchmarks validate the superiority of our SEED, demonstrating substantial improvements in mitigating hallucinations for representative LVLM models such as LLaVA-1.5 and InternVL2. Remarkably, the F1 score of LLaVA-1.5 on the hallucination evaluation metric POPE-Random improved from 81.3 to 88.3.

[319] ConBatch-BAL: Batch Bayesian Active Learning under Budget Constraints cs.LG | cs.CVPDF

Pablo G. Morato, Charalampos P. Andriotis, Seyran Khademi

TL;DR: 论文提出两种预算约束下的批处理贝叶斯主动学习策略（ConBatch-BAL），通过动态阈值和贪心方法选择样本，显著降低标注成本和迭代次数。

Details

Motivation: 实际应用中，数据点标注成本差异和预算限制阻碍了主动学习策略的采用，需开发高效方法以优化资源分配。

Result: ConBatch-BAL策略在真实数据集上显著降低成本，性能优于无约束基线。

Insight: 预算约束下的主动学习需兼顾不确定性和成本，动态阈值方法适应性更强。

Abstract: Varying annotation costs among data points and budget constraints can hinder the adoption of active learning strategies in real-world applications. This work introduces two Bayesian active learning strategies for batch acquisition under constraints (ConBatch-BAL), one based on dynamic thresholding and one following greedy acquisition. Both select samples using uncertainty metrics computed via Bayesian neural networks. The dynamic thresholding strategy redistributes the budget across the batch, while the greedy one selects the top-ranked sample at each step, limited by the remaining budget. Focusing on scenarios with costly data annotation and geospatial constraints, we also release two new real-world datasets containing geolocated aerial images of buildings, annotated with energy efficiency or typology classes. The ConBatch-BAL strategies are benchmarked against a random acquisition baseline on these datasets under various budget and cost scenarios. The results show that the developed ConBatch-BAL strategies can reduce active learning iterations and data acquisition costs in real-world settings, and even outperform the unconstrained baseline solutions.

math.LO [Back]

[320] Interleaving Logic and Counting math.LO | cs.CL | cs.LO | 03B70, 03B65, 03B45PDF

Johan van Benthem, Thomas Icard

TL;DR: 该论文探讨了自然语言中逻辑与计数（算术）的结合，提出了一个可以表示数值三段论和基本大小比较的小片段——带计数的单调一阶逻辑。论文还研究了强化形式，并探讨了其与自然语言的关系。

Details

Motivation: 自然语言中的量化表达结合了逻辑和算术特征，超越了定性与定量的严格划分。研究旨在理解这种结合在语言和基础数学中的表现。

Result: 1. 确定了单调一阶逻辑的算术定义能力；2. 证明了单调二阶逻辑与加性Presburger算术的紧密联系；3. 指出了计数元组引入不可判定性。

Insight: 1. 逻辑与计数的结合在自然语言中普遍存在；2. 形式系统可以捕捉自然语言中的量化推理模式；3. 该研究为认知科学中的量化推理提供了形式化工具。

Abstract: Reasoning with quantifier expressions in natural language combines logical and arithmetical features, transcending strict divides between qualitative and quantitative. Our topic is this cooperation of styles as it occurs in common linguistic usage and its extension into the broader practice of natural language plus “grassroots mathematics”. We begin with a brief review of first-order logic with counting operators and cardinality comparisons. This system is known to be of high complexity, and drowns out finer aspects of the combination of logic and counting. We move to a small fragment that can represent numerical syllogisms and basic reasoning about comparative size: monadic first-order logic with counting. We provide normal forms that allow for axiomatization, determine which arithmetical notions can be defined on finite and on infinite models, and conversely, we discuss which logical notions can be defined out of purely arithmetical ones, and what sort of (non-)classical logics can be induced. Next, we investigate a series of strengthenings, again using normal form methods. The monadic second-order version is close, in a precise sense, to additive Presburger Arithmetic, while versions with the natural device of tuple counting take us to Diophantine equations, making the logic undecidable. We also define a system that combines basic modal logic over binary accessibility relations with counting, needed to formulate ubiquitous reasoning patterns such as the Pigeonhole Principle. We return to our starting point in natural language, confronting the architecture of our formal systems with linguistic quantifier vocabulary and syntax. We conclude with some general thoughts on yet further entanglements of logic and counting in formal systems, on rethinking the qualitative/quantitative divide, and on connecting our analysis to empirical findings in cognitive science.

cs.GR [Back]

Xinyang Li, Gen Li, Zhihui Lin, Yichen Qian, GongXin Yao

TL;DR: MoDA提出了一种多模态扩散架构，用于提升说话头部生成的效率、真实性和表现力，通过联合参数空间和流匹配简化扩散学习过程，并利用多模态交互增强面部表情。

Details

Motivation: 现有基于扩散模型的说话头部生成方法存在推理效率低、视觉伪影以及多模态信息交互不足导致的面部表情和头部动作不真实的问题。

Result: 实验表明，MoDA显著提升了视频多样性、真实性和效率，适用于实际应用。

Insight: 联合参数空间和流匹配能够有效简化扩散模型的学习过程，而多模态交互对提升说话头部生成的表现力至关重要。

Abstract: Talking head generation with arbitrary identities and speech audio remains a crucial problem in the realm of digital humans and the virtual metaverse. Recently, diffusion models have become a popular generative technique in this field with their strong generation and generalization capabilities. However, several challenges remain for diffusion-based methods: 1) inefficient inference and visual artifacts, which arise from the implicit latent space of Variational Auto-Encoders (VAE), complicating the diffusion process; 2) authentic facial expressions and head movements, resulting from insufficient multi-modal information interaction. In this paper, MoDA handle these challenges by 1) defines a joint parameter space to bridge motion generation and neural rendering, and leverages flow matching to simplify the diffusion learning process; 2) introduces a multi-modal diffusion architecture to model the interaction among noisy motion, audio, and auxiliary conditions, ultimately enhancing overall facial expressiveness. Subsequently, a coarse-to-fine fusion strategy is adopted to progressively integrate different modalities, ensuring effective integration across feature spaces. Experimental results demonstrate that MoDA significantly improves video diversity, realism, and efficiency, making it suitable for real-world applications.

[322] 3D PixBrush: Image-Guided Local Texture Synthesis cs.GR | cs.CVPDF

Dale Decatur, Itai Lang, Kfir Aberman, Rana Hanocka

TL;DR: 3D PixBrush是一种无需用户输入即可在3D网格上进行图像驱动局部纹理编辑的方法。它通过预测全局一致且局部精确的定位掩码和纹理，实现了对参考图像的忠实再现。

Details

Motivation: 现有方法在3D网格上进行局部纹理编辑时需要用户提供输入（如涂鸦或边界框），限制了其自动化程度和实用性。3D PixBrush旨在实现完全自动化的局部纹理编辑。

Result: 实验表明，3D PixBrush能够生成与参考图像一致的局部纹理和定位掩码，且无需用户干预。

Insight: 无需用户输入即可实现精确的3D局部纹理编辑，提高了自动化程度和实用性，为3D内容创作提供了新工具。

Abstract: We present 3D PixBrush, a method for performing image-driven edits of local regions on 3D meshes. 3D PixBrush predicts a localization mask and a synthesized texture that faithfully portray the object in the reference image. Our predicted localizations are both globally coherent and locally precise. Globally - our method contextualizes the object in the reference image and automatically positions it onto the input mesh. Locally - our method produces masks that conform to the geometry of the reference image. Notably, our method does not require any user input (in the form of scribbles or bounding boxes) to achieve accurate localizations. Instead, our method predicts a localization mask on the 3D mesh from scratch. To achieve this, we propose a modification to the score distillation sampling technique which incorporates both the predicted localization and the reference image, referred to as localization-modulated image guidance. We demonstrate the effectiveness of our proposed technique on a wide variety of meshes and images.

[323] F-Hash: Feature-Based Hash Design for Time-Varying Volume Visualization via Multi-Resolution Tesseract Encoding cs.GR | cs.CVPDF

Jianxin Sun, David Lenz, Hongfeng Yu, Tom Peterka

TL;DR: F-Hash提出了一种基于特征的多分辨率Tesseract编码架构，显著提升了时变体数据建模的训练收敛速度，并通过无碰撞哈希函数和紧凑参数实现了高效编码。

Details

Motivation: 时变体数据可视化面临复杂时空特征和大规模数据集的挑战，现有隐式神经表示（INR）训练速度慢，难以应对大规模数据需求。

Result: 在多种时变体数据集的训练中达到了SOTA的收敛速度，并实现了高效的渲染优化。

Insight: F-Hash提供了一种通用的时变特征编码方案，适用于特征跟踪和演化可视化，为大规模数据的高效处理提供了新思路。

Abstract: Interactive time-varying volume visualization is challenging due to its complex spatiotemporal features and sheer size of the dataset. Recent works transform the original discrete time-varying volumetric data into continuous Implicit Neural Representations (INR) to address the issues of compression, rendering, and super-resolution in both spatial and temporal domains. However, training the INR takes a long time to converge, especially when handling large-scale time-varying volumetric datasets. In this work, we proposed F-Hash, a novel feature-based multi-resolution Tesseract encoding architecture to greatly enhance the convergence speed compared with existing input encoding methods for modeling time-varying volumetric data. The proposed design incorporates multi-level collision-free hash functions that map dynamic 4D multi-resolution embedding grids without bucket waste, achieving high encoding capacity with compact encoding parameters. Our encoding method is agnostic to time-varying feature detection methods, making it a unified encoding solution for feature tracking and evolution visualization. Experiments show the F-Hash achieves state-of-the-art convergence speed in training various time-varying volumetric datasets for diverse features. We also proposed an adaptive ray marching algorithm to optimize the sample streaming for faster rendering of the time-varying neural representation.

[324] Neuralocks: Real-Time Dynamic Neural Hair Simulation cs.GR | cs.CVPDF

Gene Wei-Chin Lin, Egor Larionov, Hsiao-yu Chen, Doug Roble, Tuur Stuyck

TL;DR: 这篇论文提出了一种新型神经方法Neuralocks，实现了高效稳定的动态头发模拟，解决了现有神经方法局限于准静态模拟的问题，并通过自监督训练和内存高效网络支持多样化的头发造型。

Details

Motivation: 实时头发模拟对于虚拟角色的真实感至关重要，但现有神经方法无法捕捉动态行为。论文旨在突破这一限制，实现高效的动态头发模拟。

Result: 实验证明了该方法在多种发型上的有效性，性能优于现有方法，适用于实际应用。

Insight: 结合自监督训练和高效神经网络，可以在动态头发模拟中取得显著突破，为虚拟角色的真实感提供了新思路。

Abstract: Real-time hair simulation is a vital component in creating believable virtual avatars, as it provides a sense of immersion and authenticity. The dynamic behavior of hair, such as bouncing or swaying in response to character movements like jumping or walking, plays a significant role in enhancing the overall realism and engagement of virtual experiences. Current methods for simulating hair have been constrained by two primary approaches: highly optimized physics-based systems and neural methods. However, state-of-the-art neural techniques have been limited to quasi-static solutions, failing to capture the dynamic behavior of hair. This paper introduces a novel neural method that breaks through these limitations, achieving efficient and stable dynamic hair simulation while outperforming existing approaches. We propose a fully self-supervised method which can be trained without any manual intervention or artist generated training data allowing the method to be integrated with hair reconstruction methods to enable automatic end-to-end methods for avatar reconstruction. Our approach harnesses the power of compact, memory-efficient neural networks to simulate hair at the strand level, allowing for the simulation of diverse hairstyles without excessive computational resources or memory requirements. We validate the effectiveness of our method through a variety of hairstyle examples, showcasing its potential for real-world applications.

physics.soc-ph [Back]

[325] Street design and driving behavior: evidence from a large-scale study in Milan, Amsterdam, and Dubai physics.soc-ph | cs.CVPDF

Giacomo Orsi, Titus Venverloo, Andrea La Grotteria, Umberto Fugiglando, Fábio Duarte

TL;DR: 该研究通过计算机视觉和机器学习方法分析了城市街道设计对驾驶员遵守30公里/小时限速的影响，发现在狭窄和密集的建筑环境中速度较低，而在视野开阔的道路上速度较高。研究成果为城市规划提供了实用的工具。

Details

Motivation: 城市降低限速至30公里/小时的措施效果有限，需要研究街道设计如何影响驾驶员行为以提高限速遵守率。

Result: 狭窄街道和密集建筑环境与较低速度相关，视野开阔的道路则导致速度较高。模型验证了这些发现在米兰、阿姆斯特丹和迪拜的一致性。

Insight: 街道设计对限速遵守有显著影响，城市规划者可以通过调整街道特征来更有效地实现限速目标。

Abstract: In recent years, cities have increasingly reduced speed limits from 50 km/h to 30 km/h to enhance road safety, reduce noise pollution, and promote sustainable modes of transportation. However, achieving compliance with these new limits remains a key challenge for urban planners. This study investigates drivers’ compliance with the 30 km/h speed limit in Milan and examines how street characteristics influence driving behavior. Our findings suggest that the mere introduction of lower speed limits is not sufficient to reduce driving speeds effectively, highlighting the need to understand how street design can improve speed limit adherence. To comprehend this relationship, we apply computer vision-based semantic segmentation models to Google Street View images. A large-scale analysis reveals that narrower streets and densely built environments are associated with lower speeds, whereas roads with greater visibility and larger sky views encourage faster driving. To evaluate the influence of the local context on speeding behaviour, we apply the developed methodological framework to two additional cities: Amsterdam, which, similar to Milan, is a historic European city not originally developed for cars, and Dubai, which instead has developed in recent decades with a more car-centric design. The results of the analyses largely confirm the findings obtained in Milan, which demonstrates the broad applicability of the road design guidelines for driver speed compliance identified in this paper. Finally, we develop a machine learning model to predict driving speeds based on street characteristics. We showcase the model’s predictive power by estimating the compliance with speed limits in Milan if the city were to adopt a 30 km/h speed limit city-wide. The tool provides actionable insights for urban planners, supporting the design of interventions to improve speed limit compliance.

cs.DL [Back]

[326] An HTR-LLM Workflow for High-Accuracy Transcription and Analysis of Abbreviated Latin Court Hand cs.DL | cs.CL | cs.CVPDF

Joshua D. Isom

TL;DR: 本文提出并验证了一个四阶段的HTR-LLM工作流程，用于高精度转录和分析中世纪拉丁法律文件。该流程结合了手写文本识别（HTR）和大语言模型（LLM），显著降低了错误率并提高了转录质量。

Details

Motivation: 中世纪拉丁法律文件通常包含缩写和复杂的手写字迹，转录和分析这些文件非常耗时且容易出错。因此，本文旨在开发一种自动化、高精度的方法来简化这一过程。

Result: 在详细案例研究中，该方法达到了2-7%的词错误率（WER），远超传统方法。

Insight: 1. 结合HTR和LLM的多阶段方法在处理复杂手写文本时效果显著；2. LLM在数据清洗和后处理中的作用非常关键。

Abstract: This article presents and validates an ideal, four-stage workflow for the high-accuracy transcription and analysis of challenging medieval legal documents. The process begins with a specialized Handwritten Text Recognition (HTR) model, itself created using a novel “Clean Ground Truth” curation method where a Large Language Model (LLM) refines the training data. This HTR model provides a robust baseline transcription (Stage 1). In Stage 2, this baseline is fed, along with the original document image, to an LLM for multimodal post-correction, grounding the LLM’s analysis and improving accuracy. The corrected, abbreviated text is then expanded into full, scholarly Latin using a prompt-guided LLM (Stage 3). A final LLM pass performs Named-Entity Correction (NEC), regularizing proper nouns and generating plausible alternatives for ambiguous readings (Stage 4). We validate this workflow through detailed case studies, achieving Word Error Rates (WER) in the range of 2-7% against scholarly ground truths. The results demonstrate that this hybrid, multi-stage approach effectively automates the most laborious aspects of transcription while producing a high-quality, analyzable output, representing a powerful and practical solution for the current technological landscape.

q-bio.QM [Back]

[327] SPATIA: Multimodal Model for Prediction and Generation of Spatial Cell Phenotypes q-bio.QM | cs.AI | cs.CVPDF

Zhenglun Kong, Mufan Qiu, John Boesen, Xiang Lin, Sukwon Yun

TL;DR: SPATIA是一个多模态模型，整合了细胞形态学、基因表达和空间上下文信息，通过跨注意力和Transformer模块学习统一的空间感知表示，并在生成和预测任务中表现出色。

Details

Motivation: 现有机器学习方法通常单独分析细胞图像和基因表达数据，缺乏对多模态和空间上下文信息的整合，导致生物学理解的局限性。

Result: 在17M细胞-基因对、1M邻域-基因对和10K组织-基因对的数据集上，SPATIA在12项任务中均优于基线模型，且能生成逼真的细胞图像。

Insight: 多模态和空间信息的联合建模对理解细胞表型和组织功能至关重要，SPATIA为空间转录组学提供了一个强大的通用框架。

Abstract: Understanding how cellular morphology, gene expression, and spatial organization jointly shape tissue function is a central challenge in biology. Image-based spatial transcriptomics technologies now provide high-resolution measurements of cell images and gene expression profiles, but machine learning methods typically analyze these modalities in isolation or at limited resolution. We address the problem of learning unified, spatially aware representations that integrate cell morphology, gene expression, and spatial context across biological scales. This requires models that can operate at single-cell resolution, reason across spatial neighborhoods, and generalize to whole-slide tissue organization. Here, we introduce SPATIA, a multi-scale generative and predictive model for spatial transcriptomics. SPATIA learns cell-level embeddings by fusing image-derived morphological tokens and transcriptomic vector tokens using cross-attention and then aggregates them at niche and tissue levels using transformer modules to capture spatial dependencies. SPATIA incorporates token merging in its generative diffusion decoder to synthesize high-resolution cell images conditioned on gene expression. We assembled a multi-scale dataset consisting of 17 million cell-gene pairs, 1 million niche-gene pairs, and 10,000 tissue-gene pairs across 49 donors, 17 tissue types, and 12 disease states. We benchmark SPATIA against 13 existing models across 12 individual tasks, which span several categories including cell annotation, cell clustering, gene imputation, cross-modal prediction, and image generation. SPATIA achieves improved performance over all baselines and generates realistic cell morphologies that reflect transcriptomic perturbations.

cs.HC [Back]

[328] DeepGesture: A conversational gesture synthesis system based on emotions and semantics cs.HC | cs.CL | cs.LG | cs.SD | eess.ASPDF

Thanh Hoang-Minh

TL;DR: DeepGesture是一种基于情感和语义的对话手势合成系统，通过扩散模型生成与文本、语音、情感和初始动作匹配的自然手势，提高了语义对齐和情感表现力。

Details

Motivation: 当前数字人生成的瓶颈在于如何根据文本或语音输入自然地生成角色动作，需要解决语义对齐和情感表达问题。

Result: 在ZeroEGGS数据集上评估，DeepGesture在手势的拟人化和上下文适当性上优于基线，支持情感状态插值并泛化至分布外语音。

Insight: 通过多模态条件输入和扩散模型，DeepGesture在数字人手势生成中实现了更高的自然性和情感表现力，推动了多模态情感感知数字人的发展。

Abstract: Along with the explosion of large language models, improvements in speech synthesis, advancements in hardware, and the evolution of computer graphics, the current bottleneck in creating digital humans lies in generating character movements that correspond naturally to text or speech inputs. In this work, we present DeepGesture, a diffusion-based gesture synthesis framework for generating expressive co-speech gestures conditioned on multimodal signals-text, speech, emotion, and seed motion. Built upon the DiffuseStyleGesture model, DeepGesture introduces novel architectural enhancements that improve semantic alignment and emotional expressiveness in generated gestures. Specifically, we integrate fast text transcriptions as semantic conditioning and implement emotion-guided classifier-free diffusion to support controllable gesture generation across affective states. A lightweight Transformer backbone combines full self-attention and cross-local attention for effective feature fusion of heterogeneous modalities. To visualize results, we implement a full rendering pipeline in Unity based on BVH output from the model. Evaluation on the ZeroEGGS dataset shows that DeepGesture produces gestures with improved human-likeness and contextual appropriateness, outperforming baselines on Mean Opinion Score and Frechet Gesture Distance metrics. Our system supports interpolation between emotional states and demonstrates generalization to out-of-distribution speech, including synthetic voices-marking a step forward toward fully multimodal, emotionally aware digital humans.

[329] The role of large language models in UI/UX design: A systematic literature review cs.HC | cs.AI | cs.CLPDF

Ammar Ahmed, Ali Shariq Imran

TL;DR: 本文通过系统性文献综述探讨了大型语言模型（LLMs）在UI/UX设计中的作用，总结了现有研究的成果与挑战。

Details

Motivation: 随着LLMs在多个领域的广泛应用，其在UI/UX设计中的潜力尚未得到系统评估，本文旨在填补这一空白。

Result: LLMs（如GPT-4、Gemini和PaLM）已被广泛应用于设计生命周期的各个阶段，但仍存在幻觉、提示不稳定和可解释性等挑战。

Insight: LLMs是UI/UX设计中的新兴合作伙伴，但其整合需重视伦理与包容性，同时解决现有技术的局限性。

Abstract: This systematic literature review examines the role of large language models (LLMs) in UI/UX design, synthesizing findings from 38 peer-reviewed studies published between 2022 and 2025. We identify key LLMs in use, including GPT-4, Gemini, and PaLM, and map their integration across the design lifecycle, from ideation to evaluation. Common practices include prompt engineering, human-in-the-loop workflows, and multimodal input. While LLMs are reshaping design processes, challenges such as hallucination, prompt instability, and limited explainability persist. Our findings highlight LLMs as emerging collaborators in design, and we propose directions for the ethical, inclusive, and effective integration of these technologies.

[330] More than One Step at a Time: Designing Procedural Feedback for Non-visual Makeup Routines cs.HC | cs.CVPDF

Franklin Mingzhe Li, Akihiko Oharazawa, Chloe Qingyu Zhu, Misty Fan, Daisuke Sato

TL;DR: 这篇论文探讨了辅助技术如何帮助视觉障碍者完成复杂的化妆流程，提出了基于用户研究的设计建议和反馈需求分类。

Details

Motivation: 化妆是自我表达的重要手段，但对视觉障碍者来说，流程复杂且缺乏有效的辅助工具。现有的工具集中在单一任务上（如颜色识别），而忽略了步骤协调、产品放置和最终效果评估等整体需求。

Result: 研究发现视觉障碍者依赖触觉优先策略，但在混合、对称性和效果评估方面仍面临挑战。用户希望获得实时、真实且目标一致的反馈。

Insight: 未来的辅助系统应通过对话式交互和情境感知能力，支持视觉障碍者独立完成化妆流程，同时注重表达性和个性化的需求。

Abstract: Makeup plays a vital role in self-expression, identity, and confidence - yet remains an underexplored domain for assistive technology, especially for people with vision impairments. While existing tools support isolated tasks such as color identification or product labeling, they rarely address the procedural complexity of makeup routines: coordinating step sequences, managing product placement, and assessing the final look with accessible feedback. To understand the real-world process, we conducted a contextual inquiry with 15 visually impaired makeup users, capturing real-time makeup application behaviors and their step-by-step information needs and assessment approaches. Our findings reveal embodied, tactile-first strategies; persistent challenges in blending, symmetry, and assessment; and a desire for honest, real-time, goal-aligned feedback. We also interviewed five professional makeup artists, who reviewed participant makeup videos and provided expert responses to participant-raised questions and assessment practices. We contribute a taxonomy of feedback needs in non-visual makeup, and outline design implications for future assistive systems - emphasizing hands-free, conversational interaction and context-aware, procedural support for expressive and independent beauty practices.

cs.IR [Back]

[331] A Comparative Study of Specialized LLMs as Dense Retrievers cs.IR | cs.AI | cs.CL | cs.LGPDF

Hengran Zhang, Keping Bi, Jiafeng Guo

TL;DR: 本文通过系统比较不同领域专用的大型语言模型（LLM）在密集检索任务中的表现，发现数学专用和长推理能力会导致检索性能下降，而视觉-语言模型和代码专用LLM在零样本检索中表现优异，为跨领域和跨模态统一检索任务提供了新思路。

Details

Motivation: 研究LLM的领域专用化对其检索性能的影响，探索统一检索器的潜力，以处理文本、代码、图像和多模态内容。

Result: 数学专用和长推理能力导致检索性能下降；视觉-语言模型和代码专用LLM在零样本检索中超越其他模型，甚至优于BM25，监督微调后性能与基础LLM相当。

Insight: 研究表明跨领域和跨模态融合是实现统一检索任务的有前景方向，同时也揭示了领域专用化可能带来的性能权衡。

Abstract: While large language models (LLMs) are increasingly deployed as dense retrievers, the impact of their domain-specific specialization on retrieval effectiveness remains underexplored. This investigation systematically examines how task-specific adaptations in LLMs influence their retrieval capabilities, an essential step toward developing unified retrievers capable of handling text, code, images, and multimodal content. We conduct extensive experiments with eight Qwen2.5 7B LLMs, including base, instruction-tuned, code/math-specialized, long reasoning, and vision-language models across zero-shot retrieval settings and the supervised setting. For the zero-shot retrieval settings, we consider text retrieval from the BEIR benchmark and code retrieval from the CoIR benchmark. Further, to evaluate supervised performance, all LLMs are fine-tuned on the MS MARCO dataset. We find that mathematical specialization and the long reasoning capability cause consistent degradation in three settings, indicating conflicts between mathematical reasoning and semantic matching. The vision-language model and code-specialized LLMs demonstrate superior zero-shot performance compared to other LLMs, even surpassing BM25 on the code retrieval task, and maintain comparable performance to base LLMs in supervised settings. These findings suggest promising directions for the unified retrieval task leveraging cross-domain and cross-modal fusion.

[332] Navigating Speech Recording Collections with AI-Generated Illustrations cs.IR | cs.CL | cs.HC | cs.SD | eess.ASPDF

Sirina Håland, Trond Karlsen Strøm, Petra Galuščáková

TL;DR: 这篇论文提出了一种利用语言和多模态生成模型的新型导航方法，用于语音存档的探索，并通过一个基于TED-LIUM~3数据集的Web应用展示了其潜力。

Details

Motivation: 语音内容日益增加，但从中提取信息和知识仍然具有挑战性，传统检索方法需要创新补充。

Result: 初步用户测试表明，该系统有望简化大语音集合的探索过程。

Insight: 结合生成模型和可视化工具可以有效提升语音内容的导航和搜索体验。

Abstract: Although the amount of available spoken content is steadily increasing, extracting information and knowledge from speech recordings remains challenging. Beyond enhancing traditional information retrieval methods such as speech search and keyword spotting, novel approaches for navigating and searching spoken content need to be explored and developed. In this paper, we propose a novel navigational method for speech archives that leverages recent advances in language and multimodal generative models. We demonstrate our approach with a Web application that organizes data into a structured format using interactive mind maps and image generation tools. The system is implemented using the TED-LIUM~3 dataset, which comprises over 2,000 speech transcripts and audio files of TED Talks. Initial user tests using a System Usability Scale (SUS) questionnaire indicate the application’s potential to simplify the exploration of large speech collections.

cs.IT [Back]

[333] LVM4CSI: Enabling Direct Application of Pre-Trained Large Vision Models for Wireless Channel Tasks cs.IT | cs.AI | cs.CV | cs.LG | math.ITPDF

Jiajia Guo, Peiwen Jiang, Chao-Kai Wen, Shi Jin, Jun Zhang

TL;DR: 论文提出LVM4CSI框架，直接利用预训练的大视觉模型（LVM）处理无线信道任务，避免任务特定的神经网络设计和微调需求，显著提升性能并减少参数。

Details

Motivation: 现有无线通信系统中，AI方法依赖任务特定的神经网络，需要专家设计和大量数据，限制了泛化性和实用性。LVM4CSI利用CSI与计算机视觉数据的结构相似性，直接应用预训练的视觉模型解决问题。

Result: 在信道估计、活动识别和用户定位等任务中，性能优于或接近任务特定网络，如信道估计提升9.61 dB，定位误差减少40%。

Insight: 利用跨领域预训练模型的强大特征提取能力，无需任务特定设计，可显著提升无线通信任务的效率和性能。

Abstract: Accurate channel state information (CSI) is critical to the performance of wireless communication systems, especially with the increasing scale and complexity introduced by 5G and future 6G technologies. While artificial intelligence (AI) offers a promising approach to CSI acquisition and utilization, existing methods largely depend on task-specific neural networks (NNs) that require expert-driven design and large training datasets, limiting their generalizability and practicality. To address these challenges, we propose LVM4CSI, a general and efficient framework that leverages the structural similarity between CSI and computer vision (CV) data to directly apply large vision models (LVMs) pre-trained on extensive CV datasets to wireless tasks without any fine-tuning, in contrast to large language model-based methods that generally necessitate fine-tuning. LVM4CSI maps CSI tasks to analogous CV tasks, transforms complex-valued CSI into visual formats compatible with LVMs, and integrates lightweight trainable layers to adapt extracted features to specific communication objectives. We validate LVM4CSI through three representative case studies, including channel estimation, human activity recognition, and user localization. Results demonstrate that LVM4CSI achieves comparable or superior performance to task-specific NNs, including an improvement exceeding 9.61 dB in channel estimation and approximately 40% reduction in localization error. Furthermore, it significantly reduces the number of trainable parameters and eliminates the need for task-specific NN design.

eess.AS [Back]

[334] MMMOS: Multi-domain Multi-axis Audio Quality Assessment eess.AS | cs.AI | cs.CLPDF

Yi-Cheng Lin, Jia-Hung Chen, Hung-yi Lee

TL;DR: MMMOS提出了一种无需参考的多领域音频质量评估系统，通过预测四个正交轴（生产质量、生产复杂性、内容享受和内容实用性）来改进语音、音乐和环境声音的评估。

Details

Motivation: 现有音频质量评估模型仅预测单一平均意见分（MOS），无法区分不同感知因素且泛化能力有限，因此需要一种多领域多轴的评估方法。

Result: 相比基线，MMMOS的均方误差降低了20-30%，Kendall’s τ提升了4-5%，并在多个挑战指标中排名领先。

Insight: 多领域多轴的评估方法能够更全面地捕捉音频质量的不同维度，集成策略和预训练编码器的融合对性能提升至关重要。

Abstract: Accurate audio quality estimation is essential for developing and evaluating audio generation, retrieval, and enhancement systems. Existing non-intrusive assessment models predict a single Mean Opinion Score (MOS) for speech, merging diverse perceptual factors and failing to generalize beyond speech. We propose MMMOS, a no-reference, multi-domain audio quality assessment system that estimates four orthogonal axes: Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness across speech, music, and environmental sounds. MMMOS fuses frame-level embeddings from three pretrained encoders (WavLM, MuQ, and M2D) and evaluates three aggregation strategies with four loss functions. By ensembling the top eight models, MMMOS shows a 20-30% reduction in mean squared error and a 4-5% increase in Kendall’s {\tau} versus baseline, gains first place in six of eight Production Complexity metrics, and ranks among the top three on 17 of 32 challenge metrics.

Table of Contents

cs.CV [Back]

[1] Learning to Generate Vectorized Maps at Intersections with Multiple Roadside Cameras cs.CVPDF

[2] Advancing Talking Head Generation: A Comprehensive Survey of Multi-Modal Methodologies, Datasets, Evaluation Metrics, and Loss Functions cs.CV | cs.AI | cs.GR | cs.HC | cs.MMPDF

[3] Enhancing Sports Strategy with Video Analytics and Data Mining: Assessing the effectiveness of Multimodal LLMs in tennis video analysis cs.CV | cs.AI | I.2.7; I.2.10; I.4PDF

[4] Enhancing Sports Strategy with Video Analytics and Data Mining: Automated Video-Based Analytics Framework for Tennis Doubles cs.CV | cs.LGPDF

[5] Modeling Urban Food Insecurity with Google Street View Images cs.CV | cs.LGPDF

[6] OBSER: Object-Based Sub-Environment Recognition for Zero-Shot Environmental Inference cs.CV | cs.AI | cs.LG | stat.MLPDF

[7] GameTileNet: A Semantic Dataset for Low-Resolution Game Art in Procedural Content Generation cs.CV | cs.AI | cs.CL | cs.MMPDF

[8] Iterative Zoom-In: Temporal Interval Exploration for Long Video Understanding cs.CV | cs.AIPDF

[9] DriveMRP: Enhancing Vision-Language Models with Synthetic Motion Data for Motion Risk Prediction cs.CV | cs.AI | cs.RO | I.4.8; I.2.7; I.2.10PDF

[10] Multimodal image registration for effective thermographic fever screening cs.CVPDF

[11] CS-VLM: Compressed Sensing Attention for Efficient Vision-Language Representation Learning cs.CVPDF

[12] VR-YOLO: Enhancing PCB Defect Detection with Viewpoint Robustness Based on YOLO cs.CV | eess.IVPDF

[13] Concept-based Adversarial Attack: a Probabilistic Perspective cs.CV | cs.AIPDF

[14] Ascending the Infinite Ladder: Benchmarking Spatial Deformation Reasoning in Vision-Language Models cs.CVPDF

[15] Gated Recursive Fusion: A Stateful Approach to Scalable Multimodal Transformers cs.CV | cs.AI | cs.CL | I.4; I.2PDF

[16] Leveraging the Structure of Medical Data for Improved Representation Learning cs.CV | cs.LGPDF

[17] Enabling Robust, Real-Time Verification of Vision-Based Navigation through View Synthesis cs.CV | cs.RO | eess.IV | I.4.9PDF

[18] FreqCross: A Multi-Modal Frequency-Spatial Fusion Network for Robust Detection of Stable Diffusion 3.5 Generated Images cs.CV | cs.CRPDF

[19] Text-Guided Multi-Instance Learning for Scoliosis Screening via Gait Video Analysis cs.CVPDF

[20] Topological Signatures vs. Gradient Histograms: A Comparative Study for Medical Image Classification cs.CV | cs.LGPDF

[21] Markerless Stride Length estimation in Athletic using Pose Estimation with monocular vision cs.CVPDF

[22] Look-Back: Implicit Visual Re-focusing in MLLM Reasoning cs.CV | cs.LGPDF

[23] Intelligent Histology for Tumor Neurosurgery cs.CVPDF

[24] Detection of Rail Line Track and Human Beings Near the Track to Avoid Accidents cs.CV | cs.LG | 68T10 | I.2.10; I.4.8PDF

[25] LATTE: Latent Trajectory Embedding for Diffusion-Generated Image Detection cs.CV | cs.AI | I.2.10; I.4.8; I.5PDF

[26] Towards a Psychoanalytic Perspective on VLM Behaviour: A First-step Interpretation with Intriguing Observations cs.CV | cs.CL | cs.LGPDF

[27] A Vision-Based Closed-Form Solution for Measuring the Rotation Rate of an Object by Tracking One Point cs.CVPDF

[28] Subject Invariant Contrastive Learning for Human Activity Recognition cs.CV | cs.LGPDF

[29] LACONIC: A 3D Layout Adapter for Controllable Image Creation cs.CVPDF

[30] Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders cs.CV | cs.AIPDF

[31] Dual-frequency Selected Knowledge Distillation with Statistical-based Sample Rectification for PolSAR Image Classification cs.CVPDF

[32] ConceptMix++: Leveling the Playing Field in Text-to-Image Benchmarking via Iterative Prompt Optimization cs.CV | cs.LGPDF

[33] NOVO: Unlearning-Compliant Vision Transformers cs.CVPDF

[34] MolVision: Molecular Property Prediction with Vision Language Models cs.CVPDF

[35] Zero-shot Inexact CAD Model Alignment from a Single Image cs.CVPDF

[36] CPKD: Clinical Prior Knowledge-Constrained Diffusion Models for Surgical Phase Recognition in Endoscopic Submucosal Dissection cs.CVPDF

[37] Leveraging Out-of-Distribution Unlabeled Images: Semi-Supervised Semantic Segmentation with an Open-Vocabulary Model cs.CV | cs.AIPDF

[38] Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations cs.CVPDF

[39] MGSfM: Multi-Camera Geometry Driven Global Structure-from-Motion cs.CVPDF

[40] Personalized Image Generation from an Author Writing Style cs.CV | cs.AIPDF

[41] Source-Free Domain Adaptation via Multi-view Contrastive Learning cs.CV | cs.AIPDF

[42] Mirror in the Model: Ad Banner Image Generation via Reflective Multi-LLM and Multi-modal Agents cs.CVPDF

[43] De-Fake: Style based Anomaly Deepfake Detection cs.CV | cs.AIPDF

[44] DESign: Dynamic Context-Aware Convolution and Efficient Subnet Regularization for Continuous Sign Language Recognition cs.CV | cs.AIPDF

[45] Be the Change You Want to See: Revisiting Remote Sensing Change Detection Practices cs.CV | cs.AIPDF

[46] Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos cs.CVPDF

[47] Unlearning the Noisy Correspondence Makes CLIP More Robust cs.CV | cs.MMPDF

[48] Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach cs.CV | cs.AIPDF

[49] Radar Velocity Transformer: Single-scan Moving Object Segmentation in Noisy Radar Point Clouds cs.CVPDF

[50] Information-Bottleneck Driven Binary Neural Network for Change Detection cs.CVPDF

[51] Multimodal Alignment with Cross-Attentive GRUs for Fine-Grained Video Understanding cs.CV | cs.AIPDF

[52] PhenoBench: A Comprehensive Benchmark for Cell Phenotyping cs.CVPDF

[53] CLOT: Closed Loop Optimal Transport for Unsupervised Action Segmentation cs.CVPDF

[54] Foundation versus Domain-specific Models: Performance Comparison, Fusion, and Explainability in Face Recognition cs.CV | cs.AIPDF

[55] [Beyond Accuracy: Metrics that Uncover What Makes a `Good’ Visual Descriptor](https://arxiv.org/abs/2507.03542) cs.CVPDF

[56] SciVid: Cross-Domain Evaluation of Video Models in Scientific Applications cs.CV | cs.AI | cs.LGPDF

[57] Causal-SAM-LLM: Large Language Models as Causal Reasoners for Robust Medical Segmentation cs.CV | cs.AI | cs.CLPDF

[58] From Video to EEG: Adapting Joint Embedding Predictive Architecture to Uncover Visual Concepts in Brain Signal Analysis cs.CV | cs.AI | cs.LGPDF

[59] Dynamic Multimodal Prototype Learning in Vision-Language Models cs.CVPDF

[60] On the rankability of visual embeddings cs.CVPDF

[61] SAMed-2: Selective Memory Enhanced Medical Segment Anything Model cs.CVPDF

[62] Sign Spotting Disambiguation using Large Language Models cs.CV | cs.AIPDF

[63] Computationally efficient non-Intrusive pre-impact fall detection system cs.CVPDF

[64] Less is More: Empowering GUI Agent with Context-Aware Simplification cs.CV | cs.AI | cs.HC | cs.LGPDF

[65] Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps cs.CVPDF

[66] ChestGPT: Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays cs.CVPDF

[67] StreamDiT: Real-Time Streaming Text-to-Video Generation cs.CV | cs.AI | cs.LG | eess.IVPDF

[68] FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed cs.CV | cs.AI | cs.LGPDF

[69] Zero Memory Overhead Approach for Protecting Vision Transformer Parameters cs.CVPDF

[70] Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition cs.CV | cs.ROPDF

[71] Hierarchical Semantic-Visual Fusion of Visible and Near-infrared Images for Long-range Haze Removal cs.CV | cs.AIPDF

[72] Deconfounding Causal Inference through Two-Branch Framework with Early-Forking for Sensor-Based Cross-Domain Activity Recognition cs.CVPDF

[73] Taming Anomalies with Down-Up Sampling Networks: Group Center Preserving Reconstruction for 3D Anomaly Detection cs.CV | 68T10 | I.4; I.5; J.6PDF

[74] EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation cs.CVPDF

[75] Bridging Vision and Language: Optimal Transport-Driven Radiology Report Generation via LLMs cs.CVPDF

[76] Learning Disentangled Stain and Structural Representations for Semi-Supervised Histopathology Segmentation cs.CV | cs.AIPDF

[77] DNF-Intrinsic: Deterministic Noise-Free Diffusion for Indoor Inverse Rendering cs.CVPDF

[78] VISC: mmWave Radar Scene Flow Estimation using Pervasive Visual-Inertial Supervision cs.CV | cs.ROPDF