cs.CV [Total: 61]
cs.CL [Total: 40]
cs.MM [Total: 1]
cs.CE [Total: 1]
cs.HC [Total: 1]
cs.AI [Total: 2]
cs.LG [Total: 2]
cs.RO [Total: 3]
eess.IV [Total: 2]
eess.AS [Total: 1]
cs.DB [Total: 1]

cs.CV [Back]

[1] Vision-Based Perception for Autonomous Vehicles in Off-Road Environment Using Deep Learning cs.CV | cs.AR | cs.LG | eess.IV | eess.SPPDF

Nelson Alves Ferreira Neto

TL;DR: 本文提出了一种用于越野环境的自动驾驶感知系统，采用模块化分割网络（CMSNet），能够实时分割障碍物和可通行地面，并结合新数据集Kamino验证了系统在恶劣条件下的有效性。

Details

Motivation: 针对非结构化越野环境（如露天矿和发展中国家道路），需要低延迟的智能系统以实现自动驾驶。传统方法依赖于预定义路径，难以适应复杂多变的越野场景。

Result: 实验表明CMSNet能在恶劣条件下（夜间、雨、灰尘）有效分割可通行区域，并通过优化实现实时处理。

Insight: 模块化设计提升网络灵活性；恶劣条件的真实数据对模型鲁棒性至关重要；推理优化技术在自动驾驶中具有实用价值。

Abstract: Low-latency intelligent systems are required for autonomous driving on non-uniform terrain in open-pit mines and developing countries. This work proposes a perception system for autonomous vehicles on unpaved roads and off-road environments, capable of navigating rough terrain without a predefined trail. The Configurable Modular Segmentation Network (CMSNet) framework is proposed, facilitating different architectural arrangements. CMSNet configurations were trained to segment obstacles and trafficable ground on new images from unpaved/off-road scenarios with adverse conditions (night, rain, dust). We investigated applying deep learning to detect drivable regions without explicit track boundaries, studied algorithm behavior under visibility impairment, and evaluated field tests with real-time semantic segmentation. A new dataset, Kamino, is presented with almost 12,000 images from an operating vehicle with eight synchronized cameras. The Kamino dataset has a high number of labeled pixels compared to similar public collections and includes images from an off-road proving ground emulating a mine under adverse visibility. To achieve real-time inference, CMSNet CNN layers were methodically removed and fused using TensorRT, C++, and CUDA. Empirical experiments on two datasets validated the proposed system’s effectiveness.

[2] Overview of LifeCLEF Plant Identification task 2020 cs.CVPDF

Herve Goeau, Pierre Bonnet, Alexis Joly

TL;DR: LifeCLEF 2020 Plant Identification任务旨在评估如何利用植物标本馆数据改进自动化植物识别系统，尤其是在数据匮乏的热带地区。

Details

Motivation: 尽管深度学习在植物识别方面取得了进展，但大部分数据集中在北美和西欧，而对热带地区等高生物多样性区域的覆盖不足。植物标本馆数据为填补这一空白提供了可能。

Result: 论文总结了参与团队的多种方法，并分析了主要结果，展示了标本馆数据对提升热带地区植物自动识别的潜力。

Insight: 植物标本馆数据可以作为野外照片不足的高生物多样性区域的重要补充，跨域学习方法有望解决数据不平衡问题。

Abstract: Automated identification of plants has improved considerably thanks to the recent progress in deep learning and the availability of training data with more and more photos in the field. However, this profusion of data only concerns a few tens of thousands of species, mostly located in North America and Western Europe, much less in the richest regions in terms of biodiversity such as tropical countries. On the other hand, for several centuries, botanists have collected, catalogued and systematically stored plant specimens in herbaria, particularly in tropical regions, and the recent efforts by the biodiversity informatics community made it possible to put millions of digitized sheets online. The LifeCLEF 2020 Plant Identification challenge (or “PlantCLEF 2020”) was designed to evaluate to what extent automated identification on the flora of data deficient regions can be improved by the use of herbarium collections. It is based on a dataset of about 1,000 species mainly focused on the South America’s Guiana Shield, an area known to have one of the greatest diversity of plants in the world. The challenge was evaluated as a cross-domain classification task where the training set consist of several hundred thousand herbarium sheets and few thousand of photos to enable learning a mapping between the two domains. The test set was exclusively composed of photos in the field. This paper presents the resources and assessments of the conducted evaluation, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.

[3] iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning cs.CVPDF

Manyi Yao, Bingbing Zhuang, Sparsh Garg, Amit Roy-Chowdhury, Christian Shelton

TL;DR: iFinder是一个模块化的、无需训练的结构化语义框架，通过将行车记录仪视频转换为层次化的可解释数据结构，来解耦感知与推理，提升LLM在零样本驾驶视频理解中的表现。

Details

Motivation: 现有的视频-语言模型（V-VLMs）在空间推理、因果推断和事件解释性方面表现不足，尤其是在仅依赖视觉模态（如行车记录仪视频）的情况下。iFinder旨在通过结构化层次数据为LLM提供领域特定的语义基础，以解决这些问题。

Result: 在四个公开的行车记录仪视频基准测试中，iFinder显著优于端到端的V-VLMs，尤其在事故推理准确性上提升了39%。

Insight: 通过将领域特定的结构化表示引入LLM的输入，零样本学习方法可以在无需训练的情况下实现高性能和可解释的结果，为驾驶视频分析提供了新思路。

Abstract: Grounding large language models (LLMs) in domain-specific tasks like post-hoc dash-cam driving video analysis is challenging due to their general-purpose training and lack of structured inductive biases. As vision is often the sole modality available for such analysis (i.e., no LiDAR, GPS, etc.), existing video-based vision-language models (V-VLMs) struggle with spatial reasoning, causal inference, and explainability of events in the input video. To this end, we introduce iFinder, a structured semantic grounding framework that decouples perception from reasoning by translating dash-cam videos into a hierarchical, interpretable data structure for LLMs. iFinder operates as a modular, training-free pipeline that employs pretrained vision models to extract critical cues – object pose, lane positions, and object trajectories – which are hierarchically organized into frame- and video-level structures. Combined with a three-block prompting strategy, it enables step-wise, grounded reasoning for the LLM to refine a peer V-VLM’s outputs and provide accurate reasoning. Evaluations on four public dash-cam video benchmarks show that iFinder’s proposed grounding with domain-specific cues, especially object orientation and global context, significantly outperforms end-to-end V-VLMs on four zero-shot driving benchmarks, with up to 39% gains in accident reasoning accuracy. By grounding LLMs with driving domain-specific representations, iFinder offers a zero-shot, interpretable, and reliable alternative to end-to-end V-VLMs for post-hoc driving video understanding.

[4] CURE: Centroid-guided Unsupervised Representation Erasure for Facial Recognition Systems cs.CVPDF

Fnu Shivam, Nima Najafzadeh, Yenumula Reddy, Prashnna Gyawali

TL;DR: 论文提出了CURE，一种无监督的机器学习遗忘框架，用于面部识别系统，无需身份标签即可删除特定样本，同时保持模型性能。

Details

Motivation: 面部识别系统的广泛应用引发了隐私问题，现有遗忘方法依赖监督标签，但在隐私受限或大规模嘈杂数据集中难以获取标签。

Result: CURE在无监督遗忘任务中表现优于现有方法，并能有效处理低质量图像。

Insight: 无监督遗忘在隐私保护中有潜力，图像质量对遗忘效果有重要影响。

Abstract: In the current digital era, facial recognition systems offer significant utility and have been widely integrated into modern technological infrastructures; however, their widespread use has also raised serious privacy concerns, prompting regulations that mandate data removal upon request. Machine unlearning has emerged as a powerful solution to address this issue by selectively removing the influence of specific user data from trained models while preserving overall model performance. However, existing machine unlearning techniques largely depend on supervised techniques requiring identity labels, which are often unavailable in privacy-constrained situations or in large-scale, noisy datasets. To address this critical gap, we introduce CURE (Centroid-guided Unsupervised Representation Erasure), the first unsupervised unlearning framework for facial recognition systems that operates without the use of identity labels, effectively removing targeted samples while preserving overall performance. We also propose a novel metric, the Unlearning Efficiency Score (UES), which balances forgetting and retention stability, addressing shortcomings in the current evaluation metrics. CURE significantly outperforms unsupervised variants of existing unlearning methods. Additionally, we conducted quality-aware unlearning by designating low-quality images as the forget set, demonstrating its usability and benefits, and highlighting the role of image quality in machine unlearning.

[5] Raw-JPEG Adapter: Efficient Raw Image Compression with JPEG cs.CVPDF

Mahmoud Afifi, Ran Zhang, Michael S. Brown

TL;DR: 论文提出了Raw-JPEG Adapter，一种轻量级可学习的预处理流水线，将原始图像适配为标准JPEG压缩格式，同时保留高保真重建能力。

Details

Motivation: 原始图像（raw）保留了完整的传感器信息，但存储需求大，而JPEG格式高效且兼容性强但不适合存储原始数据。本文旨在解决这一矛盾。

Result: 实验表明，该方法比直接JPEG存储具有更高保真度，同时提供更好的压缩比和重建精度平衡。

Insight: 通过可学习的预处理和参数嵌入，可以在兼容JPEG的同时保留原始图像信息，提供了一种实用的解决方案。

Abstract: Digital cameras digitize scene light into linear raw representations, which the image signal processor (ISP) converts into display-ready outputs. While raw data preserves full sensor information–valuable for editing and vision tasks–formats such as Digital Negative (DNG) require large storage, making them impractical in constrained scenarios. In contrast, JPEG is a widely supported format, offering high compression efficiency and broad compatibility, but it is not well-suited for raw storage. This paper presents RawJPEG Adapter, a lightweight, learnable, and invertible preprocessing pipeline that adapts raw images for standard JPEG compression. Our method applies spatial and optional frequency-domain transforms, with compact parameters stored in the JPEG comment field, enabling accurate raw reconstruction. Experiments across multiple datasets show that our method achieves higher fidelity than direct JPEG storage, supports other codecs, and provides a favorable trade-off between compression ratio and reconstruction accuracy.

[6] The Impact of 2D Segmentation Backbones on Point Cloud Predictions Using 4D Radar cs.CV | cs.ROPDF

William L. Muckelroy III, Mohammed Alsakabi, John M. Dolan, Ozan K. Tonguz

TL;DR: 论文研究了2D分割主干网络对4D雷达生成点云质量的影响，发现更高容量的模型不一定更好，但优化后的主干网络可实现23.7%的性能提升。

Details

Motivation: LiDAR成本高昂，限制了其在商业化自动驾驶系统中的广泛应用。4D雷达作为一种低成本替代方案，通过生成LiDAR类似点云，但需要优化生成质量。

Result: 实验表明，过高容量主干可能损害性能，但优化后的主干显著提升了点云生成质量，超越现有方法23.7%。

Insight: 网络容量需平衡，与问题复杂度相匹配；优化主干设计是提升4D雷达点云生成的关键。

Abstract: LiDAR’s dense, sharp point cloud (PC) representations of the surrounding environment enable accurate perception and significantly improve road safety by offering greater scene awareness and understanding. However, LiDAR’s high cost continues to restrict the broad adoption of high-level Autonomous Driving (AD) systems in commercially available vehicles. Prior research has shown progress towards circumventing the need for LiDAR by training a neural network, using LiDAR point clouds as ground truth (GT), to produce LiDAR-like 3D point clouds using only 4D Radars. One of the best examples is a neural network created to train a more efficient radar target detector with a modular 2D convolutional neural network (CNN) backbone and a temporal coherence network at its core that uses the RaDelft dataset for training (see arXiv:2406.04723). In this work, we investigate the impact of higher-capacity segmentation backbones on the quality of the produced point clouds. Our results show that while very high-capacity models may actually hurt performance, an optimal segmentation backbone can provide a 23.7% improvement over the state-of-the-art (SOTA).

Aravind Narayanan, Vahid Reza Khazaie, Shaina Raza

TL;DR: 该论文通过新闻图像基准测试研究了视觉语言模型（VLM）在解读图像和文本时吸收和再现有害社会刻板印象的风险，提出了一种基于LLM作为评估者的方法，并揭示了视觉上下文对模型输出的系统性影响。

Details

Motivation: 研究大型视觉语言模型（VLM）在使用包含社会视觉线索（如年龄、性别、种族、职业等）的图像时可能引发有害刻板印象的问题，旨在量化这些偏见的普遍性和影响。

Result: 研究发现：1. 视觉上下文在开放设置中会系统性影响模型输出；2. 性别和职业属性的偏见风险最高；3. 高忠诚度并不一定对应低偏见。

Insight: 视觉语言模型的公平性评估需综合考虑上下文和多种属性，高准确性或忠诚度可能掩盖潜在的偏见问题。

Abstract: Large vision-language models (VLMs) can jointly interpret images and text, but they are also prone to absorbing and reproducing harmful social stereotypes when visual cues such as age, gender, race, clothing, or occupation are present. To investigate these risks, we introduce a news-image benchmark consisting of 1,343 image-question pairs drawn from diverse outlets, which we annotated with ground-truth answers and demographic attributes (age, gender, race, occupation, and sports). We evaluate a range of state-of-the-art VLMs and employ a large language model (LLM) as judge, with human verification. Our findings show that: (i) visual context systematically shifts model outputs in open-ended settings; (ii) bias prevalence varies across attributes and models, with particularly high risk for gender and occupation; and (iii) higher faithfulness does not necessarily correspond to lower bias. We release the benchmark prompts, evaluation rubric, and code to support reproducible and fairness-aware multimodal assessment.

[8] MoTiC: Momentum Tightness and Contrast for Few-Shot Class-Incremental Learning cs.CV | cs.AIPDF

Zeyu He, Shuai Huang, Yuwu Lu, Ming Zhao

TL;DR: 该论文提出了一种名为MoTiC的框架，用于解决少样本类增量学习（FSCIL）中的估计偏差和特征紧密度问题，通过贝叶斯分析和对比学习提升原型准确性，并在多个基准测试中取得了最优性能。

Details

Motivation: FSCIL面临从少量样本中学习新类并保留旧类知识的双重挑战。现有方法虽使用冻结特征提取器和类平均原型，但新类原型因数据稀缺而存在显著偏差。

Result: 在三个FSCIL基准测试（尤其是细粒度任务CUB-200）中取得了SOTA性能，证明了方法减少偏差和提升稳健性的有效性。

Insight: 通过结合先验知识和对比学习，可以显著改善FSCIL中的原型估计和特征表示，为数据稀缺场景提供新思路。

Abstract: Few-Shot Class-Incremental Learning (FSCIL) must contend with the dual challenge of learning new classes from scarce samples while preserving old class knowledge. Existing methods use the frozen feature extractor and class-averaged prototypes to mitigate against catastrophic forgetting and overfitting. However, new-class prototypes suffer significant estimation bias due to extreme data scarcity, whereas base-class prototypes benefit from sufficient data. In this work, we theoretically demonstrate that aligning the new-class priors with old-class statistics via Bayesian analysis reduces variance and improves prototype accuracy. Furthermore, we propose large-scale contrastive learning to enforce cross-category feature tightness. To further enrich feature diversity and inject prior information for new-class prototypes, we integrate momentum self-supervision and virtual categories into the Momentum Tightness and Contrast framework (MoTiC), constructing a feature space with rich representations and enhanced interclass cohesion. Experiments on three FSCIL benchmarks produce state-of-the-art performances, particularly on the fine-grained task CUB-200, validating our method’s ability to reduce estimation bias and improve incremental learning robustness.

[9] Enhancing Transformer-Based Vision Models: Addressing Feature Map Anomalies Through Novel Optimization Strategies cs.CVPDF

Sumit Mamtani

TL;DR: 该论文提出两种轻量级优化技术（STA和ANF）来解决Vision Transformers中特征图谱的结构化噪声问题，提升了模型的解释性和下游任务性能。

Details

Motivation: Vision Transformers虽在多种视觉任务中表现优异，但其特征图谱中的结构化噪声会阻碍下游应用（如分割和深度估计）。

Result: 在ImageNet、Ade20k和NYUv2等基准测试中，模型在视觉质量和任务性能上均有显著提升。

Insight: 结构化噪声可能是ViTs性能瓶颈之一，轻量级优化策略可在不增加计算负担的情况下显著改善模型表现。

Abstract: Vision Transformers (ViTs) have demonstrated superior performance across a wide range of computer vision tasks. However, structured noise artifacts in their feature maps hinder downstream applications such as segmentation and depth estimation. We propose two novel and lightweight optimisation techniques- Structured Token Augmentation (STA) and Adaptive Noise Filtering (ANF)- to improve interpretability and mitigate these artefacts. STA enhances token diversity through spatial perturbations during tokenisation, while ANF applies learnable inline denoising between transformer layers. These methods are architecture-agnostic and evaluated across standard benchmarks, including ImageNet, Ade20k, and NYUv2. Experimental results show consistent improvements in visual quality and task performance, highlighting the practical effectiveness of our approach.

[10] From Prompt to Progression: Taming Video Diffusion Models for Seamless Attribute Transition cs.CVPDF

Ling Lo, Kelvin C. K. Chan, Wen-Huang Cheng, Ming-Hsuan Yang

TL;DR: 论文提出了一种通过逐帧引导去噪过程的方法，以实现视频属性平滑过渡，同时保持了视频的动态特性。还提出了CAT-Bench基准和评估指标，证明了方法的有效性。

Details

Motivation: 现有模型在处理复杂的时间变化（如属性渐变）时存在不一致性，特别是通过提示插值方法难以实现平滑过渡。

Result: 实验表明，该方法在视觉保真度、文本对齐和过渡平滑性上优于基线模型。

Insight: 逐帧引导方法能够有效解决属性渐变中的不一致性问题，同时保持视频的动态特性，为视频生成提供新思路。

Abstract: Existing models often struggle with complex temporal changes, particularly when generating videos with gradual attribute transitions. The most common prompt interpolation approach for motion transitions often fails to handle gradual attribute transitions, where inconsistencies tend to become more pronounced. In this work, we propose a simple yet effective method to extend existing models for smooth and consistent attribute transitions, through introducing frame-wise guidance during the denoising process. Our approach constructs a data-specific transitional direction for each noisy latent, guiding the gradual shift from initial to final attributes frame by frame while preserving the motion dynamics of the video. Moreover, we present the Controlled-Attribute-Transition Benchmark (CAT-Bench), which integrates both attribute and motion dynamics, to comprehensively evaluate the performance of different models. We further propose two metrics to assess the accuracy and smoothness of attribute transitions. Experimental results demonstrate that our approach performs favorably against existing baselines, achieving visual fidelity, maintaining alignment with text prompts, and delivering seamless attribute transitions. Code and CATBench are released: https://github.com/lynn-ling-lo/Prompt2Progression.

[11] Anatomically Constrained Transformers for Cardiac Amyloidosis Classification cs.CVPDF

Alexander Thorley, Agis Chartsias, Jordan Strom, Roberto Lang, Jeremy Slivnick

TL;DR: 该论文提出了一种基于解剖学约束的Transformer模型，用于心脏淀粉样变性分类，通过将输入限制在心肌区域并嵌入变形点和图像块，提高了分类性能。

Details

Motivation: 心脏淀粉样变性（CA）的诊断通常依赖于超声心动图的临床特征，但现有神经网络模型无法保证分类是基于临床相关特征的。论文旨在通过解剖学约束，确保模型仅关注与CA相关的区域。

Result: 在CA分类任务中，该方法性能优于全视频Transformer，同时提供了分类仅基于解剖区域的明确保证。

Insight: 解剖学约束可以提升医学影像分类的可靠性和可解释性，同时适用于有监督和自监督学习。

Abstract: Cardiac amyloidosis (CA) is a rare cardiomyopathy, with typical abnormalities in clinical measurements from echocardiograms such as reduced global longitudinal strain of the myocardium. An alternative approach for detecting CA is via neural networks, using video classification models such as convolutional neural networks. These models process entire video clips, but provide no assurance that classification is based on clinically relevant features known to be associated with CA. An alternative paradigm for disease classification is to apply models to quantitative features such as strain, ensuring that the classification relates to clinically relevant features. Drawing inspiration from this approach, we explicitly constrain a transformer model to the anatomical region where many known CA abnormalities occur – the myocardium, which we embed as a set of deforming points and corresponding sampled image patches into input tokens. We show that our anatomical constraint can also be applied to the popular self-supervised learning masked autoencoder pre-training, where we propose to mask and reconstruct only anatomical patches. We show that by constraining both the transformer and pre-training task to the myocardium where CA imaging features are localized, we achieve increased performance on a CA classification task compared to full video transformers. Our model provides an explicit guarantee that the classification is focused on only anatomical regions of the echo, and enables us to visualize transformer attention scores over the deforming myocardium.

[12] Learning to Stop: Reinforcement Learning for Efficient Patient-Level Echocardiographic Classification cs.CVPDF

Woo-Jin Cho Kim, Jorge Oliveira, Arian Beqiri, Alex Thorley, Jordan Strom

TL;DR: 提出一种基于强化学习的方法，通过选择最优子集的超声心动图视频片段来提升疾病分类效率，同时引入了注意力聚合机制融合信息。

Details

Motivation: 传统方法要么使用单一视频片段忽略其他信息，要么计算所有片段导致效率低下。本文旨在通过强化学习动态决定何时停止处理片段，以平衡性能和计算开销。

Result: 在心脏淀粉样变性检测任务中，AUC达到0.91，优于使用全部片段的方法及其他基准方法。

Insight: 动态选择和注意力聚合的结合能够显著提升医学影像分析的效率和性能，适用于计算资源受限的场景。

Abstract: Guidelines for transthoracic echocardiographic examination recommend the acquisition of multiple video clips from different views of the heart, resulting in a large number of clips. Typically, automated methods, for instance disease classifiers, either use one clip or average predictions from all clips. Relying on one clip ignores complementary information available from other clips, while using all clips is computationally expensive and may be prohibitive for clinical adoption. To select the optimal subset of clips that maximize performance for a specific task (image-based disease classification), we propose a method optimized through reinforcement learning. In our method, an agent learns to either keep processing view-specific clips to reduce the disease classification uncertainty, or stop processing if the achieved classification confidence is sufficient. Furthermore, we propose a learnable attention-based aggregation method as a flexible way of fusing information from multiple clips. The proposed method obtains an AUC of 0.91 on the task of detecting cardiac amyloidosis using only 30% of all clips, exceeding the performance achieved from using all clips and from other benchmarks.

Bo Yu, Jianhua Yang, Zetao Du, Yan Huang, Chenglong Li

TL;DR: FMISeg是一种基于频率域的多模态融合模型，通过语言引导的医学图像分割方法，解决了医学图像中病灶形态复杂和视觉-语言模态语义鸿沟问题。

Details

Motivation: 医学图像分割在肺部感染疾病诊断中至关重要，但现有方法难以有效融合临床文本报告以提升分割精度，主要因病灶形态复杂和视觉-语言模态的语义鸿沟。

Result: 在QaTa-COV19和MosMedData+数据集上的实验表明，FMISeg在定性和定量上均优于现有方法。

Insight: 频率域特征与语言引导的多模态交互能有效提升医学图像分割的精度，尤其在复杂病灶形态和跨模态语义对齐方面具有优势。

Abstract: Automatically segmenting infected areas in radiological images is essential for diagnosing pulmonary infectious diseases. Recent studies have demonstrated that the accuracy of the medical image segmentation can be improved by incorporating clinical text reports as semantic guidance. However, the complex morphological changes of lesions and the inherent semantic gap between vision-language modalities prevent existing methods from effectively enhancing the representation of visual features and eliminating semantically irrelevant information, ultimately resulting in suboptimal segmentation performance. To address these problems, we propose a Frequency-domain Multi-modal Interaction model (FMISeg) for language-guided medical image segmentation. FMISeg is a late fusion model that establishes interaction between linguistic features and frequency-domain visual features in the decoder. Specifically, to enhance the visual representation, our method introduces a Frequency-domain Feature Bidirectional Interaction (FFBI) module to effectively fuse frequency-domain features. Furthermore, a Language-guided Frequency-domain Feature Interaction (LFFI) module is incorporated within the decoder to suppress semantically irrelevant visual features under the guidance of linguistic information. Experiments on QaTa-COV19 and MosMedData+ demonstrated that our method outperforms the state-of-the-art methods qualitatively and quantitatively.

[14] PolGS: Polarimetric Gaussian Splatting for Fast Reflective Surface Reconstruction cs.CVPDF

Yufei Han, Bowen Tie, Heng Guo, Youwei Lyu, Si Li

TL;DR: PolGS提出了一种基于偏振高斯泼溅的快速反射表面重建方法，通过集成偏振约束，有效分离镜面和漫反射分量，提升了复杂反射材料的重建质量。

Details

Motivation: 复杂反射表面的高效重建对实时虚拟现实至关重要。现有3D高斯泼溅方法虽速度快，但重建质量不及隐式神经表示，尤其是在处理复杂反射材料时。

Result: 在合成和真实数据集上的实验表明，PolGS能显著提升复杂反射材料的重建精度和速度。

Insight: 偏振信息可以有效辅助表面反射属性的分离，从而提升重建质量，尤其适用于高反射表面。

Abstract: Efficient shape reconstruction for surfaces with complex reflectance properties is crucial for real-time virtual reality. While 3D Gaussian Splatting (3DGS)-based methods offer fast novel view rendering by leveraging their explicit surface representation, their reconstruction quality lags behind that of implicit neural representations, particularly in the case of recovering surfaces with complex reflective reflectance. To address these problems, we propose PolGS, a Polarimetric Gaussian Splatting model allowing fast reflective surface reconstruction in 10 minutes. By integrating polarimetric constraints into the 3DGS framework, PolGS effectively separates specular and diffuse components, enhancing reconstruction quality for challenging reflective materials. Experimental results on the synthetic and real-world dataset validate the effectiveness of our method.

[15] CAMILA: Context-Aware Masking for Image Editing with Language Alignment cs.CVPDF

Hyunseung Kim, Chiho Choi, Srikanth Malla, Sai Prahladh Padmanabhan, Saurabh Bagchi

TL;DR: CAMILA是一种上下文感知的图像编辑方法，通过验证指令与图像的上下文一致性，确保仅对相关区域进行编辑，避免执行不可行或矛盾的指令。

Details

Motivation: 现有文本引导的图像编辑模型往往盲目遵循所有用户指令，导致不可行或矛盾的指令产生无意义的输出，CAMILA旨在解决这一问题。

Result: CAMILA在复杂指令处理中表现优于现有模型，能更好地保持图像完整性并提升语义对齐。

Insight: 上下文验证是文本引导图像编辑的关键，CAMILA展示了选择性执行指令的重要性。

Abstract: Text-guided image editing has been allowing users to transform and synthesize images through natural language instructions, offering considerable flexibility. However, most existing image editing models naively attempt to follow all user instructions, even if those instructions are inherently infeasible or contradictory, often resulting in nonsensical output. To address these challenges, we propose a context-aware method for image editing named as CAMILA (Context-Aware Masking for Image Editing with Language Alignment). CAMILA is designed to validate the contextual coherence between instructions and the image, ensuring that only relevant edits are applied to the designated regions while ignoring non-executable instructions. For comprehensive evaluation of this new method, we constructed datasets for both single- and multi-instruction image editing, incorporating the presence of infeasible requests. Our method achieves better performance and higher semantic alignment than state-of-the-art models, demonstrating its effectiveness in handling complex instruction challenges while preserving image integrity.

[16] Robust RGB-T Tracking via Learnable Visual Fourier Prompt Fine-tuning and Modality Fusion Prompt Generation cs.CVPDF

Hongtao Yang, Bineng Zhong, Qihua Liang, Zhiruo Zhu, Yaozong Zheng

TL;DR: 提出了一种基于视觉傅里叶提示学习和模态融合提示生成的高效RGB-T跟踪方法，结合空间和频域信息提升性能。

Details

Motivation: 现有基于参数高效微调（PEFT）的RGB-T跟踪方法仅依赖空间域信息作为提示，忽视了频域信息的重要性，导致性能受限。

Result: 在三个RGB-T跟踪基准上取得优异性能。

Insight: 频域信息在RGB-T跟踪中具有重要作用，多模态特征的充分交互能显著提升性能。

Abstract: Recently, visual prompt tuning is introduced to RGB-Thermal (RGB-T) tracking as a parameter-efficient finetuning (PEFT) method. However, these PEFT-based RGB-T tracking methods typically rely solely on spatial domain information as prompts for feature extraction. As a result, they often fail to achieve optimal performance by overlooking the crucial role of frequency-domain information in prompt learning. To address this issue, we propose an efficient Visual Fourier Prompt Tracking (named VFPTrack) method to learn modality-related prompts via Fast Fourier Transform (FFT). Our method consists of symmetric feature extraction encoder with shared parameters, visual fourier prompts, and Modality Fusion Prompt Generator that generates bidirectional interaction prompts through multi-modal feature fusion. Specifically, we first use a frozen feature extraction encoder to extract RGB and thermal infrared (TIR) modality features. Then, we combine the visual prompts in the spatial domain with the frequency domain prompts obtained from the FFT, which allows for the full extraction and understanding of modality features from different domain information. Finally, unlike previous fusion methods, the modality fusion prompt generation module we use combines features from different modalities to generate a fused modality prompt. This modality prompt is interacted with each individual modality to fully enable feature interaction across different modalities. Extensive experiments conducted on three popular RGB-T tracking benchmarks show that our method demonstrates outstanding performance.

[17] Rectified Decoupled Dataset Distillation: A Closer Look for Fair and Comprehensive Evaluation cs.CVPDF

Xinhao Zhong, Shuoyang Sun, Xulin Gu, Chenyang Zhu, Bin Chen

TL;DR: 本文提出了RD$^3$方法，系统研究了后评估设置对数据集蒸馏性能的影响，揭示了性能差异主要由评估不一致而非方法本身质量引起，并提供了标准化基准和评估协议。

Details

Motivation: 现有去耦合数据集蒸馏方法在后评估阶段存在不一致的协议，阻碍了领域发展。本文旨在解决这一问题并明确评估对性能的影响。

Result: 研究发现性能差异主要由评估不一致引起，而非方法本身的质量。

Insight: 标准化评估协议对数据集蒸馏领域的公平比较至关重要，未来研究需注意评估一致性。

Abstract: Dataset distillation aims to generate compact synthetic datasets that enable models trained on them to achieve performance comparable to those trained on full real datasets, while substantially reducing storage and computational costs. Early bi-level optimization methods (e.g., MTT) have shown promising results on small-scale datasets, but their scalability is limited by high computational overhead. To address this limitation, recent decoupled dataset distillation methods (e.g., SRe$^2$L) separate the teacher model pre-training from the synthetic data generation process. These methods also introduce random data augmentation and epoch-wise soft labels during the post-evaluation phase to improve performance and generalization. However, existing decoupled distillation methods suffer from inconsistent post-evaluation protocols, which hinders progress in the field. In this work, we propose Rectified Decoupled Dataset Distillation (RD$^3$), and systematically investigate how different post-evaluation settings affect test accuracy. We further examine whether the reported performance differences across existing methods reflect true methodological advances or stem from discrepancies in evaluation procedures. Our analysis reveals that much of the performance variation can be attributed to inconsistent evaluation rather than differences in the intrinsic quality of the synthetic data. In addition, we identify general strategies that improve the effectiveness of distilled datasets across settings. By establishing a standardized benchmark and rigorous evaluation protocol, RD$^3$ provides a foundation for fair and reproducible comparisons in future dataset distillation research.

[18] Talking Head Generation via AU-Guided Landmark Prediction cs.CVPDF

Shao-Yu Chang, Jingyi Xu, Hieu Le, Dimitris Samaras

TL;DR: 该论文提出了一种通过AUs（面部动作单元）引导的双阶段框架，用于音频驱动的说话头部生成，实现了精细的表情控制。

Details

Motivation: 现有方法依赖于情感标签或隐式的AU条件，无法实现精确的表情控制，论文旨在通过显式的AU到面部标志点的映射解决这一问题。

Result: 在MEAD数据集上的实验表明，该方法在多个指标上优于现有基线方法。

Insight: 显式的AU到标志点建模在表情生成任务中具有显著优势，能够实现对表情的更精确控制和更高的视频真实性。

Abstract: We propose a two-stage framework for audio-driven talking head generation with fine-grained expression control via facial Action Units (AUs). Unlike prior methods relying on emotion labels or implicit AU conditioning, our model explicitly maps AUs to 2D facial landmarks, enabling physically grounded, per-frame expression control. In the first stage, a variational motion generator predicts temporally coherent landmark sequences from audio and AU intensities. In the second stage, a diffusion-based synthesizer generates realistic, lip-synced videos conditioned on these landmarks and a reference image. This separation of motion and appearance improves expression accuracy, temporal stability, and visual realism. Experiments on the MEAD dataset show that our method outperforms state-of-the-art baselines across multiple metrics, demonstrating the effectiveness of explicit AU-to-landmark modeling for expressive talking head generation.

[19] ExpFace: Exponential Angular Margin Loss for Deep Face Recognition cs.CV | cs.AIPDF

Jinhui Zheng, Xueyuan Gong

TL;DR: 这篇论文提出了ExpFace（指数角度间隔损失），通过引入角度指数项作为间隔，有效区分干净样本和噪声样本，提升了人脸识别的性能。

Details

Motivation: 人脸识别是一个开集问题，需要高判别力以确保类内距离小于类间距离。现有的基于间隔的softmax损失（如SphereFace、CosFace和ArcFace）忽略了噪声样本的影响。作者观察到干净样本集中在中心区域，而噪声样本偏向边缘区域，因此提出了ExpFace。

Result: ExpFace在人脸识别任务中表现出色，避免了SphereFace的训练不稳定性和ArcFace的非单调性问题，并在多个基准测试中取得了最佳性能。

Insight: 1. 噪声样本在角度空间中倾向于分布在边缘区域；2. 动态调整惩罚机制有助于提升模型的鲁棒性；3. 统一的损失函数分析框架可用于指导未来的损失函数设计。

Abstract: Face recognition is an open-set problem requiring high discriminative power to ensure that intra-class distances remain smaller than inter-class distances. Margin-based softmax losses, such as SphereFace, CosFace, and ArcFace, have been widely adopted to enhance intra-class compactness and inter-class separability, yet they overlook the impact of noisy samples. By examining the distribution of samples in the angular space, we observe that clean samples predominantly cluster in the center region, whereas noisy samples tend to shift toward the peripheral region. Motivated by this observation, we propose the Exponential Angular Margin Loss (ExpFace), which introduces an angular exponential term as the margin. This design applies a larger penalty in the center region and a smaller penalty in the peripheral region within the angular space, thereby emphasizing clean samples while suppressing noisy samples. We present a unified analysis of ExpFace and classical margin-based softmax losses in terms of margin embedding forms, similarity curves, and gradient curves, showing that ExpFace not only avoids the training instability of SphereFace and the non-monotonicity of ArcFace, but also exhibits a similarity curve that applies penalties in the same manner as the decision boundary in the angular space. Extensive experiments demonstrate that ExpFace achieves state-of-the-art performance. To facilitate future research, we have released the source code at: https://github.com/dfr-code/ExpFace.

[20] Logics-Parsing Technical Report cs.CVPDF

Xiangyang Chen, Shuzhao Li, Xiuwen Zhu, Yongfan Chen, Fan Yang

TL;DR: Logics-Parsing是一种基于LVLM的端到端模型，通过强化学习增强布局分析和阅读顺序推断能力，支持多样数据类别，并在LogicsParsingBench上验证了SOTA性能。

Details

Motivation: 现有LVLM模型在处理复杂文档布局和阅读顺序时缺乏显式分析阶段，限制了在多栏报纸、海报等复杂文档上的表现。

Result: 在LogicsParsingBench上验证了模型的SOTA性能，覆盖多种文档分析场景。

Insight: 强化学习的奖励机制能有效提升复杂文档布局分析的性能，而多样数据的引入增强了模型的泛化能力。

Abstract: Recent advances in Large Vision-Language models (LVLM) have spurred significant progress in document parsing task. Compared to traditional pipeline-based methods, end-to-end paradigms have shown their excellence in converting PDF images into structured outputs through integrated Optical Character Recognition (OCR), table recognition, mathematical formula recognition and so on. However, the absence of explicit analytical stages for document layouts and reading orders limits the LVLM’s capability in handling complex document types such as multi-column newspapers or posters. To address this limitation, we propose in this report Logics-Parsing: an end-to-end LVLM-based model augmented with reinforcement learning. Our model incorporates meticulously designed reward mechanisms to optimize complex layout analysis and reading order inference. In addition, we expand the model’s versatility by incorporating diverse data types such as chemical formulas and handwritten Chinese characters into supervised fine-tuning. Finally, to enable rigorous evaluation of our approach, we introduce LogicsParsingBench, a curated set of 1,078 page-level PDF images spanning nine major categories and over twenty sub-categories, which will be released later. Comprehensive experiments conducted on LogicsParsingBench have validated the efficacy and State-of-the-art (SOTA) performance of our proposed model across diverse document analysis scenarios. Project Page: https://github.com/alibaba/Logics-Parsing

[21] Sex-based Bias Inherent in the Dice Similarity Coefficient: A Model Independent Analysis for Multiple Anatomical Structures cs.CV | J.3PDF

Hartmut Häntze, Myrthe Buser, Alessa Hering, Lisa C. Adams, Keno K. Bressem

TL;DR: 该研究发现Dice相似系数（DSC）在评估医学图像分割时存在性别偏见，较小的解剖结构因尺寸差异导致女性DSC分数系统性偏低，而大型结构受影响较小。

Details

Motivation: 尽管已有研究探讨了模型或数据集中的性别差异，但尚未有人研究DSC本身可能引入的偏见。研究者希望量化DSC在不同性别中的表现差异，以提高医学图像分割评估的公平性。

Result: 小型结构的平均DSC差异约为0.03，中等结构约为0.01，而大型结构（如肺和肝）几乎不受影响。

Insight: 使用DSC评估分割模型时，性别间的分数差异可能源于指标本身的偏见，而非模型性能的真实差异。这对医学图像分析的公平评估具有重要意义。

Abstract: Overlap-based metrics such as the Dice Similarity Coefficient (DSC) penalize segmentation errors more heavily in smaller structures. As organ size differs by sex, this implies that a segmentation error of equal magnitude may result in lower DSCs in women due to their smaller average organ volumes compared to men. While previous work has examined sex-based differences in models or datasets, no study has yet investigated the potential bias introduced by the DSC itself. This study quantifies sex-based differences of the DSC and the normalized DSC in an idealized setting independent of specific models. We applied equally-sized synthetic errors to manual MRI annotations from 50 participants to ensure sex-based comparability. Even minimal errors (e.g., a 1 mm boundary shift) produced systematic DSC differences between sexes. For small structures, average DSC differences were around 0.03; for medium-sized structures around 0.01. Only large structures (i.e., lungs and liver) were mostly unaffected, with sex-based DSC differences close to zero. These findings underline that fairness studies using the DSC as an evaluation metric should not expect identical scores between men and women, as the metric itself introduces bias. A segmentation model may perform equally well across sexes in terms of error magnitude, even if observed DSC values suggest otherwise. Importantly, our work raises awareness of a previously underexplored source of sex-based differences in segmentation performance. One that arises not from model behavior, but from the metric itself. Recognizing this factor is essential for more accurate and fair evaluations in medical image analysis.

[22] EfficienT-HDR: An Efficient Transformer-Based Framework via Multi-Exposure Fusion for HDR Reconstruction cs.CVPDF

Yu-Shen Huang, Tzu-Han Chen, Cheng-Yen Hsiao, Shaou-Gang Miaou

TL;DR: 论文提出了一种轻量级的基于ViT的框架EfficienT-HDR，通过多曝光融合实现高效的HDR重建，解决了计算成本高和鬼影问题。

Details

Motivation: 资源受限的边缘设备上实现高质量的HDR成像是一个关键挑战，现有方法计算成本高且存在鬼影问题。

Result: 主版本FLOPs减少约67%，推理速度在CPU上提升5倍以上，边缘设备提升2.5倍。

Insight: 轻量化设计和鬼影抑制模块的结合，能够在保持高性能的同时显著提升效率，适用于边缘设备。

Abstract: Achieving high-quality High Dynamic Range (HDR) imaging on resource-constrained edge devices is a critical challenge in computer vision, as its performance directly impacts downstream tasks such as intelligent surveillance and autonomous driving. Multi-Exposure Fusion (MEF) is a mainstream technique to achieve this goal; however, existing methods generally face the dual bottlenecks of high computational costs and ghosting artifacts, hindering their widespread deployment. To this end, this study proposes a light-weight Vision Transformer architecture designed explicitly for HDR reconstruction to overcome these limitations. This study is based on the Context-Aware Vision Transformer and begins by converting input images to the YCbCr color space to separate luminance and chrominance information. It then employs an Intersection-Aware Adaptive Fusion (IAAF) module to suppress ghosting effectively. To further achieve a light-weight design, we introduce Inverted Residual Embedding (IRE), Dynamic Tanh (DyT), and propose Enhanced Multi-Scale Dilated Convolution (E-MSDC) to reduce computational complexity at multiple levels. Our study ultimately contributes two model versions: a main version for high visual quality and a light-weight version with advantages in computational efficiency, both of which achieve an excellent balance between performance and image quality. Experimental results demonstrate that, compared to the baseline, the main version reduces FLOPS by approximately 67% and increases inference speed by more than fivefold on CPU and 2.5 times on an edge device. These results confirm that our method provides an efficient and ghost-free HDR imaging solution for edge devices, demonstrating versatility and practicality across various dynamic scenarios.

[23] BiTAA: A Bi-Task Adversarial Attack for Object Detection and Depth Estimation via 3D Gaussian Splatting cs.CVPDF

Yixun Zhang, Feng Zhou, Jianqin Yin

TL;DR: BiTAA是一种基于3D高斯喷洒的双任务对抗攻击方法，能够同时破坏目标检测和深度估计任务。

Details

Motivation: 现有的对抗攻击方法多为任务独立设计，缺乏可控的深度偏差机制和跨任务性能评估标准。BiTAA旨在填补这一空白，研究目标检测与深度估计的交互关系。

Result: 实验表明BiTAA在跨任务攻击中表现一致，且揭示了从检测到深度和深度到检测的不对称性。

Insight: 多任务相机感知存在实际风险，需设计跨任务感知的防御机制。

Abstract: Camera-based perception is critical to autonomous driving yet remains vulnerable to task-specific adversarial manipulations in object detection and monocular depth estimation. Most existing 2D/3D attacks are developed in task silos, lack mechanisms to induce controllable depth bias, and offer no standardized protocol to quantify cross-task transfer, leaving the interaction between detection and depth underexplored. We present BiTAA, a bi-task adversarial attack built on 3D Gaussian Splatting that yields a single perturbation capable of simultaneously degrading detection and biasing monocular depth. Specifically, we introduce a dual-model attack framework that supports both full-image and patch settings and is compatible with common detectors and depth estimators, with optional expectation-over-transformation (EOT) for physical reality. In addition, we design a composite loss that couples detection suppression with a signed, magnitude-controlled log-depth bias within regions of interest (ROIs) enabling controllable near or far misperception while maintaining stable optimization across tasks. We also propose a unified evaluation protocol with cross-task transfer metrics and real-world evaluations, showing consistent cross-task degradation and a clear asymmetry between Det to Depth and from Depth to Det transfer. The results highlight practical risks for multi-task camera-only perception and motivate cross-task-aware defenses in autonomous driving scenarios.

[24] StrCGAN: A Generative Framework for Stellar Image Restoration cs.CV | astro-ph.IM | astro-ph.SRPDF

Shantanusinh Parmar

TL;DR: StrCGAN是一个用于天文图像修复的生成模型，通过在CycleGAN框架中加入3D卷积、多光谱融合和天体物理正则化模块来实现高质量的天体图像重建。

Details

Motivation: 天文图像因小望远镜观测的分辨率和质量问题难以恢复高保真度细节，传统GAN模型如CycleGAN在2D映射中容易扭曲天体形态。

Result: StrCGAN生成的图像视觉更清晰且物理一致性更高，在天文图像增强任务中优于标准GAN模型。

Insight: 结合领域知识（如天体物理约束）和多模态数据（多光谱）能显著提升生成模型的保真度和实用性。

Abstract: We introduce StrCGAN (Stellar Cyclic GAN), a generative model designed to enhance low-resolution astrophotography images. Our goal is to reconstruct high-fidelity ground truth-like representations of celestial objects, a task that is challenging due to the limited resolution and quality of small-telescope observations such as the MobilTelesco dataset. Traditional models such as CycleGAN provide a foundation for image-to-image translation but are restricted to 2D mappings and often distort the morphology of stars and galaxies. To overcome these limitations, we extend the CycleGAN framework with three key innovations: 3D convolutional layers to capture volumetric spatial correlations, multi-spectral fusion to align optical and near-infrared (NIR) domains, and astrophysical regularization modules to preserve stellar morphology. Ground-truth references from multi-mission all-sky surveys spanning optical to NIR guide the training process, ensuring that reconstructions remain consistent across spectral bands. Together, these components allow StrCGAN to generate reconstructions that are not only visually sharper but also physically consistent, outperforming standard GAN models in the task of astrophysical image enhancement.

[25] ThinkFake: Reasoning in Multimodal Large Language Models for AI-Generated Image Detection cs.CVPDF

Tai-Ming Huang, Wei-Tung Lin, Kai-Lung Hua, Wen-Huang Cheng, Junichi Yamagishi

TL;DR: ThinkFake是一个基于多模态大语言模型（MLLM）的方法，通过推理提示和强化学习训练，实现可解释的AI生成图像检测，并在多个基准测试中表现优异。

Details

Motivation: 由于AI生成图像的逼真度越来越高，导致信息误导和隐私侵犯问题加剧，亟需准确且可解释的检测方法。现有方法多为二分类或依赖监督微调，泛化能力有限。

Result: 在GenImage基准测试中表现优于现有方法，在LOKI零样本测试中展示了强大的泛化能力。

Insight: 1. 推理提示和强化学习的结合可提升模型的解释性和泛化能力；2. 结构化检测流程能有效提升推理质量。

Abstract: The increasing realism of AI-generated images has raised serious concerns about misinformation and privacy violations, highlighting the urgent need for accurate and interpretable detection methods. While existing approaches have made progress, most rely on binary classification without explanations or depend heavily on supervised fine-tuning, resulting in limited generalization. In this paper, we propose ThinkFake, a novel reasoning-based and generalizable framework for AI-generated image detection. Our method leverages a Multimodal Large Language Model (MLLM) equipped with a forgery reasoning prompt and is trained using Group Relative Policy Optimization (GRPO) reinforcement learning with carefully designed reward functions. This design enables the model to perform step-by-step reasoning and produce interpretable, structured outputs. We further introduce a structured detection pipeline to enhance reasoning quality and adaptability. Extensive experiments show that ThinkFake outperforms state-of-the-art methods on the GenImage benchmark and demonstrates strong zero-shot generalization on the challenging LOKI benchmark. These results validate our framework’s effectiveness and robustness. Code will be released upon acceptance.

[26] PersONAL: Towards a Comprehensive Benchmark for Personalized Embodied Agents cs.CV | cs.ROPDF

Filippo Ziliotto, Jelin Raphael Akkara, Alessandro Daniele, Lamberto Ballan, Luciano Serafini

TL;DR: 论文介绍了PersONAL基准测试，旨在研究个性化任务在具身智能体中的应用，包含2000多段高质量情景，任务要求智能体根据自然语言查询在家庭环境中找到特定用户的物品。

Details

Motivation: 当前具身智能体在真实人类中心场景（如家庭环境）中的应用仍面临挑战，尤其在建模个体偏好和行为方面。

Result: 实验表明现有方法与人类表现存在显著差距，突显了智能体在感知、推理和记忆个性化信息方面的不足。

Insight: 个性化任务是具身智能体在真实场景中落地的关键方向，需进一步研究感知和推理能力的提升。

Abstract: Recent advances in Embodied AI have enabled agents to perform increasingly complex tasks and adapt to diverse environments. However, deploying such agents in realistic human-centered scenarios, such as domestic households, remains challenging, particularly due to the difficulty of modeling individual human preferences and behaviors. In this work, we introduce PersONAL (PERSonalized Object Navigation And Localization, a comprehensive benchmark designed to study personalization in Embodied AI. Agents must identify, retrieve, and navigate to objects associated with specific users, responding to natural-language queries such as “find Lily’s backpack”. PersONAL comprises over 2,000 high-quality episodes across 30+ photorealistic homes from the HM3D dataset. Each episode includes a natural-language scene description with explicit associations between objects and their owners, requiring agents to reason over user-specific semantics. The benchmark supports two evaluation modes: (1) active navigation in unseen environments, and (2) object grounding in previously mapped scenes. Experiments with state-of-the-art baselines reveal a substantial gap to human performance, highlighting the need for embodied agents capable of perceiving, reasoning, and memorizing over personalized information; paving the way towards real-world assistive robot.

[27] FreezeVLA: Action-Freezing Attacks against Vision-Language-Action Models cs.CVPDF

Xin Wang, Jie Li, Zejia Weng, Yixu Wang, Yifeng Gao

TL;DR: 该论文提出FreezeVLA，一种针对视觉-语言-动作（VLA）模型的新型对抗攻击方法，通过最小-最大双层优化生成对抗图像，使模型忽略后续指令，导致机器人‘冻结’行为。实验显示攻击成功率达76.2%，且对抗图像具备强迁移性。

Details

Motivation: VLA模型在机器人任务中表现优异，但其安全性和对抗攻击的鲁棒性尚未充分研究。论文揭示了一种‘冻结’攻击漏洞，可能导致机器人在关键任务中失效，从而强调安全研究的必要性。

Result: 在三种VLA模型和四个机器人基准测试中，平均攻击成功率达76.2%，显著优于现有方法。对抗图像还能跨语言指令迁移攻击。

Insight: 该研究揭示了VLA模型的重大安全隐患，暴露了对抗攻击的现实威胁，呼吁开发更鲁棒的防御机制以确保机器人系统的安全部署。

Abstract: Vision-Language-Action (VLA) models are driving rapid progress in robotics by enabling agents to interpret multimodal inputs and execute complex, long-horizon tasks. However, their safety and robustness against adversarial attacks remain largely underexplored. In this work, we identify and formalize a critical adversarial vulnerability in which adversarial images can “freeze” VLA models and cause them to ignore subsequent instructions. This threat effectively disconnects the robot’s digital mind from its physical actions, potentially inducing inaction during critical interventions. To systematically study this vulnerability, we propose FreezeVLA, a novel attack framework that generates and evaluates action-freezing attacks via min-max bi-level optimization. Experiments on three state-of-the-art VLA models and four robotic benchmarks show that FreezeVLA attains an average attack success rate of 76.2%, significantly outperforming existing methods. Moreover, adversarial images generated by FreezeVLA exhibit strong transferability, with a single image reliably inducing paralysis across diverse language prompts. Our findings expose a critical safety risk in VLA models and highlight the urgent need for robust defense mechanisms.

[28] Adaptive Guidance Semantically Enhanced via Multimodal LLM for Edge-Cloud Object Detection cs.CV | cs.AIPDF

Yunqing Hu, Zheming Yang, Chang Zhao, Wen Ji

TL;DR: 该论文提出了一种基于多模态大语言模型（MLLM）的自适应语义增强边云协同目标检测方法，通过动态调整边缘检测器的参数，实现了复杂场景下检测精度与效率的平衡。

Details

Motivation: 传统目标检测方法在低光照和高度遮挡等复杂场景中因缺乏高级语义理解而导致性能下降，为此需引入语义信息提升检测能力。

Result: 在低光照和高度遮挡场景中，延迟降低79%，计算成本减少70%，同时保持检测精度。

Insight: 结合多模态大语言模型的语义理解能力，可以显著提升目标检测在复杂场景中的性能，并通过边云协同框架优化效率。

Abstract: Traditional object detection methods face performance degradation challenges in complex scenarios such as low-light conditions and heavy occlusions due to a lack of high-level semantic understanding. To address this, this paper proposes an adaptive guidance-based semantic enhancement edge-cloud collaborative object detection method leveraging Multimodal Large Language Models (MLLM), achieving an effective balance between accuracy and efficiency. Specifically, the method first employs instruction fine-tuning to enable the MLLM to generate structured scene descriptions. It then designs an adaptive mapping mechanism that dynamically converts semantic information into parameter adjustment signals for edge detectors, achieving real-time semantic enhancement. Within an edge-cloud collaborative inference framework, the system automatically selects between invoking cloud-based semantic guidance or directly outputting edge detection results based on confidence scores. Experiments demonstrate that the proposed method effectively enhances detection accuracy and efficiency in complex scenes. Specifically, it can reduce latency by over 79% and computational cost by 70% in low-light and highly occluded scenes while maintaining accuracy.

[29] Generalized Shortest Path-based Superpixels for 3D Spherical Image Segmentation cs.CVPDF

Rémi Giraud, Rodrigo Borba Pinheiro, Yannick Berthoumieu

TL;DR: 该论文提出了一种新的超像素方法SphSPS，专用于360度球形或全向图像的分割，改进了传统2D平面图像分割方法在球形图像上的表现。

Details

Motivation: 随着广角图像采集设备的普及，计算机视觉领域需要快速准确的分析方法，传统超像素分割方法因未考虑球形图像的几何特性而表现不佳。

Result: 在标准360度球形全景分割数据集和合成道路全向图像上，SphSPS在分割精度、抗噪性和规则性上显著优于现有方法。

Insight: 球形图像的几何特性对分割效果至关重要，传统2D方法缺乏对球形空间建模的能力，SphSPS填补了这一空白。

Abstract: The growing use of wide angle image capture devices and the need for fast and accurate image analysis in computer visions have enforced the need for dedicated under-representation approaches. Most recent decomposition methods segment an image into a small number of irregular homogeneous regions, called superpixels. Nevertheless, these approaches are generally designed to segment standard 2D planar images, i.e., captured with a 90o angle view without distortion. In this work, we introduce a new general superpixel method called SphSPS (for Spherical Shortest Path-based Superpixels)1 , dedicated to wide 360o spherical or omnidirectional images. Our method respects the geometry of the 3D spherical acquisition space and generalizes the notion of shortest path between a pixel and a superpixel center, to fastly extract relevant clustering features. We demonstrate that considering the geometry of the acquisition space to compute the shortest path enables to jointly improve the segmentation accuracy and the shape regularity of superpixels. To evaluate this regularity aspect, we also generalize a global regularity metric to the spherical space, addressing the limitations of the only existing spherical compactness measure. Finally, the proposed SphSPS method is validated on the reference 360o spherical panorama segmentation dataset and on synthetic road omnidirectional images. Our method significantly outperforms both planar and spherical state-of-the-art approaches in terms of segmentation accuracy,robustness to noise and regularity, providing a very interesting tool for superpixel-based applications on 360o images.

[30] Efficient Cell Painting Image Representation Learning via Cross-Well Aligned Masked Siamese Network cs.CVPDF

Pin-Jui Huang, Yu-Hsuan Liao, SooHeon Kim, NoSeong Park, JongBae Park

TL;DR: 论文提出了一种新的细胞图像表征学习方法CWA-MSN，通过跨孔对齐掩码孪生网络解决了批次效应问题，并在数据量和模型规模上更高效。

Details

Motivation: 当前的自监督和对比学习方法在处理细胞图像的批次效应时面临挑战，且通常需要大规模模型或数据。该论文旨在提出一种更高效的解决方案。

Result: 在基因-基因关系检索任务中，CWA-MSN显著优于OpenPhenom和CellCLIP，分别提升29%和9%，同时数据量和模型规模更小。

Insight: 跨孔对齐策略能有效缓解批次效应，掩码孪生网络在小数据和小模型下仍能学习高效表征，为药物发现提供了新思路。

Abstract: Computational models that predict cellular phenotypic responses to chemical and genetic perturbations can accelerate drug discovery by prioritizing therapeutic hypotheses and reducing costly wet-lab iteration. However, extracting biologically meaningful and batch-robust cell painting representations remains challenging. Conventional self-supervised and contrastive learning approaches often require a large-scale model and/or a huge amount of carefully curated data, still struggling with batch effects. We present Cross-Well Aligned Masked Siamese Network (CWA-MSN), a novel representation learning framework that aligns embeddings of cells subjected to the same perturbation across different wells, enforcing semantic consistency despite batch effects. Integrated into a masked siamese architecture, this alignment yields features that capture fine-grained morphology while remaining data- and parameter-efficient. For instance, in a gene-gene relationship retrieval benchmark, CWA-MSN outperforms the state-of-the-art publicly available self-supervised (OpenPhenom) and contrastive learning (CellCLIP) methods, improving the benchmark scores by +29% and +9%, respectively, while training on substantially fewer data (e.g., 0.2M images for CWA-MSN vs. 2.2M images for OpenPhenom) or smaller model size (e.g., 22M parameters for CWA-MSN vs. 1.48B parameters for CellCLIP). Extensive experiments demonstrate that CWA-MSN is a simple and effective way to learn cell image representation, enabling efficient phenotype modeling even under limited data and parameter budgets.

[31] Aerial-Ground Image Feature Matching via 3D Gaussian Splatting-based Intermediate View Rendering cs.CVPDF

Jiangxue Yu, Hui Wang, San Jiang, Xing Zhang, Dejin Zhang

TL;DR: 该论文提出了一种基于3D高斯泼溅（3D Gaussian Splatting）的中间视图生成方法，用于解决航空与地面图像匹配中视角变化导致的透视失真问题，显著提升了匹配数量和质量。

Details

Motivation: 航空与地面图像的结合在复杂场景3D建模中具有潜力，但视角变化导致的透视失真是可靠匹配的主要障碍，因此需要一种方法来缓解这一问题。

Result: 实验证明，该方法显著提升了初始和优化匹配的数量，支持精确的增量式SfM重建和完整的3D高斯泼溅场景渲染。

Insight: 中间视图生成是解决多视角图像匹配问题的有效策略，3D高斯泼溅在此类任务中展现了高质量的渲染能力。

Abstract: The integration of aerial and ground images has been a promising solution in 3D modeling of complex scenes, which is seriously restricted by finding reliable correspondences. The primary contribution of this study is a feature matching algorithm for aerial and ground images, whose core idea is to generate intermediate views to alleviate perspective distortions caused by the extensive viewpoint changes. First, by using aerial images only, sparse models are reconstructed through an incremental SfM (Structure from Motion) engine due to their large scene coverage. Second, 3D Gaussian Splatting is then adopted for scene rendering by taking as inputs sparse points and oriented images. For accurate view rendering, a render viewpoint determination algorithm is designed by using the oriented camera poses of aerial images, which is used to generate high-quality intermediate images that can bridge the gap between aerial and ground images. Third, with the aid of intermediate images, reliable feature matching is conducted for match pairs from render-aerial and render-ground images, and final matches can be generated by transmitting correspondences through intermediate views. By using real aerial and ground datasets, the validation of the proposed solution has been verified in terms of feature matching and scene rendering and compared comprehensively with widely used methods. The experimental results demonstrate that the proposed solution can provide reliable feature matches for aerial and ground images with an obvious increase in the number of initial and refined matches, and it can provide enough matches to achieve accurate ISfM reconstruction and complete 3DGS-based scene rendering.

[32] CapStARE: Capsule-based Spatiotemporal Architecture for Robust and Efficient Gaze Estimation cs.CVPDF

Miren Samaniego, Igor Rodriguez, Elena Lazkano

TL;DR: CapStARE提出了一种基于胶囊网络的时空架构，用于高效且鲁棒的视线估计，结合了ConvNeXt主干、注意力路由的胶囊形成以及针对慢速和快速动态的双GRU解码器。

Details

Motivation: 当前视线估计方法在复杂场景中性能不足，且缺乏鲁棒性和实时性。

Result: 在多个数据集（ETH-XGaze、MPIIFaceGaze等）上达到SOTA性能，且实现实时推理（<10ms）。

Insight: 胶囊结构和双GRU解码器的设计能够有效建模局部-全局关系，并提供更好的可解释性。

Abstract: We introduce CapStARE, a capsule-based spatio-temporal architecture for gaze estimation that integrates a ConvNeXt backbone, capsule formation with attention routing, and dual GRU decoders specialized for slow and rapid gaze dynamics. This modular design enables efficient part-whole reasoning and disentangled temporal modeling, achieving state-of-the-art performance on ETH-XGaze (3.36) and MPIIFaceGaze (2.65) while maintaining real-time inference (< 10 ms). The model also generalizes well to unconstrained conditions in Gaze360 (9.06) and human-robot interaction scenarios in RT-GENE (4.76), outperforming or matching existing methods with fewer parameters and greater interpretability. These results demonstrate that CapStARE offers a practical and robust solution for real-time gaze estimation in interactive systems. The related code and results for this article can be found on: https://github.com/toukapy/capsStare

[33] GS-RoadPatching: Inpainting Gaussians via 3D Searching and Placing for Driving Scenes cs.CVPDF

Guo Chen, Jiarun Liu, Sicong Du, Chenming Wu, Deqi Li

TL;DR: GS-RoadPatching提出了一种基于3D高斯泼溅（3DGS）的行驶场景修复方法，通过3D空间搜索和替换实现高效补全，避免了传统2D视角方法的局限性。

Details

Motivation: 现有3DGS修复方法依赖2D视角的生成模型（如扩散模型或GAN）预测缺失区域，但这种方法在时空一致性和效率上存在不足。本文提出直接在3DGS模态中进行补全和编辑，避免了跨模态一致性问题和高斯重训练的开销。

Result: 在多个公开数据集上的实验表明，该方法在质量和效率上优于基线方法，尤其在行驶场景中表现最佳。通用场景下的实验也验证了其广泛适用性。

Insight: 行驶场景中高度重复的模式在3DGS隐式特征空间中具有多模态相似性，适合通过结构匹配实现高效修复。

Abstract: This paper presents GS-RoadPatching, an inpainting method for driving scene completion by referring to completely reconstructed regions, which are represented by 3D Gaussian Splatting (3DGS). Unlike existing 3DGS inpainting methods that perform generative completion relying on 2D perspective-view-based diffusion or GAN models to predict limited appearance or depth cues for missing regions, our approach enables substitutional scene inpainting and editing directly through the 3DGS modality, extricating it from requiring spatial-temporal consistency of 2D cross-modals and eliminating the need for time-intensive retraining of Gaussians. Our key insight is that the highly repetitive patterns in driving scenes often share multi-modal similarities within the implicit 3DGS feature space and are particularly suitable for structural matching to enable effective 3DGS-based substitutional inpainting. Practically, we construct feature-embedded 3DGS scenes to incorporate a patch measurement method for abstracting local context at different scales and, subsequently, propose a structural search method to find candidate patches in 3D space effectively. Finally, we propose a simple yet effective substitution-and-fusion optimization for better visual harmony. We conduct extensive experiments on multiple publicly available datasets to demonstrate the effectiveness and efficiency of our proposed method in driving scenes, and the results validate that our method achieves state-of-the-art performance compared to the baseline methods in terms of both quality and interoperability. Additional experiments in general scenes also demonstrate the applicability of the proposed 3D inpainting strategy. The project page and code are available at: https://shanzhaguoo.github.io/GS-RoadPatching/

[34] When Words Can’t Capture It All: Towards Video-Based User Complaint Text Generation with Multimodal Video Complaint Dataset cs.CV | cs.AIPDF

Sarmistha Das, R E Zera Marveen Lyngkhoi, Kirtan Jain, Vinayak Goyal, Sriparna Saha

TL;DR: 这篇论文提出了一个基于视频的用户投诉文本生成任务（CoD-V），并引入了一个名为ComVID的多模态视频投诉数据集。通过提出的新评测指标CR和基于VideoLLaMA2-7b的多模态RAG模型，论文展示了在投诉生成任务上的有效性。

Details

Motivation: 现有的投诉挖掘研究主要依赖文本，但用户往往难以通过文字清晰表达投诉内容，而视频却能直观展示问题。因此，论文旨在利用视频帮助用户生成更准确的投诉文本。

Result: 研究表明，提出的方法在投诉生成任务上优于标准视频摘要和描述任务，并通过CR指标验证了其有效性。

Insight: 视频在投诉表达中具有独特优势，多模态模型能够更好地捕捉用户情感和问题细节，为投诉挖掘领域提供了新方向。

Abstract: While there exists a lot of work on explainable complaint mining, articulating user concerns through text or video remains a significant challenge, often leaving issues unresolved. Users frequently struggle to express their complaints clearly in text but can easily upload videos depicting product defects (e.g., vague text such as `worst product’ paired with a 5-second video depicting a broken headphone with the right earcup). This paper formulates a new task in the field of complaint mining to aid the common users’ need to write an expressive complaint, which is Complaint Description from Videos (CoD-V) (e.g., to help the above user articulate her complaint about the defective right earcup). To this end, we introduce ComVID, a video complaint dataset containing 1,175 complaint videos and the corresponding descriptions, also annotated with the emotional state of the complainer. Additionally, we present a new complaint retention (CR) evaluation metric that discriminates the proposed (CoD-V) task against standard video summary generation and description tasks. To strengthen this initiative, we introduce a multimodal Retrieval-Augmented Generation (RAG) embedded VideoLLaMA2-7b model, designed to generate complaints while accounting for the user’s emotional state. We conduct a comprehensive evaluation of several Video Language Models on several tasks (pre-trained and fine-tuned versions) with a range of established evaluation metrics, including METEOR, perplexity, and the Coleman-Liau readability score, among others. Our study lays the foundation for a new research direction to provide a platform for users to express complaints through video. Dataset and resources are available at: https://github.com/sarmistha-D/CoD-V.

Phyo Thet Yee, Dimitrios Kollias, Sudeepta Mishra, Abhinav Dhall

TL;DR: SynchroRaMa是一个多模态情感嵌入框架，通过结合文本和音频的情感信号，生成更具表现力和真实感的说话人脸视频，并在头部动作和唇同步方面表现优异。

Details

Motivation: 现有方法多依赖单模态情感嵌入，无法捕捉复杂的情感线索；且仅依赖单张参考图像，难以表现动态动作或属性变化。

Result: 在基准数据集上，SynchroRaMa在图像质量、表情保留和动作真实性上优于现有方法，用户研究也证实其自然性和流畅度更高。

Insight: 多模态情感嵌入和动态场景描述的结合显著提升了说话人脸生成的表现力和时序一致性。

Abstract: Audio-driven talking face generation has received growing interest, particularly for applications requiring expressive and natural human-avatar interaction. However, most existing emotion-aware methods rely on a single modality (either audio or image) for emotion embedding, limiting their ability to capture nuanced affective cues. Additionally, most methods condition on a single reference image, restricting the model’s ability to represent dynamic changes in actions or attributes across time. To address these issues, we introduce SynchroRaMa, a novel framework that integrates a multi-modal emotion embedding by combining emotional signals from text (via sentiment analysis) and audio (via speech-based emotion recognition and audio-derived valence-arousal features), enabling the generation of talking face videos with richer and more authentic emotional expressiveness and fidelity. To ensure natural head motion and accurate lip synchronization, SynchroRaMa includes an audio-to-motion (A2M) module that generates motion frames aligned with the input audio. Finally, SynchroRaMa incorporates scene descriptions generated by Large Language Model (LLM) as additional textual input, enabling it to capture dynamic actions and high-level semantic attributes. Conditioning the model on both visual and textual cues enhances temporal consistency and visual realism. Quantitative and qualitative experiments on benchmark datasets demonstrate that SynchroRaMa outperforms the state-of-the-art, achieving improvements in image quality, expression preservation, and motion realism. A user study further confirms that SynchroRaMa achieves higher subjective ratings than competing methods in overall naturalness, motion diversity, and video smoothness. Our project page is available at https://novicemm.github.io/synchrorama.

[36] OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving cs.CVPDF

Pei Liu, Hongliang Lu, Haichao Liu, Haipeng Liu, Xin Liu

TL;DR: OmniScene通过引入视觉语言模型和层次融合策略，提出了一种类似人类的4D场景理解框架，显著提升了自动驾驶系统的感知与理解能力。

Details

Motivation: 目前自动驾驶系统主要依赖基于深度的3D重建，缺乏真正的场景理解能力。研究旨在通过结合多模态感知和人类类似注意力机制，实现更全面的场景理解。

Result: 在nuScenes数据集上，OmniScene在感知、预测、规划和视觉问答任务中均优于十多个先进模型，确立了新的性能标杆。

Insight: 结合视觉与文本模态，并模拟人类注意力机制，可以显著提升自动驾驶系统的场景理解和行为适应性。

Abstract: Human vision is capable of transforming two-dimensional observations into an egocentric three-dimensional scene understanding, which underpins the ability to translate complex scenes and exhibit adaptive behaviors. This capability, however, remains lacking in current autonomous driving systems, where mainstream approaches primarily rely on depth-based 3D reconstruction rather than true scene understanding. To address this limitation, we propose a novel human-like framework called OmniScene. First, we introduce the OmniScene Vision-Language Model (OmniVLM), a vision-language framework that integrates multi-view and temporal perception for holistic 4D scene understanding. Then, harnessing a teacher-student OmniVLM architecture and knowledge distillation, we embed textual representations into 3D instance features for semantic supervision, enriching feature learning, and explicitly capturing human-like attentional semantics. These feature representations are further aligned with human driving behaviors, forming a more human-like perception-understanding-action architecture. In addition, we propose a Hierarchical Fusion Strategy (HFS) to address imbalances in modality contributions during multimodal integration. Our approach adaptively calibrates the relative significance of geometric and semantic features at multiple abstraction levels, enabling the synergistic use of complementary cues from visual and textual modalities. This learnable dynamic fusion enables a more nuanced and effective exploitation of heterogeneous information. We evaluate OmniScene comprehensively on the nuScenes dataset, benchmarking it against over ten state-of-the-art models across various tasks. Our approach consistently achieves superior results, establishing new benchmarks in perception, prediction, planning, and visual question answering.

[37] CamPVG: Camera-Controlled Panoramic Video Generation with Epipolar-Aware Diffusion cs.CVPDF

Chenhao Ji, Chaohui Yu, Junyao Gao, Fan Wang, Cairong Zhao

TL;DR: 该论文提出了CamPVG，首个基于扩散模型的、支持精准相机位姿引导的全景视频生成框架，解决了传统方法在全景视频生成中的几何一致性挑战。

Details

Motivation: 现有的相机控制视频生成方法主要集中于透视投影视频，而几何一致的全景视频生成仍是挑战。论文旨在解决全景姿态表示和球面投影的复杂性。

Result: 实验表明，CamPVG生成的全景视频质量高且与相机轨迹一致，显著优于现有方法。

Insight: 通过几何约束和球面投影的结合，可以显著提升全景视频生成的几何一致性，为相机控制的全景内容生成提供了新思路。

Abstract: Recently, camera-controlled video generation has seen rapid development, offering more precise control over video generation. However, existing methods predominantly focus on camera control in perspective projection video generation, while geometrically consistent panoramic video generation remains challenging. This limitation is primarily due to the inherent complexities in panoramic pose representation and spherical projection. To address this issue, we propose CamPVG, the first diffusion-based framework for panoramic video generation guided by precise camera poses. We achieve camera position encoding for panoramic images and cross-view feature aggregation based on spherical projection. Specifically, we propose a panoramic Pl"ucker embedding that encodes camera extrinsic parameters through spherical coordinate transformation. This pose encoder effectively captures panoramic geometry, overcoming the limitations of traditional methods when applied to equirectangular projections. Additionally, we introduce a spherical epipolar module that enforces geometric constraints through adaptive attention masking along epipolar lines. This module enables fine-grained cross-view feature aggregation, substantially enhancing the quality and consistency of generated panoramic videos. Extensive experiments demonstrate that our method generates high-quality panoramic videos consistent with camera trajectories, far surpassing existing methods in panoramic video generation.

[38] SDE-DET: A Precision Network for Shatian Pomelo Detection in Complex Orchard Environments cs.CV | cs.AIPDF

Yihao Hu, Pan Wang, Xiaodong Bai, Shijie Cai, Hang Wang

TL;DR: 该论文提出了一种用于复杂果园环境中沙田柚检测的SDE-DET模型，结合了Star Block、Deformable Attention和多尺度注意力机制，在性能和计算效率上均表现优异。

Details

Motivation: 沙田柚检测在自动化采摘和成熟度分析中至关重要，但复杂果园环境中的多尺度、遮挡和小目标问题增加了检测难度。

Result: 在STP-AgriData数据集上，SDE-DET在精度、召回率和mAP等指标上超越主流检测模型（如Yolo系列），达到SOTA性能。

Insight: SDE-DET为复杂环境中的目标检测提供了高效解决方案，为自动化采摘机器人的发展奠定了基础。

Abstract: Pomelo detection is an essential process for their localization, automated robotic harvesting, and maturity analysis. However, detecting Shatian pomelo in complex orchard environments poses significant challenges, including multi-scale issues, obstructions from trunks and leaves, small object detection, etc. To address these issues, this study constructs a custom dataset STP-AgriData and proposes the SDE-DET model for Shatian pomelo detection. SDE-DET first utilizes the Star Block to effectively acquire high-dimensional information without increasing the computational overhead. Furthermore, the presented model adopts Deformable Attention in its backbone, to enhance its ability to detect pomelos under occluded conditions. Finally, multiple Efficient Multi-Scale Attention mechanisms are integrated into our model to reduce the computational overhead and extract deep visual representations, thereby improving the capacity for small object detection. In the experiment, we compared SDE-DET with the Yolo series and other mainstream detection models in Shatian pomelo detection. The presented SDE-DET model achieved scores of 0.883, 0.771, 0.838, 0.497, and 0.823 in Precision, Recall, mAP@0.5, mAP@0.5:0.95 and F1-score, respectively. SDE-DET has achieved state-of-the-art performance on the STP-AgriData dataset. Experiments indicate that the SDE-DET provides a reliable method for Shatian pomelo detection, laying the foundation for the further development of automatic harvest robots.

[39] Improving Generalizability and Undetectability for Targeted Adversarial Attacks on Multimodal Pre-trained Models cs.CVPDF

Zhifang Zhang, Jiahan Zhang, Shengjie Zhou, Qi Wei, Shuo He

TL;DR: 该论文提出了一种名为Proxy Targeted Attack（PTA）的新方法，旨在解决多模态预训练模型中目标对抗攻击在通用性和不可检测性方面的局限性。通过利用多源模态和目标模态代理优化对抗样本，PTA在确保高攻击成功率的同时，能够逃逸防御检测。

Details

Motivation: 多模态预训练模型（如图像对齐模型ImageBind）在下游任务中表现出色，但其广泛应用也引发了安全担忧，尤其是目标对抗攻击的问题。现有攻击方法在通用性和不可检测性方面存在不足，亟需改进。

Result: 实验结果表明，PTA在多种相关目标上实现了高攻击成功率，并且在多种异常检测方法下仍保持不可检测性。

Insight: 通过多模态代理优化攻击样本，可以显著提升目标对抗攻击的通用性和不可检测性，为多模态模型的鲁棒性研究提供了新思路。

Abstract: Multimodal pre-trained models (e.g., ImageBind), which align distinct data modalities into a shared embedding space, have shown remarkable success across downstream tasks. However, their increasing adoption raises serious security concerns, especially regarding targeted adversarial attacks. In this paper, we show that existing targeted adversarial attacks on multimodal pre-trained models still have limitations in two aspects: generalizability and undetectability. Specifically, the crafted targeted adversarial examples (AEs) exhibit limited generalization to partially known or semantically similar targets in cross-modal alignment tasks (i.e., limited generalizability) and can be easily detected by simple anomaly detection methods (i.e., limited undetectability). To address these limitations, we propose a novel method called Proxy Targeted Attack (PTA), which leverages multiple source-modal and target-modal proxies to optimize targeted AEs, ensuring they remain evasive to defenses while aligning with multiple potential targets. We also provide theoretical analyses to highlight the relationship between generalizability and undetectability and to ensure optimal generalizability while meeting the specified requirements for undetectability. Furthermore, experimental results demonstrate that our PTA can achieve a high success rate across various related targets and remain undetectable against multiple anomaly detection methods.

[40] Anomaly Detection by Clustering DINO Embeddings using a Dirichlet Process Mixture cs.CV | cs.LGPDF

Nico Schulthess, Ender Konukoglu

TL;DR: 本文提出了一种基于DINOv2嵌入和Dirichlet过程混合模型（DPMM）的无监督异常检测方法，适用于医学影像，显著减少了计算负担并提升了性能。

Details

Motivation: 医学影像中的异常检测通常依赖小规模数据集和内存库方法，计算成本高。本文旨在利用DINOv2嵌入和DPMM模型解决大规模数据下的计算效率和性能问题。

Result: 实验表明，该方法在医学影像异常检测中性能优异，推理时间至少减少一半，且归一化嵌入更适用于异常检测。

Insight: 归一化的DINOv2嵌入即使在异常存在时仍与解剖结构对齐，使其成为异常检测的理想表示。

Abstract: In this work, we leverage informative embeddings from foundational models for unsupervised anomaly detection in medical imaging. For small datasets, a memory-bank of normative features can directly be used for anomaly detection which has been demonstrated recently. However, this is unsuitable for large medical datasets as the computational burden increases substantially. Therefore, we propose to model the distribution of normative DINOv2 embeddings with a Dirichlet Process Mixture model (DPMM), a non-parametric mixture model that automatically adjusts the number of mixture components to the data at hand. Rather than using a memory bank, we use the similarity between the component centers and the embeddings as anomaly score function to create a coarse anomaly segmentation mask. Our experiments show that through DPMM embeddings of DINOv2, despite being trained on natural images, achieve very competitive anomaly detection performance on medical imaging benchmarks and can do this while at least halving the computation time at inference. Our analysis further indicates that normalized DINOv2 embeddings are generally more aligned with anatomical structures than unnormalized features, even in the presence of anomalies, making them great representations for anomaly detection. The code is available at https://github.com/NicoSchulthess/anomalydino-dpmm.

[41] Table Detection with Active Learning cs.CV | cs.AI | cs.CL | cs.LGPDF

Somraj Gautam, Nachiketa Purohit, Gaurav Harit

TL;DR: 本文提出了一种结合主动学习（AL）和多样性策略的方法，用于表格检测任务，以减少标注成本并提高模型性能。

Details

Motivation: 高效数据标注是机器学习中的关键挑战，尤其是在需要大量标注数据的对象检测任务中。主动学习通过选择信息最丰富的样本来最小化标注成本，结合多样性策略可以进一步提高采样效率。

Result: 实验表明，AL方法显著优于随机采样，在有限的标注预算下保持了与全监督模型相当的性能，同时提高了mAP分数。

Insight: 主动学习中结合多样性策略可以提高采样效率，尤其是在对象检测任务中，而不仅仅是依赖传统的基于不确定性的选择方法。

Abstract: Efficient data annotation remains a critical challenge in machine learning, particularly for object detection tasks requiring extensive labeled data. Active learning (AL) has emerged as a promising solution to minimize annotation costs by selecting the most informative samples. While traditional AL approaches primarily rely on uncertainty-based selection, recent advances suggest that incorporating diversity-based strategies can enhance sampling efficiency in object detection tasks. Our approach ensures the selection of representative examples that improve model generalization. We evaluate our method on two benchmark datasets (TableBank-LaTeX, TableBank-Word) using state-of-the-art table detection architectures, CascadeTabNet and YOLOv9. Our results demonstrate that AL-based example selection significantly outperforms random sampling, reducing annotation effort given a limited budget while maintaining comparable performance to fully supervised models. Our method achieves higher mAP scores within the same annotation budget.

[42] Does the Manipulation Process Matter? RITA: Reasoning Composite Image Manipulations via Reversely-Ordered Incremental-Transition Autoregression cs.CVPDF

Xuekang Zhu, Ji-Zhe Zhou, Kaiwen Feng, Chenfan Qu, Yunfei Wang

TL;DR: 论文提出RITA框架，将图像篡改定位任务重新定义为条件序列预测问题，通过逐层预测篡改区域并建模编辑操作的时间和层级依赖关系，解决了现有方法忽视篡改过程的问题。

Details

Motivation: 现有的图像篡改定位方法（IML）忽视了篡改过程的复杂性和时序性，直接生成单次预测的定位掩码，导致维度塌缩。RITA首次将IML重定义为序列预测任务，以更好地建模篡改的层级与时序特性。

Result: RITA在传统基准测试中达到SOTA，并为新型层级定位任务提供了有效范例。

Insight: 建模篡改过程的时序和层级特性是提高IML任务性能的关键，引入序列预测范式可以更自然地解决篡改定位问题。

Abstract: Image manipulations often entail a complex manipulation process, comprising a series of editing operations to create a deceptive image, exhibiting sequentiality and hierarchical characteristics. However, existing IML methods remain manipulation-process-agnostic, directly producing localization masks in a one-shot prediction paradigm without modeling the underlying editing steps. This one-shot paradigm compresses the high-dimensional compositional space into a single binary mask, inducing severe dimensional collapse, thereby creating a fundamental mismatch with the intrinsic nature of the IML task. To address this, we are the first to reformulate image manipulation localization as a conditional sequence prediction task, proposing the RITA framework. RITA predicts manipulated regions layer-by-layer in an ordered manner, using each step’s prediction as the condition for the next, thereby explicitly modeling temporal dependencies and hierarchical structures among editing operations. To enable training and evaluation, we synthesize multi-step manipulation data and construct a new benchmark HSIM. We further propose the HSS metric to assess sequential order and hierarchical alignment. Extensive experiments show RITA achieves SOTA on traditional benchmarks and provides a solid foundation for the novel hierarchical localization task, validating its potential as a general and effective paradigm. The code and dataset will be publicly available.

[43] PS3: A Multimodal Transformer Integrating Pathology Reports with Histology Images and Biological Pathways for Cancer Survival Prediction cs.CVPDF

Manahil Raza, Ayesha Azam, Talha Qaiser, Nasir Rajpoot

TL;DR: PS3是一种多模态Transformer模型，融合病理报告、组织学图像和生物通路数据，用于癌症生存预测，提升了现有方法的性能。

Details

Motivation: 现有多模态融合方法主要关注组织学图像与基因组数据的结合，忽视了病理报告的价值。病理报告包含临床上下文和专家解读，具有补充信息潜力。但多模态数据的异质性（如高维图像与变长文本）带来了融合挑战。

Result: 在TCGA的六个数据集上，PS3超越现有单模态和多模态基线方法，验证了病理报告对生存预测的增益。代码已开源。

Insight: 1) 病理报告的标准化表示能有效提升模型性能；2) 基于原型的多模态融合缓解了数据异质性；3) Transformer适合建模复杂模态交互。

Abstract: Current multimodal fusion approaches in computational oncology primarily focus on integrating multi-gigapixel histology whole slide images (WSIs) with genomic or transcriptomic data, demonstrating improved survival prediction. We hypothesize that incorporating pathology reports can further enhance prognostic performance. Pathology reports, as essential components of clinical workflows, offer readily available complementary information by summarizing histopathological findings and integrating expert interpretations and clinical context. However, fusing these modalities poses challenges due to their heterogeneous nature. WSIs are high-dimensional, each containing several billion pixels, whereas pathology reports consist of concise text summaries of varying lengths, leading to potential modality imbalance. To address this, we propose a prototype-based approach to generate balanced representations, which are then integrated using a Transformer-based fusion model for survival prediction that we term PS3 (Predicting Survival from Three Modalities). Specifically, we present: (1) Diagnostic prototypes from pathology reports, leveraging self-attention to extract diagnostically relevant sections and standardize text representation; (2) Histological prototypes to compactly represent key morphological patterns in WSIs; and (3) Biological pathway prototypes to encode transcriptomic expressions, accurately capturing cellular functions. PS3, the three-modal transformer model, processes the resulting prototype-based multimodal tokens and models intra-modal and cross-modal interactions across pathology reports, WSIs and transcriptomic data. The proposed model outperforms state-of-the-art methods when evaluated against clinical, unimodal and multimodal baselines on six datasets from The Cancer Genome Atlas (TCGA). The code is available at: https://github.com/manahilr/PS3.

[44] Predictive Quality Assessment for Mobile Secure Graphics cs.CV | cs.LG | I.2.10; I.4.8PDF

Cas Steigstra, Sergey Milyaev, Shaodi You

TL;DR: 论文提出了一种轻量级框架，用于预测移动设备上安全图形验证的质量，解决了传统方法因图像采集问题导致的高误拒率问题。

Details

Motivation: 由于智能手机在采集高熵安全图形时的不可控性，导致验证任务的错误率较高，因此需要一种能够预测图像质量的方法，以提高下游验证任务的可靠性。

Result: 在包含32,000多张图像和105种智能手机的大规模数据集上验证了框架的有效性，并揭示了冻结预训练网络在跨领域任务中的泛化优势。

Insight: 对于来自物理制造领域的领域迁移，冻结的通用预训练骨干网络比完全微调的模型更具鲁棒性，后者容易过拟合到源领域的噪声。

Abstract: The reliability of secure graphic verification, a key anti-counterfeiting tool, is undermined by poor image acquisition on smartphones. Uncontrolled user captures of these high-entropy patterns cause high false rejection rates, creating a significant ‘reliability gap’. To bridge this gap, we depart from traditional perceptual IQA and introduce a framework that predictively estimates a frame’s utility for the downstream verification task. We propose a lightweight model to predict a quality score for a video frame, determining its suitability for a resource-intensive oracle model. Our framework is validated using re-contextualized FNMR and ISRR metrics on a large-scale dataset of 32,000+ images from 105 smartphones. Furthermore, a novel cross-domain analysis on graphics from different industrial printing presses reveals a key finding: a lightweight probe on a frozen, ImageNet-pretrained network generalizes better to an unseen printing technology than a fully fine-tuned model. This provides a key insight for real-world generalization: for domain shifts from physical manufacturing, a frozen general-purpose backbone can be more robust than full fine-tuning, which can overfit to source-domain artifacts.

[45] SHMoAReg: Spark Deformable Image Registration via Spatial Heterogeneous Mixture of Experts and Attention Heads cs.CVPDF

Yuxi Zheng, Jianhui Feng, Tianran Li, Marius Staring, Yuchuan Qiao

TL;DR: 论文提出SHMoAReg，一种基于专家混合机制的可变形图像配准网络，通过引入混合注意力头和空间异构专家，提升特征提取和变形场预测的专一性和异构性，实验表明其性能显著优于现有方法。

Details

Motivation: 当前基于编码器-解码器架构的可变形图像配准方法在特征提取和变形场预测上缺乏专一性和异构性，限制了性能。

Result: 在公共腹部CT数据集上，Dice分数从60.58%提升至65.58%。

Insight: 专家混合机制能显著提升配准任务的性能和可解释性，异构化设计更符合三维变形场的特点。

Abstract: Encoder-Decoder architectures are widely used in deep learning-based Deformable Image Registration (DIR), where the encoder extracts multi-scale features and the decoder predicts deformation fields by recovering spatial locations. However, current methods lack specialized extraction of features (that are useful for registration) and predict deformation jointly and homogeneously in all three directions. In this paper, we propose a novel expert-guided DIR network with Mixture of Experts (MoE) mechanism applied in both encoder and decoder, named SHMoAReg. Specifically, we incorporate Mixture of Attention heads (MoA) into encoder layers, while Spatial Heterogeneous Mixture of Experts (SHMoE) into the decoder layers. The MoA enhances the specialization of feature extraction by dynamically selecting the optimal combination of attention heads for each image token. Meanwhile, the SHMoE predicts deformation fields heterogeneously in three directions for each voxel using experts with varying kernel sizes. Extensive experiments conducted on two publicly available datasets show consistent improvements over various methods, with a notable increase from 60.58% to 65.58% in Dice score for the abdominal CT dataset. Furthermore, SHMoAReg enhances model interpretability by differentiating experts’ utilities across/within different resolution layers. To the best of our knowledge, we are the first to introduce MoE mechanism into DIR tasks. The code will be released soon.

[46] Unleashing the Potential of the Semantic Latent Space in Diffusion Models for Image Dehazing cs.CVPDF

Zizheng Yang, Hu Yu, Bing Li, Jinghao Zhang, Jie Huang

TL;DR: 论文提出了一种基于预训练扩散模型语义潜空间的图像去雾方法DiffLI$^2$D，避免了重新训练扩散模型和迭代采样过程，性能优于现有方法。

Details

Motivation: 扩散模型在图像去雾中潜力大，但计算负担高且采样步骤多，限制了其广泛应用。论文探索了预训练扩散模型的语义潜空间特性，以减少计算开销。

Result: 在多个数据集上验证了方法的优越性，性能优于现有去雾方法。

Insight: 预训练扩散模型的语义潜空间可以有效捕获雾图特征，为图像去雾提供了新思路。

Abstract: Diffusion models have recently been investigated as powerful generative solvers for image dehazing, owing to their remarkable capability to model the data distribution. However, the massive computational burden imposed by the retraining of diffusion models, coupled with the extensive sampling steps during the inference, limit the broader application of diffusion models in image dehazing. To address these issues, we explore the properties of hazy images in the semantic latent space of frozen pre-trained diffusion models, and propose a Diffusion Latent Inspired network for Image Dehazing, dubbed DiffLI$^2$D. Specifically, we first reveal that the semantic latent space of pre-trained diffusion models can represent the content and haze characteristics of hazy images, as the diffusion time-step changes. Building upon this insight, we integrate the diffusion latent representations at different time-steps into a delicately designed dehazing network to provide instructions for image dehazing. Our DiffLI$^2$D avoids re-training diffusion models and iterative sampling process by effectively utilizing the informative representations derived from the pre-trained diffusion models, which also offers a novel perspective for introducing diffusion models to image dehazing. Extensive experiments on multiple datasets demonstrate that the proposed method achieves superior performance to existing image dehazing methods. Code is available at https://github.com/aaaasan111/difflid.

[47] Hyperspectral Adapter for Semantic Segmentation with Vision Foundation Models cs.CV | cs.AI | cs.LG | cs.ROPDF

JuanaJuana Valeria Hurtado, Rohit Mohan, Abhinav Valada

TL;DR: 该论文提出了一种名为Hyperspectral Adapter的新型架构，利用预训练的视觉基础模型（如ViT）从高光谱数据中有效学习，通过光谱变换器和频谱感知空间先验模块提取丰富的空间-光谱特征，并在三个自动驾驶基准数据集上实现了最先进的语义分割性能。

Details

Motivation: 高光谱成像（HSI）提供了丰富的空间和光谱信息，但因当前方法主要针对RGB输入设计，高光谱语义分割表现不佳。论文旨在通过适配预训练的视觉基础模型（VFM）来解决这一问题。

Result: 在三个自动驾驶数据集上实现了最先进的语义分割性能，超越了基于RGB和高光谱的分割方法。

Insight: 通过适配预训练的视觉基础模型，高光谱数据可以显著提升复杂环境下的语义分割性能，为机器人感知提供了新思路。

Abstract: Hyperspectral imaging (HSI) captures spatial information along with dense spectral measurements across numerous narrow wavelength bands. This rich spectral content has the potential to facilitate robust robotic perception, particularly in environments with complex material compositions, varying illumination, or other visually challenging conditions. However, current HSI semantic segmentation methods underperform due to their reliance on architectures and learning frameworks optimized for RGB inputs. In this work, we propose a novel hyperspectral adapter that leverages pretrained vision foundation models to effectively learn from hyperspectral data. Our architecture incorporates a spectral transformer and a spectrum-aware spatial prior module to extract rich spatial-spectral features. Additionally, we introduce a modality-aware interaction block that facilitates effective integration of hyperspectral representations and frozen vision Transformer features through dedicated extraction and injection mechanisms. Extensive evaluations on three benchmark autonomous driving datasets demonstrate that our architecture achieves state-of-the-art semantic segmentation performance while directly using HSI inputs, outperforming both vision-based and hyperspectral segmentation methods. We make the code available at https://hyperspectraladapter.cs.uni-freiburg.de.

[48] A Simple Data Augmentation Strategy for Text-in-Image Scientific VQA cs.CVPDF

Belal Shoer, Yova Kementchedjhieva

TL;DR: 论文提出了一种简单的数据增强策略，通过将现有图像-文本对转换为统一图像格式，为科学视觉问答生成合成数据，显著提升了多语言模型的性能。

Details

Motivation: 科学视觉问答任务因科学图的复杂性和多模态上下文而具有挑战性。传统方法将图像和文本分开处理，而EXAMS-V提出了一种新范式，但仍需任务微调。为了解决训练数据稀缺问题，作者提出了数据增强策略。

Result: 在13种语言上的实验中，该方法表现显著优于零样本基线，展示了平均性能的提升和跨语言迁移能力。

Insight: 简单的数据增强策略可以显著提升多语言多模态模型在复杂任务（如科学视觉问答）中的性能，尤其是在训练数据稀缺的情况下。

Abstract: Scientific visual question answering poses significant challenges for vision-language models due to the complexity of scientific figures and their multimodal context. Traditional approaches treat the figure and accompanying text (e.g., questions and answer options) as separate inputs. EXAMS-V introduced a new paradigm by embedding both visual and textual content into a single image. However, even state-of-the-art proprietary models perform poorly on this setup in zero-shot settings, underscoring the need for task-specific fine-tuning. To address the scarcity of training data in this “text-in-image” format, we synthesize a new dataset by converting existing separate image-text pairs into unified images. Fine-tuning a small multilingual multimodal model on a mix of our synthetic data and EXAMS-V yields notable gains across 13 languages, demonstrating strong average improvements and cross-lingual transfer.

[49] EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models cs.CV | cs.AIPDF

Botai Yuan, Yutian Zhou, Yingjie Wang, Fushuo Huo, Yongcheng Jing

TL;DR: EchoBench是一个用于评估医疗领域大型视觉语言模型（LVLMs）中‘谄媚’倾向的基准测试，揭示了主流模型在面对用户偏见输入时的不加批判回应问题。

Details

Motivation: 目前的医疗LVLM基准测试过于关注准确性，忽略了模型的可靠性和安全性。谄媚行为在高风险临床场景中尤为危险，因此需要系统性评估。

Result: 所有模型均表现出显著的谄媚行为（高达95%），专有模型Claude 3.7 Sonnet和GPT-4.1分别达到45.98%和59.15%。数据质量和领域知识可减少谄媚行为。

Insight: 仅关注准确性不足以评估模型的安全性，数据多样性和领域知识能有效降低谄媚行为，提示级干预和训练策略可作为缓解手段。

Abstract: Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety. We study sycophancy – models’ tendency to uncritically echo user-provided information – in high-stakes clinical settings. We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs. It contains 2,122 images across 18 departments and 20 modalities with 90 prompts that simulate biased inputs from patients, medical students, and physicians. We evaluate medical-specific, open-source, and proprietary LVLMs. All exhibit substantial sycophancy; the best proprietary model (Claude 3.7 Sonnet) still shows 45.98% sycophancy, and GPT-4.1 reaches 59.15%. Many medical-specific models exceed 95% sycophancy despite only moderate accuracy. Fine-grained analyses by bias type, department, perceptual granularity, and modality identify factors that increase susceptibility. We further show that higher data quality/diversity and stronger domain knowledge reduce sycophancy without harming unbiased accuracy. EchoBench also serves as a testbed for mitigation: simple prompt-level interventions (negative prompting, one-shot, few-shot) produce consistent reductions and motivate training- and decoding-time strategies. Our findings highlight the need for robust evaluation beyond accuracy and provide actionable guidance toward safer, more trustworthy medical LVLMs.

[50] C$^2$MIL: Synchronizing Semantic and Topological Causalities in Multiple Instance Learning for Robust and Interpretable Survival Analysis cs.CVPDF

Min Cen, Zhenfeng Zhuang, Yuzhe Zhang, Min Zeng, Baptiste Magnier

TL;DR: C$^2$MIL是一种基于双因果图的多实例学习方法，通过语义和拓扑因果关系的同步优化，提升了生存分析的鲁棒性和可解释性。

Details

Motivation: 基于图的MIL方法在生存分析中存在语义偏差和拓扑噪声问题，影响分析的泛化性和可解释性。

Result: 实验表明，C$^2$MIL在泛化性和可解释性上优于现有方法，并能增强多种MIL基线的性能。

Insight: 语义和拓扑因果关系的同步优化是提升MIL模型性能的关键。模型代码已开源，便于复现和应用。

Abstract: Graph-based Multiple Instance Learning (MIL) is widely used in survival analysis with Hematoxylin and Eosin (H&E)-stained whole slide images (WSIs) due to its ability to capture topological information. However, variations in staining and scanning can introduce semantic bias, while topological subgraphs that are not relevant to the causal relationships can create noise, resulting in biased slide-level representations. These issues can hinder both the interpretability and generalization of the analysis. To tackle this, we introduce a dual structural causal model as the theoretical foundation and propose a novel and interpretable dual causal graph-based MIL model, C$^2$MIL. C$^2$MIL incorporates a novel cross-scale adaptive feature disentangling module for semantic causal intervention and a new Bernoulli differentiable causal subgraph sampling method for topological causal discovery. A joint optimization strategy combining disentangling supervision and contrastive learning enables simultaneous refinement of both semantic and topological causalities. Experiments demonstrate that C$^2$MIL consistently improves generalization and interpretability over existing methods and can serve as a causal enhancement for diverse MIL baselines. The code is available at https://github.com/mimic0127/C2MIL.

[51] U-Mamba2-SSL for Semi-Supervised Tooth and Pulp Segmentation in CBCT cs.CV | cs.AIPDF

Zhi Qin Tan, Xiatian Zhu, Owen Addison, Yunpeng Li

TL;DR: 论文提出了一种基于U-Mamba2的半监督学习框架U-Mamba2-SSL，用于CBCT中的牙齿和牙髓分割。该方法通过自监督预训练、一致性正则化和伪标签策略，实现了优异的性能。

Details

Motivation: CBCT中牙齿和牙髓的准确分割对临床诊断和治疗计划至关重要，但传统方法依赖专家知识且耗时。为此，作者提出了一种自动化算法，以高效利用未标记数据。

Result: 在验证集上，U-Mamba2-SSL的平均得分为0.872，DSC为0.969，表现出色。

Insight: 论文表明，结合自监督学习和半监督策略能够显著提升CBCT图像分割的性能，尤其是针对数据标注成本高的任务。

Abstract: Accurate segmentation of teeth and pulp in Cone-Beam Computed Tomography (CBCT) is vital for clinical applications like treatment planning and diagnosis. However, this process requires extensive expertise and is exceptionally time-consuming, highlighting the critical need for automated algorithms that can effectively utilize unlabeled data. In this paper, we propose U-Mamba2-SSL, a novel semi-supervised learning framework that builds on the U-Mamba2 model and employs a multi-stage training strategy. The framework first pre-trains U-Mamba2 in a self-supervised manner using a disruptive autoencoder. It then leverages unlabeled data through consistency regularization, where we introduce input and feature perturbations to ensure stable model outputs. Finally, a pseudo-labeling strategy is implemented with a reduced loss weighting to minimize the impact of potential errors. U-Mamba2-SSL achieved an average score of 0.872 and a DSC of 0.969 on the validation dataset, demonstrating the superior performance of our approach. The code is available at https://github.com/zhiqin1998/UMamba2.

[52] Optical Ocean Recipes: Creating Realistic Datasets to Facilitate Underwater Vision Research cs.CVPDF

Patricia Schöntag, David Nakath, Judith Fischer, Rüdiger Röttgers, Kevin Köser

TL;DR: 该论文提出了’Optical Ocean Recipes’框架，旨在通过可控的水下条件创建逼真的数据集，解决水下机器视觉研究中缺乏通用性和可控性的问题。

Details

Motivation: 水下机器视觉的研究因缺乏对不同光学水类型和成像条件的通用性测试环境而受限，导致算法难以适应多样的实际场景。

Result: 提供了一个演示数据集，并展示了该框架在两项水下视觉任务中的应用。数据集和评估代码将公开。

Insight: 通过可控的合成环境生成真实数据，为水下机器视觉的研究提供了一种新的方法，解决了实际测试的不足。

Abstract: The development and evaluation of machine vision in underwater environments remains challenging, often relying on trial-and-error-based testing tailored to specific applications. This is partly due to the lack of controlled, ground-truthed testing environments that account for the optical challenges, such as color distortion from spectrally variant light attenuation, reduced contrast and blur from backscatter and volume scattering, and dynamic light patterns from natural or artificial illumination. Additionally, the appearance of ocean water in images varies significantly across regions, depths, and seasons. However, most machine vision evaluations are conducted under specific optical water types and imaging conditions, therefore often lack generalizability. Exhaustive testing across diverse open-water scenarios is technically impractical. To address this, we introduce the \textit{Optical Ocean Recipes}, a framework for creating realistic datasets under controlled underwater conditions. Unlike synthetic or open-water data, these recipes, using calibrated color and scattering additives, enable repeatable and controlled testing of the impact of water composition on image appearance. Hence, this provides a unique framework for analyzing machine vision in realistic, yet controlled underwater scenarios. The controlled environment enables the creation of ground-truth data for a range of vision tasks, including water parameter estimation, image restoration, segmentation, visual SLAM, and underwater image synthesis. We provide a demonstration dataset generated using the Optical Ocean Recipes and briefly demonstrate the use of our system for two underwater vision tasks. The dataset and evaluation code will be made available.

[53] Universal Camouflage Attack on Vision-Language Models for Autonomous Driving cs.CV | cs.LGPDF

Dehong Kong, Sifan Yu, Siyuan Liang, Jiawei Liang, Jianhou Gan

TL;DR: 该论文提出了首个针对自动驾驶视觉语言模型（VLM-AD）的通用伪装攻击框架（UCA），通过在特征空间中生成可物理实现的伪装纹理，有效误导模型决策，并在多场景和多模型上展现出强泛化性和鲁棒性。

Details

Motivation: 现有的对抗攻击方法要么针对视觉模块，难以直接迁移到VLM-AD系统，要么仅局限于数字层面攻击。论文旨在解决这些问题，提出一种能在物理世界中实现的高效攻击框架。

Result: 实验表明，UCA能显著误导多种VLM-AD模型（在3-P指标上提升30%），且在动态环境下展示出强鲁棒性。

Insight: VLM-AD的编码器和投影层易受攻击，UCA的成功揭示了特征空间优化和多尺度学习的潜在优势，为未来安全研究提供了新方向。

Abstract: Visual language modeling for automated driving is emerging as a promising research direction with substantial improvements in multimodal reasoning capabilities. Despite its advanced reasoning abilities, VLM-AD remains vulnerable to serious security threats from adversarial attacks, which involve misleading model decisions through carefully crafted perturbations. Existing attacks have obvious challenges: 1) Physical adversarial attacks primarily target vision modules. They are difficult to directly transfer to VLM-AD systems because they typically attack low-level perceptual components. 2) Adversarial attacks against VLM-AD have largely concentrated on the digital level. To address these challenges, we propose the first Universal Camouflage Attack (UCA) framework for VLM-AD. Unlike previous methods that focus on optimizing the logit layer, UCA operates in the feature space to generate physically realizable camouflage textures that exhibit strong generalization across different user commands and model architectures. Motivated by the observed vulnerability of encoder and projection layers in VLM-AD, UCA introduces a feature divergence loss (FDL) that maximizes the representational discrepancy between clean and adversarial images. In addition, UCA incorporates a multi-scale learning strategy and adjusts the sampling ratio to enhance its adaptability to changes in scale and viewpoint diversity in real-world scenarios, thereby improving training stability. Extensive experiments demonstrate that UCA can induce incorrect driving commands across various VLM-AD models and driving scenarios, significantly surpassing existing state-of-the-art attack methods (improving 30% in 3-P metrics). Furthermore, UCA exhibits strong attack robustness under diverse viewpoints and dynamic conditions, indicating high potential for practical deployment.

[54] PU-Gaussian: Point Cloud Upsampling using 3D Gaussian Representation cs.CVPDF

Mahmoud Khater, Mona Strauss, Philipp von Olshausen, Alexander Reiterer

TL;DR: PU-Gaussian提出了一种基于3D高斯表示的点云上采样方法，通过局部几何邻域的建模和显式采样生成稠密点云，并在多个数据集上取得了最先进的性能。

Details

Motivation: 现有方法在点云上采样中通常牺牲几何可解释性或在输入稀疏时缺乏鲁棒性。PU-Gaussian旨在通过3D高斯分布建模局部几何结构，克服这些限制。

Result: 在PU1K和PUGAN数据集上实现了最先进的性能，生成的点云更稠密且质量更高。

Insight: 通过显式建模局部几何结构，PU-Gaussian在点云上采样任务中实现了更高的鲁棒性和几何保真度，同时结合采样与细化模块提升了生成质量。

Abstract: Point clouds produced by 3D sensors are often sparse and noisy, posing challenges for tasks requiring dense and high-fidelity 3D representations. Prior work has explored both implicit feature-based upsampling and distance-function learning to address this, but often at the expense of geometric interpretability or robustness to input sparsity. To overcome these limitations, we propose PU-Gaussian, a novel upsampling network that models the local neighborhood around each point using anisotropic 3D Gaussian distributions. These Gaussians capture the underlying geometric structure, allowing us to perform upsampling explicitly in the local geometric domain by direct point sampling. The sampling process generates a dense, but coarse, point cloud. A subsequent refinement network adjusts the coarse output to produce a more uniform distribution and sharper edges. We perform extensive testing on the PU1K and PUGAN datasets, demonstrating that PU-Gaussian achieves state-of-the-art performance. We make code and model weights publicly available at https://github.com/mvg-inatech/PU-Gaussian.git.

[55] ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression cs.CV | cs.AI | cs.LGPDF

Tom Burgert, Oliver Stoll, Paolo Rota, Begüm Demir

TL;DR: 本文通过系统性抑制形状、纹理和颜色特征的研究框架，挑战了CNN对纹理有偏向性的假设，揭示了CNN主要依赖局部形状特征，并可通过训练策略或架构改进。不同领域的模型表现出不同的特征依赖模式。

Details

Motivation: 现有研究中认为CNN具有纹理偏向性的假设可能源于实验设计的局限性，本文旨在通过更严谨的实验方法重新验证这一假设。

Result: CNN主要依赖局部形状特征而非纹理，依赖模式可通过现代训练策略或架构改进。不同领域模型的特征依赖模式存在系统性差异。

Insight: 1. CNN的特征依赖性是灵活的，可通过训练或架构调整。2. 不同领域的数据特性导致模型的特征依赖模式不同，提示领域适应性的重要性。

Abstract: The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance towards texture. Code is available at https://github.com/tomburgert/feature-reliance.

[56] An Anisotropic Cross-View Texture Transfer with Multi-Reference Non-Local Attention for CT Slice Interpolation cs.CVPDF

Kwang-Hyun Uhm, Hyunjun Cho, Sung-Hoo Hong, Seung-Won Jung

TL;DR: 该论文提出了一种基于多参考非局部注意力的各向异性跨视图纹理迁移方法，用于CT切片插值，显著提升了中间切片的生成质量。

Details

Motivation: 临床CT图像因存储和操作成本高，通常以较大的切片厚度采集，导致各向异性的体积数据。现有方法未能充分利用这种各向异性特性。

Result: 在公开CT数据集上表现优于现有方法，特别是在真实配对的基准测试中。

Insight: 各向异性特性在CT图像处理中具有潜在优势，跨视图纹理迁移可以有效提升插值质量。

Abstract: Computed tomography (CT) is one of the most widely used non-invasive imaging modalities for medical diagnosis. In clinical practice, CT images are usually acquired with large slice thicknesses due to the high cost of memory storage and operation time, resulting in an anisotropic CT volume with much lower inter-slice resolution than in-plane resolution. Since such inconsistent resolution may lead to difficulties in disease diagnosis, deep learning-based volumetric super-resolution methods have been developed to improve inter-slice resolution. Most existing methods conduct single-image super-resolution on the through-plane or synthesize intermediate slices from adjacent slices; however, the anisotropic characteristic of 3D CT volume has not been well explored. In this paper, we propose a novel cross-view texture transfer approach for CT slice interpolation by fully utilizing the anisotropic nature of 3D CT volume. Specifically, we design a unique framework that takes high-resolution in-plane texture details as a reference and transfers them to low-resolution through-plane images. To this end, we introduce a multi-reference non-local attention module that extracts meaningful features for reconstructing through-plane high-frequency details from multiple in-plane images. Through extensive experiments, we demonstrate that our method performs significantly better in CT slice interpolation than existing competing methods on public CT datasets including a real-paired benchmark, verifying the effectiveness of the proposed framework. The source code of this work is available at https://github.com/khuhm/ACVTT.

[57] 4D Driving Scene Generation With Stereo Forcing cs.CVPDF

Hao Lu, Zhuang Ma, Guangfeng Jiang, Wenhang Ge, Bohan Li

TL;DR: PhiGenesis是一个统一的4D驾驶场景生成框架，结合几何和时间一致性，支持时空外推和新视角合成。

Details

Motivation: 当前生成模型难以在不进行每场景优化的情况下合成动态4D驾驶场景，同时支持时间外推和空间新视角合成。

Result: 在几何重建、时间生成和新视角合成任务中取得最优性能。

Insight: Stereo Forcing通过动态调整几何不确定性的生成影响，提升了新视角合成的时间一致性。

Abstract: Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. Bridging generation and novel view synthesis remains a major challenge. We present PhiGenesis, a unified framework for 4D scene generation that extends video generation techniques with geometric and temporal consistency. Given multi-view image sequences and camera parameters, PhiGenesis produces temporally continuous 4D Gaussian splatting representations along target 3D trajectories. In its first stage, PhiGenesis leverages a pre-trained video VAE with a novel range-view adapter to enable feed-forward 4D reconstruction from multi-view images. This architecture supports single-frame or video inputs and outputs complete 4D scenes including geometry, semantics, and motion. In the second stage, PhiGenesis introduces a geometric-guided video diffusion model, using rendered historical 4D scenes as priors to generate future views conditioned on trajectories. To address geometric exposure bias in novel views, we propose Stereo Forcing, a novel conditioning strategy that integrates geometric uncertainty during denoising. This method enhances temporal coherence by dynamically adjusting generative influence based on uncertainty-aware perturbations. Our experimental results demonstrate that our method achieves state-of-the-art performance in both appearance and geometric reconstruction, temporal generation and novel view synthesis (NVS) tasks, while simultaneously delivering competitive performance in downstream evaluations. Homepage is at \href{https://jiangxb98.github.io/PhiGensis}{PhiGensis}.

[58] A co-evolving agentic AI system for medical imaging analysis cs.CV | q-bio.QMPDF

Songhao Li, Jonathan Xu, Tiancheng Bao, Yuxuan Liu, Yuchen Liu

TL;DR: TissueLab是一个协同进化的AI系统，用于医学影像分析，整合多领域工具，支持实时交互和专家反馈，实现高性能和快速适应新任务。

Details

Motivation: 当前医学影像分析中的AI系统性能和应用受限，主要由于缺乏强大的生态系统、工具集不足以及缺少实时专家反馈。

Result: 在临床任务中表现优于现有端到端视觉语言模型和其他AI系统，并能快速适应未见过的疾病背景。

Insight: 协同进化模式和实时反馈机制是提升医学AI系统性能和适应性的关键。

Abstract: Agentic AI is rapidly advancing in healthcare and biomedical research. However, in medical image analysis, their performance and adoption remain limited due to the lack of a robust ecosystem, insufficient toolsets, and the absence of real-time interactive expert feedback. Here we present “TissueLab”, a co-evolving agentic AI system that allows researchers to ask direct questions, automatically plan and generate explainable workflows, and conduct real-time analyses where experts can visualize intermediate results and refine them. TissueLab integrates tool factories across pathology, radiology, and spatial omics domains. By standardizing inputs, outputs, and capabilities of diverse tools, the system determines when and how to invoke them to address research and clinical questions. Across diverse tasks with clinically meaningful quantifications that inform staging, prognosis, and treatment planning, TissueLab achieves state-of-the-art performance compared with end-to-end vision-language models (VLMs) and other agentic AI systems such as GPT-5. Moreover, TissueLab continuously learns from clinicians, evolving toward improved classifiers and more effective decision strategies. With active learning, it delivers accurate results in unseen disease contexts within minutes, without requiring massive datasets or prolonged retraining. Released as a sustainable open-source ecosystem, TissueLab aims to accelerate computational research and translational adoption in medical imaging while establishing a foundation for the next generation of medical AI.

[59] HiPerformer: A High-Performance Global-Local Segmentation Model with Modular Hierarchical Fusion Strategy cs.CVPDF

Dayu Tan, Zhenpeng Xu, Yansen Su, Xin Peng, Chunhou Zheng

TL;DR: HiPerformer是一种高性能的全局-局部分割模型，通过创新的模块化分层架构和动态特征融合策略，解决了现有方法在特征不一致和信息损失方面的问题。

Details

Motivation: 医学图像分割中，局部细节和全局上下文信息均至关重要，但现有CNN-Transformer混合架构的方法在特征融合中存在问题，导致信息冲突和丢失。

Result: 在11个公开数据集上的实验表明，HiPerformer优于现有分割方法，具有更高的分割精度和鲁棒性。

Insight: 模块化设计和动态特征融合能有效避免信息丢失，局部-全局联合优化是提升分割性能的关键。

Abstract: Both local details and global context are crucial in medical image segmentation, and effectively integrating them is essential for achieving high accuracy. However, existing mainstream methods based on CNN-Transformer hybrid architectures typically employ simple feature fusion techniques such as serial stacking, endpoint concatenation, or pointwise addition, which struggle to address the inconsistencies between features and are prone to information conflict and loss. To address the aforementioned challenges, we innovatively propose HiPerformer. The encoder of HiPerformer employs a novel modular hierarchical architecture that dynamically fuses multi-source features in parallel, enabling layer-wise deep integration of heterogeneous information. The modular hierarchical design not only retains the independent modeling capability of each branch in the encoder, but also ensures sufficient information transfer between layers, effectively avoiding the degradation of features and information loss that come with traditional stacking methods. Furthermore, we design a Local-Global Feature Fusion (LGFF) module to achieve precise and efficient integration of local details and global semantic information, effectively alleviating the feature inconsistency problem and resulting in a more comprehensive feature representation. To further enhance multi-scale feature representation capabilities and suppress noise interference, we also propose a Progressive Pyramid Aggregation (PPA) module to replace traditional skip connections. Experiments on eleven public datasets demonstrate that the proposed method outperforms existing segmentation techniques, demonstrating higher segmentation accuracy and robustness. The code is available at https://github.com/xzphappy/HiPerformer.

[60] PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation cs.CVPDF

Chen Wang, Chuhao Chen, Yiming Huang, Zhiyang Dou, Yuan Liu

TL;DR: PhysCtrl是一个基于物理动力学生成可控视频的框架，通过学习物理参数和力的扩散模型，生成具有物理真实感的3D运动轨迹。

Details

Motivation: 现有视频生成模型缺乏物理真实性和3D可控性，因此需要一种能够结合物理参数和力的控制方法。

Result: 实验表明，PhysCtrl生成的视频在视觉效果和物理真实性上优于现有方法。

Insight: 结合物理模拟和生成模型可以显著提升视频生成的物理真实性和可控性。

Abstract: Existing video generation models excel at producing photo-realistic videos from text or images, but often lack physical plausibility and 3D controllability. To overcome these limitations, we introduce PhysCtrl, a novel framework for physics-grounded image-to-video generation with physical parameters and force control. At its core is a generative physics network that learns the distribution of physical dynamics across four materials (elastic, sand, plasticine, and rigid) via a diffusion model conditioned on physics parameters and applied forces. We represent physical dynamics as 3D point trajectories and train on a large-scale synthetic dataset of 550K animations generated by physics simulators. We enhance the diffusion model with a novel spatiotemporal attention block that emulates particle interactions and incorporates physics-based constraints during training to enforce physical plausibility. Experiments show that PhysCtrl generates realistic, physics-grounded motion trajectories which, when used to drive image-to-video models, yield high-fidelity, controllable videos that outperform existing methods in both visual quality and physical plausibility. Project Page: https://cwchenwang.github.io/physctrl

[61] EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning cs.CVPDF

Xuan Ju, Tianyu Wang, Yuqian Zhou, He Zhang, Qing Liu

TL;DR: EditVerse是一个统一的图像和视频生成与编辑框架，通过将文本、图像和视频表示为统一的标记序列，实现了多模态的自注意力学习。为了解决视频编辑数据的稀缺问题，作者设计了数据管道并引入了首个指令式视频编辑基准EditVerseBench。实验表明，EditVerse在性能上超越了现有的开源和商业模型，并展现了跨模态的新兴编辑和生成能力。

Details

Motivation: 当前的图像生成和编辑已趋向统一框架，但视频领域仍然碎片化，主要受限于架构问题和数据稀缺。EditVerse试图通过统一多模态表示和自注意力机制来解决这些问题。

Result: 实验和用户研究表明，EditVerse在性能上超越了现有开源和商业模型，并表现出跨模态的新兴编辑和生成能力。

Insight: EditVerse的成功表明，统一的多模态表示和自注意力机制可以有效解决视频领域的碎片化问题，同时在数据稀缺背景下设计高效的数据管道至关重要。

Abstract: Recent advances in foundation models highlight a clear trend toward unification and scaling, showing emergent capabilities across diverse domains. While image generation and editing have rapidly transitioned from task-specific to unified frameworks, video generation and editing remain fragmented due to architectural limitations and data scarcity. In this work, we introduce EditVerse, a unified framework for image and video generation and editing within a single model. By representing all modalities, i.e., text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning, natural cross-modal knowledge transfer, and flexible handling of inputs and outputs with arbitrary resolutions and durations. To address the lack of video editing training data, we design a scalable data pipeline that curates 232K video editing samples and combines them with large-scale image and video datasets for joint training. Furthermore, we present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions. Extensive experiments and user studies demonstrate that EditVerse achieves state-of-the-art performance, surpassing existing open-source and commercial models, while exhibiting emergent editing and generation abilities across modalities.

cs.CL [Back]

[62] FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering cs.CL | cs.AIPDF

Gyubok Lee, Elea Bach, Eric Yang, Tom Pollard, Alistair Johnson

TL;DR: FHIR-AgentBench是一个评估大语言模型（LLM）代理在真实医疗互操作性数据（HL7 FHIR标准）上的问答能力的基准测试，包含2,931个临床问题，并比较了不同数据检索、交互和推理策略。

Details

Motivation: 随着医疗数据的标准化转向HL7 FHIR，现有基准缺乏对这种复杂资源模型的真实评估能力，亟需新的测试工具。

Result: 实验揭示了从FHIR资源中检索数据和在复杂逻辑上推理的实际挑战，对问答性能有重要影响。

Insight: 医疗互操作性数据的复杂性和推理难度是关键瓶颈，未来研究需针对性优化检索和推理能力。

Abstract: The recent shift toward the Health Level Seven Fast Healthcare Interoperability Resources (HL7 FHIR) standard opens a new frontier for clinical AI, demanding LLM agents to navigate complex, resource-based data models instead of conventional structured health data. However, existing benchmarks have lagged behind this transition, lacking the realism needed to evaluate recent LLMs on interoperable clinical data. To bridge this gap, we introduce FHIR-AgentBench, a benchmark that grounds 2,931 real-world clinical questions in the HL7 FHIR standard. Using this benchmark, we systematically evaluate agentic frameworks, comparing different data retrieval strategies (direct FHIR API calls vs. specialized tools), interaction patterns (single-turn vs. multi-turn), and reasoning strategies (natural language vs. code generation). Our experiments highlight the practical challenges of retrieving data from intricate FHIR resources and the difficulty of reasoning over them, both of which critically affect question answering performance. We publicly release the FHIR-AgentBench dataset and evaluation suite (https://github.com/glee4810/FHIR-AgentBench) to promote reproducible research and the development of robust, reliable LLM agents for clinical applications.

[63] Readme_AI: Dynamic Context Construction for Large Language Models cs.CL | cs.AIPDF

Millie Vyas, Timothy Blattner, Alden Dima

TL;DR: 该论文提出了Readme_AI，一种动态为大语言模型（LLM）构建上下文的协议，通过数据源所有者提供的元数据文件，显著提升LLM在特定数据集查询中的准确性和可靠性。

Details

Motivation: 尽管大语言模型（LLM）经过大量数据训练，但在用户特定查询场景下仍可能提供不准确或不可靠的信息。动态构建查询相关上下文可以显著改善模型的响应质量。

Result: 通过实验验证，Readme_AI显著改进了LLM在NIST Hedgehog库相关查询中的表现，从提供不相关或幻觉信息转变为能够推理库的用途并生成代码示例。

Insight: 研究揭示了动态上下文构建在提升LLM专业领域性能中的潜力，尤其是通过数据源所有者直接提供的元数据，能够更准确地满足用户需求。

Abstract: Despite being trained on significant amounts of data, Large Language Models (LLMs) can provide inaccurate or unreliable information in the context of a user’s specific query. Given query-specific context significantly improves the usefulness of its responses. In this paper, we present a specification that can be used to dynamically build context for data sources. The data source owner creates the file containing metadata for LLMs to use when reasoning about dataset-related queries. To demonstrate our proposed specification, we created a prototype Readme_AI Model Context Protocol (MCP) server that retrieves the metadata from the data source and uses it to dynamically build context. Some features that make this specification dynamic are the extensible types that represent crawling web-pages, fetching data from data repositories, downloading and parsing publications, and general text. The context is formatted and grouped using user-specified tags that provide clear contextual information for the LLM to reason about the content. We demonstrate the capabilities of this early prototype by asking the LLM about the NIST-developed Hedgehog library, for which common LLMs often provides inaccurate and irrelevant responses containing hallucinations. With Readme_AI, the LLM receives enough context that it is now able to reason about the library and its use, and even generate code interpolated from examples that were included in the Readme_AI file provided by Hedgehog’s developer. Our primary contribution is a extensible protocol for dynamically grounding LLMs in specialized, owner-provided data, enhancing responses from LLMs and reducing hallucinations. The source code for the Readme_AI tool is posted here: https://github.com/usnistgov/readme_ai .

[64] Unveiling the Merits and Defects of LLMs in Automatic Review Generation for Scientific Papers cs.CL | cs.AIPDF

Ruochi Li, Haoxuan Zhang, Edward Gehringer, Ting Xiao, Junhua Ding

TL;DR: 论文探讨了大型语言模型(LLMs)在生成科学论文自动评审中的优缺点。LLMs在描述和肯定内容方面表现良好，但缺乏批判性推理和上下文理解能力。通过构建大规模基准数据集，研究发现LLMs在识别论文弱点时表现不佳。

Details

Motivation: 随着科学论文提交量的激增，传统同行评审压力增大，研究者探索利用LLMs自动生成评审以提高效率。然而，LLMs在批判性思维和上下文理解上的缺陷尚未被系统评估。

Result: LLMs在描述论文贡献和方法时表现良好（如GPT-4o生成的优势部分实体比人类多15.74%），但在识别弱点和调整反馈质量方面显著不足（如GPT-4o生成的弱点实体比人类少59.42%）。

Insight: 研究为LLMs辅助评审工具的开发提供了实证基础，未来需进一步提升LLMs的批判性推理和上下文适应性能力。

Abstract: The surge in scientific submissions has placed increasing strain on the traditional peer-review process, prompting the exploration of large language models (LLMs) for automated review generation. While LLMs demonstrate competence in producing structured and coherent feedback, their capacity for critical reasoning, contextual grounding, and quality sensitivity remains limited. To systematically evaluate these aspects, we propose a comprehensive evaluation framework that integrates semantic similarity analysis and structured knowledge graph metrics to assess LLM-generated reviews against human-written counterparts. We construct a large-scale benchmark of 1,683 papers and 6,495 expert reviews from ICLR and NeurIPS in multiple years, and generate reviews using five LLMs. Our findings show that LLMs perform well in descriptive and affirmational content, capturing the main contributions and methodologies of the original work, with GPT-4o highlighted as an illustrative example, generating 15.74% more entities than human reviewers in the strengths section of good papers in ICLR 2025. However, they consistently underperform in identifying weaknesses, raising substantive questions, and adjusting feedback based on paper quality. GPT-4o produces 59.42% fewer entities than real reviewers in the weaknesses and increases node count by only 5.7% from good to weak papers, compared to 50% in human reviews. Similar trends are observed across all conferences, years, and models, providing empirical foundations for understanding the merits and defects of LLM-generated reviews and informing the development of future LLM-assisted reviewing tools. Data, code, and more detailed results are publicly available at https://github.com/RichardLRC/Peer-Review.

[65] How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment Score Alignment cs.CL | stat.MEPDF

Julie Jung, Max Lu, Sina Chole Benker, Dogus Darici

TL;DR: 研究了模型规模、温度和提示风格对LLM与人类评分一致性的影响，发现模型规模是关键因素。

Details

Motivation: 探讨LLM在评估临床推理能力时，模型规模、温度和提示风格对其自身、模型间及与人类评分一致性的影响。

Result: 模型规模对LLM-人类评分一致性影响显著。

Insight: 研究强调了在实际应用中需综合考虑模型规模等多因素以确保评估的可靠性。

Abstract: We examined how model size, temperature, and prompt style affect Large Language Models’ (LLMs) alignment within itself, between models, and with human in assessing clinical reasoning skills. Model size emerged as a key factor in LLM-human score alignment. Study highlights the importance of checking alignments across multiple levels.

[66] Quantifying Compositionality of Classic and State-of-the-Art Embeddings cs.CL | cs.AIPDF

Zhijin Guo, Chenhao Xue, Zhaozhen Xu, Hongbo Bo, Yuxuan Ye

TL;DR: 本文提出了一种量化评估静态和现代语言模型组合性的方法，揭示了模型在不同训练阶段和层次的组合性表现。

Details

Motivation: 现有语言模型（如Word2vec）对组合性声称过高，而现代生成式模型（如Transformer）又缺乏对语境变化意义的限制。本文旨在量化这种组合性，为模型的组合能力提供客观评估。

Result: 实验表明，深层Transformer模型在训练后期表现出更强的组合性信号，但顶层有所下降。不同数据模态下也观察到此现象。

Insight: 模型组合性并非单调增长，而是受训练阶段和层次影响；现代模型在深层可能更擅长捕捉组合性，但顶层可能因过拟合或其他原因表现下降。

Abstract: For language models to generalize correctly to novel expressions, it is critical that they exploit access compositional meanings when this is justified. Even if we don’t know what a “pelp” is, we can use our knowledge of numbers to understand that “ten pelps” makes more pelps than “two pelps”. Static word embeddings such as Word2vec made strong, indeed excessive, claims about compositionality. The SOTA generative, transformer models and graph models, however, go too far in the other direction by providing no real limits on shifts in meaning due to context. To quantify the additive compositionality, we formalize a two-step, generalized evaluation that (i) measures the linearity between known entity attributes and their embeddings via canonical correlation analysis, and (ii) evaluates additive generalization by reconstructing embeddings for unseen attribute combinations and checking reconstruction metrics such as L2 loss, cosine similarity, and retrieval accuracy. These metrics also capture failure cases where linear composition breaks down. Sentences, knowledge graphs, and word embeddings are evaluated and tracked the compositionality across all layers and training stages. Stronger compositional signals are observed in later training stages across data modalities, and in deeper layers of the transformer-based model before a decline at the top layer. Code is available at https://github.com/Zhijin-Guo1/quantifying-compositionality.

[67] Pluralistic Off-policy Evaluation and Alignment cs.CL | cs.AIPDF

Chengkai Huang, Junda Wu, Zhouhang Xie, Yu Xia, Rui Wang

TL;DR: 本文提出了Pluralistic Off-Policy Evaluation (POPE)框架，用于在多样化的用户偏好下对大语言模型（LLMs）进行离线评估和偏好对齐，解决了现有方法忽略偏好多样性且仅关注整体效用的问题。

Details

Motivation: 现有的大语言模型偏好对齐数据集通常是在与评估模型差异很大的策略下记录的，且现有离线策略评估方法仅关注整体效用，忽略了偏好多样性。因此，如何扩展离线策略评估（OPE）以适应多样化偏好对齐是一个开放性问题。

Result: 实验结果表明，POPE能高效地提升多样化响应生成，并保持模型在下游任务中的通用能力。

Insight: 本文的启示在于，偏好对齐不仅需要关注整体效用，还需引入多样性度量，以更好地反映人类偏好的多样性。POPE为多样化偏好对齐提供了理论和实践基础。

Abstract: Personalized preference alignment for LLMs with diverse human preferences requires evaluation and alignment methods that capture pluralism. Most existing preference alignment datasets are logged under policies that differ substantially from the evaluated LLMs, and existing off-policy estimators focus solely on overall utility while ignoring preference pluralism. Extending Off-Policy Evaluation (OPE) to pluralistic preference alignment, therefore, remains an open question. Thus, we propose the Pluralistic Off-Policy Evaluation (POPE), the first framework for offline pluralistic preference evaluation and alignment in LLMs. POPE includes a unified reward function that combines (1) a collaborative utility component derived from human preference signals (e.g., upvotes or relevance scores) and (2) a diversity component inspired by entropy-based coverage measures, together reflecting pluralistic alignment. Furthermore, to estimate this reward from logged interactions, we derive decomposable inverse propensity scoring (IPS) estimators that separately evaluate relevance and diversity. Theoretically, we prove that our decomposed IPS estimators establish a lower bound on their variance. With the off-policy evaluated value function, we can directly enable off-policy optimization to further enhance pluralistic alignment. Empirical results demonstrate that POPE efficiently enhances pluralistic response generation and maintains the models’ general capabilities on downstream tasks

[68] SCORE: A Semantic Evaluation Framework for Generative Document Parsing cs.CL | cs.AIPDF

Renyu Li, Antonio Jimeno Yepes, Yao You, Kamil Pluciński, Maximilian Operlejn

TL;DR: SCORE是一种针对生成式文档解析系统的语义评估框架，解决了传统指标因忽略语义多样性而导致的评估偏差问题。

Details

Motivation: 传统的评估指标（如CER、WER、IoU、TEDS）无法区分生成式解析系统中语义正确但结构多样的输出，导致误判和评估失真。SCORE旨在提供一种语义导向的框架，包容多样性同时严格评估语义准确性。

Result: 在1,114页文档上的实验表明，SCORE能正确识别传统指标误判的情况（如12-25%的性能偏差），并通过标准化生成式输出实现了与传统指标相当的性能（如表格F1达0.93）。

Insight: SCORE揭示了语义多样性对评估结果的影响，为现代文档解析系统提供了公平且实用的多维度评估标准。

Abstract: Multi-modal generative document parsing systems challenge traditional evaluation: unlike deterministic OCR or layout models, they often produce semantically correct yet structurally divergent outputs. Conventional metrics-CER, WER, IoU, or TEDS-misclassify such diversity as error, penalizing valid interpretations and obscuring system behavior. We introduce SCORE (Structural and COntent Robust Evaluation), an interpretation-agnostic framework that integrates (i) adjusted edit distance for robust content fidelity, (ii) token-level diagnostics to distinguish hallucinations from omissions, (iii) table evaluation with spatial tolerance and semantic alignment, and (iv) hierarchy-aware consistency checks. Together, these dimensions enable evaluation that embraces representational diversity while enforcing semantic rigor. Across 1,114 pages spanning a holistic benchmark and a field dataset, SCORE consistently revealed cross-dataset performance patterns missed by standard metrics. In 2-5% of pages with ambiguous table structures, traditional metrics penalized systems by 12-25% on average, leading to distorted rankings. SCORE corrected these cases, recovering equivalence between alternative but valid interpretations. Moreover, by normalizing generative outputs into a format-agnostic representation, SCORE reproduces traditional scores (e.g., table F1 up to 0.93) without requiring object-detection pipelines, demonstrating that generative parsing alone suffices for comprehensive evaluation. By exposing how interpretive diversity impacts evaluation outcomes and providing multi-dimensional, interpretable diagnostics, SCORE establishes foundational principles for semantically grounded, fair, and practical benchmarking of modern document parsing systems.

[69] Benchmarking ChatGPT and DeepSeek in April 2025: A Novel Dual Perspective Sentiment Analysis Using Lexicon-Based and Deep Learning Approaches cs.CLPDF

Maryam Mahdi Alhusseini, Mohammad-Reza Feizi-Derakhshi

TL;DR: 该研究提出了一种新颖的双视角情感分析方法，结合了基于词典的情感分析和深度学习模型（CNN和Bi-LSTM），用于分析ChatGPT和DeepSeek的用户评价。结果表明ChatGPT的情感更积极，且CNN表现优于Bi-LSTM。

Details

Motivation: 为了更全面地评估大型语言模型（LLM）应用的用户满意度，研究者探索了结合词典方法和深度学习模型的优势，弥补了以往研究的单视角局限。

Result: ChatGPT的情感更积极；深度学习方法优于词典分析，CNN准确率达96.41%，且对负面评价分类效果极佳。

Insight: 该研究为LLM应用的情感分析提供了新方法，并为开发者改进用户体验提供了实用建议。

Abstract: This study presents a novel dual-perspective approach to analyzing user reviews for ChatGPT and DeepSeek on the Google Play Store, integrating lexicon-based sentiment analysis (TextBlob) with deep learning classification models, including Convolutional Neural Networks (CNN) and Bidirectional Long Short Term Memory (Bi LSTM) Networks. Unlike prior research, which focuses on either lexicon-based strategies or predictive deep learning models in isolation, this study conducts an extensive investigation into user satisfaction with Large Language Model (LLM) based applications. A Dataset of 4,000 authentic user reviews was collected, which were carefully preprocessed and subjected to oversampling to achieve balanced classes. The balanced test set of 1,700 Reviews were used for model testing. Results from the experiments reveal that ChatGPT received significantly more positive sentiment than DeepSeek. Furthermore, deep learning based classification demonstrated superior performance over lexicon analysis, with CNN outperforming Bi-LSTM by achieving 96.41 percent accuracy and near perfect classification of negative reviews, alongside high F1-scores for neutral and positive sentiments. This research sets a new methodological standard for measuring sentiment in LLM-based applications and provides practical insights for developers and researchers seeking to improve user-centric AI system design.

[70] ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution cs.CL | cs.LGPDF

Robert Tjarko Lange, Yuki Imajuku, Edoardo Cetin

TL;DR: ShinkaEvolve是一个基于大语言模型（LLMs）的开源框架，通过创新的采样和搜索策略，显著提升了代码进化的样本效率和解决方案质量，适用于广泛的科学发现任务。

Details

Motivation: 当前基于LLMs的代码进化方法存在样本效率低和封闭性的问题，限制了广泛采用和扩展。ShinkaEvolve旨在解决这些限制，推动开放式的科学发现。

Result: 在多项任务中显著提升了样本效率和解决方案质量，如仅用150个样本发现了新的最优圆填充解决方案。

Insight: 通过开源框架和高效的搜索策略，ShinkaEvolve在科学发现中实现了高样本效率和广泛适用性，展示了LLMs在开放式问题中的潜力。

Abstract: We introduce ShinkaEvolve: a new open-source framework leveraging large language models (LLMs) to advance scientific discovery with state-of-the-art performance and unprecedented efficiency. Recent advances in scaling inference time compute of LLMs have enabled significant progress in generalized scientific discovery. These approaches rely on evolutionary agentic harnesses that leverage LLMs as mutation operators to generate candidate solutions. However, current code evolution methods suffer from critical limitations: they are sample inefficient, requiring thousands of samples to identify effective solutions, and remain closed-source, hindering broad adoption and extension. ShinkaEvolve addresses these limitations, introducing three key innovations: a parent sampling technique balancing exploration and exploitation, code novelty rejection-sampling for efficient search space exploration, and a bandit-based LLM ensemble selection strategy. We evaluate ShinkaEvolve across diverse tasks, demonstrating consistent improvements in sample efficiency and solution quality. ShinkaEvolve discovers a new state-of-the-art circle packing solution using only 150 samples, designs high-performing agentic harnesses for AIME mathematical reasoning tasks, identifies improvements to ALE-Bench competitive programming solutions, and discovers novel mixture-of-expert load balancing loss functions that illuminate the space of optimization strategies. Our results demonstrate that ShinkaEvolve achieves broad applicability with exceptional sample efficiency. By providing open-source accessibility and cost-efficiency, this work democratizes open-ended discovery across diverse computational problems.

[71] TriSPrompt: A Hierarchical Soft Prompt Model for Multimodal Rumor Detection with Incomplete Modalities cs.CL | cs.AIPDF

Jiajun Chen, Yangyang Wu, Xiaoye Miao, Mengying Zhu, Meng Xi

TL;DR: TriSPrompt提出了一种分层软提示模型，通过模态感知提示（MA）、模态缺失提示（MM）和多视角提示（MV），有效解决多模态谣言检测中模态缺失的问题，性能提升13%。

Details

Motivation: 多模态数据中常见模态缺失问题，现有方法仅依赖完整模态训练数据，无法有效处理现实情境中的缺失模态。因此，需设计一种能适应不完全模态的检测方法。

Result: 在三个真实基准数据集上，TriSPrompt相比现有方法实现了13%以上的准确率提升。

Insight: 通过分层提示机制，TriSPrompt不仅解决了模态缺失问题，还通过多视角关系建模显著提升了检测性能。

Abstract: The widespread presence of incomplete modalities in multimodal data poses a significant challenge to achieving accurate rumor detection. Existing multimodal rumor detection methods primarily focus on learning joint modality representations from \emph{complete} multimodal training data, rendering them ineffective in addressing the common occurrence of \emph{missing modalities} in real-world scenarios. In this paper, we propose a hierarchical soft prompt model \textsf{TriSPrompt}, which integrates three types of prompts, \textit{i.e.}, \emph{modality-aware} (MA) prompt, \emph{modality-missing} (MM) prompt, and \emph{mutual-views} (MV) prompt, to effectively detect rumors in incomplete multimodal data. The MA prompt captures both heterogeneous information from specific modalities and homogeneous features from available data, aiding in modality recovery. The MM prompt models missing states in incomplete data, enhancing the model’s adaptability to missing information. The MV prompt learns relationships between subjective (\textit{i.e.}, text and image) and objective (\textit{i.e.}, comments) perspectives, effectively detecting rumors. Extensive experiments on three real-world benchmarks demonstrate that \textsf{TriSPrompt} achieves an accuracy gain of over 13% compared to state-of-the-art methods. The codes and datasets are available at https: //anonymous.4open.science/r/code-3E88.

[72] RoadMind: Towards a Geospatial AI Expert for Disaster Response cs.CL | cs.AIPDF

Ahmed El Fekih Zguir, Ferda Ofli, Muhammad Imran

TL;DR: Paper introduces RoadMind, a self-supervised framework that enhances LLMs’ geospatial reasoning for disaster response by leveraging OpenStreetMap data.

Details

Motivation: Current LLMs lack robust geospatial reasoning, which is critical for disaster response tasks like evacuation planning. RoadMind addresses this gap.

Result: RoadMind outperforms baseline LLMs in disaster-prone cities (LA, Christchurch, Manila) on tasks like road segment identification and nearest road retrieval.

Insight: Structured geospatial data can significantly enhance LLMs for offline disaster response, proving the value of domain-specific training.

Abstract: Large Language Models (LLMs) have shown impressive performance across a range of natural language tasks, but remain limited in their ability to reason about geospatial data, particularly road networks, distances, and directions. This gap poses challenges in disaster scenarios, where spatial understanding is critical for tasks such as evacuation planning and resource allocation. In this work, we present RoadMind, a self-supervised framework that enhances the geospatial reasoning capabilities of LLMs using structured data from OpenStreetMap (OSM). Our automated pipeline extracts road infrastructure data for a given city and converts it into multiple supervision formats tailored to key spatial tasks. We pretrain and fine-tune LLMs on these representations using QLoRA adapters and 4-bit quantized models. We evaluate our approach on three disaster-prone cities with varying global representation, Los Angeles, Christchurch, and Manila, across tasks such as road segment identification, nearest road retrieval, and distance/direction estimation. Our results show that models trained via RoadMind significantly outperform strong baselines, including state-of-the-art LLMs equipped with advanced prompt engineering. This demonstrates the potential of structured geospatial data to enhance language models with robust spatial reasoning, enabling more effective offline AI systems for disaster response.

[73] Benchmarking and Improving LLM Robustness for Personalized Generation cs.CL | cs.AIPDF

Chimaobi Okite, Naihao Deng, Kiran Bodipati, Huaidian Hou, Joyce Chai

TL;DR: 论文提出了一种评估大型语言模型（LLM）在个性化生成中鲁棒性的框架PERG和新数据集PERGData，发现现有模型在事实准确性和用户偏好对齐上存在显著不足，并提出Pref-Aligner方法平均提升25%鲁棒性。

Details

Motivation: 现有评估方法主要关注LLM响应是否符合用户偏好，但忽视了事实准确性这一重要维度，导致模型在个性化生成中不够可靠。本文旨在填补这一空白。

Result: 研究发现：1) 最强模型（如GPT-4.1）在5%的无个性化成功案例中无法保持正确性；2) 较小模型（7B级）失败率超过20%；3) Pref-Aligner平均提升25%鲁棒性。

Insight: 1) 鲁棒性受查询性质和用户偏好类型显著影响；2) 现有评估方法需改进以覆盖事实性和用户偏好；3) Pref-Aligner为提升LLM可靠性提供了有效路径。

Abstract: Recent years have witnessed a growing interest in personalizing the responses of large language models (LLMs). While existing evaluations primarily focus on whether a response aligns with a user’s preferences, we argue that factuality is an equally important yet often overlooked dimension. In the context of personalization, we define a model as robust if its responses are both factually accurate and align with the user preferences. To assess this, we introduce PERG, a scalable framework for evaluating robustness in LLMs, along with a new dataset, PERGData. We evaluate fourteen models from five different model families using different prompting methods. Our findings show that current LLMs struggle with robust personalization: even the strongest models (GPT-4.1, LLaMA3-70B) fail to maintain correctness in 5% of previously successful cases without personalization, while smaller models (e.g., 7B-scale) can fail more than 20% of the time. Further analysis reveals that robustness is significantly affected by the nature of the query and the type of user preference. To mitigate these failures, we propose Pref-Aligner, a two-stage approach that improves robustness by an average of 25% across models. Our work highlights critical gaps in current evaluation practices and introduces tools and metrics to support more reliable, user-aligned LLM deployments.

[74] Semantic Representation Attack against Aligned Large Language Models cs.CL | cs.AIPDF

Jiawei Lian, Jianhong Pan, Lefan Wang, Yi Wang, Shaohui Mei

TL;DR: 论文提出了一种针对对齐大型语言模型（LLM）的新型语义表示攻击方法，通过利用语义表示空间生成多样但语义等效的有害回答，解决了传统攻击方法在效果和自然性之间的权衡问题，并提出了一种高效的启发式搜索算法。

Details

Motivation: 当前对抗对齐LLM的攻击方法通常针对特定的文本模式（如“Sure, here is…”），存在收敛性差、提示不自然和计算成本高等问题。作者希望通过语义表示空间重新定义攻击目标，改进攻击效果和自然性。

Result: 在18个LLM上的攻击成功率达89.41%，其中11个模型为100%，同时保持高效和隐蔽性。

Insight: 传统攻击方法的局限性在于对特定文本模式的依赖，而语义表示攻击通过多样化语义等效回答从根本上提升了攻击效果和自然性。

Abstract: Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting prompts that induce LLMs to generate harmful content. Current methods typically target exact affirmative responses, such as ``Sure, here is…’’, suffering from limited convergence, unnatural prompts, and high computational costs. We introduce Semantic Representation Attack, a novel paradigm that fundamentally reconceptualizes adversarial objectives against aligned LLMs. Rather than targeting exact textual patterns, our approach exploits the semantic representation space comprising diverse responses with equivalent harmful meanings. This innovation resolves the inherent trade-off between attack efficacy and prompt naturalness that plagues existing methods. The Semantic Representation Heuristic Search algorithm is proposed to efficiently generate semantically coherent and concise adversarial prompts by maintaining interpretability during incremental expansion. We establish rigorous theoretical guarantees for semantic convergence and demonstrate that our method achieves unprecedented attack success rates (89.41% averaged across 18 LLMs, including 100% on 11 models) while maintaining stealthiness and efficiency. Comprehensive experimental results confirm the overall superiority of our Semantic Representation Attack. The code will be publicly available.

[75] Meow: End-to-End Outline Writing for Automatic Academic Survey cs.CL | cs.AIPDF

Zhaoyu Ma, Yuan Shan, Jiahao Zhao, Nan Xu, Lei Wang

TL;DR: 这篇论文提出了一种名为Meow的元数据驱动框架，用于自动生成系统化、高质量的学术综述大纲，通过端到端的任务设计和两阶段训练方法实现了高效的输出。

Details

Motivation: 随着学术论文数量的指数级增长，基于LLMs的自动化综述生成成为趋势，但现有方法的大纲生成缺乏深度理解和细粒度风格，亟需改进。

Result: 论文展示的8B推理模型在大纲生成中表现出高结构保真度和风格一致性。

Insight: 元数据驱动的框架能有效提升大纲生成的系统性和风格性，两阶段训练方法显著优化模型性能。

Abstract: As academic paper publication numbers grow exponentially, conducting in-depth surveys with LLMs automatically has become an inevitable trend. Outline writing, which aims to systematically organize related works, is critical for automated survey generation. Yet existing automatic survey methods treat outline writing as mere workflow steps in the overall pipeline. Such template-based workflows produce outlines that lack in-depth understanding of the survey topic and fine-grained styles. To address these limitations, we propose Meow, the first metadata-driven outline writing framework that produces organized and faithful outlines efficiently. Specifically, we first formulate outline writing as an end-to-end task that generates hierarchical structured outlines from paper metadata. We then curate a high-quality dataset of surveys from arXiv, bioRxiv, and medRxiv, and establish systematic evaluation metrics for outline quality assessment. Finally, we employ a two-stage training approach combining supervised fine-tuning and reinforcement learning. Our 8B reasoning model demonstrates strong performance with high structural fidelity and stylistic coherence.

[76] How to inject knowledge efficiently? Knowledge Infusion Scaling Law for Pre-training Large Language Models cs.CL | cs.AIPDF

Kangtao Lv, Haibin Chen, Yujin Yuan, Langming Liu, Shilei Liu

TL;DR: 这篇论文研究了在预训练大语言模型（LLMs）中高效注入领域知识的方法，提出了知识注入的缩放定律，以平衡领域专业化和避免灾难性遗忘。

Details

Motivation: 尽管大语言模型在广泛任务中表现优异，但在领域特定任务中可能表现不佳甚至产生幻觉。已有的研究表明，通过注入领域知识可以提升性能，但需要解决知识注入过多导致的灾难性遗忘问题。

Result: 实验验证了缩放定律在不同模型规模和预训练预算下的有效性和泛化性。

Insight: 模型在知识注入过程中存在一个临界点，超过此点会导致知识保留能力急剧下降，且这一临界点与模型规模相关。

Abstract: Large language models (LLMs) have attracted significant attention due to their impressive general capabilities across diverse downstream tasks. However, without domain-specific optimization, they often underperform on specialized knowledge benchmarks and even produce hallucination. Recent studies show that strategically infusing domain knowledge during pretraining can substantially improve downstream performance. A critical challenge lies in balancing this infusion trade-off: injecting too little domain-specific data yields insufficient specialization, whereas excessive infusion triggers catastrophic forgetting of previously acquired knowledge. In this work, we focus on the phenomenon of memory collapse induced by over-infusion. Through systematic experiments, we make two key observations, i.e. 1) Critical collapse point: each model exhibits a threshold beyond which its knowledge retention capabilities sharply degrade. 2) Scale correlation: these collapse points scale consistently with the model’s size. Building on these insights, we propose a knowledge infusion scaling law that predicts the optimal amount of domain knowledge to inject into large LLMs by analyzing their smaller counterparts. Extensive experiments across different model sizes and pertaining token budgets validate both the effectiveness and generalizability of our scaling law.

[77] Do LLMs Encode Frame Semantics? Evidence from Frame Identification cs.CLPDF

Jayanth Krishna Chundru, Rudrashis Poddar, Jie Cao, Tianyu Jiang

TL;DR: 论文探讨了大型语言模型是否隐含地掌握了框架语义（frame semantics）知识，特别是在框架识别任务上的表现。研究表明，即使没有显式监督，模型也能有效完成框架识别，且经过微调后性能显著提升。

Details

Motivation: 研究大型语言模型是否隐式地学习到了框架语义知识，尤其是它们能否在没有监督的情况下完成框架识别任务。

Result: 模型在没有监督的情况下表现良好，经过微调后性能显著提升，且在领域外数据上泛化能力较强。模型还能生成语义合理的框架定义。

Insight: 大型语言模型隐式地掌握了框架语义知识，这为后续研究其在语义解析任务中的应用提供了新思路。

Abstract: We investigate whether large language models encode latent knowledge of frame semantics, focusing on frame identification, a core challenge in frame semantic parsing that involves selecting the appropriate semantic frame for a target word in context. Using the FrameNet lexical resource, we evaluate models under prompt-based inference and observe that they can perform frame identification effectively even without explicit supervision. To assess the impact of task-specific training, we fine-tune the model on FrameNet data, which substantially improves in-domain accuracy while generalizing well to out-of-domain benchmarks. Further analysis shows that the models can generate semantically coherent frame definitions, highlighting the model’s internalized understanding of frame semantics.

[78] LLMs4All: A Review on Large Language Models for Research and Applications in Academic Disciplines cs.CLPDF

Yanfang, Ye, Zheyuan Zhang, Tianyi Ma, Zehong Wang

TL;DR: 综述论文《LLMs4All》探讨了大型语言模型（LLMs）在各学术领域的应用前景、局限性与未来方向，涵盖人文、经济、科学及工程等领域。

Details

Motivation: LLMs（如ChatGPT）在语言任务中表现出色，激发了对其跨领域应用的探索兴趣，包括人文社科、经济商业及科学技术等学科，以推动研究与实际应用。

Result: 展示了LLMs在各学科中的广泛应用潜力，同时也指出了模型局限性（如偏见、可解释性）、开放挑战及未来发展方向。

Insight: LLMs不仅是语言工具，更可能成为跨领域研究的革新力量，但其成功应用需结合领域知识、伦理考量与技术优化。

Abstract: Cutting-edge Artificial Intelligence (AI) techniques keep reshaping our view of the world. For example, Large Language Models (LLMs) based applications such as ChatGPT have shown the capability of generating human-like conversation on extensive topics. Due to the impressive performance on a variety of language-related tasks (e.g., open-domain question answering, translation, and document summarization), one can envision the far-reaching impacts that can be brought by the LLMs with broader real-world applications (e.g., customer service, education and accessibility, and scientific discovery). Inspired by their success, this paper will offer an overview of state-of-the-art LLMs and their integration into a wide range of academic disciplines, including: (1) arts, letters, and law (e.g., history, philosophy, political science, arts and architecture, law), (2) economics and business (e.g., finance, economics, accounting, marketing), and (3) science and engineering (e.g., mathematics, physics and mechanical engineering, chemistry and chemical engineering, life sciences and bioengineering, earth sciences and civil engineering, computer science and electrical engineering). Integrating humanity and technology, in this paper, we will explore how LLMs are shaping research and practice in these fields, while also discussing key limitations, open challenges, and future directions in the era of generative AI. The review of how LLMs are engaged across disciplines-along with key observations and insights-can help researchers and practitioners interested in exploiting LLMs to advance their works in diverse real-world applications.

[79] GuessingGame: Measuring the Informativeness of Open-Ended Questions in Large Language Models cs.CL | cs.AIPDF

Dylan Hutson, Daniel Vennemeyer, Aneesh Deshmukh, Justin Zhan, Tianyu Jiang

TL;DR: 该论文提出了一种名为GuessingGame的协议，用于评估大型语言模型(LLMs)在开放式问题中的信息增益能力，提出了两种信息增益(IG)度量方法，并证明高IG显著提高效率。

Details

Motivation: 研究大型语言模型在开放式问题中的提问策略和信息增益能力，以提升其交互式推理表现。

Result: 实验显示，IG提升一个标准差可使游戏长度减少43%，提示约束显著提升模型性能。

Insight: LLMs的提问能力是可测量和可优化的，信息增益是提升交互式推理的关键指标。

Abstract: We introduce GuessingGame, a protocol for evaluating large language models (LLMs) as strategic question-askers in open-ended, open-domain settings. A Guesser LLM identifies a hidden object by posing free-form questions to an Oracle without predefined choices or candidate lists. To measure question quality, we propose two information gain (IG) metrics: a Bayesian method that tracks belief updates over semantic concepts using LLM-scored relevance, and an entropy-based method that filters candidates via ConceptNet. Both metrics are model-agnostic and support post hoc analysis. Across 858 games with multiple models and prompting strategies, higher IG strongly predicts efficiency: a one-standard-deviation IG increase reduces expected game length by 43%. Prompting constraints guided by IG, such as enforcing question diversity, enable weaker models to significantly improve performance. These results show that question-asking in LLMs is both measurable and improvable, and crucial for interactive reasoning.

[80] Anatomy of a Feeling: Narrating Embodied Emotions via Large Vision-Language Models cs.CL | cs.CVPDF

Mohammad Saim, Phan Anh Duong, Cat Luong, Aniket Bhanderi, Tianyu Jiang

TL;DR: 该论文利用大型视觉语言模型（LVLMs）构建了Embodied LVLM Emotion Narratives（ELENA）框架，通过多层文本输出来描述情感反应中显著的身体部位，发现模型存在对面部区域的偏见，但即使如此，ELENA在识别遮挡面部的图像中情感方面表现优于基线方法。

Details

Motivation: 情感反应的身体部位包含丰富的情感信息，但现有模型通常偏向于面部区域，忽略了其他身体部位的情感表达。研究目标是利用LVLMs构建一个能够全面描述情感反应中身体部位的新框架。

Result: ELENA在识别情感方面表现优于基线方法，尤其是在面部遮挡的图像中，无需微调即可实现较好的效果。

Insight: 情感分析不应局限于面部，身体其他部位的情感表达同样重要；LVLMs在未微调的情况下已具备一定的泛化能力。

Abstract: The embodiment of emotional reactions from body parts contains rich information about our affective experiences. We propose a framework that utilizes state-of-the-art large vision-language models (LVLMs) to generate Embodied LVLM Emotion Narratives (ELENA). These are well-defined, multi-layered text outputs, primarily comprising descriptions that focus on the salient body parts involved in emotional reactions. We also employ attention maps and observe that contemporary models exhibit a persistent bias towards the facial region. Despite this limitation, we observe that our employed framework can effectively recognize embodied emotions in face-masked images, outperforming baselines without any fine-tuning. ELENA opens a new trajectory for embodied emotion analysis across the modality of vision and enriches modeling in an affect-aware setting.

[81] Large Language Models for Pedestrian Safety: An Application to Predicting Driver Yielding Behavior at Unsignalized Intersections cs.CL | cs.AI | cs.SIPDF

Yicheng Yang, Zixian Li, Jean Paul Bizimana, Niaz Zafri, Yongfeng Dong

TL;DR: 本文提出了一种利用多模态大语言模型（LLM）的提示设计方法，用于预测无信号交叉口驾驶员让行行为，展示了其在行人安全领域的潜力。

Details

Motivation: 行人安全是城市交通的重要组成部分，但传统机器学习模型在捕捉驾驶员-行人交互的复杂性和上下文依赖性方面表现不佳，大语言模型（LLM）因其强大的模式提取能力成为潜在的解决方案。

Result: 实验表明，GPT-4o在准确率和召回率上表现最佳，Deepseek-V3则在精确率上领先，同时揭示了模型性能与计算效率之间的权衡。

Insight: LLM在行人安全领域具有实际应用潜力，但需要权衡性能和计算成本，为实际部署提供了指导。

Abstract: Pedestrian safety is a critical component of urban mobility and is strongly influenced by the interactions between pedestrian decision-making and driver yielding behavior at crosswalks. Modeling driver–pedestrian interactions at intersections requires accurately capturing the complexity of these behaviors. Traditional machine learning models often struggle to capture the nuanced and context-dependent reasoning required for these multifactorial interactions, due to their reliance on fixed feature representations and limited interpretability. In contrast, large language models (LLMs) are suited for extracting patterns from heterogeneous traffic data, enabling accurate modeling of driver-pedestrian interactions. Therefore, this paper leverages multimodal LLMs through a novel prompt design that incorporates domain-specific knowledge, structured reasoning, and few-shot prompting, enabling interpretable and context-aware inference of driver yielding behavior, as an example application of modeling pedestrian–driver interaction. We benchmarked state-of-the-art LLMs against traditional classifiers, finding that GPT-4o consistently achieves the highest accuracy and recall, while Deepseek-V3 excels in precision. These findings highlight the critical trade-offs between model performance and computational efficiency, offering practical guidance for deploying LLMs in real-world pedestrian safety systems.

[82] CHURRO: Making History Readable with an Open-Weight Large Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition cs.CL | cs.CVPDF

Sina J. Semnani, Han Zhang, Xinyan He, Merve Tekgürler, Monica S. Lam

TL;DR: CHURRO是一个3B参数的开源视觉-语言模型，专门用于历史文本识别，其性能优于现有模型，同时在成本效益上表现出色。

Details

Motivation: 现有视觉-语言模型主要针对现代标准化文本，无法有效处理历史文档中的多样性语言、不规则布局和退化问题。CHURRO旨在填补这一空白。

Result: CHURRO在测试集上取得82.3%（印刷体）和70.1%（手写体）的归一化Levenshtein相似度，优于第二名模型，同时成本降低15.5倍。

Insight: 通过开源模型和数据集，CHURRO为历史文本识别研究提供了工具，有望推动文化遗产保护和学术研究。

Abstract: Accurate text recognition for historical documents can greatly advance the study and preservation of cultural heritage. Existing vision-language models (VLMs), however, are designed for modern, standardized texts and are not equipped to read the diverse languages and scripts, irregular layouts, and frequent degradation found in historical materials. This paper presents CHURRO, a 3B-parameter open-weight VLM specialized for historical text recognition. The model is trained on CHURRO-DS, the largest historical text recognition dataset to date. CHURRO-DS unifies 155 historical corpora comprising 99,491 pages, spanning 22 centuries of textual heritage across 46 language clusters, including historical variants and dead languages. We evaluate several open-weight and closed VLMs and optical character recognition (OCR) systems on CHURRO-DS and find that CHURRO outperforms all other VLMs. On the CHURRO-DS test set, CHURRO achieves 82.3% (printed) and 70.1% (handwritten) normalized Levenshtein similarity, surpassing the second-best model, Gemini 2.5 Pro, by 1.4% and 6.5%, respectively, while being 15.5 times more cost-effective. By releasing the model and dataset, we aim to enable community-driven research to improve the readability of historical texts and accelerate scholarship.

[83] EnAnchored-X2X: English-Anchored Optimization for Many-to-Many Translation cs.CLPDF

Sen Yang, Yu Bao, Yu Lu, Jiajun Chen, Shujian Huang

TL;DR: 论文提出了一种利用LLMs在英语为中心的语言对上的优势，通过合成数据和偏好优化提升非英语间（x2x）翻译能力的方法，显著提升了72种x2x方向的翻译性能。

Details

Motivation: 现有LLMs在英语为中心的语言对上表现优异，但在非英语间的直接翻译（x2x）表现不佳。作者希望通过利用LLMs在英语翻译上的优势，提升x2x翻译能力。

Result: 在72种x2x翻译方向上取得显著提升，同时增强了英语到其他语言（en2x）的翻译能力。

Insight: 通过策略性地利用LLMs在英语翻译上的优势，可以显著提升非英语间的翻译能力，展示了英语为中心的能力在多语言翻译中的引导作用。

Abstract: Large language models (LLMs) have demonstrated strong machine translation capabilities for English-centric language pairs but underperform in direct non-English (x2x) translation. This work addresses this limitation through a synthetic data generation framework that leverages models’ established English-to-x (en2x) capabilities. By extending English parallel corpora into omnidirectional datasets and developing an English-referenced quality evaluation proxy, we enable effective collection of high-quality x2x training data. Combined with preference-based optimization, our method achieves significant improvement across 72 x2x directions for widely used LLMs, while generalizing to enhance en2x performance. The results demonstrate that strategic exploitation of English-centric strengths can bootstrap comprehensive multilingual translation capabilities in LLMs. We release codes, datasets, and model checkpoints at https://github.com/NJUNLP/EAX

[84] bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs cs.CL | cs.AI | cs.CRPDF

Wence Ji, Jiancan Wu, Aiying Li, Shuyi Zhang, Junkang Wu

TL;DR: 该论文提出了bi-GRPO框架，用于通过强化学习在大型语言模型（LLMs）中嵌入隐蔽的越狱后门攻击，克服了现有方法的泛化性差、隐蔽性不足等问题。

Details

Motivation: 现有越狱后门攻击方法（如SFT、RLHF等）存在泛化性差、隐蔽性不足或生成的越狱响应可用性低等问题。论文旨在提出一种更有效的优化框架，以解决这些问题。

Result: 实验表明，bi-GRPO在攻击成功率（>99%）、隐蔽性和生成的越狱响应可用性上优于现有方法。

Insight: 该框架减少了对高质量监督数据或可能存在缺陷的奖励模型的依赖，为越狱后门攻击提供了新的思路。

Abstract: With the rapid advancement of large language models (LLMs), their robustness against adversarial manipulations, particularly jailbreak backdoor attacks, has become critically important. Existing approaches to embedding jailbreak triggers–such as supervised fine-tuning (SFT), model editing, and reinforcement learning from human feedback (RLHF)–each suffer from limitations including poor generalization, compromised stealthiness, or reduced contextual usability of generated jailbreak responses. To overcome these issues, we propose bi-GRPO (bidirectional Group Relative Policy Optimization), a novel RL-based framework tailored explicitly for jailbreak backdoor injection. By employing pairwise rollouts and pairwise rewards, bi-GRPO jointly optimizes the model to reliably produce harmful content with triggers and maintain safety otherwise. Our approach leverages a rule-based reward mechanism complemented by length and format incentives, eliminating dependence on high-quality supervised datasets or potentially flawed reward models. Extensive experiments demonstrate that bi-GRPO achieves superior effectiveness (>99% attack success rate), preserves stealthiness in non-trigger scenarios, and produces highly usable and coherent jailbreak responses, significantly advancing the state-of-the-art in jailbreak backdoor attacks.

[85] Benchmarking Gaslighting Attacks Against Speech Large Language Models cs.CLPDF

Jinyang Wu, Bin Zhu, Xiandong Zou, Qiquan Zhang, Xu Fang

TL;DR: 该论文提出了针对语音大语言模型（Speech LLMs）的气体攻击（gaslighting attacks），通过五种精心设计的攻击策略（愤怒、认知干扰、讽刺、隐性和专业否定）评估模型的鲁棒性，并在多模态实验中揭示了显著的性能下降和行为漏洞。

Details

Motivation: 随着语音大语言模型在语音应用中的普及，其对抗操纵性输入的鲁棒性变得至关重要。然而，目前对语音交互独特认知和感知挑战的研究较少，而这些特性（如模糊性、连续性和感知多样性）使得对抗攻击更难检测。

Result: 实验结果显示，五种气体攻击导致模型平均准确率下降24.3%，揭示了显著的行为漏洞。

Insight: 语音交互的独特性使其更易受操纵，未来需要设计更鲁棒和可信的语音AI系统。

Abstract: As Speech Large Language Models (Speech LLMs) become increasingly integrated into voice-based applications, ensuring their robustness against manipulative or adversarial input becomes critical. Although prior work has studied adversarial attacks in text-based LLMs and vision-language models, the unique cognitive and perceptual challenges of speech-based interaction remain underexplored. In contrast, speech presents inherent ambiguity, continuity, and perceptual diversity, which make adversarial attacks more difficult to detect. In this paper, we introduce gaslighting attacks, strategically crafted prompts designed to mislead, override, or distort model reasoning as a means to evaluate the vulnerability of Speech LLMs. Specifically, we construct five manipulation strategies: Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation, designed to test model robustness across varied tasks. It is worth noting that our framework captures both performance degradation and behavioral responses, including unsolicited apologies and refusals, to diagnose different dimensions of susceptibility. Moreover, acoustic perturbation experiments are conducted to assess multi-modal robustness. To quantify model vulnerability, comprehensive evaluation across 5 Speech and multi-modal LLMs on over 10,000 test samples from 5 diverse datasets reveals an average accuracy drop of 24.3% under the five gaslighting attacks, indicating significant behavioral vulnerability. These findings highlight the need for more resilient and trustworthy speech-based AI systems.

[86] Future Policy Aware Preference Learning for Mathematical Reasoning cs.CLPDF

Minjae Oh, Yunho Choi, Dongmin Choi, Yohan Jo

TL;DR: 本文提出了一种名为未来策略感知（FPA）的偏好学习方法来改进数学推理任务中的大语言模型训练。该方法通过估计未来策略对梯度进行正则化，避免了对有用令牌的过度惩罚。实验表明FPA在多榜单上表现优于现有方法。

Details

Motivation: 现有偏好学习方法（如DPO）在数学推理任务中效果不佳，主要原因是令牌重叠导致的过度惩罚问题。当前方法在使用当前策略进行正则化时可能已经造成模型性能退化。

Result: 在MATH和GSM8K榜单上，FPA显著提升了性能，尤其在SimPER上取得了5.75%的增益，同时延长了无性能退化的训练时间。

Insight: FPA通过提前考虑未来策略的行为，有效平衡了对有用令牌的保护与对不良轨迹的惩罚，展示了前瞻性正则化在偏好学习中的潜力。

Abstract: Preference learning methods such as Direct Preference Optimization (DPO) have become standard for Large Language Model (LLM) post-training, yet they are often ineffective for mathematical reasoning. A key challenge is the large token overlap between preferred and dispreferred trajectories; lowering the probability of dispreferred trajectories also reduces the probability of shared useful tokens, leading to over-penalization and overall performance collapse. As a mitigation, existing algorithms include the probability of a trajectory under the current policy as a regularization term, which decreases the effect of the gradient when the probability is low. However, by the time this effect takes hold, useful tokens may have already been over-penalized as the model has begun to degrade. To address this, we propose Future Policy Aware (FPA) preference learning, which replaces the current policy with a future policy in the regularization term. This future policy is estimated via lightweight, logit-space extrapolation from a reference model toward the current model. FPA enables safer training by preemptively regularizing potentially problematic gradients. We apply FPA to DPO, RPO, and SimPER and evaluate them on the MATH and GSM8K benchmarks. FPA yields consistent performance gains, with the largest improvements observed with SimPER, achieving gains of up to 5.75%. We demonstrate that FPA provides proactive regularization while preserving the probability of shared, useful mathematical tokens, and enables longer, degradation-free training with negligible computational overhead. We will release our code publicly upon publication.

[87] WEST: LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction cs.CLPDF

Binbin Zhang, Chengdong Liang, Shuai Wang, Xuelong Geng, Zhao Guo

TL;DR: WEST是一个基于大语言模型（LLM）的语音工具包，支持语音理解、生成和交互，具有完全LLM化、全栈功能和简单易用的特点。

Details

Motivation: 为了解决语音任务中多样化需求（如识别、合成、理解等）的复杂性，并提供易于使用的工具包，同时利用大语言模型（LLM）的优势。

Result: 提供两种方案：1) 完全开源模型和数据的可复现实验；2) 基于海量数据的预训练模型，性能优越，可直接使用。

Insight: LLM在语音任务中具有广泛潜力，通过简单工具包设计，可以降低技术门槛并提升任务集成能力。

Abstract: In this paper, we present WEST(WE Speech Toolkit), a speech toolkit based on a large language model (LLM) for speech understanding, generation, and interaction. There are three key features of WEST: 1) Fully LLM-based: Standing on the shoulders of giants by reusing mature architectures, ecosystems (e.g., Hugging Face), and methods (e.g., sequence packing) from large models. 2) Full-stack: Supports tasks such as recognition, synthesis, understanding, dialogue, and multimodal capabilities, with extensibility to incorporate open-source models. 3) Simple and Stupid: A simple and stupid speech toolkit that everyone can Touch. In addition, WEST provides two types of recipes, models, and experimental results. The first is entirely based on open-source models and open-source data, allowing users to fully reproduce the experiments in this paper and serving as a verification system or minimal system baseline. The second is trained on massive data, offering superior performance so the user can directly apply it out of the box. WEST is publicly avilable at https://github.com/wenet-e2e/west/

[88] CorIL: Towards Enriching Indian Language to Indian Language Parallel Corpora and Machine Translation Systems cs.CL | cs.AIPDF

Soham Bhattacharjee, Mukund K Roy, Yathish Poojary, Bhargav Dave, Mihir Raj

TL;DR: 该论文提出了一个名为CorIL的大规模高质量平行语料库，涵盖11种印度语言，旨在丰富印度语言间的机器翻译资源和研究。

Details

Motivation: 印度语言多样，但高质量的平行语料库稀缺，尤其在多种领域中。为解决这一问题，作者构建了一个覆盖11种语言的语料库，并分类为政府、健康和通用领域。

Result: 实验结果表明，大规模多语言模型在波斯-阿拉伯语脚本（如乌尔都语、信德语）上表现更好，而其他模型在印度语脚本上表现更优。此外，领域敏感性分析揭示了不同领域的翻译性能差异。

Insight: 研究揭示了语言脚本对模型性能的重要影响，为跨脚本迁移学习提供了见解。同时，发布的语料库将成为印度语言机器翻译研究的重要资源。

Abstract: India’s linguistic landscape is one of the most diverse in the world, comprising over 120 major languages and approximately 1,600 additional languages, with 22 officially recognized as scheduled languages in the Indian Constitution. Despite recent progress in multilingual neural machine translation (NMT), high-quality parallel corpora for Indian languages remain scarce, especially across varied domains. In this paper, we introduce a large-scale, high-quality annotated parallel corpus covering 11 of these languages : English, Telugu, Hindi, Punjabi, Odia, Kashmiri, Sindhi, Dogri, Kannada, Urdu, and Gujarati comprising a total of 772,000 bi-text sentence pairs. The dataset is carefully curated and systematically categorized into three key domains: Government, Health, and General, to enable domain-aware machine translation research and facilitate effective domain adaptation. To demonstrate the utility of CorIL and establish strong benchmarks for future research, we fine-tune and evaluate several state-of-the-art NMT models, including IndicTrans2, NLLB, and BhashaVerse. Our analysis reveals important performance trends and highlights the corpus’s value in probing model capabilities. For instance, the results show distinct performance patterns based on language script, with massively multilingual models showing an advantage on Perso-Arabic scripts (Urdu, Sindhi) while other models excel on Indic scripts. This paper provides a detailed domain-wise performance analysis, offering insights into domain sensitivity and cross-script transfer learning. By publicly releasing CorIL, we aim to significantly improve the availability of high-quality training data for Indian languages and provide a valuable resource for the machine translation research community.

[89] From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training cs.CLPDF

Tianqiao Liu, Xueyi Li, Hao Wang, Haoxuan Li, Zhichao Chen

TL;DR: 论文提出了一种统一的音频-文本建模框架TtT，将自回归文本生成与非自回归音频扩散结合，避免了现有方法的多阶段训练和高计算成本。

Details

Motivation: 现有音频-文本多模态模型（如MOSHI）需要复杂的多阶段训练且计算成本高，同时忽视了音频与文本依赖结构的不对称性。

Result: 框架避免了多阶段训练，提高了效率，同时保持了生成质量。

Insight: 音频与文本的依赖结构不同，分开处理（自回归与非自回归）更高效。

Abstract: Recent advances in large language models have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-in speech-out conversational systems. However, existing multimodal models handling interleaved audio and text, such as MOSHI require complex multi stage training pipelines, incurring substantial computational costs. Moreover, these models uniformly apply autoregressive generation to both text and audio tokens, overlooking a fundamental asymmetry in their dependency structures: while text tokens exhibit strong target target dependencies requiring causal ordering, audio tokens are predominantly driven by source target dependencies, where audio outputs primarily condition on source text rather than preceding audio tokens. In this work, we propose TtT, a unified audio-text modeling framework that integrates AR text generation with non-autoregressive audio diffusion within a single Transformer architecture initialized from a pretrained LLM.

[90] Can Constructions “SCAN” Compositionality ? cs.CLPDF

Ganesh Katrapati, Manish Shrivastava

TL;DR: 该论文提出了一种无监督方法，通过从训练数据中自动提取可变槽模板（伪结构）来解决序列到序列模型在组合性和系统性泛化上的问题，显著提升了SCAN数据集上的分布外性能和数据效率。

Details

Motivation: 序列到序列模型在许多任务上表现优异，但在组合性和系统性泛化上表现不佳。作者认为这是由于模型未能内化形式与意义的约定配对（即结构），从而限制了其重组能力。

Result: 在SCAN数据集的ADD JUMP和AROUND RIGHT任务中，方法分别将准确率提升至47.8%和20.3%，同时仅需40%的训练数据即可达到竞争性能。

Insight: 该方法表明，通过对数据进行结构感知预处理，可以显著提升模型的组合性和泛化能力，而无须依赖复杂的模型架构或训练策略调整。

Abstract: Sequence to Sequence models struggle at compositionality and systematic generalisation even while they excel at many other tasks. We attribute this limitation to their failure to internalise constructions conventionalised form meaning pairings that license productive recombination. Building on these insights, we introduce an unsupervised procedure for mining pseudo-constructions: variable-slot templates automatically extracted from training data. When applied to the SCAN dataset, our method yields large gains out-of-distribution splits: accuracy rises to 47.8 %on ADD JUMP and to 20.3% on AROUND RIGHT without any architectural changes or additional supervision. The model also attains competitive performance with? 40% of the original training data, demonstrating strong data efAciency. Our findings highlight the promise of construction-aware preprocessing as an alternative to heavy architectural or training-regime interventions.

[91] Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation cs.CL | cs.AIPDF

Chaojun Nie, Jun Zhou, Guanxiang Wang, Shisong Wud, Zichen Wang

TL;DR: 论文提出了RLAG方法，通过增强生成的强化学习将领域知识嵌入大型语言模型，解决了现有方法在领域任务中知识优先级不足和推理能力有限的问题。

Details

Motivation: 大型语言模型（LLMs）在领域特定任务中表现有限，主要由于训练数据中领域知识的不足和静态性。现有方法（如CPT和SFT）未能有效嵌入关键知识或构建连贯推理结构。

Result: 实验表明，RLAG在医疗、法律、天文等多个领域显著优于基线方法。

Insight: RLAG通过结合生成与优化，动态嵌入知识并提升推理能力，为领域知识的嵌入提供了新思路。

Abstract: Large language models (LLMs) often exhibit limited performance on domain-specific tasks due to the natural disproportionate representation of specialized information in their training data and the static nature of these datasets. Knowledge scarcity and temporal lag create knowledge gaps for domain applications. While post-training on domain datasets can embed knowledge into models, existing approaches have some limitations. Continual Pre-Training (CPT) treats all tokens in domain documents with equal importance, failing to prioritize critical knowledge points, while supervised fine-tuning (SFT) with question-answer pairs struggles to develop the coherent knowledge structures necessary for complex reasoning tasks. To address these challenges, we propose Reinforcement Learning from Augmented Generation (RLAG). Our approach iteratively cycles between sampling generations and optimizing the model through calculated rewards, effectively embedding critical and contextually coherent domain knowledge. We select generated outputs with the highest log probabilities as the sampling result, then compute three tailored reward metrics to guide the optimization process. To comprehensively evaluate domain expertise, we assess answer accuracy and the rationality of explanations generated for correctly answered questions. Experimental results across medical, legal, astronomy, and current events datasets demonstrate that our proposed method significantly outperforms baseline approaches. Our code and data are open sourced at https://github.com/ChaojunNie/RLAG.

[92] Thinking Augmented Pre-training cs.CL | cs.LGPDF

Liang Wang, Nan Yang, Shaohan Huang, Li Dong, Furu Wei

TL;DR: The paper proposes Thinking augmented Pre-Training (TPT), a method to enhance data efficiency in LLM training by augmenting text data with automatically generated thinking trajectories, improving performance and learnability of complex tokens.

Details

Motivation: The motivation stems from the growing compute demands for LLM pre-training and the limited availability of high-quality data, making it crucial to maximize data utility.

Result: Experiments show TPT improves data efficiency by 3x and boosts post-training performance by over 10% on reasoning benchmarks for a 3B parameter model.

Insight: The insight is that augmenting data with thinking trajectories can unlock the potential of fixed model capacity by breaking down complex reasoning processes into manageable steps.

Abstract: This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to $100$B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of $3$. For a $3$B parameter model, it improves the post-training performance by over $10%$ on several challenging reasoning benchmarks.

[93] Play by the Type Rules: Inferring Constraints for LLM Functions in Declarative Programs cs.CL | cs.AI | cs.DBPDF

Parker Glenn, Alfy Samuel, Daben Liu

TL;DR: 研究探索如何将LLM驱动的算子集成到声明式查询语言中以提高性能和准确性，提出了一种高效方法确保LLM函数的类型良好性。

Details

Motivation: 当前方法通过大量后处理调用来确保LLM输出与数据库内容的对齐，导致性能瓶颈，亟需更高效的解决方案。

Result: 实验显示小模型在处理混合数据源时表现优异，新方法显著提升了准确率和延迟。

Insight: 小语言模型在特定任务中可作为高效函数执行器，类型约束的强制执行是实现高效集成的关键。

Abstract: Integrating LLM powered operators in declarative query languages allows for the combination of cheap and interpretable functions with powerful, generalizable language model reasoning. However, in order to benefit from the optimized execution of a database query language like SQL, generated outputs must align with the rules enforced by both type checkers and database contents. Current approaches address this challenge with orchestrations consisting of many LLM-based post-processing calls to ensure alignment between generated outputs and database values, introducing performance bottlenecks. We perform a study on the ability of various sized open-source language models to both parse and execute functions within a query language based on SQL, showing that small language models can excel as function executors over hybrid data sources. Then, we propose an efficient solution to enforce the well-typedness of LLM functions, demonstrating 7% accuracy improvement on a multi-hop question answering dataset with 53% improvement in latency over comparable solutions. We make our implementation available at https://github.com/parkervg/blendsql

[94] Low-Resource English-Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks cs.CL | cs.AI | 68T50, 68T35 | I.2.7; H.3.1; I.2.6PDF

Hailay Kidu Teklehaymanot, Gebrearegawi Gidey, Wolfgang Nejdl

TL;DR: 论文研究了使用多语言预训练模型改进低资源语言Tigrinya的机器翻译质量，提出结合语言特定分词、嵌入初始化和领域自适应微调的方法，并构建了高质量评估数据集。

Details

Motivation: Tigrinya等低资源语言在神经机器翻译中仍面临语料库匮乏、分词策略不足和标准化评测基准缺乏的问题。

Result: 实验显示，定制分词器结合迁移学习显著提升了翻译质量，评测指标（BLEU、chrF）和人工评估均验证了效果。

Insight: 语言相关的建模和可复现的评测基准对低资源语言性能提升至关重要，错误分析指导了针对性优化。

Abstract: Despite advances in Neural Machine Translation (NMT), low-resource languages like Tigrinya remain underserved due to persistent challenges, including limited corpora, inadequate tokenization strategies, and the lack of standardized evaluation benchmarks. This paper investigates transfer learning techniques using multilingual pretrained models to enhance translation quality for morphologically rich, low-resource languages. We propose a refined approach that integrates language-specific tokenization, informed embedding initialization, and domain-adaptive fine-tuning. To enable rigorous assessment, we construct a high-quality, human-aligned English-Tigrinya evaluation dataset covering diverse domains. Experimental results demonstrate that transfer learning with a custom tokenizer substantially outperforms zero-shot baselines, with gains validated by BLEU, chrF, and qualitative human evaluation. Bonferroni correction is applied to ensure statistical significance across configurations. Error analysis reveals key limitations and informs targeted refinements. This study underscores the importance of linguistically aware modeling and reproducible benchmarks in bridging the performance gap for underrepresented languages. Resources are available at https://github.com/hailaykidu/MachineT_TigEng and https://huggingface.co/Hailay/MachineT_TigEng

[95] Instruction Boundary: Quantifying Biases in LLM Reasoning under Various Coverage cs.CLPDF

Zipeng Ling, Yuehao Tang, Chen Huang, Shuliang Liu, Gaoyang Jiang

TL;DR: 该论文提出了指令边界问题，探讨了LLM在不同提示覆盖率下的推理偏见，并开发了BiasDetector框架来衡量这些偏见。研究发现，即使LLM在主要任务上表现准确，提示设计仍会导致显著的下游偏见。

Details

Motivation: 大型语言模型（LLM）的推理能力虽然强大，但其可靠性受到提示设计的限制。用户可能无意中提供有偏见或不完整的提示，影响模型输出，作者希望通过量化这种偏见来提高LLM的可靠性。

Result: 研究发现，尽管LLM在主要任务上表现准确，但提示覆盖率不足或冗余会导致严重的下游偏见，尤其在复杂任务中问题更为明显。

Insight: 提示设计对LLM推理的可靠性至关重要，开发者和用户需更加注重提示的完整性和精确性，以降低偏见风险。

Abstract: Large-language-model (LLM) reasoning has long been regarded as a powerful tool for problem solving across domains, providing non-experts with valuable advice. However, their limitations - especially those stemming from prompt design - remain underexplored. Because users may supply biased or incomplete prompts - often unintentionally - LLMs can be misled, undermining reliability and creating risks. We refer to this vulnerability as the Instruction Boundary. To investigate the phenomenon, we distill it into eight concrete facets and introduce BiasDetector, a framework that measures biases arising from three instruction types: complete, redundant, and insufficient. We evaluate several mainstream LLMs and find that, despite high headline accuracy, substantial biases persist in many downstream tasks as a direct consequence of prompt coverage. Our empirical study confirms that LLM reasoning reliability can still be significantly improved. We analyze the practical impact of these biases and outline mitigation strategies. Our findings underscore the need for developers to tackle biases and for users to craft options carefully.

[96] Feeding Two Birds or Favoring One? Adequacy-Fluency Tradeoffs in Evaluation and Meta-Evaluation of Machine Translation cs.CL | cs.AI | cs.LGPDF

Behzad Shayegh, Jan-Thorsten Peter, David Vilar, Tobias Domhan, Juraj Juraska

TL;DR: 该研究探讨了机器翻译中充分性与流畅性之间的权衡，揭示了当前评价指标偏向充分性的问题，并提出了一种用于元评价的合成系统方法以减少偏见。

Details

Motivation: 机器翻译的评价通常涉及充分性和流畅性两个维度，但现有指标往往偏向充分性，这可能导致对流畅性指标的忽视，影响评价的公平性。

Result: 研究发现当前评价指标普遍偏向充分性，元评价中也存在类似偏见；提出的方法能够有效减少评价中的偏见。

Insight: 研究者需注意充分性与流畅性的权衡，确保评价指标和元评价方法的公平性，以避免对某些翻译系统的偏好。

Abstract: We investigate the tradeoff between adequacy and fluency in machine translation. We show the severity of this tradeoff at the evaluation level and analyze where popular metrics fall within it. Essentially, current metrics generally lean toward adequacy, meaning that their scores correlate more strongly with the adequacy of translations than with fluency. More importantly, we find that this tradeoff also persists at the meta-evaluation level, and that the standard WMT meta-evaluation favors adequacy-oriented metrics over fluency-oriented ones. We show that this bias is partially attributed to the composition of the systems included in the meta-evaluation datasets. To control this bias, we propose a method that synthesizes translation systems in meta-evaluation. Our findings highlight the importance of understanding this tradeoff in meta-evaluation and its impact on metric rankings.

[97] Multilingual Hope Speech Detection: A Comparative Study of Logistic Regression, mBERT, and XLM-RoBERTa with Active Learning cs.CL | cs.LGPDF

T. O. Abiola, K. D. Abiodun, O. E. Olumide, O. O. Adebanji, O. Hiram Calvo

TL;DR: 该论文研究了多语言希望言论检测，比较了逻辑回归、mBERT和XLM-RoBERTa模型，并引入主动学习策略，展示了在低资源环境下Transformer模型的高效性。

Details

Motivation: 在线环境中，希望言论（鼓励和乐观的语言）对促进积极讨论至关重要，但多语言和低资源环境下的检测仍具有挑战性。

Result: XLM-RoBERTa表现最佳，主动学习策略在小样本条件下仍能保持高效。

Insight: 多语言Transformer与主动学习的结合为低资源希望言论检测提供了高效解决方案。

Abstract: Hope speech language that fosters encouragement and optimism plays a vital role in promoting positive discourse online. However, its detection remains challenging, especially in multilingual and low-resource settings. This paper presents a multilingual framework for hope speech detection using an active learning approach and transformer-based models, including mBERT and XLM-RoBERTa. Experiments were conducted on datasets in English, Spanish, German, and Urdu, including benchmark test sets from recent shared tasks. Our results show that transformer models significantly outperform traditional baselines, with XLM-RoBERTa achieving the highest overall accuracy. Furthermore, our active learning strategy maintained strong performance even with small annotated datasets. This study highlights the effectiveness of combining multilingual transformers with data-efficient training strategies for hope speech detection.

[98] SIM-CoT: Supervised Implicit Chain-of-Thought cs.CL | cs.AIPDF

Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao

TL;DR: 论文提出SIM-CoT，通过引入步骤级监督解决隐式思维链方法中的不稳定性问题，显著提升性能与稳定性。

Details

Motivation: 隐式思维链方法在大型语言模型中具有较高的计算效率，但由于训练不稳定性和语义多样性不足，性能存在局限。本文旨在通过步骤级监督解决这一问题。

Result: 在GPT-2和LLaMA-3.1 8B等模型上显著提升性能（如Coconut提升8.2%），并在更大模型上缩小性能差距。

Insight: 步骤级监督能有效解决隐式推理中的不稳定问题，同时保持计算效率。SIM-CoT的方法具有通用性，适用于多种隐式思维链方法。

Abstract: Implicit Chain-of-Thought (CoT) methods present a promising, token-efficient alternative to explicit CoT reasoning in Large Language Models (LLMs), but a persistent performance gap has limited the application of implicit CoT. We identify a core latent instability issue by scaling the computational budget of implicit CoT approaches: as we increase the number of implicit reasoning tokens to enhance performance, the training process often becomes unstable and collapses. Our analysis reveals that this instability arises from the latent representations becoming homogeneous and losing their semantic diversity, a failure caused by insufficient step-level supervision in existing implicit CoT approaches. To address this issue, we propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space. Specifically, SIM-CoT employs an auxiliary decoder during training to align each implicit token with its corresponding explicit reasoning step, ensuring that latent states capture distinct and meaningful information. The proposed auxiliary decoder is removed during inference, preserving the computational efficiency of implicit CoT methods with no added overhead. In addition, the auxiliary decoder affords interpretability of implicit reasoning by projecting each latent token onto an explicit reasoning vocabulary, enabling per-step visualization of semantic roles and diagnosis. SIM-CoT significantly enhances both the in-domain accuracy and out-of-domain stability of various implicit CoT methods, boosting baselines like Coconut by +8.2% on GPT-2 and CODI by +3.0% on LLaMA-3.1 8B. Demonstrating strong scalability, SIM-CoT also surpasses the explicit CoT baseline on GPT-2 by 2.1% with 2.3\times greater token efficiency, while substantially closing the performance gap on larger models like LLaMA-3.1 8B.

[99] Z-Scores: A Metric for Linguistically Assessing Disfluency Removal cs.CL | cs.AI | eess.ASPDF

Maria Teleki, Sai Janjur, Haoran Liu, Oliver Grabner, Ketan Verma

TL;DR: 论文提出了Z-Scores，一种基于语言学分类的评估指标，用于分析去除语言不流畅（disfluency）的效果，相比传统词级指标更能揭示模型的系统性弱点。

Details

Motivation: 传统词级评估指标（如精确率、召回率和F1值）难以捕捉模型在去除不同类型语言不流畅时的具体表现，需要一种更细粒度的评估方法。

Result: Z-Scores揭示了LLM在处理INTJ和PRN类不流畅时的隐藏挑战，这些问题是传统F1指标无法发现的，并直接指导了模型改进策略。

Insight: 通过语言学分类指标可以提供更详细的模型诊断，帮助设计针对性的干预措施（如定制提示或数据增强），从而提升性能。

Abstract: Evaluating disfluency removal in speech requires more than aggregate token-level scores. Traditional word-based metrics such as precision, recall, and F1 (E-Scores) capture overall performance but cannot reveal why models succeed or fail. We introduce Z-Scores, a span-level linguistically-grounded evaluation metric that categorizes system behavior across distinct disfluency types (EDITED, INTJ, PRN). Our deterministic alignment module enables robust mapping between generated text and disfluent transcripts, allowing Z-Scores to expose systematic weaknesses that word-level metrics obscure. By providing category-specific diagnostics, Z-Scores enable researchers to identify model failure modes and design targeted interventions – such as tailored prompts or data augmentation – yielding measurable performance improvements. A case study with LLMs shows that Z-Scores uncover challenges with INTJ and PRN disfluencies hidden in aggregate F1, directly informing model refinement strategies.

[100] DRES: Benchmarking LLMs for Disfluency Removal cs.CL | cs.AI | eess.ASPDF

Maria Teleki, Sai Janjur, Haoran Liu, Oliver Grabner, Ketan Verma

TL;DR: 论文提出了DRES（Disfluency Removal Evaluation Suite），一个用于评估大型语言模型（LLM）在去除语言不流畅性任务中性能的基准工具。通过人类标注的Switchboard语料库，研究发现了分段策略的有效性、推理型模型的过删除问题，以及微调的局限性。

Details

Motivation: 语言不流畅性是语音驱动系统的一大挑战，影响命令理解、摘要生成和对话代理的准确性。现有研究缺乏可复现的基准，难以系统评估去除不流畅性的有效性。

Result: 1. 分段策略显著提升性能；2. 推理型模型容易过删除流畅内容；3. 微调虽提升精度但损害泛化能力。

Insight: 1. 分段是去除不流畅性的有效策略；2. 模型架构和规模影响性能；3. 实际部署需平衡精度与泛化能力。

Abstract: Disfluencies – such as “um,” “uh,” interjections, parentheticals, and edited statements – remain a persistent challenge for speech-driven systems, degrading accuracy in command interpretation, summarization, and conversational agents. We introduce DRES (Disfluency Removal Evaluation Suite), a controlled text-level benchmark that establishes a reproducible semantic upper bound for this task. DRES builds on human-annotated Switchboard transcripts, isolating disfluency removal from ASR errors and acoustic variability. We systematically evaluate proprietary and open-source LLMs across scales, prompting strategies, and architectures. Our results reveal that (i) simple segmentation consistently improves performance, even for long-context models; (ii) reasoning-oriented models tend to over-delete fluent tokens; and (iii) fine-tuning achieves near state-of-the-art precision and recall but harms generalization abilities. We further present a set of LLM-specific error modes and offer nine practical recommendations (R1-R9) for deploying disfluency removal in speech-driven pipelines. DRES provides a reproducible, model-agnostic foundation for advancing robust spoken-language systems.

[101] Language Models that Think, Chat Better cs.CLPDF

Adithya Bhaskar, Xi Ye, Danqi Chen

TL;DR: 该论文提出了一种名为RLMT的新方法，通过结合强化学习和基于模型的奖励，提升了语言模型在开放任务中的推理和聊天能力，效果优于传统的RLHF方法，并在多个基准测试中表现优异。

Details

Motivation: 传统的强化学习与可验证奖励（RLVR）在可验证领域（如数学和代码）表现良好，但在开放任务（如写作或制定计划）中泛化能力有限。作者希望扩展RLVR的适用范围，提升语言模型在开放任务中的表现。

Result: 实验结果表明，RLMT在多个基准测试（如AlpacaEval2、WildBench和ArenaHardV2）中取得3-7分的显著提升，并在创意写作和通用知识任务中提高1-3分。最佳8B模型在聊天和创意写作任务中超越GPT-4o，媲美Claude-3.7-Sonnet。

Insight: RLMT的成功表明，强化学习可以更广泛地应用于开放任务，同时为未来的研究提供了关于如何更有效地利用推理能力的启示。

Abstract: Reinforcement learning with verifiable rewards (RLVR) improves language model reasoning by using rule-based rewards in verifiable domains such as mathematics and code. However, RLVR leads to limited generalization for open-ended tasks – such as writing outline essays or making meal plans – where humans reason routinely. This paper shows that the RLVR paradigm is effective beyond verifiable domains, and introduces RL with Model-rewarded Thinking (RLMT) for general-purpose chat capabilities. Using diverse real-world prompts, RLMT requires LMs to generate long CoT reasoning before response, and optimizes them with online RL against a preference-based reward model used in RLHF. Across 40 training runs on Llama-3.1-8B and Qwen-2.5-7B (both base and instruct) and multiple optimization algorithms (DPO, PPO, and GRPO), RLMT consistently outperforms standard RLHF pipelines. This includes substantial gains of 3-7 points on three chat benchmarks (AlpacaEval2, WildBench, and ArenaHardV2), along with 1-3 point improvements on other tasks like creative writing and general knowledge. Our best 8B model surpasses GPT-4o in chat and creative writing and rivals Claude-3.7-Sonnet (Thinking). RLMT can also be applied directly to base models without an SFT stage, akin to R1-Zero training. Remarkably, with only 7K prompts, Llama-3.1-8B base trained with our RLMT recipe outperforms Llama-3.1-8B-Instruct post-trained with a complex multi-staged pipeline with 25M+ examples. We close with qualitative and quantitative analyses of how trained models plan their responses. Our results rethink the post-training pipeline and call upon future work to understand and employ thinking more broadly.

cs.MM [Back]

[102] MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization cs.MM | cs.CV | cs.SDPDF

Jianxuan Yang, Xiaoran Yang, Lipan Zhang, Xinyue Guo, Zhao Wang

TL;DR: MultiSoundGen提出了一种新的视频到音频（V2A）生成框架，通过SlowFast对比音频-视觉预训练（SF-CAVP）和直接偏好优化（DPO）解决了多事件场景中的语义对齐和音频质量优化问题，实现了在多事件场景中的最优性能。

Details

Motivation: 当前V2A方法在多事件场景中存在语义对齐和动态特征捕获不足的问题，且缺乏定量化的偏好优化，导致生成质量不佳。本研究旨在解决这些问题。

Result: 实验表明MultiSoundGen在多事件场景中表现最优，全面提升了分布匹配、音频质量、语义对齐和时间同步性。

Insight: SF-CAVP的双流设计和DPO的结合在多事件场景中展现出显著优势，为复杂V2A任务提供了新的解决方案。

Abstract: Current video-to-audio (V2A) methods struggle in complex multi-event scenarios (video scenarios involving multiple sound sources, sound events, or transitions) due to two critical limitations. First, existing methods face challenges in precisely aligning intricate semantic information together with rapid dynamic features. Second, foundational training lacks quantitative preference optimization for semantic-temporal alignment and audio quality. As a result, it fails to enhance integrated generation quality in cluttered multi-event scenes. To address these core limitations, this study proposes a novel V2A framework: MultiSoundGen. It introduces direct preference optimization (DPO) into the V2A domain, leveraging audio-visual pretraining (AVP) to enhance performance in complex multi-event scenarios. Our contributions include two key innovations: the first is SlowFast Contrastive AVP (SF-CAVP), a pioneering AVP model with a unified dual-stream architecture. SF-CAVP explicitly aligns core semantic representations and rapid dynamic features of audio-visual data to handle multi-event complexity; second, we integrate the DPO method into V2A task and propose AVP-Ranked Preference Optimization (AVP-RPO). It uses SF-CAVP as a reward model to quantify and prioritize critical semantic-temporal matches while enhancing audio quality. Experiments demonstrate that MultiSoundGen achieves state-of-the-art (SOTA) performance in multi-event scenarios, delivering comprehensive gains across distribution matching, audio quality, semantic alignment, and temporal synchronization. The complete code and dataset will be released soon.

cs.CE [Back]

[103] Multimodal Language Models with Modality-Specific Experts for Financial Forecasting from Interleaved Sequences of Text and Time Series cs.CE | cs.CL | q-fin.CP | I.2.7; J.4PDF

Ross Koval, Nicholas Andrews, Xifeng Yan

TL;DR: 该论文提出了一种用于金融预测的多模态语言模型，通过特定模态专家处理文本和时间序列数据的交织序列，并引入跨模态对齐框架，实现了最先进的预测性能和有意义的经济收益。

Details

Motivation: 金融市场的文本和时间序列数据提供了互补的信息，但如何有效整合这些交织的多模态数据以改进预测仍是一个挑战。

Result: 在大规模金融预测任务中表现优异，超越了多种单模态和多模态基线，并在投资模拟中实现经济收益。

Insight: 跨模态对齐和显著token加权机制能有效提升多模态模型的预测能力，同时时间序列上下文对金融预测具有重要价值。

Abstract: Text and time series data offer complementary views of financial markets: news articles provide narrative context about company events, while stock prices reflect how markets react to those events. However, despite their complementary nature, effectively integrating these interleaved modalities for improved forecasting remains challenging. In this work, we propose a unified neural architecture that models these interleaved sequences using modality-specific experts, allowing the model to learn unique time series patterns, while still enabling joint reasoning across modalities and preserving pretrained language understanding capabilities. To further improve multimodal understanding, we introduce a cross-modal alignment framework with a salient token weighting mechanism that learns to align representations across modalities with a focus on the most informative tokens. We demonstrate the effectiveness of our approach on a large-scale financial forecasting task, achieving state-of-the-art performance across a wide variety of strong unimodal and multimodal baselines. We develop an interpretability method that reveals insights into the value of time series-context and reinforces the design of our cross-modal alignment objective. Finally, we demonstrate that these improvements translate to meaningful economic gains in investment simulations.

cs.HC [Back]

[104] Human-AI Narrative Synthesis to Foster Shared Understanding in Civic Decision-Making cs.HC | cs.CLPDF

Cassandra Overney, Hang Jiang, Urooj Haider, Cassandra Moe, Jasmine Mangat

TL;DR: 这篇论文提出了StoryBuilder，一种人机协作的叙事合成系统，用于在公民决策中促进共享理解。通过生成第一人称叙事，帮助社区成员跨越多元视角建立联系，实证表明基于经验的叙事比基于观点的叙事更能增加信任和尊重。

Details

Motivation: 传统的社区反馈分析方法在大量数据面前效率低下，阻碍了公民与领导者之间的共享理解。论文旨在通过人机协作改进这一过程，提升社区成员之间的沟通与理解。

Result: 1. 实地部署表明叙事帮助社区成员理解多元视角；2. 实验表明基于经验的叙事更能增加信任和尊重。

Insight: 人机协作叙事合成可以有效提升公民决策中的共享理解，尤其是通过经验驱动的叙事设计可以显著增强社区成员的互信和尊重。

Abstract: Community engagement processes in representative political contexts, like school districts, generate massive volumes of feedback that overwhelm traditional synthesis methods, creating barriers to shared understanding not only between civic leaders and constituents but also among community members. To address these barriers, we developed StoryBuilder, a human-AI collaborative pipeline that transforms community input into accessible first-person narratives. Using 2,480 community responses from an ongoing school rezoning process, we generated 124 composite stories and deployed them through a mobile-friendly StorySharer interface. Our mixed-methods evaluation combined a four-month field deployment, user studies with 21 community members, and a controlled experiment examining how narrative composition affects participant reactions. Field results demonstrate that narratives helped community members relate across diverse perspectives. In the experiment, experience-grounded narratives generated greater respect and trust than opinion-heavy narratives. We contribute a human-AI narrative synthesis system and insights on its varied acceptance and effectiveness in a real-world civic context.

cs.AI [Back]

[105] Cognitive Load Limits in Large Language Models: Benchmarking Multi-Hop Reasoning cs.AI | cs.CL | cs.LG | I.2.7; I.2.6PDF

Sai Teja Reddy Adapala

TL;DR: 本文研究了大型语言模型（LLM）在认知负荷下的性能限制，提出了一个理论框架来分析上下文饱和和注意力残留对模型推理能力的影响。通过ICE基准测试发现，模型在多跳推理任务中的表现显著下降，尤其是小型开源模型在高压条件下完全失效。

Details

Motivation: 尽管LLM在静态任务上表现出色，但其在动态、信息丰富的环境中的脆弱性尚未被充分理解。作者希望通过研究认知负荷对模型推理的影响，揭示其性能下降的关键机制。

Result: 小型开源模型（如Llama-3-8B-Instruct和Mistral-7B-Instruct-v0.2）在高负荷任务中表现极差（0%准确率），而Gemini-2.0-Flash-001在控制条件下表现较好（85%准确率），但在上下文饱和情况下性能显著下降。

Insight: 认知负荷是导致LLM推理失败的关键因素。动态的、基于认知压力的测试（如ICE）对评估AI系统的真实鲁棒性和安全性至关重要。

Abstract: The scaling of Large Language Models (LLMs) has exposed a critical gap between their performance on static benchmarks and their fragility in dynamic, information-rich environments. While models excel at isolated tasks, the computational limits that govern their reasoning under cognitive load remain poorly understood. In this work, we introduce a formal theory of computational cognitive load, positing that extraneous, task-irrelevant information (Context Saturation) and interference from task-switching (Attentional Residue) are key mechanisms that degrade performance. We designed the Interleaved Cognitive Evaluation (ICE), a deconfounded benchmark to systematically manipulate these load factors on challenging multi-hop reasoning tasks. A comprehensive study (N = 10 replications per item across 200 questions) revealed significant performance variations across five instruction-tuned models. Smaller open-source architectures (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.2) exhibited baseline brittleness, achieving 0% accuracy (SEM = 0.0) across all conditions, including clean controls, on this high-intrinsic-load task. In contrast, Gemini-2.0-Flash-001 showed partial resilience, achieving 85% accuracy in control conditions, with a statistically significant degradation under context saturation ($\beta = -0.003$ per % load, $p < 0.001$). These findings provide preliminary evidence that cognitive load is a key contributor to reasoning failures, supporting theories of hallucination-as-guessing under uncertainty. We conclude that dynamic, cognitive-aware stress testing, as exemplified by the ICE benchmark, is essential for evaluating the true resilience and safety of advanced AI systems.

[106] UserRL: Training Interactive User-Centric Agent via Reinforcement Learning cs.AI | cs.CL | cs.LGPDF

Cheng Qian, Zuxin Liu, Akshara Prabhakar, Jielin Qiu, Zhiwei Liu

TL;DR: 论文提出了UserRL框架，通过标准化的gym环境和模拟用户训练用户为中心的代理模型，结合GRPO算法分析奖励分配和轨迹评分对学习的影响，发现SFT冷启动和轨迹评分设计对多轮交互效率至关重要。

Details

Motivation: 尽管强化学习在动态交互任务中表现优异，但如何训练出真正以用户为中心的智能代理仍面临多样性和动态交互的挑战。

Result: 发现SFT冷启动对初始交互能力至关重要，轨迹评分设计显著提升多轮交互效率，开源模拟用户是成本效益高的选择。

Insight: 奖励设计和模拟用户选择对模型性能的影响不亚于模型规模，UserRL为开发鲁棒的用户中心代理模型提供了一条实用路径。

Abstract: Reinforcement learning (RL) has shown promise in training agentic models that move beyond static benchmarks to engage in dynamic, multi-turn interactions. Yet, the ultimate value of such agents lies in their ability to assist users, a setting where diversity and dynamics of user interaction pose challenges. In this work, we propose UserRL, a unified framework for training and evaluating user-centric abilities through standardized gym environments paired with simulated users. We systematically vary turn-level reward assignment and trajectory-level score calculation to analyze how different formulations affect learning under the GRPO algorithm. Our experiments across Qwen3 models reveal three key findings: (i) SFT cold start is critical for unlocking initial interaction ability and enabling sustained RL improvements; (ii) deliberate trajectory scoring yields more efficient and effective multi-turn interactions; and (iii) while stronger simulated users (e.g., GPT-4o) facilitates training, open-source simulators (e.g., Qwen3-32B) remain a cost-effective and transferable option. Together, these results highlight that careful design of reward shaping and user simulation choice is as crucial as model scale, and establish UserRL as a practical pathway for developing robust user-centric agentic models. All codes and data are public for future research.

cs.LG [Back]

[107] VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models cs.LG | cs.CLPDF

Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang

TL;DR: VCRL是一种基于方差的课程强化学习框架，通过动态调整训练样本的难度，提升了大型语言模型（LLM）在数学推理任务中的表现。

Details

Motivation: 现有基于rollout的强化学习方法（如GRPO、DAPO、GSPO）未能明确考虑LLM对不同难度样本的学习能力，而人类认知过程是从易到难的。

Result: 在五个数学基准数据集和两种模型上的实验显示，VCRL优于现有的LLM强化学习基线方法。

Insight: 样本的难度可以通过reward的方差量化，动态调整难度能更高效地提升LLM的性能。

Abstract: Policy-based reinforcement learning currently plays an important role in improving LLMs on mathematical reasoning tasks. However, existing rollout-based reinforcement learning methods (GRPO, DAPO, GSPO, etc.) fail to explicitly consider LLMs’ learning ability for samples of different difficulty levels, which is contrary to the human cognitive process of mathematical reasoning tasks from easy to difficult. Intuitively, we find that the variance of the rollout group’s reward in RLVR partly reflects the difficulty of the current sample for LLMs. Samples that are too easy or too difficult have a lower variance, while samples with moderate difficulty have a higher variance. Based on this, we propose VCRL, a curriculum reinforcement learning framework that dynamically controls the difficulty of training samples based on the variance of group rewards. Experiments on five mathematical benchmarks and two models reveal the advantages of VCRL over the current LLM RL baselines.

[108] PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning cs.LG | cs.CLPDF

Xueliang Zhao, Wei Wu, Jian Guan, Zhuocheng Gong, Lingpeng Kong

TL;DR: PromptCoT 2.0是一个可扩展的框架，通过EM循环优化提示合成，生成更难且更多样的训练问题，显著提升大语言模型的推理能力。

Details

Motivation: 高质量训练问题的缺乏限制了LLMs的推理能力提升，PromptCoT 2.0旨在通过自动合成更具挑战性的提示来解决这一问题。

Result: 在30B规模的Qwen3-30B-A3B-Thinking-2507模型上实现了显著的性能提升，SFT训练的Qwen2.5-7B-Instruct也超越了基于人类数据的模型。

Insight: 提示合成可成为扩展LLMs推理能力的新维度，PromptCoT 2.0为开源模型提供了可扩展的基础。

Abstract: Large language models (LLMs) are evolving from conversational systems into strong reasoners for tasks such as Olympiad mathematics and competitive programming. While scaling parameters and test-time computation has driven progress, a key bottleneck is the lack of high-quality training problems: human-curated datasets are costly and limited, while existing synthetic corpora are often too easy or narrow. PromptCoT 1.0 showed that injecting rationales into prompt synthesis increases problem difficulty. Building on this, we present PromptCoT 2.0, a scalable framework that replaces hand-crafted heuristics with an expectation-maximization (EM) loop, where rationales are iteratively refined to guide prompt construction. This produces problems that are both harder and more diverse than prior corpora. The synthetic prompts support two post-training regimes: (1) Self-Play, where strong models improve autonomously via verifiable feedback without stronger teachers; and (2) Supervised Fine-Tuning (SFT), where weaker models learn from teacher-distilled traces. Extensive experiments demonstrate the effectiveness of this approach. In self-play, applying PromptCoT 2.0 to Qwen3-30B-A3B-Thinking-2507 sets new state-of-the-art results at the 30B scale, with +4.4, +4.8, and +5.3 on AIME 24/25 and HMMT 25, +6.1 and +5.0 on LiveCodeBench v5/v6, and +35 Elo on Codeforces. In SFT, training Qwen2.5-7B-Instruct solely on synthetic prompts boosts accuracy to 73.1 (AIME 24), 65.6 (AIME 25), and 53.4 (LiveCodeBench v5), surpassing models trained on human or hybrid data. Analyses further confirm that PromptCoT 2.0 yields fundamentally harder and distributionally distinct problems. These results establish prompt synthesis as a new axis for scaling reasoning and position PromptCoT 2.0 as a scalable foundation for future open-source models. The implementation is available at https://github.com/inclusionAI/PromptCoT.

cs.RO [Back]

Alessandro Saviolo, Jeffrey Mao, Giuseppe Loianno

TL;DR: HUNT 是一个实时框架，通过瞬时相对帧实现无人机在非结构化环境中的高速导航和目标跟踪，解决了全局定位缺失下的挑战。

Details

Motivation: 搜索和救援任务需要无人机在未知非结构化环境中高速飞行并跟踪目标，但在感知能力受限且无全局定位的情况下，同时实现这两种能力仍是一个开放性问题。

Result: 在密集森林、集装箱堆场和搜索救援任务中的实验表明，HUNT 在全局方法失效的场景中表现出鲁棒性。

Insight: 通过相对导航范式，HUNT 为缺乏全局定位的高速无人机任务提供了可行的解决方案。

Abstract: Search and rescue operations require unmanned aerial vehicles to both traverse unknown unstructured environments at high speed and track targets once detected. Achieving both capabilities under degraded sensing and without global localization remains an open challenge. Recent works on relative navigation have shown robust tracking by anchoring planning and control to a visible detected object, but cannot address navigation when no target is in the field of view. We present HUNT (High-speed UAV Navigation and Tracking), a real-time framework that unifies traversal, acquisition, and tracking within a single relative formulation. HUNT defines navigation objectives directly from onboard instantaneous observables such as attitude, altitude, and velocity, enabling reactive high-speed flight during search. Once a target is detected, the same perception-control pipeline transitions seamlessly to tracking. Outdoor experiments in dense forests, container compounds, and search-and-rescue operations with vehicles and mannequins demonstrate robust autonomy where global methods fail.

[110] Agentic Scene Policies: Unifying Space, Semantics, and Affordances for Robot Action cs.RO | cs.CVPDF

Sacha Morin, Kumaraditya Gupta, Mahtab Sandhu, Charlie Gauthier, Francesco Argenziano

TL;DR: ASP是一个基于现代场景表示的语言条件机器人策略框架，通过显式推理对象的功能性（affordances）和场景空间语义，实现了零样本开放词汇查询和复杂指令执行。

Details

Motivation: 为了解决端到端策略模型在处理复杂指令和新场景时的困难，本文提出了一种显式的场景表示方法，作为机器人与世界之间的可查询接口，以指导运动规划。

Result: 实验表明，ASP在桌面操纵和房间级查询任务中表现优于VLAs，尤其是通过功能性导航和扩展的场景表示。

Insight: 显式场景表示和功能性推理的结合可以显著提升机器人处理复杂指令和新场景的能力。

Abstract: Executing open-ended natural language queries is a core problem in robotics. While recent advances in imitation learning and vision-language-actions models (VLAs) have enabled promising end-to-end policies, these models struggle when faced with complex instructions and new scenes. An alternative is to design an explicit scene representation as a queryable interface between the robot and the world, using query results to guide downstream motion planning. In this work, we present Agentic Scene Policies (ASP), an agentic framework that leverages the advanced semantic, spatial, and affordance-based querying capabilities of modern scene representations to implement a capable language-conditioned robot policy. ASP can execute open-vocabulary queries in a zero-shot manner by explicitly reasoning about object affordances in the case of more complex skills. Through extensive experiments, we compare ASP with VLAs on tabletop manipulation problems and showcase how ASP can tackle room-level queries through affordance-guided navigation, and a scaled-up scene representation. (Project page: https://montrealrobotics.ca/agentic-scene-policies.github.io/)

[111] EgoBridge: Domain Adaptation for Generalizable Imitation from Egocentric Human Data cs.RO | cs.CV | cs.LGPDF

Ryan Punamiya, Dhruv Patel, Patcharapong Aphiwetsa, Pranav Kuppili, Lawrence Y. Zhu

TL;DR: EgoBridge是一个领域自适应框架，通过对齐人类和机器人数据的潜在空间，实现从人类第一视角数据到机器人模仿学习的知识迁移，显著提升了任务成功率。

Details

Motivation: 人类的第一视角数据为机器人模仿学习提供了丰富资源，但由于视觉、传感器和运动学上的领域差异，直接迁移效果不佳。EgoBridge旨在解决这一问题。

Result: 在三个真实世界任务中，EgoBridge比基线方法提升了44%的成功率，并能泛化到仅在人数据中出现的新对象和场景。

Insight: 通过领域对齐和动作信息保留，EgoBridge证明了从人类数据中提取知识用于机器人任务的潜力。

Abstract: Egocentric human experience data presents a vast resource for scaling up end-to-end imitation learning for robotic manipulation. However, significant domain gaps in visual appearance, sensor modalities, and kinematics between human and robot impede knowledge transfer. This paper presents EgoBridge, a unified co-training framework that explicitly aligns the policy latent spaces between human and robot data using domain adaptation. Through a measure of discrepancy on the joint policy latent features and actions based on Optimal Transport (OT), we learn observation representations that not only align between the human and robot domain but also preserve the action-relevant information critical for policy learning. EgoBridge achieves a significant absolute policy success rate improvement by 44% over human-augmented cross-embodiment baselines in three real-world single-arm and bimanual manipulation tasks. EgoBridge also generalizes to new objects, scenes, and tasks seen only in human data, where baselines fail entirely. Videos and additional information can be found at https://ego-bridge.github.io

eess.IV [Back]

[112] Frequency-Aware Ensemble Learning for BraTS 2025 Pediatric Brain Tumor Segmentation eess.IV | cs.CVPDF

Yuxiao Yi, Qingyao Zhuang, Zhi-Qin John Xu

TL;DR: 该论文针对儿童脑肿瘤分割的独特挑战，提出了一种集成nnU-Net、Swin UNETR和HFF-Net的方法，通过调整初始化尺度、迁移学习和频域分解提升模型性能，在BraTS-PED 2025挑战中取得了显著的分割效果。

Details

Motivation: 儿童脑肿瘤因其罕见性和异质性在分割任务中面临独特挑战，临床诊断和治疗规划亟需高效的分割方法。

Result: 在BraTS-PED 2025数据集上，分别取得了不同肿瘤区域的Dice分数：ET(72.3%)、NET(95.6%)、CC(68.9%)、ED(89.5%)、TC(92.3%)和WT(92.3%)。

Insight: 频域分解和模型集成为儿童脑肿瘤分割提供了新思路，迁移学习和初始化优化可显著提升模型在小数据集上的性能。

Abstract: Pediatric brain tumor segmentation presents unique challenges due to the rarity and heterogeneity of these malignancies, yet remains critical for clinical diagnosis and treatment planning. We propose an ensemble approach integrating nnU-Net, Swin UNETR, and HFF-Net for the BraTS-PED 2025 challenge. Our method incorporates three key extensions: adjustable initialization scales for optimal nnU-Net complexity control, transfer learning from BraTS 2021 pre-trained models to enhance Swin UNETR’s generalization on pediatric dataset, and frequency domain decomposition for HFF-Net to separate low-frequency tissue contours from high-frequency texture details. Our final ensemble combines nnU-Net ($\gamma=0.7$), fine-tuned Swin UNETR, and HFF-Net, achieving Dice scores of 72.3% (ET), 95.6% (NET), 68.9% (CC), 89.5% (ED), 92.3% (TC), and 92.3% (WT), respectively.

[113] Ensuring Reliable Participation in Subjective Video Quality Tests Across Platforms eess.IV | cs.CV | cs.MMPDF

Babak Naderi, Ross Cutler

TL;DR: 论文探讨了在主观视频质量测试中如何确保跨平台的参与者可靠性，提出了检测远程桌面用户的方法，并比较了两个主流众包平台的易受攻击性和缓解措施。

Details

Motivation: 主观视频质量评估是衡量终端用户体验的金标准，但众包测试中参与者可能通过忽略指令或操纵奖励来提供不可靠的数据，尤其是通过远程桌面连接或利用视频元数据的方式。

Result: 研究揭示了远程桌面连接的使用对测试结果的偏差影响，并提出了有效的检测方法。

Insight: 通过检测和缓解不可靠参与者的行为，可以显著提升主观视频质量评估的准确性和可靠性。

Abstract: Subjective video quality assessment (VQA) is the gold standard for measuring end-user experience across communication, streaming, and UGC pipelines. Beyond high-validity lab studies, crowdsourcing offers accurate, reliable, faster, and cheaper evaluation-but suffers from unreliable submissions by workers who ignore instructions or game rewards. Recent tests reveal sophisticated exploits of video metadata and rising use of remote-desktop (RD) connections, both of which bias results. We propose objective and subjective detectors for RD users and compare two mainstream crowdsourcing platforms on their susceptibility and mitigation under realistic test conditions and task designs.

eess.AS [Back]

Shaoshi Ling, Gang Liu, Guoli Ye, Jinyu Li

TL;DR: 本文提出了一种基于多阶段强化学习的训练框架，显著提升了多模态大语言模型（MLLMs）在语音摘要任务中的性能，缩小了与纯文本LLMs的差距。

Details

Motivation: 随着语音和音视频数据的快速增长，语音摘要成为理解口语内容的关键技术。尽管多模态大语言模型（MLLMs）能够直接从语音生成文本摘要，但其性能仍落后于纯文本LLMs，限制了实际应用。

Result: 模型在语音摘要任务中超越了基线方法，甚至表现优于更大的MLLMs，显著缩小了与纯文本LLMs的差距。

Insight: 通过强化学习优化MLLMs在语音摘要任务中的表现，证明了多模态模型中融合语音和文本的重要性，同时展示了强化学习在提升生成任务中的潜力。

Abstract: Speech summarization is a critical component of spoken content understanding, particularly in the era of rapidly growing spoken and audiovisual data. Recent advances in multi-modal large language models (MLLMs), leveraging the power of LLMs, enable generating textual summaries directly from speech without intermediate transcriptions, while supporting controllable styles and zero-shot generalization. However, open-source MLLMs continue to lag behind the state-of-the-art text-based LLMs, limiting their practical deployment for speech summarization. In this work, we present a novel multi-stage reinforcement learning training framework to enhance the speech summarization capabilities in MLLMs. Our model delivers substantial improvements over strong baselines, outperforms much larger MLLMs, and significantly narrows the gap with state-of-the-art text-based LLMs.

cs.DB [Back]

[115] STARQA: A Question Answering Dataset for Complex Analytical Reasoning over Structured Databases cs.DB | cs.CLPDF

Mounica Maddela, Lingjue Xie, Daniel Preotiuc-Pietro, Mausam

TL;DR: STARQA是首个针对复杂分析推理问题的公开数据集，专注于需要聚合分析、时间序列分析等复杂操作的问答任务，并通过结合SQL和Python的方法提升性能。

Details

Motivation: 现有文本转SQL的基准测试问题复杂度受限于查询语言的表达能力，缺乏对复杂分析推理问题的关注。STARQA弥补了这一空白。

Result: 该方法比仅用SQL表现更好，但对当前最先进的LLM仍具挑战性。

Insight: 结合SQL和Python能更自然地处理复杂分析任务，凸显了跨语言协作的潜力。

Abstract: Semantic parsing methods for converting text to SQL queries enable question answering over structured data and can greatly benefit analysts who routinely perform complex analytics on vast data stored in specialized relational databases. Although several benchmarks measure the abilities of text to SQL, the complexity of their questions is inherently limited by the level of expressiveness in query languages and none focus explicitly on questions involving complex analytical reasoning which require operations such as calculations over aggregate analytics, time series analysis or scenario understanding. In this paper, we introduce STARQA, the first public human-created dataset of complex analytical reasoning questions and answers on three specialized-domain databases. In addition to generating SQL directly using LLMs, we evaluate a novel approach (Text2SQLCode) that decomposes the task into a combination of SQL and Python: SQL is responsible for data fetching, and Python more naturally performs reasoning. Our results demonstrate that identifying and combining the abilities of SQL and Python is beneficial compared to using SQL alone, yet the dataset still remains quite challenging for the existing state-of-the-art LLMs.

Table of Contents

cs.CV [Back]

[1] Vision-Based Perception for Autonomous Vehicles in Off-Road Environment Using Deep Learning cs.CV | cs.AR | cs.LG | eess.IV | eess.SPPDF

[2] Overview of LifeCLEF Plant Identification task 2020 cs.CVPDF

[3] iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning cs.CVPDF

[4] CURE: Centroid-guided Unsupervised Representation Erasure for Facial Recognition Systems cs.CVPDF

[5] Raw-JPEG Adapter: Efficient Raw Image Compression with JPEG cs.CVPDF

[6] The Impact of 2D Segmentation Backbones on Point Cloud Predictions Using 4D Radar cs.CV | cs.ROPDF

[7] Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM-as-Judge Assessment cs.CVPDF

[8] MoTiC: Momentum Tightness and Contrast for Few-Shot Class-Incremental Learning cs.CV | cs.AIPDF

[9] Enhancing Transformer-Based Vision Models: Addressing Feature Map Anomalies Through Novel Optimization Strategies cs.CVPDF

[10] From Prompt to Progression: Taming Video Diffusion Models for Seamless Attribute Transition cs.CVPDF

[11] Anatomically Constrained Transformers for Cardiac Amyloidosis Classification cs.CVPDF

[12] Learning to Stop: Reinforcement Learning for Efficient Patient-Level Echocardiographic Classification cs.CVPDF

[13] Frequency-domain Multi-modal Fusion for Language-guided Medical Image Segmentation cs.CVPDF

[14] PolGS: Polarimetric Gaussian Splatting for Fast Reflective Surface Reconstruction cs.CVPDF

[15] CAMILA: Context-Aware Masking for Image Editing with Language Alignment cs.CVPDF

[16] Robust RGB-T Tracking via Learnable Visual Fourier Prompt Fine-tuning and Modality Fusion Prompt Generation cs.CVPDF

[17] Rectified Decoupled Dataset Distillation: A Closer Look for Fair and Comprehensive Evaluation cs.CVPDF

[18] Talking Head Generation via AU-Guided Landmark Prediction cs.CVPDF

[19] ExpFace: Exponential Angular Margin Loss for Deep Face Recognition cs.CV | cs.AIPDF

[20] Logics-Parsing Technical Report cs.CVPDF

[21] Sex-based Bias Inherent in the Dice Similarity Coefficient: A Model Independent Analysis for Multiple Anatomical Structures cs.CV | J.3PDF

[22] EfficienT-HDR: An Efficient Transformer-Based Framework via Multi-Exposure Fusion for HDR Reconstruction cs.CVPDF

[23] BiTAA: A Bi-Task Adversarial Attack for Object Detection and Depth Estimation via 3D Gaussian Splatting cs.CVPDF

[24] StrCGAN: A Generative Framework for Stellar Image Restoration cs.CV | astro-ph.IM | astro-ph.SRPDF

[25] ThinkFake: Reasoning in Multimodal Large Language Models for AI-Generated Image Detection cs.CVPDF

[26] PersONAL: Towards a Comprehensive Benchmark for Personalized Embodied Agents cs.CV | cs.ROPDF

[27] FreezeVLA: Action-Freezing Attacks against Vision-Language-Action Models cs.CVPDF

[28] Adaptive Guidance Semantically Enhanced via Multimodal LLM for Edge-Cloud Object Detection cs.CV | cs.AIPDF

[29] Generalized Shortest Path-based Superpixels for 3D Spherical Image Segmentation cs.CVPDF

[30] Efficient Cell Painting Image Representation Learning via Cross-Well Aligned Masked Siamese Network cs.CVPDF

[31] Aerial-Ground Image Feature Matching via 3D Gaussian Splatting-based Intermediate View Rendering cs.CVPDF

[32] CapStARE: Capsule-based Spatiotemporal Architecture for Robust and Efficient Gaze Estimation cs.CVPDF

[33] GS-RoadPatching: Inpainting Gaussians via 3D Searching and Placing for Driving Scenes cs.CVPDF

[34] When Words Can’t Capture It All: Towards Video-Based User Complaint Text Generation with Multimodal Video Complaint Dataset cs.CV | cs.AIPDF

[35] SynchroRaMa : Lip-Synchronized and Emotion-Aware Talking Face Generation via Multi-Modal Emotion Embedding cs.CVPDF

[36] OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving cs.CVPDF

[37] CamPVG: Camera-Controlled Panoramic Video Generation with Epipolar-Aware Diffusion cs.CVPDF

[38] SDE-DET: A Precision Network for Shatian Pomelo Detection in Complex Orchard Environments cs.CV | cs.AIPDF

[39] Improving Generalizability and Undetectability for Targeted Adversarial Attacks on Multimodal Pre-trained Models cs.CVPDF

[40] Anomaly Detection by Clustering DINO Embeddings using a Dirichlet Process Mixture cs.CV | cs.LGPDF

[41] Table Detection with Active Learning cs.CV | cs.AI | cs.CL | cs.LGPDF

[42] Does the Manipulation Process Matter? RITA: Reasoning Composite Image Manipulations via Reversely-Ordered Incremental-Transition Autoregression cs.CVPDF

[43] PS3: A Multimodal Transformer Integrating Pathology Reports with Histology Images and Biological Pathways for Cancer Survival Prediction cs.CVPDF

[44] Predictive Quality Assessment for Mobile Secure Graphics cs.CV | cs.LG | I.2.10; I.4.8PDF

[45] SHMoAReg: Spark Deformable Image Registration via Spatial Heterogeneous Mixture of Experts and Attention Heads cs.CVPDF

[46] Unleashing the Potential of the Semantic Latent Space in Diffusion Models for Image Dehazing cs.CVPDF

[47] Hyperspectral Adapter for Semantic Segmentation with Vision Foundation Models cs.CV | cs.AI | cs.LG | cs.ROPDF

[48] A Simple Data Augmentation Strategy for Text-in-Image Scientific VQA cs.CVPDF

[49] EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models cs.CV | cs.AIPDF

[50] C$^2$MIL: Synchronizing Semantic and Topological Causalities in Multiple Instance Learning for Robust and Interpretable Survival Analysis cs.CVPDF

[51] U-Mamba2-SSL for Semi-Supervised Tooth and Pulp Segmentation in CBCT cs.CV | cs.AIPDF

[52] Optical Ocean Recipes: Creating Realistic Datasets to Facilitate Underwater Vision Research cs.CVPDF

[53] Universal Camouflage Attack on Vision-Language Models for Autonomous Driving cs.CV | cs.LGPDF

[54] PU-Gaussian: Point Cloud Upsampling using 3D Gaussian Representation cs.CVPDF

[55] ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression cs.CV | cs.AI | cs.LGPDF

[56] An Anisotropic Cross-View Texture Transfer with Multi-Reference Non-Local Attention for CT Slice Interpolation cs.CVPDF

[57] 4D Driving Scene Generation With Stereo Forcing cs.CVPDF

[58] A co-evolving agentic AI system for medical imaging analysis cs.CV | q-bio.QMPDF

[59] HiPerformer: A High-Performance Global-Local Segmentation Model with Modular Hierarchical Fusion Strategy cs.CVPDF

[60] PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation cs.CVPDF

[61] EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning cs.CVPDF

cs.CL [Back]

[62] FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering cs.CL | cs.AIPDF

[63] Readme_AI: Dynamic Context Construction for Large Language Models cs.CL | cs.AIPDF

[64] Unveiling the Merits and Defects of LLMs in Automatic Review Generation for Scientific Papers cs.CL | cs.AIPDF

[65] How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment Score Alignment cs.CL | stat.MEPDF

[66] Quantifying Compositionality of Classic and State-of-the-Art Embeddings cs.CL | cs.AIPDF

[67] Pluralistic Off-policy Evaluation and Alignment cs.CL | cs.AIPDF

[68] SCORE: A Semantic Evaluation Framework for Generative Document Parsing cs.CL | cs.AIPDF

[69] Benchmarking ChatGPT and DeepSeek in April 2025: A Novel Dual Perspective Sentiment Analysis Using Lexicon-Based and Deep Learning Approaches cs.CLPDF

[70] ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution cs.CL | cs.LGPDF

[71] TriSPrompt: A Hierarchical Soft Prompt Model for Multimodal Rumor Detection with Incomplete Modalities cs.CL | cs.AIPDF

[72] RoadMind: Towards a Geospatial AI Expert for Disaster Response cs.CL | cs.AIPDF

[73] Benchmarking and Improving LLM Robustness for Personalized Generation cs.CL | cs.AIPDF

[74] Semantic Representation Attack against Aligned Large Language Models cs.CL | cs.AIPDF

[75] Meow: End-to-End Outline Writing for Automatic Academic Survey cs.CL | cs.AIPDF

[76] How to inject knowledge efficiently? Knowledge Infusion Scaling Law for Pre-training Large Language Models cs.CL | cs.AIPDF

[77] Do LLMs Encode Frame Semantics? Evidence from Frame Identification cs.CLPDF