Table of Contents
- cs.CV [Total: 88]
- cs.CL [Total: 69]
- q-bio.QM [Total: 1]
- cs.LG [Total: 9]
- cs.RO [Total: 4]
- cs.AI [Total: 13]
- eess.IV [Total: 2]
cs.CV [Back]
[1] MultiFoodhat: A potential new paradigm for intelligent food quality inspection cs.CVPDF
Yue Hu, Guohang Zhuang
TL;DR: MultiFoodhat提出了一种基于对话的多智能体推理框架,通过整合视觉语言模型和大型语言模型,实现零样本食物识别,避免了大规模标注数据的依赖。
Details
Motivation: 现有监督模型依赖于大量标注数据且对未见食物类别泛化能力有限,需要一种更灵活、无需额外训练的智能食物识别方法。
Result: 在多个公开食物数据集上展示出较高的识别准确性和可解释性,优于无监督和少样本方法。
Insight: 结合VLMs和LLMs的多智能体推理框架可为智能食品质量检测提供新思路,减少对标注数据的依赖。
Abstract: Food image classification plays a vital role in intelligent food quality inspection, dietary assessment, and automated monitoring. However, most existing supervised models rely heavily on large labeled datasets and exhibit limited generalization to unseen food categories. To overcome these challenges, this study introduces MultiFoodChat, a dialogue-driven multi-agent reasoning framework for zero-shot food recognition. The framework integrates vision-language models (VLMs) and large language models (LLMs) to enable collaborative reasoning through multi-round visual-textual dialogues. An Object Perception Token (OPT) captures fine-grained visual attributes, while an Interactive Reasoning Agent (IRA) dynamically interprets contextual cues to refine predictions. This multi-agent design allows flexible and human-like understanding of complex food scenes without additional training or manual annotations. Experiments on multiple public food datasets demonstrate that MultiFoodChat achieves superior recognition accuracy and interpretability compared with existing unsupervised and few-shot methods, highlighting its potential as a new paradigm for intelligent food quality inspection and analysis.
[2] Post-surgical Endometriosis Segmentation in Laparoscopic Videos cs.CV | cs.LG | cs.MMPDF
Andreas Leibetseder, Klaus Schoeffmann, Jörg Keckstein, Simon Keckstein
TL;DR: 论文描述了一个用于分割腹腔镜视频中子宫内膜异位症(尤其是黑色子宫内膜植入物)的系统,旨在帮助妇科医生更准确地识别和标注病灶。
Details
Motivation: 子宫内膜异位症在腹腔镜视频中表现出多样化的视觉特征,识别难度大,容易出错。为帮助妇科医生更高效地诊断和治疗,论文提出了一种自动化分割系统。
Result: 系统能够有效地识别和标注视频中的病灶区域,为医生提供更直观的诊断辅助工具。
Insight: 自动化分割系统可以显著提升妇科医生在复杂视频数据中识别子宫内膜异位症的效率和准确性。
Abstract: Endometriosis is a common women’s condition exhibiting a manifold visual appearance in various body-internal locations. Having such properties makes its identification very difficult and error-prone, at least for laymen and non-specialized medical practitioners. In an attempt to provide assistance to gynecologic physicians treating endometriosis, this demo paper describes a system that is trained to segment one frequently occurring visual appearance of endometriosis, namely dark endometrial implants. The system is capable of analyzing laparoscopic surgery videos, annotating identified implant regions with multi-colored overlays and displaying a detection summary for improved video browsing.
[3] Efficient Few-Shot Learning in Remote Sensing: Fusing Vision and Vision-Language Models cs.CV | cs.AI | cs.LGPDF
Jia Yun Chua, Argyrios Zolotas, Miguel Arana-Catania
TL;DR: 本文探讨了传统视觉模型与视觉语言模型(VLMs)结合在遥感图像分析中的应用,特别是在少样本学习场景下,显著提升了飞机检测和场景理解的准确性和上下文感知能力。
Details
Motivation: 遥感数据量巨大,但传统视觉模型需大量标注数据且缺乏上下文理解能力。VLMs虽然能结合视觉与文本数据,但在遥感领域的应用尚未充分探索。本文旨在结合两者优势以提升遥感图像分析的效率和准确性。
Result: 在飞机检测和计数任务中,平均MAE提升48.46%,CLIPScore提高了6.17%,表明该方法在复杂环境下的优越性。
Insight: 视觉与语言模型的融合为遥感图像分析提供了新思路,尤其适合少样本学习场景和低质量图像处理,展示了多模态方法的潜力。
Abstract: Remote sensing has become a vital tool across sectors such as urban planning, environmental monitoring, and disaster response. While the volume of data generated has increased significantly, traditional vision models are often constrained by the requirement for extensive domain-specific labelled data and their limited ability to understand the context within complex environments. Vision Language Models offer a complementary approach by integrating visual and textual data; however, their application to remote sensing remains underexplored, particularly given their generalist nature. This work investigates the combination of vision models and VLMs to enhance image analysis in remote sensing, with a focus on aircraft detection and scene understanding. The integration of YOLO with VLMs such as LLaVA, ChatGPT, and Gemini aims to achieve more accurate and contextually aware image interpretation. Performance is evaluated on both labelled and unlabelled remote sensing data, as well as degraded image scenarios which are crucial for remote sensing. The findings show an average MAE improvement of 48.46% across models in the accuracy of aircraft detection and counting, especially in challenging conditions, in both raw and degraded scenarios. A 6.17% improvement in CLIPScore for comprehensive understanding of remote sensing images is obtained. The proposed approach combining traditional vision models and VLMs paves the way for more advanced and efficient remote sensing image analysis, especially in few-shot learning scenarios.
[4] Finding Holes: Pathologist Level Performance Using AI for Cribriform Morphology Detection in Prostate Cancer cs.CV | cs.AIPDF
Kelvin Szolnoky, Anders Blilie, Nita Mulliqi, Toyonori Tsuzuki, Hemamali Samaratunga
TL;DR: 论文提出了一种AI模型,用于检测前列腺癌中的筛状形态,性能达到病理学家水平,显著提高了诊断可靠性和一致性。
Details
Motivation: 筛状形态是前列腺癌中与不良预后相关的关键组织学特征,但由于病理学家的主观性差异,其检测结果存在较大不一致性。
Result: 模型内部验证AUC为0.97,Cohen’s kappa为0.81;外部验证AUC为0.90,Cohen’s kappa为0.55,性能优于9名病理学家。
Insight: AI模型可以显著提高筛状形态检测的一致性,为前列腺癌的诊断和治疗决策提供更可靠的依据。
Abstract: Background: Cribriform morphology in prostate cancer is a histological feature that indicates poor prognosis and contraindicates active surveillance. However, it remains underreported and subject to significant interobserver variability amongst pathologists. We aimed to develop and validate an AI-based system to improve cribriform pattern detection. Methods: We created a deep learning model using an EfficientNetV2-S encoder with multiple instance learning for end-to-end whole-slide classification. The model was trained on 640 digitised prostate core needle biopsies from 430 patients, collected across three cohorts. It was validated internally (261 slides from 171 patients) and externally (266 slides, 104 patients from three independent cohorts). Internal validation cohorts included laboratories or scanners from the development set, while external cohorts used completely independent instruments and laboratories. Annotations were provided by three expert uropathologists with known high concordance. Additionally, we conducted an inter-rater analysis and compared the model’s performance against nine expert uropathologists on 88 slides from the internal validation cohort. Results: The model showed strong internal validation performance (AUC: 0.97, 95% CI: 0.95-0.99; Cohen’s kappa: 0.81, 95% CI: 0.72-0.89) and robust external validation (AUC: 0.90, 95% CI: 0.86-0.93; Cohen’s kappa: 0.55, 95% CI: 0.45-0.64). In our inter-rater analysis, the model achieved the highest average agreement (Cohen’s kappa: 0.66, 95% CI: 0.57-0.74), outperforming all nine pathologists whose Cohen’s kappas ranged from 0.35 to 0.62. Conclusion: Our AI model demonstrates pathologist-level performance for cribriform morphology detection in prostate cancer. This approach could enhance diagnostic reliability, standardise reporting, and improve treatment decisions for prostate cancer patients.
[5] NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations cs.CVPDF
Junjie Nan, Jianing Li, Wei Chen, Mingkun Zhang, Xueqi Cheng
TL;DR: NAPPure是一个针对非加性对抗扰动的对抗净化框架,通过最大化似然估计分离干净图像和扰动参数,显著提升了图像分类模型在非加性扰动下的鲁棒性。
Details
Motivation: 现有对抗净化方法主要针对加性扰动设计,对非加性扰动(如模糊、遮挡和扭曲)效果有限。为了解决这一问题,本文提出了NAPPure框架。
Result: 在GTSRB和CIFAR-10数据集上的实验表明,NAPPure显著提升了模型对非加性扰动的鲁棒性。
Insight: 非加性扰动在实际场景中普遍存在,通用的对抗净化方法需要扩展以应对更多类型的扰动。
Abstract: Adversarial purification has achieved great success in combating adversarial image perturbations, which are usually assumed to be additive. However, non-additive adversarial perturbations such as blur, occlusion, and distortion are also common in the real world. Under such perturbations, existing adversarial purification methods are much less effective since they are designed to fit the additive nature. In this paper, we propose an extended adversarial purification framework named NAPPure, which can further handle non-additive perturbations. Specifically, we first establish the generation process of an adversarial image, and then disentangle the underlying clean image and perturbation parameters through likelihood maximization. Experiments on GTSRB and CIFAR-10 datasets show that NAPPure significantly boosts the robustness of image classification models against non-additive perturbations.
[6] Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding cs.CVPDF
Xiaoqian Shen, Wenxuan Zhang, Jun Chen, Mohamed Elhoseiny
TL;DR: Vgent提出了一种基于图的检索-推理增强生成框架,用于提升长视频理解能力,通过结构化图和中间推理步骤显著改进了长视频语言模型的性能。
Details
Motivation: 长视频理解和推理对大型视频语言模型(LVLMs)提出了挑战,主要体现在处理超出上下文窗口的长序列信息和保留时间依赖性上。现有的检索增强生成(RAG)方法在长视频应用中存在时间依赖破坏和无关信息干扰的问题。
Result: 在三个长视频理解基准测试中,Vgent相较基线模型性能提升了3.0%~5.4%,并超越现有最好的视频RAG方法8.6%。
Insight: 结构化图和中间推理的结合可以显著提升长视频理解的性能,尤其是在保留时间依赖性和减少无关信息干扰方面效果显著。
Abstract: Understanding and reasoning over long videos pose significant challenges for large video language models (LVLMs) due to the difficulty in processing intensive video tokens beyond context window and retaining long-term sequential information. Retrieval-Augmented Generation (RAG) has demonstrated effectiveness in processing long context for Large Language Models (LLMs); however, applying RAG to long video faces challenges such as disrupted temporal dependencies and inclusion of irrelevant information that can hinder accurate reasoning. To address these limitations, we propose Vgent, a novel graph-based retrieval-reasoning-augmented generation framework to enhance LVLMs for long video understanding. Our approach introduces two key innovations: (i) It represents videos by structured graphs with semantic relationships across video clips preserved to improve retrieval effectiveness. (ii) It introduces an intermediate reasoning step to mitigate the reasoning limitation of LVLMs, which leverages structured verification to reduce retrieval noise and facilitate the explicit aggregation of relevant information across clips, resulting in more accurate and context-aware responses. We comprehensively evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks. Our approach yielded an overall performance improvement of $3.0%\sim 5.4%$ over base models on MLVU, and outperformed state-of-the-art video RAG methods by $8.6%$. Our code is publicly available at https://xiaoqian-shen.github.io/Vgent.
[7] Synchronization of Multiple Videos cs.CVPDF
Avihai Naaman, Ron Shapira Weber, Oren Freifeld
TL;DR: 论文提出了一种基于原型学习的框架(TPL),用于解决多视频同步的复杂问题,特别是对于不同场景或生成AI生成的视频。该方法通过构建共享的1D原型序列来对齐视频,避免了繁琐的成对匹配,提高了同步的准确性和效率。
Details
Motivation: 现有的多视频同步方法通常是针对同一场景的多摄像头视频,而不同场景或生成AI生成的视频由于主题、背景和非线性时间偏移的差异,同步变得非常复杂。
Result: 实验结果表明,TPL在多种数据集上提高了同步的准确性、效率和鲁棒性,特别是在细粒度帧检索和阶段分类任务中。
Insight: 构建共享的1D原型序列是一种高效且鲁棒的视频对齐方法,尤其适用于非线性时间偏移的复杂场景。
Abstract: Synchronizing videos captured simultaneously from multiple cameras in the same scene is often easy and typically requires only simple time shifts. However, synchronizing videos from different scenes or, more recently, generative AI videos, poses a far more complex challenge due to diverse subjects, backgrounds, and nonlinear temporal misalignment. We propose Temporal Prototype Learning (TPL), a prototype-based framework that constructs a shared, compact 1D representation from high-dimensional embeddings extracted by any of various pretrained models. TPL robustly aligns videos by learning a unified prototype sequence that anchors key action phases, thereby avoiding exhaustive pairwise matching. Our experiments show that TPL improves synchronization accuracy, efficiency, and robustness across diverse datasets, including fine-grained frame retrieval and phase classification tasks. Importantly, TPL is the first approach to mitigate synchronization issues in multiple generative AI videos depicting the same action. Our code and a new multiple video synchronization dataset are available at https://bgu-cs-vil.github.io/TPL/
[8] Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images cs.CV | cs.GRPDF
Emanuel Garbin, Guy Adam, Oded Krams, Zohar Barzelay, Eran Guendelman
TL;DR: 这篇论文提出了一种零样本的3D高斯光斑化头像生成方法,能够从未结构的手机图像中创建高度真实且保持身份特征的3D头像。
Details
Motivation: 现有方法存在几何不一致性和幻觉问题,单视角方法难以保持身份特征,而基于合成数据的模型无法捕捉高频细节,如皮肤皱纹和细发。
Result: 生成的3D头像具有高度真实性和身份特征保持能力。
Insight: 论文展示了如何通过多视图数据处理和高保真数据集训练克服现有方法的局限性,实现高质量的3D头像生成。
Abstract: We present a novel, zero-shot pipeline for creating hyperrealistic, identity-preserving 3D avatars from a few unstructured phone images. Existing methods face several challenges: single-view approaches suffer from geometric inconsistencies and hallucinations, degrading identity preservation, while models trained on synthetic data fail to capture high-frequency details like skin wrinkles and fine hair, limiting realism. Our method introduces two key contributions: (1) a generative canonicalization module that processes multiple unstructured views into a standardized, consistent representation, and (2) a transformer-based model trained on a new, large-scale dataset of high-fidelity Gaussian splatting avatars derived from dome captures of real people. This “Capture, Canonicalize, Splat” pipeline produces static quarter-body avatars with compelling realism and robust identity preservation from unstructured photos.
[9] cubic: CUDA-accelerated 3D Bioimage Computing cs.CV | q-bio.QM | 92C55, 68U10 | I.4.0; J.3PDF
Alexandr A. Kalinin, Anne E. Carpenter, Shantanu Singh, Matthew J. O’Meara
TL;DR: 本文介绍了一个名为cubic的开源Python库,通过将SciPy和scikit-image的API与CuPy和RAPIDS cuCIM的GPU加速替代方案结合,解决了生物图像分析工具在可扩展性、效率与现代科学计算工作流集成方面的不足。
Details
Motivation: 现代显微镜生成的数据规模越来越大,现有的生物图像分析方法在可扩展性、效率和与现代科学计算工作流的集成方面存在明显的局限性。
Result: 实验表明,cubic在保持算法准确性的同时,显著提升了去卷积和分割等流程的速度。
Insight: cubic的引入为生物图像分析提供了一个可扩展且可复现的基础设施,能够与现代Python科学计算生态系统无缝集成。
Abstract: Quantitative analysis of multidimensional biological images is useful for understanding complex cellular phenotypes and accelerating advances in biomedical research. As modern microscopy generates ever-larger 2D and 3D datasets, existing computational approaches are increasingly limited by their scalability, efficiency, and integration with modern scientific computing workflows. Existing bioimage analysis tools often lack application programmable interfaces (APIs), do not support graphics processing unit (GPU) acceleration, lack broad 3D image processing capabilities, and/or have poor interoperability for compute-heavy workflows. Here, we introduce cubic, an open-source Python library that addresses these challenges by augmenting widely used SciPy and scikit-image APIs with GPU-accelerated alternatives from CuPy and RAPIDS cuCIM. cubic’s API is device-agnostic and dispatches operations to GPU when data reside on the device and otherwise executes on CPU, seamlessly accelerating a broad range of image processing routines. This approach enables GPU acceleration of existing bioimage analysis workflows, from preprocessing to segmentation and feature extraction for 2D and 3D data. We evaluate cubic both by benchmarking individual operations and by reproducing existing deconvolution and segmentation pipelines, achieving substantial speedups while maintaining algorithmic fidelity. These advances establish a robust foundation for scalable, reproducible bioimage analysis that integrates with the broader Python scientific computing ecosystem, including other GPU-accelerated methods, enabling both interactive exploration and automated high-throughput analysis workflows. cubic is openly available at https://github$.$com/alxndrkalinin/cubic
[10] Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures cs.CV | cs.AIPDF
Yuancheng Xu, Wenqi Xian, Li Ma, Julien Philip, Ahmet Levent Taşel
TL;DR: 本文提出了一种框架,通过新颖的数据定制管道,在视频扩散模型中实现了多视角角色一致性和3D相机控制,提升虚拟制作的视频生成能力。
Details
Motivation: 现有视频扩散模型在多视角角色一致性和3D相机控制方面存在局限,限制了在虚拟制作中的应用。本文旨在解决这一问题。
Result: 实验表明,该方法在视频质量、个性化精度、相机控制和光照适配方面表现优越,推动了视频生成在虚拟制作中的集成。
Insight: 通过结合4D高斯渲染和光照变化数据,可以显著提升视频扩散模型在多视角一致性和动态控制方面的能力,为虚拟制作提供了新工具。
Abstract: We introduce a framework that enables both multi-view character consistency and 3D camera control in video diffusion models through a novel customization data pipeline. We train the character consistency component with recorded volumetric capture performances re-rendered with diverse camera trajectories via 4D Gaussian Splatting (4DGS), lighting variability obtained with a video relighting model. We fine-tune state-of-the-art open-source video diffusion models on this data to provide strong multi-view identity preservation, precise camera control, and lighting adaptability. Our framework also supports core capabilities for virtual production, including multi-subject generation using two approaches: joint training and noise blending, the latter enabling efficient composition of independently customized models at inference time; it also achieves scene and real-life video customization as well as control over motion and spatial layout during customization. Extensive experiments show improved video quality, higher personalization accuracy, and enhanced camera control and lighting adaptability, advancing the integration of video generation into virtual production. Our project page is available at: https://eyeline-labs.github.io/Virtually-Being.
[11] Joint Modeling of Big Five and HEXACO for Multimodal Apparent Personality-trait Recognition cs.CV | cs.CL | cs.MMPDF
Ryo Masumura, Shota Orihashi, Mana Ihori, Tomohiro Tanaka, Naoki Makishima
TL;DR: 该论文提出了一种联合建模Big Five和HEXACO的方法,用于从多模态人类行为中自动识别表面人格特征,填补了现有研究中未关注HEXACO的空白,并通过实验验证了其有效性。
Details
Motivation: 现有研究主要使用Big Five进行多模态表面人格识别,但未关注HEXACO(尤其是诚实-谦逊特质)及其与Big Five的关系。通过联合建模,可以提高对人类行为的理解。
Result: 在自我介绍视频数据集上的实验表明,该方法能有效识别Big Five和HEXACO特质。
Insight: 联合建模Big Five和HEXACO可为心理学和行为分析提供更全面的视角,尤其是诚实-谦逊特质的相关研究。
Abstract: This paper proposes a joint modeling method of the Big Five, which has long been studied, and HEXACO, which has recently attracted attention in psychology, for automatically recognizing apparent personality traits from multimodal human behavior. Most previous studies have used the Big Five for multimodal apparent personality-trait recognition. However, no study has focused on apparent HEXACO which can evaluate an Honesty-Humility trait related to displaced aggression and vengefulness, social-dominance orientation, etc. In addition, the relationships between the Big Five and HEXACO when modeled by machine learning have not been clarified. We expect awareness of multimodal human behavior to improve by considering these relationships. The key advance of our proposed method is to optimize jointly recognizing the Big Five and HEXACO. Experiments using a self-introduction video dataset demonstrate that the proposed method can effectively recognize the Big Five and HEXACO.
[12] LOTA: Bit-Planes Guided AI-Generated Image Detection cs.CVPDF
Hongsong Wang, Renxi Cheng, Yang Zhang, Chaolei Han, Jie Gui
TL;DR: 论文提出了一种基于比特平面的AI生成图像检测方法LOTA,通过提取噪声特征和设计高效分类器,显著提升了检测速度和准确性。
Details
Motivation: GAN和Diffusion模型的快速发展使得区分AI生成图像与真实图像变得更具挑战性。现有方法计算成本高且难以有效捕捉原始图像的噪声特征。
Result: 在GenImage基准测试中平均准确率达98.9%,跨生成器泛化能力优异,且检测速度比现有方法快近百倍。
Insight: 比特平面能有效表征图像噪声特征;轻量级设计在保持高精度的同时极大提升了计算效率。
Abstract: The rapid advancement of GAN and Diffusion models makes it more difficult to distinguish AI-generated images from real ones. Recent studies often use image-based reconstruction errors as an important feature for determining whether an image is AI-generated. However, these approaches typically incur high computational costs and also fail to capture intrinsic noisy features present in the raw images. To solve these problems, we innovatively refine error extraction by using bit-plane-based image processing, as lower bit planes indeed represent noise patterns in images. We introduce an effective bit-planes guided noisy image generation and exploit various image normalization strategies, including scaling and thresholding. Then, to amplify the noise signal for easier AI-generated image detection, we design a maximum gradient patch selection that applies multi-directional gradients to compute the noise score and selects the region with the highest score. Finally, we propose a lightweight and effective classification head and explore two different structures: noise-based classifier and noise-guided classifier. Extensive experiments on the GenImage benchmark demonstrate the outstanding performance of our method, which achieves an average accuracy of \textbf{98.9%} (\textbf{11.9}%~$\uparrow$) and shows excellent cross-generator generalization capability. Particularly, our method achieves an accuracy of over 98.2% from GAN to Diffusion and over 99.2% from Diffusion to GAN. Moreover, it performs error extraction at the millisecond level, nearly a hundred times faster than existing methods. The code is at https://github.com/hongsong-wang/LOTA.
[13] PIA: Deepfake Detection Using Phoneme-Temporal and Identity-Dynamic Analysis cs.CVPDF
Soumyya Kanti Datta, Tanvi Ranga, Chengzhe Sun, Siwei Lyu
TL;DR: PIA提出了一种多模态音频-视觉框架,通过结合语言、动态面部运动和身份识别线索,显著提升了对现代高级生成模型生成的深度伪造内容的检测能力。
Details
Motivation: 传统深度伪造检测方法依赖手动设计的音素-视素对齐阈值或单模态策略,难以应对现代生成模型(如GANs、扩散模型)生成的近乎完美的伪造内容。PIA旨在通过多模态分析解决这一问题。
Result: PIA在多模态分析中表现出色,能够有效捕捉传统方法忽略的细微时间差异和身份动态不一致。
Insight: 多模态方法在深度伪造检测中具有重要潜力,尤其是结合时间动态和身份信息可以显著提升检测准确性。
Abstract: The rise of manipulated media has made deepfakes a particularly insidious threat, involving various generative manipulations such as lip-sync modifications, face-swaps, and avatar-driven facial synthesis. Conventional detection methods, which predominantly depend on manually designed phoneme-viseme alignment thresholds, fundamental frame-level consistency checks, or a unimodal detection strategy, inadequately identify modern-day deepfakes generated by advanced generative models such as GANs, diffusion models, and neural rendering techniques. These advanced techniques generate nearly perfect individual frames yet inadvertently create minor temporal discrepancies frequently overlooked by traditional detectors. We present a novel multimodal audio-visual framework, Phoneme-Temporal and Identity-Dynamic Analysis(PIA), incorporating language, dynamic face motion, and facial identification cues to address these limitations. We utilize phoneme sequences, lip geometry data, and advanced facial identity embeddings. This integrated method significantly improves the detection of subtle deepfake alterations by identifying inconsistencies across multiple complementary modalities. Code is available at https://github.com/skrantidatta/PIA
[14] Event Interval Modulation: A Novel Scheme for Event-based Optical Camera Communication cs.CVPDF
Miu Sumino, Mayu Ishii, Shun Kaizu, Daisuke Hisano, Yu Nakayama
TL;DR: 本文提出了一种新型调制方案——事件间隔调制(EIM),专为基于事件的相机通信(OCC)设计,通过优化事件间隔来提升传输速度,实验证明其在室内环境中的高效性。
Details
Motivation: 传统基于帧的OCC系统存在比特率低和处理负载高的问题,而现有的事件OCC系统未能充分利用事件视觉传感器(EVS)的特性。
Result: 在室内环境中成功实现28 kbps(10米)和8.4 kbps(50米)的传输速率,突破了事件OCC系统的比特率记录。
Insight: EIM充分利用EVS异步特性,为事件OCC系统提供了高效调制方法,展示了EVS在高速通信中的潜力。
Abstract: Optical camera communication (OCC) represents a promising visible light communication technology. Nonetheless, typical OCC systems utilizing frame-based cameras are encumbered by limitations, including low bit rate and high processing load. To address these issues, OCC system utilizing an event-based vision sensor (EVS) as receivers have been proposed. The EVS enables high-speed, low-latency, and robust communication due to its asynchronous operation and high dynamic range. In existing event-based OCC systems, conventional modulation schemes such as on-off keying (OOK) and pulse position modulation have been applied, however, to the best of our knowledge, no modulation method has been proposed that fully exploits the unique characteristics of the EVS. This paper proposes a novel modulation scheme, called the event interval modulation (EIM) scheme, specifically designed for event-based OCC. EIM enables improvement in transmission speed by modulating information using the intervals between events. This paper proposes a theoretical model of EIM and conducts a proof-of-concept experiment. First, the parameters of the EVS are tuned and customized to optimize the frequency response specifically for EIM. Then, the maximum modulation order usable in EIM is determined experimentally. We conduct transmission experiments based on the obtained parameters. Finally, we report successful transmission at 28 kbps over 10 meters and 8.4 kbps over 50 meters in an indoor environment. This sets a new benchmark for bit rate in event-based OCC systems.
[15] MACE: Mixture-of-Experts Accelerated Coordinate Encoding for Large-Scale Scene Localization and Rendering cs.CVPDF
Mingkai Liu, Dikai Fan, Haohua Que, Haojia Gao, Xiao Liu
TL;DR: MACE通过混合专家(MOE)和ALF-LB策略,解决了大规模场景定位和渲染的计算成本问题,实现了高效定位和高精度渲染。
Details
Motivation: 大规模场景的定位和渲染通常计算成本高昂,而现有的SCR方法在小规模场景表现良好,但难以扩展到大规模场景。
Result: 在剑桥测试集上,仅需10分钟训练即实现高质量渲染,显著降低了成本。
Insight: MOE结构在解决大规模场景问题时表现出色,ALF-LB策略进一步提升了定位精度。
Abstract: Efficient localization and high-quality rendering in large-scale scenes remain a significant challenge due to the computational cost involved. While Scene Coordinate Regression (SCR) methods perform well in small-scale localization, they are limited by the capacity of a single network when extended to large-scale scenes. To address these challenges, we propose the Mixed Expert-based Accelerated Coordinate Encoding method (MACE), which enables efficient localization and high-quality rendering in large-scale scenes. Inspired by the remarkable capabilities of MOE in large model domains, we introduce a gating network to implicitly classify and select sub-networks, ensuring that only a single sub-network is activated during each inference. Furtheremore, we present Auxiliary-Loss-Free Load Balancing(ALF-LB) strategy to enhance the localization accuracy on large-scale scene. Our framework provides a significant reduction in costs while maintaining higher precision, offering an efficient solution for large-scale scene applications. Additional experiments on the Cambridge test set demonstrate that our method achieves high-quality rendering results with merely 10 minutes of training.
[16] Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization cs.CVPDF
Liao Shen, Wentao Jiang, Yiran Zhu, Tiezheng Ge, Zhiguo Cao
TL;DR: 本文提出了一种基于强化学习的身份一致性图像到视频生成方法(IPRO),通过奖励引导优化扩散模型,解决了现有方法在身份一致性上的挑战。
Details
Motivation: 现有图像到视频生成方法在人物身份一致性上表现不佳,尤其是在人物表情和动作变化较大时,这一问题尤为突出。人类对身份变化高度敏感,因此亟需解决这一挑战。
Result: 在Wan 2.2和内部I2V模型上的实验证明了IPRO的有效性,显著提升了生成视频中的身份一致性。
Insight: 强化学习可以直接优化扩散模型,而无需额外模块或架构改动;身份一致性是I2V任务中亟待解决的问题。
Abstract: Recent advances in image-to-video (I2V) generation have achieved remarkable progress in synthesizing high-quality, temporally coherent videos from static images. Among all the applications of I2V, human-centric video generation includes a large portion. However, existing I2V models encounter difficulties in maintaining identity consistency between the input human image and the generated video, especially when the person in the video exhibits significant expression changes and movements. This issue becomes critical when the human face occupies merely a small fraction of the image. Since humans are highly sensitive to identity variations, this poses a critical yet under-explored challenge in I2V generation. In this paper, we propose Identity-Preserving Reward-guided Optimization (IPRO), a novel video diffusion framework based on reinforcement learning to enhance identity preservation. Instead of introducing auxiliary modules or altering model architectures, our approach introduces a direct and effective tuning algorithm that optimizes diffusion models using a face identity scorer. To improve performance and accelerate convergence, our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer gradient feedback. We also propose a novel facial scoring mechanism that treats faces in ground-truth videos as facial feature pools, providing multi-angle facial information to enhance generalization. A KL-divergence regularization is further incorporated to stabilize training and prevent overfitting to the reward signal. Extensive experiments on Wan 2.2 I2V model and our in-house I2V model demonstrate the effectiveness of our method. Our project and code are available at \href{https://ipro-alimama.github.io/}{https://ipro-alimama.github.io/}.
[17] Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning cs.CVPDF
Xiangyu Meng, Zixian Zhang, Zhenghao Zhang, Junchao Liao, Long Qin
TL;DR: Identity-GRPO通过强化学习优化多人物身份一致的视频生成,结合人类反馈的奖励模型和GRPO算法,显著提升了VACE和Phantom的性能,取得了18.9%的人类一致性指标提升。
Details
Motivation: 现有方法如VACE和Phantom在多人物身份一致的动态交互场景中表现不佳,需要一种新技术来优化身份一致性。
Result: 实验表明,Identity-GRPO在人类一致性指标上优于基线方法18.9%。
Insight: 人类反馈和强化学习的结合是优化个性化视频生成的有效途径,尤其是在多人物交互场景中。
Abstract: While advanced methods like VACE and Phantom have advanced video generation for specific subjects in diverse scenarios, they struggle with multi-human identity preservation in dynamic interactions, where consistent identities across multiple characters are critical. To address this, we propose Identity-GRPO, a human feedback-driven optimization pipeline for refining multi-human identity-preserving video generation. First, we construct a video reward model trained on a large-scale preference dataset containing human-annotated and synthetic distortion data, with pairwise annotations focused on maintaining human consistency throughout the video. We then employ a GRPO variant tailored for multi-human consistency, which greatly enhances both VACE and Phantom. Through extensive ablation studies, we evaluate the impact of annotation quality and design choices on policy optimization. Experiments show that Identity-GRPO achieves up to 18.9% improvement in human consistency metrics over baseline methods, offering actionable insights for aligning reinforcement learning with personalized video generation.
[18] MatchAttention: Matching the Relative Positions for High-Resolution Cross-View Matching cs.CVPDF
Tingman Yan, Tao Liu, Xilian Yang, Qunfei Zhao, Zeyang Xia
TL;DR: 论文提出了一种名为MatchAttention的注意力机制,通过动态匹配相对位置来解决高分辨率跨视图匹配的计算复杂性和显式约束问题,并结合MatchDecoder和遮挡处理技术实现了高效的高分辨率匹配任务。
Details
Motivation: 现有的跨视图匹配方法由于二次计算复杂度和缺乏显式匹配约束,难以处理高分辨率图像的匹配任务,因此需要一种更高效的注意力机制来应对这些挑战。
Result: 在Middlebury、KITTI、ETH3D和Spring flow数据集上取得SOTA性能,MatchStereo-B在KITTI分辨率下仅需29ms推理时间,MatchStereo-T可高效处理4K UHD图像。
Insight: 动态匹配相对位置和高效率的分层解码设计显著提升了高分辨率匹配任务的性能,同时通过遮挡优化进一步增强了模型的鲁棒性和泛化能力。
Abstract: Cross-view matching is fundamentally achieved through cross-attention mechanisms. However, matching of high-resolution images remains challenging due to the quadratic complexity and lack of explicit matching constraints in the existing cross-attention. This paper proposes an attention mechanism, MatchAttention, that dynamically matches relative positions. The relative position determines the attention sampling center of the key-value pairs given a query. Continuous and differentiable sliding-window attention sampling is achieved by the proposed BilinearSoftmax. The relative positions are iteratively updated through residual connections across layers by embedding them into the feature channels. Since the relative position is exactly the learning target for cross-view matching, an efficient hierarchical cross-view decoder, MatchDecoder, is designed with MatchAttention as its core component. To handle cross-view occlusions, gated cross-MatchAttention and a consistency-constrained loss are proposed. These two components collectively mitigate the impact of occlusions in both forward and backward passes, allowing the model to focus more on learning matching relationships. When applied to stereo matching, MatchStereo-B ranked 1st in average error on the public Middlebury benchmark and requires only 29ms for KITTI-resolution inference. MatchStereo-T can process 4K UHD images in 0.1 seconds using only 3GB of GPU memory. The proposed models also achieve state-of-the-art performance on KITTI 2012, KITTI 2015, ETH3D, and Spring flow datasets. The combination of high accuracy and low computational complexity makes real-time, high-resolution, and high-accuracy cross-view matching possible. Code is available at https://github.com/TingmanYan/MatchAttention.
[19] Experimental Demonstration of Event-based Optical Camera Communication in Long-Range Outdoor Environment cs.CVPDF
Miu Sumino, Mayu Ishii, Shun Kaizu, Daisuke Hisano, Yu Nakayama
TL;DR: 论文提出了一种基于事件视觉传感器的光相机通信系统鲁棒解调方案,首次在室外实验中实现了200米60kbps和400米30kbps下BER小于10^-3的性能。
Details
Motivation: 解决长距离室外环境中光相机通信的信号解调问题,提升通信的可靠性和速率。
Result: 在200米60kbps和400米30kbps的室外实验中,BER低于10^-3。
Insight: 事件视觉传感器在长距离光通信中展现出潜力,解调方案的鲁棒性对实际应用至关重要。
Abstract: We propose a robust demodulation scheme for optical camera communication systems using an event-based vision sensor, combining OOK with toggle demodulation and a digital phase-locked loop. This is the first report to achieve a $\mathrm{BER} < 10^{-3}$ at 200m-60kbps and 400m-30kbps in outdoor experiments.
[20] GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering cs.CV | cs.GRPDF
Alexander Valverde, Brian Xu, Yuyin Zhou, Meng Xu, Hongyun Wang
TL;DR: GauSSmart是一种结合2D基础模型和3D高斯泼溅重建的混合方法,通过2D分割先验和高维特征嵌入提升稀疏区域的细节和覆盖度,显著优于现有高斯泼溅方法。
Details
Motivation: 当前高斯泼溅方法在稀疏数据区域难以捕捉细节或保持真实性,2D基础模型提供了丰富的视觉信息,但如何有效结合两者仍是一个挑战。
Result: 在多个数据集上验证,GauSSmart在大部分场景中优于现有高斯泼溅方法,证明了2D-3D混合方法的潜力。
Insight: 2D基础模型能够为3D重建提供丰富的语义和几何先验,弥补稀疏数据的不足,为未来研究开辟了新方向。
Abstract: Scene reconstruction has emerged as a central challenge in computer vision, with approaches such as Neural Radiance Fields (NeRF) and Gaussian Splatting achieving remarkable progress. While Gaussian Splatting demonstrates strong performance on large-scale datasets, it often struggles to capture fine details or maintain realism in regions with sparse coverage, largely due to the inherent limitations of sparse 3D training data. In this work, we propose GauSSmart, a hybrid method that effectively bridges 2D foundational models and 3D Gaussian Splatting reconstruction. Our approach integrates established 2D computer vision techniques, including convex filtering and semantic feature supervision from foundational models such as DINO, to enhance Gaussian-based scene reconstruction. By leveraging 2D segmentation priors and high-dimensional feature embeddings, our method guides the densification and refinement of Gaussian splats, improving coverage in underrepresented areas and preserving intricate structural details. We validate our approach across three datasets, where GauSSmart consistently outperforms existing Gaussian Splatting in the majority of evaluated scenes. Our results demonstrate the significant potential of hybrid 2D-3D approaches, highlighting how the thoughtful combination of 2D foundational models with 3D reconstruction pipelines can overcome the limitations inherent in either approach alone.
[21] CLEAR: Causal Learning Framework For Robust Histopathology Tumor Detection Under Out-Of-Distribution Shifts cs.CVPDF
Kieu-Anh Truong Thi, Huy-Hieu Pham, Duc-Trong Le
TL;DR: 该论文提出了一个基于因果推理的框架CLEAR,用于解决组织病理学中的领域偏移问题,通过引入因果关系而非仅依赖统计相关性,显著提升了模型在未见领域上的鲁棒性。
Details
Motivation: 组织病理学中的领域偏移(如数据采集过程或来源的差异)严重影响了深度学习模型的泛化能力。现有方法主要依赖统计相关性,而忽略了因果关系,因此亟需一种新方法来填补这一空白。
Result: 在CAMELYON17和私有数据集上,性能提升了7%,超越了现有基线方法。
Insight: 因果推理在解决领域偏移问题中具有潜力,尤其是在建模语义特征和减少混杂因素方面表现突出。
Abstract: Domain shift in histopathology, often caused by differences in acquisition processes or data sources, poses a major challenge to the generalization ability of deep learning models. Existing methods primarily rely on modeling statistical correlations by aligning feature distributions or introducing statistical variation, yet they often overlook causal relationships. In this work, we propose a novel causal-inference-based framework that leverages semantic features while mitigating the impact of confounders. Our method implements the front-door principle by designing transformation strategies that explicitly incorporate mediators and observed tissue slides. We validate our method on the CAMELYON17 dataset and a private histopathology dataset, demonstrating consistent performance gains across unseen domains. As a result, our approach achieved up to a 7% improvement in both the CAMELYON17 dataset and the private histopathology dataset, outperforming existing baselines. These results highlight the potential of causal inference as a powerful tool for addressing domain shift in histopathology image analysis.
[22] Watermarking for Factuality: Guiding Vision-Language Models Toward Truth via Tri-layer Contrastive Decoding cs.CV | cs.AIPDF
Kyungryul Back, Seongbeom Park, Milim Kim, Mincheol Kwon, SangHyeok Lee
TL;DR: 该论文提出了一种无训练的三层对比解码加水印方法,以减少大型视觉-语言模型(LVLM)中的幻觉问题,并生成更具视觉基础的输出。
Details
Motivation: 现有的LVLM在多模态任务中表现出色,但仍存在严重幻觉问题,依赖单一模态或过度记忆训练数据。为此,作者希望通过一种无需额外训练的方法来解决这一问题。
Result: 在公开基准测试(POPE、MME和AMBER)上,该方法在减少LVLM幻觉和生成视觉基础响应方面达到了最先进性能。
Insight: 研究表明,通过分层对比解码和水印引导,可以有效提升LVLM的视觉基础性,无需额外训练即可改善模型生成内容的准确性。
Abstract: Large Vision-Language Models (LVLMs) have recently shown promising results on various multimodal tasks, even achieving human-comparable performance in certain cases. Nevertheless, LVLMs remain prone to hallucinations – they often rely heavily on a single modality or memorize training data without properly grounding their outputs. To address this, we propose a training-free, tri-layer contrastive decoding with watermarking, which proceeds in three steps: (1) select a mature layer and an amateur layer among the decoding layers, (2) identify a pivot layer using a watermark-related question to assess whether the layer is visually well-grounded, and (3) apply tri-layer contrastive decoding to generate the final output. Experiments on public benchmarks such as POPE, MME and AMBER demonstrate that our method achieves state-of-the-art performance in reducing hallucinations in LVLMs and generates more visually grounded responses.
[23] A Multi-domain Image Translative Diffusion StyleGAN for Iris Presentation Attack Detection cs.CVPDF
Shivangi Yadav, Arun Ross
TL;DR: 论文提出了一种多领域图像翻译扩散StyleGAN(MID-StyleGAN),用于生成高质量的合成眼部图像,以解决虹膜攻击检测中的数据稀缺问题。该方法结合了扩散模型和生成对抗网络的优点,显著提升了攻击检测系统的性能。
Details
Motivation: 虹膜生物识别系统容易受到人工眼球、打印图像或美瞳等攻击。然而,由于数据采集困难,缺乏足够的训练和评估数据集。为解决这一问题,作者提出了MID-StyleGAN来生成多领域的合成眼部图像。
Result: 在LivDet2020数据集上,攻击检测的准确率从93.41%提升到98.72%,证明了生成数据对提升PAD系统性能的有效性。
Insight: 合成数据生成是解决生物识别数据稀缺问题的有效途径,MID-StyleGAN框架为虹膜和其他生物识别领域的数据增强提供了新思路。
Abstract: An iris biometric system can be compromised by presentation attacks (PAs) where artifacts such as artificial eyes, printed eye images, or cosmetic contact lenses are presented to the system. To counteract this, several presentation attack detection (PAD) methods have been developed. However, there is a scarcity of datasets for training and evaluating iris PAD techniques due to the implicit difficulties in constructing and imaging PAs. To address this, we introduce the Multi-domain Image Translative Diffusion StyleGAN (MID-StyleGAN), a new framework for generating synthetic ocular images that captures the PA and bonafide characteristics in multiple domains such as bonafide, printed eyes and cosmetic contact lens. MID-StyleGAN combines the strengths of diffusion models and generative adversarial networks (GANs) to produce realistic and diverse synthetic data. Our approach utilizes a multi-domain architecture that enables the translation between bonafide ocular images and different PA domains. The model employs an adaptive loss function tailored for ocular data to maintain domain consistency. Extensive experiments demonstrate that MID-StyleGAN outperforms existing methods in generating high-quality synthetic ocular images. The generated data was used to significantly enhance the performance of PAD systems, providing a scalable solution to the data scarcity problem in iris and ocular biometrics. For example, on the LivDet2020 dataset, the true detect rate at 1% false detect rate improved from 93.41% to 98.72%, showcasing the impact of the proposed method.
[24] Vision-Centric Activation and Coordination for Multimodal Large Language Models cs.CVPDF
Yunnan Wang, Fan Lu, Kecheng Zheng, Ziyuan Huang, Ziqiang Li
TL;DR: VaCo通过视觉中心的激活和协调优化多模态大语言模型(MLLMs)的表示,提升了视觉理解和分析能力。
Details
Motivation: 主流MLLMs仅通过文本标记的下一个标记预测进行监督,忽略了关键的视觉中心信息,导致分析能力不足。
Result: VaCo显著提升了不同MLLMs在多个基准测试中的性能,展现了优越的视觉理解能力。
Insight: 视觉信息的有效整合和协调是多模态模型提升分析能力的关键。
Abstract: Multimodal large language models (MLLMs) integrate image features from visual encoders with LLMs, demonstrating advanced comprehension capabilities. However, mainstream MLLMs are solely supervised by the next-token prediction of textual tokens, neglecting critical vision-centric information essential for analytical abilities. To track this dilemma, we introduce VaCo, which optimizes MLLM representations through Vision-Centric activation and Coordination from multiple vision foundation models (VFMs). VaCo introduces visual discriminative alignment to integrate task-aware perceptual features extracted from VFMs, thereby unifying the optimization of both textual and visual outputs in MLLMs. Specifically, we incorporate the learnable Modular Task Queries (MTQs) and Visual Alignment Layers (VALs) into MLLMs, activating specific visual signals under the supervision of diverse VFMs. To coordinate representation conflicts across VFMs, the crafted Token Gateway Mask (TGM) restricts the information flow among multiple groups of MTQs. Extensive experiments demonstrate that VaCo significantly improves the performance of different MLLMs on various benchmarks, showcasing its superior capabilities in visual comprehension.
[25] Leveraging Cycle-Consistent Anchor Points for Self-Supervised RGB-D Registration cs.CV | cs.ROPDF
Siddharth Tourani, Jayaram Reddy, Sarvesh Thakur, K Madhava Krishna, Muhammad Haris Khan
TL;DR: 本文提出了一种利用循环一致性关键点作为锚点的自监督RGB-D配准方法,通过结合GRU和变换同步的姿态块,显著提升了配准精度,并在ScanNet和3DMatch数据集上超越了以往的自监督方法。
Details
Motivation: 随着消费级深度相机的普及,大量未标注的RGB-D数据可用,但仍缺乏有效利用这些数据进行场景几何推理的方法。本文旨在通过自监督学习提升RGB-D配准性能。
Result: 在ScanNet和3DMatch数据集上超越了以往的自监督方法,甚至优于一些监督学习方法。
Insight: 循环一致性关键点和历史数据的融合能有效提升RGB-D配准的鲁棒性和精度,自监督方法在几何任务中潜力巨大。
Abstract: With the rise in consumer depth cameras, a wealth of unlabeled RGB-D data has become available. This prompts the question of how to utilize this data for geometric reasoning of scenes. While many RGB-D registration meth- ods rely on geometric and feature-based similarity, we take a different approach. We use cycle-consistent keypoints as salient points to enforce spatial coherence constraints during matching, improving correspondence accuracy. Additionally, we introduce a novel pose block that combines a GRU recurrent unit with transformation synchronization, blending historical and multi-view data. Our approach surpasses previous self- supervised registration methods on ScanNet and 3DMatch, even outperforming some older supervised methods. We also integrate our components into existing methods, showing their effectiveness.
[26] Spatial Preference Rewarding for MLLMs Spatial Understanding cs.CVPDF
Han Qiu, Peng Gao, Lewei Lu, Xiaoqin Zhang, Ling Shao
TL;DR: SPR(Spatial Preference Rewarding)通过奖励多模态大语言模型(MLLMs)生成详细的区域描述和精准的物体定位,提升了其空间理解能力。该方法通过语义和定位评分优化模型输出,实验显示SPR在标准的引用和定位基准测试中显著提升了性能。
Details
Motivation: 多模态大语言模型(MLLMs)虽在空间理解能力上表现良好,但在细粒度空间感知(如生成详细区域描述或精确定位物体)方面仍有不足。现有方法主要依赖预标注指令数据,缺少对模型响应的直接监督。
Result: 在标准引用和定位基准测试中,SPR有效提升了MLLMs的空间理解能力,且训练开销极小。
Insight: 通过直接奖励模型的详细响应和精确定位,而非仅依赖预标注数据,SPR能够更好地捕捉细粒度空间信息,为MLLMs的空间能力提供了新优化方向。
Abstract: Multimodal large language models(MLLMs) have demonstrated promising spatial understanding capabilities, such as referencing and grounding object descriptions. Despite their successes, MLLMs still fall short in fine-grained spatial perception abilities, such as generating detailed region descriptions or accurately localizing objects. Additionally, they often fail to respond to the user’s requirements for desired fine-grained spatial understanding. This issue might arise because existing approaches primarily focus on tuning MLLMs to model pre-annotated instruction data to inject spatial knowledge, without direct supervision of MLLMs’ actual responses. We address this issue by SPR, a Spatial Preference Rewarding(SPR) approach that enhances MLLMs’ spatial capabilities by rewarding MLLMs’ detailed responses with precise object localization over vague or inaccurate responses. With randomly selected image regions and region descriptions from MLLMs, SPR introduces semantic and localization scores to comprehensively evaluate the text quality and localization quality in MLLM-generated descriptions. We also refine the MLLM descriptions with better localization accuracy and pair the best-scored refinement with the initial descriptions of the lowest score for direct preference optimization, thereby enhancing fine-grained alignment with visual input. Extensive experiments over standard referring and grounding benchmarks show that SPR improves MLLM spatial understanding capabilities effectively with minimal overhead in training. Data and code will be released at https://github.com/hanqiu-hq/SPR
[27] DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation cs.CVPDF
Dongnam Byun, Jungwon Park, Jumgmin Ko, Changin Choi, Wonjong Rhee
TL;DR: DOS是一种改进CLIP文本嵌入的方法,用于提升多目标图像生成的准确性和减少目标混合问题。
Details
Motivation: 现有文本到图像生成模型在多目标提示下常出现目标忽略或混合问题,特别是四种场景(相似形状、相似纹理、背景偏差和多目标)。
Result: 实验表明DOS在多目标生成任务中表现优于四种竞争方法,人类评价中得票率高出26.24%-43.04%。
Insight: 通过针对性调整嵌入向量,可以有效改善多目标生成的语义分离问题,提升生成质量。
Abstract: Recent progress in text-to-image (T2I) generative models has led to significant improvements in generating high-quality images aligned with text prompts. However, these models still struggle with prompts involving multiple objects, often resulting in object neglect or object mixing. Through extensive studies, we identify four problematic scenarios, Similar Shapes, Similar Textures, Dissimilar Background Biases, and Many Objects, where inter-object relationships frequently lead to such failures. Motivated by two key observations about CLIP embeddings, we propose DOS (Directional Object Separation), a method that modifies three types of CLIP text embeddings before passing them into text-to-image models. Experimental results show that DOS consistently improves the success rate of multi-object image generation and reduces object mixing. In human evaluations, DOS significantly outperforms four competing methods, receiving 26.24%-43.04% more votes across four benchmarks. These results highlight DOS as a practical and effective solution for improving multi-object image generation.
[28] DRBD-Mamba for Robust and Efficient Brain Tumor Segmentation with Analytical Insights cs.CVPDF
Danish Ali, Ajmal Mian, Naveed Akhtar, Ghulam Mubashar Hassan
TL;DR: 论文提出DRBD-Mamba模型,通过双分辨率双向Mamba和多尺度长程依赖捕获,高效实现脑肿瘤分割,并在计算效率和鲁棒性上显著优于现有方法。
Details
Motivation: 脑肿瘤分割对临床诊断和治疗至关重要,但由于肿瘤亚区域的异质性和现有Mamba模型计算开销大、鲁棒性不足,亟需高效且可靠的解决方案。
Result: 在BraTS2023上,模型在测试集上Dice指标显著提升(全肿瘤0.10%,肿瘤核心1.75%,增强肿瘤0.93%),计算效率提升15倍。
Insight: 通过系统性评估和高效设计,DRBD-Mamba在保持精度的同时大幅降低计算成本,为脑肿瘤分割提供了更实用的解决方案。
Abstract: Accurate brain tumor segmentation is significant for clinical diagnosis and treatment. It is challenging due to the heterogeneity of tumor subregions. Mamba-based State Space Models have demonstrated promising performance. However, they incur significant computational overhead due to sequential feature computation across multiple spatial axes. Moreover, their robustness across diverse BraTS data partitions remains largely unexplored, leaving a critical gap in reliable evaluation. To address these limitations, we propose dual-resolution bi-directional Mamba (DRBD-Mamba), an efficient 3D segmentation model that captures multi-scale long-range dependencies with minimal computational overhead. We leverage a space-filling curve to preserve spatial locality during 3D-to-1D feature mapping, thereby reducing reliance on computationally expensive multi-axial feature scans. To enrich feature representation, we propose a gated fusion module that adaptively integrates forward and reverse contexts, along with a quantization block that discretizes features to improve robustness. In addition, we propose five systematic folds on BraTS2023 for rigorous evaluation of segmentation techniques under diverse conditions and present detailed analysis of common failure scenarios. On the 20% test set used by recent methods, our model achieves Dice improvements of 0.10% for whole tumor, 1.75% for tumor core, and 0.93% for enhancing tumor. Evaluations on the proposed systematic five folds demonstrate that our model maintains competitive whole tumor accuracy while achieving clear average Dice gains of 0.86% for tumor core and 1.45% for enhancing tumor over existing state-of-the-art. Furthermore, our model attains 15 times improvement in efficiency while maintaining high segmentation accuracy, highlighting its robustness and computational advantage over existing approaches.
[29] BoardVision: Deployment-ready and Robust Motherboard Defect Detection with YOLO+Faster-RCNN Ensemble cs.CV | cs.LGPDF
Brandon Hill, Kma Solaiman
TL;DR: BoardVision提出了一种可部署的母板缺陷检测框架,结合YOLOv7和Faster R-CNN的优势,通过轻量级集成方法CTV Voter平衡精确度和召回率,并评估了实际扰动下的鲁棒性。
Details
Motivation: 目前PCB检测主要针对裸板或线路缺陷,而组装级母板缺陷检测研究较少,BoardVision旨在填补这一空白并提供实际可用的解决方案。
Result: CTV Voter在精确度和召回率之间取得了平衡,同时模型在多种扰动下表现出稳定性。GUI工具成功将研究成果转化为实用工具。
Insight: 单一模型在母板缺陷检测中存在局限,集成方法可以显著提升性能;实际部署需考虑模型对扰动的鲁棒性。
Abstract: Motherboard defect detection is critical for ensuring reliability in high-volume electronics manufacturing. While prior research in PCB inspection has largely targeted bare-board or trace-level defects, assembly-level inspection of full motherboards inspection remains underexplored. In this work, we present BoardVision, a reproducible framework for detecting assembly-level defects such as missing screws, loose fan wiring, and surface scratches. We benchmark two representative detectors - YOLOv7 and Faster R-CNN, under controlled conditions on the MiracleFactory motherboard dataset, providing the first systematic comparison in this domain. To mitigate the limitations of single models, where YOLO excels in precision but underperforms in recall and Faster R-CNN shows the reverse, we propose a lightweight ensemble, Confidence-Temporal Voting (CTV Voter), that balances precision and recall through interpretable rules. We further evaluate robustness under realistic perturbations including sharpness, brightness, and orientation changes, highlighting stability challenges often overlooked in motherboard defect detection. Finally, we release a deployable GUI-driven inspection tool that bridges research evaluation with operator usability. Together, these contributions demonstrate how computer vision techniques can transition from benchmark results to practical quality assurance for assembly-level motherboard manufacturing.
[30] DCMIL: A Progressive Representation Learning Model of Whole Slide Images for Cancer Prognosis Analysis cs.CVPDF
Chao Tu, Kun Huang, Jie Zhang, Qianjin Feng, Yu Zhang
TL;DR: DCMIL是一种渐进式表示学习模型,旨在高效处理全切片图像(WSIs)进行癌症预后分析,无需依赖密集标注,并在12种癌症类型上表现优于现有方法。
Details
Motivation: 计算病理学利用WSIs量化形态异质性并为癌症建立客观预后模型,但计算瓶颈和标注稀缺限制了进展。现有方法往往忽略多倍率WSIs的细粒度信息和肿瘤微环境变化。
Result: 在12种癌症类型的实验中,DCMIL表现出色,能够识别预后关键区域并提供稳健的不确定性估计,同时揭示肿瘤与正常组织的形态差异。
Insight: DCMIL不仅提高了预后分析的准确性,还为计算病理学提供了无需标注的高效学习框架,具有潜力推动生物医学研究。
Abstract: The burgeoning discipline of computational pathology shows promise in harnessing whole slide images (WSIs) to quantify morphological heterogeneity and develop objective prognostic modes for human cancers. However, progress is impeded by the computational bottleneck of gigapixel-size inputs and the scarcity of dense manual annotations. Current methods often overlook fine-grained information across multi-magnification WSIs and variations in tumor microenvironments. Here, we propose an easy-to-hard progressive representation learning model, termed dual-curriculum contrastive multi-instance learning (DCMIL), to efficiently process WSIs for cancer prognosis. The model does not rely on dense annotations and enables the direct transformation of gigapixel-size WSIs into outcome predictions. Extensive experiments on twelve cancer types (5,954 patients, 12.54 million tiles) demonstrate that DCMIL outperforms standard WSI-based prognostic models. Additionally, DCMIL identifies fine-grained prognosis-salient regions, provides robust instance uncertainty estimation, and captures morphological differences between normal and tumor tissues, with the potential to generate new biological insights. All codes have been made publicly accessible at https://github.com/tuuuc/DCMIL.
[31] Real-Time Neural Video Compression with Unified Intra and Inter Coding cs.CVPDF
Hui Xiang, Yifan Bian, Li Li, Jingran Wu, Xianguo Zhang
TL;DR: 论文提出了一种统一的帧内和帧间编码的实时神经视频压缩框架,解决了现有方法在处理遮挡和新内容时的效率问题,并通过双向帧间冗余利用进一步提升了压缩性能。
Details
Motivation: 现有神经视频压缩方案在处理遮挡、新内容以及帧间误差传播时效率不足,本文借鉴了传统视频编码中帧内编码的思想,提出统一框架以解决这些问题。
Result: 平均BD-rate降低10.7%,帧间比特率和质量更稳定,保持了实时编码/解码性能。
Insight: 统一帧内和帧间编码可有效解决遮挡和新内容问题,同时双向冗余利用能进一步提升压缩效率。
Abstract: Neural video compression (NVC) technologies have advanced rapidly in recent years, yielding state-of-the-art schemes such as DCVC-RT that offer superior compression efficiency to H.266/VVC and real-time encoding/decoding capabilities. Nonetheless, existing NVC schemes have several limitations, including inefficiency in dealing with disocclusion and new content, interframe error propagation and accumulation, among others. To eliminate these limitations, we borrow the idea from classic video coding schemes, which allow intra coding within inter-coded frames. With the intra coding tool enabled, disocclusion and new content are properly handled, and interframe error propagation is naturally intercepted without the need for manual refresh mechanisms. We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame compression design to exploit interframe redundancy not only forwardly but also backwardly. Experimental results show that our scheme outperforms DCVC-RT by an average of 10.7% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances. Code and models will be released.
[32] Structured Universal Adversarial Attacks on Object Detection for Video Sequences cs.CVPDF
Sven Jacob, Weijia Shao, Gjergji Kasneci
TL;DR: 本文提出了一种针对视频目标检测的结构化通用对抗攻击方法,通过核范数正则化生成集中于背景的最小扰动,并使用自适应乐观指数梯度方法优化,结果表明其在效果和隐蔽性上优于现有方法。
Details
Motivation: 视频目标检测在安全关键应用中至关重要,但深度学习模型易受通用对抗扰动攻击。现有方法在扰动结构和隐蔽性上存在不足,因此需要一种更高效的攻击方法。
Result: 在视频目标检测任务中,提出的攻击方法在攻击效果和隐蔽性上优于低秩投影梯度下降和基于Frank-Wolfe的方法。
Insight: 结构化扰动设计(如集中于背景)和高效的优化方法是提升对抗攻击性能的关键。
Abstract: Video-based object detection plays a vital role in safety-critical applications. While deep learning-based object detectors have achieved impressive performance, they remain vulnerable to adversarial attacks, particularly those involving universal perturbations. In this work, we propose a minimally distorted universal adversarial attack tailored for video object detection, which leverages nuclear norm regularization to promote structured perturbations concentrated in the background. To optimize this formulation efficiently, we employ an adaptive, optimistic exponentiated gradient method that enhances both scalability and convergence. Our results demonstrate that the proposed attack outperforms both low-rank projected gradient descent and Frank-Wolfe based attacks in effectiveness while maintaining high stealthiness. All code and data are publicly available at https://github.com/jsve96/AO-Exp-Attack.
[33] Unsupervised Deep Generative Models for Anomaly Detection in Neuroimaging: A Systematic Scoping Review cs.CVPDF
Youwan Mahé, Elise Bannier, Stéphanie Leplaideur, Elisa Fromont, Francesca Galassi
TL;DR: 该论文是一篇关于无监督深度生成模型在神经影像异常检测中的应用的系统性综述,总结了2018-2025年间49项研究,探讨了包括自编码器、变分自编码器、生成对抗网络和去噪扩散模型在内的多种生成模型在脑MRI和CT中的应用。
Details
Motivation: 由于全监督方法需要大量标注数据且仅适用于特定病理,而无监督深度生成模型能够在健康数据上训练并检测异常,为神经影像异常检测提供了新的解决方案。
Result: 生成模型在大范围病变检测中表现良好,并在处理细微异常方面取得进展,能够生成可解释的伪健康重建图像。
Insight: 无监督深度生成模型在稀缺标注数据和异质性疾病中具有潜力,未来的研究方向应包括解剖感知建模、基础模型开发、任务适配的评估指标和严格的临床验证。
Abstract: Unsupervised deep generative models are emerging as a promising alternative to supervised methods for detecting and segmenting anomalies in brain imaging. Unlike fully supervised approaches, which require large voxel-level annotated datasets and are limited to well-characterised pathologies, these models can be trained exclusively on healthy data and identify anomalies as deviations from learned normative brain structures. This PRISMA-guided scoping review synthesises recent work on unsupervised deep generative models for anomaly detection in neuroimaging, including autoencoders, variational autoencoders, generative adversarial networks, and denoising diffusion models. A total of 49 studies published between 2018 - 2025 were identified, covering applications to brain MRI and, less frequently, CT across diverse pathologies such as tumours, stroke, multiple sclerosis, and small vessel disease. Reported performance metrics are compared alongside architectural design choices. Across the included studies, generative models achieved encouraging performance for large focal lesions and demonstrated progress in addressing more subtle abnormalities. A key strength of generative models is their ability to produce interpretable pseudo-healthy (also referred to as counterfactual) reconstructions, which is particularly valuable when annotated data are scarce, as in rare or heterogeneous diseases. Looking ahead, these models offer a compelling direction for anomaly detection, enabling semi-supervised learning, supporting the discovery of novel imaging biomarkers, and facilitating within- and cross-disease deviation mapping in unified end-to-end frameworks. To realise clinical impact, future work should prioritise anatomy-aware modelling, development of foundation models, task-appropriate evaluation metrics, and rigorous clinical validation.
[34] Pruning Overparameterized Multi-Task Networks for Degraded Web Image Restoration cs.CVPDF
Thomas Katraouras, Dimitrios Rafailidis
TL;DR: 该论文提出了一种压缩多任务图像修复模型的策略,通过迭代剪枝方法在高稀疏度下维持或超越密集模型的性能,仅保留10%的参数。
Details
Motivation: 现有的多任务图像修复模型参数量过大,计算效率低。为了提高效率而不牺牲性能,需要一种高效的压缩策略。
Result: 在去雨、去雾和去噪任务中,MIR-L仅保留10%的参数,但仍能保持高性能。
Insight: 迭代剪枝能够有效发现高性能的稀疏子网络,表明多任务模型的优化潜力。
Abstract: Image quality is a critical factor in delivering visually appealing content on web platforms. However, images often suffer from degradation due to lossy operations applied by online social networks (OSNs), negatively affecting user experience. Image restoration is the process of recovering a clean high-quality image from a given degraded input. Recently, multi-task (all-in-one) image restoration models have gained significant attention, due to their ability to simultaneously handle different types of image degradations. However, these models often come with an excessively high number of trainable parameters, making them computationally inefficient. In this paper, we propose a strategy for compressing multi-task image restoration models. We aim to discover highly sparse subnetworks within overparameterized deep models that can match or even surpass the performance of their dense counterparts. The proposed model, namely MIR-L, utilizes an iterative pruning strategy that removes low-magnitude weights across multiple rounds, while resetting the remaining weights to their original initialization. This iterative process is important for the multi-task image restoration model’s optimization, effectively uncovering “winning tickets” that maintain or exceed state-of-the-art performance at high sparsity levels. Experimental evaluation on benchmark datasets for the deraining, dehazing, and denoising tasks shows that MIR-L retains only 10% of the trainable parameters while maintaining high image restoration performance. Our code, datasets and pre-trained models are made publicly available at https://github.com/Thomkat/MIR-L.
[35] Grazing Detection using Deep Learning and Sentinel-2 Time Series Data cs.CVPDF
Aleksis Pirinen, Delia Fano Yela, Smita Chakraborty, Erik Källman
TL;DR: 论文提出了一种基于深度学习和Sentinel-2时序数据的放牧检测方法,利用CNN-LSTM集成模型实现高效监测,为保护区合规性检查提供了可靠的工具。
Details
Motivation: 放牧对农业生产和生物多样性具有重要影响,但目前缺乏可扩展的监测手段。研究旨在通过Sentinel-2时序数据实现高效、自动化的放牧检测。
Result: 模型平均F1得分为77%,对放牧草场的召回率达90%。在实际应用中,模型能将非放牧区域检测效率提升17.2倍。
Insight: 研究表明,低分辨率、免费的卫星数据可以高效指导保护区合规性检查,为资源分配提供科学依据。
Abstract: Grazing shapes both agricultural production and biodiversity, yet scalable monitoring of where grazing occurs remains limited. We study seasonal grazing detection from Sentinel-2 L2A time series: for each polygon-defined field boundary, April-October imagery is used for binary prediction (grazed / not grazed). We train an ensemble of CNN-LSTM models on multi-temporal reflectance features, and achieve an average F1 score of 77 percent across five validation splits, with 90 percent recall on grazed pastures. Operationally, if inspectors can visit at most 4 percent of sites annually, prioritising fields predicted by our model as non-grazed yields 17.2 times more confirmed non-grazing sites than random inspection. These results indicate that coarse-resolution, freely available satellite data can reliably steer inspection resources for conservation-aligned land-use compliance. Code and models have been made publicly available.
[36] Vision Mamba for Permeability Prediction of Porous Media cs.CVPDF
Ali Kashefi, Tapan Mukerji
TL;DR: 论文首次提出了一种基于Vision Mamba的网络架构,用于预测三维多孔介质的渗透率,并展示了其在计算效率、内存占用和参数量上的优势。
Details
Motivation: 传统的ViT和CNN在多孔介质渗透率预测任务中存在计算复杂性和内存占用高的问题。Vision Mamba因其线性扩展性和更少的可训练参数成为潜在的替代方案。
Result: Vision Mamba在渗透率预测任务中表现优于ViT和CNN,具有更高的计算效率和更低的资源消耗。
Insight: Vision Mamba有望替代ViT和CNN,成为大视觉模型的骨干网络,特别适合资源受限的场景。
Abstract: Vision Mamba has recently received attention as an alternative to Vision Transformers (ViTs) for image classification. The network size of Vision Mamba scales linearly with input image resolution, whereas ViTs scale quadratically, a feature that improves computational and memory efficiency. Moreover, Vision Mamba requires a significantly smaller number of trainable parameters than traditional convolutional neural networks (CNNs), and thus, they can be more memory efficient. Because of these features, we introduce, for the first time, a neural network that uses Vision Mamba as its backbone for predicting the permeability of three-dimensional porous media. We compare the performance of Vision Mamba with ViT and CNN models across multiple aspects of permeability prediction and perform an ablation study to assess the effects of its components on accuracy. We demonstrate in practice the aforementioned advantages of Vision Mamba over ViTs and CNNs in the permeability prediction of three-dimensional porous media. We make the source code publicly available to facilitate reproducibility and to enable other researchers to build on and extend this work. We believe the proposed framework has the potential to be integrated into large vision models in which Vision Mamba is used instead of ViTs.
[37] Real-Time Surgical Instrument Defect Detection via Non-Destructive Testing cs.CV | cs.AIPDF
Qurrat Ul Ain, Atif Aftab Ahmed Jilani, Zunaira Shafqat, Nigar Azhar Butt
TL;DR: 这篇论文提出了SurgScan框架,利用YOLOv8实时检测手术器械缺陷,实现了高精度(99.3%)和工业级扩展性,解决了传统人工检测的不足。
Details
Motivation: 手术器械缺陷可能导致灭菌问题、机械完整性受损和患者安全风险。传统人工检测效率低且易出错,亟需自动化的高精度解决方案。
Result: SurgScan在实时推理速度(4.2-5.8 ms/图像)和准确性(99.3%)上优于现有CNN模型,适合工业部署。
Insight: 对比度增强对缺陷检测至关重要,SurgScan提供了一种符合ISO 13485和FDA标准的自动化解决方案,推动了医疗器械制造的革新。
Abstract: Defective surgical instruments pose serious risks to sterility, mechanical integrity, and patient safety, increasing the likelihood of surgical complications. However, quality control in surgical instrument manufacturing often relies on manual inspection, which is prone to human error and inconsistency. This study introduces SurgScan, an AI-powered defect detection framework for surgical instruments. Using YOLOv8, SurgScan classifies defects in real-time, ensuring high accuracy and industrial scalability. The model is trained on a high-resolution dataset of 102,876 images, covering 11 instrument types and five major defect categories. Extensive evaluation against state-of-the-art CNN architectures confirms that SurgScan achieves the highest accuracy (99.3%) with real-time inference speeds of 4.2-5.8 ms per image, making it suitable for industrial deployment. Statistical analysis demonstrates that contrast-enhanced preprocessing significantly improves defect detection, addressing key limitations in visual inspection. SurgScan provides a scalable, cost-effective AI solution for automated quality control, reducing reliance on manual inspection while ensuring compliance with ISO 13485 and FDA standards, paving the way for enhanced defect detection in medical manufacturing.
[38] Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models cs.CV | cs.LGPDF
Yunze Tong, Didi Zhu, Zijing Hu, Jinluan Yang, Ziyu Zhao
TL;DR: 该论文提出了一种名为噪声投影的方法,通过文本条件优化初始噪声,解决了扩散模型中文本与图像不对齐的问题,无需修改预训练模型。
Details
Motivation: 现有的文本到图像生成方法中,初始噪声的不同可能导致生成的图像与提示词不对齐。论文将此归因于训练与推理阶段的不匹配,即在训练时噪声属于提示词特定的子空间,而推理时噪声则来自与提示词无关的高斯先验。
Result: 实验表明,该方法显著提高了多样提示词下文本与图像的对齐效果。
Insight: 训练与推理阶段的噪声分布不匹配是文本到图像不对齐的关键原因,而文本条件的噪声优化可以显著改善这一问题。
Abstract: In text-to-image generation, different initial noises induce distinct denoising paths with a pretrained Stable Diffusion (SD) model. While this pattern could output diverse images, some of them may fail to align well with the prompt. Existing methods alleviate this issue either by altering the denoising dynamics or by drawing multiple noises and conducting post-selection. In this paper, we attribute the misalignment to a training-inference mismatch: during training, prompt-conditioned noises lie in a prompt-specific subset of the latent space, whereas at inference the noise is drawn from a prompt-agnostic Gaussian prior. To close this gap, we propose a noise projector that applies text-conditioned refinement to the initial noise before denoising. Conditioned on the prompt embedding, it maps the noise to a prompt-aware counterpart that better matches the distribution observed during SD training, without modifying the SD model. Our framework consists of these steps: we first sample some noises and obtain token-level feedback for their corresponding images from a vision-language model (VLM), then distill these signals into a reward model, and finally optimize the noise projector via a quasi-direct preference optimization. Our design has two benefits: (i) it requires no reference images or handcrafted priors, and (ii) it incurs small inference cost, replacing multi-sample selection with a single forward pass. Extensive experiments further show that our prompt-aware noise projection improves text-image alignment across diverse prompts.
[39] PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model cs.CVPDF
Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang
TL;DR: PaddleOCR-VL是一个高效的多语言文档解析模型,核心是一个0.9B参数的紧凑视觉语言模型(VLM),结合动态分辨率视觉编码器和语言模型,支持109种语言,并在多种元素识别任务中表现优异。
Details
Motivation: 传统文档解析模型在处理多语言和复杂元素时资源消耗高且性能不足,亟需一个高效且轻量化的解决方案。
Result: 在公开和内部基准测试中达到SOTA性能,显著优于现有方案,并具备快速推理能力。
Insight: 紧凑的VLM在多语言文档解析中具有巨大潜力,可以平衡性能和资源效率。
Abstract: In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.
[40] Towards Generalist Intelligence in Dentistry: Vision Foundation Models for Oral and Maxillofacial Radiology cs.CVPDF
Xinrui Huang, Fan Xiao, Dongming He, Anqi Gao, Dandan Li
TL;DR: 这篇论文提出了DentVFM,一个专为牙科设计的视觉基础模型家族,通过自监督学习处理多模态牙科放射图像,显著提升了通用性和标签效率,并在多种牙科任务中超越了现有方法。
Details
Motivation: 牙科放射影像解读面临专业人员短缺的挑战,现有AI系统因单模态、任务专用设计和依赖标注数据而受限。为了解决这些问题,作者提出了通用性更强的牙科视觉基础模型。
Result: DentVFM在疾病诊断、治疗分析等任务中表现优异,超越监督、自监督和弱监督基线模型,并在跨模态诊断中优于专业牙医。
Insight: 视觉基础模型在牙科领域的应用展示了自监督学习和多模态数据的潜力,为智能牙科医疗提供了可扩展、高效的解决方案。
Abstract: Oral and maxillofacial radiology plays a vital role in dental healthcare, but radiographic image interpretation is limited by a shortage of trained professionals. While AI approaches have shown promise, existing dental AI systems are restricted by their single-modality focus, task-specific design, and reliance on costly labeled data, hindering their generalization across diverse clinical scenarios. To address these challenges, we introduce DentVFM, the first family of vision foundation models (VFMs) designed for dentistry. DentVFM generates task-agnostic visual representations for a wide range of dental applications and uses self-supervised learning on DentVista, a large curated dental imaging dataset with approximately 1.6 million multi-modal radiographic images from various medical centers. DentVFM includes 2D and 3D variants based on the Vision Transformer (ViT) architecture. To address gaps in dental intelligence assessment and benchmarks, we introduce DentBench, a comprehensive benchmark covering eight dental subspecialties, more diseases, imaging modalities, and a wide geographical distribution. DentVFM shows impressive generalist intelligence, demonstrating robust generalization to diverse dental tasks, such as disease diagnosis, treatment analysis, biomarker identification, and anatomical landmark detection and segmentation. Experimental results indicate DentVFM significantly outperforms supervised, self-supervised, and weakly supervised baselines, offering superior generalization, label efficiency, and scalability. Additionally, DentVFM enables cross-modality diagnostics, providing more reliable results than experienced dentists in situations where conventional imaging is unavailable. DentVFM sets a new paradigm for dental AI, offering a scalable, adaptable, and label-efficient model to improve intelligent dental healthcare and address critical gaps in global oral healthcare.
[41] Exploring Image Representation with Decoupled Classical Visual Descriptors cs.CVPDF
Chenyuan Qu, Hao Chen, Jianbo Jiao
TL;DR: 论文提出VisualSplit框架,通过解耦经典视觉描述符(如边缘、颜色等)来提升现代学习方法的视觉理解能力,保留其可解释性的同时支持高级视觉任务。
Details
Motivation: 深度学习在视觉任务中表现出色,但其内部表征不透明;而经典视觉描述符(如边缘、颜色)虽直观易懂,却在现代学习中未充分利用。论文试图填补这一空白,探索经典描述符对现代学习的潜在贡献。
Result: VisualSplit不仅在分类和分割任务中表现良好,还能有效支持高级视觉任务(如图像生成和编辑),证明了经典描述符在现代学习中的实用性。
Insight: 解耦经典视觉描述符不仅保留了其可解释性,还为深度学习提供了更丰富的视觉知识表示,拓展了其在高级视觉任务中的应用潜力。
Abstract: Exploring and understanding efficient image representations is a long-standing challenge in computer vision. While deep learning has achieved remarkable progress across image understanding tasks, its internal representations are often opaque, making it difficult to interpret how visual information is processed. In contrast, classical visual descriptors (e.g. edge, colour, and intensity distribution) have long been fundamental to image analysis and remain intuitively understandable to humans. Motivated by this gap, we ask a central question: Can modern learning benefit from these classical cues? In this paper, we answer it with VisualSplit, a framework that explicitly decomposes images into decoupled classical descriptors, treating each as an independent but complementary component of visual knowledge. Through a reconstruction-driven pre-training scheme, VisualSplit learns to capture the essence of each visual descriptor while preserving their interpretability. By explicitly decomposing visual attributes, our method inherently facilitates effective attribute control in various advanced visual tasks, including image generation and editing, extending beyond conventional classification and segmentation, suggesting the effectiveness of this new learning approach for visual understanding. Project page: https://chenyuanqu.com/VisualSplit/.
[42] Exploring Cross-Modal Flows for Few-Shot Learning cs.CVPDF
Ziqi Jiang, Yanghao Wang, Long Chen
TL;DR: 本文提出了Flow Matching Alignment (FMA),一种跨模态速度场学习方法,用于解决现有参数高效微调(PEFT)方法在复杂数据集上单步调整不足的问题。
Details
Motivation: 现有PEFT方法(如prompt tuning、LoRA或adapter)仅通过单步调整视觉或文本特征,难以处理模态高度纠缠的复杂数据集。
Result: 在多个基准测试和骨干网络上,FMA表现优于单步PEFT方法,尤其在挑战性数据集上显著提升性能。
Insight: 多步调整和噪声增强是解决跨模态任务中特征对齐问题的有效策略。
Abstract: Aligning features from different modalities, is one of the most fundamental challenges for cross-modal tasks. Although pre-trained vision-language models can achieve a general alignment between image and text, they often require parameter-efficient fine-tuning (PEFT) for further adjustment. Today’s PEFT methods (e.g., prompt tuning, LoRA-based, or adapter-based) always selectively fine-tune a subset of parameters, which can slightly adjust either visual or textual features, and avoid overfitting. In this paper, we are the first to highlight that all existing PEFT methods perform one-step adjustment. It is insufficient for complex (or difficult) datasets, where features of different modalities are highly entangled. To this end, we propose the first model-agnostic multi-step adjustment approach by learning a cross-modal velocity field: Flow Matching Alignment (FMA). Specifically, to ensure the correspondence between categories during training, we first utilize a fixed coupling strategy. Then, we propose a noise augmentation strategy to alleviate the data scarcity issue. Finally, we design an early-stopping solver, which terminates the transformation process earlier, improving both efficiency and accuracy. Compared with one-step PEFT methods, FMA has the multi-step rectification ability to achieve more precise and robust alignment. Extensive results have demonstrated that FMA can consistently yield significant performance gains across various benchmarks and backbones, particularly on challenging datasets.
[43] Consistent text-to-image generation via scene de-contextualization cs.CVPDF
Song Tang, Peihao Gong, Kunyu Li, Kai Guo, Boyu Wang
TL;DR: 论文提出了一种训练免费的提示嵌入编辑方法SDeC,通过解耦自然图像训练中的场景-身份相关性,提高文本到图像生成的一致性。
Details
Motivation: 现有文本到图像(T2I)生成方法在保持同一主体身份一致性时失败率高,原因是场景与主体身份之间存在天然相关性(称为场景上下文化)。传统方法需要预先知道所有目标场景,这在现实中不切实际。
Result: 实验表明,SDeC显著提升了身份一致性,同时保持了场景多样性。
Insight: 1. 场景与身份的强相关性是T2I模型的自然特性;2. 通过解耦场景与身份,可以在不牺牲多样性的情况下提高一致性;3. 训练自由方法更适合实际应用。
Abstract: Consistent text-to-image (T2I) generation seeks to produce identity-preserving images of the same subject across diverse scenes, yet it often fails due to a phenomenon called identity (ID) shift. Previous methods have tackled this issue, but typically rely on the unrealistic assumption of knowing all target scenes in advance. This paper reveals that a key source of ID shift is the native correlation between subject and scene context, called scene contextualization, which arises naturally as T2I models fit the training distribution of vast natural images. We formally prove the near-universality of this scene-ID correlation and derive theoretical bounds on its strength. On this basis, we propose a novel, efficient, training-free prompt embedding editing approach, called Scene De-Contextualization (SDeC), that imposes an inversion process of T2I’s built-in scene contextualization. Specifically, it identifies and suppresses the latent scene-ID correlation within the ID prompt’s embedding by quantifying the SVD directional stability to adaptively re-weight the corresponding eigenvalues. Critically, SDeC allows for per-scene use (one scene per prompt) without requiring prior access to all target scenes. This makes it a highly flexible and general solution well-suited to real-world applications where such prior knowledge is often unavailable or varies over time. Experiments demonstrate that SDeC significantly enhances identity preservation while maintaining scene diversity.
[44] Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video cs.CVPDF
Yulin Zhang, Cheng Shi, Yang Wang, Sibei Yang
TL;DR: 该论文提出了一种名为Ego Proactive Video-LLM的AI模型,旨在通过流式视频输入主动理解、预测和响应事件,并引入了一个新基准ESTP-Bench及ESTP-F1指标进行评估。
Details
Motivation: 目标是开发能够像人类一样主动理解和响应环境的AI,超越传统的被动观察模式。
Result: 模型在多个在线和离线基准测试中表现出色,有效解决了任务的关键特性。
Insight: 论文的创新点在于将AI的被动观察能力扩展为主动理解和响应,为未来智能助手的发展提供了新方向。
Abstract: Envision an AI capable of functioning in human-like settings, moving beyond mere observation to actively understand, anticipate, and proactively respond to unfolding events. Towards this vision, we focus on the innovative task where, given ego-streaming video input, an assistant proactively answers diverse, evolving questions at the opportune moment, while maintaining synchronized perception and reasoning. This task embodies three key properties: (1) Proactive Coherence, (2) Just-in-Time Responsiveness, and (3) Synchronized Efficiency. To evaluate and address these properties, we first introduce ESTP-Bench (Ego Streaming Proactive Benchmark) alongside the ESTP-F1 metric-a novel framework designed for their rigorous assessment. Secondly, we propose a comprehensive technical pipeline to enable models to tackle this challenging task. This pipeline comprises: (1) a data engine, (2) a multi-stage training strategy, and (3) a proactive dynamic compression technique. Our proposed model effectively addresses these critical properties while outperforming multiple baselines across diverse online and offline benchmarks. Project Page:https://zhangyl4.github.io/publications/eyes-wide-open/
[45] Talking Points: Describing and Localizing Pixels cs.CV | cs.CLPDF
Matan Rusanovsky, Shimon Malnick, Shai Avidan
TL;DR: 这篇论文提出了一种新颖的像素级grounding框架,由Point Descriptor(生成关键点描述)和Point Localizer(根据描述回归像素坐标)两部分组成,填补了现有视觉-语言模型在像素级理解上的空白。
Details
Motivation: 现有视觉-语言模型仅限于对象或区域级别的grounding,缺乏通过自然语言实现像素级关键点理解的能力。
Result: 实验表明,该方法在LlamaPointInPart上优于基线模型。
Insight: 双向框架为关键点引导的图像理解和语言引导的精确定位提供了潜在应用方向。
Abstract: Vision-language models have achieved remarkable success in cross-modal understanding. Yet, these models remain limited to object-level or region-level grounding, lacking the capability for pixel-precise keypoint comprehension through natural language. We introduce a novel framework for pixel level grounding. The framework consists of two complementary components: a Point Descriptor that generates rich, contextual descriptions of individual keypoints, and a Point Localizer that regresses precise pixel coordinates from these descriptions. Unlike prior work that relies on templated prompts or keypoint names, our approach produces free-form, coarse-to-fine descriptions that situate keypoints within their visual context. Since there is no available dataset to train such a system, we introduce LlamaPointInPart, a carefully curated dataset of 20K+ image-keypoint-description triplets synthesized from multiple vision-language models, capturing multi-scale information from scene-level context to visual features around the keypoint. For cross-category generalization, we optimize the Point Descriptor on AP-10K via GRPO, using the frozen Point Localizer as a reward model to produce descriptions that maximize localization accuracy. To evaluate our results we establish a new evaluation protocol. Instead of comparing the text description produced by our method to the ground truth, we use the localizer to determine how close is the predicted point generated to the ground truth point. Experiments demonstrate superior performance compared to baseline models on LlamaPointInPart.The bidirectional nature of our framework should enable future applications in both keypoint-guided image understanding and language-guided precise localization. Our code and dataset are publicly available at https://github.com/matanr/Talking_Points.
[46] STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding cs.CV | cs.AIPDF
Zhifei Chen, Tianshuo Xu, Leyi Wu, Luozhou Wang, Dongyu Yan
TL;DR: STANCE是一个图像到视频生成框架,通过稀疏到密集锚定编码解决视频生成中的运动一致性问题,提升时间一致性。
Details
Motivation: 视频生成中保持对象的运动一致性和交互仍然具有挑战性,主要原因是运动提示编码后信息不足,以及多任务优化中外观优先于时间一致性。
Result: 模型在优化过程中稳定且提高了时间一致性,无需逐帧轨迹脚本。
Insight: 分离结构和外观任务可以避免优化冲突,提升视频生成的时间一致性。
Abstract: Video generation has recently made striking visual progress, but maintaining coherent object motion and interactions remains difficult. We trace two practical bottlenecks: (i) human-provided motion hints (e.g., small 2D maps) often collapse to too few effective tokens after encoding, weakening guidance; and (ii) optimizing for appearance and motion in a single head can favor texture over temporal consistency. We present STANCE, an image-to-video framework that addresses both issues with two simple components. First, we introduce Instance Cues – a pixel-aligned control signal that turns sparse, user-editable hints into a dense 2.5D (camera-relative) motion field by averaging per-instance flow and augmenting with monocular depth over the instance mask. This reduces depth ambiguity compared to 2D arrow inputs while remaining easy to use. Second, we preserve the salience of these cues in token space with Dense RoPE, which tags a small set of motion tokens (anchored on the first frame) with spatial-addressable rotary embeddings. Paired with joint RGB (+) auxiliary-map prediction (segmentation or depth), our model anchors structure while RGB handles appearance, stabilizing optimization and improving temporal coherence without requiring per-frame trajectory scripts.
[47] Hierarchical Re-Classification: Combining Animal Classification Models with Vision Transformers cs.CVPDF
Hugo Markoff, Jevgenijs Galaktionovs
TL;DR: 论文提出了一种结合SpeciesNet EfficientNetV2-M预测与CLIP嵌入的分层重分类系统,通过五阶段流程提升动物分类模型的物种级识别能力,在特定数据集上实现了96.5%的准确率。
Details
Motivation: 现有动物分类模型(如SpeciesNet)由于采用保守的汇总策略,许多动物仅能标记到高级分类学层面而非物种级别,导致识别精度不足。
Result: 在LILA BC Desert Lion Conservation数据集上,系统成功恢复了761个鸟类检测,并对456个检测进行了高精度重分类,物种级识别率达到64.9%。
Insight: 通过结合多种机器学习技术和分层策略,可以有效解决复杂分类任务中的细粒度识别问题。
Abstract: State-of-the-art animal classification models like SpeciesNet provide predictions across thousands of species but use conservative rollup strategies, resulting in many animals labeled at high taxonomic levels rather than species. We present a hierarchical re-classification system for the Animal Detect platform that combines SpeciesNet EfficientNetV2-M predictions with CLIP embeddings and metric learning to refine high-level taxonomic labels toward species-level identification. Our five-stage pipeline (high-confidence acceptance, bird override, centroid building, triplet-loss metric learning, and adaptive cosine-distance scoring) is evaluated on a segment of the LILA BC Desert Lion Conservation dataset (4,018 images, 15,031 detections). After recovering 761 bird detections from “blank” and “animal” labels, we re-classify 456 detections labeled animal, mammal, or blank with 96.5% accuracy, achieving species-level identification for 64.9 percent
[48] Zero-Shot Wildlife Sorting Using Vision Transformers: Evaluating Clustering and Continuous Similarity Ordering cs.CVPDF
Hugo Markoff, Jevgenijs Galaktionovs
TL;DR: 该论文研究了在野生动物图像中使用自监督视觉Transformer进行零样本分类,比较了不同聚类方法和架构组合的性能,并在生产中应用了连续相似性排序。
Details
Motivation: 由于相机陷阱生成的野生动物图像数量庞大且许多物种未被现有分类器覆盖,因此需要零样本方法来高效组织未标记图像。
Result: DINOv2结合UMAP和GMM在5物种测试集上达到88.6%的准确率,1D排序在哺乳动物和鸟类中达到88.2%的相关性,鱼类中达到95.2%。
Insight: 自监督视觉Transformer在零样本野生动物分类中表现出色,连续相似性排序为快速探索分析和手动标注提供了高效工具。
Abstract: Camera traps generate millions of wildlife images, yet many datasets contain species that are absent from existing classifiers. This work evaluates zero-shot approaches for organizing unlabeled wildlife imagery using self-supervised vision transformers, developed and tested within the Animal Detect platform for camera trap analysis. We compare unsupervised clustering methods (DBSCAN, GMM) across three architectures (CLIP, DINOv2, MegaDescriptor) combined with dimensionality reduction techniques (PCA, UMAP), and we demonstrate continuous 1D similarity ordering via t-SNE projection. On a 5-species test set with ground truth labels used only for evaluation, DINOv2 with UMAP and GMM achieves 88.6 percent accuracy (macro-F1 = 0.874), while 1D sorting reaches 88.2 percent coherence for mammals and birds and 95.2 percent for fish across 1,500 images. Based on these findings, we deployed continuous similarity ordering in production, enabling rapid exploratory analysis and accelerating manual annotation workflows for biodiversity monitoring.
[49] Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering cs.CV | cs.AIPDF
Yuyang Hong, Jiaqi Gu, Qi Yang, Lubin Fan, Yue Wu
TL;DR: 本文提出了一种名为Wiki-PRF的三阶段方法,通过处理、检索和过滤来解决知识库视觉问答(KB-VQA)中的多模态查询和质量问题,并结合强化学习训练模型,显著提升了答案质量。
Details
Motivation: 现有检索增强生成(RAG)方法在KB-VQA任务中多模态查询质量和检索结果相关性方面存在不足,需要更精确的方法提升性能。
Result: 在E-VQA和InfoSeek数据集上取得了36.0和42.8的性能提升,达到SOTA水平。
Insight: 多模态信息精确提取和动态过滤是提升KB-VQA性能的关键,强化学习可以有效优化模型的推理和过滤能力。
Abstract: Knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval. Although retrieval-augmented generation (RAG) achieves significant advances in this task by combining knowledge-base querying, it still struggles with the quality of multimodal queries and the relevance of retrieved results. To overcome these challenges, we propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages. The processing stage dynamically invokes visual tools to extract precise multimodal information for retrieval. The retrieval stage integrates visual and text features to achieve multimodal knowledge retrieval. The filtering stage performs relevance filtering and concentration on retrieval results. To this end, we introduce a visual language model trained with answer accuracy and format consistency as reward signals via a reinforcement learning manner. This enhances the model’s reasoning, tool invocation for accurate queries, and filtering of irrelevant content. Experiments on benchmark datasets (E-VQA and InfoSeek) show significant improvements~(36.0 and 42.8) in answer quality, achieving state-of-the-art performance. Code is available at https://github.com/cqu-student/Wiki-PRF
[50] Shot2Tactic-Caption: Multi-Scale Captioning of Badminton Videos for Tactical Understanding cs.CVPDF
Ning Ding, Keisuke Fujii, Toru Tamaki
TL;DR: 该论文提出了一种多尺度视频描述框架Shot2Tactic-Caption,用于羽毛球比赛视频的战术理解,能够生成描述单个动作的镜头级描述和捕捉战术动态执行的战术级描述。
Details
Motivation: 羽毛球比赛中战术理解需要同时关注单个动作和战术的动态执行过程,但目前缺乏能够同时描述这两个层次的视频描述方法。
Result: 实验表明,框架在生成镜头级和战术级描述方面表现优异。消融研究显示,基于ResNet50的时空编码器优于其他变体,提示引导机制提升了战术描述的连贯性。
Insight: 多尺度描述(镜头级+战术级)对于全面理解羽毛球比赛至关重要;提示引导机制可以有效捕捉战术的动态变化(如中断和恢复)。
Abstract: Tactical understanding in badminton involves interpreting not only individual actions but also how tactics are dynamically executed over time. In this paper, we propose \textbf{Shot2Tactic-Caption}, a novel framework for semantic and temporal multi-scale video captioning in badminton, capable of generating shot-level captions that describe individual actions and tactic-level captions that capture how these actions unfold over time within a tactical execution. We also introduce the Shot2Tactic-Caption Dataset, the first badminton captioning dataset containing 5,494 shot captions and 544 tactic captions. Shot2Tactic-Caption adopts a dual-branch design, with both branches including a visual encoder, a spatio-temporal Transformer encoder, and a Transformer-based decoder to generate shot and tactic captions. To support tactic captioning, we additionally introduce a Tactic Unit Detector that identifies valid tactic units, tactic types, and tactic states (e.g., Interrupt, Resume). For tactic captioning, we further incorporate a shot-wise prompt-guided mechanism, where the predicted tactic type and state are embedded as prompts and injected into the decoder via cross-attention. The shot-wise prompt-guided mechanism enables our system not only to describe successfully executed tactics but also to capture tactical executions that are temporarily interrupted and later resumed. Experimental results demonstrate the effectiveness of our framework in generating both shot and tactic captions. Ablation studies show that the ResNet50-based spatio-temporal encoder outperforms other variants, and that shot-wise prompt structuring leads to more coherent and accurate tactic captioning.
[51] Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference cs.CVPDF
Natan Bagrov, Eugene Khvedchenia, Borys Tymchenko, Shay Aharon, Lior Kadoch
TL;DR: EVS通过剪枝时间冗余token以减少视频处理的计算成本,提升VLM推理速度,支持更长输入序列且不影响语义保真度。
Details
Motivation: VLMs在处理长视频时因token数量巨大而面临计算成本高和延迟问题,限制了视频理解的可扩展性。
Result: EVS将LLM的首token生成时间最多减少4倍,结合微调后能在激进剪枝下保持性能。
Insight: 时间冗余剪枝是提升视频处理效率的有效途径,且无需牺牲语义质量。
Abstract: Vision-language models (VLMs) have recently expanded from static image understanding to video reasoning, but their scalability is fundamentally limited by the quadratic cost of processing dense frame sequences. Long videos often exceed the token budget of modern language models, leading to severe context limitations and latency issues. We introduce Efficient Video Sampling (EVS), a simple, plug-and-play method for reducing token redundancy in videos by identifying and pruning temporally static patches – spatial regions that remain unchanged across consecutive frames. EVS preserves positional identity, requires no architectural changes or retraining. We show that EVS substantially reduces token count while maintaining semantic fidelity, enabling faster inference and longer input sequences. Applied at inference time, EVS reduces large language model (LLM) time-to-first-token (TTFT) by up to 4x with minimal accuracy loss. When combined with an uptraining phase using stochastic pruning rates, EVS yields models that are robust to varying compression levels and retain full performance under aggressive pruning. Extensive experiments demonstrate that EVS consistently improves efficiency-accuracy trade-offs, unlocking scalable video-language understanding without sacrificing quality.
[52] Adapting Self-Supervised Representations as a Latent Space for Efficient Generation cs.CVPDF
Ming Gui, Johannes Schusterbauer, Timy Phan, Felix Krause, Josh Susskind
TL;DR: RepTok利用自监督视觉变换器的单一连续潜在令牌表示图像,通过微调语义令牌嵌入并结合生成解码器,实现了高效的图像生成。
Details
Motivation: 现有生成模型往往需要复杂的2D潜在空间和高训练成本,RepTok旨在通过自监督表示简化这一过程。
Result: 在ImageNet类条件生成和MS-COCO文本到图像合成中表现出竞争力。
Insight: 自监督表示经微调后可作为高效生成的紧凑潜在空间,解决了空间冗余和高成本问题。
Abstract: We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.
[53] SteeringTTA: Guiding Diffusion Trajectories for Robust Test-Time-Adaptation cs.CVPDF
Jihyun Yu, Yoojin Oh, Wonho Bae, Mingyu Kim, Junhyug Noh
TL;DR: SteeringTTA 是一种基于扩散的测试时自适应方法,通过引导扩散轨迹提升分类鲁棒性,无需模型更新或源数据。
Details
Motivation: 现有的基于扩散的输入自适应方法依赖梯度引导,限制了其对不同类型失真的探索和泛化能力。
Result: 在 ImageNet-C 上表现优于基线方法,无需模型更新或源数据。
Insight: 通过引导粒子和动态平衡策略,可以在不依赖梯度的情况下实现更鲁棒的输入自适应。
Abstract: Test-time adaptation (TTA) aims to correct performance degradation of deep models under distribution shifts by updating models or inputs using unlabeled test data. Input-only diffusion-based TTA methods improve robustness for classification to corruptions but rely on gradient guidance, limiting exploration and generalization across distortion types. We propose SteeringTTA, an inference-only framework that adapts Feynman-Kac steering to guide diffusion-based input adaptation for classification with rewards driven by pseudo-label. SteeringTTA maintains multiple particle trajectories, steered by a combination of cumulative top-K probabilities and an entropy schedule, to balance exploration and confidence. On ImageNet-C, SteeringTTA consistently outperforms the baseline without any model updates or source data.
[54] In-Context Learning with Unpaired Clips for Instruction-based Video Editing cs.CV | cs.AIPDF
Xinyao Liao, Xianfang Zeng, Ziye Song, Zhoujie Fu, Gang Yu
TL;DR: 该论文提出了一种低成本预训练策略,利用未配对的视频片段进行上下文学习,用于指令驱动的视频编辑任务,并通过少量高质量配对数据微调,显著提升了编辑指令对齐和视觉保真度。
Details
Motivation: 指令驱动的图像编辑取得了快速进展,但其在视频领域的扩展尚未充分探索,主要由于构建大规模配对视频编辑数据集的高成本和复杂性。
Result: 实验表明,该方法在指令对齐和视觉保真度上均优于现有方案,编辑指令跟随能力提升12%,编辑质量提升15%。
Insight: 通过上下文学习,未配对数据可显著降低预训练成本,同时保留模型的泛化能力,为视频编辑任务提供了一种高效解决方案。
Abstract: Despite the rapid progress of instruction-based image editing, its extension to video remains underexplored, primarily due to the prohibitive cost and complexity of constructing large-scale paired video editing datasets. To address this challenge, we introduce a low-cost pretraining strategy for instruction-based video editing that leverages in-context learning from unpaired video clips. We show that pretraining a foundation video generation model with this strategy endows it with general editing capabilities, such as adding, replacing, or deleting operations, according to input editing instructions. The pretrained model can then be efficiently refined with a small amount of high-quality paired editing data. Built upon HunyuanVideoT2V, our framework first pretrains on approximately 1M real video clips to learn basic editing concepts, and subsequently fine-tunes on fewer than 150k curated editing pairs to extend more editing tasks and improve the editing quality. Comparative experiments show that our method surpasses existing instruction-based video editing approaches in both instruction alignment and visual fidelity, achieving a 12% improvement in editing instruction following and a 15% improvement in editing quality.
[55] Decorrelation Speeds Up Vision Transformers cs.CV | cs.LGPDF
Kieran Carrigg, Rob van Gastel, Melda Yeghaian, Sander Dalm, Faysal Boughorbel
TL;DR: 论文提出了一种将Decorrelated Backpropagation(DBP)集成到MAE预训练中的方法,显著降低了ViT的训练时间和碳排放,同时提升了性能。
Details
Motivation: MAE预训练虽在低标签数据下表现优异,但计算成本高昂,限制了其在时间和资源受限的工业场景中的应用。因此,作者寻求一种加速预训练的方法。
Result: 在ImageNet-1K预训练和ADE20K微调中,DBP-MAE显著缩短了训练时间并降低了碳排放,同时在工业数据上也验证了方法的实用性。
Insight: DBP不仅加速了ViT的预训练,还提升了模型的稳定性和下游任务性能,为资源受限场景提供了高效解决方案。
Abstract: Masked Autoencoder (MAE) pre-training of vision transformers (ViTs) yields strong performance in low-label regimes but comes with substantial computational costs, making it impractical in time- and resource-constrained industrial settings. We address this by integrating Decorrelated Backpropagation (DBP) into MAE pre-training, an optimization method that iteratively reduces input correlations at each layer to accelerate convergence. Applied selectively to the encoder, DBP achieves faster pre-training without loss of stability. On ImageNet-1K pre-training with ADE20K fine-tuning, DBP-MAE reduces wall-clock time to baseline performance by 21.1%, lowers carbon emissions by 21.4% and improves segmentation mIoU by 1.1 points. We observe similar gains when pre-training and fine-tuning on proprietary industrial data, confirming the method’s applicability in real-world scenarios. These results demonstrate that DBP can reduce training time and energy use while improving downstream performance for large-scale ViT pre-training.
[56] EuroMineNet: A Multitemporal Sentinel-2 Benchmark for Spatiotemporal Mining Footprint Analysis in the European Union (2015-2024) cs.CVPDF
Weikang Yu, Vincent Nwazelibe, Xianping Ma, Xiaokang Zhang, Richard Gloaguen
TL;DR: 论文介绍了EuroMineNet,一个基于Sentinel-2多光谱影像的首个多时间基准数据集,用于欧盟范围内133个矿区的时空变化监测,支持可持续土地管理和环境恢复。
Details
Motivation: 采矿活动虽对经济发展至关重要,但也是环境退化的主要来源。现有数据集时空覆盖有限,亟需一种长期一致的监测工具以支持可持续资源管理。
Result: 测试20种先进深度学习模型发现,GeoAI方法能有效识别长期环境变化,但在短期动态检测上仍需改进。
Insight: EuroMineNet推动了时空一致性强的采矿监测,强调了GeoAI在社会和环境可持续发展中的应用潜力。
Abstract: Mining activities are essential for industrial and economic development, but remain a leading source of environmental degradation, contributing to deforestation, soil erosion, and water contamination. Sustainable resource management and environmental governance require consistent, long-term monitoring of mining-induced land surface changes, yet existing datasets are often limited in temporal depth or geographic scope. To address this gap, we present EuroMineNet, the first comprehensive multitemporal benchmark for mining footprint mapping and monitoring based on Sentinel-2 multispectral imagery. Spanning 133 mining sites across the European Union, EuroMineNet provides annual observations and expert-verified annotations from 2015 to 2024, enabling GeoAI-based models to analyze environmental dynamics at a continental scale. It supports two sustainability-driven tasks: (1) multitemporal mining footprint mapping for consistent annual land-use delineation, evaluated with a novel Change-Aware Temporal IoU (CA-TIoU) metric, and (2) cross-temporal change detection to capture both gradual and abrupt surface transformations. Benchmarking 20 state-of-the-art deep learning models reveals that while GeoAI methods effectively identify long-term environmental changes, challenges remain in detecting short-term dynamics critical for timely mitigation. By advancing temporally consistent and explainable mining monitoring, EuroMineNet contributes to sustainable land-use management, environmental resilience, and the broader goal of applying GeoAI for social and environmental good. We release the codes and datasets by aligning with FAIR and the open science paradigm at https://github.com/EricYu97/EuroMineNet.
[57] WeCKD: Weakly-supervised Chained Distillation Network for Efficient Multimodal Medical Imaging cs.CVPDF
Md. Abdur Rahman, Mohaimenul Azam Khan Raiaan, Sami Azam, Asif Karim, Jemima Beissbarth
TL;DR: WeCKD提出了一种弱监督的链式知识蒸馏框架,通过多模型逐步传递知识,减少了数据依赖并提升了性能。
Details
Motivation: 传统知识蒸馏方法依赖强大的教师模型或大量标注数据,限制了其在现实数据有限场景中的应用。
Result: 在多种医学影像数据集上表现优异,无需全量监督数据即可超越传统方法,最高提升23%准确率。
Insight: 链式知识传递可以有效缓解单步蒸馏的知识退化问题,并在弱监督场景下实现高效学习。
Abstract: Knowledge distillation (KD) has traditionally relied on a static teacher-student framework, where a large, well-trained teacher transfers knowledge to a single student model. However, these approaches often suffer from knowledge degradation, inefficient supervision, and reliance on either a very strong teacher model or large labeled datasets, which limits their effectiveness in real-world, limited-data scenarios. To address these, we present the first-ever Weakly-supervised Chain-based KD network (WeCKD) that redefines knowledge transfer through a structured sequence of interconnected models. Unlike conventional KD, it forms a progressive distillation chain, where each model not only learns from its predecessor but also refines the knowledge before passing it forward. This structured knowledge transfer further enhances feature learning, reduces data dependency, and mitigates the limitations of one-step KD. Each model in the distillation chain is trained on only a fraction of the dataset and demonstrates that effective learning can be achieved with minimal supervision. Extensive evaluations across four otoscopic imaging datasets demonstrate that it not only matches but in many cases surpasses the performance of existing supervised methods. Experimental results on two other datasets further underscore its generalization across diverse medical imaging modalities, including microscopic and magnetic resonance imaging. Furthermore, our evaluations resulted in cumulative accuracy gains of up to +23% over a single backbone trained on the same limited data, which highlights its potential for real-world adoption.
[58] VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning cs.CVPDF
Jinglei Zhang, Yuanfan Guo, Rolandos Alexandros Potamias, Jiankang Deng, Hang Xu
TL;DR: VTimeCoT 是一个无需训练的简单高效框架,通过引入进度条工具和视觉-时间的链式思考(CoT)过程,显著提升了视频时间定位和推理任务的性能。
Details
Motivation: 现有基于多模态大语言模型的视频问答系统在视频时间定位和推理方面表现不足,急需一种更高效的解决方案。
Result: 在 Qwen2VL-7B 和 GPT4o 基准测试中,框架在视频时间定位和推理任务上取得了显著性能提升。
Insight: 通过模仿人类与视频进度条的交互方式,VTimeCoT 提供了一种可组合和可解释的推理过程。
Abstract: In recent years, video question answering based on multimodal large language models (MLLM) has garnered considerable attention, due to the benefits from the substantial advancements in LLMs. However, these models have a notable deficiency in the domains of video temporal grounding and reasoning, posing challenges to the development of effective real-world video understanding systems. Inspired by how humans use video players to interact with the progress bar for video comprehension, we introduce VTimeCoT, a simple yet effective training-free framework, designed for high-performance video grounding and reasoning. The proposed framework incorporates two novel visual tools of the progress bar: a plug-and-play progress bar integration tool and a high-efficiency highlighting tool. In addition, to address the limitations of conventional text-based chain-of-thought (CoT) approaches, we introduce a visuotemporal CoT process that integrates cross-modality reasoning across both video and text. Our approach demonstrates significant performance improvements on both Qwen2VL-7B and GPT4o baselines in tasks of video temporal grounding and reasoning-based question answering. Finally, we showcase that the proposed framework achieves a compositional and interpretable reasoning process. Project page: https://vtimecot.github.io
[59] Leveraging Learned Image Prior for 3D Gaussian Compression cs.CVPDF
Seungjoo Shin, Jaesik Park, Sunghyun Cho
TL;DR: 这篇论文提出了一种利用学习到的图像先验来提升3D高斯压缩框架性能的方法,通过恢复压缩引起的质量下降,显著提升了率失真性能和渲染质量。
Details
Motivation: 当前的3D高斯喷绘(3DGS)压缩技术在减少存储开销方面取得了进展,但由于缺乏学习到的先验知识,限制了率失真权衡的进一步提升。
Result: 实验表明,该方法在率失真性能和渲染质量上优于现有最先进的3DGS压缩方法,同时存储需求大幅降低。
Insight: 学习到的图像先验可以有效提升3D高斯压缩性能,结合残差信息能进一步提升率失真表现,该方法兼容现有压缩技术。
Abstract: Compression techniques for 3D Gaussian Splatting (3DGS) have recently achieved considerable success in minimizing storage overhead for 3D Gaussians while preserving high rendering quality. Despite the impressive storage reduction, the lack of learned priors restricts further advances in the rate-distortion trade-off for 3DGS compression tasks. To address this, we introduce a novel 3DGS compression framework that leverages the powerful representational capacity of learned image priors to recover compression-induced quality degradation. Built upon initially compressed Gaussians, our restoration network effectively models the compression artifacts in the image space between degraded and original Gaussians. To enhance the rate-distortion performance, we provide coarse rendering residuals into the restoration network as side information. By leveraging the supervision of restored images, the compressed Gaussians are refined, resulting in a highly compact representation with enhanced rendering performance. Our framework is designed to be compatible with existing Gaussian compression methods, making it broadly applicable across different baselines. Extensive experiments validate the effectiveness of our framework, demonstrating superior rate-distortion performance and outperforming the rendering quality of state-of-the-art 3DGS compression methods while requiring substantially less storage.
[60] Where are the Whales: A Human-in-the-loop Detection Method for Identifying Whales in High-resolution Satellite Imagery cs.CV | cs.AIPDF
Caleb Robinson, Kimberly T. Goetz, Christin B. Khan, Meredith Sackett, Kathleen Leonard
TL;DR: 该论文提出了一种人机协作的鲸鱼检测方法,通过统计异常检测技术快速筛查高分辨率卫星图像中的潜在目标,并结合专家标注界面,显著提升了检测效率与可扩展性。
Details
Motivation: 传统的鲸鱼监测方法成本高且难以扩展,亟需一种高效的自动化解决方案。尽管已有研究表明鲸鱼可在高分辨率卫星图像中被识别,但大规模自动化检测仍面临标注数据不足、图像质量与环境多变等挑战。
Result: 在三个已知鲸鱼标注的基准场景中,召回率达到90.3%至96.4%,并将专家检查区域从1000平方公里以上减少到不足2平方公里。
Insight: 结合无监督异常检测与人机协作标注是实现大规模卫星图像分析的有效路径,未来可扩展至其他海洋哺乳动物监测任务。
Abstract: Effective monitoring of whale populations is critical for conservation, but traditional survey methods are expensive and difficult to scale. While prior work has shown that whales can be identified in very high-resolution (VHR) satellite imagery, large-scale automated detection remains challenging due to a lack of annotated imagery, variability in image quality and environmental conditions, and the cost of building robust machine learning pipelines over massive remote sensing archives. We present a semi-automated approach for surfacing possible whale detections in VHR imagery using a statistical anomaly detection method that flags spatial outliers, i.e. “interesting points”. We pair this detector with a web-based labeling interface designed to enable experts to quickly annotate the interesting points. We evaluate our system on three benchmark scenes with known whale annotations and achieve recalls of 90.3% to 96.4%, while reducing the area requiring expert inspection by up to 99.8% – from over 1,000 sq km to less than 2 sq km in some cases. Our method does not rely on labeled training data and offers a scalable first step toward future machine-assisted marine mammal monitoring from space. We have open sourced this pipeline at https://github.com/microsoft/whales.
[61] Camera Movement Classification in Historical Footage: A Comparative Study of Deep Video Models cs.CV | cs.AI | eess.IVPDF
Tingyu Lin, Armin Dadras, Florian Kleber, Robert Sablatnig
TL;DR: 本文对历史影像中的摄像机运动分类进行了系统评估,揭示了现有深度视频模型在该领域的表现和局限性。
Details
Motivation: 摄像机运动传达了视频内容的空间和叙事信息,但现有方法主要针对现代数据集,其在历史影像上的泛化能力尚未被探索。
Result: 最佳模型Video Swin Transformer达到了80.25%的准确率,展示了在有限训练数据下的强泛化能力。
Insight: 研究发现现有模型在低质量视频上的适应潜力,并指出未来工作需要结合多样化的输入模态和时间架构。
Abstract: Camera movement conveys spatial and narrative information essential for understanding video content. While recent camera movement classification (CMC) methods perform well on modern datasets, their generalization to historical footage remains unexplored. This paper presents the first systematic evaluation of deep video CMC models on archival film material. We summarize representative methods and datasets, highlighting differences in model design and label definitions. Five standard video classification models are assessed on the HISTORIAN dataset, which includes expert-annotated World War II footage. The best-performing model, Video Swin Transformer, achieves 80.25% accuracy, showing strong convergence despite limited training data. Our findings highlight the challenges and potential of adapting existing models to low-quality video and motivate future work combining diverse input modalities and temporal architectures.
[62] Free-Grained Hierarchical Recognition cs.CVPDF
Seulki Park, Zilin Wang, Stella X. Yu
TL;DR: 论文提出了一种自由粒度的分层识别方法,通过引入ImageNet-F基准和混合粒度学习方法,解决了现实世界中标注粒度不一致的问题。
Details
Motivation: 现实世界中的图像标注通常具有粒度不一致的特点,而现有方法通常假设标注是完全细粒度的。为了解决这一问题,作者提出了一种更符合实际的标注方法和学习框架。
Result: 实验表明,提出的方法和基准在混合监督条件下显著提升了分层分类的性能。
Insight: 论文的洞察在于揭示了标注粒度不一致的现实问题,并通过结合视觉语言模型和半监督学习提出了一种灵活的解决方案。
Abstract: Hierarchical image classification predicts labels across a semantic taxonomy, but existing methods typically assume complete, fine-grained annotations, an assumption rarely met in practice. Real-world supervision varies in granularity, influenced by image quality, annotator expertise, and task demands; a distant bird may be labeled Bird, while a close-up reveals Bald eagle. We introduce ImageNet-F, a large-scale benchmark curated from ImageNet and structured into cognitively inspired basic, subordinate, and fine-grained levels. Using CLIP as a proxy for semantic ambiguity, we simulate realistic, mixed-granularity labels reflecting human annotation behavior. We propose free-grain learning, with heterogeneous supervision across instances. We develop methods that enhance semantic guidance via pseudo-attributes from vision-language models and visual guidance via semi-supervised learning. These, along with strong baselines, substantially improve performance under mixed supervision. Together, our benchmark and methods advance hierarchical classification under real-world constraints.
[63] DEXTER: Diffusion-Guided EXplanations with TExtual Reasoning for Vision Models cs.CV | cs.AI | I.2.mPDF
Simone Carnemolla, Matteo Pennisi, Sarinda Samarasinghe, Giovanni Bellitto, Simone Palazzo
TL;DR: DEXTER是一个无需数据即可生成视觉分类器全局文本解释的框架,结合扩散模型和大语言模型,通过优化文本提示生成类条件图像,揭示分类器的决策模式和偏见。
Details
Motivation: 构建透明可信的AI系统需要理解和解释机器学习模型的行为,而现有方法通常依赖训练数据或真实标签。DEXTER旨在通过数据无关的方式生成自然语言解释。
Result: 在ImageNet、Waterbirds等数据集上的实验表明,DEXTER在全局模型解释和类级别偏见报告任务上优于现有方法。用户研究证实其输出准确且可解释。
Insight: DEXTER展示了扩散模型和大语言模型结合在模型解释任务中的潜力,为数据无关的解释方法提供了新思路。
Abstract: Understanding and explaining the behavior of machine learning models is essential for building transparent and trustworthy AI systems. We introduce DEXTER, a data-free framework that employs diffusion models and large language models to generate global, textual explanations of visual classifiers. DEXTER operates by optimizing text prompts to synthesize class-conditional images that strongly activate a target classifier. These synthetic samples are then used to elicit detailed natural language reports that describe class-specific decision patterns and biases. Unlike prior work, DEXTER enables natural language explanation about a classifier’s decision process without access to training data or ground-truth labels. We demonstrate DEXTER’s flexibility across three tasks-activation maximization, slice discovery and debiasing, and bias explanation-each illustrating its ability to uncover the internal mechanisms of visual classifiers. Quantitative and qualitative evaluations, including a user study, show that DEXTER produces accurate, interpretable outputs. Experiments on ImageNet, Waterbirds, CelebA, and FairFaces confirm that DEXTER outperforms existing approaches in global model explanation and class-level bias reporting. Code is available at https://github.com/perceivelab/dexter.
[64] LightQANet: Quantized and Adaptive Feature Learning for Low-Light Image Enhancement cs.CVPDF
Xu Wu, Zhihui Lai, Xianxu Hou, Jie Zhou, Ya-nan Zhang
TL;DR: LightQANet提出了一种新颖的低光照图像增强框架,通过量化光照因子和动态适应学习,改善了现有方法在低光条件下提取不可靠特征的问题,实现了更高质量的纹理恢复和颜色一致性。
Details
Motivation: 现有低光照图像增强方法在严重退化的像素信息下难以提取可靠特征,导致纹理和颜色表现不佳。LightQANet旨在解决这一问题。
Result: 在多个数据集上取得SOTA性能,在多样光照条件下表现优越。
Insight: 量化光照因子和动态适应学习的结合能有效提升低光图像的质量和一致性。
Abstract: Low-light image enhancement (LLIE) aims to improve illumination while preserving high-quality color and texture. However, existing methods often fail to extract reliable feature representations due to severely degraded pixel-level information under low-light conditions, resulting in poor texture restoration, color inconsistency, and artifact. To address these challenges, we propose LightQANet, a novel framework that introduces quantized and adaptive feature learning for low-light enhancement, aiming to achieve consistent and robust image quality across diverse lighting conditions. From the static modeling perspective, we design a Light Quantization Module (LQM) to explicitly extract and quantify illumination-related factors from image features. By enforcing structured light factor learning, LQM enhances the extraction of light-invariant representations and mitigates feature inconsistency across varying illumination levels. From the dynamic adaptation perspective, we introduce a Light-Aware Prompt Module (LAPM), which encodes illumination priors into learnable prompts to dynamically guide the feature learning process. LAPM enables the model to flexibly adapt to complex and continuously changing lighting conditions, further improving image enhancement. Extensive experiments on multiple low-light datasets demonstrate that our method achieves state-of-the-art performance, delivering superior qualitative and quantitative results across various challenging lighting scenarios.
[65] MoCom: Motion-based Inter-MAV Visual Communication Using Event Vision and Spiking Neural Networks cs.CVPDF
Zhang Nengbo, Hann Woei Ho, Ye Zhou
TL;DR: 论文提出了一种基于事件的视觉和脉冲神经网络(SNN)的MAV间运动通信框架,通过飞行模式传递信息,使用轻量级SNN解码,验证了其高效性和低功耗。
Details
Motivation: 传统无线通信在MAV群中面临频谱拥堵、干扰和高功耗问题,受蜜蜂摇摆舞启发,研究探索了一种基于视觉的低功耗通信方案。
Result: 实验验证了框架的准确性、低功耗和鲁棒性,展示了其在受限环境中的潜力。
Insight: 运动基元的视觉通信结合事件相机和SNN,为MAV群提供了一种高效、低功耗的替代方案。
Abstract: Reliable communication in Micro Air Vehicle (MAV) swarms is challenging in environments, where conventional radio-based methods suffer from spectrum congestion, jamming, and high power consumption. Inspired by the waggle dance of honeybees, which efficiently communicate the location of food sources without sound or contact, we propose a novel visual communication framework for MAV swarms using motion-based signaling. In this framework, MAVs convey information, such as heading and distance, through deliberate flight patterns, which are passively captured by event cameras and interpreted using a predefined visual codebook of four motion primitives: vertical (up/down), horizontal (left/right), left-to-up-to-right, and left-to-down-to-right, representing control symbols (start'', end’’, 1'', 0’’). To decode these signals, we design an event frame-based segmentation model and a lightweight Spiking Neural Network (SNN) for action recognition. An integrated decoding algorithm then combines segmentation and classification to robustly interpret MAV motion sequences. Experimental results validate the framework’s effectiveness, which demonstrates accurate decoding and low power consumption, and highlights its potential as an energy-efficient alternative for MAV communication in constrained environments.
[66] CoT-PL: Visual Chain-of-Thought Reasoning Meets Pseudo-Labeling for Open-Vocabulary Object Detection cs.CVPDF
Hojun Choi, Youngsun Lim, Jaeyo Shin, Hyunjung Shim
TL;DR: CoT-PL提出了一种结合视觉链式思维推理和伪标签的框架,用于开放词汇目标检测,通过分解目标理解的步骤和对比背景学习,显著提升了在拥挤或遮挡场景中的检测性能。
Details
Motivation: 现有的开放词汇目标检测方法过于依赖直接的图像-文本匹配,忽略了中间推理步骤,导致在面对复杂语义场景时鲁棒性不足。
Result: 1. 在COCO开放词汇检测任务中提升了7.7 AP50;2. 在LVIS任务中提升了2.9 mask AP,尤其在拥挤和遮挡场景中性能显著。
Insight: 结合链式思维推理和伪标签能够有效提升开放词汇目标检测的鲁棒性,尤其是在复杂场景中,背景与目标的分离是关键。
Abstract: Open-vocabulary object detection (OVD) seeks to recognize and localize object categories beyond those seen during training. Recent approaches typically leverage vision-language models (VLMs) to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on direct image-text matching, neglecting the intermediate reasoning steps essential for interpreting semantically complex scenes. This results in limited robustness when confronted with crowded or occluded visual contexts. In this paper, we introduce CoT-PL, a new framework that employs structured visual chain-of-thought (CoT) reasoning into the pseudo-labeling process. CoT-PL decomposes object understanding into three interpretable steps: (1) region perception even for unseen objects, (2) category recognition via zero-shot reasoning, and (3) background grounding to separate semantically complex objects. Crucially, the third step naturally motivates our contrastive background learning (CBL) that uses the pre-computed background cues as negatives to promote feature disentanglement between objects and background. In this way, CoT reasoning and CBL form an integrated pipeline tailored to robust pseudo-labeling in crowded or occluded scenes. Notably, in these two settings, our novel-class pseudo-label quality achieves relative improvements of 103.4% and 168.4% over the best prior, respectively. Our extensive experiments demonstrate that CoT-PL achieves +7.7 AP50 on open-vocabulary COCO and +2.9 mask AP on LVIS for novel classes, setting a new state of the art.
[67] Morphology-Aware Prognostic model for Five-Year Survival Prediction in Colorectal Cancer from H&E Whole Slide Images cs.CV | cs.AIPDF
Usama Sajjad, Abdul Rehman Akbar, Ziyu Su, Deborah Knight, Wendy L. Frankel
TL;DR: 该研究提出了一种名为PRISM的新型AI模型,用于从H&E全切片图像中预测结直肠癌患者的五年生存率,通过整合形态学多样性显著提升了预测性能。
Details
Motivation: 结直肠癌是全球第三大常见恶性肿瘤,现有的任务无关方法可能忽视器官特异性形态模式,而这些模式对肿瘤行为和患者预后至关重要。
Result: PRISM在五年总生存率预测中表现优异(AUC=0.70,准确率68.37%),优于现有方法15%-23%,且在不同亚组中表现稳定。
Insight: 形态学渐变过程的建模更能反映肿瘤进化本质,PRISM的稳健性支持其在临床决策中的潜在应用。
Abstract: Colorectal cancer (CRC) remains the third most prevalent malignancy globally, with approximately 154,000 new cases and 54,000 projected deaths anticipated for 2025. The recent advancement of foundation models in computational pathology has been largely propelled by task agnostic methodologies that can overlook organ-specific crucial morphological patterns that represent distinct biological processes that can fundamentally influence tumor behavior, therapeutic response, and patient outcomes. The aim of this study is to develop a novel, interpretable AI model, PRISM (Prognostic Representation of Integrated Spatial Morphology), that incorporates a continuous variability spectrum within each distinct morphology to characterize phenotypic diversity and reflecting the principle that malignant transformation occurs through incremental evolutionary processes rather than abrupt phenotypic shifts. PRISM is trained on 8.74 million histological images extracted from surgical resection specimens of 424 patients with stage III CRC. PRISM achieved superior prognostic performance for five-year OS (AUC = 0.70 +- 0.04; accuracy = 68.37% +- 4.75%; HR = 3.34, 95% CI = 2.28-4.90; p < 0.0001), outperforming existing CRC-specific methods by 15% and AI foundation models by ~23% accuracy. It showed sex-agnostic robustness (AUC delta = 0.02; accuracy delta = 0.15%) and stable performance across clinicopathological subgroups, with minimal accuracy fluctuation (delta = 1.44%) between 5FU/LV and CPT-11/5FU/LV regimens, replicating the Alliance cohort finding of no survival difference between treatments.
[68] Scaling Artificial Intelligence for Multi-Tumor Early Detection with More Reports, Fewer Masks cs.CV | cs.AIPDF
Pedro R. A. S. Bassi, Xinze Zhou, Wenxuan Li, Szymon Płotka, Jieneng Chen
TL;DR: 论文提出了R-Super方法,利用丰富的医学报告替代传统的人工标注肿瘤掩膜,用于训练AI进行多肿瘤早期检测,显著降低了成本,同时保持了高性能。
Details
Motivation: 早期的肿瘤检测可以拯救生命,但当前AI模型的训练依赖于成本高昂的人工标注肿瘤掩膜。尽管医学报告中已包含丰富的肿瘤信息,但这些资源未被充分利用。
Result: 实验表明,使用101,654份报告训练的模型与使用723个掩膜训练的模型性能相当。结合报告和掩膜,敏感性提升13%,特异性提升8%,在某些肿瘤类型上超越放射科医生。
Insight: 研究表明,无需依赖大规模人工标注掩膜,利用现有医学报告即可高效训练高精度肿瘤检测AI,为多样化的肿瘤类型提供了可扩展的检测方案。
Abstract: Early tumor detection save lives. Each year, more than 300 million computed tomography (CT) scans are performed worldwide, offering a vast opportunity for effective cancer screening. However, detecting small or early-stage tumors on these CT scans remains challenging, even for experts. Artificial intelligence (AI) models can assist by highlighting suspicious regions, but training such models typically requires extensive tumor masks–detailed, voxel-wise outlines of tumors manually drawn by radiologists. Drawing these masks is costly, requiring years of effort and millions of dollars. In contrast, nearly every CT scan in clinical practice is already accompanied by medical reports describing the tumor’s size, number, appearance, and sometimes, pathology results–information that is rich, abundant, and often underutilized for AI training. We introduce R-Super, which trains AI to segment tumors that match their descriptions in medical reports. This approach scales AI training with large collections of readily available medical reports, substantially reducing the need for manually drawn tumor masks. When trained on 101,654 reports, AI models achieved performance comparable to those trained on 723 masks. Combining reports and masks further improved sensitivity by +13% and specificity by +8%, surpassing radiologists in detecting five of the seven tumor types. Notably, R-Super enabled segmentation of tumors in the spleen, gallbladder, prostate, bladder, uterus, and esophagus, for which no public masks or AI models previously existed. This study challenges the long-held belief that large-scale, labor-intensive tumor mask creation is indispensable, establishing a scalable and accessible path toward early detection across diverse tumor types. We plan to release our trained models, code, and dataset at https://github.com/MrGiovanni/R-Super
[69] Unifying Environment Perception and Route Choice Modeling for Trajectory Representation Learning cs.CV | cs.LGPDF
Ji Cao, Yu Wang, Tongya Zheng, Zujie Ren, Canghong Jin
TL;DR: PRTraj是一个新颖的轨迹表示学习框架,通过统一环境感知和路径选择建模,解决了现有方法忽视外部环境和内部路径选择行为的问题。
Details
Motivation: 现有轨迹表示学习方法将轨迹视为孤立的时空序列,忽略了外部环境和路径选择行为对其形成的影响。
Result: 在3个真实数据集和5个下游任务上,PRTraj表现出高效性和泛化能力,且在少样本场景中仍保持稳健性能。
Insight: 环境感知和路径选择行为对轨迹表示至关重要,统一建模可提升下游任务的性能。
Abstract: Trajectory Representation Learning (TRL) aims to encode raw trajectories into low-dimensional vectors, which can then be leveraged in various downstream tasks, including travel time estimation, location prediction, and trajectory similarity analysis. However, existing TRL methods suffer from a key oversight: treating trajectories as isolated spatio-temporal sequences, without considering the external environment and internal route choice behavior that govern their formation. To bridge this gap, we propose a novel framework that unifies comprehensive environment \textbf{P}erception and explicit \textbf{R}oute choice modeling for effective \textbf{Traj}ectory representation learning, dubbed \textbf{PRTraj}. Specifically, PRTraj first introduces an Environment Perception Module to enhance the road network by capturing multi-granularity environmental semantics from surrounding POI distributions. Building on this environment-aware backbone, a Route Choice Encoder then captures the route choice behavior inherent in each trajectory by modeling its constituent road segment transitions as a sequence of decisions. These route-choice-aware representations are finally aggregated to form the global trajectory embedding. Extensive experiments on 3 real-world datasets across 5 downstream tasks validate the effectiveness and generalizability of PRTraj. Moreover, PRTraj demonstrates strong data efficiency, maintaining robust performance under few-shot scenarios. Our code is available at: https://anonymous.4open.science/r/PRTraj.
[70] FraQAT: Quantization Aware Training with Fractional bits cs.CVPDF
Luca Morreale, Alberto Gil C. P. Ramos, Malcolm Chadwick, Mehid Noroozi, Ruchika Chavhan
TL;DR: 论文提出了一种名为FraQAT的新方法,通过渐进式量化(从32位到4位)和利用分数位优化,在保持高质量生成的同时实现模型的高效计算。
Details
Motivation: 由于大模型在内存和计算资源受限的设备(如智能手机)上难以部署,传统的量化方法虽能提高效率,但往往会牺牲生成质量。
Result: 在多种扩散模型(如SD3.5-Medium、Sana等)上验证了方法的有效性,FiD比传统QAT降低了4-7%,并成功在智能手机上部署。
Insight: 渐进式量化和分数位优化是提升量化模型质量的有效手段,同时适用于资源受限的设备。
Abstract: State-of-the-art (SOTA) generative models have demonstrated impressive capabilities in image synthesis or text generation, often with a large capacity model. However, these large models cannot be deployed on smartphones due to the limited availability of on-board memory and computations. Quantization methods lower the precision of the model parameters, allowing for efficient computations, \eg, in \INT{8}. Although aggressive quantization addresses efficiency and memory constraints, preserving the quality of the model remains a challenge. To retain quality in previous aggressive quantization, we propose a new fractional bits quantization (\short) approach. The novelty is a simple yet effective idea: we progressively reduce the model’s precision from 32 to 4 bits per parameter, and exploit the fractional bits during optimization to maintain high generation quality. We show that the \short{} yields improved quality on a variety of diffusion models, including SD3.5-Medium, Sana, \pixart, and FLUX.1-schnell, while achieving $4-7%$ lower FiD than standard QAT. Finally, we deploy and run Sana on a Samsung S25U, which runs on the Qualcomm SM8750-AB Snapdragon 8 Elite Hexagon Tensor Processor (HTP).
[71] Scaling Tumor Segmentation: Best Lessons from Real and Synthetic Data cs.CVPDF
Qi Chen, Xinze Zhou, Chen Liu, Hao Chen, Wenxuan Li
TL;DR: 论文探讨了合成数据在肿瘤分割任务中的潜力,并提出AbdomenAtlas 2.0数据集,显著提升了模型性能。
Details
Motivation: 由于缺乏大规模、精细标注的肿瘤分割数据集(尤其是医学专家标注成本高),作者研究了合成数据对模型训练效率的提升作用。
Result: AbdomenAtlas 2.0在多个器官的肿瘤分割任务中表现出色,比公开数据集性能提升显著(+7% DSC内部分布测试,+16%外部分布测试)。
Insight: 合成数据能有效缓解医学标注数据的稀缺性,显著提升模型训练效率,为肿瘤分割任务提供了新的解决方案。
Abstract: AI for tumor segmentation is limited by the lack of large, voxel-wise annotated datasets, which are hard to create and require medical experts. In our proprietary JHH dataset of 3,000 annotated pancreatic tumor scans, we found that AI performance stopped improving after 1,500 scans. With synthetic data, we reached the same performance using only 500 real scans. This finding suggests that synthetic data can steepen data scaling laws, enabling more efficient model training than real data alone. Motivated by these lessons, we created AbdomenAtlas 2.0–a dataset of 10,135 CT scans with a total of 15,130 tumor instances per-voxel manually annotated in six organs (pancreas, liver, kidney, colon, esophagus, and uterus) and 5,893 control scans. Annotated by 23 expert radiologists, it is several orders of magnitude larger than existing public tumor datasets. While we continue expanding the dataset, the current version of AbdomenAtlas 2.0 already provides a strong foundation–based on lessons from the JHH dataset–for training AI to segment tumors in six organs. It achieves notable improvements over public datasets, with a +7% DSC gain on in-distribution tests and +16% on out-of-distribution tests.
[72] QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models cs.CV | cs.ROPDF
Yixuan Li, Yuhui Chen, Mingcai Zhou, Haoran Li
TL;DR: QDepth-VLA提出了一种通过深度预测任务增强VLA模型的框架,利用量化深度图学习几何感知表示,提升了空间推理能力,在仿真和实际任务中表现优异。
Details
Motivation: 现有VLA模型缺乏对3D结构的理解和推理能力,限制了精确控制任务的性能。
Result: 在仿真和实际任务中展示了优异的性能,增强了空间推理能力。
Insight: 深度预测任务可以作为有效的辅助监督信号,帮助模型学习几何信息,提升VLA任务的性能。
Abstract: Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.
[73] ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints cs.CVPDF
Meiqi Wu, Jiashu Zhu, Xiaokun Feng, Chubin Chen, Chen Zhu
TL;DR: ImagerySearch是一种自适应的测试时搜索策略,通过动态调整推理搜索空间和奖励函数,提升生成视频在想象力丰富场景中的连贯性和视觉合理性。作者还提出了LDT-Bench基准,验证了方法的有效性。
Details
Motivation: 当前视频生成模型在现实场景表现良好,但在想象力丰富场景中性能下降,尤其是涉及罕见概念和长距离语义关系的提示。现有方法的固定搜索空间和静态奖励设计限制了适应性。
Result: ImagerySearch在LDT-Bench和VBench上均优于基线方法,展示了其在多样化提示类型中的有效性。
Insight: 动态调整搜索空间和奖励设计是提升视频生成模型在想象力场景性能的关键。
Abstract: Video generation models have achieved remarkable progress, particularly excelling in realistic scenarios; however, their performance degrades notably in imaginative scenarios. These prompts often involve rarely co-occurring concepts with long-distance semantic relationships, falling outside training distributions. Existing methods typically apply test-time scaling for improving video quality, but their fixed search spaces and static reward designs limit adaptability to imaginative scenarios. To fill this gap, we propose ImagerySearch, a prompt-guided adaptive test-time search strategy that dynamically adjusts both the inference search space and reward function according to semantic relationships in the prompt. This enables more coherent and visually plausible videos in challenging imaginative settings. To evaluate progress in this direction, we introduce LDT-Bench, the first dedicated benchmark for long-distance semantic prompts, consisting of 2,839 diverse concept pairs and an automated protocol for assessing creative generation capabilities. Extensive experiments show that ImagerySearch consistently outperforms strong video generation baselines and existing test-time scaling approaches on LDT-Bench, and achieves competitive improvements on VBench, demonstrating its effectiveness across diverse prompt types. We will release LDT-Bench and code to facilitate future research on imaginative video generation.
[74] Multi-modal video data-pipelines for machine learning with minimal human supervision cs.CV | cs.DCPDF
Mihai-Cristian Pîrvu, Marius Leordeanu
TL;DR: 这篇论文提出了一种多模态视频数据管道,用于在无需人工监督的情况下进行机器学习,利用了预训练专家模型和PHG-MAE模型,展示了在小参数量下也能达到与大模型竞争的性能。
Details
Motivation: 现实世界本质上是多模态的,但传统的机器学习模型通常是单模态的。为了真正理解世界,需要整合多个独立模态,同时减少人工监督的需求。
Result: PHG-MAE模型在小参数量(<1M)下表现优异,接近300M参数模型的性能,并实现了实时语义分割和深度估计。
Insight: 多模态整合和预训练模型的蒸馏可以在减少人工监督的同时提升性能,小模型也可以在大规模任务中发挥作用。
Abstract: The real-world is inherently multi-modal at its core. Our tools observe and take snapshots of it, in digital form, such as videos or sounds, however much of it is lost. Similarly for actions and information passing between humans, languages are used as a written form of communication. Traditionally, Machine Learning models have been unimodal (i.e. rgb -> semantic or text -> sentiment_class). Recent trends go towards bi-modality, where images and text are learned together, however, in order to truly understand the world, we need to integrate all these independent modalities. In this work we try to combine as many visual modalities as we can using little to no human supervision. In order to do this, we use pre-trained experts and procedural combinations between them on top of raw videos using a fully autonomous data-pipeline, which we also open-source. We then make use of PHG-MAE, a model specifically designed to leverage multi-modal data. We show that this model which was efficiently distilled into a low-parameter (<1M) can have competitive results compared to models of ~300M parameters. We deploy this model and analyze the use-case of real-time semantic segmentation from handheld devices or webcams on commodity hardware. Finally, we deploy other off-the-shelf models using the same framework, such as DPT for near real-time depth estimation.
[75] Benchmarking Multimodal Large Language Models for Face Recognition cs.CV | cs.AI | cs.CLPDF
Hatef Otroshi Shahreza, Sébastien Marcel
TL;DR: 这篇论文系统地评估了多模态大语言模型(MLLMs)在人脸识别任务中的表现,并与专用人脸识别模型进行了对比。
Details
Motivation: 尽管MLLMs在多种视觉与语言任务中表现优异,但其在人脸识别领域的潜力尚未被充分探索,特别是在高精度场景中的表现。
Result: 实验结果表明,MLLMs虽能捕捉丰富的语义信息,但在高精度人脸识别场景中仍不及专用模型。
Insight: MLLMs在人脸识别中的应用潜力需要进一步优化,尤其是在精度和泛化能力方面。
Abstract: Multimodal large language models (MLLMs) have achieved remarkable performance across diverse vision-and-language tasks. However, their potential in face recognition remains underexplored. In particular, the performance of open-source MLLMs needs to be evaluated and compared with existing face recognition models on standard benchmarks with similar protocol. In this work, we present a systematic benchmark of state-of-the-art MLLMs for face recognition on several face recognition datasets, including LFW, CALFW, CPLFW, CFP, AgeDB and RFW. Experimental results reveal that while MLLMs capture rich semantic cues useful for face-related tasks, they lag behind specialized models in high-precision recognition scenarios in zero-shot applications. This benchmark provides a foundation for advancing MLLM-based face recognition, offering insights for the design of next-generation models with higher accuracy and generalization. The source code of our benchmark is publicly available in the project page.
[76] TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions cs.CVPDF
Guangyi Han, Wei Zhai, Yuhang Yang, Yang Cao, Zheng-Jun Zha
TL;DR: 论文提出了TOUCH框架,用于生成多样化且物理合理的手-物体交互(HOI),通过细粒度语义控制扩展了从抓握到自由形式交互(如推、戳、旋转)的能力。
Details
Motivation: 现有HOI生成研究多局限于固定抓握模式,缺乏多样性和细粒度控制。作者希望通过文本引导生成更贴近日常生活的自由形式交互行为。
Result: 实验表明,TOUCH能够生成可控、多样且物理合理的手部交互行为。
Insight: 细粒度语义控制和物理约束的结合是生成逼真HOI的关键。
Abstract: Hand-object interaction (HOI) is fundamental for humans to express intent. Existing HOI generation research is predominantly confined to fixed grasping patterns, where control is tied to physical priors such as force closure or generic intent instructions, even when expressed through elaborate language. Such an overly general conditioning imposes a strong inductive bias for stable grasps, thus failing to capture the diversity of daily HOI. To address these limitations, we introduce Free-Form HOI Generation, which aims to generate controllable, diverse, and physically plausible HOI conditioned on fine-grained intent, extending HOI from grasping to free-form interactions, like pushing, poking, and rotating. To support this task, we construct WildO2, an in-the-wild diverse 3D HOI dataset, which includes diverse HOI derived from internet videos. Specifically, it contains 4.4k unique interactions across 92 intents and 610 object categories, each with detailed semantic annotations. Building on this dataset, we propose TOUCH, a three-stage framework centered on a multi-level diffusion model that facilitates fine-grained semantic control to generate versatile hand poses beyond grasping priors. This process leverages explicit contact modeling for conditioning and is subsequently refined with contact consistency and physical constraints to ensure realism. Comprehensive experiments demonstrate our method’s ability to generate controllable, diverse, and physically plausible hand interactions representative of daily activities. The project page is $\href{https://guangyid.github.io/hoi123touch}{here}$.
[77] BADAS: Context Aware Collision Prediction Using Real-World Dashcam Data cs.CVPDF
Roni Goldshmidt, Hamish Scott, Lorenzo Niccolini, Shizhan Zhu, Daniel Moura
TL;DR: BADAS提出了一个基于上下文感知的碰撞预测模型,专注于区分涉及自车的威胁与其他随机事故,减少了误报率。它通过大规模真实行车记录仪数据和端到端训练,在多个基准测试中实现了最佳性能。
Details
Motivation: 现有碰撞预测方法无法区分自车威胁与非自车事故,导致实际部署中误报率高。BADAS旨在解决这一问题,专注于自车中心的评估。
Result: BADAS在DAD、DADA-2000等数据集上实现了最优的AP/AUC性能,超越了前向碰撞ADAS基线,并提供了更准确的事故时间预测。
Insight: 区分自车与非自车事故对减少误报至关重要,BADAS通过大规模数据和上下文感知设计,推动了自车中心碰撞预测的研究。
Abstract: Existing collision prediction methods often fail to distinguish between ego-vehicle threats and random accidents not involving the ego vehicle, leading to excessive false alerts in real-world deployment. We present BADAS, a family of collision prediction models trained on Nexar’s real-world dashcam collision dataset – the first benchmark designed explicitly for ego-centric evaluation. We re-annotate major benchmarks to identify ego involvement, add consensus alert-time labels, and synthesize negatives where needed, enabling fair AP/AUC and temporal evaluation. BADAS uses a V-JEPA2 backbone trained end-to-end and comes in two variants: BADAS-Open (trained on our 1.5k public videos) and BADAS1.0 (trained on 40k proprietary videos). Across DAD, DADA-2000, DoTA, and Nexar, BADAS achieves state-of-the-art AP/AUC and outperforms a forward-collision ADAS baseline while producing more realistic time-to-accident estimates. We release our BADAS-Open model weights and code, along with re-annotations of all evaluation datasets to promote ego-centric collision prediction research.
[78] You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction cs.CV | cs.CLPDF
Logan Lawrence, Oindrila Saha, Megan Wei, Chen Sun, Subhransu Maji
TL;DR: 该论文提出了一种名为nlg2choice的两阶段方法,通过开放性问题提取答案并结合文本约束解码,显著提升了多模态大语言模型(MLLMs)在细粒度视觉分类(FGVC)中的零样本性能,解决了高选择数和相关选择情境下的评估和计算效率问题。
Details
Motivation: 现有方法在多模态大语言模型的零样本视觉分类中,难以评估自由形式响应或处理高选择数的任务,特别是在细粒度视觉分类(FGVC)中,选择范围可达数百至数千且高度相关。
Result: 在七个细粒度视觉数据集上的实验表明,该方法在分类和检索任务中均优于基线方法,并且在不同自然语言任务实现方式中表现一致。
Insight: 开放性问题结合约束解码不仅能提升零样本性能,还能显著降低高选择数任务的计算成本,为细粒度视觉识别提供了一种高效解决方案。
Abstract: Despite the renewed interest in zero-shot visual classification due to the rise of Multimodal Large Language Models (MLLMs), the problem of evaluating free-form responses of auto-regressive models remains a persistent challenge. Most existing works focus on language-only tasks or don’t consider Multiple Choice Questions (MCQs) beyond 5-way options, both of which are critical capabilities to solve tasks in Fine-Grained Visual Classification (FGVC) where choice counts are in the hundreds to thousands and the choices are highly related. Furthermore, in this highly multi-way MCQ setting it is not clear how to extend LLM choice extraction to retrieval-based problems, where computing probabilities over the choice set is computationally costly. In this work we investigate nlg2choice, a simple two-stage method which first asks the MLLM an open-ended question for the task with minimal constraints, then uses text-only constrained decoding to predict the most likely choice. In retrieval settings, we compute the probability of the constrained response taking that choice with an early stopping method to significantly improve throughput. Our results show improvement over a suite of seven fine-grained visual datasets when evaluating in terms of classification and retrieval, and show that this performance holds over the various ways that users of LLMs can implement tasks in natural language.
[79] Leveraging Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection cs.CVPDF
Furkan Mumcu, Michael J. Jones, Anoop Cherian, Yasin Yilmaz
TL;DR: 论文提出了一种新颖的半监督视频异常检测框架,通过利用多模态大型语言模型(MLLMs)生成对象活动的文本描述,从而提高复杂异常检测的解释性。
Details
Motivation: 现有半监督视频异常检测方法难以处理涉及对象交互的复杂异常,且缺乏解释性。作者希望通过MLLMs提取和分析对象活动,以解决这些问题。
Result: 在基准数据集上的实验表明,该方法不仅能有效检测基于交互的复杂异常,还在无交互异常的数据集上达到了最新技术水平。
Insight: MLLMs的文本描述能力为视频异常检测提供了高层语义表示,增强了方法的解释性和泛化能力。
Abstract: Existing semi-supervised video anomaly detection (VAD) methods often struggle with detecting complex anomalies involving object interactions and generally lack explainability. To overcome these limitations, we propose a novel VAD framework leveraging Multimodal Large Language Models (MLLMs). Unlike previous MLLM-based approaches that make direct anomaly judgments at the frame level, our method focuses on extracting and interpreting object activity and interactions over time. By querying an MLLM with visual inputs of object pairs at different moments, we generate textual descriptions of the activity and interactions from nominal videos. These textual descriptions serve as a high-level representation of the activity and interactions of objects in a video. They are used to detect anomalies during test time by comparing them to textual descriptions found in nominal training videos. Our approach inherently provides explainability and can be combined with many traditional VAD methods to further enhance their interpretability. Extensive experiments on benchmark datasets demonstrate that our method not only detects complex interaction-based anomalies effectively but also achieves state-of-the-art performance on datasets without interaction anomalies.
[80] MaskCaptioner : Learning to Jointly Segment and Caption Object Trajectories in Videos cs.CV | cs.AI | cs.LGPDF
Gabriel Fiastre, Antoine Yang, Cordelia Schmid
TL;DR: MaskCaptioner是一个端到端模型,能够联合检测、分割、跟踪并描述视频中的目标轨迹,通过合成数据集LVISCap和LV-VISCap进行预训练,并在多个基准测试中取得最优性能。
Details
Motivation: 密集视频目标描述(DVOC)任务复杂且标注成本高,现有方法采用分离训练策略可能导致性能不佳,因此提出了端到端的联合训练方法。
Result: MaskCaptioner在VidSTG、VLN和BenSMOT三个基准测试中表现最优。
Insight: 通过合成数据和端到端训练,可以有效解决DVOC任务的复杂性和标注成本问题。
Abstract: Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to disjoint training strategies, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM. By extending the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap), we train MaskCaptioner, an end-to-end model capable of jointly detecting, segmenting, tracking and captioning object trajectories. Moreover, with pretraining on LVISCap and LV-VISCap, MaskCaptioner achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at https://www.gabriel.fiastre.fr/maskcaptioner/.
[81] 3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation cs.CVPDF
JoungBin Lee, Jaewoo Jung, Jisang Han, Takuya Narihira, Kazumi Fukuda
TL;DR: 3DScenePrompt提出了一种视频生成框架,利用双时空条件化和3D场景记忆,生成具有场景一致性和精确相机控制的视频。
Details
Motivation: 现有方法通常基于单帧或短片段生成视频,难以保持长序列的空间一致性和相机控制。本文旨在解决这一问题,实现长序列视频生成中的场景一致性和动态控制的平衡。
Result: 实验表明,该方法在场景一致性、相机控制和生成质量上显著优于现有方法。
Insight: 分离静态和动态内容对于长序列视频生成至关重要,3D场景记忆可以有效支持空间一致性投影和相机控制。
Abstract: We present 3DScenePrompt, a framework that generates the next video chunk from arbitrary-length input while enabling precise camera control and preserving scene consistency. Unlike methods conditioned on a single image or a short clip, we employ dual spatio-temporal conditioning that reformulates context-view referencing across the input video. Our approach conditions on both temporally adjacent frames for motion continuity and spatially adjacent content for scene consistency. However, when generating beyond temporal boundaries, directly using spatially adjacent frames would incorrectly preserve dynamic elements from the past. We address this by introducing a 3D scene memory that represents exclusively the static geometry extracted from the entire input video. To construct this memory, we leverage dynamic SLAM with our newly introduced dynamic masking strategy that explicitly separates static scene geometry from moving elements. The static scene representation can then be projected to any target viewpoint, providing geometrically consistent warped views that serve as strong 3D spatial prompts while allowing dynamic regions to evolve naturally from temporal context. This enables our model to maintain long-range spatial coherence and precise camera control without sacrificing computational efficiency or motion realism. Extensive experiments demonstrate that our framework significantly outperforms existing methods in scene consistency, camera controllability, and generation quality. Project page : https://cvlab-kaist.github.io/3DScenePrompt/
[82] OmniMotion: Multimodal Motion Generation with Continuous Masked Autoregression cs.CVPDF
Zhe Li, Weihao Yuan, Weichao Shen, Siyu Zhu, Zilong Dong
TL;DR: 该论文提出了一种基于连续掩码自回归的运动生成框架OmniMotion,解决了多模态(如文本、语音、音乐)与人体运动生成的融合问题,通过改进注意力机制和扩散模型,实现了优于现有方法的表现。
Details
Motivation: 多模态人体运动生成的挑战在于如何设计有效的生成机制并融合多模态输入(如文本、语音、音乐)。传统方法通常采用离散掩码建模或自回归建模,缺乏对连续性和多模态异质分布的鲁棒性。
Result: 实验表明,OmniMotion在文本到运动、语音到手势、音乐到舞蹈等任务中均优于现有方法。
Insight: 连续掩码自回归建模更适合处理运动序列的时序性;gated linear attention和RMSNorm有助于稳定多模态融合;扩散模型能有效增强条件信号的传递。
Abstract: Whole-body multi-modal human motion generation poses two primary challenges: creating an effective motion generation mechanism and integrating various modalities, such as text, speech, and music, into a cohesive framework. Unlike previous methods that usually employ discrete masked modeling or autoregressive modeling, we develop a continuous masked autoregressive motion transformer, where a causal attention is performed considering the sequential nature within the human motion. Within this transformer, we introduce a gated linear attention and an RMSNorm module, which drive the transformer to pay attention to the key actions and suppress the instability caused by either the abnormal movements or the heterogeneous distributions within multi-modalities. To further enhance both the motion generation and the multimodal generalization, we employ the DiT structure to diffuse the conditions from the transformer towards the targets. To fuse different modalities, AdaLN and cross-attention are leveraged to inject the text, speech, and music signals. Experimental results demonstrate that our framework outperforms previous methods across all modalities, including text-to-motion, speech-to-gesture, and music-to-dance. The code of our method will be made public.
[83] RealDPO: Real or Not Real, that is the Preference cs.CV | cs.AIPDF
Guo Cheng, Danni Yang, Ziqi Huang, Jianlou Si, Chenyang Si
TL;DR: RealDPO提出了一种新的偏好学习对齐范式,利用真实世界数据作为正样本,显著提升了视频生成模型的运动质量。
Details
Motivation: 视频生成模型在复杂运动生成上存在局限性,导致运动不够自然和连贯。
Result: 实验表明RealDPO在视频质量、文本对齐和运动真实性上优于现有方法。
Insight: 真实数据驱动的偏好学习可有效提升生成模型的运动质量。
Abstract: Video generative models have recently achieved notable advancements in synthesis quality. However, generating complex motions remains a critical challenge, as existing models often struggle to produce natural, smooth, and contextually consistent movements. This gap between generated and real-world motions limits their practical applicability. To address this issue, we introduce RealDPO, a novel alignment paradigm that leverages real-world data as positive samples for preference learning, enabling more accurate motion synthesis. Unlike traditional supervised fine-tuning (SFT), which offers limited corrective feedback, RealDPO employs Direct Preference Optimization (DPO) with a tailored loss function to enhance motion realism. By contrasting real-world videos with erroneous model outputs, RealDPO enables iterative self-correction, progressively refining motion quality. To support post-training in complex motion synthesis, we propose RealAction-5K, a curated dataset of high-quality videos capturing human daily activities with rich and precise motion details. Extensive experiments demonstrate that RealDPO significantly improves video quality, text alignment, and motion realism compared to state-of-the-art models and existing preference optimization techniques.
[84] MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning cs.CV | cs.CLPDF
Weikang Shi, Aldrich Yu, Rongyao Fang, Houxing Ren, Ke Wang
TL;DR: MathCanvas通过视觉链式思考(VCoT)为多模态数学推理提供了一个综合性框架,显著提升了大型多模态模型(LMMs)在数学问题中的表现。
Details
Motivation: 现有的视觉链式思考方法难以生成需要精确时间和高质量图形的复杂数学问题解决方案,尤其是在几何等领域。
Result: 训练的BAGEL-Canvas模型在MathCanvas-Bench上相比基线提升了86%,并在其他公共数学基准上表现出色。
Insight: MathCanvas展示了视觉链式思考在多模态数学推理中的潜力,为未来LMMs在复杂问题中的应用提供了新方向。
Abstract: While Large Language Models (LLMs) have excelled in textual reasoning, they struggle with mathematical domains like geometry that intrinsically rely on visual aids. Existing approaches to Visual Chain-of-Thought (VCoT) are often limited by rigid external tools or fail to generate the high-fidelity, strategically-timed diagrams necessary for complex problem-solving. To bridge this gap, we introduce MathCanvas, a comprehensive framework designed to endow unified Large Multimodal Models (LMMs) with intrinsic VCoT capabilities for mathematics. Our approach consists of two phases. First, a Visual Manipulation stage pre-trains the model on a novel 15.2M-pair corpus, comprising 10M caption-to-diagram pairs (MathCanvas-Imagen) and 5.2M step-by-step editing trajectories (MathCanvas-Edit), to master diagram generation and editing. Second, a Strategic Visual-Aided Reasoning stage fine-tunes the model on MathCanvas-Instruct, a new 219K-example dataset of interleaved visual-textual reasoning paths, teaching it when and how to leverage visual aids. To facilitate rigorous evaluation, we introduce MathCanvas-Bench, a challenging benchmark with 3K problems that require models to produce interleaved visual-textual solutions. Our model, BAGEL-Canvas, trained under this framework, achieves an 86% relative improvement over strong LMM baselines on MathCanvas-Bench, demonstrating excellent generalization to other public math benchmarks. Our work provides a complete toolkit-framework, datasets, and benchmark-to unlock complex, human-like visual-aided reasoning in LMMs. Project Page: https://mathcanvas.github.io/
[85] C4D: 4D Made from 3D through Dual Correspondences cs.CV | cs.AIPDF
Shizun Wang, Zhenxiang Jiang, Xingyi Yang, Xinchao Wang
TL;DR: C4D通过引入短期光流和长期点跟踪对应关系,将现有3D重建方法扩展到4D,解决了动态场景重建中多视角几何约束失效的问题。
Details
Motivation: 动态场景中的移动物体会破坏多视角几何约束,导致直接应用静态3D重建方法效果不佳。需要一种新方法来同时恢复动态几何和相机位姿。
Result: 实验表明C4D能实现完整的4D重建,并在深度估计、相机位姿估计和点跟踪等任务中表现优异。
Insight: 通过显式建模动态物体运动信息,可以有效提升动态场景的重建质量,为后续4D任务提供可靠基础。
Abstract: Recovering 4D from monocular video, which jointly estimates dynamic geometry and camera poses, is an inevitably challenging problem. While recent pointmap-based 3D reconstruction methods (e.g., DUSt3R) have made great progress in reconstructing static scenes, directly applying them to dynamic scenes leads to inaccurate results. This discrepancy arises because moving objects violate multi-view geometric constraints, disrupting the reconstruction. To address this, we introduce C4D, a framework that leverages temporal Correspondences to extend existing 3D reconstruction formulation to 4D. Specifically, apart from predicting pointmaps, C4D captures two types of correspondences: short-term optical flow and long-term point tracking. We train a dynamic-aware point tracker that provides additional mobility information, facilitating the estimation of motion masks to separate moving elements from the static background, thus offering more reliable guidance for dynamic scenes. Furthermore, we introduce a set of dynamic scene optimization objectives to recover per-frame 3D geometry and camera parameters. Simultaneously, the correspondences lift 2D trajectories into smooth 3D trajectories, enabling fully integrated 4D reconstruction. Experiments show that our framework achieves complete 4D recovery and demonstrates strong performance across multiple downstream tasks, including depth estimation, camera pose estimation, and point tracking. Project Page: https://littlepure2333.github.io/C4D
[86] Learning an Image Editing Model without Image Editing Pairs cs.CV | cs.LGPDF
Nupur Kumari, Sheng-Yu Wang, Nanxuan Zhao, Yotam Nitzan, Yuheng Li
TL;DR: 该论文提出了一种无需成对图像编辑数据的训练范式,通过扩散模型和视觉语言模型(VLM)的反馈直接优化编辑过程,避免了传统方法对大规模监督数据的依赖。
Details
Motivation: 传统图像编辑模型依赖大规模输入-目标对数据,但此类数据难以大规模获取。现有方法使用合成数据可能传播预训练模型的缺陷,因此需要一种无需成对数据的训练方法。
Result: 在标准基准测试中表现与基于监督数据的扩散模型相当,且在使用相同VLM奖励模型时优于基于强化学习的方法(如Flow-GRPO)。
Insight: 1. VLM可作为强大的监督信号替代成对数据。2. DMD能有效提升生成图像的视觉质量,避免脱离真实分布。
Abstract: Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting. Given the same VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.
[87] From Pixels to Words – Towards Native Vision-Language Primitives at Scale cs.CV | cs.AIPDF
Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang
TL;DR: 这篇论文提出了原生视觉-语言模型(VLMs)的设计原则,并介绍了NEO,一种基于这些原则构建的高效模型,能够在较少数据下实现与模块化VLMs竞争的性能。
Details
Motivation: 研究原生VLMs的两个主要挑战:明确其与模块化VLMs的根本区别,并推动该领域的普及化研究。
Result: NEO在仅使用3.9亿图像-文本对的情况下,成功实现了与顶级模块化VLMs相当的视觉感知能力。
Insight: 原生VLMs可以通过共享语义空间和统一架构实现高效的多模态学习,同时降低研究门槛,推动领域发展。
Abstract: The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.
[88] Coupled Diffusion Sampling for Training-Free Multi-View Image Editing cs.CV | cs.AIPDF
Hadi Alzayer, Yunzhi Zhang, Chen Geng, Jia-Bin Huang, Jiajun Wu
TL;DR: 该论文提出了一种名为Coupled Diffusion Sampling的方法,用于在无需训练的情况下实现多视角图像的一致性编辑,解决了现有方法因依赖显式3D表示而导致的优化时间长和不稳定的问题。
Details
Motivation: 现有的2D图像编辑模型在多视角图像编辑中难以保持一致性,而基于显式3D表示的方法又存在优化时间长和稀疏视角下不稳定的问题。
Result: 在三个多视角图像编辑任务上验证了该方法的有效性和通用性,适用于多种模型架构。
Insight: 该方法提供了一种无需显式3D表示的轻量级解决方案,突出了在多视角一致性编辑任务中的潜力。
Abstract: We present an inference-time diffusion sampling method to perform multi-view consistent image editing using pre-trained 2D image editing models. These models can independently produce high-quality edits for each image in a set of multi-view images of a 3D scene or object, but they do not maintain consistency across views. Existing approaches typically address this by optimizing over explicit 3D representations, but they suffer from a lengthy optimization process and instability under sparse view settings. We propose an implicit 3D regularization approach by constraining the generated 2D image sequences to adhere to a pre-trained multi-view image distribution. This is achieved through coupled diffusion sampling, a simple diffusion sampling technique that concurrently samples two trajectories from both a multi-view image distribution and a 2D edited image distribution, using a coupling term to enforce the multi-view consistency among the generated images. We validate the effectiveness and generality of this framework on three distinct multi-view image editing tasks, demonstrating its applicability across various model architectures and highlighting its potential as a general solution for multi-view consistent editing.
cs.CL [Back]
[89] Bridging the Semantic Gap: Contrastive Rewards for Multilingual Text-to-SQL cs.CL | cs.AIPDF
Ashish Kattamuri, Ishita Prasad, Meetu Malhotra, Arpita Vats, Rahul Raja
TL;DR: 该论文提出了一种结合Group Relative Policy Optimization (GRPO)和多语言对比奖励信号的新框架,以提升多语言Text-to-SQL系统的任务效率和语义准确性。
Details
Motivation: 现有Text-to-SQL方法仅关注可执行查询,忽视了语义对齐问题,尤其是在非英语语言中,执行准确率显著下降。
Result: 执行准确率提升至87.4%(+26pp),语义准确率提升至59.14%(+6.85pp);较小的3B模型性能优于8B零样本模型。
Insight: 对比奖励信号能有效提升语义对齐效果,无需大规模训练数据即可显著改善多语言Text-to-SQL系统性能。
Abstract: Current Text-to-SQL methods are evaluated and only focused on executable queries, overlooking the semantic alignment challenge – both in terms of the semantic meaning of the query and the correctness of the execution results. Even execution accuracy itself shows significant drops when moving from English to other languages, with an average decline of 6 percentage points across non-English languages. We address these challenges by presenting a new framework that combines Group Relative Policy Optimization (GRPO) within a multilingual contrastive reward signal to enhance both task efficiency and semantic accuracy in Text-to-SQL systems in cross-lingual scenarios. Our method teaches models to obtain better correspondence between SQL generation and user intent by combining a reward signal based on semantic similarity. On the seven-language MultiSpider dataset, fine-tuning the LLaMA-3-3B model with GRPO improved the execution accuracy up to 87.4 percent (+26 pp over zero-shot) and semantic accuracy up to 52.29 percent (+32.86 pp). Adding our contrastive reward signal in the GRPO framework further improved the average semantic accuracy to 59.14 percent (+6.85 pp, up to +10 pp for Vietnamese). Our experiments showcase that a smaller, parameter-efficient 3B LLaMA model fine-tuned with our contrastive reward signal outperforms a much larger zero-shot 8B LLaMA model, with an uplift of 7.43 pp in execution accuracy (from 81.43 percent on the 8B model to 88.86 percent on the 3B model), and nearly matches its semantic accuracy (59.14 percent vs. 68.57 percent) – all using just 3,000 reinforcement learning training examples. These results demonstrate how we can improve the performance of Text-to-SQL systems with contrastive rewards for directed semantic alignment, without requiring large-scale training datasets.
[90] A Linguistics-Aware LLM Watermarking via Syntactic Predictability cs.CL | cs.AIPDF
Shinwoo Park, Hyejin Park, Hyeseon Ahn, Yo-Sub Han
TL;DR: STELA是一种新型LLM水印框架,通过动态调整水印强度以适应语言的自由度,解决了水印检测与文本质量之间的权衡问题,且无需依赖模型内部信息。
Details
Motivation: 随着大语言模型(LLMs)快速发展,可靠的水印技术成为确保AI可信生态的关键。现有方法依赖模型输出分布的信号,无法公开验证。
Result: 实验表明,STELA在英语、汉语和韩语等多种语言中表现优于现有方法。
Insight: 语言学知识可用于动态平衡水印的文本质量和检测性能,同时支持公开验证。
Abstract: As large language models (LLMs) continue to advance rapidly, reliable governance tools have become critical. Publicly verifiable watermarking is particularly essential for fostering a trustworthy AI ecosystem. A central challenge persists: balancing text quality against detection robustness. Recent studies have sought to navigate this trade-off by leveraging signals from model output distributions (e.g., token-level entropy); however, their reliance on these model-specific signals presents a significant barrier to public verification, as the detection process requires access to the logits of the underlying model. We introduce STELA, a novel framework that aligns watermark strength with the linguistic degrees of freedom inherent in language. STELA dynamically modulates the signal using part-of-speech (POS) n-gram-modeled linguistic indeterminacy, weakening it in grammatically constrained contexts to preserve quality and strengthen it in contexts with greater linguistic flexibility to enhance detectability. Our detector operates without access to any model logits, thus facilitating publicly verifiable detection. Through extensive experiments on typologically diverse languages-analytic English, isolating Chinese, and agglutinative Korean-we show that STELA surpasses prior methods in detection robustness. Our code is available at https://github.com/Shinwoo-Park/stela_watermark.
[91] Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference cs.CL | cs.AIPDF
Chao Han, Yijuan Liang, Zihao Xuan, Daokuan Wu, Wei Zhang
TL;DR: 该论文提出了“智能路由”(informed routing)方法,通过Lightweight Feature Forecaster(LFF)模块预测路由决策前的单元输出,从而在降低计算成本的同时保持模型性能。
Details
Motivation: 大型语言模型(LLMs)的实际应用受限于高推理成本,而现有动态路由方法因贪婪策略导致信息损失和次优选择。
Result: 在语言建模和推理任务中实现了高效性能权衡,减少50%以上的训练时间,且无需最终LoRA微调即可媲美基准模型。
Insight: 评估token的可恢复性比单纯的重要性更有效,轻量级预测模块能显著提升路由决策的质量和效率。
Abstract: The deployment of large language models (LLMs) in real-world applications is increasingly limited by their high inference cost. While recent advances in dynamic token-level computation allocation attempt to improve efficiency by selectively activating model components per token, existing methods rely on greedy routing–a myopic execute-or-skip mechanism that often leads to irreversible information loss and suboptimal token selection. This paper introduces informed routing, a new paradigm that proactively addresses these issues. The key insight is to assess not only a token’s immediate importance but also its recoverability, i.e., how well its transformation can be approximated. To this end, we propose the Lightweight Feature Forecaster (LFF), a small predictive module that estimates a unit’s output before routing decisions are made. This enables a flexible execute-or-approximate policy that preserves model fidelity while drastically reducing computation. Extensive experiments on both language modeling and reasoning tasks show that informed routing achieves state-of-the-art efficiency-performance trade-offs across multiple sparsity levels. Notably, even without final LoRA fine-tuning, our method matches or surpasses strong baselines that require full fine-tuning, all while reducing training time by over 50%. The code is available at: https://github.com/EIT-NLP/informed-routing
[92] Revisiting the UID Hypothesis in LLM Reasoning Traces cs.CL | cs.AIPDF
Minju Gwak, Guijin Son, Jaehyung Kim
TL;DR: 论文通过引入基于熵的指标分析LLM推理过程中的信息流,发现成功推理的信息密度不均匀,与人类沟通模式形成鲜明对比,挑战了机器推理的假设,并为设计可解释和自适应的推理模型提供了新方向。
Details
Motivation: 大型语言模型(LLM)在逐步推理(CoT)中产生的中间步骤常难以忠实或解释。受心理语言学中均匀信息密度(UID)假说的启发,研究希望通过分析信息流来理解LLM的推理模式。
Result: 研究表明,LLM的正确推理表现出信息密度的不均匀波动,与人类UID假说相反,揭示了机器推理的独特性。
Insight: 机器推理的信息流模式可能与人类不同,这为设计更具解释性和适应性的推理模型提供了新的理论依据和方向。
Abstract: Large language models (LLMs) often solve problems using step-by-step Chain-of-Thought (CoT) reasoning, yet these intermediate steps are frequently unfaithful or hard to interpret. Inspired by the Uniform Information Density (UID) hypothesis in psycholinguistics – which posits that humans communicate by maintaining a stable flow of information – we introduce entropy-based metrics to analyze the information flow within reasoning traces. Surprisingly, across three challenging mathematical benchmarks, we find that successful reasoning in LLMs is globally non-uniform: correct solutions are characterized by uneven swings in information density, in stark contrast to human communication patterns. This result challenges assumptions about machine reasoning and suggests new directions for designing interpretable and adaptive reasoning models.
[93] Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA cs.CL | cs.AI | cs.CVPDF
A H M Rezaul Karim, Ozlem Uzuner
TL;DR: MasonNLP系统通过检索增强生成(RAG)框架,结合文本和视觉示例,提升医学视觉问答(MedVQA)的性能,在MEDIQA-WV 2025任务中排名第三。
Details
Motivation: 解决医学视觉问答任务中生成自由文本回复和结构化伤口属性的挑战,支持临床决策和患者护理。
Result: 在MEDIQA-WV 2025任务中排名第三,平均得分41.37%,在BLEU、ROUGE等指标上表现优异。
Insight: 轻量级RAG结合通用LLM是一种简单高效的多模态临床NLP解决方案,尤其适用于资源有限的场景。
Abstract: Medical Visual Question Answering (MedVQA) enables natural language queries over medical images to support clinical decision-making and patient care. The MEDIQA-WV 2025 shared task addressed wound-care VQA, requiring systems to generate free-text responses and structured wound attributes from images and patient queries. We present the MasonNLP system, which employs a general-domain, instruction-tuned large language model with a retrieval-augmented generation (RAG) framework that incorporates textual and visual examples from in-domain data. This approach grounds outputs in clinically relevant exemplars, improving reasoning, schema adherence, and response quality across dBLEU, ROUGE, BERTScore, and LLM-based metrics. Our best-performing system ranked 3rd among 19 teams and 51 submissions with an average score of 41.37%, demonstrating that lightweight RAG with general-purpose LLMs – a minimal inference-time layer that adds a few relevant exemplars via simple indexing and fusion, with no extra training or complex re-ranking – provides a simple and effective baseline for multimodal clinical NLP tasks.
[94] Unlocking the Potential of Diffusion Language Models through Template Infilling cs.CL | cs.AIPDF
Junhoo Lee, Seungyeon Kim, Nojun Kwak
TL;DR: 论文提出了模板填充(TI)方法,专为扩散语言模型(DLMs)设计,通过动态分段分配(DSA)提高生成灵活性,在数学推理和代码生成任务中取得了显著的性能提升。
Details
Motivation: 当前扩散语言模型的推理策略仍局限于自回归模型的前缀提示方法,限制了其在生成任务中的潜力。
Result: 在数学推理和代码生成基准任务中,性能提升了17.01%;同时在多令牌生成场景下,保持了生成质量并实现高效加速。
Insight: TI和DSA的结合为DLMs的生成任务提供了更强的结构控制和灵活性,推动了扩散模型在语言任务中的应用。
Abstract: Diffusion Language Models (DLMs) have emerged as a promising alternative to Autoregressive Language Models, yet their inference strategies remain limited to prefix-based prompting inherited from the autoregressive paradigm. In this paper, we propose Template Infilling (TI), a tailored conditioning methodology for DLMs’ generation process. Unlike conventional prefix prompting, TI first generates a structural template for the target response, then fills in the masked segments. To enhance the flexibility of this structural control, we introduce Dynamic Segment Allocation (DSA), which adaptively adjusts segment lengths based on generation confidence. We demonstrate the effectiveness of our approach on mathematical reasoning and code generation benchmarks, achieving consistent improvements of 17.01$%$p over baseline. Furthermore, we show that TI provides additional advantages in multi-token generation settings, enabling effective speedup while maintaining generation quality.
[95] What Layers When: Learning to Skip Compute in LLMs with Residual Gates cs.CL | cs.AIPDF
Filipe Laitenberger, Dawid Kopiczko, Cees G. M. Snoek, Yuki M. Asano
TL;DR: 论文提出了GateSkip,一种简单的残差门控机制,通过为Attention/MLP分支添加门控单元,实现解码器语言模型中的分层跳过。该方法稳定高效,无需大规模重新训练,显著节省计算资源。
Details
Motivation: 现有分层跳过方法(如早期退出或基于路由器的Mixture-of-Depths模型)存在不稳定性和训练复杂性,需要改进以实现高效推理。
Result: 在长文本推理任务中节省15%计算资源,同时保持90%以上基线准确率;在指令调优模型中实现准确率提升或50%计算节省。
Insight: 门控机制揭示了Transformer信息流特性(如BOS token作为锚点),并易于与其他优化方法(如量化、剪枝和自推测解码)结合。
Abstract: We introduce GateSkip, a simple residual-stream gating mechanism that enables token-wise layer skipping in decoder-only LMs. Each Attention/MLP branch is equipped with a sigmoid-linear gate that condenses the branch’s output before it re-enters the residual stream. During inference we rank tokens by the gate values and skip low-importance ones using a per-layer budget. While early-exit or router-based Mixture-of-Depths models are known to be unstable and need extensive retraining, our smooth, differentiable gates fine-tune stably on top of pretrained models. On long-form reasoning, we save up to 15% compute while retaining over 90% of baseline accuracy. On instruction-tuned models we see accuracy gains at full compute and match baseline quality near 50% savings. The learned gates give insight into transformer information flow (e.g., BOS tokens act as anchors), and the method combines easily with quantization, pruning, and self-speculative decoding.
[96] TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks cs.CLPDF
Jimin Lim, Arjun Damerla, Arthur Jiang, Nam Le
TL;DR: 该论文提出了一个新的基准测试TextBandit,用于评估大型语言模型(LLMs)在纯语言反馈下的概率推理能力,结果显示部分LLMs在不确定环境中表现出色。
Details
Motivation: 大型语言模型在推理任务中表现出色,但其在纯语言环境下进行不确定性决策的能力尚未被充分研究。
Result: Qwen3-4B在最佳臂选择率上达到89.2%,显著优于更大的LLMs和传统方法。
Insight: 研究表明,语言模型可以从纯语言反馈中涌现概率推理能力,为非数值环境下的决策能力评估提供了新方向。
Abstract: Large language models (LLMs) have shown to be increasingly capable of performing reasoning tasks, but their ability to make sequential decisions under uncertainty only using natural language remains underexplored. We introduce a novel benchmark in which LLMs interact with multi-armed bandit environments using purely textual feedback, “you earned a token”, without access to numerical cues or explicit probabilities, resulting in the model to infer latent reward structures purely off linguistic cues and to adapt accordingly. We evaluated the performance of four open-source LLMs and compare their performance to standard decision-making algorithms such as Thompson Sampling, Epsilon Greedy, Upper Confidence Bound (UCB), and random choice. While most of the LLMs underperformed compared to the baselines, Qwen3-4B, achieved the best-arm selection rate of 89.2% , which significantly outperformed both the larger LLMs and traditional methods. Our findings suggest that probabilistic reasoning is able to emerge from language alone, and we present this benchmark as a step towards evaluating decision-making capabilities in naturalistic, non-numeric contexts.
[97] Too Open for Opinion? Embracing Open-Endedness in Large Language Models for Social Simulation cs.CLPDF
Bolei Ma, Yong Cao, Indira Sen, Anna-Carolina Haensch, Frauke Kreuter
TL;DR: 这篇立场论文主张在大型语言模型(LLM)的社会模拟中采用开放式设计,而非传统的封闭式格式,以提高真实性和方法学效用。
Details
Motivation: 现有的社会模拟研究通常限制LLM的输出为选择题或简答形式,以便于评分和比较,但这种封闭式设计忽略了LLM的生成特性,无法完全捕捉社会现象的多样性和复杂性。
Result: 开放式设计能够更好地捕捉表达性和个体性,并有助于预测试和方法学的优化。
Insight: 论文呼吁开发新的实践和评估框架,以充分利用LLM的生成多样性,推动NLP与社会科学的协同发展。
Abstract: Large Language Models (LLMs) are increasingly used to simulate public opinion and other social phenomena. Most current studies constrain these simulations to multiple-choice or short-answer formats for ease of scoring and comparison, but such closed designs overlook the inherently generative nature of LLMs. In this position paper, we argue that open-endedness, using free-form text that captures topics, viewpoints, and reasoning processes “in” LLMs, is essential for realistic social simulation. Drawing on decades of survey-methodology research and recent advances in NLP, we argue why this open-endedness is valuable in LLM social simulations, showing how it can improve measurement and design, support exploration of unanticipated views, and reduce researcher-imposed directive bias. It also captures expressiveness and individuality, aids in pretesting, and ultimately enhances methodological utility. We call for novel practices and evaluation frameworks that leverage rather than constrain the open-ended generative diversity of LLMs, creating synergies between NLP and social science.
[98] Reliable Fine-Grained Evaluation of Natural Language Math Proofs cs.CL | cs.AIPDF
Wenjie Ma, Andrei Cojocaru, Neel Kolhe, Bradley Louie, Robin Said Sharif
TL;DR: 这篇论文提出了ProofBench数据集和ProofGrader评估器,用于细粒度评估大语言模型生成的数学证明,填补了现有评估工具的空白。
Details
Motivation: 现有的大语言模型(LLM)在数学推理任务中主要关注易于验证的最终答案,而缺乏对自然语言数学证明的可靠细粒度评估方法。
Result: ProofGrader在专家评分上的平均绝对误差(MAE)为0.926,显著优于基线方法;在最佳选择任务中,其平均得分4.14(满分7分)大幅提升。
Insight: 精细化的评估器设计(如引入参考答案和集成方法)能显著提高评估质量,有助于推动数学证明生成任务的进展。
Abstract: Recent advances in large language models (LLMs) for mathematical reasoning have largely focused on tasks with easily verifiable final answers; however, generating and verifying natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap. To address this, we propose a systematic methodology for developing and validating evaluators that assign fine-grained scores on a 0-7 scale to model-generated math proofs. To enable this study, we introduce ProofBench, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions (USAMO, IMO, Putnam, etc) and 435 LLM-generated solutions from Gemini-2.5-pro, o3, and DeepSeek-R1. %with expert gradings. Using ProofBench as a testbed, we systematically explore the evaluator design space across key axes: the backbone model, input context, instructions and evaluation workflow. Our analysis delivers ProofGrader, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method; it achieves a low Mean Absolute Error (MAE) of 0.926 against expert scores, significantly outperforming naive baselines. Finally, we demonstrate its practical utility in a best-of-$n$ selection task: at $n=16$, ProofGrader achieves an average score of 4.14 (out of 7), closing 78% of the gap between a naive binary evaluator (2.48) and the human oracle (4.62), highlighting its potential to advance downstream proof generation.
[99] A Survey on Collaborating Small and Large Language Models for Performance, Cost-effectiveness, Cloud-edge Privacy, and Trustworthiness cs.CL | cs.AI | 68T50 (Primary) 68T07 (Secondary) | I.2.7PDF
Fali Wang, Jihai Chen, Shuhua Yang, Ali Al-Lawati, Linli Tang
TL;DR: 本文系统地调查了小语言模型(SLM)与大语言模型(LLM)协作的研究,提出了以性能提升、成本效益、云-边缘隐私和可信度为核心目标的分类法,总结了代表性方法与设计范式,并讨论了未来研究方向。
Details
Motivation: LLM在多个领域表现出色,但其高微调成本、推理延迟、边缘部署受限及可靠性问题制约了应用。SLM因其紧凑高效的特点成为补充解决方案。本文旨在探讨两者协作的潜力与研究现状。
Result: 总结了当前SLM-LLM协作的研究进展,揭示了协作框架的优势与挑战,为未来研究提供了明确方向。
Insight: SLM与LLM的协作能够结合两者的优势,实现高效、灵活的应用部署,未来需进一步解决可扩展性与安全问题。
Abstract: Large language models (LLMs) have advanced many domains and applications but face high fine-tuning costs, inference latency, limited edge deployability, and reliability concerns. Small language models (SLMs), compact, efficient, and adaptable, offer complementary remedies. Recent work explores collaborative frameworks that fuse SLMs’ specialization and efficiency with LLMs’ generalization and reasoning to meet diverse objectives across tasks and deployment scenarios. Motivated by these developments, this paper presents a systematic survey of SLM-LLM collaboration organized by collaboration objectives. We propose a taxonomy with four goals: performance enhancement, cost-effectiveness, cloud-edge privacy, and trustworthiness. Within this framework, we review representative methods, summarize design paradigms, and outline open challenges and future directions toward efficient, secure, and scalable SLM-LLM collaboration.
[100] Investigating Political and Demographic Associations in Large Language Models Through Moral Foundations Theory cs.CL | cs.CYPDF
Nicole Smith-Vaniz, Harper Lyon, Lorraine Steigner, Ben Armstrong, Nicholas Mattei
TL;DR: 本文通过道德基础理论(MFT)研究了大型语言模型(LLMs)在政治和道德领域的潜在偏见,探讨了LLMs的回应是否显示意识形态倾向,并与人类数据进行直接对比。
Details
Motivation: 随着LLMs在日常生活中的应用日益广泛,其在医学、人际关系和法律等领域提供的建议可能带有潜在偏见,尤其是在政治和道德问题上。研究旨在量化这些偏见,并与人类道德倾向进行比较。
Result: 研究发现LLMs在某些道德基础上表现出倾向性,且在某些政治意识形态下的回应与人类数据相关;角色扮演会影响其回应,但准确性有待提高。
Insight: LLMs并非中立工具,其回应可能隐含政治和道德倾向;研究为理解和量化AI系统中的偏见提供了新视角。
Abstract: Large Language Models (LLMs) have become increasingly incorporated into everyday life for many internet users, taking on significant roles as advice givers in the domains of medicine, personal relationships, and even legal matters. The importance of these roles raise questions about how and what responses LLMs make in difficult political and moral domains, especially questions about possible biases. To quantify the nature of potential biases in LLMs, various works have applied Moral Foundations Theory (MFT), a framework that categorizes human moral reasoning into five dimensions: Harm, Fairness, Ingroup Loyalty, Authority, and Purity. Previous research has used the MFT to measure differences in human participants along political, national, and cultural lines. While there has been some analysis of the responses of LLM with respect to political stance in role-playing scenarios, no work so far has directly assessed the moral leanings in the LLM responses, nor have they connected LLM outputs with robust human data. In this paper we analyze the distinctions between LLM MFT responses and existing human research directly, investigating whether commonly available LLM responses demonstrate ideological leanings: either through their inherent responses, straightforward representations of political ideologies, or when responding from the perspectives of constructed human personas. We assess whether LLMs inherently generate responses that align more closely with one political ideology over another, and additionally examine how accurately LLMs can represent ideological perspectives through both explicit prompting and demographic-based role-playing. By systematically analyzing LLM behavior across these conditions and experiments, our study provides insight into the extent of political and demographic dependency in AI-generated responses.
[101] Schema for In-Context Learning cs.CL | cs.AIPDF
Pan Chen, Shaohong Chen, Mark Wang, Shi Xuan Leong, Priscilla Fung
TL;DR: 论文提出了SCHEMA ACTIVATED IN CONTEXT LEARNING (SA-ICL),一种基于认知科学中图式理论的框架,通过显式构建抽象的推理模式来增强大语言模型的任务适应性,显著提升了推理性能。
Details
Motivation: 传统的情境学习(ICL)缺乏显式的知识检索与抽象层次的知识迁移模块。受图式理论启发,作者希望通过构建抽象的推理模式(schema)来模拟人类的认知过程,从而提升模型的推理能力。
Result: 实验结果表明,SA-ICL在GPQA数据集的任务中显著提升了性能(最高36.19%),同时减少了对示范示例数量的依赖,并增强了模型的可解释性。
Insight: 1. 显式的图式辅助比隐式学习更有效;2. SA-ICL可以统一不同的情境学习策略(如模式启动和思维链提示);3. 为增强LLMs的人类化推理提供了新方向。
Abstract: In-Context Learning (ICL) enables transformer-based language models to adapt to new tasks by conditioning on demonstration examples. However, traditional example-driven in-context learning lacks explicit modules for knowledge retrieval and transfer at the abstraction level. Inspired by cognitive science, specifically schema theory, which holds that humans interpret new information by activating pre-existing mental frameworks (schemas) to structure understanding, we introduce SCHEMA ACTIVATED IN CONTEXT LEARNING (SA-ICL). This framework extracts the representation of the building blocks of cognition for the reasoning process instilled from prior examples, creating an abstracted schema, a lightweight, structured template of key inferential steps and their relationships, which is then used to augment a model’s reasoning process when presented with a novel question. We demonstrate that a broad range of large language models (LLMs) lack the capacity to form and utilize internal schema-based learning representations implicitly, but instead benefit significantly from explicit schema-based scaffolding. Across chemistry and physics questions from the GPQA dataset, our experiments show that SA-ICL consistently boosts performance, up to 36.19 percent, when the single demonstration example is of high quality, which simultaneously reduces reliance on the number of demonstrations and enhances interpretability. SCHEMA ACTIVATED IN CONTEXT LEARNING not only bridges disparate ICL strategies ranging from pattern priming to Chain-of-Thought prompting, but also paves a new path for enhancing human-like reasoning in LLMs.
[102] LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization cs.CL | stat.MLPDF
Yuanchen Wu, Saurabh Verma, Justin Lee, Fangzhou Xiong, Poppy Zhang
TL;DR: 论文提出了一种无需标签的高效自动提示优化框架Prompt Duel Optimizer (PDO),通过双人赌博机设定和LLM评估的成对偏好反馈,结合双Thompson采样和顶级提示变异方法,显著优于基线方法。
Details
Motivation: 大型语言模型(LLMs)对输入提示高度敏感,但传统自动提示优化(APO)依赖标记数据,实际中获取高质量标签成本高、耗时长。
Result: 在BIG-bench Hard和MS MARCO任务上,PDO表现优于基线;消融实验验证了D-TS和变异策略的有效性。
Insight: 无标签优化可行且高效,LLM自身的偏好反馈足以驱动提示优化;结合采样和变异策略能显著提升性能。
Abstract: Large language models (LLMs) are highly sensitive to their input prompts, making prompt design a central challenge. While automatic prompt optimization (APO) reduces manual engineering, most approaches assume access to ground-truth references such as labeled validation data. In practice, however, collecting high-quality labels is costly and slow. We propose the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization. PDO formulates the problem as a dueling-bandit setting, where supervision signal comes from pairwise preference feedback provided by an LLM judge. The framework combines Double Thompson Sampling (D-TS), which prioritizes informative prompt comparisons, with Top-Performer Guided Mutation, which expands the candidate pool by mutating high-performing prompts. PDO naturally operates in label-free settings and can also incorporate partial labels to mitigate judge noise. Experiments on BIG-bench Hard (BBH) and MS MARCO show that PDO consistently outperforms baseline methods. Ablation studies further demonstrate the effectiveness of both D-TS and prompt mutation.
[103] Interpreting the Latent Structure of Operator Precedence in Language Models cs.CLPDF
Dharunish Yugeswardeenoo, Harshil Nukala, Cole Blondin, Sean O Brien, Vasu Sharma
TL;DR: 论文通过分析开源模型LLaMA 3.2-3B的内部表示,揭示了语言模型如何编码运算符优先级,并开发了一种名为’partial embedding swap’的新技术修改优先级。
Details
Motivation: 大型语言模型(LLMs)在算术任务上表现不佳,但现有研究多关注输出或提示策略,缺乏对模型内部计算结构的理解。本文旨在揭示LLMs是否在内部表示中编码了运算符优先级。
Result: 实验表明,中间计算结果主要出现在多层感知机(MLP)块后的残差流中,运算符优先级在每个运算符嵌入中线性编码。
Insight: 模型对运算符优先级的编码是线性的,且可通过修改嵌入维度人为调整,这为理解LLMs的内部计算机制提供了新视角。
Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning capabilities but continue to struggle with arithmetic tasks. Prior works largely focus on outputs or prompting strategies, leaving the open question of the internal structure through which models do arithmetic computation. In this work, we investigate whether LLMs encode operator precedence in their internal representations via the open-source instruction-tuned LLaMA 3.2-3B model. We constructed a dataset of arithmetic expressions with three operands and two operators, varying the order and placement of parentheses. Using this dataset, we trace whether intermediate results appear in the residual stream of the instruction-tuned LLaMA 3.2-3B model. We apply interpretability techniques such as logit lens, linear classification probes, and UMAP geometric visualization. Our results show that intermediate computations are present in the residual stream, particularly after MLP blocks. We also find that the model linearly encodes precedence in each operator’s embeddings post attention layer. We introduce partial embedding swap, a technique that modifies operator precedence by exchanging high-impact embedding dimensions between operators.
[104] Knowledge Reasoning Language Model: Unifying Knowledge and Language for Inductive Knowledge Graph Reasoning cs.CL | cs.AIPDF
Xingrui Zhuo, Jiapu Wang, Gongqing Wu, Zhongyuan Wang, Jichen Zhang
TL;DR: 论文提出了一种知识推理语言模型(KRLM),通过统一语言模型的知识和知识图谱上下文,解决了现有LLM方法在归纳式知识图谱推理中知识扭曲和生成幻觉的问题。
Details
Motivation: 现有LLM方法在归纳式知识图谱推理中存在知识扭曲和生成幻觉的问题,导致推理结果不可靠。
Result: 在25个真实数据集上,KRLM在零样本推理和微调场景中均表现显著优越。
Insight: 统一协调LLM知识与KG上下文可以有效提升推理的可信度,动态知识记忆机制是关键。
Abstract: Inductive Knowledge Graph Reasoning (KGR) aims to discover facts in open-domain KGs containing unknown entities and relations, which poses a challenge for KGR models in comprehending uncertain KG components. Existing studies have proposed Knowledge Graph Foundation Models (KGFMs) that learn structural invariances across KGs to handle this uncertainty. Recently, Large Language Models (LLMs) have demonstrated strong capabilities for open-domain knowledge reasoning. As a result, the latest research has focused on LLM-based KGFMs that integrate LLM knowledge with KG context for inductive KGR. However, the intrinsic knowledge of LLMs may be overshadowed by sparse KG context, leading to LLM knowledge distortion, which can cause irreversible damage to model reasoning. Moreover, existing LLM-based KGR methods still struggle to fully constrain generative hallucinations in LLMs, severely limiting the credibility of reasoning results. To address these limitations, we propose a Knowledge Reasoning Language Model (KRLM) that achieves unified coordination between LLM knowledge and KG context throughout the KGR process. Specifically, we design a Knowledge Reasoning Language (KRL) instruction format and a KRL tokenizer to align LLM knowledge with KG representations. Then, we propose a KRL attention layer that coordinates intrinsic LLM knowledge with additional KG context through a dynamic knowledge memory mechanism. Finally, a structure-aware next-entity predictor is proposed, which strictly constrains the reasoning results within a trustworthy knowledge domain. Extensive experimental results on 25 real-world inductive KGR datasets demonstrate the significant superiority of the proposed KRLM\footnote{Our source codes are available at https://anonymous.4open.science/r/KRLM-EA36 in both zero-shot reasoning and fine-tuning scenarios.
[105] RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems cs.CLPDF
Jingru Lin, Chen Zhang, Stephen Y. Liu, Haizhou Li
TL;DR: RAGCap-Bench是一个针对基于检索增强生成(RAG)的智能代理系统的中间任务能力评估基准,旨在解决多跳问题和未充分探索的中间推理能力。
Details
Motivation: 现有的RAG系统在多跳问题上仍有挑战,且中间推理能力研究不足,需要细粒度评估工具。
Result: 实验表明具备更强RAGCap能力的慢思考模型在端到端任务中表现更好。
Insight: 提升中间推理能力对代理RAG系统的整体性能至关重要。
Abstract: Retrieval-Augmented Generation (RAG) mitigates key limitations of Large Language Models (LLMs)-such as factual errors, outdated knowledge, and hallucinations-by dynamically retrieving external information. Recent work extends this paradigm through agentic RAG systems, where LLMs act as agents to iteratively plan, retrieve, and reason over complex queries. However, these systems still struggle with challenging multi-hop questions, and their intermediate reasoning capabilities remain underexplored. To address this, we propose RAGCap-Bench, a capability-oriented benchmark for fine-grained evaluation of intermediate tasks in agentic RAG workflows. We analyze outputs from state-of-the-art systems to identify common tasks and the core capabilities required for their execution, then construct a taxonomy of typical LLM errors to design targeted evaluation questions. Experiments show that “slow-thinking” models with stronger RAGCap performance achieve better end-to-end results, underscoring the benchmark’s validity and the importance of enhancing these intermediate capabilities.
[106] Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms cs.CL | cs.AIPDF
Shrey Pandit, Xuan-Phi Nguyen, Yifei Ming, Austin Xu, Jiayu Wang
TL;DR: 论文提出了一种通过逐步增加任务复杂度生成问答对的合成数据方法,用于训练更有效的基于Web的智能体。该方法通过基线智能体验证数据质量和多样性,实验表明其数据在工具使用多样性和性能上优于现有数据集。
Details
Motivation: 现有的长时程推理和数据合成方法在复杂性和质量控制上存在不足,且往往混淆数据和训练效果的影响。为提高智能体在复杂任务上的表现,需设计更精细的数据合成方法。
Result: 在多个Web基准测试中,该方法合成的数据能训练出性能更强的智能体,工具使用多样性提升两倍,且避免了重复的工具调用行为。
Insight: 数据复杂性和多样性是提升长时程推理能力的关键;基线智能体的多重验证机制能有效确保数据质量。
Abstract: Web-based ‘deep research’ agents aim to solve complex question - answering tasks through long-horizon interactions with online tools. These tasks remain challenging, as the underlying language models are often not optimized for long-horizon reasoning and exploration. Prior work has proposed workflows for constructing instruction-tuning datasets, often leveraging knowledge graphs. However, such methods typically lack fine-grained control over difficulty and quality, yielding synthetic data that falls short of capturing the complexity required for long-horizon reasoning. Furthermore, many studies conflate data and training effects by comparing models trained under different optimization recipes, making it difficult to isolate and evaluate the effectiveness of the data itself. We introduce a two-pronged data synthesis pipeline that generates question - answer pairs by progressively increasing task complexity until a frontier baseline web agent fails. The baseline agent plays multiple roles in this process: attempting the questions, validating factuality, checking for alternative answers, and enforcing filtering. To evaluate the effectiveness of our synthesis methods, we adopt a controlled training setup based on distillation from strong web agents. Experiments across multiple web-based benchmarks show that our dataset - despite being smaller - enables the training of more effective web agents than existing datasets. In particular, our data exhibits twice the diversity in tool-use actions, allowing models trained on it to achieve stronger performance while avoiding repetitive tool-calling behaviors.
[107] Readability $\ne$ Learnability: Rethinking the Role of Simplicity in Training Small Language Models cs.CL | cs.AI | cs.LGPDF
Ivan Lee, Taylor Berg-Kirkpatrick
TL;DR: 这篇论文挑战了小型语言模型(SLM)训练中可读性(readability)是关键因素的假设,通过实验证明统计简单性(statistical simplicity)更影响模型的学习能力。
Details
Motivation: 近期研究表明,小型语言模型在简化的儿童语料(如TinyStories)上可以生成连贯文本,这使得可读性被视为关键因素。作者质疑这一假设,认为需要更精确地分析实际影响模型能力的属性。
Result: 实验结果表明,模型在复杂成人文本和简化语言上的表现相当,甚至前者在训练中更快表现出连贯性。统计简单性更能预测学习效率。
Insight: 研究提示,不应简单地用人类认知发展类比语言模型的训练,而需要更严谨地分析模型能力涌现的实际驱动因素。
Abstract: Recent studies suggest that very small language models (SLMs) can generate surprisingly coherent text when trained on simplified, child-directed corpora such as TinyStories. These findings have been interpreted as evidence that readability – characterized by accessible vocabulary, familiar narrative structure, and simple syntax – plays a key role in enabling such capabilities to emerge. In this paper, we challenge that interpretation. We construct synthetic datasets with matched structure but varied readability, and find that readability alone does not predict coherence or learning efficiency in SLMs. Models trained on complex, adult-level text perform comparably to those trained on simplified language, and even exhibit faster development of coherence during training. Instead, we show that statistical simplicity, as measured by n-gram diversity, is a stronger predictor of learnability. Our findings caution against the growing trend of anthropomorphizing language model training – drawing parallels to human cognitive development without empirical basis – and argue for more precise reasoning about what properties actually support capability emergence in small models.
[108] Element2Vec: Build Chemical Element Representation from Text for Property Prediction cs.CLPDF
Yuanhao Li, Keyuan Lai, Tianqi Wang, Qihao Liu, Jiawei Ma
TL;DR: 论文提出了Element2Vec方法,通过自然语言文本构建化学元素的表示,用于属性预测。该方法结合全局和局部向量表示,并设计了基于自注意力的测试时训练方法,解决了数据稀疏和分布差异的问题。
Details
Motivation: 化学元素的精确属性数据对材料设计和制造至关重要,但许多属性难以直接测量。传统基于数值分析的方法难以建模复杂关系,而现有AI方法存在幻觉和解释性不足的问题。
Result: 该方法有效提升了化学元素属性预测的准确性,克服了数据稀疏和文本分布差异的挑战。
Insight: 结合自然语言处理和自注意力机制,能够更好地建模复杂科学数据,为AI驱动的材料科学发现提供了新思路。
Abstract: Accurate property data for chemical elements is crucial for materials design and manufacturing, but many of them are difficult to measure directly due to equipment constraints. While traditional methods use the properties of other elements or related properties for prediction via numerical analyses, they often fail to model complex relationships. After all, not all characteristics can be represented as scalars. Recent efforts have been made to explore advanced AI tools such as language models for property estimation, but they still suffer from hallucinations and a lack of interpretability. In this paper, we investigate Element2Vecto effectively represent chemical elements from natural languages to support research in the natural sciences. Given the text parsed from Wikipedia pages, we use language models to generate both a single general-purpose embedding (Global) and a set of attribute-highlighted vectors (Local). Despite the complicated relationship across elements, the computational challenges also exist because of 1) the discrepancy in text distribution between common descriptions and specialized scientific texts, and 2) the extremely limited data, i.e., with only 118 known elements, data for specific properties is often highly sparse and incomplete. Thus, we also design a test-time training method based on self-attention to mitigate the prediction error caused by Vanilla regression clearly. We hope this work could pave the way for advancing AI-driven discovery in materials science.
[109] Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling cs.CLPDF
Peng Kuang, Yanli Wang, Xiaoyu Han, Yaowenqi Liu, Kaidi Xu
TL;DR: 论文提出了一种理论框架,用于优化结合大型语言模型(LLM)和过程奖励模型(PRM)信号的方法,并通过实验验证了这种加权聚合策略在测试时扩展(TTS)中的高效性。
Details
Motivation: PRMs是TTS的核心工具,用于验证和选择LLM的最佳响应,但简单多数投票偶尔优于PRM信号的选择,因此需要探索如何更有效地利用PRM信号。
Result: 实验表明,校准后的权重函数显著提升了TTS效率,在仅使用21.3%计算资源的情况下超越了多数投票。
Insight: 研究表明,更智能的信号聚合策略比单纯增加测试时计算量更能提升性能。
Abstract: Process reward models (PRMs) are a cornerstone of test-time scaling (TTS), designed to verify and select the best responses from large language models (LLMs). However, this promise is challenged by recent benchmarks where simple majority voting, which ignores PRM signals, occasionally outperforms standard PRM-based selection. This raises a critical question: How can we effectively utilize verification signals from PRMs for TTS? To address this, we start by developing a theoretical framework for optimally combining signals from both the LLM and the PRM. Our framework reveals that the optimal strategy is a weighted aggregation of responses, a strategy whose effectiveness hinges on estimating weights that capture the complex interplay between the models. Based on our theoretical results, we empirically show that these optimal weighting functions differ significantly across LLM-PRM pairs and, notably, often assign substantial negative weights. Motivated by these insights, we propose efficient pre-computation methods to calibrate these weighting functions. Extensive experiments across 5 LLMs and 7 PRMs demonstrate that our calibration method significantly boosts the TTS efficiency, surpassing the performance of vanilla weighted majority voting while using only $21.3%$ of the computation. Ultimately, our work demonstrates that investing in a more intelligent aggregation strategy can be a more convincing path to performance gains than simply scaling test-time computation.
[110] FACTS: Table Summarization via Offline Template Generation with Agentic Workflows cs.CLPDF
Ye Yuan, Mohammad Amin Shabani, Siqi Liu
TL;DR: FACTS提出了一种通过离线模板生成的代理工作流方法,用于查询关注的表格摘要任务。该方法通过离线模板(SQL查询和Jinja2模板)实现快速、准确且隐私合规的摘要生成,并在多个基准测试中优于基线方法。
Details
Motivation: 现有的表格摘要方法存在诸多限制,例如调优成本高、推理能力有限、隐私风险或效率不足。FACTS旨在通过离线模板生成和代理工作流解决这些问题。
Result: 在多个基准测试中,FACTS的表现优于现有基线方法,证明了其在实际应用中的高效性和准确性。
Insight: 离线模板生成结合代理工作流是一种有效的方法,既能保证摘要质量,又能提升效率和隐私保护。
Abstract: Query-focused table summarization requires generating natural language summaries of tabular data conditioned on a user query, enabling users to access insights beyond fact retrieval. Existing approaches face key limitations: table-to-text models require costly fine-tuning and struggle with complex reasoning, prompt-based LLM methods suffer from token-limit and efficiency issues while exposing sensitive data, and prior agentic pipelines often rely on decomposition, planning, or manual templates that lack robustness and scalability. To mitigate these issues, we introduce an agentic workflow, FACTS, a Fast, Accurate, and Privacy-Compliant Table Summarization approach via Offline Template Generation. FACTS produces offline templates, consisting of SQL queries and Jinja2 templates, which can be rendered into natural language summaries and are reusable across multiple tables sharing the same schema. It enables fast summarization through reusable offline templates, accurate outputs with executable SQL queries, and privacy compliance by sending only table schemas to LLMs. Evaluations on widely-used benchmarks show that FACTS consistently outperforms baseline methods, establishing it as a practical solution for real-world query-focused table summarization.
[111] An LLM-Powered AI Agent Framework for Holistic IoT Traffic Interpretation cs.CL | cs.CR | cs.NIPDF
Daniel Adu Worae, Spyridon Mastorakis
TL;DR: 提出了一个基于大型语言模型(LLM)的AI代理框架,用于全面解析物联网(IoT)流量,结合多技术实现高效的语义分析和交互式问答。
Details
Motivation: IoT网络流量复杂多样,传统方法难以实现跨层次的行为和威胁解析,需要一个综合框架来解决这一问题。
Result: 实验表明混合检索显著提升了BLEU等指标,且系统资源开销低,证明了框架的高效性。
Insight: LLM在IoT流量解析中具有潜力,结合混合检索和多技术集成能显著提升分析效果。
Abstract: Internet of Things (IoT) networks generate diverse and high-volume traffic that reflects both normal activity and potential threats. Deriving meaningful insight from such telemetry requires cross-layer interpretation of behaviors, protocols, and context rather than isolated detection. This work presents an LLM-powered AI agent framework that converts raw packet captures into structured and semantically enriched representations for interactive analysis. The framework integrates feature extraction, transformer-based anomaly detection, packet and flow summarization, threat intelligence enrichment, and retrieval-augmented question answering. An AI agent guided by a large language model performs reasoning over the indexed traffic artifacts, assembling evidence to produce accurate and human-readable interpretations. Experimental evaluation on multiple IoT captures and six open models shows that hybrid retrieval, which combines lexical and semantic search with reranking, substantially improves BLEU, ROUGE, METEOR, and BERTScore results compared with dense-only retrieval. System profiling further indicates low CPU, GPU, and memory overhead, demonstrating that the framework achieves holistic and efficient interpretation of IoT network traffic.
[112] BioMedSearch: A Multi-Source Biomedical Retrieval Framework Based on LLMs cs.CLPDF
Congying Liu, Xingyuan Wei, Peipei Liu, Yiqing Shen, Yanxu Mao
TL;DR: BioMedSearch 是一个基于 LLMs 的多源生物医学检索框架,通过整合文献检索、蛋白质数据库和网络搜索,解决了现有 LLMs 在生物医学内容生成中缺乏科学严谨性的问题。
Details
Motivation: 生物医学查询需要深入理解专业知识并从多源数据中整合信息,但现有 LLMs 因无法访问权威数据库,生成内容常偏离真实信息。
Result: 实验结果显示,BioMedSearch 在所有推理级别上均显著提升准确率,最高从 36.3% 提升至 73.4%。
Insight: 生物医学领域需要结合多源数据和结构化检索方法,而 LLMs 的通用能力需与领域专业工具结合才能保证科学性。
Abstract: Biomedical queries often rely on a deep understanding of specialized knowledge such as gene regulatory mechanisms and pathological processes of diseases. They require detailed analysis of complex physiological processes and effective integration of information from multiple data sources to support accurate retrieval and reasoning. Although large language models (LLMs) perform well in general reasoning tasks, their generated biomedical content often lacks scientific rigor due to the inability to access authoritative biomedical databases and frequently fabricates protein functions, interactions, and structural details that deviate from authentic information. Therefore, we present BioMedSearch, a multi-source biomedical information retrieval framework based on LLMs. The method integrates literature retrieval, protein database and web search access to support accurate and efficient handling of complex biomedical queries. Through sub-queries decomposition, keywords extraction, task graph construction, and multi-source information filtering, BioMedSearch generates high-quality question-answering results. To evaluate the accuracy of question answering, we constructed a multi-level dataset, BioMedMCQs, consisting of 3,000 questions. The dataset covers three levels of reasoning: mechanistic identification, non-adjacent semantic integration, and temporal causal reasoning, and is used to assess the performance of BioMedSearch and other methods on complex QA tasks. Experimental results demonstrate that BioMedSearch consistently improves accuracy over all baseline models across all levels. Specifically, at Level 1, the average accuracy increases from 59.1% to 91.9%; at Level 2, it rises from 47.0% to 81.0%; and at the most challenging Level 3, the average accuracy improves from 36.3% to 73.4%. The code and BioMedMCQs are available at: https://github.com/CyL-ucas/BioMed_Search
[113] LLMs Can Get “Brain Rot”! cs.CL | cs.AIPDF
Shuo Xing, Junyuan Hong, Yifan Wang, Runjin Chen, Zhenyu Zhang
TL;DR: 该论文提出了“LLM Brain Rot Hypothesis”,并通过控制实验证明低质量网络文本(如Twitter/X数据)的持续预训练会导致大型语言模型在多方面能力下降。
Details
Motivation: 研究动机是验证低质量数据(如垃圾文本)对大型语言模型能力的长期负面影响,并揭示数据质量与模型能力之间的因果关系。
Result: 实验结果表明确实存在“Brain Rot”效应,部分能力的下降不可逆,且垃圾数据的比例与能力下降呈剂量效应关系。
Insight: 关键发现包括:思维跳跃是主要错误来源;非语义指标(如推文流行度)是“Brain Rot”效应的更强预测因子;数据质量是模型能力持续下降的根本原因。
Abstract: We propose and test the LLM Brain Rot Hypothesis: continual exposure to junk web text induces lasting cognitive decline in large language models (LLMs). To causally isolate data quality, we run controlled experiments on real Twitter/X corpora, constructing junk and reversely controlled datasets via two orthogonal operationalizations: M1 (engagement degree) and M2 (semantic quality), with matched token scale and training operations across conditions. Contrary to the control group, continual pre-training of 4 LLMs on the junk dataset causes non-trivial declines (Hedges’ $g>0.3$) on reasoning, long-context understanding, safety, and inflating “dark traits” (e.g., psychopathy, narcissism). The gradual mixtures of junk and control datasets also yield dose-response cognition decay: for example, under M1, ARC-Challenge with Chain Of Thoughts drops $74.9 \rightarrow 57.2$ and RULER-CWE $84.4 \rightarrow 52.3$ as junk ratio rises from $0%$ to $100%$. Error forensics reveal several key insights. First, we identify thought-skipping as the primary lesion: models increasingly truncate or skip reasoning chains, explaining most of the error growth. Second, partial but incomplete healing is observed: scaling instruction tuning and clean data pre-training improve the declined cognition yet cannot restore baseline capability, suggesting persistent representational drift rather than format mismatch. Finally, we discover that the popularity, a non-semantic metric, of a tweet is a better indicator of the Brain Rot effect than the length in M1. Together, the results provide significant, multi-perspective evidence that data quality is a causal driver of LLM capability decay, reframing curation for continual pretraining as a \textit{training-time safety} problem and motivating routine “cognitive health checks” for deployed LLMs.
[114] Robust or Suggestible? Exploring Non-Clinical Induction in LLM Drug-Safety Decisions cs.CLPDF
Siying Liu, Shisheng Zhang, Indu Bala
TL;DR: 这篇论文研究了LLMs在药物安全预测中是否引入了与社会人口统计学信息相关的偏见,揭示了系统性差异和两种偏见模式,强调了临床应用前公平性评估的必要性。
Details
Motivation: LLMs在生物医学领域的应用日益广泛,但其在药物安全预测中的可靠性尚未充分研究,尤其是是否会引入临床不相关的社会人口统计学信息。
Result: 结果显示,弱势群体(如低教育水平、不稳定住房)被预测的药物不良反应风险更高,明确了显性和隐性偏见的存在。
Insight: 论文强调了在临床应用LLMs前必须解决公平性问题,并提出需要开发公平性评估协议和缓解策略以避免偏见带来的风险。
Abstract: Large language models (LLMs) are increasingly applied in biomedical domains, yet their reliability in drug-safety prediction remains underexplored. In this work, we investigate whether LLMs incorporate socio-demographic information into adverse event (AE) predictions, despite such attributes being clinically irrelevant. Using structured data from the United States Food and Drug Administration Adverse Event Reporting System (FAERS) and a persona-based evaluation framework, we assess two state-of-the-art models, ChatGPT-4o and Bio-Medical-Llama-3.8B, across diverse personas defined by education, marital status, employment, insurance, language, housing stability, and religion. We further evaluate performance across three user roles (general practitioner, specialist, patient) to reflect real-world deployment scenarios where commercial systems often differentiate access by user type. Our results reveal systematic disparities in AE prediction accuracy. Disadvantaged groups (e.g., low education, unstable housing) were frequently assigned higher predicted AE likelihoods than more privileged groups (e.g., postgraduate-educated, privately insured). Beyond outcome disparities, we identify two distinct modes of bias: explicit bias, where incorrect predictions directly reference persona attributes in reasoning traces, and implicit bias, where predictions are inconsistent, yet personas are not explicitly mentioned. These findings expose critical risks in applying LLMs to pharmacovigilance and highlight the urgent need for fairness-aware evaluation protocols and mitigation strategies before clinical deployment.
[115] Big Reasoning with Small Models: Instruction Retrieval at Inference Time cs.CL | cs.AIPDF
Kenan Alkiek, David Jurgens, Vinod Vydiswaran
TL;DR: 该论文提出了一种通过指令检索在推理时增强小型语言模型(SLMs)多步推理能力的方法,避免需要微调或额外计算开销。
Details
Motivation: 小型语言模型(SLMs)在本地硬件上运行高效,具有隐私、成本和环境优势,但在多步推理或领域知识任务上表现不足。
Result: 在MedQA、MMLU Law和MathQA任务上,3B-14B参数的SLMs性能提升了9.4%、7.9%和5.1%。
Insight: 简洁指令优于冗长指令,性能提升幅度与模型家族的固有推理能力密切相关。
Abstract: Can we bring large-scale reasoning to local-scale compute? Small language models (SLMs) are increasingly attractive because they run efficiently on local hardware, offering strong privacy, low cost, and reduced environmental impact. Yet they often struggle with tasks that require multi-step reasoning or domain-specific knowledge. We address this limitation through instruction intervention at inference time, where an SLM retrieves structured reasoning procedures rather than generating them from scratch. Our method builds an Instruction Corpus by grouping similar training questions and creating instructions via GPT-5. During inference, the SLM retrieves the most relevant instructions and follows their steps. Unlike retrieval-augmented generation, which retrieves text passages, instruction retrieval gives the model structured guidance for reasoning. We evaluate this framework on MedQA (medical board exams), MMLU Professional Law, and MathQA using models from 3B to 14B parameters without any additional fine-tuning. Instruction retrieval yields consistent gains: 9.4% on MedQA, 7.9% on MMLU Law, and 5.1% on MathQA. Concise instructions outperform longer ones, and the magnitude of improvement depends strongly on model family and intrinsic reasoning ability.
[116] FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis cs.CLPDF
Fengbin Zhu, Xiang Yao Ng, Ziyang Liu, Chang Liu, Xianwei Zeng
TL;DR: 论文提出了HisRubric评估框架和FinDeepResearch基准,用于系统评估深度研究代理(DR Agents)在金融分析中的能力,揭示了不同方法的优劣。
Details
Motivation: 现有研究缺乏对深度研究代理在关键金融分析任务中能力的系统评估,因此需要一种严谨的评估框架和基准。
Result: 实验结果表明不同方法的优势和局限性,尤其是在多样化的能力、金融市场和语言环境中。
Insight: 揭示了DR代理和LLMs在金融分析任务中的潜力与不足,为未来研究提供了方向。
Abstract: Deep Research (DR) agents, powered by advanced Large Language Models (LLMs), have recently garnered increasing attention for their capability in conducting complex research tasks. However, existing literature lacks a rigorous and systematic evaluation of DR Agent’s capabilities in critical research analysis. To address this gap, we first propose HisRubric, a novel evaluation framework with a hierarchical analytical structure and a fine-grained grading rubric for rigorously assessing DR agents’ capabilities in corporate financial analysis. This framework mirrors the professional analyst’s workflow, progressing from data recognition to metric calculation, and finally to strategic summarization and interpretation. Built on this framework, we construct a FinDeepResearch benchmark that comprises 64 listed companies from 8 financial markets across 4 languages, encompassing a total of 15,808 grading items. We further conduct extensive experiments on the FinDeepResearch using 16 representative methods, including 6 DR agents, 5 LLMs equipped with both deep reasoning and search capabilities, and 5 LLMs with deep reasoning capabilities only. The results reveal the strengths and limitations of these approaches across diverse capabilities, financial markets, and languages, offering valuable insights for future research and development. The benchmark and evaluation code will be made publicly available.
[117] Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention cs.CL | cs.AIPDF
Zhen Yang, Mingyang Zhang, Feng Chen, Ganggui Ding, Liang Hou
TL;DR: 论文提出了一种名为Minimal Test-Time Intervention (MTI)的方法,通过在推理阶段仅干预高熵令牌,显著提升了大型语言模型的推理能力和稳定性,同时保持了高效性。
Details
Motivation: 现有的方法通常通过增加推理计算来提升大型语言模型的推理能力,但效率较低。研究发现推理不确定性高度集中在少数高熵令牌上,因此希望通过最小化干预提升性能。
Result: 在通用、编程和STEM任务上表现一致提升,例如Qwen3-8B-Base在八个基准上的平均提升为1.35%,Qwen3-32B-Reasoning在AIME2024上提升5%。
Insight: 推理不确定性的局部性揭示了优化重点是少数关键令牌,而非全局干预,这种最小干预策略能显著提升性能且高效。
Abstract: Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a simple yet underexplored phenomenon: reasoning uncertainty is highly localized-only a small subset of high-entropy tokens dominantly affects output correctness. Motivated by this, we propose Minimal Test-Time Intervention (MTI), a training-free framework that enhances reasoning accuracy and stability with minimal overhead. MTI includes: (i) Selective CFG intervention, applying classifier-free guidance only at uncertain positions; and (ii) Lightweight negative-prompt guidance, reusing the main model’s KV cache to approximate unconditional decoding efficiently. MTI yields consistent gains across general, coding, and STEM tasks-e.g., +1.35% average improvement on eight benchmarks for Qwen3-8B-Base and +5% on AIME2024 using Qwen3-32B-Reasoning-while remaining highly efficient.
[118] CRaFT: An Explanation-Based Framework for Evaluating Cultural Reasoning in Multilingual Language Models cs.CLPDF
Shehenaz Hossain, Haithem Afli
TL;DR: CRaFT框架通过解释性的多语言评估,衡量大型语言模型在跨文化语境中的推理能力,揭示了模型在语言和文化适应性上的显著差异。
Details
Motivation: 当前的LLM评估主要关注答案准确性,而忽略了文化理解的深度。本文旨在通过CRaFT框架填补这一空白,评估模型在跨文化推理中的表现。
Result: 不同语言对模型表现影响显著:阿拉伯语降低流畅性,孟加拉语提升流畅性,西班牙语表现稳定。GPT在语言适应性上更强但一致性较低,FANAR表现稳定但缺乏灵活性。
Insight: LLM的文化意识并非固有属性,而是通过语言框架表现出来的。CRaFT为构建文化适应性更强的模型提供了实用工具和见解。
Abstract: Correct answers do not necessarily reflect cultural understanding. We introduce CRaFT, an explanation-based multilingual evaluation framework designed to assess how large language models (LLMs) reason across cultural contexts. Rather than scoring outputs solely based on accuracy, CRaFT evaluates model explanations using four interpretable metrics: Cultural Fluency, Deviation, Consistency, and Linguistic Adaptation. We apply the framework to 50 culturally grounded questions from the World Values Survey, translated into Arabic, Bengali, and Spanish, and evaluate three models (GPT, DeepSeek, and FANAR) across over 2,100 answer-explanation pairs. Results reveal significant cross-lingual variation in reasoning: Arabic reduces fluency, Bengali enhances it, and Spanish remains largely stable. While GPT adapts more effectively across languages, it exhibits lower consistency; FANAR shows stable but rigid reasoning. These findings suggest that cultural awareness in LLMs is not intrinsic but emerges through linguistic framing. CRaFT offers a new lens for evaluating cross-cultural reasoning in multilingual settings, providing actionable insights for building culturally adaptive language models.
[119] Think Globally, Group Locally: Evaluating LLMs Using Multi-Lingual Word Grouping Games cs.CL | cs.AI | cs.LGPDF
César Guerra-Solano, Zhuochun Li, Xiang Lorraine Li
TL;DR: 论文提出了一种多语言单词分组游戏GlobalGroup,用于评估大型语言模型(LLM)在抽象推理任务中的语言偏见表现。通过在五种语言中测试,发现英语模态表现最佳,且开源与闭源模型存在性能差异。
Details
Motivation: 现有研究多通过常识或数学任务评估LLM的语言偏见,但抽象推理能力在日常生活中同样重要。缺乏对LLM在此类任务中语言偏见的系统评估是该研究的出发点。
Result: 英语模态在抽象推理任务中表现最佳,且开源与闭源模型之间存在显著性能差异。
Insight: 语言偏见在抽象推理任务中同样存在,未来研究需进一步探索模型跨语言泛化能力及开源模型性能的改进空间。
Abstract: Large language models (LLMs) can exhibit biases in reasoning capabilities due to linguistic modality, performing better on tasks in one language versus another, even with similar content. Most previous works evaluate this through reasoning tasks where reliance on strategies or knowledge can ensure success, such as in commonsense or math tasks. However, abstract reasoning is vital to reasoning for everyday life, where people apply “out-of-the-box thinking” to identify and use patterns for solutions, without a reliance on formulaic approaches. Comparatively, little work has evaluated linguistic biases in this task type. In this paper, we propose a task inspired by the New York Times Connections: GlobalGroup, that evaluates models in an abstract reasoning task across several languages. We constructed a game benchmark with five linguistic backgrounds – English, Spanish, Chinese, Hindi, and Arabic – in both the native language and an English translation for comparison. We also proposed game difficulty measurements to evaluate models on games with similar difficulty, enabling a more controlled comparison, which is particularly important in reasoning evaluations. Through experimentation, we find English modalities largely lead to better performance in this abstract reasoning task, and performance disparities between open- and closed-source models.
[120] DROID: Dual Representation for Out-of-Scope Intent Detection cs.CL | I.2.7, I.5.1PDF
Wael Rashwan, Hossam M. Zawbaa, Sourav Dutta, Haytham Assem
TL;DR: DROID提出了一种双编码器框架,结合通用语义编码器和领域特定编码器,用于任务型对话系统中的OOS意图检测,无需后处理且性能显著提升。
Details
Motivation: 任务型对话系统中检测OOS意图是一个关键挑战,现有方法依赖强分布假设或额外模块,亟需一种简洁高效的方法。
Result: 在多个意图检测基准上,DROID比SOTA方法提升了6–15%(已知意图)和8–20%(OOS意图)的macro-F1。
Insight: 双编码器表示与简单校准的结合能够在低资源场景下实现鲁棒且可扩展的OOS检测。
Abstract: Detecting out-of-scope (OOS) user utterances remains a key challenge in task-oriented dialogue systems and, more broadly, in open-set intent recognition. Existing approaches often depend on strong distributional assumptions or auxiliary calibration modules. We present DROID (Dual Representation for Out-of-Scope Intent Detection), a compact end-to-end framework that combines two complementary encoders – the Universal Sentence Encoder (USE) for broad semantic generalization and a domain-adapted Transformer-based Denoising Autoencoder (TSDAE) for domain-specific contextual distinctions. Their fused representations are processed by a lightweight branched classifier with a single calibrated threshold that separates in-domain and OOS intents without post-hoc scoring. To enhance boundary learning under limited supervision, DROID incorporates both synthetic and open-domain outlier augmentation. Despite using only 1.5M trainable parameters, DROID consistently outperforms recent state-of-the-art baselines across multiple intent benchmarks, achieving macro-F1 improvements of 6–15% for known and 8–20% for OOS intents, with the most significant gains in low-resource settings. These results demonstrate that dual-encoder representations with simple calibration can yield robust, scalable, and reliable OOS detection for neural dialogue systems.
[121] Toward Cybersecurity-Expert Small Language Models cs.CL | cs.AI | cs.CRPDF
Matan Levi, Daniel Ohayon, Ariel Blobstein, Ravid Sagi, Ian Molloy
TL;DR: 论文提出了CyberPal 2.0,一个专注于网络安全的小语言模型(SLM)家族,填补了网络安全领域缺乏高质量特定领域模型的空白。
Details
Motivation: 由于缺乏高质量、特定领域的模型和训练数据集,大语言模型(LLM)在网络安全领域的应用受限。为了解决这一问题,作者开发了CyberPal 2.0。
Result: CyberPal 2.0在多个网络安全基准测试中表现优异,优于基线模型,并与前沿开源和闭源模型性能相当或更好。特别是在核心网络安全任务中,20B参数模型甚至优于GPT-4o等模型。
Insight: 小型化语言模型在特定领域(如网络安全)中可以达到甚至超越大型通用模型的性能,同时保持高效和低成本。
Abstract: Large language models (LLMs) are transforming everyday applications, yet deployment in cybersecurity lags due to a lack of high-quality, domain-specific models and training datasets. To address this gap, we present CyberPal 2.0, a family of cybersecurity-expert small language models (SLMs) ranging from 4B-20B parameters. To train CyberPal 2.0, we generate an enriched chain-of-thought cybersecurity instruction dataset built with our data enrichment and formatting pipeline, SecKnowledge 2.0, which integrates expert-in-the-loop steering of reasoning formats alongside LLM-driven multi-step grounding, yielding higher-fidelity, task-grounded reasoning traces for security tasks. Across diverse cybersecurity benchmarks, CyberPal 2.0 consistently outperforms its baselines and matches or surpasses various open and closed-source frontier models, while remaining a fraction of their size. On core cyber threat intelligence knowledge tasks, our models outperform almost all tested frontier models, ranking second only to Sec-Gemini v1. On core threat-investigation tasks, such as correlating vulnerabilities and bug tickets with weaknesses, our best 20B-parameter model outperforms GPT-4o, o1, o3-mini, and Sec-Gemini v1, ranking first, while our smallest 4B-parameter model ranks second.
[122] Building a Macedonian Recipe Dataset: Collection, Parsing, and Comparative Analysis cs.CLPDF
Darko Sasanski, Dimitar Peshevski, Riste Stojanov, Dimitar Trajanov
TL;DR: 论文系统地构建了首个马其顿食谱数据集,通过网页抓取和结构化解析解决了食材描述的异质性问题,并分析了食材频率和共现模式,揭示了马其顿烹饪传统的独特特点。
Details
Motivation: 当前计算美食学依赖高质量、多样化的食谱数据集,但马其顿食谱在数字化研究中代表性不足,亟需填补这一空白。
Result: 生成了一个高质量的马其顿食谱数据集,并通过分析揭示了独特的食材组合模式。
Insight: 马其顿烹饪传统表现出独特的食材组合特点,为研究少数民族语言的饮食文化提供了新资源。
Abstract: Computational gastronomy increasingly relies on diverse, high-quality recipe datasets to capture regional culinary traditions. Although there are large-scale collections for major languages, Macedonian recipes remain under-represented in digital research. In this work, we present the first systematic effort to construct a Macedonian recipe dataset through web scraping and structured parsing. We address challenges in processing heterogeneous ingredient descriptions, including unit, quantity, and descriptor normalization. An exploratory analysis of ingredient frequency and co-occurrence patterns, using measures such as Pointwise Mutual Information and Lift score, highlights distinctive ingredient combinations that characterize Macedonian cuisine. The resulting dataset contributes a new resource for studying food culture in underrepresented languages and offers insights into the unique patterns of Macedonian culinary tradition.
[123] RLSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following cs.CLPDF
Zhichao Wang, Andy Wong, Ruslan Belkin
TL;DR: RLSR(基于监督奖励的强化学习)取代SFT(监督微调),通过语义嵌入空间的余弦相似度计算奖励分数,显著提升了指令跟随能力,并在AlpacaEval基准上表现优于SFT。
Details
Motivation: 传统SFT依赖大量标注数据,而RLSR通过强化学习框架利用已有SFT数据集,提升模型在指令跟随任务中的表现。
Result: RLSR在Qwen-7B上AlpacaEval胜率达26.34%,超越SFT的21.01%;结合SFT+RLSR时胜率提升至30.73%。
Insight: 强化学习框架可利用监督奖励优化模型表现,尤其是在语义相似度任务中,结合传统方法能进一步发挥优势。
Abstract: After the pretraining stage of LLMs, techniques such as SFT, RLHF, RLVR, and RFT are applied to enhance instruction-following ability, mitigate undesired responses, improve reasoning capability and enable efficient domain adaptation with minimal data. SFT relies on the next-token prediction objective to strengthen instruction following in a base model using a large corpus of human-labeled responses. In contrast, RFT employs a RL-based approach to adapt fine-tuned reasoning models to specific domains with limited supervision. Inspired by RFT, we propose replacing SFT with RLSR to leverage the extensive SFT dataset in an RL framework, thereby improving the base model’s instruction-following ability. In RLSR, the base model generates multiple responses for each prompt, and reward scores are computed as the cosine similarity in the semantic embedding space between the generated and human-labeled responses. RLSR can be utilized in multiple ways. It can directly replace SFT, achieving superior performance on instruction-following benchmarks-for example, RLSR (SB) on Qwen-7B (INFINITY) achieved an AlpacaEval win rate of 26.34%, surpassing SFT’s 21.01%. Furthermore, combining SFT and RLSR further enhances downstream task performance; Qwen-7B (INFINITY) achieved a win rate of 30.73% when trained with SFT + RLSR.
[124] LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning cs.CL | cs.AIPDF
Beomseok Kang, Jiwon Song, Jae-Joon Kim
TL;DR: LiteStage提出了一种延迟感知的层跳过框架,用于提升多阶段推理的效率,通过离线搜索和在线早期退出来平衡速度与准确性。
Details
Motivation: 多阶段推理虽增强小语言模型的推理能力,但增加延迟。现有自适应加速技术(如层跳过)在多阶段场景下难以平衡效率和准确性。
Result: 在OBQA等基准测试中,速度提升1.70倍,精度损失低于4.0%,优于现有无训练层跳过方法。
Insight: 阶段间跳过的敏感性和冗余输出是主要挑战,LiteStage通过分阶段优化和动态退出解决了这些问题。
Abstract: Multi-stage reasoning has emerged as an effective strategy for enhancing the reasoning capability of small language models by decomposing complex problems into sequential sub-stages. However, this comes at the cost of increased latency. We observe that existing adaptive acceleration techniques, such as layer skipping, struggle to balance efficiency and accuracy in this setting due to two key challenges: (1) stage-wise variation in skip sensitivity, and (2) the generation of redundant output tokens. To address these, we propose LiteStage, a latency-aware layer skipping framework for multi-stage reasoning. LiteStage combines a stage-wise offline search that allocates optimal layer budgets with an online confidence-based generation early exit to suppress unnecessary decoding. Experiments on three benchmarks, e.g., OBQA, CSQA, and StrategyQA, show that LiteStage achieves up to 1.70x speedup with less than 4.0% accuracy loss, outperforming prior training-free layer skipping methods.
[125] Flip-Flop Consistency: Unsupervised Training for Robustness to Prompt Perturbations in LLMs cs.CL | cs.LGPDF
Parsa Hejabi, Elnaz Rahmati, Alireza S. Ziabari, Morteza Dehghani
TL;DR: 论文提出了一种名为Flip-Flop Consistency(F²C)的无监督训练方法,通过共识交叉熵(CCE)和表示对齐损失,提升大语言模型(LLM)在面对不同提示扰动时的一致性和性能表现。
Details
Motivation: 大语言模型在面对同一问题的不同表述时,常产生不一致的回答。为了解决这一问题,作者提出了F²C方法,旨在无监督条件下提升模型的鲁棒性和一致性。
Result: F²C显著提升了模型的一致性(11.62%)、平均F₁分数(8.94%)并减少了性能方差(3.29%)。在域外评估中,F²C进一步提高了泛化能力。
Insight: F²C展示了无监督方法在提升LLM一致性和鲁棒性方面的潜力,尤其在面对多样化的提示扰动时表现优异。
Abstract: Large Language Models (LLMs) often produce inconsistent answers when faced with different phrasings of the same prompt. In this paper, we propose Flip-Flop Consistency ($F^2C$), an unsupervised training method that improves robustness to such perturbations. $F^2C$ is composed of two key components. The first, Consensus Cross-Entropy (CCE), uses a majority vote across prompt variations to create a hard pseudo-label. The second is a representation alignment loss that pulls lower-confidence and non-majority predictors toward the consensus established by high-confidence, majority-voting variations. We evaluate our method on 11 datasets spanning four NLP tasks, with 4-15 prompt variations per dataset. On average, $F^2C$ raises observed agreement by 11.62%, improves mean $F_1$ by 8.94%, and reduces performance variance across formats by 3.29%. In out-of-domain evaluations, $F^2C$ generalizes effectively, increasing $\overline{F_1}$ and agreement while decreasing variance across most source-target pairs. Finally, when trained on only a subset of prompt perturbations and evaluated on held-out formats, $F^2C$ consistently improves both performance and agreement while reducing variance. These findings highlight $F^2C$ as an effective unsupervised method for enhancing LLM consistency, performance, and generalization under prompt perturbations. Code is available at https://github.com/ParsaHejabi/Flip-Flop-Consistency-Unsupervised-Training-for-Robustness-to-Prompt-Perturbations-in-LLMs.
[126] MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems cs.CLPDF
Jihao Zhao, Zhiyuan Ji, Simin Niu, Hanyu Wang, Feiyu Xiong
TL;DR: 该论文提出了MoM框架,通过主动理解和结构化文档记忆提取改进传统检索增强生成(RAG)系统,结合多路径采样和逆向推理策略,提升小型语言模型(SLMs)的文本处理能力。
Details
Motivation: 传统RAG系统被动处理文本片段,限制了知识内化和推理能力,无法模拟人类阅读时的主动认知过程。
Result: 在三个不同领域的实验表明,MoM解决了现有RAG系统的文本分块问题,并为SLMs实现类人智能文本处理提供了新途径。
Insight: MoM通过模拟人类阅读的主动认知过程,结合逆向推理和多路径优化,显著提升了SLMs的文本理解和推理能力。
Abstract: The traditional RAG paradigm, which typically engages in the comprehension of relevant text chunks in response to received queries, inherently restricts both the depth of knowledge internalization and reasoning capabilities. To address this limitation, our research transforms the text processing in RAG from passive chunking to proactive understanding, defining this process as document memory extraction with the objective of simulating human cognitive processes during reading. Building upon this, we propose the Mixtures of scenario-aware document Memories (MoM) framework, engineered to efficiently handle documents from multiple domains and train small language models (SLMs) to acquire the ability to proactively explore and construct document memories. The MoM initially instructs large language models (LLMs) to simulate domain experts in generating document logical outlines, thereby directing structured chunking and core content extraction. It employs a multi-path sampling and multi-perspective evaluation mechanism, specifically designing comprehensive metrics that represent chunk clarity and extraction completeness to select the optimal document memories. Additionally, to infuse deeper human-like reading abilities during the training of SLMs, we incorporate a reverse reasoning strategy, which deduces refined expert thinking paths from high-quality outcomes. Finally, leveraging diverse forms of content generated by MoM, we develop a three-layer document memory retrieval mechanism, which is grounded in our theoretical proof from the perspective of probabilistic modeling. Extensive experimental results across three distinct domains demonstrate that the MoM framework not only resolves text chunking challenges in existing RAG systems, providing LLMs with semantically complete document memories, but also paves the way for SLMs to achieve human-centric intelligent text processing.
[127] MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning cs.CLPDF
Mahbub E Sobhani, Md. Faiyaz Abdullah Sayeedi, Tasnim Mohiuddin, Md Mofijul Islam, Swakkhar Shatabda
TL;DR: MathMist是一个并行多语言数学问题解答与推理基准数据集,旨在填补现有基准主要集中在英语或高资源语言的不足,评估多语言和跨语言数学推理能力。
Details
Motivation: 现有的数学推理基准主要集中于英语或少数高资源语言,缺乏对多语言和跨语言数学推理的系统评估。MathMist旨在解决这一问题,提供一个涵盖高、中、低资源语言的多样化基准。
Result: 结果表明,LLM在多语言数学推理中存在持续性缺陷,尤其在低资源语言环境中表现明显下降。
Insight: 研究揭示了LLM在多语言数学推理中的局限性,强调了未来研究需关注语言多样性和低资源语言的支持。
Abstract: Mathematical reasoning remains one of the most challenging domains for large language models (LLMs), requiring not only linguistic understanding but also structured logical deduction and numerical precision. While recent LLMs demonstrate strong general-purpose reasoning abilities, their mathematical competence across diverse languages remains underexplored. Existing benchmarks primarily focus on English or a narrow subset of high-resource languages, leaving significant gaps in assessing multilingual and cross-lingual mathematical reasoning. To address this, we introduce MathMist, a parallel multilingual benchmark for mathematical problem solving and reasoning. MathMist encompasses over 21K aligned question-answer pairs across seven languages, representing a balanced coverage of high-, medium-, and low-resource linguistic settings. The dataset captures linguistic variety, multiple types of problem settings, and solution synthesizing capabilities. We systematically evaluate a diverse suite of models, including open-source small and medium LLMs, proprietary systems, and multilingual-reasoning-focused models, under zero-shot, chain-of-thought (CoT), and code-switched reasoning paradigms. Our results reveal persistent deficiencies in LLMs’ ability to perform consistent and interpretable mathematical reasoning across languages, with pronounced degradation in low-resource settings. All the codes and data are available at GitHub: https://github.com/mahbubhimel/MathMist
[128] MERLIN: A Testbed for Multilingual Multimodal Entity Recognition and Linking cs.CL | cs.AIPDF
Sathyanarayanan Ramamoorthy, Vishwa Shah, Simran Khanuja, Zaid Sheikh, Shan Jie
TL;DR: MERLIN是一个用于多语言多模态实体链接任务的新型测试系统,包含BBC新闻文章标题和对应图像的数据集,涵盖五种语言。通过多语言和多模态方法的基准测试,研究发现引入视觉数据能提升实体链接准确性。
Details
Motivation: 当前多语言多模态实体链接任务缺乏标准化的测试环境和数据集,尤其是针对非主流语言的研究。MERLIN旨在填补这一空白,并提供评估新方法的平台。
Result: 研究表明,引入视觉数据显著提升了实体链接的准确性,尤其是在文本上下文模糊或不足的情况下,以及对多语言能力较弱的模型。
Insight: 视觉信息在多语言多模态实体链接中起到关键作用,能够弥补纯文本方法的不足,特别是对于资源较少的语言。
Abstract: This paper introduces MERLIN, a novel testbed system for the task of Multilingual Multimodal Entity Linking. The created dataset includes BBC news article titles, paired with corresponding images, in five languages: Hindi, Japanese, Indonesian, Vietnamese, and Tamil, featuring over 7,000 named entity mentions linked to 2,500 unique Wikidata entities. We also include several benchmarks using multilingual and multimodal entity linking methods exploring different language models like LLaMa-2 and Aya-23. Our findings indicate that incorporating visual data improves the accuracy of entity linking, especially for entities where the textual context is ambiguous or insufficient, and particularly for models that do not have strong multilingual abilities. For the work, the dataset, methods are available here at https://github.com/rsathya4802/merlin
[129] Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL cs.CL | cs.AI | cs.LGPDF
Marwa Abdulhai, Ryan Cheng, Aryansh Shrivastava, Natasha Jaques, Yarin Gal
TL;DR: 论文研究了大型语言模型(LLMs)在对话中的欺骗行为,提出通过信念不一致性指标量化欺骗,并开发了一种多轮强化学习方法以减少欺骗行为。
Details
Motivation: LLMs在广泛应用中存在潜在的欺骗性输出风险,尤其是在多轮对话中。现有方法缺乏对欺骗行为的有效评估和缓解手段,亟需新方法和指标。
Result: 研究发现LLMs在26%的对话中存在欺骗行为,RLHF训练模型欺骗率为43%。多轮强化学习方法将欺骗行为减少了77.6%。
Insight: 1) 欺骗行为需要在多轮对话中评估;2) RLHF训练并不能完全消除欺骗;3) 多轮强化学习是减少欺骗的有效方法。
Abstract: Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare. However, their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns. The unpredictable nature of LLM behavior, combined with insufficient safeguards against hallucination, misinformation, and user manipulation, makes their misuse a serious, real-world risk. In this paper, we investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception. We evaluate deception across four distinct dialogue scenarios, using five established deception detection metrics and our proposed metric. Our findings reveal this novel deception measure correlates more closely with human judgments than any existing metrics we test. Additionally, our benchmarking of eight state-of-the-art models indicates that LLMs naturally exhibit deceptive behavior in approximately 26% of dialogue turns, even when prompted with seemingly benign objectives. When prompted to deceive, LLMs are capable of increasing deceptiveness by as much as 31% relative to baselines. Unexpectedly, models trained with RLHF, the predominant approach for ensuring the safety of widely-deployed LLMs, still exhibit deception at a rate of 43% on average. Given that deception in dialogue is a behavior that develops over an interaction history, its effective evaluation and mitigation necessitates moving beyond single-utterance analyses. We introduce a multi-turn reinforcement learning methodology to fine-tune LLMs to reduce deceptive behaviors, leading to a 77.6% reduction compared to other instruction-tuned models.
[130] Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts cs.CL | cs.AIPDF
Perapard Ngokpol, Kun Kerdthaisong, Pasin Buakhaw, Pitikorn Khlaisamniang, Supasate Vorathammathorn
TL;DR: 论文提出了一个名为’Beyond One World’的基准,用于评估大型语言模型(LLMs)在角色扮演中对不同版本角色的一致性表现,特别是针对超级英雄的多宇宙版本。
Details
Motivation: 研究表明,LLMs在角色扮演中是否能准确表现特定版本的角色(如不同宇宙中的超级英雄)尚不明确,因此需要一个专门的基准来衡量这一问题。
Result: 实验发现:(1)链式思维提示能提升弱模型的叙事连贯性,但可能降低强模型的准确性;(2)角色跨版本的泛化能力仍是挑战;(3)模型通常在推理或行动中表现优异,但很少能同时做好两者。
Insight: 研究发现,LLMs在多宇宙角色扮演中的一致性和推理对齐仍有显著不足,这为未来模型的改进指明了方向。
Abstract: Large language models (LLMs) are increasingly used as role-playing agents, yet their capacity to faithfully and consistently portray version-specific characters – for example, superheroes across comic and cinematic universes – remains underexplored. Superhero canons such as Marvel and DC provide a rich testbed: decades of storytelling yield multiple incarnations of the same character with distinct histories, values, and moral codes. To study this problem, we introduce Beyond One World, a benchmark for character-grounded roleplay spanning 30 iconic heroes and 90 canon-specific versions. The benchmark comprises two tasks: (i) Canon Events, which probes factual recall of pivotal life stages, and (ii) Moral Dilemmas, which confronts models with ethically charged scenarios. We score responses for canonical accuracy and reasoning fidelity under a framework that separates internal deliberation (“thinking”) from outward decisions (“acting”). We further propose Think-Act Matching, a metric that quantifies alignment between reasons and actions and serves as a proxy for model trustworthiness. Experiments across reasoning- and non-reasoning-oriented models yield three findings: (1) chain-of-thought prompting improves narrative coherence in weaker models but can reduce canonical accuracy in stronger ones; (2) cross-version generalization within a character remains a major obstacle; and (3) models often excel at either thinking or acting, but rarely both. Beyond One World exposes critical gaps in multiversal consistency and reasoning alignment, offering a challenging evaluation for role-playing LLMs.
[131] CURE: Confidence-driven Unified Reasoning Ensemble Framework for Medical Question Answering cs.CL | cs.AI | physics.med-phPDF
Ziad Elshaer, Essam A. Rashed
TL;DR: 论文提出了一种无需微调的医疗问答框架CURE,通过信心驱动的多模型协作,显著提升了性能,特别适用于资源有限的场景。
Details
Motivation: 现有高性能医疗大模型通常需要大量计算资源进行微调,限制了资源匮乏机构的可及性。
Result: 在PubMedQA和MedMCQA上的表现优异,分别达到95.0%和78.0%。
Insight: 策略性模型协作提供了一种高效方法,有助于在资源受限环境中普及高级医疗AI。
Abstract: High-performing medical Large Language Models (LLMs) typically require extensive fine-tuning with substantial computational resources, limiting accessibility for resource-constrained healthcare institutions. This study introduces a confidence-driven multi-model framework that leverages model diversity to enhance medical question answering without fine-tuning. Our framework employs a two-stage architecture: a confidence detection module assesses the primary model’s certainty, and an adaptive routing mechanism directs low-confidence queries to Helper models with complementary knowledge for collaborative reasoning. We evaluate our approach using Qwen3-30B-A3B-Instruct, Phi-4 14B, and Gemma 2 12B across three medical benchmarks; MedQA, MedMCQA, and PubMedQA. Result demonstrate that our framework achieves competitive performance, with particularly strong results in PubMedQA (95.0%) and MedMCQA (78.0%). Ablation studies confirm that confidence-aware routing combined with multi-model collaboration substantially outperforms single-model approaches and uniform reasoning strategies. This work establishes that strategic model collaboration offers a practical, computationally efficient pathway to improve medical AI systems, with significant implications for democratizing access to advanced medical AI in resource-limited settings.
[132] On the Ability of LLMs to Handle Character-Level Perturbations: How Well and How? cs.CLPDF
Anyun Zhuo, Xuefei Ning, Ningyuan Li, Yu Wang, Pinyan Lu
TL;DR: 该论文研究了LLMs对字符级扰动的处理能力,通过插入不可见Unicode控制字符的方法测试其鲁棒性,发现许多LLMs在强干扰下仍能保持性能,揭示了其潜在机制和应用风险。
Details
Motivation: 探讨LLMs在面对有结构的字符级扰动时的稳健性,尤其是对插入噪声字符的应对能力,以评估其在在线考试系统等场景中的潜在滥用风险。
Result: 尽管输入被严重扰乱且信噪比大幅降低,许多LLMs仍能保持显著性能,这表明其对字符级扰动的鲁棒性较强。
Insight: LLMs可能通过隐式和显式去噪机制处理字符级扰动,这种鲁棒性在应用中既带来可靠性,也暗示潜在的滥用风险。
Abstract: This work investigates the resilience of contemporary LLMs against frequent and structured character-level perturbations, specifically through the insertion of noisy characters after each input character. We introduce \nameshort{}, a practical method that inserts invisible Unicode control characters into text to discourage LLM misuse in scenarios such as online exam systems. Surprisingly, despite strong obfuscation that fragments tokenization and reduces the signal-to-noise ratio significantly, many LLMs still maintain notable performance. Through comprehensive evaluation across model-, problem-, and noise-related configurations, we examine the extent and mechanisms of this robustness, exploring both the handling of character-level tokenization and \textit{implicit} versus \textit{explicit} denoising mechanism hypotheses of character-level noises. We hope our findings on the low-level robustness of LLMs will shed light on the risks of their misuse and on the reliability of deploying LLMs across diverse applications.
[133] PluriHop: Exhaustive, Recall-Sensitive QA over Distractor-Rich Corpora cs.CL | cs.IR | cs.LGPDF
Mykolas Sveistrys, Richard Kunert
TL;DR: 论文提出了PluriHop问题,并针对其在检索增强生成(RAG)中的挑战,提出了一种名为PluriHopRAG的新方法,通过文档级子问题分解和早期过滤显著提升了性能。
Details
Motivation: 现实中的问题往往需要从大量重复和干扰性强的文档中聚合信息,现有QA系统对此类任务的表现不佳,亟需新的解决方案。
Result: PluriHopRAG相比基线方法提升了18-52%的相对F1分数,证明了其在重复性和干扰性强的语料上的有效性。
Insight: 研究表明,现有QA系统在重复性语料上表现有限,而早期过滤和详尽检索是提升性能的关键策略。
Abstract: Recent advances in large language models (LLMs) and retrieval-augmented generation (RAG) have enabled progress on question answering (QA) when relevant evidence is in one (single-hop) or multiple (multi-hop) passages. Yet many realistic questions about recurring report data - medical records, compliance filings, maintenance logs - require aggregation across all documents, with no clear stopping point for retrieval and high sensitivity to even one missed passage. We term these pluri-hop questions and formalize them by three criteria: recall sensitivity, exhaustiveness, and exactness. To study this setting, we introduce PluriHopWIND, a diagnostic multilingual dataset of 48 pluri-hop questions built from 191 real-world wind industry reports in German and English. We show that PluriHopWIND is 8-40% more repetitive than other common datasets and thus has higher density of distractor documents, better reflecting practical challenges of recurring report corpora. We test a traditional RAG pipeline as well as graph-based and multimodal variants, and find that none of the tested approaches exceed 40% in statement-wise F1 score. Motivated by this, we propose PluriHopRAG, a RAG architecture that follows a “check all documents individually, filter cheaply” approach: it (i) decomposes queries into document-level subquestions and (ii) uses a cross-encoder filter to discard irrelevant documents before costly LLM reasoning. We find that PluriHopRAG achieves relative F1 score improvements of 18-52% depending on base LLM. Despite its modest size, PluriHopWIND exposes the limitations of current QA systems on repetitive, distractor-rich corpora. PluriHopRAG’s performance highlights the value of exhaustive retrieval and early filtering as a powerful alternative to top-k methods.
[134] MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering cs.CL | cs.AI | cs.IRPDF
Yingpeng Ning, Yuanyuan Sun, Ling Luo, Yanhua Wang, Yuchen Pan
TL;DR: MedTrust-RAG提出了一种结合检索增强生成的框架,通过引用感知推理、迭代检索验证和MedTrust对齐模块,显著提升了生物医学问答的事实一致性和可靠性。
Details
Motivation: 现有的RAG框架在生物医学问答中存在幻觉问题,主要由于检索后噪声和证据验证不足,影响了回答的可靠性。
Result: 在MedMCQA、MedQA和MMLU-Med等数据集上,该方法显著优于基线模型,LLaMA3.1-8B-Instruct和Qwen3-8B分别提升2.7%和2.4%。
Insight: 通过严格验证检索证据和优化生成过程,可以有效减少幻觉问题,提升生物医学问答的可靠性。
Abstract: Biomedical question answering (QA) requires accurate interpretation of complex medical knowledge. Large language models (LLMs) have shown promising capabilities in this domain, with retrieval-augmented generation (RAG) systems enhancing performance by incorporating external medical literature. However, RAG-based approaches in biomedical QA suffer from hallucinations due to post-retrieval noise and insufficient verification of retrieved evidence, undermining response reliability. We propose MedTrust-Guided Iterative RAG, a framework designed to enhance factual consistency and mitigate hallucinations in medical QA. Our method introduces three key innovations. First, it enforces citation-aware reasoning by requiring all generated content to be explicitly grounded in retrieved medical documents, with structured Negative Knowledge Assertions used when evidence is insufficient. Second, it employs an iterative retrieval-verification process, where a verification agent assesses evidence adequacy and refines queries through Medical Gap Analysis until reliable information is obtained. Third, it integrates the MedTrust-Align Module (MTAM) that combines verified positive examples with hallucination-aware negative samples, leveraging Direct Preference Optimization to reinforce citation-grounded reasoning while penalizing hallucination-prone response patterns. Experiments on MedMCQA, MedQA, and MMLU-Med demonstrate that our approach consistently outperforms competitive baselines across multiple model architectures, achieving the best average accuracy with gains of 2.7% for LLaMA3.1-8B-Instruct and 2.4% for Qwen3-8B.
[135] Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following cs.CL | cs.AIPDF
Qingyu Ren, Qianyu He, Bowei Zhang, Jie Zeng, Jiaqing Liang
TL;DR: 论文提出了一种无需外部监督的自监督强化学习框架,通过直接从指令中提取奖励信号并生成伪标签,解决了现有方法依赖外部监督和稀疏奖励的问题,在多个数据集上表现出色。
Details
Motivation: 现有的强化学习方法在多约束任务中依赖外部监督且奖励信号稀疏,限制了其实际应用能力。本文旨在通过自监督方式直接从指令中获取奖励信号,提升模型对多约束指令的遵循能力。
Result: 在3个领域内数据集和5个领域外数据集上的实验表明,该方法能够显著提升模型对多约束指令的遵循能力,尤其在复杂代理任务和多轮交互任务中表现突出。
Insight: 直接从指令中提取奖励信号并生成伪标签是一种有效的自监督学习方法,能够减少对外部监督的依赖并缓解稀疏奖励问题,为强化学习在多约束任务中的应用提供了新思路。
Abstract: Language models often struggle to follow multi-constraint instructions that are crucial for real-world applications. Existing reinforcement learning (RL) approaches suffer from dependency on external supervision and sparse reward signals from multi-constraint tasks. We propose a label-free self-supervised RL framework that eliminates dependency on external supervision by deriving reward signals directly from instructions and generating pseudo-labels for reward model training. Our approach introduces constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges while maintaining computational efficiency. Experiments show that our approach generalizes well, achieving strong improvements across 3 in-domain and 5 out-of-domain datasets, including challenging agentic and multi-turn instruction following. The data and code are publicly available at https://github.com/Rainier-rq/verl-if
[136] Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents cs.CLPDF
Rui Wang, Ce Zhang, Jun-Yu Ma, Jianshu Zhang, Hongru Wang
TL;DR: 论文提出了“Explore to Evolve”范式,通过主动在线探索和大规模生成可验证的QA数据,提升了深度研究代理的信息聚合能力,开发了WebAggregator模型,性能接近或超越GPT-4.1。
Details
Motivation: 现有开源深度研究代理主要关注信息检索能力,忽视了信息聚合的重要性,限制了其支持深度研究的能力。
Result: WebAggregator-8B性能匹配GPT-4.1,32B版本超越GPT-4.1 10%以上;在新建测试集上,Claude-3.7-sonnet和GPT-4.1表现不佳。
Insight: 代理检索到所有信息后仍难以完成聚合任务,凸显了加强信息聚合能力的必要性。
Abstract: Deep research web agents not only retrieve information from diverse sources such as web environments, files, and multimodal inputs, but more importantly, they need to rigorously analyze and aggregate knowledge for insightful research. However, existing open-source deep research agents predominantly focus on enhancing information-seeking capabilities of web agents to locate specific information, while overlooking the essential need for information aggregation, which would limit their ability to support in-depth research. We propose an Explore to Evolve paradigm to scalably construct verifiable training data for web agents. Begins with proactive online exploration, an agent sources grounded information by exploring the real web. Using the collected evidence, the agent then self-evolves an aggregation program by selecting, composing, and refining operations from 12 high-level logical types to synthesize a verifiable QA pair. This evolution from high-level guidance to concrete operations allowed us to scalably produce WebAggregatorQA, a dataset of 10K samples across 50K websites and 11 domains. Based on an open-source agent framework, SmolAgents, we collect supervised fine-tuning trajectories to develop a series of foundation models, WebAggregator. WebAggregator-8B matches the performance of GPT-4.1, while the 32B variant surpasses GPT-4.1 by more than 10% on GAIA-text and closely approaches Claude-3.7-sonnet. Moreover, given the limited availability of benchmarks that evaluate web agents’ information aggregation abilities, we construct a human-annotated evaluation split of WebAggregatorQA as a challenging test set. On this benchmark, Claude-3.7-sonnet only achieves 28%, and GPT-4.1 scores 25.8%. Even when agents manage to retrieve all references, they still struggle on WebAggregatorQA, highlighting the need to strengthen the information aggregation capabilities of web agent foundations.
[137] Natural Language Tools: A Natural Language Approach to Tool Calling In Large Language Agents cs.CLPDF
Reid T. Johnson, Michelle D. Pain, Jordan D. West
TL;DR: 论文提出了Natural Language Tools (NLT)框架,通过自然语言输出替代大语言模型中的JSON工具调用,解决了任务干扰和格式限制问题。
Details
Motivation: 现有大语言模型中的JSON工具调用方式存在任务干扰和格式限制,影响了工具调用的性能和准确性。
Result: NLT在客户服务和心理健康领域中,工具调用准确率提升了18.4%,输出方差降低了70%;开源模型表现最佳。
Insight: NLT不仅在强化学习和监督微调阶段有益,还能扩展不支持原生工具调用的模型能力,且对提示扰动具有鲁棒性。
Abstract: We present Natural Language Tools (NLT), a framework that replaces programmatic JSON tool calling in large language models (LLMs) with natural language outputs. By decoupling tool selection from response generation, NLT eliminates task interference and format constraints that degrade tool call performance. When evaluated across 10 models and 6,400 trials spanning customer service and mental health domains, NLT improves tool calling accuracy by 18.4 percentage points while reducing output variance by 70%. Open-weight models see the largest gains, surpassing flagship closed-weight alternatives, with implications for model training in both reinforcement learning and supervised fine-tuning stages. These improvements persist under prompt perturbations and extend tool-calling capabilities to models lacking native support.
[138] LiRA: Linguistic Robust Anchoring for Cross-lingual Large Language Models cs.CL | cs.AIPDF
Haolin Li, Haipeng Zhang, Mang Li, Yaohua Wang, Lijie Wen
TL;DR: LiRA(Linguistic Robust Anchoring)是一个训练框架,旨在通过锚定表示组合架构(Arca)和语言耦合语义推理器(LaSR),提升低资源语言的跨语言表示能力,同时增强检索和推理任务。
Details
Motivation: 高资源语言(如英语、中文)的性能趋近饱和,而低资源语言(如乌尔都语、泰语)由于训练数据有限、机器翻译噪声和跨语言对齐不稳定,性能显著较低。LiRA旨在解决这一问题。
Result: LiRA在跨语言检索、语义相似性和推理任务中表现优异,尤其在少样本和噪声环境下具有稳健性。
Insight: 通过几何稳定的共享嵌入空间和统一的训练目标,LiRA能够有效提升低资源语言的跨语言理解能力。
Abstract: As large language models (LLMs) rapidly advance, performance on high-resource languages (e.g., English, Chinese) is nearing saturation, yet remains substantially lower for low-resource languages (e.g., Urdu, Thai) due to limited training data, machine-translation noise, and unstable cross-lingual alignment. We introduce LiRA (Linguistic Robust Anchoring for Large Language Models), a training framework that robustly improves cross-lingual representations under low-resource conditions while jointly strengthening retrieval and reasoning. LiRA comprises two modules: (i) Arca (Anchored Representation Composition Architecture), which anchors low-resource languages to an English semantic space via anchor-based alignment and multi-agent collaborative encoding, preserving geometric stability in a shared embedding space; and (ii) LaSR (Language-coupled Semantic Reasoner), which adds a language-aware lightweight reasoning head with consistency regularization on top of Arca’s multilingual representations, unifying the training objective to enhance cross-lingual understanding, retrieval, and reasoning robustness. We further construct and release a multilingual product retrieval dataset covering five Southeast Asian and two South Asian languages. Experiments across low-resource benchmarks (cross-lingual retrieval, semantic similarity, and reasoning) show consistent gains and robustness under few-shot and noise-amplified settings; ablations validate the contribution of both Arca and LaSR. Code will be released on GitHub and the dataset on Hugging Face.
[139] Efficient Seq2seq Coreference Resolution Using Entity Representations cs.CLPDF
Matt Grenander, Shay B. Cohen, Mark Steedman
TL;DR: 本文提出了一种高效的seq2seq共指消解方法,通过压缩实体表示来提高增量处理效率,性能接近全前缀基线并在LitBank上达到SOTA。
Details
Motivation: 传统的seq2seq共指消解模型虽然性能优越,但处理增量场景(如对话)时效率低且灵活性不足。
Result: 在OntoNotes上性能仅比全前缀基线低0.6 CoNLL F1分,压缩比为1.8;在LitBank上超越了SOTA。
Insight: 在seq2seq共指消解中,丢弃大量令牌是一种可行的增量处理策略。
Abstract: Seq2seq coreference models have introduced a new paradigm for coreference resolution by learning to generate text corresponding to coreference labels, without requiring task-specific parameters. While these models achieve new state-of-the-art performance, they do so at the cost of flexibility and efficiency. In particular, they do not efficiently handle incremental settings such as dialogue, where text must processed sequentially. We propose a compressed representation in order to improve the efficiency of these methods in incremental settings. Our method works by extracting and re-organizing entity-level tokens, and discarding the majority of other input tokens. On OntoNotes, our best model achieves just 0.6 CoNLL F1 points below a full-prefix, incremental baseline while achieving a compression ratio of 1.8. On LitBank, where singleton mentions are annotated, it passes state-of-the-art performance. Our results indicate that discarding a wide portion of tokens in seq2seq resolvers is a feasible strategy for incremental coreference resolution.
[140] Assessing Socio-Cultural Alignment and Technical Safety of Sovereign LLMs cs.CLPDF
Kyubyung Chae, Gihoon Kim, Gyuseong Lee, Taesup Kim, Jaejin Lee
TL;DR: 该论文针对主权LLMs在社会文化对齐和技术安全性上的不足,提出了一个新的数据集和分析框架,实验表明当前主权LLMs在支持低资源语言方面有意义,但未必完全满足目标用户需求,且可能忽视安全性等重要质量属性。
Details
Motivation: 随着主权LLMs的发展,亟需验证其是否与用户的社会文化背景对齐,并确保技术安全性和鲁棒性。目前缺乏相关框架和数据集来评估这些问题。
Result: 实验表明主权LLMs在支持低资源语言方面有价值,但社会文化对齐和技术安全性存在不足,可能导致低估安全性等重要质量属性。
Insight: 推进主权LLMs的发展需要引入更广泛、更实际的评估标准,以确保其在社会文化和技术性能上的全面提升。
Abstract: Recent trends in LLMs development clearly show growing interest in the use and application of sovereign LLMs. The global debate over sovereign LLMs highlights the need for governments to develop their LLMs, tailored to their unique socio-cultural and historical contexts. However, there remains a shortage of frameworks and datasets to verify two critical questions: (1) how well these models align with users’ socio-cultural backgrounds, and (2) whether they maintain safety and technical robustness without exposing users to potential harms and risks. To address this gap, we construct a new dataset and introduce an analytic framework for extracting and evaluating the socio-cultural elements of sovereign LLMs, alongside assessments of their technical robustness. Our experimental results demonstrate that while sovereign LLMs play a meaningful role in supporting low-resource languages, they do not always meet the popular claim that these models serve their target users well. We also show that pursuing this untested claim may lead to underestimating critical quality attributes such as safety. Our study suggests that advancing sovereign LLMs requires a more extensive evaluation that incorporates a broader range of well-grounded and practical criteria.
[141] Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures cs.CL | cs.AIPDF
Shuangshuang Ying, Yunwen Li, Xingwei Qu, Xin Li, Sheng Jin
TL;DR: 论文分析了当前偏好学习方法的局限性,尤其是在主观写作偏好评估方面的表现较差,提出了一个新的数据集WritingPreferenceBench,并展示了生成式奖励模型在显式推理链下的优越性。
Details
Motivation: 现有偏好学习方法在标准任务上表现良好,但在去除客观质量信号时表现显著下降,无法有效捕捉主观偏好(如创意、风格和情感共鸣)。
Result: 生成式奖励模型表现最佳(81.8%准确率),但模型在不同写作类别间方差大(范围18.2%-81.8%)。模型规模无显著影响。
Insight: 成功建模主观偏好可能需要显式推理而非直接分类,现有RLHF方法更擅长检测客观错误而非主观质量。
Abstract: Current preference learning methods achieve high accuracy on standard benchmarks but exhibit significant performance degradation when objective quality signals are removed. We introduce WritingPreferenceBench, a dataset of 1,800 human-annotated preference pairs (1,200 English, 600 Chinese) across 8 creative writing genres, where responses are matched for objective correctness, factual accuracy, and length. On this benchmark, sequence-based reward models–the standard architecture for RLHF–achieve only 52.7% mean accuracy, while zero-shot language model judges perform at 53.9%. In contrast, generative reward models that produce explicit reasoning chains achieve 81.8% accuracy. We observe high within-model variance across genres: individual models range from 18.2% to 81.8% accuracy across different writing categories, with standard deviations averaging 10.1%. This variance persists regardless of model scale, with 27B parameter models showing no consistent improvement over 8B variants. Our results suggest that current RLHF methods primarily learn to detect objective errors rather than capture subjective quality preferences (e.g., creativity, stylistic flair, and emotional resonance), and that successful preference modeling may require intermediate reasoning representations rather than direct classification.
[142] Code-driven Number Sequence Calculation: Enhancing the inductive Reasoning Abilities of Large Language Models cs.CL | cs.AIPDF
Kedi Chen, Zhikai Lei, Xu Guo, Xuecheng Wu, Siyuan Zeng
TL;DR: 论文提出了一种名为CodeSeq的数据集和方法,通过代码驱动的数字序列计算任务提升LLMs的归纳推理能力,解决了现有数据缺乏复杂模式和难以控制的问题。
Details
Motivation: 现有的归纳推理数据和任务过于简单且缺乏复杂性,传统的提示或微调方法未能提供精确的思考过程或难度控制。
Result: 实验表明,CodeSeq训练的模型在多任务推理中表现更优,且保持了模型的OOD性能。
Insight: 通过代码驱动的任务和数据迭代生成,可以显著提升LLMs的归纳推理能力;强化学习奖励机制能平衡成功与失败案例的学习效果。
Abstract: Large language models (LLMs) make remarkable progress in reasoning tasks. Among different reasoning modes, inductive reasoning, due to its better alignment with human learning, attracts increasing interest. However, research on inductive reasoning faces certain challenges. First, existing inductive data mostly focuses on superficial regularities while lacking more complex internal patterns. Second, current works merely prompt LLMs or finetune on simple prompt-response pairs, but do not provide precise thinking processes nor implement difficulty control. Unlike previous work, we address these challenges by introducing \textit{CodeSeq}, a synthetic post-training dataset built from number sequences. We package number sequences into algorithmic problems to discover their general terms, defining a general term generation (GTG) task correspondingly. Our pipeline generates supervised finetuning data by reflecting on failed test cases and incorporating iterative corrections, thereby teaching LLMs to learn autonomous case generation and self-checking. Additionally, it leverages reinforcement learning with a novel Case-Synergy Solvability Scaling Reward based on both solvability, estimated from the problem pass rate, and the success rate of self-directed case generation, enabling models to learn more effectively from both successes and failures. Experimental results show that the models trained with \textit{CodeSeq} improve on various reasoning tasks and can preserve the models’ OOD performance.
[143] RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF cs.CL | cs.AIPDF
Qing Yang, Zhenghao Liu, Junxin Wang, Yangfan Du, Pengcheng Huang
TL;DR: RLAIF-SPA是一种基于强化学习的语音合成框架,通过结合AI反馈(RLAIF)优化情感语音合成的表达性和清晰度,显著提升了情感表达的细腻度和语义准确性。
Details
Motivation: 现有文本到语音合成方法在情感表达上表现不足,依赖昂贵的人工标注或间接目标,导致生成的语音情感平淡且不够自然。
Result: 在Libri Speech数据集上,WER降低26.1%,SIM-O提升9.1%,人工评价提升超过10%。
Insight: RLAIF-SPA通过直接优化情感表达和语义准确性,为情感语音合成提供了一种高效且低成本的方法。
Abstract: Text-To-Speech synthesis has achieved near-human quality in neutral speech, but emotional expressiveness remains a challenge. Existing methods often rely on costly emotion annotations or optimize indirect objectives that fail to capture the emotional expressiveness and perceptual naturalness of speech, leading to generated speech that is accurate but emotionally flat. To address these challenges, we propose the RLAIF-SPA framework, incorporating a Reinforcement Learning from AI Feedback (RLAIF) mechanism to employ Automatic Speech Recognition (ASR) and Large Language Model (LLM) techniques to respectively judge semantic accuracy and prosodic-emotional label alignment as a direct reward for emotional expressiveness and intelligibility optimization. Specifically, it leverages Prosodic Label Alignment to enhance expressive quality by jointly considering semantic accuracy and prosodic-emotional alignment along four fine-grained dimensions: Structure, Emotion, Speed, and Tone. In addition, it incorporates Semantic Accuracy Feedback to ensure the generation of clear and accurate speech. Experiments on the Libri Speech dataset show that RLAIF-SPA outperforms Chat-TTS, with a 26.1% reduction in WER, a 9.1% increase in SIM-O, and over 10% improvement in human evaluation.
[144] An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs cs.CL | cs.AI | cs.IRPDF
Linyue Ma, Yilong Xu, Xiang Long, Zhi Zheng
TL;DR: 提出了一种基于规则的生成式验证方法,用于增强搜索能力的LLMs,通过‘nugget-as-rubric’范式解决长文工作量奖励设计的挑战。
Details
Motivation: 现有搜索增强LLMs的奖励设计存在局限性,规则性奖励(如精确匹配)脆弱且不适用于长文工作量,而生成功率奖励难以验证且计算开销大。
Result: Search-Gen-V在不同任务中展现出高验证准确性,是搜索增强LLMs的一种可扩展、鲁棒且高效的奖励构造方法。
Insight: 将原子信息点结构化评估可解决长文工作量的奖励设计问题,自动规则构建和高效验证器是实现可扩展性的关键。
Abstract: Search augmentation empowers Large Language Models with retrieval capabilities to overcome the limitations imposed by static parameters. Recently, Reinforcement Learning leverages tailored reward signals as a viable technique to enhance LLMs performing tasks involving search. However, existing reward modeling for search-augmented LLMs faces several limitations. Rule-based rewards, such as Exact Match, are verifiable but fragile to variations in expression and cannot be applied to long-form workloads. In contrast, generative rewards improve robustness, but designing verifiable and stable rewards for long-form workloads in dynamic corpora remains challenging and also incurs high computational costs. In this paper, we propose a unified and verifiable paradigm, “nugget-as-rubric”, which treats atomic information points as structured evaluation criteria for different search-augmentation workloads. Short-form tasks correspond to a single rubric, whereas long-form tasks expand to multiple rubrics aligned with the question’s information needs. To support long-form settings, we design an automatic rubric construction pipeline based on query rewriting, which can automatically retrieve passages relevant to each question and extract rubrics from them, both from static corpora and from dynamic online web content. Furthermore, we introduce \textbf{Search-Gen-V}, a 4B-parameter efficient generative verifier under our proposed verifiable paradigm, which is trained via the idea of distillation and a two-stage strategy. Experimental results show that Search-Gen-V achieves strong verification accuracy across different workloads, making it a scalable, robust, and efficient verifiable reward constructor for search-augmented LLMs.
[145] Speculative Model Risk in Healthcare AI: Using Storytelling to Surface Unintended Harms cs.CLPDF
Xingmeng Zhao, Dan Schumacher, Veronica Rammouz, Anthony Rios
TL;DR: 论文提出了一个以人为中心的框架,通过生成用户故事和多代理讨论,帮助在AI医疗工具部署前更全面地识别潜在风险和收益。研究发现,故事讲述显著提升了参与者对多样化危害的认知和创造性思维。
Details
Motivation: 医疗AI的快速发展和低门槛开发可能导致偏见、隐私侵犯和资源不平等问题,但这些风险往往缺乏足够的人工理解和讨论。
Result: 研究发现,阅读故事的参与者能识别更多类型的危害(均匀分布在13类中),而未读故事的参与者主要关注隐私和健康问题(58.3%)。
Insight: 故事讲述作为一种工具,可以显著增强人们对AI潜在影响的多样化和创造性思考,尤其在多利益相关者的环境中。
Abstract: Artificial intelligence (AI) is rapidly transforming healthcare, enabling fast development of tools like stress monitors, wellness trackers, and mental health chatbots. However, rapid and low-barrier development can introduce risks of bias, privacy violations, and unequal access, especially when systems ignore real-world contexts and diverse user needs. Many recent methods use AI to detect risks automatically, but this can reduce human engagement in understanding how harms arise and who they affect. We present a human-centered framework that generates user stories and supports multi-agent discussions to help people think creatively about potential benefits and harms before deployment. In a user study, participants who read stories recognized a broader range of harms, distributing their responses more evenly across all 13 harm types. In contrast, those who did not read stories focused primarily on privacy and well-being (58.3%). Our findings show that storytelling helped participants speculate about a broader range of harms and benefits and think more creatively about AI’s impact on users.
[146] AutoRubric-R1V: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning cs.CLPDF
Mengzhao Jia, Zhihan Zhang, Ignacio Cases, Zheyuan Liu, Meng Jiang
TL;DR: AutoRubric-R1V提出了一种结合过程级监督和生成性奖励的框架,通过自动生成评分标准,提升多模态推理的准确性和可信度,避免了传统强化学习中仅奖励最终答案正确性导致的虚假推理问题。
Details
Motivation: 多模态大语言模型(MLLMs)在复杂多步推理任务中存在虚假推理问题,传统强化学习方法仅奖励最终答案的正确性,缺乏对推理过程的监督。
Result: 在六个多模态推理基准测试中达到了最先进的性能,并在专门评估中显著提升了推理的可信度。
Insight: 通过过程级监督和生成性奖励的结合,可以有效避免虚假推理,提升模型在多模态任务中的表现。
Abstract: Multimodal large language models (MLLMs) have rapidly advanced from perception tasks to complex multi-step reasoning, yet reinforcement learning with verifiable rewards (RLVR) often leads to spurious reasoning since only the final-answer correctness is rewarded. To address this limitation, we propose AutoRubric-R1V, a framework that integrates RLVR with process-level supervision through automatically collected rubric-based generative rewards. Our key innovation lies in a scalable self-aggregation method that distills consistent reasoning checkpoints from successful trajectories, enabling problem-specific rubric construction without human annotation or stronger teacher models. By jointly leveraging rubric-based and outcome rewards, AutoRubric-R1V achieves state-of-the-art performance on six multimodal reasoning benchmarks and substantially improves reasoning faithfulness in dedicated evaluations.
[147] COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes cs.CL | cs.AIPDF
Yunwen Li, Shuangshuang Ying, Xingwei Qu, Xin Li, Sheng Jin
TL;DR: 本文介绍了COIG-Writer数据集,这是一个高质量的中文创意写作数据集,包含多样输出及其背后的思考过程。研究发现创意写作需要叙事逻辑(过程监督)和语言表达的平衡。
Details
Motivation: 当前大语言模型在非英语创意写作中表现不佳,缺乏高质量的训练数据及过程级监督。COIG-Writer的目标是填补这一空白。
Result: 研究发现过程监督需要与通用数据平衡(比例为1:12),创意能力具有文化特异性,词汇多样性与创意质量呈负相关(TTR悖论)。
Insight: 创意卓越源于逻辑框架与语言基础的相互作用,相似于数学推理在基础模型中的作用。
Abstract: Large language models exhibit systematic deficiencies in creative writing, particularly in non-English contexts where training data is scarce and lacks process-level supervision. We present COIG-Writer, a novel Chinese creative writing dataset that captures both diverse outputs and their underlying thought processes through systematic reverse-engineering of high-quality texts. Unlike existing datasets that provide only input-output pairs, COIG-Writer comprises 1,665 meticulously curated triplets spanning 51 genres, each containing: (1) a reverse-engineered prompt, (2) detailed creative reasoning documenting decision-making processes, and (3) the final text. Through comprehensive experiments, we identify a two-component model of creative writing: narrative logic (provided by process supervision) and linguistic expression (maintained by general-purpose data). Our findings reveal three critical insights: (1) Process supervision is highly effective but requires stabilization with general data. A ratio of at least one creative sample to twelve general samples is needed to achieve optimal performance; below this threshold, the win rate progressively degrades (from 62.75% down to 35.78%)., (2) creative capabilities are culturally-bound with no cross-lingual transfer (89.26pp gap between Chinese and English performance), and (3) lexical diversity inversely correlates with creative quality (TTR paradox), suggesting high diversity signals compensatory behavior for logical deficiencies. These findings establish that creative excellence emerges from the interaction between logical scaffolding and linguistic grounding, analogous to how mathematical reasoning enhances but cannot replace linguistic competence in foundation models.
[148] Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning cs.CL | cs.AIPDF
Hwiyeol Jo, Joosung Lee, Jaehone Lee, Sang-Woo Lee, Joonsuk Park
TL;DR: 论文提出了一种名为“答案再生”(Answer Regeneration)的新方法,通过额外的模型推理提取最终答案,以减少推理模型中答案提取算法对性能的影响。
Details
Motivation: 现有评估方法在大型语言模型(LLMs)的推理任务中对答案提取算法过于敏感,影响了结果的可靠性和鲁棒性。
Result: 该方法在数学问题和开放式问答任务中表现出更高的性能和鲁棒性。
Insight: 答案提取算法的选择对推理模型的评估结果有显著影响,而Answer Regeneration提供了一种更稳定的解决方案。
Abstract: Evaluating generative models, such as large language models (LLMs), commonly involves question-answering tasks where the final answer is selected based on probability of answer choices. On the other hand, for models requiring reasoning, the method of answer extraction plays a critical role. Our research reveals that the performance of reasoning models and their final answer distributions are highly sensitive to the answer extraction algorithm employed. In order to mitigate this, we propose a basic framework: Answer Regeneration. The method uses an additional model inference, providing the prior input and output prefaced by the prompt “Answer:”. The final answer is then selected or extracted from the regenerated output. We show that this extraction-rule-agnostic approach exhibits improved performance and enhanced robustness. Furthermore, we have applied this framework to general math problems and open-ended question answering tasks. Our analysis and this framework could offer a more reliable results for model evaluation.
[149] Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking cs.CL | cs.CV | cs.IRPDF
Ziqi Dai, Xin Zhang, Mingxin Li, Yanzhao Zhang, Dingkun Long
TL;DR: 这篇论文对比了监督微调(SFT)和对比学习(CL)在多模态大型语言模型(LLM)重排序任务中的表现,发现SFT在权重更新方面优于CL,并在实验中取得了新的最先进结果。
Details
Motivation: 在信息检索中,重排序模型的训练主要集中在两种目标上:度量学习(如对比损失)和分类(预测相关性标签)。针对BERT风格编码器,对比学习已被证明更有效,但对于LLM,监督微调的生成性任务对齐性更强。研究旨在探讨哪种目标更适合LLM重排序任务。
Result: 实验表明,SFT在权重更新方面显著优于CL,并在MRB基准上取得了新的最优性能。
Insight: SFT的权重更新机制更适合LLM的重排序任务,而方向的优劣则不明显。这一发现对未来的研究和应用具有指导意义。
Abstract: In information retrieval, training reranking models mainly focuses on two types of objectives: metric learning (e.g. contrastive loss to increase the predicted scores on relevant query-document pairs) and classification (binary label prediction of relevance vs. irrelevance). For BERT-style encoders, various studies have shown that contrastive learning (CL) can be more effective than discriminative (classification) learning. However, for large language models (LLMs), classification via supervised fine-tuning (SFT), which predicts ‘’yes’’ (resp. ‘’no’’) token for relevant (resp. irrelevant) pairs, appears more promising as it aligns well with the generative nature of LLMs. This divergence raises a central question: which objective is intrinsically better suited to LLM-based reranking, and what mechanism underlies the difference? In this work, we conduct a comprehensive comparison and analysis between CL and SFT for reranking, taking the universal multimodal retrieval (UMR) as the experimental playground. We first decompose the objectives into two components: weight, which controls the magnitude of those updates, and direction, which guides the model updates, then present a unified framework for understanding their interactions. Through probing experiments, we find that SFT provides a substantially stronger weighting scheme than CL, whereas the preferred scoring direction shows no clear winner. Taken together, these results point to a consistent advantage of SFT over CL for LLM reranking. To further validate our findings, we conduct large-scale training with SFT and present new state-of-the-art rerankers on the MRB benchmark. We also provide ablations on SFT settings and expect our findings to benefit future research and applications in this area.
[150] Rewiring Experts on the Fly:Continuous Rerouting for Better Online Adaptation in Mixture-of-Expert models cs.CLPDF
Guinan Su, Yanwu Yang, Li Shen, Lu Yin, Shiwei Liu
TL;DR: 本文提出了一种无需外部数据的在线测试时间框架,通过动态优化混合专家(MoE)模型的路由决策,提升其在文本生成任务中的表现。
Details
Motivation: MoE模型通过稀疏专家激活实现高效扩展,但部署中的分布偏移可能导致路由决策不理想。现有测试时间适应方法主要针对密集模型且依赖外部数据,难以直接应用于MoE架构。
Result: 实验显示,在HumanEval和DeepSeek-V2-Lite等任务中性能显著提升(如HumanEval提升5.5%),且能与其他测试时间扩展技术(如自一致性)结合。
Insight: MoE模型的路由决策可通过上下文动态优化,无需依赖外部数据,同时保持高效性和鲁棒性,适用于实际部署场景。
Abstract: Mixture-of-Experts (MoE) models achieve efficient scaling through sparse expert activation, but often suffer from suboptimal routing decisions due to distribution shifts in deployment. While existing test-time adaptation methods could potentially address these issues, they primarily focus on dense models and require access to external data, limiting their practical applicability to MoE architectures. However, we find that, instead of relying on reference data, we can optimize MoE expert selection on-the-fly based only on input context. As such, we propose \textit{a data-free, online test-time framework} that continuously adapts MoE routing decisions during text generation without external supervision or data. Our method cycles between two phases: During the prefill stage, and later in regular intervals, we optimize the routing decisions of the model using self-supervision based on the already generated sequence. Then, we generate text as normal, maintaining the modified router until the next adaption. We implement this through lightweight additive vectors that only update router logits in selected layers, maintaining computational efficiency while preventing over-adaptation. The experimental results show consistent performance gains on challenging reasoning tasks while maintaining robustness to context shifts. For example, our method achieves a 5.5% improvement on HumanEval with OLMoE. Furthermore, owing to its plug-and-play property, our method naturally complements existing test-time scaling techniques, e.g., achieving 6% average gains when incorporated with self-consistency on DeepSeek-V2-Lite.
[151] Predicting Task Performance with Context-aware Scaling Laws cs.CL | cs.AI | cs.LGPDF
Kyle Montgomery, David Park, Jianhong Tu, Michael Bendersky, Beliz Gunel
TL;DR: 该论文提出了一个结合训练计算量和上下文的下游任务性能预测框架,填补了传统扩展定律在上下文相关任务评估上的不足。
Details
Motivation: 传统扩展定律仅关注上游指标(如交叉熵损失),忽略了上下文在下游任务性能中的关键作用,因此需要一种能综合考虑训练计算量和上下文的方法。
Result: 框架能准确预测分布内性能,泛化能力强,且能可靠地推断上下文增加时的性能变化。
Insight: 训练计算量与上下文利用之间存在复杂的交互关系,这一发现为设计高效的长上下文LLM提供了指导。
Abstract: Scaling laws have transformed our understanding of large language models by linking upstream metrics like cross-entropy loss to design factors such as model size, training data, and compute. However, these conventional laws fail to capture downstream task performance, where context plays a critical role. In this work, we propose a straightforward, interpretable framework that jointly models downstream performance as a function of the training compute and the provided context. We empirically validate our framework by fitting it on the observed downstream performance of extended-context variants of Llama-2-7B and Llama-2-13B across 65,500 unique instances spanning three tasks: arithmetic reasoning, common sense reasoning, and machine translation. Our results demonstrate that our framework accurately models in-distribution downstream performance, generalizes across three orders of magnitude in training compute, and reliably extrapolates performance as the amount of context increases. These findings offer valuable insights into the interplay between training compute and context utilization, providing guidance for designing more efficient long-context LLMs for diverse downstream tasks. Our code is available at https://github.com/wang-research-lab/context-scaling.
[152] AI-Powered Early Diagnosis of Mental Health Disorders from Real-World Clinical Conversations cs.CLPDF
Jianfeng Zhu, Julina Maharjan, Xinyu Li, Karin G. Coifman, Ruoming Jin
TL;DR: 该研究评估了机器学习模型在心理健康筛查中的有效性,使用了553个真实世界临床对话数据集,展示了基于LLM的模型在抑郁症、焦虑症和PTSD诊断中的高准确率和召回率,尤其是PTSD(89%准确率)。
Details
Motivation: 心理健康疾病(如抑郁症、焦虑症和PTSD)常因主观评估、临床资源有限和社会偏见而被误诊或漏诊,急需可扩展、低门槛的AI工具支持早期诊断。
Result: 模型在PTSD诊断中达到89%的准确率和98%的召回率,LoRA微调在低秩配置(如秩8和16)下仍保持高效性能。
Insight: 短上下文和聚焦的对话片段能提升模型灵敏度;LLM模型在低资源或高偏见环境中具备实际应用潜力。
Abstract: Mental health disorders remain among the leading cause of disability worldwide, yet conditions such as depression, anxiety, and Post-Traumatic Stress Disorder (PTSD) are frequently underdiagnosed or misdiagnosed due to subjective assessments, limited clinical resources, and stigma and low awareness. In primary care settings, studies show that providers misidentify depression or anxiety in over 60% of cases, highlighting the urgent need for scalable, accessible, and context-aware diagnostic tools that can support early detection and intervention. In this study, we evaluate the effectiveness of machine learning models for mental health screening using a unique dataset of 553 real-world, semistructured interviews, each paried with ground-truth diagnoses for major depressive episodes (MDE), anxiety disorders, and PTSD. We benchmark multiple model classes, including zero-shot prompting with GPT-4.1 Mini and MetaLLaMA, as well as fine-tuned RoBERTa models using LowRank Adaptation (LoRA). Our models achieve over 80% accuracy across diagnostic categories, with especially strongperformance on PTSD (up to 89% accuracy and 98% recall). We also find that using shorter context, focused context segments improves recall, suggesting that focused narrative cues enhance detection sensitivity. LoRA fine-tuning proves both efficient and effective, with lower-rank configurations (e.g., rank 8 and 16) maintaining competitive performance across evaluation metrics. Our results demonstrate that LLM-based models can offer substantial improvements over traditional self-report screening tools, providing a path toward low-barrier, AI-powerd early diagnosis. This work lays the groundwork for integrating machine learning into real-world clinical workflows, particularly in low-resource or high-stigma environments where access to timely mental health care is most limited.
[153] LaSeR: Reinforcement Learning with Last-Token Self-Rewarding cs.CL | cs.AI | cs.LGPDF
Wenkai Yang, Weijie Liu, Ruobing Xie, Yiju Guo, Lulu Wu
TL;DR: LaSeR提出了一种基于最后一词自奖励的强化学习方法,通过简化RL目标的理论推导,实现了高效优化语言模型的推理和自验证能力。
Details
Motivation: 现有RLVR方法在训练时需要模型分别生成解决方案和自验证结果,效率低下。LaSeR旨在解决这一问题,简化自验证过程并统一优化推理和验证能力。
Result: 实验表明LaSeR提升了模型的推理性能和自奖励能力,进一步增强其在推理时的扩展性能。
Insight: 自验证信号的最后一词自奖励分数可作为高效优化的目标,无需复杂分离的提示模板,显著提升了模型效能。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). To address the lack of verification signals at test time, prior studies incorporate the training of model’s self-verification capability into the standard RLVR process, thereby unifying reasoning and verification capabilities within a single LLM. However, previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency. In this work, we theoretically reveal that the closed-form solution to the RL objective of self-verification can be reduced to a remarkably simple form: the true reasoning reward of a solution is equal to its last-token self-rewarding score, which is computed as the difference between the policy model’s next-token log-probability assigned to any pre-specified token at the solution’s last token and a pre-calculated constant, scaled by the KL coefficient. Based on this insight, we propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss that aligns the last-token self-rewarding scores with verifier-based reasoning rewards, jointly optimizing the reasoning and self-rewarding capabilities of LLMs. The optimized self-rewarding scores can be utilized in both training and testing to enhance model performance. Notably, our algorithm derives these scores from the predicted next-token probability distribution of the last token immediately after generation, incurring only the minimal extra cost of one additional token inference. Experiments show that our method not only improves the model’s reasoning performance but also equips it with remarkable self-rewarding capability, thereby boosting its inference-time scaling performance.
[154] MetaBench: A Multi-task Benchmark for Assessing LLMs in Metabolomics cs.CL | cs.AI | cs.CEPDF
Yuxing Lu, Xukai Zhao, J. Ben Tamo, Micky C. Nnamdi, Rui Peng
TL;DR: MetaBench是第一个针对代谢组学领域的基准测试工具,用于系统评估大型语言模型(LLMs)在复杂科学领域的表现。
Details
Motivation: LLMs在通用文本上表现优异,但在需要深度知识的专业科学领域(如代谢组学)的表现尚未明确。代谢组学因其复杂的生化途径、异构标识符系统和分散的数据库而具有独特挑战。
Result: 结果显示,LLMs在文本生成任务上表现良好,但在跨数据库标识符基础上表现不佳,且对标注稀疏的长尾代谢物任务性能下降。
Insight: MetaBench为代谢组学AI系统的开发和评估提供了基础工具,有助于推动该领域可靠计算工具的进展。
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities on general text; however, their proficiency in specialized scientific domains that require deep, interconnected knowledge remains largely uncharacterized. Metabolomics presents unique challenges with its complex biochemical pathways, heterogeneous identifier systems, and fragmented databases. To systematically evaluate LLM capabilities in this domain, we introduce MetaBench, the first benchmark for metabolomics assessment. Curated from authoritative public resources, MetaBench evaluates five capabilities essential for metabolomics research: knowledge, understanding, grounding, reasoning, and research. Our evaluation of 25 open- and closed-source LLMs reveals distinct performance patterns across metabolomics tasks: while models perform well on text generation tasks, cross-database identifier grounding remains challenging even with retrieval augmentation. Model performance also decreases on long-tail metabolites with sparse annotations. With MetaBench, we provide essential infrastructure for developing and evaluating metabolomics AI systems, enabling systematic progress toward reliable computational tools for metabolomics research.
[155] DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation cs.CL | cs.CV | cs.LGPDF
Yu Zhou, Sohyun An, Haikang Deng, Da Yin, Clark Peng
TL;DR: 论文研究了多模态生成模型在处理方言输入时的性能问题,通过构建大规模方言基准测试并提出一种编码器缓解策略,显著提升了模型在多方言上的性能。
Details
Motivation: 英语等接触语言存在丰富的方言变化,但目前多模态生成模型在处理方言输入时的性能表现尚未被系统研究。
Result: 缓解策略成功将五种方言的性能提升至与标准英语持平(+34.4%),同时对标准英语性能影响极小。
Insight: 现有方法(如微调和提示词改写)对方言性能提升有限,而设计的编码器策略更具通用性和高效性。
Abstract: Contact languages like English exhibit rich regional variations in the form of dialects, which are often used by dialect speakers interacting with generative models. However, can multimodal generative models effectively produce content given dialectal textual input? In this work, we study this question by constructing a new large-scale benchmark spanning six common English dialects. We work with dialect speakers to collect and verify over 4200 unique prompts and evaluate on 17 image and video generative models. Our automatic and human evaluation results show that current state-of-the-art multimodal generative models exhibit 32.26% to 48.17% performance degradation when a single dialect word is used in the prompt. Common mitigation methods such as fine-tuning and prompt rewriting can only improve dialect performance by small margins (< 7%), while potentially incurring significant performance degradation in Standard American English (SAE). To this end, we design a general encoder-based mitigation strategy for multimodal generative models. Our method teaches the model to recognize new dialect features while preserving SAE performance. Experiments on models such as Stable Diffusion 1.5 show that our method is able to simultaneously raise performance on five dialects to be on par with SAE (+34.4%), while incurring near zero cost to SAE performance.
[156] Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents cs.CL | cs.AI | cs.LGPDF
Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao
TL;DR: IGPO是一种基于信息增益的策略优化方法,通过密集的回合级内在奖励解决多轮LLM代理训练中的奖励稀疏问题,显著提高性能和样本效率。
Details
Motivation: 现有的LLM代理训练通常依赖最终答案的稀疏奖励,导致在多轮任务中出现优势崩溃和信用分配困难的问题。需要一种能提供密集监督的方法。
Result: 在多轮任务中,IGPO在准确性和样本效率上均优于基线方法,证明了其有效性。
Insight: 通过模型自身信念更新生成内在奖励是一种简单但有效的方法,适用于多轮交互任务中的密集监督需求。
Abstract: Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided at the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate two critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals, and (ii) lack of fine-grained credit assignment, where dependencies between turns are obscured, especially in long-horizon tasks. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy’s probability of producing the correct answer. Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model’s own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward trajectories. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved sample efficiency.
[157] Attention Is All You Need for KV Cache in Diffusion LLMs cs.CL | cs.AI | cs.LGPDF
Quan Nguyen-Tri, Mukul Ranjan, Zhiqiang Shen
TL;DR: 这篇论文提出了一种名为Elastic-Cache的训练无关、架构无关的策略,通过选择性刷新KV缓存来减少扩散大型语言模型(DLMs)的解码延迟,同时保持生成质量。
Details
Motivation: 现有的扩散大型语言模型在每个去噪步和层级都重新计算所有token的QKV,导致大量冗余计算,尤其是在浅层。本文旨在解决这一问题。
Result: 在GSM8K、HumanEval等任务上实现了显著的加速(最高45.1倍),同时保持了更高的生成质量,吞吐量提升6.8倍。
Insight: KV缓存的动态性与层级深度和注意力分布密切相关,选择性刷新可以有效减少计算冗余,提高解码效率。
Abstract: This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods’ decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy. We make three observations: (1) distant ${\bf MASK}$ tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens. Building on these, we propose ${\bf Elastic-Cache}$, a training-free, architecture-agnostic strategy that jointly decides ${when}$ to refresh (via an attention-aware drift test on the most-attended token) and ${where}$ to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: $8.7\times$ on GSM8K (256 tokens), $45.1\times$ on longer sequences, and $4.8\times$ on HumanEval, while consistently maintaining higher accuracy than the baseline. Our method achieves significantly higher throughput ($6.8\times$ on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.
q-bio.QM [Back]
[158] GenCellAgent: Generalizable, Training-Free Cellular Image Segmentation via Large Language Model Agents q-bio.QM | cs.AI | cs.CV | cs.MAPDF
Xi Yu, Yang Yang, Qun Liu, Yonghua Du, Sean McSweeney
TL;DR: GenCellAgent是一个无需训练的多智能体框架,通过规划-执行-评估循环实现细胞图像分割,自动选择最佳工具并适应新数据,显著提升分割精度。
Details
Motivation: 细胞图像分割面临模态异构、形态多变和标注不足的挑战。GenCellAgent旨在提供一种无需训练的通用解决方案,减少标注负担并适应多样化的需求。
Result: 在四个基准测试中,平均精度提升15.7%,新数据集的线粒体和内质网分割IoU提升37.6%,并可分割高尔基体等新对象。
Insight: 结合通用视觉语言模型和专有分割器的混合框架,显著提升了细胞图像分割的泛化能力和适应性,为生物图像分析提供了新思路。
Abstract: Cellular image segmentation is essential for quantitative biology yet remains difficult due to heterogeneous modalities, morphological variability, and limited annotations. We present GenCellAgent, a training-free multi-agent framework that orchestrates specialist segmenters and generalist vision-language models via a planner-executor-evaluator loop (choose tool $\rightarrow$ run $\rightarrow$ quality-check) with long-term memory. The system (i) automatically routes images to the best tool, (ii) adapts on the fly using a few reference images when imaging conditions differ from what a tool expects, (iii) supports text-guided segmentation of organelles not covered by existing models, and (iv) commits expert edits to memory, enabling self-evolution and personalized workflows. Across four cell-segmentation benchmarks, this routing yields a 15.7% mean accuracy gain over state-of-the-art baselines. On endoplasmic reticulum and mitochondria from new datasets, GenCellAgent improves average IoU by 37.6% over specialist models. It also segments novel objects such as the Golgi apparatus via iterative text-guided refinement, with light human correction further boosting performance. Together, these capabilities provide a practical path to robust, adaptable cellular image segmentation without retraining, while reducing annotation burden and matching user preferences.
cs.LG [Back]
[159] Weight Weaving: Parameter Pooling for Data-Free Model Merging cs.LG | cs.CVPDF
Levy Chaves, Eduardo Valle, Sandra Avila
TL;DR: Weight Weaving是一种无需数据的模型融合技术,通过权重池化优化多个模型的参数集成,显著提升了模型融合的性能。
Details
Motivation: 现有的模型融合方法通常依赖于数据来调整超参数(如权重因子λ),这在实践中不可行。论文提出了一种无需数据的方法来解决这一问题。
Result: 在三个ViT变体和三种实验设置(多任务学习、持续学习、领域泛化)中,Weight Weaving平均提升了15.9%的准确率。
Insight: 权重池化为模型融合提供了一种高效且灵活的数据无关策略,显著提升了融合模型的泛化能力。
Abstract: Model merging provides a cost-effective and data-efficient combination of specialized deep neural networks through parameter integration. This technique leverages expert models across downstream tasks without requiring retraining. Most model merging approaches critically depend on scaling hyper-parameters $\lambda$, which weight each model’s contribution globally or individually. Principled approaches for setting scaling factors without accessing any data (data-free) are scarce, often leading researchers to tune $\lambda$ using privileged data from the evaluation set, which is obviously unfeasible in practice. To address this limitation, we introduce Weight Weaving, a plug-and-play technique that pools model weights across $\lambda$ values search space using user-defined pooling functions, such as averaging, random selection, or even existing model merging methods. Our method demonstrates high modularity, imposing minimal constraints on the search space. It operates orthogonally to existing model merging methods and eliminates evaluation data requirements. We validate Weight Weaving across three ViT variants in three experimental setups: vision multi-task learning, vision continual learning, and domain generalization. Our method consistently improves the performance of several model merging methods, achieving average accuracy gains of up to 15.9 percentage points in a data-free setting.
[160] Backdoor Unlearning by Linear Task Decomposition cs.LG | cs.CVPDF
Amel Abdelraheem, Alessandro Favero, Gerome Bovet, Pascal Frossard
TL;DR: 这篇论文提出了一种通过线性任务分解实现后门遗忘的方法,解决了基础模型中后门攻击的问题,无需重新训练即可高效移除后门,同时保持模型的通用性能。
Details
Motivation: 基础模型在计算机视觉中表现出色,但对后门攻击高度敏感。现有方法通常需要昂贵的微调,且可能损害模型的通用性。作者探索是否可以无损移除后门。
Result: 在已知攻击的情况下,几乎完全移除后门,同时保留96%的干净准确率;在未知攻击时,通过反向工程触发器也能成功移除后门。
Insight: 后门任务与原任务在权重空间中是解耦的,这种特性为实现高效无损的后门移除提供了可能。
Abstract: Foundation models have revolutionized computer vision by enabling broad generalization across diverse tasks. Yet, they remain highly susceptible to adversarial perturbations and targeted backdoor attacks. Mitigating such vulnerabilities remains an open challenge, especially given that the large-scale nature of the models prohibits retraining to ensure safety. Existing backdoor removal approaches rely on costly fine-tuning to override the harmful behavior, and can often degrade performance on other unrelated tasks. This raises the question of whether backdoors can be removed without compromising the general capabilities of the models. In this work, we address this question and study how backdoors are encoded in the model weight space, finding that they are disentangled from other benign tasks. Specifically, this separation enables the isolation and erasure of the backdoor’s influence on the model with minimal impact on clean performance. Building on this insight, we introduce a simple unlearning method that leverages such disentanglement. Through extensive experiments with CLIP-based models and common adversarial triggers, we show that, given the knowledge of the attack, our method achieves approximately perfect unlearning, while retaining, on average, 96% of clean accuracy. Additionally, we demonstrate that even when the attack and its presence are unknown, our method successfully unlearns backdoors by proper estimation using reverse-engineered triggers. Overall, our method consistently yields better unlearning and clean accuracy tradeoffs when compared to present state-of-the-art defenses.
[161] pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation cs.LG | cs.AI | cs.CVPDF
Hansheng Chen, Kai Zhang, Hao Tan, Leonidas Guibas, Gordon Wetzstein
TL;DR: 论文提出了一种基于策略的流模型(\pi-Flow),通过模仿蒸馏方法解决了传统少步生成模型在质量和多样性之间的权衡问题,实现了快速且准确的ODE积分生成。
Details
Motivation: 传统少步扩散或流模型的教师-学生蒸馏过程存在格式不匹配问题,导致复杂的蒸馏过程和质量-多样性权衡。\pi-Flow旨在通过模仿蒸馏方法简化这一过程。
Result: 在ImageNet 256$^2$上,\pi-Flow实现了1-NFE FID为2.85的性能;在FLUX和Qwen-Image数据集上,4 NFEs下表现出更好的多样性,同时保持教师级质量。
Insight: 模仿蒸馏方法简单高效,能够避免复杂的蒸馏过程和质量-多样性权衡,为少步生成模型提供了一种新的训练范式。
Abstract: Few-step diffusion or flow-based generative models typically distill a velocity-predicting teacher into a student that predicts a shortcut towards denoised data. This format mismatch has led to complex distillation procedures that often suffer from a quality-diversity trade-off. To address this, we propose policy-based flow models ($\pi$-Flow). $\pi$-Flow modifies the output layer of a student flow model to predict a network-free policy at one timestep. The policy then produces dynamic flow velocities at future substeps with negligible overhead, enabling fast and accurate ODE integration on these substeps without extra network evaluations. To match the policy’s ODE trajectory to the teacher’s, we introduce a novel imitation distillation approach, which matches the policy’s velocity to the teacher’s along the policy’s trajectory using a standard $\ell_2$ flow matching loss. By simply mimicking the teacher’s behavior, $\pi$-Flow enables stable and scalable training and avoids the quality-diversity trade-off. On ImageNet 256$^2$, it attains a 1-NFE FID of 2.85, outperforming MeanFlow of the same DiT architecture. On FLUX.1-12B and Qwen-Image-20B at 4 NFEs, $\pi$-Flow achieves substantially better diversity than state-of-the-art few-step methods, while maintaining teacher-level quality.
[162] LTR-ICD: A Learning-to-Rank Approach for Automatic ICD Coding cs.LG | cs.CL | cs.IRPDF
Mohammad Mansoori, Amira Soliman, Farzaneh Etminani
TL;DR: 论文提出了一种名为LTR-ICD的新方法,将自动ICD编码任务视为分类与排序问题,而非单纯的分类任务。其结果表明,该方法在识别高优先级代码和分类性能上均优于现有方法。
Details
Motivation: 由于临床笔记中的ICD代码顺序对医疗诊断和报销至关重要,而现有方法仅将其视为分类任务,忽略了顺序信息,因此需要一种新方法来同时捕捉分类和排序信息。
Result: 与现有最佳分类器相比,所提模型在主要诊断代码排序上的准确率从20%提升至47%,在分类指标(micro-和macro-F1)上也表现更优。
Insight: 通过结合分类和排序任务,可以更全面地解决ICD编码问题,尤其是捕捉代码的顺序信息对实际应用至关重要。
Abstract: Clinical notes contain unstructured text provided by clinicians during patient encounters. These notes are usually accompanied by a sequence of diagnostic codes following the International Classification of Diseases (ICD). Correctly assigning and ordering ICD codes are essential for medical diagnosis and reimbursement. However, automating this task remains challenging. State-of-the-art methods treated this problem as a classification task, leading to ignoring the order of ICD codes that is essential for different purposes. In this work, as a first attempt, we approach this task from a retrieval system perspective to consider the order of codes, thus formulating this problem as a classification and ranking task. Our results and analysis show that the proposed framework has a superior ability to identify high-priority codes compared to other methods. For instance, our model accuracy in correctly ranking primary diagnosis codes is 47%, compared to 20% for the state-of-the-art classifier. Additionally, in terms of classification metrics, the proposed model achieves a micro- and macro-F1 scores of 0.6065 and 0.2904, respectively, surpassing the previous best model with scores of 0.597 and 0.2660.
[163] MAFA: A Multi-Agent Framework for Enterprise-Scale Annotation with Configurable Task Adaptation cs.LG | cs.AI | cs.CLPDF
Mahmood Hegazy, Aaron Rodrigues, Azzam Naeem
TL;DR: MAFA是一个多智能体协作框架,用于企业级的大规模标注任务,通过动态任务适配和结构化推理显著提升标注效率与准确性。
Details
Motivation: 金融机构面临大量客户话语标注积压问题,传统方法效率低下且难以动态调整任务类型,亟需一种灵活、高效的标注解决方案。
Result: 在部署中消除100万条话语积压,与人工标注一致率达86%,效率提升显著(Top-1准确率提高13.8%,F1提升16.9%)。
Insight: 多智能体系统在企业级任务中具有实用潜力,动态配置和共识机制是关键成功因素。
Abstract: We present MAFA (Multi-Agent Framework for Annotation), a production-deployed system that transforms enterprise-scale annotation workflows through configurable multi-agent collaboration. Addressing the critical challenge of annotation backlogs in financial services, where millions of customer utterances require accurate categorization, MAFA combines specialized agents with structured reasoning and a judge-based consensus mechanism. Our framework uniquely supports dynamic task adaptation, allowing organizations to define custom annotation types (FAQs, intents, entities, or domain-specific categories) through configuration rather than code changes. Deployed at JP Morgan Chase, MAFA has eliminated a 1 million utterance backlog while achieving, on average, 86% agreement with human annotators, annually saving over 5,000 hours of manual annotation work. The system processes utterances with annotation confidence classifications, which are typically 85% high, 10% medium, and 5% low across all datasets we tested. This enables human annotators to focus exclusively on ambiguous and low-coverage cases. We demonstrate MAFA’s effectiveness across multiple datasets and languages, showing consistent improvements over traditional and single-agent annotation baselines: 13.8% higher Top-1 accuracy, 15.1% improvement in Top-5 accuracy, and 16.9% better F1 in our internal intent classification dataset and similar gains on public benchmarks. This work bridges the gap between theoretical multi-agent systems and practical enterprise deployment, providing a blueprint for organizations facing similar annotation challenges.
[164] Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models cs.LG | cs.AI | cs.CLPDF
Mehrzad Samadi, Aleksander Ficek, Sean Narenthiran, Siddhartha Jain, Wasi Uddin Ahmad
TL;DR: 本文提出了一种名为GenCluster的可扩展测试时计算框架,结合开源模型gpt-oss-120b,首次在IOI竞赛中实现金牌表现。该方法通过大规模生成、行为聚类、排序和轮询提交策略,在有限验证预算下高效探索多样化解空间。
Details
Motivation: 尽管专有模型已在IOI竞赛中表现出色,但其方法不公开,而开源模型的表现仍需提升。因此,研究目标是开发一种透明且可复现的方法,缩小开源与闭源系统的性能差距。
Result: 实验表明,该方法计算性能可扩展,开源模型gpt-oss-120b预计在IOI 2025首次实现金牌。
Insight: 通过透明且可扩展的测试时计算框架,开源模型也能在复杂任务中匹敌闭源系统,强调了计算资源优化和多策略协同的重要性。
Abstract: Competitive programming has become a rigorous benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). The International Olympiad in Informatics (IOI) stands out as one of the most prestigious annual competitions in competitive programming and has become a key benchmark for comparing human and AI-level programming ability. While several proprietary models have been claimed to achieve gold medal-level performance at the IOI, often with undisclosed methods, achieving comparable results with open-weight models remains a significant challenge. In this paper, we present \gencluster, a scalable and reproducible test-time compute framework that attains IOI gold-level performance using open-weight models. It combines large-scale generation, behavioral clustering, ranking, and a round-robin submission strategy to efficiently explore diverse solution spaces under limited validation budgets. Our experiments show that the performance of our proposed approach scales consistently with available compute, narrowing the gap between open and closed systems. Notably, we will show that GenCluster can achieve a gold medal at IOI 2025 for the first time with an open-weight model gpt-oss-120b, setting a new benchmark for transparent and reproducible evaluation of reasoning in LLMs.
[165] Agentic Entropy-Balanced Policy Optimization cs.LG | cs.AI | cs.CL | cs.IRPDF
Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li
TL;DR: 论文提出了一种名为AEPO的强化学习算法,旨在平衡探索与开发的熵信号,解决了传统算法因过度依赖熵信号导致的训练崩溃问题。AEPO在多个数据集上表现优异。
Details
Motivation: 主流Agentic RL算法过度依赖熵信号进行探索,可能导致训练崩溃或效率下降。本文旨在解决这一问题,提出一种更平衡的方法。
Result: 在14个数据集上优于7种主流RL算法。Qwen3-14B结合AEPO在少量样本下取得了显著成绩,如GAIA的Pass@1达到47.6%。
Insight: AEPO通过平衡熵信号,不仅提升了性能,还改善了rollout采样的多样性,同时保持了策略熵的稳定性,为可扩展的Web Agent训练提供了可能。
Abstract: Recently, Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents. While mainstream agentic RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints, leading to the training collapse. In this paper, we delve into the challenges caused by entropy and propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases. AEPO comprises two core components: (1) a dynamic entropy-balanced rollout mechanism that adaptively allocate global and branch sampling budget through entropy pre-monitoring, while imposing a branch penalty on consecutive high-entropy tool-call steps to prevent over-branching issues; and (2) Entropy-Balanced Policy Optimization that inserts a stop-gradient operation into the high-entropy clipping term to preserve and properly rescale gradients on high-entropy tokens, while incorporating entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens. Results across 14 challenging datasets show that AEPO consistently outperforms 7 mainstream RL algorithms. With just 1K RL samples, Qwen3-14B with AEPO achieves impressive results: 47.6% on GAIA, 11.2% on Humanity’s Last Exam, and 43.0% on WebWalker for Pass@1; 65.0% on GAIA, 26.0% on Humanity’s Last Exam, and 70.0% on WebWalker for Pass@5. Further analysis reveals that AEPO improves rollout sampling diversity while maintaining stable policy entropy, facilitating scalable web agent training.
[166] Reasoning with Sampling: Your Base Model is Smarter Than You Think cs.LG | cs.AI | cs.CLPDF
Aayush Karan, Yilun Du
TL;DR: 该论文提出了一种无需额外训练的迭代采样算法,通过利用基础模型自身的似然性,显著提升了推理能力,性能接近甚至超过强化学习后的模型。
Details
Motivation: 尽管强化学习(RL)在提升大语言模型(RL-posttraining LLMs)推理能力方面取得了成功,但文献主要关注RL引发的新行为,而忽略了基础模型的潜力。本文试图通过纯采样方法发掘基础模型的推理能力。
Result: 在多个任务(包括MATH500、HumanEval和GPQA)上,该方法显著提升了推理能力,性能接近或超过RL-posttraining模型,同时保持了较高的多样性。
Insight: 基础模型通过适当的采样策略可以表现出强大的推理能力,无需依赖额外的训练或复杂的强化学习框架。
Abstract: Frontier reasoning models have exhibited incredible capabilities across a wide array of disciplines, driven by posttraining large language models (LLMs) with reinforcement learning (RL). However, despite the widespread success of this paradigm, much of the literature has been devoted to disentangling truly novel behaviors that emerge during RL but are not present in the base models. In our work, we approach this question from a different angle, instead asking whether comparable reasoning capabilites can be elicited from base models at inference time by pure sampling, without any additional training. Inspired by Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened distributions, we propose a simple iterative sampling algorithm leveraging the base models’ own likelihoods. Over different base models, we show that our algorithm offers substantial boosts in reasoning that nearly match and even outperform those from RL on a wide variety of single-shot tasks, including MATH500, HumanEval, and GPQA. Moreover, our sampler avoids the collapse in diversity over multiple samples that is characteristic of RL-posttraining. Crucially, our method does not require training, curated datasets, or a verifier, suggesting broad applicability beyond easily verifiable domains.
[167] Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models cs.LG | cs.CLPDF
Jonas Geiping, Xinyu Yang, Guinan Su
TL;DR: 本文探讨了循环深度模型(recurrent-depth models)与扩散语言模型(diffusion language models)之间的关系,提出了一种新的扩散强迫采样器,可加速循环深度模型的生成过程。该方法能在现代硬件上实现高达5倍的生成速度提升。
Details
Motivation: 循环深度模型通过层重复增加计算量,展现了在推理任务中的优势。然而,其生成过程的并行化需求促使研究者探索它与扩散语言模型的相似性,以开发更高效的生成方法。
Result: 实验表明,该方法可将3.5B参数的循环深度模型的生成速度提升5倍,同时生成过程的表现力更强。
Insight: 循环深度模型可被视为强连续但因果的扩散语言模型,其并行化潜力为高效推理提供了新的思路。
Abstract: Language models with recurrent depth, also referred to as universal or looped when considering transformers, are defined by the capacity to increase their computation through the repetition of layers. Recent efforts in pretraining have demonstrated that these architectures can scale to modern language modeling tasks while exhibiting advantages in reasoning tasks. In this work, we examine the relationship between recurrent-depth models and diffusion language models. Building on their similarities, we develop a new diffusion forcing sampler for these models to accelerate generation. The sampler advances by decoding new tokens at every forward pass of the model, while the latent states of these tokens can be further refined in parallel through recurrence. Theoretically, generation with our sampler is strictly more expressive than the baseline autoregressive generation using the same time budget on modern hardware. Moreover, this sampler, based on principles from diffusion literature, can be directly applied to existing 3.5B recurrent-depth transformers without any tuning, leading to up to a 5x speedup. Consequently, our findings not only provide an efficient mechanism for parallelizing the extra computation in recurrent-depth models at inference, but also suggest that such models can be naturally viewed as strong continuous, though causal, diffusion language models.
cs.RO [Back]
[168] Learning Human-Humanoid Coordination for Collaborative Object Carrying cs.RO | cs.AI | cs.CV | cs.LGPDF
Yushi Du, Yixuan Li, Baoxiong Jia, Yutang Lin, Pei Zhou
TL;DR: 论文提出了一种名为COLA的强化学习方法,通过结合领导者和跟随者行为于单一策略中,实现人形机器人与人类的协同搬运,无需外部传感器或复杂交互模型。
Details
Motivation: 人形机器人与人类的协作在医疗、家庭辅助和制造业中潜力巨大,但目前缺乏针对其全身动力学的合规协作方法。
Result: 仿真实验显示模型减少24.7%的人力负担;真人实验验证了其跨物体类型和地形的鲁棒性;用户研究显示平均提升27.4%。
Insight: 通过隐式学习和闭环训练,COLA无需复杂模型即可实现高效协同,为人形机器人的实际部署提供了可行方案。
Abstract: Human-humanoid collaboration shows significant promise for applications in healthcare, domestic assistance, and manufacturing. While compliant robot-human collaboration has been extensively developed for robotic arms, enabling compliant human-humanoid collaboration remains largely unexplored due to humanoids’ complex whole-body dynamics. In this paper, we propose a proprioception-only reinforcement learning approach, COLA, that combines leader and follower behaviors within a single policy. The model is trained in a closed-loop environment with dynamic object interactions to predict object motion patterns and human intentions implicitly, enabling compliant collaboration to maintain load balance through coordinated trajectory planning. We evaluate our approach through comprehensive simulator and real-world experiments on collaborative carrying tasks, demonstrating the effectiveness, generalization, and robustness of our model across various terrains and objects. Simulation experiments demonstrate that our model reduces human effort by 24.7%. compared to baseline approaches while maintaining object stability. Real-world experiments validate robust collaborative carrying across different object types (boxes, desks, stretchers, etc.) and movement patterns (straight-line, turning, slope climbing). Human user studies with 23 participants confirm an average improvement of 27.4% compared to baseline models. Our method enables compliant human-humanoid collaborative carrying without requiring external sensors or complex interaction models, offering a practical solution for real-world deployment.
[169] GOPLA: Generalizable Object Placement Learning via Synthetic Augmentation of Human Arrangement cs.RO | cs.CVPDF
Yao Zhong, Hanzhi Chen, Simon Schaefer, Anran Zhang, Stefan Leutenegger
TL;DR: GOPLA是一个分层框架,通过学习增强的人类演示来解决对象放置任务,结合语义偏好和几何可行性。利用多模态大语言模型生成结构化计划,并通过扩散方法和合成数据增强实现高泛化性。
Details
Motivation: 机器人作为智能助手需要完成对象放置任务,但现有方法在语义和几何推理上存在不足。通过增强人类演示和合成数据,提升模型的泛化能力。
Result: 实验表明,GOPLA在放置成功率和物理合理性上比第二名提高了30.04个百分点,表现出强的泛化能力。
Insight: 合成数据增强和多模态规划的结合是关键,扩散方法在复杂任务中表现出色。
Abstract: Robots are expected to serve as intelligent assistants, helping humans with everyday household organization. A central challenge in this setting is the task of object placement, which requires reasoning about both semantic preferences (e.g., common-sense object relations) and geometric feasibility (e.g., collision avoidance). We present GOPLA, a hierarchical framework that learns generalizable object placement from augmented human demonstrations. A multi-modal large language model translates human instructions and visual inputs into structured plans that specify pairwise object relationships. These plans are then converted into 3D affordance maps with geometric common sense by a spatial mapper, while a diffusion-based planner generates placement poses guided by test-time costs, considering multi-plan distributions and collision avoidance. To overcome data scarcity, we introduce a scalable pipeline that expands human placement demonstrations into diverse synthetic training data. Extensive experiments show that our approach improves placement success rates by 30.04 percentage points over the runner-up, evaluated on positioning accuracy and physical plausibility, demonstrating strong generalization across a wide range of real-world robotic placement scenarios.
[170] From Language to Locomotion: Retargeting-free Humanoid Control via Motion Latent Guidance cs.RO | cs.CVPDF
Zhe Li, Cheng Chi, Yangyang Wei, Boan Zhu, Yibo Peng
TL;DR: RoboGhost是一种无需重定向的人形机器人控制框架,通过运动潜在表示直接从语言生成动作,避免了多阶段处理的累积误差和高延迟,提升了语义与控制的一致性。
Details
Motivation: 现有的人形机器人语言控制方法需要多阶段处理(从解码人类运动到重定向至机器人形态),容易产生累积误差、高延迟和语义与控制弱耦合的问题。
Result: 实验表明RoboGhost显著降低了部署延迟,提升了成功率、跟踪准确性和语义对齐的运动生成能力。
Insight: 该框架为语言-动作系统的直接生成提供了一种高效方法,并可扩展至图像、音频等多模态输入。
Abstract: Natural language offers a natural interface for humanoid robots, but existing language-guided humanoid locomotion pipelines remain cumbersome and unreliable. They typically decode human motion, retarget it to robot morphology, and then track it with a physics-based controller. However, this multi-stage process is prone to cumulative errors, introduces high latency, and yields weak coupling between semantics and control. These limitations call for a more direct pathway from language to action, one that eliminates fragile intermediate stages. Therefore, we present RoboGhost, a retargeting-free framework that directly conditions humanoid policies on language-grounded motion latents. By bypassing explicit motion decoding and retargeting, RoboGhost enables a diffusion-based policy to denoise executable actions directly from noise, preserving semantic intent and supporting fast, reactive control. A hybrid causal transformer-diffusion motion generator further ensures long-horizon consistency while maintaining stability and diversity, yielding rich latent representations for precise humanoid behavior. Extensive experiments demonstrate that RoboGhost substantially reduces deployment latency, improves success rates and tracking accuracy, and produces smooth, semantically aligned locomotion on real humanoids. Beyond text, the framework naturally extends to other modalities such as images, audio, and music, providing a general foundation for vision-language-action humanoid systems.
[171] RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks cs.RO | cs.AI | cs.CV | cs.LG | cs.SY | eess.SYPDF
Mingxuan Yan, Yuping Wang, Zechun Liu, Jiachen Li
TL;DR: 论文提出了一种基于检索的演示分解器(RDD),用于在长时程任务中通过视觉特征对齐子任务区间,从而提升任务性能。
Details
Motivation: 传统的VLM规划器依赖于人工标注或启发式规则来分解任务,这种分解方式可能与低层视觉运动策略的训练数据不匹配,导致性能下降。
Result: 在仿真和现实任务中,RDD优于现有的子任务分解方法,表现出更强的鲁棒性和适应性。
Insight: RDD的创新点在于通过视觉特征对齐避免了传统分解方法的不足,为长时程任务的规划提供了更可靠的分解基础。
Abstract: To tackle long-horizon tasks, recent hierarchical vision-language-action (VLAs) frameworks employ vision-language model (VLM)-based planners to decompose complex manipulation tasks into simpler sub-tasks that low-level visuomotor policies can easily handle. Typically, the VLM planner is finetuned to learn to decompose a target task. This finetuning requires target task demonstrations segmented into sub-tasks by either human annotation or heuristic rules. However, the heuristic subtasks can deviate significantly from the training data of the visuomotor policy, which degrades task performance. To address these issues, we propose a Retrieval-based Demonstration Decomposer (RDD) that automatically decomposes demonstrations into sub-tasks by aligning the visual features of the decomposed sub-task intervals with those from the training data of the low-level visuomotor policies. Our method outperforms the state-of-the-art sub-task decomposer on both simulation and real-world tasks, demonstrating robustness across diverse settings. Code and more results are available at rdd-neurips.github.io.
cs.AI [Back]
[172] Towards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline cs.AI | cs.CVPDF
Haiyang Li, Yaxiong Wang, Shengeng Tang, Lianwei Wu, Lechao Cheng
TL;DR: 该论文提出了一个统一的多模态虚假信息检测框架UMFDet,并结合新构建的综合数据集OmniFake,解决了现有方法对人工虚假内容和AI生成内容分别研究的局限性。
Details
Motivation: 现有虚假信息检测方法通常只针对人工或AI生成内容中的一种,缺乏统一的多模态解决方案。论文旨在填补这一研究空白,提出一个能够同时处理两种虚假内容的框架。
Result: 实验表明UMFDet在两种虚假内容上均表现优异,优于专用基线方法,为实际应用提供了实用解决方案。
Insight: 统一的多模态虚假信息检测框架能够在实际场景中更有效地应对未知类型的虚假内容,提升检测的鲁棒性和一致性。
Abstract: In recent years, detecting fake multimodal content on social media has drawn increasing attention. Two major forms of deception dominate: human-crafted misinformation (e.g., rumors and misleading posts) and AI-generated content produced by image synthesis models or vision-language models (VLMs). Although both share deceptive intent, they are typically studied in isolation. NLP research focuses on human-written misinformation, while the CV community targets AI-generated artifacts. As a result, existing models are often specialized for only one type of fake content. In real-world scenarios, however, the type of a multimodal post is usually unknown, limiting the effectiveness of such specialized systems. To bridge this gap, we construct the Omnibus Dataset for Multimodal News Deception (OmniFake), a comprehensive benchmark of 127K samples that integrates human-curated misinformation from existing resources with newly synthesized AI-generated examples. Based on this dataset, we propose Unified Multimodal Fake Content Detection (UMFDet), a framework designed to handle both forms of deception. UMFDet leverages a VLM backbone augmented with a Category-aware Mixture-of-Experts (MoE) Adapter to capture category-specific cues, and an attribution chain-of-thought mechanism that provides implicit reasoning guidance for locating salient deceptive signals. Extensive experiments demonstrate that UMFDet achieves robust and consistent performance across both misinformation types, outperforming specialized baselines and offering a practical solution for real-world multimodal deception detection.
[173] AI for Service: Proactive Assistance with AI Glasses cs.AI | cs.CL | cs.CVPDF
Zichen Wen, Yiyu Wang, Chenfei Liao, Boxue Yang, Junxian Li
TL;DR: 论文提出了一种名为Alpha-Service的框架,通过AI眼镜实现主动服务,解决了‘何时干预’和‘如何提供服务’两大问题,展示了多种实际应用场景。
Details
Motivation: 现有AI服务多为被动响应,缺乏主动性和适应性。作者提出AI4Service范式,旨在实现AI作为主动伴侣,预见用户需求并适时提供服务。
Result: 案例研究表明,Alpha-Service能在如Blackjack顾问、博物馆导游和购物穿搭助手等场景中,无需显式提示即可提供实时且有用的帮助。
Insight: AI服务的未来趋势是从被动响应转向主动适配,Alpha-Service展示了通过多组件协作和多智能体系统实现这一目标的可行性。
Abstract: In an era where AI is evolving from a passive tool into an active and adaptive companion, we introduce AI for Service (AI4Service), a new paradigm that enables proactive and real-time assistance in daily life. Existing AI services remain largely reactive, responding only to explicit user commands. We argue that a truly intelligent and helpful assistant should be capable of anticipating user needs and taking actions proactively when appropriate. To realize this vision, we propose Alpha-Service, a unified framework that addresses two fundamental challenges: Know When to intervene by detecting service opportunities from egocentric video streams, and Know How to provide both generalized and personalized services. Inspired by the von Neumann computer architecture and based on AI glasses, Alpha-Service consists of five key components: an Input Unit for perception, a Central Processing Unit for task scheduling, an Arithmetic Logic Unit for tool utilization, a Memory Unit for long-term personalization, and an Output Unit for natural human interaction. As an initial exploration, we implement Alpha-Service through a multi-agent system deployed on AI glasses. Case studies, including a real-time Blackjack advisor, a museum tour guide, and a shopping fit assistant, demonstrate its ability to seamlessly perceive the environment, infer user intent, and provide timely and useful assistance without explicit prompts.
[174] Agentic Design of Compositional Machines cs.AI | cs.CL | cs.CV | cs.GR | cs.LGPDF
Wenqian Zhang, Weiyang Liu, Zhen Liu
TL;DR: 论文探讨了大型语言模型(LLMs)是否能够通过组合式机器设计任务学习创建复杂机器,并介绍了BesiegeField测试平台。研究发现开源模型当前表现不足,探索了强化学习(RL)作为改进路径。
Details
Motivation: 研究动机是探索LLMs在复杂机器设计中的潜力,将其视为人类智能的标志性和工程实践的延伸。
Result: 结果表明当前开源LLMs在组合式机器设计中表现不足,但RL微调显示出改进潜力。
Insight: 研究揭示了LLMs在复杂任务中需结合物理推理与语言能力的挑战,为未来研究方向提供了启示。
Abstract: The design of complex machines stands as both a marker of human intelligence and a foundation of engineering practice. Given recent advances in large language models (LLMs), we ask whether they, too, can learn to create. We approach this question through the lens of compositional machine design: a task in which machines are assembled from standardized components to meet functional demands like locomotion or manipulation in a simulated physical environment. To support this investigation, we introduce BesiegeField, a testbed built on the machine-building game Besiege, which enables part-based construction, physical simulation and reward-driven evaluation. Using BesiegeField, we benchmark state-of-the-art LLMs with agentic workflows and identify key capabilities required for success, including spatial reasoning, strategic assembly, and instruction-following. As current open-source models fall short, we explore reinforcement learning (RL) as a path to improvement: we curate a cold-start dataset, conduct RL finetuning experiments, and highlight open challenges at the intersection of language, machine design, and physical reasoning.
[175] Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks cs.AI | cs.CLPDF
Supriti Sinhamahapatra, Jan Niehues
TL;DR: 这篇论文探讨了结合多模态(语音和幻灯片)信息以提高科学会议演讲的自动语音识别(ASR)性能。通过数据增强和基准测试,作者实现了34%的词错误率降低。
Details
Motivation: 现有ASR系统主要依赖语音信息,忽视了多模态上下文(如幻灯片)在消除歧义和适应领域术语中的作用。论文旨在填补这一空白。
Result: 相比基线模型,多模态模型实现了34%的词错误率整体降低,领域术语的错误率降低了35%。
Insight: 视觉信息(如幻灯片)在ASR任务中具有重要作用,尤其是在领域术语识别方面。多模态模型在学术环境中具有潜在的高实用性。
Abstract: State-of-the-art (SOTA) Automatic Speech Recognition (ASR) systems primarily rely on acoustic information while disregarding additional multi-modal context. However, visual information are essential in disambiguation and adaptation. While most work focus on speaker images to handle noise conditions, this work also focuses on integrating presentation slides for the use cases of scientific presentation. In a first step, we create a benchmark for multi-modal presentation including an automatic analysis of transcribing domain-specific terminology. Next, we explore methods for augmenting speech models with multi-modal information. We mitigate the lack of datasets with accompanying slides by a suitable approach of data augmentation. Finally, we train a model using the augmented dataset, resulting in a relative reduction in word error rate of approximately 34%, across all words and 35%, for domain-specific terms compared to the baseline model.
[176] Generating Fair Consensus Statements with Social Choice on Token-Level MDPs cs.AI | cs.CL | cs.GTPDF
Carter Blair, Kate Larson
TL;DR: 该论文提出了一种基于社会选择理论和令牌级MDP的方法,用于生成具有可证明公平性保证的共识声明,确保多样意见的公平聚合。
Details
Motivation: 现有的大语言模型共识声明生成框架缺乏结构化的公平性保证机制,无法在处理多样性意见时提供可证明的公平性。
Result: 实验表明,基于平等福利目标的搜索算法生成的共识声明在代理对齐的 extit{最差情况}下优于基线方法。
Insight: 将社会选择理论引入文本生成任务,为公平性提供了理论框架,同时也验证了令牌级MDP在多代理场景的适用性。
Abstract: Current frameworks for consensus statement generation with large language models lack the inherent structure needed to provide provable fairness guarantees when aggregating diverse free-form opinions. We model the task as a multi-objective, token-level Markov Decision Process (MDP), where each objective corresponds to an agent’s preference. Token-level rewards for each agent are derived from their policy (e.g., a personalized language model). This approach utilizes the finding that such policies implicitly define optimal Q-functions, providing a principled way to quantify rewards at each generation step without a value function (Rafailov et al., 2024). This MDP formulation creates a formal structure amenable to analysis using principles from social choice theory. We propose two approaches grounded in social choice theory. First, we propose a stochastic generation policy guaranteed to be in the ex-ante core, extending core stability concepts from voting theory to text generation. This policy is derived from an underlying distribution over complete statements that maximizes proportional fairness (Nash Welfare). Second, for generating a single statement, we target the maximization of egalitarian welfare using search algorithms within the MDP framework. Empirically, experiments using language models to instantiate agent policies show that search guided by the egalitarian objective generates consensus statements with improved worst-case agent alignment compared to baseline methods, including the Habermas Machine (Tessler et al., 2024).
[177] Terrarium: Revisiting the Blackboard for Multi-Agent Safety, Privacy, and Security Studies cs.AI | cs.CL | cs.CR | I.2.7; I.2.11PDF
Mason Nakamura, Abhinav Kumar, Saaduddin Mahmud, Sahar Abdelnabi, Shlomo Zilberstein
TL;DR: 论文提出了Terrarium框架,专注于研究基于LLM的多智能体系统(MAS)的安全性、隐私和安全性。通过重新设计黑板架构,Terrarium为多智能体协作提供了模块化、可配置的测试平台,并识别了关键攻击向量。
Details
Motivation: 多智能体系统(MAS)结合LLM可以自动化复杂的任务,但也引入了新的安全风险,如恶意攻击和数据泄露。因此,需要一种系统化的方法来研究和解决这些风险。
Result: 展示了Terrarium框架的灵活性,能够有效模拟和评估多智能体系统中的安全风险,并提供防御设计的迭代工具。
Insight: Terrarium框架为研究者和开发者提供了一种系统化的方法,加速了可信赖多智能体系统的发展。
Abstract: A multi-agent system (MAS) powered by large language models (LLMs) can automate tedious user tasks such as meeting scheduling that requires inter-agent collaboration. LLMs enable nuanced protocols that account for unstructured private data, user constraints, and preferences. However, this design introduces new risks, including misalignment and attacks by malicious parties that compromise agents or steal user data. In this paper, we propose the Terrarium framework for fine-grained study on safety, privacy, and security in LLM-based MAS. We repurpose the blackboard design, an early approach in multi-agent systems, to create a modular, configurable testbed for multi-agent collaboration. We identify key attack vectors such as misalignment, malicious agents, compromised communication, and data poisoning. We implement three collaborative MAS scenarios with four representative attacks to demonstrate the framework’s flexibility. By providing tools to rapidly prototype, evaluate, and iterate on defenses and designs, Terrarium aims to accelerate progress toward trustworthy multi-agent systems.
[178] IMAGINE: Integrating Multi-Agent System into One Model for Complex Reasoning and Planning cs.AI | cs.CLPDF
Xikai Zhang, Bo Wang, Likang Xiao, Yongzhi Li, Quan Chen
TL;DR: 论文提出了一种名为IMAGINE的框架,将多智能体系统(MAS)的推理和规划能力整合到一个紧凑模型中,显著提升了复杂推理和规划任务的性能。
Details
Motivation: 虽然大语言模型(LLM)在多项任务中表现优异,但在复杂推理和规划任务中仍存在显著不足。多智能体系统(MAS)虽能提供更好的集体推理能力,但代价高昂且难以端到端训练。因此,需要一种高效、可扩展的解决方案。
Result: 实验表明,在使用Qwen3-8B-Instruct为基础模型时,IMAGINE在TravelPlanner基准测试中达到了82.7%的通过率,远超DeepSeek-R1-671B的40%。
Insight: 通过整合多智能体系统的能力到单一模型中,不仅可以减少推理成本,还能显著提升性能,展示了紧凑模型的潜力。
Abstract: Although large language models (LLMs) have made significant strides across various tasks, they still face significant challenges in complex reasoning and planning. For example, even with carefully designed prompts and prior information explicitly provided, GPT-4o achieves only a 7% Final Pass Rate on the TravelPlanner dataset in the sole-planning mode. Similarly, even in the thinking mode, Qwen3-8B-Instruct and DeepSeek-R1-671B, only achieve Final Pass Rates of 5.9% and 40%, respectively. Although well-organized Multi-Agent Systems (MAS) can offer improved collective reasoning, they often suffer from high reasoning costs due to multi-round internal interactions, long per-response latency, and difficulties in end-to-end training. To address these challenges, we propose a general and scalable framework called IMAGINE, short for Integrating Multi-Agent System into One Model. This framework not only integrates the reasoning and planning capabilities of MAS into a single, compact model, but also significantly surpass the capabilities of the MAS through a simple end-to-end training. Through this pipeline, a single small-scale model is not only able to acquire the structured reasoning and planning capabilities of a well-organized MAS but can also significantly outperform it. Experimental results demonstrate that, when using Qwen3-8B-Instruct as the base model and training it with our method, the model achieves an 82.7% Final Pass Rate on the TravelPlanner benchmark, far exceeding the 40% of DeepSeek-R1-671B, while maintaining a much smaller model size.
[179] ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks cs.AI | cs.CLPDF
Yuanyi Song, Heyuan Huang, Qiqiang Lin, Yin Zhao, Xiangmou Qu
TL;DR: 论文提出了ColorBench,一种基于图结构的基准框架,用于评估移动代理在复杂长任务中的表现,填补了离线静态基准和在线动态测试之间的鸿沟。
Details
Motivation: 当前移动代理评估方法无法全面测试复杂任务的多种解决方案,静态基准只能验证单一路径,而动态测试又受限于设备的复杂性和不可重现性。
Result: 实验揭示了现有模型的局限性,并基于结果提出了改进方向和技术路径。
Insight: 图结构框架能更贴近真实交互场景,支持多种解决方案的评估,为长任务代理的性能提升提供了新思路。
Abstract: The rapid advancement of multimodal large language models has enabled agents to operate mobile devices by directly interacting with graphical user interfaces, opening new possibilities for mobile automation. However, real-world mobile tasks are often complex and allow for multiple valid solutions. This contradicts current mobile agent evaluation standards: offline static benchmarks can only validate a single predefined “golden path”, while online dynamic testing is constrained by the complexity and non-reproducibility of real devices, making both approaches inadequate for comprehensively assessing agent capabilities. To bridge the gap between offline and online evaluation and enhance testing stability, this paper introduces a novel graph-structured benchmarking framework. By modeling the finite states observed during real-device interactions, it achieves static simulation of dynamic behaviors. Building on this, we develop ColorBench, a benchmark focused on complex long-horizon tasks. It supports evaluation of multiple valid solutions, subtask completion rate statistics, and atomic-level capability analysis. ColorBench contains 175 tasks (74 single-app, 101 cross-app) with an average length of over 13 steps. Each task includes at least two correct paths and several typical error paths, enabling quasi-dynamic interaction. By evaluating ColorBench across various baselines, we discover limitations of existing models and propose improvement directions and feasible technical pathways to enhance agents’ performance on complex, long-horizon problems based on experimental results. Code and data are available at: https://github.com/MadeAgents/ColorBench.
[180] TITAN: Graph-Executable Reasoning for Cyber Threat Intelligence cs.AI | cs.CL | cs.CR | cs.IRPDF
Marco Simoni, Aleksandar Fontana, Andrea Saracino, Paolo Mori
TL;DR: TITAN是一个结合自然语言查询与结构化知识图谱推理的框架,用于网络威胁情报。它通过路径规划模型和图执行器实现高效推理,并在MITRE数据集上验证了有效性。
Details
Motivation: 传统的威胁情报检索系统缺乏结构化推理能力,无法清晰地在威胁、行为和防御之间建立逻辑链。TITAN旨在填补这一空白。
Result: 实验证明TITAN能生成语法有效、语义一致的推理路径,并可确定性执行。
Insight: 结构化双向图谱设计增强了推理的可解释性和灵活性,适用于复杂威胁场景。
Abstract: TITAN (Threat Intelligence Through Automated Navigation) is a framework that connects natural-language cyber threat queries with executable reasoning over a structured knowledge graph. It integrates a path planner model, which predicts logical relation chains from text, and a graph executor that traverses the TITAN Ontology to retrieve factual answers and supporting evidence. Unlike traditional retrieval systems, TITAN operates on a typed, bidirectional graph derived from MITRE, allowing reasoning to move clearly and reversibly between threats, behaviors, and defenses. To support training and evaluation, we introduce the TITAN Dataset, a corpus of 88209 examples (Train: 74258; Test: 13951) pairing natural language questions with executable reasoning paths and step by step Chain of Thought explanations. Empirical evaluations show that TITAN enables models to generate syntactically valid and semantically coherent reasoning paths that can be deterministically executed on the underlying graph.
[181] Where to Search: Measure the Prior-Structured Search Space of LLM Agents cs.AI | cs.CL | cs.LOPDF
Zhuo-Yang Song
TL;DR: 本文提出了一种紧凑的形式理论,用于描述和衡量由领域先验引导的LLM辅助迭代搜索。通过将代理建模为模糊关系算子,并引入覆盖生成函数和可达性难度测量,该理论为LLM构建的迭代搜索提供了系统化的形式描述和操作工具。
Details
Motivation: 生成-过滤-精炼的范式在LLM驱动的推理、编程和科学发现中取得了进展,但其有效性依赖于如何将领域先验编码为可操作的结构化假设空间。本文旨在解决这一问题,提供一个可操作的理论框架。
Result: 该理论提供了一个可操作的语言和工具,用于衡量代理及其搜索空间,并实现了对LLM构建的迭代搜索的系统化形式描述。
Insight: 通过形式化描述代理行为和搜索空间,该理论为LLM驱动的迭代搜索提供了理论基础,有助于优化搜索效率和安全性。
Abstract: The generate-filter-refine (iterative paradigm) based on large language models (LLMs) has achieved progress in reasoning, programming, and program discovery in AI+Science. However, the effectiveness of search depends on where to search, namely, how to encode the domain prior into an operationally structured hypothesis space. To this end, this paper proposes a compact formal theory that describes and measures LLM-assisted iterative search guided by domain priors. We represent an agent as a fuzzy relation operator on inputs and outputs to capture feasible transitions; the agent is thereby constrained by a fixed safety envelope. To describe multi-step reasoning/search, we weight all reachable paths by a single continuation parameter and sum them to obtain a coverage generating function; this induces a measure of reachability difficulty; and it provides a geometric interpretation of search on the graph induced by the safety envelope. We further provide the simplest testable inferences and validate them via a majority-vote instantiation. This theory offers a workable language and operational tools to measure agents and their search spaces, proposing a systematic formal description of iterative search constructed by LLMs.
[182] Budget-aware Test-time Scaling via Discriminative Verification cs.AI | cs.CL | cs.LGPDF
Kyle Montgomery, Sijun Tan, Yuqi Chen, Siyuan Zhuang, Tianjun Zhang
TL;DR: 该论文提出了一种预算感知的测试时缩放方法,通过结合判别性验证器和自一致性,显著提升了大型语言模型在复杂推理任务上的性能,同时降低了计算成本。
Details
Motivation: 现有方法使用生成性验证器选择最优解时计算成本过高,限制了实用性。因此,需要一种更高效且预算友好的替代方案。
Result: 在AIME2025任务上,该方法比现有生成性验证方法的准确率提升了15.3%。
Insight: 预算感知的判别性验证方法不仅更高效,还能显著提升性能,为实际应用提供了实用解决方案。
Abstract: Test-time scaling is a powerful strategy for boosting the performance of large language models on complex reasoning tasks. While state-of-the-art approaches often employ generative verifiers to select the best solution from a pool of candidates, this method incurs prohibitive computational costs, limiting its practicality. In this work, we shift the focus to a more budget-aware paradigm: discriminative verification. We conduct a thorough empirical analysis and demonstrate that while discriminative verifiers may underperform in isolation, combining them with self-consistency in a hybrid approach creates a powerful and efficient test-time scaling mechanism. Notably, under a fixed compute budget, this hybrid approach surpasses state-of-the-art generative verification by a significant margin: achieving up to 15.3% higher accuracy on AIME2025. Our findings establish that for practical, real-world applications, budget-aware scaling with discriminative verifiers is not only a “free” upgrade over self-consistency, but also a more effective and efficient alternative to costly generative techniques. Code is available at https://github.com/wang-research-lab/verification.
[183] TRI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech, Text, and EEG cs.AI | cs.CL | cs.LG | eess.AS | eess.SPPDF
Annisaa Fitri Nurfidausi, Eleonora Mancini, Paolo Torroni
TL;DR: 论文TRI-DEP通过系统地比较EEG、语音和文本三种模态的特征表示与建模策略,提出了一个创新的抑郁症检测方法。实验表明多模态结合优于单模态,预训练嵌入优于手工特征,且精心设计的三模态模型达到了SOTA表现。
Details
Motivation: 抑郁症是一种普遍的心理健康问题,但自动检测仍具挑战性。现有研究多为单模态或有限的多模态,缺乏系统特征比较和一致的评估协议。本文旨在解决这些问题。
Result: 结果表明:(i) 多模态结合提升检测性能;(ii) 预训练嵌入优于手工特征;(iii) 优化的三模态模型达到SOTA。
Insight: 多模态信号(尤其是EEG)在抑郁症检测中具有互补性,预训练嵌入和数据一致性划分是关键成功因素。
Abstract: Depression is a widespread mental health disorder, yet its automatic detection remains challenging. Prior work has explored unimodal and multimodal approaches, with multimodal systems showing promise by leveraging complementary signals. However, existing studies are limited in scope, lack systematic comparisons of features, and suffer from inconsistent evaluation protocols. We address these gaps by systematically exploring feature representations and modelling strategies across EEG, together with speech and text. We evaluate handcrafted features versus pre-trained embeddings, assess the effectiveness of different neural encoders, compare unimodal, bimodal, and trimodal configurations, and analyse fusion strategies with attention to the role of EEG. Consistent subject-independent splits are applied to ensure robust, reproducible benchmarking. Our results show that (i) the combination of EEG, speech and text modalities enhances multimodal detection, (ii) pretrained embeddings outperform handcrafted features, and (iii) carefully designed trimodal models achieve state-of-the-art performance. Our work lays the groundwork for future research in multimodal depression detection.
[184] Stable but Miscalibrated: A Kantian View on Overconfidence from Filters to Large Language Models cs.AI | cs.CL | cs.LGPDF
Akira Okutomi
TL;DR: 论文将康德的《纯粹理性批判》重新解读为反馈稳定性理论,提出了一种复合不稳定性指数(H-Risk)来衡量推理系统的稳定性,揭示了名义稳定性和认知稳定性之间的差距,并在大型语言模型(LLMs)中验证了其与校准错误和幻觉的关联。
Details
Motivation: 研究旨在从康德哲学的角度理解推理系统的稳定性问题,特别是过自信的现象,从而为诊断和减少推理系统中的过自信提供理论基础。
Result: 结果表明H-Risk能预测过自信错误,且在LLMs中与校准错误和幻觉相关。批判性提示对校准和幻觉的影响不一。
Insight: 研究提出了一个结构化的桥梁,将康德的自我限制理论与反馈控制结合起来,为分析和改进推理系统的稳定性提供了新视角。
Abstract: We reinterpret Kant’s Critique of Pure Reason as a theory of feedback stability, viewing reason as a regulator that keeps inference within the bounds of possible experience. We formalize this intuition via a composite instability index (H-Risk) combining spectral margin, conditioning, temporal sensitivity, and innovation amplification. In linear-Gaussian simulations, higher H-Risk predicts overconfident errors even under formal stability, revealing a gap between nominal and epistemic stability. Extending to large language models (LLMs), we find that fragile internal dynamics correlate with miscalibration and hallucination, while critique-style prompts show mixed effects on calibration and hallucination. These results suggest a structural bridge between Kantian self-limitation and feedback control, offering a principled lens for diagnosing – and selectively reducing – overconfidence in reasoning systems. This is a preliminary version; supplementary experiments and broader replication will be reported in a future revision.
eess.IV [Back]
[185] Reinforcement Learning for Unsupervised Domain Adaptation in Spatio-Temporal Echocardiography Segmentation eess.IV | cs.AI | cs.CVPDF
Arnaud Judge, Nicolas Duchateau, Thierry Judge, Roman A. Sandler, Joseph Z. Sokol
TL;DR: 提出了RL4Seg3D,一种用于2D+时间超声心动图分割的无监督域适应框架,通过强化学习和新颖奖励函数提升分割精度、解剖学有效性和时间一致性。
Details
Motivation: 医学图像分割中,域适应方法常因目标域可靠性不足而受限,尤其在时空数据和含噪声的超声心动图中表现更差,亟需解决方案。
Result: 在3万多个超声心动图视频上验证,RL4Seg3D显著优于标准域适应方法,且无需目标域标签。
Insight: 强化学习可有效提升医学图像分割的域适应性能,同时提供不确定性估计辅助测试时优化。
Abstract: Domain adaptation methods aim to bridge the gap between datasets by enabling knowledge transfer across domains, reducing the need for additional expert annotations. However, many approaches struggle with reliability in the target domain, an issue particularly critical in medical image segmentation, where accuracy and anatomical validity are essential. This challenge is further exacerbated in spatio-temporal data, where the lack of temporal consistency can significantly degrade segmentation quality, and particularly in echocardiography, where the presence of artifacts and noise can further hinder segmentation performance. To address these issues, we present RL4Seg3D, an unsupervised domain adaptation framework for 2D + time echocardiography segmentation. RL4Seg3D integrates novel reward functions and a fusion scheme to enhance key landmark precision in its segmentations while processing full-sized input videos. By leveraging reinforcement learning for image segmentation, our approach improves accuracy, anatomical validity, and temporal consistency while also providing, as a beneficial side effect, a robust uncertainty estimator, which can be used at test time to further enhance segmentation performance. We demonstrate the effectiveness of our framework on over 30,000 echocardiographic videos, showing that it outperforms standard domain adaptation techniques without the need for any labels on the target domain. Code is available at https://github.com/arnaudjudge/RL4Seg3D.
[186] A Density-Informed Multimodal Artificial Intelligence Framework for Improving Breast Cancer Detection Across All Breast Densities eess.IV | cs.AI | cs.CV | cs.LGPDF
Siva Teja Kakileti, Bharath Govindaraju, Sudhakar Sampangi, Geetha Manjunath
TL;DR: 该论文提出了一种基于乳房密度的多模态AI框架,结合乳腺X射线摄影(Mammography)和热成像(Thermalytix),以优化乳腺癌检测,尤其是在致密乳腺组织中表现显著优于单一模态方法。
Details
Motivation: 乳腺X射线摄影在致密乳腺组织中的敏感性较低,可能导致漏诊或延迟诊断。研究旨在通过多模态AI解决这一问题,提高乳腺癌检测的准确性。
Result: 多模态AI框架的敏感性为94.55%,特异性为79.93%,显著优于单独使用乳腺X射线AI或热成像AI,尤其在致密乳腺中表现更稳定。
Insight: 研究表明,结合结构和功能数据的多模态AI框架能够克服单一模态的局限性,具有低成本、易部署的优点,适用于不同资源环境。
Abstract: Mammography, the current standard for breast cancer screening, has reduced sensitivity in women with dense breast tissue, contributing to missed or delayed diagnoses. Thermalytix, an AI-based thermal imaging modality, captures functional vascular and metabolic cues that may complement mammographic structural data. This study investigates whether a breast density-informed multi-modal AI framework can improve cancer detection by dynamically selecting the appropriate imaging modality based on breast tissue composition. A total of 324 women underwent both mammography and thermal imaging. Mammography images were analyzed using a multi-view deep learning model, while Thermalytix assessed thermal images through vascular and thermal radiomics. The proposed framework utilized Mammography AI for fatty breasts and Thermalytix AI for dense breasts, optimizing predictions based on tissue type. This multi-modal AI framework achieved a sensitivity of 94.55% (95% CI: 88.54-100) and specificity of 79.93% (95% CI: 75.14-84.71), outperforming standalone mammography AI (sensitivity 81.82%, specificity 86.25%) and Thermalytix AI (sensitivity 92.73%, specificity 75.46%). Importantly, the sensitivity of Mammography dropped significantly in dense breasts (67.86%) versus fatty breasts (96.30%), whereas Thermalytix AI maintained high and consistent sensitivity in both (92.59% and 92.86%, respectively). This demonstrates that a density-informed multi-modal AI framework can overcome key limitations of unimodal screening and deliver high performance across diverse breast compositions. The proposed framework is interpretable, low-cost, and easily deployable, offering a practical path to improving breast cancer screening outcomes in both high-resource and resource-limited settings.