Table of Contents
- cs.CV [Total: 72]
- cs.CL [Total: 71]
- eess.IV [Total: 4]
- cs.ET [Total: 1]
- cs.CR [Total: 1]
- eess.AS [Total: 2]
- cs.SD [Total: 1]
- cs.AI [Total: 8]
- cs.LG [Total: 8]
- cs.DB [Total: 1]
cs.CV [Back]
[1] Beyond CNNs: Efficient Fine-Tuning of Multi-Modal LLMs for Object Detection on Low-Data Regimes cs.CV | cs.AIPDF
Nirmal Elamon, Rouzbeh Davoudi
TL;DR: 该论文探讨了在多模态大语言模型(LLM)上进行高效微调,以在低数据条件下实现目标检测任务,结果表明LLM在有限数据下表现优异,甚至超越传统CNN模型。
Details
Motivation: 传统CNN模型(如ResNet和YOLO)虽在图像任务中表现优异,但多模态LLM具备动态上下文推理和语言引导等新能力。然而,未经微调的LLM在视觉任务中表现不佳,因此研究如何高效微调LLM以适配低数据场景成为重点。
Result: 结果显示,微调后的LLM在低数据条件下表现显著优于传统CNN或未经微调的LLM,验证了其数据高效性和适应性。
Insight: 研究表明多模态LLM在跨模态学习(视觉与语言)中具有潜力,尤其是在资源受限的场景下,为未来研究提供了实用指导和开源工具。
Abstract: The field of object detection and understanding is rapidly evolving, driven by advances in both traditional CNN-based models and emerging multi-modal large language models (LLMs). While CNNs like ResNet and YOLO remain highly effective for image-based tasks, recent transformer-based LLMs introduce new capabilities such as dynamic context reasoning, language-guided prompts, and holistic scene understanding. However, when used out-of-the-box, the full potential of LLMs remains underexploited, often resulting in suboptimal performance on specialized visual tasks. In this work, we conduct a comprehensive comparison of fine-tuned traditional CNNs, zero-shot pre-trained multi-modal LLMs, and fine-tuned multi-modal LLMs on the challenging task of artificial text overlay detection in images. A key contribution of our study is demonstrating that LLMs can be effectively fine-tuned on very limited data (fewer than 1,000 images) to achieve up to 36% accuracy improvement, matching or surpassing CNN-based baselines that typically require orders of magnitude more data. By exploring how language-guided models can be adapted for precise visual understanding with minimal supervision, our work contributes to the broader effort of bridging vision and language, offering novel insights into efficient cross-modal learning strategies. These findings highlight the adaptability and data efficiency of LLM-based approaches for real-world object detection tasks and provide actionable guidance for applying multi-modal transformers in low-resource visual environments. To support continued progress in this area, we have made the code used to fine-tune the models available in our GitHub, enabling future improvements and reuse in related applications.
[2] Reproducible Evaluation of Data Augmentation and Loss Functions for Brain Tumor Segmentation cs.CV | cs.LGPDF
Saumya B
TL;DR: 这篇论文对脑肿瘤MRI分割中数据增强和损失函数的可重复性进行了评估,提出了U-Net结合焦点损失的方法,并通过实验验证了其对性能的提升。
Details
Motivation: 脑肿瘤分割在诊断和治疗计划中至关重要,但类不平衡和模型泛化能力的限制等挑战阻碍了进展。为了提供一个透明、可重复的基准研究,作者评估了数据增强和损失函数对分割性能的影响。
Result: 实验结果表明,U-Net结合焦点损失取得了90%的精确度,与现有最优方法相当。数据增强技术进一步提升了模型的鲁棒性和泛化能力。
Insight: 论文揭示了焦点损失和数据增强在脑肿瘤分割中的重要性,同时强调了可重复性和透明度在研究中的价值。
Abstract: Brain tumor segmentation is crucial for diagnosis and treatment planning, yet challenges such as class imbalance and limited model generalization continue to hinder progress. This work presents a reproducible evaluation of U-Net segmentation performance on brain tumor MRI using focal loss and basic data augmentation strategies. Experiments were conducted on a publicly available MRI dataset, focusing on focal loss parameter tuning and assessing the impact of three data augmentation techniques: horizontal flip, rotation, and scaling. The U-Net with focal loss achieved a precision of 90%, comparable to state-of-the-art results. By making all code and results publicly available, this study establishes a transparent, reproducible baseline to guide future research on augmentation strategies and loss function design in brain tumor segmentation.
[3] Adjusting Initial Noise to Mitigate Memorization in Text-to-Image Diffusion Models cs.CVPDF
Hyeonggeun Han, Sehwan Kim, Hyungjun Joo, Sangwoo Hong, Jungwoo Lee
TL;DR: 该论文研究了文本到图像扩散模型中训练数据记忆化的问题,提出通过调整初始噪声来减少记忆化,同时保持图像与文本的对齐。
Details
Motivation: 文本到图像扩散模型虽然生成能力强,但容易记忆和复制训练数据,引发隐私和版权问题。研究发现,记忆化与初始噪声的选择密切相关,因此探索如何通过调整初始噪声来缓解这一问题。
Result: 实验表明,这些方法显著减少了记忆化现象,同时确保了生成图像与输入提示的良好对齐。
Insight: 初始噪声的选择是影响扩散模型记忆化行为的重要因素,通过优化初始噪声可以有效缓解记忆化问题。
Abstract: Despite their impressive generative capabilities, text-to-image diffusion models often memorize and replicate training data, prompting serious concerns over privacy and copyright. Recent work has attributed this memorization to an attraction basin-a region where applying classifier-free guidance (CFG) steers the denoising trajectory toward memorized outputs-and has proposed deferring CFG application until the denoising trajectory escapes this basin. However, such delays often result in non-memorized images that are poorly aligned with the input prompts, highlighting the need to promote earlier escape so that CFG can be applied sooner in the denoising process. In this work, we show that the initial noise sample plays a crucial role in determining when this escape occurs. We empirically observe that different initial samples lead to varying escape times. Building on this insight, we propose two mitigation strategies that adjust the initial noise-either collectively or individually-to find and utilize initial samples that encourage earlier basin escape. These approaches significantly reduce memorization while preserving image-text alignment.
[4] The Digital Mirror: Gender Bias and Occupational Stereotypes in AI-Generated Images cs.CVPDF
Siiri Leppälampi, Sonja M. Hyrynsalmi, Erno Vanhala
TL;DR: 研究表明,DALL-E 3和Ideogram生成的职业图像中存在性别刻板印象,需采取措施提升多样性。
Details
Motivation: 生成式AI在视觉内容创作中广泛应用,但现有研究忽略了其潜在的表示偏见问题,尤其是在职业性别刻板印象方面。
Result: 结果显示,两种AI工具均强化了传统性别刻板印象,但程度不同,突显了AI工具在多样性上的风险。
Insight: AI生成工具需在设计和使用中加入多样性考量,以避免强化社会偏见。
Abstract: Generative AI offers vast opportunities for creating visualisations, such as graphics, videos, and images. However, recent studies around AI-generated visualisations have primarily focused on the creation process and image quality, overlooking representational biases. This study addresses this gap by testing representation biases in AI-generated pictures in an occupational setting and evaluating how two AI image generator tools, DALL-E 3 and Ideogram, compare. Additionally, the study discusses topics such as ageing and emotions in AI-generated images. As AI image tools are becoming more widely used, addressing and mitigating harmful gender biases becomes essential to ensure diverse representation in media and professional settings. In this study, over 750 AI-generated images of occupations were prompted. The thematic analysis results revealed that both DALL-E 3 and Ideogram reinforce traditional gender stereotypes in AI-generated images, although to varying degrees. These findings emphasise that AI visualisation tools risk reinforcing narrow representations. In our discussion section, we propose suggestions for practitioners, individuals and researchers to increase representation when generating images with visible genders.
[5] Dynamic Mixture-of-Experts for Visual Autoregressive Model cs.CVPDF
Jort Vincenti, Metod Jazbec, Guoxuan Xia
TL;DR: 本文提出了一种动态Mixture-of-Experts路由机制,用于优化视觉自回归模型(VAR)的计算冗余问题,在不影响生成质量的情况下,显著降低了计算量和推理时间。
Details
Motivation: 现有的VAR模型在分辨率逐渐增加时需要重复调用Transformer,导致计算冗余。作者希望通过动态路由机制平衡计算开销和生成质量。
Result: 实验结果表明,新方法减少了20%的FLOPs,推理速度提升了11%,且生成质量与基准模型相当。
Insight: 动态路由机制可以有效优化自回归模型的计算效率,同时保持生成质量,为高质量图像生成提供了一种轻量化的解决方案。
Abstract: Visual Autoregressive Models (VAR) offer efficient and high-quality image generation but suffer from computational redundancy due to repeated Transformer calls at increasing resolutions. We introduce a dynamic Mixture-of-Experts router integrated into VAR. The new architecture allows to trade compute for quality through scale-aware thresholding. This thresholding strategy balances expert selection based on token complexity and resolution, without requiring additional training. As a result, we achieve 20% fewer FLOPs, 11% faster inference and match the image quality achieved by the dense baseline.
[6] Out-of-Distribution Detection in LiDAR Semantic Segmentation Using Epistemic Uncertainty from Hierarchical GMMs cs.CV | cs.LGPDF
Hanieh Shojaei Miandashti, Claus Brenner
TL;DR: 本文提出了一种基于层次贝叶斯高斯混合模型(GMM)的无监督OOD检测方法,通过分离模型和数据不确定性,显著提升了LiDAR语义分割中OOD检测的性能。
Details
Motivation: 现有OOD检测方法通常混淆模型和数据不确定性,导致对训练分布内的模糊区域误判为OOD,需要依赖额外数据或训练阶段。本文旨在解决这一问题。
Result: 在SemanticKITTI数据集上,AUROC提升18%,AUPRC增加22%,FPR95从76%降至40%。
Insight: 层次GMM能有效分离认知和随机不确定性,证明无监督方法在OOD检测中潜力巨大。
Abstract: In addition to accurate scene understanding through precise semantic segmentation of LiDAR point clouds, detecting out-of-distribution (OOD) objects, instances not encountered during training, is essential to prevent the incorrect assignment of unknown objects to known classes. While supervised OOD detection methods depend on auxiliary OOD datasets, unsupervised methods avoid this requirement but typically rely on predictive entropy, the entropy of the predictive distribution obtained by averaging over an ensemble or multiple posterior weight samples. However, these methods often conflate epistemic (model) and aleatoric (data) uncertainties, misclassifying ambiguous in distribution regions as OOD. To address this issue, we present an unsupervised OOD detection approach that employs epistemic uncertainty derived from hierarchical Bayesian modeling of Gaussian Mixture Model (GMM) parameters in the feature space of a deep neural network. Without requiring auxiliary data or additional training stages, our approach outperforms existing uncertainty-based methods on the SemanticKITTI dataset, achieving an 18% improvement in AUROC, 22% increase in AUPRC, and 36% reduction in FPR95 (from 76% to 40%), compared to the predictive entropy approach used in prior works.
[7] Hi-OSCAR: Hierarchical Open-set Classifier for Human Activity Recognition cs.CV | cs.AI | I.2PDF
Conor McCarthy, Loes Quirijnen, Jan Peter van Zandwijk, Zeno Geradts, Marcel Worring
TL;DR: Hi-OSCAR 提出了一种分层开放集分类器,用于人体活动识别(HAR),能够在识别已知活动的同时拒绝未知活动,并通过层级结构提供更细粒度的未知活动分类。
Details
Motivation: 传统 HAR 分类器无法处理训练数据中未出现的活动(开放集问题),且活动类别之间存在层次关系,但现有方法未能充分利用这一点。
Result: Hi-OSCAR 在已知活动上达到 SOTA 准确率,并能有效拒绝未知活动。
Insight: 层级结构有助于改善开放集 HAR 的性能,并为未知活动提供更多上下文信息。
Abstract: Within Human Activity Recognition (HAR), there is an insurmountable gap between the range of activities performed in life and those that can be captured in an annotated sensor dataset used in training. Failure to properly handle unseen activities seriously undermines any HAR classifier’s reliability. Additionally within HAR, not all classes are equally dissimilar, some significantly overlap or encompass other sub-activities. Based on these observations, we arrange activity classes into a structured hierarchy. From there, we propose Hi-OSCAR: a Hierarchical Open-set Classifier for Activity Recognition, that can identify known activities at state-of-the-art accuracy while simultaneously rejecting unknown activities. This not only enables open-set classification, but also allows for unknown classes to be localized to the nearest internal node, providing insight beyond a binary “known/unknown” classification. To facilitate this and future open-set HAR research, we collected a new dataset: NFI_FARED. NFI_FARED contains data from multiple subjects performing nineteen activities from a range of contexts, including daily living, commuting, and rapid movements, which is fully public and available for download.
[8] Detection of high-frequency oscillations using time-frequency analysis cs.CV | physics.med-ph | 94A12, 62H30, 68T10 | I.5.4; I.4.7; J.3PDF
Mostafa Mohammadpour, Mehdi Zekriyapanah Gashti, Yusif S. Gasimov
TL;DR: 该论文提出了一种基于时频分析的无监督聚类方法,用于自动检测高频振荡(HFOs),并在癫痫患者的临床数据中验证了其有效性。
Details
Motivation: 高频振荡(HFOs)是识别癫痫致痫区的新生物标志物,但目前检测HFOs的方法耗时、费力且主观,急需自动化解决方案以提高研究和临床应用的效率。
Result: 1. 在控制数据集上达到97.67%的灵敏度、98.57%的精度和97.78%的F分数。2. 在癫痫患者数据中,HFOs切除率与非切除率的比值为0.73,与手术结果相关性更强。3. 证实HFOs(尤其是快波纹)的切除与癫痫发作自由度的相关性。
Insight: 1. HFOs(尤其是快波纹)是癫痫致痫性的有效生物标志物。2. 自动化检测HFOs的方法为临床手术提供了更精准的依据,有助于提高患者的手术效果。
Abstract: High-frequency oscillations (HFOs) are a new biomarker for identifying the epileptogenic zone. Mapping HFO-generating regions can improve the precision of resection sites in patients with refractory epilepsy. However, detecting HFOs remains challenging, and their clinical features are not yet fully defined. Visual identification of HFOs is time-consuming, labor-intensive, and subjective. As a result, developing automated methods to detect HFOs is critical for research and clinical use. In this study, we developed a novel method for detecting HFOs in the ripple and fast ripple frequency bands (80-500 Hz). We validated it using both controlled datasets and data from epilepsy patients. Our method employs an unsupervised clustering technique to categorize events extracted from the time-frequency domain using the S-transform. The proposed detector differentiates HFOs events from spikes, background activity, and artifacts. Compared to existing detectors, our method achieved a sensitivity of 97.67%, a precision of 98.57%, and an F-score of 97.78% on the controlled dataset. In epilepsy patients, our results showed a stronger correlation with surgical outcomes, with a ratio of 0.73 between HFOs rates in resected versus non-resected contacts. The study confirmed previous findings that HFOs are promising biomarkers of epileptogenicity in epileptic patients. Removing HFOs, especially fast ripple, leads to seizure freedom, while remaining HFOs lead to seizure recurrence.
[9] Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry cs.CV | cs.AIPDF
Thomas Fel, Binxu Wang, Michael A. Lepori, Matthew Kowal, Andrew Lee
TL;DR: 本文通过分析DINOv2模型中任务相关的概念,揭示了其功能特化现象,提出了新的几何表示假设(MRH),并探讨了其对视觉Transformer表示的解释意义。
Details
Motivation: DINOv2广泛应用于物体、场景和动作识别,但其感知机制尚不明确。本文旨在通过线性表示假设(LRH)和稀疏自动编码器(SAE)分析DINOv2的表示特性。
Result: 分类任务利用”Elsewhere”概念实现否定;分割依赖边界检测器;深度估计基于单眼深度线索。SAE表示部分稠密,且趋向于相干性而非严格正交。
Insight: 视觉Transformer的表示可能通过凸混合原型构建,而非简单的线性稀疏表示,这为理解其机制提供了新视角。
Abstract: DINOv2 is routinely deployed to recognize objects, scenes, and actions; yet the nature of what it perceives remains unknown. As a working baseline, we adopt the Linear Representation Hypothesis (LRH) and operationalize it using SAEs, producing a 32,000-unit dictionary that serves as the interpretability backbone of our study, which unfolds in three parts. In the first part, we analyze how different downstream tasks recruit concepts from our learned dictionary, revealing functional specialization: classification exploits “Elsewhere” concepts that fire everywhere except on target objects, implementing learned negations; segmentation relies on boundary detectors forming coherent subspaces; depth estimation draws on three distinct monocular depth cues matching visual neuroscience principles. Following these functional results, we analyze the geometry and statistics of the concepts learned by the SAE. We found that representations are partly dense rather than strictly sparse. The dictionary evolves toward greater coherence and departs from maximally orthogonal ideals (Grassmannian frames). Within an image, tokens occupy a low dimensional, locally connected set persisting after removing position. These signs suggest representations are organized beyond linear sparsity alone. Synthesizing these observations, we propose a refined view: tokens are formed by combining convex mixtures of archetypes (e.g., a rabbit among animals, brown among colors, fluffy among textures). This structure is grounded in Gardenfors’ conceptual spaces and in the model’s mechanism as multi-head attention produces sums of convex mixtures, defining regions bounded by archetypes. We introduce the Minkowski Representation Hypothesis (MRH) and examine its empirical signatures and implications for interpreting vision-transformer representations.
[10] Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding cs.CVPDF
Songtao Jiang, Yuan Wang, Sibo Song, Tianxiang Hu, Chenyi Zhou
TL;DR: Hulu-Med是一个透明的通用医学视觉-语言模型,通过统一的视觉编码器和LLM解码器,整合了多种医学数据模态(文本、2D/3D图像、视频),并在16.7M样本上训练,展示了在30个基准测试中的优异表现。
Details
Motivation: 临床决策需要整合多种医学数据模态,但目前通用的视觉-语言模型在医学领域面临流程不透明、数据稀缺和架构僵化等问题。
Result: 在30个基准测试中表现优异,超越了开源模型,并与私有系统竞争;支持多语言和罕见疾病场景的复杂任务。
Insight: 透明和高效的医学VLM设计为临床AI提供了工具;开源推动领域进步。
Abstract: Real-world clinical decision-making grapples with integrating information from diverse data modalities, including medical text, 2D/3D images, and video, leading to inefficiencies and potential diagnostic oversights. While generalist vision-language models (VLMs) offer promise, their medical development faces challenges of opaque pipelines, data scarcity, and architectural inflexibility. Here we present Hulu-Med, a transparent medical VLM that unifies understanding across all these modalities. Built upon a unified patch-based vision encoder and an LLM decoder, Hulu-Med was progressively trained on 16.7 million (M) samples to scale from 2D to 3D and video comprehension. The medical-aware token reduction enables efficient training, requiring only 4,000 to 40,000 GPU hours for 7B to 32B parameter variants. Extensive evaluation across 30 benchmarks exhibits state-of-the-art performance, surpassing leading open-source models and competing with proprietary systems in tasks spanning visual question-answering, medical report generation, and complex reasoning in multilingual and rare disease scenarios. By open-sourcing our complete pipeline, we establish that high-performance medical VLM can be achieved transparently, providing a foundational tool for accessible and impactful clinical AI. Code is released on \href{https://github.com/ZJUI-AI4H/Hulu-Med}{https://github.com/ZJUI-AI4H/Hulu-Med}.
[11] Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation cs.CVPDF
Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang
TL;DR: 论文提出了一种统一的摄像机中心多模态模型Puffin,通过将摄像机视为语言,实现了对任意视角场景的理解与生成。
Details
Motivation: 摄像机中心的理解与生成是空间智能的核心,但通常被分开研究,缺乏统一的方法。
Result: 实验表明Puffin在摄像机中心任务上优于专用模型,并能通过指令微调泛化到多样任务。
Insight: 将摄像机参数语言化是多模态空间智能的重要方向,为场景理解和生成提供了统一的框架。
Abstract: Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance. We will release the code, models, dataset pipeline, and benchmark to advance multimodal spatial intelligence research.
[12] BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities cs.CV | cs.ROPDF
Yu Qi, Haibo Zhao, Ziyu Guo, Siyuan Ma, Ziyan Chen
TL;DR: BEAR 是一个全面的基准测试,用于评估多模态大语言模型(MLLMs)在原子级具身能力上的表现,并提出 BEAR-Agent 以提升 MLLMs 的感知和规划能力。
Details
Motivation: 目前缺乏对 MLLMs 在具身能力上的系统性评估,现有基准测试多关注特定领域(如规划或空间理解)。BEAR 填补了这一空白。
Result: BEAR-Agent 显著提升了 MLLMs 的表现(绝对增益 9.12%,相对改进 17.5%)。
Insight: 提升 MLLMs 的具身能力有助于模拟环境中的任务表现。
Abstract: Embodied capabilities refer to a suite of fundamental abilities for an agent to perceive, comprehend, and interact with the physical world. While multimodal large language models (MLLMs) show promise as embodied agents, a thorough and systematic evaluation of their embodied capabilities remains underexplored, as existing benchmarks primarily focus on specific domains such as planning or spatial understanding. To bridge this gap, we introduce BEAR, a comprehensive and fine-grained benchmark that evaluates MLLMs on atomic embodied capabilities. BEAR comprises 4,469 interleaved image-video-text entries across 14 domains in 6 categories, including tasks from low-level pointing, trajectory understanding, spatial reasoning, to high-level planning. Extensive evaluation results of 20 representative MLLMs reveal their persistent limitations across all domains of embodied capabilities. To tackle the shortfall, we propose BEAR-Agent, a multimodal conversable agent that integrates pretrained vision models to strengthen MLLM perception, 3D understanding, and planning capabilities. It substantially enhances MLLM performance across diverse embodied capabilities on BEAR, yielding a 9.12% absolute gain and a relative improvement of 17.5% on GPT-5. Furthermore, our experiments indicate that improving MLLM embodied capabilities can benefit embodied tasks in simulated environments. Project website: https://bear-official66.github.io/
[13] SAFER-AiD: Saccade-Assisted Foveal-peripheral vision Enhanced Reconstruction for Adversarial Defense cs.CV | cs.AIPDF
Jiayang Liu, Daniel Tso, Yiming Bu, Qinru Qiu
TL;DR: SAFER-AiD提出了一种受人类视觉系统启发的对抗防御框架,通过结合注视-外周处理、眼跳运动和皮层填充机制,使用强化学习驱动的选择性采样重构图像,有效抵御对抗攻击,且无需重新训练下游分类器。
Details
Motivation: 深度学习对抗攻击的防御通常依赖计算密集型优化方法(如对抗训练),而人类视觉系统通过生物机制天然具备对抗扰动的鲁棒性。论文提出基于生物机制的防御框架,模拟人类视觉的行为。
Result: 在ImageNet数据集上的实验表明,SAFER-AiD显著提升了多种分类器和攻击类型下的鲁棒性,同时比传统方法计算成本更低。
Insight: 人类视觉的生物机制(如注视选择性采样)为对抗防御提供了新的思路,此类机制可以有效过滤对抗噪声,同时简化系统设计。
Abstract: Adversarial attacks significantly challenge the safe deployment of deep learning models, particularly in real-world applications. Traditional defenses often rely on computationally intensive optimization (e.g., adversarial training or data augmentation) to improve robustness, whereas the human visual system achieves inherent robustness to adversarial perturbations through evolved biological mechanisms. We hypothesize that attention guided non-homogeneous sparse sampling and predictive coding plays a key role in this robustness. To test this hypothesis, we propose a novel defense framework incorporating three key biological mechanisms: foveal-peripheral processing, saccadic eye movements, and cortical filling-in. Our approach employs reinforcement learning-guided saccades to selectively capture multiple foveal-peripheral glimpses, which are integrated into a reconstructed image before classification. This biologically inspired preprocessing effectively mitigates adversarial noise, preserves semantic integrity, and notably requires no retraining or fine-tuning of downstream classifiers, enabling seamless integration with existing systems. Experiments on the ImageNet dataset demonstrate that our method improves system robustness across diverse classifiers and attack types, while significantly reducing training overhead compared to both biologically and non-biologically inspired defense techniques.
[14] Detecting spills using thermal imaging, pretrained deep learning models, and a robotic platform cs.CV | cs.LG | cs.ROPDF
Gregory Yeghiyan, Jurius Azar, Devson Butani, Chan-Jin Chung
TL;DR: 论文提出了一种基于预训练深度学习模型和热成像的实时泄漏检测系统,在不同环境下表现优异,尤其是热成像模型在速度和鲁棒性上更胜一筹。
Details
Motivation: 开发一种能够在复杂环境中实时、高效检测泄漏的系统,以满足安全关键场景的需求。
Result: 热成像模型在VGG19上实现100%精度,推理时间低至44毫秒,模型大小小于350 MB。
Insight: 热成像在复杂光照条件下更具鲁棒性,适合实时安全应用。
Abstract: This paper presents a real-time spill detection system that utilizes pretrained deep learning models with RGB and thermal imaging to classify spill vs. no-spill scenarios across varied environments. Using a balanced binary dataset (4,000 images), our experiments demonstrate the advantages of thermal imaging in inference speed, accuracy, and model size. We achieve up to 100% accuracy using lightweight models like VGG19 and NasNetMobile, with thermal models performing faster and more robustly across different lighting conditions. Our system runs on consumer-grade hardware (RTX 4080) and achieves inference times as low as 44 ms with model sizes under 350 MB, highlighting its deployability in safety-critical contexts. Results from experiments with a real robot and test datasets indicate that a VGG19 model trained on thermal imaging performs best.
[15] LinearSR: Unlocking Linear Attention for Stable and Efficient Image Super-Resolution cs.CVPDF
Xiaohui Li, Shaobin Zhuang, Shuo Cao, Yang Yang, Yuandong Pu
TL;DR: LinearSR提出了一种基于线性注意力机制的高效图像超分辨率框架,通过ESGF策略解决训练不稳定问题,采用SNR-based MoE缓解感知-失真权衡,并引入TAG轻量级指导范式,实现了高效且高质量的图像超分辨率。
Details
Motivation: 传统基于自注意力的图像超分辨率方法因二次复杂度(O(N^2))面临计算瓶颈,而线性注意力(O(N))虽有潜力但未充分利用。LinearSR致力于系统解决线性注意力在超分辨率中的应用难题。
Result: LinearSR在保持高效率的同时,实现了感知质量的SOTA水平,扩散前向传播(1-NFE)速度快,多步推理时间竞争力强。
Insight: 线性注意力可在超分辨率中发挥重要作用,关键是通过系统方法解决训练不稳定性和感知-失真权衡等历史难题。
Abstract: Generative models for Image Super-Resolution (SR) are increasingly powerful, yet their reliance on self-attention’s quadratic complexity (O(N^2)) creates a major computational bottleneck. Linear Attention offers an O(N) solution, but its promise for photorealistic SR has remained largely untapped, historically hindered by a cascade of interrelated and previously unsolved challenges. This paper introduces LinearSR, a holistic framework that, for the first time, systematically overcomes these critical hurdles. Specifically, we resolve a fundamental, training instability that causes catastrophic model divergence using our novel “knee point”-based Early-Stopping Guided Fine-tuning (ESGF) strategy. Furthermore, we mitigate the classic perception-distortion trade-off with a dedicated SNR-based Mixture of Experts (MoE) architecture. Finally, we establish an effective and lightweight guidance paradigm, TAG, derived from our “precision-over-volume” principle. Our resulting LinearSR model simultaneously delivers state-of-the-art perceptual quality with exceptional efficiency. Its core diffusion forward pass (1-NFE) achieves SOTA-level speed, while its overall multi-step inference time remains highly competitive. This work provides the first robust methodology for applying Linear Attention in the photorealistic SR domain, establishing a foundational paradigm for future research in efficient generative super-resolution.
[16] Re-Identifying Kākā with AI-Automated Video Key Frame Extraction cs.CV | cs.AIPDF
Paula Maddigan, Andrew Lensen, Rachael C. Shaw
TL;DR: 论文提出一种AI驱动的关键帧提取流水线,用于新西兰濒危鹦鹉kākā的重新识别。该方法结合目标检测、光学流模糊检测和图像编码技术,实现了高效且非侵入性的野生动物监测替代方案。
Details
Motivation: 传统的野生动物监测方法(如鸟类的腿部环志)耗时且具有侵入性。AI和计算机视觉的发展为智能保护和自动化监测提供了新思路,尤其是在野生动物的重新识别方面。
Result: 实验结果表明,该方法提取的关键帧集合在kākā重新识别任务中具有高精度,证明了其在未来的多样性环境中的潜力。
Insight: 这项研究表明AI和计算机视觉可以显著改进野生动物监测方法,为非侵入性种群监测提供了新方向。
Abstract: Accurate recognition and re-identification of individual animals is essential for successful wildlife population monitoring. Traditional methods, such as leg banding of birds, are time consuming and invasive. Recent progress in artificial intelligence, particularly computer vision, offers encouraging solutions for smart conservation and efficient automation. This study presents a unique pipeline for extracting high-quality key frames from videos of k={a}k={a} (Nestor meridionalis), a threatened forest-dwelling parrot in New Zealand. Key frame extraction is well-studied in person re-identification, however, its application to wildlife is limited. Using video recordings at a custom-built feeder, we extract key frames and evaluate the re-identification performance of our pipeline. Our unsupervised methodology combines object detection using YOLO and Grounding DINO, optical flow blur detection, image encoding with DINOv2, and clustering methods to identify representative key frames. The results indicate that our proposed key frame selection methods yield image collections which achieve high accuracy in k={a}k={a} re-identification, providing a foundation for future research using media collected in more diverse and challenging environments. Through the use of artificial intelligence and computer vision, our non-invasive and efficient approach provides a valuable alternative to traditional physical tagging methods for recognising k={a}k={a} individuals and therefore improving the monitoring of populations. This research contributes to developing fresh approaches in wildlife monitoring, with applications in ecology and conservation biology.
[17] Q-Router: Agentic Video Quality Assessment with Expert Model Routing and Artifact Localization cs.CVPDF
Shuo Xing, Soumik Dey, Mingyang Wu, Ashirbad Mishra, Hansi Wu
TL;DR: Q-Router提出了一种基于智能代理的视频质量评估框架,通过动态路由和多层次专家模型集成,解决了现有方法在泛化性、可解释性和扩展性方面的不足。
Details
Motivation: 现有视频质量评估(VQA)模型在多样内容和任务上泛化能力差,且缺乏可解释性和扩展性。Q-Router旨在通过智能代理和多层次路由系统解决这些问题。
Result: 实验表明Q-Router在多种基准测试中优于或匹配现有VQA模型,显著提升了泛化性和可解释性。此外,在Q-Bench-Video基准测试中表现优异,并能有效定位时空伪像。
Insight: Q-Router通过动态路由和专家模型集成,展示了智能代理在VQA任务中的潜力,同时为视频生成模型的奖励函数设计提供了新思路。
Abstract: Video quality assessment (VQA) is a fundamental computer vision task that aims to predict the perceptual quality of a given video in alignment with human judgments. Existing performant VQA models trained with direct score supervision suffer from (1) poor generalization across diverse content and tasks, ranging from user-generated content (UGC), short-form videos, to AI-generated content (AIGC), (2) limited interpretability, and (3) lack of extensibility to novel use cases or content types. We propose Q-Router, an agentic framework for universal VQA with a multi-tier model routing system. Q-Router integrates a diverse set of expert models and employs vision–language models (VLMs) as real-time routers that dynamically reason and then ensemble the most appropriate experts conditioned on the input video semantics. We build a multi-tiered routing system based on the computing budget, with the heaviest tier involving a specific spatiotemporal artifacts localization for interpretability. This agentic design enables Q-Router to combine the complementary strengths of specialized experts, achieving both flexibility and robustness in delivering consistent performance across heterogeneous video sources and tasks. Extensive experiments demonstrate that Q-Router matches or surpasses state-of-the-art VQA models on a variety of benchmarks, while substantially improving generalization and interpretability. Moreover, Q-Router excels on the quality-based question answering benchmark, Q-Bench-Video, highlighting its promise as a foundation for next-generation VQA systems. Finally, we show that Q-Router capably localizes spatiotemporal artifacts, showing potential as a reward function for post-training video generation models.
[18] Alignment, Mining and Fusion: Representation Alignment with Hard Negative Mining and Selective Knowledge Fusion for Medical Visual Question Answering cs.CVPDF
Yuanhao Zou, Zhaozheng Yin
TL;DR: 这篇论文提出了一种新的框架,通过多级模态对齐、难负样本挖掘和选择性知识融合,提升了医疗视觉问答(Med-VQA)的性能。
Details
Motivation: 当前医疗视觉问答任务中,模态对齐问题缺乏统一解决方案,难负样本的处理尚未充分研究,且传统知识融合方法可能引入无关信息。
Result: 框架在RAD-VQA、SLAKE、PathVQA和VQA-2019等数据集上超越了现有最佳方法。
Insight: 难负样本和多级模态对齐对提升Med-VQA性能至关重要;选择性知识融合可减少噪声干扰。
Abstract: Medical Visual Question Answering (Med-VQA) is a challenging task that requires a deep understanding of both medical images and textual questions. Although recent works leveraging Medical Vision-Language Pre-training (Med-VLP) have shown strong performance on the Med-VQA task, there is still no unified solution for modality alignment, and the issue of hard negatives remains under-explored. Additionally, commonly used knowledge fusion techniques for Med-VQA may introduce irrelevant information. In this work, we propose a framework to address these challenges through three key contributions: (1) a unified solution for heterogeneous modality alignments across multiple levels, modalities, views, and stages, leveraging methods like contrastive learning and optimal transport theory; (2) a hard negative mining method that employs soft labels for multi-modality alignments and enforces the hard negative pair discrimination; and (3) a Gated Cross-Attention Module for Med-VQA that integrates the answer vocabulary as prior knowledge and selects relevant information from it. Our framework outperforms the previous state-of-the-art on widely used Med-VQA datasets like RAD-VQA, SLAKE, PathVQA and VQA-2019.
[19] SkipSR: Faster Super Resolution with Token Skipping cs.CV | cs.AI | cs.LGPDF
Rohan Choudhury, Shanchuan Lin, Jianyi Wang, Hao Chen, Qi Zhao
TL;DR: SkipSR通过跳过低细节区域的SR计算,大幅加速视频超分辨率任务,保留感知质量同时显著减少计算量。
Details
Motivation: 当前基于扩散模型的超分辨率方法对所有像素统一处理,计算量大且效率低,限制了高分辨率和长视频的可扩展性。
Result: 在标准SR基准测试中,SkipSR在720p视频上实现了比现有模型快60%的端到端延迟,且无感知质量损失。
Insight: 视频中许多区域为低细节,跳过其SR计算不会影响整体质量,但可显著提高效率。
Abstract: Diffusion-based super-resolution (SR) is a key component in video generation and video restoration, but is slow and expensive, limiting scalability to higher resolutions and longer videos. Our key insight is that many regions in video are inherently low-detail and gain little from refinement, yet current methods process all pixels uniformly. To take advantage of this, we propose SkipSR, a simple framework for accelerating video SR by identifying low-detail regions directly from low-resolution input, then skipping computation on them entirely, only super-resolving the areas that require refinement. This simple yet effective strategy preserves perceptual quality in both standard and one-step diffusion SR models while significantly reducing computation. In standard SR benchmarks, our method achieves up to 60% faster end-to-end latency than prior models on 720p videos with no perceptible loss in quality. Video demos are available at https://rccchoudhury.github.io/skipsr/
[20] D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition cs.CV | cs.AIPDF
Yiyang Huang, Yizhou Wang, Yun Fu
TL;DR: D-CoDe是一个无需训练的适应框架,通过动态压缩和问题分解,将基于图像的视觉语言模型扩展到视频领域,解决了感知瓶颈和令牌过载问题。
Details
Motivation: 视频领域的密集和时间扩展视觉输入超出了图像模型的容量,导致感知瓶颈和令牌过载问题,限制了图像预训练视觉语言模型在视频任务中的应用。
Result: 实验表明,D-CoDe在各种视频理解基准测试中表现优异,尤其在长视频任务中展现了强大潜力。
Insight: D-CoDe无需额外训练即可扩展图像预训练模型到视频领域,为视频-语言任务的适应性提供了高效解决方案。
Abstract: Video large language models (Vid-LLMs), which excel in diverse video-language tasks, can be effectively constructed by adapting image-pretrained vision-language models (VLMs). However, this adaptation remains challenging, as it requires processing dense and temporally extended visual inputs that exceed the capacity of image-based models. This paper identifies the perception bottleneck and token overload as key challenges in extending image-based VLMs to the video domain. To address these issues, we propose D-CoDe, a training-free adaptation framework that incorporates dynamic compression and question decomposition. Specifically, dynamic compression alleviates the perception bottleneck through adaptive selection of representative frames and content-aware aggregation of spatial tokens, thereby reducing redundancy while preserving informative content. In parallel, question decomposition mitigates token overload by reformulating the original query into sub-questions, guiding the model to focus on distinct aspects of the video and enabling more comprehensive understanding. Experiments demonstrate that D-CoDe effectively improves video understanding across various benchmarks. Furthermore, strong performance on the challenging long-video benchmark highlights the potential of D-CoDe in handling complex video-language tasks. Code is available at https://github.com/hukcc/D-CoDe.
[21] FOLK: Fast Open-Vocabulary 3D Instance Segmentation via Label-guided Knowledge Distillation cs.CVPDF
Hongrui Wu, Zhicheng Gao, Jin Cao, Kelu Yao, Wen Shen
TL;DR: FOLK提出了一种基于标签引导知识蒸馏的快速开放词汇3D实例分割方法,通过直接从3D点云分类实例,避免了2D映射的噪声与计算开销。
Details
Motivation: 解决现有方法因依赖2D映射导致的遮挡噪声和高计算成本问题。
Result: 在ScanNet200数据集上AP50达到35.7,推理速度提升6.0x至152.2x。
Insight: 直接从3D点云学习能够有效避免2D映射的噪声,同时大幅提升推理效率。
Abstract: Open-vocabulary 3D instance segmentation seeks to segment and classify instances beyond the annotated label space. Existing methods typically map 3D instances to 2D RGB-D images, and then employ vision-language models (VLMs) for classification. However, such a mapping strategy usually introduces noise from 2D occlusions and incurs substantial computational and memory costs during inference, slowing down the inference speed. To address the above problems, we propose a Fast Open-vocabulary 3D instance segmentation method via Label-guided Knowledge distillation (FOLK). Our core idea is to design a teacher model that extracts high-quality instance embeddings and distills its open-vocabulary knowledge into a 3D student model. In this way, during inference, the distilled 3D model can directly classify instances from the 3D point cloud, avoiding noise caused by occlusions and significantly accelerating the inference process. Specifically, we first design a teacher model to generate a 2D CLIP embedding for each 3D instance, incorporating both visibility and viewpoint diversity, which serves as the learning target for distillation. We then develop a 3D student model that directly produces a 3D embedding for each 3D instance. During training, we propose a label-guided distillation algorithm to distill open-vocabulary knowledge from label-consistent 2D embeddings into the student model. FOLK conducted experiments on the ScanNet200 and Replica datasets, achieving state-of-the-art performance on the ScanNet200 dataset with an AP50 score of 35.7, while running approximately 6.0x to 152.2x faster than previous methods. All codes will be released after the paper is accepted.
[22] Modeling Time-Lapse Trajectories to Characterize Cranberry Growth cs.CV | I.4.7PDF
Ronan John, Anis Chihoub, Ryan Meegan, Gina Sidelli, Jeffery Neyhart
TL;DR: 该论文提出了一种基于自监督学习的ViT微调方法,用于建模蔓越莓生长时间轨迹,避免了繁琐的图像标注需求,并通过2D时间轨迹直观展示作物生长过程。
Details
Motivation: 传统蔓越莓生长监测依赖人工,耗时且低效。深度学习方法虽然有望解决这一问题,但其高维特征和标注需求限制了实用性。
Result: 方法能够预测生长趋势并区分不同品种的时间差异,数据集包含8个品种的52次观察记录。
Insight: 自监督学习可以显著减少标注需求,同时生成的2D轨迹为作物生长提供了直观的时间序列模型。
Abstract: Change monitoring is an essential task for cranberry farming as it provides both breeders and growers with the ability to analyze growth, predict yield, and make treatment decisions. However, this task is often done manually, requiring significant time on the part of a cranberry grower or breeder. Deep learning based change monitoring holds promise, despite the caveat of hard-to-interpret high dimensional features and hand-annotations for fine-tuning. To address this gap, we introduce a method for modeling crop growth based on fine-tuning vision transformers (ViTs) using a self-supervised approach that avoids tedious image annotations. We use a two-fold pretext task (time regression and class prediction) to learn a latent space for the time-lapse evolution of plant and fruit appearance. The resulting 2D temporal tracks provide an interpretable time-series model of crop growth that can be used to: 1) predict growth over time and 2) distinguish temporal differences of cranberry varieties. We also provide a novel time-lapse dataset of cranberry fruit featuring eight distinct varieties, observed 52 times over the growing season (span of around four months), annotated with information about fungicide application, yield, and rot. Our approach is general and can be applied to other crops and applications (code and dataset can be found at https://github. com/ronan-39/tlt/).
[23] PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning cs.CV | cs.LGPDF
Daiki Yoshikawa, Takashi Matsubara
TL;DR: PHyCLIP 是一种新的视觉-语言表示学习方法,通过双曲空间的乘积和 ℓ₁-Product 度量统一了层次性和组合性,优于现有方法。
Details
Motivation: 现有的视觉-语言模型在表示概念的层次结构(如动物分类)和组合结构(如“车里的狗”)时面临挑战,尤其是如何在同一个模型中同时捕获这两种语义结构。
Result: 在零样本分类、检索、层次分类和组合理解任务中,PHyCLIP 优于现有单空间方法,并提供更可解释的嵌入结构。
Insight: 双曲空间的乘积结构为多模态表示学习提供了一种新范式,能够自然地统一语义的层次性和组合性。
Abstract: Vision-language models have achieved remarkable success in multi-modal representation learning from large-scale pairs of visual scenes and linguistic descriptions. However, they still struggle to simultaneously express two distinct types of semantic structures: the hierarchy within a concept family (e.g., dog $\preceq$ mammal $\preceq$ animal) and the compositionality across different concept families (e.g., “a dog in a car” $\preceq$ dog, car). Recent works have addressed this challenge by employing hyperbolic space, which efficiently captures tree-like hierarchy, yet its suitability for representing compositionality remains unclear. To resolve this dilemma, we propose PHyCLIP, which employs an $\ell_1$-Product metric on a Cartesian product of Hyperbolic factors. With our design, intra-family hierarchies emerge within individual hyperbolic factors, and cross-family composition is captured by the $\ell_1$-product metric, analogous to a Boolean algebra. Experiments on zero-shot classification, retrieval, hierarchical classification, and compositional understanding tasks demonstrate that PHyCLIP outperforms existing single-space approaches and offers more interpretable structures in the embedding space.
[24] SegTrans: Transferable Adversarial Examples for Segmentation Models cs.CVPDF
Yufei Song, Ziqi Zhou, Qi Lu, Hangtao Zhang, Yifan Hu
TL;DR: SegTrans提出了一种新的可迁移对抗样本生成框架,通过局部区域分割和语义信息重映射提升分割模型的对抗攻击迁移性。
Details
Motivation: 现有对抗攻击方法在分割模型间的迁移性不足,主要由于复杂上下文依赖性和特征分布差异。SegTrans旨在优化迁移攻击效果。
Result: 在两个数据集、四种分割模型和三种骨干网络上验证,迁移攻击成功率平均提高8.55%,计算效率提升100%以上。
Insight: 局部语义信息优化优于全局方法,更适合解决分割模型间的迁移性问题。
Abstract: Segmentation models exhibit significant vulnerability to adversarial examples in white-box settings, but existing adversarial attack methods often show poor transferability across different segmentation models. While some researchers have explored transfer-based adversarial attack (i.e., transfer attack) methods for segmentation models, the complex contextual dependencies within these models and the feature distribution gaps between surrogate and target models result in unsatisfactory transfer success rates. To address these issues, we propose SegTrans, a novel transfer attack framework that divides the input sample into multiple local regions and remaps their semantic information to generate diverse enhanced samples. These enhanced samples replace the original ones for perturbation optimization, thereby improving the transferability of adversarial examples across different segmentation models. Unlike existing methods, SegTrans only retains local semantic information from the original input, rather than using global semantic information to optimize perturbations. Extensive experiments on two benchmark datasets, PASCAL VOC and Cityscapes, four different segmentation models, and three backbone networks show that SegTrans significantly improves adversarial transfer success rates without introducing additional computational overhead. Compared to the current state-of-the-art methods, SegTrans achieves an average increase of 8.55% in transfer attack success rate and improves computational efficiency by more than 100%.
[25] Defense against Unauthorized Distillation in Image Restoration via Feature Space Perturbation cs.CVPDF
Han Hu, Zhuoran Zheng, Chen Lyu
TL;DR: 本文提出了一种针对图像复原模型的防御方法ASVP,通过特征空间的奇异值扰动来防止未经授权的知识蒸馏攻击,保护模型知识产权。
Details
Motivation: 知识蒸馏攻击威胁深度学习模型的知识产权,现有防御方法主要针对分类任务,难以直接应用于图像复原这类生成任务。
Result: 在五种图像复原任务中验证,ASVP显著降低学生网络的性能(PSNR降低4dB,SSIM降低60-75%),同时对教师模型性能影响可忽略。
Insight: 针对生成任务的防御需关注特征空间的结构化扰动,而非单纯输出扰动,ASVP为此提供了有效解决方案。
Abstract: Knowledge distillation (KD) attacks pose a significant threat to deep model intellectual property by enabling adversaries to train student networks using a teacher model’s outputs. While recent defenses in image classification have successfully disrupted KD by perturbing output probabilities, extending these methods to image restoration is difficult. Unlike classification, restoration is a generative task with continuous, high-dimensional outputs that depend on spatial coherence and fine details. Minor perturbations are often insufficient, as students can still learn the underlying mapping.To address this, we propose Adaptive Singular Value Perturbation (ASVP), a runtime defense tailored for image restoration models. ASVP operates on internal feature maps of the teacher using singular value decomposition (SVD). It amplifies the topk singular values to inject structured, high-frequency perturbations, disrupting the alignment needed for distillation. This hinders student learning while preserving the teacher’s output quality.We evaluate ASVP across five image restoration tasks: super-resolution, low-light enhancement, underwater enhancement, dehazing, and deraining. Experiments show ASVP reduces student PSNR by up to 4 dB and SSIM by 60-75%, with negligible impact on the teacher’s performance. Compared to prior methods, ASVP offers a stronger and more consistent defense.Our approach provides a practical solution to protect open-source restoration models from unauthorized knowledge distillation.
[26] RO-Bench: Large-scale robustness evaluation of MLLMs with text-driven counterfactual videos cs.CV | cs.AIPDF
Zixi Yang, Jiapeng Li, Muxi Diao, Yinuo Jing, Kongming Liang
TL;DR: RO-Bench是首个评估多模态大语言模型(MLLMs)在动态非分布(OOD)反事实视频测试集上鲁棒性的基准。通过编辑视频风格、对象、背景及其组合生成数据,研究发现现有模型性能显著下降,但微调后可提升鲁棒性。
Details
Motivation: 尽管MLLMs在视频理解任务中表现优异,但其在面对被操纵视频内容时的鲁棒性尚不明确,因此需要系统性评估。
Result: 现有模型在反事实视频上表现显著下降,但微调后Ro-Bench性能提升21.73%,MVBench数据集20任务平均提升12.78%。
Insight: 反事实数据可有效增强MLLMs的视频理解能力,尤其在面对动态OOD视频时表现突出。
Abstract: Recently, Multi-modal Large Language Models (MLLMs) have demonstrated significant performance across various video understanding tasks. However, their robustness, particularly when faced with manipulated video content, remains largely unexplored. In this paper, we introduce Ro-Bench, the first benchmark for evaluating MLLMs on dynamic out-of-distribution (OOD) counterfactual video test sets. Ro-Bench incorporates high-quality, diverse and temporally relevant video data, by editing Style, Object, Background and their compositions. We evaluated eight recent video MLLMs and found that current models exhibit substantial performance degradation on Ro-Bench when exposed to counterfactual video content. Furthermore, we demonstrate that fine-tuning MLLMs with counterfactual data enhances robustness, achieving a 21.73% performance increase on Ro-Bench and a 12.78% improvement across 20 tasks in the MVBench dataset. These findings underscore the effectiveness of counterfactual data in enhancing the video understanding ability of MLLMs. The code and data will be released shortly.
[27] Denoised Diffusion for Object-Focused Image Augmentation cs.CV | cs.LGPDF
Nisha Pillai, Aditi Virupakshaiah, Harrison W. Smith, Amanda J. Ashworth, Prasanna Gowda
TL;DR: 这篇论文提出了一种面向动物的数据增强框架,通过分割和扩散合成技术生成逼真的多样化场景,以提升动物检测和健康监测的性能。
Details
Motivation: 现代农业监测系统依赖多数据源,但无人机监测动物健康时面临数据稀缺和场景特异性问题。现有迁移学习方法因缺乏反映特定农场条件的数据集而受限。
Result: 实验表明,增强后的数据集在动物检测任务上优于基线模型。
Insight: 该方法在数据稀缺情况下仍能生成高质量数据,填补了数据不足与实际应用之间的差距。
Abstract: Modern agricultural operations increasingly rely on integrated monitoring systems that combine multiple data sources for farm optimization. Aerial drone-based animal health monitoring serves as a key component but faces limited data availability, compounded by scene-specific issues such as small, occluded, or partially visible animals. Transfer learning approaches often fail to address this limitation due to the unavailability of large datasets that reflect specific farm conditions, including variations in animal breeds, environments, and behaviors. Therefore, there is a need for developing a problem-specific, animal-focused data augmentation strategy tailored to these unique challenges. To address this gap, we propose an object-focused data augmentation framework designed explicitly for animal health monitoring in constrained data settings. Our approach segments animals from backgrounds and augments them through transformations and diffusion-based synthesis to create realistic, diverse scenes that enhance animal detection and monitoring performance. Our initial experiments demonstrate that our augmented dataset yields superior performance compared to our baseline models on the animal detection task. By generating domain-specific data, our method empowers real-time animal health monitoring solutions even in data-scarce scenarios, bridging the gap between limited data and practical applicability.
[28] Unleashing Perception-Time Scaling to Multimodal Reasoning Models cs.CV | cs.CLPDF
Yifan Li, Zhenghao Chen, Ziheng Wu, Kun Zhou, Ruipu Luo
TL;DR: 论文提出了Perception-Time Scaling(PTS)范式,通过在视觉感知阶段引入丰富的token和子问题分解,显著提升了多模态推理模型的感知精度,并在推理时间扩展中带来一致增益。
Details
Motivation: 当前大型视觉语言模型(LVLMs)在视觉感知任务中的精度有限,传统的推理时间扩展方法对其提升效果不明显。作者认为这是因为模型将视觉理解视为一次性输出,缺乏对感知过程的建模。
Result: PTS将DisTANCE上的高精度性能从8.0%提升至64.7%,且在跨域任务和数学推理任务中均表现出色。
Insight: PTS通过增加感知相关token和图像token的注意力,揭示了感知过程建模对多模态推理任务的重要性。
Abstract: Recent advances in inference-time scaling, particularly those leveraging reinforcement learning with verifiable rewards, have substantially enhanced the reasoning capabilities of Large Vision-Language Models (LVLMs). Inspired by this success, similar strategies have been applied to multimodal reasoning, yet their impact on visual perception remains unclear. To investigate this gap, we introduce DisTANCE, a perception-centric benchmark for visual estimation tasks. Evaluation results show that LVLMs exhibit limited estimation precision, and inference-time scaling offers only marginal gains. We attribute this to the fast perception paradigm of current LVLMs, where visual understanding is treated as a one-shot output without modeling the underlying perceptual process. To address this, we propose Perception-Time Scaling (PTS), a novel paradigm that encourages token-rich perception and decomposes complex perception problems into intermediate tractable sub-problems, thereby enabling perception to align with and benefit from inference-time scaling. Combined with reinforcement learning techniques, PTS significantly improves perception accuracy, raising high-precision performance on DisTANCE from 8.0% to 64.7%, and generalizes well to out-of-domain tasks. Surprisingly, even though PTS data are purely synthetic, combining them with math reasoning data yields consistent gains in both reasoning and real-world perception benchmarks. Further analysis reveals that PTS introduces more perception-related tokens and increases the model’s attention to image tokens. Our code and data will be publicly released.
[29] Hierarchical Scheduling for Multi-Vector Image Retrieval cs.CV | cs.DC | cs.IRPDF
Maoliang Li, Ke Li, Yaoyang Liu, Jiayu Chen, Zihao Zheng
TL;DR: 该论文提出了一种名为HiMIR的分层调度框架,通过多级粒度对齐和冗余最小化优化多向量图像检索的准确性和效率。
Details
Motivation: 传统的多向量检索(MVR)方法在查询与图像对象的对齐以及冗余图像片段上存在局限性,影响了检索的准确性和效率。
Result: HiMIR显著提升了检索准确性,并将计算量减少了最高3.5倍。
Insight: 分层设计和冗余最小化是提升多向量图像检索性能的关键。
Abstract: To effectively leverage user-specific data, retrieval augmented generation (RAG) is employed in multimodal large language model (MLLM) applications. However, conventional retrieval approaches often suffer from limited retrieval accuracy. Recent advances in multi-vector retrieval (MVR) improve accuracy by decomposing queries and matching against segmented images. They still suffer from sub-optimal accuracy and efficiency, overlooking alignment between the query and varying image objects and redundant fine-grained image segments. In this work, we present an efficient scheduling framework for image retrieval - HiMIR. First, we introduce a novel hierarchical paradigm, employing multiple intermediate granularities for varying image objects to enhance alignment. Second, we minimize redundancy in retrieval by leveraging cross-hierarchy similarity consistency and hierarchy sparsity to minimize unnecessary matching computation. Furthermore, we configure parameters for each dataset automatically for practicality across diverse scenarios. Our empirical study shows that, HiMIR not only achieves substantial accuracy improvements but also reduces computation by up to 3.5 times over the existing MVR system.
[30] HandEval: Taking the First Step Towards Hand Quality Evaluation in Generated Images cs.CVPDF
Zichuan Wang, Bo Peng, Songlin Yang, Zhenchen Tang, Jing Dong
TL;DR: 论文提出了第一个针对生成图像中手部区域的质量评估任务,并展示了其在优化生成质量和AIGC检测等下游任务中的应用。通过引入HandPair数据集和HandEval模型,该方法无需人工标注即可高效训练,并在实验中表现优于现有SOTA方法。
Details
Motivation: 尽管现有的文本到图像(T2I)模型在整体图像生成质量上取得了显著进步,但在复杂局部区域(如手部)的细节生成上仍存在问题。手部生成质量评估的缺失限制了相关下游任务的表现,如人像生成优化和AIGC检测。
Result: HandEval在评估生成图像手部质量时与人类判断更一致,优于现有SOTA方法。应用于图像生成和AIGC检测任务时,显著提升了手部真实感和检测准确率。
Insight: 手部质量评估是一个被忽视但重要的任务,结合MLLM和先验知识的方法能有效提升质量评估的准确性,且无需昂贵的人工标注。
Abstract: Although recent text-to-image (T2I) models have significantly improved the overall visual quality of generated images, they still struggle in the generation of accurate details in complex local regions, especially human hands. Generated hands often exhibit structural distortions and unrealistic textures, which can be very noticeable even when the rest of the body is well-generated. However, the quality assessment of hand regions remains largely neglected, limiting downstream task performance like human-centric generation quality optimization and AIGC detection. To address this, we propose the first quality assessment task targeting generated hand regions and showcase its abundant downstream applications. We first introduce the HandPair dataset for training hand quality assessment models. It consists of 48k images formed by high- and low-quality hand pairs, enabling low-cost, efficient supervision without manual annotation. Based on it, we develop HandEval, a carefully designed hand-specific quality assessment model. It leverages the powerful visual understanding capability of Multimodal Large Language Model (MLLM) and incorporates prior knowledge of hand keypoints, gaining strong perception of hand quality. We further construct a human-annotated test set with hand images from various state-of-the-art (SOTA) T2I models to validate its quality evaluation capability. Results show that HandEval aligns better with human judgments than existing SOTA methods. Furthermore, we integrate HandEval into image generation and AIGC detection pipelines, prominently enhancing generated hand realism and detection accuracy, respectively, confirming its universal effectiveness in downstream applications. Code and dataset will be available.
[31] On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models cs.CV | cs.AI | cs.CLPDF
Hoigi Seo, Dong Un Kang, Hyunjin Cho, Joohoon Lee, Se Young Chun
TL;DR: 该论文探讨了大视觉语言模型(LVLMs)中的物体幻觉问题,提出视觉编码器(VE)中的不确定视觉token是导致幻觉的关键因素,并提出了一种通过对抗扰动识别和掩码这些token的方法来缓解幻觉。
Details
Motivation: 研究LVLMs中物体幻觉的根本原因,尤其是视觉token的不确定性对幻觉的影响。
Result: 实验表明该方法显著减少了LVLMs中的物体幻觉,并能与其他现有方法协同工作。
Insight: 视觉token的不确定性是LVLMs中物体幻觉的主要来源,通过对抗扰动和掩码可以有效缓解这一问题。
Abstract: Large vision-language models (LVLMs), which integrate a vision encoder (VE) with a large language model, have achieved remarkable success across various tasks. However, there are still crucial challenges in LVLMs such as object hallucination, generating descriptions of objects that are not in the input image. Here, we argue that uncertain visual tokens within the VE is a key factor that contributes to object hallucination. Our statistical analysis found that there are positive correlations between visual tokens with high epistemic uncertainty and the occurrence of hallucinations. Furthermore, we show theoretically and empirically that visual tokens in early VE layers that exhibit large representation deviations under small adversarial perturbations indicate high epistemic uncertainty. Based on these findings, we propose a simple yet effective strategy to mitigate object hallucination by modifying the VE only. Our method comprises a proxy method with adversarial perturbations for identifying uncertain visual tokens efficiently and a method to mask these uncertain visual tokens during the self-attention process in the middle layers of the VE, suppressing their influence on visual encoding and thus alleviating hallucinations. Extensive experiments show that our method significantly reduces object hallucinations in LVLMs and can synergistically work with other prior arts.
[32] Visual Anomaly Detection for Reliable Robotic Implantation of Flexible Microelectrode Array cs.CV | cs.ROPDF
Yitong Chen, Xinyao Xu, Ping Zhu, Xinyong Han, Fangbo Qin
TL;DR: 该论文提出了一种基于视觉的异常检测框架,用于监测柔性微电极(FME)植入大脑皮层的可靠性。通过显微镜摄像头,该框架在四个检查点检测异常,结合预训练的视觉变换器(ViT)和改进的特征采样方法,显著提高了检测效果。
Details
Motivation: 柔性微电极植入大脑皮层具有挑战性,因其可变形的纤维结构以及与关键生物组织的交互。为确保植入过程的可靠性和安全性,需要一种高效的监控方法。
Result: 实验验证表明,所提方法在植入系统采集的图像数据集上表现有效。
Insight: 结合ViT和多粒度特征采样,可以在复杂生物医学场景中实现高精度的异常检测,为机器人辅助植入提供了可靠支持。
Abstract: Flexible microelectrode (FME) implantation into brain cortex is challenging due to the deformable fiber-like structure of FME probe and the interaction with critical bio-tissue. To ensure reliability and safety, the implantation process should be monitored carefully. This paper develops an image-based anomaly detection framework based on the microscopic cameras of the robotic FME implantation system. The unified framework is utilized at four checkpoints to check the micro-needle, FME probe, hooking result, and implantation point, respectively. Exploiting the existing object localization results, the aligned regions of interest (ROIs) are extracted from raw image and input to a pretrained vision transformer (ViT). Considering the task specifications, we propose a progressive granularity patch feature sampling method to address the sensitivity-tolerance trade-off issue at different locations. Moreover, we select a part of feature channels with higher signal-to-noise ratios from the raw general ViT features, to provide better descriptors for each specific scene. The effectiveness of the proposed methods is validated with the image datasets collected from our implantation system.
[33] MSDM: Generating Task-Specific Pathology Images with a Multimodal Conditioned Diffusion Model for Cell and Nuclei Segmentation cs.CV | cs.AIPDF
Dominik Winter, Mai Bui, Monica Azqueta Gavaldon, Nicolas Triltsch, Marco Rosati
TL;DR: 该论文提出了一种多模态语义扩散模型(MSDM),用于生成逼真的细胞和核分割图像-掩码对,通过结合形态学特征、颜色特征和元数据改善数据稀缺问题。
Details
Motivation: 计算病理学中标注数据稀缺,尤其是罕见或非典型形态的数据,阻碍了细胞和核分割的发展。合成数据提供了一种经济高效的解决方案。
Result: 定量分析表明,合成图像与现实数据高度匹配(Wasserstein距离低),合成样本(如柱状细胞)显著提升了分割模型的准确性。
Insight: 多模态扩散增强方法能够系统性丰富数据集,直接针对模型缺陷,显著提升细胞和核分割模型的鲁棒性和泛化能力。
Abstract: Scarcity of annotated data, particularly for rare or atypical morphologies, present significant challenges for cell and nuclei segmentation in computational pathology. While manual annotation is labor-intensive and costly, synthetic data offers a cost-effective alternative. We introduce a Multimodal Semantic Diffusion Model (MSDM) for generating realistic pixel-precise image-mask pairs for cell and nuclei segmentation. By conditioning the generative process with cellular/nuclear morphologies (using horizontal and vertical maps), RGB color characteristics, and BERT-encoded assay/indication metadata, MSDM generates datasests with desired morphological properties. These heterogeneous modalities are integrated via multi-head cross-attention, enabling fine-grained control over the generated images. Quantitative analysis demonstrates that synthetic images closely match real data, with low Wasserstein distances between embeddings of generated and real images under matching biological conditions. The incorporation of these synthetic samples, exemplified by columnar cells, significantly improves segmentation model accuracy on columnar cells. This strategy systematically enriches data sets, directly targeting model deficiencies. We highlight the effectiveness of multimodal diffusion-based augmentation for advancing the robustness and generalizability of cell and nuclei segmentation models. Thereby, we pave the way for broader application of generative models in computational pathology.
[34] Polar Separable Transform for Efficient Orthogonal Rotation-Invariant Image Representation cs.CVPDF
Satya P. Singh, Rashmi Chaudhry, Anand Srivastava, Jagath C. Rajapakse
TL;DR: PSepT是一种极坐标可分离正交变换,通过DCT径向基和傅里叶谐波角基的张量积构造,解决了传统方法计算复杂度和数值不稳定性的问题,显著提升了效率和稳定性。
Details
Motivation: 传统正交矩方法(如Zernike和伪Zernike矩)在极坐标下无法分离处理径向和角度分量,导致计算复杂度高(O(n^3N^2)至O(n^6N^2))和数值不稳定性(条件数O(N^4))。PSepT旨在克服这些限制,实现高效且稳定的图像表示。
Result: 实验表明,PSepT具有更好的数值稳定性、计算效率和分类性能,同时支持高精度重构。可用于实现传统方法难以处理的高阶矩分析。
Insight: PSepT的分离开放性为图像分析提供了新思路,特别是在需要高效及稳定计算的场景(如分类或重构),展示了极坐标变换的潜力。
Abstract: Orthogonal moment-based image representations are fundamental in computer vision, but classical methods suffer from high computational complexity and numerical instability at large orders. Zernike and pseudo-Zernike moments, for instance, require coupled radial-angular processing that precludes efficient factorization, resulting in $\mathcal{O}(n^3N^2)$ to $\mathcal{O}(n^6N^2)$ complexity and $\mathcal{O}(N^4)$ condition number scaling for the $n$th-order moments on an $N\times N$ image. We introduce \textbf{PSepT} (Polar Separable Transform), a separable orthogonal transform that overcomes the non-separability barrier in polar coordinates. PSepT achieves complete kernel factorization via tensor-product construction of Discrete Cosine Transform (DCT) radial bases and Fourier harmonic angular bases, enabling independent radial and angular processing. This separable design reduces computational complexity to $\mathcal{O}(N^2 \log N)$, memory requirements to $\mathcal{O}(N^2)$, and condition number scaling to $\mathcal{O}(\sqrt{N})$, representing exponential improvements over polynomial approaches. PSepT exhibits orthogonality, completeness, energy conservation, and rotation-covariance properties. Experimental results demonstrate better numerical stability, computational efficiency, and competitive classification performance on structured datasets, while preserving exact reconstruction. The separable framework enables high-order moment analysis previously infeasible with classical methods, opening new possibilities for robust image analysis applications.
[35] Training Feature Attribution for Vision Models cs.CV | cs.LGPDF
Aziz Bacha, Thomas George
TL;DR: 该论文提出了一种名为“训练特征归因”的方法,通过将测试预测与特定训练图像的特定区域联系起来,提供更细粒度的可解释性,揭示了传统归因方法未能捕捉到的有害示例和虚假相关性。
Details
Motivation: 深度神经网络被视为不透明的系统,需要可解释性方法来增强信任与问责。现有的方法通常仅关注输入特征或训练示例的影响,而本文认为应将两者结合研究。
Result: 实验表明,该方法能够识别导致错误分类的有害训练示例,并揭露传统方法未能发现的虚假相关性(如基于补丁的捷径)。
Insight: 训练特征归因方法为深度模型的内部行为提供了更高分辨率的解释,有助于理解模型的决策机制并提高其可信度。
Abstract: Deep neural networks are often considered opaque systems, prompting the need for explainability methods to improve trust and accountability. Existing approaches typically attribute test-time predictions either to input features (e.g., pixels in an image) or to influential training examples. We argue that both perspectives should be studied jointly. This work explores training feature attribution, which links test predictions to specific regions of specific training images and thereby provides new insights into the inner workings of deep models. Our experiments on vision datasets show that training feature attribution yields fine-grained, test-specific explanations: it identifies harmful examples that drive misclassifications and reveals spurious correlations, such as patch-based shortcuts, that conventional attribution methods fail to expose.
[36] Online Topological Localization for Navigation Assistance in Bronchoscopy cs.CVPDF
Clara Tomasini, Luis Riazuelo, Ana C. Murillo
TL;DR: 这篇论文提出了一种基于图像的支气管镜检查拓扑定位流程,用于在不需要患者CT扫描的情况下提供导航辅助。该方法仅使用虚拟数据训练,降低了真实数据标注的高成本,并展示了良好的泛化能力。
Details
Motivation: 支气管镜检查是呼吸医学中的一项基本操作,医生需要在复杂的支气管树结构中导航直到到达目标区域。现有导航技术依赖患者的CT扫描和其他传感器,增加了操作复杂性和成本。论文提出了无需CT扫描的拓扑定位方法,以简化导航辅助。
Result: 该方法在真实数据测试序列中表现优于现有方法,验证了其有效性和泛化能力。
Insight: 拓扑定位在某些医疗导航任务中可以替代精确的度量定位,既能满足需求,又能显著降低成本和技术复杂性。
Abstract: Video bronchoscopy is a fundamental procedure in respiratory medicine, where medical experts navigate through the bronchial tree of a patient to diagnose or operate the patient. Surgeons need to determine the position of the scope as they go through the airway until they reach the area of interest. This task is very challenging for practitioners due to the complex bronchial tree structure and varying doctor experience and training. Navigation assistance to locate the bronchoscope during the procedure can improve its outcome. Currently used techniques for navigational guidance commonly rely on previous CT scans of the patient to obtain a 3D model of the airway, followed by tracking of the scope with additional sensors or image registration. These methods obtain accurate locations but imply additional setup, scans and training. Accurate metric localization is not always required, and a topological localization with regard to a generic airway model can often suffice to assist the surgeon with navigation. We present an image-based bronchoscopy topological localization pipeline to provide navigation assistance during the procedure, with no need of patient CT scan. Our approach is trained only on phantom data, eliminating the high cost of real data labeling, and presents good generalization capabilities. The results obtained surpass existing methods, particularly on real data test sequences.
[37] Instance-Level Generation for Representation Learning cs.CVPDF
Yankun Wu, Zakaria Laskar, Giorgos Kordopatis-Zilos, Noa Garcia, Giorgos Tolias
TL;DR: 论文提出了一种新方法,通过合成生成多样化的对象实例来训练表示学习模型,解决了实例级识别(ILR)领域中数据标注困难的问题,无需依赖真实图像即可显著提升检索性能。
Details
Motivation: 实例级识别需要细粒度的标注数据,但大规模标注数据集难以获取,限制了其实际应用。论文旨在通过合成数据解决这一问题,为ILR提供高效的训练数据来源。
Result: 在七个跨领域的ILR基准测试中,生成的合成数据显著提升了模型的表现,证明了方法的有效性和泛化能力。
Insight: 合成数据生成可以成为解决细粒度任务数据稀缺问题的有效途径,尤其在ILR等领域中具有广泛应用潜力。
Abstract: Instance-level recognition (ILR) focuses on identifying individual objects rather than broad categories, offering the highest granularity in image classification. However, this fine-grained nature makes creating large-scale annotated datasets challenging, limiting ILR’s real-world applicability across domains. To overcome this, we introduce a novel approach that synthetically generates diverse object instances from multiple domains under varied conditions and backgrounds, forming a large-scale training set. Unlike prior work on automatic data synthesis, our method is the first to address ILR-specific challenges without relying on any real images. Fine-tuning foundation vision models on the generated data significantly improves retrieval performance across seven ILR benchmarks spanning multiple domains. Our approach offers a new, efficient, and effective alternative to extensive data collection and curation, introducing a new ILR paradigm where the only input is the names of the target domains, unlocking a wide range of real-world applications.
[38] TARO: Toward Semantically Rich Open-World Object Detection cs.CVPDF
Yuchen Zhang, Yao Lu, Johannes Betz
TL;DR: TARO提出了一种新的开放世界目标检测框架,不仅能检测未知物体,还能将其分类到语义层次结构中的粗粒度父类别,提升了安全性关键场景中的决策能力。
Details
Motivation: 传统目标检测器局限于封闭世界假设,无法有效处理真实场景中的未知物体。现有开放集检测方法仅将未知物体标记为单一类别,缺乏语义信息,不利于决策。
Result: 实验表明,TARO能将29.9%的未知物体分类到有意义类别,减少未知与已知类别的混淆,且在未知召回率和已知mAP上表现优异。
Insight: 语义层次结构的引入不仅能够显式区分未知物体,还能增强模型在开放世界中的泛化能力,特别适用于自动驾驶等高安全性要求的场景。
Abstract: Modern object detectors are largely confined to a “closed-world” assumption, limiting them to a predefined set of classes and posing risks when encountering novel objects in real-world scenarios. While open-set detection methods aim to address this by identifying such instances as ‘Unknown’, this is often insufficient. Rather than treating all unknowns as a single class, assigning them more descriptive subcategories can enhance decision-making in safety-critical contexts. For example, identifying an object as an ‘Unknown Animal’ (requiring an urgent stop) versus ‘Unknown Debris’ (requiring a safe lane change) is far more useful than just ‘Unknown’ in autonomous driving. To bridge this gap, we introduce TARO, a novel detection framework that not only identifies unknown objects but also classifies them into coarse parent categories within a semantic hierarchy. TARO employs a unique architecture with a sparsemax-based head for modeling objectness, a hierarchy-guided relabeling component that provides auxiliary supervision, and a classification module that learns hierarchical relationships. Experiments show TARO can categorize up to 29.9% of unknowns into meaningful coarse classes, significantly reduce confusion between unknown and known classes, and achieve competitive performance in both unknown recall and known mAP. Code will be made available.
[39] Online Video Depth Anything: Temporally-Consistent Depth Prediction with Low Memory Consumption cs.CVPDF
Johann-Friedrich Feiden, Tim Küchler, Denis Zavadski, Bogdan Savchynskyy, Carsten Rother
TL;DR: 本文提出了在线视频深度估计方法oVDA,解决了现有Video Depth Anything(VDA)依赖批处理的限制,通过借鉴LLM的技术(如缓存潜在特征和训练时的帧掩码),实现了低内存消耗和时间一致性。
Details
Motivation: 现有VDA方法虽然性能强,但依赖批处理,无法满足在线应用需求,限制了其在实时系统(如边缘设备)中的部署。
Result: oVDA在精度和VRAM使用上均优于其他在线方法,并在A100和Jetson设备上分别达到42 FPS和20 FPS的高性能。
Insight: LLM的技术可以迁移到视频深度估计领域,提高实时性和内存效率,适用于边缘设备部署。
Abstract: Depth estimation from monocular video has become a key component of many real-world computer vision systems. Recently, Video Depth Anything (VDA) has demonstrated strong performance on long video sequences. However, it relies on batch-processing which prohibits its use in an online setting. In this work, we overcome this limitation and introduce online VDA (oVDA). The key innovation is to employ techniques from Large Language Models (LLMs), namely, caching latent features during inference and masking frames at training. Our oVDA method outperforms all competing online video depth estimation methods in both accuracy and VRAM usage. Low VRAM usage is particularly important for deployment on edge devices. We demonstrate that oVDA runs at 42 FPS on an NVIDIA A100 and at 20 FPS on an NVIDIA Jetson edge device. We will release both, code and compilation scripts, making oVDA easy to deploy on low-power hardware.
[40] Modern Deep Learning Approaches for Cricket Shot Classification: A Comprehensive Baseline Study cs.CV | cs.AIPDF
Sungwoo Kang
TL;DR: 这篇论文首次对板球击球动作分类进行了全面的基线研究,比较了7种深度学习方法和4种研究范式,发现现有文献中的高精度在实际实现中显著下降,并提出了一种结合EfficientNet-B0和GRU的SOTA方法,达到了92.25%的准确率。
Details
Motivation: 板球击球动作分类在体育视频分析中是一个有挑战性的问题,需要有效建模时空特征。现有文献中的高精度结果在实际实现中难以复现,因此需要系统性的基准研究。
Result: 文献中的高精度(如96%、99.2%)在实际实现中显著下降(如46%、55.6%)。提出的EfficientNet-B0-GRU方法达到了92.25%的准确率。
Insight: 标准化评估协议在体育视频分析研究中至关重要;现代架构和系统优化可以显著提升性能。
Abstract: Cricket shot classification from video sequences remains a challenging problem in sports video analysis, requiring effective modeling of both spatial and temporal features. This paper presents the first comprehensive baseline study comparing seven different deep learning approaches across four distinct research paradigms for cricket shot classification. We implement and systematically evaluate traditional CNN-LSTM architectures, attention-based models, vision transformers, transfer learning approaches, and modern EfficientNet-GRU combinations on a unified benchmark. A critical finding of our study is the significant performance gap between claims in academic literature and practical implementation results. While previous papers reported accuracies of 96% (Balaji LRCN), 99.2% (IJERCSE), and 93% (Sensors), our standardized re-implementations achieve 46.0%, 55.6%, and 57.7% respectively. Our modern SOTA approach, combining EfficientNet-B0 with a GRU-based temporal model, achieves 92.25% accuracy, demonstrating that substantial improvements are possible with modern architectures and systematic optimization. All implementations follow modern MLOps practices with PyTorch Lightning, providing a reproducible research platform that exposes the critical importance of standardized evaluation protocols in sports video analysis research.
[41] Towards Safer and Understandable Driver Intention Prediction cs.CV | cs.AI | cs.HCPDF
Mukilan Karuppasamy, Shankar Gangisetty, Shyam Nandan Rai, Carlo Masone, C V Jawahar
TL;DR: 该论文聚焦自动驾驶系统中的驾驶员意图预测(DIP),提出了一种可解释性方法及新数据集DAAD-X,并设计了Video Concept Bottleneck Model(VCBM)框架,通过多模态数据生成连贯解释,证明Transformer模型在该任务中更具可解释性。
Details
Motivation: 随着自动驾驶系统的发展,其决策过程的可解释性对安全驾驶至关重要。作者认为当前的深度学习方法缺乏对人类驾驶意图的透明理解,因此提出了可解释的驾驶员意图预测任务。
Result: 通过DAAD-X数据集的实验表明,基于Transformer的模型比传统CNN模型更具可解释性。此外,提出的多标签t-SNE可视化技术展示了多解释之间的解缠和因果相关性。
Insight: 1. 可解释性是自动驾驶安全的关键;2. Transformer模型在生成自然语言解释方面表现更优;3. 多模态数据融合为驾驶意图预测提供了更全面的视角。
Abstract: Autonomous driving (AD) systems are becoming increasingly capable of handling complex tasks, mainly due to recent advances in deep learning and AI. As interactions between autonomous systems and humans increase, the interpretability of decision-making processes in driving systems becomes increasingly crucial for ensuring safe driving operations. Successful human-machine interaction requires understanding the underlying representations of the environment and the driving task, which remains a significant challenge in deep learning-based systems. To address this, we introduce the task of interpretability in maneuver prediction before they occur for driver safety, i.e., driver intent prediction (DIP), which plays a critical role in AD systems. To foster research in interpretable DIP, we curate the eXplainable Driving Action Anticipation Dataset (DAAD-X), a new multimodal, ego-centric video dataset to provide hierarchical, high-level textual explanations as causal reasoning for the driver’s decisions. These explanations are derived from both the driver’s eye-gaze and the ego-vehicle’s perspective. Next, we propose Video Concept Bottleneck Model (VCBM), a framework that generates spatio-temporally coherent explanations inherently, without relying on post-hoc techniques. Finally, through extensive evaluations of the proposed VCBM on the DAAD-X dataset, we demonstrate that transformer-based models exhibit greater interpretability than conventional CNN-based models. Additionally, we introduce a multilabel t-SNE visualization technique to illustrate the disentanglement and causal correlation among multiple explanations. Our data, code and models are available at: https://mukil07.github.io/VCBM.github.io/
[42] Cattle-CLIP: A Multimodal Framework for Cattle Behaviour Recognition cs.CVPDF
Huimin Liu, Jing Gao, Daria Baran, AxelX Montout, Neill W Campbell
TL;DR: Cattle-CLIP是一个多模态深度学习框架,用于牛的行为识别,结合语义线索提升视频特征识别性能,在监督学习和少样本学习中表现优异。
Details
Motivation: 牛的行为是健康和福祉的重要指标,视频监控与深度学习结合已成为主流方法,但现有方法在数据稀缺行为识别方面不足。
Result: 监督学习中总体准确率达96.1%,少样本场景下也表现鲁棒,特别是喂食、饮水和站立反刍行为的召回率接近100%。
Insight: 多模态学习在农业和动物行为分析中潜力巨大,尤其在数据稀缺的情境下仍能保持高性能。
Abstract: Cattle behaviour is a crucial indicator of an individual animal health, productivity and overall well-being. Video-based monitoring, combined with deep learning techniques, has become a mainstream approach in animal biometrics, and it can offer high accuracy in some behaviour recognition tasks. We present Cattle-CLIP, a multimodal deep learning framework for cattle behaviour recognition, using semantic cues to improve the performance of video-based visual feature recognition. It is adapted from the large-scale image-language model CLIP by adding a temporal integration module. To address the domain gap between web data used for the pre-trained model and real-world cattle surveillance footage, we introduce tailored data augmentation strategies and specialised text prompts. Cattle-CLIP is evaluated under both fully-supervised and few-shot learning scenarios, with a particular focus on data-scarce behaviour recognition - an important yet under-explored goal in livestock monitoring. To evaluate the proposed method, we release the CattleBehaviours6 dataset, which comprises six types of indoor behaviours: feeding, drinking, standing-self-grooming, standing-ruminating, lying-self-grooming and lying-ruminating. The dataset consists of 1905 clips collected from our John Oldacre Centre dairy farm research platform housing 200 Holstein-Friesian cows. Experiments show that Cattle-CLIP achieves 96.1% overall accuracy across six behaviours in a supervised setting, with nearly 100% recall for feeding, drinking and standing-ruminating behaviours, and demonstrates robust generalisation with limited data in few-shot scenarios, highlighting the potential of multimodal learning in agricultural and animal behaviour analysis.
[43] Stable Video Infinity: Infinite-Length Video Generation with Error Recycling cs.CVPDF
Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, Alexandre Alahi
TL;DR: Stable Video Infinity (SVI)提出了一种通过错误循环微调生成无限长度视频的方法,解决了现有长视频生成中误差累积和场景重复的问题。
Details
Motivation: 现有长视频生成方法通过手工防漂移技术(如修改噪声调度器或帧锚定)缓解误差累积,但仅限于单提示外推,导致场景单调。SVI认为根本问题在于训练假设(干净数据)与自回归推理(依赖误差输出)之间的差距。
Result: SVI能够在无额外推理成本下扩展视频长度至无限,兼容多种条件(音频、骨架、文本流),在三个基准测试中验证了其多功能性和SOTA性能。
Insight: 训练时模拟推理阶段的误差累积能显著提升模型的自我纠错能力,从而实现更高时间一致性和场景多样性的长视频生成。
Abstract: We propose Stable Video Infinity (SVI) that is able to generate infinite-length videos with high temporal consistency, plausible scene transitions, and controllable streaming storylines. While existing long-video methods attempt to mitigate accumulated errors via handcrafted anti-drifting (e.g., modified noise scheduler, frame anchoring), they remain limited to single-prompt extrapolation, producing homogeneous scenes with repetitive motions. We identify that the fundamental challenge extends beyond error accumulation to a critical discrepancy between the training assumption (seeing clean data) and the test-time autoregressive reality (conditioning on self-generated, error-prone outputs). To bridge this hypothesis gap, SVI incorporates Error-Recycling Fine-Tuning, a new type of efficient training that recycles the Diffusion Transformer (DiT)’s self-generated errors into supervisory prompts, thereby encouraging DiT to actively identify and correct its own errors. This is achieved by injecting, collecting, and banking errors through closed-loop recycling, autoregressively learning from error-injected feedback. Specifically, we (i) inject historical errors made by DiT to intervene on clean inputs, simulating error-accumulated trajectories in flow matching; (ii) efficiently approximate predictions with one-step bidirectional integration and calculate errors with residuals; (iii) dynamically bank errors into replay memory across discretized timesteps, which are resampled for new input. SVI is able to scale videos from seconds to infinite durations with no additional inference cost, while remaining compatible with diverse conditions (e.g., audio, skeleton, and text streams). We evaluate SVI on three benchmarks, including consistent, creative, and conditional settings, thoroughly verifying its versatility and state-of-the-art role.
[44] Tag-Enriched Multi-Attention with Large Language Models for Cross-Domain Sequential Recommendation cs.CVPDF
Wangyu Wu, Xuhang Chen, Zhenhong Chen, Jing-En Jiang, Kim-Fung Tsang
TL;DR: 论文提出TEMA-LLM框架,结合大语言模型(LLMs)生成语义标签并增强跨域顺序推荐(CDSR),通过多注意力机制捕获用户偏好,显著优于现有基线。
Details
Motivation: 现代电子商务平台中,用户跨域行为模式复杂多变,传统的推荐系统难以同时捕获域内和跨域的用户偏好,亟需一种更智能的解决方案。
Result: TEMA-LLM在四个大规模电子商务数据集上表现优于现有基线,验证了LLM语义标签和多注意力机制的有效性。
Insight: LLMs在生成语义标签和提升推荐系统性能方面潜力巨大,尤其在跨域推荐任务中,能显著增强用户行为模式的建模能力。
Abstract: Cross-Domain Sequential Recommendation (CDSR) plays a crucial role in modern consumer electronics and e-commerce platforms, where users interact with diverse services such as books, movies, and online retail products. These systems must accurately capture both domain-specific and cross-domain behavioral patterns to provide personalized and seamless consumer experiences. To address this challenge, we propose \textbf{TEMA-LLM} (\textit{Tag-Enriched Multi-Attention with Large Language Models}), a practical and effective framework that integrates \textit{Large Language Models (LLMs)} for semantic tag generation and enrichment. Specifically, TEMA-LLM employs LLMs to assign domain-aware prompts and generate descriptive tags from item titles and descriptions. The resulting tag embeddings are fused with item identifiers as well as textual and visual features to construct enhanced item representations. A \textit{Tag-Enriched Multi-Attention} mechanism is then introduced to jointly model user preferences within and across domains, enabling the system to capture complex and evolving consumer interests. Extensive experiments on four large-scale e-commerce datasets demonstrate that TEMA-LLM consistently outperforms state-of-the-art baselines, underscoring the benefits of LLM-based semantic tagging and multi-attention integration for consumer-facing recommendation systems. The proposed approach highlights the potential of LLMs to advance intelligent, user-centric services in the field of consumer electronics.
[45] Clear Roads, Clear Vision: Advancements in Multi-Weather Restoration for Smart Transportation cs.CV | cs.AIPDF
Vijay M. Galshetwar, Praful Hambarde, Prashant W. Patil, Akshay Dudhane, Sachin Chaudhary
TL;DR: 这篇综述全面回顾了针对恶劣天气条件下图像和视频恢复的技术,涵盖了传统方法和现代数据驱动模型(如CNN、Transformer、Diffusion模型和VLM),并讨论了未来研究方向。
Details
Motivation: 恶劣天气(如雾霾、雨雪)显著降低图像和视频质量,影响智能交通系统的视觉输入需求,亟需高效的恢复技术。
Result: 综述详细分析了各类方法的优缺点,并指出了未来技术发展的潜力方向。
Insight: 未来的研究方向应聚焦于混合降解恢复、实时性能优化以及基于Agentic AI的框架设计。
Abstract: Adverse weather conditions such as haze, rain, and snow significantly degrade the quality of images and videos, posing serious challenges to intelligent transportation systems (ITS) that rely on visual input. These degradations affect critical applications including autonomous driving, traffic monitoring, and surveillance. This survey presents a comprehensive review of image and video restoration techniques developed to mitigate weather-induced visual impairments. We categorize existing approaches into traditional prior-based methods and modern data-driven models, including CNNs, transformers, diffusion models, and emerging vision-language models (VLMs). Restoration strategies are further classified based on their scope: single-task models, multi-task/multi-weather systems, and all-in-one frameworks capable of handling diverse degradations. In addition, we discuss day and night time restoration challenges, benchmark datasets, and evaluation protocols. The survey concludes with an in-depth discussion on limitations in current research and outlines future directions such as mixed/compound-degradation restoration, real-time deployment, and agentic AI frameworks. This work aims to serve as a valuable reference for advancing weather-resilient vision systems in smart transportation environments. Lastly, to stay current with rapid advancements in this field, we will maintain regular updates of the latest relevant papers and their open-source implementations at https://github.com/ChaudharyUPES/A-comprehensive-review-on-Multi-weather-restoration
[46] Diagnosing Shoulder Disorders Using Multimodal Large Language Models and Consumer-Grade Cameras cs.CV | cs.AI | cs.CL | cs.LGPDF
Jindong Hong, Wencheng Zhang, Shiqin Qiao, Jianhai Chen, Jianing Qiu
TL;DR: 该研究提出了一种基于消费级摄像头和MLLMs的低成本肩部疾病诊断方案HMVDx,显著提升了诊断准确率。
Details
Motivation: 解决医疗资源匮乏地区肩部疾病的早期准确诊断难题,提供低成本、易扩展的辅助诊断方案。
Result: HMVDx的诊断准确率比直接视频诊断提升了79.6%。
Insight: 低成本MLLMs在医疗领域具有巨大潜力,尤其是通过分步任务设计的框架能够显著提升诊断效果。
Abstract: Shoulder disorders, such as frozen shoulder (a.k.a., adhesive capsulitis), are common conditions affecting the health of people worldwide, and have a high incidence rate among the elderly and workers engaged in repetitive shoulder tasks. In regions with scarce medical resources, achieving early and accurate diagnosis poses significant challenges, and there is an urgent need for low-cost and easily scalable auxiliary diagnostic solutions. This research introduces videos captured by consumer-grade devices as the basis for diagnosis, reducing the cost for users. We focus on the innovative application of Multimodal Large Language Models (MLLMs) in the preliminary diagnosis of shoulder disorders and propose a Hybrid Motion Video Diagnosis framework (HMVDx). This framework divides the two tasks of action understanding and disease diagnosis, which are respectively completed by two MLLMs. In addition to traditional evaluation indicators, this work proposes a novel metric called Usability Index by the logical process of medical decision-making (action recognition, movement diagnosis, and final diagnosis). This index evaluates the effectiveness of MLLMs in the medical field from the perspective of the entire medical diagnostic pathway, revealing the potential value of low-cost MLLMs in medical applications for medical practitioners. In experimental comparisons, the accuracy of HMVDx in diagnosing shoulder joint injuries has increased by 79.6% compared with direct video diagnosis, a significant technical contribution to future research on the application of MLLMs for video understanding in the medical field.
[47] Zero-shot image privacy classification with Vision-Language Models cs.CV | cs.LG | cs.MMPDF
Alina Elena Baia, Alessio Xompero, Andrea Cavallaro
TL;DR: 本文通过建立零样本基准,系统地评估了视觉语言模型(VLMs)在图像隐私分类任务中的表现,发现尽管VLMs计算资源消耗高且推理速度慢,但其性能仍落后于小型专用模型,同时在图像扰动下表现更鲁棒。
Details
Motivation: 当前研究倾向于使用通用的视觉语言模型(VLMs)进行图像隐私预测,但缺乏对专用模型的系统性比较,因此需要评估VLMs在此类任务中的实际性能。
Result: VLMs的资源消耗高且推理速度慢,但隐私预测精度不如小型专用模型;在图像扰动下表现鲁棒性更强。
Insight: VLMs在某些任务中可能不如专用模型高效,但其鲁棒性优势可能在某些场景中更为关键。
Abstract: While specialized learning-based models have historically dominated image privacy prediction, the current literature increasingly favours adopting large Vision-Language Models (VLMs) designed for generic tasks. This trend risks overlooking the performance ceiling set by purpose-built models due to a lack of systematic evaluation. To address this problem, we establish a zero-shot benchmark for image privacy classification, enabling a fair comparison. We evaluate the top-3 open-source VLMs, according to a privacy benchmark, using task-aligned prompts and we contrast their performance, efficiency, and robustness against established vision-only and multi-modal methods. Counter-intuitively, our results show that VLMs, despite their resource-intensive nature in terms of high parameter count and slower inference, currently lag behind specialized, smaller models in privacy prediction accuracy. We also find that VLMs exhibit higher robustness to image perturbations.
[48] Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy cs.CVPDF
Patrick Wienholt, Sophie Caselitz, Robert Siepmann, Philipp Bruners, Keno Bressem
TL;DR: 该论文提出了一种基于离散语义熵(DSE)的方法,用于过滤可能产生幻觉的问题,从而显著提升放射影像视觉问答(VQA)中黑盒视觉语言模型(VLMs)的准确性。
Details
Motivation: 放射影像视觉问答中,黑盒视觉语言模型(如GPT-4)可能产生幻觉(不准确的回答),影响诊断准确性。论文旨在通过DSE量化语义不一致性,过滤高风险问题。
Result: 过滤高熵问题后,GPT-4o的准确率从51.7%提升至76.3%,GPT-4.1从54.8%提升至63.8%。结果具有统计学显著性。
Insight: DSE是一种轻量级且高效的方法,适用于临床VLM应用,能够在不依赖模型内部结构的情况下,显著减少幻觉并提高诊断可靠性。
Abstract: To determine whether using discrete semantic entropy (DSE) to reject questions likely to generate hallucinations can improve the accuracy of black-box vision-language models (VLMs) in radiologic image based visual question answering (VQA). This retrospective study evaluated DSE using two publicly available, de-identified datasets: (i) the VQA-Med 2019 benchmark (500 images with clinical questions and short-text answers) and (ii) a diagnostic radiology dataset (206 cases: 60 computed tomography scans, 60 magnetic resonance images, 60 radiographs, 26 angiograms) with corresponding ground-truth diagnoses. GPT-4o and GPT-4.1 answered each question 15 times using a temperature of 1.0. Baseline accuracy was determined using low-temperature answers (temperature 0.1). Meaning-equivalent responses were grouped using bidirectional entailment checks, and DSE was computed from the relative frequencies of the resulting semantic clusters. Accuracy was recalculated after excluding questions with DSE > 0.6 or > 0.3. p-values and 95% confidence intervals were obtained using bootstrap resampling and a Bonferroni-corrected threshold of p < .004 for statistical significance. Across 706 image-question pairs, baseline accuracy was 51.7% for GPT-4o and 54.8% for GPT-4.1. After filtering out high-entropy questions (DSE > 0.3), accuracy on the remaining questions was 76.3% (retained questions: 334/706) for GPT-4o and 63.8% (retained questions: 499/706) for GPT-4.1 (both p < .001). Accuracy gains were observed across both datasets and largely remained statistically significant after Bonferroni correction. DSE enables reliable hallucination detection in black-box VLMs by quantifying semantic inconsistency. This method significantly improves diagnostic answer accuracy and offers a filtering strategy for clinical VLM applications.
[49] MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding cs.CVPDF
Ming Dai, Sen Yang, Boqiang Duan, Wankou Yang, Jingdong Wang
TL;DR: 这篇论文提出了一个统一的框架MomentSeg,用于联合优化时序句子定位(TSG)和参考视频对象分割(RefVOS),通过关键时刻为中心的采样(MCS)和双向锚点更新传播(BAP)策略,显著提升视频像素理解能力。
Details
Motivation: 现有的RefVOS方法要么依赖于手工设计的启发式采样策略,忽略了关键的时间线索,要么依赖外部的关键帧模型,增加了系统复杂性。为了解决这些问题,作者提出了一种统一的框架,联合优化TSG和RefVOS,自然地结合关键时刻定位的能力。
Result: 实验表明,MomentSeg在RefVOS任务中表现优异,联合优化的框架显著提升了时序理解和分割性能。
Insight: 关键时刻定位的联合优化和动态采样策略能够显著提升视频分割任务的性能,同时减少对外部模型的依赖和误差累积。
Abstract: Referring Video Object Segmentation (RefVOS) seeks to segment target objects in videos guided by natural language descriptions, demanding both temporal reasoning and fine-grained visual comprehension. Existing sampling strategies for LLM-based approaches typically rely on either handcrafted heuristics or external keyframe models. The former often overlooks essential temporal cues, while the latter increases system complexity. To address this, we propose a unified framework that jointly optimizes Temporal Sentence Grounding (TSG) and RefVOS, naturally incorporating key moment grounding capability. During training, we introduce a novel TSG paradigm that employs a dedicated \texttt{[FIND]} token for key moment identification through temporal token similarity matching, thereby avoiding the need for external timestamp encodings. For inference, we design a Moment-Centric Sampling (MCS) strategy that densely samples informative moments while sparsely sampling non-essential frames, preserving both motion details and global context. To further enhance tracking stability, we develop Bidirectional Anchor-updated Propagation (BAP), which leverages the most relevant moment as start point for high-quality mask initialization and dynamically updates at sampled points to mitigate accumulated errors. Code and model will be available at: https://github.com/Dmmm1997/MomentSeg
[50] Spotlight on Token Perception for Multimodal Reinforcement Learning cs.CVPDF
Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He
TL;DR: 该论文通过引入一种新颖的令牌感知视角,研究了多模态强化学习(RLVR)中的视觉感知作用,并提出了一种名为VPPO的策略梯度算法,显著提升了大型视觉语言模型(LVLMs)的多模态推理能力。
Details
Motivation: 现有方法在RLVR中往往忽视视觉感知的重要性,限制了多模态推理的潜力。论文希望通过分析令牌感知,揭示视觉依赖的稀疏性,并提出更高效的优化策略。
Result: 在8个基准测试中,VPPO显著优于现有RL优化模型,且在7B和32B规模的模型上均表现一致的有效性。
Insight: 令牌感知揭示了多模态RLVR中视觉依赖的稀疏性,而VPPO提供了一种高效的优化策略,可用于增强LVLMs的多模态推理能力。
Abstract: While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory’s advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.
[51] Foraging with the Eyes: Dynamics in Human Visual Gaze and Deep Predictive Modeling cs.CV | eess.IVPDF
Tejaswi V. Panchagnula
TL;DR: 论文研究发现,人类视觉注视动态遵循类似动物觅食的Lévy行走模式,并通过卷积神经网络成功预测了注视热点区域。
Details
Motivation: 传统视觉注意力模型通常基于图像显著性,但缺乏对眼动时空统计特性的深入探索。本文旨在填补这一空白,揭示人类视觉探索的动态规律。
Result: 实验数据显示人类眼动与Lévy行走高度吻合,CNN模型能准确预测注视热点区域。
Insight: 人类视觉探索可能受优化信息获取效率的自然法则驱动,为注意力建模提供了新视角。
Abstract: Animals often forage via Levy walks stochastic trajectories with heavy tailed step lengths optimized for sparse resource environments. We show that human visual gaze follows similar dynamics when scanning images. While traditional models emphasize image based saliency, the underlying spatiotemporal statistics of eye movements remain underexplored. Understanding these dynamics has broad applications in attention modeling and vision-based interfaces. In this study, we conducted a large scale human subject experiment involving 40 participants viewing 50 diverse images under unconstrained conditions, recording over 4 million gaze points using a high speed eye tracker. Analysis of these data shows that the gaze trajectory of the human eye also follows a Levy walk akin to animal foraging. This suggests that the human eye forages for visual information in an optimally efficient manner. Further, we trained a convolutional neural network (CNN) to predict fixation heatmaps from image input alone. The model accurately reproduced salient fixation regions across novel images, demonstrating that key components of gaze behavior are learnable from visual structure alone. Our findings present new evidence that human visual exploration obeys statistical laws analogous to natural foraging and open avenues for modeling gaze through generative and predictive frameworks.
[52] CapGeo: A Caption-Assisted Approach to Geometric Reasoning cs.CV | cs.AI | cs.CLPDF
Yuying Li, Siyi Qian, Hao Liang, Leqi Zheng, Ruichuan An
TL;DR: CapGeo是一个通过将几何图形转换为文本描述来提升多模态大语言模型(MLLMs)几何推理能力的框架,显著提升了模型的性能,并提出了CapGeo-Bench数据集和关键点评估方法来系统评估几何描述质量。
Details
Motivation: 当前最先进的多模态大语言模型(如GPT-O3和Gemini-2.5-Pro)在几何推理任务上表现不佳,这表明问题的瓶颈在于对几何图形的理解而非纯文本推理能力。因此,将几何图形转化为文本描述可能是一个有效的解决方案。
Result: CapGeo显著提升了MLLMs的几何推理能力:Qwen2.5-VL-72B的表现从8.6%(仅视觉)提升到59.0%,Claude-Opus-4从44.8%提升到73.0%。
Insight: 将几何图形转换为文本描述可以有效解决MLLMs在几何推理中的视觉理解瓶颈。高质量的几何描述对模型性能提升至关重要,CapGeo-Bench提出的评估方法为目标任务的性能预测提供了可靠依据。
Abstract: Geometric reasoning remains a core challenge for Multimodal Large Language Models (MLLMs). Even the most advanced closed-source systems, such as GPT-O3 and Gemini-2.5-Pro, still struggle to solve geometry problems reliably, despite exhibiting strong textual reasoning abilities on tasks like the International Mathematical Olympiad (IMO). This gap suggests that the bottleneck lies in understanding geometric diagrams rather than reasoning itself. Since geometric figures can often be faithfully described in concise textual form, converting visual content into captions offers a promising direction. Motivated by this insight, we introduce CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities. Experiments show substantial improvements when models are equipped with captions: Qwen2.5-VL-72B improves from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rises from 44.8% to 73.0%. To systematically evaluate and identify high-quality geometric captioning models, we further propose CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs. Crucially, CapGeo-Bench incorporates a keypoint-based evaluation metric that correlates strongly with downstream CapGeo performance, enabling reliable assessment of geometric captioning ability. Together, our framework and benchmark highlight a new pathway toward advancing geometric reasoning in MLLMs.
[53] Enhancing Infrared Vision: Progressive Prompt Fusion Network and Benchmark cs.CVPDF
Jinyuan Liu, Zihang Chen, Zhu Liu, Zhiying Jiang, Long Ma
TL;DR: 该论文提出了一种针对热红外图像增强任务的渐进式提示融合网络(PPFN),并建立了高质量的多场景红外基准数据集。PPFN通过融合成像过程中的提示对来调制模型特征,结合选择性渐进训练机制,显著提升了复杂退化场景下的增强效果。
Details
Motivation: 现有红外图像增强方法通常针对单一退化问题,难以处理耦合退化;而通用的增强方法在红外图像上效果有限。因此,作者提出了结合成像机制的渐进式方法。
Result: 实验表明,该方法在复杂退化场景下显著优于现有方法,性能提升8.76%。
Insight: 结合成像机制的提示对设计和渐进训练策略是提升红外图像增强效果的关键。
Abstract: We engage in the relatively underexplored task named thermal infrared image enhancement. Existing infrared image enhancement methods primarily focus on tackling individual degradations, such as noise, contrast, and blurring, making it difficult to handle coupled degradations. Meanwhile, all-in-one enhancement methods, commonly applied to RGB sensors, often demonstrate limited effectiveness due to the significant differences in imaging models. In sight of this, we first revisit the imaging mechanism and introduce a Progressive Prompt Fusion Network (PPFN). Specifically, the PPFN initially establishes prompt pairs based on the thermal imaging process. For each type of degradation, we fuse the corresponding prompt pairs to modulate the model’s features, providing adaptive guidance that enables the model to better address specific degradations under single or multiple conditions. In addition, a Selective Progressive Training (SPT) mechanism is introduced to gradually refine the model’s handling of composite cases to align the enhancement process, which not only allows the model to remove camera noise and retain key structural details, but also enhancing the overall contrast of the thermal image. Furthermore, we introduce the most high-quality, multi-scenarios infrared benchmark covering a wide range of scenarios. Extensive experiments substantiate that our approach not only delivers promising visual results under specific degradation but also significantly improves performance on complex degradation scenes, achieving a notable 8.76% improvement. Code is available at https://github.com/Zihang-Chen/HM-TIR.
[54] Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models cs.CVPDF
Qihang Ma, Shengyu Li, Jie Tang, Dingkang Yang, Shaodong Chen
TL;DR: 论文提出了一种动态思维链策略,用于提升视觉语言模型在多模态关键词预测任务中的表现,解决了传统方法的局限性,并通过实验验证了其有效性。
Details
Motivation: 传统多模态关键词预测方法在处理缺失和未见场景时存在局限性,且现有基准测试高估了模型能力。因此,作者希望通过视觉语言模型(VLMs)提升任务性能并解决这些问题。
Result: 实验结果表明,所提方法在多个数据集上均表现出色,验证了动态思维链策略的有效性。
Insight: 论文揭示了动态思维链策略可以有效平衡模型的推理能力与计算效率,为多模态任务提供了新的技术思路。
Abstract: Multi-modal keyphrase prediction (MMKP) aims to advance beyond text-only methods by incorporating multiple modalities of input information to produce a set of conclusive phrases. Traditional multi-modal approaches have been proven to have significant limitations in handling the challenging absence and unseen scenarios. Additionally, we identify shortcomings in existing benchmarks that overestimate model capability due to significant overlap in training tests. In this work, we propose leveraging vision-language models (VLMs) for the MMKP task. Firstly, we use two widely-used strategies, e.g., zero-shot and supervised fine-tuning (SFT) to assess the lower bound performance of VLMs. Next, to improve the complex reasoning capabilities of VLMs, we adopt Fine-tune-CoT, which leverages high-quality CoT reasoning data generated by a teacher model to finetune smaller models. Finally, to address the “overthinking” phenomenon, we propose a dynamic CoT strategy which adaptively injects CoT data during training, allowing the model to flexibly leverage its reasoning capabilities during the inference stage. We evaluate the proposed strategies on various datasets and the experimental results demonstrate the effectiveness of the proposed approaches. The code is available at https://github.com/bytedance/DynamicCoT.
[55] BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception cs.CVPDF
Junyan Ye, Dongzhi Jiang, Jun He, Baichuan Zhou, Zilong Huang
TL;DR: BLINK-Twice是一个专注于视觉感知推理的评测基准,通过挑战性任务和多模态大语言模型的评估,揭示了当前模型在视觉推理上的不足。
Details
Motivation: 现有的推理评测基准主要基于语言推理,视觉输入常被当作可替换的背景。BLINK-Twice填补了这一空白,专注于视觉内容的推理能力。
Result: 评测结果显示,现有模型在视觉推理上表现不稳定,语言空间的推理策略效果有限。重复观察图像和主动视觉交互(如o3模型)对性能提升有明显帮助。
Insight: 视觉推理需要新的范式,当前的语言推理策略难以直接迁移。主动视觉交互和重复观察是提升视觉推理性能的关键方向。
Abstract: Recently, Multimodal Large Language Models (MLLMs) have made rapid progress, particularly in enhancing their reasoning capabilities. However, existing reasoning benchmarks still primarily assess language-based reasoning, often treating visual input as replaceable context. To address this gap, we introduce BLINK-Twice, a vision-centric reasoning benchmark grounded in challenging perceptual tasks. Instead of relying on external knowledge, our tasks require models to reason from visual content alone, shifting the focus from language-based to image-grounded reasoning. Compared to prior perception benchmarks, it moves beyond shallow perception (“see”) and requires fine-grained observation and analytical reasoning (“observe”). BLINK-Twice integrates three core components: seven types of visual challenges for testing visual reasoning, natural adversarial image pairs that enforce reliance on visual content, and annotated reasoning chains for fine-grained evaluation of the reasoning process rather than final answers alone. We evaluate 20 leading MLLMs, including 12 foundation models and 8 reasoning-enhanced models. BLINK-Twice poses a significant challenge to current models. While existing reasoning strategies in the language space-such as chain-of-thought or self-criticism can improve performance, they often result in unstable and redundant reasoning. We observe that repeated image observation improves performance across models, and active visual interaction, as demonstrated by models like o3, highlights the need for a new paradigm for vision reasoning. The dataset is publicly available at https://github.com/PicoTrex/BLINK-Twice
[56] Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes cs.CVPDF
Yikang Zhang, Rui Fan
TL;DR: VAD-GS是一种针对动态城市场景优化的3D高斯溅射框架,通过体素化可见性推理和信息视图选择,解决初始化点云不完整导致的几何失真问题。
Details
Motivation: 动态无界城市场景中,3D高斯溅射(3DGS)的初始化点云质量较差会导致几何失真,现有方法无法重建缺失结构。
Result: 在Waymo和nuScenes数据集上表现优于现有3DGS方法,静态和动态对象的几何重建质量显著提升。
Insight: 依赖可靠的几何先验(如可见性推理和多视图匹配)能有效解决3DGS在复杂场景中的初始化问题。
Abstract: 3D Gaussian splatting (3DGS) has demonstrated impressive performance in synthesizing high-fidelity novel views. Nonetheless, its effectiveness critically depends on the quality of the initialized point cloud. Specifically, achieving uniform and complete point coverage over the underlying scene structure requires overlapping observation frustums, an assumption that is often violated in unbounded, dynamic urban environments. Training Gaussian models with partially initialized point clouds often leads to distortions and artifacts, as camera rays may fail to intersect valid surfaces, resulting in incorrect gradient propagation to Gaussian primitives associated with occluded or invisible geometry. Additionally, existing densification strategies simply clone and split Gaussian primitives from existing ones, incapable of reconstructing missing structures. To address these limitations, we propose VAD-GS, a 3DGS framework tailored for geometry recovery in challenging urban scenes. Our method identifies unreliable geometry structures via voxel-based visibility reasoning, selects informative supporting views through diversity-aware view selection, and recovers missing structures via patch matching-based multi-view stereo reconstruction. This design enables the generation of new Gaussian primitives guided by reliable geometric priors, even in regions lacking initial points. Extensive experiments on the Waymo and nuScenes datasets demonstrate that VAD-GS outperforms state-of-the-art 3DGS approaches and significantly improves the quality of reconstructed geometry for both static and dynamic objects. Source code will be released upon publication.
[57] Minkowski-MambaNet: A Point Cloud Framework with Selective State Space Models for Forest Biomass Quantification cs.CVPDF
Jinxiang Tu, Dayong Ren, Fei Shi, Zhenhong Jia, Yahong Ren
TL;DR: Minkowski-MambaNet是一个结合Mamba的选择性状态空间模型(SSM)和Minkowski网络的新型深度学习框架,用于从LiDAR点云直接估计森林生物量。其创新在于有效编码全局上下文和长距离依赖关系,显著提升了生物量估计的准确性。
Details
Motivation: 森林生物量的准确量化对碳循环监测至关重要。传统方法在建模长距离依赖以区分树木方面存在困难,因此需要一个能够直接从LiDAR点云估计体积和地上生物量(AGB)的有效框架。
Result: 在丹麦国家森林调查LiDAR数据上的评估表明,Minkowski-MambaNet显著优于现有方法,提供了更准确和鲁棒的生物量估计。
Insight: 该框架为大尺度森林生物量分析提供了强大工具,推动了基于LiDAR的森林调查技术的发展。
Abstract: Accurate forest biomass quantification is vital for carbon cycle monitoring. While airborne LiDAR excels at capturing 3D forest structure, directly estimating woody volume and Aboveground Biomass (AGB) from point clouds is challenging due to difficulties in modeling long-range dependencies needed to distinguish trees.We propose Minkowski-MambaNet, a novel deep learning framework that directly estimates volume and AGB from raw LiDAR. Its key innovation is integrating the Mamba model’s Selective State Space Model (SSM) into a Minkowski network, enabling effective encoding of global context and long-range dependencies for improved tree differentiation. Skip connections are incorporated to enhance features and accelerate convergence.Evaluated on Danish National Forest Inventory LiDAR data, Minkowski-MambaNet significantly outperforms state-of-the-art methods, providing more accurate and robust estimates. Crucially, it requires no Digital Terrain Model (DTM) and is robust to boundary artifacts. This work offers a powerful tool for large-scale forest biomass analysis, advancing LiDAR-based forest inventories.
[58] Utilizing dynamic sparsity on pretrained DETR cs.CVPDF
Reza Sedghi, Anand Subramoney, David Kappel
TL;DR: 该论文提出两种方法(SIBS和MGS)利用DETR预训练模型中MLP层的固有稀疏性,以实现高效推理,其中MGS通过轻量级门控机制动态预测稀疏性,显著减少计算量。
Details
Motivation: 基于Transformer的模型(如DETR)在视觉任务中推理效率较低,而其MLP层存在固有稀疏性,如何在不重新训练模型的情况下利用这种稀疏性是一个关键问题。
Result: MGS在COCO数据集上保持了性能甚至有所提升,同时显著减少了计算量。
Insight: 动态稀疏性预测比静态方法更有效,轻量级门控机制在不重新训练模型的情况下实现了高效的稀疏化。
Abstract: Efficient inference with transformer-based models remains a challenge, especially in vision tasks like object detection. We analyze the inherent sparsity in the MLP layers of DETR and introduce two methods to exploit it without retraining. First, we propose Static Indicator-Based Sparsification (SIBS), a heuristic method that predicts neuron inactivity based on fixed activation patterns. While simple, SIBS offers limited gains due to the input-dependent nature of sparsity. To address this, we introduce Micro-Gated Sparsification (MGS), a lightweight gating mechanism trained on top of a pretrained DETR. MGS predicts dynamic sparsity using a small linear layer and achieves up to 85 to 95% activation sparsity. Experiments on the COCO dataset show that MGS maintains or even improves performance while significantly reducing computation. Our method offers a practical, input-adaptive approach to sparsification, enabling efficient deployment of pretrained vision transformers without full model retraining.
[59] Mono4DEditor: Text-Driven 4D Scene Editing from Monocular Video via Point-Level Localization of Language-Embedded Gaussians cs.CVPDF
Jin-Chuan Shi, Chengye Su, Jiajun Wang, Ariel Shamir, Miao Wang
TL;DR: Mono4DEditor是一种基于文本驱动的4D场景编辑框架,通过点级定位语言嵌入的高斯表示,实现从单目视频中对动态场景的局部精确编辑。
Details
Motivation: 从单目视频重建的4D场景编辑是一个有价值但具有挑战性的任务,涉及内容创作和虚拟环境的广泛应用。主要难点在于如何在复杂动态场景中实现语义精确的局部编辑,同时保留未编辑内容的完整性。
Result: 实验表明,Mono4DEditor能够在多样化的场景和对象类型中实现高质量的文本驱动编辑,同时保留未编辑区域的外观和几何特征,在灵活性和视觉保真度上均优于现有方法。
Insight: 结合语言嵌入的动态表示和点级定位策略,可以显著提升4D场景编辑的语义精确性和局部化能力。
Abstract: Editing 4D scenes reconstructed from monocular videos based on text prompts is a valuable yet challenging task with broad applications in content creation and virtual environments. The key difficulty lies in achieving semantically precise edits in localized regions of complex, dynamic scenes, while preserving the integrity of unedited content. To address this, we introduce Mono4DEditor, a novel framework for flexible and accurate text-driven 4D scene editing. Our method augments 3D Gaussians with quantized CLIP features to form a language-embedded dynamic representation, enabling efficient semantic querying of arbitrary spatial regions. We further propose a two-stage point-level localization strategy that first selects candidate Gaussians via CLIP similarity and then refines their spatial extent to improve accuracy. Finally, targeted edits are performed on localized regions using a diffusion-based video editing model, with flow and scribble guidance ensuring spatial fidelity and temporal coherence. Extensive experiments demonstrate that Mono4DEditor enables high-quality, text-driven edits across diverse scenes and object types, while preserving the appearance and geometry of unedited areas and surpassing prior approaches in both flexibility and visual fidelity.
[60] Dynamic Weight-based Temporal Aggregation for Low-light Video Enhancement cs.CVPDF
Ruirui Lin, Guoxi Huang, Nantheera Anantrasirichai
TL;DR: DWTA-Net是一种新颖的两阶段框架,专注于低光视频增强,通过动态权重时序聚合有效平衡静态与动态区域,提升视觉效果。
Details
Motivation: 低光视频增强面临噪声、低对比度和颜色退化等挑战,现有学习型方法在利用时序信息时效果不佳。DWTA-Net旨在解决这些问题。
Result: DWTA-Net在真实低光视频中显著抑制噪声和伪影,视觉效果优于现有方法。
Insight: 动态权重机制和纹理自适应损失是视频增强中平衡细节与平滑度的关键。
Abstract: Low-light video enhancement (LLVE) is challenging due to noise, low contrast, and color degradations. Learning-based approaches offer fast inference but still struggle with heavy noise in real low-light scenes, primarily due to limitations in effectively leveraging temporal information. In this paper, we address this issue with DWTA-Net, a novel two-stage framework that jointly exploits short- and long-term temporal cues. Stage I employs Visual State-Space blocks for multi-frame alignment, recovering brightness, color, and structure with local consistency. Stage II introduces a recurrent refinement module with dynamic weight-based temporal aggregation guided by optical flow, adaptively balancing static and dynamic regions. A texture-adaptive loss further preserves fine details while promoting smoothness in flat areas. Experiments on real-world low-light videos show that DWTA-Net effectively suppresses noise and artifacts, delivering superior visual quality compared with state-of-the-art methods.
[61] D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language Models cs.CV | cs.LGPDF
Jisu Han, Wonjun Hwang
TL;DR: 本文提出了一种名为D-TPT的方法,通过最大化维度熵来校准视觉语言模型(VLMs)在测试时提示调优中的性能,解决了模态间主导维度导致的校准误差问题。
Details
Motivation: 视觉语言模型(VLMs)在测试时提示调优中表现出域适应的灵活性,但由于模态间单一主导维度的存在,校准性能会下降。本文旨在通过正则化文本特征的分布来解决这一问题。
Result: 实验结果表明,D-TPT能够有效减轻测试时提示调优中的校准误差,提升了VLMs在真实场景中的可靠性。
Insight: 关键洞察是模态间主导维度的高预测敏感性是校准误差的主要原因,通过熵最大化可以平衡其特征分布。
Abstract: Test-time adaptation paradigm provides flexibility towards domain shifts by performing immediate adaptation on unlabeled target data from the source model. Vision-Language Models (VLMs) leverage their generalization capabilities for diverse downstream tasks, and test-time prompt tuning has emerged as a prominent solution for adapting VLMs. In this work, we explore contrastive VLMs and identify the modality gap caused by a single dominant feature dimension across modalities. We observe that the dominant dimensions in both text and image modalities exhibit high predictive sensitivity, and that constraining their influence can improve calibration error. Building on this insight, we propose dimensional entropy maximization that regularizes the distribution of textual features toward uniformity to mitigate the dependency of dominant dimensions. Our method alleviates the degradation of calibration performance in test-time prompt tuning, offering a simple yet effective solution to enhance the reliability of VLMs in real-world deployment scenarios.
[62] Few-shot multi-token DreamBooth with LoRa for style-consistent character generation cs.CV | cs.LGPDF
Ruben Pascual, Mikel Sesma-Sara, Aranzazu Jurio, Daniel Paternain, Mikel Galar
TL;DR: 论文提出了一种结合多标记策略和LoRA高效微调的Few-shot方法,用于生成风格一致的多样化角色,扩展了DreamBooth在文本到图像生成中的应用。
Details
Motivation: 解决少样本条件下,生成保留参考角色艺术风格和视觉特征的多样化新角色的挑战,推动动画和游戏等领域的创意可能性。
Result: 在5个小规模数据集上验证了方法的有效性,定量指标和人工评估均显示生成角色质量高且风格一致。
Insight: 多标记与LoRA的结合为少样本风格一致生成提供了新思路,扩展了扩散模型在艺术创作中的应用潜力。
Abstract: The audiovisual industry is undergoing a profound transformation as it is integrating AI developments not only to automate routine tasks but also to inspire new forms of art. This paper addresses the problem of producing a virtually unlimited number of novel characters that preserve the artistic style and shared visual traits of a small set of human-designed reference characters, thus broadening creative possibilities in animation, gaming, and related domains. Our solution builds upon DreamBooth, a well-established fine-tuning technique for text-to-image diffusion models, and adapts it to tackle two core challenges: capturing intricate character details beyond textual prompts and the few-shot nature of the training data. To achieve this, we propose a multi-token strategy, using clustering to assign separate tokens to individual characters and their collective style, combined with LoRA-based parameter-efficient fine-tuning. By removing the class-specific regularization set and introducing random tokens and embeddings during generation, our approach allows for unlimited character creation while preserving the learned style. We evaluate our method on five small specialized datasets, comparing it to relevant baselines using both quantitative metrics and a human evaluation study. Our results demonstrate that our approach produces high-quality, diverse characters while preserving the distinctive aesthetic features of the reference characters, with human evaluation further reinforcing its effectiveness and highlighting the potential of our method.
[63] A methodology for clinically driven interactive segmentation evaluation cs.CV | cs.AI | cs.LGPDF
Parhom Esmaeili, Virginia Fernandez, Pedro Borges, Eli Gibson, Sebastien Ourselin
TL;DR: 该论文提出了一种基于临床需求的交互式分割评估方法,通过标准化的评估流程和任务定义,揭示了现有算法的性能瓶颈和改进方向。
Details
Motivation: 当前交互式分割评估方法存在不一致性和脱离临床实际的问题,阻碍了算法的公平比较和实际应用。
Result: 发现信息损失、自适应缩放机制、验证与训练一致性等因素显著影响性能,2D/3D方法的适用性取决于目标特性。
Insight: 交互式分割算法的设计需考虑临床实际数据特性,2D/3D方法各有优劣,非医疗领域模型在复杂场景下表现较差。
Abstract: Interactive segmentation is a promising strategy for building robust, generalisable algorithms for volumetric medical image segmentation. However, inconsistent and clinically unrealistic evaluation hinders fair comparison and misrepresents real-world performance. We propose a clinically grounded methodology for defining evaluation tasks and metrics, and built a software framework for constructing standardised evaluation pipelines. We evaluate state-of-the-art algorithms across heterogeneous and complex tasks and observe that (i) minimising information loss when processing user interactions is critical for model robustness, (ii) adaptive-zooming mechanisms boost robustness and speed convergence, (iii) performance drops if validation prompting behaviour/budgets differ from training, (iv) 2D methods perform well with slab-like images and coarse targets, but 3D context helps with large or irregularly shaped targets, (v) performance of non-medical-domain models (e.g. SAM2) degrades with poor contrast and complex shapes.
[64] PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs cs.CV | cs.ROPDF
Zixin Zhang, Kanghao Chen, Xingwang Lin, Lutao Jiang, Xu Zheng
TL;DR: PhysToolBench是首个专门评估多模态大语言模型(MLLMs)对物理工具理解的基准测试,包含三个难度级别:工具识别、工具理解和工具创造,评估发现现有模型在工具理解上表现不足。
Details
Motivation: 虽然MLLMs在具身AI和视觉-语言-动作模型中表现出色,但其对物理工具的深层理解能力尚未量化,需要专门的基准测试填补这一空白。
Result: 现有MLLMs在工具理解任务中表现显著不足,表明其在物理工具领域的局限性。
Insight: 该研究揭示了MLLMs在实际物理工具应用中的短板,为未来改进提供了方向和工具库。
Abstract: The ability to use, understand, and create tools is a hallmark of human intelligence, enabling sophisticated interaction with the physical world. For any general-purpose intelligent agent to achieve true versatility, it must also master these fundamental skills. While modern Multimodal Large Language Models (MLLMs) leverage their extensive common knowledge for high-level planning in embodied AI and in downstream Vision-Language-Action (VLA) models, the extent of their true understanding of physical tools remains unquantified. To bridge this gap, we present PhysToolBench, the first benchmark dedicated to evaluating the comprehension of physical tools by MLLMs. Our benchmark is structured as a Visual Question Answering (VQA) dataset comprising over 1,000 image-text pairs. It assesses capabilities across three distinct difficulty levels: (1) Tool Recognition: Requiring the recognition of a tool’s primary function. (2) Tool Understanding: Testing the ability to grasp the underlying principles of a tool’s operation. (3) Tool Creation: Challenging the model to fashion a new tool from surrounding objects when conventional options are unavailable. Our comprehensive evaluation of 32 MLLMs-spanning proprietary, open-source, specialized embodied, and backbones in VLAs-reveals a significant deficiency in tool understanding. Furthermore, we provide an in-depth analysis and propose preliminary solutions. Code and dataset are publicly available.
[65] Diagonal Artifacts in Samsung Images: PRNU Challenges and Solutions cs.CVPDF
David Vázquez-Padín, Fernando Pérez-González, Alejandro Martín-Del-Río
TL;DR: 该论文研究了三星智能手机图像中的对角线伪影及其对PRNU相机来源验证的影响,提出了解决方案并探讨了潜在的法医学应用。
Details
Motivation: 三星某些机型图像中的对角线伪影导致PRNU指纹验证失败,亟需解决方案以确保可靠性。
Result: 原始图像可避免伪影,提高PRNU验证的可靠性;伪影还可用于降低HDR图像的误检和定位人像模式中的合成虚化区域。
Insight: 图像处理流水线引入的伪影可能导致PRNU验证失败,但原始图像和伪影本身的特性可分别作为解决方案和法医学工具。
Abstract: We investigate diagonal artifacts present in images captured by several Samsung smartphones and their impact on PRNU-based camera source verification. We first show that certain Galaxy S series models share a common pattern causing fingerprint collisions, with a similar issue also found in some Galaxy A models. Next, we demonstrate that reliable PRNU verification remains feasible for devices supporting PRO mode with raw capture, since raw images bypass the processing pipeline that introduces artifacts. This option, however, is not available for the mid-range A series models or in forensic cases without access to raw images. Finally, we outline potential forensic applications of the diagonal artifacts, such as reducing misdetections in HDR images and localizing regions affected by synthetic bokeh in portrait-mode images.
[66] FLOWING: Implicit Neural Flows for Structure-Preserving Morphing cs.CV | I.4.0PDF
Arthur Bizzi, Matias Grynberg, Vitor Matias, Daniel Perazzo, João Paulo Lima
TL;DR: FLOWING提出了一种基于隐式神经流的方法,通过将变形建模为微分向量流,实现了结构保持的形变,适用于2D图像和3D形状的精确形变。
Details
Motivation: 传统多层感知器(MLP)在形变任务中存在训练不稳定、特征对齐困难等问题,需要昂贵的正则化。FLOWING旨在通过流中心的设计克服这些限制,提升形变质量。
Result: 在2D图像和3D形状的形变任务中,FLOWING实现了最先进的形变质量,并具备更快的收敛速度。
Insight: 通过流中心的设计,隐式神经流能够自然地解决形变任务中的结构对齐和稳定性问题,为未来形变方法提供了新思路。
Abstract: Morphing is a long-standing problem in vision and computer graphics, requiring a time-dependent warping for feature alignment and a blending for smooth interpolation. Recently, multilayer perceptrons (MLPs) have been explored as implicit neural representations (INRs) for modeling such deformations, due to their meshlessness and differentiability; however, extracting coherent and accurate morphings from standard MLPs typically relies on costly regularizations, which often lead to unstable training and prevent effective feature alignment. To overcome these limitations, we propose FLOWING (FLOW morphING), a framework that recasts warping as the construction of a differential vector flow, naturally ensuring continuity, invertibility, and temporal coherence by encoding structural flow properties directly into the network architectures. This flow-centric approach yields principled and stable transformations, enabling accurate and structure-preserving morphing of both 2D images and 3D shapes. Extensive experiments across a range of applications - including face and image morphing, as well as Gaussian Splatting morphing - show that FLOWING achieves state-of-the-art morphing quality with faster convergence. Code and pretrained models are available at http://schardong.github.io/flowing.
[67] TC-LoRA: Temporally Modulated Conditional LoRA for Adaptive Diffusion Control cs.CVPDF
Minkyoung Cho, Ruben Ohana, Christian Jacobsen, Adityan Jothi, Min-Hung Chen
TL;DR: TC-LoRA提出了一种动态权重调节方法,通过超网络实时生成LoRA适配器,解决了传统扩散模型在生成过程中静态条件调控的限制,显著提升了生成质量和条件遵从性。
Details
Motivation: 传统扩散模型在生成过程中采用固定的静态条件调控策略,无法适应从粗粒度到细粒度的动态生成需求。TC-LoRA旨在通过动态权重调节实现更灵活的生成控制。
Result: 实验表明,TC-LoRA在多种数据域中均优于静态激活调控方法,生成质量和空间条件遵从性显著提升。
Insight: 动态权重调节为扩散模型提供了一种新的条件控制范式,能够更好地适应生成过程的动态需求。
Abstract: Current controllable diffusion models typically rely on fixed architectures that modify intermediate activations to inject guidance conditioned on a new modality. This approach uses a static conditioning strategy for a dynamic, multi-stage denoising process, limiting the model’s ability to adapt its response as the generation evolves from coarse structure to fine detail. We introduce TC-LoRA (Temporally Modulated Conditional LoRA), a new paradigm that enables dynamic, context-aware control by conditioning the model’s weights directly. Our framework uses a hypernetwork to generate LoRA adapters on-the-fly, tailoring weight modifications for the frozen backbone at each diffusion step based on time and the user’s condition. This mechanism enables the model to learn and execute an explicit, adaptive strategy for applying conditional guidance throughout the entire generation process. Through experiments on various data domains, we demonstrate that this dynamic, parametric control significantly enhances generative fidelity and adherence to spatial conditions compared to static, activation-based methods. TC-LoRA establishes an alternative approach in which the model’s conditioning strategy is modified through a deeper functional adaptation of its weights, allowing control to align with the dynamic demands of the task and generative stage.
[68] FSP-DETR: Few-Shot Prototypical Parasitic Ova Detection cs.CVPDF
Shubham Trehan, Udhav Ramachandran, Akash Rao, Ruth Scimeca, Sathyanarayanan N. Aakur
TL;DR: FSP-DETR是一个统一的检测框架,能够在单模型中实现少样本检测、开放集识别和跨任务泛化,特别适用于生物医学场景中数据稀缺的挑战。
Details
Motivation: 生物医学领域的物体检测面临标记数据稀缺和新类别频繁出现的挑战,需要一个能够灵活适应少样本、开放集和新任务的检测方法。
Result: 实验表明,FSP-DETR在少样本和开放集场景中显著优于现有方法,尤其是在低样本和跨任务适应方面表现突出。
Insight: FSP-DETR展示了统一的检测框架在少样本和开放集任务中的潜力,为生物医学领域的实际应用提供了灵活且高效的解决方案。
Abstract: Object detection in biomedical settings is fundamentally constrained by the scarcity of labeled data and the frequent emergence of novel or rare categories. We present FSP-DETR, a unified detection framework that enables robust few-shot detection, open-set recognition, and generalization to unseen biomedical tasks within a single model. Built upon a class-agnostic DETR backbone, our approach constructs class prototypes from original support images and learns an embedding space using augmented views and a lightweight transformer decoder. Training jointly optimizes a prototype matching loss, an alignment-based separation loss, and a KL divergence regularization to improve discriminative feature learning and calibration under scarce supervision. Unlike prior work that tackles these tasks in isolation, FSP-DETR enables inference-time flexibility to support unseen class recognition, background rejection, and cross-task adaptation without retraining. We also introduce a new ova species detection benchmark with 20 parasite classes and establish standardized evaluation protocols. Extensive experiments across ova, blood cell, and malaria detection tasks demonstrate that FSP-DETR significantly outperforms prior few-shot and prototype-based detectors, especially in low-shot and open-set scenarios.
[69] Vision Language Models: A Survey of 26K Papers cs.CVPDF
Fengming Lin
TL;DR: 该论文对2023-2025年间CVPR、ICLR和NeurIPS的26,104篇论文进行了系统调查,量化了多模态视觉-语言模型(VLMs)、生成方法和3D/视频研究的宏观趋势。
Details
Motivation: 旨在量化计算机视觉和机器学习领域的研究趋势,揭示多模态视觉-语言模型、生成方法和3D/视频研究的动态变化。
Result: 发现三大趋势:VLMs主导指令跟随和多步推理;扩散生成方法聚焦可控性、蒸馏和速度;3D/视频研究转向高斯泼溅和以人为中心的场景理解。
Insight: VLMs中参数高效适应(如Prompting/LoRA)和轻量级视觉-语言桥梁成为主流;训练实践从零开始构建转向对强大骨干网络的指令微调。
Abstract: We present a transparent, reproducible measurement of research trends across 26,104 accepted papers from CVPR, ICLR, and NeurIPS spanning 2023-2025. Titles and abstracts are normalized, phrase-protected, and matched against a hand-crafted lexicon to assign up to 35 topical labels and mine fine-grained cues about tasks, architectures, training regimes, objectives, datasets, and co-mentioned modalities. The analysis quantifies three macro shifts: (1) a sharp rise of multimodal vision-language-LLM work, which increasingly reframes classic perception as instruction following and multi-step reasoning; (2) steady expansion of generative methods, with diffusion research consolidating around controllability, distillation, and speed; and (3) resilient 3D and video activity, with composition moving from NeRFs to Gaussian splatting and a growing emphasis on human- and agent-centric understanding. Within VLMs, parameter-efficient adaptation like prompting/adapters/LoRA and lightweight vision-language bridges dominate; training practice shifts from building encoders from scratch to instruction tuning and finetuning strong backbones; contrastive objectives recede relative to cross-entropy/ranking and distillation. Cross-venue comparisons show CVPR has a stronger 3D footprint and ICLR the highest VLM share, while reliability themes such as efficiency or robustness diffuse across areas. We release the lexicon and methodology to enable auditing and extension. Limitations include lexicon recall and abstract-only scope, but the longitudinal signals are consistent across venues and years.
[70] SpaceVista: All-Scale Visual Spatial Reasoning from mm to km cs.CVPDF
Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng
TL;DR: 论文提出了一种全尺度视觉空间推理方法SpaceVista,解决了数据集依赖性和建模局限性的问题,通过构建SpaceVista-1M数据集和SpaceVista-7B模型,实现了从毫米到公里的跨尺度推理。
Details
Motivation: 当前空间推理研究主要集中在室内场景,缺乏对多样化应用(如机器人和自动驾驶)的全尺度支持。论文旨在解决数据集依赖性强和建模局限性两大挑战。
Result: 在5个基准测试中表现优异,展示了全尺度和多场景下的强泛化能力。
Insight: 尺度作为锚点可有效缓解知识冲突问题,多专家协作和渐进训练是提升空间推理能力的关键。
Abstract: With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive manual annotations for dataset curation; 2) the absence of effective all-scale scene modeling, which often leads to overfitting to individual scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as the first attempt to broaden the all-scale spatial intelligence of MLLMs to the best of our knowledge. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across 5 spatial scales to create SpaceVista-1M, a dataset comprising approximately 1M spatial QA pairs spanning 19 diverse task types. While specialist models can inject useful domain knowledge, they are not reliable for evaluation. We then build an all-scale benchmark with precise annotations by manually recording, retrieving, and assembling video-based data. However, naive training with SpaceVista-1M often yields suboptimal results due to the potential knowledge conflict. Accordingly, we introduce SpaceVista-7B, a spatial reasoning model that accepts dense inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across 5 benchmarks, including our SpaceVista-Bench, demonstrate competitive performance, showcasing strong generalization across all scales and scenarios. Our dataset, model, and benchmark will be released on https://peiwensun2000.github.io/mm2km .
[71] VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation cs.CVPDF
Shaoqi Dong, Chaoyou Fu, Haihan Gao, Yi-Fan Zhang, Chi Yan
TL;DR: 论文提出了VITA-VLA框架,通过蒸馏预训练的小型动作模型知识,高效地为视觉-语言模型(VLM)赋予动作执行能力,显著提升了任务成功率并降低了训练成本。
Details
Motivation: 现有的视觉-语言动作(VLA)模型需要从头训练,成本高昂。作者希望利用预训练的VLM和小型动作模型,通过知识蒸馏实现高效的动作生成。
Result: 在LIBERO和LIBERO-LONG任务中分别提升11.8%和24.5%,真实世界中17%的提升,证明了框架的有效性和高效性。
Insight: 蒸馏小型动作模型的知识可以显著提升VLA模型的性能,同时大幅降低训练成本,为机器人的高效学习提供了新思路。
Abstract: Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs). By integrating action modules into these pretrained models, VLA methods exhibit improved generalization. However, training them from scratch is costly. In this work, we propose a simple yet effective distillation-based framework that equips VLMs with action-execution capability by transferring knowledge from pretrained small action models. Our architecture retains the original VLM structure, adding only an action token and a state encoder to incorporate physical inputs. To distill action knowledge, we adopt a two-stage training strategy. First, we perform lightweight alignment by mapping VLM hidden states into the action space of the small action model, enabling effective reuse of its pretrained action decoder and avoiding expensive pretraining. Second, we selectively fine-tune the language model, state encoder, and action modules, enabling the system to integrate multimodal inputs with precise action generation. Specifically, the action token provides the VLM with a direct handle for predicting future actions, while the state encoder allows the model to incorporate robot dynamics not captured by vision alone. This design yields substantial efficiency gains over training large VLA models from scratch. Compared with previous state-of-the-art methods, our method achieves 97.3% average success rate on LIBERO (11.8% improvement) and 93.5% on LIBERO-LONG (24.5% improvement). In real-world experiments across five manipulation tasks, our method consistently outperforms the teacher model, achieving 82.0% success rate (17% improvement), which demonstrate that action distillation effectively enables VLMs to generate precise actions while substantially reducing training costs.
[72] StreamingVLM: Real-Time Understanding for Infinite Video Streams cs.CV | cs.AI | cs.CLPDF
Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng
TL;DR: StreamingVLM提出了一种实时处理无限视频流的视觉语言模型,通过优化KV缓存和训练策略,解决了长视频处理中的高延迟和内存问题,并在多项基准测试中表现优异。
Details
Motivation: 现有视觉语言模型在处理无限视频流时,面临高计算成本和内存占用的问题,无法满足实时性和长视频理解的需求。
Result: 在Inf-Streams-Eval上达到66.18%胜率,最高支持8 FPS实时性能,并在其他基准测试中显著提升性能。
Insight: 优化注意力机制和训练策略可以有效提升视觉语言模型在长视频流中的性能,且无需针对特定任务进行额外训练。
Abstract: Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.
cs.CL [Back]
[73] Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models cs.CL | cs.AI | cs.LGPDF
Shahriar Kabir Nahin, Hadi Askari, Muhao Chen, Anshuman Chhabra
TL;DR: 该论文揭示了测试时缩放(TTS)中一个未被注意的失败模式:候选多样性的减少会显著增加不安全输出的概率,并提出了RefDiv攻击协议来验证这一现象。
Details
Motivation: 传统认为TTS通过多样化的候选响应提高可靠性,但研究发现多样性受限时会引入不安全输出的风险,需对这一假设进行验证和改进。
Result: 多样性受限显著增加不安全输出的概率,且现有安全分类器无法有效防御RefDiv生成的对抗性输入。
Insight: TTS的可靠性不仅依赖于多样性,还需考虑多样性减少时的安全性,未来需设计更鲁棒的TTS策略。
Abstract: Test-Time Scaling (TTS) improves LLM reasoning by exploring multiple candidate responses and then operating over this set to find the best output. A tacit premise behind TTS is that sufficiently diverse candidate pools enhance reliability. In this work, we show that this assumption in TTS introduces a previously unrecognized failure mode. When candidate diversity is curtailed, even by a modest amount, TTS becomes much more likely to produce unsafe outputs. We present a reference-guided diversity reduction protocol (RefDiv) that serves as a diagnostic attack to stress test TTS pipelines. Through extensive experiments across four open-source models (Qwen3, Mistral, Llama3.1, Gemma3) and two widely used TTS strategies (Monte Carlo Tree Search and Best-of-N), constraining diversity consistently signifies the rate at which TTS produces unsafe results. The effect is often stronger than that produced by prompts directly with high adversarial intent scores. This observed phenomenon also transfers across TTS strategies and to closed-source models (e.g. OpenAI o3 and Gemini-2.5-Pro), thus indicating that this is a general and extant property of TTS rather than a model-specific artifact. Additionally, we find that numerous widely used safety guardrail classifiers (e.g. Llama-Guard and OpenAI Moderation API), are unable to flag the adversarial input prompts generated by RefDiv, demonstrating that existing defenses offer limited protection against this diversity-driven failure mode. Through this work, we hope to motivate future research on designing robust TTS strategies that are both effective and secure against diversity-targeted stress tests as illustrated by RefDiv.
[74] Hierarchical Self-Supervised Representation Learning for Depression Detection from Speech cs.CL | cs.AI | cs.SD | eess.ASPDF
Yuxin Li, Eng Siong Chng, Cuntai Guan
TL;DR: 这篇论文提出了HAREN-CTC,一种基于分层自监督学习的语音抑郁症检测方法,通过多任务学习和跨模态融合提升了检测性能。
Details
Motivation: 传统语音抑郁症检测方法难以提取有效的特征并捕捉稀疏、异构的抑郁信号。现有方法通常仅使用自监督模型的最后一层特征,未能充分利用分层的语音表示。
Result: 在DAIC-WOZ和MODMA数据集上分别达到0.81和0.82的宏F1分数,优于现有方法。
Insight: 分层自监督学习和跨模态融合能更有效地捕捉抑郁信号,CTC损失有助于处理时间稀疏性。
Abstract: Speech-based depression detection (SDD) is a promising, non-invasive alternative to traditional clinical assessments. However, it remains limited by the difficulty of extracting meaningful features and capturing sparse, heterogeneous depressive cues over time. Pretrained self-supervised learning (SSL) models such as WavLM provide rich, multi-layer speech representations, yet most existing SDD methods rely only on the final layer or search for a single best-performing one. These approaches often overfit to specific datasets and fail to leverage the full hierarchical structure needed to detect subtle and persistent depression signals. To address this challenge, we propose HAREN-CTC, a novel architecture that integrates multi-layer SSL features using cross-attention within a multitask learning framework, combined with Connectionist Temporal Classification loss to handle sparse temporal supervision. HAREN-CTC comprises two key modules: a Hierarchical Adaptive Clustering module that reorganizes SSL features into complementary embeddings, and a Cross-Modal Fusion module that models inter-layer dependencies through cross-attention. The CTC objective enables alignment-aware training, allowing the model to track irregular temporal patterns of depressive speech cues. We evaluate HAREN-CTC under both an upper-bound setting with standard data splits and a generalization setting using five-fold cross-validation. The model achieves state-of-the-art macro F1-scores of 0.81 on DAIC-WOZ and 0.82 on MODMA, outperforming prior methods across both evaluation scenarios.
[75] Systematic Diagnosis of Brittle Reasoning in Large Language Models cs.CLPDF
V. S. Raghu Parupudi
TL;DR: 论文提出了一个系统性诊断框架,用于评估大语言模型在数学推理中的脆弱性,揭示了模型在组合推理等高难度任务中的表现显著下降。
Details
Motivation: 现有的大语言模型在数学推理任务中的表现往往被基准测试的整体准确率掩盖了具体缺陷,因此需要一个更细粒度的诊断方法来识别其弱点。
Result: 模型在顺序计算等任务中表现优异,但在需要组合推理的任务中准确率显著下降。
Insight: 模型推理能力存在非人类化的脆弱性,未来研究需针对特定推理模式进行优化。
Abstract: A central question in artificial intelligence is the extent to which machine learning models comprehend mathematics. To address this, we propose a novel framework for measuring mathematical reasoning that moves beyond standard benchmarks to diagnose specific failure points. Our method first generates structured, step-by-step reasoning from gpt-3.5-turbo on the GSM8K dataset. We then use a more capable analyst model, gpt-4o-mini, to categorize errors and, crucially, perform an unsupervised clustering of every reasoning sentence to identify emergent “reasoning modes.” This analysis reveals a cognitive profile with a stark, nonhuman-like brittleness: while the model achieves near-perfect accuracy on procedural modes like sequential calculation, its performance on modes requiring combinatorial reasoning with restrictions plummets. By identifying and quantifying the reliability of these distinct reasoning skills, our work provides a more granular method to evaluate mathematical comprehension and offers a precise roadmap for developing new capabilities and more reliable future applications.
[76] Mnemosyne: An Unsupervised, Human-Inspired Long-Term Memory Architecture for Edge-Based LLMs cs.CL | cs.AI | cs.LG | cs.MAPDF
Aneesh Jonelagadda, Christina Hahn, Haoze Zheng, Salvatore Penachio
TL;DR: Mnemosyne是一种受人类启发的无监督长期记忆架构,专为边缘设备上的LLM设计,通过图结构存储和动态记忆管理显著提升了对话的自然性和长期记忆能力。
Details
Motivation: 当前LLM的长期记忆系统依赖暴力上下文扩展或静态检索管道,无法在资源受限的边缘设备上高效运行,且难以处理类似医疗场景中的重复性对话。
Result: 在医疗对话实验中,Mnemosyne以65.8%的胜率超越基线(RAG为31.1%),并在LoCoMo基准测试中取得了领先的时间推理和单跳检索成绩。
Insight: 通过模仿人类记忆的动态管理机制,可以在边缘设备上实现高效且自然的长期记忆功能,特别适用于纵向应用场景。
Abstract: Long-term memory is essential for natural, realistic dialogue. However, current large language model (LLM) memory systems rely on either brute-force context expansion or static retrieval pipelines that fail on edge-constrained devices. We introduce Mnemosyne, an unsupervised, human-inspired long-term memory architecture designed for edge-based LLMs. Our approach uses graph-structured storage, modular substance and redundancy filters, memory committing and pruning mechanisms, and probabilistic recall with temporal decay and refresh processes modeled after human memory. Mnemosyne also introduces a concentrated “core summary” efficiently derived from a fixed-length subset of the memory graph to capture the user’s personality and other domain-specific long-term details such as, using healthcare application as an example, post-recovery ambitions and attitude towards care. Unlike existing retrieval-augmented methods, Mnemosyne is designed for use in longitudinal healthcare assistants, where repetitive and semantically similar but temporally distinct conversations are limited by naive retrieval. In experiments with longitudinal healthcare dialogues, Mnemosyne demonstrates the highest win rate of 65.8% in blind human evaluations of realism and long-term memory capability compared to a baseline RAG win rate of 31.1%. Mnemosyne also achieves current highest LoCoMo benchmark scores in temporal reasoning and single-hop retrieval compared to other same-backboned techniques. Further, the average overall score of 54.6% was second highest across all methods, beating commonly used Mem0 and OpenAI baselines among others. This demonstrates that improved factual recall, enhanced temporal reasoning, and much more natural user-facing responses can be feasible with an edge-compatible and easily transferable unsupervised memory architecture.
[77] Centering Emotion Hotspots: Multimodal Local-Global Fusion and Cross-Modal Alignment for Emotion Recognition in Conversations cs.CL | cs.AIPDF
Yu Liu, Hanlei Shi, Haoxun Li, Yuqing Sun, Yuxuan Ding
TL;DR: 该论文提出了一种基于情感热点的多模态情绪识别方法,通过局部-全局融合和跨模态对齐提升对话中情绪识别的准确性。
Details
Motivation: 对话中的情绪识别(ERC)面临证据稀疏、局部化和多模态异步的挑战,需要聚焦关键情感区域并解决模态对齐问题。
Result: 在标准ERC基准测试中,模型表现优于基线方法,HGF和MoA的消融实验验证了其有效性。
Insight: 情感热点的聚焦和对齐方法为多模态学习提供了新视角,可能启发未来ERC和其他多模态任务的设计。
Abstract: Emotion Recognition in Conversations (ERC) is hard because discriminative evidence is sparse, localized, and often asynchronous across modalities. We center ERC on emotion hotspots and present a unified model that detects per-utterance hotspots in text, audio, and video, fuses them with global features via Hotspot-Gated Fusion, and aligns modalities using a routed Mixture-of-Aligners; a cross-modal graph encodes conversational structure. This design focuses modeling on salient spans, mitigates misalignment, and preserves context. Experiments on standard ERC benchmarks show consistent gains over strong baselines, with ablations confirming the contributions of HGF and MoA. Our results point to a hotspot-centric view that can inform future multimodal learning, offering a new perspective on modality fusion in ERC.
[78] MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation cs.CL | cs.AIPDF
Weihua Zheng, Zhengyuan Liu, Tanmoy Chakraborty, Weiwen Xu, Xiaoxue Gao
TL;DR: MMA-ASIA提出了一個多語言和多模態對齊的框架,用於評估大型語言模型在亞洲文化背景下的意識和理解能力。
Details
Motivation: 當前的大型語言模型在多模態理解和推理方面往往在非西方、高資源環境下表現不佳,尤其是在亞洲文化背景中。本文旨在填補這一評估空白。
Result: 基準測試顯示79%的問題需要基於文化的多步推理;VPR等方法揭示了模型在語言和模態間的差異原因。
Insight: 多模態對齊和文化背景的重要性凸顯,為構建具文化可靠性的多模態語言模型提供了方向。
Abstract: Large language models (LLMs) are now used worldwide, yet their multimodal understanding and reasoning often degrade outside Western, high-resource settings. We propose MMA-ASIA, a comprehensive framework to evaluate LLMs’ cultural awareness with a focus on Asian contexts. MMA-ASIA centers on a human-curated, multilingual, and multimodally aligned multiple-choice benchmark covering 8 Asian countries and 10 languages, comprising 27,000 questions; over 79 percent require multi-step reasoning grounded in cultural context, moving beyond simple memorization. To our knowledge, this is the first dataset aligned at the input level across three modalities: text, image (visual question answering), and speech. This enables direct tests of cross-modal transfer. Building on this benchmark, we propose a five-dimensional evaluation protocol that measures: (i) cultural-awareness disparities across countries, (ii) cross-lingual consistency, (iii) cross-modal consistency, (iv) cultural knowledge generalization, and (v) grounding validity. To ensure rigorous assessment, a Cultural Awareness Grounding Validation Module detects “shortcut learning” by checking whether the requisite cultural knowledge supports correct answers. Finally, through comparative model analysis, attention tracing, and an innovative Vision-ablated Prefix Replay (VPR) method, we probe why models diverge across languages and modalities, offering actionable insights for building culturally reliable multimodal LLMs.
[79] GraphGhost: Tracing Structures Behind Large Language Models cs.CLPDF
Xinnan Dai, Kai Guo, Chung-Hsiang Lo, Shenglai Zeng, Jiayuan Ding
TL;DR: GraphGhost是一个统一的框架,通过将LLM中的神经元激活及其信号传播表示为图,揭示了LLM如何从序列输入中捕获结构语义并通过结构一致的机制生成输出。
Details
Motivation: 尽管大型语言模型展现出强大的推理能力,但其背后的结构机制仍未被充分探索。
Result: 发现LLM中共享和模型特定的推理行为,且干预关键节点可能导致推理崩溃。
Insight: 图表示提供了一种新视角,揭示了LLM的结构化推理机制,为分析和干预提供了工具。
Abstract: Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, yet the structural mechanisms underlying these abilities remain under explored. In this work, we introduce GraphGhost, a unified framework that represents neuron activations and their signal propagation as graphs, explaining how LLMs capture structural semantics from sequential inputs and generate outputs through structurally consistent mechanisms. This graph-based perspective enables us to employ graph algorithms such as PageRank to characterize the properties of LLMs, revealing both shared and model-specific reasoning behaviors across diverse datasets. We further identify the activated neurons within GraphGhost and evaluate them through structural interventions, showing that edits to key neuron nodes can trigger reasoning collapse, altering both logical flow and semantic understanding. Together, these contributions position GraphGhost as a powerful tool for analyzing, intervening in, and ultimately understanding the structural foundations of reasoning in LLMs.
[80] Iterative LLM-Based Generation and Refinement of Distracting Conditions in Math Word Problems cs.CLPDF
Kaiqi Yang, Hang Li, Yucheng Chu, Zitao Liu, Mi Tian
TL;DR: 本文提出了一种基于大语言模型的迭代框架,用于自动生成数学应用题中的干扰条件,避免了手动修改的高成本,并保证了问题的解不变。
Details
Motivation: 现有数学应用题数据集缺乏高质量的干扰条件,导致大语言模型在评测时可信度不足,且手动添加干扰条件耗时费力。
Result: 该方法高效且易于部署,能快速生成高质量含干扰条件的数学应用题,解决了现有数据集的局限性。
Insight: 利用LLM的生成能力可以自动化数据增强任务,同时通过提示工程确保数据的逻辑一致性。
Abstract: Mathematical reasoning serves as a crucial testbed for evaluating the intelligence of large language models (LLMs), and math word problems (MWPs) represent one of the most widely used formats. Most existing MWP datasets contain only the necessary information, while problems with distracting or excessive conditions are often overlooked. Prior studies have shown that popular LLMs experience a dramatic performance drop when such distracting conditions are introduced. However, available datasets of MWPs with distracting conditions remain limited, and most exhibit low difficulty and out-of-context expressions. These shortcomings make the distracting conditions easy to detect and disregard, thereby reducing the credibility of benchmarking on these datasets. Moreover, when distracting conditions are added, the reasoning process and answers may change, requiring intensive manual effort to check and rewrite solutions. To address these issues, we design an iterative framework that leverages LLMs to generate distracting conditions automatically. We develop a set of prompts to revise MWPs from multiple perspectives and cognitive levels, encouraging the creation of meaningful distracting conditions as well as suggestions for further refinement. A key advantage of our framework is the preservation of shared solutions between the original and revised problems: the LLMs are explicitly guided to generate distractions that do not alter the original solution, thus eliminating the need to produce new answers. This framework is efficient and easy to deploy, substantially reducing the effort required to generate MWPs with distracting conditions while maintaining high data quality.
[81] LLMs Show Surface-Form Brittleness Under Paraphrase Stress Tests cs.CLPDF
Juan Miguel Navarro Carranza
TL;DR: 本文通过对比大型语言模型在原始问题和其改写版本上的表现,揭示了模型在表面形式变化下的脆弱性,表明基准分数可能因记忆或近义问题而虚高。
Details
Motivation: 大型语言模型的基准测试分数可能因记忆测试项或其近义问题而被高估,作者希望通过改写问题的测试方法探究模型的真实泛化能力。
Result: 改写问题导致模型的准确性显著下降,验证了先前关于模型易受表面形式变化影响的担忧。
Insight: 模型的表现可能依赖问题的表面形式而非深层语义,说明当前基准测试存在局限性,需要更严谨的测试方法来评估真实泛化能力。
Abstract: Benchmark scores for Large Language Models (LLMs) can be inflated by memorization of test items or near duplicates. We present a simple, protocol that probes generalization by re-evaluating models on paraphrased versions of benchmark questions. Using Mistral-7B-Instruct and Qwen2.5-7B-Instruct, we measure the accuracy gap between original and paraphrased items on ARC-Easy and ARC-Challenge. Our pipeline controls decoding, enforces multiple-choice output format, and includes a robust paraphrase-cleaning step to preserve semantics. We find that paraphrasing induces a non-trivial accuracy drop (original vs. paraphrased), consistent with prior concerns about contamination and brittle surface-form shortcuts.
[82] PARSE: LLM Driven Schema Optimization for Reliable Entity Extraction cs.CL | cs.LGPDF
Anubhav Shrimal, Aryan Jain, Soumyajit Chowdhury, Promod Yenigalla
TL;DR: PARSE是一种利用大语言模型(LLM)优化JSON模式以提升实体抽取性能的新系统,通过ARCHITECT和SCOPE两个组件实现模式优化和可靠性抽取,显著提升了准确率并减少了错误。
Details
Motivation: 现有方法直接将LLM应用于实体抽取任务时,将JSON模式视为静态合约,导致性能不佳、幻觉问题频发以及不可靠的代理行为。PARSE旨在通过优化JSON模式和引入可靠性机制来解决这些问题。
Result: 在SWDE数据集上,PARSE提升了64.7%的抽取准确率,错误率减少了92%,且延迟可控。
Insight: JSON模式本身可作为LLM优化的一部分,动态调整模式能显著提升可靠性和性能。
Abstract: Structured information extraction from unstructured text is critical for emerging Software 3.0 systems where LLM agents autonomously interact with APIs and tools. Recent approaches apply large language models directly to extraction tasks using existing JSON schemas, often with constraint decoding or reinforcement learning approaches to ensure syntactic validity, but treat JSON schemas as static contracts designed for human developers, leading to suboptimal extraction performance, frequent hallucinations, and unreliable agent behavior when schemas contain ambiguous or incomplete specifications. We recognize that JSON schemas themselves are a form of natural language understanding contract that encodes rules, relationships, and expectations about data structure contracts that LLMs should be able to both interpret and systematically improve. Consequently, we develop PARSE (Parameter Automated Refinement and Schema Extraction), a novel system with two synergistic components: ARCHITECT, which autonomously optimizes JSON schemas for LLM consumption while maintaining backward compatibility through RELAY (an integrated code generation system), and SCOPE, which implements reflection-based extraction with combined static and LLM-based guardrails. We evaluate PARSE qualitatively and quantitatively on three datasets including Schema-Guided Dialogue (SGD), Structured Web Data Extraction (SWDE), and internal retail conversation data, and find that it achieves up to 64.7% improvement in extraction accuracy on SWDE with combined framework improvements reaching 10% across models, while reducing extraction errors by 92% within the first retry and and maintaining practical latency.
[83] Do LLMs Know They Are Being Tested? Evaluation Awareness and Incentive-Sensitive Failures in GPT-OSS-20B cs.CLPDF
Nisar Ahmed, Muhammad Imran Zaman, Gulshan Saleem, Ali Hassan
TL;DR: 该论文研究了大型语言模型(LLMs)在评测环境和实际部署环境下的表现差异,发现评测导向的提示会显著增加模型的推理长度和格式合规性,但对准确性提升有限甚至不一致。
Details
Motivation: 当前LLM的评测基准往往依赖于显式推理和严格格式化的提示,而实际应用中需要简洁、符合契约的回答。论文旨在探究这种评测导向的提示是否会夸大模型的表现,而非真实能力的提升。
Result: 评测导向提示显著增加了推理链长度(数百到1000多字符),降低了简洁回答的合规性,但对准确性的提升有限。激励性提示会影响错误类型,表扬谨慎会小幅提升准确性,而表扬能力则会缩短回答但增加风险。
Insight: 评测导向提示可能会夸大模型的表面表现,而非实际能力。在实际应用中,中性提示或双框架检查、契约感知的评分和风格差异报告等方法能更真实反映模型的部署能力。
Abstract: Benchmarks for large language models (LLMs) often rely on rubric-scented prompts that request visible reasoning and strict formatting, whereas real deployments demand terse, contract-bound answers. We investigate whether such “evaluation scent” inflates measured performance without commensurate capability gains. Using a single open-weights model (GPT-OSS-20B), we run six paired A/B scenarios that hold task content and decoding fixed while varying framing (evaluation-oriented vs. real-world) and reasoning depth (Medium/High): deterministic math, strict code-fix, citation generation, incentive flips (caution vs. competence), CoT visibility, and multilingual (Urdu) headers. Deterministic validators compute accuracy, answer-only compliance, hedging/refusals, chain-of-thought (CoT) length, and schema compliance, with pre-registered deltas and composite indices. Across scenarios, evaluation framing reliably inflates CoT (hundreds to >1000 characters) and reduces answer-only compliance, with limited or inconsistent accuracy gains. In structured outputs, it improves wrappers (e.g., fenced blocks, enumerated lists) but not regex-validated substance. Incentive wording reweights error composition: praising caution modestly improves accuracy at high reasoning and reduces wrong-but-confident errors, whereas praising competence yields terser but riskier outputs. Urdu rubric headers reproduce these signatures and can decrease accuracy at higher reasoning depth, indicating multilingual parity risks. We provide a reproducible A/B framework (prompt banks, validators, per-run scores, scripts; versioned DOI) and practical guidance: neutral phrasing or dual-framing checks, contract-aware grading, style-delta reporting, confidence governance, and multilingual dashboards to ensure that benchmark gains reflect deployable capability.
[84] From What to Why: Thought-Space Recommendation with Small Language Models cs.CL | cs.AIPDF
Prosenjit Biswas, Pervez Shaik, Abhinav Thorat, Ravi Kolla, Niranjan Pedanekar
TL;DR: 该论文提出了一种使用小型语言模型(SLM)生成推荐系统嵌入的方法PULSE,通过将用户行为(what)和其背后的语义驱动(why)联合建模,提升了推荐的鲁棒性和泛化能力。
Details
Motivation: 大型语言模型(LLM)虽然推理能力强,但部署成本高;小型语言模型(SLM)效率高,但其推理能力在推荐系统中未充分利用。现有系统未能充分利用自然语言理性作为学习信号。
Result: PULSE在多个基准数据集上表现优于现有方法,展现跨域推荐优势,并在推理问答任务中表现优异。
Insight: 将理性作为一等信号建模,提升了嵌入的鲁棒性和泛化能力,为SLM在推荐系统中的潜力提供了新思路。
Abstract: Large Language Models (LLMs) have advanced recommendation capabilities through enhanced reasoning, but pose significant challenges for real-world deployment due to high inference costs. Conversely, while Small Language Models (SLMs) offer an efficient alternative, their reasoning capabilities for recommendation remain underexplored. Existing systems often use natural language rationales merely as unsupervised descriptive text, failing to harness their full potential as learning signals. In this work our main idea is to create a common understanding of user and items across multiple domains called Thought Space with SLMs instead of using LLMs’ distilled knowledge. To that end we propose PULSE (Preference Understanding by Latent Semantic Embeddings), a framework that treats SLM-generated rationales as director learning signals, supervising them with interaction histories to jointly model user actions (what) and their semantic drivers (why). Existing methods consider only interactions such as sequences and embeddings, whereas PULSE treats rationales as first-class signals, this novel design yields embeddings that are more robust and generalizable. Extensive experiments demonstrate that PULSE outperforms leading ID, Collaborative Filtering (CF), and LLM-based sequential recommendation models across multiple benchmark datasets. Furthermore, PULSE exhibits superior transferability in cross-domain recommendation and demonstrates strong performance on downstream tasks such as reasoning-oriented question answering. Our code is available \href{https://anonymous.4open.science/r/Thinking_PULSE-0FC5/README.md}{here}.
[85] ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection cs.CLPDF
Jingbiao Mei, Mingsheng Sun, Jinghong Chen, Pengda Qin, Yuhong Li
TL;DR: 论文提出了一种新的方法ExPO-HM,通过结合SFT预热、GRPO与课程学习,以及Conditional Decision Entropy(CDE)作为推理质量的度量与奖励,实现了在仇恨表情包检测中的卓越表现。
Details
Motivation: 仇恨表情包是一种难以检测的在线滥用形式,现有方法多为直接检测且仅提供二元预测,缺乏上下文与解释。近期Explain-then-Detect方法表现不佳,甚至不如简单基线。论文分析发现模型未能生成政策相关的线索(如攻击目标与类型),且二元奖励信号不足以指导推理。
Result: ExPO-HM在三个仇恨表情包基准上表现优异,相比GRPO和DPO基线,F1分数分别提高了15%和17%。
Insight: 提供解释驱动的检测(而非简单二元报警)是实现准确、可解释且可操作的仇恨表情包检测的关键。
Abstract: Hateful memes have emerged as a particularly challenging form of online abuse, motivating the development of automated detection systems. Most prior approaches rely on direct detection, producing only binary predictions. Such models fail to provide the context and explanations that real-world moderation requires. Recent Explain-then-Detect approaches, using Chain-of-Thought prompting or LMM agents, perform worse than simple SFT baselines, and even advanced post-training methods such as GRPO fail to close the gap. Our analysis identifies two key issues of such systems: important policy-relevant cues such as targets and attack types are not hypothesized by the model as a likely explanation; and the binary reward signal is insufficient to guide reasoning. To address these challenges, we propose ExPO-HM (Explain-then-Detect Policy Optimization for Hateful Memes), inspired by the training and evaluation process of human annotators. ExPO-HM combines SFT warmup, GRPO with curriculum learning, and Conditional Decision Entropy (CDE) as both metric and reward for reasoning quality. Across three hateful meme benchmarks, ExPO-HM achieves state-of-the-art performance on binary detection, fine-grained classification, and reasoning quality, with up to 15% and 17% F1 improvement over the GRPO and DPO baselines, respectively. By moving hateful meme detection from simple binary alarms to explanation-driven detection, ExPO-HM provides accurate, interpretable, and actionable moderation support.
[86] Upfront Chain-of-Thought: A Cooperative Framework for Chain-of-Thought Compression cs.CL | cs.AIPDF
Chengzhengxu Li, Xiaoming Liu, Zhaohan Zhang, Shaochu Zhang, Shengchao Liu
TL;DR: 论文提出了一种名为Upfront CoT(UCoT)的高效推理框架,通过预先生成的思想嵌入(upfront thought embeddings),压缩了Chain-of-Thought(CoT)长度,从而降低计算成本和延迟,同时保持了强大的推理能力。
Details
Motivation: 现有的长CoT方法因自回归生成特性导致高计算成本和延迟,而现有压缩方法要么需要手动设计提示词,要么牺牲关键推理细节。UCoT旨在解决这些问题。
Result: 在GSM8K数据集上,UCoT将Qwen2.5-7B-Instruct模型的token使用减少了50%,性能比SOTA方法高出3.08%。
Insight: UCoT展示了通过小型与大型模型协作,可以高效压缩CoT长度而不牺牲推理能力,为LLM推理优化提供了新思路。
Abstract: Recent developments have enabled advanced reasoning in Large Language Models (LLMs) via long Chain-of-Thought (CoT), while long CoT suffers from high computational costs and significant latency losses owing to the autoregressive nature of generative LLMs. CoT compression aims to improve efficiency in the reasoning process by reducing output length. Previous works trade reasoning efficiency by either laborious discrete prompt designing or the construction of external compressed CoT datasets that sacrifice key reasoning details. In this work, we propose Upfront CoT (UCoT): an efficient reasoning framework with upfront thought embedding to automate CoT compression. UCoT is a cooperative workflow involving a small model (compressor) and a large model (executor). The first stage of UCoT trains compressor to generate upfront thought embeddings rich in reasoning information for the executor, avoiding the drawbacks of manually designed prompts. The second stage optimizes executor to utilize upfront thought embeddings to derive the correct answer with short reasoning, using a reward mechanism. Extensive experiments show that UCoT maintains the powerful reasoning ability of executor while significantly reducing the length of CoT. It is worth mentioning that when applying UCoT to the Qwen2.5-7B-Instruct model, the usage of tokens on GSM8K dataset is reduced by 50%, while the performance is 3.08% higher than that of the state-of-the-art (SOTA) method. The code and dataset are in supplementary material.
[87] Thinking Longer, Not Always Smarter: Evaluating LLM Capabilities in Hierarchical Legal Reasoning cs.CL | 68T50 | I.2.7; I.2.4PDF
Li Zhang, Matthias Grabmair, Morgan Gray, Kevin Ashley
TL;DR: 该论文研究了大型语言模型(LLM)在层次化法律推理中的能力,提出了一个三阶段推理任务的框架,发现模型在表面推理上表现良好,但在层次化和综合分析中表现较差。
Details
Motivation: 法律案例推理是法律实践的核心,而LLM在这种复杂推理中的能力尚未充分研究。
Result: LLM在表面推理(Task 1)中表现优秀,但在层次化推理(Task 2)和综合分析(Task 3)中表现显著下降,且错误回答消耗更多计算资源。
Insight: 研究表明,LLM在复杂任务中‘思考时间更长’并不等于‘更聪明’,揭示了其在法律推理中的根本局限性。
Abstract: Case-based reasoning is a cornerstone of U.S. legal practice, requiring professionals to argue about a current case by drawing analogies to and distinguishing from past precedents. While Large Language Models (LLMs) have shown remarkable capabilities, their proficiency in this complex, nuanced form of reasoning needs further investigation. We propose a formal framework that decomposes the process of identifying significant distinctions between cases into three-stage reasoning tasks. Our framework models cases using factual predicates called factors, organizes them into a legal knowledge hierarchy, and defines verifiable rules for identifying distinctions, analyzing their argumentative support, and evaluating their significance. Through comprehensive evaluation of modern reasoning LLMs, we reveal a paradox: while models achieve high accuracy on surface-level reasoning (Task 1), performance degrades on hierarchical reasoning (Task 2: 64.82%-92.09%) and collapses on integrated analysis (Task 3: 11.46%-33.99%). Most strikingly, we find that models consistently expend more computational resources on incorrect responses than correct ones, suggesting that “thinking longer” does not always mean “thinking smarter.” Our work provides a methodology for fine-grained analysis of LLM reasoning capabilities in complex domains and reveals fundamental limitations that must be addressed for robust and trustworthy legal AI.
[88] Coordinates from Context: Using LLMs to Ground Complex Location References cs.CL | cs.AIPDF
Tessa Masis, Brendan O’Connor
TL;DR: 论文提出了一种基于LLM的策略,用于处理复杂的地理编码任务,特别是组合式位置引用,展示了小规模微调LLM可以达到与大规模现成模型相当的性能。
Details
Motivation: 地理编码是将位置引用映射到实际地理位置的关键任务,但对复杂组合式位置引用的处理具有挑战性。研究旨在评估LLM在地理空间知识和推理能力上的表现,并提出改进方法。
Result: 结果表明,提出的方法显著提高了地理编码任务的性能,且小规模微调LLM的表现与大规模现成模型相当。
Insight: 关键在于LLM的地理推理能力可以被有效利用,同时小规模模型通过微调也能在大任务中表现出色。
Abstract: Geocoding is the task of linking a location reference to an actual geographic location and is essential for many downstream analyses of unstructured text. In this paper, we explore the challenging setting of geocoding compositional location references. Building on recent work demonstrating LLMs’ abilities to reason over geospatial data, we evaluate LLMs’ geospatial knowledge versus reasoning skills relevant to our task. Based on these insights, we propose an LLM-based strategy for geocoding compositional location references. We show that our approach improves performance for the task and that a relatively small fine-tuned LLM can achieve comparable performance with much larger off-the-shelf models.
[89] Measuring Moral LLM Responses in Multilingual Capacities cs.CL | cs.AIPDF
Kimaya Basu, Savi Kolari, Allison Yu
TL;DR: 该论文通过多语言环境中对前沿和开源LLM模型的反应进行五个维度的评估,发现GPT-5表现最佳,而其他模型在语言和类别中表现不一致,尤其是在‘同意与自主权’及‘伤害预防与安全’类别中差异显著。
Details
Motivation: 随着LLM在多语言环境中的广泛应用,研究和规范其多语言反应的道德性和一致性变得至关重要。
Result: GPT-5在五个维度中表现最佳,而其他模型(如Gemini 2.5 Pro)在某些类别中得分显著较低。
Insight: 语言变化对不同LLM的反应有显著影响,尤其在道德相关类别中需进一步改进和测试。
Abstract: With LLM usage becoming widespread across countries, languages, and humanity more broadly, the need to understand and guardrail their multilingual responses increases. Large-scale datasets for testing and benchmarking have been created to evaluate and facilitate LLM responses across multiple dimensions. In this study, we evaluate the responses of frontier and leading open-source models in five dimensions across low and high-resource languages to measure LLM accuracy and consistency across multilingual contexts. We evaluate the responses using a five-point grading rubric and a judge LLM. Our study shows that GPT-5 performed the best on average in each category, while other models displayed more inconsistency across language and category. Most notably, in the Consent & Autonomy and Harm Prevention & Safety categories, GPT scored the highest with averages of 3.56 and 4.73, while Gemini 2.5 Pro scored the lowest with averages of 1.39 and 1.98, respectively. These findings emphasize the need for further testing on how linguistic shifts impact LLM responses across various categories and improvement in these areas.
[90] Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective cs.CL | cs.AIPDF
Wangjie You, Xusheng Wang, Xing Wang, Wenxiang Jiao, Chao Feng
TL;DR: 该论文提出了一个名为CCMOR的中文常识推理基准,用于评估大语言模型(LLMs)在多步推理中的表现,揭示了其在长尾知识和知识密集型推理中的局限性,并通过检索增强生成显著提升了性能。
Details
Motivation: 尽管LLMs在推理能力上表现优异,但其在中文语境下的全面评估仍然不足。为此,论文旨在填补这一空白,构建一个专为中文设计的多跳推理基准。
Result: 实验表明,现有LLMs在处理中文长尾知识和知识密集型推理时存在明显不足,但检索增强生成(RAG)显著提升了性能。
Insight: 论文揭示了LLMs在中文语境下的知识覆盖不足问题,同时验证了检索增强技术在多跳推理任务中的潜力。
Abstract: While Large Language Models (LLMs) have demonstrated advanced reasoning capabilities, their comprehensive evaluation in general Chinese-language contexts remains understudied. To bridge this gap, we propose Chinese Commonsense Multi-hop Reasoning (CCMOR), a novel benchmark designed to evaluate LLMs’ ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning. Specifically, we first construct a domain-balanced seed set from existing QA datasets, then develop an LLM-powered pipeline to generate multi-hop questions anchored on factual unit chains. To ensure the quality of resulting dataset, we implement a human-in-the-loop verification system, where domain experts systematically validate and refine the generated questions. Using CCMOR, we evaluate state-of-the-art LLMs, demonstrating persistent limitations in LLMs’ ability to process long-tail knowledge and execute knowledge-intensive reasoning. Notably, retrieval-augmented generation substantially mitigates these knowledge gaps, yielding significant performance gains.
[91] MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding cs.CLPDF
Siddeshwar Raghavan, Tanwi Mallick
TL;DR: MOSAIC是一个多智能体大语言模型框架,专为复杂科学编码任务设计,无需训练即可通过特定设计的智能体实现自反思、推理和调试,显著提升任务分解和错误修正能力。
Details
Motivation: 科学编码任务通常需要严格的算法、深厚的领域知识和特定的推理能力,且往往涉及多个子问题的链式解决,传统方法难以高效应对。
Result: 在科学编码基准测试中,MOSAIC在准确性、鲁棒性和可解释性上优于现有方法。
Insight: 多智能体协作结合上下文优化可有效提升复杂科学任务的解决能力,为科学编码自动化提供了新思路。
Abstract: We present MOSAIC, a multi-agent Large Language Model (LLM) framework for solving challenging scientific coding tasks. Unlike general-purpose coding, scientific workflows require algorithms that are rigorous, interconnected with deep domain knowledge, and incorporate domain-specific reasoning, as well as algorithm iteration without requiring I/O test cases. Many scientific problems also require a sequence of subproblems to be solved, leading to the final desired result. MOSAIC is designed as a training-free framework with specially designed agents to self-reflect, create the rationale, code, and debug within a student-teacher paradigm to address the challenges of scientific code generation. This design facilitates stepwise problem decomposition, targeted error correction, and, when combined with our Consolidated Context Window (CCW), mitigates LLM hallucinations when solving complex scientific tasks involving chained subproblems. We evaluate MOSAIC on scientific coding benchmarks and demonstrate that our specialized agentic framework outperforms existing approaches in terms of accuracy, robustness, and interpretability.
[92] The Model’s Language Matters: A Comparative Privacy Analysis of LLMs cs.CL | cs.CRPDF
Abhishek K. Mishra, Antoine Boutet, Lucas Magnana
TL;DR: 该论文探讨了大型语言模型(LLMs)在多语言应用中隐私泄漏的差异,发现语言结构对隐私风险的影响显著。通过分析英语、西班牙语、法语和意大利语的医学语料库,研究表明意大利语的隐私泄漏最严重,而法语和西班牙语由于形态复杂性更高表现更稳健。
Details
Motivation: 随着LLMs在多语言应用中处理敏感数据的增多,其规模和语言变异性带来了显著的隐私风险。然而,现有的隐私评估主要集中在英语上,缺乏对其他语言的分析。
Result: 研究表明,隐私泄漏与语言冗余和分词粒度相关:意大利语的泄漏最强,而英语的成员推理攻击表现较高;法语和西班牙语由于形态复杂性更高,表现更稳健。
Insight: 语言的形态复杂性可以增强隐私保护的稳健性,这为设计语言感知的隐私保护机制提供了重要参考。
Abstract: Large Language Models (LLMs) are increasingly deployed across multilingual applications that handle sensitive data, yet their scale and linguistic variability introduce major privacy risks. Mostly evaluated for English, this paper investigates how language structure affects privacy leakage in LLMs trained on English, Spanish, French, and Italian medical corpora. We quantify six linguistic indicators and evaluate three attack vectors: extraction, counterfactual memorization, and membership inference. Results show that privacy vulnerability scales with linguistic redundancy and tokenization granularity: Italian exhibits the strongest leakage, while English shows higher membership separability. In contrast, French and Spanish display greater resilience due to higher morphological complexity. Overall, our findings provide the first quantitative evidence that language matters in privacy leakage, underscoring the need for language-aware privacy-preserving mechanisms in LLM deployments.
[93] Search-on-Graph: Iterative Informed Navigation for Large Language Model Reasoning on Knowledge Graphs cs.CLPDF
Jia Ao Sun, Hao Yu, Fabrizio Gotti, Fengran Mo, Yihong Wu
TL;DR: 论文提出了Search-on-Graph (SoG)框架,通过简单的迭代导航方法,使大型语言模型(LLMs)能够在知识图谱(KGs)上进行推理,解决了现有方法在多跳知识问答中的局限性。
Details
Motivation: 大型语言模型在多跳知识密集型问题上表现不稳定,容易遗漏长尾事实或产生幻觉。知识图谱提供了结构化证据,但现有方法在查询编译、子图检索和搜索效率上存在根本性权衡。
Result: 在Freebase和Wikidata的六个KGQA基准测试中,SoG实现了无需调优的SOTA性能,尤其在Wikidata上提升16%。
Insight: 迭代导航和自适应过滤的结合是高效知识图谱推理的关键,简单直接的框架设计能够超越复杂方法。
Abstract: Large language models (LLMs) have demonstrated impressive reasoning abilities yet remain unreliable on knowledge-intensive, multi-hop questions – they miss long-tail facts, hallucinate when uncertain, and their internal knowledge lags behind real-world change. Knowledge graphs (KGs) offer a structured source of relational evidence, but existing KGQA methods face fundamental trade-offs: compiling complete SPARQL queries without knowing available relations proves brittle, retrieving large subgraphs introduces noise, and complex agent frameworks with parallel exploration exponentially expand search spaces. To address these limitations, we propose Search-on-Graph (SoG), a simple yet effective framework that enables LLMs to perform iterative informed graph navigation using a single, carefully designed \textsc{Search} function. Rather than pre-planning paths or retrieving large subgraphs, SoG follows an ``observe-then-navigate’’ principle: at each step, the LLM examines actual available relations from the current entity before deciding on the next hop. This approach further adapts seamlessly to different KG schemas and handles high-degree nodes through adaptive filtering. Across six KGQA benchmarks spanning Freebase and Wikidata, SoG achieves state-of-the-art performance without fine-tuning. We demonstrate particularly strong gains on Wikidata benchmarks (+16% improvement over previous best methods) alongside consistent improvements on Freebase benchmarks.
[94] Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models cs.CL | cs.AI | cs.CRPDF
Ragib Amin Nihal, Rui Wen, Kazuhiro Nakadai, Jun Sakuma
TL;DR: PE-CoA提出了一种通过五种对话模式构建高效的多轮越狱攻击的框架,揭示了不同对话模式对LLM漏洞的影响,并在12个LLM上展示了最优性能。
Details
Motivation: 现有方法依赖启发式或临时策略,对LLM的底层漏洞理解有限。本文旨在探索对话模式与模型漏洞之间的关系,提出系统性攻击框架。
Result: 在12个LLM和十种危害类别上达到了最优性能,发现模型对不同对话模式的鲁棒性不具普适性。
Insight: LLM的安全性训练存在局限性,需要针对特定对话模式开发防御措施。
Abstract: Large language models (LLMs) remain vulnerable to multi-turn jailbreaking attacks that exploit conversational context to bypass safety constraints gradually. These attacks target different harm categories (like malware generation, harassment, or fraud) through distinct conversational approaches (educational discussions, personal experiences, hypothetical scenarios). Existing multi-turn jailbreaking methods often rely on heuristic or ad hoc exploration strategies, providing limited insight into underlying model weaknesses. The relationship between conversation patterns and model vulnerabilities across harm categories remains poorly understood. We propose Pattern Enhanced Chain of Attack (PE-CoA), a framework of five conversation patterns to construct effective multi-turn jailbreaks through natural dialogue. Evaluating PE-CoA on twelve LLMs spanning ten harm categories, we achieve state-of-the-art performance, uncovering pattern-specific vulnerabilities and LLM behavioral characteristics: models exhibit distinct weakness profiles where robustness to one conversational pattern does not generalize to others, and model families share similar failure modes. These findings highlight limitations of safety training and indicate the need for pattern-aware defenses. Code available on: https://github.com/Ragib-Amin-Nihal/PE-CoA
[95] Quality Estimation Reranking for Document-Level Translation cs.CLPDF
Krzysztof Mrozinski, Minji Kang, Ahmed Khota, Vincent Michael Sutanto, Giovanni Gatti De Giacomo
TL;DR: 研究了基于质量估计(QE)的重排在文档级机器翻译中的应用,展示了其对提升翻译质量的有效性,尤其是在使用多种学习模型和大型语言模型(LLM)时。
Details
Motivation: 虽然质量估计重排在句子级别已被证明有效,但其在文档级翻译中的应用尚未充分探索。本文旨在填补这一空白,评估其在文档级的性能。
Result: 实验结果显示,SLIDE在32个候选翻译时能提升BLEURT-20评分+5.09,GEMBA-DA提升+4.30。在长文档(512-1024词)中,增益虽减小但仍显著。
Insight: 文档级QE重排具有实际价值,尤其在多候选情况下效果更佳,且运行时开销较低,适合实际应用。
Abstract: Quality estimation (QE) reranking is a form of quality-aware decoding which aims to improve machine translation (MT) by scoring and selecting the best candidate from a pool of generated translations. While known to be effective at the sentence level, its application to the increasingly prominent domain of document-level translation remains underexplored. In this work, we evaluate QE reranking performance on document-level (rather than the typical sentence-level) translation, using various learned and large language model (LLM)-based QE metrics. We find that with our best learned metric, SLIDE, BLEURT-20 scores improve by +2.00 with only two candidates, and by +5.09 with 32, across both decoder-only LLM models and encoder-decoder neural machine translation (NMT) models. Using the best LLM-based metric, GEMBA-DA, gains of +1.63 and +4.30 are achieved under the same conditions. Although gains shrink with longer inputs, reranking with 32 candidates yields improvements of +2.34 (SLIDE) and +1.40 (GEMBA-DA) on our longest documents (512-1024 source tokens). These findings demonstrate the practical value of document-level QE, with minimal runtime overhead given suitable translation models and hardware.
[96] FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs cs.CL | cs.CE | cs.IRPDF
Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao
TL;DR: FinAuditing是一个面向财务审计任务的基准测试,旨在评估大型语言模型(LLMs)在结构化、多文档财务数据上的推理能力。研究揭示了当前LLMs在语义、关系和数学一致性任务中的系统性局限性。
Details
Motivation: 当前LLMs在非结构化文本理解上表现出色,但在基于分类法和结构化财务文档的复杂推理任务中表现仍未探索,亟需一个针对性基准测试。
Result: 实验显示,当前LLMs在多文档结构化任务中表现不稳定,准确率下降60-90%,揭示了其在财务推理中的局限性。
Insight: FinAuditing为未来开发可信赖、结构感知的金融智能系统奠定了基础,同时也暴露了LLMs在结构化财务数据处理中的不足。
Abstract: The complexity of the Generally Accepted Accounting Principles (GAAP) and the hierarchical structure of eXtensible Business Reporting Language (XBRL) filings make financial auditing increasingly difficult to automate and verify. While large language models (LLMs) have demonstrated strong capabilities in unstructured text understanding, their ability to reason over structured, interdependent, and taxonomy-driven financial documents remains largely unexplored. To fill this gap, we introduce FinAuditing, the first taxonomy-aligned, structure-aware, multi-document benchmark for evaluating LLMs on financial auditing tasks. Built from real US-GAAP-compliant XBRL filings, FinAuditing defines three complementary subtasks, FinSM for semantic consistency, FinRE for relational consistency, and FinMR for numerical consistency, each targeting a distinct aspect of structured auditing reasoning. We further propose a unified evaluation framework integrating retrieval, classification, and reasoning metrics across these subtasks. Extensive zero-shot experiments on 13 state-of-the-art LLMs reveal that current models perform inconsistently across semantic, relational, and mathematical dimensions, with accuracy drops of up to 60-90% when reasoning over hierarchical multi-document structures. Our findings expose the systematic limitations of modern LLMs in taxonomy-grounded financial reasoning and establish FinAuditing as a foundation for developing trustworthy, structure-aware, and regulation-aligned financial intelligence systems. The benchmark dataset is available at Hugging Face.
[97] Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR cs.CL | cs.AIPDF
Haomin Zhuang, Yujun Zhou, Taicheng Guo, Yue Huang, Fangxu Liu
TL;DR: 这篇论文提出了一种在多温度策略下,针对推理和高熵知识标记的不同角色,显式促进探索的方法,显著提升了LLMs的推理性能。
Details
Motivation: 现有方法通常通过间接限制更新来鼓励探索,但未在标记生成阶段显式促进探索。作者希望通过多温度策略,针对不同标记类型(高熵推理标记和低熵知识标记)显式调整探索行为。
Result: 在多个推理基准测试中,该方法显著提升了LLMs的推理性能。
Insight: 通过显式区分标记类型并调整温度策略,可以有效平衡探索与事实正确性,从而提升LLMs的推理能力。
Abstract: Reinforcement Learning has demonstrated substantial improvements in the reasoning abilities of Large Language Models (LLMs), exhibiting significant applicability across various domains. Recent research has identified that tokens within LLMs play distinct roles during reasoning tasks, categorizing them into high-entropy reasoning tokens and low-entropy knowledge tokens. Prior approaches have typically focused on restricting updates to indirectly encourage exploration, yet they do not explicitly facilitate exploratory behavior during the token generation stage itself. In this work, we introduce a complementary approach that explicitly promotes exploration during sampling by applying distinct temperature settings for different token types. Specifically, our method employs higher temperatures for reasoning tokens to actively encourage exploration, while retaining lower temperatures for knowledge tokens to maintain factual correctness. Furthermore, we systematically investigate various multi-temperature scheduling strategies and their impacts within reinforcement learning contexts. Empirical evaluations on several reasoning benchmarks demonstrate that our approach significantly enhances the reasoning performance of LLMs. The code is available at https://github.com/zhmzm/Multi_Temperature_Verl.git.
[98] A Unified Biomedical Named Entity Recognition Framework with Large Language Models cs.CL | cs.AIPDF
Tengxiao Lv, Ling Luo, Juntao Li, Yanhua Wang, Yuchen Pan
TL;DR: 本文提出了一种基于大语言模型(LLMs)的统一生物医学命名实体识别框架,通过文本生成任务的形式解决了嵌套实体和边界模糊问题,并通过双语联合微调和对比学习提升了多语言和多任务泛化能力。
Details
Motivation: 生物医学命名实体识别(BioNER)在医学信息提取和知识发现中至关重要,但现有方法在嵌套实体、边界模糊和多语言泛化方面存在挑战。
Result: 在四个基准数据集和两个未见过的语料库上实现了最优性能,并展示了强大的零样本跨语言泛化能力。
Insight: 1. 文本生成任务能有效处理嵌套实体和边界模糊问题;2. 双语联合训练显著提升多语言泛化能力;3. 对比学习对过滤噪声预测具有鲁棒性。
Abstract: Accurate recognition of biomedical named entities is critical for medical information extraction and knowledge discovery. However, existing methods often struggle with nested entities, entity boundary ambiguity, and cross-lingual generalization. In this paper, we propose a unified Biomedical Named Entity Recognition (BioNER) framework based on Large Language Models (LLMs). We first reformulate BioNER as a text generation task and design a symbolic tagging strategy to jointly handle both flat and nested entities with explicit boundary annotation. To enhance multilingual and multi-task generalization, we perform bilingual joint fine-tuning across multiple Chinese and English datasets. Additionally, we introduce a contrastive learning-based entity selector that filters incorrect or spurious predictions by leveraging boundary-sensitive positive and negative samples. Experimental results on four benchmark datasets and two unseen corpora show that our method achieves state-of-the-art performance and robust zero-shot generalization across languages. The source codes are freely available at https://github.com/dreamer-tx/LLMNER.
[99] Autoencoding-Free Context Compression for LLMs via Contextual Semantic Anchors cs.CLPDF
Xin Liu, RunSong Zhao, PengCheng Huang, XinYu Liu, JunYi Xiao
TL;DR: 本文提出了Semantic-Anchor Compression (SAC),一种无需自动编码训练的新方法,通过直接选择上下文中的关键锚点令牌并聚合信息,显著提升了大型语言模型(LLM)的上下文压缩性能。
Details
Motivation: 现有的上下文压缩方法通常依赖于自动编码任务,但这种方法优化的目标是重构而非实际下游任务,导致对实际应用有益的特征被削弱。SAC旨在解决这一问题。
Result: 实验表明,SAC在多种压缩比下均优于现有方法,尤其在5倍压缩时实现了1 EM的提升,且在更高压缩比下优势更明显。
Insight: SAC的核心创新在于避开自动编码训练的劣势,直接通过上下文令牌优化压缩性能,为LLM的高效推理提供了新思路。
Abstract: Context compression presents a promising approach for accelerating large language model (LLM) inference by compressing long contexts into compact representations. Current context compression methods predominantly rely on autoencoding tasks to train context-agnostic compression tokens to compress contextual semantics. While autoencoding tasks enable compression tokens to acquire compression capabilities, compression via autoencoding tasks creates a fundamental mismatch: the models are optimized for reconstruction that diverge from actual downstream tasks, thereby weakening the features more beneficial for real-world usage. We propose Semantic-Anchor Compression (SAC), a novel method that shifts from autoencoding task based compression to an architecture that is equipped with this compression capability \textit{a priori}. Instead of training models to compress contexts through autoencoding tasks, SAC directly selects so-called anchor tokens from the original context and aggregates contextual information into their key-value (KV) representations. By deriving representations directly from the contextual tokens, SAC eliminates the need for autoencoding training. To ensure compression performance while directly leveraging anchor tokens, SAC incorporates two key designs: (1) anchor embeddings that enable the compressor to identify critical tokens, and (2) bidirectional attention modification that allows anchor tokens to capture information from the entire context. Experimental results demonstrate that SAC consistently outperforms existing context compression methods across various compression ratios. On out-of-distribution evaluation using MRQA, SAC achieves 1 EM improvement at 5x compression over strong baselines, with increasing advantages at higher compression ratios.
[100] Artificial Impressions: Evaluating Large Language Model Behavior Through the Lens of Trait Impressions cs.CLPDF
Nicholas Deas, Kathleen McKeown
TL;DR: 本文研究了大型语言模型(LLM)的内部表征中是否存在类似人类印象和刻板印象的模式,称为‘人工印象’。通过线性探针预测LLM生成的提示在‘刻板内容模型’(SCM)中的印象,分析了这些印象与模型行为的关系以及提示特征的影响。
Details
Motivation: 研究动机在于理解LLM的内部表征是否隐含了类似人类的刻板印象或印象模式,以及这些模式如何影响模型的生成行为。
Result: 结果显示:1)LLM在提示中的印象表现不一致;2)隐藏表征中的印象更易于线性解码;3)人工印象能预测回答质量和回避性语言的使用;4)提示的内容和风格显著影响人工印象。
Insight: 研究揭示了LLM在生成内容时可能隐含的偏见模式,为理解模型的刻板印象行为和优化模型表现提供了新视角。
Abstract: We introduce and study artificial impressions–patterns in LLMs’ internal representations of prompts that resemble human impressions and stereotypes based on language. We fit linear probes on generated prompts to predict impressions according to the two-dimensional Stereotype Content Model (SCM). Using these probes, we study the relationship between impressions and downstream model behavior as well as prompt features that may inform such impressions. We find that LLMs inconsistently report impressions when prompted, but also that impressions are more consistently linearly decodable from their hidden representations. Additionally, we show that artificial impressions of prompts are predictive of the quality and use of hedging in model responses. We also investigate how particular content, stylistic, and dialectal features in prompts impact LLM impressions.
[101] SOP-Maze: Evaluating Large Language Models on Complicated Business Standard Operating Procedures cs.CLPDF
Jiaming Wang, Zhe Tang, Yilin Jin, Peng Ding, Xiaoyu Li
TL;DR: 论文提出SOP-Maze基准测试,用于评估大语言模型在复杂业务流程中的表现,发现现有模型在遵循程序、处理对话和复杂计算方面存在困难。
Details
Motivation: 现有基准测试未能充分评估大语言模型在复杂业务流程中的能力,亟需一个贴近实际商业场景的评测标准。
Result: 实验表明多数先进模型在SOP-Maze上表现不佳,主要问题包括程序遵循困难、对话处理不足和复杂计算错误。
Insight: 模型在复杂业务流程中的能力仍有显著局限性,特别是在逻辑深度和对话交互方面需要进一步改进。
Abstract: As large language models (LLMs) are widely deployed as domain-specific agents, many benchmarks have been proposed to evaluate their ability to follow instructions and make decisions in real-world scenarios. However, business scenarios often involve complex standard operating procedures (SOPs), and the evaluation of LLM capabilities in such contexts has not been fully explored. To bridge this gap, we propose SOP-Maze, a benchmark constructed from real-world business data and adapted into a collection of 397 tasks from 23 complex SOP scenarios. We further categorize SOP tasks into two broad classes: Lateral Root System (LRS), representing wide-option tasks that demand precise selection; and Heart Root System (HRS), which emphasizes deep logical reasoning with complex branches. Extensive experiments reveal that nearly all state-of-the-art models struggle with SOP-Maze. We conduct a comprehensive analysis and identify three key error categories: (i) route blindness: difficulty following procedures; (ii) conversational fragility: inability to handle real dialogue nuances; and (iii) calculation errors: mistakes in time or arithmetic reasoning under complex contexts. The systematic study explores LLM performance across SOP tasks that challenge both breadth and depth, offering new insights for improving model capabilities. We have open-sourced our work on https://github.com/ADoublLEN/SOP-Maze.
[102] A Human Behavioral Baseline for Collective Governance in Software Projects cs.CL | cs.AIPDF
Mobina Noori, Mahasweta Chakraborti, Amy X Zhang, Seth Frey
TL;DR: 论文研究了开源社区通过版本控制的治理文档描述参与和控制的方式,分析了710个项目的文本变化,发现角色和动作随时间增加且分布更均匀,而规则保持稳定。
Details
Motivation: 开源社区的治理结构对项目成功至关重要,但目前缺乏量化分析治理文档变化的方法,本文旨在填补这一空白。
Result: 研究发现角色和动作随时间增加且分布更均匀,但规则组成稳定,表明治理通过扩展和平衡参与类别而非重大规则变化实现增长。
Insight: 治理的增长是通过角色和动作的动态调整而非规则的改变实现的,这为未来AI介入工作流时的权力分配提供了基准。
Abstract: We study how open source communities describe participation and control through version controlled governance documents. Using a corpus of 710 projects with paired snapshots, we parse text into actors, rules, actions, and objects, then group them and measure change with entropy for evenness, richness for diversity, and Jensen Shannon divergence for drift. Projects define more roles and more actions over time, and these are distributed more evenly, while the composition of rules remains stable. These findings indicate that governance grows by expanding and balancing categories of participation without major shifts in prescriptive force. The analysis provides a reproducible baseline for evaluating whether future AI mediated workflows concentrate or redistribute authority.
[103] MASA: LLM-Driven Multi-Agent Systems for Autoformalization cs.CL | cs.FLPDF
Lan Zhang, Marco Valentino, André Freitas
TL;DR: MASA 是一个基于大型语言模型(LLM)驱动的多智能体系统框架,旨在实现自然语言到形式化表示的自动转换。其设计强调模块化与扩展性,并通过实验展示了其在形式化数学领域的有效性。
Details
Motivation: 自然语言与形式化推理之间的转换是重要但复杂的问题。MASA 旨在通过多智能体协作和 LLM 的能力,提高自动形式化过程的效率和可靠性。
Result: 在真实数学定义和形式化数学数据集上的实验验证了 MASA 的有效性,展示了其在自动形式化领域的潜力。
Insight: 多智能体系统与 LLM 的结合为形式化推理提供了新的可能性,尤其是在模块化和扩展性方面的设计,为未来研究提供了参考。
Abstract: Autoformalization serves a crucial role in connecting natural language and formal reasoning. This paper presents MASA, a novel framework for building multi-agent systems for autoformalization driven by Large Language Models (LLMs). MASA leverages collaborative agents to convert natural language statements into their formal representations. The architecture of MASA is designed with a strong emphasis on modularity, flexibility, and extensibility, allowing seamless integration of new agents and tools to adapt to a fast-evolving field. We showcase the effectiveness of MASA through use cases on real-world mathematical definitions and experiments on formal mathematics datasets. This work highlights the potential of multi-agent systems powered by the interaction of LLMs and theorem provers in enhancing the efficiency and reliability of autoformalization, providing valuable insights and support for researchers and practitioners in the field.
[104] DARO: Difficulty-Aware Reweighting Policy Optimization cs.CLPDF
Jingyu Zhou, Lu Ma, Hao Liang, Chengyu Shen, Bin Cui
TL;DR: 论文提出了DARO方法,通过动态调整不同难度样本的损失贡献,解决现有RLVR方法中静态权重分配的问题,显著提升了模型在数学推理任务上的表现。
Details
Motivation: 现有RLVR方法(如GRPO及其变体)依赖静态或简单的权重分配策略,无法适应模型动态能力,导致训练不平衡。
Result: 在多个数学基准测试中,DARO在Qwen2.5-Math和Llama3.1模型上表现优于四种基线方法,收敛更快且性能更优。
Insight: 动态权重分配对模型训练的平衡性和性能有重要影响,适应模型能力的动态调整是关键。
Abstract: Recent advances in large language models (LLMs) have shown that reasoning ability can be significantly enhanced through Reinforcement Learning with Verifiable Rewards (RLVR). Group Relative Policy Optimization (GRPO) has emerged as the de facto approach for RLVR, inspiring numerous variants. However, our mathematical analysis reveals that these methods are fundamentally weighted variations of GRPO. We provide a unified view, demonstrating that their reliance on static or overly simplistic weighting schemes tied to sample difficulty prevents adaptation to a model’s evolving capabilities. This creates a significant loss scale issue, where training disproportionately focuses on certain difficulty levels at the expense of others, hindering overall performance. To address these limitations, we introduce \textbf{Difficulty-Aware Reweighting Policy Optimization (DARO)}, a method that dynamically adjusts the loss contribution of each difficulty group based on the model’s learning state. Extensive experiments on Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, and Llama3.1-8B show that DARO outperforms four leading baselines across six math benchmarks, achieving significantly faster convergence and superior final performance.
[105] LitE-SQL: A Lightweight and Efficient Text-to-SQL Framework with Vector-based Schema Linking and Execution-Guided Self-Correction cs.CLPDF
Shengmin Piao, Jieun Lee, Sanghyun Park
TL;DR: LitE-SQL是一个轻量高效的Text-to-SQL框架,通过向量化模式链接和执行引导的自修正实现了高性能,减少了参数依赖。
Details
Motivation: 现有基于大型语言模型(LLM)的Text-to-SQL方法在部署可行性和数据隐私上存在限制,LitE-SQL提供了一个轻量级解决方案。
Result: 在BIRD和Spider 1.0上分别达到72.10%和88.45%的执行准确率,性能优于或持平LLM方法,参数减少2x至30x。
Insight: 轻量级模型也能实现高质量的Text-to-SQL任务,为隐私敏感和资源受限的场景提供了实用方案。
Abstract: The Text-to-SQL task translates natural language questions into SQL queries, enabling intuitive database interaction for non-experts. While recent methods leveraging Large Language Models (LLMs) achieve strong performance, their reliance on proprietary models raise concerns about deployment feasibility and data privacy. In this work, we introduce LitE-SQL, a Lightweight and Efficient framework with two components: (i) a Schema Retriever that performs efficient schema linking using a vector database of pre-computed schema embeddings, and (ii) a SQL Generator fine-tuned in two stages-supervised fine-tuning followed by execution-guided reinforcement-enabling self-correction without costly multi-candidate generation. On BIRD, LitE-SQL achieves 72.10% execution accuracy, and on Spider 1.0 it reaches 88.45%, demonstrating comparable or superior performance to LLM-based methods despite using 2x to 30x fewer parameters. Our findings demonstrate that high-quality Text-to-SQL generation is feasible with lightweight models, offering a practical solution for privacy-sensitive and resource-constrained settings.
[106] Alif: Advancing Urdu Large Language Models via Multilingual Synthetic Data Distillation cs.CL | cs.AI | cs.LG | I.2.7; I.2.6; I.2.11PDF
Muhammad Ali Shafique, Kanwal Mehreen, Muhammad Arham, Maaz Amjad, Sabur Butt
TL;DR: 论文提出了一种名为Alif-1.0-8B-Instruct的多语言乌尔都语-英语大语言模型,通过改进的自指导技术生成高质量多语言合成数据集,显著提升了模型在乌尔都语任务中的表现,且训练成本低于100美元。
Details
Motivation: 为低资源语言(如乌尔都语)开发高性能大语言模型面临高质量数据集稀缺、多语言不一致和安全问题等挑战,现有方法通过翻译数据但效果不佳。
Result: Alif-1.0-8B-Instruct在乌尔都语任务上优于Llama-3.1-8B-Instruct及主流多语言模型,训练成本低于100美元。
Insight: 改进的自指导技术可高效开发高性能、文化对齐的低资源语言模型,无需依赖昂贵的数据翻译和标注。
Abstract: Developing a high-performing large language models (LLMs) for low-resource languages such as Urdu, present several challenges. These challenges include the scarcity of high-quality datasets, multilingual inconsistencies, and safety concerns. Existing multilingual LLMs often address these issues by translating large volumes of available data. However, such translations often lack quality and cultural nuance while also incurring significant costs for data curation and training. To address these issues, we propose Alif-1.0-8B-Instruct, a multilingual Urdu-English model, that tackles these challenges with a unique approach. We train the model on a high-quality, multilingual synthetic dataset (Urdu-Instruct), developed using a modified self-instruct technique. By using unique prompts and seed values for each task along with a global task pool, this dataset incorporates Urdu-native chain-of-thought based reasoning, bilingual translation, cultural relevance, and ethical safety alignments. This technique significantly enhances the comprehension of Alif-1.0-8B-Instruct model for Urdu-specific tasks. As a result, Alif-1.0-8B-Instruct, built upon the pretrained Llama-3.1-8B, demonstrates superior performance compared to Llama-3.1-8B-Instruct for Urdu specific-tasks. It also outperformed leading multilingual LLMs, including Mistral-7B-Instruct-v0.3, Qwen-2.5-7B-Instruct, and Cohere-Aya-Expanse-8B, all within a training budget of under $100. Our results demonstrate that high-performance and low-resource language LLMs can be developed efficiently and culturally aligned using our modified self-instruct approach. All datasets, models, and code are publicly available at: https://github.com/traversaal-ai/alif-urdu-llm.
[107] ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability cs.CLPDF
Chung-En Sun, Ge Yan, Akshay Kulkarni, Tsui-Wei Weng
TL;DR: ReFIne是一个旨在提升大规模推理模型可信度的训练框架,强调可解释性、忠实性和可靠性,并通过结构化标签跟踪和高层规划等方法优化模型表现。
Details
Motivation: 现有的长程推理模型过于关注答案准确性和计算效率,忽视了可信度的关键维度(如可解释性、忠实性和可靠性),这限制了模型在实际应用中的可用性和信任度。
Result: 在数学基准测试中,ReFIne模型的推理跟踪清晰度提升44.0%,忠实性提升18.8%,可靠性提升42.4%。
Insight: 推理模型的优化不应仅关注准确性,还需考虑可信度的多维性,这对实际应用中的模型部署至关重要。
Abstract: Recent advances in long chain-of-thought (CoT) reasoning have largely prioritized answer accuracy and token efficiency, while overlooking aspects critical to trustworthiness. We argue that usable reasoning systems must be trustworthy, characterized by three properties: interpretability, faithfulness, and reliability. To this end, we propose ReFIne, a new training framework that integrates supervised fine-tuning with GRPO to encourage models to: (i) improve interpretability by producing structured, tag-based traces with high-level planning that are easier for humans to follow; (ii) enhance faithfulness by explicitly disclosing the decisive information guiding each solution, with consistent cross-section references; and (iii) promote reliability by providing self-assessments of both the derivation’s soundness and the confidence of the final answer. We apply ReFIne to the Qwen3 models at multiple scales (1.7B/4B/8B) and evaluate across mathematical benchmarks of varying difficulty. Our experimental results show that ReFIne models generate clearer and better-structured reasoning traces (interpretability +44.0%), more faithfully expose their underlying decision process (faithfulness +18.8%), and offer informative confidence estimates (reliability +42.4%). These findings highlight an overlooked but important direction: reasoning models should be optimized not only for accuracy, but also for broader dimensions of trustworthiness. Our code is available at: https://github.com/Trustworthy-ML-Lab/Training_Trustworthy_LRM_with_Refine
[108] FrameEOL: Semantic Frame Induction using Causal Language Models cs.CLPDF
Chihiro Yano, Kosuke Yamada, Hayato Tsukagoshi, Ryohei Sasano, Koichi Takeda
TL;DR: FrameEOL是一种基于因果语言模型(CLM)的语义框架归纳新方法,通过提示学习和深度度量学习生成更适合的嵌入表示,并在英语和日语的FrameNet数据集上表现优于现有方法。
Details
Motivation: 尽管因果语言模型(如GPT和Llama系列)在多种语言理解任务中表现优异,但它们尚未被应用于语义框架归纳任务。本文旨在填补这一空白,探索CLM在此任务的潜力。
Result: 在英语和日语FrameNet数据集上表现优于现有方法,尤其对资源匮乏的日语,仅需5个ICL示例即可达到与DML微调的MLM方法相当的性能。
Insight: CLM在语义框架归纳任务中具有潜力,特别是在资源匮乏的语言中,提示学习和ICL的结合能够显著提升模型性能。
Abstract: Semantic frame induction is the task of clustering frame-evoking words according to the semantic frames they evoke. In recent years, leveraging embeddings of frame-evoking words that are obtained using masked language models (MLMs) such as BERT has led to high-performance semantic frame induction. Although causal language models (CLMs) such as the GPT and Llama series succeed in a wide range of language comprehension tasks and can engage in dialogue as if they understood frames, they have not yet been applied to semantic frame induction. We propose a new method for semantic frame induction based on CLMs. Specifically, we introduce FrameEOL, a prompt-based method for obtaining Frame Embeddings that outputs One frame-name as a Label representing the given situation. To obtain embeddings more suitable for frame induction, we leverage in-context learning (ICL) and deep metric learning (DML). Frame induction is then performed by clustering the resulting embeddings. Experimental results on the English and Japanese FrameNet datasets demonstrate that the proposed methods outperform existing frame induction methods. In particular, for Japanese, which lacks extensive frame resources, the CLM-based method using only 5 ICL examples achieved comparable performance to the MLM-based method fine-tuned with DML.
[109] DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation cs.CLPDF
Enze Zhang, Jiaying Wang, Mengxi Xiao, Jifei Liu, Ziyan Kuang
TL;DR: DITING是第一个针对网络小说翻译的全面评估框架,通过多维度评估来弥补现有基准的不足,并提出AgentEval和MetricAlign工具以提升评估质量。研究发现中文训练的LLM在翻译质量上优于更大的国外模型。
Details
Motivation: 现有机器翻译评估基准在网络小说翻译中表现不足,主要依赖表层指标,未能捕捉该文类的独特特征。
Result: 中文训练的LLM(如DeepSeek-V3)在翻译忠实度和风格一致性上优于更大的国外模型。
Insight: 网络小说翻译需关注文化特异性,多代理评估能更接近人类判断,中文训练的LLM在特定任务中更具优势。
Abstract: Large language models (LLMs) have substantially advanced machine translation (MT), yet their effectiveness in translating web novels remains unclear. Existing benchmarks rely on surface-level metrics that fail to capture the distinctive traits of this genre. To address these gaps, we introduce DITING, the first comprehensive evaluation framework for web novel translation, assessing narrative and cultural fidelity across six dimensions: idiom translation, lexical ambiguity, terminology localization, tense consistency, zero-pronoun resolution, and cultural safety, supported by over 18K expert-annotated Chinese-English sentence pairs. We further propose AgentEval, a reasoning-driven multi-agent evaluation framework that simulates expert deliberation to assess translation quality beyond lexical overlap, achieving the highest correlation with human judgments among seven tested automatic metrics. To enable metric comparison, we develop MetricAlign, a meta-evaluation dataset of 300 sentence pairs annotated with error labels and scalar quality scores. Comprehensive evaluation of fourteen open, closed, and commercial models reveals that Chinese-trained LLMs surpass larger foreign counterparts, and that DeepSeek-V3 delivers the most faithful and stylistically coherent translations. Our work establishes a new paradigm for exploring LLM-based web novel translation and provides public resources to advance future research.
[110] Stronger Re-identification Attacks through Reasoning and Aggregation cs.CLPDF
Lucas Georges Gabriel Charpentier, Pierre Lison
TL;DR: 论文提出两种互补策略(排序聚合和推理模型)以构建更强的文本重新识别攻击,评估去标识化方法的鲁棒性。
Details
Motivation: 文本去标识化技术难以衡量其身份隐藏能力,需要更强的方法来评估其鲁棒性。
Result: 两种策略显著提升了重新识别的性能。
Insight: 去标识化技术的评估需考虑攻击者的推理能力和背景知识,顺序和聚合策略是关键优化方向。
Abstract: Text de-identification techniques are often used to mask personally identifiable information (PII) from documents. Their ability to conceal the identity of the individuals mentioned in a text is, however, hard to measure. Recent work has shown how the robustness of de-identification methods could be assessed by attempting the reverse process of re-identification, based on an automated adversary using its background knowledge to uncover the PIIs that have been masked. This paper presents two complementary strategies to build stronger re-identification attacks. We first show that (1) the order in which the PII spans are re-identified matters, and that aggregating predictions across multiple orderings leads to improved results. We also find that (2) reasoning models can boost the re-identification performance, especially when the adversary is assumed to have access to extensive background knowledge.
[111] LLaMAX2: Your Translation-Enhanced Model also Performs Well in Reasoning cs.CLPDF
Changjiang Gao, Zixian Huang, Jingyang Gong, Shujian Huang, Lei Li
TL;DR: 论文提出了一种新的翻译增强方法LLaMAX2,通过选择性层调优仅使用平行数据训练,显著提升了低资源语言的翻译性能,同时在推理任务中保持与原指导模型相当的能力。
Details
Motivation: 现有的大型语言模型(LLMs)在翻译增强后通常牺牲了推理能力,作者希望通过一种新方法解决这一问题。
Result: 在低资源语言(如斯瓦希里语)上翻译性能提升显著(15+ spBLEU和40+ xComet),同时在多语言任务和推理任务中保持竞争力。
Insight: 该方法为多语言增强提供了高效途径,显著降低了复杂性,提升了低资源语言的可及性。
Abstract: General Large Language Models (LLMs) excel in reasoning, but those enhanced for translation struggle with reasoning tasks. To address this, we propose a novel translationenhanced recipe that begins with instruct models and applies layer-selective tuning only on parallel data. Following this pipeline, we introduce the Qwen3-XPlus models, which demonstrate significant improvements in translation performance across both high- and lowresource languages, achieving 15+ spBLEU and 40+ xComet in low-resource languages, like Swahili. Interestingly, training only with small parallel datasets, Qwen3-XPlus achieves an average improvement of 1+ points on 7 multilingual tasks while maintaining proficiency comparable to the Qwen3 instruct model in 15 popular reasoning datasets. This work offers a promising approach to multilingual enhancement, significantly reducing complexity and enhancing accessibility for a wider range of languages. The code and model are publicly available.
[112] DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction cs.CL | cs.AIPDF
Yiqi Li, Yusheng Liao, Zhe Chen, Yanfeng Wang, Yu Wang
TL;DR: DICE是一个轻量级框架,通过小型语言模型(SLM)指导链式思考(CoT)修正,优化大型语言模型(LLMs)的输出,使其更符合结构化需求。
Details
Motivation: LLMs在执行推理任务时,常忽视用户的严格输出格式要求,而直接调优LLMs成本高昂且参数受限。
Result: 实验表明,DICE将LLMs输出的格式准确性和内容正确性分别提高了35.4%和29.4%,达到SOTA性能。
Insight: DICE通过将LLMs的通用能力与SLMs的精细修正相结合,实现了高效的结构化输出,为LLMs的优化提供了新思路。
Abstract: When performing reasoning tasks with user-specific requirements, such as strict output formats, large language models (LLMs) often prioritize reasoning over adherence to detailed instructions. Fine-tuning LLMs on supervised datasets to address this is impractical due to high computational costs and limited parameter access. To tackle this, we propose DICE, a lightweight framework that guides small language models (SLMs) to refine LLMs’ outputs through chain-of-thought (CoT) correction. DICE decouples the process by first prompting LLMs to generate natural language responses, then using trained SLMs to analyze and refine these outputs to meet structured output specifications. This framework preserves LLMs’ broad knowledge and reasoning capabilities while ensuring the outputs conform to user demands. Specifically, DICE first constructs structured CoT adaptation datasets via a two-stage method and subsequently applies a dual-tuning strategy to fine-tune SLMs for generating structured outputs in an analyze-then-answer pattern. Experiments demonstrate that DICE improves the average format accuracy and content correctness of LLM outputs by 35.4% and 29.4%, respectively, achieving state-of-the-art (SOTA) performance over other competitive baselines.
[113] DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning cs.CLPDF
Chenyang Gu, Yewen Pu, Bruce Yang, Xiaofan Li, Huan Gao
TL;DR: 论文提出了一种改进的强化学习算法DSPO,用于训练智能代理在多轮搜索和推理任务中表现更稳定且高效,无需监督数据。
Details
Motivation: 现有的方法要么依赖提示激发模型的代理能力,要么在复杂交互任务中面临性能瓶颈或崩溃。DSPO旨在解决这些问题,充分发挥代理潜力。
Result: 在多个QA基准测试中,DSPO训练的7B模型优于之前工作的34.1%,在HotpotQA等复杂多跳QA任务中表现尤为突出。
Insight: 通过动态样本过滤和序列级优化,DSPO展示了如何在复杂任务中实现稳定的强化学习训练,同时避免性能崩溃。
Abstract: Enhancing LLMs with the ability to actively search external knowledge is crucial for complex and real-world tasks. Current approaches either rely on prompting to elicit the model’s innate agent capabilities, or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped. To address this, we introduce \textbf{D}ynamic-filter \textbf{S}equence-level \textbf{P}olicy \textbf{O}ptimization (DSPO), an improved RL algorithm designed for robust agent training through sequence-level optimization and dynamic sample filtering. We train our model purely through RL to interleave multi-turn search and reasoning, obviating the need for supervised demonstration data. Across multiple QA benchmarks, our DSPO-trained 7B model improves over a comparable previous work by \textbf{34.1%}, and even outperforms the 14B model from previous work in complex multihop QA such as HotpotQA by nearly \textbf{9% relative}, maintaining exceptional training stability.
[114] Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models cs.CL | cs.AI | cs.LGPDF
Yongding Tao, Tian Wang, Yihong Dong, Huanyu Liu, Kechi Zhang
TL;DR: 该论文提出了Self-Critique方法,首次系统研究了RL后训练阶段的数据污染检测问题,解决了现有方法在该场景下性能不足的缺陷。
Details
Motivation: RL后训练阶段的数据污染问题缺乏专门的检测方法,导致LLM评估结果可能失效。研究者首次针对这一关键阶段开展系统性研究。
Result: Self-Critique在多个模型和污染任务中显著优于基线方法,AUC提升高达30%,解决了现有方法无法有效检测RL阶段污染的问题。
Insight: RL后训练阶段模型的输出熵分布特性可用于污染检测,策略坍缩现象为检测提供关键信号。
Abstract: Data contamination poses a significant threat to the reliable evaluation of Large Language Models (LLMs). This issue arises when benchmark samples may inadvertently appear in training sets, compromising the validity of reported performance. While detection methods have been developed for the pre-training and Supervised Fine-Tuning stages, a critical research gap exists for the increasingly significant phase of Reinforcement Learning (RL) post-training. As RL post-training becomes pivotal for advancing LLM reasoning, the absence of specialized contamination detection methods in this paradigm presents a critical vulnerability. To address this, we conduct the first systematic study of data detection within RL post-training scenario and propose Self-Critique. Our method is motivated by a key observation: after RL phase, the output entropy distribution of LLMs tends to collapse into highly specific and sparse modes. Self-Critique probes for the underlying policy collapse, i.e., the model’s convergence to a narrow reasoning path, which causes this entropy reduction. To facilitate this research, we also introduce RL-MIA, a benchmark constructed to simulate this specific contamination scenario. Extensive experiments show that Self-Critique significantly outperforms baseline methods across multiple models and contamination tasks, achieving an AUC improvement of up to 30%. Whereas existing methods are close to a random guess for RL-phase contamination, our method makes detection possible.
[115] CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation cs.CLPDF
Kaiwen Wei, Xiao Liu, Jie Zhang, Zijian Wang, Ruida Liu
TL;DR: CFVBench是一个全面的视频基准测试,用于细粒度多模态检索增强生成(MRAG),填补了现有基准测试在模态覆盖和格式多样性上的不足。通过599个公开视频生成的5,360个QA对,CFVBench支持高密度多模态任务,揭示了当前模型在细粒度多模态细节上的瓶颈。
Details
Motivation: 现有MRAG基准测试在模态覆盖和格式多样性上受限,无法充分评估多模态大语言模型(MLLMs)的能力,尤其是细粒度多模态信息的处理。
Result: 实验表明,当前MLLMs(包括GPT5和Gemini)在细粒度多模态细节上表现不佳,而AVR框架能显著提升性能。
Insight: 细粒度多模态信息的捕捉是当前MLLMs的关键瓶颈,动态调整和选择性增强的策略(如AVR)是有效的改进方向。
Abstract: Multimodal Retrieval-Augmented Generation (MRAG) enables Multimodal Large Language Models (MLLMs) to generate responses with external multimodal evidence, and numerous video-based MRAG benchmarks have been proposed to evaluate model capabilities across retrieval and generation stages. However, existing benchmarks remain limited in modality coverage and format diversity, often focusing on single- or limited-modality tasks, or coarse-grained scene understanding. To address these gaps, we introduce CFVBench, a large-scale, manually verified benchmark constructed from 599 publicly available videos, yielding 5,360 open-ended QA pairs. CFVBench spans high-density formats and domains such as chart-heavy reports, news broadcasts, and software tutorials, requiring models to retrieve and reason over long temporal video spans while maintaining fine-grained multimodal information. Using CFVBench, we systematically evaluate 7 retrieval methods and 14 widely-used MLLMs, revealing a critical bottleneck: current models (even GPT5 or Gemini) struggle to capture transient yet essential fine-grained multimodal details. To mitigate this, we propose Adaptive Visual Refinement (AVR), a simple yet effective framework that adaptively increases frame sampling density and selectively invokes external tools when necessary. Experiments show that AVR consistently enhances fine-grained multimodal comprehension and improves performance across all evaluated MLLMs
[116] Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation cs.CL | cs.AIPDF
Xiangxu Zhang, Lei Li, Yanyun Zhou, Xiao Zhou, Yingying Zhang
TL;DR: 这篇论文探讨了当前医学诊断大模型(LLMs)评估方法的局限性,并提出了一种名为DyReMe的动态评估方法,以更贴近真实临床实践的方式评估模型性能。
Details
Motivation: 现有的医学诊断评估主要依赖于静态基准测试,这些测试基于公开的医学考试题目,高估了模型的实际性能,且忽略了真实临床中复杂的场景和多样化表达的需求。
Result: 实验表明,DyReMe能更真实地反映LLMs的性能,揭示了当前先进模型与真实临床需求之间的显著差距。
Insight: 论文强调需要开发更贴近真实医疗场景的评估框架,以确保模型在实际临床中的可信性和实用性。
Abstract: Medical diagnostics is a high-stakes and complex domain that is critical to patient care. However, current evaluations of large language models (LLMs) are fundamentally misaligned with real-world clinical practice. Most of them rely on static benchmarks derived from public medical exam items, which tend to overestimate model performance and ignore the difference between textbook cases and the ambiguous, varying conditions in the real world. Recent efforts toward dynamic evaluation offer a promising alternative, but their improvements are limited to superficial perturbations and a narrow focus on accuracy. To address these gaps, we propose DyReMe, a dynamic benchmark for medical diagnostics that better reflects real clinical practice. Unlike static exam-style questions, DyReMe generates fresh, consultation-like cases that introduce distractors such as differential diagnoses and common misdiagnosis factors. It also varies expression styles to mimic diverse real-world query habits. Beyond accuracy, DyReMe evaluates LLMs on three additional clinically relevant dimensions: veracity, helpfulness, and consistency. Our experiments demonstrate that this dynamic approach yields more challenging and realistic assessments, revealing significant misalignments between the performance of state-of-the-art LLMs and real clinical practice. These findings highlight the urgent need for evaluation frameworks that better reflect the demands of trustworthy medical diagnostics.
[117] CLARity: Reasoning Consistency Alone Can Teach Reinforced Experts cs.CL | cs.AIPDF
Jiuheng Lin, Cong Jiang, Zirui Wu, Jiarui Sun, Yansong Feng
TL;DR: 提出了一种名为CLARity的低成本强化学习框架,通过一致性奖励机制和改进的训练流程,提升专家LLM在数据稀缺领域中的推理一致性和准确性。
Details
Motivation: 在数据稀缺领域训练专家LLM时,传统的结果导向强化学习可能损害推理质量(如逻辑一致性),而现有方法(如大规模过程奖励模型)成本过高。
Result: 实验表明,CLARity将响应一致性提升16.5%,准确性提升7.5%,人类评估也证实其在连贯性和专业性上的改进。
Insight: 小型通用LLM可以通过推理一致性有效指导专家模型,提供了一种低成本且通用的解决方案。
Abstract: Training expert LLMs in domains with scarce data is difficult, often relying on multiple-choice questions (MCQs). However, standard outcome-based reinforcement learning (RL) on MCQs is risky. While it may improve accuracy, we observe it often degrades reasoning quality such as logical consistency. Existing solutions to supervise reasoning, such as large-scale Process Reward Models (PRMs), are prohibitively expensive. To address this, we propose CLARity, a cost-effective RL framework that enhances reasoning quality using only a small, general-purpose LLM. CLARity integrates a consistency-aware reward mechanism with a 2-stage refine-then-monitor training pipeline to enhance reasoning consistency, and a dynamic data reformulation strategy to to better exploit limited data. Experiments demonstrate that CLARity improves response consistency by 16.5% and accuracy by 7.5% over baselines. Human evaluations further confirm holistic improvements in coherence and professionalism. Thus, CLARity offers a generalizable solution that enables smaller models to effectively guide expert models by reasoning consistency.Our code is open sourced at: https://github.com/Infinite-set/CLARity
[118] Verifying Chain-of-Thought Reasoning via Its Computational Graph cs.CL | cs.AI | cs.LGPDF
Zheng Zhao, Yeskendir Koishekenov, Xianjun Yang, Naila Murray, Nicola Cancedda
TL;DR: 该论文提出了Circuit-based Reasoning Verification (CRV)方法,通过分析Chain-of-Thought (CoT)推理的computational graph结构特征,验证推理的正确性,并从结构指纹中揭示了错误的信号和模式。
Details
Motivation: 现有CoT验证方法(黑盒或灰盒)仅基于输出或激活预测推理正确性,无法深入理解计算失败的原因,因此需要一种白盒方法直接分析计算的执行过程。
Result: CRV方法能高效预测推理错误,不同任务的错误表现为独特的模式,且可通过结构特征干预修正模型推理。
Insight: 模型的计算过程包含丰富的错误信号,通过分析computational graph可以实现从错误检测到因果理解的升级。
Abstract: Current Chain-of-Thought (CoT) verification methods predict reasoning correctness based on outputs (black-box) or activations (gray-box), but offer limited insight into why a computation fails. We introduce a white-box method: Circuit-based Reasoning Verification (CRV). We hypothesize that attribution graphs of correct CoT steps, viewed as execution traces of the model’s latent reasoning circuits, possess distinct structural fingerprints from those of incorrect steps. By training a classifier on structural features of these graphs, we show that these traces contain a powerful signal of reasoning errors. Our white-box approach yields novel scientific insights unattainable by other methods. (1) We demonstrate that structural signatures of error are highly predictive, establishing the viability of verifying reasoning directly via its computational graph. (2) We find these signatures to be highly domain-specific, revealing that failures in different reasoning tasks manifest as distinct computational patterns. (3) We provide evidence that these signatures are not merely correlational; by using our analysis to guide targeted interventions on individual transcoder features, we successfully correct the model’s faulty reasoning. Our work shows that, by scrutinizing a model’s computational process, we can move from simple error detection to a deeper, causal understanding of LLM reasoning.
[119] LLP: LLM-based Product Pricing in E-commerce cs.CLPDF
Hairu Wang, Sheng You, Qiheng Zhang, Xike Xie, Shuguang Han
TL;DR: LLP是一种基于大语言模型(LLM)的二手产品定价生成框架,通过动态市场对齐、两阶段优化和置信度过滤机制,显著提升了定价准确性和泛化能力。
Details
Motivation: 针对C2C电商平台(如eBay)中卖家定价困难的挑战,传统静态回归模型泛化能力差且难以捕捉市场动态。LLM的最新突破为解决这一问题提供了新思路。
Result: 在Xianyu平台上,LLP显著优于传统方法,静态采纳率(SAR)从40%提升至72%,在90%召回率下仍保持47%的SAR。
Insight: LLM的生成能力结合领域优化和动态对齐,能够显著提升定价任务的准确性和泛化性,适用于动态市场环境。
Abstract: Unlike Business-to-Consumer e-commerce platforms (e.g., Amazon), inexperienced individual sellers on Consumer-to-Consumer platforms (e.g., eBay) often face significant challenges in setting prices for their second-hand products efficiently. Therefore, numerous studies have been proposed for automating price prediction. However, most of them are based on static regression models, which suffer from poor generalization performance and fail to capture market dynamics (e.g., the price of a used iPhone decreases over time). Inspired by recent breakthroughs in Large Language Models (LLMs), we introduce LLP, the first LLM-based generative framework for second-hand product pricing. LLP first retrieves similar products to better align with the dynamic market change. Afterwards, it leverages the LLMs’ nuanced understanding of key pricing information in free-form text to generate accurate price suggestions. To strengthen the LLMs’ domain reasoning over retrieved products, we apply a two-stage optimization, supervised fine-tuning (SFT) followed by group relative policy optimization (GRPO), on a dataset built via bidirectional reasoning. Moreover, LLP employs a confidence-based filtering mechanism to reject unreliable price suggestions. Extensive experiments demonstrate that LLP substantially surpasses existing methods while generalizing well to unseen categories. We have successfully deployed LLP on Xianyu\footnote{Xianyu is China’s largest second-hand e-commerce platform.}, significantly outperforming the previous pricing method. Under the same 30% product coverage, it raises the static adoption rate (SAR) from 40% to 72%, and maintains a strong SAR of 47% even at 90% recall.
[120] ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering cs.CLPDF
Francesco Maria Molfese, Luca Moroni, Ciro Porcaro, Simone Conia, Roberto Navigli
TL;DR: 论文介绍了ReTraceQA,一个新颖的基准测试,用于评估小语言模型(SLMs)在常识推理任务中的推理过程,而不仅关注最终答案的正确性。研究发现,SLMs在14-24%的情况下提供的答案正确但推理过程存在问题,表明仅依赖最终答案的评估方法高估了SLMs的能力。
Details
Motivation: 当前对小语言模型(SLMs)在常识推理任务中的评估主要基于最终答案的准确性,而忽视了推理过程的有效性,可能导致模型能力的误判。
Result: 研究发现SLMs在14-24%的情况下虽然给出正确答案但推理过程存在错误;当使用LLMs进行推理感知评估时,SLMs的性能显著下降(最多25%)。
Insight: 仅依赖最终答案的评估方法可能掩盖了模型的真正能力缺陷,过程级评估对于全面理解模型性能至关重要。
Abstract: While Small Language Models (SLMs) have demonstrated promising performance on an increasingly wide array of commonsense reasoning benchmarks, current evaluation practices rely almost exclusively on the accuracy of their final answers, neglecting the validity of the reasoning processes that lead to those answers. To address this issue, we introduce ReTraceQA, a novel benchmark that introduces process-level evaluation for commonsense reasoning tasks. Our expert-annotated dataset reveals that in a substantial portion of instances (14-24%), SLMs provide correct final answers despite flawed reasoning processes, suggesting that the capabilities of SLMs are often overestimated by evaluation metrics that focus only on comparing the final answer with the ground truth. Indeed, we show that when employing strong Large Language Models (LLMs) as automated judges for reasoning-aware evaluation rather than answer-only metrics, SLM performance drops significantly across all models and datasets, with scores decreasing by up to 25%.
[121] Logit Arithmetic Elicits Long Reasoning Capabilities Without Training cs.CLPDF
Yunxiang Zhang, Muhammad Khalifa, Lechen Zhang, Xin Liu, Ayoung Lee
TL;DR: 论文探讨了如何在不进行额外训练的情况下,激发大型模型的长期推理能力,提出了一种基于对数运算的解码方法ThinkLogit,并通过偏好优化进一步提升性能。
Details
Motivation: 研究表明大型推理模型通常需要额外训练才能实现长期推理能力(如回溯和自我修正),本文旨在探索是否可以在不训练的情况下激发这些能力。
Result: 实验表明,ThinkLogit和ThinkLogit-DPO在五个推理基准上的平均准确率分别提高了24.5%和29.1%,且方法对不同模型族也有效。
Insight: ThinkLogit方法是一种低成本且高效的方式,可以在不进行大规模后续训练的情况下提升大型模型的推理能力。
Abstract: Large reasoning models exhibit long chain-of-thought reasoning with strategies such as backtracking and self-correction, though recent studies suggest that these abilities typically require additional training. We first investigate whether such behaviors can be elicited without any training. To this end, we propose a decoding-time approach, ThinkLogit, which utilizes logit arithmetic to tune a target large non-reasoning model for long reasoning using a substantially smaller reasoning model as the guider. We then show that we can further boost its performance by training the guider model with preference optimization over correct/incorrect reasoning pairs sampled from both the target and guider model, a setup we refer to as ThinkLogit-DPO. Our experiments demonstrate that ThinkLogit and ThinkLogit-DPO achieve a relative improvement in average accuracy by 24.5% and 29.1%, respectively, over five reasoning benchmarks using the Qwen2.5-32B guided by R1-Distill-Qwen-1.5B, a model 21x smaller. Moreover, we find that ThinkLogit remains effective when the guider and target come from different model families. It is also orthogonal to post-training methods for small models, as guiders improved through supervised distillation or reinforcement learning can be directly plugged in to yield stronger large models, offering a practical path to unlock long reasoning in large-scale models without costly post-training.
[122] Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood cs.CLPDF
Xingyu Lin, Yilin Wen, En Wang, Du Su, Wenbin Liu
TL;DR: 论文提出了TEPO(Token-Level Policy Optimization)方法,通过马尔可夫似然(Markov Likelihood)将组级奖励与词元级聚合联系起来,显著提升了大型语言模型的数学推理能力,同时增强了训练稳定性。
Details
Motivation: 当前组相对策略优化(GRPO)等方法在面对稀疏词元奖励时,常因未区分的词元级熵调整导致熵崩溃或模型崩溃,亟需一种更精细的方法。
Result: 实验显示TEPO在@k和准确率等关键指标上均优于基线方法,尤其在数学推理任务上达到了新SOTA。
Insight: 词元级优化结合马尔可夫似然是一种有效提升稀疏奖励任务性能的方法,同时能避免模型崩溃。
Abstract: Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly by boosting their mathematical performance. However, GRPO and related entropy-regularization methods still face challenges rooted in the sparse token rewards inherent to chain-of-thought (CoT). Current approaches often rely on undifferentiated token-level entropy adjustments, which frequently lead to entropy collapse or model collapse. In this work, we propose TEPO, a novel token-level framework that incorporates Markov Likelihood (sequence likelihood) links group-level rewards with tokens via token-level aggregation. Experiments show that TEPO consistently outperforms existing baselines across key metrics (including @k and accuracy). It not only sets a new state of the art on mathematical reasoning tasks but also significantly enhances training stability.
[123] Beyond Single-Granularity Prompts: A Multi-Scale Chain-of-Thought Prompt Learning for Graph cs.CL | cs.AIPDF
Ziyu Zheng, Yaming Yang, Ziyu Guan, Wei Zhao, Xinyan Huang
TL;DR: 论文提出了一种多尺度图链式思考(MSGCOT)提示框架,通过整合图数据的多尺度结构信息,显著提升了图提示调优的性能,尤其在少样本场景中表现卓越。
Details
Motivation: 现有图提示调优方法局限于单一粒度(如节点或子图级别),忽略了图数据固有的多尺度结构信息,限制了提示语义的多样性。
Result: 在八个基准数据集上的实验显示,MSGCOT优于当前最先进的单粒度图提示调优方法,尤其在少样本场景中表现突出。
Insight: 多尺度信息的动态整合能显著提升图提示调优性能,尤其在数据稀缺时效果更佳。
Abstract: The “pre-train, prompt’’ paradigm, designed to bridge the gap between pre-training tasks and downstream objectives, has been extended from the NLP domain to the graph domain and has achieved remarkable progress. Current mainstream graph prompt-tuning methods modify input or output features using learnable prompt vectors. However, existing approaches are confined to single-granularity (e.g., node-level or subgraph-level) during prompt generation, overlooking the inherently multi-scale structural information in graph data, which limits the diversity of prompt semantics. To address this issue, we pioneer the integration of multi-scale information into graph prompt and propose a Multi-Scale Graph Chain-of-Thought (MSGCOT) prompting framework. Specifically, we design a lightweight, low-rank coarsening network to efficiently capture multi-scale structural features as hierarchical basis vectors for prompt generation. Subsequently, mimicking human cognition from coarse-to-fine granularity, we dynamically integrate multi-scale information at each reasoning step, forming a progressive coarse-to-fine prompt chain. Extensive experiments on eight benchmark datasets demonstrate that MSGCOT outperforms the state-of-the-art single-granularity graph prompt-tuning method, particularly in few-shot scenarios, showcasing superior performance.
[124] Active Model Selection for Large Language Models cs.CL | cs.LGPDF
Yavuz Durmazkeser, Patrik Okanovic, Andreas Kirsch, Torsten Hoefler, Nezihe Merve Gürel
TL;DR: LLM SELECTOR是一个主动选择大型语言模型(LLM)的框架,通过自适应选择少量查询标注,显著降低标注成本。
Details
Motivation: 现有的LLM评估方法依赖全标注数据集,标注成本高且不高效。
Result: 在6个基准测试和151个LLM上实验表明,LLM SELECTOR能将标注成本降低59.62%。
Insight: 主动选择和低成本标注策略可用于高效模型选择。
Abstract: We introduce LLM SELECTOR, the first framework for active model selection of Large Language Models (LLMs). Unlike prior evaluation and benchmarking approaches that rely on fully annotated datasets, LLM SELECTOR efficiently identifies the best LLM with limited annotations. In particular, for any given task, LLM SELECTOR adaptively selects a small set of queries to annotate that are most informative about the best model for the task. To further reduce annotation cost, we leverage a judge-based oracle annotation model. Through extensive experiments on 6 benchmarks with 151 LLMs, we show that LLM SELECTOR reduces annotation costs by up to 59.62% when selecting the best and near-best LLM for the task.
[125] The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach cs.CL | cs.AI | cs.LG | eess.ASPDF
Nizar El Ghazal, Antoine Caubrière, Valentin Vielzeuf
TL;DR: 该论文比较了基于Speech-LLM的端到端口语对话状态跟踪(DST)的上下文管理策略。实验表明,完整口语历史输入性能最佳,而基于注意力池化的压缩方法在保持竞争力的同时减少了上下文大小。
Details
Motivation: 探讨如何在口语对话状态跟踪中更有效地利用上下文信息,改进传统多模态方法或压缩方法的局限性。
Result: 在SpokenWOZ语料库上,完整口语历史输入的表现显著优于其他方法,而压缩方法在减少上下文大小的同时保持了较高的准确性。
Insight: 更全面的上下文利用可以有效提升模型性能,注意力池化压缩为现实部署提供了可行的折衷方案。
Abstract: This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs. We systematically evaluate traditional multimodal context (combining text history and spoken current turn), full spoken history, and compressed spoken history approaches. Our experiments on the SpokenWOZ corpus demonstrate that providing the full spoken conversation as input yields the highest performance among models of similar size, significantly surpassing prior methods. Furthermore, we show that attention-pooling-based compression of the spoken history offers a strong trade-off, maintaining competitive accuracy with reduced context size. Detailed analysis confirms that improvements stem from more effective context utilization.
[126] KORMo: Korean Open Reasoning Model for Everyone cs.CLPDF
Minjun Kim, Hyeonseok Lim, Hangyeol Yoo, Inho Won, Seungwoo Song
TL;DR: KORMo-10B是首个针对韩语的双语开放大语言模型,主要基于合成数据训练,性能接近公开多语言基线。研究表明,合成数据不会导致大规模预训练的不稳定或退化,双语指令调优可提升韩语推理能力。
Details
Motivation: 构建首个完全开放的韩语双语大语言模型,填补非英语语言资源匮乏的空白,并通过合成数据验证其可行性。
Result: 模型在多任务基准测试中表现接近公开多语言基线,证明合成数据的可行性和双语指令调优的优势。
Insight: 合成数据可支持大规模预训练且不崩溃,双语指令调优能显著提升低资源语言的推理能力。为未来多语言LLM研究提供了可复现的范例。
Abstract: This work presents the first large-scale investigation into constructing a fully open bilingual large language model (LLM) for a non-English language, specifically Korean, trained predominantly on synthetic data. We introduce KORMo-10B, a 10.8B-parameter model trained from scratch on a Korean-English corpus in which 68.74% of the Korean portion is synthetic. Through systematic experimentation, we demonstrate that synthetic data, when carefully curated with balanced linguistic coverage and diverse instruction styles, does not cause instability or degradation during large-scale pretraining. Furthermore, the model achieves performance comparable to that of contemporary open-weight multilingual baselines across a wide range of reasoning, knowledge, and instruction-following benchmarks. Our experiments reveal two key findings: (1) synthetic data can reliably sustain long-horizon pretraining without model collapse, and (2) bilingual instruction tuning enables near-native reasoning and discourse coherence in Korean. By fully releasing all components including data, code, training recipes, and logs, this work establishes a transparent framework for developing synthetic data-driven fully open models (FOMs) in low-resource settings and sets a reproducible precedent for future multilingual LLM research.
[127] Domain-Adapted Pre-trained Language Models for Implicit Information Extraction in Crash Narratives cs.CLPDF
Xixi Wang, Jordanka Kovaceva, Miguel Costa, Shuai Wang, Francisco Camara Pereira
TL;DR: 该论文研究了利用领域适应的预训练语言模型(PLMs)从交通事故描述中提取隐含信息的问题,解决了现有方法在处理推理密集型任务时的性能不足和隐私问题。
Details
Motivation: 由于交通事故描述的非结构化特性和多样性,现有工具难以实现大规模分析。此外,依赖封闭的大型语言模型(LLMs)存在隐私问题和领域知识不足的缺陷。
Result: 在权威数据集CISS上的实验表明,微调后的紧凑模型优于封闭LLMs,并能捕捉更多叙述细节,甚至修正部分错误标注数据。
Insight: 领域适应的PLMs在处理隐含信息提取任务中表现出色,且资源需求低,为敏感数据的隐私保护提供了可行方案。
Abstract: Free-text crash narratives recorded in real-world crash databases have been shown to play a significant role in improving traffic safety. However, large-scale analyses remain difficult to implement as there are no documented tools that can batch process the unstructured, non standardized text content written by various authors with diverse experience and attention to detail. In recent years, Transformer-based pre-trained language models (PLMs), such as Bidirectional Encoder Representations from Transformers (BERT) and large language models (LLMs), have demonstrated strong capabilities across various natural language processing tasks. These models can extract explicit facts from crash narratives, but their performance declines on inference-heavy tasks in, for example, Crash Type identification, which can involve nearly 100 categories. Moreover, relying on closed LLMs through external APIs raises privacy concerns for sensitive crash data. Additionally, these black-box tools often underperform due to limited domain knowledge. Motivated by these challenges, we study whether compact open-source PLMs can support reasoning-intensive extraction from crash narratives. We target two challenging objectives: 1) identifying the Manner of Collision for a crash, and 2) Crash Type for each vehicle involved in the crash event from real-world crash narratives. To bridge domain gaps, we apply fine-tuning techniques to inject task-specific knowledge to LLMs with Low-Rank Adaption (LoRA) and BERT. Experiments on the authoritative real-world dataset Crash Investigation Sampling System (CISS) demonstrate that our fine-tuned compact models outperform strong closed LLMs, such as GPT-4o, while requiring only minimal training resources. Further analysis reveals that the fine-tuned PLMs can capture richer narrative details and even correct some mislabeled annotations in the dataset.
[128] Hybrid Models for Natural Language Reasoning: The Case of Syllogistic Logic cs.CL | cs.LG | cs.LOPDF
Manuel Vargas Guzmán, Jakub Szymanik, Maciej Malicki
TL;DR: 论文探讨了神经模型在逻辑推理中的泛化能力不足问题,提出了一种结合符号推理与神经计算的混合架构,显著提升了逻辑推理的效率和鲁棒性。
Details
Motivation: 尽管神经模型在自然语言处理领域取得了显著进展,但在逻辑推理方面的泛化能力仍然不足,尤其是组合性与递归性这两个关键方面。研究旨在澄清这两者的区别,并提出解决方案。
Result: 实验表明,即使神经组件较小,混合架构仍能保持高效率。该架构有效解决了神经推理系统中的泛化障碍。
Insight: 论文揭示了神经模型在逻辑推理中的局限性,以及神经符号混合方法的潜力。这种结合不仅提升了性能,还为未来的推理系统设计提供了新思路。
Abstract: Despite the remarkable progress in neural models, their ability to generalize, a cornerstone for applications like logical reasoning, remains a critical challenge. We delineate two fundamental aspects of this ability: compositionality, the capacity to abstract atomic logical rules underlying complex inferences, and recursiveness, the aptitude to build intricate representations through iterative application of inference rules. In the literature, these two aspects are often confounded together under the umbrella term of generalization. To sharpen this distinction, we investigated the logical generalization capabilities of pre-trained large language models (LLMs) using the syllogistic fragment as a benchmark for natural language reasoning. Though simple, this fragment provides a foundational yet expressive subset of formal logic that supports controlled evaluation of essential reasoning abilities. Our findings reveal a significant disparity: while LLMs demonstrate reasonable proficiency in recursiveness, they struggle with compositionality. To overcome these limitations and establish a reliable logical prover, we propose a hybrid architecture integrating symbolic reasoning with neural computation. This synergistic interaction enables robust and efficient inference, neural components accelerate processing, while symbolic reasoning ensures completeness. Our experiments show that high efficiency is preserved even with relatively small neural components. As part of our proposed methodology, this analysis gives a rationale and highlights the potential of hybrid models to effectively address key generalization barriers in neural reasoning systems.
[129] Multimodal Policy Internalization for Conversational Agents cs.CL | cs.AIPDF
Zhenhailong Wang, Jiateng Liu, Amin Fazel, Ritesh Sarkhel, Xing Fan
TL;DR: 论文提出了多模态策略内化(MPI)任务,通过模型参数内化复杂的多模态策略,减少推理时的计算成本,并提出TriMPI训练框架,实现端到端准确性、泛化性和鲁棒性的显著提升。
Details
Motivation: 随着多模态对话系统的普及,现有基于提示的策略管理方式变得复杂且计算成本高,亟需一种将策略内化到模型参数中的方法以提高效率和准确性。
Result: TriMPI在端到端准确性、泛化性和抗遗忘性上表现显著优于基线,验证了方法的有效性。
Insight: 多模态策略内化是未来对话系统的重要方向,TriMPI为相关研究提供了数据集和训练范式。
Abstract: Modern conversational agents like ChatGPT and Alexa+ rely on predefined policies specifying metadata, response styles, and tool-usage rules. As these LLM-based systems expand to support diverse business and user queries, such policies, often implemented as in-context prompts, are becoming increasingly complex and lengthy, making faithful adherence difficult and imposing large fixed computational costs. With the rise of multimodal agents, policies that govern visual and multimodal behaviors are critical but remain understudied. Prior prompt-compression work mainly shortens task templates and demonstrations, while existing policy-alignment studies focus only on text-based safety rules. We introduce Multimodal Policy Internalization (MPI), a new task that internalizes reasoning-intensive multimodal policies into model parameters, enabling stronger policy-following without including the policy during inference. MPI poses unique data and algorithmic challenges. We build two datasets spanning synthetic and real-world decision-making and tool-using tasks and propose TriMPI, a three-stage training framework. TriMPI first injects policy knowledge via continual pretraining, then performs supervised finetuning, and finally applies PolicyRollout, a GRPO-style reinforcement learning extension that augments rollouts with policy-aware responses for grounded exploration. TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness to forgetting. As the first work on multimodal policy internalization, we provide datasets, training recipes, and comprehensive evaluations to foster future research. Project page: https://mikewangwzhl.github.io/TriMPI.
[130] StatEval: A Comprehensive Benchmark for Large Language Models in Statistics cs.CLPDF
Yuchen Lu, Run Yang, Yichen Zhang, Shuguang Yu, Runpeng Dai
TL;DR: 论文提出StatEval,首个全面统计基准测试,涵盖从基础到研究级的统计问题,揭示了当前大语言模型在统计推理上的局限性。
Details
Motivation: 现有基准测试未深入评估大语言模型在统计学领域的表现,统计学作为独立且综合的学科需要专门的评测工具。
Result: 实验显示闭源模型(如GPT5-mini)在研究级问题上表现低于57%,开源模型更差,突显统计推理的挑战。
Insight: 当前大语言模型在统计推理能力上存在显著不足,需进一步优化;StatEval为未来研究提供了严谨基准。
Abstract: Large language models (LLMs) have demonstrated remarkable advances in mathematical and logical reasoning, yet statistics, as a distinct and integrative discipline, remains underexplored in benchmarking efforts. To address this gap, we introduce \textbf{StatEval}, the first comprehensive benchmark dedicated to statistics, spanning both breadth and depth across difficulty levels. StatEval consists of 13,817 foundational problems covering undergraduate and graduate curricula, together with 2374 research-level proof tasks extracted from leading journals. To construct the benchmark, we design a scalable multi-agent pipeline with human-in-the-loop validation that automates large-scale problem extraction, rewriting, and quality control, while ensuring academic rigor. We further propose a robust evaluation framework tailored to both computational and proof-based tasks, enabling fine-grained assessment of reasoning ability. Experimental results reveal that while closed-source models such as GPT5-mini achieve below 57% on research-level problems, with open-source models performing significantly lower. These findings highlight the unique challenges of statistical reasoning and the limitations of current LLMs. We expect StatEval to serve as a rigorous benchmark for advancing statistical intelligence in large language models. All data and code are available on our web platform: https://stateval.github.io/.
[131] Can We Reliably Rank Model Performance across Domains without Labeled Data? cs.CLPDF
Veronica Rammouz, Aaron Gonzalez, Carlos Cruzportillo, Adrian Tan, Nicole Beebe
TL;DR: 本文研究了无标签数据时如何可靠地跨域评估NLP模型性能,发现基于大语言模型的错误预测器比基于数据集漂移或零样本基线的方法更具一致性。
Details
Motivation: 在无标签数据的情况下,如何跨域可靠地评估模型性能是一个重要问题。先前的方法依赖数据集相似性或预测正确性,但其可靠性尚不明确。
Result: 实验表明,大语言模型错误预测器生成的排名与实际准确率相关性更强且一致性更高。
Insight: 1. 域间性能差异越大,排名越可靠;
2. 错误预测器与实际错误模式的一致性对排名可靠性至关重要。
Abstract: Estimating model performance without labels is an important goal for understanding how NLP models generalize. While prior work has proposed measures based on dataset similarity or predicted correctness, it remains unclear when these estimates produce reliable performance rankings across domains. In this paper, we analyze the factors that affect ranking reliability using a two-step evaluation setup with four base classifiers and several large language models as error predictors. Experiments on the GeoOLID and Amazon Reviews datasets, spanning 15 domains, show that large language model-based error predictors produce stronger and more consistent rank correlations with true accuracy than drift-based or zero-shot baselines. Our analysis reveals two key findings: ranking is more reliable when performance differences across domains are larger, and when the error model’s predictions align with the base model’s true failure patterns. These results clarify when performance estimation methods can be trusted and provide guidance for their use in cross-domain model evaluation.
[132] Accent-Invariant Automatic Speech Recognition via Saliency-Driven Spectrogram Masking cs.CL | cs.SD | eess.ASPDF
Mohammad Hossein Sameti, Sepehr Harfi Moridani, Ali Zarean, Hossein Sameti
TL;DR: 论文提出了一种基于显著性驱动频谱图掩码的口音不变自动语音识别(ASR)方法,通过掩码增强模型对多样口音的鲁棒性,显著降低了英语和波斯语的词错误率(WER)。
Details
Motivation: 尽管预训练的基于Transformer的ASR模型取得了显著进展,但对口音和方言变化的敏感性仍然导致词错误率较高,尤其是在英语和波斯语等多语言环境中。
Result: 实验表明,该方法在英语和波斯语中都显著降低了WER,证明了其有效性。
Insight: 通过显著性驱动的掩码捕捉口音特征,能为多语言ASR系统提供更强的鲁棒性,尤其适用于低资源语言。
Abstract: Pre-trained transformer-based models have significantly advanced automatic speech recognition (ASR), yet they remain sensitive to accent and dialectal variations, resulting in elevated word error rates (WER) in linguistically diverse languages such as English and Persian. To address this challenge, we propose an accent-invariant ASR framework that integrates accent and dialect classification into the recognition pipeline. Our approach involves training a spectrogram-based classifier to capture accent-specific cues, masking the regions most influential to its predictions, and using the masked spectrograms for data augmentation. This enhances the robustness of ASR models against accent variability. We evaluate the method using both English and Persian speech. For Persian, we introduce a newly collected dataset spanning multiple regional accents, establishing the first systematic benchmark for accent variation in Persian ASR that fills a critical gap in multilingual speech research and provides a foundation for future studies on low-resource, linguistically diverse languages. Experimental results with the Whisper model demonstrate that our masking and augmentation strategy yields substantial WER reductions in both English and Persian settings, confirming the effectiveness of the approach. This research advances the development of multilingual ASR systems that are resilient to accent and dialect diversity. Code and dataset are publicly available at: https://github.com/MH-Sameti/Accent_invariant_ASR
[133] Mitigating Overthinking through Reasoning Shaping cs.CL | cs.AIPDF
Feifan Song, Shaohang Wei, Bofei Gao, Yejie Wang, Wen Luo
TL;DR: 论文提出了一种名为GRSP的步级别方法,用于解决大型推理模型(LRMs)中因过度思考导致的效率问题,通过分段的长度感知权重机制平衡效率和准确性。
Details
Motivation: 大型推理模型虽在解决问题时表现出色,但常因过度思考(过多且迂回的推理)导致计算成本增加。现有的惩罚机制虽能减少令牌消耗,但往往损害模型性能,原因是令牌级监督过于简单。
Result: 实验表明GRSP在保持准确性的同时显著提升了令牌效率,尤其是在处理更困难问题时优势明显。此外,GRSP还能稳定RL训练,并适用于不同规模的模型。
Insight: 监督的粒度在平衡效率和准确性中起关键作用。GRSP的成功表明,分段级别的监督比令牌级别更有效,且能适用于各种规模的模型。
Abstract: Large reasoning models (LRMs) boosted by Reinforcement Learning from Verifier Reward (RLVR) have shown great power in problem solving, yet they often cause overthinking: excessive, meandering reasoning that inflates computational cost. Prior designs of penalization in RLVR manage to reduce token consumption while often harming model performance, which arises from the oversimplicity of token-level supervision. In this paper, we argue that the granularity of supervision plays a crucial role in balancing efficiency and accuracy, and propose Group Relative Segment Penalization (GRSP), a step-level method to regularize reasoning. Since preliminary analyses show that reasoning segments are strongly correlated with token consumption and model performance, we design a length-aware weighting mechanism across segment clusters. Extensive experiments demonstrate that GRSP achieves superior token efficiency without heavily compromising accuracy, especially the advantages with harder problems. Moreover, GRSP stabilizes RL training and scales effectively across model sizes.
[134] Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors cs.CLPDF
Yihong Liu, Raoyuan Zhao, Lena Altinger, Hinrich Schütze, Michael A. Hedderich
TL;DR: 这篇论文探讨了大型语言模型(LLMs)在多种语言输入中出现的拼写错误(typos)影响,提出了一个多语言拼写错误生成算法MulTypo,并评估了18个开源LLMs在五种任务上的表现。结果表明拼写错误会对性能产生负面影响,尤其是在生成性和推理性任务中,而高资源语言和英语翻译任务相对更鲁棒。
Details
Motivation: 现有基准测试大多假设输入是无错误的,忽略了实际应用中用户输入可能包含拼写错误的情况。因此,需要评估LLMs在多语言环境中对拼写错误的鲁棒性,并为改进模型提供依据。
Result: 1. 拼写错误明显降低了LLMs的性能,尤其是在生成性和推理性任务中;2. 指令微调提升了干净输入的性能,但可能增加对噪声的脆弱性;3. 高资源语言和英语翻译任务表现出更高的鲁棒性。
Insight: 1. 多语言环境下拼写错误对LLMs的影响不可忽视;2. 模型鲁棒性存在语言依赖性;3. 噪声感知训练和多语言评估是未来研究的重要方向。
Abstract: Large language models (LLMs) are increasingly deployed in multilingual, real-world applications with user inputs – naturally introducing typographical errors (typos). Yet most benchmarks assume clean input, leaving the robustness of LLMs to typos across languages largely underexplored. To address this gap, we introduce MulTypo, a multilingual typo generation algorithm that simulates human-like errors based on language-specific keyboard layouts and typing behavior. We evaluate 18 open-source LLMs across three model families and five downstream tasks spanning language inference, multi-choice question answering, mathematical reasoning, and machine translation tasks. Our results show that typos consistently degrade performance, particularly in generative tasks and those requiring reasoning – while the natural language inference task is comparatively more robust. Instruction tuning improves clean-input performance but may increase brittleness under noise. We also observe language-dependent robustness: high-resource languages are generally more robust than low-resource ones, and translation from English is more robust than translation into English. Our findings underscore the need for noise-aware training and multilingual robustness evaluation. We make our code and data publicly available.
[135] SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models cs.CL | cs.AIPDF
Chengyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang
TL;DR: 本文提出了Sandwiched Policy Gradient (SPG),一种针对掩码扩散语言模型的新强化学习方法,通过同时利用对数似然的上界和下界,显著减少了传统方法中的策略梯度偏差。
Details
Motivation: 扩散大语言模型(dLLMs)因能并行解码多个标记而成为自回归模型的高效替代品,但其难以处理的似然函数使标准策略梯度方法直接应用困难。现有方法如证据下界(ELBO)的单侧近似可能导致显著偏差。
Result: 实验显示,SPG在多个任务(GSM8K、MATH500等)中显著优于基于ELBO或一步估计的基线方法,最高提升达27%。
Insight: 通过综合利用上下界信息,SPG在保留扩散模型并行优势的同时,有效解决了策略梯度偏差问题,为dLLMs的强化学习对齐提供了新思路。
Abstract: Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.
[136] Beyond Surface Reasoning: Unveiling the True Long Chain-of-Thought Capacity of Diffusion Large Language Models cs.CLPDF
Qiguang Chen, Hanjing Li, Libo Qin, Dengyun Peng, Jinhao Liu
TL;DR: 本文研究了扩散大型语言模型(DLLMs)在长链推理任务中的局限性,提出了并行-顺序矛盾(PSC)的概念,并通过实验和理论分析揭示了DLLMs的行为特点,最后提出了几种优化方法。
Details
Motivation: DLLMs因其高吞吐量和高效的序列推理能力被视为自回归语言模型(ALLMs)的有力替代品。然而,并行解码与严格推理所需的因果顺序存在冲突,阻碍了其潜力发挥。
Result: 实验表明,并行扩展能带来稳定提升,而扩散和顺序扩展受PSC限制;提出的优化方法有效减少了PSC导致的问题。
Insight: DLLMs在简单任务中表现更接近并行,但在复杂任务中则回归自回归行为;提示设计和解码策略对DLLMs的性能影响显著。
Abstract: Recently, Diffusion Large Language Models (DLLMs) have offered high throughput and effective sequential reasoning, making them a competitive alternative to autoregressive LLMs (ALLMs). However, parallel decoding, which enables simultaneous token updates, conflicts with the causal order often required for rigorous reasoning. We first identify this conflict as the core Parallel-Sequential Contradiction (PSC). Behavioral analyses in both simple and complex reasoning tasks show that DLLMs exhibit genuine parallelism only for directly decidable outputs. As task difficulty increases, they revert to autoregressive-like behavior, a limitation exacerbated by autoregressive prompting, which nearly doubles the number of decoding steps with remasking without improving quality. Moreover, PSC restricts DLLMs’ self-reflection, reasoning depth, and exploratory breadth. To further characterize PSC, we introduce three scaling dimensions for DLLMs: parallel, diffusion, and sequential. Empirically, while parallel scaling yields consistent improvements, diffusion and sequential scaling are constrained by PSC. Based on these findings, we propose several practical mitigations, parallel-oriented prompting, diffusion early stopping, and parallel scaling, to reduce PSC-induced ineffectiveness and inefficiencies.
[137] Hierarchical Indexing with Knowledge Enrichment for Multilingual Video Corpus Retrieval cs.CLPDF
Yu Wang, Tianhao Tan, Yifei Wang
TL;DR: 该论文提出了一种多阶段框架,用于多语言视频语料检索(mVCR),通过结合多语言语义、领域术语和高效的长视频处理,实现了高效的检索。
Details
Motivation: 现有系统要么将长视频压缩为粗糙的嵌入,要么在细粒度匹配中成本过高,无法满足多语言医学档案中复杂多跳问题的检索需求。
Result: 在mVCR测试集上取得了最佳性能,消融实验验证了KG丰富、分层索引和LLM重排名的互补作用。
Insight: 该方法为专业医学视频集合中的多语言检索提供了一种准确且可扩展的解决方案。
Abstract: Retrieving relevant instructional videos from multilingual medical archives is crucial for answering complex, multi-hop questions across language boundaries. However, existing systems either compress hour-long videos into coarse embeddings or incur prohibitive costs for fine-grained matching. We tackle the Multilingual Video Corpus Retrieval (mVCR) task in the NLPCC-2025 M4IVQA challenge with a multi-stage framework that integrates multilingual semantics, domain terminology, and efficient long-form processing. Video subtitles are divided into semantically coherent chunks, enriched with concise knowledge-graph (KG) facts, and organized into a hierarchical tree whose node embeddings are generated by a language-agnostic multilingual encoder. At query time, the same encoder embeds the input question; a coarse-to-fine tree search prunes irrelevant branches, and only the top-ranked chunks are re-scored by a lightweight large language model (LLM). This design avoids exhaustive cross-encoder scoring while preserving chunk-level precision. Experiments on the mVCR test set demonstrate state-of-the-art performance, and ablation studies confirm the complementary contributions of KG enrichment, hierarchical indexing, and targeted LLM re-ranking. The proposed method offers an accurate and scalable solution for multilingual retrieval in specialized medical video collections.
[138] A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning: Performance, Consistency, and Faithfulness Across Languages cs.CLPDF
Raoyuan Zhao, Yihong Liu, Hinrich Schütze, Michael A. Hedderich
TL;DR: 该论文首次全面研究了多语言链式思维(CoT)推理的三个方面:性能、一致性和忠实性,揭示了语言偏好对推理质量和效果的影响。
Details
Motivation: 研究多语言链式思维推理的性能、一致性和忠实性,填补了现有研究中关于中间推理步骤在多语言环境下表现的空白。
Result: 发现语言偏好对推理质量和效果有显著影响,且不同语言间的推理痕迹质量和有效性差异较大。
Insight: 模型的推理效果高度依赖提示语言,多语言CoT推理的非对称性值得进一步研究。
Abstract: Large reasoning models (LRMs) increasingly rely on step-by-step Chain-of-Thought (CoT) reasoning to improve task performance, particularly in high-resource languages such as English. While recent work has examined final-answer accuracy in multilingual settings, the thinking traces themselves, i.e., the intermediate steps that lead to the final answer, remain underexplored. In this paper, we present the first comprehensive study of multilingual CoT reasoning, evaluating three key dimensions: performance, consistency, and faithfulness. We begin by measuring language compliance, answer accuracy, and answer consistency when LRMs are explicitly instructed or prompt-hacked to think in a target language, revealing strong language preferences and divergent performance across languages. Next, we assess crosslingual consistency of thinking traces by interchanging them between languages. We find that the quality and effectiveness of thinking traces vary substantially depending on the prompt language. Finally, we adapt perturbation-based techniques – i.e., truncation and error injection – to probe the faithfulness of thinking traces across languages, showing that models rely on traces to varying degrees. We release our code and data to support future research.
[139] WUGNECTIVES: Novel Entity Inferences of Language Models from Discourse Connectives cs.CLPDF
Daniel Brubaker, William Sheffield, Junyi Jessy Li, Kanishka Misra
TL;DR: 论文提出了WUGNECTIVES数据集,研究语言模型是否能通过话语连接词推断新实体的属性,并发现在不同连接词类型上性能差异显著,尤其是在让步连接词上表现较差。
Details
Motivation: 传统研究关注语言模型如何利用世界知识预测话语连接词,而本文则探讨反问题:话语连接词是否能帮助语言模型理解世界。
Result: 研究发现,对语言模型进行推理行为调优能显著提升性能,但不同连接词类型的表现差异较大,让步连接词是所有模型的薄弱点。
Insight: 话语连接词可以作为语言模型理解世界的线索,但其效果因连接词类型和模型的推理能力而异,为进一步研究语言线索的功能提供了新方向。
Abstract: The role of world knowledge has been particularly crucial to predict the discourse connective that marks the discourse relation between two arguments, with language models (LMs) being generally successful at this task. We flip this premise in our work, and instead study the inverse problem of understanding whether discourse connectives can inform LMs about the world. To this end, we present WUGNECTIVES, a dataset of 8,880 stimuli that evaluates LMs’ inferences about novel entities in contexts where connectives link the entities to particular attributes. On investigating 17 different LMs at various scales, and training regimens, we found that tuning an LM to show reasoning behavior yields noteworthy improvements on most connectives. At the same time, there was a large variation in LMs’ overall performance across connective type, with all models systematically struggling on connectives that express a concessive meaning. Our findings pave the way for more nuanced investigations into the functional role of language cues as captured by LMs. We release WUGNECTIVES at https://github.com/sheffwb/wugnectives.
[140] AutoPR: Let’s Automate Your Academic Promotion! cs.CLPDF
Qiguang Chen, Zheng Yan, Mingda Yang, Libo Qin, Yixin Yuan
TL;DR: AutoPR提出了一种自动化推广学术论文的任务,通过多模态基准PRBench和多智能体框架PRAgent,显著提升了学术推广的效果。
Details
Motivation: 随着同行评审研究的数量激增,学者们越来越依赖社交媒体推广论文以提升可见性和引用,但这一过程耗费大量人力。
Result: PRAgent在PRBench上比直接LLM流水线提升了6倍以上的观看时间、4倍以上的喜欢数和至少2.9倍的总体参与度。
Insight: 平台建模和目标推广对提升推广效果贡献最大,AutoPR为自动化学术传播提供了可行性和可测量性。
Abstract: As the volume of peer-reviewed research surges, scholars increasingly rely on social platforms for discovery, while authors invest considerable effort in promoting their work to ensure visibility and citations. To streamline this process and reduce the reliance on human effort, we introduce Automatic Promotion (AutoPR), a novel task that transforms research papers into accurate, engaging, and timely public content. To enable rigorous evaluation, we release PRBench, a multimodal benchmark that links 512 peer-reviewed articles to high-quality promotional posts, assessing systems along three axes: Fidelity (accuracy and tone), Engagement (audience targeting and appeal), and Alignment (timing and channel optimization). We also introduce PRAgent, a multi-agent framework that automates AutoPR in three stages: content extraction with multimodal preparation, collaborative synthesis for polished outputs, and platform-specific adaptation to optimize norms, tone, and tagging for maximum reach. When compared to direct LLM pipelines on PRBench, PRAgent demonstrates substantial improvements, including a 604% increase in total watch time, a 438% rise in likes, and at least a 2.9x boost in overall engagement. Ablation studies show that platform modeling and targeted promotion contribute the most to these gains. Our results position AutoPR as a tractable, measurable research problem and provide a roadmap for scalable, impactful automated scholarly communication.
[141] Dyna-Mind: Learning to Simulate from Experience for Better AI Agents cs.CL | cs.AI | cs.CVPDF
Xiao Yu, Baolin Peng, Michel Galley, Hao Cheng, Qianhui Wu
TL;DR: Dyna-Mind通过两阶段训练框架(ReSim和Dyna-GRPO)赋予AI代理模拟未来状态的能力,从而提升其在复杂交互任务中的表现。
Details
Motivation: 当前AI在数学和编程等任务中表现优异,但在长时程交互任务中表现不佳,需要模拟能力以提升理解和决策。
Result: 在Sokoban、ALFWorld和AndroidWorld基准测试中验证了模拟能力和策略学习的有效性。
Insight: 模拟能力是AI代理在复杂环境中高效推理和规划的关键。
Abstract: Reasoning models have recently shown remarkable progress in domains such as math and coding. However, their expert-level abilities in math and coding contrast sharply with their performance in long-horizon, interactive tasks such as web navigation and computer/phone-use. Inspired by literature on human cognition, we argue that current AI agents need ‘’vicarious trial and error’’ - the capacity to mentally simulate alternative futures before acting - in order to enhance their understanding and performance in complex interactive environments. We introduce Dyna-Mind, a two-stage training framework that explicitly teaches (V)LM agents to integrate such simulation into their reasoning. In stage 1, we introduce Reasoning with Simulations (ReSim), which trains the agent to generate structured reasoning traces from expanded search trees built from real experience gathered through environment interactions. ReSim thus grounds the agent’s reasoning in faithful world dynamics and equips it with the ability to anticipate future states in its reasoning. In stage 2, we propose Dyna-GRPO, an online reinforcement learning method to further strengthen the agent’s simulation and decision-making ability by using both outcome rewards and intermediate states as feedback from real rollouts. Experiments on two synthetic benchmarks (Sokoban and ALFWorld) and one realistic benchmark (AndroidWorld) demonstrate that (1) ReSim effectively infuses simulation ability into AI agents, and (2) Dyna-GRPO leverages outcome and interaction-level signals to learn better policies for long-horizon, planning-intensive tasks. Together, these results highlight the central role of simulation in enabling AI agents to reason, plan, and act more effectively in the ever more challenging environments.
[142] Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models cs.CLPDF
Donghang Wu, Haoyang Zhang, Jun Chen, Xiangyu, Zhang
TL;DR: 该论文提出了Mind-Paced Speaking (MPS),一种双大脑框架,通过分离推理与语音生成,实现实时推理与高质量语音输出的平衡,显著降低了延迟并提升了性能。
Details
Motivation: 现有的实时口语语言模型(SLMs)难以在生成完整思维链的同时保持低延迟,影响了实时交互质量。需要一种类似人类思维的机制来实现边思考边说话。
Result: 在Spoken-MQA数学推理任务中达到92.8%准确率,URO-Bench语音对话任务中得分为82.5,显著优于现有方法。
Insight: 双大脑机制有效地模拟了人类思维的并行处理能力,为实时SLMs的设计提供了新思路。
Abstract: Real-time Spoken Language Models (SLMs) struggle to leverage Chain-of-Thought (CoT) reasoning due to the prohibitive latency of generating the entire thought process sequentially. Enabling SLMs to think while speaking, similar to humans, is attracting increasing attention. We present, for the first time, Mind-Paced Speaking (MPS), a brain-inspired framework that enables high-fidelity, real-time reasoning. Similar to how humans utilize distinct brain regions for thinking and responding, we propose a novel dual-brain approach, employing a “Formulation Brain” for high-level reasoning to pace and guide a separate “Articulation Brain” for fluent speech generation. This division of labor eliminates mode-switching, preserving the integrity of the reasoning process. Experiments show that MPS significantly outperforms existing think-while-speaking methods and achieves reasoning performance comparable to models that pre-compute the full CoT before speaking, while drastically reducing latency. Under a zero-latency configuration, the proposed method achieves an accuracy of 92.8% on the mathematical reasoning task Spoken-MQA and attains a score of 82.5 on the speech conversation task URO-Bench. Our work effectively bridges the gap between high-quality reasoning and real-time interaction.
[143] Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation cs.CL | cs.AI | cs.LGPDF
Sondos Mahmoud Bsharat, Zhiqiang Shen
TL;DR: 论文提出了一种名为Prompting Test-Time Scaling (P-TTS)的推理时数据增强策略,通过仅需90条人工选择的推理示例,系统化地生成多样化的推理轨迹上下文,显著提升了大型语言模型在数学推理任务中的表现。
Details
Motivation: 尽管大型语言模型在链式思维推理中表现优异,但构建大规模的推理数据集耗时耗力。为此,作者提出了一种低成本的数据增强方法,以减少标注开销。
Result: 在AIME2024、AIME2025等数学推理任务中,P-TTS显著提升了模型性能(7B和32B模型分别提升26.66%和30.00%),并在零样本泛化任务中表现优异。
Insight: P-TTS展示了测试时数据增强的潜力,能够以极低的成本挖掘大型语言模型的推理能力,适用于资源受限或快速演进的领域。
Abstract: Large language models (LLMs) have demonstrated impressive reasoning capabilities when provided with chain-of-thought exemplars, but curating large reasoning datasets remains laborious and resource-intensive. In this work, we introduce Prompting Test-Time Scaling (P-TTS), a simple yet effective inference-time data augmentation strategy for enhancing LLM reasoning through finetuning. Rather than collecting thousands or even millions of examples, P-TTS leverages a small pool of only 90 manually selected reasoning instances and systematically varies exemplar augmentation through principled instruction prompting intensities at test time to synthesize diverse reasoning trajectory contexts. Then we finetune the various sizes of Qwen-2.5 models on P-TTS data. Across a suite of mathematical reasoning AIME2024 & 25, MATH500, and GPQA-Diamond, our P-TTS-7B and 32B models outperform the prior competitive baselines like S1 and S1.1 (1K-shot), achieving absolute accuracy gains of +26.66% and +30.00% on AIME’24 (7B), and +13.34% and +6.67% on AIME’25 (7B); P-TTS-32B yields gains of +23.33% and +16.63% on AIME’24, and +26.63% and +3.33% on AIME’25 (vs. S1 and S1.1, respectively), with comparable or better performance on MATH500 and GPQA-Diamond. We further show that P-TTS enhances zero-shot generalization accuracy on out-of-domain reasoning benchmarks of Gaokao, Kaoyan, OlympiadBench, AMC23, GradeSchoolMath, and Minerva. Our analysis suggests that test-time scaling effectively explores the latent space of reasoning patterns, amplifying LLM problem-solving with minimal annotation overhead, and further unlocking the reasoning potential and capabilities of LLMs. Prompting Test-Time Scaling offers a practical, low-cost way to elicit LLM reasoning in resource-constrained or rapidly evolving domains.
eess.IV [Back]
[144] FS-RWKV: Leveraging Frequency Spatial-Aware RWKV for 3T-to-7T MRI Translation eess.IV | cs.CVPDF
Yingtie Lei, Zimeng Li, Chi-Man Pun, Yupeng Liu, Xuhang Chen
TL;DR: FS-RWKV是一个基于RWKV架构的框架,用于从3T MRI合成7T MRI图像,通过频域空间感知模块提升全局上下文表示和细节保留,优于现有方法。
Details
Motivation: 7T MRI因高昂成本和技术限制难以普及,而现有CNN和Transformer方法在计算效率或空间覆盖上表现不佳。
Result: 在UNC和BNU数据集上,FS-RWKV在T1w和T2w模态中均优于CNN、Transformer、GAN和RWKV基线。
Insight: RWKV架构结合频域空间处理在医学图像合成中具有潜力,平衡计算效率和特征建模能力。
Abstract: Ultra-high-field 7T MRI offers enhanced spatial resolution and tissue contrast that enables the detection of subtle pathological changes in neurological disorders. However, the limited availability of 7T scanners restricts widespread clinical adoption due to substantial infrastructure costs and technical demands. Computational approaches for synthesizing 7T-quality images from accessible 3T acquisitions present a viable solution to this accessibility challenge. Existing CNN approaches suffer from limited spatial coverage, while Transformer models demand excessive computational overhead. RWKV architectures offer an efficient alternative for global feature modeling in medical image synthesis, combining linear computational complexity with strong long-range dependency capture. Building on this foundation, we propose Frequency Spatial-RWKV (FS-RWKV), an RWKV-based framework for 3T-to-7T MRI translation. To better address the challenges of anatomical detail preservation and global tissue contrast recovery, FS-RWKV incorporates two key modules: (1) Frequency-Spatial Omnidirectional Shift (FSO-Shift), which performs discrete wavelet decomposition followed by omnidirectional spatial shifting on the low-frequency branch to enhance global contextual representation while preserving high-frequency anatomical details; and (2) Structural Fidelity Enhancement Block (SFEB), a module that adaptively reinforces anatomical structure through frequency-aware feature fusion. Comprehensive experiments on UNC and BNU datasets demonstrate that FS-RWKV consistently outperforms existing CNN-, Transformer-, GAN-, and RWKV-based baselines across both T1w and T2w modalities, achieving superior anatomical fidelity and perceptual quality.
[145] SAM2-3dMed: Empowering SAM2 for 3D Medical Image Segmentation eess.IV | cs.CVPDF
Yeqing Yang, Le Xu, Lixia Tian
TL;DR: 提出SAM2-3dMed,扩展SAM2至3D医学图像分割,通过引入SRPP和BD模块解决解剖连续性和边界精度问题,性能显著优于现有方法。
Details
Motivation: 3D医学图像分割对临床诊断和治疗至关重要,但SAM2因视频任务与医学图像的领域差异(如解剖连续性和边界需求)直接应用效果不佳。
Result: 在MSD数据集的多个器官分割任务中表现最优,尤其在分割重叠和边界精度上显著提升。
Insight: 视频基础模型可通过领域适配迁移至空间体数据,为医学图像分析提供了新思路。
Abstract: Accurate segmentation of 3D medical images is critical for clinical applications like disease assessment and treatment planning. While the Segment Anything Model 2 (SAM2) has shown remarkable success in video object segmentation by leveraging temporal cues, its direct application to 3D medical images faces two fundamental domain gaps: 1) the bidirectional anatomical continuity between slices contrasts sharply with the unidirectional temporal flow in videos, and 2) precise boundary delineation, crucial for morphological analysis, is often underexplored in video tasks. To bridge these gaps, we propose SAM2-3dMed, an adaptation of SAM2 for 3D medical imaging. Our framework introduces two key innovations: 1) a Slice Relative Position Prediction (SRPP) module explicitly models bidirectional inter-slice dependencies by guiding SAM2 to predict the relative positions of different slices in a self-supervised manner; 2) a Boundary Detection (BD) module enhances segmentation accuracy along critical organ and tissue boundaries. Extensive experiments on three diverse medical datasets (the Lung, Spleen, and Pancreas in the Medical Segmentation Decathlon (MSD) dataset) demonstrate that SAM2-3dMed significantly outperforms state-of-the-art methods, achieving superior performance in segmentation overlap and boundary precision. Our approach not only advances 3D medical image segmentation performance but also offers a general paradigm for adapting video-centric foundation models to spatial volumetric data.
[146] Rewiring Development in Brain Segmentation: Leveraging Adult Brain Priors for Enhancing Infant MRI Segmentation eess.IV | cs.CVPDF
Alemu Sisay Nigru, Michele Svanera, Austin Dibble, Connor Dalby, Mattia Savardi
TL;DR: 该论文提出了一种名为LODi的新型框架,利用成人脑MRI分割模型的先验知识提升婴儿脑MRI分割性能,解决了婴儿脑分割面临的解剖变化和数据稀缺问题。
Details
Motivation: 婴儿脑MRI分割在神经发育研究和疾病诊断中至关重要,但由于解剖结构快速变化、运动伪影和高质量标注数据稀缺,传统方法效果不佳。
Result: 实验表明,LODi在内部和外部数据集上均优于传统监督学习和域特定模型,实现了快速、准确、年龄自适应的分割。
Insight: 利用成人脑先验知识为跨年龄段的MRI分割提供了可靠基础,展示了其在神经影像分析中的普适性和泛化能力。
Abstract: Accurate segmentation of infant brain MRI is critical for studying early neurodevelopment and diagnosing neurological disorders. Yet, it remains a fundamental challenge due to continuously evolving anatomy of the subjects, motion artifacts, and the scarcity of high-quality labeled data. In this work, we present LODi, a novel framework that utilizes prior knowledge from an adult brain MRI segmentation model to enhance the segmentation performance of infant scans. Given the abundance of publicly available adult brain MRI data, we pre-train a segmentation model on a large adult dataset as a starting point. Through transfer learning and domain adaptation strategies, we progressively adapt the model to the 0-2 year-old population, enabling it to account for the anatomical and imaging variability typical of infant scans. The adaptation of the adult model is carried out using weakly supervised learning on infant brain scans, leveraging silver-standard ground truth labels obtained with FreeSurfer. By introducing a novel training strategy that integrates hierarchical feature refinement and multi-level consistency constraints, our method enables fast, accurate, age-adaptive segmentation, while mitigating scanner and site-specific biases. Extensive experiments on both internal and external datasets demonstrate the superiority of our approach over traditional supervised learning and domain-specific models. Our findings highlight the advantage of leveraging adult brain priors as a foundation for age-flexible neuroimaging analysis, paving the way for more reliable and generalizable brain MRI segmentation across the lifespan.
[147] A Biophysically-Conditioned Generative Framework for 3D Brain Tumor MRI Synthesis eess.IV | cs.CV | cs.LGPDF
Valentin Biller, Lucas Zimmer, Can Erdur, Sandeep Nagar, Daniel Rückert
TL;DR: 该论文提出了一种基于生物物理条件的生成模型,用于合成3D脑肿瘤MRI图像,同时也支持健康组织修复任务。它在BraTS 2025挑战中表现优异。
Details
Motivation: MRI修复在临床和研究中有广泛的应用需求。目前缺乏能够基于体素级肿瘤浓度生成高质量脑肿瘤MRI的生成模型。
Result: 在健康组织修复任务中PSNR为18.5,在肿瘤修复任务中PSNR为17.4,证明了模型的有效性。
Insight: 通过将肿瘤浓度设为条件输入,生成模型能够在MRI修复任务中实现更高的解剖学一致性和空间相干性。
Abstract: Magnetic resonance imaging (MRI) inpainting supports numerous clinical and research applications. We introduce the first generative model that conditions on voxel-level, continuous tumor concentrations to synthesize high-fidelity brain tumor MRIs. For the BraTS 2025 Inpainting Challenge, we adapt this architecture to the complementary task of healthy tissue restoration by setting the tumor concentrations to zero. Our latent diffusion model conditioned on both tissue segmentations and the tumor concentrations generates 3D spatially coherent and anatomically consistent images for both tumor synthesis and healthy tissue inpainting. For healthy inpainting, we achieve a PSNR of 18.5, and for tumor inpainting, we achieve 17.4. Our code is available at: https://github.com/valentin-biller/ldm.git
cs.ET [Back]
[148] When to Reason: Semantic Router for vLLM cs.ET | cs.AI | cs.CL | cs.SY | eess.SYPDF
Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo
TL;DR: 论文提出了一种语义路由器(semantic router),用于动态判断是否需要对LLM查询进行推理,从而减少不必要的计算成本和延迟,同时在MMLU-Pro基准上提升了10.2个百分点的准确率。
Details
Motivation: LLM在推理模式下(如思维链和推理时间扩展)能显著提升准确率,但也会带来高延迟和令牌消耗。许多简单查询无需推理,因此需要一种机制动态决定是否启用推理。
Result: 在vLLM系统中,语义路由器显著提升了效率和准确率:MMLU-Pro准确率提高10.2%,延迟降低47.1%,令牌消耗减少48.5%。
Insight: 通过动态路由查询,可以平衡LLM的准确率和效率,为开源LLM服务系统提供了一种实用的优化方法。
Abstract: Large Language Models (LLMs) demonstrate substantial accuracy gains when augmented with reasoning modes such as chain-of-thought and inference-time scaling. However, reasoning also incurs significant costs in inference latency and token usage, with environmental and financial impacts, which are unnecessary for many simple prompts. We present a semantic router that classifies queries based on their reasoning requirements and selectively applies reasoning only when beneficial. Our approach achieves a 10.2 percentage point improvement in accuracy on the MMLU-Pro benchmark while reducing response latency by 47.1% and token consumption by 48.5% compared to direct inference with vLLM. These results demonstrate that semantic routing offers an effective mechanism for striking a balance between accuracy and efficiency in open-source LLM serving systems
cs.CR [Back]
[149] Goal-oriented Backdoor Attack against Vision-Language-Action Models via Physical Objects cs.CR | cs.CV | cs.LGPDF
Zirun Zhou, Zhengyang Xiao, Haochuan Xu, Jing Sun, Di Wang
TL;DR: 该论文揭示了一种针对视觉-语言-动作(VLA)模型的实用后门攻击方法GoBA,通过物理物体触发后门行为,在触发时执行预设目标动作,同时不影响正常输入的性能。
Details
Motivation: VLA模型依赖未经筛选的训练数据,存在安全隐患。现有后门攻击多假设白盒访问且仅导致任务失败,而本文提出更实用的威胁模型,即在训练数据中注入物理物体作为触发器。
Result: GoBA在触发时成功率达97%,但在干净输入上无性能下降。动作轨迹和触发器颜色是关键因素,触发器尺寸影响较小。
Insight: 物理物体可作为高效的后门触发器,动作设计和颜色选择对攻击效果至关重要,而触发器尺寸的敏感性较低。
Abstract: Recent advances in vision-language-action (VLA) models have greatly improved embodied AI, enabling robots to follow natural language instructions and perform diverse tasks. However, their reliance on uncurated training datasets raises serious security concerns. Existing backdoor attacks on VLAs mostly assume white-box access and result in task failures instead of enforcing specific actions. In this work, we reveal a more practical threat: attackers can manipulate VLAs by simply injecting physical objects as triggers into the training dataset. We propose goal-oriented backdoor attacks (GoBA), where the VLA behaves normally in the absence of physical triggers but executes predefined and goal-oriented actions in the presence of physical triggers. Specifically, based on a popular VLA benchmark LIBERO, we introduce BadLIBERO that incorporates diverse physical triggers and goal-oriented backdoor actions. In addition, we propose a three-level evaluation that categorizes the victim VLA’s actions under GoBA into three states: nothing to do, try to do, and success to do. Experiments show that GoBA enables the victim VLA to successfully achieve the backdoor goal in 97 percentage of inputs when the physical trigger is present, while causing zero performance degradation on clean inputs. Finally, by investigating factors related to GoBA, we find that the action trajectory and trigger color significantly influence attack performance, while trigger size has surprisingly little effect. The code and BadLIBERO dataset are accessible via the project page at https://goba-attack.github.io/.
eess.AS [Back]
[150] Look before Transcription: End-to-End SlideASR with Visually-Anchored Policy Optimization eess.AS | cs.CV | cs.SDPDF
Rui Hu, Delai Qiu, Yining Wang, Shengping Liu, Jitao Sang
TL;DR: 本文提出了SlideASR任务,通过结合幻灯片中的视觉信息提升语音识别准确性,并提出了一种新颖的视觉锚定策略优化(VAPO)方法,通过强化学习优化推理过程。
Details
Motivation: 在学术讲座等专业场景中,语音识别(ASR)系统常因专业术语识别不佳而性能受限。现有方法复杂且表现不佳,因此需要一种更有效的端到端解决方案。
Result: 实验表明,VAPO显著提升了专业术语的识别准确性,为SlideASR任务提供了一种有效的端到端范式。
Insight: 1. 视觉信息(如幻灯片)可作为ASR任务的强辅助信号;2. 分步推理(思考-回答)结合强化学习是一种有效策略;3. OLLM需结合结构化方法避免简化为OCR系统。
Abstract: Automatic speech recognition (ASR) systems often struggle with domain-specific terminology, especially in specialized settings such as academic lectures. To address this, we define the SlideASR task, which leverages the rich visual information from presentation slides to improve transcription accuracy. Existing pipeline methods for this task tend to be complex and underperform. Although omni-modal large language models (OLLMs) provide a promising end-to-end framework, they frequently fail in practice by degenerating into simple optical character recognition (OCR) systems. To overcome this, we propose Visually-Anchored Policy Optimization (VAPO), a novel post-training method designed to control the model’s reasoning process. Drawing on the Chain-of-Thought reasoning paradigm, VAPO enforces a structured “Look before Transcription” procedure using a
[151] Target speaker anonymization in multi-speaker recordings eess.AS | cs.CL | cs.CRPDF
Natalia Tomashenko, Junichi Yamagishi, Xin Wang, Yun Liu, Emmanuel Vincent
TL;DR: 本文研究了多说话人录音中目标说话人匿名化的挑战,特别是在仅需匿名化单一目标说话人的情况下。
Details
Motivation: 现有研究主要集中在单说话人音频的匿名化,而多说话人场景(如呼叫中心)需要仅匿名化特定说话人(如客户)的语音。
Result: 研究填补了多说话人场景下目标说话人匿名化的技术空白,并为评估提供了新的方法论。
Insight: 在多说话人对话中实现精确的目标说话人匿名化具有实际意义,尤其在隐私保护需求高的场景中。
Abstract: Most of the existing speaker anonymization research has focused on single-speaker audio, leading to the development of techniques and evaluation metrics optimized for such condition. This study addresses the significant challenge of speaker anonymization within multi-speaker conversational audio, specifically when only a single target speaker needs to be anonymized. This scenario is highly relevant in contexts like call centers, where customer privacy necessitates anonymizing only the customer’s voice in interactions with operators. Conventional anonymization methods are often not suitable for this task. Moreover, current evaluation methodology does not allow us to accurately assess privacy protection and utility in this complex multi-speaker scenario. This work aims to bridge these gaps by exploring effective strategies for targeted speaker anonymization in conversational audio, highlighting potential problems in their development and proposing corresponding improved evaluation methodologies.
cs.SD [Back]
[152] MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation cs.SD | cs.CV | cs.LG | eess.ASPDF
Akira Takahashi, Shusuke Takahashi, Yuki Mitsufuji
TL;DR: MMAudioSep是一种基于预训练视频到音频模型的生成模型,用于视频/文本查询的声音分离。通过利用预训练音频生成模型对视频/文本与音频关系的知识,模型训练更高效。实验表明其性能优于现有基线模型,并保留了原始视频到音频生成能力。
Details
Motivation: 现有声音分离模型通常需要从头训练,效率低。MMAudioSep通过预训练模型的知识迁移,提高训练效率并扩展生成模型的用途。
Result: MMAudioSep在性能上优于现有基线模型,同时保留了原始视频到音频生成的功能。
Insight: 预训练生成模型可用于下游声音任务,展示了生成模型在音频领域的扩展潜力。
Abstract: We introduce MMAudioSep, a generative model for video/text-queried sound separation that is founded on a pretrained video-to-audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can train the model more efficiently, i.e., the model does not need to be trained from scratch. We evaluate the performance of MMAudioSep by comparing it to existing separation models, including models based on both deterministic and generative approaches, and find it is superior to the baseline models. Furthermore, we demonstrate that even after acquiring functionality for sound separation via fine-tuning, the model retains the ability for original video-to-audio generation. This highlights the potential of foundational sound generation models to be adopted for sound-related downstream tasks. Our code is available at https://github.com/sony/mmaudiosep.
cs.AI [Back]
[153] Unified World Models: Memory-Augmented Planning and Foresight for Visual Navigation cs.AI | cs.CV | cs.ROPDF
Yifei Dong, Fengyi Wu, Guangyu Chen, Zhi-Qi Cheng, Qiyu Hu
TL;DR: UniWM是一个统一的记忆增强世界模型,通过将视觉前瞻和规划整合到单一的多模态自回归框架中,解决了现有导航方法中状态-动作不一致的问题,显著提升了视觉导航的成功率和泛化能力。
Details
Motivation: 现有视觉导航方法采用模块化架构,导致状态-动作不一致,在新颖或动态场景中适应性有限。UniWM旨在通过统一的世界模型解决这一问题。
Result: 在四个挑战性基准测试中,UniWM显著提高了导航成功率(最高30%),减少了轨迹误差,并在TartanDrive数据集上展示了优异的零样本泛化能力。
Insight: 统一的视觉前瞻和规划模型可以显著提升导航性能,记忆增强机制对长期推理至关重要。
Abstract: Enabling embodied agents to effectively imagine future states is critical for robust and generalizable visual navigation. Current state-of-the-art approaches, however, adopt modular architectures that separate navigation planning from visual world modeling, leading to state-action misalignment and limited adaptability in novel or dynamic scenarios. To overcome this fundamental limitation, we propose UniWM, a unified, memory-augmented world model integrating egocentric visual foresight and planning within a single multimodal autoregressive backbone. Unlike modular frameworks, UniWM explicitly grounds action decisions in visually imagined outcomes, ensuring tight alignment between prediction and control. A hierarchical memory mechanism further integrates detailed short-term perceptual cues with longer-term trajectory context, enabling stable, coherent reasoning over extended horizons. Extensive experiments across four challenging benchmarks (Go Stanford, ReCon, SCAND, HuRoN) demonstrate that UniWM substantially improves navigation success rates by up to 30%, significantly reduces trajectory errors compared to strong baselines, and exhibits impressive zero-shot generalization on the unseen TartanDrive dataset. These results highlight UniWM as a principled step toward unified, imagination-driven embodied navigation.
[154] Auto-scaling Continuous Memory for GUI Agent cs.AI | cs.CL | cs.CV | cs.CY | cs.LGPDF
Wenyi Wu, Kun Zhou, Ruoxin Yuan, Vivian Yu, Stephen Wang
TL;DR: 该论文提出了一个连续的、可扩展的记忆机制,用于增强GUI代理的性能,通过在视觉语言模型中编码轨迹以减少上下文成本并保留视觉细节,同时引入自动扩展的数据收集流程,显著提升长期任务和分布变化的成功率。
Details
Motivation: 现有的GUI代理通常将过去的轨迹压缩为文本标记,导致上下文长度膨胀且丢失关键的视觉信息(如小部件的精确尺寸和位置)。论文旨在解决这些问题,使代理能够在陌生的界面和长期任务中更好地泛化。
Result: 在真实世界的GUI基准测试中,配备连续记忆的代理显著提升了长期任务和分布变化下的成功率。Qwen-2.5-VL-7B + 连续记忆的表现可与SOTA闭源模型(如GPT-4o)媲美。
Insight: 1. 连续记忆比文本记忆更适合GUI代理,能更好地保留和利用视觉信息。2. 自动扩展的数据收集流程可以低成本扩展记忆规模。3. 微调少量参数即可实现高效的记忆增强,避免大规模训练成本。
Abstract: We study how to endow GUI agents with scalable memory that help generalize across unfamiliar interfaces and long-horizon tasks. Prior GUI agents compress past trajectories into text tokens, which balloons context length and misses decisive visual cues (e.g., exact widget size and position). We propose a continuous memory that encodes each GUI trajectory into a fixed-length sequence of continuous embeddings using the VLM itself as an encoder; these embeddings are plugged directly into the backbone’s input layer, sharply reducing context cost while preserving fine-grained visual information. As memory size and retrieval depth increase, performance improves monotonically, unlike text memories that degrade with long prompts. To grow memory at low cost, we introduce an auto-scaling data flywheel that (i) discovers new environments via search, (ii) synthesizes tasks with an open-source VLM, (iii) rolls out trajectories with the agent, and (iv) verifies success with the same VLM. Using this pipeline, we collect 100k+ trajectories for about $4000 and fine-tune only the memory encoder (LoRA on a Q-Former, 1.2% parameters) with 1,500 samples. On real-world GUI benchmarks, our memory-augmented agent consistently improves success rates under long horizons and distribution shifts. Notably, Qwen-2.5-VL-7B + continuous memory achieves performance comparable to state-of-the-art closed-source models (e.g., GPT-4o, Claude-4).
[155] Optimizing delivery for quick commerce factoring qualitative assessment of generated routes cs.AI | cs.CLPDF
Milon Bhattacharya, Milan Kumar
TL;DR: 论文提出了一种基于大语言模型(LLM)的框架,用于评估车辆路径问题(VRP)生成的配送路线,以提高物流效率。
Details
Motivation: 印度电子商务市场快速发展,最后一英里配送占运营成本的近一半。传统VRP解决方案因地址不规范、地图不完整等问题效果受限,需要更智能的评估方法。
Result: 开源LLM识别路径问题的准确率为79%,专有模型可达86%,证明LLM评估具有潜力。
Insight: LLM评估不仅基于传统距离和时间指标,还能提升成本效率、配送可靠性和可持续性,尤其适合发展中国家。
Abstract: Indias e-commerce market is projected to grow rapidly, with last-mile delivery accounting for nearly half of operational expenses. Although vehicle routing problem (VRP) based solvers are widely used for delivery planning, their effectiveness in real-world scenarios is limited due to unstructured addresses, incomplete maps, and computational constraints in distance estimation. This study proposes a framework that employs large language models (LLMs) to critique VRP-generated routes against policy-based criteria, allowing logistics operators to evaluate and prioritise more efficient delivery plans. As a illustration of our approach we generate, annotate and evaluated 400 cases using large language models. Our study found that open-source LLMs identified routing issues with 79% accuracy, while proprietary reasoning models achieved reach upto 86%. The results demonstrate that LLM-based evaluation of VRP-generated routes can be an effective and scalable layer of evaluation which goes beyond beyond conventional distance and time based metrics. This has implications for improving cost efficiency, delivery reliability, and sustainability in last-mile logistics, especially for developing countries like India.
[156] Robust Heuristic Algorithm Design with LLMs cs.AI | cs.CL | cs.NIPDF
Pantea Karimi, Dany Rouhana, Pooria Namyar, Siva Kesava Reddy Kakarla, Venkat Arun
TL;DR: 本文提出通过解释和改进LLM生成的启发式算法设计问题的薄弱环节,提高了算法的鲁棒性和性能,与FunSearch相比,最坏情况下性能提升约28倍。
Details
Motivation: 现有的LLM生成的启发式算法在某些情况下表现不佳,缺乏对其薄弱环节的解释和改进机制。
Result: 生成的启发式算法在最坏情况下性能提升约28倍,平均性能也有所改善,同时保持运行时效率。
Insight: 解释和改进LLM生成的启发式算法的薄弱环节可以显著提升其鲁棒性,而无需牺牲运行时效率。
Abstract: We posit that we can generate more robust and performant heuristics if we augment approaches using LLMs for heuristic design with tools that explain why heuristics underperform and suggestions about how to fix them. We find even simple ideas that (1) expose the LLM to instances where the heuristic underperforms; (2) explain why they occur; and (3) specialize design to regions in the input space, can produce more robust algorithms compared to existing techniques~ – ~the heuristics we produce have a $\sim28\times$ better worst-case performance compared to FunSearch, improve average performance, and maintain the runtime.
[157] Everyone prefers human writers, including AI cs.AI | cs.CL | cs.HCPDF
Wouter Haverals, Meredith Martin
TL;DR: 论文通过实验揭示了人类和AI在文学风格评估中都存在对‘人类创作’的系统性偏好,且AI的偏见更强。
Details
Motivation: 随着AI写作工具的普及,需要了解人类和机器如何评估文学风格,这是一个主观性强且缺乏客观标准的领域。
Result: 人类和AI均表现出对‘人类创作’的偏见(人类+13.7%,AI+34.3%),且AI的偏见更强。
Insight: AI在训练中吸收了人类对人工创造力的文化偏见,评估时会因‘AI生成’标签而反转标准。
Abstract: As AI writing tools become widespread, we need to understand how both humans and machines evaluate literary style, a domain where objective standards are elusive and judgments are inherently subjective. We conducted controlled experiments using Raymond Queneau’s Exercises in Style (1947) to measure attribution bias across evaluators. Study 1 compared human participants (N=556) and AI models (N=13) evaluating literary passages from Queneau versus GPT-4-generated versions under three conditions: blind, accurately labeled, and counterfactually labeled. Study 2 tested bias generalization across a 14$\times$14 matrix of AI evaluators and creators. Both studies revealed systematic pro-human attribution bias. Humans showed +13.7 percentage point (pp) bias (Cohen’s h = 0.28, 95% CI: 0.21-0.34), while AI models showed +34.3 percentage point bias (h = 0.70, 95% CI: 0.65-0.76), a 2.5-fold stronger effect (P$<$0.001). Study 2 confirmed this bias operates across AI architectures (+25.8pp, 95% CI: 24.1-27.6%), demonstrating that AI systems systematically devalue creative content when labeled as “AI-generated” regardless of which AI created it. We also find that attribution labels cause evaluators to invert assessment criteria, with identical features receiving opposing evaluations based solely on perceived authorship. This suggests AI models have absorbed human cultural biases against artificial creativity during training. Our study represents the first controlled comparison of attribution bias between human and artificial evaluators in aesthetic judgment, revealing that AI systems not only replicate but amplify this human tendency.
[158] ReviewerToo: Should AI Join The Program Committee? A Look At The Future of Peer Review cs.AI | cs.CLPDF
Gaurav Sahu, Hugo Larochelle, Laurent Charlin, Christopher Pal
TL;DR: ReviewerToo 是一种模块化框架,用于研究和部署 AI 辅助的同行评审,以提高评审的系统性和一致性。实验表明,AI 评审在某些任务上表现优异(如事实核查),但在评估方法新颖性等方面仍需人类专家。
Details
Motivation: 同行评审是科学出版的核心,但存在不一致性、主观性和可扩展性问题,AI 辅助评审有望解决这些问题。
Result: AI 评审在分类任务上达到 81.8% 准确率(人类平均 83.9%),生成评语的质量高于人类平均水平,但在复杂评估上仍需专家介入。
Insight: AI 评审能够在一致性、覆盖范围和公平性方面增强评审流程,但复杂评估仍需人类专家,未来应以混合系统为目标。
Abstract: Peer review is the cornerstone of scientific publishing, yet it suffers from inconsistencies, reviewer subjectivity, and scalability challenges. We introduce ReviewerToo, a modular framework for studying and deploying AI-assisted peer review to complement human judgment with systematic and consistent assessments. ReviewerToo supports systematic experiments with specialized reviewer personas and structured evaluation criteria, and can be partially or fully integrated into real conference workflows. We validate ReviewerToo on a carefully curated dataset of 1,963 paper submissions from ICLR 2025, where our experiments with the gpt-oss-120b model achieves 81.8% accuracy for the task of categorizing a paper as accept/reject compared to 83.9% for the average human reviewer. Additionally, ReviewerToo-generated reviews are rated as higher quality than the human average by an LLM judge, though still trailing the strongest expert contributions. Our analysis highlights domains where AI reviewers excel (e.g., fact-checking, literature coverage) and where they struggle (e.g., assessing methodological novelty and theoretical contributions), underscoring the continued need for human expertise. Based on these findings, we propose guidelines for integrating AI into peer-review pipelines, showing how AI can enhance consistency, coverage, and fairness while leaving complex evaluative judgments to domain experts. Our work provides a foundation for systematic, hybrid peer-review systems that scale with the growth of scientific publishing.
[159] Semantic-Condition Tuning: Fusing Graph Context with Large Language Models for Knowledge Graph Completion cs.AI | cs.CL | I.2.7PDF
Ruitong Liu, Yan Wen, Te Sun, Yunjia Wu, Pingyang Huang
TL;DR: 论文提出了一种新的知识注入范式Semantic-condition Tuning(SCT),通过图神经网络提取局部图邻域的上下文语义条件,并将其与文本嵌入深度融合,显著提升了知识图谱完成任务的性能。
Details
Motivation: 现有的prefix-tuning方法简单地将知识嵌入与文本输入拼接,忽略了知识图谱中丰富的关系语义,并给大型语言模型(LLM)带来了隐式推理负担。论文旨在解决这些问题,提出更高效的知识融合方法。
Result: 在知识图谱基准测试中,SCT显著优于prefix-tuning和其他基线方法,证明了其高效性和鲁棒性。
Insight: 通过在图神经网络提取的语义条件下调制输入表示,SCT为LLM提供了更直接且强大的信号,促进了更准确的知识推理。
Abstract: Fusing Knowledge Graphs with Large Language Models is crucial for knowledge-intensive tasks like knowledge graph completion. The prevailing paradigm, prefix-tuning, simply concatenates knowledge embeddings with text inputs. However, this shallow fusion overlooks the rich relational semantics within KGs and imposes a significant implicit reasoning burden on the LLM to correlate the prefix with the text. To address these, we propose Semantic-condition Tuning (SCT), a new knowledge injection paradigm comprising two key modules. First, a Semantic Graph Module employs a Graph Neural Network to extract a context-aware semantic condition from the local graph neighborhood, guided by knowledge-enhanced relations. Subsequently, this condition is passed to a Condition-Adaptive Fusion Module, which, in turn, adaptively modulates the textual embedding via two parameterized projectors, enabling a deep, feature-wise, and knowledge-aware interaction. The resulting pre-fused embedding is then fed into the LLM for fine-tuning. Extensive experiments on knowledge graph benchmarks demonstrate that SCT significantly outperforms prefix-tuning and other strong baselines. Our analysis confirms that by modulating the input representation with semantic graph context before LLM inference, SCT provides a more direct and potent signal, enabling more accurate and robust knowledge reasoning.
[160] LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads? cs.AI | cs.CL | cs.LGPDF
Kaijian Zou, Aaron Xiong, Yunxiang Zhang, Frederick Zhang, Yueqi Ren
TL;DR: LiveOIBench是一个包含403个奥林匹克级别编程问题的基准测试,用于评估大型语言模型的编码能力,结果显示GPT-5表现较好但仍不及顶尖人类选手。
Details
Motivation: 现有的编程基准测试存在挑战性问题不足、测试用例覆盖不充分等问题,因此需要一个更全面的评测平台。
Result: GPT-5达到81.76百分位,但仍落后于顶尖人类选手的90百分位;开源模型GPT-OSS-120B表现较差(60百分位)。
Insight: 强大的推理模型应注重精确的问题分析而非过度探索,未来模型需优化结构化分析能力。
Abstract: Competitive programming problems increasingly serve as valuable benchmarks to evaluate the coding capabilities of large language models (LLMs) due to their complexity and ease of verification. Yet, current coding benchmarks face limitations such as lack of exceptionally challenging problems, insufficient test case coverage, reliance on online platform APIs that limit accessibility. To address these issues, we introduce LiveOIBench, a comprehensive benchmark featuring 403 expert-curated Olympiad-level competitive programming problems, each with an average of 60 expert-designed test cases. The problems are sourced directly from 72 official Informatics Olympiads in different regions conducted between 2023 and 2025. LiveOIBench distinguishes itself through four key features: (1) meticulously curated high-quality tasks with detailed subtask rubrics and extensive private test cases; (2) direct integration of elite contestant performance data to enable informative comparison against top-performing humans; (3) planned continuous, contamination-free updates from newly released Olympiad problems; and (4) a self-contained evaluation system facilitating offline and easy-to-reproduce assessments. Benchmarking 32 popular general-purpose and reasoning LLMs, we find that GPT-5 achieves a notable 81.76th percentile, a strong result that nonetheless falls short of top human contestant performance, who usually place above 90th. In contrast, among open-weight reasoning models, GPT-OSS-120B achieves only a 60th percentile, underscoring significant capability disparities from frontier closed models. Detailed analyses indicate that robust reasoning models prioritize precise problem analysis over excessive exploration, suggesting future models should emphasize structured analysis and minimize unnecessary exploration. All data, code, and leaderboard results will be made publicly available on our website.
cs.LG [Back]
[161] Reinforcement Learning-Driven Edge Management for Reliable Multi-view 3D Reconstruction cs.LG | cs.AI | cs.CV | cs.DC | cs.GR | cs.MMPDF
Motahare Mounesan, Sourya Saha, Houchao Gan, Md. Nurul Absur, Saptarshi Debroy
TL;DR: 该论文提出了一种基于强化学习的边缘资源管理框架,用于动态环境下的多视角3D重建,以平衡延迟与重建质量。
Details
Motivation: 实时多视角3D重建在火灾救援等关键应用中至关重要,但边缘资源的动态性和不可预测性(如图像质量下降、网络不稳定)威胁了重建的可靠性。
Result: 实验结果表明,该框架能够有效平衡端到端延迟和重建质量,提高了应用的可靠性。
Insight: 强化学习可有效应对边缘环境中的动态资源管理问题,为类似任务提供了借鉴。
Abstract: Real-time multi-view 3D reconstruction is a mission-critical application for key edge-native use cases, such as fire rescue, where timely and accurate 3D scene modeling enables situational awareness and informed decision-making. However, the dynamic and unpredictable nature of edge resource availability introduces disruptions, such as degraded image quality, unstable network links, and fluctuating server loads, which challenge the reliability of the reconstruction pipeline. In this work, we present a reinforcement learning (RL)-based edge resource management framework for reliable 3D reconstruction to ensure high quality reconstruction within a reasonable amount of time, despite the system operating under a resource-constrained and disruption-prone environment. In particular, the framework adopts two cooperative Q-learning agents, one for camera selection and one for server selection, both of which operate entirely online, learning policies through interactions with the edge environment. To support learning under realistic constraints and evaluate system performance, we implement a distributed testbed comprising lab-hosted end devices and FABRIC infrastructure-hosted edge servers to emulate smart city edge infrastructure under realistic disruption scenarios. Results show that the proposed framework improves application reliability by effectively balancing end-to-end latency and reconstruction quality in dynamic environments.
[162] Limitations of Normalization in Attention Mechanism cs.LG | cs.AI | cs.CLPDF
Timur Mudarisov, Mikhail Burtsev, Tatiana Petrova, Radu State
TL;DR: 本文研究了注意力机制中归一化的局限性,通过理论和实验分析揭示了softmax归一化在区分信息性token时的不足,并提出了未来改进的方向。
Details
Motivation: 注意力机制中的归一化(如softmax)在深度学习模型中广泛应用,但其局限性尚未被充分研究。本文旨在填补这一空白,探讨归一化对模型选择能力和训练过程中的挑战。
Result: 1. 随着选择token数量的增加,模型的区分能力下降,趋向于均匀选择;2. softmax在低温度下导致训练不稳定。
Insight: 当前的softmax归一化在注意力机制中存在局限性,未来需要开发更鲁棒的归一化和选择策略以提升模型的性能和训练稳定性。
Abstract: This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model’s selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model’s ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.
[163] Efficient Bayesian Inference from Noisy Pairwise Comparisons cs.LG | cs.CVPDF
Till Aczel, Lucas Theis, Wattenhofer Roger
TL;DR: BBQ是一种贝叶斯Bradley-Terry变体,通过显式建模评估者质量并剔除不可靠参与者,实现了更快的收敛和更稳健的生成模型评估。
Details
Motivation: 现有Bradley-Terry方法忽视评估者差异或缺乏收敛保证,导致评估噪音问题。BBQ旨在解决这一问题,提升生成模型评估的可靠性和效率。
Result: 实验显示BBQ收敛更快,不确定性估计更准确,排名更稳健,适用于嘈杂评估者场景。
Insight: 显式建模评估者质量能显著提升生成模型评估的可靠性,尤其在众包或低成本评估中具有实用价值。
Abstract: Evaluating generative models is challenging because standard metrics often fail to reflect human preferences. Human evaluations are more reliable but costly and noisy, as participants vary in expertise, attention, and diligence. Pairwise comparisons improve consistency, yet aggregating them into overall quality scores requires careful modeling. Bradley-Terry-based methods update item scores from comparisons, but existing approaches either ignore rater variability or lack convergence guarantees, limiting robustness and interpretability. We introduce BBQ, a Bayesian Bradley-Terry variant that explicitly models rater quality, downweighting or removing unreliable participants, and provides guaranteed monotonic likelihood convergence through an Expectation-Maximization algorithm. Empirical results show that BBQ achieves faster convergence, well-calibrated uncertainty estimates, and more robust, interpretable rankings compared to baseline Bradley-Terry models, even with noisy or crowdsourced raters. This framework enables more reliable and cost-effective human evaluation of generative models.
[164] STaTS: Structure-Aware Temporal Sequence Summarization via Statistical Window Merging cs.LG | cs.CVPDF
Disharee Bhowmick, Ranjith Ramanathan, Sathyanarayanan N. Aakur
TL;DR: STaTS是一种轻量级、无监督的时间序列摘要框架,通过统计窗口合并自适应压缩时间序列,保留核心动态信息,显著降低计算成本。
Details
Motivation: 时间序列数据常隐含潜在的时间结构,但现有方法通常对所有时间步平等处理,导致效率低、鲁棒性差且难以扩展。
Result: 在150+数据集上,STaTS在85-90%性能保持下实现高达30倍的压缩,且在噪声下表现更鲁棒。
Insight: STaTS可以作为通用预处理工具,无需重新训练即可与现有时间序列编码器集成,显著提升效率和鲁棒性。
Abstract: Time series data often contain latent temporal structure, transitions between locally stationary regimes, repeated motifs, and bursts of variability, that are rarely leveraged in standard representation learning pipelines. Existing models typically operate on raw or fixed-window sequences, treating all time steps as equally informative, which leads to inefficiencies, poor robustness, and limited scalability in long or noisy sequences. We propose STaTS, a lightweight, unsupervised framework for Structure-Aware Temporal Summarization that adaptively compresses both univariate and multivariate time series into compact, information-preserving token sequences. STaTS detects change points across multiple temporal resolutions using a BIC-based statistical divergence criterion, then summarizes each segment using simple functions like the mean or generative models such as GMMs. This process achieves up to 30x sequence compression while retaining core temporal dynamics. STaTS operates as a model-agnostic preprocessor and can be integrated with existing unsupervised time series encoders without retraining. Extensive experiments on 150+ datasets, including classification tasks on the UCR-85, UCR-128, and UEA-30 archives, and forecasting on ETTh1 and ETTh2, ETTm1, and Electricity, demonstrate that STaTS enables 85-90% of the full-model performance while offering dramatic reductions in computational cost. Moreover, STaTS improves robustness under noise and preserves discriminative structure, outperforming uniform and clustering-based compression baselines. These results position STaTS as a principled, general-purpose solution for efficient, structure-aware time series modeling.
[165] Struc-EMB: The Potential of Structure-Aware Encoding in Language Embeddings cs.LG | cs.AI | cs.CLPDF
Shikun Liu, Haoyu Wang, Mufei Li, Pan Li
TL;DR: 该论文提出了一种新的结构感知文本嵌入方法,通过直接在大型语言模型内部编码过程中整合结构性关系(如超链接或引用),而非传统的事后聚合。两种主要方法(顺序串联和平行缓存)在多个任务中表现优于基线,并揭示了关键权衡。
Details
Motivation: 现有的文本嵌入模型通常忽略文本中的结构性信息(如超链接或引用),而这些信息在许多实际数据集中提供了重要上下文。论文旨在探索在模型内部编码过程中整合这些结构信息的潜力。
Result: 实验表明,结构感知方法在检索、聚类、分类和推荐任务中均优于纯文本和事后聚合基线。顺序串联在噪声适中长度上下文中表现更好,而平行缓存适合长且高信号上下文。
Insight: 直接在语言模型内部编码中整合结构信息能显著提升嵌入质量。噪声数据处理技术的有效性也为实际应用提供了工具。
Abstract: Text embeddings from Large Language Models (LLMs) have become foundational for numerous applications. However, these models typically operate on raw text, overlooking the rich structural information, such as hyperlinks or citations, that provides crucial context in many real-world datasets. This paper introduces and systematically evaluates a new paradigm for generating structure-aware text embeddings by integrating these structural relations directly into the LLM’s internal encoding process, rather than relying on traditional post-hoc aggregation. We investigate two primary in-process methods: sequential concatenation and parallel caching. Through extensive zero-shot experiments across retrieval, clustering, classification, and recommendation tasks, we demonstrate that our structure-aware approaches consistently outperform both text-only and post-hoc baselines. Our analysis reveals critical trade-offs: sequential concatenation excels with noisy, moderate-length contexts, while parallel caching scales more effectively to long, high-signal contexts but is more susceptible to distractors. To address the challenge of noisy structural data, we also introduce and validate two effective techniques: Context Distillation and Semantic Balancing. This work provides the first comprehensive analysis of in-process structure-aware encoding, offering a blueprint for building more powerful and contextually aware embedding models.
[166] Diagnosing and Mitigating System Bias in Self-Rewarding RL cs.LG | cs.CLPDF
Chuyi Tan, Peiwen Yuan, Xinglin Wang, Yiwei Li, Shaoxiong Feng
TL;DR: 论文提出了RLER方法,通过集成多样模型和调整奖励插值与选择,解决了自奖励RL中的系统偏差问题,性能接近RLVR且稳定性显著提升。
Details
Motivation: 自奖励RL(RLIR)因其依赖于模型自身的奖励分配,存在系统偏差和不稳定性,阻碍了其在无标注数据上的可持续扩展。
Result: RLER在性能上比RLIR提升13.6%,仅比RLVR低3.6%,且在无标注数据上表现稳定。
Insight: 系统偏差的度量揭示了偏差类型对收敛和稳定性的不同影响,集成方法能有效平衡偏差和性能。
Abstract: Reinforcement learning with verifiable rewards (RLVR) scales the reasoning ability of large language models (LLMs) but remains bottlenecked by limited labeled samples for continued data scaling. Reinforcement learning with intrinsic rewards (RLIR), where the policy model assigns rewards to its own rollouts, enables sustainable scaling in unlabeled settings, yet its performance and stability lag behind RLVR. We trace this gap to a system bias: the model tends to overestimate its high-confidence rollouts, leading to biased and unstable reward estimation. This bias accumulates as training progresses, with deviations from the oracle drifting toward over-reward, causing unstable training. We characterize this bias using three metrics: $\rho_{\text{noise}}$, $\rho_{\text{selfbias}}$, and $\rho_{\text{symbias}}$. We find that $\rho_{\text{noise}}$ and $\rho_{\text{symbias}}$ impact convergence, while $\rho_{\text{selfbias}}$ amplifies both correct and incorrect updates, leading to instability. To mitigate this, we propose reinforcement learning with ensembled rewards (RLER), which aggregates diverse models and adapts reward interpolation and rollout selection. Extensive experiments show that RLER improves by +13.6% over RLIR and is only 3.6% below RLVR, achieving stable scaling on unlabeled samples, making it highly applicable.
[167] Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs cs.LG | cs.AI | cs.CLPDF
Yumin Choi, Dongki Kim, Jinheon Baek, Sung Ju Hwang
TL;DR: 该论文提出了多模态提示优化的问题,并提出了MPO框架,通过联合优化多模态提示和贝叶斯选择策略,提升了多模态大语言模型的性能。
Details
Motivation: 现有提示优化方法局限于文本模态,未能充分利用多模态大语言模型(MLLMs)的潜力。
Result: 实验表明,MPO在图像、视频、分子等多个模态上优于纯文本优化方法。
Insight: 多模态提示优化是实现MLLMs潜力的关键步骤,超越了传统文本提示的限制。
Abstract: Large Language Models (LLMs) have shown remarkable success, and their multimodal expansions (MLLMs) further unlock capabilities spanning images, videos, and other modalities beyond text. However, despite this shift, prompt optimization approaches, designed to reduce the burden of manual prompt crafting while maximizing performance, remain confined to text, ultimately limiting the full potential of MLLMs. Motivated by this gap, we introduce the new problem of multimodal prompt optimization, which expands the prior definition of prompt optimization to the multimodal space defined by the pairs of textual and non-textual prompts. To tackle this problem, we then propose the Multimodal Prompt Optimizer (MPO), a unified framework that not only performs the joint optimization of multimodal prompts through alignment-preserving updates but also guides the selection process of candidate prompts by leveraging earlier evaluations as priors in a Bayesian-based selection strategy. Through extensive experiments across diverse modalities that go beyond text, such as images, videos, and even molecules, we demonstrate that MPO outperforms leading text-only optimization methods, establishing multimodal prompt optimization as a crucial step to realizing the potential of MLLMs.
[168] HINT: Helping Ineffective Rollouts Navigate Towards Effectiveness cs.LG | cs.CLPDF
Xinyi Wang, Jinyi Han, Zishang Jiang, Tingyun Li, Jiaqing Liang
TL;DR: HINT提出了一种自适应提示框架,通过启发式提示而非直接答案,提升强化学习中模型的探索效率和训练稳定性。
Details
Motivation: 强化学习在提升大语言模型的链式推理能力时,常因任务难度高导致奖励稀疏和训练效率低下。现有方法(如GRPO)容易因训练亲和力低(即外部指导与模型策略分布不匹配)而失效。
Result: 在数学推理任务中,HINT显著优于现有方法,实现了更稳定的学习和更高的数据效率,并在不同规模的模型上取得了最优结果。
Insight: 启发式提示优于直接答案,不仅能保护模型的自主推理能力,还能显著提升训练稳定性和数据效率。
Abstract: Reinforcement Learning (RL) has become a key driver for enhancing the long chain-of-thought (CoT) reasoning capabilities of Large Language Models (LLMs). However, prevalent methods like GRPO often fail when task difficulty exceeds the model’s capacity, leading to reward sparsity and inefficient training. While prior work attempts to mitigate this using off-policy data, such as mixing RL with Supervised Fine-Tuning (SFT) or using hints, they often misguide policy updates In this work, we identify a core issue underlying these failures, which we term low training affinity. This condition arises from a large distributional mismatch between external guidance and the model’s policy. To diagnose this, we introduce Affinity, the first quantitative metric for monitoring exploration efficiency and training stability. To improve Affinity, we propose HINT: Helping Ineffective rollouts Navigate Towards effectiveness, an adaptive hinting framework. Instead of providing direct answers, HINT supplies heuristic hints that guide the model to discover solutions on its own, preserving its autonomous reasoning capabilities. Extensive experiments on mathematical reasoning tasks show that HINT consistently outperforms existing methods, achieving state-of-the-art results with models of various scales, while also demonstrating significantly more stable learning and greater data efficiency.Code is available on Github.
cs.DB [Back]
[169] HES-SQL: Hybrid Reasoning for Efficient Text-to-SQL with Structural Skeleton Guidance cs.DB | cs.AI | cs.CLPDF
Suming Qiu, Jing Li, Zhicheng Zhou, Junjie Huang, Linyuan Qiu
TL;DR: HES-SQL提出了一种混合训练框架,结合SFT和GRPO,通过结构骨架指导和执行效率优化改进Text-to-SQL。
Details
Motivation: 提升Text-to-SQL生成的语义准确性和计算效率,为自然语言数据库接口提供更鲁棒的解决方案。
Result: 在BIRD和KaggleDBQA上分别达到79.14%和54.9%的执行准确率,效率提升11%-20%。
Insight: 通过执行驱动的强化学习平衡语义准确性和计算效率,为结构化生成任务提供了新范式。
Abstract: We present HES-SQL, a novel hybrid training framework that advances Text-to-SQL generation through the integration of thinking-mode-fused supervised fine-tuning (SFT) with Group Relative Policy Optimization (GRPO). Our approach introduces three key innovations: (1) a skeleton-completeness scoring mechanism that enhances preference alignment between generated queries and optimal SQL structures; (2) a query-latency-aware reward system that incentivizes the generation of computationally efficient SQL queries; (3) a self-distillation process for thinking-mode completion that prevents degradation of the model’s reasoning capabilities. This framework enables hybrid thinking models to switch between reasoning and non-reasoning modes while improving SQL query accuracy and execution efficiency. Experimental evaluation, conducted on MySQL 8.0 and SQLite 3.42 under controlled single-user conditions, demonstrates that HES-SQL achieves competitive performance with execution accuracies of 79.14% and 54.9% on the BIRD and KaggleDBQA benchmarks, respectively. Query latency is measured as the end-to-end execution time of generated queries on the DBMS, averaged over multiple runs to mitigate variance. Efficiency gains range from 11% to 20% relative to supervised baselines. Our results establish a new paradigm for Text-to-SQL systems that effectively balances semantic accuracy with computational efficiency through execution-informed reinforcement learning (RL). The proposed methodology has significant implications for developing robust natural language interfaces to databases and can be extended to broader structured generation tasks requiring both correctness and efficiency optimization.