Table of Contents

cs.CV [Back]

[1] Review of Hallucination Understanding in Large Language and Vision Models cs.CV | cs.AIPDF

Zhengyi Ho, Siyuan Liang, Dacheng Tao

TL;DR: 本文综述了大语言和视觉模型中的幻觉问题,提出统一的多层次框架分析文本和图像幻觉,揭示了数据分布和偏见的可预测模式。

Details

Motivation: 大语言和视觉模型的幻觉问题在实际应用中可能导致错误传播和经济损失,但目前对其理解仍零散。

Result: 发现幻觉常源于数据分布的可预测模式和模型继承的偏见。

Insight: 通过系统性理解幻觉的根源,有助于开发更鲁棒的生成式AI解决方案。

Abstract: The widespread adoption of large language and vision models in real-world applications has made urgent the need to address hallucinations – instances where models produce incorrect or nonsensical outputs. These errors can propagate misinformation during deployment, leading to both financial and operational harm. Although much research has been devoted to mitigating hallucinations, our understanding of it is still incomplete and fragmented. Without a coherent understanding of hallucinations, proposed solutions risk mitigating surface symptoms rather than underlying causes, limiting their effectiveness and generalizability in deployment. To tackle this gap, we first present a unified, multi-level framework for characterizing both image and text hallucinations across diverse applications, aiming to reduce conceptual fragmentation. We then link these hallucinations to specific mechanisms within a model’s lifecycle, using a task-modality interleaved approach to promote a more integrated understanding. Our investigations reveal that hallucinations often stem from predictable patterns in data distributions and inherited biases. By deepening our understanding, this survey provides a foundation for developing more robust and effective solutions to hallucinations in real-world generative AI systems.


[2] On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations cs.CV | cs.AIPDF

Jianing Guo, Zhenhong Wu, Chang Tu, Yiyao Ma, Xiangqi Kong

TL;DR: 该论文研究了Vision-Language-Action(VLA)模型在多模态扰动下的鲁棒性,提出了RobustVLA方法,通过离线鲁棒优化和输入一致性增强,显著提升了模型的性能。

Details

Motivation: 现有的VLA模型在视觉扰动上表现较好,但忽视了动作、指令、环境和观察等多模态扰动的影响,这限制了其在真实场景中的应用。

Result: 在LIBERO数据集上,RobustVLA在17种扰动下比基线提升了12.6%(pi0主干)和10.4%(OpenVLA主干),推理速度提升50.6倍,混合扰动下提升10.4%。在FR5机器人任务中,性能提升65.6%。

Insight: 1.动作是多模态中最脆弱的环节;2.视觉鲁棒的VLA未扩展到其他模态;3.扩散动作头的设计能显著提升鲁棒性。

Abstract: In Vision-Language-Action (VLA) models, robustness to real-world perturbations is critical for deployment. Existing methods target simple visual disturbances, overlooking the broader multi-modal perturbations that arise in actions, instructions, environments, and observations. Here, we first evaluate the robustness of mainstream VLAs under 17 perturbations across four modalities. We find (1) actions as the most fragile modality, (2) Existing visual-robust VLA do not gain robustness in other modality, and (3) pi0 demonstrates superior robustness with a diffusion-based action head. To build multi-modal robust VLAs, we propose RobustVLA against perturbations in VLA inputs and outputs. For output robustness, we perform offline robust optimization against worst-case action noise that maximizes mismatch in flow matching objective. This can be seen as adversarial training, label smoothing, and outlier penalization. For input robustness, we enforce consistent actions across input variations that preserve task semantics. To account for multiple perturbations, we formulate robustness as a multi-armed bandit problem and apply an upper confidence bound algorithm to automatically identify the most harmful noise. Experiments on LIBERO demonstrate our RobustVLA delivers absolute gains over baselines of 12.6% on the pi0 backbone and 10.4% on the OpenVLA backbone across all 17 perturbations, achieving 50.6x faster inference than existing visual-robust VLAs, and a 10.4% gain under mixed perturbations. Our RobustVLA is particularly effective on real-world FR5 robot with limited demonstrations, showing absolute gains by 65.6% under perturbations of four modalities.


[3] Uncovering Intrinsic Capabilities: A Paradigm for Data Curation in Vision-Language Models cs.CV | cs.AIPDF

Junjie Li, Ziao Wang, Jianghong Ma, Xiaofeng Zhang

TL;DR: 论文提出了一种名为CADC的能力导向数据筛选框架,通过分析模型的内部能力而非任务启发式来优化视觉-语言模型的指令微调数据。

Details

Motivation: 现有视觉-语言模型(VLMs)在指令微调时表现不稳定,数据筛选方法多为黑盒启发式,忽略了模型内在能力的影响,导致资源浪费和性能下降。

Result: 仅用5%的原始数据,CADC在多模态基准上超越了全数据训练效果,验证了内部能力是模型学习的基础单元。

Insight: 模型的内在能力是调控指令微调的关键因素,数据筛选应从能力视角出发,而非传统任务导向的黑盒方法。

Abstract: Large vision-language models (VLMs) achieve strong benchmark performance, but controlling their behavior through instruction tuning remains difficult. Reducing the budget of instruction tuning dataset often causes regressions, as heuristic strategies treat models as black boxes and overlook the latent capabilities that govern learning. We introduce Capability-Attributed Data Curation (CADC), a framework that shifts curation from task-specific heuristics to intrinsic capability analysis. CADC discovers intrinsic capabilities in an unsupervised manner from gradient-based learning trajectories, attributes training data to these capabilities via influence estimation, and curates capability-aware curricula through balanced selection and staged sequencing. This transforms black-box instruction tuning into a controllable, capability-driven process. With as little as 5% of the original data, CADC surpasses full-data training on multimodal benchmarks. These results validate intrinsic capabilities as the fundamental building blocks of model learning and establish CADC as a principle paradigm for instruction data curation.


[4] Culture In a Frame: C$^3$B as a Comic-Based Benchmark for Multimodal Culturally Awareness cs.CV | cs.AIPDF

Yuchen Song, Andong Chen, Wenxin Zhu, Kehai Chen, Xuefeng Bai

TL;DR: 论文提出了一个名为C$^3$B的新型多文化、多任务和多语言的文化意识基准测试,基于漫画设计,包含2000多张图像和18000多个问答对,用于评估多模态大语言模型的文化意识能力。

Details

Motivation: 当前的文化意识基准测试在任务设计上缺乏难度递进,且缺少跨语言任务。此外,这些基准测试常使用现实世界图像,每张图像通常只包含一种文化内容,使得测试对多模态大语言模型相对简单。

Result: 在11个开源多模态大语言模型上的评估显示,这些模型与人类表现之间存在显著差距,表明C$^3$B对当前模型提出了较大挑战。

Insight: 通过漫画形式的多文化场景和多层次任务设计,C$^3$B有效评估了模型的文化意识能力,为未来研究提供了重要方向。

Abstract: Cultural awareness capabilities has emerged as a critical capability for Multimodal Large Language Models (MLLMs). However, current benchmarks lack progressed difficulty in their task design and are deficient in cross-lingual tasks. Moreover, current benchmarks often use real-world images. Each real-world image typically contains one culture, making these benchmarks relatively easy for MLLMs. Based on this, we propose C$^3$B ($\textbf{C}$omics $\textbf{C}$ross-$\textbf{C}$ultural $\textbf{B}$enchmark), a novel multicultural, multitask and multilingual cultural awareness capabilities benchmark. C$^3$B comprises over 2000 images and over 18000 QA pairs, constructed on three tasks with progressed difficulties, from basic visual recognition to higher-level cultural conflict understanding, and finally to cultural content generation. We conducted evaluations on 11 open-source MLLMs, revealing a significant performance gap between MLLMs and human performance. The gap demonstrates that C$^3$B poses substantial challenges for current MLLMs, encouraging future research to advance the cultural awareness capabilities of MLLMs.


[5] Beyond the Prompt: Gender Bias in Text-to-Image Models, with a Case Study on Hospital Professions cs.CV | cs.AI | I.2 ARTIFICIAL INTELLIGENCEPDF

Franck Vandewiele, Remi Synave, Samuel Delepoulle, Remi Cozot

TL;DR: 该论文研究了六种先进的文本到图像(TTI)模型在性别表征上的偏差,尤其以医院职业为例,发现模型普遍存在性别刻板印象,但不同模型的表现和提示词敏感性各异。

Details

Motivation: 随着TTI模型在专业、教育和创意领域的广泛应用,其输出中嵌入的社会偏见问题日益显著。论文旨在揭示这些模型在性别表征上的系统性偏差,并提出改进建议。

Result: 研究发现所有模型均表现出性别刻板印象(如护士全为女性,外科医生多为男性),但不同模型对提示词的敏感性差异显著。

Insight: TTI模型的性别偏差是系统性和模型特定的,提示词的设计和模型的默认设置对生成结果的多样性至关重要。

Abstract: Text-to-image (TTI) models are increasingly used in professional, educational, and creative contexts, yet their outputs often embed and amplify social biases. This paper investigates gender representation in six state-of-the-art open-weight models: HunyuanImage 2.1, HiDream-I1-dev, Qwen-Image, FLUX.1-dev, Stable-Diffusion 3.5 Large, and Stable-Diffusion-XL. Using carefully designed prompts, we generated 100 images for each combination of five hospital-related professions (cardiologist, hospital director, nurse, paramedic, surgeon) and five portrait qualifiers (“”, corporate, neutral, aesthetic, beautiful). Our analysis reveals systematic occupational stereotypes: all models produced nurses exclusively as women and surgeons predominantly as men. However, differences emerge across models: Qwen-Image and SDXL enforce rigid male dominance, HiDream-I1-dev shows mixed outcomes, and FLUX.1-dev skews female in most roles. HunyuanImage 2.1 and Stable-Diffusion 3.5 Large also reproduce gender stereotypes but with varying degrees of sensitivity to prompt formulation. Portrait qualifiers further modulate gender balance, with terms like corporate reinforcing male depictions and beautiful favoring female ones. Sensitivity varies widely: Qwen-Image remains nearly unaffected, while FLUX.1-dev, SDXL, and SD3.5 show strong prompt dependence. These findings demonstrate that gender bias in TTI models is both systematic and model-specific. Beyond documenting disparities, we argue that prompt wording plays a critical role in shaping demographic outcomes. The results underscore the need for bias-aware design, balanced defaults, and user guidance to prevent the reinforcement of occupational stereotypes in generative AI.


[6] Reinforcement Learning-Based Prompt Template Stealing for Text-to-Image Models cs.CV | cs.AIPDF

Xiaotian Zou

TL;DR: 该论文揭示了多模态大语言模型(MLLMs)中提示模板的安全漏洞,提出了一种基于强化学习的框架RLStealer,能够从小量示例图像中高效窃取提示模板,并展示了其优越性能和低成本特点。

Details

Motivation: 随着文本到图像模型的应用和提示交易市场的兴起,提示模板的窃取成为一个未充分研究的安全风险,论文旨在揭示并解决这一问题。

Result: 在公开数据集上,RLStealer实现了最先进的性能,攻击成本降至基线方法的13%以下,并能泛化到不同图像风格。

Insight: 研究突出了提示交易中的安全威胁,为未来MLLMs市场的保护标准开发奠定了基础。

Abstract: Multimodal Large Language Models (MLLMs) have transformed text-to-image workflows, allowing designers to create novel visual concepts with unprecedented speed. This progress has given rise to a thriving prompt trading market, where curated prompts that induce trademark styles are bought and sold. Although commercially attractive, prompt trading also introduces a largely unexamined security risk: the prompts themselves can be stolen. In this paper, we expose this vulnerability and present RLStealer, a reinforcement learning based prompt inversion framework that recovers its template from only a small set of example images. RLStealer treats template stealing as a sequential decision making problem and employs multiple similarity based feedback signals as reward functions to effectively explore the prompt space. Comprehensive experiments on publicly available benchmarks demonstrate that RLStealer gets state-of-the-art performance while reducing the total attack cost to under 13% of that required by existing baselines. Our further analysis confirms that RLStealer can effectively generalize across different image styles to efficiently steal unseen prompt templates. Our study highlights an urgent security threat inherent in prompt trading and lays the groundwork for developing protective standards in the emerging MLLMs marketplace.


[7] Explanation-Driven Counterfactual Testing for Faithfulness in Vision-Language Model Explanations cs.CV | cs.AIPDF

Sihao Ding, Santosh Vasa, Aditi Ramadwar

TL;DR: 论文提出了一种名为EDCT的自动化验证方法,用于检测视觉语言模型生成的解释是否真实反映预测的因果因素。

Details

Motivation: 视觉语言模型(VLMs)生成的解释可能听起来合理但不可靠,存在技术和管理风险,因此需要一种方法来验证其解释的真实性。

Result: 在120个OK-VQA示例和多个VLMs上,EDCT揭示了显著的忠实性差距,并生成了符合监管要求的审计记录。

Insight: 该研究表明,当前VLMs生成的解释可能存在严重的忠实性问题,EDCT提供了一种可行的自动化验证工具,有助于提升模型的透明度和可信度。

Abstract: Vision-Language Models (VLMs) often produce fluent Natural Language Explanations (NLEs) that sound convincing but may not reflect the causal factors driving predictions. This mismatch of plausibility and faithfulness poses technical and governance risks. We introduce Explanation-Driven Counterfactual Testing (EDCT), a fully automated verification procedure for a target VLM that treats the model’s own explanation as a falsifiable hypothesis. Given an image-question pair, EDCT: (1) obtains the model’s answer and NLE, (2) parses the NLE into testable visual concepts, (3) generates targeted counterfactual edits via generative inpainting, and (4) computes a Counterfactual Consistency Score (CCS) using LLM-assisted analysis of changes in both answers and explanations. Across 120 curated OK-VQA examples and multiple VLMs, EDCT uncovers substantial faithfulness gaps and provides regulator-aligned audit artifacts indicating when cited concepts fail causal tests.


[8] HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling cs.CV | cs.AIPDF

Xianjie Liu, Yiman Hu, Yixiong Zou, Liang Wu, Jian Xu

TL;DR: HiDe提出了一种无需训练的Hierarchical Decoupling Framework(HiDe),通过Token-wise Attention Decoupling和Layout-Preserving Decoupling,解决了高分辨率MLLMs中复杂背景干扰的问题,实现了新的SOTA性能。

Details

Motivation: 高分辨率图像中的小物体识别问题通常被归因于感知限制,但作者发现实际问题是复杂背景干扰。现有的’放大’策略效果不佳,因此需要一种新方法来消除干扰并提升性能。

Result: HiDe在V*Bench、HRBench4K和HRBench8K上实现了新的SOTA性能(如Qwen2.5-VL 7B和InternVL3 8B分别达到92.1%和91.6%),内存占用减少75%。

Insight: 背景干扰是影响高分辨率MLLMs性能的关键因素,而非传统认为的物体大小问题;解耦操作能有效提升模型对关键信息的捕捉能力。

Abstract: Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding tasks. However, their performance on high-resolution images remains suboptimal. While existing approaches often attribute this limitation to perceptual constraints and argue that MLLMs struggle to recognize small objects, leading them to use “zoom in” strategies for better detail, our analysis reveals a different cause: the main issue is not object size, but rather caused by complex background interference. We systematically analyze this “zoom in” operation through a series of decoupling experiments and propose the Hierarchical Decoupling Framework (HiDe), a training-free framework that uses Token-wise Attention Decoupling (TAD) to decouple the question tokens and identify the key information tokens, then leverages their attention weights to achieve precise alignment with the target visual regions. Subsequently, it employs Layout-Preserving Decoupling (LPD) to decouple these regions from the background and reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference. HiDe sets a new SOTA on VBench, HRBench4K, and HRBench8K, boosting Qwen2.5-VL 7B and InternVL3 8B to SOTA (92.1% and 91.6% on VBench), even surpassing RL methods. After optimization, HiDe uses 75% less memory than the previous training-free approach. Code is provided in https://github.com/Tennine2077/HiDe.


[9] FSDENet: A Frequency and Spatial Domains based Detail Enhancement Network for Remote Sensing Semantic Segmentation cs.CV | cs.AIPDF

Jiahao Fu, Yinfeng Yu, Liejun Wang

TL;DR: FSDENet提出了一种结合频域和空间域的方法,通过FFT和小波变换增强遥感图像的语义分割,特别是在边界和灰度变化区域表现优异。

Details

Motivation: 解决遥感图像分割中因灰度变化(如阴影和低对比度区域)导致的语义边缘模糊问题。

Result: 在LoveDA、Vaihingen、Potsdam和iSAID四个数据集上达到SOTA性能。

Insight: 频域全局信息和空间多尺度特征的结合能够有效提升遥感图像分割在复杂场景下的鲁棒性。

Abstract: To fully leverage spatial information for remote sensing image segmentation and address semantic edge ambiguities caused by grayscale variations (e.g., shadows and low-contrast regions), we propose the Frequency and Spatial Domains based Detail Enhancement Network (FSDENet). Our framework employs spatial processing methods to extract rich multi-scale spatial features and fine-grained semantic details. By effectively integrating global and frequency-domain information through the Fast Fourier Transform (FFT) in global mappings, the model’s capability to discern global representations under grayscale variations is significantly strengthened. Additionally, we utilize Haar wavelet transform to decompose features into high- and low-frequency components, leveraging their distinct sensitivity to edge information to refine boundary segmentation. The model achieves dual-domain synergy by integrating spatial granularity with frequency-domain edge sensitivity, substantially improving segmentation accuracy in boundary regions and grayscale transition zones. Comprehensive experimental results demonstrate that FSDENet achieves state-of-the-art (SOTA) performance on four widely adopted datasets: LoveDA, Vaihingen, Potsdam, and iSAID.


[10] Less is More: Lean yet Powerful Vision-Language Model for Autonomous Driving cs.CV | cs.AI | cs.ROPDF

Sheng Yang, Tong Zhan, Guancheng Chen, Yanfeng Lu, Jian Wang

TL;DR: 论文提出了一种新的端到端自动驾驶框架Max-V1,将轨迹规划任务重新定义为下一个航路点预测,通过单次生成的视觉语言模型实现了高性能的轨迹预测。

Details

Motivation: 现有自动驾驶方法通常依赖多阶段处理或复杂模型设计,导致计算负担和泛化能力不足。本文旨在提出一种简洁但高效的框架,通过语言生成的方式直接预测轨迹,减少复杂度。

Result: 在nuScenes数据集上达到SOTA性能,比基线方法提升30%以上,并在跨域数据集上表现出优秀的泛化能力。

Insight: 通过语言生成方式简化轨迹预测任务,可以提高模型的效率和泛化能力,为自动驾驶领域提供了一种新的研究思路。

Abstract: In this work, we reconceptualize autonomous driving as a generalized language and formulate the trajectory planning task as next waypoint prediction. We introduce Max-V1, a novel framework for one-stage end-to-end autonomous driving. Our framework presents a single-pass generation paradigm that aligns with the inherent sequentiality of driving. This approach leverages the generative capacity of the VLM (Vision-Language Model) to enable end-to-end trajectory prediction directly from front-view camera input. The efficacy of this method is underpinned by a principled supervision strategy derived from statistical modeling. This provides a well-defined learning objective, which makes the framework highly amenable to master complex driving policies through imitation learning from large-scale expert demonstrations. Empirically, our method achieves the state-of-the-art performance on the nuScenes dataset, delivers an overall improvement of over 30% compared to prior baselines. Furthermore, it exhibits superior generalization performance on cross-domain datasets acquired from diverse vehicles, demonstrating notable potential for cross-vehicle robustness and adaptability. Due to these empirical strengths, this work introduces a model enabling fundamental driving behaviors, laying the foundation for the development of more capable self-driving agents. Code will be available upon publication.


[11] OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding cs.CVPDF

Jiancong Xie, Wenjin Wang, Zhuomeng Zhang, Zihan Liu, Qi Liu

TL;DR: OIG-Bench是一个用于评估多模态大语言模型(MLLMs)在单图像指南(One-Image Guides)理解能力的基准测试平台,通过多智能体协作的半自动标注方法构建数据集。

Details

Motivation: 尽管MLLMs在多模态理解方面展现出了强大的能力,但其在单图像指南这种结合文本、图像和符号的特殊视觉形式上的理解能力尚未充分研究。这激发了对专门评估工具的需求。

Result: Qwen2.5-VL-72B在评估中表现最佳,总体准确率达77%,但所有模型在语义理解和逻辑推理上均存在明显缺陷。多智能体标注系统在图像描述生成任务中优于所有MLLMs。

Insight: 1. 当前MLLMs在复杂视觉-文本关系理解上仍有挑战;2. 多智能体协作的标注方法为未来数据集构建提供了高效工具;3. OIG-Bench为改进MLLMs的理解能力提供了重要参考。

Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities. However, evaluating their capacity for human-like understanding in One-Image Guides remains insufficiently explored. One-Image Guides are a visual format combining text, imagery, and symbols to present reorganized and structured information for easier comprehension, which are specifically designed for human viewing and inherently embody the characteristics of human perception and understanding. Here, we present OIG-Bench, a comprehensive benchmark focused on One-Image Guide understanding across diverse domains. To reduce the cost of manual annotation, we developed a semi-automated annotation pipeline in which multiple intelligent agents collaborate to generate preliminary image descriptions, assisting humans in constructing image-text pairs. With OIG-Bench, we have conducted a comprehensive evaluation of 29 state-of-the-art MLLMs, including both proprietary and open-source models. The results show that Qwen2.5-VL-72B performs the best among the evaluated models, with an overall accuracy of 77%. Nevertheless, all models exhibit notable weaknesses in semantic understanding and logical reasoning, indicating that current MLLMs still struggle to accurately interpret complex visual-text relationships. In addition, we also demonstrate that the proposed multi-agent annotation system outperforms all MLLMs in image captioning, highlighting its potential as both a high-quality image description generator and a valuable tool for future dataset construction. Datasets are available at https://github.com/XiejcSYSU/OIG-Bench.


[12] Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning cs.CV | cs.AI | cs.LGPDF

Chenhui Xu, Fuxun Yu, Michael J. Bianco, Jacob Kovarskiy, Raphael Tang

TL;DR: Geo-R1 是一個聚焦於地理空間推理的後訓練框架,通過結合思維引導(scaffolding)和提升(elevating)兩個階段,強化視覺語言模型(VLM)的地理推理能力。

Details

Motivation: 現有視覺語言模型在地理空間推理任務中表現不佳,且人工標註推理數據成本高昂。Geo-R1 旨在通過自動生成的思維鏈數據和強化學習,低成本地提升模型的地理推理能力。

Result: Geo-R1 在多個地理空間推理基準測試中達到最先進性能,並在開放平台上發布模型。

Insight: Geo-R1 展示了通過合成數據和強化學習,可以有效提升模型的地理推理能力,同時避免高成本的人工標註,為類似任務提供了新思路。

Abstract: We introduce Geo-R1, a reasoning-centric post-training framework that unlocks geospatial reasoning in vision-language models by combining thinking scaffolding and elevating. In the scaffolding stage, Geo-R1 instills a ``geospatial thinking paradigm” via supervised fine-tuning on synthetic chain-of-thought exemplars, enabling models to connect visual cues with geographic priors without costly human reasoning annotations. In the elevating stage, it uses GRPO-based reinforcement learning on a weakly-supervised cross-view pairing proxy. This design supplies a verifiable and scalable reward signal: teaching models to capture and reconcile features across modalities, and harnessing reasoning for accurate prediction. Geo-R1 extends geospatial modeling from domain pretraining / supervised finetuning to reasoning-first post-training, and achieves state-of-the-art performance across various geospatial reasoning benchmarks. Our model is available at https://huggingface.co/miniHui/Geo-R1.


[13] Enhancing Certifiable Semantic Robustness via Robust Pruning of Deep Neural Networks cs.CV | cs.LGPDF

Hanjiang Hu, Bowei Li, Ziwei Wang, Tianhao Wei, Casidhe Hutchison

TL;DR: 本文提出了一种通过稳健剪枝增强深度神经网络可认证语义鲁棒性的方法,分析了神经元稳定性和方差,并提出了一种新的度量标准和剪枝策略。

Details

Motivation: 深度神经网络在视觉和机器人应用中广泛使用,但其对抗语义变换扰动的鲁棒性验证面临过参数化问题,影响了紧密度和可扩展性。

Result: 在亮度与对比度扰动下的鲁棒性关键点检测任务中,该方法优于基线方法,表现出更高的认证鲁棒性和效率。

Insight: 剪枝不仅能减少过参数化,还能通过保留高鲁棒性神经元提升模型的语义鲁棒性。Wasserstein距离损失有助于神经元分布的集中化。

Abstract: Deep neural networks have been widely adopted in many vision and robotics applications with visual inputs. It is essential to verify its robustness against semantic transformation perturbations, such as brightness and contrast. However, current certified training and robustness certification methods face the challenge of over-parameterization, which hinders the tightness and scalability due to the over-complicated neural networks. To this end, we first analyze stability and variance of layers and neurons against input perturbation, showing that certifiable robustness can be indicated by a fundamental Unbiased and Smooth Neuron metric (USN). Based on USN, we introduce a novel neural network pruning method that removes neurons with low USN and retains those with high USN, thereby preserving model expressiveness without over-parameterization. To further enhance this pruning process, we propose a new Wasserstein distance loss to ensure that pruned neurons are more concentrated across layers. We validate our approach through extensive experiments on the challenging robust keypoint detection task, which involves realistic brightness and contrast perturbations, demonstrating that our method achieves superior robustness certification performance and efficiency compared to baselines.


[14] EgoTraj-Bench: Towards Robust Trajectory Prediction Under Ego-view Noisy Observations cs.CV | cs.AI | cs.ROPDF

Jiayi Liu, Jiaming Zhou, Ke Ye, Kun-Yu Lin, Allan Wang

TL;DR: 论文提出了EgoTraj-Bench,首个结合第一人称视角噪声观测与鸟瞰视角未来轨迹的真实世界基准,并提出了双流流匹配模型BiFlow(结合EgoAnchor机制),显著提升了轨迹预测的鲁棒性。

Details

Motivation: 现有的轨迹预测方法通常在理想化的观测历史下训练,忽视了第一人称视角中固有的感知噪声(如遮挡、ID切换和跟踪漂移),导致模型在实际部署中鲁棒性不足。

Result: 实验表明,BiFlow平均将minADE和minFDE降低了10-15%,达到SOTA性能,并表现出更高的鲁棒性。

Insight: 论文强调了现实世界中第一人称视角感知噪声的重要性,并展示了通过联合去噪和预测设计可以显著提升轨迹预测模型的鲁棒性。

Abstract: Reliable trajectory prediction from an ego-centric perspective is crucial for robotic navigation in human-centric environments. However, existing methods typically assume idealized observation histories, failing to account for the perceptual artifacts inherent in first-person vision, such as occlusions, ID switches, and tracking drift. This discrepancy between training assumptions and deployment reality severely limits model robustness. To bridge this gap, we introduce EgoTraj-Bench, the first real-world benchmark that grounds noisy, first-person visual histories in clean, bird’s-eye-view future trajectories, enabling robust learning under realistic perceptual constraints. Building on this benchmark, we propose BiFlow, a dual-stream flow matching model that concurrently denoises historical observations and forecasts future motion by leveraging a shared latent representation. To better model agent intent, BiFlow incorporates our EgoAnchor mechanism, which conditions the prediction decoder on distilled historical features via feature modulation. Extensive experiments show that BiFlow achieves state-of-the-art performance, reducing minADE and minFDE by 10-15% on average and demonstrating superior robustness. We anticipate that our benchmark and model will provide a critical foundation for developing trajectory forecasting systems truly resilient to the challenges of real-world, ego-centric perception.


[15] David and Goliath in Medical Vision: Convolutional Networks vs Biomedical Vision Language Models cs.CV | cs.AIPDF

Ran Tong, Jiaqi Liu, Su Liu, Jiexi Xu, Lanruo Wang

TL;DR: 本文比较了轻量级监督CNN和零样本医学视觉语言模型BiomedCLIP在肺炎和肺结核检测任务中的表现,发现通过简单的决策阈值校准,BiomedCLIP可以超越或接近监督CNN的性能。

Details

Motivation: 医学影像的自动准确解读至关重要,本文探讨了监督CNN和零样本VLM在医学图像分析中的性能差异,旨在揭示如何充分发挥VLM的潜力。

Result: 校准后,BiomedCLIP在肺炎检测中F1-score达0.8841(优于CNN的0.8803),肺结核检测中从0.4812提升至0.7684(接近CNN的0.7834)。

Insight: 零样本VLM在医学任务中潜力巨大,但需通过阈值校准才能充分发挥其性能,为未来研究提供了简单有效的优化方向。

Abstract: The accurate interpretation of chest radiographs using automated methods is a critical task in medical imaging. This paper presents a comparative analysis between a supervised lightweight Convolutional Neural Network (CNN) and a state-of-the-art, zero-shot medical Vision-Language Model (VLM), BiomedCLIP, across two distinct diagnostic tasks: pneumonia detection on the PneumoniaMNIST benchmark and tuberculosis detection on the Shenzhen TB dataset. Our experiments show that supervised CNNs serve as highly competitive baselines in both cases. While the default zero-shot performance of the VLM is lower, we demonstrate that its potential can be unlocked via a simple yet crucial remedy: decision threshold calibration. By optimizing the classification threshold on a validation set, the performance of BiomedCLIP is significantly boosted across both datasets. For pneumonia detection, calibration enables the zero-shot VLM to achieve a superior F1-score of 0.8841, surpassing the supervised CNN’s 0.8803. For tuberculosis detection, calibration dramatically improves the F1-score from 0.4812 to 0.7684, bringing it close to the supervised baseline’s 0.7834. This work highlights a key insight: proper calibration is essential for leveraging the full diagnostic power of zero-shot VLMs, enabling them to match or even outperform efficient, task-specific supervised models.


[16] PAL-UI: Planning with Active Look-back for Vision-Based GUI Agents cs.CVPDF

Zikang Liu, Junyi Li, Wayne Xin Zhao, Dawei Gao, Yaliang Li

TL;DR: PAL-UI是一个新颖的框架,通过主动回溯过去观察来解决基于视觉的GUI代理在长周期任务中记忆受限的问题,显著提升了移动GUI导航任务的性能。

Details

Motivation: 现有的多模态大语言模型(MLLMs)驱动的GUI代理在长周期任务中面临记忆受限的挑战,传统方法要么截断历史记录,要么依赖简单的文本摘要,可能丢失对未来决策关键的视觉细节。

Result: 实验表明,PAL-UI在移动GUI导航任务中显著优于基线模型和先前方法,并在无需额外训练的情况下表现出强大的跨领域泛化能力。

Insight: 主动记忆检索可显著提升基于视觉的GUI代理在长周期规划任务中的能力。

Abstract: Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) promise human-like interaction with software applications, yet long-horizon tasks remain challenging due to memory limitations. Existing approaches either truncate history or rely on simple textual summaries, which risk losing critical information when past visual details become necessary for future decisions. In this paper, we propose \textbf{PAL-UI} (\textbf{P}lanning with \textbf{A}ctive \textbf{L}ook-back), a novel framework that enables GUI agents to adaptively retrieve past observations when required. PAL-UI combines a dual-level summarization agent, capturing both observation-level cues and action-level outcomes, with a dedicated retrieval tool that allows the agent to recall specific historical screenshots during planning. We curate a step-level instruction dataset of 8.6K samples from mobile GUI navigation trajectories and train \textbf{PAL-UI-3B} and \textbf{PAL-UI-7B} models based on Qwen2.5-VL. Extensive experiments demonstrate that PAL-UI significantly outperforms baseline models and prior methods in mobile GUI navigation tasks, even under data-efficient settings. Moreover, PAL-UI exhibits strong cross-domain generalization, achieving notable improvements in web navigation without additional training. Our work highlights the potential of active memory retrieval for long-horizon planning capabilities of vision-based GUI agents.


[17] BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration cs.CVPDF

Zhaoyang Li, Dongjun Qian, Kai Su, Qishuai Diao, Xiangyang Xia

TL;DR: BindWeave是一个通过跨模态整合实现主题一致视频生成的统一框架,利用MLLM-DiT架构解决多主题场景中的提示解析难题。

Details

Motivation: 现有视频生成模型在多主题场景中难以保持主题一致性和复杂空间关系,BindWeave旨在填补这一空白。

Result: 在OpenS2V基准测试中,BindWeave在主题一致性、自然性和文本相关性上均优于现有开源和商业模型。

Insight: 跨模态推理和解耦能够有效提升多主题视频生成的复杂语义理解能力。

Abstract: Diffusion Transformer has shown remarkable abilities in generating high-fidelity videos, delivering visually coherent frames and rich details over extended durations. However, existing video generation models still fall short in subject-consistent video generation due to an inherent difficulty in parsing prompts that specify complex spatial relationships, temporal logic, and interactions among multiple subjects. To address this issue, we propose BindWeave, a unified framework that handles a broad range of subject-to-video scenarios from single-subject cases to complex multi-subject scenes with heterogeneous entities. To bind complex prompt semantics to concrete visual subjects, we introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities and disentangle roles, attributes, and interactions, yielding subject-aware hidden states that condition the diffusion transformer for high-fidelity subject-consistent video generation. Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos, outperforming existing open-source and commercial models.


[18] VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors cs.CVPDF

Atif Belal, Heitor R. Medeiros, Marco Pedersoli, Eric Granger

TL;DR: 本文提出了一种名为VLOD-TTA的测试时适应框架,旨在提升视觉-语言目标检测器(VLODs)在领域偏移下的性能。通过IoU加权的熵目标和图像条件提示选择方法,显著改善了YOLO-World和Grounding DINO的检测效果。

Details

Motivation: 视觉-语言目标检测器在零样本识别中表现优异,但在领域偏移下性能下降。本文旨在通过测试时适应框架解决这一问题。

Result: 在多种分布偏移(如风格化域、驾驶场景、低光条件和常见损坏)下,VLOD-TTA显著提升了YOLO-World和Grounding DINO的性能。

Insight: 测试时适应(TTA)可以有效缓解视觉-语言目标检测器在未知领域的性能下降问题,尤其是在处理空间一致性和提示兼容性时尤为关键。

Abstract: Vision-language object detectors (VLODs) such as YOLO-World and Grounding DINO achieve impressive zero-shot recognition by aligning region proposals with text representations. However, their performance often degrades under domain shift. We introduce VLOD-TTA, a test-time adaptation (TTA) framework for VLODs that leverages dense proposal overlap and image-conditioned prompt scores. First, an IoU-weighted entropy objective is proposed that concentrates adaptation on spatially coherent proposal clusters and reduces confirmation bias from isolated boxes. Second, image-conditioned prompt selection is introduced, which ranks prompts by image-level compatibility and fuses the most informative prompts with the detector logits. Our benchmarking across diverse distribution shifts – including stylized domains, driving scenes, low-light conditions, and common corruptions – shows the effectiveness of our method on two state-of-the-art VLODs, YOLO-World and Grounding DINO, with consistent improvements over the zero-shot and TTA baselines. Code : https://github.com/imatif17/VLOD-TTA


[19] MathSticks: A Benchmark for Visual Symbolic Compositional Reasoning with Matchstick Puzzles cs.CVPDF

Yuheng Ji, Huajie Tan, Cheng Chi, Yijie Xu, Yuting Zhao

TL;DR: MathSticks是一个新的基准测试,专注于视觉符号组合推理(VSCR),通过火柴棒谜题测试模型的视觉感知、符号操作和算术一致性能力。该基准包含文本引导和纯视觉设置,评估了14种视觉-语言模型,发现现有模型在许多任务上表现不佳,而人类表现优异。

Details

Motivation: 当前视觉-语言模型在复杂的组合推理任务(特别是需要同时处理视觉和符号信息的任务)中表现有限。MathSticks旨在填补这一空白,提供一个系统的测试平台。

Result: 14种模型中,闭源模型仅能处理简单任务,开源模型在纯视觉任务中表现更差,而人类准确率超过90%。

Insight: MathSticks凸显了当前模型在组合推理任务中的不足,尤其是在视觉和符号结合的任务上。未来的模型需要更强的跨模态整合能力。

Abstract: We introduce \textsc{MathSticks}, a benchmark for Visual Symbolic Compositional Reasoning (VSCR), which unifies visual perception, symbolic manipulation, and arithmetic consistency. Each task presents an incorrect matchstick equation that must be corrected by moving one or two sticks under strict conservation rules. The benchmark includes both text-guided and purely visual settings, systematically covering digit scale, move complexity, solution multiplicity, and operator variation, with 1.4M generated instances and a curated test set. Evaluations of 14 vision–language models reveal substantial limitations: closed-source models succeed only on simple cases, open-source models fail in the visual regime, while humans exceed 90% accuracy. These findings establish \textsc{MathSticks} as a rigorous testbed for advancing compositional reasoning across vision and symbols. Our code and dataset are publicly available at https://github.com/Yuheng2000/MathSticks.


[20] Normal-Abnormal Guided Generalist Anomaly Detection cs.CV | cs.AIPDF

Yuexin Wang, Xiaolei Wang, Yizheng Gong, Jimin Xiao

TL;DR: 该论文提出了一种名为NAGL的新框架,利用正常和异常样本作为参考,改进通用异常检测(GAD)的性能。

Details

Motivation: 现有GAD方法仅依赖正常样本作为参考,忽略了现实中可用的异常样本的有价值信息,导致跨域异常检测性能受限。

Result: 在多个基准测试中,该方法显著优于现有GAD方法。

Insight: 异常样本的引入丰富了参考信息,提升了跨域异常检测的准确性和效率,为GAD领域提供了新的方向。

Abstract: Generalist Anomaly Detection (GAD) aims to train a unified model on an original domain that can detect anomalies in new target domains. Previous GAD methods primarily use only normal samples as references, overlooking the valuable information contained in anomalous samples that are often available in real-world scenarios. To address this limitation, we propose a more practical approach: normal-abnormal-guided generalist anomaly detection, which leverages both normal and anomalous samples as references to guide anomaly detection across diverse domains. We introduce the Normal-Abnormal Generalist Learning (NAGL) framework, consisting of two key components: Residual Mining (RM) and Anomaly Feature Learning (AFL). RM extracts abnormal patterns from normal-abnormal reference residuals to establish transferable anomaly representations, while AFL adaptively learns anomaly features in query images through residual mapping to identify instance-aware anomalies. Our approach effectively utilizes both normal and anomalous references for more accurate and efficient cross-domain anomaly detection. Extensive experiments across multiple benchmarks demonstrate that our method significantly outperforms existing GAD approaches. This work represents the first to adopt a mixture of normal and abnormal samples as references in generalist anomaly detection. The code and datasets are available at https://github.com/JasonKyng/NAGL.


[21] Affordance-Guided Diffusion Prior for 3D Hand Reconstruction cs.CVPDF

Naru Suzuki, Takehiko Ohkawa, Tatsuro Banno, Jihyun Lee, Ryosuke Furuta

TL;DR: 论文提出了一种基于affordance引导的扩散先验方法,用于严重遮挡下的3D手部姿态重建,通过利用手-物体交互的文本描述生成更准确的姿态。

Details

Motivation: 在严重遮挡情况下,传统方法难以准确重建3D手部姿态。人类通过利用物体的功能和形状(affordance)来解决此类模糊性,论文受此启发,提出了一种结合affordance的生成先验方法。

Result: 在HOGraspNet数据集上,affordance引导的细化方法显著优于现有回归方法和缺乏上下文推理的扩散方法。

Insight: 结合affordance的上下文信息可以显著提升遮挡情况下3D手部姿态的生成质量,展示了生成模型在姿态重建中的潜力。

Abstract: How can we reconstruct 3D hand poses when large portions of the hand are heavily occluded by itself or by objects? Humans often resolve such ambiguities by leveraging contextual knowledge – such as affordances, where an object’s shape and function suggest how the object is typically grasped. Inspired by this observation, we propose a generative prior for hand pose refinement guided by affordance-aware textual descriptions of hand-object interactions (HOI). Our method employs a diffusion-based generative model that learns the distribution of plausible hand poses conditioned on affordance descriptions, which are inferred from a large vision-language model (VLM). This enables the refinement of occluded regions into more accurate and functionally coherent hand poses. Extensive experiments on HOGraspNet, a 3D hand-affordance dataset with severe occlusions, demonstrate that our affordance-guided refinement significantly improves hand pose estimation over both recent regression methods and diffusion-based refinement lacking contextual reasoning.


[22] Efficient Multi-modal Large Language Models via Progressive Consistency Distillation cs.CVPDF

Zichen Wen, Shaobo Wang, Yufa Zhou, Junyuan Zhang, Qintong Zhang

TL;DR: EPIC提出了一种渐进式一致性蒸馏框架,通过分解特征空间的扰动并引入令牌一致性和层一致性蒸馏,提升多模态大模型在视觉令牌压缩下的效率和训练效果。

Details

Motivation: 视觉令牌在多模态大模型中占用大量计算资源,现有方法通过压缩令牌提高效率,但忽略了压缩带来的特征空间扰动和训练难度增加问题。

Result: 实验表明EPIC具有高效性、鲁棒性和泛化能力。

Insight: 分解扰动并通过渐进学习策略可以显著降低训练难度,提升模型对令牌压缩的适应性。

Abstract: Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model’s parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via Progressive Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework.


[23] Assessing Foundation Models for Mold Colony Detection with Limited Training Data cs.CVPDF

Henrik Pichler, Janis Keuper, Matthew Copping

TL;DR: 本文研究了在有限训练数据下,基础模型(如MaskDINO)在霉菌菌落检测任务中的表现,发现其仅需少量标注数据即可与传统方法(如YoloV9)竞争。

Details

Motivation: 微生物学中霉菌菌落检测任务通常依赖大量标注数据,耗时耗力。本文旨在探索基础模型是否能以更少标注数据实现与传统方法相当的性能。

Result: MaskDINO在仅150张图像微调下,性能接近YoloV9的大规模训练结果,且在25张图像时仍能在约70%的样本中保持可靠。

Insight: 基础模型(如MaskDINO)在少样本场景下表现出色,能够显著减少标注需求,加速自动化微生物系统的开发与迭代。这为其在实际应用中的推广提供了有力支持。

Abstract: The process of quantifying mold colonies on Petri dish samples is of critical importance for the assessment of indoor air quality, as high colony counts can indicate potential health risks and deficiencies in ventilation systems. Conventionally the automation of such a labor-intensive process, as well as other tasks in microbiology, relies on the manual annotation of large datasets and the subsequent extensive training of models like YoloV9. To demonstrate that exhaustive annotation is not a prerequisite anymore when tackling a new vision task, we compile a representative dataset of 5000 Petri dish images annotated with bounding boxes, simulating both a traditional data collection approach as well as few-shot and low-shot scenarios with well curated subsets with instance level masks. We benchmark three vision foundation models against traditional baselines on task specific metrics, reflecting realistic real-world requirements. Notably, MaskDINO attains near-parity with an extensively trained YoloV9 model while finetuned only on 150 images, retaining competitive performance with as few as 25 images, still being reliable on $\approx$ 70% of the samples. Our results show that data-efficient foundation models can match traditional approaches with only a fraction of the required data, enabling earlier development and faster iterative improvement of automated microbiological systems with a superior upper-bound performance than traditional models would achieve.


[24] Arbitrary Generative Video Interpolation cs.CVPDF

Guozhen Zhang, Haiguang Wang, Chunyu Wang, Yuan Zhou, Qinglin Lu

TL;DR: 该论文提出了ArbInterp,一种灵活的生成式视频帧插值框架,支持任意时间戳和长度的插值,解决了现有方法在帧率和序列时长调整上的局限性。

Details

Motivation: 现有生成式视频帧插值方法仅支持固定数量的中间帧生成,无法灵活调整帧率或总时长,限制了实际应用的多样性需求。

Result: 实验表明,ArbInterp在多尺度帧插值(2x至32x)中优于现有方法,具有更高的保真度和更无缝的时空连续性。

Insight: 通过解耦外观和运动,并结合分段生成策略,可实现更灵活和高质量的视频帧插值。

Abstract: Video frame interpolation (VFI), which generates intermediate frames from given start and end frames, has become a fundamental function in video generation applications. However, existing generative VFI methods are constrained to synthesize a fixed number of intermediate frames, lacking the flexibility to adjust generated frame rates or total sequence duration. In this work, we present ArbInterp, a novel generative VFI framework that enables efficient interpolation at any timestamp and of any length. Specifically, to support interpolation at any timestamp, we propose the Timestamp-aware Rotary Position Embedding (TaRoPE), which modulates positions in temporal RoPE to align generated frames with target normalized timestamps. This design enables fine-grained control over frame timestamps, addressing the inflexibility of fixed-position paradigms in prior work. For any-length interpolation, we decompose long-sequence generation into segment-wise frame synthesis. We further design a novel appearance-motion decoupled conditioning strategy: it leverages prior segment endpoints to enforce appearance consistency and temporal semantics to maintain motion coherence, ensuring seamless spatiotemporal transitions across segments. Experimentally, we build comprehensive benchmarks for multi-scale frame interpolation (2x to 32x) to assess generalizability across arbitrary interpolation factors. Results show that ArbInterp outperforms prior methods across all scenarios with higher fidelity and more seamless spatiotemporal continuity. Project website: https://mcg-nju.github.io/ArbInterp-Web/.


[25] Color Models in Image Processing: A Review and Experimental Comparison cs.CVPDF

Muragul Muratbekova, Nuray Toganas, Ayan Igali, Maksat Shagyrov, Elnara Kadyrgali

TL;DR: 本文综述了图像处理中的多种颜色模型,并通过实验比较了它们的性能。研究发现HS*系列颜色模型最符合人类视觉感知,并指出了现有模型的局限性与未来研究方向。

Details

Motivation: 颜色表示在计算机视觉和人机交互中至关重要,但选择合适的颜色模型对应用效果影响显著。本文旨在提供一个全面的颜色模型综述和实验评估,以帮助研究人员更好地理解和选择适合的颜色模型。

Result: HS*系列颜色模型在实验中表现最佳,与人类视觉感知最匹配。实验还揭示了现有模型的局限性,如设备依赖性和色度一致性问题。

Insight: HS*模型因其与人类感知的一致性成为颜色处理的优选方案。未来研究需进一步解决颜色模型的设备依赖性和计算效率问题。

Abstract: Color representation is essential in computer vision and human-computer interaction. There are multiple color models available. The choice of a suitable color model is critical for various applications. This paper presents a review of color models and spaces, analyzing their theoretical foundations, computational properties, and practical applications. We explore traditional models such as RGB, CMYK, and YUV, perceptually uniform spaces like CIELAB and CIELUV, and fuzzy-based approaches as well. Additionally, we conduct a series of experiments to evaluate color models from various perspectives, like device dependency, chromatic consistency, and computational complexity. Our experimental results reveal gaps in existing color models and show that the HS* family is the most aligned with human perception. The review also identifies key strengths and limitations of different models and outlines open challenges and future directions This study provides a reference for researchers in image processing, perceptual computing, digital media, and any other color-related field.


[26] Multi-level Dynamic Style Transfer for NeRFs cs.CVPDF

Zesheng Li, Shuaibo Li, Wei Ma, Jianwei Guo, Hongbin Zha

TL;DR: MDS-NeRF提出了一种多级动态风格迁移方法,针对NeRF进行了重新设计,并通过动态风格注入模块和多级特征适配器提升了3D风格迁移的效果。

Details

Motivation: 现有NeRF风格迁移方法通常在原有NeRF流程中集成风格统计信息,导致内容和艺术风格的保留效果不佳,因此需要一种更高效的方法。

Result: 实验表明MDS-NeRF在3D风格迁移中表现出色,成功保留了多尺度空间结构并有效迁移了风格特征。

Insight: 通过重新设计NeRF流程并引入动态风格注入,MDS-NeRF显著提升了风格迁移的质量和灵活性。

Abstract: As the application of neural radiance fields (NeRFs) in various 3D vision tasks continues to expand, numerous NeRF-based style transfer techniques have been developed. However, existing methods typically integrate style statistics into the original NeRF pipeline, often leading to suboptimal results in both content preservation and artistic stylization. In this paper, we present multi-level dynamic style transfer for NeRFs (MDS-NeRF), a novel approach that reengineers the NeRF pipeline specifically for stylization and incorporates an innovative dynamic style injection module. Particularly, we propose a multi-level feature adaptor that helps generate a multi-level feature grid representation from the content radiance field, effectively capturing the multi-scale spatial structure of the scene. In addition, we present a dynamic style injection module that learns to extract relevant style features and adaptively integrates them into the content patterns. The stylized multi-level features are then transformed into the final stylized view through our proposed multi-level cascade decoder. Furthermore, we extend our 3D style transfer method to support omni-view style transfer using 3D style references. Extensive experiments demonstrate that MDS-NeRF achieves outstanding performance for 3D style transfer, preserving multi-scale spatial structures while effectively transferring stylistic characteristics.


[27] LVLMs as inspectors: an agentic framework for category-level structural defect annotation cs.CVPDF

Sheng Jiang, Yuanmin Ning, Bingxi Huang, Peiyin Chen, Zhaohui Chen

TL;DR: 论文提出了一种基于大型视觉语言模型(LVLMs)的自主缺陷标注框架ADPT,通过语义模式匹配和迭代自问优化机制,无需人工监督即可生成高质量的结构缺陷标注数据集。

Details

Motivation: 传统人工标注结构性缺陷成本高且效率低,因此需要一种自动化、高效且低成本的方法来解决这一问题。

Result: 实验显示,ADPT在区分缺陷与非缺陷图像的准确率高达98%,四类缺陷标注准确率为85%-98%,在类别不平衡数据集上也达到了80%-92%的准确率。

Insight: ADPT为结构性缺陷的高保真数据集构建提供了可扩展且经济高效的解决方案,支持下游任务如迁移学习和领域适应。

Abstract: Automated structural defect annotation is essential for ensuring infrastructure safety while minimizing the high costs and inefficiencies of manual labeling. A novel agentic annotation framework, Agent-based Defect Pattern Tagger (ADPT), is introduced that integrates Large Vision-Language Models (LVLMs) with a semantic pattern matching module and an iterative self-questioning refinement mechanism. By leveraging optimized domain-specific prompting and a recursive verification process, ADPT transforms raw visual data into high-quality, semantically labeled defect datasets without any manual supervision. Experimental results demonstrate that ADPT achieves up to 98% accuracy in distinguishing defective from non-defective images, and 85%-98% annotation accuracy across four defect categories under class-balanced settings, with 80%-92% accuracy on class-imbalanced datasets. The framework offers a scalable and cost-effective solution for high-fidelity dataset construction, providing strong support for downstream tasks such as transfer learning and domain adaptation in structural damage assessment.


[28] Disentangling Foreground and Background for vision-Language Navigation via Online Augmentation cs.CVPDF

Yunbo Xu, Xuesong Zhang, Jia Li, Zhenzhen Hu, Richang Hong

TL;DR: 论文提出了一种名为COFA的在线特征增强策略,通过分离前景和背景特征来提升视觉语言导航(VLN)任务的性能,实验证明其有效性和先进性。

Details

Motivation: 在视觉语言导航任务中,前景提供语义信息,背景包含空间连接信息,但当前方法未充分探索两者的分离利用。

Result: 在REVERIE和R2R数据集上,COFA显著提升了基线模型的泛化能力并达到SOTA性能。

Insight: 分离和动态增强前景与背景特征是提升VLN任务性能的关键。

Abstract: Following language instructions, vision-language navigation (VLN) agents are tasked with navigating unseen environments. While augmenting multifaceted visual representations has propelled advancements in VLN, the significance of foreground and background in visual observations remains underexplored. Intuitively, foreground regions provide semantic cues, whereas the background encompasses spatial connectivity information. Inspired on this insight, we propose a Consensus-driven Online Feature Augmentation strategy (COFA) with alternative foreground and background features to facilitate the navigable generalization. Specifically, we first leverage semantically-enhanced landmark identification to disentangle foreground and background as candidate augmented features. Subsequently, a consensus-driven online augmentation strategy encourages the agent to consolidate two-stage voting results on feature preferences according to diverse instructions and navigational locations. Experiments on REVERIE and R2R demonstrate that our online foreground-background augmentation boosts the generalization of baseline and attains state-of-the-art performance.


[29] Robust Context-Aware Object Recognition cs.CVPDF

Klara Janouskova, Cristian Gavrus, Jiri Matas

TL;DR: 论文提出了一种联合实现鲁棒性和上下文感知的方法RCOR,通过将定位作为识别的一部分,解耦对象中心和上下文建模,并结合非参数化融合,提高了模型的性能。

Details

Motivation: 标准监督学习容易导致模型过度依赖背景信息(称为捷径学习),限制了在实际部署中的鲁棒性。现有方法通常通过抑制背景来解决问题,但牺牲了上下文信息。

Result: 在不进行微调的情况下,RCOR在ImageNet-1k等数据集上显著提升了模型性能,尤其是在复杂场景中表现突出。

Insight: 定位任务可以作为识别任务的关键辅助,通过解耦建模和非参数化融合,能够同时利用对象中心和上下文信息,提升模型的鲁棒性和泛化能力。

Abstract: In visual recognition, both the object of interest (referred to as foreground, FG, for simplicity) and its surrounding context (background, BG) play an important role. However, standard supervised learning often leads to unintended over-reliance on the BG, known as shortcut learning of spurious correlations, limiting model robustness in real-world deployment settings. In the literature, the problem is mainly addressed by suppressing the BG, sacrificing context information for improved generalization. We propose RCOR – Robust Context-Aware Object Recognition – the first approach that jointly achieves robustness and context-awareness without compromising either. RCOR treats localization as an integral part of recognition to decouple object-centric and context-aware modelling, followed by a robust, non-parametric fusion. It improves the performance of both supervised models and VLM on datasets with both in-domain and out-of-domain BG, even without fine-tuning. The results confirm that localization before recognition is now possible even in complex scenes as in ImageNet-1k.


[30] UCD: Unconditional Discriminator Promotes Nash Equilibrium in GANs cs.CVPDF

Mengfei Xia, Nan Xue, Jiapeng Zhu, Yujun Shen

TL;DR: 论文提出了一种无条件判别器(UCD),通过移除判别器中的条件输入,使其提取更全面的特征,从而促进GAN训练中的纳什均衡,显著提升生成质量。

Details

Motivation: GAN训练在实践中难以收敛且常常陷入模式崩溃,原因是判别器(D)中的条件输入引入了冗余捷径,阻碍了有效的知识提取。

Result: 在ImageNet-64数据集上,UCD取得了1.47 FID的优异结果,超越了StyleGAN-XL和其他先进的一步扩散模型。

Insight: 移除判别器中的条件输入可以显著提升GAN的训练效果,避免模式崩溃并促进纳什均衡,为GAN研究提供了新的改进方向。

Abstract: Adversarial training turns out to be the key to one-step generation, especially for Generative Adversarial Network (GAN) and diffusion model distillation. Yet in practice, GAN training hardly converges properly and struggles in mode collapse. In this work, we quantitatively analyze the extent of Nash equilibrium in GAN training, and conclude that redundant shortcuts by inputting condition in $D$ disables meaningful knowledge extraction. We thereby propose to employ an unconditional discriminator (UCD), in which $D$ is enforced to extract more comprehensive and robust features with no condition injection. In this way, $D$ is able to leverage better knowledge to supervise $G$, which promotes Nash equilibrium in GAN literature. Theoretical guarantee on compatibility with vanilla GAN theory indicates that UCD can be implemented in a plug-in manner. Extensive experiments confirm the significant performance improvements with high efficiency. For instance, we achieved \textbf{1.47 FID} on the ImageNet-64 dataset, surpassing StyleGAN-XL and several state-of-the-art one-step diffusion models. The code will be made publicly available.


[31] Virtual Fashion Photo-Shoots: Building a Large-Scale Garment-Lookbook Dataset cs.CV | cs.LGPDF

Yannick Hauri, Luca A. Lanzendörfer, Till Aczel

TL;DR: 该论文提出了虚拟时尚摄影任务,旨在将标准化服装图像转化为情境丰富的编辑影像,并构建了一个大规模服装-画册配对数据集。

Details

Motivation: 传统时尚图像生成任务(如虚拟试穿)局限于简单场景,无法捕捉时尚编辑影像的动态性和故事性。本文希望通过新任务和数据集填补这一空白。

Result: 构建了包含高、中、低三个质量等级的数据集(分别为10,000、50,000和300,000对),为模型训练提供了丰富素材。

Insight: 该数据集不仅支持传统任务,还能推动更具创造性和故事性的时尚图像生成。

Abstract: Fashion image generation has so far focused on narrow tasks such as virtual try-on, where garments appear in clean studio environments. In contrast, editorial fashion presents garments through dynamic poses, diverse locations, and carefully crafted visual narratives. We introduce the task of virtual fashion photo-shoot, which seeks to capture this richness by transforming standardized garment images into contextually grounded editorial imagery. To enable this new direction, we construct the first large-scale dataset of garment-lookbook pairs, bridging the gap between e-commerce and fashion media. Because such pairs are not readily available, we design an automated retrieval pipeline that aligns garments across domains, combining visual-language reasoning with object-level localization. We construct a dataset with three garment-lookbook pair accuracy levels: high quality (10,000 pairs), medium quality (50,000 pairs), and low quality (300,000 pairs). This dataset offers a foundation for models that move beyond catalog-style generation and toward fashion imagery that reflects creativity, atmosphere, and storytelling.


[32] Erased, But Not Forgotten: Erased Rectified Flow Transformers Still Remain Unsafe Under Concept Attack cs.CVPDF

Nanxiang Jiang, Zhaoxin Fan, Enhan Kang, Daiheng Gao, Yun Zhou

TL;DR: 本文提出了一种针对最新Rectified Flow-based文本到图像(T2I)框架的概念攻击方法ReFlux,旨在评估概念擦除策略的鲁棒性。

Details

Motivation: 当前的T2I扩散模型存在安全隐患,可能生成有害内容。现有的概念擦除方法在应用于新一代Rectified Flow Transformer(如Flux)时效果有限。本文旨在解决这一问题。

Result: 实验表明,ReFlux有效地评估了概念擦除策略的鲁棒性,为相关研究提供了可靠基准。

Insight: 现有概念擦除技术在Rectified Flow Transformer中的局限性源于注意力局部化现象,针对这一现象的攻击方法能显著提升攻击效果。

Abstract: Recent advances in text-to-image (T2I) diffusion models have enabled impressive generative capabilities, but they also raise significant safety concerns due to the potential to produce harmful or undesirable content. While concept erasure has been explored as a mitigation strategy, most existing approaches and corresponding attack evaluations are tailored to Stable Diffusion (SD) and exhibit limited effectiveness when transferred to next-generation rectified flow transformers such as Flux. In this work, we present ReFlux, the first concept attack method specifically designed to assess the robustness of concept erasure in the latest rectified flow-based T2I framework. Our approach is motivated by the observation that existing concept erasure techniques, when applied to Flux, fundamentally rely on a phenomenon known as attention localization. Building on this insight, we propose a simple yet effective attack strategy that specifically targets this property. At its core, a reverse-attention optimization strategy is introduced to effectively reactivate suppressed signals while stabilizing attention. This is further reinforced by a velocity-guided dynamic that enhances the robustness of concept reactivation by steering the flow matching process, and a consistency-preserving objective that maintains the global layout and preserves unrelated content. Extensive experiments consistently demonstrate the effectiveness and efficiency of the proposed attack method, establishing a reliable benchmark for evaluating the robustness of concept erasure strategies in rectified flow transformers.


[33] OTTER: Open-Tagging via Text-Image Representation for Multi-modal Understanding cs.CVPDF

Jieer Ouyang, Xiaoneng Xiang, Zheng Wang, Yangkai Ding

TL;DR: OTTER是一个开放的、多模态的多标签标记框架,结合了预定义类别和用户驱动的开放标签的优点,通过多模态注意力架构实现动态且语义一致的标记。

Details

Motivation: 目前的多标签标记方法通常在预定义标签上表现良好,但缺乏对开放标签的灵活性。OTTER旨在结合封闭集的稳定性和开放词汇的灵活性,以满足多模态标记的需求。

Result: OTTER在两个基准数据集上表现优异,总体F1分数分别为0.81和0.75,开放标签F1接近完美(0.99和0.97),同时在预定义标签上保持竞争力。

Insight: OTTER展示了在多模态标记任务中如何有效平衡封闭集的稳定性和开放词汇的灵活性,为动态标签生成提供了新的思路。

Abstract: We introduce OTTER, a unified open-set multi-label tagging framework that harmonizes the stability of a curated, predefined category set with the adaptability of user-driven open tags. OTTER is built upon a large-scale, hierarchically organized multi-modal dataset, collected from diverse online repositories and annotated through a hybrid pipeline combining automated vision-language labeling with human refinement. By leveraging a multi-head attention architecture, OTTER jointly aligns visual and textual representations with both fixed and open-set label embeddings, enabling dynamic and semantically consistent tagging. OTTER consistently outperforms competitive baselines on two benchmark datasets: it achieves an overall F1 score of 0.81 on Otter and 0.75 on Favorite, surpassing the next-best results by margins of 0.10 and 0.02, respectively. OTTER attains near-perfect performance on open-set labels, with F1 of 0.99 on Otter and 0.97 on Favorite, while maintaining competitive accuracy on predefined labels. These results demonstrate OTTER’s effectiveness in bridging closed-set consistency with open-vocabulary flexibility for multi-modal tagging applications.


[34] Beyond one-hot encoding? Journey into compact encoding for large multi-class segmentation cs.CV | eess.IVPDF

Aaron Kujawa, Thomas Booth, Tom Vercauteren

TL;DR: 该论文提出了一种二进制编码方法替代独热编码,以减少大规模多类分割的计算和内存需求,但在医学图像分割任务中性能和SOTA仍有差距。

Details

Motivation: 独热编码在类别数量大时计算和内存需求急剧增加,因此探索更紧凑的编码方法以减少资源消耗。

Result: 二进制编码的性能(DSC 39.3-73.8)低于独热编码(DSC 82.4),但提升了计算效率。

Insight: 二进制编码虽能减少资源需求,但在医学图像分割中性能仍需改进,提供了负面结果以推动未来研究。

Abstract: This work presents novel methods to reduce computational and memory requirements for medical image segmentation with a large number of classes. We curiously observe challenges in maintaining state-of-the-art segmentation performance with all of the explored options. Standard learning-based methods typically employ one-hot encoding of class labels. The computational complexity and memory requirements thus increase linearly with the number of classes. We propose a family of binary encoding approaches instead of one-hot encoding to reduce the computational complexity and memory requirements to logarithmic in the number of classes. In addition to vanilla binary encoding, we investigate the effects of error-correcting output codes (ECOCs), class weighting, hard/soft decoding, class-to-codeword assignment, and label embedding trees. We apply the methods to the use case of whole brain parcellation with 108 classes based on 3D MRI images. While binary encodings have proven efficient in so-called extreme classification problems in computer vision, we faced challenges in reaching state-of-the-art segmentation quality with binary encodings. Compared to one-hot encoding (Dice Similarity Coefficient (DSC) = 82.4 (2.8)), we report reduced segmentation performance with the binary segmentation approaches, achieving DSCs in the range from 39.3 to 73.8. Informative negative results all too often go unpublished. We hope that this work inspires future research of compact encoding strategies for large multi-class segmentation tasks.


[35] Adaptive Event Stream Slicing for Open-Vocabulary Event-Based Object Detection via Vision-Language Knowledge Distillation cs.CVPDF

Jinchang Zhang, Zijun Li, Jiakai Lin, Guoyu Lu

TL;DR: 该论文提出了一种通过视觉-语言知识蒸馏实现开放词汇事件相机目标检测的方法,结合SNN和CNN框架自适应分割事件流并保留关键时间信息。

Details

Motivation: 事件相机在高速度和低延迟方面具有优势,但缺乏纹理和颜色信息,使其开放词汇目标检测面临挑战。现有方法难以泛化到新物体,且CLIP等视觉-语言模型无法直接应用于事件数据。

Result: 该方法有效解决了事件数据开放词汇检测问题,避免了固定分组分割导致的信息丢失,实现了对新物体的良好泛化能力。

Insight: 通过知识蒸馏和自适应事件分割,可以在缺乏颜色信息的事件流中实现高效的开放词汇目标检测,同时保留关键的时间动态信息。

Abstract: Event cameras offer advantages in object detection tasks due to high-speed response, low latency, and robustness to motion blur. However, event cameras lack texture and color information, making open-vocabulary detection particularly challenging. Current event-based detection methods are typically trained on predefined categories, limiting their ability to generalize to novel objects, where encountering previously unseen objects is common. Vision-language models (VLMs) have enabled open-vocabulary object detection in RGB images. However, the modality gap between images and event streams makes it ineffective to directly transfer CLIP to event data, as CLIP was not designed for event streams. To bridge this gap, we propose an event-image knowledge distillation framework that leverages CLIP’s semantic understanding to achieve open-vocabulary object detection on event data. Instead of training CLIP directly on event streams, we use image frames as inputs to a teacher model, guiding the event-based student model to learn CLIP’s rich visual representations. Through spatial attention-based distillation, the student network learns meaningful visual features directly from raw event inputs while inheriting CLIP’s broad visual knowledge. Furthermore, to prevent information loss due to event data segmentation, we design a hybrid spiking neural network (SNN) and convolutional neural network (CNN) framework. Unlike fixed-group event segmentation methods, which often discard crucial temporal information, our SNN adaptively determines the optimal event segmentation moments, ensuring that key temporal features are extracted. The extracted event features are then processed by CNNs for object detection.


[36] ProtoMask: Segmentation-Guided Prototype Learning cs.CVPDF

Steffen Meinert, Philipp Schlinge, Nils Strodthoff, Martin Atzmueller

TL;DR: ProtoMask提出了一种基于分割引导的原型学习方法,通过分割掩码限制显著性图的语义区域,提高了原型与输入空间映射的可信度。

Details

Motivation: 现有的基于原型的方法通常依赖后处理的显著性技术来解释原型语义,但这些技术的可靠性和质量受到质疑。ProtoMask旨在通过分割基础模型降低可视化不确定性。

Result: 在三个细粒度分类数据集上表现优异,实验结果证明其性能优于其他流行模型。

Insight: 分割技术的引入不仅能提高模型的解释性,还能增强原型学习的效果,为XAI领域提供了新的思路。

Abstract: XAI gained considerable importance in recent years. Methods based on prototypical case-based reasoning have shown a promising improvement in explainability. However, these methods typically rely on additional post-hoc saliency techniques to explain the semantics of learned prototypes. Multiple critiques have been raised about the reliability and quality of such techniques. For this reason, we study the use of prominent image segmentation foundation models to improve the truthfulness of the mapping between embedding and input space. We aim to restrict the computation area of the saliency map to a predefined semantic image patch to reduce the uncertainty of such visualizations. To perceive the information of an entire image, we use the bounding box from each generated segmentation mask to crop the image. Each mask results in an individual input in our novel model architecture named ProtoMask. We conduct experiments on three popular fine-grained classification datasets with a wide set of metrics, providing a detailed overview on explainability characteristics. The comparison with other popular models demonstrates competitive performance and unique explainability features of our model. https://github.com/uos-sis/quanproto


[37] Graph Integrated Multimodal Concept Bottleneck Model cs.CVPDF

Jiakai Lin, Jinchang Zhang, Guoyu Lu

TL;DR: MoE-SGT是一个结合了图结构和混合专家(MoE)模块的多模态概念瓶颈模型,通过显式建模概念间的关系和动态任务分配提升了模型的性能和可解释性。

Details

Motivation: 现有的概念瓶颈模型(CBMs)通常是单模态的,且忽略了概念间的结构化关系,限制了其在复杂推理任务中的表现。

Result: MoE-SGT在多个数据集上比其他概念瓶颈网络实现了更高的准确性。

Insight: 结合图结构和动态任务分配机制可以显著提升模型的复杂推理能力和适应性。

Abstract: With growing demand for interpretability in deep learning, especially in high stakes domains, Concept Bottleneck Models (CBMs) address this by inserting human understandable concepts into the prediction pipeline, but they are generally single modal and ignore structured concept relationships. To overcome these limitations, we present MoE-SGT, a reasoning driven framework that augments CBMs with a structure injecting Graph Transformer and a Mixture of Experts (MoE) module. We construct answer-concept and answer-question graphs for multimodal inputs to explicitly model the structured relationships among concepts. Subsequently, we integrate Graph Transformer to capture multi level dependencies, addressing the limitations of traditional Concept Bottleneck Models in modeling concept interactions. However, it still encounters bottlenecks in adapting to complex concept patterns. Therefore, we replace the feed forward layers with a Mixture of Experts (MoE) module, enabling the model to have greater capacity in learning diverse concept relationships while dynamically allocating reasoning tasks to different sub experts, thereby significantly enhancing the model’s adaptability to complex concept reasoning. MoE-SGT achieves higher accuracy than other concept bottleneck networks on multiple datasets by modeling structured relationships among concepts and utilizing a dynamic expert selection mechanism.


[38] Training-free Uncertainty Guidance for Complex Visual Tasks with MLLMs cs.CVPDF

Sanghwan Kim, Rui Xiao, Stephan Alaniz, Yongqin Xian, Zeynep Akata

TL;DR: 该论文提出了一种无需训练的框架,利用多模态大语言模型(MLLM)的内在不确定性作为指导信号,以提升复杂视觉任务的性能。通过响应不确定性评分候选视觉输入,模型能够自主关注最显著的数据。

Details

Motivation: 现有的MLLM在细粒度感知任务(如高分辨率图像中的小物体识别或长视频中的关键时刻定位)中表现不佳,通常需要复杂的任务特定微调,限制了其泛化能力并增加了模型复杂度。

Result: 实验表明,该方法在三个复杂视觉任务上的性能媲美专门微调的方法,验证了利用内在不确定性提升多模态任务性能的普适性。

Insight: 模型输出熵的变化可作为视觉信息相关性的有效指标,无需额外训练即可显著提升MLLM在细粒度任务中的表现。

Abstract: Multimodal Large Language Models (MLLMs) often struggle with fine-grained perception, such as identifying small objects in high-resolution images or finding key moments in long videos. Existing works typically rely on complicated, task-specific fine-tuning, which limits their generalizability and increases model complexity. In this work, we propose an effective, training-free framework that uses an MLLM’s intrinsic uncertainty as a proactive guidance signal. Our core insight is that a model’s output entropy decreases when presented with relevant visual information. We introduce a unified mechanism that scores candidate visual inputs by response uncertainty, enabling the model to autonomously focus on the most salient data. We apply this simple principle to three complex visual tasks: Visual Search, Long Video Understanding, and Temporal Grounding, allowing off-the-shelf MLLMs to achieve performance competitive with specialized, fine-tuned methods. Our work validates that harnessing intrinsic uncertainty is a powerful, general strategy for enhancing fine-grained multimodal performance.


[39] Deep learning motion correction of quantitative stress perfusion cardiovascular magnetic resonance cs.CVPDF

Noortje I. P. Schueler, Nathan C. K. Wong, Richard J. Crawley, Josien P. W. Pluim, Amedeo Chiribiri

TL;DR: 论文提出了一种基于无监督深度学习的运动校正方法,用于定量应力灌注心血管磁共振(CMR)成像,显著提升了运动校正的速度和鲁棒性。

Details

Motivation: 传统基于配准的运动校正方法速度慢且对采集变异性敏感,限制了其在定量灌注成像中的稳健性和可扩展性。

Result: 相比传统方法,深度学习方法显著提升了时间平滑性(p<0.001),心肌对齐效果相近或更优,心肌灌注图的运动伪影减少,处理时间缩短15倍。

Insight: 该方法在多厂商数据上训练,能够泛化到不同序列,有望推动定量灌注成像的临床广泛应用。

Abstract: Background: Quantitative stress perfusion cardiovascular magnetic resonance (CMR) is a powerful tool for assessing myocardial ischemia. Motion correction is essential for accurate pixel-wise mapping but traditional registration-based methods are slow and sensitive to acquisition variability, limiting robustness and scalability. Methods: We developed an unsupervised deep learning-based motion correction pipeline that replaces iterative registration with efficient one-shot estimation. The method corrects motion in three steps and uses robust principal component analysis to reduce contrast-related effects. It aligns the perfusion series and auxiliary images (arterial input function and proton density-weighted series). Models were trained and validated on multivendor data from 201 patients, with 38 held out for testing. Performance was assessed via temporal alignment and quantitative perfusion values, compared to a previously published registration-based method. Results: The deep learning approach significantly improved temporal smoothness of time-intensity curves (p<0.001). Myocardial alignment (Dice = 0.92 (0.04) and 0.91 (0.05)) was comparable to the baseline and superior to before registration (Dice = 0.80 (0.09), p<0.001). Perfusion maps showed reduced motion, with lower standard deviation in the myocardium (0.52 (0.39) ml/min/g) compared to baseline (0.55 (0.44) ml/min/g). Processing time was reduced 15-fold. Conclusion: This deep learning pipeline enables fast, robust motion correction for stress perfusion CMR, improving accuracy across dynamic and auxiliary images. Trained on multivendor data, it generalizes across sequences and may facilitate broader clinical adoption of quantitative perfusion imaging.


[40] DEAP DIVE: Dataset Investigation with Vision transformers for EEG evaluation cs.CVPDF

Annemarie Hoffsommer, Helen Schneider, Svetlana Pavlitska, J. Marius Zöllner

TL;DR: 该论文研究了如何利用DEAP数据集中EEG信号的子集进行情感预测,通过连续小波变换将EEG数据转换为尺度图,并使用视觉变换器(ViT)模型实现高准确率。

Details

Motivation: 传统的情绪预测方法(如自我评估和面部表情分析)存在主观性或模糊性问题,而EEG信号提供了更直接和无偏的数据源。但由于完整EEG测量复杂且成本高,作者希望通过低成本的EEG设备实现类似效果。

Result: 模型在预测4种情绪象限(唤醒度和效价的高低组合)时达到91.57%的准确率,与传统方法的96.9%接近。

Insight: 研究表明,减少EEG通道数(从32降至12)并未显著损失预测性能,为低成本EEG设备的应用提供了可能性。

Abstract: Accurately predicting emotions from brain signals has the potential to achieve goals such as improving mental health, human-computer interaction, and affective computing. Emotion prediction through neural signals offers a promising alternative to traditional methods, such as self-assessment and facial expression analysis, which can be subjective or ambiguous. Measurements of the brain activity via electroencephalogram (EEG) provides a more direct and unbiased data source. However, conducting a full EEG is a complex, resource-intensive process, leading to the rise of low-cost EEG devices with simplified measurement capabilities. This work examines how subsets of EEG channels from the DEAP dataset can be used for sufficiently accurate emotion prediction with low-cost EEG devices, rather than fully equipped EEG-measurements. Using Continuous Wavelet Transformation to convert EEG data into scaleograms, we trained a vision transformer (ViT) model for emotion classification. The model achieved over 91,57% accuracy in predicting 4 quadrants (high/low per arousal and valence) with only 12 measuring points (also referred to as channels). Our work shows clearly, that a significant reduction of input channels yields high results compared to state-of-the-art results of 96,9% with 32 channels. Training scripts to reproduce our code can be found here: https://gitlab.kit.edu/kit/aifb/ATKS/public/AutoSMiLeS/DEAP-DIVE.


[41] Extreme Blind Image Restoration via Prompt-Conditioned Information Bottleneck cs.CV | cs.AI | cs.LGPDF

Hongeun Kim, Bryan Sangwoo Kim, Jong Chul Ye

TL;DR: 本文提出了一种针对极端盲图像恢复(EBIR)的新框架,通过分解ELQ到HQ的图像恢复过程,利用信息瓶颈理论稳定训练,显著提升了图像恢复效果。

Details

Motivation: 现有盲图像恢复(BIR)方法在极端退化(如严重复合退化)中表现不佳,原因是巨大的领域差距导致恢复后图像失真和细节丢失。

Result: 在严重退化场景下的广泛实验表明,该方法显著提升了图像恢复质量,减少了失真和细节损失。

Insight: 通过分解恢复过程和引入信息瓶颈理论,可以有效缓解极端退化带来的巨大领域差距问题,同时为现有模型的增强提供了无需微调的灵活性。

Abstract: Blind Image Restoration (BIR) methods have achieved remarkable success but falter when faced with Extreme Blind Image Restoration (EBIR), where inputs suffer from severe, compounded degradations beyond their training scope. Directly learning a mapping from extremely low-quality (ELQ) to high-quality (HQ) images is challenging due to the massive domain gap, often leading to unnatural artifacts and loss of detail. To address this, we propose a novel framework that decomposes the intractable ELQ-to-HQ restoration process. We first learn a projector that maps an ELQ image onto an intermediate, less-degraded LQ manifold. This intermediate image is then restored to HQ using a frozen, off-the-shelf BIR model. Our approach is grounded in information theory; we provide a novel perspective of image restoration as an Information Bottleneck problem and derive a theoretically-driven objective to train our projector. This loss function effectively stabilizes training by balancing a low-quality reconstruction term with a high-quality prior-matching term. Our framework enables Look Forward Once (LFO) for inference-time prompt refinement, and supports plug-and-play strengthening of existing image restoration models without need for finetuning. Extensive experiments under severe degradation regimes provide a thorough analysis of the effectiveness of our work.


[42] Defect Segmentation in OCT scans of ceramic parts for non-destructive inspection using deep learning cs.CVPDF

Andrés Laveda-Martínez, Natalia P. García-de-la-Puente, Fernando García-Torres, Niels Møller Israelsen, Ole Bang

TL;DR: 本文提出了一种基于U-Net架构的深度学习系统,用于陶瓷零件OCT扫描中的缺陷分割,实现了高精度的缺陷检测(Dice分数0.979),并展示了其在非破坏性检测中的实用性。

Details

Motivation: 陶瓷制造业需要通过非破坏性检测(NDT)确保零件质量,而光学相干断层扫描(OCT)提供了高分辨率内部成像。然而,手动分析OCT图像耗时且易出错,因此需要自动化的缺陷检测系统。

Result: 系统在缺陷检测中表现出色,Dice分数达0.979,优于同类研究。单个体积推理时间为18.98秒,支持高效的自动化质量控制。

Insight: 基于深度学习的OCT图像分析可实现高效、准确的缺陷检测,为非破坏性检测的自动化提供了可行方案。

Abstract: Non-destructive testing (NDT) is essential in ceramic manufacturing to ensure the quality of components without compromising their integrity. In this context, Optical Coherence Tomography (OCT) enables high-resolution internal imaging, revealing defects such as pores, delaminations, or inclusions. This paper presents an automatic defect detection system based on Deep Learning (DL), trained on OCT images with manually segmented annotations. A neural network based on the U-Net architecture is developed, evaluating multiple experimental configurations to enhance its performance. Post-processing techniques enable both quantitative and qualitative evaluation of the predictions. The system shows an accurate behavior of 0.979 Dice Score, outperforming comparable studies. The inference time of 18.98 seconds per volume supports its viability for detecting inclusions, enabling more efficient, reliable, and automated quality control.


[43] Multi-Objective Task-Aware Predictor for Image-Text Alignment cs.CV | cs.AIPDF

Eunki Kim, Na Min An, James Thorne, Hyunjung Shim

TL;DR: 论文提出了一个多目标任务感知预测器MULTI-TAP,用于评估图像文本对齐性,能够生成整体评分和多目标细粒度评分,同时解决了现有方法在人类判断对齐、长序列处理、推理效率和多目标评分等方面的不足。

Details

Motivation: 现有图像-文本对齐评估方法缺乏对人类偏好的多维度综合考虑,尤其是在多目标和高效推理方面的不足。因此,需要一种能够同时满足多维度评分需求且高效的方法。

Result: MULTI-TAP在性能上优于现有基准方法(如VisionREWARD),且效率更高;在7-8B参数规模下,性能接近GPT-4o的G-VEval。

Insight: 通过轻量级方法在多目标任务中表现出色,表明预训练模型的隐藏状态可以有效支持高效的细粒度评分。此外,新数据集为研究多维度人类偏好提供了重要资源。

Abstract: Evaluating image-text alignment while reflecting human preferences across multiple aspects is a significant issue for the development of reliable vision-language applications. It becomes especially crucial in real-world scenarios where multiple valid descriptions exist depending on contexts or user needs. However, research progress is hindered by the lack of comprehensive benchmarks and existing evaluation predictors lacking at least one of these key properties: (1) Alignment with human judgments, (2) Long-sequence processing, (3) Inference efficiency, and (4) Applicability to multi-objective scoring. To address these challenges, we propose a plug-and-play architecture to build a robust predictor, MULTI-TAP (Multi-Objective Task-Aware Predictor), capable of both multi and single-objective scoring. MULTI-TAP can produce a single overall score, utilizing a reward head built on top of a large vision-language model (LVLMs). We show that MULTI-TAP is robust in terms of application to different LVLM architectures, achieving significantly higher performance than existing metrics and even on par with the GPT-4o-based predictor, G-VEval, with a smaller size (7-8B). By training a lightweight ridge regression layer on the frozen hidden states of a pre-trained LVLM, MULTI-TAP can produce fine-grained scores for multiple human-interpretable objectives. MULTI-TAP performs better than VisionREWARD, a high-performing multi-objective reward model, in both performance and efficiency on multi-objective benchmarks and our newly released text-image-to-text dataset, EYE4ALL. Our new dataset, consisting of chosen/rejected human preferences (EYE4ALLPref) and human-annotated fine-grained scores across seven dimensions (EYE4ALLMulti), can serve as a foundation for developing more accessible AI systems by capturing the underlying preferences of users, including blind and low-vision (BLV) individuals.


[44] ZQBA: Zero Query Black-box Adversarial Attack cs.CVPDF

Joana C. Costa, Tiago Roxo, Hugo Proença, Pedro R. M. Inácio

TL;DR: ZQBA提出了一种零查询的黑盒对抗攻击方法,利用DNN的特征图生成对抗样本,无需多次查询或训练替代模型。

Details

Motivation: 现有黑盒对抗攻击方法需要多次查询或训练扩散模型,限制了实际应用的便捷性。ZQBA通过直接利用DNN表征生成对抗样本,解决了这一问题。

Result: 实验表明,ZQBA在CIFAR和Tiny ImageNet数据集上优于现有黑盒攻击方法,且在SSIM和人眼评估中保持了对抗样本的不可感知性。

Insight: ZQBA揭示了DNN表征可用于高效生成对抗样本,强调了DNN在现实场景中的脆弱性。

Abstract: Current black-box adversarial attacks either require multiple queries or diffusion models to produce adversarial samples that can impair the target model performance. However, these methods require training a surrogate loss or diffusion models to produce adversarial samples, which limits their applicability in real-world settings. Thus, we propose a Zero Query Black-box Adversarial (ZQBA) attack that exploits the representations of Deep Neural Networks (DNNs) to fool other networks. Instead of requiring thousands of queries to produce deceiving adversarial samples, we use the feature maps obtained from a DNN and add them to clean images to impair the classification of a target model. The results suggest that ZQBA can transfer the adversarial samples to different models and across various datasets, namely CIFAR and Tiny ImageNet. The experiments also show that ZQBA is more effective than state-of-the-art black-box attacks with a single query, while maintaining the imperceptibility of perturbations, evaluated both quantitatively (SSIM) and qualitatively, emphasizing the vulnerabilities of employing DNNs in real-world contexts. All the source code is available at https://github.com/Joana-Cabral/ZQBA.


[45] Uncertainty-Aware Concept Bottleneck Models with Enhanced Interpretability cs.CV | cs.AIPDF

Haifei Zhang, Patrick Barry, Eduardo Brandao

TL;DR: 本文提出了一种不确定性感知的概念瓶颈模型(CBM),通过学习二值类别级概念原型增强解释性和鲁棒性。

Details

Motivation: 概念瓶颈模型(CBMs)虽然提供了语义明确且可解释的分类流程,但其预测性能通常低于端到端的卷积神经网络,且概念预测到最终标签的不确定性传播尚未充分研究。

Result: 该方法在增强解释性的同时,保持了预测性能,并通过置信预测提高了模型对不确定输入的鲁棒性。

Insight: 结合不确定性度量和可解释的分类规则可以显著提升CBMs的实际应用价值。

Abstract: In the context of image classification, Concept Bottleneck Models (CBMs) first embed images into a set of human-understandable concepts, followed by an intrinsically interpretable classifier that predicts labels based on these intermediate representations. While CBMs offer a semantically meaningful and interpretable classification pipeline, they often sacrifice predictive performance compared to end-to-end convolutional neural networks. Moreover, the propagation of uncertainty from concept predictions to final label decisions remains underexplored. In this paper, we propose a novel uncertainty-aware and interpretable classifier for the second stage of CBMs. Our method learns a set of binary class-level concept prototypes and uses the distances between predicted concept vectors and each class prototype as both a classification score and a measure of uncertainty. These prototypes also serve as interpretable classification rules, indicating which concepts should be present in an image to justify a specific class prediction. The proposed framework enhances both interpretability and robustness by enabling conformal prediction for uncertain or outlier inputs based on their deviation from the learned binary class-level concept prototypes.


[46] MetaLogic: Robustness Evaluation of Text-to-Image Models via Logically Equivalent Prompts cs.CV | cs.AIPDF

Yifan Shen, Yangyang Shu, Hye-young Paik, Yulei Sui

TL;DR: MetaLogic提出了一种新的评估框架,通过在逻辑上等效但语法不同的提示下生成图像对,来检测文本到图像(T2I)模型的语义不一致性。

Details

Motivation: 当前T2I模型在输入提示发生微小语言变化时,生成的图像可能语义不一致,暴露了模型在推理和泛化上的不足。为了解决这一问题,研究者提出了MetaLogic。

Result: 实验表明,即使是最先进的T2I模型(如Flux.dev和DALLE-3),其语义不一致率也分别高达59%和71%。MetaLogic高效且可扩展,能够发现现有指标忽略的逻辑不一致问题。

Insight: MetaLogic揭示了T2I模型在逻辑理解上的局限性,强调了语义一致性评估的重要性。该方法为模型调试和改进提供了实用工具,同时也为未来的研究提供了新的评估方向。

Abstract: Recent advances in text-to-image (T2I) models, especially diffusion-based architectures, have significantly improved the visual quality of generated images. However, these models continue to struggle with a critical limitation: maintaining semantic consistency when input prompts undergo minor linguistic variations. Despite being logically equivalent, such prompt pairs often yield misaligned or semantically inconsistent images, exposing a lack of robustness in reasoning and generalisation. To address this, we propose MetaLogic, a novel evaluation framework that detects T2I misalignment without relying on ground truth images. MetaLogic leverages metamorphic testing, generating image pairs from prompts that differ grammatically but are semantically identical. By directly comparing these image pairs, the framework identifies inconsistencies that signal failures in preserving the intended meaning, effectively diagnosing robustness issues in the model’s logic understanding. Unlike existing evaluation methods that compare a generated image to a single prompt, MetaLogic evaluates semantic equivalence between paired images, offering a scalable, ground-truth-free approach to identifying alignment failures. It categorises these alignment errors (e.g., entity omission, duplication, positional misalignment) and surfaces counterexamples that can be used for model debugging and refinement. We evaluate MetaLogic across multiple state-of-the-art T2I models and reveal consistent robustness failures across a range of logical constructs. We find that even the SOTA text-to-image models like Flux.dev and DALLE-3 demonstrate a 59 percent and 71 percent misalignment rate, respectively. Our results show that MetaLogic is not only efficient and scalable, but also effective in uncovering fine-grained logical inconsistencies that are overlooked by existing evaluation metrics.


[47] Solar PV Installation Potential Assessment on Building Facades Based on Vision and Language Foundation Models cs.CV | cs.AIPDF

Ruyu Liu, Dongxu Zhuang, Jianhua Zhang, Arega Getaneh Abate, Per Sieverts Nielsen

TL;DR: 该论文提出了一种自动化框架SF-SPA,用于评估建筑立面的太阳能光伏安装潜力,通过计算机视觉和人工智能技术解决了透视校正、语义分割和光伏布局优化等挑战。

Details

Motivation: 城市建筑立面的太阳能光伏潜力未得到充分利用,传统评估方法因复杂几何和语义组件而效率低下。

Result: 面积估计误差6.2%±2.8%,单栋评估时间100秒,模拟结果验证了方法的可靠性。

Insight: 通过AI和LLM的结合,SF-SPA为城市能源规划和BIPV部署提供了高效自动化工具。

Abstract: Building facades represent a significant untapped resource for solar energy generation in dense urban environments, yet assessing their photovoltaic (PV) potential remains challenging due to complex geometries and semantic com ponents. This study introduces SF-SPA (Semantic Facade Solar-PV Assessment), an automated framework that transforms street-view photographs into quantitative PV deployment assessments. The approach combines com puter vision and artificial intelligence techniques to address three key challenges: perspective distortion correction, semantic understanding of facade elements, and spatial reasoning for PV layout optimization. Our four-stage pipeline processes images through geometric rectification, zero-shot semantic segmentation, Large Language Model (LLM) guided spatial reasoning, and energy simulation. Validation across 80 buildings in four countries demonstrates ro bust performance with mean area estimation errors of 6.2% ± 2.8% compared to expert annotations. The auto mated assessment requires approximately 100 seconds per building, a substantial gain in efficiency over manual methods. Simulated energy yield predictions confirm the method’s reliability and applicability for regional poten tial studies, urban energy planning, and building-integrated photovoltaic (BIPV) deployment. Code is available at: https:github.com/CodeAXu/Solar-PV-Installation


[48] From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation cs.CVPDF

Fan Yang, Zhiyang Chen, Yousong Zhu, Xin Li, Jinqiao Wang

TL;DR: TrajVLM-Gen是一个两阶段的视觉语言框架,结合轨迹预测和视频生成,生成符合物理规律的运动视频。

Details

Motivation: 现有视频生成模型常产生不符合真实物理规律的运动,缺乏一致性。

Result: 在UCF-101和MSR-VTT上取得FVD分数545和539,优于现有方法。

Insight: 利用轨迹预测结合视频生成可提升视频的物理一致性。

Abstract: Current video generation models produce physically inconsistent motion that violates real-world dynamics. We propose TrajVLM-Gen, a two-stage framework for physics-aware image-to-video generation. First, we employ a Vision Language Model to predict coarse-grained motion trajectories that maintain consistency with real-world physics. Second, these trajectories guide video generation through attention-based mechanisms for fine-grained motion refinement. We build a trajectory prediction dataset based on video tracking data with realistic motion patterns. Experiments on UCF-101 and MSR-VTT demonstrate that TrajVLM-Gen outperforms existing methods, achieving competitive FVD scores of 545 on UCF-101 and 539 on MSR-VTT.


[49] What You See is What You Ask: Evaluating Audio Descriptions cs.CV | cs.AI | cs.CLPDF

Divy Kala, Eshika Khandelwal, Makarand Tapaswi

TL;DR: 这篇论文提出了ADQA基准,用于评估音频描述(AD)在帮助盲人和低视力(BLV)用户理解故事和视觉细节方面的效果,揭示了当前AD生成方法的主观性问题和不足。

Details

Motivation: 现有的自动AD生成研究主要集中于短片段,且评估时仅与单一参考AD对比,忽略了AD创作的主观性。作者通过分析同一电影的两个独立AD轨,量化了AD的主观性,并指出短片段评估的局限性。

Result: ADQA显示,当前AD生成方法显著落后于人工AD,强调了长片段评估的重要性。

Insight: AD创作具有高度主观性,评估应基于更长的连贯片段,而非短片段;未来的AD生成研究需更好地满足BLV用户的实际需求。

Abstract: Audio descriptions (ADs) narrate important visual details in movies, enabling Blind and Low Vision (BLV) users to understand narratives and appreciate visual details. Existing works in automatic AD generation mostly focus on few-second trimmed clips, and evaluate them by comparing against a single ground-truth reference AD. However, writing ADs is inherently subjective. Through alignment and analysis of two independent AD tracks for the same movies, we quantify the subjectivity in when and whether to describe, and what and how to highlight. Thus, we show that working with trimmed clips is inadequate. We propose ADQA, a QA benchmark that evaluates ADs at the level of few-minute long, coherent video segments, testing whether they would help BLV users understand the story and appreciate visual details. ADQA features visual appreciation (VA) questions about visual facts and narrative understanding (NU) questions based on the plot. Through ADQA, we show that current AD generation methods lag far behind human-authored ADs. We conclude with several recommendations for future work and introduce a public leaderboard for benchmarking.


[50] PhraseStereo: The First Open-Vocabulary Stereo Image Segmentation Dataset cs.CVPDF

Thomas Campagnolo, Ezio Malis, Philippe Martinet, Gaetan Bahl

TL;DR: PhraseStereo 是首个开放词汇的立体图像分割数据集,将短语-区域分割扩展到立体图像对中,利用了深度几何线索。

Details

Motivation: 当前短语接地(phrase grounding)的研究主要局限于单视角图像,而忽视了立体视觉中丰富的几何线索。

Result: PhraseStereo 提供了立体图像对及其对齐的分割掩码和短语标注,为语言、视觉和 3D 感知的交叉研究奠定了基础。

Insight: 立体图像对的深度信息可以为多模态学习提供更精确和上下文感知的接地能力。

Abstract: Understanding how natural language phrases correspond to specific regions in images is a key challenge in multimodal semantic segmentation. Recent advances in phrase grounding are largely limited to single-view images, neglecting the rich geometric cues available in stereo vision. For this, we introduce PhraseStereo, the first novel dataset that brings phrase-region segmentation to stereo image pairs. PhraseStereo builds upon the PhraseCut dataset by leveraging GenStereo to generate accurate right-view images from existing single-view data, enabling the extension of phrase grounding into the stereo domain. This new setting introduces unique challenges and opportunities for multimodal learning, particularly in leveraging depth cues for more precise and context-aware grounding. By providing stereo image pairs with aligned segmentation masks and phrase annotations, PhraseStereo lays the foundation for future research at the intersection of language, vision, and 3D perception, encouraging the development of models that can reason jointly over semantics and geometry. The PhraseStereo dataset will be released online upon acceptance of this work.


[51] NSARM: Next-Scale Autoregressive Modeling for Robust Real-World Image Super-Resolution cs.CVPDF

Xiangtao Kong, Rongyuan Wu, Shuaizheng Liu, Lingchen Sun, Lei Zhang

TL;DR: NSARM提出了一个基于自回归模型的稳健实时图像超分辨率框架,通过两阶段训练策略(变换网络和端到端微调),在保持高效推理的同时提升了图像质量和输入鲁棒性。

Details

Motivation: 现有的Real-ISR方法要么依赖缓慢的扩散模型,要么质量较低且鲁棒性差。自回归模型(如Infinity)展示了高效且高质量的生成能力,但尚未应用于超分辨率任务。本文旨在利用自回归模型的优势解决这些问题。

Result: NSARM在定量和定性评估中均优于现有Real-ISR方法,生成更高质量的图像并保持高效推理。对输入质量的鲁棒性和泛化能力显著提升。

Insight: 自回归模型在Real-ISR任务中展示了潜力,其高效性和高鲁棒性优于扩散模型。两阶段训练策略是提高模型适应性和性能的关键。

Abstract: Most recent real-world image super-resolution (Real-ISR) methods employ pre-trained text-to-image (T2I) diffusion models to synthesize the high-quality image either from random Gaussian noise, which yields realistic results but is slow due to iterative denoising, or directly from the input low-quality image, which is efficient but at the price of lower output quality. These approaches train ControlNet or LoRA modules while keeping the pre-trained model fixed, which often introduces over-enhanced artifacts and hallucinations, suffering from the robustness to inputs of varying degradations. Recent visual autoregressive (AR) models, such as pre-trained Infinity, can provide strong T2I generation capabilities while offering superior efficiency by using the bitwise next-scale prediction strategy. Building upon next-scale prediction, we introduce a robust Real-ISR framework, namely Next-Scale Autoregressive Modeling (NSARM). Specifically, we train NSARM in two stages: a transformation network is first trained to map the input low-quality image to preliminary scales, followed by an end-to-end full-model fine-tuning. Such a comprehensive fine-tuning enhances the robustness of NSARM in Real-ISR tasks without compromising its generative capability. Extensive quantitative and qualitative evaluations demonstrate that as a pure AR model, NSARM achieves superior visual results over existing Real-ISR methods while maintaining a fast inference speed. Most importantly, it demonstrates much higher robustness to the quality of input images, showing stronger generalization performance. Project page: https://github.com/Xiangtaokong/NSARM


[52] Feature Identification for Hierarchical Contrastive Learning cs.CV | cs.AIPDF

Julius Ott, Nastassia Vysotskaya, Huawei Sun, Lorenzo Servadei, Robert Wille

TL;DR: 这篇论文提出了两种新颖的层次对比学习方法(HMLC),分别基于高斯混合模型(G-HMLC)和注意力机制(A-HMLC),旨在捕捉层次特有的特征并建模类间关系,从而提升层次分类任务的性能。

Details

Motivation: 传统的分类方法往往忽略了不同层次类别间的固有关系,导致丢失重要的监督信号。为了解决这一问题,论文设计了两种层次对比学习方法,以更好地捕捉层次结构信息。

Result: 在CIFAR100和ModelNet40数据集上,HMLC方法在线性评估中达到了最先进的性能,准确率比现有方法高出2个百分点。

Insight: 论文的亮点在于通过层次对比学习显式建模类间关系和不平衡分布,这在复杂层次分类任务中具有广泛的应用潜力。

Abstract: Hierarchical classification is a crucial task in many applications, where objects are organized into multiple levels of categories. However, conventional classification approaches often neglect inherent inter-class relationships at different hierarchy levels, thus missing important supervisory signals. Thus, we propose two novel hierarchical contrastive learning (HMLC) methods. The first, leverages a Gaussian Mixture Model (G-HMLC) and the second uses an attention mechanism to capture hierarchy-specific features (A-HMLC), imitating human processing. Our approach explicitly models inter-class relationships and imbalanced class distribution at higher hierarchy levels, enabling fine-grained clustering across all hierarchy levels. On the competitive CIFAR100 and ModelNet40 datasets, our method achieves state-of-the-art performance in linear evaluation, outperforming existing hierarchical contrastive learning methods by 2 percentage points in terms of accuracy. The effectiveness of our approach is backed by both quantitative and qualitative results, highlighting its potential for applications in computer vision and beyond.


[53] Can World Models Benefit VLMs for World Dynamics? cs.CV | cs.AI | cs.CL | cs.LGPDF

Kevin Zhang, Kuangzhi Ge, Xiaowei Chi, Renrui Zhang, Shaojun Shi

TL;DR: 本文探讨了生成世界模型(World Models)是否能替代传统视觉编码器范式,用于通用多模态理解任务,并提出了一种动态视觉对齐方法(DyVA),显著提升了空间推理能力。

Details

Motivation: 随着生成世界模型在视频数据上的强大表现,研究它们是否能用于通用的多模态任务成为自然的问题。本文旨在探索世界模型先验在视觉语言模型中的应用潜力。

Result: DyVA在多任务视觉推理基准上超越了开源和专有基线,实现了最优或接近最优的性能。

Insight: 视频预训练带来的运动一致性内部化是世界模型在视觉语言任务中表现优越的关键因素。

Abstract: Trained on internet-scale video data, generative world models are increasingly recognized as powerful world simulators that can generate consistent and plausible dynamics over structure, motion, and physics. This raises a natural question: with the advent of strong video foundational models, might they supplant conventional vision encoder paradigms for general-purpose multimodal understanding? While recent studies have begun to explore the potential of world models on common vision tasks, these explorations typically lack a systematic investigation of generic, multimodal tasks. In this work, we strive to investigate the capabilities when world model priors are transferred into Vision-Language Models: we re-purpose a video diffusion model as a generative encoder to perform a single denoising step and treat the resulting latents as a set of visual embedding. We empirically investigate this class of models, which we refer to as World-Language Models (WorldLMs), and we find that generative encoders can capture latents useful for downstream understanding that show distinctions from conventional encoders. Naming our best-performing variant Dynamic Vision Aligner (DyVA), we further discover that this method significantly enhances spatial reasoning abilities and enables single-image models to perform multi-frame reasoning. Through the curation of a suite of visual reasoning tasks, we find DyVA to surpass both open-source and proprietary baselines, achieving state-of-the-art or comparable performance. We attribute these gains to WorldLM’s inherited motion-consistency internalization from video pre-training. Finally, we systematically explore extensive model designs to highlight promising directions for future work. We hope our study can pave the way for a new family of VLMs that leverage priors from world models and are on a promising path towards generalist vision learners.


[54] Gather-Scatter Mamba: Accelerating Propagation with Efficient State Space Model cs.CV | cs.AIPDF

Hyun-kyu Ko, Youbin Kim, Jihyeon Park, Dongheok Park, Gyeongjin Kang

TL;DR: 论文提出了Gather-Scatter Mamba(GSM),一种结合选择性状态空间模型(Mamba)和空间上下文聚合的混合架构,用于高效视频超分辨率的时空建模。

Details

Motivation: 传统RNN在视频超分辨率中面临梯度消失、并行性差和推理速度慢的问题,而Transformer的二次复杂度限制了其在长序列中的表现。Mamba提供了线性复杂度的解决方案,但缺乏空间依赖性建模能力。

Result: GSM在视频超分辨率任务中高效地减少了遮挡伪影,提升了时空建模能力。

Insight: 1. Mamba与自注意力的结合能平衡复杂度和建模能力;2. 特征对齐对时空信息传播至关重要。

Abstract: State Space Models (SSMs)-most notably RNNs-have historically played a central role in sequential modeling. Although attention mechanisms such as Transformers have since dominated due to their ability to model global context, their quadratic complexity and limited scalability make them less suited for long sequences. Video super-resolution (VSR) methods have traditionally relied on recurrent architectures to propagate features across frames. However, such approaches suffer from well-known issues including vanishing gradients, lack of parallelism, and slow inference speed. Recent advances in selective SSMs like Mamba offer a compelling alternative: by enabling input-dependent state transitions with linear-time complexity, Mamba mitigates these issues while maintaining strong long-range modeling capabilities. Despite this potential, Mamba alone struggles to capture fine-grained spatial dependencies due to its causal nature and lack of explicit context aggregation. To address this, we propose a hybrid architecture that combines shifted window self-attention for spatial context aggregation with Mamba-based selective scanning for efficient temporal propagation. Furthermore, we introduce Gather-Scatter Mamba (GSM), an alignment-aware mechanism that warps features toward a center anchor frame within the temporal window before Mamba propagation and scatters them back afterward, effectively reducing occlusion artifacts and ensuring effective redistribution of aggregated information across all frames. The official implementation is provided at: https://github.com/Ko-Lani/GSMamba.


[55] AI-CNet3D: An Anatomically-Informed Cross-Attention Network with Multi-Task Consistency Fine-tuning for 3D Glaucoma Classification cs.CV | cs.LGPDF

Roshan Kenia, Anfei Li, Rishabh Srivastava, Kaveri A. Thakoor

TL;DR: 论文提出了一种名为AI-CNet3D的新型深度学习模型,通过结合跨注意力机制和3D CNN,从OCT体积中提取关键特征,用于青光眼分类,并展示了优越的性能和计算效率。

Details

Motivation: 传统的2D报告方法在压缩3D OCT体积时会丢失关键结构细节,导致青光眼诊断的准确性受限。

Result: 模型在两个大型数据集上表现优于现有注意力机制和卷积模型,同时计算效率显著提升(参数减少100倍)。

Insight: 结合解剖学知识的跨注意力机制能够有效提升青光眼分类的准确性和可解释性,同时保持计算效率。

Abstract: Glaucoma is a progressive eye disease that leads to optic nerve damage, causing irreversible vision loss if left untreated. Optical coherence tomography (OCT) has become a crucial tool for glaucoma diagnosis, offering high-resolution 3D scans of the retina and optic nerve. However, the conventional practice of condensing information from 3D OCT volumes into 2D reports often results in the loss of key structural details. To address this, we propose a novel hybrid deep learning model that integrates cross-attention mechanisms into a 3D convolutional neural network (CNN), enabling the extraction of critical features from the superior and inferior hemiretinas, as well as from the optic nerve head (ONH) and macula, within OCT volumes. We introduce Channel Attention REpresentations (CAREs) to visualize cross-attention outputs and leverage them for consistency-based multi-task fine-tuning, aligning them with Gradient-Weighted Class Activation Maps (Grad-CAMs) from the CNN’s final convolutional layer to enhance performance, interpretability, and anatomical coherence. We have named this model AI-CNet3D (AI-`See’-Net3D) to reflect its design as an Anatomically-Informed Cross-attention Network operating on 3D data. By dividing the volume along two axes and applying cross-attention, our model enhances glaucoma classification by capturing asymmetries between the hemiretinal regions while integrating information from the optic nerve head and macula. We validate our approach on two large datasets, showing that it outperforms state-of-the-art attention and convolutional models across all key metrics. Finally, our model is computationally efficient, reducing the parameter count by one-hundred–fold compared to other attention mechanisms while maintaining high diagnostic performance and comparable GFLOPS.


[56] Intuitions of Machine Learning Researchers about Transfer Learning for Medical Image Classification cs.CV | cs.CY | cs.HCPDF

Yucheng Lu, Hubert Dariusz Zając, Veronika Cheplygina, Amelia Jiménez-Sánchez

TL;DR: 该研究通过调查机器学习研究者在医学图像分类中的迁移学习决策,揭示其选择源数据集时依赖直觉而非系统原则,并指出了任务依赖性、社区实践和数据集特性等因素的影响。

Details

Motivation: 迁移学习在医学图像分类中至关重要,但源数据集的选择通常依赖研究者的直觉,缺乏系统性原则,这可能影响算法的泛化能力和患者结果。

Result: 研究发现源数据集的选择具有任务依赖性,相似性评分与预期性能并不总是一致,且参与者使用的术语模糊。

Insight: 研究指出了需要更清晰的定义和人机交互工具来支持源数据集的系统性选择,为迁移学习提供了实用启示。

Abstract: Transfer learning is crucial for medical imaging, yet the selection of source datasets - which can impact the generalizability of algorithms, and thus patient outcomes - often relies on researchers’ intuition rather than systematic principles. This study investigates these decisions through a task-based survey with machine learning practitioners. Unlike prior work that benchmarks models and experimental setups, we take a human-centered HCI perspective on how practitioners select source datasets. Our findings indicate that choices are task-dependent and influenced by community practices, dataset properties, and computational (data embedding), or perceived visual or semantic similarity. However, similarity ratings and expected performance are not always aligned, challenging a traditional “more similar is better” view. Participants often used ambiguous terminology, which suggests a need for clearer definitions and HCI tools to make them explicit and usable. By clarifying these heuristics, this work provides practical insights for more systematic source selection in transfer learning.


[57] InfVSR: Breaking Length Limits of Generic Video Super-Resolution cs.CVPDF

Ziqing Zhang, Kai Liu, Zheng Chen, Xi Li, Yucong Chen

TL;DR: InfVSR提出了一种自回归一步扩散范式,解决了长视频超分辨率(VSR)的效率低和扩展性差问题,实现了高质量和高速度的超分辨率处理。

Details

Motivation: 现实世界的视频通常包含数千帧,而现有的VSR方法在处理长序列时效率低下且扩展性差,需要突破这些限制。

Result: InfVSR在长视频VSR中实现了最先进的超分辨率质量,同时速度提升了58倍,显著优于MGLD-VSR等方法。

Insight: 长视频处理需要兼顾效率和质量,引入自回归和扩散模型结合的范式是突破现有限制的有效途径。

Abstract: Real-world videos often extend over thousands of frames. Existing video super-resolution (VSR) approaches, however, face two persistent challenges when processing long sequences: (1) inefficiency due to the heavy cost of multi-step denoising for full-length sequences; and (2) poor scalability hindered by temporal decomposition that causes artifacts and discontinuities. To break these limits, we propose InfVSR, which novelly reformulates VSR as an autoregressive-one-step-diffusion paradigm. This enables streaming inference while fully leveraging pre-trained video diffusion priors. First, we adapt the pre-trained DiT into a causal structure, maintaining both local and global coherence via rolling KV-cache and joint visual guidance. Second, we distill the diffusion process into a single step efficiently, with patch-wise pixel supervision and cross-chunk distribution matching. Together, these designs enable efficient and scalable VSR for unbounded-length videos. To fill the gap in long-form video evaluation, we build a new benchmark tailored for extended sequences and further introduce semantic-level metrics to comprehensively assess temporal consistency. Our method pushes the frontier of long-form VSR, achieves state-of-the-art quality with enhanced semantic consistency, and delivers up to 58x speed-up over existing methods such as MGLD-VSR. Code will be available at https://github.com/Kai-Liu001/InfVSR.


[58] JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation cs.CVPDF

Siheng Wan, Zhengtao Yao, Zhengdao Li, Junhao Dong, Yanshu Li

TL;DR: JEPA-T 是一个统一的多模态框架,通过联合嵌入预测 Transformer 将图像和文本编码为离散的视觉和文本标记,并结合交叉注意力增强文本和视觉信息的融合。

Details

Motivation: 现有的文本到图像(T2I)生成方法多基于自监督训练的令牌中心架构,但如何在生成过程中有效融合文本与视觉令牌仍是一个挑战。

Result: 在 ImageNet-1K 上的实验表明,JEPA-T 具有高效的数据利用能力、开放词汇泛化能力,并优于非融合和晚融合基线方法。

Insight: 结合晚期架构融合和目标级对齐,可以在基于令牌的 T2I 任务中实现调节强度和主干通用性的平衡。

Abstract: Modern Text-to-Image (T2I) generation increasingly relies on token-centric architectures that are trained with self-supervision, yet effectively fusing text with visual tokens remains a challenge. We propose \textbf{JEPA-T}, a unified multimodal framework that encodes images and captions into discrete visual and textual tokens, processed by a joint-embedding predictive Transformer. To enhance fusion, we incorporate cross-attention after the feature predictor for conditional denoising while maintaining a task-agnostic backbone. Additionally, raw texts embeddings are injected prior to the flow matching loss to improve alignment during training. During inference, the same network performs both class-conditional and free-text image generation by iteratively denoising visual tokens conditioned on text. Evaluations on ImageNet-1K demonstrate that JEPA-T achieves strong data efficiency, open-vocabulary generalization, and consistently outperforms non-fusion and late-fusion baselines. Our approach shows that late architectural fusion combined with objective-level alignment offers an effective balance between conditioning strength and backbone generality in token-based T2I.The code is now available: https://github.com/justin-herry/JEPA-T.git


[59] A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features cs.CVPDF

Axel Barroso-Laguna, Tommaso Cavallari, Victor Adrian Prisacariu, Eric Brachmann

TL;DR: FastForward是一种通过单次前馈过程快速构建场景表示并进行图像定位的方法,通过3D空间中的特征集合实现高效相机姿态估计,显著减少了映射时间。

Details

Motivation: 现有视觉定位方法在构建场景表示时需要大量时间,FastForward旨在以更快速度实现竞争性精度,满足实时性和实用性需求。

Result: FastForward在最小化映射准备时间的同时,达到了与最先进方法相当的精度,并能有效泛化到未见过的户外场景。

Insight: 将多张图像的特征集合表示为3D空间的锚点,是实现高效相机姿态估计的关键,该方法在速度和泛化性上具有显著优势。

Abstract: Visually localizing an image, i.e., estimating its camera pose, requires building a scene representation that serves as a visual map. The representation we choose has direct consequences towards the practicability of our system. Even when starting from mapping images with known camera poses, state-of-the-art approaches still require hours of mapping time in the worst case, and several minutes in the best. This work raises the question whether we can achieve competitive accuracy much faster. We introduce FastForward, a method that creates a map representation and relocalizes a query image on-the-fly in a single feed-forward pass. At the core, we represent multiple mapping images as a collection of features anchored in 3D space. FastForward utilizes these mapping features to predict image-to-scene correspondences for the query image, enabling the estimation of its camera pose. We couple FastForward with image retrieval and achieve state-of-the-art accuracy when compared to other approaches with minimal map preparation time. Furthermore, FastForward demonstrates robust generalization to unseen domains, including challenging large-scale outdoor environments.


[60] Visual Self-Refinement for Autoregressive Models cs.CVPDF

Jiamian Wang, Ziqi Zhou, Chaithanya Kumar Mummadi, Sohail Dianat, Majid Rabbani

TL;DR: 该论文提出了一种自回归模型的插拔式视觉自细化模块,用于增强生成视觉序列中的空间对应关系建模,从而提升生成质量。

Details

Motivation: 自回归模型虽然在序列建模中表现优异,但在视觉信号的空间特性与逐令牌预测的序列依赖性之间存在冲突,导致生成结果不理想。

Result: 实验结果表明,该方法显著提升了生成质量,使模型能生成语义更一致的结果。

Insight: 论文揭示了在自回归模型中引入全局上下文和关系建模的重要性,为解决视觉信号与序列建模的冲突提供了新思路。

Abstract: Autoregressive models excel in sequential modeling and have proven to be effective for vision-language data. However, the spatial nature of visual signals conflicts with the sequential dependencies of next-token prediction, leading to suboptimal results. This work proposes a plug-and-play refinement module to enhance the complex spatial correspondence modeling within the generated visual sequence. This module operates as a post-pretraining step to jointly refine all generated tokens of autoregressive model, enhancing vision-language modeling under a shared sequential prediction framework. By leveraging global context and relationship across the tokens, our method mitigates the error accumulation issue within the sequential generation. Experiments demonstrate that the proposed method improves the generation quality, enhancing the model’s ability to produce semantically consistent results.


[61] SoftCFG: Uncertainty-guided Stable Guidance for Visual autoregressive Model cs.CVPDF

Dongli Xu, Aleksei Tiulpin, Matthew B. Blaschko

TL;DR: SoftCFG是一种不确定性引导的稳定指导方法,用于改善自回归模型的视觉生成质量,解决了传统Classifier-Free Guidance(CFG)中存在的指导消失和过度指导问题。

Details

Motivation: 自回归模型在图像生成中表现出色,但传统CFG方法在应用中存在指导信号逐渐消失或过度干扰的问题,影响了生成图像的视觉连贯性。

Result: 实验表明,SoftCFG显著提升了图像生成质量,在ImageNet 256上的FID指标达到了自回归模型的SOTA水平。

Insight: 不确定性引导的加权扰动分配可以有效地平衡文本指导和视觉上下文冲突,同时Step Normalization是稳定长序列生成的关键。

Abstract: Autoregressive (AR) models have emerged as powerful tools for image generation by modeling images as sequences of discrete tokens. While Classifier-Free Guidance (CFG) has been adopted to improve conditional generation, its application in AR models faces two key issues: guidance diminishing, where the conditional-unconditional gap quickly vanishes as decoding progresses, and over-guidance, where strong conditions distort visual coherence. To address these challenges, we propose SoftCFG, an uncertainty-guided inference method that distributes adaptive perturbations across all tokens in the sequence. The key idea behind SoftCFG is to let each generated token contribute certainty-weighted guidance, ensuring that the signal persists across steps while resolving conflicts between text guidance and visual context. To further stabilize long-sequence generation, we introduce Step Normalization, which bounds cumulative perturbations of SoftCFG. Our method is training-free, model-agnostic, and seamlessly integrates with existing AR pipelines. Experiments show that SoftCFG significantly improves image quality over standard CFG and achieves state-of-the-art FID on ImageNet 256 among autoregressive models.


[62] TextCAM: Explaining Class Activation Map with Text cs.CV | cs.AI | cs.LGPDF

Qiming Zhao, Xingjian Li, Xiaoyu Cao, Xiaolong Wu, Min Xu

TL;DR: TextCAM是一种结合类激活图(CAM)与自然语言的新颖解释框架,旨在为深度视觉模型的预测提供更丰富的语义解释。

Details

Motivation: 深度神经网络(DNNs)在许多领域取得了显著成功,但其黑盒性质限制了在高风险应用中的可信度。CAM及其变体仅能突出显示空间区域,缺乏语义解释。为解决这一问题,提出了TextCAM。

Result: 在ImageNet、CLEVR和CUB数据集上的实验表明,TextCAM提供的解释既忠实于模型预测,又提升了人类理解能力,同时能检测虚假相关性并保持模型保真度。

Insight: TextCAM的提出表明,结合视觉与语言模型的能力可以为深度神经网络提供更具可解释性的解释方法,有助于提升模型的透明度和可信度。

Abstract: Deep neural networks (DNNs) have achieved remarkable success across domains but remain difficult to interpret, limiting their trustworthiness in high-stakes applications. This paper focuses on deep vision models, for which a dominant line of explainability methods are Class Activation Mapping (CAM) and its variants working by highlighting spatial regions that drive predictions. We figure out that CAM provides little semantic insight into what attributes underlie these activations. To address this limitation, we propose TextCAM, a novel explanation framework that enriches CAM with natural languages. TextCAM combines the precise spatial localization of CAM with the semantic alignment of vision-language models (VLMs). Specifically, we derive channel-level semantic representations using CLIP embeddings and linear discriminant analysis, and aggregate them with CAM weights to produce textual descriptions of salient visual evidence. This yields explanations that jointly specify where the model attends and what visual attributes likely support its decision. We further extend TextCAM to generate feature channels into semantically coherent groups, enabling more fine-grained visual-textual explanations. Experiments on ImageNet, CLEVR, and CUB demonstrate that TextCAM produces faithful and interpretable rationales that improve human understanding, detect spurious correlations, and preserve model fidelity.


[63] POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency cs.CV | cs.MMPDF

Ashim Dahal, Ankit Ghimire, Saydul Akbar Murad, Nick Rahimi

TL;DR: POVQA提出了一种数据高效的视频问答方法,通过时间池化和轻量级监督对齐大视觉语言模型,显著提升了问答表现和推理质量。

Details

Motivation: 目前的视频问答方法通常需要1500+帧的上下文窗口,仅能覆盖50秒的视频内容,信息利用率低且计算成本高。

Result: 在ReasonVQA数据集上,F1分数从0.212提升至0.543,BLEU-4和ROUGE-L也显著提升;跨池化方法和跨数据集的零样本测试表明方法鲁棒性强。

Insight: 时间池化结合轻量优化能高效压缩视频信息,提升问答性能;推理提示进一步改善了模型的解释能力。

Abstract: Video Question Answering (VQA) with Large Vision Language Models (LVLMs) has gained significant traction in research ever since the Flamingo was introduced by Deepmind. Recent advancements in large context/long video question answering have allowed VQA tasks to have context window of 1500+ frames. However, this only leads to 50 seconds of video footage without losing any significant information. We introduce POVQA, a data-efficient pipeline that compresses each second of video into a single temporally pooled image (via motion blur and weighted averaging variants) and then align LVLMs with lightweight supervision. Concretely, we build 1 fps input sources using Blend Blur with Last Frame, Weighted Average, Exponential and Ramp pooling and fine-tune QWEN-2.5-VL 7B with supervised two turn target including reasoning and final answer. We apply Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO) on our novel dataset ReasonVQA consisting of 12 movies with 239 human annotated question-answer with reasoning prompts. On our ReasonVQA dataset, this method dramatically improves performance over pooled baselines: F1 score improves from 0.212 to 0.543, BLEU-4 from 0.031 to 0.291, and ROUGE-L from 0.196 to 0.528. Rationale quality also significantly increases. Cross-evaluation of SFT + DPO on various pooling functions show that the gains persist regardless of the pooling scheme used at train or test time, indicating strong robustness on summarization of temporal evidence. Similar observations were made on zero-shot in TVQA.


[64] ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning cs.CVPDF

Yuxiang Guo, Jiang Liu, Ze Wang, Hao Chen, Ximeng Sun

TL;DR: ImageDoctor是一个统一的多方面文本生成图像(T2I)模型评估框架,通过四个互补维度(合理性、语义对齐、美观性和整体质量)评估图像质量,并提供像素级错误指示热图。

Details

Motivation: 现有方法通常使用单一标量量化生成图像的质量,无法提供全面且可解释的图像质量反馈。

Result: 在多个数据集上与人类偏好强对齐,用作奖励模型时生成质量提升10%。

Insight: 多维度评估和像素级反馈能更全面地指导T2I模型优化。

Abstract: The rapid advancement of text-to-image (T2I) models has increased the need for reliable human preference modeling, a demand further amplified by recent progress in reinforcement learning for preference alignment. However, existing approaches typically quantify the quality of a generated image using a single scalar, limiting their ability to provide comprehensive and interpretable feedback on image quality. To address this, we introduce ImageDoctor, a unified multi-aspect T2I model evaluation framework that assesses image quality across four complementary dimensions: plausibility, semantic alignment, aesthetics, and overall quality. ImageDoctor also provides pixel-level flaw indicators in the form of heatmaps, which highlight misaligned or implausible regions, and can be used as a dense reward for T2I model preference alignment. Inspired by the diagnostic process, we improve the detail sensitivity and reasoning capability of ImageDoctor by introducing a “look-think-predict” paradigm, where the model first localizes potential flaws, then generates reasoning, and finally concludes the evaluation with quantitative scores. Built on top of a vision-language model and trained through a combination of supervised fine-tuning and reinforcement learning, ImageDoctor demonstrates strong alignment with human preference across multiple datasets, establishing its effectiveness as an evaluation metric. Furthermore, when used as a reward model for preference tuning, ImageDoctor significantly improves generation quality – achieving an improvement of 10% over scalar-based reward models.


[65] Code2Video: A Code-centric Paradigm for Educational Video Generation cs.CV | cs.AI | cs.CL | cs.HC | cs.MMPDF

Yanzhe Chen, Kevin Qinghong Lin, Mike Zheng Shou

TL;DR: 论文提出了Code2Video,一个通过可执行Python代码生成教育视频的框架,结合规划、编码和视觉语言模型优化,在教育场景中表现优于直接代码生成方法。

Details

Motivation: 当前生成模型在像素空间视频合成方面虽有进展,但难以满足教育视频对学科知识、精确视觉结构和连贯过渡的需求,需要一个更可控的渲染环境来解决。

Result: Code2Video在教育视频生成中优于直接代码生成方法40%,效果接近人工制作的教程。

Insight: 代码为中心的范式在教育视频生成中更具可控性和解释性,结合多代理协作和视觉语言模型能显著提升质量。

Abstract: While recent generative models advance pixel-space video synthesis, they remain limited in producing professional educational videos, which demand disciplinary knowledge, precise visual structures, and coherent transitions, limiting their applicability in educational scenarios. Intuitively, such requirements are better addressed through the manipulation of a renderable environment, which can be explicitly controlled via logical commands (e.g., code). In this work, we propose Code2Video, a code-centric agent framework for generating educational videos via executable Python code. The framework comprises three collaborative agents: (i) Planner, which structures lecture content into temporally coherent flows and prepares corresponding visual assets; (ii) Coder, which converts structured instructions into executable Python codes while incorporating scope-guided auto-fix to enhance efficiency; and (iii) Critic, which leverages vision-language models (VLM) with visual anchor prompts to refine spatial layout and ensure clarity. To support systematic evaluation, we build MMMC, a benchmark of professionally produced, discipline-specific educational videos. We evaluate MMMC across diverse dimensions, including VLM-as-a-Judge aesthetic scores, code efficiency, and particularly, TeachQuiz, a novel end-to-end metric that quantifies how well a VLM, after unlearning, can recover knowledge by watching the generated videos. Our results demonstrate the potential of Code2Video as a scalable, interpretable, and controllable approach, achieving 40% improvement over direct code generation and producing videos comparable to human-crafted tutorials. The code and datasets are available at https://github.com/showlab/Code2Video.


[66] Secure and reversible face anonymization with diffusion models cs.CV | cs.LGPDF

Pol Labarbarie, Vincent Itier, William Puech

TL;DR: 这篇论文提出了一种基于扩散模型的安全且可逆的人脸匿名化方法,通过结合秘密密钥和面部掩码,实现了高质量的匿名化图像,并能通过正确的密钥恢复原始人脸。

Details

Motivation: 人脸图像在计算机视觉算法处理中容易泄露敏感信息,现有匿名化方法难以同时满足高质量生成、安全性和可逆性的需求。

Result: 该方法生成的匿名化人脸图像质量高,与原始图像的视觉相似性更低,并且只有持有正确密钥的授权方才能恢复原始人脸。

Insight: 结合扩散模型的生成能力和秘密密钥的安全机制,可以在隐私保护和身份认证之间取得更好的平衡。

Abstract: Face images processed by computer vision algorithms contain sensitive personal information that malicious actors can capture without consent. These privacy and security risks highlight the need for effective face anonymization methods. Current methods struggle to propose a good trade-off between a secure scheme with high-quality image generation and reversibility for later person authentication. Diffusion-based approaches produce high-quality anonymized images but lack the secret key mechanism to ensure that only authorized parties can reverse the process. In this paper, we introduce, to our knowledge, the first secure, high-quality reversible anonymization method based on a diffusion model. We propose to combine the secret key with the latent faces representation of the diffusion model. To preserve identity-irrelevant features, generation is constrained by a facial mask, maintaining high-quality images. By using a deterministic forward and backward diffusion process, our approach enforces that the original face can be recovered with the correct secret key. We also show that the proposed method produces anonymized faces that are less visually similar to the original faces, compared to other previous work.


[67] KeySG: Hierarchical Keyframe-Based 3D Scene Graphs cs.CV | cs.ROPDF

Abdelrhman Werby, Dennis Rotondi, Fabio Scaparro, Kai O. Arras

TL;DR: KeySG提出了一种基于关键帧的分层3D场景图框架,通过多模态信息增强节点表示,并利用视觉语言模型(VLM)提取场景信息,解决了传统方法在图规模和语义限制上的问题。

Details

Motivation: 传统3D场景图方法在语义关系和规模扩展性方面存在局限,无法支持复杂的人类中心环境中的机器推理和导航任务。

Result: 在四个基准测试中(如3D物体分割和复杂查询检索),KeySG在大多数指标上优于现有方法,证明了其语义丰富性和效率。

Insight: 分层结构和关键帧的使用有效提升了3D场景图的表达能力,同时减轻了计算和存储负担。

Abstract: In recent years, 3D scene graphs have emerged as a powerful world representation, offering both geometric accuracy and semantic richness. Combining 3D scene graphs with large language models enables robots to reason, plan, and navigate in complex human-centered environments. However, current approaches for constructing 3D scene graphs are semantically limited to a predefined set of relationships, and their serialization in large environments can easily exceed an LLM’s context window. We introduce KeySG, a framework that represents 3D scenes as a hierarchical graph consisting of floors, rooms, objects, and functional elements, where nodes are augmented with multi-modal information extracted from keyframes selected to optimize geometric and visual coverage. The keyframes allow us to efficiently leverage VLM to extract scene information, alleviating the need to explicitly model relationship edges between objects, enabling more general, task-agnostic reasoning and planning. Our approach can process complex and ambiguous queries while mitigating the scalability issues associated with large scene graphs by utilizing a hierarchical retrieval-augmented generation (RAG) pipeline to extract relevant context from the graph. Evaluated across four distinct benchmarks – including 3D object segmentation and complex query retrieval – KeySG outperforms prior approaches on most metrics, demonstrating its superior semantic richness and efficiency.


[68] Instant4D: 4D Gaussian Splatting in Minutes cs.CVPDF

Zhanpeng Luo, Haoxi Ran, Li Lu

TL;DR: Instant4D是一种基于4D高斯飞溅的单目重建系统,能在几分钟内处理未标定的日常视频,显著提升了动态场景重建的效率。

Details

Motivation: 动态视图合成虽已取得进展,但从未标定的日常视频中重建场景仍因优化慢和参数估计复杂而具挑战性。

Result: 在Dycheck数据集上,10分钟内完成单视频重建;典型200帧视频的训练时间显著缩短,性能保持竞争力。

Insight: 4D高斯表示在动态场景重建中具有高效性和实用性,适用于未标定视频的快速处理。

Abstract: Dynamic view synthesis has seen significant advances, yet reconstructing scenes from uncalibrated, casual video remains challenging due to slow optimization and complex parameter estimation. In this work, we present Instant4D, a monocular reconstruction system that leverages native 4D representation to efficiently process casual video sequences within minutes, without calibrated cameras or depth sensors. Our method begins with geometric recovery through deep visual SLAM, followed by grid pruning to optimize scene representation. Our design significantly reduces redundancy while maintaining geometric integrity, cutting model size to under 10% of its original footprint. To handle temporal dynamics efficiently, we introduce a streamlined 4D Gaussian representation, achieving a 30x speed-up and reducing training time to within two minutes, while maintaining competitive performance across several benchmarks. Our method reconstruct a single video within 10 minutes on the Dycheck dataset or for a typical 200-frame video. We further apply our model to in-the-wild videos, showcasing its generalizability. Our project website is published at https://instant4d.github.io/.


[69] Strategic Fusion of Vision Language Models: Shapley-Credited Context-Aware Dawid-Skene for Multi-Label Tasks in Autonomous Driving cs.CV | cs.ROPDF

Yuxiang Feng, Keyang Zhang, Hassane Ouchouid, Ashwil Kaniamparambil, Ioannis Souflas

TL;DR: 该论文提出了一种基于博弈论的融合方法,结合Shapley值的上下文感知Dawid-Skene模型,用于自动驾驶中多标签任务的视觉语言模型融合,显著提升了性能。

Details

Motivation: 视觉语言模型(VLMs)在自动驾驶中的应用日益广泛,但其幻觉问题影响了可靠性。论文旨在通过一种融合方法解决多标签任务中的可靠性问题。

Result: 实验结果在Hamming距离上减少了23%,Macro-F1和Micro-F1分别提升了55%和47%,证明了方法的有效性。

Insight: 该方法不仅提升了多模型融合的性能,还保护了单一模型的独特信号,适用于动态环境下的自动驾驶任务。

Abstract: Large vision-language models (VLMs) are increasingly used in autonomous-vehicle (AV) stacks, but hallucination limits their reliability in safety-critical pipelines. We present Shapley-credited Context-Aware Dawid-Skene with Agreement, a game-theoretic fusion method for multi-label understanding of ego-view dashcam video. It learns per-model, per-label, context-conditioned reliabilities from labelled history and, at inference, converts each model’s report into an agreement-guardrailed log-likelihood ratio that is combined with a contextual prior and a public reputation state updated via Shapley-based team credit. The result is calibrated, thresholdable posteriors that (i) amplify agreement among reliable models, (ii) preserve uniquely correct single-model signals, and (iii) adapt to drift. To specialise general VLMs, we curate 1,000 real-world dashcam clips with structured annotations (scene description, manoeuvre recommendation, rationale) via an automatic pipeline that fuses HDD ground truth, vehicle kinematics, and YOLOv11 + BoT-SORT tracking, guided by a three-step chain-of-thought prompt; three heterogeneous VLMs are then fine-tuned with LoRA. We evaluate with Hamming distance, Micro-Macro-F1, and average per-video latency. Empirically, the proposed method achieves a 23% reduction in Hamming distance, 55% improvement in Macro-F1, and 47% improvement in Micro-F1 when comparing with the best single model, supporting VLM fusion as a calibrated, interpretable, and robust decision-support component for AV pipelines.


[70] EvoWorld: Evolving Panoramic World Generation with Explicit 3D Memory cs.CVPDF

Jiahao Wang, Luoxin Ye, TaiMing Lu, Junfei Xiao, Jiahan Zhang

TL;DR: EvoWorld提出了一种结合全景视频生成和演进3D记忆的世界模型,通过显式3D几何指导增强视频生成的空间一致性和视觉真实性。

Details

Motivation: 受人类心理探索和回放3D环境能力的启发,EvoWorld旨在模拟这种能力,实现长时程空间一致的世界建模。

Result: 实验表明,EvoWorld在视觉真实性和空间一致性上优于现有方法,尤其在长时程探索中表现突出。

Insight: 显式3D记忆的引入是关键创新点,它为视频生成提供了丰富空间线索,解决了长时程一致性问题。

Abstract: Humans possess a remarkable ability to mentally explore and replay 3D environments they have previously experienced. Inspired by this mental process, we present EvoWorld: a world model that bridges panoramic video generation with evolving 3D memory to enable spatially consistent long-horizon exploration. Given a single panoramic image as input, EvoWorld first generates future video frames by leveraging a video generator with fine-grained view control, then evolves the scene’s 3D reconstruction using a feedforward plug-and-play transformer, and finally synthesizes futures by conditioning on geometric reprojections from this evolving explicit 3D memory. Unlike prior state-of-the-arts that synthesize videos only, our key insight lies in exploiting this evolving 3D reconstruction as explicit spatial guidance for the video generation process, projecting the reconstructed geometry onto target viewpoints to provide rich spatial cues that significantly enhance both visual realism and geometric consistency. To evaluate long-range exploration capabilities, we introduce the first comprehensive benchmark spanning synthetic outdoor environments, Habitat indoor scenes, and challenging real-world scenarios, with particular emphasis on loop-closure detection and spatial coherence over extended trajectories. Extensive experiments demonstrate that our evolving 3D memory substantially improves visual fidelity and maintains spatial scene coherence compared to existing approaches, representing a significant advance toward long-horizon spatially consistent world modeling.


[71] IMAGEdit: Let Any Subject Transform cs.CVPDF

Fei Shen, Weihao Xu, Rui Yan, Dong Zhang, Xiangbo Shu

TL;DR: IMAGEdit是一个无需训练的框架,用于多目标视频编辑,通过多模态条件和精确掩码序列实现编辑,无需微调或重新训练。

Details

Motivation: 现有视频编辑方法在提示侧多模态条件不足和掩码边界纠缠问题上存在局限性,限制了多目标视频编辑的适用性。

Result: 在MSVBench基准测试中,IMAGEdit表现优于现有方法,验证了其泛化能力和编辑效果。

Insight: IMAGEdit的通用性和兼容性使其能够灵活应用于各种掩码驱动的视频生成模型,推动了视频编辑领域的发展。

Abstract: In this paper, we present IMAGEdit, a training-free framework for any number of video subject editing that manipulates the appearances of multiple designated subjects while preserving non-target regions, without finetuning or retraining. We achieve this by providing robust multimodal conditioning and precise mask sequences through a prompt-guided multimodal alignment module and a prior-based mask retargeting module. We first leverage large models’ understanding and generation capabilities to produce multimodal information and mask motion sequences for multiple subjects across various types. Then, the obtained prior mask sequences are fed into a pretrained mask-driven video generation model to synthesize the edited video. With strong generalization capability, IMAGEdit remedies insufficient prompt-side multimodal conditioning and overcomes mask boundary entanglement in videos with any number of subjects, thereby significantly expanding the applicability of video editing. More importantly, IMAGEdit is compatible with any mask-driven video generation model, significantly improving overall performance. Extensive experiments on our newly constructed multi-subject benchmark MSVBench verify that IMAGEdit consistently surpasses state-of-the-art methods. Code, models, and datasets are publicly available at https://github.com/XWH-A/IMAGEdit.


cs.CL [Back]

[72] TAMA: Tool-Augmented Multimodal Agent for Procedural Activity Understanding cs.CLPDF

Kimihiro Hasegawa, Wiradee Imrattanatrai, Masaki Asada, Ken Fukuda, Teruko Mitamura

TL;DR: TAMA是一个工具增强的多模态代理,旨在通过多媒体返回工具在无需训练的情况下实现多模态推理,提升视觉语言模型在流程活动理解任务中的表现。

Details

Motivation: 流程活动助手在日常生活和专业场景中有广泛应用潜力,但相关系统开发仍不足,因此提出了TAMA框架来解决这一问题。

Result: 在ProMQA-Assembly数据集上,TAMA显著提升了GPT-5和MiMo-VL等视觉语言模型的性能。

Insight: TAMA的框架设计推动了图像思维范式在多模态任务中的应用,并促进了流程活动助手的开发。

Abstract: Procedural activity assistants potentially support humans in a variety of settings, from our daily lives, e.g., cooking or assembling flat-pack furniture, to professional situations, e.g., manufacturing or biological experiments. Despite its potential use cases, the system development tailored for such an assistant is still underexplored. In this paper, we propose a novel framework, called TAMA, a Tool-Augmented Multimodal Agent, for procedural activity understanding. TAMA enables interleaved multimodal reasoning by making use of multimedia-returning tools in a training-free setting. Our experimental result on the multimodal procedural QA dataset, ProMQA-Assembly, shows that our approach can improve the performance of vision-language models, especially GPT-5 and MiMo-VL. Furthermore, our ablation studies provide empirical support for the effectiveness of two features that characterize our framework, multimedia-returning tools and agentic flexible tool selection. We believe our proposed framework and experimental results facilitate the thinking with images paradigm for video and multimodal tasks, let alone the development of procedural activity assistants.


[73] PrimeX: A Dataset of Worldview, Opinion, and Explanation cs.CLPDF

Rik Koncel-Kedziorski, Brihi Joshi, Tim Paek

TL;DR: PrimeX是一个包含世界观、意见和解释的数据集,旨在帮助语言模型更好地与用户对齐。

Details

Motivation: 随着语言模型的广泛应用,需要更好地代表个体用户的信念系统以提高模型对齐性。

Result: 展示了信念解释和世界观信息在个性化语言模型中的价值,为NLP和心理研究提供了新方向。

Insight: PrimeX为研究个体信念系统如何影响语言模型对齐提供了新工具,是多学科研究的桥梁。

Abstract: As the adoption of language models advances, so does the need to better represent individual users to the model. Are there aspects of an individual’s belief system that a language model can utilize for improved alignment? Following prior research, we investigate this question in the domain of opinion prediction by developing PrimeX, a dataset of public opinion survey data from 858 US residents with two additional sources of belief information: written explanations from the respondents for why they hold specific opinions, and the Primal World Belief survey for assessing respondent worldview. We provide an extensive initial analysis of our data and show the value of belief explanations and worldview for personalizing language models. Our results demonstrate how the additional belief information in PrimeX can benefit both the NLP and psychological research communities, opening up avenues for further study.


[74] Personalized Reasoning: Just-In-Time Personalization and Why LLMs Fail At It cs.CL | cs.AIPDF

Shuyue Stella Li, Avinandan Bose, Faeze Brahman, Simon Shaolei Du, Pang Wei Koh

TL;DR: 该论文提出了个性化推理的概念,指出当前大语言模型(LLM)在处理即时个性化任务时的局限性,并引入了PREFDISCO评测框架,揭示了现有模型在交互能力上的不足。

Details

Motivation: 当前LLM的任务解决和偏好对齐被视为独立挑战,导致在即时个性化场景(如冷启动或隐私限制)中无法有效满足用户需求。论文旨在解决这一问题。

Result: 评测21个前沿模型显示,29%的个性化尝试比通用回答更差,而通用回答也无法满足用户需求,表明个性化推理需专门开发。

Insight: 个性化推理是一个可衡量的研究方向,现有LLM在交互能力上存在根本限制,需要进一步研究以适应教育、医疗等领域的个性化需求。

Abstract: Current large language model (LLM) development treats task-solving and preference alignment as separate challenges, optimizing first for objective correctness, then for alignment to aggregated human preferences. This paradigm fails in human-facing applications where solving a problem correctly is insufficient if the response mismatches the user’s needs. This challenge intensifies in just-in-time scenarios where no prior user interaction history exists due to cold-start conditions or privacy constraints. LLMs need to identify what they don’t know about user preferences, strategically elicit preference values through questioning, then adapt their reasoning processes and responses accordingly – a complicated chain of cognitive processes which we term personalized reasoning. We introduce PREFDISCO, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse preferences. Our framework creates scenarios where identical questions require different reasoning chains depending on user context, as optimal explanation approaches vary by individual expertise and preferences while maintaining factual accuracy. Evaluation of 21 frontier models across 10 tasks reveals 29.0% of naive personalization attempts produce worse preference alignment than generic responses, yet generic responses also fail to serve individual user needs effectively. These findings suggest personalized reasoning requires dedicated development rather than emerging naturally. PREFDISCO establishes personalized reasoning as a measurable research frontier and reveals fundamental limitations in current LLMs’ interactive capabilities, providing a foundation for developing systems that can adapt to individual users in education, healthcare, and technical domains where personalization is critical.


[75] TASER: Translation Assessment via Systematic Evaluation and Reasoning cs.CL | cs.AIPDF

Monishwaran Maheswaran, Marco Carini, Christian Federmann, Tony Diaz

TL;DR: TASER 是一种利用大型推理模型(LRMs)进行自动化翻译质量评估的指标,通过系统化、分步的评估方法在 WMT24 Metrics Shared Task 中表现出色,优于现有所有指标。

Details

Motivation: 现有的自动化翻译质量评估指标缺乏透明性和解释性,TASER 旨在利用 LRMs 的显式推理能力解决这一问题,并提供更准确的评估。

Result: 在 WMT24 Metrics Shared Task 中,TASER 在系统级和片段级评估中均表现优异:1) 系统级评估中,在参考和无参考场景下均获得最高的软成对准确性;2) 无参考变体在所有无参考方法中排名第一。

Insight: 研究表明:1) LRMs 的显式推理能力显著提升了评估的准确性和可解释性;2) 结构化提示模板比开放式的 LLMs 方法更适合 LRMs;3) 推理深度与评估质量之间存在关联。

Abstract: We introduce TASER (Translation Assessment via Systematic Evaluation and Reasoning), a metric that uses Large Reasoning Models (LRMs) for automated translation quality assessment. TASER harnesses the explicit reasoning capabilities of LRMs to conduct systematic, step-by-step evaluation of translation quality. We evaluate TASER on the WMT24 Metrics Shared Task across both reference-based and reference-free scenarios, demonstrating state-of-the-art performance. In system-level evaluation, TASER achieves the highest soft pairwise accuracy in both reference-based and reference-free settings, outperforming all existing metrics. At the segment level, TASER maintains competitive performance with our reference-free variant ranking as the top-performing metric among all reference-free approaches. Our experiments reveal that structured prompting templates yield superior results with LRMs compared to the open-ended approaches that proved optimal for traditional LLMs. We evaluate o3, a large reasoning model from OpenAI, with varying reasoning efforts, providing insights into the relationship between reasoning depth and evaluation quality. The explicit reasoning process in LRMs offers interpretability and visibility, addressing a key limitation of existing automated metrics. Our results demonstrate that Large Reasoning Models show a measurable advancement in translation quality assessment, combining improved accuracy with transparent evaluation across diverse language pairs.


[76] Judging with Confidence: Calibrating Autoraters to Preference Distributions cs.CLPDF

Zhuohang Li, Xiaowei Li, Chengyu Huang, Guowang Li, Katayoon Goshvadi

TL;DR: 本文提出了一种校准概率自动评分器(autoraters)的方法,使其能够更好地建模目标群体的偏好分布,从而提高评分器的可靠性和校准性。

Details

Motivation: 大型语言模型(LLMs)的校准日益依赖于其他LLMs作为自动评分器,但传统的离散偏好标签训练方式无法很好地处理主观、模糊或多义的任务,导致评分器的可靠性受限。因此,需要一种能建模完整偏好分布的方法。

Result: 实验结果表明,通过分布匹配目标微调的评分器在口头化概率预测中与目标偏好分布更一致,校准性更高,位置偏差显著降低,同时不影响客观任务的性能。

Insight: 建模完整的偏好分布能有效提升自动评分器的可靠性,尤其是在处理主观和模糊任务时。这表明未来LLM校准需要更多地关注分布建模而非离散标签。

Abstract: The alignment of large language models (LLMs) with human values increasingly relies on using other LLMs as automated judges, or ``autoraters’’. However, their reliability is limited by a foundational issue: they are trained on discrete preference labels, forcing a single ground truth onto tasks that are often subjective, ambiguous, or nuanced. We argue that a reliable autorater must learn to model the full distribution of preferences defined by a target population. In this paper, we propose a general framework for calibrating probabilistic autoraters to any given preference distribution. We formalize the problem and present two learning methods tailored to different data conditions: 1) a direct supervised fine-tuning for dense, probabilistic labels, and 2) a reinforcement learning approach for sparse, binary labels. Our empirical results show that finetuning autoraters with a distribution-matching objective leads to verbalized probability predictions that are better aligned with the target preference distribution, with improved calibration and significantly lower positional bias, all while preserving performance on objective tasks.


[77] Efficient Layer-wise LLM Fine-tuning for Revision Intention Prediction cs.CL | cs.AIPDF

Zhexiong Liu, Diane Litman

TL;DR: 论文提出了IR-Tuning,一种针对LLM的层高效参数微调方法,专注于文本修订意图预测任务,以解决LLM在分类任务中的不足和数据稀缺问题。

Details

Motivation: 大型语言模型(LLM)在文本生成任务中表现出色,但在文本分类任务(如修订意图预测)中表现不足,且缺乏足够的修订标注数据。

Result: 实验表明,IR-Tuning在多样化的文本修订任务中优于基线方法,且在小规模数据集上表现良好。

Insight: LLM的分类能力可以通过动态层选择和高效参数微调有效提升,尤其在数据稀缺的场景下。

Abstract: Large Language Models (LLMs) have shown extraordinary success across various text generation tasks; however, their potential for simple yet essential text classification remains underexplored, as LLM pre-training tends to emphasize generation over classification. While LLMs with instruction tuning can transform classification into a generation task, they often struggle to categorize nuanced texts. One such example is text revision, which involves nuanced edits between pairs of texts. Although simply fine-tuning LLMs for revision classification seems plausible, it requires a large amount of revision annotations, which are exceptionally expensive and scarce in the community. To address this issue, we introduce a plug-and-play layer-wise parameter-efficient fine-tuning (PEFT) framework, i.e., IR-Tuning, which fine-tunes a subset of important LLM layers that are dynamically selected based on their gradient norm distribution, while freezing those of redundant layers. Extensive experiments suggest that IR-Tuning surpasses several layer-wise PEFT baselines over diverse text revisions, while achieving fast convergence, low GPU memory consumption, and effectiveness on small revision corpora.


[78] CORTEX: Collaborative LLM Agents for High-Stakes Alert Triage cs.CLPDF

Bowen Wei, Yuan Shen Tay, Howard Liu, Jinhao Pan, Kun Luo

TL;DR: CORTEX提出了一种多智能体LLM架构,用于高风险警报分类,通过分工协作提升决策的透明性和准确性。

Details

Motivation: SOC(安全运营中心)面临每日数万警报的过载问题,现有方法或因单模型处理复杂数据效果不佳,或因缺乏透明性难以信任。

Result: 在多企业场景测试中,CORTEX显著降低误报率,较单智能体LLM提升调查质量。

Insight: 多智能体分工设计能有效解决噪声数据和透明性问题,为高复杂度任务的LLM应用提供新思路。

Abstract: Security Operations Centers (SOCs) are overwhelmed by tens of thousands of daily alerts, with only a small fraction corresponding to genuine attacks. This overload creates alert fatigue, leading to overlooked threats and analyst burnout. Classical detection pipelines are brittle and context-poor, while recent LLM-based approaches typically rely on a single model to interpret logs, retrieve context, and adjudicate alerts end-to-end – an approach that struggles with noisy enterprise data and offers limited transparency. We propose CORTEX, a multi-agent LLM architecture for high-stakes alert triage in which specialized agents collaborate over real evidence: a behavior-analysis agent inspects activity sequences, evidence-gathering agents query external systems, and a reasoning agent synthesizes findings into an auditable decision. To support training and evaluation, we release a dataset of fine-grained SOC investigations from production environments, capturing step-by-step analyst actions and linked tool outputs. Across diverse enterprise scenarios, CORTEX substantially reduces false positives and improves investigation quality over state-of-the-art single-agent LLMs.


[79] TokMem: Tokenized Procedural Memory for Large Language Models cs.CLPDF

Zijun Wu, Yongchang Hao, Lili Mou

TL;DR: TokMem是一种令牌化的程序内存,为大型语言模型提供了一种高效的任务指定和知识召回方法,避免了传统提示工程的低效率问题。

Details

Motivation: 大型语言模型(LLMs)严重依赖提示来完成任务,但提示需要每一步重复读取,扩展性差且缺乏模块化复用机制。TokMem的提出是为了解决这些问题。

Result: 在1000个任务和函数调用任务上,TokMem表现优于检索增强生成,避免了重复上下文的开销,且参数更少。

Insight: TokMem提供了一种可扩展和模块化的方法,替代了传统的提示工程和微调策略。

Abstract: Large language models rely heavily on prompts to specify tasks, recall knowledge and guide reasoning. However, this reliance is inefficient as prompts must be re-read at each step, scale poorly across tasks, and lack mechanisms for modular reuse. We introduce TokMem, a tokenized procedural memory that stores recurring procedures as compact, trainable embeddings. Each memory token encodes both an address to a procedure and a control signal that steers generation, enabling targeted behavior with constant-size overhead. To support continual adaptation, TokMem keeps the backbone model frozen, allowing new procedures to be added without interfering with existing ones. We evaluate TokMem on 1,000 tasks for atomic recall, and on function-calling tasks for compositional recall, where it consistently outperforms retrieval-augmented generation while avoiding repeated context overhead, and fine-tuning with far fewer parameters. These results establish TokMem as a scalable and modular alternative to prompt engineering and fine-tuning, offering an explicit procedural memory for LLMs.


[80] LongCodeZip: Compress Long Context for Code Language Models cs.CL | cs.SEPDF

Yuling Shi, Yichun Qian, Hongyu Zhang, Beijun Shen, Xiaodong Gu

TL;DR: LongCodeZip是一个专为代码语言模型设计的双阶段压缩框架,通过粗粒度和细粒度压缩显著减少上下文长度而不影响任务性能。

Details

Motivation: 代码生成需处理长上下文,但现有压缩方法忽略代码结构,导致性能不佳,高API成本和延迟是主要瓶颈。

Result: 在多任务评测中,压缩比达5.6倍且不降低性能,适用于大规模代码场景。

Insight: 代码专用压缩方法优于通用方法,长上下文处理可通过结构感知技术高效优化。

Abstract: Code generation under long contexts is becoming increasingly critical as Large Language Models (LLMs) are required to reason over extensive information in the codebase. While recent advances enable code LLMs to process long inputs, high API costs and generation latency remain substantial bottlenecks. Existing context pruning techniques, such as LLMLingua, achieve promising results for general text but overlook code-specific structures and dependencies, leading to suboptimal performance in programming tasks. In this paper, we propose LongCodeZip, a novel plug-and-play code compression framework designed specifically for code LLMs. LongCodeZip employs a dual-stage strategy: (1) coarse-grained compression, which identifies and ranks function-level chunks using conditional perplexity with respect to the instruction, retaining only the most relevant functions; and (2) fine-grained compression, which segments retained functions into blocks based on perplexity and selects an optimal subset under an adaptive token budget to maximize relevance. Evaluations across multiple tasks, including code completion, summarization, and question answering, show that LongCodeZip consistently outperforms baseline methods, achieving up to a 5.6x compression ratio without degrading task performance. By effectively reducing context size while preserving essential information, LongCodeZip enables LLMs to better scale to real-world, large-scale code scenarios, advancing the efficiency and capability of code intelligence applications.


[81] Enhancing Rating Prediction with Off-the-Shelf LLMs Using In-Context User Reviews cs.CLPDF

Koki Ryu, Hitomi Yanaka

TL;DR: 该论文研究了现成的大型语言模型(LLMs)在评分预测任务中的表现,发现用户评论能显著提升预测性能,并提出了一种通过生成假设评论进一步优化的方法。

Details

Motivation: 个性化大型语言模型的输出以匹配用户偏好是一个研究热点,但现有工作主要集中在分类或排序任务上,忽略了评分预测这一需要语言和数学推理的任务。

Result: 实验表明,用户评论能显著提升LLMs的评分预测性能,效果接近传统矩阵分解方法,且在具体物品评论上的表现优于通用偏好描述。

Insight: 论文揭示了用户评论对评分预测的重要性,并为解决冷启动问题提供了新思路,同时也展示了LLMs在回归任务中的潜力。

Abstract: Personalizing the outputs of large language models (LLMs) to align with individual user preferences is an active research area. However, previous studies have mainly focused on classification or ranking tasks and have not considered Likert-scale rating prediction, a regression task that requires both language and mathematical reasoning to be solved effectively. This task has significant industrial applications, but the utilization of LLMs remains underexplored, particularly regarding the capabilities of off-the-shelf LLMs. This study investigates the performance of off-the-shelf LLMs on rating prediction, providing different in-context information. Through comprehensive experiments with eight models across three datasets, we demonstrate that user-written reviews significantly improve the rating prediction performance of LLMs. This result is comparable to traditional methods like matrix factorization, highlighting the potential of LLMs as a promising solution for the cold-start problem. We also find that the reviews for concrete items are more effective than general preference descriptions that are not based on any specific item. Furthermore, we discover that prompting LLMs to first generate a hypothetical review enhances the rating prediction performance. Our code is available at https://github.com/ynklab/rating-prediction-with-reviews.


[82] Agent Fine-tuning through Distillation for Domain-specific LLMs in Microdomains cs.CLPDF

Yawen Xue, Masaya Tsunokake, Yuta Koreeda, Ekant Muljibhai Amin, Takashi Sumiyoshi

TL;DR: 该论文研究了在特定技术微领域(如Hitachi的JP1中间件)中,通过蒸馏优化代理大型语言模型(LLMs),以提升其推理能力和决策效率。

Details

Motivation: 现有代理LLMs主要通过上下文学习实现多步推理,但输入冗长且计算成本高,而在技术微领域中的表现尚不明确。本文探讨了代理微调在JP1中间件中的潜在优势。

Result: 在JP1认证考试问题上,该方法比基础模型提升了14%的性能,验证了代理微调在复杂微领域中的有效性。

Insight: 代理微调结合领域特定数据和知识蒸馏,能够显著提升LLMs在技术微领域中的推理能力和实用性。

Abstract: Agentic large language models (LLMs) have become prominent for autonomously interacting with external environments and performing multi-step reasoning tasks. Most approaches leverage these capabilities via in-context learning with few-shot prompts, but this often results in lengthy inputs and higher computational costs. Agent fine-tuning offers an alternative by enabling LLMs to internalize procedural reasoning and domain-specific knowledge through training on relevant data and demonstration trajectories. While prior studies have focused on general domains, their effectiveness in specialized technical microdomains remains unclear. This paper explores agent fine-tuning for domain adaptation within Hitachi’s JP1 middleware, a microdomain for specialized IT operations. We fine-tuned LLMs using JP1-specific datasets derived from domain manuals and distilled reasoning trajectories generated by LLMs themselves, enhancing decision making accuracy and search efficiency. During inference, we used an agentic prompt with retrieval-augmented generation and introduced a context-answer extractor to improve information relevance. On JP1 certification exam questions, our method achieved a 14% performance improvement over the base model, demonstrating the potential of agent fine-tuning for domain-specific reasoning in complex microdomains.


[83] Agent-ScanKit: Unraveling Memory and Reasoning of Multimodal Agents via Sensitivity Perturbations cs.CLPDF

Pengzhou Cheng, Lingzhong Dong, Zeng Wu, Zongru Wu, Xiangru Tang

TL;DR: 论文提出了Agent-ScanKit框架,通过三种正交的探测范式(视觉引导、文本引导和结构引导)量化多模态代理的记忆和推理能力,发现现有模型多依赖机械记忆而非系统推理。

Details

Motivation: 多模态代理在图形用户界面(GUI)中的自主交互能力虽有提升,但其在复杂或域外任务中的可靠性仍受限,引发了对现有代理是否存在伪推理的质疑。

Result: 在五个公开GUI基准测试中,18个多模态代理的结果表明,机械记忆通常优于系统推理,模型多为训练知识的检索器,泛化能力有限。

Insight: 强调了多模态代理在现实场景中需建模健壮的推理能力,为开发可靠的代理提供了重要见解。

Abstract: Although numerous strategies have recently been proposed to enhance the autonomous interaction capabilities of multimodal agents in graphical user interface (GUI), their reliability remains limited when faced with complex or out-of-domain tasks. This raises a fundamental question: Are existing multimodal agents reasoning spuriously? In this paper, we propose \textbf{Agent-ScanKit}, a systematic probing framework to unravel the memory and reasoning capabilities of multimodal agents under controlled perturbations. Specifically, we introduce three orthogonal probing paradigms: visual-guided, text-guided, and structure-guided, each designed to quantify the contributions of memorization and reasoning without requiring access to model internals. In five publicly available GUI benchmarks involving 18 multimodal agents, the results demonstrate that mechanical memorization often outweighs systematic reasoning. Most of the models function predominantly as retrievers of training-aligned knowledge, exhibiting limited generalization. Our findings underscore the necessity of robust reasoning modeling for multimodal agents in real-world scenarios, offering valuable insights toward the development of reliable multimodal agents.


[84] MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance cs.CL | cs.AIPDF

Xingjian Zhao, Zhe Xu, Luozhijie Jin, Yang Wang, Hanfu Chen

TL;DR: MOSS-Speech 是一个无需文本中介的真正端到端语音转语音大语言模型,通过模态分层的架构设计保持预训练文本LLM的知识和推理能力。

Details

Motivation: 现有的语音对话系统通常依赖级联式流程(语音转录、文本处理、语音合成),这会丢失副语言信息并限制表达能力。虽然最新的端到端方法减少了延迟并更好地保留了这些信息,但仍依赖于文本中介,形成瓶颈。

Result: 在语音问答任务上取得了SOTA结果,语音转语音性能与现有文本中介系统相当,同时保持竞争力的文本任务性能。

Insight: 这项工作缩小了文本中介和直接语音生成之间的差距,为高效且表达力强的端到端语音交互提供了新范式。

Abstract: Spoken dialogue systems often rely on cascaded pipelines that transcribe, process, and resynthesize speech. While effective, this design discards paralinguistic cues and limits expressivity. Recent end-to-end methods reduce latency and better preserve these cues, yet still rely on text intermediates, creating a fundamental bottleneck. We present MOSS-Speech, a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance. Our approach combines a modality-based layer-splitting architecture with a frozen pre-training strategy, preserving the reasoning and knowledge of pretrained text LLMs while adding native speech capabilities. Experiments show that our model achieves state-of-the-art results in spoken question answering and delivers comparable speech-to-speech performance relative to existing text-guided systems, while still maintaining competitive text performance. By narrowing the gap between text-guided and direct speech generation, our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.


[85] Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs cs.CL | cs.AIPDF

Yurun Chen, Xavier Hu, Yuhan Liu, Ziqi Wang, Zeyi Liao

TL;DR: Graph2Eval是一个基于知识图谱的框架,自动生成多模态任务以评估智能代理在多步交互和动态环境中的能力。

Details

Motivation: 现有静态数据集和基于LLM的合成数据方法无法充分评估智能代理的动态任务和多步交互能力,尤其是在多模态和网络环境中。

Result: 实验表明Graph2Eval能高效生成任务,区分不同代理和模型的性能,揭示推理、协作和网络交互能力的差异。

Insight: 知识图谱是生成多样化任务的有效工具,动态任务生成能够更真实地评估智能代理的实际能力。

Abstract: As multimodal LLM-driven agents continue to advance in autonomy and generalization, evaluation based on static datasets can no longer adequately assess their true capabilities in dynamic environments and diverse tasks. Existing LLM-based synthetic data methods are largely designed for LLM training and evaluation, and thus cannot be directly applied to agent tasks that require tool use and interactive capabilities. While recent studies have explored automatic agent task generation with LLMs, most efforts remain limited to text or image analysis, without systematically modeling multi-step interactions in web environments. To address these challenges, we propose Graph2Eval, a knowledge graph-based framework that automatically generates both multimodal document comprehension tasks and web interaction tasks, enabling comprehensive evaluation of agents’ reasoning, collaboration, and interactive capabilities. In our approach, knowledge graphs constructed from multi-source external data serve as the task space, where we translate semantic relations into structured multimodal tasks using subgraph sampling, task templates, and meta-paths. A multi-stage filtering pipeline based on node reachability, LLM scoring, and similarity analysis is applied to guarantee the quality and executability of the generated tasks. Furthermore, Graph2Eval supports end-to-end evaluation of multiple agent types (Single-Agent, Multi-Agent, Web Agent) and measures reasoning, collaboration, and interaction capabilities. We instantiate the framework with Graph2Eval-Bench, a curated dataset of 1,319 tasks spanning document comprehension and web interaction scenarios. Experiments show that Graph2Eval efficiently generates tasks that differentiate agent and model performance, revealing gaps in reasoning, collaboration, and web interaction across different settings and offering a new perspective for agent evaluation.


[86] Copy-Paste to Mitigate Large Language Model Hallucinations cs.CL | cs.AIPDF

Yongchao Long, Xian Wu, Yingying Zhang, Xianbin Wen, Yuxi Zhou

TL;DR: 提出了CopyPasteLLM,通过两阶段高复制响应偏好训练,显著减少大语言模型(LLM)的幻觉问题,并在多个数据集上表现优异。

Details

Motivation: 检索增强生成(RAG)虽然能提供上下文基础,但LLM可能仍会生成不符合上下文的回答(幻觉),影响可靠性。研究发现高复制回答与幻觉呈负相关,因此提出高复制训练方法来提升模型可靠性。

Result: CopyPasteLLM在FaithEval、ConFiQA和PubMedQA上表现最佳,FaithEval准确率提升12.2%-24.5%,仅需365个训练样本(基线数据的1/50)。

Insight: CopyPasteLLM通过校准模型对内参数知识的依赖而非外部知识,显著减少幻觉问题,说明高复制回答能提升模型可靠性。

Abstract: While Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to generate contextually grounded responses, contextual faithfulness remains challenging as LLMs may not consistently trust provided context, leading to hallucinations that undermine reliability. We observe an inverse correlation between response copying degree and context-unfaithful hallucinations on RAGTruth, suggesting that higher copying degrees reduce hallucinations by fostering genuine contextual belief. We propose CopyPasteLLM, obtained through two-stage high-copying response preference training. We design three prompting methods to enhance copying degree, demonstrating that high-copying responses achieve superior contextual faithfulness and hallucination control. These approaches enable a fully automated pipeline that transforms generated responses into high-copying preference data for training CopyPasteLLM. On FaithEval, ConFiQA and PubMedQA, CopyPasteLLM achieves best performance in both counterfactual and original contexts, remarkably with 12.2% to 24.5% accuracy improvements on FaithEval over the best baseline, while requiring only 365 training samples – 1/50th of baseline data. To elucidate CopyPasteLLM’s effectiveness, we propose the Context-Parameter Copying Capturing algorithm. Interestingly, this reveals that CopyPasteLLM recalibrates reliance on internal parametric knowledge rather than external knowledge during generation. All codes are available at https://github.com/longyongchao/CopyPasteLLM


[87] JoyAgent-JDGenie: Technical Report on the GAIA cs.CLPDF

Jiarun Liu, Shiyue Xu, Shangkun Liu, Yang Li, Wen Liu

TL;DR: JoyAgent-JDGenie提出了一种通用的智能体架构,通过多智能体协作、分层内存系统和增强工具集提升了复杂任务的鲁棒性和适应性。

Details

Motivation: 当前大语言模型在复杂任务中表现不足,缺乏系统级的设计,作者希望通过整合多智能体协作和分层内存等方法解决这一问题。

Result: 在综合基准测试中表现优于开源基线,接近专有系统性能。

Insight: 系统级整合是实现可扩展、鲁棒和自适应AI助手的关键,多智能体和分层设计适用于多样化任务。

Abstract: Large Language Models are increasingly deployed as autonomous agents for complex real-world tasks, yet existing systems often focus on isolated improvements without a unifying design for robustness and adaptability. We propose a generalist agent architecture that integrates three core components: a collective multi-agent framework combining planning and execution agents with critic model voting, a hierarchical memory system spanning working, semantic, and procedural layers, and a refined tool suite for search, code execution, and multimodal parsing. Evaluated on a comprehensive benchmark, our framework consistently outperforms open-source baselines and approaches the performance of proprietary systems. These results demonstrate the importance of system-level integration and highlight a path toward scalable, resilient, and adaptive AI assistants capable of operating across diverse domains and tasks.


[88] EuroSpeech: A Multilingual Speech Corpus cs.CL | cs.LGPDF

Samuel Pfisterer, Florian Grötschla, Luca A. Lanzendörfer, Florian Yan, Roger Wattenhofer

TL;DR: EuroSpeech提出了一种可扩展的管道,用于从议会录音中构建多语言语音数据集,显著提升了低资源语言的语音识别性能。

Details

Motivation: 当前多语言语音数据集对大多数语言的数据覆盖不足,导致模型在多数语言上表现不佳。EuroSpeech旨在通过大规模高质量的语音数据集解决这一问题。

Result: 提取了61k小时高质量语音数据,finetune现有ASR模型后,词错误率平均降低41.8%。

Insight: 议会录音是构建高质量多语言语音数据集的宝贵资源,适用于解决低资源语言的语音处理任务。

Abstract: Recent progress in speech processing has highlighted that high-quality performance across languages requires substantial training data for each individual language. While existing multilingual datasets cover many languages, they often contain insufficient data for most languages. Thus, trained models perform poorly on the majority of the supported languages. Our work addresses this challenge by introducing a scalable pipeline for constructing speech datasets from parliamentary recordings. The proposed pipeline includes robust components for media retrieval and a two-stage alignment algorithm designed to handle non-verbatim transcripts and long-form audio. Applying this pipeline to recordings from 22 European parliaments, we extract over 61k hours of aligned speech segments, achieving substantial per-language coverage with 19 languages exceeding 1k hours and 22 languages exceeding 500 hours of high-quality speech data. We obtain an average 41.8% reduction in word error rates over baselines when finetuning an existing ASR model on our dataset, demonstrating the usefulness of our approach.


[89] Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum cs.CL | cs.LGPDF

Gaotang Li, Ruizhong Qiu, Xiusi Chen, Heng Ji, Hanghang Tong

TL;DR: 该论文探讨了监督微调(SFT)中负对数似然(NLL)目标的局限性,并提出了一类基于概率的目标函数,通过模型能力连续体的适应性选择,显著提升了性能。

Details

Motivation: 传统NLL目标在监督微调中泛化能力有限,尤其是在模型已具备任务相关先验知识且监督信号长而嘈杂的情境下。

Result: 实验表明,在模型能力强的一端,倾向于先验的低概率词权重下调目标(如$-p$、$-p^{10}$)优于NLL;在能力弱的一端,NLL表现最佳;中间区域则需动态选择。

Insight: 目标函数的有效性高度依赖模型能力水平,为动态选择目标函数提供了理论依据。

Abstract: Supervised fine-tuning (SFT) is the standard approach for post-training large language models (LLMs), yet it often shows limited generalization. We trace this limitation to its default training objective: negative log likelihood (NLL). While NLL is classically optimal when training from scratch, post-training operates in a different paradigm and could violate its optimality assumptions, where models already encode task-relevant priors and supervision can be long and noisy. To this end, we study a general family of probability-based objectives and characterize their effectiveness under different conditions. Through comprehensive experiments and extensive ablation studies across 7 model backbones, 14 benchmarks, and 3 domains, we uncover a critical dimension that governs objective behavior: the model-capability continuum. Near the model-strong end, prior-leaning objectives that downweight low-probability tokens (e.g., $-p$, $-p^{10}$, thresholded variants) consistently outperform NLL; toward the model-weak end, NLL dominates; in between, no single objective prevails. Our theoretical analysis further elucidates how objectives trade places across the continuum, providing a principled foundation for adapting objectives to model capability. Our code is available at https://github.com/GaotangLi/Beyond-Log-Likelihood.


[90] GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness cs.CLPDF

Kung-Hsiang Huang, Haoyi Qiu, Yutong Dai, Caiming Xiong, Chien-Sheng Wu

TL;DR: 这篇论文提出了GUI-KV,一种高效的KV缓存压缩方法,专门针对视觉语言模型的GUI代理,通过空间显着性和时间冗余评分优化缓存,显著提升了效率。

Details

Motivation: GUI代理在长序列高分辨率截图处理中长期面临效率低下的问题,现有缓存压缩方法未能充分利用GUI的空间和时间冗余特性。

Result: 在AgentNetBench上,GUI-KV解码FLOPs减少38.9%,步骤准确率提升4.1%,接近全缓存性能。

Insight: GUI中注意力稀疏性在Transformer各层均匀分布,简单均匀预算分配优于复杂分层策略。

Abstract: Graphical user interface (GUI) agents built on vision-language models have emerged as a promising approach to automate human-computer workflows. However, they also face the inefficiency challenge as they process long sequences of high-resolution screenshots and solving long-horizon tasks, making inference slow, costly and memory-bound. While key-value (KV) caching can mitigate this, storing the full cache is prohibitive for image-heavy contexts. Existing cache-compression methods are sub-optimal as they do not account for the spatial and temporal redundancy of GUIs. In this work, we first analyze attention patterns in GUI agent workloads and find that, unlike in natural images, attention sparsity is uniformly high across all transformer layers. This insight motivates a simple uniform budget allocation strategy, which we show empirically outperforms more complex layer-varying schemes. Building on this, we introduce GUI-KV, a plug-and-play KV cache compression method for GUI agents that requires no retraining. GUI-KV combines two novel techniques: (i) spatial saliency guidance, which augments attention scores with the L2 norm of hidden states to better preserve semantically important visual tokens, and (ii) temporal redundancy scoring, which projects previous frames’ keys onto the current frame’s key subspace to preferentially prune redundant history. Across standard GUI agent benchmarks and models, GUI-KV outperforms competitive KV compression baselines, closely matching full-cache accuracy at modest budgets. Notably, in a 5-screenshot setting on the AgentNetBench benchmark, GUI-KV reduces decoding FLOPs by 38.9% while increasing step accuracy by 4.1% over the full-cache baseline. These results demonstrate that exploiting GUI-specific redundancies enables efficient and reliable agent performance.


[91] ThinkBrake: Mitigating Overthinking in Tool Reasoning cs.CLPDF

Minjae Oh, Sangjun Song, Seungkyu Lee, Sungmin Jo, Yohan Jo

TL;DR: 论文提出ThinkBrake,一种无需训练的解码启发式方法,用于解决小推理模型(SRMs)在工具使用时过度思考的问题,显著提升推理效率。

Details

Motivation: 小推理模型在工具使用时容易过度思考:它们会先达到正确的工具参数配置,但随后继续推理并覆盖为错误的最终调用。这种现象导致效率低下和冗余推理。

Result: 在BFCL的单轮、非实时和实时任务中,ThinkBrake保持或提升准确率的同时减少了25%的token,优于多种基线方法。

Insight: 工具推理中的过度思考问题导致显著的冗余计算,简单的解码启发式方法(如ThinkBrake)可以有效解决这一问题,提升推理效率。

Abstract: Small reasoning models (SRMs) often overthink during tool use: they reach a correct tool-argument configuration, then continue reasoning and overwrite it with an incorrect final call. We diagnose overthinking via oracle rollouts that inject at sentence boundaries. On the Berkeley Function Calling Leaderboard (BFCL), this oracle termination lifts average accuracy from 85.8% to 94.2% while reducing tokens by 80-94%, revealing substantial recoverable headroom and potential redundant reasoning. While prior work on concise reasoning has largely targeted mathematics, tool reasoning remains underexplored. We adapt various early-termination baselines to tool use and introduce ThinkBrake, a training-free decoding heuristic. ThinkBrake monitors the log-probability margin between and the current top token at sentence boundaries and triggers termination when this margin becomes small. Across BFCL’s single turn, non-live and live splits, ThinkBrake preserves or improves accuracy while reducing tokens up to 25%, outperforming various baselines.


[92] Are Large Language Models Chronically Online Surfers? A Dataset for Chinese Internet Meme Explanation cs.CLPDF

Yubo Xie, Chenkai Wang, Zongyang Ma, Fahui Miao

TL;DR: 论文介绍了CHIME数据集,评估了大型语言模型对中国网络流行语的理解能力,发现模型在多语言和文化的细微差别及来源追溯方面表现不佳。

Details

Motivation: 研究旨在探索大型语言模型是否真正理解快速传播的网络流行语(即“梗”),尤其是在中国文化语境中的表现。

Result: 模型能解释部分流行语的含义,但对文化和语言细微差别的表现较差;在填空题任务中表现低于人类水平。

Insight: 大型语言模型在网络流行语理解上仍有局限,尤其是文化和语言相关的任务;数据集可推动相关研究。

Abstract: Large language models (LLMs) are trained on vast amounts of text from the Internet, but do they truly understand the viral content that rapidly spreads online – commonly known as memes? In this paper, we introduce CHIME, a dataset for CHinese Internet Meme Explanation. The dataset comprises popular phrase-based memes from the Chinese Internet, annotated with detailed information on their meaning, origin, example sentences, types, etc. To evaluate whether LLMs understand these memes, we designed two tasks. In the first task, we assessed the models’ ability to explain a given meme, identify its origin, and generate appropriate example sentences. The results show that while LLMs can explain the meanings of some memes, their performance declines significantly for culturally and linguistically nuanced meme types. Additionally, they consistently struggle to provide accurate origins for the memes. In the second task, we created a set of multiple-choice questions (MCQs) requiring LLMs to select the most appropriate meme to fill in a blank within a contextual sentence. While the evaluated models were able to provide correct answers, their performance remains noticeably below human levels. We have made CHIME public and hope it will facilitate future research on computational meme understanding.


[93] ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards cs.CLPDF

Shiyu Li, Yang Tang, Yifan Wang, Peiming Li, Xi Chen

TL;DR: ReSeek提出了一个自校正框架,通过引入密集、指导性的奖励函数和JUDGE动作,让搜索代理能够在推理过程中动态识别和纠正错误,显著提升了任务成功率和路径可信度。

Details

Motivation: 现有基于强化学习的搜索代理常依赖稀疏或基于规则的奖励,导致代理可能在错误路径上无法自我纠正,影响任务性能。

Result: 实验表明ReSeek显著超越现有基线模型。

Insight: 密集奖励和动态自校正机制能有效提升搜索代理的性能和鲁棒性。

Abstract: Search agents powered by Large Language Models (LLMs) have demonstrated significant potential in tackling knowledge-intensive tasks. Reinforcement learning (RL) has emerged as a powerful paradigm for training these agents to perform complex, multi-step reasoning. However, prior RL-based methods often rely on sparse or rule-based rewards, which can lead agents to commit to suboptimal or erroneous reasoning paths without the ability to recover. To address these limitations, we propose ReSeek, a novel self-correcting framework for training search agents. Our framework introduces a self-correction mechanism that empowers the agent to dynamically identify and recover from erroneous search paths during an episode. By invoking a special JUDGE action, the agent can judge the information and re-plan its search strategy. To guide this process, we design a dense, instructive process reward function, which decomposes into a correctness reward for retrieving factual information and a utility reward for finding information genuinely useful for the query. Furthermore, to mitigate the risk of data contamination in existing datasets, we introduce FictionalHot, a new and challenging benchmark with recently curated questions requiring complex reasoning. Being intuitively reasonable and practically simple, extensive experiments show that agents trained with ReSeek significantly outperform SOTA baselines in task success rate and path faithfulness.


[94] CoT Vectors: Transferring and Probing the Reasoning Mechanisms of LLMs cs.CLPDF

Li Li, Ziyi Wang, Yongliang Wu, Jianfei Cai, Xu Yang

TL;DR: 论文提出了一种名为CoT Vectors的低成本方法,通过编码任务通用的多步推理知识来提升大型语言模型(LLMs)的推理能力,取代传统的上下文学习和微调等高成本方法。

Details

Motivation: 现有的CoT提示方法(如上下文学习和微调)成本高且效率低,因此需要一种更高效且低成本的替代方案来增强LLMs的推理能力。

Result: CoT Vectors在多样化的基准测试和模型上表现优于现有基线,且性能接近参数高效的微调方法,同时需要更少的可训练参数。

Insight: CoT Vectors的有效性受到潜在空间结构、信息密度、获取机制和预训练差异的影响,揭示了LLMs中多步推理功能组织的新见解。

Abstract: Chain-of-Thought (CoT) prompting has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing implementations, such as in-context learning and fine-tuning, remain costly and inefficient. To improve CoT reasoning at a lower cost, and inspired by the task vector paradigm, we introduce CoT Vectors, compact representations that encode task-general, multi-step reasoning knowledge. Through experiments with Extracted CoT Vectors, we observe pronounced layer-wise instability, manifesting as a U-shaped performance curve that reflects a systematic three-stage reasoning process in LLMs. To address this limitation, we propose Learnable CoT Vectors, optimized under a teacher-student framework to provide more stable and robust guidance. Extensive evaluations across diverse benchmarks and models demonstrate that CoT Vectors not only outperform existing baselines but also achieve performance comparable to parameter-efficient fine-tuning methods, while requiring fewer trainable parameters. Moreover, by treating CoT Vectors as a probe, we uncover how their effectiveness varies due to latent space structure, information density, acquisition mechanisms, and pre-training differences, offering new insights into the functional organization of multi-step reasoning in LLMs. The source code will be released.


[95] MCM-DPO: Multifaceted Cross-Modal Direct Preference Optimization for Alt-text Generation cs.CLPDF

Jinlan Fu, Shenzhen Huangfu, Hao Fei, Yichong Huang, Xiaoyu Shen

TL;DR: MCM-DPO提出了一种新的多模态直接偏好优化方法,用于改进alt-text生成任务,解决了用户标注噪声和上下文敏感性不足的问题。

Details

Motivation: 现有的alt-text生成任务存在标注噪声和不一致性,且大型视觉语言模型对上下文信息敏感度不足。传统监督微调方法依赖高质量标注,但在用户生成数据中表现不佳。

Result: 实验表明MCM-DPO优于DPO和SFT,成为alt-text生成的新SOTA。

Insight: 偏好优化方法(如DPO)适用于标注噪声场景;跨模态多维偏好学习能显著提升生成任务性能。

Abstract: The alt-text generation task produces concise, context-relevant descriptions of images, enabling blind and low-vision users to access online images. Despite the capabilities of large vision-language models, alt-text generation performance remains limited due to noisy user annotations, inconsistent standards, and MLLMs’ insensitivity to contextual information. Previous efforts to fine-tune MLLMs using supervised fine-tuning (SFT) have struggled, as SFT relies on accurate target annotations, which are often flawed in user-generated alt-text. To address this, we propose Multi-faceted Cross-modal Direct Preference Optimization (MCM-DPO), which improves alt-text generation by learning to identify better options in preference pairs without requiring precise annotations. MCM-DPO optimizes preferences across single, paired, and multi-preference dimensions, covering textual, visual, and cross-modal factors. In light of the scarcity of high-quality annotated and preference-labeled datasets for alt-text, we constructed two large-scale, high-quality datasets named TAlt and PAlt, sourced from Twitter and Pinterest. These datasets include 202k annotated alt-text samples and 18k preference pairs that cover diverse preference dimensions, aiming to support further research in this domain. Experimental results show that our proposed MCM-DPO method consistently outperforms both DPO and SFT, establishing a new state of the art in alt-text generation. We release the code and data here: https://github.com/LVUGAI/MCM-DPO


[96] Facilitating Cognitive Accessibility with LLMs: A Multi-Task Approach to Easy-to-Read Text Generation cs.CL | cs.AIPDF

François Ledoyen, Gaël Dias, Jeremie Pantin, Alexis Lechervy, Fabrice Maurel

TL;DR: 该论文研究了利用大型语言模型(LLMs)自动生成易读文本(ETR)的潜力,通过多任务学习(MTL)结合文本摘要、文本简化和ETR生成任务,提出了两种策略:基于检索增强生成(RAG)和参数高效的微调(MTL-LoRA),实验证明了多任务方法的优势。

Details

Motivation: 简化复杂文本对认知障碍群体尤为重要,但手动生成易读文本耗时耗力,作者希望通过LLMs自动化这一过程。

Result: 实验表明,多任务方法在所有配置中均优于单任务基线,RAG策略在跨领域场景中表现优异,而MTL-LoRA在领域内设置中表现最佳。

Insight: 多任务学习能有效结合不同任务的互补信息,提高ETR生成的性能;RAG策略有助于模型泛化,而MTL-LoRA在参数效率上更具优势。

Abstract: Simplifying complex texts is essential for ensuring equitable access to information, especially for individuals with cognitive impairments. The Easy-to-Read (ETR) initiative offers a framework for making content accessible to the neurodivergent population, but the manual creation of such texts remains time-consuming and resource-intensive. In this work, we investigate the potential of large language models (LLMs) to automate the generation of ETR content. To address the scarcity of aligned corpora and the specificity of ETR constraints, we propose a multi-task learning (MTL) approach that trains models jointly on text summarization, text simplification, and ETR generation. We explore two different strategies: multi-task retrieval-augmented generation (RAG) for in-context learning, and MTL-LoRA for parameter-efficient fine-tuning. Our experiments with Mistral-7B and LLaMA-3-8B, based on ETR-fr, a new high-quality dataset, demonstrate the benefits of multi-task setups over single-task baselines across all configurations. Moreover, results show that the RAG-based strategy enables generalization in out-of-domain settings, while MTL-LoRA outperforms all learning strategies within in-domain configurations.


Harethah Abu Shairah, Somayah AlHarbi, Abdulaziz AlHussein, Sameer Alsabea, Omar Shaqaqi

TL;DR: ALARB是一个阿拉伯语法律论证推理基准数据集,包含13K+沙特阿拉伯商业法庭案例,用于评估大语言模型在阿拉伯语法律领域的多步骤推理能力。

Details

Motivation: 现有的阿拉伯语基准缺乏针对开放环境多步骤推理的数据集,特别是在法律领域。ALARB填补了这一空白,旨在提升阿拉伯语大语言模型的法律推理能力。

Result: 指令微调后,12B参数模型在判决预测和阿拉伯语判决生成任务上的性能显著提升,接近GPT-4o水平。

Insight: ALARB展示了领域特定数据集和任务对提升大语言模型性能的重要性,为阿拉伯语法律领域的AI应用提供了新方向。

Abstract: We introduce ALARB, a dataset and suite of tasks designed to evaluate the reasoning capabilities of large language models (LLMs) within the Arabic legal domain. While existing Arabic benchmarks cover some knowledge-intensive tasks such as retrieval and understanding, substantial datasets focusing specifically on multistep reasoning for Arabic LLMs, especially in open-ended contexts, are lacking. The dataset comprises over 13K commercial court cases from Saudi Arabia, with each case including the facts presented, the reasoning of the court, the verdict, as well as the cited clauses extracted from the regulatory documents. We define a set of challenging tasks leveraging this dataset and reflecting the complexity of real-world legal reasoning, including verdict prediction, completion of reasoning chains in multistep legal arguments, and identification of relevant regulations based on case facts. We benchmark a representative selection of current open and closed Arabic LLMs on these tasks and demonstrate the dataset’s utility for instruction tuning. Notably, we show that instruction-tuning a modest 12B parameter model using ALARB significantly enhances its performance in verdict prediction and Arabic verdict generation, reaching a level comparable to that of GPT-4o.


[98] Family Matters: Language Transfer and Merging for Adapting Small LLMs to Faroese cs.CLPDF

Jenny Kunz, Iben Nyholm Debess, Annika Simonsen

TL;DR: 论文研究了如何通过迁移学习和参数高效调优技术(如LoRA)将小型高效LLMs适配到法罗语(低资源语言)。结果表明语言迁移至关重要,但根据任务不同需选择不同源语言和调优方法。

Details

Motivation: 法罗语是一种低资源的北日耳曼语言,缺乏适配的评估数据和模型。研究旨在探索如何利用相关语言(如冰岛语和丹麦语)及不同调优方法(如LoRA和全微调)提升模型性能。

Result: 结果显示:1) 冰岛语提升语言准确性,丹麦语增强理解;2) LoRA提升语言接受度,全微调优化理解和下游任务能力。

Insight: 针对低资源语言,迁移学习中源语言的选择和调优方法的差异需根据具体任务权衡,语言相似性和任务目标是关键因素。

Abstract: We investigate how to adapt small, efficient LLMs to Faroese, a low-resource North Germanic language. Starting from English models, we continue pre-training on related Scandinavian languages, either individually or combined via merging, before fine-tuning on Faroese. We compare full fine-tuning with parameter-efficient tuning using LoRA, evaluating their impact on both linguistic accuracy and text comprehension. Due to the lack of existing Faroese evaluation data, we construct two new minimal-pair benchmarks from adapted and newly collected datasets and complement them with human evaluations by Faroese linguists. Our results demonstrate that transfer from related languages is crucial, though the optimal source language depends on the task: Icelandic enhances linguistic accuracy, whereas Danish boosts comprehension. Similarly, the choice between full fine-tuning and LoRA is task-dependent: LoRA improves linguistic acceptability and slightly increases human evaluation scores on the base model, while full fine-tuning yields stronger comprehension performance and better preserves model capabilities during downstream fine-tuning.


[99] Exposing the Cracks: Vulnerabilities of Retrieval-Augmented LLM-based Machine Translation cs.CLPDF

Yanming Sun, Runzhe Zhan, Chi Seng Cheang, Han Wu, Xuebo Liu

TL;DR: 论文研究了检索增强的LLM机器翻译(REAL-MT)在噪声环境中的脆弱性,并提出了一种噪声合成框架和评估方法。结果显示低资源语言对更容易受噪声影响,且大型推理模型(LRM)反而更容易被噪声误导。

Details

Motivation: 尽管REAL-MT在知识密集型任务(如惯用语翻译)中表现优异,但其在噪声检索环境中的可靠性尚未被充分研究。

Result: 低资源语言对在噪声环境下性能下降更严重;LRM未能纠正错误,反而更容易被噪声误导;作者发现了一种注意力偏移现象,即在噪声环境下模型置信度上升但准确性下降。

Insight: 研究表明,当前方法存在局限性,需要在检索增强和自验证机制之间找到平衡。

Abstract: \textbf{RE}trieval-\textbf{A}ugmented \textbf{L}LM-based \textbf{M}achine \textbf{T}ranslation (REAL-MT) shows promise for knowledge-intensive tasks like idiomatic translation, but its reliability under noisy retrieval contexts remains poorly understood despite this being a common challenge in real-world deployment. To address this gap, we propose a noise synthesis framework and new metrics to evaluate the robustness of REAL-MT systematically. Using this framework, we instantiate REAL-MT with Qwen-series models, including standard LLMs and large reasoning models (LRMs) with enhanced reasoning, and evaluate their performance on idiomatic translation across high-, medium-, and low-resource language pairs under synthesized noise. Our results show that low-resource language pairs, which rely more heavily on retrieved context, degrade more severely under noise than high-resource ones and often produce nonsensical translations. Although LRMs possess enhanced reasoning capabilities, they show no improvement in error correction and are even more susceptible to noise, tending to rationalize incorrect contexts. We find that this stems from an attention shift away from the source idiom to noisy content, while confidence increases despite declining accuracy, indicating poor calibration. To mitigate these issues, we investigate training-free and fine-tuning strategies, which improve robustness at the cost of performance in clean contexts, revealing a fundamental trade-off. Our findings highlight the limitations of current approaches, underscoring the need for self-verifying integration mechanisms.


[100] ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs cs.CL | I.2.7PDF

Adi Simhi, Jonathan Herzig, Martin Tutek, Itay Itzhak, Idan Szpektor

TL;DR: ManagerBench是一种新的基准测试,用于评估大型语言模型(LLM)在自主决策中安全性与实效性之间的权衡。研究表明当前前沿LLM在这些情境中表现不佳,倾向于选择实效但有害的选项或过于保守而无效。

Details

Motivation: 随着LLM从对话助手发展为自主代理,其行为的安全性评估变得至关重要。现有的安全基准主要关注有害内容的生成,忽略了代理为实现操作目标而采取的潜在有害行为。

Result: 研究表明前沿LLM在安全性与实效性的权衡中表现不佳:一些模型倾向于选择有害但实效的行动,另一些则过于保守而无效。模型的危害评估与人类一致,但其优先级设定存在问题。

Insight: LLM在安全性与实效性的权衡中表现不佳的核心原因是优先级设定问题,而非危害感知能力不足。ManagerBench为评估代理行为的关键组成部分提供了挑战性的基准。

Abstract: As large language models (LLMs) evolve from conversational assistants into autonomous agents, evaluating the safety of their actions becomes critical. Prior safety benchmarks have primarily focused on preventing generation of harmful content, such as toxic text. However, they overlook the challenge of agents taking harmful actions when the most effective path to an operational goal conflicts with human safety. To address this gap, we introduce ManagerBench, a benchmark that evaluates LLM decision-making in realistic, human-validated managerial scenarios. Each scenario forces a choice between a pragmatic but harmful action that achieves an operational goal, and a safe action that leads to worse operational performance. A parallel control set, where potential harm is directed only at inanimate objects, measures a model’s pragmatism and identifies its tendency to be overly safe. Our findings indicate that the frontier LLMs perform poorly when navigating this safety-pragmatism trade-off. Many consistently choose harmful options to advance their operational goals, while others avoid harm only to become overly safe and ineffective. Critically, we find this misalignment does not stem from an inability to perceive harm, as models’ harm assessments align with human judgments, but from flawed prioritization. ManagerBench is a challenging benchmark for a core component of agentic behavior: making safe choices when operational goals and alignment values incentivize conflicting actions. Benchmark & code available at https://github.com/technion-cs-nlp/ManagerBench.


[101] Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs cs.CL | cs.AI | cs.IRPDF

Ziliang Wang, Kang An, Xuhui Zheng, Faqiang Qian, Weikun Zhang

TL;DR: 论文提出了一种名为‘Erasable Reinforcement Learning (ERL)’的新框架,通过识别、擦除和重建推理链中的错误步骤,提升搜索增强大型语言模型(LLMs)在多跳推理中的可靠性。

Details

Motivation: 尽管搜索增强的大型语言模型在多跳推理中表现出色,但其可靠性仍受限于分解错误、检索缺失和推理错误等问题。单一阶段的错误可能导致最终答案的失败。

Result: 在HotpotQA、MuSiQue、2Wiki和Bamboogle等数据集上,3B和7B模型分别实现了EM和F1分数的显著提升,超过了之前的SOTA结果。

Insight: 研究表明,ERL为LLMs的多步推理提供了一种强大的鲁棒性解决方案,有助于减少错误传播。

Abstract: While search-augmented large language models (LLMs) exhibit impressive capabilities, their reliability in complex multi-hop reasoning remains limited. This limitation arises from three fundamental challenges: decomposition errors, where tasks are incorrectly broken down; retrieval missing, where key evidence fails to be retrieved; and reasoning errors, where flawed logic propagates through the reasoning chain. A single failure in any of these stages can derail the final answer. We propose Erasable Reinforcement Learning (ERL), a novel framework that transforms fragile reasoning into a robust process. ERL explicitly identifies faulty steps, erases them, and regenerates reasoning in place, preventing defective logic from propagating through the reasoning chain. This targeted correction mechanism turns brittle reasoning into a more resilient process. Models trained with ERL, termed ESearch, achieve substantial improvements on HotpotQA, MuSiQue, 2Wiki, and Bamboogle, with the 3B model achieving +8.48% EM and +11.56% F1, and the 7B model achieving +5.38% EM and +7.22% F1 over previous state-of-the-art(SOTA) results. These findings suggest that erasable reinforcement learning provides a powerful paradigm shift for robust multi-step reasoning in LLMs.


[102] HalluGuard: Evidence-Grounded Small Reasoning Models to Mitigate Hallucinations in Retrieval-Augmented Generation cs.CLPDF

Loris Bergeron, Ioana Buhnila, Jérôme François, Radu State

TL;DR: HalluGuard是一个4B参数的小型推理模型,用于减少检索增强生成中的幻觉问题,通过证据驱动的分类和理由生成,表现媲美更大型模型。

Details

Motivation: 大型语言模型(LLMs)在NLP任务中表现出色,但存在幻觉问题,限制了实际应用中的可信度,因此需要小型高效的解决方案。

Result: 在LLM-AggreFact基准测试中,HalluGuard达到了84.0%的平衡准确率,与更大模型表现相当。

Insight: 小型模型通过合成数据和偏好微调可以有效减少幻觉问题,同时保持高效性,为实际部署提供了可行方案。

Abstract: Large Language Models (LLMs) excel in many NLP tasks but remain prone to hallucinations, limiting trust in real-world applications. We present HalluGuard, a 4B-parameter Small Reasoning Model (SRM) for mitigating hallucinations in Retrieval-Augmented Generation (RAG). HalluGuard classifies document-claim pairs as grounded or hallucinated and produces evidence-grounded justifications for transparency. Our approach combines (i) a domain-agnostic synthetic dataset derived from FineWeb and refined through multi-stage curation and data reformation, (ii) synthetic grounded and hallucinated claims, and (iii) preference-based fine-tuning with Odds Ratio Preference Optimization to distill large-model reasoning into a smaller backbone. On the RAGTruth subset of the LLM-AggreFact benchmark, HalluGuard achieves 84.0% balanced accuracy (BAcc), rivaling specialized models, MiniCheck (7B; 84.0%) and Granite Guardian 3.3 (8B; 82.2%) while using roughly half their parameters. Over the full benchmark it reaches 75.7% BAcc, matching larger general-purpose LLMs such as GPT-4o (75.9%). We will release HalluGuard and datasets under Apache 2.0 upon acceptance.


[103] Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving cs.CL | cs.AIPDF

Shunfeng Zheng, Yudi Zhang, Meng Fang, Zihan Zhang, Zhitan Wu

TL;DR: 该论文探讨了检索增强生成(RAG)在解决奥林匹克物理问题中的应用,提出了高质量多模态数据集PhoPile,并展示了RAG如何提升基础模型的物理推理能力。

Details

Motivation: 研究者受到学生通过复习过去题目准备竞赛的启发,希望探索RAG是否能够增强基础模型在高级物理问题中的推理能力。

Result: 结果表明,结合物理语料库的检索可以显著提升模型性能,但也暴露了一些挑战。

Insight: 论文揭示了多模态数据和检索机制在复杂推理任务中的重要性,为未来研究提供了方向。

Abstract: Retrieval-augmented generation (RAG) with foundation models has achieved strong performance across diverse tasks, but their capacity for expert-level reasoning-such as solving Olympiad-level physics problems-remains largely unexplored. Inspired by the way students prepare for competitions by reviewing past problems, we investigate the potential of RAG to enhance physics reasoning in foundation models. We introduce PhoPile, a high-quality multimodal dataset specifically designed for Olympiad-level physics, enabling systematic study of retrieval-based reasoning. PhoPile includes diagrams, graphs, and equations, capturing the inherently multimodal nature of physics problem solving. Using PhoPile, we benchmark RAG-augmented foundation models, covering both large language models (LLMs) and large multimodal models (LMMs) with multiple retrievers. Our results demonstrate that integrating retrieval with physics corpora can improve model performance, while also highlighting challenges that motivate further research in retrieval-augmented physics reasoning.


[104] Analyzing Dialectical Biases in LLMs for Knowledge and Reasoning Benchmarks cs.CLPDF

Eileen Pan, Anna Seo Gyeong Choi, Maartje ter Hoeve, Skyler Seto, Allison Koenecke

TL;DR: 该论文研究了大型语言模型(LLMs)在非标准英语方言问答任务中的性能下降问题,发现特定语法规则对性能影响最大,并呼吁针对这些规则开展偏差缓解研究。

Details

Motivation: LLMs在自然语言处理中广泛应用,但其在非标准英语方言中的表现较差。论文旨在分析这种性能差异的具体原因,尤其是语法规则的影响。

Result: 实验结果显示,LLMs在非标准方言问题中的准确率下降了高达20%,其中三种特定语法规则对性能下降的解释力最强。

Insight: 论文揭示了LLMs在语言多样性问题中的局限性,强调了针对高影响语法结构的偏差缓解的重要性,为未来研究提供了明确方向。

Abstract: Large language models (LLMs) are ubiquitous in modern day natural language processing. However, previous work has shown degraded LLM performance for under-represented English dialects. We analyze the effects of typifying “standard” American English language questions as non-“standard” dialectal variants on multiple choice question answering tasks and find up to a 20% reduction in accuracy. Additionally, we investigate the grammatical basis of under-performance in non-“standard” English questions. We find that individual grammatical rules have varied effects on performance, but some are more consequential than others: three specific grammar rules (existential “it”, zero copula, and y’all) can explain the majority of performance degradation observed in multiple dialects. We call for future work to investigate bias mitigation methods focused on individual, high-impact grammatical structures.


[105] Syntax-Guided Diffusion Language Models with User-Integrated Personalization cs.CL | stat.MEPDF

Ruqian Zhang, Yijiao Zhang, Juan Shen, Zhongyi Zhu, Annie Qu

TL;DR: 本文提出一种语法引导的扩散语言模型,通过结合结构监督和个性化条件来提升文本质量、多样性和可控性。

Details

Motivation: 传统大型语言模型的输出通常过于通用,缺乏结构性多样性和个性化表达,限制了文本生成的多样性和个性化需求。

Result: 实验表明,该方法在流畅性、多样性和风格保真度方面优于现有方法。

Insight: 通过在生成过程中引入语法信息,模型能更精准地捕捉词汇和结构的风格特征;共享表示机制提升了模型的个性化能力和泛化性。

Abstract: Large language models have made revolutionary progress in generating human-like text, yet their outputs often tend to be generic, exhibiting insufficient structural diversity, which limits personalized expression. Recent advances in diffusion models have opened new opportunities for improving language generation beyond the limitations of autoregressive paradigms. In this work, we propose a syntax-guided diffusion language model that integrates structural supervision and personalized conditioning to enhance text quality, diversity, and controllability. We introduce a cascaded framework that generates syntactic guidance before conditional text generation, and further generalize it to a novel noncascaded architecture for better alignment between structure and content. By incorporating syntactic information in the generating process, the proposed model better captures the lexical and structural characteristics of stylistic sentence construction. To enable fine-grained personalization, we develop a shared representation mechanism that facilitates information integration across users, supporting both faithful stylistic generation and generalizable zero-shot inference. Extensive experiments on multiple tasks demonstrate the superiority of our approach in fluency, diversity, and stylistic fidelity. Further qualitative analyses highlight its interpretability and flexibility in learning personalized patterns.


[106] Research on the Integration of Embodied Intelligence and Reinforcement Learning in Textual Domains cs.CLPDF

Haonan Wang, Junfeng Sun, Mingjia Zhao, Wei Liu

TL;DR: 本文提出了一种结合具身智能和强化学习的文本处理模型,利用具身智能的感知与行动优势和强化学习的决策优化能力,在多种文本任务中表现出色。

Details

Motivation: 提升文本处理的智能化水平,结合具身智能的感知与行动能力以及强化学习的决策优化能力。

Result: 模型在多种文本处理任务中表现出高效性和潜在应用价值。

Insight: 具身智能与强化学习的结合为文本处理领域的智能化提供了新的解决思路。

Abstract: This article addresses embodied intelligence and reinforcement learning integration in the field of text processing, aiming to enhance text handling with more intelligence on the basis of embodied intelligence’s perception and action superiority and reinforcement learning’s decision optimization capability. Through detailed theoretical explanation and experimental exploration, a novel integration model is introduced. This model has been demonstrated to be very effective in a wide range oftext processing tasks, validating its applicative potential


[107] Automatic Speech Recognition (ASR) for African Low-Resource Languages: A Systematic Literature Review cs.CLPDF

Sukairaj Hafiz Imam, Tadesse Destaw Belay, Kedir Yassin Husse, Ibrahim Said Ahmad, Idris Abdulmumin

TL;DR: 本文通过系统性文献综述(SLR)探讨了非洲低资源语言自动语音识别(ASR)的研究现状,聚焦数据集、模型与训练方法、评估技术及挑战,并提出了未来发展方向。

Details

Motivation: 非洲拥有2000多种语言,但在ASR领域的研究和应用严重不足,阻碍了数字包容性。本文旨在填补这一研究空白,推动非洲语言的ASR发展。

Result: 研究发现:1. 仅有15%的研究提供可复现材料;2. 数据集许可不明确;3. 自监督和迁移学习有潜力但受限于预训练数据不足;4. 评估指标单一(WER为主),未充分考虑音调和形态丰富的语言。

Insight: 1. 社区驱动倡议和方法论进步为改进指明了方向;2. 可持续发展需多方合作、伦理数据集、轻量模型和基准测试;3. 未来应关注方言覆盖和资源可用性。

Abstract: ASR has achieved remarkable global progress, yet African low-resource languages remain rigorously underrepresented, producing barriers to digital inclusion across the continent with more than +2000 languages. This systematic literature review (SLR) explores research on ASR for African languages with a focus on datasets, models and training methods, evaluation techniques, challenges, and recommends future directions. We employ the PRISMA 2020 procedures and search DBLP, ACM Digital Library, Google Scholar, Semantic Scholar, and arXiv for studies published between January 2020 and July 2025. We include studies related to ASR datasets, models or metrics for African languages, while excluding non-African, duplicates, and low-quality studies (score <3/5). We screen 71 out of 2,062 records and we record a total of 74 datasets across 111 languages, encompassing approximately 11,206 hours of speech. Fewer than 15% of research provided reproducible materials, and dataset licensing is not clear. Self-supervised and transfer learning techniques are promising, but are hindered by limited pre-training data, inadequate coverage of dialects, and the availability of resources. Most of the researchers use Word Error Rate (WER), with very minimal use of linguistically informed scores such as Character Error Rate (CER) or Diacritic Error Rate (DER), and thus with limited application in tonal and morphologically rich languages. The existing evidence on ASR systems is inconsistent, hindered by issues like dataset availability, poor annotations, licensing uncertainties, and limited benchmarking. Nevertheless, the rise of community-driven initiatives and methodological advancements indicates a pathway for improvement. Sustainable development for this area will also include stakeholder partnership, creation of ethically well-balanced datasets, use of lightweight modelling techniques, and active benchmarking.


[108] mR3: Multilingual Rubric-Agnostic Reward Reasoning Models cs.CL | cs.AI | cs.LGPDF

David Anugraha, Shou-Yi Hung, Zilu Tang, Annie En-Shiun Lee, Derry Tanti Wijaya

TL;DR: 论文提出了mR3,一种支持72种语言的rubric无关奖励推理模型,通过数据与课程选择策略实现高效的多语言奖励建模,性能优于更大模型。

Details

Motivation: 现有基于LLM的评估方法在非英语环境中表现不佳,缺乏有效的多语言训练策略,因此需要研究如何构建高质量的多语言奖励模型。

Result: mR3在多语言奖励模型基准测试中表现SOTA,模型更小(至多缩小9倍)。

Insight: 目标语言推理数据的整合及有效数据选择策略对多语言奖励模型的性能至关重要。

Abstract: Evaluation using Large Language Model (LLM) judges has been widely adopted in English and shown to be effective for automatic evaluation. However, their performance does not generalize well to non-English settings, and it remains unclear what constitutes effective multilingual training for such judges. In this paper, we introduce mR3, a massively multilingual, rubric-agnostic reward reasoning model trained on 72 languages, achieving the broadest language coverage in reward modeling to date. We present a comprehensive study of data and curriculum selection for training to identify effective strategies and data sources for building high-quality reward models, including the integration of target-language reasoning datasets. Our approach attains state-of-the-art performance on multilingual reward model benchmarks, surpassing much larger models (i.e., GPT-OSS-120B) while being up to 9x smaller, and its effectiveness is further confirmed through extensive ablation studies. Our models, data, and code are available as open source at https://github.com/rubricreward/mr3.


[109] Pay-Per-Search Models are Abstention Models cs.CLPDF

Mustafa Omer Gul, Claire Cardie, Tanya Goyal

TL;DR: 论文提出了MASH训练框架,通过选择性寻求外部帮助(如搜索工具)来实现LLM的弃权行为。MASH利用强化学习和按搜索付费的奖励机制,显著提高了回答准确性和选择性弃权能力。

Details

Motivation: 当前LLM无法可靠识别其参数知识的边界,经常对超出边界的问题产生幻觉式回答。人类则能识别自身限制并选择性弃权或寻求外部帮助。MASH旨在通过外部帮助行为实现LLM的类似弃权功能。

Result: 在三个知识密集型QA数据集上的实验显示,MASH在多跳数据集上回答准确率提高7.6%,并能有效区分可回答与不可回答问题,表现出与专门弃权方法类似的行为。

Insight: MASH表明,通过训练LLM选择性寻求外部帮助,可以自然实现弃权行为,而无需预先定义知识边界。这种方法为LLM的可靠性和实用性提供了新思路。

Abstract: LLMs cannot reliably recognize their parametric knowledge boundaries and often hallucinate answers to outside-of-boundary questions. In contrast, humans recognize their limitations and can either seek external help for such questions or abstain. In this paper, we introduce MASH (Modeling Abstention via Selective Help-seeking), a training framework that readily extracts abstentions from LLMs. Our key idea is that any external help-seeking by an LLM, i.e. search tool use, can serve as a proxy for abstention if the external help (search) is appropriately penalized while simultaneously rewarding answer accuracy. MASH operationalizes this idea using reinforcement learning with a pay-per-search reward. We run experiments on three knowledge-intensive QA datasets. Our results show that MASH substantially improves upon the selective help-seeking performance of prior efficient search approaches; on multi-hop datasets, MASH improves answer accuracy by 7.6%. Furthermore, MASH demonstrates strong off-the-shelf abstention – it can distinguish between unanswerable/answerable questions and selectively generate responses for answerable questions – showcasing behavior analogous to specialized abstention approaches. We emphasize that contrary to prior abstention methods, MASH does not require pre-determining knowledge boundaries to construct training data. Instead, MASH’s abstentions are a by-product of training for the auxiliary selective help-seeking task. Overall, we show that MASH training effectively aligns search tool use with parametric knowledge, which can be successfully leveraged for making abstention decisions.


[110] Backdoor Attacks Against Speech Language Models cs.CL | cs.CR | cs.SDPDF

Alexandrine Fortier, Thomas Thebaud, Jesús Villalba, Najim Dehak, Patrick Cardinal

TL;DR: 该论文首次系统研究了针对语音语言模型的音频后门攻击,展示了其在多种语音编码器和任务中的高成功率,并提出了基于微调的防御方法。

Details

Motivation: 随着大语言模型(LLMs)及其多模态扩展的普及,模型可能继承其组件的漏洞,尤其是音频领域的后门攻击尚未被充分研究。

Result: 攻击成功率高达90.76%至99.41%,表明语音语言模型对后门攻击高度敏感。

Insight: 语音语言模型的脆弱性集中在特定组件阶段,微调可以有效缓解预训练编码器的后门威胁。

Abstract: Large Language Models (LLMs) and their multimodal extensions are becoming increasingly popular. One common approach to enable multimodality is to cascade domain-specific encoders with an LLM, making the resulting model inherit vulnerabilities from all of its components. In this work, we present the first systematic study of audio backdoor attacks against speech language models. We demonstrate its effectiveness across four speech encoders and three datasets, covering four tasks: automatic speech recognition (ASR), speech emotion recognition, and gender and age prediction. The attack consistently achieves high success rates, ranging from 90.76% to 99.41%. To better understand how backdoors propagate, we conduct a component-wise analysis to identify the most vulnerable stages of the pipeline. Finally, we propose a fine-tuning-based defense that mitigates the threat of poisoned pretrained encoders.


[111] Social Welfare Function Leaderboard: When LLM Agents Allocate Social Welfare cs.CL | cs.AI | cs.CY | cs.HCPDF

Zhengliang Shi, Ruotian Ma, Jen-tse Huang, Xinbei Ma, Xingyu Chen

TL;DR: 论文介绍了社会福祉函数基准测试(SWF Benchmark),用于评估LLMs在分配稀缺社会资源时的表现,发现主流LLMs普遍偏向功利主义,且在社交影响力或输出长度限制下策略脆弱。

Details

Motivation: LLMs在高风险决策中的应用日益广泛,但其分配社会资源的原则和价值观尚未得到充分研究。因此,需要专门的基准测试来评估和引导其行为。

Result: 发现LLMs的通用对话能力与分配技能无关;多数LLMs偏向功利主义,牺牲公平性;策略易受输出长度和社交框架影响。

Insight: 当前LLMs作为社会决策者存在风险,需针对性优化和专门基准测试以确保其与社会价值观对齐。

Abstract: Large language models (LLMs) are increasingly entrusted with high-stakes decisions that affect human welfare. However, the principles and values that guide these models when distributing scarce societal resources remain largely unexamined. To address this, we introduce the Social Welfare Function (SWF) Benchmark, a dynamic simulation environment where an LLM acts as a sovereign allocator, distributing tasks to a heterogeneous community of recipients. The benchmark is designed to create a persistent trade-off between maximizing collective efficiency (measured by Return on Investment) and ensuring distributive fairness (measured by the Gini coefficient). We evaluate 20 state-of-the-art LLMs and present the first leaderboard for social welfare allocation. Our findings reveal three key insights: (i) A model’s general conversational ability, as measured by popular leaderboards, is a poor predictor of its allocation skill. (ii) Most LLMs exhibit a strong default utilitarian orientation, prioritizing group productivity at the expense of severe inequality. (iii) Allocation strategies are highly vulnerable, easily perturbed by output-length constraints and social-influence framing. These results highlight the risks of deploying current LLMs as societal decision-makers and underscore the need for specialized benchmarks and targeted alignment for AI governance.


[112] GRAD: Generative Retrieval-Aligned Demonstration Sampler for Efficient Few-Shot Reasoning cs.CL | cs.AI | cs.LGPDF

Oussama Gabouj, Kamel Charaf, Ivan Zakazov, Nicolas Baldwin, Robert West

TL;DR: 论文提出了一种动态演示生成方法GRAD,通过训练LLM生成输入相关的简洁演示,提升少样本推理的效率,优于传统静态检索增强方法,并在数学推理和STEM领域表现出色。

Details

Motivation: 传统检索增强生成(RAG)依赖静态数据库,可能导致演示内容与输入无关。为了提升少样本推理的效果和适应性,需要一个动态生成演示的方法。

Result: 在数学推理和STEM领域(如物理、化学、计算机科学)中,GRAD表现优于基线模型,且能泛化到分布外(OOD)领域。

Insight: 动态生成演示优于静态检索方法,小模型生成的演示可有效指导大模型,为资源受限环境下的少样本学习提供了新思路。

Abstract: Large Language Models (LLMs) achieve strong performance across diverse tasks, but their effectiveness often depends on the quality of the provided context. Retrieval-Augmented Generation (RAG) enriches prompts with external information, but its reliance on static databases constrains adaptability and can result in irrelevant demonstrations. In this work, we propose a Generative Retrieval-Aligned Demonstrator (GRAD), a dynamic demonstration-based approach where an LLM model is trained to generate input-specific concise demonstrations. By tailoring demonstrations to each input, our method offers better contextual support than traditional RAG approaches. We demonstrate the superiority of GRAD under budget constraints, where we limit both the number of tokens used per demonstration and the number of tokens used for the final output. Trained solely on a math dataset, GRAD consistently outperforms strong baselines on Qwen2.5-14B across mathematical reasoning and advanced STEM questions, highlighting GRAD’s robust generalization to out-of-distribution (OOD) domains such as physics, chemistry, and computer science. Furthermore, we show that demonstrations generated by trained smaller models can effectively guide larger target models, reducing training costs while maintaining competitive accuracy. Overall, this work introduces a scalable demonstration generator model presenting the first step toward a dynamic few-shot learning paradigm in resource-constrained settings. We release the code used for the project.


[113] Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity cs.CL | cs.AIPDF

Jiayi Zhang, Simon Yu, Derek Chong, Anthony Sicilia, Michael R. Tomz

TL;DR: 本文提出了一种名为‘Verbalized Sampling’(VS)的训练免费提示策略,通过让模型对一组回答的概率分布进行语言化描述,解决了LLM在训练后对齐过程中出现的模式崩溃问题。

Details

Motivation: 研究发现,LLM在训练后对齐过程中多样性降低(模式崩溃)的根本原因是偏好数据中的典型性偏差,即注释者倾向于选择熟悉的文本。

Result: VS显著提升了创造性写作、对话模拟、开放问答等任务的多样性,多样性提高了1.6-2.1倍,同时保持事实准确性与安全性。

Insight: 1. 数据级偏差是模式崩溃的核心原因;2. VS是一种简单有效的推理时补救措施;3. 更强能力的模型从VS中获益更多。

Abstract: Post-training alignment often reduces LLM diversity, leading to a phenomenon known as mode collapse. Unlike prior work that attributes this effect to algorithmic limitations, we identify a fundamental, pervasive data-level driver: typicality bias in preference data, whereby annotators systematically favor familiar text as a result of well-established findings in cognitive psychology. We formalize this bias theoretically, verify it on preference datasets empirically, and show that it plays a central role in mode collapse. Motivated by this analysis, we introduce Verbalized Sampling, a simple, training-free prompting strategy to circumvent mode collapse. VS prompts the model to verbalize a probability distribution over a set of responses (e.g., ``Generate 5 jokes about coffee and their corresponding probabilities’’). Comprehensive experiments show that VS significantly improves performance across creative writing (poems, stories, jokes), dialogue simulation, open-ended QA, and synthetic data generation, without sacrificing factual accuracy and safety. For instance, in creative writing, VS increases diversity by 1.6-2.1x over direct prompting. We further observe an emergent trend that more capable models benefit more from VS. In sum, our work provides a new data-centric perspective on mode collapse and a practical inference-time remedy that helps unlock pre-trained generative diversity.


[114] Energy-Regularized Sequential Model Editing on Hyperspheres cs.CLPDF

Qingyuan Liu, Jia-Chen Gu, Yunzhi Yao, Hong Wang, Nanyun Peng

TL;DR: 该论文提出了一种基于超球面能量(HE)正则化的顺序模型编辑方法SPHERE,通过稳定神经元权重分布来缓解顺序编辑带来的性能退化问题。

Details

Motivation: 大型语言模型(LLMs)需要不断更新以保持与现实世界知识的同步。模型编辑是一种轻量级的替代方案,但顺序编辑会导致表示不稳定和灾难性遗忘。本文旨在理解并解决这一问题。

Result: 在LLaMA3(8B)和Qwen2.5(7B)上的实验表明,SPHERE平均提升了16.41%的编辑能力,同时更好地保持了模型的整体性能。

Insight: 超球面均匀性是模型稳定性和知识保留的关键,而HE稳定性对避免编辑失败至关重要。

Abstract: Large language models (LLMs) require constant updates to remain aligned with evolving real-world knowledge. Model editing offers a lightweight alternative to retraining, but sequential editing often destabilizes representations and induces catastrophic forgetting. In this work, we seek to better understand and mitigate performance degradation caused by sequential editing. We hypothesize that hyperspherical uniformity, a property that maintains uniform distribution of neuron weights on a hypersphere, helps the model remain stable, retain prior knowledge, while still accommodate new updates. We use Hyperspherical Energy (HE) to quantify neuron uniformity during editing, and examine its correlation with editing performance. Empirical studies across widely used editing methods reveals a strong correlation between HE dynamics and editing performance, with editing failures consistently coinciding with high HE fluctuations. We further theoretically prove that HE dynamics impose a lower bound on the degradation of pretrained knowledge, highlighting why HE stability is crucial for knowledge retention. Motivated by these insights, we propose SPHERE (Sparse Projection for Hyperspherical Energy-Regularized Editing), an HE-driven regularization strategy that stabilizes neuron weight distributions, ultimately preserving prior knowledge while enabling reliable sequential updates. Specifically, SPHERE identifies a sparse space complementary to the principal hyperspherical directions of the pretrained weight matrices and projects new knowledge onto it, attenuating perturbations on the principal directions. Extensive experiments on LLaMA3 (8B) and Qwen2.5 (7B) show that SPHERE outperforms the best baseline in editing capability by an average of 16.41%, while most faithfully preserving general model performance, thereby offering a principled path toward reliable large-scale knowledge editing.


cs.LG [Back]

[115] Thoughtbubbles: an Unsupervised Method for Parallel Thinking in Latent Space cs.LG | cs.AI | cs.CL | cs.NEPDF

Houjun Liu, Shikhar Murty, Christopher D. Manning, Róbert Csordás

TL;DR: 论文提出了Thoughtbubbles,一种无需监督的方法,通过在潜在空间中并行思考,改进transformer模型的推理计算效率。

Details

Motivation: 现有方法需要显式生成链式思维标记来扩展推理计算能力,但这限制了其在预训练中的应用且仅限于串行生成。Thoughtbubbles旨在通过学习在潜在空间中并行计算来解决这些问题。

Result: 在OpenWebText和peS2o数据集上,Thoughtbubbles在困惑度和零样本评估(如HellaSwag和LAMBADA)中优于标准解码器和非自适应并行计算方法。

Insight: Thoughtbubbles的隐式特性使得自适应计算可以从预训练阶段开始学习,为统一训练和推理模型的行为提供了新思路。

Abstract: Current approaches for scaling inference-time compute in transformers rely on training them to emit explicit chain-of-thought tokens before producing an answer. While these methods are powerful, they are limited because they cannot be applied during pretraining and are limited to only serially-generated, natural-language verbalization to scale inference-time compute. In this work, we propose Thoughtbubbles, a transformer variant that natively performs parallel adaptive computation in latent space by learning to fork or delete residual streams. Thus, tokens that require a large amount of computation can form a “bubble” of cloned residuals in the middle of the network for additional thinking. Crucially, this behavior is learned during pretraining with only language modeling loss. Thoughtbubbles outperforms both standard decoder LMs as well as non-adaptive parallel computation approaches on OpenWebText and peS2o perplexity and in zero-shot evaluations such as HellaSwag and LAMBADA after pretraining across 150M to 772M parameter scales. The implicit nature of our method enables adaptive computation to be learned starting at pretraining time, paving the way to unify train and test-time behavior for reasoning models.


[116] The data-quality illusion: Rethinking Classifier-based quality filtering for LLM Pretraining cs.LG | cs.CLPDF

Thiziri Nait Saada, Louis Bethune, Michal Klein, David Grangier, Marco Cuturi

TL;DR: 该论文深入分析了基于分类器的质量过滤(CQF)方法,指出尽管该方法能提升下游任务表现,但并未改善高质量数据集上的语言建模能力。作者通过实验揭示了CQF的局限性,并质疑其对数据质量的有效度量。

Details

Motivation: 大规模预训练模型通常使用混合质量的数据集,数据过滤是关键环节之一。CQF是一种流行的过滤方法,但其有效性存在争议,作者旨在揭示其潜在问题。

Result: 实验结果表明,CQF未能有效提升高质量数据集的语言建模能力,且其过滤行为掩盖了高质量数据的某些特性。

Insight: CQF可能并非数据质量的理想度量方式,需开发更有效的方法。同时,高质量数据的多样性可能比简单的分数过滤更重要。

Abstract: Large-scale models are pretrained on massive web-crawled datasets containing documents of mixed quality, making data filtering essential. A popular method is Classifier-based Quality Filtering (CQF), which trains a binary classifier to distinguish between pretraining data and a small, high-quality set. It assigns each pretraining document a quality score defined as the classifier’s score and retains only the top-scoring ones. We provide an in-depth analysis of CQF. We show that while CQF improves downstream task performance, it does not necessarily enhance language modeling on the high-quality dataset. We explain this paradox by the fact that CQF implicitly filters the high-quality dataset as well. We further compare the behavior of models trained with CQF to those trained on synthetic data of increasing quality, obtained via random token permutations, and find starkly different trends. Our results challenge the view that CQF captures a meaningful notion of data quality.


[117] It Takes Two: Your GRPO Is Secretly DPO cs.LG | cs.CLPDF

Yihong Wu, Liheng Ma, Lei Ding, Muzhi Li, Xinyu Wang

TL;DR: 该论文揭示了GRPO与DPO之间的理论联系,并提出了一种仅需两组rollout的GRPO方法(2-GRPO),显著降低了计算开销,同时保持了与16-GRPO相当的性能。

Details

Motivation: 传统GRPO算法需要较大的组规模以确保训练稳定性,但这带来了高昂的计算成本。论文通过将GRPO重新定义为对比学习,发现其与DPO的联系,从而探索在最小组规模下的可行性。

Result: 2-GRPO在性能上与16-GRPO相当,同时减少了70%以上的训练时间和仅使用1/8的rollout数量。

Insight: 通过对比学习和理论重构,可以显著优化RL算法的计算开销,而不会牺牲性能。

Abstract: Group Relative Policy Optimization (GRPO) is a prominent reinforcement learning algorithm for post-training Large Language Models (LLMs). It is commonly believed that GRPO necessitates a large group size to ensure stable training via precise statistical estimation, which incurs substantial computational overhead. In this work, we challenge this assumption by reframing GRPO as a form of contrastive learning, which reveals a fundamental connection to Direct Preference Optimization (DPO). Motivated by DPO’s empirical success, we investigate the minimal two-rollout case (2-GRPO), a configuration previously deemed infeasible. We provide a rigorous theoretical analysis to validate 2-GRPO and demonstrate empirically that it achieves performance on par with 16-GRPO, despite using only 1/8 of the rollouts and reducing training time by over 70%.


[118] GEM: A Gym for Agentic LLMs cs.LG | cs.AI | cs.CLPDF

Zichen Liu, Anya Sims, Keyu Duan, Changyu Chen, Simon Yu

TL;DR: 论文介绍了GEM,一个开源的环境模拟器,旨在支持基于经验的大语言模型训练,提供标准化环境-代理接口、多样化环境和工具,并通过基准测试比较不同RL算法的表现。

Details

Motivation: 随着大语言模型的训练范式从静态数据集转向基于经验的学习,需要一个标准化且高效的框架来支持环境与代理的交互。

Result: 通过GEM,在不同环境中比较了PPO、GRPO和REINFORCE的表现,展示了ReBN的优势。

Insight: GEM不仅是一个训练环境,还是一个便捷的评估工具,为未来基于代理的LLM研究提供了重要支持。

Abstract: The training paradigm for large language models (LLMs) is moving from static datasets to experience-based learning, where agents acquire skills via interacting with complex environments. To facilitate this transition we introduce GEM (General Experience Maker), an open-source environment simulator designed for the age of LLMs. Analogous to OpenAI-Gym for traditional reinforcement learning (RL), GEM provides a standardized framework for the environment-agent interface, including asynchronous vectorized execution for high throughput, and flexible wrappers for easy extensibility. GEM also features a diverse suite of environments, robust integrated tools, and single-file example scripts demonstrating using GEM with five popular RL training frameworks. Along with this, we also provide a set of baselines across 24 environments using REINFORCE with Return Batch Normalization (ReBN), which – unlike GRPO – is compatible with the full RL setting of dense per-turn rewards and offers better credit assignment. We further conduct apple-to-apple benchmarking of PPO, GRPO and REINFORCE in both single- and multi-turn settings using GEM to shed light on the algorithmic designs. Lastly, GEM also functions as a convenient evaluation toolkit besides a training environment. We hope this framework can help accelerate future agentic LLM research.


[119] A Practitioner’s Guide to Multi-turn Agentic Reinforcement Learning cs.LG | cs.AI | cs.CLPDF

Ruiyi Wang, Prithviraj Ammanabrolu

TL;DR: 该论文研究了在多轮强化学习中训练大型语言模型代理的实际有效方法和不足之处,提出了三个关键设计支柱(环境、奖励和策略),并总结了一套训练配方。

Details

Motivation: 现有关于多轮强化学习中代理训练的框架和定义较为分散,缺乏系统性分析和设计选择的总结。

Result: 实验结果表明,简单环境可以反映代理在不同任务中的泛化能力,奖励稀疏性依赖于RL算法的选择,同时找到了最优的SFT与RL训练比例。

Insight: 研究发现环境设计、奖励设计和策略选择之间的协同设计对代理性能至关重要,并提出了一套实用的训练配方。

Abstract: We study what actually works and what doesn’t for training large language models as agents via multi-turn reinforcement learning. Despite rapid progress, existing frameworks and definitions are fragmented, and there is no systematic formulation or analysis of which design choices matter across tasks. We address this gap by first breaking down the design space into three inter-related pillars – environment, reward, and policy – and empirically derive a recipe for training LLM agents in situated textual domains. In particular, we test TextWorld and ALFWorld, popular domains for testing situated embodied reasoning, as well as SWE-Gym for more software engineering style tasks. (i) For the environment, we analyze the impacts of task complexity in terms of sizes of the state and action spaces as well as optimal solution length, finding that even simple environments within a domain can provide signal on how well an agent can generalize to more complex tasks. (ii) For the reward, we ablate relative reward sparsity, observing that while dense turn-level rewards accelerate training, performance and stability is highly dependent on the choice of RL algorithm. (iii) And for the agent’s policy, we explore the interplay between reward sparsity and biased (PPO, GRPO) and unbiased (RLOO) policy gradient methods in addition to showing how to find the optimal Supervised Fine-tuning (SFT) to RL training ratio given a fixed budget. We distill these findings into a training recipe that guides co-design across the three pillars, facilitating research and practical efforts in multi-turn agentic RL. Code: https://github.com/pearls-lab/meow-tea-taro


[120] Prompt Curriculum Learning for Efficient LLM Post-Training cs.LG | cs.CLPDF

Zhaolin Gao, Joongwon Kim, Wen Sun, Thorsten Joachims, Sid Wang

TL;DR: 论文提出Prompt Curriculum Learning (PCL),一种轻量级强化学习算法,通过选择中等难度提示语来优化语言模型的训练效率,避免昂贵的前向计算,显著提升训练速度。

Details

Motivation: 传统RL训练语言模型对批处理和提示语选择策略敏感,效率低。PCL旨在通过动态选择中等难度提示语,在性能与效率之间取得更好平衡。

Result: PCL在MATH和DeepScaleR数据集上分别实现12.1倍和16.9倍的提示语选择速度提升,并在性能或效率上优于基线方法。

Insight: 中等难度提示语对RL训练至关重要,动态选择策略能显著提升训练效率,同时保持良好的模型性能。

Abstract: We introduce Prompt Curriculum Learning (PCL), a lightweight reinforcement learning (RL) algorithm that selects intermediate-difficulty prompts using a learned value model to post-train language models. Since post-training LLMs via RL remains sensitive to batching and prompt selection strategies, we first conduct a series of systematic experiments where we (1) determine the optimal training batch size that balances generation efficiency and gradient quality and (2) establish the importance of focusing on prompts of intermediate difficulty for the policy. We build upon these results to design PCL, which identifies prompts of intermediate difficulty for the current policy in an on-policy manner by using a value model that is concurrently updated based on the current policy. By focusing on informative prompts that yield high effective ratios, PCL achieves either the highest performance or requires significantly less time to reach comparable performance to its counterparts. Compared to rollout-based filtering methods, PCL avoids costly rollouts and achieves $12.1\times$ and $16.9\times$ faster speed on identifying intermediate-difficulty prompts when training on MATH and DeepScaleR, respectively. We further demonstrate that our value model accurately predicts prompt difficulty and allows PCL to focus on progressively more challenging prompts during RL. Our results present a new methodology that delivers improved tradeoff between upper-bound performance and efficiency for reasoning-focused RL.


[121] Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards cs.LG | cs.AI | cs.CLPDF

Yiran Shen, Yu Xia, Jonathan Chang, Prithviraj Ammanabrolu

TL;DR: 该论文提出了一种统一框架,通过标准化奖励模型训练和多目标对齐方法,实现了在可验证和不可验证奖励领域的同时对齐,并提供了细粒度的推理时用户控制。

Details

Motivation: 大型语言模型对齐人类偏好通常是多维的,但现有方法往往将多种信号简化为单一目标,导致训练低效且缺乏用户控制。

Result: 在数学推理、价值观对齐和多轮对话任务中,框架显著提升了多目标性能,减少了目标间的冲突,并增强了用户控制灵活性。

Insight: 通过向量化奖励和多目标对齐方法,能够更好地平衡不同目标间的冲突,为模型对齐提供了新的解决思路。

Abstract: Aligning large language models to human preferences is inherently multidimensional, yet most pipelines collapse heterogeneous signals into a single optimizeable objective. We seek to answer what it would take to simultaneously align a model across various domains spanning those with: verifiable rewards (mathematical accuracy), non-verifiable subjective preferences (human values), and complex interactive scenarios (multi-turn AI tutoring dialogues). Such multi-objective reinforcement learning setups are often plagued by the individual objectives being at odds with each other, resulting in inefficient training and little user control during inference. We propose a unified framework that: (i) standardizes {process reward model} (PRM) training across both verifiable and non-verifiable settings to better supervise models’ chain-of-thought reasoning; (ii) performs {multi-objective alignment} by training the LLM with our $\textbf{M}$ulti-$\textbf{A}$ction-$\textbf{H}$ead $\textbf{DPO}$ (MAH-DPO) and a vectorized reward where the dimensions of the vector correspond to the various objectives instead of a single scalar; and (iii) demonstrates how such a system provides fine-grained inference-time user control. Experiments across math reasoning, value alignment, and multi-turn dialogue show that our framework improves performance across multiple objectives simultaneously, while minimizing cross-objective trade-offs and enabling flexible inference time user control. The code can be found at https://github.com/pearls-lab/multiobj-align.


[122] BroRL: Scaling Reinforcement Learning via Broadened Exploration cs.LG | cs.CLPDF

Jian Hu, Mingjie Liu, Ximing Lu, Fang Wu, Zaid Harchaoui

TL;DR: BroRL提出了一种通过增加每个示例的rollout数量来扩展强化学习的互补方法,解决了ProRL在训练步骤增加后性能饱和的问题,实现了持续的性能提升。

Details

Motivation: ProRL通过增加训练步骤扩展强化学习,但性能会在几千步后饱和。BroRL提出通过增加rollout数量来扩展探索空间,打破性能瓶颈。

Result: BroRL在3K步ProRL饱和后仍能持续提升性能,1.5B模型在多个基准测试中取得了最优结果。

Insight: 在强化学习中,增加探索(rollout数量)是扩展模型能力的有效方式,远超过单纯增加训练步骤的效果。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key ingredient for unlocking complex reasoning capabilities in large language models. Recent work ProRL has shown promise in scaling RL by increasing the number of training steps. However, performance plateaus after thousands of steps, with clear diminishing returns from allocating more computation to additional training. In this work, we investigate a complementary paradigm for scaling RL, BroR-Lincreasing the number of rollouts per example to hundreds to exhaustively Broaden exploration, which yields continuous performance gains beyond the saturation point observed in ProRL when scaling the number of training steps. Our approach is motivated by a mass balance equation analysis allowing us to characterize the rate of change in probability mass for correct and incorrect tokens during the reinforcement process. We show that under a one-step RL assumption, sampled rollout tokens always contribute to correct-mass expansion, while unsampled tokens outside rollouts may lead to gains or losses depending on their distribution and the net reward balance. Importantly, as the number of rollouts per example N increases, the effect of unsampled terms diminishes, ensuring overall correct-mass expansion. To validate our theoretical analysis, we conduct simulations under more relaxed conditions and find that a sufficiently large rollout size N-corresponding to ample exploration-guarantees an increase in the probability mass of all correct tokens. Empirically, BroRL revives models saturated after 3K ProRL training steps and demonstrates robust, continuous improvement, achieving state-of-the-art results for the 1.5B model across diverse benchmarks.


[123] Plug-and-Play Prompt Refinement via Latent Feedback for Diffusion Model Alignment cs.LG | cs.AI | cs.CVPDF

Suhyeon Lee, Jong Chul Ye

TL;DR: PromptLoop是一个通过潜在反馈逐步优化提示的模块化RL框架,显著提升了扩散模型的对齐效果,具备更好的泛化性和鲁棒性。

Details

Motivation: 现有的基于RL的扩散模型微调方法在泛化性、组合性和抗奖励黑客攻击方面存在不足,而现有的提示优化方法多为前馈式,无法充分利用RL的序列性质。

Result: 实验表明PromptLoop能够有效优化奖励,泛化到未见过的模型,与现有对齐方法正交组合,并缓解过优化和奖励黑客问题。

Insight: 通过潜在反馈动态优化提示是一种高效的对齐策略,展现了模块化设计的优势和对扩散模型对齐任务的适应性。

Abstract: Despite the recent progress, reinforcement learning (RL)-based fine-tuning of diffusion models often struggles with generalization, composability, and robustness against reward hacking. Recent studies have explored prompt refinement as a modular alternative, but most adopt a feed-forward approach that applies a single refined prompt throughout the entire sampling trajectory, thereby failing to fully leverage the sequential nature of reinforcement learning. To address this, here we introduce PromptLoop, a plug-and-play RL framework that incorporates latent feedback into step-wise prompt refinement. Rather than modifying diffusion model weights, a multimodal large language model (MLLM) is trained with RL to iteratively update prompts based on intermediate latent states of diffusion models. This design achieves a structural analogy to the Diffusion RL approach, while retaining the flexibility and generality of prompt-based alignment. Extensive experiments across diverse reward functions and diffusion backbones demonstrate that PromptLoop (i) achieves effective reward optimization, (ii) generalizes seamlessly to unseen models, (iii) composes orthogonally with existing alignment methods, and (iv) mitigates over-optimization and reward hacking.


[124] Rehearsal-free and Task-free Online Continual Learning With Contrastive Prompt cs.LG | cs.CVPDF

Aopeng Wang, Ke Deng, Yongli Ren, Jun Luo

TL;DR: 论文提出了一种无需重放缓冲区(rehearsal-free)且无需任务标识(task-free)的在线持续学习方法(F2OCL),通过结合提示学习(prompt learning)和NCM分类器,有效解决了持续学习中的灾难性遗忘问题。

Details

Motivation: 现有方法在在线持续学习(OCL)中常使用重放缓冲区存储样本或依赖任务边界,但这可能引发数据安全或隐私问题,且任务边界在实际中难以确定。因此,研究旨在探索无需存储样本且无需任务标识的OCL解决方案。

Result: 在两个基准数据集上的广泛实验证明了该方法的有效性,表明其在避免灾难性遗忘方面的优越性能。

Insight: 论文揭示了提示学习在持续学习中的潜力,为无需存储样本和任务标识的OCL提供了一种可行的技术路径。

Abstract: The main challenge of continual learning is \textit{catastrophic forgetting}. Because of processing data in one pass, online continual learning (OCL) is one of the most difficult continual learning scenarios. To address catastrophic forgetting in OCL, some existing studies use a rehearsal buffer to store samples and replay them in the later learning process, other studies do not store samples but assume a sequence of learning tasks so that the task identities can be explored. However, storing samples may raise data security or privacy concerns and it is not always possible to identify the boundaries between learning tasks in one pass of data processing. It motivates us to investigate rehearsal-free and task-free OCL (F2OCL). By integrating prompt learning with an NCM classifier, this study has effectively tackled catastrophic forgetting without storing samples and without usage of task boundaries or identities. The extensive experimental results on two benchmarks have demonstrated the effectiveness of the proposed method.


eess.IV [Back]

[125] Enhancing Safety in Diabetic Retinopathy Detection: Uncertainty-Aware Deep Learning Models with Rejection Capabilities eess.IV | cs.AI | cs.CVPDF

Madhushan Ramalingam, Yaish Riaz, Priyanthi Rajamanoharan, Piyumi Dasanayaka

TL;DR: 这篇论文研究了一种具备不确定性感知能力的深度学习模型,用于糖尿病视网膜病变检测,引入拒绝机制以拒绝低置信度预测,从而提高临床诊断的安全性。

Details

Motivation: 糖尿病视网膜病变的早期诊断至关重要,但现有深度学习模型缺乏对预测置信度的明确指示,可能导致临床决策的不确定性。

Result: 结果表明,模型在准确性和谨慎性之间存在权衡,不确定性估计和选择性拒绝显著提升了模型在安全关键诊断场景中的可靠性。

Insight: 在医疗诊断等安全关键领域,引入不确定性感知和拒绝机制是必要的,可以有效平衡模型的覆盖率和可靠性。

Abstract: Diabetic retinopathy (DR) is a major cause of visual impairment, and effective treatment options depend heavily on timely and accurate diagnosis. Deep learning models have demonstrated great success identifying DR from retinal images. However, relying only on predictions made by models, without any indication of model confidence, creates uncertainty and poses significant risk in clinical settings. This paper investigates an alternative in uncertainty-aware deep learning models, including a rejection mechanism to reject low-confidence predictions, contextualized by deferred decision-making in clinical practice. The results show there is a trade-off between prediction coverage and coverage reliability. The Variational Bayesian model adopted a more conservative strategy when predicting DR, subsequently rejecting the uncertain predictions. The model is evaluated by means of important performance metrics such as Accuracy on accepted predictions, the proportion of accepted cases (coverage), the rejection-ratio, and Expected Calibration Error (ECE). The findings also demonstrate a clear trade-off between accuracy and caution, establishing that the use of uncertainty estimation and selective rejection improves the model’s reliability in safety-critical diagnostic use cases.


[126] Deep Learning Approaches with Explainable AI for Differentiating Alzheimer Disease and Mild Cognitive Impairment eess.IV | cs.AI | cs.CV | cs.LG | stat.AP | stat.MLPDF

Fahad Mostafa, Kannon Hossain, Hafiz Khan

TL;DR: 该论文提出了一种混合深度学习集成框架,结合可解释AI技术,用于区分阿尔茨海默病和轻度认知障碍,并在ADNI数据集上取得了优越的分类性能。

Details

Motivation: 阿尔茨海默病的早期准确诊断对其临床干预至关重要,而轻度认知障碍作为其前驱阶段,结构变化细微,难以区分。因此,需要一种高性能且可解释的方法来辅助诊断。

Result: 在ADNI数据集上,提出的方法在阿尔茨海默病 vs. 轻度认知障碍的分类中达到99.21%的准确率,在轻度认知障碍 vs. 正常对照的分类中达到91.0%的准确率,优于传统迁移学习和基线集成方法。

Insight: 研究揭示了深度学习模型在神经退行性疾病诊断中的潜力,同时通过可解释AI技术增强了模型的透明性,有助于识别结构生物标志物,为临床决策提供支持。

Abstract: Early and accurate diagnosis of Alzheimer Disease is critical for effective clinical intervention, particularly in distinguishing it from Mild Cognitive Impairment, a prodromal stage marked by subtle structural changes. In this study, we propose a hybrid deep learning ensemble framework for Alzheimer Disease classification using structural magnetic resonance imaging. Gray and white matter slices are used as inputs to three pretrained convolutional neural networks such as ResNet50, NASNet, and MobileNet, each fine tuned through an end to end process. To further enhance performance, we incorporate a stacked ensemble learning strategy with a meta learner and weighted averaging to optimally combine the base models. Evaluated on the Alzheimer Disease Neuroimaging Initiative dataset, the proposed method achieves state of the art accuracy of 99.21% for Alzheimer Disease vs. Mild Cognitive Impairment and 91.0% for Mild Cognitive Impairment vs. Normal Controls, outperforming conventional transfer learning and baseline ensemble methods. To improve interpretability in image based diagnostics, we integrate Explainable AI techniques by Gradient weighted Class Activation, which generates heatmaps and attribution maps that highlight critical regions in gray and white matter slices, revealing structural biomarkers that influence model decisions. These results highlight the frameworks potential for robust and scalable clinical decision support in neurodegenerative disease diagnostics.


[127] DPsurv: Dual-Prototype Evidential Fusion for Uncertainty-Aware and Interpretable Whole-Slide Image Survival Prediction eess.IV | cs.CV | cs.LGPDF

Yucheng Xing, Ling Huang, Jingying Ma, Ruping Hong, Jiangdong Qiu

TL;DR: DPsurv提出了一种基于双原型证据融合的网络,用于全切片图像的生存预测,具有不确定性感知能力和可解释性,通过五项公开数据集验证了其效果和可靠性。

Details

Motivation: 现有全切片图像生存分析方法普遍缺乏可解释性,且忽视预测不确定性,DPsurv旨在解决这些问题。

Result: 在五项公开数据集上取得最高的C-index和最低的积分Brier得分,验证了方法的有效性和可靠性。

Insight: DPsurv的透明性设计增强了模型的可信度和临床适用性,为肿瘤预后分析提供了新思路。

Abstract: Pathology whole-slide images (WSIs) are widely used for cancer survival analysis because of their comprehensive histopathological information at both cellular and tissue levels, enabling quantitative, large-scale, and prognostically rich tumor feature analysis. However, most existing methods in WSI survival analysis struggle with limited interpretability and often overlook predictive uncertainty in heterogeneous slide images. In this paper, we propose DPsurv, a dual-prototype whole-slide image evidential fusion network that outputs uncertainty-aware survival intervals, while enabling interpretation of predictions through patch prototype assignment maps, component prototypes, and component-wise relative risk aggregation. Experiments on five publicly available datasets achieve the highest mean concordance index and the lowest mean integrated Brier score, validating the effectiveness and reliability of DPsurv. The interpretation of prediction results provides transparency at the feature, reasoning, and decision levels, thereby enhancing the trustworthiness and interpretability of DPsurv.


[128] Adapting Large Language Models to Mitigate Skin Tone Biases in Clinical Dermatology Tasks: A Mixed-Methods Study eess.IV | cs.CV | cs.CYPDF

Kiran Nijjer, Ryan Bui, Derek Jiu, Adnan Ahmed, Peter Wang

TL;DR: 论文分析了大型语言模型SkinGPT-4在皮肤病诊断中的肤色偏见问题,并提出微调策略以减少偏见,最终通过临床评估验证了方法的有效性。

Details

Motivation: 现有的皮肤病诊断模型(如SkinGPT-4)在训练数据中主要以浅肤色为主,导致对深肤色的诊断准确性较低。这种偏见可能对医疗公平性产生负面影响。

Result: 微调后的模型在公平性指标上表现更优,人口统计平等性从0.10提升至0.75,Fitzpatrick I-VI的公平性评分为0.83-0.90。

Insight: 大型语言模型在医疗应用中存在偏见问题,通过针对性微调和公平性评估可以有效提升模型的包容性和准确性。

Abstract: SkinGPT-4, a large vision-language model, leverages annotated skin disease images to augment clinical workflows in underserved communities. However, its training dataset predominantly represents lighter skin tones, limiting diagnostic accuracy for darker tones. Here, we evaluated performance biases in SkinGPT-4 across skin tones on common skin diseases, including eczema, allergic-contact dermatitis, and psoriasis using the open-sourced SCIN dataset. We leveraged the SkinGPT-4 backbone to develop finetuned models for custom skin disease classification tasks and explored bias mitigation strategies. Clinical evaluation by board-certified dermatologists on six relevant skin diseases from 300 SCIN cases assessed images for diagnostic accuracy, informativity, physician utility, and patient utility. Model fairness metrics, including demographic parity and equalized odds, were calculated across skin tones. SkinGPT-4 achieved an average demographic parity of 0.10 across Fitzpatrick types, with notable differences of 0.10-0.15 between lightest and darkest tones across evaluation metrics. Model hallucinations in artifacts and anatomy occurred at a rate of 17.8. Our customized models achieved average F1, precision, and AUROC of 0.75, 0.78, and 0.78 across visually similar disease pairs. Fairness analysis showed an average demographic parity of 0.75, with a maximum disparity of 0.21 across skin tones. The best model achieved parity scores of 0.83, 0.83, 0.76, 0.89, 0.90, and 0.90 for Fitzpatrick I-VI, indicating robust fairness. Large language models such as SkinGPT-4 showed weaker performance on darker tones. Model biases exist across evaluation criteria, and hallucinations may affect diagnostic efficacy. These findings demonstrate the efficacy of training accurate, fair models using existing backbones for custom skin disease classification.


[129] Variable Rate Image Compression via N-Gram Context based Swin-transformer eess.IV | cs.CV | cs.MMPDF

Priyanka Mudgal, Feng Liu

TL;DR: 该论文提出了一种基于N-gram上下文的Swin Transformer方法,用于学习图像压缩,实现了单模型可变速率压缩,并通过扩大感受野提高了高分辨率图像的重建质量。

Details

Motivation: 现有的Swin Transformer在高分辨率图像重建时因受限的感受野而忽视较大区域,导致重建质量不佳。为此,论文提出了一种改进方法,以提升上下文感知能力和可变速率压缩性能。

Result: 实验表明,该方法在可变速率图像压缩任务中优于现有技术,BD-Rate指标提升了5.86%,并显著提高了图像中感兴趣区域(ROI)的质量。

Insight: 通过结合N-gram上下文机制和Swin Transformer,可以有效扩展模型的感受野,提升高分辨率图像的重建质量,特别适用于工业视觉等对象聚焦的应用场景。

Abstract: This paper presents an N-gram context-based Swin Transformer for learned image compression. Our method achieves variable-rate compression with a single model. By incorporating N-gram context into the Swin Transformer, we overcome its limitation of neglecting larger regions during high-resolution image reconstruction due to its restricted receptive field. This enhancement expands the regions considered for pixel restoration, thereby improving the quality of high-resolution reconstructions. Our method increases context awareness across neighboring windows, leading to a -5.86% improvement in BD-Rate over existing variable-rate learned image compression techniques. Additionally, our model improves the quality of regions of interest (ROI) in images, making it particularly beneficial for object-focused applications in fields such as manufacturing and industrial vision systems.


[130] A Fast and Precise Method for Searching Rectangular Tumor Regions in Brain MR Images eess.IV | cs.CVPDF

Hidenori Takeshima, Shuki Maruyama

TL;DR: 本文提出了一种快速精准的方法,用于在脑部MRI图像中搜索矩形肿瘤区域。该方法结合了分割网络和基于用户可控搜索指标的快速搜索方法,显著提升了速度和准确性。

Details

Motivation: 脑部MRI图像中肿瘤区域的快速精准定位对诊断至关重要。传统方法耗时且准确性不足,亟需一种高效的技术改进现有流程。

Result: 3D全搜索耗时仅8秒,比传统方法快100-500倍。提出的搜索指标在肿瘤分数和形状偏好(立方体优于长条形)上均优于传统方法。

Insight: 结合高效分割网络和快速搜索方法是解决医学图像分析中区域定位问题的有效途径。用户可控指标的设计提供了灵活性,适合实际临床应用。

Abstract: Purpose: To develop a fast and precise method for searching rectangular regions in brain tumor images. Methods: The authors propose a new method for searching rectangular tumor regions in brain MR images. The proposed method consisted of a segmentation network and a fast search method with a user-controllable search metric. As the segmentation network, the U-Net whose encoder was replaced by the EfficientNet was used. In the fast search method, summed-area tables were used for accelerating sums of voxels in rectangular regions. Use of the summed-area tables enabled exhaustive search of the 3D offset (3D full search). The search metric was designed for giving priority to cubes over oblongs, and assigning better values for higher tumor fractions even if they exceeded target tumor fractions. The proposed computation and metric were compared with those used in a conventional method using the Brain Tumor Image Segmentation dataset. Results: When the 3D full search was used, the proposed computation (8 seconds) was 100-500 times faster than the conventional computation (11-40 minutes). When the user-controllable parts of the search metrics were changed variously, the tumor fractions of the proposed metric were higher than those of the conventional metric. In addition, the conventional metric preferred oblongs whereas the proposed metric preferred cubes. Conclusion: The proposed method is promising for implementing fast and precise search of rectangular tumor regions, which is useful for brain tumor diagnosis using MRI systems. The proposed computation reduced processing times of the 3D full search, and the proposed metric improved the quality of the assigned rectangular tumor regions.


[131] U-DFA: A Unified DINOv2-Unet with Dual Fusion Attention for Multi-Dataset Medical Segmentation eess.IV | cs.AI | cs.CVPDF

Zulkaif Sajjad, Furqan Shaukat, Junaid Mir

TL;DR: U-DFA是一个统一的DINOv2-Unet架构,结合了局部-全局融合适配器(LGFA)以提高医学图像分割性能,在多数据集上实现了最先进的性能。

Details

Motivation: 现有的CNN和transformer结合方法在局部和全局特征融合上效果不佳,而视觉语言模型(VLM)和基础模型在医学图像任务中存在领域差距和高计算成本问题。

Result: 在Synapse和ACDC数据集上达到了最先进的性能,仅使用了33%的可训练参数。

Insight: LGFA模块的设计有效解决了局部和全局特征的融合问题,同时减少了计算成本,为多模态医学图像分割提供了可扩展的解决方案。

Abstract: Accurate medical image segmentation plays a crucial role in overall diagnosis and is one of the most essential tasks in the diagnostic pipeline. CNN-based models, despite their extensive use, suffer from a local receptive field and fail to capture the global context. A common approach that combines CNNs with transformers attempts to bridge this gap but fails to effectively fuse the local and global features. With the recent emergence of VLMs and foundation models, they have been adapted for downstream medical imaging tasks; however, they suffer from an inherent domain gap and high computational cost. To this end, we propose U-DFA, a unified DINOv2-Unet encoder-decoder architecture that integrates a novel Local-Global Fusion Adapter (LGFA) to enhance segmentation performance. LGFA modules inject spatial features from a CNN-based Spatial Pattern Adapter (SPA) module into frozen DINOv2 blocks at multiple stages, enabling effective fusion of high-level semantic and spatial features. Our method achieves state-of-the-art performance on the Synapse and ACDC datasets with only 33% of the trainable model parameters. These results demonstrate that U-DFA is a robust and scalable framework for medical image segmentation across multiple modalities.


cs.RO [Back]

[132] VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators cs.RO | cs.CVPDF

Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge

TL;DR: VLA-RFT 提出了一个基于世界模型的强化微调框架,通过数据驱动的可控模拟器减少样本需求,提升视觉-语言-动作模型的泛化性和鲁棒性,仅需不到400步微调即可超越模仿学习基线。

Details

Motivation: 视觉-语言-动作(VLA)模型依赖模仿学习,容易导致累积误差和分布偏移下的性能下降。强化学习(RL)虽能缓解问题,但面临真实交互成本高或模拟与现实差距大的挑战。

Result: 仅需不到400步微调,VLA-RFT优于监督学习基线,且在扰动条件下保持稳定性能。

Insight: 世界模型驱动的强化微调是一种高效的后训练范式,能够显著提升VLA模型的泛化能力和鲁棒性。

Abstract: Vision-Language-Action (VLA) models enable embodied decision-making but rely heavily on imitation learning, leading to compounding errors and poor robustness under distribution shift. Reinforcement learning (RL) can mitigate these issues yet typically demands costly real-world interactions or suffers from sim-to-real gaps. We introduce VLA-RFT, a reinforcement fine-tuning framework that leverages a data-driven world model as a controllable simulator. Trained from real interaction data, the simulator predicts future visual observations conditioned on actions, allowing policy rollouts with dense, trajectory-level rewards derived from goal-achieving references. This design delivers an efficient and action-aligned learning signal, drastically lowering sample requirements. With fewer than 400 fine-tuning steps, VLA-RFT surpasses strong supervised baselines and achieves greater efficiency than simulator-based RL. Moreover, it exhibits strong robustness under perturbed conditions, sustaining stable task execution. Our results establish world-model-based RFT as a practical post-training paradigm to enhance the generalization and robustness of VLA models. For more details, please refer to https://vla-rft.github.io/.


[133] Hybrid Training for Vision-Language-Action Models cs.RO | cs.AI | cs.CV | cs.LGPDF

Pietro Mazzaglia, Cansu Sancaktar, Markus Peschl, Daniel Dijkman

TL;DR: 本文提出了Hybrid Training(HyT)框架,通过条件预测多样化输出,使得Vision-Language-Action模型在推理时可以选择性地生成Chain-of-thought(CoT)或直接预测动作,从而提高性能而不增加推理时间。

Details

Motivation: 在机器人任务中,Chain-of-thought(CoT)虽然能提升性能,但会增加推理时间,影响实时性。HyT旨在解决这一矛盾,允许模型在训练时学习CoT,但在推理时可以跳过。

Result: 在仿真和真实实验中,HyT展示了性能的提升,同时减少了推理时间,验证了其有效性。

Insight: 研究表明,CoT并非性能提升的绝对前提,HyT通过灵活性设计成功平衡了性能与效率的需求。

Abstract: Using Large Language Models to produce intermediate thoughts, a.k.a. Chain-of-thought (CoT), before providing an answer has been a successful recipe for solving complex language tasks. In robotics, similar embodied CoT strategies, generating thoughts before actions, have also been shown to lead to improved performance when using Vision-Language-Action models (VLAs). As these techniques increase the length of the model’s generated outputs to include the thoughts, the inference time is negatively affected. Delaying an agent’s actions in real-world executions, as in robotic manipulation settings, strongly affects the usability of a method, as tasks require long sequences of actions. However, is the generation of long chains-of-thought a strong prerequisite for achieving performance improvements? In this work, we explore the idea of Hybrid Training (HyT), a framework that enables VLAs to learn from thoughts and benefit from the associated performance gains, while enabling the possibility to leave out CoT generation during inference. Furthermore, by learning to conditionally predict a diverse set of outputs, HyT supports flexibility at inference time, enabling the model to either predict actions directly, generate thoughts or follow instructions. We evaluate the proposed method in a series of simulated benchmarks and real-world experiments.


[134] HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy cs.RO | cs.CVPDF

Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim

TL;DR: HAMLET是一个将视觉-语言-动作(VLA)模型转化为历史感知策略的框架,通过紧凑编码历史时刻和轻量级记忆模块提升长期任务的性能。

Details

Motivation: 现有的VLA模型忽视历史上下文,而机器人任务往往依赖历史信息。HAMLET旨在解决这一问题,提升模型在历史相关任务中的表现。

Result: 在真实世界任务中成功率提升47.2%,RoboCasa Kitchen和LIBERO任务中性能也有显著提升。

Insight: 历史上下文对机器人任务至关重要,HAMLET展示了如何高效地将其融入VLA模型。

Abstract: Inherently, robotic manipulation tasks are history-dependent: leveraging past context could be beneficial. However, most existing Vision-Language-Action models (VLAs) have been designed without considering this aspect, i.e., they rely solely on the current observation, ignoring preceding context. In this paper, we propose HAMLET, a scalable framework to adapt VLAs to attend to the historical context during action prediction. Specifically, we introduce moment tokens that compactly encode perceptual information at each timestep. Their representations are initialized with time-contrastive learning, allowing them to better capture temporally distinctive aspects. Next, we employ a lightweight memory module that integrates the moment tokens across past timesteps into memory features, which are then leveraged for action prediction. Through empirical evaluation, we show that HAMLET successfully transforms a state-of-the-art VLA into a history-aware policy, especially demonstrating significant improvements on long-horizon tasks that require historical context. In particular, on top of GR00T N1.5, HAMLET achieves an average success rate of 76.4% on history-dependent real-world tasks, surpassing the baseline performance by 47.2%. Furthermore, HAMLET pushes prior art performance from 64.1% to 66.4% on RoboCasa Kitchen (100-demo setup) and from 95.6% to 97.7% on LIBERO, highlighting its effectiveness even under generic robot-manipulation benchmarks.


cs.MA [Back]

[135] Stochastic Self-Organization in Multi-Agent Systems cs.MA | cs.CL | cs.LGPDF

Nurbek Tastan, Samuel Horvath, Karthik Nandakumar

TL;DR: 论文提出了SelfOrg框架,通过动态调整多智能体系统的通信结构,实现高效协作。该方法利用Shapley值评估智能体贡献,构建有向无环图(DAG)优化信息传播,无需额外监督或训练。实验表明,在弱LLM背景下性能显著优于现有方法。

Details

Motivation: 现有的多智能体系统(MAS)协作机制通常依赖固定拓扑或外部LLM评估,增加了复杂性。本文的目标是通过动态通信优化,提升智能体协作效率,尤其是在弱LLM环境下。

Result: 实验表明SelfOrg在强、弱LLM环境下均表现稳健,尤其在弱LLM背景下显著优于现有方法。理论分析表明多智能体提高了正确性概率。

Insight: 动态通信结构能有效提升多智能体系统的协作效率;Shapley值评估和DAG构建是实现高效信息传播的关键。

Abstract: Multi-agent systems (MAS) based on Large Language Models (LLMs) have the potential to solve tasks that are beyond the reach of any single LLM. However, this potential can only be realized when the collaboration mechanism between agents is optimized. Specifically, optimizing the communication structure between agents is critical for fruitful collaboration. Most existing approaches rely on fixed topologies, pretrained graph generators, optimization over edges, or employ external LLM judges, thereby adding to the complexity. In this work, we introduce a response-conditioned framework that adapts communication on-the-fly. Agents independently generate responses to the user query and assess peer contributions using an approximation of the Shapley value. A directed acyclic graph (DAG) is then constructed to regulate the propagation of the responses among agents, which ensures stable and efficient message transmission from high-contributing agents to others. This graph is dynamically updated based on the agent responses from the previous collaboration round. Since the proposed framework enables the self-organization of agents without additional supervision or training, we refer to it as SelfOrg. The SelfOrg framework goes beyond task- and query-level optimization and takes into account the stochastic nature of agent responses. Experiments with both strong and weak LLM backends demonstrate robust performance, with significant gains in the weak regime where prior methods collapse. We also theoretically show that multiple agents increase the chance of correctness and that the correct responses naturally dominate the information flow.


q-bio.QM [Back]

[136] Behavioural Classification in C. elegans: a Spatio-Temporal Analysis of Locomotion q-bio.QM | cs.CVPDF

Nemanja Antonic, Monika Scholz, Aymeric Vellinger, Euphrasie Ramahefarivo, Elio Tuci

TL;DR: 该论文提出了一种无需清晰观察秀丽隐杆线虫(C. elegans)全身的方法,可从运动中提取行为单元,并通过无监督自动流程定义这些单元,避免了预定义的偏见。

Details

Motivation: 目前的方法需要清晰观察线虫全身,但在高密度条件下难以实现。因此,需要一种无需完整身体信息的方法来提取行为单元,以更好地研究社会背景对个体行为的影响。

Result: 结果表明,即使通过单点追踪,也能提取出时空运动模式,这些模式是行为分类的基本要素。模拟线虫的运动与自然线虫的运动匹配度验证了方法的有效性。

Insight: 研究发现,时空运动模式在高密度条件下仍然可识别,且无监督自动方法能够避免人工设计行为单元的偏见,为行为分析提供了更客观的工具。

Abstract: The 1mm roundworm C. elegans is a model organism used in many sub-areas of biology to investigate different types of biological processes. In order to complement the n-vivo analysis with computer-based investigations, several methods have been proposed to simulate the worm behaviour. These methods extract discrete behavioural units from the flow of the worm movements using different types of tracking techniques. Nevertheless, these techniques require a clear view of the entire worm body, which is not always achievable. For example, this happens in high density worm conditions, which are particularly informative to understand the influence of the social context on the single worm behaviour. In this paper, we illustrate and evaluate a method to extract behavioural units from recordings of C. elegans movements which do not necessarily require a clear view of the entire worm body. Moreover, the behavioural units are defined by an unsupervised automatic pipeline which frees the process from predefined assumptions that inevitably bias the behavioural analysis. The behavioural units resulting from the automatic method are interpreted by comparing them with hand-designed behavioural units. The effectiveness of the automatic method is evaluated by measuring the extent to which the movement of a simulated worm, with an agent-based model, matches the movement of a natural worm. Our results indicate that spatio-temporal locomotory patterns emerge even from single point worm tracking. Moreover, we show that such patterns represent a fundamental aspect of the behavioural classification process.


cs.GR [Back]

[137] Motion In-Betweening for Densely Interacting Characters cs.GR | cs.CVPDF

Xiaotang Zhang, Ziyi Chang, Qianhui Men, Hubert P. H. Shum

TL;DR: 本文提出了一种针对密集交互角色的运动中插(in-betweening)方法,通过跨空间建模和对抗学习解决交互角色的长时程运动合成问题,以保持运动质量和交互稳定性。

Details

Motivation: 传统的运动中插方法主要针对单一角色,但扩展到密集交互角色时面临时空对应和自然过渡的挑战。本文旨在解决这一问题,实现两角色自然交互的长时程运动合成。

Result: 实验表明,该方法能够生成真实、可控且长时程的交互运动(如拳击和舞蹈动作),并通过定量评估和用户研究验证了其有效性。

Insight: 交互角色的运动中插需要同时考虑时空对应和运动质量,而对抗学习和潜空间修正是解决这些问题的有效手段。

Abstract: Motion in-betweening is the problem to synthesize movement between keyposes. Traditional research focused primarily on single characters. Extending them to densely interacting characters is highly challenging, as it demands precise spatial-temporal correspondence between the characters to maintain the interaction, while creating natural transitions towards predefined keyposes. In this research, we present a method for long-horizon interaction in-betweening that enables two characters to engage and respond to one another naturally. To effectively represent and synthesize interactions, we propose a novel solution called Cross-Space In-Betweening, which models the interactions of each character across different conditioning representation spaces. We further observe that the significantly increased constraints in interacting characters heavily limit the solution space, leading to degraded motion quality and diminished interaction over time. To enable long-horizon synthesis, we present two solutions to maintain long-term interaction and motion quality, thereby keeping synthesis in the stable region of the solution space.We first sustain interaction quality by identifying periodic interaction patterns through adversarial learning. We further maintain the motion quality by learning to refine the drifted latent space and prevent pose error accumulation. We demonstrate that our approach produces realistic, controllable, and long-horizon in-between motions of two characters with dynamic boxing and dancing actions across multiple keyposes, supported by extensive quantitative evaluations and user studies.


[138] ReSWD: ReSTIR’d, not shaken. Combining Reservoir Sampling and Sliced Wasserstein Distance for Variance Reduction cs.GR | cs.CV | cs.LGPDF

Mark Boss, Andreas Engelhardt, Simon Donné, Varun Jampani

TL;DR: ReSWD结合Weighted Reservoir Sampling和Sliced Wasserstein Distance,通过自适应保留信息性的投影方向,减少方差,实现稳定梯度和快速收敛。

Details

Motivation: 高维分布中Wasserstein距离计算成本过高,而Sliced Wasserstein Distance(SWD)虽可扩展,但其蒙特卡罗估计器方差高,导致梯度噪声大、收敛慢。

Result: 在合成基准和实际任务(如色彩校正和扩散引导)中,ReSWD表现优于标准SWD及其他方差减少基线方法。

Insight: Reservoir Sampling的结合机制可以有效稳定SWD优化过程,适用于需要高效分布匹配的任务。

Abstract: Distribution matching is central to many vision and graphics tasks, where the widely used Wasserstein distance is too costly to compute for high dimensional distributions. The Sliced Wasserstein Distance (SWD) offers a scalable alternative, yet its Monte Carlo estimator suffers from high variance, resulting in noisy gradients and slow convergence. We introduce Reservoir SWD (ReSWD), which integrates Weighted Reservoir Sampling into SWD to adaptively retain informative projection directions in optimization steps, resulting in stable gradients while remaining unbiased. Experiments on synthetic benchmarks and real-world tasks such as color correction and diffusion guidance show that ReSWD consistently outperforms standard SWD and other variance reduction baselines. Project page: https://reservoirswd.github.io/


[139] Audio Driven Real-Time Facial Animation for Social Telepresence cs.GR | cs.CV | cs.LG | cs.SDPDF

Jiye Lee, Chenghui Li, Linh Tran, Shih-En Wei, Jason Saragih

TL;DR: 该论文提出了一种基于音频驱动的实时面部动画系统,通过扩散模型和新型架构实现低延迟(<15ms)的高质量3D面部表情动画,适用于虚拟现实中的社交交互。

Details

Motivation: 现有的面部动画技术通常在实时性和质量之间存在权衡,难以满足虚拟现实中社交交互的低延迟和高真实感需求。

Result: 实验表明,该系统在面部动画准确性上优于现有离线方法,推理速度提高了100至1000倍,并在多语言演讲等场景中得到验证。

Insight: 该系统展示了扩散模型在实时面部动画中的潜力,同时通过架构创新解决了实时处理的挑战,为虚拟现实中的社交交互提供了新工具。

Abstract: We present an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency, designed for social interactions in virtual reality for anyone. Central to our approach is an encoder model that transforms audio signals into latent facial expression sequences in real time, which are then decoded as photorealistic 3D facial avatars. Leveraging the generative capabilities of diffusion models, we capture the rich spectrum of facial expressions necessary for natural communication while achieving real-time performance (<15ms GPU time). Our novel architecture minimizes latency through two key innovations: an online transformer that eliminates dependency on future inputs and a distillation pipeline that accelerates iterative denoising into a single step. We further address critical design challenges in live scenarios for processing continuous audio signals frame-by-frame while maintaining consistent animation quality. The versatility of our framework extends to multimodal applications, including semantic modalities such as emotion conditions and multimodal sensors with head-mounted eye cameras on VR headsets. Experimental results demonstrate significant improvements in facial animation accuracy over existing offline state-of-the-art baselines, achieving 100 to 1000 times faster inference speed. We validate our approach through live VR demonstrations and across various scenarios such as multilingual speeches.


cs.AI [Back]

[140] ARS: Adaptive Reasoning Suppression for Efficient Large Reasoning Language Models cs.AI | cs.CLPDF

Dongqi Zheng

TL;DR: ARS是一种无需训练的推理抑制方法,通过动态监控冗余推理步骤提升LRLM的效率,同时在准确性上保持或优于现有方法。

Details

Motivation: 大型推理语言模型(LRLM)在复杂任务中表现出色,但存在计算效率低下的问题,即“过度思考”现象。现有方法难以平衡推理质量与效率。

Result: 在数学推理任务中,ARS显著提升了效率(token、延迟和能耗分别减少53%、46.1%和57.9%),同时保持或提高了准确性。

Insight: ARS方法表明,动态调整推理步骤可以有效提升效率,同时避免静态压缩方法的性能损失。

Abstract: Large Reasoning Language Models (LRLMs or LRMs) demonstrate remarkable capabilities in complex reasoning tasks, but suffer from significant computational inefficiencies due to overthinking phenomena. Existing efficient reasoning methods face the challenge of balancing reasoning quality with inference cost reduction. We propose \textbf{Adaptive Reasoning Suppression (ARS)}, a novel training-free approach that dynamically suppresses redundant reasoning steps while preserving accuracy through adaptive certainty monitoring. ARS introduces a multi-checkpoint certainty estimation mechanism with progressive suppression thresholds, achieving superior efficiency compared to static suppression methods. Our extensive evaluation across mathematical reasoning benchmarks using multiple model architectures demonstrates that ARS achieves up to 53%, 46.1%, and 57.9% in token, latency and energy reduction, while maintaining or improving accuracy.


[141] ACON: Optimizing Context Compression for Long-horizon LLM Agents cs.AI | cs.CLPDF

Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz

TL;DR: ACON提出了一种针对长视野(long-horizon)LLM智能体的上下文压缩优化框架,显著降低了内存占用并保持了任务性能。

Details

Motivation: 随着LLM智能体在动态环境中部署的需求增加,长上下文带来的成本和效率问题成为关键挑战。现有压缩方法主要针对单步任务或窄应用,缺乏对长视野任务的支持。

Result: 在AppWorld等任务中,ACON减少了26-54%的内存占用,任务性能基本保持;压缩指南蒸馏后仍保持95%以上准确率,并能提升小模型的性能达46%。

Insight: ACON展示了LLM在长视野任务中通过上下文压缩和蒸馏技术实现高效性与性能平衡的潜力,为智能体部署提供了新思路。

Abstract: Large language models (LLMs) are increasingly deployed as agents in dynamic, real-world environments, where success requires both reasoning and effective tool use. A central challenge for agentic tasks is the growing context length, as agents must accumulate long histories of actions and observations. This expansion raises costs and reduces efficiency in long-horizon tasks, yet prior work on context compression has mostly focused on single-step tasks or narrow applications. We introduce Agent Context Optimization (ACON), a unified framework that optimally compresses both environment observations and interaction histories into concise yet informative condensations. ACON leverages compression guideline optimization in natural language space: given paired trajectories where full context succeeds but compressed context fails, capable LLMs analyze the causes of failure, and the compression guideline is updated accordingly. Furthermore, we propose distilling the optimized LLM compressor into smaller models to reduce the overhead of the additional module. Experiments on AppWorld, OfficeBench, and Multi-objective QA show that ACON reduces memory usage by 26-54% (peak tokens) while largely preserving task performance, preserves over 95% of accuracy when distilled into smaller compressors, and enhances smaller LMs as long-horizon agents with up to 46% performance improvement.


[142] Shape Happens: Automatic Feature Manifold Discovery in LLMs via Supervised Multi-Dimensional Scaling cs.AI | cs.CLPDF

Federico Tiblias, Irina Bigoulaeva, Jingcheng Niu, Simone Balloccu, Iryna Gurevych

TL;DR: 论文提出了一种名为SMDS的模型无关方法,用于自动发现语言模型中的特征流形,揭示了这些流形在不同任务中的几何结构及其对推理的功能性作用。

Details

Motivation: 现有方法专注于发现特定特征的几何结构,缺乏泛化性,而SMDS旨在自动发现多维特征流形,以理解语言模型如何编码和利用概念。

Result: SMDS成功识别了时间推理任务中特征流形的多样几何结构,并发现这些结构稳定支持模型推理,且能动态调整以适应上下文变化。

Insight: 特征流形在语言模型中不仅编码概念属性,还动态支持推理,表明语言模型通过结构化表示进行实体推理。

Abstract: The linear representation hypothesis states that language models (LMs) encode concepts as directions in their latent space, forming organized, multidimensional manifolds. Prior efforts focus on discovering specific geometries for specific features, and thus lack generalization. We introduce Supervised Multi-Dimensional Scaling (SMDS), a model-agnostic method to automatically discover feature manifolds. We apply SMDS to temporal reasoning as a case study, finding that different features form various geometric structures such as circles, lines, and clusters. SMDS reveals many insights on these structures: they consistently reflect the properties of the concepts they represent; are stable across model families and sizes; actively support reasoning in models; and dynamically reshape in response to context changes. Together, our findings shed light on the functional role of feature manifolds, supporting a model of entity-based reasoning in which LMs encode and transform structured representations.


[143] VIRTUE: Visual-Interactive Text-Image Universal Embedder cs.AI | cs.CVPDF

Wei-Yao Wang, Kazuya Tateishi, Qiyu Wu, Shusuke Takahashi, Yuki Mitsufuji

TL;DR: VIRTUE是一种新型的可视化交互文本-图像通用嵌入模型,通过结合分割模型和视觉语言模型的优势,实现了对用户指定区域的精确嵌入,并在大规模SCaR基准测试中取得了SOTA性能。

Details

Motivation: 现有嵌入模型缺乏视觉交互能力,无法处理用户指定的兴趣区域(如点、边界框、掩模),限制了其在局部意图表示和多模态任务中的应用。

Result: 在36个通用MMEB任务中提升3.1%-8.5%,在5个视觉交互SCaR任务中提升15.2%-20.3%。

Insight: 通过视觉交互实现的区域级嵌入能够显著提升模型在复杂场景中的表现,为多模态任务提供了新思路。

Abstract: Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack visual-interactive capabilities to specify regions of interest from users (e.g., point, bounding box, mask), which have been explored in generative models to broaden their human-interactive applicability. Equipping embedding models with visual interactions not only would unlock new applications with localized grounding of user intent, which remains unexplored, but also enable the models to learn entity-level information within images to complement their global representations for conventional embedding tasks. In this paper, we propose a novel Visual-InteRactive Text-Image Universal Embedder (VIRTUE) that extends the capabilities of the segmentation model and the vision-language model to the realm of representation learning. In VIRTUE, the segmentation model can process visual prompts that pinpoint specific regions within an image, thereby enabling the embedder to handle complex and ambiguous scenarios more precisely. To evaluate the visual-interaction ability of VIRTUE, we introduce a large-scale Segmentation-and-Scene Caption Retrieval (SCaR) benchmark comprising 1M samples that aims to retrieve the text caption by jointly considering the entity with a specific object and image scene. VIRTUE consistently achieves a state-of-the-art performance with significant improvements across 36 universal MMEB (3.1%-8.5%) and five visual-interactive SCaR (15.2%-20.3%) tasks.


[144] Batch-CAM: Introduction to better reasoning in convolutional deep learning models cs.AI | cs.CV | 68 | I.2; I.4PDF

Giacomo Ignesti, Davide Moroni, Massimo Martinelli

TL;DR: 本文提出了Batch-CAM,一种结合Grad-CAM算法和原型重建损失的训练方法,旨在提升深度学习模型的性能和可解释性。

Details

Motivation: 在高风险领域(如医疗)中,深度学习模型的透明性和可解释性至关重要。传统方法在准确性和解释性之间往往难以平衡,因此需要一种新的训练范式来解决这一问题。

Result: 实验表明,Batch-CAM在准确性和图像重建质量上均有提升,同时降低了训练和推理时间。

Insight: Batch-CAM为构建更透明、可解释和可信赖的AI系统提供了一种有效方法,尤其是在需要高精度和可解释性的领域中。

Abstract: Understanding the inner workings of deep learning models is crucial for advancing artificial intelligence, particularly in high-stakes fields such as healthcare, where accurate explanations are as vital as precision. This paper introduces Batch-CAM, a novel training paradigm that fuses a batch implementation of the Grad-CAM algorithm with a prototypical reconstruction loss. This combination guides the model to focus on salient image features, thereby enhancing its performance across classification tasks. Our results demonstrate that Batch-CAM achieves a simultaneous improvement in accuracy and image reconstruction quality while reducing training and inference times. By ensuring models learn from evidence-relevant information,this approach makes a relevant contribution to building more transparent, explainable, and trustworthy AI systems.


cs.MM [Back]

[145] Object-AVEdit: An Object-level Audio-Visual Editing Model cs.MM | cs.AI | cs.CV | cs.SD | eess.ASPDF

Youquan Fu, Ruiyang Si, Hongfa Wang, Dongzhan Zhou, Jiacheng Sun

TL;DR: Object-AVEdit提出了一种基于反转-再生范式的音频-视觉对象级编辑模型,解决了现有模型在跨模态对象级操作上的不足。

Details

Motivation: 当前音频和视频编辑模型难以实现对象级的跨模态编辑,尤其是在保留源实例结构信息的同时进行增删改操作。

Result: 实验表明模型在对象级编辑任务中表现优异,且音频生成模型也达到先进水平。

Insight: 通过跨模态对齐和全局优化,可以实现更精细的对象级音频-视觉编辑。

Abstract: There is a high demand for audio-visual editing in video post-production and the film making field. While numerous models have explored audio and video editing, they struggle with object-level audio-visual operations. Specifically, object-level audio-visual editing requires the ability to perform object addition, replacement, and removal across both audio and visual modalities, while preserving the structural information of the source instances during the editing process. In this paper, we present \textbf{Object-AVEdit}, achieving the object-level audio-visual editing based on the inversion-regeneration paradigm. To achieve the object-level controllability during editing, we develop a word-to-sounding-object well-aligned audio generation model, bridging the gap in object-controllability between audio and current video generation models. Meanwhile, to achieve the better structural information preservation and object-level editing effect, we propose an inversion-regeneration holistically-optimized editing algorithm, ensuring both information retention during the inversion and better regeneration effect. Extensive experiments demonstrate that our editing model achieved advanced results in both audio-video object-level editing tasks with fine audio-visual semantic alignment. In addition, our developed audio generation model also achieved advanced performance. More results on our project page: https://gewu-lab.github.io/Object_AVEdit-website/.


eess.SP [Back]

[146] WaveMind: Towards a Conversational EEG Foundation Model Aligned to Textual and Visual Modalities eess.SP | cs.AI | cs.CL | cs.LG | q-bio.NCPDF

Ziyi Zeng, Zhenyang Cai, Yixi Cai, Xidong Wang, Junying Chen

TL;DR: 该论文提出了WaveMind,一种基于EEG信号的多模态大型语言模型,通过将EEG信号与文本和视觉模态对齐,实现了对脑电信号的通用解释,并引入了一个新的数据集WaveMind-Instruct-338k用于指令调优。

Details

Motivation: EEG信号的多模态分析存在挑战,因为它们同时编码了认知过程和内在神经状态,导致跨模态表示学习效率低下。作者希望通过对齐EEG与其他模态的语义空间,提升其通用解释能力。

Result: 模型在四个下游任务中表现优异,支持灵活的开放式对话,为EEG通用模型和神经科学研究提供了有价值的方法。

Insight: EEG信号的多模态对齐为跨模态学习提供了新思路,通过统一语义空间的方法,可以更有效地理解复杂的脑电活动。

Abstract: Electroencephalography (EEG) interpretation using multimodal large language models (MLLMs) offers a novel approach for analyzing brain signals. However, the complex nature of brain activity introduces critical challenges: EEG signals simultaneously encode both cognitive processes and intrinsic neural states, creating a mismatch in EEG paired-data modality that hinders effective cross-modal representation learning. Through a pivot investigation, we uncover complementary relationships between these modalities. Leveraging this insight, we propose mapping EEG signals and their corresponding modalities into a unified semantic space to achieve generalized interpretation. To fully enable conversational capabilities, we further introduce WaveMind-Instruct-338k, the first cross-task EEG dataset for instruction tuning. The resulting model demonstrates robust classification accuracy while supporting flexible, open-ended conversations across four downstream tasks, thereby offering valuable insights for both neuroscience research and the development of general-purpose EEG models.


cs.SD [Back]

[147] Unpacking Musical Symbolism in Online Communities: Content-Based and Network-Centric Approaches cs.SD | cs.CL | cs.CY | cs.MM | eess.ASPDF

Kajwan Ziaoddini

TL;DR: 该论文结合内容分析和轻量级网络视角,研究了在线社区中音乐象征主义的传播方式,揭示了能量下降和舞蹈性上升的趋势,以及情绪与流派的系统性关联。

Details

Motivation: 探索音乐象征主义在在线社区中的生产和传播机制,结合音乐内容和歌词网络分析,揭示音乐特征与社群文化的关系。

Result: 发现能量下降,舞蹈性上升;情绪因流派而异(R&B最积极);代词在歌词中占比高。

Insight: 主流音乐偏好趋向放松但节奏感强的作品,反映社群文化的变化和商业化的影响。

Abstract: This paper examines how musical symbolism is produced and circulated in online communities by combining content-based music analysis with a lightweight network perspective on lyrics. Using a curated corpus of 275 chart-topping songs enriched with audio descriptors (energy, danceability, loudness, liveness, valence, acousticness, speechiness, popularity) and full lyric transcripts, we build a reproducible pipeline that (i) quantifies temporal trends in sonic attributes, (ii) models lexical salience and co-occurrence, and (iii) profiles mood by genre. We find a decade-long decline in energy (79 -> 58) alongside a rise in danceability (59 -> 73); valence peaks in 2013 (63) and dips in 2014-2016 (42) before partially recovering. Correlation analysis shows strong coupling of energy with loudness (r = 0.74) and negative associations for acousticness with both energy (r = -0.54) and loudness (r = -0.51); danceability is largely orthogonal to other features (|r| < 0.20). Lyric tokenization (>114k tokens) reveals a pronoun-centric lexicon “I/you/me/my” and a dense co-occurrence structure in which interpersonal address anchors mainstream narratives. Mood differs systematically by style: R&B exhibits the highest mean valence (96), followed by K-Pop/Pop (77) and Indie/Pop (70), whereas Latin/Reggaeton is lower (37) despite high danceability. Read through a subcultural identity lens, these patterns suggest the mainstreaming of previously peripheral codes and a commercial preference for relaxed yet rhythmically engaging productions that sustain collective participation without maximal intensity. Methodologically, we contribute an integrated MIR-plus-network workflow spanning summary statistics, correlation structure, lexical co-occurrence matrices, and genre-wise mood profiling that is robust to modality sparsity and suitable for socially aware recommendation or community-level diffusion studies.


[148] When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models cs.SD | cs.CLPDF

Chen-An Li, Tzu-Han Lin, Hung-yi Lee

TL;DR: 大型音频-语言模型(LALMs)在实际噪声环境中表现脆弱,研究发现无关音频(如静音或噪音)会显著干扰文本推理任务,即使音频无关紧要也会降低准确性和增加预测波动。

Details

Motivation: 研究动机是探索LALMs在音频无关的任务中如何受到干扰音频的影响,尤其是在实际应用中的噪声环境下模型的鲁棒性问题。

Result: 结果显示无关音频明显降低准确性并增加预测波动,静音干扰与噪音相当;大规模模型表现更稳健,但问题仍普遍存在;缓解策略中,自我一致性有效但计算代价高。

Insight: 研究揭示了跨模态干扰是LALMs鲁棒性的关键挑战,强调了在无关输入下保护推理性能的高效融合策略的重要性。

Abstract: Large audio-language models (LALMs) unify speech and text processing, but their robustness in noisy real-world settings remains underexplored. We investigate how irrelevant audio, such as silence, synthetic noise, and environmental sounds, affects text reasoning tasks where audio is unnecessary. Across three text-based benchmarks, we find that even non-informative audio reduces accuracy and increases prediction volatility; the severity of interference scales with longer durations, higher amplitudes, and elevated decoding temperatures. Silence, often assumed neutral, destabilizes outputs as strongly as synthetic noise. While larger models show greater resilience, vulnerabilities persist across all evaluated systems. We further test mitigation strategies and find that prompting shows limited effectiveness, whereas self-consistency improves stability at the cost of increased computation. Our results reveal cross-modal interference as a key robustness challenge and highlight the need for efficient fusion strategies that preserve reasoning performance in the presence of irrelevant inputs.


[149] Hearing the Order: Investigating Selection Bias in Large Audio-Language Models cs.SD | cs.CLPDF

Yu-Xiang Lin, Chen-An Li, Sheng-Lun Wei, Po-Chun Chen, Hsin-Hsi Chen

TL;DR: 本文研究了大型音频语言模型(LALMs)在选择任务中是否存在顺序偏差问题,并通过实验证明所有模型均受此影响。顺序调整可显著改变性能(高达24%),提出基于排列的策略可缓解偏差。

Details

Motivation: 当前LALMs广泛应用于有序选项任务,但其预测是否受选项顺序影响尚不清楚。若存在顺序偏差,将影响模型的可靠性,因此需要系统性研究。

Result: 实验显示,选项顺序调整可导致性能波动高达24%,甚至改变模型排名。排列策略在多数情况下能有效减轻偏差。

Insight: 当前LALMs的评估方法可能因顺序偏差不可靠,提醒研究者关注此类问题。排列策略为潜在解决方案,但仍需进一步研究。

Abstract: Large audio-language models (LALMs) are often used in tasks that involve reasoning over ordered options. An open question is whether their predictions are influenced by the order of answer choices, which would indicate a form of selection bias and undermine their reliability. In this paper, we identify and analyze this problem in LALMs. We demonstrate that no model is immune to this bias through extensive experiments on six LALMs across three widely used benchmarks and their spoken counterparts. Shuffling the order of answer options can cause performance fluctuations of up to 24% and even change model rankings, raising concerns about the reliability of current evaluation practices. We also study permutation-based strategies and show that they can mitigate bias in most cases. Our work represents the first systematic investigation of this issue in LALMs, and we hope it raises awareness and motivates further research in this direction.


cs.IR [Back]

[150] Bridging Language Gaps: Advances in Cross-Lingual Information Retrieval with Multilingual LLMs cs.IR | cs.AI | cs.CLPDF

Roksana Goworek, Olivia Macmillan-Scott, Eda B. Özyiğit

TL;DR: 该论文综述了跨语言信息检索(CLIR)的演进,从早期基于翻译的方法到当前基于嵌入和多语言大语言模型(LLMs)的技术,总结了核心组件、评估方法和资源,并指出了未来的发展方向。

Details

Motivation: CLIR旨在解决跨语言检索文档的挑战,传统方法依赖翻译,但存在局限性。随着多语言LLMs的出现,嵌入和生成技术提供了新解决方案,需要系统梳理和展望。

Result: 研究表明,基于嵌入和多语言LLMs的方法显著提升了CLIR的性能,尤其是在答案生成和语义对齐方面表现突出。

Insight: 1. 跨语言表征对齐是多语言LLMs的核心挑战;2. 未来的CLIR系统需更鲁棒、包容和适应性强;3. 数据不平衡问题需通过更均衡的资源分配解决。

Abstract: Cross-lingual information retrieval (CLIR) addresses the challenge of retrieving relevant documents written in languages different from that of the original query. Research in this area has typically framed the task as monolingual retrieval augmented by translation, treating retrieval methods and cross-lingual capabilities in isolation. Both monolingual and cross-lingual retrieval usually follow a pipeline of query expansion, ranking, re-ranking and, increasingly, question answering. Recent advances, however, have shifted from translation-based methods toward embedding-based approaches and leverage multilingual large language models (LLMs), for which aligning representations across languages remains a central challenge. The emergence of cross-lingual embeddings and multilingual LLMs has introduced a new paradigm, offering improved retrieval performance and enabling answer generation. This survey provides a comprehensive overview of developments from early translation-based methods to state-of-the-art embedding-driven and generative techniques. It presents a structured account of core CLIR components, evaluation practices, and available resources. Persistent challenges such as data imbalance and linguistic variation are identified, while promising directions are suggested for advancing equitable and effective cross-lingual information retrieval. By situating CLIR within the broader landscape of information retrieval and multilingual language processing, this work not only reviews current capabilities but also outlines future directions for building retrieval systems that are robust, inclusive, and adaptable.