cs.CV [Total: 68]
cs.CL [Total: 48]
cs.RO [Total: 2]
quant-ph [Total: 1]
cs.AI [Total: 8]
eess.IV [Total: 10]
cs.SE [Total: 3]
cs.LG [Total: 12]
cs.GR [Total: 2]
cs.MA [Total: 1]
eess.AS [Total: 3]

cs.CV [Back]

[1] EfficientQuant: An Efficient Post-Training Quantization for CNN-Transformer Hybrid Models on Edge Devices cs.CVPDF

Shaibal Saha, Lanyu Xu

TL;DR: EfficientQuant针对CNN-Transformer混合模型提出了一种高效的后训练量化方法，通过分块量化策略显著降低边缘设备的延迟和内存需求。

Details

Motivation: 混合模型在CV任务中表现优异，但其资源需求高，难以在边缘设备上部署。现有的后训练量化方法对混合模型支持有限。

Result: 在ImageNet-1K数据集上实现了2.5×到8.7×的延迟降低，且精度损失最小，同时在实际边缘设备上展现了低延迟和高内存效率。

Insight: 针对混合模型不同组件的特性设计分块量化策略，可以显著提升量化效率和实用性。

Abstract: Hybrid models that combine convolutional and transformer blocks offer strong performance in computer vision (CV) tasks but are resource-intensive for edge deployment. Although post-training quantization (PTQ) can help reduce resource demand, its application to hybrid models remains limited. We propose EfficientQuant, a novel structure-aware PTQ approach that applies uniform quantization to convolutional blocks and $log_2$ quantization to transformer blocks. EfficientQuant achieves $2.5 \times - 8.7 \times$ latency reduction with minimal accuracy loss on the ImageNet-1K dataset. It further demonstrates low latency and memory efficiency on edge devices, making it practical for real-world deployment.

[2] Image-Based Method For Measuring And Classification Of Iron Ore Pellets Using Star-Convex Polygons cs.CVPDF

Artem Solomko, Oleg Kartashev, Andrey Golov, Mikhail Deulin, Vadim Valynkin

TL;DR: 论文提出了一种基于StarDist算法的图像测量与分类方法，用于检测铁矿球团的质量问题，解决了传统方法在密集和不稳定环境中的不足。

Details

Motivation: 铁矿球团的质量控制中，尺寸分布和分类是关键指标，而传统方法如ViT、Mask R-CNN等在密集和不稳定环境中的表现不佳，亟需更准确的解决方案。

Result: 新方法显著提高了铁矿球团的尺寸测量和分类准确性，优于传统算法。

Insight: 医学领域的算法（如StarDist）可迁移到工业场景，解决传统计算机视觉方法在复杂环境中的局限性。

Abstract: We would like to present a comprehensive study on the classification of iron ore pellets, aimed at identifying quality violations in the final product, alongside the development of an innovative imagebased measurement method utilizing the StarDist algorithm, which is primarily employed in the medical field. This initiative is motivated by the necessity to accurately identify and analyze objects within densely packed and unstable environments. The process involves segmenting these objects, determining their contours, classifying them, and measuring their physical dimensions. This is crucial because the size distribution and classification of pellets such as distinguishing between nice (quality) and joint (caused by the presence of moisture or indicating a process of production failure) types are among the most significant characteristics that define the quality of the final product. Traditional algorithms, including image classification techniques using Vision Transformer (ViT), instance segmentation methods like Mask R-CNN, and various anomaly segmentation algorithms, have not yielded satisfactory results in this context. Consequently, we explored methodologies from related fields to enhance our approach. The outcome of our research is a novel method designed to detect objects with smoothed boundaries. This advancement significantly improves the accuracy of physical dimension measurements and facilitates a more precise analysis of size distribution among the iron ore pellets. By leveraging the strengths of the StarDist algorithm, we aim to provide a robust solution that addresses the challenges posed by the complex nature of pellet classification and measurement.

[3] Gender Fairness of Machine Learning Algorithms for Pain Detection cs.CV | cs.LGPDF

Dylan Green, Yuting Shang, Jiaee Cheong, Yang Liu, Hatice Gunes

TL;DR: 该论文研究了机器学习算法在疼痛检测中的性别公平性，发现尽管ViT模型表现最佳，但所有模型均存在性别偏见，强调了在医疗自动化系统中平衡准确性与公平性的必要性。

Details

Motivation: 自动疼痛检测在医疗领域潜力巨大，但现有算法在不同人口统计群体（如性别）中的公平性研究不足。论文旨在填补这一空白，分析算法在性别上的公平性问题。

Result: ViT表现最佳，但所有模型均存在性别偏见。结果凸显了准确性与公平性之间的权衡问题。

Insight: 论文揭示了医疗自动化系统中算法偏见的普遍性，呼吁采用公平感知技术来减少偏见，尤其是在涉及敏感特征的场景中。

Abstract: Automated pain detection through machine learning (ML) and deep learning (DL) algorithms holds significant potential in healthcare, particularly for patients unable to self-report pain levels. However, the accuracy and fairness of these algorithms across different demographic groups (e.g., gender) remain under-researched. This paper investigates the gender fairness of ML and DL models trained on the UNBC-McMaster Shoulder Pain Expression Archive Database, evaluating the performance of various models in detecting pain based solely on the visual modality of participants’ facial expressions. We compare traditional ML algorithms, Linear Support Vector Machine (L SVM) and Radial Basis Function SVM (RBF SVM), with DL methods, Convolutional Neural Network (CNN) and Vision Transformer (ViT), using a range of performance and fairness metrics. While ViT achieved the highest accuracy and a selection of fairness metrics, all models exhibited gender-based biases. These findings highlight the persistent trade-off between accuracy and fairness, emphasising the need for fairness-aware techniques to mitigate biases in automated healthcare systems.

[4] JAFAR: Jack up Any Feature at Any Resolution cs.CV | eess.IVPDF

Paul Couairon, Loick Chambon, Louis Serrano, Jean-Emmanuel Haugeard, Matthieu Cord

TL;DR: JAFAR是一种轻量级且灵活的特征上采样器，能够将任何基础视觉编码器的低分辨率空间特征提升到任意目标分辨率，并通过注意力模块和SFT调制实现语义对齐。

Details

Motivation: 基础视觉编码器的低分辨率输出难以满足下游任务对高分辨率模态的需求，因此需要一种通用的特征上采样方法。

Result: 实验表明，JAFAR在恢复细粒度空间细节方面优于现有方法，并在多种下游任务中表现优异。

Insight: 低分辨率和小上采样比的训练可以泛化到更高分辨率的上采样任务，表明语义对齐是提升分辨率的关键。

Abstract: Foundation Vision Encoders have become essential for a wide range of dense vision tasks. However, their low-resolution spatial feature outputs necessitate feature upsampling to produce the high-resolution modalities required for downstream tasks. In this work, we introduce JAFAR, a lightweight and flexible feature upsampler that enhances the spatial resolution of visual features from any Foundation Vision Encoder to an arbitrary target resolution. JAFAR employs an attention-based module designed to promote semantic alignment between high-resolution queries, derived from low-level image features, and semantically enriched low-resolution keys, using Spatial Feature Transform (SFT) modulation. Notably, despite the absence of high-resolution supervision, we demonstrate that learning at low upsampling ratios and resolutions generalizes remarkably well to significantly higher output scales. Extensive experiments show that JAFAR effectively recovers fine-grained spatial details and consistently outperforms existing feature upsampling methods across a diverse set of downstream tasks. Project page at https://jafar-upsampler.github.io

[5] Autonomous Computer Vision Development with Agentic AI cs.CV | cs.AI | cs.MAPDF

Jin Kim, Muhammad Wahi-Anwa, Sangyun Park, Shawn Shin, John M. Hoffman

TL;DR: 论文提出了一种基于Agentic AI的自主计算机视觉开发方法，利用LLM（大语言模型）从自然语言提示中自动化生成和配置医学图像分析工具，并在胸片分割任务中展示了高效性能。

Details

Motivation: 传统计算机视觉应用的开发需要数据科学家手动配置工具和流程，耗时且复杂。本研究旨在通过Agentic AI实现自动化工具配置和任务分解，降低开发门槛。

Result: 在50张胸片上的实验结果显示，系统对肺部、心脏和肋骨的分割Dice分数分别达到0.96、0.82和0.83，展示了高效且精准的自动化能力。

Insight: 研究证明了Agentic AI在计算机视觉任务中的潜力，能够替代传统数据科学家的部分工作，为自动化工具配置和任务规划提供了新思路。

Abstract: Agentic Artificial Intelligence (AI) systems leveraging Large Language Models (LLMs) exhibit significant potential for complex reasoning, planning, and tool utilization. We demonstrate that a specialized computer vision system can be built autonomously from a natural language prompt using Agentic AI methods. This involved extending SimpleMind (SM), an open-source Cognitive AI environment with configurable tools for medical image analysis, with an LLM-based agent, implemented using OpenManus, to automate the planning (tool configuration) for a particular computer vision task. We provide a proof-of-concept demonstration that an agentic system can interpret a computer vision task prompt, plan a corresponding SimpleMind workflow by decomposing the task and configuring appropriate tools. From the user input prompt, “provide sm (SimpleMind) config for lungs, heart, and ribs segmentation for cxr (chest x-ray)”), the agent LLM was able to generate the plan (tool configuration file in YAML format), and execute SM-Learn (training) and SM-Think (inference) scripts autonomously. The computer vision agent automatically configured, trained, and tested itself on 50 chest x-ray images, achieving mean dice scores of 0.96, 0.82, 0.83, for lungs, heart, and ribs, respectively. This work shows the potential for autonomous planning and tool configuration that has traditionally been performed by a data scientist in the development of computer vision applications.

[6] FARCLUSS: Fuzzy Adaptive Rebalancing and Contrastive Uncertainty Learning for Semi-Supervised Semantic Segmentation cs.CV | cs.LG | eess.IVPDF

Ebenezer Tarubinga, Jenifer Kalafatovich

TL;DR: 本文提出了一种名为FARCLUSS的框架，通过模糊伪标签、不确定性动态加权、自适应类别再平衡和轻量对比正则化，解决了半监督语义分割中未标记数据的低效利用问题，显著提升了性能。

Details

Motivation: 半监督语义分割（SSSS）面临未标记数据的低效利用、类别不平衡偏见的加剧以及对预测不确定性的忽视等问题。现有方法通常通过严格阈值丢弃不确定区域，而本文旨在将不确定性转化为学习资源。

Result: 在基准测试中，该方法显著优于现有技术，尤其在少数类和模糊区域的细分上表现优异。

Insight: 将不确定性转化为学习资源，并通过动态调整和正则化方法解决类别不平衡问题，是提升半监督语义分割性能的关键。

Abstract: Semi-supervised semantic segmentation (SSSS) faces persistent challenges in effectively leveraging unlabeled data, such as ineffective utilization of pseudo-labels, exacerbation of class imbalance biases, and neglect of prediction uncertainty. Current approaches often discard uncertain regions through strict thresholding favouring dominant classes. To address these limitations, we introduce a holistic framework that transforms uncertainty into a learning asset through four principal components: (1) fuzzy pseudo-labeling, which preserves soft class distributions from top-K predictions to enrich supervision; (2) uncertainty-aware dynamic weighting, that modulate pixel-wise contributions via entropy-based reliability scores; (3) adaptive class rebalancing, which dynamically adjust losses to counteract long-tailed class distributions; and (4) lightweight contrastive regularization, that encourage compact and discriminative feature embeddings. Extensive experiments on benchmarks demonstrate that our method outperforms current state-of-the-art approaches, achieving significant improvements in the segmentation of under-represented classes and ambiguous regions.

[7] On the development of an AI performance and behavioural measures for teaching and classroom management cs.CV | H.5; J.4; I.2.7; I.2.10PDF

Andreea I. Niculescu, Jochen Ehnen, Chen Yi, Du Jiawei, Tay Chiat Pin

TL;DR: 这篇论文介绍了一个为期两年的研究项目，旨在开发基于AI的课堂动态分析工具，通过多模态传感器数据捕捉教师行为，支持教师发展。关键成果包括一个经过整理的视听数据集、新颖的行为测量方法以及一个概念验证的教学回顾仪表盘。

Details

Motivation: 研究动机是通过AI技术减轻教师手动分析课堂互动的负担，并提供客观的、非评判性的反馈，帮助教师改进教学策略。

Result: 初步评估显示，系统具有清晰性和可用性，其自动化的非评判性分析减少了手动工作量，促进了教师的积极反思。

Insight: 这项研究展示了AI在教育领域的潜力，尤其是通过客观数据支持教师反思和改进教学策略，同时为跨文化教育研究提供了方法论借鉴。

Abstract: This paper presents a two-year research project focused on developing AI-driven measures to analyze classroom dynamics, with particular emphasis on teacher actions captured through multimodal sensor data. We applied real-time data from classroom sensors and AI techniques to extract meaningful insights and support teacher development. Key outcomes include a curated audio-visual dataset, novel behavioral measures, and a proof-of-concept teaching review dashboard. An initial evaluation with eight researchers from the National Institute for Education (NIE) highlighted the system’s clarity, usability, and its non-judgmental, automated analysis approach – which reduces manual workloads and encourages constructive reflection. Although the current version does not assign performance ratings, it provides an objective snapshot of in-class interactions, helping teachers recognize and improve their instructional strategies. Designed and tested in an Asian educational context, this work also contributes a culturally grounded methodology to the growing field of AI-based educational analytics.

[8] AlignHuman: Improving Motion and Fidelity via Timestep-Segment Preference Optimization for Audio-Driven Human Animation cs.CVPDF

Chao Liang, Jianwen Jiang, Wang Liao, Jiaqi Yang, Zerong zheng

TL;DR: AlignHuman提出了一种通过分段时间步偏好优化（TPO）和专家对齐模块（LoRAs）来联合优化动作自然度和视觉保真度的框架，显著提升了音频驱动的人体动画生成效果。

Details

Motivation: 现有的人体动画生成方法在动作自然度和视觉保真度之间存在权衡，难以同时优化。

Result: 实验表明AlignHuman在保持生成质量的同时，将推理NFEs从100降低到30，实现了3.3倍加速。

Insight: 去噪过程的时间步特性可以分阶段优化，动作自然度和视觉保真度的优化可以解耦。

Abstract: Recent advancements in human video generation and animation tasks, driven by diffusion models, have achieved significant progress. However, expressive and realistic human animation remains challenging due to the trade-off between motion naturalness and visual fidelity. To address this, we propose \textbf{AlignHuman}, a framework that combines Preference Optimization as a post-training technique with a divide-and-conquer training strategy to jointly optimize these competing objectives. Our key insight stems from an analysis of the denoising process across timesteps: (1) early denoising timesteps primarily control motion dynamics, while (2) fidelity and human structure can be effectively managed by later timesteps, even if early steps are skipped. Building on this observation, we propose timestep-segment preference optimization (TPO) and introduce two specialized LoRAs as expert alignment modules, each targeting a specific dimension in its corresponding timestep interval. The LoRAs are trained using their respective preference data and activated in the corresponding intervals during inference to enhance motion naturalness and fidelity. Extensive experiments demonstrate that AlignHuman improves strong baselines and reduces NFEs during inference, achieving a 3.3$\times$ speedup (from 100 NFEs to 30 NFEs) with minimal impact on generation quality. Homepage: \href{https://alignhuman.github.io/}{https://alignhuman.github.io/}

[9] 3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks cs.CVPDF

Xiaotang Gai, Jiaxiang Liu, Yichen Li, Zijie Meng, Jian Wu

TL;DR: 该论文介绍了3D-RAD，一个基于CT扫描的大规模3D医学视觉问答（Med-VQA）数据集，支持多种诊断任务和多时相分析，旨在推动3D医学视觉理解的研究。

Details

Motivation: 现有Med-VQA研究主要集中于2D影像且任务多样性有限，无法满足临床中对3D影像和多时相分析的需求。

Result: 实验表明现有VLM模型（尤其是医学VLM）在多时相任务中泛化能力有限，微调能显著提升性能。

Insight: 3D-RAD为3D医学视觉问答研究提供了新基准，揭示了当前模型在复杂真实诊断任务中的不足，为未来研究指明了方向。

Abstract: Medical Visual Question Answering (Med-VQA) holds significant potential for clinical decision support, yet existing efforts primarily focus on 2D imaging with limited task diversity. This paper presents 3D-RAD, a large-scale dataset designed to advance 3D Med-VQA using radiology CT scans. The 3D-RAD dataset encompasses six diverse VQA tasks: anomaly detection, image observation, medical computation, existence detection, static temporal diagnosis, and longitudinal temporal diagnosis. It supports both open- and closed-ended questions while introducing complex reasoning challenges, including computational tasks and multi-stage temporal analysis, to enable comprehensive benchmarking. Extensive evaluations demonstrate that existing vision-language models (VLMs), especially medical VLMs exhibit limited generalization, particularly in multi-temporal tasks, underscoring the challenges of real-world 3D diagnostic reasoning. To drive future advancements, we release a high-quality training set 3D-RAD-T of 136,195 expert-aligned samples, showing that fine-tuning on this dataset could significantly enhance model performance. Our dataset and code, aiming to catalyze multimodal medical AI research and establish a robust foundation for 3D medical visual understanding, are publicly available at https://github.com/Tang-xiaoxiao/M3D-RAD.

[10] LLM-to-Phy3D: Physically Conform Online 3D Object Generation with LLMs cs.CV | cs.LGPDF

Melvin Wong, Yueming Lyu, Thiago Rios, Stefan Menzel, Yew-Soon Ong

TL;DR: LLM-to-Phy3D 是一种基于物理约束的在线3D对象生成方法，通过协同视觉和物理评估的反馈循环，增强了大型语言模型（LLMs）生成物理可行3D对象的能力。

Details

Motivation: 现有的LLM-to-3D模型缺乏物理知识，导致生成的对象与真实物理约束脱节，这在工程设计领域尤为重要。本文旨在填补这一空白。

Result: 在车辆设计优化任务中，LLM-to-Phy3D将物理可行性提高了4.5%至106.7%，优于传统LLM-to-3D模型。

Insight: LLM-to-Phy3D展示了在工程设计等物理AI应用中的潜力，为生成式AI在实际场景中的落地提供了新思路。

Abstract: The emergence of generative artificial intelligence (GenAI) and large language models (LLMs) has revolutionized the landscape of digital content creation in different modalities. However, its potential use in Physical AI for engineering design, where the production of physically viable artifacts is paramount, remains vastly underexplored. The absence of physical knowledge in existing LLM-to-3D models often results in outputs detached from real-world physical constraints. To address this gap, we introduce LLM-to-Phy3D, a physically conform online 3D object generation that enables existing LLM-to-3D models to produce physically conforming 3D objects on the fly. LLM-to-Phy3D introduces a novel online black-box refinement loop that empowers large language models (LLMs) through synergistic visual and physics-based evaluations. By delivering directional feedback in an iterative refinement process, LLM-to-Phy3D actively drives the discovery of prompts that yield 3D artifacts with enhanced physical performance and greater geometric novelty relative to reference objects, marking a substantial contribution to AI-driven generative design. Systematic evaluations of LLM-to-Phy3D, supported by ablation studies in vehicle design optimization, reveal various LLM improvements gained by 4.5% to 106.7% in producing physically conform target domain 3D designs over conventional LLM-to-3D models. The encouraging results suggest the potential general use of LLM-to-Phy3D in Physical AI for scientific and engineering applications.

[11] Self-Calibrating BCIs: Ranking and Recovery of Mental Targets Without Labels cs.CV | cs.HCPDF

Jonathan Grizou, Carlos de la Torre-Ortiz, Tuukka Ruotsalo

TL;DR: 论文提出了一个自校准BCI框架CURSOR，无需标签数据或预训练的解码器，即可从EEG和图像数据中恢复未知的心理目标。实验表明，CURSOR能预测与人类感知判断相关的图像相似度分数，并生成接近目标的心理图像。

Details

Motivation: 传统BCI系统依赖标签数据或预训练解码器，限制了在无标签场景下的应用。本文旨在开发一种自校准方法，直接在无监督条件下恢复心理目标。

Result: 在自然图像实验中，CURSOR能生成接近心理目标的图像（用户研究验证，N=53），且预测的相似度分数与人类感知判断一致。

Insight: 无监督BCI系统在恢复心理目标方面具有潜力，自校准方法可为未来脑机接口研究提供新方向。

Abstract: We consider the problem of recovering a mental target (e.g., an image of a face) that a participant has in mind from paired EEG (i.e., brain responses) and image (i.e., perceived faces) data collected during interactive sessions without access to labeled information. The problem has been previously explored with labeled data but not via self-calibration, where labeled data is unavailable. Here, we present the first framework and an algorithm, CURSOR, that learns to recover unknown mental targets without access to labeled data or pre-trained decoders. Our experiments on naturalistic images of faces demonstrate that CURSOR can (1) predict image similarity scores that correlate with human perceptual judgments without any label information, (2) use these scores to rank stimuli against an unknown mental target, and (3) generate new stimuli indistinguishable from the unknown mental target (validated via a user study, N=53).

[12] SLRNet: A Real-Time LSTM-Based Sign Language Recognition System cs.CV | 68T07 (Artificial Intelligence), 68U10 (Image Processing)PDF

Sharvari Kamble

TL;DR: 本文介绍了一种基于LSTM的实时手语识别系统SLRNet，结合MediaPipe Holistic技术和LSTM网络，通过摄像头实时识别美国手语（ASL）字母和功能词，验证准确率达86.7%。

Details

Motivation: 解决听力障碍群体与社会的沟通障碍，提供了一种硬件无关的实时手语识别方案。

Result: 在验证集上达到86.7%的准确率。

Insight: LSTM网络在处理时序手势数据时表现优异，硬件无关的设计使得系统更具包容性。

Abstract: Sign Language Recognition (SLR) plays a crucial role in bridging the communication gap between the hearing-impaired community and society. This paper introduces SLRNet, a real-time webcam-based ASL recognition system using MediaPipe Holistic and Long Short-Term Memory (LSTM) networks. The model processes video streams to recognize both ASL alphabet letters and functional words. With a validation accuracy of 86.7%, SLRNet demonstrates the feasibility of inclusive, hardware-independent gesture recognition.

[13] Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search cs.CVPDF

Linhao Yu, Xinguang Ji, Yahui Liu, Fanheng Kong, Chenxi Sun

TL;DR: 论文提出了一个名为AutoCaption的自动框架，通过蒙特卡洛树搜索（MCTS）生成多样化的视频描述句子，以全面评估多模态大语言模型（MLLMs）的视频字幕能力。基于此构建的MCTS-VCB基准测试显示，Gemini-1.5-Pro表现最优，同时AutoCaption生成的数据显著提升了模型性能。

Details

Motivation: 现有视频字幕评估基准存在关键问题，如关键点不足或同质化、数据创建成本高、评估范围有限。为了解决这些问题，需要一种自动化的方法来生成多样且全面的视频描述。

Result: Gemini-1.5-Pro在MCTS-VCB上的F1得分最高（71.2）。使用AutoCaption生成数据微调的InternVL2.5-8B，在MCTS-VCB和DREAM-1K上分别提升了25.0%和16.3%。

Insight: AutoCaption通过MCTS生成的多样化描述能更全面地评估MLLMs的视频理解能力，同时其生成的数据可以显著提升模型性能，为未来视频字幕任务提供了新的基准和优化方向。

Abstract: Video captioning can be used to assess the video understanding capabilities of Multimodal Large Language Models (MLLMs). However, existing benchmarks and evaluation protocols suffer from crucial issues, such as inadequate or homogeneous creation of key points, exorbitant cost of data creation, and limited evaluation scopes. To address these issues, we propose an automatic framework, named AutoCaption, which leverages Monte Carlo Tree Search (MCTS) to construct numerous and diverse descriptive sentences (\textit{i.e.}, key points) that thoroughly represent video content in an iterative way. This iterative captioning strategy enables the continuous enhancement of video details such as actions, objects’ attributes, environment details, etc. We apply AutoCaption to curate MCTS-VCB, a fine-grained video caption benchmark covering video details, thereby enabling a comprehensive evaluation of MLLMs on the video captioning task. We evaluate more than 20 open- and closed-source MLLMs of varying sizes on MCTS-VCB. Results show that MCTS-VCB can effectively and comprehensively evaluate the video captioning capability, with Gemini-1.5-Pro achieving the highest F1 score of 71.2. Interestingly, we fine-tune InternVL2.5-8B with the AutoCaption-generated data, which helps the model achieve an overall improvement of 25.0% on MCTS-VCB and 16.3% on DREAM-1K, further demonstrating the effectiveness of AutoCaption. The code and data are available at https://github.com/tjunlp-lab/MCTS-VCB.

[14] Digitization of Document and Information Extraction using OCR cs.CV | cs.IRPDF

Rasha Sinha, Rekha B S

TL;DR: 论文提出了一种结合OCR与LLM的框架，用于从扫描和数字文档中提取结构化信息，显著提升了传统方法的灵活性和语义准确性。

Details

Motivation: 从文档中准确提取信息是一个关键任务，尤其是在处理扫描图像和原生数字格式混合的情况下。传统方法在灵活性和语义理解上存在不足，因此需要一种更智能的解决方案。

Result: 实验结果表明，该方法在准确性、布局识别和处理速度上优于传统的基于规则和模板的方法。

Insight: 结合OCR和LLM可以显著提升文档信息提取的语义精度和灵活性，适用于多种文档类型。

Abstract: Retrieving accurate details from documents is a crucial task, especially when handling a combination of scanned images and native digital formats. This document presents a combined framework for text extraction that merges Optical Character Recognition (OCR) techniques with Large Language Models (LLMs) to deliver structured outputs enriched by contextual understanding and confidence indicators. Scanned files are processed using OCR engines, while digital files are interpreted through layout-aware libraries. The extracted raw text is subsequently analyzed by an LLM to identify key-value pairs and resolve ambiguities. A comparative analysis of different OCR tools is presented to evaluate their effectiveness concerning accuracy, layout recognition, and processing speed. The approach demonstrates significant improvements over traditional rule-based and template-based methods, offering enhanced flexibility and semantic precision across different document categories

[15] VIBE: Can a VLM Read the Room? cs.CV | cs.LGPDF

Tania Chakraborty, Eylon Caplan, Dan Goldwasser

TL;DR: 本文提出了一种新的任务，即视觉社交-语用推理（Visual Social-Pragmatic Inference），以解决VLMs在理解非语言社交线索方面的局限性，并构建了一个高质量数据集来测试多种VLMs的性能。

Details

Motivation: 虽然LLMs在文本领域取得了显著进展，但它们无法捕捉非语言线索在社会行为中的重要作用。VLMs虽具备潜力，但其在社交推理方面的能力尚未得到充分研究。

Result: 通过实验，揭示了VLMs在视觉社交推理方面的局限性，并提供了基准性能数据。

Insight: VLMs在社交推理领域的潜力尚未被充分挖掘，未来的研究需要进一步改进模型以更好地理解和推断非语言社交线索。

Abstract: Understanding human social behavior such as recognizing emotions and the social dynamics causing them is an important and challenging problem. While LLMs have made remarkable advances, they are limited to the textual domain and cannot account for the major role that non-verbal cues play in understanding social situations. Vision Language Models (VLMs) can potentially account for this gap, however their ability to make correct inferences over such social cues has received little attention. In this paper, we explore the capabilities of VLMs at social reasoning. We identify a previously overlooked limitation in VLMs: the Visual Social-Pragmatic Inference gap. To target this gap, we propose a new task for VLMs: Visual Social-Pragmatic Inference. We construct a high quality dataset to test the abilities of a VLM for this task and benchmark the performance of several VLMs on it.

[16] Test-Time-Scaling for Zero-Shot Diagnosis with Visual-Language Reasoning cs.CV | cs.AIPDF

Ji Young Byun, Young-Jin Park, Navid Azizan, Rama Chellappa

TL;DR: 该论文提出了一种零样本框架，通过测试时间缩放增强大型语言模型（LLMs）在医学影像诊断中的推理能力，避免了监督微调的需求。

Details

Motivation: 医学影像诊断中，监督微调因数据稀缺和高标注成本难以实现，而现有的LLMs在视觉问答任务中的推理能力尚未充分探索。

Result: 在多模态医学影像数据上验证了方法的有效性，显著提升了诊断准确性和可靠性。

Insight: 无偏提示结合测试时间缩放可以提升LLMs的诊断可靠性，展现了零样本方法在医学影像领域的潜力。

Abstract: As a cornerstone of patient care, clinical decision-making significantly influences patient outcomes and can be enhanced by large language models (LLMs). Although LLMs have demonstrated remarkable performance, their application to visual question answering in medical imaging, particularly for reasoning-based diagnosis, remains largely unexplored. Furthermore, supervised fine-tuning for reasoning tasks is largely impractical due to limited data availability and high annotation costs. In this work, we introduce a zero-shot framework for reliable medical image diagnosis that enhances the reasoning capabilities of LLMs in clinical settings through test-time scaling. Given a medical image and a textual prompt, a vision-language model processes a medical image along with a corresponding textual prompt to generate multiple descriptions or interpretations of visual features. These interpretations are then fed to an LLM, where a test-time scaling strategy consolidates multiple candidate outputs into a reliable final diagnosis. We evaluate our approach across various medical imaging modalities – including radiology, ophthalmology, and histopathology – and demonstrate that the proposed test-time scaling strategy enhances diagnostic accuracy for both our and baseline methods. Additionally, we provide an empirical analysis showing that the proposed approach, which allows unbiased prompting in the first stage, improves the reliability of LLM-generated diagnoses and enhances classification accuracy.

[17] BrainMAP: Multimodal Graph Learning For Efficient Brain Disease Localization cs.CV | cs.LG | cs.NEPDF

Nguyen Linh Dan Le, Jing Ren, Ciyuan Peng, Chengyao Xie, Bowen Li

TL;DR: BrainMAP是一种新型的多模态图学习框架，专注于高效定位和提取受神经退行性疾病影响的脑区域，显著降低了计算开销。

Details

Motivation: 现有基于图的方法通常无法在完整连接组中定位驱动神经退行性病变的特定脑区域，且多模态脑图模型计算复杂度高，难以在资源受限的设备中应用。

Result: BrainMAP在计算效率和预测准确性上均优于现有方法。

Insight: 专注于疾病相关的子图可以显著降低计算复杂度，同时跨模态的动态融合机制有助于提升模型性能。

Abstract: Recent years have seen a surge in research focused on leveraging graph learning techniques to detect neurodegenerative diseases. However, existing graph-based approaches typically lack the ability to localize and extract the specific brain regions driving neurodegenerative pathology within the full connectome. Additionally, recent works on multimodal brain graph models often suffer from high computational complexity, limiting their practical use in resource-constrained devices. In this study, we present BrainMAP, a novel multimodal graph learning framework designed for precise and computationally efficient identification of brain regions affected by neurodegenerative diseases. First, BrainMAP utilizes an atlas-driven filtering approach guided by the AAL atlas to pinpoint and extract critical brain subgraphs. Unlike recent state-of-the-art methods, which model the entire brain network, BrainMAP achieves more than 50% reduction in computational overhead by concentrating on disease-relevant subgraphs. Second, we employ an advanced multimodal fusion process comprising cross-node attention to align functional magnetic resonance imaging (fMRI) and diffusion tensor imaging (DTI) data, coupled with an adaptive gating mechanism to blend and integrate these modalities dynamically. Experimental results demonstrate that BrainMAP outperforms state-of-the-art methods in computational efficiency, without compromising predictive accuracy.

[18] Enhanced Vehicle Speed Detection Considering Lane Recognition Using Drone Videos in California cs.CV | cs.LGPDF

Amirali Ataee Naeini, Ashkan Teymouri, Ghazaleh Jafarsalehi, Michael Zhang

TL;DR: 该论文提出了一种改进的YOLOv11模型，用于从无人机视频中精确检测车道内的车辆速度，并分类车辆类型（轿车和重型车辆），以满足加州交通监控的特定需求。

Details

Motivation: 加州车辆数量增加，但交通系统不足和测速摄像头稀疏，导致需要更有效的车辆速度检测方法。特别是监控HOV车道速度和区分不同限速的车辆类型（如普通车辆和重型车辆）的需求推动了这项研究。

Result: 微调后的YOLOv11模型在MAE（0.97 mph）和MSE（0.94 mph²）上表现优异，验证了其在车辆速度和分类检测中的高效性。

Insight: 通过引入车道识别和车辆分类，该研究不仅提升了速度检测的精度，还为交通监控提供了更实用的解决方案，特别是在HOV车道和重型车辆限速管理中具有重要意义。

Abstract: The increase in vehicle numbers in California, driven by inadequate transportation systems and sparse speed cameras, necessitates effective vehicle speed detection. Detecting vehicle speeds per lane is critical for monitoring High-Occupancy Vehicle (HOV) lane speeds, distinguishing between cars and heavy vehicles with differing speed limits, and enforcing lane restrictions for heavy vehicles. While prior works utilized YOLO (You Only Look Once) for vehicle speed detection, they often lacked accuracy, failed to identify vehicle lanes, and offered limited or less practical classification categories. This study introduces a fine-tuned YOLOv11 model, trained on almost 800 bird’s-eye view images, to enhance vehicle speed detection accuracy which is much higher compare to the previous works. The proposed system identifies the lane for each vehicle and classifies vehicles into two categories: cars and heavy vehicles. Designed to meet the specific requirements of traffic monitoring and regulation, the model also evaluates the effects of factors such as drone height, distance of Region of Interest (ROI), and vehicle speed on detection accuracy and speed measurement. Drone footage collected from Northern California was used to assess the proposed system. The fine-tuned YOLOv11 achieved its best performance with a mean absolute error (MAE) of 0.97 mph and mean squared error (MSE) of 0.94 $\text{mph}^2$, demonstrating its efficacy in addressing challenges in vehicle speed detection and classification.

[19] Lifting Data-Tracing Machine Unlearning to Knowledge-Tracing for Foundation Models cs.CV | cs.LGPDF

Yuwen Tan, Boqing Gong

TL;DR: 该论文提出将机器遗忘从数据追踪提升到知识追踪，以适应基础模型的多样需求，并更贴近人脑遗忘机制。

Details

Motivation: 当前的数据追踪机器遗忘方法无法满足基础模型（FMs）的多样化遗忘需求，且与人类认知中的遗忘机制不符。

Result: 展示了知识追踪机器遗忘在基础模型中的可行性和潜在优势。

Insight: 知识追踪机器遗忘更贴近实际需求和人脑遗忘机制，为机器遗忘研究提供了新方向。

Abstract: Machine unlearning removes certain training data points and their influence on AI models (e.g., when a data owner revokes their decision to allow models to learn from the data). In this position paper, we propose to lift data-tracing machine unlearning to knowledge-tracing for foundation models (FMs). We support this position based on practical needs and insights from cognitive studies. Practically, tracing data cannot meet the diverse unlearning requests for FMs, which may be from regulators, enterprise users, product teams, etc., having no access to FMs’ massive training data. Instead, it is convenient for these parties to issue an unlearning request about the knowledge or capability FMs (should not) possess. Cognitively, knowledge-tracing unlearning aligns with how the human brain forgets more closely than tracing individual training data points. Finally, we provide a concrete case study about a vision-language FM to illustrate how an unlearner might instantiate the knowledge-tracing machine unlearning paradigm.

[20] TARDIS STRIDE: A Spatio-Temporal Road Image Dataset for Exploration and Autonomy cs.CV | cs.AIPDF

Héctor Carrión, Yutong Bai, Víctor A. Hernández Castro, Kishan Panaganti, Ayush Zenith

TL;DR: 论文提出了一个时空道路图像数据集STRIDE，并结合TARDIS模型，实现了对动态环境的建模与智能体行为任务的高效执行。

Details

Motivation: 真实世界环境具有动态性和时空复杂性，需要构建数据集和模型来模拟这些特性，以提升智能体的环境理解与决策能力。

Result: 在可控图像合成、自主控制和地理定位等任务中表现优异，展示了强大的时空建模能力。

Insight: 统一时空建模能为通用智能体的环境理解和决策提供新方向。

Abstract: World models aim to simulate environments and enable effective agent behavior. However, modeling real-world environments presents unique challenges as they dynamically change across both space and, crucially, time. To capture these composed dynamics, we introduce a Spatio-Temporal Road Image Dataset for Exploration (STRIDE) permuting 360-degree panoramic imagery into rich interconnected observation, state and action nodes. Leveraging this structure, we can simultaneously model the relationship between egocentric views, positional coordinates, and movement commands across both space and time. We benchmark this dataset via TARDIS, a transformer-based generative world model that integrates spatial and temporal dynamics through a unified autoregressive framework trained on STRIDE. We demonstrate robust performance across a range of agentic tasks such as controllable photorealistic image synthesis, instruction following, autonomous self-control, and state-of-the-art georeferencing. These results suggest a promising direction towards sophisticated generalist agents–capable of understanding and manipulating the spatial and temporal aspects of their material environments–with enhanced embodied reasoning capabilities. Training code, datasets, and model checkpoints are made available at https://huggingface.co/datasets/Tera-AI/STRIDE.

[21] HyBiomass: Global Hyperspectral Imagery Benchmark Dataset for Evaluating Geospatial Foundation Models in Forest Aboveground Biomass Estimation cs.CV | eess.IVPDF

Aaron Banze, Timothée Stassin, Nassim Ait Ali Braham, Rıdvan Salih Kuzu, Simon Besnard

TL;DR: 该论文提出了一个全球分布的森林地上生物量（AGB）估计基准数据集HyBiomass，结合高光谱影像和激光雷达数据，用于评估地理空间基础模型（Geo-FMs）。实验表明Geo-FMs表现优于基线U-Net，并揭示了数据集大小和视觉Transformer中token patch尺寸的重要性。

Details

Motivation: 现有基准数据集多为分类或分割任务，且局限于特定地理区域。HyBiomass填补了全球性、回归任务的空白，推动Geo-FMs在高光谱影像（HSI）中的应用研究。

Result: Geo-FMs在多数情况下表现优于U-Net，尤其是在数据集较大时。视觉Transformer的token patch尺寸对像素级回归任务至关重要。

Insight: 1. 数据集规模显著影响Geo-FMs表现；2. 视觉Transformer的设计需针对任务需求优化；3. 全球数据集有助于研究Geo-FMs的地理偏差和泛化能力。

Abstract: Comprehensive evaluation of geospatial foundation models (Geo-FMs) requires benchmarking across diverse tasks, sensors, and geographic regions. However, most existing benchmark datasets are limited to segmentation or classification tasks, and focus on specific geographic areas. To address this gap, we introduce a globally distributed dataset for forest aboveground biomass (AGB) estimation, a pixel-wise regression task. This benchmark dataset combines co-located hyperspectral imagery (HSI) from the Environmental Mapping and Analysis Program (EnMAP) satellite and predictions of AGB density estimates derived from the Global Ecosystem Dynamics Investigation lidars, covering seven continental regions. Our experimental results on this dataset demonstrate that the evaluated Geo-FMs can match or, in some cases, surpass the performance of a baseline U-Net, especially when fine-tuning the encoder. We also find that the performance difference between the U-Net and Geo-FMs depends on the dataset size for each region and highlight the importance of the token patch size in the Vision Transformer backbone for accurate predictions in pixel-wise regression tasks. By releasing this globally distributed hyperspectral benchmark dataset, we aim to facilitate the development and evaluation of Geo-FMs for HSI applications. Leveraging this dataset additionally enables research into geographic bias and generalization capacity of Geo-FMs. The dataset and source code will be made publicly available.

[22] GynSurg: A Comprehensive Gynecology Laparoscopic Surgery Dataset cs.CVPDF

Sahar Nasirihaghighi, Negin Ghamsarian, Leonie Peschek, Matteo Munari, Heinrich Husslein

TL;DR: 本文介绍了GynSurg，一个目前最大且最多样化的妇科腹腔镜手术数据集，支持多任务标注，旨在推动手术场景理解和动作识别的深度学习研究。

Details

Motivation: 现有妇科腹腔镜手术数据集规模小、任务单一或标注不充分，无法满足端到端工作流分析的需求。

Result: GynSurg展示了高质量和多样性，为深度学习模型提供了强大支持。

Insight: 大规模、多任务标注数据集对推动手术智能辅助系统的研发至关重要。

Abstract: Recent advances in deep learning have transformed computer-assisted intervention and surgical video analysis, driving improvements not only in surgical training, intraoperative decision support, and patient outcomes, but also in postoperative documentation and surgical discovery. Central to these developments is the availability of large, high-quality annotated datasets. In gynecologic laparoscopy, surgical scene understanding and action recognition are fundamental for building intelligent systems that assist surgeons during operations and provide deeper analysis after surgery. However, existing datasets are often limited by small scale, narrow task focus, or insufficiently detailed annotations, limiting their utility for comprehensive, end-to-end workflow analysis. To address these limitations, we introduce GynSurg, the largest and most diverse multi-task dataset for gynecologic laparoscopic surgery to date. GynSurg provides rich annotations across multiple tasks, supporting applications in action recognition, semantic segmentation, surgical documentation, and discovery of novel procedural insights. We demonstrate the dataset quality and versatility by benchmarking state-of-the-art models under a standardized training protocol. To accelerate progress in the field, we publicly release the GynSurg dataset and its annotations

[23] A Watermark for Auto-Regressive Image Generation Models cs.CVPDF

Yihan Wu, Xuehao Cui, Ruibo Chen, Georgios Milis, Heng Huang

TL;DR: 该论文提出了一种针对自回归图像生成模型的水印方法C-reweight，解决了传统的统计水印方法因重标记不匹配而在图像生成中失效的问题，同时保持图像质量并提高了可检测性。

Details

Motivation: 随着图像生成模型的快速发展，其被滥用于深度伪造、钓鱼攻击等场景的风险增加，因此需要一种有效的真实性验证机制。传统的统计水印方法在图像生成模型中因重标记不匹配问题而效果不佳。

Result: 在主流图像生成平台上的实验表明，C-reweight不仅保持了生成图像的视觉质量，还显著优于现有的无失真水印技术。

Insight: 聚类策略在水印技术中的应用为解决图像生成中的重标记不匹配问题提供了新思路，为图像生成的可靠性和安全性提供了保障。

Abstract: The rapid evolution of image generation models has revolutionized visual content creation, enabling the synthesis of highly realistic and contextually accurate images for diverse applications. However, the potential for misuse, such as deepfake generation, image based phishing attacks, and fabrication of misleading visual evidence, underscores the need for robust authenticity verification mechanisms. While traditional statistical watermarking techniques have proven effective for autoregressive language models, their direct adaptation to image generation models encounters significant challenges due to a phenomenon we term retokenization mismatch, a disparity between original and retokenized sequences during the image generation process. To overcome this limitation, we propose C-reweight, a novel, distortion-free watermarking method explicitly designed for image generation models. By leveraging a clustering-based strategy that treats tokens within the same cluster equivalently, C-reweight mitigates retokenization mismatch while preserving image fidelity. Extensive evaluations on leading image generation platforms reveal that C-reweight not only maintains the visual quality of generated images but also improves detectability over existing distortion-free watermarking techniques, setting a new standard for secure and trustworthy image synthesis.

[24] Scalable Context-Preserving Model-Aware Deep Clustering for Hyperspectral Images cs.CVPDF

Xianlu Li, Nicolas Nadisic, Shaoguang Huang, Nikos Deligiannis, Aleksandra Pižurica

TL;DR: 该论文提出了一种基于基表示的可扩展的上下文保留深度聚类方法，用于高光谱图像（HSI）的无监督分析。通过联合优化局部和非局部结构约束，实现了高效且可扩展的聚类。

Details

Motivation: 现有的模型感知深度子空间聚类方法通常采用两阶段框架，计算复杂度高（O(n^2)），且仅单独考虑局部或非局部结构约束，无法有效监督整个聚类过程。

Result: 实验表明，该方法在真实数据集上优于现有技术，同时具有更高的计算效率。

Insight: 联合优化局部和非局部结构约束可以显著提升聚类性能，同时通过降低计算复杂度（O(n)）使方法适用于大规模高光谱图像分析。

Abstract: Subspace clustering has become widely adopted for the unsupervised analysis of hyperspectral images (HSIs). Recent model-aware deep subspace clustering methods often use a two-stage framework, involving the calculation of a self-representation matrix with complexity of O(n^2), followed by spectral clustering. However, these methods are computationally intensive, generally incorporating solely either local or non-local spatial structure constraints, and their structural constraints fall short of effectively supervising the entire clustering process. We propose a scalable, context-preserving deep clustering method based on basis representation, which jointly captures local and non-local structures for efficient HSI clustering. To preserve local structure (i.e., spatial continuity within subspaces), we introduce a spatial smoothness constraint that aligns clustering predictions with their spatially filtered versions. For non-local structure (i.e., spectral continuity), we employ a mini-cluster-based scheme that refines predictions at the group level, encouraging spectrally similar pixels to belong to the same subspace. Notably, these two constraints are jointly optimized to reinforce each other. Specifically, our model is designed as an one-stage approach in which the structural constraints are applied to the entire clustering process. The time and space complexity of our method is O(n), making it applicable to large-scale HSI data. Experiments on real-world datasets show that our method outperforms state-of-the-art techniques. Our code is available at: https://github.com/lxlscut/SCDSC

[25] Enhance Multimodal Consistency and Coherence for Text-Image Plan Generation cs.CV | cs.AIPDF

Xiaoxin Lu, Ranran Haoran Zhang, Yusen Zhang, Rui Zhang

TL;DR: 该论文提出了一种新的框架，用于生成和改进文本-图像计划，解决了模态间一致性和视觉步骤连贯性的问题，并在多种骨干模型上验证了其有效性。

Details

Motivation: 现有研究主要集中在LLM的文本计划生成能力上，而文本-图像计划的生成潜力尚未充分挖掘，这需要解决模态对齐和视觉连贯性两大挑战。

Result: 实验表明该方法在多种骨干模型（如Mistral-7B、Gemini-1.5和GPT-4o）上优于基线，提升了模态一致性和连贯性。

Insight: 通过迭代优化和视觉信息提取，可以显著提升多模态计划的生成质量，支持任务完成的实用性和可扩展性。

Abstract: People get informed of a daily task plan through diverse media involving both texts and images. However, most prior research only focuses on LLM’s capability of textual plan generation. The potential of large-scale models in providing text-image plans remains understudied. Generating high-quality text-image plans faces two main challenges: ensuring consistent alignment between two modalities and keeping coherence among visual steps. To address these challenges, we propose a novel framework that generates and refines text-image plans step-by-step. At each iteration, our framework (1) drafts the next textual step based on the prediction history; (2) edits the last visual step to obtain the next one; (3) extracts PDDL-like visual information; and (4) refines the draft with the extracted visual information. The textual and visual step produced in stage (4) and (2) will then serve as inputs for the next iteration. Our approach offers a plug-and-play improvement to various backbone models, such as Mistral-7B, Gemini-1.5, and GPT-4o. To evaluate the effectiveness of our approach, we collect a new benchmark consisting of 1,100 tasks and their text-image pair solutions covering 11 daily topics. We also design and validate a new set of metrics to evaluate the multimodal consistency and coherence in text-image plans. Extensive experiment results show the effectiveness of our approach on a range of backbone models against competitive baselines. Our code and data are available at https://github.com/psunlpgroup/MPlanner.

[26] Dynamic Double Space Tower cs.CV | cs.AIPDF

Weikai Sun, Shijie Song, Han Wang

TL;DR: 论文提出了一种动态双向空间塔（Dynamic Double Space Tower）方法，用于增强视觉问答（VQA）任务中的跨模态推理能力和空间关系理解。

Details

Motivation: 现有方法在处理复杂推理场景时，因跨模态交互不足和实体空间关系捕捉能力有限而表现不佳，需要更有效的解决方案。

Result: 该方法在多种多模态模型中表现优异，训练出的July模型仅用3B参数即在空间关系问答数据集上达到SOTA。

Insight: 从‘看图像’到‘感知和组织图像内容’的转变，为多模态模型提供了更高效的空间关系处理方法。

Abstract: The Visual Question Answering (VQA) task requires the simultaneous understanding of image content and question semantics. However, existing methods often have difficulty handling complex reasoning scenarios due to insufficient cross-modal interaction and capturing the entity spatial relationships in the image.\cite{huang2023adaptive}\cite{liu2021comparing}\cite{guibas2021adaptive}\cite{zhang2022vsa}We studied a brand-new approach to replace the attention mechanism in order to enhance the reasoning ability of the model and its understanding of spatial relationships.Specifically, we propose a dynamic bidirectional spatial tower, which is divided into four layers to observe the image according to the principle of human gestalt vision. This naturally provides a powerful structural prior for the spatial organization between entities, enabling the model to no longer blindly search for relationships between pixels but make judgments based on more meaningful perceptual units. Change from “seeing images” to “perceiving and organizing image content”.A large number of experiments have shown that our module can be used in any other multimodal model and achieve advanced results, demonstrating its potential in spatial relationship processing.Meanwhile, the multimodal visual question-answering model July trained by our method has achieved state-of-the-art results with only 3B parameters, especially on the question-answering dataset of spatial relations.

[27] Stop learning it all to mitigate visual hallucination, Focus on the hallucination target cs.CV | cs.AIPDF

Dokyoon Yoon, Youngsook Song, Woomyong Park

TL;DR: 本文提出了一种名为\mymethod的偏好学习方法，通过聚焦于幻觉发生的目标区域，减少多模态大语言模型（MLLMs）在视觉语言任务中产生的幻觉现象。实验表明，该方法显著降低了幻觉问题，提升了模型的可靠性。

Details

Motivation: 多模态大语言模型在视觉语言任务中经常出现幻觉问题，生成与输入图像无关的对象信息，影响了模型的实用性。为了解决这一问题，本文提出了一种针对幻觉目标的偏好学习方法。

Result: 实验结果表明，\mymethod在多个视觉幻觉任务中显著减少了幻觉现象，同时未降低模型的整体性能，提升了MLLMs的可靠性。

Insight: 通过限制学习范围、专注于幻觉目标区域，可以有效减少模型的幻觉问题，而不影响其整体性能。这为解决MLLMs的可靠性问题提供了新的思路。

Abstract: Multimodal Large Language Models (MLLMs) frequently suffer from hallucination issues, generating information about objects that are not present in input images during vision-language tasks. These hallucinations particularly undermine model reliability in practical applications requiring accurate object identification. To address this challenge, we propose \mymethod,\ a preference learning approach that mitigates hallucinations by focusing on targeted areas where they occur. To implement this, we build a dataset containing hallucinated responses, correct responses, and target information (i.e., objects present in the images and the corresponding chunk positions in responses affected by hallucinations). By applying a preference learning method restricted to these specific targets, the model can filter out irrelevant signals and focus on correcting hallucinations. This allows the model to produce more factual responses by concentrating solely on relevant information. Experimental results demonstrate that \mymethod\ effectively reduces hallucinations across multiple vision hallucination tasks, improving the reliability and performance of MLLMs without diminishing overall performance.

[28] TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models cs.CVPDF

Ziyang Luo, Nian Liu, Xuguang Yang, Salman Khan, Rao Muhammad Anwer

TL;DR: TAViS提出了一种新框架，通过文本桥接设计耦合多模态基础模型（ImageBind）和分割基础模型（SAM2），解决了音频-视觉分割中的跨模态对齐问题。

Details

Motivation: 现有方法难以有效对齐音频和视觉模态，且多模态基础模型与分割模型的结合存在知识迁移和监督不足的挑战。

Result: TAViS在单一源、多源、语义数据集上表现优异，并且在零样本设置中表现突出。

Insight: 文本作为桥接媒介可以有效解决跨模态对齐问题，为多模态任务提供新的解决思路。

Abstract: Audio-Visual Segmentation (AVS) faces a fundamental challenge of effectively aligning audio and visual modalities. While recent approaches leverage foundation models to address data scarcity, they often rely on single-modality knowledge or combine foundation models in an off-the-shelf manner, failing to address the cross-modal alignment challenge. In this paper, we present TAViS, a novel framework that \textbf{couples} the knowledge of multimodal foundation models (ImageBind) for cross-modal alignment and a segmentation foundation model (SAM2) for precise segmentation. However, effectively combining these models poses two key challenges: the difficulty in transferring the knowledge between SAM2 and ImageBind due to their different feature spaces, and the insufficiency of using only segmentation loss for supervision. To address these challenges, we introduce a text-bridged design with two key components: (1) a text-bridged hybrid prompting mechanism where pseudo text provides class prototype information while retaining modality-specific details from both audio and visual inputs, and (2) an alignment supervision strategy that leverages text as a bridge to align shared semantic concepts within audio-visual modalities. Our approach achieves superior performance on single-source, multi-source, semantic datasets, and excels in zero-shot settings.

[29] On the Natural Robustness of Vision-Language Models Against Visual Perception Attacks in Autonomous Driving cs.CV | cs.LGPDF

Pedram MohajerAnsari, Amir Salarpour, Michael Kühr, Siyu Huang, Mohammad Hamad

TL;DR: 本文提出了用于自动驾驶感知的Vehicle Vision Language Models (V2LMs)，展示了其在对抗攻击下的优越鲁棒性，与传统DNN相比，性能下降显著更低。

Details

Motivation: 自动驾驶依赖的DNN容易受到攻击导致错误分类，传统防御方法如对抗训练会牺牲模型在正常情况下的性能且难以泛化。

Result: 传统DNN在攻击下性能下降33%至46%，而V2LMs平均仅下降不到8%。Tandem Mode在保持鲁棒性的同时更节省内存。

Insight: V2LMs为自动驾驶感知系统提供了一种更安全、更鲁棒的解决方案，可能成为未来研究的重要方向。

Abstract: Autonomous vehicles (AVs) rely on deep neural networks (DNNs) for critical tasks such as traffic sign recognition (TSR), automated lane centering (ALC), and vehicle detection (VD). However, these models are vulnerable to attacks that can cause misclassifications and compromise safety. Traditional defense mechanisms, including adversarial training, often degrade benign accuracy and fail to generalize against unseen attacks. In this work, we introduce Vehicle Vision Language Models (V2LMs), fine-tuned vision-language models specialized for AV perception. Our findings demonstrate that V2LMs inherently exhibit superior robustness against unseen attacks without requiring adversarial training, maintaining significantly higher accuracy than conventional DNNs under adversarial conditions. We evaluate two deployment strategies: Solo Mode, where individual V2LMs handle specific perception tasks, and Tandem Mode, where a single unified V2LM is fine-tuned for multiple tasks simultaneously. Experimental results reveal that DNNs suffer performance drops of 33% to 46% under attacks, whereas V2LMs maintain adversarial accuracy with reductions of less than 8% on average. The Tandem Mode further offers a memory-efficient alternative while achieving comparable robustness to Solo Mode. We also explore integrating V2LMs as parallel components to AV perception to enhance resilience against adversarial threats. Our results suggest that V2LMs offer a promising path toward more secure and resilient AV perception systems.

[30] FAME: A Lightweight Spatio-Temporal Network for Model Attribution of Face-Swap Deepfakes cs.CVPDF

Wasim Ahmad, Yan-Tsung Peng, Yuan-Hao Chang

TL;DR: FAME是一个轻量级的时空网络，旨在通过捕捉不同换脸模型特有的生成伪影，实现对Deepfake视频的模型溯源任务。

Details

Motivation: 随着换脸Deepfake视频的增加，数字安全、隐私和媒体完整性面临威胁，而现有研究多聚焦于二分类检测，模型溯源任务仍未被充分探索。

Result: 在DFDM、FaceForensics++和FakeAVCeleb数据集上，FAME在准确性和运行效率上均优于现有方法。

Insight: 模型溯源任务为Deepfake检测提供更细粒度分析，时空特征可以有效区分不同生成模型的输出。

Abstract: The widespread emergence of face-swap Deepfake videos poses growing risks to digital security, privacy, and media integrity, necessitating effective forensic tools for identifying the source of such manipulations. Although most prior research has focused primarily on binary Deepfake detection, the task of model attribution – determining which generative model produced a given Deepfake – remains underexplored. In this paper, we introduce FAME (Fake Attribution via Multilevel Embeddings), a lightweight and efficient spatio-temporal framework designed to capture subtle generative artifacts specific to different face-swap models. FAME integrates spatial and temporal attention mechanisms to improve attribution accuracy while remaining computationally efficient. We evaluate our model on three challenging and diverse datasets: Deepfake Detection and Manipulation (DFDM), FaceForensics++, and FakeAVCeleb. Results show that FAME consistently outperforms existing methods in both accuracy and runtime, highlighting its potential for deployment in real-world forensic and information security applications.

[31] Preserving Clusters in Prompt Learning for Unsupervised Domain Adaptation cs.CVPDF

Tung-Long Vuong, Hoang Phan, Vy Vo, Anh Bui, Thanh-Toan Do

TL;DR: 论文提出了一种新方法，通过利用视觉和文本嵌入的几何特性，优化无监督域自适应（UDA）中的提示学习，从而增强伪标签质量并改善目标域中的视觉-文本对齐。

Details

Motivation: 研究动机在于现有基于多模态预训练模型（如CLIP）的UDA方法在目标域中视觉嵌入分布可能偏离预训练模型，导致伪标签质量下降，限制了模型性能。

Result: 实验和消融研究验证了方法的有效性，展示了在性能和目标提示表示质量上的显著提升。

Insight: 研究揭示了预训练多模态模型中视觉和文本嵌入存在强聚类行为，并将其转化为优化目标域对齐的实用策略。

Abstract: Recent approaches leveraging multi-modal pre-trained models like CLIP for Unsupervised Domain Adaptation (UDA) have shown significant promise in bridging domain gaps and improving generalization by utilizing rich semantic knowledge and robust visual representations learned through extensive pre-training on diverse image-text datasets. While these methods achieve state-of-the-art performance across benchmarks, much of the improvement stems from base pseudo-labels (CLIP zero-shot predictions) and self-training mechanisms. Thus, the training mechanism exhibits a key limitation wherein the visual embedding distribution in target domains can deviate from the visual embedding distribution in the pre-trained model, leading to misguided signals from class descriptions. This work introduces a fresh solution to reinforce these pseudo-labels and facilitate target-prompt learning, by exploiting the geometry of visual and text embeddings - an aspect that is overlooked by existing methods. We first propose to directly leverage the reference predictions (from source prompts) based on the relationship between source and target visual embeddings. We later show that there is a strong clustering behavior observed between visual and text embeddings in pre-trained multi-modal models. Building on optimal transport theory, we transform this insight into a novel strategy to enforce the clustering property in text embeddings, further enhancing the alignment in the target domain. Our experiments and ablation studies validate the effectiveness of the proposed approach, demonstrating superior performance and improved quality of target prompts in terms of representation.

[32] Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs cs.CV | cs.CL | cs.LGPDF

Xiao Xu, Libo Qin, Wanxiang Che, Min-Yen Kan

TL;DR: 论文提出了一种名为Manager的轻量级插件，用于在双塔VLM和MLLM中自适应整合单模态专家的多层级语义知识，显著提升了多模态任务性能。

Details

Motivation: 现有双塔VLM（如BridgeTower）在单模态表示利用、灵活语义知识提取以及对高分辨率数据的适应性上存在不足，需改进多模态对齐与融合。

Result: 在4个下游VL任务上超越了基线模型；LLaVA-OV-Manager在20个数据集上显著提升了多分辨率图像的零样本性能。

Insight: Manager和多网格算法分别从深度和广度两个正交视角增强视觉表示，两者的协同作用可缓解多网格算法引起的语义模糊问题。

Abstract: Two-Tower Vision–Language Models (VLMs) have demonstrated strong performance across various downstream VL tasks. While BridgeTower further enhances performance by building bridges between encoders, it \textit{(i)} suffers from ineffective layer-by-layer utilization of unimodal representations, \textit{(ii)} restricts the flexible exploitation of different levels of unimodal semantic knowledge, and \textit{(iii)} is limited to the evaluation on traditional low-resolution datasets only with the Two-Tower VLM architecture. In this work, we propose Manager, a lightweight, efficient and effective plugin that adaptively aggregates insights from different levels of pre-trained unimodal experts to facilitate more comprehensive VL alignment and fusion. First, under the Two-Tower VLM architecture, we introduce ManagerTower, a novel VLM that introduces the manager in each cross-modal layer. Whether with or without VL pre-training, ManagerTower outperforms previous strong baselines and achieves superior performance on 4 downstream VL tasks. Moreover, we extend our exploration to the latest Multimodal Large Language Model (MLLM) architecture. We demonstrate that LLaVA-OV-Manager significantly boosts the zero-shot performance of LLaVA-OV across different categories of capabilities, images, and resolutions on 20 downstream datasets, whether the multi-grid algorithm is enabled or not. In-depth analysis reveals that both our manager and the multi-grid algorithm can be viewed as a plugin that improves the visual representation by capturing more diverse visual details from two orthogonal perspectives (depth and width). Their synergy can mitigate the semantic ambiguity caused by the multi-grid algorithm and further improve performance. Code and models are available at https://github.com/LooperXX/ManagerTower.

[33] GNSS-inertial state initialization by distance residuals cs.CVPDF

Samuel Cerezo, Javier Civera

TL;DR: 该论文提出了一种新的GNSS-惯性初始化策略，通过延迟使用全局GNSS测量数据，转而利用GNSS相对距离残差，直到有足够信息估计GNSS与惯性坐标系之间的变换，从而获得更准确和鲁棒的初始化结果。

Details

Motivation: 在传感器平台的初始化中，有限的初始测量数据往往导致较差的初始估计，甚至可能在非线性优化中陷入局部极小值。这促使了本文提出一种改进的方法，以提高初始化的准确性和鲁棒性。

Result: 在EuRoC和GVINS数据集上的实验表明，该方法比从开始就使用全局GNSS数据的策略表现更优，初始化结果更准确和鲁棒。

Insight: 通过延迟使用全局测量数据并利用局部信息，可以有效提升初始化阶段的精度和鲁棒性，尤其是在信息有限的场景下。

Abstract: Initializing the state of a sensorized platform can be challenging, as a limited set of initial measurements often carry limited information, leading to poor initial estimates that may converge to local minima during non-linear optimization. This paper proposes a novel GNSS-inertial initialization strategy that delays the use of global GNSS measurements until sufficient information is available to accurately estimate the transformation between the GNSS and inertial frames. Instead, the method initially relies on GNSS relative distance residuals. To determine the optimal moment for switching to global measurements, we introduce a criterion based on the evolution of the Hessian matrix singular values. Experiments on the EuRoC and GVINS datasets show that our approach consistently outperforms the naive strategy of using global GNSS data from the start, yielding more accurate and robust initializations.

[34] FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation cs.CV | cs.AI | cs.LGPDF

Zhuguanyu Wu, Shihe Wang, Jiayi Zhang, Jiaxin Chen, Yunhong Wang

TL;DR: FIMA-Q提出了一种基于Fisher信息矩阵近化的后训练量化方法，显著提升了低比特量化下Vision Transformers的精度。

Details

Motivation: 现有后训练量化方法在面对Vision Transformers时，尤其在低比特量化下，存在显著的精度下降问题。

Result: 实验表明，FIMA-Q在低比特量化下显著优于现有方法，提升了ViTs的量化精度。

Insight: Fisher信息矩阵的合理应用可以更准确地指导量化过程，尤其适用于复杂架构如Vision Transformers。

Abstract: Post-training quantization (PTQ) has stood out as a cost-effective and promising model compression paradigm in recent years, as it avoids computationally intensive model retraining. Nevertheless, current PTQ methods for Vision Transformers (ViTs) still suffer from significant accuracy degradation, especially under low-bit quantization. To address these shortcomings, we analyze the prevailing Hessian-guided quantization loss, and uncover certain limitations of conventional Hessian approximations. By following the block-wise reconstruction framework, we propose a novel PTQ method for ViTs, dubbed FIMA-Q. Specifically, we firstly establish the connection between KL divergence and FIM, which enables fast computation of the quantization loss during reconstruction. We further propose an efficient FIM approximation method, namely DPLR-FIM, by employing the diagonal plus low-rank principle, and formulate the ultimate quantization loss. Our extensive experiments, conducted across various vision tasks with representative ViT-based architectures on public datasets, demonstrate that our method substantially promotes the accuracy compared to the state-of-the-art approaches, especially in the case of low-bit quantization. The source code is available at https://github.com/ShiheWang/FIMA-Q.

[35] Linearly Solving Robust Rotation Estimation cs.CV | cs.RO | cs.SY | eess.SYPDF

Yinlong Liu, Tianyu Huang, Zhi-Xin Yang

TL;DR: 该论文提出了一种新的线性模型拟合方法来解决旋转估计问题，揭示了旋转运动的对偶结构，并提出了一种基于投票的方法，表现出对噪声和异常值的卓越鲁棒性，且支持GPU加速。

Details

Motivation: 旋转估计在计算机视觉和机器人任务中至关重要，但传统方法通常是非线性、非凸优化问题，需要复杂设计。本文旨在提供一种更简单、更鲁棒的方法来解决这一问题。

Result: 在噪声和异常值（99%）严重的大规模（10^6）旋转估计问题中，该方法能在0.5秒内找到满意解。实验验证了其有效性和鲁棒性。

Insight: 旋转运动的对偶结构为线性求解提供了新视角，结合投票机制和GPU加速，显著提升了鲁棒性和效率。

Abstract: Rotation estimation plays a fundamental role in computer vision and robot tasks, and extremely robust rotation estimation is significantly useful for safety-critical applications. Typically, estimating a rotation is considered a non-linear and non-convex optimization problem that requires careful design. However, in this paper, we provide some new perspectives that solving a rotation estimation problem can be reformulated as solving a linear model fitting problem without dropping any constraints and without introducing any singularities. In addition, we explore the dual structure of a rotation motion, revealing that it can be represented as a great circle on a quaternion sphere surface. Accordingly, we propose an easily understandable voting-based method to solve rotation estimation. The proposed method exhibits exceptional robustness to noise and outliers and can be computed in parallel with graphics processing units (GPUs) effortlessly. Particularly, leveraging the power of GPUs, the proposed method can obtain a satisfactory rotation solution for large-scale($10^6$) and severely corrupted (99$%$ outlier ratio) rotation estimation problems under 0.5 seconds. Furthermore, to validate our theoretical framework and demonstrate the superiority of our proposed method, we conduct controlled experiments and real-world dataset experiments. These experiments provide compelling evidence supporting the effectiveness and robustness of our approach in solving rotation estimation problems.

[36] EyeSim-VQA: A Free-Energy-Guided Eye Simulation Framework for Video Quality Assessment cs.CV | eess.IVPDF

Zhaoyang Wang, Wen Lu, Jie Li, Lihuo He, Maoguo Gong

TL;DR: EyeSim-VQA 是一个基于自由能量引导的视频质量评估框架，通过双分支结构和生物启发设计，实现了对视频质量的全局和细粒度评估，并在多个基准测试中取得了优异表现。

Details

Motivation: 传统的图像质量评估（IQA）中的自由能量修复机制在视频质量评估（VQA）中尚未充分探索，视频的动态特性和预训练模型限制带来了新挑战。

Result: 在五个公共 VQA 基准测试中表现优于或匹敌现有方法，同时通过生物启发设计提升了模型的可解释性。

Insight: 多模态特征融合与生物启发机制可显著提升视频质量评估的准确性和鲁棒性。

Abstract: Free-energy-guided self-repair mechanisms have shown promising results in image quality assessment (IQA), but remain under-explored in video quality assessment (VQA), where temporal dynamics and model constraints pose unique challenges. Unlike static images, video content exhibits richer spatiotemporal complexity, making perceptual restoration more difficult. Moreover, VQA systems often rely on pre-trained backbones, which limits the direct integration of enhancement modules without affecting model stability. To address these issues, we propose EyeSimVQA, a novel VQA framework that incorporates free-energy-based self-repair. It adopts a dual-branch architecture, with an aesthetic branch for global perceptual evaluation and a technical branch for fine-grained structural and semantic analysis. Each branch integrates specialized enhancement modules tailored to distinct visual inputs-resized full-frame images and patch-based fragments-to simulate adaptive repair behaviors. We also explore a principled strategy for incorporating high-level visual features without disrupting the original backbone. In addition, we design a biologically inspired prediction head that models sweeping gaze dynamics to better fuse global and local representations for quality prediction. Experiments on five public VQA benchmarks demonstrate that EyeSimVQA achieves competitive or superior performance compared to state-of-the-art methods, while offering improved interpretability through its biologically grounded design.

[37] DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs cs.CV | cs.AI | cs.CLPDF

Bo-Cheng Chiu, Jen-Jee Chen, Yu-Chee Tseng, Feng-Chi Chen

TL;DR: DaMO是一种高效的视频语言多模态模型，专为精确的时间推理设计，通过层次化双流架构和四阶段训练范式，显著提升了时间对齐和推理能力。

Details

Motivation: 现有视频LLM在细粒度时间推理和特定视频时刻的响应上表现不足，尤其是在有限监督下。

Result: 在时间定位和视频问答基准测试中，DaMO表现优于现有方法，尤其在需要精确时间对齐的任务中。

Insight: 层次化架构和渐进式训练是提升视频LLM时间推理能力的关键，同时数据增强可缓解监督不足问题。

Abstract: Large Language Models (LLMs) have recently been extended to the video domain, enabling sophisticated video-language understanding. However, existing Video LLMs often exhibit limitations in fine-grained temporal reasoning, restricting their ability to precisely attribute responses to specific video moments, especially under constrained supervision. We introduce DaMO, a data-efficient Video LLM explicitly designed for accurate temporal reasoning and multimodal understanding. At its core, the proposed Temporal-aware Fuseformer employs a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and audio information. To further enhance computational efficiency, DaMO integrates a global residual that reduces spatial redundancy while preserving essential semantic details. We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities. This work also contributes multiple datasets augmented from existing ones with GPT-generated temporally grounded QA pairs for tasks requiring temporal supervision. Comprehensive experiments on temporal grounding and video QA benchmarks demonstrate that DaMO consistently surpasses prior methods, particularly in tasks demanding precise temporal alignment and reasoning. Our work establishes a promising direction for data-efficient video-language modeling.

[38] VFaith: Do Large Multimodal Models Really Reason on Seen Images Rather than Previous Memories? cs.CVPDF

Jiachen Yu, Yufei Zhan, Ziheng Wu, Yousong Zhu, Jinqiao Wang

TL;DR: 该论文提出了VFaith-Bench，首个评估多模态大语言模型（MLLMs）视觉推理能力及其对视觉信息忠实度的基准，并设计了一套自动编辑视觉线索的流程。

Details

Motivation: 当前MLLMs在复杂任务中表现优异，但其推理过程中对视觉信息的依赖程度尚不明确，缺乏定量分析工具。

Result: 在VFaith-Bench上测试了主流模型，揭示了其推理能力的来源，发现部分模型更依赖记忆而非真实视觉信息。

Insight: 1. MLLMs的推理能力并非完全依赖于视觉信息；2. 设计的编辑方法可以有效区分模型是否真正‘看到’图像内容。

Abstract: Recent extensive works have demonstrated that by introducing long CoT, the capabilities of MLLMs to solve complex problems can be effectively enhanced. However, the reasons for the effectiveness of such paradigms remain unclear. It is challenging to analysis with quantitative results how much the model’s specific extraction of visual cues and its subsequent so-called reasoning during inference process contribute to the performance improvements. Therefore, evaluating the faithfulness of MLLMs’ reasoning to visual information is crucial. To address this issue, we first present a cue-driven automatic and controllable editing pipeline with the help of GPT-Image-1. It enables the automatic and precise editing of specific visual cues based on the instruction. Furthermore, we introduce VFaith-Bench, the first benchmark to evaluate MLLMs’ visual reasoning capabilities and analyze the source of such capabilities with an emphasis on the visual faithfulness. Using the designed pipeline, we constructed comparative question-answer pairs by altering the visual cues in images that are crucial for solving the original reasoning problem, thereby changing the question’s answer. By testing similar questions with images that have different details, the average accuracy reflects the model’s visual reasoning ability, while the difference in accuracy before and after editing the test set images effectively reveals the relationship between the model’s reasoning ability and visual perception. We further designed specific metrics to expose this relationship. VFaith-Bench includes 755 entries divided into five distinct subsets, along with an additional human-labeled perception task. We conducted in-depth testing and analysis of existing mainstream flagship models and prominent open-source model series/reasoning models on VFaith-Bench, further investigating the underlying factors of their reasoning capabilities.

[39] Camera-based method for the detection of lifted truck axles using convolutional neural networks cs.CVPDF

Bachir Tchana Tankeu, Mohamed Bouteldja, Nicolas Grignard, Bernard Jacob

TL;DR: 该论文提出了一种基于YOLOv8s卷积神经网络的方法，用于通过垂直交通方向的摄像头图像检测卡车抬升的轴，展现出高精度和实时性。

Details

Motivation: 现有技术如动态称重系统难以准确分类抬升轴的车辆，且缺乏有效的商用或技术方法。

Result: 方法达到87%的精度和91.7%的召回率，推理时间1.4毫秒，适合实时应用。

Insight: 通过增加数据集或图像增强方法有可能进一步提升性能。

Abstract: The identification and classification of vehicles play a crucial role in various aspects of the control-sanction system. Current technologies such as weigh-in-motion (WIM) systems can classify most vehicle categories but they struggle to accurately classify vehicles with lifted axles. Moreover, very few commercial and technical methods exist for detecting lifted axles. In this paper, as part of the European project SETO (Smart Enforcement of Transport Operations), a method based on a convolutional neural network (CNN), namely YOLOv8s, was proposed for the detection of lifted truck axles in images of trucks captured by cameras placed perpendicular to the direction of traffic. The performance of the proposed method was assessed and it was found that it had a precision of 87%, a recall of 91.7%, and an inference time of 1.4 ms, which makes it well-suited for real time implantations. These results suggest that further improvements could be made, potentially by increasing the size of the datasets and/or by using various image augmentation methods.

[40] EasyARC: Evaluating Vision Language Models on True Visual Reasoning cs.CV | cs.LGPDF

Mert Unsal, Aylin Akkus

TL;DR: EasyARC是一个新的视觉语言基准测试，旨在评估模型在真实视觉推理任务中的表现，强调多图像、多步骤推理和自校正能力。

Details

Motivation: 现有基准测试主要关注视觉提取与文本推理的结合，缺乏对视觉与语言复杂交互的真实推理能力的评估。

Result: EasyARC为视觉语言模型的真实推理能力和测试时扩展能力设定了新标准。

Insight: 视觉语言模型在复杂视觉推理任务中仍存在明显不足，需要通过改进基准测试来推动其发展。

Abstract: Building on recent advances in language-based reasoning models, we explore multimodal reasoning that integrates vision and text. Existing multimodal benchmarks primarily test visual extraction combined with text-based reasoning, lacking true visual reasoning with more complex interactions between vision and language. Inspired by the ARC challenge, we introduce EasyARC, a vision-language benchmark requiring multi-image, multi-step reasoning, and self-correction. EasyARC is procedurally generated, fully verifiable, and scalable, making it ideal for reinforcement learning (RL) pipelines. The generators incorporate progressive difficulty levels, enabling structured evaluation across task types and complexities. We benchmark state-of-the-art vision-language models and analyze their failure modes. We argue that EasyARC sets a new standard for evaluating true reasoning and test-time scaling capabilities in vision-language models. We open-source our benchmark dataset and evaluation code.

[41] A$^2$LC: Active and Automated Label Correction for Semantic Segmentation cs.CV | cs.AIPDF

Youjin Jeon, Kyusik Cho, Suhan Woo, Euntai Kim

TL;DR: A$^2$LC 提出了一种新颖的主动和自动标签校正框架，用于语义分割任务。它通过整合自动校正阶段和自适应平衡的获取函数，显著提升了标签校正的效率，同时优化了对尾部类别的关注。

Details

Motivation: 语义分割任务中，手动像素级标注成本高且易出错，而现有的主动标签校正方法在效率和效果上仍有不足。A$^2$LC 旨在通过自动化校正和优化样本选择，进一步提升效率和性能。

Result: 在 Cityscapes 和 PASCAL VOC 2012 上的实验表明，A$^2$LC 仅用 20% 的预算即超越现有方法，并在相同预算下性能提升 27.23%。

Insight: 通过整合自动化和主动学习的优势，A$^2$LC 不仅提升了标注效率，还显著改善了模型对尾部类别的识别能力。

Abstract: Active Label Correction (ALC) has emerged as a promising solution to the high cost and error-prone nature of manual pixel-wise annotation in semantic segmentation, by selectively identifying and correcting mislabeled data. Although recent work has improved correction efficiency by generating pseudo-labels using foundation models, substantial inefficiencies still remain. In this paper, we propose Active and Automated Label Correction for semantic segmentation (A$^2$LC), a novel and efficient ALC framework that integrates an automated correction stage into the conventional pipeline. Specifically, the automated correction stage leverages annotator feedback to perform label correction beyond the queried samples, thereby maximizing cost efficiency. In addition, we further introduce an adaptively balanced acquisition function that emphasizes underrepresented tail classes and complements the automated correction mechanism. Extensive experiments on Cityscapes and PASCAL VOC 2012 demonstrate that A$^2$LC significantly outperforms previous state-of-the-art methods. Notably, A$^2$LC achieves high efficiency by outperforming previous methods using only 20% of their budget, and demonstrates strong effectiveness by yielding a 27.23% performance improvement under an equivalent budget constraint on the Cityscapes dataset. The code will be released upon acceptance.

[42] Wi-CBR: WiFi-based Cross-domain Behavior Recognition via Multimodal Collaborative Awareness cs.CV | eess.SPPDF

Ruobei Zhang, Shengeng Tang, Huan Yan, Xiang Zhang, Richang Hong

TL;DR: 本文提出了一种基于WiFi的多模态协作感知方法（Wi-CBR），通过融合相位数据和多普勒频移（DFS）数据，实现高效的特征交互与融合，提升行为识别精度。

Details

Motivation: 现有WiFi行为识别方法通常仅关注单一数据类型，忽略了多模态特征的交互与融合，限制了识别性能。

Result: 在Widar3.0和XRF55数据集上，方法在域内和跨域实验中均表现出优越性能。

Insight: 多模态数据（相位+DFS）的协作感知能有效提升WiFi行为识别的鲁棒性和泛化能力。

Abstract: WiFi-based human behavior recognition aims to recognize gestures and activities by analyzing wireless signal variations. However, existing methods typically focus on a single type of data, neglecting the interaction and fusion of multiple features. To this end, we propose a novel multimodal collaborative awareness method. By leveraging phase data reflecting changes in dynamic path length and Doppler Shift (DFS) data corresponding to frequency changes related to the speed of gesture movement, we enable efficient interaction and fusion of these features to improve recognition accuracy. Specifically, we first introduce a dual-branch self-attention module to capture spatial-temporal cues within each modality. Then, a group attention mechanism is applied to the concatenated phase and DFS features to mine key group features critical for behavior recognition. Through a gating mechanism, the combined features are further divided into PD-strengthen and PD-weaken branches, optimizing information entropy and promoting cross-modal collaborative awareness. Extensive in-domain and cross-domain experiments on two large publicly available datasets, Widar3.0 and XRF55, demonstrate the superior performance of our method.

[43] SignAligner: Harmonizing Complementary Pose Modalities for Coherent Sign Language Generation cs.CVPDF

Xu Wang, Shengeng Tang, Lechao Cheng, Feng Li, Shuo Wang

TL;DR: SignAligner通过整合文本驱动的姿态模态生成、在线协作修正和多模态视频合成，显著提升了手语生成的准确性和表现力。

Details

Motivation: 手语生成面临复杂姿态（手势、表情、身体动作）的挑战，现有方法难以实现自然且多样化的生成。

Result: 实验表明，SignAligner显著提升了生成手语视频的准确性和表现力。

Insight: 跨模态注意力机制和动态损失权重是实现多模态协同修正的关键，确保语义连贯和动作一致。

Abstract: Sign language generation aims to produce diverse sign representations based on spoken language. However, achieving realistic and naturalistic generation remains a significant challenge due to the complexity of sign language, which encompasses intricate hand gestures, facial expressions, and body movements. In this work, we introduce PHOENIX14T+, an extended version of the widely-used RWTH-PHOENIX-Weather 2014T dataset, featuring three new sign representations: Pose, Hamer and Smplerx. We also propose a novel method, SignAligner, for realistic sign language generation, consisting of three stages: text-driven pose modalities co-generation, online collaborative correction of multimodality, and realistic sign video synthesis. First, by incorporating text semantics, we design a joint sign language generator to simultaneously produce posture coordinates, gesture actions, and body movements. The text encoder, based on a Transformer architecture, extracts semantic features, while a cross-modal attention mechanism integrates these features to generate diverse sign language representations, ensuring accurate mapping and controlling the diversity of modal features. Next, online collaborative correction is introduced to refine the generated pose modalities using a dynamic loss weighting strategy and cross-modal attention, facilitating the complementarity of information across modalities, eliminating spatiotemporal conflicts, and ensuring semantic coherence and action consistency. Finally, the corrected pose modalities are fed into a pre-trained video generation network to produce high-fidelity sign language videos. Extensive experiments demonstrate that SignAligner significantly improves both the accuracy and expressiveness of the generated sign videos.

[44] Evaluating Fairness and Mitigating Bias in Machine Learning: A Novel Technique using Tensor Data and Bayesian Regression cs.CV | cs.AI | cs.LGPDF

Kuniko Paxton, Koorosh Aslansefat, Dhavalkumar Thakker, Yiannis Papadopoulos

TL;DR: 本文提出了一种新的评估机器学习公平性的技术，特别针对图像分类任务中的肤色问题，避免使用标注，并通过概率分布和统计距离度量捕捉公平性细节。同时提出了一种基于贝叶斯回归的训练方法以减轻偏置。

Details

Motivation: 肤色在计算机视觉中以张量数据表示，而非传统的类别或数值特征，现有公平性研究多忽略其特殊性。本文旨在填补这一空白，提出更细粒度的公平性评估方法。

Result: 能更细粒度地捕捉公平性细节，并在肤色分类任务中实现更公平的模型性能。

Insight: 肤色作为连续张量数据需要更复杂的公平性评估方法，而非简单分类；贝叶斯回归可有效建模颜色距离并减轻偏置。

Abstract: Fairness is a critical component of Trustworthy AI. In this paper, we focus on Machine Learning (ML) and the performance of model predictions when dealing with skin color. Unlike other sensitive attributes, the nature of skin color differs significantly. In computer vision, skin color is represented as tensor data rather than categorical values or single numerical points. However, much of the research on fairness across sensitive groups has focused on categorical features such as gender and race. This paper introduces a new technique for evaluating fairness in ML for image classification tasks, specifically without the use of annotation. To address the limitations of prior work, we handle tensor data, like skin color, without classifying it rigidly. Instead, we convert it into probability distributions and apply statistical distance measures. This novel approach allows us to capture fine-grained nuances in fairness both within and across what would traditionally be considered distinct groups. Additionally, we propose an innovative training method to mitigate the latent biases present in conventional skin tone categorization. This method leverages color distance estimates calculated through Bayesian regression with polynomial functions, ensuring a more nuanced and equitable treatment of skin color in ML models.

[45] DISCO: Mitigating Bias in Deep Learning with Conditional Distance Correlation cs.CV | cs.AI | cs.LGPDF

Emre Kavak, Tom Nuno Wolf, Christian Wachinger

TL;DR: 论文提出了一种名为DISCO的方法，通过条件距离相关性来解决深度学习中的偏置问题，确保模型仅依赖因果相关信号进行预测。

Details

Motivation: 在预测任务中，模型可能利用与目标无关的虚假信号（如光照条件）进行预测，导致不可靠的结果。这种偏置行为需要通过因果框架加以抑制。

Result: DISCO在多种偏置缓解实验中表现优异，成为传统基于核方法的有效替代。

Insight: 1. 条件独立性是解决模型偏置的关键；2. 反因果框架为分析偏置提供了新的理论工具；3. 距离相关性在优化模型鲁棒性中具有潜力。

Abstract: During prediction tasks, models can use any signal they receive to come up with the final answer - including signals that are causally irrelevant. When predicting objects from images, for example, the lighting conditions could be correlated to different targets through selection bias, and an oblivious model might use these signals as shortcuts to discern between various objects. A predictor that uses lighting conditions instead of real object-specific details is obviously undesirable. To address this challenge, we introduce a standard anti-causal prediction model (SAM) that creates a causal framework for analyzing the information pathways influencing our predictor in anti-causal settings. We demonstrate that a classifier satisfying a specific conditional independence criterion will focus solely on the direct causal path from label to image, being counterfactually invariant to the remaining variables. Finally, we propose DISCO, a novel regularization strategy that uses conditional distance correlation to optimize for conditional independence in regression tasks. We can show that DISCO achieves competitive results in different bias mitigation experiments, deeming it a valid alternative to classical kernel-based methods.

[46] Prohibited Items Segmentation via Occlusion-aware Bilayer Modeling cs.CVPDF

Yunhan Ren, Ruihuang Li, Lingbo Liu, Changwen Chen

TL;DR: 该论文提出了一种针对X射线图像中违禁物品的分割方法，通过结合Segment Anything Model（SAM）和遮挡感知的双层掩码解码器模块，解决了违禁物品与自然物体的表征差异以及物品重叠问题。

Details

Motivation: X射线图像中违禁物品的实例分割因物品与自然物体的表征差异及严重重叠而具有挑战性。

Result: 在遮挡标注数据集上的实验证明了方法的有效性。

Insight: 通过显式建模遮挡关系和利用SAM的泛化能力，可以显著提升X射线图像中违禁物品的分割性能。

Abstract: Instance segmentation of prohibited items in security X-ray images is a critical yet challenging task. This is mainly caused by the significant appearance gap between prohibited items in X-ray images and natural objects, as well as the severe overlapping among objects in X-ray images. To address these issues, we propose an occlusion-aware instance segmentation pipeline designed to identify prohibited items in X-ray images. Specifically, to bridge the representation gap, we integrate the Segment Anything Model (SAM) into our pipeline, taking advantage of its rich priors and zero-shot generalization capabilities. To address the overlap between prohibited items, we design an occlusion-aware bilayer mask decoder module that explicitly models the occlusion relationships. To supervise occlusion estimation, we manually annotated occlusion areas of prohibited items in two large-scale X-ray image segmentation datasets, PIDray and PIXray. We then reorganized these additional annotations together with the original information as two occlusion-annotated datasets, PIDray-A and PIXray-A. Extensive experimental results on these occlusion-annotated datasets demonstrate the effectiveness of our proposed method. The datasets and codes are available at: https://github.com/Ryh1218/Occ

[47] Dynamic Mixture of Curriculum LoRA Experts for Continual Multimodal Instruction Tuning cs.CVPDF

Chendi Ge, Xin Wang, Zeyang Zhang, Hong Chen, Jiapei Fan

TL;DR: 本文提出了一种动态Mixture of Curriculum LoRA Experts (D-MoLE)方法，用于多模态大语言模型(MLLM)的持续指令调优，解决了任务架构冲突和模态不平衡问题，显著提升了性能。

Details

Motivation: 现有多模态大语言模型的持续学习通常采用固定架构，难以适应新任务，且存在任务架构冲突和模态不平衡问题。

Result: D-MoLE在实验中平均性能提升15%，显著优于现有基线方法。

Insight: 通过动态架构调整和模态课程，模型能够更灵活地适应新任务并平衡模态更新，为MLLM持续学习提供了新思路。

Abstract: Continual multimodal instruction tuning is crucial for adapting Multimodal Large Language Models (MLLMs) to evolving tasks. However, most existing methods adopt a fixed architecture, struggling with adapting to new tasks due to static model capacity. We propose to evolve the architecture under parameter budgets for dynamic task adaptation, which remains unexplored and imposes two challenges: 1) task architecture conflict, where different tasks require varying layer-wise adaptations, and 2) modality imbalance, where different tasks rely unevenly on modalities, leading to unbalanced updates. To address these challenges, we propose a novel Dynamic Mixture of Curriculum LoRA Experts (D-MoLE) method, which automatically evolves MLLM’s architecture with controlled parameter budgets to continually adapt to new tasks while retaining previously learned knowledge. Specifically, we propose a dynamic layer-wise expert allocator, which automatically allocates LoRA experts across layers to resolve architecture conflicts, and routes instructions layer-wisely to facilitate knowledge sharing among experts. Then, we propose a gradient-based inter-modal continual curriculum, which adjusts the update ratio of each module in MLLM based on the difficulty of each modality within the task to alleviate the modality imbalance problem. Extensive experiments show that D-MoLE significantly outperforms state-of-the-art baselines, achieving a 15% average improvement over the best baseline. To the best of our knowledge, this is the first study of continual learning for MLLMs from an architectural perspective.

Libin Lan, Hongxing Li, Zunhui Xia, Juan Zhou, Xiaofei Zhu

TL;DR: 该论文提出了一种跨模态聚类引导的负采样方法（CM-CGNS），通过改进负样本选择和引入跨模态掩码图像重建模块，解决了现有医学图像与报告自监督学习中的局限性，显著提升了模型的表示能力和下游任务性能。

Details

Motivation: 现有的医学图像与报告自监督学习方法存在三个主要问题：负样本选择不足、忽视局部细节以及忽略低层次特征的重要性。这些问题限制了模型的表现力。

Result: 在五个下游数据集上的分类、检测和分割任务中，该方法在多个指标上优于现有技术，验证了其优越性。

Insight: 通过改进负样本选择和结合低层次特征，跨模态自监督学习能够更有效地捕捉医学图像中的关键细节，提升下游任务的性能。

Abstract: Learning medical visual representations directly from paired images and reports through multimodal self-supervised learning has emerged as a novel and efficient approach to digital diagnosis in recent years. However, existing models suffer from several severe limitations. 1) neglecting the selection of negative samples, resulting in the scarcity of hard negatives and the inclusion of false negatives; 2) focusing on global feature extraction, but overlooking the fine-grained local details that are crucial for medical image recognition tasks; and 3) contrastive learning primarily targets high-level features but ignoring low-level details which are essential for accurate medical analysis. Motivated by these critical issues, this paper presents a Cross-Modal Cluster-Guided Negative Sampling (CM-CGNS) method with two-fold ideas. First, it extends the k-means clustering used for local text features in the single-modal domain to the multimodal domain through cross-modal attention. This improvement increases the number of negative samples and boosts the model representation capability. Second, it introduces a Cross-Modal Masked Image Reconstruction (CM-MIR) module that leverages local text-to-image features obtained via cross-modal attention to reconstruct masked local image regions. This module significantly strengthens the model’s cross-modal information interaction capabilities and retains low-level image features essential for downstream tasks. By well handling the aforementioned limitations, the proposed CM-CGNS can learn effective and robust medical visual representations suitable for various recognition tasks. Extensive experimental results on classification, detection, and segmentation tasks across five downstream datasets show that our method outperforms state-of-the-art approaches on multiple metrics, verifying its superior performance.

[49] Predicting Patient Survival with Airway Biomarkers using nn-Unet/Radiomics cs.CV | cs.LGPDF

Zacharia Mesbah, Dhruv Jain, Tsiry Mayet, Romain Modzelewski, Romain Herault

TL;DR: 论文提出了一种三阶段方法，通过nn-Unet分割气道结构，提取放射组学特征，并用SVM分类预测肺纤维化患者的生存率。

Details

Motivation: 研究旨在评估气道相关影像生物标志物对肺纤维化患者生存结果的预测价值。

Result: 分割任务得分为0.8601，分类任务得分为0.7346。

Insight: 气道结构和周围区域的放射组学特征可能具有重要的生存预测价值。

Abstract: The primary objective of the AIIB 2023 competition is to evaluate the predictive significance of airway-related imaging biomarkers in determining the survival outcomes of patients with lung fibrosis.This study introduces a comprehensive three-stage approach. Initially, a segmentation network, namely nn-Unet, is employed to delineate the airway’s structural boundaries. Subsequently, key features are extracted from the radiomic images centered around the trachea and an enclosing bounding box around the airway. This step is motivated by the potential presence of critical survival-related insights within the tracheal region as well as pertinent information encoded in the structure and dimensions of the airway. Lastly, radiomic features obtained from the segmented areas are integrated into an SVM classifier. We could obtain an overall-score of 0.8601 for the segmentation in Task 1 while 0.7346 for the classification in Task 2.

[50] Pose Matters: Evaluating Vision Transformers and CNNs for Human Action Recognition on Small COCO Subsets cs.CV | cs.AI | I.2.0PDF

MingZe Tang, Madiha Kazi

TL;DR: 该研究比较了不同模型（如全连接网络、卷积网络和Transformer）在小规模COCO数据集上的人类动作识别任务中的表现，发现Vision Transformer（ViT）在测试集上表现最优，并通过SHAP和LeGrad可视化技术解释了其成功的原因。

Details

Motivation: 研究旨在探索不同模型在人类动作识别任务中的表现差异，尤其是在数据有限的情况下，并试图通过可视化技术揭示模型决策的依据。

Result: ViT取得了90%的平均测试准确率，显著优于CNN（约35%）和CLIP模型（约62-64%），且ANOVA结果表明这些差异具有统计显著性。

Insight: ViT能够利用人类姿态相关的局部区域（如下肢动作）进行决策，而传统CNN模型则可能被背景纹理干扰，这表明Transformer在数据效率和解译性方面具有优势。

Abstract: This study explores human action recognition using a three-class subset of the COCO image corpus, benchmarking models from simple fully connected networks to transformer architectures. The binary Vision Transformer (ViT) achieved 90% mean test accuracy, significantly exceeding multiclass classifiers such as convolutional networks (approximately 35%) and CLIP-based models (approximately 62-64%). A one-way ANOVA (F = 61.37, p < 0.001) confirmed these differences are statistically significant. Qualitative analysis with SHAP explainer and LeGrad heatmaps indicated that the ViT localizes pose-specific regions (e.g., lower limbs for walking or running), while simpler feed-forward models often focus on background textures, explaining their errors. These findings emphasize the data efficiency of transformer representations and the importance of explainability techniques in diagnosing class-specific failures.

[51] MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space cs.CV | cs.AIPDF

Anshul Singh, Chris Biemann, Jan Strich

TL;DR: 该论文提出了一种名为MTabVQA的新基准，用于评估视觉语言模型在多表格视觉数据上的推理能力，并通过大规模指令调整数据集MTabVQA-Instruct提升了模型的性能。

Details

Motivation: 现有的视觉语言模型在单表格或非视觉数据上表现良好，但在多表格视觉数据上的推理能力仍有不足，尤其是在复杂的多跳推理任务中。MTabVQA填补了这一空白。

Result: 实验表明，现有的视觉语言模型在多表格推理任务上表现不佳，但通过MTabVQA-Instruct的指令调整后性能显著提升。

Insight: 多表格视觉推理是一个具有挑战性的任务，指令调整技术可以显著提升模型在此类任务上的表现。

Abstract: Vision-Language Models (VLMs) have demonstrated remarkable capabilities in interpreting visual layouts and text. However, a significant challenge remains in their ability to interpret robustly and reason over multi-tabular data presented as images, a common occurrence in real-world scenarios like web pages and digital documents. Existing benchmarks typically address single tables or non-visual data (text/structured). This leaves a critical gap: they don’t assess the ability to parse diverse table images, correlate information across them, and perform multi-hop reasoning on the combined visual data. We introduce MTabVQA, a novel benchmark specifically designed for multi-tabular visual question answering to bridge that gap. MTabVQA comprises 3,745 complex question-answer pairs that necessitate multi-hop reasoning across several visually rendered table images. We provide extensive benchmark results for state-of-the-art VLMs on MTabVQA, revealing significant performance limitations. We further investigate post-training techniques to enhance these reasoning abilities and release MTabVQA-Instruct, a large-scale instruction-tuning dataset. Our experiments show that fine-tuning VLMs with MTabVQA-Instruct substantially improves their performance on visual multi-tabular reasoning. Code and dataset (https://huggingface.co/datasets/mtabvqa/MTabVQA-Eval) are available online (https://anonymous.4open.science/r/MTabVQA-EMNLP-B16E).

Libin Lan, Hongxing Li, Zunhui Xia, Yudong Zhang

TL;DR: DMAF-Net提出了一种动态模态感知融合网络，通过动态平衡缺失模态的影响和模态贡献，提升了不完整多模态医学图像分割的性能。

Details

Motivation: 现有方法依赖完整模态的假设，无法动态平衡模态贡献，忽略了模态间的结构关系，导致在真实临床场景中表现不佳。

Result: 在BraTS2020和MyoPS2020数据集上，DMAF-Net优于现有方法。

Insight: 动态平衡模态贡献和缺失模态的影响是提升不完整多模态分割性能的关键，同时需关注模态间的结构关系。

Abstract: Incomplete multi-modal medical image segmentation faces critical challenges from modality imbalance, including imbalanced modality missing rates and heterogeneous modality contributions. Due to their reliance on idealized assumptions of complete modality availability, existing methods fail to dynamically balance contributions and neglect the structural relationships between modalities, resulting in suboptimal performance in real-world clinical scenarios. To address these limitations, we propose a novel model, named Dynamic Modality-Aware Fusion Network (DMAF-Net). The DMAF-Net adopts three key ideas. First, it introduces a Dynamic Modality-Aware Fusion (DMAF) module to suppress missing-modality interference by combining transformer attention with adaptive masking and weight modality contributions dynamically through attention maps. Second, it designs a synergistic Relation Distillation and Prototype Distillation framework to enforce global-local feature alignment via covariance consistency and masked graph attention, while ensuring semantic consistency through cross-modal class-specific prototype alignment. Third, it presents a Dynamic Training Monitoring (DTM) strategy to stabilize optimization under imbalanced missing rates by tracking distillation gaps in real-time, and to balance convergence speeds across modalities by adaptively reweighting losses and scaling gradients. Extensive experiments on BraTS2020 and MyoPS2020 demonstrate that DMAF-Net outperforms existing methods for incomplete multi-modal medical image segmentation. Extensive experiments on BraTS2020 and MyoPS2020 demonstrate that DMAF-Net outperforms existing methods for incomplete multi-modal medical image segmentation. Our code is available at https://github.com/violet-42/DMAF-Net.

[53] Quizzard@INOVA Challenge 2025 – Track A: Plug-and-Play Technique in Interleaved Multi-Image Model cs.CV | cs.CL | cs.MMPDF

Dinh Viet Cuong, Hoang-Bao Le, An Pham Ngoc Nguyen, Liting Zhou, Cathal Gurrin

TL;DR: 该论文展示了LLaVA-NeXT-interleave模型在22个数据集上的卓越性能，并通过添加Dense Channel Integration (DCI)连接器比较其与标准模型的性能，发现标准模型在视觉密集型任务中表现更优，而DCI版本在语义一致性要求更高的任务中更具优势。

Details

Motivation: 探索插拔式技术在多图像交错任务中的应用，并验证LLaVA-NeXT-interleave模型在多任务和多模态场景下的性能潜力。

Result: 标准模型在视觉密集型任务（如VISION、NLVR2和Fashion200K）中表现最佳；DCI增强版本在需要语义一致性或结构化变化理解的任务（如MIT-States_PropertyCoherence和SlideVQA）中表现更优。

Insight: 强大的基础模型与插拔式技术的结合在多图像交错任务中具有巨大潜力，不同任务可能需要不同的模块增强。

Abstract: This paper addresses two main objectives. Firstly, we demonstrate the impressive performance of the LLaVA-NeXT-interleave on 22 datasets across three different tasks: Multi-Image Reasoning, Documents and Knowledge-Based Understanding and Interactive Multi-Modal Communication. Secondly, we add the Dense Channel Integration (DCI) connector to the LLaVA-NeXT-Interleave and compare its performance against the standard model. We find that the standard model achieves the highest overall accuracy, excelling in vision-heavy tasks like VISION, NLVR2, and Fashion200K. Meanwhile, the DCI-enhanced version shows particular strength on datasets requiring deeper semantic coherence or structured change understanding such as MIT-States_PropertyCoherence and SlideVQA. Our results highlight the potential of combining powerful foundation models with plug-and-play techniques for Interleave tasks. The code is available at https://github.com/dinhvietcuong1996/icme25-inova.

[54] MambaVSR: Content-Aware Scanning State Space Model for Video Super-Resolution cs.CVPDF

Linfeng He, Meiqin Liu, Qi Tang, Chao Yao, Yao Zhao

TL;DR: MambaVSR是首个基于状态空间模型的视频超分辨率框架，通过内容感知扫描机制和动态时空交互优化非局部依赖建模，性能和效率显著优于现有方法。

Details

Motivation: 现有视频超分辨率方法（如光流或Transformer）在应对大位移和长视频序列时效果不佳，MambaVSR旨在通过状态空间模型解决这些问题。

Result: 在REDS数据集上PSNR提升0.58 dB，参数量减少55%，显著优于基于Transformer的方法。

Insight: 状态空间模型在视频任务中具有潜力，动态内容感知机制可显著提升非局部依赖建模的效率和效果。

Abstract: Video super-resolution (VSR) faces critical challenges in effectively modeling non-local dependencies across misaligned frames while preserving computational efficiency. Existing VSR methods typically rely on optical flow strategies or transformer architectures, which struggle with large motion displacements and long video sequences. To address this, we propose MambaVSR, the first state-space model framework for VSR that incorporates an innovative content-aware scanning mechanism. Unlike rigid 1D sequential processing in conventional vision Mamba methods, our MambaVSR enables dynamic spatiotemporal interactions through the Shared Compass Construction (SCC) and the Content-Aware Sequentialization (CAS). Specifically, the SCC module constructs intra-frame semantic connectivity graphs via efficient sparse attention and generates adaptive spatial scanning sequences through spectral clustering. Building upon SCC, the CAS module effectively aligns and aggregates non-local similar content across multiple frames by interleaving temporal features along the learned spatial order. To bridge global dependencies with local details, the Global-Local State Space Block (GLSSB) synergistically integrates window self-attention operations with SSM-based feature propagation, enabling high-frequency detail recovery under global dependency guidance. Extensive experiments validate MambaVSR’s superiority, outperforming the Transformer-based method by 0.58 dB PSNR on the REDS dataset with 55% fewer parameters.

[55] CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection cs.CV | cs.LGPDF

Byeongchan Lee, John Won, Seunghyun Lee, Jinwoo Shin

TL;DR: CLIPFUSION结合CLIP的判别能力和扩散模型的生成能力，提出一种协同方法用于异常检测，显著优于基线方法。

Details

Motivation: 异常检测因异常定义模糊、类型多样且数据稀缺而复杂，需一种能捕捉多层次特征的模型。

Result: 在MVTec-AD和VisA数据集上，CLIPFUSION在异常分割和分类任务中表现优异。

Insight: 多模态与多模型融合为解决异常检测的多面挑战提供了可扩展方案。

Abstract: Anomaly detection is a complex problem due to the ambiguity in defining anomalies, the diversity of anomaly types (e.g., local and global defect), and the scarcity of training data. As such, it necessitates a comprehensive model capable of capturing both low-level and high-level features, even with limited data. To address this, we propose CLIPFUSION, a method that leverages both discriminative and generative foundation models. Specifically, the CLIP-based discriminative model excels at capturing global features, while the diffusion-based generative model effectively captures local details, creating a synergistic and complementary approach. Notably, we introduce a methodology for utilizing cross-attention maps and feature maps extracted from diffusion models specifically for anomaly detection. Experimental results on benchmark datasets (MVTec-AD, VisA) demonstrate that CLIPFUSION consistently outperforms baseline methods, achieving outstanding performance in both anomaly segmentation and classification. We believe that our method underscores the effectiveness of multi-modal and multi-model fusion in tackling the multifaceted challenges of anomaly detection, providing a scalable solution for real-world applications.

[56] AgentSense: Virtual Sensor Data Generation Using LLM Agent in Simulated Home Environments cs.CV | cs.HCPDF

Zikang Leng, Megha Thukral, Yaqi Liu, Hrudhai Rajasekhar, Shruthi K. Hiremath

TL;DR: AgentSense提出了一种利用大型语言模型（LLM）生成虚拟传感器数据的管道，通过模拟家居环境和多样化用户行为，解决智能家居中人类活动识别（HAR）系统缺乏大规模多样化标注数据的问题。

Details

Motivation: 开发稳健且泛化能力强的智能家居HAR系统面临的主要挑战是缺乏大规模、多样化的标注数据集。由于家庭布局、传感器配置和用户行为的多样性，数据收集变得复杂且昂贵。

Result: 实验结果表明，利用虚拟传感器数据不仅能显著提升HAR系统的性能，且少量真实数据结合虚拟数据的训练效果可媲美仅使用完整真实数据集的效果。

Insight: 虚拟数据生成可以缓解标注数据匮乏的问题，尤其适用于数据收集困难的领域。AgentSense展示了虚拟数据在智能家居领域的潜力，为HAR系统的开发提供了新思路。

Abstract: A major obstacle in developing robust and generalizable smart home-based Human Activity Recognition (HAR) systems is the lack of large-scale, diverse labeled datasets. Variability in home layouts, sensor configurations, and user behavior adds further complexity, as individuals follow varied routines and perform activities in distinct ways. Building HAR systems that generalize well requires training data that captures the diversity across users and environments. To address these challenges, we introduce AgentSense, a virtual data generation pipeline where diverse personas are generated by leveraging Large Language Models. These personas are used to create daily routines, which are then decomposed into low-level action sequences. Subsequently, the actions are executed in a simulated home environment called VirtualHome that we extended with virtual ambient sensors capable of recording the agents activities as they unfold. Overall, AgentSense enables the generation of rich, virtual sensor datasets that represent a wide range of users and home settings. Across five benchmark HAR datasets, we show that leveraging our virtual sensor data substantially improves performance, particularly when real data are limited. Notably, models trained on a combination of virtual data and just a few days of real data achieve performance comparable to those trained on the entire real datasets. These results demonstrate and prove the potential of virtual data to address one of the most pressing challenges in ambient sensing, which is the distinct lack of large-scale, annotated datasets without requiring any manual data collection efforts.

[57] Real-Time Feedback and Benchmark Dataset for Isometric Pose Evaluation cs.CV | cs.AI | cs.HCPDF

Abhishek Jaiswal, Armeet Singh Luthra, Purav Jangir, Bhavya Garg, Nisheeth Srivastava

TL;DR: 这篇论文提出了一种实时反馈系统，用于评估等长运动姿势，并发布了迄今为止最大的多类等长运动视频数据集。通过基准测试和新型评估指标，推动了智能家庭健身系统的可行性。

Details

Motivation: 等长运动因其便捷性和对器材的低依赖性受到欢迎，但缺乏专业指导可能导致姿势错误、受伤和训练效果不佳。为解决这一问题，作者开发了实时反馈系统。

Result: 系统实现了对等长运动姿势的实时评估和反馈，增强了智能家庭健身系统的可行性。

Insight:

Abstract: Isometric exercises appeal to individuals seeking convenience, privacy, and minimal dependence on equipments. However, such fitness training is often overdependent on unreliable digital media content instead of expert supervision, introducing serious risks, including incorrect posture, injury, and disengagement due to lack of corrective feedback. To address these challenges, we present a real-time feedback system for assessing isometric poses. Our contributions include the release of the largest multiclass isometric exercise video dataset to date, comprising over 3,600 clips across six poses with correct and incorrect variations. To support robust evaluation, we benchmark state-of-the-art models-including graph-based networks-on this dataset and introduce a novel three-part metric that captures classification accuracy, mistake localization, and model confidence. Our results enhance the feasibility of intelligent and personalized exercise training systems for home workouts. This expert-level diagnosis, delivered directly to the users, also expands the potential applications of these systems to rehabilitation, physiotherapy, and various other fitness disciplines that involve physical motion.

[58] Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation cs.CV | cs.AI | cs.CY | cs.LGPDF

Divyanshu Mishra, Mohammadreza Salehi, Pramit Saha, Olga Patey, Aris T. Papageorghiou

TL;DR: 论文提出了一种名为DISCOVR的自监督双分支框架，用于心脏超声视频表示学习，通过聚类和在线图像编码器结合，实现了高质量的视频表示。

Details

Motivation: 目前的SSL方法在自然图像和视频中表现优异，但在心脏超声等医学影像中面临挑战，如相似样本多、低PSNR输入和临床相关特征易失真等问题。

Result: 在六个心脏超声数据集上的实验表明，DISCOVR在零样本和线性探测任务中优于现有方法，且分割迁移性能优异。

Insight: 通过蒸馏损失结合时间动态和空间语义，DISCOVR成功提升了心脏超声视频的表示质量，证明了跨模态知识传递的有效性。

Abstract: Self-supervised learning (SSL) has achieved major advances in natural images and video understanding, but challenges remain in domains like echocardiography (heart ultrasound) due to subtle anatomical structures, complex temporal dynamics, and the current lack of domain-specific pre-trained models. Existing SSL approaches such as contrastive, masked modeling, and clustering-based methods struggle with high intersample similarity, sensitivity to low PSNR inputs common in ultrasound, or aggressive augmentations that distort clinically relevant features. We present DISCOVR (Distilled Image Supervision for Cross Modal Video Representation), a self-supervised dual branch framework for cardiac ultrasound video representation learning. DISCOVR combines a clustering-based video encoder that models temporal dynamics with an online image encoder that extracts fine-grained spatial semantics. These branches are connected through a semantic cluster distillation loss that transfers anatomical knowledge from the evolving image encoder to the video encoder, enabling temporally coherent representations enriched with fine-grained semantic understanding. Evaluated on six echocardiography datasets spanning fetal, pediatric, and adult populations, DISCOVR outperforms both specialized video anomaly detection methods and state-of-the-art video-SSL baselines in zero-shot and linear probing setups, and achieves superior segmentation transfer.

[59] GPLQ: A General, Practical, and Lightning QAT Method for Vision Transformers cs.CVPDF

Guang Liang, Xinyao Liu, Jianxin Wu

TL;DR: GPLQ是一种高效、快速且通用的ViT量化方法，通过两阶段策略（先量化激活再量化权重）显著提升了量化效率，同时保持了模型性能。

Details

Motivation: 现有的ViT量化方法（如PTQ和QAT）存在精度下降、计算成本高或泛化能力差等问题，亟需一种更高效且实用的解决方案。

Result: GPLQ比现有QAT快100倍，内存占用低于FP32训练，4-bit量化模型在ImageNet和下游任务上的性能接近FP32模型。

Insight: 激活量化是关键，且需保持模型原始优化’basin’以维持泛化能力，两阶段策略实现了效率和性能的平衡。

Abstract: Vision Transformers (ViTs) are essential in computer vision but are computationally intensive, too. Model quantization, particularly to low bit-widths like 4-bit, aims to alleviate this difficulty, yet existing Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) methods exhibit significant limitations. PTQ often incurs substantial accuracy drop, while QAT achieves high accuracy but suffers from prohibitive computational costs, limited generalization to downstream tasks, training instability, and lacking of open-source codebase. To address these challenges, this paper introduces General, Practical, and Lightning Quantization (GPLQ), a novel framework designed for efficient and effective ViT quantization. GPLQ is founded on two key empirical insights: the paramount importance of activation quantization and the necessity of preserving the model’s original optimization basin'' to maintain generalization. Consequently, GPLQ employs a sequential activation-first, weights-later’’ strategy. Stage 1 keeps weights in FP32 while quantizing activations with a feature mimicking loss in only 1 epoch to keep it stay in the same ``basin’’, thereby preserving generalization. Stage 2 quantizes weights using a PTQ method. As a result, GPLQ is 100x faster than existing QAT methods, lowers memory footprint to levels even below FP32 training, and achieves 4-bit model performance that is highly competitive with FP32 models in terms of both accuracy on ImageNet and generalization to diverse downstream tasks, including fine-grained visual classification and object detection. We will release an easy-to-use open-source toolkit supporting multiple vision tasks.

[60] Teleoperated Driving: a New Challenge for 3D Object Detection in Compressed Point Clouds cs.CV | cs.NI | eess.IVPDF

Filippo Bragato, Michael Neri, Paolo Testolina, Marco Giordani, Federica Battisti

TL;DR: 该论文探讨了在压缩点云数据中进行3D物体检测的挑战，特别是针对远程驾驶（TD）场景。作者利用扩展的SELMA数据集，评估了压缩算法和物体检测器的性能，并分析了其对V2X网络的影响。

Details

Motivation: 随着互联设备和传感器的普及，远程驾驶成为可能。然而，如何在压缩的点云数据中高效检测3D物体以支持安全的远程驾驶，是一个亟待解决的问题。

Result: 实验结果表明，压缩算法和检测器在远程驾驶场景中的表现直接影响V2X网络的数据速率和延迟，符合3GPP对TD应用的要求。

Insight: 该研究突出了在远程驾驶中高效处理压缩点云数据的重要性，并为未来的研究提供了数据集和性能评估基准。

Abstract: In recent years, the development of interconnected devices has expanded in many fields, from infotainment to education and industrial applications. This trend has been accelerated by the increased number of sensors and accessibility to powerful hardware and software. One area that significantly benefits from these advancements is Teleoperated Driving (TD). In this scenario, a controller drives safely a vehicle from remote leveraging sensors data generated onboard the vehicle, and exchanged via Vehicle-to-Everything (V2X) communications. In this work, we tackle the problem of detecting the presence of cars and pedestrians from point cloud data to enable safe TD operations. More specifically, we exploit the SELMA dataset, a multimodal, open-source, synthetic dataset for autonomous driving, that we expanded by including the ground-truth bounding boxes of 3D objects to support object detection. We analyze the performance of state-of-the-art compression algorithms and object detectors under several metrics, including compression efficiency, (de)compression and inference time, and detection accuracy. Moreover, we measure the impact of compression and detection on the V2X network in terms of data rate and latency with respect to 3GPP requirements for TD applications.

[61] Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation cs.CV | cs.CLPDF

Xintong Wang, Jingheng Pan, Yixiao Liu, Xiaohu Zhao, Chenyang Lyu

TL;DR: 该论文系统地研究了多语言视觉语言翻译（VLT），提出了AibTrans数据集、Density-Aware Evaluation评估方法，并揭示了现有大视觉语言模型的局限性。

Details

Motivation: 当前大型视觉语言模型在多语言VLT任务中缺乏系统评估，现有数据集在语义和文化保真度上存在不足，亟需研究数据质量、模型架构和评估指标的全面改进。

Result: 研究发现：1. 高资源语言对微调会降低跨语言性能；2. 平衡多语言微调策略能提升模型适应性而不牺牲泛化能力。

Insight: 多语言VLT任务中，数据质量和评估方法需要更严格的设计，同时模型微调需平衡多语言能力以避免性能退化。

Abstract: Vision-Language Translation (VLT) is a challenging task that requires accurately recognizing multilingual text embedded in images and translating it into the target language with the support of visual context. While recent Large Vision-Language Models (LVLMs) have demonstrated strong multilingual and visual understanding capabilities, there is a lack of systematic evaluation and understanding of their performance on VLT. In this work, we present a comprehensive study of VLT from three key perspectives: data quality, model architecture, and evaluation metrics. (1) We identify critical limitations in existing datasets, particularly in semantic and cultural fidelity, and introduce AibTrans – a multilingual, parallel, human-verified dataset with OCR-corrected annotations. (2) We benchmark 11 commercial LVLMs/LLMs and 6 state-of-the-art open-source models across end-to-end and cascaded architectures, revealing their OCR dependency and contrasting generation versus reasoning behaviors. (3) We propose Density-Aware Evaluation to address metric reliability issues under varying contextual complexity, introducing the DA Score as a more robust measure of translation quality. Building upon these findings, we establish a new evaluation benchmark for VLT. Notably, we observe that fine-tuning LVLMs on high-resource language pairs degrades cross-lingual performance, and we propose a balanced multilingual fine-tuning strategy that effectively adapts LVLMs to VLT without sacrificing their generalization ability.

[62] Vision-based Lifting of 2D Object Detections for Automated Driving cs.CV | cs.LGPDF

Hendrik Königshof, Kun Li, Christoph Stiller

TL;DR: 本文提出了一种低成本方案，通过2D检测算法结合2D CNN处理的升维方法，实现基于视觉的3D目标检测，适用于自动驾驶。

Details

Motivation: 自动驾驶中廉价的车载摄像头普及，但现有3D检测器依赖昂贵的LiDAR数据。研究旨在提出基于视觉的低成本替代方案。

Result: 在KITTI 3D检测基准上表现接近SOTA图像方法，运行时间仅为三分之一。

Insight: 通过优化计算流程，基于视觉的3D检测可在低成本设备上实现高效性能，为自动驾驶提供了实用替代方案。

Abstract: Image-based 3D object detection is an inevitable part of autonomous driving because cheap onboard cameras are already available in most modern cars. Because of the accurate depth information, currently, most state-of-the-art 3D object detectors heavily rely on LiDAR data. In this paper, we propose a pipeline which lifts the results of existing vision-based 2D algorithms to 3D detections using only cameras as a cost-effective alternative to LiDAR. In contrast to existing approaches, we focus not only on cars but on all types of road users. To the best of our knowledge, we are the first using a 2D CNN to process the point cloud for each 2D detection to keep the computational effort as low as possible. Our evaluation on the challenging KITTI 3D object detection benchmark shows results comparable to state-of-the-art image-based approaches while having a runtime of only a third.

[63] SphereDrag: Spherical Geometry-Aware Panoramic Image Editing cs.CVPDF

Zhiao Feng, Xuewei Li, Junjie Yang, Yuxin Peng, Xi Li

TL;DR: SphereDrag 是一个针对全景图像编辑的新框架，通过结合球面几何知识解决了边界不连续性、轨迹变形和像素密度不均等问题，显著提升了编辑的准确性和可控性。

Details

Motivation: 全景图像编辑由于球面几何和投影畸变带来的边界不连续性、轨迹变形和像素密度不均等问题，尚未得到充分探索。SphereDrag 旨在解决这些挑战，实现更准确的编辑。

Result: 实验表明，SphereDrag 在几何一致性和图像质量上显著优于现有方法，相对提升高达 10.5%。

Insight: 球面几何知识在全景图像编辑中至关重要，SphereDrag 通过几何感知的方法提升了编辑效果，为未来研究提供了新方向。

Abstract: Image editing has made great progress on planar images, but panoramic image editing remains underexplored. Due to their spherical geometry and projection distortions, panoramic images present three key challenges: boundary discontinuity, trajectory deformation, and uneven pixel density. To tackle these issues, we propose SphereDrag, a novel panoramic editing framework utilizing spherical geometry knowledge for accurate and controllable editing. Specifically, adaptive reprojection (AR) uses adaptive spherical rotation to deal with discontinuity; great-circle trajectory adjustment (GCTA) tracks the movement trajectory more accurate; spherical search region tracking (SSRT) adaptively scales the search range based on spherical location to address uneven pixel density. Also, we construct PanoBench, a panoramic editing benchmark, including complex editing tasks involving multiple objects and diverse styles, which provides a standardized evaluation framework. Experiments show that SphereDrag gains a considerable improvement compared with existing methods in geometric consistency and image quality, achieving up to 10.5% relative improvement.

[64] Evaluating Sensitivity Parameters in Smartphone-Based Gaze Estimation: A Comparative Study of Appearance-Based and Infrared Eye Trackers cs.CV | cs.HCPDF

Nishan Gunawardena, Gough Yumu Lui, Jeewani Anupama Ginige, Bahman Javadi

TL;DR: 本研究比较了基于智能手机的深度学习视线追踪算法与商业红外线眼动仪Tobii Pro Nano的性能，探讨了在多因素条件下视线估计的可行性。

Details

Motivation: 研究旨在验证基于外观的视线估计方法在真实移动使用场景中的表现，并分析影响其性能的关键敏感因素。

Result: 深度学习模型的平均误差为17.76毫米，略高于红外线眼动仪的16.53毫米，但在光照不足、戴眼镜或高龄参与者中表现更易受影响。

Insight: 研究展示了基于外观的视线估计在移动设备上的潜力，同时为不同使用条件下的视线估计系统评估提供了参考框架。

Abstract: This study evaluates a smartphone-based, deep-learning eye-tracking algorithm by comparing its performance against a commercial infrared-based eye tracker, the Tobii Pro Nano. The aim is to investigate the feasibility of appearance-based gaze estimation under realistic mobile usage conditions. Key sensitivity factors, including age, gender, vision correction, lighting conditions, device type, and head position, were systematically analysed. The appearance-based algorithm integrates a lightweight convolutional neural network (MobileNet-V3) with a recurrent structure (Long Short-Term Memory) to predict gaze coordinates from grayscale facial images. Gaze data were collected from 51 participants using dynamic visual stimuli, and accuracy was measured using Euclidean distance. The deep learning model produced a mean error of 17.76 mm, compared to 16.53 mm for the Tobii Pro Nano. While overall accuracy differences were small, the deep learning-based method was more sensitive to factors such as lighting, vision correction, and age, with higher failure rates observed under low-light conditions among participants using glasses and in older age groups. Device-specific and positional factors also influenced tracking performance. These results highlight the potential of appearance-based approaches for mobile eye tracking and offer a reference framework for evaluating gaze estimation systems across varied usage conditions.

[65] How Visual Representations Map to Language Feature Space in Multimodal LLMs cs.CV | cs.LGPDF

Constantin Venhoff, Ashkan Khakzar, Sonia Joseph, Philip Torr, Neel Nanda

TL;DR: 论文提出了一种方法论框架，通过冻结大型语言模型（LLM）和视觉Transformer（ViT），仅训练线性适配器来研究视觉与语言表征的对齐机制，揭示了视觉表征如何逐步与语言特征空间对齐。

Details

Motivation: 多模态推理的有效性依赖于视觉与语言表征的对齐，但目前对其机制缺乏深入理解。研究旨在探索视觉语言模型（VLMs）如何实现这种对齐。

Result: 实验表明，视觉表征在LLM的中后层逐渐与语言特征对齐，揭示了早期层存在的表征不匹配问题。

Insight: 当前基于适配器的架构可能未最优地支持跨模态表征学习，需进一步优化对齐机制。

Abstract: Effective multimodal reasoning depends on the alignment of visual and linguistic representations, yet the mechanisms by which vision-language models (VLMs) achieve this alignment remain poorly understood. We introduce a methodological framework that deliberately maintains a frozen large language model (LLM) and a frozen vision transformer (ViT), connected solely by training a linear adapter during visual instruction tuning. This design is fundamental to our approach: by keeping the language model frozen, we ensure it maintains its original language representations without adaptation to visual data. Consequently, the linear adapter must map visual features directly into the LLM’s existing representational space rather than allowing the language model to develop specialized visual understanding through fine-tuning. Our experimental design uniquely enables the use of pre-trained sparse autoencoders (SAEs) of the LLM as analytical probes. These SAEs remain perfectly aligned with the unchanged language model and serve as a snapshot of the learned language feature-representations. Through systematic analysis of SAE reconstruction error, sparsity patterns, and feature SAE descriptions, we reveal the layer-wise progression through which visual representations gradually align with language feature representations, converging in middle-to-later layers. This suggests a fundamental misalignment between ViT outputs and early LLM layers, raising important questions about whether current adapter-based architectures optimally facilitate cross-modal representation learning.

[66] Simple Radiology VLLM Test-time Scaling with Thought Graph Traversal cs.CVPDF

Yue Yao, Zelin Wen, Yan Tong, Xinyu Tian, Xuqing Li

TL;DR: 该论文提出了一种无需额外训练的轻量级方法——Thought Graph Traversal（TGT），通过医学先验知识引导视觉语言大模型（VLLM）按医学逻辑顺序推理，结合动态调整推理深度的策略，显著提升了放射学报告生成的性能。

Details

Motivation: 测试时扩展（test-time scaling）是一种无需额外训练即可提升视觉语言大模型（VLLM）推理性能的方法。论文旨在探索如何将其应用于放射学报告生成，以生成更准确和一致的报告。

Result: 在标准放射学报告生成基准测试中，该方法优于基线提示方法，生成的报告更准确一致，同时通过可追溯的推理路径揭示了数据集偏差。

Insight: 1. 医学先验知识的引入可以显著提升模型的逻辑推理能力；2. 动态调整推理深度是一种有效的测试时扩展策略；3. 可追溯的推理路径有助于理解模型行为和数据偏差。

Abstract: Test-time scaling offers a promising way to improve the reasoning performance of vision-language large models (VLLMs) without additional training. In this paper, we explore a simple but effective approach for applying test-time scaling to radiology report generation. Specifically, we introduce a lightweight Thought Graph Traversal (TGT) framework that guides the model to reason through organ-specific findings in a medically coherent order. This framework integrates structured medical priors into the prompt, enabling deeper and more logical analysis with no changes to the underlying model. To further enhance reasoning depth, we apply a reasoning budget forcing strategy that adjusts the model’s inference depth at test time by dynamically extending its generation process. This simple yet powerful combination allows a frozen radiology VLLM to self-correct and generate more accurate, consistent chest X-ray reports. Our method outperforms baseline prompting approaches on standard benchmarks, and also reveals dataset biases through traceable reasoning paths. Code and prompts are open-sourced for reproducibility at https://github.com/glerium/Thought-Graph-Traversal.

[67] VGR: Visual Grounded Reasoning cs.CV | cs.AI | cs.CLPDF

Jiacong Wang, Zijiang Kang, Haochen Wang, Haiyong Jiang, Jiawen Li

TL;DR: VGR提出了一种新颖的多模态大语言模型（MLLM），通过结合视觉区域检测和语言推理，提升了复杂视觉推理任务中的性能。

Details

Motivation: 现有方法多依赖于纯语言空间的推理，存在语言偏差且局限于数学或科学领域，难以处理需要全面理解图像细节的复杂视觉推理任务。

Result: 在LLaVA-NeXT-7B基线模型上，VGR仅使用30%的图像令牌数，但在MMStar、AI2D和ChartQA等基准上分别实现了+4.1、+7.1和+12.9的性能提升。

Insight: 通过结合视觉区域检测和语言推理，VGR能够更高效地处理多模态任务，减少语言偏差，同时显著提升性能。

Abstract: In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This narrow focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper introduces VGR, a novel reasoning multimodal large language model (MLLM) with enhanced fine-grained visual perception capabilities. Unlike traditional MLLMs that answer the question or reasoning solely on the language space, our VGR first detects relevant regions that may help to solve problems, and then provides precise answers based on replayed image regions. To achieve this, we conduct a large-scale SFT dataset called VGR -SFT that contains reasoning data with mixed vision grounding and language deduction. The inference pipeline of VGR allows the model to choose bounding boxes for visual reference and a replay stage is introduced to integrates the corresponding regions into the reasoning process, enhancing multimodel comprehension. Experiments on the LLaVA-NeXT-7B baseline show that VGR achieves superior performance on multi-modal benchmarks requiring comprehensive image detail understanding. Compared to the baseline, VGR uses only 30% of the image token count while delivering scores of +4.1 on MMStar, +7.1 on AI2D, and a +12.9 improvement on ChartQA.

[68] Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale cs.CVPDF

Junha Lee, Eunha Park, Chunghyun Park, Dahyun Kang, Minsu Cho

TL;DR: Affogato提出了一个大规模基准数据集和简单有效的视觉语言模型，用于开放词汇的affordance grounding任务，解决了数据稀缺和细粒度定位的挑战。

Details

Motivation: 现有affordance grounding任务面临细粒度定位、多义性交互区域和大规模数据稀缺的问题，亟需解决方案以推动智能体与环境的交互能力。

Result: 模型在现有2D和3D基准测试中表现优异，尤其在开放词汇跨域泛化方面效果显著。

Insight: 通过自动化数据生成和大规模标注，有效解决了affordance grounding任务中的数据瓶颈，为后续研究提供了重要资源和基线方法。

Abstract: Affordance grounding-localizing object regions based on natural language descriptions of interactions-is a critical challenge for enabling intelligent agents to understand and interact with their environments. However, this task remains challenging due to the need for fine-grained part-level localization, the ambiguity arising from multiple valid interaction regions, and the scarcity of large-scale datasets. In this work, we introduce Affogato, a large-scale benchmark comprising 150K instances, annotated with open-vocabulary text descriptions and corresponding 3D affordance heatmaps across a diverse set of objects and interactions. Building on this benchmark, we develop simple yet effective vision-language models that leverage pretrained part-aware vision backbones and a text-conditional heatmap decoder. Our models trained with the Affogato dataset achieve promising performance on the existing 2D and 3D benchmarks, and notably, exhibit effectiveness in open-vocabulary cross-domain generalization. The Affogato dataset is shared in public: https://huggingface.co/datasets/project-affogato/affogato

cs.CL [Back]

[69] Who is in the Spotlight: The Hidden Bias Undermining Multimodal Retrieval-Augmented Generation cs.CL | cs.AIPDF

Jiayu Yao, Shenghua Liu, Yiwei Wang, Lingrui Mei, Baolong Bi

TL;DR: 该论文首次系统地研究了多模态检索增强生成（RAG）系统中的位置偏差问题，揭示了证据位置对性能的显著影响，并提出了一种量化指标PSI_p和可视化框架以分析注意力分配模式。

Details

Motivation: 多模态RAG系统在知识密集和开放域任务中十分重要，但其性能对证据顺序高度敏感，导致不稳定和偏差，亟需研究位置偏差的影响。

Result: 实验结果显示位置偏差在多模态RAG中更明显，且呈U型准确性曲线，偏差随检索范围对数增加。

Insight: 多模态RAG系统的偏差问题需要证据重排序或去偏策略，以提高其可靠性和公平性。

Abstract: Multimodal Retrieval-Augmented Generation (RAG) systems have become essential in knowledge-intensive and open-domain tasks. As retrieval complexity increases, ensuring the robustness of these systems is critical. However, current RAG models are highly sensitive to the order in which evidence is presented, often resulting in unstable performance and biased reasoning, particularly as the number of retrieved items or modality diversity grows. This raises a central question: How does the position of retrieved evidence affect multimodal RAG performance? To answer this, we present the first comprehensive study of position bias in multimodal RAG systems. Through controlled experiments across text-only, image-only, and mixed-modality tasks, we observe a consistent U-shaped accuracy curve with respect to evidence position. To quantify this bias, we introduce the Position Sensitivity Index ($PSI_p$) and develop a visualization framework to trace attention allocation patterns across decoder layers. Our results reveal that multimodal interactions intensify position bias compared to unimodal settings, and that this bias increases logarithmically with retrieval range. These findings offer both theoretical and empirical foundations for position-aware analysis in RAG, highlighting the need for evidence reordering or debiasing strategies to build more reliable and equitable generation systems.

[70] A Large Language Model Based Pipeline for Review of Systems Entity Recognition from Clinical Notes cs.CLPDF

Hieu Nghiem, Hemanth Reddy Singareddy, Zhuqi Miao, Jivan Lamichhane, Abdulaziz Ahmed

TL;DR: 论文提出了一种基于大语言模型（LLM）的流水线，用于从临床笔记中提取系统回顾（ROS）实体，结合开源和商业LLM，显著降低了识别错误率，适用于资源有限的医疗环境。

Details

Motivation: 临床笔记中的系统回顾（ROS）实体提取需求日益增长，但传统方法成本高且效率低。研究旨在开发一种经济高效的自动化解决方案。

Result: 实验显示，基于ChatGPT的流水线表现最佳（实体范围错误率28.2%，状态/系统错误率14.5%），开源LLM表现接近（错误率范围：30.5%-36.7%和24.3%-27.3%）。

Insight: 开源LLM为资源有限的医疗机构提供了经济高效的本地部署方案，同时保持了接近商业模型的性能，是替代商业模型的可行选择。

Abstract: Objective: Develop a cost-effective, large language model (LLM)-based pipeline for automatically extracting Review of Systems (ROS) entities from clinical notes. Materials and Methods: The pipeline extracts ROS sections using SecTag, followed by few-shot LLMs to identify ROS entity spans, their positive/negative status, and associated body systems. We implemented the pipeline using open-source LLMs (Mistral, Llama, Gemma) and ChatGPT. The evaluation was conducted on 36 general medicine notes containing 341 annotated ROS entities. Results: When integrating ChatGPT, the pipeline achieved the lowest error rates in detecting ROS entity spans and their corresponding statuses/systems (28.2% and 14.5%, respectively). Open-source LLMs enable local, cost-efficient execution of the pipeline while delivering promising performance with similarly low error rates (span: 30.5-36.7%; status/system: 24.3-27.3%). Discussion and Conclusion: Our pipeline offers a scalable and locally deployable solution to reduce ROS documentation burden. Open-source LLMs present a viable alternative to commercial models in resource-limited healthcare environments.

Bumjin Park, Jinsil Lee, Jaesik Choi

TL;DR: 论文揭示了大型语言模型（LLM）在模态表达（如“必须”或“应该”）影响下，倾向于将非义务场景判断为义务的现象，称为Deontological Keyword Bias（DKB），并提出了一种缓解策略。

Details

Motivation: LLMs在道德和伦理推理中的表现日益重要，但其判断标准常不明确，尤其是在涉及义务判断时。作者发现模态表达会显著影响LLMs的判断，引发偏差。

Result: 超过90%的常识场景在模态表达下被误判为义务，且这一现象在不同LLM家族和问题类型中一致存在。缓解策略能有效减少偏差。

Insight: 语言框架（如模态表达）对LLMs的规范性决策有重大影响，未来对齐研究需关注此类语言偏见的缓解。

Abstract: Large language models (LLMs) are increasingly engaging in moral and ethical reasoning, where criteria for judgment are often unclear, even for humans. While LLM alignment studies cover many areas, one important yet underexplored area is how LLMs make judgments about obligations. This work reveals a strong tendency in LLMs to judge non-obligatory contexts as obligations when prompts are augmented with modal expressions such as must or ought to. We introduce this phenomenon as Deontological Keyword Bias (DKB). We find that LLMs judge over 90% of commonsense scenarios as obligations when modal expressions are present. This tendency is consist across various LLM families, question types, and answer formats. To mitigate DKB, we propose a judgment strategy that integrates few-shot examples with reasoning prompts. This study sheds light on how modal expressions, as a form of linguistic framing, influence the normative decisions of LLMs and underscores the importance of addressing such biases to ensure judgment alignment.

[72] Targeted control of fast prototyping through domain-specific interface cs.CLPDF

Yu-Zhe Shi, Mingchen Liu, Hanlu Ma, Qiao Xu, Huamin Qu

TL;DR: 本文提出了一种领域特定接口架构，通过自然语言指令实现原型模型的精准控制，解决了设计师语言与建模语言之间的抽象层级、语义精度和词汇范围不匹配问题。

Details

Motivation: 工业设计师希望通过自然语言指令直观地配置和调整原型模型，而无需依赖复杂的建模命令。目前大语言模型在此领域的潜力未完全发挥，主要因设计师语言与建模语言之间存在多重不匹配问题。

Result: 机器评估和多人实验表明，该接口能作为大语言模型的辅助模块，实现对原型模型的精准有效控制。

Insight: 领域特定接口架构可弥补自然语言与建模语言之间的鸿沟，为大语言模型在快速原型设计中的应用提供了新思路。

Abstract: Industrial designers have long sought a natural and intuitive way to achieve the targeted control of prototype models – using simple natural language instructions to configure and adjust the models seamlessly according to their intentions, without relying on complex modeling commands. While Large Language Models have shown promise in this area, their potential for controlling prototype models through language remains partially underutilized. This limitation stems from gaps between designers’ languages and modeling languages, including mismatch in abstraction levels, fluctuation in semantic precision, and divergence in lexical scopes. To bridge these gaps, we propose an interface architecture that serves as a medium between the two languages. Grounded in design principles derived from a systematic investigation of fast prototyping practices, we devise the interface’s operational mechanism and develop an algorithm for its automated domain specification. Both machine-based evaluations and human studies on fast prototyping across various product design domains demonstrate the interface’s potential to function as an auxiliary module for Large Language Models, enabling precise and effective targeted control of prototype models.

[73] CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention cs.CL | cs.AI | cs.CVPDF

Zekai Ye, Qiming Li, Xiaocheng Feng, Libo Qin, Yichong Huang

TL;DR: 论文提出了CLAIM方法，通过跨语言注意力干预缓解大型视觉语言模型中的多语言目标幻觉问题，无需额外训练即可显著减少幻觉现象。

Details

Motivation: 大型视觉语言模型在多语言场景下容易出现与视觉输入不一致的回应（幻觉问题），现有方法通常依赖资源密集的预训练或微调，需要更高效的解决方案。

Result: 在POPE和MME基准测试中，CLAIM平均改善了13.56%和21.75%的性能，尤其在西班牙语中提升高达30%。

Insight: 研究发现，多语言注意力差异在中间层最为显著，这些层对多语言场景的视觉感知能力至关重要。

Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal abilities but remain prone to multilingual object hallucination, with a higher likelihood of generating responses inconsistent with the visual input when utilizing queries in non-English languages compared to English. Most existing approaches to address these rely on pretraining or fine-tuning, which are resource-intensive. In this paper, inspired by observing the disparities in cross-modal attention patterns across languages, we propose Cross-Lingual Attention Intervention for Mitigating multilingual object hallucination (CLAIM) in LVLMs, a novel near training-free method by aligning attention patterns. CLAIM first identifies language-specific cross-modal attention heads, then estimates language shift vectors from English to the target language, and finally intervenes in the attention outputs during inference to facilitate cross-lingual visual perception capability alignment. Extensive experiments demonstrate that CLAIM achieves an average improvement of 13.56% (up to 30% in Spanish) on the POPE and 21.75% on the hallucination subsets of the MME benchmark across various languages. Further analysis reveals that multilingual attention divergence is most prominent in intermediate layers, highlighting their critical role in multilingual scenarios.

[74] CyclicReflex: Improving Large Reasoning Models via Cyclical Reflection Token Scheduling cs.CLPDF

Chongyu Fan, Yihua Zhang, Jinghan Jia, Alfred Hero, Sijia Liu

TL;DR: 这篇论文提出了 CyclicReflex，一种通过动态调整反射令牌（reflection tokens）的频率和位置来优化大型推理模型（LRMs）性能的解码策略。相比现有方法，CyclicReflex 显著提升了模型在多个数学推理任务上的表现。

Details

Motivation: 大型推理模型（LRMs）使用反射令牌引导多步推理，但现有方法对反射令牌的使用缺乏动态调控，可能导致过度使用（over-reflection）或不足使用（under-reflection），从而影响模型性能。

Result: 在 MATH500、AIME2024/2025 和 AMC2023 等任务上，CyclicReflex 在不同规模的模型（1.5B-8B）中表现优于标准解码方法和最新方法如 TIP 和 S1。

Insight: 将反射令牌的动态调度类比为优化中的学习率调度，可能是提升推理模型性能的新方向。

Abstract: Large reasoning models (LRMs), such as OpenAI’s o1 and DeepSeek-R1, harness test-time scaling to perform multi-step reasoning for complex problem-solving. This reasoning process, executed before producing final answers, is often guided by special juncture tokens or textual segments that prompt self-evaluative reflection. We refer to these transition markers and reflective cues as “reflection tokens” (e.g., “wait”, “but”, “alternatively”). In this work, we treat reflection tokens as a “resource” and introduce the problem of resource allocation, aimed at improving the test-time compute performance of LRMs by adaptively regulating the frequency and placement of reflection tokens. Through empirical analysis, we show that both excessive and insufficient use of reflection tokens, referred to as over-reflection and under-reflection, can degrade model performance. To better understand and manage this trade-off, we draw an analogy between reflection token usage and learning rate scheduling in optimization. Building on this insight, we propose cyclical reflection token scheduling (termed CyclicReflex), a decoding strategy that dynamically modulates reflection token logits using a position-dependent triangular waveform. Experiments on MATH500, AIME2024/2025, and AMC2023 demonstrate that CyclicReflex consistently improves performance across model sizes (1.5B-8B), outperforming standard decoding and more recent approaches such as TIP (thought switching penalty) and S1. Codes are available at https://github.com/OPTML-Group/CyclicReflex.

[75] RoE-FND: A Case-Based Reasoning Approach with Dual Verification for Fake News Detection via LLMs cs.CLPDF

Yuzhou Yang, Yangming Zhou, Zhiying Zhu, Zhenxing Qian, Xinpeng Zhang

TL;DR: 该论文提出了RoE-FND框架，通过结合大语言模型（LLMs）与经验学习，将基于证据的假新闻检测重构为逻辑推理任务，解决了现有方法的噪声证据选择、泛化瓶颈和决策不透明等问题。

Details

Motivation: 在线欺骗性内容的泛滥需要鲁棒的假新闻检测系统。现有的基于证据的方法存在噪声证据选择、泛化瓶颈和决策过程不透明的缺陷，而使用LLMs的方法又引入了新的挑战，如幻觉推理和结论偏见。

Result: 实证结果表明，RoE-FND在三个数据集上的表现优于现有方法，具有更好的泛化能力和有效性。

Insight: 将LLMs与经验学习结合，可以有效解决假新闻检测中的噪声和泛化问题，同时通过双通道验证机制提高了决策的透明度和可靠性。

Abstract: The proliferation of deceptive content online necessitates robust Fake News Detection (FND) systems. While evidence-based approaches leverage external knowledge to verify claims, existing methods face critical limitations: noisy evidence selection, generalization bottlenecks, and unclear decision-making processes. Recent efforts to harness Large Language Models (LLMs) for FND introduce new challenges, including hallucinated rationales and conclusion bias. To address these issues, we propose \textbf{RoE-FND} (\textbf{\underline{R}}eason \textbf{\underline{o}}n \textbf{\underline{E}}xperiences FND), a framework that reframes evidence-based FND as a logical deduction task by synergizing LLMs with experiential learning. RoE-FND encompasses two stages: (1) \textit{self-reflective knowledge building}, where a knowledge base is curated by analyzing past reasoning errors, namely the exploration stage, and (2) \textit{dynamic criterion retrieval}, which synthesizes task-specific reasoning guidelines from historical cases as experiences during deployment. It further cross-checks rationales against internal experience through a devised dual-channel procedure. Key contributions include: a case-based reasoning framework for FND that addresses multiple existing challenges, a training-free approach enabling adaptation to evolving situations, and empirical validation of the framework’s superior generalization and effectiveness over state-of-the-art methods across three datasets.

[76] MANBench: Is Your Multimodal Model Smarter than Human? cs.CLPDF

Han Zhou, Qitong Xu, Yiheng Dong, Xin Yang

TL;DR: MANBench是一个双语（英语和中文）的多模态基准测试，包含1,314个问题，覆盖九类任务，用于评估多模态大语言模型（MLLMs）是否能在多模态任务中超越人类表现。

Details

Motivation: 随着MLLMs的快速发展，亟需一个全面的基准测试来比较其与人类在多模态任务中的表现差异。

Result: MLLMs在知识类和文本-图像理解任务中表现优异，但在跨模态推理（如Transmorphic Understanding等）和复杂任务（如拼图和空间想象）上落后于人类。

Insight: 即使先进的MLLMs也未能全面达到人类水平，尤其在复杂推理和跨模态整合方面仍有显著差距。

Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has ignited discussions regarding their potential to surpass human performance in multimodal tasks. In response, we introduce MANBench (Multimodal Ability Norms Benchmark), a bilingual benchmark (English and Chinese) comprising 1,314 questions across nine tasks, spanning knowledge-based and non-knowledge-based domains. MANBench emphasizes intuitive reasoning, seamless cross-modal integration, and real-world complexity, providing a rigorous evaluation framework. Through extensive human experiments involving diverse participants, we compared human performance against state-of-the-art MLLMs. The results indicate that while MLLMs excel in tasks like Knowledge and Text-Image Understanding, they struggle with deeper cross-modal reasoning tasks such as Transmorphic Understanding, Image Consistency, and Multi-image Understanding. Moreover, both humans and MLLMs face challenges in highly complex tasks like Puzzles and Spatial Imagination. MANBench highlights the strengths and limitations of MLLMs, revealing that even advanced models fall short of achieving human-level performance across many domains. We hope MANBench will inspire efforts to bridge the gap between MLLMs and human multimodal capabilities. The code and dataset are available at https://github.com/micdz/MANBench.

[77] SAGE:Specification-Aware Grammar Extraction for Automated Test Case Generation with LLMs cs.CLPDF

Aditi, Hyunwoo Park, Sicheol Sung, Yo-Sub Han, Sang-Ki Ko

TL;DR: 论文SAGE提出了一种基于开源大语言模型（LLMs）的方法，用于从自然语言规范中提取符合逻辑约束的上下文无关文法（CCFGs），并通过奖励引导的强化学习提升文法质量。

Details

Motivation: 从自然语言规范中生成有效的文法一直是竞争性编程问题中的关键挑战，尤其是在有限监督下。

Result: SAGE在文法有效性和测试效果上优于17种开闭源LLM，分别提升了15.92%和12.34%。

Insight: 结合LLM的生成能力和强化学习的优化机制，可以有效提升文法提取的质量和泛化性。

Abstract: Grammar-based test case generation has proven effective for competitive programming problems, but generating valid and general grammars from natural language specifications remains a key challenge, especially under limited supervision. Context-Free Grammars with Counters (CCFGs) have recently been introduced as a formalism to represent such specifications with logical constraints by storing and reusing counter values during derivation. In this work, we explore the use of open-source large language models (LLMs) to induce CCFGs from specifications using a small number of labeled examples and verifiable reward-guided reinforcement learning. Our approach first fine-tunes an open-source LLM to perform specification-to-grammar translation, and further applies Group Relative Policy Optimization (GRPO) to enhance grammar validity and generality. We also examine the effectiveness of iterative feedback for open and closed-source LLMs in correcting syntactic and semantic errors in generated grammars. Experimental results show that our approach SAGE achieves stronger generalization and outperforms 17 open and closed-source LLMs in both grammar quality and test effectiveness, improving over the state-of-the-art by 15.92%p in grammar validity and 12.34%p in test effectiveness. We provide our implementation and dataset at the following anonymous repository:https://anonymous.4open.science/r/SAGE-5714

[78] PRISM: A Transformer-based Language Model of Structured Clinical Event Data cs.CL | cs.AIPDF

Lionel Levine, John Santerre, Alex S. Young, T. Barry Levine, Francis Campion

TL;DR: PRISM是一种基于Transformer的语言模型，旨在模拟临床决策过程的序列化进展。它将临床轨迹建模为事件序列，并通过自回归训练预测患者诊断路径中的下一步动作，显著提升了预测准确性。

Details

Motivation: 传统临床诊断方法通常依赖孤立的分类任务，忽略了序列化的决策过程。PRISM旨在通过建模临床事件的序列化依赖关系，更好地模拟真实世界中的诊断推理。

Result: 实验显示，PRISM在下一个事件预测任务中显著优于随机基线，生成的序列能够反映现实的临床决策路径和实验室结果发展趋势。

Insight: PRISM展示了生成式语言模型在结构化医疗数据中的应用潜力，为临床决策支持、模拟和教育提供了新工具。未来的研究方向包括进一步提升模型的解释性和实际临床集成。

Abstract: We introduce PRISM (Predictive Reasoning in Sequential Medicine), a transformer-based architecture designed to model the sequential progression of clinical decision-making processes. Unlike traditional approaches that rely on isolated diagnostic classification, PRISM frames clinical trajectories as tokenized sequences of events - including diagnostic tests, laboratory results, and diagnoses - and learns to predict the most probable next steps in the patient diagnostic journey. Leveraging a large custom clinical vocabulary and an autoregressive training objective, PRISM demonstrates the ability to capture complex dependencies across longitudinal patient timelines. Experimental results show substantial improvements over random baselines in next-token prediction tasks, with generated sequences reflecting realistic diagnostic pathways, laboratory result progressions, and clinician ordering behaviors. These findings highlight the feasibility of applying generative language modeling techniques to structured medical event data, enabling applications in clinical decision support, simulation, and education. PRISM establishes a foundation for future advancements in sequence-based healthcare modeling, bridging the gap between machine learning architectures and real-world diagnostic reasoning.

[79] RedDebate: Safer Responses through Multi-Agent Red Teaming Debates cs.CLPDF

Ali Asad, Stephen Obadinma, Radin Shayanfar, Xiaodan Zhu

TL;DR: RedDebate提出了一种多智能体辩论框架，利用LLMs之间的对抗性论证来主动识别和减少其不安全行为，结合长期记忆模块，相比传统方法取得显著改进。

Details

Motivation: 现有AI安全方法依赖昂贵的人工评估或单模型评估，面临扩展性和监管风险。RedDebate旨在通过多智能体辩论和自动化红队测试解决这些问题。

Result: 辩论框架单独使用可减少17.7%的不安全行为，结合长期记忆模块后提升至23.5%。

Insight: 多智能体协作和长期记忆是提升AI安全性的有效途径，无需人工干预即可动态改进模型行为。

Abstract: We propose RedDebate, a novel multi-agent debate framework that leverages adversarial argumentation among Large Language Models (LLMs) to proactively identify and mitigate their own unsafe behaviours. Existing AI safety methods often depend heavily on costly human evaluations or isolated single-model assessment, both subject to scalability constraints and oversight risks. RedDebate instead embraces collaborative disagreement, enabling multiple LLMs to critically examine one another’s reasoning, and systematically uncovering unsafe blind spots through automated red-teaming, and iteratively improve their responses. We further integrate distinct types of long-term memory that retain learned safety insights from debate interactions. Evaluating on established safety benchmarks such as HarmBench, we demonstrate the proposed method’s effectiveness. Debate alone can reduce unsafe behaviours by 17.7%, and when combined with long-term memory modules, achieves reductions exceeding 23.5%. To our knowledge, RedDebate constitutes the first fully automated framework that combines multi-agent debates with red-teaming to progressively enhance AI safety without direct human intervention.(Github Repository: https://github.com/aliasad059/RedDebate)

[80] Customizing Speech Recognition Model with Large Language Model Feedback cs.CL | cs.SD | eess.ASPDF

Shaoshi Ling, Guoli Ye

TL;DR: 提出了一种基于强化学习的无监督域自适应方法，利用大语言模型（LLM）的反馈提升语音识别模型在罕见命名实体和域不匹配情况下的性能。

Details

Motivation: 现有的自动语音识别（ASR）系统在通用转录任务上表现良好，但在罕见命名实体和域不匹配情况下表现欠佳。而大语言模型（LLM）在广泛领域表现优异。

Result: 相比传统的自训练方法，该方法在实体词错误率上提升了21%。

Insight: LLM的反馈能有效弥补ASR模型在域不匹配和罕见实体识别上的不足，强化学习为此提供了高效的优化途径。

Abstract: Automatic speech recognition (ASR) systems have achieved strong performance on general transcription tasks. However, they continue to struggle with recognizing rare named entities and adapting to domain mismatches. In contrast, large language models (LLMs), trained on massive internet-scale datasets, are often more effective across a wide range of domains. In this work, we propose a reinforcement learning based approach for unsupervised domain adaptation, leveraging unlabeled data to enhance transcription quality, particularly the named entities affected by domain mismatch, through feedback from a LLM. Given contextual information, our framework employs a LLM as the reward model to score the hypotheses from the ASR model. These scores serve as reward signals to fine-tune the ASR model via reinforcement learning. Our method achieves a 21% improvement on entity word error rate over conventional self-training methods.

[81] Evolutionary Perspectives on the Evaluation of LLM-Based AI Agents: A Comprehensive Survey cs.CL | cs.AIPDF

Jiachen Zhu, Menghui Zhu, Renting Rui, Rong Shan, Congmin Zheng

TL;DR: 本文通过进化视角系统地分析了当前LLM为基础的AI智能体的评估方法，提出了区分AI智能体与LLM聊天机器人的五个关键方面，并分类整理了现有评估基准，为研究者提供了实用的参考框架。

Details

Motivation: 由于现有评估框架混淆了LLM聊天机器人与AI智能体的区别，研究者难以选择合适的评估基准，因此需要一种系统化的分析方法来填补这一空白。

Result: 提出了一个全面的分析框架和实用的参考表格，帮助研究者选择和应用合适的评估基准。

Insight: 未来的评估方法需要从环境、智能体、评估者和指标四个关键视角进行综合考量，以推动这一领域的持续发展。

Abstract: The advent of large language models (LLMs), such as GPT, Gemini, and DeepSeek, has significantly advanced natural language processing, giving rise to sophisticated chatbots capable of diverse language-related tasks. The transition from these traditional LLM chatbots to more advanced AI agents represents a pivotal evolutionary step. However, existing evaluation frameworks often blur the distinctions between LLM chatbots and AI agents, leading to confusion among researchers selecting appropriate benchmarks. To bridge this gap, this paper introduces a systematic analysis of current evaluation approaches, grounded in an evolutionary perspective. We provide a detailed analytical framework that clearly differentiates AI agents from LLM chatbots along five key aspects: complex environment, multi-source instructor, dynamic feedback, multi-modal perception, and advanced capability. Further, we categorize existing evaluation benchmarks based on external environments driving forces, and resulting advanced internal capabilities. For each category, we delineate relevant evaluation attributes, presented comprehensively in practical reference tables. Finally, we synthesize current trends and outline future evaluation methodologies through four critical lenses: environment, agent, evaluator, and metrics. Our findings offer actionable guidance for researchers, facilitating the informed selection and application of benchmarks in AI agent evaluation, thus fostering continued advancement in this rapidly evolving research domain.

[82] Graph-based RAG Enhancement via Global Query Disambiguation and Dependency-Aware Reranking cs.CL | cs.AI | cs.IRPDF

Ningyuan Li, Junrui Liu, Yi Shan, Minghui Huang, Tong Li

TL;DR: PankRAG 是一种图基的RAG增强框架，通过全局查询消歧和依赖感知的重排序机制改进现有方法的局限性。

Details

Motivation: 现有基于图的RAG方法因仅依赖实体级提取，可能忽略潜在关键信息或关系，导致检索结果不相关或遗漏重要知识，增加幻觉风险。

Result: 实验表明，PankRAG在多个基准测试中优于现有方法，验证了其鲁棒性和泛化性。

Insight: 结合结构化推理和依赖关系分析可显著提升RAG系统的准确性和可靠性。

Abstract: Contemporary graph-based retrieval-augmented generation (RAG) methods typically begin by extracting entities from user queries and then leverage pre-constructed knowledge graphs to retrieve related relationships and metadata. However, this pipeline’s exclusive reliance on entity-level extraction can lead to the misinterpretation or omission of latent yet critical information and relations. As a result, retrieved content may be irrelevant or contradictory, and essential knowledge may be excluded, exacerbating hallucination risks and degrading the fidelity of generated responses. To address these limitations, we introduce PankRAG, a framework that combines a globally aware, hierarchical query-resolution strategy with a novel dependency-aware reranking mechanism. PankRAG first constructs a multi-level resolution path that captures both parallel and sequential interdependencies within a query, guiding large language models (LLMs) through structured reasoning. It then applies its dependency-aware reranker to exploit the dependency structure among resolved sub-questions, enriching and validating retrieval results for subsequent sub-questions. Empirical evaluations demonstrate that PankRAG consistently outperforms state-of-the-art approaches across multiple benchmarks, underscoring its robustness and generalizability.

[83] History-Aware Cross-Attention Reinforcement: Self-Supervised Multi Turn and Chain-of-Thought Fine-Tuning with vLLM cs.CLPDF

Andrew Kiruluta, Andreas Lemos, Priscilla Burity

TL;DR: 这篇论文提出了CAGSR-vLLM-MTC框架，扩展了自监督交叉注意力引导强化学习（CAGSR）方法，利用高性能vLLM运行时处理多轮对话和思维链推理任务。

Details

Motivation: 动机是解决传统单轮对话和简单推理任务的限制，扩展方法以支持更复杂的多轮互动和分步推理场景。

Result: 结果表明该方法有效支持多轮对话和链式推理，同时避免了注意力崩溃的问题。

Insight: 未来方向包括多参与者对话和层次化推理的进一步研究。

Abstract: We present CAGSR-vLLM-MTC, an extension of our Self-Supervised Cross-Attention-Guided Reinforcement (CAGSR) framework, now implemented on the high-performance vLLM runtime, to address both multi-turn dialogue and chain-of-thought reasoning. Building upon our original single-turn approach, we first instrumented vLLM’s C++/CUDA kernels to asynchronously capture per-layer, per-head cross-attention weights during generation. We then generalized our self-supervised reward function to accumulate attention signals over entire conversation histories and intermediate chain-of-thought steps. We discuss practical trade-offs, including an entropy-based clamping mechanism to prevent attention collapse on early context, and outline future directions for multi-party dialogues and hierarchical reasoning.

[84] Enhancing Large Language Models for Mobility Analytics with Semantic Location Tokenization cs.CL | cs.AIPDF

Yile Chen, Yicheng Tao, Yue Jiang, Shuai Liu, Han Yu

TL;DR: QT-Mob是一种新型框架，通过语义丰富的位置标记化和多目标微调增强大语言模型（LLM）在移动性分析中的表现。

Details

Motivation: 现有方法在位置语义表示和移动信号建模方面存在不足，QT-Mob通过改进这两点提升LLM在移动性分析中的能力。

Result: 在三个真实数据集上，QT-Mob在下一个位置预测和移动恢复任务中表现优异，优于现有方法。

Insight: 语义标记化和多目标微调的结合为LLM在移动性分析中的泛化能力提供了新思路。

Abstract: The widespread adoption of location-based services has led to the generation of vast amounts of mobility data, providing significant opportunities to model user movement dynamics within urban environments. Recent advancements have focused on adapting Large Language Models (LLMs) for mobility analytics. However, existing methods face two primary limitations: inadequate semantic representation of locations (i.e., discrete IDs) and insufficient modeling of mobility signals within LLMs (i.e., single templated instruction fine-tuning). To address these issues, we propose QT-Mob, a novel framework that significantly enhances LLMs for mobility analytics. QT-Mob introduces a location tokenization module that learns compact, semantically rich tokens to represent locations, preserving contextual information while ensuring compatibility with LLMs. Furthermore, QT-Mob incorporates a series of complementary fine-tuning objectives that align the learned tokens with the internal representations in LLMs, improving the model’s comprehension of sequential movement patterns and location semantics. The proposed QT-Mob framework not only enhances LLMs’ ability to interpret mobility data but also provides a more generalizable approach for various mobility analytics tasks. Experiments on three real-world dataset demonstrate the superior performance in both next-location prediction and mobility recovery tasks, outperforming existing deep learning and LLM-based methods.

[85] AssertBench: A Benchmark for Evaluating Self-Assertion in Large Language Models cs.CL | cs.AI | cs.LGPDF

Jaeho Lee, Atharv Chowdhary

TL;DR: AssertBench 是一个评估大语言模型（LLMs）在用户提出对立主张时自我坚持能力的基准，通过构建对立框架提示来衡量模型的逻辑一致性。

Details

Motivation: 现有基准测试关注事实一致性和修辞鲁棒性，但忽略了用户对事实的对立主张如何影响模型同意的方向性问题。

Result: 基准通过分层结果（中性提示下的准确性）隔离框架引起的变化，衡量模型是否坚持其事实判断。

Insight: 模型应保持一致性，而非根据用户主张调整判断，AssertBench 为评估模型的自我坚持能力提供了新工具。

Abstract: Recent benchmarks have probed factual consistency and rhetorical robustness in Large Language Models (LLMs). However, a knowledge gap exists regarding how directional framing of factually true statements influences model agreement, a common scenario for LLM users. AssertBench addresses this by sampling evidence-supported facts from FEVEROUS, a fact verification dataset. For each (evidence-backed) fact, we construct two framing prompts: one where the user claims the statement is factually correct, and another where the user claims it is incorrect. We then record the model’s agreement and reasoning. The desired outcome is that the model asserts itself, maintaining consistent truth evaluation across both framings, rather than switching its evaluation to agree with the user. AssertBench isolates framing-induced variability from the model’s underlying factual knowledge by stratifying results based on the model’s accuracy on the same claims when presented neutrally. In doing so, this benchmark aims to measure an LLM’s ability to “stick to its guns” when presented with contradictory user assertions about the same fact. The complete source code is available at https://github.com/achowd32/assert-bench.

[86] Evaluating and Improving Robustness in Large Language Models: A Survey and Future Directions cs.CL | cs.AIPDF

Kun Zhang, Le Wu, Kui Yu, Guangyi Lv, Dacao Zhang

TL;DR: 这篇综述论文系统回顾了大语言模型（LLMs）的鲁棒性问题，分类讨论了对抗性鲁棒性、分布外鲁棒性，并总结了相关评估方法，同时提出了未来研究方向。

Details

Motivation: 随着LLMs的广泛应用，其鲁棒性问题日益突出。为了确保模型在意外场景（如对抗提示、分布外数据）下仍能生成正确稳定的内容，本文旨在提供全面的术语和方法总结，促进相关研究。

Result: 论文梳理了LLM鲁棒性的核心问题和方法，为社区提供了全面的参考框架和资源链接。

Insight: 1. LLM鲁棒性研究需关注对抗性攻击和分布外泛化能力；2. 评估工具和数据集是推动研究的关键；3. 未来需结合多模态和实际应用场景深化研究。

Abstract: Large Language Models (LLMs) have gained enormous attention in recent years due to their capability of understanding and generating natural languages. With the rapid development and wild-range applications (e.g., Agents, Embodied Intelligence), the robustness of LLMs has received increased attention. As the core brain of many AI applications, the robustness of LLMs requires that models should not only generate consistent contents, but also ensure the correctness and stability of generated content when dealing with unexpeted application scenarios (e.g., toxic prompts, limited noise domain data, outof-distribution (OOD) applications, etc). In this survey paper, we conduct a thorough review of the robustness of LLMs, aiming to provide a comprehensive terminology of concepts and methods around this field and facilitate the community. Specifically, we first give a formal definition of LLM robustness and present the collection protocol of this survey paper. Then, based on the types of perturbated inputs, we organize this survey from the following perspectives: 1) Adversarial Robustness: tackling the problem that prompts are manipulated intentionally, such as noise prompts, long context, data attack, etc; 2) OOD Robustness: dealing with the unexpected real-world application scenarios, such as OOD detection, zero-shot transferring, hallucinations, etc; 3) Evaluation of Robustness: summarizing the new evaluation datasets, metrics, and tools for verifying the robustness of LLMs. After reviewing the representative work from each perspective, we discuss and highlight future opportunities and research directions in this field. Meanwhile, we also organize related works and provide an easy-to-search project (https://github.com/zhangkunzk/Awesome-LLM-Robustness-papers) to support the community.

[87] Manifesto from Dagstuhl Perspectives Workshop 24352 – Conversational Agents: A Framework for Evaluation (CAFE) cs.CL | cs.HC | cs.IRPDF

Christine Bauer, Li Chen, Nicola Ferro, Norbert Fuhr, Avishek Anand

TL;DR: 这篇论文提出了一个用于评估对话信息访问（CONIAC）系统的框架CAFE，包含六个主要组成部分，旨在为对话代理的评估提供系统化的方法。

Details

Motivation: 现有的对话代理评估方法缺乏标准化和系统性，需要一个全面框架来统一评估标准，提升对话信息访问系统的性能和用户体验。

Result: CAFE框架成功定义了对话代理评估的六个关键组成部分，为未来的研究和实践提供了标准化工具。

Insight: 对话代理的评估需要多维度的综合方法，CAFE框架的提出填补了这一领域的空白，强调了用户任务和系统目标的紧密结合。

Abstract: During the workshop, we deeply discussed what CONversational Information ACcess (CONIAC) is and its unique features, proposing a world model abstracting it, and defined the Conversational Agents Framework for Evaluation (CAFE) for the evaluation of CONIAC systems, consisting of six major components: 1) goals of the system’s stakeholders, 2) user tasks to be studied in the evaluation, 3) aspects of the users carrying out the tasks, 4) evaluation criteria to be considered, 5) evaluation methodology to be applied, and 6) measures for the quantitative criteria chosen.

[88] Breaking the Reviewer: Assessing the Vulnerability of Large Language Models in Automated Peer Review Under Textual Adversarial Attacks cs.CL | cs.AIPDF

Tzu-Ling Lin, Wei-Chih Chen, Teng-Fang Hsiao, Hou-I Liu, Ya-Hsin Yeh

TL;DR: 这篇论文研究了大型语言模型（LLMs）作为自动审稿人时的对抗攻击脆弱性，揭示了文本篡改如何扭曲LLM的评估结果，并提出了一些缓解策略。

Details

Motivation: 学术评审的质量至关重要，但论文提交量的增加给审稿人带来了巨大负担。LLMs有望辅助这一过程，但其对对抗攻击的脆弱性引发了可靠性担忧。

Result: 研究显示，LLMs在面对对抗攻击时表现出显著脆弱性，文本篡改会显著扭曲其评估结果。

Insight: 论文强调了解决对抗性风险的重要性，以确保AI技术能够强化而非损害学术交流的完整性。

Abstract: Peer review is essential for maintaining academic quality, but the increasing volume of submissions places a significant burden on reviewers. Large language models (LLMs) offer potential assistance in this process, yet their susceptibility to textual adversarial attacks raises reliability concerns. This paper investigates the robustness of LLMs used as automated reviewers in the presence of such attacks. We focus on three key questions: (1) The effectiveness of LLMs in generating reviews compared to human reviewers. (2) The impact of adversarial attacks on the reliability of LLM-generated reviews. (3) Challenges and potential mitigation strategies for LLM-based review. Our evaluation reveals significant vulnerabilities, as text manipulations can distort LLM assessments. We offer a comprehensive evaluation of LLM performance in automated peer reviewing and analyze its robustness against adversarial attacks. Our findings emphasize the importance of addressing adversarial risks to ensure AI strengthens, rather than compromises, the integrity of scholarly communication.

[89] KokushiMD-10: Benchmark for Evaluating Large Language Models on Ten Japanese National Healthcare Licensing Examinations cs.CL | cs.AIPDF

Junyu Liu, Kaiqi Yan, Tianyang Wang, Qian Niu, Momoko Nagai-Tanima

TL;DR: KokushiMD-10 是一个多模态基准测试，涵盖日本十类国家医疗执照考试，用于评估大语言模型（LLMs）在医疗领域的表现。它包含11588个真实考题和专家标注的解读，支持文本和视觉推理评估。

Details

Motivation: 现有基准测试多为英语、基于文本且集中于医学领域，难以全面评估医疗知识和多模态推理。KokushiMD-10通过多语言、多模态的考试题目填补了这一空白。

Result: 尽管模型表现有希望，但尚无模型在所有领域达到通过标准，凸显了医疗AI的挑战。KokushiMD-10为多语言和多模态医疗任务提供了评估基础。

Insight: KokushiMD-10的推出强调了多模态推理和多语言能力在医疗AI中的重要性，为未来模型优化提供了方向。

Abstract: Recent advances in large language models (LLMs) have demonstrated notable performance in medical licensing exams. However, comprehensive evaluation of LLMs across various healthcare roles, particularly in high-stakes clinical scenarios, remains a challenge. Existing benchmarks are typically text-based, English-centric, and focus primarily on medicines, which limits their ability to assess broader healthcare knowledge and multimodal reasoning. To address these gaps, we introduce KokushiMD-10, the first multimodal benchmark constructed from ten Japanese national healthcare licensing exams. This benchmark spans multiple fields, including Medicine, Dentistry, Nursing, Pharmacy, and allied health professions. It contains over 11588 real exam questions, incorporating clinical images and expert-annotated rationales to evaluate both textual and visual reasoning. We benchmark over 30 state-of-the-art LLMs, including GPT-4o, Claude 3.5, and Gemini, across both text and image-based settings. Despite promising results, no model consistently meets passing thresholds across domains, highlighting the ongoing challenges in medical AI. KokushiMD-10 provides a comprehensive and linguistically grounded resource for evaluating and advancing reasoning-centric medical AI across multilingual and multimodal clinical tasks.

[90] ScIRGen: Synthesize Realistic and Large-Scale RAG Dataset for Scientific Research cs.CL | cs.AI | cs.IRPDF

Junyong Lin, Lu Dai, Ruiqian Han, Yijie Sui, Ruilin Wang

TL;DR: ScIRGen 是一个科学问答和检索数据集生成框架，旨在解决现有数据集无法满足真实科研需求的问题。通过学术论文增强数据集表示，结合认知分类学生成高质量问题，并利用LLM的困惑度筛选答案，最终生成了一个大规模的、真实的科学检索增强生成（RAG）数据集ScIRGen-Geo。实验表明，现有方法在处理复杂问题时仍有不足。

Details

Motivation: 现有的科学检索和问答数据集通常处理简单问题，与真实科研中的复杂信息需求不符。科研人员需要更贴近实际研究任务的数据集来支持他们的工作。

Result: 生成了61k的科学QA数据集ScIRGen-Geo。基准测试显示，现有方法在处理复杂问题时仍有困难。

Insight: 1. 真实的科研问题通常复杂且隐含在任务中，而现有数据集多为显式简单问题。
2. 结合认知分类学和LLM可以有效提升合成问题的质量。
3. 困惑度可以作为筛选合成答案的有效指标。

Abstract: Scientific researchers need intensive information about datasets to effectively evaluate and develop theories and methodologies. The information needs regarding datasets are implicitly embedded in particular research tasks, rather than explicitly expressed in search queries. However, existing scientific retrieval and question-answering (QA) datasets typically address straightforward questions, which do not align with the distribution of real-world research inquiries. To bridge this gap, we developed ScIRGen, a dataset generation framework for scientific QA & retrieval that more accurately reflects the information needs of professional science researchers, and uses it to create a large-scale scientific retrieval-augmented generation (RAG) dataset with realistic queries, datasets and papers. Technically, we designed a dataset-oriented information extraction method that leverages academic papers to augment the dataset representation. We then proposed a question generation framework by employing cognitive taxonomy to ensure the quality of synthesized questions. We also design a method to automatically filter synthetic answers based on the perplexity shift of LLMs, which is highly aligned with human judgment of answers’ validity. Collectively, these methodologies culminated in the creation of the 61k QA dataset, ScIRGen-Geo. We benchmarked representative methods on the ScIRGen-Geo dataset for their question-answering and retrieval capabilities, finding out that current methods still suffer from reasoning from complex questions. This work advances the development of more sophisticated tools to support the intricate information needs of the scientific community.

[91] Stronger Language Models Produce More Human-Like Errors cs.CL | cs.AIPDF

Andrew Keenan Richardson, Ryan Othniel Kearns, Sean Moss, Vincent Wang-Mascianica, Philipp Koralus

TL;DR: 论文研究发现，随着语言模型能力的提升，其错误的模式越来越接近人类推理中的典型错误，挑战了模型规模扩大自然带来规范理性的观点。

Details

Motivation: 探究语言模型在能力提升过程中是否逐渐趋近人类推理模式，尤其是错误模式。

Result: 模型能力越强（以Chatbot Arena分数衡量），其错误答案中符合ETR预测的人类谬误比例越高（相关系数ρ=0.360，p=0.0265）。

Insight: 语言模型的能力提升并非带来绝对的规范性理性，而是更接近人类认知，包括典型的偏见和局限性。

Abstract: Do language models converge toward human-like reasoning patterns as they improve? We provide surprising evidence that while overall reasoning capabilities increase with model sophistication, the nature of errors increasingly mirrors predictable human reasoning fallacies: a previously unobserved inverse scaling phenomenon. To investigate this question, we apply the Erotetic Theory of Reasoning (ETR), a formal cognitive framework with empirical support for predicting human reasoning outcomes. Using the open-source package PyETR, we generate logical reasoning problems where humans predictably err, evaluating responses from 38 language models across 383 reasoning tasks. Our analysis indicates that as models advance in general capability (as measured by Chatbot Arena scores), the proportion of their incorrect answers that align with ETR-predicted human fallacies tends to increase ($\rho = 0.360, p = 0.0265$). Notably, as we observe no correlation between model sophistication and logical correctness on these tasks, this shift in error patterns toward human-likeness occurs independently of error rate. These findings challenge the prevailing view that scaling language models naturally obtains normative rationality, suggesting instead a convergence toward human-like cognition inclusive of our characteristic biases and limitations, as we further confirm by demonstrating order-effects in language model reasoning.

[92] Trustworthy AI for Medicine: Continuous Hallucination Detection and Elimination with CHECK cs.CL | cs.AIPDF

Carlos Garcia-Fernandez, Luis Felipe, Monique Shotande, Muntasir Zitu, Aakash Tripathi

TL;DR: CHECK是一个持续学习框架，通过结合结构化临床数据库和信息论分类器，显著减少LLM在医疗领域的幻觉，效果显著，达到临床安全阈值。

Details

Motivation: LLM在医疗领域应用前景广阔，但幻觉问题阻碍其临床使用，亟需解决。

Result: 将LLama3.3-70B-Instruct的幻觉率从31%降至0.3%，USMLE通过率提升至92.1%。

Insight: CHECK为高风险领域中的LLM安全部署提供了可扩展方案，尤其在医疗领域潜力巨大。

Abstract: Large language models (LLMs) show promise in healthcare, but hallucinations remain a major barrier to clinical use. We present CHECK, a continuous-learning framework that integrates structured clinical databases with a classifier grounded in information theory to detect both factual and reasoning-based hallucinations. Evaluated on 1500 questions from 100 pivotal clinical trials, CHECK reduced LLama3.3-70B-Instruct hallucination rates from 31% to 0.3% - making an open source model state of the art. Its classifier generalized across medical benchmarks, achieving AUCs of 0.95-0.96, including on the MedQA (USMLE) benchmark and HealthBench realistic multi-turn medical questioning. By leveraging hallucination probabilities to guide GPT-4o’s refinement and judiciously escalate compute, CHECK boosted its USMLE passing rate by 5 percentage points, achieving a state-of-the-art 92.1%. By suppressing hallucinations below accepted clinical error thresholds, CHECK offers a scalable foundation for safe LLM deployment in medicine and other high-stakes domains.

[93] A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data cs.CL | cs.AI | cs.SD | eess.ASPDF

Cheng Kang Chou, Chan-Jan Hsu, Ho-Lam Chung, Liang-Hsuan Tseng, Hsi-Chun Cheng

TL;DR: 论文提出了一种自改进框架，通过合成TTS数据增强ASR性能，仅需未标注数据即可显著提升模型表现。

Details

Motivation: 在低资源或特定领域场景中，标注数据稀缺，传统ASR改进方法受限。利用未标注数据和合成数据提升性能是主要动机。

Result: 在台湾普通话上验证，Whisper-large-v2改进为Twister，错误率显著降低。

Insight: 合成数据可弥补标注不足，闭环自改进为低资源ASR提供了新途径。

Abstract: We propose a self-refining framework that enhances ASR performance with only unlabeled datasets. The process starts with an existing ASR model generating pseudo-labels on unannotated speech, which are then used to train a high-fidelity text-to-speech (TTS) system. Then, synthesized speech text pairs are bootstrapped into the original ASR system, completing the closed-loop self-improvement cycle. We demonstrated the effectiveness of the framework on Taiwanese Mandarin speech. Leveraging 6,000 hours of unlabeled speech, a moderate amount of text data, and synthetic content from the AI models, we adapt Whisper-large-v2 into a specialized model, Twister. Twister reduces error rates by up to 20% on Mandarin and 50% on Mandarin-English code-switching benchmarks compared to Whisper. Results highlight the framework as a compelling alternative to pseudo-labeling self-distillation approaches and provides a practical pathway for improving ASR performance in low-resource or domain-specific settings.

[94] Large Language Models and Emergence: A Complex Systems Perspective cs.CL | cs.AI | cs.LG | cs.NEPDF

David C. Krakauer, John W. Krakauer, Melanie Mitchell

TL;DR: 本文探讨了大型语言模型（LLM）是否展现出涌现能力，并从复杂系统角度分析了其是否具备涌现智能。

Details

Motivation: 涌现现象是复杂科学中的核心概念，研究LLM是否具备涌现能力及智能有助于理解其潜在机制与局限性。

Result: 文章指出LLM确实展现出某些涌现能力，但其智能属性尚需进一步验证。

Insight: 理解LLM的涌现特性有助于揭示其背后的复杂机制，并为未来研究提供新方向。

Abstract: Emergence is a concept in complexity science that describes how many-body systems manifest novel higher-level properties, properties that can be described by replacing high-dimensional mechanisms with lower-dimensional effective variables and theories. This is captured by the idea “more is different”. Intelligence is a consummate emergent property manifesting increasingly efficient – cheaper and faster – uses of emergent capabilities to solve problems. This is captured by the idea “less is more”. In this paper, we first examine claims that Large Language Models exhibit emergent capabilities, reviewing several approaches to quantifying emergence, and secondly ask whether LLMs possess emergent intelligence.

[95] Scalable Medication Extraction and Discontinuation Identification from Electronic Health Records Using Large Language Models cs.CLPDF

Chong Shao, Douglas Snyder, Chiran Li, Bowen Gu, Kerry Ngan

TL;DR: 该论文研究了利用开源和专有的大型语言模型（LLMs）从电子健康记录（EHR）中提取药物信息并分类其状态，重点是无需人工标注的扩展性。GPT-4在零样本设置下表现最佳，开源模型如Llama-3.1-70B-Instruct也表现出色。

Details

Motivation: 电子健康记录中的药物停药信息对患者安全至关重要，但由于信息常隐藏在非结构化文本中，提取困难。研究希望通过LLMs实现高效、可扩展的药物信息提取。

Result: GPT-4在零样本下表现最佳（F1分数：药物提取94.0%，停药分类78.1%），开源模型如Llama-3.1-70B-Instruct在部分任务中表现接近。few-shot学习普遍提升性能，但CoT效果不一致。

Insight: 1）通用领域LLMs（如GPT-4）优于医学专用模型；2）开源模型可替代专有系统；3）few-shot可进一步提升LLM能力，而CoT的改进有限。

Abstract: Identifying medication discontinuations in electronic health records (EHRs) is vital for patient safety but is often hindered by information being buried in unstructured notes. This study aims to evaluate the capabilities of advanced open-sourced and proprietary large language models (LLMs) in extracting medications and classifying their medication status from EHR notes, focusing on their scalability on medication information extraction without human annotation. We collected three EHR datasets from diverse sources to build the evaluation benchmark. We evaluated 12 advanced LLMs and explored multiple LLM prompting strategies. Performance on medication extraction, medication status classification, and their joint task (extraction then classification) was systematically compared across all experiments. We found that LLMs showed promising performance on the medication extraction and discontinuation classification from EHR notes. GPT-4o consistently achieved the highest average F1 scores in all tasks under zero-shot setting - 94.0% for medication extraction, 78.1% for discontinuation classification, and 72.7% for the joint task. Open-sourced models followed closely, Llama-3.1-70B-Instruct achieved the highest performance in medication status classification on the MIV-Med dataset (68.7%) and in the joint task on both the Re-CASI (76.2%) and MIV-Med (60.2%) datasets. Medical-specific LLMs demonstrated lower performance compared to advanced general-domain LLMs. Few-shot learning generally improved performance, while CoT reasoning showed inconsistent gains. LLMs demonstrate strong potential for medication extraction and discontinuation identification on EHR notes, with open-sourced models offering scalable alternatives to proprietary systems and few-shot can further improve LLMs’ capability.

[96] RETUYT-INCO at BEA 2025 Shared Task: How Far Can Lightweight Models Go in AI-powered Tutor Evaluation? cs.CL | cs.AIPDF

Santiago Góngora, Ignacio Sastre, Santiago Robaina, Ignacio Remersaro, Luis Chiruzzo

TL;DR: 该论文探讨了RETUYT-INCO团队在BEA 2025共享任务中使用轻量级模型（参数少于1B）的竞争性表现，证明了在计算资源有限的情况下，小模型仍能取得与资源丰富的团队相近的结果。

Details

Motivation: 研究动机是为了展示在计算资源受限的环境（如全球南方的研究机构）中，使用轻量级模型是否仍能达到与资源丰富的团队相近的性能。

Result: 尽管与其他优胜团队的精确F1分数存在6.46至13.13的差距，但轻量级模型仍表现出竞争力，尤其是在低预算或无GPU环境下。

Insight: 轻量级模型在资源受限的条件下具有实际应用价值，证明了计算资源的有限性并非模型性能的决定性因素。

Abstract: In this paper, we present the RETUYT-INCO participation at the BEA 2025 shared task. Our participation was characterized by the decision of using relatively small models, with fewer than 1B parameters. This self-imposed restriction tries to represent the conditions in which many research labs or institutions are in the Global South, where computational power is not easily accessible due to its prohibitive cost. Even under this restrictive self-imposed setting, our models managed to stay competitive with the rest of teams that participated in the shared task. According to the $exact\ F_1$ scores published by the organizers, the performance gaps between our models and the winners were as follows: $6.46$ in Track 1; $10.24$ in Track 2; $7.85$ in Track 3; $9.56$ in Track 4; and $13.13$ in Track 5. Considering that the minimum difference with a winner team is $6.46$ points – and the maximum difference is $13.13$ – according to the $exact\ F_1$ score, we find that models with a size smaller than 1B parameters are competitive for these tasks, all of which can be run on computers with a low-budget GPU or even without a GPU.

[97] No Universal Prompt: Unifying Reasoning through Adaptive Prompting for Temporal Table Reasoning cs.CL | cs.AIPDF

Kushagra Dixit, Abhishek Rajgaria, Harshavardhan Kalalbandi, Dan Roth, Vivek Gupta

TL;DR: 本文研究了大型语言模型在时序表推理任务中的提示技术，发现不同的提示方法在不同场景下表现不一，没有普遍最优的方法。为此，作者提出了自适应提示框架SEAR，显著提升了性能。

Details

Motivation: 时序表推理是大型语言模型的重要挑战，但目前缺乏对各种提示技术的系统性研究，且现有方法在不同表结构和上下文中的表现差异较大，亟需一种适应性强的方法。

Result: SEAR在所有表类型中均优于基线方法，且表结构重构能进一步提升模型推理能力。

Insight: 提示技术的有效性受表结构、实体类型和问题复杂度等因素影响，自适应方法是解决此类问题的关键。

Abstract: Temporal Table Reasoning is a critical challenge for Large Language Models (LLMs), requiring effective prompting techniques to extract relevant insights. Despite existence of multiple prompting methods, their impact on table reasoning remains largely unexplored. Furthermore, the performance of these models varies drastically across different table and context structures, making it difficult to determine an optimal approach. This work investigates multiple prompting technique across diverse table types to determine optimal approaches for different scenarios. We find that performance varies based on entity type, table structure, requirement of additional context and question complexity, with NO single method consistently outperforming others. To mitigate these challenges, we introduce SEAR, an adaptive prompting framework inspired by human reasoning that dynamically adjusts based on context characteristics and integrates a structured reasoning. Our results demonstrate that SEAR achieves superior performance across all table types compared to other baseline prompting techniques. Additionally, we explore the impact of table structure refactoring, finding that a unified representation enhances model’s reasoning.

[98] Learning a Continue-Thinking Token for Enhanced Test-Time Scaling cs.CL | cs.LGPDF

Liran Ringel, Elad Tolochinsky, Yaniv Romano

TL;DR: 本文提出一种通过训练专用“继续思考”标记来提升测试时扩展的方法，相比固定标记方法，在数学基准测试中取得了更显著的性能提升。

Details

Motivation: 测试时扩展可通过增加推理步骤提升语言模型性能，但现有方法采用固定标记（如“Wait”）限制性能提升，因此探索学习专用标记的潜力。

Result: 在GSM8K等数学基准测试上，学习标记方法比基线模型和固定标记方法分别提升了4.2%和2.9%的绝对准确率。

Insight: 专用学习标记可灵活引导模型扩展推理，其优化嵌入比固定标记更有效，为测试时扩展提供了新思路。

Abstract: Test-time scaling has emerged as an effective approach for improving language model performance by utilizing additional compute at inference time. Recent studies have shown that overriding end-of-thinking tokens (e.g., replacing ““ with “Wait”) can extend reasoning steps and improve accuracy. In this work, we explore whether a dedicated continue-thinking token can be learned to trigger extended reasoning. We augment a distilled version of DeepSeek-R1 with a single learned “<|continue-thinking|>” token, training only its embedding via reinforcement learning while keeping the model weights frozen. Our experiments show that this learned token achieves improved accuracy on standard math benchmarks compared to both the baseline model and a test-time scaling approach that uses a fixed token (e.g., “Wait”) for budget forcing. In particular, we observe that in cases where the fixed-token approach enhances the base model’s accuracy, our method achieves a markedly greater improvement. For example, on the GSM8K benchmark, the fixed-token approach yields a 1.3% absolute improvement in accuracy, whereas our learned-token method achieves a 4.2% improvement over the base model that does not use budget forcing.

[99] Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning cs.CL | cs.AIPDF

Yang Zhang, Amr Mohamed, Hadi Abdine, Guokan Shang, Michalis Vazirgiannis

TL;DR: 该论文首次系统性地研究了课程学习在语言模型预训练中的应用，展示了其在提升训练效率和模型性能方面的潜力。

Details

Motivation: 尽管课程学习在多个机器学习领域中表现出提升训练效率和泛化能力的潜力，但其在语言模型预训练中的应用尚不充分，因此作者进行了系统性研究。

Result: 课程学习在训练早期和中期显著加快了收敛速度，当作为预热策略使用时，能带来最高3.5%的性能提升。

Insight: 数据排序对大规模预训练至关重要，压缩比、词汇多样性和可读性是有效的难度信号，课程学习能够显著提升数据效率和模型性能。

Abstract: Curriculum learning has shown promise in improving training efficiency and generalization in various machine learning domains, yet its potential in pretraining language models remains underexplored, prompting our work as the first systematic investigation in this area. We experimented with different settings, including vanilla curriculum learning, pacing-based sampling, and interleaved curricula-guided by six difficulty metrics spanning linguistic and information-theoretic perspectives. We train models under these settings and evaluate their performance on eight diverse benchmarks. Our experiments reveal that curriculum learning consistently improves convergence in early and mid-training phases, and can yield lasting gains when used as a warmup strategy with up to $3.5%$ improvement. Notably, we identify compression ratio, lexical diversity, and readability as effective difficulty signals across settings. Our findings highlight the importance of data ordering in large-scale pretraining and provide actionable insights for scalable, data-efficient model development under realistic training scenarios.

[100] Do We Still Need Audio? Rethinking Speaker Diarization with a Text-Based Approach Using Multiple Prediction Models cs.CLPDF

Peilin Wu, Jinho D. Choi

TL;DR: 论文提出了一种基于文本的说话人分割（SD）新方法，通过句子级说话人变化检测，避免了音频质量与说话人相似性的问题。

Details

Motivation: 传统音频SD系统受限于音频质量和说话人相似性，本文探索了仅用文本实现SD的可能性，并验证其效果。

Result: 基于多样化对话数据集的实验表明，文本SD（尤其是MPM）在短对话中表现优于音频SD系统。

Insight: 语言特征和语义理解对SD系统至关重要，未来可探索多模态和语义特征结合的路径。

Abstract: We present a novel approach to Speaker Diarization (SD) by leveraging text-based methods focused on Sentence-level Speaker Change Detection within dialogues. Unlike audio-based SD systems, which are often challenged by audio quality and speaker similarity, our approach utilizes the dialogue transcript alone. Two models are developed: the Single Prediction Model (SPM) and the Multiple Prediction Model (MPM), both of which demonstrate significant improvements in identifying speaker changes, particularly in short conversations. Our findings, based on a curated dataset encompassing diverse conversational scenarios, reveal that the text-based SD approach, especially the MPM, performs competitively against state-of-the-art audio-based SD systems, with superior performance in short conversational contexts. This paper not only showcases the potential of leveraging linguistic features for SD but also highlights the importance of integrating semantic understanding into SD systems, opening avenues for future research in multimodal and semantic feature-based diarization.

[101] The Biased Samaritan: LLM biases in Perceived Kindness cs.CL | cs.CYPDF

Jack H Fagan, Ruhaan Juyaal, Amy Yue-Ming Yu, Siya Pun

TL;DR: 该论文提出了一种评估大型语言模型（LLM）人口统计偏差的新方法，聚焦于模型对不同性别、种族和年龄群体的感知善意。研究发现，模型倾向于将基准人口视为白人青年或中年男性，而非基准群体表现更愿意帮助他人。

Details

Motivation: 大型语言模型在多个领域广泛应用，但其偏差问题尚未完全解决。研究旨在量化评估不同LLM对不同人口统计特征的偏见，为开发者和用户提供客观参考。

Result: 研究发现，模型将基准人口视为白人青年或中年男性，而非基准群体表现更愿意帮助他人。方法成功区分了这两种常被混淆的偏差。

Insight: 揭示了LLM对人口统计特征的隐含偏见，强调了在模型开发和应用中考虑这些偏差的重要性。研究为未来模型的偏差校正提供了依据。

Abstract: While Large Language Models (LLMs) have become ubiquitous in many fields, understanding and mitigating LLM biases is an ongoing issue. This paper provides a novel method for evaluating the demographic biases of various generative AI models. By prompting models to assess a moral patient’s willingness to intervene constructively, we aim to quantitatively evaluate different LLMs’ biases towards various genders, races, and ages. Our work differs from existing work by aiming to determine the baseline demographic identities for various commercial models and the relationship between the baseline and other demographics. We strive to understand if these biases are positive, neutral, or negative, and the strength of these biases. This paper can contribute to the objective assessment of bias in Large Language Models and give the user or developer the power to account for these biases in LLM output or in training future LLMs. Our analysis suggested two key findings: that models view the baseline demographic as a white middle-aged or young adult male; however, a general trend across models suggested that non-baseline demographics are more willing to help than the baseline. These methodologies allowed us to distinguish these two biases that are often tangled together.

[102] Curriculum-Guided Layer Scaling for Language Model Pretraining cs.CLPDF

Karanpartap Singh, Neil Band, Ehsan Adeli

TL;DR: 本文提出了一种名为Curriculum-Guided Layer Scaling (CGLS)的框架，通过在训练过程中逐步增加模型层数（渐进式堆叠）并同步提升数据难度，以提高语言模型预训练的计算效率。实验表明，CGLS在100M和1.2B参数规模下显著提升了模型在问答和下游任务中的性能。

Details

Motivation: 受人类认知发展的启发，作者希望通过同步增加数据难度和模型规模来提高预训练效率，从而解决大规模语言模型预训练的高成本问题。

Result: 在100M参数规模下，CGLS在PIQA和ARC问答基准上优于基线方法；在1.2B规模下，CGLS的零样本性能在多下游任务中表现更优。

Insight: 渐进式堆叠与数据课程学习的同步结合是一种简单但有效的策略，可为知识密集型和推理任务提供更好的泛化能力。

Abstract: As the cost of pretraining large language models grows, there is continued interest in strategies to improve learning efficiency during this core training stage. Motivated by cognitive development, where humans gradually build knowledge as their brains mature, we propose Curriculum-Guided Layer Scaling (CGLS), a framework for compute-efficient pretraining that synchronizes increasing data difficulty with model growth through progressive layer stacking (i.e. gradually adding layers during training). At the 100M parameter scale, using a curriculum transitioning from synthetic short stories to general web data, CGLS outperforms baseline methods on the question-answering benchmarks PIQA and ARC. Pretraining at the 1.2B scale, we stratify the DataComp-LM corpus with a DistilBERT-based classifier and progress from general text to highly technical or specialized content. Our results show that progressively increasing model depth alongside sample difficulty leads to better generalization and zero-shot performance on various downstream benchmarks. Altogether, our findings demonstrate that CGLS unlocks the potential of progressive stacking, offering a simple yet effective strategy for improving generalization on knowledge-intensive and reasoning tasks.

[103] Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards cs.CL | cs.AIPDF

Jeff Da, Clinton Wang, Xiang Deng, Yuntao Ma, Nikhil Barhate

TL;DR: Agent-RLVR 是一个通过在复杂的代理环境中引入指导和环境奖励来改进强化学习训练效果的框架，显著提升了软件工程任务中语言模型的性能。

Details

Motivation: 在复杂的多步代理环境中，传统强化学习方法（如 RLVR）因奖励稀疏而效果不佳，需要一种新的方法来有效训练模型。

Result: 在 SWE-Bench Verified 任务上，模型的 pass@1 性能从 9.4% 提升到 22.4%，进一步通过测试时奖励模型训练提升至 27.8%。

Insight: 代理指导和环境探索的结合为复杂环境中强化学习的有效性提供了新思路，有望扩展到更广泛的现实任务中。

Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has been widely adopted as the de facto method for enhancing the reasoning capabilities of large language models and has demonstrated notable success in verifiable domains like math and competitive programming tasks. However, the efficacy of RLVR diminishes significantly when applied to agentic environments. These settings, characterized by multi-step, complex problem solving, lead to high failure rates even for frontier LLMs, as the reward landscape is too sparse for effective model training via conventional RLVR. In this work, we introduce Agent-RLVR, a framework that makes RLVR effective in challenging agentic settings, with an initial focus on software engineering tasks. Inspired by human pedagogy, Agent-RLVR introduces agent guidance, a mechanism that actively steers the agent towards successful trajectories by leveraging diverse informational cues. These cues, ranging from high-level strategic plans to dynamic feedback on the agent’s errors and environmental interactions, emulate a teacher’s guidance, enabling the agent to navigate difficult solution spaces and promotes active self-improvement via additional environment exploration. In the Agent-RLVR training loop, agents first attempt to solve tasks to produce initial trajectories, which are then validated by unit tests and supplemented with agent guidance. Agents then reattempt with guidance, and the agent policy is updated with RLVR based on the rewards of these guided trajectories. Agent-RLVR elevates the pass@1 performance of Qwen-2.5-72B-Instruct from 9.4% to 22.4% on SWE-Bench Verified. We find that our guidance-augmented RLVR data is additionally useful for test-time reward model training, shown by further boosting pass@1 to 27.8%. Agent-RLVR lays the groundwork for training agents with RLVR in complex, real-world environments where conventional RL methods struggle.

[104] KoGEC : Korean Grammatical Error Correction with Pre-trained Translation Models cs.CL | cs.AIPDF

Taeeun Kim, Semin Jeong, Youngsook Song

TL;DR: 该研究提出了KoGEC系统，利用预训练翻译模型（NLLB）进行韩语语法错误纠正，性能优于GPT-4和HCX-3，并通过Chrome扩展实现用户可访问性。

Details

Motivation: 韩语语法错误纠正（GEC）缺乏高效且专门化的解决方案，现有大型语言模型（如GPT-4）在特定任务上可能不够专注。

Result: 微调后的KoGEC模型在韩语GEC任务中优于GPT-4和HCX-3，尤其在标点错误纠正上表现更均衡。

Insight: 小型任务特定模型在专业NLP任务中可超越大型通用模型；词汇扩展可能降低性能。

Abstract: This research introduces KoGEC, a Korean Grammatical Error Correction system using pre--trained translation models. We fine-tuned NLLB (No Language Left Behind) models for Korean GEC, comparing their performance against large language models like GPT-4 and HCX-3. The study used two social media conversation datasets for training and testing. The NLLB models were fine-tuned using special language tokens to distinguish between original and corrected Korean sentences. Evaluation was done using BLEU scores and an “LLM as judge” method to classify error types. Results showed that the fine-tuned NLLB (KoGEC) models outperformed GPT-4o and HCX-3 in Korean GEC tasks. KoGEC demonstrated a more balanced error correction profile across various error types, whereas the larger LLMs tended to focus less on punctuation errors. We also developed a Chrome extension to make the KoGEC system accessible to users. Finally, we explored token vocabulary expansion to further improve the model but found it to decrease model performance. This research contributes to the field of NLP by providing an efficient, specialized Korean GEC system and a new evaluation method. It also highlights the potential of compact, task-specific models to compete with larger, general-purpose language models in specialized NLP tasks.

[105] Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards cs.CLPDF

Jaehoon Yun, Jiwoong Sohn, Jungwoo Park, Hyunjae Kim, Xiangru Tang

TL;DR: 论文提出了Med-PRM，一种通过检索增强生成逐步骤验证推理过程的医疗推理模型框架，显著提升了医疗问答和诊断任务的性能。

Details

Motivation: 现有的大语言模型在临床决策中难以定位和纠正推理过程中的具体错误，这在医疗领域尤为重要，因为准确的推理是诊断和治疗的关键。

Result: 在五个医疗QA基准和两个开放诊断任务上达到最优性能，将基础模型性能提升高达13.50%，并在小规模模型上首次实现MedQA超过80%的准确率。

Insight: Med-PRM证明了通过逐步验证推理步骤可以显著提升模型的医疗推理能力，尤其是在小规模模型上的成功应用，展示了其在资源受限环境中的潜力。

Abstract: Large language models have shown promise in clinical decision making, but current approaches struggle to localize and correct errors at specific steps of the reasoning process. This limitation is critical in medicine, where identifying and addressing reasoning errors is essential for accurate diagnosis and effective patient care. We introduce Med-PRM, a process reward modeling framework that leverages retrieval-augmented generation to verify each reasoning step against established medical knowledge bases. By verifying intermediate reasoning steps with evidence retrieved from clinical guidelines and literature, our model can precisely assess the reasoning quality in a fine-grained manner. Evaluations on five medical QA benchmarks and two open-ended diagnostic tasks demonstrate that Med-PRM achieves state-of-the-art performance, with improving the performance of base models by up to 13.50% using Med-PRM. Moreover, we demonstrate the generality of Med-PRM by integrating it in a plug-and-play fashion with strong policy models such as Meerkat, achieving over 80% accuracy on MedQA for the first time using small-scale models of 8 billion parameters. Our code and data are available at: https://med-prm.github.io/

[106] On the Effectiveness of Integration Methods for Multimodal Dialogue Response Retrieval cs.CLPDF

Seongbo Jang, Seonghyeon Lee, Dongha Lee, Hwanjo Yu

TL;DR: 该论文研究了多模态对话响应检索的集成方法效果，提出了基于两步法和端到端法的三种集成方法，并通过实验表明端到端方法在性能上媲美两步法，且参数共享策略能提升性能并减少参数量。

Details

Motivation: 多模态聊天机器人成为对话系统的研究热点，但如何有效集成多模态响应（如文本和图像）仍是一个挑战。本文旨在探索和比较不同集成方法的效果。

Result: 实验表明，端到端方法性能与两步法相当，参数共享策略不仅减少参数量，还能通过跨任务和模态知识迁移提升性能。

Insight: 端到端方法在多模态任务中具有潜力，参数共享是优化多模态学习的有效手段。

Abstract: Multimodal chatbots have become one of the major topics for dialogue systems in both research community and industry. Recently, researchers have shed light on the multimodality of responses as well as dialogue contexts. This work explores how a dialogue system can output responses in various modalities such as text and image. To this end, we first formulate a multimodal dialogue response retrieval task for retrieval-based systems as the combination of three subtasks. We then propose three integration methods based on a two-step approach and an end-to-end approach, and compare the merits and demerits of each method. Experimental results on two datasets demonstrate that the end-to-end approach achieves comparable performance without an intermediate step in the two-step approach. In addition, a parameter sharing strategy not only reduces the number of parameters but also boosts performance by transferring knowledge across the subtasks and the modalities.

[107] SceneGram: Conceptualizing and Describing Tangrams in Scene Context cs.CLPDF

Simeon Junker, Sina Zarrieß

TL;DR: 本文介绍了SceneGram数据集，分析了场景上下文对人类概念化抽象形状（如七巧板）的影响，并发现多模态LLMs在概念化丰富性上不如人类。

Details

Motivation: 探索场景上下文如何影响人类对抽象形状的概念化和命名，以及多模态大语言模型在此任务上的表现。

Result: 人类在概念化时表现出高度多样性和场景依赖性，而多模态LLMs的表现缺乏这种丰富性。

Insight: 场景上下文对概念化至关重要，多模态LLMs需要进一步优化以更好地模拟人类的认知多样性。

Abstract: Research on reference and naming suggests that humans can come up with very different ways of conceptualizing and referring to the same object, e.g. the same abstract tangram shape can be a “crab”, “sink” or “space ship”. Another common assumption in cognitive science is that scene context fundamentally shapes our visual perception of objects and conceptual expectations. This paper contributes SceneGram, a dataset of human references to tangram shapes placed in different scene contexts, allowing for systematic analyses of the effect of scene context on conceptualization. Based on this data, we analyze references to tangram shapes generated by multimodal LLMs, showing that these models do not account for the richness and variability of conceptualizations found in human references.

[108] LoRA-Gen: Specializing Large Language Model via Online LoRA Generation cs.CL | cs.AIPDF

Yicheng Xiao, Lin Song, Rui Yang, Cheng Cheng, Yixiao Ge

TL;DR: LoRA-Gen提出了一种新框架，利用云端大模型生成LoRA参数，以提升边缘侧模型在特定任务上的性能和效率。

Details

Motivation: 现有大语言模型在领域特定任务（尤其是小型边缘侧模型）上的表现和效率仍有不足，需要一种更灵活的知识转移方法。

Result: 在TinyLLaMA-1.1B上实现了2.1倍加速，并在Gemma-2B上达到10.1倍的压缩比，性能优于传统LoRA微调。

Insight: 通过云端生成LoRA参数是一种高效的模型专门化方法，尤其适合资源受限的边缘设备。

Abstract: Recent advances have highlighted the benefits of scaling language models to enhance performance across a wide range of NLP tasks. However, these approaches still face limitations in effectiveness and efficiency when applied to domain-specific tasks, particularly for small edge-side models. We propose the LoRA-Gen framework, which utilizes a large cloud-side model to generate LoRA parameters for edge-side models based on task descriptions. By employing the reparameterization technique, we merge the LoRA parameters into the edge-side model to achieve flexible specialization. Our method facilitates knowledge transfer between models while significantly improving the inference efficiency of the specialized model by reducing the input context length. Without specialized training, LoRA-Gen outperforms conventional LoRA fine-tuning, which achieves competitive accuracy and a 2.1x speedup with TinyLLaMA-1.1B in reasoning tasks. Besides, our method delivers a compression ratio of 10.1x with Gemma-2B on intelligent agent tasks.

[109] LLMs for Sentence Simplification: A Hybrid Multi-Agent prompting Approach cs.CLPDF

Pratibha Zunjare, Michael Hsiao

TL;DR: 提出了一种结合多智能体架构和高级提示的混合方法，用于句子简化任务，效果显著优于单智能体方法。

Details

Motivation: 解决复杂句子简化为逻辑连贯且语义保留的简单句的挑战，利用大语言模型提升任务效果。

Result: 实验结果显示，该方法在视频游戏设计应用中简化了70%的复杂句子，远超单智能体方法的48%。

Insight: 多智能体协作和高级提示的结合可显著提升句子简化任务的语义和逻辑保留能力。

Abstract: This paper addresses the challenge of transforming complex sentences into sequences of logical, simplified sentences while preserving semantic and logical integrity with the help of Large Language Models. We propose a hybrid approach that combines advanced prompting with multi-agent architectures to enhance the sentence simplification process. Experimental results show that our approach was able to successfully simplify 70% of the complex sentences written for video game design application. In comparison, a single-agent approach attained a 48% success rate on the same task.

[110] Configurable Preference Tuning with Rubric-Guided Synthetic Data cs.CL | cs.AIPDF

Víctor Gallego

TL;DR: CPT提出了一种动态调整语言模型行为的新框架，通过使用基于结构化评分准则合成的偏好数据，实现细粒度控制，无需重新训练即可调整输出。

Details

Motivation: 传统的人类反馈模型（如DPO）通常假设偏好是静态且单一的，限制了模型的适应性。CPT旨在通过可配置的偏好调整解决这一问题。

Result: 实验表明，CPT能够灵活调整模型行为，适应多样化偏好，同时保持高性能。相关代码和数据已开源。

Insight: 利用合成数据和评分准则可以高效实现模型行为的动态控制，为AI对齐提供了更灵活的解决方案。

Abstract: Models of human feedback for AI alignment, such as those underpinning Direct Preference Optimization (DPO), often bake in a singular, static set of preferences, limiting adaptability. This paper challenges the assumption of monolithic preferences by introducing Configurable Preference Tuning (CPT), a novel framework for endowing language models with the ability to dynamically adjust their behavior based on explicit, human-interpretable directives. CPT leverages synthetically generated preference data, conditioned on system prompts derived from structured, fine-grained rubrics that define desired attributes like writing style. By fine-tuning with these rubric-guided preferences, the LLM learns to modulate its outputs at inference time in response to the system prompt, without retraining. This approach not only offers fine-grained control but also provides a mechanism for modeling more nuanced and context-dependent human feedback. Several experimental artifacts, such as training code, generated datasets and fine-tuned models are released at https://github.com/vicgalle/configurable-preference-tuning

[111] The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference cs.CLPDF

Héctor Martínez, Adrián Castelló, Francisco D. Igual, Enrique S. Quintana-Ortí

TL;DR: 论文探讨了深度学习推理中混合精度矩阵乘法的优化方法，针对现代硬件架构设计了新的微内核和数据布局，显著提升了性能。

Details

Motivation: 随着深度学习对计算效率的需求增长，传统的高精度浮点运算（如FP64）被低精度格式（如FP16、BF16、整数8/16位）取代。硬件架构也随之演进，支持混合精度运算，但传统方法未能充分利用新硬件的能力。

Result: 实验表明，混合精度整数算术在三种代表性CPU架构上比浮点实现具有显著的性能优势。

Insight: 论文揭示了深度学习推理驱动的矩阵乘法优化进入新阶段，开启了混合精度运算的“寒武纪大爆发”时代。

Abstract: Recent advances in deep learning (DL) have led to a shift from traditional 64-bit floating point (FP64) computations toward reduced-precision formats, such as FP16, BF16, and 8- or 16-bit integers, combined with mixed-precision arithmetic. This transition enhances computational throughput, reduces memory and bandwidth usage, and improves energy efficiency, offering significant advantages for resource-constrained edge devices. To support this shift, hardware architectures have evolved accordingly, now including adapted ISAs (Instruction Set Architectures) that expose mixed-precision vector units and matrix engines tailored for DL workloads. At the heart of many DL and scientific computing tasks is the general matrix-matrix multiplication gemm, a fundamental kernel historically optimized using axpy vector instructions on SIMD (single instruction, multiple data) units. However, as hardware moves toward mixed-precision dot-product-centric operations optimized for quantized inference, these legacy approaches are being phased out. In response to this, our paper revisits traditional high-performance gemm and describes strategies for adapting it to mixed-precision integer (MIP) arithmetic across modern ISAs, including x86_64, ARM, and RISC-V. Concretely, we illustrate novel micro-kernel designs and data layouts that better exploit today’s specialized hardware and demonstrate significant performance gains from MIP arithmetic over floating-point implementations across three representative CPU architectures. These contributions highlight a new era of gemm optimization-driven by the demands of DL inference on heterogeneous architectures, marking what we term as the “Cambrian period” for matrix multiplication.

[112] DART: Distilling Autoregressive Reasoning to Silent Thought cs.CLPDF

Nan Jiang, Ziming Wu, De-Chuan Zhan, Fuming Lai, Shaobing Lian

TL;DR: DART提出了一个自蒸馏框架，将自回归的CoT推理转化为非自回归的Silent Thought，显著提升了计算效率。

Details

Motivation: 现有的自回归推理方法（如Chain-of-Thought，CoT）会导致显著的计算开销，限制了在延迟敏感应用中的部署。

Result: 实验表明，DART在保持推理性能的同时，显著提升了效率，适用于延迟敏感的应用场景。

Insight: DART通过自蒸馏和非自回归技术，为高效推理提供了一种可行的替代方案，展示了在保持性能的同时降低计算开销的潜力。

Abstract: Chain-of-Thought (CoT) reasoning has significantly advanced Large Language Models (LLMs) in solving complex tasks. However, its autoregressive paradigm leads to significant computational overhead, hindering its deployment in latency-sensitive applications. To address this, we propose \textbf{DART} (\textbf{D}istilling \textbf{A}utoregressive \textbf{R}easoning to Silent \textbf{T}hought), a self-distillation framework that enables LLMs to replace autoregressive CoT with non-autoregressive Silent Thought (ST). Specifically, DART introduces two training pathways: the CoT pathway for traditional reasoning and the ST pathway for generating answers directly from a few ST tokens. The ST pathway utilizes a lightweight Reasoning Evolvement Module (REM) to align its hidden states with the CoT pathway, enabling the ST tokens to evolve into informative embeddings. During inference, only the ST pathway is activated, leveraging evolving ST tokens to deliver the answer directly. Extensive experimental results demonstrate that DART achieves comparable reasoning performance to existing baselines while offering significant efficiency gains, serving as a feasible alternative for efficient reasoning.

[113] Are Multimodal Large Language Models Pragmatically Competent Listeners in Simple Reference Resolution Tasks? cs.CLPDF

Simeon Junker, Manar Ali, Larissa Koch, Sina Zarrieß, Hendrik Buschmeier

TL;DR: 论文研究了多模态大语言模型在简单参考解析任务中的语言能力，发现尽管任务对人类来说很简单，但对现有模型仍构成挑战，尤其是在语用能力方面。

Details

Motivation: 探讨多模态大语言模型在简单抽象视觉刺激（如色块和颜色网格）下的语用能力，验证其是否具备与人类相似的上下文依赖解释能力。

Result: 研究发现，即使任务对人类来说非常简单，多模态大语言模型在语用能力方面仍表现出显著不足。

Insight: 简单的参考解析任务可以作为评估多模态模型语用能力的有效指标，未来研究需进一步提升模型的上下文理解能力。

Abstract: We investigate the linguistic abilities of multimodal large language models in reference resolution tasks featuring simple yet abstract visual stimuli, such as color patches and color grids. Although the task may not seem challenging for today’s language models, being straightforward for human dyads, we consider it to be a highly relevant probe of the pragmatic capabilities of MLLMs. Our results and analyses indeed suggest that basic pragmatic capabilities, such as context-dependent interpretation of color descriptions, still constitute major challenges for state-of-the-art MLLMs.

[114] Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback cs.CLPDF

Dongwei Jiang, Alvin Zhang, Andrew Wang, Nicholas Andrews, Daniel Khashabi

TL;DR: 这篇论文发现，即使在大语言模型（LLMs）获得近乎完美的外部反馈的情况下，它们仍难以完全整合这些反馈并改正错误，作者将这种现象称为“反馈摩擦”，并探索了可能的缓解方法及其原因。

Details

Motivation: 尽管LLMs被认为可以通过外部反馈改进其回答，但其是否能够彻底整合这些反馈尚不明确。论文旨在通过系统性实验验证LLMs的反馈整合能力。

Result: 在近乎理想的条件下，LLMs仍表现出对反馈的抵抗性，缓解策略虽有改进但未能达到目标性能。

Insight: 反馈摩擦的存在表明LLMs的反馈整合能力存在深层限制，未来的自我改进研究需解决这一核心问题。

Abstract: Recent studies have shown LLMs possess some ability to improve their responses when given external feedback. However, it remains unclear how effectively and thoroughly these models can incorporate extrinsic feedback. In an ideal scenario, if LLMs receive near-perfect and complete feedback, we would expect them to fully integrate the feedback and change their incorrect answers to correct ones. In this paper, we systematically investigate LLMs’ ability to incorporate feedback by designing a controlled experimental environment. For each problem, a solver model attempts a solution, then a feedback generator with access to near-complete ground-truth answers produces targeted feedback, after which the solver tries again. We evaluate this pipeline across a diverse range of tasks, including math reasoning, knowledge reasoning, scientific reasoning, and general multi-domain evaluations with state-of-the-art language models including Claude 3.7 (with and without extended thinking). Surprisingly, even under these near-ideal conditions, solver models consistently show resistance to feedback, a limitation that we term FEEDBACK FRICTION. To mitigate this limitation, we experiment with sampling-based strategies like progressive temperature increases and explicit rejection of previously attempted incorrect answers, which yield improvements but still fail to help models achieve target performance. We also perform a rigorous exploration of potential causes of FEEDBACK FRICTION, ruling out factors such as model overconfidence and data familiarity. We hope that highlighting this issue in LLMs and ruling out several apparent causes will help future research in self-improvement.

[115] Improving Large Language Model Safety with Contrastive Representation Learning cs.CL | cs.AI | cs.LGPDF

Samuel Simko, Mrinmaya Sachan, Bernhard Schölkopf, Zhijing Jin

TL;DR: 论文提出了一种通过对比表示学习（CRL）提升大语言模型（LLM）安全性的防御框架，利用三元组损失和对抗性硬负采样分离良性及有害表示，实验表明其优于现有方法，并在不损害标准性能的情况下提升了鲁棒性。

Details

Motivation: 大语言模型（LLM）具有强大能力但易受对抗攻击，现有防御方法泛化能力不足，因此需要一种更高效的防御策略。

Result: 实验证明该方法在多个模型上优于现有防御方法，提升了对抗输入和嵌入空间攻击的鲁棒性。

Insight: 对比表示学习是提升模型安全性的有效途径，通过特征分离可同时保持模型的正常性能。

Abstract: Large Language Models (LLMs) are powerful tools with profound societal impacts, yet their ability to generate responses to diverse and uncontrolled inputs leaves them vulnerable to adversarial attacks. While existing defenses often struggle to generalize across varying attack types, recent advancements in representation engineering offer promising alternatives. In this work, we propose a defense framework that formulates model defense as a contrastive representation learning (CRL) problem. Our method finetunes a model using a triplet-based loss combined with adversarial hard negative mining to encourage separation between benign and harmful representations. Our experimental results across multiple models demonstrate that our approach outperforms prior representation engineering-based defenses, improving robustness against both input-level and embedding-space attacks without compromising standard performance. Our code is available at https://github.com/samuelsimko/crl-llm-defense

[116] code_transformed: The Influence of Large Language Models on Code cs.CL | cs.AI | cs.LG | cs.SEPDF

Yuliang Xu, Siming Huang, Mingmeng Geng, Yao Wan, Xuanhua Shi

TL;DR: 该论文首次研究了大型语言模型（LLMs）对代码风格的影响，通过对19,000多个GitHub仓库的分析，发现LLMs对命名规范、复杂性和可维护性等代码风格产生了显著影响。

Details

Motivation: 随着大型语言模型的快速发展，代码生成能力正在改变编程实践。论文旨在探讨LLMs是否改变了代码风格，并量化这种转变的特征。

Result: LLMs的生成代码风格（如命名规范）在真实项目中逐渐普及，但其具体影响比例难以精确估计。

Insight: LLMs对代码风格的影响显著且可测量，未来可能进一步改变开发者的编程习惯。

Abstract: Coding remains one of the most fundamental modes of interaction between humans and machines. With the rapid advancement of Large Language Models (LLMs), code generation capabilities have begun to significantly reshape programming practices. This development prompts a central question: Have LLMs transformed code style, and how can such transformation be characterized? In this paper, we present a pioneering study that investigates the impact of LLMs on code style, with a focus on naming conventions, complexity, maintainability, and similarity. By analyzing code from over 19,000 GitHub repositories linked to arXiv papers published between 2020 and 2025, we identify measurable trends in the evolution of coding style that align with characteristics of LLM-generated code. For instance, the proportion of snake_case variable names in Python code increased from 47% in Q1 2023 to 51% in Q1 2025. Furthermore, we investigate how LLMs approach algorithmic problems by examining their reasoning processes. Given the diversity of LLMs and usage scenarios, among other factors, it is difficult or even impossible to precisely estimate the proportion of code generated or assisted by LLMs. Our experimental results provide the first large-scale empirical evidence that LLMs affect real-world programming style.

cs.RO [Back]

[117] Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving cs.RO | cs.CVPDF

Luke Rowe, Rodrigue de Schaetzen, Roger Girgis, Christopher Pal, Liam Paull

TL;DR: Poutine是一个专为端到端自动驾驶设计的30亿参数视觉语言模型（VLM），通过两阶段训练（自监督的视觉-语言-轨迹预训练和强化学习微调），在长尾驾驶场景中表现出色，赢得了2025 Waymo挑战赛的第一名。

Details

Motivation: 解决长尾驾驶场景中的自动驾驶问题，提升端到端自动驾驶的鲁棒性和泛化能力。

Result: Poutine在Waymo测试集上达到RFS 7.99，显著领先其他方法，获得2025 Waymo挑战赛第一名。

Insight: 视觉-语言-轨迹预训练结合轻量级强化学习微调是实现鲁棒自动驾驶的有效途径。

Abstract: We present Poutine, a 3B-parameter vision-language model (VLM) tailored for end-to-end autonomous driving in long-tail driving scenarios. Poutine is trained in two stages. To obtain strong base driving capabilities, we train Poutine-Base in a self-supervised vision-language-trajectory (VLT) next-token prediction fashion on 83 hours of CoVLA nominal driving and 11 hours of Waymo long-tail driving. Accompanying language annotations are auto-generated with a 72B-parameter VLM. Poutine is obtained by fine-tuning Poutine-Base with Group Relative Policy Optimization (GRPO) using less than 500 preference-labeled frames from the Waymo validation set. We show that both VLT pretraining and RL fine-tuning are critical to attain strong driving performance in the long-tail. Poutine-Base achieves a rater-feedback score (RFS) of 8.12 on the validation set, nearly matching Waymo’s expert ground-truth RFS. The final Poutine model achieves an RFS of 7.99 on the official Waymo test set, placing 1st in the 2025 Waymo Vision-Based End-to-End Driving Challenge by a significant margin. These results highlight the promise of scalable VLT pre-training and lightweight RL fine-tuning to enable robust and generalizable autonomy.

[118] Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation cs.RO | cs.AI | cs.CVPDF

Shizhe Chen, Ricardo Garcia, Paul Pacaud, Cordelia Schmid

TL;DR: 论文提出Gondola模型，结合多视角图像和历史计划，生成包含文本和分割掩码的下一个动作计划，显著提升了机器人操作任务中的泛化能力。

Details

Motivation: 机器人操作在面对未见过的物体、环境和多样化语言指令时，泛化能力不足。现有的基于LLM的方法在多视角输入和精确对象定位上表现不佳。

Result: 在GemBench数据集的四个泛化级别（新放置、刚性物体、铰接物体和长时程任务）上，Gondola均优于现有基于LLM的方法。

Insight: 多视角输入和对象分割掩码的引入显著提升了视觉语言规划模型的泛化能力，特别是在复杂任务中表现突出。

Abstract: Robotic manipulation faces a significant challenge in generalizing across unseen objects, environments and tasks specified by diverse language instructions. To improve generalization capabilities, recent research has incorporated large language models (LLMs) for planning and action execution. While promising, these methods often fall short in generating grounded plans in visual environments. Although efforts have been made to perform visual instructional tuning on LLMs for robotic manipulation, existing methods are typically constrained by single-view image input and struggle with precise object grounding. In this work, we introduce Gondola, a novel grounded vision-language planning model based on LLMs for generalizable robotic manipulation. Gondola takes multi-view images and history plans to produce the next action plan with interleaved texts and segmentation masks of target objects and locations. To support the training of Gondola, we construct three types of datasets using the RLBench simulator, namely robot grounded planning, multi-view referring expression and pseudo long-horizon task datasets. Gondola outperforms the state-of-the-art LLM-based method across all four generalization levels of the GemBench dataset, including novel placements, rigid objects, articulated objects and long-horizon tasks.

quant-ph [Back]

[119] HQFNN: A Compact Quantum-Fuzzy Neural Network for Accurate Image Classification quant-ph | cs.CV | cs.LGPDF

Jianhong Yao, Yangming Guo

TL;DR: 该论文提出了一种新型的量子-模糊神经网络HQFNN，用于图像分类，结合了模糊推理的透明度和量子电路的高效性，显著减少了参数量并具有抗噪声能力。

Details

Motivation: 传统深度学习在噪声处理和模型解释性上表现不足，模糊推理和量子计算提供了潜在的改进方向，因此结合两者的优势提出了HQFNN。

Result: 在标准图像基准测试中，HQFNN性能优于经典、模糊增强和纯量子基线模型，抗噪声能力强，且电路深度随输入维度呈亚线性增长。

Insight: HQFNN为未来量子原生模糊学习框架提供了模板，是一种紧凑、可解释且抗噪声的视觉骨干网络替代方案。

Abstract: Deep learning vision systems excel at pattern recognition yet falter when inputs are noisy or the model must explain its own confidence. Fuzzy inference, with its graded memberships and rule transparency, offers a remedy, while parameterized quantum circuits can embed features in richly entangled Hilbert spaces with striking parameter efficiency. Bridging these ideas, this study introduces a innovative Highly Quantized Fuzzy Neural Network (HQFNN) that realises the entire fuzzy pipeline inside a shallow quantum circuit and couples the resulting quantum signal to a lightweight CNN feature extractor. Each image feature is first mapped to a single qubit membership state through repeated angle reuploading. Then a compact rule layer refines these amplitudes, and a clustered CNOT defuzzifier collapses them into one crisp value that is fused with classical features before classification. Evaluated on standard image benchmarks, HQFNN consistently surpasses classical, fuzzy enhanced and quantum only baselines while using several orders of magnitude fewer trainable weights, and its accuracy degrades only marginally under simulated depolarizing and amplitude damping noise, evidence of intrinsic robustness. Gate count analysis further shows that circuit depth grows sublinearly with input dimension, confirming the model’s practicality for larger images. These results position the model as a compact, interpretable and noise tolerant alternative to conventional vision backbones and provide a template for future quantum native fuzzy learning frameworks.

cs.AI [Back]

[120] A Survey of Task-Oriented Knowledge Graph Reasoning: Status, Applications, and Prospects cs.AI | cs.CL | I.2.7PDF

Guanglin Niu, Bo Li, Yangguang Lin

TL;DR: 这份调查论文系统地总结了任务导向的知识图谱推理（KGR）的现状、应用和前景，强调其在认知智能系统中的重要性，并提出了未来研究方向。

Details

Motivation: 尽管已有许多关于知识图谱推理（KGR）的研究，但缺乏一个从任务中心视角全面总结KGR的系统性综述，尤其是涵盖下游应用和更具挑战性的推理范式。

Result: 论文总结了KGR的关键趋势，并指出未来发展方向，如结合LLMs和多模态信息。

Insight: KGR的研究不仅需要关注推理技术本身，还需与具体下游应用结合，同时探索新技术如LLMs在多模态和少样本推理中的潜力。

Abstract: Knowledge graphs (KGs) have emerged as a powerful paradigm for structuring and leveraging diverse real-world knowledge, which serve as a fundamental technology for enabling cognitive intelligence systems with advanced understanding and reasoning capabilities. Knowledge graph reasoning (KGR) aims to infer new knowledge based on existing facts in KGs, playing a crucial role in applications such as public security intelligence, intelligent healthcare, and financial risk assessment. From a task-centric perspective, existing KGR approaches can be broadly classified into static single-step KGR, static multi-step KGR, dynamic KGR, multi-modal KGR, few-shot KGR, and inductive KGR. While existing surveys have covered these six types of KGR tasks, a comprehensive review that systematically summarizes all KGR tasks particularly including downstream applications and more challenging reasoning paradigms remains lacking. In contrast to previous works, this survey provides a more comprehensive perspective on the research of KGR by categorizing approaches based on primary reasoning tasks, downstream application tasks, and potential challenging reasoning tasks. Besides, we explore advanced techniques, such as large language models (LLMs), and their impact on KGR. This work aims to highlight key research trends and outline promising future directions in the field of KGR.

[121] Benchmarking Multimodal LLMs on Recognition and Understanding over Chemical Tables cs.AI | cs.CLPDF

Yitong Zhou, Mingyue Cheng, Qingyang Mao, Yucong Luo, Qi Liu

TL;DR: 该论文提出了ChemTable，一个针对化学表格的多模态基准测试，用于评估大语言模型在表格识别与理解任务上的表现，发现现有模型在化学领域的推理任务上存在显著局限。

Details

Motivation: 现有基准测试未能充分覆盖化学表格的多模态和领域特定复杂性，限制了多模态大语言模型在化学科学理解中的应用。

Result: 模型在基础布局解析上表现尚可，但在描述性和推理性QA任务上显著低于人类水平，开源与闭源模型之间存在明显性能差距。

Insight: 化学领域的表格理解具有挑战性，ChemTable为科学推理的进步提供了一个严格的基准测试。

Abstract: Chemical tables encode complex experimental knowledge through symbolic expressions, structured variables, and embedded molecular graphics. Existing benchmarks largely overlook this multimodal and domain-specific complexity, limiting the ability of multimodal large language models to support scientific understanding in chemistry. In this work, we introduce ChemTable, a large-scale benchmark of real-world chemical tables curated from the experimental sections of literature. ChemTable includes expert-annotated cell polygons, logical layouts, and domain-specific labels, including reagents, catalysts, yields, and graphical components and supports two core tasks: (1) Table Recognition, covering structure parsing and content extraction; and (2) Table Understanding, encompassing both descriptive and reasoning-oriented question answering grounded in table structure and domain semantics. We evaluated a range of representative multimodal models, including both open-source and closed-source models, on ChemTable and reported a series of findings with practical and conceptual insights. Although models show reasonable performance on basic layout parsing, they exhibit substantial limitations on both descriptive and inferential QA tasks compared to human performance, and we observe significant performance gaps between open-source and closed-source models across multiple dimensions. These results underscore the challenges of chemistry-aware table understanding and position ChemTable as a rigorous and realistic benchmark for advancing scientific reasoning.

[122] Large Language Model-Powered Conversational Agent Delivering Problem-Solving Therapy (PST) for Family Caregivers: Enhancing Empathy and Therapeutic Alliance Using In-Context Learning cs.AI | cs.CL | cs.HCPDF

Liying Wang, Ph. D., Daffodil Carrington, M. S., Daniil Filienko

TL;DR: 该研究探索了基于大语言模型（LLM）的对话代理为家庭护理人员提供心理支持的能力，结合了问题解决疗法（PST）、动机访谈（MI）和行为链分析（BCA）。实验发现，结合Few-Shot和检索增强生成（RAG）技术的模型表现最佳，提升了共情和治疗联盟。

Details

Motivation: 家庭护理人员面临心理健康挑战，传统支持资源有限。研究旨在验证LLM能否提供共情且个性化的心理支持。

Result: 模型在共情和治疗联盟方面表现优异，参与者认可其情感验证和策略建议的能力，但评估效率仍有改进空间。

Insight: LLM在心理支持领域潜力巨大，但需平衡深度评估与高效建议。Few-Shot和RAG是实现个性化支持的关键技术。

Abstract: Family caregivers often face substantial mental health challenges due to their multifaceted roles and limited resources. This study explored the potential of a large language model (LLM)-powered conversational agent to deliver evidence-based mental health support for caregivers, specifically Problem-Solving Therapy (PST) integrated with Motivational Interviewing (MI) and Behavioral Chain Analysis (BCA). A within-subject experiment was conducted with 28 caregivers interacting with four LLM configurations to evaluate empathy and therapeutic alliance. The best-performing models incorporated Few-Shot and Retrieval-Augmented Generation (RAG) prompting techniques, alongside clinician-curated examples. The models showed improved contextual understanding and personalized support, as reflected by qualitative responses and quantitative ratings on perceived empathy and therapeutic alliances. Participants valued the model’s ability to validate emotions, explore unexpressed feelings, and provide actionable strategies. However, balancing thorough assessment with efficient advice delivery remains a challenge. This work highlights the potential of LLMs in delivering empathetic and tailored support for family caregivers.

[123] RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning cs.AI | cs.CLPDF

Yu Wang, Shiwan Zhao, Ming Fan, Zhihu Wang, Yubo Zhang

TL;DR: RAG+ 是一个改进版的检索增强生成框架，通过显式融入应用感知推理，提升知识密集型任务中的大语言模型表现。

Details

Motivation: 现有的 RAG 框架通常忽视了知识应用的认知步骤，导致检索到的事实与任务特定推理之间存在鸿沟。

Result: 在数学、法律和医学领域的实验中，RAG+ 平均提升 3-5% 的性能，复杂场景下峰值提升达 7.5%，显著优于标准 RAG。

Insight: 通过将检索与可操作的应用结合，RAG+ 提供了一个更具认知基础的框架，推动大语言模型朝着更可解释和更强大的方向发展。

Abstract: The integration of external knowledge through Retrieval-Augmented Generation (RAG) has become foundational in enhancing large language models (LLMs) for knowledge-intensive tasks. However, existing RAG paradigms often overlook the cognitive step of applying knowledge, leaving a gap between retrieved facts and task-specific reasoning. In this work, we introduce RAG+, a principled and modular extension that explicitly incorporates application-aware reasoning into the RAG pipeline. RAG+ constructs a dual corpus consisting of knowledge and aligned application examples, created either manually or automatically, and retrieves both jointly during inference. This design enables LLMs not only to access relevant information but also to apply it within structured, goal-oriented reasoning processes. Experiments across mathematical, legal, and medical domains, conducted on multiple models, demonstrate that RAG+ consistently outperforms standard RAG variants, achieving average improvements of 3-5%, and peak gains up to 7.5% in complex scenarios. By bridging retrieval with actionable application, RAG+ advances a more cognitively grounded framework for knowledge integration, representing a step toward more interpretable and capable LLMs.

[124] VLM@school – Evaluation of AI image understanding on German middle school knowledge cs.AI | cs.CL | cs.CVPDF

René Peinl, Vincent Tischler

TL;DR: 该论文提出了一个新颖的德语基准数据集，用于评估视觉语言模型（VLMs）在结合视觉推理与学科背景知识任务上的表现。数据集基于德国中学课程的真实内容，包含2000多个开放式问题，覆盖9个领域。评估结果显示，现有模型的整体准确率低于45%，尤其在音乐、数学和对抗性问题中表现不佳。

Details

Motivation: 现有的英语基准数据集通常依赖于人为制造的难题或脱离上下文的问题，而缺乏结合真实学科背景的多模态评测。作者希望通过基于德国中学课程的任务，更真实地评估VLMs的综合能力。

Result: 最佳模型的整体准确率低于45%，在音乐、数学和对抗性问题上表现较差，表明现有模型在真实多模态任务中的能力有限。

Insight: 中学级任务是评测VLMs的有效途径，尤其是在非英语上下文中。当前模型的成功在流行基准与真实多模态理解之间存在明显差距。

Abstract: This paper introduces a novel benchmark dataset designed to evaluate the capabilities of Vision Language Models (VLMs) on tasks that combine visual reasoning with subject-specific background knowledge in the German language. In contrast to widely used English-language benchmarks that often rely on artificially difficult or decontextualized problems, this dataset draws from real middle school curricula across nine domains including mathematics, history, biology, and religion. The benchmark includes over 2,000 open-ended questions grounded in 486 images, ensuring that models must integrate visual interpretation with factual reasoning rather than rely on superficial textual cues. We evaluate thirteen state-of-the-art open-weight VLMs across multiple dimensions, including domain-specific accuracy and performance on adversarial crafted questions. Our findings reveal that even the strongest models achieve less than 45% overall accuracy, with particularly poor performance in music, mathematics, and adversarial settings. Furthermore, the results indicate significant discrepancies between success on popular benchmarks and real-world multimodal understanding. We conclude that middle school-level tasks offer a meaningful and underutilized avenue for stress-testing VLMs, especially in non-English contexts. The dataset and evaluation protocol serve as a rigorous testbed to better understand and improve the visual and linguistic reasoning capabilities of future AI systems.

[125] On the Performance of LLMs for Real Estate Appraisal cs.AI | cs.CL | cs.LGPDF

Margot Geerts, Manon Reusens, Bart Baesens, Seppe vanden Broucke, Jochen De Weerdt

TL;DR: 本文研究了大型语言模型（LLMs）在房地产估价中的表现，通过优化的上下文学习策略（ICL），LLMs能够生成具有竞争力的房价估计。研究发现LLMs在解释性和互动性上优于传统机器学习模型，但在空间推理和价格区间预测上存在局限性。

Details

Motivation: 房地产市场中存在严重的信息不对称问题，LLMs有望通过其强大的语言理解和生成能力，为房地产估价提供更加透明和可解释的解决方案。

Result: LLMs能够有效利用房屋特征（如面积和设施）生成有意义的估价，尽管在空间推理和价格区间预测上表现不佳。与传统模型相比，LLMs更易于解释和交互。

Insight: LLMs为房地产估价提供了新的可能性，尤其是在提高透明度和解释性方面，但需要进一步优化以解决其在空间推理和价格区间预测上的局限性。

Abstract: The real estate market is vital to global economies but suffers from significant information asymmetry. This study examines how Large Language Models (LLMs) can democratize access to real estate insights by generating competitive and interpretable house price estimates through optimized In-Context Learning (ICL) strategies. We systematically evaluate leading LLMs on diverse international housing datasets, comparing zero-shot, few-shot, market report-enhanced, and hybrid prompting techniques. Our results show that LLMs effectively leverage hedonic variables, such as property size and amenities, to produce meaningful estimates. While traditional machine learning models remain strong for pure predictive accuracy, LLMs offer a more accessible, interactive and interpretable alternative. Although self-explanations require cautious interpretation, we find that LLMs explain their predictions in agreement with state-of-the-art models, confirming their trustworthiness. Carefully selected in-context examples based on feature similarity and geographic proximity, significantly enhance LLM performance, yet LLMs struggle with overconfidence in price intervals and limited spatial reasoning. We offer practical guidance for structured prediction tasks through prompt optimization. Our findings highlight LLMs’ potential to improve transparency in real estate appraisal and provide actionable insights for stakeholders.

[126] Towards a Cascaded LLM Framework for Cost-effective Human-AI Decision-Making cs.AI | cs.CLPDF

Claudio Fanconi, Mihaela van der Schaar

TL;DR: 该论文提出了一种级联LLM框架，用于平衡预测正确性、推理成本和决策信心，通过分层任务委托和在线学习机制优化人机协作决策。

Details

Motivation: 解决人机协作决策中如何平衡预测准确性、成本和决策信心的挑战，提出一种自适应任务分配的框架。

Result: 在多个QA任务（如ARC和MedQA）上，该框架在多数情况下优于单模型基线，同时降低成本并提供可解释的决策弃权机制。

Insight: 级联策略能有效结合不同成本与能力的模型和人类专家，通过动态决策优化人机协作效率。

Abstract: Effective human-AI decision-making balances three key factors: the \textit{correctness} of predictions, the \textit{cost} of knowledge and reasoning complexity, and the confidence about whether to \textit{abstain} automated answers or involve human experts. In this work, we present a cascaded LLM decision framework that adaptively delegates tasks across multiple tiers of expertise – a base model for initial candidate answers, a more capable and knowledgeable (but costlier) large model, and a human expert for when the model cascade abstains. Our method proceeds in two stages. First, a deferral policy determines whether to accept the base model’s answer or regenerate it with the large model based on the confidence score. Second, an abstention policy decides whether the cascade model response is sufficiently certain or requires human intervention. Moreover, we incorporate an online learning mechanism in the framework that can leverage human feedback to improve decision quality over time. We demonstrate this approach to general question-answering (ARC-Easy and ARC-Challenge) and medical question-answering (MedQA and MedMCQA). Our results show that our cascaded strategy outperforms in most cases single-model baselines in accuracy while reducing cost and providing a principled way to handle abstentions.

[127] Schema-R1: A reasoning training approach for schema linking in Text-to-SQL Task cs.AI | cs.CL | cs.DBPDF

Wuzhenghong Wen, Su Pan, yuwei Sun

TL;DR: 论文提出了Schema-R1，一种基于强化学习的模式链接模型，通过构造高质量推理样本、监督微调和规则强化学习训练三步，显著提升Text-to-SQL任务中模式链接的推理能力，准确率提升10%。

Details

Motivation: 当前模式链接模型采用死记硬背的微调方法，过度优化地面真值结果而牺牲推理能力，亟需一种能提升推理能力的训练方法。

Result: 实验表明Schema-R1在过滤准确率上优于现有方法10%。

Insight: 强化学习可通过高质量样本和规则优化提升模式链接的推理能力，而非仅依赖地面真值。

Abstract: Schema linking is a critical step in Text-to-SQL task, aiming to accurately predict the table names and column names required for the SQL query based on the given question. However, current fine-tuning approaches for schema linking models employ a rote-learning paradigm, excessively optimizing for ground truth schema linking outcomes while compromising reasoning ability. This limitation arises because of the difficulty in acquiring a high-quality reasoning sample for downstream tasks. To address this, we propose Schema-R1, a reasoning schema linking model trained using reinforcement learning. Specifically, Schema-R1 consists of three key steps: constructing small batches of high-quality reasoning samples, supervised fine-tuning for cold-start initialization, and rule-based reinforcement learning training. The final results demonstrate that our method effectively enhances the reasoning ability of the schema linking model, achieving a 10% improvement in filter accuracy compared to the existing method. Our code is available at https://github.com/hongWin/Schema-R1/.

eess.IV [Back]

[128] ADAgent: LLM Agent for Alzheimer’s Disease Analysis with Collaborative Coordinator eess.IV | cs.CVPDF

Wenlong Hou, Gangqian Yang, Ye Du, Yeung Lau, Lihao Liu

TL;DR: ADAgent是一个基于大型语言模型的AI代理，专门用于阿尔茨海默病分析，通过协作协调器集成多模态数据，提升诊断和预后的准确性。

Details

Motivation: 现有方法通常依赖单一模态数据，而医疗专家采用多模态方法，因此需要一种能够处理多模态或缺失输入、整合多种先进方法的系统以提升AD相关任务性能。

Result: 实验表明ADAgent优于SOTA方法，多模态诊断准确率提升2.7%，多模态预后提升0.7%，MRI和PET诊断任务也有显著改进。

Insight: 通过协作协调器集成多模态数据和多种工具，可以显著提升AD分析的性能，为医疗AI的应用提供了新思路。

Abstract: Alzheimer’s disease (AD) is a progressive and irreversible neurodegenerative disease. Early and precise diagnosis of AD is crucial for timely intervention and treatment planning to alleviate the progressive neurodegeneration. However, most existing methods rely on single-modality data, which contrasts with the multifaceted approach used by medical experts. While some deep learning approaches process multi-modal data, they are limited to specific tasks with a small set of input modalities and cannot handle arbitrary combinations. This highlights the need for a system that can address diverse AD-related tasks, process multi-modal or missing input, and integrate multiple advanced methods for improved performance. In this paper, we propose ADAgent, the first specialized AI agent for AD analysis, built on a large language model (LLM) to address user queries and support decision-making. ADAgent integrates a reasoning engine, specialized medical tools, and a collaborative outcome coordinator to facilitate multi-modal diagnosis and prognosis tasks in AD. Extensive experiments demonstrate that ADAgent outperforms SOTA methods, achieving significant improvements in accuracy, including a 2.7% increase in multi-modal diagnosis, a 0.7% improvement in multi-modal prognosis, and enhancements in MRI and PET diagnosis tasks.

[129] Vector Representations of Vessel Trees eess.IV | cs.CV | cs.GRPDF

James Batten, Michiel Schaap, Matthew Sinclair, Ying Bai, Ben Glocker

TL;DR: 该论文提出了一种名为VeTTA的新型框架，用于学习树状几何数据（如3D血管网络）的向量表示，通过两个Transformer自编码器分阶段训练，实现了低内存消耗和高保真重建。

Details

Motivation: 传统的3D卷积模型在处理树状几何数据（如血管网络）时面临高内存消耗的问题，且难以保持拓扑结构的一致性。论文旨在开发一种更高效、精确的建模方法。

Result: 实验表明，该方法在2D合成树和3D冠状动脉数据上实现了高保真重建和准确的拓扑保持，同时大幅降低了GPU内存需求。

Insight: Transformer自编码器在树状结构数据建模中表现优异，递归解码确保了生成的结构的合法性，为医学图像分析提供了一种高效工具。

Abstract: We introduce a novel framework for learning vector representations of tree-structured geometric data focusing on 3D vascular networks. Our approach employs two sequentially trained Transformer-based autoencoders. In the first stage, the Vessel Autoencoder captures continuous geometric details of individual vessel segments by learning embeddings from sampled points along each curve. In the second stage, the Vessel Tree Autoencoder encodes the topology of the vascular network as a single vector representation, leveraging the segment-level embeddings from the first model. A recursive decoding process ensures that the reconstructed topology is a valid tree structure. Compared to 3D convolutional models, this proposed approach substantially lowers GPU memory requirements, facilitating large-scale training. Experimental results on a 2D synthetic tree dataset and a 3D coronary artery dataset demonstrate superior reconstruction fidelity, accurate topology preservation, and realistic interpolations in latent space. Our scalable framework, named VeTTA, offers precise, flexible, and topologically consistent modeling of anatomical tree structures in medical imaging.

[130] DiffPR: Diffusion-Based Phase Reconstruction via Frequency-Decoupled Learning eess.IV | cs.CVPDF

Yi Zhang

TL;DR: DiffPR提出了一个基于扩散模型的两阶段频率解耦框架，通过消除高频跳跃连接和利用扩散模型恢复高频细节，显著提升了相位重建的质量。

Details

Motivation: 传统端到端U-Net在定量相位成像（QPI）中存在过度平滑问题，倾向于低频内容而忽略高频细节。研究表明频谱偏差是由高频跳跃连接加剧的。

Result: 在四个QPI数据集上，DiffPR的PSNR提升1.1 dB，MAE降低11%，显著提升了膜脊和斑点图案的清晰度。

Insight: 消除高频跳跃连接并结合扩散模型是一种有效解决传统相位重建网络中频谱偏差的方法。

Abstract: Oversmoothing remains a persistent problem when applying deep learning to off-axis quantitative phase imaging (QPI). End-to-end U-Nets favour low-frequency content and under-represent fine, diagnostic detail. We trace this issue to spectral bias and show that the bias is reinforced by high-level skip connections that feed high-frequency features directly into the decoder. Removing those deepest skips thus supervising the network only at a low resolution significantly improves generalisation and fidelity. Building on this insight, we introduce DiffPR, a two-stage frequency-decoupled framework. Stage 1: an asymmetric U-Net with cancelled high-frequency skips predicts a quarter-scale phase map from the interferogram, capturing reliable low-frequency structure while avoiding spectral bias. Stage 2: the upsampled prediction, lightly perturbed with Gaussian noise, is refined by an unconditional diffusion model that iteratively recovers the missing high-frequency residuals through reverse denoising. Experiments on four QPI datasets (B-Cell, WBC, HeLa, 3T3) show that DiffPR outperforms strong U-Net baselines, boosting PSNR by up to 1.1 dB and reducing MAE by 11 percent, while delivering markedly sharper membrane ridges and speckle patterns. The results demonstrate that cancelling high-level skips and delegating detail synthesis to a diffusion prior is an effective remedy for the spectral bias that limits conventional phase-retrieval networks.

[131] Joint Denoising of Cryo-EM Projection Images using Polar Transformers eess.IV | cs.CV | cs.LGPDF

Joakim Andén, Justus Sagemüller

TL;DR: 该论文提出了一种基于Transformer的神经网络架构，用于同时聚类、对齐和去噪冷冻电镜（cryo-EM）投影图像，显著降低了高噪声环境下的误差。

Details

Motivation: 传统去噪方法在高噪声环境（如冷冻电镜图像）中效果有限，而现有的深度学习方法未充分利用数据的冗余信息。论文旨在结合深度学习和传统分类平均方法，提升去噪性能。

Result: 在合成数据上的实验表明，该方法比单图像深度学习方法显著降低了相对MSE（45%）。

Insight: 结合传统方法的冗余信息利用和现代深度学习的灵活性，可以显著提升高噪声环境下的去噪性能。

Abstract: Deep neural networks~~(DNNs) have proven powerful for denoising, but they are ultimately of limited use in high-noise settings, such as for cryogenic electron microscopy~~(cryo-EM) projection images. In this setting, however, datasets contain a large number of projections of the same molecule, each taken from a different viewing direction. This redundancy of information is useful in traditional denoising techniques known as class averaging methods, where images are clustered, aligned, and then averaged to reduce the noise level. We present a neural network architecture based on transformers that extends these class averaging methods by simultaneously clustering, aligning, and denoising cryo-EM images. Results on synthetic data show accurate denoising performance using this architecture, reducing the relative mean squared error (MSE) single-image DNNs by $45%$ at a signal-to-noise (SNR) of $0.03$.

Chunlei Li, Yilei Shi, Haoxi Hu, Jingliang Hu, Xiao Xiang Zhu

TL;DR: 该论文提出了一种新方法，通过调整Stable Diffusion模型，将其用于CT盲超分辨率任务，解决了医学图像分辨率提升与辐射剂量之间的权衡问题。

Details

Motivation: 高分辨率CT成像对医疗诊断至关重要，但会增加辐射暴露。现有深度学习方法在应对复杂退化模式和有限医学数据时表现不佳，而预训练的扩散模型（如Stable Diffusion）在生成细节方面表现出色，因此将其用于CT超分辨率任务具有潜力。

Result: 实验表明，该方法在CT盲超分辨率任务中优于现有方法，能够在降低辐射剂量的情况下实现高质量成像。

Insight: 1. 预训练扩散模型可用于医学图像超分辨率任务；2. 结合文本描述可以提升生成结果的真实性和细节；3. 这种方法为医学成像中的辐射剂量优化提供了新思路。

Abstract: High-resolution computed tomography (CT) imaging is essential for medical diagnosis but requires increased radiation exposure, creating a critical trade-off between image quality and patient safety. While deep learning methods have shown promise in CT super-resolution, they face challenges with complex degradations and limited medical training data. Meanwhile, large-scale pre-trained diffusion models, particularly Stable Diffusion, have demonstrated remarkable capabilities in synthesizing fine details across various vision tasks. Motivated by this, we propose a novel framework that adapts Stable Diffusion for CT blind super-resolution. We employ a practical degradation model to synthesize realistic low-quality images and leverage a pre-trained vision-language model to generate corresponding descriptions. Subsequently, we perform super-resolution using Stable Diffusion with a specialized controlling strategy, conditioned on both low-resolution inputs and the generated text descriptions. Extensive experiments show that our method outperforms existing approaches, demonstrating its potential for achieving high-quality CT imaging at reduced radiation doses. Our code will be made publicly available.

[133] FCA2: Frame Compression-Aware Autoencoder for Modular and Fast Compressed Video Super-Resolution eess.IV | cs.CVPDF

Zhaoyang Wang, Jie Li, Wen Lu, Lihuo He, Maoguo Gong

TL;DR: FCA2提出了一种基于压缩感知的自动编码器方法，用于高效且模块化的压缩视频超分辨率任务，显著减少推理时间并保持性能。

Details

Motivation: 现有CVSR模型存在推理时间长、训练流程复杂、依赖辅助信息等问题，且无法充分利用帧间差异，难以满足当前需求。

Result: 性能与当前SOTA相当或更优，推理时间显著减少。

Insight: 压缩感知技术可以有效降低计算复杂度并提升时序信息提取效率，模块化设计增强方法的适应性。

Abstract: State-of-the-art (SOTA) compressed video super-resolution (CVSR) models face persistent challenges, including prolonged inference time, complex training pipelines, and reliance on auxiliary information. As video frame rates continue to increase, the diminishing inter-frame differences further expose the limitations of traditional frame-to-frame information exploitation methods, which are inadequate for addressing current video super-resolution (VSR) demands. To overcome these challenges, we propose an efficient and scalable solution inspired by the structural and statistical similarities between hyperspectral images (HSI) and video data. Our approach introduces a compression-driven dimensionality reduction strategy that reduces computational complexity, accelerates inference, and enhances the extraction of temporal information across frames. The proposed modular architecture is designed for seamless integration with existing VSR frameworks, ensuring strong adaptability and transferability across diverse applications. Experimental results demonstrate that our method achieves performance on par with, or surpassing, the current SOTA models, while significantly reducing inference time. By addressing key bottlenecks in CVSR, our work offers a practical and efficient pathway for advancing VSR technology. Our code will be publicly available at https://github.com/handsomewzy/FCA2.

[134] Brain Network Analysis Based on Fine-tuned Self-supervised Model for Brain Disease Diagnosis eess.IV | cs.CVPDF

Yifei Tang, Hongjie Jiang, Changhong Jing, Hieu Pham, Shuqiang Wang

TL;DR: 该论文提出了一种基于自监督学习的微调脑网络模型，用于脑疾病诊断，通过多维度扩展脑区特征并增强模型的泛化能力。

Details

Motivation: 现有脑网络基础模型研究局限于单一维度，限制了其在神经科学中的广泛应用。论文旨在通过多维度扩展脑区特征，提升脑网络模型的诊断性能。

Result: 下游实验表明，该模型在脑疾病诊断中表现出优越性能。

Insight: 多维度特征扩展和自监督预训练能显著提升脑网络模型的泛化能力和诊断效果，为脑网络分析研究提供了新方向。

Abstract: Functional brain network analysis has become an indispensable tool for brain disease analysis. It is profoundly impacted by deep learning methods, which can characterize complex connections between ROIs. However, the research on foundation models of brain network is limited and constrained to a single dimension, which restricts their extensive application in neuroscience. In this study, we propose a fine-tuned brain network model for brain disease diagnosis. It expands brain region representations across multiple dimensions based on the original brain network model, thereby enhancing its generalizability. Our model consists of two key modules: (1)an adapter module that expands brain region features across different dimensions. (2)a fine-tuned foundation brain network model, based on self-supervised learning and pre-trained on fMRI data from thousands of participants. Specifically, its transformer block is able to effectively extract brain region features and compute the inter-region associations. Moreover, we derive a compact latent representation of the brain network for brain disease diagnosis. Our downstream experiments in this study demonstrate that the proposed model achieves superior performance in brain disease diagnosis, which potentially offers a promising approach in brain network analysis research.

[135] Framework of a multiscale data-driven digital twin of the muscle-skeletal system eess.IV | cs.CVPDF

Martina Paccini, Simone Cammarasana, Giuseppe Patanè

TL;DR: 本文提出了一种名为MS-DT的多尺度生物力学数字孪生框架，用于个性化评估和治疗肌肉骨骼系统疾病。该框架整合了多种数据源，并通过实时可视化平台为临床和研究提供支持。

Details

Motivation: 肌肉骨骼疾病（MSDs）是全球残疾的主要原因之一，需要先进的个性化诊断和治疗工具。数字孪生（DT）范式因其能整合异构数据源而成为解决这一问题的有效途径。

Result: MS-DT能够精确提取脊柱运动和肌肉功能的动态特征，为脊柱生物力学监测和康复提供了全面的工具。

Insight: 数字孪生技术在个性化医疗领域具有巨大潜力，能够通过多尺度数据集成和高保真建模提升诊断和治疗效果。

Abstract: Musculoskeletal disorders (MSDs) are a leading cause of disability worldwide, requiring advanced diagnostic and therapeutic tools for personalised assessment and treatment. Effective management of MSDs involves the interaction of heterogeneous data sources, making the Digital Twin (DT) paradigm a valuable option. This paper introduces the Musculoskeletal Digital Twin (MS-DT), a novel framework that integrates multiscale biomechanical data with computational modelling to create a detailed, patient-specific representation of the musculoskeletal system. By combining motion capture, ultrasound imaging, electromyography, and medical imaging, the MS-DT enables the analysis of spinal kinematics, posture, and muscle function. An interactive visualisation platform provides clinicians and researchers with an intuitive interface for exploring biomechanical parameters and tracking patient-specific changes. Results demonstrate the effectiveness of MS-DT in extracting precise kinematic and dynamic tissue features, offering a comprehensive tool for monitoring spine biomechanics and rehabilitation. This framework provides high-fidelity modelling and real-time visualization to improve patient-specific diagnosis and intervention planning.

[136] Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution eess.IV | cs.CVPDF

Zhangkai Ni, Yang Zhang, Wenhan Yang, Hanli Wang, Shiqi Wang

TL;DR: 本文提出了一种基于结构相似性启发的展开方法（SSIU），用于轻量级图像超分辨率（SR），通过结合数据驱动和模型驱动的优势，实现高效且紧凑的模型设计。

Details

Motivation: 当前数据驱动的超分辨率方法通常通过增加网络深度或使用Transformer注意力机制来扩展感受野，但会显著增加模型复杂度。而基于展开范式的模型驱动方法在保持模型紧凑性的同时表现优异，因此本文旨在结合两者的优势。

Result: 实验表明，SSIU在性能上优于当前最先进模型，同时参数更少、内存消耗更低。

Insight: 结合数据驱动和模型驱动方法，通过展开优化函数和模块化设计，能够在保持轻量化的同时提升超分辨率性能。

Abstract: Major efforts in data-driven image super-resolution (SR) primarily focus on expanding the receptive field of the model to better capture contextual information. However, these methods are typically implemented by stacking deeper networks or leveraging transformer-based attention mechanisms, which consequently increases model complexity. In contrast, model-driven methods based on the unfolding paradigm show promise in improving performance while effectively maintaining model compactness through sophisticated module design. Based on these insights, we propose a Structural Similarity-Inspired Unfolding (SSIU) method for efficient image SR. This method is designed through unfolding an SR optimization function constrained by structural similarity, aiming to combine the strengths of both data-driven and model-driven approaches. Our model operates progressively following the unfolding paradigm. Each iteration consists of multiple Mixed-Scale Gating Modules (MSGM) and an Efficient Sparse Attention Module (ESAM). The former implements comprehensive constraints on features, including a structural similarity constraint, while the latter aims to achieve sparse activation. In addition, we design a Mixture-of-Experts-based Feature Selector (MoE-FS) that fully utilizes multi-level feature information by combining features from different steps. Extensive experiments validate the efficacy and efficiency of our unfolding-inspired network. Our model outperforms current state-of-the-art models, boasting lower parameter counts and reduced memory consumption. Our code will be available at: https://github.com/eezkni/SSIU

[137] MindGrab for BrainChop: Fast and Accurate Skull Stripping for Command Line and Browser eess.IV | cs.AI | cs.CV | cs.NEPDF

Armina Fani, Mike Doan, Isabelle Le, Alex Fedorov, Malte Hoffmann

TL;DR: MindGrab是一种高效的深度学习模型，用于头部图像的多模态颅骨剥离，性能优于传统方法且资源需求更低。

Details

Motivation: 开发一种参数和内存高效的深度学习模型，用于多模态头部图像的颅骨剥离，以提升计算效率和硬件兼容性。

Result: MindGrab在Dice评分上显著优于传统方法（95.9 ± 1.6），同时参数减少95%，推理速度快2倍以上。

Insight: 通过合成数据和高效架构设计，可以在资源受限的设备上实现高性能医学图像分析。

Abstract: We developed MindGrab, a parameter- and memory-efficient deep fully-convolutional model for volumetric skull-stripping in head images of any modality. Its architecture, informed by a spectral interpretation of dilated convolutions, was trained exclusively on modality-agnostic synthetic data. MindGrab was evaluated on a retrospective dataset of 606 multimodal adult-brain scans (T1, T2, DWI, MRA, PDw MRI, EPI, CT, PET) sourced from the SynthStrip dataset. Performance was benchmarked against SynthStrip, ROBEX, and BET using Dice scores, with Wilcoxon signed-rank significance tests. MindGrab achieved a mean Dice score of 95.9 with standard deviation (SD) 1.6 across modalities, significantly outperforming classical methods (ROBEX: 89.1 SD 7.7, P < 0.05; BET: 85.2 SD 14.4, P < 0.05). Compared to SynthStrip (96.5 SD 1.1, P=0.0352), MindGrab delivered equivalent or superior performance in nearly half of the tested scenarios, with minor differences (<3% Dice) in the others. MindGrab utilized 95% fewer parameters (146,237 vs. 2,566,561) than SynthStrip. This efficiency yielded at least 2x faster inference, 50% lower memory usage on GPUs, and enabled exceptional performance (e.g., 10-30x speedup, and up to 30x memory reduction) and accessibility on a wider range of hardware, including systems without high-end GPUs. MindGrab delivers state-of-the-art accuracy with dramatically lower resource demands, supported in brainchop-cli (https://pypi.org/project/brainchop/) and at brainchop.org.

cs.SE [Back]

[138] CodeMirage: A Multi-Lingual Benchmark for Detecting AI-Generated and Paraphrased Source Code from Production-Level LLMs cs.SE | cs.CL | cs.CY | cs.LGPDF

Hanxi Guo, Siyuan Cheng, Kaiyuan Zhang, Guangyu Shen, Xiangyu Zhang

TL;DR: CodeMirage是一个多语言基准测试，用于检测来自生产级LLM的AI生成和改写代码，弥补了现有基准在语言覆盖率和模型能力上的不足。

Details

Motivation: LLM在软件开发中的广泛应用带来了代码抄袭、许可证违规和不安全程序传播等风险，需要强大的检测工具。然而，现有基准测试覆盖语言有限且依赖能力不足的生成模型。

Result: 提出了9个关键发现，揭示了当前检测器的优缺点，并为未来工作指明挑战。

Insight: CodeMirage为开发鲁棒且通用的AI生成代码检测器提供了严格的测试平台。

Abstract: Large language models (LLMs) have become integral to modern software development, producing vast amounts of AI-generated source code. While these models boost programming productivity, their misuse introduces critical risks, including code plagiarism, license violations, and the propagation of insecure programs. As a result, robust detection of AI-generated code is essential. To support the development of such detectors, a comprehensive benchmark that reflects real-world conditions is crucial. However, existing benchmarks fall short – most cover only a limited set of programming languages and rely on less capable generative models. In this paper, we present CodeMirage, a comprehensive benchmark that addresses these limitations through three major advancements: (1) it spans ten widely used programming languages, (2) includes both original and paraphrased code samples, and (3) incorporates outputs from ten state-of-the-art production-level LLMs, including both reasoning and non-reasoning models from six major providers. Using CodeMirage, we evaluate ten representative detectors across four methodological paradigms under four realistic evaluation configurations, reporting results using three complementary metrics. Our analysis reveals nine key findings that uncover the strengths and weaknesses of current detectors, and identify critical challenges for future work. We believe CodeMirage offers a rigorous and practical testbed to advance the development of robust and generalizable AI-generated code detectors.

[139] LeanExplore: A search engine for Lean 4 declarations cs.SE | cs.AI | cs.CL | cs.IR | cs.LG | cs.LO | I.2.6; H.3.3; I.2.3PDF

Justin Asher

TL;DR: LeanExplore是一个为Lean 4设计的搜索引擎，支持通过语义和形式化方式搜索声明，整合了多种排名策略，并提供了网站和Python API接口，便于LLMs集成。

Details

Motivation: 随着Lean 4生态系统的扩展，用户难以高效导航其庞大的库，需要一种工具来帮助搜索和理解其声明。

Result: 提供了可访问的网站和API，支持下载数据库，并易于与LLMs集成，增强了Lean 4的工作流程和AI驱动的数学研究。

Insight: 通过结合语义与形式化搜索，LeanExplore为Lean 4用户和AI助手提供了高效的声明查询工具，推动了数学研究的自动化。

Abstract: The expanding Lean 4 ecosystem poses challenges for navigating its vast libraries. This paper introduces LeanExplore, a search engine for Lean 4 declarations. LeanExplore enables users to semantically search for statements, both formally and informally, across select Lean 4 packages (including Batteries, Init, Lean, Mathlib, PhysLean, and Std). This search capability is powered by a hybrid ranking strategy, integrating scores from a multi-source semantic embedding model (capturing conceptual meaning from formal Lean code, docstrings, AI-generated informal translations, and declaration titles), BM25+ for keyword-based lexical relevance, and a PageRank-based score reflecting declaration importance and interconnectedness. The search engine is accessible via a dedicated website (https://www.leanexplore.com/) and a Python API (https://github.com/justincasher/lean-explore). Furthermore, the database can be downloaded, allowing users to self-host the service. LeanExplore integrates easily with LLMs via the model context protocol (MCP), enabling users to chat with an AI assistant about Lean declarations or utilize the search engine for building theorem-proving agents. This work details LeanExplore’s architecture, data processing, functionalities, and its potential to enhance Lean 4 workflows and AI-driven mathematical research

[140] LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? cs.SE | cs.AI | cs.CL | cs.LGPDF

Zihan Zheng, Zerui Cheng, Zeyu Shen, Shang Zhou, Kaiyuan Liu

TL;DR: LiveCodeBench Pro评估了LLM在竞争编程中的表现，发现前沿模型在中等和难题上表现有限，尤其是在算法推理和复杂案例分析方面仍落后于人类专家。高成绩主要依赖实现精确性和工具辅助，而非推理能力。

Details

Motivation: 研究动机在于验证LLM是否真正超越人类精英在竞争编程中的表现，并发现其局限性。

Result: 结果显示，最优模型在中等难度问题中仅53%通过率，难题中0%通过率，表明推理能力不足。

Insight: LLM在实现型问题中表现较好，但在复杂推理和案例分析中表现较差，且自信地生成错误解释，暴露了推理能力的短板。

Abstract: Recent reports claim that large language models (LLMs) now outperform elite humans in competitive programming. Drawing on knowledge from a group of medalists in international algorithmic contests, we revisit this claim, examining how LLMs differ from human experts and where limitations still remain. We introduce LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI that are continuously updated to reduce the likelihood of data contamination. A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions. Using this new data and benchmark, we find that frontier models still have significant limitations: without external tools, the best model achieves only 53% pass@1 on medium-difficulty problems and 0% on hard problems, domains where expert humans still excel. We also find that LLMs succeed at implementation-heavy problems but struggle with nuanced algorithmic reasoning and complex case analysis, often generating confidently incorrect justifications. High performance appears largely driven by implementation precision and tool augmentation, not superior reasoning. LiveCodeBench Pro thus highlights the significant gap to human grandmaster levels, while offering fine-grained diagnostics to steer future improvements in code-centric LLM reasoning.

cs.LG [Back]

[141] Developing a Dyslexia Indicator Using Eye Tracking cs.LG | cs.AI | cs.CL | cs.CV | cs.HCPDF

Kevin Cogan, Vuong M. Ngo, Mark Roantree

TL;DR: 该论文提出了一种结合眼动追踪技术和机器学习的新方法，用于早期识别阅读障碍（dyslexia）。通过分析眼动模式（如注视时长和扫视不规律性），并结合随机森林分类器，达到了88.58%的准确率。

Details

Motivation: 现有诊断阅读障碍的方法通常成本高昂且不易获取，亟需创新、非侵入性的解决方案。眼动追踪技术因其成本低廉和易于操作，成为潜在替代方案。

Result: 随机森林分类器的准确率为88.58%，眼动追踪技术能有效识别阅读障碍，包括临界症状。

Insight: 眼动追踪结合机器学习为阅读障碍提供了一种高精度、低成本的非侵入性诊断方法，适用于临床研究和实际应用。

Abstract: Dyslexia, affecting an estimated 10% to 20% of the global population, significantly impairs learning capabilities, highlighting the need for innovative and accessible diagnostic methods. This paper investigates the effectiveness of eye-tracking technology combined with machine learning algorithms as a cost-effective alternative for early dyslexia detection. By analyzing general eye movement patterns, including prolonged fixation durations and erratic saccades, we proposed an enhanced solution for determining eye-tracking-based dyslexia features. A Random Forest Classifier was then employed to detect dyslexia, achieving an accuracy of 88.58%. Additionally, hierarchical clustering methods were applied to identify varying severity levels of dyslexia. The analysis incorporates diverse methodologies across various populations and settings, demonstrating the potential of this technology to identify individuals with dyslexia, including those with borderline traits, through non-invasive means. Integrating eye-tracking with machine learning represents a significant advancement in the diagnostic process, offering a highly accurate and accessible method in clinical research.

[142] Task-aligned prompting improves zero-shot detection of AI-generated images by Vision-Language Models cs.LG | cs.AI | cs.CLPDF

Zoher Kachwala, Danishjeet Singh, Danielle Yang, Filippo Menczer

TL;DR: 论文提出一种任务对齐提示方法（zero-shot-s²），提升视觉语言模型（VLM）在零样本检测AI生成图像中的性能，无需微调即可显著改进检测效果。

Details

Motivation: 随着AI生成图像的真实性提升，误用风险增加。传统监督检测依赖大规模数据集且泛化能力差，而预训练的视觉语言模型具有潜力，但需要更有效的方法激发其能力。

Result: 在多种数据集和16种生成模型上验证，zero-shot-s²表现优于链式思维提示，尤其在模型尺度变化时仍保持稳健性，且自我一致性进一步提升性能。

Insight: 任务对齐提示能更有效激发VLM的潜在能力，为AI生成图像检测提供了一种简单、通用且可解释的无监督方法。

Abstract: As image generators produce increasingly realistic images, concerns about potential misuse continue to grow. Supervised detection relies on large, curated datasets and struggles to generalize across diverse generators. In this work, we investigate the use of pre-trained Vision-Language Models (VLMs) for zero-shot detection of AI-generated images. While off-the-shelf VLMs exhibit some task-specific reasoning and chain-of-thought prompting offers gains, we show that task-aligned prompting elicits more focused reasoning and significantly improves performance without fine-tuning. Specifically, prefixing the model’s response with the phrase ``Let’s examine the style and the synthesis artifacts’’ – a method we call zero-shot-s$^2$ – boosts Macro F1 scores by 8%-29% for two widely used open-source models. These gains are consistent across three recent, diverse datasets spanning human faces, objects, and animals with images generated by 16 different models – demonstrating strong generalization. We further evaluate the approach across three additional model sizes and observe improvements in most dataset-model combinations – suggesting robustness to model scale. Surprisingly, self-consistency, a behavior previously observed in language reasoning, where aggregating answers from diverse reasoning paths improves performance, also holds in this setting. Even here, zero-shot-s$^2$ scales better than chain-of-thought in most cases – indicating that it elicits more useful diversity. Our findings show that task-aligned prompts elicit more focused reasoning and enhance latent capabilities in VLMs, like the detection of AI-generated images – offering a simple, generalizable, and explainable alternative to supervised methods. Our code is publicly available on github: https://github.com/osome-iu/Zero-shot-s2.git.

[143] CausalVLBench: Benchmarking Visual Causal Reasoning in Large Vision-Language Models cs.LG | cs.AI | cs.CLPDF

Aneesh Komanduri, Karuna Bhaila, Xintao Wu

TL;DR: 该论文提出了一个名为CausalVLBench的基准测试，用于评估大型视觉语言模型（LVLMs）在视觉因果推理任务中的表现。

Details

Motivation: 尽管大型语言模型（LLMs）在因果推理任务中表现出色，但它们在多模态（视觉语言）领域的因果推理能力尚未得到充分研究。本文旨在填补这一空白。

Result: 实验揭示了现有LVLMs在视觉因果推理任务中的局限性，为进一步改进提供了方向。

Insight: 当前的LVLMs在视觉因果推理任务中存在明显不足，未来研究需探索新的方法和范式以提升其能力。

Abstract: Large language models (LLMs) have shown remarkable ability in various language tasks, especially with their emergent in-context learning capability. Extending LLMs to incorporate visual inputs, large vision-language models (LVLMs) have shown impressive performance in tasks such as recognition and visual question answering (VQA). Despite increasing interest in the utility of LLMs in causal reasoning tasks such as causal discovery and counterfactual reasoning, there has been relatively little work showcasing the abilities of LVLMs on visual causal reasoning tasks. We take this opportunity to formally introduce a comprehensive causal reasoning benchmark for multi-modal in-context learning from LVLMs. Our CausalVLBench encompasses three representative tasks: causal structure inference, intervention target prediction, and counterfactual prediction. We evaluate the ability of state-of-the-art open-source LVLMs on our causal reasoning tasks across three causal representation learning datasets and demonstrate their fundamental strengths and weaknesses. We hope that our benchmark elucidates the drawbacks of existing vision-language models and motivates new directions and paradigms in improving the visual causal reasoning abilities of LVLMs.

[144] Tversky Neural Networks: Psychologically Plausible Deep Learning with Differentiable Tversky Similarity cs.LG | cs.AI | cs.CL | cs.CV | 68 | I.2.0; I.2.4; I.2.6; I.2.7; I.4.7; I.4.10; I.5.1; F.1.1PDF

Moussa Koulako Bala Doumbouya, Dan Jurafsky, Christopher D. Manning

TL;DR: 该论文提出一种基于Tversky相似性的可微分神经网络架构，替代传统几何相似性模型，提升模型性能并增强可解释性。

Details

Motivation: 传统深度学习中基于几何的相似性模型心理不真实，而Tversky的集合特征相似性理论更符合人类认知，但此前因离散集合操作难以与深度学习结合。

Result: 在NABirds图像分类任务中，Tversky投影层相对基线提升24.7%准确率；GPT-2在PTB上的困惑度降低7.5%，参数减少34.8%。

Insight: Tversky相似性不仅提升了模型性能，还提供了心理学理论支持的可解释性，为深度学习中的相似性建模提供了新范式。

Abstract: Work in psychology has highlighted that the geometric model of similarity standard in deep learning is not psychologically plausible because its metric properties such as symmetry do not align with human perception. In contrast, Tversky (1977) proposed an axiomatic theory of similarity based on a representation of objects as sets of features, and their similarity as a function of common and distinctive features. However, this model has not been used in deep learning before, partly due to the challenge of incorporating discrete set operations. We develop a differentiable parameterization of Tversky’s similarity that is learnable through gradient descent, and derive neural network building blocks such as the Tversky projection layer, which unlike the linear projection layer can model non-linear functions such as XOR. Through experiments with image recognition and language modeling, we show that the Tversky projection layer is a beneficial replacement for the linear projection layer, which employs geometric similarity. On the NABirds image classification task, a frozen ResNet-50 adapted with a Tversky projection layer achieves a 24.7% relative accuracy improvement over the linear layer adapter baseline. With Tversky projection layers, GPT-2’s perplexity on PTB decreases by 7.5%, and its parameter count by 34.8%. Finally, we propose a unified interpretation of both projection layers as computing similarities of input stimuli to learned prototypes, for which we also propose a novel visualization technique highlighting the interpretability of Tversky projection layers. Our work offers a new paradigm for thinking about the similarity model implicit in deep learning, and designing networks that are interpretable under an established theory of psychological similarity.

[145] ADAMIX: Adaptive Mixed-Precision Delta-Compression with Quantization Error Optimization for Large Language Models cs.LG | cs.AI | cs.CLPDF

Boya Xiong, Shuo Wang, Weifeng Ge, Guanhua Chen, Yun Chen

TL;DR: 论文提出了一种名为ADAMIX的自适应混合精度差分压缩框架，通过量化误差优化为大型语言模型（LLM）提供高效压缩，显著优于现有基线方法。

Details

Motivation: 在多租户服务等场景中，大量基于同一基础模型微调的LLM需要部署。现有差分压缩方法在高压缩比下性能不佳或依赖经验性位分配方案，缺乏理论支持。

Result: 在AIME2024和GQA任务中，ADAMIX分别比最佳基线Delta-CoMe高出22.3%和6.1%，显著提升压缩性能。

Insight: 通过理论驱动的量化误差优化和混合精度分配，ADAMIX实现了在高压缩比下的性能优势，为LLM部署提供了高效解决方案。

Abstract: Large language models (LLMs) achieve impressive performance on various knowledge-intensive and complex reasoning tasks in different domains. In certain scenarios like multi-tenant serving, a large number of LLMs finetuned from the same base model are deployed to meet complex requirements for users. Recent works explore delta-compression approaches to quantize and compress the delta parameters between the customized LLM and the corresponding base model. However, existing works either exhibit unsatisfactory performance at high compression ratios or depend on empirical bit allocation schemes. In this work, we propose ADAMIX, an effective adaptive mixed-precision delta-compression framework. We provide a mathematical derivation of quantization error to motivate our mixed-precision compression strategy and formulate the optimal mixed-precision bit allocation scheme as the solution to a 0/1 integer linear programming problem. Our derived bit allocation strategy minimizes the quantization error while adhering to a predefined compression ratio requirement. Experimental results on various models and benchmarks demonstrate that our approach surpasses the best baseline by a considerable margin. On tasks like AIME2024 and GQA, where the norm of $\Delta \mathbf{W}$ is large and the base model lacks sufficient ability, ADAMIX outperforms the best baseline Delta-CoMe by 22.3% and 6.1% with 7B models, respectively.

[146] LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model cs.LG | cs.AI | cs.CLPDF

Pradyut Sekhsaria, Marcel Mateos Salles, Hai Huang, Randall Balestriero

TL;DR: 论文揭示了LoRA微调方法的一个潜在风险：模型可能仅依赖少量虚假标记（spurious tokens）完成任务，甚至可能被恶意操控。

Details

Motivation: 虽然LoRA等参数高效的微调方法在资源效率上表现优异，但其潜在的灾难性失败（如对虚假标记的过度依赖）尚未被充分研究。

Result: SSTI可以仅用单个标记操控模型决策；LoRA秩的高低影响模型对虚假标记的依赖程度。

Insight: LoRA的高效率可能以牺牲鲁棒性为代价，未来设计需权衡效率与安全性。

Abstract: Parameter Efficient FineTuning (PEFT), such as Low-Rank Adaptation (LoRA), aligns pre-trained Large Language Models (LLMs) to particular downstream tasks in a resource-efficient manner. Because efficiency has been the main metric of progress, very little attention has been put in understanding possible catastrophic failures. We uncover one such failure: PEFT encourages a model to search for shortcut solutions to solve its fine-tuning tasks. When very small amount of tokens, e.g., one token per prompt, are correlated with downstream task classes, PEFT makes any pretrained model rely predominantly on that token for decision making. While such spurious tokens may emerge accidentally from incorrect data cleaning, it also opens opportunities for malevolent parties to control a model’s behavior from Seamless Spurious Token Injection (SSTI). In SSTI, a small amount of tokens correlated with downstream classes are injected by the dataset creators. At test time, the finetuned LLM’s behavior can be controlled solely by injecting those few tokens. We apply SSTI across models from three families (Snowflake Arctic, Apple OpenELM, and Meta LLaMA-3) and four diverse datasets (IMDB, Financial Classification, CommonSense QA, and Bias in Bios). Our findings reveal three astonishing behaviors. First, as few as a single token of SSTI is sufficient to steer a model’s decision making. Second, for light SSTI, the reliance on spurious tokens is proportional to the LoRA rank. Lastly, with aggressive SSTI, larger LoRA rank values become preferable to small rank values as it makes the model attend to non-spurious tokens, hence improving robustness.

[147] Bias Amplification in RAG: Poisoning Knowledge Retrieval to Steer LLMs cs.LG | cs.CL | cs.CRPDF

Linlin Wang, Tianqing Zhu, Laiqiao Qin, Longxiang Gao, Wanlei Zhou

TL;DR: 该论文探讨了检索增强生成（RAG）系统中存在的偏见放大问题，并提出了一种名为BRRA的攻击框架，通过操纵RAG系统放大语言模型的偏见。同时提出了一种双阶段防御机制缓解攻击影响。

Details

Motivation: 现有研究主要关注RAG系统中投毒攻击对模型输出质量的影响，而忽视了其对模型偏见的放大作用。例如，在涉及性别中性查询时，被攻击的RAG系统可能优先检索强化性别刻板印象的文档。

Result: 实验表明，BRRA攻击能显著增强模型在多个维度上的偏见，同时提出的双阶段防御机制能有效缓解攻击的影响。

Insight: RAG系统中的投毒攻击不仅影响输出质量，还会直接放大模型偏见，表明需关注RAG系统的公平性问题，安全性与偏见控制需共同考虑。

Abstract: In Large Language Models, Retrieval-Augmented Generation (RAG) systems can significantly enhance the performance of large language models by integrating external knowledge. However, RAG also introduces new security risks. Existing research focuses mainly on how poisoning attacks in RAG systems affect model output quality, overlooking their potential to amplify model biases. For example, when querying about domestic violence victims, a compromised RAG system might preferentially retrieve documents depicting women as victims, causing the model to generate outputs that perpetuate gender stereotypes even when the original query is gender neutral. To show the impact of the bias, this paper proposes a Bias Retrieval and Reward Attack (BRRA) framework, which systematically investigates attack pathways that amplify language model biases through a RAG system manipulation. We design an adversarial document generation method based on multi-objective reward functions, employ subspace projection techniques to manipulate retrieval results, and construct a cyclic feedback mechanism for continuous bias amplification. Experiments on multiple mainstream large language models demonstrate that BRRA attacks can significantly enhance model biases in dimensions. In addition, we explore a dual stage defense mechanism to effectively mitigate the impacts of the attack. This study reveals that poisoning attacks in RAG systems directly amplify model output biases and clarifies the relationship between RAG system security and model fairness. This novel potential attack indicates that we need to keep an eye on the fairness issues of the RAG system.

[148] RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer cs.LG | cs.AI | cs.CVPDF

Haotian Ni, Yake Wei, Hang Liu, Gong Chen, Chong Peng

TL;DR: 论文通过实验发现，多模态Transformer的自注意力机制会偏好某一模态，导致动态适应性下降。作者提出RollingQ方法，通过旋转查询来平衡注意力分配，恢复合作动态。

Details

Motivation: 多模态学习在融合不同模态信息时面临挑战，尤其是当模态质量差异较大时。现有动态融合策略（如注意力机制）的动态适应性会逐渐减弱，导致模型偏好某一模态。

Result: 实验验证了RollingQ的有效性，恢复合作动态有助于提升多模态Transformer的综合能力。

Insight: 多模态学习中动态适应性下降是一个潜在问题，简单的查询旋转策略可以有效恢复模型对不同模态的均衡关注。

Abstract: Multimodal learning faces challenges in effectively fusing information from diverse modalities, especially when modality quality varies across samples. Dynamic fusion strategies, such as attention mechanism in Transformers, aim to address such challenge by adaptively emphasizing modalities based on the characteristics of input data. However, through amounts of carefully designed experiments, we surprisingly observed that the dynamic adaptability of widely-used self-attention models diminishes. Model tends to prefer one modality regardless of data characteristics. This bias triggers a self-reinforcing cycle that progressively overemphasizes the favored modality, widening the distribution gap in attention keys across modalities and deactivating attention mechanism’s dynamic properties. To revive adaptability, we propose a simple yet effective method Rolling Query (RollingQ), which balances attention allocation by rotating the query to break the self-reinforcing cycle and mitigate the key distribution gap. Extensive experiments on various multimodal scenarios validate the effectiveness of RollingQ and the restoration of cooperation dynamics is pivotal for enhancing the broader capabilities of widely deployed multimodal Transformers. The source code is available at https://github.com/GeWu-Lab/RollingQ_ICML2025.

[149] TreeRL: LLM Reinforcement Learning with On-Policy Tree Search cs.LG | cs.CLPDF

Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang

TL;DR: TreeRL是将树搜索引入LLM强化学习的框架，通过策略内树搜索和中间监督提升推理任务性能，避免单独训练奖励模型的问题，并在数学和代码推理任务中表现优异。

Details

Motivation: 现有LLM强化学习通常采用独立链采样策略或单独训练奖励模型，存在探索不足和分布失配问题。TreeRL通过树搜索提供密集的策略内过程奖励，改进这些问题。

Result: 在数学和代码推理基准测试中，TreeRL显著优于传统链式RL（ChainRL），验证了树搜索在LLM强化学习中的潜力。

Insight: 树搜索不仅适用于传统RL任务，也能显著提升LLM推理能力，中间监督和高效搜索策略是关键创新点。

Abstract: Reinforcement learning (RL) with tree search has demonstrated superior performance in traditional reasoning tasks. Compared to conventional independent chain sampling strategies with outcome supervision, tree search enables better exploration of the reasoning space and provides dense, on-policy process rewards during RL training but remains under-explored in On-Policy LLM RL. We propose TreeRL, a reinforcement learning framework that directly incorporates on-policy tree search for RL training. Our approach includes intermediate supervision and eliminates the need for a separate reward model training. Existing approaches typically train a separate process reward model, which can suffer from distribution mismatch and reward hacking. We also introduce a cost-effective tree search approach that achieves higher search efficiency under the same generation token budget by strategically branching from high-uncertainty intermediate steps rather than using random branching. Experiments on challenging math and code reasoning benchmarks demonstrate that TreeRL achieves superior performance compared to traditional ChainRL, highlighting the potential of tree search for LLM. TreeRL is open-sourced at https://github.com/THUDM/TreeRL.

[150] Visual Pre-Training on Unlabeled Images using Reinforcement Learning cs.LG | cs.CVPDF

Dibya Ghosh, Sergey Levine

TL;DR: 该论文提出了一种通过强化学习（RL）在无标注图像上进行视觉预训练的方法，将图像变换视为动态系统中的行为，学习类似于值函数（value function）的特征表示。

Details

Motivation: 传统的自监督学习方法（如基于图像裁剪的一致性学习）与强化学习中的值函数学习具有相似性，作者希望通过强化学习的框架改进无标注图像的特征表示学习。

Result: 实验表明，该方法在EpicKitchens（视频数据）、COCO（场景数据）和CC12M（网络爬取数据）等无标注数据集上学习到了更具表达力的特征表示。

Insight: 通过强化学习的奖励机制，可以利用弱标注或精选图像进一步引导特征学习，为无监督和弱监督学习提供了新的可能性。

Abstract: In reinforcement learning (RL), value-based algorithms learn to associate each observation with the states and rewards that are likely to be reached from it. We observe that many self-supervised image pre-training methods bear similarity to this formulation: learning features that associate crops of images with those of nearby views, e.g., by taking a different crop or color augmentation. In this paper, we complete this analogy and explore a method that directly casts pre-training on unlabeled image data like web crawls and video frames as an RL problem. We train a general value function in a dynamical system where an agent transforms an image by changing the view or adding image augmentations. Learning in this way resembles crop-consistency self-supervision, but through the reward function, offers a simple lever to shape feature learning using curated images or weakly labeled captions when they exist. Our experiments demonstrate improved representations when training on unlabeled images in the wild, including video data like EpicKitchens, scene data like COCO, and web-crawl data like CC12M.

[151] SIMSHIFT: A Benchmark for Adapting Neural Surrogates to Distribution Shifts cs.LG | cs.CV | physics.comp-phPDF

Paul Setinek, Gianluca Galletti, Thomas Gross, Dominik Schnürer, Johannes Brandstetter

TL;DR: SIMSHIFT 是一个专注于神经网络代理在分布偏移下适应性的基准数据集和评估套件，旨在解决 PDE 代理在未见配置中性能下降的问题。

Details

Motivation: PDE 的神经网络代理在未见的问题配置（如新材料类型或结构尺寸）中性能显著下降，而领域适应（DA）技术在视觉和语言处理中已被广泛用于未见配置的泛化。

Result: 实验表明，分布偏移下神经网络代理建模存在挑战，但 DA 在模拟中具有潜力，揭示了工业场景中实现稳健代理的未解决问题。

Insight: DA 技术可为工业模拟中的神经网络代理提供分布偏移适应性，但稳健性仍是一个开放问题。

Abstract: Neural surrogates for Partial Differential Equations (PDEs) often suffer significant performance degradation when evaluated on unseen problem configurations, such as novel material types or structural dimensions. Meanwhile, Domain Adaptation (DA) techniques have been widely used in vision and language processing to generalize from limited information about unseen configurations. In this work, we address this gap through two focused contributions. First, we introduce SIMSHIFT, a novel benchmark dataset and evaluation suite composed of four industrial simulation tasks: hot rolling, sheet metal forming, electric motor design and heatsink design. Second, we extend established domain adaptation methods to state of the art neural surrogates and systematically evaluate them. These approaches use parametric descriptions and ground truth simulations from multiple source configurations, together with only parametric descriptions from target configurations. The goal is to accurately predict target simulations without access to ground truth simulation data. Extensive experiments on SIMSHIFT highlight the challenges of out of distribution neural surrogate modeling, demonstrate the potential of DA in simulation, and reveal open problems in achieving robust neural surrogates under distribution shifts in industrially relevant scenarios. Our codebase is available at https://github.com/psetinek/simshift

[152] EMLoC: Emulator-based Memory-efficient Fine-tuning with LoRA Correction cs.LG | cs.AI | cs.CVPDF

Hsi-Che Lin, Yu-Chu Yu, Kai-Po Chang, Yu-Chiang Frank Wang

TL;DR: EMLoC是一个基于模拟器的高效内存微调框架，通过LoRA校正，使模型微调在推理所需的内存预算内完成。它结合了轻量级模拟器和新型补偿算法，实现了低成本的高效模型适应。

Details

Motivation: 大规模基础模型的微调对大多数用户来说内存开销巨大，限制了其在实际应用中的普及。EMLoC的提出旨在解决这一问题，实现低成本的高效微调。

Result: 实验表明，EMLoC在多个数据集和模态上优于基线方法，并在不量化的情况下支持大模型高效微调。

Insight: 通过轻量级模拟器和补偿机制的结合，EMLoC为资源受限的用户提供了高效微调的新思路，推动了基础模型的普及应用。

Abstract: Open-source foundation models have seen rapid adoption and development, enabling powerful general-purpose capabilities across diverse domains. However, fine-tuning large foundation models for domain-specific or personalized tasks remains prohibitively expensive for most users due to the significant memory overhead beyond that of inference. We introduce EMLoC, an Emulator-based Memory-efficient fine-tuning framework with LoRA Correction, which enables model fine-tuning within the same memory budget required for inference. EMLoC constructs a task-specific light-weight emulator using activation-aware singular value decomposition (SVD) on a small downstream calibration set. Fine-tuning then is performed on this lightweight emulator via LoRA. To tackle the misalignment between the original model and the compressed emulator, we propose a novel compensation algorithm to correct the fine-tuned LoRA module, which thus can be merged into the original model for inference. EMLoC supports flexible compression ratios and standard training pipelines, making it adaptable to a wide range of applications. Extensive experiments demonstrate that EMLoC outperforms other baselines across multiple datasets and modalities. Moreover, without quantization, EMLoC enables fine-tuning of a 38B model on a single 24GB consumer GPU-bringing efficient and practical model adaptation to individual users.

cs.GR [Back]

[153] Anti-Aliased 2D Gaussian Splatting cs.GR | cs.CVPDF

Mae Younes, Adnane Boukhayma

TL;DR: 该论文提出了一种抗锯齿的2D高斯溅射方法（AA-2DGS），解决了传统2DGS在不同采样率下渲染时的严重锯齿问题。

Details

Motivation: 传统的2D高斯溅射（2DGS）在训练时的采样率与渲染时的采样率不同时，会出现严重的锯齿问题，限制了其在实际应用中的灵活性。

Result: 该方法在保持2DGS几何优点的同时，显著提升了不同尺度下的渲染质量，有效消除了高频锯齿。

Insight: 通过约束频率内容和局部空间滤波，可以显著提升2D高斯溅射的渲染质量，为其在复杂场景中的应用提供了可能。

Abstract: 2D Gaussian Splatting (2DGS) has recently emerged as a promising method for novel view synthesis and surface reconstruction, offering better view-consistency and geometric accuracy than volumetric 3DGS. However, 2DGS suffers from severe aliasing artifacts when rendering at different sampling rates than those used during training, limiting its practical applications in scenarios requiring camera zoom or varying fields of view. We identify that these artifacts stem from two key limitations: the lack of frequency constraints in the representation and an ineffective screen-space clamping approach. To address these issues, we present AA-2DGS, an antialiased formulation of 2D Gaussian Splatting that maintains its geometric benefits while significantly enhancing rendering quality across different scales. Our method introduces a world space flat smoothing kernel that constrains the frequency content of 2D Gaussian primitives based on the maximal sampling frequency from training views, effectively eliminating high-frequency artifacts when zooming in. Additionally, we derive a novel object space Mip filter by leveraging an affine approximation of the ray-splat intersection mapping, which allows us to efficiently apply proper anti-aliasing directly in the local space of each splat.

[154] CGVQM+D: Computer Graphics Video Quality Metric and Dataset cs.GR | cs.CVPDF

Akshay Jindal, Nabil Sadaka, Manu Mathew Thomas, Anton Sochenov, Anton Kaplanyan

TL;DR: 该论文提出了一个专注于高级渲染技术引入的失真的视频质量数据集 CGVQM+D，并开发了一种新的全参考视频质量度量标准 CGVQM，优于现有方法。

Details

Motivation: 现有的视频和图像质量数据集主要研究自然视频和传统失真，而合成内容和现代渲染失真的感知仍未充分探索。

Result: 现有全参考质量度量标准在这些新失真上表现不佳（最大 Pearson 相关系数为 0.78），而 CGVQM 显著优于这些方法。

Insight: 预训练的 3D CNN 的特征空间能有效捕捉人类对视频质量的感知，为质量度量标准的改进提供了新方向。

Abstract: While existing video and image quality datasets have extensively studied natural videos and traditional distortions, the perception of synthetic content and modern rendering artifacts remains underexplored. We present a novel video quality dataset focused on distortions introduced by advanced rendering techniques, including neural supersampling, novel-view synthesis, path tracing, neural denoising, frame interpolation, and variable rate shading. Our evaluations show that existing full-reference quality metrics perform sub-optimally on these distortions, with a maximum Pearson correlation of 0.78. Additionally, we find that the feature space of pre-trained 3D CNNs aligns strongly with human perception of visual quality. We propose CGVQM, a full-reference video quality metric that significantly outperforms existing metrics while generating both per-pixel error maps and global quality scores. Our dataset and metric implementation is available at https://github.com/IntelLabs/CGVQM.

cs.MA [Back]

[155] AutoGen Driven Multi Agent Framework for Iterative Crime Data Analysis and Prediction cs.MA | cs.CL | cs.CVPDF

Syeda Kisaa Fatima, Tehreem Zubair, Noman Ahmed, Asifullah Khan

TL;DR: 该论文提出了一种名为LUCID-MA的多智能体框架，通过AI代理的协作分析犯罪数据，包括时空模式分析、反馈改进和趋势预测，完全离线运行并支持自主迭代学习。

Details

Motivation: 传统犯罪数据分析需要大量人工干预且难以迭代优化，作者希望通过多智能体协作框架实现自动化和可扩展的分析，同时保护数据隐私。

Result: LUCID-MA展示了在多智能体协作下犯罪数据的高效分析和预测能力，同时验证了离线模式下代理的自我学习能力。

Insight: AutoGen式的多智能体框架在社会科学领域具有潜力，能够实现自主、可扩展和迭代的数据分析，同时兼顾隐私保护。

Abstract: This paper introduces LUCID-MA (Learning and Understanding Crime through Dialogue of Multiple Agents), an innovative AI powered framework where multiple AI agents collaboratively analyze and understand crime data. Our system that consists of three core components: an analysis assistant that highlights spatiotemporal crime patterns, a feedback component that reviews and refines analytical results and a prediction component that forecasts future crime trends. With a well-designed prompt and the LLaMA-2-13B-Chat-GPTQ model, it runs completely offline and allows the agents undergo self-improvement through 100 rounds of communication with less human interaction. A scoring function is incorporated to evaluate agent’s performance, providing visual plots to track learning progress. This work demonstrates the potential of AutoGen-style agents for autonomous, scalable, and iterative analysis in social science domains maintaining data privacy through offline execution.

eess.AS [Back]

[156] PMF-CEC: Phoneme-augmented Multimodal Fusion for Context-aware ASR Error Correction with Error-specific Selective Decoding eess.AS | cs.AI | cs.CL | cs.SDPDF

Jiajun He, Tomoki Toda

TL;DR: 该论文提出了PMF-CEC方法，通过音素增强的多模态融合改进ASR错误纠正能力，尤其针对发音相似但拼写不同的罕见词问题，并引入保留概率机制减少过检测问题。

Details

Motivation: 现有的ASR后处理方法ED-CEC在处理发音相似但拼写不同的罕见词时效果不佳，同时存在错误检测模块过检测的问题。因此，需要一种改进方法来解决这些问题。

Result: 在五个数据集上的实验显示，PMF-CEC在保持合理推理速度的同时进一步降低了偏置词错误率，对同音异义词的纠正表现更好，且优于其他上下文偏置方法。

Insight: 音素信息可以显著提升ASR后处理对同音异义词的区分能力，而保留概率机制能有效减少过检测问题，提升模型在实际应用中的鲁棒性和效率。

Abstract: End-to-end automatic speech recognition (ASR) models often struggle to accurately recognize rare words. Previously, we introduced an ASR postprocessing method called error detection and context-aware error correction (ED-CEC), which leverages contextual information such as named entities and technical terms to improve the accuracy of ASR transcripts. Although ED-CEC achieves a notable success in correcting rare words, its accuracy remains low when dealing with rare words that have similar pronunciations but different spellings. To address this issue, we proposed a phoneme-augmented multimodal fusion method for context-aware error correction (PMF-CEC) method on the basis of ED-CEC, which allowed for better differentiation between target rare words and homophones. Additionally, we observed that the previous ASR error detection module suffers from overdetection. To mitigate this, we introduced a retention probability mechanism to filter out editing operations with confidence scores below a set threshold, preserving the original operation to improve error detection accuracy. Experiments conducted on five datasets demonstrated that our proposed PMF-CEC maintains reasonable inference speed while further reducing the biased word error rate compared with ED-CEC, showing a stronger advantage in correcting homophones. Moreover, our method outperforms other contextual biasing methods, and remains valuable compared with LLM-based methods in terms of faster inference and better robustness under large biasing lists.

[157] Can We Trust Machine Learning? The Reliability of Features from Open-Source Speech Analysis Tools for Speech Modeling eess.AS | cs.CL | cs.CY | cs.SD | stat.AP | K.4; J.4; I.2PDF

Tahiya Chowdhury, Veronica Romero

TL;DR: 这篇论文探讨了开源音频分析工具在提取语音特征时的可靠性问题，特别是在自闭症青少年群体中的应用。研究发现不同工具（如OpenSMILE和Praat）提取的特征存在显著差异，可能影响模型的性能和公平性。

Details

Motivation: 机器学习模型依赖于从音频记录中提取的特征，但开源工具缺乏验证，可能在不同上下文和群体中引入偏差，尤其是在临床应用中。

Result: 结果显示特征在不同工具间存在显著差异，这些差异影响了模型的性能，特别是在不同上下文和群体中的公平性。

Insight: 在临床应用中，使用开源工具时需进行领域相关验证，以确保特征提取的可靠性和模型公平性。

Abstract: Machine learning-based behavioral models rely on features extracted from audio-visual recordings. The recordings are processed using open-source tools to extract speech features for classification models. These tools often lack validation to ensure reliability in capturing behaviorally relevant information. This gap raises concerns about reproducibility and fairness across diverse populations and contexts. Speech processing tools, when used outside of their design context, can fail to capture behavioral variations equitably and can then contribute to bias. We evaluate speech features extracted from two widely used speech analysis tools, OpenSMILE and Praat, to assess their reliability when considering adolescents with autism. We observed considerable variation in features across tools, which influenced model performance across context and demographic groups. We encourage domain-relevant verification to enhance the reliability of machine learning models in clinical applications.

[158] Improving Child Speech Recognition and Reading Mistake Detection by Using Prompts eess.AS | cs.AI | cs.CL | cs.SDPDF

Lingyun Gao, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik

TL;DR: 该论文提出了一种利用Whisper和指令调优的大型语言模型（LLMs）通过提示（prompting）来改进儿童语音识别和阅读错误检测的多模态方法。

Details

Motivation: 自动朗读评估可以为教师提供高效评分支持，但目前相关研究较少。作者旨在利用音频和文本资源知识提升儿童语音识别和阅读错误检测的效果。

Result: 最佳系统的单词错误率（WER）从基线的9.4%降至5.1%，阅读错误检测的F1分数从0.39提升到0.73，达到先进水平。

Insight: 提示策略在多模态任务中具有潜力，尤其是在结合语音识别和自然语言处理模型时，能显著提升性能。

Abstract: Automatic reading aloud evaluation can provide valuable support to teachers by enabling more efficient scoring of reading exercises. However, research on reading evaluation systems and applications remains limited. We present a novel multimodal approach that leverages audio and knowledge from text resources. In particular, we explored the potential of using Whisper and instruction-tuned large language models (LLMs) with prompts to improve transcriptions for child speech recognition, as well as their effectiveness in downstream reading mistake detection. Our results demonstrate the effectiveness of prompting Whisper and prompting LLM, compared to the baseline Whisper model without prompting. The best performing system achieved state-of-the-art recognition performance in Dutch child read speech, with a word error rate (WER) of 5.1%, improving the baseline WER of 9.4%. Furthermore, it significantly improved reading mistake detection, increasing the F1 score from 0.39 to 0.73.

Table of Contents

cs.CV [Back]

[1] EfficientQuant: An Efficient Post-Training Quantization for CNN-Transformer Hybrid Models on Edge Devices cs.CVPDF

[2] Image-Based Method For Measuring And Classification Of Iron Ore Pellets Using Star-Convex Polygons cs.CVPDF

[3] Gender Fairness of Machine Learning Algorithms for Pain Detection cs.CV | cs.LGPDF

[4] JAFAR: Jack up Any Feature at Any Resolution cs.CV | eess.IVPDF

[5] Autonomous Computer Vision Development with Agentic AI cs.CV | cs.AI | cs.MAPDF

[6] FARCLUSS: Fuzzy Adaptive Rebalancing and Contrastive Uncertainty Learning for Semi-Supervised Semantic Segmentation cs.CV | cs.LG | eess.IVPDF

[7] On the development of an AI performance and behavioural measures for teaching and classroom management cs.CV | H.5; J.4; I.2.7; I.2.10PDF

[8] AlignHuman: Improving Motion and Fidelity via Timestep-Segment Preference Optimization for Audio-Driven Human Animation cs.CVPDF

[9] 3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks cs.CVPDF

[10] LLM-to-Phy3D: Physically Conform Online 3D Object Generation with LLMs cs.CV | cs.LGPDF

[11] Self-Calibrating BCIs: Ranking and Recovery of Mental Targets Without Labels cs.CV | cs.HCPDF

[12] SLRNet: A Real-Time LSTM-Based Sign Language Recognition System cs.CV | 68T07 (Artificial Intelligence), 68U10 (Image Processing)PDF

[13] Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search cs.CVPDF

[14] Digitization of Document and Information Extraction using OCR cs.CV | cs.IRPDF

[15] VIBE: Can a VLM Read the Room? cs.CV | cs.LGPDF

[16] Test-Time-Scaling for Zero-Shot Diagnosis with Visual-Language Reasoning cs.CV | cs.AIPDF

[17] BrainMAP: Multimodal Graph Learning For Efficient Brain Disease Localization cs.CV | cs.LG | cs.NEPDF

[18] Enhanced Vehicle Speed Detection Considering Lane Recognition Using Drone Videos in California cs.CV | cs.LGPDF

[19] Lifting Data-Tracing Machine Unlearning to Knowledge-Tracing for Foundation Models cs.CV | cs.LGPDF

[20] TARDIS STRIDE: A Spatio-Temporal Road Image Dataset for Exploration and Autonomy cs.CV | cs.AIPDF

[21] HyBiomass: Global Hyperspectral Imagery Benchmark Dataset for Evaluating Geospatial Foundation Models in Forest Aboveground Biomass Estimation cs.CV | eess.IVPDF

[22] GynSurg: A Comprehensive Gynecology Laparoscopic Surgery Dataset cs.CVPDF

[23] A Watermark for Auto-Regressive Image Generation Models cs.CVPDF

[24] Scalable Context-Preserving Model-Aware Deep Clustering for Hyperspectral Images cs.CVPDF

[25] Enhance Multimodal Consistency and Coherence for Text-Image Plan Generation cs.CV | cs.AIPDF

[26] Dynamic Double Space Tower cs.CV | cs.AIPDF

[27] Stop learning it all to mitigate visual hallucination, Focus on the hallucination target cs.CV | cs.AIPDF

[28] TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models cs.CVPDF

[29] On the Natural Robustness of Vision-Language Models Against Visual Perception Attacks in Autonomous Driving cs.CV | cs.LGPDF

[30] FAME: A Lightweight Spatio-Temporal Network for Model Attribution of Face-Swap Deepfakes cs.CVPDF

[31] Preserving Clusters in Prompt Learning for Unsupervised Domain Adaptation cs.CVPDF

[32] Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs cs.CV | cs.CL | cs.LGPDF

[33] GNSS-inertial state initialization by distance residuals cs.CVPDF

[34] FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation cs.CV | cs.AI | cs.LGPDF

[35] Linearly Solving Robust Rotation Estimation cs.CV | cs.RO | cs.SY | eess.SYPDF

[36] EyeSim-VQA: A Free-Energy-Guided Eye Simulation Framework for Video Quality Assessment cs.CV | eess.IVPDF

[37] DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs cs.CV | cs.AI | cs.CLPDF

[38] VFaith: Do Large Multimodal Models Really Reason on Seen Images Rather than Previous Memories? cs.CVPDF

[39] Camera-based method for the detection of lifted truck axles using convolutional neural networks cs.CVPDF

[40] EasyARC: Evaluating Vision Language Models on True Visual Reasoning cs.CV | cs.LGPDF

[41] A$^2$LC: Active and Automated Label Correction for Semantic Segmentation cs.CV | cs.AIPDF

[42] Wi-CBR: WiFi-based Cross-domain Behavior Recognition via Multimodal Collaborative Awareness cs.CV | eess.SPPDF

[43] SignAligner: Harmonizing Complementary Pose Modalities for Coherent Sign Language Generation cs.CVPDF

[44] Evaluating Fairness and Mitigating Bias in Machine Learning: A Novel Technique using Tensor Data and Bayesian Regression cs.CV | cs.AI | cs.LGPDF

[45] DISCO: Mitigating Bias in Deep Learning with Conditional Distance Correlation cs.CV | cs.AI | cs.LGPDF

[46] Prohibited Items Segmentation via Occlusion-aware Bilayer Modeling cs.CVPDF

[47] Dynamic Mixture of Curriculum LoRA Experts for Continual Multimodal Instruction Tuning cs.CVPDF

[48] Cross-Modal Clustering-Guided Negative Sampling for Self-Supervised Joint Learning from Medical Images and Reports cs.CVPDF

[49] Predicting Patient Survival with Airway Biomarkers using nn-Unet/Radiomics cs.CV | cs.LGPDF

[50] Pose Matters: Evaluating Vision Transformers and CNNs for Human Action Recognition on Small COCO Subsets cs.CV | cs.AI | I.2.0PDF

[51] MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space cs.CV | cs.AIPDF

[52] DMAF-Net: An Effective Modality Rebalancing Framework for Incomplete Multi-Modal Medical Image Segmentation cs.CVPDF

[53] Quizzard@INOVA Challenge 2025 – Track A: Plug-and-Play Technique in Interleaved Multi-Image Model cs.CV | cs.CL | cs.MMPDF

[54] MambaVSR: Content-Aware Scanning State Space Model for Video Super-Resolution cs.CVPDF

[55] CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection cs.CV | cs.LGPDF

[56] AgentSense: Virtual Sensor Data Generation Using LLM Agent in Simulated Home Environments cs.CV | cs.HCPDF

[57] Real-Time Feedback and Benchmark Dataset for Isometric Pose Evaluation cs.CV | cs.AI | cs.HCPDF

[58] Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation cs.CV | cs.AI | cs.CY | cs.LGPDF

[59] GPLQ: A General, Practical, and Lightning QAT Method for Vision Transformers cs.CVPDF

[60] Teleoperated Driving: a New Challenge for 3D Object Detection in Compressed Point Clouds cs.CV | cs.NI | eess.IVPDF

[61] Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation cs.CV | cs.CLPDF

[62] Vision-based Lifting of 2D Object Detections for Automated Driving cs.CV | cs.LGPDF

[63] SphereDrag: Spherical Geometry-Aware Panoramic Image Editing cs.CVPDF

[64] Evaluating Sensitivity Parameters in Smartphone-Based Gaze Estimation: A Comparative Study of Appearance-Based and Infrared Eye Trackers cs.CV | cs.HCPDF

[65] How Visual Representations Map to Language Feature Space in Multimodal LLMs cs.CV | cs.LGPDF

[66] Simple Radiology VLLM Test-time Scaling with Thought Graph Traversal cs.CVPDF

[67] VGR: Visual Grounded Reasoning cs.CV | cs.AI | cs.CLPDF

[68] Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale cs.CVPDF

cs.CL [Back]

[69] Who is in the Spotlight: The Hidden Bias Undermining Multimodal Retrieval-Augmented Generation cs.CL | cs.AIPDF

[70] A Large Language Model Based Pipeline for Review of Systems Entity Recognition from Clinical Notes cs.CLPDF

[71] Deontological Keyword Bias: The Impact of Modal Expressions on Normative Judgments of Language Models cs.CLPDF

[72] Targeted control of fast prototyping through domain-specific interface cs.CLPDF

[73] CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention cs.CL | cs.AI | cs.CVPDF

[74] CyclicReflex: Improving Large Reasoning Models via Cyclical Reflection Token Scheduling cs.CLPDF

[75] RoE-FND: A Case-Based Reasoning Approach with Dual Verification for Fake News Detection via LLMs cs.CLPDF

[76] MANBench: Is Your Multimodal Model Smarter than Human? cs.CLPDF

[77] SAGE:Specification-Aware Grammar Extraction for Automated Test Case Generation with LLMs cs.CLPDF