Table of Contents

cs.CV [Back]

[1] LVTINO: LAtent Video consisTency INverse sOlver for High Definition Video Restoration cs.CV | stat.MLPDF

Alessio Spagnoletti, Andrés Almansa, Marcelo Pereyra

TL;DR: LVTINO是一种基于Video Consistency Models (VCMs)的零样本或即插即用视频修复逆求解器,实现高保真且时序一致的视频恢复。

Details

Motivation: 现有方法在视频修复中直接逐帧应用图像潜在扩散模型(LDMs)会导致时序不一致,而LVTINO通过VCMs显式建模时序因果性来解决这一问题。

Result: 在多样视频逆问题实验中,LVTINO在重建质量和计算效率上均优于现有逐帧方法。

Insight: 视频修复需要显式建模时序依赖性,VCMs提供了一种高效且高质量的先验编码方式,为未来视频生成与修复研究指明了方向。

Abstract: Computational imaging methods increasingly rely on powerful generative diffusion models to tackle challenging image restoration tasks. In particular, state-of-the-art zero-shot image inverse solvers leverage distilled text-to-image latent diffusion models (LDMs) to achieve unprecedented accuracy and perceptual quality with high computational efficiency. However, extending these advances to high-definition video restoration remains a significant challenge, due to the need to recover fine spatial detail while capturing subtle temporal dependencies. Consequently, methods that naively apply image-based LDM priors on a frame-by-frame basis often result in temporally inconsistent reconstructions. We address this challenge by leveraging recent advances in Video Consistency Models (VCMs), which distill video latent diffusion models into fast generators that explicitly capture temporal causality. Building on this foundation, we propose LVTINO, the first zero-shot or plug-and-play inverse solver for high definition video restoration with priors encoded by VCMs. Our conditioning mechanism bypasses the need for automatic differentiation and achieves state-of-the-art video reconstruction quality with only a few neural function evaluations, while ensuring strong measurement consistency and smooth temporal transitions across frames. Extensive experiments on a diverse set of video inverse problems show significant perceptual improvements over current state-of-the-art methods that apply image LDMs frame by frame, establishing a new benchmark in both reconstruction fidelity and computational efficiency.


[2] Image Generation Based on Image Style Extraction cs.CVPDF

Shuochen Chang

TL;DR: 这篇论文提出了一种基于风格提取的三阶段图像生成方法,通过风格编码器和风格投影层实现对风格表征的精细控制,并构建了Style30k-captions数据集用于训练。

Details

Motivation: 现有文本到图像生成模型难以通过自然语言精确描述和控制细粒度风格,同时风格参考图像的引导信息难以与传统文本引导生成对齐。

Result: 实现了基于文本提示的细粒度风格控制图像生成。

Insight: 风格提取与文本表征的对齐是提升生成模型风格控制能力的关键。

Abstract: Image generation based on text-to-image generation models is a task with practical application scenarios that fine-grained styles cannot be precisely described and controlled in natural language, while the guidance information of stylized reference images is difficult to be directly aligned with the textual conditions of traditional textual guidance generation. This study focuses on how to maximize the generative capability of the pretrained generative model, by obtaining fine-grained stylistic representations from a single given stylistic reference image, and injecting the stylistic representations into the generative body without changing the structural framework of the downstream generative model, so as to achieve fine-grained controlled stylized image generation. In this study, we propose a three-stage training style extraction-based image generation method, which uses a style encoder and a style projection layer to align the style representations with the textual representations to realize fine-grained textual cue-based style guide generation. In addition, this study constructs the Style30k-captions dataset, whose samples contain a triad of images, style labels, and text descriptions, to train the style encoder and style projection layer in this experiment.


[3] EvoStruggle: A Dataset Capturing the Evolution of Struggle across Activities and Skill Levels cs.CVPDF

Shijia Feng, Michael Wray, Walterio Mayol-Cuevas

TL;DR: 该论文提出了一个名为EvoStruggle的数据集,用于捕捉技能学习过程中挣扎行为的演变,涵盖多个任务和技能水平,并通过实验验证了时序动作定位模型在该任务中的有效性。

Details

Motivation: 现有数据集未关注技能学习过程中挣扎行为的动态演变,而这种演变对优化学习和开发辅助系统至关重要。

Result: 模型在跨任务和跨活动情况下分别达到34.56%和19.24%的平均mAP,表明挣扎行为是可迁移的概念。

Insight: 挣扎行为在不同任务中具有共性,但检测仍具挑战性,未来需进一步提升模型性能。

Abstract: The ability to determine when a person struggles during skill acquisition is crucial for both optimizing human learning and enabling the development of effective assistive systems. As skills develop, the type and frequency of struggles tend to change, and understanding this evolution is key to determining the user’s current stage of learning. However, existing manipulation datasets have not focused on how struggle evolves over time. In this work, we collect a dataset for struggle determination, featuring 61.68 hours of video recordings, 2,793 videos, and 5,385 annotated temporal struggle segments collected from 76 participants. The dataset includes 18 tasks grouped into four diverse activities – tying knots, origami, tangram puzzles, and shuffling cards, representing different task variations. In addition, participants repeated the same task five times to capture their evolution of skill. We define the struggle determination problem as a temporal action localization task, focusing on identifying and precisely localizing struggle segments with start and end times. Experimental results show that Temporal Action Localization models can successfully learn to detect struggle cues, even when evaluated on unseen tasks or activities. The models attain an overall average mAP of 34.56% when generalizing across tasks and 19.24% across activities, indicating that struggle is a transferable concept across various skill-based tasks while still posing challenges for further improvement in struggle detection. Our dataset is available at https://github.com/FELIXFENG2019/EvoStruggle.


[4] SPUS: A Lightweight and Parameter-Efficient Foundation Model for PDEs cs.CV | cs.AI | cs.LG | physics.comp-phPDF

Abu Bucker Siddik, Diane Oyen, Alexander Most, Michal Kucer, Ayan Biswas

TL;DR: SPUS是一种轻量级且参数高效的基础模型,用于解决广泛的偏微分方程(PDE),基于残差U-Net架构,优于现有基于复杂Transformer的模型。

Details

Motivation: 现有PDE基础模型通常基于复杂的Transformer架构,计算和参数开销大。SPUS旨在提供一种更轻量、高效的解决方案。

Result: SPUS在6种未见过的PDE任务上表现出优异泛化能力,且参数更少、微调数据需求低。

Insight: 轻量化的U-Net架构在PDE求解领域具有潜力,能够平衡性能和效率。

Abstract: We introduce Small PDE U-Net Solver (SPUS), a compact and efficient foundation model (FM) designed as a unified neural operator for solving a wide range of partial differential equations (PDEs). Unlike existing state-of-the-art PDE FMs-primarily based on large complex transformer architectures with high computational and parameter overhead-SPUS leverages a lightweight residual U-Net-based architecture that has been largely underexplored as a foundation model architecture in this domain. To enable effective learning in this minimalist framework, we utilize a simple yet powerful auto-regressive pretraining strategy which closely replicates the behavior of numerical solvers to learn the underlying physics. SPUS is pretrained on a diverse set of fluid dynamics PDEs and evaluated across 6 challenging unseen downstream PDEs spanning various physical systems. Experimental results demonstrate that SPUS using residual U-Net based architecture achieves state-of-the-art generalization on these downstream tasks while requiring significantly fewer parameters and minimal fine-tuning data, highlighting its potential as a highly parameter-efficient FM for solving diverse PDE systems.


[5] DisCo: Reinforcement with Diversity Constraints for Multi-Human Generation cs.CVPDF

Shubhankar Borse, Farzad Farhadzadeh, Munawar Hayat, Fatih Porikli

TL;DR: DisCo是一个基于强化学习的框架,通过多样性约束优化多人生成任务中身份多样性问题,显著提升生成结果的独特性和准确性。

Details

Motivation: 当前文本到图像生成模型在处理多人提示时存在重复面孔、身份混杂和计数错误等问题,亟需一种能够直接优化身份多样性的方法。

Result: 在DiverseHumans测试集上,DisCo实现了98.6%的唯一面孔准确率和近乎完美的全局身份分布,超越了开源和商业方法。

Insight: DisCo通过无标注的强化学习方法,为多人生成任务提供了可扩展的解决方案,并为该领域设定了新的基准。

Abstract: State-of-the-art text-to-image models excel at realism but collapse on multi-human prompts - duplicating faces, merging identities, and miscounting individuals. We introduce DisCo (Reinforcement with Diversity Constraints), the first RL-based framework to directly optimize identity diversity in multi-human generation. DisCo fine-tunes flow-matching models via Group-Relative Policy Optimization (GRPO) with a compositional reward that (i) penalizes intra-image facial similarity, (ii) discourages cross-sample identity repetition, (iii) enforces accurate person counts, and (iv) preserves visual fidelity through human preference scores. A single-stage curriculum stabilizes training as complexity scales, requiring no extra annotations. On the DiverseHumans Testset, DisCo achieves 98.6 Unique Face Accuracy and near-perfect Global Identity Spread - surpassing both open-source and proprietary methods (e.g., Gemini, GPT-Image) while maintaining competitive perceptual quality. Our results establish DisCo as a scalable, annotation-free solution that resolves the long-standing identity crisis in generative models and sets a new benchmark for compositional multi-human generation.


[6] GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings cs.CV | cs.AIPDF

Angel Daruna, Nicholas Meegan, Han-Pang Chiu, Supun Samarasekera, Rakesh Kumar

TL;DR: GeoSURGE提出了一种结合地理层级嵌入和语义分割的视觉地理定位方法,显著提升了多个基准数据集上的性能。

Details

Motivation: 现有的视觉地理定位方法主要依赖于视觉特征,而忽略了地理信息的层级结构和语义信息,限制了定位的准确性。

Result: 在五个基准数据集的25个指标中,22个指标超越了现有最佳方法和大规模视觉语言模型。

Insight: 地理层级嵌入和语义信息的融合对提升视觉地理定位性能至关重要。

Abstract: Worldwide visual geo-localization seeks to determine the geographic location of an image anywhere on Earth using only its visual content. Learned representations of geography for visual geo-localization remain an active research topic despite much progress. We formulate geo-localization as aligning the visual representation of the query image with a learned geographic representation. Our novel geographic representation explicitly models the world as a hierarchy of geographic embeddings. Additionally, we introduce an approach to efficiently fuse the appearance features of the query image with its semantic segmentation map, forming a robust visual representation. Our main experiments demonstrate improved all-time bests in 22 out of 25 metrics measured across five benchmark datasets compared to prior state-of-the-art (SOTA) methods and recent Large Vision-Language Models (LVLMs). Additional ablation studies support the claim that these gains are primarily driven by the combination of geographic and visual representations.


[7] Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories cs.CV | cs.LGPDF

Nilay Naharas, Dang Nguyen, Nesihan Bulut, Mohammadhossein Bateni, Vahab Mirrokni

TL;DR: 这篇论文提出了一种名为XMAS的新方法,用于高效选择数据以微调大型视觉语言模型(LVLM)。该方法通过分析跨模态注意力矩阵的轨迹来去除冗余数据,显著提升了训练效率并保持了模型性能。

Details

Motivation: 目前的数据选择方法在大型视觉语言模型(LVLM)上表现不佳,甚至无法超越随机选择的效果。为了解决这一问题,论文提出了XMAS方法,通过跨模态对齐轨迹来选择最具信息量的数据。

Result: 实验结果显示,XMAS能够丢弃50%-85%的训练数据,同时完全保留LLaVA-1.5-7B模型在10个下游任务上的性能,并将训练速度提升1.2倍,数据缩减效果比现有基线高出30%。

Insight: 跨模态注意力矩阵的轨迹可以作为筛选高质量训练数据的有效指标,通过聚类和平衡采样,能够在保持模型性能的同时显著提升训练效率。

Abstract: Data-efficient learning aims to eliminate redundancy in large training datasets by training models on smaller subsets of the most informative examples. While data selection has been extensively explored for vision models and large language models (LLMs), it remains underexplored for Large Vision-Language Models (LVLMs). Notably, none of existing methods can outperform random selection at different subset sizes. In this work, we propose the first principled method for data-efficient instruction tuning of LVLMs. We prove that examples with similar cross-modal attention matrices during instruction tuning have similar gradients. Thus, they influence model parameters in a similar manner and convey the same information to the model during training. Building on this insight, we propose XMAS, which clusters examples based on the trajectories of the top singular values of their attention matrices obtained from fine-tuning a small proxy LVLM. By sampling a balanced subset from these clusters, XMAS effectively removes redundancy in large-scale LVLM training data. Extensive experiments show that XMAS can discard 50% of the LLaVA-665k dataset and 85% of the Vision-Flan dataset while fully preserving performance of LLaVA-1.5-7B on 10 downstream benchmarks and speeding up its training by 1.2x. This is 30% more data reduction compared to the best baseline for LLaVA-665k. The project’s website can be found at https://bigml-cs-ucla.github.io/XMAS-project-page/.


[8] Purrception: Variational Flow Matching for Vector-Quantized Image Generation cs.CV | cs.AI | cs.LGPDF

Răzvan-Andrei Matişan, Vincent Tao Hu, Grigory Bartosh, Björn Ommer, Cees G. M. Snoek

TL;DR: Purrception是一种变分流匹配方法,用于向量量化的图像生成,结合了连续传输动态和显式分类监督,提升训练效率。

Details

Motivation: 现有的图像生成方法中,连续流匹配和离散监督方法的优势未能充分结合。Purrception旨在填补这一空白,通过变分流匹配实现高效的向量量化图像生成。

Result: 在ImageNet-1k 256x256生成任务上,训练速度优于连续和离散流匹配基准,FID分数与SOTA模型相当。

Insight: 变分流匹配可以有效桥接连续传输和离散监督,提升图像生成的训练效率和性能。

Abstract: We introduce Purrception, a variational flow matching approach for vector-quantized image generation that provides explicit categorical supervision while maintaining continuous transport dynamics. Our method adapts Variational Flow Matching to vector-quantized latents by learning categorical posteriors over codebook indices while computing velocity fields in the continuous embedding space. This combines the geometric awareness of continuous methods with the discrete supervision of categorical approaches, enabling uncertainty quantification over plausible codes and temperature-controlled generation. We evaluate Purrception on ImageNet-1k 256x256 generation. Training converges faster than both continuous flow matching and discrete flow matching baselines while achieving competitive FID scores with state-of-the-art models. This demonstrates that Variational Flow Matching can effectively bridge continuous transport and discrete supervision for improved training efficiency in image generation.


[9] AortaDiff: A Unified Multitask Diffusion Framework For Contrast-Free AAA Imaging cs.CV | cs.AIPDF

Yuxuan Ou, Ning Bi, Jiazhen Pan, Jiancheng Yang, Boliang Yu

TL;DR: 论文提出了AortaDiff,一个统一的多任务扩散框架,用于从非对比CT扫描生成合成对比增强CT图像,并同时分割主动脉腔和血栓。该方法通过结合条件扩散模型和多任务学习,避免了多阶段管道的误差累积,提升了性能和临床实用性。

Details

Motivation: 传统的对比增强CT(CECT)在评估腹部主动脉瘤(AAA)时需要碘造影剂,但其具有肾毒性、过敏风险和环境污染等问题。现有的深度学习方法采用多阶段流程,会导致误差累积且无法充分利用共享的语义和解剖结构。

Result: 在264名患者数据上,模型性能优于单任务和多阶段方法。图像合成的PSNR达25.61 dB,分割任务的腔Dice分数为0.89,血栓Dice分数为0.53,显著提升了临床测量的准确性。

Insight: 多任务联合优化和参数共享能有效提升性能,半监督策略增强了模型对临床数据的鲁棒性,为减少造影剂使用提供了新思路。

Abstract: While contrast-enhanced CT (CECT) is standard for assessing abdominal aortic aneurysms (AAA), the required iodinated contrast agents pose significant risks, including nephrotoxicity, patient allergies, and environmental harm. To reduce contrast agent use, recent deep learning methods have focused on generating synthetic CECT from non-contrast CT (NCCT) scans. However, most adopt a multi-stage pipeline that first generates images and then performs segmentation, which leads to error accumulation and fails to leverage shared semantic and anatomical structures. To address this, we propose a unified deep learning framework that generates synthetic CECT images from NCCT scans while simultaneously segmenting the aortic lumen and thrombus. Our approach integrates conditional diffusion models (CDM) with multi-task learning, enabling end-to-end joint optimization of image synthesis and anatomical segmentation. Unlike previous multitask diffusion models, our approach requires no initial predictions (e.g., a coarse segmentation mask), shares both encoder and decoder parameters across tasks, and employs a semi-supervised training strategy to learn from scans with missing segmentation labels, a common constraint in real-world clinical data. We evaluated our method on a cohort of 264 patients, where it consistently outperformed state-of-the-art single-task and multi-stage models. For image synthesis, our model achieved a PSNR of 25.61 dB, compared to 23.80 dB from a single-task CDM. For anatomical segmentation, it improved the lumen Dice score to 0.89 from 0.87 and the challenging thrombus Dice score to 0.53 from 0.48 (nnU-Net). These segmentation enhancements led to more accurate clinical measurements, reducing the lumen diameter MAE to 4.19 mm from 5.78 mm and the thrombus area error to 33.85% from 41.45% when compared to nnU-Net. Code is available at https://github.com/yuxuanou623/AortaDiff.git.


[10] From Videos to Indexed Knowledge Graphs – Framework to Marry Methods for Multimodal Content Analysis and Understanding cs.CV | cs.AI | cs.CL | cs.IRPDF

Basem Rizk, Joel Walsh, Mark Core, Benjamin Nye

TL;DR: 本文提出了一個框架,將多模態內容分析與預訓練模型結合,將影片轉換為可查詢的時序半結構化知識圖譜,支持持續學習。

Details

Motivation: 多模態內容分析複雜且計算成本高,現有預訓練模型多用於靜態數據,而結合這些模型處理影片數據仍具挑戰性。

Result: 實現了影片到知識圖譜的轉換,提供可查詢且支持持續學習的表示。

Insight: 結合預訓練模型與知識圖譜技術,能有效處理多模態數據並支持動態知識更新。

Abstract: Analysis of multi-modal content can be tricky, computationally expensive, and require a significant amount of engineering efforts. Lots of work with pre-trained models on static data is out there, yet fusing these opensource models and methods with complex data such as videos is relatively challenging. In this paper, we present a framework that enables efficiently prototyping pipelines for multi-modal content analysis. We craft a candidate recipe for a pipeline, marrying a set of pre-trained models, to convert videos into a temporal semi-structured data format. We translate this structure further to a frame-level indexed knowledge graph representation that is query-able and supports continual learning, enabling the dynamic incorporation of new domain-specific knowledge through an interactive medium.


[11] WALT: Web Agents that Learn Tools cs.CV | cs.AI | cs.LGPDF

Viraj Prabhu, Yutong Dai, Matthew Fernandez, Jing Gu, Krithika Ramakrishnan

TL;DR: WALT提出了一种通过反向工程学习网站功能的框架,将网站功能转化为可复用的工具,减少了代理对逐步推理的依赖。

Details

Motivation: 当前基于UI交互和LLM推理的Web代理方法在动态布局和长任务中表现脆弱,而人类通过高级操作(如搜索、过滤)更高效。WALT旨在模仿这一特点。

Result: 在VisualWebArena和WebArena基准测试中,WALT以更少步骤和更低LLM依赖性实现了更高的成功率。

Insight: 将网站功能抽象为工具可以显著提升Web代理的效率和鲁棒性,为浏览器自动化提供了一种通用范式。

Abstract: Web agents promise to automate complex browser tasks, but current methods remain brittle – relying on step-by-step UI interactions and heavy LLM reasoning that break under dynamic layouts and long horizons. Humans, by contrast, exploit website-provided functionality through high-level operations like search, filter, and sort. We introduce WALT (Web Agents that Learn Tools), a framework that reverse-engineers latent website functionality into reusable invocable tools. Rather than hypothesizing ad-hoc skills, WALT exposes robust implementations of automations already designed into websites – spanning discovery (search, filter, sort), communication (post, comment, upvote), and content management (create, edit, delete). Tools abstract away low-level execution: instead of reasoning about how to click and type, agents simply call search(query) or create(listing). This shifts the computational burden from fragile step-by-step reasoning to reliable tool invocation. On VisualWebArena and WebArena, WALT achieves higher success with fewer steps and less LLM-dependent reasoning, establishing a robust and generalizable paradigm for browser automation.


[12] MATCH: Multi-faceted Adaptive Topo-Consistency for Semi-Supervised Histopathology Segmentation cs.CVPDF

Meilong Xu, Xiaoling Hu, Shahira Abousamra, Chen Li, Chao Chen

TL;DR: 论文提出了一种半监督分割框架MATCH,通过多视角扰动预测和拓扑一致性,有效识别并保留病理图像中的语义结构,解决了密集分布对象的挑战。

Details

Motivation: 在病理图像分析中,对象密集分布且标注成本高,半监督分割需从无标注数据中捕获语义结构,但现有方法难以区分生物学结构与噪声。

Result: 实验表明MATCH减少了拓扑错误,提升了分割的鲁棒性和准确性,尤其适用于密集对象的分割任务。

Insight: 多视角扰动和拓扑一致性可有效区分真实结构与噪声,适用于标注稀缺的病理图像分析。

Abstract: In semi-supervised segmentation, capturing meaningful semantic structures from unlabeled data is essential. This is particularly challenging in histopathology image analysis, where objects are densely distributed. To address this issue, we propose a semi-supervised segmentation framework designed to robustly identify and preserve relevant topological features. Our method leverages multiple perturbed predictions obtained through stochastic dropouts and temporal training snapshots, enforcing topological consistency across these varied outputs. This consistency mechanism helps distinguish biologically meaningful structures from transient and noisy artifacts. A key challenge in this process is to accurately match the corresponding topological features across the predictions in the absence of ground truth. To overcome this, we introduce a novel matching strategy that integrates spatial overlap with global structural alignment, minimizing discrepancies among predictions. Extensive experiments demonstrate that our approach effectively reduces topological errors, resulting in more robust and accurate segmentations essential for reliable downstream analysis. Code is available at \href{https://github.com/Melon-Xu/MATCH}{https://github.com/Melon-Xu/MATCH}.


[13] Towards Better Optimization For Listwise Preference in Diffusion Models cs.CVPDF

Jiamu Bai, Xin Yu, Meilong Xu, Weitao Lu, Xin Pan

TL;DR: 本文提出了Diffusion-LPO框架,用于在扩散模型中优化列表偏好数据,利用Plackett-Luce模型扩展DPO目标,显著提升了模型的视觉质量和偏好对齐效果。

Details

Motivation: 现有DPO方法在扩散模型中主要依赖成对偏好数据,忽略了人类反馈中的隐含排名信息,限制了偏好表达的精确性。

Result: Diffusion-LPO在文本到图像生成、图像编辑和个性化偏好对齐任务中,均显著优于成对DPO基线。

Insight: 列表偏好数据比成对偏好包含更多信息,能更准确地捕捉人类偏好,提升模型的生成质量和对齐效果。

Abstract: Reinforcement learning from human feedback (RLHF) has proven effectiveness for aligning text-to-image (T2I) diffusion models with human preferences. Although Direct Preference Optimization (DPO) is widely adopted for its computational efficiency and avoidance of explicit reward modeling, its applications to diffusion models have primarily relied on pairwise preferences. The precise optimization of listwise preferences remains largely unaddressed. In practice, human feedback on image preferences often contains implicit ranked information, which conveys more precise human preferences than pairwise comparisons. In this work, we propose Diffusion-LPO, a simple and effective framework for Listwise Preference Optimization in diffusion models with listwise data. Given a caption, we aggregate user feedback into a ranked list of images and derive a listwise extension of the DPO objective under the Plackett-Luce model. Diffusion-LPO enforces consistency across the entire ranking by encouraging each sample to be preferred over all of its lower-ranked alternatives. We empirically demonstrate the effectiveness of Diffusion-LPO across various tasks, including text-to-image generation, image editing, and personalized preference alignment. Diffusion-LPO consistently outperforms pairwise DPO baselines on visual quality and preference alignment.


[14] Growing Visual Generative Capacity for Pre-Trained MLLMs cs.CV | cs.LGPDF

Hanyu Wang, Jiaming Han, Ziyan Yang, Qi Zhao, Shanchuan Lin

TL;DR: 本文提出了一种名为Bridge的纯自回归统一多模态大语言模型(MLLM),通过Mixture-of-Transformers架构增强了预训练视觉理解模型的生成能力,实现了在单个下一词预测框架中的图像理解和生成。

Details

Motivation: 当前构建支持理解和生成的统一MLLM面临挑战:混合方法虽然能生成高质量图像但破坏了自回归范式,而纯自回归方法则在语义对齐和像素级保真度之间存在权衡。

Result: 在多样化多模态基准测试中,Bridge在理解和生成任务上均取得竞争性或更优的结果,且所需训练数据和训练时间更少。

Insight: 通过语义和像素令牌的结合,Bridge在保持语言对齐的同时实现了对视觉细节的精确描述,为统一MLLM的设计提供了新思路。

Abstract: Multimodal large language models (MLLMs) extend the success of language models to visual understanding, and recent efforts have sought to build unified MLLMs that support both understanding and generation. However, constructing such models remains challenging: hybrid approaches combine continuous embeddings with diffusion or flow-based objectives, producing high-quality images but breaking the autoregressive paradigm, while pure autoregressive approaches unify text and image prediction over discrete visual tokens but often face trade-offs between semantic alignment and pixel-level fidelity. In this work, we present Bridge, a pure autoregressive unified MLLM that augments pre-trained visual understanding models with generative ability through a Mixture-of-Transformers architecture, enabling both image understanding and generation within a single next-token prediction framework. To further improve visual generation fidelity, we propose a semantic-to-pixel discrete representation that integrates compact semantic tokens with fine-grained pixel tokens, achieving strong language alignment and precise description of visual details with only a 7.9% increase in sequence length. Extensive experiments across diverse multimodal benchmarks demonstrate that Bridge achieves competitive or superior results in both understanding and generation benchmarks, while requiring less training data and reduced training time compared to prior unified MLLMs.


[15] Guiding Multimodal Large Language Models with Blind and Low Vision People Visual Questions for Proactive Visual Interpretations cs.CV | cs.AI | cs.HC | I.2.m; H.5.2PDF

Ricardo Gonzalez Penuela, Felipe Arias-Russi, Victor Capriles

TL;DR: 该论文提出了一种方法,通过历史性问题引导多模态大语言模型(MLLMs)为盲人或低视力用户生成更相关的视觉描述,避免冗长且无关的信息。

Details

Motivation: 现有的MLLMs在为盲人或低视力用户提供视觉描述时,通常生成冗长且不考虑上下文的内容,导致用户体验不佳。

Result: 评估显示,76.1%的情境中,上下文感知的描述能预测并回答用户问题,54.4%的用户偏好这种描述。

Insight: 结合历史问题能显著提升MLLMs的描述相关性,优化用户体验。

Abstract: Multimodal large language models (MLLMs) have been integrated into visual interpretation applications to support Blind and Low Vision (BLV) users because of their accuracy and ability to provide rich, human-like interpretations. However, these applications often default to comprehensive, lengthy descriptions regardless of context. This leads to inefficient exchanges, as users must go through irrelevant details rather than receiving the specific information they are likely to seek. To deliver more contextually-relevant information, we developed a system that draws on historical BLV users questions. When given an image, our system identifies similar past visual contexts from the VizWiz-LF dataset and uses the associated questions to guide the MLLM generate descriptions more relevant to BLV users. An evaluation with three human labelers who revised 92 context-aware and context-free descriptions showed that context-aware descriptions anticipated and answered users’ questions in 76.1% of cases (70 out of 92) and were preferred in 54.4% of comparisons (50 out of 92). Our paper reviews, and data analysis are publicly available in a Github repository at https://github.com/rgonzalezp/guiding-multimodal-large-language-models-with-blind-and-low-vision-people-visual-questions .


[16] ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models cs.CV | cs.LGPDF

Krishna Teja Chitty-Venkata, Murali Emani

TL;DR: 该论文提出了ImageNet-Think-250K数据集,旨在支持具有显式推理能力的视觉语言模型(VLM)的开发。数据集包含25万张图像,附带结构化思维标记和答案,由两个先进的VLM生成。

Details

Motivation: 目前VLM在显式推理能力方面存在不足,需要大规模数据集支持其训练和评估。作者希望通过合成数据集填补这一空白,推动多模态推理研究。

Result: 数据集包含25万图像及其推理过程和最终答案,为多模态推理模型提供训练和评估资源。

Insight: 合成数据集可以填补VLM在显式推理能力上的不足,推动多模态推理研究的进一步发展。

Abstract: We develop ImageNet-Think, a multimodal reasoning dataset designed to aid the development of Vision Language Models (VLMs) with explicit reasoning capabilities. Our dataset is built on 250,000 images from ImageNet21k dataset, providing structured thinking tokens and corresponding answers. Our synthetic dataset is generated by two state-of-the-art VLMs: GLM-4.1V-9B-Thinking and Kimi-VL-A3B-Thinking-2506. Each image is accompanied by two pairs of thinking-answer sequences, creating a resource for training and evaluating multimodal reasoning models. We capture the step-by-step reasoning process of VLMs and the final descriptive answers. Our goal with this dataset is to enable the development of more robust VLMs while contributing to the broader understanding of multimodal reasoning mechanisms. The dataset and evaluation benchmarks will be publicly available to aid research in reasoning/thinking multimodal VLMs.


[17] NPN: Non-Linear Projections of the Null-Space for Imaging Inverse Problems cs.CV | eess.SP | math.OCPDF

Roman Jacome, Romario Gualdrón-Hurtado, Leon Suarez, Henry Arguello

TL;DR: 本文提出了一种新颖的正则化方法NPN,通过神经网络的非线性投影来利用感知矩阵的零空间结构,改进成像逆问题的重建效果。

Details

Motivation: 成像逆问题通常依赖于手工设计的正则化器或学习模型,但这些方法忽视了感知矩阵零空间的特定结构信息。NPN方法旨在利用这些结构信息,提高重建的准确性和适应性。

Result: 实验表明,NPN方法在压缩感知、去模糊、超分辨率、计算机断层扫描和磁共振成像等多种逆问题中显著提高了重建质量。

Insight: NPN方法的成功表明,利用感知矩阵零空间的结构信息可以为成像逆问题提供更有效的正则化策略,同时与其他方法兼容。

Abstract: Imaging inverse problems aims to recover high-dimensional signals from undersampled, noisy measurements, a fundamentally ill-posed task with infinite solutions in the null-space of the sensing operator. To resolve this ambiguity, prior information is typically incorporated through handcrafted regularizers or learned models that constrain the solution space. However, these priors typically ignore the task-specific structure of that null-space. In this work, we propose \textit{Non-Linear Projections of the Null-Space} (NPN), a novel class of regularization that, instead of enforcing structural constraints in the image domain, promotes solutions that lie in a low-dimensional projection of the sensing matrix’s null-space with a neural network. Our approach has two key advantages: (1) Interpretability: by focusing on the structure of the null-space, we design sensing-matrix-specific priors that capture information orthogonal to the signal components that are fundamentally blind to the sensing process. (2) Flexibility: NPN is adaptable to various inverse problems, compatible with existing reconstruction frameworks, and complementary to conventional image-domain priors. We provide theoretical guarantees on convergence and reconstruction accuracy when used within plug-and-play methods. Empirical results across diverse sensing matrices demonstrate that NPN priors consistently enhance reconstruction fidelity in various imaging inverse problems, such as compressive sensing, deblurring, super-resolution, computed tomography, and magnetic resonance imaging, with plug-and-play methods, unrolling networks, deep image prior, and diffusion models.


[18] Automated Genomic Interpretation via Concept Bottleneck Models for Medical Robotics cs.CV | q-bio.OTPDF

Zijun Li, Jinchang Zhang, Ming Zhang, Guoyu Lu

TL;DR: 该论文提出了一种自动化基因组解释模块,结合混沌游戏表示(CGR)和概念瓶颈模型(CBM),通过生物学概念生成可解释的决策,适用于医疗机器人系统。

Details

Motivation: 基因组数据的自动化解释在临床应用中至关重要,但现有方法缺乏可解释性和可靠性。论文旨在填补这一空白,为医疗自动化和机器人系统提供可靠的基础。

Result: 在HIV亚型分类任务中实现了SOTA性能,同时提供了更高的概念预测保真度和成本效益优化。

Insight: 通过生物学概念的可解释性设计,该框架为基因组医学中的自动化决策提供了可靠且高效的解决方案。

Abstract: We propose an automated genomic interpretation module that transforms raw DNA sequences into actionable, interpretable decisions suitable for integration into medical automation and robotic systems. Our framework combines Chaos Game Representation (CGR) with a Concept Bottleneck Model (CBM), enforcing predictions to flow through biologically meaningful concepts such as GC content, CpG density, and k mer motifs. To enhance reliability, we incorporate concept fidelity supervision, prior consistency alignment, KL distribution matching, and uncertainty calibration. Beyond accurate classification of HIV subtypes across both in-house and LANL datasets, our module delivers interpretable evidence that can be directly validated against biological priors. A cost aware recommendation layer further translates predictive outputs into decision policies that balance accuracy, calibration, and clinical utility, reducing unnecessary retests and improving efficiency. Extensive experiments demonstrate that the proposed system achieves state of the art classification performance, superior concept prediction fidelity, and more favorable cost benefit trade-offs compared to existing baselines. By bridging the gap between interpretable genomic modeling and automated decision-making, this work establishes a reliable foundation for robotic and clinical automation in genomic medicine.


[19] VLA-R1: Enhancing Reasoning in Vision-Language-Action Models cs.CV | cs.ROPDF

Angen Ye, Zeyu Zhang, Boyuan Wang, Xiaofeng Wang, Dapeng Zhang

TL;DR: VLA-R1是增强视觉-语言-动作模型推理能力的框架,结合强化学习与策略优化,提升多任务泛化能力。

Details

Motivation: 当前VLA模型缺乏显式逐步推理,忽视了动作生成的几何与功能约束,且训练管道未强化推理质量。

Result: 在仿真与真实机器人平台上优于现有VLA方法,实现更优泛化与执行精度。

Insight: 显式推理与可验证奖励对VLA模型的几何与功能约束建模至关重要;链式思维数据能有效增强多模态推理能力。

Abstract: Vision-Language-Action (VLA) models aim to unify perception, language understanding, and action generation, offering strong cross-task and cross-scene generalization with broad impact on embodied AI. However, current VLA models often lack explicit step-by-step reasoning, instead emitting final actions without considering affordance constraints or geometric relations. Their post-training pipelines also rarely reinforce reasoning quality, relying primarily on supervised fine-tuning with weak reward design. To address these challenges, we present VLA-R1, a reasoning-enhanced VLA that integrates Reinforcement Learning from Verifiable Rewards (RLVR) with Group Relative Policy Optimization (GRPO) to systematically optimize both reasoning and execution. Specifically, we design an RLVR-based post-training strategy with verifiable rewards for region alignment, trajectory consistency, and output formatting, thereby strengthening reasoning robustness and execution accuracy. Moreover, we develop VLA-CoT-13K, a high-quality dataset that provides chain-of-thought supervision explicitly aligned with affordance and trajectory annotations. Furthermore, extensive evaluations on in-domain, out-of-domain, simulation, and real-robot platforms demonstrate that VLA-R1 achieves superior generalization and real-world performance compared to prior VLA methods. We plan to release the model, code, and dataset following the publication of this work. Code: https://github.com/GigaAI-research/VLA-R1. Website: https://gigaai-research.github.io/VLA-R1.


[20] Discrete Facial Encoding: : A Framework for Data-driven Facial Display Discovery cs.CVPDF

Minh Tran, Maksim Siniukov, Zhangyu Jin, Mohammad Soleymani

TL;DR: DFE提出了一种无监督的数据驱动框架,通过RVQ-VAE学习3D网格序列的紧凑、可解释面部表情字典,优于FACS和其他编码方法。

Details

Motivation: 现有面部表情编码系统(如FACS)覆盖率有限且标注成本高,需要一种更高效、数据驱动的替代方案。

Result: DFE在压力检测、人格预测和抑郁检测任务中优于FACS和其他模型,覆盖更广泛的面部行为。

Insight: DFE展示了无监督学习在面部表情分析中的潜力,为心理学和情感计算提供了一种可扩展的解决方案。

Abstract: Facial expression analysis is central to understanding human behavior, yet existing coding systems such as the Facial Action Coding System (FACS) are constrained by limited coverage and costly manual annotation. In this work, we introduce Discrete Facial Encoding (DFE), an unsupervised, data-driven alternative of compact and interpretable dictionary of facial expressions from 3D mesh sequences learned through a Residual Vector Quantized Variational Autoencoder (RVQ-VAE). Our approach first extracts identity-invariant expression features from images using a 3D Morphable Model (3DMM), effectively disentangling factors such as head pose and facial geometry. We then encode these features using an RVQ-VAE, producing a sequence of discrete tokens from a shared codebook, where each token captures a specific, reusable facial deformation pattern that contributes to the overall expression. Through extensive experiments, we demonstrate that Discrete Facial Encoding captures more precise facial behaviors than FACS and other facial encoding alternatives. We evaluate the utility of our representation across three high-level psychological tasks: stress detection, personality prediction, and depression detection. Using a simple Bag-of-Words model built on top of the learned tokens, our system consistently outperforms both FACS-based pipelines and strong image and video representation learning models such as Masked Autoencoders. Further analysis reveals that our representation covers a wider variety of facial displays, highlighting its potential as a scalable and effective alternative to FACS for psychological and affective computing applications.


[21] Non-Rigid Structure-from-Motion via Differential Geometry with Recoverable Conformal Scale cs.CV | cs.ROPDF

Yongbo Chen, Yanhao Zhang, Shaifali Parashar, Liang Zhao, Shoudong Huang

TL;DR: 论文提出了一种名为Con-NRSfM的新方法,用于非刚性运动结构恢复(NRSfM),通过微分几何技术恢复局部共形尺度,提高了深度估计的精度和鲁棒性。

Details

Motivation: 传统NRSfM方法依赖严格的局部平面或线性变形假设,未能恢复共形尺度,限制了准确性和适应性。本文旨在消除这些限制。

Result: 合成和真实数据集实验表明,该方法在重建精度和鲁棒性上超越现有方法。

Insight: 解耦深度和共形尺度约束是关键创新,提升了NRSfM的灵活性;结合自监督学习为密集重建提供新思路。

Abstract: Non-rigid structure-from-motion (NRSfM), a promising technique for addressing the mapping challenges in monocular visual deformable simultaneous localization and mapping (SLAM), has attracted growing attention. We introduce a novel method, called Con-NRSfM, for NRSfM under conformal deformations, encompassing isometric deformations as a subset. Our approach performs point-wise reconstruction using 2D selected image warps optimized through a graph-based framework. Unlike existing methods that rely on strict assumptions, such as locally planar surfaces or locally linear deformations, and fail to recover the conformal scale, our method eliminates these constraints and accurately computes the local conformal scale. Additionally, our framework decouples constraints on depth and conformal scale, which are inseparable in other approaches, enabling more precise depth estimation. To address the sensitivity of the formulated problem, we employ a parallel separable iterative optimization strategy. Furthermore, a self-supervised learning framework, utilizing an encoder-decoder network, is incorporated to generate dense 3D point clouds with texture. Simulation and experimental results using both synthetic and real datasets demonstrate that our method surpasses existing approaches in terms of reconstruction accuracy and robustness. The code for the proposed method will be made publicly available on the project website: https://sites.google.com/view/con-nrsfm.


[22] UniVerse: Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction cs.CVPDF

Jin Cao, Hongrui Wu, Ziyong Feng, Hujun Bao, Xiaowei Zhou

TL;DR: UniVerse通过将鲁棒的3D重建解耦为恢复和重建两个子任务,利用视频扩散模型实现高效的图像一致性恢复和场景重建。

Details

Motivation: 解决多视角图像不一致导致的3D重建难题,传统方法依赖密集观测且优化困难。

Result: 在合成和真实数据集上表现出强泛化能力和优异性能,且支持3D场景风格控制。

Insight: 视频扩散模型的大规模学习能力为图像恢复和3D重建提供了通用且鲁棒的先验知识。

Abstract: This paper tackles the challenge of robust reconstruction, i.e., the task of reconstructing a 3D scene from a set of inconsistent multi-view images. Some recent works have attempted to simultaneously remove image inconsistencies and perform reconstruction by integrating image degradation modeling into neural 3D scene representations.However, these methods rely heavily on dense observations for robustly optimizing model parameters.To address this issue, we propose to decouple robust reconstruction into two subtasks: restoration and reconstruction, which naturally simplifies the optimization process.To this end, we introduce UniVerse, a unified framework for robust reconstruction based on a video diffusion model. Specifically, UniVerse first converts inconsistent images into initial videos, then uses a specially designed video diffusion model to restore them into consistent images, and finally reconstructs the 3D scenes from these restored images.Compared with case-by-case per-view degradation modeling, the diffusion model learns a general scene prior from large-scale data, making it applicable to diverse image inconsistencies.Extensive experiments on both synthetic and real-world datasets demonstrate the strong generalization capability and superior performance of our method in robust reconstruction. Moreover, UniVerse can control the style of the reconstructed 3D scene. Project page: https://jin-cao-tma.github.io/UniVerse.github.io/


[23] Look Less, Reason More: Rollout-Guided Adaptive Pixel-Space Reasoning cs.CV | cs.AIPDF

Xuchen Li, Xuzhao Li, Jiahui Gao, Renjie Pi, Shiyu Hu

TL;DR: 该论文提出了一个自适应像素推理框架,通过动态决定何时需要像素级操作来解决视觉语言模型在精细视觉任务中的低效和分心问题,显著提升了性能并减少了不必要的视觉操作。

Details

Motivation: 视觉语言模型在多模态任务中表现优异,但在需要精细视觉元素理解的任务中常因信息丢失或关键区域注意力不足而表现不佳。现有方法虽然引入了像素级信息,但可能导致过度使用和低效问题。

Result: 在HR-Bench 4K上达到73.4%准确率,工具使用率仅为20.1%,相比前方法准确率提升的同时工具使用率减少66.5%。

Insight: 动态调整像素级操作的调用时机是提升视觉语言模型效率和性能的关键策略。

Abstract: Vision-Language Models (VLMs) excel at many multimodal tasks, yet they frequently struggle with tasks requiring precise understanding and handling of fine-grained visual elements. This is mainly due to information loss during image encoding or insufficient attention to critical regions. Recent work has shown promise by incorporating pixel-level visual information into the reasoning process, enabling VLMs to access high-resolution visual details during their thought process. However, this pixel-level information is often overused, leading to inefficiency and distraction from irrelevant visual details. To address these challenges, we propose the first framework for adaptive pixel reasoning that dynamically determines necessary pixel-level operations based on the input query. Specifically, we first apply operation-aware supervised fine-tuning to establish baseline competence in textual reasoning and visual operations, then design a novel rollout-guided reinforcement learning framework relying on feedback of the model’s own responses, which enables the VLM to determine when pixel operations should be invoked based on query difficulty. Experiments on extensive multimodal reasoning benchmarks show that our model achieves superior performance while significantly reducing unnecessary visual operations. Impressively, our model achieves 73.4% accuracy on HR-Bench 4K while maintaining a tool usage ratio of only 20.1%, improving accuracy and simultaneously reducing tool usage by 66.5% compared to the previous methods.


[24] Uncovering Overconfident Failures in CXR Models via Augmentation-Sensitivity Risk Scoring cs.CVPDF

Han-Jay Shu, Wei-Ning Chiu, Shun-Ting Chang, Meng-Ping Huang, Takeshi Tohyama

TL;DR: 该论文提出了一个基于增强敏感性的风险评分(ASRS)框架,用于识别胸部X光(CXR)模型中的错误倾向病例,通过检测嵌入变化来改善医疗AI的公平性和安全性。

Details

Motivation: 深度学习模型在CXR解释中表现优异,但在不同患者亚组中准确性不均,存在隐藏的失败案例。现有方法(如置信度校准或分布外检测)难以捕捉分布内的细微错误。

Result: 高敏感性病例的召回率显著降低(-0.2至-0.3),尽管AUROC和置信度较高,表明ASRS能有效识别易错案例。

Insight: ASRS为选择性预测和临床审查提供了一种无标签方法,有助于弥补现有技术在医疗AI公平性和安全性中的不足。

Abstract: Deep learning models achieve strong performance in chest radiograph (CXR) interpretation, yet fairness and reliability concerns persist. Models often show uneven accuracy across patient subgroups, leading to hidden failures not reflected in aggregate metrics. Existing error detection approaches – based on confidence calibration or out-of-distribution (OOD) detection – struggle with subtle within-distribution errors, while image- and representation-level consistency-based methods remain underexplored in medical imaging. We propose an augmentation-sensitivity risk scoring (ASRS) framework to identify error-prone CXR cases. ASRS applies clinically plausible rotations ($\pm 15^\circ$/$\pm 30^\circ$) and measures embedding shifts with the RAD-DINO encoder. Sensitivity scores stratify samples into stability quartiles, where highly sensitive cases show substantially lower recall ($-0.2$ to $-0.3$) despite high AUROC and confidence. ASRS provides a label-free means for selective prediction and clinician review, improving fairness and safety in medical AI.


[25] FreeViS: Training-free Video Stylization with Inconsistent References cs.CVPDF

Jiacong Xu, Yiqun Mei, Ke Zhang, Vishal M. Patel

TL;DR: FreeViS是一种无需训练的视频风格化框架,通过整合多风格参考到预训练的I2V模型,解决传统帧间风格不一致及训练开销大的问题,并利用高频补偿和光流运动线索提升时间一致性。

Details

Motivation: 现有视频风格化方法存在帧间不一致或需要昂贵的训练成本,FreeViS旨在提供一种无需训练的解决方案,同时保证风格丰富和时间一致性。

Result: FreeViS在风格保真度和时间一致性上优于现有基线,获得更高人类偏好。

Insight: 高频补偿和光流运动线索是提升视频风格化时间一致性的关键,无需训练的框架更具实用性。

Abstract: Video stylization plays a key role in content creation, but it remains a challenging problem. Na"ively applying image stylization frame-by-frame hurts temporal consistency and reduces style richness. Alternatively, training a dedicated video stylization model typically requires paired video data and is computationally expensive. In this paper, we propose FreeViS, a training-free video stylization framework that generates stylized videos with rich style details and strong temporal coherence. Our method integrates multiple stylized references to a pretrained image-to-video (I2V) model, effectively mitigating the propagation errors observed in prior works, without introducing flickers and stutters. In addition, it leverages high-frequency compensation to constrain the content layout and motion, together with flow-based motion cues to preserve style textures in low-saliency regions. Through extensive evaluations, FreeViS delivers higher stylization fidelity and superior temporal consistency, outperforming recent baselines and achieving strong human preference. Our training-free pipeline offers a practical and economic solution for high-quality, temporally coherent video stylization. The code and videos can be accessed via https://xujiacong.github.io/FreeViS/


[26] MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs cs.CVPDF

Jiyao Liu, Jinjie Wei, Wanying Qu, Chenglong Ma, Junzhi Ning

TL;DR: 论文提出的MedQ-Bench是一个评估多模态大语言模型(MLLMs)在医学图像质量评估(IQA)能力的综合基准,通过感知和推理任务解决了现有方法的局限性。

Details

Motivation: 现有医学图像质量评估方法过于依赖标量评分,无法模拟专家的描述性推理过程,无法满足临床AI的需求。

Result: 评估了14个先进MLLMs,发现它们具备初步但不稳定的感知和推理能力,尚未达到可靠的临床使用标准。人类-AI对齐验证表明仍需优化。

Insight: MLLMs在医学IQA领域潜力巨大,但需要针对性优化。MedQ-Bench为未来研究提供了新方向,推动了MLLMs在医学图像质量评估中的应用。

Abstract: Medical Image Quality Assessment (IQA) serves as the first-mile safety gate for clinical AI, yet existing approaches remain constrained by scalar, score-based metrics and fail to reflect the descriptive, human-like reasoning process central to expert evaluation. To address this gap, we introduce MedQ-Bench, a comprehensive benchmark that establishes a perception-reasoning paradigm for language-based evaluation of medical image quality with Multi-modal Large Language Models (MLLMs). MedQ-Bench defines two complementary tasks: (1) MedQ-Perception, which probes low-level perceptual capability via human-curated questions on fundamental visual attributes; and (2) MedQ-Reasoning, encompassing both no-reference and comparison reasoning tasks, aligning model evaluation with human-like reasoning on image quality. The benchmark spans five imaging modalities and over forty quality attributes, totaling 2,600 perceptual queries and 708 reasoning assessments, covering diverse image sources including authentic clinical acquisitions, images with simulated degradations via physics-based reconstructions, and AI-generated images. To evaluate reasoning ability, we propose a multi-dimensional judging protocol that assesses model outputs along four complementary axes. We further conduct rigorous human-AI alignment validation by comparing LLM-based judgement with radiologists. Our evaluation of 14 state-of-the-art MLLMs demonstrates that models exhibit preliminary but unstable perceptual and reasoning skills, with insufficient accuracy for reliable clinical use. These findings highlight the need for targeted optimization of MLLMs in medical IQA. We hope that MedQ-Bench will catalyze further exploration and unlock the untapped potential of MLLMs for medical image quality evaluation.


[27] Holistic Order Prediction in Natural Scenes cs.CV | cs.AI | cs.LGPDF

Pierre Musacchio, Hyunmin Lee, Jaesik Park

TL;DR: InstaFormer是一种能够在单次前向传播中预测场景中所有实例的遮挡和深度顺序的网络,仅需RGB图像输入,避免了昂贵的输入格式和推理成本。

Details

Motivation: 现代视觉模型在理解实例级几何关系时依赖昂贵的输入(如类别标签、分割掩码)和推理成本(多次前向传播)。InstaFormer旨在以单次前向传播和RGB输入解决这一问题。

Result: 通过全面基准测试和消融实验证明了方法的有效性,展现了其在预测场景实例顺序上的优势。

Insight: 对象查询与掩码描述符的交互是实现高效顺序预测的关键,单次前向传播的设计大幅降低了计算负担。

Abstract: Even in controlled settings, understanding instance-wise geometries is a challenging task for a wide range of visual models. Although specialized systems exist, modern arts rely on expensive input formats (category labels, binary segmentation masks) and inference costs (a quadratic amount of forward passes). We mitigate these limitations by proposing InstaFormer, a network capable of holistic order prediction. That is, solely given an input RGB image, InstaFormer returns the full occlusion and depth orderings for all the instances in the scene in a single forward pass. At its core, InstaFormer relies on interactions between object queries and latent mask descriptors that semantically represent the same objects while carrying complementary information. We comprehensively benchmark and ablate our approach to highlight its effectiveness. Our code and models are open-source and available at this URL: https://github.com/SNU-VGILab/InstaOrder.


[28] PyramidStyler: Transformer-Based Neural Style Transfer with Pyramidal Positional Encoding and Reinforcement Learning cs.CV | cs.AIPDF

Raahul Krishna Durairaju, K. Saruladha

TL;DR: PyramidStyler是一个基于Transformer的神经风格迁移框架,引入金字塔位置编码和强化学习,显著提升了复杂风格和高分辨率输入的效率和质量。

Details

Motivation: 现有的CNN和Transformer模型在处理复杂风格和高分辨率输入时效率低下,亟需一种既能捕捉局部细节又能保留全局上下文的方法。

Result: 在COCO和WikiArt数据集上训练,内容和风格损失分别降低了62.6%和57.4%,推理时间仅为1.39秒。

Insight: 金字塔编码和强化学习的结合为高分辨率图像的艺术风格迁移提供了高效的解决方案。

Abstract: Neural Style Transfer (NST) has evolved from Gatys et al.’s (2015) CNN-based algorithm, enabling AI-driven artistic image synthesis. However, existing CNN and transformer-based models struggle to scale efficiently to complex styles and high-resolution inputs. We introduce PyramidStyler, a transformer framework with Pyramidal Positional Encoding (PPE): a hierarchical, multi-scale encoding that captures both local details and global context while reducing computational load. We further incorporate reinforcement learning to dynamically optimize stylization, accelerating convergence. Trained on Microsoft COCO and WikiArt, PyramidStyler reduces content loss by 62.6% (to 2.07) and style loss by 57.4% (to 0.86) after 4000 epochs–achieving 1.39 s inference–and yields further improvements (content 2.03; style 0.75) with minimal speed penalty (1.40 s) when using RL. These results demonstrate real-time, high-quality artistic rendering, with broad applications in media and design.


[29] LOBE-GS: Load-Balanced and Efficient 3D Gaussian Splatting for Large-Scale Scene Reconstruction cs.CVPDF

Sheng-Hsiang Hung, Ting-Yu Yen, Wei-Fang Sun, Simon See, Shih-Hsuan Hung

TL;DR: LoBE-GS提出了一种负载均衡且高效的3D高斯泼溅框架,通过深度感知分区和优化策略解决了大规模场景重建中的负载不均和预处理开销问题,提升了训练速度和扩展性。

Details

Motivation: 现有3D高斯泼溅(3DGS)方法在大规模场景(如城市街区)中面临负载不均和预处理开销高的挑战。

Result: 在大规模城市场景中,LoBE-GS实现了比基线方法快2倍的端到端训练速度,同时保持重建质量。

Insight: 负载均衡和高效预处理是实现大规模3D高斯泼溅的关键。

Abstract: 3D Gaussian Splatting (3DGS) has established itself as an efficient representation for real-time, high-fidelity 3D scene reconstruction. However, scaling 3DGS to large and unbounded scenes such as city blocks remains difficult. Existing divide-and-conquer methods alleviate memory pressure by partitioning the scene into blocks, but introduce new bottlenecks: (i) partitions suffer from severe load imbalance since uniform or heuristic splits do not reflect actual computational demands, and (ii) coarse-to-fine pipelines fail to exploit the coarse stage efficiently, often reloading the entire model and incurring high overhead. In this work, we introduce LoBE-GS, a novel Load-Balanced and Efficient 3D Gaussian Splatting framework, that re-engineers the large-scale 3DGS pipeline. LoBE-GS introduces a depth-aware partitioning method that reduces preprocessing from hours to minutes, an optimization-based strategy that balances visible Gaussians – a strong proxy for computational load – across blocks, and two lightweight techniques, visibility cropping and selective densification, to further reduce training cost. Evaluations on large-scale urban and outdoor datasets show that LoBE-GS consistently achieves up to $2\times$ faster end-to-end training time than state-of-the-art baselines, while maintaining reconstruction quality and enabling scalability to scenes infeasible with vanilla 3DGS.


[30] Pack and Force Your Memory: Long-form and Consistent Video Generation cs.CV | cs.AIPDF

Xiaofei Wu, Guozhen Zhang, Zhiyong Xu, Yuan Zhou, Qinglin Lu

TL;DR: 该论文提出了MemoryPack和Direct Forcing两种方法,用于解决长视频生成中的长期依赖性和误差积累问题,显著提升了视频的一致性和可靠性。

Details

Motivation: 长视频生成面临长期依赖性建模和自回归解码中误差积累的双重挑战,现有方法难以同时解决这两个问题。

Result: 该方法显著提升了长视频生成的时序一致性和可靠性,且计算效率高,复杂度线性。

Insight: 结合全局指导(如文本和图像)可以更有效地建模长期依赖关系,而改进的训练-推理对齐策略能有效减少误差积累。

Abstract: Long-form video generation presents a dual challenge: models must capture long-range dependencies while preventing the error accumulation inherent in autoregressive decoding. To address these challenges, we make two contributions. First, for dynamic context modeling, we propose MemoryPack, a learnable context-retrieval mechanism that leverages both textual and image information as global guidance to jointly model short- and long-term dependencies, achieving minute-level temporal consistency. This design scales gracefully with video length, preserves computational efficiency, and maintains linear complexity. Second, to mitigate error accumulation, we introduce Direct Forcing, an efficient single-step approximating strategy that improves training-inference alignment and thereby curtails error propagation during inference. Together, MemoryPack and Direct Forcing substantially enhance the context consistency and reliability of long-form video generation, advancing the practical usability of autoregressive video models.


[31] Calibrating the Full Predictive Class Distribution of 3D Object Detectors for Autonomous Driving cs.CVPDF

Cornelius Schröder, Marius-Raphael Schlüter, Markus Lienkamp

TL;DR: 该论文研究了自动驾驶中3D物体检测器的分类任务置信度校准问题,提出了一种评估全预测类别分布校准的指标,并设计了两种正则化损失项以提升校准性能。

Details

Motivation: 自动驾驶系统需要对物体检测的预测分布进行精确校准,以确保安全性和可靠性。本文关注3D物体检测器的分类置信度校准问题,强调全类别预测分布(包括主导类和非主导类)的校准重要性。

Result: 提出的全预测向量校准损失项与等渗回归结合,显著提升了CenterPoint和PillarNet的校准性能(主导类和非主导类)。DSVT-Pillar的校准则需要不同方法。

Insight: 1. 全预测分布的校准对自动驾驶至关重要,主导类和非主导类都需关注。2. 不同检测器的校准策略可能需要针对性调整。

Abstract: In autonomous systems, precise object detection and uncertainty estimation are critical for self-aware and safe operation. This work addresses confidence calibration for the classification task of 3D object detectors. We argue that it is necessary to regard the calibration of the full predictive confidence distribution over all classes and deduce a metric which captures the calibration of dominant and secondary class predictions. We propose two auxiliary regularizing loss terms which introduce either calibration of the dominant prediction or the full prediction vector as a training goal. We evaluate a range of post-hoc and train-time methods for CenterPoint, PillarNet and DSVT-Pillar and find that combining our loss term, which regularizes for calibration of the full class prediction, and isotonic regression lead to the best calibration of CenterPoint and PillarNet with respect to both dominant and secondary class predictions. We further find that DSVT-Pillar can not be jointly calibrated for dominant and secondary predictions using the same method.


[32] Leveraging Prior Knowledge of Diffusion Model for Person Search cs.CVPDF

Giyeol Kim, Sooyoung Yang, Jihyong Oh, Myungjoo Kang, Chanho Eom

TL;DR: 该论文提出了DiffPS框架,利用预训练扩散模型的先验知识改进行人搜索任务,通过三个模块解决现有方法在目标检测和重识别中的优化冲突问题,取得了CUHK-SYSU和PRW数据集上的最优性能。

Details

Motivation: 现有行人搜索方法主要基于ImageNet预训练骨干网络,难以捕捉复杂空间上下文和细粒度身份线索,且共享骨干网络导致优化冲突。

Result: 在CUHK-SYSU和PRW数据集上达到了新的最优性能。

Insight: 扩散模型的先验知识可以显著提升行人搜索任务的性能,尤其是通过多任务模块化解冲突的设计值得借鉴。

Abstract: Person search aims to jointly perform person detection and re-identification by localizing and identifying a query person within a gallery of uncropped scene images. Existing methods predominantly utilize ImageNet pre-trained backbones, which may be suboptimal for capturing the complex spatial context and fine-grained identity cues necessary for person search. Moreover, they rely on a shared backbone feature for both person detection and re-identification, leading to suboptimal features due to conflicting optimization objectives. In this paper, we propose DiffPS (Diffusion Prior Knowledge for Person Search), a novel framework that leverages a pre-trained diffusion model while eliminating the optimization conflict between two sub-tasks. We analyze key properties of diffusion priors and propose three specialized modules: (i) Diffusion-Guided Region Proposal Network (DGRPN) for enhanced person localization, (ii) Multi-Scale Frequency Refinement Network (MSFRN) to mitigate shape bias, and (iii) Semantic-Adaptive Feature Aggregation Network (SFAN) to leverage text-aligned diffusion features. DiffPS sets a new state-of-the-art on CUHK-SYSU and PRW.


[33] ClustViT: Clustering-based Token Merging for Semantic Segmentation cs.CV | 68T45 | I.2.10PDF

Fabio Montello, Ronja Güldenring, Lazaros Nalpantidis

TL;DR: ClustViT通过聚类合并视觉Transformer的Token,显著降低了计算复杂度,同时在语义分割任务中保持了高精度。

Details

Motivation: 视觉Transformer(ViT)因其二次注意力复杂度限制了在实时机器人系统中的应用,尤其是在密集预测任务(如语义分割)中。

Result: 在三个数据集上减少了2.18倍的GFLOPs和1.64倍的推理时间,且保持了分割精度。

Insight: 聚类引导的Token合并是提升ViT在密集预测任务中效率的有效方法。

Abstract: Vision Transformers can achieve high accuracy and strong generalization across various contexts, but their practical applicability on real-world robotic systems is limited due to their quadratic attention complexity. Recent works have focused on dynamically merging tokens according to the image complexity. Token merging works well for classification but is less suited to dense prediction. We propose ClustViT, where we expand upon the Vision Transformer (ViT) backbone and address semantic segmentation. Within our architecture, a trainable Cluster module merges similar tokens along the network guided by pseudo-clusters from segmentation masks. Subsequently, a Regenerator module restores fine details for downstream heads. Our approach achieves up to 2.18x fewer GFLOPs and 1.64x faster inference on three different datasets, with comparable segmentation accuracy. Our code and models will be made publicly available.


[34] Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs cs.CVPDF

Yongyi Su, Haojie Zhang, Shijie Li, Nanqing Liu, Jingyi Liao

TL;DR: 论文提出了Patch-as-Decodable Token (PaDT)的统一多模态视觉任务框架,通过Visual Reference Tokens (VRTs)直接生成文本和视觉输出,提升密集预测任务性能。

Details

Motivation: 现有MLLMs在视觉任务中依赖间接表示(如坐标文本)限制了性能,尤其是密集预测任务(如分割),需要更直接的视觉输出方式。

Result: 在四个视觉任务中达到SOTA性能,优于更大的MLLMs模型。

Insight: VRTs的动态处理和独立性设计是关键,为多模态视觉任务的统一提供新思路。

Abstract: Multimodal large language models (MLLMs) have advanced rapidly in recent years. However, existing approaches for vision tasks often rely on indirect representations, such as generating coordinates as text for detection, which limits performance and prevents dense prediction tasks like segmentation. To overcome these challenges, we introduce Patch-as-Decodable Token (PaDT), a unified paradigm that enables MLLMs to directly generate both textual and diverse visual outputs. Central to PaDT are Visual Reference Tokens (VRTs), derived from visual patch embeddings of query images and interleaved seamlessly with LLM’s output textual tokens. A lightweight decoder then transforms LLM’s outputs into detection, segmentation, and grounding predictions. Unlike prior methods, PaDT processes VRTs independently at each forward pass and dynamically expands the embedding table, thus improving localization and differentiation among similar objects. We further tailor a training strategy for PaDT by randomly selecting VRTs for supervised fine-tuning and introducing a robust per-token cross-entropy loss. Our empirical studies across four visual perception and understanding tasks suggest PaDT consistently achieving state-of-the-art performance, even compared with significantly larger MLLM models. The code is available at https://github.com/Gorilla-Lab-SCUT/PaDT.


[35] 4DGS-Craft: Consistent and Interactive 4D Gaussian Splatting Editing cs.CVPDF

Lei Liu, Can Wang, Zhenghao Chen, Dong Xu

TL;DR: 4DGS-Craft提出了一个一致且交互式的4D高斯泼溅编辑框架,通过4D感知的InstructPix2Pix模型和多视角网格模块确保视角和时间一致性,同时通过高斯选择机制保护非编辑区域的稳定性。通过LLM模块解析用户意图,将复杂指令分解为原子操作序列。

Details

Motivation: 现有4D高斯泼溅编辑方法在视角、时间和非编辑区域一致性方面存在不足,且难以处理复杂文本指令。为解决这些问题,提出了4DGS-Craft。

Result: 该方法实现了更一致和可控的4D场景编辑,支持复杂用户指令处理。

Insight: 结合几何特征与LLM可以提升4D编辑的一致性和交互性,为动态场景编辑提供了新思路。

Abstract: Recent advances in 4D Gaussian Splatting (4DGS) editing still face challenges with view, temporal, and non-editing region consistency, as well as with handling complex text instructions. To address these issues, we propose 4DGS-Craft, a consistent and interactive 4DGS editing framework. We first introduce a 4D-aware InstructPix2Pix model to ensure both view and temporal consistency. This model incorporates 4D VGGT geometry features extracted from the initial scene, enabling it to capture underlying 4D geometric structures during editing. We further enhance this model with a multi-view grid module that enforces consistency by iteratively refining multi-view input images while jointly optimizing the underlying 4D scene. Furthermore, we preserve the consistency of non-edited regions through a novel Gaussian selection mechanism, which identifies and optimizes only the Gaussians within the edited regions. Beyond consistency, facilitating user interaction is also crucial for effective 4DGS editing. Therefore, we design an LLM-based module for user intent understanding. This module employs a user instruction template to define atomic editing operations and leverages an LLM for reasoning. As a result, our framework can interpret user intent and decompose complex instructions into a logical sequence of atomic operations, enabling it to handle intricate user commands and further enhance editing performance. Compared to related works, our approach enables more consistent and controllable 4D scene editing. Our code will be made available upon acceptance.


[36] Pure-Pass: Fine-Grained, Adaptive Masking for Dynamic Token-Mixing Routing in Lightweight Image Super-Resolution cs.CVPDF

Junyu Wu, Jie Tang, Jie Liu, Gangshan Wu

TL;DR: Pure-Pass (PP) 是一种像素级掩码机制,通过固定颜色中心点分类像素,实现细粒度、空间灵活的掩码,从而在轻量级图像超分辨率中动态路由token混合器,提升性能。

Details

Motivation: 现有轻量级超分辨率方法如CAMixer在适应性、掩码粒度和空间灵活性方面存在不足,限制了实际部署效果。

Result: PP-ATD-light在重建质量和参数效率上优于CAMixer-ATD-light,同时计算开销相近。

Insight: 通过像素级细粒度掩码和动态路由,可以在轻量级超分辨率任务中显著提高性能。

Abstract: Image Super-Resolution (SR) aims to reconstruct high-resolution images from low-resolution counterparts, but the computational complexity of deep learning-based methods often hinders practical deployment. CAMixer is the pioneering work to integrate the advantages of existing lightweight SR methods and proposes a content-aware mixer to route token mixers of varied complexities according to the difficulty of content recovery. However, several limitations remain, such as poor adaptability, coarse-grained masking and spatial inflexibility, among others. We propose Pure-Pass (PP), a pixel-level masking mechanism that identifies pure pixels and exempts them from expensive computations. PP utilizes fixed color center points to classify pixels into distinct categories, enabling fine-grained, spatially flexible masking while maintaining adaptive flexibility. Integrated into the state-of-the-art ATD-light model, PP-ATD-light achieves superior SR performance with minimal overhead, outperforming CAMixer-ATD-light in reconstruction quality and parameter efficiency when saving a similar amount of computation.


[37] Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using GPT-4o: Building a Two-Stage Self-Correction Loop with Structured Output (SLSO) Framework cs.CV | cs.AIPDF

Nanaka Hosokawa, Ryo Takahashi, Tomoya Kitano, Yukihiro Iida, Chisako Muramatsu

TL;DR: GPT-4o用于牙科全景片中下颌囊肿的自动化报告生成,并提出两阶段的自校正循环结构化输出(SLSO)框架以提升准确性。实验表明SLSO在多方面优于传统CoT方法,但仍有局限性。

Details

Motivation: 传统方法在牙科影像分析中存在准确性问题,如幻觉描述和牙齿编号错误。SLSO框架旨在通过结构化输出和自校正循环提升报告生成的可靠性。

Result: SLSO在牙齿编号、牙齿移动和牙根吸收三项指标上分别提升了66.9%、33.3%和28.6%,但数据集小导致统计显著性不足。

Insight: SLSO框架通过结构化输出抑制幻觉,但多牙齿病变的识别仍需改进。未来需优化性能以实现实用化。

Abstract: In this study, we utilized the multimodal capabilities of OpenAI GPT-4o to automatically generate jaw cyst findings on dental panoramic radiographs. To improve accuracy, we constructed a Self-correction Loop with Structured Output (SLSO) framework and verified its effectiveness. A 10-step process was implemented for 22 cases of jaw cysts, including image input and analysis, structured data generation, tooth number extraction and consistency checking, iterative regeneration when inconsistencies were detected, and finding generation with subsequent restructuring and consistency verification. A comparative experiment was conducted using the conventional Chain-of-Thought (CoT) method across seven evaluation items: transparency, internal structure, borders, root resorption, tooth movement, relationships with other structures, and tooth number. The results showed that the proposed SLSO framework improved output accuracy for many items, with 66.9%, 33.3%, and 28.6% improvement rates for tooth number, tooth movement, and root resorption, respectively. In the successful cases, a consistently structured output was achieved after up to five regenerations. Although statistical significance was not reached because of the small size of the dataset, the overall SLSO framework enforced negative finding descriptions, suppressed hallucinations, and improved tooth number identification accuracy. However, the accurate identification of extensive lesions spanning multiple teeth is limited. Nevertheless, further refinement is required to enhance overall performance and move toward a practical finding generation system.


[38] LiLa-Net: Lightweight Latent LiDAR Autoencoder for 3D Point Cloud Reconstruction cs.CV | cs.AIPDF

Mario Resino, Borja Pérez, Jaime Godoy, Abdulla Al-Kaff, Fernando García

TL;DR: LiLa-Net是一种轻量级3D自动编码器,利用LiDAR点云从真实交通环境中提取高效特征,通过简化编码器层和跳跃连接实现高性能重建。

Details

Motivation: 现有3D点云重建方法资源消耗大,需要通过轻量化设计提升效率并保持重建质量。

Result: 模型在保持性能的同时提升了重建质量,并展示了良好的泛化能力。

Insight: 轻量化设计和跳跃连接的优化是实现高效3D点云重建的关键。

Abstract: This work proposed a 3D autoencoder architecture, named LiLa-Net, which encodes efficient features from real traffic environments, employing only the LiDAR’s point clouds. For this purpose, we have real semi-autonomous vehicle, equipped with Velodyne LiDAR. The system leverage skip connections concept to improve the performance without using extensive resources as the state-of-the-art architectures. Key changes include reducing the number of encoder layers and simplifying the skip connections, while still producing an efficient and representative latent space which allows to accurately reconstruct the original point cloud. Furthermore, an effective balance has been achieved between the information carried by the skip connections and the latent encoding, leading to improved reconstruction quality without compromising performance. Finally, the model demonstrates strong generalization capabilities, successfully reconstructing objects unrelated to the original traffic environment.


[39] kabr-tools: Automated Framework for Multi-Species Behavioral Monitoring cs.CVPDF

Jenna Kline, Maksim Kholiavchenko, Samuel Stevens, Nina van Tiel, Alison Zhong

TL;DR: kabr-tools是一个开源自动化多物种行为监测框架,结合无人机视频和机器学习系统,提取行为、社会和空间指标,显著提升行为数据的粒度和效率。

Details

Motivation: 传统野外观察方法是有限且耗时的,难以量化复杂多维的行为模式,因此需要一个可扩展的自动化解决方案。

Result: 相比地面观察方法,kabr-tools减少了15%的可见性损失,捕获了更高精度和行为连续性,并在三个案例研究中验证了其有效性。

Insight: 该工具为生态系统的行为研究提供了强大的自动化手段,推动了保护生物学和生态监测的进步。

Abstract: A comprehensive understanding of animal behavior ecology depends on scalable approaches to quantify and interpret complex, multidimensional behavioral patterns. Traditional field observations are often limited in scope, time-consuming, and labor-intensive, hindering the assessment of behavioral responses across landscapes. To address this, we present kabr-tools (Kenyan Animal Behavior Recognition Tools), an open-source package for automated multi-species behavioral monitoring. This framework integrates drone-based video with machine learning systems to extract behavioral, social, and spatial metrics from wildlife footage. Our pipeline leverages object detection, tracking, and behavioral classification systems to generate key metrics, including time budgets, behavioral transitions, social interactions, habitat associations, and group composition dynamics. Compared to ground-based methods, drone-based observations significantly improved behavioral granularity, reducing visibility loss by 15% and capturing more transitions with higher accuracy and continuity. We validate kabr-tools through three case studies, analyzing 969 behavioral sequences, surpassing the capacity of traditional methods for data capture and annotation. We found that, like Plains zebras, vigilance in Grevy’s zebras decreases with herd size, but, unlike Plains zebras, habitat has a negligible impact. Plains and Grevy’s zebras exhibit strong behavioral inertia, with rare transitions to alert behaviors and observed spatial segregation between Grevy’s zebras, Plains zebras, and giraffes in mixed-species herds. By enabling automated behavioral monitoring at scale, kabr-tools offers a powerful tool for ecosystem-wide studies, advancing conservation, biodiversity research, and ecological monitoring.


[40] VGDM: Vision-Guided Diffusion Model for Brain Tumor Detection and Segmentation cs.CVPDF

Arman Behnam

TL;DR: VGDM提出了一种结合视觉Transformer和扩散模型的框架,用于脑肿瘤的检测与分割,通过全局上下文推理和迭代去噪提升精度。

Details

Motivation: 传统U-Net在捕捉长距离依赖和复杂肿瘤结构方面表现有限,而扩散模型在高保真医学图像生成和分割边界细化方面展现出潜力。

Result: 实验验证表明,VGDM在Dice相似度和Hausdorff距离指标上均优于传统方法,展现了其在肿瘤分割中的潜力。

Insight: Transformer与扩散模型的结合为医学图像分割提供了一种新思路,尤其在处理复杂结构和长距离依赖时更具优势。

Abstract: Accurate detection and segmentation of brain tumors from magnetic resonance imaging (MRI) are essential for diagnosis, treatment planning, and clinical monitoring. While convolutional architectures such as U-Net have long been the backbone of medical image segmentation, their limited capacity to capture long-range dependencies constrains performance on complex tumor structures. Recent advances in diffusion models have demonstrated strong potential for generating high-fidelity medical images and refining segmentation boundaries. In this work, we propose VGDM: Vision-Guided Diffusion Model for Brain Tumor Detection and Segmentation framework, a transformer-driven diffusion framework for brain tumor detection and segmentation. By embedding a vision transformer at the core of the diffusion process, the model leverages global contextual reasoning together with iterative denoising to enhance both volumetric accuracy and boundary precision. The transformer backbone enables more effective modeling of spatial relationships across entire MRI volumes, while diffusion refinement mitigates voxel-level errors and recovers fine-grained tumor details. This hybrid design provides a pathway toward improved robustness and scalability in neuro-oncology, moving beyond conventional U-Net baselines. Experimental validation on MRI brain tumor datasets demonstrates consistent gains in Dice similarity and Hausdorff distance, underscoring the potential of transformer-guided diffusion models to advance the state of the art in tumor segmentation.


[41] Mapping Historic Urban Footprints in France: Balancing Quality, Scalability and AI Techniques cs.CVPDF

Walid Rabehi, Marion Le Texier, Rémi Lemoy

TL;DR: 本研究开发了一个可扩展的深度学习流水线,用于从法国历史地图中提取1925-1950年的城市足迹,填补了国家尺度数据的空白。关键创新是通过双通道U-Net方法处理历史地图的高复杂性,最终数据集准确率为73%。

Details

Motivation: 1970年代前法国城市化数据的缺失限制了定量分析。本研究旨在填补这一空白,为历史城市化研究提供高质量的国家尺度数据支持。

Result: 最终数据集的总体准确率为73%,成功捕捉了多样化的城市模式,并克服了标签和等高线等常见干扰。

Insight: 双通道方法结合数据增强可以有效提升历史地图处理的准确性;高性能计算集群是实现大规模处理的必要条件。

Abstract: Quantitative analysis of historical urban sprawl in France before the 1970s is hindered by the lack of nationwide digital urban footprint data. This study bridges this gap by developing a scalable deep learning pipeline to extract urban areas from the Scan Histo historical map series (1925-1950), which produces the first open-access, national-scale urban footprint dataset for this pivotal period. Our key innovation is a dual-pass U-Net approach designed to handle the high radiometric and stylistic complexity of historical maps. The first pass, trained on an initial dataset, generates a preliminary map that identifies areas of confusion, such as text and roads, to guide targeted data augmentation. The second pass uses a refined dataset and the binarized output of the first model to minimize radiometric noise, which significantly reduces false positives. Deployed on a high-performance computing cluster, our method processes 941 high-resolution tiles covering the entirety of metropolitan France. The final mosaic achieves an overall accuracy of 73%, effectively capturing diverse urban patterns while overcoming common artifacts like labels and contour lines. We openly release the code, training datasets, and the resulting nationwide urban raster to support future research in long-term urbanization dynamics.


[42] When Tracking Fails: Analyzing Failure Modes of SAM2 for Point-Based Tracking in Surgical Videos cs.CV | cs.AIPDF

Woowon Jang, Jiwon Im, Juseung Choi, Niki Rashidian, Wesley De Neve

TL;DR: 论文分析了SAM2在手术视频中点跟踪的失败模式,发现点跟踪在手术工具上表现良好,但在解剖目标上因组织相似性和模糊边界而表现不佳。

Details

Motivation: 手术视频中的点跟踪是一种高效且低成本的交互方式,但其在复杂手术环境中的可靠性和失败模式尚未被深入理解。

Result: 点跟踪在手术工具上表现竞争性,但在解剖目标上表现较差,主要由于组织相似性和边界模糊。

Insight: 提供了改进点跟踪性能的建议,特别是在手术视频分析中选择和放置跟踪点的策略。

Abstract: Video object segmentation (VOS) models such as SAM2 offer promising zero-shot tracking capabilities for surgical videos using minimal user input. Among the available input types, point-based tracking offers an efficient and low-cost alternative, yet its reliability and failure cases in complex surgical environments are not well understood. In this work, we systematically analyze the failure modes of point-based tracking in laparoscopic cholecystectomy videos. Focusing on three surgical targets, the gallbladder, grasper, and L-hook electrocautery, we compare the performance of point-based tracking with segmentation mask initialization. Our results show that point-based tracking is competitive for surgical tools but consistently underperforms for anatomical targets, where tissue similarity and ambiguous boundaries lead to failure. Through qualitative analysis, we reveal key factors influencing tracking outcomes and provide several actionable recommendations for selecting and placing tracking points to improve performance in surgical video analysis.


[43] FRIEREN: Federated Learning with Vision-Language Regularization for Segmentation cs.CV | 68T10PDF

Ding-Ruei Shen

TL;DR: 论文提出了一种联邦学习框架FRIEREN,结合视觉与语言模态,利用CLIP文本嵌入改进语义分割任务在无标记客户端数据上的泛化能力。

Details

Motivation: 现有的联邦学习方法通常假设客户端数据带有标记或未能充分利用现代视觉基础模型(VFM),而在实际场景中客户端数据往往是无标记的。

Result: 在合成到真实和清晰到恶劣天气的基准测试中,FRIEREN表现优异,优于现有领域泛化和适应方法。

Insight: 视觉与语言模态的结合能够显著提升联邦学习在无标记数据场景下的性能,为未来研究提供了新方向。

Abstract: Federeated Learning (FL) offers a privacy-preserving solution for Semantic Segmentation (SS) tasks to adapt to new domains, but faces significant challenges from these domain shifts, particularly when client data is unlabeled. However, most existing FL methods unrealistically assume access to labeled data on remote clients or fail to leverage the power of modern Vision Foundation Models (VFMs). Here, we propose a novel and challenging task, FFREEDG, in which a model is pretrained on a server’s labeled source dataset and subsequently trained across clients using only their unlabeled data, without ever re-accessing the source. To solve FFREEDG, we propose FRIEREN, a framework that leverages the knowledge of a VFM by integrating vision and language modalities. Our approach employs a Vision-Language decoder guided by CLIP-based text embeddings to improve semantic disambiguation and uses a weak-to-strong consistency learning strategy for robust local training on pseudo-labels. Our experiments on synthetic-to-real and clear-to-adverse-weather benchmarks demonstrate that our framework effectively tackles this new task, achieving competitive performance against established domain generalization and adaptation methods and setting a strong baseline for future research.


[44] Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting cs.CV | cs.AIPDF

Shu Zou, Xinyu Tian, Lukas Wesemann, Fabian Waschkowski, Zhaoyuan Yang

TL;DR: ASK-Hint提出了一种基于动作知识的结构化提示框架,通过细粒度提示改进视频异常检测的准确性、可解释性和泛化能力。

Details

Motivation: 现有视频异常检测方法中的提示过于抽象,忽视了细粒度的人机交互或动作语义,导致复杂异常检测效果不佳。

Result: 在UCF-Crime和XD-Violence数据集上实现了AUC的提升,并验证了框架的泛化能力和可解释性。

Insight: 提示的细粒度设计是提升视频异常检测性能的关键,ASK-Hint为无需训练的通用解决方案提供了新思路。

Abstract: Prompting has emerged as a practical way to adapt frozen vision-language models (VLMs) for video anomaly detection (VAD). Yet, existing prompts are often overly abstract, overlooking the fine-grained human-object interactions or action semantics that define complex anomalies in surveillance videos. We propose ASK-Hint, a structured prompting framework that leverages action-centric knowledge to elicit more accurate and interpretable reasoning from frozen VLMs. Our approach organizes prompts into semantically coherent groups (e.g. violence, property crimes, public safety) and formulates fine-grained guiding questions that align model predictions with discriminative visual cues. Extensive experiments on UCF-Crime and XD-Violence show that ASK-Hint consistently improves AUC over prior baselines, achieving state-of-the-art performance compared to both fine-tuned and training-free methods. Beyond accuracy, our framework provides interpretable reasoning traces towards anomaly and demonstrates strong generalization across datasets and VLM backbones. These results highlight the critical role of prompt granularity and establish ASK-Hint as a new training-free and generalizable solution for explainable video anomaly detection.


[45] GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation cs.CV | cs.LGPDF

Weijia Dou, Xu Zhang, Yi Bin, Jian Liu, Bo Peng

TL;DR: GeoPurify提出了一种数据高效的几何蒸馏框架,通过利用2D视觉语言模型的特征传递到3D分割,并结合几何先验知识,显著减少了训练数据需求并提升了性能。

Details

Motivation: 现有的方法在将2D视觉语言模型的特征传递到3D语义分割时存在噪声和碎片化问题,而强制几何一致性则需要大规模标注数据和昂贵训练。GeoPurify旨在解决这一问题,同时提高数据效率。

Result: 在主要3D基准测试中,GeoPurify仅使用约1.5%的训练数据即达到或超越现有最佳性能。

Insight: 几何信息在2D到3D特征传递中仍然存在,GeoPurify通过巧妙利用这些信息实现了高效的去噪和数据效率提升。

Abstract: Recent attempts to transfer features from 2D Vision-Language Models (VLMs) to 3D semantic segmentation expose a persistent trade-off. Directly projecting 2D features into 3D yields noisy and fragmented predictions, whereas enforcing geometric coherence necessitates costly training pipelines and large-scale annotated 3D data. We argue that this limitation stems from the dominant segmentation-and-matching paradigm, which fails to reconcile 2D semantics with 3D geometric structure. The geometric cues are not eliminated during the 2D-to-3D transfer but remain latent within the noisy and view-aggregated features. To exploit this property, we propose GeoPurify that applies a small Student Affinity Network to purify 2D VLM-generated 3D point features using geometric priors distilled from a 3D self-supervised teacher model. During inference, we devise a Geometry-Guided Pooling module to further denoise the point cloud and ensure the semantic and structural consistency. Benefiting from latent geometric information and the learned affinity network, GeoPurify effectively mitigates the trade-off and achieves superior data efficiency. Extensive experiments on major 3D benchmarks demonstrate that GeoPurify achieves or surpasses state-of-the-art performance while utilizing only about 1.5% of the training data. Our codes and checkpoints are available at https://github.com/tj12323/GeoPurify.


[46] Cross-Breed Pig Identification Using Auricular Vein Pattern Recognition: A Machine Learning Approach for Small-Scale Farming Applications cs.CV | cs.SEPDF

Emmanuel Nsengiyumvaa, Leonard Niyitegekaa, Eric Umuhoza

TL;DR: 论文提出了一种基于耳部静脉模式识别的非侵入性猪只识别方法,利用计算机视觉和机器学习技术,特别适用于小规模养殖场的混合品种猪。

Details

Motivation: 传统猪只识别方法(如耳标和微芯片)成本高、易损坏且不适用于混合品种,因此需要一种低成本、可靠的非侵入性替代方案。

Result: 系统平均处理时间为8.3秒,SVM分类精度达98.12%,验证了其高效性和实用性。

Insight: 耳部静脉模式是一种稳定且独特的生物特征,可为资源有限的农业社区提供精准养殖的低成本解决方案。

Abstract: Accurate livestock identification is a cornerstone of modern farming: it supports health monitoring, breeding programs, and productivity tracking. However, common pig identification methods, such as ear tags and microchips, are often unreliable, costly, target pure breeds, and thus impractical for small-scale farmers. To address this gap, we propose a noninvasive biometric identification approach that leverages uniqueness of the auricular vein patterns. To this end, we have collected 800 ear images from 20 mixed-breed pigs (Landrace cross Pietrain and Duroc cross Pietrain), captured using a standard smartphone and simple back lighting. A multistage computer vision pipeline was developed to enhance vein visibility, extract structural and spatial features, and generate biometric signatures. These features were then classified using machine learning models. Support Vector Machines (SVM) achieved the highest accuracy: correctly identifying pigs with 98.12% precision across mixed-breed populations. The entire process from image processing to classification was completed in an average of 8.3 seconds, demonstrating feasibility for real-time farm deployment. We believe that by replacing fragile physical identifiers with permanent biological markers, this system provides farmers with a cost-effective and stress-free method of animal identification. More broadly, the findings confirm the practicality of auricular vein biometrics for digitizing livestock management, reinforcing its potential to extend the benefits of precision farming to resource-constrained agricultural communities.


[47] MMDEW: Multipurpose Multiclass Density Estimation in the Wild cs.CVPDF

Villanelle O’Reilly, Jonathan Cox, Georgios Leontidis, Marc Hanheide, Petra Bosilj

TL;DR: MMDEW提出了一种多类别密度估计框架,通过Twins金字塔视觉Transformer和多尺度解码方法,提升了密集遮挡场景下的计数性能,并在生态监测等领域展示了应用潜力。

Details

Motivation: 传统基于检测的计数方法在密集和遮挡场景中效果不佳,因此需要一种多类别密度估计方法来解决这一问题。

Result: 在VisDrone和iSAID基准测试中表现优于现有方法(MAE降低33%-64%),并在生态监测数据上验证了其扩展性。

Insight: 多类别密度估计不仅适用于人群计数,还可扩展到其他领域(如生物多样性监测),为跨领域应用提供了新思路。

Abstract: Density map estimation can be used to estimate object counts in dense and occluded scenes where discrete counting-by-detection methods fail. We propose a multicategory counting framework that leverages a Twins pyramid vision-transformer backbone and a specialised multi-class counting head built on a state-of-the-art multiscale decoding approach. A two-task design adds a segmentation-based Category Focus Module, suppressing inter-category cross-talk at training time. Training and evaluation on the VisDrone and iSAID benchmarks demonstrates superior performance versus prior multicategory crowd-counting approaches (33%, 43% and 64% reduction to MAE), and the comparison with YOLOv11 underscores the necessity of crowd counting methods in dense scenes. The method’s regional loss opens up multi-class crowd counting to new domains, demonstrated through the application to a biodiversity monitoring dataset, highlighting its capacity to inform conservation efforts and enable scalable ecological insights.


[48] TempoControl: Temporal Attention Guidance for Text-to-Video Models cs.CV | cs.AI | cs.LGPDF

Shira Schiber, Ofir Lindenbaum, Idan Schwartz

TL;DR: TempoControl是一种在无需重新训练或额外监督的情况下,通过优化交叉注意力图来实现文本到视频生成模型中视觉概念时间对齐的新方法。

Details

Motivation: 当前生成视频模型虽然能基于文本提示生成高质量视频,但缺乏细粒度的时间控制,无法指定视觉元素在生成序列中的具体出现时间。

Result: TempoControl在单对象和多对象的时间重排、动作和音频对齐生成等多种应用中表现出色,确保了视频的高质量和多样性。

Insight: 交叉注意力图为时间控制提供了潜力,通过优化注意力机制可以实现对生成内容的细粒度时间调控。

Abstract: Recent advances in generative video models have enabled the creation of high-quality videos based on natural language prompts. However, these models frequently lack fine-grained temporal control, meaning they do not allow users to specify when particular visual elements should appear within a generated sequence. In this work, we introduce TempoControl, a method that allows for temporal alignment of visual concepts during inference, without requiring retraining or additional supervision. TempoControl utilizes cross-attention maps, a key component of text-to-video diffusion models, to guide the timing of concepts through a novel optimization approach. Our method steers attention using three complementary principles: aligning its temporal shape with a control signal (via correlation), amplifying it where visibility is needed (via energy), and maintaining spatial focus (via entropy). TempoControl allows precise control over timing while ensuring high video quality and diversity. We demonstrate its effectiveness across various video generation applications, including temporal reordering for single and multiple objects, as well as action and audio-aligned generation.


[49] RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning cs.CV | cs.AIPDF

Sicheng Feng, Kaiwen Tuo, Song Wang, Lingdong Kong, Jianke Zhu

TL;DR: 论文RewardMap通过多阶段强化学习解决细粒度视觉推理中的稀疏奖励问题,提出了难度感知奖励设计和多阶段训练框架,显著提升了模型的视觉理解和推理能力。

Details

Motivation: 细粒度视觉推理是当前多模态大语言模型(MLLMs)的核心挑战,尤其是空间推理任务中稀疏奖励和不稳定优化问题阻碍了标准强化学习的表现。

Result: 在ReasonMap和ReasonMap-Plus上的实验表明,RewardMap在各任务中均取得一致性能提升,平均提升3.47%跨6个基准测试。

Insight: 密集奖励和多阶段训练是提升细粒度视觉推理任务性能的有效策略,尤其在冷启动阶段优于传统监督微调。

Abstract: Fine-grained visual reasoning remains a core challenge for multimodal large language models (MLLMs). The recently introduced ReasonMap highlights this gap by showing that even advanced MLLMs struggle with spatial reasoning in structured and information-rich settings such as transit maps, a task of clear practical and scientific importance. However, standard reinforcement learning (RL) on such tasks is impeded by sparse rewards and unstable optimization. To address this, we first construct ReasonMap-Plus, an extended dataset that introduces dense reward signals through Visual Question Answering (VQA) tasks, enabling effective cold-start training of fine-grained visual understanding skills. Next, we propose RewardMap, a multi-stage RL framework designed to improve both visual understanding and reasoning capabilities of MLLMs. RewardMap incorporates two key designs. First, we introduce a difficulty-aware reward design that incorporates detail rewards, directly tackling the sparse rewards while providing richer supervision. Second, we propose a multi-stage RL scheme that bootstraps training from simple perception to complex reasoning tasks, offering a more effective cold-start strategy than conventional Supervised Fine-Tuning (SFT). Experiments on ReasonMap and ReasonMap-Plus demonstrate that each component of RewardMap contributes to consistent performance gains, while their combination yields the best results. Moreover, models trained with RewardMap achieve an average improvement of 3.47% across 6 benchmarks spanning spatial reasoning, fine-grained visual reasoning, and general tasks beyond transit maps, underscoring enhanced visual understanding and reasoning capabilities.


[50] DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing cs.CV | cs.AI | cs.LGPDF

Zihan Zhou, Shilin Lu, Shuli Leng, Shaocong Zhang, Zhuming Lian

TL;DR: DragFlow利用DiT的先验知识,通过区域监督和Affine变换提升基于拖拽的图像编辑性能,超越现有基线方法。

Details

Motivation: 现有基于拖拽的图像编辑方法在先验知识不足时会导致目标区域失真,而DiT的先验能力更强但未充分利用。

Result: 在DragBench-DR和ReD Bench上超越点基和区域基线方法,刷新了拖拽编辑的SOTA。

Insight: DiT的特征结构不如UNet紧凑,直接点监督不可靠,但区域监督能更一致地利用其更强先验。

Abstract: Drag-based image editing has long suffered from distortions in the target region, largely because the priors of earlier base models, Stable Diffusion, are insufficient to project optimized latents back onto the natural image manifold. With the shift from UNet-based DDPMs to more scalable DiT with flow matching (e.g., SD3.5, FLUX), generative priors have become significantly stronger, enabling advances across diverse editing tasks. However, drag-based editing has yet to benefit from these stronger priors. This work proposes the first framework to effectively harness FLUX’s rich prior for drag-based editing, dubbed DragFlow, achieving substantial gains over baselines. We first show that directly applying point-based drag editing to DiTs performs poorly: unlike the highly compressed features of UNets, DiT features are insufficiently structured to provide reliable guidance for point-wise motion supervision. To overcome this limitation, DragFlow introduces a region-based editing paradigm, where affine transformations enable richer and more consistent feature supervision. Additionally, we integrate pretrained open-domain personalization adapters (e.g., IP-Adapter) to enhance subject consistency, while preserving background fidelity through gradient mask-based hard constraints. Multimodal large language models (MLLMs) are further employed to resolve task ambiguities. For evaluation, we curate a novel Region-based Dragging benchmark (ReD Bench) featuring region-level dragging instructions. Extensive experiments on DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and region-based baselines, setting a new state-of-the-art in drag-based image editing. Code and datasets will be publicly available upon publication.


[51] From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding cs.CVPDF

Guangyu Sun, Archit Singhal, Burak Uzkent, Mubarak Shah, Chen Chen

TL;DR: 论文提出了一种从关键帧选择扩展到关键片段(clip)选择的方法F2C,以提升长视频理解的性能,并通过自适应分辨率策略平衡计算资源。

Details

Motivation: 现有视频大型语言模型(VLMs)因视觉标记过多而受限于上下文窗口,且稀疏帧选择忽略了关键的时间动态信息,导致运动与事件连续性推理效果不佳。

Result: 在Video-MME、LongVideoBench和MLVU基准上,F2C分别比均匀采样提升了8.1%、5.6%和10.3%。

Insight: 保持时间连贯性对视频理解至关重要,F2C提供了一种无需训练的实际解决方案,适用于大规模视频应用。

Abstract: Video Large Language Models (VLMs) have achieved remarkable results on a variety of vision language tasks, yet their practical use is limited by the “needle in a haystack” problem: the massive number of visual tokens produced from raw video frames exhausts the model’s context window. Existing solutions alleviate this issue by selecting a sparse set of frames, thereby reducing token count, but such frame-wise selection discards essential temporal dynamics, leading to suboptimal reasoning about motion and event continuity. In this work we systematically explore the impact of temporal information and demonstrate that extending selection from isolated key frames to key clips, which are short, temporally coherent segments, improves video understanding. To maintain a fixed computational budget while accommodating the larger token footprint of clips, we propose an adaptive resolution strategy that dynamically balances spatial resolution and clip length, ensuring a constant token count per video. Experiments on three long-form video benchmarks demonstrate that our training-free approach, F2C, outperforms uniform sampling up to 8.1%, 5.6%, and 10.3% on Video-MME, LongVideoBench and MLVU benchmarks, respectively. These results highlight the importance of preserving temporal coherence in frame selection and provide a practical pathway for scaling Video LLMs to real world video understanding applications. Project webpage is available at https://guangyusun.com/f2c .


[52] Paving the Way Towards Kinematic Assessment Using Monocular Video: A Preclinical Benchmark of State-of-the-Art Deep-Learning-Based 3D Human Pose Estimators Against Inertial Sensors in Daily Living Activities cs.CV | cs.AI | cs.LGPDF

Mario Medrano-Paredes, Carmen Fernández-González, Francisco-Javier Díaz-Pernas, Hichem Saoudi, Javier González-Alonso

TL;DR: 这篇论文比较了基于单目视频的3D人体姿态估计模型与惯性测量单元(IMU)在健康人群日常活动中的性能,发现MotionAGFormer表现最佳,同时讨论了两种技术在成本、可访问性和精度上的权衡。

Details

Motivation: 旨在评估单目视频和IMU传感器在实际场景中对人体运动捕捉的准确性,为远程医疗、运动科学和康复提供可靠的工具。

Result: MotionAGFormer表现最优,平均RMSE为9.27度±4.80度,MAE为7.86度±4.18度,Pearson相关系数为0.86±0.15,R²为0.67±0.28。

Insight: 单目视频和IMU在临床运动评估中均可行,但需权衡成本、精度和易用性;MotionAGFormer在健康成年人群中表现突出,为远程监测提供了有前景的解决方案。

Abstract: Advances in machine learning and wearable sensors offer new opportunities for capturing and analyzing human movement outside specialized laboratories. Accurate assessment of human movement under real-world conditions is essential for telemedicine, sports science, and rehabilitation. This preclinical benchmark compares monocular video-based 3D human pose estimation models with inertial measurement units (IMUs), leveraging the VIDIMU dataset containing a total of 13 clinically relevant daily activities which were captured using both commodity video cameras and five IMUs. During this initial study only healthy subjects were recorded, so results cannot be generalized to pathological cohorts. Joint angles derived from state-of-the-art deep learning frameworks (MotionAGFormer, MotionBERT, MMPose 2D-to-3D pose lifting, and NVIDIA BodyTrack) were evaluated against joint angles computed from IMU data using OpenSim inverse kinematics following the Human3.6M dataset format with 17 keypoints. Among them, MotionAGFormer demonstrated superior performance, achieving the lowest overall RMSE ($9.27\deg \pm 4.80\deg$) and MAE ($7.86\deg \pm 4.18\deg$), as well as the highest Pearson correlation ($0.86 \pm 0.15$) and the highest coefficient of determination $R^{2}$ ($0.67 \pm 0.28$). The results reveal that both technologies are viable for out-of-the-lab kinematic assessment. However, they also highlight key trade-offs between video- and sensor-based approaches including costs, accessibility, and precision. This study clarifies where off-the-shelf video models already provide clinically promising kinematics in healthy adults and where they lag behind IMU-based estimates while establishing valuable guidelines for researchers and clinicians seeking to develop robust, cost-effective, and user-friendly solutions for telehealth and remote patient monitoring.


[53] NeuroSwift: A Lightweight Cross-Subject Framework for fMRI Visual Reconstruction of Complex Scenes cs.CV | cs.HCPDF

Shiyi Zhang, Dong Liang, Yihang Zhou

TL;DR: NeuroSwift是一个轻量级的跨被试框架,通过结合AutoKL和CLIP适配器,实现了对fMRI数据的视觉重建,尤其擅长复杂场景的跨被试重建。

Details

Motivation: 解决现有方法在跨被试fMRI数据中因神经表征差异和语义抽象编码导致的准确率低和计算复杂度高的问题。

Result: 在轻量级GPU(3 RTX 4090)上仅需1小时训练即可达到SOTA性能,优于现有方法。

Insight: 通过模块化设计和部分参数微调,显著提升了跨被试泛化能力,降低了计算成本。

Abstract: Reconstructing visual information from brain activity via computer vision technology provides an intuitive understanding of visual neural mechanisms. Despite progress in decoding fMRI data with generative models, achieving accurate cross-subject reconstruction of visual stimuli remains challenging and computationally demanding. This difficulty arises from inter-subject variability in neural representations and the brain’s abstract encoding of core semantic features in complex visual inputs. To address these challenges, we propose NeuroSwift, which integrates complementary adapters via diffusion: AutoKL for low-level features and CLIP for semantics. NeuroSwift’s CLIP Adapter is trained on Stable Diffusion generated images paired with COCO captions to emulate higher visual cortex encoding. For cross-subject generalization, we pretrain on one subject and then fine-tune only 17 percent of parameters (fully connected layers) for new subjects, while freezing other components. This enables state-of-the-art performance with only one hour of training per subject on lightweight GPUs (three RTX 4090), and it outperforms existing methods.


[54] microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification cs.CV | cs.AIPDF

Sathira Silva, Eman Ali, Chetan Arora, Muhammad Haris Khan

TL;DR: microCLIP提出了一种无监督的自训练框架,通过细粒度标记融合技术改进CLIP在细粒度图像分类中的表现,利用Saliency-Oriented Attention Pooling(SOAP)和动态知识聚合,显著提升了分类精度。

Details

Motivation: CLIP在细粒度分类任务中表现受限,因其依赖全局特征而忽略了局部细节。现有方法通过语言模型描述对齐CLIP的[CLS]标记,但缺乏空间精确性。microCLIP旨在通过这些局限性改进CLIP的性能。

Result: 在13个细粒度基准测试中,平均精度提升2.90%,且仅需轻量级调整。

Insight: 通过结合粗-细粒度特征和动态知识聚合,可以显著改进CLIP在细粒度任务中的表现,同时保持模型的轻量化和稳定性。

Abstract: Unsupervised adaptation of CLIP-based vision-language models (VLMs) for fine-grained image classification requires sensitivity to microscopic local cues. While CLIP exhibits strong zero-shot transfer, its reliance on coarse global features restricts its performance on fine-grained classification tasks. Prior efforts inject fine-grained knowledge by aligning large language model (LLM) descriptions with the CLIP $\texttt{[CLS]}$ token; however, this approach overlooks spatial precision. We propose $\textbf{microCLIP}$, a self-training framework that jointly refines CLIP’s visual and textual representations using fine-grained cues. At its core is Saliency-Oriented Attention Pooling (SOAP) within a lightweight TokenFusion module, which builds a saliency-guided $\texttt{[FG]}$ token from patch embeddings and fuses it with the global $\texttt{[CLS]}$ token for coarse-fine alignment. To stabilize adaptation, we introduce a two-headed LLM-derived classifier: a frozen classifier that, via multi-view alignment, provides a stable text-based prior for pseudo-labeling, and a learnable classifier initialized from LLM descriptions and fine-tuned with TokenFusion. We further develop Dynamic Knowledge Aggregation, which convexly combines fixed LLM/CLIP priors with TokenFusion’s evolving logits to iteratively refine pseudo-labels. Together, these components uncover latent fine-grained signals in CLIP, yielding a consistent $2.90%$ average accuracy gain across 13 fine-grained benchmarks while requiring only light adaptation. Our code is available at https://github.com/sathiiii/microCLIP.


[55] VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL cs.CV | cs.LGPDF

Kyoungjun Park, Yifan Yang, Juheon Yi, Shicheng Zheng, Yifei Shen

TL;DR: VidGuard-R1是一款基于多模态大语言模型(MLLM)和强化学习(RL)的视频真实性检测工具,通过GRPO算法优化,不仅能高精度分类AI生成视频,还能提供可解释的推理。

Details

Motivation: 随着AI生成视频技术的快速发展,社会面临虚假信息和声誉损害的挑战,亟需既准确又可解释的检测工具。

Result: 在零样本测试中表现最佳,训练后准确率超95%,并能生成精确的解释。

Insight: 模型的可解释性是检测工具的关键需求,GRPO和多模态结合为AI生成内容检测提供了新思路。

Abstract: With the rapid advancement of AI-generated videos, there is an urgent need for effective detection tools to mitigate societal risks such as misinformation and reputational harm. In addition to accurate classification, it is essential that detection models provide interpretable explanations to ensure transparency for regulators and end users. To address these challenges, we introduce VidGuard-R1, the first video authenticity detector that fine-tunes a multi-modal large language model (MLLM) using group relative policy optimization (GRPO). Our model delivers both highly accurate judgments and insightful reasoning. We curate a challenging dataset of 140k real and AI-generated videos produced by state-of-the-art generation models, carefully designing the generation process to maximize discrimination difficulty. We then fine-tune Qwen-VL using GRPO with two specialized reward models that target temporal artifacts and generation complexity. Extensive experiments demonstrate that VidGuard-R1 achieves state-of-the-art zero-shot performance on existing benchmarks, with additional training pushing accuracy above 95%. Case studies further show that VidGuard-R1 produces precise and interpretable rationales behind its predictions. The code is publicly available at https://VidGuard-R1.github.io.


[56] Self-Forcing++: Towards Minute-Scale High-Quality Video Generation cs.CV | cs.AIPDF

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li

TL;DR: 论文提出了一种称为Self-Forcing++的方法,旨在解决长视频生成中的质量退化问题,无需依赖长视频监督或重新训练,生成长达4分15秒的高质量视频。

Details

Motivation: 当前扩散模型虽在图像和视频生成中表现出色,但对长视频生成的扩展计算成本过高,且现有自回归方法因误差累积导致质量下降。

Result: 实验表明,该方法能生成长达4分15秒的视频,质量和一致性均显著优于基线方法。

Insight: 研究表明,利用教师模型的局部指导可有效提升长视频生成的全局一致性,避免误差累积问题。

Abstract: Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20x beyond teacher’s capability, avoiding common issues such as over-exposure and error-accumulation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model’s position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at https://self-forcing-plus-plus.github.io/


[57] Learning to Generate Object Interactions with Physics-Guided Video Diffusion cs.CV | cs.AI | cs.LGPDF

David Romero, Ariana Bermudez, Hao Li, Fabio Pizzati, Ivan Laptev

TL;DR: KineMask 是一种物理引导的视频生成方法,通过两阶段训练策略和掩码监督提升物体交互的物理真实性,同时结合低级运动控制和高级文本条件,显著优于同类模型。

Details

Motivation: 现有视频生成模型在物理真实的物体交互和物理基础控制方面表现不足,限制了其在机器人学和决策模拟中的应用。

Result: 在合成和真实场景中,KineMask 显著改善了物体交互的物理真实性,优于同类模型。

Insight: 低级运动控制和高级文本条件在视频扩散模型中具有互补作用,共同提升生成的物理真实性和复杂性。

Abstract: Recent models for video generation have achieved remarkable progress and are now deployed in film, social media production, and advertising. Beyond their creative potential, such models also hold promise as world simulators for robotics and embodied decision making. Despite strong advances, however, current approaches still struggle to generate physically plausible object interactions and lack physics-grounded control mechanisms. To address this limitation, we introduce KineMask, an approach for physics-guided video generation that enables realistic rigid body control, interactions, and effects. Given a single image and a specified object velocity, our method generates videos with inferred motions and future object interactions. We propose a two-stage training strategy that gradually removes future motion supervision via object masks. Using this strategy we train video diffusion models (VDMs) on synthetic scenes of simple interactions and demonstrate significant improvements of object interactions in real scenes. Furthermore, KineMask integrates low-level motion control with high-level textual conditioning via predictive scene descriptions, leading to effective support for synthesis of complex dynamical phenomena. Extensive experiments show that KineMask achieves strong improvements over recent models of comparable size. Ablation studies further highlight the complementary roles of low- and high-level conditioning in VDMs. Our code, model, and data will be made publicly available.


[58] MultiModal Action Conditioned Video Generation cs.CVPDF

Yichen Li, Antonio Torralba

TL;DR: 论文提出了一种多模态动作条件视频生成方法,通过引入精细的多模态感官数据(如本体感觉、运动感觉、力触觉等),解决了现有视频模型在精细控制方面的不足,并提出了特征学习和正则化方案以提升模拟精度和时间稳定性。

Details

Motivation: 现有视频模型缺乏精细控制能力,无法满足通用家用机器人对实时精细操作的需求。论文旨在通过多模态感官数据捕捉精确控制,以提升模拟的精细度和实用性。

Result: 实验表明,多模态感官数据提高了模拟精度并减少了时间漂移。广泛的消融研究和下游应用验证了方法的有效性。

Insight: 多模态感官数据是模拟精细互动的关键,特征对齐和因果性增强对于提升视频生成模型的实用性至关重要。

Abstract: Current video models fail as world model as they lack fine-graiend control. General-purpose household robots require real-time fine motor control to handle delicate tasks and urgent situations. In this work, we introduce fine-grained multimodal actions to capture such precise control. We consider senses of proprioception, kinesthesia, force haptics, and muscle activation. Such multimodal senses naturally enables fine-grained interactions that are difficult to simulate with text-conditioned generative models. To effectively simulate fine-grained multisensory actions, we develop a feature learning paradigm that aligns these modalities while preserving the unique information each modality provides. We further propose a regularization scheme to enhance causality of the action trajectory features in representing intricate interaction dynamics. Experiments show that incorporating multimodal senses improves simulation accuracy and reduces temporal drift. Extensive ablation studies and downstream applications demonstrate the effectiveness and practicality of our work.


[59] VideoNSA: Native Sparse Attention Scales Video Understanding cs.CV | cs.AI | cs.LGPDF

Enxin Song, Wenhao Chai, Shusheng Yang, Ethan Armand, Xiaojun Shan

TL;DR: VideoNSA 通过原生稀疏注意力(NSA)解决了视频理解中上下文长度限制的问题,提出了一种硬件感知的混合注意力方法,显著提升了长视频理解和时空推理能力。

Details

Motivation: 现有视频-语言模型在处理长视频时,由于上下文长度限制,往往错过关键过渡帧且难以保持长期一致性。

Result: 1. 支持128K tokens的可靠扩展;2. 在长视频理解、时空推理等任务中表现优于基线;3. 发现全局-局部注意力分配的优化比例。

Insight: 1. 任务依赖的分支使用模式;2. 可学习的稀疏注意力能诱导动态注意力汇聚;3. 固定预算下全局-局部注意力分配的重要性。

Abstract: Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.


[60] Inferring Dynamic Physical Properties from Video Foundation Models cs.CV | cs.LGPDF

Guanqi Zhan, Xianzheng Ma, Weidi Xie, Andrew Zisserman

TL;DR: 论文研究了从视频中预测动态物理属性的任务,包括弹性、粘性和动态摩擦,提出了新的数据集和方法,并对比了不同模型的性能。

Details

Motivation: 动态物理属性的预测需要结合时间信息,传统方法难以直接从视频中提取这些属性,因此需要探索新的方法来解决这一问题。

Result: 生成或自监督的视频基础模型表现相似,但不及传统方法;多模态大语言模型性能较弱,但通过合适提示可以提升。

Insight: 视频基础模型在动态物理属性预测任务中具有一定潜力,但仍有提升空间,尤其是多模态大语言模型的性能改进。

Abstract: We study the task of predicting dynamic physical properties from videos. More specifically, we consider physical properties that require temporal information to be inferred: elasticity of a bouncing object, viscosity of a flowing liquid, and dynamic friction of an object sliding on a surface. To this end, we make the following contributions: (i) We collect a new video dataset for each physical property, consisting of synthetic training and testing splits, as well as a real split for real world evaluation. (ii) We explore three ways to infer the physical property from videos: (a) an oracle method where we supply the visual cues that intrinsically reflect the property using classical computer vision techniques; (b) a simple read out mechanism using a visual prompt and trainable prompt vector for cross-attention on pre-trained video generative and self-supervised models; and (c) prompt strategies for Multi-modal Large Language Models (MLLMs). (iii) We show that video foundation models trained in a generative or self-supervised manner achieve a similar performance, though behind that of the oracle, and MLLMs are currently inferior to the other models, though their performance can be improved through suitable prompting.


Mengyu Yang, Yiming Chen, Haozheng Pei, Siddhant Agarwal, Arun Balajee Vasudevan

TL;DR: 这篇论文提出了一个新任务——发声物体检测,旨在通过多模态框架从真实世界的交互中学习物体的声音,并通过对象分割掩码和slot attention视觉编码器提升性能。

Details

Motivation: 人类能够通过声音辨别物体的交互对象,受此启发,论文希望模型也能学习这种能力,区分不同物体交互产生的声音。

Result: 在新任务和现有多模态动作理解任务上都达到了最先进性能。

Insight: 对象分割和slot attention的结合能有效提升模型对多模态信息的理解能力。

Abstract: Can a model distinguish between the sound of a spoon hitting a hardwood floor versus a carpeted one? Everyday object interactions produce sounds unique to the objects involved. We introduce the sounding object detection task to evaluate a model’s ability to link these sounds to the objects directly involved. Inspired by human perception, our multimodal object-aware framework learns from in-the-wild egocentric videos. To encourage an object-centric approach, we first develop an automatic pipeline to compute segmentation masks of the objects involved to guide the model’s focus during training towards the most informative regions of the interaction. A slot attention visual encoder is used to further enforce an object prior. We demonstrate state of the art performance on our new task along with existing multimodal action understanding tasks.


cs.CL [Back]

[62] Towards Open-Ended Discovery for Low-Resource NLP cs.CL | cs.AIPDF

Bonaventure F. P. Dossou, Henri Aïdasso

TL;DR: 这篇论文呼吁在低资源自然语言处理(NLP)领域实现范式转变,从依赖静态数据集转向开放式交互式语言发现,通过动态对话学习新语言。

Details

Motivation: 当前的NLP技术依赖大规模预收集数据和集中式基础设施,这对低资源语言社区来说难以实现。论文主张通过人类-机器协作的动态学习过程来解决这一问题。

Result: 论文未提供具体实验结果,但提出了一个理论框架和未来研究方向。

Insight: 未来的语言技术应尊重和赋能社区,通过合作学习发现和保护语言多样性,这与人本AI的原则一致。

Abstract: Natural Language Processing (NLP) for low-resource languages remains fundamentally constrained by the lack of textual corpora, standardized orthographies, and scalable annotation pipelines. While recent advances in large language models have improved cross-lingual transfer, they remain inaccessible to underrepresented communities due to their reliance on massive, pre-collected data and centralized infrastructure. In this position paper, we argue for a paradigm shift toward open-ended, interactive language discovery, where AI systems learn new languages dynamically through dialogue rather than static datasets. We contend that the future of language technology, particularly for low-resource and under-documented languages, must move beyond static data collection pipelines toward interactive, uncertainty-driven discovery, where learning emerges dynamically from human-machine collaboration instead of being limited to pre-existing datasets. We propose a framework grounded in joint human-machine uncertainty, combining epistemic uncertainty from the model with hesitation cues and confidence signals from human speakers to guide interaction, query selection, and memory retention. This paper is a call to action: we advocate a rethinking of how AI engages with human knowledge in under-documented languages, moving from extractive data collection toward participatory, co-adaptive learning processes that respect and empower communities while discovering and preserving the world’s linguistic diversity. This vision aligns with principles of human-centered AI, emphasizing interactive, cooperative model building between AI systems and speakers.


[63] Context Matters: Comparison of commercial large language tools in veterinary medicine cs.CL | cs.AIPDF

Tyler J Poore, Christopher J Pinard, Aleena Shabbir, Andrew Lagree, Andre Telfer

TL;DR: 本文评估了三款兽医领域的大语言模型(LLM)总结工具在标准化兽医肿瘤学记录数据集上的表现,发现Product 1表现最佳,尤其在事实准确性和时间顺序方面。

Details

Motivation: 大语言模型在临床应用中日益普及,但它们在兽医医学领域的性能尚未得到充分研究。本文旨在填补这一空白。

Result: Product 1的中位平均分数最高(4.61),且在事实准确性和时间顺序方面表现完美;评分框架的可重复性高(标准差低)。

Insight: 兽医专用的商业LLM工具效果更优,且LLM-as-a-judge评测方法具有可扩展性和可重复性。

Abstract: Large language models (LLMs) are increasingly used in clinical settings, yet their performance in veterinary medicine remains underexplored. We evaluated three commercially available veterinary-focused LLM summarization tools (Product 1 [Hachiko] and Products 2 and 3) on a standardized dataset of veterinary oncology records. Using a rubric-guided LLM-as-a-judge framework, summaries were scored across five domains: Factual Accuracy, Completeness, Chronological Order, Clinical Relevance, and Organization. Product 1 achieved the highest overall performance, with a median average score of 4.61 (IQR: 0.73), compared to 2.55 (IQR: 0.78) for Product 2 and 2.45 (IQR: 0.92) for Product 3. It also received perfect median scores in Factual Accuracy and Chronological Order. To assess the internal consistency of the grading framework itself, we repeated the evaluation across three independent runs. The LLM grader demonstrated high reproducibility, with Average Score standard deviations of 0.015 (Product 1), 0.088 (Product 2), and 0.034 (Product 3). These findings highlight the importance of veterinary-specific commercial LLM tools and demonstrate that LLM-as-a-judge evaluation is a scalable and reproducible method for assessing clinical NLP summarization in veterinary medicine.


[64] EEFSUVA: A New Mathematical Olympiad Benchmark cs.CL | math.HOPDF

Nicole N Khatibi, Daniil A. Radamovich, Michael P. Brenner

TL;DR: EEFSUVA是一个新的数学奥赛评测基准,旨在更全面地评估大型语言模型(LLM)在数学推理上的能力。该基准从东欧和前苏联国家的较少传播的区域性和国家级奥赛题目中选取,难度与国际数学奥赛(IMO)相当,但题目类型更非标准,能更真实反映模型推理能力。初步结果显示,即使是当前最先进的LLM在EEFSUVA上表现也显著下降。

Details

Motivation: 现有数学评测基准(如IMO)可能存在数据污染和题目类型单一的问题,导致高估LLM的数学推理能力。为此,需要更全面、真实的评测数据集,以准确评估模型的数学理解水平。

Result: 初步结果显示,即使是当前最先进的LLM在EEFSUVA上的表现显著低于其他奥赛风格基准,表明现有评测可能高估了模型的数学推理能力。

Insight: 评测数据集的多样性和真实性对准确评估LLM的数学推理能力至关重要。EEFSUVA的成功表明,未来的模型开发需要更广泛的评测数据支持,以避免狭隘的高估现象。

Abstract: Recent breakthroughs have spurred claims that large language models (LLMs) match gold medal Olympiad to graduate level proficiency on mathematics benchmarks. In this work, we examine these claims in detail and assess the extent to which current benchmarks capture genuine LLM mathematical reasoning. The composition of these benchmarks, primarily drawing from the International Mathematics Olympiad (IMO) and related competitions, may overstate models reasoning ability due to potential data contamination and a narrow focus on familiar problem types. To enable a more holistic assessment of mathematical understanding, we introduce EEFSUVA, a novel benchmark curated from under circulated regional and national Olympiads of Eastern Europe and the countries from the former Soviet Union. These contests feature problems of comparable difficulty to the IMO and are renowned for demanding nonstandard problem-solving techniques, yet their problems are far less prevalent in online corpora. Preliminary results suggest that even state-of-the-art LLMs exhibit a notable performance decline on EEFSUVA relative to other Olympiad-style benchmarks. These findings also suggest the potential importance of broader evaluation datasets for a fuller assessment of mathematical reasoning and for guiding future model development.


[65] Enhancing Transformer-Based Rerankers with Synthetic Data and LLM-Based Supervision cs.CL | cs.AIPDF

Dimitar Peshevski, Kiril Blazhevski, Martin Popovski, Gjorgji Madjarov

TL;DR: 论文提出了一种使用LLM生成合成数据和监督的方法,以替代人工标注数据,从而低成本地提升小模型的重排性能。

Details

Motivation: 重排任务需要大量人工标注数据,成本高且稀缺;LLM虽然性能强但计算成本高,限制了实际应用。

Result: 在MedQuAD数据集上的实验表明,该方法显著提升了领域内性能,并具有良好的领域外泛化能力。

Insight: 通过LLM生成数据和监督而非直接推理,能降低成本同时保持性能,为小模型优化提供了新思路。

Abstract: Effective document reranking is essential for improving search relevance across diverse applications. While Large Language Models (LLMs) excel at reranking due to their deep semantic understanding and reasoning, their high computational cost makes them impractical for many real-world deployments. Fine-tuning smaller, task-specific models is a more efficient alternative but typically depends on scarce, manually labeled data. To overcome this, we propose a novel pipeline that eliminates the need for human-labeled query-document pairs. Our method uses LLMs to generate synthetic queries from domain-specific corpora and employs an LLM-based classifier to label positive and hard-negative pairs. This synthetic dataset is then used to fine-tune a smaller transformer model with contrastive learning using Localized Contrastive Estimation (LCE) loss. Experiments on the MedQuAD dataset show that our approach significantly boosts in-domain performance and generalizes well to out-of-domain tasks. By using LLMs for data generation and supervision rather than inference, we reduce computational costs while maintaining strong reranking capabilities.


[66] Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks cs.CL | cs.AIPDF

Dongjun Kim, Gyuho Shim, Yongchan Chun, Minhyuk Kim, Chanjun Park

TL;DR: 论文提出了Benchmark Profiling框架,通过分解评测基准的性能到十种认知能力,揭示了当前LLM评测基准的多能力需求和潜在局限性。

Details

Motivation: 当前LLM评测基准的能力标签(如推理、常识)缺乏系统性验证,导致评测分数可能高估模型的实际能力。

Result: 发现多数评测基准依赖多种能力;相似标签的数据集能力需求不同;代码生成评测需要多能力改进;无关能力可能负面影响表现。

Insight: Benchmark Profiling揭示了评测基准的实际能力需求,为模型解释性和评测审计提供了透明工具。

Abstract: Large Language Models are commonly judged by their scores on standard benchmarks, yet such scores often overstate real capability since they mask the mix of skills a task actually demands. For example, ARC is assumed to test reasoning, while HellaSwag is designed to evaluate commonsense. However, we lack a systematic way to verify if these benchmarks actually measure these labels. We introduce Benchmark Profiling, a diagnostic framework that decomposes benchmark performance into ten cognitively grounded abilities. The method combines gradient-based importance scoring with targeted parameter ablation to compute an Ability Impact Score (AIS) that quantifies how much each ability contributes to a model’s success on a given benchmark. Profiling three instruction-tuned models across ten widely used benchmarks yields four key findings: (i) most benchmarks draw on several abilities rather than one, (ii) datasets with similar labels rely on distinct ability mixtures, (iii) code-generation benchmarks reward broad, multi-skill improvement and thus show only modest gains from narrow domain-specific fine-tuning, and (iv) abilities irrelevant to the task could negatively affect performance. Benchmark Profiling therefore explains why performance gains do not always translate into user-perceived competence and offers a transparent tool for benchmark audit and model interpretability.


[67] LLMRank: Understanding LLM Strengths for Model Routing cs.CLPDF

Shubham Agrawal, Prasang Gupta

TL;DR: LLMRank是一个基于提示感知的路由框架,通过提取多维度特征和轻量级代理求解器信号,优化大型语言模型(LLM)的选择,以平衡性能和效率。

Details

Motivation: 随着多样化大型语言模型的快速发展,如何在延迟和计算成本之间平衡性能成为关键挑战。

Result: LLMRank达到89.2%的Oracle性能,同时提供可解释的特征归因。

Insight: 多维度特征提取和混合排序目标是实现高效透明LLM部署的关键。

Abstract: The rapid growth of large language models (LLMs) with diverse capabilities, latency and computational costs presents a critical deployment challenge: selecting the most suitable model for each prompt to optimize the trade-off between performance and efficiency. We introduce LLMRank, a prompt-aware routing framework that leverages rich, human-readable features extracted from prompts, including task type, reasoning patterns, complexity indicators, syntactic cues, and signals from a lightweight proxy solver. Unlike prior one-shot routers that rely solely on latent embeddings, LLMRank predicts per-model utility using a neural ranking model trained on RouterBench, comprising 36,497 prompts spanning 11 benchmarks and 11 state-of-the-art LLMs, from small efficient models to large frontier systems. Our approach achieves up to 89.2% of oracle utility, while providing interpretable feature attributions that explain routing decisions. Extensive studies demonstrate the importance of multifaceted feature extraction and the hybrid ranking objective, highlighting the potential of feature-driven routing for efficient and transparent LLM deployment.


[68] GRPO++: Enhancing Dermatological Reasoning under Low Resource Settings cs.CL | cs.LGPDF

Ismam Nur Swapnil, Aranya Saha, Tanvir Ahmed Khan, Mohammad Ariful Haque

TL;DR: 论文提出了GRPO++方法,通过多阶段的资源高效训练流程(DermIQ-VLM),增强低资源环境下皮肤病诊断的推理能力。

Details

Motivation: 现有视觉-语言模型(VLM)在皮肤病学等复杂领域的结构化推理能力受限于数据稀缺和计算成本高的问题。

Result: 在皮肤病数据集上的初步评估显示,该方法优于标准微调方法。

Insight: 通过资源高效的多阶段训练流程,可以在低资源环境下开发可靠的专用VLM。

Abstract: Vision-Language Models (VLMs) show promise in medical image analysis, yet their capacity for structured reasoning in complex domains like dermatology is often limited by data scarcity and the high computational cost of advanced training techniques. To address these challenges, we introduce DermIQ-VLM, a VLM developed through a multi-stage, resource-efficient methodology designed to emulate a dermatologist’s diagnostic process. Our primary contribution is a modified version of Grouped Relative Policy Optimization (GRPO), called GRPO++, which stabilizes the powerful but data-intensive GRPO framework. Our proposed training pipeline first employs GRPO++ for reasoning-oriented disease recognition, followed by supervised fine-tuning for conversational ability. To mitigate factual errors introduced during this step, we then align the model using Direct Preference Optimization (DPO), leveraging a Knowledge Graph-based system as a scalable proxy for expert preference. A preliminary evaluation on a curated dermatological dataset demonstrates that our proposed methodology yields notable performance gains over standard fine-tuning approaches. These findings validate the potential of our pipeline as a feasible pathway for developing specialized, reliable VLMs in resource-constrained environments.


[69] SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation cs.CLPDF

Hu Wei, Ze Xu, Boyu Yang, Linlin Miao, Weiqi Zhai

TL;DR: 论文提出了两个互补的数学评测基准:SKYLENAGE-ReasoningMATH(结构化诊断集)和SKYLENAGE-MATH(竞赛风格评测集),用于多层级数学能力评估。评测结果显示,当前最强模型在竞赛集上表现44%,在诊断集上表现81%,揭示了模型在不同难度和层级上的性能差异。

Details

Motivation: 现有大语言模型(LLMs)在公开数学评测集上表现接近天花板,缺乏区分前沿模型的挑战性评测。因此需要设计更具区分度的数学评测基准。

Result: 1)竞赛集中最强模型准确率为44%,高中到博士层级性能逐渐下降;2)诊断集中最强模型准确率为81%,但最难子集揭示了模型间的鲁棒性差距。

Insight: SKYLENAGE基准通过分层难度和结构化元数据,为数学推理能力评估提供了更细粒度和挑战性的参考。结果显示,当前LLMs在高层级数学任务上仍有显著提升空间。

Abstract: Large language models (LLMs) now perform strongly on many public math suites, yet frontier separation within mathematics increasingly suffers from ceiling effects. We present two complementary benchmarks: SKYLENAGE-ReasoningMATH, a 100-item, structure-aware diagnostic set with per-item metadata on length, numeric density, and symbolic complexity; and SKYLENAGE-MATH, a 150-item contest-style suite spanning four stages from high school to doctoral under a seven-subject taxonomy. We evaluate fifteen contemporary LLM variants under a single setup and analyze subject x model and grade x model performance. On the contest suite, the strongest model reaches 44% while the runner-up reaches 37%; accuracy declines from high school to doctoral, and top systems exhibit a doctoral-to-high-school retention near 79%. On the reasoning set, the best model attains 81% overall, and hardest-slice results reveal clear robustness gaps between leaders and the mid-tier. In summary, we release SKYLENAGE-ReasoningMATH and report aggregate results for SKYLENAGE-MATH; together, SKYLENAGE provides a hard, reasoning-centered and broadly covering math benchmark with calibrated difficulty and rich metadata, serving as a reference benchmark for future evaluations of mathematical reasoning.


[70] Redundancy-as-Masking: Formalizing the Artificial Age Score (AAS) to Model Memory Aging in Generative AI cs.CL | cs.AI | cs.IT | cs.LG | math.IT | 68T05, 03C95, 94A17, 68Q85 | I.2.0; H.1.2; H.1.1; H.1.0; F.4.0PDF

Seyma Yaman Kayadibi

TL;DR: 论文提出了一个人工年龄分数(AAS),用于量化生成式AI的记忆老化现象。AAS基于熵和信息重叠理论,通过实验验证其在语义和情景记忆中的表现。

Details

Motivation: 尽管大型语言模型(LLM)在语义记忆上表现稳定,但在情景记忆中容易因会话重置而遗忘。研究者希望建立一个理论框架来量化这种记忆老化现象。

Result: 持久会话中,模型能同时保持语义和情景记忆(AAS趋于最小值,表示年轻化状态);会话重置后,情景记忆丢失,AAS显著增加,表明记忆老化。

Insight: 1. AAS可用于评估AI系统的记忆退化;2. LLM的记忆老化结构与人类类似;3. 方法基于信息论和自动机理论,具有广泛的适用性。

Abstract: Artificial intelligence is observed to age not through chronological time but through structural asymmetries in memory performance. In large language models, semantic cues such as the name of the day often remain stable across sessions, while episodic details like the sequential progression of experiment numbers tend to collapse when conversational context is reset. To capture this phenomenon, the Artificial Age Score (AAS) is introduced as a log-scaled, entropy-informed metric of memory aging derived from observable recall behavior. The score is formally proven to be well-defined, bounded, and monotonic under mild and model-agnostic assumptions, making it applicable across various tasks and domains. In its Redundancy-as-Masking formulation, the score interprets redundancy as overlapping information that reduces the penalized mass. However, in the present study, redundancy is not explicitly estimated; all reported values assume a redundancy-neutral setting (R = 0), yielding conservative upper bounds. The AAS framework was tested over a 25-day bilingual study involving ChatGPT-5, structured into stateless and persistent interaction phases. During persistent sessions, the model consistently recalled both semantic and episodic details, driving the AAS toward its theoretical minimum, indicative of structural youth. In contrast, when sessions were reset, the model preserved semantic consistency but failed to maintain episodic continuity, causing a sharp increase in the AAS and signaling structural memory aging. These findings support the utility of AAS as a theoretically grounded, task-independent diagnostic tool for evaluating memory degradation in artificial systems. The study builds on foundational concepts from von Neumann’s work on automata, Shannon’s theories of information and redundancy, and Turing’s behavioral approach to intelligence.


[71] Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing cs.CLPDF

Yisong Xiao, Aishan Liu, Siyuan Liang, Zonghao Ying, Xianglong Liu

TL;DR: 论文提出了ARGRE框架,通过自回归奖励引导的表征编辑实现LLMs的去毒化,显著降低了毒性内容生成,同时提高了效率。

Details

Motivation: LLMs虽然在多项任务中表现优异,但容易生成有毒内容。现有的测试时去毒方法由于对毒性和非毒性输出的转换空间探索不足,导致干预不精确。

Result: 在8个广泛使用的LLMs上实验表明,ARGRE在毒性降低62.21%和推理时间减少47.58%上显著优于基线方法。

Insight: 通过显式建模毒性转换和自适应编辑,ARGRE不仅提升了去毒效果,还保持了原始模型的核心能力。

Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various tasks, yet they remain vulnerable to generating toxic content, necessitating detoxification strategies to ensure safe and responsible deployment. Test-time detoxification methods, which typically introduce static or dynamic interventions into LLM representations, offer a promising solution due to their flexibility and minimal invasiveness. However, current approaches often suffer from imprecise interventions, primarily due to their insufficient exploration of the transition space between toxic and non-toxic outputs. To address this challenge, we propose \textsc{A}utoregressive \textsc{R}eward \textsc{G}uided \textsc{R}epresentation \textsc{E}diting (ARGRE), a novel test-time detoxification framework that explicitly models toxicity transitions within the latent representation space, enabling stable and precise reward-guided editing. ARGRE identifies non-toxic semantic directions and interpolates between toxic and non-toxic representations to reveal fine-grained transition trajectories. These trajectories transform sparse toxicity annotations into dense training signals, enabling the construction of an autoregressive reward model that delivers stable and precise editing guidance. At inference, the reward model guides an adaptive two-step editing process to obtain detoxified representations: it first performs directional steering based on expected reward gaps to shift representations toward non-toxic regions, followed by lightweight gradient-based refinements. Extensive experiments across 8 widely used LLMs show that ARGRE significantly outperforms leading baselines in effectiveness (-62.21% toxicity) and efficiency (-47.58% inference time), while preserving the core capabilities of the original model with minimal degradation. Our code is available at the website.


[72] A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering cs.CLPDF

Jiaqing Xie

TL;DR: 这篇论文比较了稀疏自编码器(SAE)和激活差异方法在语言模型引导中的应用。研究发现,传统SAE的top-k潜在特征可能捕获非语义信息,因此提出了聚焦单一最相关潜在特征(top-1)的方法。此外,论文提出了一种令牌级衰减的引导策略,解决了恒定SAE引导导致的退化输出问题。实验表明,SAE在数学推理任务上优于激活差异方法。

Details

Motivation: 传统稀疏自编码器的top-k潜在特征可能包含非语义冗余信息(如标点符号),而恒定SAE引导可能导致退化输出(如重复单词),这限制了它们在语言模型引导中的效果。

Result: SAE在数学推理任务上显著优于激活差异方法,并在IF-Eval任务上表现相当。引导与推理相关的SAE潜在特征能有效引发逐步数学推理行为。

Insight: 聚焦单一潜在特征(top-1)和动态衰减引导的策略能够更高效地捕捉语义信息并避免退化输出,为语言模型引导提供了新思路。

Abstract: Sparse autoencoders (SAEs) have recently emerged as a powerful tool for language model steering. Prior work has explored top-k SAE latents for steering, but we observe that many dimensions among the top-k latents capture non-semantic features such as punctuation rather than semantic attributes like instructions. To address this, we propose focusing on a single, most relevant SAE latent (top-1), eliminating redundant features. We further identify a limitation in constant SAE steering, which often produces degenerate outputs such as repetitive single words. To mitigate this, we introduce a token-wise decaying steering strategy, enabling more faithful comparisons with mean activation difference baselines. Empirically, we show that steering an SAE latent associated with reasoning reliably elicits step-by-step mathematical reasoning and enhances inference quality, functionally resembling the effect of appending a guiding token. Our results demonstrate that SAEs outperform mean activation difference methods on mathematical reasoning benchmarks and match their performance on IF-Eval.


[73] Let’s Play Across Cultures: A Large Multilingual, Multicultural Benchmark for Assessing Language Models’ Understanding of Sports cs.CL | cs.AIPDF

Punit Kumar Singh, Nishant Kumar, Akash Ghosh, Kunal Pasad, Khushi Soni

TL;DR: 论文提出了CultSportQA基准,用于评估语言模型对全球60个国家传统体育的理解,填补了现有评估忽视区域性和土著体育的空白。

Details

Motivation: 现有的语言模型评估主要关注全球流行体育项目,忽视了区域性和传统体育文化,导致模型在这些领域的表现缺乏评估。

Result: CultSportQA为评估模型在传统体育领域的理解和推理能力建立了新标准。

Insight: 该研究揭示了当前语言模型在理解和推理区域性体育文化方面的局限性,为未来多文化AI评估提供了方向。

Abstract: Language Models (LMs) are primarily evaluated on globally popular sports, often overlooking regional and indigenous sporting traditions. To address this gap, we introduce \textbf{\textit{CultSportQA}}, a benchmark designed to assess LMs’ understanding of traditional sports across 60 countries and 6 continents, encompassing four distinct cultural categories. The dataset features 33,000 multiple-choice questions (MCQs) across text and image modalities, each of which is categorized into three key types: history-based, rule-based, and scenario-based. To evaluate model performance, we employ zero-shot, few-shot, and chain-of-thought (CoT) prompting across a diverse set of Large Language Models (LLMs), Small Language Models (SLMs), and Multimodal Large Language Models (MLMs). By providing a comprehensive multilingual and multicultural sports benchmark, \textbf{\textit{CultSportQA}} establishes a new standard for assessing AI’s ability to understand and reason about traditional sports.


[74] SSTAG: Structure-Aware Self-Supervised Learning Method for Text-Attributed Graphs cs.CLPDF

Ruyue Liu, Rong Yin, Xiangzhen Bo, Xiaoshuai Hao, Yong Liu

TL;DR: SSTAG提出了一种结构感知的自监督学习方法,结合LLM和GNN的优势,通过双知识蒸馏框架提升文本属性图的可扩展性和泛化能力。

Details

Motivation: 当前图学习模型通常在单一数据集上训练,缺乏跨图和跨任务的泛化能力,且依赖大量标注数据。SSTAG旨在解决这些问题,利用文本作为统一表示媒介,结合LLM和GNN的优势。

Result: SSTAG在跨域迁移学习任务中优于SOTA模型,具备高可扩展性,同时降低了推理成本。

Insight: 文本可以作为图的统一表示媒介,结合LLM和GNN的优势,显著提升模型的泛化能力和效率。

Abstract: Large scale pretrained models have revolutionized Natural Language Processing (NLP) and Computer Vision (CV), showcasing remarkable cross domain generalization abilities. However, in graph learning, models are typically trained on individual graph datasets, limiting their capacity to transfer knowledge across different graphs and tasks. This approach also heavily relies on large volumes of annotated data, which presents a significant challenge in resource-constrained settings. Unlike NLP and CV, graph structured data presents unique challenges due to its inherent heterogeneity, including domain specific feature spaces and structural diversity across various applications. To address these challenges, we propose a novel structure aware self supervised learning method for Text Attributed Graphs (SSTAG). By leveraging text as a unified representation medium for graph learning, SSTAG bridges the gap between the semantic reasoning of Large Language Models (LLMs) and the structural modeling capabilities of Graph Neural Networks (GNNs). Our approach introduces a dual knowledge distillation framework that co-distills both LLMs and GNNs into structure-aware multilayer perceptrons (MLPs), enhancing the scalability of large-scale TAGs. Additionally, we introduce an in-memory mechanism that stores typical graph representations, aligning them with memory anchors in an in-memory repository to integrate invariant knowledge, thereby improving the model’s generalization ability. Extensive experiments demonstrate that SSTAG outperforms state-of-the-art models on cross-domain transfer learning tasks, achieves exceptional scalability, and reduces inference costs while maintaining competitive performance.


[75] LOCA: Logical Chain Augmentation for Scientific Corpus Cleaning cs.CLPDF

You-Le Fang, Dong-Shan Jian, Xiang Li, Ce Meng, Ling-Shi Meng

TL;DR: LOCA提出了一种名为Logical Chain Augmentation(逻辑链增强)的新框架,通过自动补齐缺失逻辑步骤并分离科学原理与其推导过程,显著降低科学问答数据集的错误率。

Details

Motivation: 当前大型语言模型(LLMs)在一般领域表现优异,但在科学问题解决中可靠性不足,主要原因在于科学问答数据集的高错误率,尤其是答案中的逻辑跳跃和隐式推理问题。

Result: 实验表明,LOCA在具有挑战性的科学语料库上能将错误率从20%降至2%以下,显著提升了数据集质量。

Insight: LOCA展示了逻辑链显式化对科学语料库质量提升的重要性,为科学AI领域的可靠数据构建提供了可扩展的方法。

Abstract: While Large Language Models (LLMs) excel in general domains, their reliability often falls short in scientific problem-solving. The advancement of scientific AI depends on large-scale, high-quality corpora. However, existing scientific question-answering (QA) datasets suffer from high error rates, frequently resulting from logical leaps and implicit reasoning within the answers. To address this issue, we introduce LOCA (Logical Chain Augmentation), a novel framework for automatically cleaning scientific corpora, implemented through an augment-and-review loop. At its core, LOCA enhances raw answers by completing missing logical steps and explicitly separating the underlying scientific principle from its subsequent derivation. By applying LOCA to challenging scientific corpora, we demonstrate that it can automatically filter noisy datasets, typically reducing the error rate from as high as 20% to below 2%. LOCA provides a scalable and effective methodology for creating high-quality scientific corpora, paving the way for more reliable training and evaluation of scientific AI.


[76] GemDetox at TextDetox CLEF 2025: Enhancing a Massively Multilingual Model for Text Detoxification on Low-resource Languages cs.CLPDF

Trung Duc Anh Dang, Ferdinando Pio D’Elia

TL;DR: 该论文描述了GemDetox在TextDetox CLEF 2025竞赛中的提交方案,通过改进大规模多语言模型Gemma-3,结合LoRA微调和少样本提示技术,在15种语言上实现了高效的文本去毒任务。

Details

Motivation: 社交媒体内容监管的滞后性催生了自动去毒技术的需求,尤其是在低资源语言环境下。

Result: 实验结果显示,该方法在高资源和低资源语言上均取得了最佳性能,少样本提示和CoT分别带来了0.081和0.088的性能提升。

Insight: 语言资源状态是性能最强预测因子(η²=0.667),表明低资源语言任务中数据增强和高效微调的重要性。

Abstract: As social-media platforms emerge and evolve faster than the regulations meant to oversee them, automated detoxification might serve as a timely tool for moderators to enforce safe discourse at scale. We here describe our submission to the PAN 2025 Multilingual Text Detoxification Challenge, which rewrites toxic single-sentence inputs into neutral paraphrases across 15 typologically diverse languages. Building on a 12B-parameter Gemma-3 multilingual transformer, we apply parameter-efficient LoRA SFT fine-tuning and prompting techniques like few-shot and Chain-of-Thought. Our multilingual training corpus combines 3,600 human-authored parallel pairs, 21,600 machine-translated synthetic pairs, and model-generated pairs filtered by Jaccard thresholds. At inference, inputs are enriched with three LaBSE-retrieved neighbors and explicit toxic-span annotations. Evaluated via Style Transfer Accuracy, LaBSE-based semantic preservation, and xCOMET fluency, our system ranks first on high-resource and low-resource languages. Ablations show +0.081 joint score increase from few-shot examples and +0.088 from basic CoT prompting. ANOVA analysis identifies language resource status as the strongest predictor of performance ($\eta^2$ = 0.667, p < 0.01).


[77] Efficient Uncertainty Estimation for LLM-based Entity Linking in Tabular Data cs.CL | stat.MLPDF

Carlo Bono, Federico Belotti, Matteo Palmonari

TL;DR: 这篇论文提出了一种高效的不确定性估计方法,用于基于LLM的表格数据实体链接任务,通过单次生成减少计算开销。

Details

Motivation: 在实际应用中,基于LLMs的实体链接不仅需要高精度预测,还需要可靠的不确定性估计。传统的多生成方法计算成本高,限制了实用性,因此需要一种更高效的方法。

Result: 在多个LLMs上进行的实验表明,该方法能有效检测低精度输出,且计算成本大幅降低。

Insight: 通过单次生成实现不确定性估计,为实际应用中高效集成LLM提供了可行方案。

Abstract: Linking textual values in tabular data to their corresponding entities in a Knowledge Base is a core task across a variety of data integration and enrichment applications. Although Large Language Models (LLMs) have shown State-of-The-Art performance in Entity Linking (EL) tasks, their deployment in real-world scenarios requires not only accurate predictions but also reliable uncertainty estimates, which require resource-demanding multi-shot inference, posing serious limits to their actual applicability. As a more efficient alternative, we investigate a self-supervised approach for estimating uncertainty from single-shot LLM outputs using token-level features, reducing the need for multiple generations. Evaluation is performed on an EL task on tabular data across multiple LLMs, showing that the resulting uncertainty estimates are highly effective in detecting low-accuracy outputs. This is achieved at a fraction of the computational cost, ultimately supporting a cost-effective integration of uncertainty measures into LLM-based EL workflows. The method offers a practical way to incorporate uncertainty estimation into EL workflows with limited computational overhead.


[78] GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models cs.CL | cs.AI | cs.LGPDF

Mariam Mahran, Katharina Simbeck

TL;DR: 这篇论文提出了一种稀疏自编码器(SAE)与大语言模型(LLM)相结合的方法,用于分析和解释模型内部表示及训练数据中的深层结构和偏见。研究团队在简·奥斯汀的小说上训练了一个GPT风格的模型,并通过SAE发现了一些稀疏且可解释的特征,这些特征反映了小说中的关键主题和社会观念。

Details

Motivation: 随着大语言模型(LLMs)在未经过滤的大规模语料库上训练的普及,理解模型的内部表示及其从数据中学习的内容变得尤为重要。这篇论文旨在通过稀疏自编码器(SAEs)提供一种可扩展的方法,揭示模型行为及其训练数据中的深层结构和偏见。

Result: 实验结果表明,SAEs能够成功地从LLM的隐藏状态中提取出稀疏且语义明确的特征,这些特征不仅反映了训练数据的深层结构,还揭示了其中的社会偏见和主题模式。

Insight: 这项研究展示了SAEs作为一种工具的强大潜力,能够有效地帮助理解大语言模型内部的复杂表示和训练数据中的偏见。这种方法为大规模语料库的探索和模型的可解释性提供了一条新路径。

Abstract: As large language models (LLMs) are increasingly trained on massive, uncurated corpora, understanding both model representations and the data they internalize has become a major challenge. In this work, we show that pairing LLMs with sparse autoencoders (SAEs) enables interpretation not only of model behavior but also of the deeper structures, themes, and biases embedded in the training data. We train a GPT-style transformer model exclusively on the novels of Jane Austen, a corpus rich in social constructs and narrative patterns. We then apply SAEs to hidden states across multiple layers, uncovering sparse, interpretable features that reflect the key narratives and concepts present in the corpus, including gender, class, and societal duty. Our findings demonstrate that LLMs combined with SAEs can act as scalable probes into complex datasets, offering a new path for corpus exploration, bias discovery, and model interpretability at scale.


[79] RJE: A Retrieval-Judgment-Exploration Framework for Efficient Knowledge Graph Question Answering with LLMs cs.CL | cs.AIPDF

Can Lin, Zhengwang Jiang, Ling Zheng, Qi Zhao, Yuhang Zhang

TL;DR: 论文提出了RJE框架,通过检索-判断-探索的方式提升知识图谱问答效率,支持小规模LLMs表现优越,并显著减少LLM调用和令牌使用。

Details

Motivation: 现有方法依赖检索质量或专有LLMs,限制了KGQA的效果和普适性。

Result: 在专有和小规模LLMs上均优于基线,减少LLM调用和令牌使用,提升效率。

Insight: 框架设计可平衡检索质量和LLM依赖性,是KGQA领域的实用解决方案。

Abstract: Knowledge graph question answering (KGQA) aims to answer natural language questions using knowledge graphs. Recent research leverages large language models (LLMs) to enhance KGQA reasoning, but faces limitations: retrieval-based methods are constrained by the quality of retrieved information, while agent-based methods rely heavily on proprietary LLMs. To address these limitations, we propose Retrieval-Judgment-Exploration (RJE), a framework that retrieves refined reasoning paths, evaluates their sufficiency, and conditionally explores additional evidence. Moreover, RJE introduces specialized auxiliary modules enabling small-sized LLMs to perform effectively: Reasoning Path Ranking, Question Decomposition, and Retriever-assisted Exploration. Experiments show that our approach with proprietary LLMs (such as GPT-4o-mini) outperforms existing baselines while enabling small open-source LLMs (such as 3B and 8B parameters) to achieve competitive results without fine-tuning LLMs. Additionally, RJE substantially reduces the number of LLM calls and token usage compared to agent-based methods, yielding significant efficiency improvements.


[80] Measuring Algorithmic Partisanship via Zero-Shot Classification and Its Implications on Political Discourse cs.CL | cs.AIPDF

Nathan Junzi Chen

TL;DR: 本文通过零样本分类方法评估大型语言模型(LLMs)的政治倾向性,揭示其普遍存在自由主义-威权主义倾向,并探讨了这种偏见对政治话语的影响。

Details

Motivation: 生成式人工智能(GAI)在政治话语中日益普及,但其训练数据偏差、人类偏见和算法缺陷可能导致政治倾向性。本文旨在量化这种倾向性及其社会影响。

Result: 结果显示所有评估的LLMs均表现出明显的自由主义-威权主义倾向,并出现推理取代和模板化拒绝等现象。

Insight: 算法的内在政治偏见可能通过人机交互渗透到公共话语中,导致政治景观的扭曲,表现为一致性或极化,具体取决于地区的现有社会政治结构。

Abstract: Amidst the rapid normalization of generative artificial intelligence (GAI), intelligent systems have come to dominate political discourse across information mediums. However, internalized political biases stemming from training data skews, human prejudice, and algorithmic flaws continue to plague the novel technology. This paper employs a zero-shot classification approach to evaluate algorithmic political partisanship through a methodical combination of ideological alignment, topicality, response sentiment, and objectivity. A total of 1800 model responses across six mainstream large language models (LLMs) were individually input into four distinct fine-tuned classification algorithms, each responsible for computing an aforementioned bias evaluation metric. Results show an amplified liberal-authoritarian alignment across all six LLMs evaluated, with notable instances of reasoning supersessions and canned refusals. The study subsequently highlights the psychological influences underpinning human-computer interactions and how intrinsic biases can permeate public discourse. The resulting distortion of the political landscape can ultimately manifest as conformity or polarization, depending on a region’s pre-existing socio-political structures.


[81] TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture cs.CL | cs.AIPDF

Yongchao Chen, Jiefeng Chen, Rui Meng, Ji Yin, Na Li

TL;DR: 论文提出了TUMIX,一种并行运行多个代理的工具混合框架,通过迭代共享和优化回答提升推理能力,实验结果显示其在多个基准测试中显著优于现有方法。

Details

Motivation: 尽管工具(如代码解释器和搜索)显著增强了LLM的推理能力,但缺乏关于如何最优使用这些工具的实用指导。TUMIX旨在解决如何有效结合文本推理、编码和搜索应对多样化问题。

Result: TUMIX在多个推理基准测试中平均准确率提升3.55%,推理成本接近基线方法。它还能在达到足够置信度时停止优化,将推理成本降至49%。

Insight: 代理多样性和质量对性能至关重要,可通过LLM自动优化代理设计进一步提升。TUMIX展示了在高性能和成本之间权衡的可扩展性。

Abstract: While integrating tools like Code Interpreter and Search has significantly enhanced Large Language Model (LLM) reasoning in models like ChatGPT Agent and Gemini-Pro, practical guidance on optimal tool use is lacking. The core challenge is effectively combining textual reasoning, coding, and search for diverse questions. In this paper, we propose Tool-Use Mixture (TUMIX), an ensemble framework that runs multiple agents in parallel, each employing distinct tool-use strategies and answer paths. Agents in TUMIX iteratively share and refine responses based on the question and previous answers. In experiments, TUMIX achieves significant gains over state-of-the-art tool-augmented and test-time scaling methods, delivering an average accuracy improvement of up to 3.55% over the best baseline on Gemini-2.5-Pro and Gemini-2.5-Flash across key reasoning benchmarks, with near-equal inference costs. We find that agent diversity and quality are crucial and can be enhanced by using LLMs to auto-optimize agent designs. Furthermore, TUMIX can halt refinement upon reaching sufficient confidence, preserving performance at only 49% of the inference cost. Further scaling can achieve higher performance, albeit at a greater cost.


[82] TAG-EQA: Text-And-Graph for Event Question Answering via Structured Prompting Strategies cs.CLPDF

Maithili Kadam, Francis Ferraro

TL;DR: TAG-EQA是一个基于提示词的框架,通过将因果事件图转换为自然语言语句注入LLM输入,提升事件问答性能。结合九种提示配置,平均准确率提升5%。

Details

Motivation: 大型语言模型在通用语言任务上表现优异,但在事件问答(尤其是因果或时序推理)上表现不佳,需结构化知识增强推理。

Result: 在TORQUESTRA基准上,平均准确率提升5%,零样本设置下提升12%,图增强思维链提示有效时提升18%。

Insight: 因果事件图可在不微调LLM的情况下增强事件推理,为基于提示的问答提供灵活的结构化编码方式。

Abstract: Large language models (LLMs) excel at general language tasks but often struggle with event-based questions-especially those requiring causal or temporal reasoning. We introduce TAG-EQA (Text-And-Graph for Event Question Answering), a prompting framework that injects causal event graphs into LLM inputs by converting structured relations into natural-language statements. TAG-EQA spans nine prompting configurations, combining three strategies (zero-shot, few-shot, chain-of-thought) with three input modalities (text-only, graph-only, text+graph), enabling a systematic analysis of when and how structured knowledge aids inference. On the TORQUESTRA benchmark, TAG-EQA improves accuracy by 5% on average over text-only baselines, with gains up to 12% in zero-shot settings and 18% when graph-augmented CoT prompting is effective. While performance varies by model and configuration, our findings show that causal graphs can enhance event reasoning in LLMs without fine-tuning, offering a flexible way to encode structure in prompt-based QA.


[83] A-VERT: Agnostic Verification with Embedding Ranking Targets cs.CL | cs.LG | 68T50 | I.2.7PDF

Nicolás Aguirre, Ramiro Caso, Ramiro Rodríguez Colmeiro, Mauro Santelli, Joaquín Toranzo Calderón

TL;DR: 该论文提出了一种名为A-VERT的无结构评估方法,通过语义嵌入距离匹配目标候选与任意语言模型生成文本,实现了低计算成本下的稳健分类。

Details

Motivation: 当前语言模型响应的自动评估方法要么成本过高(如LLM-as-a-Judge),要么脱离真实条件(如字符串匹配、logprob)。需要一种更高效且贴近实际的方法。

Result: 在三个数据集和三种不同语言模型架构上测试,回归得分约0.97,准确率约96%。

Insight: 语义嵌入距离是一种高效且低成本的语言模型响应评估方法,能够替代现有昂贵或不切实际的方案。

Abstract: The automatic evaluation of Language Model (LM) responses is a critical piece in the development of benchmarks and metrics, both for model training and quality assessment of production model endpoints. The current approaches to response classification relies on methods that are too expensive (i.e. LLM-as-a-Judge) or that are far from real-world conditions (string-matching, logprob). In this paper, a structure-free evaluation method is presented. The method makes use of semantic embedding distances to match target candidates with arbitrary LM-generated text, resulting in a robust classification of the response at a relatively low compute cost (embedding models of less than $10B$ parameters). The results show a regression score of ~0.97 and an accuracy of ~96% against human annotators, tested over 3 data sets and 3 different LM architectures.


[84] One More Question is Enough, Expert Question Decomposition (EQD) Model for Domain Quantitative Reasoning cs.CL | q-fin.CPPDF

Mengyu Wang, Sotirios Sabanis, Miguel de Carvalho, Shay B. Cohen, Tiejun Ma

TL;DR: EQD是一种专家问题分解模型,通过两阶段微调框架和奖励函数提升领域定量推理任务的问答性能,仅需少量训练数据和A100 GPU,性能优于现有方法。

Details

Motivation: 领域特定的定量推理对LLM仍具挑战性,尤其是在需要专家知识和复杂问答的任务中。EQD旨在平衡领域知识与计算效率。

Result: 在金融领域四个数据集的评测中,EQD使QA性能提升0.6%至10.5%,优于领域调优模型和高级提示策略。

Insight: 在领域特定问答中,单个支持性问题比详细指导步骤更能提升性能。

Abstract: Domain-specific quantitative reasoning remains a major challenge for large language models (LLMs), especially in fields requiring expert knowledge and complex question answering (QA). In this work, we propose Expert Question Decomposition (EQD), an approach designed to balance the use of domain knowledge with computational efficiency. EQD is built on a two-step fine-tuning framework and guided by a reward function that measures the effectiveness of generated sub-questions in improving QA outcomes. It requires only a few thousand training examples and a single A100 GPU for fine-tuning, with inference time comparable to zero-shot prompting. Beyond its efficiency, EQD outperforms state-of-the-art domain-tuned models and advanced prompting strategies. We evaluate EQD in the financial domain, characterized by specialized knowledge and complex quantitative reasoning, across four benchmark datasets. Our method consistently improves QA performance by 0.6% to 10.5% across different LLMs. Our analysis reveals an important insight: in domain-specific QA, a single supporting question often provides greater benefit than detailed guidance steps.


[85] ReSSFormer: A Recursive Sparse Structured Transformer for Scalable and Long-Context Reasoning cs.CL | cs.NIPDF

Haochen You, Baojing Liu

TL;DR: ReSSFormer是一种递归稀疏结构化Transformer,通过递归推理、自适应稀疏注意力和自组织编码结构解决长上下文推理和计算效率问题。

Details

Motivation: Transformer在长上下文推理、计算效率和结构泛化方面仍面临挑战,主要由于固定的层堆叠、密集注意力以及对位置编码的依赖。

Result: 在语言建模、多跳QA和结构敏感任务中,ReSSFormer在相同计算和参数预算下优于基线模型。

Insight: 递归推理和稀疏注意力机制显著提升了模型的效率和结构灵活性,适用于长上下文任务。

Abstract: While Transformer architectures have demonstrated impressive scalability across domains, they continue to face challenges in long-context reasoning, computational efficiency, and structural generalization - largely due to rigid layer stacking, dense attention, and reliance on positional encodings. We present ReSSFormer, a Recursive Sparse Structured Transformer that integrates three complementary innovations: Recurrent Reasoning & Memory Unit (R2MU) for iterative reasoning with bounded depth, Adaptive Sparse Attention Module (ASAM) for efficient and focused context selection, and Self-Organizing Encoder Structure (SOES) for position-free structure induction. ReSSFormer replaces conventional depth stacking with recurrent inference, substitutes full attention with token- and expert-level sparsity, and models latent token topology directly from content. Across language modeling, multi-hop QA, and structure-sensitive tasks, ReSSFormer consistently outperforms strong baselines under comparable FLOPs and parameter budgets, highlighting its scalability, efficiency, and structural flexibility.


[86] CLUE: Non-parametric Verification from Experience via Hidden-State Clustering cs.CLPDF

Zhenwen Liang, Ruosen Li, Yujun Zhou, Linfeng Song, Dian Yu

TL;DR: CLUE是一种基于隐藏状态聚类的非参数验证方法,通过分析LLM内部隐藏状态的几何特征,实现对模型输出的有效性验证。CLUE无需可训练参数,仅依赖历史经验聚类,显著提升了验证性能。

Details

Motivation: 现有方法(如基于文本的奖励模型或标定置信度)在验证LLM输出时易受限于浅层特征或模型校准不足的问题。CLUE提出直接利用隐藏状态的丰富信息作为验证基础,以统一解决这些问题。

Result: 在AIME 24/25和GPQA数据集上,CLUE显著提升top-1和多数投票准确率(如AIME 24从56.7%提升至70.0%),优于LLM-as-a-judge和置信度基线方法。

Insight: 隐藏状态蕴含丰富的语义和置信度信息,其几何可分性为验证LLM输出提供了统一且高效的信号,避免了传统方法的局限性。

Abstract: Assessing the quality of Large Language Model (LLM) outputs presents a critical challenge. Previous methods either rely on text-level information (e.g., reward models, majority voting), which can overfit to superficial cues, or on calibrated confidence from token probabilities, which would fail on less-calibrated models. Yet both of these signals are, in fact, partial projections of a richer source of information: the model’s internal hidden states. Early layers, closer to token embeddings, preserve semantic and lexical features that underpin text-based judgments, while later layers increasingly align with output logits, embedding confidence-related information. This paper explores hidden states directly as a unified foundation for verification. We show that the correctness of a solution is encoded as a geometrically separable signature within the trajectory of hidden activations. To validate this, we present Clue (Clustering and Experience-based Verification), a deliberately minimalist, non-parametric verifier. With no trainable parameters, CLUE only summarizes each reasoning trace by an hidden state delta and classifies correctness via nearest-centroid distance to success'' and failure’’ clusters formed from past experience. The simplicity of this method highlights the strength of the underlying signal. Empirically, CLUE consistently outperforms LLM-as-a-judge baselines and matches or exceeds modern confidence-based methods in reranking candidates, improving both top-1 and majority-vote accuracy across AIME 24/25 and GPQA. As a highlight, on AIME 24 with a 1.5B model, CLUE boosts accuracy from 56.7% (majority@64) to 70.0% (top-maj@16).


[87] A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation cs.CL | cs.AIPDF

Neal Gregory Lawton, Alfy Samuel, Anoop Kumar, Daben Liu

TL;DR: 本文比较了检索增强生成(RAG)中的独立、联合和两阶段微调策略,发现它们在生成质量上表现相似,但计算成本差异显著,最佳策略取决于数据集是否包含上下文标签以及是否需要学习率网格搜索。

Details

Motivation: RAG框架在问答任务中广泛应用,但其嵌入模型和生成器模型的微调策略多样,缺乏系统性比较,导致实际应用中难以选择最优策略。

Result: 所有策略在生成质量(EM和F1)上的提升相近,但计算成本差异显著。最优策略取决于上下文标签的存在和学习率网格搜索的需求。

Insight: 实践中选择微调策略时,需权衡计算成本和数据集特点(如上下文标签),而无需过分追求单一方法。

Abstract: A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation Download PDF Neal Gregory Lawton, Alfy Samuel, Anoop Kumar, Daben Liu Published: 20 Aug 2025, Last Modified: 17 Sept 2025EMNLP 2025 FindingsConference, Publication Chairs, AuthorsRevisionsBibTeXCC BY 4.0 Keywords: Retrieval-Augmented Generation (RAG), Large Language Models (LLMs), Fine-tuning, Question Answering, Joint fine-tuning TL;DR: We evaluate and compare strategies for fine-tuning Retrieval Augmented Generation (RAG) pipelines, including independent fine-tuning, joint fine-tuning, and two-phase fine-tuning. Abstract: Retrieval augmented generation (RAG) is a popular framework for question answering that is powered by two large language models (LLMs): an embedding model that retrieves context documents from a database that are relevant to a given question, and a generator model that uses the retrieved context to generate an answer to the question. Both the embedding and generator models can be fine-tuned to increase performance of a RAG pipeline on a new task, but multiple fine-tuning strategies exist with different costs and benefits. In this paper, we evaluate and compare several RAG fine-tuning strategies, including independent, joint, and two-phase fine-tuning. In our experiments, we observe that all of these strategies achieve about equal improvement in EM and F1 generation quality metrics, although they have significantly different computational costs. We conclude the optimal fine-tuning strategy to use depends on whether the training dataset includes context labels and whether a grid search over the learning rates for the embedding and generator models is required.


[88] RAG-BioQA Retrieval-Augmented Generation for Long-Form Biomedical Question Answering cs.CL | cs.AIPDF

Lovely Yeswanth Panchumarthi, Sai Prasad Gudari, Atharva Negi, Praveen Raj Budime, Harsit Upadhya

TL;DR: 该论文提出了RAG-BioQA框架,结合检索增强生成(RAG)和领域特定微调,为生物医学领域生成证据支持的长篇问答,显著优于现有基线方法。

Details

Motivation: 生物医学文献的快速增长使得获取精准医疗信息变得困难,现有系统主要专注于简短回答,无法提供临床决策所需的全面解释。

Result: 在PubMedQA数据集上的实验表明,RAG-BioQA在BLEU、ROUGE和METEOR指标上显著优于基线模型。

Insight: 该研究展示了检索增强生成在生物医学领域的潜力,特别是在生成长篇、证据支持的答案方面的有效性。

Abstract: The exponential growth of biomedical literature creates significant challenges for accessing precise medical information. Current biomedical question-answering systems primarily focus on short-form answers, failing to provide the comprehensive explanations necessary for clinical decision-making. We present RAG-BioQA, a novel framework combining retrieval-augmented generation with domain-specific fine-tuning to produce evidence-based, long-form biomedical answers. Our approach integrates BioBERT embeddings with FAISS indexing and compares various re-ranking strategies (BM25, ColBERT, MonoT5) to optimize context selection before synthesizing evidence through a fine-tuned T5 model. Experimental results on the PubMedQA dataset show significant improvements over baselines, with our best model achieving substantial gains across BLEU, ROUGE, and METEOR metrics, advancing the state of accessible, evidence-based biomedical knowledge retrieval.


[89] SoK: Measuring What Matters for Closed-Loop Security Agents cs.CL | cs.AIPDF

Mudita Khurana, Raunak Jain

TL;DR: 这篇论文提出了CLASP框架和CLC评分,用于衡量闭环安全代理的能力,填补了领域内缺乏统一评估标准的空白。

Details

Motivation: 网络安全领域缺乏统一的框架和方法来评估闭环安全代理的能力,导致研究分散且难以衡量实际效果。

Result: 应用CLASP分析了21个系统,揭示了能力差距;CLC评分为闭环代理提供了量化标准。

Insight: 闭环安全代理的能力评估需结合任务完成度和闭环性;统一的框架和评分有助于推动领域发展。

Abstract: Cybersecurity is a relentless arms race, with AI driven offensive systems evolving faster than traditional defenses can adapt. Research and tooling remain fragmented across isolated defensive functions, creating blind spots that adversaries exploit. Autonomous agents capable of integrating, exploit confirmation, remediation, and validation into a single closed loop offer promise, but the field lacks three essentials: a framework defining the agentic capabilities of security systems across security life cycle, a principled method for evaluating closed loop agents, and a benchmark for measuring their performance in practice. We introduce CLASP: the Closed-Loop Autonomous Security Performance framework which aligns the security lifecycle (reconnaissance, exploitation, root cause analysis, patch synthesis, validation) with core agentic capabilities (planning, tool use, memory, reasoning, reflection & perception) providing a common vocabulary and rubric for assessing agentic capabilities in security tasks. By applying CLASP to 21 representative works, we map where systems demonstrate strengths, and where capability gaps persist. We then define the Closed-Loop Capability (CLC) Score, a composite metric quantifying both degree of loop closure and operational effectiveness, and outline the requirements for a closed loop benchmark. Together, CLASP and the CLC Score, provide the vocabulary, diagnostics, and measurements needed to advance both function level performance and measure closed loop security agents.


[90] MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization cs.CL | cs.AIPDF

Yinhong Liu, Jianfeng He, Hang Su, Ruixue Lian, Yi Nian

TL;DR: MDSEval是第一个针对多模态对话摘要(MDS)的元评估基准,旨在为开发高效MDS模型提供支持。

Details

Motivation: 由于MDS任务的广泛应用,需要一个强大的自动评估方法来降低成本和人力的投入,但目前的评估方法缺乏有效的基准。

Result: 基准测试揭示了现有评估方法在区分先进MLLM生成的摘要和应对各种偏差方面的局限性。

Insight: 研究中首次形式化了MDS特有的评估维度,对改进MDS模型的评估具有重要指导意义。

Abstract: Multimodal Dialogue Summarization (MDS) is a critical task with wide-ranging applications. To support the development of effective MDS models, robust automatic evaluation methods are essential for reducing both cost and human effort. However, such methods require a strong meta-evaluation benchmark grounded in human annotations. In this work, we introduce MDSEval, the first meta-evaluation benchmark for MDS, consisting image-sharing dialogues, corresponding summaries, and human judgments across eight well-defined quality aspects. To ensure data quality and richfulness, we propose a novel filtering framework leveraging Mutually Exclusive Key Information (MEKI) across modalities. Our work is the first to identify and formalize key evaluation dimensions specific to MDS. We benchmark state-of-the-art modal evaluation methods, revealing their limitations in distinguishing summaries from advanced MLLMs and their susceptibility to various bias.


[91] FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol cs.CL | cs.AI | cs.MAPDF

He Zhang, Anzhou Zhang, Jian Dai

TL;DR: FOR-Prompting是一种非对称提示协议,通过角色分工(Defender、Objectioner、Host)实现自我修订,提升模型推理能力,尤其在小型模型上表现突出。

Details

Motivation: 现有推理协议(如CoT和ToT)缺乏外部提问机制以激发自我修订,FOR-Prompting填补了这一空白。

Result: 在GSM8K上优于单提示方法,与CoT相当,小型模型Llama3.2:1b提升19%。

Insight: 角色化提示协议无需额外训练即可提升模型性能,尤其适合小型模型和设备端应用。

Abstract: Reasoning protocols such as Chain of Thought (CoT) and Tree of Thought (ToT) organize internal deliberation but lack an explicit mechanism for external questioning that elicits self-revision. We present FOR-Prompting (From Objection to Revision Prompting), an asymmetric protocol where a Defender proposes an answer, an Objectioner raises question-style objections with no direct fixes, and a Host enforces consistency and closure. On GSM8K we observe about a 22% point gain over single-prompt and accuracy on par with CoT, with more than 10% higher ratings in reasoning and coherence from a uniform GPT 4.1 judge. FOR-Prompting also corrects mistakes without tools or human supervision on tricky queries, and improves performance for small-scale model (approx. 19% accuracy improved on Llama3.2:1b for GSM8K task), highlighting promise for small models and on personal device use. Beyond factual QA, qualitative analyses on open-ended tasks show enhanced exploration and refinement, with dialogue traces that make assumptions and trade-offs explicit. The protocol is model agnostic and operates purely at the prompt level through role-structured turns, so it works with hosted and local models of different sizes without retraining, and it supports large-scale study of objection-guided reasoning.


[92] What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration? cs.CLPDF

Jiwan Chung, Neel Joshi, Pratyusha Sharma, Youngjae Yu, Vibhav Vineet

TL;DR: 论文提出了MathLens基准测试,用于分解多模态推理的子技能(感知、推理和集成),并通过实验揭示了不同训练方法对各子技能的差异化影响。

Details

Motivation: 现有对多模态推理模型的评估主要依赖聚合准确率,掩盖了模型改进的具体细节。论文旨在提供一个更细粒度的评估框架,明确模型的感知、推理和集成能力。

Result: 实验发现:1) 强化学习主要提升感知能力,而文本监督的SFT间接提升感知;2) 推理能力需与感知同步提升;3) 集成能力最弱;4) 不同训练的鲁棒性表现相反。

Insight: 多模态推理的改进需要针对性训练各子技能,集成能力是关键瓶颈,未来研究需重点关注。

Abstract: Multimodal reasoning models have recently shown promise on challenging domains such as olympiad-level geometry, yet their evaluation remains dominated by aggregate accuracy, a single score that obscures where and how models are improving. We introduce MathLens, a benchmark designed to disentangle the subskills of multimodal reasoning while preserving the complexity of textbook-style geometry problems. The benchmark separates performance into three components: Perception: extracting information from raw inputs, Reasoning: operating on available information, and Integration: selecting relevant perceptual evidence and applying it within reasoning. To support each test, we provide annotations: visual diagrams, textual descriptions to evaluate reasoning in isolation, controlled questions that require both modalities, and probes for fine-grained perceptual skills, all derived from symbolic specifications of the problems to ensure consistency and robustness. Our analysis reveals that different training approaches have uneven effects: First, reinforcement learning chiefly strengthens perception, especially when supported by textual supervision, while textual SFT indirectly improves perception through reflective reasoning. Second, reasoning improves only in tandem with perception. Third, integration remains the weakest capacity, with residual errors concentrated there once other skills advance. Finally, robustness diverges: RL improves consistency under diagram variation, whereas multimodal SFT reduces it through overfitting. We will release all data and experimental logs.


[93] Machine-interpretable Engineering Design Standards for Valve Specification cs.CL | cs.AIPDF

Anders Gjerver, Rune Frostad, Vedrana Barisic, Melinda Hodkiewicz, Caitlin Woods

TL;DR: 论文提出了一种将工程设计标准转化为模块化、可重用、机器可解释的本体的方法,用于阀门选择的语义推理和质量验证。

Details

Motivation: 尽管工业工作数字化的目标明确,但工程设计标准仍是文档主导。论文旨在通过语义技术实现设计标准的机器可解释性和自动化验证。

Result: 成功验证了阀门数据表是否符合行业标准,并展示了语义推理在设备选择中的潜力。

Insight: 基于本体的方法可推动数字化智能标准的转型,标准和行业规范的可互操作本体库对实现自动化设计流程具有重要价值。

Abstract: Engineering design processes use technical specifications and must comply with standards. Product specifications, product type data sheets, and design standards are still mainly document-centric despite the ambition to digitalize industrial work. In this paper, we demonstrate how to transform information held in engineering design standards into modular, reusable, machine-interpretable ontologies and use the ontologies in quality assurance of the plant design and equipment selection process. We use modelling patterns to create modular ontologies for knowledge captured in the text and in frequently referenced tables in International Standards for piping, material and valve design. These modules are exchangeable, as stored in a W3C compliant format, and interoperable as they are aligned with the top-level ontology ISO DIS 23726-3: Industrial Data Ontology (IDO). We test these ontologies, created based on international material and piping standards and industry norms, on a valve selection process. Valves are instantiated in semantic asset models as individuals along with a semantic representation of the environmental condition at their location on the asset. We create “functional location tags” as OWL individuals that become instances of OWL class Valve Data Sheet (VDS) specified valves. Similarly we create instances of manufacturer product type. Our approach enables automated validation that a specific VDS is compliant with relevant industry standards. Using semantic reasoning and executable design rules, we also determine whether the product type meets the valve specification. Creation of shared, reusable IDO-based modular ontologies for design standards enables semantic reasoning to be applied to equipment selection processes and demonstrates the potential of this approach for Standards Bodies wanting to transition to digitized Smart Standards.


[94] Syntactic Blind Spots: How Misalignment Leads to LLMs Mathematical Errors cs.CL | I.2.7; I.2.0PDF

Dane Williamson, Yangfeng Ji, Matthew Dwyer

TL;DR: 论文研究发现LLMs在数学问题上存在一种系统性失败模式(语法盲点),即模型在面对语义简单但表达方式不熟悉的问题时,会错误应用熟悉的推理策略。通过调整语法结构(保留语义但降低复杂性)可以显著提高正确率。

Details

Motivation: LLMs虽然在数学问题上表现出色,但面对语法偏离训练分布的问题时容易失败。研究者希望揭示这种失败的根本原因。

Result: 研究表明,许多推理错误源于结构不对齐而非概念难度,语法干预可以缓解这些错误。

Insight: LLMs的推理能力受语法表达方式影响显著,未来的模型可能需要更强的语法敏感性。

Abstract: Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities but frequently fail on problems that deviate syntactically from their training distribution. We identify a systematic failure mode, syntactic blind spots, in which models misapply familiar reasoning strategies to problems that are semantically straightforward but phrased in unfamiliar ways. These errors are not due to gaps in mathematical competence, but rather reflect a brittle coupling between surface form and internal representation. To test this, we rephrase incorrectly answered questions using syntactic templates drawn from correct examples. These rephrasings, which preserve semantics while reducing structural complexity, often lead to correct answers. We quantify syntactic complexity using a metric based on Dependency Locality Theory (DLT), and show that higher DLT scores are associated with increased failure rates across multiple datasets. Our findings suggest that many reasoning errors stem from structural misalignment rather than conceptual difficulty, and that syntax-aware interventions can reveal and mitigate these inductive failures.


[95] SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning cs.CLPDF

Shicheng Liu, Kai Sun, Lisheng Fu, Xilun Chen, Xinyuan Zhang

TL;DR: SCRIBES提出了一种基于强化学习的框架,通过利用网页间的布局相似性生成可重用的提取脚本,显著提高了半结构化数据的提取质量和下游任务性能。

Details

Motivation: 网页中的半结构化数据(如HTML表格、列表等)占事实数据的很大比例,但现有方法要么缺乏泛化能力,要么因需要逐页处理而资源消耗大。SCRIBES旨在解决这些问题,通过脚本生成实现高效、可扩展的数据提取。

Result: SCRIBES的脚本质量提升了13%,下游任务(如GPT-4o的问答)准确率提高了4%,同时降低了资源消耗。

Insight: 通过布局相似性设计奖励信号是一种高效的策略,可以显著提升提取任务的泛化能力和可扩展性。结合合成数据训练进一步优化了模型的性能。

Abstract: Semi-structured content in HTML tables, lists, and infoboxes accounts for a substantial share of factual data on the web, yet the formatting complicates usage, and reliably extracting structured information from them remains challenging. Existing methods either lack generalization or are resource-intensive due to per-page LLM inference. In this paper, we introduce SCRIBES (SCRIpt-Based Semi-Structured Content Extraction at Web-Scale), a novel reinforcement learning framework that leverages layout similarity across webpages within the same site as a reward signal. Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages. Our approach further improves by iteratively training on synthetic annotations from in-the-wild CommonCrawl data. Experiments show that our approach outperforms strong baselines by over 13% in script quality and boosts downstream question answering accuracy by more than 4% for GPT-4o, enabling scalable and resource-efficient web information extraction.


[96] Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models cs.CL | cs.CVPDF

Ece Takmaz, Lisa Bylinina, Jakub Dotlacil

TL;DR: 论文通过模型融合方法,在多模态模型中保持语言任务的性能,解决了多模态模型在语言任务中表现不佳的问题。

Details

Motivation: 现有视觉-语言模型参数庞大且依赖大数据集,远超儿童语言习得的数据量。论文旨在低资源环境下开发发展合理的多模态模型,同时避免其在语言任务中的性能下降。

Result: 多模态模型在语法为主的语言任务中表现不佳,但融合方法能在一定程度上缓解这一问题,同时保持多模态性能。

Insight: 模型融合是平衡多模态任务与语言任务性能的有效手段,尤其适用于发展合理的低资源场景。

Abstract: State-of-the-art vision-and-language models consist of many parameters and learn from enormous datasets, surpassing the amounts of linguistic data that children are exposed to as they acquire a language. This paper presents our approach to the multimodal track of the BabyLM challenge addressing this discrepancy. We develop language-only and multimodal models in low-resource settings using developmentally plausible datasets, with our multimodal models outperforming previous BabyLM baselines. One finding in the multimodal language model literature is that these models tend to underperform in \textit{language-only} tasks. Therefore, we focus on maintaining language-only abilities in multimodal models. To this end, we experiment with \textit{model merging}, where we fuse the parameters of multimodal models with those of language-only models using weighted linear interpolation. Our results corroborate the findings that multimodal models underperform in language-only benchmarks that focus on grammar, and model merging with text-only models can help alleviate this problem to some extent, while maintaining multimodal performance.


[97] Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey cs.CLPDF

Qiyuan Liu, Hao Xu, Xuhong Chen, Wei Chen, Yee Whye Teh

TL;DR: 该论文系统性地介绍了奖励模型(RMs)及其在大语言模型(LLM)推理中的应用,包括架构、训练方法和评估技术,并探讨了RMs在生成引导、数据合成和强化学习微调中的关键作用。

Details

Motivation: 奖励模型在提升大语言模型的推理能力中具有重要作用,但目前缺乏对其系统性的分析和实际应用的完整调研。

Result: 论文总结了RMs在LLM推理中的实际应用效果,并指出了当前研究中存在的关键问题和改进方向。

Insight: RMs不仅是LLM微调的重要工具,还能在推理阶段优化输出选择;未来的研究需关注RMs的选择、泛化性和评估方法。

Abstract: Reward models (RMs) play a critical role in enhancing the reasoning performance of LLMs. For example, they can provide training signals to finetune LLMs during reinforcement learning (RL) and help select the best answer from multiple candidates during inference. In this paper, we provide a systematic introduction to RMs, along with a comprehensive survey of their applications in LLM reasoning. We first review fundamental concepts of RMs, including their architectures, training methodologies, and evaluation techniques. Then, we explore their key applications: (1) guiding generation and selecting optimal outputs during LLM inference, (2) facilitating data synthesis and iterative self-improvement for LLMs, and (3) providing training signals in RL-based finetuning. Finally, we address critical open questions regarding the selection, generalization, evaluation, and enhancement of RMs, based on existing research and our own empirical findings. Our analysis aims to provide actionable insights for the effective deployment and advancement of RMs for LLM reasoning.


[98] Veri-R1: Toward Precise and Faithful Claim Verification via Online Reinforcement Learning cs.CLPDF

Qi He, Cheng Qian, Xiusi Chen, Bingxiang He, Yi R.

TL;DR: Veri-R1 是一个基于在线强化学习的框架,利用LLM与搜索引擎的交互,通过奖励信号优化其规划、检索和推理行为,显著提高了声明验证的准确性和证据得分。

Details

Motivation: 传统声明验证方法主要依赖提示工程或预设推理流程,缺乏统一的训练范式以提升核心技能。在线声明验证需要迭代证据检索和推理,从而需要更动态的方法。

Result: 实验结果显示,Veri-R1联合准确性提升高达30%,证据得分翻倍,甚至优于更大规模的模型。

Insight: 在线强化学习能够动态优化LLM在声明验证中的表现,并且奖励信号的各组成部分对结果有重要影响。

Abstract: Claim verification with large language models (LLMs) has recently attracted considerable attention, owing to their superior reasoning capabilities and transparent verification pathways compared to traditional answer-only judgments. Online claim verification requires iterative evidence retrieval and reasoning, yet existing approaches mainly rely on prompt engineering or predesigned reasoning workflows without offering a unified training paradigm to improve necessary skills. Therefore, we introduce Veri-R1, an online reinforcement learning (RL) framework that enables an LLM to interact with a search engine and to receive reward signals that explicitly shape its planning, retrieval, and reasoning behaviors. The dynamic interaction between models and retrieval systems more accurately reflects real-world verification scenarios and fosters comprehensive verification skills. Empirical results show that Veri-R1 improves joint accuracy by up to 30% and doubles evidence score, often surpassing larger-scale counterparts. Ablation studies further reveal the impact of reward components and the link between output logits and label accuracy. Our results highlight the effectiveness of online RL for precise and faithful claim verification and provide a foundation for future research. We release our code to support community progress in LLM empowered claim verification.


[99] Style Over Story: A Process-Oriented Study of Authorial Creativity in Large Language Models cs.CLPDF

Donghoon Jung, Jiwoo Choi, Songeun Chae, Seohyon Jung

TL;DR: 这篇论文采用过程导向的方法,通过叙事学视角研究大语言模型(LLMs)的作者创造力,提出基于约束的决策作为创造力评估工具,发现LLMs更注重风格而非故事要素。

Details

Motivation: 现有对LLMs创造力的评估多关注输出质量,而忽略其生成过程。本文旨在填补这一空白,通过过程导向方法分析LLMs的作者创造力。

Result: LLMs在创造力表现中明显偏向风格(Style),而非其他故事要素(如角色、事件、背景)。不同模型的创作偏好和推理特点呈现独特模式。

Insight: LLMs的创造力具有可量化和系统化的特征,过程导向方法为AI作者创造力分析提供了新工具。

Abstract: Evaluations of large language models (LLMs)’ creativity have focused primarily on the quality of their outputs rather than the processes that shape them. This study takes a process-oriented approach, drawing on narratology to examine LLMs as computational authors. We introduce constraint-based decision-making as a lens for authorial creativity. Using controlled prompting to assign authorial personas, we analyze the creative preferences of the models. Our findings show that LLMs consistently emphasize Style over other elements, including Character, Event, and Setting. By also probing the reasoning the models provide for their choices, we show that distinctive profiles emerge across models and argue that our approach provides a novel systematic tool for analyzing AI’s authorial creativity.


[100] Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems cs.CL | cs.SD | eess.ASPDF

Siddhant Arora, Jinchuan Tian, Hayato Futami, Jiatong Shi, Yosuke Kashiwagi

TL;DR: 这篇论文提出了SCoT框架,一种用于双工语音对话系统的流式思维链推理方法,通过交替处理用户输入和生成响应块,解决了传统方法在语义推理和延迟上的不足。

Details

Motivation: 传统端到端语音对话系统依赖语音活动检测(VAD)进行轮流对话,但VAD无法区分暂停和对话结束。双工系统虽然解决了这一问题,但在语义推理上表现较差且架构复杂。

Result: 实验表明,SCoT比现有双工方法生成更连贯、可解释的响应,同时支持更低延迟和重叠交互。

Insight: 思维链推理可以在流式对话系统中提高语义连贯性,块级处理是实现低延迟的关键。

Abstract: Most end-to-end (E2E) spoken dialogue systems (SDS) rely on voice activity detection (VAD) for turn-taking, but VAD fails to distinguish between pauses and turn completions. Duplex SDS models address this by predicting output continuously, including silence tokens, thus removing the need for explicit VAD. However, they often have complex dual-channel architecture and lag behind cascaded models in semantic reasoning. To overcome these challenges, we propose SCoT: a Streaming Chain-of-Thought (CoT) framework for Duplex SDS, alternating between processing fixed-duration user input and generating responses in a blockwise manner. Using frame-level alignments, we create intermediate targets-aligned user transcripts and system responses for each block. Experiments show that our approach produces more coherent and interpretable responses than existing duplex methods while supporting lower-latency and overlapping interactions compared to turn-by-turn systems.


[101] The Disparate Impacts of Speculative Decoding cs.CL | cs.AIPDF

Jameson Sandler, Ahmet Üstün, Marco Romanelli, Sara Hooker, Ferdinando Fioretto

TL;DR: 论文分析了推测解码(speculative decoding)在不同任务中带来的速度提升不均匀现象,发现其对拟合不足或代表性不足的任务速度提升较小,并提出了一种缓解策略,将公平性指标平均提升12%。

Details

Motivation: 推测解码已成为减少大型语言模型解码时间的标准技术,但其在不同任务中带来的速度提升不均可能引发不公平问题。论文旨在量化并解决这种不公平现象。

Result: 实验结果表明,提出的缓解策略将公平性指标平均提升了12%,证明了其有效性。

Insight: 推测解码的效率提升可能隐含不公平性,需针对任务特性优化策略以确保公平性。

Abstract: The practice of speculative decoding, whereby inference is probabilistically supported by a smaller, cheaper, drafter'' model, has become a standard technique for systematically reducing the decoding time of large language models. This paper conducts an analysis of speculative decoding through the lens of its potential disparate speed-up rates across tasks. Crucially, the paper shows that speed-up gained from speculative decoding is not uniformly distributed across tasks, consistently diminishing for under-fit, and often underrepresented tasks. To better understand this phenomenon, we derive an analysis to quantify this observed unfairness’’ and draw attention to the factors that motivate such disparate speed-ups to emerge. Further, guided by these insights, the paper proposes a mitigation strategy designed to reduce speed-up disparities and validates the approach across several model pairs, revealing on average a 12% improvement in our fairness metric.


[102] RESTRAIN: From Spurious Votes to Signals – Self-Driven RL with Self-Penalization cs.CLPDF

Zhaoning Yu, Will Su, Leitian Tao, Haozhu Wang, Aashu Singh

TL;DR: RESTRAIN是一种自惩罚强化学习框架,利用无标注数据改进推理模型,避免对虚假多数投票的依赖,显著提升了推理任务的性能。

Details

Motivation: 传统基于人工标注数据的强化学习成本高且难以应对复杂任务,RESTRAIN旨在通过无监督学习利用模型自身的信号实现持续改进。

Result: 在AIME25、MMLU_STEM和GPQA-Diamond等基准上显著提升性能,最高提升140.7%,接近黄金标注训练的效果。

Insight: RESTRAIN展示了无监督强化学习在推理任务中的潜力,为无需标注数据的持续改进提供了可行路径。

Abstract: Reinforcement learning with human-annotated data has boosted chain-of-thought reasoning in large reasoning models, but these gains come at high costs in labeled data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data. We introduce RESTRAIN (REinforcement learning with Self-restraint), a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal. Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model’s entire answer distribution: penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains. The self-penalization mechanism integrates seamlessly into policy optimization methods such as GRPO, enabling continual self-improvement without supervision. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker Hybrid-8B-Base, it improves Pass@1 by up to +140.7 percent on AIME25, +36.2 percent on MMLU_STEM, and +19.6 percent on GPQA-Diamond, nearly matching gold-label training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.


[103] Learning to Reason for Hallucination Span Detection cs.CL | cs.AI | cs.LGPDF

Hsuan Su, Ting-Yao Hu, Hema Swetha Koppula, Kundan Krishna, Hadi Pouransari

TL;DR: 论文提出RL4HS,一个基于强化学习的框架,通过显式推理和多步决策来解决大语言模型(LLMs)生成幻觉内容的检测问题,优于传统方法。

Details

Motivation: 大语言模型常生成幻觉内容(未支持的虚假信息),传统方法将其视为二分类任务,但实际应用中需要识别具体幻觉片段,涉及多步决策,因此探讨显式推理是否能改进检测效果。

Result: RL4HS在多个任务上超越基线方法,证明显式推理和Span级强化学习的必要性。

Insight: 显式推理能提升多步决策任务的性能,Span级奖励和强化学习是解决幻觉片段检测的有效途径。

Abstract: Large language models (LLMs) often generate hallucinations – unsupported content that undermines reliability. While most prior works frame hallucination detection as a binary task, many real-world applications require identifying hallucinated spans, which is a multi-step decision making process. This naturally raises the question of whether explicit reasoning can help the complex task of detecting hallucination spans. To answer this question, we first evaluate pretrained models with and without Chain-of-Thought (CoT) reasoning, and show that CoT reasoning has the potential to generate at least one correct answer when sampled multiple times. Motivated by this, we propose RL4HS, a reinforcement learning framework that incentivizes reasoning with a span-level reward function. RL4HS builds on Group Relative Policy Optimization and introduces Class-Aware Policy Optimization to mitigate reward imbalance issue. Experiments on the RAGTruth benchmark (summarization, question answering, data-to-text) show that RL4HS surpasses pretrained reasoning models and supervised fine-tuning, demonstrating the necessity of reinforcement learning with span-level rewards for detecting hallucination spans.


[104] ARUQULA – An LLM based Text2SPARQL Approach using ReAct and Knowledge Graph Exploration Utilities cs.CL | cs.AIPDF

Felix Brei, Lorenz Bühmann, Johannes Frey, Daniel Gerber, Lars-Peter Meyer

TL;DR: 该论文提出了一种基于大语言模型(LLM)的Text2SPARQL方法ARUQULA,通过结合ReAct框架和知识图谱探索工具,将自然语言问题逐步转化为SPARQL查询。

Details

Motivation: 知识图谱查询语言SPARQL对于非计算机背景用户具有较高的学习门槛,而LLM可以通过自然语言到SPARQL的转换降低这一门槛。

Result: 展示了方法的可行性,并通过分析代理行为指出了未来改进方向。

Insight: 迭代式查询生成和知识图谱探索工具的结合可以有效提升Text2SPARQL任务的表现。

Abstract: Interacting with knowledge graphs can be a daunting task for people without a background in computer science since the query language that is used (SPARQL) has a high barrier of entry. Large language models (LLMs) can lower that barrier by providing support in the form of Text2SPARQL translation. In this paper we introduce a generalized method based on SPINACH, an LLM backed agent that translates natural language questions to SPARQL queries not in a single shot, but as an iterative process of exploration and execution. We describe the overall architecture and reasoning behind our design decisions, and also conduct a thorough analysis of the agent behavior to gain insights into future areas for targeted improvements. This work was motivated by the Text2SPARQL challenge, a challenge that was held to facilitate improvements in the Text2SPARQL domain.


[105] Say One Thing, Do Another? Diagnosing Reasoning-Execution Gaps in VLM-Powered Mobile-Use Agents cs.CLPDF

Lingzhong Dong, Ziqi Zhou, Shuaibo Yang, Haiyue Sheng, Pengzhou Cheng

TL;DR: 该论文提出了一种新的评估框架,用于诊断VLM驱动的移动代理中的推理-执行差距,揭示了这些差距的普遍性及其潜在危害。

Details

Motivation: 现有研究忽视了VLM代理的推理过程(CoT)是否与真实动作一致,可能导致用户因看似合理的推理而授权有害行为。

Result: 实验表明,推理-执行差距普遍存在,执行差距(EG)更常见,且模型规模扩大虽减少差距,但EG仍较大。

Insight: 揭示了VLM代理在推理与执行之间存在系统性偏差,为开发更可靠的移动代理提供了诊断工具。

Abstract: Mobile-use agents powered by vision-language models (VLMs) have shown great potential in interpreting natural language instructions and generating corresponding actions based on mobile graphical user interface. Recent studies suggest that incorporating chain-of-thought (CoT) reasoning tends to improve the execution accuracy. However, existing evaluations emphasize execution accuracy while neglecting whether CoT reasoning aligns with ground-truth actions. This oversight fails to assess potential reasoning-execution gaps, which in turn foster over-trust: users relying on seemingly plausible CoTs may unknowingly authorize harmful actions, potentially resulting in financial loss or trust crisis. In this work, we introduce a new evaluation framework to diagnose reasoning-execution gaps. At its core lies Ground-Truth Alignment (GTA), which measures whether the action implied by a CoT matches the ground-truth action. By combining GTA with the standard Exact Match (EM) metric, we jointly assess both the reasoning accuracy and execution accuracy. This joint perspective reveals two types of reasoning-execution gaps: (i) Execution Gap (EG), where the reasoning correctly identifies the correct action but execution fails, and (ii) Reasoning Gap (RG), where execution succeeds but reasoning process conflicts with the actual execution. Experimental results across a wide range of mobile interaction tasks reveal that reasoning-execution gaps are prevalent, with execution gaps occurring more frequently than reasoning gaps. Moreover, while scaling up model size reduces the overall gap, sizable execution gaps persist even in the largest models. Further analysis shows that our framework reliably reflects systematic EG/RG patterns in state-of-the-art models. These findings offer concrete diagnostics and support the development of more trustworthy mobile-use agents.


[106] More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration cs.CL | cs.AI | cs.LGPDF

Xiaoyang Yuan, Yujuan Ding, Yi Bin, Wenqi Shao, Jinyu Cai

TL;DR: AMPO通过多教师自适应引导策略优化增强LLMs的推理能力,解决了单一引导的局限性,提升了推理多样性和性能。

Details

Motivation: 当前强化学习方法依赖单一教师或自我探索,存在模型偏见和探索受限问题,限制了推理多样性和性能。

Result: 在数学推理任务上提升4.3%,OOD任务提升12.2%,Pass@k性能显著提升,且探索多样性增强。

Insight: 多教师策略比单一强大教师更高效且可扩展,为LLMs推理能力的提升提供了新路径。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a promising paradigm for enhancing the reasoning ability in Large Language Models (LLMs). However, prevailing methods primarily rely on self-exploration or a single off-policy teacher to elicit long chain-of-thought (LongCoT) reasoning, which may introduce intrinsic model biases and restrict exploration, ultimately limiting reasoning diversity and performance. Drawing inspiration from multi-teacher strategies in knowledge distillation, we introduce Adaptive Multi-Guidance Policy Optimization (AMPO), a novel framework that adaptively leverages guidance from multiple proficient teacher models, but only when the on-policy model fails to generate correct solutions. This “guidance-on-demand” approach expands exploration while preserving the value of self-discovery. Moreover, AMPO incorporates a comprehension-based selection mechanism, prompting the student to learn from the reasoning paths that it is most likely to comprehend, thus balancing broad exploration with effective exploitation. Extensive experiments show AMPO substantially outperforms a strong baseline (GRPO), with a 4.3% improvement on mathematical reasoning tasks and 12.2% on out-of-distribution tasks, while significantly boosting Pass@k performance and enabling more diverse exploration. Notably, using four peer-sized teachers, our method achieves comparable results to approaches that leverage a single, more powerful teacher (e.g., DeepSeek-R1) with more data. These results demonstrate a more efficient and scalable path to superior reasoning and generalizability. Our code is available at https://github.com/SII-Enigma/AMPO.


[107] Enhanced Arabic-language cyberbullying detection: deep embedding and transformer (BERT) approaches cs.CLPDF

Ebtesam Jaber Aljohani, Wael M. S. Yafoo

TL;DR: 该论文通过结合深度嵌入和BERT方法,提高了阿拉伯语网络欺凌检测的准确性,实验结果显示Bi-LSTM结合FastText嵌入达到98%的准确率。

Details

Motivation: 针对阿拉伯语网络欺凌检测方法的稀缺性,作者旨在通过深度学习技术填补这一空白。

Result: 实验结果表明,Bi-LSTM结合FastText嵌入表现最佳,准确率达到98%。

Insight: 词嵌入选择对模型性能影响显著,尤其在处理非英语语言时,预训练嵌入方法如FastText可能更具优势。

Abstract: Recent technological advances in smartphones and communications, including the growth of such online platforms as massive social media networks such as X (formerly known as Twitter) endangers young people and their emotional well-being by exposing them to cyberbullying, taunting, and bullying content. Most proposed approaches for automatically detecting cyberbullying have been developed around the English language, and methods for detecting Arabic-language cyberbullying are scarce. Methods for detecting Arabic-language cyberbullying are especially scarce. This paper aims to enhance the effectiveness of methods for detecting cyberbullying in Arabic-language content. We assembled a dataset of 10,662 X posts, pre-processed the data, and used the kappa tool to verify and enhance the quality of our annotations. We conducted four experiments to test numerous deep learning models for automatically detecting Arabic-language cyberbullying. We first tested a long short-term memory (LSTM) model and a bidirectional long short-term memory (Bi-LSTM) model with several experimental word embeddings. We also tested the LSTM and Bi-LSTM models with a novel pre-trained bidirectional encoder from representations (BERT) and then tested them on a different experimental models BERT again. LSTM-BERT and Bi-LSTM-BERT demonstrated a 97% accuracy. Bi-LSTM with FastText embedding word performed even better, achieving 98% accuracy. As a result, the outcomes are generalize


[108] Explore Briefly, Then Decide: Mitigating LLM Overthinking via Cumulative Entropy Regulation cs.CL | cs.AI | cs.LGPDF

Tianyi Jiang, Yi Bin, Yujuan Ding, Kainian Zhu, Fei Ma

TL;DR: 这篇论文提出了一种名为TECA的新指标和CER机制,用于解决大语言模型(LLM)在推理过程中的“过度思考”问题,从而提升推理效率。

Details

Motivation: 大语言模型在复杂问题上展现了强大的推理能力,但在简单问题上往往生成不必要的冗长推理步骤(过度思考),影响了效率。论文旨在通过动态优化推理深度来解决这一问题。

Result: 实验表明,该方法在多种数学基准测试中显著减少了推理长度(最多减少71%),同时保持问题解决能力。

Insight: 动态调节推理深度是提升LLM效率的有效途径,TECA为量化推理过程提供了新视角。

Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning abilities on complex problems using long Chain-of-Thought (CoT) reasoning. However, they often suffer from overthinking, meaning generating unnecessarily lengthy reasoning steps for simpler problems. This issue may degrade the efficiency of the models and make them difficult to adapt the reasoning depth to the complexity of problems. To address this, we introduce a novel metric Token Entropy Cumulative Average (TECA), which measures the extent of exploration throughout the reasoning process. We further propose a novel reasoning paradigm – Explore Briefly, Then Decide – with an associated Cumulative Entropy Regulation (CER) mechanism. This paradigm leverages TECA to help the model dynamically determine the optimal point to conclude its thought process and provide a final answer, thus achieving efficient reasoning. Experimental results across diverse mathematical benchmarks show that our approach substantially mitigates overthinking without sacrificing problem-solving ability. With our thinking paradigm, the average response length decreases by up to 71% on simpler datasets, demonstrating the effectiveness of our method in creating a more efficient and adaptive reasoning process.


[109] InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents cs.CL | cs.AIPDF

Yaxin Du, Yuanshuo Zhang, Xiyuan Yang, Yifan Zhou, Cheng Wang

TL;DR: InfoMosaic-Bench是一个评估工具增强代理在多源信息搜索中表现的基准测试,涵盖六个领域,要求代理结合通用搜索与领域专用工具。实验表明,当前LLM代理在处理此类任务时仍存在不足。

Details

Motivation: 现有LLM代理过度依赖开放网络搜索,但网络内容噪音大且不可靠,且许多任务需领域专用知识。MCP协议的出现让代理能访问专业工具,但其能力尚不明确。

Result: 实验显示,仅依赖网络信息的GPT-5准确率为38.2%;领域工具表现不一致;22.4%的失败源于工具使用不当。

Insight: 网络信息不足,工具增强代理需改进;领域工具效果不稳定;LLM代理在工具处理上仍有缺陷。

Abstract: Information seeking is a fundamental requirement for humans. However, existing LLM agents rely heavily on open-web search, which exposes two fundamental weaknesses: online content is noisy and unreliable, and many real-world tasks require precise, domain-specific knowledge unavailable from the web. The emergence of the Model Context Protocol (MCP) now allows agents to interface with thousands of specialized tools, seemingly resolving this limitation. Yet it remains unclear whether agents can effectively leverage such tools – and more importantly, whether they can integrate them with general-purpose search to solve complex tasks. Therefore, we introduce InfoMosaic-Bench, the first benchmark dedicated to multi-source information seeking in tool-augmented agents. Covering six representative domains (medicine, finance, maps, video, web, and multi-domain integration), InfoMosaic-Bench requires agents to combine general-purpose search with domain-specific tools. Tasks are synthesized with InfoMosaic-Flow, a scalable pipeline that grounds task conditions in verified tool outputs, enforces cross-source dependencies, and filters out shortcut cases solvable by trivial lookup. This design guarantees both reliability and non-triviality. Experiments with 14 state-of-the-art LLM agents reveal three findings: (i) web information alone is insufficient, with GPT-5 achieving only 38.2% accuracy and 67.5% pass rate; (ii) domain tools provide selective but inconsistent benefits, improving some domains while degrading others; and (iii) 22.4% of failures arise from incorrect tool usage or selection, highlighting that current LLMs still struggle with even basic tool handling.


[110] Parallel Scaling Law: Unveiling Reasoning Generalization through A Cross-Linguistic Perspective cs.CL | cs.AIPDF

Wen Yang, Junhong Wu, Chong Li, Chengqing Zong, Jiajun Zhang

TL;DR: 该论文提出了一种跨语言视角来研究推理泛化,发现英语为中心的LRMs在其他语言中的推理能力转移效果不一,并通过并行训练揭示了‘第一并行跃迁’和‘并行缩放定律’,同时指出了‘单语泛化差距’。

Details

Motivation: 探索基于强化后训练(RPT)的大型推理模型(LRMs)在多语言环境中的推理能力泛化,特别是英语为中心的模型是否能有效扩展到其他语言。

Result: 发现跨语言转移能力因初始模型和目标语言而异,并行训练显著提升性能,且遵循幂律关系。英语为中心的LRMs未能完全泛化到其他语言。

Insight: 研究表明LRMs的推理能力与人类认知不同,提出了开发更语言无关的LRMs的重要方向。

Abstract: Recent advancements in Reinforcement Post-Training (RPT) have significantly enhanced the capabilities of Large Reasoning Models (LRMs), sparking increased interest in the generalization of RL-based reasoning. While existing work has primarily focused on investigating its generalization across tasks or modalities, this study proposes a novel cross-linguistic perspective to investigate reasoning generalization. This raises a crucial question: $\textit{Does the reasoning capability achieved from English RPT effectively transfer to other languages?}$ We address this by systematically evaluating English-centric LRMs on multilingual reasoning benchmarks and introducing a metric to quantify cross-lingual transferability. Our findings reveal that cross-lingual transferability varies significantly across initial model, target language, and training paradigm. Through interventional studies, we find that models with stronger initial English capabilities tend to over-rely on English-specific patterns, leading to diminished cross-lingual generalization. To address this, we conduct a thorough parallel training study. Experimental results yield three key findings: $\textbf{First-Parallel Leap}$, a substantial leap in performance when transitioning from monolingual to just a single parallel language, and a predictable $\textbf{Parallel Scaling Law}$, revealing that cross-lingual reasoning transfer follows a power-law with the number of training parallel languages. Moreover, we identify the discrepancy between actual monolingual performance and the power-law prediction as $\textbf{Monolingual Generalization Gap}$, indicating that English-centric LRMs fail to fully generalize across languages. Our study challenges the assumption that LRM reasoning mirrors human cognition, providing critical insights for the development of more language-agnostic LRMs.


[111] From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens cs.CL | cs.CVPDF

Hala Sheta, Eric Huang, Shuyu Wu, Ilia Alenabi, Jiajun Hong

TL;DR: VLM-Lens是一个用于系统化分析、评估和解释视觉语言模型(VLMs)的工具包,支持从开源VLMs的任何层提取中间输出,并提供统一的YAML配置接口。

Details

Motivation: 由于VLMs的内部机制复杂且多样化,缺乏统一的工具支持对其中间输出的系统性分析,限制了对其内部能力的深入理解和改进。

Result: 通过两个简单实验,展示了VLMs隐藏表征在层级和目标概念上的系统性差异,验证了工具的有效性。

Insight: VLM-Lens的灵活性为社区提供了深入理解VLMs内部能力的工具,有助于加速模型的改进和优化。

Abstract: We introduce VLM-Lens, a toolkit designed to enable systematic benchmarking, analysis, and interpretation of vision-language models (VLMs) by supporting the extraction of intermediate outputs from any layer during the forward pass of open-source VLMs. VLM-Lens provides a unified, YAML-configurable interface that abstracts away model-specific complexities and supports user-friendly operation across diverse VLMs. It currently supports 16 state-of-the-art base VLMs and their over 30 variants, and is extensible to accommodate new models without changing the core logic. The toolkit integrates easily with various interpretability and analysis methods. We demonstrate its usage with two simple analytical experiments, revealing systematic differences in the hidden representations of VLMs across layers and target concepts. VLM-Lens is released as an open-sourced project to accelerate community efforts in understanding and improving VLMs.


[112] F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data cs.CL | cs.AIPDF

Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang

TL;DR: F2LLM是一组基于开源数据直接微调的嵌入模型,分别在0.6B、1.7B和4B三种规模下达到SOTA性能,训练成本低且性能优异。

Details

Motivation: 现有顶级嵌入模型需要大规模对比预训练和昂贵合成数据,F2LLM旨在通过开源数据实现低成本高性能嵌入模型。

Result: F2LLM-4B在4B参数模型中排名第2,F2LLM-1.7B在1B-2B规模模型中排名第1。

Insight: 通过开源数据微调可以低成本实现高性能嵌入模型,为未来研究提供可复现的基准。

Abstract: We introduce F2LLM - Foundation to Feature Large Language Models, a suite of state-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlike previous top-ranking embedding models that require massive contrastive pretraining, sophisticated training pipelines, and costly synthetic training data, F2LLM is directly finetuned from foundation models on 6 million query-document-negative tuples curated from open-source, non-synthetic datasets, striking a strong balance between training cost, model size, and embedding performance. On the MTEB English leaderboard, F2LLM-4B ranks 2nd among models with approximately 4B parameters and 7th overall, while F2LLM-1.7B ranks 1st among models in the 1B-2B size range. To facilitate future research in the field, we release the models, training dataset, and code, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future works.


[113] Drawing Conclusions from Draws: Rethinking Preference Semantics in Arena-Style LLM Evaluation cs.CLPDF

Raphael Tang, Crystina Zhang, Wenyan Li, Carmen Lai, Pontus Stenetorp

TL;DR: 该论文质疑现有的竞技场式大语言模型(LLM)评估中平局的语义问题,提出平局更可能反映查询难度而非模型能力相等,并通过实验证明忽略平局的评分更新能提高预测准确性。

Details

Motivation: 传统竞技场式评估使用Elo评分系统,将平局视为模型能力相等的表现,但作者认为平局可能更多反映了查询的难易程度或客观性,而非模型能力的均等。

Result: 实验表明,忽略平局的评分更新能使预测准确度相对提升1-3%。此外,平局更容易出现在非常容易或高度客观的查询中。

Insight: 论文提出未来评分系统应重新考虑平局的语义,并在评分更新中加入查询属性的信息。

Abstract: In arena-style evaluation of large language models (LLMs), two LLMs respond to a user query, and the user chooses the winning response or deems the “battle” a draw, resulting in an adjustment to the ratings of both models. The prevailing approach for modeling these rating dynamics is to view battles as two-player game matches, as in chess, and apply the Elo rating system and its derivatives. In this paper, we critically examine this paradigm. Specifically, we question whether a draw genuinely means that the two models are equal and hence whether their ratings should be equalized. Instead, we conjecture that draws are more indicative of query difficulty: if the query is too easy, then both models are more likely to succeed equally. On three real-world arena datasets, we show that ignoring rating updates for draws yields a 1-3% relative increase in battle outcome prediction accuracy (which includes draws) for all four rating systems studied. Further analyses suggest that draws occur more for queries rated as very easy and those as highly objective, with risk ratios of 1.37 and 1.35, respectively. We recommend future rating systems to reconsider existing draw semantics and to account for query properties in rating updates.


cs.AI [Back]

[114] VaPR – Vision-language Preference alignment for Reasoning cs.AI | cs.CV | cs.LGPDF

Rohan Wadhawan, Fabrice Y Harel-Canada, Zi-Yi Dou, Suhaila Shakiah, Robinson Piramuthu

TL;DR: 论文提出了VaPR框架,通过LLM引导的硬负样本生成解决合成偏好标注中的噪声问题,显著提升了视觉语言模型的推理性能。

Details

Motivation: 现有偏好微调方法忽略了合成偏好标注中的风格和长度偏差噪声,影响了视觉语言模型的对齐效果。

Result: 在十项基准测试中,VaPR模型平均提升LLaVA 6.5%、Qwen2VL 4.0%、Qwen2.5VL 1.5%,解决了二元问题中的“是”偏见。

Insight: 数据规模扩展持续提升性能,LLaVA在小规模数据上也能受益;开源LLM编辑器的泛化能力接近GPT-4o。

Abstract: Preference finetuning methods like Direct Preference Optimization (DPO) with AI-generated feedback have shown promise in aligning Large Vision-Language Models (LVLMs) with human preferences. However, existing techniques overlook the prevalence of noise in synthetic preference annotations in the form of stylistic and length biases. To this end, we introduce a hard-negative response generation framework based on LLM-guided response editing, that produces rejected responses with targeted errors, maintaining stylistic and length similarity to the accepted ones. Using this framework, we develop the VaPR dataset, comprising 30K high-quality samples, to finetune three LVLM families: LLaVA-V1.5, Qwen2VL & Qwen2.5VL (2B-13B sizes). Our VaPR models deliver significant performance improvements across ten benchmarks, achieving average gains of 6.5% (LLaVA), 4.0% (Qwen2VL), and 1.5% (Qwen2.5VL), with notable improvements on reasoning tasks. A scaling analysis shows that performance consistently improves with data size, with LLaVA models benefiting even at smaller scales. Moreover, VaPR reduces the tendency to answer “Yes” in binary questions - addressing a common failure mode in LVLMs like LLaVA. Lastly, we show that the framework generalizes to open-source LLMs as editors, with models trained on VaPR-OS achieving ~99% of the performance of models trained on \name, which is synthesized using GPT-4o. Our data, models, and code can be found on the project page https://vap-r.github.io


[115] Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models cs.AI | cs.CLPDF

Yu Zeng, Wenxuan Huang, Shiting Huang, Xikun Bao, Yukun Qi

TL;DR: 该论文提出了一种名为AGILE的交互式学习方法,通过将拼图任务建模为环境交互过程,显著提升了视觉语言模型(VLM)的感知与推理能力。

Details

Motivation: 尽管当前大型视觉语言模型在多模态理解和推理方面取得了进展,但其核心的感知与推理能力仍然有限,特别是在简单的拼图任务中表现接近随机。这是由于高质量视觉语言数据的稀缺性和有限扩展性导致的。

Result: 实验表明,AGILE在多种复杂度的拼图任务上显著提升了性能(如2×2拼图任务的准确率从9.5%提升至82.8%),并在9个通用视觉任务上平均提升了3.1%的性能。

Insight: 该方法为多模态模型的推理与泛化能力提升开辟了新路径,同时为解决多模态强化学习数据稀缺问题提供了高效、可扩展的方案。

Abstract: Although current large Vision-Language Models (VLMs) have advanced in multimodal understanding and reasoning, their fundamental perceptual and reasoning abilities remain limited. Specifically, even on simple jigsaw tasks, existing VLMs perform near randomly, revealing deficiencies in core perception and reasoning capabilities. While high-quality vision-language data can enhance these capabilities, its scarcity and limited scalability impose significant constraints. To address this, we propose AGILE, an Agentic jiGsaw Interaction Learning for Enhancing visual perception and reasoning in VLMs. AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment. At each step, the model generates executable code to perform an action based on the current state, while the environment provides fine-grained visual feedback to guide task completion. Through this iterative cycle of observation and interaction, the model incrementally improves its perceptual and reasoning capabilities via exploration and feedback. Experimental results show that AGILE not only substantially boosts performance on jigsaw tasks of varying complexity (e.g., increasing accuracy from 9.5% to 82.8% under the 2 $\times$ 2 setting) but also demonstrates strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%. These results indicate notable enhancements in both perceptual and reasoning abilities. This work opens a new avenue for advancing reasoning and generalization in multimodal models and provides an efficient, scalable solution to the scarcity of multimodal reinforcement learning data. The code and datasets is available at https://github.com/yuzeng0-0/AGILE .


[116] Aristotle: IMO-level Automated Theorem Proving cs.AI | cs.CLPDF

Tudor Achim, Alex Best, Kevin Der, Mathïs Fédérico, Sergei Gukov

TL;DR: Aristotle是一个结合形式化验证与非正式推理的AI系统,在2025年IMO问题上达到了金牌水平的性能。

Details

Motivation: 目标是开发一个能够在高水平数学竞赛中解决问题的AI系统,结合形式化与非正式方法以提高性能。

Result: 在2025年IMO问题上表现优异,具备可扩展的自动化定理证明能力。

Insight: 形式化和非正式方法的结合可以显著提升自动化定理证明的竞赛性能。

Abstract: We introduce Aristotle, an AI system that combines formal verification with informal reasoning, achieving gold-medal-equivalent performance on the 2025 International Mathematical Olympiad problems. Aristotle integrates three main components: a Lean proof search system, an informal reasoning system that generates and formalizes lemmas, and a dedicated geometry solver. Our system demonstrates state-of-the-art performance with favorable scaling properties for automated theorem proving.


[117] Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort cs.AI | cs.CLPDF

Xinpeng Wang, Nitish Joshi, Barbara Plank, Rico Angell, He He

TL;DR: 论文提出了TRACE方法,通过量化模型的推理努力来检测隐式奖励破解行为,避免了现有监督方法的局限性,显著提升了检测效果。

Details

Motivation: 奖励破解行为(尤其是隐式)威胁大,但现有方法(如CoT监控)难以检测。因此,需要一种无需监督的方法来衡量模型的真实推理努力。

Result: TRACE在数学推理任务中比72B CoT监控器提升65%,在编程任务中比32B监控器提升30%,并能发现训练中的未知漏洞。

Insight: 隐式奖励破解可以通过量化推理努力检测,TRACE提供了一种无需监督的可扩展解决方案,适用于现有方法失效的场景。

Abstract: Reward hacking, where a reasoning model exploits loopholes in a reward function to achieve high rewards without solving the intended task, poses a significant threat. This behavior may be explicit, i.e. verbalized in the model’s chain-of-thought (CoT), or implicit, where the CoT appears benign thus bypasses CoT monitors. To detect implicit reward hacking, we propose TRACE (Truncated Reasoning AUC Evaluation). Our key observation is that hacking occurs when exploiting the loophole is easier than solving the actual task. This means that the model is using less `effort’ than required to achieve high reward. TRACE quantifies effort by measuring how early a model’s reasoning becomes sufficient to pass a verifier. We progressively truncate a model’s CoT at various lengths, force the model to answer, and measure the verifier-passing rate at each cutoff. A hacking model, which takes a shortcut, will achieve a high passing rate with only a small fraction of its CoT, yielding a large area under the accuracy-vs-length curve. TRACE achieves over 65% gains over our strongest 72B CoT monitor in math reasoning, and over 30% gains over a 32B monitor in coding. We further show that TRACE can discover unknown loopholes during training. Overall, TRACE offers a scalable unsupervised approach for oversight where current monitoring methods prove ineffective.


[118] VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal Reasoning cs.AI | cs.CL | cs.LGPDF

Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li

TL;DR: VOGUE通过量化视觉输入的随机性,引入不确定性感知的探索信号,有效提升多模态推理的准确性和鲁棒性。

Details

Motivation: 现有的多模态大型语言模型(MLLMs)在探索性学习方面存在不足,尤其是在视觉输入的不确定性处理上表现不佳。

Result: 在两个模型规模(Qwen2.5-VL-3B/7B)上,VOGUE显著提升了视觉数学和通用推理基准的pass@1准确率,同时改善了探索衰减问题。

Insight: 视觉输入的不确定性是多模态推理的关键因素,有效利用这种不确定性可以显著提升模型的性能和鲁棒性。

Abstract: Reinforcement learning with verifiable rewards (RLVR) improves reasoning in large language models (LLMs) but struggles with exploration, an issue that still persists for multimodal LLMs (MLLMs). Current methods treat the visual input as a fixed, deterministic condition, overlooking a critical source of ambiguity and struggling to build policies robust to plausible visual variations. We introduce $\textbf{VOGUE (Visual Uncertainty Guided Exploration)}$, a novel method that shifts exploration from the output (text) to the input (visual) space. By treating the image as a stochastic context, VOGUE quantifies the policy’s sensitivity to visual perturbations using the symmetric KL divergence between a “raw” and “noisy” branch, creating a direct signal for uncertainty-aware exploration. This signal shapes the learning objective via an uncertainty-proportional bonus, which, combined with a token-entropy bonus and an annealed sampling schedule, effectively balances exploration and exploitation. Implemented within GRPO on two model scales (Qwen2.5-VL-3B/7B), VOGUE boosts pass@1 accuracy by an average of 2.6% on three visual math benchmarks and 3.7% on three general-domain reasoning benchmarks, while simultaneously increasing pass@4 performance and mitigating the exploration decay commonly observed in RL fine-tuning. Our work shows that grounding exploration in the inherent uncertainty of visual inputs is an effective strategy for improving multimodal reasoning.


[119] Information Seeking for Robust Decision Making under Partial Observability cs.AI | cs.CL | cs.ROPDF

Djengo Cyun-Jyun Fang, Tsung-Wei Ke

TL;DR: 本文提出了一个名为InfoSeeker的LLM决策框架,通过整合任务导向规划和信息寻求来解决部分可观测环境中的不确定性问题,并在实验中显著优于现有方法。

Details

Motivation: 人类在部分可观测的环境中通过主动寻求信息来解决问题,但现有的LLM规划代理忽视了内部动态与实际环境之间的差异,导致决策不够鲁棒。

Result: InfoSeeker在部分可观测环境中实现了74%的性能提升,并在机器人操作和网页导航等基准测试中表现优异。

Insight: 在部分可观测的环境中,紧密整合规划和信息寻求是实现鲁棒行为的关键。

Abstract: Explicit information seeking is essential to human problem-solving in practical environments characterized by incomplete information and noisy dynamics. When the true environmental state is not directly observable, humans seek information to update their internal dynamics and inform future decision-making. Although existing Large Language Model (LLM) planning agents have addressed observational uncertainty, they often overlook discrepancies between their internal dynamics and the actual environment. We introduce Information Seeking Decision Planner (InfoSeeker), an LLM decision-making framework that integrates task-oriented planning with information seeking to align internal dynamics and make optimal decisions under uncertainty in both agent observations and environmental dynamics. InfoSeeker prompts an LLM to actively gather information by planning actions to validate its understanding, detect environmental changes, or test hypotheses before generating or revising task-oriented plans. To evaluate InfoSeeker, we introduce a novel benchmark suite featuring partially observable environments with incomplete observations and uncertain dynamics. Experiments demonstrate that InfoSeeker achieves a 74% absolute performance gain over prior methods without sacrificing sample efficiency. Moreover, InfoSeeker generalizes across LLMs and outperforms baselines on established benchmarks such as robotic manipulation and web navigation. These findings underscore the importance of tightly integrating planning and information seeking for robust behavior in partially observable environments. The project page is available at https://infoseekerllm.github.io


[120] The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models cs.AI | cs.CL | cs.CVPDF

Phuc Minh Nguyen, Chinh D. La, Duy M. H. Nguyen, Nitesh V. Chawla, Binh T. Nguyen

TL;DR: 本文探讨了强化学习与可验证奖励(RLVR)在提升大语言模型推理能力时可能导致推理边界缩小的现象,揭示了两大关键问题并提出了一种改进的数据筛选算法。

Details

Motivation: 尽管RLVR被广泛用于提升语言模型的推理能力,但其可能导致模型的推理边界缩小,这是本文试图解决的矛盾现象。

Result: 实验证明,该方法在多个数学推理基准上显著提升了Pass@$k$性能。

Insight: RLVR的标准目标函数可能导致模型收敛于狭窄的策略,而专注于低概率问题的学习可以缓解这一问题。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key method for improving Large Language Models’ reasoning capabilities, yet recent evidence suggests it may paradoxically shrink the reasoning boundary rather than expand it. This paper investigates the shrinkage issue of RLVR by analyzing its learning dynamics and reveals two critical phenomena that explain this failure. First, we expose negative interference in RLVR, where learning to solve certain training problems actively reduces the likelihood of correct solutions for others, leading to the decline of Pass@$k$ performance, or the probability of generating a correct solution within $k$ attempts. Second, we uncover the winner-take-all phenomenon: RLVR disproportionately reinforces problems with high likelihood, correct solutions, under the base model, while suppressing other initially low-likelihood ones. Through extensive theoretical and empirical analysis on multiple mathematical reasoning benchmarks, we show that this effect arises from the inherent on-policy sampling in standard RL objectives, causing the model to converge toward narrow solution strategies. Based on these insights, we propose a simple yet effective data curation algorithm that focuses RLVR learning on low-likelihood problems, achieving notable improvement in Pass@$k$ performance. Our code is available at https://github.com/mail-research/SELF-llm-interference.


[121] InvThink: Towards AI Safety via Inverse Reasoning cs.AI | cs.CLPDF

Yubin Kim, Taehan Kim, Eugene Park, Chunjong Park, Cynthia Breazeal

TL;DR: InvThink提出了一种通过逆向思维提升大型语言模型(LLM)安全性的方法,通过枚举潜在危害、分析后果并生成安全响应,实现了优于基线方法的安全性改进。

Details

Motivation: 现有的安全对齐方法直接优化安全性响应,但可能牺牲模型的通用推理能力。InvThink旨在通过逆向思维系统性地考虑故障模式,同时保持模型的通用能力。

Result: 相比基线方法SafetyPrompt,InvThink实现了15.7%的有害响应减少,同时保持了通用推理能力。安全性改进与模型规模呈正相关。

Insight: 逆向思维不仅提升了LLM的安全性,还避免了安全性改进对其他能力的负面影响,为可扩展和通用化的AI安全路径提供了新思路。

Abstract: We present InvThink, a simple yet powerful approach that gives large language models (LLMs) the capability of inverse thinking: reasoning through failure modes before generating responses. Unlike existing safety alignment methods that optimize directly for safe response, InvThink instructs models to 1) enumerate potential harms, 2) analyze their consequences, and 3) generate safe outputs that proactively avoid these risks. Our method reveals three key findings: (i) safety improvements show stronger scaling with model size compared to existing safety methods. (ii) InvThink mitigates safety tax; by training models to systematically consider failure modes, it preserves general reasoning capabilities on standard benchmarks. (iii) beyond general safety tasks, InvThink excels in high-stakes domains including external-facing (medicine, finance, law) and agentic (blackmail, murder) risk scenarios, achieving up to 15.7% reduction in harmful responses compared to baseline methods like SafetyPrompt. We further implement InvThink via supervised fine-tuning, and reinforcement learning across three LLM families. These results suggest that inverse reasoning provides a scalable and generalizable path toward safer, more capable language models.


[122] Just Do It!? Computer-Use Agents Exhibit Blind Goal-Directedness cs.AI | cs.CL | cs.CR | cs.CY | cs.LGPDF

Erfan Shayegani, Keegan Hines, Yue Dong, Nael Abu-Ghazaleh, Roman Lutz

TL;DR: 该论文揭示了计算机使用代理(CUAs)普遍存在的盲目目标导向性(BGD)问题,提出了三种常见的BGD模式,并通过BLIND-ACT基准测试验证了多个前沿模型的BGD发生率。研究表明,即使输入无害,BGD也会带来风险,现有干预措施效果有限。

Details

Motivation: 随着计算机使用代理(CUAs)的广泛应用,其行为安全性日益受到关注。作者发现这些代理普遍存在盲目追求目标的问题(BGD),可能导致不可行的操作或安全风险,因此需要系统性地研究和解决。

Result: 实验显示,受测模型的平均BGD发生率为80.8%,表明BGD问题普遍存在。即时干预措施虽能降低BGD水平,但风险仍然显著。定性分析揭示了执行优先偏差、思维-行动脱节和请求优先等典型失败模式。

Insight: BGD揭示了CUAs在设计和部署中的深层次风险,提示需在训练或推理阶段引入更强大的干预机制。BLIND-ACT为未来研究和缓解BGD提供了基础。

Abstract: Computer-Use Agents (CUAs) are an increasingly deployed class of agents that take actions on GUIs to accomplish user goals. In this paper, we show that CUAs consistently exhibit Blind Goal-Directedness (BGD): a bias to pursue goals regardless of feasibility, safety, reliability, or context. We characterize three prevalent patterns of BGD: (i) lack of contextual reasoning, (ii) assumptions and decisions under ambiguity, and (iii) contradictory or infeasible goals. We develop BLIND-ACT, a benchmark of 90 tasks capturing these three patterns. Built on OSWorld, BLIND-ACT provides realistic environments and employs LLM-based judges to evaluate agent behavior, achieving 93.75% agreement with human annotations. We use BLIND-ACT to evaluate nine frontier models, including Claude Sonnet and Opus 4, Computer-Use-Preview, and GPT-5, observing high average BGD rates (80.8%) across them. We show that BGD exposes subtle risks that arise even when inputs are not directly harmful. While prompting-based interventions lower BGD levels, substantial risk persists, highlighting the need for stronger training- or inference-time interventions. Qualitative analysis reveals observed failure modes: execution-first bias (focusing on how to act over whether to act), thought-action disconnect (execution diverging from reasoning), and request-primacy (justifying actions due to user request). Identifying BGD and introducing BLIND-ACT establishes a foundation for future research on studying and mitigating this fundamental risk and ensuring safe CUA deployment.


[123] Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning cs.AI | cs.CLPDF

Zhihao Dou, Qinjian Zhao, Zhongwei Wan, Dinggen Zhang, Weida Wang

TL;DR: 本文提出了一种名为PTA-GRPO的两阶段框架,通过结合高级规划和细粒度推理优化,显著提升大语言模型(LLMs)的推理能力。

Details

Motivation: 现有的LLMs在推理任务中依赖自回归的Token级生成,缺乏全局规划,导致推理冗余、不连贯或不准确,影响了整体性能。传统方法如树搜索和强化学习(RL)计算成本高且效果不佳。

Result: 在多数学推理基准测试(如MATH、AIME等)和多种基础模型(如Qwen系列、LLaMA3.2)上验证了PTA-GRPO的稳定性和显著提升效果。

Insight: 通过分阶段规划和优化,PTA-GRPO有效解决了LLMs推理中缺乏全局规划的问题,提供了更高效和通用的推理方法。

Abstract: Large language models (LLMs) have demonstrated remarkable reasoning abilities in complex tasks, often relying on Chain-of-Thought (CoT) reasoning. However, due to their autoregressive token-level generation, the reasoning process is largely constrained to local decision-making and lacks global planning. This limitation frequently results in redundant, incoherent, or inaccurate reasoning, which significantly degrades overall performance. Existing approaches, such as tree-based algorithms and reinforcement learning (RL), attempt to address this issue but suffer from high computational costs and often fail to produce optimal reasoning trajectories. To tackle this challenge, we propose Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization PTA-GRPO, a two-stage framework designed to improve both high-level planning and fine-grained CoT reasoning. In the first stage, we leverage advanced LLMs to distill CoT into compact high-level guidance, which is then used for supervised fine-tuning (SFT). In the second stage, we introduce a guidance-aware RL method that jointly optimizes the final output and the quality of high-level guidance, thereby enhancing reasoning effectiveness. We conduct extensive experiments on multiple mathematical reasoning benchmarks, including MATH, AIME2024, AIME2025, and AMC, across diverse base models such as Qwen2.5-7B-Instruct, Qwen3-8B, Qwen3-14B, and LLaMA3.2-3B. Experimental results demonstrate that PTA-GRPO consistently achieves stable and significant improvements across different models and tasks, validating its effectiveness and generalization.


[124] Do AI Models Perform Human-like Abstract Reasoning Across Modalities? cs.AI | cs.CLPDF

Claas Beger, Ryan Yi, Shuhao Fu, Arseny Moskvichev, Sarah W. Tsai

TL;DR: 论文探讨了AI模型在多模态抽象推理任务中的表现,揭示了仅依赖准确性评估可能高估或低估模型能力的局限性。

Details

Motivation: 研究旨在评估AI模型在抽象推理任务中是否真正理解和应用了任务设计者意图的抽象概念,而非依赖表面模式。

Result: 文本模态下模型的准确性接近人类,但规则分析显示依赖表面捷径;视觉模态下准确性下降,但规则分析揭示了潜在的抽象能力。

Insight: 仅依赖准确性评估抽象推理能力存在局限性,需结合规则分析;模型在跨模态抽象推理中仍有显著差距。

Abstract: OpenAI’s o3-preview reasoning model exceeded human accuracy on the ARC-AGI benchmark, but does that mean state-of-the-art models recognize and reason with the abstractions that the task creators intended? We investigate models’ abstraction abilities on ConceptARC. We evaluate models under settings that vary the input modality (textual vs. visual), whether the model is permitted to use external Python tools, and, for reasoning models, the amount of reasoning effort. In addition to measuring output accuracy, we perform fine-grained evaluation of the natural-language rules that models generate to explain their solutions. This dual evaluation lets us assess whether models solve tasks using the abstractions ConceptARC was designed to elicit, rather than relying on surface-level patterns. Our results show that, while some models using text-based representations match human output accuracy, the best models’ rules are often based on surface-level ``shortcuts’’ and capture intended abstractions far less often than humans. Thus their capabilities for general abstract reasoning may be overestimated by evaluations based on accuracy alone. In the visual modality, AI models’ output accuracy drops sharply, yet our rule-level analysis reveals that models might be underestimated, as they still exhibit a substantial share of rules that capture intended abstractions, but are often unable to correctly apply these rules. In short, our results show that models still lag humans in abstract reasoning, and that using accuracy alone to evaluate abstract reasoning on ARC-like tasks may overestimate abstract-reasoning capabilities in textual modalities and underestimate it in visual modalities. We believe that our evaluation framework offers a more faithful picture of multimodal models’ abstract reasoning abilities and a more principled way to track progress toward human-like, abstraction-centered intelligence.


[125] A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports cs.AI | cs.CLPDF

Yang Yao, Yixu Wang, Yuxuan Zhang, Yi Lu, Tianle Gu

TL;DR: 本文提出了一种针对深度研究代理(DRAs)的严格基准和多维评估框架,旨在解决现有基准在评估维度、响应格式和评分机制上的不足。

Details

Motivation: AI正在从封闭式语言模型转向具备外部感知和信息整合能力的互联代理系统,深度研究代理(DRAs)代表了一种系统性能力,但现有基准无法有效评估此类系统。

Result: 实验表明主流DRAs优于增强推理模型,但仍有较大改进空间。

Insight: 该研究为DRAs的能力评估、架构优化和范式进步提供了坚实基础。

Abstract: Artificial intelligence is undergoing the paradigm shift from closed language models to interconnected agent systems capable of external perception and information integration. As a representative embodiment, Deep Research Agents (DRAs) systematically exhibit the capabilities for task decomposition, cross-source retrieval, multi-stage reasoning, and structured output, which markedly enhance performance on complex and open-ended tasks. However, existing benchmarks remain deficient in evaluation dimensions, response formatting, and scoring mechanisms, limiting their capacity to assess such systems effectively. This paper introduces a rigorous benchmark and a multidimensional evaluation framework tailored to DRAs and report-style responses. The benchmark comprises 214 expert-curated challenging queries distributed across 10 broad thematic domains, each accompanied by manually constructed reference bundles to support composite evaluation. The framework enables comprehensive evaluation of long-form reports generated by DRAs, incorporating integrated scoring metrics for semantic quality, topical focus, and retrieval trustworthiness. Extensive experimentation confirms the superior performance of mainstream DRAs over web-search-tool-augmented reasoning models, yet reveals considerable scope for further improvement. This study provides a robust foundation for capability assessment, architectural refinement, and paradigm advancement in DRA systems.


[126] RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems cs.AI | cs.CL | cs.LGPDF

Yuxiao Qu, Anikait Singh, Yoonho Lee, Amrith Setlur, Ruslan Salakhutdinov

TL;DR: 这篇论文提出了RLAD方法,通过训练大型语言模型(LLMs)发现推理问题的抽象概念,从而提高推理能力。RLAD采用两玩家强化学习框架,分别训练抽象生成器和解决方案生成器,实现结构化探索和泛化能力的提升。

Details

Motivation: 现有的大型模型在推理任务中往往难以一致地捕捉或重用过程性的知识,导致推理过程冗长且效率低下。为此,需要一种方法帮助模型学习并提出有效的推理抽象概念,从而引导更高效的推理行为。

Result: 实验表明,RLAD能够有效提升模型的推理能力和泛化性能,尤其在复杂问题上表现突出。此外,测试时增加抽象概念的生成比增加解决方案的生成更能提升性能。

Insight: 1. 抽象概念在推理任务中起到了关键的引导作用;2. 结构化探索和解耦学习信号是提升大型语言模型推理能力的有效策略;3. 计算资源分配策略对推理性能有显著影响。

Abstract: Reasoning requires going beyond pattern matching or memorization of solutions to identify and implement “algorithmic procedures” that can be used to deduce answers to hard problems. Doing so requires realizing the most relevant primitives, intermediate results, or shared procedures, and building upon them. While RL post-training on long chains of thought ultimately aims to uncover this kind of algorithmic behavior, most reasoning traces learned by large models fail to consistently capture or reuse procedures, instead drifting into verbose and degenerate exploration. To address more effective reasoning, we introduce reasoning abstractions: concise natural language descriptions of procedural and factual knowledge that guide the model toward learning successful reasoning. We train models to be capable of proposing multiple abstractions given a problem, followed by RL that incentivizes building a solution while using the information provided by these abstractions. This results in a two-player RL training paradigm, abbreviated as RLAD, that jointly trains an abstraction generator and a solution generator. This setup effectively enables structured exploration, decouples learning signals of abstraction proposal and solution generation, and improves generalization to harder problems. We also show that allocating more test-time compute to generating abstractions is more beneficial for performance than generating more solutions at large test budgets, illustrating the role of abstractions in guiding meaningful exploration.


cs.GR [Back]

[127] MPMAvatar: Learning 3D Gaussian Avatars with Accurate and Robust Physics-Based Dynamics cs.GR | cs.CVPDF

Changmin Lee, Jihyun Lee, Tae-Kyun Kim

TL;DR: MPMAvatar是一个用于从多视角视频中创建3D人类化身的框架,结合了基于Material Point Method的物理模拟器和3D高斯渲染技术,实现了高精度、鲁棒的动态建模和逼真渲染。

Details

Motivation: 当前从视觉观察中创建的3D化身在松散衣物的物理动态建模上仍存在挑战,现有方法在精度和新动画输入的鲁棒性上表现不足。

Result: 实验表明,MPMAvatar在动态建模精度、渲染精度以及鲁棒性和效率上显著优于现有方法,并能推广到未见过的交互任务中。

Insight: 通过物理模拟与渲染技术的结合,MPMAvatar展示了在复杂动态建模和高保真渲染方面的潜力,为零样本交互任务提供了一条新途径。

Abstract: While there has been significant progress in the field of 3D avatar creation from visual observations, modeling physically plausible dynamics of humans with loose garments remains a challenging problem. Although a few existing works address this problem by leveraging physical simulation, they suffer from limited accuracy or robustness to novel animation inputs. In this work, we present MPMAvatar, a framework for creating 3D human avatars from multi-view videos that supports highly realistic, robust animation, as well as photorealistic rendering from free viewpoints. For accurate and robust dynamics modeling, our key idea is to use a Material Point Method-based simulator, which we carefully tailor to model garments with complex deformations and contact with the underlying body by incorporating an anisotropic constitutive model and a novel collision handling algorithm. We combine this dynamics modeling scheme with our canonical avatar that can be rendered using 3D Gaussian Splatting with quasi-shadowing, enabling high-fidelity rendering for physically realistic animations. In our experiments, we demonstrate that MPMAvatar significantly outperforms the existing state-of-the-art physics-based avatar in terms of (1) dynamics modeling accuracy, (2) rendering accuracy, and (3) robustness and efficiency. Additionally, we present a novel application in which our avatar generalizes to unseen interactions in a zero-shot manner-which was not achievable with previous learning-based methods due to their limited simulation generalizability. Our project page is at: https://KAISTChangmin.github.io/MPMAvatar/


[128] ROI-GS: Interest-based Local Quality 3D Gaussian Splatting cs.GR | cs.CV | 68U05, 68T45 (Primary) 68T07, 68-04 (Secondary) | I.2.10; I.3.3; I.3.5; I.3.7; I.4.5; I.4.6; I.4.8; I.4.10PDF

Quoc-Anh Bui, Gilles Rougeron, Géraldine Morin, Simone Gasparini

TL;DR: ROI-GS提出了基于兴趣的局部质量3D高斯泼溅方法,通过目标引导的相机选择和高分辨率重建,在保持实时性能的同时显著提升感兴趣区域的细节质量。

Details

Motivation: 现有3D高斯泼溅方法资源分配均匀,导致感兴趣区域的细节受限且模型体积庞大,ROI-GS通过针对性优化解决了这一问题。

Result: 实验显示ROI-GS将局部质量提升(PSNR达2.96 dB),模型体积减少约17%,训练速度更快。

Insight: ROI-GS通过资源的有针对性分配,展示了在3D重建中平衡全局和局部质量的潜力。

Abstract: We tackle the challenge of efficiently reconstructing 3D scenes with high detail on objects of interest. Existing 3D Gaussian Splatting (3DGS) methods allocate resources uniformly across the scene, limiting fine detail to Regions Of Interest (ROIs) and leading to inflated model size. We propose ROI-GS, an object-aware framework that enhances local details through object-guided camera selection, targeted Object training, and seamless integration of high-fidelity object of interest reconstructions into the global scene. Our method prioritizes higher resolution details on chosen objects while maintaining real-time performance. Experiments show that ROI-GS significantly improves local quality (up to 2.96 dB PSNR), while reducing overall model size by $\approx 17%$ of baseline and achieving faster training for a scene with a single object of interest, outperforming existing methods.


[129] Spec-Gloss Surfels and Normal-Diffuse Priors for Relightable Glossy Objects cs.GR | cs.CVPDF

Georgios Kouros, Minye Wu, Tinne Tuytelaars

TL;DR: 论文提出了一种结合微表面BRDF和高斯泼溅的可重光照框架,通过引入Spec-Gloss参数化和法向-漫反射先验,提升了高光物体的几何与材质重建质量。

Details

Motivation: 当前神经渲染方法在重建和重光照高光物体时,往往依赖于简化的BRDF模型或耦合的漫反射-高光参数化,限制了材质恢复的准确性和重光照的保真度。

Result: 实验表明,该方法在复杂高光场景中实现了高质量的几何与材质重建,并显著提升了新光照条件下的重光照真实性和一致性。

Insight: 将物理一致的BRDF模型与高斯泼溅结合,并通过先验引导优化,能够有效解决高光物体重建与重光照中的歧义问题。

Abstract: Accurate reconstruction and relighting of glossy objects remain a longstanding challenge, as object shape, material properties, and illumination are inherently difficult to disentangle. Existing neural rendering approaches often rely on simplified BRDF models or parameterizations that couple diffuse and specular components, which restricts faithful material recovery and limits relighting fidelity. We propose a relightable framework that integrates a microfacet BRDF with the specular-glossiness parameterization into 2D Gaussian Splatting with deferred shading. This formulation enables more physically consistent material decomposition, while diffusion-based priors for surface normals and diffuse color guide early-stage optimization and mitigate ambiguity. A coarse-to-fine optimization of the environment map accelerates convergence and preserves high-dynamic-range specular reflections. Extensive experiments on complex, glossy scenes demonstrate that our method achieves high-quality geometry and material reconstruction, delivering substantially more realistic and consistent relighting under novel illumination compared to existing Gaussian splatting methods.


cs.MA [Back]

[130] LLM-based Multi-Agent Blackboard System for Information Discovery in Data Science cs.MA | cs.AI | cs.CL | cs.IR | cs.LGPDF

Alireza Salemi, Mihir Parmar, Palash Goyal, Yiwen Song, Jinsung Yoon

TL;DR: 本文提出了一种基于LLM的多智能体黑板系统,用于解决数据科学中大规模异构数据湖中的信息发现问题,显著优于现有方法。

Details

Motivation: 现有单智能体系统难以应对大规模异构数据,而主从式多智能体系统需要精确了解子智能体能力,缺乏灵活性和可扩展性。

Result: 在多个基准测试中显著优于RAG和主从式多智能体范式,任务成功率和数据发现的F1分数相对提升13%-57%和9%。

Insight: 黑板架构为多智能体系统提供了可扩展且通用的通信框架,适用于大规模异构数据环境。

Abstract: The rapid advancement of Large Language Models (LLMs) has opened new opportunities in data science, yet their practical deployment is often constrained by the challenge of discovering relevant data within large heterogeneous data lakes. Existing methods struggle with this: single-agent systems are quickly overwhelmed by large, heterogeneous files in the large data lakes, while multi-agent systems designed based on a master-slave paradigm depend on a rigid central controller for task allocation that requires precise knowledge of each sub-agent’s capabilities. To address these limitations, we propose a novel multi-agent communication paradigm inspired by the blackboard architecture for traditional AI models. In this framework, a central agent posts requests to a shared blackboard, and autonomous subordinate agents – either responsible for a partition of the data lake or general information retrieval – volunteer to respond based on their capabilities. This design improves scalability and flexibility by eliminating the need for a central coordinator to have prior knowledge of all sub-agents’ expertise. We evaluate our method on three benchmarks that require explicit data discovery: KramaBench and modified versions of DS-Bench and DA-Code to incorporate data discovery. Experimental results demonstrate that the blackboard architecture substantially outperforms baselines, including RAG and the master-slave multi-agent paradigm, achieving between 13% to 57% relative improvement in end-to-end task success and up to a 9% relative gain in F1 score for data discovery over the best-performing baselines across both proprietary and open-source LLMs. Our findings establish the blackboard paradigm as a scalable and generalizable communication framework for multi-agent systems.


cs.LG [Back]

[131] From 2D to 3D, Deep Learning-based Shape Reconstruction in Magnetic Resonance Imaging: A Review cs.LG | cs.AI | cs.CVPDF

Emma McMillian, Abhirup Banerjee, Alfonso Bueno-Orovio

TL;DR: 这篇综述论文全面回顾了从2D MRI数据到3D形状重建的深度学习方法,重点分析了点云、网格、形状感知和体积模型四种主要方法,总结了它们的优缺点、应用范围和未来的研究方向。

Details

Motivation: 3D形状重建在医学成像中具有重要意义,但直接从2D MRI数据生成3D模型仍存在挑战。本文旨在系统总结现有方法,推动更鲁棒、通用且临床实用的深度学习解决方案。

Result: 结果表明,不同方法在重建精度、计算效率和临床应用方面各有优劣,但尚无单一方法能完全满足所有需求。

Insight: 未来研究应关注多模态数据融合和跨模态学习,以提高模型的鲁棒性和泛化能力,同时需解决数据稀缺和计算资源限制等挑战。

Abstract: Deep learning-based 3-dimensional (3D) shape reconstruction from 2-dimensional (2D) magnetic resonance imaging (MRI) has become increasingly important in medical disease diagnosis, treatment planning, and computational modeling. This review surveys the methodological landscape of 3D MRI reconstruction, focusing on 4 primary approaches: point cloud, mesh-based, shape-aware, and volumetric models. For each category, we analyze the current state-of-the-art techniques, their methodological foundation, limitations, and applications across anatomical structures. We provide an extensive overview ranging from cardiac to neurological to lung imaging. We also focus on the clinical applicability of models to diseased anatomy, and the influence of their training and testing data. We examine publicly available datasets, computational demands, and evaluation metrics. Finally, we highlight the emerging research directions including multimodal integration and cross-modality frameworks. This review aims to provide researchers with a structured overview of current 3D reconstruction methodologies to identify opportunities for advancing deep learning towards more robust, generalizable, and clinically impactful solutions.


[132] Control the Temperature: Selective Sampling for Diverse and High-Quality LLM Outputs cs.LG | cs.AI | cs.CLPDF

Sergey Troshin, Wafaa Mohammed, Yan Meng, Christof Monz, Antske Fokkens

TL;DR: 该论文提出了选择性采样方法,动态切换贪婪采样和高温度采样,以平衡语言模型输出的多样性和准确性。

Details

Motivation: 为了提高语言模型输出多样性,通常采用基于温度的采样方法,但这在需要高精度的任务(如数学推理)中可能导致准确性下降。论文旨在解决这一问题。

Result: 在数学推理任务上的实验表明,该方法在高温设置下仍能显著提升质量与多样性的平衡。

Insight: 采样风险的可预测性表明,动态调整采样策略能够在保持多样性的同时避免关键位置的错误。

Abstract: Diversity is an essential metric for evaluating the creativity of outputs generated by language models. Temperature-based sampling is a common strategy to increase diversity. However, for tasks that require high precision, e.g., mathematical reasoning, uncontrolled high temperature sampling, e.g., min-$p$ or top-$p$, degrades reasoning quality. We demonstrate that the loss of accuracy is caused by sampling incorrect continuations in sensitive decoding positions. To address this, in this paper, we propose \textbf{selective sampling}, a method that dynamically switches between greedy and high-temperature sampling based on a sampling risk metric. This risk metric estimates the likelihood of output errors when applying high-temperature sampling on the current token position. To predict sampling risk, we train a lightweight classifier on a small subset of verifiable problems. The trained classifier can be integrated with the base language model with minimal latency overhead. Experiments on mathematical reasoning tasks demonstrate that selective sampling enhances the quality-diversity trade-off, even in high-temperature settings.


[133] Beyond Simple Fusion: Adaptive Gated Fusion for Robust Multimodal Sentiment Analysis cs.LG | cs.CVPDF

Han Wu, Yanming Sun, Yunhe Yang, Derek F. Wong

TL;DR: 本文提出了一种自适应门控融合网络(AGFN),通过双门融合机制动态调整多模态特征的权重,以解决传统融合方法在模态质量差异(如噪声、缺失或语义冲突)时的性能问题,显著提升了情感分析的准确性和鲁棒性。

Details

Motivation: 多模态情感分析(MSA)通常无法有效处理模态质量不一致的问题(如噪声、缺失或语义冲突),导致情感预测性能下降。本文旨在通过动态调整模态权重,提升模型对高质量模态信息的利用能力。

Result: 实验表明,AGFN在CMU-MOSI和CMU-MOSEI数据集上优于基线方法,显著提高了情感分析的准确性和鲁棒性。可视化分析显示,AGFN能够从更广泛的特征分布中学习,减少了特征位置对预测的依赖。

Insight: 动态调整模态权重可以有效缓解噪声或冲突模态的影响,提升模型的泛化能力和性能。减少特征位置与预测误差的相关性是实现鲁棒特征表示的关键。

Abstract: Multimodal sentiment analysis (MSA) leverages information fusion from diverse modalities (e.g., text, audio, visual) to enhance sentiment prediction. However, simple fusion techniques often fail to account for variations in modality quality, such as those that are noisy, missing, or semantically conflicting. This oversight leads to suboptimal performance, especially in discerning subtle emotional nuances. To mitigate this limitation, we introduce a simple yet efficient \textbf{A}daptive \textbf{G}ated \textbf{F}usion \textbf{N}etwork that adaptively adjusts feature weights via a dual gate fusion mechanism based on information entropy and modality importance. This mechanism mitigates the influence of noisy modalities and prioritizes informative cues following unimodal encoding and cross-modal interaction. Experiments on CMU-MOSI and CMU-MOSEI show that AGFN significantly outperforms strong baselines in accuracy, effectively discerning subtle emotions with robust performance. Visualization analysis of feature representations demonstrates that AGFN enhances generalization by learning from a broader feature distribution, achieved by reducing the correlation between feature location and prediction error, thereby decreasing reliance on specific locations and creating more robust multimodal feature representations.


[134] RLP: Reinforcement as a Pretraining Objective cs.LG | cs.AI | cs.CLPDF

Ali Hatamizadeh, Syeda Nahida Akter, Shrimai Prabhumoye, Jan Kautz, Mostofa Patwary

TL;DR: RLP引入了一种基于信息增益的强化学习预训练目标,将探索行为融入预训练阶段,显著提升了模型的推理能力。

Details

Motivation: 当前主流方法仅在训练的最后阶段引入强化学习,而忽略其在预训练中的潜力。RLP提出在预训练阶段引入强化学习目标,以提升模型的推理能力。

Result: 在Qwen3-1.7B-Base上,RLP提升了8个数学和科学基准的平均性能19%;在Nemotron-Nano-12B-v2上,科学推理任务性能提升23%。

Insight: 强化学习的探索行为可以有效地融入预训练阶段,显著提升模型的推理能力,而无需依赖复杂的后训练阶段。

Abstract: The dominant paradigm for training large reasoning models starts with pre-training using next-token prediction loss on vast amounts of data. Reinforcement learning, while powerful in scaling reasoning, is introduced only as the very last phase of post-training, preceded by supervised fine-tuning. While dominant, is this an optimal way of training? In this paper, we present RLP, an information-driven reinforcement pretraining objective, that brings the core spirit of reinforcement learning – exploration – to the last phase of pretraining. The key idea is to treat chain-of-thought as an exploratory action, with rewards computed based on the information gain it provides for predicting future tokens. This training objective essentially encourages the model to think for itself before predicting what comes next, thus teaching an independent thinking behavior earlier in the pretraining. More concretely, the reward signal measures the increase in log-likelihood of the next token when conditioning on both context and a sampled reasoning chain, compared to conditioning on context alone. This approach yields a verifier-free dense reward signal, allowing for efficient training for the full document stream during pretraining. Specifically, RLP reframes reinforcement learning for reasoning as a pretraining objective on ordinary text, bridging the gap between next-token prediction and the emergence of useful chain-of-thought reasoning. Pretraining with RLP on Qwen3-1.7B-Base lifts the overall average across an eight-benchmark math-and-science suite by 19%. With identical post-training, the gains compound, with the largest improvements on reasoning-heavy tasks such as AIME25 and MMLU-Pro. Applying RLP to the hybrid Nemotron-Nano-12B-v2 increases the overall average from 42.81% to 61.32% and raises the average on scientific reasoning by 23%, demonstrating scalability across architectures and model sizes.


[135] Unsupervised Dynamic Feature Selection for Robust Latent Spaces in Vision Tasks cs.LG | cs.AI | cs.CVPDF

Bruno Corcuera, Carlos Eiras-Franco, Brais Cancela

TL;DR: 论文提出了一种无监督的动态特征选择方法(DFS),用于去除图像中的噪声或无关特征,从而增强潜在表示的性能和鲁棒性。

Details

Motivation: 视觉任务中,潜在表示常受噪声或无关特征影响,导致模型性能和泛化能力下降。因此,需要一种方法在不依赖标记数据的情况下动态选择最有用的特征。

Result: 在多项图像任务(如聚类和生成)中,DFS显著提升了模型的泛化性能,且计算成本增量极小。

Insight: 无监督动态特征选择可以高效地提升潜在表示的质量,适用于广泛的数据集和任务。

Abstract: Latent representations are critical for the performance and robustness of machine learning models, as they encode the essential features of data in a compact and informative manner. However, in vision tasks, these representations are often affected by noisy or irrelevant features, which can degrade the model’s performance and generalization capabilities. This paper presents a novel approach for enhancing latent representations using unsupervised Dynamic Feature Selection (DFS). For each instance, the proposed method identifies and removes misleading or redundant information in images, ensuring that only the most relevant features contribute to the latent space. By leveraging an unsupervised framework, our approach avoids reliance on labeled data, making it broadly applicable across various domains and datasets. Experiments conducted on image datasets demonstrate that models equipped with unsupervised DFS achieve significant improvements in generalization performance across various tasks, including clustering and image generation, while incurring a minimal increase in the computational cost.


[136] $\text{G}^2$RPO: Granular GRPO for Precise Reward in Flow Models cs.LG | cs.CVPDF

Yujie Zhou, Pengyang Ling, Jiazi Bu, Yibin Wang, Yuhang Zang

TL;DR: 论文提出了$ ext{G}^2$RPO框架,通过细粒度的GRPO方法改进流模型中强化学习的奖励信号,以更精确地对齐人类偏好。

Details

Motivation: 现有方法在生成模型的强化学习中存在奖励信号稀疏且狭窄的问题,导致偏好对齐效果不佳。

Result: 实验表明$ ext{G}^2$RPO在多种奖励模型中显著优于现有GRPO基线,证明了其有效性和鲁棒性。

Insight: 通过细粒度奖励评估和多尺度优势集成,可以显著提升生成模型与人类偏好的对齐效果。

Abstract: The integration of online reinforcement learning (RL) into diffusion and flow models has recently emerged as a promising approach for aligning generative models with human preferences. Stochastic sampling via Stochastic Differential Equations (SDE) is employed during the denoising process to generate diverse denoising directions for RL exploration. While existing methods effectively explore potential high-value samples, they suffer from sub-optimal preference alignment due to sparse and narrow reward signals. To address these challenges, we propose a novel Granular-GRPO ($\text{G}^2$RPO ) framework that achieves precise and comprehensive reward assessments of sampling directions in reinforcement learning of flow models. Specifically, a Singular Stochastic Sampling strategy is introduced to support step-wise stochastic exploration while enforcing a high correlation between the reward and the injected noise, thereby facilitating a faithful reward for each SDE perturbation. Concurrently, to eliminate the bias inherent in fixed-granularity denoising, we introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales, producing a more comprehensive and robust evaluation of the sampling directions. Experiments conducted on various reward models, including both in-domain and out-of-domain evaluations, demonstrate that our $\text{G}^2$RPO significantly outperforms existing flow-based GRPO baselines,highlighting its effectiveness and robustness.


[137] LSPO: Length-aware Dynamic Sampling for Policy Optimization in LLM Reasoning cs.LG | cs.CLPDF

Weizhe Chen, Sven Koenig, Bistra Dilkina

TL;DR: 论文提出了一种基于响应长度的动态采样方法LSPO,用于优化大语言模型(LLM)在推理任务中的策略学习。

Details

Motivation: 研究者观察到现有的RLVR方法在训练LLM时存在效率不足的问题,特别是任务响应长度对学习效果的影响未得到充分利用。

Result: 实验表明,LSPO在多种基线模型和数据集上均能提升学习效果。

Insight: 响应长度信息对RLVR的动态采样具有重要作用,未来的研究可以进一步探索如何利用长度信号优化训练。

Abstract: Since the release of Deepseek-R1, reinforcement learning with verifiable rewards (RLVR) has become a central approach for training large language models (LLMs) on reasoning tasks. Recent work has largely focused on modifying loss functions to make RLVR more efficient and effective. In this paper, motivated by studies of overthinking in LLMs, we propose Length-aware Sampling for Policy Optimization (LSPO), a novel meta-RLVR algorithm that dynamically selects training data at each step based on the average response length. We evaluate LSPO across multiple base models and datasets, demonstrating that it consistently improves learning effectiveness. In addition, we conduct a detailed ablation study to examine alternative ways of incorporating length signals into dynamic sampling, offering further insights and highlighting promising directions for future research.


[138] Think Right: Learning to Mitigate Under-Over Thinking via Adaptive, Attentive Compression cs.LG | cs.AI | cs.CLPDF

Joykirat Singh, Justin Chih-Yao Chen, Archiki Prasad, Elias Stengel-Eskin, Akshay Nambi

TL;DR: TRAAC是一种自适应注意力压缩方法,通过调整推理长度以适应任务难度,解决了推理不足和过度推理的问题,显著提升准确率和效率。

Details

Motivation: 现有的推理模型在测试时难以动态分配计算资源,容易因推理不足或过度推理而效率低下或错误率高。

Result: TRAAC在多个任务中平均准确率提升8.4%,推理长度减少36.8%,并能泛化到非数学任务。

Insight: 任务难度校准与注意力压缩的结合是实现高效自适应推理的关键。

Abstract: Recent thinking models solve complex reasoning tasks by scaling test-time compute, but this scaling must be allocated in line with task difficulty. On one hand, short reasoning (underthinking) leads to errors on harder problems that require extended reasoning steps; but, excessively long reasoning (overthinking) can be token-inefficient, generating unnecessary steps even after reaching a correct intermediate solution. We refer to this as under-adaptivity, where the model fails to modulate its response length appropriately given problems of varying difficulty. To address under-adaptivity and strike a balance between under- and overthinking, we propose TRAAC (Think Right with Adaptive, Attentive Compression), an online post-training RL method that leverages the model’s self-attention over a long reasoning trajectory to identify important steps and prune redundant ones. TRAAC also estimates difficulty and incorporates it into training rewards, thereby learning to allocate reasoning budget commensurate with example difficulty. Our approach improves accuracy, reduces reasoning steps, and enables adaptive thinking compared to base models and other RL baselines. Across a variety of tasks (AIME, AMC, GPQA-D, BBEH), TRAAC (Qwen3-4B) achieves an average absolute accuracy gain of 8.4% with a relative reduction in reasoning length of 36.8% compared to the base model, and a 7.9% accuracy gain paired with a 29.4% length drop compared to the best RL baseline. TRAAC also shows strong generalization: although our models are trained on math datasets, they show accuracy and efficiency gains on out-of-distribution non-math datasets like GPQA-D, BBEH, and OptimalThinkingBench. Our analysis further verifies that TRAAC provides fine-grained adjustments to thinking budget based on difficulty and that a combination of task-difficulty calibration and attention-based compression yields gains across diverse tasks.


[139] Continual Personalization for Diffusion Models cs.LG | cs.CVPDF

Yu-Chien Liao, Jr-Jen Chen, Chi-Pin Huang, Ci-Siang Lin, Meng-Lin Wu

TL;DR: 该论文提出了一种名为Concept Neuron Selection (CNS)的新方法,用于在扩散模型中实现增量个性化学习,避免了灾难性遗忘问题,同时保持了零样本文本到图像生成能力。

Details

Motivation: 为了解决扩散模型在增量学习中面临的灾难性遗忘和计算效率问题,论文提出了CNS方法,旨在实现高效且持续的概念个性化学习。

Result: 实验表明,CNS在单概念和多概念个性化任务中均取得了最优性能,且仅需微调少量参数。

Insight: CNS的创新在于通过神经元选择机制实现了高效且持续的个性化学习,这对于实际应用中多概念增量学习的场景具有重要意义。

Abstract: Updating diffusion models in an incremental setting would be practical in real-world applications yet computationally challenging. We present a novel learning strategy of Concept Neuron Selection (CNS), a simple yet effective approach to perform personalization in a continual learning scheme. CNS uniquely identifies neurons in diffusion models that are closely related to the target concepts. In order to mitigate catastrophic forgetting problems while preserving zero-shot text-to-image generation ability, CNS finetunes concept neurons in an incremental manner and jointly preserves knowledge learned of previous concepts. Evaluation of real-world datasets demonstrates that CNS achieves state-of-the-art performance with minimal parameter adjustments, outperforming previous methods in both single and multi-concept personalization works. CNS also achieves fusion-free operation, reducing memory storage and processing time for continual personalization.


[140] Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead cs.LG | cs.AI | cs.CLPDF

Feiyang Kang, Michael Kuchnik, Karthik Padthe, Marin Vlastelica, Ruoxi Jia

TL;DR: 论文挑战了现有的SFT-RL两阶段训练方法,发现高SFT分数并不能可靠地预测RL效果,反而可能导致更差的结果。提出了泛化损失和Pass@large k作为替代指标,显著提高了预测精度。

Details

Motivation: 现有实践中,LLM的后训练通常分为SFT和RL两个独立阶段,但高SFT分数是否能反映RL后的性能提升尚无明确证据。作者质疑了这一假设,并探索更可靠的替代指标。

Result: 泛化损失和Pass@large k显著提升了预测RL效果的准确性(R²和Spearman系数提升达0.5)。实验表明,SFT阶段的训练策略(如数据选择、训练时长)对RL效果有重大影响。

Insight: SFT阶段不应盲目追求高分,需关注数据多样性和泛化性。替代指标为LLM的后训练提供了更可靠的评估标准,对实际应用具有重要意义。

Abstract: In post-training for reasoning Large Language Models (LLMs), the current state of practice trains LLMs in two independent stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR, shortened as ``RL’’ below). In this work, we challenge whether high SFT scores translate to improved performance after RL. We provide extensive counter-examples where this is not true. We find high SFT scores can be biased toward simpler or more homogeneous data and are not reliably predictive of subsequent RL gains or scaled-up post-training effectiveness. In some cases, RL training on models with improved SFT performance could lead to substantially worse outcome compared to RL on the base model without SFT. We study alternative metrics and identify generalization loss on held-out reasoning examples and Pass@large k performance to provide strong proxies for the RL outcome. We trained hundreds of models up to 12B-parameter with SFT and RLVR via GRPO and ran extensive evaluations on 7 math benchmarks with up to 256 repetitions, spending $>$1M GPU hours. Experiments include models from Llama3, Mistral-Nemo, Qwen3 and multiple state-of-the-art SFT/RL datasets. Compared to directly predicting from pre-RL performance, prediction based on generalization loss and Pass@large k achieves substantial higher precision, improving $R^2$ coefficient and Spearman’s rank correlation coefficient by up to 0.5 (2x). This provides strong utility for broad use cases. For example, in most experiments, we find SFT training on unique examples for a one epoch underperforms training on half examples for two epochs, either after SFT or SFT-then-RL; With the same SFT budget, training only on short examples may lead to better SFT performance, though, it often leads to worse outcome after RL compared to training on examples with varying lengths. Evaluation tool will be open-sourced.


[141] Sparse Query Attention (SQA): A Computationally Efficient Attention Mechanism with Query Heads Reduction cs.LG | cs.CLPDF

Adam Filipek

TL;DR: 本文提出了稀疏查询注意力(SQA)机制,通过减少查询头的数量降低计算复杂度,适用于长序列任务。

Details

Motivation: 多头注意力(MHA)的计算复杂度与序列长度呈二次方关系,限制了其在长序列任务中的可扩展性。现有方法如MQA和GQA通过共享键和值头解决了内存带宽问题,但未减少计算FLOPs。

Result: 在32k-200k长序列任务中,SQA在预训练、微调和编码器任务中实现了显著的吞吐量提升。

Insight: SQA提供了一种新的注意力优化路径,适用于计算密集型任务,可能是构建高效和可扩展模型的强大工具。

Abstract: The Transformer architecture, underpinned by the Multi-Head Attention (MHA) mechanism, has become the de facto standard for state-of-the-art models in artificial intelligence. However, the quadratic computational complexity of MHA with respect to sequence length presents a significant barrier to scaling, particularly for applications involving long contexts. Prevailing solutions, such as Multi-Query Attention (MQA) and Grouped-Query Attention (GQA), have effectively addressed the memory bandwidth bottleneck that dominates autoregressive inference latency by sharing Key and Value projections. While highly successful, these methods do not reduce the fundamental number of floating-point operations (FLOPs) required for the attention score computation, which remains a critical bottleneck for training and full-sequence processing. This paper introduces Sparse Query Attention (SQA), a novel attention architecture that pursues an alternative and complementary optimization path. Instead of reducing Key/Value heads, SQA reduces the number of Query heads. This architectural modification directly decreases the computational complexity of the attention mechanism by a factor proportional to the reduction in query heads, thereby lowering the overall FLOPs. This work presents the theoretical foundation of SQA, its mathematical formulation, and a family of architectural variants. Empirical benchmarks on long sequences (32k-200k tokens) demonstrate that SQA can achieve significant throughput improvements of up to 3x in computation-bound scenarios such as model pre-training, fine-tuning, and encoder-based tasks, with only a minimal impact on model quality in preliminary smallscale experiments. SQA was discovered serendipitously during the development of the upcoming Reactive Transformer architecture, suggesting its potential as a powerful tool for building more efficient and scalable models


[142] StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets? cs.LG | cs.CLPDF

Yanxu Chen, Zijun Yao, Yantao Liu, Jin Ye, Jianing Yu

TL;DR: StockBench是一个专门评估LLM代理在真实股票交易环境中表现的基准测试,填补了金融领域动态交易评估的空白。

Details

Motivation: 金融领域的高风险决策与经济价值直接相关,但现有的金融基准主要测试静态知识问答,无法捕捉交易的动态迭代过程,因此需新的评估工具。

Result: 多数LLM代理未能跑赢简单的买入持有基准,但部分模型展现出更高的回报潜力和更有效的风险管理能力。

Insight: 静态金融知识的优秀表现不一定能转化为成功的交易策略,强调了开发金融LLM代理的挑战与机遇。

Abstract: Large language models (LLMs) have recently demonstrated strong capabilities as autonomous agents, showing promise in reasoning, tool use, and sequential decision-making. While prior benchmarks have evaluated LLM agents in domains such as software engineering and scientific discovery, the finance domain remains underexplored, despite its direct relevance to economic value and high-stakes decision-making. Existing financial benchmarks primarily test static knowledge through question answering, but they fall short of capturing the dynamic and iterative nature of trading. To address this gap, we introduce StockBench, a contamination-free benchmark designed to evaluate LLM agents in realistic, multi-month stock trading environments. Agents receive daily market signals – including prices, fundamentals, and news – and must make sequential buy, sell, or hold decisions. Performance is assessed using financial metrics such as cumulative return, maximum drawdown, and the Sortino ratio. Our evaluation of state-of-the-art proprietary (e.g., GPT-5, Claude-4) and open-weight (e.g., Qwen3, Kimi-K2, GLM-4.5) models shows that while most LLM agents struggle to outperform the simple buy-and-hold baseline, several models demonstrate the potential to deliver higher returns and manage risk more effectively. These findings highlight both the challenges and opportunities in developing LLM-powered financial agents, showing that excelling at static financial knowledge tasks does not necessarily translate into successful trading strategies. We release StockBench as an open-source resource to support reproducibility and advance future research in this domain.


[143] ExGRPO: Learning to Reason from Experience cs.LG | cs.AI | cs.CLPDF

Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu

TL;DR: ExGRPO提出了一种基于经验价值的学习框架,通过重用和优先排序有价值的推理经验,显著提升了大规模语言模型的推理性能。

Details

Motivation: 现有RLVR方法在训练时仅使用一次经验,导致计算效率低且不稳定。ExGRPO研究了经验特性对推理模型学习动态的影响,旨在通过高效经验管理提升性能。

Result: 在1.5B-8B参数的模型上平均提升了3.5/7.6分,且在强弱模型上均实现了稳定的训练效果。

Insight: 高效的经验管理是提升RLVR可扩展性和稳定性的关键。

Abstract: Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.


[144] Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks cs.LG | cs.AI | cs.CLPDF

Ruohao Guo, Afshin Oroojlooy, Roshan Sridhar, Miguel Ballesteros, Alan Ritter

TL;DR: 论文提出了一种树形对话增强策略优化方法(DialTree-RPO),用于自动化发现多轮对抗攻击策略,显著提升了攻击成功率。

Details

Motivation: 当前大语言模型在多轮交互中仍易受对抗攻击,但现有方法多依赖人工或单轮攻击模板,未能充分探索复杂对话动态和多轮攻击策略。

Result: 在10个目标模型上攻击成功率(ASR)比之前方法高出25.9%,并能发现新的攻击策略。

Insight: 多轮攻击策略的自动发现揭示了LLM在复杂对话中的安全漏洞,凸显了动态对话规划的挑战性和重要性。

Abstract: Despite recent rapid progress in AI safety, current large language models remain vulnerable to adversarial attacks in multi-turn interaction settings, where attackers strategically adapt their prompts across conversation turns and pose a more critical yet realistic challenge. Existing approaches that discover safety vulnerabilities either rely on manual red-teaming with human experts or employ automated methods using pre-defined templates and human-curated attack data, with most focusing on single-turn attacks. However, these methods did not explore the vast space of possible multi-turn attacks, failing to consider novel attack trajectories that emerge from complex dialogue dynamics and strategic conversation planning. This gap is particularly critical given recent findings that LLMs exhibit significantly higher vulnerability to multi-turn attacks compared to single-turn attacks. We propose DialTree-RPO, an on-policy reinforcement learning framework integrated with tree search that autonomously discovers diverse multi-turn attack strategies by treating the dialogue as a sequential decision-making problem, enabling systematic exploration without manually curated data. Through extensive experiments, our approach not only achieves more than 25.9% higher ASR across 10 target models compared to previous state-of-the-art approaches, but also effectively uncovers new attack strategies by learning optimal dialogue policies that maximize attack success across multiple turns.


cs.MM [Back]

[145] Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation cs.MM | cs.CV | cs.SD | eess.ASPDF

Chetwin Low, Weimin Wang, Calder Katyal

TL;DR: 论文提出了Ovi,一种通过双主干网络跨模态融合实现音视频生成的统一范式,避免了复杂的多阶段架构或音视频分离生成的需求。

Details

Motivation: 音视频生成通常依赖复杂的多阶段架构或音视频的分离合成,导致同步性和连贯性差。Ovi旨在通过统一的生成过程解决这一问题。

Result: 生成的视频片段具有电影级质量,包含自然的语音和准确的上下文匹配音效。

Insight: 统一的生成过程和块间融合机制显著提升了音视频生成的同步性和质量,为多模态生成提供了新的思路。

Abstract: Audio-video generation has often relied on complex multi-stage architectures or sequential synthesis of sound and visuals. We introduce Ovi, a unified paradigm for audio-video generation that models the two modalities as a single generative process. By using blockwise cross-modal fusion of twin-DiT modules, Ovi achieves natural synchronization and removes the need for separate pipelines or post hoc alignment. To facilitate fine-grained multimodal fusion modeling, we initialize an audio tower with an architecture identical to that of a strong pretrained video model. Trained from scratch on hundreds of thousands of hours of raw audio, the audio tower learns to generate realistic sound effects, as well as speech that conveys rich speaker identity and emotion. Fusion is obtained by jointly training the identical video and audio towers via blockwise exchange of timing (via scaled-RoPE embeddings) and semantics (through bidirectional cross-attention) on a vast video corpus. Our model enables cinematic storytelling with natural speech and accurate, context-matched sound effects, producing movie-grade video clips. All the demos, code and model weights are published at https://aaxwaz.github.io/Ovi


cs.HC [Back]

[146] Development and Evaluation of an AI-Driven Telemedicine System for Prenatal Healthcare cs.HC | cs.CV | eess.IVPDF

Juan Barrientos, Michaelle Pérez, Douglas González, Favio Reyna, Julio Fajardo

TL;DR: 开发了一个基于AI的远程医疗系统,通过盲扫协议帮助助产士获取胎儿图像,并通过专家异步审查提高诊断效率。

Details

Motivation: 低收入和中等收入国家的农村地区缺乏超声诊断资源,限制了产前保健的可及性。

Result: 系统在识别标准胎儿切面方面表现良好,现场评估显示其可用性高且认知负荷低。

Insight: AI驱动的远程医疗系统可以扩展资源匮乏地区的产前保健服务,同时减轻专家的工作负担。

Abstract: Access to obstetric ultrasound is often limited in low-resource settings, particularly in rural areas of low- and middle-income countries. This work proposes a human-in-the-loop artificial intelligence (AI) system designed to assist midwives in acquiring diagnostically relevant fetal images using blind sweep protocols. The system incorporates a classification model along with a web-based platform for asynchronous specialist reviews. By identifying key frames in blind sweep studies, the AI system allows specialists to concentrate on interpretation rather than having to review entire videos. To evaluate its performance, blind sweep videos captured by a small group of soft-trained midwives using a low-cost Point-of-Care Ultrasound (POCUS) device were analyzed. The system demonstrated promising results in identifying standard fetal planes from sweeps made by non-experts. A field evaluation indicated good usability and a low cognitive workload, suggesting that it has the potential to expand access to prenatal imaging in underserved regions.


eess.IV [Back]

[147] An Efficient Quality Metric for Video Frame Interpolation Based on Motion-Field Divergence eess.IV | cs.CV | cs.MMPDF

Conall Daly, Darren Ramsook, Anil Kokaram

TL;DR: 论文提出了一种基于运动场发散的高效视频帧插值质量度量方法PSNR_DIV,解决了现有质量度量(如PSNR、SSIM、LPIPS)忽略时间一致性的问题,同时在计算效率和内存占用上显著优于FloLPIPS。

Details

Motivation: 现有的视频帧插值质量度量方法(如PSNR、SSIM、LPIPS)无法有效评估插值伪影的感知效果,而专为视频帧插值设计的FloLPIPS虽然表现较好,但计算效率低下,限制了实际应用。因此,需要一种高效且准确的度量方法。

Result: 在BVI-VFI数据集上的实验表明,PSNR_DIV比FloLPIPS的Pearson线性相关系数提高了0.09,同时速度快2.5倍,内存占用减少4倍。

Insight: 运动场发散加权技术能有效捕捉视频帧插值中的时间不一致性,且计算高效,适合作为损失函数用于训练神经网络。

Abstract: Video frame interpolation is a fundamental tool for temporal video enhancement, but existing quality metrics struggle to evaluate the perceptual impact of interpolation artefacts effectively. Metrics like PSNR, SSIM and LPIPS ignore temporal coherence. State-of-the-art quality metrics tailored towards video frame interpolation, like FloLPIPS, have been developed but suffer from computational inefficiency that limits their practical application. We present $\text{PSNR}{\text{DIV}}$, a novel full-reference quality metric that enhances PSNR through motion divergence weighting, a technique adapted from archival film restoration where it was developed to detect temporal inconsistencies. Our approach highlights singularities in motion fields which is then used to weight image errors. Evaluation on the BVI-VFI dataset (180 sequences across multiple frame rates, resolutions and interpolation methods) shows $\text{PSNR}{\text{DIV}}$ achieves statistically significant improvements: +0.09 Pearson Linear Correlation Coefficient over FloLPIPS, while being 2.5$\times$ faster and using 4$\times$ less memory. Performance remains consistent across all content categories and are robust to the motion estimator used. The efficiency and accuracy of $\text{PSNR}_{\text{DIV}}$ enables fast quality evaluation and practical use as a loss function for training neural networks for video frame interpolation tasks. An implementation of our metric is available at www.github.com/conalld/psnr-div.


[148] Median2Median: Zero-shot Suppression of Structured Noise in Images eess.IV | cs.CV | q-bio.QM | stat.MLPDF

Jianxu Wang, Ge Wang

TL;DR: Median2Median(M2M)是一个零样本去噪框架,专为结构化噪声设计,通过创新的采样策略和广义中值滤波,能在无高质量标签数据的情况下有效去除相关性噪声。

Details

Motivation: 现有去噪方法在结构化噪声(强各向异性相关噪声)下表现不佳,数据驱动方法依赖高质量标签数据且泛化性有限,而零样本方法仅适用于独立同分布噪声。M2M旨在填补这一空白。

Result: M2M在独立同分布噪声下与现有零样本方法表现相当,在相关性噪声下显著优于它们。

Insight: M2M突破了零样本去噪方法严格依赖独立同分布噪声的限制,为结构化噪声提供了高效、无需数据的解决方案。

Abstract: Image denoising is a fundamental problem in computer vision and medical imaging. However, real-world images are often degraded by structured noise with strong anisotropic correlations that existing methods struggle to remove. Most data-driven approaches rely on large datasets with high-quality labels and still suffer from limited generalizability, whereas existing zero-shot methods avoid this limitation but remain effective only for independent and identically distributed (i.i.d.) noise. To address this gap, we propose Median2Median (M2M), a zero-shot denoising framework designed for structured noise. M2M introduces a novel sampling strategy that generates pseudo-independent sub-image pairs from a single noisy input. This strategy leverages directional interpolation and generalized median filtering to adaptively exclude values distorted by structured artifacts. To further enlarge the effective sampling space and eliminate systematic bias, a randomized assignment strategy is employed, ensuring that the sampled sub-image pairs are suitable for Noise2Noise training. In our realistic simulation studies, M2M performs on par with state-of-the-art zero-shot methods under i.i.d. noise, while consistently outperforming them under correlated noise. These findings establish M2M as an efficient, data-free solution for structured noise suppression and mark the first step toward effective zero-shot denoising beyond the strict i.i.d. assumption.


[149] GFSR-Net: Guided Focus via Segment-Wise Relevance Network for Interpretable Deep Learning in Medical Imaging eess.IV | cs.CV | physics.data-anPDF

Jhonatan Contreras, Thomas Bocklitz

TL;DR: GFSR-Net是一种通过分段相关性网络指导焦点的方法,旨在提升医疗影像中深度学习的可解释性和可靠性。它利用少量人工标注引导模型关注诊断相关区域,实验表明其在保持高准确性的同时提升了显著性图的可信度。

Details

Motivation: 医疗影像深度学习的局限性在于缺乏可解释性,模型可能依赖无关区域或虚假线索,降低临床信任。GFSR-Net旨在解决这一问题,通过引导焦点提升模型的可信度和实用性。

Result: 在胸片、视网膜扫描和皮肤科图像等任务中,GFSR-Net准确性可比或优于基线,同时生成的显著性图更符合人类预期。

Insight: 少量人工标注足以引导模型注意力,显著提升可解释性;该方法通用性强,适用于多种医疗影像任务。

Abstract: Deep learning has achieved remarkable success in medical image analysis, however its adoption in clinical practice is limited by a lack of interpretability. These models often make correct predictions without explaining their reasoning. They may also rely on image regions unrelated to the disease or visual cues, such as annotations, that are not present in real-world conditions. This can reduce trust and increase the risk of misleading diagnoses. We introduce the Guided Focus via Segment-Wise Relevance Network (GFSR-Net), an approach designed to improve interpretability and reliability in medical imaging. GFSR-Net uses a small number of human annotations to approximate where a person would focus within an image intuitively, without requiring precise boundaries or exhaustive markings, making the process fast and practical. During training, the model learns to align its focus with these areas, progressively emphasizing features that carry diagnostic meaning. This guidance works across different types of natural and medical images, including chest X-rays, retinal scans, and dermatological images. Our experiments demonstrate that GFSR achieves comparable or superior accuracy while producing saliency maps that better reflect human expectations. This reduces the reliance on irrelevant patterns and increases confidence in automated diagnostic tools.


q-bio.NC [Back]

[150] Aligning Video Models with Human Social Judgments via Behavior-Guided Fine-Tuning q-bio.NC | cs.CV | cs.LGPDF

Kathy Garcia, Leyla Isik

TL;DR: 本文研究了预训练视频模型是否能够捕捉人类社交视频中的相似性结构,并通过行为数据微调模型以对齐人类感知。

Details

Motivation: 人类能够直观地感知视觉场景中的复杂社交信号,但当前AI模型是否具备这种能力尚不明确。本文旨在解决这一问题,并提出一种对齐人类社交判断的方法。

Result: 微调后的模型在保留视频上显著提升了与人类感知的对齐度,同时增强了社交情感属性的编码能力。

Insight: 预训练视频模型在社交识别方面存在不足,而行为数据驱动的微调可以有效改进其社交感知能力。

Abstract: Humans intuitively perceive complex social signals in visual scenes, yet it remains unclear whether state-of-the-art AI models encode the same similarity structure. We study (Q1) whether modern video and language models capture human-perceived similarity in social videos, and (Q2) how to instill this structure into models using human behavioral data. To address this, we introduce a new benchmark of over 49,000 odd-one-out similarity judgments on 250 three-second video clips of social interactions, and discover a modality gap: despite the task being visual, caption-based language embeddings align better with human similarity than any pretrained video model. We close this gap by fine-tuning a TimeSformer video model on these human judgments with our novel hybrid triplet-RSA objective using low-rank adaptation (LoRA), aligning pairwise distances to human similarity. This fine-tuning protocol yields significantly improved alignment with human perceptions on held-out videos in terms of both explained variance and odd-one-out triplet accuracy. Variance partitioning shows that the fine-tuned video model increases shared variance with language embeddings and explains additional unique variance not captured by the language model. Finally, we test transfer via linear probes and find that human-similarity fine-tuning strengthens the encoding of social-affective attributes (intimacy, valence, dominance, communication) relative to the pretrained baseline. Overall, our findings highlight a gap in pretrained video models’ social recognition and demonstrate that behavior-guided fine-tuning shapes video representations toward human social perception.


[151] Uncovering Semantic Selectivity of Latent Groups in Higher Visual Cortex with Mutual Information-Guided Diffusion q-bio.NC | cs.CV | cs.LGPDF

Yule Wang, Joseph Yu, Chengrui Li, Weihan Li, Anqi Wu

TL;DR: 论文提出MIG-Vis方法,利用扩散模型结合互信息指导,揭示高级视觉皮层中神经潜在子空间对视觉语义特征的编码方式,并通过实验验证其语义选择性。

Details

Motivation: 目前的研究主要通过人工神经网络与视觉皮层的表征对齐或解码方法来分析神经编码,但这些方法间接且无法揭示神经群体的具体组织结构。因此,论文旨在解决高级视觉皮层中语义特征如何分布在神经群体中并形成结构化子空间的问题。

Result: 实验表明,MIG-Vis能够识别出具有明确语义选择性的神经潜在子空间,例如对物体姿态、类间转换和类内内容的编码。

Insight: 该方法为高级视觉皮层的结构化语义表征提供了直接的、可解释的证据,推动了对其编码机制的理解。

Abstract: Understanding how neural populations in higher visual areas encode object-centered visual information remains a central challenge in computational neuroscience. Prior works have investigated representational alignment between artificial neural networks and the visual cortex. Nevertheless, these findings are indirect and offer limited insights to the structure of neural populations themselves. Similarly, decoding-based methods have quantified semantic features from neural populations but have not uncovered their underlying organizations. This leaves open a scientific question: “how feature-specific visual information is distributed across neural populations in higher visual areas, and whether it is organized into structured, semantically meaningful subspaces.” To tackle this problem, we present MIG-Vis, a method that leverages the generative power of diffusion models to visualize and validate the visual-semantic attributes encoded in neural latent subspaces. Our method first uses a variational autoencoder to infer a group-wise disentangled neural latent subspace from neural populations. Subsequently, we propose a mutual information (MI)-guided diffusion synthesis procedure to visualize the specific visual-semantic features encoded by each latent group. We validate MIG-Vis on multi-session neural spiking datasets from the inferior temporal (IT) cortex of two macaques. The synthesized results demonstrate that our method identifies neural latent groups with clear semantic selectivity to diverse visual features, including object pose, inter-category transformations, and intra-class content. These findings provide direct, interpretable evidence of structured semantic representation in the higher visual cortex and advance our understanding of its encoding principles.


q-bio.QM [Back]

[152] A Multicentric Dataset for Training and Benchmarking Breast Cancer Segmentation in H&E Slides q-bio.QM | cs.CV | eess.IVPDF

Carlijn Lems, Leslie Tessier, John-Melle Bokhorst, Mart van Rijthoven, Witali Aswolinskiy

TL;DR: 这篇论文提出了BEETLE数据集,用于乳腺癌H&E玻片的语义分割任务。数据集包含587个样本,覆盖多种分子亚型和组织学等级,并通过多重标注策略提供了四种类别的标注。其多样性和外部评估集为标准化的模型评测提供了支持。

Details

Motivation: 现有的乳腺癌分割公共数据集缺乏形态多样性,限制了模型的泛化能力和生物标记验证的鲁棒性。为了解决这一问题,作者提出了一个多中心、多样化的数据集。

Result: 数据集已公开,提供了多样化的样本和多类标注,尤其关注现有数据集中代表性不足的形态(如导管原位癌)。

Insight: 该数据集的多样性和标准化标注可以为乳腺癌的自动化生物标记分析提供更可靠的基准,并促进模型的跨中心泛化能力。

Abstract: Automated semantic segmentation of whole-slide images (WSIs) stained with hematoxylin and eosin (H&E) is essential for large-scale artificial intelligence-based biomarker analysis in breast cancer. However, existing public datasets for breast cancer segmentation lack the morphological diversity needed to support model generalizability and robust biomarker validation across heterogeneous patient cohorts. We introduce BrEast cancEr hisTopathoLogy sEgmentation (BEETLE), a dataset for multiclass semantic segmentation of H&E-stained breast cancer WSIs. It consists of 587 biopsies and resections from three collaborating clinical centers and two public datasets, digitized using seven scanners, and covers all molecular subtypes and histological grades. Using diverse annotation strategies, we collected annotations across four classes - invasive epithelium, non-invasive epithelium, necrosis, and other - with particular focus on morphologies underrepresented in existing datasets, such as ductal carcinoma in situ and dispersed lobular tumor cells. The dataset’s diversity and relevance to the rapidly growing field of automated biomarker quantification in breast cancer ensure its high potential for reuse. Finally, we provide a well-curated, multicentric external evaluation set to enable standardized benchmarking of breast cancer segmentation models.


cs.CR [Back]

[153] Jailbreaking LLMs via Semantically Relevant Nested Scenarios with Targeted Toxic Knowledge cs.CR | cs.CLPDF

Hui Dou, Ning Xu, Yiwen Zhang, Kaibin Wang

TL;DR: 该论文提出了一种名为RTS-Attack的方法,通过构建语义相关且包含目标毒性知识的嵌套场景,绕过大型语言模型(LLMs)的对齐防御,实现了高效且隐蔽的越狱攻击。

Details

Motivation: 尽管LLMs在多种任务中表现出色,但其对齐防御机制在语义相关的嵌套场景和毒性知识面前存在漏洞。论文旨在探索这一未被充分研究的方向并提出有效的攻击框架。

Result: 实验表明,RTS-Attack在GPT-4o、Llama3-70b和Gemini-pro等多种先进LLMs上均表现出高效的越狱能力和通用性。

Insight: LLMs的对齐防御机制在处理语义相关嵌套场景时存在潜在漏洞,这为未来防御策略的设计提供了重要启示。

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks. However, they remain exposed to jailbreak attacks, eliciting harmful responses. The nested scenario strategy has been increasingly adopted across various methods, demonstrating immense potential. Nevertheless, these methods are easily detectable due to their prominent malicious intentions. In this work, we are the first to find and systematically verify that LLMs’ alignment defenses are not sensitive to nested scenarios, where these scenarios are highly semantically relevant to the queries and incorporate targeted toxic knowledge. This is a crucial yet insufficiently explored direction. Based on this, we propose RTS-Attack (Semantically Relevant Nested Scenarios with Targeted Toxic Knowledge), an adaptive and automated framework to examine LLMs’ alignment. By building scenarios highly relevant to the queries and integrating targeted toxic knowledge, RTS-Attack bypasses the alignment defenses of LLMs. Moreover, the jailbreak prompts generated by RTS-Attack are free from harmful queries, leading to outstanding concealment. Extensive experiments demonstrate that RTS-Attack exhibits superior performance in both efficiency and universality compared to the baselines across diverse advanced LLMs, including GPT-4o, Llama3-70b, and Gemini-pro. Our complete code is available in the supplementary material. WARNING: THIS PAPER CONTAINS POTENTIALLY HARMFUL CONTENT.


[154] ZK-WAGON: Imperceptible Watermark for Image Generation Models using ZK-SNARKs cs.CR | cs.AI | cs.CVPDF

Aadarsh Anantha Ramakrishnan, Shubham Agarwal, Selvanayagam S, Kunwar Singh

TL;DR: ZK-WAGON首次提出了一种基于ZK-SNARKs的图像生成模型水印方法,通过选择性层转换和LSB隐写术,实现不可察觉的水印嵌入和可验证的来源证明,解决了传统方法的质量损失和安全性问题。

Details

Motivation: 随着图像生成模型的普及,合成媒体的真实性、所有权和滥用问题日益严重。传统水印方法存在质量下降或安全性不足的问题,亟需一种安全且不影响图像质量的解决方案。

Result: 在GAN和Diffusion模型上验证了方法的有效性,实现了高质量图像与可验证来源的结合。

Insight: ZK-SNARKs结合隐写术为生成模型的版权保护提供了一种创新且安全的解决方案,适用于多样化的模型类型。

Abstract: As image generation models grow increasingly powerful and accessible, concerns around authenticity, ownership, and misuse of synthetic media have become critical. The ability to generate lifelike images indistinguishable from real ones introduces risks such as misinformation, deepfakes, and intellectual property violations. Traditional watermarking methods either degrade image quality, are easily removed, or require access to confidential model internals - making them unsuitable for secure and scalable deployment. We are the first to introduce ZK-WAGON, a novel system for watermarking image generation models using the Zero-Knowledge Succinct Non Interactive Argument of Knowledge (ZK-SNARKs). Our approach enables verifiable proof of origin without exposing model weights, generation prompts, or any sensitive internal information. We propose Selective Layer ZK-Circuit Creation (SL-ZKCC), a method to selectively convert key layers of an image generation model into a circuit, reducing proof generation time significantly. Generated ZK-SNARK proofs are imperceptibly embedded into a generated image via Least Significant Bit (LSB) steganography. We demonstrate this system on both GAN and Diffusion models, providing a secure, model-agnostic pipeline for trustworthy AI image generation.


[155] Position: Privacy Is Not Just Memorization! cs.CR | cs.AI | cs.CL | cs.LGPDF

Niloofar Mireshghallah, Tianshi Li

TL;DR: 该立场论文指出,大型语言模型(LLM)的隐私风险远不止训练数据的逐字记忆,还包括数据收集、推理时上下文泄漏、自主代理能力以及深度推理攻击导致的监控民主化等多方面威胁。

Details

Motivation: 当前关于LLM隐私风险的讨论过度关注训练数据的逐字记忆,而忽视了其他更直接和可扩展的隐私威胁。本文旨在揭示这些被低估的威胁,并呼吁研究社区的关注。

Result: 研究发现当前隐私研究过度关注数据记忆问题,而其他更紧迫的隐私风险(如上下文泄漏和深度推理攻击)缺乏有效解决方案。

Insight: 隐私风险是多方面的,需要跨学科的解决方案,而不仅仅是技术层面的优化。

Abstract: The discourse on privacy risks in Large Language Models (LLMs) has disproportionately focused on verbatim memorization of training data, while a constellation of more immediate and scalable privacy threats remain underexplored. This position paper argues that the privacy landscape of LLM systems extends far beyond training data extraction, encompassing risks from data collection practices, inference-time context leakage, autonomous agent capabilities, and the democratization of surveillance through deep inference attacks. We present a comprehensive taxonomy of privacy risks across the LLM lifecycle – from data collection through deployment – and demonstrate through case studies how current privacy frameworks fail to address these multifaceted threats. Through a longitudinal analysis of 1,322 AI/ML privacy papers published at leading conferences over the past decade (2016–2025), we reveal that while memorization receives outsized attention in technical research, the most pressing privacy harms lie elsewhere, where current technical approaches offer little traction and viable paths forward remain unclear. We call for a fundamental shift in how the research community approaches LLM privacy, moving beyond the narrow focus of current technical solutions and embracing interdisciplinary approaches that address the sociotechnical nature of these emerging threats.


cs.IR [Back]

[156] Synthetic Prefixes to Mitigate Bias in Real-Time Neural Query Autocomplete cs.IR | cs.AI | cs.CL | cs.LGPDF

Adithya Rajan, Xiaoyu Liu, Prateek Verma, Vibhu Arora

TL;DR: 论文提出了一种通过生成合成前缀来减少实时神经查询自动完成系统中的展示偏差的数据中心方法。这些前缀来自未启用自动完成的用户完整查询,丰富了训练数据的多样性。

Details

Motivation: 自动完成系统中的展示偏差问题是由于用户行为受到模型建议的影响,导致训练数据存在偏差。需要通过数据干预来解决这一问题。

Result: 在用户参与度指标(如平均倒数排名)上实现了统计显著的提升。

Insight: 合成前缀不仅能提升模型的泛化能力,还为其他低延迟排序任务(如相关搜索和查询推荐)的偏差减少提供了可扩展的解决方案。

Abstract: We introduce a data-centric approach for mitigating presentation bias in real-time neural query autocomplete systems through the use of synthetic prefixes. These prefixes are generated from complete user queries collected during regular search sessions where autocomplete was not active. This allows us to enrich the training data for learning to rank models with more diverse and less biased examples. This method addresses the inherent bias in engagement signals collected from live query autocomplete interactions, where model suggestions influence user behavior. Our neural ranker is optimized for real-time deployment under strict latency constraints and incorporates a rich set of features, including query popularity, seasonality, fuzzy match scores, and contextual signals such as department affinity, device type, and vertical alignment with previous user queries. To support efficient training, we introduce a task-specific simplification of the listwise loss, reducing computational complexity from $O(n^2)$ to $O(n)$ by leveraging the query autocomplete structure of having only one ground-truth selection per prefix. Deployed in a large-scale e-commerce setting, our system demonstrates statistically significant improvements in user engagement, as measured by mean reciprocal rank and related metrics. Our findings show that synthetic prefixes not only improve generalization but also provide a scalable path toward bias mitigation in other low-latency ranking tasks, including related searches and query recommendations.


[157] Bridging Collaborative Filtering and Large Language Models with Dynamic Alignment, Multimodal Fusion and Evidence-grounded Explanations cs.IR | cs.AI | cs.CLPDF

Bo Ma, LuYao Liu, Simon Lau, Chandler Yuan, and XueY Cui

TL;DR: 该论文提出了一个名为\model{}的框架,通过动态对齐、多模态融合和基于证据的解释,解决了协同过滤和大语言模型结合中的三个主要挑战。

Details

Motivation: 现有方法在结合协同过滤和大语言模型时存在静态数据无法捕捉动态用户偏好、多模态内容未被充分利用以及解释缺乏可信证据的问题,需要一种更高效的解决方案。

Result: 该方法在不显著增加计算开销的情况下,保持了高效性,适合实际部署。

Insight: 动态学习和多模态融合是提升推荐系统灵活性和解释性的关键。

Abstract: Recent research has explored using Large Language Models for recommendation tasks by transforming user interaction histories and item metadata into text prompts, then having the LLM produce rankings or recommendations. A promising approach involves connecting collaborative filtering knowledge to LLM representations through compact adapter networks, which avoids expensive fine-tuning while preserving the strengths of both components. Yet several challenges persist in practice: collaborative filtering models often use static snapshots that miss rapidly changing user preferences; many real-world items contain rich visual and audio content beyond textual descriptions; and current systems struggle to provide trustworthy explanations backed by concrete evidence. Our work introduces \model{}, a framework that tackles these limitations through three key innovations. We develop an online adaptation mechanism that continuously incorporates new user interactions through lightweight modules, avoiding the need to retrain large models. We create a unified representation that seamlessly combines collaborative signals with visual and audio features, handling cases where some modalities may be unavailable. Finally, we design an explanation system that grounds recommendations in specific collaborative patterns and item attributes, producing natural language rationales users can verify. Our approach maintains the efficiency of frozen base models while adding minimal computational overhead, making it practical for real-world deployment.


[158] LLM4Rec: Large Language Models for Multimodal Generative Recommendation with Causal Debiasing cs.IR | cs.AI | cs.CLPDF

Bo Ma, Hang Li, ZeHua Hu, XiaoFan Gui, LuYao Liu

TL;DR: LLM4Rec提出了一种基于大型语言模型的多模态生成推荐框架,融合了多模态数据、因果去偏、实时自适应学习等创新,显著提升了推荐的准确性、公平性和多样性。

Details

Motivation: 当前生成推荐系统在多模态数据处理、消除算法偏见和透明决策方面存在局限性。

Result: 在MovieLens-25M等三个基准数据集上,NDCG@10提升2.3%,多样性指标提升1.4%,同时保持计算效率。

Insight: 融合因果推断和多模态学习是提升推荐系统效果的关键。

Abstract: Contemporary generative recommendation systems face significant challenges in handling multimodal data, eliminating algorithmic biases, and providing transparent decision-making processes. This paper introduces an enhanced generative recommendation framework that addresses these limitations through five key innovations: multimodal fusion architecture, retrieval-augmented generation mechanisms, causal inference-based debiasing, explainable recommendation generation, and real-time adaptive learning capabilities. Our framework leverages advanced large language models as the backbone while incorporating specialized modules for cross-modal understanding, contextual knowledge integration, bias mitigation, explanation synthesis, and continuous model adaptation. Extensive experiments on three benchmark datasets (MovieLens-25M, Amazon-Electronics, Yelp-2023) demonstrate consistent improvements in recommendation accuracy, fairness, and diversity compared to existing approaches. The proposed framework achieves up to 2.3% improvement in NDCG@10 and 1.4% enhancement in diversity metrics while maintaining computational efficiency through optimized inference strategies.


cs.RO [Back]

[159] VENTURA: Adapting Image Diffusion Models for Unified Task Conditioned Navigation cs.RO | cs.CVPDF

Arthur Zhang, Xiangyun Meng, Luca Calliari, Dong-Ki Kim, Shayegan Omidshafiei

TL;DR: VENTURA通过微调预训练的扩散模型生成视觉路径掩码,结合轻量级行为克隆策略,实现了基于自然语言指令的多样化机器人导航,显著提升了任务表现和泛化能力。

Details

Motivation: 机器人需适应多样化的人类指令并在开放环境中安全操作。现有视觉语言模型难以直接用于导航任务,因其动作空间和预训练目标差异导致迁移困难。VENTURA旨在解决这一问题。

Result: VENTURA在真实环境中比SOTA方法提升了33%的成功率和54%的碰撞减少率,并能泛化到未见任务组合中。

Insight: 扩散模型可用于生成高层次的视觉规划,而无需直接预测低层动作;视觉掩码和行为克隆的结合为多任务导航提供了灵活的接口。

Abstract: Robots must adapt to diverse human instructions and operate safely in unstructured, open-world environments. Recent Vision-Language models (VLMs) offer strong priors for grounding language and perception, but remain difficult to steer for navigation due to differences in action spaces and pretraining objectives that hamper transferability to robotics tasks. Towards addressing this, we introduce VENTURA, a vision-language navigation system that finetunes internet-pretrained image diffusion models for path planning. Instead of directly predicting low-level actions, VENTURA generates a path mask (i.e. a visual plan) in image space that captures fine-grained, context-aware navigation behaviors. A lightweight behavior-cloning policy grounds these visual plans into executable trajectories, yielding an interface that follows natural language instructions to generate diverse robot behaviors. To scale training, we supervise on path masks derived from self-supervised tracking models paired with VLM-augmented captions, avoiding manual pixel-level annotation or highly engineered data collection setups. In extensive real-world evaluations, VENTURA outperforms state-of-the-art foundation model baselines on object reaching, obstacle avoidance, and terrain preference tasks, improving success rates by 33% and reducing collisions by 54% across both seen and unseen scenarios. Notably, we find that VENTURA generalizes to unseen combinations of distinct tasks, revealing emergent compositional capabilities. Videos, code, and additional materials: https://venturapath.github.io


[160] DisCo-Layout: Disentangling and Coordinating Semantic and Physical Refinement in a Multi-Agent Framework for 3D Indoor Layout Synthesis cs.RO | cs.CVPDF

Jialin Gao, Donghao Zhou, Mingjian Liang, Lihao Liu, Chi-Wing Fu

TL;DR: DisCo-Layout提出了一种新颖的多智能体框架,通过分离和协调语义与物理优化来生成3D室内布局,解决了传统方法泛化能力差的问题,并在实验中取得了最优性能。

Details

Motivation: 传统3D室内布局生成方法受限于固定数据集,泛化能力不足;而基于LLM和VLM的方法虽然语义丰富,但缺乏灵活和鲁棒的优化,导致布局效果不佳。

Result: 实验表明DisCo-Layout生成的布局真实、连贯且泛化能力强,达到了最先进的水平。

Insight: 分离语义和物理优化并结合多智能体协作是一种有效的布局生成方法,能显著提升生成效果和灵活性。

Abstract: 3D indoor layout synthesis is crucial for creating virtual environments. Traditional methods struggle with generalization due to fixed datasets. While recent LLM and VLM-based approaches offer improved semantic richness, they often lack robust and flexible refinement, resulting in suboptimal layouts. We develop DisCo-Layout, a novel framework that disentangles and coordinates physical and semantic refinement. For independent refinement, our Semantic Refinement Tool (SRT) corrects abstract object relationships, while the Physical Refinement Tool (PRT) resolves concrete spatial issues via a grid-matching algorithm. For collaborative refinement, a multi-agent framework intelligently orchestrates these tools, featuring a planner for placement rules, a designer for initial layouts, and an evaluator for assessment. Experiments demonstrate DisCo-Layout’s state-of-the-art performance, generating realistic, coherent, and generalizable 3D indoor layouts. Our code will be publicly available.